My problem is really in how they handled the situation once they knew that there was a problem, not even the initial manufacturing defect.
Yes, okay. They didn’t know exactly the problem, didn’t know exactly the scope, and didn’t have a fix. Fine. I get that that is a really hard problem to solve.
But they knew that there was a problem.
Putting out a list of known-affected processors and a list of known-possibly-affected processors at the earliest date would have at least let their customers do what is possible to mitigate the situation. And I personally think that they shouldn’t have been selling more of the potentially-affected processors until they’d figured out the problem sufficient to ensure that people who bought new ones wouldn’t be affected.
And I think that, at first opportunity, they should have advised customers as to what Intel planned to do, at least within the limits of certainty (e.g. if Intel can confirm that the problem is due to an Intel manufacturing or design problem, then Intel will issue a replacement to consumers who can send in affected CPUs) and what customers should do (save purchase documentation or physical CPUs).
Those are things that Intel could certainly have done but didn’t. This is the first statement they’ve made with some of that kind of information.
It might have meant that an Intel customer holds off on an upgrade to a potentially-problematic processor. Maybe those customers would have been fine taking the risk or just waiting for Intel to figure out the issue, issue an update, and make sure that they used updated systems with the affected processors. But they would have at least been going into this with their eyes open, and been able to mitigate some of the impact.
Like, I think that in general, the expectation should be that a manufacturer who has sold a product with a defect should put out what information they can to help customers mitigate the impact, even if that information is incomplete, at the soonest opportunity. And I generally don’t think that a manufacturer should sell a product with known severe defects (of the “it might likely destroy itself in a couple months” variety).
I think that one should be able to expect that a manufacturer do so even today. If there are some kind of reasons that they are not willing to do so (e.g. concerns about any statement affecting their position in potential class-action suits), I’d like regulators to restructure the rules to eliminate that misincentive. Maybe it could be a stick, like “if you don’t issue information dealing with known product defects of severity X within N days, you are exposed to strict liability”. Or a carrot, like “any information in public statements provided to consumers with the intent of mitigating harm caused by a defective product may not be introduced as evidence in class action lawsuits over the issue”. But I want manufacturers of defective products to act, not to just sit there clammed up, even if they haven’t figured out the full extent of the problem, because they are almost certainly in a better position to figure out the problem and issue information to mitigate it than their customers individually are, and in this case, Intel just silently sat there for a very long time while a lot of their customers tried to figure out the scope of what was going wrong, and often spent a lot of money trying to address the problem themselves when more information from Intel probably would have avoided them incurring some of those costs.
To put this another way, Intel had at least three serious failures that let the problem reach this level:
A manufacturing defect that led to the flawed CPUs being produced in the first place.
A QA failure to detect the flawed CPUs initially (or to be able to quickly narrow down the likely and certain scope of the problem once the issue arose). Not to mention having a second generation of chips with the defect go out the door, I can only assume (and hope) without QA having initially identified that they were also affected.
A customer care issue, in that Intel did not promptly publicly provide customers with information that Intel either had or should have had about likely scope of the problem, mitigation, and at least within some bounds of uncertainty (“if it can be proven that the problem is due to an Intel manufacturing defect on a given processor for some definition of proven, Intel will provide a replacement processor”), what Intel would do for affected customers. A lot of customers spent a lot of time replicating effort trying to diagnose and address the problem at their level, as well as continuing to buy and use the defective CPUs. It is almost certain that some of that was not necessary.
The manufacturing failure sucks, fine. But it happens. Intel’s pushing physical limits. I accept that this kind of thing is just one thing that occasionally happens when you do that. Obviously not great, but it happens. This was an especially bad defect, but it’s within the realm of what I can understand and accept. AMD just recalled an initial batch of new CPUs (albeit way, way earlier in the generation than Intel)…they dicked something up too.
I still don’t understand how the QA failure happened to the degree that it did. Like, yes, it was a hard problem to identify, since it was progressive degradation that took some time to arise, and there were a lot of reasons for other components to potentially be at fault. And CPUs are a fast moving market. You can’t try running a new gen of CPU for weeks or months prior to shipping, maybe. But for Intel to not have identified that they had a problem with the 13th gen at least within certain parameters at least subsequent to release and then to have not held up the 14th gen until it was definitely addressed seems unfathomable to me. Like, does Intel not have a number of CPUs that they just keep hot and running to see if there are aging problems? Surely that has to be part of their QA process, right? I used to work for another PC component manufacturer and while I wasn’t involved in it, I know that they definitely did that as part of their QA process.
But as much as I think that that QA failure should not have happened, it pales in comparison to the customer care failure.
Like, there were Intel customers who kept building systems with components that Intel knew or should have known were defective. Far a long time, Intel did not promptly issue a public warning saying “we know that there is a problem with this product”. They did not pull known defective components from the market, which means that customers kept sinking money into them (and resources trying to diagnose and otherwise resolve the issues). Intel did not issue a public statement about the likely-affected components, even though they were probably in the best position to know. Again, they let customers keep building them into systems. They did not issue a statement as to what Intel would do (and I’m not saying that Intel has to conclusively determine that this is an Intel problem, but at least say “if this is shown to be an Intel defect, then we will provide a replacement for parts proven to be defective due to this cause”). They did not issue a statement telling Intel customers what to do to qualify for any such program. Those are all things that I am confident that Intel could have done much earlier and which would have substantially reduced how bad this incident was for their customers. Instead, their customers were left in isolation to try to figure out the problems individually and come up with mitigations themselves. In many cases, manufacturers of other parts were blamed, and money spent buying components unnecessarily, or trying to run important services on components that Intel knew or should have known were potentially defective. Like, I expect Intel, whatever failures happen at the manufacturing or QA stages, to get the customer care done correctly. I expect that to happen even if Intel does not yet completely understand the scope of the problem or how it could be addressed. And they really did not.
I’d argue there was a fourth serious failure, and that was Intel allowing the motherboard manufacturers to go nuts and run these chips way out of spec by default. Granted, ultimately it was the motherboard manufacturers that did it, but there’s really no excuse for what these motherboards were doing by default. Yes, I get the “K” chips are unlocked, but it should be up to the user to choose to overclock their CPU and how they want to go about it. To make matters worse, a lot of these motherboards didn’t even have an easy way to put things back into spec - it was up to you to go through all the settings one by one and set them correctly.
I smell a class action lawsuit brewing
As compared to a recall and re-fitting a fab, a class action is probably the cheaper way out.
I wish companies cared about what they sold instead of picking the cheapest way out, but welcome to the world we live in.
I mean, I’m sure Intel cares.
My problem is really in how they handled the situation once they knew that there was a problem, not even the initial manufacturing defect.
Yes, okay. They didn’t know exactly the problem, didn’t know exactly the scope, and didn’t have a fix. Fine. I get that that is a really hard problem to solve.
But they knew that there was a problem.
Putting out a list of known-affected processors and a list of known-possibly-affected processors at the earliest date would have at least let their customers do what is possible to mitigate the situation. And I personally think that they shouldn’t have been selling more of the potentially-affected processors until they’d figured out the problem sufficient to ensure that people who bought new ones wouldn’t be affected.
And I think that, at first opportunity, they should have advised customers as to what Intel planned to do, at least within the limits of certainty (e.g. if Intel can confirm that the problem is due to an Intel manufacturing or design problem, then Intel will issue a replacement to consumers who can send in affected CPUs) and what customers should do (save purchase documentation or physical CPUs).
Those are things that Intel could certainly have done but didn’t. This is the first statement they’ve made with some of that kind of information.
It might have meant that an Intel customer holds off on an upgrade to a potentially-problematic processor. Maybe those customers would have been fine taking the risk or just waiting for Intel to figure out the issue, issue an update, and make sure that they used updated systems with the affected processors. But they would have at least been going into this with their eyes open, and been able to mitigate some of the impact.
Like, I think that in general, the expectation should be that a manufacturer who has sold a product with a defect should put out what information they can to help customers mitigate the impact, even if that information is incomplete, at the soonest opportunity. And I generally don’t think that a manufacturer should sell a product with known severe defects (of the “it might likely destroy itself in a couple months” variety).
I think that one should be able to expect that a manufacturer do so even today. If there are some kind of reasons that they are not willing to do so (e.g. concerns about any statement affecting their position in potential class-action suits), I’d like regulators to restructure the rules to eliminate that misincentive. Maybe it could be a stick, like “if you don’t issue information dealing with known product defects of severity X within N days, you are exposed to strict liability”. Or a carrot, like “any information in public statements provided to consumers with the intent of mitigating harm caused by a defective product may not be introduced as evidence in class action lawsuits over the issue”. But I want manufacturers of defective products to act, not to just sit there clammed up, even if they haven’t figured out the full extent of the problem, because they are almost certainly in a better position to figure out the problem and issue information to mitigate it than their customers individually are, and in this case, Intel just silently sat there for a very long time while a lot of their customers tried to figure out the scope of what was going wrong, and often spent a lot of money trying to address the problem themselves when more information from Intel probably would have avoided them incurring some of those costs.
To put this another way, Intel had at least three serious failures that let the problem reach this level:
A manufacturing defect that led to the flawed CPUs being produced in the first place.
A QA failure to detect the flawed CPUs initially (or to be able to quickly narrow down the likely and certain scope of the problem once the issue arose). Not to mention having a second generation of chips with the defect go out the door, I can only assume (and hope) without QA having initially identified that they were also affected.
A customer care issue, in that Intel did not promptly publicly provide customers with information that Intel either had or should have had about likely scope of the problem, mitigation, and at least within some bounds of uncertainty (“if it can be proven that the problem is due to an Intel manufacturing defect on a given processor for some definition of proven, Intel will provide a replacement processor”), what Intel would do for affected customers. A lot of customers spent a lot of time replicating effort trying to diagnose and address the problem at their level, as well as continuing to buy and use the defective CPUs. It is almost certain that some of that was not necessary.
The manufacturing failure sucks, fine. But it happens. Intel’s pushing physical limits. I accept that this kind of thing is just one thing that occasionally happens when you do that. Obviously not great, but it happens. This was an especially bad defect, but it’s within the realm of what I can understand and accept. AMD just recalled an initial batch of new CPUs (albeit way, way earlier in the generation than Intel)…they dicked something up too.
I still don’t understand how the QA failure happened to the degree that it did. Like, yes, it was a hard problem to identify, since it was progressive degradation that took some time to arise, and there were a lot of reasons for other components to potentially be at fault. And CPUs are a fast moving market. You can’t try running a new gen of CPU for weeks or months prior to shipping, maybe. But for Intel to not have identified that they had a problem with the 13th gen at least within certain parameters at least subsequent to release and then to have not held up the 14th gen until it was definitely addressed seems unfathomable to me. Like, does Intel not have a number of CPUs that they just keep hot and running to see if there are aging problems? Surely that has to be part of their QA process, right? I used to work for another PC component manufacturer and while I wasn’t involved in it, I know that they definitely did that as part of their QA process.
But as much as I think that that QA failure should not have happened, it pales in comparison to the customer care failure.
Like, there were Intel customers who kept building systems with components that Intel knew or should have known were defective. Far a long time, Intel did not promptly issue a public warning saying “we know that there is a problem with this product”. They did not pull known defective components from the market, which means that customers kept sinking money into them (and resources trying to diagnose and otherwise resolve the issues). Intel did not issue a public statement about the likely-affected components, even though they were probably in the best position to know. Again, they let customers keep building them into systems. They did not issue a statement as to what Intel would do (and I’m not saying that Intel has to conclusively determine that this is an Intel problem, but at least say “if this is shown to be an Intel defect, then we will provide a replacement for parts proven to be defective due to this cause”). They did not issue a statement telling Intel customers what to do to qualify for any such program. Those are all things that I am confident that Intel could have done much earlier and which would have substantially reduced how bad this incident was for their customers. Instead, their customers were left in isolation to try to figure out the problems individually and come up with mitigations themselves. In many cases, manufacturers of other parts were blamed, and money spent buying components unnecessarily, or trying to run important services on components that Intel knew or should have known were potentially defective. Like, I expect Intel, whatever failures happen at the manufacturing or QA stages, to get the customer care done correctly. I expect that to happen even if Intel does not yet completely understand the scope of the problem or how it could be addressed. And they really did not.
I’d argue there was a fourth serious failure, and that was Intel allowing the motherboard manufacturers to go nuts and run these chips way out of spec by default. Granted, ultimately it was the motherboard manufacturers that did it, but there’s really no excuse for what these motherboards were doing by default. Yes, I get the “K” chips are unlocked, but it should be up to the user to choose to overclock their CPU and how they want to go about it. To make matters worse, a lot of these motherboards didn’t even have an easy way to put things back into spec - it was up to you to go through all the settings one by one and set them correctly.