Note: The legwork and credit for the discovery I’m going to talk about below goes to Usman Pirzada of WCCFTech. I was on vacation last week when this news broke, but I ran some tests for him on an AMD laptop to make certain these findings applied to both Intel and AMD CPUs relative to the M1.
Let me be clear about the headline above: The “flaw” we’re going to talk about isn’t a problem with any specific benchmark or reviewer. It’s a difference in how the Apple M1 allocates and assigns resources versus how x86 CPUs work.
x86 CPUs from AMD and Intel are designed to use a technique known as Symmetric Multi-Threading (Intel calls this Hyper-Threading). AMD and Intel implement the feature somewhat differently, but in both cases, SMT-enabled CPUs are able to schedule work from more than one thread for execution in the same clock cycle. A CPU that does not support SMT is limited to executing instructions from the same thread in any given cycle.
Modern x86 CPUs from AMD and Intel take advantage of SMT to improve performance by an average of 20-30 percent at a fraction of the cost or power that would be required to build an entire second core. The flip side to this is that a single-threaded workload is unable to take advantage of the performance advantage SMT offers.
Apple’s M1 doesn’t have this problem. Some of the reasons for the M1’s width come down to low-level design differences between the x86 and ARM instruction sets. The front-end of a RISC CPU allows generally higher efficiency in terms of instructions decoded per single thread. (WCCFTech has a bit more on this).
This is not some just-discovered flaw in the guts of Intel and AMD CPUs — it’s the entire reason Intel built HT and the reason why AMD adopted SMT as well. An x86 CPU achieves much higher overall efficiency when you run two threads through a single core, partly because they’ve been explicitly designed and optimized for it, and partly because SMT helps CPUs with decoupled CISC front-ends achieve higher IPC overall.
How This Difference Impacts Benchmark Results
In any given 1T performance comparison, the x86 CPUs are running at 75 percent to 80 percent of their effective per-core performance. The M1 doesn’t have this issue.
The graph below is by WCCFTech. The red data points are my own contributions to their work (which is worth reading in its own right):
This graph puts a somewhat different spin on things. When you run a second thread through the x86 CPUs, their performance improves significantly. In fact, here, the AMD Ryzen 4800U is outperforming the M1 by a whisker.
Is this a fair comparison? That’s really going to depend on what you want to measure. Core-for-core? Yes. Thread-for-thread? No. This difference in utilization creates complications for x86-versus-M1 comparisons. The last time we dealt with anything similar in performance measurements was when AMD’s Athlon XP was facing off against the Pentium 4 with Hyper-Threading. Since AMD had to price defensively, it was sometimes possible to buy an Athlon XP that would beat an equivalently priced P4 in single-threaded performance, but lose in SMT.
The end result of this difference is that there’s not going to be a single, simple way of comparing scaling between Apple and x86 the way we have for Intel versus AMD. 1T per core effectively cuts the x86 CPUs off from capabilities intended to boost their performance. Running 2T per-core on both x86 and M1 would force the Apple CPU into a potentially non-optimal configuration, and could degrade its performance.
Running 2T on x86 and comparing against 1T on M1 is “fair” inasmuch as it runs both cores in the manufacturer-optimized state, but this would be a comparison of single-core performance, not single-thread performance, and it’s not going to surprise people when a CPU running 2T outperforms a CPU running 1T. Finally, running 2T1C on x86 versus 2T2C on the M1 creates a variation on the original problem: The x86 CPU is being limited to the performance of a single physical CPU core, while the M1 benefits from two physical CPU cores.
The problem here is that x86 CPUs are designed to be run optimally in 2T1C configurations, as a recent Anandtech deep dive into the performance advantages and disadvantages of enabling SMT indicates, while the M1 is designed to run optimally in a 1T1C configuration.
This may well be an ongoing problem for x86. Remember that scaling per-thread is far from perfect and gets worse every thread you add. Historically, the CPU that delivers the best per-core performance in the smallest die area and with the highest performance per watt is the CPU that wins whatever “round” of the CPU wars one cares to consider. The fact that x86 requires two threads to do what Apple can do with one is not a strength. Whether only loading an x86 CPU with one thread constitutes a penalty will depend on what kind of comparison you want to make, but the difference in optimal thread counts and distribution needs to be acknowledged.
The big takeaways of the M1 remain unchanged. In many tests, the CPU shows consistently higher results than x86 CPUs when measured in terms of performance per watt. When it is outperformed by x86 CPUs, it is typically by chips that consume far more power than itself. The M1 appears to take a 20-30 percent performance hit when running applications built for Intel Macs, and there it may consume more power in this mode. Apple’s emulation ecosystem and third-party support are still in early days and may not meet the needs of every user depending on the degree to which you are plugged into the overall Apple ecosystem. None of these is a direct reflection on the M1’s silicon, however, which still looks like one of the most interesting advances in silicon in the past few decades — and a harbinger of problems to come for Intel and AMD.
Now Read:
- The New Apple M1 Reviews Put AMD, Intel Officially on Notice
- Why Apple’s M1 Chip Could be a Real Threat to Intel and AMD
- New Mac Teardowns Show Apple’s M1 Engineering Under the Hood
No comments:
Post a Comment