Thursday, 19 August 2021

Alder Lake Extravaganza: Intel Unloads Details on its Next-Gen CPU

This week, Intel shared significant details on its Alder Lake CPU family with far more information than we’ve previously had regarding the CPUs core design, performance, and expected power efficiency.

This is a critical launch for Intel. The manufacturer’s desktop CPUs have been stuck on 14nm for over six years now, and the cracks in that process node have been showing for at least two. Rocket Lake is currently competitive with AMD at the midrange and lower-end of the market as long as you don’t care about power efficiency, but AMD has an advantage at the high end. Alder Lake is intended to change that. Although Pat Gelsinger hasn’t been CEO of Intel long enough to have had much input into the design, it’s still the first major launch of his tenure and the first product built on Intel’s next iteration of its 10nm node.

After six years stuck on 14nm, Intel needs to demonstrate that it can recapture process and performance leadership. No one expects this to happen overnight, but Gelsinger has bet Chipzilla’s business model on the outcome. Instead of pivoting towards pure-play foundry partnerships and away from building its own hardware, as some activist investor firms wanted, Intel has chosen to make a play for both sides of the foundry business simultaneously. It will continue to manufacture its own hardware and it will offer foundry services and license x86 core designs to companies that wish to purchase either.

Intel has little interest in being a second-tier foundry or in pursuing commodity manufacturing contracts on low-cost chips, and the capital-intensive nature of its business likely precludes such a strategy in any case. Because it builds its own chips, Intel can leverage Alder Lake as proof of improved competitiveness, provided the CPU actually delivers on that promise. Alder Lake is also the first x86 CPU to take a page from Apple’s book and deploy both “big” and “little” cores.

Meet Gracemont

Alder Lake is a hybrid CPU containing two different types of CPU cores. The Efficiency cores are based on Gracemont, Intel’s low-power architecture that evolved from the original Atom back in 2008. The Performance cores are based on a new architecture, Golden Cove. These are Intel’s latest small-core and big-core standards and both are new designs. Let’s talk about Gracemont first:

Gracemont-8

Gracemont retains some design elements in common with Tremont. Both CPU cores offer a dual 3-wide decoder unit, but Gracemont doubles up on instruction cache (64KB). This is the second time Intel has increased the L1 instruction cache; Tremont bumped from 24KB to 32KB a few years ago. The CPU contains Intel’s first on-demand instruction length decoder and a large increase in the total number of execution ports, from 10 to 17. Like Tremont, Gracemont lacks Hyper-Threading and is a single-thread CPU core.

According to Intel: “An on-demand instruction length decoder decodes instruction data to determine where instructions begin and end. The output is then used to either steer the instruction data to the decoders, or it can be saved along with the instruction bytes parallel to the instruction cache to mark the beginning/end on future fetch and decode.” This sounds like a feature that could be used to compensate for x86’s variable-length instructions and may help the CPU extract additional efficiency by giving it more information.

Gracemont can issue five instructions and retire eight per cycle, where Tremont could issue four instructions and retire eight, and it can resolve two branches per clock cycle. Intel has not gone into a great deal of detail regarding when Gracemont can actually decode and use all six instructions per clock — the chip has dual 3-wide encoders, not a 6-wide solution — but when Tremont launched, Intel claimed that dual three-wide decoders saved power and die space compared with a large micro-op cache or a unified six-wide decoder.

According to Intel, “four Efficient-cores offer 80 percent more performance while still consuming less power than two Skylake cores running four threads or the same throughput performance while consuming 80 percent less power.” Intel also claims that Gracemont can deliver 40 percent more single-threaded performance than Skylake in the same power envelope or identical performance in less than 40 percent the power.

One thing to keep in mind when evaluating these claims is that Intel does not give a reference clock speed or TDP. The large efficiency advantages over Skylake could be explained partly by that CPU’s weak performance in the TDP ranges Gracemont is designed to serve. The Core i3-6100U had a configurable TDP down of 7.5W and a clock speed of 800MHz at that TDP. If Intel is comparing within low TDP ranges and clocks, it would explain the tremendous efficiency improvement.

Gracemont has a shared L2 cache, with each quad-core sharing up to 4MB of L2, a 17-cycle L2 latency, and it supports AVX, AVX2, and AVX-VNNI. AVX-VNNI is part of the AVX-512 specification, but Intel is not claiming full AVX-512 support and there are multiple AVX-512 instructions that Gracemont cannot execute. These workloads will be handled by Golden Cove.

We’ve tucked some of Intel’s additional slides into the slideshow below if you’d like more information on Gracemont. You can click on each slide to open it, full-size, in a new window.

Greet Golden Cove

The mantra for Golden Cove development, according to Intel, was “Wider, Faster, Smarter,” and that’s a good way of summarizing the various improvements to the CPU. Golden Cove is descended from the Willow Cove core inside Intel’s Tiger Lake CPU, but it contains a significant number of upgrades and improvements over that design.

“Wider, Deeper, Smarter” apparently beat out “Wider, Deeper, Faster.” Can’t imagine why.

Golden Cove increases the number of front-end decoders to six, up from four, and it expands Intel’s iTLBs significantly. The CPU now supports 32-byte decode, up from 16-bytes per cycle, and the micro-op queue is slightly wider. It now supports 72 entries per thread, up from 70, and the micro-op cache can hold 4K micro-ops, up from 2.25K. The micro-op cache hit rate and front-end bandwidth have both been increased.

There are now 12 execution ports, up from 10, with a deeper reorder buffer (512 entries, up from 352 in Sunny Cove/Willow Cove). The L1 cache now supports three load ports, up from two, and can handle 3×256-bit loads or 2×512-bit loads in a single cycle. The L1 data cache is now 96KB (Tiger Cove packed 64KB), with 16 prefetchers and the ability to support four page table walks, up from two.

Golden Cove will offer either 1.25MB of L2 for client computing (flat versus Tiger Lake) or 2MB in data center applications. It also supports Intel’s new Advanced Matrix Extensions (AMX) extension set, which Intel claims delivers a mammoth increase in AI performance. Using VNNI, an Intel CPU can perform 256 INT8 instructions per cycle. AMX allows the same chip to execute 2,048 INT8 instructions per cycle.

This could substantially improve Intel’s CPU-based AI performance in relevant applications, though the usual caveats about SIMD adoption and optimization apply. It may be a few years before AMX is much use in commercial applications, but the performance gains imply Intel CPUs may be a reasonable alternative to Nvidia GPUs for certain AI and machine learning-related tasks. CPUs can already perform AI inference workloads at reasonable speeds, so it’ll be interesting to see if this improves CPU performance in training AI models or if it merely makes them more competitive in inference.

Add all of this up and here’s what you get:

This slide is actually a bit misleading, in my opinion, but not in a way that favors Intel. For once, using a non-zero starting point actually makes Intel look worse, not better. The performance gap between Rocket Lake and Alder Lake in the worst-performing sub-test is ~92 percent at the far left-hand side of the graph, while Alder Lake is reportedly up to 1.6x faster in a handful of tests. The median gain is 1.19x, according to Intel.

While Intel took a lot of heat for its failure to deliver new process nodes over the past six years, a 1.19x performance increase from a new product generation is respectable. Rocket Lake increased IPC compared with Comet Lake, but Intel had to trade back cores to make the TDP work. As a result, an eight-core RKL and a 10-core CML are broadly similar in many applications. Alder Lake combines up to eight Golden Cove cores and 16 GC threads with up to eight Gracemont cores (1T each), for a grand total of 16 cores and 24 threads in a top-end SKU.

If you’d like to check out Intel’s additional Golden Cove slides, we’ve compiled them into a second slideshow below.

 

Note: After rebranding 10nm several times, Intel has settled on a new nomenclature for its process nodes. Alder Lake is built on Intel 7 (without a “nm” suffix). Intel 7 is still a 10nm node — it would’ve been branded “Enhanced SuperFin” under the old nomenclature — but Intel claims a 10-15 percent improvement in performance per watt and various FinFET transistor optimizations. More information about Intel’s long-term node update plans can be found here.

Making It All Work Together

Shuffling workloads between small and large cores requires additional support. Intel has built improved hardware scheduling into its chips, dubbed Thread Director. Thread Director monitors the CPU and makes certain that each workload ends up on the appropriate core.

While it’s difficult to show the demos Intel gave us or to evaluate them without being hands-on, the company gave an example of how Thread Director would distribute multiple threads across the Performance and Efficiency cores. In the image below, green tasks are scalar workloads, orange tasks represent a new AI workload that just launched, and blue tasks are background tasks.

Under appropriate conditions, the CPU will schedule workloads across both the Performance and Efficiency cores. Intel did not disclose how likely this was to occur under real-world conditions or what kind of performance boost it expected the P cores to gain from leveraging the additional throughput available from Gracemont.

Intel developed Thread Director in cooperation with Microsoft and Alder Lake will run best under Windows 11, though the chip also supports Windows 10. Prior to the introduction of Thread Director, the operating system scheduler had no insight into the thread it was running or which core it should be scheduled on. According to Intel, Thread Director closes this gap and provides more information to the OS regarding scheduling. The chip is also able to make workload scheduling decisions with microsecond fidelity and scheduling is more fine-grained than it was before the introduction of Windows 11.

Tests indicated that Intel’s previous hybrid CPU, Lakefield, could pick up 5-6 percent under Windows 11 versus Windows 10. Lakefield lacks Thread Director, so we’re curious to see what the Alder Lake delta will look like between the two operating systems.

While these aspects of the system are already known, Alder Lake will introduce PCIe 5.0 support and scale from 9W – 125W TDP envelopes. ADL offers 1×16 PCIe 5.0 lanes attached to the CPU, a single x4 PCIe 4.0 connection, and 16 lanes of PCIe 3.0 and 4.0 via the southbridge. Motherboard vendors will likely have the option of enabling PCIe 5.0 support if a single GPU is plugged in or falling back to 2×8 PCIe 4.0 links if more than one GPU is attached. One x8 PCIe 5.0 link would provide the same amount of bandwidth as an x16 PCIe 4.0 connection, so there ought to be no bandwidth penalty in any configuration, even in demanding workloads.

Conclusion

While we can’t draw any conclusions about Alder Lake until we have silicon in hand, the depth and breadth of Intel’s reveal suggest the company feels confident in the final product. A 1.19x IPC uplift is quite good, especially given that RKL managed to mostly tie things up with CML last generation. If an eight-core Rocket Lake can roughly match a 10-core Comet Lake, an eight-core Alder Lake ought to be decisively faster in the majority of tasks.

Intel didn’t share any hard benchmark data or specific performance figures, but its disclosure points towards significant gains in both power efficiency and raw performance. AMD is forecasting that its V-NAND equipped Zen 3 chips will gain roughly 1.15x in performance but has not yet disclosed any additional efficiency or performance-boosting changes to any future CPUs it’ll launch late this year or early in 2022.

Now Read:



No comments:

Post a Comment