Monday, 10 February 2020

From 4.3GHz All-Core Overclocking to SMT Scaling: A Comprehensive Review of the AMD Threadripper 3990X

AMD has spent the last three years rewriting the rules of desktop performance. On Friday, the microprocessor manufacturer launched its AMD Ryzen Threadripper 3990X, the world’s first single-socket 64-core CPU. I’ve already written a teaser for this article and gone over some of my early thoughts on the CPU, but here’s where we dig into the data on the chip and see what the reports can tell us.

Under the right circumstances, the Ryzen Threadripper 3990X offers incredible performance. In unoptimized workloads, it’s flat or even declines against the Ryzen Threadripper 3970X. I’ve spent a great deal of time putting the CPU through its paces, looking for scenarios where it succeeds and analyzing where it falls flat. I also threw it outside in 12-degree Fahrenheit (-11C) air and overclocked it using the Asus Zenith II Extreme, just for fun. We’ll talk about that, too. I may have missed the world record to an enterprising individual with liquid nitrogen, but the CB20 score I hit would still qualify this system for second-place according to HWBot.

I’m going to assume you’re aware of the Threadripper 3990X and have read our 3970X review, plus the 3990X launch discussion from last week.

The Windows Thread Scheduler

One issue affecting our Windows 10 results is the fact that the OS doesn’t scale above 64 threads very effectively. The OS splits CPU workloads into processor groups, with up to 64 threads assigned to each group. Some applications provide their own thread schedulers, but applications that don’t are often capped at ~50 percent CPU usage on the 3990X. Linux scaling is generally better; Rob Williams at Techgage has more data on this. 3D rendering applications are easily the strongest use-category for the 3990X, which puts up its strongest scaling figures in these applications. There have been reports that Windows 10 Enterprise may offer better scaling, but our contacts at AMD indicated there’s no reason for a user to run Windows 10 Enterprise when buying a 3990X. The official guidance from AMD is that Windows 10 Pro is enough.

The (Lack of) Intel Competition

We reached out to Intel to inquire if the company would provide Xeon server CPUs to benchmark against the 3990X, but the company has opted not to sample against the AMD CPU. While these comparisons wouldn’t align on price, they would have allowed us to compare top-end solutions from both companies. Without the option to draw on Xeon, our comparison vehicle was limited to the Core i9-10980XE.

At ~$1000, the 10980XE cannot be considered fair competition for the $4000 3990X, but it’s the closest Intel CPU we have, and I wanted to give some indication of how it stacked up. Because different applications have such different responses to high core count CPUs, there are cases where the 10980XE matches or outperforms the 3990X. More cores are not always better, even today.

Results Formatting, Test Setup

This review has a larger footprint than my typical coverage, and I’ve subdivided the results into several different categories. Our standard suite of tests compares the top Threadripper and Core i9-10980XE (along with the Ryzen 9 3950X where possible) in a wide range of applications. The next section discusses SMT scaling on the 3990X specifically, with some specific evaluations in applications like DaVinci Resolve using the Puget Systems Extended Benchmarks for that application. Finally, the overclocking section discusses our OC results, with a little help from Mother Nature.

All testbeds were equipped with 64GB of DDR4-3600 in four sticks of RAM. XMP was enabled on both the AMD and Intel systems, but the Intel Cascade Lake required a DDR4-3200 RAM clock, not DDR4-3600. Both Intel and AMD systems were benchmarked with a Corsair MP600 SSD, though the AMD system used the drive in PCIe 4.0 mode, while the Intel rig was limited to PCIe 3.0.

An RTX 2080 was used to provide GPU testing in all cases, with Nvidia GeForce Game Ready Driver 442.19. The latest UEFI images were loaded on all motherboards.

A few application-specific notes before we get started: I’ve included MATLAB results here. MATLAB favors Intel by default for reasons we discuss in far more detail in this article. I’ve benchmarked AMD’s with the “Cripple AMD” instructions both enabled (default) and disabled.

We’ve also added a significant prosumer / workstation workload. Puget Systems distributes their own extended benchmark suite for various applications, including DaVinci Resolve. These tests don’t kid around — the 8K DaVinci Resolve suite requires a GPU with up to 20GB of VRAM, which precluded us from testing it. AThe free trial of DaVinci Studio was used, which does impact the performance of one H.264 benchmark according to Puget, but the test results we present are accurate relative to that version of the application and the same workload ran in software on all of the CPUs we tested. I have data from Agisoft Metashape as well, but I realized late on Sunday that I need to re-check those results.

Arnold Render CPU Benchmark – Antonio Bosi

Special thanks to Antonio Bosi, who designed our Maya 2020 Arnold Render CPU benchmark by modifying a scene from his existing test suite. Antonio maintains his site with a number of Arnold Render  tutorials, personal art, and 3D models for download. The tweaked version of the standard Fast benchmark scaled about four percent better than the default flavor, and, at 1.4x scaling over the 3970X, delivered our strongest render uplift between the two CPUs.

We also downloaded a number of Blender scenes for test rendering, all of which were used in the Blender open movie “Spring.” Two screenshots of representative animation frames are shown above and below:

Test Results

Tests included: 7zip, Blender (stand-alone benchmark and full application), Cinebench R15 and R20, Handbrake 1.31, Indigo Bench, Maya 2020, Neat Bench, POV-RAY 3.7, and a Qt compile benchmark using MSVC 2019.

We see two different performance profiles in these results. In some applications — generally rendering applications — the 3990X is 1.3x – 1.4x faster than the 3970X. Some tests, like V-Ray, predict even stronger scaling. We’ve benchmarked a range of rendering applications to demonstrate that in many cases, existing software does take advantage of these capabilities. In some cases, like Arnold Render, the 3990X even comes close to proving cost-effective against the 10980XE, which takes 3.63x longer to render our test scene and costs 25 percent as much. We’ll also examine a renderer that doesn’t scale in our SMT section.

Outside of rendering applications, the 3970X is generally a better choice, especially given that it’s more cooperative with the regular version of Windows 10. Inside rendering applications, particularly at the professional level, the improved performance might be worth it to creatives with cash to burn.

I’ve fallen back to using slideshows for this article because of the amount of data, but a few of our results don’t fit well in that format for various reasons. Our MATLAB benchmark was provided by Intel for testing the Core i9-10980XE’s performance. The table below summarizes MATLAB performance on AMD hardware versus Intel.

MATLAB is an application that doesn’t scale past 64 threads on Windows 10 Pro. As a result, the 3990X is slower than the 3970X with SMT enabled because its baseline clocks are lower, even in 32-core mode. We’ve seen several examples of this.

Blender-3990X-CPU

The Blender 1.0Beta2 benchmark is based on an older version of Blender (2.79) and runs more slowly than the current flavor (2.81). It also crashes on the 10980XE for unknown reasons (the 10980XE doesn’t have this problem in the actual application). Blender is a solid win for the 3990X, and while we don’t see our best scaling in this renderer, the 3990X renders between 25 – 35 percent faster than the 3970X in these scenes.

SMT Scaling

Our standard test suite explored performance scaling between the 3970X and 3990X, generally finding that the 3970X is the stronger option for the typical user, but with specific improvements in certain workloads for the 3990X. Now we’re going to take a look at the SMT scaling question under Windows 10.

Tests included: Blender, DaVinci Resolve (Puget Systems), Keyshot 9, Maya 2020, Maxwell Render 4.2.

There are only a handful of applications that show performance declines when SMT is enabled, but Maxwell Render 4.2 definitely does. Unfortunately, the “Benchwell” scene included in Maxwell Render 4 will no longer load in Maxwell Render 5, so I was unable to test if the same problem occurs in the newest version of the test. Turning SMT off gives the 3990X a win over the 3970X, but not enough to justify the cost of the CPU.

Other tests, however, showed consistent scaling from enabling SMT, and benefited from its use. We rendered an extensive series of Blender scenes (Junk Shop, Spring, Agent 327, and Mr. Elephant), all of which show varying responses to SMT use. I ran a number of additional test renders that don’t appear here, just to keep the data set manageable, and the general performance improvement in Blender in the professional-level renders available via Blender Cloud is a very consistent 1.3 – 1.4x. Keyshot scales a bit less, at roughly 1.2x-1.25x.

In DaVinci Resolve 16, Intel doesn’t win, but it does win the price/performance category. The 3990X scales a bit off the 3970X, but I don’t know that many people would pay 4x more for a 1.16x performance gain. It might very well come down to which specific codec and media settings you were editing, since the 3990X does show larger gains over the 10980XE in a few specific tests.

Intel, therefore, definitely still makes a case for its own utility and relevance in these workloads. The 3970X and 3990X have carved out real territory for themselves, but they aren’t a slam dunk in every situation.

Overclocking Performance

17 years ago, literally to the day, on February 10 2003, AMD launched the Athlon XP 2500+, 2800+, and 3000+. To overclock the 3000+ — and because we all knew at the time that it couldn’t match the Pentium 4 at default clock — I stuck it outside to OC it, and pushed the CPU up to 2.6GHz, 1.2x over stock. I never did it again until this past weekend. This article was originally supposed to run on Friday, but I’m tickled that it’s actually running today, because having the dates align this perfectly is fun. The 3990X is a rather good overclocking CPU, if my single sample is anything to go by, and the fact that this article is running on the same day is just icing on the cake.

How’d I do it? Simple. I stuck the entire system outside in 12F / -11C air. I actually experimented with testing the system inside, by putting the CPU radiator + fan assembly up against a window screen, so the cooler could draw directly from the outside air. This worked to a point, but it didn’t provide the cooling I wanted. Solution: Outdoor overclocking.

Besides, that sounds better than “Bathroom overclocking.”

Who has had two thumbs and is unhappy this idea didn’t work? Me, that’s who.

I started my testing at an all-core 3.7GHz and 1.4v, but this was too much voltage for the Asus Zenith II Extreme. The OCP protection on the motherboard would kick off halfway through stress tests. Lowering the voltage to 1.35v produced better results. I ran the complete Blender Benchmark 1.0Beta 2 suite at 3.6, 3.7, 3.8, and 3.9GHz all-core, lowering the voltage at each step.

At 3.9GHz and 1.33v, I decided to leap. I knew that an all-core 4GHz wouldn’t break the 32K mark, which is where I needed to be to beat the (now second-highest) score. I dialed in 4.1GHz and she POSTed… but my score was still too low. Since I wrote my article on Friday, the world record has been claimed by someone with a 5.3GHz overclock, and I knew I wasn’t going to hit that, but I thought I just might manage to take second place.

At this point, it’s about 1:30 AM on Sunday morning. Anyone driving past would have observed a remarkable sight — a brilliant (because of course both the motherboard and RAM are LED-equipped) star shining in my front yard. Had they come closer, they would have marveled at the sturdy, unassuming, and unexpectedly extremely valuable nightstand stoutly holding about $6000 in computer equipment out of everything computer equipment is never supposed to touch.

I considered this. I contemplated the wisdom of testing extremely expensive hardware at night, in the open air. I thought about snow and wind, and contemplated the fact that I was running markedly less voltage than I had used to maintain a stable 3.7GHz OC.

If you want to be good at overclocking, you have to understand it as an art. Systems don’t just become randomly unstable. There’s an order and a hierarchy. Systems need to POST, boot, and then run benchmarks. The longer and more rigorous the test you can run, the greater the chance the overclock is stable. I knew the 3.7GHz all-core was stable enough to run through a fair number of tests, and that 3.9GHz had been stable through Blender. I knew I couldn’t be too far off hitting the motherboard’s overcurrent protection, however. The CPU idled at 6C in the frozen wasteland of my… front yard, but under load, she was already hitting 67C. Overclocked CPUs are often far more thermally sensitive than their stock-clocked counterparts, and 67C was more than I was comfortable with already.

Any time you push a CPU to the outer edge of the envelope (and here, that could mean anything from a stock cooler to LN2), you’re dancing on the head of a pin. It’s a gamble that you can tune the CPU just enough to eke out a test result without impacting performance in a way that kills the net effect of your improvement.

CPU power dissipation increases with TDP, but it increases much more strongly as a result of voltage. I gambled that I’d reduced the VRM load enough that she could handle a 4.3GHz all-core at 1.3275v, even though I’d seen the machine hard-off at 4GHz and 1.4v.

Keep in mind, we’re talking about a CPU clock that’s about 1.26x higher than where I’d estimate the 3990X’s clock sits on a regular basis, and we’re talking about doing it on 64 CPU cores at once. All-core 4.3GHz = 275.2GHz. 0.275THz. Yeah, it starts with a decimal. Don’t care.

I felt like Han Solo reaching for the hyperspace levers on the Millennium Falcon, if Han had been the biggest damn nerd on Earth. My hands were freezing, my ears were numb, my front yard sounded like a hair dryer, and the rig had already spooked a dog walking by. It was time to see what she had. I typed “43,” hit “Save and Exit,” and crossed my fingers. She POSTed. Booted.

Took a second-place world record in Cinebench R20 and in Cinebench R15, though I’m still working on the submission process to HWBot.

I’ve been a reviewer for 18.5 years. I’ve tested systems valued at over $10,000. I’ve never tested a CPU that could be called the second-fastest at anything on the planet. Even knowing that my own record will soon be broken by 3990X owners with LN2 and exotic cooling setups that don’t rely on the weather, even knowing that the result was simply in a benchmark, there’s something undeniably cool about that.

I don’t think people are going to get 4.3GHz overclocks out of the 3990X on a regular basis, but lower clocks seem eminently possible. The results above show that they can provide a sustained benefit and the voltage required to maintain an AC 3.7GHz is clearly well below 1.3275v, given that I used that same voltage to hit 4.3GHz. I’ll leave it to manufacturers like Boxx to figure out what the possibilities are, but these test results imply they might be good.

In the right workloads, for the right buyer, overclocking the 3990X could make good sense. My performance improved by 10 – 17 percent moving from stock to 3.7 AC.

Conclusion

The 3990X is not the CPU for everyone. It doesn’t scale well enough to objectively justify its price, unless you shop in markets where price is no object. Even assuming better scaling from Linux or Enterprise Windows, it’s unlikely that enough applications would benefit to make the chip an objective improvement for many buyers.

All of this is completely normal for products at the top of a product stack. The Intel Xeon W-3265 is a 24-core chip at 2.7GHz base / 4.4GHz boost. The Xeon W-3275 is a 28-core CPU at 2.5GHz base / 4.4GHz boost. The W-3265 costs $3349. The W-3275 is $4449. That’s a 1.32x price increase for a 1.17x increase in core count. The Xeon Platinum 8280 is a $10,009 CPU with 28 cores, the Xeon Platinum 8270 is a $7405 CPU with 26 cores. Nobody blinks when Intel prices parts this way, even though there’s no workload on Earth where an 8280 is going to deliver a reasonable uplift over the 8270 with just two extra cores.

But the 3990X isn’t trying to be all things to all people. It’s the laurel wreath. It’s a victory lap. The 3970X is the CPU that’s actually intended to go toe-to-toe with what Intel has to offer; it’s the 3990X that clinches the deal, for the AMD customer for whom money is no object.

As for the significance of that? This is the first time in 15 years that AMD has had a product that competed for the “money is no object” segment in the first place. You have to go back to the days of dual-core Opteron and Athlon 64 FX, when AMD was facing off against Prescott and Smithfield, to find a time when AMD was so confident of its endgame to launch a part in this kind of position. Other reviewers, with access to more expensive Xeons than I have, have confirmed that AMD wins benchmarks against $20K worth of Xeon CPUs in multiple areas. That’s the kind of performance disparity that can make even the “Money is no object” crowd sit up and take notice.

Well played.

Now Read:



No comments:

Post a Comment