Wednesday, 29 July 2020

Nvidia Crushes New MLPerf Tests, but Google’s Future Looks Promising

So far, there haven’t been any upsets in the MLPerf AI benchmarks. Nvidia not only wins everything, but they are still the only company that even competes in every category. Today’s MLPerf Training 0.7 announcement of results isn’t much different. Nvidia started shipping its A100 GPUs in time to submit results in the Released category for commercially available products, where it put in a top-of-the-charts performance across the board. However, there were some interesting results from Google in the Research category.

MLPerf Training 0.7 Adds Three Important New Benchmarks

To help reflect the growing variety of uses for machine learning in production settings, MLPerf had added two new and one upgraded training benchmarks. The first, Deep Learning Recommendation Model (DLRM), involves training a recommendation engine, which is particularly important in eCommerce applications among other large categories. As a hint to its use, it’s trained on a massive trove of Click-Through-Rate data.

The second addition is the training time for BERT, a widely-respected natural language processing (NLP) model. While BERT itself has been built on to create bigger and more complex versions, benchmarking the training time on the original is a good proxy for NLP deployments because BERT is one of a class of Transformer models that are widely used for that purpose.

Finally, with Reinforcement Learning (RL) becoming increasingly important in areas such as robotics, the MiniGo benchmark has been upgraded to MiniGo Full (on a 19 x 19 board), which makes a great deal of sense.

MLPerf Training added three important new benchmarks to its suite with the new release

MLPerf Training added three important new benchmarks to its suite with the new release

Results

For the most part, commercially available alternatives to Nvidia either didn’t participate at all in some of the categories, or couldn’t even out-perform Nvidia’s last-generation V100 on a per-processor basis. One exception is Google’s TPU v3 beating out the V100 by 20 percent on ResNet-50, and only coming in behind the A100 by another 20 percent. It was also interesting to see Huawei compete with a respectable entry for ResNet-50, using its Ascend processor. While the company is still far behind Nvidia and Google in AI, it’s continuing to make it a major focus.

As you can see from the chart below, the A100 is 1.5x to 2.5x the performance of the V100 depending on the benchmark:

As usual Nvidia was mostly competing against itself -- this slide show per processor speedup over the V100

As usual, Nvidia was mostly competing against itself. This slide show per processor speedup over the V100

If you have the budget, Nvidia’s solution also scales to well beyond anything else submitted. Running on the company’s SELENE SuperPOD that includes 2,048 A100s, models that used to take days can now be trained in minutes:

As expected Nvidia's Ampere-based SuperPOD broke all the records for training times

As expected, Nvidia’s Ampere-based SuperPOD broke all the records for training times. Note that the Google submission only used 16 TPUs, while the SuperPOD used a thousand or more, so for head-to-head chip evaluation it’s better to use the prior chart with per-processor numbers.

Nvidia’s Architecture Is Particularly Suited for Reinforcement Learning

While many types of specialized hardware have been designed specifically for machine learning, most of them excel at either training or inferencing. Reinforcement Learning (RL) requires an interleaving of both. Nvidia’s GPGPU-based hardware is ideal for the task. And, because data is generated and consumed during the training process, Nvidia’s high-speed interlinks are also helpful for RL. Finally, because training robots in the real world is expensive and potentially dangerous, Nvidia’s GPU-accelerated simulation tools are useful when doing RL training in the lab.

Google Tips Its Hand With Impressive TPU v4 Results

Google Research put in an impressive showing with its future TPU v4 chip

Google Research put in an impressive showing with its future TPU v4 chip

Perhaps the most surprising piece of news from the new benchmarks is how well Google’s TPU v4 did. While v4 of the TPU is in the Research category — meaning it won’t be commercially available for at least 6 months — its near-Ampere-level performance for many training tasks is quite impressive. It was also interesting to see Intel weigh in with a decent performer in reinforcement learning with a soon-to-be-released CPU. That should help it deliver in future robotics applications that may not require a discrete GPU. Full results are available from MLPerf.

Now Read:



No comments:

Post a Comment