markurtz's comments

markurtz · on Sept 11, 2021

Disclosure: I work for Neural Magic.

Hi ml_hardware, we report results for both throughput and latency in the blog. As you noted, the throughput performance for GPUs does beat out our implementations by a bit, but we did improve the throughput performance on CPUs by over 10x. Our goal is to enable better flexibility and performance for deployments through more commonly available CPU servers.

For throughput costs, this flexibility becomes essential. The user could scale down to even one core if they wanted to, with a much more significant increase in the cost performance. We walk through these comparisons in more depth in our YOLOv3 blog: https://neuralmagic.com/blog/benchmark-yolov3-on-cpus-with-d...

INT8 wasn't run on GPUs because we have issues with operator support on the conversion from PyTorch graphs to TensorRT (PyTorch currently doesn't have support for INT8 on GPU). We are actively working on this, though, so stay tuned as we run those comparisons!

The models we're shipping will see performance gains on the A100s, as well, due to their support for semi-structured sparsity. Note, though, A100s are priced more expensive than the commonly available V100s and T4s, which will need to be considered. We generally keep our current benchmarks limited to what is available in the top cloud services to represent what is deployable for most people on servers. This usability is why we don't consider ML Perf a great source for most users. ML Perf has done a great job in standardizing benchmarking and improving numbers across the industry. Still, the systems submitted are hyper-engineered for ML Perf numbers, and most customers cannot realize these numbers due to the cost involved.

Finally, note that the post-processing for these networks is currently limited to CPUs due to operator support. This limitation will become a bottleneck for most deployments (it already is for GPUs and us for the YOLOv5s numbers). We are actively working on speeding up the post-processing by leveraging the cache hierarchy in the CPUs through the DeepSparse engine, and are seeing promising early results. We'll be releasing those sometime soon in the future to show even better comparisons.

markurtz · on Sept 11, 2021

Disclosure: I work for Neural Magic.

Hi deepnotderp, as noted by others the speeds listed here are combining throughput for GPU from Ultralytics to latency for GPU from Neural Magic. We did also include throughput measurements, though, where YOLOv5s was around 3 ms per image on a V100 at fp16 in our testing. All benchmarks were run on AWS instances for repeatability and availability and is likely where the 2 ms vs 3 ms discrepancy comes from (slower memory transfer on the AWS machine vs the one Ultralytics used). Note, though, a slower overall machine will also affect CPU results as well.

We benchmarked using the available PyTorch APIs mimicking what was done for Ultralytics benchmarking. This code is open sourced for viewing and use here: https://github.com/neuralmagic/deepsparse/blob/main/examples...

markurtz · on Sept 11, 2021

Disclosure: I work for Neural Magic.

Hi carbocation, we'd love to see what you think of the performance using the DeepSparse engine for CPU inference: https://github.com/neuralmagic/deepsparse

Take a look through our getting started pages that walk through performance benchmarking, training, and deployment for our featured models: https://sparsezoo.neuralmagic.com/getting-started

markurtz · on Sept 11, 2021

Disclosure: I work for Neural Magic.

Hi 37ef_ced3, AVX-512 has certainly helped close the gap for CPUs vs GPUs. Even then, though, most inference engines on CPUs are still very compute bound for networks. Unstructured sparsity enables us to cut the compute requirements down for CPUs leading to most layers becoming memory bound. This then allows us to focus on the unique cache architectures within the CPUs to remove the memory bottlenecks through proprietary techniques. The combination is what's truly unique for Neural Magic and enables the DeepSparse engine to compete with GPUs while leveraging the flexibility and deployability of highly available CPU servers.

Also as boulos said, the numbers quoted here are a bit lopsided. There are more affordable GPUs and GCP has enabled some very cost effective T4 instances. Our performance blog on YOLOv3 walks through a direct comparison to T4s and V100s on GCP for performance and cost: https://neuralmagic.com/blog/benchmark-yolov3-on-cpus-with-d...

Additionally, I'd like to clarify that our open sourced products do not communicate with any of our servers currently. These are stand alone products that do not require Neural Magic to be in the loop for use.

HN For You