Doing more with less: How to design smarter AI at the edge

When designing any SoC or ASIC car, AI engineers must focus on building production platforms that can reliably execute algorithms while achieving exceptional PPA: lowest power, lowest cost, higher performance. They must also commit to hardware platform selection early in the design cycle, often before final algorithm development is complete.

Author: Tony King-Smith, Executive Consultant at aiMotive

In this article, I’ll try to distill what I’ve learned to help engineers find real solutions when it comes to automotive AI hardware platforms.

Einstein once said, “We cannot solve problems with the same level of thinking that we used to create them.”

Why do we like big numbers?

As an engineer with over 40 years of experience in semiconductor business R&D director and CMO management, I consider myself and my peers to be logical. However, how many of us can honestly say we are not seduced by claims like “my product is faster than yours”? I’m afraid it’s just human nature.

The question is always one of definition: how do we define “faster” or “lower power” or “cheaper”? That’s what benchmarks try to solve — it’s about having a consistent context and external standards to make sure you’re comparing similar tests. Anyone who uses benchmarks knows this all too well (eg aiMotive was born out of a leading GPU benchmarking company).

Addressing this need has never been more urgent when trying to compare hardware platforms for automotive AI applications.

When is 10 TOPS not 10 TOPS?

With or without a dedicated NPU, most SoCs refer to their ability to perform neural network workloads as TOPS: stands for Tera operations per second. This is simply the total number of arithmetic operations per second that the NPU (or the entire SoC) can perform, whether centralized in a dedicated NPU or distributed across multiple compute engines (such as GPUs, CPU vector coprocessors, or other accelerators.)

However, no hardware execution engine can execute any workload with 100% efficiency. For neural network inference, certain layers (such as pooling or activation) are mathematically very different from convolutions. Before the convolution itself (or other layers like pooling) can start, the data has to be rearranged or moved from one place to another. Other times, the NPU may need to wait for new instructions or data from the host CPU controlling it, per layer or even per block of data. These all result in fewer computations, thus limiting the theoretical maximum capacity.

Hardware utilization – not what it looks like

Many NPU vendors cite hardware utilization to indicate how well their NPU is performing a given neural network workload. This basically says, “This is how much of my NPU’s theoretical capacity is being used to perform neural network workloads.” Of course, this tells me what I need to know.

Unfortunately not. The problem with hardware utilization is one of definition: the number is entirely up to how the NPU vendor chooses to define it. In fact, the problem with hardware utilization and TOPS is that they only tell you what the hardware engine can theoretically achieve, not how well it achieves.

This may lead to some misleading information. Figure 1 below shows our comparison between the 4 TOPS aiWare3P NPU and another well-known NPU rated at 8 TOPS.

Doing more with less: How to design smarter AI at the edge
Figure 1: Comparison of utilization and efficiency of two automotive inference NPUs (Source: aiMotive using publicly available hardware and software tools)

For two different well-known benchmarks, this NPU claims a capacity of 8 TOPS compared to aiWare3P’s 4 TOPS, which should mean it will provide roughly 2x higher fps performance than aiWare3P. In reality, however, the opposite is true: aiWare3P offers 2 to 5 times the performance, even though it is only half the claimed TOPS!

The conclusion is that TOPS is a very poor measure of AI hardware capability; hardware utilization is almost as misleading as TOPS.

NPU Efficiency and Autonomy: The Key to Optimizing PPAs

That’s why I think you have to evaluate NPU capabilities in terms of efficiency when executing a representative set of workloads, not raw theoretical hardware capabilities. Efficiency is defined as the number of operations required to execute a particular CNN within a frame, as a percentage of the claimed TOPS. This number is calculated entirely from the underlying math that defines any CNN, regardless of how the NPU actually evaluates it. It compares actual to claimed performance, which is what really matters.

An NPU that exhibits high efficiency means it will take full advantage of every square millimeter of silicon used to implement it, which translates into lower chip cost and lower power consumption. Efficiency enables optimal PPA (performance, power and area) for an automotive SoC or ASIC.

The autonomy of the NPU is another important factor. How much CPU load does the NPU put on the host CPU for maximum performance? What does this have to do with the memory subsystem? The NPU must be considered as a large block in any SoC or ASIC — its impact on the rest of the chip and subsystem cannot be ignored.

in conclusion

When designing any SoC or ASIC car, AI engineers must focus on building production platforms that can reliably execute algorithms while achieving exceptional PPA: lowest power, lowest cost, higher performance. They must also commit to hardware platform selection early in the design cycle, often before final algorithm development is complete.

Efficiency is the best way to achieve this; neither TOPS nor hardware utilization is a good measure. Evaluating the autonomy of the NPU is also critical if demanding production targets are to be met.

Tony King-Smith is an executive consultant for aiMotive. With over 40 years of experience in the semiconductor and electronics fields, he has managed R&D strategy and hardware and software engineering teams for several multinational companies including Panasonic, Renesas, British Aerospace and LSI Logic. Tony was previously CMO of Imagination Technologies, a leading semiconductor IP provider.

The Links:   MG75J1ZS50 NL8060BC21-04 BEST SOURCE