Thought Leadership

Designing the next generation of AI chips part 1

By Spencer Acain

I recently had the opportunity to speak with Russell Klein, program director at Siemens EDA and a member of the Catapult HLS team about the ways they’re helping create the next generation of AI accelerator chips to power everything from smartphones to self-driving cars. Check out the podcast here or read the full transcript below.

Spencer Acain: Hello and welcome to the AI Spectrum podcast. I’m your host, Spencer Acain. In this series we talk to experts all across Siemens about a wide range of AI topics and how they are applied to different technologies. Today I am joined by Russell Klein, program director at Siemens EDA and a member of the CATAPULT HLS team welcome, Russ.

Russell Klein: Thank you.

Spencer Acain: So before we jump right into this, can you give us a little bit of background about yourself and the kind of work you do that supports artificial intelligence?

Russell Klein: So I’ve been with mentor for about for more than 20 years now and most of my work has been on the boundary between the hardware and software worlds. Today I’m working in the CATAPULT group on the high level synthesis tool. And I’m focused on algorithm acceleration. And that is when we look at algorithms that are running on embedded processors. Sometimes those don’t go fast enough for their consuming too much power and we can use high level synthesis to move those into dedicated hardware accelerators. They’re going to run faster and run much more efficiently.

Spencer Acain: And how does that apply to like artificial intelligence work and running AI algorithm and stuff? Does that connect to that?

Russell Klein: Yeah. So what we’re seeing out in our customer base is that the AI models are really growing in terms of complexity, which means they’re getting more layers. They’re getting more channels within each layer. They’re fundamentally just getting a lot bigger. And as they get bigger, the inferencing that you need to do afterwards becomes more computationally complex. And what we’re seeing is that the inferencing is simply too big a computational load to run on an embedded processor. Processors tend to be rather slow, rather inefficient, and unlike you know, 15 or 20 years ago where we got bigger, faster processors every few years, it’s just not happening anymore. And so people need to bring some kind of hardware acceleration into the mix to be able to run those inferences on embedded systems. And so again, HLS is a really good way, or excuse me, high level synthesis is a really good way to move those functions off the CPU into an accelerator.

Spencer Acain: So with these accelerators be like GPUs as well, can you be using those in place of a CPU? I know they’re used a lot for like training. You know you get thousands of these up in the server farm for training big models and stuff, but what about for these sort of applications that you’re talking about?

Russell Klein: Yeah, so GPU certainly are going to accelerate inferencing and you can use them in embedded systems as well. So you can use them for doing inferencing on systems that are out at the edge but what you’re gonna find is that GPU’s are really quite general, and they also drag along a lot of hardware that is used for processing graphics. And that’s not necessarily a great thing when you’re doing inferencing. So there are also our neural processing units, or NPUs, say, and sometimes they’re called TPUs or tensor processing units. And these are a lot like GPUs. So they’re gonna have a big array of processing elements, and they’re going to bring a lot of parallelism into the mix. They’re gonna go a bit faster than GPU and they’re going to be a bit more efficient because they drop some of the unnecessary hardware. But when we want to get the absolute highest efficiency and the most performance out of our accelerator, we really don’t wanna have something that’s generalized. We want something that’s really tailored to the specific inference that we’re doing, and that’s where high level synthesis, rather than creating a very general purpose accelerator that would work for lots of inferences, we can create one that’s highly tailored to just this one type of inference. And that’s going to give you the highest performance and the highest efficiency. It’s going to run, you know, faster and at a lower energy level than either GPUs or TPUs.

Spencer Acain: That sounds like it would be pretty valuable. So and that’s like the what you’re using HLS for. Can you tell me more about catapult and what HLS is in this context?

Russell Klein: Sure. So high level synthesis of fundamentally takes an algorithm that you describe and C or C++ some of our customers are using system C and it will take that algorithm and it will generate what’s called an RTL description of the hardware. Now in a traditional design flow, the designer sits down and he’s gonna describe every wire in the design and every register in the design and every single operator in the design. So all the adders, all the multipliers, everything. He’s gonna explicitly call out every single one of those, and that tends to be, you know, really tedious and it also tends to be error pro. It’s a highly manual process. And so with high level synthesis, rather than describing every register, every wire we describe, the algorithm that we want to get done, and then a compiler comes along and parses that description and figures out how many wires do we need, what registers do we need, and it can actually identify the opportunity to share things. So in other words, we could have a register that holds three different variables at different times where you know a human writing the hardware would never notice that the compiler can be a lot more rigorous about that.

Russell Klein: Now high level synthesis also knows about the target technology that we’re going to be aiming at, so if we want this algorithm to be implemented on an FPGA, it will know all about the number of lookup tables that are there. It’ll understand the characteristics of those lookup tables, and we’ll know how much memory is spread throughout the FPGA fabric and so forth. If we’re targeting an ASIC, it will know all of the gate level primitives that are available in that ASIC. It will know what clock frequency the designers trying to run this out, and it can actually build the hardware very specifically for that technology target, which makes it very easy to ultimately deploy the hardware. Now what high level synthesis is creating is an RTL source file or a register transfer level hardware description of that hardware to be implemented and so high level synthesis produces the same thing that the person would, it just does it much faster and much more efficiently so that that RTL file can be used in all of the downstream tools in exactly the same way that a manually created RTL file would. So it preserves the investment in all the downstream back end tools.

Spencer Acain: That sounds pretty well like kind of very valuable to be able to take that I know like these microprocessor designs and whatnot are incredibly complex, so being able to describe it on the computer must be a like a huge time saving or just having it the computer figure it out for you from what you’re going to need.

Russell Klein: Indeed, so you know, one of the really big benefits of high level synthesis is that you get to design A lot faster. And when you’re working in a market like AI, when you’re working in a technology space like AI, what we’re seeing is that the algorithms are changing very quickly. And if we’ve got a very short design cycle, it means that the design that we put together is going to be able to incorporate a lot of those most recent changes to the algorithms, whereas if you have a manual design process that’s going to take months or years you’re gonna have to lock in your algorithms and then deliver your silicon, you know, it easily could be 18 to 24 months after you’ve decided on exactly what you’re going to build. So that shorter design cycle can be a really significant benefit if you’re working in a part of the industry where those algorithms are quickly changing. The other thing is that, for example, if we look at networking chips, what you’re going to find is that the guy who’s working on a networking chip he’s got a little bit of gray hair and he’s probably been building networking chips for the last 30 years and he knows exactly what a really good architecture for the networking chip is and he’ll be able to say for our next design we should change this and this and this.

Russell Klein: In the AI space, finding somebody who’s got more than five years of building AI accelerator chips is extremely rare, right? It just hasn’t been an industry for that long, and therefore the designers just don’t know exactly what the ideal design is. They just don’t have the experience base. So being able to try out a number of different architectures very quickly is something you can do with high level synthesis that’s really impractical with a traditional design cycle.

Russell Klein: The other thing that high level synthesis brings to the party is it makes verification of the design a lot easier. If we look at the total cost of building an SoC, you’re gonna find that verification is the biggest single cost in that development cycle. With high level synthesis, rather than verifying everything at the very low level, we can verify the algorithm at the beginning and then as we migrate the design down to the RTL, we can prove equivalency at each stage and this dramatically reduces the amount of manual verification that we need to do and the total compute time that we need to run those verifications. So we’ve got a really big savings there as well, and those are some things that are not out obvious.

Spencer Acain: Yeah, I can imagine. My background is in math and applied mathematics, so it’s very it kind of almost intuitive to me to think that you would want to be, you would find something easier to that. You can say that A is equal to B and then I can prove that A is correct or A is true all of the time. Much easier than I can be and that’s kind of what you’re doing.

Russell Klein: Yeah. And within CATAPULT, we’ve actually built-in formal verification tools. Now a lot of people are formal verification tools are a little bit scary because they have a reputation of needing a lot of expertise. Well, we’ve built it into the high level synthesis tool in a way that you don’t have to bring all that expertise to it. But we can do a lot of formal verification between the original C algorithm and the resulting RTL so that the verification space we need to exercise with dynamic verification is going to be much, much smaller, so we can bring in those formal techniques without requiring the user have a lot of expertise.

Spencer Acain: That’s that all sounds pretty incredible, actually. So we’ve talked a bit about now how catapult and HLS are really helpful in designing these accelerators for AI chips, for AI algorithms and stuff. But where would you be seeing this where these chips implemented like, where is this approach really being used today?

Russell Klein: One of the areas is in the IoT space where we’ve got limited power and we’ve got limited budget costs that we can build the systems out of, especially if we’ve got battery powered devices. This is where you know bringing in something like a GPU which burns a lot of power, just isn’t practical. You need to be more efficient. So the IoT space, especially battery powered, we’re seeing a lot of activity there. Interestingly, there’s certain set of systems where privacy and security concerns are paramount, and we’re seeing some activity there as well because the alternative to doing the inferencing down on the edge system is just to send the data up to a data center, an AWS cloud or an Azure cloud do the inferencing there bring the results back if you don’t have hard real time requirements in your latency can be fairly high that works out pretty well but if you’ve got data that you really just don’t want going on the public Internet, you might wanna be doing that inferencing down on the embedded system. In the automotive space, as we look to you know, self-driving cars and things like that, the reaction times that are necessary between the sensors and the actuators, so between our cameras that we’ve got out in front of us, the inference needs to make a decision and then we need to impact the steering or the brake or the accelerator. Umm, the time limits we’ve got there are measured in milliseconds and there you’re not you, you really are gonna have to have the highest level of performance. And this is where you want a highly customized accelerator using something off the shelf really isn’t going to be practical.

Russell Klein: And we’re also seeing where people want to differentiate from competing solutions. So for example, if you’re in a marketplace where all your competitors have grabbed, say, a Jetson nano AI accelerator, if you’re doing using the same processor, using the same off the shelf AI accelerator, you’re gonna have roughly the same price point, roughly the same performance and roughly the same costs as all your competitors. If you want to differentiate, you can use high level synthesis to build something specific to your industry, to your customers, to your application and they can give you an edge in a competitive marketplace.

Spencer Acain: I see, so it sounds like CATAPULT and HLS are really a good way to kind of zone in and fine tune exactly what you need for a very specific application, you know a great tool for the highly specific AI algorithms we see today. And with that, we are unfortunately out of time for this episode. I have been your host Spencer Acain and this has been the AI Spectrum podcast.


Siemens Digital Industries Software helps organizations of all sizes digitally transform using software, hardware and services from the Siemens Xcelerator business platform. Siemens’ software and the comprehensive digital twin enable companies to optimize their design, engineering and manufacturing processes to turn today’s ideas into the sustainable products of the future. From chips to entire systems, from product to process, across all industries. Siemens Digital Industries Software – Accelerating transformation.

Leave a Reply

This article first appeared on the Siemens Digital Industries Software blog at https://blogs.sw.siemens.com/thought-leadership/2023/03/07/designing-the-next-generation-of-ai-chips-part-1/