Special report: building AI chips that deliver seminar

By Thomas Dewey

For those that did not get a chance to register for the “Building AI Chips that Deliver” seminar at the AI Hardware Summit, I am doing a recap for you. You are welcome! This seminar was actually an interactive panel covering how to create AI hardware systems that are efficient, predictable, and low power and also they also took on what requirements the ecosystem should have to attain these goals. The conversation thus centered on modeling and achieving latency goals. Representatives on the panel were:

Dennis Abts, Chief Architect and Fellow at Groq
Ty Garibay, VP of Engineering at Mythic
Ganjinder Panesar, Fellow at Siemens
Rangan Sukumar, Distinguished Technologist at Hewlett Packard Enterprise

The first discussion was how to approach AI hardware challenges from a design perspective. The panel question was, “How does designing for performance for an AI chip differ from “normal” techniques for chip design?” You would think that non-determinism for AI algorithms (models) and neural structures would cause differences. But the consensus was that in many ways, the design process is the same; you still need to meet timing goals and push data from point A to point B efficiently. However, the difference is the AI algorithm software that needs to run on the hardware. I have found that so far, no one has come up with a hardware solution that can efficiently run every AI algorithm type. So, this makes sense to me. Another issue is the programming language that runs on the hardware. There is a whole ecosystem required for every language that needs to be supported by the hardware system. However, the panel feels that once the design hits the hardware accelerator implementation stage, as a generated compute graph, it is very deterministic as far as latency is concerned. This is because the panelist’s solutions (Mythic and Groq) do not have arbiters and other hardware aspects that impact timing. But Siemens provides on-board profiling and monitoring IP that can be added to the hardware solution for non-intrusive debug that does not affect latency.

The next topics was, “What are the challenges for systems that deploy machine learning neural networks and AI accelerators in hardware?” The panelists believe the biggest issue here is the connection to the host processor. You want to make sure that the AI components get the data that they need to work efficiently from the host. The key is a light-weight interface that is host “agnostic” that just performs exception handling. Basically, the general host compute capabilities are used to compile algorithm models for special-purpose AI accelerators on chip. But can machine learning and AI be used to detect issues while the chip is running? The answer is yes, in that the UltraSoC solution that Siemens recently acquired, “enables a unified data-driven infrastructure that can enhance product quality, safety and cybersecurity, and the creation of a comprehensive solution to help semiconductor industry customers overcome key pain points including manufacturing defects, software and hardware bugs, device early-failure and wear-out, functional safety, and malicious attacks,” to quote our press release.

The next topic addressed implementation challenges for the chip. AI systems are typically developed on a CPU or GPU system first. Once the system is verified, designers attempt to implement the design in custom hardware. The key is managing the differences between the original design and the custom chip. This is especially true for managing the software that runs on the chip and making sure that the latency specifications are met. In addition, the AI algorithms change so often during the course of development, that the traditional IC flow no longer works, if you want to meet schedules. While the panel did not get into it, I can personally say that this is where high-level synthesis (HLS) comes into play. Using a tool like Siemens Catapult HLS, designers can code their algorithms at the C++ level and automatically generate the RTL code as input into the traditional IC design flow. Any algorithm changes are easily made and verified at the C++ level and the tool generates the RTL for any targeted device or implementation technology. The tool also allows designers to analyze and change the algorithms based on latency and power consumption.

This was an interesting panel with a nice mix of solution providers that seemed to complement each other well. Plus, the conversation naturally allowed me to insert non-subtle references for some of the Siemens solutions that we have in this space which nostalgically brought back memories of my past role as a technical marketing engineer. Good times.

What to read next:

Leave a Reply Cancel reply