Products

The million-dollar bug: Why hyperscale verification can’t afford delays

Reading Time: 3 minutes

The Hyperscale Verification Challenge

In my early days as a verification engineer, the Vice-President of the division walked into my cube and asked for a status update on the bug that I had found. He politely informed me that every week of delay was costing the company about $1M, and that if we could go ahead and get this taped out that would be great. The problem wasn’t that I’d found the bug — no one really praised me for that — it was that now we had to find every corner case, update the RTL with the fix, and then verify the fix as soon as humanly possible.

That was hard enough when designs were hundreds of millions of gates and we had some margin built into the schedule. Today we’re scaling tens of billions of gates, multi-chiplet, many-interface designs with CPUs and custom AI accelerators, with market windows shrinking to the comical. The pressure I felt as a young engineer from that VP has been dramatically increased by the stakes these hyperscale teams are facing. Missing a tape-out at foundry doesn’t cost 100s of million a week — it may cost an entire program.

As the designs have exploded in complexity and the software workload has grown by orders of magnitude from only a few years ago — and getting it right the first time means getting it right across die-to-die boundaries, HBM memory, high-speed interconnect, and custom accelerators moving data in ways conventional computer architecture never imagined a decade ago. So, you’re going to have bugs — that’s a given.

This is where conventional hardware-assisted verification has not kept up with the realities of finding and fixing bugs. Running a tool faster doesn’t help if the bug is not reproducible. And when you need to insert a trigger into the DUT to isolate the problem, your platform shouldn’t force you to recompile the entire design for that one trigger change — because that VP standing in your cube certainly won’t wait. You are now on the hook to explain why you still haven’t isolated, fixed, and verified it. Without 100% visibility and debuggability on a hardware platform that is purpose-built for exactly that, it becomes a very expensive and frustrating experience.

This is the architectural difference that matters: Veloce Strato CS is built on custom emulation silicon, not repurposed off-the-shelf FPGAs. That’s why visibility, dynamic trigger insertion, and debug speed are native capabilities, and scalability not features bolted on after the fact.

Arm used Veloce Strato CS and proFPGA CS hardware platforms to verify their recently announced AGI CPU — as announced. They needed to scale 10 billion+ gate monolithic models. They needed to run realistic AI workloads across multiple CSS sub-systems simultaneously, stacking dozens of high-speed interfaces. And they understood that being able to run the workload was only the first step. Finding and fixing bugs quickly was the purpose of running those workloads.

That means full visibility without the need to instrument the design after a failure. It means software solutions that drive stimulus at a proven 98% of theoretical maximum — running dozens of PCIe Gen 6 instances, NVMe, CXL, and memory controller endpoints while executing Arm Compliance Suite on top of it, all to stress the design to its limit. And it requires channel bandwidth that does not slow the DUT clock down when things really heat up on those interfaces.

What Arm found most valuable maps directly to the KPIs that every hyperscale verification team is measured against: running a 10 billion+ gate monolithic model at scale across four emulation towers, channel bandwidth that saturates dozens of PCIe, memory, and interconnect interfaces without slowing the DUT clock, and full-visibility debug with dynamic trigger modification — no recompile required. These are just a few of the capabilities that come from Crystal X, the custom silicon architecture inside Veloce Strato CS, designed by Siemens from the ground up for emulation.

The same platforms and solutions that helped Arm verify their AGI CPU are available to Arm partners and to any team facing the same hyperscale verification challenge. I’ve seen my share of these semiconductor cycles that change everything. The platforms and tools that enable them are what separate the teams that ship on time from the ones explaining to their VP why they can’t.

To learn more, download the Veloce Strato CS factsheet, the Veloce proFPGA CS factsheet, or contact us at 1-800-547-3000.

In upcoming blogs, we’ll take a closer look at each of these aspects of verifying a hyperscale-size design: meeting the performance requirements, the different solutions that work together to stress the design, and the ability to quickly debug and fix issues when they pop up.

Leave a Reply

This article first appeared on the Siemens Digital Industries Software blog at https://blogs.sw.siemens.com/hardware-assisted-verification/2026/05/05/the-million-dollar-bug-why-hyperscale-verification-cant-afford-delays/