CPU Cache and CFD – a Core Friendship

By Stephan Gross

Why don’t we have a single number that tells us how fast the system will do the simulation?

Jean Doe, Head of Simulation, Engineering Ltd.

It is just not that easy! As we tried to point out in earlier blogs, there is not one single criterion to measure the computing speed of a processor. Nor is there a synthetic benchmark that can tell how processors will really perform in your daily tasks.

If your tasks are Computational Fluid Dynamic simulations, published results from hardware and software vendors can give you an idea of which hardware is suited for your application. But how do processors actually solve your CFD simulation? Let us just imagine your job was not a CFD simulation but renovating your house…

The easy part: the more, the merrier!

If you renovate your house, just invite all your friends* and share the work among them. The job will be done in no time if you find, say, 95 extra pairs of helping hands. That’s the same thing processors do. The latest 4th Gen AMD EPYC™ processors can share all tasks among their up to 96 cores. This way you will not only finish earlier, but altogether you will also save effort – or energy!

However, not all friends are the same. We all have that one friend that we ask for help – and in the end doing the job alone would have been faster. But there are also those who put in mad hustle to get more work done. Processor cores are just the same. Some are faster; some are slower. That’s due largely to clock speed, measured in Gigahertz. But beware! Some people just look extremely stressed and busy – and don’t really finish much in all that hustle. For processors, this is why it’s important to look at “instructions per cycle” – how much work processors will actually do in their clock time.

Fast friends finish faster?

The issue is: even if you invite the most productive bunch of friends to your construction site, there is no guarantee that you will finish fast! It is also a matter of tools and where you store them. Let’s say one of your friends has a workshop with all the tools humans have ever invented – and you want to pick all the relevant tools and machines and load them into your van before you go to the construction site. In our computer, this workshop would equate to the hard drive, the main storage. For your CFD simulation, you should pick all the relevant tools and machines and load them into your van before you go to the construction site. This van is our representation for our Random Access Memory, short RAM or just memory.

Once you arrive at your renovation site, you put a limited selectionthe most important tools for this job into your toolbox, which you share with your 95 friends. In a processor, this toolbox is called Level 3 CPU cache, L3 for short. Everybody also has a toolbelt where they put a few favorite tools, but these aren’t shared; they are just for one worker. (Ask your friends, do they like others to reach into their pocket??) This represents our L2 CPU cache, where there’s only enough space for some tools.

While doing the actual renovation, everybody has a tool in hand and hopefully a set of instructions in the other. This is the processor’s closest cache, with Level 1 data (our tool) plus Level 1 instructions. As it was with L2, this is only accessible for one friend — don’t rip the screwdriver out of their hands! Put it into the shared toolbox. Likewise, processor cores are polite; they wait for the tool to be laid into the L3 CPU cache before they get their hands on them.

L1 instructions and data are stored in hand. If you need them, they are right there. But there is only space for one tool and a small instruction set. More can be stored in L2, but to get to it, our friends need to put away the stuff in hand and reach into their pocket.

Over the renovation project, there will be countless occasions where the tools you need are not in your hand or in your pocket, so you have to grab them from the toolbox, our L3 CPU cache in the processor world. There is far more space here, but you also take more time to reach it.

Things take even more time if you must go to the van because you forgot to put everything in your toolbox. Walking to the van takes a few minutes, but this is still business as usual, on any construction site or in any CFD simulation. Most of the data used for CFD is already present in memory.

What really would ruin your day is if you forgot something at the workshop and had to drive back and forth… but that should never happen in the world of CFD simulation (“swapping”).

Making your friends more productive

It’s clear that the ideal scenario would be for your toolbox to have space for more tools. That way the renovation would finish faster, because your most productive friends would waste less time walking to the van. And this is exactly what AMD has done with their AMD 3D V-Cache™ technology. They added extra L3 CPU cache – that now exceeds one gigabyte! I can still recall the days when computers had whole hard drives of that size!  Also, bigger toolboxes are more practical than fancy working trousers with 17 more pockets – or a third hand for each of your 95 friends!

When AMD brought this huge leap forward to the market, by introducing their AMD EPYC™ processors with AMD 3D V-Cache technology in 2022, our faster friends upgraded their vans. This latest generation is able to use even more and faster DDR5 memory than their predecessor.

Furthermore, the introduction of AMD EPYC™ 9654 processor was a new high-water mark for total core count in one x86-processor. But now, the AMD EPYC™ 9684X has 96 cores AND comes equipped with a total of 1150 MB L3 AMD 3D V-Cache. Depending on the job at hand, this means you can fit significantly more tools into your toolbox.

Our benchmark charts show that in bigger, industry-standard cases with large meshes, the additional CPU cache directly contributes to a speedup in Computational Fluid Dynamics simulations:

The diagram above shows the relative performance of all CFD benchmarks performed in Simcenter STAR-CCM+. Green shades represent the standard 4th Gen AMD EPYC™ processors, whereas the blue shaded bars show the performance of the 3D V-Cache enabled counterparts.

Let’s look at performance by the previously fastest processor for CFD – the AMD EPYC™ 7773X. This processor from 3rd Gen AMD EPYC has 64 cores and 768 MB L3 CPU cache. Thanks to AMD 3D V-Cache, it outperformed the AMD EPYC 7763 significantly for CFD applications – with the same amount of cores. That speedup came from the tripled L3 CPU cache size.

Where previous generation processors needed 64 cores to run that fast, the 4th Gen AMD EPYC processors are even faster with half the core count! The 9384X with 32 cores outperforms the 7773X with 64 cores, by 27%.

If 32 cores are enough to beat former 64-core champions – what happens if we take the core count to the max? The recently introduced AMD EPYC 9654 features 96 cores delivering a speedup of 93% compared to the 3D V-Cache enabled 3rd gen leader, despite only having half the CPU cache. What happens if you triple that?

Then it is time to crown a new king, as the AMD EPYC 9684X squeezes a whopping 118% speedup out of the 1,150 MB V-Cache.

Just like the necessary tools differ for each renovation project, some simulations in Simcenter STAR-CCM+ benefit more from AMD 3D V-Cache than others. Because of that, we looked at benchmarks covering several typical use cases, from a small 3 million cell mesh around a ship hull combined with multiphase physics to vehicle thermal management simulations with nearly 180 million cells – as shown in the first video above!

The use cases with bigger meshes profit especially well from the 4th Gen AMD EPYC processors. In all use cases AMD 3D V-Cache is faster than any frequency-optimized counterpart.

Using family-scale effects

If you renovate your house, you should not only rely on your friends. What about your partner, your family, your kids? Everybody can bring their bunch of friends and share the work among them! This typically happens on CFD clusters, where several computational nodes with lots of cores work on one simulation.

If it was your renovation project, the big crowd of people might help…. but also, everyone gets in each other’s way the more crowded the workplace is. You might expect processors to behave this way too, right?

Similarly, if you distribute small simulations with a few million cells over several nodes, you will not see a linear speedup. And if you spread a 3 million cell mesh onto eight nodes with 2×96 cores, these 1,536 cores are only twice as fast as the initial 2×96. Not even 3D V-Cache can change this. But as every skilled CFD engineer knows, 3 million cells are no match for modern processors and for efficient solvers like Simcenter STAR-CCM+.

But if you have a CFD simulation with 100 million cells or more, Simcenter STAR-CCM+ and AMD 3D V-Cache just roll up their sleeves and get the job done. Now the CPU cache size pays back even more than before!

Now, if you were renovating not your home but a castle, there would be plenty of room for everyone. In this case, short paths to tools pay back even more! Because one person or group working in different rooms still have to walk the whole castle to reach the toolbox. Likewise, when you add nodes to the cluster, you see even better performance with 3D V-Cache Technology.

With superlinear scaling from AMD EPYC processors, eight computational nodes are not just eight times faster but a whopping 11.12x faster! It is phenomenal how new generations of processors make efficient solvers run even faster than should logically be possible.

“AMD, the AMD arrow logo, EPYC, AMD 3D V-Cache and combinations thereof are trademarks of Advanced Micro Devices, Inc.”

* This analogy is meant for non-engineers. We engineers do know how CPU cache works, we don’t need silly metaphors. And also, we don’t have 95 friends.

This article first appeared on the Siemens Digital Industries Software blog at