The Use of Synthetic Data in AI model Training

Artificial intelligence (AI) holds great promise for the factory, but the issue of training tops the list of challenges potential AI implementations face. In a factory setting the required data is often highly specific to a particular facility, assembly line, or product. That makes the process of capturing appropriate training images and annotating them a long and expensive one. Researchers and engineers have created a number of methods to address this issue, one of the most prominent being synthetic data. In a recent podcast, I sat down with Zachi Mann an expert in the field of AI, robotics and automation, to discuss how synthetic data is changing the face of AI training.

Synthetic data is any type of data generated using a computer that takes the place of real data. This includes everything from simple excel sheets containing names and addresses to fully rendered 3D scenes. In this podcast we focused on synthetic image data, which is of particularly high value in the manufacturing space where most tasks rely on accurate visual sensing capabilities. Of course, synthetic image data generation presents its own unique challenges, namely in preparing the required 3D models and digital environments. However, the transition to synthetic data is a natural progression for companies that are already embracing digitization and rely on CAD and simulation capabilities, such as those offered by the Siemens Xcelerator portfolio.

While synthetic data certainly presents challenges, it also offers substantial benefits. Firstly, synthetic image data allows for an up to 80% reduction in real-world images needed for accurate training. This represents a tangible savings in both time and money, as capturing real images in a factory is expensive due to the reliance on actual manufacturing equipment and painstakingly annotating the data. Secondly, synthetic data can be used to train for edge cases and “what if” scenarios, taking the place of real data that is difficult or impossible to capture. This allows for more robust algorithms to be developed which can be deployed under a wider range of operating conditions than an algorithm trained using traditional methods.

In recent years, synthetic data has seen a marked increase in interest and adoption across the industry thanks to several driving factors. Creating the required digital representations of parts and equipment is easier than ever thanks to readily available LiDAR and high-resolution camera equipment. This also helps balance the cost of developing highly complex models against more traditional data capture methods. Another factor is algorithmic maturity has greatly increased, meaning further growth for AI by trying to refine proven algorithms is difficult and expensive. Now the focus is on better data, an area that has been largely ignored till now, so that AI systems can more quickly provide meaningful and economical improvement. Synthetic data offers a path to creating smarter, more focused training data in accordance with the data-centric approach to AI proposed by AI experts such as Andrew Ng.

Beyond AI training, synthetic data can also support design and manufacturing across a product’s entire lifecycle. For example, the comprehensive digital twin of production can become an environment for training an AI model before the product is even prototyped. By introducing a virtual camera into the digital twin, it is possible to create synthetic data out of the existing 3D models which can then facilitate virtual commissioning of the AI-driven robotics system. Synthetic data can also be used by engineers to run simulations of product iterations, proposed robotics station layouts, and any number of other tasks without having to construct the real systems and products to test those scenarios. Finally, this system can incorporate real data after production begins, allowing for further refinement of both the process and the synthetic data itself.

I found my time talking with Zachi to be an illuminating one as we delved into a wide range of AI and data related topics. If you are interested to learn how synthetic data is impacting factories, designs, AI, and beyond please check out the podcast linked in this blog. To listen to the full podcast, click here or to read the transcript, click here.


Siemens Digital Industries Software is driving transformation to enable a digital enterprise where engineering, manufacturing and electronics design meet tomorrow. Xcelerator, the comprehensive and integrated portfolio of software and services from Siemens Digital Industries Software, helps companies of all sizes create and leverage a comprehensive digital twin that provides organizations with new insights, opportunities and levels of automation to drive innovation.

For more information on Siemens Digital Industries Software products and services, visit siemens.com/software or follow us on LinkedIn, Twitter, Facebook and Instagram.

Siemens Digital Industries Software – Where today meets tomorrow.

Want to stay up to date on news from Siemens Digital Industries Software? Click here to choose content that's right for you

Leave a Reply

This article first appeared on the Siemens Digital Industries Software blog at https://blogs.sw.siemens.com/thought-leadership/2022/05/23/the-use-of-synthetic-data-in-ai-model-training/