The Use of Synthetic Data in AI model Training – Transcript
I recently sat down with Zachi Mann to discuss how synthetic data is changing AI training, both in the factory and beyond. To check out the podcast click here. For those that prefer to read, the full transcript is below.
Spencer Acain: Hello and welcome to the AI Spectrum podcast. In this series, we talk to experts all across Siemens about a wide range of artificial intelligence topics and how they are applied in various products. Today, we are joined by Zachi Mann to discuss AI training in the factory. Welcome, Zachi.
Zachi Mann: Hello, Spencer, it’s great to be here.
Spencer Acain: Thank you for joining us here today. Before we get started, can you tell us a little bit about your background, and a little bit about what got you interested in AI?
Zachi Mann: Sure. So, I have a background in Software Engineering, a master’s degree in that. And actually joined Siemens more than a decade ago and I became part of the Tecnomatix simulation software group. And my focus, along the years, maybe even since childhood, was a lot about robotics. This is something that always interested me; how robotics are being utilized in various cases. In some way, joining Siemens into that position was already something very natural for me, and I was very happy about that move. About six years ago, I get an opportunity to pretty much land my dream job, so to speak. And in this job, I lead a new initiative that is focused on advanced robotics simulation capabilities, meaning being able to simulate various implicate environment conditions that typically robots inside factories have to deal with in order to perform very complex tasks, like detecting objects and picking that up, and so on. So, we’re talking about camera vision simulation and all types of other sensors force–torque feedback. So, obviously, that involves a lot of integration with various AI and machine learning techniques, and also collaboration with startups in the domain of advanced robotics.
Spencer Acain: Sounds very interesting. I’ve been interested in robotics for a while myself, and so I can very much understand that sentiment. So, it sounds like you do a lot of work with AI and advanced robotics in the factory. Can you give us a little bit background about what you’re currently working on.
Zachi Mann: Some of the topics that we are currently working on is to enable the factory engineers to quickly and rapidly ramp up with their AI and machine learning algorithms, to be able to deploy them in a much faster and easier way. Since we are coming from simulation, so you’re always considering what kind of tools and capabilities we can bring with simulation to help in that area. And this is where we get into the domain of synthetic data, meaning being able to create or replicate a digital twin of the fact that we have the parts being manufactured of the objects of the equipment, so that we are able to train various machine learning algorithms to perform various tasks like detecting the objects and detecting defects and so on.
Spencer Acain: You just mentioned synthetic data. Can you tell us what that is?
Zachi Mann: Synthetic data, to put it simply, is any kind of data typically used for machine learning applications that is generated using some sort of computer simulation. It can be tabulated, like information in tables about users, names of users, birthdate, and so on. And it can also be visual data like videos, images, and so on, that can be generated out of a simulation that can be a 3D simulation, a 2D simulation, and so on. But synthetic data in the aspect of machine learning also, typically, is provided together with interesting or relevant annotations, meaning to be able to detect within the images, or within the tables, or within the data interesting samples and being able to tag them to identify or classify different types of situations within that data. And the advantage of synthetic data when it is being run or being generated outside of a simulation is that we have perfect knowledge of the environment being created, so we can also give perfect description of the objects that exist inside the sim.
Spencer Acain: Wow, that sounds very cool, very impressive to be able to do that on the computer now. But it also sounds like it’d be very challenging because I’ve never heard about anything like this before. It must be fairly new to be able to do this. Can you talk a little bit about some of the challenges with using synthetic data?
Zachi Mann: So, first of all, I don’t think that synthetic data is that far off of what Siemens is doing for many years now since we are in the business of CAD software, Computer-Aided Design, so we have 3D information of various objects of the factory that is being planned with Siemens tools like NX, Solid Edge. And we have simulations with the manufacturing processes being designed and simulated in tools like Process Simulate, NXMCD, and others. So, all of these simulators are essentially generating all sorts of “what if” scenarios for manufacturing engineers to check, but what happens now with all of these new trends, with machine learning, we actually want the machines to make decisions on their own. So, this is where the machines need to actually make a decision instead of that engineer that is planning the simulation or the line. And in that case, the machine would require to get input in a way that it is able to interpret together with some label, some annotation that tells it how to interpret that from, let’s say, an image and want to detect certain objects, so I might expect that I would get some information like the bounding boxes of every object that needs to be detected together with the type of objects in that image.
Spencer Acain: Well, I can see how that would be very useful to AI training to be able to do all of that, that it’s just allowed computer or the algorithm to make its decisions on its own. But you mentioned that this dovetails with CAD modeling and 3D modeling that we already do here at Siemens quite a bit. Doesn’t that present challenges, in its own right, to be able to have all of that? I know, not every company will have all of the models that are requirements as well.
Zachi Mann: That’s, of course, correct, and that sometimes poses a challenge. And some companies that would require would need synthetic data for their machine learning use cases, sometimes go to some service company that has some expertise in generating 3D scenes of various scenarios to be simulated. These kinds of companies might use expensive scanning equipment to acquire real-world data in order to replicate that in a digital twin. And I think this is a capability also that we have in some of our Siemens tools as well. Of course CAD data, if you want to create some kind of 3D scenes, this is a must. I think a lot of customers that already embrace digital twin and digitalization. So, for them, already, this data is typically available, so the move into generating synthetic data is typically much easier than for others. I think, for other companies, typically, I think scanning right now is becoming much, much cheaper. And even with an iPhone 12 or 13 you can have a LiDAR and scan an object and bring that into a simulation. Apart from that, I think that there are other challenges if you want to generate synthetic data because there is always some kind of a trade-off between how complex it is to create an accurate scene that will bring you results with only a few data samples, data points, versus creating something very robust and that is not specialized to one specific case. And there is always some kind of a trade-off based on the use case and what kind of performance you actually need from your machine learning algorithm.
Spencer Acain: It sounds like the process of creating these digital environments and setting all of this up might be quite challenging at times. But for those companies that have already invested in digitization, it might not be as big of a hurdle. I think you told us a bit about some of the challenges of synthetic data, but can you tell us a bit about benefits too? I imagine there are quite a few of people who are looking to adapt to this over traditional types of data.
Zachi Mann: Generally, I think that if we are about to launch some new AI or machine learning model that will do some work inside the factory, like detect objects, detect defects or anything – the first challenge that we are faced with is how do we actually collect the data that we need to train because, typically, if we just collect any kind of data off the internet, you can scrape the internet for images and then annotate them, it will be very generic. But probably any model like that will perform not so well in a manufacturing scenario, it has to bring about 99.9% accuracy sometimes. So, we would want to fine-tune any kind of model on probably a smaller set of data samples that are collected from the exact environment where the machine learning model will be deployed. But the problem in manufacturing is that it doesn’t stop, we manufacture all the time. And if you shut down the station just to take images of objects in the real environment, it means a loss of refactoring capacity. Sometimes you have to do it all for hours. It’s always a bit challenging to get that right. If you need to have specialized equipment operator like robotics and so on, so there’s always an additional risk of all sorts of safety issues or damage to the equipment if you need to drive off base on some of the AI algorithm and you didn’t test it or train it first using a digital twin or with synthetic data. So, you will always have to do that in a very secure and safe and slow pace, that poses a lot of challenges because it just slows down the process of collecting this data, sometimes to the point where it’s not actually feasible to do it. So, just looking for a different solution or sometimes replacing that with human labor, humans are typically able to adapt to new situations. An additional challenge is that even if data is collected, so not always the way that it’s getting annotated, it’s done in most of the correct or any correct way. Sometimes annotation work is done by a third-party company or service. Not always that is done according to the right spec. In many cases, it’s very hard to detect when images, for instance, are not annotated correctly and you detect it very late, only after the model doesn’t perform very well. And I think with synthetic data, that problem is really solved because there is perfect information about the scene and the objects. So, a simulator can generate the bounding box, for instance, in very precise, pixel-perfect annotations.
Spencer Acain: That does sound like it would be a big benefit over more traditional methods. So, if this was such a big benefit, and like you said, it’s not all that different from the CAD and system modeling and simulation modeling that has been going on in the past, why are we only starting to see synthetic data being adopted recently, over the last few years?
Zachi Mann: I think, first of all, there is a search in deep learning techniques that typically require a lot of data. Now, there was the statement by Toyota about this number of hours, millions of hours before you can actually get some kind of AI to drive autonomous cars and so on. And then I think this gave a lot of surge and hype to the use of synthetic data for autonomous driving. But I think that especially for deep learning since now, we have several well-established network architectures that are being reused over and over. So, the algorithms themselves are no longer the challenge; they’re just tools to be used. And now, the data is the new bottleneck; having a clean and good data. I think that Andrew Ng, I think, the professor from Stanford, is promoting this movement about data-centric AI, he’s talking about, actually, now, this is the new big thing; how do you clean the data? How do you make sure that you just use less data, not big data, but the data with good quality that solves the problem at hand? And synthetic data is part of that, it really helps in getting this clean data and up to the spec of what is actually required. And again, the downside is sometimes if you want to have a very exact and accurate data that matches reality in a good way, so sometimes you have to invest more time in creating a more realistic simulation. You have to put the experts in creating 3D walls on that and so on, or get a service company that specializes in these topics. There could also be other approaches, like domain randomization, where you don’t specify exactly the environment, and the tools are actually randomized, a lot of different conditions. And then you can get some synthetic data with which you can train a neural network to perform a task. And it can probably get to some level of accuracy but still miss some cases in the real world. But then you can actually do two steps; you can actually also fine-tune that with a much smaller data set of real images. So, you take a pre-annotate synthetic data trained neural network, and then you find that with some real images but much – less 10% of the original amount. And there are several studies that show that this is actually enough. So, while synthetic data does not eliminate completely the need to collect data, it can significantly reduce this, even without requiring specialized expertise in creating very dedicated or very accurate simulations of the environment.
Spencer Acain: That’s very interesting. So, it sounds like you can’t completely remove the need for real data, but if you can cut it by 90%, that would be a huge asset to a lot of AI deployments. I did some research in this area myself and I’ve seen that it takes hundreds of thousands or millions of images. And if you can cut that down by 90%, I’d imagine that would be quite the savings in time and money.
Zachi Mann: Definitely. I think that simulation can even go — You can completely replace in some cases with simulation, but it’s just a trade-off as always; if you put no effort into the accuracy of the simulation, having all the objects lighted correctly and textured correctly and all of that. So, this is like all of these movies from Pixar, they really invested a lot in that simulation. And the result is amazing, but also, the investment is amazing. So, we see that now the real need is to change models all the time, or we refine models all the time because parts are changing, scenarios are changing. Sometimes conditions are changing a bit like difference in lighting conditions between the daytime and nighttime, inside a factory, and so on.
Spencer Acain: I see. So, it sounds like there’s a balance there of how much time and money you want to invest in the simulation versus how expensive it would be to just stop production and take some real pictures. And I will just imagine that’s part of the balance of choosing when you want to use synthetic data and when you would go back to the more traditional methods there, something there.
Zachi Mann: I think if you want to choose synthetic data, and want to examine the use of that for your use case, I think you need to validate that this is something that is actually possible to simulate in a feasible way without requiring too many new adaptations to the simulation. Sometimes you can look out and there are many kinds of simulations for many types of use cases that might be already solving your issue. So, that’s always something to check. If you have to develop your own simulation, it’s always wise to check if you have this case set in-house to develop these 3D walls, or you want to maybe outsource that to some service company to do that. But in some cases, especially when you have to replace the part that needs to be detected maybe several times a month, the effort of doing that over and over again just might not be worth it. So, it might be worse to develop some kind of pipeline in which you just pay fine-tune or return with a few images every time that the path changes. And then this will probably yield better results. But it really depends on what’s the implication of the machine learning task at hand. If your neural network is supposed to detect some defects, and the cost of not detecting some defect is very, very high, and can span across hundreds of thousands of different parts that are being manufactured. It might be worth to run a simulation, a very exact one, and build it to actually generate proper synthetic training data for that, to get the match accurate model for that. So, it’s really a question of return of investment. I guess
Spencer Acain: That makes sense. We’ve discussed heavily the uses of synthetic data in the factory, and how it can be used to really bring AI to the next level. Are there some other potential uses of synthetic data as well? Is it limited to AI training and for factory use like this? Or can you do other things with it?
Zachi Mann: Synthetic data can be used, in essence, also for validating how well your algorithm performs within a simulation environment. So, let’s say that we have a robotic station where the robot needs to detect parts and then pick them up and place them inside some machine. Suppose if you have a digital twin, a simulation of that station. If I can introduce a virtual camera into that digital twin that is fed with synthetic images generated out of that scene, and I already have a machine learning model that I trained for detecting maybe the pose of every object – so, I can actually test that within the simulation environment as though is the robot able to reach and pick these parts, what is the footprint of that station? And I can answer so many other questions that are interesting with respect to the ability to manufacture and produce these paths and get to the right throughput that is required as well.
Spencer Acain: So, it sounds like synthetic data can also be used in conjunction with the digital twin to speed up validation for robotic systems.
Zachi Mann: Right. And eventually, if you do that together with digital twins, it saves you a lot of time because you actually save out all of these issues that you might have found only when you actually commissioned the physical system. So, I think that’s where the real saving is. So, before you actually commission the real system, you can test everything, you can test your algorithm, and coupled together with the types of sensors that you also using the scene together with a specific robot software and so on, everything can be actually executed within the simulation in a digital twin.
Spencer Acain: Well, that sounds like it would be a huge asset for, really, any manufacturing company, or companies in general, to be able to do all of that on the computer, just completely digitally like that. So, is there anything else of interest that you’ve been working on recently or that’s even outside the company that it’s relevant to this if you’d like to talk about?
Zachi Mann: There is one product that you’re excited about, that we are currently working on. And this is a product called SynthAI –Synthetic AI, that’s short for that. And this is a product that allows anyone to just upload CAD files of his own part or object, and we can do the complete pipeline, running completely inside the clouds, that allows eventually to generate synthetic data with the uploaded part. But not only that, also train various types of machine learning models that they are able to detect these parts inside images. For instance, detect bounding boxes, or detect instance segmentation of that part, meaning which pixels actually this part is actually inside the image – so you can actually call every pixel by the type of part that is underneath it, in a sense. So, this is, I think, the first solution that I saw that really covers end to end from CADdirectly to a completely trained model that you can download and eventually also deploy on an edge device in the factory to be able to detect the objects. And this system is already being used by some early evaluators, which are really reporting to us amazing results in the ability of this tool to accelerate the deployment for various use cases, like throughput analysis and completeness checks of assembly and several others to detect parts for a robot to pick inside a bin and so on.
Spencer Acain: That sounds very interesting, but I think we’re about out of time here. Once again, I’ve been your host, Spencer Acain, and I would like to thank Zachi Mann for joining us here.
Zachi Mann: Thank you, Spencer.
Spencer Acain: This has been the AI Spectrum podcast.