Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/44328
Title: Training Computer Vision Models with Synthetic Data
Authors: VANHERLE, Bram 
Advisors: Van Reeth, Frank
Michiels, Nick
Issue Date: 2024
Abstract: Neural networks are a powerful machine learning technique that can solve many complex problems. Throughout the years, they have emerged as an exceptional tool for pattern recognition. One of the most important subfields of pattern recognition is computer vision. For computer vision, neural networks have enabled significant progress on many tasks that were previously hard to solve with algorithmic or more primitive learning approaches. These vision problems include classification, object detection segmentation, and depth estimation. Neural network models can learn to solve problems by training on many example solutions. The more parameters such a model has, the more powerful it is, but this also implies that more data is needed to train it. So, to train a model for more advanced problems such as vision, a lot of labeled data of the relevant domain is required. Manually collecting and annotating this data is an arduous task. For example, when a robotics engineer wants to train a computer vision model to detect apples on a tree, they need to take hundreds of images of apple trees and manually point out the location of each apple. Additionally, humans can hurt the data quality by introducing errors and biases in the dataset. For example, the robotics engineer can miss some apples on the tree or mistake some leaves for an apple. These obstacles in data acquisition hinder the application of deep learning to arbitrary computer vision problems. An alternative to manual dataset creation is synthetic data generation. Images are created based on a scene description using image generation techniques. A semantic label for the target problem is derived from the scene description used to make that image. This way, image and label pairs are created that can be used to train a deep-learning model. Image generation often uses a render engine, but learning-based approaches also exist. Synthetic data offers a method to obtain large quantities of labeled data, requiring only a small amount of effort compared to manually creating a whole dataset. The annotations are also more accurate compared to human-labeled data. In the case of the robotics engineer trying to detect apples, they can download some 3D models of trees and apples online and write a script randomly placing the apples in the trees. These newly created scenes can be rendered, and the apple locations can be exported, giving the engineer a dataset to train the apple detection model. However, synthetic data is still not a perfect solution for data acquisition. The main issue is that rendered images do not look exactly like real photographs due to the difficulty of accurately simulating light and because some steps of the camera capture process are not modeled in renderers. This difference between the synthetic training and real target images is called the domain gap. A model trained on these rendered images will perform worse when used on actual pictures. Another issue is that, to render an accurate image, a renderer needs as much information as possible about the scene it has to render. For example, textured 3D models of all objects must be available when using a graphics engine, and the world where the objects will be spawned must be modeled. The modeling of the scene must be done as accurately as possible to decrease the domain gap, costing significant effort. To create a realistic apple tree dataset, the robotics engineer must have highly detailed and properly textured 3D models of a tree and an apple. Ideally, they need multiple instances of each object to get a robust model. Additionally, the scene needs to be composed realistically, i.e., the apples need to be placed in plausible positions on the tree, similar to where they would be in real life. Also, the environment in which the tree stands and the other objects that can appear near the tree have to be modeled. This thesis presents an in-depth exploration of synthetic data within the context of computer vision, examining the challenges and limitations it faces. Throughout six chapters, we propose solutions that aim to make synthetic data a more powerful tool that can help apply AI to arbitrary computer vision problems. The research consists of two parts. The first part focuses on several techniques for creating synthetic data. The second part examines how to train computer vision models best using this synthetic data. The first chapter introduces a toolkit built on a traditional game engine that enables fast synthetic data generation. The method leverages the availability of 3D models and knowledge of the target domain to generate high-quality training data. We showcase the abilities of this approach by highlighting some example datasets. Data generated by this toolkit is used often throughout this thesis. A traditional render engine needs textured 3D models and a representation of the environment to create synthetic data. The next chapter introduces a method that leverages novel view synthesis for synthetic data generation to overcome these requirements. Gaussian Splatting is used to learn a representation of the target objects. These objects are then rendered in plausible positions in RGB-D background images. We show that using novel view synthesis allows for high-quality data generation and that it outperforms less informed cut-and-paste techniques. Generative models are machine learning models that can learn to create images and, thus, synthetic data. The third chapter extends an existing method for styled handwritten text generation by improving its ability to generate rare characters and the overall quality of the created images. This is something that is largely ignored in existing research. We show that our contributions towards input preparation and model regularization help the model outperform numerous competing approaches, creating a valuable synthetic data generation approach. When generating data using a render engine, creating any imaginable scene configuration is possible. The creator of the simulation has complete control over the placement of objects and cameras, as well as the configuration of lights. The next chapter will explore whether we should use this capability to generate data closely resembling the target domain or if random synthetic data is sufficient for good generalization. We uncover several exciting insights, demonstrating how a model trained on 800 images can outperform one trained on 52000 images. Data augmentations can help overcome the domain gap by widening the training data distribution. A wide range of possible augmentations exist, and it is unclear why some work better than others for a specific sim-to-real case. We introduce two metrics that help predict the performance of an augmentation policy and leverage these metrics to find good policies automatically using a genetic learning algorithm. We show that an object detection model trained with this strategy outperforms models trained with random augmentation strategies and is on par with active domain adaptation methods. In the final chapter, we use synthetic data to address the issue of landmark detection in images of tools. To do this, we develop a specific synthetic dataset and introduce a specialized architecture for learning keypoint detection from synthetic data by incorporating an intermediate supervision concept. Our results demonstrate that our method effectively detects keypoints in complex real-world images of tools.
Document URI: http://hdl.handle.net/1942/44328
Category: T1
Type: Theses and Dissertations
Appears in Collections:Research publications

Files in This Item:
File Description SizeFormat 
Portfolio1.pdf
  Until 2029-09-29
Published version64.77 MBAdobe PDFView/Open    Request a copy
Show full item record

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.