Scroll to top

Dstl

LITTLE DATA FACTORY

CLIENT

dstl
REQUIREMENT
REQUIREMENT

Dstl asked us to demonstrate the generation of synthetic data that could be used in developing AI tools. Our response was ‘little data factory’, designed to work out the best approach and understand how synthetic multimodal data might be routinely generated as part of some future “Data Factory” capability.

CHALLENGE
CHALLENGE

We had to develop a way to go from a high-level story given by a user to generating detailed data sets of many types – imagery, video, communications, social media and more. They all had to be consistent with one another and ensure specific events happened at the time the user wanted.

SOLUTION
SOLUTION

We used the Unity3D games engine supercharged with third-party libraries and our own, to develop a simulation that could be guided by the high-level scenario and model many entities in a military scenario. As these interacted and played off one another, our solution output data of many types including some we fed into ChatGPT to create realistic Tweets.

The need for synthetic data
The need for synthetic data

The demand for synthetic data is huge - according to Gartner, synthetic data will completely overshadow real data in AI models by 2030.

Collecting data in the real world can be hard and very expensive, and even sometimes impossible because of safety or physical constraints. But synthetic generation is not limited in the same way, allowing plentiful data to be created across many different scenarios. One particular target use for synthetic data creation is for training machine learning (ML) models. These require vast amounts of data, ideally labelled, and obtaining such data in sufficient quantities can be a limiting factor to ML development.

Defence is a sector that can benefit from synthetic data, and Dstl were interested in how synthetic multimodal data might generated routinely, along the lines of a “Data Factory”. But first they wanted to prove the concept by developing a first implementation and overcome some particular technical challenges. We carried out this project under Serapis Lot 6 “Understand”, which is managed by Frazer-Nash Consultancy for Dstl, and were supported by SVGC and its subject matter experts in military operations.

From story idea to detailed coherent, multi-modal datasets
From story idea to detailed coherent, multi-modal datasets

There were three key challenges relating to synthetic data creation that the client posed to us, and finding solutions to each of them was crucial to deciding whether a data factory could succeed. The first was being able to create diverse multi-modal datasets that were coherent and consistent with one another – this is important as ML agents may want to combine different sources of data when making its predictions. The second was being able to seed events, for example requiring a particular event happens at a particular time, which sounds easy but because of the interdependence of events and the need for consistency and realism it requires real care. The final one was being able to give input to the Data Factory as a high-level storyline created by a human, which it then interprets and converts to a detailed machine-readable input for the model to use.

Putting “cool toys” such as Unity and ChatGPT into action
Putting “cool toys” such as Unity and ChatGPT into action

From the start, we decided on using a computer games engine as the heart of our solution – we would use it to model numerous entities within an environment, their interplay and reaction to what happens, and capture data recording the events, and the resulting data would be inherently coherent and consistent. We chose the Unity3D engine, which is extensively used in commercial game development and has a vibrant community of developers whose capabilities can be imported and reused. We worked with SVGC to develop a realistic scenario where there are two opposing forces operating in an area where there are also civilians – this included military vehicles, logistics support vehicles, command posts, drones, communications and more. Together we developed rules for the behaviours of different entities depending on events, which we then implemented in the model. We created a method for reading in a high-level story specified by a human and using it to initialise the simulation and then converting it into a series of phases and transitions that the simulation could follow. We also developed a way to seed events through this scripting approach, where a model manager would influence what happens to make the user desired event happen but without having to prescribe all the events, which would be very inflexible. The simulation provides many types of outputs, including images from fixed cameras, aboard drones and satellite, videos from fixed positions and from drones, communication records as well as all the metadata associated with these, which can provide the labels for ML training. It also output data which we then used as prompts to ChatGPT to generate realistic social media data, such as Tweets, as if posted by civilians as they viewed events around them.

A significant step towards powerful synthetic data generation
A significant step towards powerful synthetic data generation

The project demonstrated a solution for generating synthetic datasets designed for training AI agents. It showed that these can be of numerous and wide-ranging data types but still all consistent with one another, and that the whole thing can be run from a high-level input that also includes particular events that should happen. We also tested out a lot of tools and methods and found out what works and, importantly, what doesn’t.

A bright future for synthetic data generation and Data Factory
A bright future for synthetic data generation and Data Factory

We have created a new capability for synthetic data generation. An important part of the project was finding ways to overcome major challenges, as well as discovering new ones and addressing those or at least working out how they could be dealt with. Although originally focused on creating datasets for AI and ML model development, it also has the potential to be used in experimentation and training. The work has laid a solid foundation for the future. We’re excited to see how the Data Factory concept will develop and its full potential realised.

Synthetically generated data environment
Synthetic data from the Little Data Factory

Let’s collaborate

Send us an email, to discuss a new project.

We’re a team of innovators who are excited about unique ideas and help companies to create amazing solutions.

Defence & Security

Little Data Factory

We use cookies to give you the best experience.