The Joy of Faking It – Reducing Security Risks with Synthetic Data

2 hours ago 2

PZ teams love green-field development, but in some ways this is playing the software development game on “easy” mode – existing systems with years of features, large user bases and lots of existing data are often more challenging. For greenfield projects we typically bake in a set of fake APIs with minimal data (if we’re using an API built by a different team) or methods in our own code to seed data stores with fake data (if we’re also building the API). This set of fake data can grow and evolve with the product. Although this takes discipline, it can be managed along with similar tasks like writing unit tests, ensuring dependencies are kept up to date and the like.

The situation is different for a so-called “brown-field” or legacy system – this system may already have many years of real-word data accrued, and it can be very tempting for stakeholders to want to use this data for manual testing of the system, especially if they are already in the habit of doing so.

Unfortunately using real-world data for testing can be a recipe for disaster. There have been many instances where lax security on non-production systems coupled with use of real-world production data have resulted in data breaches. In 2022 up to 9.7 million Optus customer details were scraped from an unsecured API – an insider claimed this was caused by a test API being inadvertently exposed to the public internet – a claim which Optus denies. A recent example is API security testing firm APIsec who in April 2025 left a database used for “testing purposes” unsecured on the internet which contained 12TB of customer data and API security information. Link: https://www.techradar.com/pro/security/top-api-testing-firm-apisec-exposed-customer-data-during-security-lapse

As part of Australian law, company directors can face personal liability for data breaches as part of the notifiable data breaches (NDB) scheme. In the face of high-profile Medicare and the afore-mentioned Optus breaches, the penalties for serious or repeated breaches were raised - up to $2.5 million dollars for individuals, and up to $50 million for corporations. This has created further incentives to move away from using real-world data, but often it can seem like such a daunting task to undertake it can fail to gain traction.

Below is a real-world example of how we were able to achieve this on a project PZ was working on. It is unlikely your project will face exactly the same set of constraints, but we hope by sharing what we did it can give you some ideas on how you might tackle this if you want to undertake it yourself.

Project Background

The project we inherited was almost 10 years old and consisted of a set of application-specific APIs backing on to a relational database. These APIs served up the data for a front-end developed in a slew of different UI technologies, and the modernisation of that will have to be the subject of a follow-up article. These application-specific APIs (from here on out referred to as “the application”) also integrated with other internal systems via “shared” APIs. The “shared” APIs aggregated and exposed a number of core domain objects for the application to read, but not modify. Historically, developers spun up a new local instance of the application and its database, connected to the VPN, and pointed their local instance to dev copies of these internal APIs running inside the internal network, which had a backup of production as their data set. In test/UAT/staging environments the application instance also had a backup of the prod application database. In the test environment for example a user could log in and see an interact with a recent copy of the same data they could access in production. Much of this data was sensitive in nature and contained a lot of PII.

A vintage-style robot with a half teal, half yellow face standing in front of a wall of filing cabinets. The robot is holding a theatrical mask with a red and yellow split design and decorative flourishes above the eyes. The text

Improving the Local Dev Story First

Our first step was to remove the dependency on these “shared” APIs and connecting to the VPN for local development. In the application all access to these “shared” APIs was via proxy wrappers which were injected into our application at run-time via a dependency injection (DI) container. This had been done by the previous development team and allowed the application code to be unit-tested in isolation of the “shared” APIs. We created a number of “fake” implementations of these API proxies, and a configuration switch to inject either all the real proxies (the current state of affairs) or all the “fake” ones (our synthetic data APIs). This allowed new developers to be productive more quickly (receiving and configuring access to the VPN was sometimes a slow process) and to continue working during outages and issues with the internal “shared” APIs.

Real-Looking Data

It soon became apparent that to create a real-world volume of synthetic data we didn’t want to be hard-coding static responses in our fake APIs. We created a SQLite database that contained a representation of the business domain that was sufficient to provide all the data our fake versions of the APIs needed, and wrote some code to populate that database with real-world-looking data.

For this we used a library - most frameworks have one or more packages that can create real-looking people, addresses, orders and the like. In Ruby there is Faker, in .NET there is Bogus, and in JavaScript there is faker.js. We created a startup process that seeded this fake data if it wasn’t already present. The surface area of the “fake” APIs was large – maybe a few hundred methods, but many of the methods were only used in a few niche areas. We worked incrementally over a few sprints (in parallel with our regular feature delivery) following the pareto principle to implement the small subset of fake APIs necessary for the major parts of the application to work. Going forward as people hit road-blocks they would implement any required synthetic API methods. Returning explicit errors from not-implemented synthetic methods prevented silent failures.

Local Data Volume

Local development had always had the shortcoming of not having any data in the application itself. The only local application-specific data was what you entered for yourself over time. Now that we had local versions of these “shared” APIs serving up synthetic data the next step was to extend the synthetic data generation process to populate data into the local “real” application database. The domain model for some of these processes was quite large and involved, and creating large volumes of this data slowed down the startup time for local developers, so we switched these to a background process. Although some areas of the application which we hadn’t worked on actively were not seeded, we took an “on-demand” approach and developers extended the seed process as new areas of the application were developed, or existing areas were otherwise touched.

Taking it to Other Environments

Once we had this working locally it was easy to run some experiments where we switched over other environments to use this fake data via a configuration switch.

Additional Safety Benefits of Synthetic Data

In addition to the obvious information security advantages there were some other advantages to synthetic data that improved our development process.

Data volume – Synthetic data allowed us to create a volume of data for a new feature where no real data currently existed, and test it out with volumes of data that we might not expect to see until many months after the feature went live. Once we had a process for programmatically generating a hundred records it was an easy matter to change some parameters to generate tens of thousands of records.

Data stability – Using backups of production data for testing often required a lot of effort to hunt for or modify real-world data in a particular state to test a scenario. The “shared” data provided by the APIs could not be edited from within the application and required complicated developer knowledge of additional systems to modify. With synthetic data we could create whatever data we needed to test a particular scenario, even if these were relatively rare in the real-world. This was beneficial for manual testing, but also for end-to-end UI automation tests and integration testing.

Training environments and videos – Synthetic data allowed us to set up training environments where prospective users could be given access without any information privacy concerns. Training materials and repro information for bug reports could be created directly from the synthetic environment without any requirements for images or videos to be redacted.

Principles of Synthetic Data

During our development a few principles emerged which guided our generation of synthetic data:

Synthetic data should look "real-world" – we tried to spend a small amount of extra effort making our data look real-enough that it could be included in a screen-shot or training video and not stick out as being obviously fake, and to make it varied enough that many different real-world scenarios could be found in it. When new bugs were discovered that relied on particular data setup we could extend the seed process to always create this data so the scenario could be tested again in the future.

The synthetic data generation process should be deterministic – we wanted the seed process to produce the same data each time so that if an issue was found in the test environment using synthetic data, a developer on their machine could use exactly the same data when working to fix the issue.

The seed process should be fast – we wanted developers to be comfortable with frequently blowing away their local databases and re-seeding it without it being a productivity blocker, so we made the seed process fast. Anything we couldn’t make fast on startup we moved to a background process.

Synthetic Data vs. Anonymised/Redacted Data

Another approach to addressing the information security concerns of production data is to anonymise, sanitize or redact the production data before it is used elsewhere. Depending on your project setup this may be a great choice (and certainly better than not doing anything). We chose not to do this on this project because the operational complexity for local development – setting up and running instances of the legacy “shared” APIs, and restoring their respective databases from backups – was much more complicated, and when we began the journey towards synthetic data we lacked the organisational buy-in that this approach required. There is also always the concern that anonymised data might not be sufficiently anonymised, and that a clever attacker could find ways to re-identify particular domain entities.

Link: https://en.wikipedia.org/wiki/Data_re-identification

Your project could also consider a mixed approach where anonymised data is used in some environments, and synthetic data is used in others, or where anonymised data is augmented with synthetic data to create data volume, and/or thwart re-identification.

Read Entire Article