We need to talk about test data

James Sheasby Thomas

May 17, 2017

Last month, I was hurriedly booking a vets’ appointment using my surgery’s online form. In the process, I accidentally used test data instead of my own!

#testerproblems Absent-mindedly filling in the vets’ contact form with fake contact details rather than my real details.

— James Sheasby Thomas (@RightSaidJames) April 20, 2017

Not sure Mr Fakename (email: test@example.com) cares all that much about guinea pig grooming services, but you never know.

— James Sheasby Thomas (@RightSaidJames) April 20, 2017

While this was a case of using test data when real data was required, it got me thinking about some of the patterns I use when entering fake, placeholder or test data into forms or web apps. These two tweets sparked a lighthearted discussion on commonly-used test data:

HULK LIVE ON TEST ROAD, TESTINGTON, T3ST T3ST #IFEELYOU https://t.co/sXRjX4UQ4Y

— Hulk QA (@HULK_QA) April 20, 2017

@HULK_QA Twinned with Faketown, I believe

— James Sheasby Thomas (@RightSaidJames) April 20, 2017

@HULK_QA Lord D Vader,
C/O Imperial Palace Administration,
Block 702, Sigma63/9
Courouscant

— Dan Billing (@TheTestDoctor) April 20, 2017

@DanAshby04 @JokinAspiazu @HULK_QA @TheTestDoctor I normally use 10 Downing Street. Because, you know, why not 😀

— James Sheasby Thomas (@RightSaidJames) April 21, 2017

Sample debit card template courtesy of psdGraphics

If you test systems that ask for user data then you probably have your own variations of these. Now, the above examples have two key things in common:

They don’t look real. 10 Downing Street is a real address, but no one expects the Prime Minister or their staff to be filling in random webforms. Likewise user data with ‘fake’ or ‘test’ in it should stand out to most readers.
Despite being obviously fake these examples are similar to real user data.

However, there’s also some issues with these examples:

The ‘fakeness’ of this data may not be universal. If you’re not much of a Star Wars fan then ‘D Vader’ is your only real clue that Dan Billing’s example isn’t real. Likewise, people outside the UK may not know that 10 Downing Street is where the PM lives. Even then, many Brits won’t recognise the Downing Street postcode if used on its own.
It can mask bugs. Using clearly fake strings like ‘Test’ or ‘Fake’ introduces a risk that stray placeholder values* are missed during testing. In other words, if the developer’s placeholder values are similar to your test data then you may not notice that your user input is not being processed.
Using real details introduces a risk that a real person is contacted by mistake. Amusing though it might be, I don’t want to waste taxpayer money by making the PM’s office deal with accidental communications.

* In this context, placeholder values are hardcoded strings/values used during initial prototyping. As a developer completes a feature, they will replace these placeholder values with dynamic logic.

These are all minor problems with fairly simple resolutions. As a tester, I could co-ordinate with my developer colleagues to ensure we never use the same fake data. Before using the details of ‘famous’ people or places as test data, I could ensure that safeguards are in place to prevent my inputs leaking into production systems. But could I improve test data in order to make my testing more effective?

Test data considerations

When deciding what test data to use, I could ask myself the following questions:

How important is the test data? Is it stored once then forgotten about, or used throughout the system?
Is this test data displayed at the front-end? If so, could I vary its length and format to identify parts of the layout that are not robust enough to handle realistic user data?
Who might see this test data? e.g. developers, other testers, POs, clients, beta users
Is the system self-contained, or does it integrate with other systems/services? If the latter, how can I reduce the risk of my test data causing problems if it found its way into ‘the real world’?
Should I use the same test data every time I test (i.e. a control variable), or ensure that it is unique?
Can I auto-generate my test data, or would it be better to create it manually?
Can I design my test data to help me conduct boundary value analysis?
Are there any limitations on the format of the data? For example, UK postcodes must follow a strict format.
Is there any predetermined sample data we can make use of? For example, most countries have phone numbers that will pass validation but will never be assigned to a real person.
Is it important to be able to tell one group of test data apart from another?

Designing test data

By answering these questions, we can decide if our test data:

should be realistic or obviously fake;
should have any limitations on its format or structure;
is repetitive and therefore easy to spot at a glance…
… or varied in order to help us test different scenarios;
should be boundary-aware to help us test form validation…
… or fit within these boundaries so it doesn’t get in the way of our testing;
should act as a control variable…
… or be a dependent variable!

Types of test data

Whilst writing this post, I realised that there are two main types of test data:

Passive: it doesn’t really matter what this data is, it just needs to exist so that you can test a particular fix or feature that depends on this data being present.
Active: the data itself, and how it interacts with the system, is a primary focus of our testing.

Active test data obviously requires much more care and attention than passive does. It’s also much more likely to be unique (or at least more variable), and to challenge the constraints of the system under test. However that doesn’t mean that passive test data is less important, just that it has different requirements.

Creating test data

One of the benefits of obvious test data like ‘1 Test Road, Testville’ is that it’s quite easy to remember. You don’t need to keep a database of fake names and addresses if you’re reusing the same details every time you test. However, as mentioned above, this approach can be problematic. Instead, why not create test data to suit the context of your project?

Approaches for creating your own test data include:

Start from a base set of ‘obviously fake’ data but vary it slightly each time by adding numbers or arbitrary letters. This will ensure all of your data is unique but still easy to spot at a glance. This approach is fairly low-effort, but it can be hard to keep track of what variations you’ve already used.
Download some pre-made data (probably in CSV or JSON format) that is suitable for the context you’re working in. One free service that does this is Random User Generator – many others are available. With Random User Generator, you can specify the output format, nationality and gender of your random users, among other things. This solution is quite simple and robust, but you may find that you have to adapt the output to match your needs.
Fetch customer data from your production environment, then anonymise it before re-uploading it to your test environment. You should probably ask permission before doing this, and ensure that you store the non-anonymised data securely and dispose of it of when you no longer need it. This solution will give you extremely realistic test data, but it’s also quite risky from a legal or security perspective.
Write a script to generate test data on demand (or ask a colleague to write one for you). This solution is fairly robust, but probably requires the most effort.

Final thoughts

It’s worth noting that this advice applies both to purely ‘manual’ testing and testing that involves some level of automation. It doesn’t really matter if your test data is entered into the system you’re testing by hand, by an automated script, or a combination of the two. However, you might decide that using a limited, predictable set of test data is the best approach for an automated test suite, so that your test results are reproducible.

Test data is admittedly quite a boring topic, but hopefully this post gives you some things to think about. I also hope that I’ve convinced you that test data is worth thinking about! If nothing else, consider if the test data you work with most often is passive or active, and adjust your strategy accordingly.

9 responses to “We need to talk about test data”

Testing Bits – 5/14/17 – 5/20/17 | Testing Curator Blog

May 21, 2017

[…] We need to talk about test data – James Sheasby Thomas – http://rightsaidjames.com/2017/05/improving-test-data/ […]

Loading…

Reply
Java Testing Weekly 21 / 2017

May 22, 2017

[…] We need talk about test data is a valuable blog post that identifies the questions you should ask when you are creating test data and describes how you can create useful test data for your automated tests. […]

Loading…

Reply
thegrowingtester

May 22, 2017

One tool I saw mentioned in the test Slack chat for random data entry was the Chrome extension Form Filler – https://chrome.google.com/webstore/detail/form-filler/bnjjngeaknajbdcgpfkgnonkmififhfo?hl=en
If you need passive data, this will populate specific fields or all fields on the screen, and will use data based on the type of field it is.

I grabbed it for when I need to test screens where I simply need data to progress through a business process, and don’t care what the actual data is.

Loading…

Reply
Five Blogs – 2 June 2017 – 5blogs

June 2, 2017

[…] We need to talk about test data Written by: James Sheasby Thomas […]

Loading…

Reply
Our reading recommendations of the week #22/17 | | Lyon Testing

June 2, 2017

[…] to use various test data, or do you find that you always use the same data patterns? The article “Improving test data” from James Sheasby Thomas describes those problematics and is looking for ways of improving our […]

Loading…

Reply
Jeremy Wenisch

July 19, 2017

I really like your distinction between active and passive test data. One of those things that seems obvious after it’s named, but that I never consciously considered before. I think it’ll be useful to be more aware of it.

Loading…

Reply
1. James Sheasby Thomas
  
  July 31, 2017
  
  Thanks Jeremy. I must admit that I’d never fully considered it before either, but when writing this post I realised that I was actually discussing two different use cases so I added that section in.
  
  Loading…
  
  Reply
Cypress CI pipeline integration for fun and profit – James Sheasby Thomas

March 19, 2021

[…] this Cypress dilemma sound familiar? You’re an exploratory tester who happens to know a bit of JavaScript, so you decide to give Cypress a try. It’s got great […]

Loading…

Reply
Generative AI: thoughts, ideas and concerns – James Sheasby Thomas (@RightSaidJames)

March 11, 2024

[…] as usual, I’m writing a blog post to help refine my own understanding of a specific topic. Given the above explanations of generative AI, I wanted to share some specific points. Most of […]

Loading…

Reply

James Sheasby Thomas (@RightSaidJames)

We need to talk about test data

Test data considerations

Designing test data

Types of test data

Creating test data

Final thoughts

Other testing posts from this blog:

Like this:

9 responses to “We need to talk about test data”

Leave a ReplyCancel reply

We need to talk about test data

Test data considerations

Designing test data

Types of test data

Creating test data

Final thoughts

Other testing posts from this blog:

Share this:

Like this:

9 responses to “We need to talk about test data”

Leave a ReplyCancel reply