Generative AI: pitfalls, myths and potential

, , ,

This post is about Generative AI – also known as GAI, LLMs, GPT, AI chatbots or simply ‘ChatGPT’. Its output is impressive and fun to play around with. It seems like everyone and their guinea pigs have used ChatGPT at least once. But how much do we, even those who work in digital and tech, really understand GAI?

Generative AI in a nutshell

At first glance, the concept of generative AI is fairly simple. The user provides an input (text, image, document etc.). The LLM draws from its large and contextualised databank of content to generate a suitable response. LLMs generate their responses on the fly, so they might give a range of responses to the same user input. Most AI chatbot tools are also responsive to user feedback. For example you can tell them to make a response less formal or use fewer words. If you spot a mistake, you can often persuade your AI tool to correct this in future responses. Sounds great, right?

A longer explanation, AKA ‘fancy autocomplete’

The above paragraph is my version of how many people understand/talk about GAI and LLMs. But there’s a much more details explanation for how they really work, often cynically summarised as ‘fancy autocomplete’.

An AI chatbot like ChatGPT, or a media generating tool like Craiyon (previously known as DALL·E mini) is made up of multiple parts:

  • First there’s the UI – the thing you type/speak into (or upload a file) and which returns some sort of response.
  • Powering that is a backend system, or ‘app‘. This might include an API call to submit your request, a backend process to generate your response from the LLM, an API request/response to send your response to the UI, and a database to store your request and response for later retrieval and reuse.
  • Underpinning the backend system/app is the actual LLM (aka Large Language Model). This is often GPT-3.5 or GPT-4, plus whatever customisations and tweaks the AI tool provider has made. For example, ChatGPT will often refuse to respond if you ask it about anything illegal, explicit or morally questionable. Other AI chatbots have fewer limitations, even if they’re based on the same LLM and/or dataset.
    • Tweaks to behaviour are usually achieved via a custom prompt/context that gives specific instructions for the LLM to follow. This is often described as ‘prompt engineering‘, and is increasingly being viewed as a specialist skill.
  • An LLM would be nothing without its extensive training dataset. The GPT dataset includes extensive scraping of books, news sites, multimedia, Wikipedia and the internet at large. The current generation of LLMs can retrieve and summarise data from this corpus, but they have also ‘internalised’ the writing styles and quirks of this source data.

That’s too many bullet points, James!

To put it another way, generative AI creators have essentially ‘averaged out’ the collective output of humanity since records began. They have created tools that present an interface that can respond to human input and generate a realistic response. The model powering each LLM is ‘aware’ of what words are historically associated with each other. The same principle applies to pixels, code, sound waves, numbers, formulas etc. generated by AI tools.

This so-called ‘fancy autocomplete’ is why tools like ChatGPT seem so realistic and knowledgeable. If 99% of sources say Biden is the 46th US President, it will be very hard to persuade ChatGPT otherwise.

The current generation of LLMs/generative AI need to have their training data refreshed regularly to remain useful. Alternatively, their training data can be augmented with up-to-the-minute web content. As of February 2024, ChatGPT 3.5 is non-committal about whether Queen Elizabeth II is still alive and who succeeded her. To get an accurate response to this question, you need to pay for ChatGPT Pro to unlock ChatGPT 4. But if you ask the same questions to Perplexity, based on ChatGPT 3.5, it will accurately tell you the date of Queen’s death and King Charles’ accession to the throne. What’s more, it will provide authoritative web sources to back it up.

The Queen Elizabeth example is quite a good illustration of how generative AI tools work under the hood. Two AI tools, based on the same LLM and the same dataset, give noticeably different responses to the same question. This is not due to randomness or the intelligence of the AI. Rather, a human has intervened (or not) to change how the AI responds in specific circumstances.

Why am I telling you all this?

Well, as usual, I’m writing a blog post to help refine my own understanding of a specific topic. Given the above explanations of generative AI, I wanted to share some specific points. Most of these have been bubbling away in my head ever since ChatGPT was unleashed upon the general public!

1. Most digital professionals don’t understand how Generative AI works

If search any niche subreddit, and probably your own organisation’s Slack or Teams workspace, you’ll find people saying things like:

I’m not sure about [complex question], but here’s what ChatGPT says:
[massive wall of text that the poster has not checked for accuracy]

I think this this is quite an obnoxious response. Most internet users have heard of ChatGPT by now, and version 3.5 is freely available. This type of response is just not helpful.

In their current form, AI chatbots are helpful for getting an initial overview of a topic or question. Most don’t cite their sources, if they even ‘know’ where their content originated. As such, you can’t verify the accuracy of AI-generated content without doing some independent research. If you already understand a topic, an AI-generated wall of text and rough structure might be a helpful starting point. If you know little to nothing about a subject, turning a first pass into something helpful and accurate can be very challenging.

2. AI-generated content is often boring, overly-broad and predictable

Recently I was helping to score applications for a new job role using Applied. All applications are anonymous to the reviewers, with no CVs/resumes or even list of past roles. Instead, applicants answer specific questions, which we score on a 1-5 scale. I strongly support this approach – it removes bias and puts every candidate on a level playing field. In particular, it ensures we don’t favour people based on who they’ve worked for, or the design of their CV.

Unfortunately, we all now have to deal with the risk that our text boxes designed for user-generated content have been populated by AI. Sadly, this includes job applications! Applied doesn’t have a feature to flag AI answers, but it does provide a ‘referent AI answer‘ alongside candidate responses. Even without this example AI answer, it’s not too hard to spot AI-assisted responses once you’ve seen a few of them.

Beware the ‘comprehensive’ AI answer

One main ‘tell’ for AI answers is that they tend to tend to over-explain and give answers that are ‘too good’. In response to a question, a ChatGPT-like bot will generate an answer that is broad, well-structured and (seemingly) comprehensive. Depending on the complexity of the topic, the answer may also lack depth and include too many broad brush statements. Most humans don’t write like this – their personality, specific knowledge and unconscious biases will typically shine through in their content. Because LLM datasets are based on a massive corpus of knowledge, more than any one human could hold in their head, they tend to give out answers that are far too broad and detailed than one human would be expected to give. Good Wikipedia articles are broad, detailed and unbiased, but real humans have tweaked and argued over their content to ensure conciseness, provide relatable examples and update the content in response to real-time events.

So, until LLMs start to be more selective about how much information they give out in response to basic prompts (or provide easy controls to help tweak their tone of voice), they have limited utility for copy-pasting in a factual context. Thankfully, this also means that they’re not yet effective tools for plagiarism. Yet.

3. Generative AI creates decent prose, but Vogon poetry

Despite what I’ve written above about the questionable accuracy and writing style of LLMs, one thing they seem to excel at is creative writing, especially when you give it a specific style/format to work from. It’s generally awful at poetry though – like Douglas Adams’ Vogons it tries way too hard and misses the mark by a thousand miles. Follow that second link at your own risk – I bear no responsibility for your untimely demise.

Zipmo’s been acquired, but don’t mention the product discontinuation

To generate a blog post in a specific style/format, here’s a short prompt that I gave to ChatGPT 3.5:

Write me a hype-laden blog post for a tech startup called ‘Zipmo’ that has just been acquired by Google. The post should gloss over the question of whether their main product (also called ‘Zipmo’) will be available after the acquisition process is complete, but still provide reassurance to Zipmo customers that their user needs will continue to be met for the near future.

ChatGPT 3.5 prompt, written by yours truly

And here’s a snippet from the full result:

[…] With this landmark acquisition, Zipmo is poised to reach new heights and explore uncharted territories, bringing the best of both worlds to tech aficionados everywhere. While some may be wondering what this acquisition means for Zipmo’s flagship product, aptly named ‘Zipmo,’ we’re here to assure you that our commitment to meeting your user needs remains steadfast and unwavering.

[…]

We know you’re eager to know what the future holds for Zipmo, and while we can’t reveal all the details at this moment, we can say this: innovation is in our DNA. Our dedicated team is working tirelessly to ensure that your Zipmo experience remains exceptional.

We’re actively exploring ways to integrate Zipmo seamlessly into Google’s ecosystem, creating a harmonious union that will bring you the best of both worlds. While we may undergo some changes along the way, our commitment to delivering technology that simplifies your life remains unchanged.

[…]

ChatGPT 3.5 response, 3rd January 2024

If I worked for Zipmo’s beleaguered PR department I would not publish the full response as-is, but it’s a good start. It has the right cadence, length and key points, and covers all of the things I mentioned in my brief. If the tone was a bit different to Zipmo’s usual writing style then I could give additional feedback to tweak the output to my liking, provide text examples from past blog posts, or start a new chat with a modified prompt.

Help us Squeaker, you’re our only hope!

ChatGPT is also quite good at writing zany book blurbs. Here’s a prompt for ChatGPT 4, featuring post-apocalyptic guinea pig factions:

Write a book blurb for a dystopian sci-fi novel about two competing factions of guinea pigs in a post-apocalypse scenario. The first faction wants to help rebuild the human population, while the second wants to become the new dominant species.

Another zany prompt written by me, this time aimed at ChatGPT 4

And the response:

In the charred remains of a world once dominated by humans, a new conflict rises, pitting guinea pig against guinea pig in an epic struggle for the future. “The Squeak of Dawn” is a riveting dystopian sci-fi novel that catapults readers into a post-apocalyptic landscape where two factions of guinea pigs emerge as the unexpected heirs to a devastated earth.

The first faction, The Guardians, led by the wise and compassionate Elder Whiskers, believes in restoring the human race, seeing in them the stewards of the old world and the keys to rebuilding civilization. They gather remnants of human knowledge, protect human survivors, and dream of a world where humans and guinea pigs live in harmony.

Contrasting sharply with this vision is The Claw, a militant group of guinea pigs who have tasted freedom from their cages and see the human downfall as their ascent. Led by the cunning and ruthless Fang, The Claw aims to establish guinea pigs as the new dominant species on the planet. They are skilled, fierce, and unyielding in their belief that the age of humans is over and the era of the guinea pig has begun.

As tensions between The Guardians and The Claw escalate, the fate of the planet hangs in the balance. Caught in the middle is a young guinea pig, Squeaker, torn between the ideals of both factions. Squeaker’s journey is one of discovery, courage, and hope, as he navigates a treacherous world and confronts the question: what is the true path to a reborn world?

“The Squeak of Dawn” is not just a tale of survival; it is a story about the power of unity, the importance of dreams, and the enduring nature of hope in the darkest of times. This novel will captivate readers, leaving them pondering the delicate balance between preservation and evolution, and the role each creature, no matter how small, plays in the tapestry of life.

ChatGPT 4 response, 3rd January 2024

Given the limited information provided in my brief, I’d give this response a 10/10, no notes. It’s clear that ChatGPT’s ‘understanding’ of the dystopian sci-fi genre and the general characteristics of the guinea pig is more than sufficient to generate a blurb for a book that I would genuinely read. I’m impressed that it’s thought up names for each faction, and even introduced a main character to tie it all together.

Just for fun (and so this post would have an image), I asked ChatGPT 4 to generate me a suitable book illustration:

An AI-generated black and white illustration. A large crowd of guinea pigs, separated by a small gap that splits them into two. In the foreground, two guinea pigs are facing off against each other as if they were the leader of each faction. Ruined skyscrapers can be seen in the background.
It took a surprisingly high number of attempts to generate an image that featured realistic guinea pigs and no humans.

Would I use ChatGPT to write me an entire book on this topic? No, as it would soon start to slip up, generating content that doesn’t fit with the theme or is too repetitive/predictable. But I might be able to use it as a prompt for each chapter, for generating character and subplot ideas, or as a compare-and-contrast for my own writing.

Another question I would ask, if I were a dystopian sci-fi author or a publishing company, is whether this book is something that people would actually read. I would also question whether starting with a blurb is a sensible way to write a book. But thankfully I am not an author or a publisher, so I don’t have to worry about any of that!

Overall, ChatGPT can be pretty great as a writing aid or for generating sample content in a specific format (e.g. blurbs, blog posts), especially where accuracy and preciseness isn’t such an important factor.

4. We can’t ‘solve’ accessibility with generative AI

During 2023 there was a lot of questions in the Slack-sponsored a11y Slack community (DM me for an invite, it’s an amazing resource) about the potential of ChatGPT and other generative AI tools for making the web more accessible. Popular ideas include automatic code remediation, alt text writing and accessibility auditing.

I won’t waste my time and yours by writing a long essay on this. Smarter people than me have already written authoritative posts on the pitfalls of using generative AI to solve your accessibility problems, which I’ll link below.

But what I will say is that digital accessibility, by its very nature, cannot ever be fully ‘solved’ using AI. Accessibility is not purely a technical challenge, like optimising for mobile performance or orchestrating your development workflow using a CI/CD tool. Accessibility has some technical compliance aspects (semantic HTML, ARIA properties etc), of course. But it’s main purpose is to ensure that content and code can be accessed by a wide range of humans with complex needs. Making that evaluation requires subjective human analysis, risk/benefit calculations and reaching a consensus with relevant stakeholders who are responsible for different parts of your product. No AI tool, no matter how well trained, can solve all of these problems for you.

AI-assisted accessibility compliance?

Generative AI can help you to solve some of these problems, particularly those involving content accessibility and technical compliance. Without the right training data behind them and a suitable level of context, an LLM can’t give you answers that will be correct for every situation. Should we spend our time and efforts teaching AI tools about the context-heavy decision making processes required for accessibility? Or would it be better to give accessibility training and guidance to as many humans as possible – digital professionals, non-technical stakeholders, users – so that they can make these decisions for themselves?

Generative AI + accessibility: further reading

As promised, here’s a few articles about the potential use cases (or lack thereof) for generative AI in the field of digital accessibility:

5. Testing AI products is a bit of a minefield.

Software testing, in its purest form, is like running a series of experiments:

  • First you have the input (e.g. some text, a mouse click or some existing data in a database)
  • Next you have the system component(s) that will take your input and process it in some way.
  • Finally, you have the output, which is the effect that your input has on the system under test.

To make their testing as accurate as possible, testers will often try to normalise the testing conditions. For example they might use one or more of these strategies:

  • a script that resets their test environment to a clean slate before/after every test cycle
  • explicit test cases that define the exact data and actions for each test
  • a testing environment that is isolated from all third-party integrations
  • a mock API that gives fixed/predictable responses to different types of request
  • a fresh set of user accounts for each regression testing cycle

Sometimes it’s not helpful to test in a controlled way – for instance, when trying to reproduce a bug that was reported in a live environment – but it’s a key part of a tester’s arsenal. By controlling the test conditions, we can learn more about specific aspects of the product under test and give more accurate feedback to our team members.

LLM-powered product testing strategies

When testing a product that is powered by a third-party LLM, we are testing not only the system we have created from scratch, but also how that system interfaces with the LLM and the quality of the generated output. We can still test each part in isolation, of course. For example, we could use an API testing tool to try out different user inputs with our custom prompt, or use a mock API service to stub out the LLM tool’s responses. But at some point we will have to deal with the fact that a core part of our product – the LLM – is at best a gray box that we cannot fully control or observe.

LLMs, by design, do not generate the same response every single time. This means that two tests with identical test data and identical conditions may not produce identical results. Therefore, an automated test or explicit test case will fail frequently if the LLM-generated response is not consistent. This means that classical testing strategies are not suitable for this context.

Techniques could include…

To deal with the fact that LLM output is variable, we have to approach our testing in a new way. Example techniques could include:

  • Snapshot testing – a Percy/Chromatic-style automated check that tests whether, given identical input and conditions, the output is with a % variance of the baseline. As the wider team improves the underlying LLM and prompt, testers could update this baseline to reflect the current reality.
  • Comparative judgement – this is a human-led process where reviewers are presented with two responses side-by-side and asked to judge which is better. Products such as No More Marking use this approach to judge student work without a marking scheme. In theory, testers could use this same approach to peer evaluate AI responses and determine the best prompt for each product use case.
  • Prompt injection – this means trying to convince the tool to break out of its usual routine and do something that its creators did not intend. This is an ever-moving target, but it’s something that testers should make themselves aware of and learn how to do. As with other types of negative case testing, it’s better to discover these flaws in a controlled environment, rather than waiting for them to be discovered by real users.
  • Exploratory testing – yes, that old chestnut. But given that we can’t always predict what an LLM-powered tool will do, exploratory testing is a vital tool. By exploring an AI app and learning more about its behaviour, we can generate more scenarios for future testing. From there, we can feed back what we’ve learned to the rest of the team.

Blurred lines in generative AI testing

Although testers can use familiar testing strategies when testing AI-powered tools, it’s impossible to escape the fact that we’re testing the LLM at the same time as the product. It’s difficult to separate testing an AI tool from discovering how to better interact with the underlying LLM. For those who care about such things, this strays beyond the bounds of ‘pure’ testing. As a tester you’re at risk of directly changing the product rather than simply providing useful information about it. Does this matter? Maybe, maybe not, but it’s something to be aware of.

Final thoughts

I would describe myself as an interested-but-sceptical generative AI user. LLMs, especially the bleeding-edge versions, clearly have a lot of potential but also a great deal of flaws. Some flaws are obvious (Queen Elizabeth II’s reign is definitely over), while others require specialist knowledge to confirm or refute.

So, the next time you’re evaluating a generative AI tool for work or personal use, ask yourself these questions:

  • Do I trust this tool’s output? If not, why not?
  • Could a human have completed this task more accurately? If so, how much more accurate would it be?
  • Is this tool’s output a finished product? If not, is it better to edit it manually or ask the tool for further edits?
  • Is there anything in this tool’s output that I can’t verify? If so, is that important?
  • If I provide this tool’s output to other people, will they find it useful/informative/entertaining?
  • Will my peers think I am dishonest or lazy if I share this AI output without disclosing its origin?
  • Do I have enough knowledge, or access to trustworthy sources, to be able verify the accuracy of this AI output?

You get the idea. Overall, I would advise anyone, regardless of their technical knowledge, to be skeptical about every new AI tool they encounter.

With generative AI and LLMs hitting the mainstream in 2023 and beyond, you will have already encountered a lot of hype, showcases and experience reports. Just as it’s important to be skeptical about AI tools, you should also be careful about taking other people’s experiences at face value. Generative AI is often poorly understood, even by its enthusiasts, so it’s important to consider a range of views and, where possible, try these tools out for your own use cases.


Leave a Reply