Learn
Learn
Learn web development from expert teachers. Build real projects, join our community, and accelerate your career
Get Started
Fullstack Rust Fullstack Node.js Fullstack D3 Fullstack React Fullstack React with TypeScript view all books →
The newline Guide to Building Your First GraphQL Server with Node and TypeScript
In this course, we'll show you how to create your first GraphQL server with Node.js and TypeScript
Enroll for free
Teach
Teach
Share your knowledge with others, earn money, and help people with their career
Apply Now
Apply To Teach A Course What Our Teachers Say
Amelia Wattenberger
Author of Fullstack D3
"Writing Fullstack D3 was a thoroughly enjoyable, fun process.

The writing was over before I knew it, and we've sold way more copies than I expected! Plus, the compliments from my peers have been really amazing."
Community
Community
Get help with programming projects, find collaborators, and make friends
Join Now
Explore new Communities Join our Discord Server What Our Students Say
Tutorials
Pricing

Tutorials on Qa

Learn about Qa from fellow newline community members!

Synthetic Data Generation with Prompt Engineering

In our previous article, we talked about the role of synthetic data in QA testing, and looked at two QA methodologies: Equivalence Class Partitioning and Boundary Value Analysis. Today, we’re going to talk about how you can use LLMs to generate test data for your applications. If you haven’t read it yet, I recommend taking a look at our articles on prompt engineering for traditional and reasoning models, as we’re going to be using prompts to generate test data. As we’ve discussed before, there are many reasons to use synthetic data in your testing - one of the largest being the cost and scalability, but it may also be required as an alternative to production data in the event that it contains personally identifiable information (illegal to use in most of the world).

Thumbnail Image of Tutorial Synthetic Data Generation with Prompt Engineering

Nick Badot

Nick Badot is a QA Engineer, Technical Project Manager, & Technical writer from Ireland. He has worked for global tech companies in Dublin (Amazon, Oracle) and was a QA lead for the launch of Amazon Alexa in the French language.

•Last Updated:May 5th 2025

00

Read Full Article

Test Data and AI: What Makes Good Test Data?

In this series of articles we’re going to be talking about how to use LLMs to generate synthetic data for QA testing, starting with the basics of test data, then moving on to generation methods, and finally looking at examples for generating test data for the purpose of validating LLM products. But let’s start at the beginning - in this article we’re going to talk about how to use synthetic test data more generally, what makes good or bad test data, and we’ll also look at some traditional QA methodologies and how test data can inform them. Synthetic data refers to any machine-generated data that can be used to execute test cases or mock a production environment scenario. This includes data produced by LLMs, procedural data, and human curated or created data generated outside of production. Of course, production data is incredibly valuable for testing, and when it’s possible to use it, it should be used - but often this is not possible, legal or scalable. Generating production data can also be an expensive process for a new feature or product since you need to hire beta testers. Synthetic data also has some other advantages other than cost.

Thumbnail Image of Tutorial Test Data and AI: What Makes Good Test Data?

Nick Badot

Nick Badot is a QA Engineer, Technical Project Manager, & Technical writer from Ireland. He has worked for global tech companies in Dublin (Amazon, Oracle) and was a QA lead for the launch of Amazon Alexa in the French language.

•Last Updated:Jun 6th 2025

00

Read Full Article

I got a job offer, thanks in a big part to your teaching. They sent a test as part of the interview process, and this was a huge help to implement my own Node server.

This has been a really good investment!

Advance your career with newline Pro.

Only $40 per month for unlimited access to over 60+ books, guides and courses!

Common Statistical LLM Evaluation Metrics and what they Mean

In one of our earlier articles , we touched on statistical metrics and how they can be used in evaluation - we also briefly discussed precision, recall, and F1-score in our article on benchmarking . Today, we’ll go into more detail on how to apply these metrics more directly, and more complex metrics derived from these that can be used to assess LLM performance. This is a standard measure in statistics, and has long been used to measure the performance of ML systems. In simple terms, this is a measure of how many samples are correctly categorised (true positives) or predicted by a model out of the total set of samples predicted to be positive (true positives + false positives). If we take a simple examples of an ML tool that takes a photo as an input and tells you if there is a dog in the picture, this would be:

Thumbnail Image of Tutorial Common Statistical LLM Evaluation Metrics and what they Mean

Nick Badot

Nick Badot is a QA Engineer, Technical Project Manager, & Technical writer from Ireland. He has worked for global tech companies in Dublin (Amazon, Oracle) and was a QA lead for the launch of Amazon Alexa in the French language.

•Last Updated:Jun 17th 2025

00

Read Full Article

How Good is Good Enough: Subjective Testing and Manual LLM Evaluation

In our previous article , we talked about the highest level of testing and evaluation for LLM models, and went into detail about some of the most commonly used benchmarks for validating LLM performance at a high level. Today, we’re going to look a at some more fine-grained evaluation metrics that you can use while building an LLM-based tool. Here we make the distinction between statistical metrics - that is those computed using a statistical model - and more generalised metrics that attempt to measure the more ‘subjective’ elements of LLM performance (such as those used in manual testing) and that use AI to evaluate how useful a model is in its given context. In this article we’ll give an overview of the different classes of metrics used and cover human evaluation and its importance before moving on to common statistical metrics and LLM-as-Judge evaluations in the following articles.

Nick Badot

Nick Badot is a QA Engineer, Technical Project Manager, & Technical writer from Ireland. He has worked for global tech companies in Dublin (Amazon, Oracle) and was a QA lead for the launch of Amazon Alexa in the French language.

•Last Updated:Jun 17th 2025

00

Read Full Article

How Good is Good Enough? - Introduction to LLM Testing and Benchmarks

The proliferation of Large-Language Models (LLMs), and their subsequent embedding into workflows in every industry imaginable, has upended much of the conventional wisdom around quality assurance and software testing. QA Engineers effectively have to deal with non-deterministic outputs - so traditional automated testing that involves assertions on the output are partially out. Moreover, the input set for LLM-based services has equally ballooned, with the potential input set being the entirety of human language in the worst case, and a very flexible subset for more specialised LLMs. This is a vast test surface with many potential points of failure, one in which it is practically impossible to achieve 100% test coverage, and the edge cases are equally vast and difficult to enumerate - it’s unsurprising that we’ve seen bugs even in top tier customer-facing LLMs even amongst the biggest companies. Like Google’s AI recommending users eat one small rock a day after indexing an Onion article or Grok accusing NBA star Klay Thompson of vandalism .

Thumbnail Image of Tutorial How Good is Good Enough? - Introduction to LLM Testing and Benchmarks

Nick Badot

Nick Badot is a QA Engineer, Technical Project Manager, & Technical writer from Ireland. He has worked for global tech companies in Dublin (Amazon, Oracle) and was a QA lead for the launch of Amazon Alexa in the French language.

•Last Updated:Jun 13th 2025

00

Read Full Article

How Good is Good Enough: A Guide to Common LLM Benchmarks

In our last article, we talked about benchmarking as the highest level method of assessing the performance of LLMs. Today, we’re going to be looking in more detail at some of the most popular benchmarks, what they measure, and how they measure it. Note that most of the benchmarks listed below will have leaderboards and questions sets available somewhere public facing if you want to dive deeper, I’ve also included links to papers where appropriate. Let’s dive in!

Thumbnail Image of Tutorial How Good is Good Enough: A Guide to Common LLM Benchmarks

Nick Badot

Nick Badot is a QA Engineer, Technical Project Manager, & Technical writer from Ireland. He has worked for global tech companies in Dublin (Amazon, Oracle) and was a QA lead for the launch of Amazon Alexa in the French language.

•Last Updated:Jun 13th 2025

00

Read Full Article

Email Newsletter

Trusted by 100,000+ developers!