Tutorials on Qa

Learn about Qa from fellow newline community members!

  • React
  • Angular
  • Vue
  • Svelte
  • NextJS
  • Redux
  • Apollo
  • Storybook
  • D3
  • Testing Library
  • JavaScript
  • TypeScript
  • Node.js
  • Deno
  • Rust
  • Python
  • GraphQL
  • React
  • Angular
  • Vue
  • Svelte
  • NextJS
  • Redux
  • Apollo
  • Storybook
  • D3
  • Testing Library
  • JavaScript
  • TypeScript
  • Node.js
  • Deno
  • Rust
  • Python
  • GraphQL
NEW

Synthetic Data Generation with Prompt Engineering

In our previous article, we talked about the role of synthetic data in QA testing, and looked at two QA methodologies: Equivalence Class Partitioning and Boundary Value Analysis. Today, we’re going to talk about how you can use LLMs to generate test data for your applications. If you haven’t read it yet, I recommend taking a look at our articles on prompt engineering for traditional and reasoning models, as we’re going to be using prompts to generate test data. As we’ve discussed before, there are many reasons to use synthetic data in your testing - one of the largest being the cost and scalability, but it may also be required as an alternative to production data in the event that it contains personally identifiable information (illegal to use in most of the world).
Thumbnail Image of Tutorial Synthetic Data Generation with Prompt Engineering
NEW

Test Data and AI: What Makes Good Test Data?

In this series of articles we’re going to be talking about how to use LLMs to generate synthetic data for QA testing, starting with the basics of test data, then moving on to generation methods, and finally looking at examples for generating test data for the purpose of validating LLM products. But let’s start at the beginning - in this article we’re going to talk about how to use synthetic test data more generally, what makes good or bad test data, and we’ll also look at some traditional QA methodologies and how test data can inform them. Synthetic data refers to any machine-generated data that can be used to execute test cases or mock a production environment scenario. This includes data produced by LLMs, procedural data, and human curated or created data generated outside of production. Of course, production data is incredibly valuable for testing, and when it’s possible to use it, it should be used - but often this is not possible, legal or scalable. Generating production data can also be an expensive process for a new feature or product since you need to hire beta testers. Synthetic data also has some other advantages other than cost.
Thumbnail Image of Tutorial Test Data and AI: What Makes Good Test Data?

I got a job offer, thanks in a big part to your teaching. They sent a test as part of the interview process, and this was a huge help to implement my own Node server.

This has been a really good investment!

Advance your career with newline Pro.

Only $40 per month for unlimited access to over 60+ books, guides and courses!

Learn More

Common Statistical LLM Evaluation Metrics and what they Mean

In one of our earlier articles , we touched on statistical metrics and how they can be used in evaluation - we also briefly discussed precision, recall, and F1-score in our article on benchmarking . Today, we’ll go into more detail on how to apply these metrics more directly, and more complex metrics derived from these that can be used to assess LLM performance. This is a standard measure in statistics, and has long been used to measure the performance of ML systems. In simple terms, this is a measure of how many samples are correctly categorised (true positives) or predicted by a model out of the total set of samples predicted to be positive (true positives + false positives). If we take a simple examples of an ML tool that takes a photo as an input and tells you if there is a dog in the picture, this would be:
Thumbnail Image of Tutorial Common Statistical LLM Evaluation Metrics and what they Mean

How Good is Good Enough: Subjective Testing and Manual LLM Evaluation

In our previous article , we talked about the highest level of testing and evaluation for LLM models, and went into detail about some of the most commonly used benchmarks for validating LLM performance at a high level. Today, we’re going to look a at some more fine-grained evaluation metrics that you can use while building an LLM-based tool. Here we make the distinction between statistical metrics - that is those computed using a statistical model - and more generalised metrics that attempt to measure the more ‘subjective’ elements of LLM performance (such as those used in manual testing) and that use AI to evaluate how useful a model is in its given context. In this article we’ll give an overview of the different classes of metrics used and cover human evaluation and its importance before moving on to common statistical metrics and LLM-as-Judge evaluations in the following articles.

How Good is Good Enough? - Introduction to LLM Testing and Benchmarks

The proliferation of Large-Language Models (LLMs), and their subsequent embedding into workflows in every industry imaginable, has upended much of the conventional wisdom around quality assurance and software testing. QA Engineers effectively have to deal with non-deterministic outputs - so traditional automated testing that involves assertions on the output are partially out. Moreover, the input set for LLM-based services has equally ballooned, with the potential input set being the entirety of human language in the worst case, and a very flexible subset for more specialised LLMs. This is a vast test surface with many potential points of failure, one in which it is practically impossible to achieve 100% test coverage, and the edge cases are equally vast and difficult to enumerate - it’s unsurprising that we’ve seen bugs even in top tier customer-facing LLMs even amongst the biggest companies. Like Google’s AI recommending users eat one small rock a day after indexing an Onion article or Grok accusing NBA star Klay Thompson of vandalism .
Thumbnail Image of Tutorial How Good is Good Enough? - Introduction to LLM Testing and Benchmarks

How Good is Good Enough: A Guide to Common LLM Benchmarks

In our last article, we talked about benchmarking as the highest level method of assessing the performance of LLMs. Today, we’re going to be looking in more detail at some of the most popular benchmarks, what they measure, and how they measure it. Note that most of the benchmarks listed below will have leaderboards and questions sets available somewhere public facing if you want to dive deeper, I’ve also included links to papers where appropriate. Let’s dive in!
Thumbnail Image of Tutorial How Good is Good Enough: A Guide to Common LLM Benchmarks