What is LLM as Judge and Why Should you use it?

 we covered statistical metrics like Perplexity, BLEU, ROUGE and more, as well as some of the statistical concepts that underpin them, their strengths (accuracy, reliability) and weaknesses (no subjective focus, use of reference texts. Between human evaluation (manual testing) and statistical measures we get a mix of high-value qualitative assessment on a small part of the test surface, and a rigorous but limited view on a wider area. That still leaves a lot of middle ground uncovered!

That’s why there’s been a push the last few years to get coverage for the space between - something that has a level of subjectivity and nuance but that also scales up. This is where LLM-as-a-Judge comes in. In our 

 I compared this to a kind of ouroboros where AI validates AI - and rightly so, that isn’t necessarily a bad thing. LLMs are able to do some things better than humans and LLM-as-a-Judge plays to those strengths - but it does not replace the need for human oversight and statistical assessment.

There’s also metrics that combine LLM-as-a-Judge with statistical metrics - but we’ll talk more about that later. 

As we’ve discussed, LLM as a Judge refers to any evaluation method that incorporates LLMs in the assessment of LLM performance based on a set of established criteria. The concept was introduced in a 

 by researchers who used MT Bench and ChatBot arena to evaluate LLM outputs. Their findings indicated that this method had roughly 80% agreement with human evaluators - so not quite as good (and certainly prone to make some severe mistakes) but a scalable approximation that gives a strong all-round evaluation.

The researchers pit two LLM models against each other on questions from the MT Bench benchmark using ChatBot arena, and then asking an LLM model to evaluate which response was best. 

Since then, the concept has evolved and many new metrics have been introduced using LLM as a Judge. This includes other pairwise comparisons (i.e. choosing the best output out of a pair like in the example above) to methods that use scores and rubrics which may be similar to those used by human evaluators.

Depending on how bullish you are on LLMs, the idea either sounds a bit bananas or like it’s the most obvious solution to the problems of subjectivity and test surface. But the main reason is quite simple - it’s more scalable than manual testing, and more subjective than statistical methods. 

It’s a middle ground that offers valuable insights that neither other method can provide - but let me reiterate, it’s not a substitute for either!

There are a few limitations of using LLM as a Judge. 

There’s a few methods that can be used to address these limitations, such as using Chain of Thought (CoT) prompting to ask the LLM judge to expand on its reasoning, using a few-shot approach and references in the prompt, breaking up evaluated outputs into sections, or fine-tuning the LLM as a Judge model.

Okay - now that we’ve covered what LLM as a Judge is, what they’re good for and what they aren’t, let’s take a look at some of the most common LLM as a Judge methodologies.

LLM as a Judge Metrics and what they Mean

There’s quite a few options for LLM as a Judge evaluators - here we’re going to focus on the few most common ones.

 is an LLM Model based scorer that uses GPT4 to evaluate text output using CoT reasoning, based on a user-defined criteria. When I say user-defined criteria here this refers to a subjective criteria, like those we discussed in our 

 - for example metrics like coherence and relevancy are standards.

The use of CoT reasoning helps to avoid some of the limitations listed in the previous section of the article, and the customizable nature of user-defined metrics means that this is a great, flexible scorer for text outputs.

To use G-Eval you should provide a prompt detailing:

Since G-Eval uses a reasoning model, it will auto-generate evaluation steps (called ”auto-CoT”) to determine how to grade outputs. Each output will then be given a bounded score between 1 and 5.

 is an open-source LLM evaluator that operates on the same basic principles as G-Eval, but with a few differences (aside from being open-source):

Thus the input that need to be given to Prometheus (as per the 

Like G-Eval, Prometheus returns a bounded score, and in lieu of showing reasoning steps it offers feedback as part of its output.

HuggingFace provides access to Prometheus and all it’s documentation 

There is a big limitation of LLM as a Judge that even CoT doesn’t really address - the fact that results are non-deterministic. Moreover the scoring rubric lends itself to certain values being overrepresented - for example in a 1-5 score you might get a bell curve of results with a peak of 3. To make results more reliable and repeatable we can configure (for example) G-Eval with a Direct Acyclical Graph (DAG).

DAG involves using a tree/node structure in which multiple evaluation criteria are chained on a decision tree. 

Let’s take a look at the workflow diagram below for summarization testing taken from 

Here we can see a DAG for G-Eval that has multiple nodes of evaluation, the first of which assesses whether the summary contains all the expected elements, and the second of which verifies whether (and to what extent) the sections are ordered (or disordered). The output of each evaluation node directs the scorer down a decision tree where the leaf nodes (or verdict nodes) are scores.

This makes results much more deterministic and repeatable.

Combining Statistical Methods and AI Models

Naturally, our middle ground of using LLM as a Judge produces its own middle grounds between the other endpoints of subjectivity and reliability (i.e. manual testing and statistical methods). To further bridge the reliability gap, there are a number of methods that mix LLM or other AI model based evaluations with statistical scoring.

While the n-gram based statistical scores we discussed in the 

 have their uses, they fail to take in to account the relationship and context of the words in a phrase other than the order in which they go in (though METEOR does search for synonyms). It is not necessarily the case that sentences that are syntactically similar are more likely to share the same meaning, and n-gram based scorers don’t pick up on paraphrasing and distant relationships between words.

This has led to the development of Contextual Embedding Metrics that attempt to rectify this. 

 is the most well-known of these. The difference here is that tokens are given contextual embeddings by a model (in this case BERT) that allow for comparison between words beyond the semantic. 

BERTScore also takes into account other nearby words and compares the output and references using 

 (a popular method in text analysis as it determines the “angle” between two vectors).

Note: This doesn’t technically use an LLM, as BERT is an embedding model, but contextual embedding models.

 is a novel hallucination detection metric. As the name implies, this involves an LLM model sampling and verifying its own outputs in order to detect hallucinations. The guiding principle here is that an LLM will be right 

 for questions that are within its power to answer, and that hallucinations are exceptional cases.

SelfCheckGPT uses a test set of prompts and outputs to compare at the sentence level, then uses the prompt to generate a number of its own sample outputs 

 for each prompt in the test set. Then it evaluates how each of its N samples compare to the provided output from the LLM under test. In theory, the more samples match, the better, since the LLM should be right on average. Thus, a hallucination score is then calculated based on how often the test output matches its samples.

See the diagram below for a clearer picture:

SelfCheck does not catch issues where the LLM is consistently wrong or lacks something in its knowledge base, i.e. where the general trend of the answers provided is towards being wrong or has repeated hallucinations on a given topic. It’s also worth noting that this method is quite computationally expensive, scaling linearly with the sample size N.

There were too many of these to cover in one article! But I’ve included a list of other evaluators to look into, including links to the relevant research papers.

Now we’ve concluded our series on the different types of LLM evaluations - to recap,we have many different levels of abstraction at which we can test LLMs and each has its own value:

If you missed the previous articles in this series, you can check them out here:

I hope you’ve enjoyed this series of articles and that they can help you build better LLM products. 

<h1 data-node-type="titlePlaceholder" class="titlePlaceholder">What is LLM as Judge and Why Should you use it?</h1><p>In the <a href="https://www.newline.co/@NickBadot/common-statistical-llm-evaluation-metrics-and-what-they-mean--6da233be" target="_blank">last article</a> we covered statistical metrics like Perplexity, BLEU, ROUGE and more, as well as some of the statistical concepts that underpin them, their strengths (accuracy, reliability) and weaknesses (no subjective focus, use of reference texts. Between human evaluation (manual testing) and statistical measures we get a mix of high-value qualitative assessment on a small part of the test surface, and a rigorous but limited view on a wider area. That still leaves a lot of middle ground uncovered!</p><p>That’s why there’s been a push the last few years to get coverage for the space between - something that has a level of subjectivity and nuance but that also scales up. This is where LLM-as-a-Judge comes in. In our <a href="https://www.newline.co/@NickBadot/how-good-is-good-enough-subjective-testing-and-manual-llm-evaluation--eaa8c1c9" target="_blank">manual testing for LLMs article</a> I compared this to a kind of ouroboros where AI validates AI - and rightly so, that isn’t necessarily a bad thing. LLMs are able to do some things better than humans and LLM-as-a-Judge plays to those strengths - but it does not replace the need for human oversight and statistical assessment.</p><p>There’s also metrics that combine LLM-as-a-Judge with statistical metrics - but we’ll talk more about that later. </p><h2 class="heading-anchor-text" id="what-is-llm-as-a-judge"><span>What is LLM as a Judge?</span><a class="d-inline heading-anchor" title="Direct link to heading" href="#what-is-llm-as-a-judge">#</a></h2><p>As we’ve discussed, LLM as a Judge refers to any evaluation method that incorporates LLMs in the assessment of LLM performance based on a set of established criteria. The concept was introduced in a <a href="https://arxiv.org/abs/2306.05685" target="_blank" rel="noopener noreferrer nofollow">2023 paper</a> by researchers who used MT Bench and ChatBot arena to evaluate LLM outputs. Their findings indicated that this method had roughly 80% agreement with human evaluators - so not quite as good (and certainly prone to make some severe mistakes) but a scalable approximation that gives a strong all-round evaluation.</p><p>The researchers pit two LLM models against each other on questions from the MT Bench benchmark using ChatBot arena, and then asking an LLM model to evaluate which response was best. </p><figure class="photo figure-position-inline"><img src="https://s3.amazonaws.com/assets.fullstack.io/n/20250326174419112_image.png" attachmentid="f8cf2300-3fd0-41f8-b831-0c2a9c5966d1" contenteditable="false" data-width="664" data-height="557"><div></div></figure><p>Since then, the concept has evolved and many new metrics have been introduced using LLM as a Judge. This includes other pairwise comparisons (i.e. choosing the best output out of a pair like in the example above) to methods that use scores and rubrics which may be similar to those used by human evaluators.</p><h2 class="heading-anchor-text" id="why-use-llm-as-a-judge"><span>Why Use LLM as a Judge?</span><a class="d-inline heading-anchor" title="Direct link to heading" href="#why-use-llm-as-a-judge">#</a></h2><p>Depending on how bullish you are on LLMs, the idea either sounds a bit bananas or like it’s the most obvious solution to the problems of subjectivity and test surface. But the main reason is quite simple - it’s more scalable than manual testing, and more subjective than statistical methods. </p><p>It’s a middle ground that offers valuable insights that neither other method can provide - but let me reiterate, it’s not a substitute for either!</p><h3 class="heading-anchor-text" id="what-limitations-does-llm-as-a-judge-have"><span>What Limitations <strong>does</strong> LLM as a Judge Have?</span><a class="d-inline heading-anchor" title="Direct link to heading" href="#what-limitations-does-llm-as-a-judge-have">#</a></h3><p>There are a few limitations of using LLM as a Judge. </p><ol><li><p><strong>Non-Deterministic: Results</strong> LLMs are non-deterministic, that means it might score the same output differently from one execution to another. So individual test cases are less repeatable and consistent.</p></li><li><p><strong>Positional Bias:</strong> LLMs tend to prefer the first option when evaluating pairwise input - though the level of bias seems to vary by LLM. Humans do this too, to an extent, but less measurably than LLMs. This can be addressed by swapping the position in 50% of samples in a pairwise evaluation.</p></li><li><p><strong>Verbosity Bias:</strong> Some LLMs also tend to prefer longer answers over short answers, meaning that verbose outputs score better without necessarily being materially better (or even being worse).</p></li><li><p><strong>Self-Enhancement Bias</strong>: While less widespread than some other biases, some LLMs will favour answers that they generated over those by other models. Remember what I said about an ouroboros? This is also sometimes called narcissism bias.</p></li></ol><p>There’s a few methods that can be used to address these limitations, such as using Chain of Thought (CoT) prompting to ask the LLM judge to expand on its reasoning, using a few-shot approach and references in the prompt, breaking up evaluated outputs into sections, or fine-tuning the LLM as a Judge model.</p><p>Okay - now that we’ve covered what LLM as a Judge is, what they’re good for and what they aren’t, let’s take a look at some of the most common LLM as a Judge methodologies.</p><h2 class="heading-anchor-text" id="llm-as-a-judge-metrics-and-what-they-mean"><span>LLM as a Judge Metrics and what they Mean</span><a class="d-inline heading-anchor" title="Direct link to heading" href="#llm-as-a-judge-metrics-and-what-they-mean">#</a></h2><p>There’s quite a few options for LLM as a Judge evaluators - here we’re going to focus on the few most common ones.</p><h3 class="heading-anchor-text" id="g-eval"><span>G-Eval</span><a class="d-inline heading-anchor" title="Direct link to heading" href="#g-eval">#</a></h3><p><a href="https://arxiv.org/pdf/2303.16634" target="_blank" rel="noopener noreferrer nofollow">G-Eval</a> is an LLM Model based scorer that uses GPT4 to evaluate text output using CoT reasoning, based on a user-defined criteria. When I say user-defined criteria here this refers to a subjective criteria, like those we discussed in our <a href="https://www.newline.co/@NickBadot/how-good-is-good-enough-subjective-testing-and-manual-llm-evaluation--eaa8c1c9" target="_blank">human evaluation article</a> - for example metrics like coherence and relevancy are standards.</p><p>The use of CoT reasoning helps to avoid some of the limitations listed in the previous section of the article, and the customizable nature of user-defined metrics means that this is a great, flexible scorer for text outputs.</p><p>To use G-Eval you should provide a prompt detailing:</p><ol><li><p>An introduction to the task: What the algorithm will be processing</p></li><li><p>An evaluation condition: What it should grade it on</p></li></ol><p>Since G-Eval uses a reasoning model, it will auto-generate evaluation steps (called ”auto-CoT”) to determine how to grade outputs. Each output will then be given a bounded score between 1 and 5.</p><p></p><figure class="photo figure-position-inline"><img src="https://s3.amazonaws.com/assets.fullstack.io/n/20250326174445609_66d3fdd1f10fc3992b6c9d81_66d3fd8f7a958b870c33c977_Screenshot_202024-09-01_20at_201.37.09_20PM.png" attachmentid="4edd2d5f-8909-432f-ace0-3b02354b7a1a" contenteditable="false" data-width="1366" data-height="786"><div></div></figure><p></p><h3 class="heading-anchor-text" id="prometheus"><span>Prometheus</span><a class="d-inline heading-anchor" title="Direct link to heading" href="#prometheus">#</a></h3><p><a href="https://arxiv.org/pdf/2303.16634" target="_blank" rel="noopener noreferrer nofollow">Prometheus</a> is an open-source LLM evaluator that operates on the same basic principles as G-Eval, but with a few differences (aside from being open-source):</p><ul><li><p>The rubric for scoring the evaluation criteria is provided within the prompt rather than generated with auto-CoT</p></li><li><p>Prometheus uses a model tuned specifically for evaluation rather than GPT</p></li><li><p>Prometheus requires reference samples of evaluation inputs and outputs - this aids the model in fine-tuning itself to encompassing custom evaluation criteria</p></li></ul><p>Thus the input that need to be given to Prometheus (as per the <a href="https://arxiv.org/pdf/2310.08491" target="_blank" rel="noopener noreferrer nofollow">paper</a>) are:</p><ol><li><p>Instruction: An instruction that a user would prompt to an arbitrary LLM.</p></li><li><p>Response to Evaluate: A response to the instruction that the evaluator LM has to evaluate.</p></li><li><p>Customized Score Rubric: A specification of novel criteria decided by the user. The evaluator should focus on this aspect during evaluation. The rubric consists of (1) a description of the criteria and (2) a description of each scoring decision (1 to 5).</p></li><li><p>Reference Answer: A reference answer that would receive a score of 5. Instead of requiring the evaluator LLM to solve the instruction, it enables the evaluator to use the mutual information between the reference answer and the response to make a scoring decision</p></li></ol><p></p><figure class="photo figure-position-inline"><img src="https://s3.amazonaws.com/assets.fullstack.io/n/20250326174504383_image%201.png" attachmentid="d56c518e-df5c-4d23-958e-33a495b824d4" contenteditable="false" data-width="667" data-height="397"><div></div></figure><p></p><p>Like G-Eval, Prometheus returns a bounded score, and in lieu of showing reasoning steps it offers feedback as part of its output.</p><p>HuggingFace provides access to Prometheus and all it’s documentation <a href="https://huggingface.co/prometheus-eval/prometheus-13b-v1.0" target="_blank" rel="noopener noreferrer nofollow">here</a>.</p><h3 class="heading-anchor-text" id="dag"><span>DAG</span><a class="d-inline heading-anchor" title="Direct link to heading" href="#dag">#</a></h3><p>There is a big limitation of LLM as a Judge that even CoT doesn’t really address - the fact that results are non-deterministic. Moreover the scoring rubric lends itself to certain values being overrepresented - for example in a 1-5 score you might get a bell curve of results with a peak of 3. To make results more reliable and repeatable we can configure (for example) G-Eval with a Direct Acyclical Graph (DAG).</p><p>DAG involves using a tree/node structure in which multiple evaluation criteria are chained on a decision tree. </p><p>Let’s take a look at the workflow diagram below for summarization testing taken from <a href="https://docs.confident-ai.com/docs/metrics-dag" target="_blank" rel="noopener noreferrer nofollow">Deepeval</a>:</p><p></p><figure class="photo figure-position-inline"><img src="https://s3.amazonaws.com/assets.fullstack.io/n/20250326174519578_dag-formatting-metric.svg" attachmentid="3df75157-3afc-41b8-a1a5-4f1f284544b0" contenteditable="false" data-width="3669" data-height="2447"><div></div></figure><p></p><p>Here we can see a DAG for G-Eval that has multiple nodes of evaluation, the first of which assesses whether the summary contains all the expected elements, and the second of which verifies whether (and to what extent) the sections are ordered (or disordered). The output of each evaluation node directs the scorer down a decision tree where the leaf nodes (or verdict nodes) are scores.</p><p>This makes results much more deterministic and repeatable.</p><h2 class="heading-anchor-text" id="combining-statistical-methods-and-ai-models"><span>Combining Statistical Methods and AI Models</span><a class="d-inline heading-anchor" title="Direct link to heading" href="#combining-statistical-methods-and-ai-models">#</a></h2><p>Naturally, our middle ground of using LLM as a Judge produces its own middle grounds between the other endpoints of subjectivity and reliability (i.e. manual testing and statistical methods). To further bridge the reliability gap, there are a number of methods that mix LLM or other AI model based evaluations with statistical scoring.</p><h3 class="heading-anchor-text" id="bertscore"><span>BERTScore</span><a class="d-inline heading-anchor" title="Direct link to heading" href="#bertscore">#</a></h3><p>While the n-gram based statistical scores we discussed in the <a href="https://www.newline.co/@NickBadot/common-statistical-llm-evaluation-metrics-and-what-they-mean--6da233be" target="_blank">previous article</a> have their uses, they fail to take in to account the relationship and context of the words in a phrase other than the order in which they go in (though METEOR does search for synonyms). It is not necessarily the case that sentences that are syntactically similar are more likely to share the same meaning, and n-gram based scorers don’t pick up on paraphrasing and distant relationships between words.</p><p>This has led to the development of Contextual Embedding Metrics that attempt to rectify this. <a href="https://huggingface.co/spaces/evaluate-metric/bertscore" target="_blank" rel="noopener noreferrer nofollow">BERTScore</a> is the most well-known of these. The difference here is that tokens are given contextual embeddings by a model (in this case BERT) that allow for comparison between words beyond the semantic. </p><p>BERTScore also takes into account other nearby words and compares the output and references using <a href="https://en.wikipedia.org/wiki/Cosine_similarity" target="_blank" rel="noopener noreferrer nofollow">cosine similarity</a> (a popular method in text analysis as it determines the “angle” between two vectors).</p><p></p><figure class="photo figure-position-inline"><img src="https://s3.amazonaws.com/assets.fullstack.io/n/20250326174608166_0_LGsBGhAK1oil693A.webp" attachmentid="ac4268b1-c158-4175-a733-2d94352a1a5b" contenteditable="false" data-width="1280" data-height="720"><div></div></figure><p></p><p>Note: This doesn’t technically use an LLM, as BERT is an embedding model, but contextual embedding models.</p><h3 class="heading-anchor-text" id="selfcheckgpt"><span><strong>SelfCheckGPT</strong></span><a class="d-inline heading-anchor" title="Direct link to heading" href="#selfcheckgpt">#</a></h3><p><a href="https://arxiv.org/abs/2303.08896" target="_blank" rel="noopener noreferrer nofollow">SelfCheckGPT</a> is a novel hallucination detection metric. As the name implies, this involves an LLM model sampling and verifying its own outputs in order to detect hallucinations. The guiding principle here is that an LLM will be right <em>on average</em> for questions that are within its power to answer, and that hallucinations are exceptional cases.</p><p>SelfCheckGPT uses a test set of prompts and outputs to compare at the sentence level, then uses the prompt to generate a number of its own sample outputs <em>N</em> for each prompt in the test set. Then it evaluates how each of its N samples compare to the provided output from the LLM under test. In theory, the more samples match, the better, since the LLM should be right on average. Thus, a hallucination score is then calculated based on how often the test output matches its samples.</p><p>See the diagram below for a clearer picture:</p><figure class="photo figure-position-null"><img src="https://s3.amazonaws.com/assets.fullstack.io/n/20250326174541054_0_870MDpS2bso6hcQx.webp" attachmentid="f5f39b34-955e-464f-b7f9-a2a147f2baff" contenteditable="false" data-width="640" data-height="596"><div></div></figure><p>SelfCheck does not catch issues where the LLM is consistently wrong or lacks something in its knowledge base, i.e. where the general trend of the answers provided is towards being wrong or has repeated hallucinations on a given topic. It’s also worth noting that this method is quite computationally expensive, scaling linearly with the sample size N.</p><h2 class="heading-anchor-text" id="other-llm-and-model-based-evaluators:"><span>Other LLM and Model-Based Evaluators:</span><a class="d-inline heading-anchor" title="Direct link to heading" href="#other-llm-and-model-based-evaluators:">#</a></h2><p>There were too many of these to cover in one article! But I’ve included a list of other evaluators to look into, including links to the relevant research papers.</p><ul><li><p><a href="https://arxiv.org/pdf/2302.04166" target="_blank" rel="noopener noreferrer nofollow">GPTScore</a>: Evaluates the conditional probability of a model producing the target text</p></li><li><p><a href="https://arxiv.org/abs/2309.15217" target="_blank" rel="noopener noreferrer nofollow">RAGAS</a>: Suite of evaluation metrics for RAG systems</p></li><li><p><a href="https://arxiv.org/abs/2004.04696" target="_blank" rel="noopener noreferrer nofollow">BLEURT</a>: Text evaluation with pre-trained models and embedding</p></li><li><p><a href="https://arxiv.org/abs/1909.02622" target="_blank" rel="noopener noreferrer nofollow">MoverScore</a>: Contextual Embeddings and statistical distance-based evaluator</p></li><li><p><a href="https://arxiv.org/abs/2305.17002" target="_blank" rel="noopener noreferrer nofollow">QAG Score</a>: Mixed method where LLMs extract claims in an output and close-ended questions (usually true/false checks) are made for each claim.</p></li></ul><h3 class="heading-anchor-text" id="conclusion"><span>Conclusion</span><a class="d-inline heading-anchor" title="Direct link to heading" href="#conclusion">#</a></h3><p>Now we’ve concluded our series on the different types of LLM evaluations - to recap,we have many different levels of abstraction at which we can test LLMs and each has its own value:</p><ul><li><p>Benchmarks offer a broad view of LLM performance in a given domain on a given dataset</p></li><li><p>Manual Testing provides the best level of subjective assessment and sensitivity, but is very expensive</p></li><li><p>Statistical evaluations are great to assess things at scale, but lack the ability to appraise subjective elements</p></li><li><p>LLM as a Judge provides a middle ground between manual testing and statistical evaluation, and can be further combined with statistical metrics or deployed with DAG to be more deterministic.</p></li></ul><p>If you missed the previous articles in this series, you can check them out here:</p><ul><li><p><a href="https://www.newline.co/@NickBadot/how-good-is-good-enough-introduction-to-llm-testing-and-benchmarks--460c6dd4" target="_blank">Article 1: Intro to LLM Testing and Benchmarking</a></p></li><li><p><a href="https://www.newline.co/@NickBadot/how-good-is-good-enough-a-guide-to-common-llm-benchmarks--cccbbaf9" target="_blank">Article 2: Guide to Common LLM benchmarks</a></p></li><li><p><a href="https://www.newline.co/@NickBadot/how-good-is-good-enough-subjective-testing-and-manual-llm-evaluation--eaa8c1c9" target="_blank">Article 3: Subjective Testing and Manual Evaluation</a></p></li><li><p><a href="https://www.newline.co/@NickBadot/common-statistical-llm-evaluation-metrics-and-what-they-mean--6da233be" target="_blank">Article 4: Guide to Statistical Evaluations for LLMs</a></p></li></ul><p>I hope you’ve enjoyed this series of articles and that they can help you build better LLM products. </p><p>That’s all for now - happy testing!</p><p></p>

AUTHOR

Nick Badot

In the last article we covered statistical metrics like Perplexity, BLEU, ROUGE and more, as well as some of the statistical concepts that underpin them, their strengths (accuracy, reliability) and weaknesses (no subjective focus, use of reference texts. Between human evaluation (manual testing)…

Learn

The newline Guide to Building Your First GraphQL Server with Node and TypeScript

Teach

Amelia Wattenberger

Author of Fullstack D3

Community

Free Tools

What is LLM as Judge and Why Should you use it?

Level

Responses (0)

Free AI Career Tools

AI Job Listings

ATS Resume Checker

Startup Perks

Masterclasses

Tutorials

Fullstack React with TypeScript