Confidence Unlocked: A Method to Measure Certainty in LLM Outputs

4 months ago 16

vatvengers

Written by Ruth Miller

Understanding an AI model’s confidence in its outputs is crucial for making informed decisions. Large language models (LLMs) have revolutionized natural language processing, enabling tasks like generating human-like text, summarizing documents, and extracting structured data from unstructured sources. However, when it comes to deploying LLMs in production environments, having reliable confidence scores that accurately reflect the model’s certainty is essential. Extracting meaningful confidence scores from LLM outputs, though, is not a straightforward task.

Understanding LLM Confidence

LLM confidence refers to the model’s internal measure of how certain it is about its output. This confidence is often tied to the probabilities of the tokens (words, symbols, etc.) that the model generates. Higher probabilities indicate greater confidence that a particular token or phrase is correct, based on the surrounding context and prior tokens.

For example, when an LLM generates the tokens “2023–09–02” as an issue date in an invoice, it assigns probability scores to each token. These scores indicate the model’s certainty in the likelihood of each token in “2023–09–02” appearing, based on the context and preceding tokens.

I began by analyzing log probabilities from OpenAI outputs, drawing insights from resources like this Medium post on interpreting logprobs. Yet, I encountered a problem: the confidence scores were consistently high, regardless of prediction quality. This motivated me to develop a new method for interpreting logprobs, one that produces meaningful confidence scores that can be leveraged to ensure higher accuracy of LLM-based models in production.

Log Probabilities in OpenAI

Log probabilities of output tokens indicate the likelihood of each token occurring in the sequence given the context. To simplify, a log probability, or logprob, is calculated as log(p), where p is the probability of a token occurring at a specific position based on the previous tokens in the context.

Key Points about Logprobs:

  • Logarithmic Scale: Logprobs are on a logarithmic scale, meaning that small differences in logprob values can represent large differences in actual probabilities.
  • Additive Property: When dealing with sequences of tokens, logprobs can be summed to determine the overall likelihood of a sequence. This is particularly useful for aggregating confidence across multiple tokens in a key-value pair.
  • Negative Values: Logprobs are typically negative, with values closer to zero indicating higher probabilities (greater confidence).

Method Overview

This method relies on structuring the LLM output in a key-value format, where the key represents the problem or task and the value is the solution. By doing so, the model’s output gains additional context, allowing the probability of the solution to be assessed in relation to the problem itself.

In entity extraction tasks, this key-value structure is intuitive, but it can also be applied to many other types of tasks. The advantage of this approach is that it enhances the relationship between the problem (key) and the solution (value), improving the relevance of the probability calculations.

This method extracts confidence scores for each key in the output by leveraging the log probabilities of the tokens generated by the model. By aggregating these log probabilities, it produces a confidence score for each key-value pair, offering a more detailed view of the model’s certainty.

Applying the Method to Classification and Q&A Tasks

Instead of directly extracting the answer or class, you can prompt the LLM to generate a JSON object where the key represents the question or classification task in a short, summarized form, and the value is the answer or class.

For example, in a classification task to determine the expense type of an invoice, the output might look like this:

{
"Expense type": "Hotel"
}

In a Q&A task, the output could be structured as:

{
"What is the capital of France?": "Paris"
}

This approach is superior to simply extracting the answer or class, as the model calculates probabilities for each token, reflecting the likelihood of its occurrence within the given context. By including the problem or classification task in the output, the probabilities of the answer tokens are enriched and more closely tied to the context, leading to more accurate and reliable confidence scores.

Step-by-Step Explanation

Step 1: Parsing the Response

First, the JSON response from OpenAI is parsed into individual key-value pairs. In the context of extracting data from invoices, the JSON might look like this:

{
"total_amount": 3.02,
"issue_date": "2022-11-26",
"expense_type": 18,
"country": "DE",
"currency": "EUR",
"has_alcohol": false,
"vat_data": [
{
"vat_percent": 19,
"vat_amount": 0.48,
"exclude_vat_amount": 2.54
}
]
}

Step 2: Aggregating Log Probabilities

Each key-value pair is associated with a series of tokens, and each token has an associated log probability — a logarithmic measure of the likelihood that the token is correct. The method aggregates these log probabilities for all tokens related to a key-value pair by summing them up. Summing log probabilities is mathematically equivalent to multiplying the original probabilities, which allows us to capture the overall confidence for each key.

Step 3: Calculating Confidence Scores

Once the log probabilities are summed, they are converted back into probabilities, yielding a confidence score for each key. This score represents the joint probability of all tokens in a key-value pair, providing a clear measure of the overall probability for each key.

Below is a visual representation of this process, where each token’s probability is displayed above the token itself. The overall confidence score for each key-value pair is displayed at the top of the arc.

Confidence scores calculation process

Example Code Implementation

Below is an example code that demonstrates how to calculate confidence scores from log probabilities. To simplify the process, I’ve developed an open source PyPi package called llm_confidence that handles log probabilities and calculates confidence scores.

from llm_confidence.logprobs_handler import LogprobsHandler

# Initialize the LogprobsHandler
logprobs_handler = LogprobsHandler()

# Format the logprobs
logprobs_formatted = [{'token': ' "', 'logprob': -0.050051052}, {'token': 'total', 'logprob': 0.0}, {'token': '_amount', 'logprob': 0.0}, {'token': '":', 'logprob': -1.0280384e-06}, {'token': ' ', 'logprob': -1.0280384e-06}, {'token': '3', 'logprob': -3.0545007e-06}, {'token': '.', 'logprob': 0.0}, {'token': '02', 'logprob': -4.3202e-07}, {'token': ',', 'logprob': 0.0}]

# Process the log probabilities to get confidence scores
confidence = logprobs_handler.process_logprobs(
logprobs_formatted)

Below is an example of how this package can be used together with OpenAI’s API response:

from openai import OpenAI
import os
from llm_confidence.logprobs_handler import LogprobsHandler

def get_completion(
messages: list[dict[str, str]],
model: str = "gpt-4o",
max_tokens=500,
temperature=0,
stop=None,
seed=42,
response_format=None,
logprobs=None,
top_logprobs=None,
):
params = {
"model": model,
"messages": messages,
"max_tokens": max_tokens,
"temperature": temperature,
"stop": stop,
"seed": seed,
"logprobs": logprobs,
"top_logprobs": top_logprobs,
}
if response_format:
params["response_format"] = response_format

completion = client.chat.completions.create(**params)
return completion

# Initialize the LogprobsHandler
logprobs_handler = LogprobsHandler()

# Set up your OpenAI client with your API key
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

# Define a prompt for completion
response_raw = get_completion(
[{'role': 'user', 'content': '<Insert your prompt here>'}],
logprobs=True,
response_format={'type': 'json_object'}
)

# Print the output
print(response_raw.choices[0].message.content)

# Extract the log probabilities from the response
response_logprobs = response_raw.choices[0].logprobs.content if hasattr(response_raw.choices[0], 'logprobs') else []

# Format the logprobs from OpenAI
logprobs_formatted = logprobs_handler.format_logprobs(response_logprobs)

# Process the log probabilities to get confidence scores
confidence = logprobs_handler.process_logprobs(logprobs_formatted)

# Print the confidence scores
print(confidence)

Practical Use Cases

One of the key applications of this method is in extracting data from invoices. Imagine an AI model tasked with parsing hundreds of invoices and returning structured data in JSON format. Each field — such as “total_amount, “issue_date,” and “vat_data” — is critical for downstream processes like accounting, compliance, and reporting.

By applying the method described above, you can obtain a confidence score for each key in the JSON output. For instance:

{
"total_amount": 0.98,
"issue_date": 0.85,
"expense_type": 0.90,
"country": 0.95,
"currency": 0.92,
"has_alcohol": 0.80,
"vat_data": 0.88
}

In this example, the confidence score for “issue_date” is slightly lower than for “total_amount,” indicating that the model is less certain about the extracted date. Such insights can be invaluable for prioritizing manual reviews or triggering further validation steps.

The chart below demonstrates the improvement in the accuracy of the “total_amount” key by applying thresholding on the confidence scores. As the threshold increases, accuracy improves, though the coverage of this key decreases accordingly, illustrating the trade-off between accuracy and coverage.

Accuracy improves as threshold increases

To test the robustness of this method, I conducted several experiments to validate its performance.

The first experiment involved changing the order of key-value pairs in the JSON output. In the second experiment, I extracted only one key-value pair in the JSON output. The goal was to see if the confidence scores remained reliable even when the structure of the output was altered.

The results showed that this method is indeed robust. Regardless of changes to the order or separation of key-value pairs, the confidence scores remained indicative of the solution’s quality. This demonstrates that the method can handle variations in output structure without compromising the accuracy of its confidence assessment.

Reordering JSON Output Experiment

This plot demonstrates that as the confidence score threshold increases, accuracy improves even after reordering the key-value pairs in the JSON output.

Reordering JSON output experiment results

Single Key JSON Output Experiment

This plot illustrates the effect of increasing the confidence score threshold when extracting only one key-value pair. The accuracy improves with higher thresholds, while coverage decreases, highlighting the method’s reliability even when working with minimal JSON data.

Single key JSON output experiment results

Discussion

During the robustness experiments, it became clear that while the confidence score remains indicative even after altering the order or size of the JSON output, these manipulations can affect the distribution of confidence scores. This suggests that the decision-making threshold may need to be adjusted based on how the JSON structure is modified.

Additionally, the confidence for “total_amount” varied, while the confidence for “has_alcohol” remained stable. This suggests that some fields are more sensitive to their surrounding key-value pairs, while others are more flexible. Moreover, altering the JSON structure affects not only confidence distribution but also the overall quality of key-value pairs.

It’s also important to carefully consider the key-value structure for each task. In some cases, including additional key-value pairs in the output can enrich certain keys, improving the overall confidence. However, in other scenarios, it may be better to extract a single key-value pair without introducing any additional text, to maintain clarity and precision. For instance, the accuracy of the “total_amount” field improved by 0.5% when it was extracted as a standalone key-value pair, compared to when it was extracted alongside other fields.

When multiple key-value pairs are extracted in a single JSON output, designing a normalized confidence score could be beneficial. This score would account for the influence of other key-value pairs in the output, ensuring a more consistent evaluation of confidence across all fields.

Finally, this solution is empirical in nature and may not generalize in the same way across different LLMs. Tuning and experimentation might be necessary to adapt it effectively to other models.

Edge Cases

While the method is robust, there are still potential edge cases to be mindful of. Complex tokens, such as those found in nested JSON objects or strings with special characters, can sometimes impact the confidence score in unexpected ways. Future iterations of the method could focus on refining token handling to better address these complexities.

Recap

In summary, after struggling with leveraging logprobas into informative confidence scores, I developed a new method that provides a way to extract and quantify confidence scores for LLM models that provide logprobas (such as openAI). These confidence scores offer deeper insights into the model’s certainty, allowing for more informed decision-making — especially in high-stakes tasks like financial data extraction or classification problems.

Call to Action

I invite you to try this method and apply it to your own LLM outputs. Whether you’re working with invoices, contracts, or any other structured data, understanding LLM confidence can significantly improve the reliability and accuracy of your results.

I also invite you to try and improve this method and share your thoughts in the comments and contribute to llm-confidence package !

References

Acknowledgments

A special thanks to the OpenAI community and everyone who contributed valuable insights during the development of this method. I’d also like to extend my gratitude to my amazing data science team, especially David Grabois, for their continuous support.

Read Entire Article