Language Metrics

The Language Metrics module provides a suite of tools for evaluating the quality and consistency of language model outputs. These metrics are designed to assess various aspects of generated text, such as semantic similarity, factual accuracy, and coherence. By leveraging advanced models and techniques, the Language Metrics module helps ensure that language models produce high-quality, reliable, and contextually appropriate responses.

The module includes the following key metrics:

BLEURTScore: A learned metric that uses BERT to compute a similarity score for each token in the candidate sentence with each token in the reference sentence. This metric is particularly useful for evaluating the semantic content of generated text.
Q-Squared: A reference-free metric that evaluates the factual consistency of knowledge-grounded dialogue systems. This approach is based on automatic question generation and question answering, providing a robust measure of how well the generated responses align with the given knowledge.

These metrics are essential for developers and researchers working on natural language processing (NLP) tasks, as they provide valuable insights into the performance and reliability of language models.

class saga_llm_evaluation.helpers.language_metrics.BLEURTScore(checkpoint='BLEURT-tiny')[source]

Bases: object

BLEURT is a learnt metric that uses BERT to compute a similarity score for each token in the candidate sentence with each token in the reference sentence.

Parameters:: checkpoint (str, optional) – Checkpoint to use. Defaults to BLEURT-tiny if not specified. Check https://huggingface.co/spaces/evaluate-metric/bleurt for more checkpoints.

compute(references, predictions, **kwargs)[source]

This function computes the BLEURT score for each candidate sentence in the list of predictions.

Parameters:

references (list) – List of reference sentences.
predictions (list) – List of candidate sentences.

Returns:

List of scores for each candidate sentence.

Return type:

list

class saga_llm_evaluation.helpers.language_metrics.QSquared(qa_model: str = 'ktrapeznikov/albert-xlarge-v2-squad-v2', qg_model: str = 'mrm8488/t5-base-finetuned-question-generation-ap', lang='en')[source]

Bases: object

Q² is a reference-free metric that aims to evaluate the factual consistency of knowledge-grounded dialogue systems. The approach is based on automatic question generation and question answering. Source: https://github.com/orhonovich/q-squared

Parameters:

qa_model (str) – Huggingface question answering model to use
qg_model (str) – Huggingface question generation model to use
lan (str, optional) – Language to use. Defaults to “en”, it may also be “fr”.

compute(predictions: list, knowledges: list, single: bool = False, remove_personal: bool = True)[source]

Compute the Q² score for a given response and knowledge.

Parameters:

references (list or str) – (list of) candidate text generated by the LLM
knowledges (list or str) – (list of) knowledge given as a context to the LLM for each candidate text
single (bool) – if True, only one question is generated for each candidate answer. Defaults to False.
remove_personal (bool) – if True, remove questions that contain personal pronouns. Defaults to True.

Returns:

dictionary containing the following keys:

avg_f1 (float) : list of average F1-scores for each candidate

Return type:

dict

get_answer(question: str, text: str)[source]

Search for the answer in the text given the question.

Parameters:

question (str) – question to ask
text (str) – text to search in

Returns:

answer to the question

Return type:

str

get_answer_candidates(text: str)[source]

Look for candidate aswers that could be answered by the text.

Parameters:: text (str) – text to search in
Returns:: candidates answers
Return type:: str

get_questions_beam(answer: str, context: str, max_length: int = 128, beam_size: int = 5, num_return: int = 5)[source]

Get the n best questions for a given answer, given the context. “Beam” is the name of the approach.

Parameters:

answer (str) – answer to the question
context (str) – context to search in
max_length (int, optional) – max length of the generated question. Defaults to 128.
beam_size (int, optional) – beam size. Defaults to 5.
num_return (int, optional) – number of questions to return. Defaults to 5.

Returns:

n best questions

Return type:

list

single_question_score(question: str, answer: str, response: str, knowledge: str)[source]

Given a candidate pair of question and answer (generated from the candidate text), get the score of the aswer given by taking as a context the knowledge that the LLM was given. The higher the F1-score, the more the model we are trying to evaluate is consistent with the knowledge.

Parameters:

question (str) – cadidate question (generated from the candidate text)
answer (str) – candidate answer (generated from the candidate text)
response (str) – text generated by the LLM
knowledge (str) – knowledge given as a context to the LLM

Returns:

bert-score of the knowledge answer, knowledge answer

Return type:

tuple