Language Metrics
The Language Metrics module provides a suite of tools for evaluating the quality and consistency of language model outputs. These metrics are designed to assess various aspects of generated text, such as semantic similarity, factual accuracy, and coherence. By leveraging advanced models and techniques, the Language Metrics module helps ensure that language models produce high-quality, reliable, and contextually appropriate responses.
The module includes the following key metrics:
BLEURTScore: A learned metric that uses BERT to compute a similarity score for each token in the candidate sentence with each token in the reference sentence. This metric is particularly useful for evaluating the semantic content of generated text.
Q-Squared: A reference-free metric that evaluates the factual consistency of knowledge-grounded dialogue systems. This approach is based on automatic question generation and question answering, providing a robust measure of how well the generated responses align with the given knowledge.
These metrics are essential for developers and researchers working on natural language processing (NLP) tasks, as they provide valuable insights into the performance and reliability of language models.
- class saga_llm_evaluation.helpers.language_metrics.BLEURTScore(checkpoint='BLEURT-tiny')[source]
Bases:
object
BLEURT is a learnt metric that uses BERT to compute a similarity score for each token in the candidate sentence with each token in the reference sentence.
- Parameters:
checkpoint (str, optional) – Checkpoint to use. Defaults to BLEURT-tiny if not specified. Check https://huggingface.co/spaces/evaluate-metric/bleurt for more checkpoints.
- compute(references, predictions, **kwargs)[source]
This function computes the BLEURT score for each candidate sentence in the list of predictions.
- Parameters:
references (list) – List of reference sentences.
predictions (list) – List of candidate sentences.
- Returns:
List of scores for each candidate sentence.
- Return type:
list
- class saga_llm_evaluation.helpers.language_metrics.QSquared(qa_model: str = 'ktrapeznikov/albert-xlarge-v2-squad-v2', qg_model: str = 'mrm8488/t5-base-finetuned-question-generation-ap', lang='en')[source]
Bases:
object
Q² is a reference-free metric that aims to evaluate the factual consistency of knowledge-grounded dialogue systems. The approach is based on automatic question generation and question answering. Source: https://github.com/orhonovich/q-squared
- Parameters:
qa_model (str) – Huggingface question answering model to use
qg_model (str) – Huggingface question generation model to use
lan (str, optional) – Language to use. Defaults to “en”, it may also be “fr”.
- compute(predictions: list, knowledges: list, single: bool = False, remove_personal: bool = True)[source]
Compute the Q² score for a given response and knowledge.
- Parameters:
references (list or str) – (list of) candidate text generated by the LLM
knowledges (list or str) – (list of) knowledge given as a context to the LLM for each candidate text
single (bool) – if True, only one question is generated for each candidate answer. Defaults to False.
remove_personal (bool) – if True, remove questions that contain personal pronouns. Defaults to True.
- Returns:
dictionary containing the following keys:
avg_f1 (float) : list of average F1-scores for each candidate
- Return type:
dict
- get_answer(question: str, text: str)[source]
Search for the answer in the text given the question.
- Parameters:
question (str) – question to ask
text (str) – text to search in
- Returns:
answer to the question
- Return type:
str
- get_answer_candidates(text: str)[source]
Look for candidate aswers that could be answered by the text.
- Parameters:
text (str) – text to search in
- Returns:
candidates answers
- Return type:
str
- get_questions_beam(answer: str, context: str, max_length: int = 128, beam_size: int = 5, num_return: int = 5)[source]
Get the n best questions for a given answer, given the context. “Beam” is the name of the approach.
- Parameters:
answer (str) – answer to the question
context (str) – context to search in
max_length (int, optional) – max length of the generated question. Defaults to 128.
beam_size (int, optional) – beam size. Defaults to 5.
num_return (int, optional) – number of questions to return. Defaults to 5.
- Returns:
n best questions
- Return type:
list
- single_question_score(question: str, answer: str, response: str, knowledge: str)[source]
Given a candidate pair of question and answer (generated from the candidate text), get the score of the aswer given by taking as a context the knowledge that the LLM was given. The higher the F1-score, the more the model we are trying to evaluate is consistent with the knowledge.
- Parameters:
question (str) – cadidate question (generated from the candidate text)
answer (str) – candidate answer (generated from the candidate text)
response (str) – text generated by the LLM
knowledge (str) – knowledge given as a context to the LLM
- Returns:
bert-score of the knowledge answer, knowledge answer
- Return type:
tuple