LLM Metrics

The idea is to use the LLMs to score the candidate output based on its generation probability without any reference target, under the assumption that the LLMs have learned to assign higher probabilities to high-quality and fluent texts. Obviously, the disadvantage of this approach is that the LLMs are not perfect and can assign high probabilities to incorrect or nonsensical texts. However, the advantage is that it is a fully automatic and reference-free evaluation method that can be used to evaluate the quality of the output of the LLMs in a large-scale setting.

class saga_llm_evaluation.helpers.llm_metrics.Correctness(llm=None)[source]

Bases: object

This class implements the correctness evaluation metric for generative language models. The correctness metric evaluates if the submission is correct, accurate, and factual. This definition is based on LangChain’s labeled_criteria evaluator.

Parameters:: llm (LangChain BaseLanguageModel) – model used for evaluation. If None, the model is chosen as “gpt-4” by default.

compute(user_prompts: list, predictions: list, references: list)[source]

This method computes the correctness score for a candidate sentence given a source text and a reference.

Parameters:

user_prompts (list) – Source text generated by the user.
pred (list) – Candidate sentence.
references (list) – Reference sentence.

Returns:

Correctness score for the candidate sentence. The dictionary contains the following keys:

score (int) : Correctness score. Binary integer value (0 or 1), where 1 indicates that the sentence is correct and 0 indicates that the sentence is incorrect.

value (str) : Correctness value. Y or N, where Y indicates that the sentence is correct and N indicates that the sentence is incorrect.

reasoning (str) : Reasoning for the correctness score.

Return type:

dict

class saga_llm_evaluation.helpers.llm_metrics.Faithfulness(llm=None)[source]

Bases: object

This class implements the faithfulness evaluation metric for generative language models. The faithfulness metric evaluates if the submission contains information not present in the input or reference.

Parameters:: llm (LangChain BaseLanguageModel) – model used for evaluation. If None, the model is chosen as “gpt-4” by default.

compute(user_prompts: list, predictions: list, references: list)[source]

This method computes the faithfulness score for a candidate sentence given a source text and a reference.

Parameters:

user_prompts (list) – Source text generated by the user.
pred (list) – Candidate sentence.
references (list) – Reference sentence.

Returns:

Faithfulness score for the candidate sentence. The dictionary contains the following keys:

score (int) : Faithfulness score. Binary integer value (0 or 1), where 1 indicates that the sentence is faithful and 0 indicates that the sentence is not faithful (i.e. it contains hallucinations).

value (str) : Faithfulness value. Y or N, where Y indicates that the sentence is faithful and N indicates that the sentence is not faithful.

reasoning (str): Reasoning for the faithfulness score.

Return type:

dict

class saga_llm_evaluation.helpers.llm_metrics.GEval(model=None)[source]

Bases: object

This class implements the GEval evaluation metric for generative language models. It is inspired by the GEval metric proposed in https://arxiv.org/pdf/2303.16634.pdf.

Parameters:: model (LangChain BaseChatModel) – model used for evaluation. If False, the model used is “gpt-3.5-turbo” by default.

add_aspect(code: str, name: str, prompt: str)[source]

This method adds an aspect to the list of pre-defined aspects. Please try to follow the following example pattern to ensure consistency.

Example:

"COH": {
    "name": "Coherence",
    "prompt": "Coherence (1-5) - the overall quality and logical flow of all sentences.\
        This dimension aligns with the DUC quality question of structure and coherence, which states that\
        the summary should be well-structured and well-organized. It should not just be\
        a collection of related information, but should build from sentence to sentence\
        to form a coherent body of information about a topic."
}

Parameters:

code (str) – Aspect code.
name (str) – Aspect name.
prompt (str) – Aspect prompt.

add_task(name: str, definition: str)[source]

This method adds a task to the list of pre-defined tasks. Please try to follow the following example pattern to ensure consistency. Example:

"summ": "You will be given one summary written for a news article.\n"
        "Your task is to rate the summary on one metric.\n"
        "Please make sure you read and understand these instructions carefully.\n"
        "Please keep this document open while reviewing, and refer to it as needed."

Parameters:

name (str) – Task name.
definition (str) – Task description.

compute(user_prompts: list, predictions: list, task=None, aspect=None, custom_prompt=None)[source]

This method computes the GEval score for a candidate sentence given a source text, a prompt template, an aspect to evaluate, and a task description.

Parameters:

user_prompts (list or str) – Source text generated by the user.
pred (str) – Candidate sentence to evaluate.
task (str, optional) – Definition of the task.
aspect (str or list of str optional) – (List of) Evaluation criterion codes.
custom_prompt (dict, optional) – Custom prompt template. Defaults to None.

Returns:

Score for the candidate sentence.

Return type:

float

get_cot(prompt: str)[source]

This method returns a chain of thoughts given a prompt template.

Parameters:: prompt (str) – Prompt template.
Returns:: Chain of thoughts.
Return type:: str

get_prediction(prompt: str)[source]

This method returns a prediction given a prompt template.

Parameters:: prompt (str) – Prompt template.
Returns:: Response from the model.
Return type:: dict

get_prompt(prompts: list, predictions: list, task: str, aspect: str, custom_prompt: dict | None = None)[source]

This method returns a prompt template given a source text, a candidate sentence, an aspect to evaluate, and a task description.

Parameters:

prompts (list) – list of source text.
predictions (list) – list of candidate sentence to evaluate.
task (str) – Definition of the task.
aspect (str) – Evaluation criterion code.
custom_prompt (dict) – Custom prompt template. Must contain the following keys: “task”, “aspect”, “name”.

Returns:

List of prompt templates

Return type:

list

get_score(prompts: list)[source]

This method returns the GEval score given a prompt template.

Parameters:: prompts (list) – List of prompt template.
Returns:: List of scores for each candidate sentence.
Return type:: list

class saga_llm_evaluation.helpers.llm_metrics.GPTScore(model=None)[source]

Bases: object

This class implements the GPTScore evaluation metric for generative language models. It is inspired by the GPTScore metric proposed in https://arxiv.org/pdf/2302.04166.pdf. The GPTScore from the paper is always gonna be calculated as the average log-likelihood of the tokens in the sentence. However, since the probability of each token is always gonna be between 0 and 1, the average log-likelihood is always gonna be negative. Thus, the bigger the GPTScore, the better the sentence. The GPTScore is always gonna be negative.

Parameters:: model (LangChain BaseChatModel) – model used for evaluation. If None, the model used is “gpt-3.5-turbo” by default.

add_criterion(task: str, code: str, desc: str)[source]

This method adds a criterion to the list of pre-defined criteria. Please try to follow the following example pattern to ensure consistency.

Example:

"diag": {
    "COH": (
        f"Answer the question based on the conversation between a human and AI.\n"
        "Question: Is the AI coherent and maintains a good conversation flow throughout the conversation? (a) Yes. (b) No.\n"
        "Conversation:\nUser: {{src}}\nAI: {{pred}}\nAnswer: Yes."
    ),
}

Parameters:

task (str) – Task name. (Example: “diag”)
code (str) – Aspect code. (Example: “COH”)
desc (str) – Aspect description.

compute(user_prompts: list, predictions: list, custom_prompt: dict | None = None, aspect=None, task: str | None = None)[source]

This method computes the GPTScore for a candidate sentence given a source text, a system_prompt template, a user_prompt source text, an aspect to evaluate, and a task description.

Parameters:

user_prompts (list or str) – (list of) Source text generated by the user.
pred (list or str) – (list of) Candidate sentence.
custom_prompt (dict, optional) – Custom prompt template. Defaults to None. Must contain the following keys: “task”, “aspect”, “name”.
aspect (str or list, optional) – (List of) Aspect(s) to evaluate. Defaults to None.
task (str, optional) – Task description. Defaults to None.

Returns:

(list of) Score for (each of) the candidate sentence per aspect.

Return type:

dict

get_prompt(aspect: str, task: str, prompts: list, predictions: list, custom_prompt: dict | None = None)[source]

This method returns a prompt template given a task description, and an aspect to evaluate.

Parameters:

prompts (str) – list of source texts.
pred (str) – list of candidate sentences.
aspect (str) – Aspect to evaluate.
task (str) – Task description.
custom_prompt (dict) – Custom prompt template. Defaults to None. Must contain the following keys: “task”, “aspect”.

Returns:

(list of) Prompt templates.

Return type:

list

get_score(prompts: list)[source]

This method returns the GPTScore given a prompt template.

Parameters:: prompt (list) – list of Prompt templates.
Returns:: GPTScore of the candidate sentence.
Return type:: float

class saga_llm_evaluation.helpers.llm_metrics.HallucinationScore[source]

Bases: object

compute(predictions: list, references: list)[source]

This method computes the hallucination scores for a candidate sentence given a reference sentence.

Parameters:

predictions (list) – Candidate sentences (e.g., model outputs).
references (list) – Reference sentences (e.g., ground truth).

Returns:

Hallucination detection score. The dictionary contains the following keys:

f1_score (float): F1 score, representing the overlap between the prediction and the reference.

exact_match (int): Binary integer value (0 or 1), where 1 indicates that the prediction exactly matches the reference and 0 indicates it does not.

Return type:

dict

class saga_llm_evaluation.helpers.llm_metrics.NegativeRejection(llm=None)[source]

Bases: object

This class implements the negative rejection evaluation metric for generative language models. The negative rejection metric evaluates if the submission refuses to answer when the answer is not present in the input or reference.

Parameters:: llm (LangChain BaseLanguageModel) – model used for evaluation. If None, the model is chosen as “gpt-4” by default.

compute(user_prompts: list, predictions: list, references: list)[source]

This class computes the ability of the system to refuse to answer in the absence of evidence.

Parameters:

user_prompts (list) – Source text generated by the user.
pred (list) – Candidate sentence.
references (list) – Reference sentence.

Returns:

Negative rejection score for the candidate sentence. The dictionary contains the following keys:

score (int): Negative rejection score. Binary integer value (0 or 1), where 1 indicates that the sentence is a refusal to answer and 0 indicates that the sentence is not a refusal to answer.

value (str): Negative rejection value. Y or N, where Y indicates that the sentence is a refusal to answer and N indicates that the sentence is not a refusal to answer.

reasoning (str): Reasoning for the negative rejection score.

Return type:

dict

class saga_llm_evaluation.helpers.llm_metrics.Relevance(llm=None)[source]

Bases: object

This class implements the relevance evaluation metric for generative language models. The relevance metric evaluates if the submission refers to or accurately conveys the information from the input text, even if it is not an exact quote.

Parameters:: llm (LangChain BaseLanguageModel) – model used for evaluation. If None, the model is chosen as “gpt-4” by default.

compute(user_prompts: list, predictions: list)[source]

This method computes the relevance score for a candidate sentence given a source text. In other words, it validates that the candidate sentence (response) is related to the query topic, and meets the query requirements.

Parameters:

user_prompts (list) – Source text generated by the user.
pred (list) – Candidate sentence.

Returns:

Relevance score for the candidate sentence. The dictionary contains the following keys:

score (int): Relevance score. Binary integer value (0 or 1), where 1 indicates that the sentence is relevant and 0 indicates that the sentence is irrelevant.

value (str): Relevance value. Y or N, where Y indicates that the sentence is relevant and N indicates that the sentence is irrelevant.

reasoning (str): Reasoning for the relevance score.

Return type:

dict

class saga_llm_evaluation.helpers.llm_metrics.SelfCheckGPT(model, eval_model=None)[source]

Bases: object

This class implements the self-check GPT evaluation metric for generative language models. It is inspired by the self-check metric proposed in https://arxiv.org/pdf/2303.08896.pdf.

Parameters:

model (Langchain BaseChatModel) – LLM model to evaluate.
eval_model (Langchain BaseChatModel, optional) – Evaluation model. If None, the model used is “llama” by default.

compute(user_prompts: list, predictions: list, n_samples=5)[source]

This method computes the self-check GPT score for a candidate sentence given a source text, a prompt template, and a question.

Parameters:

user_prompts (str) – Question asked to the model for which it generated the candidate sentence.
predictions (str) – Candidate sentence.
n_samples (int) – Number of samples to generate.

Returns:

Score for the candidate sentence.

Return type:

float

get_prompt(pred: str, sample: str, question: str)[source]

This method returns a prompt template given a candidate sentence, a sample sentence, and a question.

Parameters:

pred (str) – Candidate sentence.
sample (str) – Sample sentence.
question (str) – Question asked to the model for which it generated the candidate sentence.

Returns:

Prompt template.

Return type:

str

get_prompts(pred: str, samples: str, question: str)[source]

This method returns a list of prompt templates given a candidate sentence, a list of sample sentences, and a question.

Parameters:

pred (str) – Candidate sentence.
samples (list of str) – List of sample sentences.
question (str) – Question asked to the model for which it generated the candidate sentence.

Returns:

List of prompt templates.

Return type:

list