API Reference

Helpers Module

Scorer Module

class saga_llm_evaluation.score.LLMScorer(metrics=['bert_score', 'mauve', 'bleurt', 'q_squared', 'selfcheckgpt', 'geval', 'gptscore'], model=None, eval_model=None, config=None)[source]

Bases: object

Initialize the LLMScorer class. This class is used to evaluate the performance of a language model using a set of evaluation metrics. The evaluation metrics are defined in the config file or can be passed as an input parameter. The model to evaluate and the evaluation model can be passed as input parameters.

Parameters:

metrics (list, optional) – List of evaluation metrics to use. Defaults to [“bert_score”, “mauve”, “bleurt”, “q_squared”, “selfcheckgpt”, “geval”, “gptscore”].
model (object, optional) – Model to evaluate. Defaults to None.
eval_model (object, optional) – Evaluation model. Defaults to None.
config (dict, optional) – Config file. Defaults to None.

add_geval_aspect(code: str, name: str, prompt: str)[source]

This function adds a new aspect to the GEval metric. Please follow the example pattern below to ensure consistency.

Example:

"COH": {
    "name": "Coherence",
    "prompt": "Coherence (1-5) - the overall quality and logical flow of all sentences.\
        This dimension aligns with the DUC quality question of structure and coherence, which states that\
        the summary should be well-structured and well-organized. It should not just be\
        a collection of related information, but should build from sentence to sentence\
        to form a coherent body of information about a topic."
}

Parameters:

code (str) – Code of the aspect.
name (str) – Name of the aspect.
prompt (str) – Prompt of the aspect.

add_geval_task(name: str, definition: str)[source]

This function adds a new task to the GEval metric. Please follow the example pattern below to ensure consistency.

Example:

"summ": "You will be given one summary written for a news article.\n"
        "Your task is to rate the summary on one metric.\n"
        "Please make sure you read and understand these instructions carefully.\n"
        "Please keep this document open while reviewing, and refer to it as needed."

Parameters:

name (str) – Name of the task.
definition (str) – Definition of the task.

add_gptscore_template(task: str, code: str, prompt: str)[source]

This function adds a template to the GPTScore metric. Please follow the example pattern below to ensure consistency.

Example:

"diag": {
    "COH": (
        f"Answer the question based on the conversation between a human and AI.\n"
        "Question: Is the AI coherent and maintains a good conversation flow throughout the conversation? (a) Yes. (b) No.\n"
        "Conversation:\nUser: {{src}}\nAI: {{pred}}\nAnswer: Yes."
    ),
}

Parameters:

task (str) – Task of the template.
code (str) – Code of the aspect.
prompt (str) – Prompt of the aspect.

score(user_prompt: list, prediction: list, knowledge: list | None = None, reference: list | None = None, config: dict | None = None)[source]

This function computes the evaluation metrics for a given user prompt and prediction.

Parameters:

user_prompt (str) – user prompt to the model
prediction (str) – Prediction of the model.
knowledge (str, optional) – Source text that the model used to generate the prediction. Defaults to None.
reference (str, optional) – Reference of the prediction. Defaults to None.
config (dict, optional) – Config file. Defaults to None.

Returns:

Dictionary containing the metadata and evaluation metrics.

Return type:

dict

saga_llm_evaluation.score.get_model(config: dict, model=None, eval_model=None, key: str = 'bert_score')[source]

Get the evaluation metric model.

Parameters:

config (dict) – Config file.
model (object, optional) – Model to evaluate. Defaults to None.
eval_model (object, optional) – Evaluation model. Defaults to None.
key (str, optional) – Evaluation metric to use. Defaults to “bert_score”.

Returns:

Evaluation metric model.

Return type:

object