Evaluation Metrics

The Collie library supports evaluating both implicit and explicit models.

Three common implicit recommendation evaluation metrics come out-of-the-box with Collie. These include Area Under the ROC Curve (AUC), Mean Reciprocal Rank (MRR), and Mean Average Precision at K (MAP@K). Each metric is optimized to be as efficient as possible by having all calculations done in batch, tensor form on the GPU (if available). We provide a standard helper function, evaluate_in_batches, to evaluate a model on many metrics in a single pass.

Explicit evaluation of recommendation systems is luckily much more straightforward, allowing us to utilize the TorchMetrics library for flexible, optimized metric calculations on the GPU accessed through a standard helper function, explicit_evaluate_in_batches, whose API is very similar to its implicit counterpart.

Evaluate in Batches

Implicit Evaluate in Batches

collie.metrics.evaluate_in_batches(metric_list: Iterable[Callable], test_interactions: collie.interactions.datasets.Interactions, model: collie.model.base.base_pipeline.BasePipeline, k: int = 10, batch_size: int = 20, logger: Optional[pytorch_lightning.loggers.base.LightningLoggerBase] = None, verbose: bool = True)List[float][source]

Evaluate a model with potentially several different metrics.

Memory constraints require that most test sets will need to be evaluated in batches. This function handles the looping and batching boilerplate needed to properly evaluate the model without running out of memory.

Parameters
  • metric_list (list of functions) –

    List of evaluation functions to apply. Each function must accept keyword arguments:

    • targets

    • user_ids

    • preds

    • k

  • test_interactions (collie.interactions.Interactions) – Interactions to use as labels

  • model (collie.model.BasePipeline) – Model that can take a (user_id, item_id) pair as input and return a recommendation score

  • k (int) – Number of recommendations to consider per user. This is ignored by some metrics

  • batch_size (int) – Number of users to score in a single batch. For best efficiency, this number should be as high as possible without running out of memory

  • logger (pytorch_lightning.loggers.base.LightningLoggerBase) – If provided, will log outputted metrics dictionary using the log_metrics method with keys being the string representation of metric_list and values being evaluation_results. Additionally, if model.hparams.num_epochs_completed exists, this will be logged as well, making it possible to track metrics progress over the course of model training

  • verbose (bool) – Display progress bar and print statements during function execution

Returns

evaluation_results – List of floats, with each metric value corresponding to the respective function passed in metric_list

Return type

list

Examples

from collie.metrics import auc, evaluate_in_batches, mapk, mrr

map_10_score, mrr_score, auc_score = evaluate_in_batches(
    metric_list=[mapk, mrr, auc],
    test_interactions=test,
    model=model,
)

print(map_10_score, mrr_score, auc_score)

Explicit Evaluate in Batches

collie.metrics.explicit_evaluate_in_batches(metric_list: Iterable[torchmetrics.metric.Metric], test_interactions: collie.interactions.datasets.ExplicitInteractions, model: collie.model.base.base_pipeline.BasePipeline, logger: Optional[pytorch_lightning.loggers.base.LightningLoggerBase] = None, verbose: bool = True, **kwargs)List[float][source]

Evaluate a model with potentially several different metrics.

Memory constraints require that most test sets will need to be evaluated in batches. This function handles the looping and batching boilerplate needed to properly evaluate the model without running out of memory.

Parameters
  • metric_list (list of torchmetrics.Metric) – List of evaluation functions to apply. Each function must accept arguments for predictions and targets, in order

  • test_interactions (collie.interactions.ExplicitInteractions) –

  • model (collie.model.BasePipeline) – Model that can take a (user_id, item_id) pair as input and return a recommendation score

  • batch_size (int) – Number of users to score in a single batch. For best efficiency, this number should be as high as possible without running out of memory

  • logger (pytorch_lightning.loggers.base.LightningLoggerBase) – If provided, will log outputted metrics dictionary using the log_metrics method with keys being the string representation of metric_list and values being evaluation_results. Additionally, if model.hparams.num_epochs_completed exists, this will be logged as well, making it possible to track metrics progress over the course of model training

  • verbose (bool) – Display progress bar and print statements during function execution

  • kwargs (keyword arguments) – Additional arguments sent to the InteractionsDataLoader

Returns

evaluation_results – List of floats, with each metric value corresponding to the respective function passed in metric_list

Return type

list

Examples

import torchmetrics

from collie.metrics import explicit_evaluate_in_batches

mse_score, mae_score = evaluate_in_batches(
    metric_list=[torchmetrics.MeanSquaredError(), torchmetrics.MeanAbsoluteError()],
    test_interactions=test,
    model=model,
)

print(mse_score, mae_score)

Implicit Metrics

AUC

collie.metrics.auc(targets: scipy.sparse.csr.csr_matrix, user_ids: (<built-in function array>, <built-in method tensor of type object at 0x7fdbda032ec0>), preds: (<built-in function array>, <built-in method tensor of type object at 0x7fdbda032ec0>), k: Optional[Any] = None)float[source]

Calculate the area under the ROC curve (AUC) for each user and average the results.

Parameters
  • targets (scipy.sparse.csr_matrix) – Interaction matrix containing user and item IDs

  • user_ids (np.array or torch.tensor) – Users corresponding to the recommendations in the top k predictions

  • preds (torch.tensor) – Tensor of shape (n_users x n_items) with each user’s scores for each item

  • k (Any) – Ignored, included only for compatibility with mapk

Returns

auc_score

Return type

float

MAP@K

collie.metrics.mapk(targets: scipy.sparse.csr.csr_matrix, user_ids: (<built-in function array>, <built-in method tensor of type object at 0x7fdbda032ec0>), preds: (<built-in function array>, <built-in method tensor of type object at 0x7fdbda032ec0>), k: int = 10)float[source]

Calculate the mean average precision at K (MAP@K) score for each user.

Parameters
  • targets (scipy.sparse.csr_matrix) – Interaction matrix containing user and item IDs

  • user_ids (np.array or torch.tensor) – Users corresponding to the recommendations in the top k predictions

  • preds (torch.tensor) – Tensor of shape (n_users x n_items) with each user’s scores for each item

  • k (int) – Number of recommendations to consider per user

Returns

mapk_score

Return type

float

MRR

collie.metrics.mrr(targets: scipy.sparse.csr.csr_matrix, user_ids: (<built-in function array>, <built-in method tensor of type object at 0x7fdbda032ec0>), preds: (<built-in function array>, <built-in method tensor of type object at 0x7fdbda032ec0>), k: Optional[Any] = None)float[source]

Calculate the mean reciprocal rank (MRR) of the input predictions.

Parameters
  • targets (scipy.sparse.csr_matrix) – Interaction matrix containing user and item IDs

  • user_ids (np.array or torch.tensor) – Users corresponding to the recommendations in the top k predictions

  • preds (torch.tensor) – Tensor of shape (n_users x n_items) with each user’s scores for each item

  • k (Any) – Ignored, included only for compatibility with mapk

Returns

mrr_score

Return type

float