Evaluation Metrics¶

The Collie library supports evaluating both implicit and explicit models.

Three common implicit recommendation evaluation metrics come out-of-the-box with Collie. These include Area Under the ROC Curve (AUC), Mean Reciprocal Rank (MRR), and Mean Average Precision at K (MAP@K). Each metric is optimized to be as efficient as possible by having all calculations done in batch, tensor form on the GPU (if available). We provide a standard helper function, evaluate_in_batches, to evaluate a model on many metrics in a single pass.

Explicit evaluation of recommendation systems is luckily much more straightforward, allowing us to utilize the TorchMetrics library for flexible, optimized metric calculations on the GPU accessed through a standard helper function, explicit_evaluate_in_batches, whose API is very similar to its implicit counterpart.

Evaluate in Batches¶

Implicit Evaluate in Batches¶

collie.metrics.evaluate_in_batches(metric_list: Iterable[Callable[[csr_matrix, Union[array, tensor], Union[array, tensor], Optional[int]], float]], test_interactions: Interactions, model: BasePipeline, k: int = 10, batch_size: int = 20, logger: Optional[Logger] = None, verbose: bool = True) → List[float][source]¶

Evaluate a model with potentially several different metrics.

Memory constraints require that most test sets will need to be evaluated in batches. This function handles the looping and batching boilerplate needed to properly evaluate the model without running out of memory.

Parameters

metric_list (list of functions) –
List of evaluation functions to apply. Each function must accept keyword arguments:
- targets
- user_ids
- preds
- k
test_interactions (collie.interactions.Interactions) – Interactions to use as labels
model (collie.model.BasePipeline) – Model that can take a (user_id, item_id) pair as input and return a recommendation score
k (int) – Number of recommendations to consider per user. This is ignored by some metrics
batch_size (int) – Number of users to score in a single batch. For best efficiency, this number should be as high as possible without running out of memory
logger (pytorch_lightning.loggers.base.LightningLoggerBase) – If provided, will log outputted metrics dictionary using the log_metrics method with keys being the string representation of metric_list and values being evaluation_results. Additionally, if model.hparams.num_epochs_completed exists, this will be logged as well, making it possible to track metrics progress over the course of model training
verbose (bool) – Display progress bar and print statements during function execution

Returns

evaluation_results – List of floats, with each metric value corresponding to the respective function passed in metric_list

Return type

list

Examples

from collie.metrics import auc, evaluate_in_batches, mapk, mrr

map_10_score, mrr_score, auc_score = evaluate_in_batches(
    metric_list=[mapk, mrr, auc],
    test_interactions=test,
    model=model,
)

print(map_10_score, mrr_score, auc_score)

Explicit Evaluate in Batches¶

collie.metrics.explicit_evaluate_in_batches(metric_list: Iterable[Metric], test_interactions: ExplicitInteractions, model: BasePipeline, logger: Optional[Logger] = None, verbose: bool = True, **kwargs) → List[float][source]¶

Evaluate a model with potentially several different metrics.

Memory constraints require that most test sets will need to be evaluated in batches. This function handles the looping and batching boilerplate needed to properly evaluate the model without running out of memory.

Parameters

metric_list (list of torchmetrics.Metric) – List of evaluation functions to apply. Each function must accept arguments for predictions and targets, in order
test_interactions (collie.interactions.ExplicitInteractions) –
model (collie.model.BasePipeline) – Model that can take a (user_id, item_id) pair as input and return a recommendation score
batch_size (int) – Number of users to score in a single batch. For best efficiency, this number should be as high as possible without running out of memory
logger (pytorch_lightning.loggers.base.LightningLoggerBase) – If provided, will log outputted metrics dictionary using the log_metrics method with keys being the string representation of metric_list and values being evaluation_results. Additionally, if model.hparams.num_epochs_completed exists, this will be logged as well, making it possible to track metrics progress over the course of model training
verbose (bool) – Display progress bar and print statements during function execution
**kwargs (keyword arguments) – Additional arguments sent to the InteractionsDataLoader

Returns

evaluation_results – List of floats, with each metric value corresponding to the respective function passed in metric_list

Return type

list

Examples

import torchmetrics

from collie.metrics import explicit_evaluate_in_batches

mse_score, mae_score = evaluate_in_batches(
    metric_list=[torchmetrics.MeanSquaredError(), torchmetrics.MeanAbsoluteError()],
    test_interactions=test,
    model=model,
)

print(mse_score, mae_score)

Implicit Metrics¶

AUC¶

collie.metrics.auc(targets: csr_matrix, user_ids: Union[array, tensor], preds: Union[array, tensor], k: Optional[Any] = None) → float[source]¶

Calculate the area under the ROC curve (AUC) for each user and average the results.

Parameters

targets (scipy.sparse.csr_matrix) – Interaction matrix containing user and item IDs
user_ids (np.array or torch.tensor) – Users corresponding to the recommendations in the top k predictions
preds (torch.tensor) – Tensor of shape (n_users x n_items) with each user’s scores for each item
k (Any) – Ignored, included only for compatibility with mapk

Returns

auc_score

Return type

float

MAP@K¶

collie.metrics.mapk(targets: csr_matrix, user_ids: Union[array, tensor], preds: Union[array, tensor], k: int = 10) → float[source]¶

Calculate the mean average precision at K (MAP@K) score for each user.

Parameters

targets (scipy.sparse.csr_matrix) – Interaction matrix containing user and item IDs
user_ids (np.array or torch.tensor) – Users corresponding to the recommendations in the top k predictions
preds (torch.tensor) – Tensor of shape (n_users x n_items) with each user’s scores for each item
k (int) – Number of recommendations to consider per user

Returns

mapk_score

Return type

float

MRR¶

collie.metrics.mrr(targets: csr_matrix, user_ids: Union[array, tensor], preds: Union[array, tensor], k: Optional[Any] = None) → float[source]¶

Calculate the mean reciprocal rank (MRR) of the input predictions.

Parameters

targets (scipy.sparse.csr_matrix) – Interaction matrix containing user and item IDs
user_ids (np.array or torch.tensor) – Users corresponding to the recommendations in the top k predictions
preds (torch.tensor) – Tensor of shape (n_users x n_items) with each user’s scores for each item
k (Any) – Ignored, included only for compatibility with mapk

Returns

mrr_score

Return type

float