Collie DocumentationΒΆ

Welcome to the Collie docs! πŸŽ‰

These docs are meant to be both readable and comprehensive. For the best understanding of the library, we suggest reading the docs in order, starting with Interactions.

API Reference

collieΒΆ

PyPI version versions Workflows Passing Documentation Status codecov license

Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Collie dog breed.

Collie offers a collection of simple APIs for preparing and splitting datasets, incorporating item metadata directly into a model architecture or loss, efficiently evaluating a model’s performance on the GPU, and so much more. Above all else though, Collie is built with flexibility and customization in mind, allowing for faster prototyping and experimentation.

See the documentation for more details.

β€œWe adopted 2 Border Collies a year ago and they are about 3 years old. They are completely obsessed with fetch and tennis balls and it’s getting out of hand. They live in the fenced back yard and when anyone goes out there they instantly run around frantically looking for a tennis ball. If there is no ball they will just keep looking and will not let you pet them. When you do have a ball, they are 100% focused on it and will not notice anything else going on around them, like it’s their whole world.”

β€”A Reddit thread on r/DogTraining

InstallationΒΆ

pip install collie

Through July 2021, this library used to be under the name collie_recs. While this version is still available on PyPI, it is no longer supported or maintained. All users of the library should use collie for the latest and greatest version of the code!

Quick StartΒΆ

Implicit DataΒΆ

Creating and evaluating a matrix factorization model with implicit MovieLens 100K data is simple with Collie:

Open In Colab
from collie.cross_validation import stratified_split
from collie.interactions import Interactions
from collie.metrics import auc, evaluate_in_batches, mapk, mrr
from collie.model import MatrixFactorizationModel, CollieTrainer
from collie.movielens import read_movielens_df
from collie.utils import convert_to_implicit


# read in explicit MovieLens 100K data
df = read_movielens_df()

# convert the data to implicit
df_imp = convert_to_implicit(df)

# store data as ``Interactions``
interactions = Interactions(users=df_imp['user_id'],
                            items=df_imp['item_id'],
                            allow_missing_ids=True)

# perform a data split
train, val = stratified_split(interactions)

# train an implicit ``MatrixFactorization`` model
model = MatrixFactorizationModel(train=train,
                                 val=val,
                                 embedding_dim=10,
                                 lr=1e-1,
                                 loss='adaptive',
                                 optimizer='adam')
trainer = CollieTrainer(model, max_epochs=10)
trainer.fit(model)
model.eval()

# evaluate the model
auc_score, mrr_score, mapk_score = evaluate_in_batches(metric_list=[auc, mrr, mapk],
                                                       test_interactions=val,
                                                       model=model)

print(f'AUC:          {auc_score}')
print(f'MRR:          {mrr_score}')
print(f'MAP@10:       {mapk_score}')

More complicated examples of implicit pipelines can be viewed for MovieLens 100K data here, in notebooks here, and documentation here.

Explicit DataΒΆ

Collie also handles the situation when you instead have explicit data, such as star ratings. Note how similar the pipeline and APIs are compared to the implicit example above:

Open In Colab
from collie.cross_validation import stratified_split
from collie.interactions import ExplicitInteractions
from collie.metrics import explicit_evaluate_in_batches
from collie.model import MatrixFactorizationModel, CollieTrainer
from collie.movielens import read_movielens_df

from torchmetrics import MeanAbsoluteError, MeanSquaredError


# read in explicit MovieLens 100K data
df = read_movielens_df()

# store data as ``Interactions``
interactions = ExplicitInteractions(users=df['user_id'],
                                    items=df['item_id'],
                                    ratings=df['rating'])

# perform a data split
train, val = stratified_split(interactions)

# train an implicit ``MatrixFactorization`` model
model = MatrixFactorizationModel(train=train,
                                 val=val,
                                 embedding_dim=10,
                                 lr=1e-2,
                                 loss='mse',
                                 optimizer='adam')
trainer = CollieTrainer(model, max_epochs=10)
trainer.fit(model)
model.eval()

# evaluate the model
mae_score, mse_score = explicit_evaluate_in_batches(metric_list=[MeanAbsoluteError(),
                                                                 MeanSquaredError()],
                                                    test_interactions=val,
                                                    model=model)

print(f'MAE: {mae_score}')
print(f'MSE: {mse_score}')

Comparison With Other Open-Source Recommendation LibrariesΒΆ

On some smaller screens, you might have to scroll right to see the full table. ➑️

Aspect Included in Library

Surprise

LightFM

FastAI

Spotlight

RecBole

TensorFlow Recommenders

Collie

Implicit data support for when we only know when a user interacts with an item or not, not the explicit rating the user gave the item

βœ“

βœ“

βœ“

βœ“

βœ“

Explicit data support for when we know the explicit rating the user gave the item

βœ“

βœ“

βœ“

βœ“

βœ“

βœ“

βœ“

Support for side-data incorporated directly into the models

βœ“

βœ“

βœ“

βœ“

Support a flexible framework for new model architectures and experimentation

βœ“

βœ“

βœ“

βœ“

βœ“

Deep learning libraries utilizing speed-ups with a GPU and able to implement new, cutting-edge deep learning algorithms

βœ“

βœ“

βœ“

βœ“

βœ“

Automatic support for multi-GPU training

βœ“

Actively supported and maintained

βœ“

βœ“

βœ“

βœ“

βœ“

βœ“

Type annotations for classes, methods, and functions

βœ“

βœ“

Scalable for larger, out-of-memory datasets

βœ“

βœ“

Includes model zoo with two or more model architectures implemented

βœ“

βœ“

βœ“

Includes implicit loss functions for training and metric functions for model evaluation

βœ“

βœ“

βœ“

βœ“

Includes adaptive loss functions for multiple negative examples

βœ“

βœ“

βœ“

Includes **loss functions with partial credit for side-data**

βœ“

The following table notes shows the results of an experiment training and evaluating recommendation models in some popular implicit recommendation model frameworks on a common MovieLens 10M dataset. The data was split via a 90/5/5 stratified data split. Each model was trained for a maximum of 40 epochs using an embedding dimension of 32. For each model, we used default hyperparameters (unless otherwise noted below).

Model

MAP@10 Score

Notes

Randomly initialized, untrained model

0.0001

Logistic MF

0.0128

Using the CUDA implementation.

LightFM with BPR Loss

0.0180

ALS

0.0189

Using the CUDA implementation.

BPR

0.0301

Using the CUDA implementation.

Spotlight

0.0376

Using adaptive hinge loss.

LightFM with WARP Loss

0.0412

Collie MatrixFactorizationModel

0.0425

Using a separate SGD bias optimizer.

At ShopRunner, we have found Collie models outperform comparable LightFM models with up to 64% improved MAP@10 scores.

DevelopmentΒΆ

To run locally, begin by creating a data path environment variable:

# Define where on your local hard drive you want to store data. It is best if this
# location is not inside the repo itself. An example is below
export DATA_PATH=$HOME/data/collie

Run development from within the Docker container:

docker build -t collie .

# run the container in interactive mode, leaving port ``8888`` open for Jupyter
docker run \
    -it \
    --rm \
    -v "${DATA_PATH}:/collie/data/" \
    -v "${PWD}:/collie" \
    -p 8888:8888 \
    collie /bin/bash

Run on a GPU:ΒΆ

docker build -t collie .

# run the container in interactive mode, leaving port ``8888`` open for Jupyter
docker run \
    -it \
    --rm \
    --gpus all \
    -v "${DATA_PATH}:/collie/data/" \
    -v "${PWD}:/collie" \
    -p 8888:8888 \
    collie /bin/bash

Start JupyterLabΒΆ

To run JupyterLab, start the container and execute the following:

jupyter lab --ip 0.0.0.0 --no-browser --allow-root

Connect to JupyterLab here: http://localhost:8888/lab

Unit TestsΒΆ

Library unit tests in this repo are to be run in the Docker container:

# execute unit tests
pytest --cov-report term --cov=collie

Note that a handful of tests require the MovieLens 100K dataset to be downloaded (~5MB in size), meaning that either before or during test time, there will need to be an internet connection. This dataset only needs to be downloaded a single time for use in both unit tests and tutorials.

DocsΒΆ

The Collie library supports Read the Docs documentation. To compile locally,

cd docs
make html

# open local docs
open build/html/index.html