MovieLens Functions¶

The following functions under collie.movielens read and prepare MovieLens 100K data, train and evaluate a model on this data, and visualize recommendation results.

Get MovieLens 100K Data¶

Read MovieLens 100K Interactions Data¶

collie.movielens.read_movielens_df(decrement_ids: bool = True) → DataFrame[source]¶

Read u.data from the MovieLens 100K dataset.

If there is not a directory at $DATA_PATH/ml-100k, this function creates that directory and downloads the entire dataset there.

See the MovieLens 100K README for additional information on the dataset: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt

Parameters

decrement_ids (bool) – Decrement user and item IDs by 1 before returning, which is required for Collie’s Interactions dataset

Returns

df –

MovieLens 100K u.data comprising of columns:

user_id

item_id

rating

timestamp

Return type

pd.DataFrame

Side Effects

Creates directory at $DATA_PATH/ml-100k and downloads data files if data does not exist.

Read MovieLens 100K Item Data¶

collie.movielens.read_movielens_df_item() → DataFrame[source]¶

Read u.item from the MovieLens 100K dataset.

If there is not a directory at $DATA_PATH/ml-100k, this function creates that directory and downloads the entire dataset there.

See the MovieLens 100K README for additional information on the dataset: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt

Returns

df_item –

MovieLens 100K u.item containing columns:

item_id

movie_title

release_date

video_release_date

IMDb_URL

unknown

Action

Adventure

Animation

Children

Comedy

Crime

Documentary

Drama

Fantasy

Film_Noir

Horror

Musical

Mystery

Romance’, ‘Sci_Fi

Thriller

War

Wester

Return type

pd.DataFrame

Side Effects

Creates directory at $DATA_PATH/ml-100k and downloads data files if data does not exist.

Read MovieLens 100K Posters Data¶

collie.movielens.read_movielens_posters_df() → DataFrame[source]¶

Read in data containing the item ID and poster URL for visualization purposes of MovieLens 100K data.

This function will attempt to read the file at data/movielens_posters.csv if it exists and, if not, will read the CSV from the origin GitHub repo at https://raw.githubusercontent.com/ShopRunner/collie/main/data/movielens_posters.csv.

Returns

posters_df –

DataFrame comprising columns:

item_id

url

Return type

pd.DataFrame

Format MovieLens 100K Item Metadata Data¶

collie.movielens.get_movielens_metadata(df_item: Optional[DataFrame] = None) → DataFrame[source]¶

Return MovieLens 100K metadata as a DataFrame.

DataFrame returned has the following column order:

[
    'genre_action', 'genre_adventure', 'genre_animation', 'genre_children', 'genre_comedy',
    'genre_crime', 'genre_documentary', 'genre_drama', 'genre_fantasy', 'genre_film_noir',
    'genre_horror', 'genre_musical', 'genre_mystery', 'genre_romance', 'genre_sci_fi',
    'genre_thriller', 'genre_war', 'genre_western', 'genre_unknown', 'decade_unknown',
    'decade_20', 'decade_30', 'decade_40', 'decade_50', 'decade_60',
    'decade_70', 'decade_80', 'decade_90',
]

See the MovieLens 100K README for additional information on the dataset: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt

Parameters: df_item (pd.DataFrame) – DataFrame of MovieLens 100K u.item containing binary columns of movie names and metadata. If None, will automatically read the output of read_movielens_df_item()
Returns: metadata_df
Return type: pd.DataFrame

Format MovieLens 100K User Metadata Data¶

collie.movielens.get_user_metadata(df_user: Optional[DataFrame] = None) → DataFrame[source]¶

Return MovieLens 100K user metadata as a DataFrame.

DataFrame returned has the following column order:

[
    'age', 'gender', 'occupation_administrator', 'occupation_artist'
    'occupation_doctor', 'occupation_educator', 'occupation_engineer'
    'occupation_entertainment', 'occupation_executive'
    'occupation_healthcare', 'occupation_homemaker'
    'occupation_lawyer', 'occupation_librarian', 'occupation_marketing'
    'occupation_none', 'occupation_other', 'occupation_programmer'
    'occupation_retired', 'occupation_salesman', 'occupation_scientist'
    'occupation_student', 'occupation_technician', 'occupation_writer',
]

See the MovieLens 100K README for additional information on the dataset: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt

Parameters: df_user (pd.DataFrame) – DataFrame of MovieLens 100K u.user containing columns of user metadata. If None, will automatically read the output of read_movielens_df_user()
Returns: metadata_df
Return type: pd.DataFrame

MovieLens Model Training Pipeline¶

collie.movielens.run_movielens_example(epochs: int = 20, gpus: int = 0) → None[source]¶

Retrieve and split data, train and evaluate a model, and save it.

From the terminal, you can run this script with:

python collie/movielens/run.py  --epochs 20

Parameters

epochs (int) – Number of epochs for model training
gpus (int) – Number of gpus to train on

Visualize MovieLens Predictions¶

collie.movielens.get_recommendation_visualizations(model: BasePipeline, user_id: int, df_user: Optional[DataFrame] = None, df_item: Optional[DataFrame] = None, movielens_posters_df: Optional[DataFrame] = None, num_user_movies_to_display: int = 10, num_similar_movies: int = 10, filter_films: bool = True, shuffle: bool = True, detailed: bool = False, image_width: int = 500) → str[source]¶

Visualize Movielens 100K recommendations for a given user.

Parameters

model (collie.model.BasePipeline) –
user_id (int) – User ID to retrieve recommendations for
df_user (DataFrame) –
u.data from MovieLens data. This DataFrame must have columns:
- user_id (starting at 1)
- item_id (starting at 1)
- rating (explicit ratings)
If None, will set to the output of read_movielens_df(decrement_ids=False).
df_item (DataFrame) –
u.item from MovieLens data. This DataFrame must have columns:
- item_id (starting at 1)
- movie_title
If None, will set to the output of read_movielens_df_item()
movielens_posters_df (DataFrame) –
DataFrame containing item_ids from MovieLens data and the poster url. This DataFrame must have columns:
- item_id (starting at 1)
- url
If None, will set to the output of read_movielens_posters_df()
num_user_movies_to_display (int) – Number of movies rated 4 or 5 to display for the user
num_similar_movies (int) – Number of movies recommendations to display
filter_films (bool) – Filter films out of recommendations if the user has already interacted with them
shuffle (bool) – Shuffle order of num_user_movies_to_display films
detailed (bool) – Of the top N unfiltered recommendations, display how many movies the user gave a positive and negative rating to
image_width (int) – Image width for HTML images

Returns

html – HTML string of movies a user loved and the model recommended for a given user, ready for displaying

Return type

str