MovieLens Functions

https://movielens.org/images/movielens-logo.svg

The following functions under collie.movielens read and prepare MovieLens 100K data, train and evaluate a model on this data, and visualize recommendation results.

Get MovieLens 100K Data

Read MovieLens 100K Interactions Data

collie.movielens.read_movielens_df(decrement_ids: bool = True) DataFrame[source]

Read u.data from the MovieLens 100K dataset.

If there is not a directory at $DATA_PATH/ml-100k, this function creates that directory and downloads the entire dataset there.

See the MovieLens 100K README for additional information on the dataset: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt

Parameters

decrement_ids (bool) – Decrement user and item IDs by 1 before returning, which is required for Collie’s Interactions dataset

Returns

df

MovieLens 100K u.data comprising of columns:

  • user_id

  • item_id

  • rating

  • timestamp

Return type

pd.DataFrame

Side Effects

Creates directory at $DATA_PATH/ml-100k and downloads data files if data does not exist.

Read MovieLens 100K Item Data

collie.movielens.read_movielens_df_item() DataFrame[source]

Read u.item from the MovieLens 100K dataset.

If there is not a directory at $DATA_PATH/ml-100k, this function creates that directory and downloads the entire dataset there.

See the MovieLens 100K README for additional information on the dataset: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt

Returns

df_item

MovieLens 100K u.item containing columns:

  • item_id

  • movie_title

  • release_date

  • video_release_date

  • IMDb_URL

  • unknown

  • Action

  • Adventure

  • Animation

  • Children

  • Comedy

  • Crime

  • Documentary

  • Drama

  • Fantasy

  • Film_Noir

  • Horror

  • Musical

  • Mystery

  • Romance’, ‘Sci_Fi

  • Thriller

  • War

  • Wester

Return type

pd.DataFrame

Side Effects

Creates directory at $DATA_PATH/ml-100k and downloads data files if data does not exist.

Read MovieLens 100K Posters Data

collie.movielens.read_movielens_posters_df() DataFrame[source]

Read in data containing the item ID and poster URL for visualization purposes of MovieLens 100K data.

This function will attempt to read the file at data/movielens_posters.csv if it exists and, if not, will read the CSV from the origin GitHub repo at https://raw.githubusercontent.com/ShopRunner/collie/main/data/movielens_posters.csv.

Returns

posters_df

DataFrame comprising columns:

  • item_id

  • url

Return type

pd.DataFrame

Format MovieLens 100K Item Metadata Data

collie.movielens.get_movielens_metadata(df_item: Optional[DataFrame] = None) DataFrame[source]

Return MovieLens 100K metadata as a DataFrame.

DataFrame returned has the following column order:

[
    'genre_action', 'genre_adventure', 'genre_animation', 'genre_children', 'genre_comedy',
    'genre_crime', 'genre_documentary', 'genre_drama', 'genre_fantasy', 'genre_film_noir',
    'genre_horror', 'genre_musical', 'genre_mystery', 'genre_romance', 'genre_sci_fi',
    'genre_thriller', 'genre_war', 'genre_western', 'genre_unknown', 'decade_unknown',
    'decade_20', 'decade_30', 'decade_40', 'decade_50', 'decade_60',
    'decade_70', 'decade_80', 'decade_90',
]

See the MovieLens 100K README for additional information on the dataset: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt

Parameters

df_item (pd.DataFrame) – DataFrame of MovieLens 100K u.item containing binary columns of movie names and metadata. If None, will automatically read the output of read_movielens_df_item()

Returns

metadata_df

Return type

pd.DataFrame

Format MovieLens 100K User Metadata Data

collie.movielens.get_user_metadata(df_user: Optional[DataFrame] = None) DataFrame[source]

Return MovieLens 100K user metadata as a DataFrame.

DataFrame returned has the following column order:

[
    'age', 'gender', 'occupation_administrator', 'occupation_artist'
    'occupation_doctor', 'occupation_educator', 'occupation_engineer'
    'occupation_entertainment', 'occupation_executive'
    'occupation_healthcare', 'occupation_homemaker'
    'occupation_lawyer', 'occupation_librarian', 'occupation_marketing'
    'occupation_none', 'occupation_other', 'occupation_programmer'
    'occupation_retired', 'occupation_salesman', 'occupation_scientist'
    'occupation_student', 'occupation_technician', 'occupation_writer',
]

See the MovieLens 100K README for additional information on the dataset: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt

Parameters

df_user (pd.DataFrame) – DataFrame of MovieLens 100K u.user containing columns of user metadata. If None, will automatically read the output of read_movielens_df_user()

Returns

metadata_df

Return type

pd.DataFrame

MovieLens Model Training Pipeline

collie.movielens.run_movielens_example(epochs: int = 20, gpus: int = 0) None[source]

Retrieve and split data, train and evaluate a model, and save it.

From the terminal, you can run this script with:

python collie/movielens/run.py  --epochs 20
Parameters
  • epochs (int) – Number of epochs for model training

  • gpus (int) – Number of gpus to train on

Visualize MovieLens Predictions

collie.movielens.get_recommendation_visualizations(model: BasePipeline, user_id: int, df_user: Optional[DataFrame] = None, df_item: Optional[DataFrame] = None, movielens_posters_df: Optional[DataFrame] = None, num_user_movies_to_display: int = 10, num_similar_movies: int = 10, filter_films: bool = True, shuffle: bool = True, detailed: bool = False, image_width: int = 500) str[source]

Visualize Movielens 100K recommendations for a given user.

Parameters
  • model (collie.model.BasePipeline) –

  • user_id (int) – User ID to retrieve recommendations for

  • df_user (DataFrame) –

    u.data from MovieLens data. This DataFrame must have columns:

    • user_id (starting at 1)

    • item_id (starting at 1)

    • rating (explicit ratings)

    If None, will set to the output of read_movielens_df(decrement_ids=False).

  • df_item (DataFrame) –

    u.item from MovieLens data. This DataFrame must have columns:

    • item_id (starting at 1)

    • movie_title

    If None, will set to the output of read_movielens_df_item()

  • movielens_posters_df (DataFrame) –

    DataFrame containing item_ids from MovieLens data and the poster url. This DataFrame must have columns:

    • item_id (starting at 1)

    • url

    If None, will set to the output of read_movielens_posters_df()

  • num_user_movies_to_display (int) – Number of movies rated 4 or 5 to display for the user

  • num_similar_movies (int) – Number of movies recommendations to display

  • filter_films (bool) – Filter films out of recommendations if the user has already interacted with them

  • shuffle (bool) – Shuffle order of num_user_movies_to_display films

  • detailed (bool) – Of the top N unfiltered recommendations, display how many movies the user gave a positive and negative rating to

  • image_width (int) – Image width for HTML images

Returns

html – HTML string of movies a user loved and the model recommended for a given user, ready for displaying

Return type

str