MovieLens Functions¶
The following functions under collie.movielens
read and prepare MovieLens 100K data, train and evaluate a model on this data, and visualize recommendation results.
Get MovieLens 100K Data¶
Read MovieLens 100K Interactions Data¶
- collie.movielens.read_movielens_df(decrement_ids: bool = True) DataFrame [source]¶
Read
u.data
from the MovieLens 100K dataset.If there is not a directory at
$DATA_PATH/ml-100k
, this function creates that directory and downloads the entire dataset there.See the MovieLens 100K README for additional information on the dataset: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt
- Parameters
decrement_ids (bool) – Decrement user and item IDs by 1 before returning, which is required for Collie’s
Interactions
dataset- Returns
df –
MovieLens 100K
u.data
comprising of columns:user_id
item_id
rating
timestamp
- Return type
pd.DataFrame
Side Effects
Creates directory at
$DATA_PATH/ml-100k
and downloads data files if data does not exist.
Read MovieLens 100K Item Data¶
- collie.movielens.read_movielens_df_item() DataFrame [source]¶
Read
u.item
from the MovieLens 100K dataset.If there is not a directory at
$DATA_PATH/ml-100k
, this function creates that directory and downloads the entire dataset there.See the MovieLens 100K README for additional information on the dataset: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt
- Returns
df_item –
MovieLens 100K
u.item
containing columns:item_id
movie_title
release_date
video_release_date
IMDb_URL
unknown
Action
Adventure
Animation
Children
Comedy
Crime
Documentary
Drama
Fantasy
Film_Noir
Horror
Musical
Mystery
Romance’, ‘Sci_Fi
Thriller
War
Wester
- Return type
pd.DataFrame
Side Effects
Creates directory at
$DATA_PATH/ml-100k
and downloads data files if data does not exist.
Read MovieLens 100K Posters Data¶
- collie.movielens.read_movielens_posters_df() DataFrame [source]¶
Read in data containing the item ID and poster URL for visualization purposes of MovieLens 100K data.
This function will attempt to read the file at
data/movielens_posters.csv
if it exists and, if not, will read the CSV from the origin GitHub repo at https://raw.githubusercontent.com/ShopRunner/collie/main/data/movielens_posters.csv.- Returns
posters_df –
DataFrame comprising columns:
item_id
url
- Return type
pd.DataFrame
Format MovieLens 100K Item Metadata Data¶
- collie.movielens.get_movielens_metadata(df_item: Optional[DataFrame] = None) DataFrame [source]¶
Return MovieLens 100K metadata as a DataFrame.
DataFrame returned has the following column order:
[ 'genre_action', 'genre_adventure', 'genre_animation', 'genre_children', 'genre_comedy', 'genre_crime', 'genre_documentary', 'genre_drama', 'genre_fantasy', 'genre_film_noir', 'genre_horror', 'genre_musical', 'genre_mystery', 'genre_romance', 'genre_sci_fi', 'genre_thriller', 'genre_war', 'genre_western', 'genre_unknown', 'decade_unknown', 'decade_20', 'decade_30', 'decade_40', 'decade_50', 'decade_60', 'decade_70', 'decade_80', 'decade_90', ]
See the MovieLens 100K README for additional information on the dataset: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt
- Parameters
df_item (pd.DataFrame) – DataFrame of MovieLens 100K
u.item
containing binary columns of movie names and metadata. IfNone
, will automatically read the output ofread_movielens_df_item()
- Returns
metadata_df
- Return type
pd.DataFrame
Format MovieLens 100K User Metadata Data¶
- collie.movielens.get_user_metadata(df_user: Optional[DataFrame] = None) DataFrame [source]¶
Return MovieLens 100K user metadata as a DataFrame.
DataFrame returned has the following column order:
[ 'age', 'gender', 'occupation_administrator', 'occupation_artist' 'occupation_doctor', 'occupation_educator', 'occupation_engineer' 'occupation_entertainment', 'occupation_executive' 'occupation_healthcare', 'occupation_homemaker' 'occupation_lawyer', 'occupation_librarian', 'occupation_marketing' 'occupation_none', 'occupation_other', 'occupation_programmer' 'occupation_retired', 'occupation_salesman', 'occupation_scientist' 'occupation_student', 'occupation_technician', 'occupation_writer', ]
See the MovieLens 100K README for additional information on the dataset: https://files.grouplens.org/datasets/movielens/ml-100k-README.txt
- Parameters
df_user (pd.DataFrame) – DataFrame of MovieLens 100K
u.user
containing columns of user metadata. IfNone
, will automatically read the output ofread_movielens_df_user()
- Returns
metadata_df
- Return type
pd.DataFrame
MovieLens Model Training Pipeline¶
- collie.movielens.run_movielens_example(epochs: int = 20, gpus: int = 0) None [source]¶
Retrieve and split data, train and evaluate a model, and save it.
From the terminal, you can run this script with:
python collie/movielens/run.py --epochs 20
- Parameters
epochs (int) – Number of epochs for model training
gpus (int) – Number of gpus to train on
Visualize MovieLens Predictions¶
- collie.movielens.get_recommendation_visualizations(model: BasePipeline, user_id: int, df_user: Optional[DataFrame] = None, df_item: Optional[DataFrame] = None, movielens_posters_df: Optional[DataFrame] = None, num_user_movies_to_display: int = 10, num_similar_movies: int = 10, filter_films: bool = True, shuffle: bool = True, detailed: bool = False, image_width: int = 500) str [source]¶
Visualize Movielens 100K recommendations for a given user.
- Parameters
model (collie.model.BasePipeline) –
user_id (int) – User ID to retrieve recommendations for
df_user (DataFrame) –
u.data
from MovieLens data. This DataFrame must have columns:user_id
(starting at1
)item_id
(starting at1
)rating
(explicit ratings)
If
None
, will set to the output ofread_movielens_df(decrement_ids=False)
.df_item (DataFrame) –
u.item
from MovieLens data. This DataFrame must have columns:item_id
(starting at1
)movie_title
If
None
, will set to the output ofread_movielens_df_item()
movielens_posters_df (DataFrame) –
DataFrame containing item_ids from MovieLens data and the poster url. This DataFrame must have columns:
item_id
(starting at1
)url
If
None
, will set to the output ofread_movielens_posters_df()
num_user_movies_to_display (int) – Number of movies rated
4
or5
to display for the usernum_similar_movies (int) – Number of movies recommendations to display
filter_films (bool) – Filter films out of recommendations if the user has already interacted with them
shuffle (bool) – Shuffle order of
num_user_movies_to_display
filmsdetailed (bool) – Of the top
N
unfiltered recommendations, display how many movies the user gave a positive and negative rating toimage_width (int) – Image width for HTML images
- Returns
html – HTML string of movies a user loved and the model recommended for a given user, ready for displaying
- Return type
str