Cross Validation¶
Once data is set up in an Interactions
dataset, we can perform a data split to evaluate a trained model later. Collie supports the following two data splits below that share a common API, but differ in split strategy and performance.
from collie.cross_validation import random_split, stratified_split
from collie.interactions import Interactions
from collie.movielens import read_movielens_df
from collie.utils import convert_to_implicit, Timer
# EXPERIMENT SETUP
# read in MovieLens 100K data
df = read_movielens_df()
# convert the data to implicit
df_imp = convert_to_implicit(df)
# store data as ``Interactions``
interactions = Interactions(users=df_imp['user_id'],
items=df_imp['item_id'],
allow_missing_ids=True)
t = Timer()
# EXPERIMENT BEGIN
train, test = random_split(interactions)
t.timecheck(message='Random split timecheck')
train, test = stratified_split(interactions)
t.timecheck(message='Stratified split timecheck')
# as expected, a random split is much faster than a stratified split
Random split timecheck (0.00 min)
Stratified split timecheck (0.04 min)
Random Split¶
- collie.cross_validation.random_split(interactions: BaseInteractions, val_p: float = 0.0, test_p: float = 0.2, processes: Optional[Any] = None, seed: Optional[int] = None) Tuple[BaseInteractions, ...] [source]¶
Randomly split interactions into training, validation, and testing sets.
This split does NOT guarantee that every user will be represented in both the training and testing datasets. While much faster than
stratified_split
, it is not the most representative data split because of this.Note that this function is not supported for
HDF5Interactions
objects, since this data split implementation requires all data to fit in memory. A data split for large datasets should be done using a big data processing technology, like Spark.- Parameters
interactions (collie.interactions.BaseInteractions) –
val_p (float) – Proportion of data used for validation
test_p (float) – Proportion of data used for testing
processes (Any) – Ignored, included only for compatability with
stratified_split
APIseed (int) – Random seed for splits
- Returns
train_interactions (collie.interactions.BaseInteractions) – Training data of size proportional to
1 - val_p - test_p
validate_interactions (collie.interactions.BaseInteractions) – Validation data of size proportional to
val_p
, returned only ifval_p > 0
test_interactions (collie.interactions.BaseInteractions) – Testing data of size proportional to
test_p
Examples
>>> interactions = Interactions(...) >>> len(interactions) 100000 >>> train, test = random_split(interactions) >>> len(train), len(test) (80000, 20000)
Stratified Split¶
- collie.cross_validation.stratified_split(interactions: BaseInteractions, val_p: float = 0.0, test_p: float = 0.2, processes: int = -1, seed: Optional[int] = None, force_split: bool = False) Tuple[BaseInteractions, ...] [source]¶
Split an
Interactions
instance into train, validate, and test datasets in a stratified manner such that each user appears at least once in each of the datasets.This split guarantees that every user will be represented in the training, validation, and testing datasets given they appear in
interactions
at least three times. Ifval_p == 0
, they will appear in the training and testing datasets given they appear at least two times. If a user appears fewer than this number of times, aValueError
will be raised. To filter users with fewer thann
points out, usecollie.utils.remove_users_with_fewer_than_n_interactions
.This is computationally more complex than
random_split
, but produces a more representative data split. Note that whenval_p > 0
, the algorithm will perform the data split twice, once to create the test set and another to create the validation set, essentially doubling the computational time.Note that this function is not supported for
HDF5Interactions
objects, since this data split implementation requires all data to fit in memory. A data split for large datasets should be done using a big data processing technology, like Spark.- Parameters
interactions (collie.interactions.BaseInteractions) –
Interactions
instance containing the data to splitval_p (float) – Proportion of data used for validation
test_p (float) – Proportion of data used for testing
processes (int) – Number of CPUs to use for parallelization. If
processes == 0
, this will be run sequentially in a single list comprehension, else this function usesjoblib.delayed
andjoblib.Parallel
for parallelization. A value of-1
means that all available cores will be usedseed (int) – Random seed for splits
force_split (bool) – Ignore error raised when a user in the dataset has only a single interaction. Normally, a
ValueError
is raised when this occurs. Whenforce_split=True
, however, users with a single interaction will be placed in the training set and an error will NOT be raised
- Returns
train_interactions (collie.interactions.BaseInteractions) – Training data of size proportional to
1 - val_p - test_p
validate_interactions (collie.interactions.BaseInteractions) – Validation data of size proportional to
val_p
, returned only ifval_p > 0
test_interactions (collie.interactions.BaseInteractions) – Testing data of size proportional to
test_p
Examples
>>> interactions = Interactions(...) >>> len(interactions) 100000 >>> train, test = stratified_split(interactions) >>> len(train), len(test) (80000, 20000)