Cross Validation

Once data is set up in an Interactions dataset, we can perform a data split to evaluate a trained model later. Collie supports the following two data splits below that share a common API, but differ in split strategy and performance.

from collie.cross_validation import random_split, stratified_split
from collie.interactions import Interactions
from collie.movielens import read_movielens_df
from collie.utils import convert_to_implicit, Timer


# EXPERIMENT SETUP
# read in MovieLens 100K data
df = read_movielens_df()

# convert the data to implicit
df_imp = convert_to_implicit(df)

# store data as ``Interactions``
interactions = Interactions(users=df_imp['user_id'],
                            items=df_imp['item_id'],
                            allow_missing_ids=True)

t = Timer()

# EXPERIMENT BEGIN
train, test = random_split(interactions)
t.timecheck(message='Random split timecheck')

train, test = stratified_split(interactions)
t.timecheck(message='Stratified split timecheck')
# as expected, a random split is much faster than a stratified split
Random split timecheck (0.00 min)
Stratified split timecheck (0.04 min)

Random Split

collie.cross_validation.random_split(interactions: BaseInteractions, val_p: float = 0.0, test_p: float = 0.2, processes: Optional[Any] = None, seed: Optional[int] = None) Tuple[BaseInteractions, ...][source]

Randomly split interactions into training, validation, and testing sets.

This split does NOT guarantee that every user will be represented in both the training and testing datasets. While much faster than stratified_split, it is not the most representative data split because of this.

Note that this function is not supported for HDF5Interactions objects, since this data split implementation requires all data to fit in memory. A data split for large datasets should be done using a big data processing technology, like Spark.

Parameters
  • interactions (collie.interactions.BaseInteractions) –

  • val_p (float) – Proportion of data used for validation

  • test_p (float) – Proportion of data used for testing

  • processes (Any) – Ignored, included only for compatability with stratified_split API

  • seed (int) – Random seed for splits

Returns

  • train_interactions (collie.interactions.BaseInteractions) – Training data of size proportional to 1 - val_p - test_p

  • validate_interactions (collie.interactions.BaseInteractions) – Validation data of size proportional to val_p, returned only if val_p > 0

  • test_interactions (collie.interactions.BaseInteractions) – Testing data of size proportional to test_p

Examples

>>> interactions = Interactions(...)
>>> len(interactions)
100000
>>> train, test = random_split(interactions)
>>> len(train), len(test)
(80000, 20000)

Stratified Split

collie.cross_validation.stratified_split(interactions: BaseInteractions, val_p: float = 0.0, test_p: float = 0.2, processes: int = -1, seed: Optional[int] = None, force_split: bool = False) Tuple[BaseInteractions, ...][source]

Split an Interactions instance into train, validate, and test datasets in a stratified manner such that each user appears at least once in each of the datasets.

This split guarantees that every user will be represented in the training, validation, and testing datasets given they appear in interactions at least three times. If val_p == 0, they will appear in the training and testing datasets given they appear at least two times. If a user appears fewer than this number of times, a ValueError will be raised. To filter users with fewer than n points out, use collie.utils.remove_users_with_fewer_than_n_interactions.

This is computationally more complex than random_split, but produces a more representative data split. Note that when val_p > 0, the algorithm will perform the data split twice, once to create the test set and another to create the validation set, essentially doubling the computational time.

Note that this function is not supported for HDF5Interactions objects, since this data split implementation requires all data to fit in memory. A data split for large datasets should be done using a big data processing technology, like Spark.

Parameters
  • interactions (collie.interactions.BaseInteractions) – Interactions instance containing the data to split

  • val_p (float) – Proportion of data used for validation

  • test_p (float) – Proportion of data used for testing

  • processes (int) – Number of CPUs to use for parallelization. If processes == 0, this will be run sequentially in a single list comprehension, else this function uses joblib.delayed and joblib.Parallel for parallelization. A value of -1 means that all available cores will be used

  • seed (int) – Random seed for splits

  • force_split (bool) – Ignore error raised when a user in the dataset has only a single interaction. Normally, a ValueError is raised when this occurs. When force_split=True, however, users with a single interaction will be placed in the training set and an error will NOT be raised

Returns

  • train_interactions (collie.interactions.BaseInteractions) – Training data of size proportional to 1 - val_p - test_p

  • validate_interactions (collie.interactions.BaseInteractions) – Validation data of size proportional to val_p, returned only if val_p > 0

  • test_interactions (collie.interactions.BaseInteractions) – Testing data of size proportional to test_p

Examples

>>> interactions = Interactions(...)
>>> len(interactions)
100000
>>> train, test = stratified_split(interactions)
>>> len(train), len(test)
(80000, 20000)