Interactions

What are Interactions?

The Interactions object is at the core of how data loading and retrieval works in Collie models.

An Interactions object is, in its simplest form, a torch.data.Dataset wrapper around a scipy.sparse.coo_matrix that supports iterating and batching data during model training. We supplement this with data consistency checks during initialization to catch potential errors sooner, a high-throughput and memory-efficient form of negative sampling, and a simple API. Indexing an Interactions object returns a user ID and an item ID that the user has interacted with, as well as an O(1) negative sample of item ID(s) a user has not interacted with, supporting the implicit loss functions built into Collie.

import pandas as pd

from collie.interactions import Interactions


df = pd.DataFrame(data={'user_id': [0, 0, 0, 1, 1, 2],
                        'item_id': [0, 1, 2, 3, 4, 5]})
interactions = Interactions(users=df['user_id'], items=df['item_id'], num_negative_samples=2)

for _ in range(3):
    print(interactions[0])
# output structure: ((user IDs, positive item IDs), negative items IDs)
# notice all negative item IDs will be true negatives for user ``0``, e.g.
((0, 0), array([5., 3.]))
((0, 0), array([5., 4.]))
((0, 0), array([3., 5.]))

We can see this same idea holds when we instead create an InteractionsDataLoader, as such:

import pandas as pd

from collie.interactions import InteractionsDataLoader


df = pd.DataFrame(data={'user_id': [0, 0, 0, 1, 1, 2],
                        'item_id': [0, 1, 2, 3, 4, 5]})
interactions_loader = InteractionsDataLoader(
    users=df['user_id'], items=df['item_id'], num_negative_samples=2
)

for batch in interactions_loader:
    print(batch)
# output structure: [[user IDs, positive item IDs], negative items IDs]
# users and positive items IDs is now a tensor of shape ``batch_size`` and
# negative items IDs is now a tensor of shape ``batch_size x num_negative_samples``
# notice all negative item IDs will still be true negatives, e.g.
[[tensor([0, 0, 0, 1, 1, 2], dtype=torch.int32),
  tensor([0, 1, 2, 3, 4, 5], dtype=torch.int32)],
 tensor([[4., 5.],
         [3., 5.],
         [4., 5.],
         [0., 1.],
         [5., 0.],
         [3., 4.]])]

Once data is in an Interactions form, you can easily perform data splits, train and evaluate a model, and much more. See Cross Validation and Models documentation for more information on this.

How can I speed up Interactions data loading?

While an Interactions object works out-of-the-box with a torch.data.DataLoader, such as the included InteractionsDataLoader, sampling true negatives for each Interactions element can become costly as the number of items grows. In this situation, it might be desirable to trade exact negative sampling for a faster, approximate sampler. For these scenarios, we use the ApproximateNegativeSamplingInteractionsDataLoader, an extension of the more traditional InteractionsDataLoader that samples data in batches, forgoing the expensive concatenation of individual data points an InteractionsDataLoader must do for each batch. Here, negative samples are simply returned as a collection of randomly sampled item IDs, meaning it is possible that a negative item ID returned for a user can actually be an item a user had positively interacted with. When the number of items is large, though, this scenario is increasingly rare, and the speedup benefit is worth the slight performance hit.

import pandas as pd

from collie.interactions import ApproximateNegativeSamplingInteractionsDataLoader


df = pd.DataFrame(data={'user_id': [0, 0, 0, 1, 1, 2],
                        'item_id': [0, 1, 2, 3, 4, 5]})
interactions_loader = ApproximateNegativeSamplingInteractionsDataLoader(
    users=df['user_id'], items=df['item_id'], num_negative_samples=2
)

for batch in interactions_loader:
    print(batch)
# output structure: [[user IDs, positive item IDs], "negative" items IDs]
# users and positive items IDs is now a tensor of shape ``batch_size`` and
# negative items IDs is now a tensor of shape ``batch_size x num_negative_samples``
# notice negative item IDs will *not* always be true negatives now, e.g.
[[tensor([0, 0, 0, 1, 1, 2], dtype=torch.int32),
  tensor([0, 1, 2, 3, 4, 5], dtype=torch.int32)],
 tensor([[4, 5],
         [1, 2],
         [4, 2],
         [3, 5],
         [4, 0],
         [4, 3]])]

interactions = interactions_loader.interactions
# use this for cross validation, evaluation, etc.

What if my data cannot fit in memory?

For datasets that are too large to fit in memory, Collie includes the HDF5InteractionsDataLoader (which uses a HDF5Interactions dataset at its base, sharing many of the same features and methods as an Interactions object). A HDF5InteractionsDataLoader applies the same principles behind the ApproximateNegativeSamplingInteractionsDataLoader, but for data stored on disk in a HDF5 format. The main drawback to this approach is that when shuffle=True, data will only be shuffled within batches (as opposed to the true shuffle in ApproximateNegativeSamplingInteractionsDataLoader). For sufficiently large enough data, this effect on model performance should be negligible.

import pandas as pd

from collie.interactions import HDF5InteractionsDataLoader
from collie.utils import pandas_df_to_hdf5


# we'll write out a sample DataFrame to HDF5 format for this example
df = pd.DataFrame(data={'user_id': [0, 0, 0, 1, 1, 2],
                        'item_id': [0, 1, 2, 3, 4, 5]})
pandas_df_to_hdf5(df=df, out_path='sample_hdf5.h5')

interactions_loader = HDF5InteractionsDataLoader(
    hdf5_path='sample_hdf5.h5',
    user_col='user_id',
    item_col='item_id',
    num_negative_samples=2,
)

for batch in interactions_loader:
    print(batch)
# output structure: [[user IDs, positive item IDs], "negative" items IDs]
# users and positive items IDs is now a tensor of shape ``batch_size`` and
# negative items IDs is now a tensor of shape ``batch_size x num_negative_samples``
# notice negative item IDs will *not* always be true negatives now, e.g.
[[tensor([0, 0, 0, 1, 1, 2], dtype=torch.int32),
  tensor([0, 1, 2, 3, 4, 5], dtype=torch.int32)],
 tensor([[5, 4],
         [4, 5],
         [5, 2],
         [4, 3],
         [4, 2],
         [1, 3]])]

The table below shows the time differences to train a MatrixFactorizationModel for a single epoch on MovieLens 10M data using default parameters on the GPU on a p3.2xlarge EC2 instance 1.

DataLoader Type

Time to Train a Single Epoch

InteractionsDataLoader

1min 25s

ApproximateNegativeSamplingInteractionsDataLoader

1min 8s

HDF5InteractionsDataLoader

1min 10s

What if my data has explicit ratings in it?

Thus far, we’ve only discussed the scenario in which you have data without an explicit indicator showing to what degree a user loved an item. When you do have that data (i.e. star ratings for product reviews, number of times a user has interacted with an item, etc.), you have explicit data. Luckily, as of version 0.6.0 of Collie, this is now fully supported within the library, with the only differences between an explicit and implicit pipeline being 1) the dataset definition (detailed below) and 2) evaluation (detailed in Evaluation Metrics).

Note the similarities in the explicit example below with the examples shown thus far:

import pandas as pd

from collie.interactions import ExplicitInteractions


explicit_df = pd.DataFrame(data={'user_id': [0, 0, 0, 1, 1, 2],
                                 'item_id': [0, 1, 2, 3, 4, 5],
                                 'ratings': [1, 2, 3, 4, 5, 3.5]})
explicit_interactions = ExplicitInteractions(users=explicit_df['user_id'],
                                             items=explicit_df['item_id'],
                                             ratings=explicit_df['ratings'])

for _ in range(3):
    print(explicit_interactions[0])

print('\n-----\n')

for idx in range(len(explicit_interactions)):
    print(explicit_interactions[idx])
# output structure: (user IDs, positive item IDs, ratings)
# notice that unlike implicit interactions, there is no negative sampling going
# on under the hood, meaning this printout will always be deterministic
(0, 0, 1.0)
(0, 0, 1.0)
(0, 0, 1.0)

-----

(0, 0, 1.0)
(0, 1, 2.0)
(0, 2, 3.0)
(1, 3, 4.0)
(1, 4, 5.0)
(2, 5, 3.5)

Once the ExplicitInteractions dataset is defined, you can use the built-in InteractionsDataLoader to batch and iterate through the data!

import pandas as pd

from collie.interactions import ExplicitInteractions, InteractionsDataLoader


# the same setup code from the code snippet above
explicit_df = pd.DataFrame(data={'user_id': [0, 0, 0, 1, 1, 2],
                                 'item_id': [0, 1, 2, 3, 4, 5],
                                 'ratings': [1, 2, 3, 4, 5, 3.5]})
explicit_interactions = ExplicitInteractions(users=explicit_df['user_id'],
                                             items=explicit_df['item_id'],
                                             ratings=explicit_df['ratings'])

explicit_interactions_loader = InteractionsDataLoader(interactions=explicit_interactions)

for batch in explicit_interactions_loader:
    print(batch)
# output structure: [user IDs, positive item IDs, ratings]
[tensor([0, 0, 0, 1, 1, 2], dtype=torch.int32),
 tensor([0, 1, 2, 3, 4, 5], dtype=torch.int32),
 tensor([1.0000, 2.0000, 3.0000, 4.0000, 5.0000, 3.5000], dtype=torch.float64)]

All Collie models support both implicit and explicit data, and can be instantiating by either passing in the Interactions/ExplicitInteractions data or the dataset wrapped in a DataLoader. See Models for more details on this.

Datasets

Implicit Interactions Dataset

class collie.interactions.Interactions(mat: Optional[Union[coo_matrix, array]] = None, users: Optional[Iterable[int]] = None, items: Optional[Iterable[int]] = None, ratings: Optional[Iterable[int]] = None, num_negative_samples: int = 10, allow_missing_ids: bool = False, remove_duplicate_user_item_pairs: bool = True, num_users: int = 'infer', num_items: int = 'infer', check_num_negative_samples_is_valid: bool = True, max_number_of_samples_to_consider: int = 200, seed: Optional[int] = None)[source]

Bases: BaseInteractions

PyTorch Dataset for implicit user-item interactions data.

If mat is provided, the Interactions instance will act as a wrapper for a sparse matrix in COOrdinate format, typically looking like:

  • Users comprising the rows

  • Items comprising the columns

  • Ratings given by that user for that item comprising the elements of the matrix

Interactions can be instantiated instead by passing in single arrays with corresponding user_ids, item_ids, and ratings (by default, set to 1 for implicit recommenders) values with the same functionality as a matrix. Note that with this approach, the number of users and items will be the maximum values in those two columns, respectively, and it is expected that all integers between 0 and the maximum ID should appear somewhere in the data.

By default, exact negative sampling will be used during each __getitem__ call. To use approximate negative sampling, set max_number_of_samples_to_consider = 0. This will avoid building a positive item lookup dictionary during initialization.

Unlike in ExplicitInteractions, we rely on negative sampling for implicit data. Each __getitem__ call will thus return a nested tuple containing user IDs, item IDs, and sampled negative item IDs. This nested vs. non-nested structure is key for the model to determine where it should be implicit or explicit. Use the table below for reference:

__getitem__ Format

Expected Meaning

Model Type

((X, Y), Z)

((user IDs, item IDs), negative item IDs)

Implicit

(X, Y, Z)

(user IDs, item IDs, ratings)

Explicit

Parameters
  • mat (scipy.sparse.coo_matrix or numpy.array, 2-dimensional) – Interactions matrix, which, if provided, will be used instead of users, items, and ratings arguments

  • users (Iterable[int], 1-d) – Array of user IDs, starting at 0

  • items (Iterable[int], 1-d) – Array of corresponding item IDs to users, starting at 0

  • ratings (Iterable[int], 1-d) – Array of corresponding ratings to both users and items. If None, will default to each user in user interacting with an item with a rating value of 1

  • num_negative_samples (int) – Number of negative samples to return with each __getitem__ call

  • allow_missing_ids (bool) – If False, will check that both users and items contain each integer from 0 to the maximum value in the array. This check only applies when initializing an Interactions instance using 1-dimensional arrays users and items

  • remove_duplicate_user_item_pairs (bool) – Will check for and remove any duplicate user, item ID pairs from the Interactions matrix during initialization. Note that this will create a second sparse matrix held in memory to efficiently check, which could cause memory concerns for larger data. If you are sure that there are no duplicated, user, item ID pairs, set to False

  • num_users (int) – Number of users in the dataset. If num_users == 'infer', this will be set to the mat.shape[0] or max(users) + 1, depending on the input

  • num_items (int) – Number of items in the dataset. If num_items == 'infer', this will be set to the mat.shape[1] or max(items) + 1, depending on the input

  • check_num_negative_samples_is_valid (bool) – Check that num_negative_samples is less than the maximum number of items a user has interacted with. If it is not, then for all users who have fewer than num_negative_samples items not interacted with, a random sample including positive items will be returned as negative

  • max_number_of_samples_to_consider (int) – Number of samples to try for a given user before returning an approximate negative sample. This should be greater than num_negative_samples. If set to 0, approximate negative sampling will be used by default in __getitem__ and a positive item lookup dictionary will NOT be built

  • seed (int) – Seed for random sampling

head(n: int = 5) array

Return the first n rows of the dense matrix as a np.array, 2-d.

tail(n: int = 5) array

Return the last n rows of the dense matrix as a np.array, 2-d.

toarray() array

Transforms BaseInteractions instance sparse matrix to np.array, 2-d.

todense() matrix

Transforms BaseInteractions instance sparse matrix to np.matrix, 2-d.

Explicit Interactions Dataset

class collie.interactions.ExplicitInteractions(mat: Optional[Union[coo_matrix, array]] = None, users: Optional[Iterable[int]] = None, items: Optional[Iterable[int]] = None, ratings: Optional[Iterable[int]] = None, allow_missing_ids: bool = False, remove_duplicate_user_item_pairs: bool = True, num_users: int = 'infer', num_items: int = 'infer')[source]

Bases: BaseInteractions

PyTorch Dataset for explicit user-item interactions data.

If mat is provided, the Interactions instance will act as a wrapper for a sparse matrix in COOrdinate format, typically looking like:

  • Users comprising the rows

  • Items comprising the columns

  • Ratings given by that user for that item comprising the elements of the matrix

Interactions can be instantiated instead by passing in single arrays with corresponding user_ids, item_ids, and ratings values with the same functionality as a matrix. Note that with this approach, the number of users and items will be the maximum values in those two columns, respectively, and it is expected that all integers between 0 and the maximum ID should appear somewhere in the user or item ID data.

Unlike in Interactions, there is no need for negative sampling for explicit data. Each __getitem__ call will thus return a single, non-nested tuple containing user IDs, item IDs, and ratings. This nested vs. non-nested structure is key for the model to determine where it should be implicit or explicit. Use the table below for reference:

__getitem__ Format

Expected Meaning

Model Type

((X, Y), Z)

((user IDs, item IDs), negative item IDs)

Implicit

(X, Y, Z)

(user IDs, item IDs, ratings)

Explicit

Parameters
  • mat (scipy.sparse.coo_matrix or numpy.array, 2-dimensional) – Interactions matrix, which, if provided, will be used instead of users, items, and ratings arguments

  • users (Iterable[int], 1-d) – Array of user IDs, starting at 0

  • items (Iterable[int], 1-d) – Array of corresponding item IDs to users, starting at 0

  • ratings (Iterable[int], 1-d) – Array of corresponding ratings to both users and items. If None, will default to each user in user interacting with an item with a rating value of 1

  • allow_missing_ids (bool) – If False, will check that both users and items contain each integer from 0 to the maximum value in the array. This check only applies when initializing an ExplicitInteractions instance using 1-dimensional arrays users and items

  • remove_duplicate_user_item_pairs (bool) – Will check for and remove any duplicate user, item ID pairs from the ExplicitInteractions matrix during initialization. Note that this will create a second sparse matrix held in memory to efficiently check, which could cause memory concerns for larger data. If you are sure that there are no duplicated, user, item ID pairs, set to False

  • num_users (int) – Number of users in the dataset. If num_users == 'infer', this will be set to the mat.shape[0] or max(users) + 1, depending on the input

  • num_items (int) – Number of items in the dataset. If num_items == 'infer', this will be set to the mat.shape[1] or max(items) + 1, depending on the input

head(n: int = 5) array

Return the first n rows of the dense matrix as a np.array, 2-d.

property num_negative_samples: int

Does not exist for explicit data.

tail(n: int = 5) array

Return the last n rows of the dense matrix as a np.array, 2-d.

toarray() array

Transforms BaseInteractions instance sparse matrix to np.array, 2-d.

todense() matrix

Transforms BaseInteractions instance sparse matrix to np.matrix, 2-d.

HDF5 Interactions Dataset

class collie.interactions.HDF5Interactions(hdf5_path: str, user_col: str = 'users', item_col: str = 'items', num_negative_samples: int = 10, num_users: int = 'infer', num_items: int = 'infer', seed: Optional[int] = None, shuffle: bool = False)[source]

Bases: Dataset

Create an Interactions-like object for data in the HDF5 format that might be too large to fit in memory.

Many of the same features of Interactions are implemented here, with the exception that approximate negative sampling will always be used.

Parameters
  • hdf5_path (str) –

  • user_col (str) – Column in HDF5 file with user IDs. IDs must begin at 0

  • item_col (str) – Column in HDF5 file with item IDs. IDs must begin at 0

  • num_negative_samples (int) – Number of negative samples to return with each __getitem__ call

  • num_users (int) – Number of users in the dataset. If num_users == 'infer' and there is not a meta key in hdf5_path’s HDF5 dataset, this will be set to the the maximum value in user_col + 1, found by iterating through the entire dataset

  • num_items (int) – Number of items in the dataset. If num_items == 'infer' and there is not an meta key in hdf5_path’s HDF5 dataset, this will be set to the the maximum value in item_col + 1, found by iterating through the entire dataset

  • seed (int) – Seed for random sampling and shuffling if shuffle is True

  • shuffle (bool) – Shuffle data in a batch. For example, if one calls __getitem__ with start_idx_and_batch_size = (0, 4) and shuffle is False, this will always return the data at indices 0, 1, 2, 3 in order. However, the same call with shuffle = True will return a random shuffle of 0, 1, 2, 3 each call. This is recommended for use in a HDF5InteractionsDataLoader for training data in lieu of true data shuffling

head(n: int = 5) DataFrame[source]

Return the first n rows of the underlying pd.DataFrame.

tail(n: int = 5) DataFrame[source]

Return the last n rows of the underlying pd.DataFrame.

DataLoaders

Interactions DataLoader

class collie.interactions.InteractionsDataLoader(interactions: Optional[BaseInteractions] = None, mat: Optional[Union[coo_matrix, array]] = None, users: Optional[Iterable[int]] = None, items: Optional[Iterable[int]] = None, ratings: Optional[Iterable[int]] = None, batch_size: int = 1024, shuffle: bool = False, **kwargs)[source]

Bases: BaseInteractionsDataLoader

A light wrapper around a torch.utils.data.DataLoader for Interactions-type datasets.

For implicit data, batches will be created one-point-at-a-time using exact negative sampling (unless configured not to in interactions), which is optimal when datasets are smaller (< 1M+ interactions) and model training speed is not a concern. This is the default DataLoader for Interactions datasets.

For explicit data, negative sampling is not used, but batches will still be created one-point-at-a-time.

Parameters
  • interactions (BaseInteractions) – If not provided, an Interactions object will be created with mat or all of users, items, and ratings

  • mat (scipy.sparse.coo_matrix or numpy.array, 2-dimensional) – If interactions is None, will be used instead of users, items, and ratings arguments to create an Interactions object

  • users (Iterable[int], 1-d) – If interactions is None and mat is None, array of user IDs, starting at 0

  • items (Iterable[int], 1-d) – If interactions is None and mat is None, array of corresponding item IDs to users, starting at 0

  • ratings (Iterable[int], 1-d) – If interactions is None and mat is None, array of corresponding ratings to both users and items. If None, will default to each user in user interacting with an item with a rating value of 1

  • batch_size (int) – Number of samples per batch to load

  • shuffle (bool) – Whether to shuffle the order of data returned or not. This is especially useful for training data to ensure the model does not overfit to a specific order of data

  • **kwargs (keyword arguments) – Relevant keyword arguments will be passed into Interactions object creation, if interactions is None and the keyword argument matches one of Interactions.__init__.__code__.co_varnames. All other keyword arguments will be passed into torch.utils.data.DataLoader: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

interactions
Type

Interactions (default) or ExplicitInteractions

Original ``torch.utils.data.DataLoader`` docstring as follows
########
Data loader. Combines a dataset and a sampler, and provides an iterable over
the given dataset.
The :class:`~torch.utils.data.DataLoader` supports both map-style and
iterable-style datasets with single- or multi-process loading, customizing
loading order and optional automatic batching (collation) and memory pinning.
See :py:mod:`torch.utils.data` documentation page for more details.
Args

dataset (Dataset): dataset from which to load the data. batch_size (int, optional): how many samples per batch to load

(default: 1).

shuffle (bool, optional): set to True to have the data reshuffled

at every epoch (default: False).

sampler (Sampler or Iterable, optional): defines the strategy to draw

samples from the dataset. Can be any Iterable with __len__ implemented. If specified, shuffle must not be specified.

batch_sampler (Sampler or Iterable, optional): like sampler, but

returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.

num_workers (int, optional): how many subprocesses to use for data

loading. 0 means that the data will be loaded in the main process. (default: 0)

collate_fn (Callable, optional): merges a list of samples to form a

mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

pin_memory (bool, optional): If True, the data loader will copy Tensors

into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.

drop_last (bool, optional): set to True to drop the last incomplete batch,

if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)

timeout (numeric, optional): if positive, the timeout value for collecting a batch

from workers. Should always be non-negative. (default: 0)

worker_init_fn (Callable, optional): If not None, this will be called on each

worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)

generator (torch.Generator, optional): If not None, this RNG will be used

by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers. (default: None)

prefetch_factor (int, optional, keyword-only arg): Number of batches loaded

in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default: 2)

persistent_workers (bool, optional): If True, the data loader will not shutdown

the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default: False)

pin_memory_device (str, optional): the data loader will copy Tensors

into device pinned memory before returning them if pin_memory is set to true.

Warning

If the spawn start method is used, worker_init_fn cannot be an unpicklable object, e.g., a lambda function. See multiprocessing-best-practices on more details related to multiprocessing in PyTorch.

Warning

len(dataloader) heuristic is based on the length of the sampler used. When dataset is an IterableDataset, it instead returns an estimate based on len(dataset) / batch_size, with proper rounding depending on drop_last, regardless of multi-process loading configurations. This represents the best guess PyTorch can make because PyTorch trusts user dataset code in correctly handling multi-process loading to avoid duplicate data.

However, if sharding results in multiple workers having incomplete last batches, this estimate can still be inaccurate, because (1) an otherwise complete batch can be broken into multiple ones and (2) more than one batch worth of samples can be dropped when drop_last is set. Unfortunately, PyTorch can not detect such cases in general.

See `Dataset Types`_ for more details on these two types of datasets and how IterableDataset interacts with `Multi-process data loading`_.

Warning

See reproducibility, and dataloader-workers-random-seed, and data-loading-randomness notes for random seed related questions.

property mat: coo_matrix

Sparse COO matrix of interactions.

property num_interactions: int

Number of interactions in interactions.

property num_items: int

Number of items in interactions.

property num_negative_samples: int

Number of negative samples in interactions.

property num_users: int

Number of users in interactions.

Approximate Negative Sampling Interactions DataLoader

class collie.interactions.ApproximateNegativeSamplingInteractionsDataLoader(interactions: Optional[Interactions] = None, mat: Optional[Union[coo_matrix, array]] = None, users: Optional[Iterable[int]] = None, items: Optional[Iterable[int]] = None, ratings: Optional[Iterable[int]] = None, batch_size: int = 1024, shuffle: bool = False, **kwargs)[source]

Bases: BaseInteractionsDataLoader

A computationally more efficient DataLoader for Interactions data using approximate negative sampling for negative items.

This DataLoader groups __getitem__ calls together into a single operation, which dramatically speeds up a traditional DataLoader’s process of calling __getitem__ one index at a time, then concatenating them together before returning. In an effort to batch operations together, all negative samples returned will be approximate, meaning this does not check if a user has previously interacted with the item. With a sufficient number of interactions (1M+), we have found a speed increase of 2x at the cost of a 1% reduction in MAP @ 10 performance compared to InteractionsDataLoader.

For greater efficiency, we disable automated batching by setting the DataLoader’s batch_size attribute to None. Thus, to access the “true” batch size that the sampler uses, access ApproximateNegativeSamplingInteractionsDataLoader.approximate_negative_sampler.batch_size.

Parameters
  • interactions (Interactions) – If not provided, an Interactions object will be created with mat or all of users, items, and ratings with max_number_of_samples_to_consider=0

  • mat (scipy.sparse.coo_matrix or numpy.array, 2-dimensional) – If interactions is None, will be used instead of users, items, and ratings arguments to create an Interactions object

  • users (Iterable[int], 1-d) – If interactions is None and mat is None, array of user IDs, starting at 0

  • items (Iterable[int], 1-d) – If interactions is None and mat is None, array of corresponding item IDs to users, starting at 0

  • ratings (Iterable[int], 1-d) – If interactions is None and mat is None, array of corresponding ratings to both users and items. If None, will default to each user in user interacting with an item with a rating value of 1

  • batch_size (int) – Number of samples per batch to load

  • shuffle (bool) – Whether to shuffle the order of data returned or not. This is especially useful for training data to ensure the model does not overfit to a specific order of data

  • **kwargs (keyword arguments) – Relevant keyword arguments will be passed into Interactions object creation, if interactions is None and the keyword argument matches one of Interactions.__init__.__code__.co_varnames. All other keyword arguments will be passed into torch.utils.data.DataLoader: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

interactions
Type

Interactions

Original ``torch.utils.data.DataLoader`` docstring as follows
########
Data loader. Combines a dataset and a sampler, and provides an iterable over
the given dataset.
The :class:`~torch.utils.data.DataLoader` supports both map-style and
iterable-style datasets with single- or multi-process loading, customizing
loading order and optional automatic batching (collation) and memory pinning.
See :py:mod:`torch.utils.data` documentation page for more details.
Args

dataset (Dataset): dataset from which to load the data. batch_size (int, optional): how many samples per batch to load

(default: 1).

shuffle (bool, optional): set to True to have the data reshuffled

at every epoch (default: False).

sampler (Sampler or Iterable, optional): defines the strategy to draw

samples from the dataset. Can be any Iterable with __len__ implemented. If specified, shuffle must not be specified.

batch_sampler (Sampler or Iterable, optional): like sampler, but

returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.

num_workers (int, optional): how many subprocesses to use for data

loading. 0 means that the data will be loaded in the main process. (default: 0)

collate_fn (Callable, optional): merges a list of samples to form a

mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

pin_memory (bool, optional): If True, the data loader will copy Tensors

into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.

drop_last (bool, optional): set to True to drop the last incomplete batch,

if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)

timeout (numeric, optional): if positive, the timeout value for collecting a batch

from workers. Should always be non-negative. (default: 0)

worker_init_fn (Callable, optional): If not None, this will be called on each

worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)

generator (torch.Generator, optional): If not None, this RNG will be used

by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers. (default: None)

prefetch_factor (int, optional, keyword-only arg): Number of batches loaded

in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default: 2)

persistent_workers (bool, optional): If True, the data loader will not shutdown

the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default: False)

pin_memory_device (str, optional): the data loader will copy Tensors

into device pinned memory before returning them if pin_memory is set to true.

Warning

If the spawn start method is used, worker_init_fn cannot be an unpicklable object, e.g., a lambda function. See multiprocessing-best-practices on more details related to multiprocessing in PyTorch.

Warning

len(dataloader) heuristic is based on the length of the sampler used. When dataset is an IterableDataset, it instead returns an estimate based on len(dataset) / batch_size, with proper rounding depending on drop_last, regardless of multi-process loading configurations. This represents the best guess PyTorch can make because PyTorch trusts user dataset code in correctly handling multi-process loading to avoid duplicate data.

However, if sharding results in multiple workers having incomplete last batches, this estimate can still be inaccurate, because (1) an otherwise complete batch can be broken into multiple ones and (2) more than one batch worth of samples can be dropped when drop_last is set. Unfortunately, PyTorch can not detect such cases in general.

See `Dataset Types`_ for more details on these two types of datasets and how IterableDataset interacts with `Multi-process data loading`_.

Warning

See reproducibility, and dataloader-workers-random-seed, and data-loading-randomness notes for random seed related questions.

property mat: coo_matrix

Sparse COO matrix of interactions.

property num_interactions: int

Number of interactions in interactions.

property num_items: int

Number of items in interactions.

property num_negative_samples: int

Number of negative samples in interactions.

property num_users: int

Number of users in interactions.

HDF5 Approximate Negative Sampling Interactions DataLoader

class collie.interactions.HDF5InteractionsDataLoader(hdf5_interactions: Optional[HDF5Interactions] = None, hdf5_path: Optional[str] = None, batch_size: int = 1024, shuffle: bool = False, **kwargs)[source]

Bases: BaseInteractionsDataLoader

A light wrapper around a torch.utils.data.DataLoader for HDF5 data, with behavior very similar to ApproximateNegativeSamplingInteractionsDataLoader.

If not provided, a HDF5Interactions dataset will be created as the data for the DataLoader. A custom sampler, HDF5Sampler, will also be instantiated for the DataLoader to use that allows sampling in batches that make for faster HDF5 data reads from disk.

While similar to a standard DataLoader, note that when shuffle is True, this will only shuffle the order of batches and the data within batches to still make for efficient reading of HDF5 data from disk, rather than shuffling across the entire dataset.

For greater efficiency, we disable automated batching by setting the DataLoader’s batch_size attribute to None. Thus, to access the “true” batch size that the sampler uses, access HDF5InteractionsDataLoader.hdf5_sampler.batch_size.

Parameters
  • hdf5_interactions (HDF5Interactions) – If provided, will override input argument for hdf5_path

  • hdf5_path (str) – If hdf5_interactions is None, the path to the HDF5 dataset

  • batch_size (int) – Number of samples per batch to load

  • shuffle (bool) – Whether to shuffle the order of batches returned or not. This is especially useful for training data to ensure the model does not overfit to a specific order of data. Note that this will not perform a true shuffle of the data, but shuffle the order of batches. While this is an approximation of true sampling, it allows us a greater speed up during model training for a negligible effect on model performance

  • **kwargs (keyword arguments) – Relevant keyword arguments will be passed into HDF5Interactions object creation, if hdf5_interactions is None and the keyword argument matches one of HDF5Interactions.__init__.__code__.co_varnames. All other keyword arguments will be passed into torch.utils.data.DataLoader: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

  • follows (Original torch.utils.data.DataLoader docstring as) –

  • ########

  • sampler (Data loader. Combines a dataset and a) –

  • over (and provides an iterable) –

  • dataset. (the given) –

:param The DataLoader supports both map-style and: :param iterable-style datasets with single- or multi-process loading: :param customizing: :param loading order and optional automatic batching (collation) and memory pinning.: :param See torch.utils.data documentation page for more details.: :param Args: dataset (Dataset): dataset from which to load the data.

batch_size (int, optional): how many samples per batch to load

(default: 1).

shuffle (bool, optional): set to True to have the data reshuffled

at every epoch (default: False).

sampler (Sampler or Iterable, optional): defines the strategy to draw

samples from the dataset. Can be any Iterable with __len__ implemented. If specified, shuffle must not be specified.

batch_sampler (Sampler or Iterable, optional): like sampler, but

returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.

num_workers (int, optional): how many subprocesses to use for data

loading. 0 means that the data will be loaded in the main process. (default: 0)

collate_fn (Callable, optional): merges a list of samples to form a

mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

pin_memory (bool, optional): If True, the data loader will copy Tensors

into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.

drop_last (bool, optional): set to True to drop the last incomplete batch,

if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)

timeout (numeric, optional): if positive, the timeout value for collecting a batch

from workers. Should always be non-negative. (default: 0)

worker_init_fn (Callable, optional): If not None, this will be called on each

worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)

generator (torch.Generator, optional): If not None, this RNG will be used

by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers. (default: None)

prefetch_factor (int, optional, keyword-only arg): Number of batches loaded

in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default: 2)

persistent_workers (bool, optional): If True, the data loader will not shutdown

the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default: False)

pin_memory_device (str, optional): the data loader will copy Tensors

into device pinned memory before returning them if pin_memory is set to true.

Warning

If the spawn start method is used, worker_init_fn cannot be an unpicklable object, e.g., a lambda function. See multiprocessing-best-practices on more details related to multiprocessing in PyTorch.

Warning

len(dataloader) heuristic is based on the length of the sampler used. When dataset is an IterableDataset, it instead returns an estimate based on len(dataset) / batch_size, with proper rounding depending on drop_last, regardless of multi-process loading configurations. This represents the best guess PyTorch can make because PyTorch trusts user dataset code in correctly handling multi-process loading to avoid duplicate data.

However, if sharding results in multiple workers having incomplete last batches, this estimate can still be inaccurate, because (1) an otherwise complete batch can be broken into multiple ones and (2) more than one batch worth of samples can be dropped when drop_last is set. Unfortunately, PyTorch can not detect such cases in general.

See `Dataset Types`_ for more details on these two types of datasets and how IterableDataset interacts with `Multi-process data loading`_.

Warning

See reproducibility, and dataloader-workers-random-seed, and data-loading-randomness notes for random seed related questions.

property mat: None

mat attribute is not possible to access in HDF5InteractionsDataLoader.

property num_interactions: int

Number of interactions in interactions.

property num_items: int

Number of items in interactions.

property num_negative_samples: int

Number of negative samples in interactions.

property num_users: int

Number of users in interactions.

Footnotes

1

Welcome to a detailed footnote about this experiment!

  • The MovieLens 10M data was preprocessed using Collie utility functions in Utility Functions that keeps all ratings above a 4 and removes users with fewer than 3 interactions. This left us with 5,005,398 total interactions.

  • For a much faster training time, we recommend setting sparse=True (see the point below this) in the model definition and using a larger batch size with pin_memory=True in the DataLoader.

  • Since we used default parameters, the embeddings of the MatrixFactorizationModel were not sparse. Had we used sparse embeddings and a Sparse Adam optimizer, the table would show:

DataLoader Type

Time to Train a Single Epoch

InteractionsDataLoader

1min 21s

ApproximateNegativeSamplingInteractionsDataLoader

1min 4s

HDF5InteractionsDataLoader

1min 7s

These times are more dramatically different with larger datasets (1M+ items). While these options are certainly faster, having sparse settings be the default limits the optimizer options and general flexibility of customizing an architecture, since not all PyTorch operations support sparse layers. For that reason, we made the default parameters non-sparse, which works best for small-sized datasets.

  • We have also noticed drastic changes in training time depending on the version of PyTorch used. While we used torch@1.8.0 here, we have noticed the fastest training times using torch@1.6.0. If you understand why this is, make a PR updating these docs with that information!