Interactions¶
What are Interactions?
The Interactions
object is at the core of how data loading and retrieval works in Collie models.
An Interactions
object is, in its simplest form, a torch.data.Dataset
wrapper around a scipy.sparse.coo_matrix
that supports iterating and batching data during model training. We supplement this with data consistency checks during initialization to catch potential errors sooner, a high-throughput and memory-efficient form of negative sampling, and a simple API. Indexing an Interactions
object returns a user ID and an item ID that the user has interacted with, as well as an O(1)
negative sample of item ID(s) a user has not interacted with, supporting the implicit loss functions built into Collie.
import pandas as pd
from collie.interactions import Interactions
df = pd.DataFrame(data={'user_id': [0, 0, 0, 1, 1, 2],
'item_id': [0, 1, 2, 3, 4, 5]})
interactions = Interactions(users=df['user_id'], items=df['item_id'], num_negative_samples=2)
for _ in range(3):
print(interactions[0])
# output structure: ((user IDs, positive item IDs), negative items IDs)
# notice all negative item IDs will be true negatives for user ``0``, e.g.
((0, 0), array([5., 3.]))
((0, 0), array([5., 4.]))
((0, 0), array([3., 5.]))
We can see this same idea holds when we instead create an InteractionsDataLoader
, as such:
import pandas as pd
from collie.interactions import InteractionsDataLoader
df = pd.DataFrame(data={'user_id': [0, 0, 0, 1, 1, 2],
'item_id': [0, 1, 2, 3, 4, 5]})
interactions_loader = InteractionsDataLoader(
users=df['user_id'], items=df['item_id'], num_negative_samples=2
)
for batch in interactions_loader:
print(batch)
# output structure: [[user IDs, positive item IDs], negative items IDs]
# users and positive items IDs is now a tensor of shape ``batch_size`` and
# negative items IDs is now a tensor of shape ``batch_size x num_negative_samples``
# notice all negative item IDs will still be true negatives, e.g.
[[tensor([0, 0, 0, 1, 1, 2], dtype=torch.int32),
tensor([0, 1, 2, 3, 4, 5], dtype=torch.int32)],
tensor([[4., 5.],
[3., 5.],
[4., 5.],
[0., 1.],
[5., 0.],
[3., 4.]])]
Once data is in an Interactions
form, you can easily perform data splits, train and evaluate a model, and much more. See Cross Validation and Models documentation for more information on this.
How can I speed up Interactions data loading?
While an Interactions
object works out-of-the-box with a torch.data.DataLoader
, such as the included InteractionsDataLoader
, sampling true negatives for each Interactions element can become costly as the number of items grows. In this situation, it might be desirable to trade exact negative sampling for a faster, approximate sampler. For these scenarios, we use the ApproximateNegativeSamplingInteractionsDataLoader
, an extension of the more traditional InteractionsDataLoader
that samples data in batches, forgoing the expensive concatenation of individual data points an InteractionsDataLoader
must do for each batch. Here, negative samples are simply returned as a collection of randomly sampled item IDs, meaning it is possible that a negative item ID returned for a user can actually be an item a user had positively interacted with. When the number of items is large, though, this scenario is increasingly rare, and the speedup benefit is worth the slight performance hit.
import pandas as pd
from collie.interactions import ApproximateNegativeSamplingInteractionsDataLoader
df = pd.DataFrame(data={'user_id': [0, 0, 0, 1, 1, 2],
'item_id': [0, 1, 2, 3, 4, 5]})
interactions_loader = ApproximateNegativeSamplingInteractionsDataLoader(
users=df['user_id'], items=df['item_id'], num_negative_samples=2
)
for batch in interactions_loader:
print(batch)
# output structure: [[user IDs, positive item IDs], "negative" items IDs]
# users and positive items IDs is now a tensor of shape ``batch_size`` and
# negative items IDs is now a tensor of shape ``batch_size x num_negative_samples``
# notice negative item IDs will *not* always be true negatives now, e.g.
[[tensor([0, 0, 0, 1, 1, 2], dtype=torch.int32),
tensor([0, 1, 2, 3, 4, 5], dtype=torch.int32)],
tensor([[4, 5],
[1, 2],
[4, 2],
[3, 5],
[4, 0],
[4, 3]])]
interactions = interactions_loader.interactions
# use this for cross validation, evaluation, etc.
What if my data cannot fit in memory?
For datasets that are too large to fit in memory, Collie includes the HDF5InteractionsDataLoader
(which uses a HDF5Interactions
dataset at its base, sharing many of the same features and methods as an Interactions
object). A HDF5InteractionsDataLoader
applies the same principles behind the ApproximateNegativeSamplingInteractionsDataLoader
, but for data stored on disk in a HDF5 format. The main drawback to this approach is that when shuffle=True
, data will only be shuffled within batches (as opposed to the true shuffle in ApproximateNegativeSamplingInteractionsDataLoader
). For sufficiently large enough data, this effect on model performance should be negligible.
import pandas as pd
from collie.interactions import HDF5InteractionsDataLoader
from collie.utils import pandas_df_to_hdf5
# we'll write out a sample DataFrame to HDF5 format for this example
df = pd.DataFrame(data={'user_id': [0, 0, 0, 1, 1, 2],
'item_id': [0, 1, 2, 3, 4, 5]})
pandas_df_to_hdf5(df=df, out_path='sample_hdf5.h5')
interactions_loader = HDF5InteractionsDataLoader(
hdf5_path='sample_hdf5.h5',
user_col='user_id',
item_col='item_id',
num_negative_samples=2,
)
for batch in interactions_loader:
print(batch)
# output structure: [[user IDs, positive item IDs], "negative" items IDs]
# users and positive items IDs is now a tensor of shape ``batch_size`` and
# negative items IDs is now a tensor of shape ``batch_size x num_negative_samples``
# notice negative item IDs will *not* always be true negatives now, e.g.
[[tensor([0, 0, 0, 1, 1, 2], dtype=torch.int32),
tensor([0, 1, 2, 3, 4, 5], dtype=torch.int32)],
tensor([[5, 4],
[4, 5],
[5, 2],
[4, 3],
[4, 2],
[1, 3]])]
The table below shows the time differences to train a MatrixFactorizationModel
for a single epoch on MovieLens 10M data using default parameters on the GPU on a p3.2xlarge
EC2 instance 1.
DataLoader Type |
Time to Train a Single Epoch |
---|---|
|
1min 25s |
|
1min 8s |
|
1min 10s |
What if my data has explicit ratings in it?
Thus far, we’ve only discussed the scenario in which you have data without an explicit indicator showing to what degree a user loved an item. When you do have that data (i.e. star ratings for product reviews, number of times a user has interacted with an item, etc.), you have explicit data. Luckily, as of version 0.6.0
of Collie, this is now fully supported within the library, with the only differences between an explicit and implicit pipeline being 1) the dataset definition (detailed below) and 2) evaluation (detailed in Evaluation Metrics).
Note the similarities in the explicit example below with the examples shown thus far:
import pandas as pd
from collie.interactions import ExplicitInteractions
explicit_df = pd.DataFrame(data={'user_id': [0, 0, 0, 1, 1, 2],
'item_id': [0, 1, 2, 3, 4, 5],
'ratings': [1, 2, 3, 4, 5, 3.5]})
explicit_interactions = ExplicitInteractions(users=explicit_df['user_id'],
items=explicit_df['item_id'],
ratings=explicit_df['ratings'])
for _ in range(3):
print(explicit_interactions[0])
print('\n-----\n')
for idx in range(len(explicit_interactions)):
print(explicit_interactions[idx])
# output structure: (user IDs, positive item IDs, ratings)
# notice that unlike implicit interactions, there is no negative sampling going
# on under the hood, meaning this printout will always be deterministic
(0, 0, 1.0)
(0, 0, 1.0)
(0, 0, 1.0)
-----
(0, 0, 1.0)
(0, 1, 2.0)
(0, 2, 3.0)
(1, 3, 4.0)
(1, 4, 5.0)
(2, 5, 3.5)
Once the ExplicitInteractions
dataset is defined, you can use the built-in InteractionsDataLoader
to batch and iterate through the data!
import pandas as pd
from collie.interactions import ExplicitInteractions, InteractionsDataLoader
# the same setup code from the code snippet above
explicit_df = pd.DataFrame(data={'user_id': [0, 0, 0, 1, 1, 2],
'item_id': [0, 1, 2, 3, 4, 5],
'ratings': [1, 2, 3, 4, 5, 3.5]})
explicit_interactions = ExplicitInteractions(users=explicit_df['user_id'],
items=explicit_df['item_id'],
ratings=explicit_df['ratings'])
explicit_interactions_loader = InteractionsDataLoader(interactions=explicit_interactions)
for batch in explicit_interactions_loader:
print(batch)
# output structure: [user IDs, positive item IDs, ratings]
[tensor([0, 0, 0, 1, 1, 2], dtype=torch.int32),
tensor([0, 1, 2, 3, 4, 5], dtype=torch.int32),
tensor([1.0000, 2.0000, 3.0000, 4.0000, 5.0000, 3.5000], dtype=torch.float64)]
All Collie models support both implicit and explicit data, and can be instantiating by either passing in the Interactions
/ExplicitInteractions
data or the dataset wrapped in a DataLoader. See Models for more details on this.
Datasets¶
Implicit Interactions Dataset¶
- class collie.interactions.Interactions(mat: Optional[Union[coo_matrix, array]] = None, users: Optional[Iterable[int]] = None, items: Optional[Iterable[int]] = None, ratings: Optional[Iterable[int]] = None, num_negative_samples: int = 10, allow_missing_ids: bool = False, remove_duplicate_user_item_pairs: bool = True, num_users: int = 'infer', num_items: int = 'infer', check_num_negative_samples_is_valid: bool = True, max_number_of_samples_to_consider: int = 200, seed: Optional[int] = None)[source]¶
Bases:
BaseInteractions
PyTorch
Dataset
for implicit user-item interactions data.If
mat
is provided, theInteractions
instance will act as a wrapper for a sparse matrix in COOrdinate format, typically looking like:Users comprising the rows
Items comprising the columns
Ratings given by that user for that item comprising the elements of the matrix
Interactions
can be instantiated instead by passing in single arrays with corresponding user_ids, item_ids, and ratings (by default, set to 1 for implicit recommenders) values with the same functionality as a matrix. Note that with this approach, the number of users and items will be the maximum values in those two columns, respectively, and it is expected that all integers between 0 and the maximum ID should appear somewhere in the data.By default, exact negative sampling will be used during each
__getitem__
call. To use approximate negative sampling, setmax_number_of_samples_to_consider = 0
. This will avoid building a positive item lookup dictionary during initialization.Unlike in
ExplicitInteractions
, we rely on negative sampling for implicit data. Each__getitem__
call will thus return a nested tuple containing user IDs, item IDs, and sampled negative item IDs. This nested vs. non-nested structure is key for the model to determine where it should be implicit or explicit. Use the table below for reference:__getitem__
FormatExpected Meaning
Model Type
((X, Y), Z)
((user IDs, item IDs), negative item IDs)
Implicit
(X, Y, Z)
(user IDs, item IDs, ratings)
Explicit
- Parameters
mat (scipy.sparse.coo_matrix or numpy.array, 2-dimensional) – Interactions matrix, which, if provided, will be used instead of
users
,items
, andratings
argumentsusers (Iterable[int], 1-d) – Array of user IDs, starting at 0
items (Iterable[int], 1-d) – Array of corresponding item IDs to
users
, starting at 0ratings (Iterable[int], 1-d) – Array of corresponding ratings to both
users
anditems
. IfNone
, will default to each user inuser
interacting with an item with a rating value of 1num_negative_samples (int) – Number of negative samples to return with each
__getitem__
callallow_missing_ids (bool) – If
False
, will check that bothusers
anditems
contain each integer from 0 to the maximum value in the array. This check only applies when initializing anInteractions
instance using 1-dimensional arraysusers
anditems
remove_duplicate_user_item_pairs (bool) – Will check for and remove any duplicate user, item ID pairs from the
Interactions
matrix during initialization. Note that this will create a second sparse matrix held in memory to efficiently check, which could cause memory concerns for larger data. If you are sure that there are no duplicated, user, item ID pairs, set toFalse
num_users (int) – Number of users in the dataset. If
num_users == 'infer'
, this will be set to themat.shape[0]
ormax(users) + 1
, depending on the inputnum_items (int) – Number of items in the dataset. If
num_items == 'infer'
, this will be set to themat.shape[1]
ormax(items) + 1
, depending on the inputcheck_num_negative_samples_is_valid (bool) – Check that
num_negative_samples
is less than the maximum number of items a user has interacted with. If it is not, then for all users who have fewer thannum_negative_samples
items not interacted with, a random sample including positive items will be returned as negativemax_number_of_samples_to_consider (int) – Number of samples to try for a given user before returning an approximate negative sample. This should be greater than
num_negative_samples
. If set to0
, approximate negative sampling will be used by default in__getitem__
and a positive item lookup dictionary will NOT be builtseed (int) – Seed for random sampling
- head(n: int = 5) array ¶
Return the first
n
rows of the dense matrix as a np.array, 2-d.
- tail(n: int = 5) array ¶
Return the last
n
rows of the dense matrix as a np.array, 2-d.
- toarray() array ¶
Transforms
BaseInteractions
instance sparse matrix to np.array, 2-d.
- todense() matrix ¶
Transforms
BaseInteractions
instance sparse matrix to np.matrix, 2-d.
Explicit Interactions Dataset¶
- class collie.interactions.ExplicitInteractions(mat: Optional[Union[coo_matrix, array]] = None, users: Optional[Iterable[int]] = None, items: Optional[Iterable[int]] = None, ratings: Optional[Iterable[int]] = None, allow_missing_ids: bool = False, remove_duplicate_user_item_pairs: bool = True, num_users: int = 'infer', num_items: int = 'infer')[source]¶
Bases:
BaseInteractions
PyTorch
Dataset
for explicit user-item interactions data.If
mat
is provided, theInteractions
instance will act as a wrapper for a sparse matrix in COOrdinate format, typically looking like:Users comprising the rows
Items comprising the columns
Ratings given by that user for that item comprising the elements of the matrix
Interactions
can be instantiated instead by passing in single arrays with corresponding user_ids, item_ids, and ratings values with the same functionality as a matrix. Note that with this approach, the number of users and items will be the maximum values in those two columns, respectively, and it is expected that all integers between 0 and the maximum ID should appear somewhere in the user or item ID data.Unlike in
Interactions
, there is no need for negative sampling for explicit data. Each__getitem__
call will thus return a single, non-nested tuple containing user IDs, item IDs, and ratings. This nested vs. non-nested structure is key for the model to determine where it should be implicit or explicit. Use the table below for reference:__getitem__
FormatExpected Meaning
Model Type
((X, Y), Z)
((user IDs, item IDs), negative item IDs)
Implicit
(X, Y, Z)
(user IDs, item IDs, ratings)
Explicit
- Parameters
mat (scipy.sparse.coo_matrix or numpy.array, 2-dimensional) – Interactions matrix, which, if provided, will be used instead of
users
,items
, andratings
argumentsusers (Iterable[int], 1-d) – Array of user IDs, starting at 0
items (Iterable[int], 1-d) – Array of corresponding item IDs to
users
, starting at 0ratings (Iterable[int], 1-d) – Array of corresponding ratings to both
users
anditems
. IfNone
, will default to each user inuser
interacting with an item with a rating value of 1allow_missing_ids (bool) – If
False
, will check that bothusers
anditems
contain each integer from 0 to the maximum value in the array. This check only applies when initializing anExplicitInteractions
instance using 1-dimensional arraysusers
anditems
remove_duplicate_user_item_pairs (bool) – Will check for and remove any duplicate user, item ID pairs from the
ExplicitInteractions
matrix during initialization. Note that this will create a second sparse matrix held in memory to efficiently check, which could cause memory concerns for larger data. If you are sure that there are no duplicated, user, item ID pairs, set toFalse
num_users (int) – Number of users in the dataset. If
num_users == 'infer'
, this will be set to themat.shape[0]
ormax(users) + 1
, depending on the inputnum_items (int) – Number of items in the dataset. If
num_items == 'infer'
, this will be set to themat.shape[1]
ormax(items) + 1
, depending on the input
- head(n: int = 5) array ¶
Return the first
n
rows of the dense matrix as a np.array, 2-d.
- property num_negative_samples: int¶
Does not exist for explicit data.
- tail(n: int = 5) array ¶
Return the last
n
rows of the dense matrix as a np.array, 2-d.
- toarray() array ¶
Transforms
BaseInteractions
instance sparse matrix to np.array, 2-d.
- todense() matrix ¶
Transforms
BaseInteractions
instance sparse matrix to np.matrix, 2-d.
HDF5 Interactions Dataset¶
- class collie.interactions.HDF5Interactions(hdf5_path: str, user_col: str = 'users', item_col: str = 'items', num_negative_samples: int = 10, num_users: int = 'infer', num_items: int = 'infer', seed: Optional[int] = None, shuffle: bool = False)[source]¶
Bases:
Dataset
Create an
Interactions
-like object for data in the HDF5 format that might be too large to fit in memory.Many of the same features of
Interactions
are implemented here, with the exception that approximate negative sampling will always be used.- Parameters
hdf5_path (str) –
user_col (str) – Column in HDF5 file with user IDs. IDs must begin at 0
item_col (str) – Column in HDF5 file with item IDs. IDs must begin at 0
num_negative_samples (int) – Number of negative samples to return with each
__getitem__
callnum_users (int) – Number of users in the dataset. If
num_users == 'infer'
and there is not ameta
key inhdf5_path
’s HDF5 dataset, this will be set to the the maximum value inuser_col
+ 1, found by iterating through the entire datasetnum_items (int) – Number of items in the dataset. If
num_items == 'infer'
and there is not anmeta
key inhdf5_path
’s HDF5 dataset, this will be set to the the maximum value initem_col
+ 1, found by iterating through the entire datasetseed (int) – Seed for random sampling and shuffling if
shuffle is True
shuffle (bool) – Shuffle data in a batch. For example, if one calls
__getitem__
withstart_idx_and_batch_size = (0, 4)
andshuffle is False
, this will always return the data at indices 0, 1, 2, 3 in order. However, the same call withshuffle = True
will return a random shuffle of 0, 1, 2, 3 each call. This is recommended for use in aHDF5InteractionsDataLoader
for training data in lieu of true data shuffling
DataLoaders¶
Interactions DataLoader¶
- class collie.interactions.InteractionsDataLoader(interactions: Optional[BaseInteractions] = None, mat: Optional[Union[coo_matrix, array]] = None, users: Optional[Iterable[int]] = None, items: Optional[Iterable[int]] = None, ratings: Optional[Iterable[int]] = None, batch_size: int = 1024, shuffle: bool = False, **kwargs)[source]¶
Bases:
BaseInteractionsDataLoader
A light wrapper around a
torch.utils.data.DataLoader
forInteractions
-type datasets.For implicit data, batches will be created one-point-at-a-time using exact negative sampling (unless configured not to in
interactions
), which is optimal when datasets are smaller (< 1M+ interactions) and model training speed is not a concern. This is the defaultDataLoader
forInteractions
datasets.For explicit data, negative sampling is not used, but batches will still be created one-point-at-a-time.
- Parameters
interactions (BaseInteractions) – If not provided, an
Interactions
object will be created withmat
or all ofusers
,items
, andratings
mat (scipy.sparse.coo_matrix or numpy.array, 2-dimensional) – If
interactions is None
, will be used instead ofusers
,items
, andratings
arguments to create anInteractions
objectusers (Iterable[int], 1-d) – If
interactions is None and mat is None
, array of user IDs, starting at 0items (Iterable[int], 1-d) – If
interactions is None and mat is None
, array of corresponding item IDs tousers
, starting at 0ratings (Iterable[int], 1-d) – If
interactions is None and mat is None
, array of corresponding ratings to bothusers
anditems
. IfNone
, will default to each user inuser
interacting with an item with a rating value of 1batch_size (int) – Number of samples per batch to load
shuffle (bool) – Whether to shuffle the order of data returned or not. This is especially useful for training data to ensure the model does not overfit to a specific order of data
**kwargs (keyword arguments) – Relevant keyword arguments will be passed into
Interactions
object creation, ifinteractions is None
and the keyword argument matches one ofInteractions.__init__.__code__.co_varnames
. All other keyword arguments will be passed intotorch.utils.data.DataLoader
: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
- interactions¶
- Type
Interactions (default) or ExplicitInteractions
- Original ``torch.utils.data.DataLoader`` docstring as follows
- ########
- Data loader. Combines a dataset and a sampler, and provides an iterable over
- the given dataset.
- The :class:`~torch.utils.data.DataLoader` supports both map-style and
- iterable-style datasets with single- or multi-process loading, customizing
- loading order and optional automatic batching (collation) and memory pinning.
- See :py:mod:`torch.utils.data` documentation page for more details.
- Args¶
dataset (Dataset): dataset from which to load the data. batch_size (int, optional): how many samples per batch to load
(default:
1
).- shuffle (bool, optional): set to
True
to have the data reshuffled at every epoch (default:
False
).- sampler (Sampler or Iterable, optional): defines the strategy to draw
samples from the dataset. Can be any
Iterable
with__len__
implemented. If specified,shuffle
must not be specified.- batch_sampler (Sampler or Iterable, optional): like
sampler
, but returns a batch of indices at a time. Mutually exclusive with
batch_size
,shuffle
,sampler
, anddrop_last
.- num_workers (int, optional): how many subprocesses to use for data
loading.
0
means that the data will be loaded in the main process. (default:0
)- collate_fn (Callable, optional): merges a list of samples to form a
mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
- pin_memory (bool, optional): If
True
, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your
collate_fn
returns a batch that is a custom type, see the example below.- drop_last (bool, optional): set to
True
to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If
False
and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default:False
)- timeout (numeric, optional): if positive, the timeout value for collecting a batch
from workers. Should always be non-negative. (default:
0
)- worker_init_fn (Callable, optional): If not
None
, this will be called on each worker subprocess with the worker id (an int in
[0, num_workers - 1]
) as input, after seeding and before data loading. (default:None
)- generator (torch.Generator, optional): If not
None
, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers. (default:
None
)- prefetch_factor (int, optional, keyword-only arg): Number of batches loaded
in advance by each worker.
2
means there will be a total of 2 * num_workers batches prefetched across all workers. (default:2
)- persistent_workers (bool, optional): If
True
, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default:
False
)- pin_memory_device (str, optional): the data loader will copy Tensors
into device pinned memory before returning them if pin_memory is set to true.
- shuffle (bool, optional): set to
Warning
If the
spawn
start method is used,worker_init_fn
cannot be an unpicklable object, e.g., a lambda function. See multiprocessing-best-practices on more details related to multiprocessing in PyTorch.Warning
len(dataloader)
heuristic is based on the length of the sampler used. Whendataset
is anIterableDataset
, it instead returns an estimate based onlen(dataset) / batch_size
, with proper rounding depending ondrop_last
, regardless of multi-process loading configurations. This represents the best guess PyTorch can make because PyTorch trusts userdataset
code in correctly handling multi-process loading to avoid duplicate data.However, if sharding results in multiple workers having incomplete last batches, this estimate can still be inaccurate, because (1) an otherwise complete batch can be broken into multiple ones and (2) more than one batch worth of samples can be dropped when
drop_last
is set. Unfortunately, PyTorch can not detect such cases in general.See `Dataset Types`_ for more details on these two types of datasets and how
IterableDataset
interacts with `Multi-process data loading`_.Warning
See reproducibility, and dataloader-workers-random-seed, and data-loading-randomness notes for random seed related questions.
- property mat: coo_matrix¶
Sparse COO matrix of
interactions
.
- property num_interactions: int¶
Number of interactions in
interactions
.
- property num_items: int¶
Number of items in
interactions
.
- property num_negative_samples: int¶
Number of negative samples in
interactions
.
- property num_users: int¶
Number of users in
interactions
.
Approximate Negative Sampling Interactions DataLoader¶
- class collie.interactions.ApproximateNegativeSamplingInteractionsDataLoader(interactions: Optional[Interactions] = None, mat: Optional[Union[coo_matrix, array]] = None, users: Optional[Iterable[int]] = None, items: Optional[Iterable[int]] = None, ratings: Optional[Iterable[int]] = None, batch_size: int = 1024, shuffle: bool = False, **kwargs)[source]¶
Bases:
BaseInteractionsDataLoader
A computationally more efficient
DataLoader
forInteractions
data using approximate negative sampling for negative items.This DataLoader groups
__getitem__
calls together into a single operation, which dramatically speeds up a traditional DataLoader’s process of calling__getitem__
one index at a time, then concatenating them together before returning. In an effort to batch operations together, all negative samples returned will be approximate, meaning this does not check if a user has previously interacted with the item. With a sufficient number of interactions (1M+), we have found a speed increase of 2x at the cost of a 1% reduction in MAP @ 10 performance compared toInteractionsDataLoader
.For greater efficiency, we disable automated batching by setting the DataLoader’s
batch_size
attribute toNone
. Thus, to access the “true” batch size that the sampler uses, accessApproximateNegativeSamplingInteractionsDataLoader.approximate_negative_sampler.batch_size
.- Parameters
interactions (Interactions) – If not provided, an
Interactions
object will be created withmat
or all ofusers
,items
, andratings
withmax_number_of_samples_to_consider=0
mat (scipy.sparse.coo_matrix or numpy.array, 2-dimensional) – If
interactions is None
, will be used instead ofusers
,items
, andratings
arguments to create anInteractions
objectusers (Iterable[int], 1-d) – If
interactions is None and mat is None
, array of user IDs, starting at 0items (Iterable[int], 1-d) – If
interactions is None and mat is None
, array of corresponding item IDs tousers
, starting at 0ratings (Iterable[int], 1-d) – If
interactions is None and mat is None
, array of corresponding ratings to bothusers
anditems
. IfNone
, will default to each user inuser
interacting with an item with a rating value of 1batch_size (int) – Number of samples per batch to load
shuffle (bool) – Whether to shuffle the order of data returned or not. This is especially useful for training data to ensure the model does not overfit to a specific order of data
**kwargs (keyword arguments) – Relevant keyword arguments will be passed into
Interactions
object creation, ifinteractions is None
and the keyword argument matches one ofInteractions.__init__.__code__.co_varnames
. All other keyword arguments will be passed intotorch.utils.data.DataLoader
: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
- interactions¶
- Type
- Original ``torch.utils.data.DataLoader`` docstring as follows
- ########
- Data loader. Combines a dataset and a sampler, and provides an iterable over
- the given dataset.
- The :class:`~torch.utils.data.DataLoader` supports both map-style and
- iterable-style datasets with single- or multi-process loading, customizing
- loading order and optional automatic batching (collation) and memory pinning.
- See :py:mod:`torch.utils.data` documentation page for more details.
- Args¶
dataset (Dataset): dataset from which to load the data. batch_size (int, optional): how many samples per batch to load
(default:
1
).- shuffle (bool, optional): set to
True
to have the data reshuffled at every epoch (default:
False
).- sampler (Sampler or Iterable, optional): defines the strategy to draw
samples from the dataset. Can be any
Iterable
with__len__
implemented. If specified,shuffle
must not be specified.- batch_sampler (Sampler or Iterable, optional): like
sampler
, but returns a batch of indices at a time. Mutually exclusive with
batch_size
,shuffle
,sampler
, anddrop_last
.- num_workers (int, optional): how many subprocesses to use for data
loading.
0
means that the data will be loaded in the main process. (default:0
)- collate_fn (Callable, optional): merges a list of samples to form a
mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
- pin_memory (bool, optional): If
True
, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your
collate_fn
returns a batch that is a custom type, see the example below.- drop_last (bool, optional): set to
True
to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If
False
and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default:False
)- timeout (numeric, optional): if positive, the timeout value for collecting a batch
from workers. Should always be non-negative. (default:
0
)- worker_init_fn (Callable, optional): If not
None
, this will be called on each worker subprocess with the worker id (an int in
[0, num_workers - 1]
) as input, after seeding and before data loading. (default:None
)- generator (torch.Generator, optional): If not
None
, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers. (default:
None
)- prefetch_factor (int, optional, keyword-only arg): Number of batches loaded
in advance by each worker.
2
means there will be a total of 2 * num_workers batches prefetched across all workers. (default:2
)- persistent_workers (bool, optional): If
True
, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default:
False
)- pin_memory_device (str, optional): the data loader will copy Tensors
into device pinned memory before returning them if pin_memory is set to true.
- shuffle (bool, optional): set to
Warning
If the
spawn
start method is used,worker_init_fn
cannot be an unpicklable object, e.g., a lambda function. See multiprocessing-best-practices on more details related to multiprocessing in PyTorch.Warning
len(dataloader)
heuristic is based on the length of the sampler used. Whendataset
is anIterableDataset
, it instead returns an estimate based onlen(dataset) / batch_size
, with proper rounding depending ondrop_last
, regardless of multi-process loading configurations. This represents the best guess PyTorch can make because PyTorch trusts userdataset
code in correctly handling multi-process loading to avoid duplicate data.However, if sharding results in multiple workers having incomplete last batches, this estimate can still be inaccurate, because (1) an otherwise complete batch can be broken into multiple ones and (2) more than one batch worth of samples can be dropped when
drop_last
is set. Unfortunately, PyTorch can not detect such cases in general.See `Dataset Types`_ for more details on these two types of datasets and how
IterableDataset
interacts with `Multi-process data loading`_.Warning
See reproducibility, and dataloader-workers-random-seed, and data-loading-randomness notes for random seed related questions.
- property mat: coo_matrix¶
Sparse COO matrix of
interactions
.
- property num_interactions: int¶
Number of interactions in
interactions
.
- property num_items: int¶
Number of items in
interactions
.
- property num_negative_samples: int¶
Number of negative samples in
interactions
.
- property num_users: int¶
Number of users in
interactions
.
HDF5 Approximate Negative Sampling Interactions DataLoader¶
- class collie.interactions.HDF5InteractionsDataLoader(hdf5_interactions: Optional[HDF5Interactions] = None, hdf5_path: Optional[str] = None, batch_size: int = 1024, shuffle: bool = False, **kwargs)[source]¶
Bases:
BaseInteractionsDataLoader
A light wrapper around a
torch.utils.data.DataLoader
for HDF5 data, with behavior very similar toApproximateNegativeSamplingInteractionsDataLoader
.If not provided, a
HDF5Interactions
dataset will be created as the data for theDataLoader
. A custom sampler,HDF5Sampler
, will also be instantiated for theDataLoader
to use that allows sampling in batches that make for faster HDF5 data reads from disk.While similar to a standard
DataLoader
, note that whenshuffle is True
, this will only shuffle the order of batches and the data within batches to still make for efficient reading of HDF5 data from disk, rather than shuffling across the entire dataset.For greater efficiency, we disable automated batching by setting the DataLoader’s
batch_size
attribute toNone
. Thus, to access the “true” batch size that the sampler uses, accessHDF5InteractionsDataLoader.hdf5_sampler.batch_size
.- Parameters
hdf5_interactions (HDF5Interactions) – If provided, will override input argument for
hdf5_path
hdf5_path (str) – If
hdf5_interactions is None
, the path to the HDF5 datasetbatch_size (int) – Number of samples per batch to load
shuffle (bool) – Whether to shuffle the order of batches returned or not. This is especially useful for training data to ensure the model does not overfit to a specific order of data. Note that this will not perform a true shuffle of the data, but shuffle the order of batches. While this is an approximation of true sampling, it allows us a greater speed up during model training for a negligible effect on model performance
**kwargs (keyword arguments) – Relevant keyword arguments will be passed into
HDF5Interactions
object creation, ifhdf5_interactions is None
and the keyword argument matches one ofHDF5Interactions.__init__.__code__.co_varnames
. All other keyword arguments will be passed intotorch.utils.data.DataLoader
: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoaderfollows (Original torch.utils.data.DataLoader docstring as) –
######## –
sampler (Data loader. Combines a dataset and a) –
over (and provides an iterable) –
dataset. (the given) –
:param The
DataLoader
supports both map-style and: :param iterable-style datasets with single- or multi-process loading: :param customizing: :param loading order and optional automatic batching (collation) and memory pinning.: :param Seetorch.utils.data
documentation page for more details.: :param Args: dataset (Dataset): dataset from which to load the data.- batch_size (int, optional): how many samples per batch to load
(default:
1
).- shuffle (bool, optional): set to
True
to have the data reshuffled at every epoch (default:
False
).- sampler (Sampler or Iterable, optional): defines the strategy to draw
samples from the dataset. Can be any
Iterable
with__len__
implemented. If specified,shuffle
must not be specified.- batch_sampler (Sampler or Iterable, optional): like
sampler
, but returns a batch of indices at a time. Mutually exclusive with
batch_size
,shuffle
,sampler
, anddrop_last
.- num_workers (int, optional): how many subprocesses to use for data
loading.
0
means that the data will be loaded in the main process. (default:0
)- collate_fn (Callable, optional): merges a list of samples to form a
mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
- pin_memory (bool, optional): If
True
, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your
collate_fn
returns a batch that is a custom type, see the example below.- drop_last (bool, optional): set to
True
to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If
False
and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default:False
)- timeout (numeric, optional): if positive, the timeout value for collecting a batch
from workers. Should always be non-negative. (default:
0
)- worker_init_fn (Callable, optional): If not
None
, this will be called on each worker subprocess with the worker id (an int in
[0, num_workers - 1]
) as input, after seeding and before data loading. (default:None
)- generator (torch.Generator, optional): If not
None
, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers. (default:
None
)- prefetch_factor (int, optional, keyword-only arg): Number of batches loaded
in advance by each worker.
2
means there will be a total of 2 * num_workers batches prefetched across all workers. (default:2
)- persistent_workers (bool, optional): If
True
, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default:
False
)- pin_memory_device (str, optional): the data loader will copy Tensors
into device pinned memory before returning them if pin_memory is set to true.
Warning
If the
spawn
start method is used,worker_init_fn
cannot be an unpicklable object, e.g., a lambda function. See multiprocessing-best-practices on more details related to multiprocessing in PyTorch.Warning
len(dataloader)
heuristic is based on the length of the sampler used. Whendataset
is anIterableDataset
, it instead returns an estimate based onlen(dataset) / batch_size
, with proper rounding depending ondrop_last
, regardless of multi-process loading configurations. This represents the best guess PyTorch can make because PyTorch trusts userdataset
code in correctly handling multi-process loading to avoid duplicate data.However, if sharding results in multiple workers having incomplete last batches, this estimate can still be inaccurate, because (1) an otherwise complete batch can be broken into multiple ones and (2) more than one batch worth of samples can be dropped when
drop_last
is set. Unfortunately, PyTorch can not detect such cases in general.See `Dataset Types`_ for more details on these two types of datasets and how
IterableDataset
interacts with `Multi-process data loading`_.Warning
See reproducibility, and dataloader-workers-random-seed, and data-loading-randomness notes for random seed related questions.
- property mat: None¶
mat
attribute is not possible to access inHDF5InteractionsDataLoader
.
- property num_interactions: int¶
Number of interactions in
interactions
.
- property num_items: int¶
Number of items in
interactions
.
- property num_negative_samples: int¶
Number of negative samples in
interactions
.
- property num_users: int¶
Number of users in
interactions
.
Footnotes
- 1
Welcome to a detailed footnote about this experiment!
The MovieLens 10M data was preprocessed using Collie utility functions in Utility Functions that keeps all ratings above a
4
and removes users with fewer than 3 interactions. This left us with5,005,398
total interactions.For a much faster training time, we recommend setting
sparse=True
(see the point below this) in the model definition and using a larger batch size withpin_memory=True
in the DataLoader.Since we used default parameters, the embeddings of the
MatrixFactorizationModel
were not sparse. Had we used sparse embeddings and a Sparse Adam optimizer, the table would show:
DataLoader Type
Time to Train a Single Epoch
InteractionsDataLoader
1min 21s
ApproximateNegativeSamplingInteractionsDataLoader
1min 4s
HDF5InteractionsDataLoader
1min 7s
These times are more dramatically different with larger datasets (1M+ items). While these options are certainly faster, having sparse settings be the default limits the optimizer options and general flexibility of customizing an architecture, since not all PyTorch operations support sparse layers. For that reason, we made the default parameters non-sparse, which works best for small-sized datasets.
We have also noticed drastic changes in training time depending on the version of PyTorch used. While we used
torch@1.8.0
here, we have noticed the fastest training times usingtorch@1.6.0
. If you understand why this is, make a PR updating these docs with that information!