Utility Functions

Create Ratings Matrix

class collie.utils.create_ratings_matrix(df: pandas.core.frame.DataFrame, user_col: str = 'user_id', item_col: str = 'item_id', ratings_col: str = 'rating', sparse: bool = False)[source]

Helper function to convert a Pandas DataFrame to 2-dimensional matrix.

Parameters
  • df (pd.DataFrame) – Dataframe with columns for user IDs, item IDs, and ratings

  • user_col (str) – Column name for the user IDs

  • item_col (str) – Column name for the item IDs

  • ratings_col (str) – Column name for the ratings column

  • sparse (bool) – Whether to return data as a sparse coo_matrix (True) or np.array (False)

Returns

ratings_matrix – Data with users as rows, items as columns, and ratings as values

Return type

np.array or scipy.sparse.coo_matrix, 2-d

DataFrame to Interactions

class collie.utils.df_to_interactions(df: pandas.core.frame.DataFrame, user_col: str = 'user_id', item_col: str = 'item_id', ratings_col: Optional[str] = 'rating', **kwargs)[source]

Helper function to convert a DataFrame to an Interactions object.

Parameters
  • df (pd.DataFrame) – Dataframe with columns for user IDs, item IDs, and (optionally) ratings

  • user_col (str) – Column name for the user IDs

  • item_col (str) – Column name for the item IDs

  • ratings_col (str) – Column name for the ratings column. If None, will default to ratings of all 1s

  • **kwargs – Keyword arguments to pass to Interactions

Returns

interactions

Return type

collie.interactions.Interactions

Convert to Implicit Ratings

class collie.utils.convert_to_implicit(explicit_df: pandas.core.frame.DataFrame, min_rating_to_keep: Optional[float] = 4, user_col: str = 'user_id', item_col: str = 'item_id', ratings_col: str = 'rating')[source]

Convert explicit interactions data to implicit data.

Duplicate user ID and item ID pairs will be dropped, as well as all scores that are < min_rating_to_keep. All remaining interactions will have a rating of 1.

Parameters
  • explicit_df (pd.DataFrame) – Dataframe with explicit ratings in the rating column

  • min_rating_to_keep (int) – Minimum rating to be considered a valid interaction

  • ratings_col (str) – Column name for the ratings column

Returns

implicit_df – Dataframe that converts all ratings >= min_rating_to_keep to 1 and drops the rest with a reset index. Note that the order of implicit_df will not be equal to explicit_df

Return type

pd.DataFrame

Remove Users With Fewer Than n Interactions

class collie.utils.remove_users_with_fewer_than_n_interactions(df: pandas.core.frame.DataFrame, min_num_of_interactions: int = 3, user_col: str = 'user_id')[source]

Remove DataFrame rows with users who appear fewer than min_num_of_interactions times.

Parameters
  • df (pd.DataFrame) –

  • min_num_of_interactions (int) – Minimum number of interactions a user can have while remaining in filtered_df

  • user_col (str) – Column name for the user IDs

Returns

filtered_df

Return type

pd.DataFrame

Pandas DataFrame to HDF5 Format

class collie.utils.pandas_df_to_hdf5(df: pandas.core.frame.DataFrame, out_path: Union[str, pathlib.Path], key: str = 'interactions')[source]

Append a Pandas DataFrame to HDF5 using a table format and blosc compression.

DataFrame to HTML

class collie.utils.df_to_html(df: pandas.core.frame.DataFrame, image_cols: List[str] = [], hyperlink_cols: List[str] = [], html_tags: Dict[str, Union[str, List[str]]] = {}, transpose: bool = False, image_width: Optional[int] = None, max_num_rows: int = 200, **kwargs)[source]

Convert a Pandas DataFrame to HTML.

Parameters
  • df (DataFrame) – DataFrame to convert to HTML

  • image_cols (str or list) – Column names that contain image urls or file paths. Columns specified as images will make all other transformations to those columns be ignored. Local files will display correctly in Jupyter if specified using relative paths but not if specified using absolute paths (see https://github.com/jupyter/notebook/issues/3810).

  • hyperlink_cols (str or list) – Column names that contain hyperlinks to open in a new tab

  • html_tags (dictionary) –

    A transformation to be inserted directly into the HTML tag.

    Ex: {'col_name_1': 'strong'} becomes <strong>col_name_1</strong>

    Ex: {'col_name_2': 'mark'} becomes <mark>col_name_2</mark>

    Ex: {'col_name_3': 'h2'} becomes <h2>col_name_3</h2>

    Ex: {'col_name_4': ['em', 'strong']} becomes <em><strong>col_name_4</strong></em>

  • transpose (bool) – Transpose the DataFrame before converting to HTML

  • image_width (int) – Set image width for each image generated

  • max_num_rows (int) – Maximum number of rows to display

  • **kwargs (keyword arguments) – Additional arguments sent to pandas.DataFrame.to_html, as listed in: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_html.html

Returns

df_html – DataFrame converted to a HTML string, ready for displaying

Return type

HTML

Examples

In a Jupyter notebook:

from IPython.core.display import display, HTML
import pandas as pd

df = pd.DataFrame({
    'item': ['Beefy Fritos® Burrito'],
    'price': ['1.00'],
    'image_url': ['https://www.tacobell.com/images/22480_beefy_fritos_burrito_269x269.jpg'],
})
display(
    HTML(
        df_to_html(
            df,
            image_cols='image_url',
            html_tags={'item': 'strong', 'price': 'em'},
            image_width=200,
        )
    )
)

Note

Converted table will have CSS class ‘dataframe’, unless otherwise specified.

Timer Class

class collie.utils.Timer[source]

Class to manage timing different sections of a job.

time_since_start(message: str = 'Total time')float[source]

Get time since timer was instantiated.

timecheck(message: str = 'Finished')float[source]

Get time since last timecheck.

Truncated Normal Initialization

class collie.utils.trunc_normal(embedding_weight: None._VariableFunctionsClass.tensor, mean: float = 0.0, std: float = 1.0)[source]

Truncated normal initialization (approximation).

Taken from FastAI: https://github.com/fastai/fastai/blob/master/fastai/layers.py