What is an embedding layer and how does it work?

In this example I’ll make a collaborative filtering model (recommender system) which uses an entity embedding as part of a system for recommending books to users.

Embeddings are a neat way to take a large number of individual items (users, products, locations for example), and represent each item using an n-dimensional vector instead of using its unique id. At first this might sound like it would increase the size and complexity of the model - since each item now needs an additional vector representation - but in fact this process reduces the number of individual inputs the model needs to see to be able to make predictions.

For example, if we had an embedding for 1000 book titles, without an embedding layer the model would need to see each unique ID and learn the difference between them. An embedding vector for each of these book titles might be 2 dimensions deep, and might encode for each book’s sci-fi-ness and its length. This means we could feed this two dimensional embedding vector as input to the model rather than the 1000 individual titles. Since those inputs represent something real about the book, that might be enough information to make sensible predictions with. In a sense the embedding compresses information about each of the N inputs into an n dimensional vector.

In this blog post I’ll follow a similar process to the one outlined in the fast.ai course which used the movielens dataset. I’ll aim to explain some nuances about embedding layers, since I found this concept pretty confusing at first. Now that I’ve got my head around them I’m pretty amazed at how elegant, powerful and useful embeddings can be, and I’m excited to start trying out creative uses for embeddings.

Read more on embeddings in this paper: Guo, Cheng et al. “Entity Embeddings of Categorical Variables”

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from fastai.collab import *
from fastai.tabular.all import *

def display_all(df):
    with pd.set_option('display.max_columns', 0, 'display.max_rows', 0):
        print(df)

path = Path('/kaggle/input/book-recommendation-dataset/')

Loading the data into Pandas

It doesn’t look like much, but the Ratings.csv file contains all the data we need to train a collaborative filtering model: a user column, the ISBN of a book, and the rating a user gave for that book.

It will be easier for us to understand if we can replace the book’s ISBN with its title, so the Books.csv file is used to find the titles.

ratings = pd.read_csv(path/'Ratings.csv')
books = pd.read_csv(path/'Books.csv', low_memory=False)
ratings.head()

	User-ID	ISBN	Book-Rating
0	276725	034545104X	0
1	276726	0155061224	5
2	276727	0446520802	0
3	276729	052165615X	3
4	276729	0521795028	6

Scaling

Here I’m dividing all the ratings by 10 so they all lie between 0 and 1 instead of 0 and 10. I wanted to see what effect this had on the loss during training. It reduced the loss by an order of magnitude. This isn’t a meaningful increase in accuracy - it just means that the size of the errors is correspondingly lower since we’re operating within a smaller target range. Regarless I decided to keep the ratings scaled between 0 and 1 since I think it’s just as easy to understand this scale, plus there might be some benefit to this for models with more features. Click here to read more about scaling.

ratings['Book-Rating']=ratings['Book-Rating'].divide(10)

ratings = ratings.merge(books)
ratings.head()

	User-ID	ISBN	Book-Rating	Book-Title	Book-Author	Year-Of-Publication	Publisher	Image-URL-S	Image-URL-M	Image-URL-L
0	276725	034545104X	0.0	Flesh Tones: A Novel	M. J. Rose	2002	Ballantine Books	http://images.amazon.com/images/P/034545104X.01.THUMBZZZ.jpg	http://images.amazon.com/images/P/034545104X.01.MZZZZZZZ.jpg	http://images.amazon.com/images/P/034545104X.01.LZZZZZZZ.jpg
1	2313	034545104X	0.5	Flesh Tones: A Novel	M. J. Rose	2002	Ballantine Books	http://images.amazon.com/images/P/034545104X.01.THUMBZZZ.jpg	http://images.amazon.com/images/P/034545104X.01.MZZZZZZZ.jpg	http://images.amazon.com/images/P/034545104X.01.LZZZZZZZ.jpg
2	6543	034545104X	0.0	Flesh Tones: A Novel	M. J. Rose	2002	Ballantine Books	http://images.amazon.com/images/P/034545104X.01.THUMBZZZ.jpg	http://images.amazon.com/images/P/034545104X.01.MZZZZZZZ.jpg	http://images.amazon.com/images/P/034545104X.01.LZZZZZZZ.jpg
3	8680	034545104X	0.5	Flesh Tones: A Novel	M. J. Rose	2002	Ballantine Books	http://images.amazon.com/images/P/034545104X.01.THUMBZZZ.jpg	http://images.amazon.com/images/P/034545104X.01.MZZZZZZZ.jpg	http://images.amazon.com/images/P/034545104X.01.LZZZZZZZ.jpg
4	10314	034545104X	0.9	Flesh Tones: A Novel	M. J. Rose	2002	Ballantine Books	http://images.amazon.com/images/P/034545104X.01.THUMBZZZ.jpg	http://images.amazon.com/images/P/034545104X.01.MZZZZZZZ.jpg	http://images.amazon.com/images/P/034545104X.01.LZZZZZZZ.jpg

keep_list = ['User-ID', 'Book-Title', 'Book-Rating',]
del_list = ratings.columns.drop(keep_list)
del_list
ratings = ratings.drop(del_list, axis = 1)
ratings = ratings[keep_list] # changes the order
ratings = ratings.rename(columns={'User-ID': 'user', 'Book-Title': 'title', 'Book-Rating': 'rating'})
ratings.head()

	user	title	rating
0	276725	Flesh Tones: A Novel	0.0
1	2313	Flesh Tones: A Novel	0.5
2	6543	Flesh Tones: A Novel	0.0
3	8680	Flesh Tones: A Novel	0.5
4	10314	Flesh Tones: A Novel	0.9

Now we’ve got a table of book titles, ratings and user IDs. Let’s make a fastai Dataloaders object

The dataloaders object specifies a way of getting a series of mini batches (training and validation) from a dataset. Here our model will be a collaborative filtering model, which is a little different to what we’ve seen before with image recognition problems. In this case we’ll be using the book rating as the label, and the book-title and user-id as the input features.

Embeddings

Since there are hundreds of thousands of individual user IDs, and many more book titles, it will be useful to compress this data in some way - in a way which keeps the relevant information about each user and book, but doesn’t require the model to learn each individual user ID or book title. This is where Embeddings come in handy.

dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=16)

Now let’s make a dataloaders object

The dataloaders object gives us a quick way of getting a batch of features and labels from separate training and validation datasets. Below we can see pairings of 16 input features- users with book titles- and the corresponding label for these features, which is the rating the user gave for the book. 16 is the batch size, which I’ve chosen to be a small number for displaying here - but I’ll change it to 64 later and experiment with different batch sizes.

dls.one_batch()

(tensor([[  4648,  17499],
         [ 47035,  71643],
         [ 80274, 172179],
         [ 70742,  51917],
         [ 80254,  91689],
         [ 29527, 202962],
         [ 46094,  78241],
         [ 71592,  69112],
         [ 45123, 152087],
         [   167,  11511],
         [  4649, 153008],
         [ 58572, 153828],
         [ 73899, 172135],
         [  7868, 119495],
         [ 17158,  55868],
         [ 65724, 126796]]),
 tensor([[0.0000],
         [0.0000],
         [0.0000],
         [0.5000],
         [0.0000],
         [0.9000],
         [0.0000],
         [0.8000],
         [0.0000],
         [0.9000],
         [0.0000],
         [0.9000],
         [0.0000],
         [0.0000],
         [0.8000],
         [0.0000]]))

dls.valid.show_batch()

	user	title	rating
0	10819	Second Foundation (Foundation Novels (Paperback))	0.8
1	94159	Too Good to Be True: The Colossal Book of Urban Legends	0.0
2	#na#	The Heart of a Goof (Penguin Books)	0.9
3	214272	Skyline	0.0
4	238781	Accidental Tourist	0.0
5	143253	Mitzi and the Terrible Tyrannosaurus Rex	0.0
6	93047	Strangers on a Train	0.8
7	234281	Apocalypse sur commande	0.8
8	13552	Four To Score (A Stephanie Plum Novel)	0.0
9	205980	The Joy Luck Club	0.0

Take a sample

To speed up development and testing We’ll work with a random sample of 300,000 users from the dataset.

number_of_samples = 300000
df=ratings.sample(number_of_samples)
dls = CollabDataLoaders.from_df(df, item_name='title', bs=64)

Crosstab

Here’s a crosstab representation of the data. This is how we’ll think of the data, though in reality all the model will see is one batch from the dataloaders object at a time. Note that the table is very sparsely populated - this is because most users haven’t read many of the books in the table.

sdf = df.sample(50)
pd.crosstab(sdf.user, sdf.title, values=sdf.rating, aggfunc='max').head()

title	A Destiny of Love	A Farewell to Arms (A Scribner Classic)	A Fine Balance	Angelica's Grotto	Beauty: A Retelling of the Story of Beauty and the Beast	Black Lightning	Breath of Magic	Carlucci's Edge	Cat's Eye	Clan Novel: Assamite	...	The Handmaid's Tale	The MacKade Brothers: Devin and Shane (Silhouette Single Title)	The Nectar of Instruction	The Perfect Neighbor (The Macgregors) (The Macgregors : Special Edition Series, No 1232)	The Twits	The Winter of Our Discontent (Penguin Twentieth-Century Classics)	Thurston House	Turbulence	Vida de Una Geisha	Violets Are Blue
user
1485	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5769	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7500	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.7
14362	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
21576	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 50 columns

Creating an embedding matrix

Since we have a very large number of categorical input features, we need some way of compressing this information. We’ll create two matrices of latent factors -one for the users and one for the books. Each of these matrices will have a vector containing factors, where each factor represents something about books, or something about users.

For example - we’ll begin by creating 5 x 3058 matrix for the users, and a 5 * 4473 matrix for the books. conceptually you can imagine these slotting in to the right of the user column, and below the title colum in such a way that each book, and each user, will have its own unique set of 5 factors. These factors will initially be random numbers, but as the model trains, they will start to encode something meaningful about users’ preferences, and something about books’ qualities. We won’t decide what these factors mean; that will be learned by the model during training.

Let’s go ahead and make these matrices.

n_users = len(dls.classes['user'])
n_titles = len(dls.classes['title'])
n_factors = 5

user_factors = torch.randn(n_users, n_factors)
title_factors = torch.randn(n_titles, n_factors)
user_factors.size(), title_factors.size()

(torch.Size([40736, 5]), torch.Size([116040, 5]))

Looking at the features

To make a forward pass through the model, we’ll take the dot product of some user factors with some title factors. If the vectors are similar, then it means that the user’s tastes are matched to the book’s qualities. Let’s take a look at this more closely:

Suppose user A has the factors (0, 1, 0.5, 0, -1) ,

and book N has the factors (0, 1, 0.6, 1, -1)

Since most of these factors are similar, except at index 3, we’ll get an output which is more positive, indicating that the book is a good match for the user. If the factors were all opposite to one-another, we’d get a more negative output; perhaps not such a good match. The factors in this case might encode for something like this:

‘written in english’
‘short book’,
‘written in the past 20 years’,
‘written by terry pratchett’,
‘contains dragons’

But the factors are learned automatically as the model trains.

Dot Product, Vectors and Scalars

https://www.mathsisfun.com/algebra/vectors-dot-product.html On this site you can get a quick refresher on vectors, scalars, and dot procucts.

In short, if you imagine two vectors on a plane, the dot product returns a scalar value describing how much these vectors overlap, or more accurately, what’s the magnitude of the component shared by both the vectors.

Let’s try this out.

vector_a = torch.tensor([1, 0, 0.5])
vector_b = torch.tensor([1, 1, 0.1])

The dot product is just the sum of the products of all the features like so:

a1b1 + a2b2 + a3b3.

So the dot product of these vectors would be

1 + 0 + 0.05 = 1.05

This is just the sum of an elementwise multiplication in python, which is also identical to a matrix multiplaction of two vectors.

# Sum of elementwise multiplication
(vector_a*vector_b).sum()

tensor(1.0500)

# Matrix multiply
vector_a@vector_b

tensor(1.0500)

What’s an embedding layer?

The Embedding class here creates an embedding matrix, just like we did above. It also provides a way of indexing into the matrix to get the vector at a specific index.

The input x in this case is one batch of user IDs and book titles, with the shape bs x 2. When we pass the input to the embedding layer, we’ll get back the vectors containing the factors for that batch of inputs.

The matrix multiply way of doing this is to one-hot encode the input indices in a one dimensional matrix (or 2d vector, however you want to think of it), then do a matrix multiply of this one hot encoded vector with the embedding matrix. The result would be a 16x5 matrix of feature vectors- one for each of the inputs in the minibatch. An embedding layer provides a way to get the embedding vectors out of the embedding matrix using indexes, in a way which looks just like matrix multiplication, without the need to build the one-hot encoded matrix with all those redundant zeros.

There may be some relation between an embedding for a particular book, and an embedding for a particular user, which correlates with the rating that user gave to that book. When we train this model, we’re trying to learn the set of parameters for the embeddings for each book and user, such that the dot product of the book embeddings and the user embeddings is close to the actual ratings a user gave for a particular book.

Our model’s forward method needs to make rating predictions by doing an elementwise multiplication of the user embedding and the book embedding, then sum over this to predict an overall rating. This predicted rating will be compared with the actual rating the user gave the book, then the initially random weights in the embedding matrix will be updated using stochastic gradient descent to create a better embedding.

Through this process the embedding will come to represent some real world features about the data, which relate to the ratings which people gave to books. These features might not be named or explicitly stated by the user, but rather they’ll be discovered by the network as its parameters automatically adjust to minimise the output of the loss function.

Making a PyTorch dot product model

class DotProduct(Module):
    def __init__(self, n_users, n_titles, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.title_factors = Embedding(n_titles, n_factors)
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        titles = self.title_factors(x[:,1])
        return (users*titles).sum(dim=1)

x,y = dls.one_batch()
x.shape, y.shape

(torch.Size([64, 2]), torch.Size([64, 1]))

model = DotProduct(n_users, n_titles, n_factors=50)
learn = Learner(dls, model, loss_func=MSELossFlat())

learn.fit_one_cycle(5, 5e-3)

epoch	train_loss	valid_loss	time
0	0.238515	0.227373	00:27
1	0.183773	0.226952	00:27
2	0.089674	0.227779	00:27
3	0.047348	0.224438	00:27
4	0.023283	0.224684	00:27

Making the training process more efficient

Training on the entire dataset took 3 mins per epoch.

When I first ran this model it took 15 mins for 5 epochs. The model was still converging after 5 epochs but this is too slow for experimentation - we should find a sample size which allows some convergence, but which we don’t have to wait forever to train.

For the next run, I took a random sample of 300,000 users from the database. This reduced the training time but reduced convergence - the loss measured on the validation set remained high. We need a way of reducing the size of the dataset but retaining most of the data.

Sample only popular books and users with lots of entries.

Deliberately selecting from the most read titles, and the most active readers could be a way of getting the information density up a little. This is definitely a design decision which should be scrutinized, since it biases the system towards more popular items, but it could be a good way to jumpstart training.

Plus it doesn’t make a lot of sense to be training a collaborative filtering model on users who have read only one book: there wouldn’t be any second item to lookup and recommend for another user who has read the same book.

book_count = len(set(ratings.title))
popular_books = ratings.title.value_counts()[:1000].keys()

reader_count = len(set(ratings.user))
avid_readers = ratings.user.value_counts()[:1000].keys()

len(ratings)

Overwriting the variable dense_df with this new selection

dense_df = ratings[ratings.title.isin(popular_books)]
dense_df = (dense_df[dense_df.user.isin(avid_readers)])
print(len(dense_df))

Now we’ve got the number of samples in the database down to 76402, and it only contains the top 1000 readers and the top 1000 books.

Make a new dataloaders object to draw training and validation samples from this new dataframe.

dense_dls = CollabDataLoaders.from_df(dense_df, item_name='title', bs=64)
n_users = len(dense_dls.classes['user'])
n_titles = len(dense_dls.classes['title'])

model = DotProduct(n_users, n_titles, n_factors=50)

Let’s see how the model trains now

learn = Learner(dense_dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 1e-3)

epoch	train_loss	valid_loss	time
0	0.169560	0.172441	00:06
1	0.119379	0.122923	00:06
2	0.103940	0.114699	00:06
3	0.098195	0.113169	00:06
4	0.099816	0.113051	00:06

Great - the training only takes 5s per epoch, and we’re still seeing convergence after 5 epochs. Let’s try to improve from here

Adding intentional Bias

So far our model only takes the dot procuct of two vectors then adds up these contributions. To improve the model we should add bias. This will allow us to represent the overall bias of a particular book or user. For example, a book might be extremely short and extremely sci-fi, but also be generally terrible. Even for a reader who also loves short sci-fi books, if the book is generally terrible they probably won’t enjoy it. Conversely there might be a book which is very sci-fi but also so good that even non-sci-fi fans enjoy it. We can represent this overall bias of the book by adding or subtracting a scalar to our embedding vector after the elementwise multiplication operation.

The bias in for the user embedding factors lets us represent users who on average, give a higher or lower rating than other users across the board.

Let’s give this a go below

class DotProductBias(Module):
    def __init__(self, n_users, n_titles, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.title_factors = Embedding(n_titles, n_factors)
        self.title_bias = Embedding(n_titles, 1)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        titles = self.title_factors(x[:,1])
        result = (users*titles).sum(dim=1, keepdim=True)
        result += self.user_bias(x[:,0]) + self.title_bias(x[:,1])
        return(result)

Here we’re just adding another embedding to represent the bias for each user and each book. This scalar value is added to the prediction for a user and book combination.

Initially I added added a sigmoid to the output to keep the predictions between 0 and 1.1 Using an upper limit of 1.1 allows prediction of the number 1, which would be impossible to achieve with sigmoid otherwise, since the sigmoid function scales all inputs from -inf to inf to lie between 0 and 1.

In practice what happened was that all the predicted ratings were between ~0.4 and ~0.5. Removing the sigmoid on the outputs fixed this and all predicted ratings now fall between 0 and 1, perhaps because I’ve pre-scaled the ratings to lie within this range.

Weight decay

L2_Regularization also called weight decay, is also used here. L2 regularization penalizes large weights in the model by adding to the loss function the sum of all the weights squared. This helps reduce overfitting by reducing the chance of any individual weight becoming very large. This will slow down the training of the model, but it will also produce a model which generalizes better - the model will find general patterns rather than producing an overly complex and overfit function which only represents items in the training set.

I have left

model = DotProductBias(n_users, n_titles, n_factors=50)
learn = Learner(dense_dls, model, loss_func=MSELossFlat()).to_fp16()
learn.fit_one_cycle(5, 1e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.141313	0.143934	00:07
1	0.115564	0.117711	00:07
2	0.099428	0.113122	00:07
3	0.087854	0.112196	00:07
4	0.087845	0.112063	00:07

Now that we have the model trained, we can get predictions for any pairings of users an books. The model outputs will be the rating which the model predicts for that user - book combo. Here’s a demonstration which uses one batch of data, so it’s just a random pairing of users with books.

batch = dense_dls.one_batch()[0].to('cuda')

get_device to check that the tensor is on the GPU (-1 = cpum 0=cuda:0)

batch.get_device()

passing a batch of inputs (user/title pairs) gives us an output of predictions for that pairing.

model(batch)[:10]

tensor([[ 0.2103],
        [ 0.1679],
        [ 0.0168],
        [ 0.0581],
        [ 0.3176],
        [ 0.2288],
        [ 0.2315],
        [ 0.2790],
        [ 0.0960],
        [-0.0118]], device='cuda:0', grad_fn=<SliceBackward0>)

looking at the factors for a batch of users

Here we can see the indices of a batch of users. Each one of these users has a corresponding set of factors which are accessed by passing these indices to the Embedding instance called user_factors

batch[:,0]

tensor([150, 174, 842, 449, 281, 692, 974, 600, 582, 494,  91, 194, 332, 361,
        485, 657, 299, 802, 675, 378, 680, 465, 858, 541, 567, 879, 506, 817,
        288, 950, 968, 581, 827, 174, 144, 381,  28, 566, 202, 897, 956, 838,
        380, 672, 981, 257, 714, 953, 343, 846, 447, 842, 701, 643, 343, 384,
        305, 632, 975,  48, 436, 630, 836, 583], device='cuda:0')

Thinking about latent factors as components of a vector in an n-dimensional feature space

Here are the factors for each of the users in the batch:

model.user_factors(batch[:,0])

tensor([[ 0.0206,  0.0324, -0.0256,  ..., -0.0972,  0.0861, -0.0837],
        [ 0.0978,  0.1441,  0.0092,  ..., -0.0022, -0.0379,  0.0745],
        [-0.0334,  0.0943,  0.0024,  ...,  0.0780, -0.0867, -0.0408],
        ...,
        [ 0.0150,  0.0170,  0.0297,  ...,  0.0246, -0.0477, -0.0407],
        [-0.0341, -0.0660, -0.0711,  ...,  0.0390,  0.0392,  0.0218],
        [ 0.0929,  0.0700,  0.0584,  ...,  0.0826, -0.0621,  0.0504]],
       device='cuda:0', grad_fn=<EmbeddingBackward0>)

Each of these numbers represents a learned latent factor for that user. The latent factors can can be thought of as the contribution / component to a vector in n-dimensional space, where each number is a different axis’s contribution. The factors are all orthoganal to oneanother. They can represent things like taste, genre, age etc.

For example: if user A has 3 latent factors x, y, z, and these have values 1, 0.2, -0.9, then we can imagine a vector in 3d space which extends along the x dimension by 1, along y by 0.2, and extends negatively along the z dimension by 1.

Another user, or book title, might point in a very similar direction. This would mean that their factors overlap a lot and tend not to cancel out.

Each of these dimensions could code for something like ‘enjoys horror books’, ‘enjoys shorter books’, younger.

If there was another user who’s factors were -1, 0.2, 1, we might say that they had the opposite taste for horror stories, that they have the same liking for shorter books, and that they are older.

The latent factors encode for real world meaning, but the factors themselves aren’t chosen by the engineer when setting up the neural network - rather they emerge from the relationships between books, users and ratings as the model trains.

Using the trained model

Finding the books with the highest bias

Here’s a list of books with a high bias: they end up having a higher rating across the board, despite the specific features which were learned to describe the books. Intuitively this means that they’re high quality - since they get consistently high ratings despite their genre and the users’ tastes.

books_bias = learn.model.title_bias.weight.squeeze()
idxs = books_bias.argsort(descending=True)[:20]
[dense_dls.classes['title'][i] for i in idxs]

['Harry Potter and the Order of the Phoenix (Book 5)',
 'Harry Potter and the Prisoner of Azkaban (Book 3)',
 "Harry Potter and the Sorcerer's Stone (Book 1)",
 'Harry Potter and the Chamber of Secrets (Book 2)',
 'To Kill a Mockingbird',
 '84 Charing Cross Road',
 "Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback))",
 'The Lovely Bones: A Novel',
 'A Wrinkle in Time',
 'Harry Potter and the Goblet of Fire (Book 4)',
 'The Little Prince',
 'The Secret Garden',
 'The Da Vinci Code',
 'Girl in Hyacinth Blue',
 'Stupid White Men ...and Other Sorry Excuses for the State of the Nation!',
 'Lord of the Flies',
 'Dragonfly in Amber',
 "Dude, Where's My Country?",
 'Fahrenheit 451',
 'Carrie']

Making recommendations for a single user.

We know how to get rating predictions for a single batch: take the dot product of the user factors and title factors for each user/title pariring in the batch. To get predictions for a single user, we’d just need to replace all the user id’s with the id for that single user. Let’s try this:

Now that we have a trained model, to make a recommendation we need to do 2 things:

Find out which books the user has read already. This is just so that we’re not recommending books they’ve already read.
Create a tensor of tuples which contain user IDs and book titles. These will be passed to the DotProductBias forward() method - which takes the dot product of a user-id book-title combination. We need to make the user IDs all the same (11676), and calculate these dot products for every book the user hasn’t yet read. Once this calculation is performed, we’ll have a prediction of what rating this user might give if they were to read these books. Based on these predictions we can recommend the books which get the highest predicted rating.

Let’s take a look at the user who has read the most books:

dense_df.user.value_counts()[:1000].keys()

Int64Index([ 11676,  35859,  76352,  16795, 153662, 102967, 238120,  23768,
            230522,  55492,
            ...
             69808,   4385, 168464, 164465, 227250,  35433, 241198, 173632,
            133868,  72352],
           dtype='int64', length=997)

user 11676

This is the ID of the user we’re trying to recommend books for.

We made an embedding using a subset of the 1000 top users - so we need a way to find which index this ID is at:

def get_index(cat, dataloader):
    'get the index of a category from a dataloader'
    for i, j in enumerate(dataloader):
        if j == cat:
            return i
        
get_index(11676, dense_dls.classes['user'])

Let’s confirm that this works by tesing it on a book title:

get_index('The Little Prince', dense_dls.classes['title'])

We’re going to check for book recommendations for user 11676, who is at index 32 in our dense dataloaders object.

user_index = 32
n_books = len(dense_dls.cats['title'].unique())
user_idxs = torch.full((n_books, 1), user_index, dtype=int).cuda()
book_idxs = torch.linspace(1, n_books, n_books, dtype=int).unsqueeze(1).cuda()
user_books_tensor = torch.cat((user_idxs, book_idxs), -1)
user_books_tensor

tensor([[  32,    1],
        [  32,    2],
        [  32,    3],
        ...,
        [  32,  998],
        [  32,  999],
        [  32, 1000]], device='cuda:0')

Now we have a tensor pairing the user at index 32 with each of the book indices from 1 to 1000. Passing this into our model’s forward() method will calculate the dot product of this user’s latent factors vector with the latent factors for each book in the dataset. This dot product is the rating prediction.

recommendations = model(user_books_tensor)
top_10 = recommendations.argsort(0, descending=True)[:10]
top_10

tensor([[307],
        [861],
        [872],
        [515],
        [293],
        [941],
        [597],
        [842],
        [ 64],
        [308]], device='cuda:0')

Now we have the indices of the top 10 recommended books for this user. Finally we can look up these top indices in the dataloaders classes to get the titles.

dense_dls.classes['title'][top_10]

(#10) ['Harry Potter and the Chamber of Secrets (Book 2)','The Secret','The Sparrow','Possession : A Romance','Girl With a Pearl Earring','Under the Tuscan Sun','Sisterhood of the Traveling Pants','The Reader',"Angela's Ashes (MMP) : A Memoir",'Harry Potter and the Goblet of Fire (Book 4)']

Let’s take a look at all the books this user has read, ordered by rating:

dense_df.loc[dense_df.user==11676].loc[dense_df.rating==1][:10]

	user	title	rating
69	11676	The Notebook	1.0
189	11676	A Painted House	1.0
12730	11676	Harry Potter and the Chamber of Secrets (Book 2)	1.0
13401	11676	Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback))	1.0
19165	11676	The Sweet Potato Queens' Book of Love	1.0
19881	11676	Dreamcatcher	1.0
20480	11676	Fight Club	1.0
26049	11676	1st to Die: A Novel	1.0
26691	11676	The Hot Zone	1.0
28195	11676	The Girl Who Loved Tom Gordon	1.0

recommendations.min(), recommendations.max()

(tensor(0.1127, device='cuda:0', grad_fn=<MinBackward1>),
 tensor(0.9781, device='cuda:0', grad_fn=<MaxBackward1>))

Finding ‘book buddies’

We can use the same approach to pair users with people they’re most similar to - If there are two readers in the model with the same set of latent factors as oneanother, then this means they have very similar tastes in books. I remember the Last.FM music recommendation software had a feature where you could see your ‘musical neighbours’ and see what music they’d been listening to. This likely uses a similar collaborative filtering system.

To find two similar readers, we could use the following approach:

pick a user
apply the same process as above but instead of calculating the dot product of this user with every book, calculate the dot product of the user with every other user. If their latent factors are similar they’re likely to have similar tastes in books.

Let’s give this a go!

n_users = len(dense_dls.cats['user'].unique())
user_idxs = torch.full((n_users, 1), user_index, dtype=int).cuda()
all_users = torch.tensor(dense_dls.cats['user'].values).unique().unsqueeze(1).cuda()
pairs = torch.cat((user_idxs, all_users), -1)
top_10_indices = model(pairs).argsort(0)[:10]

top_10_buddies = dense_dls.classes['user'][top_10_indices]
top_10_buddies

(#10) [104429,175886,113983,258185,43619,48046,21364,225087,192093,184299]

Conclusion

In this post I’ve covered:

how to load a dataset into fastai
how to take the dot product of two vectors
building a custom PyTorch model which inherits from pytorch’s Module class and contains a forward() method and a couple of embedding layers for the input features
training a model on a denser subset of the data to enable faster model training
the role of bias and factors in the embedding matrices
How to get a book reccomendation for a given user
How to find users with similar tastes in the dataset.

I was able to train a dot product based model with embedding layers on both the input features, and get book reccomendations for a given user.

References

http://fast.ai

Guo, Cheng et al. “Entity Embeddings of Categorical Variables”