Real-time Personalization using Embeddings

Victor (Xiao-Jie) Zhang

2021-01-11

Introduction

Since I will some recommendation system related stuffs in my new job, I would like to read some papers on recommendation systems from now on. This time I’m going to introduce a paper in KDD18, whose name is: Real-time Personalization using Embeddings for Search Ranking at Airbnb (paper link), which introduces the embedding method for search ranking.

Motivation

In many situations, the company may need to optimize the search results between users and items, meaning given input query with different features, the model needs to output a ranking of the items. This can further formulate the problem as pairwise regression with positive utilities for booking and negative utilities for rejections. Beside, user’s historical data can be utilized into the model, including positive and negative signal to give user a better recommendation. In order to achieve the target to rank the search results, this paper proposed listing embeddings, low-dimensional vector representations learned from search sessions. Besides the above embeddings for user’s short time interests, this paper also introduces another types of embeddings trained on booking in order to capture user’s long time behaviors. The listing type embeddings are in the same space as user type embeddings, so both user and listing type can search and rank the results at the same time.

The embedding training here is quite similar to the way using neural networks to solve NLP problems, such as CBOW and SG, by considering both words order and their occurrence.

Methodology

As mentioned before, embedding has played an important role in NLP related problems. Recommendation systems have borrowed the idea of embeddings from NLP fields, and more importantly, user embeddings and item embeddings can be mapped into the same space which can directly calculate the similarity between user and items.

Given a set $\mathcal{S}$ of $S$ click sessions from $N$ users, where each session $s = (l_1, l_2,…,l_M) \in \mathcal{S}$, which defined M listing ids clicked by the user. Two sessions has an interval of 30 mins for the same user. Given the dataset, the problem can be described as learning a $d$-dimensional real-valued representation $\mathbf{v}{l_i} \in \mathbb{R}^d $ of listing $l_i$, which is the embedding of the $i{th}$ items. The loss function can be described as follows. It is the sum of the negative log probability of a sliding windows, which is conditioned by the previous recordings.

$$\mathcal{L} = -\sum_{s\in \mathcal{S}} \sum_{l_i \in S}\sum_{-m\le j\le m, i\ne 0}log P(l_{i+j}|l_j)$$

The probability, on the other hand, can be easily described as the dot product between the embeddings as follows.

$$ log P(l_{i+j}|l_j) = \frac{exp(\mathbf{v}i^\top\mathbf{v}’{i+j})}{\sum_{l=1}^{\mathcal{V}} exp(\mathbf{v}i^\top\mathbf{v}’{l})} $$

We can notice this equation is quite similar to the one in word2vec. In fact the underlying idea of these two methods are identical, as we have mentioned before, so some techniques in word2vec can also be applied to this problem, including negative sampling. By sampling positively related pairs $(l, c)$ from the datasets and randomly related pairs $(l, c)$ from the vocabulary, then the loss function could be expressed as follows.

$$ \mathcal{L} = -\left( \sum_{l, c \in \mathcal{D}_p}\frac{1}{1+e^{-\mathbf{v}_c^\top\mathbf{v}’l}} + \sum{l, c \in \mathcal{D}_n}\frac{1}{1+e^{\mathbf{v}_c^\top\mathbf{v}’_l}}\right)$$

An illustration of the model can be found from the following figure. Here the click session set $\mathcal{S}$ can be divided into two parts, 1) booked sessions and 2) exploratory sessions. Booked session is important in the model, since this session is related to the user booking in the end, and this can help predict user’s booking behavior more precisely. After introducing the fist one sessions, the loss function can be described as follows.

$$\mathcal{L} = -\left( \sum_{l, c \in \mathcal{D}_p}\frac{1}{1+e^{-\mathbf{v}_c^\top\mathbf{v}’l}} + \sum{l, c \in \mathcal{D}_n}\frac{1}{1+e^{\mathbf{v}_c^\top\mathbf{v}’l}} + \frac{1}{1+e^{-\mathbf{v}{l_b}^\top\mathbf{v}’_l}} \right)$$

Users of online searching may want to look into a series of items groups that are very similar. As the example in the paper, user may search markets that near the place they want to visit. This suggests another technique for negative sampling: instead of random negative sampling, just do negative sampling in the context of positive samples. The negative set is noted as $\mathcal{D}_{m_n}$, and the loss function could be displayed as follows.

$$\mathcal{L} = -\left( \sum_{l, c \in \mathcal{D}_p}\frac{1}{1+e^{-\mathbf{v}_c^\top\mathbf{v}’l}} + \sum{l, c \in \mathcal{D}n}\frac{1}{1+e^{\mathbf{v}c^\top\mathbf{v}’l}} + \frac{1}{1+e^{-\mathbf{v}{l_b}^\top\mathbf{v}’l}} + \sum{l, m_n \in \mathcal{D}{m_n}}\frac{1}{1+e^{\mathbf{v}{m_n}^\top\mathbf{v}’_l}} \right) $$

Another problem is code start of the whole recommendation system. Obviously at first, we don’t have any embeddings because the click data are not available. In order to alleviate this issue, the initial embedding could be learnt from the nearby embeddings. By average the geographical nearby 3 similar embeddings, this averaged vector could be utilized as the initial embedding.

Finally the embeddings are clusters, and compare the cosine similarity for inter- and intra- cluster points. It can be shown that type and prices of the same cluster points are much closer in the cosine distance, as well as the geographical distance. The results suggest the embedding learning is effective with user clicking behavior as the input. From the figure below, we can find the similar pattern embeddings are clustered together.

Another challenging task is to train item embeddings from user booking rather than user clicking. Given a list of booking session obtained from different users, it should be possible to learn the embeddings from booked items. However, there are still some difficulties concerning following points. First is the data size problem, since the booking behavior is much less frequent thant the clicking behavior. Another problem is many user only book once without any context, while the same phenomena occur for items, where many items are only booked once. Finally, the booking behavior is sparse, so there may be long time gap between two bookings. During that long time, user may have a huge change in interest. These problems can be solved by o learn embeddings at a level of listing_type instead of listing_id. Here the listing_type is determined by a rule table, which maps various attributes including location, price, listing type, capacity, number of beds, etc to different buckets.