Using Cosine Similarity to Compare Users in a Recommendation System
By Xiaotong Ding (Claire), With Greg Page
In the digital world around us, we are surrounded by recommendation systems. The entertainment recommendations that we see from streaming movie and music services, the item suggestions presented to us when we shop online, and the targeted ads that appear after our web searches are all specifically geared towards us. Based on our browsing history, purchase patterns, “Likes” on social media, product reviews, demographics, and myriad other quantifiable data points, marketers continually refine their algorithms to make these targeted recommendations increasingly relevant for us — and lucrative for them.
In order to make such systems work, some sort of similarity metric is needed. For any automated process to “know” how similar Item A is to Item B, or how similar User A is to User B, there must be an objective standard that can be used as a measuring stick. In this article, we will discuss one of the most popular such metrics, cosine similarity.
Cosine similarity is one of the most popular and common ways to determine similarity among users or items. Ranging from 0 to 1, cosine similarity tells us about the angle that forms between two vectors in n-dimensional space. The closer the cosine value to 1, the more similar the two vectors are.
To calculate the cosine similarity between two vectors, the first step is to take the dot product between the vectors — this will form the numerator of the cosine similarity measure. The dot product of two vectors is a scalar value, obtained by multiplying the vectors’ pairwise components, and adding the resulting products.
For example, if we have vector u (3, 7, 10) and a vector v (5, 2, 9), then their dot product is: (3x5) + (7x2) + (10x9) = 119.
Next, we must find the norm, or length, of each of the vectors. For each vector, the norm is found by taking the square root of the sum of each of its elements. The product of the two vectors’ norm lengths will be the denominator in the cosine similarity calculation.
For u, the norm is 32 + 72 + 102 = 158, which is approximately 12.57.
For v, the norm is 52 + 22 + 92 = 110 , which is approximately 10.49.
The product of these two norm lengths is: 131.8593, and the cosine similarity between vectors u and v is: 119 / 131.8593 0.902.
The largest possible cosine similarity between any vectors is 1. You can prove this by taking identical vectors and putting them through this formula — you will end up with an identical numerator and denominator. Cosine similarity can never be negative — its lowest possible value is 0.
To answer that question, let’s first imagine that we wish to characterize the similarity in book reviews among a group of eight friends.
Each of the eight friends fills out a survey, in which he or she is asked to rate each of these six books on a scale from 1 to 5, with 5 being the best, and 1 being the worst. The friends are instructed to mark their survey with an ‘x’ whenever they come across a book title that they have not yet read.
Their survey answers are shown below:
To calculate the cosine similarity between any two of the friends, we will first store the values for the items that they co-rated as vectors.
For Ana and Bo, the co-rated items are: Grapes of Wrath, Invisible Man, Catcher in the Rye, and the Great Gatsby. Ana’s ratings for these titles are: (5, 3, 4, 1), while Bo’s ratings are: (4, 5, 3, 2).
The numerator in the cosine similarity metric for Ana and Bo will be the dot product of these vectors: (5x4) + (3x5) + (4x3) + (1x2) = 20 + 15 + 12 + 2 = 49. The denominator will be 52 + 32 + 42 + 12 * 42 + 52 + 32 + 22, or 51 * 54 , which roughly approximates to 52.48.
Ana and Bo’s cosine similarity, therefore, is 49/52.48, or 0.93.
Because cosine similarity measures the angle between vectors, and scales with the number of items being measured in the calculation, it enables us to make meaningful comparisons between the “Ana & Bo” vector pair and other vector pairs, even when those other pairs have different dimensions. Such a comparison would not be possible if we used a distance metric that relied on absolute distance, without some sort of length normalization. Imagine, for instance, comparing Euclidean distances in four-dimensional space to those in three-dimensional space: an “apples-to-apples” comparison would be impossible to make.
With cosine similarity, however, we can compare Ana and Freddie, who only have three co-rated items, to Ana and Bo.
Ana and Freddie’s three co-rated items are: Invisible Man, Catcher in the Rye, and the Great Gatsby.
For this comparison, we’ll use the dot product of their pairwise ratings to form our numerator: (3x3) + (4x5) + (1x5) = 34. The denominator will be 32 + 42 + 12 * 32 + 52 + 52, or 26 * 59 , so their cosine similarity approximates to 34/39.16, or 0.87.
User-Item Analysis: Applications
At first, this approach would suffer from a “cold start” problem — until each user rated at least two items, and until there were at least two users in the system, there would be no way to begin to make the comparisons. However, as user data begins to enter the system, the distinctions between items, or between users, could become increasingly refined.
A streaming audiobook service could use the cosine similarities of the eight friends shown above to recommend items to certain users, based on the actions of other ones. If Dawei reads a newly-released title, and gives a “5” rating, for example, the service is likely to recommend it to a similar user, such as Carla. Dawei’s ratings will not be as useful for recommendations made to Bo, however — to recommend items for Bo, the service can place more weight on the reactions of Ana or Hao.