Cosine Similarity

Summary

Cosine similarity is a statistical measure to calculate the similarity between two discrete values.

Understanding Cosine Similarity

Document Example

Let's say you have two documents with the following word counts (vectors):

  • Document A: [3 apples, 0 oranges, 2 bananas]
  • Document B: [1 apple, 1 orange, 1 banana]

We can think of each document as a point in 3D space (the dimensions are apples, oranges, and bananas).

  • Document A is at point (3, 0, 2) because it has 3 apples, 0 oranges, and 2 bananas.
  • Document B is at point (1, 1, 1) because it has 1 apple, 1 orange, and 1 banana.

Now, to find how similar these documents are, cosine similarity calculates the angle between their directions. If they point in the same direction, they're similar. If they point in different directions, they're not.

The cosine similarity will be a number between 0 and 1:

  • 1 means they're pointing in exactly the same direction (very similar).
  • 0 means they're not related at all.

Calculation Example

Step 1: Represent the Documents as Vectors

  • Document A (Vector A): [3, 0, 2]
  • Document B (Vector B): [1, 1, 1]

Step 2: The Cosine Similarity Formula

 extCosineSimilarity=\ racABAB\ ext{Cosine Similarity} = \ rac{A \cdot B}{\|A\| \|B\|}

Step 3: Calculate the Dot Product A·B

AB=(3 imes1)+(0 imes1)+(2 imes1)A \cdot B = (3 \ imes 1) + (0 \ imes 1) + (2 \ imes 1)
AB=3+0+2=5A \cdot B = 3 + 0 + 2 = 5

Step 4: Calculate Vector Magnitudes

For Vector A [3, 0, 2]:

A=(32+02+22)=(9+0+4)=133.605\|A\| = \sqrt{(3^2 + 0^2 + 2^2)} = \sqrt{(9 + 0 + 4)} = \sqrt{13} \approx 3.605

For Vector B [1, 1, 1]:

B=(12+12+12)=(1+1+1)=31.732\|B\| = \sqrt{(1^2 + 1^2 + 1^2)} = \sqrt{(1 + 1 + 1)} = \sqrt{3} \approx 1.732

Step 5: Final Calculation

 extCosineSimilarity=\ rac5(3.605 imes1.732)\ ext{Cosine Similarity} = \ rac{5}{(3.605 \ imes 1.732)}
 extCosineSimilarity=\ rac56.2450.801\ ext{Cosine Similarity} = \ rac{5}{6.245} \approx 0.801

The result of 0.801 indicates that these documents are fairly similar, which makes sense as they both mention apples and bananas, though Document B also includes oranges.