Cosine Similarity
Summary
Cosine similarity is a statistical measure to calculate the similarity between two discrete values.
Understanding Cosine Similarity
Document Example
Let's say you have two documents with the following word counts (vectors):
- Document A: [3 apples, 0 oranges, 2 bananas]
- Document B: [1 apple, 1 orange, 1 banana]
We can think of each document as a point in 3D space (the dimensions are apples, oranges, and bananas).
- Document A is at point (3, 0, 2) because it has 3 apples, 0 oranges, and 2 bananas.
- Document B is at point (1, 1, 1) because it has 1 apple, 1 orange, and 1 banana.
Now, to find how similar these documents are, cosine similarity calculates the angle between their directions. If they point in the same direction, they're similar. If they point in different directions, they're not.
The cosine similarity will be a number between 0 and 1:
- 1 means they're pointing in exactly the same direction (very similar).
- 0 means they're not related at all.
Calculation Example
Step 1: Represent the Documents as Vectors
- Document A (Vector A): [3, 0, 2]
- Document B (Vector B): [1, 1, 1]
Step 2: The Cosine Similarity Formula
Step 3: Calculate the Dot Product A·B
Step 4: Calculate Vector Magnitudes
For Vector A [3, 0, 2]:
For Vector B [1, 1, 1]:
Step 5: Final Calculation
The result of 0.801 indicates that these documents are fairly similar, which makes sense as they both mention apples and bananas, though Document B also includes oranges.