I have this problem right now in postgres. I have 600000 feed_items with the schema (feed_item_id uuid, author varchar, content text, guid varchar, link varchar, title varchar, summary text, feed_id integer) The content and summary columns especially for some of the news items are highly similar but not equal. For any given 2 such news items, I am trying to cut them down to 1. Any ideas?
I've implemented a minhash-like system in Bigquery, and was able to calculate cosine similarities (within a certain similarity threshold) between all Stack Overflow in reasonable time. To give you a broad idea of the procedure:
1. Concatenate and split all text fields into an array of ngrams (2...n chars)
2. Declare two global arrays A & B of length-n and fill them with random 32-64 bit integers
3. Hash each ngram to a 32-64 bit integer, multiply the resulting hash by each random value in array A, and modulo the resulting output with each random value in array B. Take the minimum value. Per row, your aim is to end up with an array of 'minhashed' integers of equal length to the arrays in step 2. If you declare the global arrays to be length of 64, the minhashed array for each row will also be of length 64.
4. Bucketise the resulting hash arrays by summing N consecutive minhash values using a window function (e.g. sum each 4 consecutive rows)
If all went well, you can now unnest these arrays (call them 'source rows') and self join the dataset on each bucketed minhash value (resulting in an additional column of 'target rows'). Group by source, target columns and count the occurances to get an indication of how likely two rows are similar.
In essence, the more often two items hash to a similar bucket, the more similar they are, and it's up to you to define the cut-off from where to calculate actual pairwise Jaccard (or cosine) similarities between items.
love this and have been using tf/idf for embeddings and various measures of similarity for some personal pet projects. one thing i came across in my research is that cosine similarity was more useful for vectors of different lengths and that euclidean distance was useful for vectors of similar length but simon alludes to a same-length requirement. i’m not formally trained in this area so i was hoping someone could shed some light on this for me.
You can use cosine similarity with embedding vectors of different lengths (or better, the vectors have all the same length, but they are sparse with most components being 0).
You can use min hash to avoid the full O(N^2) distance matrix, but with just 600000 items you might just brute force compute the full matrix for simiplicity. What's your time budget?