I have this problem right now in postgres. I have 600000 feed_items with the sch...

dleeftink · on July 4, 2024

I've implemented a minhash-like system in Bigquery, and was able to calculate cosine similarities (within a certain similarity threshold) between all Stack Overflow in reasonable time. To give you a broad idea of the procedure:

1. Concatenate and split all text fields into an array of ngrams (2...n chars)

2. Declare two global arrays A & B of length-n and fill them with random 32-64 bit integers

3. Hash each ngram to a 32-64 bit integer, multiply the resulting hash by each random value in array A, and modulo the resulting output with each random value in array B. Take the minimum value. Per row, your aim is to end up with an array of 'minhashed' integers of equal length to the arrays in step 2. If you declare the global arrays to be length of 64, the minhashed array for each row will also be of length 64.

4. Bucketise the resulting hash arrays by summing N consecutive minhash values using a window function (e.g. sum each 4 consecutive rows)

If all went well, you can now unnest these arrays (call them 'source rows') and self join the dataset on each bucketed minhash value (resulting in an additional column of 'target rows'). Group by source, target columns and count the occurances to get an indication of how likely two rows are similar.

In essence, the more often two items hash to a similar bucket, the more similar they are, and it's up to you to define the cut-off from where to calculate actual pairwise Jaccard (or cosine) similarities between items.

RobinL · on July 4, 2024

One useful technique here could be to use text embeddings and cosine similarity: https://simonwillison.net/2023/Oct/23/embeddings/

swasheck · on July 4, 2024

love this and have been using tf/idf for embeddings and various measures of similarity for some personal pet projects. one thing i came across in my research is that cosine similarity was more useful for vectors of different lengths and that euclidean distance was useful for vectors of similar length but simon alludes to a same-length requirement. i’m not formally trained in this area so i was hoping someone could shed some light on this for me.

gpderetta · on July 5, 2024

You can use cosine similarity with embedding vectors of different lengths (or better, the vectors have all the same length, but they are sparse with most components being 0).

gpderetta · on July 5, 2024

You can use min hash to avoid the full O(N^2) distance matrix, but with just 600000 items you might just brute force compute the full matrix for simiplicity. What's your time budget?

probably_wrong · on July 4, 2024

If you think two items cover the same highly similar _keywords_ then Jaccard distance should work.

If you think two items share highly similar _text_ then give Levenshtein distance a try.

kordlessagain · on July 4, 2024

Have an LLM generate a reverse index for the entries, but force it to keep the cardinality low. Then you can use a Jaccard similarity.