MinHash: Document-level Deduplication

Introduce

Document-level de-duplication: Global MinHash de-duplication across the entire the dataset to remove near duplicate documents.

MinHash estimates the Jaccard similarity (resemblance) between sets of arbitrary sizes in linear time using a small and fixed memory space. It can also be used to compute Jaccard similarity between data streams.

Implement

from datasketch import MinHash

data1 = ['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'datasets']
data2 = ['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'documents']

m1, m2 = MinHash(), MinHash()
for d in data1:
    m1.update(d.encode('utf8'))
for d in data2:
    m2.update(d.encode('utf8'))
print("Estimated Jaccard for data1 and data2 is", m1.jaccard(m2))

s1 = set(data1)
s2 = set(data2)
actual_jaccard = float(len(s1.intersection(s2)))/float(len(s1.union(s2)))
print("Actual Jaccard for data1 and data2 is", actual_jaccard)

Refs.

datasketch.MinHash

Andrei Z. Broder. On the resemblance and containment of documents

Data Preprocessing — Deduplication with MinHash and LSH

Comments

MinHash: Document-level Deduplication

Introduce

Implement

Comments

Related Posts

Published

Category

Tags

Contact