lucene - The best way to search millions of fuzzy hashes -
i have spamsum composite hashes ten million files in database table , find files reasonably similar each other. spamsum hashes composed of 2 ctph hashes of maximum 64 bytes , this:
384:w2mhnfnjf47jdnunek3slbjj+sgfoypayjwsn3gdqymefd4kkagxqcfotpi0nd:wemfogxqcfotpi0nd
they can broken down 3 sections (split string on colons):
- block size:
384
in hash above - first signature:
w2mhnfnjf47jdnunek3slbjj+sgfoypayjwsn3gdqymefd4kkagxqcfotpi0nd
- second signature:
wemfogxqcfotpi0nd
block size refers block size first signature, , block size second signature twice of first signature (here: 384 x 2 = 768). each file has 1 of these composite hashes, means each file has 2 signatures different block sizes.
the spamsum signatures can compared if block sizes correspond. composite hash above can compared other composite hash contains signature block size of 384 or 768. similarity of signature strings hashes similar block size can takes measure of similarity between files represented hashes.
so if have:
file1.blk2 = 768
file1.sig2 = wemfogxqcfotpi0nd
file2.blk1 = 768
file2.sig1 = lsmfogxqcfotpi0nd
we can sense of degree of similarity of 2 files calculating weighted edit distance (like levenshtein distance) 2 signatures. here 2 files seem pretty similar.
leven_dist(file1.sig2, file2.sig1) = 2
one can calculate normalized similarity score between 2 hashes (see details here).
i find 2 files more 70% similar based on these hashes, , have strong preference using available software packages (or apis/sdks), although not afraid of coding way through problem.
i have tried breaking hashes down , indexing them using lucene (4.7.0), search seems slow , tedious. here example of lucene queries have tried (for each single signature -- twice per hash , using case-sensitive keywordanalyzer):
(blk1:768 , sig1:wemfogxqcfotpi0nd~0.7) or (blk2:768 , sig2:wemfogxqcfotpi0nd~0.7)
it seems lucene's incredibly fast levenshtein automata not accept edit distance limits above 2 (i need support 0.7 x 64 ≃ 19) , normal editing distance algorithm not optimized long search terms (the brute force method used not cut off calculation once distance limit reached.) said, may query not optimized want do, don't hesitate correct me on that.
i wondering whether can accomplish need using of algorithms offered lucene, instead of directly calculating editing distance. have heard bk-trees best way index such searches, don't know of available implementations of algorithm (does lucene use @ all?). have heard probable solution narrow down search list using n-gram methods not sure how compares editing distance calculation in terms of inclusiveness , speed (i pretty sure lucene supports one). , way, there way have lucene run term search in parallel mode?
given using lucene pre-match hashes , calculate real similarity score using appropriate algorithm later, need method @ least inclusive levenshtein distance used in similarity score calculation -- is, don't want pre-matching method exclude hashes flagged matches scoring algorithm.
any help/theory/reference/code or clue start appreciated.
this not definitive answer question, have tried number of methods ever since. assuming hashes saved in database, suggestions remain valid in-memory data structures well.
- save signatures (2 per hash) along corresponding block sizes in separate child table. since signatures of same size can compared each other, can filter table block size before starting compare signatures.
- reduce repetitive sequences of more 3 characters 3 characters ('bbbbb' -> 'bbb'). spamsum's comparison algorithm automatically.
- spamsum uses rolling window of 7 compare signatures, , won't compare 2 signatures not have 7-character overlap after eliminating excessive repetitions. if using database support lists/arrays fields, create field list of possible 7-character sequences extracted each signature. create fastest exact match index have access on field. before trying find distance of 2 signatures, first try exact matches on field (any seven-gram in common?).
- the last step experimenting save signatures , seven-grams 2 modes of bipartite graph, projecting graph single mode (composed of hashes only), , calculating levenshtein distance on adjacent nodes similar block sizes.
the above steps pre-matching , substantially reduce number of signatures each signature has compared with. after these the modified levenshtein/damreau distance has calculated.
Comments
Post a Comment