ios - How to effectively use the Levenshtein algorithm for text auto-completion -
i'm using levenshtein distance algorithm filter through text in order determine best matching result purpose of text field auto-completion (and top 5 best results).
currently, have array of strings, , apply algorithm each 1 in attempt determine how close of match text typed user. problem i'm not sure how interpret values outputted algorithm rank results expected.
for example: (text typed = "nvmb")
- result: "game" ; levenshtein distance = 3 (best match)
- result: "number stars" ; levenshtein distance = 13 (second best match)
this technically makes sense; second result needs many more 'edits', because of it's length. problem second result logically , visually closer match first one. it's if should ignore characters longer length of typed text.
any ideas on how achieve this?
the levenshtein distance corresponds number of single-character insertions, deletions , substitutions in optimal global pairwise alignment of 2 sequences if gap , mismatch costs 1.
the needleman-wunsch dp algorithm find such alignment, in addition score (it's same dp algorithm 1 used calculate levenshtein distance, option weight gaps, , mismatches between given pair of characters, arbitrarily). there more general models of alignment allow reduced penalties gaps @ start or end (and reduced penalties contiguous blocks of gaps, may useful here, although doesn't directly answer question). @ 1 extreme, have local alignment, pay no penalty @ gaps @ ends -- computed smith-waterman dp algorithm. think want here in-between: want penalise gaps @ start of both query , test strings, , gaps in test string @ end, not gaps in query string @ end. way, trailing mismatches cost nothing, , costs like:
query: nvmb costs: 0100000000000000 = 1 in total against: number stars query: nvmb costs: 1101 = 3 in total against: game query: number stars costs: 0100111111111111 = 13 in total against: nvmb query: ber star costs: 1110001111100000 = 8 in total against: number stars query: numbor costs: 111110000100000000000 = 6 in total against: number stars
(in fact might want give trailing mismatches small nonzero penalty, exact match preferred prefix-only match.)
the algorithm
suppose query string has length n, , string b testing against has length m. let d[i][j] dp table value @ (i, j) -- is, cost of optimal alignment of length-i prefix of length-j prefix of b. if go 0 penalty trailing mismatches, need modify nw algorithm in simple way: instead of calculating , returning dp table value d[n][m], need calculate table before, , find minimum of d[n][j], 0 <= j <= m. corresponds best match of query string against any prefix of test string.
Comments
Post a Comment