GitHub - karol-broda/fuzzy-match-bktree

summary of the levenshtein-based fuzzy search algorithm

the algorithm efficiently finds all strings within a specified levenshtein distance ( k ) from a query string in a dataset. it combines optimized levenshtein distance calculations with a data structure called a bk-tree to perform fast fuzzy searches.

steps taken to create the algorithm

identifying the challenge
- needed a fast method to find approximate string matches in large datasets.
- standard levenshtein distance calculations were too slow for this purpose.
selecting an appropriate data structure
- chose the bk-tree for its suitability in metric spaces and efficient search capabilities.
- bk-trees work well with levenshtein distance since it satisfies the triangle inequality.
optimizing levenshtein distance calculation
- implemented ukkonen's algorithm to reduce computation time.
- this optimization limits calculations to a narrow band around the main diagonal of the dynamic programming matrix.
- added early termination to stop calculations when the distance exceeds ( k ).
constructing the bk-tree
- started with an empty tree.
- for each word in the dataset:
  - if the tree is empty, set the word as the root.
  - otherwise, insert the word by:
    - computing the levenshtein distance ( d ) to the current node's word.
    - if a child at distance ( d ) exists, move to that child and repeat.
    - if not, create a new child node at distance ( d ).
- this organizes the dataset so that similar words are grouped together.
implementing the fuzzy search
- to search for a query string:
  - begin at the root node.
  - compute the levenshtein distance ( d ) between the query and the node's word using the optimized algorithm.
  - if ( d \leq k ), add the word to the results.
  - explore child nodes where the edge distances ( e ) satisfy ( d - k \leq e \leq d + k ).
  - this approach prunes unnecessary branches, improving search efficiency.
writing the code
- developed functions for:
  - optimized levenshtein distance calculation, implementing ukkonen's algorithm for efficiency.
  - bk-tree insertion and traversal, ensuring the tree correctly applies the triangle inequality during search.
- identified and fixed issues:
  - during testing, noticed that the search was returning incorrect results and was slower than expected.
  - realized that the search function was not correctly applying the triangle inequality, leading to excessive node exploration.
  - updated the search function to use the correct condition: ( |e - d| \leq k ), where ( e ) is the edge distance and ( d ) is the distance between the query and the current node's word.
  - this correction significantly improved accuracy and performance.
- code availability:
  - the implementation is in c, reading words from words.txt.
  - it includes:
    - bk_tree.c and bk_tree.h for the bk-tree implementation.
    - levenshtein.c and levenshtein.h for the optimized levenshtein distance function.
    - main.c as the entry point, handling wordlist loading and search execution.
    - a makefile for easy compilation.

runtime analysis

let:

( n ): total number of words in the dataset.
( l ): average length of the words.
( k ): maximum allowed levenshtein distance.
( m ): number of nodes visited during the search (usually ( m \ll n )).

per distance calculation

time complexity: ( O(k * l) )
- thanks to ukkonen's algorithm reducing unnecessary computations.

total runtime

best-case runtime
- time complexity: ( O(k * l) )
- occurs when:
  - the query word is identical or very similar to the root node's word.
  - minimal traversal is needed.
  - early termination frequently stops calculations.
average-case runtime
- time complexity: ( o(m * k * l) )
- occurs when:
  - the bk-tree is well-balanced.
  - only a small subset of nodes (( m )) need to be explored.
  - effective pruning reduces the search space.
worst-case runtime
- time complexity: ( O(n * k * l) )
- occurs when:
  - the bk-tree is unbalanced.
  - the search has to visit all nodes.
  - pruning is ineffective due to high ( k ) or dataset characteristics.

additional considerations

applications
- spell checkers: suggest corrections for misspelled words.
- auto-completion: provide word suggestions as users type.
- search engines: find relevant results despite typos or slight variations.
advantages
- efficient even with large datasets.
- adjustable fuzziness by changing ( k ).
- precise control over performance and accuracy trade-offs.
limitations
- performance depends on keeping ( k ) small.
- the bk-tree can become unbalanced without careful insertion.
- less effective for very long strings or languages with large alphabets.

practical tips

choosing ( k )
- smaller ( k ) values improve performance.
- typical values range from 1 to 3, balancing speed and match tolerance.
maintaining the bk-tree
- insert words in random order to promote balance.
- periodically rebuild or balance the tree if necessary.
optimizing distance calculations
- cache intermediate results when possible.
- use efficient data structures for storing partial computations.
handling large datasets
- split the dataset and build multiple bk-trees.
- use parallel processing to search across trees.

examples in practice

spell checker implementation
- build a bk-tree from a dictionary.
- when a user inputs a word, search with an appropriate ( k ).
- present suggestions from the search results.
auto-completion system
- as the user types, incrementally search the bk-tree.
- adjust ( k ) based on input length to maintain responsiveness.
search engine fuzzy matching
- index terms using a bk-tree.
- during a search, find terms within a small levenshtein distance.
- retrieve and rank documents based on these terms.

conclusion

by carefully selecting the data structure and optimizing computations, i developed an efficient levenshtein-based fuzzy search algorithm. the key steps involved recognizing the need for optimization, choosing the bk-tree, enhancing the levenshtein calculation, constructing the tree, and implementing the search mechanism. this approach provides a practical solution for applications requiring quick and accurate approximate string matching.

further future enhancements

alternative distance metrics
- use damerau-levenshtein distance if transpositions are common errors.
caching mechanisms
- cache recent queries to speed up repeated searches.
hybrid approaches
- combine bk-trees with other techniques like trigrams or n-grams for improved performance

final thoughts

understanding the steps taken to create the algorithm helps in tailoring it to specific needs. by adjusting parameters like ( k ) and ensuring the bk-tree remains balanced, you can optimize performance for your particular application. the combination of an efficient data structure and optimized calculations makes this algorithm a powerful tool for fuzzy searching in various contexts.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
Makefile		Makefile
readme.md		readme.md
words.txt		words.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

karol-broda/fuzzy-match-bktree

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages