Implement a fast algorithm which can solve the following type of matching problem.
Full is the type of matcher that the input string must exactly equal to the pattern.
Substr is the type of matcher that the input string must contain the pattern as a sub-string.
Domain is the type of matcher that the input string must be a sub-domain or itself of the pattern.
The DomainMatcher is divided into two parts:
fullanddomainpatterns are matched by Rabin-Karp algorithm & minimal perfect hash table;substrpatterns are matched by ac automaton;
Matching problem definition:
- a
domainrulebaidu.comcan be seen asexactmatchmoc.udiabandmoc.udiab.when traversing the domain names in reverse order. Andmoc.udiabandmoc.udiab.should not appear in the middle of the string. - a
fullrulebaidu.comcan be seen asexactmatchmoc.udiabwhen traversing the domain names in reverse order. Andmoc.udiabshould not appear in the middle of the string. - a
substrrulebaidu.comis a matching problem that check ifbaidu.comis substring of the given domain names.substrrules can be matched byACAutomaton.
Through the above definition, we can merge the full and domain rules together to match. The simplest way is to store these rules in the HashMap. However, when we query, we need to calculate the hash value of the same string and its substrings. This additional overhead can be reduced by rolling hash.
We choose 32bit FNV-prime 16777619 to calculate our rolling hash.
Inspired by "Hash, displace, and compress" algorithm, we can design a minimal perfect hash table through two rounds hashes. The first round of hash is rolling hash, which we get directly from the process of traversing the string. The second round of hash is memhash.
In this way, when checking whether the rule is hit, we only need to calculate the hash and compare it once.
#[inline(always)]
fn lookup(&self, h: RollingHashType, query_string: &str) -> bool {
let level0_idx = h & self.level0_mask;
let seed = self.level0[level0_idx as usize] as Level1HashType;
let level1_idx = seed.mem_hash(query_string) & self.level1_mask;
return self.rules[self.level1[level1_idx as usize] as usize] == query_string;
}