This is a Proof of Concept for storing a large volume of audit facts in Riak by bulking them together in large objects; and then searching for facts using a map function across hints files which are generated by a pre-commit hook.
The scenario to be covered in the Proof of concept is as follows:
- A large number of overall facts - order(100 billion);
- Facts grouped into objects with >> 1000 but < 100000 facts per object;
- Most individual facts appear in <1% of objects, so value to be gained in finding facts without trawling through all objects;
- All objects will have keys starting with a natural filter to use to split out the Map job (so that we orchestrate small map jobs outside of riak, rather than relying on riak to manage a large long-running job), so that individual map jobs are run against keys stored contiguously.
On loading of the object a project-specific extract_data_forprocess/2 function is called which should extract the Facts to be indexed in the hints files from the original object (as well as any new 2i fields/terms to be added to both the original and the hints object).
The hints file is a binary representation of the facts using a bloom filter with a low false positive rate, but just a single hash. This makes the bloom large, but it is compressed for storage using rice encoding. This provides a good ratio between disk footprint and false positive rate, whilst also providing a natural checksum when processing to protect against any random bit-flipping.
To improve processing time, the bloom is first partitioned - so only a part of the bloom needs to be checked.
The map job (using function map_checkhints/3) can then be run across a subset (or all) of the hints files looking for the presence of one or more facts - outputting {Fact, Key} indicating in which keys each fact can be found.
- Note standard Basho guidance to use Map sparingly.
- Important to use AAE to mitigate risk of r=1 query.
- Lots of TODO.
- Currently no proof at volume.