This repository implements FracMinHash sketching and Affirmative Sampling. See basicTest.cpp
for use case.
After making, the following programs are generated.
Purpose: basic testing of implementation
This program will do the following:
- Generate a random set A with x elements
- Generate the sketch of A using FMH and AS
- Record the sketch sizes
- Vary x from 1K to 10M
Results:
10000,10.2,2.6,573,8.82043
100000,96.5,7.32462,790.3,31.8624
1000000,1004.2,26.5548,1021,30.4335
10000000,9968.3,73.369,1247,42.1829
100000000,97035.5,302.213,1486.3,24.0626
By using the un-filtered sketches computed using AS, we get the following accuracy.
From the looks of it, it is clear that there may be some kind of bias. Let us also investigate for other
Now, I am pretty certain that there is some bias. The theory suggests that the bias can be corrected by filtering the sketches.
Let the sketches be
and
After that, if we estimate Jaccard/containment, we should get unbiased estimates. Let us implement these filterings and re-generate the images.
The accuracy test for Jaccard now looks like this:
Better! The same result using many