One of my project in c datastructures is to implement a bloom filter. Bloom filter is a space efficient probabilistic datastructure in c. like a set datastructure , the main function of bloom filter is to test whether a element is the member of a set. Probabilistic data structure means it tell us an element definitely not in the set or may be in the set. Bloom filter is an application of bitvector . When using bloom filter false positives are possible and false negative does’nt.
The main datastructure of bloom filter is a bit array. Implementation of bloom filter contain hash functions to get the array positions. An empty bloom filter contain all bits in the bit array zero .There are many hash functions in c which can be used to implement bloom filter. Each hash function maps some elements in the array. To add an element in the array ,element must pass through each hash functions and each function returns a integer corresponding to the element. Then using this integer as the index in the array we set the corresponding bit as do in the bit vector implementation .
Let we can take a simple example of idea behind bloom filter.
We insert and search on a bloom filter of size m=10 and the number of hash functions will be k=3.
Let H(x) denote the result of three hash functions and we can write it as a set { h1(x) ,h2(x) ,h3(x)}
This is our empty array
Then we inserting an element x0. The element must pass through 3 hash function and we obtain values as follows,
H(x0)= { 1, 4, 9}
Then array like this
Insert another element x1 , it also pass through hash function and we got a set as follows and we set the bit in bitarray
H(x1)={ 4, 5, 8}
Next we query for an element y0 in bloom filter it again pass through same hash function and we get
H(y0)= { 0 ,4, 8 } here the element is not in the set because of one of its bit is zero .
then H(y1)= { 1, 5, 8} here the element is present and it is a false positive
In my program i select two hash functions FNV hash function and DJB hash functions. These are two of the most efficient general purpose hash functions in c.
We can start from a dynamic array named bloom and allocate memory using malloc() function in c. To add an element to this array we just pass the element through the two hash functions and set the bit in the bit vector at the index of those hashes to 1. At first we pass the string to FNV hash function ,it return a specific hash value using this we set the bit as 1 in bit vector ,second we pass through DJB hash function. This hash function seems to have a good over-all distribution for many different data sets. Again using the hash value we set the another bit in the bloom array . Here is a simple code of my program
To test the membership of the element then hash the element with same hash function and check those values in the bitvector is set to 1. Element is pass through the same hash function used for insertion and we check those values in the bloom is 1 if it is true then element may be present otherwise definitely not in the set . In the below code if both r1 1nd r2 true then element may be present.
Source code is available here https://bitbucket.org/neethuedappal/c-data-structures/changeset/21b72694a7f0











