Monthly Archives: November 2011

Bloom filter – A probabilistic datastructure in c

Standard

One of my project in c datastructures is to implement a bloom filter. Bloom filter is a space  efficient probabilistic datastructure in c.  like a set datastructure , the main function of bloom filter is to test whether a element is the member of a set. Probabilistic data structure means it tell us an element definitely not in the set or may be in the set. Bloom filter is an application of bitvector . When using bloom filter false positives are possible and  false negative does’nt.

The main datastructure of bloom filter is a bit array. Implementation of bloom filter contain hash functions to get the array positions.  An empty bloom filter contain all bits in the bit array zero .There are many hash functions in c which can be used to  implement bloom filter. Each hash function maps some elements in the array. To add an element in the array ,element must  pass through each hash functions and each  function returns a integer corresponding to the element. Then using this integer as the index in the array we set the corresponding bit as do in the bit vector implementation .

Let we can take a simple example of idea behind bloom filter.

We insert and search on a bloom filter of size  m=10 and the number of hash functions will be k=3.

Let H(x) denote the result of three hash functions and we can write it as a set { h1(x) ,h2(x) ,h3(x)}

This is our empty array

Then we inserting an element  x0. The element  must pass through 3 hash function and we obtain values as follows,

H(x0)= { 1, 4, 9}

Then array like this

Insert another element x1 ,  it also pass through hash function and we got a set as follows and we set the bit in bitarray

H(x1)={ 4, 5, 8}

Next we query for an element  y0 in bloom filter  it again pass through same hash function and we get

H(y0)= { 0 ,4, 8 } here the element is not in the set because of one of its bit is zero .

then H(y1)= { 1, 5, 8} here the element is present and it is a false positive

In my program i select two hash functions  FNV hash function and DJB hash functions. These are two of the most efficient  general purpose hash functions in c.

We can start from a dynamic array  named bloom and  allocate memory  using malloc() function in c. To add an element to this array we just pass the element through the two hash functions and set the bit in the bit vector at the index of  those hashes to 1. At first we pass the string to FNV   hash function ,it return a specific hash value using this we set the bit as 1 in bit vector ,second we pass through DJB hash function. This  hash function seems to have a good over-all distribution for many different data sets.  Again using the hash value we set the another bit in the bloom array . Here is a simple code of my program

To  test the membership of the element then hash the element with same hash function and check those values in the bitvector is set to 1. Element is pass through the same hash function used for insertion and we check those values in the bloom is 1 if it is true then element  may be present otherwise definitely not in the set . In the below code if both r1 1nd r2 true then element may be present.

 

Source code is available here  https://bitbucket.org/neethuedappal/c-data-structures/changeset/21b72694a7f0

Bit vector implementation in C

Standard

Bit vector implementation  provide space optimization  by using a re-sizable character array . This article will help you to understand implementation of bitvector in c programming.

 What is bit vector ?

Bit vector is composed of two words bit and vector. We already know bit and vector , bit is smallest unit of data in computer and it has single value either 0 or 1. Here vector defining a structure which is basically a re-sizable array . So bit vector is mean to pack values into an array so that no space is wasted. Bit vectors are array of bits. When we defining an integer array which can hold 10 elements  then each block in the array contain 32 bit and total size will be  320 bit but we does’nt use  entire  320 bit . So remaining  space will be waste. Bit vector implementation is a solution to this problem . Here we declare a character array  of some size and initialize it as zero  .

In my program i declare an array named bit vector and initilize as zero . To insert an element into this dynamic array some steps are there Insertion means only setting a bit as true not actual insertion .

Let we take element  ‘n ‘ as  ‘6’

  1. To insert  6   we need a block in the array  and a bit position .
  2.  To find the block were 6  belongs ,  dividing   6  by the  block size.  Here array  bitvector  contain array of character so the block size will be 8 bits.  6 /8  is  0  and 6 must be belongs to 0th block.
  3. To find bit position of  6 in the  0’th block take reminder of  6 %8 . So we get the bit position as 6
  4. Here no insertion of  element ‘6’ only setting the bit in the corresponding  bit position as 1.
  5. To set a bit  ,left shifted  1 p times, where p is bitposition, and OR it with the value in the block

bitvector[bitpostion] |= 1 << bit position

Now we set 6’th bit in 0’th block as 1.

To check whether a number is present or  not in bit array we need the  block which number belongs and also bitpostion.

  1. Find the block as do in the insertion
  2. Find the bitposition also.
  3. If the bit in the  bitposition is zero no element is there
  4. If the value returned by the bitwise AND with the value in the block and left shifing of 1 p times ,were p is bitposition will true then number is present in bit array.

To clear a bit  without affecting the all other bit in bit array the bit must be change to 0.

  1. So again shift a 1 left with bitposition
  2. Apply unary operator ~ to reverse all bits   , ~(1 <<bitposition)
  3. Then binary AND operator with value in the block size

These are the main steps to implement a bit vector and bit vector implement a set data structre  to store individual bits . Bit vector implementation in many languages provide space optimization.

Avl Tree- a self balanced binary search tree

Standard

We are familiar with binary search tree , an efficient  datastructure contain data as nodes. Avl tree is also  binary search tree. What is difference between binary search tree and avl tree ? The answer is  avl tree is self balanced , and it was first data structure which is self balanced. Lookup, insertion, and deletion all take O(log n) time in both the average and worst cases, where n is the number of nodes in the tree prior to the operation.  In avl tree height of two child subtrees of any node differ by atmost one .The balance factor of a node is the height of left subtree minus height of right subtree and a node with balance factor 1,0,-1 considered as balanced. Below an avl tree with 9 nodes and each node have a balance factor of 1 ,-1 or 0.

When inserting a new node or deleting en existing node , the resulting tree will disturb the balance of tree. An unbalanced state is defind as a state in which any subtree have a balance factor greater than 1 or less than one . Tree rotations are necessary  to retain the balanced state. A tree rotation move one node up and one node down in the tree.

Insertion 

Every time inserting a node in tree ,it is necessary to check  each of the nodes ancestors for consistency with the rules of avl. If the balance factor of each node remains  -1,0 ,+1 then no rotations are necessary  . In my program a function named  balance(p) return the balance factor  of the node p in the tree . The following code will check the balance condition.If a node in tree violate avl rule  ,then  rotation  will necessary then we get a balanced tree.

There are four cases to be considered in which two are symmetric to the other two. Let p be the root and we can observe different cases.

Right right case and right left case  

  • If the balance factor of p is less than or equal to -2 then right subtree of p overweight the left subtree ,then balance factor of right child of p  will be checked.
  • If the balance factor of right child  is -1 then single left rotation needed.

Left rotation can be done in following steps. Select a new node called pivot it will be the new root. Set right child of p as root and p takes the ownership  of  pivot’s left child , then p will be the left child of new root pivot.

Here a tree with 3 as root node and it is right heavy. Its balance factor is -2 so and balance factor of its right child is -1 then a single left rotation is takes place and the  resulting tree must be balanced.

  •  If the balance factor of  right child is +1 then two rotations are needed .Sometimes single rotation not sufficient to balance a unbalanced tree .Such cases two rotations necessory. First is right rotation with p’s right child as root and second is left rotation with p as root. After right rotations we get a unbalanced tree with right subtree left heavy .Then a left rotation is necessory to balance the tree. The tree below need two rotations ( double right rotations ) because balance factor of root’s right child is +1 ,after two rotation resulting tree will balanced.
Left-Left case and Left-Right case
If the balance factor of p is greater than or equal to +2 then left subtree of p overweight the right subtree. Then balance factor of left child will be checked .In these cases we use right rotations
  • If the balance factor of left child is +1 ,a single right rotations is needed.

Right rotation is just opposite of left rotation. Here set  p’s left child to pivot and it will be the new root . p takes ownership of the pivot’s right child as its left child . Pivot take ownership of p as its right child  .The code as follows

Then below tree is left heavy and it need a single right rotation and resulting tree obey rules of avl .
  • If the balance factor of  p’s left child is -1 ,then two rotations are necessory , First is left rotation with p’s left child as root and second is right rotation with p as root . The following tree  is left heavy with root as 5 and its left child is right heavy. So two rotations are needed to balance the tree.
Following above rules you shold be able to balance an avl tree following an insert or delete every time. After all rotation the balance factor must be recalculated and check again the rules of avl .