Extendible-Hashing impl

Implements Extendible-Hashing.

Summary

Copyright

This program is free software.  You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

Author

© 2012 Jörg Seebohn

Files

C-kern/api/ds/inmem/exthash.h

Header file Extendible-Hashing.

C-kern/ds/inmem/exthash.c

Implementation file Extendible-Hashing impl.

exthash_t

Summary
description
AlgorithmThe container of exthash_t holds all entries in an array of size of a power of two.
lifetime
search
change
unsharebucket_exthashMakes a shared bucket unshared.
test

description

Algorithm

Hash Table

The container of exthash_t holds all entries in an array of size of a power of two.  If too many entries are hashed to the same array index the table is doubled in size.  The current size of the table is stored as exponent in exthash_t.level.  The maximum value for exthash_t.level is stored in exthash_t.maxlevel.  Therefore the the table can grow to the maximum size pow(2, exthash_t.maxlevel).

Hashing

Before inserting, removing, or finding an element in the hash tbale its hash value (a 32 or 64 bit value of type size_t) is calculated.  The computed value is taken modulo pow(2, exthash_t.level) (the size of the hash table).  The hash table contains only a pointer to the bucket where elements are stored.  The reason is that extendible hashing can be used to hash on external storage.  In this implementation the table contains a pointer to the root node of a tree.  The tree manages all elements stores all elements hashed to the same index.

Growing the Table

Consider a hash table of size 2 and inserting an element with hash value 0x83290a13.

---------------
| index[0b00] | --> (tree with 1 entries)
---------------
| index[0b01] | --> (tree with 3 entry)
---------------

The new element is hashed to index 1 (== 0x83290a13 % 2).  The tree at index 1 contains at least 3 elements therefore the implementation choses to double the table size.

---------------
| index[0b00] | --> (tree)    <---
---------------                  |
| index[0b01] | --> (tree)    <--|---   (level 1 / 2 == pow(2, 1))
---------------                  |  |
| index[0b10] | --> (shared)  ---|  |
---------------                     |
| index[0b11] | --> (shared)  -------   (level 2 / 4 == pow(2, 2))
---------------

The new entries in the table are marked with the special value (void*)(uintptr_t)-1.  This value indicates that the table entry does not point to an allocated bucket (tree) but shares its content with an entry at a lower level.

Now all entries in index 1 have to be rehashed.  The index is calculated as (hash value % 4) instead of (hash value % 2).  A modulo operation of a power of 2 is the same as masking the hash value with (size-1).  Therefore instead of masking the hash value with 0b0...001 it is masked with 0xb0...011.  So the bit pow(2, exthash_t.level-1) decides between keeping the entry at index 1 or moving it to the newly allocated bucket at index 3.  The new element with hash value 0x83290a13 hashes then to index 3 (== 0x83290a13 % 4) and is stored in this bucket.

---------------
| index[0b00] | --> (tree      <-----------
---------------                           |
| index[0b01] | --> (tree with 2 entries) |
---------------                           |
| index[0b10] | --> (shared)  ------------|
---------------
| index[0b11] | --> (tree with 1 rehashed old entry + new entry)
---------------

Doubling th table again would produce the following

----------------
| index[0b000] | --> (tree)    <--------- <-
----------------                        |  |
| index[0b001] | --> (tree 2 entries) <-|--|--        (level 1 / 2 == pow(2, 1))
----------------                        |  | |
| index[0b010] | --> (shared)  ---------|  | | <-
----------------                           | |  |
| index[0b011] | --> (tree 2 entries)      | |  | <-  (level 2 / 4 == pow(2, 2))
----------------                           | |  |  |
| index[0b100] | --> (shared)  ------------| |  |  |
----------------                             |  |  |
| index[0b101] | --> (shared)  ---------------  |  |
----------------                                |  |
| index[0b110] | --> (shared)  ------------------  |
----------------                                   |
| index[0b111] | --> (shared)  ---------------------  (level 3 / 8 == pow(2, 3))
----------------

Inserting a new element with hash value 0x11223342 hash to index 2 (modulo 8).  The bucket at index 2 is shares its content with the bucket at index 0.  Therefore the algorithm chooses to rehash the nodes stored at index 0 to make index 2 unshared and then inserts the new element at index 2.

----------------
| index[0b000] | --> (tree)    <--------- <-
----------------                        |  |
| index[0b001] | --> (tree 2 entries) <-|--|--        (level 1 / 2 == pow(2, 1))
----------------                        |  | |
| index[0b010] | --> (tree with new entry) | | <-
----------------                           | |  |
| index[0b011] | --> (tree 2 entries)      | |  | <-  (level 2 / 4 == pow(2, 2))
----------------                           | |  |  |
| index[0b100] | --> (shared)  ------------| |  |  |
----------------                             |  |  |
| index[0b101] | --> (shared)  ---------------  |  |
----------------                                |  |
| index[0b110] | --> (shared)  ------------------  |
----------------                                   |
| index[0b111] | --> (shared)  ---------------------  (level 3 / 8 == pow(2, 3))
----------------

Nothing else has to be changed.  Cause index 6 at the next level shares its content already with index 2.  This means that a find operation has to follow the chain of shared entries in the table which makes it an operation having a time complexity of O(level) in the worst case.

lifetime

search

change

unsharebucket_exthash

static int unsharebucket_exthash(exthash_t *htable,
size_t tabidx)

Makes a shared bucket unshared.  Splits the content of the bucket <exthash_t.hashtable>[tabidx] into two buckets.  The second bucket index is the next higher index which points to a bucket which shares its content with bucket tabidx.

If the function succeeds tabidx is no more shared on the next higher level.  It is possible that another table entry shares its content with tabidx.  Another call to this function would then distribute the keys to these two buckets.

(unchecked) Preconditions

1.<tableindexfromkeyAndFollow_exthash> must have indicated that a bucket with higher index exists which shares its content.

test

exthash_iterator_t

lifetime

iterate

exthash_t

test

Offers a container which organizes stored nodes as a hash table.
Implements Extendible-Hashing.
static int unsharebucket_exthash(exthash_t *htable,
size_t tabidx)
Makes a shared bucket unshared.
uint8_t level
Determines the hash table size as pow(2,level).
uint8_t maxlevel
Determines the max size of the hash table.
Close