Libdict is a compact, ANSI C library which provides access to a set of generic
and flexible ``dictionary'' data structures. All algorithms used in libdict
have been optimized, and, with one very small exception, are not recursive but
iterative. It was written entirely by me,
and is released under a BSD style licence. Libdict can be downloaded
here. Descriptions of several of the algorithms employed
in Libdict can be downloaded from here.
Libdict implements the following data structures, which are listed with their
asymptotic expected and worst-case complexity for insert, search, and delete
operations:
Data Structure
Expected Time
Worst-Case Time
AVL Tree
O(lg N)
O(lg N)
Red-Black Tree
O(lg N)
O(lg N)
Splay Tree
O(lg N) [amortized]
O(N)
Treap
O(lg N)
O(N)
Weight-balanced tree
O(lg N)
O(lg N)
insert and search, O(N) delete
Path-reduction tree
O(lg N)
O(lg N)
insert and search, O(N) delete
Hashtable (Chained)
O(1)
O(1) insert, O(N)
search and delete
These structures can be used to efficiently store and retrieve key-data pairs.
Each of these structures can be accessed using its direct API, or it can be
accessed using a dictionary abstraction. Despite it's name, libdict can be
used to store any kind of data and any kind of key (provided it fits into a
'void' pointer on your system). All data structures have the following
operations defined:
Insert insert a key-data pair into the dictionary.
Probe search the data structure for the given key. If found, return
it's associated data, otherwise, insert the given key-data pair.
Search search the data structure for the given key, and return the
associated data, if found.
Const Search same as search but accepts a const pointer to the data
structure instead and returns a const pointer to the associated data for the
key.
Remove If found, remove the key-data pair from the data structure.
Count return the number of key-data pairs stored in the data
structure.
Empty remove all key-data pairs from the data structure.
Walk pass each key-data pair to a callback function, stopping when
the callback function returns 0.
Destroy destroy the data structure and release its storaage.
Make Iterator allocate and return an iterator, which allows one to
iterate over each key-data pair in the structure efficiently. Iterators are
both random and sequential (bi-directional).
In addition, various structures can provide statistical information. For
example, each tree structure can provide the minimum and maximum path lengths
from the root to a leaf, and the hash table can provide the number of buckets
used, etc. These statistical operations are not provided in the generic
dictionary interface because they are not applicable to each data structure.
The great advantage to using the generic interface is, by changing only the
line of code which creates the dictionary, the underlying data structure and
therefore its performance characteristics can be changed.
Using the Library
The library allows the client to store keys and ``satellite'' data associated
with each key in a dict structure. The client must provide the library with a
comparison function which will then be used to determine the relationship
between two keys. This is passed to the library as a ``dict_cmp_func'', which
is defined as:
typedef int (*dict_cmp_func)(const void *key1, const void *key2);
This function should return 0 if the two keys are determined to be equal, less
then zero if key1 < key2, and greater than zero if key1 > key2. This
function must be deterministic and must always return the same result for the
same keys. In addition, the comparison relation must be reflexive (A = A),
symmetric (A = B implies B = A), and transitive (A = B and B = C implies A = C,
A < B and B < C implies A < C, etc).
In addition, the client may optionally provide functions which can be used for
automatically freeing the key and / or the data which is stored in the tree.
These are passed to the library as ``dict_del_func'', which is defined as:
typedef void (*dict_del_func)(void *);
For example, if a tree was being used to store blocks of memory, indexed by
allocated strings, dict_cmp_func could be the ``strcmp'', and the routines
passed in for freeing the key and data could both be ``free''. The deallocation
function is not required to do anything, and it should not be used as a
``notification'' callback in the client application that the data item is no
longer in the tree. This is because these routines are invoked when
``overwriting'' the data of an item already in the tree to free the old key and
data. If the deallocation mechanism is abused and is used as a notification
that the item is no longer in the key, the application will think the item is
no longer in the tree when in fact it is still present but only had its data
changed.
However, say we are storing a structure such as the one that follows as data in
a dict structure:
struct fooey {
char *key;
int bar;
void *ptr;
};
If the struct members key and ptr are allocated and we wish to free them when
the structure is removed from the dict structure, we could write the following
free routine:
A pointer to this function can now be passed as the 'dat_del' argument to
a function which creates a dict structure (this will be explained below).
All data structures provide identical interfaces (apart from function name
prefixes) and identical semantics for all operations defined. Thus, by
presenting the interface for a particular structure, the reader will also learn
how to use the rest. In the following section, the interface to the red-black
tree will be explained. Every data structure is opaque to the user; only a
pointer to it may be stored by the client program.
Red-Black Trees
A red-black tree is a balanced tree structure. This structure guarantees that
insertion, deletion, and search for an item has a time complexity O(lg N). A
red-black tree with N data items has a height of at most 2lg(N + 1). More
specifically, the insertion and deletion algorithms guarantee that no leaf has
a distance from the root that is more than twice the distance of another leaf.
Thus the tree remains roughly balanced and manipulation of the tree has
logarithmic time complexity. This is unlike ``plain'' unbalanced search trees,
where performance may, and often does, degenerate into linear time.
Although the maximum number of comparisons for an insert, search, or delete for
a given item in a red-black tree is bounded by 2lg N + 1, empirical studies on
red-black trees which are filled with random data show that the average number
of comparisons for these operations is about 1.002lg N.
The following function will allocate and return a new red-black tree:
If the ``key_cmp'' argument is NULL, then the keys which are later inserted
into the tree will be cast to void pointers and their value will be compared.
One will rarely want this behavior, but it is there should you require
it. Either ``key_del'' or ``dat_del'' may be NULL, in which case they will not
be called :-).
Once the tree is created, it may be used until it is no longer needed, in which
case it must be destroyed with
void rb_tree_destroy(rb_tree *tree, int del);
This function will release all storage used by the tree and, if del is set, it
will invoke the user-defined deallocation functions to free the key and / or
the data, if applicable.
To insert key and data pairs into the tree, use:
int rb_tree_insert(rb_tree *tree, void *key, void *dat, int overwrite);
This will attempt to insert the given key and data pair into ``tree''. If
the item is already in the tree and ``overwrite'' is non-zero, the new data
value will overwrite the old one, and the function will return a positive
value. If the node is not present in the tree, it is inserted and the function
will return 0. If, however, the function cannot allocate memory to perform the
insertion, it will return -1.
This is a more general function which also doubles as a tree search function:
int rb_tree_probe(rb_tree *tree, void *key, void **dat);
This function will search the given tree for a match for ``key''; if it is
found, the data associated with the given key is stored at the location
``dat'', and the function will return 0. Otherwise, the given key and data pair
is inserted into the tree and the function will return 1. However, if
it the function was unable to allocate memory it will return -1. Probe
operations provides an interface similar to, for example, the programming
language awk's associative arrays, in which the first reference to a given key
will bring that key and data pair into the array, and subsequent accesses will
return the item originally placed into the array.
Both will return the data associated with the given key if it present in the
tree, or NULL. ``rb_tree_csearch'' will return a const pointer to the data. It
is meant for sections of code which have a const pointer to a rb_tree, and so
are not allowed to modify the tree or any data held in the tree.
Removing an item from the tree is done using:
int rb_tree_remove(rb_tree *tree, const void *key, int del);
This function will return 0 if the item was found and successfully removed. If
the item was not found it will return -1. Any user-defined deallocators will
only be invoked if ``del'' is non-zero.
To remove all key-data pairs from a tree, use:
void rb_tree_empty(rb_tree *tree, int del);
The tree will be completely emptied. Any user-defined deallocators will only be
invoked if ``del'' is non-zero.
It is possible to have the tree call a given function for each item in the
tree. The callback is defined as:
int (*dict_vis_func)(const void *key, void *dat);
This function will be called for each key-data pair in the tree. The tree will
call this function for each item in order, until the function returns zero.
This is accomplished with:
Finally, to find out how many key-data pairs are held in the tree, use:
unsigned rb_tree_count(rb_tree *tree);
This sums up the all the operations which can be done on the dict data
structures, using the red-black tree as an example. There are, however, more
operations which can be performed on trees.
``rb_tree_height'' will return the length of the longest path from the root to
a leaf. ``rb_tree_mheight'' will return the length of the shortest path from
the root to a leaf.
These functions will return the minimum and maximum keys in the tree,
respectively. Be careful, as these may will NULL if the tree is empty. These
functions also allow the tree to be used as a double-ended priority queue
(although there are better algorithms for that).
Suppose you decide you would like to use the red-black tree structure provided
by this library in your application. It would be perfectly OK to use the
``rb_tree_*'' functions described here. However, suppose at a later time you
decide you'd rather use the splay tree. This would require you to go through
your source code and change every instance of ``rb_tree'' to ``sp_tree.'' There
is a Better Way (tm) to access this library which relieves the programmer from
having to know about the underlying data in use. There is where the generic
interface of libdict comes into play.
The generic interface is defined in the dict.h file. It includes all operations
such as dict_insert(), dict_probe(), dict_remove(), dict_empty(), etc. Each of
these routines [they are actually macros] requires a pointer to a ``dict''
structure. However, there is no function defined which creates a new dict
structure. The function which will create the generic dict structure is defined
in the header file for that particular structure. For example, earlier we used
the rb_tree_new() function to create a new red-black tree. Instead, we now use
the rb_dict_new() function, which will create a red-black tree, but instead of
returning the tree itself, it will return a dict structure which encapsulates
all of the operations on the structure. We can now use the dict structure
returned by rb_dict_new() with all of the functions such as dict_insert(),
dict_remove(), etc, from dict.h. Thus, the programmer may change the underlying
data structure used, and hence the performance characteristics of the
application, by simply changing a single line of code - the line which creates
the dict structure. Throughout the rest of the application, the programmer no
longer has to worry about which actual structure is being used.
As mentioned earlier, iterators are also available which allow both random and
sequential, bi-directional access. A routine which creates a new iterator would
be something like:
rb_itor *rb_itor_new(rb_tree *tree);
All iterators have the following operations defined:
Valid is the iterator positioned at a meaningful location?
Invalidate invalidate the iterator, or position it at a sentinel
location which signifies that it cannot be ``dereferenced.''
Next move the iterator to the next location. If it was invalid, move
it to the first location in its associated dictionary.
Next N move the iterator forward N locations. If it was invalid,
first move it the first location in its associated dictionary, then move it
forward from there N - 1 times.
Prev move the iterator to the previous location. If it was invalid,
move it to the last location in its associated dictionary.
Prev N move the iterator backwards N locations. If it was invalid,
first move it the last location in its associated dictionary, then move it
backwards from there N - 1 times.
First move the iterator to the first location in its associated
dictionary.
Last move the iterator to the last location in its associated
dictionary.
Key return the key stored at the current location of the iterator.
Data return the data stored at the current location of the iterator.
Set Data change the data stored the current location of the iterator.
The following program illustrates the use of a red-black tree and an iterator.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <dict.h>
void *xmalloc(size_t size);
char *xstrdup(const char *s);
int
main(void)
{
rb_tree *tree;
rb_itor *itor;
FILE *fp;
char *p;
char buf[512];
int rv;
if ((fp = fopen("data.txt", "r")) == NULL) {
fprintf(stderr, "can't open data.txt\n");
exit(EXIT_FAILURE);
}
/*
* Set the allocation function for the library. This may not be a good idea
* except in trivial programs like this one.
*/
dict_set_malloc(xmalloc);
tree = rb_tree_new((dict_cmp_func)strcmp, free, free);
/*
* We expect the text file to contain lines like "key=value".
*/
while ((fgets(buf, sizeof(buf), fp)) != NULL) {
p = strchr(buf, '=');
if (!p) {
fprintf(stderr, "bad file syntax\n");
exit(EXIT_FAILURE);
}
*p++ = '\0';
if (!*p) {
fprintf(stderr, "bad file syntax\n");
exit(EXIT_FAILURE);
}
if ((rv = rb_tree_insert(tree, xstrdup(buf), xstrdup(p), 0)) != 0) {
fprintf(stderr, "duplicate key in file: '%s'\n", buf);
exit(EXIT_FAILURE);
}
}
rewind(fp);
while ((fgets(buf, sizeof(buf), fp)) != NULL) {
strtok(buf, '=');
if ((p = rb_tree_search(tree, buf)) == NULL) {
fprintf(stderr, "key '%s' not found in tree!\n", buf);
exit(EXIT_FAILURE);
}
printf("%s = %s\n", buf, p);
}
printf("\n");
fclose(fp);
/* Print every key and data pair that is in the tree, in order. */
itor = rb_itor_new(tree);
for (; rb_itor_valid(itor); rb_itor_next(itor))
printf("%s = %s\n", rb_itor_key(itor), rb_itor_data(itor));
rb_itor_destroy(itor);
/* Destroy the tree and free all the data in it. */
rb_tree_destroy(tree, TRUE);
exit(EXIT_SUCCESS);
}
void *
xmalloc(size_t size)
{
void *p;
if ((p = malloc(size)) == NULL) {
fprintf(stderr, "out of memory\n");
exit(EXIT_FAILURE);
}
return p;
}
char *
xstrdup(const char *s)
{
unsigned len;
len = strlen(s) + 1;
return memcpy(xmalloc(len), s, len);
}
At the risk of being repetitive, here is the same program, but this time it
uses the generic interface. (Some parts of the first listing are omitted for
brevity).
int
main(void)
{
dict *dct;
dict_itor *itor;
FILE *fp;
char *p;
char buf[512];
int rv;
if ((fp = fopen("data.txt", "r")) == NULL) {
fprintf(stderr, "can't open data.txt\n");
exit(EXIT_FAILURE);
}
/*
* Set the allocation function for the library so we dont have to worry
* about it running out of memory. This may not be a good idea except in
* trivial programs like this one.
*/
dict_set_malloc(xmalloc);
dct = rb_dict_new((dict_cmp_func)strcmp, free, free);
/*
* We expect the text file to contain lines like "key=value".
*/
while ((fgets(buf, sizeof(buf), fp)) != NULL) {
strtok(buf, "\n");
p = strchr(buf, '=');
if (!p) {
fprintf(stderr, "bad file syntax\n");
exit(EXIT_FAILURE);
}
*p++ = '\0';
if (!*p) {
fprintf(stderr, "bad file syntax\n");
exit(EXIT_FAILURE);
}
if ((rv = dict_insert(dct, xstrdup(buf), xstrdup(p), 0)) != 0) {
fprintf(stderr, "duplicate key in file: '%s'\n", buf);
exit(EXIT_FAILURE);
}
}
rewind(fp);
while ((fgets(buf, sizeof(buf), fp)) != NULL) {
strtok(buf, "=");
if ((p = dict_search(dct, buf)) == NULL) {
fprintf(stderr, "key '%s' not found in dict!\n", buf);
exit(EXIT_FAILURE);
}
printf("%s = %s\n", buf, p);
}
printf("\n");
fclose(fp);
/* Print every key and data pair that is in the dict, in order. */
itor = dict_itor_new(dct);
for (; dict_itor_valid(itor); dict_itor_next(itor))
printf("%s = %s\n", dict_itor_key(itor), dict_itor_data(itor));
dict_itor_destroy(itor);
/* Destroy the dict and free all the data in it. */
dict_destroy(dct, TRUE);
exit(EXIT_SUCCESS);
}
Now, if we wanted to use some other data structure than a red-black tree, such
as a splay tree, we could just change the line dct = rb_dict_new(...) to dct = sp_dict_new(...)'.