Suffix-Tree

A suffix tree is a tree which stores all suffixes of a given string.  For “ABABC” all suffixes are: ABABC,BABC,ABC,BC,C and the empty string.  This implementation considers the empty string as no suffix in contrast to the standard definition.

Time in O(n)

The construction of a suffix tree runs in time O(n) for a given input string of length n.

If you want to know if a certain substring of length s is contained in a suffix tree you can do so in time O(s).

Reference

The algorithm for constructing a suffix tree is from E. Ukkonen (1995).  See paper for a thorough description.

Summary
Suffix-TreeA suffix tree is a tree which stores all suffixes of a given string.
CopyrightThis program is free software.
Files
C-kern/api/ds/inmem/suffixtree.hHeader file of Suffix-Tree.
C-kern/ds/inmem/suffixtree.cImplementation file of Suffix-Tree impl.
Types
struct suffixtree_tExports suffixtree_t.
Functions
test
unittest_ds_inmem_suffixtreeTest suffix tree with content if a C-source file.
suffixtree_tA suffix tree contains a set of connected nodes and a reference to the input text.
childsPoints to root node with all childs of root node.
maxlengthMax length (in bytes) of all added strings.
lifetime
suffixtree_FREEStatic initializer.
free_suffixtreeFrees all memory of the allocated nodes.
query
dump_suffixtreeWrites a simple ascii representation of all nodes into a cstring_t.
isstring_suffixtreeReturns true if at least one suffix begins with searchstr.
matchall_suffixtreeReturns number of times and start addresses of suffixes which contain searchstr.
build
build_suffixtreeThis function constructs a suffix tree from the given input string.
clear_suffixtreeAll contained suffixes are removed and all internal memory is freed.

Copyright

This program is free software.  You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

Author

© 2012 Jörg Seebohn

Files

C-kern/api/ds/inmem/suffixtree.h

Header file of Suffix-Tree.

C-kern/ds/inmem/suffixtree.c

Implementation file of Suffix-Tree impl.

Types

struct suffixtree_t

typedef struct suffixtree_t suffixtree_t

Exports suffixtree_t.  Implementation of a suffix tree.

Functions

Summary
test
unittest_ds_inmem_suffixtreeTest suffix tree with content if a C-source file.

test

unittest_ds_inmem_suffixtree

int unittest_ds_inmem_suffixtree(void)

Test suffix tree with content if a C-source file.

suffixtree_t

struct suffixtree_t

A suffix tree contains a set of connected nodes and a reference to the input text.  The root node is represented with the special value NULL.  Child nodes are managed by a single linked list.  The first character of the substring of every child is unique between the childs.

This implementation considers the empty string as an invalid suffix in contrast to the standard definition.  To query for an empty string with isstring_suffixtree always returns false.

Summary
childsPoints to root node with all childs of root node.
maxlengthMax length (in bytes) of all added strings.
lifetime
suffixtree_FREEStatic initializer.
free_suffixtreeFrees all memory of the allocated nodes.
query
dump_suffixtreeWrites a simple ascii representation of all nodes into a cstring_t.
isstring_suffixtreeReturns true if at least one suffix begins with searchstr.
matchall_suffixtreeReturns number of times and start addresses of suffixes which contain searchstr.
build
build_suffixtreeThis function constructs a suffix tree from the given input string.
clear_suffixtreeAll contained suffixes are removed and all internal memory is freed.

childs

struct suffixtree_node_t * childs

Points to root node with all childs of root node.

maxlength

size_t maxlength

Max length (in bytes) of all added strings.

lifetime

suffixtree_FREE

#define suffixtree_FREE { (void*)0, 0 }

Static initializer.  Sets all fields to 0.

free_suffixtree

int free_suffixtree(suffixtree_t *tree)

Frees all memory of the allocated nodes.  The input string can also be freed if it is no longer needed.  TODO: Implement free memory group - allow to mark the nodes as part of a group !

query

dump_suffixtree

int dump_suffixtree(suffixtree_t *tree,
struct cstring_t *cstr)

Writes a simple ascii representation of all nodes into a cstring_t.  The first dumped node is root

node(0):
 childs:
 A -> node(bffff17c)
 B -> node(bffff168)
node(bffff168): 'B'
suffix->node(0), parent->node(0), childs:
 ::-> leaf: ''
 A -> node(bffff17c)"
 C -> leaf: 'CXX'
...

A node has a number (its internal memory address).  Root has number 0.  It matches also a string except root.  It has a list of childs indexed by the first matching character of the child.  Other nodes have also a suffix pointer and a parent pointer.  The suffix pointer points to the node which matches the suffix of this node.  The string a node matches are all string concatenated on path from root to this node (including the node).

A leaf is only listed as child of a node.  It matches only a string.  A leaf marked with ‘::’ is an end marker to store only positional information.

isstring_suffixtree

bool isstring_suffixtree(suffixtree_t *tree,
size_t length,
const uint8_t searchstr[length])

Returns true if at least one suffix begins with searchstr.  Use this function to check if a certain substring is contained in the string for which this suffix tree was built.

Returns

truesearched string is contained in input text
falsesearched string is not contained in input text

matchall_suffixtree

int matchall_suffixtree(suffixtree_t *tree,
size_t length,
const uint8_t searchstr[length],
size_t skip_count,
/*out*/size_t *matched_count,
size_t maxmatchcount,
/*out*/const uint8_t *matchedpos[maxmatchcount])

Returns number of times and start addresses of suffixes which contain searchstr.  Use this function to search for all occurrences of a certain substring in the string for which this suffix tree was built.

The number of valid positions in array matchedpos can be computed with

min((matched_count > skip_count ? matched_count-skip_count : 0), maxmatchcount)

Parameter

treeSuffix tree of the string which is scanned for searchstr.
lengthThe length of searchstr.
searchstrThe content of the search string.
skip_countThe first skip_count number of found occurrences are not returned in matchedpos array.
matched_countReturns the number of all found occurrences.  This number is independent of skip_count and maxmatchcount.  Set maxmatchcount to 0 to count all occurences without returning them.
maxmatchcountContains the size of the array matchedpos.
matchedposReturns the positions searchstr is found in the string suffix tree was built for.  The number of positions returned is less or equal than maxmatchcount.  The first skip_count positions are not contained.

TODO: rewrite interface to use danymic array as result (implement it part static allocation + dynamic allocation; same as wbuffer, but possible mixed at the same time) !

build

build_suffixtree

int build_suffixtree(suffixtree_t *tree,
size_t length,
const uint8_t input_string[length])

This function constructs a suffix tree from the given input string.  After return the caller is not allowed to free the text.  The input text must live at least as long as the constructed tree cause it stores references to input_string to describe string contents instead of making copies.  Memory of any previously build tree is freed before a new one is built.

clear_suffixtree

int clear_suffixtree(suffixtree_t *tree)

All contained suffixes are removed and all internal memory is freed.  After return no more references to any added input strings are held so they can be safely freed.

A suffix tree is a tree which stores all suffixes of a given string.
Implements Suffix-Tree.
typedef struct suffixtree_t suffixtree_t
Exports suffixtree_t.
struct suffixtree_t
A suffix tree contains a set of connected nodes and a reference to the input text.
int unittest_ds_inmem_suffixtree(void)
Test suffix tree with content if a C-source file.
struct suffixtree_node_t * childs
Points to root node with all childs of root node.
size_t maxlength
Max length (in bytes) of all added strings.
#define suffixtree_FREE { (void*)0, 0 }
Static initializer.
int free_suffixtree(suffixtree_t *tree)
Frees all memory of the allocated nodes.
int dump_suffixtree(suffixtree_t *tree,
struct cstring_t *cstr)
Writes a simple ascii representation of all nodes into a cstring_t.
struct cstring_t
Dynamically growing C string with trailing ‘\0’ byte.
bool isstring_suffixtree(suffixtree_t *tree,
size_t length,
const uint8_t searchstr[length])
Returns true if at least one suffix begins with searchstr.
int matchall_suffixtree(suffixtree_t *tree,
size_t length,
const uint8_t searchstr[length],
size_t skip_count,
/*out*/size_t *matched_count,
size_t maxmatchcount,
/*out*/const uint8_t *matchedpos[maxmatchcount])
Returns number of times and start addresses of suffixes which contain searchstr.
int build_suffixtree(suffixtree_t *tree,
size_t length,
const uint8_t input_string[length])
This function constructs a suffix tree from the given input string.
int clear_suffixtree(suffixtree_t *tree)
All contained suffixes are removed and all internal memory is freed.
Close