Suffix-Tree impl

Implements Suffix-Tree.

Summary
Suffix-Tree implImplements Suffix-Tree.
CopyrightThis program is free software.
Files
C-kern/api/ds/inmem/suffixtree.hHeader file of Suffix-Tree.
C-kern/ds/inmem/suffixtree.cImplementation file of Suffix-Tree impl.
Types
struct suffixtree_iterator_tShortcut for suffixtree_iterator_t.
struct suffixtree_addstate_tShortcut for suffixtree_addstate_t.
struct suffixtree_leaf_tShortcut for suffixtree_leaf_t.
struct suffixtree_node_tShortcut for suffixtree_node_t.
struct suffixtree_pos_tShortcut for suffixtree_pos_t.
suffixtree_iterator_tStores position in suffixtree_t which has to be visited next.
nextUsed in <suffixtreeiterator_list_t> to connect all contained suffixtree_iterator_t.
prefixlenSum of length of strings on the path from root up to next_child.
next_childPointer to suffixtree_node_t which has to be visited next on this level of hierachy.
lifetime
new_suffixtreeiteratorAllocates a new suffixtree_iterator_t object.
delete_suffixtreeiteratorFrees all memory allocated for suffixtree_iterator_t object.
suffixtreeiterator_adapter_itDefines typeadapt_t to manage memory of suffixtree_iterator_t.
Functions
freenode_suffixtreeiteratoradapterFrees memory of object type suffixtree_iterator_t.
Macros
slist_IMPLEMENT_iterlistGenerates list managing suffixtree_iterator_t -- see slist_IMPLEMENT.
Functions
pushnew_iterlistCreates new suffixtree_iterator_t and pushes onto <suffixtreeiterator_list_t>.
suffixtree_leaf_tA node in suffixtree_t which has no childs.
next_childUsed to manage childs in a single linked list.
str_startStart address of matched string.
str_sizeLength in bytes of str_start.
lifetime
new_suffixtreeleafAllocates new leaf of suffix tree.
delete_suffixtreeleafFrees resources associated with suffixtree_leaf_t.
generic
suffixtree_leaf_EMBEDAllows to embed all data fields of suffixtree_leaf_t into another structure.
query
ISLEAFReturns true if parameter points to node of type suffixtree_leaf_t.
STR_SIZEReturns the size of the string this node or leaf matches.
STR_STARTReturns the start address of the string this node or leaf matches.
change
INIT_AS_NODEInits the string description fields and sets the type to suffixtree_node_t.
INIT_AS_LEAFInits the string description fields and sets the type to suffixtree_leaf_t.
SKIP_STR_BYTESIncrements suffixtree_leaf_t.str_start and decrements suffixtree_leaf_t.str_size with _add.
suffixtree_node_tA node in suffixtree_t which has childs.
suffixtree_leaf_EMBEDInherit all data fields from suffixtree_leaf_t.
childsList of all children.
suffix_linkPoints to node which matches same string with this node except without the first character.
lifetime
suffixtree_pos_tDescribes a position in suffixtree_t.
matched_lenThe number of characters matched in node.
nodeThe current node indicates the position in the tree.
parentThe parent of node.
lifetime
suffixtree_pos_INITStatic initializer.
suffixtree_addstate_tHolds the position in suffixtree_t and the next character to be added.
posPosition in suffixtree_t where the next character is added.
suffixThe string which describes the current suffix.
lifetime
suffixtree_addstate_INITStatic initializer.
suffixtree_t
description
Build a Suffix TreeA suffix tree is built from an input string and it stores all suffixes.
Optimize Tree ConstructionTo be able to optimize the building process we introduce nodes which can store more than one character.
helper
compiletime_testsTests that suffixtree_node_t inherits from suffixtree_leaf_t.
lifetime
build
replacechild_suffixtreeReplaces old_child of parent with new_child.
splitnode_suffixtreeThe node in pos (see suffixtree_pos_t.node) is splitted into two nodes.
addchar_suffixtreeThis function process the next input character in state->suffix.
query
findstring_suffixtreeSearches the node matching searchstring beginning from root.
test

Copyright

This program is free software.  You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

Author

© 2012 Jörg Seebohn

Files

C-kern/api/ds/inmem/suffixtree.h

Header file of Suffix-Tree.

C-kern/ds/inmem/suffixtree.c

Implementation file of Suffix-Tree impl.

Types

struct suffixtree_iterator_t

typedef struct suffixtree_iterator_t suffixtree_iterator_t

Shortcut for suffixtree_iterator_t.  Holds position in suffixtree_t used in iteration.

struct suffixtree_addstate_t

typedef struct suffixtree_addstate_t suffixtree_addstate_t

Shortcut for suffixtree_addstate_t.  Holds state information used in the operation which adds the next character of the input string.

struct suffixtree_leaf_t

typedef struct suffixtree_leaf_t suffixtree_leaf_t

Shortcut for suffixtree_leaf_t.  Holds node of suffixtree_t without any children and no suffix link.

struct suffixtree_node_t

typedef struct suffixtree_node_t suffixtree_node_t

Shortcut for suffixtree_node_t.  Holds node of suffixtree_t with children and a suffix link.

struct suffixtree_pos_t

typedef struct suffixtree_pos_t suffixtree_pos_t

Shortcut for suffixtree_pos_t.  Stores position in suffix tree.

suffixtree_iterator_t

struct suffixtree_iterator_t

Stores position in suffixtree_t which has to be visited next.  A position consists of the next to be visited child of type suffixtree_node_t.  The length of string which is matched on the path leading up to next_child is stored in prefixlen.

Summary
nextUsed in <suffixtreeiterator_list_t> to connect all contained suffixtree_iterator_t.
prefixlenSum of length of strings on the path from root up to next_child.
next_childPointer to suffixtree_node_t which has to be visited next on this level of hierachy.
lifetime
new_suffixtreeiteratorAllocates a new suffixtree_iterator_t object.
delete_suffixtreeiteratorFrees all memory allocated for suffixtree_iterator_t object.

next

slist_node_t * next

Used in <suffixtreeiterator_list_t> to connect all contained suffixtree_iterator_t.

prefixlen

size_t prefixlen

Sum of length of strings on the path from root up to next_child.

next_child

suffixtree_node_t * next_child

Pointer to suffixtree_node_t which has to be visited next on this level of hierachy.

lifetime

new_suffixtreeiterator

static inline int new_suffixtreeiterator(/*out*/suffixtree_iterator_t **iter)

Allocates a new suffixtree_iterator_t object.

delete_suffixtreeiterator

static int delete_suffixtreeiterator(suffixtree_iterator_t **iter)

Frees all memory allocated for suffixtree_iterator_t object.

suffixtreeiterator_adapter_it

Defines typeadapt_t to manage memory of suffixtree_iterator_t.  Used in slist_t (see slist_IMPLEMENT_iterlist) storing suffixtree_iterator_t.

Summary
Functions
freenode_suffixtreeiteratoradapterFrees memory of object type suffixtree_iterator_t.
Macros
slist_IMPLEMENT_iterlistGenerates list managing suffixtree_iterator_t -- see slist_IMPLEMENT.
Functions
pushnew_iterlistCreates new suffixtree_iterator_t and pushes onto <suffixtreeiterator_list_t>.

Functions

freenode_suffixtreeiteratoradapter

static int freenode_suffixtreeiteratoradapter(
   suffixtreeiterator_adapter_t *typeadp,
   suffixtree_iterator_t **iter
)

Frees memory of object type suffixtree_iterator_t.  Used to adapt <suffixtreeiterator_list_t> to object type suffixtree_iterator_t.

Macros

slist_IMPLEMENT_iterlist

Generates list managing suffixtree_iterator_t -- see slist_IMPLEMENT.

Functions

pushnew_iterlist

static suffixtree_iterator_t * pushnew_iterlist(slist_t *stack)

Creates new suffixtree_iterator_t and pushes onto <suffixtreeiterator_list_t>.

suffixtree_leaf_t

struct suffixtree_leaf_t

A node in suffixtree_t which has no childs.  A leaf indicates a position in the tree and contains a matching string.  The matching string extends alwayas until the end of the string for which the suffix tree was built.  A leaf is created for every suffix of the input string for which the tree is built.  Therefore the start address of a leaf subtracted with the length of all strings of all nodes on the path from root to leaf define the start address of a suffix.

To distinguish between suffixtree_node_t and suffixtree_leaf_t bit 0x01 of str_start is used.

Attention

Do no forget to adapt suffixtree_leaf_EMBED after a change of sthis structure.

Summary
next_childUsed to manage childs in a single linked list.
str_startStart address of matched string.
str_sizeLength in bytes of str_start.
lifetime
new_suffixtreeleafAllocates new leaf of suffix tree.
delete_suffixtreeleafFrees resources associated with suffixtree_leaf_t.
generic
suffixtree_leaf_EMBEDAllows to embed all data fields of suffixtree_leaf_t into another structure.
query
ISLEAFReturns true if parameter points to node of type suffixtree_leaf_t.
STR_SIZEReturns the size of the string this node or leaf matches.
STR_STARTReturns the start address of the string this node or leaf matches.
change
INIT_AS_NODEInits the string description fields and sets the type to suffixtree_node_t.
INIT_AS_LEAFInits the string description fields and sets the type to suffixtree_leaf_t.
SKIP_STR_BYTESIncrements suffixtree_leaf_t.str_start and decrements suffixtree_leaf_t.str_size with _add.

next_child

suffixtree_node_t * next_child

Used to manage childs in a single linked list.

str_start

const uint8_t * str_start

Start address of matched string.

str_size

size_t str_size

Length in bytes of str_start.  This value can be 0 in case of an (end of input) marker.

lifetime

new_suffixtreeleaf

static inline int new_suffixtreeleaf(/*out*/suffixtree_leaf_t **leaf)

Allocates new leaf of suffix tree.

delete_suffixtreeleaf

static int delete_suffixtreeleaf(suffixtree_leaf_t **leaf)

Frees resources associated with suffixtree_leaf_t.

generic

suffixtree_leaf_EMBED

#define suffixtree_leaf_EMBED() suffixtree_node_t * next_child ; const uint8_t * str_start ; size_t str_size

Allows to embed all data fields of suffixtree_leaf_t into another structure.

query

ISLEAF

#define ISLEAF(leaf) ((leaf)->str_size & 0x01)

Returns true if parameter points to node of type suffixtree_leaf_t.

STR_SIZE

#define STR_SIZE(leaf) ((leaf)->str_size >> 1)

Returns the size of the string this node or leaf matches.

STR_START

#define STR_START(leaf) ((leaf)->str_start)

Returns the start address of the string this node or leaf matches.

change

INIT_AS_NODE

Inits the string description fields and sets the type to suffixtree_node_t.

INIT_AS_LEAF

Inits the string description fields and sets the type to suffixtree_leaf_t.

SKIP_STR_BYTES

Increments suffixtree_leaf_t.str_start and decrements suffixtree_leaf_t.str_size with _add.

suffixtree_node_t

struct suffixtree_node_t

A node in suffixtree_t which has childs.  A node is also linked to a another node with its suffix_link.  A node indicates a position in the tree and contains a matching string.  If a node matches then its contained string is matched and also all strings encountered along the path from root to this node.  Every node has at least two childs.  This ensures that suffixtree_t contains no more than (n-1) suffixtree_node_t where n is the length of the input string for which the tree was built.

Summary
suffixtree_leaf_EMBEDInherit all data fields from suffixtree_leaf_t.
childsList of all children.
suffix_linkPoints to node which matches same string with this node except without the first character.
lifetime

suffixtree_leaf_EMBED

suffixtree_leaf_EMBED()

Inherit all data fields from suffixtree_leaf_t.

childs

suffixtree_node_t * childs

List of all children.

suffix_link

suffixtree_node_t * suffix_link

Points to node which matches same string with this node except without the first character.  If this node matches (path from root to this node + contained string) string “c₀c₁...cₘ” then the node suffix_lkinks points to matches “c₁...cₘ”.  If this node is a child of root and matches only one character then suffix_link points to root and has value NULL.

lifetime

suffixtree_pos_t

struct suffixtree_pos_t

Describes a position in suffixtree_t.

 root
  ↓
 ...
  ↓
parent
  ↓
 node    // matches matched_len characters
Summary
matched_lenThe number of characters matched in node.
nodeThe current node indicates the position in the tree.
parentThe parent of node.
lifetime
suffixtree_pos_INITStatic initializer.

matched_len

size_t matched_len

The number of characters matched in node.  The following condition is always true:

matched_len < (node ? STR_SIZE(node) : 1).

If (matched_len == 0) then node has fully matched and the processing step assigns node to parent and a child of node to node.

node

suffixtree_node_t * node

The current node indicates the position in the tree.  All parent nodes encountered has been matched fully and this node accounts for matched_len characters.  If this value is NULL it points to the root and matched_len is also 0.

parent

suffixtree_node_t * parent

The parent of node.  If node is NULL this value is also NULL and has no meaning.

lifetime

suffixtree_pos_INIT

#define suffixtree_pos_INIT { 0, NULL, NULL }

Static initializer.

suffixtree_addstate_t

struct suffixtree_addstate_t

Holds the position in suffixtree_t and the next character to be added.

Summary
posPosition in suffixtree_t where the next character is added.
suffixThe string which describes the current suffix.
lifetime
suffixtree_addstate_INITStatic initializer.

pos

suffixtree_pos_t pos

Position in suffixtree_t where the next character is added.

suffix

string_t suffix

The string which describes the current suffix.  The first character of the suffix is read suffix.addr[0] if (suffix.size > 0).  A special end-marker is also allowed with a suffix.size of 0.  This is translated internal into the character code 256 which does not exist in any string.

lifetime

suffixtree_addstate_INIT

#define suffixtree_addstate_INIT(
   strsize,
   straddr
) { suffixtree_pos_INIT, string_INIT(strsize,straddr) }

Static initializer.

suffixtree_t

Summary
description
Build a Suffix TreeA suffix tree is built from an input string and it stores all suffixes.
Optimize Tree ConstructionTo be able to optimize the building process we introduce nodes which can store more than one character.
helper
compiletime_testsTests that suffixtree_node_t inherits from suffixtree_leaf_t.
lifetime
build
replacechild_suffixtreeReplaces old_child of parent with new_child.
splitnode_suffixtreeThe node in pos (see suffixtree_pos_t.node) is splitted into two nodes.
addchar_suffixtreeThis function process the next input character in state->suffix.
query
findstring_suffixtreeSearches the node matching searchstring beginning from root.
test

description

Build a Suffix Tree

A suffix tree is built from an input string and it stores all suffixes.

The tree is initialized as empty where the root is denoted by “*”.

[0] empty tree
    *

Now we build a suffix from the input string “ccxccxccc”.  It is built character by character starting from left until the end of string is reached.  In following diagram an arrow top to bottom is a pointer from parent to child.  A node matches its contained character.  No two childs of the same parent can match the same character.

An end node matches a suffix.  It is marked with a subscript ₀,₁,...  An input string of length n has n matching end nodes.  The value of the subscript is the start offset of a matched suffix.

An arrow from child to parent is called the suffix link.  The suffix link of a node points to the node which matches the next substring without the first character.  For example the child

A possible suffix tree may look like this

[1] add "ccxccxccc" │  [2] add "ccxccxccc" │  [3] add "ccxccxccc"
         ^          │            ^         │             ^
    *               │      *               │      * ──────╮
    ↓               │      ↓               │      ↓       ↓
    c₀              │   ╭> c₁              │   ╭> c ──╮ ╭>x₂
                    │   │  ↓               │   │  ↓   ↓ │
                    │   ╰─ c₀              │   ╰─ c ╭>x₁╯
                    │                      │      ↓ │
                    │                      │      x₀╯

For example the child node “x₀” matches the string “ccx”.  Its suffix link points to “x₁” which matches “cx” the next smaller suffix of “ccx” (without the first character).  The suffix links which point to the root (root matches empty string) are omitted.

[4] add "ccxccxccc" │  [5] add "ccxccxccc" │  [6] add "ccxccxccc"
            ^       │               ^      │                ^
     * ─────╮       │       * ──────╮      │       * ────────╮
     ↓      ↓       │   ╭───↓───────↓──╮   │   ╭───↓─────────↓──╮
 ╭>╭>c₃ ─╮╭>x       │   ╰>╭>c₃ ─╮╭─>x  │   │   ╰>╭>c ──╮ ╭─> x₅ │
 │ │ ↓   ↓│ ↓       │     │ ↓   ↓│  ↓  │   │     │ ↓   ↓ │   ↓  │
 │ ╰ c╭─>x╯ c₂<╮    │   ╭>╰ c₄╭>x╯  c<╮╯   │   ╭>╰ c ╭>x₃╯╭─>c ─╯
 ╰───↓│──↓──╯  │    │   ╰───↓╭╯─↓───↓─│─╮  │   ╰───↓╭╯──↓─┼──↓──╮
     x╯╭>c₁────╯    │       x╯╭>c╮╭>c₂┼─╯  │       x₄╭> c ╯╭>c──╯
     ↓ │            │       ↓╭╯ ↓╰┼───╯    │       ↓╭╯  ↓ ╭╯ ↓
     c₀╯            │       c╯╭>c₁╯        │       c╯╭> c ╯╭>x₂
                    │       ↓ │            │       ↓ │  ↓ ╭╯
                    │       c₀╯            │       c ╯╭>x₁╯
                    │                      │       ↓  │
                    │                      │       x₀ ╯

As you can see the construction takes all end nodes with a subscript and append a new node matching the next character.  The subscript value is moved to the appended node.  If a child already exists with the same matching character this child becomes the new end node with the subscript value.

All nodes which matches a suffix (end nodes) lie on a (the outermost) suffix link chain as you can see in the above diagram.  So appending a new character becomes following the suffix link chain which is O(i) for a suffix tree which matches i suffixes.  To construct the whole tree for input length n we need the time O(1+2+3+...+n) which is O(n*n).

Optimize Tree Construction

1.  Node matches x characters

To be able to optimize the building process we introduce nodes which can store more than one character.  This allows us to forbid “characters chains” c->c->x where a node has only one child.  Every node has therefore at least two childs or it is a leaf (with no childs).

The characters are not stored in the nodes only start and end pointers into to the input string.  Therefore creating a node is possible in constant time and not dependant on the length of contained characters.

2.  End nodes become leaves and always match a whole suffix

A leaf matches all characters from its first matching character (offset i) until the end of string This allows us to ignore end nodes after creation except if we need to split them.  Also a leaf node does not have a suffix pointer cause you never follow it.

3.  To understand the optimized algorithm from Ukkonen we need to understand that only nodes (not leaves) of the unoptimized version needs to be updated.  The optimized version does no create a node for every single character.  Therefore a node can implicitly be represented as a prefix of length p from a node or leave which matches more than p characters.

Such nodes (or leaves) must be split so that only the prefix becomes represented in a newly created node and the remainder of the string is matched in the old (splitted) node.

4.  Every added character now checks a node if either the next character of the matching string in the node equals the new added character or if it has a child matching this character.  If not the node is split if necessary and a leave is added.  Noew the next suffix is checked by following the suffix link and the whole step is repeated until root is reached or a node found which already has a child matching the added character.

5.  Our tree only grows by a possible splitting of a node and creating a new leaf.  So having a leaf we never have to update it except if it has to be splitted and a prefix of becomes a node.  It suffices for the algorithm the remember only the last position (possibly root) where it stopped and begin from this position again adding the following character.

6.  For every node created through splitting its suffix pointer is computed.  Which is done in the next iteration of the loop in step 4.

In the following diagram the sequence ‘--^’ marks the position in the tree where the next step will insert the next character.

[1] add "ccxccxccc" │  [2] add "ccxccxccc" │  [3] add "ccxccxccc"
         ^          │            ^         │             ^
    *╮              │      *               │      *╮──────────╮
  --^↓              │      ↓               │    --^↓          ↓
     ccxccxccc₀     │      ccxccxccc₀      │    ╭─>c─────────╮xccxccc₂<╮
                    │    --^               │    │  ↓         ↓         │
                    │                      │    ╰  cxccxccc₀ xccxccc₁──╯

The first step adds the whole string which is a suffix of itself as leaf to the tree.  Cause we have added it to the root we are ready.  In the next step we add “c” which is is the first character of the already added leaf and therefore we need to do nothing except to move our position ‘--^’ to this first character.  In the third step we add x.  The following character of our current position is ‘c’ so we have to split the leaf into node ‘c’ and leaf ‘cxccxccc₀’.  Then we add a new leaf ‘xccxccc₁’ and follow the suffix link of the parent of node ‘c’ (node ‘c’ has an empty suffix link after the split operation).  The parent of ‘c’ is the root node so it has no suffix link and therefore we skip the first character from our current position ‘c’ and follow root with ‘’ which is root.  Then we check if root contains a child beginning with ‘x’.  It does not so we add a new leaf to root (the same as before except it matches at position 2).  The position is root so we are ready.

[4] add "ccxccxccc"      │  [5] add "ccxccxccc"     │  [6] add "ccxccxccc"
            ^            │               ^          │                ^
    *───────────╮        │      *──────────╮        │      *──────────╮
    ↓           ↓        │      ↓          ↓        │      ↓          ↓
    c╮─────────╮xccxccc₂ │      c─────────╮xccxccc₂ │      c─────────╮xccxccc₂
  --^↓         ↓         │      ↓         ↓         │      ↓         ↓
     cxccxccc₀ xccxccc₁  │      cxccxccc₀ xccxccc₁  │      cxccxccc₀ xccxccc₁
                         │    --^                   │     --^

The steps 4 up to 6 are simple they move only the position in the tree.

[7] add "ccxccxccc"      │  [8] add "ccxccxccc"     │  [9] add "ccxccxccc"
               ^         │                  ^       │                   ^
    *───────────╮        │      *──────────╮        │    *───────────────╮
    ↓           ↓        │      ↓          ↓        │    ↓               ↓
    c─────────╮ xccxccc₂ │      c─────────╮xccxccc₂ │    c───────╮  ╭──> xcc───╮
    ↓         ↓          │      ↓         ↓         │    ↓       ↓ ╭╯    ↓     ↓
    cxccxccc₀ xccxccc₁   │      cxccxccc₀ xccxccc₁  │ ╭──c───╮╭─>xcc───╮ xccc₂ c₅
    --^                  │       --^                │ ↓--^╭──↓╯  ↓     ↓
                         │                          │ xcc─╯─╮c₆  xccc₁ c₄
                         │                          │ ↓     ↓
                         │                          │ xccc₀ c₃

Steps 7 and 8 only move the position in the tree.  Step 9 is more interesting.  It is shown in the next diagrams splitted into 5 substeps.

[9.1] add "ccxccxccc"    │  [9.2] add "ccxccxccc"   │  [9.3] add "ccxccxccc"
                   ^     │                     ^    │                     ^
    *──────────╮         │      *──────────╮        │      *──────────────╮
    ↓          ↓         │      ↓          ↓        │      ↓              ↓
    c─────────╮xccxccc₂  │      c──────╮   xccxccc₂ │      c──────╮  ╭──> xcc───╮
    ↓         ↓          │      ↓      ↓   --^      │      ↓  ╭─╮ ↓ ╭╯    ↓     ↓
    cxcc──╮   xccxccc₁   │   ╭──cxcc──>xcc───╮      │   ╭─ cxcc ╰>xcc───╮ xccc₂ c₅
    ↓     ↓   --^        │   ↓     ↓   ↓     ↓      │   ↓--^  ↓   ↓     ↓
    xccc₀ c₃             │   xccc₀ c₃  xccc₁ c₄     │   xccc₀ c₃  xccc₁ c₄

The first substep splits the node where the currrent position is set in step 8.  A new leaf is added.  The suffix “cxcc” is followed which defines the new position in the tree.  Substep 2 splits the node the current position points to into node “xcc” and leave “xccc”.  A new leaf is added to “xcc”.  The suffix “xcc” is followed which defines the new position in the tree.  Substep 3 splits the node the current position points to into node “xcc” and leave “xccc”.  A new leaf is added to “xcc”.  Then the suffix pointer from the created in node in step 2 is set to the new created node.  The suffix “cc” is followed which defines the new position in the tree.

[9.4] add "ccxccxccc"       │  [9.5] add "ccxccxccc"
                   ^        │                     ^
   *───────────────╮        │    *───────────────╮
   ↓╭─────────────╮↓        │    ↓               ↓
   c┼──────╮  ╭──>xcc ───╮  │  ╭>c───────╮  ╭──> xcc───╮
--^↓↓      ↓ ╭╯    ↓     ↓  │  ╰╮↓       ↓ ╭╯    ↓     ↓
╭─ c───╮╭─>xcc───╮ xccc₂ c₅ │ ╭─╰c───╮╭─>xcc───╮ xccc₂ c₅
↓ ╭────↓╯  ↓     ↓          │ ↓--^╭──↓╯  ↓     ↓
xcc───╮c₆  xccc₁ c₄         │ xcc─╯─╮c₆  xccc₁ c₄
↓     ↓                     │ ↓     ↓
xccc₀ c₃                    │ xccc₀ c₃

Substep 4 split node “cxcc” into nodes “c” and “xcc”.  A new leaf is added to “c” and the suffix pointer from the node created in step 3 is set to point to node “c”.  The suffix “c” is followed which defines the new position in the tree.  Substep 5 need not to split the node and checks its childs.  A child with “c” exists so the alogithm ends here.  The suffix pointer from the node created in step 4 is set to point to “c”.

The algorithm has created only leaves 0-6.  The leaves matching position 7 and 8 were not created.  To create them an explicit end-marker (empty string ‘’) is added in step 10.

[10] add "ccxccxccc"
                   ^
          *───────────────╮
       --^↓               ↓
╭───────╭>c───────╮  ╭──> xcc───╮
↓       ╰╮↓       ↓ ╭╯    ↓     ↓
''₈╭───╭─╰c───╮╭─>xcc───╮ xccc₂ c₅
   ↓   ↓   ╭──↓╯  ↓     ↓
   ''₇ xcc─╯─╮c₆  xccc₁ c₄
       ↓     ↓
      xccc₀ c₃

The end-marker which never occurs in the string adds a new leave to the node pointed to by the current position (see step 9.5).  Then it follow the suffix “c” and adds another leave.  The suffix “” is the root but no end-marker is added to the root.  Having reached the root ends the algorithm.

helper

compiletime_tests

static inline void compiletime_tests(void)

Tests that suffixtree_node_t inherits from suffixtree_leaf_t.

lifetime

build

replacechild_suffixtree

static int replacechild_suffixtree(suffixtree_t *tree,
suffixtree_node_t *parent,
suffixtree_node_t * const old_child,
suffixtree_node_t * const new_child)

Replaces old_child of parent with new_child.

Before return <suffixtree_node_t.next_child> of old_child is cleared.

splitnode_suffixtree

static int splitnode_suffixtree(suffixtree_t *tree,
suffixtree_pos_t *pos)

The node in pos (see suffixtree_pos_t.node) is splitted into two nodes.  For example consider a node which matches the string “ab” and pos->matched_len == 1.  After split a newly created node which replaces the position of the old matches “a” and its child which is the old node now matches “b”.  Therfore suffix pointers to node are valid furthermore.

(unchecked) Preconditions

1.pos->matched_len >= 1
2.STR_LEN(pos->node) > pos->matched_len

The preconditions ensure that after the split both nodes match a string of length >= 1.

addchar_suffixtree

static int addchar_suffixtree(suffixtree_t *tree,
suffixtree_addstate_t *state)

This function process the next input character in state->suffix.  See <suffixtree_t.Optimize Tree Construction> for a description of the algorithm and <build_suffixtree> for how this function is driven.

INFO: extension needed for adding multiple strings

node and leaf are used interchangeable.  This makes no problem cause a suffix will never match fully a leaf.  But if more than one string is added to suffix tree then this could become a problem ! The reason is that a string previously added could be a prefix of any later added string.  So leaves could become parents which is an error.  Before this could happen they must be splitted and converted to a node and an empty leaf (end marker) !!  This INFO is only useful if adding more than one string will be supported by some future changes.

query

findstring_suffixtree

Searches the node matching searchstring beginning from root.  The out parameter pos contains after successul return the found node, its parent, and the number of bytes which are matched from the beginning of the string in node.

Nodes can match a string of characters.  Therefore the returned number in pos->matched_len (see suffixtree_pos_t.matched_len) gives the number of matched characters (number of bytes) matched from the returned node - its value is always > 0.

test

A suffix tree is a tree which stores all suffixes of a given string.
Implements Suffix-Tree.
typedef struct suffixtree_iterator_t suffixtree_iterator_t
Shortcut for suffixtree_iterator_t.
struct suffixtree_iterator_t
Stores position in suffixtree_t which has to be visited next.
typedef struct suffixtree_addstate_t suffixtree_addstate_t
Shortcut for suffixtree_addstate_t.
struct suffixtree_addstate_t
Holds the position in suffixtree_t and the next character to be added.
typedef struct suffixtree_leaf_t suffixtree_leaf_t
Shortcut for suffixtree_leaf_t.
struct suffixtree_leaf_t
A node in suffixtree_t which has no childs.
typedef struct suffixtree_node_t suffixtree_node_t
Shortcut for suffixtree_node_t.
struct suffixtree_node_t
A node in suffixtree_t which has childs.
typedef struct suffixtree_pos_t suffixtree_pos_t
Shortcut for suffixtree_pos_t.
struct suffixtree_pos_t
Describes a position in suffixtree_t.
slist_node_t * next
Used in suffixtreeiterator_list_t to connect all contained suffixtree_iterator_t.
size_t prefixlen
Sum of length of strings on the path from root up to next_child.
suffixtree_node_t * next_child
Pointer to suffixtree_node_t which has to be visited next on this level of hierachy.
static inline int new_suffixtreeiterator(/*out*/suffixtree_iterator_t **iter)
Allocates a new suffixtree_iterator_t object.
static int delete_suffixtreeiterator(suffixtree_iterator_t **iter)
Frees all memory allocated for suffixtree_iterator_t object.
static int freenode_suffixtreeiteratoradapter(
   suffixtreeiterator_adapter_t *typeadp,
   suffixtree_iterator_t **iter
)
Frees memory of object type suffixtree_iterator_t.
#define slist_IMPLEMENT(
   _fsuffix,
   object_t,
   name_nextptr
) typedef slist_iterator_t iteratortype##_fsuffix ; typedef object_t * iteratedtype##_fsuffix ; static inline int initfirst##_fsuffix##iterator(slist_iterator_t * iter, slist_t * list) __attribute__ ((always_inline)) ; static inline int free##_fsuffix##iterator(slist_iterator_t * iter) __attribute__ ((always_inline)) ; static inline bool next##_fsuffix##iterator(slist_iterator_t * iter, object_t ** node) __attribute__ ((always_inline)) ; static inline void init##_fsuffix(slist_t * list) __attribute__ ((always_inline)) ; static inline int free##_fsuffix(slist_t * list, struct typeadapt_t * typeadp) __attribute__ ((always_inline)) ; static inline int isempty##_fsuffix(const slist_t * list) __attribute__ ((always_inline)) ; static inline object_t * first##_fsuffix(const slist_t * list) __attribute__ ((always_inline)) ; static inline object_t * last##_fsuffix(const slist_t * list) __attribute__ ((always_inline)) ; static inline object_t * next##_fsuffix(object_t * node) __attribute__ ((always_inline)) ; static inline bool isinlist##_fsuffix(object_t * node) __attribute__ ((always_inline)) ; static inline void insertfirst##_fsuffix(slist_t * list, object_t * new_node) __attribute__ ((always_inline)) ; static inline void insertlast##_fsuffix(slist_t * list, object_t * new_node) __attribute__ ((always_inline)) ; static inline void insertafter##_fsuffix(slist_t * list, object_t * prev_node, object_t * new_node) __attribute__ ((always_inline)) ; static inline int removefirst##_fsuffix(slist_t * list, object_t ** removed_node) __attribute__ ((always_inline)) ; static inline int removeafter##_fsuffix(slist_t * list, object_t * prev_node, object_t ** removed_node) __attribute__ ((always_inline)) ; static inline int removeall##_fsuffix(slist_t * list, struct typeadapt_t * typeadp) __attribute__ ((always_inline)) ; static inline slist_node_t * asnode##_fsuffix(object_t * object) { static_assert(&((object_t*)0)->name_nextptr == (slist_node_t**)offsetof(object_t, name_nextptr), "correct type") ; return (slist_node_t *) ((uintptr_t)object + offsetof(object_t, name_nextptr)) ; } static inline object_t * asobject##_fsuffix(slist_node_t * node) { return (object_t *) ((uintptr_t)node - offsetof(object_t, name_nextptr)) ; } static inline object_t * asobjectnull##_fsuffix(slist_node_t * node) { return node ? (object_t *) ((uintptr_t)node - offsetof(object_t, name_nextptr)) : 0 ; } static inline void init##_fsuffix(slist_t * list) { init_slist(list) ; } static inline void initsingle##_fsuffix(slist_t * list, object_t * node) { initsingle_slist(list, asnode##_fsuffix(node)) ; } static inline int free##_fsuffix(slist_t * list, struct typeadapt_t * typeadp) { return free_slist(list, offsetof(object_t, name_nextptr), typeadp) ; } static inline int isempty##_fsuffix(const slist_t * list) { return isempty_slist(list) ; } static inline object_t * first##_fsuffix(const slist_t * list) { return asobjectnull##_fsuffix(first_slist(list)) ; } static inline object_t * last##_fsuffix(const slist_t * list) { return asobjectnull##_fsuffix(last_slist(list)) ; } static inline object_t * next##_fsuffix(object_t * node) { return asobject##_fsuffix(next_slist(asnode##_fsuffix(node))) ; } static inline bool isinlist##_fsuffix(object_t * node) { return isinlist_slist(asnode##_fsuffix(node)) ; } static inline void insertfirst##_fsuffix(slist_t * list, object_t * new_node) { insertfirst_slist(list, asnode##_fsuffix(new_node)) ; } static inline void insertlast##_fsuffix(slist_t * list, object_t * new_node) { insertlast_slist(list, asnode##_fsuffix(new_node)) ; } static inline void insertafter##_fsuffix(slist_t * list, object_t * prev_node, object_t * new_node) { insertafter_slist(list, asnode##_fsuffix(prev_node), asnode##_fsuffix(new_node)) ; } static inline int removefirst##_fsuffix(slist_t * list, object_t ** removed_node) { int err = removefirst_slist(list, (slist_node_t**)removed_node) ; if (!err) *removed_node = asobject##_fsuffix(*(slist_node_t**)removed_node) ; return err ; } static inline int removeafter##_fsuffix(slist_t * list, object_t * prev_node, object_t ** removed_node) { int err = removeafter_slist(list, asnode##_fsuffix(prev_node), (slist_node_t**)removed_node) ; if (!err) *removed_node = asobject##_fsuffix(*(slist_node_t**)removed_node) ; return err ; } static inline int removeall##_fsuffix(slist_t * list, struct typeadapt_t * typeadp) { return removeall_slist(list, offsetof(object_t, name_nextptr), typeadp) ; } static inline int initfirst##_fsuffix##iterator(slist_iterator_t * iter, slist_t * list) { return initfirst_slistiterator(iter, list) ; } static inline int free##_fsuffix##iterator(slist_iterator_t * iter) { return free_slistiterator(iter) ; } static inline bool next##_fsuffix##iterator(slist_iterator_t * iter, object_t ** node) { bool isNext = next_slistiterator(iter, (slist_node_t**)node) ; if (isNext) *node = asobject##_fsuffix(*(slist_node_t**)node) ; return isNext ; }
Implements slist_t.slist_IMPLEMENT.
static suffixtree_iterator_t * pushnew_iterlist(slist_t *stack)
Creates new suffixtree_iterator_t and pushes onto suffixtreeiterator_list_t.
suffixtree_node_t * next_child
Used to manage childs in a single linked list.
const uint8_t * str_start
Start address of matched string.
size_t str_size
Length in bytes of str_start.
static inline int new_suffixtreeleaf(/*out*/suffixtree_leaf_t **leaf)
Allocates new leaf of suffix tree.
static int delete_suffixtreeleaf(suffixtree_leaf_t **leaf)
Frees resources associated with suffixtree_leaf_t.
#define suffixtree_leaf_EMBED() suffixtree_node_t * next_child ; const uint8_t * str_start ; size_t str_size
Allows to embed all data fields of suffixtree_leaf_t into another structure.
#define ISLEAF(leaf) ((leaf)->str_size & 0x01)
Returns true if parameter points to node of type suffixtree_leaf_t.
#define STR_SIZE(leaf) ((leaf)->str_size >> 1)
Returns the size of the string this node or leaf matches.
#define STR_START(leaf) ((leaf)->str_start)
Returns the start address of the string this node or leaf matches.
suffixtree_leaf_EMBED()
Inherit all data fields from suffixtree_leaf_t.
suffixtree_node_t * childs
List of all children.
suffixtree_node_t * suffix_link
Points to node which matches same string with this node except without the first character.
size_t matched_len
The number of characters matched in node.
suffixtree_node_t * node
The current node indicates the position in the tree.
suffixtree_node_t * parent
The parent of node.
#define suffixtree_pos_INIT { 0, NULL, NULL }
Static initializer.
suffixtree_pos_t pos
Position in suffixtree_t where the next character is added.
string_t suffix
The string which describes the current suffix.
#define suffixtree_addstate_INIT(
   strsize,
   straddr
) { suffixtree_pos_INIT, string_INIT(strsize,straddr) }
Static initializer.
static inline void compiletime_tests(void)
Tests that suffixtree_node_t inherits from suffixtree_leaf_t.
static int replacechild_suffixtree(suffixtree_t *tree,
suffixtree_node_t *parent,
suffixtree_node_t * const old_child,
suffixtree_node_t * const new_child)
Replaces old_child of parent with new_child.
static int splitnode_suffixtree(suffixtree_t *tree,
suffixtree_pos_t *pos)
The node in pos (see suffixtree_pos_t.node) is splitted into two nodes.
static int addchar_suffixtree(suffixtree_t *tree,
suffixtree_addstate_t *state)
This function process the next input character in state->suffix.
Generates list managing suffixtree_iterator_t -- see slist_IMPLEMENT.
Close