31 Tries

T.V. Geetha

epgp books

 

 

 

Welcome to the e-PG Pathshala Lecture Series on Data Structures. We will now look at another data structure based on trees called the Tries. Tries are appropriate when we want to search with keys where many words begin with the same sequence of letters.

 

Learning Objectives

 

The learning objectives of the module are as follows:

• To discuss the properties of Tries

• To explain the different operations with Tries

• To explain the different types of Tries

 

31.1 Introduction – Tries

 

When searching for the name “Smith” in a phone book, we first locate the group of names starting with “S”, then within those we search for “m”, etc.

 

Idea: Perform a search based on a prefix of the key, rather than a comparison of the whole key. The branching is now determined by a partial key value.Each node branches out to as many nodes as there are characters.

 

All of the searching methods we have seen so far compare entire keys during the search process. The question is why not to consider a key to be a sequence of characters (letters or digits, for example) and use these characters to determine a multi-way branch at each step down the tree. A trie is a compact data structure for representing a set of strings, such as all the words in a text . Tries supports pattern matching queries in time proportional to the pattern size . If the keys are alphabetic names (names), we make a 26-way branch at each step. On the other hand if the keys are natural numbers in base 10, we make a 10 way branch at each step. This can be thought of as similar to a thumb index in a dictionary. The name “trie” is derived from the word retrieval since this data structure has been specifically designed with retrieval in mind as we will see later.

 

Tries are appropriate when many words begin with the same sequence of letters and when the number of distinct prefixes among all words in the set is much less than the total length of all the words.Each path from the root to the leaf corresponds to one word in the represented set.Nodes of the trie correspond to the prefixes of the words in the set.

 

31.1.1 Trie – Definition

 

The standard trie for a set of strings S is an ordered tree such that:

  • each node but the root is labeled with a character
  • the children of a node are alphabetically ordered
  • the paths from the root to the external nodes yield the strings of S

A standard trie uses O(n) space. Typical operations such as find, insert, remove, etc., take time of the order of O(dm) each, wheren is the total size of the strings in S, m is the size of the string parameter of the operation and d is the alphabet size of the set of strings we are representing.

 

31.1.2 Trie – Example

 

Figure 31.1 shows the set of strings {“mat”, “max, “am”, “bad”} as stored in a BST and as stored when using trie.

 

 

As you can see from the root of the tree we have the one letter prefixes of the words in the set – here “a”,”b”,”m” at the next level, the next letter at the next level and so on until at the leaves we have the actual string. Note that for the words mat and max where the prefix “ma” is the same, the same path is traversed and then the path is divided when the letters are different.

 

31.1.2 Properties of a Trie

 

Trie is a multi-way treefor storing strings in which there is there is one node for every common prefix, the actual strings are stored in the leaves. Each node has from 1 to d children where d is the alphabet size. Each edge of the tree is labeled with a character and each leaf node corresponds to the stored string, which is a concatenation of characters on a path from the root to this leaf. Let us see another example of Tries shown in Figure 31.2, Here we assume that all strings end with “$, a character not in S where S is the set of strings we wish to store. Let the set of strings be{bear, bid, bulk, bull, sun, sunday}. The Trie for this set of strings is shown in Figure 31.2. Here

 

 

the first letters of the set of strings are “b” and “s” and hence from the root we have only these two branches. Starting with b we have four strings namely bear, bid, bulk, bullwhere the second letters are “e”, “I” and “u”. This will be the branch from the second level. Now from the node ending “u” we have another common prefix for the strings bulk and bull so there will be only one branch from “u” for both these strings. For all other cases there are no common prefixes and so will have only one branch each. All leaves are reached from the path labelled “$”, since leaves end with “$”. This helps when a prefix is itself a complete string as is the case with the sun and sunday.

 

Another example of the Trie is given in Figure 31.3. Here again $ indicates the end of each word. The Trie corresponds to the set {THE,THEN THIN, TIN, SIN, SING}. In any Trie each node can have at most 27 children (27 for the 26 alphabets +1 for $) but most nodes have fewer children. As explained before a leaf reached by an edge labeled $ will have no children.

 

31.2 Trie Nodes as ADT

 

A node in a trie can be viewed as a mapping whose domain is {A,B,…Z, $} and whose value set is the type “Pointer to trie node”.A trie can be identified with its root.ADT’s TRIE and TRIENODE have the same data type.However, their operations are different.

 

31.2.1 Operations on Trie Nodes

  • ASSIGN(node,c,p): Assign value p (a pointer to a node) to character c in node node.
  • VALUEOF(node, c): Produce the value associated with character c in node.
  • GETNEW(node, c): Make the value of node for character c be a pointer to a new node.
  • MAKENULL(node): Makes node to be null mapping.

 

31.2.2 Tries – Example

 

Assume that we have the following keys in a trie:0, 10, 11, 100, 101, 110, 2, 21 (Figure 31.4). Notice that the root does not store actual data. The searching starts at the root and appropriately follows branches until the leaf is reached (Figure 31.5).

 

 

 

Here we insert the words of the text into trie. Each leaf is associated with one particular word. The leaf stores indices where associated word begins . For example “see” starts at index 0 and 24, and hence leaf for “see” stores those indices (Figure 31.6). Similarly “stock” occurs four times in the text and starts at indices 17,40, 51 and 62 which are stored at the leaves of the trie.

 

Now let us discuss the insertion of 2195 into a trie. When we search from the root we reach a leaf node 21. Now we add two nodes for the characters 9 and 5 as shown in Figure 31.7

 

 

 

 

Figure 31.8 shows the deletion of 10 from the trie. In general we perform search and then delete all nodes on the path from ‘\0’ to the root of thetree unless we reach a node with more than 1 child.

 

 

31.3 Analysis of Tries and Applications

 

The number of steps requires to search a trie (or to insert into it) is proportional to the number of characters making up a key. For example, if we have numbers in the range 0 – 999999, then we have at most 6 steps (comparisons) to reach a key. On the other hand if the number of characters is small relative to the logarithm 2 of the number of keys, a trie may be superior to a binary search tree. For example if keys consist of all 999999 numbers in the range 0-999999, we have at most 6 steps to reach a key, whereas a binary search tree would take log2(999999) ~ 20 steps (key comparisons). A standard trie uses O(n) space where n is the total size of the strings. Operations such as find, insert, removetake time O(dm) each, where m is the length of the string and d is the alphabet size. The best solution may be to combine the methods where a trie can be used for the first few characters of the key, and then another method can be used for the remainder of the key.

 

The main advantage of Tries is that the height of the tree depends on the length of the keys only. A trie can be used to store very large sets but the height (and therefore the search time) will be very short and does not depend on the number of keys stored.

 

A standard trie supports the following operations on pre-processed text in time O(m), where m length of the string X. They are word matching where we want to find the first occurrence of a word X in the text or prefix matching where we want to find the first occurrence of the longest prefix of word X in the text. Each operation is performed by tracing a path in the trie starting at the root. Some of the applications of tries include spell checkers, the search index of a web search engine and the network router where search is carried out using IP address.

 

31.3 Compressed Tries

 

There are some variants of Tries. The problem with standard tries is the dependence on the size of the alphabet which determines the size of the nodes. There are several ways to reduce or avoid the problem of the alphabet size. A simple method, is to replace the big nodes by linked lists of all the entries that are really used.

 

31.3.1 Compressed Tries

 

Now as we have already observed in the case of tries many nodes have few non-null pointers. This space can be saved. For this purpose we use Compressed Tries. Compressed Tries are obtained from standard tries by compressing chains of redundant nodes. The minor disadvantage in this case is that an unsuccessful search may take more steps to end. Here we replace achain of one-child nodes with an edge labeled with a string. Each non-leaf node has at least two children as shown in Figure31.5.

 

Figure 31.5also shows the implementation of the compressed trie. Here the strings are stored external to the structure in one array, edges are labeled with indices in the array (from, to).

 

 

Another special type of compact tree is the Patricia Tree. Patricia stands for Practical Algorithm To Retrieve Information Coded In Alphanumeric. We generally observe that the actual set of keys is a small subset of the potential set of keys. This may result in a large number of nodes that only have one descendant. We would like to save some space. Therefore the idea is to make the tree more compact by collapsing long chains.The resulting tree is called a PATRICIA tree. For chains of nodes that have only one childwe need to indicate how many characters should be skipped or in other words indicate the length of the collapsed chain. This structure is especially good for text searching. We replace sub-words by a reference position in text together with its length. Internal nodes keep only the length and the first letter of the phrase and external nodes keep the position of the phrase in the text. Every internal node has two children and has an indication of branching, that is bit position to branch, `zero’ bit to left subtree and `one’ bit to right subtree (Figure 31.6). A Patricia tree has a node structure indicating whether to travel left or right and an index field indicating which character is to be checked at this node and this value increases as we descend the tree.

 

Suffix Trees

 

A suffix tree for a string T of length m is a rooted tree such that it has exactly m leafs, numbered from 1 to m; every edge has a label, which is a substring of T and every internal node has at least two children. The labels of two edges starting at an internal node do not start with the same character and the label of the path from the root to a leaf numbered I is the suffix of T starting at position i, i.e. T[i..m]. For any leave i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i, that is, spells out S[i,…,m]. An example suffix tree for the string “harare” is shown in Figure 31.7.

 

 

 

The set of keywords is the set of suffixes of S that is {Harare,arare,rare,are,re,e}. Suffix tree is essentially the keyword tree for this set of suffixes of S. We assume that no suffix is a prefix of another suffix (can be a substring, but not a prefix). We ensure thia by adding a character $ to end of S. The internal nodes except root must have at least 2 children and the edges can be labeled with strings, Label of a path from root r to a node v is the concatenation of labels on edges from r to v. In constructing suffix trees, we will need to be able to “split” edges “in the middle” and then build suffix tree for text T. We match pattern P against tree starting at root until P is completely matched and every leaf below this match point is the starting location of P in T or o match is possible which indicates that P does not occur in T.

 

Summary

  • Explained the concept of Tries
  • Outlined important properties of Tries
  • Explained the different operations with Tries
  • Discussed the different type of Tries
you can view video on Tries