Trie

Requirements:

In this sub-project, you will modify the class:

  1. 26-ary Trie: Trie.
  2. 26-ary Trie node: Trie_node.

This page describes the first class.

Consider wanting to store 10000 words for spell checking. If we were to use an AVL tree, each search would be approximately 20 iterations. We could use an array together with an interpolation search (not covered in this class, but notes exist for it), which would reduce the number of searches to approximately four search, but this would not allow us to easily add or remove words.

An alternate data structure is the trie, from the word retrieve. It is normally pronounced as a homophone of the word try, although the original authors said that it should be a homophone of the word tree.

A trie is a 26-ary tree where the root node represents an empty string "" and if the kth (k going from 0 to 25) sub-tree is not a null sub-tree, it represents a string that is the concatenation of string represented by the parent and the kth letter of the alphabet (where a is the 0th letter, b is the 1st letter, and so on). Each node may or may not indicate that it is a terminal entry for a word.

While a trie could be used to store hyphenated and capitalized words together with those with apostrophes, we will restrict ourselves to words made up of the twenty six letters of the English alphabet.

For example, consider the sentence "The fable then faded from my thoughts and memory." If we wanted to put these seven words into a trie, it would look like Figure 1.

See the following text.
Figure 1. A trie containing the words in the sentence "The fable then faded from my thoughts and memory."

Only the root node explicitly shows its 26 sub-trees, each sub-tree being associated with a letter of the alphabet. In this example, 22 of the sub-trees of the root node are null, as there are no words that begin with, for example, 'b'. Four of the children of the root are not null: those representing the letters 'a', 'f', 'm' and 't'. For the other nodes, only the non-null sub-trees are shown. These sub-trees represent words starting with these letters, respectively. Terminal nodes are represented as red circles. We will look at each of these sub-trees here:

The sub-tree of the word "and" has itself a child representing 'n', and that node has a terminal node following the sub-tree representing 'd'."

There are three words that start with 'f', but two of those have 'a' as a second letter while the other has 'r' as a second letter. The two children of the 'a' contain the next letters of 'fable' and 'faded', while the child of 'r' has 'o' as its next letter. After this, all three words are unique, so these trees degenerate into linked lists with →'b'→'l'→'e', →'d'→'e'→'d' and →'r'→'o'→'m'. The nodes following the 'e', 'd' and 'm' at the end of each of these linked lists is a terminal node.

Next, there are two words that start with 'm'. The one sub-tree represents 'e' which then degenerates into the linked list →'m'→'o'→'r'→'y' and the other simply terminates at 'y'. Both letters 'y' here are terminal nodes.

Finally, the words 'the', 'then' and 'thoughts' all start with 'th', so there is only one child of the first node, and that child represents 'h'. Next, there are two children: one sub-tree represents 'e' and the other 'o'. The child representing 'e' is a terminal node (ending the word "the"), but it also has a child representing 'n', and that node is itself also a terminal node representing "then". Under "o", there is a linked list →'u'→'g'→'h'→'t'→'s'. In this case, the node of the sub-tree representing 's' is a terminal node.

Characters in C++

As for characters, each letter is represented by an 8-bit character (char). Unfortunately, char restricts its self to the letters of the English alphabet as well as a few letters from other European languages. Unicode allows for the encoding of most of the world's alphabets, although Tengwar and Klingon have, as of yet, not yet been included.

The character 'A' is represented by the 8-bit number 010000012 and can be thought of as being exactly that: 65. You can actually add numbers to characters:

	char letter = 'A';
	std::cout << letter << std::endl;   // prints A
	++letter;
	std::cout << letter << std::endl;   // prints B

	letter = 'C';
	int offset = static_cast<int>( letter - 'A' );
	std::cout << offset << std::endl;   // prints 2

	letter = 'M';
	int offset = static_cast<int>( letter - 'A' );
	std::cout << offset << std::endl;   // prints 12

Thus, letter - 'A' can be used to produce an index between 0 and 25. You can even do logical tests: letter >= 'A' && letter <= 'Z' may be used to test if letter is an upper case letter. Tools for working with characters can be found in the cctype library: just be sure to include

#include <cctype>

at the top of your file.

To convert between lower case and upper case, it is only necessary to switch the value of the 3rd most significant bit. As the character 'A' is represented by the 8-bit number 010000012, 'a' is represented by 011000012, 'a'. Thus, if letter >= 'a' && letter <= 'z', then letter - 32 is the corresponding upper-case letter.

Strings in C++

A string is a list of characters. In C, a string of n characters is an array of type char of size n + 1 or greater where the (n + 1)st character is the null character '\0'. Fortunately, in C++, there is a string class in the standard library. You include this library with

#include <string>

And, as it is in the standard template library, the class is defined within the std namespace: std::string. To access the kth character, use an index. For example, the code snippet

	std::string str = "Hello world!";

	for ( int i = 0; i < 15; ++i ) {
		char c = str[i];

		if ( c == '\0' ) {
			// If the ith character is the null character, indicate this...
			std::cout << "a null character" << std::endl;
		} else {
			std::cout << c << std::endl;
		}
	}

produces the output

H
e
l
l
o

w
o
r
l
d
!
a null character
a null character
a null character

You can view other member functions of this class at cplusplus.com.

Class Specifications:


Trie

Description

A class which implements a trie. In this case, the variable n represents length of a string; that is, the number of characters making up the string. This trie will be case insensitive; that is, the words "Hello", "hello" and "HELLO" all represent the same word. In the examples above, we converted all words into lower-case; however, you may chose to convert all letters into upper-case—it will work either way.

Member Variables

This class one member variable:

  • A pointer to the root node, and
  • An integer variable storing the number of words in the tree (that is, the size).

Accessors

This class has four accessors:

bool empty() const
Return true if the trie is empty (the size is 0). (O(1))
int size() const
Returns the number of words in the trie. (O(1))
Trie_node *root() const
Returns a pointer to the root node. (O(1))
bool member( std::String str ) const
Return true if the word represented by the string is in the trie and false otherwise. If the string contains any characters other than those of the English alphabet ('A' through 'Z' or 'a' through 'z'), throw an illegal argument exception1. (O(n))

Mutators

This class has three mutators:

bool insert( std::string str )
Insert the word represented by str into the trie. If the string contains any characters other than those of the English alphabet ('A' through 'Z' or 'a' through 'z'), throw an illegal argument exception1; otherwise if the string is already in the trie, return false; otherwise, return true (the insertion was successful). This is done by calling insert on the root, and if the root node is null, it will be necessary create an instance of the Trie_node class and assign it to the root first. (Θ(n))
bool erase( std::string str )
Erase the word represented by str from the trie. If the string contains any characters other than those of the English alphabet ('A' through 'Z' or 'a' through 'z'), throw an illegal argument exception1; otherwise if the string is not in the trie, return false; otherwise, return true (the erase was successful). If the trie is empty, return false, otherwise this function calls erase on the root. If the word erased is the last one in the trie, delete the root node. (O(n))
void clear()
Delete all the nodes in the tree. Again, if the tree is not empty, it should just call clear on the root and set the appropriate member variables. (O(N) where N is the number of words in the trie)

Please note, you are are allowed to any additional member functions you wish; however, they should be private to the class as they are not specified in this interface.