Abstract String / String ADT

Relation: explicitly defined linear ordering

The Abstract String is a programmer-defined linear ordering of objects from a finite set of objects termed an alphabet. The alphabet may be restricted to alphanumeric and punctuation characters but may also include additional mark-up information such as a new line character.

Strings, unlike a more general list, are generally assumed to have information coded not only in the individual characters but also in the ordering of the characters. (The significance of an object in a general container does not usually change based on the placement of other objects within the container.)

Critical Operations for Abstract Strings

  • The concatenation of two strings,
  • The extraction of substrings,
  • Searching a string for a matching substring,
  • Matching regular expressions,
  • Parsing,

and others, many of which may be found on Wikipedia.

If characters can be inserted, modified, or removed from a string, the string is said to be mutable. Otherwise, the string is immutable.

A regular expression is a means of describing a pattern of characters within a given alphabet. Determining which sequence of characters matches a particular pattern is often useful for editing or even describing a programming language.

More recently, DNA has been recognized to be a string containing an alphabet of four base pairs usually labeled A-T, G-C, C-G, and T-A. Thus, many operations designed for alphanumeric strings are now used on DNA strings; however, in the former, the lengths of strings are often significantly shorter while the latter has strings of length 220 million. This has led to significant research in optimizing string operations.

Example Data Structures

The most common and basic implementation of strings is an array of characters. In C, a string was defined as an array of characters where the last character in the string was followed by a zero character ('\0' which has the bit representation 00000000).

A rope is either an ordered tree where all leaves are character arrays or a partial ordering where all nodes with no successors are character arrays and the successors of all other nodes are ordered. The later may be used to reduce redundancy in a string which has many repeated substrings.

A text file may be interpreted as a single string; however, most text editors will represent the text file as a linearly ordered sequence of lines and each line is represented by a string. Modifications made by the user affect only the line the user is currently modifying. In the case where the length of a line grows too long, the text editor may reinterpret the line in some other form to continue to allow fast insertions, accesses, and removals.

XML is also a tree representation of an object where strings are leaves.