String Distances

A distance, or metric between two strings s and t is a bivariate function distance(st) which satisfies the following conditions:

  • distance(st) ≥ 0,
  • distance(ss) = 0 if and only if s = t, and
  • distance(st) = distance(ts),
  • distance(su) ≤ distance(st) + distance(tu).

The last condition is called the triangle inequality which can be easily understood through analogy: the distance from Waterloo to Toronto is less than the distance from Waterloo to Hamilton plus the distance from Hamilton to Toronto.

Four distances, in order of complexity, are given:

  • The Hamming distance (1950),
  • The Levenshtein distance (1964), and
  • The Damerau-Levenshtein distance (1964).