CSC 421: Algorithm Design & Analysis
Spring 2017

HW6: D&C vs. Dynamic

Note: late submissions will NOT be accepted for this final assignment.

Many computer applications rely on finding close matches to partial or misspelled words. For example, most spellcheckers not only tell you when a word is misspelled, but also recommend possible words. Text messaging applications will recommend words based on partial matches with the characters typed in so far. One way to identify close matches is to utilize edit distance.

A word or phrase can be transformed using one of three operations:

If we associate a cost with each of these operations, it becomes possible to calculate the minimal cost by which one phrase can be transformed into another. For example, consider transforming "abcd" into "acxy". If all three operations have the same cost of 1, then the minimal cost is 3 (delete 'b', substitute 'x' for 'd', and insert 'y').

The following pseudocode describes a divide & conquer solution to calculating the Minimal Edit Distance (MED) between text1 and text2: MED(text1, text2): if text1 is the empty string, MED = INSERT_COST * text2.length() if text2 is the empty string, MED = DELETE_COST * text1.length() otherwise, MED = minimum of distSubstFirst, distInsertFirst & distDeleteFirst where: if text1 and text have the same first letter, distSubstFirst = MED(text1.substring(1), text2.substring(1)) otherwise, distSubstFirst = SUBST_COST + MED(text1.substring(1), text2.substring(1)) distInsertFirst = INSERT_COST + MED(text1, text2.substring(1)) distDeleteFirst = DELETE_COST + MED(text1.substring(1), text2)

Part 1: Divide & Conquer Solution

Define an EditDistance class that contains a static method named MED. This method should implement the above divide & conquer solution using top-down recursion. Your class should have constants for SUBST_COST, INSERT_COST, and DELETE_COST, which are all assigned 1 (but can be easily changed). You should also have a driver class which allows you to repeatedly enter lines of text and which displays the Minimal Edit Distance between those lines. For example:

Enter the starting text (empty line to terminate): abcd Enter the goal text: acxy Minimal Edit Distance = 3 Enter the starting text (empty line to terminate): qwert abc Enter the goal text: abqwert c Minimal Edit Distance = 4 Enter the starting text (empty line to terminate): DONE

Test your code on a variety of phrases and answer the following questions:

  1. What kinds of phrases tend to take the longest time to calculate? That is, for a String of length N, is it slower if the goal phrase is roughly the same size or significantly shorter/longer? Is it slower if there is no overlap of characters between the two phrases, or when the phrases are similar in content?
  2. Are the recursive cases in this divide and conquer solution largely independent, or is there significant overlap? If there is overlap, how big can the phrases be before the overlap slows the calculation unreasonably?

Part 2: Dynamic Programming Solution

The potential drawback of a top-down implementation of divide & conquer is redundancy. If there is significant overlap between the recursive cases, the redundancy can grow exponentially, leading to drastic decreases in performance as the problem size grows.

  1. Add a method named MEDdynamic which utilizes the same recursive relationships but in a bottom-up fashion. Note that each recursive call deals with the phrases (text1 and text2) or successive suffixes of those phrases. Thus, you should be able to create a table where the rows correspond to suffixes of text1 and the columns correspond to suffixes of text2. That is, table[r][c] would store the Minimal Edit Distance between the last r characters of text1 and the last c characters of text2. When the table is filled, the Minimal Edit Distance between text1 and text2 should be at table[text1.length()][text2.length()].
  2. Does this bottom-up implementation perform significantly better than the top-down version as the length of the text samples increases? Justify your answer with some observational timings.