CSC 421: Algorithm Design & Analysis
Spring 2019

HW1: Brute Force Approximations


In his classic paper, A Mathematical Theory of Communication, Claude E. Shannon described a method for the stochastic generation of text. That is, he showed how it was possible to generate "random" sequences of characters with specific probabilistic characteristics. This included generating "random" text whose probability distribution matched a given source text. Given a source text, he defined the kth-order approximation of that text as:

0th-order approximation:
Characters are generated completely at random (chosen from the same alphabet as the source text).
1st-order approximation:
Characters are generated at random following the same character frequency distribution as in the source text. That is, if the character 'e' appeared twice as often as 'w' in the source text, then the probability of generating an 'e' should be twice the probability of generating a 'w'.
2nd-order approximation:
Characters are generated at random following the same character-pair frequency distribution as in the source text. That is, if the character 't' was the last one generated, then the next character should be generated based on the frequencies of characters that follow 't' in the source text. For example, if 't' appeared 3 times in the source text: in "three" and "teeth", then there would be a 2/3 probability that 'h' would be generated next and a 1/3 probability that 'e' would be generated.
  .
  .
  .
kth-order approximation:
Characters are generated at random following the same (k-character)-combination frequency distribution as in the source text. That is, if "c1...ck-1" were the last k-1 characters generated, then the next character should be generated based on the frequencies of characters that follow "c1...ck-1" in the source text. For example, if k = 4 and "the" appeared 5 times in the source text: in "then", "the ", "weather", "heathen", and "lathe ", then there would be a 2/5 probability that 'n' would be generated next, a 2/5 probability that ' ' would be generated, and a 1/5 probability that 'r' would be generated.

For this assignment, you are to implement a program that generates kth-order approximations of a source text. For simplicity, ignore all characters other than letters and spaces, and treat all letters as uppercase. Your program should prompt the user for the name of the source file, then repeatedly prompt for the approximation order and length for sequences to generate. For example:

Enter the file name: lincoln.txt

Enter the approximation order and string length (negatives to quit): 0 40
YTMLOLGPEGEUDHPRIO WHELAUGE PUVYGNVNHELY

Enter the approximation order and string length (negatives to quit): 1 40
SENATHTMT W  NOTWSWEHC EVE IPWRTOAOTH SH

Enter the approximation order and string length (negatives to quit): 2 60
 IT GONGHETHOMED T THEDERNOUR HAICONMATIECOMIGOTIT CA ST LY 

Enter the approximation order and string length (negatives to quit): 3 60
E POSIT COR THE MAY DEAREMAING WILL NAT HAT PRIED TO ALL PER

Enter the approximation order and string length (negatives to quit): 4 60
FOR US THAT CAN NEW BIRTH ON COME TO DEAD WHO DEDICATE NATIO

Enter the approximation order and string length (negatives to quit): 5 60
AR ABOVE OUR FATHER TO THESE DEDICATE WE HERE GAVE COME TO D

Enter the approximation order and string length (negatives to quit): -1 -1
**DONE**

Note that a 0-th order approximation is completely random, but using only those characters that appear in the source text. A 1st-order approximation is similar only in the relative frequencies of the characters. As the order increases, the generated text pulls larger and larger sequences of characters from the source text. By the time the order reaches 8 or 10, it is likely that entire sections of the source text will be generated.

To simplify your task, you are given two starter files: Approximation.java and ApproxDriver.java. The Approximation class reads the contents of a text file (ignoring any characters other than letters or spaces) and provides a generate method for generating sequences whose probability distribution matches the file. ApproxDriver is a simple driver program that prompts the user for the input file and repeatedly generates and displays 1st-order approximations of that file. You are to build on these classes to enable the generation of approximations of any order and length (as shown in the sample output).

Several points to consider: