CSC 421: Algorithm Design & Analysis
Fall 2024

HW1: Brute Force Approximations


In his classic paper, A Mathematical Theory of Communication, Claude E. Shannon described a method for the stochastic generation of text. That is, he showed how it was possible to generate "random" sequences of characters with specific probabilistic characteristics. This included generating "random" text whose probability distribution matched a given source text. Given a source text, he defined the kth-order approximation of that text as:

1st-order approximation:
Characters are generated at random following the same character frequency distribution as in the source text. That is, if the character 'e' appeared twice as often as 'w' in the source text, then the probability of generating an 'e' should be twice the probability of generating a 'w'.
2nd-order approximation:
Characters are generated at random following the same character-pair frequency distribution as in the source text. That is, if the character 't' was the last one generated, then the next character should be generated based on the frequencies of characters that follow 't' in the source text. For example, if 't' appeared 3 times in the source text: in "three" and "teeth", then there would be a 2/3 probability that 'h' would be generated next and a 1/3 probability that 'e' would be generated.
  .
  .
  .
kth-order approximation:
Characters are generated at random following the same (k-character)-combination frequency distribution as in the source text. That is, if "c1...ck-1" were the last k-1 characters generated, then the next character should be generated based on the frequencies of characters that follow "c1...ck-1" in the source text. For example, if k = 4 and "the" appeared 5 times in the source text: in "then", "the ", "weather", "heathen", and "lathe ", then there would be a 2/5 probability that 'n' would be generated next, a 2/5 probability that ' ' would be generated, and a 1/5 probability that 'r' would be generated.

For this assignment, you are to implement a program that generates kth-order approximations of a source text. For simplicity, ignore all characters other than letters and spaces, and treat all letters as uppercase. Your program should prompt the user for the name of the source file and the desired order for the approximation, then repeatedly prompt for the lengths of sequences to generate. For example:

Enter the file name: lincoln.txt
Enter the desired order (>= 1): 4

Enter the desired string length (negative to quit): 40
E OUR FOR LONG PLACE FOR THAT CANNOT CAN

Enter the desired string length (negative to quit): 20
SITION THEY DIED IN 

Enter the desired string length (negative to quit): 40
OF THIS GREAT THIS NATION THAT TASK REMA

Enter the desired string length (negative to quit): -1
**DONE**

Note that a 1st-order approximation is similar to the source text only in the relative frequencies of the characters. As the order increases, the generated text pulls larger and larger sequences of characters from the source text. By the time the order reaches 8 or 10, it is likely that entire sections of the source text will be generated.

To simplify your task, you are given two starter files: ApproxGenerator.java and ApproxDriver.java. The ApproxGenerator class reads the contents of a text file (ignoring any characters other than letters or spaces) and provides a generate method for generating sequences whose probability distribution matches the file. ApproxDriver is a simple driver program that prompts the user for the input file and repeatedly generates and displays 1st-order approximations of that file. You are to build on these classes to enable the generation of approximations of any order and length (as shown in the sample output).

Several points to consider: