CSC 321: HW6

CSC 321: Data Structures
Fall 2015

HW6: Text & Maps

No late submissions will be accepted for this assignment.

In his classic paper, A Mathematical Theory of Communication, Claude E. Shannon described a method for the stochastic generation of text. That is, he showed how it was possible to generate "random" sequences of characters with specific probabilistic characteristics. This included generating "random" text whose probability distribution matched a given source text. Given a source text, he defined the kth-order approximation of that text as:

0th-order approximation:
Characters are generated completely at random (chosen from the same alphabet as the source text).
1st-order approximation:
Characters are generated at random following the same character frequency distribution as in the source text. That is, if the character 'e' appeared twice as often as 'w' in the source text, then the probability of generating an 'e' should be twice the probability of generating a 'w'.
2nd-order approximation:
Characters are generated at random following the same character-pair frequency distribution as in the source text. That is, if the character 't' was the last one generated, then the next character should be generated based on the frequencies of characters that follow 't' in the source text. For example, if 't' appeared 3 times in the source text: "th...", "tr...", and "th...", then there would be a 2/3 probability that 'h' would be generated next and a 1/3 probability that 'r' would be generated.
  .
  .
  .

kth-order approximation:
Characters are generated at random following the same (k-character)-combination frequency distribution as in the source text. That is, if "c₁...c_k-1" were the last k-1 characters generated, then the next character should be generated based on the frequencies of characters that follow "c₁...c_k-1" in the source text. For example, if k = 4 and "thr" appeared 5 times in the source text: "thro...", "thre...", "thro...", "thri", and "thro...", then there would be a 3/5 probability that 'o' would be generated next, a 1/5 probability that 'e' would be generated, and a 1/5 probability that 'i' would be generated.

For this assignment, you are to implement a program that generates kth-order approximations of a source text. The file name of the source text will be entered by the user, followed by repeated requests for approximations of specific lengths. You may choose to create a simple GUI to allow for this interaction, but a text-based interface (such as the one shown below, with user inputs in bold) is sufficient:

Enter the file name: lincoln.txt

Enter the order and the desired text length (-1 -1 to quit): 0 40
xhclpirjysqfdbdtrzcblciuysbvahznegxycsne

Enter the order and the desired text length (-1 -1 to quit): 1 40
autetaedo  haossraeoegrfr aawahyscrtpeci

Enter the order and the desired text length (-1 -1 to quit): 4 40
ich the larget war above dedicated it ca

Enter the order and the desired text length (-1 -1 to quit): 8 40
who died here it is rather for us the li

Enter the order and the desired text length (-1 -1 to quit): -1 -1
BYE

Note that a 0-th order approximation is completely random and so has no commonality with the source text. A 1st-order approximation is similar only in the relative frequencies of the characters. As the order increases, the generated text pulls larger and larger sequences of characters from the source text. Here, the 8th-order approximation is an entire sequence from the source text (the Gettysburg Address).

To simplify your task, the TextProcessor class has been provided for you. This class can be used to read in the contents of a file and store the contents in a single, processed string. For this assignment, we will only concern ourselves with letters and spaces, so all other characters are removed from the file contents. Letters are also made lowercase to simplify processing. You are to build on the TextProcessor class to construct a program that produces the above behavior.

Several points to consider:

When generating a kth-order approximation, you will need to "seed" the process with the initial k-1 characters. You should randomly choose some (k-1)-character sequence from the source text as your initial seed. Note that this choice should be probabilistically correct, i.e., if "thr" appears twice as often as "que", then it should be twice as likely to be chosen as seed. Following the initial seed, each successive letter should be generated based on the frequencies of k-character sequences in the source text (as described above).
There is one special case that you must consider when generating text. Suppose you are generating a kth-order approximation and the last k-1 characters in the source text are unique. That is, the final (k-1)-character sequence in the source text appears only once in the file (at the very end). If you ever generated this sequence, you will have no basis for generating the next character (since no character follows that sequence in the source text). In the rare event that this happens, your code should simply choose a new seed sequence (i.e., a (k-1)-character sequence from somewhere in the source text) and continue generating additional characters from that seed.
A kth-order approximation can only be generated if 0 <= k < the length of the source text. If the user specifies an order outside of this legal range, your program should generate an appropriate error message. You may also assume that it is an error to request a kth-order approximation whose length is less than k.

CSC 321: Data Structures Fall 2015 HW6: Text & Maps

CSC 321: Data Structures
Fall 2015

HW6: Text & Maps