CSC 321: HW5

CSC 321: Data Structures
Fall 2016

HW5: Text & Maps

In his classic paper, A Mathematical Theory of Communication, Claude E. Shannon described a method for the stochastic generation of text. That is, he showed how it was possible to generate "random" sequences of characters with specific probabilistic characteristics. This included generating "random" text whose probability distribution matched a given source text. Given a source text, he defined the kth-order approximation of that text as:

0th-order approximation:
Characters are generated completely at random (chosen from the same alphabet as the source text).
1st-order approximation:
Characters are generated at random following the same character frequency distribution as in the source text. That is, if the character 'e' appeared twice as often as 'w' in the source text, then the probability of generating an 'e' should be twice the probability of generating a 'w'.
2nd-order approximation:
Characters are generated at random following the same character-pair frequency distribution as in the source text. That is, if the character 't' was the last one generated, then the next character should be generated based on the frequencies of characters that follow 't' in the source text. For example, if 't' appeared 3 times in the source text: in "three" and "teeth", then there would be a 2/3 probability that 'h' would be generated next and a 1/3 probability that 'e' would be generated.
  .
  .
  .

kth-order approximation:
Characters are generated at random following the same (k-character)-combination frequency distribution as in the source text. That is, if "c₁...c_k-1" were the last k-1 characters generated, then the next character should be generated based on the frequencies of characters that follow "c₁...c_k-1" in the source text. For example, if k = 4 and "the" appeared 5 times in the source text: in "then", "the ", "weather", "heathen", and "lathe ", then there would be a 2/5 probability that 'n' would be generated next, a 2/5 probability that ' ' would be generated, and a 1/5 probability that 'r' would be generated.

For this assignment, you are to implement a program that generates kth-order approximations of a source text. Initially, the file name of the source text and the approximation order will be entered by the user. Then, repeated requests for approximations of specific lengths will follow, terminated by -1. For example:

Enter the file name and approximation order: lincoln.txt 4

Enter the desired text length (-1 to quit): 20
hat gove to that a l

Enter the desired text length (-1 to quit): 40
ng and dead will proposition this nation

Enter the desired text length (-1 to quit): 40
from the people nation that cannot peopl

Enter the desired text length (-1 to quit): -1
BYE

Note that a 0-th order approximation is completely random, but using only those characters that appear in the source text. A 1st-order approximation is similar only in the relative frequencies of the characters. As the order increases, the generated text pulls larger and larger sequences of characters from the source text. By the time the order reaches 8 or 10, it is likely that entire sections of the source text will be generated.

To simplify your task, the TextProcessor class has been provided for you. This class demonstrates how to read in the contents of a file and store the contents in a single, processed string. For this assignment, we will only concern ourselves with letters and spaces, so all other characters are removed from the file contents. Letters are also made lowercase to simplify processing. You are to build on the TextProcessor class to construct a program that produces the above behavior.

Several points to consider:

When generating a kth-order approximation, you will need to "seed" the process with the initial k-1 characters. You should randomly choose some (k-1)-character sequence from the source text as your initial seed. Note that this choice should be probabilistically correct, i.e., if "thr" appears twice as often as "que", then it should be twice as likely to be chosen as seed. Following the initial seed, each successive letter should be generated based on the frequencies of k-character sequences in the source text (as described above).
There is one special case that you must consider when generating text. Suppose you are generating a kth-order approximation and the last k-1 characters in the source text are unique. That is, the final (k-1)-character sequence in the source text appears only once in the file (at the very end). If you ever generated this sequence, you will have no basis for generating the next character (since no character follows that sequence in the source text). In the rare event that this happens, your code should simply choose a new seed sequence (i.e., a (k-1)-character sequence from somewhere in the source text) to add to the sequence and continue from there.
In order to generate sequences in a kth-order approximation, you will need to create a data structure that allows for selecting the next character given the (k-1) characters that have come before. The obvious choice is a map, where the keys are the (k-1)-character sequences that appear in the source file and the values represent the distribution of characters that follow. For full credit, the number of keys should depend only on the number of unique (k-1)-character sequences and the values should depend only on the number of unique characters in the source file.
Your program should perform reasonable error checking. Note that the order must not be negative and must be less than the number of characters in the file. If the user asks for a sequence whose length is less than the order, then your program should simply select a random sequence from the file (i.e., a seed of the specified length).

CSC 321: Data Structures Fall 2016 HW5: Text & Maps

CSC 321: Data Structures
Fall 2016

HW5: Text & Maps