Name: _________________________________________



CSC 221: Computer Programming I         Spring 2010

HW 2: Experimentation with Word Frequencies

This assignment will demonstrate the use of computer programs as experimental tools for solving problems. Utilizing the SequenceGenerator class, you will perform experiments with word distributions and attempt to estimate the number of words with certain properties.


Generating Random Sequences

The SequenceGenerator class contains a method named randomSequence that generates a random sequence of letters. The method takes one parameter value, the length of the sequence to be generated, and returns a random letter sequence.


EXERCISE 1:    Load this class into the BlueJ IDE and create a SequenceGenerator object (using the first constructor with no parameter). Generate 10 random sequences of 3 letters and list them below. Were any of the 10 sequences real English words? Would you expect them to be?










EXERCISE 2:    Similarly, generate 10 random sequences of 4 letters and list them below. Were any of the 10 sequences real English words? Would you expect them to be? Would you expect a 4-letter word be more or less likely than a 3-letter word? Explain.











EXERCISE 3:    Generating one sequence at a time can make experimentation with a SequenceGenerator tedious. It would make things easier to have another method that automated the process of generating and displaying multiple sequences. The following method has two parameters, the number of sequences to generate and their length. Add this method to the source code for the SequenceGenerator class, placed below the existing methods (but above the closing curly-brace). /** * Displays a set number of random letter sequences of the specified length * @param numSequences the number of sequences to generate & display * @param seqLength the number of letters in the random sequences */ public void displaySequences(int numSequences, int seqLength) { int wordsPerLine = 40 / seqLength; int sequencesSoFar = 0; while (sequencesSoFar < numSequences) { System.out.print(this.randomSequence(seqLength) + " "); sequencesSoFar = sequencesSoFar + 1; if (sequencesSoFar % wordsPerLine == 0 || sequencesSoFar == numSequences) { System.out.println(); } } }

Compile the modified class, create a SequenceGenerator object, and call your new method. Verify that this method behaves as described.



EXERCISE 4:    What would you expect to happen if executed the displaySequences method and entered 0 or a negative number for the number of sequences to be generated? How about 0 or a negative number for the length of each sequence? Verify your answers.


















Experimentation & Problem Solving

Using a SequenceGenerator object, you can perform some interesting experiments. In particular, you can use the object as a tool for verifying or disproving hypotheses about word distributions, and to generate further data for analysis. First, consider the total number of unique 3-letter sequences that can be generated. Since each of the three positions in a sequence can be any of the 26 letters, there are 263 = 17,576 different sequences. Clearly, not all of these sequences form real words. The question arises: how many random 3-letter sequences would you expect to have to generate before you obtain a word?

It so happens that there are approximately 550 3-letter words in the English language (according to a popular online dictionary). Thus, if you generated a random 3-letter sequence, there is a 550 out of 17,576 chance that it will be a word. Since 550/17,576 is approximately 1/32, you might expect 1 out of every 32 sequences to be a word. More accurately, you might expect 3 out of every 100 random 3-letter sequences to be words, since 1/32 is approximately 3%. This number can be verified experimentally using a SequenceGenerator object.


EXERCISE 5:    Use a SequenceGenerator object to generate 100 random 3-letter sequences and count how many English words you obtain. List that number below.





Is the number you obtained close (relatively speaking) to the expected value of 3? Calculate the relative error using the formula:

relative error = |(experimental result)-(expected value)| / (expected value)
For example, if you obtained only 1 word, the relative error would be |1-3|/3 = 2/3 = 0.67 = 67%. Show your data and calculations.












When the number of generated sequences is small, it is possible for a lucky or unlucky run of sequences to skew the results. For example, it is possible that you might not obtain any words out of your 100 random sequences. From that limited data, you would hardly want to conclude that there are no 3-letter words. Instead, you would recognize that 100 sequences is too small of a number to be conclusive. To balance out streaks of good or bad luck, you would need to generate a large number of sequences.


EXERCISE 6:    Generate another 400 sequences and add the number of words you obtain to the number from the first 100 (from EXERCISE 5). Then, calculate the relative error using your combined count. For example, if your 500 sequences produced 18 words, your relative error would be |18-15|/15 = 3/15 = 0.2 = 20%. Is the relative error decreased by the additional repetitions? Show your data and calculations.














Part of the blame for the scarcity of words among randomly generated sequences falls on letters such as 'q' and 'z'. Since these letters are used so infrequently in English, their inclusion in a random sequence of letters makes a real word extremely unlikely. If we exclude letters such as these, however, we can improve the chances of generating words considerably. For example, the 10 letters that appear most frequently in English text are "etaoinshrd". Random sequences of these letters would appear more likely to produce words.


EXERCISE 7:    Using the alternate constructor (with a String parameter), create a SequenceGenerator that is limited to the letters "etaoinshrd". Generate 500 random 3-letter sequences of these letters and count how many English words you obtain. List that number below.







Using your experimental results from the previous exercise, you should be able to estimate the number of 3-letter words in the English language that use only the letters in "etaoinshrd". The following general formula applies:

# of words = (# of possible sequences) * (chances that a random sequence is a word)
The total number of 3-letter sequences that use only 10 letters is 103 = 1,000. As an estimate of the chances that a random sequence of these letters is a word, divide the number of words you obtained by the total number of sequences you generated. For example, suppose your 500 random sequences produced 30 actual words. You could then estimate the chances of randomly generating a word to be 30/500 = 3/50. Plugging these numbers into the above formula, the total number of 3-letter words using only "etaoinshrd" is estimated to be (1,000 * 3/50) = 60.


EXERCISE 8:    Using the numbers you obtained in EXERCISE 7, estimate the number of 3-letter words in the English language that use only the letters in "etaoinshrd". Show your work in obtaining your estimate.

























Hand in a printout of your modified SequenceGenerator class along with these sheets.