Name: _________________________________________
This lab will demonstrate the use of computer programs as experimental tools for solving problems. Utilizing a Web page with embedded JavaScript code, you will perform experiments with word distributions and attempt to estimate the number of words with certain properties. Many of the labs in this course will similarly emphasize the use of computers for modeling complex phenomena or for analyzing data, and so will help to develop empirical skills such as formulating reasonable hypotheses, designing experiments, executing them, and critiquing the results.
The Web page www.dickinson.edu/~cs131/Labs/randSeq.html contains JavaScript code for generating random sequences of letters. By default, clicking on the button labeled "Click to generate letter sequences" within the page will generate and display a single sequence of 3 random letters, e.g., "syk". By changing a value in the appropriate box, however, the number of random sequences, the length of each sequence, and even the possible letters in each sequence can be changed.
EXERCISE 1: Load this page into a separate browser window. Generate 10 random sequences of 3 letters and list them below. Were any of the 10 sequences real English words? Would you expect them to be?
EXERCISE 2: Modify the appropriate value in the page to generate 10 random sequences of 4 letters and list them below. Were any of the 10 sequences real English words? Would you expect them to be? Would you expect a 4-letter word be more or less likely than a 3-letter word? Explain.
EXERCISE 3: What would you expect to happen if you changed the contents of the box labeled "Letters to choose from" to only two letters, say "DC". What if you changed the contents to a single letter? Verify your predictions in the page.
EXERCISE 4: What would you expect to happen if you changed the number of sequences to be a negative number? What would you expect if the length of the sequences was specified as negative? Verify your predictions in the page.
You are not expected to understand JavaScript code yet, but you may find it interesting to look at the actual code for generating random letter sequences. View the HTML source for the randSeq.html page (by selecting Page Source under the Netscape View menu). There are two sections of code in the HEAD of this page, delineated by <SCRIPT LANGUAGE="JavaScript"> and </SCRIPT> tags. The first section of code loads a library of JavaScript routines called random.js, including code for generating a random character. The second section of code contains the definition of a function which generates the random sequences that appear in the page. The text in the body of the page sets up the layout of the page and executes the JavaScript code when the user clicks on the button.
EXERCISE 5: What is the name of the JavaScript function that generates the random letter sequences? The function name follows the keyword function in the code section found in the HEAD of the page.
Using the random sequence generator page, you can perform some interesting experiments. In particular, you can use this code as a tool for verifying or disproving hypotheses about word distributions, and to generate further data for analysis. First, consider the total number of unique 3-letter sequences that can be generated. Since each of the three positions in a sequence can be any of the 26 letters, there are 263 = 17,576 different sequences. Clearly, not all of these sequences form real words. The question arises: how many random 3-letter sequences would you expect to have to generate before you obtain a word?
It so happens that there are approximately 500 3-letter words in the English language (at least according to the UNIX online dictionary). Thus, if you generated a random 3-letter sequence, there is a 500 out of 17,576 chance that it will be a word. Since 500/17,567 is approximately 1/35, you might expect 1 out of every 35 sequences to be a word. More accurately, you might expect 28 out of every 1000 random 3-letter sequences to be words, since 1/35 is approximately 2.8%. This number can be verified experimentally using your JavaScript code.
EXERCISE 6: Use the Web page to generate 1000 random 3-letter sequences and count how many English words you obtain. List that number below. Hint: Since 1000 sequences will not fit on the screen at one time, generate the sequences in 10 groups of 100. Scanning 100 sequences for words can be done in just a few seconds.
Is the number you obtained close to the expected value of 28? If not, try generating another 500 or 1000 sequences and see if the approximation (number of words / number of sequences generated) improves.
As the length of the letter sequences increases, the chances of generating a word at random decreases dramatically. For example, there are 1,777 4-letter words in the online dictionary. Thus, the chances of generating a 4-letter word at random are 1 in 257 (1,777/264 = 1,777/456,976 = 1/257). For 5-letter sequences, the chances of generating a word at random is 1 in 4,920 (2,415/265 = 2,415/11,881,376 = 1/4,920) .
EXERCISE 7: Use the Web page to generate 1000 random 4-letter sequences and count how many English words you obtain. List that number below.
Since there is only a .38% chance of generating a 4-letter word at random, you would expect to obtain around 4 words out of 1000 random 4-letter sequences. Is the number you obtained close to 4? Compared to the case for 3-letter words, would you expect it to take more or fewer sequences to obtain a number close to the expected value? Explain.
Part of the blame for the scarcity of words among randomly generated sequences falls on letters such as 'q' and 'z'. Since these letters are used so infrequently in English, their inclusion in a random sequence of letters makes a real word extremely unlikely. If we exclude letters such as these, however, we can improve the chances of generating words considerably. For example, the 10 letters that appear most frequently in English text are "etaoinshrd". Random sequences of these letters would appear more likely to produce words.
EXERCISE 8: Modify the appropriate field in the Web page so that it generates random sequences using only the letters "etaoinshrd". Generate 1000 random 3-letter sequences of these letters and count how many English words you obtain. List that number below.Is the number you obtained relatively consistent with the number obtained by the person sitting next to you (e.g., within 10% of each other)? If not, generate more sequences and add the word counts until your totals are closer.
Using your experimental results from the previous exercise, you should now be able to estimate the number of 3-letter words in the English language that use only the letters in "etaoinshrd". The followng general formula applies:
EXERCISE 9: Using the numbers you obtained in EXERCISE 8, estimate the number of 3-letter words in the English language that use only the letters in "etaoinshrd". Show your work in obtaining your estimate.
EXERCISE 10: Using the same approach as above, estimate the number of 4-letter words using only the letters "etaoinshrd". Show your work in obtaining your estimate.Note: since there are many more 4-letter sequences than there are 3-letter sequences, it may take many more random generations in order to obtain a reasonable estimate. Justify your data.