Lab 1: Monte Carlo Methods

Name: _________________________________________

CSC 121: Computers and Scientific Thinking
Spring 2011

Lab 1: Monte Carlo Methods

It may surprise you to learn that random events such as coin flips, dice rolls, and roulette wheel spins are central to many computer science applications, such as cryptography and database management. While events such as these are random in nature, their behavior over the long run is predictable and so can be used to construct deterministic solutions to complex problems. Problem solutions that use such random events are known as Monte Carlo methods, named after the city famous for its casinos.

In this lab, you will use a Monte Carlo method to estimate the number of English words with certain properties. Utilizing a Web page with embedded JavaScript code, you will perform repeated experiments with random letter sequences and attempt to draw conclusions from your results. Many of the labs in this course will similarly emphasize the use of computers for modeling complex phenomena or for analyzing data, and so will help to develop empirical skills such as formulating reasonable hypotheses, designing experiments, executing them, and critiquing the results.

JavaScript in a Web Page

The Web page randSeq.html contains JavaScript code for generating random sequences of letters. By default, clicking on the button labeled "Click to generate letter sequences" within the page will generate and display a single sequence of 3 random letters, e.g., "syk". By changing a value in the appropriate box, however, the number of random sequences, the length of each sequence, and even the possible letters in each sequence can be changed.

EXERCISE 1: Load this page into a separate browser window. Generate 10 random sequences of 3 letters and list them below. Were any of the 10 sequences real English words? Would you expect them to be?

EXERCISE 2: Modify the appropriate value in the page to generate 10 random sequences of 4 letters and list them below. Were any of the 10 sequences real English words? Would you expect them to be? Would you expect a 4-letter word be more or less likely than a 3-letter word? Explain.

EXERCISE 3: What would you expect to happen if you changed the contents of the box labeled "Letters to choose from" to only two letters, say "bo". What if you changed the contents to a single letter? Verify your predictions in the page.

EXERCISE 4: What would you expect to happen if you changed the number of sequences to be a negative number? What would you expect if the length of the sequences was specified as negative? Verify your predictions in the page.

You are not expected to understand JavaScript code yet, but you may find it interesting to look at the actual code for generating random letter sequences. View the HTML source for the randSeq.html page (by selecting View --> Source in Internet Explorer or View --> Page Source in Firefox). There are two sections of code in the HEAD of this page, delineated by <script type="text/javascript"> and </script> tags. The first section of code loads a library of JavaScript routines called random.js, including code for generating a random character. The second section of code contains the definitions of two functions that generate and display the random sequences that appear in the page. The text in the body of the page sets up the layout of the page and executes the JavaScript code when the user clicks on the button.

EXERCISE 5: What are the names of the two JavaScript functions that generate and display the random letter sequences? Note that each function name follows the keyword function in the code section found in the HEAD of the page.

Simple Experiments

Using the random sequence generator page, you can perform some interesting experiments. In particular, you can use this code as a tool for verifying or disproving hypotheses about word distributions, and to generate further data for analysis. First, consider the total number of unique 3-letter sequences that can be generated. Since each of the three positions in a sequence can be any of the 26 letters, there are 26³ = 17,576 different sequences. Clearly, not all of these sequences form real words. The question arises: how many random 3-letter sequences would you expect to have to generate before you obtain a word?

It so happens that there are approximately 500 3-letter words in the English language (according to a popular online dictionary). Thus, if you generated a random 3-letter sequence, there is a 500 out of 17,576 chance that it will be a word. Since 500/17,567 is approximately 1/35, you might expect 1 out of every 35 sequences to be a word. More accurately, you might expect 28 out of every 1000 random 3-letter sequences to be words, since 1/35 is approximately 2.8%. This number can be verified experimentally using the randSeq.html Web page.

EXERCISE 6: Use the Web page to generate 1000 random 3-letter sequences and count how many English words you obtain. List that number below. Hint: Since 1000 sequences may not fit on the screen at one time, you can break them into smaller groups, say 10 groups of 100. Scanning 100 sequences for words can be done in just a few seconds.

Is the number you obtained close (relatively speaking) to the expected value of 28? Calculate the relative error using the formula:
relative error = |(experimental result)-(expected value)| / (expected value)
For example, if you obtained only 24 words, the relative error would be |24-28|/28 = 4/28 = 0.143 = 14.3%. Show your data and calculations.

Generate another 1000 sequences and recalculate the relative error with your updated numbers. For example, if your second set of 1000 sequences produced 34 words, you would have a total of 58 words with an expected number of 56. This yields a relative error of |58-56|/56 = 2/56 = 0.036 = 3.6%. Show your data and calculations.

Did your relative error decrease with the additional 1000 sequences? Would you expect it to? If the relative error is still greater than 10%, continue generating sequences in sets of 1000 until the relative error is less than 10%. Show your data and calculations.

As the length of the letter sequences increases, the chances of generating a word at random decreases dramatically. For example, there are 1,777 4-letter words in the online dictionary. Thus, the chances of generating a 4-letter word at random are 1 in 257 (1,777/26⁴ = 1,777/456,976 = 1/257). For 5-letter sequences, the chances of generating a word at random is 1 in 4,920 (2,415/26⁵ = 2,415/11,881,376 = 1/4,920) .

EXERCISE 7: Use the Web page to generate 1000 random 4-letter sequences and count how many English words you obtain. List that number below.

Since there is only a 0.38% chance of generating a 4-letter word at random, you would expect to obtain around 4 words out of 1000 random 4-letter sequences. Is the number you obtained close (relatively speaking) to 4? Compared to the case for 3-letter words, would you expect it to take more or fewer sequences to obtain a number close (relatively speaking) to the expected value? Explain.

Similar to the calculations you performed in EXERCISE 6, generate 4 letter sequences until the relative error from the expected number of words is less than 10%. You may stop if you have generated 4000 sequences without reaching 10%. Show your data and calculations.

Problem Solving via Experimentation

Part of the blame for the scarcity of words among randomly generated sequences falls on letters such as 'q' and 'z'. Since these letters are used so infrequently in English, their inclusion in a random sequence of letters makes a real word extremely unlikely. If we exclude letters such as these, however, we can improve the chances of generating words considerably. For example, the 10 letters that appear most frequently in English text are "etaoinshrd". Random sequences of these letters would appear more likely to produce words.

EXERCISE 8: Modify the appropriate box in the Web page so that it generates random sequences using only the letters "etaoinshrd". Generate 1000 random 3-letter sequences of these letters and count how many English words you obtain. List that number below.

Is the number you obtained relatively consistent with the number obtained by the person sitting next to you (e.g., within 10% of each other)? If not, generate more sequences and add the word counts until your totals are closer.

Using your experimental results from the previous exercise, you should now be able to estimate the number of 3-letter words in the English language that use only the letters in "etaoinshrd". The following general formula applies:

# of words = (# of possible sequences) * (chances that a random sequence is a word)

The total number of 3-letter sequences that use only 10 letters is 10³ = 1,000. As an estimate of the chances that a random sequence of these letters is a word, divide the number of words you obtained by the total number of sequences you generated. For example, suppose you generated 1500 random sequences, 150 of which were actual words. You could then estimate the chances of randomly generating a word to be 150/1500 = 1/10. Plugging these numbers into the above formula, the total number of 3-letter words using only "etaoinshrd" is estimated to be (1,000 * 1/10) = 100.

EXERCISE 9: Using the numbers you obtained in EXERCISE 8, estimate the number of 3-letter words in the English language that use only the letters in "etaoinshrd". Show your work in obtaining your estimate.

EXERCISE 10: Using the same approach as above, estimate the number of 4-letter words using only the letters "etaoinshrd". Show your work in obtaining your estimate.
Note: since there are many more 4-letter sequences than there are 3-letter sequences, it may take many more random generations in order to obtain a reasonable estimate. Justify your data.

EXERCISE 11: Using the Internet or other references, answer the following questions about Monte Carlo methods. Be sure to identify the source for your answer. If the source is on the Internet, list an additional source to corroborate your answer.

What is meant by the term pseudorandom number? Why is it that computer-generated numbers (such as those used to pick letters in our simulation) are not truly random?

Who is credited with the invention of Monte Carlo methods?

Monte Carlo methods were instrumental in the research that led to the design of the hydrogen bomb. In addition to this and the simple examples we have discussed in class, describe another scientific application of a Monte Carlo method. That is, describe a problem from some field of science in which randomness was used to help solve that problem.

Hand in these sheets with your written responses.

CSC 121: Computers and Scientific Thinking Spring 2011