CSC 221: HW 6

CSC 221: Introduction to Programming
Fall 2012

HW 6: Files and Lists

You may choose to work with one other person in the class for this assignment.
If you choose to do so, your team will submit one assignment and each student will receive the same grade.
You may not consult with anyone other than your partner (and the instructor, of course).

Did Charles Dickens have a penchant for using words with the letter 'W'?
Did Louisa May Alcott find long words simply irresistible?
Would it surprise you if Dr. Seuss tended to use short words and the letter 'Z' more than other authors?

These question may seem a bit fanciful, but considerable research has gone into studying the collected works of authors and discerning patterns. The theory is that each author has a unique way of using words in his or her writing, which identifies that author's works the same way fingerprints identify people. These literary fingerprints not only provide insight into an author's methods, but also can be (and have been) used to identify the author of anonymous or disputed works of literature.

For this assignment, you will write a Python program that analyzes a text file and produces a report describing its literary fingerprint. The output of your code must include:

the file name
the number of sentences in the file
the number of words in the file
the number of letters in the file
the average number of words in each sentence
the average number of letters in each word
the number of short words (3 or fewer letters) in the file, and the percentage of short words (with respect to the total number of words)
the number of long words (8 or more letters) in the file, and the percentage of long words (with respect to the total number of words)
the number of occurrences of each letter, ignoring cases, and the percentage of each letter (with respect to the total number of letters)
a histogram that shows the (rounded) percentages of each letter as a row of asterisks

To make things easier, we will make some simplifying assumptions. We will assume that any word that ends in a punctuation mark, either ".", "?", or "!", designates a sentence. This may lead to some counting errors, such as "I climbed Mt. Shasta." counting as two sentences. Likewise, we will assume that any sequence of characters delineated by whitespace is a word. Again, this might lead to some inaccuracies, such as "He paused - then spoke." counting as a 5-word sentence. However, these counting errors may also balance out, as sentences that end with quotes (e.g., "He said 'stop.'") and hyphen-connected words (e.g., "paused--then") may not be counted.

Finally, note that you are asked to report the number of letters in the file, not the number of characters. You should ignore non-letters (e.g., whitespaces, punctuation marks, digits) when calculating this total. Likewise, the average word length and the definitions of short and long words depend on the number of letters, not the number of characters. This ensures that punctuation marks do not count in word lengths (e.g., "end!" should count as a 3-letter word), but also leads to some surprising results (e.g., "1234" is a 0-letter word).

Your program should use a file dialog window to enable the user to select the text file to be processed. All averages and percentages should be rounded to one decimal place, and all letter frequency stats should be aligned in columns as shown in the sample below:

You should test your code on small files for which you can hand-calculate stats. Once you are confident it works as desired, you can test your code on the following public-domain texts:

Alice's Adventures in Wonderland, by Lewis Carroll
Bartleby, The Scrivener, by Herman Melville
The Cask of Amontillado, by Edgar Allen Poe
The Tragedy of Hamlet, Prince of Denmark, by William Shakespeare
The Notorious Jumping Frog of Calaveras County, by Mark Twain

CSC 221: Introduction to Programming Fall 2012 HW 6: Files and Lists

CSC 221: Introduction to Programming
Fall 2012

HW 6: Files and Lists