CSC 221: HW 5

CSC 221: Introduction to Programming
Fall 2018

HW 5: Files and Lists

Did Charles Dickens have a penchant for using words with the letter 'W'?
Did Louisa May Alcott find long words simply irresistible?
Would it surprise you if Dr. Seuss tended to use short words and the letter 'Z' more than other authors?

These question may seem a bit fanciful, but considerable research has gone into studying the collected works of authors and discerning patterns. The theory is that each author has a unique way of using words and letters in his or her writing, which identifies that author's works the same way fingerprints identify people. These literary fingerprints not only provide insight into an author's methods, but also can be (and have been) used to identify the author of anonymous or disputed works of literature.

For this assignment, you will write a Python function (or collection of functions) that analyzes a text file and produces a report describing its literary fingerprint. The output of your code must include:

the file name
the number of sentences in the file
the number of words in the file
the number of letters in the file
the average number of words in each sentence
the average number of letters in each word
the number of short words (3 or fewer letters) in the file, and the percentage of short words (with respect to the total number of words)
the number of long words (8 or more letters) in the file, and the percentage of long words (with respect to the total number of words)
the number of occurrences of each letter, ignoring cases, and the percentage of each letter (with respect to the total number of letters)
a histogram that shows the (rounded) percentages of each letter as a row of asterisks

To make things easier, we will make some simplifying assumptions. We will assume that any word that ends in a terminal punctuation mark, either ".", "?", or "!", designates a sentence. This may lead to some counting errors, such as I climbed Mt. Shasta. counting as two sentences. Likewise, we will assume that any sequence of characters delineated by whitespace is a word. Again, this might lead to some inaccuracies, such as He paused - then spoke. counting as a 5-word sentence. However, these counting errors may also balance out, as sentences that end with quotes (e.g., He said 'stop.') and hyphen-connected words (e.g., paused--then) will not be counted.

Due to these simplifying assumptions, it is possible for a file to have 0 sentences (e.g., if all punctuation marks are enclosed in quotes) or even 0 letters (e.g., if the file contains only numbers). Your code should not crash in extreme cases such as these, but should either print a warning message or simply omit displaying any ill-defined statistics.

Finally, note that you are asked to report the number of letters in the file, not the number of characters. You should ignore non-letters (e.g., whitespaces, punctuation marks, digits) when calculating this total. Likewise, the average word length and the definitions of short and long words depend on the number of letters, not the number of characters. This ensures that punctuation marks do not count in word lengths (e.g., end! should count as a 3-letter word), but also leads to some surprising results (e.g., 1234 is a 0-letter word).

Coding (90%)

Your main function, call it fingerprint, should take the file name as input and display all of the statistics described above. Since this function may become long, you may want to write some helper functions that are called by fingerprint to carry out supporting tasks (e.g., printing the letter frequency table). All averages and percentages should be rounded to one decimal place, and all letter frequency stats should be aligned in columns. For example, the execution on the left shows the fingerprint statistics for Lewis Carroll's Alice's Adventures in Wonderland. The execution on the right shows the fingerprint statistics for a short story by an unidentified author.

You should test your code on small files for which you can hand-calculate stats. Once you are confident it works as desired, you can test your code on the following public-domain texts:

Alice's Adventures in Wonderland, by Lewis Carroll
Bartleby, The Scrivener, by Herman Melville
The Pit and the Pendulum, by Edgar Allan Poe
The Tragedy of Hamlet, Prince of Denmark, by William Shakespeare
The Notorious Jumping Frog of Calaveras County, by Mark Twain

Analysis (10%)

It so happens that mystery.txt was written by one of the five authors listed above: Carroll, Melville, Poe, Shakespeare, or Twain. Compare the fingerprint statistics for this short story with the five known works and try to predict the unidentified author. In a separate document, include the fingerprint statistics for the author you have chosen, and provide a brief rationale for why you think this fingerprint best matches the mystery fingerprint.

CSC 221: Introduction to Programming Fall 2018 HW 5: Files and Lists

Coding (90%)

Analysis (10%)

CSC 221: Introduction to Programming
Fall 2018

HW 5: Files and Lists