CSC 222: Object-Oriented Programming
Spring 2017

HW 3: Files, Strings & Lists

Every day, literary scholars debate the stylistic choices and writing patterns of famous authors. In fact, it has been shown that certain authors have consistent patterns in the way they choose words and construct sentences, and these patterns can be studied and used to identify the authors of unknown works. For this assignment, you are given a simple class, FileStats.java, that reads words from a text file (whose name is specified by the user) and stores those words in an ArrayList. Currently, the class has three public methods: numWords (which returns the number of words in the file), maxWordLength (which returns the length of the longest word in the file), and showStats (which calls these other public methods and displays statistics about the file). You will add more methods to this class and update the showStats method so that it can be used to study the patterns contained in literary works.

Note: For all of these new methods, you should assume that a default answer of 0 or 0.0 should be returned if the method is called on an empty list of words. Your methods may add additional fields or local variables, as necessary, but should not need to reopen the file.

When testing your modifications/additions, you should utilize small files for which you can hand-calculate stats. Once you are confident it works as desired, you can test your code on the following public-domain texts:


  1. Add a method named numWordsOfLength that has one parameter, an integer specifying a word length. The method should calculate and return the number of words in the list that are the specified word length. For example, if the method is called with parameter 5, it should calculate and return the number of 5-character words currently stored.

  2. Add a method named mostCommonLength that determines and returns the most common length of words in the file. For example, if the file contained 100 words, 15 of which were 3-letter words, 20 were 4-letter words, 15 were 5-letter words, 40 were 6-letter words, and 10 were 7-letter words, then the method should return 4 (since 4-letter words are most common). If there is a tie, i.e., two or more word lengths are the same and most common, then the function should return the smallest length that is maximal. Modify the showStats method to display this statistic for the file.

  3. Add a method named averageCharactersPerWord that calculates and returns the average number of characters per word. For example, given a tiny file containing one 3-character word and two 4-character words, a call to averageCharactersPerWord should return 3.666... Modify the showStats method to display this statistic for the file.

  4. Add a method named averageSyllablesPerWord that calculates and returns the average number of syllables per word. To keep things simple, we will assume that any sequence of consecutive vowels (including 'y') corresponds to a syllable. For example, "heavy" has two syllables while "Italian" has three syllables. Hint: you might consider defining a private helper method, similar to strip, which takes a word as parameter and returns the number of syllables in that word. Modify the showStats method to display this statistic for the file.

  5. Add a method named typeTokenRatio that calculates and returns the Type-Token Ratio for the stored words, which is a measure of how repetitive the vocabulary is. The Type-Token Ratio is defined to be the number of different words divided by the total number of words. For example, text with no repeated words will have a Type-Token Ratio of 1.0, while text in which every word appears twice would have a Type-Token Ratio of 0.5. Modify the showStats method to display this statistic for the file.

  6. Add a method named hapaxLegomanaRatio that calculates and returns the Hapax Legomana Ratio for the stored words, which is closely related to the Type-Token Ratio. The Hapax Legomana Ratio is defined to be the number of singleton words divided by the total number of words. A singleton word is a word that appears exactly once in the list. For example, text with no repeated words will have a Hapax Legomana Ratio of 1.0, while text in which every word appears twice would have a Hapax Legomana Ratio of 0.0. Modify the showStats method to display this statistic for the file.
  7. Suppose a new work of literature has been discovered that has the following characteristics:

    Stats for mystery.txt: number of words = 7040 longest word = 20 most common length = 3 average characters per word = 4.6707386363636365 average syllables per word = 1.6095170454545455 Type-Token Ratio = 0.29786931818181817 Hapax Legomana Ratio = 0.19829545454545455

    Compare the statistics of this mystery author with the statistics of the above works. Based on your comparison, which person (Carroll, Melville, Poe, Shakespeare, or Twain) is most likely to be the author of this mystery text? Provide data and justify your conclusion.

Submit your modified FileStats.java and your data/answer to question 7 (in a single ZIP file) via BlueLine.