A KWIC (Key Word in Context) index is a list of titles or other phrases, organized so that it can be easily searched by keyword. Before computers, KWIC indexes were commonly used in technical manuals and book/article compilations to enable searches by topic or keyword. For example, the following is the first few lines from a KWIC index of book titles.
ABSALOM, ABSALOM! Berlin ALEXANDERPLATZ Things Fall APART The Devil to Pay in the BACKLANDS Gypsy BALLADS BERLIN Alexanderplatz BOOK of Job Madame BOVARY The BROTHERS Karamazov . . .
Note that every title will appear once for each keyword in that title, excluding junk words like "the" and "of". The titles in the KWIC index are listed in alphabetical order by keyword, with the keyword displayed in all caps. If the same keyword appears more than once in a title, the title is only listed once (for that keyword) with all occurrences of the keyword capitalized. When identifying keywords, any characters other than letters or digits at the beginning or end of the word should be ignored. For example, in the first title, "Absalom, Absalom!", the keyword "Absalom" appears twice (ignoring the punctuation that appears at the ends of the words). Thus, the title appears only once in the KWIC index, with both occurrences of the keyword capitalized. Punctuation that appears internally in keywords, e.g., "isn't", should be maintained.
For this assignment, you will write a program that reads in a file containing junk words to be ignored (the junkWords.txt file from HW1 suffices) and a file of titles (the Wikipedia top 100 books list is provided). Your program should construct a KWIC index and allow the user to search that index. A showKeyword
method should be provided, which takes a keyword as input and displays all of the titles that contain that keyword (in alphabetical order, with the keyword in all caps). It should also provide a showAll
method, which has no inputs and displays the entire KWIC index.
To simplify this task, you do no need to align the titles so that keywords begin in the same columns. Also, you may display non-keywords in all lowercase. That is, your showAll
method might display the above KWIC index as:
ABSALOM, ABSALOM! berlin ALEXANDERPLATZ things fall APART the devil to pay in the BACKLANDS gypsy BALLADS BERLIN alexanderplatz BOOK of job madame BOVARY the BROTHERS karamazov . . .
FOR EXTRA CREDIT, modify your code so that the capitalization of non-keywords is maintained and titles are aligned by keyword.