posted May 4, 2017, 6:52 AM by Samuel Konstantinovich   [ updated May 8, 2017, 7:33 AM ]

May the 4th be with you!

Goal: an exploration of book data.

In honor of the holiday, you will process war of the worlds.

You may collaborate with neighboring groups to work on the planning.

Go get the project gutenberg text (UTF-8) file for the war of the worlds, and trim it as follows:  
(The goal is to paste the book into a text file, but remove the non-book header/footer)

The War of the Worlds

by H. G. Wells [1898]

     But who shall dwell in these worlds if they be
     inhabited? .  .  .  Are we or they Lords of the
     World? .  .  .  And how are all things made for man?--
          KEPLER (quoted in The Anatomy of Melancholy)




And strangest of all is it to hold my wife's hand again, and to think
that I have counted her, and that she has counted me, among the dead.

To help make things easier, we can use a subset of the book when testing code! We should all use the same subset of the book to help compare answers.

text = text[:4112]
print text[3980:]#this prints the last few lines

If you trimmed the book as instructed, this will stop the book at:
'''And we men, the creatures who inhabit this earth, must be to them
at least as alien and lowly as are the monkeys and lemurs to us.'''

This will be our class standard subset to help us test.

Pair programming + Written brainstorming.
You must answer all questions 1-4 in YOUR notes before you touch the computer. 
DO NOT be lazy about this. Make sure you can refer back to these notes and understand what you are answering.
After you open the book and read it into a string:
1. First, you need to make a list of all of the words in the book.
  a)Describe how you would perform this task.
  b)What problems might arise? 
     (What things might happen to your list of words that may not be desirable?)
2. We would want to strip away punctuation from the words in the list.
  a)Describe how you would do this, which things need to be stripped.
  b)Is there any punctuation that is needed inside a word? 
     Will your method break those cases?
3. Sometimes books use '--' in between words. You need to remove '--' separators between words before you split the book into words.
  a)How should you remove them? When should you do this?
  b)What should it look like afterward? give an example.
4. convert your list into a tally dictionary.
  a) How would you do this?
  b) Guess how long this dictionary will be
Now go do all the things you described in 1-4 with your partner. When done, move on to 5, and answer the question in your notes first.

5. calculate the following about the entire book:
   a) Answer this first: How would you calculate these? Describe the method for each one (some will be similar)
   ? characters in the book.
   ? words in the book. (total, not unique)
   ? unique words in the book. (all words converted to lower case)
   ? words that are used over 250 times
   ? words that are used exactly once
   ? words that are over 15 letters long

#Test your program on the SUBSET that was outlined above!
#here is preliminary data to help verify you are correct:
#4112 characters in the book SUBSET.
#707 words in the book SUBSET. (total, not unique, non-empty strings only)
#342 unique words in the book SUBSET. (all words converted to lower case)
#3 words that are used over 20 times in the book SUBSET
#252 words that are used once in the book SUBSET
#3 words that are over 12 letters long in the book SUBSET

(adjusted the count to make it easier to test)