2018-05-16

posted May 16, 2018, 6:22 AM by Konstantinovich Samuel   [ updated May 16, 2018, 7:19 AM ]
Goal: an exploration of book data. (War of the Worlds)

READ THE WHOLE POST WITH YOUR PARTNER FIRST. 
You may collaborate with neighboring groups to work on the planning.



STANDARD SUBSET
To help make things easier, we can use a smaller file to start ( warsmall.txt )


Pair programming + Written brainstorming.
You must answer all questions 1-4 in YOUR notes before you touch the computer. 
DO NOT be lazy about this. Make sure you can refer back to these notes and understand what you are answering.
After you open the book and read it into a string:
1. First, you need to make a list of all of the words in the book.
  a)Describe how you would perform this task.
  b)What problems might arise? 
     (What things might happen to your list of words that may not be desirable?)

2. We would want to strip away punctuation from the words in the list.
  a)Describe how you would do this, which things need to be stripped.
  b)Is there any punctuation that is needed inside a word? 
     Will your method break those cases?

3. This book, and some other books use '--' in between words. You need to remove '--' separators between words before you split the book into words.
  a)How should you remove them? When should you do this?
  b)What should it look like afterward? give an example.

4. convert your list into a tally dictionary.
  a) How would you do this?
  b) Guess how long this dictionary will be

Now go do all the things you described in 1-4 with your partner. When done, move on to 5, and answer the question in your notes first.
note:
a. This file has windows line endings "\r\n"
   You need to make them just "\n"
b. Splitting text that has spaces, newlines, and tabs can be done in one step if you rememeber how split works.
c. There is a string command that gives you all punctuation
d. You don't remove the mid-word punctuation like hyphenated words or contractions. 
e. You should lowercase your words so they tally properly. (The the THE should all count as the same word)

5. calculate the following about the entire book:
   a) Answer this first: How would you calculate these? Describe the method for each one (some will be similar)
   ? characters in the book.
   ? total words in the book. (total, not unique)
   ? different words in the book. (unique words ignore duplicates.)
   ? words that are used over 250 times
   ? words that are used exactly once
   ? words that are over 15 letters long

#Test your program on the SUBSET that was outlined above! You can check words easily there.

ċ
war.txt
(338k)
Konstantinovich Samuel,
May 16, 2018, 6:31 AM
ċ
warsmall.txt
(2k)
Konstantinovich Samuel,
May 16, 2018, 6:31 AM
Comments