free hit counter

Python: Most Used Words

I recently saw a program that gathered the 10 most common words on a webpage and displayed them in a window along with their word count. I decided to build my own using Python and some code I had written before to scrape data from webpages.

# Gives a list of the most common words
# Hunter Thornsberry - hunter@hunterthornsberry.com
from BeautifulSoup import BeautifulSoup  
import urllib2  
import random  
import time

#limit on the number of top words we want to know the count of
limit = 10

#random integer to select user agent
randomint = random.randint(0,7)

#random interger to select sleep time
randomtime = random.randint(1, 30)

#urls to be scraped
urls = ["http://raw.adventuresintechland.com/freedom.html"]

#user agents
user_agents = [  
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:8.0.1) Gecko/20100101 Firefox/8.0.1',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.151 Safari/535.19'
]

words = []

index = 0  
while len(urls) > index:  
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', user_agents[randomint])]
    response = opener.open(urls[index])
    the_page = response.read()
    soup = BeautifulSoup(the_page)

    #Search criteria (is an html tag). Example <p>, <body>, <h1>, etc.
    text = soup.findAll("body")

    #Runs until it has an index out of range error and breaks, this will return every response
    while True:
        try:
            i = 0
            while True:
                #print text[i].text
                words.append(text[i].text)
                i = i + 1
        except IndexError:
            print "--End--"
            break
    index = index + 1

words = words[0].split(" ")  
words = [element.lower() for element in words]  
sort = []  
for word in set(words):  
    sort.append(str(words.count(word)) + " " + word)

x = 0  
for item in sorted(sort, reverse=True):  
    print item
    if x == limit:
        break
    x = x + 1

This code basically comes in two parts, the first part gets the data from the webpage. I've got a whole blog post dedicated just to that.

This is the second part of the code:

words = words[0].split(" ")  
words = [element.lower() for element in words]  
sort = []  
for word in set(words):  
    sort.append(str(words.count(word)) + " " + word)

x = 0  
for item in sorted(sort, reverse=True):  
    print item
    if x == limit:
        break
    x = x + 1

Here I am using .split(" ") to find all of the words. Then I am making every word lower case (as to get a true count of the words, since technically "The" and "the" are two different words). Next the first for loop uses set(words) to get the unique words and appends a string representation of the number of times that word appears in the words list and the word itself.

The second for loop sorts the list and prints the results. Notice sorted() is not a defined function, it is actually built into Python, and we are also passing "reverse=True" so the word with the highest count returns first.

Output

--End--
9 programmers  
9 other  
9 one  
9 new  
9 few  
9 code  
8 when  
8 says  
8 print  
8 first  
8 didn't