free hit counter

A Web Scraper That Sucks Even Less!

I recently wrote a post about using BeautifulSoup and urllib2 to scrape html off webpages and parse it into useful text. The only issue was it was easy to get banned with.

This modification to the code does not make you ban proof, and the same warning applies.

from bs4 import BeautifulSoup  
import urllib2  
import random  
import time

#random integer to select user agent
randomint = random.randint(0,7)

#random interger to select sleep time
randomtime = random.randint(1, 30)

#urls to be scraped
urls = ["http://www.hunterthornsberry.com", "http://huntert.me/"]

#user agents
user_agents = [  
    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
    'Opera/9.25 (Windows NT 5.1; U; en)',
    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
    'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:8.0.1) Gecko/20100101 Firefox/8.0.1',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.151 Safari/535.19'
]


index = 0  
while len(urls) > index:  
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', user_agents[randomint])]
    response = opener.open(urls[index])
    the_page = response.read()
    soup = BeautifulSoup(the_page)  

    #Search criteria (is an html tag). Example <p>, <body>, <h1>, etc.
    text = soup.findAll("body")

    #Runs until it has an index out of range error and breaks, this will return every response
    while True:
        try:
            i = 0
            while True:
                print text[i].text
                i = i + 1
        except IndexError:
            print "--End--"
            break
    index = index + 1
    time.sleep(randomtime)

What I've done here is taken a list of common user-agents and randomly selected one to be passed with our HTTP request, this makes our request look as if they are coming from different browsers. On top of that I've added a random wait period (1-30 seconds) after each request.