free hit counter

A Web Scraper That Doesn't (Completly) Suck

Python is great for a lot of things. Here's another, web scraping. Web scraping is the act of programmatically grabbing information from webpages. Typically from the HTML returned by a website.

NOTE: Web scraping can be abused and in many cases will get you banned from websites (sorry pastebin!), only use it when you are 100% positive it is allowed

We're going to use urllib2 and Beautiful Soup although you can use your choice of HTTP libraries (requests is another big one) in place of urllib2.

from bs4 import BeautifulSoup  
import urllib2

req = urllib2.Request("http://www.crummy.com/software/BeautifulSoup/")  
response = urllib2.urlopen(req)  
the_page = response.read()  
soup = BeautifulSoup(the_page)  
text = soup.findAll("p")

print text[0].text  

In this example we are finding all of the

tags (meaning it comes back as an array) and grabbing the text from them with text[0].text.

If we wanted the complete text of the webpage we would simply change to "soup.findAll("html") which will grab all of the text in between the two html tags.