A Web Scraper That Doesn't (Completly) Suck

Python is great for a lot of things. Here's another, web scraping. Web scraping is the act of programmatically grabbing information from webpages. Typically from the HTML returned by a website.

NOTE: Web scraping can be abused and in many cases will get you banned from websites (sorry pastebin!), only use it when you are 100% positive it is allowed

We're going to use urllib2 and Beautiful Soup although you can use your choice of HTTP libraries (requests is another big one) in place of urllib2.

from bs4 import BeautifulSoup  
import urllib2

req = urllib2.Request("http://www.crummy.com/software/BeautifulSoup/")  
response = urllib2.urlopen(req)  
the_page = response.read()  
soup = BeautifulSoup(the_page)  
text = soup.findAll("p")

print text[0].text

In this example we are finding all of the

tags (meaning it comes back as an array) and grabbing the text from them with text[0].text.

If we wanted the complete text of the webpage we would simply change to "soup.findAll("html") which will grab all of the text in between the two html tags.

A Web Scraper That Doesn't (Completly) Suck

How To Detect When A Webpage Changes With Python

Ghost Tweeter

Python IRC Bot

Purdue BoilerMake 2014

Free Student Software

Subscribe to Adventures In Techland

Subscribe to Adventures In Techland