/ python

How To Detect When A Webpage Changes With Python

If you've ever been waiting for a webpage to update then you know how time consuming it can be to constantly have to refresh the page.

I had a similar problem right before the Eclipse of 2017. I was attempting to get eclipse glasses from a local observatory but they were sold out. Fortunately, the observatory's website said a new shipment of glasses were arriving in a few days, and they would update their website when they go on sale. These eclipse glasses were going like hotcakes so I knew I had to be on the ball as soon as the website updated. Obviously I wasn't going to spend all day checking their website so I wrote this bit of Python to help.

This code detects changes in a webpage's content by getting the initial HTML of a webpage and then repeatedly getting the HTML and checking it against the initial HTML every ten seconds. If the webpage has changed it prints "Changed", if not it prints "Not Changed". You can change these print statements to email you, or run whatever other python code you like.


# Hunter Thornsberry
# http://www.adventuresintechland.com

# WebChange.py
# Alerts you when a webpage has changed it's content by comparing checksums of the html.

import hashlib
import urllib2
import random
import time

# url to be scraped
url = "http://raw.adventuresintechland.com/freedom.html"

# time between checks in seconds
sleeptime = 60

def getHash():
    # random integer to select user agent
    randomint = random.randint(0,7)

    # User_Agents
    # This helps skirt a bit around servers that detect repeaded requests from the same machine.
    # This will not prevent your IP from getting banned but will help a bit by pretending to be different browsers
    # and operating systems.
    user_agents = [
        'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
        'Opera/9.25 (Windows NT 5.1; U; en)',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
        'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
        'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:8.0.1) Gecko/20100101 Firefox/8.0.1',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.151 Safari/535.19'
    ]

    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', user_agents[randomint])]
    response = opener.open(url)
    the_page = response.read()

    return hashlib.sha224(the_page).hexdigest()

current_hash = getHash() # Get the current hash, which is what the website is now

while 1: # Run forever
    if getHash() == current_hash: # If nothing has changed
        print "Not Changed"
    else: # If something has changed
        print "Changed"
        break
    time.sleep(sleeptime)