Checking if web sites are online with Python

Saturday, March 31, 2018

Checking if web sites are online with Python

By Vasudev Ram

Hi readers,

Recently, I thought of writing a small program to check if one or more web sites are online or not. I used the requests Python library with the HTTP HEAD method. I also checked out PycURL for this. It is a thin wrapper over libcurl, the library that powers the well-known and widely used curl command line tool. While PycURL looks powerful and fast (since it is a thin wrapper that exposes most or all of the functionality of libcurl), I decided to use requests for this version of the program. The code for the program is straightforward, but I found a few interesting things while running it with a few different sites as arguments. I mention those points below.

Here is the tool: I named it is_site_online.py:

"""
is_site_online.py
Purpose: A Python program to check if a site is online or not.
Uses the requests library and the HTTP HEAD method.
Tries both with and without HTTP redirects.
Author: Vasudev Ram
Copyright 2018 Vasudev Ram
Web site: https://vasudevram.github.io
Blog: https://jugad2.blogspot.com
Product store: https://gumroad.com/vasudevram
"""

from __future__ import print_function
import sys
import requests
import time

if len(sys.argv) < 2:
    sys.stderr.write("Usage: {} site ...".format(sys.argv[0]))
    sys.stderr.write("Checks if the given site(s) are online or not.")
    sys.exit(0)

print("Checking if these sites are online or not:")
print("   ".join(sys.argv[1:]))

print("-" * 60)
try:
    for site in sys.argv[1:]:
        for allow_redirects in (False, True):
            tc1 = time.clock()
            r = requests.head(site, allow_redirects=allow_redirects)
            tc2 = time.clock()
            print("Site:", site)
            print("Check with allow_redirects =", allow_redirects)
            print("Results:")
            print("r.ok:", r.ok)
            print("r.status_code:", r.status_code)
            print("request time:", round(tc2 - tc1, 3), "secs")
            print("-" * 60)
except requests.ConnectionError as ce:
    print("Error: ConnectionError: {}".format(ce))
    sys.exit(1)
except requests.exceptions.MissingSchema as ms:
    print("Error: MissingSchema: {}".format(ms))
    sys.exit(1)
except Exception as e:
    print("Error: Exception: {}".format(e))
    sys.exit(1)

The results of some runs of the program:

Check for Google and Yahoo!:

$ python is_site_online.py http://google.com http://yahoo.com
Checking if these sites are online or not:
http://google.com   http://yahoo.com
-----------------------------------------------------------
Site: http://google.com
Check with allow_redirects = False
Results:
r.ok: True
r.status_code: 302
request time: 0.217 secs
------------------------------------------------------------
Site: http://google.com
Check with allow_redirects = True
Results:
r.ok: True
r.status_code: 200
request time: 0.36 secs
------------------------------------------------------------
Site: http://yahoo.com
Check with allow_redirects = False
Results:
r.ok: True
r.status_code: 301
request time: 2.837 secs
------------------------------------------------------------
Site: http://yahoo.com
Check with allow_redirects = True
Results:
r.ok: True
r.status_code: 200
request time: 1.852 secs
------------------------------------------------------------

In the cases where allow_redirects is False, google.com gives a status code of 302 and yahoo.com gives a status code of 301. The 3xx series of codes are related to HTTP redirection.

After seeing this, I looked up HTTP status code information in a few sites such as Wikipedia and the official site www.w3.org (the World Wide Web Consortium), and found a point worth noting. See the part in the Related links section at the end of this post about "302 Found", where it says: "This is an example of industry practice contradicting the standard.".

Now let's check for some error cases:

One error case: we do not give an http:// prefix (assume some novice user who is mixed up about schemes and paths), so they type a garbled site name, say http.om:

$ python is_site_online.py http.om
Checking if these sites are online or not:
http.om
------------------------------------------------------------
Traceback (most recent call last):
  File "is_site_online.py", line 32, in 
    r = requests.head(site, allow_redirects=allow_redirects)
[snip long traceback]
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL 'http.om':
No schema supplied. Perhaps you meant http://http.om?

This traceback tells us that when no HTTP 'scheme' [1][2] is given, requests raises a MissingSchema exception. So we now know that we need to catch that exception in our code, by adding another except clause to the try statement, which I later did, in the program you see in this post. In general, this technique can be useful when using a new Python library for the first time: just don't handle any exceptions in the beginning, use it a few times with variations in input or modes of use, and see what sorts of exceptions it throws. Then add code to handle them.

[1] The components of a URL

[2] Parts of URL

Another error case - a made-up site name that does not exist:

$ python is_site_online.py http://abcd.efg
Checking if these sites are online or not:
http://abcd.efg
------------------------------------------------------------
Caught ConnectionError: HTTPConnectionPool(host='abcd.efg',
port=80): Max retries exceeded with url: / (Caused
by NewConnectionError(': Failed
to establish a new connection: [Errno 11004] getaddrinfo
failed',))

From the above error we can see or figure out a few things:

- the requests library defines a ConnectionError exception. I first ran the above command without catching ConnectionError in the program; it gave that error, then I added the handler for it.

- requests uses an HTTP connection pool

- requests does some retries when you try to get() or head() a URL (a site name)

- requests uses urllib3 (from the Python standard library) under the hood

I had discovered that last point earlier too; see this post:

urllib3, the library used by the Python requests library

And as I mentioned in that post, urllib3 itself uses httplib.

Now let's check for some sites that are misspellings of the site google.com:

$ python is_site_online.py http://gogle.com Checking ... ------------------------------------------------------------ Site: http://gogle.com With allow_redirects: False Results: r.ok: True r.status_code: 301 request time: 3.377 ------------------------------------------------------------ Site: http://gogle.com With allow_redirects: True Results: r.ok: True r.status_code: 200 request time: 1.982 ------------------------------------------------------------

$ python is_site_online.py http://gooogle.com Checking ... ------------------------------------------------------------ Site: http://gooogle.com With allow_redirects: False Results: r.ok: True r.status_code: 301 request time: 0.425 ------------------------------------------------------------ Site: http://gooogle.com With allow_redirects: True Results: r.ok: True r.status_code: 200 request time: 1.216 ------------------------------------------------------------

Interestingly, the results show that that both those misspellings of google.com exist as sites.

It is known that some people register domains that are similar in spelling to well-known / popular / famous domain names, maybe hoping to capture some of the traffic resulting from users mistyping the famous ones. Although I did not plan it that way, I realized, from the above two results for gogle.com and gooogle.com, that this tool can be used to detect the existence of such sites (if they are online when you check, of course).

jugad2 - Vasudev Ram on software innovation

Pages

Saturday, March 31, 2018

Checking if web sites are online with Python

No comments:

Blog Archive

Labels