Screen Scraping

2: Writing the screen scraper

We will follow the generally good practice of breaking the problem up into small testable chunks and evolve the code gradually.

(i) A dummy screen scraper

First we will implement a short piece of code that creates a variable containing a sequence and calls a (dummy) PredictSS() function that simply returns a random string of the correct length. Later we will replace this function with our proper screen scraper.

Your code needs to:

1. Define a CleanSequence() function which removes return characters and whitespace.

See the code

def CleanSequence(seq):
    seq = seq.replace('\n', '')
    seq = seq.replace(' ', '')
    return(seq)

2. Define a PredictSS() function which calls the CleanSequence() function and then generates a string of question marks of the same length as the sequence using a for loop and the range function.

See the code

def PredictSS(seq):
    seq = CleanSequence(seq)
    retval = ''
    for i in range(0, len(seq)):
        retval += '?'
    return(retval)

3. Create a main program that sets a variable containing the sequence, calls the PredictSS() function and then prints the cleaned sequence and the secondary structure.

See the code

seq = """KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDY
GILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNR
CKGTDVQAWIRGCRL"""

ss = PredictSS(seq)
if(ss != ""):
    seq = CleanSequence(seq)
    print (seq)
    print (ss)

(ii) A test screen scraper

Your next job is to replace the dummy PredictSS() code with a proper routine that will get the results back from the server, but won't yet parse out the data we are interested in.

You need to:

1. Work out the full URL that you need to use to access the CGI script including passing any parameters. This will be of the form:

http://server.com/path/to/script.cgi?name1=value1&name2=value2&name3=value3

Get some help!

Note:

You should have already have obtained the full URL for the CGI script (i.e. the bit before the question mark) as well as the names and possible values of the radio button used to specify the tertiary structure class, the name of the input box for the name of the sequence, the name of the textarea box used to enter the sequence.
We will assume that you do not know the tertiary structure class of the protein and therefore can set this value to whatever is used in the web page to indicate no known tertiary structure class.
We do not need a name of the protein sequence, so we will set the protein name to a blank string (i.e. include the appropriate name and an equals sign with nothing after it).

2. Import the request method from the urllib module. (You might also want to import the sys module if you want to send errors to standard error instead of standard output.)

See the code

from urllib import request
import sys

3. Provide a dummy routine (ParseSS()) that parses the secondary structure information out of the resulting HTML (currently this will just do nothing other than return the HTML that went into the routine).

See the code

def ParseSS(result):
   return(result)

3. Replace the PredictSS() routine with one which

cleans the sequence,

See the code

def PredictSS(seq):
    seq = CleanSequence(seq)

builds the full URL including the parameters,

See sample code

    url    = "http://server.com/path/to/script.cgi"
    params = "option=none&name=&text=" + seq
    fullurl= url + "?" + params

uses request.urlopen(URL).read() to access the CGI script and obtain the results,

See the code

    result = request.urlopen(fullurl).read()

uses str(TEXT, encoding='utf-8') to decode the result into UTF-8

See the code

    result = str(result, encoding='utf-8')

checks the result is not a blank string. If it is, an error message is given and a blank string is returned. If it's not a blank string, it calls ParseSS() and returns the result.

See the code

    if(result != ''):
        ss = ParseSS(result)
        return(ss)
    else:
        sys.stderr.write("Nothing was returned\n")

    return("")

(iii) The final screen scraper

Finally you need to rewrite the ParseSS() routine to parse the secondary structure information out of the HTML.

You need to:

1. Look at the results from running the previous script and see where the secondary structure information is found:

See some sample HTML

<HTML><HEAD>
<TITLE>NNPREDICT RESULTS</TITLE>
</HEAD>
<BODY fgcolor="F0F0F0">
<h1 align=center>Results of nnpredict query</h1>
<p><b>Tertiary structure class:</b> alpha/beta
<p><b>Sequence</b>:<br>
<tt>
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL
</tt>
<p><b>Secondary structure prediction
<i>(H = helix, E = strand, - = no prediction)</i>:<br></b>
<tt>
EEEEHE-EHEH-H-H-HEHHE-E-EEEEE-EH--E-EEHHHHHH-HEH-HEHE-HH--H-EHHEHEHH-E--HHH-EHEEEHEEE---EEE-EHH-E----HH-HHHH--H-HE-HEHHHHEEHH-HEH
</tt>
</body></html>

You should observe that the secondary structure information is in the second <tt> block.

2. Import the regular expression (re) module

See the code

import re

3. Strip all return characters from the HTML

See the code

result = result.replace('\n', '')

4. Create and execute a regular expression to match:

As many characters as possible,
A <tt> followed by as few characters as possible followed by </tt> with the characters between <tt> and </tt> being captured by being in parentheses,
As many characters as possible,

Note that .* is a greedy match so will match as many characters as possible, while .*? will match as few characters as possible. Since regular expressions are matched from left to right, the first greedy match will encompass the first <tt> so the captured pattern will be the text between the last (in this case the second) <tt> </tt> pair. Looking at a simple example, if the string were abcXdefXghi, then .*X would match abcXdefX while .*?X would match abcX.

See the code

pattern = re.compile('.*<tt>(.*?)</tt>.*')
match   = pattern.match(result)

5. Extract and return the matched characters.

See the code

result = match.group(1)   # Returns the (group) match
return(result)

You might want to improve the code to deal with a failed match.

Congratulations! You should now have a fully working screen scraper. Sadly you won't have any real secondary structure predictions since the real version of NNPREDICT doesn't exist any more.

Note how the screen parsing depends on the layout of the HTML. If the server returned results in a different format, the screen scraper would stop working.