Screen Scraping

2: Writing the screen scraper

We will follow the generally good practice of breaking the problem up into small testable chunks and evolve the code gradually.

(i) A dummy screen scraper

First we will implement a short piece of code that creates a variable containing a sequence and calls a (dummy) PredictSS() function that simply returns a random string of the correct length. Later we will replace this function with our proper screen scraper.

Your code needs to:

1. Define a CleanSequence() function which removes return characters and whitespace.

def CleanSequence(seq):
    seq = seq.replace('\n', '')
    seq = seq.replace(' ', '')
    return(seq)

2. Define a PredictSS() function which calls the CleanSequence() function and then generates a string of question marks of the same length as the sequence using a for loop and the range function.

def PredictSS(seq):
    seq = CleanSequence(seq)
    retval = ''
    for i in range(0, len(seq)):
        retval += '?'
    return(retval)

3. Create a main program that sets a variable containing the sequence, calls the PredictSS() function and then prints the cleaned sequence and the secondary structure.

seq = """KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDY
GILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNR
CKGTDVQAWIRGCRL"""

ss = PredictSS(seq)
if(ss != ""):
    seq = CleanSequence(seq)
    print (seq)
    print (ss)

(ii) A test screen scraper

Your next job is to replace the dummy PredictSS() code with a proper routine that will get the results back from the server, but won't yet parse out the data we are interested in.

You need to:

1. Work out the full URL that you need to use to access the CGI script including passing any parameters. This will be of the form:

http://server.com/path/to/script.cgi?name1=value1&name2=value2&name3=value3

Note:

  1. You should have already have obtained the full URL for the CGI script (i.e. the bit before the question mark) as well as the names and possible values of the radio button used to specify the tertiary structure class, the name of the input box for the name of the sequence, the name of the textarea box used to enter the sequence.
  2. We will assume that you do not know the tertiary structure class of the protein and therefore can set this value to whatever is used in the web page to indicate no known tertiary structure class.
  3. We do not need a name of the protein sequence, so we will set the protein name to a blank string (i.e. include the appropriate name and an equals sign with nothing after it).

2. Import the request method from the urllib module. (You might also want to import the sys module if you want to send errors to standard error instead of standard output.)

from urllib import request
import sys

3. Provide a dummy routine (ParseSS()) that parses the secondary structure information out of the resulting HTML (currently this will just do nothing other than return the HTML that went into the routine).

def ParseSS(result):
   return(result)

3. Replace the PredictSS() routine with one which


(iii) The final screen scraper

Finally you need to rewrite the ParseSS() routine to parse the secondary structure information out of the HTML.

You need to:

1. Look at the results from running the previous script and see where the secondary structure information is found:

<HTML><HEAD>
<TITLE>NNPREDICT RESULTS</TITLE>
</HEAD>
<BODY fgcolor="F0F0F0">
<h1 align=center>Results of nnpredict query</h1>
<p><b>Tertiary structure class:</b> alpha/beta
<p><b>Sequence</b>:<br>
<tt>
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL
</tt>
<p><b>Secondary structure prediction
<i>(H = helix, E = strand, - = no prediction)</i>:<br></b>
<tt>
EEEEHE-EHEH-H-H-HEHHE-E-EEEEE-EH--E-EEHHHHHH-HEH-HEHE-HH--H-EHHEHEHH-E--HHH-EHEEEHEEE---EEE-EHH-E----HH-HHHH--H-HE-HEHHHHEEHH-HEH
</tt>
</body></html>


You should observe that the secondary structure information is in the second <tt> block.

2. Import the regular expression (re) module

import re

3. Strip all return characters from the HTML

result = result.replace('\n', '')

4. Create and execute a regular expression to match:

Note that .* is a greedy match so will match as many characters as possible, while .*? will match as few characters as possible. Since regular expressions are matched from left to right, the first greedy match will encompass the first <tt> so the captured pattern will be the text between the last (in this case the second) <tt> </tt> pair. Looking at a simple example, if the string were abcXdefXghi, then .*X would match abcXdefX while .*?X would match abcX.

pattern = re.compile('.*<tt>(.*?)</tt>.*')
match   = pattern.match(result)

5. Extract and return the matched characters.

result = match.group(1)   # Returns the (group) match
return(result)

You might want to improve the code to deal with a failed match.


Congratulations! You should now have a fully working screen scraper. Sadly you won't have any real secondary structure predictions since the real version of NNPREDICT doesn't exist any more.

Note how the screen parsing depends on the layout of the HTML. If the server returned results in a different format, the screen scraper would stop working.

Continue