We will follow the generally good practice of breaking the problem up into small testable chunks and evolve the code gradually.
First we will implement a short piece of code that creates a variable containing a sequence and calls a (dummy) PredictSS() function that simply returns a random string of the correct length. Later we will replace this function with our proper screen scraper.
Your code needs to:
1. Define a CleanSequence() function which removes return characters and whitespace.
def CleanSequence(seq): seq = seq.replace('\n', '') seq = seq.replace(' ', '') return(seq)
2. Define a PredictSS() function which calls the CleanSequence() function and then generates a string of question marks of the same length as the sequence using a for loop and the range function.
def PredictSS(seq): seq = CleanSequence(seq) retval = '' for i in range(0, len(seq)): retval += '?' return(retval)
3. Create a main program that sets a variable containing the sequence, calls the PredictSS() function and then prints the cleaned sequence and the secondary structure.
seq = """KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDY GILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNR CKGTDVQAWIRGCRL""" ss = PredictSS(seq) if(ss != ""): seq = CleanSequence(seq) print (seq) print (ss)
Your next job is to replace the dummy PredictSS() code with a proper routine that will get the results back from the server, but won't yet parse out the data we are interested in.
You need to:
1. Work out the full URL that you need to use to access the CGI script including passing any parameters. This will be of the form:
http://server.com/path/to/script.cgi?name1=value1&name2=value2&name3=value3
Note:
2. Import the request method from the urllib module. (You might also want to import the sys module if you want to send errors to standard error instead of standard output.)
from urllib import request import sys
3. Provide a dummy routine (ParseSS()) that parses the secondary structure information out of the resulting HTML (currently this will just do nothing other than return the HTML that went into the routine).
def ParseSS(result): return(result)
3. Replace the PredictSS() routine with one which
def PredictSS(seq): seq = CleanSequence(seq)
url = "http://server.com/path/to/script.cgi" params = "option=none&name=&text=" + seq fullurl= url + "?" + params
result = request.urlopen(fullurl).read()
result = str(result, encoding='utf-8')
if(result != ''): ss = ParseSS(result) return(ss) else: sys.stderr.write("Nothing was returned\n") return("")
Finally you need to rewrite the ParseSS() routine to parse the secondary structure information out of the HTML.
You need to:
1. Look at the results from running the previous script and see where the secondary structure information is found:
<HTML><HEAD> <TITLE>NNPREDICT RESULTS</TITLE> </HEAD> <BODY fgcolor="F0F0F0"> <h1 align=center>Results of nnpredict query</h1> <p><b>Tertiary structure class:</b> alpha/beta <p><b>Sequence</b>:<br> <tt> KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL </tt> <p><b>Secondary structure prediction <i>(H = helix, E = strand, - = no prediction)</i>:<br></b> <tt> EEEEHE-EHEH-H-H-HEHHE-E-EEEEE-EH--E-EEHHHHHH-HEH-HEHE-HH--H-EHHEHEHH-E--HHH-EHEEEHEEE---EEE-EHH-E----HH-HHHH--H-HE-HEHHHHEEHH-HEH </tt> </body></html>
You should observe that the secondary structure information is in the second <tt> block.
2. Import the regular expression (re) module
import re
3. Strip all return characters from the HTML
result = result.replace('\n', '')
4. Create and execute a regular expression to match:
Note that .* is a greedy match so will match as many characters as possible, while .*? will match as few characters as possible. Since regular expressions are matched from left to right, the first greedy match will encompass the first <tt> so the captured pattern will be the text between the last (in this case the second) <tt> </tt> pair. Looking at a simple example, if the string were abcXdefXghi, then .*X would match abcXdefX while .*?X would match abcX.
pattern = re.compile('.*<tt>(.*?)</tt>.*') match = pattern.match(result)
5. Extract and return the matched characters.
result = match.group(1) # Returns the (group) match return(result)
You might want to improve the code to deal with a failed match.
Congratulations! You should now have a fully working screen scraper. Sadly you won't have any real secondary structure predictions since the real version of NNPREDICT doesn't exist any more.
Note how the screen parsing depends on the layout of the HTML. If the server returned results in a different format, the screen scraper would stop working.