Your task now is to modify the code you wrote for screen scraping such that it accesses the URL:
http://www.bioinf.org.uk/servers/pdbsws/query.cgi?qtype=pdb&id=1bwi&res=35&plain=1
Again start with dummy code:
#!/usr/bin/env python3 from urllib import request import sys import re def ReadPDBSWS(pdbcode, resnum): ac = 'P12345' upresnum = 666 return(ac, upresnum) """ Main program """ pdbcode = '1bwi' resnum = 35 (ac, upresnum) = ReadPDBSWS(pdbcode, resnum) print ("Accession: " + ac) print ("UniProt Resnum: %d" % upresnum)
Now modify the ReadPDBSWS() function so that it creates the URL:
url = 'http://www.bioinf.org.uk/servers/pdbsws/query.cgi?plain=1&qtype=pdb' url += '&id=' + pdbcode url += '&res=' + str(resnum)
Read the URL and decode the resulting information.
result = request.urlopen(url).read() result = str(result, encoding='utf-8')
Replace all the return characters with a # sign to make pattern matching easier
result = result.replace('\n', '#')
Match a pattern based on AC: followed by one or more spaces then extract the minimum number of characters before a # sign.
pattern = re.compile('.*AC:\s+(.*?)#') match = pattern.match(result) ac = match.group(1)
Repeat, but look for UPCOUNT: instead of AC: and return the results.
pattern = re.compile('.*UPCOUNT:\s+(.*?)#') match = pattern.match(result) upresnum = int(match.group(1)) return(ac, upresnum)
You should now have a working program that obtains the information from PDBSWS.
You might now want to modify the program: