Secondary databanks

Secondary databanks store information derived from the raw data in primary databanks. Typically they contain characteristic information extracted from families of related protein sequences. They are also often very highly annotated. Examples are:

Prosite
PRINTS, BLOCKS
Pfam
InterPro
CATH
Scop

Each secondary databank generally provides a specialized search tool which exploits the characteristic information for the protein family. You can then scan a protein sequence against the search tools to predict whether the protein matches the characteristics of a protein family.

Secondary databanks with information on functional sites and domains, such as PROSITE, PRINTS, SMART, Pfam, and ProDom, are vital resources for identifying distant relationships in novel sequences, and hence for predicting protein function and structure.

InterPro is a recent effort to amalgamate the annotations from the different databases and provide a unified search interface to the different sets of characteristics.

Practical work

You will now try a search using InterPro.

Each of the various secondary databanks has its own format and nomenclature and each has its own strengths and weaknesses. InterPro (Integrated resource of Protein Families, Domains and Sites) can be considered as a meta-secondary database - it combines a number of secondary databases. It amalgamates the annotations from the different databases and provides a unified search interface.

It is a result of a collaboration between EBI, SIB, University of Manchester, Sanger Institute, GENE-IT, CNRS/INRA, LION bioscience AG and University of Bergen. It unifies data from PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIR-SuperFamily and the structure-based SUPERFAMILY. It also provides cross-links to BLOCKS and other specialized protein family databases.

InterPro allows a simple text search and an SRS-based search. In addition, it provides a sequence-based search which applies one or more of the search methods of the underlying databases.

Text searches

Visit the InterPro web site: http://www.ebi.ac.uk/interpro/

Type pfk into the Text search box near the top of the page and press the Search button to the right of the box.

You will see a set of hits including one for the major isozyme pfkA, one for eukaryotic Pfk, one for PfkB.

Click the first accession link for IPR000023

In the resulting page, you will see a description of the Phosphofructokinase domain including links to papers and Gene Ontology (GO) terms. On the right of the page are links to the 'contributing signatures' - the secondary databanks used to construct this combined InterPro entry. These links (which have names like PF00365), take you to the underlying secondary databank entry. Remember that InterPro integrates data and search tools from a number of secondary databanks.

On the left of the page, there is a menu where you can click for more information, for example on Structures where you will find links to entries in the Protein Databank (PDB).

Click the link to Proteins Matched

You will see a list of the UniProt entries (sequences) that contain the Pfk domain sorted by UniProt accession code. You can follow these links to obtain the full information on these entries including the sequence data.

Click the link to Domain architectures (ensuring you are back on the InterPro page first!)

In the resulting page you will see a summary of where this domain is found in combination with other domains. The key at the bottom tells you which colour lozenge represents Pfk and what domain types the other colours represent. You can see that the Pfk domain occurs alone, in chains containing multiple copies, and in combination with a number of other domains.

Think about the importance of global vs. local sequence alignment when looking at domains like this. Ask a demonstrator if you are unclear.

Sequence Searches

Here is our protein sequence used in the previous searches:

>myseq
MIKKIGVLTSGGDAPGMNAAIRGVVRSALTEGLEVMGIYDGYLGLYEDRMVQLDRYSVSD
MINRGGTFLGSARFPEFRDENIRAVAIENLKKRGIDALVVIGGDGSYMGAMRLTEMGFPC
IGLPGTIDNDIKGTDYTIGFFTALSTVVEAIDRLRDTSSSHQRISVVEVMGRYCGDLTLA
AAIAGGCEFVVVPEVEFSREDLVNEIKAGIAKGKKHAIVAITEHMCDVDELAHFIEKETG
RETRATVLGHIQRGGSPVPYDRILASRMGAYAIDLLLAGYGGRCVGIQNEQLVHHDIIDA
IENMKRPFKGDWLDCAKKLY

Return to InterPro site: http://www.ebi.ac.uk/interpro/

Leave all options at their default values.

Paste the sequence into the sequence box and click the Submit Job button.

The results should be returned within a couple of minutes. Your sequence has been scanned using the tools provided by each of the seondary databases and the matched regions are shown graphically.

You can click the links under the Detailed Signature Matches to obtain the detailed annotation for this match. Simply hover over the graphical match bars to see which of the contributing secondary databanks was matched.

By performing the search at http://www.ebi.ac.uk/InterProScan/ you will obtain an alternative view of the match results where the match over each region is shown in a text version rather than graphically. Where available, e-values for the matches are also indicated.

InterPro is a complex, but powerful system which integrates data from various secondary databanks. It provides comprehensive annotations and search tools.

There is a link to a detailed tutorial in the Further Reading