BIOC0030 (BIOC3014) - Bioinformatics Workshop

Promoter prediction is based on finding good matches to the classic TATA-box (TATAAAA) motif. Further support may come from other motifs such as the CCAAT-Box, CpG islands and other transcription factor binding sites.

In the computer, this is done by using position weight matrices (PWMs) which are them compared against a DNA sequence using a program such as MatInspector (Currently not available!) which slides the PWMs along the DNA sequence to find signficant matches. PWMs may be obtained from databases such as TRANSFAC or Jaspar.

However, if you run a search program such as MatInspector to find matches against the TRANSFAC database, you will find a huge number of hits.

Instead, we will run two promoter prediction programs and work on the assumption that the predicted promoter which overlaps best between the two predictions is the the correct one. In other words, we will do a logical 'AND' between the two predictions (if it is predicted well in software A and in software B, then we will believe it).

Promoter 2.0

First we will use a program called Promoter 2.0 from the Centre for Biological Sequence Analysis at the University of Copenhagen

Open the Promoter 2.0 page at https://services.healthtech.dtu.dk/service.php?Promoter-2.0
Cut and paste the query sequence into the sequence box and press the Submit button

In a few seconds the results will appear. Look at the set of predicted promoters listed at the top of the page. Note how many sites there are, where they occur and the confidence in the predictions.

Record your results by cutting and pasting them into a text editor or word processor

LBL Promoter

Next we will use the LBL Promoter predictor from the Drosophila Genome community. The program works with human sequences as well.

Click http://www.fruitfly.org/seq_tools/promoter.html to access the software.
Cut and paste the query sequence into the sequence box
Leave all other settings at their default values
Click Submit to submit the job.

The results should be returned within a few seconds. You should see varying quality matches to the TATA-box consensus (tATA^A/_TA^A/_T) approximately ¼ to ½ the way along the resulting sequence matches.

For example, in the first match, count 16 bases along and you will see tataaat.

The predicted transcription start site is shown in a larger font.

Record your results by cutting and pasting them into a text editor or word processor

TSSG Promoter prediction

Next we will use a program called TSSG from a company called Softberry

Open the TSSG page at http://www.softberry.com/berry.phtml?topic=tssg&group=programs&subgroup=promoter
Cut and paste the query sequence into the sequence box and press the PROCESS button

In a few seconds the results will appear. Look at the set of predicted promoters listed at the top of the page. How many sites are there compared with the previous predictor? In particular, how many TATA boxes are predicted?

NOTE! As of 2018, Softberry are limiting the number of queries that may be run in one day. If you do not get a result back, then you can access the results here.

Note that the detailed results that follow the title "Transcription factor binding sites:" are very terse and difficult to understand! This long list of results shows matches to functional motifs from the Ghosh database. The sequence fragments use the IUPAC ambiguity codes. The authors provide detailed help.

Record your results by cutting and pasting them into a text editor or word processor

The main take-home message is that promoter prediction is actually far from trivial. Different predictors can give quite different results. Comparing Promoter 2.0 (P2) and TSSG, my view would be that if more than one predictor makes a prediction around the same area, then this is more likely to be correct.

Finding Promoters

Promoter 2.0

LBL Promoter

TSSG Promoter prediction