To use it you must be a registered user of the HGMP Resource Centre (Registration is free for academics).
The analysis programs run include:
GRAIL, Fex, Hexon, MZEF, Genemark, Genefinder, FGene, BLAST (against many databases) Polyah, RepeatMasker, tRNAscan.
There are many useful computer tools to do this, none of these are 100% accurate and so it is useful to be able to compare the results of many programs which use different methods and so will fail with different things.
Viewing the results of many such programs side by side makes it easy to see when many programs have a consensus about a feature.
We decided that rather than give the user total control over how the programs are run by providing them with innumerable poorly-understood choices of arguments for all of the programs, the NIX system should select reasonable defaults for the programs based on whether the sequence was genomic or transcribed, its species of origin and its size.
This makes a vastly simpified interface for running a dozen or so programs on the specified sequence.
Of course, if you are dissatisfied with this approach, you are welcome to run the analysis programs yourself from their WWW sites or from the options in the HGMP menu, supplying the exact parameters you require.
Many exon-finding programs take a species as a parameter. The list of species they can deal with is usually very small, (sometimes very large). The list of species in the NIX form is a comprimise list that covers the taxonomic groups that are most frequently available as parameters to the programs. The mapping from the list on the form to the species supplied to the program is done as carefully as possible, but there are often large mismatches. See the documentation on individual programs in the results display for details on this mapping.
For Genomic sequences:
For transcribed sequences:
Many of the programs seem to loose sensitivity when analysing long sequences.
The programs start to break when given sequences of 150 Kb or longer. Do not go above this level.
Some of the gene model prediction programs try to predict just one gene in the sequence, so a long sequence with potentially many genes in it will result in incorrect models.
Many of the programs expect a sequence that is longer than 100 bases.
So, the best length is probably 20 to 50 Kb or less. If your sequence is longer than this, we advise you to split it up with an overlap of maybe a few Kb.
This form allows you to click to see the graphical display of the results. It also allows you to delete the results files for sequences to save disk space. This is important because some of these results take up a lot of disk space.
The results of the analyses are used to make a WWW-browsable image of the regions of the sequence which show features found by the programs. The features can be clicked on to show details of the feature and the original output file produced by the program. Blast results are displayed using a WWW-based blast results viewer.
Initially it looks very pretty, but very confusing, doesn't it?
Allow us to talk you through it.
The colours each have three shades to indicate the quality or confidence of the prediction. The stronger (more intense) the shade the better the prediction. The allocation of the quality of the results is made to one of three strengths: 'excellent', 'good' or 'marginal' and is based on a very subjective interpretation of what the scores of the various programs mean.
The colour shading may not come out very well on some browsers - sorry.
Everything above this line is a feature found on the forward sense.
Everything below it is a feature found on the reverse sense.
Some features are direction independent and so are duplicated on both strands.
Basically any one type of feature is displayed by the line or block or triangle etc. that we think looks best.
The up and down-ward pointing triangles are generally used to indicate a feature that is very small, for example a poly-A site.
The segmented blocks linked with lines are an attempt to link up distant regions that share some common property. This is most commonly used with blast matches and gene model predictions. A set of exons may all have blast matches to one database entry and in this case the matches will be linked by a line. Similarly if a exon-finding program predicts that a set of exons comprise a single gene, they will be linked.
The zooming might be a bit offset from where you actually clicked, so you can correct this by clicking to the left or right of the Sequence line to shift your position left or right along the sequence.
About half of all mammalian genes have a CG-rich region around their 5' end.
It is said that all mammalian house-keeping genes have a CpG island!
Non-mammalian vertebrates have some CpG islands that are associated with genes, but the association gets equivocal in the farther taxonomic groups.
Finding a CpG island upstream of predicted exons or genes (see below) looks like good news.
Most, if not all, current exon detection algorithms are notorious for missing very short exons, hence predictions tend to be less reliable for the commonly shorter 5' and 3' exons of intron-containing genes. These delimiting exons also tend to be separated from internal exons by longer than average introns, thus throwing a further spanner in the works.
Beware! There is a lot of junk in the databases that has been mis-described, badly sequenced and poorly annotated.
Swissprot is the best of the sequence databases in this respect (i.e. the most reliable).
It is estimated that 30% of all proteins in the databases have missing exons, or introns that have been erroneously translated as part of the protein entry!
Beware of finding matches with pseudogenes or matches to collagen in the database if your sequence contains proline-rich regions.
There will be gaps in your blast matches because repetitive regions have been masked out and SEG is used to mask out biased-complexity regions.
A match to the EST sequences is a good indication that you may have a transcribed region.
EST sequences should not include intronic regions, but this is not a perfect world so don't be surprised if you find matches between predicted introns and ESTs.
EST sequences are commonly produced by pulling out sequences by virtue of their poly-A regions on the assumtion that this is a poly-A tail of a mRNA. This procedure could also pulls out genomic poly-A regions and surrounding sequences if the sample wasn't digested (properly) with DNAse...
The standard of sequencing of EST sequences is not high and it is probable that in the rush to produce hundreds of thousands of sequences many mistakes have been made. There is a fair amount of junk in the EST database.
Repeats, such as direct tandem repeats, are also frequent in genomic DNA. Because some have putative regulatory functions, they can be found upstream of certain genes.
It is always noted in the annotations when you submit your sequence to EMBL that you have found repeat/repetitive regions. It shows that you are on the ball.
NEVER TRUST A COMPUTER!
Pefore publishing your results, you should do some bench work to confirm the computer results.
For example, you might consider doing a RT-PCR to confirm that predicted exons are expressed.
You should then and only then submit your sequence to EMBL
If you have ideas for things this system could do, please let us know by clicking on the Support mail address below.