Introduction to NIX

Run NIX Now

View your NIX Results Now

NIX Frequently Asked Questions

What is NIX

NIX is a WWW tool to view the results of running many DNA analysis programs on your DNA sequence.

To use it you must be a registered user of the HGMP Resource Centre (Registration is free for academics).

The analysis programs run include:

GRAIL, Fex, Hexon, MZEF, Genemark, Genefinder, FGene, BLAST (against many databases) Polyah, RepeatMasker, tRNAscan.

Philosophy behind NIX

NIX is intended as a tool to aid the identification of interesting regions in Genomic or transcribed nucleic acid sequences.

There are many useful computer tools to do this, none of these are 100% accurate and so it is useful to be able to compare the results of many programs which use different methods and so will fail with different things.

Viewing the results of many such programs side by side makes it easy to see when many programs have a consensus about a feature.

We decided that rather than give the user total control over how the programs are run by providing them with innumerable poorly-understood choices of arguments for all of the programs, the NIX system should select reasonable defaults for the programs based on whether the sequence was genomic or transcribed, its species of origin and its size.

This makes a vastly simpified interface for running a dozen or so programs on the specified sequence.

Problems with this approach

There is no way for the users to play around with different parameters to programs to adjust them to give results under more or less stringent conditions.

Of course, if you are dissatisfied with this approach, you are welcome to run the analysis programs yourself from their WWW sites or from the options in the HGMP menu, supplying the exact parameters you require.

Many exon-finding programs take a species as a parameter. The list of species they can deal with is usually very small, (sometimes very large). The list of species in the NIX form is a comprimise list that covers the taxonomic groups that are most frequently available as parameters to the programs. The mapping from the list on the form to the species supplied to the program is done as carefully as possible, but there are often large mismatches. See the documentation on individual programs in the results display for details on this mapping.

What it does

  1. A directory is created in the user's home directory called NIXResultsDirectory
  2. A subdirectory of this is created named after the sequence file. All results are stored here

For Genomic sequences:

  1. The sequence is masked for repeat regions using Washington University's repeatmasker program
  2. Blast searches are started using the masked sequence against the databases: ecoli,est,sts,embl (minus the sts, est, gss sections),trembl,swissprot. The Expect value cutoff is set to 0.1. Up to 1000000 alignments can be output. The blast results files are compressed using gzip to save filespace.
  3. The following exon-finding programs are run using the masked sequence: Grail,Genefinder,Genemark,Fex,Hexon,Fgene
  4. The trnascan program is run on the sequence

For transcribed sequences:

  1. The sequence is masked for repeat regions using Washington University's repeatmasker program
  2. Blast searches are started using the masked sequence against the databases: ecoli,est,embl (minus the sts, est, gss sections) using TBLASTX,trembl,swissprot. The Expect value cutoff is set to 0.1. Up to 1000000 alignments can be output. The blast results files are compressed using gzip to save filespace.
  3. The following exon-finding programs are run using the masked sequence: Grail

Best size of sequence

The cut and paste field on the NIX input form can only transfer sequences of up to about 20Kb (this is a limitation of HTML browsers).

Many of the programs seem to loose sensitivity when analysing long sequences.

The programs start to break when given sequences of 150 Kb or longer. Do not go above this level.

Some of the gene model prediction programs try to predict just one gene in the sequence, so a long sequence with potentially many genes in it will result in incorrect models.

Many of the programs expect a sequence that is longer than 100 bases.

So, the best length is probably 20 to 50 Kb or less. If your sequence is longer than this, we advise you to split it up with an overlap of maybe a few Kb.

Displaying the results

A table of sequences that have been analysed can be seen by clicking on the link "View the results of previous NIX analyses here" above the NIX form.

This form allows you to click to see the graphical display of the results. It also allows you to delete the results files for sequences to save disk space. This is important because some of these results take up a lot of disk space.

The results of the analyses are used to make a WWW-browsable image of the regions of the sequence which show features found by the programs. The features can be clicked on to show details of the feature and the original output file produced by the program. Blast results are displayed using a WWW-based blast results viewer.

NIX Results Viewer

What you see

An example of the results of a typical analysis together with some added user annotation is displayed below. When viewing your actual analysis, you get a description of each blob as your mouse pointer passes over it and you can select which features to display and you can change the background colour.

Initially it looks very pretty, but very confusing, doesn't it?

Allow us to talk you through it.

Colours

The colours do not have any specific meaning except that programs with similar purposes have generally been grouped together and given the same colour.

The colours each have three shades to indicate the quality or confidence of the prediction. The stronger (more intense) the shade the better the prediction. The allocation of the quality of the results is made to one of three strengths: 'excellent', 'good' or 'marginal' and is based on a very subjective interpretation of what the scores of the various programs mean.

The colour shading may not come out very well on some browsers - sorry.

Sequence line

The sequence line is the central green line with "sequence' printed next to it.

Everything above this line is a feature found on the forward sense.

Everything below it is a feature found on the reverse sense.

Some features are direction independent and so are duplicated on both strands.

The pretty blobs

There are blocks and triangles and lines and arrows and blocks linked by lines and maybe a few other things. We may change what we use to display some features over time.

Basically any one type of feature is displayed by the line or block or triangle etc. that we think looks best.

The up and down-ward pointing triangles are generally used to indicate a feature that is very small, for example a poly-A site.

The segmented blocks linked with lines are an attempt to link up distant regions that share some common property. This is most commonly used with blast matches and gene model predictions. A set of exons may all have blast matches to one database entry and in this case the matches will be linked by a line. Similarly if a exon-finding program predicts that a set of exons comprise a single gene, they will be linked.

What you get

Clicking on the text

Clicking on the text down the left hand side gives you documentation on that program.

Clicking on the blobs

Clicking on a feature will give details of that feature and will display the raw output of the program that found the feature.

Clicking on the linking lines

Clicking on the lines linking segmented blocks will give information on the start and end positions of the whole set of regions. The description will be the description given to the first-found region of that feature.

The Blast viewer

If you click on a feature found by Blast, you will get the chance to either look at the raw blast results, or a graphical view of the blast matches. Further information on this is available on the graphical blast viewer pages.

Clicking on the central sequence line

Clicking on the central Sequence line zooms in on the area of the sequence that you clicked on.

The zooming might be a bit offset from where you actually clicked, so you can correct this by clicking to the left or right of the Sequence line to shift your position left or right along the sequence.

Making sense of it

Basically you should start reading down through the features in the display.

Is there a CpG island?

About half of all mammalian genes have a CG-rich region around their 5' end.

It is said that all mammalian house-keeping genes have a CpG island!

Non-mammalian vertebrates have some CpG islands that are associated with genes, but the association gets equivocal in the farther taxonomic groups.

Finding a CpG island upstream of predicted exons or genes (see below) looks like good news.

Are there Pol II promoter sites?

Prediction of these is not very good, i.e. expect significant false positives and - more importantly - false negatives. However, a predicted promoter site upstream of a predicted gene is pleasing.

Are exons predicted?

No exon predicting program is 100% accurate, far from it! However, if there is a consensus between various programs (which use different detection methods), then you should certainly take the consensus seriously!

Are Gene models predicted?

Some gene model predicting programs expect only one gene in the sequence. This means they can get things laughably wrong when asked to inspect large chunks of genomic DNA containing multiple independant genes. Remember that you can use matches to the blast results (discussed further) as support for (or against) the putative gene models.

Most, if not all, current exon detection algorithms are notorious for missing very short exons, hence predictions tend to be less reliable for the commonly shorter 5' and 3' exons of intron-containing genes. These delimiting exons also tend to be separated from internal exons by longer than average introns, thus throwing a further spanner in the works.

Are there blast matches to Swissprot, Trembl, EMBL?

Yes? Then you are laughing - you have a good chance of identifying the gene, exon start and end positions as well as the possible function of the gene. Unless of course the blast hit is too good, in which case you've been scooped (sequence already deposited in data bank)!

Beware! There is a lot of junk in the databases that has been mis-described, badly sequenced and poorly annotated.

Swissprot is the best of the sequence databases in this respect (i.e. the most reliable).

It is estimated that 30% of all proteins in the databases have missing exons, or introns that have been erroneously translated as part of the protein entry!

Beware of finding matches with pseudogenes or matches to collagen in the database if your sequence contains proline-rich regions.

There will be gaps in your blast matches because repetitive regions have been masked out and SEG is used to mask out biased-complexity regions.

Are there blast matches to EST

EST (partial cDNA sequences) are transcribed sequences that might include some 5' and 3' untranslated regions.

A match to the EST sequences is a good indication that you may have a transcribed region.

EST sequences should not include intronic regions, but this is not a perfect world so don't be surprised if you find matches between predicted introns and ESTs.

EST sequences are commonly produced by pulling out sequences by virtue of their poly-A regions on the assumtion that this is a poly-A tail of a mRNA. This procedure could also pulls out genomic poly-A regions and surrounding sequences if the sample wasn't digested (properly) with DNAse...

The standard of sequencing of EST sequences is not high and it is probable that in the rush to produce hundreds of thousands of sequences many mistakes have been made. There is a fair amount of junk in the EST database.

Are poly A sites prected?

A poly A site at the end of a exon-rich region is good supportive evidence that there is a gene and that it may end at the predicted poly A region. Again, expect false positives and false negatives!

Are Frame Shift errors predicted?

Then don't worry too much - there is a high false positive rate for this prediction, but you might like to just go back and check on your contigs if protein blast hits have a sudden end or discontinuity at the predicted frame shift error position.

Are there BLAST matches to STS?

Were you expecting to find a mapping marker in or close to this sequence? Yes? Then you can pin down your sequence in the genome physical map data!

Are there blast matches to ecoli?

Then don't dispair! Unless your blast hit is 100% (congratulations! you've sequenced your vector's host), there are many house-keeping genes that are well enough conserved in both bacteria and vertebrates to be picked up by blast. Check in the blast search to EMBL to see if this regions has hits to such a house-keeping gene.

Are there repeats or repetitive elements?

Repetitive elements such as Alu's are littered around the genome. It's good to know they are there if you intend to do things like targeted gene inactivation by homologous double recombination, since too many such elements will dramatically reduce your proportion of targeted versus random integration events!

Repeats, such as direct tandem repeats, are also frequent in genomic DNA. Because some have putative regulatory functions, they can be found upstream of certain genes.

It is always noted in the annotations when you submit your sequence to EMBL that you have found repeat/repetitive regions. It shows that you are on the ball.

Are tRNAs predicted?

Ditto.

What to do next

The prediction by these programs is just that: a prediction.

NEVER TRUST A COMPUTER!

Pefore publishing your results, you should do some bench work to confirm the computer results.

For example, you might consider doing a RT-PCR to confirm that predicted exons are expressed.

You should then and only then submit your sequence to EMBL

Further ideas

We haven't thought of everything.

If you have ideas for things this system could do, please let us know by clicking on the Support mail address below.

Problems

The heavy load of searches on our machine may cause us to scale down the number of searches we include in NIX. If you find NIX useful, then let us know! We may get funding to continue this service!


Support@hgmp.mrc.ac.uk
This help page was created by Gary Williams Last modified: 18 Nov 1999