Bioinformatics Tools
Expert BioSystems (EBS), scientists and engineers have deatail knowledge of bioinformatics tools. EBS training department provides hands-on training for most bioinformatic tools.
Following are partial list of bioinformatics tools, in alpha-betic order:
This tool allows
users to explore the characteristics of amino acids by
comparing their structural and chemical properties,
predicting protein sequence changes caused by
mutations, viewing common substitutions, and browsing
the functions of given residues in conserved domains.
Links the raw
sequence information found in the Trace Archive with
assembly information found in publicly available
sequence repositories (GenBank/EMBL/DDBJ). The Assembly
Viewer allows a user to see the multiple sequence
alignments as well as the actual sequence chromatogram.
BLAST (Basic Local Alignment Search
Tool)
Finds regions of
local similarity between biological sequences. The
program compares nucleotide or protein sequences to
sequence databases and calculates the statistical
significance of matches. BLAST can be used to infer
functional and evolutionary relationships between
sequences as well as to help identify members of gene
families.
A link option on
protein records that displays the results of a
pre-computed BLAST search of that protein against all
other protein sequences at NCBI.
Performs a BLAST
search for similar sequences from selected complete
eukaryotic and prokaryotic genomes.
This page links to
a number of BLAST-related tutorials and guides,
including a selection guide for BLAST algorithms,
descriptions of BLAST output formats, explanations of
the parameters for stand-alone BLAST, directions for
setting up stand-alone BLAST on local machines and
using the BLAST URL API.
Allows you to
retrieve records from many Entrez databases by
uploading a file of GI or accession numbers from the
Nucleotide or Protein databases, or a file of unique
identifiers from other Entrez databases. Search results
can be saved in various formats directly to a local
file on your computer.
Tools that
summarize the biological test results in the PubChem
database and provide alternative ways to view bioassay
results and structure-activity relationships. Users
also can download their analyses and data tables.
Is an ultrafast, memory-efficient
short read aligner. It aligns short DNA sequences
(reads) to the human genome at a rate of over 25
million 35-bp reads per hour. Bowtie indexes the genome
with a Burrows-Wheeler index to keep its memory
footprint small: typically about 2.2 GB for the human
genome (2.9 GB for paired-end).
BWA
- Burrows-Wheeler Alignment
BWA is a fast light-weighted tool
that aligns relatively short sequences (queries) to a
sequence database (targe), such as the human reference
genome. It implements two different algorithms, both
based on Burrows-Wheeler Transform (BWT). The first
algorithm is designed for short queries up to ~200bp
with low error rate (<3%). It does gapped global
alignment w.r.t. queries, supports paired-end reads,
and is one of the fastest short read alignment
algorithms to date while also visiting suboptimal hits.
The second algorithm, BWA-SW, is designed for long
reads with more errors. It performs heuristic
Smith-Waterman-like alignment to find high-scoring
local hits (and thus chimera). On low-error short
queries, BWA-SW is slower and less accurate than the
first algorithm, but on long queries, it is better.
A stand-alone
application for classifying protein sequences and
investigating their evolutionary relationships. CDTree
can import, analyze and update existing Conserved
Domain (CDD) records and hierarchies, and also allows
users to create their own. CDTree is tightly integrated
with Entrez CDD and Cn3D, and allows users to create
and update protein domain alignments.
COBALT is a
protein multiple sequence alignment tool that finds a
collection of pairwise constraints derived from
conserved domain database, protein motif database, and
sequence similarity, using RPS-BLAST, BLASTP, and
PHI-BLAST.
A stand-alone
application for viewing 3-dimensional structures from
NCBI's Entrez retrieval service. Cn3D runs on Windows,
Macintosh, and UNIX and can be configured to receive
data from most popular web browsers. Cn3D
simultaneously displays structure, sequence, and
alignment, and has powerful annotation and alignment
editing features.
Part of the NCBI
Bookshelf, Coffee Break combines reports on recent
biomedical discoveries with use of NCBI tools. Each
report incorporates interactive tutorials that show how
NCBI bioinformatics tools are used as a part of the
research process.
Concise Microbial Protein BLAST
A specialized
BLAST service in which the queried database consists of
all proteins from complete microbial (prokaryotic)
genomes. NCBI has precalculated clusters of similar
proteins at the genus-level and one representative is
chosen from each cluster in order to reduce the
dataset, thereby reducing search time and providing a
broader taxonomic view.
Conserved Domain Architecture Retrieval
Tool (CDART)
Displays the
functional domains that make up a given protein
sequence. It lists proteins with similar domain
architectures and can retrieve proteins that contain
particular combinations of domains.
Conserved Domain Search Service (CD
Search)
Identifies the
conserved domains present in a protein sequence.
CD-Search uses RPS-BLAST (Reverse Position-Specific
BLAST) to compare a query sequence against
position-specific score matrices that have been
prepared from conserved domain alignments present in
the Conserved Domain Database (CDD).
Digital Differential Display (DDD)
A tool for
comparing EST profiles in order to identify genes with
significantly different expression levels.
This interactive
tool allows users to build E-utility URLs, either from
a form or by hand, and then view their raw output. The
tool provides a simple environment for testing
E-utility URLs before including them in applications.
Tools that provide
access to data within NCBI's Entrez system outside of
the regular web query interface. They provide a method
of automating Entrez tasks within software
applications. Each utility performs a specialized
retrieval task, and can be used simply by writing a
specially formatted URL.
A tool that allows
users to construct an E-utility analysis pipeline using
an online form, and then generates a Perl script to
execute the pipeline.
A computational
procedure that is used to identify sequence tagged
sites (STSs) within DNA sequences. e-PCR looks for
potential STSs in DNA sequences by searching for
subsequences that closely match the PCR primers and
have the correct order, orientation, and spacing that
could represent the PCR primers used to generate known
STSs.
Frequency-weighted Link (FLink)
FLink is a tool
that enables you to link from a group of records in a
source database to a ranked list of associated records
in a destination database based on frequency-weighted
statistics.
Gene Expression Omnibus (GEO) BLAST
Tool for aligning
a query sequence (nucleotide or protein) to GenBank
sequences included on microarray or SAGE platforms in
the GEO database.
A tool for
pairwise comparison of two prokaryotic genomes that
displays pairs of protein homologs that are symmetrical
best hits between the two genomes.
Displays the
genetic codes for organisms in the Taxonomy database in
tables and on a taxonomic tree.
Genome ProtMap
maps each protein from a COG, or in the case of viruses
a VOG, back to its genome, and displays all the genomic
segments coding for members of this particular group of
related proteins. The view can be shifted to focus on
an adjacent COG/VOG, and clusters can be searched by
name, protein gi, or gene locus tag.
NCBI's Remap tool
allows users to project annotation data from one
assembly to another through a base by base analysis.
Options are provided to adjust the stringency of
remapping, and summary results are displayed on the web
page. Full results can be downloaded for viewing in
NCBI's Genome Workbench graphical viewer, and
annotation data for the remapped features, as well as
summary data, is also available for download.
An integrated
application for viewing and analyzing sequence data.
With Genome Workbench, you can view data in publically
available sequence databases at NCBI, and mix these
data with your own data.
Genomic Basic Local Alignment Search
Tool (BLAST)
This tool compares
nucleotide or protein sequences to genomic sequence
databases and calculates the statistical significance
of matches.
igv
–Integrative Genomics Viewer
The Integrative Genomics Viewer
(IGV) is a high-performance visualization tool for
interactive exploration of large, integrated datasets.
It supports a wide variety of data types including
sequence alignments, microarrays, and genomic
annotations.
A service that
allows third parties to link directly from PubMed and
other Entrez database records to relevant
web-accessible resources beyond the Entrez system.
Examples of LinkOut resources include full-text
publications, biological databases, consumer health
information and research tools.
A software
component of the Genome database that provides special
browsing capabilities for a subset of organisms. You
can view and search an organism's complete genome,
display chromosome maps, and zoom into progressively
greater levels of detail, down to the sequence data for
a region of interest.
This is a
Motif-based sequence analysis tools. The
MEME Suite allows you to:
- discover motifs using
MEME or
GLAM2 on groups of related DNA or protein
sequences,
-
search sequence databases using motifs,
-
compare a motif to all motifs in a database of
motifs, and
-
associate motifs with Gene Ontology terms via
their putative target genes.
NCBI's monthly
newsletter that provides information on new and updated
databases, and software services. The News often has
feature articles that highlight and demonstrate
services, features, tools, and interesting data with
practical examples of their use.
A set of software
and data exchange specifications used by NCBI to
produce portable, modular software for molecular
biology. The software in the Toolbox is primarily
designed to read records in Abstract Syntax Notation 1
(ASN.1) format, an International Standards Organization
(ISO) data representation format.
A public domain
quality assurance software package that facilitates the
assessment of multiplex short tandem repeat (STR) DNA
profiles based on laboratory-specific protocols. OSIRIS
evaluates the raw electrophoresis data using an
independently derived mathematically-based sizing
algorithm. It offers two new peak quality measures -
fit level and sizing residual. It can be customized to
accommodate laboratory-specific signatures such as
background noise settings, customized naming
conventions and additional internal laboratory
controls.
Open Mass Spectrometry Search Algorithm
(OMSSA) Search
An efficient
search engine for identifying MS/MS peptide spectra by
searching libraries of known protein sequences. OMSSA
scores significant hits with a probability score
developed using classical hypothesis testing, the same
statistical method used in BLAST.
Open Reading Frame Finder (ORF Finder)
A graphical
analysis tool that finds all open reading frames in a
user's sequence or in a sequence already in the
database. Sixteen different genetic codes can be used.
The deduced amino acid sequence can be saved in various
formats and searched against protein databases using
BLAST.
Allows users to
display, sort, subset and download position-specific
score matrices (PSSMs) either from CDD records or from
Position Specific Iterated (PSI)-BLAST protein
searches. The tool also can align a query protein to
the PSSM and highlight positions of high conservation.
The Primer-BLAST
tool uses Primer3 to design PCR primers to a sequence
template. The potential products are then automatically
analyzed with a BLAST search against user specified
databases, to check the specificity to the target
intended.
A utility for
computing alignment of proteins to genomic nucleotide
sequence. It is based on a variation of the Needleman
Wunsch global alignment algorithm and specifically
accounts for introns and splice signals. Due to this
algorithm, ProSplign is accurate in determining splice
sites and tolerant to sequencing errors.
PubChem Power User Gateway (PUG)
PUG provides
access to PubChem services via a programmatic
interface. PUG allows users to download data, initiate
chemical structure searches, standardize chemical
structures and interact with the E-utilities. PUG can
be accessed using either standard URLs or via SOAP.
PubChem Standardization Service
Standardization,
in PubChem terminology, is the processing of chemical
structures in the same way used to create PubChem
Compound records from contributors' original
structures. This service lets users see how PubChem
would handle any structure they would like to submit.
PubChem Structure
Search allows the PubChem Compound Database to be
queried by chemical structure or chemical structure
pattern. The PubChem Sketcher allows a query to be
drawn manually. Users may also specify the structural
query input by PubChem Compound Identifier (CID),
SMILES, SMARTS, InChI, Molecular Formula, or by upload
of a supported structure file format.
A specialized
PubMed search form targeted to clinicians and health
services researchers. The page simplifies searching by
clinical study category, finding systematic reviews and
searching the medical genetics literature.
A collection of
web and flash tutorials on PubMed searching and
linking, saving searches in MyNCBI, using MeSH and
other PubMed services.
The Related
Structures tool allows users to find 3D structures from
the Molecular Modeling Database (MMDB) that are similar
in sequence to a query protein. Although the query
protein may not yet have a resolved structure, the 3D
shape of a similar protein sequence can shed light on
the putative shape and biological function of the query
protein.
Ruffus
computational pipelines
Ruffus is designed to allow
scientific and other analyses to be automated with the
minimum of fuss and the least effort.
These are Ruffus's
strengths:
- Lightweight: Suitable for the
simplest of tasks
- Handles even fiendishly
complicated pipelines which would cause make
or scons to go cross-eyed and recursive.
- Standard python syntax. No
"clever magic" to code around.
- Unintrusive and unambitiously
lightweight syntax which tries to do this one small
thing well.
SAM (Sequence Alignment/Map)
format is a generic format for storing large nucleotide
sequence alignments. SAM aims to be a format that:
- Is flexible enough to store
all the alignment information generated by various
alignment programs;
- Is simple enough to be easily
generated by alignment programs or converted from
existing alignment formats;
- Is compact in file size;
- Allows most of operations on
the alignment to work on a stream without loading
the whole alignment into memory;
- Allows the file to be indexed
by genomic position to efficiently retrieve all
reads aligning to a locus.
SAM Tools provide various
utilities for manipulating alignments in the SAM
format, including sorting, merging, indexing and
generating alignments in a per-position format.
SNP Database Specialized Search Tools
A variety of tools
are available for searching the SNP database, allowing
search by genotype, method, population, submitter,
markers and sequence similarity using BLAST. These are
linked under ""Search"" on the left side bar of the
dbSNP main page.
A basic
introduction to the science and technology that
underlies many of the NCBI resources. A great starting
place for students and the general public, the Science
Primer provides a basis for understanding the NCBI web
site and mission, and provides direct links to many
NCBI databases and tools. Topics include genome
mapping, molecular modeling, mutations, microarrays
(gene expression), genetics, pharmacogenomics
(personalized medicine) and phylogenetics (evolutionary
relationships).
A utility for
computing cDNA-to-Genomic sequence alignments. It is
based on a variation of the Needleman-Wunsch global
alignment algorithm and specifically accounts for
introns and splice signals. Due to this algorithm,
Splign is accurate in determining splice sites and
tolerant to sequencing errors.
The scientific community now
understands disease biology better than ever before and
has the tools to make a difference in how novel
medicines are developed that can save lives. Yet, every
new drug launch is a heroic effort. With new emphasis
on reforming the global healthcare system, this
situation will not be allowed to continue much longer.
A tool for
comparing genomes on the basis of the protein sequences
they encode. To use TaxPlot, one selects a reference
genome and two species for comparison. Pre-computed
BLAST results are then used to plot a point for each
predicted protein in the reference genome, based on the
best alignment with proteins in each of the two genomes
being compared.
Supports searching
the taxonomy tree using partial taxonomic names, common
names, wild cards and phonetically similar names. For
each taxonomic node, the tool provides links to all
data in Entrez for that node, displays the lineage, and
provides links to external sites related to the node.
Generates a
taxonomic tree for a selected group of organisms. Users
can upload a file of taxonomy IDs or names, or they can
enter names or IDs directly.
Displays the
number of taxonomic nodes in the database for a given
rank and date of inclusion.
Displays the
current status of a set of taxonomic nodes or IDs.
A system for
quickly identifying segments of a nucleic acid sequence
that may be of vector origin. VecScreen searches a
query sequence for segments that match any sequence in
a specialized non-redundant vector database (UniVec).
Vector Alignment Search Tool (VAST)
A computer
algorithm that identifies similar protein 3-dimensional
structures. Structure neighbors for every structure in
MMDB are pre-computed and accessible via links on the
MMDB Structure Summary pages. These neighbors can be
used to identify distant homologs that cannot be
recognized by sequence comparison alone.
This tool helps
identify the genotype of a viral sequence. A window is
slid along the query sequence and each window is
compared by BLAST to each of the reference sequences
for a particular virus.
Smith-Waterman Alignment
The
Smith–Waterman algorithm is a well-known algorithm
for performing local
sequence alignment;
that is, for determining similar regions between two
nucleotide
or
protein sequences.
Instead of looking at the total sequence, the
Smith–Waterman algorithm compares segments of all
possible lengths and optimizes the similarity measure.
Sequence
Similarity Search Engines:
http://www.ebi.ac.uk/Tools/sss/
A collection of
various algorithm for performing sequence similarity
searches from EMBL
Hydropobicity plots?
TMHMM:
http://www.cbs.dtu.dk/services/TMHMM/
The TMpred program makes a
prediction of membrane-spanning regions and their
orientation. The algorithm is based on the statistical
analysis of TMbase, a database of naturally occuring
transmembrane proteins. The prediction is made using a
combination of several weight-matrices for scoring.
SignalP:
http://www.cbs.dtu.dk/services/SignalP/
SignalP 3.0 server predicts the
presence and location of signal peptide cleavage sites
in amino acid sequences from different organisms:
Gram-positive prokaryotes, Gram-negative prokaryotes,
and eukaryotes. The method incorporates a prediction of
cleavage sites and a signal peptide/non-signal peptide
prediction based on a combination of several artificial
neural networks and hidden Markov models.
Pfam:
http://pfam.sanger.ac.uk/
The Pfam database contains
information about
protein domains
and families. Pfam-A is the manually curated portion of
the database that contains over 10,000 entries. For
each entry a
protein sequence alignment
and a
hidden Markov model
is stored. These
hidden Markov models
can be used to search sequence databases with the
HMMER
package written by Sean Eddy. Because the entries in
Pfam-A do not cover all known proteins, an
automatically generated supplement is provided called
Pfam-B. Pfam-B contains a large number of small
families derived from clusters produced by an algorithm
called ADDA.[4]
Although of lower quality, Pfam-B families can be
useful when no Pfam-A families are found.
GSEA (Gene Set Enrichment
Analysis):
http://www.broadinstitute.org/gsea/index.jsp
Gene Set Enrichment
Analysis (GSEA) is a
computational method that determines whether an a
priori defined set of genes shows statistically
significant, concordant differences between two
biological states
(e.g. phenotypes).
Cytoscape:
http://www.cytoscape.org/
Cytoscape is an open source
software platform for visualizing
complex-networks and integrating these with any
type of attribute data. A lot of
plugins are available for various kinds of problem
domains, including bioinformatics, social network
analysis, and semantic web.
Resources for Biological Network
Analysis:
http://www.cs.rice.edu/~nakhleh/COMP572/NetworkResources.html
(a nice starting list of various biological
analysis packages)