Bioinformatics Tools

Expert BioSystems (EBS), scientists and engineers have deatail knowledge of bioinformatics tools. EBS training department provides hands-on training for most bioinformatic tools.

Following are partial list of bioinformatics tools, in alpha-betic order:

Amino Acid Explorer

This tool allows users to explore the characteristics of amino acids by comparing their structural and chemical properties, predicting protein sequence changes caused by mutations, viewing common substitutions, and browsing the functions of given residues in conserved domains.

Assembly Archive

Links the raw sequence information found in the Trace Archive with assembly information found in publicly available sequence repositories (GenBank/EMBL/DDBJ). The Assembly Viewer allows a user to see the multiple sequence alignments as well as the actual sequence chromatogram.

BLAST (Basic Local Alignment Search Tool)

Finds regions of local similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as to help identify members of gene families.

BLAST Link (BLink)

A link option on protein records that displays the results of a pre-computed BLAST search of that protein against all other protein sequences at NCBI.

BLAST Microbial Genomes

Performs a BLAST search for similar sequences from selected complete eukaryotic and prokaryotic genomes.

BLAST Tutorials and Guides

This page links to a number of BLAST-related tutorials and guides, including a selection guide for BLAST algorithms, descriptions of BLAST output formats, explanations of the parameters for stand-alone BLAST, directions for setting up stand-alone BLAST on local machines and using the BLAST URL API.

Batch Entrez

Allows you to retrieve records from many Entrez databases by uploading a file of GI or accession numbers from the Nucleotide or Protein databases, or a file of unique identifiers from other Entrez databases. Search results can be saved in various formats directly to a local file on your computer.

BioAssay Services

Tools that summarize the biological test results in the PubChem database and provide alternative ways to view bioassay results and structure-activity relationships. Users also can download their analyses and data tables.

Bowtie  

Is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).

BWA - Burrows-Wheeler Alignment

BWA is a fast light-weighted tool that aligns relatively short sequences (queries) to a sequence database (targe), such as the human reference genome. It implements two different algorithms, both based on Burrows-Wheeler Transform (BWT). The first algorithm is designed for short queries up to ~200bp with low error rate (<3%). It does gapped global alignment w.r.t. queries, supports paired-end reads, and is one of the fastest short read alignment algorithms to date while also visiting suboptimal hits. The second algorithm, BWA-SW, is designed for long reads with more errors. It performs heuristic Smith-Waterman-like alignment to find high-scoring local hits (and thus chimera). On low-error short queries, BWA-SW is slower and less accurate than the first algorithm, but on long queries, it is better.

CDTree

A stand-alone application for classifying protein sequences and investigating their evolutionary relationships. CDTree can import, analyze and update existing Conserved Domain (CDD) records and hierarchies, and also allows users to create their own. CDTree is tightly integrated with Entrez CDD and Cn3D, and allows users to create and update protein domain alignments.

COBALT

COBALT is a protein multiple sequence alignment tool that finds a collection of pairwise constraints derived from conserved domain database, protein motif database, and sequence similarity, using RPS-BLAST, BLASTP, and PHI-BLAST.

Cn3D

A stand-alone application for viewing 3-dimensional structures from NCBI's Entrez retrieval service. Cn3D runs on Windows, Macintosh, and UNIX and can be configured to receive data from most popular web browsers. Cn3D simultaneously displays structure, sequence, and alignment, and has powerful annotation and alignment editing features.

Coffee Break

Part of the NCBI Bookshelf, Coffee Break combines reports on recent biomedical discoveries with use of NCBI tools. Each report incorporates interactive tutorials that show how NCBI bioinformatics tools are used as a part of the research process.

Concise Microbial Protein BLAST

A specialized BLAST service in which the queried database consists of all proteins from complete microbial (prokaryotic) genomes. NCBI has precalculated clusters of similar proteins at the genus-level and one representative is chosen from each cluster in order to reduce the dataset, thereby reducing search time and providing a broader taxonomic view.

Conserved Domain Architecture Retrieval Tool (CDART)

Displays the functional domains that make up a given protein sequence. It lists proteins with similar domain architectures and can retrieve proteins that contain particular combinations of domains.

Conserved Domain Search Service (CD Search)

Identifies the conserved domains present in a protein sequence. CD-Search uses RPS-BLAST (Reverse Position-Specific BLAST) to compare a query sequence against position-specific score matrices that have been prepared from conserved domain alignments present in the Conserved Domain Database (CDD).

Digital Differential Display (DDD)

A tool for comparing EST profiles in order to identify genes with significantly different expression levels.

E-Bench

This interactive tool allows users to build E-utility URLs, either from a form or by hand, and then view their raw output. The tool provides a simple environment for testing E-utility URLs before including them in applications.

E-Utilities

Tools that provide access to data within NCBI's Entrez system outside of the regular web query interface. They provide a method of automating Entrez tasks within software applications. Each utility performs a specialized retrieval task, and can be used simply by writing a specially formatted URL.

Ebot

A tool that allows users to construct an E-utility analysis pipeline using an online form, and then generates a Perl script to execute the pipeline.

Electronic PCR (e-PCR)

A computational procedure that is used to identify sequence tagged sites (STSs) within DNA sequences. e-PCR looks for potential STSs in DNA sequences by searching for subsequences that closely match the PCR primers and have the correct order, orientation, and spacing that could represent the PCR primers used to generate known STSs.

Frequency-weighted Link (FLink)

FLink is a tool that enables you to link from a group of records in a source database to a ranked list of associated records in a destination database based on frequency-weighted statistics.

Gene Expression Omnibus (GEO) BLAST

Tool for aligning a query sequence (nucleotide or protein) to GenBank sequences included on microarray or SAGE platforms in the GEO database.

Gene Plot

A tool for pairwise comparison of two prokaryotic genomes that displays pairs of protein homologs that are symmetrical best hits between the two genomes.

Genetic Codes

Displays the genetic codes for organisms in the Taxonomy database in tables and on a taxonomic tree.

Genome ProtMap

Genome ProtMap maps each protein from a COG, or in the case of viruses a VOG, back to its genome, and displays all the genomic segments coding for members of this particular group of related proteins. The view can be shifted to focus on an adjacent COG/VOG, and clusters can be searched by name, protein gi, or gene locus tag.

Genome Remapping Service

NCBI's Remap tool allows users to project annotation data from one assembly to another through a base by base analysis. Options are provided to adjust the stringency of remapping, and summary results are displayed on the web page. Full results can be downloaded for viewing in NCBI's Genome Workbench graphical viewer, and annotation data for the remapped features, as well as summary data, is also available for download.

Genome Workbench

An integrated application for viewing and analyzing sequence data. With Genome Workbench, you can view data in publically available sequence databases at NCBI, and mix these data with your own data.

Genomic Basic Local Alignment Search Tool (BLAST)

This tool compares nucleotide or protein sequences to genomic sequence databases and calculates the statistical significance of matches.

igv –Integrative Genomics Viewer

The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated datasets. It supports a wide variety of data types including sequence alignments, microarrays, and genomic annotations.

LinkOut

A service that allows third parties to link directly from PubMed and other Entrez database records to relevant web-accessible resources beyond the Entrez system. Examples of LinkOut resources include full-text publications, biological databases, consumer health information and research tools.

Map Viewer

A software component of the Genome database that provides special browsing capabilities for a subset of organisms. You can view and search an organism's complete genome, display chromosome maps, and zoom into progressively greater levels of detail, down to the sequence data for a region of interest.

MEME Suite

This is a Motif-based sequence analysis tools. The MEME Suite allows you to:

  • discover motifs using MEME or GLAM2 on groups of related DNA or protein sequences,
  • search sequence databases using motifs,
  • compare a motif to all motifs in a database of motifs, and
  • associate motifs with Gene Ontology terms via their putative target genes.

 

NCBI News

NCBI's monthly newsletter that provides information on new and updated databases, and software services. The News often has feature articles that highlight and demonstrate services, features, tools, and interesting data with practical examples of their use.

NCBI Toolbox

A set of software and data exchange specifications used by NCBI to produce portable, modular software for molecular biology. The software in the Toolbox is primarily designed to read records in Abstract Syntax Notation 1 (ASN.1) format, an International Standards Organization (ISO) data representation format.

OSIRIS

A public domain quality assurance software package that facilitates the assessment of multiplex short tandem repeat (STR) DNA profiles based on laboratory-specific protocols. OSIRIS evaluates the raw electrophoresis data using an independently derived mathematically-based sizing algorithm. It offers two new peak quality measures - fit level and sizing residual. It can be customized to accommodate laboratory-specific signatures such as background noise settings, customized naming conventions and additional internal laboratory controls.

Open Mass Spectrometry Search Algorithm (OMSSA) Search

An efficient search engine for identifying MS/MS peptide spectra by searching libraries of known protein sequences. OMSSA scores significant hits with a probability score developed using classical hypothesis testing, the same statistical method used in BLAST.

Open Reading Frame Finder (ORF Finder)

A graphical analysis tool that finds all open reading frames in a user's sequence or in a sequence already in the database. Sixteen different genetic codes can be used. The deduced amino acid sequence can be saved in various formats and searched against protein databases using BLAST.

PSSM Viewer

Allows users to display, sort, subset and download position-specific score matrices (PSSMs) either from CDD records or from Position Specific Iterated (PSI)-BLAST protein searches. The tool also can align a query protein to the PSSM and highlight positions of high conservation.

Primer-BLAST

The Primer-BLAST tool uses Primer3 to design PCR primers to a sequence template. The potential products are then automatically analyzed with a BLAST search against user specified databases, to check the specificity to the target intended.

ProSplign

A utility for computing alignment of proteins to genomic nucleotide sequence. It is based on a variation of the Needleman Wunsch global alignment algorithm and specifically accounts for introns and splice signals. Due to this algorithm, ProSplign is accurate in determining splice sites and tolerant to sequencing errors.

PubChem Power User Gateway (PUG)

PUG provides access to PubChem services via a programmatic interface. PUG allows users to download data, initiate chemical structure searches, standardize chemical structures and interact with the E-utilities. PUG can be accessed using either standard URLs or via SOAP.

PubChem Standardization Service

Standardization, in PubChem terminology, is the processing of chemical structures in the same way used to create PubChem Compound records from contributors' original structures. This service lets users see how PubChem would handle any structure they would like to submit.

PubChem Structure Search

PubChem Structure Search allows the PubChem Compound Database to be queried by chemical structure or chemical structure pattern. The PubChem Sketcher allows a query to be drawn manually. Users may also specify the structural query input by PubChem Compound Identifier (CID), SMILES, SMARTS, InChI, Molecular Formula, or by upload of a supported structure file format.

PubMed Clinical Queries

A specialized PubMed search form targeted to clinicians and health services researchers. The page simplifies searching by clinical study category, finding systematic reviews and searching the medical genetics literature.

PubMed Tutorials

A collection of web and flash tutorials on PubMed searching and linking, saving searches in MyNCBI, using MeSH and other PubMed services.

Related Structures

The Related Structures tool allows users to find 3D structures from the Molecular Modeling Database (MMDB) that are similar in sequence to a query protein. Although the query protein may not yet have a resolved structure, the 3D shape of a similar protein sequence can shed light on the putative shape and biological function of the query protein.

Ruffus computational pipelines

Ruffus is designed to allow scientific and other analyses to be automated with the minimum of fuss and the least effort.

These are Ruffus's strengths:

  • Lightweight: Suitable for the simplest of tasks
  • Handles even fiendishly complicated pipelines which would cause make or scons to go cross-eyed and recursive.
  • Standard python syntax. No "clever magic" to code around.
  • Unintrusive and unambitiously lightweight syntax which tries to do this one small thing well.

 

SAMtools

SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. SAM aims to be a format that:

  • Is flexible enough to store all the alignment information generated by various alignment programs;
  • Is simple enough to be easily generated by alignment programs or converted from existing alignment formats;
  • Is compact in file size;
  • Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory;
  • Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.

SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.

 

SNP Database Specialized Search Tools

A variety of tools are available for searching the SNP database, allowing search by genotype, method, population, submitter, markers and sequence similarity using BLAST. These are linked under ""Search"" on the left side bar of the dbSNP main page.

Science Primer

A basic introduction to the science and technology that underlies many of the NCBI resources. A great starting place for students and the general public, the Science Primer provides a basis for understanding the NCBI web site and mission, and provides direct links to many NCBI databases and tools. Topics include genome mapping, molecular modeling, mutations, microarrays (gene expression), genetics, pharmacogenomics (personalized medicine) and phylogenetics (evolutionary relationships).

Splign

A utility for computing cDNA-to-Genomic sequence alignments. It is based on a variation of the Needleman-Wunsch global alignment algorithm and specifically accounts for introns and splice signals. Due to this algorithm, Splign is accurate in determining splice sites and tolerant to sequencing errors.

Spotfire for Life Sciences

The scientific community now understands disease biology better than ever before and has the tools to make a difference in how novel medicines are developed that can save lives. Yet, every new drug launch is a heroic effort. With new emphasis on reforming the global healthcare system, this situation will not be allowed to continue much longer.

 

TaxPlot

A tool for comparing genomes on the basis of the protein sequences they encode. To use TaxPlot, one selects a reference genome and two species for comparison. Pre-computed BLAST results are then used to plot a point for each predicted protein in the reference genome, based on the best alignment with proteins in each of the two genomes being compared.

Taxonomy Browser

Supports searching the taxonomy tree using partial taxonomic names, common names, wild cards and phonetically similar names. For each taxonomic node, the tool provides links to all data in Entrez for that node, displays the lineage, and provides links to external sites related to the node.

Taxonomy Common Tree

Generates a taxonomic tree for a selected group of organisms. Users can upload a file of taxonomy IDs or names, or they can enter names or IDs directly.

Taxonomy Statistics

Displays the number of taxonomic nodes in the database for a given rank and date of inclusion.

Taxonomy Status Reports

Displays the current status of a set of taxonomic nodes or IDs.

VecScreen

A system for quickly identifying segments of a nucleic acid sequence that may be of vector origin. VecScreen searches a query sequence for segments that match any sequence in a specialized non-redundant vector database (UniVec).

Vector Alignment Search Tool (VAST)

A computer algorithm that identifies similar protein 3-dimensional structures. Structure neighbors for every structure in MMDB are pre-computed and accessible via links on the MMDB Structure Summary pages. These neighbors can be used to identify distant homologs that cannot be recognized by sequence comparison alone.

Viral Genotyping Tool

This tool helps identify the genotype of a viral sequence. A window is slid along the query sequence and each window is compared by BLAST to each of the reference sequences for a particular virus.

Smith-Waterman Alignment

The Smith–Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

Sequence Similarity Search Engines: http://www.ebi.ac.uk/Tools/sss/

A collection of various algorithm for performing sequence similarity searches from EMBL

Hydropobicity plots?

 

TMHMM: http://www.cbs.dtu.dk/services/TMHMM/

The TMpred program makes a prediction of membrane-spanning regions and their orientation. The algorithm is based on the statistical analysis of TMbase, a database of naturally occuring transmembrane proteins. The prediction is made using a combination of several weight-matrices for scoring.

SignalP: http://www.cbs.dtu.dk/services/SignalP/

SignalP 3.0 server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks and hidden Markov models.

Pfam: http://pfam.sanger.ac.uk/

The Pfam database contains information about protein domains and families. Pfam-A is the manually curated portion of the database that contains over 10,000 entries. For each entry a protein sequence alignment and a hidden Markov model is stored. These hidden Markov models can be used to search sequence databases with the HMMER package written by Sean Eddy. Because the entries in Pfam-A do not cover all known proteins, an automatically generated supplement is provided called Pfam-B. Pfam-B contains a large number of small families derived from clusters produced by an algorithm called ADDA.[4] Although of lower quality, Pfam-B families can be useful when no Pfam-A families are found.

 

GSEA (Gene Set Enrichment Analysis): http://www.broadinstitute.org/gsea/index.jsp

Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically
significant, concordant differences between two biological states
(e.g. phenotypes).

Cytoscape:  http://www.cytoscape.org/

Cytoscape is an open source software platform for visualizing complex-networks and integrating these with any type of attribute data. A lot of plugins are available for various kinds of problem domains, including bioinformatics, social network analysis, and semantic web.

Resources for Biological Network Analysis:  http://www.cs.rice.edu/~nakhleh/COMP572/NetworkResources.html  (a nice starting list of various biological analysis packages)