The BioJulia organization collects a lot of great packages related to bioinformatics.
An important task in bioinformatics is parsing files in various standard formats. Here we list some file formats and packages with parsers:
Format | Extensions | Description | Packages |
---|---|---|---|
FASTA | .fas, .fasta, .fa | DNA or protein sequences without annotations | FASTX |
FASTQ | .fq, .fastq | DNA sequences with quality information | FASTX |
GENBANK | .gb, .gbk | DNA or protein sequences with annotations | GenomicAnnotations.jl |
EMBL | .embl | DNA or protein sequences with annotations | GenomicAnnotations.jl |
SAM | .sam | Aligned DNA sequences (typically from read mapping). Text based. | XAM.jl |
BAM | .bam | Aligned DNA sequences (typically from read mapping). Binary. | XAM.jl |
PDB | .pdb | Protein 3D structure. | BioStructures.jl, MIToS |
mmCIF | Macromolecular Crystallographic Information File (mmCIF) also known as PDBx/mmCIF is a standard text file format for representing macromolecular structure data | BioStructures.jl, MIToS | |
MMTF | MacroMolecular Transmission Format (MMTF) is a binary encoding of biological structures. | BioStructures.jl | |
DSSP | Protein Secondary Structure | ProteinSecondaryStructures.jl | |
STRIDE | Protein Secondary Structure | ProteinSecondaryStructures.jl | |
PAF | .paf | Pairwise mApping Format. | PairwiseMappingFormat.jl |
Stockholm | .sto, .stk, .stockholm | Stockholm format is a multiple sequence alignment format used by Pfam, Rfam and Dfam | MIToS.jl |
A3M | .fas | A2M/A3M are a family of FASTA-derived formats used for sequence alignments | MIToS.jl |
PIR | .pir | Multiple sequence alignment format | MIToS.jl |
The basic data structures for representing DNA, RNA and protein sequences are in BioSequences.jl
A core task in bioinformatics is aligning sequences. This can be done with BioAlignments.jl which includes algorithms for the following pairwise alignment types:
GlobalAlignment: global-to-global alignment
SemiGlobalAlignment: local-to-global alignment
LocalAlignment: local-to-local alignment
OverlapAlignment: end-free alignment
I'm not aware of tools in Julia to compute multiple sequence alignment, but MIToS.jl can read the most common MSA formats: stockholm, FASTA, A3M, A2M, PIR or Raw format
Biological sequences for the julia language
BioSequences.jl BioSequences provides data types and methods for common operations with biological sequences, including DNA, RNA, and amino acid sequences.
It can do sequence search and pattern matching in sequences, and compute simple sequence statistics.
Sequence alignment tools
BioAlignments.jl provides sequence alignment algorithms and data structures. It includes algorithms for the following pairwise alignment types:
GlobalAlignment: global-to-global alignment
SemiGlobalAlignment: local-to-global alignment
LocalAlignment: local-to-local alignment
OverlapAlignment: end-free alignment
Tools for genomic features in Julia.
GenomicAnnotations is a package for reading, modifying, and writing genomic annotations in the GenBank, GFF3, GFF2/GTF, and EMBL file formats.
BioStructures.jlA Julia package to read, write and manipulate macromolecular structures
From the package README:
BioStructures provides functionality to read, write and manipulate macromolecular structures, in particular proteins. Protein Data Bank (PDB), mmCIF and MMTF format files can be read in to a hierarchical data structure. Spatial calculations and functions to access the PDB are also provided. It compares favourably in terms of performance to other PDB parsers - see some benchmarks online - and should be lightweight enough to build other packages on top of.
Datastructures and algorithms for working with genetic variation
From the package README:
GeneticVariation provides types and methods for working with datasets of genetic variation. It provides a VCF and BCF parser, as well as methods for working with variation in sequences such as evolutionary distance computation, and counting different mutation types.
The BioJulia package for working with phylogenetic trees and geneologies.
This looks stale.
From the package README:
A julia package providing an abstract type and interface for phylogenies, a concrete phylogeny type implementation, and higher-level methods for working with phylogenies.
In development.
A modern genomics framework for julia
From the package README:
GenomeGraphs provides a representation of sequence graphs. Such graphs represent genome assemblies and population graphs of genotypes/haplotypes and variation.
Julia interface to APIs for various bio-related web services
Thin wrapper around NCBI's BLAST+ CLI https://www.ncbi.nlm.nih.gov/books/NBK569856/
From the package README:
This package is a thin wrapper around the Basic Local Alignment Search Tool CLI, better known as BLAST, developed by the National Center for Biotechnology Information (NCBI).
For now, this uses CondaPkg.jl to install BLAST+.
Parse and process FASTA and FASTQ formatted files of biological sequences.
FASTX provides I/O and utilities for manipulating FASTA and FASTQ, formatted sequence data files.
Parse and process FASTA and FASTQ formatted files of biological sequences.
FASTX provides I/O and utilities for manipulating FASTA and FASTQ, formatted sequence data files.
Parser for the PAF format in bioinformatics
PairwiseMappingFormat.jl provide a parser for Pairwise Mapping Format (PAF) files. PAF is a simple, tab-delimited format created by programs such as minimap2.
Wrapper to protein secondary structure calculation packages
From the package README:
This package parses STRIDE and DSSP secondary structure prediction outputs, to make them convenient to use from Julia, particularly for the analysis of MD simulations.
Plotting and interface tools for biology.
BioMakie.jl has functions to visualize
Protein 3D structures
Multiple Sequence Alignments
A Julia package to analyze protein sequences, structures, and evolutionary information
From the package README:
MIToS provides a comprehensive suite of tools for the analysis of protein sequences and structures. It allows working with Multiple Sequence Alignments (MSAs) to obtain evolutionary information in the Julia language [1]. In particular, it eases the analysis of coevoling position in an MSA using Mutual Information (MI), a measure of covariation. MI-derived scores are good predictors of inter-residue contacts in a protein structure and functional sites in proteins [2,3]. To allow such analysis, MIToS also implements several useful tools for working with protein structures, such as those available in the Protein Data Bank (PDB) or predicted by AlphaFold 2.
Simulate sequence data and complicated pedigree structures
From the package README:
XSim is a fast and user-friendly tool to simulate sequence data and complicated pedigree structures.
Features
An efficient CPOS algorithm
Using founders that are characterized by real genome sequence data
Complicated pedigree structures among descendants
A Gene Finder framework for Julia.
GeneFinder.jl is a species-agnostic, algorithm extensible, sequence-anonymous (genome, metagenomes) gene finder library framework for the Julia Language.
From the package README:
The GeneFinder
package aims to be a versatile module that enables the application of different gene finding algorithms to the BioSequence
type, by providing a common interface and a flexible data structure to store the predicted ORFI or genes. The package is designed to be easily extensible, allowing users to implement their own algorithms and integrate them into the framework.
This package is currently under development and is not yet ready for production use. The API is subject to change.
Note that the Bio.jl package is deprecated. In this blogpost, the main developer of Bio.jl, describes where the functionalities have gone:
Bio.Seq became BioSequences.jl
Bio.Align became BioAlignments.jl
Bio.Intervals became GenomicFeatures.jl
Bio.Structure became BioStructures.jl
Bio.Var became GeneticVariation.jl
Bio.Phylo became Phylogenies.jl
Bio.Services became BioServices.jl
Bio.Tools became BioTools.jl (now archived)