BioMaj @ Pasteur

BioMaj is the tool used to manage and keep biological data banks up to date on our cluster. It superseed home made tool (dbmaint).
The installed version of BioMaj is 3.1.1 (Feb 2017).
More information are available here : https://biomaj.pasteur.fr/

Managed data banks @ Pasteur

Banks and formats

Following banks are maintained and available @ Pasteur for all users how want to run tool(s) against them (blastall, bowtie, etc...).
Supported and generated formats are listed on the same line as the bank name.

  • BioMaj manages 82 banks under 11 different formats (Nov 17th 2016).

If you want to know the list of supported banks and their respective formats, please check BioMaj interface at https://biomaj.pasteur.fr/app/#/formats

How to use handled banks

Provided banks (index and flat formats) are available for users on our cluster infrastructure in :
$ ls /local/databases
$ blast2  doc  fasta  ftp  index  rel  release
  • ftp gives access to the compressed raw data
  • release contains the uncompressed raw data
  • doc/versions gives info about banks versions between each switches
  • rel contains all the banks
  • blast2 Gives a list of available blast2 indexed banks
  • fasta contains blast2 indexes as well as fasta files
  • index contains indexes for several tools (blast2, bwa, bowtie ...)

Update frequency

  • Each banks are prepared and updated individually in background. The new updated versions are automatically rotated every other sunday at midnight.
    New banks releases are available here such as its rotation date.

New supported data banks

The data banks listed below have been added with the use of the new management tool BioMaj (Jan 2014).

  • astephensi (_Anopheles stephensi) (April 2017)
  • calbicans5314 (Candida albicans strain 5314)
  • calbicansWO1 (Candida albicans strain WO1)
  • cfamiliaris (Canis familiaris dog)
  • chiroptera (bat)
  • cneoformansH99 (Cryptococcus neoformans strain H99) (July 2014)
  • cneoformansJEC21 (Cryptococcus neoformans strain JEC21) (July 2014)
  • csabaeus (Chlorocebus sabaeus green monkey)
  • dmelanogaster (Drosophila melanogaster fly)
  • ecaballus (Equus caballus horse)
  • embl_wgs (EMBL Whole Genome Shotgun) (June 2014)
  • epo (European Patent Office)
  • fcatus (Felis catus cat)
  • frogs (FROGS: Find Rapidly OTU with Galaxy Solution) (Jan 2016)
  • gembase_genomes (Bacterial and plasmids sequences (NCBI) (May 2015)
  • gembase_phages (Phages sequences (PhAnToMe) (May 2015)
  • genbank_wgs (Genbank Whole Genome Shotgun) (June 2014)
  • hg18 (Homo sapiens Human v18) (Sept 2016)
  • hg19 (Homo sapiens Human v19)
  • hg38 (Homo sapiens Human v38) (Sept 2016)
  • itsdb Internal Transcribed Spacer database.
  • mlucifugus (Myotis lucifugus Little brown bat)
  • mm8 (Mus musculus Mouse V8) (Sept 2016)
  • mm9 (Mus musculus Mouse v9)
  • mm10 (Mus musculus Mouse v10)
  • mmusculus_ensembl (Mus musculus Ensembl source (used for a particular tool))
  • vFam ( vFam is a HMMER3 database of profile hidden Markov models (HMMs) built from all the viral proteins present in RefSeq.) (Nov 2016)
  • pdb (Protein Data Bank: new Fasta section, fasta and blast2 are available)
  • phiX (Phage PhiX 174 complete genome) (Oct 2015)
  • pvampyrus (Pteropus vampyrus Large flying fox (bat))
  • pvivax (Plasmodium vivax strin Sal-I)
  • spombe (Schizosaccharomyces pombe)
  • sscrofa (Sus scrofa pig)
  • TigrFAMs (HMM models) (March 16th 2015)

Banks not supported anymore

  • camlst
  • celegans
  • genpept
  • hssp (Jan 2016)
  • pp1

Info about some formats and new stuffs

Please check list of availables formats and indexes here

Wu-Blast

wu-blast format is no longer supported due to death :P .

Sub banks a.k.a. Sections

These sections are available as FASTA files and Blast2 indexes.

Removed sections

NOTE: nrprot_new (nrprot monthly updates) has disappeared due to unavailability from the remote ftp site (NCBI).

From now, all generated FASTA files will end with .fa file extension for more homogeneity.
Some examples:

  • Alu
    alunuc (DNA)
    alupro (Prot)
    
  • EMBL (DNA)
    • Release/Updates
      embl (embl_release + embl_update)
      embl_release release
      embl_update
      
    • CON (Contigs)
      embl_release_con_env
      embl_release_con_fun
      embl_release_con_hum
      embl_release_con_inv
      embl_release_con_mam
      embl_release_con_mus
      embl_release_con_pln
      embl_release_con_pro
      embl_release_con_rod
      embl_release_con_vrl
      embl_release_con_vrt
      
    • EST (Expressed Sequence Tag)
      embl_release_est_env
      embl_release_est_fun
      embl_release_est_hum
      embl_release_est_inv
      embl_release_est_mam
      embl_release_est_mus
      embl_release_est_pln
      embl_release_est_pro
      embl_release_est_rod
      embl_release_est_unc
      embl_release_est_vrl
      embl_release_est_vrt
      
    • GSS (Genome survey sequence)
      embl_release_gss_env
      embl_release_gss_fun
      embl_release_gss_hum
      embl_release_gss_inv
      embl_release_gss_mam
      embl_release_gss_mus
      embl_release_gss_phg
      embl_release_gss_pln
      embl_release_gss_pro
      embl_release_gss_rod
      embl_release_gss_tgn
      embl_release_gss_vrl
      embl_release_gss_vrt
      
    • HTC (high throughput cDNA sequencing)
      embl_release_htc_env
      embl_release_htc_fun
      embl_release_htc_hum
      embl_release_htc_inv
      embl_release_htc_mam
      embl_release_htc_mus
      embl_release_htc_pln
      embl_release_htc_pro
      embl_release_htc_rod
      embl_release_htc_vrt
      
    • HTG (high throughput genomic sequencing)
      embl_release_htg_env
      embl_release_htg_fun
      embl_release_htg_hum
      embl_release_htg_inv
      embl_release_htg_mam
      embl_release_htg_mus
      embl_release_htg_phg
      embl_release_htg_pln
      embl_release_htg_pro
      embl_release_htg_rod
      embl_release_htg_vrl
      embl_release_htg_vrt
      
    • PAT (Patent)
      embl_release_pat_env
      embl_release_pat_fun
      embl_release_pat_hum
      embl_release_pat_inv
      embl_release_pat_mam
      embl_release_pat_mus
      embl_release_pat_phg
      embl_release_pat_pln
      embl_release_pat_pro
      embl_release_pat_rod
      embl_release_pat_syn
      embl_release_pat_unc
      embl_release_pat_vrl
      embl_release_pat_vrt
      
    • STD (Standard)
      embl_release_std_env
      embl_release_std_fun
      embl_release_std_hum
      embl_release_std_inv
      embl_release_std_mam
      embl_release_std_mus
      embl_release_std_phg
      embl_release_std_pln
      embl_release_std_pro
      embl_release_std_rod
      embl_release_std_syn
      embl_release_std_tgn
      embl_release_std_unc
      embl_release_std_vrl
      embl_release_std_vrt
      
    • STS (Sequence Tagged Site)
      embl_release_sts_fun
      embl_release_sts_hum
      embl_release_sts_inv
      embl_release_sts_mam
      embl_release_sts_mus
      embl_release_sts_pln
      embl_release_sts_pro
      embl_release_sts_rod
      embl_release_sts_vrt
      
    • TSA (Transcriptome Shotgun Assembly)
      embl_release_tsa_env
      embl_release_tsa_fun
      embl_release_tsa_inv
      embl_release_tsa_mam
      embl_release_tsa_pln
      embl_release_tsa_pro
      embl_release_tsa_rod
      embl_release_tsa_vrl
      embl_release_tsa_vrt
      
  • EMBL WGS (DNA)
    • Whole wgs
      embl_wgs
    • Environmental Sample
      embl_wgs_env
    • Fungal
      embl_wgs_fun
    • Human
      embl_wgs_hum
    • Invertebrate
      embl_wgs_inv
    • Mammalian
      embl_wgs_mam.fa
    • Mix ?? (Since 08/2014)
      embl_wgs_mix.fa
    • Mus musculus
      embl_wgs_mus.fa
    • Plant
      embl_wgs_pln.fa
    • Prokaryote
      embl_wgs_pro.fa
    • Other Rodent
      embl_wgs_rod.fa
    • Viral
      embl_wgs_vrl.fa
    • Other Vertebrate
      embl_wgs_vrt.fa
  • FROGS (DNA) blast+/2.2.31
    frogs_18Sv119 (18S, silva v119)
    frogs_18Sv123 (18S, silva v123)
    frogs_23S
    frogs_16S
    
  • Greengenes (DNA) blast+/2.2.31
    greengenes
  • GemBase
    gembase: gembase_genomes + gembase_phages
    gembase_genomes_pro (Prot)
    gembase_phages_pro  (Prot)
    gembase_pro         (Prot) (gembase_genomes_pro + gembase_phages_pro)
    gembase_genomes_dna (DNA)
    

    gembase_phages
    gembase_phages_dna (DNA)
    gembase_dna        (DNA) (gembase_genomes_dna + gembase_phages_dna)
    
  • Genbank (DNA)
    • Release/Updates
      genbank (genbank_release + genbank_update)
      genbank_update updates
      genbank_release release
      
    • Sections
      genbank_release_bct    Bacterial
      genbank_release_con    Constructed Sequences
      genbank_release_env    Environmental sampling
      genbank_release_est    EST (expressed sequence tag)
      genbank_release_gss    GSS (genome survey sequence)
      genbank_release_htc    HTC (high throughput cDNA sequencing)
      genbank_release_htg    HTG (high throughput genomic sequencing)
      genbank_release_inv    Invertebrate
      genbank_release_mam    Other mammalian
      ...
      
  • Genbank WGS new 23rd June 2014 (was wgs)
    genbank_wgsnuc -> genbank_wgsnuc.fa (DNA)
    genbank_wgspro -> genbank_wgspro.fa (Prot)
    
  • Greengenes (DNA) blast+/2.2.28
    greengenes
    
  • IMGT (DNA)
    imgt (full sequences)
    imgtrefseq (IMGT sequences included into RefSeq)
    
    • Germlines receptor sequences subbank (DNA) (new 22th Oct. 2014)
      File names follow this rule: IG: Immuno globulin, TR: T-cell Receptor, Chain D (Diversity), J(Joining) and V(Variable)
    • Homo spaiens (Human)
      human_IG_D
      human_IG_J
      human_IG_V
      human_TR_D
      human_TR_J
      human_TR_V
      
    • Macaca mulatta (Rhesus monkey)
      rhesus_monkey_IG_D
      rhesus_monkey_IG_J
      rhesus_monkey_IG_V
      rhesus_monkey_TR_D
      rhesus_monkey_TR_J
      rhesus_monkey_TR_V
      
    • Mus musculus (Mouse)
      mouse_IG_D
      mouse_IG_J
      mouse_IG_V
      mouse_TR_D
      mouse_TR_J
      mouse_TR_V
      
    • oryctolagus cuniculus (European rabbit)
      rabbit_IG_D
      rabbit_IG_J
      rabbit_IG_V
      
    • Rattus norvegicus (Rat)
      rat_IG_D
      rat_IG_J
      rat_IG_V
      
  • phiX
    phiX
  • RDPII (DNA)
    rdpii
    rdpii_cultured
    
  • RefSeq
    • Release/Updates
      refseqn         (release + updates)
      refseqp         (release + updates)
      refseqn_release (DNA)
      refseqn_udpate  (DNA)
      refseqp_release (Prot) 
      refseqp_udpate  (Prot)
      
    • Sections
      refseqn_release_dna (All genomic release sequences) (new)
      refseqn_release_rna (All RNA release sequences) (new)
      refseqn_release_wgs (All WGS release sequences) (new)
      
  • Silva (DNA)
    silva (silva_lsu + silva_ssu)
    silva_lsu (Silva *L*arge *S*ub *U*nits 23S)
    silva_ssu (Silva *S*mall *S*ub *U*nits 16S)
    
  • UniProt (Prot)
    uniprot        (uniprot sprot + trembl)
    uniprot_sprot  (SwissProt)
    uniprot_trembl (Trembl)
    
  • vFAM (Prot)
    vfam
    

NGS Index

Most of the complete genomes managed with BioMaj are indexed with available NGS tools Pasteur.
For each index, the last default version of the package will be used through @module load
command.
Supported NGS tools are actually:

  • bowtie (v1 and v2)
  • bwa
  • gatk
  • picard
  • samtools
  • soap

Indexes can be found at the following location:

/local/databases/index/<tool>/<Genome>/<tool_version>

LiftOver files

We are now providing some LiftOver files for certain genomes.
At the moment, we are providing such files for 2 genomes (hg18 and mm8).

  • hg18Tohg19.over.chain
  • mm8Tomm9.over.chain

Theses file can be found at the following location:

/local/databases/index/liftover/<Genome>/

Contact

For any question or suggestion, please contact