knowledgebase

recipes that save time

View the Project on GitHub hbc/knowledgebase

Homo Sapiens + Covid19

*GRCh38_SARSCov2 * - built from ensembl

Mus musculus

GRCm38_98 - built from ensembl

Caenorhabditis elegans

WBcel235_WS272 - built from wormbase

Drosophila melanogaster

DGP6 - built from Flybase

DGP6.92 - built from Ensembl info

Updating supported transcriptomes

  1. clone cloudbiolinux
  2. update transcriptome
    bcbio_python cloudbiolinux/utils/prepare_tx_gff.py --cores 8 --gtf Macaca_mulatta.Mmul_8.0.1.95.chr.gtf.gz --fasta /n/app/bcbio/biodata/genomes/Mmulatta/mmul8noscaffold/seq/mmul8noscaffold.fa Mmulatta mmul8noscaffold
    
  3. upload the xz file to the bucket
    aws s3 cp hg19-rnaseq-2019-02-28_75.tar.xz s3://biodata/annotation/ --grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers full=emailaddress=chapmanb@50mail.com
    
  4. edit cloudbiolinux ggd transcripts.yaml recipe to point to the new file uploaded on the bucket
  5. edit the cloudbiolinux ggd gtf.yaml to show where you got the GTF from and what you did to it
  6. test before pushing
    mkdir tmpbcbio-install
    ln -s `pwd`/cloudbiolinux tmpbcbio-install/cloudbiolinux
    log into bcbio user: sudo -su bcbio /bin/bash
    bcbio_nextgen.py upgrade --data
    
  7. push changes back to cloudbiolinux

Factual list of genomes in O2:/n/shared_db/bcbio/biodata/genomes as of 2020-03-13

.
├── Ad37
│   ├── GW7619026
│   └── GW76-19026
├── Adenovirus
│   └── Ad37
├── Amexicanus
│   └── Amexicanus2
├── Amis
│   ├── ASM28112v4
│   └── ASM28112v4.a
├── Anidulans
│   └── FGSC_A4
├── Atta_cephalotes
│   └── Attacep1.0
├── bcbiotx
├── Btaurus
│   └── UMD3.1
├── Celegans
│   ├── WBcel235
│   ├── WBcel235_90
│   ├── WBcel235_raw
│   └── WBcel235_WS272
├── Dmelanogaster
│   ├── BDGP6
│   ├── BDGP6.15
│   ├── BDGP6.19
│   ├── BDGP6.92
│   ├── flybase
│   └── flybase_dmel_r6.28
├── Drerio
│   ├── Zv10
│   ├── Zv11
│   └── Zv9
├── Ecoli
│   ├── EDL933
│   ├── k12
│   ├── MB0009
│   ├── MB2409
│   ├── MB2455
│   ├── MG1655
│   ├── MG1655_v2
│   ├── MG1655_virus
│   ├── MG1655_wrong_name
│   └── NC_000913.3
├── Gallus_gallus
│   └── galgal5
├── gdc-virus
│   └── gdc-virus-hsv
├── haD37
│   └── DQ900900.1
├── Hsapiens
│   ├── GRCh37
│   ├── hg19
│   ├── hg19-ercc
│   ├── hg19-mt
│   ├── hg19-subset
│   ├── hg19-test
│   └── hg38
├── humanAd37
│   └── Ad37.hg19
├── kraken
│   ├── bcbio
│   ├── micro
│   ├── minikraken_20141208
│   ├── minimal
│   └── old_20141302
├── Lafricana
│   └── loxAfr3
├── Macaca
│   ├── Mfascicularis
│   ├── Mmul8
│   └── mmul8noscaffold
├── Mmulatta
│   ├── mmul8
│   └── mmul8noscaffold
├── Mmusculus
│   ├── cloudbiolinux
│   ├── GRCm38_90
│   ├── GRCm38_98
│   ├── greenberg-mm9
│   ├── mm10
│   └── mm9
├── Oaires
│   └── Oar_v31
├── phiX174
│   └── phix
├── Pintermedia
│   └── ASM195395v1
├── Rnorvegicus
│   └── rn6
├── Scerevisiae
│   └── sacCer3
├── Spombe
│   ├── ASM284v2.25
│   └── ASM284v2.30
├── spombe
│   └── ASM294v2
├── Sscrofa
    ├── ss11.1
    └── Sscrofa10.2

How to install a custom genome in O2

Workflow4: Whole genome trio (50x) - hg38

Inputs (FASTQ files) and results (BAM files, etc) of the whole genome BWA alignment and GATK variant calling workflow are stored in /n/data1/cores/bcbio/shared/NA12878-trio-eval

Use an updated hg38 transcriptome

wget ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.101.gtf.gz
gtf=Homo_sapiens.GRCh38.101.chr.gtf.gz
remap_url=http://raw.githubusercontent.com/dpryan79/ChromosomeMappings/master/GRCh38_ensembl2UCSC.txt
wget --no-check-certificate -qO- $remap_url | awk '{if($1!=$2) print "s/^"$1"/"$2"/g"}' > remap.sed
gzip -cd ${gtf} | sed -f remap.sed | grep -v "*_*_alt" > hg38-remapped.gtf

Then pass hg38-remapped.gtf as the transcriptome_gtf option.