knowledgebase

recipes that save time

View the Project on GitHub hbc/knowledgebase

Setting up an analysis Guidelines

Initial folder and git setup on on O2

(also read new checklist on dropbox - HBC Team Folder (1)/Consults/_checklists/Data Management Checklist of Bulk RNA.docx)

  1. Setup repo on github under the HBC org (default to a private repo unless the PI wants it to be public).

Adapt the Trello project name to work on the server (i.e. replace spaces with underscores, remove special characters and make lowercase) and use that as the github repo name

If the project is not on Trello, use something specific so we can tell which rpoject it is: Make sure to include the hbc_ prefix if you can find it: hbc_$technology_of_$pilastname_$intervention_on_$tissue_in_$organism_$hbccode

  1. If not present already, make a folder on the server in the PIs directory using this format: $pifirstname_pilastname
  2. Clone the repo inside this directory
  3. Go inside the repo directory, and setup subfolders called:

    #### data (for raw data)

    #### meta (for extra, unformatted, sample metadata)

    #### templates (bcbio config files) #### docs (other information that you might want to keep near the data)

    ## Notes on folders

    • these are the FASTQ or similar file types for other technologies.
    • In the data folder, you can have the actual downloaded files, or symlinks to someone else’s downloaded files

    ### meta

    • munge the metadata into the format for bcbio and give it a name that will tell you about the particular bcbio run.
    • Here is a simple example of a metadata file with only replicate and genotype as covariates: https://docs.google.com/spreadsheets/d/18h6qPc7_rGzyg2gTbgyg5Nmo00zBikXBJGzDl9QRmXY/edit?usp=sharing
    • The stem of this file (filename without extension) will be used to name the folder with your final bcbio results
    • Typically, I just call it “bcbio.csv”
    • if I was to do another bcbio run with a different genome than before (Flybase for example), I give it a new, descriptive name (eg. “bcbio_flybase.csv”)

    ### templates

    • These are the bcbio templates.
    • you can grab one from a previous project or download from the bcbio repo
    • you can also just run bcbio_download_template rnaseq (for example) to get the template for your particular technology
    • The final term in the command will be used to match against the following templates and any having any overlap will be downloaded:
        freebayes-variant.yaml   
        illumina-chipseq.yaml   
        illumina-rnaseq.yaml     
        indrop-singlecell.yaml
        tumor-paired.yaml
        gatk-variant.yaml       
        illumina-fastrnaseq.yaml
        illumina-srnaseq.yaml    
        noalign-variant.yaml
      

      So for example, bcbio_download_template gatk will only pull down the gatk-variant.yaml template, but bcbio_download_template ill will pull down every template that has illlumina in its’ name.

    • Modify the template file as needed to fit the needs of your analysis. If running bcbio_o2 below, there is no need to modify the upload:dir variable
  4. Setup .gitignore file to ignore all files you don’t want to sync. In theory we only use git to store code and very small files. (Ignore bcb final folder, and data folder) [FUTURE: NEED EXAMPLES]*

Running bcbio

At the end of the run, you will have a directory structure that looks something like this:

├── Homo_sapiens.GRCh38.92.gtf     
├── Homo_sapiens.GRCh38.92-tx2gene.tsv     
├── Homo_sapiens.GRCh38.cdna.all.fa    
├── indrop-rnaseq.yaml   
├── metadata   
│   ├── lane1_NoIndex_L001   
│   ├── lane2_NoIndex_L002      
├── sc-human   
│   ├── config   
│   │   ├── sscc-human.csv   
│   │   ├── sc-human-template.yaml   
│   │   ├── sc-human.yaml   
│   │   └── sc-human.yaml.bak2018-07-12-14-38-13   
│   └── final   
│       ├── 2018-07-19_sc-human   
│       │   ├── bcbio-nextgen-commands.log   
│       │   ├── bcbio-nextgen.log   
│       │   ├── bcb.rds   
│       │   ├── data_versions.csv   
│       │   ├── metadata.csv   
│       │   ├── programs.txt   
│       │   ├── project-summary.yaml   
│       │   ├── tagcounts-dupes.mtx   
│       │   ├── tagcounts-dupes.mtx.colnames   
│       │   ├── tagcounts-dupes.mtx.rownames   
│       │   ├── tagcounts.mtx   
│       │   ├── tagcounts.mtx.colnames   
│       │   └── tagcounts.mtx.rownames   
│       ├── lane1-AGCTTTCT   
│       │   ├── lane1-AGCTTTCT-barcodes-filtered.tsv   
│       │   ├── lane1-AGCTTTCT-barcodes.tsv   
│       │   ├── lane1-AGCTTTCT.mtx   
│       │   ├── lane1-AGCTTTCT.mtx.colnames   
│       │   ├── lane1-AGCTTTCT.mtx.rownames   
│       │   └── lane1-AGCTTTCT-transcriptome.bam   
│       ├── lane2-AAGAGCGT   
│       │   ├── lane2-AAGAGCGT-barcodes-filtered.tsv   
│       │   ├── lane2-AAGAGCGT-barcodes.tsv   
│       │   ├── lane2-AAGAGCGT.mtx   
│       │   ├── lane2-AAGAGCGT.mtx.colnames   
│       │   ├── lane2-AAGAGCGT.mtx.rownames   
│       │   └── lane2-AAGAGCGT-transcriptome.bam      
│       └── mtx.tar.gz   
└── sc-human.csv  

Template for human RNA-seq using Illumina prepared samples

details:      
  - analysis: RNA-seq      
		genome_build: BDGP6      
	    algorithm:      
     	aligner: star      
	    quality_format: Standard      
       	trim_reads: False      
       	adapters: [truseq, polya]      
       	strandedness: unstranded
	upload:
		dir: /n/data1/cores/bcbio/PIs/mel_feany/RNAseq_of_different_genotypes_in_Drosophila_brain/bcbio/final

Setting up on scratch

It’s a good idea to get bcbio to use a work directory that is on scratch. One way to do this is to run bcbio’s templating script and then replace the work folder with a symlink to a folder on scratch. keep your work direcotyr variant has the nice feature of automatically putting the working directory on the scratch drive, so that our storage space doesn’t blow up

Another more convoluted route is just to run everything on scratch and copy over the final files:

Running an Analysis

Create the bcbioRNASeq object

   # Template for mouse RNA-seq using Illumina prepared samples
   ---
   details:
     - analysis: RNA-seq
       genome_build: mm10
       algorithm:
         aligner: star
         quality_format: standard
         strandedness: unstranded
        tools_on: bcbiornaseq
        bcbiornaseq:
          organism: mus musculus
          interesting_groups: [day, genotype]
  upload:
    dir: ../bcbio_final

Sync git repo to local computer

Sharing results

You can share via Dropbox by either:

1) using code within r2dropSmart::sync function (install from github lpantano/r2dropSmart)

2) By hand, by zipping up results and copying them to the appropriate folder on Dropbox. It’s a good idea to always include the code you used for the results, as well as any linked results [FUTURE: NEED DISCUSSION]

We don’t use Dropbox to share bcbio results, BAM files, fastqs, or bcbio objects. If people needs those, we point them to the server or have them come with harddrive.

Suggested structure for the project folder that is in git repository or dropbox

Analysis
config
metadata
docs
templates
README
reports (code goes to GIT REPO, DROPBOX IF YOU WANT)
RMD (go to DROPBOX)
HTML (go to DROPBOX)
Data (R objects that you don’t want to sync to any place)
Results (go to DROPBOX) dropn[#FUTURE: NEED DISCUSSION]