knowledgebase

recipes that save time

View the Project on GitHub hbc/knowledgebase

Google drive - big file

Use rclone to copy or sync files to a Google Drive directory. Instructions are at https://rclone.org/drive/. A few notes:

Additional notes:

To copy, use rclone copy source:directory destination:directory

Alternate instructions

https://www.quora.com/How-do-I-download-a-very-large-file-from-Google-Drive

Center for Cancer Genomics (CCG) Google Drive

Data generated byt the CCG can be downloaded from GCP to the transfer node on O2. However, downloads from CCG are paid and need to be linked to a billing account and a project identifier. Contact Shannan to set this up. The link to the data in GCP will ask you to set the billing account.

ActiveMotif

Biopolymers

A typical command might look like this:

scp -r jmubel2@bpfngs.med.harvard.edu:./FC_03443 .

Dana Farber MBCF

MBCF Google Bucket

-Zach setup a google bucket since the FTP server was painfully slow for data downloads over 1TB (3 days vs 6 hours)

How do I access Google bucket from browser? (for viewing only)

Use: https://console.cloud.google.com/storage/browser/ BUCKET_NAME .

(https://cloud.google.com/storage/docs/cloud-console)

example, BUCKET_NAME zach created & gave permissions for: mbcf-hsph https://console.cloud.google.com/storage/browser/mbcf-hsph

(gsutil credentials need to be setup - presumably best to use a gmail address. Once you do this may not need to do it again. Check elsewhere for info on this (broad, terra). )

Login to HMS transfer node (ssh username@transfer.rc.hms.harvard.edu)

$gcloud init (answer questions-defaults are typically fine)

$gsutil ls -l gs://mbcf-hsph

$gsutil cp -R gs://mbcf-hsph/ test/ *I only got it to work by specifying a folder at the destination so this is how I do it - created folder test and cp to it. could go up one folder level but didn’t do that for various reasons.

$nohup gsutil cp -R gs://mbcf-hsph/ test/ *I always use nohup - keeps the transfer going even if there’s a network/power issue, etc.

(CommandException: Destination URL must name a directory, bucket, or bucket subdirectory for the multiple source form of the cp command. Note: This error, due to an unrelated syntax issue, is why i specify a folder. makes no sense, but works.)

misc

$tail nohup.out

Copying gs://mbcf-hsph/231011_KT10562_fastq/multiqc_data/multiqc_sources.txt… Copying gs://mbcf-hsph/231011_KT10562_fastq/multiqc_report.html…
/ [919 files][ 1.2 TiB/ 1.2 TiB] 3.6 MiB/s

HMS RC also suggested: Rclone - here is a good link to get you started. https://rclone.org/googlecloudstorage/

OLD DFCI Notes - still helpful for wget!

Getting the data

nohup wget -m ftp://userid:password@34.198.31.178/*

-m mirror is to copy a mirror image of the directory/data including all files and subfolders

Use this if nohup isn’t working. Double check the UN, PW and IP address as they change.: wget -m ftp://HSPH_bfx:MBCFHSPH_bfx\!@18.205.134.163

*note the escaped exclamation point in the password (!), they like to put characters like that in their passwords. (old: wget -m ftp://jhutchinson:MBCFjhutchinson\!@34.198.31.178)

CosMx Data from BWH

Broad Institute

Aspera

For detailed instructions, examples and FAQs, see:
http://www.broadinstitute.org/aspera/doc/aspera_shares_transfers.txt

Install the Aspera CLI client for Linux on o2. (download the Aspera CLI client from https://data.broadinstitute.org/aspera_doc/ibm-aspera-cli-3.9.6.1467.159c5b1-linux-64-release.sh )

sh ibm-aspera-cli-3.9.6.1467.159c5b1-linux-64-release.sh PATH=/home/ms561/.aspera/cli/bin:$PATH MANPATH=/home/ms561/.aspera/cli/share/man:$MANPATH

you will get credentials that look like this (and typically must be used within a few days): USERNAME: SN0020420
 PASSWORD: Y5pkItiMlDay

change to the directory where you want the data. (/n/data1/cores/bcbio/PIs/) The aspera download requires you specify a directory so in this example, make a directory named data, then use the following commands:

mkdir data (if you haven’t already)

Run this command, using the correct username (twice) and password:

aspera shares download –username=SN0020420 –password=Y5pkItiMlDay –host=shares.broadinstitute.org –destination=data/ –source=SN0020420/

a directory is created, something like SN0020420

cd SN0020420 tar -xzfv $name.tar.gz ($name where $name is the name of your file ending in .tar.gz.)

md5sum -c $hashName.md5 (Replace $hashName with the name of the included file ending in .md5)

Getting the Aspera client installed to facilitate transfer directly to o2 using the command line is initially a bit of a chore, but once installed makes data transfers simple. (I used globus to copy it from my mac to o2.)

Globus

Notes from James

Downloading data (more notes about sharing data with clients at * Globus )

to have someone send you data via Globus create a globus ID, the sender will use this to estalish a transfer.

https://www.globusid.org/

When the transfer is ready, you will receive an email message from Globus with a link.

Create a download destination directory, for example, on o2 in /n/data1/cores/bcbio/PIs/

Click on globus link provided

highlight the files/directories in the left hand pane you want to transfer from the client

click on “Transfer or Sync to”

this will open two panes

in the right hand destination panel, search for “HMS-RC” in the top Collection slot.

when it finds it, paste the path to your dir you’ve established for the destination directory. Click on the window to enter.

You may instead have to click on the square to the left of the base directory, like “/n” and click the arrow to open the available subdirectories.

keep doing this to add to the path name to your destination directory.

you may have to select the provided data folders again in the left pane.

you may get a message asking you to confirm your globus id, and enter your Harvard passkey information.

hit Start, you’ll see a window drop down and in a bit a green flag on the left Activity button, as well as a Green “Transfer request submitted successfully” banner, saying things are working. If you get a red flag something aint right., stop ther transfer and try again.

you can click on the green banner to monitor the transfer.

BaseSpace to O2 by Radhika (July 2022)

https://help.basespace.illumina.com/cmd-line-interfaces/basespace-cli/introduction-to-basemount#Overview

  1. Log into the transfer node on O2: ssh username@transfer.rc.hms.harvard.edu
  2. Use the command basemount BaseSpace/ from wherever you want to mount (I mounted it in my home directory). If using it for the first time, you will have to authenticate using the link, see example below.
     rsk27@transfer06:~$ basemount BaseSpace/
     ,-----.                        ,--.   ,--.                         ,--.   
     |  |) /_  ,--,--. ,---.  ,---. |   `.'   | ,---. ,--.,--.,--,--, ,-'  '-. 
     |  .-.  \' ,-.  |(  .-' | .-. :|  |'.'|  || .-. ||  ||  ||      \'-.  .-'
     |  '--' /\ '-'  |.-'  `)\   --.|  |   |  |' '-' ''  ''  '|  ||  |  |  |  
     `------'  `--`--'`----'  `----'`--'   `--' `---'  `----' `--''--'  `--' 
     Illumina BaseMount v0.25.2.3271 public develop 2021-07-12 15:33
    	
     Command called:
         basemount BaseSpace/
     From:
         /home/rsk27
    	
     Mount point "BaseSpace/" doesn't exist
     Create this mount point directory? (Y/n) Y
     Creating directory "BaseSpace/"
     Starting authentication.
    	
     You need to authenticate by opening this URL in a browser:
       https://basespace.illumina.com/oauth/device?code=U7my2
     ...
     It worked!
     Your identification has been saved.
    	
     Mounting BaseSpace account.
     To unmount, run: basemount --unmount /home/rsk27/BaseSpace
    
  3. ls BaseSpace/ will show you what is available to you
     rsk27@transfer06:~$ ls BaseSpace/
     IAP  Projects  README  Runs  Trash
    
  4. Since it is mounted now, you can simply use cp or rsync, if you prefer, to copy over the necessary files/directories into the appropriate location. cp ~/BaseSpace/Projects/BS_46-RNA_S-21-1766_GAP375/Samples/[A-Z]_*/Files/*gz .
  5. To unmount, run basemount --unmount ~/BaseSpace

Basespace by Rory

wget https://da1s119xsxmu0.cloudfront.net/sites/knowledgebase/API/08052014/Script/BaseSpaceRunDownloader_v2.zip unzip BaseSpaceRunDownloader_v2.zip rm run_BaseSpaceRunDownloader.bat BaseSpaceRunDownloader_v2.zip python BaseSpaceRunDownloader_v2.py -r -a

if project is specified instead of run: wget https://gist.githubusercontent.com/rlesca01/7ce2ca0c35c7ff97a215/raw/0eeaa8cc1b3eff00babf398a82a31f4b0946f5bb/BaseSpaceRunDownloader_v2a.py

Basespace by Victor

Use Illumina’s native GUI client or run BaseMount on Ubuntu.

The Python downloader is deprecated and no longer supported by Illumina.

Update:

If you need to download processed data (i.e: not the bcl files but the fastq or a whole project) or you don’t have a run number you can use the BaseSpace Sequence Hub CLI. After installing it you can download specific datasets or projects with bs download.

Example: Downloading specific fastq files from a project:

1st. Identify datasets with bs list datasets

$bs list datasets > avail_datasets
$head avail_datasets

+-----------------------------+-------------------------------------+----------------------+---------------------+
|            Name             |                 Id                  |     Project.Name     |   DataSetType.Id    |
+-----------------------------+-------------------------------------+----------------------+---------------------+
| 13_Randa_01_25_19           | ds.9a309d19c84f44c191fc86919b9a562e | Randa_01_25_19_Run1  | common.files        |
| 11_Randa_01_25_19           | ds.c29329f59b2d42beaa0e617e50829e06 | Randa_01_25_19_Run1  | common.files        |
| 10_Randa_01_25_19           | ds.d3946986f8da4d8ba9a8c5db4037ff7c | Randa_01_25_19_Run1  | common.files        |
| 12_Randa_01_25_19           | ds.262b749568fb4733ad958f9dfb4df0e3 | Randa_01_25_19_Run1  | common.files        |
| 14_Randa_01_25_19           | ds.2645866a5f514e0fb0bb26b191eff138 | Randa_01_25_19_Run1  | common.files        |
| 23_Randa_01_25_19           | ds.e84152ac7b754e4ca4f0e859b5864344 | Randa_01_25_19_Run1  | common.files        |
| 5_Randa_01_25_19            | ds.3170cc50b2774650bdbcfedb5ce48830 | Randa_01_25_19_Run1  | common.files        |

(I clean the header and remove lines of —- to make avail_datasets to dataset_list)

2nd. Clean up a bit the file and select the ones you’re instered in. (in this case I’m using gawk to select the ones processed by “illumina.fastq.v1.8” analysis. gawk 'BEGIN{FS="|"}{print $2,$4,$5}' dataset_list | gawk 'BEGIN{OFS=","}$3=="illumina.fastq.v1.8"{print $1,$2}' > fastqIDs.txt

Then I use the following code to download each dataset and store it in a Project folder:

<obtain_ds.sh>

#!/bin/sh

fastqID_file="$1"

while read -r line; do
    dsID="$(cut -d',' -f1 <<<"$line")"
    projectID="$(cut -d',' -f2 <<<"$line")"
	# line is available for processing
	bs download dataset -n ${dsID} -o ${projectID}/${dsID}
done < ${fastqID_file}

<obtain_ds_job.sh>

#!/bin/bash

#SBATCH --job-name=mittendorf          # Job name
#SBATCH --partition=short             # Partition name
#SBATCH --time=0-11:59                 # Runtime in D-HH:MM format
#SBATCH --nodes=1                      # Number of nodes (keep at 1)
#SBATCH --ntasks=1                     # Number of tasks per node (keep at 1)
#SBATCH --cpus-per-task=1             # CPU cores requested per task (change for threaded jobs)
#SBATCH --mem=4G                     # Memory needed per node (total)
#SBATCH --error=jobid_%j.err           # File to which STDERR will be written, including job ID
#SBATCH --output=jobid_%j.out          # File to which STDOUT will be written, including job ID
#SBATCH --mail-type=ALL                # Type of email notification (BEGIN, END, FAIL, ALL)


bash obtain_ds.sh fastqIDs.txt

Basespace by Sergey

basespace-cli. New Illumina sequencers upload data to basespace cloud. bs utility copies data from cloud to HPC. To copy bcl files: bs cp //./Runs//Data .