Google drive - big file

Use rclone to copy or sync files to a Google Drive directory. Instructions are at https://rclone.org/drive/. A few notes:

log in to the transfer node on HMS (transfer.rc.hms.harvard.edu)
create new remote using rclone config and select n. Give it a name.
choose 12 for Google Drive
skip client_id and client_secret
for scope, select either 1 if you want to modify on Google drive or 2 if you just want to read files and copy them
no need to select a root unless you want access to the “Computers” folder
skip service_account_file
n for auto config since we are a remote machine
copy the link into your browser. Log in to Google. Copy key and paste into terminal
if the drive is in your “Shared” folder on Google Drive, configure as a Shared Drive.

Additional notes:

type rclone lsd remote: to see the files on your Google drive directory
if you want to see what in your “Shared with me” drive, use –drive-shared-with-me as in rclone --drive-shared-with-me lsd remote:
note that “Shared” and “Shared with me” are two different drives.

To copy, use rclone copy source:directory destination:directory

Alternate instructions

https://www.quora.com/How-do-I-download-a-very-large-file-from-Google-Drive

Center for Cancer Genomics (CCG) Google Drive

Data generated byt the CCG can be downloaded from GCP to the transfer node on O2. However, downloads from CCG are paid and need to be linked to a billing account and a project identifier. Contact Shannan to set this up. The link to the data in GCP will ask you to set the billing account.

login in to user_id@transfer.rc.hms.harvard.edu
create a directory for your data under /n/data1/cores/bcbio/PIs/pi_firstname_pi_lastname/data/
gsutil is installed on O2
check that you have it with gsutil version -l
use gcloud init to initialize, authorize, and configure the gcloud CLI. It will walk you through the process of configuring gcloud, including creating a new configuration and choosing the gmail account you will use to perform the configuration.
You will need to login using gcloud auth login. This will create a link for you to copy and paste into your browser to login into your gmail account and authenticate using an authorization code.
now you are finally ready to download the data
The gsutil cp command is used to fetch files from your bucket using wildcard search patterns: ( https://cloud.google.com/storage/docs/gsutil/commands/cp )
To download the web summaries, for examples, use gsutil -u <PROJECT_IDENTIFIER> -m cp -r gs://link_to_gcp_project/CCxxxxx/web_summaries .
only pull down the directories you think you will need as you are paying for every downloaded file

ActiveMotif

They will give us a long link like this: ftp://ftp.activemotif.comhttps://urldefense.proofpoint.com/v2/url?u=ftp-3A__ftp.activemotif.com_&d=DwMF-g&c=WO-RGvefibhHBZq3fL85hQ&r=CiePDOg3jDpiyuOnMdRMAf55kb0y979UKdT9-l_8xx4&m=sG-xxxx_AhGbB1PxIiUTGzFWLjIwpK-ff9Oza6JZM&s=Nmf-t3MTzOiZQ0WcCpKsFd6PMprEvuMK-izroBVUwB0&e=
We just need to get in to “ftp://ftp.activemotif.com” (some browsers might not open this)
log in with the user, password.
click on the folders and get the links.

get on the terminal, transfer node.

wget --user=user --password=password {link_for_necessary_folders}
# 'wget -m' for a folder.

Biopolymers

Biopolymers (BPF) make their data available through an SFTP site
Their data will often come as both fastq files AND fastq.bz2 files.
Feel free to delete one of these, we really don’t need both and they take up a lot of space.
Typically, the researcher will have an email from BPF with their login, password and the dataset id (usually in the form of FC_$number)
Complete instructions can be found here
You can use scp or rsync to pull down the files

A typical command might look like this:

scp -r jmubel2@bpfngs.med.harvard.edu:./FC_03443 .

Dana Farber MBCF

MBCF Google Bucket

-Zach setup a google bucket since the FTP server was painfully slow for data downloads over 1TB (3 days vs 6 hours)

How do I access Google bucket from browser? (for viewing only)

Use: https://console.cloud.google.com/storage/browser/ BUCKET_NAME .

(https://cloud.google.com/storage/docs/cloud-console)

example, BUCKET_NAME zach created & gave permissions for: mbcf-hsph https://console.cloud.google.com/storage/browser/mbcf-hsph

(gsutil credentials need to be setup - presumably best to use a gmail address. Once you do this may not need to do it again. Check elsewhere for info on this (broad, terra). )

$gcloud init (answer questions-defaults are typically fine)

$gsutil ls -l gs://mbcf-hsph

$gsutil cp -R gs://mbcf-hsph/ test/ *I only got it to work by specifying a folder at the destination so this is how I do it - created folder test and cp to it. could go up one folder level but didn’t do that for various reasons.

$nohup gsutil cp -R gs://mbcf-hsph/ test/ *I always use nohup - keeps the transfer going even if there’s a network/power issue, etc.

(CommandException: Destination URL must name a directory, bucket, or bucket subdirectory for the multiple source form of the cp command. Note: This error, due to an unrelated syntax issue, is why i specify a folder. makes no sense, but works.)

misc

$tail nohup.out

Copying gs://mbcf-hsph/231011_KT10562_fastq/multiqc_data/multiqc_sources.txt… Copying gs://mbcf-hsph/231011_KT10562_fastq/multiqc_report.html…
/ [919 files][ 1.2 TiB/ 1.2 TiB] 3.6 MiB/s

HMS RC also suggested: Rclone - here is a good link to get you started. https://rclone.org/googlecloudstorage/

OLD DFCI Notes - still helpful for wget!

Zach and co. always share raw data
If you email Zach (zherbert@mail.dfci.harvard.edu) and tell him who’s data you need (cc: the researcher) he will setup an FTP site for you to use.
Make sure to let them know you’ve pulled down the data, so they can turn off the site when you’re done (it costs money to run this).
Their data is typically in tar.gz files, it can pay off to decompress them right away so you know if you have the whole file…

Getting the data

You can access the data through a wget command.
Preface it with nohup so your job keeps running even if you connection drops.
The final nohup.out file will the download progress in it if you want to confirm download. A typical command might be something like this:

nohup wget -m ftp://userid:password@34.198.31.178/*

-m mirror is to copy a mirror image of the directory/data including all files and subfolders

Use this if nohup isn’t working. Double check the UN, PW and IP address as they change.: wget -m ftp://HSPH_bfx:MBCFHSPH_bfx\!@18.205.134.163

*note the escaped exclamation point in the password (!), they like to put characters like that in their passwords. (old: wget -m ftp://jhutchinson:MBCFjhutchinson\!@34.198.31.178)

CosMx Data from BWH

Only for the Clark lab and CosMx data so far, but who knows…
Get an email from the lab then schedule a time with Miles Tran mtran26 at bwh dot harvard dot edu. (Great that the Clark lab downloads the data at the same time so we know they have a copy of the data)
They use an AWS download service and send a tarball. Apparently AWS opens permissions on the tarball so they send a link that’s good for 15 minutes
at the scheduled time, Miles sends a bit.ly code. use wget and the code (previously had sent a very long code with instructions to put it in ‘single quotes’ but that never worked for me, so he sends the bit.ly now). In the example below, I made up the code for the transfer (ie. https://bit.ly/7d34a6e), and then the transferred tarball would be called 7d34a6e)
(Preface wget with nohup so your job keeps running even if you connection drops.)
what I do:
login to o2 (transfer node or not) - go to appropriate directory
nohup wget https://bit.ly/7d34a6e
tar zxvf 7d34a6e
Done!

Broad Institute

It can depend on the platform the researcher used, but the Broad typically only give out BAM files for normal RNA-seq runs.
For their DGE (96 well) platform, they give everything under the sun.
For 10x they may give you cellranger count output data matrices, and may or may not include bam files or even fastq files, so check you have what you need.

Aspera

For detailed instructions, examples and FAQs, see: http://www.broadinstitute.org/aspera/doc/aspera_shares_transfers.txt

Install the Aspera CLI client for Linux on o2. (download the Aspera CLI client from https://data.broadinstitute.org/aspera_doc/ibm-aspera-cli-3.9.6.1467.159c5b1-linux-64-release.sh )

sh ibm-aspera-cli-3.9.6.1467.159c5b1-linux-64-release.sh PATH=/home/ms561/.aspera/cli/bin:$PATH MANPATH=/home/ms561/.aspera/cli/share/man:$MANPATH

you will get credentials that look like this (and typically must be used within a few days): USERNAME: SN0020420  PASSWORD: Y5pkItiMlDay

change to the directory where you want the data. (/n/data1/cores/bcbio/PIs/) The aspera download requires you specify a directory so in this example, make a directory named data, then use the following commands:

mkdir data (if you haven’t already)

Run this command, using the correct username (twice) and password:

aspera shares download –username=SN0020420 –password=Y5pkItiMlDay –host=shares.broadinstitute.org –destination=data/ –source=SN0020420/

a directory is created, something like SN0020420

cd SN0020420 tar -xzfv $name.tar.gz ($name where $name is the name of your file ending in .tar.gz.)

md5sum -c $hashName.md5 (Replace $hashName with the name of the included file ending in .md5)

Getting the Aspera client installed to facilitate transfer directly to o2 using the command line is initially a bit of a chore, but once installed makes data transfers simple. (I used globus to copy it from my mac to o2.)

Globus

Notes from James

Downloading data (more notes about sharing data with clients at * Globus )

to have someone send you data via Globus create a globus ID, the sender will use this to estalish a transfer.

https://www.globusid.org/

When the transfer is ready, you will receive an email message from Globus with a link.

Create a download destination directory, for example, on o2 in /n/data1/cores/bcbio/PIs/

Click on globus link provided

highlight the files/directories in the left hand pane you want to transfer from the client

click on “Transfer or Sync to”

this will open two panes

in the right hand destination panel, search for “HMS-RC” in the top Collection slot.

when it finds it, paste the path to your dir you’ve established for the destination directory. Click on the window to enter.

You may instead have to click on the square to the left of the base directory, like “/n” and click the arrow to open the available subdirectories.

keep doing this to add to the path name to your destination directory.

you may have to select the provided data folders again in the left pane.

you may get a message asking you to confirm your globus id, and enter your Harvard passkey information.

hit Start, you’ll see a window drop down and in a bit a green flag on the left Activity button, as well as a Green “Transfer request submitted successfully” banner, saying things are working. If you get a red flag something aint right., stop ther transfer and try again.

you can click on the green banner to monitor the transfer.

BaseSpace to O2 by Radhika (July 2022)

https://help.basespace.illumina.com/cmd-line-interfaces/basespace-cli/introduction-to-basemount#Overview

Log into the transfer node on O2: ssh username@transfer.rc.hms.harvard.edu

Use the command basemount BaseSpace/ from wherever you want to mount (I mounted it in my home directory). If using it for the first time, you will have to authenticate using the link, see example below.

 rsk27@transfer06:~$ basemount BaseSpace/
 ,-----.                        ,--.   ,--.                         ,--.   
 |  |) /_  ,--,--. ,---.  ,---. |   `.'   | ,---. ,--.,--.,--,--, ,-'  '-. 
 |  .-.  \' ,-.  |(  .-' | .-. :|  |'.'|  || .-. ||  ||  ||      \'-.  .-'
 |  '--' /\ '-'  |.-'  `)\   --.|  |   |  |' '-' ''  ''  '|  ||  |  |  |  
 `------'  `--`--'`----'  `----'`--'   `--' `---'  `----' `--''--'  `--' 
 Illumina BaseMount v0.25.2.3271 public develop 2021-07-12 15:33
	
 Command called:
     basemount BaseSpace/
 From:
     /home/rsk27
	
 Mount point "BaseSpace/" doesn't exist
 Create this mount point directory? (Y/n) Y
 Creating directory "BaseSpace/"
 Starting authentication.
	
 You need to authenticate by opening this URL in a browser:
   https://basespace.illumina.com/oauth/device?code=U7my2
 ...
 It worked!
 Your identification has been saved.
	
 Mounting BaseSpace account.
 To unmount, run: basemount --unmount /home/rsk27/BaseSpace

ls BaseSpace/ will show you what is available to you

 rsk27@transfer06:~$ ls BaseSpace/
 IAP  Projects  README  Runs  Trash

Since it is mounted now, you can simply use cp or rsync, if you prefer, to copy over the necessary files/directories into the appropriate location. cp ~/BaseSpace/Projects/BS_46-RNA_S-21-1766_GAP375/Samples/[A-Z]_*/Files/*gz .
To unmount, run basemount --unmount ~/BaseSpace

Basespace by Rory

wget https://da1s119xsxmu0.cloudfront.net/sites/knowledgebase/API/08052014/Script/BaseSpaceRunDownloader_v2.zip unzip BaseSpaceRunDownloader_v2.zip rm run_BaseSpaceRunDownloader.bat BaseSpaceRunDownloader_v2.zip python BaseSpaceRunDownloader_v2.py -r -a

if project is specified instead of run: wget https://gist.githubusercontent.com/rlesca01/7ce2ca0c35c7ff97a215/raw/0eeaa8cc1b3eff00babf398a82a31f4b0946f5bb/BaseSpaceRunDownloader_v2a.py

Basespace by Victor

Use Illumina’s native GUI client or run BaseMount on Ubuntu.

The Python downloader is deprecated and no longer supported by Illumina.

Update:

If you need to download processed data (i.e: not the bcl files but the fastq or a whole project) or you don’t have a run number you can use the BaseSpace Sequence Hub CLI. After installing it you can download specific datasets or projects with bs download.

Example: Downloading specific fastq files from a project:

1st. Identify datasets with bs list datasets

$bs list datasets > avail_datasets
$head avail_datasets

+-----------------------------+-------------------------------------+----------------------+---------------------+
|            Name             |                 Id                  |     Project.Name     |   DataSetType.Id    |
+-----------------------------+-------------------------------------+----------------------+---------------------+
| 13_Randa_01_25_19           | ds.9a309d19c84f44c191fc86919b9a562e | Randa_01_25_19_Run1  | common.files        |
| 11_Randa_01_25_19           | ds.c29329f59b2d42beaa0e617e50829e06 | Randa_01_25_19_Run1  | common.files        |
| 10_Randa_01_25_19           | ds.d3946986f8da4d8ba9a8c5db4037ff7c | Randa_01_25_19_Run1  | common.files        |
| 12_Randa_01_25_19           | ds.262b749568fb4733ad958f9dfb4df0e3 | Randa_01_25_19_Run1  | common.files        |
| 14_Randa_01_25_19           | ds.2645866a5f514e0fb0bb26b191eff138 | Randa_01_25_19_Run1  | common.files        |
| 23_Randa_01_25_19           | ds.e84152ac7b754e4ca4f0e859b5864344 | Randa_01_25_19_Run1  | common.files        |
| 5_Randa_01_25_19            | ds.3170cc50b2774650bdbcfedb5ce48830 | Randa_01_25_19_Run1  | common.files        |

(I clean the header and remove lines of —- to make avail_datasets to dataset_list)

2nd. Clean up a bit the file and select the ones you’re instered in. (in this case I’m using gawk to select the ones processed by “illumina.fastq.v1.8” analysis. gawk 'BEGIN{FS="|"}{print $2,$4,$5}' dataset_list | gawk 'BEGIN{OFS=","}$3=="illumina.fastq.v1.8"{print $1,$2}' > fastqIDs.txt

Then I use the following code to download each dataset and store it in a Project folder:

<obtain_ds.sh>

#!/bin/sh

fastqID_file="$1"

while read -r line; do
    dsID="$(cut -d',' -f1 <<<"$line")"
    projectID="$(cut -d',' -f2 <<<"$line")"
	# line is available for processing
	bs download dataset -n ${dsID} -o ${projectID}/${dsID}
done < ${fastqID_file}

<obtain_ds_job.sh>

#!/bin/bash

#SBATCH --job-name=mittendorf          # Job name
#SBATCH --partition=short             # Partition name
#SBATCH --time=0-11:59                 # Runtime in D-HH:MM format
#SBATCH --nodes=1                      # Number of nodes (keep at 1)
#SBATCH --ntasks=1                     # Number of tasks per node (keep at 1)
#SBATCH --cpus-per-task=1             # CPU cores requested per task (change for threaded jobs)
#SBATCH --mem=4G                     # Memory needed per node (total)
#SBATCH --error=jobid_%j.err           # File to which STDERR will be written, including job ID
#SBATCH --output=jobid_%j.out          # File to which STDOUT will be written, including job ID
#SBATCH --mail-type=ALL                # Type of email notification (BEGIN, END, FAIL, ALL)


bash obtain_ds.sh fastqIDs.txt

Basespace by Sergey

basespace-cli. New Illumina sequencers upload data to basespace cloud. bs utility copies data from cloud to HPC. To copy bcl files: bs cp //./Runs//Data .