recipes that save time
Use rclone to copy or sync files to a Google Drive directory. Instructions are at https://rclone.org/drive/. A few notes:
transfer.rc.hms.harvard.edu)rclone config and select n. Give it a name.12 for Google Driveclient_id and client_secretscope, select either 1 if you want to modify on Google drive or 2 if you just want to read files and copy themservice_account_filen for auto config since we are a remote machineAdditional notes:
rclone lsd remote: to see the files on your Google drive directoryrclone --drive-shared-with-me lsd remote:To copy, use rclone copy source:directory destination:directory
https://www.quora.com/How-do-I-download-a-very-large-file-from-Google-Drive
Data generated byt the CCG can be downloaded from GCP to the transfer node on O2. However, downloads from CCG are paid and need to be linked to a billing account and a project identifier. Contact Shannan to set this up. The link to the data in GCP will ask you to set the billing account.
user_id@transfer.rc.hms.harvard.edu/n/data1/cores/bcbio/PIs/pi_firstname_pi_lastname/data/gsutil version -lgcloud init to initialize, authorize, and configure the gcloud CLI. It will walk you through the process of configuring gcloud, including creating a new configuration and choosing the gmail account you will use to perform the configuration.gcloud auth login. This will create a link for you to copy and paste into your browser to login into your gmail account and authenticate using an authorization code.gsutil -u <PROJECT_IDENTIFIER> -m cp -r gs://link_to_gcp_project/CCxxxxx/web_summaries .They will give us a long link like this: ftp://ftp.activemotif.comhttps://urldefense.proofpoint.com/v2/url?u=ftp-3A__ftp.activemotif.com_&d=DwMF-g&c=WO-RGvefibhHBZq3fL85hQ&r=CiePDOg3jDpiyuOnMdRMAf55kb0y979UKdT9-l_8xx4&m=sG-xxxx_AhGbB1PxIiUTGzFWLjIwpK-ff9Oza6JZM&s=Nmf-t3MTzOiZQ0WcCpKsFd6PMprEvuMK-izroBVUwB0&e=
We just need to get in to “ftp://ftp.activemotif.com” (some browsers might not open this)
log in with the user, password.
click on the folders and get the links.
get on the terminal, transfer node.
wget --user=user --password=password {link_for_necessary_folders}
# 'wget -m' for a folder.
Biopolymers (BPF) make their data available through an SFTP site
Feel free to delete one of these, we really don’t need both and they take up a lot of space.
scp or rsync to pull down the filesA typical command might look like this:
scp -r jmubel2@bpfngs.med.harvard.edu:./FC_03443 .
-Zach setup a google bucket since the FTP server was painfully slow for data downloads over 1TB (3 days vs 6 hours)
How do I access Google bucket from browser? (for viewing only)
Use: https://console.cloud.google.com/storage/browser/ BUCKET_NAME .
(https://cloud.google.com/storage/docs/cloud-console)
example, BUCKET_NAME zach created & gave permissions for: mbcf-hsph https://console.cloud.google.com/storage/browser/mbcf-hsph
(gsutil credentials need to be setup - presumably best to use a gmail address. Once you do this may not need to do it again. Check elsewhere for info on this (broad, terra). )
Login to HMS transfer node (ssh username@transfer.rc.hms.harvard.edu)
$gcloud init (answer questions-defaults are typically fine)
$gsutil ls -l gs://mbcf-hsph
$gsutil cp -R gs://mbcf-hsph/ test/ *I only got it to work by specifying a folder at the destination so this is how I do it - created folder test and cp to it. could go up one folder level but didn’t do that for various reasons.
$nohup gsutil cp -R gs://mbcf-hsph/ test/ *I always use nohup - keeps the transfer going even if there’s a network/power issue, etc.
(CommandException: Destination URL must name a directory, bucket, or bucket subdirectory for the multiple source form of the cp command. Note: This error, due to an unrelated syntax issue, is why i specify a folder. makes no sense, but works.)
misc
$tail nohup.out
Copying gs://mbcf-hsph/231011_KT10562_fastq/multiqc_data/multiqc_sources.txt…
Copying gs://mbcf-hsph/231011_KT10562_fastq/multiqc_report.html…
/ [919 files][ 1.2 TiB/ 1.2 TiB] 3.6 MiB/s
HMS RC also suggested: Rclone - here is a good link to get you started. https://rclone.org/googlecloudstorage/
Zach and co. always share raw data
If you email Zach (zherbert@mail.dfci.harvard.edu) and tell him who’s data you need (cc: the researcher) he will setup an FTP site for you to use.
Make sure to let them know you’ve pulled down the data, so they can turn off the site when you’re done (it costs money to run this).
Their data is typically in tar.gz files, it can pay off to decompress them right away so you know if you have the whole file…
wget command.nohup so your job keeps running even if you connection drops.nohup wget -m ftp://userid:password@34.198.31.178/*
-m mirror is to copy a mirror image of the directory/data including all files and subfolders
Use this if nohup isn’t working. Double check the UN, PW and IP address as they change.:
wget -m ftp://HSPH_bfx:MBCFHSPH_bfx\!@18.205.134.163
*note the escaped exclamation point in the password (!), they like to put characters like that in their passwords. (old: wget -m ftp://jhutchinson:MBCFjhutchinson\!@34.198.31.178)
nohup so your job keeps running even if you connection drops.)For detailed instructions, examples and FAQs, see: http://www.broadinstitute.org/aspera/doc/aspera_shares_transfers.txt
Install the Aspera CLI client for Linux on o2. (download the Aspera CLI client from https://data.broadinstitute.org/aspera_doc/ibm-aspera-cli-3.9.6.1467.159c5b1-linux-64-release.sh )
sh ibm-aspera-cli-3.9.6.1467.159c5b1-linux-64-release.sh PATH=/home/ms561/.aspera/cli/bin:$PATH MANPATH=/home/ms561/.aspera/cli/share/man:$MANPATH
you will get credentials that look like this (and typically must be used within a few days): USERNAME: SN0020420 PASSWORD: Y5pkItiMlDay
change to the directory where you want the data. (/n/data1/cores/bcbio/PIs/) The aspera download requires you specify a directory so in this example, make a directory named data, then use the following commands:
mkdir data (if you haven’t already)
Run this command, using the correct username (twice) and password:
aspera shares download –username=SN0020420 –password=Y5pkItiMlDay –host=shares.broadinstitute.org –destination=data/ –source=SN0020420/
a directory is created, something like SN0020420
cd SN0020420 tar -xzfv $name.tar.gz ($name where $name is the name of your file ending in .tar.gz.)
md5sum -c $hashName.md5 (Replace $hashName with the name of the included file ending in .md5)
Getting the Aspera client installed to facilitate transfer directly to o2 using the command line is initially a bit of a chore, but once installed makes data transfers simple. (I used globus to copy it from my mac to o2.)
Notes from James
Downloading data (more notes about sharing data with clients at * Globus )
to have someone send you data via Globus create a globus ID, the sender will use this to estalish a transfer.
https://www.globusid.org/
When the transfer is ready, you will receive an email message from Globus with a link.
Create a download destination directory, for example, on o2 in /n/data1/cores/bcbio/PIs/
Click on globus link provided
highlight the files/directories in the left hand pane you want to transfer from the client
click on “Transfer or Sync to”
this will open two panes
in the right hand destination panel, search for “HMS-RC” in the top Collection slot.
when it finds it, paste the path to your dir you’ve established for the destination directory. Click on the window to enter.
You may instead have to click on the square to the left of the base directory, like “/n” and click the arrow to open the available subdirectories.
keep doing this to add to the path name to your destination directory.
you may have to select the provided data folders again in the left pane.
you may get a message asking you to confirm your globus id, and enter your Harvard passkey information.
hit Start, you’ll see a window drop down and in a bit a green flag on the left Activity button, as well as a Green “Transfer request submitted successfully” banner, saying things are working. If you get a red flag something aint right., stop ther transfer and try again.
you can click on the green banner to monitor the transfer.
https://help.basespace.illumina.com/cmd-line-interfaces/basespace-cli/introduction-to-basemount#Overview
ssh username@transfer.rc.hms.harvard.edubasemount BaseSpace/ from wherever you want to mount (I mounted it in my home directory). If using it for the first time, you will have to authenticate using the link, see example below.
rsk27@transfer06:~$ basemount BaseSpace/
,-----. ,--. ,--. ,--.
| |) /_ ,--,--. ,---. ,---. | `.' | ,---. ,--.,--.,--,--, ,-' '-.
| .-. \' ,-. |( .-' | .-. :| |'.'| || .-. || || || \'-. .-'
| '--' /\ '-' |.-' `)\ --.| | | |' '-' '' '' '| || | | |
`------' `--`--'`----' `----'`--' `--' `---' `----' `--''--' `--'
Illumina BaseMount v0.25.2.3271 public develop 2021-07-12 15:33
Command called:
basemount BaseSpace/
From:
/home/rsk27
Mount point "BaseSpace/" doesn't exist
Create this mount point directory? (Y/n) Y
Creating directory "BaseSpace/"
Starting authentication.
You need to authenticate by opening this URL in a browser:
https://basespace.illumina.com/oauth/device?code=U7my2
...
It worked!
Your identification has been saved.
Mounting BaseSpace account.
To unmount, run: basemount --unmount /home/rsk27/BaseSpace
ls BaseSpace/ will show you what is available to you
rsk27@transfer06:~$ ls BaseSpace/
IAP Projects README Runs Trash
cp ~/BaseSpace/Projects/BS_46-RNA_S-21-1766_GAP375/Samples/[A-Z]_*/Files/*gz .basemount --unmount ~/BaseSpacewget https://da1s119xsxmu0.cloudfront.net/sites/knowledgebase/API/08052014/Script/BaseSpaceRunDownloader_v2.zip
unzip BaseSpaceRunDownloader_v2.zip
rm run_BaseSpaceRunDownloader.bat BaseSpaceRunDownloader_v2.zip
python BaseSpaceRunDownloader_v2.py -r
if project is specified instead of run: wget https://gist.githubusercontent.com/rlesca01/7ce2ca0c35c7ff97a215/raw/0eeaa8cc1b3eff00babf398a82a31f4b0946f5bb/BaseSpaceRunDownloader_v2a.py
Use Illumina’s native GUI client or run BaseMount on Ubuntu.
The Python downloader is deprecated and no longer supported by Illumina.
If you need to download processed data (i.e: not the bcl files but the fastq or a whole project) or you don’t have a run number you can use the BaseSpace Sequence Hub CLI. After installing it you can download specific datasets or projects with bs download.
Example: Downloading specific fastq files from a project:
1st. Identify datasets with bs list datasets
$bs list datasets > avail_datasets
$head avail_datasets
+-----------------------------+-------------------------------------+----------------------+---------------------+
| Name | Id | Project.Name | DataSetType.Id |
+-----------------------------+-------------------------------------+----------------------+---------------------+
| 13_Randa_01_25_19 | ds.9a309d19c84f44c191fc86919b9a562e | Randa_01_25_19_Run1 | common.files |
| 11_Randa_01_25_19 | ds.c29329f59b2d42beaa0e617e50829e06 | Randa_01_25_19_Run1 | common.files |
| 10_Randa_01_25_19 | ds.d3946986f8da4d8ba9a8c5db4037ff7c | Randa_01_25_19_Run1 | common.files |
| 12_Randa_01_25_19 | ds.262b749568fb4733ad958f9dfb4df0e3 | Randa_01_25_19_Run1 | common.files |
| 14_Randa_01_25_19 | ds.2645866a5f514e0fb0bb26b191eff138 | Randa_01_25_19_Run1 | common.files |
| 23_Randa_01_25_19 | ds.e84152ac7b754e4ca4f0e859b5864344 | Randa_01_25_19_Run1 | common.files |
| 5_Randa_01_25_19 | ds.3170cc50b2774650bdbcfedb5ce48830 | Randa_01_25_19_Run1 | common.files |
(I clean the header and remove lines of —- to make avail_datasets to dataset_list)
2nd. Clean up a bit the file and select the ones you’re instered in. (in this case I’m using gawk to select the ones processed by “illumina.fastq.v1.8” analysis.
gawk 'BEGIN{FS="|"}{print $2,$4,$5}' dataset_list | gawk 'BEGIN{OFS=","}$3=="illumina.fastq.v1.8"{print $1,$2}' > fastqIDs.txt
Then I use the following code to download each dataset and store it in a Project folder:
<obtain_ds.sh>
#!/bin/sh
fastqID_file="$1"
while read -r line; do
dsID="$(cut -d',' -f1 <<<"$line")"
projectID="$(cut -d',' -f2 <<<"$line")"
# line is available for processing
bs download dataset -n ${dsID} -o ${projectID}/${dsID}
done < ${fastqID_file}
<obtain_ds_job.sh>
#!/bin/bash
#SBATCH --job-name=mittendorf # Job name
#SBATCH --partition=short # Partition name
#SBATCH --time=0-11:59 # Runtime in D-HH:MM format
#SBATCH --nodes=1 # Number of nodes (keep at 1)
#SBATCH --ntasks=1 # Number of tasks per node (keep at 1)
#SBATCH --cpus-per-task=1 # CPU cores requested per task (change for threaded jobs)
#SBATCH --mem=4G # Memory needed per node (total)
#SBATCH --error=jobid_%j.err # File to which STDERR will be written, including job ID
#SBATCH --output=jobid_%j.out # File to which STDOUT will be written, including job ID
#SBATCH --mail-type=ALL # Type of email notification (BEGIN, END, FAIL, ALL)
bash obtain_ds.sh fastqIDs.txt
basespace-cli. New Illumina sequencers upload data to basespace cloud. bs utility copies data from cloud to HPC. To copy bcl files: bs cp //./Runs/