BigCompute – bbmri

Context Navigation

Version 63 (modified by Barbera van Schaik, 15 years ago) (diff)
--

Port applications to Dutch Life Science Grid

People

AMC: Antoine van Kampen, Barbera van Schaik, Silvia D Olabarriaga, Mark Santcroos
Sara/BiGGrid: Tom Visser
UMCG: Morris Swertz, Freerk van Dijk

Description Software is going to be implemented as workflow components. The workflows will run on the Dutch life science grid.

Information about the infrastructure: http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/
Getting started: http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/GettingStarted
Angela CM Luyf, Barbera DC van Schaik, Michel de Vries, Frank Baas, Antoine HC van Kampen, Silvia D Olabarriaga (2010) Initial steps towards a production platform for DNA sequence analysis on the grid. BMC Bioinformatics 11: 1. Dec pubmed

Implemented workflow components at AMC

This list of workflow components are already available. We can expand it with Pindel and (parts of) the GATK pipeline.

Splitting of fastq files
Building a BWA index on the genome sequence (base space and color space)
BWA for shotgun reads (base space and color space) It is possible to do parameter sweeps. Output is in bam format
Merge bam results
Samtools pileup
Varscan (pileup to snp, indel and cns)
Bam2coverage creates a UCSC wiggle file to display the genome coverage (per 50kbp)
Coverage-per-base determines the coverage for every base in the genome and it summarizes the results (coverage versus frequency)
Annovar (works for hg18, working on other assemblies). This is a pipeline to annotate variants (gene, dbsnp, hapmap, 1000g, conservation, etc)
FastqC

Implemented components of the Groningen pipeline A more detailed description will follow later

BwaIllumina (done) - pe00-bwa-align-pair1.ftl, pe01-bwa-align-pair2.ftl, pe02-bwa-sampe.ftl, pe03-sam-to-bam.ftl, pe04-sam-sort.ftl
MarkDuplicates (done) - pe05-mark-duplicates.ftl
PicardQC (partly done) - pe04b-picardQC.ftl. Didn't get the R environment up and running yet, so the .pdf .hist and .bamindexstats can not be produced yet. Will continue with the other components and fix this later. Attachment contains info about the required R packages.
GatkGenerateIntervalFile (done) - see e-mail Freerk on Dec 13, 2010
ReAlign (in progress) - pe06-realign.ftl. Fails, need to discuss this with Freerk
FixMates (done) - pe07-fixmates.ftl
GatkCovariates (done) - pe08-covariates-before.ftl, pe11-covariates-after.ftl
GatkRecalibrate (done) - pe09-recalibrate.ftl
SortSam (done) - pe10-sam-sort.ftl
AnalyzeCovariates (done) - pe12-analyze-covariates.ftl
Update 31-1-2011: integrating all the components in two workflows; one for alignment, one for the subsequent steps

To be implemented

The components of the Groningen pipeline that not implemented as a workflow component yet
Pindel

Data access rights

To ensure that the most limited group of people has access to the data we have created a subgroup "gvnl" within the "vlemed" Virtual Organisation (VO). For people to become part of this group, it is required that they have a Grid certificate and that they are part of the "vlemed" VO. On the following page there is information on how to get a certificate, how to get into the "vlemed" VO: http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/EBioInfra#Access

For more information about data access see http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/DataManagement

Things to address

Available disk space on the grid storage elements / worker nodes

Data location on grid

Data

The data is located on the storage element at Sara and only readable and writable for the vlemed/gvnl group. This screencast demonstrates how to access the data from the Vbrowser: http://www.youtube.com/watch?v=FicwWGAbubQ

Storage location (resource): srm.grid.sara.nl

Path: /pnfs/grid.sara.nl/data/vlemed/gvnl

Workflows and databases

These directories are open to all members of the vlemed VO

Workflows: lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF
Databases: lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_DB

The directories that contain the workflows have the following structure:

Directory	Description
bin	dependent binaries like bwa, samtools
GasW	component description, describes which executable has to run on grid and it specifies the in and output files/parameters
Java	dependent jar files, e.g. the GATK jar
parameterFiles	text files that contain exactly one line with the parameters that you would like to provide to bwa or another component
Scufl	old workflow description files, can be ignored
shFiles	these files are executed on the grid and described by the GASW descriptor, the gvnl shFiles are based on the Groningen templates
Workflows	Workflow descriptions, clicking on them will start the Moteur plugin. Input files/parameters can be specified in the fields. If you click on "Run" the jobs are submitted to the grid

The shell files can (in most cases) run on any linux cluster. In that case you need to place the shell file and the dependent executable(s) in one directory. At the start of each shell file is an example on how to run them.

Workflow execution

On mini pilot. Split lines-per-file: 8,000,000. Start: 18-12-2010 16:40

Lane	split BWA merge	Comments	Elapsed time (s)
A4a_L4_HUModqRBUDIBAPE	done		22740
A4a_L6_HUModqRADDIAAPE	done		37750
A4a_L7_HUModqRBVDIBAPE	failed
A4b_L3_HUModqRAFDIBAPE	done		32256
A4b_L6_HUModqRBTDIBAPE	failed
R2A _L1_HUModqRADDIBAPE	failed
R2A _L1_HUModqRAFDIBAPE	done		35819
R2A _L5_HUModqRAEDIAAPE	failed
R2B _L3_HUModqRBTDIAAPE	done		23754
R2B _L4_HUModqRBUDIAAPE	failed	pair 1 not in correct gzip format?
R2B _L6_HUModqRBTDIBAPE	failed
R2C _L2_HUModqRBUDIBAPE	done		40763
R2C _L2_HUModqRBVDIBAPE	failed
R2C _L7_HUModqRBVDIAAPE	done		39374
Unknown_L6_HUModqRBUDIAAPE	done		30307

Note: The Merge component is red (marked as failed) in the finished workflows. The bam files where produced successfully.

BWA alignment on mini pilot without splitting the data

19-12-2010 14:00 done - elapsed time 325886 s

PicardQC on finished bam files

20-12-2010 11:30 done - elapsed time (10 bam files) 8730 s
Info runtime and used disk space (ods)

Coverage-per-base on finished bam files

20-12-2010 12:40 done - elapsed time 89634 s

Mark-duplicates on finished bam files

20-12-2010 12:47 done - elapsed time 10614 s

Re-align on finished bam files

20-12-2010 17:48 done - elapsed time 9165 s

Gvnl pipeline on remaining samples (split, bwa, merge, coverage-per-base, picard-qc, mark-duplicates, re-align) Start: 20-12-2010 18:50

Lane	Progress	Elapsed time
A4a_L7_HUModqRBVDIBAPE	done	208542
A4b_L6_HUModqRBTDIBAPE	done	74476
R2A _L1_HUModqRADDIBAPE	done	72636
R2A _L5_HUModqRAEDIAAPE	done	55242
R2B _L4_HUModqRBUDIAAPE	failed
R2B _L6_HUModqRBTDIBAPE	done	227859
R2C _L2_HUModqRBVDIBAPE	done	44730

Note1: Workflow stops after merge-bam. Working on it.
Note2: The Dutch grid seems very busy at the moment. That could explain the longer execution times.

Gvnl pipeline on theoretical reads

22-12-2010 19:14 done - elapsed time 38587 s
Workflow does continue after the merge-bam now
To do: implement recalibration step, and a check on all bam files of the split input before the merge step

Alternatives

Clusters

Groningen
- Description here about code template and automatic PBS script generation. Job submission/monitoring
Leiden
Huygens
Lisa
Philips
DAS

Grid

EBioInfra http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/
BiGGrid Cloud http://www.cloud.sara.nl/
Topos https://grid.sara.nl/wiki/index.php/Using_the_Grid/ToPoS

Attachments (3)

r-environment.txt (134 bytes) - added by Barbera van Schaik 16 years ago. info about R packages
log-picardqc-20101220.ods (17.1 KB) - added by Barbera van Schaik 16 years ago.
log-fastqc20110423.xls (118.0 KB) - added by Barbera van Schaik 15 years ago.

Download all attachments as: .zip

Download in other formats:

Plain Text