wiki:BIOS_Pipeline/pipeline_todo

Version 1 (modified by jamverlouw, 8 years ago) (diff)

--

Pipeline todos

This page is reserved to track planned modifications to the pipeline for the full run.

Timeline

  • October 1st will be the target date to start implementing these features into the final pipeline. Issues should be filed before this date.

The full run will start when:

  • Issues on this page are implemented
  • Metadatabase issues resolved
  • All FQ files are merged and available
  • Aim to start running after final plans for the second paper are clear

Full run implementation list

  • Two alignments to accommodate downstream analyses (QTL, ASE) to their full potential
    1. Unmasked for QTL (and expression quantification)
    2. Masked for ASE (Check with Dasha for the masked index)
    • Mask with GoNL, 1KG and UMCG ASE study snps.
    • Separate map statistics in analysis database
  • Modify STAR settings to Encode (below)
  • Variant calling on unmasked bam/mpileup (ASE)

Discussion points

  • STAR 2-pass?

Suggested STAR Encode settings

Encode settings (Settings sent to me by Alexander Dobin who did the alignment for some of Encode samples):
/home/dzhernakova/tools/STAR_2.3.0e.Linux_x86_64/STAR \
--runThreadN 8 \
--genomeDir /home/dzhernakova/resources/STARindex_GoNL/ \
--genomeLoad NoSharedMemory \
--readFilesIn /home/dzhernakova/data/rawData/LL-557-130804_R1.fq.gz ~/data/rawData/LL-557-130804_R2.fq.gz \
--readFilesCommand zcat \
--outFileNamePrefix ~/data/mappedData/LL-557-130804.encode/LL-557-130804.encode. \
--outSAMstrandField intronMotif \
--outSAMunmapped Within \
--outFilterType BySJout \ //reduces the number of "spurious" junctions
--outFilterMultimapNmax 20 \ //max multiple alignments per read: if exceeded, read is considered unmapped
--outFilterMismatchNmax 999 \ //max number of mismatches per pair (absolute)
--outFilterMismatchNoverLmax 0.04 \ //max mismatches per pair relative to length (0.04*(2*50)=4)
--alignIntronMin 20 \ //min intron size (default: 21)
--alignIntronMax 1000000 \ //max intron (default: specified by the size of bins)
--alignMatesGapMax 1000000 \ //max genomic distance between mates (default: specified by the size of bins)
--alignSJoverhangMin 8 \ //min overhang for unannotated junctions (default: 5)
--alignSJDBoverhangMin 1 //min overhang for annotated junctions (default: 3)