-
Notifications
You must be signed in to change notification settings - Fork 9
Graph based ChIP seq tutorial
This guide describes how to use Graph Peak Caller and vg to run a whole graph-based ChIP-seq pipeline by doing Peak Calling on each chromosome and merging the results. This guide is aimed at advanced users, and assumes the users to be comfortable using the command line. Note that Graph Peak Caller and vg more easily can be used on a set of pre-built graphs through a Galaxy installation at https://hyperbrowser.uio.no/graph-peak-caller.
- You need to have your own vg graphs or use one of the graphs that we provide (see below).
- Note that if you want to work on the whole-genome human reference graphs, you should have at least 70 GB of RAM available (this is the size of the vg indices when loaded into memory). Drosophila Melanogaster and Arabidopsis Thaliana requires much less RAM, and a typical laptop will suffice to follow this tutorial for one those species.
- You need to have vg installed on your system.
- This guide assumes you have single-ended reads.
Since we are doing peak calling against the whole genome of a species, we want to use one graph for every chromosome. This enables us to do peak calling on every chromosome independently (and in parallel) and merge the results in the end.
- If you want to use our prebuilt and preprocessed graphs for either Drosophila Melanogaster, Human or Arabidopsis Thaliana, these can be downloaded by clicking on one of the links below.
- Human whole-genome Graph. Warning: Requires at least 70 GB of RAM to use. Also, a typical ChIP-seq experiment can take hours.
- Drosophila Melanogaster
- Arabidopsis Thaliana
- If you want to use your own vg graph files or build your own graphs from scratch, e.g. using a vcf, follow this tutorial
After downloading, make sure you extract the files.
The rest of this tutorial assumes you are using the Drosophila Melanogaster graphs. If you are using other graphs, all the following steps will be similar, except that you will need to change the list of chromosomes we use with the relevant chromosomes for your set of graphs.
We will init some required parameters by assigning them to variables. This way, we don't need to write them many times in the rest of the pipeline. Run these lines on your command line:
read_length=50 # Change this number to the read length of your raw reads
fragment_length=230 # Change this number to the fragment length of your raw reads
graph_dir=path/to/your/graphs/ # This is where you extracted the graphs
Use read length 50 and fragment 240 if you want to use our Drosophila Melanogaster example data. If you don't know the fragment length (also called shift) of your raw reads, Graph Peak Caller can estimate it. See description later in this guide).
graph_dir should be the directory where you have all your graph files (.nobg files, vg json files etc).
You might want to filter/trim your raw ChIP-seq reads in order to remove adapters and parts of reads with low quality. You can e.g. use Trim Galore to do this.
Download example reads for the Drosophilia Melanogaster SQZ transcription factor by clicking this link.
Assuming you have your reads in ENCFF156YTQ.fastq.gz , we simply call vg map to map our reads to the graph:
vg map -f ENCFF156YTQ.fastq.gz -x $graph_dir/wg.xg -g $graph_dir/wg.gcsa > mapped.gam
Change wg.xg and wg.gcsa to the file names of your indices (you don't need to change if you are using our data). Mapping typically takes a few hours if run on a laptop. Note that vg map has a lot of parameters that can be tweaked, some of which might improve mapping (especially if mapping to a huge graph).
To simplify this tutorial, we are not mapping control reads, but if you also have a set of control reads, this is the time to map these.
mapped.gam now contains all your reads, also those that did not map well. We use vg filter to filter out bad mappings:
vg filter -r 0.95 -s 2.0 -q 60 -fu mapped.gam > filtered.gam
You can play around with the different parameters of vg filter. We don't go into details on what these mean here. If you have a control track, make sure to also filter that one. Filtering can take typically take 10-30 minutes.
Graph Peak Caller only accepts mapped reads in JSON format, since Python Proto is very slow compared to JSON. Converting to json using vg is simple:
vg view -aj filtered.gam > filtered.json
This cantake 10-20 minutes.
To be able to distribute the peak calling to one process for each chromosome, we split our filtered alignments into one file for each chromosome:
graph_peak_caller split_vg_json_reads_into_chromosomes $chromosomes filtered.json $graph_dir
We are finally ready to start peak calling. Since we have multiple chromosomes, we will do this in two steps. We will first predict p-values on each chromosome, and then run peak calling using these p-values (in order to do correct multiple testing adjustment). You can choose to only run peak calling on a few chromosomes if you want to speed things up.
We will store the list of chromosomes that we want to do peak calling on in a variable, in order to write less code later. Note that this list needs to match your graph names: If your graphs are called 1.nobg, 2.nobg,... etc, your list will be "1,2,... etc". If your graphs are called chr1.nobg, chr2.nobg,... etch, your list will be "chr1,chr2,... etc". If you created vg graphs yourself, the names will depend on your input vcf and fasta file.
If you are using our Drosophila Melanogaster graphs, this will give you the complete set of chromosomes:
chromosomes="chr3R,chr3L,chr2R,chr2L,chrX,chr4"
If you are using our 1000 genomes human graphs, you can use this:
chromosomes="1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X"
For Arabidopsis Thaliana, use:
chromosomes="1,2,3,4,5"
If you don't have a fragment length now, you can let Graph Peak Caller estimate it by calling graph_peak_caller estimate_shift $chromosomes $graph_dir/ filtered 5 100 where filtered is the base name of your sample alignments. Set fragment_length= the number outputed from that command (shift).
Since we are splitting the peak calling into a separate process for each chromosome, we ned to send in the total number of unique reads (for all chromosomes) and total genome size in order for Graph Peak Caller to correctly predict the background signal. Run this to get the number of unique reads:
unique_reads=$(pcregrep -o1 '"sequence": "([ACGTNacgtn]{20,})"' filtered.json | sort | uniq | wc -l)
Also, set genome_size to the "linear genome size". You can use 3100000000 for the human graph, 135000000 for Arabidopsis Thaliana and 98000000 for Drosohilia Melanogaster
genome_size=98000000 # Change to correct number for the species you are using
Make sure that the $fragment_length and $read_length that we created at the beginning of this tutorial are still set:
echo $fragment_length
echo $read_length
You can now start peak calling for all your chromosomes by running the following script. Note the & at the end of the line, which means that peak calling for each chromosome will start as a separate background process. Omit this if you do not want to run each chromosome in parallel. If you have a control track, make sure to include -c control_ (without file ending). You can always type graph_peak_caller callpeaks for an overview of the different parameters.
for chromosome in $(echo $chromosomes | tr "," "\n")
do
graph_peak_caller callpeaks -g $graph_dir/$chromosome.nobg -s filtered_$chromosome.json -n "" \
-f $fragment_length -r $read_length -p True -u $unique_reads -G $genome_size &
done
On the human reference graph, this will take several hours. On Drosophila Melanogaster, this should take about half an hour for the biggest chromosome and only a few minutes for small ones.
Wait until all chromosomes have finished before continuing.
You now have a set of files containing the p-values. In the final step, we are simply computing q-values (adjusted p-values) for each chromosome, using all the p-values:
for chromosome in $(echo $chromosomes | tr "," "\n")
do
graph_peak_caller callpeaks_whole_genome_from_p_values \
-d $graph_dir/ -n "" -f $fragment_length -r $read_length $chromosome &
done
You now have a set of peaks for each chromosome. We can merge these into a single sorted file of all peaks:
graph_peak_caller concatenate_sequence_files $chromosomes all_peaks.fasta
You might be interested in validating your results by comparing the peaks to peaks found by Macs2 on a linear reference genome. This is possible as long as you have Macs2 peaks that have been called using the same linear reference genome that goes through your graph.
You will need a linear genome fasta reference file (indexed):
linear_genome_fasta_file="$graph_dir/dm3.fasta"
We first split the linear peaks into one file for each chromosome:
for chromosome in $(echo $chromosomes | tr "," "\n")
do
grep "^${chromosome}\s" peaks.narrowPeak > macs_peaks_chr${chromosome}.bed
graph_peak_caller linear_peaks_to_fasta peaks_chr${chromosome}.bed $linear_genome_fasta_file macs_sequences_chr${chromosome}.fasta
done
Assuming you are still in the same directory where your peak calling results and Macs2 peak result files are, you can now run the analysis:
graph_peak_caller analyse_peaks_whole_genome $chromosomes ./ . out_file_name