Skip to content

Running the project

mcmaniou edited this page Jan 19, 2021 · 6 revisions

The framework consists of three scripts:

  • UMIsProject.R: main script.
  • casesWorkflows.R: includes functions called by the main script.
  • functions.R: includes functions called by the main script.

Inputs

Before running the project, the user must set the appropriate input parameters in the main script UMIsProject.R.

The later has the following inputs:

  • pairedData: boolean variable that indicates, whether data are paired T or single F.
  • UMIlocation: variable that indicates, whether UMI is located only in Read1 R1 or Read1 and Read2 R1 & R2.
  • UMIlength: the length of the UMI sequence.
  • sequenceLength: the length of the read sequence.
  • countsCutoff: min read counts per UMI, for initial data cleaning.
  • UMIdistance: max UMI distance for UMI merging.
  • sequenceDistance: max sequence distance for UMI merging.
  • inputsFolder: name or filepath of the inputs folder.
  • outputsFolder: name or filepath of the outputs folder, default value is UMIc_output.

The input data must be provided in fastq files and it is assumed that the UMI is placed at the beginning of each sequence e.g.

@M03403:12:000000000-CNPJD:1:1101:17452:1456 1:N:0:CCCTCATC+CTGTCGCT
GTAAAACGACGGCCAGTCTCTCACTTCAATCCTTACCATCAAGTCCGTAGAGAAAGAAGACATGGCCGTTTACTACTGTGCTGCGTGGGACGGAGAAACTCTTTGGCAGTGGAACAACACTCCCTATAGTGAGTCGTATTAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCCTCATCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAATGAGTTCTGTAAAAATGAATATTAGTGACGTTT
+
AB33ADFBBDBBGAEGGGGFGGHHHHF55FEEGHHHHGHHB3AGHGGEGGFHGAGHHHGFGHHHGAFEEFCGHHHHHHGHHEHF?EGGGHHGGE?/0FFGHHHHHEHGGHHHHHHHHHGGHHHHHHHFGHFBFHHCGHHHBDGHHBFD?FGHHHFFHHGHEHFBGHHHGHHDDHGDHGEDCGF0C0:GF::CF00;.9F.FG0FFFF.0;FF@-;=9-9//:9//:/:/9///.9////////////.../

The library preparation step of the input files must be genarated using the same protocol and fulfil the same input parameters described above.

Running the project

In order to run the project, set the R folder as your working directory, set the input parameters in the main script UMIsProject.R and then use the following command:

source("UMIsProject.R")

The project provides example input datasets and their outputs, for testing purposes. The folder data includes example datasets for all three scenarios in their corresponding subfolders. Each subfolder icludes the fastq files and a Readme.md file with the parameter values, used to generate the files in folder outputs. The user must provide the input and output folders' filepaths.

Outputs

The output data are stored also in fastq files, named the same as the input files with an added _corrected suffix and the name of the folder can be provided by the user. The files contain the corrected sequences (without the UMI) and their quality. It is worth mentioning that the new sequence ID is constructed by combining the ID of one of the input sequences, that has that same UMI, and the UMI itself e.g.

@M03403:12:000000000-CNPJD:1:1101:17452:1456 1:N:0:CCCTCATC+CTGTCGCT GTAAAACGACGG
CCAGTCAGTCTAAACCCTCCCTCCCGTGATACCCGTCGATGAACAGGGATTGGGCGAGGGGCACTAGCCTGTGTGAGACGTAATTGTTTTATGGTGGAACATCATGACTCCTAGTGCTTAGTAGTAGTAAGGGCGGAAGAGCACCCGGCACTCCAGCAAACACTCCTCATCCCCTCTGTAGTCAGCTTCTTGAAGAAAAAAAAAGAACACTAATCACCACTCATCACCCCTCATCTCGT
+
FFFGGGFFFFGGFGGGGGGFGGGFGFGFFFFFGGGGGFFEFFFGGGGFGEFEEEFEFFEEEGFFGFGFFGFGFFEEFFGEGEGGGEGGGEFGGFFFFFGFGGGGFFFFGFFGFFFFFFFEFFFEFFFFFFFFEFEEEEEFEEFEEFEEFEFFFFFFEEEEFEFEFEFFFFEEECEEEDDDDEDDDEEDDEEDDDDCDDDCC@;?C<;@:A=9;?@=?A;@>@=>A=@?A??A@?@;@@;

The framework also produces a csv file with all the information of the output fastq files and extra information, that can help return from the output sequences to their corresponding input sequences. The file is named the same as the Read1 fastq file with an added _summary_table suffix. It is organized in a table, in which each row is an output sequence, with the following columns:

  • UMI: the new UMI of the sequence.
  • UMIs: the merged UMIs that produced the sequence, multiple values are separated with "|".
  • counts: the read counts of the merged UMIs, multiple values are separated with "|".
  • read1: the read1 of the corrected sequence.
  • quality1: the corresponding quality of the read1 of the corrected sequence.
  • read2: the read2 of the corrected sequence.*
  • quality2: the corresponding quality of the read2 of the corrected sequence.*
  • ID1: the new ID of the read1 of the corrected sequence.
  • ID2: the new ID of the read2 of the corrected sequence.*

* these columns exist only in case of paired end libraries

An example of the table is provided below:

UMI UMIs counts read1 quality1 read2 quality2 ID1 ID2
CCTTAATCAAGT CCTTAATCAAGT|CCTTAATCACGT 1514|15 ATGGGAAAGAGTGTCCCTGGGGGGTCCCTGAG... EFDFDFGFGGGHGHHHHHGGFFFFFFGHHGGGGH... CTTACCTGAGGAGACGGTGACCAGGGTTCCC... @ABAAEFECDAEEBEECEFDFFGGFGGGGGGHHGGG... M03403:12:000000000-CNPJD:1:1101:17294:1642 1:N:0:CCCTCATC+CTGTCGCT CCTTAATCAAGT M03403:12:000000000-CNPJD:1:1101:17294:1642 2:N:0:CCCTCATC+CTGTCGCT CCTTAATCAAGT
GTGAAACCACCT GTGAAACCACCT 1432 AGATCCGATCCGTATTCACAGTCCGATTATCTGG... CDFGFFGEFFFGFGGGGGGGGGGGFEFFFGGHGGGGFFEF... CCGACCAGATAATCGGACTGTGAATACGGATCGGAT... @A@@ABAEBEEFFFEDECEGGFGGGGGHFFFFFFDFF... M03403:12:000000000-CNPJD:1:1101:22295:3280 1:N:0:CCCTCATC+CTGTCGCT GTGAAACCACCT M03403:12:000000000-CNPJD:1:1101:22295:3280 2:N:0:CCCTCATC+CTGTCGCT GTGAAACCACCT
CTGGTACTTCTA CTGGTACTTCTA|CTGGTACTTCTC|CTTGTACTTCTA|CTGTTACTTCTA 1375|58|10|5 ATGGGAAAGAGTGTCCAGGTGCAGCTGGTGGA... CEDFDFFFGFGGGGGGGGGFGGGGGGHGFGFGGGG... CTTACCTGAGGAGACGGTGACCAGGGTTCCCTGGCCCC... ?AAAAEEEBDAEEADD@EFCFFGFEFGGGFGGG... M03403:12:000000000-CNPJD:1:1101:13898:3987 1:N:0:CCCTCATC+CTGTCGCT CTGGTACTTCTA M03403:12:000000000-CNPJD:1:1101:13898:3987 2:N:0:CCCTCATC+CTGTCGCT CTGGTACTTCTA

Clone this wiki locally