Add molecular position stratification to CollectSamErrorMetrics#2035
Open
yfarjoun wants to merge 9 commits intobroadinstitute:masterfrom
Open
Add molecular position stratification to CollectSamErrorMetrics#2035yfarjoun wants to merge 9 commits intobroadinstitute:masterfrom
yfarjoun wants to merge 9 commits intobroadinstitute:masterfrom
Conversation
…from cycle as it tries to give the same answer for overlapping bases)
…one errors and edge cases (like positions that are after the end of the molecule, according to the other read...)
tfenne
requested changes
Feb 3, 2026
| READ_GROUP(() -> readgroupStratifier, "The read-group id of the read."), | ||
| CYCLE(() -> baseCycleStratifier, "The machine cycle during which the base was read."), | ||
| BINNED_CYCLE(() -> binnedReadCycleStratifier, "The binned machine cycle. Similar to CYCLE, but binned into 5 evenly spaced ranges across the size of the read. This stratifier may produce confusing results when used on datasets with variable sized reads."), | ||
| MOLECULAR_POS(() -> baseMolecularPosStratifier, "The molecular position of the read-base (distance from the end of the molecule, negative if closer to read 2's 3' end)."), |
Collaborator
There was a problem hiding this comment.
I think this could use some more explanation - perhaps you could give an example? E.g. if I have a 200bp insert with 2x75bp reads, what do I get for the 1st base R1 and the first base in R2?
Collaborator
There was a problem hiding this comment.
Now having read the javadoc on the implementation I have three thoughts:
- I think the name should imply more what's being done - my nomination would be some version of {molecule|insert|template} x {endproximity|enddistance|etc.} to make it clear that it's the distance to the nearest end of the molecule.
- I think given that the values produced are not immediately obvious, I think I would increase the docs to include "Values are 1..n for bases closer to the R1 end of the molecule and -1..-n for bases closer to the R2 end of the molecule."
- I think the docs here should also state clearly that the stratifier only works on PE data with both reads mapped in FR orientation and a reasonable insert size.
| samFileHeader.addSequence(samSequenceRecord); | ||
| final SAMReadGroupRecord readGroupRecord = new SAMReadGroupRecord("rgID"); | ||
| samFileHeader.addReadGroup(readGroupRecord); | ||
|
|
Contributor
Author
|
Thanks @tfenne. Back to you with changes. |
- add test.
tfenne
approved these changes
Feb 11, 2026
| CYCLE(() -> baseCycleStratifier, "The machine cycle during which the base was read."), | ||
| BINNED_CYCLE(() -> binnedReadCycleStratifier, "The binned machine cycle. Similar to CYCLE, but binned into 5 evenly spaced ranges across the size of the read. This stratifier may produce confusing results when used on datasets with variable sized reads."), | ||
| INSERT_END_DISTANCE(() -> baseInsertEndDistanceStratifier, "Distance from the nearest insert end for FR read pairs. " + | ||
| "Negative if closer to read 2's 3' end, e.g., 150x2 reads with 100bp overlap will have values (R1) 51, .., 100, -100, .., -51 for the overlapping region. " + |
Collaborator
There was a problem hiding this comment.
Two things:
- I think it's clearer to say "closer to the start of read 2" because I interpret the "3' end of read 2" to mean the last base in read 2, not the first
- Does this really only output values in the overlapped region? I thought it output values for all positions so long as you had an FR insert?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Checklist (never delete this)
Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.
Content
Review
For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests