Skip to content

Add molecular position stratification to CollectSamErrorMetrics#2035

Open
yfarjoun wants to merge 9 commits intobroadinstitute:masterfrom
yfarjoun:yf_add_end_of_insert_stratifier
Open

Add molecular position stratification to CollectSamErrorMetrics#2035
yfarjoun wants to merge 9 commits intobroadinstitute:masterfrom
yfarjoun:yf_add_end_of_insert_stratifier

Conversation

@yfarjoun
Copy link
Contributor

@yfarjoun yfarjoun commented Feb 3, 2026

Description

  • Adds a new "molecular position" stratification for ReadBaseStratification
  • Unlike cycle position, this stratification provides consistent answers for overlapping bases from paired reads
  • Includes tests for the new stratification

Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on github actions

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

…from cycle as it tries to give the same answer for overlapping bases)
…one errors and edge cases (like positions that are after the end of the molecule, according to the other read...)
READ_GROUP(() -> readgroupStratifier, "The read-group id of the read."),
CYCLE(() -> baseCycleStratifier, "The machine cycle during which the base was read."),
BINNED_CYCLE(() -> binnedReadCycleStratifier, "The binned machine cycle. Similar to CYCLE, but binned into 5 evenly spaced ranges across the size of the read. This stratifier may produce confusing results when used on datasets with variable sized reads."),
MOLECULAR_POS(() -> baseMolecularPosStratifier, "The molecular position of the read-base (distance from the end of the molecule, negative if closer to read 2's 3' end)."),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could use some more explanation - perhaps you could give an example? E.g. if I have a 200bp insert with 2x75bp reads, what do I get for the 1st base R1 and the first base in R2?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now having read the javadoc on the implementation I have three thoughts:

  1. I think the name should imply more what's being done - my nomination would be some version of {molecule|insert|template} x {endproximity|enddistance|etc.} to make it clear that it's the distance to the nearest end of the molecule.
  2. I think given that the values produced are not immediately obvious, I think I would increase the docs to include "Values are 1..n for bases closer to the R1 end of the molecule and -1..-n for bases closer to the R2 end of the molecule."
  3. I think the docs here should also state clearly that the stratifier only works on PE data with both reads mapped in FR orientation and a reasonable insert size.

samFileHeader.addSequence(samSequenceRecord);
final SAMReadGroupRecord readGroupRecord = new SAMReadGroupRecord("rgID");
samFileHeader.addReadGroup(readGroupRecord);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Feb 3, 2026

Thanks @tfenne. Back to you with changes.

CYCLE(() -> baseCycleStratifier, "The machine cycle during which the base was read."),
BINNED_CYCLE(() -> binnedReadCycleStratifier, "The binned machine cycle. Similar to CYCLE, but binned into 5 evenly spaced ranges across the size of the read. This stratifier may produce confusing results when used on datasets with variable sized reads."),
INSERT_END_DISTANCE(() -> baseInsertEndDistanceStratifier, "Distance from the nearest insert end for FR read pairs. " +
"Negative if closer to read 2's 3' end, e.g., 150x2 reads with 100bp overlap will have values (R1) 51, .., 100, -100, .., -51 for the overlapping region. " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  • I think it's clearer to say "closer to the start of read 2" because I interpret the "3' end of read 2" to mean the last base in read 2, not the first
  • Does this really only output values in the overlapped region? I thought it output values for all positions so long as you had an FR insert?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants