Skip to content

Filter reads for gc bias (mapq and interval)#2030

Open
bwlang wants to merge 9 commits intobroadinstitute:masterfrom
bwlang:filter_gc_bias
Open

Filter reads for gc bias (mapq and interval)#2030
bwlang wants to merge 9 commits intobroadinstitute:masterfrom
bwlang:filter_gc_bias

Conversation

@bwlang
Copy link
Contributor

@bwlang bwlang commented Jan 9, 2026

Description

I added a filtration option to the GC bias tool to allow users to exclude known problematic regions (e.g. from boyle et. al.) and clean up "spikes" in the GC bias due to poly-G stretches mapping or adapter constructs, etc. I also added a min mapq feature for similar reasons.


Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on github actions

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

@bwlang
Copy link
Contributor Author

bwlang commented Jan 11, 2026

I've also tested with larger bam files and small intervals. The interval checking overhead is less than 1% and only when an interval argument is provided.

@bwlang
Copy link
Contributor Author

bwlang commented Feb 3, 2026

@yfarjoun : it was before your #2031 PR, but i think I followed the recommendations it outlines.

@bwlang
Copy link
Contributor Author

bwlang commented Feb 19, 2026

@yfarjoun do you have any sense of whether this is likely to be merged or needs attention?

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @bwlang and sorry for the slow review process.

I have some questions and suggestions, but overall, I think that this would be a welcome addition.

Comment on lines 712 to 713
writer.write("track name=\"Test Track\" description=\"Test intervals for GC bias\"\n");
writer.write("browser position chrM:1-16571\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this as a valid BED format. the formalized version is here: https://samtools.github.io/hts-specs/BEDv1.pdf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why generate the BED file programmatically? why not add it to an appropriate subdirectory in testdata/picard ? (same question for the interval list)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen these in the wild... and it's documented here: https://grch37.ensembl.org/info/website/upload/bed.html
I'll add some comments to that effect unless you'd rather just reject such files.

I thought i got feedback in an earlier PR (from Nils i think) that a generated version was preferred - maybe that was just for that case. I'm fine either way.

I'll add the bed parser and test it with a file if you think that's clearest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the page you are referenceing that example seems to be under the "bedGraph" (not bed) format. I'd rather go by the (now) official format from hts-specs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding the generated vs. stored file. I don't have strong feelings. I find the stored file cleaner (no need to delete and clutter the test code with generation lines), and one can simply look at the file to understand what's in it. The downside is that it can result in a collection of odd files in the testdata directory that no-one knows exactly what test they are for.... perhaps this needs to be decided at an "official" level @jonn-smith?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the page you are referenceing that example seems to be under the "bedGraph" (not bed) format. I'd rather go by the (now) official format from hts-specs.

I think the portion that describes the track lines is in the bed section, it's just that they indicate them to be "compulsory" in the bedgraph section. I also think it's strange to have these reserved words (why not just put them behind a comment character?), but on the other hand it's annoying for users to have to munge their input files...

I'll remove track and accept only # or blank lines unless you change your mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more "bed specification" : https://genome.ucsc.edu/FAQ/FAQformat.html#format1 (allows "track" and "browser" ) headers.

Copy link
Contributor Author

@bwlang bwlang Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the latest push removes the track support but keeps the "generator" pattern for bed test... let me know how you want to go there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather stick to the GA4GH specs and if someone complains, we can point them there... 🤷

since @nh13 asked for the generator pattern, I'm not going to ask you to change back.. :-)

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that there's a rebase problem, or perhaps your version needs rebaseing, but I see lots of irrelevant (and new) changes...

also, I did not mean that your entire code (with the bed/interval-list sniffing) should be plopped into BedToIntervallist. what I meant is that as you can see in that class there's code for reading a bed file and generating features. I suggest that you take that (small) part, extract it from BedToIntervalList (into a new utility class) and call the appropriate method from both BedToIntervalList and CollectCGBias (once you know you need to read a bed file).

Sorry for not being clear about this the first time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants