-
-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full fragment coverage mode for proper pairs #246
Conversation
i haven't looked at code yet, just your comment, but for |
Good point. I've simplified it to only count |
I've added a functional test and a small bam file (8 simulated reads) that has 2 fragments with isize > 2x read size and 2 fragments with isize < 2x read size. It seems to work as expected, unless I've mixed up some 0/1-indexing somewhere. Locally, it fails at the d4 test step but I think that's an installation thing. And the CI checks pass here. So, I think it's ready for you to have a look at. No rush though! Here's an overview of the bam file: And the mosdepth output: Edit: There's an issue with how I generated the bam file (the unpaired reads have proper-pair flags even though they're not paired, so it counts them). I will fix the bam file tomorrow. |
Is it too strict to only include the proper pairs? We obviously need a filter like that to ensure the combined reads represent an actual fragment. Wondering whether there is a less strict set of conditions for that though? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great! just minor changes.
thank you!
hah, I just had the same thought while reviewing. open to brainstorming, but seems reasonable. |
thanks! i made a new release here: https://github.com/brentp/mosdepth/releases/tag/v0.3.11 |
Perfect! |
Whereas the current version of mosdepth only calculates the coverage in the positions of the observed reads, the actual biological coverage in a paired-end sequencing would surely include positions in-between the reads (when 2x read size < fragment length). This PR adds a fragment-based coverage mode (--fragment-mode) where the entire fragment between paired-end reads is counted. It is a separate mode to fast-mode and the normal mode (I don't really understand the cigar operations part of the code, so correct me if these need to be supported here as well).
rec.start < rec.matepos
.rec.start == rec.matepos
, we use a hash set (seen_ids
) to ensure each read name (qname
) is counted only once.rec.start
and decrements atrec.start + abs(rec.isize)
, preserving a start/end approach for efficient calculation with cumsum() later.This was the cleanest initial approach, I could come up with. I haven't tested it yet, but wanted to start the conversation.