Decoding haplotypes without trees#3298
Draft
benjeffery wants to merge 11 commits intotskit-dev:mainfrom
Draft
Conversation
75e1002 to
af7f033
Compare
Member
|
It does looks like a neat idea, but I think performance benefits are not compelling enough (on use-cases that we currently have) to put it on the critical path? |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3298 +/- ##
===========================================
- Coverage 89.79% 48.70% -41.09%
===========================================
Files 29 18 -11
Lines 30962 8479 -22483
Branches 5664 1422 -4242
===========================================
- Hits 27803 4130 -23673
- Misses 1775 4176 +2401
+ Partials 1384 173 -1211
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Member
Author
|
Yep, not something I want to hold up a release for or put any serious work into. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I started to have a look at #1896 where one of the possible solutions is decoding haplotypes in C. (The current methods all do this by iterating over sites then returning the full samples*sites matrix).
Recent work on
parent_index_arrayin the JIT code gave me an idea, I then couldn't sleep until I had tried it. The code in this PR is NSFJK (Not Safe For Jerome) but I thought I would share it now as I think it is interesting and I have to draw a line! Iteresting bits are hereThe outline is:
resolvedover sites and aresultstring of ancestral site states.resultof the mutations site to the derived allele, marking these sites asresolvedresultandresolved. Push any parent edges that have unresolved sites on the stackresultshould now be your node's haplotypeHere is the perf on a tree sequence which is a simplified subset of 100k samples of chr 21 Quebecois sim:

The summary being "if you are extracting less than 1000 genotypes the haplotype way is faster" and "if you're grabbing one node it can be 100 times faster"
I have done a little perf work, but I think there are probably big wins to be had by clever caching of haplotypes of nodes higher up the tree or something like that, and not calculating the full parent edge index when you're grabbing a slice of the genome.