Refactor low-level code#830
Merged
jeromekelleher merged 34 commits intotskit-dev:mainfrom Mar 27, 2026
Merged
Conversation
Member
|
Looking good on a quick look through! Will have a proper look soon. |
Member
Author
|
No rush - I'm just pushing here to be able to refer to it and for discussion when we're interested. |
f907061 to
2a7a848
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #830 +/- ##
===========================================
- Coverage 86.44% 74.29% -12.15%
===========================================
Files 18 18
Lines 5651 6653 +1002
Branches 896 1074 +178
===========================================
+ Hits 4885 4943 +58
- Misses 593 1609 +1016
+ Partials 173 101 -72
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Member
Author
|
I'm bringing this branch up to date now with the intention of merging in as soon as it's usable. |
a774ed3 to
679cd8c
Compare
from tree_sequence_builder and workign directly on the tree sequence instead. Initial dump of required HMM code First steps - isolated code works with shim Builder class Some more tests Remove TreeSequenceBuilder g Fake the sortedcontainers API for now Fully move to sorted lists of edges. Stuff Partial Abstract out the MatcherIndexes class Pull out stuff necessary for static match Compiling version without tree_sequence_builder Partway through getting C tests running add test dataq C code working and some tests Factor out the "output" struct in matcher Some basics for new Matcher infrastructure compiling with lwt interface partial update to use table collection Convert C code to use tables Roughly working Python-C infrastructure Add basic debug support to the MatcherIndexes Basic Python-C linkage works :hooray: Basic high-level infrastructure for matching Rename file Refactor the Matcher infrastructure Improve class infrastructure Add vestigial root automatically Fix up tests to remove hard-coded virtual root Work on making matcher work with edges not on site values Roughtly working version with edges on genome coords Python version looks like it's working Add sites_position storage and coordinate_t type Matching working in C (looks like) Infer start and end from haplotype Rough implementation of flank skipping Change python code to use coords in path Implement coordinate paths in C Implement some cludges to support initial zero site-paht Minor updates Sort-of working driver script Fiddle with some tricky issues Rename the test file to avoid collisions
16 tests across 4 hand-crafted tree topologies (star, binary, two-tree, deep chain) with exact path and mutation assertions from the current AncestorMatcher. These serve as regression tests when swapping to MatcherIndexes / AncestorMatcher2.
Update the AncestorMatcher2 Python wrapper so find_path accepts the same (h, start, end, match_out) arguments and returns the same (left, right, parent) tuple as the current AncestorMatcher. Remove Path, Match, find_match and zero_sites_path from matching.py. Move Path and Match to test_lshmm.py where they are used by the Python reference implementation.
679cd8c to
3cbd349
Compare
Try fixing symlinks on windows Try safe-directory Do simple solution Add python versions fixup Add coverage upload Use new workflow fix use hte tag
…estorMatcher2 Set TSINFER_MATCHER=matcher2 to use the new AncestorMatcher2/MatcherIndexes implementation behind the Matcher interface. Default behaviour unchanged.
Run the Python reference matcher on star, binary, two-tree, and deep-chain topologies built via make_root_ts + extend_ts. Adds vestigial_root flag to MatcherIndexes and run_match_python so tsinfer-produced topologies can skip add_vestigial_root.
run_match_tsinfer now accepts c_too flag to also run C AncestorMatcher2 (currently segfaults on tsinfer topologies, so off by default). Python reference passes all fixture tests.
Build star, binary, two-tree, and deep-chain topologies in C matching the Python test_matcher_fixtures. All four segfault in ancestor_matcher2_find_path, reproducing the known bug.
matcher_indexes_copy_mutation_data only set sites.mutations and sites.num_alleles for sites with mutations, leaving sites without mutations uninitialised. Initialise all entries in the position-copy loop so every site has a valid NULL mutations pointer and num_alleles.
Assert path_length, left, right, and parent for each query haplotype against the known results from test_matcher_fixtures.py.
Wrap legacy AncestorMatcher to convert site-index edges to genomic positions, mapping left=positions[0] to 0 for consistency. Update Matcher._match_one to use genomic positions directly. Both matchers now pass all 16 fixture tests identically.
Log RSS after Matcher construction completes, include RSS in the per-group completion message, and log the cache budget at startup. Helps trace OOM kills during matching.
Add num_alleles parameter to matcher_indexes_alloc in C; if NULL, defaults to biallelic (2). Thread through Python/C binding and MatcherIndexes wrapper. Fixes TSI_ERR_BAD_HAPLOTYPE_ALLELE on multi-allelic sites. Add triallelic C test.
… budget register_loader had chunk_bytes default to 0. The lazy reader path omitted it, so the 7.4 GB sample chunk bypassed the budget check and was loaded unchecked, spiking memory and triggering the OOM killer. Make chunk_bytes a required argument so this class of bug is caught at call time.
Add periodic memory logging
During large groups the only RSS logging was after group completion, giving no visibility when the process is OOM-killed mid-group. Now every 30s we log cache state, force a GC (reporting freed memory), and take a tracemalloc snapshot diff showing the top 5 Python allocation growth sites plus untraced (C extension) memory.
The futures dict held all 71K futures alive for the entire group, retaining per-haplotype numpy arrays from find_path (~2-4 MB each). Deleting each future after yielding its result frees these immediately.
Replace TSI_ERR_GENERIC with a dedicated TSI_ERR_MULTIPLE_MUTATIONS_AT_SITE error code so users get a clear message instead of "please file a bug report". Add a C test for this error condition.
…ownership Py_BuildValue with O format increments refcount, but the code then set local pointers to NULL preventing cleanup Py_XDECREF from running, leaving each returned numpy array with a permanent refcount of 1. Switched to N format which transfers ownership without incrementing. Also added missing Py_XDECREF for haplotype_array in AncestorMatcher2.
01481d0 to
b032846
Compare
Replace precision-based rounding with the legacy matcher's likelihood threshold approach: L[u] = TSK_MAX(L[u] / max_L, threshold). Default threshold is DBL_MIN, matching the legacy behaviour. Add TSI_DISABLE_WEIGHT_BY_N flag support so N-weighting can be toggled. Both parameters are exposed in the Python wrapper.
Remove unused run_match_python() and update run_match_tsinfer() to always compare Python and C matcher results, now that AncestorMatcher2 is reliable on tsinfer topologies.
Switch traceback_allocator from tsk_blkalloc_t to tsi_blkalloc_t so that total_memory reads are thread-safe via C11 atomics. Add the TSI_NO_ATOMICS guard to the Python getter matching the legacy pattern. Add tests for total_memory and mean_traceback_size properties.
Switch from tskit's tsk_blkalloc to tsinfer's tsi_blkalloc, which has an atomic total_size field for thread-safe memory observation. This makes matcher_indexes_t consistent with all other tsinfer structs.
…ision value The ancestor_matcher2_alloc signature changed from precision (int) to likelihood_threshold (double) in e272b1e, but the test helper still passed 14 (the old precision). This caused 8 matcher tests to fail.
This was referenced Mar 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Just posting here for discussion, definitely not ready for merging any time soon.