Refactor low-level code by jeromekelleher · Pull Request #830 · tskit-dev/tsinfer

jeromekelleher · 2023-05-25T09:09:31Z

Just posting here for discussion, definitely not ready for merging any time soon.

benjeffery · 2023-05-25T10:22:17Z

Looking good on a quick look through! Will have a proper look soon.

jeromekelleher · 2023-05-25T10:27:55Z

No rush - I'm just pushing here to be able to refer to it and for discussion when we're interested.

codecov · 2026-03-26T15:09:47Z

Codecov Report

❌ Patch coverage is 74.01186% with 263 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.29%. Comparing base (f087068) to head (92c35ac).
⚠️ Report is 34 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #830       +/-   ##
===========================================
- Coverage   86.44%   74.29%   -12.15%     
===========================================
  Files          18       18               
  Lines        5651     6653     +1002     
  Branches      896     1074      +178     
===========================================
+ Hits         4885     4943       +58     
- Misses        593     1609     +1016     
+ Partials      173      101       -72

Flag	Coverage Δ
C	`0.00% <0.00%> (-81.58%)`	⬇️
c-python	`62.23% <73.90%> (+3.91%)`	⬆️
python-tests	`90.60% <80.00%> (-0.34%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Python API	`90.71% <80.00%> (-0.34%)`	⬇️
Python C interface	`68.41% <74.69%> (+1.74%)`	⬆️
C library	`58.87% <72.95%> (-29.81%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jeromekelleher · 2026-03-26T15:15:42Z

I'm bringing this branch up to date now with the intention of merging in as soon as it's usable.

from tree_sequence_builder and workign directly on the tree sequence instead. Initial dump of required HMM code First steps - isolated code works with shim Builder class Some more tests Remove TreeSequenceBuilder g Fake the sortedcontainers API for now Fully move to sorted lists of edges. Stuff Partial Abstract out the MatcherIndexes class Pull out stuff necessary for static match Compiling version without tree_sequence_builder Partway through getting C tests running add test dataq C code working and some tests Factor out the "output" struct in matcher Some basics for new Matcher infrastructure compiling with lwt interface partial update to use table collection Convert C code to use tables Roughly working Python-C infrastructure Add basic debug support to the MatcherIndexes Basic Python-C linkage works :hooray: Basic high-level infrastructure for matching Rename file Refactor the Matcher infrastructure Improve class infrastructure Add vestigial root automatically Fix up tests to remove hard-coded virtual root Work on making matcher work with edges not on site values Roughtly working version with edges on genome coords Python version looks like it's working Add sites_position storage and coordinate_t type Matching working in C (looks like) Infer start and end from haplotype Rough implementation of flank skipping Change python code to use coords in path Implement coordinate paths in C Implement some cludges to support initial zero site-paht Minor updates Sort-of working driver script Fiddle with some tricky issues Rename the test file to avoid collisions

16 tests across 4 hand-crafted tree topologies (star, binary, two-tree, deep chain) with exact path and mutation assertions from the current AncestorMatcher. These serve as regression tests when swapping to MatcherIndexes / AncestorMatcher2.

Update the AncestorMatcher2 Python wrapper so find_path accepts the same (h, start, end, match_out) arguments and returns the same (left, right, parent) tuple as the current AncestorMatcher. Remove Path, Match, find_match and zero_sites_path from matching.py. Move Path and Match to test_lshmm.py where they are used by the Python reference implementation.

Try fixing symlinks on windows Try safe-directory Do simple solution Add python versions fixup Add coverage upload Use new workflow fix use hte tag

…estorMatcher2 Set TSINFER_MATCHER=matcher2 to use the new AncestorMatcher2/MatcherIndexes implementation behind the Matcher interface. Default behaviour unchanged.

Run the Python reference matcher on star, binary, two-tree, and deep-chain topologies built via make_root_ts + extend_ts. Adds vestigial_root flag to MatcherIndexes and run_match_python so tsinfer-produced topologies can skip add_vestigial_root.

run_match_tsinfer now accepts c_too flag to also run C AncestorMatcher2 (currently segfaults on tsinfer topologies, so off by default). Python reference passes all fixture tests.

Build star, binary, two-tree, and deep-chain topologies in C matching the Python test_matcher_fixtures. All four segfault in ancestor_matcher2_find_path, reproducing the known bug.

matcher_indexes_copy_mutation_data only set sites.mutations and sites.num_alleles for sites with mutations, leaving sites without mutations uninitialised. Initialise all entries in the position-copy loop so every site has a valid NULL mutations pointer and num_alleles.

Assert path_length, left, right, and parent for each query haplotype against the known results from test_matcher_fixtures.py.

Wrap legacy AncestorMatcher to convert site-index edges to genomic positions, mapping left=positions[0] to 0 for consistency. Update Matcher._match_one to use genomic positions directly. Both matchers now pass all 16 fixture tests identically.

Log RSS after Matcher construction completes, include RSS in the per-group completion message, and log the cache budget at startup. Helps trace OOM kills during matching.

Add num_alleles parameter to matcher_indexes_alloc in C; if NULL, defaults to biallelic (2). Thread through Python/C binding and MatcherIndexes wrapper. Fixes TSI_ERR_BAD_HAPLOTYPE_ALLELE on multi-allelic sites. Add triallelic C test.

… budget register_loader had chunk_bytes default to 0. The lazy reader path omitted it, so the 7.4 GB sample chunk bypassed the budget check and was loaded unchecked, spiking memory and triggering the OOM killer. Make chunk_bytes a required argument so this class of bug is caught at call time.

Add periodic memory logging

During large groups the only RSS logging was after group completion, giving no visibility when the process is OOM-killed mid-group. Now every 30s we log cache state, force a GC (reporting freed memory), and take a tracemalloc snapshot diff showing the top 5 Python allocation growth sites plus untraced (C extension) memory.

The futures dict held all 71K futures alive for the entire group, retaining per-haplotype numpy arrays from find_path (~2-4 MB each). Deleting each future after yielding its result frees these immediately.

Replace TSI_ERR_GENERIC with a dedicated TSI_ERR_MULTIPLE_MUTATIONS_AT_SITE error code so users get a clear message instead of "please file a bug report". Add a C test for this error condition.

…ownership Py_BuildValue with O format increments refcount, but the code then set local pointers to NULL preventing cleanup Py_XDECREF from running, leaving each returned numpy array with a permanent refcount of 1. Switched to N format which transfers ownership without incrementing. Also added missing Py_XDECREF for haplotype_array in AncestorMatcher2.

Replace precision-based rounding with the legacy matcher's likelihood threshold approach: L[u] = TSK_MAX(L[u] / max_L, threshold). Default threshold is DBL_MIN, matching the legacy behaviour. Add TSI_DISABLE_WEIGHT_BY_N flag support so N-weighting can be toggled. Both parameters are exposed in the Python wrapper.

Remove unused run_match_python() and update run_match_tsinfer() to always compare Python and C matcher results, now that AncestorMatcher2 is reliable on tsinfer topologies.

Switch traceback_allocator from tsk_blkalloc_t to tsi_blkalloc_t so that total_memory reads are thread-safe via C11 atomics. Add the TSI_NO_ATOMICS guard to the Python getter matching the legacy pattern. Add tests for total_memory and mean_traceback_size properties.

Switch from tskit's tsk_blkalloc to tsinfer's tsi_blkalloc, which has an atomic total_size field for thread-safe memory observation. This makes matcher_indexes_t consistent with all other tsinfer structs.

…ision value The ancestor_matcher2_alloc signature changed from precision (int) to likelihood_threshold (double) in e272b1e, but the test helper still passed 14 (the old precision). This caused 8 matcher tests to fail.

jeromekelleher force-pushed the lshmm-directly-on-ts branch from f907061 to 2a7a848 Compare March 26, 2026 15:06

jeromekelleher force-pushed the lshmm-directly-on-ts branch 3 times, most recently from a774ed3 to 679cd8c Compare March 26, 2026 17:06

jeromekelleher added 6 commits March 26, 2026 17:07

Port tests to new layout

df2b3db

Fixup lint

74d8f9f

Fix build issue with packaging

48b0267

jeromekelleher force-pushed the lshmm-directly-on-ts branch from 679cd8c to 3cbd349 Compare March 26, 2026 17:07

jeromekelleher mentioned this pull request Mar 27, 2026

Review and improve test coverage #1165

Closed

jeromekelleher added 14 commits March 27, 2026 15:41

Tests on windows

0bdeeef

Try fixing symlinks on windows Try safe-directory Do simple solution Add python versions fixup Add coverage upload Use new workflow fix use hte tag

Remove dead file

a8e0012

Add TSINFER_MATCHER env var to switch between AncestorMatcher and Anc…

0e165be

…estorMatcher2 Set TSINFER_MATCHER=matcher2 to use the new AncestorMatcher2/MatcherIndexes implementation behind the Matcher interface. Default behaviour unchanged.

Fix missing fstrings

8e98012

Add C Matcher2 comparison path for tsinfer-topology tests

97ac289

run_match_tsinfer now accepts c_too flag to also run C AncestorMatcher2 (currently segfaults on tsinfer topologies, so off by default). Python reference passes all fixture tests.

Add tsinfer-topology fixtures to C test suite

ab9e466

Build star, binary, two-tree, and deep-chain topologies in C matching the Python test_matcher_fixtures. All four segfault in ancestor_matcher2_find_path, reproducing the known bug.

Add a plan with some analysis on matching implementation

481145f

Add exact expected match paths to C fixture tests

b1ac387

Assert path_length, left, right, and parent for each query haplotype against the known results from test_matcher_fixtures.py.

Add memory diagnostics: RSS after Matcher, per-group, cache budget

49f7bfe

Log RSS after Matcher construction completes, include RSS in the per-group completion message, and log the cache budget at startup. Helps trace OOM kills during matching.

jeromekelleher added 6 commits March 27, 2026 15:41

Require functions at the top.

b99beac

Add periodic memory logging

Fix memory leak: delete completed futures in Matcher.match()

e599fda

The futures dict held all 71K futures alive for the entire group, retaining per-haplotype numpy arrays from find_path (~2-4 MB each). Deleting each future after yielding its result frees these immediately.

Turn off tracemalloc

69555e8

Add informative error for multiple mutations per site in matcher_indexes

6a93e1b

Replace TSI_ERR_GENERIC with a dedicated TSI_ERR_MULTIPLE_MUTATIONS_AT_SITE error code so users get a clear message instead of "please file a bug report". Add a C test for this error condition.

jeromekelleher force-pushed the lshmm-directly-on-ts branch from 01481d0 to b032846 Compare March 27, 2026 15:42

jeromekelleher marked this pull request as ready for review March 27, 2026 15:42

jeromekelleher added 8 commits March 27, 2026 15:47

Add a plan to remove the legacyh matcher code.

2228934

Tidy up tests

a6da166

Enable C engine comparison in all test_lshmm fixture tests

e2aee50

Remove unused run_match_python() and update run_match_tsinfer() to always compare Python and C matcher results, now that AncestorMatcher2 is reliable on tsinfer topologies.

Use tsi_blkalloc in matcher_indexes_t

6d423f5

Switch from tskit's tsk_blkalloc to tsinfer's tsi_blkalloc, which has an atomic total_size field for thread-safe memory observation. This makes matcher_indexes_t consistent with all other tsinfer structs.

Fix windows tests

92c35ac

jeromekelleher merged commit f2a5b6b into tskit-dev:main Mar 27, 2026
11 of 13 checks passed

jeromekelleher deleted the lshmm-directly-on-ts branch March 27, 2026 16:47

This was referenced Mar 27, 2026

Remove legacy TreeSequenceBuilder and AncestorMatcher code #1168

Merged

Implement path_compression #1169

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor low-level code#830

Refactor low-level code#830
jeromekelleher merged 34 commits intotskit-dev:mainfrom
jeromekelleher:lshmm-directly-on-ts

jeromekelleher commented May 25, 2023

Uh oh!

benjeffery commented May 25, 2023

Uh oh!

jeromekelleher commented May 25, 2023

Uh oh!

codecov bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

jeromekelleher commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeromekelleher commented May 25, 2023

Uh oh!

benjeffery commented May 25, 2023

Uh oh!

jeromekelleher commented May 25, 2023

Uh oh!

codecov bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Mar 26, 2026 •

edited

Loading