Skip to content

Refactor low-level code#830

Merged
jeromekelleher merged 34 commits intotskit-dev:mainfrom
jeromekelleher:lshmm-directly-on-ts
Mar 27, 2026
Merged

Refactor low-level code#830
jeromekelleher merged 34 commits intotskit-dev:mainfrom
jeromekelleher:lshmm-directly-on-ts

Conversation

@jeromekelleher
Copy link
Copy Markdown
Member

Just posting here for discussion, definitely not ready for merging any time soon.

@benjeffery
Copy link
Copy Markdown
Member

Looking good on a quick look through! Will have a proper look soon.

@jeromekelleher
Copy link
Copy Markdown
Member Author

No rush - I'm just pushing here to be able to refer to it and for discussion when we're interested.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 26, 2026

Codecov Report

❌ Patch coverage is 74.01186% with 263 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.29%. Comparing base (f087068) to head (92c35ac).
⚠️ Report is 34 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main     #830       +/-   ##
===========================================
- Coverage   86.44%   74.29%   -12.15%     
===========================================
  Files          18       18               
  Lines        5651     6653     +1002     
  Branches      896     1074      +178     
===========================================
+ Hits         4885     4943       +58     
- Misses        593     1609     +1016     
+ Partials      173      101       -72     
Flag Coverage Δ
C 0.00% <0.00%> (-81.58%) ⬇️
c-python 62.23% <73.90%> (+3.91%) ⬆️
python-tests 90.60% <80.00%> (-0.34%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Python API 90.71% <80.00%> (-0.34%) ⬇️
Python C interface 68.41% <74.69%> (+1.74%) ⬆️
C library 58.87% <72.95%> (-29.81%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jeromekelleher
Copy link
Copy Markdown
Member Author

I'm bringing this branch up to date now with the intention of merging in as soon as it's usable.

@jeromekelleher jeromekelleher force-pushed the lshmm-directly-on-ts branch 3 times, most recently from a774ed3 to 679cd8c Compare March 26, 2026 17:06
from tree_sequence_builder and workign directly on the tree sequence
instead.

Initial dump of required HMM code

First steps - isolated code works with shim Builder class

Some more tests

Remove TreeSequenceBuilder
g

Fake the sortedcontainers API for now

Fully move to sorted lists of edges.

Stuff

Partial

Abstract out the MatcherIndexes class

Pull out stuff necessary for static match

Compiling version without tree_sequence_builder

Partway through getting C tests running

add test dataq

C code working and some tests

Factor out the "output" struct in matcher

Some basics for new Matcher infrastructure

compiling  with lwt interface

partial update to use table collection

Convert C code to use tables

Roughly working Python-C infrastructure

Add basic debug support to the MatcherIndexes

Basic Python-C linkage works :hooray:

Basic high-level infrastructure for matching

Rename file

Refactor the Matcher infrastructure

Improve class infrastructure

Add vestigial root automatically

Fix up tests to remove hard-coded virtual root

Work on making matcher work with edges not on site values

Roughtly working version with edges on genome coords

Python version looks like it's working

Add sites_position storage and coordinate_t type

Matching working in C (looks like)

Infer start and end from haplotype

Rough implementation of flank skipping

Change python code to use coords in path

Implement coordinate paths in C

Implement some cludges to support initial zero site-paht

Minor updates

Sort-of working driver script

Fiddle with some tricky issues

Rename the test file to avoid collisions
16 tests across 4 hand-crafted tree topologies (star, binary, two-tree,
deep chain) with exact path and mutation assertions from the current
AncestorMatcher. These serve as regression tests when swapping to
MatcherIndexes / AncestorMatcher2.
Update the AncestorMatcher2 Python wrapper so find_path accepts the
same (h, start, end, match_out) arguments and returns the same
(left, right, parent) tuple as the current AncestorMatcher.

Remove Path, Match, find_match and zero_sites_path from matching.py.
Move Path and Match to test_lshmm.py where they are used by the
Python reference implementation.
Try fixing symlinks on windows

Try safe-directory

Do simple solution

Add python versions

fixup

Add coverage upload

Use new workflow

fix

use hte tag
…estorMatcher2

Set TSINFER_MATCHER=matcher2 to use the new AncestorMatcher2/MatcherIndexes
implementation behind the Matcher interface. Default behaviour unchanged.
Run the Python reference matcher on star, binary, two-tree, and
deep-chain topologies built via make_root_ts + extend_ts.  Adds
vestigial_root flag to MatcherIndexes and run_match_python so
tsinfer-produced topologies can skip add_vestigial_root.
run_match_tsinfer now accepts c_too flag to also run C AncestorMatcher2
(currently segfaults on tsinfer topologies, so off by default).
Python reference passes all fixture tests.
Build star, binary, two-tree, and deep-chain topologies in C
matching the Python test_matcher_fixtures.  All four segfault in
ancestor_matcher2_find_path, reproducing the known bug.
matcher_indexes_copy_mutation_data only set sites.mutations and
sites.num_alleles for sites with mutations, leaving sites without
mutations uninitialised.  Initialise all entries in the position-copy
loop so every site has a valid NULL mutations pointer and num_alleles.
Assert path_length, left, right, and parent for each query
haplotype against the known results from test_matcher_fixtures.py.
Wrap legacy AncestorMatcher to convert site-index edges to genomic
positions, mapping left=positions[0] to 0 for consistency.  Update
Matcher._match_one to use genomic positions directly.  Both matchers
now pass all 16 fixture tests identically.
Log RSS after Matcher construction completes, include RSS in the
per-group completion message, and log the cache budget at startup.
Helps trace OOM kills during matching.
Add num_alleles parameter to matcher_indexes_alloc in C; if NULL,
defaults to biallelic (2).  Thread through Python/C binding and
MatcherIndexes wrapper.  Fixes TSI_ERR_BAD_HAPLOTYPE_ALLELE on
multi-allelic sites.  Add triallelic C test.
… budget

register_loader had chunk_bytes default to 0. The lazy reader path
omitted it, so the 7.4 GB sample chunk bypassed the budget check and
was loaded unchecked, spiking memory and triggering the OOM killer.

Make chunk_bytes a required argument so this class of bug is caught
at call time.
Add periodic memory logging
During large groups the only RSS logging was after group completion,
giving no visibility when the process is OOM-killed mid-group. Now
every 30s we log cache state, force a GC (reporting freed memory),
and take a tracemalloc snapshot diff showing the top 5 Python
allocation growth sites plus untraced (C extension) memory.
The futures dict held all 71K futures alive for the entire group,
retaining per-haplotype numpy arrays from find_path (~2-4 MB each).
Deleting each future after yielding its result frees these immediately.
Replace TSI_ERR_GENERIC with a dedicated TSI_ERR_MULTIPLE_MUTATIONS_AT_SITE
error code so users get a clear message instead of "please file a bug report".
Add a C test for this error condition.
…ownership

Py_BuildValue with O format increments refcount, but the code then set
local pointers to NULL preventing cleanup Py_XDECREF from running,
leaving each returned numpy array with a permanent refcount of 1.

Switched to N format which transfers ownership without incrementing.
Also added missing Py_XDECREF for haplotype_array in AncestorMatcher2.
@jeromekelleher jeromekelleher marked this pull request as ready for review March 27, 2026 15:42
Replace precision-based rounding with the legacy matcher's likelihood
threshold approach: L[u] = TSK_MAX(L[u] / max_L, threshold). Default
threshold is DBL_MIN, matching the legacy behaviour.

Add TSI_DISABLE_WEIGHT_BY_N flag support so N-weighting can be toggled.
Both parameters are exposed in the Python wrapper.
Remove unused run_match_python() and update run_match_tsinfer() to
always compare Python and C matcher results, now that AncestorMatcher2
is reliable on tsinfer topologies.
Switch traceback_allocator from tsk_blkalloc_t to tsi_blkalloc_t so
that total_memory reads are thread-safe via C11 atomics. Add the
TSI_NO_ATOMICS guard to the Python getter matching the legacy pattern.
Add tests for total_memory and mean_traceback_size properties.
Switch from tskit's tsk_blkalloc to tsinfer's tsi_blkalloc, which has
an atomic total_size field for thread-safe memory observation. This
makes matcher_indexes_t consistent with all other tsinfer structs.
…ision value

The ancestor_matcher2_alloc signature changed from precision (int) to
likelihood_threshold (double) in e272b1e, but the test helper still
passed 14 (the old precision). This caused 8 matcher tests to fail.
@jeromekelleher jeromekelleher merged commit f2a5b6b into tskit-dev:main Mar 27, 2026
11 of 13 checks passed
@jeromekelleher jeromekelleher deleted the lshmm-directly-on-ts branch March 27, 2026 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants