Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3353 +/- ##
=======================================
Coverage 89.76% 89.76%
=======================================
Files 29 29
Lines 31292 31292
Branches 5738 5738
=======================================
Hits 28089 28089
Misses 1794 1794
Partials 1409 1409
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| subset of sites, either by specifying a single vector for both rows and columns | ||
| or a pair of vectors for the row sites and column sites separately. | ||
|
|
||
| The following example produces a matrix containing {math}`r^2` computed pairwise |
There was a problem hiding this comment.
| The following example produces a matrix containing {math}`r^2` computed pairwise | |
| The following computes a matrix of {math}`r^2` measure of linkage-disequilibrium (LD) pairwise |
| The two-locus API provides a mechanism by which to subset the samples under | ||
| consideration, providing the ability to compute a separate LD matrix for each | ||
| sample set or an LD matrix between sample sets. Output dimensions are handled in | ||
| the same manner as the rest of the stats api (see |
There was a problem hiding this comment.
| the same manner as the rest of the stats api (see | |
| the same manner as the rest of the stats API (see |
| : {math}`f(w_{Ab}, w_{aB}, w_{AB}, n) = p_{ab} - p_{a}p_{b}` | ||
|
|
||
| This statistic is inherently polarised, as the unpolarised result of this | ||
| statistic is expected to be zero. Uses the `total` normalisation method. |
There was a problem hiding this comment.
Expectation is zero when unpolarised, but actual values can differ from zero, particularly in small populations or under intense selection, no?
|
|
||
| :param list sample_sets: A list of lists of Node IDs, specifying the | ||
| groups of nodes to compute the statistic with. | ||
| Defaults to all samples grouped by population. TODO: does it? |
|
I spotted minor bits for you to check/consider. Someone more versed in the tskit code should also have a look;) |
|
Thank you @lkirk! I’ll give detailed comments on it later this evening. Great to see your methods getting documentation |
apragsdale
left a comment
There was a problem hiding this comment.
Hi Lloyd - many small, minor comments here that will hopefully help with documentation clarity in places. Thanks again!
| The {meth}`~TreeSequence.ld_matrix` method provides an interface to a collection | ||
| of two-locus statistics with predefined summary functions (see | ||
| {ref}`sec_stats_two_locus_summary_functions`) and `site` and `branch` {ref}`modes | ||
| <sec_stats_mode>`. This method differs from the other stats methods because it |
There was a problem hiding this comment.
"The API for this method differs..."
| of two-locus statistics with predefined summary functions (see | ||
| {ref}`sec_stats_two_locus_summary_functions`) and `site` and `branch` {ref}`modes | ||
| <sec_stats_mode>`. This method differs from the other stats methods because it | ||
| provides a collection of statistics, instead of the usual one method per |
There was a problem hiding this comment.
Maybe, "The LD matrix method differs from other statistics methods in that it provides a unified API with an argument to specify different two-locus summaries of the data." Or something along those lines.
| <sec_stats_mode>`. This method differs from the other stats methods because it | ||
| provides a collection of statistics, instead of the usual one method per | ||
| stat. Otherwise, it behaves similarly to most other functions with respect to | ||
| `sample_sets` and `indexes`. Site statistics can be computed from multi-allelic |
There was a problem hiding this comment.
Maybe, "Two-locus statistics can be computed using two modes, either site or branch, and these should be interpreted in the same was as single-site statistics. Site statistics allow for multi-allelic data, while branch statistics assume an infinite sites model."
|
|
||
| #### Site | ||
|
|
||
| The `site` mode computes two-locus statistics from pairs of alleles on sites. By |
There was a problem hiding this comment.
"from pairs of alleles on sites" - it's not entirely clear what this means. Maybe, "summarized over all pairs of alleles between specified sites."?
|
|
||
| The `site` mode computes two-locus statistics from pairs of alleles on sites. By | ||
| default, this method will compute a matrix for all pairs of sites, with rows and | ||
| columns representing each site in the tree sequence (ie an n×n matrix where n is |
There was a problem hiding this comment.
"i.e.,".
Usually use "n" for sample size, maybe "m" for number of sites.
|
|
||
| In the site mode, the sites under consideration can be restricted using the | ||
| ``sites`` argument. Sites can be passed as a list of lists, specifying the | ||
| ``[[row_sites], [col_sites]]`` or by specifying ``[all_sites]``, where a square |
There was a problem hiding this comment.
Though note that it doesn't have to be "all sites". It can be a single subset of sites on the tree, in which case we get a square matrix.
|
|
||
| We can also compute two-way LD statistics between two sample sets. If the | ||
| ``indexes`` argument is specified, at least two sample sets must also be | ||
| specified. ``indexes`` specifies the sample sets indexes between which to |
There was a problem hiding this comment.
"sample set indexes" ... or "sample sets' indexes"?
| :math:`\widehat{\pi_2}` n n "pi2_unbiased" | ||
| ========================= ============ ================= ================= | ||
|
|
||
| :param list sample_sets: A list of lists of Node IDs, specifying the |
| mode. [[row_sites], [col_sites]] or [all_sites]. | ||
| :param list positions: A list of genomic positions to restrict. Can be | ||
| specified as a list of lists to control the row and column sites. | ||
| [[row_sites], [col_sites]] or [all_sites]. |
There was a problem hiding this comment.
Is this only applicable in branch mode?
| specified as a list of lists to control the row and column sites. | ||
| [[row_sites], [col_sites]] or [all_sites]. | ||
| :param list indexes: A list of 2-tuples or a single 2-tuple, specifying | ||
| the indexes of two populations on which to compute two-way LD |
This PR provides some documentation for the ld_matrix method so that we can consider it a finalized, public API method.
I've tried to provide a sufficient level of detail without going too deep into the method. I also provide some simple demos with the provided example tree sequences. I attempt to provide everything a user would need in the docstring for the
ld_matrixmethod, I'm not sure if it's too much detail.I didn't touch the existing
Todonote in theMulti site statisticssection of the documentation, I wasn't exactly sure how we wanted to handle that. To my knowledge, theLdCalculatorstill exists and it would be useful to frame this method in context with that one.cc: @petrelharp @apragsdale