Red-team follow-up: dedup config, PK validation, coverage warning by nick-gorman · Pull Request #99 · UNSW-CEEM/NEMOSIS

nick-gorman · 2026-05-27T02:50:46Z

Summary

Stacked on top of #98. Four small commits, each addressing one finding from a red-team pass against master (full notes in the companion nemosis-smoke repo's redteam/FINDINGS.md). All within the spirit of #98 — convert previously-silent footguns into clear errors or warnings, with regression tests — but slightly broader in scope, so kept on a separate branch.

What's in here

`f5b947c` — F5: install standard finalise pipeline for MARKET_PRICE_THRESHOLDS

processing_info_maps.finalise["MARKET_PRICE_THRESHOLDS"] was None, unique among search_type="all" tables. Every other one ships [most_recent_records_before_start_time, drop_duplicates_by_primary_key] — MARKET_PRICE_THRESHOLDS got neither. Effect:

A 5-month live query returned 1269 rows for 16 distinct (EFFECTIVEDATE, VERSIONNO) pairs — the same byte-identical row repeating up to 114 times.
Joining MARKET_PRICE_THRESHOLDS to DISPATCHPRICE on EFFECTIVEDATE produces cartesian-explosion bugs.
Aggregates (mean VOLL etc.) are heavily biased toward older effective dates because more monthly archives ship them.

effective_date_group_col for this table is already [], which sends most_recent_records_before_start_time down the .tail(1) branch. PKs and primary_date_columns are also already set up. The fix is just to install the standard pipeline.

Strengthens the existing fixture test to assert the PK invariant — verified the new assertion FAILS without the fix (the 2-month fixture exhibits 2× duplication) and PASSES with it.

`02b4ee2` — F3: validate PK columns in user select_columns up-front

For tables whose pipeline dedupes on PK, omitting a PK column from user-typed select_columns previously caused a bare pandas KeyError: Index(['VERSIONNO'], dtype='str') from inside the finalise step. Cryptic — the user has no clue the fix is "add VERSIONNO back" (or just use defaults).

New _validate_user_select_columns_includes_pk covers both:

Dynamic: finalise pipeline contains query_wrappers.drop_duplicates_by_primary_key.
Static: static_data_finaliser_map[table_name] contains _finalise_excel_data (which calls drop_duplicates(primary_keys) when the table is in defaults.table_primary_keys).

Raises a clear UserInputError naming the missing PK column(s) and suggesting either adding them back or passing select_columns=None. Two regression tests in test_errors.py.

`46db129` — F1 (static): wire `_check_loaded_select_columns` into static_table

_check_loaded_select_columns was added in #98 for dynamic_data_compiler but not static_table. So a typo in user-typed select_columns (['Participant', 'duid'] instead of ['Participant', 'DUID']) silently shipped a stub DataFrame from static_table with just Participant.

Apply the same guard to static_table. Also fix the missing-raise in the post-load filter_cols check at the static_table call site, and in the matching spot in dynamic_data_compiler (was constructed-but-not-raised — same shape as the bugs fixed in #98 but two call sites further down). With _check_loaded_select_columns in place this path is mostly unreachable, but defence-in-depth matters for the defaults-inherited-but-vintage-missing edge case. Two regression tests.

`896a56f` — F15: partial-coverage WARNING for gaps in the user's window

A multi-period query that partly succeeds and partly 404s previously had no aggregate signal — only N separate per-file warnings that aggregating users tend not to read. So asking for 16 months of INTERMITTENT_GEN_SCADA against AEMO's ~6-7-week Reports/Current/ retention silently returned ~10% of the requested range, with all downstream means/aggregates correspondingly wrong.

Track per-period success through the fetch loop, but only count iterations whose data period actually overlaps the user's [start_time, end_time) window. After the loop, if some — but not all — of those in-window periods returned data, emit a single summary WARNING naming the table, missing count, coverage percentage, and the three common causes (retention window, not-yet-published month, pre-AEMO data).

The in-window check (_iteration_overlaps_window) translates each date_gen iteration into its [period_start, period_end) extent — monthly for MMS, daily for current/scraped, 5-min for FCAS — and runs the standard strict half-open overlap test:

period_start < user_end AND period_end > user_start

This naturally excludes the two kinds of "behind-the-scenes" iterations the fetch loop makes that the user didn't directly ask for:

The 1-day buffer-back month/day before start_time — sits flush against user_start from the left, fails the strict period_end > user_start test.
The ~120 historical months a search_type='all' table scans to compute "snapshot as of start_time" — entirely before user_start.

So a PARTICIPANT query for [2024-01-01, 2024-06-01) counts 5 in-window months (Jan–May) regardless of the ~115 historical-scan + buffer-back months the loop actually iterated. If all 5 return data → no warning. If one is missing → 4/5, warning fires for a real gap.

Three regression tests:

partial coverage (DISPATCHPRICE 2018-04 → 2018-12 against a fixture with only 2018-04 and 2018-05; 2/8 in-window) emits the WARNING.
full coverage (single fixtured month) emits NO warning — prevents the signal from being drowned in noise on every healthy query.
out-of-window 404s for a search_type='all' table emit NO warning, even when scan months 404 — proves the in-window check correctly suppresses the historical-scan noise that a flat success-rate ratio would have false-positived on.

Test plan

uv run pytest tests/ — 435 passed, 1 skipped (was 432 on PR Polish from DISPATCHPRICE user-testing: input validation, log clarity, index reset #98).
Re-installed locally into the nemosis-smoke env and ran the three red-team smoke gates — all PASS:
- 11_market_price_thresholds_dedup.py — confirms PK invariant on a 5-month live query (was FAIL before this branch)
- 12_filter_cols_silent_skip.py — confirms filter_cols with a missing column raises (was FAIL)
- 13_current_table_coverage.py — confirms the "Partial coverage" WARNING surfaces for a multi-month current-table query (was FAIL)
Re-ran the existing smoke set (01, 03, 04, 06, 07, 09, 10) — all still PASS. (08 has a pre-existing smoke-script bug unrelated to NEMOSIS; 02 and 05 skipped only because they take ~3-5 min each.)
Sanity-tested the partial-coverage warning verbiage live: INTERMITTENT_GEN_SCADA 4-month query shows 47/120 periods in the requested window returned data (39%) plus the cause hints.

Notes for review

F5 is the only commit that changes data semantics for a previously-supported table. Anyone currently relying on "MARKET_PRICE_THRESHOLDS returns N copies of each row" (presumably nobody) would see a row-count change. The actual data values are unaffected.
F3's static-table guard catches static_table("Generators and Scheduled Loads", select_columns=['Participant', 'Region']) (missing the DUID PK). If this is a known idiom anywhere, the error message points the user at select_columns=None.
The F15 coverage check has no threshold/heuristic — it warns iff at least one in-window period is missing (and at least one succeeded; zero-data is NoDataToReturn's job). The strict half-open overlap test handles both the buffer-back month and search_type='all''s historical scan automatically.

🤖 Generated with Claude Code

`processing_info_maps.finalise["MARKET_PRICE_THRESHOLDS"]` was `None`, unique among the search_type="all" tables. Every other one ships `[most_recent_records_before_start_time, drop_duplicates_by_primary_key]` so a "snapshot as of start_time" query returns one row per PK pair. With `finalise=None`, every monthly archive's copy of every effective date passed straight through to the user. A multi-month query returned ~N copies of each row where N is the number of monthly files scanned — a 5-month live query returned 1269 rows for 16 distinct (EFFECTIVEDATE, VERSIONNO) pairs, the same byte-identical row repeating up to 114 times. Joining MARKET_PRICE_THRESHOLDS to DISPATCHPRICE on EFFECTIVEDATE produces cartesian-explosion bugs; `df.set_index('EFFECTIVEDATE')` blows up on the duplicate index; mean/aggregate calculations are heavily biased toward older effective dates because more monthly files have already shipped by then. `defaults.effective_date_group_col["MARKET_PRICE_THRESHOLDS"]` is already `[]`, which sends `most_recent_records_before_start_time` down the `.tail(1)` branch — exactly one "as-of" row from before start_time. `defaults.table_primary_keys["MARKET_PRICE_THRESHOLDS"]` is already `["EFFECTIVEDATE", "VERSIONNO"]`, so the dedup just works. The fix is to install the standard pipeline. Strengthens the existing test to assert the PK invariant — verified the strengthened test FAILS without the fix (every PK pair appears twice in the 2-month fixture) and PASSES with it. Full suite green: 428 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For tables whose processing pipeline dedupes on PK, omitting a PK column from user-typed select_columns previously caused a bare pandas KeyError to escape from inside the finalise step: KeyError: Index(['VERSIONNO'], dtype='str') The traceback pointed at pandas internals and gave the user no clue that the actual fix is "add VERSIONNO back to your select_columns" — or just use the defaults, which always include the PK. Validate up-front in dynamic_data_compiler, cache_compiler, and static_table: * Dynamic-table dedup is detected by the presence of `query_wrappers.drop_duplicates_by_primary_key` in the table's finalise pipeline. * Static-table dedup is detected by the presence of `_finalise_excel_data` in the table's static_data_finaliser_map entry (it calls `drop_duplicates(primary_keys)` when the table is in `defaults.table_primary_keys`). In both cases, raise a clear UserInputError naming the missing PK column(s) and the suggested fix (add to select_columns, or pass select_columns=None to use defaults). Two regression tests in test_errors.py — one each for the dynamic (LOSSFACTORMODEL, PK = [EFFECTIVEDATE, INTERCONNECTORID, REGIONID, VERSIONNO]) and static (Generators and Scheduled Loads, PK = [DUID]) crash paths. 430 passed, 1 skipped (was 428). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tic_table `dynamic_data_compiler` had two related guards added in 63edfee: * `_check_loaded_select_columns`: any user-typed select_columns column that didn't survive the file load is a hard error (catches typos like 'duid' for 'DUID' that previously returned a stub DataFrame with a WARNING). * The post-load filter_cols check at line 162 still constructed UserInputError without `raise` — a silent filter no-op that returned unfiltered data when the filter column was inherited from defaults but happened to be missing from this AEMO file vintage. `static_table` shared both shapes. The first wasn't covered at all — a typo in select_columns silently produced a stub DataFrame. The second had the same missing-`raise` bug at the static_table call site. Apply both fixes to static_table: * Capture `user_select_columns` before defaults reassignment. * Call `_check_loaded_select_columns` after column-selection + string-strip, before filtering. * Add `raise` in front of the previously-bare `UserInputError(...)` in the post-load filter_cols check. Fix the same missing-`raise` at the corresponding spot in `dynamic_data_compiler` while we're here — same bug pattern. With `_check_loaded_select_columns` in place, the missing-`raise` path is mostly unreachable, but defence-in-depth matters for the defaults-inherited-but-vintage-missing edge case. Tests: * `test_static_select_columns_typo_raises` — bogus column in select_columns raises UserInputError naming the missing column. * `test_static_filter_col_not_in_loaded_data_raises` — combined select-and-filter typo raises (caught by the earlier _check_loaded_select_columns). 432 passed, 1 skipped (was 430). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…window A multi-period query that partly succeeds and partly 404s previously had no aggregate signal — only N separate per-file warnings that aggregating users tend not to read. So asking for 16 months of INTERMITTENT_GEN_SCADA against AEMO's ~6-7-week Reports/Current/ retention silently returned ~10% of the requested range, with all downstream means/aggregates correspondingly wrong. Track per-period success in `_dynamic_data_fetch_loop`, but only count iterations whose data period actually overlaps the user's [start_time, end_time) window. After the loop, if some — but not all — of those in-window periods returned data, emit a single summary WARNING naming the table, the missing count, the coverage percentage, and the three common causes (retention window, not-yet-published month, pre-AEMO data). The in-window check (`_iteration_overlaps_window`) translates each date_gen iteration into its [period_start, period_end) extent — monthly for MMS, daily for current/scraped, 5-min for FCAS — and runs the standard strict half-open overlap test: period_start < user_end AND period_end > user_start This naturally excludes the two kinds of "behind-the-scenes" iterations the fetch loop makes that the user didn't directly ask for: * The 1-day buffer-back month / day before `start_time`. Its period sits flush against `user_start` from the left, so the strict `period_end > user_start` test fails and it's excluded. * The ~120 historical months a search_type='all' table scans to compute "snapshot as of start_time". Those periods are entirely before `user_start` and excluded. So a PARTICIPANT query for [2024-01-01, 2024-06-01) counts 5 in-window months (Jan–May) regardless of the ~115 historical-scan + buffer-back months the loop actually iterated. If all 5 return data → no warning. If one is missing → 4/5, warning fires for a real gap in the user's request. caching_mode skips the check: cache_compiler doesn't return a frame the caller could aggregate over, and the per-period warnings already appear in its log stream. The zero-data case is also skipped — the caller already raises NoDataToReturn upstream, and a duplicate WARNING would just be noise. Three regression tests in test_errors.py: * partial coverage (DISPATCHPRICE 2018-04 -> 2018-12 against a fixture that only has 2018-04 and 2018-05; 2/8 in-window months) emits exactly one Partial-coverage WARNING. * full coverage (single fixtured month) emits NO warning — prevents the signal from being lost to noise on every healthy query. * out-of-window 404s for a search_type='all' table emit NO warning, even when several scan months 404 — proves the in-window check correctly suppresses the historical-scan noise that a flat success-rate ratio would have false-positived on. 435 passed, 1 skipped (was 432). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nick-gorman and others added 4 commits May 27, 2026 11:36

nick-gorman force-pushed the polish-from-red-team branch from 76e56c4 to 896a56f Compare May 27, 2026 03:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Red-team follow-up: dedup config, PK validation, coverage warning#99

Red-team follow-up: dedup config, PK validation, coverage warning#99
nick-gorman wants to merge 4 commits into
docs-start-end-timefrom
polish-from-red-team

nick-gorman commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nick-gorman commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in here

f5b947c — F5: install standard finalise pipeline for MARKET_PRICE_THRESHOLDS

02b4ee2 — F3: validate PK columns in user select_columns up-front

46db129 — F1 (static): wire _check_loaded_select_columns into static_table

896a56f — F15: partial-coverage WARNING for gaps in the user's window

Test plan

Notes for review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nick-gorman commented May 27, 2026 •

edited

Loading

`f5b947c` — F5: install standard finalise pipeline for MARKET_PRICE_THRESHOLDS

`02b4ee2` — F3: validate PK columns in user select_columns up-front

`46db129` — F1 (static): wire `_check_loaded_select_columns` into static_table

`896a56f` — F15: partial-coverage WARNING for gaps in the user's window