Skip to content

Red-team follow-up: dedup config, PK validation, coverage warning#99

Open
nick-gorman wants to merge 4 commits into
docs-start-end-timefrom
polish-from-red-team
Open

Red-team follow-up: dedup config, PK validation, coverage warning#99
nick-gorman wants to merge 4 commits into
docs-start-end-timefrom
polish-from-red-team

Conversation

@nick-gorman
Copy link
Copy Markdown
Member

@nick-gorman nick-gorman commented May 27, 2026

Summary

Stacked on top of #98. Four small commits, each addressing one finding from a red-team pass against master (full notes in the companion nemosis-smoke repo's redteam/FINDINGS.md). All within the spirit of #98 — convert previously-silent footguns into clear errors or warnings, with regression tests — but slightly broader in scope, so kept on a separate branch.

What's in here

f5b947c — F5: install standard finalise pipeline for MARKET_PRICE_THRESHOLDS

processing_info_maps.finalise["MARKET_PRICE_THRESHOLDS"] was None, unique among search_type="all" tables. Every other one ships [most_recent_records_before_start_time, drop_duplicates_by_primary_key] — MARKET_PRICE_THRESHOLDS got neither. Effect:

  • A 5-month live query returned 1269 rows for 16 distinct (EFFECTIVEDATE, VERSIONNO) pairs — the same byte-identical row repeating up to 114 times.
  • Joining MARKET_PRICE_THRESHOLDS to DISPATCHPRICE on EFFECTIVEDATE produces cartesian-explosion bugs.
  • Aggregates (mean VOLL etc.) are heavily biased toward older effective dates because more monthly archives ship them.

effective_date_group_col for this table is already [], which sends most_recent_records_before_start_time down the .tail(1) branch. PKs and primary_date_columns are also already set up. The fix is just to install the standard pipeline.

Strengthens the existing fixture test to assert the PK invariant — verified the new assertion FAILS without the fix (the 2-month fixture exhibits 2× duplication) and PASSES with it.

02b4ee2 — F3: validate PK columns in user select_columns up-front

For tables whose pipeline dedupes on PK, omitting a PK column from user-typed select_columns previously caused a bare pandas KeyError: Index(['VERSIONNO'], dtype='str') from inside the finalise step. Cryptic — the user has no clue the fix is "add VERSIONNO back" (or just use defaults).

New _validate_user_select_columns_includes_pk covers both:

  • Dynamic: finalise pipeline contains query_wrappers.drop_duplicates_by_primary_key.
  • Static: static_data_finaliser_map[table_name] contains _finalise_excel_data (which calls drop_duplicates(primary_keys) when the table is in defaults.table_primary_keys).

Raises a clear UserInputError naming the missing PK column(s) and suggesting either adding them back or passing select_columns=None. Two regression tests in test_errors.py.

46db129 — F1 (static): wire _check_loaded_select_columns into static_table

_check_loaded_select_columns was added in #98 for dynamic_data_compiler but not static_table. So a typo in user-typed select_columns (['Participant', 'duid'] instead of ['Participant', 'DUID']) silently shipped a stub DataFrame from static_table with just Participant.

Apply the same guard to static_table. Also fix the missing-raise in the post-load filter_cols check at the static_table call site, and in the matching spot in dynamic_data_compiler (was constructed-but-not-raised — same shape as the bugs fixed in #98 but two call sites further down). With _check_loaded_select_columns in place this path is mostly unreachable, but defence-in-depth matters for the defaults-inherited-but-vintage-missing edge case. Two regression tests.

896a56f — F15: partial-coverage WARNING for gaps in the user's window

A multi-period query that partly succeeds and partly 404s previously had no aggregate signal — only N separate per-file warnings that aggregating users tend not to read. So asking for 16 months of INTERMITTENT_GEN_SCADA against AEMO's ~6-7-week Reports/Current/ retention silently returned ~10% of the requested range, with all downstream means/aggregates correspondingly wrong.

Track per-period success through the fetch loop, but only count iterations whose data period actually overlaps the user's [start_time, end_time) window. After the loop, if some — but not all — of those in-window periods returned data, emit a single summary WARNING naming the table, missing count, coverage percentage, and the three common causes (retention window, not-yet-published month, pre-AEMO data).

The in-window check (_iteration_overlaps_window) translates each date_gen iteration into its [period_start, period_end) extent — monthly for MMS, daily for current/scraped, 5-min for FCAS — and runs the standard strict half-open overlap test:

period_start < user_end AND period_end > user_start

This naturally excludes the two kinds of "behind-the-scenes" iterations the fetch loop makes that the user didn't directly ask for:

  • The 1-day buffer-back month/day before start_time — sits flush against user_start from the left, fails the strict period_end > user_start test.
  • The ~120 historical months a search_type='all' table scans to compute "snapshot as of start_time" — entirely before user_start.

So a PARTICIPANT query for [2024-01-01, 2024-06-01) counts 5 in-window months (Jan–May) regardless of the ~115 historical-scan + buffer-back months the loop actually iterated. If all 5 return data → no warning. If one is missing → 4/5, warning fires for a real gap.

Three regression tests:

  • partial coverage (DISPATCHPRICE 2018-04 → 2018-12 against a fixture with only 2018-04 and 2018-05; 2/8 in-window) emits the WARNING.
  • full coverage (single fixtured month) emits NO warning — prevents the signal from being drowned in noise on every healthy query.
  • out-of-window 404s for a search_type='all' table emit NO warning, even when scan months 404 — proves the in-window check correctly suppresses the historical-scan noise that a flat success-rate ratio would have false-positived on.

Test plan

  • uv run pytest tests/435 passed, 1 skipped (was 432 on PR Polish from DISPATCHPRICE user-testing: input validation, log clarity, index reset #98).
  • Re-installed locally into the nemosis-smoke env and ran the three red-team smoke gates — all PASS:
    • 11_market_price_thresholds_dedup.py — confirms PK invariant on a 5-month live query (was FAIL before this branch)
    • 12_filter_cols_silent_skip.py — confirms filter_cols with a missing column raises (was FAIL)
    • 13_current_table_coverage.py — confirms the "Partial coverage" WARNING surfaces for a multi-month current-table query (was FAIL)
  • Re-ran the existing smoke set (01, 03, 04, 06, 07, 09, 10) — all still PASS. (08 has a pre-existing smoke-script bug unrelated to NEMOSIS; 02 and 05 skipped only because they take ~3-5 min each.)
  • Sanity-tested the partial-coverage warning verbiage live: INTERMITTENT_GEN_SCADA 4-month query shows 47/120 periods in the requested window returned data (39%) plus the cause hints.

Notes for review

  • F5 is the only commit that changes data semantics for a previously-supported table. Anyone currently relying on "MARKET_PRICE_THRESHOLDS returns N copies of each row" (presumably nobody) would see a row-count change. The actual data values are unaffected.
  • F3's static-table guard catches static_table("Generators and Scheduled Loads", select_columns=['Participant', 'Region']) (missing the DUID PK). If this is a known idiom anywhere, the error message points the user at select_columns=None.
  • The F15 coverage check has no threshold/heuristic — it warns iff at least one in-window period is missing (and at least one succeeded; zero-data is NoDataToReturn's job). The strict half-open overlap test handles both the buffer-back month and search_type='all''s historical scan automatically.

🤖 Generated with Claude Code

nick-gorman and others added 4 commits May 27, 2026 11:36
`processing_info_maps.finalise["MARKET_PRICE_THRESHOLDS"]` was `None`,
unique among the search_type="all" tables. Every other one ships
`[most_recent_records_before_start_time, drop_duplicates_by_primary_key]`
so a "snapshot as of start_time" query returns one row per PK pair.

With `finalise=None`, every monthly archive's copy of every effective
date passed straight through to the user. A multi-month query
returned ~N copies of each row where N is the number of monthly files
scanned — a 5-month live query returned 1269 rows for 16 distinct
(EFFECTIVEDATE, VERSIONNO) pairs, the same byte-identical row
repeating up to 114 times. Joining MARKET_PRICE_THRESHOLDS to
DISPATCHPRICE on EFFECTIVEDATE produces cartesian-explosion bugs;
`df.set_index('EFFECTIVEDATE')` blows up on the duplicate index;
mean/aggregate calculations are heavily biased toward older effective
dates because more monthly files have already shipped by then.

`defaults.effective_date_group_col["MARKET_PRICE_THRESHOLDS"]` is
already `[]`, which sends `most_recent_records_before_start_time` down
the `.tail(1)` branch — exactly one "as-of" row from before
start_time. `defaults.table_primary_keys["MARKET_PRICE_THRESHOLDS"]`
is already `["EFFECTIVEDATE", "VERSIONNO"]`, so the dedup just works.
The fix is to install the standard pipeline.

Strengthens the existing test to assert the PK invariant — verified
the strengthened test FAILS without the fix (every PK pair appears
twice in the 2-month fixture) and PASSES with it. Full suite green:
428 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For tables whose processing pipeline dedupes on PK, omitting a PK
column from user-typed select_columns previously caused a bare pandas
KeyError to escape from inside the finalise step:

    KeyError: Index(['VERSIONNO'], dtype='str')

The traceback pointed at pandas internals and gave the user no clue
that the actual fix is "add VERSIONNO back to your select_columns" —
or just use the defaults, which always include the PK.

Validate up-front in dynamic_data_compiler, cache_compiler, and
static_table:

  * Dynamic-table dedup is detected by the presence of
    `query_wrappers.drop_duplicates_by_primary_key` in the table's
    finalise pipeline.
  * Static-table dedup is detected by the presence of
    `_finalise_excel_data` in the table's static_data_finaliser_map
    entry (it calls `drop_duplicates(primary_keys)` when the table is
    in `defaults.table_primary_keys`).

In both cases, raise a clear UserInputError naming the missing PK
column(s) and the suggested fix (add to select_columns, or pass
select_columns=None to use defaults).

Two regression tests in test_errors.py — one each for the dynamic
(LOSSFACTORMODEL, PK = [EFFECTIVEDATE, INTERCONNECTORID, REGIONID,
VERSIONNO]) and static (Generators and Scheduled Loads, PK = [DUID])
crash paths.

430 passed, 1 skipped (was 428).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tic_table

`dynamic_data_compiler` had two related guards added in 63edfee:

  * `_check_loaded_select_columns`: any user-typed select_columns
    column that didn't survive the file load is a hard error (catches
    typos like 'duid' for 'DUID' that previously returned a stub
    DataFrame with a WARNING).
  * The post-load filter_cols check at line 162 still constructed
    UserInputError without `raise` — a silent filter no-op that
    returned unfiltered data when the filter column was inherited
    from defaults but happened to be missing from this AEMO file
    vintage.

`static_table` shared both shapes. The first wasn't covered at all —
a typo in select_columns silently produced a stub DataFrame. The
second had the same missing-`raise` bug at the static_table call
site.

Apply both fixes to static_table:
  * Capture `user_select_columns` before defaults reassignment.
  * Call `_check_loaded_select_columns` after column-selection +
    string-strip, before filtering.
  * Add `raise` in front of the previously-bare `UserInputError(...)`
    in the post-load filter_cols check.

Fix the same missing-`raise` at the corresponding spot in
`dynamic_data_compiler` while we're here — same bug pattern. With
`_check_loaded_select_columns` in place, the missing-`raise` path is
mostly unreachable, but defence-in-depth matters for the
defaults-inherited-but-vintage-missing edge case.

Tests:
  * `test_static_select_columns_typo_raises` — bogus column in
    select_columns raises UserInputError naming the missing column.
  * `test_static_filter_col_not_in_loaded_data_raises` — combined
    select-and-filter typo raises (caught by the earlier
    _check_loaded_select_columns).

432 passed, 1 skipped (was 430).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…window

A multi-period query that partly succeeds and partly 404s previously
had no aggregate signal — only N separate per-file warnings that
aggregating users tend not to read. So asking for 16 months of
INTERMITTENT_GEN_SCADA against AEMO's ~6-7-week Reports/Current/
retention silently returned ~10% of the requested range, with all
downstream means/aggregates correspondingly wrong.

Track per-period success in `_dynamic_data_fetch_loop`, but only count
iterations whose data period actually overlaps the user's
[start_time, end_time) window. After the loop, if some — but not all
— of those in-window periods returned data, emit a single summary
WARNING naming the table, the missing count, the coverage
percentage, and the three common causes (retention window,
not-yet-published month, pre-AEMO data).

The in-window check (`_iteration_overlaps_window`) translates each
date_gen iteration into its [period_start, period_end) extent —
monthly for MMS, daily for current/scraped, 5-min for FCAS — and
runs the standard strict half-open overlap test:

    period_start < user_end AND period_end > user_start

This naturally excludes the two kinds of "behind-the-scenes" iterations
the fetch loop makes that the user didn't directly ask for:

  * The 1-day buffer-back month / day before `start_time`. Its period
    sits flush against `user_start` from the left, so the strict
    `period_end > user_start` test fails and it's excluded.
  * The ~120 historical months a search_type='all' table scans to
    compute "snapshot as of start_time". Those periods are entirely
    before `user_start` and excluded.

So a PARTICIPANT query for [2024-01-01, 2024-06-01) counts 5 in-window
months (Jan–May) regardless of the ~115 historical-scan + buffer-back
months the loop actually iterated. If all 5 return data → no warning.
If one is missing → 4/5, warning fires for a real gap in the user's
request.

caching_mode skips the check: cache_compiler doesn't return a frame
the caller could aggregate over, and the per-period warnings already
appear in its log stream. The zero-data case is also skipped — the
caller already raises NoDataToReturn upstream, and a duplicate
WARNING would just be noise.

Three regression tests in test_errors.py:
  * partial coverage (DISPATCHPRICE 2018-04 -> 2018-12 against a
    fixture that only has 2018-04 and 2018-05; 2/8 in-window months)
    emits exactly one Partial-coverage WARNING.
  * full coverage (single fixtured month) emits NO warning — prevents
    the signal from being lost to noise on every healthy query.
  * out-of-window 404s for a search_type='all' table emit NO warning,
    even when several scan months 404 — proves the in-window check
    correctly suppresses the historical-scan noise that a flat
    success-rate ratio would have false-positived on.

435 passed, 1 skipped (was 432).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nick-gorman nick-gorman force-pushed the polish-from-red-team branch from 76e56c4 to 896a56f Compare May 27, 2026 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant