Red-team follow-up: dedup config, PK validation, coverage warning#99
Open
nick-gorman wants to merge 4 commits into
Open
Red-team follow-up: dedup config, PK validation, coverage warning#99nick-gorman wants to merge 4 commits into
nick-gorman wants to merge 4 commits into
Conversation
`processing_info_maps.finalise["MARKET_PRICE_THRESHOLDS"]` was `None`,
unique among the search_type="all" tables. Every other one ships
`[most_recent_records_before_start_time, drop_duplicates_by_primary_key]`
so a "snapshot as of start_time" query returns one row per PK pair.
With `finalise=None`, every monthly archive's copy of every effective
date passed straight through to the user. A multi-month query
returned ~N copies of each row where N is the number of monthly files
scanned — a 5-month live query returned 1269 rows for 16 distinct
(EFFECTIVEDATE, VERSIONNO) pairs, the same byte-identical row
repeating up to 114 times. Joining MARKET_PRICE_THRESHOLDS to
DISPATCHPRICE on EFFECTIVEDATE produces cartesian-explosion bugs;
`df.set_index('EFFECTIVEDATE')` blows up on the duplicate index;
mean/aggregate calculations are heavily biased toward older effective
dates because more monthly files have already shipped by then.
`defaults.effective_date_group_col["MARKET_PRICE_THRESHOLDS"]` is
already `[]`, which sends `most_recent_records_before_start_time` down
the `.tail(1)` branch — exactly one "as-of" row from before
start_time. `defaults.table_primary_keys["MARKET_PRICE_THRESHOLDS"]`
is already `["EFFECTIVEDATE", "VERSIONNO"]`, so the dedup just works.
The fix is to install the standard pipeline.
Strengthens the existing test to assert the PK invariant — verified
the strengthened test FAILS without the fix (every PK pair appears
twice in the 2-month fixture) and PASSES with it. Full suite green:
428 passed, 1 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For tables whose processing pipeline dedupes on PK, omitting a PK
column from user-typed select_columns previously caused a bare pandas
KeyError to escape from inside the finalise step:
KeyError: Index(['VERSIONNO'], dtype='str')
The traceback pointed at pandas internals and gave the user no clue
that the actual fix is "add VERSIONNO back to your select_columns" —
or just use the defaults, which always include the PK.
Validate up-front in dynamic_data_compiler, cache_compiler, and
static_table:
* Dynamic-table dedup is detected by the presence of
`query_wrappers.drop_duplicates_by_primary_key` in the table's
finalise pipeline.
* Static-table dedup is detected by the presence of
`_finalise_excel_data` in the table's static_data_finaliser_map
entry (it calls `drop_duplicates(primary_keys)` when the table is
in `defaults.table_primary_keys`).
In both cases, raise a clear UserInputError naming the missing PK
column(s) and the suggested fix (add to select_columns, or pass
select_columns=None to use defaults).
Two regression tests in test_errors.py — one each for the dynamic
(LOSSFACTORMODEL, PK = [EFFECTIVEDATE, INTERCONNECTORID, REGIONID,
VERSIONNO]) and static (Generators and Scheduled Loads, PK = [DUID])
crash paths.
430 passed, 1 skipped (was 428).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tic_table `dynamic_data_compiler` had two related guards added in 63edfee: * `_check_loaded_select_columns`: any user-typed select_columns column that didn't survive the file load is a hard error (catches typos like 'duid' for 'DUID' that previously returned a stub DataFrame with a WARNING). * The post-load filter_cols check at line 162 still constructed UserInputError without `raise` — a silent filter no-op that returned unfiltered data when the filter column was inherited from defaults but happened to be missing from this AEMO file vintage. `static_table` shared both shapes. The first wasn't covered at all — a typo in select_columns silently produced a stub DataFrame. The second had the same missing-`raise` bug at the static_table call site. Apply both fixes to static_table: * Capture `user_select_columns` before defaults reassignment. * Call `_check_loaded_select_columns` after column-selection + string-strip, before filtering. * Add `raise` in front of the previously-bare `UserInputError(...)` in the post-load filter_cols check. Fix the same missing-`raise` at the corresponding spot in `dynamic_data_compiler` while we're here — same bug pattern. With `_check_loaded_select_columns` in place, the missing-`raise` path is mostly unreachable, but defence-in-depth matters for the defaults-inherited-but-vintage-missing edge case. Tests: * `test_static_select_columns_typo_raises` — bogus column in select_columns raises UserInputError naming the missing column. * `test_static_filter_col_not_in_loaded_data_raises` — combined select-and-filter typo raises (caught by the earlier _check_loaded_select_columns). 432 passed, 1 skipped (was 430). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…window
A multi-period query that partly succeeds and partly 404s previously
had no aggregate signal — only N separate per-file warnings that
aggregating users tend not to read. So asking for 16 months of
INTERMITTENT_GEN_SCADA against AEMO's ~6-7-week Reports/Current/
retention silently returned ~10% of the requested range, with all
downstream means/aggregates correspondingly wrong.
Track per-period success in `_dynamic_data_fetch_loop`, but only count
iterations whose data period actually overlaps the user's
[start_time, end_time) window. After the loop, if some — but not all
— of those in-window periods returned data, emit a single summary
WARNING naming the table, the missing count, the coverage
percentage, and the three common causes (retention window,
not-yet-published month, pre-AEMO data).
The in-window check (`_iteration_overlaps_window`) translates each
date_gen iteration into its [period_start, period_end) extent —
monthly for MMS, daily for current/scraped, 5-min for FCAS — and
runs the standard strict half-open overlap test:
period_start < user_end AND period_end > user_start
This naturally excludes the two kinds of "behind-the-scenes" iterations
the fetch loop makes that the user didn't directly ask for:
* The 1-day buffer-back month / day before `start_time`. Its period
sits flush against `user_start` from the left, so the strict
`period_end > user_start` test fails and it's excluded.
* The ~120 historical months a search_type='all' table scans to
compute "snapshot as of start_time". Those periods are entirely
before `user_start` and excluded.
So a PARTICIPANT query for [2024-01-01, 2024-06-01) counts 5 in-window
months (Jan–May) regardless of the ~115 historical-scan + buffer-back
months the loop actually iterated. If all 5 return data → no warning.
If one is missing → 4/5, warning fires for a real gap in the user's
request.
caching_mode skips the check: cache_compiler doesn't return a frame
the caller could aggregate over, and the per-period warnings already
appear in its log stream. The zero-data case is also skipped — the
caller already raises NoDataToReturn upstream, and a duplicate
WARNING would just be noise.
Three regression tests in test_errors.py:
* partial coverage (DISPATCHPRICE 2018-04 -> 2018-12 against a
fixture that only has 2018-04 and 2018-05; 2/8 in-window months)
emits exactly one Partial-coverage WARNING.
* full coverage (single fixtured month) emits NO warning — prevents
the signal from being lost to noise on every healthy query.
* out-of-window 404s for a search_type='all' table emit NO warning,
even when several scan months 404 — proves the in-window check
correctly suppresses the historical-scan noise that a flat
success-rate ratio would have false-positived on.
435 passed, 1 skipped (was 432).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
76e56c4 to
896a56f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on top of #98. Four small commits, each addressing one finding from a red-team pass against
master(full notes in the companionnemosis-smokerepo'sredteam/FINDINGS.md). All within the spirit of #98 — convert previously-silent footguns into clear errors or warnings, with regression tests — but slightly broader in scope, so kept on a separate branch.What's in here
f5b947c— F5: install standard finalise pipeline for MARKET_PRICE_THRESHOLDSprocessing_info_maps.finalise["MARKET_PRICE_THRESHOLDS"]wasNone, unique amongsearch_type="all"tables. Every other one ships[most_recent_records_before_start_time, drop_duplicates_by_primary_key]— MARKET_PRICE_THRESHOLDS got neither. Effect:effective_date_group_colfor this table is already[], which sendsmost_recent_records_before_start_timedown the.tail(1)branch. PKs and primary_date_columns are also already set up. The fix is just to install the standard pipeline.Strengthens the existing fixture test to assert the PK invariant — verified the new assertion FAILS without the fix (the 2-month fixture exhibits 2× duplication) and PASSES with it.
02b4ee2— F3: validate PK columns in user select_columns up-frontFor tables whose pipeline dedupes on PK, omitting a PK column from user-typed
select_columnspreviously caused a bare pandasKeyError: Index(['VERSIONNO'], dtype='str')from inside the finalise step. Cryptic — the user has no clue the fix is "add VERSIONNO back" (or just use defaults).New
_validate_user_select_columns_includes_pkcovers both:query_wrappers.drop_duplicates_by_primary_key.static_data_finaliser_map[table_name]contains_finalise_excel_data(which callsdrop_duplicates(primary_keys)when the table is indefaults.table_primary_keys).Raises a clear
UserInputErrornaming the missing PK column(s) and suggesting either adding them back or passingselect_columns=None. Two regression tests intest_errors.py.46db129— F1 (static): wire_check_loaded_select_columnsinto static_table_check_loaded_select_columnswas added in #98 fordynamic_data_compilerbut notstatic_table. So a typo in user-typedselect_columns(['Participant', 'duid']instead of['Participant', 'DUID']) silently shipped a stub DataFrame fromstatic_tablewith justParticipant.Apply the same guard to
static_table. Also fix the missing-raisein the post-loadfilter_colscheck at the static_table call site, and in the matching spot indynamic_data_compiler(was constructed-but-not-raised — same shape as the bugs fixed in #98 but two call sites further down). With_check_loaded_select_columnsin place this path is mostly unreachable, but defence-in-depth matters for the defaults-inherited-but-vintage-missing edge case. Two regression tests.896a56f— F15: partial-coverage WARNING for gaps in the user's windowA multi-period query that partly succeeds and partly 404s previously had no aggregate signal — only N separate per-file warnings that aggregating users tend not to read. So asking for 16 months of
INTERMITTENT_GEN_SCADAagainst AEMO's ~6-7-weekReports/Current/retention silently returned ~10% of the requested range, with all downstream means/aggregates correspondingly wrong.Track per-period success through the fetch loop, but only count iterations whose data period actually overlaps the user's
[start_time, end_time)window. After the loop, if some — but not all — of those in-window periods returned data, emit a single summary WARNING naming the table, missing count, coverage percentage, and the three common causes (retention window, not-yet-published month, pre-AEMO data).The in-window check (
_iteration_overlaps_window) translates eachdate_geniteration into its[period_start, period_end)extent — monthly for MMS, daily for current/scraped, 5-min for FCAS — and runs the standard strict half-open overlap test:This naturally excludes the two kinds of "behind-the-scenes" iterations the fetch loop makes that the user didn't directly ask for:
start_time— sits flush againstuser_startfrom the left, fails the strictperiod_end > user_starttest.search_type='all'table scans to compute "snapshot as of start_time" — entirely beforeuser_start.So a PARTICIPANT query for
[2024-01-01, 2024-06-01)counts 5 in-window months (Jan–May) regardless of the ~115 historical-scan + buffer-back months the loop actually iterated. If all 5 return data → no warning. If one is missing →4/5, warning fires for a real gap.Three regression tests:
Test plan
uv run pytest tests/— 435 passed, 1 skipped (was 432 on PR Polish from DISPATCHPRICE user-testing: input validation, log clarity, index reset #98).nemosis-smokeenv and ran the three red-team smoke gates — all PASS:11_market_price_thresholds_dedup.py— confirms PK invariant on a 5-month live query (was FAIL before this branch)12_filter_cols_silent_skip.py— confirmsfilter_colswith a missing column raises (was FAIL)13_current_table_coverage.py— confirms the "Partial coverage" WARNING surfaces for a multi-month current-table query (was FAIL)INTERMITTENT_GEN_SCADA4-month query shows47/120 periods in the requested window returned data (39%)plus the cause hints.Notes for review
static_table("Generators and Scheduled Loads", select_columns=['Participant', 'Region'])(missing theDUIDPK). If this is a known idiom anywhere, the error message points the user atselect_columns=None.NoDataToReturn's job). The strict half-open overlap test handles both the buffer-back month andsearch_type='all''s historical scan automatically.🤖 Generated with Claude Code