feat: Add StatVar Series Aggregations#571
Conversation
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| Documentation | 11 minor |
| Security | 16 medium |
| CodeStyle | 14 minor |
| Complexity | 7 medium |
🟢 Metrics 76 complexity
Metric Results Complexity 76
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request introduces the StatVarSeriesAggregator class to orchestrate multi-round statistical variable series aggregations (such as anomalies and ensembles) using BigQuery Federation over Spanner, along with corresponding integration tests. The review feedback identifies critical issues in the generated SQL queries: first, the TimeSeries table does not contain a direct entity1 column and instead requires extracting it from the entities JSON column; second, the join between Observation and TimeSeries lacks extra_entities_id, which can lead to incorrect many-to-many mappings.
| FROM EXTERNAL_QUERY("{connection_id}", | ||
| '''SELECT variable_measured, entity1, extra_entities_id, facet | ||
| FROM TimeSeries | ||
| WHERE provenance IN ({input_provenance_str}) {ts_exclude_filter} ''') |
There was a problem hiding this comment.
The TimeSeries table in Spanner does not have a direct entity1 column; instead, it stores entity information in the entities JSON column (e.g., {"entity1": "..."}). Querying entity1 directly in the EXTERNAL_QUERY will result in a Spanner SQL error.
Use JSON_VALUE(entities, '$.entity1') AS entity1 to correctly extract the entity ID.
| FROM EXTERNAL_QUERY("{connection_id}", | |
| '''SELECT variable_measured, entity1, extra_entities_id, facet | |
| FROM TimeSeries | |
| WHERE provenance IN ({input_provenance_str}) {ts_exclude_filter} ''') | |
| FROM EXTERNAL_QUERY("{connection_id}", | |
| '''SELECT variable_measured, JSON_VALUE(entities, '$.entity1') AS entity1, extra_entities_id, facet | |
| FROM TimeSeries | |
| WHERE provenance IN ({input_provenance_str}) {ts_exclude_filter} ''') |
| JOIN EXTERNAL_QUERY("{connection_id}", | ||
| '''SELECT variable_measured, entity1, extra_entities_id, facet_id, facet | ||
| FROM TimeSeries | ||
| WHERE provenance IN ({input_provenance_str})''') ts | ||
| ON o.variable_measured = ts.variable_measured AND o.entity1 = ts.entity1 AND o.facet_id = ts.facet_id |
There was a problem hiding this comment.
There are two issues in this join block:
- The
TimeSeriestable does not have anentity1column. It must be extracted from theentitiesJSON column usingJSON_VALUE(entities, '$.entity1'). - The join condition does not include
extra_entities_id. For statistical variables that represent relationships or have multiple entities (whereextra_entities_idis populated), omitting this column from the join can cause incorrect many-to-many mappings and duplicate observation rows.
Update the query to extract entity1 and include a NULL-safe comparison for extra_entities_id in the join condition.
| JOIN EXTERNAL_QUERY("{connection_id}", | |
| '''SELECT variable_measured, entity1, extra_entities_id, facet_id, facet | |
| FROM TimeSeries | |
| WHERE provenance IN ({input_provenance_str})''') ts | |
| ON o.variable_measured = ts.variable_measured AND o.entity1 = ts.entity1 AND o.facet_id = ts.facet_id | |
| JOIN EXTERNAL_QUERY("{connection_id}", | |
| '''SELECT variable_measured, JSON_VALUE(entities, '$.entity1') AS entity1, extra_entities_id, facet_id, facet | |
| FROM TimeSeries | |
| WHERE provenance IN ({input_provenance_str})''') ts | |
| ON o.variable_measured = ts.variable_measured AND o.entity1 = ts.entity1 AND o.facet_id = ts.facet_id AND COALESCE(o.extra_entities_id, '') = COALESCE(ts.extra_entities_id, '') |
WIP. Don't review yet.