feat: Add StatVar Series Aggregations by SandeepTuniki · Pull Request #571 · datacommonsorg/import

SandeepTuniki · 2026-06-22T09:28:59Z

WIP. Don't review yet.

… Diff)

codacy-production · 2026-06-22T09:30:18Z

Not up to standards ⛔

🔴 Issues 23 medium · 25 minor

Alerts:
⚠ 48 issues (≤ 0 issues of at least minor severity)

Results:
48 new issues

Category Results

Documentation 11 minor

Security 16 medium

CodeStyle 14 minor

Complexity 7 medium

View in Codacy

🟢 Metrics 76 complexity

Metric Results

Complexity 76

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

gemini-code-assist

Code Review

This pull request introduces the StatVarSeriesAggregator class to orchestrate multi-round statistical variable series aggregations (such as anomalies and ensembles) using BigQuery Federation over Spanner, along with corresponding integration tests. The review feedback identifies critical issues in the generated SQL queries: first, the TimeSeries table does not contain a direct entity1 column and instead requires extracting it from the entities JSON column; second, the join between Observation and TimeSeries lacks extra_entities_id, which can lead to incorrect many-to-many mappings.

gemini-code-assist · 2026-06-22T09:31:05Z

+          FROM EXTERNAL_QUERY("{connection_id}",
+            '''SELECT variable_measured, entity1, extra_entities_id, facet 
+               FROM TimeSeries 
+               WHERE provenance IN ({input_provenance_str}) {ts_exclude_filter} ''')


The TimeSeries table in Spanner does not have a direct entity1 column; instead, it stores entity information in the entities JSON column (e.g., {"entity1": "..."}). Querying entity1 directly in the EXTERNAL_QUERY will result in a Spanner SQL error.

Use JSON_VALUE(entities, '$.entity1') AS entity1 to correctly extract the entity ID.

Suggested change

FROM EXTERNAL_QUERY("{connection_id}",

'''SELECT variable_measured, entity1, extra_entities_id, facet

FROM TimeSeries

WHERE provenance IN ({input_provenance_str}) {ts_exclude_filter} ''')

FROM EXTERNAL_QUERY("{connection_id}",

'''SELECT variable_measured, JSON_VALUE(entities, '$.entity1') AS entity1, extra_entities_id, facet

FROM TimeSeries

WHERE provenance IN ({input_provenance_str}) {ts_exclude_filter} ''')

gemini-code-assist · 2026-06-22T09:31:06Z

+          JOIN EXTERNAL_QUERY("{connection_id}",
+            '''SELECT variable_measured, entity1, extra_entities_id, facet_id, facet 
+               FROM TimeSeries 
+               WHERE provenance IN ({input_provenance_str})''') ts
+            ON o.variable_measured = ts.variable_measured AND o.entity1 = ts.entity1 AND o.facet_id = ts.facet_id


There are two issues in this join block:

The TimeSeries table does not have an entity1 column. It must be extracted from the entities JSON column using JSON_VALUE(entities, '$.entity1').

The join condition does not include extra_entities_id. For statistical variables that represent relationships or have multiple entities (where extra_entities_id is populated), omitting this column from the join can cause incorrect many-to-many mappings and duplicate observation rows.

Update the query to extract entity1 and include a NULL-safe comparison for extra_entities_id in the join condition.

Suggested change

JOIN EXTERNAL_QUERY("{connection_id}",

'''SELECT variable_measured, entity1, extra_entities_id, facet_id, facet

FROM TimeSeries

WHERE provenance IN ({input_provenance_str})''') ts

ON o.variable_measured = ts.variable_measured AND o.entity1 = ts.entity1 AND o.facet_id = ts.facet_id

JOIN EXTERNAL_QUERY("{connection_id}",

'''SELECT variable_measured, JSON_VALUE(entities, '$.entity1') AS entity1, extra_entities_id, facet_id, facet

FROM TimeSeries

WHERE provenance IN ({input_provenance_str})''') ts

ON o.variable_measured = ts.variable_measured AND o.entity1 = ts.entity1 AND o.facet_id = ts.facet_id AND COALESCE(o.extra_entities_id, '') = COALESCE(ts.extra_entities_id, '')

Implement core StatVar Series Aggregations (Anomalies, Ensembles, Max…

6742b74

… Diff)

gemini-code-assist Bot reviewed Jun 22, 2026

View reviewed changes

Implement aggr_over_time and count_threshold with E2E tests

4a509ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add StatVar Series Aggregations#571

feat: Add StatVar Series Aggregations#571
SandeepTuniki wants to merge 2 commits into
masterfrom
add-statvar-series-aggrs

SandeepTuniki commented Jun 22, 2026

Uh oh!

codacy-production Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SandeepTuniki commented Jun 22, 2026

Uh oh!

codacy-production Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codacy-production Bot commented Jun 22, 2026 •

edited

Loading