Skip to content

[DCP Ingestion] Add scaling factor#580

Open
gmechali wants to merge 3 commits into
datacommonsorg:masterfrom
gmechali:scalingf
Open

[DCP Ingestion] Add scaling factor#580
gmechali wants to merge 3 commits into
datacommonsorg:masterfrom
gmechali:scalingf

Conversation

@gmechali

Copy link
Copy Markdown
Contributor

No description provided.

@gmechali gmechali requested a review from dwnoble June 23, 2026 20:50
@codacy-production

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds the mapping for dcid:scalingFactor to constants.COLUMN_SCALING_FACTOR in variable_per_row_importer.py. However, a critical bug was identified in _apply_property_defaults where it checks for the original property name instead of the renamed column name, which prevents this and other per-row mappings from working correctly. A fix has been suggested to check for the internal column name instead.

Comment on lines +53 to +54
"dcid:scalingFactor":
constants.COLUMN_SCALING_FACTOR, # "scaling_factor"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is an existing bug in _apply_property_defaults that will prevent this new dcid:scalingFactor mapping (as well as the existing dcid:measurementMethod and dcid:observationPeriod mappings) from working when specified per-row in the CSV.\n\n### The Issue\nIn _write_observations, _apply_column_mappings is called before _apply_property_defaults:\npython\nobservations_df = (self._apply_column_mappings(self.df).pipe(\n self._track_entity_dcids).pipe(\n _apply_property_defaults, obs_props)...\n\n_apply_column_mappings renames the physical columns to their internal names (e.g., constants.COLUMN_SCALING_FACTOR which is "scaling_factor").\n\nHowever, _apply_property_defaults checks for the predicate name (prop, e.g., "scalingFactor") instead of the internal column name (col_name, e.g., "scaling_factor"):\npython\nfor prop, col_name in property_mapping.items():\n default_value = getattr(obs_props, col_name, \"\")\n if prop in df.columns: # <--- This will be False because the column was renamed to col_name\n ...\n else:\n # If the column doesn't exist, use default for all rows\n df[col_name] = default_value # <--- This overwrites the per-row values with the default!\n\n\n### Recommended Fix\nIn _apply_property_defaults (lines 85-119), update the loop to check for col_name instead of prop:\npython\ndef _apply_property_defaults(df: pd.DataFrame,\n obs_props: ObservationProperties) -> pd.DataFrame:\n \"\"\"Apply property defaults, using per-row values where available.\"\"\"\n property_mapping = {\n sc.PREDICATE_UNIT: constants.COLUMN_UNIT,\n sc.PREDICATE_SCALING_FACTOR: constants.COLUMN_SCALING_FACTOR,\n sc.PREDICATE_MEASUREMENT_METHOD: constants.COLUMN_MEASUREMENT_METHOD,\n sc.PREDICATE_OBSERVATION_PERIOD: constants.COLUMN_OBSERVATION_PERIOD,\n }\n\n for prop, col_name in property_mapping.items():\n default_value = getattr(obs_props, col_name, \"\")\n if col_name in df.columns:\n # Replace empty strings with NaN for consistent handling\n source_col = df[col_name].replace(\"\", pd.NA)\n\n # Check if source is numeric before filling (to preserve int format for numeric columns)\n is_source_numeric = pd.api.types.is_numeric_dtype(source_col)\n\n if is_source_numeric:\n df[col_name] = _convert_numeric_to_string(source_col,\n default_for_na=default_value)\n else:\n df[col_name] = source_col.fillna(default_value).astype(str)\n else:\n # If the column doesn't exist, use default for all rows\n df[col_name] = default_value\n\n # Custom properties column (always empty for variable_per_row_importer)\n df[constants.COLUMN_PROPERTIES] = \"\"\n return df\n

References
  1. Ensure JSON fields such as scalingFactor are not cast to numeric types (like FLOAT64) if they are defined as strings in the schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant