[DCP Ingestion] Add scaling factor#580
Conversation
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request adds the mapping for dcid:scalingFactor to constants.COLUMN_SCALING_FACTOR in variable_per_row_importer.py. However, a critical bug was identified in _apply_property_defaults where it checks for the original property name instead of the renamed column name, which prevents this and other per-row mappings from working correctly. A fix has been suggested to check for the internal column name instead.
| "dcid:scalingFactor": | ||
| constants.COLUMN_SCALING_FACTOR, # "scaling_factor" |
There was a problem hiding this comment.
There is an existing bug in _apply_property_defaults that will prevent this new dcid:scalingFactor mapping (as well as the existing dcid:measurementMethod and dcid:observationPeriod mappings) from working when specified per-row in the CSV.\n\n### The Issue\nIn _write_observations, _apply_column_mappings is called before _apply_property_defaults:\npython\nobservations_df = (self._apply_column_mappings(self.df).pipe(\n self._track_entity_dcids).pipe(\n _apply_property_defaults, obs_props)...\n\n_apply_column_mappings renames the physical columns to their internal names (e.g., constants.COLUMN_SCALING_FACTOR which is "scaling_factor").\n\nHowever, _apply_property_defaults checks for the predicate name (prop, e.g., "scalingFactor") instead of the internal column name (col_name, e.g., "scaling_factor"):\npython\nfor prop, col_name in property_mapping.items():\n default_value = getattr(obs_props, col_name, \"\")\n if prop in df.columns: # <--- This will be False because the column was renamed to col_name\n ...\n else:\n # If the column doesn't exist, use default for all rows\n df[col_name] = default_value # <--- This overwrites the per-row values with the default!\n\n\n### Recommended Fix\nIn _apply_property_defaults (lines 85-119), update the loop to check for col_name instead of prop:\npython\ndef _apply_property_defaults(df: pd.DataFrame,\n obs_props: ObservationProperties) -> pd.DataFrame:\n \"\"\"Apply property defaults, using per-row values where available.\"\"\"\n property_mapping = {\n sc.PREDICATE_UNIT: constants.COLUMN_UNIT,\n sc.PREDICATE_SCALING_FACTOR: constants.COLUMN_SCALING_FACTOR,\n sc.PREDICATE_MEASUREMENT_METHOD: constants.COLUMN_MEASUREMENT_METHOD,\n sc.PREDICATE_OBSERVATION_PERIOD: constants.COLUMN_OBSERVATION_PERIOD,\n }\n\n for prop, col_name in property_mapping.items():\n default_value = getattr(obs_props, col_name, \"\")\n if col_name in df.columns:\n # Replace empty strings with NaN for consistent handling\n source_col = df[col_name].replace(\"\", pd.NA)\n\n # Check if source is numeric before filling (to preserve int format for numeric columns)\n is_source_numeric = pd.api.types.is_numeric_dtype(source_col)\n\n if is_source_numeric:\n df[col_name] = _convert_numeric_to_string(source_col,\n default_for_na=default_value)\n else:\n df[col_name] = source_col.fillna(default_value).astype(str)\n else:\n # If the column doesn't exist, use default for all rows\n df[col_name] = default_value\n\n # Custom properties column (always empty for variable_per_row_importer)\n df[constants.COLUMN_PROPERTIES] = \"\"\n return df\n
References
- Ensure JSON fields such as scalingFactor are not cast to numeric types (like FLOAT64) if they are defined as strings in the schema.
No description provided.