Skip to content

Align observationProperties to JSON-LD reference objects and fix custom namespace lookups#576

Open
clincoln8 wants to merge 1 commit into
datacommonsorg:masterfrom
clincoln8:feature/verify-nested-observation-properties
Open

Align observationProperties to JSON-LD reference objects and fix custom namespace lookups#576
clincoln8 wants to merge 1 commit into
datacommonsorg:masterfrom
clincoln8:feature/verify-nested-observation-properties

Conversation

@clincoln8

Copy link
Copy Markdown
Contributor

Summary

This PR aligns the preprocessor's JSON-LD output format for multi-entity observations with the Java ingestion pipeline's expectations, and fixes a critical namespace resolution bug in the Java pipeline.

Changes

1. Preprocessor Alignment (Python)

  • jsonld_stream_db.py: Updated _write_observation_shard to generate proper JSON-LD reference objects ({"@id": "..."}) for observationProperties keys rather than plain strings. This ensures semantic correctness (properly representing graph node references as IRIs) and matches the Java pipeline's test expectations.
  • jsonld_stream_db_test.py: Added assertions to verify dcid:variableMeasured contains the nested observationProperties reference objects, preventing future regressions.

2. Robust Namespace Resolution (Java)

  • McfUtil.stripNamespace: Discovered and fixed a critical bug where property lookups would fail for custom/third-party namespaces (e.g., custom:destinationCountry) because the prefix-stripping logic was hardcoded to only recognize dcid:, schema:, and dcs:.
  • Now, it dynamically strips any alphanumeric namespace prefix before a colon, and also handles full expanded URL prefixes, ensuring DCP's open-source flexibility for custom schemas.

3. Testing Coverage (Java)

  • JsonLdParserTest.java: Added testNestedObservationProperties_StringList to verify parsing of string-list fallback formats (matching the legacy preprocessor output).
  • GraphReaderTest.java: Added testExtractTimeSeries_JsonLd_MultiEntity_Integration to verify end-to-end parsing and reading of JSON-LD with custom namespaces, successfully validating the fix.

TAG=agy
CONV=09581de6-99cc-4213-8266-f3dbf7e70a52

…om namespace lookups

1. Preprocessor Alignment:
   - Updated 'jsonld_stream_db.py' to generate proper JSON-LD reference objects ('{"@id": "..."}') for 'observationProperties' keys rather than plain strings, ensuring semantic correctness and full compliance with JSON-LD specs.

2. Robust Namespace Resolution in Java Pipeline:
   - Discovered and fixed a critical bug in 'McfUtil.stripNamespace' where property lookups would fail for custom/third-party namespaces (e.g. 'custom:destinationCountry') because it was hardcoded to only strip 'dcid:', 'schema:', and 'dcs:'.
   - Updated the implementation to dynamically strip any alphanumeric namespace prefix before a colon, and also handle full expanded URL prefixes, ensuring DCP's open-source flexibility.

3. Testing Coverage:
   - Added 'testNestedObservationProperties_StringList' in 'JsonLdParserTest.java' to verify parsing of string-list fallback formats.
   - Added 'testExtractTimeSeries_JsonLd_MultiEntity_Integration' in 'GraphReaderTest.java' to verify end-to-end parsing and reading of JSON-LD with custom namespaces, successfully validating the fix.
   - Added corresponding assertions in 'jsonld_stream_db_test.py' to prevent regressions.

TAG=agy
CONV=09581de6-99cc-4213-8266-f3dbf7e70a52
@codacy-production

Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 2 minor

Alerts:
⚠ 2 issues (≤ 0 issues of at least minor severity)

Results:
2 new issues

Category Results
CodeStyle 2 minor

View in Codacy

🟢 Metrics 6 complexity

Metric Results
Complexity 6

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates namespace stripping logic in McfUtil.java to handle full URIs and null values, refactors property key processing in jsonld_stream_db.py using _uri_ref, and adds corresponding unit and integration tests. Feedback highlights a correctness issue in McfUtil.stripNamespace where using lastIndexOf('/') can corrupt hierarchical IDs like country/FRA into FRA, suggesting that custom prefixes should instead be stripped during parsing. Additionally, it is recommended to filter out potential None values when generating prop_keys in jsonld_stream_db.py to prevent invalid JSON-LD.

Comment on lines +168 to +176
if (val.startsWith("http://") || val.startsWith("https://")) {
int lastSlash = val.lastIndexOf('/');
int lastHash = val.lastIndexOf('#');
int idx = Math.max(lastSlash, lastHash);
if (idx > 0 && idx < val.length() - 1) {
return val.substring(idx + 1);
}
return val;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Correctness Issue: Hierarchical ID Corruption with Custom Namespaces

Using val.lastIndexOf('/') to strip the namespace from a full URI will incorrectly strip hierarchical entity IDs (such as country/FRA or geoId/06) down to their last segment (e.g., FRA or 06).

For example, if a custom namespace is https://customsource.org/schema/ and the entity reference is https://customsource.org/schema/country/FRA, this method will return FRA instead of country/FRA. This will break entity resolution, node ID generation, and lookups in the graph database.

Recommended Solution

The root cause is that JsonLdParser does not strip custom namespace prefixes during parsing, leaving full URIs in the McfGraph. Trying to guess the namespace boundary in McfUtil.stripNamespace using lastIndexOf('/') is inherently fragile.

Instead:

  1. Strip Custom Prefixes in JsonLdParser: Extract the namespaces defined in the @context of the JSON-LD document during parsing, and use them to strip prefixes (e.g., converting https://customsource.org/schema/country/FRA to custom:country/FRA or country/FRA) before storing them in the McfGraph.
  2. Fallback in McfUtil: If you must keep a fallback here, avoid stripping past known hierarchical patterns, or ensure that the namespace prefix is explicitly known.

Comment on lines 93 to 95
prop_keys = [
f"dcid:{k}" if not k.startswith(
("dcid:", "http://", "https://")) else k
for k in props_dict.keys()
_uri_ref(k) for k in props_dict.keys()
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Potential None Values in observationProperties

If props_dict contains empty keys or keys that evaluate to None under _uri_ref, the resulting prop_keys list will contain None elements, which produces invalid JSON-LD / RDF.

Filtering out None values using an assignment expression ensures that only valid reference objects are included in dcid:observationProperties.

Suggested change
prop_keys = [
f"dcid:{k}" if not k.startswith(
("dcid:", "http://", "https://")) else k
for k in props_dict.keys()
_uri_ref(k) for k in props_dict.keys()
]
prop_keys = [
ref for k in props_dict.keys() if (ref := _uri_ref(k)) is not None
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant