Align observationProperties to JSON-LD reference objects and fix custom namespace lookups#576
Conversation
…om namespace lookups
1. Preprocessor Alignment:
- Updated 'jsonld_stream_db.py' to generate proper JSON-LD reference objects ('{"@id": "..."}') for 'observationProperties' keys rather than plain strings, ensuring semantic correctness and full compliance with JSON-LD specs.
2. Robust Namespace Resolution in Java Pipeline:
- Discovered and fixed a critical bug in 'McfUtil.stripNamespace' where property lookups would fail for custom/third-party namespaces (e.g. 'custom:destinationCountry') because it was hardcoded to only strip 'dcid:', 'schema:', and 'dcs:'.
- Updated the implementation to dynamically strip any alphanumeric namespace prefix before a colon, and also handle full expanded URL prefixes, ensuring DCP's open-source flexibility.
3. Testing Coverage:
- Added 'testNestedObservationProperties_StringList' in 'JsonLdParserTest.java' to verify parsing of string-list fallback formats.
- Added 'testExtractTimeSeries_JsonLd_MultiEntity_Integration' in 'GraphReaderTest.java' to verify end-to-end parsing and reading of JSON-LD with custom namespaces, successfully validating the fix.
- Added corresponding assertions in 'jsonld_stream_db_test.py' to prevent regressions.
TAG=agy
CONV=09581de6-99cc-4213-8266-f3dbf7e70a52
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| CodeStyle | 2 minor |
🟢 Metrics 6 complexity
Metric Results Complexity 6
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request updates namespace stripping logic in McfUtil.java to handle full URIs and null values, refactors property key processing in jsonld_stream_db.py using _uri_ref, and adds corresponding unit and integration tests. Feedback highlights a correctness issue in McfUtil.stripNamespace where using lastIndexOf('/') can corrupt hierarchical IDs like country/FRA into FRA, suggesting that custom prefixes should instead be stripped during parsing. Additionally, it is recommended to filter out potential None values when generating prop_keys in jsonld_stream_db.py to prevent invalid JSON-LD.
| if (val.startsWith("http://") || val.startsWith("https://")) { | ||
| int lastSlash = val.lastIndexOf('/'); | ||
| int lastHash = val.lastIndexOf('#'); | ||
| int idx = Math.max(lastSlash, lastHash); | ||
| if (idx > 0 && idx < val.length() - 1) { | ||
| return val.substring(idx + 1); | ||
| } | ||
| return val; | ||
| } |
There was a problem hiding this comment.
Correctness Issue: Hierarchical ID Corruption with Custom Namespaces
Using val.lastIndexOf('/') to strip the namespace from a full URI will incorrectly strip hierarchical entity IDs (such as country/FRA or geoId/06) down to their last segment (e.g., FRA or 06).
For example, if a custom namespace is https://customsource.org/schema/ and the entity reference is https://customsource.org/schema/country/FRA, this method will return FRA instead of country/FRA. This will break entity resolution, node ID generation, and lookups in the graph database.
Recommended Solution
The root cause is that JsonLdParser does not strip custom namespace prefixes during parsing, leaving full URIs in the McfGraph. Trying to guess the namespace boundary in McfUtil.stripNamespace using lastIndexOf('/') is inherently fragile.
Instead:
- Strip Custom Prefixes in
JsonLdParser: Extract the namespaces defined in the@contextof the JSON-LD document during parsing, and use them to strip prefixes (e.g., convertinghttps://customsource.org/schema/country/FRAtocustom:country/FRAorcountry/FRA) before storing them in theMcfGraph. - Fallback in
McfUtil: If you must keep a fallback here, avoid stripping past known hierarchical patterns, or ensure that the namespace prefix is explicitly known.
| prop_keys = [ | ||
| f"dcid:{k}" if not k.startswith( | ||
| ("dcid:", "http://", "https://")) else k | ||
| for k in props_dict.keys() | ||
| _uri_ref(k) for k in props_dict.keys() | ||
| ] |
There was a problem hiding this comment.
Potential None Values in observationProperties
If props_dict contains empty keys or keys that evaluate to None under _uri_ref, the resulting prop_keys list will contain None elements, which produces invalid JSON-LD / RDF.
Filtering out None values using an assignment expression ensures that only valid reference objects are included in dcid:observationProperties.
| prop_keys = [ | |
| f"dcid:{k}" if not k.startswith( | |
| ("dcid:", "http://", "https://")) else k | |
| for k in props_dict.keys() | |
| _uri_ref(k) for k in props_dict.keys() | |
| ] | |
| prop_keys = [ | |
| ref for k in props_dict.keys() if (ref := _uri_ref(k)) is not None | |
| ] |
Summary
This PR aligns the preprocessor's JSON-LD output format for multi-entity observations with the Java ingestion pipeline's expectations, and fixes a critical namespace resolution bug in the Java pipeline.
Changes
1. Preprocessor Alignment (Python)
jsonld_stream_db.py: Updated_write_observation_shardto generate proper JSON-LD reference objects ({"@id": "..."}) forobservationPropertieskeys rather than plain strings. This ensures semantic correctness (properly representing graph node references as IRIs) and matches the Java pipeline's test expectations.jsonld_stream_db_test.py: Added assertions to verifydcid:variableMeasuredcontains the nestedobservationPropertiesreference objects, preventing future regressions.2. Robust Namespace Resolution (Java)
McfUtil.stripNamespace: Discovered and fixed a critical bug where property lookups would fail for custom/third-party namespaces (e.g.,custom:destinationCountry) because the prefix-stripping logic was hardcoded to only recognizedcid:,schema:, anddcs:.3. Testing Coverage (Java)
JsonLdParserTest.java: AddedtestNestedObservationProperties_StringListto verify parsing of string-list fallback formats (matching the legacy preprocessor output).GraphReaderTest.java: AddedtestExtractTimeSeries_JsonLd_MultiEntity_Integrationto verify end-to-end parsing and reading of JSON-LD with custom namespaces, successfully validating the fix.TAG=agy
CONV=09581de6-99cc-4213-8266-f3dbf7e70a52