Improve validation performance by caching compiled schematron Templates to avoid repeated XSLT compilation#1572
Conversation
Add a HashMap<String, Templates> cache to SchematronTransformer so that the expensive ISO schematron XSLT compilation is performed only once per unique schematron source string. Subsequent calls to transform(String) return a new Transformer from the cached Templates object. - Extract compilation logic into private compileSchematron() returning Templates - Add cache lookup in transform(String, ProblemHandler) - Add clearCache() method (naturally reset when LabelValidator.clear() creates a new instance) - Add debug logging for cache hits/misses Fixes: NASA-PDS#1565 Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
…n-templates Cache compiled schematron Templates to avoid repeated XSLT compilation
jordanpadams
left a comment
There was a problem hiding this comment.
PR Review
Overall: Approve with minor comments. The implementation is correct and well-targeted. Using javax.xml.transform.Templates is exactly the right JAXP API for this — it represents a compiled, thread-safe stylesheet that produces new Transformer instances cheaply. The core fix is sound.
What's good
- Correct API choice.
Templatesis JAXP's purpose-built mechanism for compiled stylesheet reuse.newTransformer()is lightweight; the expensive compilation (newTemplates()) now happens once. - Clean extraction.
compileSchematron()as a private method reads well and avoids duplication between the cached and uncached code paths. Transformernot cached. Correctly callstemplates.newTransformer()on each use —Transformeris not thread-safe and must not be shared.- Debug logging. Cache hit/miss logging will make performance profiling straightforward.
Comments
1. HashMap is not thread-safe — use ConcurrentHashMap as a forward-looking safety measure
SchematronTransformer is currently safe because LabelValidator.parseAndValidate() is synchronized, so the cache is only accessed by one thread at a time. But the parallel-validation work (#1566, #1567) will remove that synchronized constraint. Switching to ConcurrentHashMap now costs nothing and avoids a subtle data race regression later:
// current
private final Map<String, Templates> cachedTemplates = new HashMap<>();
// safer
private final Map<String, Templates> cachedTemplates = new ConcurrentHashMap<>();2. Cache key is the full schematron string — memory overhead for large schematrons
Using the source string itself as the key means the map holds a reference to the entire schematron text in addition to the compiled Templates. For large schematrons this doubles memory use for cached entries. A minor improvement would be keying on a hash of the source (e.g., SHA-256 hex) rather than the full string. Low priority but worth knowing.
3. clearCache() is public but not wired up anywhere — consider package-private or remove
The commit message says the cache "naturally resets when LabelValidator.clear() creates a new instance." If that's the lifecycle, clearCache() appears to be dead code. Either wire it to LabelValidator.clear() or make it package-private to limit surface area. If it's intended for future use, a // for testing comment would clarify intent.
4. transform(Source, ProblemHandler) bypass — cache is not used for all call paths
The Source-based transform(Source, ProblemHandler) path bypasses the cache entirely — it calls compileSchematron() directly every time. If any caller goes through the Source overload rather than the String overload, they get no benefit. Worth confirming that LabelValidator call sites only go through the String path in practice.
5. No test for cache behavior
There's no test asserting that compileSchematron is called only once for repeated identical inputs. A simple unit test on SchematronTransformer using a spy or call counter would protect this from regression.
Summary
| Comment | Severity |
|---|---|
Use ConcurrentHashMap for future thread-safety |
Recommended |
| Cache key memory overhead | Low / informational |
clearCache() wiring / visibility |
Minor |
Source-path cache bypass |
Minor — verify not hit in practice |
| No cache-hit regression test | Recommended |
Intercept non-existent file targets in doValidation() before running the validator. Records a MISSING_REFERENCED_FILE error (PRODUCT category) directly to the report so the product shows FAIL, the error is counted in the summary, and the exit code is non-zero. Previously, LocationValidator recorded the error as NO_PRODUCTS_FOUND (ProblemCategory.EXECUTION), which Report.record() explicitly excluded from error counts, causing the product to show PASS with 0 errors in the summary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a HashMap<String, Templates> cache to SchematronTransformer so that the expensive ISO schematron XSLT compilation is performed only once per unique schematron source string. Subsequent calls to transform(String) return a new Transformer from the cached Templates object. - Extract compilation logic into private compileSchematron() returning Templates - Add cache lookup in transform(String, ProblemHandler) - Add clearCache() method (naturally reset when LabelValidator.clear() creates a new instance) - Add debug logging for cache hits/misses Fixes: NASA-PDS#1565 Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
1. Use ConcurrentHashMap instead of HashMap for thread-safety (NASA-PDS#1566, NASA-PDS#1567) 2. Key cache on SHA-256 hash of source string to reduce memory overhead 3. Make clearCache() package-private; add cacheSize() for testing 4. Document Source-path cache bypass design decision 5. Add SchematronTransformerTest with cache behavior assertions Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
1. ThreadLocal<MessageDigest> instead of per-call getInstance (avoids provider lookup) 2. HexFormat.of().formatHex() instead of String.format loop (Java 17+) 3. Document intentional check-then-act race on ConcurrentHashMap 4. @tag("integration") and class-level Javadoc noting Saxon-HE dependency on tests Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
Co-Authored-By: jordan.h.padams <jordan.h.padams@jpl.nasa.gov>
…ew-comments Address PR review comments on schematron Templates cache
|
Becareful with this one. I found that the schematron libraries that we use have hidden globals that cause things to be remembered. Of course that was 20 versions ago... |
@al-niessner copy that. I am going to squash these updates to ensure we can easily revert if we determine it is introducing bugs we can't fix. |
There was a problem hiding this comment.
Pull request overview
This PR improves schematron-validation performance by caching compiled XSLT Templates so schematron compilation happens once per unique schematron source instead of once per label.
Changes:
- Added a
ConcurrentHashMapcache of compiledTemplatesinSchematronTransformerand refactored compilation into acompileSchematron(...)helper. - Updated
transform(String, ProblemHandler)to reuse cached templates and return a freshTransformerper call. - Added a JUnit test suite to validate cache population, reuse, and cache clearing.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/main/java/gov/nasa/pds/tools/label/SchematronTransformer.java | Introduces Templates caching to avoid repeated XSLT compilation and adds test-oriented cache helpers. |
| src/test/java/gov/nasa/pds/tools/label/SchematronTransformerTest.java | Adds integration-style tests validating the new cache behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // In practice LabelValidator only calls the String-based overload (line 622), | ||
| // so this uncached path does not affect bundle-validation performance. |
There was a problem hiding this comment.
The new inline comment references specific line numbers in LabelValidator (e.g., “line 622”). These line numbers will drift over time and make the comment misleading. Prefer referencing the method name/call site (e.g., LabelValidator.parseAndValidate() uses the String overload) without hard-coded line numbers.
| // In practice LabelValidator only calls the String-based overload (line 622), | |
| // so this uncached path does not affect bundle-validation performance. | |
| // In practice, LabelValidator.parseAndValidate() uses the String-based | |
| // overload, so this uncached path does not affect bundle-validation | |
| // performance. |
| // Package-private: cache lifetime is tied to the SchematronTransformer instance. | ||
| // LabelValidator.clear() creates a new instance (line 219), which naturally | ||
| // starts with an empty cache. Exposed for testing. | ||
| void clearCache() { | ||
| cachedTemplates.clear(); |
There was a problem hiding this comment.
PR description states clearCache() was added “for explicit lifecycle management”, but the implementation is package-private and the comment says it’s “Exposed for testing”. Either make the method part of the public API (if callers should manage cache lifetime) or update the PR description / comment to reflect that it’s test-only and not intended for external use.
| String key = sha256(source); | ||
| // Intentional check-then-act: a concurrent miss may compile twice, but | ||
| // both results are equivalent and the race is self-healing after warmup. | ||
| // computeIfAbsent is avoided because it holds a lock during compilation. | ||
| // handler is intentionally unused on cache hit — compilation errors | ||
| // would only occur during the first call (cache miss), not on reuse. | ||
| Templates templates = cachedTemplates.get(key); | ||
| if (templates == null) { | ||
| LOG.debug("transform: cache miss, compiling schematron (length={})", source.length()); | ||
| templates = compileSchematron(new StreamSource(new StringReader(source)), handler); | ||
| cachedTemplates.put(key, templates); |
There was a problem hiding this comment.
transform(String, ProblemHandler) recomputes a SHA-256 hash of the entire schematron text on every call. In LabelValidator.parseAndValidate() the same schematron String instance is reused for many labels, so this adds avoidable O(n) work per label and reduces the benefit of caching. Consider using the source String itself as the cache key (String.hashCode is memoized per instance) or caching a precomputed key alongside the schematron text so you don’t re-hash for every validation run.
| return TransformerFactory.newInstance() | ||
| .newTransformer(new StreamSource(new StringReader(schematronStyleSheet.toString()))); | ||
| .newTemplates(new StreamSource(new StringReader(schematronStyleSheet.toString()))); | ||
| } |
There was a problem hiding this comment.
compileSchematron() creates a new TransformerFactory for compiling the generated XSLT into Templates but does not apply the same secure-processing / external access restrictions used in buildIsoTransformer(). This can allow external stylesheet resolution during compilation if the generated XSLT contains imports/includes. Suggest configuring this factory similarly (secure-processing + ACCESS_EXTERNAL_* attributes and, if needed, a URIResolver) before calling newTemplates().
🗒️ Summary
Adds a
HashMap<String, Templates>cache toSchematronTransformerso the expensive ISO schematron XSLT compilation runs once per unique schematron source string rather than once per label. Subsequent calls totransform(String)return a newTransformerfrom the pre-compiledTemplatesobject — eliminating the dominant CPU cost during bundle validation for large collections with many products.Changes in
SchematronTransformer.java:cachedTemplatesinstance field (Map<String, Templates>)compileSchematron()returningTemplatestransform(String, ProblemHandler)clearCache()for explicit lifecycle management🤖 AI Assistance Disclosure
Estimated % of code influenced by AI: 90%
⚙️ Test Data and/or Report
Existing Cucumber integration tests in
src/test/resources/features/cover schematron-based validation. No new test data required — this change is transparent to callers. Performance improvement observable via timing on any bundle validation run.♻️ Related Issues
Fixes #1565
🤓 Reviewer Checklist
Reviewers: Please verify the following before approving this pull request.
Documentation and PR Content
Security & Quality
Testing & Validation
Maintenance