Skip to content

Support hot reload for cluster runtime configs#17975

Open
CRZbulabula wants to merge 4 commits into
masterfrom
analysis/confignode-hot-reload-configs
Open

Support hot reload for cluster runtime configs#17975
CRZbulabula wants to merge 4 commits into
masterfrom
analysis/confignode-hot-reload-configs

Conversation

@CRZbulabula

Copy link
Copy Markdown
Contributor

Description

This PR adds hot-reload support for several cluster runtime configuration items managed by ConfigNode:

  • heartbeat_interval_in_ms
  • continuous_query_min_every_interval_in_ms
  • schema_region_group_extension_policy and data_region_group_extension_policy
  • default_schema_region_group_num_per_database and default_data_region_group_num_per_database
  • schema_region_per_data_node and data_region_per_data_node
  • read_consistency_level

The implementation updates ConfigNode and DataNode hot-reload paths, removes stale cached configuration snapshots from region extension and read-consistency paths, and reschedules heartbeat-related ConfigNode services when the heartbeat interval changes. Region-per-DataNode changes trigger max region group recalculation only on the current ConfigNode leader.

Load-balancing related parameters were also reviewed. Region extension policies and region-per-DataNode limits now support hot reload. region_group_allocate_policy and Ratis auto leader-balance settings remain restart-only because the corresponding balancer/allocator instances are selected during service initialization.

Testing

  • mvn spotless:apply -pl iotdb-core/confignode
  • mvn spotless:apply -pl iotdb-core/datanode
  • mvn spotless:apply -pl integration-test -P with-integration-tests
  • mvn compile -pl iotdb-core/confignode,iotdb-core/datanode -am -DskipTests -DskipUTs
  • mvn verify -DskipUTs -Dit.test=IoTDBSetConfigurationIT -DfailIfNoTests=false -Dfailsafe.failIfNoSpecifiedTests=false -Drat.skip=true -pl integration-test -am -P with-integration-tests
  • git diff --check

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds hot-reload support for multiple cluster runtime configuration items that are managed by ConfigNode and need to take effect without restarting nodes (e.g., heartbeat interval, CQ min interval, region extension policies/limits, and read consistency). It updates both ConfigNode and DataNode reload paths, removes stale cached config snapshots from execution paths, and reschedules heartbeat-driven services when the heartbeat interval changes.

Changes:

  • Mark several cluster config items as hot_reload in the system properties template and implement their hot-reload parsing/validation in ConfigNode/DataNode descriptors.
  • Remove static cached snapshots of runtime config values (region extension policies, region-per-DataNode, read consistency) so new values take effect immediately.
  • Add rescheduling/rebuild hooks for heartbeat-related services and failure detectors when the heartbeat interval changes, plus integration tests covering the hot-reload behavior.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
iotdb-core/node-commons/src/assembly/resources/conf/iotdb-system.properties.template Marks selected cluster configs as hot_reload in the template.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/queryengine/plan/planner/plan/AbstractFragmentParallelPlanner.java Stops caching read consistency at construction; reads live config for routing decisions.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/conf/IoTDBDescriptor.java Adds hot-reload parsing for CQ min interval and read consistency level on DataNode.
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/conf/IoTDBConfig.java Makes hot-reloaded fields volatile and keeps CQ min interval fields in sync.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/schema/ClusterSchemaManager.java Removes static cached region-per-node limits; reads live ConfigNode config.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/RetryFailedTasksThread.java Schedules retry loop using live heartbeat interval; adds reload hook.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/partition/PartitionManager.java Removes static cached region extension policies; reads live ConfigNode config.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/service/TopologyService.java Uses shared failure-detector builder and adds reload hook.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/service/StatisticsService.java Schedules with live heartbeat interval; adds reload hook.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/service/HeartbeatService.java Schedules with live heartbeat interval; adds reload hook.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/service/EventService.java Schedules with live heartbeat interval; adds reload hook.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/LoadManager.java Adds a centralized heartbeat-interval reload method to reschedule services and rebuild detectors.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/cache/region/RegionGroupCache.java Adds failure-detector reload fan-out to region caches.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/cache/LoadCache.java Adds failure-detector reload fan-out to all cache types.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/cache/AbstractLoadCache.java Makes detector volatile, centralizes detector creation, and adds reload method.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/ConfigManager.java Triggers service rescheduling and leader-only max-region recalculation on hot-reload.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/conf/ConfigNodeDescriptor.java Refactors config parsing into helpers; adds hot-reload validation/apply for new hot-reload items.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/conf/ConfigNodeConfig.java Makes newly hot-reloaded config fields volatile.
integration-test/src/test/java/org/apache/iotdb/db/it/IoTDBSetConfigurationIT.java Adds IT coverage for hot-reload of heartbeat interval, CQ min interval, region extension config, and read consistency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 17.39130% with 171 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.20%. Comparing base (b443006) to head (2e67312).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...he/iotdb/confignode/conf/ConfigNodeDescriptor.java 0.00% 73 Missing ⚠️
...apache/iotdb/confignode/manager/ConfigManager.java 0.00% 37 Missing ⚠️
...tdb/confignode/manager/RetryFailedTasksThread.java 0.00% 11 Missing ⚠️
.../confignode/manager/load/service/EventService.java 0.00% 11 Missing ⚠️
...fignode/manager/load/service/HeartbeatService.java 0.00% 11 Missing ⚠️
...ignode/manager/load/service/StatisticsService.java 0.00% 11 Missing ⚠️
...che/iotdb/confignode/manager/load/LoadManager.java 0.00% 4 Missing ⚠️
...onfignode/manager/schema/ClusterSchemaManager.java 0.00% 4 Missing ⚠️
...java/org/apache/iotdb/db/conf/IoTDBDescriptor.java 83.33% 3 Missing ⚠️
...apache/iotdb/confignode/conf/ConfigNodeConfig.java 84.61% 2 Missing ⚠️
... and 3 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17975      +/-   ##
============================================
- Coverage     41.21%   41.20%   -0.01%     
  Complexity      318      318              
============================================
  Files          5258     5258              
  Lines        365944   366120     +176     
  Branches      47330    47359      +29     
============================================
+ Hits         150825   150865      +40     
- Misses       215119   215255     +136     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shuwenwei

Copy link
Copy Markdown
Member

I have one concern about the new hot-reload behavior for heartbeat_interval_in_ms.

The PR correctly reschedules HeartbeatService, StatisticsService, EventService, and RetryFailedTasksThread when the interval changes. However, the Phi failure detector in AbstractLoadCache is constructed with CONF.getHeartbeatIntervalInMs() * 200_000L, so existing load caches may still keep the detector parameters derived from the old heartbeat interval after hot reload.

Could you confirm whether the failure detector threshold is expected to follow heartbeat_interval_in_ms hot reload as well? If yes, we probably need to rebuild/update the existing load-cache failure detectors when the heartbeat interval changes, or add a test/clarification showing that only the scheduling interval is intended to be hot-reloaded.

@Caideyipi

Copy link
Copy Markdown
Collaborator

I found a few issues that look worth fixing before merge:

  1. heartbeat_interval_in_ms hot reload is incomplete. The reload path reschedules HeartbeatService, StatisticsService, EventService, and retry tasks, but existing LoadCache node/region caches and TopologyService keep PhiAccrualDetector instances that were constructed with the old heartbeat interval. PhiAccrualDetector.minHeartbeatStdNs is final, so after changing the heartbeat interval the node/region/topology failure detection thresholds still use the previous interval. Relevant locations: ConfigManager.java, LoadManager.java, AbstractLoadCache.java, and TopologyService.java.

  2. Rejected hot-reload values can still appear in show configuration as applied. Both ConfigNode and DataNode call ConfigurationFileUtils.updateAppliedProperties(...) before validating and applying the merged hot-reload properties. If a reload later fails, for example set configuration heartbeat_interval_in_ms=-1, the config file is not written, but lastAppliedProperties has already been updated, so show configuration can report a value that was not actually applied. Relevant locations: ConfigNodeDescriptor.loadHotModifiedProps, IoTDBDescriptor.loadHotModifiedProps, and ConfigurationFileUtils.updateAppliedProperties.

  3. This PR changes schema_region_per_data_node and data_region_per_data_node from accepting numeric strings through Double.parseDouble(...).intValue() to strict Integer.parseInt(...). Existing configs like data_region_per_data_node=5.0, which were previously accepted and match the old IT API type, will now fail startup or any ConfigNode hot reload because the merged config file is revalidated. If this behavior change is intentional, it should probably be called out; otherwise consider preserving backward compatibility.

@CRZbulabula

Copy link
Copy Markdown
Contributor Author

Thanks @shuwenwei, confirmed that the failure detector threshold is not intended to follow heartbeat_interval_in_ms hot reload in this PR. The hot reload path only reschedules heartbeat-related services.

I pushed a follow-up commit (2e673122cc) to make this explicit and avoid partial behavior: ConfigNode now keeps a startup heartbeat-interval snapshot for failure detector initialization, and AbstractLoadCache / TopologyService use that snapshot instead of the hot-reloaded heartbeat interval. Existing detectors are not rebuilt, and newly-created detectors after hot reload still use the startup value.

Verified with:

  • mvn spotless:apply -pl iotdb-core/confignode
  • mvn compile -pl iotdb-core/confignode -DskipTests -DskipUTs
  • git diff --check

@CRZbulabula

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review. I pushed fecde7f229 to address the remaining issues.

What changed:

  • Failure detector behavior: no further code change in this commit. As clarified in 2e673122cc, failure detector thresholds are intentionally restart-only for this PR and now use the startup heartbeat interval snapshot. Hot reload only reschedules the heartbeat-related services.
  • Rejected hot reload values: fixed the applied-property update ordering in both ConfigNode and DataNode so show configuration is only updated after a successful hot reload. I also fixed ConfigManager.setConfiguration to stop immediately when the local ConfigNode rejects a cluster-wide hot reload, so invalid values are not propagated to DataNodes or other ConfigNodes.
  • Region-per-node parsing: preserved compatibility for integer-compatible decimal strings such as 2.0, while rejecting non-integer decimals such as 1.9 instead of truncating them.

Added/updated IT coverage in IoTDBSetConfigurationIT for:

  • rejected heartbeat_interval_in_ms=-1 not polluting ConfigNode/DataNode show configuration;
  • schema_region_per_data_node / data_region_per_data_node accepting 2.0 and rejecting 1.9.

Verified with:

  • mvn spotless:apply -pl iotdb-core/confignode,iotdb-core/datanode
  • mvn spotless:apply -pl integration-test -P with-integration-tests
  • mvn compile -pl iotdb-core/confignode,iotdb-core/datanode -DskipTests -DskipUTs
  • mvn verify -DskipUTs -Dit.test=IoTDBSetConfigurationIT#testRejectedHotReloadDoesNotUpdateAppliedConfiguration+testHotReloadRegionGroupExtensionConfiguration -DfailIfNoTests=false -Dfailsafe.failIfNoSpecifiedTests=false -Drat.skip=true -pl integration-test -am -P with-integration-tests
  • git diff --check

Note: the same focused IT without -Drat.skip=true was blocked locally by an unrelated .claude/scheduled_tasks.lock RAT violation before tests started.

@sonarqubecloud

Copy link
Copy Markdown

@Caideyipi Caideyipi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. I think the remaining ConfigNode config-file-missing edge case is non-blocking for this PR: when iotdb-system.properties is absent, ConfigManager.setConfiguration can apply the value in memory, set EXECUTE_STATEMENT_ERROR for the missing file warning, and then return before running the heartbeat/region/topology hot-reload hooks. That can leave the in-memory value changed without the corresponding reschedule/recalculation hook in this degraded path. Normal deployments with the system config file present are unaffected, and the rejected-value propagation/applied-configuration issues are fixed. It would still be cleaner to separate the missing-file warning from actual reload failure in a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants