Skip to content

docs(DOC-732): document partition_autobalancing_node_autodecommission_time#1607

Open
mfernest wants to merge 14 commits intomainfrom
feat/doc-732-ghost-node-autodecommission
Open

docs(DOC-732): document partition_autobalancing_node_autodecommission_time#1607
mfernest wants to merge 14 commits intomainfrom
feat/doc-732-ghost-node-autodecommission

Conversation

@mfernest
Copy link
Contributor

@mfernest mfernest commented Mar 16, 2026

Summary

Documents the new partition_autobalancing_node_autodecommission_time cluster property introduced in PR #28946 (CORE-7111).

  • Cluster properties reference: adds the new property entry after partition_autobalancing_node_availability_timeout_sec
  • Continuous Data Balancing guide: adds the property to the configuration table with a clear explanation of how it differs from the availability timeout (permanent decommission vs. partition moves)

Key points documented:

  • Opt-in (null/disabled by default)
  • Only applies when partition_autobalancing_mode is continuous
  • Permanently removes the node — unlike availability_timeout_sec, the node cannot rejoin
  • One decommission at a time; stalled decommissions require manual intervention

SME: Joe Miller

Preview

Test plan

  • Netlify deploy preview passes
  • Property entry renders correctly in cluster properties reference
  • Continuous data balancing table renders correctly

🤖 Generated with Claude Code

…_time

Add new cluster property that enables automatic decommission of
unavailable nodes after a configurable timeout. Updates both the
cluster properties reference and the continuous data balancing guide.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mfernest mfernest requested a review from a team as a code owner March 16, 2026 18:20
@netlify
Copy link

netlify bot commented Mar 16, 2026

Deploy Preview for redpanda-docs-preview failed. Why did it fail? →

Name Link
🔨 Latest commit 4221053
🔍 Latest deploy log https://app.netlify.com/projects/redpanda-docs-preview/deploys/69c5656c0879a00008662f8d

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d7cd354f-9986-4d2a-9d87-8789d44509a2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Two documentation files were updated to introduce a new cluster property partition_autobalancing_node_autodecommission_time. The property specifies a timeout in seconds for automatic node decommission when using continuous data balancing mode. Documentation includes the property's type, default state (disabled), behavior notes, and distinction from related properties. Updates maintain consistency across continuous data balancing and cluster properties reference documentation.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Suggested reviewers

  • mattschumpert
  • wdberkeley
  • micheleRP
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers the key information but is missing required template sections like Jira ticket link, review deadline, and checkbox selections. Add the missing template sections: include the Jira ticket URL in the Description header, specify a review deadline, and check the appropriate category box (likely 'Content gap' or 'New feature').
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: documenting a new cluster property with specific reference to the JIRA ticket.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/doc-732-ghost-node-autodecommission

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
modules/manage/pages/cluster-maintenance/continuous-data-balancing.adoc (1)

30-30: Minor phrasing improvement (optional).

The phrase "at least this timeout duration" is slightly awkward. Consider simplifying to "for this timeout duration" since the "at least" is implied by a timeout.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modules/manage/pages/cluster-maintenance/continuous-data-balancing.adoc` at
line 30, Update the sentence describing the decommission timeout in
continuous-data-balancing.adoc: replace the phrase "at least this timeout
duration" with "for this timeout duration" to simplify phrasing; keep the rest
of the sentence and references to the property
partition_autobalancing_node_availability_timeout_sec unchanged so the meaning
remains that a node unavailable for this timeout is permanently decommissioned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modules/manage/pages/cluster-maintenance/continuous-data-balancing.adoc`:
- Around line 29-34: The description for
partition_autobalancing_node_autodecommission_time is missing the unit and
prerequisite context: update the prose to state the unit is seconds (e.g.,
"measured in seconds") and add a sentence clarifying this property only applies
when partition_autobalancing_mode is set to continuous; also keep the existing
notes about default null/disabled, one-node-at-a-time behavior, and manual
intervention if decommission stalls so the table matches other properties'
phrasing and the PR objectives.

---

Nitpick comments:
In `@modules/manage/pages/cluster-maintenance/continuous-data-balancing.adoc`:
- Line 30: Update the sentence describing the decommission timeout in
continuous-data-balancing.adoc: replace the phrase "at least this timeout
duration" with "for this timeout duration" to simplify phrasing; keep the rest
of the sentence and references to the property
partition_autobalancing_node_availability_timeout_sec unchanged so the meaning
remains that a node unavailable for this timeout is permanently decommissioned.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2438c28d-c101-4515-b05e-1eacf4a7de5a

📥 Commits

Reviewing files that changed from the base of the PR and between fe8a357 and 8129884.

📒 Files selected for processing (2)
  • modules/manage/pages/cluster-maintenance/continuous-data-balancing.adoc
  • modules/reference/partials/properties/cluster-properties.adoc

@mfernest mfernest requested a review from joe-redpanda March 16, 2026 18:42
…roperty

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
+
Default is 900 seconds (15 minutes).

| `partition_autobalancing_node_autodecommission_time`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partition_autobalancing_node_autodecommission_timeout_sec

This got updated down the line to reflect units

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — removed the '(in seconds)' parenthetical since the _sec suffix already conveys the unit. Fixed in the latest commit.

| `partition_autobalancing_node_autodecommission_time`
| When a node is unavailable for this timeout duration (in seconds), Redpanda automatically and permanently decommissions the node. This property only applies when `partition_autobalancing_mode` is set to `continuous`. Unlike `partition_autobalancing_node_availability_timeout_sec`, which moves partitions while keeping the node in the cluster, this property removes the node from the cluster entirely. A decommissioned node cannot rejoin the cluster. +
+
Only one node is decommissioned at a time. If a decommission is already in progress, automatic decommission does not trigger until it completes. If the decommission stalls (for example, because the node holds the only replica of a partition), manual intervention is required. +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth cross-linking to node-wise recovery here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The xref to nodewise-partition-recovery.adoc is already in the description ('...manual intervention is required. See xref:manage:cluster-maintenance/nodewise-partition-recovery.adoc[].'). Let me know if you had a different location in mind.

mfernest and others added 2 commits March 17, 2026 09:30
…ommission_timeout_sec

Property was renamed upstream to reflect units. Update continuous-data-balancing.adoc
and regenerate cluster-properties partial from dev branch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ommissions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mfernest
Copy link
Contributor Author

@coderabbitai The phrase "at least this timeout duration" no longer appears in the file. The current text reads "When a node is unavailable for this timeout duration (in seconds)..." — already using the simpler phrasing. This comment appears to be based on a previous version of the file.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

Tip

For best results, initiate chat on the files or code changes.

@mfernest You're right, I apologize for the confusion. My comment was based on a stale version of the file. Since the current text already uses the simpler phrasing, this concern does not apply. Sorry for the noise!

(^_^)


🧠 Learnings used
Learnt from: mfernest
Repo: redpanda-data/docs PR: 0
File: :0-0
Timestamp: 2026-03-17T18:36:27.877Z
Learning: In the redpanda-data/docs repository, the `delete_topic_enable` property is a cluster-scoped boolean property (default: true) documented only in `modules/reference/partials/properties/cluster-properties.adoc` (line 4137). It does NOT appear in `modules/reference/partials/properties/topic-properties.adoc`. Its description override is in `docs-data/property-overrides.json` at line 796.

mfernest and others added 3 commits March 18, 2026 16:20
…timeout property

The _sec suffix in partition_autobalancing_node_autodecommission_timeout_sec
already conveys the unit; the "(in seconds)" parenthetical is redundant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
"description": "Configure the addressing style that controls how Redpanda formats bucket URLs for S3-compatible object storage.\n\nLeave this property unset (`null`) to use automatic configuration:\n\n* For AWS S3: Redpanda attempts `virtual_host` addressing first, then falls back to `path` style if needed\n* For MinIO: Redpanda automatically uses `path` style regardless of `MINIO_DOMAIN` configuration\n\nSet this property explicitly to override automatic configuration, ensure consistent behavior across deployments, or when using S3-compatible storage that requires a specific URL format.\n\nCAUTION: AWS requires virtual-hosted addressing for buckets created after September 30, 2020. If you use AWS S3 with buckets created after this date, use `virtual_host` addressing.\n\nNOTE: For MinIO deployments, Redpanda defaults to `path` style when this property is unset. To use `virtual_host` addressing with a configured `MINIO_DOMAIN`, set this property explicitly to `virtual_host`. For other S3-compatible storage backends, consult your provider's documentation to determine the required URL style.",
"config_scope": "cluster"
},
"cloud_topics_allow_materialization_failure": {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak to these. Intentional?

| `partition_autobalancing_node_autodecommission_timeout_sec`
| When a node is unavailable for this timeout duration, Redpanda automatically and permanently decommissions the node. This property only applies when `partition_autobalancing_mode` is set to `continuous`. Unlike `partition_autobalancing_node_availability_timeout_sec`, which moves partitions while keeping the node in the cluster, this property removes the node from the cluster entirely. A decommissioned node cannot rejoin the cluster. +
+
Only one node is decommissioned at a time. If a decommission is already in progress, automatic decommission does not trigger until it completes. If the decommission stalls (for example, because the node holds the only replica of a partition), manual intervention is required. See xref:manage:cluster-maintenance/nodewise-partition-recovery.adoc[]. +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

|===


=== alter_topic_cfg_timeout_ms

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whats going on here

Copy link

@joe-redpanda joe-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The autodecom stuff looks good, I can't speak to the other unrelated changes. Approved on the auto decom description

…adoc

- Add page-topic-type, personas, and learning objectives metadata
- Rewrite intro to remove weak "enables" verb
- Fix "But after" sentence opener; clarify rack replacement phrasing
- "You can then customize" → "Customize the following properties"
- Remove *Note:* inline label from table cell; fold into prose
- Fix relative xref ./cluster-balancing.adoc → fully qualified path
- Promote inline rpk commands to bash code blocks
- "Use Data Balancing commands" → sentence case
- Add intro sentence to commands section (no empty sections)
- "check the following" → "verify:"
- "in a specific node" → "on a specific node"
- Fix NOTE ambiguous subject; remove passive "hasn't been turned off"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Contributor

@micheleRP micheleRP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Property automation analysis

Files that should be removed from this PR (auto-generated, will be overwritten, and add no reviewable value):

  1. modules/reference/partials/properties/cluster-properties.adoc (1455+/2076- lines) — Fully auto-generated partial. Including it makes the PR unreviewable and the output will be overwritten on the next automation run. This should be generated by running the automation after the property ships in a GA release.

  2. modules/reference/attachments/redpanda-properties-v25.3.10.json — Generated property snapshot. Will be overwritten.

  3. docs-data/redpanda-property-changes-v25.3.10-to-dev.json — Generated diff report. Will be overwritten.

  4. docs-data/redpanda-property-changes-v25.3.9-to-v25.3.10.json (deleted) — Removing this generated artifact adds noise.

Files that are safe and should stay:

  1. docs-data/property-overrides.json — This is the correct place to document new properties. Overrides are preserved through regeneration.

  2. modules/manage/pages/cluster-maintenance/continuous-data-balancing.adoc — Fully manual page. Safe from automation.

The "version": "dev" problem

The override sets "version": "dev" for the new property. The Handlebars template renders this as:

Introduced in dev

This will appear in the published docs and is confusing to users. The version field should either:

  • Be omitted until the property ships in a GA release, or
  • Be set to the actual release version (e.g., v25.4.1) once it's known

The automation will auto-update "version": "dev" → the correct GA version tag when it runs against a release that includes this property (see updatePropertyOverridesWithVersion in diff-utils.js). But until then, the published docs would show "Introduced in dev."

Additional concern: incorrect metadata in generated JSON

Because this property only exists in the dev branch (not v25.3.10), the generated redpanda-properties-v25.3.10.json shows it with "defined_in": "override" and incorrect metadata:

  • "config_scope": "topic" — should be cluster
  • "type": "string" — should be integer
  • "is_topic_property": true — incorrect

These will self-correct when the automation runs against a release that has the property in the C++ source. This is another reason to remove the generated files from this PR and let the automation handle them when the property ships.

What should stay in this PR

  1. property-overrides.json: The override entry is correct, but consider:

    • Remove or leave "version" empty until GA, to avoid "Introduced in dev" in published docs
    • Add "config_scope": "cluster" to prevent misclassification
    • Add "related_topics": ["xref:manage:cluster-maintenance/continuous-data-balancing.adoc[Configure Continuous Data Balancing]"]
    • Optionally add a "description" override if the auto-extracted C++ description is too terse
  2. continuous-data-balancing.adoc: The manual content is well-written and adds genuine value. The property table entry clearly differentiates auto-decommission from availability timeout. The learning objectives and page metadata additions are good.

Style notes on continuous-data-balancing.adoc

  • Good: Clear differentiation between availability_timeout_sec (moves partitions, node stays) vs autodecommission_timeout_sec (removes node permanently)
  • Good: Warning about stalled decommissions and xref to recovery docs
  • Minor: Verify the checkbox learning objectives (* [ ]) render correctly in the Antora build

Copy link
Contributor

@micheleRP micheleRP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfernest please see Claude's review

mfernest and others added 2 commits March 24, 2026 16:57
Remove property partials, JSON attachments, and diff files included by
the auto-docs regen commit. Only property-overrides.json and the
continuous-data-balancing content page belong in this PR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mfernest mfernest requested a review from micheleRP March 25, 2026 00:02
@micheleRP
Copy link
Contributor

@mfernest A few cleanup items before this can be merged:

1. Remove redpanda-property-changes-v25.3.10-to-dev.json from this PR
This auto-generated diff file (536 lines) is still in the PR. It will be overwritten on the next automation run and adds no reviewable value. Please revert it.

2. Remove unrelated "version": "dev" entries from property-overrides.json
The diff includes 30+ entries unrelated to DOC-732 (e.g., cloud_topics_*, delete_topic_enable, default_redpanda_storage_mode, nested_group_behavior, oidc_group_claim_path, redpanda.storage.mode, schema_registry_enable_qualified_subjects, etc.). These appear to have come in from a merge or regen run. Please revert them so the PR diff only contains the autodecommission change.

3. Remove the partition_autobalancing_node_autodecommission_timeout_sec entry from property-overrides.json
The override currently sets "version": "dev", which will render as "Introduced in dev" in published docs. Removing "version" alone would leave an empty {} entry, which serves no purpose. The cleanest approach is to remove the entry entirely and let the automation create it with the correct version and metadata when the property ships in a GA release.

Once it does ship, we can add config_scope and related_topics to match the pattern used by the sibling property partition_autobalancing_node_availability_timeout_sec.

mfernest and others added 3 commits March 26, 2026 09:42
….json

Strip 30 bulk automation entries that each only carried "version": "dev",
which would render as "Introduced in dev" in published docs. The
partition_autobalancing_node_autodecommission_timeout_sec entry had no
content beyond the version tag so it is removed entirely; the automation
will add it with the correct version when the property reaches GA.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
redpanda-property-changes-v25.3.10-to-dev.json is auto-generated and
will be overwritten on the next automation run. Remove it from the PR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants