Skip to content

docs: ensemble component predict pattern + regression test (#1136)#1558

Merged
thinkall merged 2 commits into
microsoft:mainfrom
immu4989:flaml-fix-1136-ensemble-component-preprocess
Jun 12, 2026
Merged

docs: ensemble component predict pattern + regression test (#1136)#1558
thinkall merged 2 commits into
microsoft:mainfrom
immu4989:flaml-fix-1136-ensemble-component-preprocess

Conversation

@immu4989

@immu4989 immu4989 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Why are these changes needed?

#1136 reported a long-standing footgun: when automl.fit(..., ensemble=True) is used on a DataFrame with categorical features, calling automl.model.estimators_[i].predict(X_raw) on a single ensemble component throws cryptic errors (LightGBM train and valid dataset categorical_feature do not match, XGBoost DataFrame.dtypes for data must be int, float, bool or category, sklearn estimators feature_names should match those that were passed during fit). The reporter found a workaround that reached into private state: automl._state.task.preprocess(X, automl._transformer).

PR #1497 (merged 2026-01-21) already added the public automl.preprocess(X) method that performs exactly this transformation without touching private state. The underlying functional gap is therefore already closed — but the docstring example didn't show the ensemble-component case, and there was no regression test pinning the original failure surface. This PR closes both gaps:

  • flaml/automl/automl.py — extend the preprocess() docstring example to show the per-component prediction pattern, referencing #1136.
  • test/automl/test_regression.py — add test_ensemble_component_predict_via_public_preprocess, which builds an ensemble on a DataFrame with categorical features (gender, education — the exact failure surface from #1136), asserts that at least one component fails on raw input, and verifies all components succeed once data is run through automl.preprocess(X).

After this lands, #1136 can be closed.

Verified locally

  • New test: pytest test/automl/test_regression.py::test_ensemble_component_predict_via_public_preprocess — passes.
  • Adjacent test_multioutput test continues to pass.
  • pre-commit run --files flaml/automl/automl.py test/automl/test_regression.py — all hooks pass.

Related issue number

Closes #1136 (functional resolution shipped in #1497; this PR adds the missing example + regression test).

Checks

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates FLAML’s public AutoML.preprocess(X) documentation and adds a regression test to capture the ensemble-component prediction “footgun” from #1136, where individual ensemble components (from automl.model.estimators_) are fit on task-preprocessed data and therefore require callers to preprocess inputs before calling .predict() directly.

Changes:

  • Extend AutoML.preprocess() docstring example to include the ensemble-component prediction pattern referencing #1136.
  • Add a regression test that builds an ensemble on a DataFrame with categorical columns and verifies component prediction succeeds after using automl.preprocess(X).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
flaml/automl/automl.py Docstring example updated to demonstrate preprocessing before predicting with a single ensemble component.
test/automl/test_regression.py Adds a regression test covering the ensemble-component prediction workflow with categorical features using the public preprocess() API.

Comment thread test/automl/test_regression.py
Comment thread test/automl/test_regression.py
Comment thread test/automl/test_regression.py
Comment thread flaml/automl/automl.py
@thinkall thinkall merged commit 7d25e03 into microsoft:main Jun 12, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prediction problem for component models while using ensembling

3 participants