diff --git a/README/WHATS_NEW_zh-CN.md b/README/WHATS_NEW_zh-CN.md index 51f42407..04613662 100644 --- a/README/WHATS_NEW_zh-CN.md +++ b/README/WHATS_NEW_zh-CN.md @@ -1,5 +1,23 @@ # 本次更新 — AutoControl +## 本次更新 (2026-06-24) — 逐步评审特征 + 规则式步骤评分 + +把为代理步骤评分所需的证据打包,并内建规则式评分器。完整参考:[`docs/source/Zh/doc/new_features/v177_features_doc.rst`](../docs/source/Zh/doc/new_features/v177_features_doc.rst)。 + +- **`build_critic_record` / `score_step_rule_based` / `to_judge_prompt`**(`AC_build_critic_record`、`AC_score_step`):`trajectory_eval` 对整条轨迹评分而无逐步证据;`agent_trace` 发出 span 而非质量;`agent_replay` 保存步骤却不评分。本功能把 `action_effect` + `observation_delta` + `postcondition` 组合成单一逐步记录,接着 `score_step_rule_based` 给出确定性的 `{outcome, process_score, reasons}`(不需模型),`to_judge_prompt` 把它渲染给可选的 LLM-as-judge。纯标准库聚合器;不导入 `PySide6`。 + +## 本次更新 (2026-06-24) — 标题与正文分类 + 文档大纲 + +以高度区分标题与正文,并建立文档大纲。完整参考:[`docs/source/Zh/doc/new_features/v176_features_doc.rst`](../docs/source/Zh/doc/new_features/v176_features_doc.rst)。 + +- **`classify_lines` / `outline`**(`AC_classify_lines`、`AC_outline`):框架中没有功能把行高对应到标题层级或建立章节大纲——`ocr/structure` / `element_parse` 纯属位置性,`text_blocks` 不排序。本功能套用标准启发法:行高超过 `heading_ratio` × 中位行高者为标题,不同标题高度成为层级(最高 = 1)。`classify_lines` 为每行标记 `{box, text, role, level}`;`outline` 依序返回标题作为目录。纯标准库,作用于行字典;不导入 `PySide6`。 + +## 本次更新 (2026-06-24) — 变化量序列的稳定检测 + +判断 UI 何时安定下来——以纯粹、可测试的函数作用于变化序列。完整参考:[`docs/source/Zh/doc/new_features/v175_features_doc.rst`](../docs/source/Zh/doc/new_features/v175_features_doc.rst)。 + +- **`settle_point` / `is_settled` / `SettleTracker`**(`AC_settle_point`):`smart_waits.wait_until_screen_stable` 把稳定逻辑包在 `time.sleep` 循环内、作用于实时帧——你无法喂记录好的序列,也无法单元测试该决策。本功能把它抽离:给定一串*变化量*(像素差 / 元素数差 / 0-1 digest 是否变),在变化量连续 `quiet_samples` 次维持 ≤ `max_churn` 时报告稳定(尖峰重置 run)。`settle_point` 返回稳定索引,`SettleTracker` 为供实时循环的增量形式。纯标准库,不需时钟、不需捕获;不导入 `PySide6`。 + ## 本次更新 (2026-06-24) — OCR 行的段落与列表分组 把 OCR 行分组成段落,并检测项目符号 / 编号列表。完整参考:[`docs/source/Zh/doc/new_features/v174_features_doc.rst`](../docs/source/Zh/doc/new_features/v174_features_doc.rst)。 diff --git a/README/WHATS_NEW_zh-TW.md b/README/WHATS_NEW_zh-TW.md index 65ca68c9..a2cf33fd 100644 --- a/README/WHATS_NEW_zh-TW.md +++ b/README/WHATS_NEW_zh-TW.md @@ -1,5 +1,23 @@ # 本次更新 — AutoControl +## 本次更新 (2026-06-24) — 逐步評審特徵 + 規則式步驟評分 + +把為代理步驟評分所需的證據打包,並內建規則式評分器。完整參考:[`docs/source/Zh/doc/new_features/v177_features_doc.rst`](../docs/source/Zh/doc/new_features/v177_features_doc.rst)。 + +- **`build_critic_record` / `score_step_rule_based` / `to_judge_prompt`**(`AC_build_critic_record`、`AC_score_step`):`trajectory_eval` 對整條軌跡評分而無逐步證據;`agent_trace` 發出 span 而非品質;`agent_replay` 保存步驟卻不評分。本功能把 `action_effect` + `observation_delta` + `postcondition` 組合成單一逐步記錄,接著 `score_step_rule_based` 給出確定性的 `{outcome, process_score, reasons}`(不需模型),`to_judge_prompt` 把它渲染給可選的 LLM-as-judge。純標準函式庫聚合器;不匯入 `PySide6`。 + +## 本次更新 (2026-06-24) — 標題與內文分類 + 文件大綱 + +以高度區分標題與內文,並建立文件大綱。完整參考:[`docs/source/Zh/doc/new_features/v176_features_doc.rst`](../docs/source/Zh/doc/new_features/v176_features_doc.rst)。 + +- **`classify_lines` / `outline`**(`AC_classify_lines`、`AC_outline`):框架中沒有功能把行高對應到標題層級或建立章節大綱——`ocr/structure` / `element_parse` 純屬位置性,`text_blocks` 不排序。本功能套用標準啟發法:行高超過 `heading_ratio` × 中位行高者為標題,不同標題高度成為層級(最高 = 1)。`classify_lines` 為每行標記 `{box, text, role, level}`;`outline` 依序回傳標題作為目錄。純標準函式庫,作用於行字典;不匯入 `PySide6`。 + +## 本次更新 (2026-06-24) — 變化量序列的穩定偵測 + +判斷 UI 何時安定下來——以純粹、可測試的函式作用於變化序列。完整參考:[`docs/source/Zh/doc/new_features/v175_features_doc.rst`](../docs/source/Zh/doc/new_features/v175_features_doc.rst)。 + +- **`settle_point` / `is_settled` / `SettleTracker`**(`AC_settle_point`):`smart_waits.wait_until_screen_stable` 把穩定邏輯包在 `time.sleep` 迴圈內、作用於即時幀——你無法餵記錄好的序列,也無法單元測試該決策。本功能把它抽離:給定一串*變化量*(像素差 / 元素數差 / 0-1 digest 是否變),在變化量連續 `quiet_samples` 次維持 ≤ `max_churn` 時回報穩定(尖峰重置 run)。`settle_point` 回傳穩定索引,`SettleTracker` 為供即時迴圈的增量形式。純標準函式庫,不需時鐘、不需擷取;不匯入 `PySide6`。 + ## 本次更新 (2026-06-24) — OCR 行的段落與清單分組 把 OCR 行分組成段落,並偵測項目符號 / 編號清單。完整參考:[`docs/source/Zh/doc/new_features/v174_features_doc.rst`](../docs/source/Zh/doc/new_features/v174_features_doc.rst)。 diff --git a/WHATS_NEW.md b/WHATS_NEW.md index 144438db..ed2d61ce 100644 --- a/WHATS_NEW.md +++ b/WHATS_NEW.md @@ -1,5 +1,23 @@ # What's New — AutoControl +## What's new (2026-06-24) — Per-Step Critic Features + Rule-Based Step Scorer + +Bundle the evidence to score an agent step, with a built-in rule-based scorer. Full reference: [`docs/source/Eng/doc/new_features/v177_features_doc.rst`](docs/source/Eng/doc/new_features/v177_features_doc.rst). + +- **`build_critic_record` / `score_step_rule_based` / `to_judge_prompt`** (`AC_build_critic_record`, `AC_score_step`): `trajectory_eval` scores a whole trajectory with no per-step evidence; `agent_trace` emits spans not quality; `agent_replay` stores steps but doesn't score. This composes `action_effect` + `observation_delta` + `postcondition` into one per-step record, then `score_step_rule_based` gives a deterministic `{outcome, process_score, reasons}` (no model needed) and `to_judge_prompt` renders it for an optional LLM-as-judge. Pure-stdlib aggregator; no `PySide6`. + +## What's new (2026-06-24) — Heading vs Body Classification + Document Outline + +Tell headings from body text by height and build a document outline. Full reference: [`docs/source/Eng/doc/new_features/v176_features_doc.rst`](docs/source/Eng/doc/new_features/v176_features_doc.rst). + +- **`classify_lines` / `outline`** (`AC_classify_lines`, `AC_outline`): nothing mapped line height to heading levels or built a section outline — `ocr/structure` / `element_parse` are positional and `text_blocks` doesn't rank. This applies the standard heuristic: a line taller than `heading_ratio` × the median line height is a heading, and distinct heading heights become levels (tallest = 1). `classify_lines` tags each line `{box, text, role, level}`; `outline` returns the headings in order as a table of contents. Pure-stdlib over line dicts; no `PySide6`. + +## What's new (2026-06-24) — Settle Detection Over a Churn Series + +Decide when the UI has gone quiet — as a pure, testable function over a change series. Full reference: [`docs/source/Eng/doc/new_features/v175_features_doc.rst`](docs/source/Eng/doc/new_features/v175_features_doc.rst). + +- **`settle_point` / `is_settled` / `SettleTracker`** (`AC_settle_point`): `smart_waits.wait_until_screen_stable` bakes the settle logic inside a `time.sleep` loop over live frames — you can't feed it a recorded series or unit-test the decision. This extracts it: given a stream of *churn* values (pixel delta / element-count delta / 0-1 digest-changed), it reports when churn stayed ≤ `max_churn` for `quiet_samples` in a row (a spike resets the run). `settle_point` returns the settle index, `SettleTracker` is the incremental form for a live loop. Pure-stdlib, no clock, no capture; no `PySide6`. + ## What's new (2026-06-24) — Paragraph & List Grouping of OCR Lines Group OCR lines into paragraphs and detect bulleted / numbered lists. Full reference: [`docs/source/Eng/doc/new_features/v174_features_doc.rst`](docs/source/Eng/doc/new_features/v174_features_doc.rst). diff --git a/docs/source/Eng/doc/new_features/v175_features_doc.rst b/docs/source/Eng/doc/new_features/v175_features_doc.rst new file mode 100644 index 00000000..ba3faf3a --- /dev/null +++ b/docs/source/Eng/doc/new_features/v175_features_doc.rst @@ -0,0 +1,43 @@ +Settle Detection Over a Churn Series +==================================== + +``smart_waits.wait_until_screen_stable`` and ``actionability``'s stability check bake the +settle logic *inside* a ``time.sleep`` polling loop over live pixel frames — you cannot feed +them a recorded series of a11y-element counts or screen-diff metrics, and you cannot unit-test +the *decision* independently of capture. ``settle_detector`` extracts that decision: it takes a +stream of *churn* values (how much changed each sample — a pixel delta, an element-count delta, +a digest-changed 0/1, anything) and reports when the churn has stayed at or below ``max_churn`` +for ``quiet_samples`` in a row. A spike resets the quiet run, so "settled then changed again" +is handled. + +Pure-stdlib; deterministic and unit-testable on an injected series with no capture and no +clock. Imports no ``PySide6``. + +Headless API +------------ + +.. code-block:: python + + from je_auto_control import settle_point, is_settled, SettleTracker + + churns = [5, 4, 0.5, 0.3, 0.2] # per-frame change metric + settle_point(churns, quiet_samples=3, max_churn=1.0) # -> 4 + is_settled(churns, quiet_samples=3, max_churn=1.0) # -> True + + # incremental, for a live loop (you supply the churn each tick) + tracker = SettleTracker(quiet_samples=3, max_churn=1.0) + state = tracker.update(current_churn) + if state.settled: + observe_now() + +``settle_point`` returns the index at which the series first settles (or ``None``); +``is_settled`` is the boolean. ``SettleTracker`` is the incremental form: ``update(churn)`` +returns a ``SettleState`` (``settled`` / ``quiet_run`` / ``churn``); ``reset`` clears the run +(e.g. right after acting again). + +Executor command +---------------- + +``AC_settle_point`` (``churns`` / ``quiet_samples`` / ``max_churn`` → ``{settled, index}``) is +exposed as the MCP tool ``ac_settle_point`` (read-only) and as the Script Builder command +**Settle Point (churn series)** under **Flow**. diff --git a/docs/source/Eng/doc/new_features/v176_features_doc.rst b/docs/source/Eng/doc/new_features/v176_features_doc.rst new file mode 100644 index 00000000..bf58d1ce --- /dev/null +++ b/docs/source/Eng/doc/new_features/v176_features_doc.rst @@ -0,0 +1,37 @@ +Heading vs Body Classification + Document Outline +================================================= + +Nothing in the framework maps line height to heading levels or builds a section outline — +``ocr/structure`` and ``element_parse`` are purely positional, and ``text_blocks`` groups +paragraphs / lists but does not rank them. ``heading_segment`` adds the standard heuristic: +a line whose height exceeds ``heading_ratio`` times the median line height is a heading, and +distinct heading heights become heading *levels* (the tallest is level 1). From that it emits +a flat document outline. + +Pure-stdlib over plain line dicts (text + bbox); fully unit-testable with no image and no OCR +engine. Reuses ``table_grid_fill``'s box-bounds reader. Imports no ``PySide6``. + +Headless API +------------ + +.. code-block:: python + + from je_auto_control import classify_lines, outline + + for item in classify_lines(ocr_lines, heading_ratio=1.2): + print(item["role"], item["level"], item["text"]) + + for heading in outline(ocr_lines): + print(" " * (heading["level"] - 1) + heading["text"]) + +``classify_lines`` tags each line ``{box, text, role, level}`` — ``role`` is ``"heading"`` or +``"body"``, ``level`` is the heading level (1 = tallest, 0 for body). ``outline`` returns just +the headings in top-to-bottom order as ``{level, text, top}`` — a document table of contents. + +Executor commands +----------------- + +``AC_classify_lines`` (``lines`` / ``heading_ratio`` → ``{count, lines}``) and ``AC_outline`` +(``lines`` / ``heading_ratio`` → ``{count, headings}``). They are exposed as the MCP tools +``ac_classify_lines`` / ``ac_outline`` (read-only) and as the Script Builder commands +**Classify Headings vs Body** / **Document Outline** under **OCR**. diff --git a/docs/source/Eng/doc/new_features/v177_features_doc.rst b/docs/source/Eng/doc/new_features/v177_features_doc.rst new file mode 100644 index 00000000..f1c074c0 --- /dev/null +++ b/docs/source/Eng/doc/new_features/v177_features_doc.rst @@ -0,0 +1,46 @@ +Per-Step Critic Features + Rule-Based Step Scorer +================================================= + +Scoring an agent's step needs the evidence in one place — what the action was, what changed, +whether it landed on target, whether the declared postcondition held. ``trajectory_eval`` +scores a *finished whole trajectory* against a static rubric and has no per-step evidence; +``agent_trace`` emits OTel spans (tokens / latency), not decision quality; ``agent_replay`` +persists ``{obs, action, result}`` but does no scoring. ``critic_features`` is the missing +per-step layer: it composes ``action_effect`` (did it do anything, where), +``observation_delta`` (how much changed) and ``postcondition`` (did the expected outcome hold) +into one compact record, and ships a deterministic rule-based scorer so the feature works fully +headless — leaving the optional LLM-as-judge to the integrator. + +Pure-stdlib; composes existing pure modules; deterministic and unit-testable with no device. +Imports no ``PySide6``. + +Headless API +------------ + +.. code-block:: python + + from je_auto_control import (build_critic_record, score_step_rule_based, + to_judge_prompt) + + record = build_critic_record({"type": "click", "x": 480, "y": 260}, + before_elements, after_elements, + postcondition={"appears": {"role": "dialog"}}) + score = score_step_rule_based(record) + # {"outcome": True, "process_score": 1.0, "reasons": [...]} + + prompt = to_judge_prompt(record) # compact text for an LLM-as-judge + +``build_critic_record`` returns ``{action, effect, delta_counts}`` plus a ``postcondition`` +report when a spec is given. ``score_step_rule_based`` returns ``{outcome, process_score, +reasons}`` — ``outcome`` is a binary success (the action did something *and* any postcondition +held), ``process_score`` is a 0..1 quality from the effect class (halved if the postcondition +failed). ``to_judge_prompt`` renders the record for an external judge. + +Executor commands +----------------- + +``AC_build_critic_record`` (``action`` / ``before`` / ``after`` / ``postcondition`` / +``radius`` → the record) and ``AC_score_step`` (``record`` → ``{outcome, process_score, +reasons}``). They are exposed as the MCP tools ``ac_build_critic_record`` / ``ac_score_step`` +(read-only) and as the Script Builder commands **Build Critic Record** / **Score Step +(rule-based)** under **Native UI**. diff --git a/docs/source/Eng/eng_index.rst b/docs/source/Eng/eng_index.rst index 7d1ed13f..a1cf10d5 100644 --- a/docs/source/Eng/eng_index.rst +++ b/docs/source/Eng/eng_index.rst @@ -197,6 +197,9 @@ Comprehensive guides for all AutoControl features. doc/new_features/v172_features_doc doc/new_features/v173_features_doc doc/new_features/v174_features_doc + doc/new_features/v175_features_doc + doc/new_features/v176_features_doc + doc/new_features/v177_features_doc doc/ocr_backends/ocr_backends_doc doc/observability/observability_doc doc/operations_layer/operations_layer_doc diff --git a/docs/source/Zh/doc/new_features/v175_features_doc.rst b/docs/source/Zh/doc/new_features/v175_features_doc.rst new file mode 100644 index 00000000..346ef90b --- /dev/null +++ b/docs/source/Zh/doc/new_features/v175_features_doc.rst @@ -0,0 +1,39 @@ +變化量序列的穩定偵測 +==================== + +``smart_waits.wait_until_screen_stable`` 與 ``actionability`` 的穩定檢查把穩定邏輯包在 +``time.sleep`` 輪詢迴圈內、作用於即時像素幀——你無法餵給它一段記錄好的 a11y 元素數或畫面 +差異指標序列,也無法獨立於擷取去單元測試那個*決策*。``settle_detector`` 把該決策抽離:它接收 +一串*變化量*(churn,每個樣本變了多少——像素差、元素數差、digest 是否變的 0/1,皆可),並在 +變化量連續 ``quiet_samples`` 次維持在 ``max_churn`` 以下時回報穩定。尖峰會重置 quiet run,因此 +「穩定後又變動」也能處理。 + +純標準函式庫;確定性、可在注入序列上單元測試,不需擷取、不需時鐘。不匯入 ``PySide6``。 + +無頭 API +-------- + +.. code-block:: python + + from je_auto_control import settle_point, is_settled, SettleTracker + + churns = [5, 4, 0.5, 0.3, 0.2] # 每幀變化量指標 + settle_point(churns, quiet_samples=3, max_churn=1.0) # -> 4 + is_settled(churns, quiet_samples=3, max_churn=1.0) # -> True + + # 增量版,供即時迴圈(你每 tick 提供 churn) + tracker = SettleTracker(quiet_samples=3, max_churn=1.0) + state = tracker.update(current_churn) + if state.settled: + observe_now() + +``settle_point`` 回傳序列首次穩定的索引(或 ``None``);``is_settled`` 為布林。``SettleTracker`` +為增量形式:``update(churn)`` 回傳 ``SettleState``(``settled`` / ``quiet_run`` / ``churn``); +``reset`` 清除 run(例如在再次動作後)。 + +執行器指令 +---------- + +``AC_settle_point``(``churns`` / ``quiet_samples`` / ``max_churn`` → ``{settled, index}``) +以 MCP 工具 ``ac_settle_point``(唯讀)及 Script Builder 指令 **Settle Point (churn series)** +(位於 **Flow** 分類下)形式提供。 diff --git a/docs/source/Zh/doc/new_features/v176_features_doc.rst b/docs/source/Zh/doc/new_features/v176_features_doc.rst new file mode 100644 index 00000000..c37aecd5 --- /dev/null +++ b/docs/source/Zh/doc/new_features/v176_features_doc.rst @@ -0,0 +1,35 @@ +標題與內文分類 + 文件大綱 +========================== + +框架中沒有任何功能把行高對應到標題層級或建立章節大綱——``ocr/structure`` 與 ``element_parse`` +純屬位置性,``text_blocks`` 把段落 / 清單分組但不對其排序。``heading_segment`` 補上標準啟發法: +行高超過 ``heading_ratio`` 乘以中位行高者為標題,且不同的標題高度成為標題*層級*(最高為第 1 級)。 +由此輸出扁平的文件大綱。 + +純標準函式庫,作用於純行字典(text + bbox);可在無影像、無 OCR 引擎下完整單元測試。重用 +``table_grid_fill`` 的框邊界讀取器。不匯入 ``PySide6``。 + +無頭 API +-------- + +.. code-block:: python + + from je_auto_control import classify_lines, outline + + for item in classify_lines(ocr_lines, heading_ratio=1.2): + print(item["role"], item["level"], item["text"]) + + for heading in outline(ocr_lines): + print(" " * (heading["level"] - 1) + heading["text"]) + +``classify_lines`` 為每行標記 ``{box, text, role, level}``——``role`` 為 ``"heading"`` 或 +``"body"``,``level`` 為標題層級(1 = 最高,內文為 0)。``outline`` 只回傳依上到下順序的標題, +為 ``{level, text, top}``——即文件目錄。 + +執行器指令 +---------- + +``AC_classify_lines``(``lines`` / ``heading_ratio`` → ``{count, lines}``)與 ``AC_outline`` +(``lines`` / ``heading_ratio`` → ``{count, headings}``)。兩者以 MCP 工具 ``ac_classify_lines`` / +``ac_outline``(唯讀)及 Script Builder 指令 **Classify Headings vs Body** / **Document Outline** +(位於 **OCR** 分類下)形式提供。 diff --git a/docs/source/Zh/doc/new_features/v177_features_doc.rst b/docs/source/Zh/doc/new_features/v177_features_doc.rst new file mode 100644 index 00000000..7839708f --- /dev/null +++ b/docs/source/Zh/doc/new_features/v177_features_doc.rst @@ -0,0 +1,41 @@ +逐步評審特徵 + 規則式步驟評分 +============================== + +為代理的步驟評分需要把證據集中一處——動作是什麼、變了什麼、是否落在目標、宣告的後置條件 +是否成立。``trajectory_eval`` 對*已完成的整條軌跡*依靜態準則評分,沒有逐步證據; +``agent_trace`` 發出 OTel span(權杖 / 延遲),而非決策品質;``agent_replay`` 保存 +``{obs, action, result}`` 卻不評分。``critic_features`` 正是缺少的逐步層:它把 ``action_effect`` +(有無效果、落在何處)、``observation_delta``(變了多少)與 ``postcondition``(預期結果是否成立) +組合成單一精簡記錄,並附上確定性的規則式評分器,使此功能可完整無頭運作——把可選的 +LLM-as-judge 留給整合者。 + +純標準函式庫;組合既有純模組;確定性、可在無裝置下單元測試。不匯入 ``PySide6``。 + +無頭 API +-------- + +.. code-block:: python + + from je_auto_control import (build_critic_record, score_step_rule_based, + to_judge_prompt) + + record = build_critic_record({"type": "click", "x": 480, "y": 260}, + before_elements, after_elements, + postcondition={"appears": {"role": "dialog"}}) + score = score_step_rule_based(record) + # {"outcome": True, "process_score": 1.0, "reasons": [...]} + + prompt = to_judge_prompt(record) # 給 LLM-as-judge 的精簡文字 + +``build_critic_record`` 回傳 ``{action, effect, delta_counts}``,並在給定規格時附上 +``postcondition`` 報告。``score_step_rule_based`` 回傳 ``{outcome, process_score, reasons}`` +——``outcome`` 為二元成功(動作有效果*且*任何後置條件成立),``process_score`` 為依效果類別的 +0..1 品質(後置條件失敗時減半)。``to_judge_prompt`` 把記錄渲染給外部評審。 + +執行器指令 +---------- + +``AC_build_critic_record``(``action`` / ``before`` / ``after`` / ``postcondition`` / +``radius`` → 該記錄)與 ``AC_score_step``(``record`` → ``{outcome, process_score, reasons}``)。 +兩者以 MCP 工具 ``ac_build_critic_record`` / ``ac_score_step``(唯讀)及 Script Builder 指令 +**Build Critic Record** / **Score Step (rule-based)**(位於 **Native UI** 分類下)形式提供。 diff --git a/docs/source/Zh/zh_index.rst b/docs/source/Zh/zh_index.rst index 198b3e6a..65497aa5 100644 --- a/docs/source/Zh/zh_index.rst +++ b/docs/source/Zh/zh_index.rst @@ -197,6 +197,9 @@ AutoControl 所有功能的完整使用指南。 doc/new_features/v172_features_doc doc/new_features/v173_features_doc doc/new_features/v174_features_doc + doc/new_features/v175_features_doc + doc/new_features/v176_features_doc + doc/new_features/v177_features_doc doc/ocr_backends/ocr_backends_doc doc/observability/observability_doc doc/operations_layer/operations_layer_doc diff --git a/je_auto_control/__init__.py b/je_auto_control/__init__.py index 0be599b3..7f645d83 100644 --- a/je_auto_control/__init__.py +++ b/je_auto_control/__init__.py @@ -323,6 +323,10 @@ from je_auto_control.utils.text_blocks import ( detect_lists, group_paragraphs, ) +# Classify OCR lines as headings vs body and build a document outline +from je_auto_control.utils.heading_segment import ( + classify_lines, outline, +) # Associate form labels with values (multi-direction) + checkbox state from je_auto_control.utils.form_fields import ( associate_fields, checkbox_state, match_labels_to_widgets, @@ -343,6 +347,14 @@ from je_auto_control.utils.grounding_consensus import ( ConsensusResult, consensus_element, consensus_point, is_confident, ) +# Decide when a UI has settled, as a pure seam over a churn series +from je_auto_control.utils.settle_detector import ( + SettleState, SettleTracker, is_settled, settle_point, +) +# Per-step critic feature bundle + rule-based step scorer +from je_auto_control.utils.critic_features import ( + build_critic_record, score_step_rule_based, to_judge_prompt, +) # Locate on-screen regions by colour (mask + connected components) from je_auto_control.utils.color_region import ( find_color_region, find_color_regions, @@ -1285,6 +1297,8 @@ def start_autocontrol_gui(*args, **kwargs): "to_blocks", "group_paragraphs", "detect_lists", + "classify_lines", + "outline", "associate_fields", "match_labels_to_widgets", "checkbox_state", @@ -1304,6 +1318,13 @@ def start_autocontrol_gui(*args, **kwargs): "consensus_point", "consensus_element", "is_confident", + "SettleState", + "SettleTracker", + "settle_point", + "is_settled", + "build_critic_record", + "score_step_rule_based", + "to_judge_prompt", "find_color_region", "find_color_regions", "ssim_compare", diff --git a/je_auto_control/gui/script_builder/command_schema.py b/je_auto_control/gui/script_builder/command_schema.py index a022c6d5..9f0ab51b 100644 --- a/je_auto_control/gui/script_builder/command_schema.py +++ b/je_auto_control/gui/script_builder/command_schema.py @@ -827,6 +827,26 @@ def _add_ocr_specs(specs: List[CommandSpec]) -> None: ), description="Detect bulleted / numbered list items among OCR lines.", )) + specs.append(CommandSpec( + "AC_classify_lines", "OCR", "Classify Headings vs Body", + fields=( + FieldSpec("lines", FieldType.STRING, + placeholder='[{"x":0,"y":0,"width":200,"height":40,' + '"text":"Title"}]'), + FieldSpec("heading_ratio", FieldType.FLOAT, optional=True, default=1.2), + ), + description="Tag OCR lines as heading/body by height; assign heading levels.", + )) + specs.append(CommandSpec( + "AC_outline", "OCR", "Document Outline", + fields=( + FieldSpec("lines", FieldType.STRING, + placeholder='[{"x":0,"y":0,"width":200,"height":40,' + '"text":"Title"}]'), + FieldSpec("heading_ratio", FieldType.FLOAT, optional=True, default=1.2), + ), + description="Headings in order with levels (document outline) from OCR lines.", + )) specs.append(CommandSpec( "AC_scroll_to_find", "OCR", "Scroll Until Visible", fields=( @@ -3273,6 +3293,39 @@ def _add_set_of_marks_specs(specs: List[CommandSpec]) -> None: ), description="Agreed target point from clustered grounding proposals.", )) + specs.append(CommandSpec( + "AC_settle_point", "Flow", "Settle Point (churn series)", + fields=( + FieldSpec("churns", FieldType.STRING, + placeholder="[5, 4, 0.5, 0.3, 0.2]"), + FieldSpec("quiet_samples", FieldType.INT, optional=True, default=3), + FieldSpec("max_churn", FieldType.FLOAT, optional=True, default=1.0), + ), + description="Index where a churn series first settles (offline settle check).", + )) + specs.append(CommandSpec( + "AC_build_critic_record", "Native UI", "Build Critic Record", + fields=( + FieldSpec("action", FieldType.STRING, + placeholder='{"type":"click","x":50,"y":50}'), + FieldSpec("before", FieldType.STRING, + placeholder='[{"role":"button","x":0,"y":0}]'), + FieldSpec("after", FieldType.STRING, + placeholder='[{"role":"dialog","x":40,"y":40}]'), + FieldSpec("postcondition", FieldType.STRING, optional=True, + placeholder='{"appears":{"role":"dialog"}}'), + FieldSpec("radius", FieldType.INT, optional=True, default=64), + ), + description="Per-step critic evidence (effect + delta + postcondition).", + )) + specs.append(CommandSpec( + "AC_score_step", "Native UI", "Score Step (rule-based)", + fields=( + FieldSpec("record", FieldType.STRING, + placeholder='{"effect":{"effect":"changed_near_target"}}'), + ), + description="Rule-based outcome + process score of a critic record.", + )) specs.append(CommandSpec( "AC_consensus_element", "Native UI", "Grounding Consensus Element", fields=( diff --git a/je_auto_control/utils/critic_features/__init__.py b/je_auto_control/utils/critic_features/__init__.py new file mode 100644 index 00000000..a3651049 --- /dev/null +++ b/je_auto_control/utils/critic_features/__init__.py @@ -0,0 +1,6 @@ +"""Per-step critic feature bundle + a rule-based step scorer.""" +from je_auto_control.utils.critic_features.critic_features import ( + build_critic_record, score_step_rule_based, to_judge_prompt, +) + +__all__ = ["build_critic_record", "score_step_rule_based", "to_judge_prompt"] diff --git a/je_auto_control/utils/critic_features/critic_features.py b/je_auto_control/utils/critic_features/critic_features.py new file mode 100644 index 00000000..4c299480 --- /dev/null +++ b/je_auto_control/utils/critic_features/critic_features.py @@ -0,0 +1,79 @@ +"""Per-step critic feature bundle + a rule-based step scorer. + +Scoring an agent's step needs the evidence in one place — what the action was, what changed, +whether it landed on target, whether the declared postcondition held. ``trajectory_eval`` +scores a *finished whole trajectory* against a static rubric and has no per-step evidence; +``agent_trace`` emits OTel spans (tokens / latency), not decision quality; ``agent_replay`` +persists ``{obs, action, result}`` but does no scoring. ``critic_features`` is the missing +per-step layer: it composes ``action_effect`` (did it do anything, where), ``observation_delta`` +(how much changed) and ``postcondition`` (did the expected outcome hold) into one compact +record, and ships a deterministic rule-based scorer so the feature works fully headless — +leaving the optional LLM-as-judge to the integrator (``to_judge_prompt``). + +Pure-stdlib; composes existing pure modules; deterministic and unit-testable with no device. +Imports no ``PySide6``. +""" +from typing import Any, Dict, Optional, Sequence + +Element = Dict[str, Any] + +_EFFECT_SCORE = {"changed_near_target": 1.0, "changed": 0.6, + "changed_elsewhere": 0.3, "no_op": 0.0} + + +def build_critic_record(action: Any, before: Sequence[Element], + after: Sequence[Element], *, + postcondition: Optional[Dict[str, Any]] = None, + radius: int = 64) -> Dict[str, Any]: + """Compose a per-step critic record from the before/after observation + action. + + Returns ``{action, effect, delta_counts}`` and, when a ``postcondition`` spec is + given, the ``postcondition`` report — the evidence bundle a step critic scores. + """ + from je_auto_control.utils.action_effect import classify_effect + from je_auto_control.utils.observation_delta import delta_index + verdict = classify_effect(before, after, action, radius=int(radius)).to_dict() + delta = delta_index(before, after) + record: Dict[str, Any] = { + "action": action, "effect": verdict, + "delta_counts": {"added": len(delta["added"]), + "removed": len(delta["removed"]), + "changed": len(delta["changed"]), + "stable": len(delta["stable"])}} + if postcondition is not None: + from je_auto_control.utils.postcondition import check_postcondition + record["postcondition"] = check_postcondition( + after, postcondition, before=before).to_dict() + return record + + +def score_step_rule_based(record: Dict[str, Any]) -> Dict[str, Any]: + """Score a critic record deterministically → ``{outcome, process_score, reasons}``. + + ``outcome`` is a binary success (the action did something *and* any postcondition held); + ``process_score`` is a 0..1 quality from the effect class, halved if the postcondition + failed. + """ + effect = record["effect"]["effect"] + process = _EFFECT_SCORE.get(effect, 0.0) + report = record.get("postcondition") + postcondition_ok = report["ok"] if report else True + reasons = [f"effect={effect}"] + if report is not None: + reasons.append(f"postcondition={'ok' if postcondition_ok else 'failed'}") + return {"outcome": effect != "no_op" and postcondition_ok, + "process_score": round(process * (1.0 if postcondition_ok else 0.5), 4), + "reasons": reasons} + + +def to_judge_prompt(record: Dict[str, Any]) -> str: + """Render a critic record as a compact text block for an LLM-as-judge.""" + counts = record["delta_counts"] + lines = [f"Action: {record['action']}", + f"Effect: {record['effect']['effect']} ({record['effect']['reason']})", + f"Changed: +{counts['added']} -{counts['removed']} ~{counts['changed']}"] + report = record.get("postcondition") + if report is not None: + lines.append(f"Postcondition ok: {report['ok']} " + f"(failed: {report['failed']})") + return "\n".join(lines) diff --git a/je_auto_control/utils/executor/action_executor.py b/je_auto_control/utils/executor/action_executor.py index cc4bd1b7..8b66803b 100644 --- a/je_auto_control/utils/executor/action_executor.py +++ b/je_auto_control/utils/executor/action_executor.py @@ -3537,6 +3537,26 @@ def _detect_lists(lines: Any) -> Dict[str, Any]: return {"count": len(items), "items": items} +def _classify_lines(lines: Any, heading_ratio: Any = 1.2) -> Dict[str, Any]: + """Adapter: classify OCR lines as headings vs body with levels.""" + import json + from je_auto_control.utils.heading_segment import classify_lines + if isinstance(lines, str): + lines = json.loads(lines) + classified = classify_lines(lines, heading_ratio=float(heading_ratio)) + return {"count": len(classified), "lines": classified} + + +def _outline(lines: Any, heading_ratio: Any = 1.2) -> Dict[str, Any]: + """Adapter: the document outline (headings in order) from OCR lines.""" + import json + from je_auto_control.utils.heading_segment import outline + if isinstance(lines, str): + lines = json.loads(lines) + headings = outline(lines, heading_ratio=float(heading_ratio)) + return {"count": len(headings), "headings": headings} + + def _find_color_region(rgb: Any, tolerance: Any = 20, min_area: Any = 50, region: Any = None) -> Dict[str, Any]: """Adapter: locate coloured regions on the screen, largest first.""" @@ -4232,6 +4252,45 @@ def _consensus_element(candidates: Any, elements: Any) -> Dict[str, Any]: "agreement": winner[1] if winner else 0.0} +def _settle_point(churns: Any, quiet_samples: Any = 3, + max_churn: Any = 1.0) -> Dict[str, Any]: + """Adapter: index at which a churn series first settles (or settled=False).""" + import json + from je_auto_control.utils.settle_detector import settle_point + if isinstance(churns, str): + churns = json.loads(churns) + index = settle_point([float(c) for c in churns], + quiet_samples=int(quiet_samples), + max_churn=float(max_churn)) + return {"settled": index is not None, "index": index} + + +def _build_critic_record(action: Any, before: Any, after: Any, + postcondition: Any = None, radius: Any = 64) -> Dict[str, Any]: + """Adapter: per-step critic feature bundle (effect + delta + postcondition).""" + import json + from je_auto_control.utils.critic_features import build_critic_record + if isinstance(action, str): + action = json.loads(action) + if isinstance(before, str): + before = json.loads(before) + if isinstance(after, str): + after = json.loads(after) + if isinstance(postcondition, str): + postcondition = json.loads(postcondition) if postcondition.strip() else None + return build_critic_record(action, before, after, postcondition=postcondition, + radius=int(radius)) + + +def _score_step(record: Any) -> Dict[str, Any]: + """Adapter: rule-based score of a critic record.""" + import json + from je_auto_control.utils.critic_features import score_step_rule_based + if isinstance(record, str): + record = json.loads(record) + return score_step_rule_based(record) + + def _validate_action(action: Any, screen: Any = None, targets: Any = None) -> Dict[str, Any]: """Adapter: validate a coordinate action (bounds + optional snap-to-target).""" @@ -6076,6 +6135,8 @@ def __init__(self): "AC_xy_cut": _xy_cut, "AC_group_paragraphs": _group_paragraphs, "AC_detect_lists": _detect_lists, + "AC_classify_lines": _classify_lines, + "AC_outline": _outline, "AC_ssim_compare": _ssim_compare, "AC_ssim_changed_regions": _ssim_changed_regions, "AC_feature_match": _feature_match, @@ -6121,6 +6182,9 @@ def __init__(self): "AC_plan_repair": _plan_repair, "AC_consensus_point": _consensus_point, "AC_consensus_element": _consensus_element, + "AC_settle_point": _settle_point, + "AC_build_critic_record": _build_critic_record, + "AC_score_step": _score_step, "AC_validate_action": _validate_action, "AC_replay_trace": _replay_trace, "AC_match_elements": _match_elements, diff --git a/je_auto_control/utils/heading_segment/__init__.py b/je_auto_control/utils/heading_segment/__init__.py new file mode 100644 index 00000000..638ff8e5 --- /dev/null +++ b/je_auto_control/utils/heading_segment/__init__.py @@ -0,0 +1,6 @@ +"""Classify OCR lines as headings vs body and build a document outline.""" +from je_auto_control.utils.heading_segment.heading_segment import ( + classify_lines, outline, +) + +__all__ = ["classify_lines", "outline"] diff --git a/je_auto_control/utils/heading_segment/heading_segment.py b/je_auto_control/utils/heading_segment/heading_segment.py new file mode 100644 index 00000000..52068c8d --- /dev/null +++ b/je_auto_control/utils/heading_segment/heading_segment.py @@ -0,0 +1,63 @@ +"""Classify OCR lines as headings vs body and build a document outline. + +Nothing in the framework maps line height to heading levels or builds a section outline — +``ocr/structure`` and ``element_parse`` are purely positional, and ``text_blocks`` groups +paragraphs / lists but does not rank them. ``heading_segment`` adds the standard heuristic: +a line whose height exceeds ``heading_ratio`` times the median line height is a heading, and +distinct heading heights become heading *levels* (the tallest is level 1). From that it emits +a flat document outline. + +Pure-stdlib over plain line dicts (text + bbox); fully unit-testable with no image and no OCR +engine. Reuses ``table_grid_fill``'s box-bounds reader. Imports no ``PySide6``. +""" +from typing import Any, Dict, List, Sequence + +from je_auto_control.utils.table_grid_fill.table_grid_fill import _box_bounds + +Line = Dict[str, Any] + + +def _height(line: Line) -> int: + _, top, _, bottom = _box_bounds(line) + return bottom - top + + +def _box(line: Line) -> Dict[str, int]: + left, top, right, bottom = _box_bounds(line) + return {"left": left, "top": top, "right": right, "bottom": bottom} + + +def classify_lines(lines: Sequence[Line], *, + heading_ratio: float = 1.2) -> List[Dict[str, Any]]: + """Tag each line as a heading or body line with a heading ``level``. + + A line taller than ``heading_ratio`` x the median line height is a heading; distinct + heading heights map to levels (tallest = 1). Body lines get ``level`` 0. Returns + ``{box, text, role, level}`` per line, in input order. + """ + if not lines: + return [] + heights = sorted(_height(line) for line in lines) + threshold = heights[len(heights) // 2] * float(heading_ratio) + heading_heights = sorted({_height(line) for line in lines + if _height(line) > threshold}, reverse=True) + level_of = {height: index + 1 for index, height in enumerate(heading_heights)} + classified: List[Dict[str, Any]] = [] + for line in lines: + height = _height(line) + is_heading = height > threshold + classified.append({"box": _box(line), + "text": str(line.get("text", "")), + "role": "heading" if is_heading else "body", + "level": level_of.get(height, 0) if is_heading else 0}) + return classified + + +def outline(lines: Sequence[Line], *, + heading_ratio: float = 1.2) -> List[Dict[str, Any]]: + """Return the document outline: the headings in top-to-bottom order with levels.""" + headings = [item for item in classify_lines(lines, heading_ratio=heading_ratio) + if item["role"] == "heading"] + headings.sort(key=lambda item: item["box"]["top"]) + return [{"level": item["level"], "text": item["text"], + "top": item["box"]["top"]} for item in headings] diff --git a/je_auto_control/utils/mcp_server/tools/_factories.py b/je_auto_control/utils/mcp_server/tools/_factories.py index 2793876e..a2c6829c 100644 --- a/je_auto_control/utils/mcp_server/tools/_factories.py +++ b/je_auto_control/utils/mcp_server/tools/_factories.py @@ -3439,6 +3439,48 @@ def observation_tools() -> List[MCPTool]: handler=h.consensus_element, annotations=READ_ONLY, ), + MCPTool( + name="ac_settle_point", + description=("Decide when a UI settled from a 'churns' series (how much " + "changed each sample). Returns {settled, index} — the index " + "where churn first stayed <= 'max_churn' for 'quiet_samples' " + "in a row (a spike resets the run). Feed pixel deltas / " + "element-count deltas / 0-1 digest-changed flags."), + input_schema=schema({ + "churns": {"type": "array", "items": {"type": "number"}}, + "quiet_samples": {"type": "integer"}, + "max_churn": {"type": "number"}}, + required=["churns"]), + handler=h.settle_point, + annotations=READ_ONLY, + ), + MCPTool( + name="ac_build_critic_record", + description=("Build a per-step critic record from 'action' + 'before' / " + "'after' element lists (+ optional 'postcondition' spec): " + "composes effect / delta-counts / postcondition into " + "{action, effect, delta_counts, postcondition?} — the " + "evidence a step critic scores."), + input_schema=schema({ + "action": {"type": "object"}, + "before": {"type": "array", "items": {"type": "object"}}, + "after": {"type": "array", "items": {"type": "object"}}, + "postcondition": {"type": "object"}, + "radius": {"type": "integer"}}, + required=["action", "before", "after"]), + handler=h.build_critic_record, + annotations=READ_ONLY, + ), + MCPTool( + name="ac_score_step", + description=("Rule-based score of a critic 'record' (from " + "ac_build_critic_record): {outcome (binary success), " + "process_score (0..1), reasons}. Deterministic, no model."), + input_schema=schema({"record": {"type": "object"}}, + required=["record"]), + handler=h.score_step, + annotations=READ_ONLY, + ), ] @@ -3975,6 +4017,31 @@ def screen_grid_tools() -> List[MCPTool]: handler=h.detect_lists, annotations=READ_ONLY, ), + MCPTool( + name="ac_classify_lines", + description=("Classify OCR 'lines' as headings vs body by height: a line " + "taller than 'heading_ratio' x the median line height is a " + "heading, and distinct heading heights become levels (tallest " + "= 1). Returns {count, lines:[{box,text,role,level}]}."), + input_schema=schema({ + "lines": {"type": "array", "items": {"type": "object"}}, + "heading_ratio": {"type": "number"}}, + required=["lines"]), + handler=h.classify_lines, + annotations=READ_ONLY, + ), + MCPTool( + name="ac_outline", + description=("Return the document outline from OCR 'lines' — the headings " + "in top-to-bottom order with levels. Returns {count, " + "headings:[{level,text,top}]}."), + input_schema=schema({ + "lines": {"type": "array", "items": {"type": "object"}}, + "heading_ratio": {"type": "number"}}, + required=["lines"]), + handler=h.outline, + annotations=READ_ONLY, + ), ] diff --git a/je_auto_control/utils/mcp_server/tools/_handlers.py b/je_auto_control/utils/mcp_server/tools/_handlers.py index 8a6fc2ef..38ec81cd 100644 --- a/je_auto_control/utils/mcp_server/tools/_handlers.py +++ b/je_auto_control/utils/mcp_server/tools/_handlers.py @@ -2205,6 +2205,16 @@ def detect_lists(lines): return _detect_lists(lines) +def classify_lines(lines, heading_ratio=1.2): + from je_auto_control.utils.executor.action_executor import _classify_lines + return _classify_lines(lines, heading_ratio) + + +def outline(lines, heading_ratio=1.2): + from je_auto_control.utils.executor.action_executor import _outline + return _outline(lines, heading_ratio) + + def find_color_region(rgb, tolerance=20, min_area=50, region=None): from je_auto_control.utils.executor.action_executor import ( _find_color_region) @@ -2478,6 +2488,21 @@ def consensus_element(candidates, elements): return _consensus_element(candidates, elements) +def settle_point(churns, quiet_samples=3, max_churn=1.0): + from je_auto_control.utils.executor.action_executor import _settle_point + return _settle_point(churns, quiet_samples, max_churn) + + +def build_critic_record(action, before, after, postcondition=None, radius=64): + from je_auto_control.utils.executor.action_executor import _build_critic_record + return _build_critic_record(action, before, after, postcondition, radius) + + +def score_step(record): + from je_auto_control.utils.executor.action_executor import _score_step + return _score_step(record) + + def validate_action(action, screen=None, targets=None): from je_auto_control.utils.executor.action_executor import _validate_action return _validate_action(action, screen, targets) diff --git a/je_auto_control/utils/settle_detector/__init__.py b/je_auto_control/utils/settle_detector/__init__.py new file mode 100644 index 00000000..d41630a0 --- /dev/null +++ b/je_auto_control/utils/settle_detector/__init__.py @@ -0,0 +1,6 @@ +"""Decide when a UI has settled, as a pure seam over a churn series.""" +from je_auto_control.utils.settle_detector.settle_detector import ( + SettleState, SettleTracker, is_settled, settle_point, +) + +__all__ = ["SettleState", "SettleTracker", "settle_point", "is_settled"] diff --git a/je_auto_control/utils/settle_detector/settle_detector.py b/je_auto_control/utils/settle_detector/settle_detector.py new file mode 100644 index 00000000..d729480b --- /dev/null +++ b/je_auto_control/utils/settle_detector/settle_detector.py @@ -0,0 +1,70 @@ +"""Decide when a UI has settled, as a pure seam over a churn series. + +``smart_waits.wait_until_screen_stable`` and ``actionability``'s stability check bake the +settle logic *inside* a ``time.sleep`` polling loop over live pixel frames — you cannot feed +them a recorded series of a11y-element counts or screen-diff metrics, and you cannot unit-test +the *decision* independently of capture. ``settle_detector`` extracts that decision: it takes a +stream of *churn* values (how much changed each sample — pixel delta, element-count delta, a +digest-changed 0/1, anything) and reports when the churn has stayed at or below ``max_churn`` +for ``quiet_samples`` in a row. A spike resets the quiet run, so "settled then changed again" +is handled. + +Pure-stdlib; deterministic and unit-testable on an injected series with no capture, no clock. +Imports no ``PySide6``. +""" +from dataclasses import asdict, dataclass +from typing import Any, Dict, Optional, Sequence + + +@dataclass(frozen=True) +class SettleState: + """One settle observation: whether settled, the quiet run length, latest churn.""" + + settled: bool + quiet_run: int + churn: float + + def to_dict(self) -> Dict[str, Any]: + """Return the state as a plain dict.""" + return asdict(self) + + +class SettleTracker: + """Incremental settle detector: feed churn values, ask if it has gone quiet.""" + + def __init__(self, quiet_samples: int = 3, max_churn: float = 1.0) -> None: + """Settle after ``quiet_samples`` consecutive churns <= ``max_churn``.""" + self.quiet_samples = int(quiet_samples) + self.max_churn = float(max_churn) + self.quiet_run = 0 + + def update(self, churn: float) -> SettleState: + """Feed the next churn value and return the current settle state.""" + churn = float(churn) + if churn <= self.max_churn: + self.quiet_run += 1 + else: + self.quiet_run = 0 + return SettleState(self.quiet_run >= self.quiet_samples, self.quiet_run, + churn) + + def reset(self) -> None: + """Clear the quiet run (e.g. after acting again).""" + self.quiet_run = 0 + + +def settle_point(churns: Sequence[float], *, quiet_samples: int = 3, + max_churn: float = 1.0) -> Optional[int]: + """Return the index at which the churn series first becomes settled, or ``None``.""" + tracker = SettleTracker(quiet_samples, max_churn) + for index, churn in enumerate(churns): + if tracker.update(churn).settled: + return index + return None + + +def is_settled(churns: Sequence[float], *, quiet_samples: int = 3, + max_churn: float = 1.0) -> bool: + """Return whether the churn series settles at any point.""" + return settle_point(churns, quiet_samples=quiet_samples, + max_churn=max_churn) is not None diff --git a/test/unit_test/headless/test_critic_features_batch.py b/test/unit_test/headless/test_critic_features_batch.py new file mode 100644 index 00000000..7e0fbd12 --- /dev/null +++ b/test/unit_test/headless/test_critic_features_batch.py @@ -0,0 +1,70 @@ +"""Headless tests for per-step critic features + rule-based scorer (pure stdlib).""" +import je_auto_control as ac +from je_auto_control.utils.critic_features import ( + build_critic_record, score_step_rule_based, to_judge_prompt, +) + + +def _el(x, y, name="", role="button"): + return dict(x=x, y=y, width=40, height=20, role=role, name=name) + + +def test_record_captures_effect_and_delta(): + before = [_el(0, 0, "A")] + after = [_el(0, 0, "A"), _el(40, 40, "Popup", role="dialog")] + record = build_critic_record({"x": 50, "y": 50}, before, after) + assert record["effect"]["effect"] == "changed_near_target" + assert record["delta_counts"]["added"] == 1 + + +def test_score_good_step(): + before = [_el(0, 0, "A")] + after = [_el(0, 0, "A"), _el(40, 40, "Popup", role="dialog")] + score = score_step_rule_based(build_critic_record({"x": 50, "y": 50}, + before, after)) + assert score["outcome"] is True + assert abs(score["process_score"] - 1.0) < 1e-9 + + +def test_score_no_op_fails(): + frame = [_el(0, 0, "A")] + score = score_step_rule_based(build_critic_record({"x": 9, "y": 9}, + frame, list(frame))) + assert score["outcome"] is False + assert abs(score["process_score"]) < 1e-9 + + +def test_postcondition_failure_lowers_outcome(): + before = [_el(0, 0, "A")] + after = [_el(0, 0, "A"), _el(40, 40, "Popup", role="dialog")] + spec = {"appears": {"role": "menu"}} # a menu that never appears + record = build_critic_record({"x": 50, "y": 50}, before, after, + postcondition=spec) + score = score_step_rule_based(record) + assert score["outcome"] is False # effect ok but postcondition failed + assert record["postcondition"]["ok"] is False + + +def test_to_judge_prompt_mentions_effect(): + before = [_el(0, 0, "A")] + after = [_el(0, 0, "A"), _el(40, 40, "P", role="dialog")] + text = to_judge_prompt(build_critic_record({"x": 50, "y": 50}, before, after)) + assert "Effect:" in text and "changed_near_target" in text + + +# --- wiring --------------------------------------------------------------- + +def test_wiring(): + known = set(ac.executor.known_commands()) + assert {"AC_build_critic_record", "AC_score_step"} <= known + from je_auto_control.utils.mcp_server.tools import build_default_tool_registry + names = {t.name for t in build_default_tool_registry()} + assert {"ac_build_critic_record", "ac_score_step"} <= names + from je_auto_control.gui.script_builder.command_schema import _build_specs + specs = {s.command for s in _build_specs()} + assert {"AC_build_critic_record", "AC_score_step"} <= specs + + +def test_facade_exports(): + for name in ("build_critic_record", "score_step_rule_based", "to_judge_prompt"): + assert hasattr(ac, name) and name in ac.__all__ diff --git a/test/unit_test/headless/test_heading_segment_batch.py b/test/unit_test/headless/test_heading_segment_batch.py new file mode 100644 index 00000000..67add083 --- /dev/null +++ b/test/unit_test/headless/test_heading_segment_batch.py @@ -0,0 +1,58 @@ +"""Headless tests for heading vs body classification + outline (pure stdlib).""" +import je_auto_control as ac +from je_auto_control.utils.heading_segment import classify_lines, outline + + +def _line(y, text, h=20, x=0, w=200): + return {"x": x, "y": y, "width": w, "height": h, "text": text} + + +def _doc(): + # one big title (h=40), some body (h=20), a smaller heading (h=30) + return [_line(0, "Title", h=40), _line(50, "body one"), + _line(75, "body two"), _line(110, "Subsection", h=30), + _line(145, "more body")] + + +def test_classify_marks_headings_and_levels(): + by_text = {c["text"]: c for c in classify_lines(_doc(), heading_ratio=1.2)} + assert by_text["Title"]["role"] == "heading" + assert by_text["Subsection"]["role"] == "heading" + assert by_text["body one"]["role"] == "body" + # tallest heading is level 1, the next distinct height is level 2 + assert by_text["Title"]["level"] == 1 + assert by_text["Subsection"]["level"] == 2 + + +def test_body_only_has_no_headings(): + lines = [_line(0, "a"), _line(25, "b"), _line(50, "c")] + assert all(c["role"] == "body" for c in classify_lines(lines)) + + +def test_outline_lists_headings_in_order(): + result = outline(_doc(), heading_ratio=1.2) + assert [h["text"] for h in result] == ["Title", "Subsection"] + assert [h["level"] for h in result] == [1, 2] + + +def test_empty(): + assert classify_lines([]) == [] + assert outline([]) == [] + + +# --- wiring --------------------------------------------------------------- + +def test_wiring(): + known = set(ac.executor.known_commands()) + assert {"AC_classify_lines", "AC_outline"} <= known + from je_auto_control.utils.mcp_server.tools import build_default_tool_registry + names = {t.name for t in build_default_tool_registry()} + assert {"ac_classify_lines", "ac_outline"} <= names + from je_auto_control.gui.script_builder.command_schema import _build_specs + specs = {s.command for s in _build_specs()} + assert {"AC_classify_lines", "AC_outline"} <= specs + + +def test_facade_exports(): + for name in ("classify_lines", "outline"): + assert hasattr(ac, name) and name in ac.__all__ diff --git a/test/unit_test/headless/test_settle_detector_batch.py b/test/unit_test/headless/test_settle_detector_batch.py new file mode 100644 index 00000000..dbb89ff4 --- /dev/null +++ b/test/unit_test/headless/test_settle_detector_batch.py @@ -0,0 +1,52 @@ +"""Headless tests for the settle decision over a churn series (pure stdlib).""" +import je_auto_control as ac +from je_auto_control.utils.settle_detector import ( + SettleTracker, is_settled, settle_point, +) + + +def test_settle_point_after_quiet_run(): + # 5, 4 are noisy; then three values <= 1.0 → settled at index 4 + assert settle_point([5, 4, 0.5, 0.3, 0.2], quiet_samples=3, + max_churn=1.0) == 4 + + +def test_spike_resets_quiet_run(): + # quiet, quiet, SPIKE, quiet x3 → settles only at the final index + assert settle_point([0.2, 0.2, 5, 0.1, 0.1, 0.1], quiet_samples=3, + max_churn=1.0) == 5 + + +def test_never_settles_is_none(): + assert settle_point([5, 4, 3], quiet_samples=2, max_churn=1.0) is None + + +def test_is_settled_bool(): + assert is_settled([0.1, 0.1], quiet_samples=2, max_churn=1.0) is True + assert is_settled([9, 8], quiet_samples=2, max_churn=1.0) is False + + +def test_tracker_incremental_and_reset(): + tracker = SettleTracker(quiet_samples=2, max_churn=1.0) + assert tracker.update(0.5).settled is False + state = tracker.update(0.4) + assert state.settled is True and state.quiet_run == 2 + tracker.reset() + assert tracker.update(0.3).settled is False # run cleared + + +# --- wiring --------------------------------------------------------------- + +def test_wiring(): + assert "AC_settle_point" in set(ac.executor.known_commands()) + from je_auto_control.utils.mcp_server.tools import build_default_tool_registry + names = {t.name for t in build_default_tool_registry()} + assert "ac_settle_point" in names + from je_auto_control.gui.script_builder.command_schema import _build_specs + specs = {s.command for s in _build_specs()} + assert "AC_settle_point" in specs + + +def test_facade_exports(): + for name in ("settle_point", "is_settled", "SettleTracker", "SettleState"): + assert hasattr(ac, name) and name in ac.__all__