Add column_layout: infer columns from whitespace (borderless tables)#377
Merged
Conversation
ocr/structure detects a table only when every row's cell-left-x matches, so it fails on ragged / borderless / right-aligned columns; edge_lines.find_grid needs ruling lines a whitespace table has none of. Find columns by the gaps: project OCR boxes onto the x-axis, read the persistent empty vertical bands as gutters, assign column indices, bucket rows by spacing, emit the table. Pure difference-array projection, no numpy.
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 58 |
| Duplication | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



摘要
新增
detect_borderless_table/column_gutters/assign_columns/vertical_projection— 以垂直留白投影偵測無框線表格的欄位。ocr/structure只有在每一列儲存格左緣 x 都在容差內相符時才偵測得到表格,對 ragged / 無框線 / 右對齊數字欄、或缺格的列都會失敗;edge_lines.find_grid需要框線,純留白表格沒有網格。本功能用版面分析文獻常用的穩健方法——靠間隙:把 OCR 框投影到 x 軸(墨水密度剖面),讀出持續為空的垂直帶作為欄間隙(gutter),為每個框指派欄索引,依垂直間距分群成列,輸出無框線表格。純標準函式庫差分陣列投影(不需 numpy);重用
table_grid_fill的框邊界讀取器。Qt-free。五層
utils/column_layout/—vertical_projection、column_gutters、assign_columns、detect_borderless_table。je_auto_control匯出 +__all__。AC_detect_borderless_table({found, table})/AC_column_gutters({count, gutters})。ac_detect_borderless_table/ac_column_gutters(read-only)。測試
test/unit_test/headless/test_column_layout_batch.py— 投影含零值 gutter、gutter 偵測、欄位指派、端到端 2 欄 3 列表格、單欄非表格回 None、空回 None、wiring + facade。8 passed。ruff / bandit / radon / float-scan / Qt-free 全乾淨。