Rename CustomSampler to PresetSampler and add preset dataloader test by jayhenry · Pull Request #1637 · InternLM/xtuner

jayhenry · 2026-03-26T14:38:40Z

No description provided.

- _load_pack_config_jsonl: parse [dataset_path(str), sample_idx, char_start, char_end, token_start_offset] - Replace _load_pack_config_npy with _load_pack_config_parquet using load_mixed_dict_from_parquet - Update _load_pack_config dispatch: .jsonl -> JSONL loader, .parquet -> Parquet loader - Rewrite test helpers (_write_jsonl_pack, _write_parquet_pack) for new format - Add loader-level unit tests (5 tests, all passing) - Mark Feature 1 as passes: true in feature_list.json Made-with: Cursor

- Build _path_to_ds_idx mapping from ds.path; replace dataset_id int with dataset_path str lookup - Remove 'skip' strategy from short_pack_strategy and long_pack_strategy - Fix token count: use ds.num_tokens[s_idx] directly (remove double-indexing via ds.sampled) - New char-range validation: both -1 OK (plain DataItem); else char_start>=0 and char_end>char_start - pack_infos stores 6-tuples (ds_idx, s_idx, char_start, char_end, token_start_offset, max_tokens) - Add _FakeDataset with .path attribute; 8 new validation unit tests (13 total, all passing) - Mark Feature 2 as passes: true in feature_list.json Made-with: Cursor

- Replace old token-slicing logic with DataItem/LongTextDataItem consistency check - For char_start==-1: verify item is plain DataItem (no 'char_start' key) - For char_start!=-1: verify LongTextDataItem fields match pack config exactly - Retain truncation via max_tokens for long_pack_strategy='truncate' - Update _FakeDataset with long_text_meta support for LongTextDataItem testing - Add 6 TestGetitem unit tests (DataItem JSONL/Parquet, LongTextDataItem, mixed, error cases) - Updated feature_list.json: marked feature InternLM#3 as passing Made-with: Cursor

…tom pack integration - Add disable_filter: bool = False to JsonlDataset.__init__; skips num_tokens==0 and max_length filters when True - Add disable_filter: bool = False field to DatasetConfig; forwarded to JsonlDataset.build() - Add DataloaderConfig._force_custom_pack_settings: forces sample_ratio=1.0, enable_sequential_sampler=True, disable_filter=True for all datasets when pack_level='custom', with warnings on overrides - Call _force_custom_pack_settings in DataloaderConfig.build before build_datasets - Add TestDisableFilter and TestDataloaderConfigCustomMode - Updated feature_list.json: marked feature InternLM#4 as passing (4/4 features complete) Made-with: Cursor

- Add token_end_offset as 6th element to _PackSlice and _ValidatedSlice - Update _load_pack_config_jsonl: require 6 elements, parse token_end_offset - Update _load_pack_config_parquet: unpack 6th element token_end_offset - Update _validate_pack: validate token_end_offset > token_start_offset >= 0; compute n_tokens = token_end_offset - token_start_offset (no longer reads ds.num_tokens); truncate adjusts token_end_offset of last slice - Update __getitem__: compute max_tokens = tok_end - tok_off from validated slice - Update all test fixtures to 6-element format; add 2 new validation tests - All 25 tests pass; marked feature InternLM#5 as passing Made-with: Cursor

…nizeFn - For char_start==-1 (plain TokenizeFn): slice input_ids/labels[tok_off:tok_end] so token_start_offset is correctly applied (not just tok_end truncation) - For char_start!=-1 (LongTextTokenizeFn): return item as-is after field-match validation since it is pre-truncated at tokenize time - Add test_plain_tokenizefn_token_start_offset_applied: verifies non-zero token_start_offset slicing on plain DataItem - Add test_longtextdataitem_no_extra_truncation: verifies no re-slicing occurs - All 27 tests pass; marked feature InternLM#6 as passing Made-with: Cursor

…rized validation - Add load_config(path, mmap=True) loading boundaries.npy, samples.npy, paths.npy - Rewrite __init__ to store mmap'd arrays; remove old JSONL/Parquet machinery - Replace _validate_pack Python loop with _validate_arrays vectorized numpy checks - Move long_pack_strategy='truncate' handling from __init__ to __getitem__ - Update test helpers to write NPY directory format; update all fixtures - Updated feature_list.json: marked feature InternLM#7 as passing Made-with: Cursor

…mparison - Add generate_stress_pack_config: greedy packing with uniform [200,16000] token lengths - Add _MockDataset: satisfies JsonlDataset interface without file I/O - Add TestStress with 3 tests: - test_generate_stress_pack_config: validates NPY directory output - test_multiprocess_getitem: 8 fork'd processes with random index sampling, reports init time, RSS/PSS deltas, and __getitem__ latency per rank - test_mmap_memory_saving: two subprocesses compare load_config RSS/PSS/elapsed for mmap=True (0.2MB, 0.7ms) vs mmap=False (24MB, 7.5ms) - Updated feature_list.json: marked feature InternLM#8 as passing (8/8 complete) Made-with: Cursor

…son and PackConfig TypedDict - Rename custom_pack.py -> preset_pack.py, CustomPackDataset -> PresetPackDataset - Replace paths.npy (allow_pickle) with paths.json for security - Store paths as list[str] instead of object ndarray (equivalent memory, cleaner types) - Introduce PackConfig TypedDict for precise return type of load_config() - Update __init__.py, config.py, custom_sampler.py, run_test.sh accordingly - Rename test file to test_preset_pack_dataset.py and update all references Made-with: Cursor

- Introduced a new test case to compare the performance of mmap fast path versus slow path when using enable_mmap_shared with multiple local processes. - The test validates that the fast path does not write large metadata to temporary storage, while the slow path does, ensuring correct behavior under different configurations. - Utilized a tracking mechanism to log the number of bytes saved during the process for accurate performance measurement.

- Modified the `_load_sampler_config` function to enforce the use of `.npy` files and validate the loaded array's properties. - Changed the behavior of the sampler to truncate the global order length to the nearest multiple of `global_batch_size * world_size` instead of rounding up.

…ampler

jayhenry added 30 commits March 23, 2026 12:19

rm some unit tests of custom pack

79eeaf7

Init with plan agent

fba1159

update design and feature_list.json

ccd325e

Update new features and design doc

7d45573

add new feature to feature_list.json

cbb2041

update design doc

ce6eea5

fix jsonl disable filter ut

6af0a85

bypass set jsonl attrs if disable_filter and sampler_ratio=1

d9ac928

simplify jsonl to use _meta instead of offsets, num_tokens

00e22cc

Enable JsonlDataset mmap highway if not filter and sample

3092e8c

remove helper files

40a734f

refine meta update logic

c4b90ff

rename CustomSampler to PresetSampler

29a1779

rename preset pack and sampler

ab11954

remove arg:seed in PresetSampler

470dcc0

add dataloader test using preset pack and sampler

49b5d87

refine preset dataloader test

ca238a6

Add hard pack config test

5f95cea

add longest member in PresetPackDataset

5081ed2

jayhenry added 15 commits March 26, 2026 02:35

add sampler type in dataloader config

1ec4661

add preset pack and group sampler test case

ba163d3

add preset sampler test case using length group logic

87e0d21

HardPackDataset add pack_workers and pack_chunk_size init args

78a36f8

add pack_case in test_preset_dataloader

a06d67b

test preset pack and sampler same with build-in hard pack and group s…

7e16e83

…ampler

support getting pack config from multiple jsonl dataset pack info

b6855d4

get pack config with multiple jsonl

4ba05d3

move config helper functions to utils module

fdece37

add dist version test of preset same with buildin pack

e5594eb

add preset dataloader resume test and fix preset sampler resume bug

c230250

get pack config return longest

0cdaa97

add paths in get pack info function

9a6689d

speed get_pack_config_from_pack_infos

e86f645

speed PresetPackDataset validate

ff3db03

jayhenry force-pushed the preset_sampler branch from 9065369 to ff3db03 Compare March 27, 2026 09:28

nil0x9 added 2 commits March 28, 2026 16:10

[Fix] fix wrong total token length in get_pack_infos_by_hard_split

45aafd8

[offline] add script generating preset dataset schedule

c6171da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename CustomSampler to PresetSampler and add preset dataloader test#1637

Rename CustomSampler to PresetSampler and add preset dataloader test#1637
jayhenry wants to merge 47 commits intoInternLM:mainfrom
jayhenry:preset_sampler

jayhenry commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jayhenry commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants