Skip to content

Rename CustomSampler to PresetSampler and add preset dataloader test#1637

Open
jayhenry wants to merge 47 commits intoInternLM:mainfrom
jayhenry:preset_sampler
Open

Rename CustomSampler to PresetSampler and add preset dataloader test#1637
jayhenry wants to merge 47 commits intoInternLM:mainfrom
jayhenry:preset_sampler

Conversation

@jayhenry
Copy link
Copy Markdown
Collaborator

No description provided.

jayhenry added 30 commits March 23, 2026 12:19
- _load_pack_config_jsonl: parse [dataset_path(str), sample_idx, char_start, char_end, token_start_offset]
- Replace _load_pack_config_npy with _load_pack_config_parquet using load_mixed_dict_from_parquet
- Update _load_pack_config dispatch: .jsonl -> JSONL loader, .parquet -> Parquet loader
- Rewrite test helpers (_write_jsonl_pack, _write_parquet_pack) for new format
- Add loader-level unit tests (5 tests, all passing)
- Mark Feature 1 as passes: true in feature_list.json

Made-with: Cursor
- Build _path_to_ds_idx mapping from ds.path; replace dataset_id int with dataset_path str lookup
- Remove 'skip' strategy from short_pack_strategy and long_pack_strategy
- Fix token count: use ds.num_tokens[s_idx] directly (remove double-indexing via ds.sampled)
- New char-range validation: both -1 OK (plain DataItem); else char_start>=0 and char_end>char_start
- pack_infos stores 6-tuples (ds_idx, s_idx, char_start, char_end, token_start_offset, max_tokens)
- Add _FakeDataset with .path attribute; 8 new validation unit tests (13 total, all passing)
- Mark Feature 2 as passes: true in feature_list.json

Made-with: Cursor
- Replace old token-slicing logic with DataItem/LongTextDataItem consistency check
- For char_start==-1: verify item is plain DataItem (no 'char_start' key)
- For char_start!=-1: verify LongTextDataItem fields match pack config exactly
- Retain truncation via max_tokens for long_pack_strategy='truncate'
- Update _FakeDataset with long_text_meta support for LongTextDataItem testing
- Add 6 TestGetitem unit tests (DataItem JSONL/Parquet, LongTextDataItem, mixed, error cases)
- Updated feature_list.json: marked feature InternLM#3 as passing

Made-with: Cursor
…tom pack integration

- Add disable_filter: bool = False to JsonlDataset.__init__; skips num_tokens==0 and max_length filters when True
- Add disable_filter: bool = False field to DatasetConfig; forwarded to JsonlDataset.build()
- Add DataloaderConfig._force_custom_pack_settings: forces sample_ratio=1.0, enable_sequential_sampler=True, disable_filter=True for all datasets when pack_level='custom', with warnings on overrides
- Call _force_custom_pack_settings in DataloaderConfig.build before build_datasets
- Add TestDisableFilter and TestDataloaderConfigCustomMode
- Updated feature_list.json: marked feature InternLM#4 as passing (4/4 features complete)

Made-with: Cursor
- Add token_end_offset as 6th element to _PackSlice and _ValidatedSlice
- Update _load_pack_config_jsonl: require 6 elements, parse token_end_offset
- Update _load_pack_config_parquet: unpack 6th element token_end_offset
- Update _validate_pack: validate token_end_offset > token_start_offset >= 0;
  compute n_tokens = token_end_offset - token_start_offset (no longer reads ds.num_tokens);
  truncate adjusts token_end_offset of last slice
- Update __getitem__: compute max_tokens = tok_end - tok_off from validated slice
- Update all test fixtures to 6-element format; add 2 new validation tests
- All 25 tests pass; marked feature InternLM#5 as passing

Made-with: Cursor
…nizeFn

- For char_start==-1 (plain TokenizeFn): slice input_ids/labels[tok_off:tok_end]
  so token_start_offset is correctly applied (not just tok_end truncation)
- For char_start!=-1 (LongTextTokenizeFn): return item as-is after field-match
  validation since it is pre-truncated at tokenize time
- Add test_plain_tokenizefn_token_start_offset_applied: verifies non-zero
  token_start_offset slicing on plain DataItem
- Add test_longtextdataitem_no_extra_truncation: verifies no re-slicing occurs
- All 27 tests pass; marked feature InternLM#6 as passing

Made-with: Cursor
…rized validation

- Add load_config(path, mmap=True) loading boundaries.npy, samples.npy, paths.npy
- Rewrite __init__ to store mmap'd arrays; remove old JSONL/Parquet machinery
- Replace _validate_pack Python loop with _validate_arrays vectorized numpy checks
- Move long_pack_strategy='truncate' handling from __init__ to __getitem__
- Update test helpers to write NPY directory format; update all fixtures
- Updated feature_list.json: marked feature InternLM#7 as passing

Made-with: Cursor
…mparison

- Add generate_stress_pack_config: greedy packing with uniform [200,16000] token lengths
- Add _MockDataset: satisfies JsonlDataset interface without file I/O
- Add TestStress with 3 tests:
  - test_generate_stress_pack_config: validates NPY directory output
  - test_multiprocess_getitem: 8 fork'd processes with random index sampling,
    reports init time, RSS/PSS deltas, and __getitem__ latency per rank
  - test_mmap_memory_saving: two subprocesses compare load_config RSS/PSS/elapsed
    for mmap=True (0.2MB, 0.7ms) vs mmap=False (24MB, 7.5ms)
- Updated feature_list.json: marked feature InternLM#8 as passing (8/8 complete)

Made-with: Cursor
…son and PackConfig TypedDict

- Rename custom_pack.py -> preset_pack.py, CustomPackDataset -> PresetPackDataset
- Replace paths.npy (allow_pickle) with paths.json for security
- Store paths as list[str] instead of object ndarray (equivalent memory, cleaner types)
- Introduce PackConfig TypedDict for precise return type of load_config()
- Update __init__.py, config.py, custom_sampler.py, run_test.sh accordingly
- Rename test file to test_preset_pack_dataset.py and update all references

Made-with: Cursor
- Introduced a new test case to compare the performance of mmap fast path versus slow path when using enable_mmap_shared with multiple local processes.
- The test validates that the fast path does not write large metadata to temporary storage, while the slow path does, ensuring correct behavior under different configurations.
- Utilized a tracking mechanism to log the number of bytes saved during the process for accurate performance measurement.
- Modified the `_load_sampler_config` function to enforce the use of `.npy` files and validate the loaded array's properties.
- Changed the behavior of the sampler to truncate the global order length to the nearest multiple of `global_batch_size * world_size` instead of rounding up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants