[QDP] Numpy Input Speed & Memory Improvements by 400Ping · Pull Request #877 · apache/mahout

400Ping · 2026-01-19T17:26:34Z

Purpose of PR

Implemented streaming and mmap alternatives for .npy ingestion to avoid the read_npy -> Array2 -> flatten Vec memory spike. Added direct header parsing, chunked readers (row-major + Fortran-order handling), and new IO helpers, plus memmap dependency and tests for streaming/mmap paths.

Related Issues or PRs

Closes #789

Changes Made

Breaking Changes

Yes
No

Checklist

Added or updated unit tests for all changes
Added or updated documentation for all changes
Successfully built and ran all unit tests or manual tests locally
PR title follows "MAHOUT-XXX: Brief Description" format (if related to an issue)
Code follows ASF guidelines

Signed-off-by: 400Ping <fourhundredping@gmail.com>

Copilot

Pull request overview

This PR implements streaming and memory-mapped alternatives for NumPy .npy file ingestion to avoid memory spikes from the original approach that loaded the entire file through Array2 before flattening. The changes add direct header parsing, chunked readers with support for both row-major (C) and Fortran-order arrays, new IO helper functions, the memmap2 dependency, and comprehensive unit tests.

Changes:

Added NumpyStreamingReader for chunk-by-chunk reading without loading entire files into memory
Added NumpyMmapReader for memory-mapped IO with OS-managed paging
Implemented custom header parsing to avoid intermediate Array2 allocations
Added helper functions for safe f64 reading from bytes and files

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 20 comments.

Show a summary per file

File	Description
qdp/qdp-core/src/readers/numpy.rs	Core implementation of streaming and mmap readers with custom header parsing, data reading utilities, and unit tests
qdp/qdp-core/src/readers/mod.rs	Exported new reader types for public API
qdp/qdp-core/src/io.rs	Added convenience functions for streaming and mmap batch reading
qdp/qdp-core/Cargo.toml	Added memmap2 dependency for memory-mapped IO
qdp/Cargo.toml	Specified memmap2 version in workspace
qdp/Cargo.lock	Locked memmap2 dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

qdp/qdp-core/src/readers/numpy.rs

viiccwen · 2026-01-20T09:08:03Z

qdp/qdp-core/src/readers/numpy.rs

+            let start = data_base + self.row_cursor * sample_size * std::mem::size_of::<f64>();
+            let end = start + elem_count * std::mem::size_of::<f64>();
+            let bytes = &self.mmap[start..end];
+            copy_f64s_from_bytes(bytes, &mut buffer[..elem_count])?;


Since we use copy_f64s_from_bytes in MmapReader that requires high-throughput, we implement with bytes -> f64 transformation + copy now, maybe we could take a follow-up to improve it into a pure memcpy method (zero-copy). WDYT?

I think it is actually an opened issue, I will do this probably later tonight

Nice try diddy. thx for contribution!

wait I double check again and found out that rich didn't put zero copy into the issue list #718 weird

if you want you can open a follow up for this?

ryankert01 · 2026-01-20T11:34:56Z

Could you provide a benchmark before and after result if any? So reviewers don’t have to run it themselves! Thanks!

400Ping · 2026-01-20T14:45:02Z

Before:

======================================================================
NUMPY I/O + ENCODING BENCHMARK
======================================================================
Qubits: 10
Sample size: 1024 elements
Number of samples: 10000
Total data: 78.12 MB
Frameworks: mahout, pennylane

Generating test data...
Saving to /tmp/tmp0rf7j7p6.npy...
File size: 78.13 MB

[Mahout + NumPy] Loading and encoding...
  Total Time (I/O + Encode): 0.0581 s
  Throughput: 172165.2 samples/sec
  Average per sample: 0.01 ms

[PennyLane + NumPy] Loading and encoding...
  Total Time (I/O + Encode): 2.8481 s
  Throughput: 3511.2 samples/sec
  Average per sample: 0.28 ms

======================================================================
SUMMARY
======================================================================
Framework       Time (s)     Throughput           Avg/Sample     
----------------------------------------------------------------------
Mahout          0.0581       172165.2             0.01           
PennyLane       2.8481       3511.2               0.28           

----------------------------------------------------------------------
SPEEDUP COMPARISON
----------------------------------------------------------------------
Mahout vs PennyLane: 49.03x
Time reduction: 49.03x faster

Cleaned up temporary file: /tmp/tmp0rf7j7p6.npy

======================================================================
BENCHMARK COMPLETE
======================================================================

After:

======================================================================
NUMPY I/O + ENCODING BENCHMARK
======================================================================
Qubits: 10
Sample size: 1024 elements
Number of samples: 10000
Total data: 78.12 MB
Frameworks: mahout, pennylane

Generating test data...
Saving to /tmp/tmpq898jgel.npy...
File size: 78.13 MB

[Mahout + NumPy] Loading and encoding...
  Total Time (I/O + Encode): 0.0574 s
  Throughput: 174195.3 samples/sec
  Average per sample: 0.01 ms

[PennyLane + NumPy] Loading and encoding...
  Total Time (I/O + Encode): 2.8587 s
  Throughput: 3498.1 samples/sec
  Average per sample: 0.29 ms

======================================================================
SUMMARY
======================================================================
Framework       Time (s)     Throughput           Avg/Sample     
----------------------------------------------------------------------
Mahout          0.0574       174195.3             0.01           
PennyLane       2.8587       3498.1               0.29           

----------------------------------------------------------------------
SPEEDUP COMPARISON
----------------------------------------------------------------------
Mahout vs PennyLane: 49.80x
Time reduction: 49.80x faster

Cleaned up temporary file: /tmp/tmpq898jgel.npy

======================================================================
BENCHMARK COMPLETE
======================================================================

400Ping · 2026-01-20T14:46:08Z

The improvements isn't a lot better, should I close this?

ryankert01 · 2026-01-20T14:54:51Z

You might want to also monitor mem spike as it your primary goal and if it really matters because it's on cpu and ram can be offload to ssd. (I'm not sure if it's possible tho. I'm not familiar with offloading strategy. offloading will decrease the speed because another io bound)

400Ping · 2026-01-20T15:13:24Z

You might want to also monitor mem spike as it your primary goal and if it really matters because it's on cpu and ram can be offload to ssd. (I'm not sure if it's possible tho. I'm not familiar with offloading strategy. offloading will decrease the speed because another io bound)

Just checked the mem spike, it isn't better

main：Maximum resident set size 838,872 kB (~819.3 MiB)
PR：Maximum resident set size 838,716 kB (~819.1 MiB)

Closing this PR

numpy speed & mem improv

549e254

Signed-off-by: 400Ping <fourhundredping@gmail.com>

400Ping force-pushed the qdp/numpy-speed-mem-improv branch from 44bc250 to 549e254 Compare January 19, 2026 17:31

400Ping changed the title ~~[QDP] Numpy Input Speed & Memory Improvements~~ [QDP] Numpy Input Speed & Memory Improvements Jan 19, 2026

fix pre-commit

38ced94

Signed-off-by: 400Ping <fourhundredping@gmail.com>

guan404ming requested a review from Copilot January 20, 2026 08:30

Copilot started reviewing on behalf of guan404ming January 20, 2026 08:30 View session

Copilot AI reviewed Jan 20, 2026

View reviewed changes

This comment was marked as duplicate.

Sign in to view

viiccwen reviewed Jan 20, 2026

View reviewed changes

viiccwen mentioned this pull request Jan 20, 2026

[QDP] Optimize MmapReader of Numpy fast path #885

Closed

400Ping closed this Jan 20, 2026

guan404ming modified the milestone: Qumat 0.5.0 Jan 20, 2026

Conversation

400Ping commented Jan 19, 2026

Purpose of PR

Related Issues or PRs

Changes Made

Breaking Changes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as duplicate.

Uh oh!

viiccwen Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

400Ping Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

viiccwen Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

400Ping Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

400Ping Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

viiccwen Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

ryankert01 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

400Ping commented Jan 20, 2026

Uh oh!

400Ping commented Jan 20, 2026

Uh oh!

ryankert01 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

400Ping commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viiccwen Jan 20, 2026 •

edited

Loading

400Ping Jan 20, 2026 •

edited

Loading

ryankert01 commented Jan 20, 2026 •

edited

Loading

ryankert01 commented Jan 20, 2026 •

edited

Loading