Skip to content

[QDP] Numpy Input Speed & Memory Improvements#877

Closed
400Ping wants to merge 2 commits intoapache:mainfrom
400Ping:qdp/numpy-speed-mem-improv
Closed

[QDP] Numpy Input Speed & Memory Improvements#877
400Ping wants to merge 2 commits intoapache:mainfrom
400Ping:qdp/numpy-speed-mem-improv

Conversation

@400Ping
Copy link
Member

@400Ping 400Ping commented Jan 19, 2026

Purpose of PR

Implemented streaming and mmap alternatives for .npy ingestion to avoid the read_npy -> Array2 -> flatten Vec memory spike. Added direct header parsing, chunked readers (row-major + Fortran-order handling), and new IO helpers, plus memmap dependency and tests for streaming/mmap paths.

Related Issues or PRs

Closes #789

Changes Made

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Test
  • CI/CD pipeline
  • Other

Breaking Changes

  • Yes
  • No

Checklist

  • Added or updated unit tests for all changes
  • Added or updated documentation for all changes
  • Successfully built and ran all unit tests or manual tests locally
  • PR title follows "MAHOUT-XXX: Brief Description" format (if related to an issue)
  • Code follows ASF guidelines

Signed-off-by: 400Ping <fourhundredping@gmail.com>
@400Ping 400Ping force-pushed the qdp/numpy-speed-mem-improv branch from 44bc250 to 549e254 Compare January 19, 2026 17:31
@400Ping 400Ping changed the title [QDP] Numpy Input Speed & Memory Improvements [QDP] Numpy Input Speed & Memory Improvements Jan 19, 2026
Signed-off-by: 400Ping <fourhundredping@gmail.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements streaming and memory-mapped alternatives for NumPy .npy file ingestion to avoid memory spikes from the original approach that loaded the entire file through Array2 before flattening. The changes add direct header parsing, chunked readers with support for both row-major (C) and Fortran-order arrays, new IO helper functions, the memmap2 dependency, and comprehensive unit tests.

Changes:

  • Added NumpyStreamingReader for chunk-by-chunk reading without loading entire files into memory
  • Added NumpyMmapReader for memory-mapped IO with OS-managed paging
  • Implemented custom header parsing to avoid intermediate Array2 allocations
  • Added helper functions for safe f64 reading from bytes and files

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 20 comments.

Show a summary per file
File Description
qdp/qdp-core/src/readers/numpy.rs Core implementation of streaming and mmap readers with custom header parsing, data reading utilities, and unit tests
qdp/qdp-core/src/readers/mod.rs Exported new reader types for public API
qdp/qdp-core/src/io.rs Added convenience functions for streaming and mmap batch reading
qdp/qdp-core/Cargo.toml Added memmap2 dependency for memory-mapped IO
qdp/Cargo.toml Specified memmap2 version in workspace
qdp/Cargo.lock Locked memmap2 dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

viiccwen

This comment was marked as duplicate.

let start = data_base + self.row_cursor * sample_size * std::mem::size_of::<f64>();
let end = start + elem_count * std::mem::size_of::<f64>();
let bytes = &self.mmap[start..end];
copy_f64s_from_bytes(bytes, &mut buffer[..elem_count])?;
Copy link
Contributor

@viiccwen viiccwen Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we use copy_f64s_from_bytes in MmapReader that requires high-throughput, we implement with bytes -> f64 transformation + copy now, maybe we could take a follow-up to improve it into a pure memcpy method (zero-copy). WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is actually an opened issue, I will do this probably later tonight

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice try diddy. thx for contribution!

Copy link
Member Author

@400Ping 400Ping Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait I double check again and found out that rich didn't put zero copy into the issue list #718 weird

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want you can open a follow up for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

@ryankert01
Copy link
Member

ryankert01 commented Jan 20, 2026

Could you provide a benchmark before and after result if any? So reviewers don’t have to run it themselves! Thanks!

@400Ping
Copy link
Member Author

400Ping commented Jan 20, 2026

Before:

======================================================================
NUMPY I/O + ENCODING BENCHMARK
======================================================================
Qubits: 10
Sample size: 1024 elements
Number of samples: 10000
Total data: 78.12 MB
Frameworks: mahout, pennylane

Generating test data...
Saving to /tmp/tmp0rf7j7p6.npy...
File size: 78.13 MB

[Mahout + NumPy] Loading and encoding...
  Total Time (I/O + Encode): 0.0581 s
  Throughput: 172165.2 samples/sec
  Average per sample: 0.01 ms

[PennyLane + NumPy] Loading and encoding...
  Total Time (I/O + Encode): 2.8481 s
  Throughput: 3511.2 samples/sec
  Average per sample: 0.28 ms

======================================================================
SUMMARY
======================================================================
Framework       Time (s)     Throughput           Avg/Sample     
----------------------------------------------------------------------
Mahout          0.0581       172165.2             0.01           
PennyLane       2.8481       3511.2               0.28           

----------------------------------------------------------------------
SPEEDUP COMPARISON
----------------------------------------------------------------------
Mahout vs PennyLane: 49.03x
Time reduction: 49.03x faster

Cleaned up temporary file: /tmp/tmp0rf7j7p6.npy

======================================================================
BENCHMARK COMPLETE
======================================================================

After:

======================================================================
NUMPY I/O + ENCODING BENCHMARK
======================================================================
Qubits: 10
Sample size: 1024 elements
Number of samples: 10000
Total data: 78.12 MB
Frameworks: mahout, pennylane

Generating test data...
Saving to /tmp/tmpq898jgel.npy...
File size: 78.13 MB

[Mahout + NumPy] Loading and encoding...
  Total Time (I/O + Encode): 0.0574 s
  Throughput: 174195.3 samples/sec
  Average per sample: 0.01 ms

[PennyLane + NumPy] Loading and encoding...
  Total Time (I/O + Encode): 2.8587 s
  Throughput: 3498.1 samples/sec
  Average per sample: 0.29 ms

======================================================================
SUMMARY
======================================================================
Framework       Time (s)     Throughput           Avg/Sample     
----------------------------------------------------------------------
Mahout          0.0574       174195.3             0.01           
PennyLane       2.8587       3498.1               0.29           

----------------------------------------------------------------------
SPEEDUP COMPARISON
----------------------------------------------------------------------
Mahout vs PennyLane: 49.80x
Time reduction: 49.80x faster

Cleaned up temporary file: /tmp/tmpq898jgel.npy

======================================================================
BENCHMARK COMPLETE
======================================================================

@400Ping
Copy link
Member Author

400Ping commented Jan 20, 2026

The improvements isn't a lot better, should I close this?

@ryankert01
Copy link
Member

ryankert01 commented Jan 20, 2026

You might want to also monitor mem spike as it your primary goal and if it really matters because it's on cpu and ram can be offload to ssd. (I'm not sure if it's possible tho. I'm not familiar with offloading strategy. offloading will decrease the speed because another io bound)

@400Ping
Copy link
Member Author

400Ping commented Jan 20, 2026

You might want to also monitor mem spike as it your primary goal and if it really matters because it's on cpu and ram can be offload to ssd. (I'm not sure if it's possible tho. I'm not familiar with offloading strategy. offloading will decrease the speed because another io bound)

Just checked the mem spike, it isn't better

main:Maximum resident set size 838,872 kB (~819.3 MiB)
PR:Maximum resident set size 838,716 kB (~819.1 MiB)

Closing this PR

@400Ping 400Ping closed this Jan 20, 2026
@guan404ming guan404ming modified the milestone: Qumat 0.5.0 Jan 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[QDP] Numpy input potential speed & mem improv

5 participants