Skip to content

Fuzz testing for parquet errors vs panic #9742

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In general, this crate should error on invalid data rather than panic

https://github.com/apache/arrow-rs?tab=readme-ov-file#guidelines-for-panic-vs-result

For those caused by invalid user input, however, we prefer to report that invalidity gracefully as an error result instead of panicking. In general, invalid input should result in an Error as soon as possible.

However, we keep hitting various paths in parquet where there are panics

Given these paths require a corrupt / invalid datasource, it is hard to write tests for them

For example, here is a test that @xuzifu666 added for one such error: 0bb9942

However, I thought it would be hard to maintain over the long run as the programatic generation of bad data will be brittle (if we change how the thrift is written, for example, the truncation may go down a different path).

Describe the solution you'd like

I think we should consider some sort of parquet fuzzer that makes randomly bad data and ensures that the reader is returning error (not panicing). It would be nice if it made some parqut files and then applied common data corruption:

  1. Truncate the data (remove bytes from end of the file)
  2. Truncate the data (remove bytesof the start of the file)
  3. Switch a random bit
  4. Set a random range of the file to all zeros

There are probably other good ones we can do

Describe alternatives you've considered

Additional context
Related to

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions