Skip to content

Multi-threaded scan #47

Description

@freddie-freeloader

First findings

  • It seems that DuckdDB might be able to read multiple parquet-files in concurrently -- but not one file concurrently

Thoughts

  • In theory, we could do this by copy from with exactly the same number of threads & use each thread the location info of the sheetreader thread.
  • Would it be possible to partition excel sheet in 2048 / (number of threads) rows? + make the buffers that size? Probably tricky, because we would have to know the number of columns before (because buffer size / columns is the numbers of rows, which fit into one buffer)

TODO

A multi-threaded scan would be interesting, since our copy/scan function takes some time.

Have a look at:

https://github.com/duckdb/duckdb_delta/blob/main/src/functions/delta_scan.cpp

According to the README, it supports a multi-threaded scan. I suspect that this doesn't need any new implementation, since they are reading the parquet files.

  • Find out whether this is due to the parquet files
  • Find out whether DuckDB supports also a multi-threaded scan of Apache Arrow format
  • Have a look at how the multi-threaded scan is implemented
  • Find out whether we could copy concurrently -- this might not be possible, because sheetreader-core saves the data in a special way (per thread & some rows are split in multiple threads -- and there is only an implicit order)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions