Skip to content

named tuple iterator, fixes for nested structures and column name handling#71

Merged
tanmaykm merged 9 commits intomasterfrom
tan/namedtuples
May 18, 2020
Merged

named tuple iterator, fixes for nested structures and column name handling#71
tanmaykm merged 9 commits intomasterfrom
tan/namedtuples

Conversation

@tanmaykm
Copy link
Copy Markdown
Member

  • Removed iterators that return Julia structs
  • Removed schema creators for Julia structs, and those for Protobuf and Thrift
  • Added an iterator RecordCursor that gives out named tuples for records. Did not reuse the old name RecCursor to avoid confusion.
  • Fixed handling of nested structures in schema. Nested structures will appear as nested named tuples.
  • Used string vectors instead of a delimiter character to represent fully qualified column names. Since we would support arbirtary names now, we can not fix any single delimiter. Column path is important in Parquet because it can contain nested structures, and the same name can appear at different paths. The safest way is to represent the fully qualified path as a vector of path elements. Operations using the iterator will remain oblivious to this though.

Ref: #51, should now pass all the failure cases listed there.

cc: @davidanthoff does this look fine?

Some example interactions:

julia> using Parquet

julia> par = ParFile("booltest/alltypes_plain.snappy.parquet")
Parquet file: booltest/alltypes_plain.snappy.parquet
    version: 1
    nrows: 2
    created by: impala version 1.3.0-INTERNAL (build 8a48ddb1eff84592b3fc06bc6f51ec120e1fffc9)
    cached: 0 column chunks

julia> for rec in RecordCursor(par)
           println(rec)
       end
NamedTuple{(:id, :bool_col, :tinyint_col, :smallint_col, :int_col, :bigint_col, :float_col, :double_col, :date_string_col, :string_col, :timestamp_col),Tuple{Union{Missing, Int32},Union{Missing, Bool},Union{Missing, Int32},Union{Missing, Int32},Union{Missing, Int32},Union{Missing, Int64},Union{Missing, Float32},Union{Missing, Float64},Union{Missing, Array{UInt8,1}},Union{Missing, Array{UInt8,1}},Union{Missing, Int128}}}((6, true, 0, 0, 0, 0, 0.0f0, 0.0, UInt8[0x30, 0x34, 0x2f, 0x30, 0x31, 0x2f, 0x30, 0x39], UInt8[0x30], 45285336301663273581805568))
NamedTuple{(:id, :bool_col, :tinyint_col, :smallint_col, :int_col, :bigint_col, :float_col, :double_col, :date_string_col, :string_col, :timestamp_col),Tuple{Union{Missing, Int32},Union{Missing, Bool},Union{Missing, Int32},Union{Missing, Int32},Union{Missing, Int32},Union{Missing, Int64},Union{Missing, Float32},Union{Missing, Float64},Union{Missing, Array{UInt8,1}},Union{Missing, Array{UInt8,1}},Union{Missing, Int128}}}((7, false, 1, 1, 1, 10, 1.1f0, 10.1, UInt8[0x30, 0x34, 0x2f, 0x30, 0x31, 0x2f, 0x30, 0x39], UInt8[0x31], 45285336301663333581805568))

julia> rc = RecordCursor(par)
Record Cursor on booltest/alltypes_plain.snappy.parquet
    rows: 1:2
    cols: id, bool_col, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col

julia> rec, state = iterate(rc);

julia> rec
NamedTuple{(:id, :bool_col, :tinyint_col, :smallint_col, :int_col, :bigint_col, :float_col, :double_col, :date_string_col, :string_col, :timestamp_col),Tuple{Union{Missing, Int32},Union{Missing, Bool},Union{Missing, Int32},Union{Missing, Int32},Union{Missing, Int32},Union{Missing, Int64},Union{Missing, Float32},Union{Missing, Float64},Union{Missing, Array{UInt8,1}},Union{Missing, Array{UInt8,1}},Union{Missing, Int128}}}((6, true, 0, 0, 0, 0, 0.0f0, 0.0, UInt8[0x30, 0x34, 0x2f, 0x30, 0x31, 0x2f, 0x30, 0x39], UInt8[0x30], 45285336301663273581805568))

julia> rec.id
6

julia> rec.bool_col
true

julia> rec.tinyint_col
0

julia> colnames(par)
11-element Array{Array{String,1},1}:
 ["id"]
 ["bool_col"]
 ["tinyint_col"]
 ["smallint_col"]
 ["int_col"]
 ["bigint_col"]
 ["float_col"]
 ["double_col"]
 ["date_string_col"]
 ["string_col"]
 ["timestamp_col"]

Will update the readme and examples. Will also do some more tests and maybe some refactor and cleanup.

This was referenced May 16, 2020
tanmaykm added 3 commits May 17, 2020 06:30
The `ParFile` reader now accepts an optional `map_logical_types`.

ParFile(path; map_logical_types) => ParFile

`map_logical_types` can be one of:

- `false`: no mapping is done (default)
- `true`: default mappings are attempted on all columns (bytearray => String, int96 => DateTime)
- A user supplied dict mapping column names to a tuple of type and a converter function
@tanmaykm tanmaykm changed the title WIP: named tuple iterator, fixes for nested structures and column name handling named tuple iterator, fixes for nested structures and column name handling May 17, 2020
tanmaykm added a commit to tanmaykm/ParquetFiles.jl that referenced this pull request May 17, 2020
This [Parquet.jl update](JuliaIO/Parquet.jl#71) will add a named tuple iterator `RecordReader`. We will be able to use that here directly, instead of wrapping over the older `RecCursor`.

The new `map_logical_types` option to `ParFile` automatically converts byte arrays to strings, so we do not need to handle that here now.
@tanmaykm
Copy link
Copy Markdown
Member Author

I have also put up a PR for the corresponding changes needed to ParquetFiles.jl: queryverse/ParquetFiles.jl#25

@tanmaykm tanmaykm merged commit 62d3219 into master May 18, 2020
@tanmaykm tanmaykm mentioned this pull request May 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant