named tuple iterator, fixes for nested structures and column name handling by tanmaykm · Pull Request #71 · JuliaIO/Parquet.jl

tanmaykm · 2020-05-15T14:48:08Z

Removed iterators that return Julia structs
Removed schema creators for Julia structs, and those for Protobuf and Thrift
Added an iterator RecordCursor that gives out named tuples for records. Did not reuse the old name RecCursor to avoid confusion.
Fixed handling of nested structures in schema. Nested structures will appear as nested named tuples.
Used string vectors instead of a delimiter character to represent fully qualified column names. Since we would support arbirtary names now, we can not fix any single delimiter. Column path is important in Parquet because it can contain nested structures, and the same name can appear at different paths. The safest way is to represent the fully qualified path as a vector of path elements. Operations using the iterator will remain oblivious to this though.

Ref: #51, should now pass all the failure cases listed there.

cc: @davidanthoff does this look fine?

Some example interactions:

julia> using Parquet

julia> par = ParFile("booltest/alltypes_plain.snappy.parquet")
Parquet file: booltest/alltypes_plain.snappy.parquet
    version: 1
    nrows: 2
    created by: impala version 1.3.0-INTERNAL (build 8a48ddb1eff84592b3fc06bc6f51ec120e1fffc9)
    cached: 0 column chunks

julia> for rec in RecordCursor(par)
           println(rec)
       end
NamedTuple{(:id, :bool_col, :tinyint_col, :smallint_col, :int_col, :bigint_col, :float_col, :double_col, :date_string_col, :string_col, :timestamp_col),Tuple{Union{Missing, Int32},Union{Missing, Bool},Union{Missing, Int32},Union{Missing, Int32},Union{Missing, Int32},Union{Missing, Int64},Union{Missing, Float32},Union{Missing, Float64},Union{Missing, Array{UInt8,1}},Union{Missing, Array{UInt8,1}},Union{Missing, Int128}}}((6, true, 0, 0, 0, 0, 0.0f0, 0.0, UInt8[0x30, 0x34, 0x2f, 0x30, 0x31, 0x2f, 0x30, 0x39], UInt8[0x30], 45285336301663273581805568))
NamedTuple{(:id, :bool_col, :tinyint_col, :smallint_col, :int_col, :bigint_col, :float_col, :double_col, :date_string_col, :string_col, :timestamp_col),Tuple{Union{Missing, Int32},Union{Missing, Bool},Union{Missing, Int32},Union{Missing, Int32},Union{Missing, Int32},Union{Missing, Int64},Union{Missing, Float32},Union{Missing, Float64},Union{Missing, Array{UInt8,1}},Union{Missing, Array{UInt8,1}},Union{Missing, Int128}}}((7, false, 1, 1, 1, 10, 1.1f0, 10.1, UInt8[0x30, 0x34, 0x2f, 0x30, 0x31, 0x2f, 0x30, 0x39], UInt8[0x31], 45285336301663333581805568))

julia> rc = RecordCursor(par)
Record Cursor on booltest/alltypes_plain.snappy.parquet
    rows: 1:2
    cols: id, bool_col, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col

julia> rec, state = iterate(rc);

julia> rec
NamedTuple{(:id, :bool_col, :tinyint_col, :smallint_col, :int_col, :bigint_col, :float_col, :double_col, :date_string_col, :string_col, :timestamp_col),Tuple{Union{Missing, Int32},Union{Missing, Bool},Union{Missing, Int32},Union{Missing, Int32},Union{Missing, Int32},Union{Missing, Int64},Union{Missing, Float32},Union{Missing, Float64},Union{Missing, Array{UInt8,1}},Union{Missing, Array{UInt8,1}},Union{Missing, Int128}}}((6, true, 0, 0, 0, 0, 0.0f0, 0.0, UInt8[0x30, 0x34, 0x2f, 0x30, 0x31, 0x2f, 0x30, 0x39], UInt8[0x30], 45285336301663273581805568))

julia> rec.id
6

julia> rec.bool_col
true

julia> rec.tinyint_col
0

julia> colnames(par)
11-element Array{Array{String,1},1}:
 ["id"]
 ["bool_col"]
 ["tinyint_col"]
 ["smallint_col"]
 ["int_col"]
 ["bigint_col"]
 ["float_col"]
 ["double_col"]
 ["date_string_col"]
 ["string_col"]
 ["timestamp_col"]

Will update the readme and examples. Will also do some more tests and maybe some refactor and cleanup.

purge protobuf and thrift conversion of parquet schemas in preparation of moving to named tuples representation.

The `ParFile` reader now accepts an optional `map_logical_types`. ParFile(path; map_logical_types) => ParFile `map_logical_types` can be one of: - `false`: no mapping is done (default) - `true`: default mappings are attempted on all columns (bytearray => String, int96 => DateTime) - A user supplied dict mapping column names to a tuple of type and a converter function

This [Parquet.jl update](JuliaIO/Parquet.jl#71) will add a named tuple iterator `RecordReader`. We will be able to use that here directly, instead of wrapping over the older `RecCursor`. The new `map_logical_types` option to `ParFile` automatically converts byte arrays to strings, so we do not need to handle that here now.

tanmaykm · 2020-05-17T05:21:36Z

I have also put up a PR for the corresponding changes needed to ParquetFiles.jl: queryverse/ParquetFiles.jl#25

tanmaykm added 6 commits May 15, 2020 19:29

purge protobuf and thrift schema

d39d88d

purge protobuf and thrift conversion of parquet schemas in preparation of moving to named tuples representation.

fix julia schema and RecCursor for nested data

eef5e74

add tests for nested data

6964d5e

purge JuliaSchema, give out NamedTuple records

e5027e2

use string vectors instead of delim for col paths

35152a1

parameterize cursor methods

42723d7

This was referenced May 16, 2020

Fix Reader #61

Closed

Better colnames #69

Closed

tanmaykm added 3 commits May 17, 2020 06:30

provide eltype and length for RecordCursor

2c241f5

update README

267ff39

tanmaykm changed the title ~~WIP: named tuple iterator, fixes for nested structures and column name handling~~ named tuple iterator, fixes for nested structures and column name handling May 17, 2020

tanmaykm mentioned this pull request May 17, 2020

use Parquet.jl named tuples iterator: RecordCursor queryverse/ParquetFiles.jl#25

Open

tanmaykm merged commit 62d3219 into master May 18, 2020

tanmaykm mentioned this pull request May 21, 2020

Fix fieldnames #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

named tuple iterator, fixes for nested structures and column name handling#71

named tuple iterator, fixes for nested structures and column name handling#71
tanmaykm merged 9 commits intomasterfrom
tan/namedtuples

tanmaykm commented May 15, 2020

Uh oh!

tanmaykm commented May 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tanmaykm commented May 15, 2020

Uh oh!

tanmaykm commented May 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant