go/writer: improve parquet write performance by jacobmarble · Pull Request #4471 · estuary/connectors

jacobmarble · 2026-05-14T15:49:56Z

Description:

Improves parquet write per a new benchmark:

15% CPU
41% memory
75% memory allocations

Two-part refactor of ParquetWriter's buffer-to-sink pipeline:

Column-oriented buffer (b55c7158d): changes the write buffer from [][]any to []any of typed slice pointers (*[]int64, *[]parquet.ByteArray, etc.) with a parallel []int16 definition-level slice. Values are converted to their Parquet physical types at Write time instead of during the flush loop, eliminating the per-flush transposition pass and reducing GC scanning pressure.
Page-copy (cf5e9e76d): transferColumnValues previously decoded every column value from the scratch file and re-encoded it into the sink. Now the scratch writer uses the same codec as the sink and disables dictionary encoding, so compressed data pages are copied byte-for-byte through a new mergeWriter type that accepts pre-encoded pages and tracks row counts explicitly.

Benchmark results (A = baseline, B = column-oriented buffer, C = page-copy):

Step 1 (A→B): cpu +7%, mem −25%, allocs ~unchanged.
Step 2 (B→C): cpu −9–25%, mem −6–33%, allocs −75%.

                                   │      A (sec/op)      │          C (sec/op)          │
                                   │        sec/op        │   sec/op     vs base         │
ParquetWriter/small/uncompressed-8            699.4m ± 2%   559.9m ± 2%  -19.95% (p=0.000 n=8+7)
ParquetWriter/small/snappy-8                  718.6m ± 3%   636.6m ± 6%  -11.41% (p=0.000 n=8+7)
ParquetWriter/large/uncompressed-8             3.271 ± 2%    2.653 ± 2%  -18.92% (p=0.000 n=8+7)
ParquetWriter/large/snappy-8                   3.318 ± 4%    3.036 ± 2%   -8.48% (p=0.000 n=8+7)
geomean                                        1.528         1.302       -14.83%

                                   │     A (B/op)     │          C (B/op)           │
                                   │      B/op        │    B/op      vs base        │
ParquetWriter/small/uncompressed-8          638.3Mi ± 0%   333.0Mi ± 0%  -47.82% (p=0.000 n=8+7)
ParquetWriter/small/snappy-8               651.8Mi ± 0%   444.2Mi ± 0%  -31.85% (p=0.000 n=8+7)
ParquetWriter/large/uncompressed-8         2.945Gi ± 0%   1.464Gi ± 0%  -50.29% (p=0.000 n=8+7)
ParquetWriter/large/snappy-8               2.967Gi ± 0%   1.990Gi ± 0%  -32.94% (p=0.000 n=8+7)
geomean                                    1.364Gi        819.8Mi       -41.32%

                                   │  A (allocs/op)  │       C (allocs/op)       │
                                   │   allocs/op     │ allocs/op    vs base      │
ParquetWriter/small/uncompressed-8       4.013M ± 0%   1.020M ± 0%  -74.58% (p=0.000 n=8+7)
ParquetWriter/small/snappy-8             4.013M ± 0%   1.021M ± 0%  -74.56% (p=0.000 n=8+7)
ParquetWriter/large/uncompressed-8      19.997M ± 0%   5.046M ± 0%  -74.77% (n=8+7)
ParquetWriter/large/snappy-8            19.997M ± 0%   5.048M ± 0%  -74.75% (p=0.000 n=8+7)
geomean                                  8.958M        2.269M       -74.67%

Net: −15% CPU, −41% memory, −75% allocations (all p=0.000).

Workflow steps:

No user-visible change.

Documentation links affected:

None.

Notes for reviewers:

makeColumnBuffer stores a pointer-to-slice in []any to avoid boxing a 3-word slice header on every append.
appendVal[T parquetValue] is a 5-line generic the compiler inlines at each call site, allowing devirtualization of the getValFn[T] parameter.
mergeWriter (merge_writer.go) is the new sink type; it accepts WriteDataPage calls and tracks row counts via SetNumRows.

mdibaiee

Amazing results!!

A few small comments, otherwise LGTM

One note: I think we can now remove WithDisableDictionaryEncoding here:

connectors/go/writer/parquet.go

Lines 171 to 175 in 8da94b3

    
           func WithDisableDictionaryEncoding() ParquetOption { 
        
           	return func(cfg *parquetConfig) { 
        
           		cfg.disableDictionaryEncoding = true 
        
           	} 
        
           }

mdibaiee · 2026-05-15T07:53:23Z

+}
+
+// Close finalizes any open row group, writes the parquet footer (file metadata, footer length,
+// and trailing magic), and marks the writer closed. The underlying sink is not itself closed.


Why is the underlying sink not closed?

mdibaiee · 2026-05-15T07:57:19Z

+}
+
+// SetNumRows records the row count for this row group; required before Close. The same value
+// must also describe every column written into the row group.


Not sure what this means "The same value must also describe every column written into the row group"

mdibaiee · 2026-05-15T08:04:03Z

+	parent     *mergeWriter
+	metadata   *metadata.RowGroupMetaDataBuilder
+	ordinal    int16
+	nextCol    int


nit: The name of this property threw me off a bit... if I understand correctly this is essentially colOrdinal of the column being written by NextColumn

also maybe we can store it as int16 since it seems to be used as such

mdibaiee · 2026-05-15T08:12:27Z

+	// at construction time will fail again immediately on the first real write.
+	sinkWriter, err := newMergeWriter(cwc, schemaRoot, props, kvmeta)
+	if err != nil {
+		panic(fmt.Sprintf("creating sink writer: %s", err))


Would we get better error presentation for users if we bubble up the error or we need the panic stacktrace?

mdibaiee · 2026-05-15T08:26:44Z

+		compressed := cw.pageWriter.Compress(&buf, page.Data())
+		compressedData := make([]byte, len(compressed))
+		copy(compressedData, compressed)
+		newBuf := memory.NewBufferBytes(compressedData)


Do you mind adding a comment as to why the copy is necessary?

mdibaiee · 2026-05-15T10:30:08Z

@jacobmarble it seems like tests are timing out on this branch while they work on main... perhaps something is hanging with the new implementation?

dyaffe · 2026-05-15T14:17:05Z

Just commenting directly here as per the slack thread. Lets hold off on this change until after we switch to the Snowpipe streaming SDK.

jacobmarble added 2 commits May 14, 2026 15:47

go/writer.ParquetWriter: add benchmark

d147c38

go/writer: column-oriented parquet write buffer

b55c715

jacobmarble force-pushed the jgm-parquet-columns branch from 1050885 to cf5e9e7 Compare May 14, 2026 17:53

jacobmarble marked this pull request as ready for review May 14, 2026 18:24

jacobmarble marked this pull request as draft May 14, 2026 18:26

go/writer: copy compressed pages verbatim from scratch to sink

3e7d697

jacobmarble force-pushed the jgm-parquet-columns branch from d76c2d0 to 3e7d697 Compare May 14, 2026 19:03

jacobmarble changed the title ~~go/writer: column-oriented parquet write buffer~~ go/writer: improve parquet write performance May 14, 2026

jacobmarble marked this pull request as ready for review May 14, 2026 19:17

jacobmarble requested a review from a team May 14, 2026 19:17

jacobmarble added 3 commits May 14, 2026 19:28

Merge remote-tracking branch 'origin/main' into jgm-parquet-columns

4479e0c

resolve merge conflicts

31ddde6

small readability improvement

ceea1b9

mdibaiee reviewed May 15, 2026

View reviewed changes

jacobmarble mentioned this pull request May 15, 2026

go/writer: column-oriented parquet write buffer #4485

Draft

jacobmarble marked this pull request as draft May 15, 2026 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go/writer: improve parquet write performance#4471

go/writer: improve parquet write performance#4471
jacobmarble wants to merge 6 commits into
mainfrom
jgm-parquet-columns

jacobmarble commented May 14, 2026 •

edited

Loading

Uh oh!

mdibaiee left a comment

Uh oh!

mdibaiee May 15, 2026

Uh oh!

mdibaiee May 15, 2026

Uh oh!

mdibaiee May 15, 2026

Uh oh!

mdibaiee May 15, 2026

Uh oh!

mdibaiee May 15, 2026

Uh oh!

mdibaiee commented May 15, 2026

Uh oh!

dyaffe commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	func WithDisableDictionaryEncoding() ParquetOption {
	return func(cfg *parquetConfig) {
	cfg.disableDictionaryEncoding = true
	}
	}

Conversation

jacobmarble commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdibaiee left a comment

Choose a reason for hiding this comment

Uh oh!

mdibaiee May 15, 2026

Choose a reason for hiding this comment

Uh oh!

mdibaiee May 15, 2026

Choose a reason for hiding this comment

Uh oh!

mdibaiee May 15, 2026

Choose a reason for hiding this comment

Uh oh!

mdibaiee May 15, 2026

Choose a reason for hiding this comment

Uh oh!

mdibaiee May 15, 2026

Choose a reason for hiding this comment

Uh oh!

mdibaiee commented May 15, 2026

Uh oh!

dyaffe commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jacobmarble commented May 14, 2026 •

edited

Loading