feat(import): bulk file imports for csv / lp / parquet / tle (PR4)#3
Conversation
PR4 of the phased arcctl build-out. Adds `arcctl import` with four
subcommands targeting Arc's admin-only /api/v1/import/* endpoints.
Wire surface (verified against arc/internal/api/import.go)
- POST /api/v1/import/csv (adminAuth) multipart file + query opts
- POST /api/v1/import/parquet (adminAuth) multipart file + query opts
- POST /api/v1/import/lp (adminAuth) multipart file + query opts;
server auto-detects gzip via magic bytes,
caps at 500 MB decompressed
- POST /api/v1/import/tle (adminAuth) multipart file; measurement
override via x-arc-measurement HEADER
(not query param — quirky server design)
Highlights
- io.Pipe + goroutine streaming so multi-GB files never buffer in
memory; tested with a 5 MiB synthetic payload that round-tripped
byte-for-byte through httptest
- TLE's x-arc-measurement header is suppressed when --measurement
is omitted, so the server's "satellite_tle" default applies
- LP --measurement acts as a filter (not destination) — LP lines
self-declare measurement names in syntax
- LP --precision validated client-side; empty value means
"let server apply default = ns"
- All four render to table (default) or json; csv/arrow rejected
because the result is a one-shot summary, not tabular data
- JSON output normalises nil slices to `[]` (ImportResult.Columns,
LPImportResult.Measurements) so consumers see `[]` instead of
`null` — same fix shape as Gemini's PR2 finding (PR #2 round 2)
- buildClient runs BEFORE any destructive prompt-equivalent
(matrix lesson from PR3 Gemini round 1 — apply consistently)
msgpack scope deviation
- The plan said PR4 = csv+lp+parquet+msgpack. The Arc server has no
/api/v1/import/msgpack endpoint — msgpack is a write path
(/api/v1/write/msgpack) with a specific columnar binary schema,
not a bulk-import format. msgpack-write extension to `arcctl write`
is now a separate follow-up (PR7 in the roadmap)
- TLE replaces it as the 4th import format
Verification
- 17 client tests (httptest): multipart shape, headers, query params,
empty-optional-omission, file-missing, db-missing, server-error
surfacing, TLE header-vs-query, gzip pass-through, 5 MiB streaming
- 8 command tests: render dispatch (table/json), format validation,
nil-slice normalisation regression tests (JSON empty arrays not null),
TLE parse_warnings rendering, empty-warnings suppression
- End-to-end smoke against arc serve on :18003 (delete.enabled=true):
- LP happy + filter + gzip auto-detect + JSON output all verified
- Round-trip: imported LP queries back correctly via `arcctl query`
- 0-byte LP + TLE files surface server "file is empty" verbatim
- Client-side error paths: missing --file, missing --measurement,
invalid --precision, nonexistent file, no database
- Two arc server bugs surfaced incidentally but NOT arcctl issues:
- CSV import: server's getColumns() issues invalid DuckDB SQL
(`SELECT column_name FROM (DESCRIBE read_csv(...))`); arcctl
correctly surfaces the verbatim 422
- TLE checksum validation rejects even Celestrak production TLE
lines; arcctl correctly surfaces the verbatim 400 with warnings
Internal review addressed (matrix + single deep reviewer)
- M2: Content-Type overwrite guard — extraHeaders loop now skips
Content-Type so a future caller can't accidentally lose the
multipart boundary; Content-Type set LAST so neither setCommonHeaders
nor extraHeaders can clobber it
- Style: ValidPrecision guard tightened (ValidPrecision("") already
returns true; outer != "" check was redundant)
- Style: shallow-copy semantics in renderImportResult / renderLP-
ImportResult now documented in comments so future field additions
trigger reconsideration
- Style: error from os.Open no longer double-wrapped ("open /x: open
/x: ..." → just "open /x: ...")
Companion docs PR lands in docs.basekick.net (docs/cli/import.md
+ index.md status table + Next list update).
|
@gemini-code-assist review please |
There was a problem hiding this comment.
Code Review
This pull request implements the import subcommand tree for arcctl, allowing bulk-loading of CSV, Line Protocol (LP), Apache Parquet, and Two-Line Element (TLE) files into an Arc cluster. The client streams files via io.Pipe to ensure low memory usage even with multi-gigabyte files. The changes are well-tested with comprehensive unit tests. The only feedback is to add validation for the --skip-rows flag to ensure it is non-negative, preventing negative values from being silently ignored.
There was a problem hiding this comment.
Code Review
This pull request implements the import subcommand tree (csv, lp, parquet, and tle) to support bulk-loading files into an Arc cluster. The implementation streams files end-to-end using io.Pipe to avoid buffering large files in memory. Feedback on the changes suggests an improvement in the multipart upload streaming logic within internal/client/import.go: on error paths, calling mw.Close() is unnecessary and potentially problematic, so it should be skipped in favor of directly closing the pipe with the error.
) Two findings from Gemini's review on #3, both Medium. #1 — `--skip-rows` accepts negative values silently A negative value passes through to `ImportCSV`, then gets dropped by the `opts.SkipRows > 0` gate in the client. User typed `--skip-rows -5`, nothing happens, no error. Now validated in the command's RunE before any client call. Regression test driven through cobra so the wired flag path is covered end-to-end; mutation-verified by disabling the guard (downstream "no database" check fires instead). #2 — `mw.Close()` on error paths writes trailing boundary into broken pipe In the multipart goroutine, the error paths used to call `mw.Close()` *before* `pw.CloseWithError(err)`. `multipart.Writer.Close()` writes the trailing boundary `--<boundary>--\r\n` into the pipe — but the preceding content is already known-bad (CreateFormFile failed, or io.Copy errored mid-file). Result: a multipart payload that's syntactically valid but semantically truncated, confusing for a reader attempting to make sense of partial data. Now: error paths skip `mw.Close()` entirely and signal the error on the pipe; happy path still closes the writer normally to emit the trailing boundary.
|
@gemini-code-assist review please |
There was a problem hiding this comment.
Code Review
This pull request implements the arcctl import subcommand tree, enabling bulk-importing of CSV, Line Protocol, Parquet, and TLE files via streamed multipart uploads using io.Pipe. It includes comprehensive client-side and CLI-level tests, along with updated documentation in the README. Feedback is provided regarding a potential goroutine and file descriptor leak in uploadMultipart where the pipe reader pr is not deferred closed, which could cause the background writer goroutine to block indefinitely on early exits or HTTP request failures.
… PR #3 r3) Add `defer pr.Close()` immediately after `io.Pipe()` creation so the pipe reader is closed on every return path from `uploadMultipart`, not only the success path. Without this, an early failure in `http.NewRequestWithContext` or `c.http.Do` (DNS failure, TLS handshake failure, ctx cancelled before send, etc.) would leave `pr` open — the writer goroutine then blocks on `pw.Write` forever, leaking the goroutine and the underlying file handle. The redundant explicit `_ = pr.Close()` on the `NewRequest` error path becomes unnecessary (the defer covers it) and is removed. Empirical note: Go 1.25's net/http transport does close `req.Body` on the connection-refused early-error path, so the leak isn't reproducible today by simply refusing TCP — `delta=0` either way in the test below. The fix is defense-in-depth against future Go changes and against other early-failure paths I haven't enumerated. The regression test stays as a guard: if any future early-error path starts leaking, the linear growth across iterations would surface here. Gemini's review on #3 (commit b2ad849) raised this as High priority. Verified by tracing io.Pipe + http.Client interaction; confirmed via a standalone reproducer that today's behavior happens to be safe but the contract isn't guaranteed.
|
@gemini-code-assist review please |
There was a problem hiding this comment.
Code Review
This pull request implements the import subcommand tree for arcctl, enabling bulk-loading of CSV, Line Protocol (LP), Parquet, and TLE files into an Arc cluster. The implementation includes streaming multipart uploads via io.Pipe to prevent high memory usage, comprehensive unit tests, and updated documentation. Feedback on the changes suggests adding client-side validation for required parameters (such as measurement and filePath) to fail fast, as well as applying defensive programming by checking for nil pointers before dereferencing result structs or uninitialized clients.
Reject negative `CSVImportOptions.SkipRows` at the client boundary. Previously the client silently dropped negative values via the `opts.SkipRows > 0` gate — the cobra layer rejects them at RunE today (b2ad849), but any future caller that bypasses the cobra path (a programmatic test helper, a follow-up command) would re-introduce the silent-drop bug. Making the contract explicit at the client boundary closes that latent gap. Of the 7 Medium findings Gemini raised on commit e377a0d, this is the one with a real defense-in-depth case. The other 6 (duplicate `measurement == ""` / `filePath == ""` checks already enforced at the cobra layer; `cli == nil` and result-pointer nil-checks where the caller has already verified `err == nil`) are declined as per CLAUDE.md guidance to trust internal-package guarantees and not add validation for scenarios that can't happen — see the PR thread reply for the detailed reasoning.
|
Thanks for the thorough re-review. Round 4 raised 7 Medium findings; addressing them in two groups. Accepted (1)Client-layer Declined (6) — with reasoningThe remaining six findings are defensive-programming suggestions that I don't think pull their weight. Per
Each declined finding falls into one of two patterns: Pattern A: duplicate validation that's already enforced at the cobra layer
These would be load-bearing if Pattern B: nil-checks on pointers that the caller has already validated
Adding these checks would make every line into "what if someone calls me wrong?" boilerplate, which obscures the real logic without catching any bug that can happen in this codebase. If a future refactor changes the contract (e.g. one of the client methods starts returning NetRound 4 was useful for surfacing the latent silent-drop in |
|
@gemini-code-assist review please |
There was a problem hiding this comment.
Code Review
This pull request introduces bulk-import capabilities (arcctl import) for CSV, Line Protocol, Parquet, and TLE formats, streaming files efficiently via io.Pipe. Feedback focuses on improving parameter validation: adding client-side checks in the client library to fail fast, and leveraging Cobra's built-in MarkFlagRequired feature in the CLI layer to replace manual flag validation with more idiomatic code.
…i PR #3 r5) Round 5 brought 8 truly new findings on commit d1c50a7. Accepting all. Cobra MarkFlagRequired refactor - addImportCommonFlags now marks --file as required via c.MarkFlagRequired("file") — covers all 4 subcommands - newImportCSVCmd + newImportParquetCmd mark --measurement required - newImportLPCmd does NOT mark --measurement (it's an optional filter for LP since lines self-declare measurement names) - newImportTLECmd does NOT mark --measurement (server defaults to "satellite_tle" when omitted) - Manual `if common.filePath == ""` and `if measurement == ""` checks removed from all 4 RunE blocks — cobra now enforces before RunE with the canonical `Error: required flag(s) "X" not set` format Client-boundary validations - ImportCSV now rejects measurement=="", non-single-rune Delimiter, and TimeFormat values outside the {epoch_s,epoch_ms,epoch_us, epoch_ns,""} enum - ImportParquet now rejects measurement=="" - Each validation fires before any file open or network call Why these are different from round-4 declines - Round-4 declined nil-pointer checks on internal-only callers (measurement/filePath/cli/result-pointer) where the caller had already validated. Those declines stand. - Round-5 client-layer validations are paired with REMOVAL of the cobra-layer duplicates — the validation moves from cobra to the client, it isn't added on both sides. Net diff is -10 LOC of defensive boilerplate. - Round-5 also added genuinely-new constraints (Delimiter must be one rune, TimeFormat must be in the enum). These catch typos like `--time-format epoch_milliseconds` client-side instead of leaving the operator to decipher a server-side parse error. Tests added - TestImportCSV_RejectsEmptyMeasurement - TestImportCSV_RejectsMultiCharDelimiter - TestImportCSV_AcceptsSingleRuneDelimiter (covers multi-byte UTF-8) - TestImportCSV_RejectsInvalidTimeFormat - TestImportCSV_AcceptsEmptyTimeFormat - TestImportParquet_RejectsEmptyMeasurement - TestImportCSV_MarkFlagRequired_File - TestImportCSV_MarkFlagRequired_Measurement - TestImportLP_MarkFlagRequired_File (also asserts --measurement is NOT required for LP) End-to-end smoke - arcctl import csv (no flags) → 'Error: required flag(s) "file", "measurement" not set' (clean cobra format) - arcctl import csv --file x.csv (no measurement) → 'Error: required flag(s) "measurement" not set' - arcctl import lp (no flags) → 'Error: required flag(s) "file" not set' (no --measurement complaint, correct for LP) - arcctl import tle (no flags) → same single-flag error
Summary
msgpack scope deviation
The plan listed `msgpack` as the 4th import format. The Arc server has no `/api/v1/import/msgpack` endpoint — msgpack is a write path (`/api/v1/write/msgpack`) with a typed columnar binary schema, not a bulk-file format. TLE replaces it here. The msgpack-write extension to `arcctl write` is now a separate follow-up (added to README roadmap as PR7).
Test plan
Two Arc server bugs surfaced (not in arcctl)
Both surfaced cleanly through arcctl's error-passthrough; not blocking for this PR.
Internal review
Matrix + single deep reviewer per `.claude/CLAUDE.md`. Findings addressed pre-commit:
Companion docs PR
Per the per-PR docs rule, docs land in docs.basekick.net at `docs/cli/import.md` + `cli/index.md` updates.
🤖 Generated with Claude Code