Skip to content

feat(import): bulk file imports for csv / lp / parquet / tle (PR4)#3

Merged
xe-nvdk merged 5 commits into
mainfrom
feat/pr4-import
Jun 1, 2026
Merged

feat(import): bulk file imports for csv / lp / parquet / tle (PR4)#3
xe-nvdk merged 5 commits into
mainfrom
feat/pr4-import

Conversation

@xe-nvdk

@xe-nvdk xe-nvdk commented Jun 1, 2026

Copy link
Copy Markdown
Member

Summary

  • Adds `arcctl import {csv,lp,parquet,tle}` against Arc's admin-only `/api/v1/import/*` endpoints.
  • New `internal/client/import.go` with `uploadMultipart` streaming primitive (io.Pipe + goroutine; no buffering, even for multi-GB files).
  • Each subcommand exposes the relevant query-param options:
    • csv: `--time-column`, `--time-format` (epoch_{s,ms,us,ns}), `--delimiter`, `--skip-rows`
    • lp: `--precision` (ns|us|ms|s), `--measurement` (server-side filter), gzip auto-detected on the server via magic bytes, 500 MB cap
    • parquet: `--time-column`
    • tle: `--measurement` (server default: `satellite_tle`); sent via the quirky `x-arc-measurement` HEADER rather than query param
  • Output formats: `table` (default) and `json` only; `csv`/`arrow` rejected because the result is a one-shot summary record.
  • JSON output normalises nil slices to `[]` (PR3 Gemini-finding pattern; new regression tests pin it).
  • `buildClient` runs before any client-side validation that could waste user input (lesson from PR3 round 1).

msgpack scope deviation

The plan listed `msgpack` as the 4th import format. The Arc server has no `/api/v1/import/msgpack` endpoint — msgpack is a write path (`/api/v1/write/msgpack`) with a typed columnar binary schema, not a bulk-file format. TLE replaces it here. The msgpack-write extension to `arcctl write` is now a separate follow-up (added to README roadmap as PR7).

Test plan

  • `go test -race -count=1 ./...` — 25 new tests across `internal/client/` and `internal/commands/`
  • `gofmt -l . && go vet ./...` clean
  • End-to-end smoke against `arc serve` (port 18003, isolated SQLite, delete enabled):
    • LP happy path (multi-measurement file → `cpu, mem` both ingested)
    • LP filter (`--measurement cpu` drops `mem`)
    • LP gzip auto-detection (`.lp.gz` → server detects magic bytes, server returns OK)
    • LP `-o json` round-trips through stdlib decoder
    • Imported LP data queries back via `arcctl query` (8 rows across 4 imports)
    • 0-byte LP + TLE: server's `file is empty` 400 surfaces verbatim
    • Error paths: missing `--file`, missing `--measurement` for CSV, invalid `--precision`, nonexistent file, no database (no network call in any client-side case)
  • Multipart streaming verified at 5 MiB (round-trips byte-for-byte through httptest)
  • Content-Type overwrite guard pinned (extraHeaders cannot clobber the multipart boundary)

Two Arc server bugs surfaced (not in arcctl)

  • CSV import: server's `getColumns()` issues invalid DuckDB SQL (`SELECT column_name FROM (DESCRIBE read_csv(...))`). arcctl correctly surfaces the verbatim 422; the bug is in `internal/api/import.go#getColumns` (one-time introspection during ingest; doesn't affect query-path).
  • TLE checksums: server's checksum validator rejects even Celestrak production ISS TLEs. Probably a whitespace-strictness bug.

Both surfaced cleanly through arcctl's error-passthrough; not blocking for this PR.

Internal review

Matrix + single deep reviewer per `.claude/CLAUDE.md`. Findings addressed pre-commit:

  • M2 — Content-Type overwrite guard: `extraHeaders` loop now skips `Content-Type`; Content-Type set LAST in the request-building sequence so neither `setCommonHeaders` nor extraHeaders can clobber the multipart boundary
  • Style — `ValidPrecision` guard tightened (`ValidPrecision("")` already returns true; outer `!= ""` check was redundant)
  • Style — Shallow-copy semantics in `renderImportResult` / `renderLPImportResult` documented; `os.Open` error no longer double-wrapped
  • Smoke gap — added 0-byte file smoke on LP + TLE; verified server's `file is empty` surfaces verbatim
  • No Blockers, no High findings

Companion docs PR

Per the per-PR docs rule, docs land in docs.basekick.net at `docs/cli/import.md` + `cli/index.md` updates.

🤖 Generated with Claude Code

PR4 of the phased arcctl build-out. Adds `arcctl import` with four
subcommands targeting Arc's admin-only /api/v1/import/* endpoints.

Wire surface (verified against arc/internal/api/import.go)
- POST /api/v1/import/csv     (adminAuth) multipart file + query opts
- POST /api/v1/import/parquet (adminAuth) multipart file + query opts
- POST /api/v1/import/lp      (adminAuth) multipart file + query opts;
                              server auto-detects gzip via magic bytes,
                              caps at 500 MB decompressed
- POST /api/v1/import/tle     (adminAuth) multipart file; measurement
                              override via x-arc-measurement HEADER
                              (not query param — quirky server design)

Highlights
- io.Pipe + goroutine streaming so multi-GB files never buffer in
  memory; tested with a 5 MiB synthetic payload that round-tripped
  byte-for-byte through httptest
- TLE's x-arc-measurement header is suppressed when --measurement
  is omitted, so the server's "satellite_tle" default applies
- LP --measurement acts as a filter (not destination) — LP lines
  self-declare measurement names in syntax
- LP --precision validated client-side; empty value means
  "let server apply default = ns"
- All four render to table (default) or json; csv/arrow rejected
  because the result is a one-shot summary, not tabular data
- JSON output normalises nil slices to `[]` (ImportResult.Columns,
  LPImportResult.Measurements) so consumers see `[]` instead of
  `null` — same fix shape as Gemini's PR2 finding (PR #2 round 2)
- buildClient runs BEFORE any destructive prompt-equivalent
  (matrix lesson from PR3 Gemini round 1 — apply consistently)

msgpack scope deviation
- The plan said PR4 = csv+lp+parquet+msgpack. The Arc server has no
  /api/v1/import/msgpack endpoint — msgpack is a write path
  (/api/v1/write/msgpack) with a specific columnar binary schema,
  not a bulk-import format. msgpack-write extension to `arcctl write`
  is now a separate follow-up (PR7 in the roadmap)
- TLE replaces it as the 4th import format

Verification
- 17 client tests (httptest): multipart shape, headers, query params,
  empty-optional-omission, file-missing, db-missing, server-error
  surfacing, TLE header-vs-query, gzip pass-through, 5 MiB streaming
- 8 command tests: render dispatch (table/json), format validation,
  nil-slice normalisation regression tests (JSON empty arrays not null),
  TLE parse_warnings rendering, empty-warnings suppression
- End-to-end smoke against arc serve on :18003 (delete.enabled=true):
  - LP happy + filter + gzip auto-detect + JSON output all verified
  - Round-trip: imported LP queries back correctly via `arcctl query`
  - 0-byte LP + TLE files surface server "file is empty" verbatim
  - Client-side error paths: missing --file, missing --measurement,
    invalid --precision, nonexistent file, no database
- Two arc server bugs surfaced incidentally but NOT arcctl issues:
  - CSV import: server's getColumns() issues invalid DuckDB SQL
    (`SELECT column_name FROM (DESCRIBE read_csv(...))`); arcctl
    correctly surfaces the verbatim 422
  - TLE checksum validation rejects even Celestrak production TLE
    lines; arcctl correctly surfaces the verbatim 400 with warnings

Internal review addressed (matrix + single deep reviewer)
- M2: Content-Type overwrite guard — extraHeaders loop now skips
  Content-Type so a future caller can't accidentally lose the
  multipart boundary; Content-Type set LAST so neither setCommonHeaders
  nor extraHeaders can clobber it
- Style: ValidPrecision guard tightened (ValidPrecision("") already
  returns true; outer != "" check was redundant)
- Style: shallow-copy semantics in renderImportResult / renderLP-
  ImportResult now documented in comments so future field additions
  trigger reconsideration
- Style: error from os.Open no longer double-wrapped ("open /x: open
  /x: ..." → just "open /x: ...")

Companion docs PR lands in docs.basekick.net (docs/cli/import.md
+ index.md status table + Next list update).
@xe-nvdk

xe-nvdk commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

@gemini-code-assist review please

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the import subcommand tree for arcctl, allowing bulk-loading of CSV, Line Protocol (LP), Apache Parquet, and Two-Line Element (TLE) files into an Arc cluster. The client streams files via io.Pipe to ensure low memory usage even with multi-gigabyte files. The changes are well-tested with comprehensive unit tests. The only feedback is to add validation for the --skip-rows flag to ensure it is non-negative, preventing negative values from being silently ignored.

Comment thread internal/commands/import.go Outdated

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the import subcommand tree (csv, lp, parquet, and tle) to support bulk-loading files into an Arc cluster. The implementation streams files end-to-end using io.Pipe to avoid buffering large files in memory. Feedback on the changes suggests an improvement in the multipart upload streaming logic within internal/client/import.go: on error paths, calling mw.Close() is unnecessary and potentially problematic, so it should be skipped in favor of directly closing the pipe with the error.

Comment thread internal/client/import.go
)

Two findings from Gemini's review on #3,
both Medium.

#1 — `--skip-rows` accepts negative values silently
A negative value passes through to `ImportCSV`, then gets dropped by
the `opts.SkipRows > 0` gate in the client. User typed `--skip-rows -5`,
nothing happens, no error. Now validated in the command's RunE before
any client call. Regression test driven through cobra so the wired flag
path is covered end-to-end; mutation-verified by disabling the guard
(downstream "no database" check fires instead).

#2 — `mw.Close()` on error paths writes trailing boundary into broken pipe
In the multipart goroutine, the error paths used to call `mw.Close()`
*before* `pw.CloseWithError(err)`. `multipart.Writer.Close()` writes
the trailing boundary `--<boundary>--\r\n` into the pipe — but the
preceding content is already known-bad (CreateFormFile failed, or
io.Copy errored mid-file). Result: a multipart payload that's
syntactically valid but semantically truncated, confusing for a reader
attempting to make sense of partial data. Now: error paths skip
`mw.Close()` entirely and signal the error on the pipe; happy path
still closes the writer normally to emit the trailing boundary.
@xe-nvdk

xe-nvdk commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

@gemini-code-assist review please

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the arcctl import subcommand tree, enabling bulk-importing of CSV, Line Protocol, Parquet, and TLE files via streamed multipart uploads using io.Pipe. It includes comprehensive client-side and CLI-level tests, along with updated documentation in the README. Feedback is provided regarding a potential goroutine and file descriptor leak in uploadMultipart where the pipe reader pr is not deferred closed, which could cause the background writer goroutine to block indefinitely on early exits or HTTP request failures.

Comment thread internal/client/import.go
… PR #3 r3)

Add `defer pr.Close()` immediately after `io.Pipe()` creation so the
pipe reader is closed on every return path from `uploadMultipart`, not
only the success path. Without this, an early failure in
`http.NewRequestWithContext` or `c.http.Do` (DNS failure, TLS
handshake failure, ctx cancelled before send, etc.) would leave `pr`
open — the writer goroutine then blocks on `pw.Write` forever, leaking
the goroutine and the underlying file handle.

The redundant explicit `_ = pr.Close()` on the `NewRequest` error
path becomes unnecessary (the defer covers it) and is removed.

Empirical note: Go 1.25's net/http transport does close `req.Body` on
the connection-refused early-error path, so the leak isn't reproducible
today by simply refusing TCP — `delta=0` either way in the test below.
The fix is defense-in-depth against future Go changes and against other
early-failure paths I haven't enumerated. The regression test stays as
a guard: if any future early-error path starts leaking, the linear
growth across iterations would surface here.

Gemini's review on #3
(commit b2ad849) raised this as High priority. Verified by tracing
io.Pipe + http.Client interaction; confirmed via a standalone
reproducer that today's behavior happens to be safe but the contract
isn't guaranteed.
@xe-nvdk

xe-nvdk commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

@gemini-code-assist review please

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the import subcommand tree for arcctl, enabling bulk-loading of CSV, Line Protocol (LP), Parquet, and TLE files into an Arc cluster. The implementation includes streaming multipart uploads via io.Pipe to prevent high memory usage, comprehensive unit tests, and updated documentation. Feedback on the changes suggests adding client-side validation for required parameters (such as measurement and filePath) to fail fast, as well as applying defensive programming by checking for nil pointers before dereferencing result structs or uninitialized clients.

Comment thread internal/client/import.go
Comment thread internal/client/import.go
Comment thread internal/client/import.go
Comment thread internal/commands/import.go
Comment thread internal/commands/import.go
Comment thread internal/commands/import.go
Comment thread internal/commands/import.go
Reject negative `CSVImportOptions.SkipRows` at the client boundary.
Previously the client silently dropped negative values via the
`opts.SkipRows > 0` gate — the cobra layer rejects them at RunE today
(b2ad849), but any future caller that bypasses the cobra path (a
programmatic test helper, a follow-up command) would re-introduce the
silent-drop bug. Making the contract explicit at the client boundary
closes that latent gap.

Of the 7 Medium findings Gemini raised on commit e377a0d, this is the
one with a real defense-in-depth case. The other 6 (duplicate
`measurement == ""` / `filePath == ""` checks already enforced at the
cobra layer; `cli == nil` and result-pointer nil-checks where the
caller has already verified `err == nil`) are declined as per CLAUDE.md
guidance to trust internal-package guarantees and not add validation
for scenarios that can't happen — see the PR thread reply for the
detailed reasoning.
@xe-nvdk

xe-nvdk commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

Thanks for the thorough re-review. Round 4 raised 7 Medium findings; addressing them in two groups.

Accepted (1)

Client-layer SkipRows >= 0 validation in ImportCSV (d1c50a7). The cobra layer already rejects negative --skip-rows at RunE since b2ad849, but the client previously silently dropped negative values via the opts.SkipRows > 0 gate. Making the contract explicit at the client boundary closes a latent gap — any future caller that bypasses cobra (a programmatic test helper, a follow-up subcommand) is now guaranteed to surface the error instead of a silent no-op. Added a regression test that asserts the server is never called when SkipRows is negative.

Declined (6) — with reasoning

The remaining six findings are defensive-programming suggestions that I don't think pull their weight. Per .claude/CLAUDE.md:

Don't add error handling, fallbacks, or validation for scenarios that can't happen. Trust internal code and framework guarantees. Only validate at system boundaries (user input, external APIs).

Each declined finding falls into one of two patterns:

Pattern A: duplicate validation that's already enforced at the cobra layer

  • internal/client/import.go:112measurement == "" check in ImportCSV. Already rejected at internal/commands/import.go:135 in newImportCSVCmd's RunE. Adding it to the client too is pure duplication.
  • internal/client/import.go:141 — same shape, ImportParquet / measurement. Already rejected at internal/commands/import.go in newImportParquetCmd's RunE.
  • internal/client/import.go:229filePath == "" check in uploadMultipart. Already rejected at the command layer via common.filePath == "" checks in every subcommand. (And os.Open("") returns a clear error anyway: open : no such file or directory.)

These would be load-bearing if internal/client were a public, externally-importable package — but it's internal/, so the only callers ever can be inside arcctl. Same reasoning that lets us trust setCommonHeaders not to panic on a nil receiver (which the cobra layer guarantees via the err-then-return pattern from buildClient).

Pattern B: nil-checks on pointers that the caller has already validated

  • internal/commands/import.go:87cli == nil in resolveImportDatabase. The function is called only from cobra RunE blocks, after buildClient returned (cli, _, err) and the RunE checked err != nil. buildClient either returns (non-nil, nil) or (nil, non-nil-err) — there's no third state. The nil-check is dead code.
  • internal/commands/import.go:362, 398, 423 — nil-checks on *ImportResult, *LPImportResult, *TLEImportResult in the three render functions. Same pattern: called immediately after cli.ImportXxx(...) returns (*Result, error) and the caller checks err != nil first. The client's success path always returns &out (a non-nil pointer); the failure path returns (nil, err). The render function is never reached with a nil pointer.

Adding these checks would make every line into "what if someone calls me wrong?" boilerplate, which obscures the real logic without catching any bug that can happen in this codebase. If a future refactor changes the contract (e.g. one of the client methods starts returning (nil, nil) somehow), that's the moment to add the check — driven by an actual failure mode, not a hypothetical.

Net

Round 4 was useful for surfacing the latent silent-drop in ImportCSV — that fix lands. The remaining 6 are noise of the kind that accumulates when a reviewer runs out of substantive findings on a small diff. Happy to revisit any of the declined items if you can sketch a concrete code path where one would catch a real bug.

@xe-nvdk

xe-nvdk commented Jun 1, 2026

Copy link
Copy Markdown
Member Author

@gemini-code-assist review please

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces bulk-import capabilities (arcctl import) for CSV, Line Protocol, Parquet, and TLE formats, streaming files efficiently via io.Pipe. Feedback focuses on improving parameter validation: adding client-side checks in the client library to fail fast, and leveraging Cobra's built-in MarkFlagRequired feature in the CLI layer to replace manual flag validation with more idiomatic code.

Comment thread internal/client/import.go
Comment thread internal/client/import.go
Comment thread internal/commands/import.go
Comment thread internal/commands/import.go Outdated
Comment thread internal/commands/import.go
Comment thread internal/commands/import.go Outdated
Comment thread internal/commands/import.go Outdated
Comment thread internal/commands/import.go
Comment thread internal/commands/import.go Outdated
…i PR #3 r5)

Round 5 brought 8 truly new findings on commit d1c50a7. Accepting all.

Cobra MarkFlagRequired refactor
- addImportCommonFlags now marks --file as required via
  c.MarkFlagRequired("file") — covers all 4 subcommands
- newImportCSVCmd + newImportParquetCmd mark --measurement required
- newImportLPCmd does NOT mark --measurement (it's an optional filter
  for LP since lines self-declare measurement names)
- newImportTLECmd does NOT mark --measurement (server defaults to
  "satellite_tle" when omitted)
- Manual `if common.filePath == ""` and `if measurement == ""` checks
  removed from all 4 RunE blocks — cobra now enforces before RunE
  with the canonical `Error: required flag(s) "X" not set` format

Client-boundary validations
- ImportCSV now rejects measurement=="", non-single-rune Delimiter,
  and TimeFormat values outside the {epoch_s,epoch_ms,epoch_us,
  epoch_ns,""} enum
- ImportParquet now rejects measurement==""
- Each validation fires before any file open or network call

Why these are different from round-4 declines
- Round-4 declined nil-pointer checks on internal-only callers
  (measurement/filePath/cli/result-pointer) where the caller had
  already validated. Those declines stand.
- Round-5 client-layer validations are paired with REMOVAL of the
  cobra-layer duplicates — the validation moves from cobra to the
  client, it isn't added on both sides. Net diff is -10 LOC of
  defensive boilerplate.
- Round-5 also added genuinely-new constraints (Delimiter must be
  one rune, TimeFormat must be in the enum). These catch typos like
  `--time-format epoch_milliseconds` client-side instead of leaving
  the operator to decipher a server-side parse error.

Tests added
- TestImportCSV_RejectsEmptyMeasurement
- TestImportCSV_RejectsMultiCharDelimiter
- TestImportCSV_AcceptsSingleRuneDelimiter (covers multi-byte UTF-8)
- TestImportCSV_RejectsInvalidTimeFormat
- TestImportCSV_AcceptsEmptyTimeFormat
- TestImportParquet_RejectsEmptyMeasurement
- TestImportCSV_MarkFlagRequired_File
- TestImportCSV_MarkFlagRequired_Measurement
- TestImportLP_MarkFlagRequired_File (also asserts --measurement is
  NOT required for LP)

End-to-end smoke
- arcctl import csv (no flags) → 'Error: required flag(s) "file",
  "measurement" not set' (clean cobra format)
- arcctl import csv --file x.csv (no measurement) → 'Error: required
  flag(s) "measurement" not set'
- arcctl import lp (no flags) → 'Error: required flag(s) "file" not
  set' (no --measurement complaint, correct for LP)
- arcctl import tle (no flags) → same single-flag error
@xe-nvdk xe-nvdk merged commit e26b413 into main Jun 1, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant