Skip to content

fix: make attestation verify atomic and concurrency-safe (closes #98)#119

Merged
rsharath merged 9 commits intomainfrom
fix-attestation-double-credential-on-retry
May 6, 2026
Merged

fix: make attestation verify atomic and concurrency-safe (closes #98)#119
rsharath merged 9 commits intomainfrom
fix-attestation-double-credential-on-retry

Conversation

@rsharath
Copy link
Copy Markdown
Contributor

@rsharath rsharath commented May 5, 2026

Summary

Closes #98. AttestationService.VerifyAttestation now performs its three side-effecting writes (IssueCredential, UpdateIdentity, repo.Update) inside a single transaction with a row-level lock, fixing both partial-failure consistency and the concurrent-verify race that could mint two credentials from one proof.

Problem

VerifyAttestation ran three writes serially, each in its own auto-commit transaction:

  1. credentialSvc.IssueCredential — inserts issued_credentials row.
  2. identitySvc.UpdateIdentity — flips identities.trust_level.
  3. repo.Update — sets attestation_records.is_verified and links credential_id.

Two failure modes:

  • Partial state on failure between steps. The original if record.IsVerified guard only catches retries after a fully-successful run. If Step 1 succeeded but Step 2 or 3 failed, credential_id was set but is_verified was still false, and a retry would mint a second credential.
  • Concurrent /verify race. Two simultaneous calls on the same record could each pass the pre-tx guard, each enter the writes, and each IssueCredential — leaving two credentials in the DB and one record pointing at one of them.

Fix

Both fixes ship in this PR.

Atomicity — bun.RunInTx around the three writes

AttestationService gains a db *bun.DB field. VerifyAttestation's side-effecting writes now run inside a single closure passed to s.db.RunInTx. Any failure rolls all three back; retries always see a clean slate.

Serialization — SELECT ... FOR UPDATE on the attestation row

The transaction's first statement is s.repo.GetByIDForUpdate(...), which acquires a Postgres row-level write lock on attestation_records. Concurrent verifies on the same record queue on the lock; the second one re-reads after the first commits, sees CredentialID set, and bails with ErrAttestationAlreadyVerified.

Reusable transaction-propagation pattern

New internal/store/postgres/tx.go:

  • WithTx(ctx, tx) (exported) attaches a bun.Tx to ctx.
  • dbOrTx(ctx, fallback) (unexported) returns the in-flight tx if one is set, falling back to the repo's *bun.DB.
  • hasTx(ctx) (unexported) for repo methods that fail-fast on missing tx (e.g., GetByIDForUpdate).

Every method in internal/store/postgres/{credential,identity,attestation}.go now routes through dbOrTx. Callers that don't open a transaction get the repo's auto-commit *bun.DB exactly as before — every existing call site is behaviorally unchanged. Other repos (apikey, refresh_token, oauth_client, …) intentionally not touched; the pattern can be extended on demand.

Defense in depth

The record.IsVerified || record.CredentialID != "" clause on the ErrAttestationAlreadyVerified guard stays. With the transaction in place it's unreachable through the normal flow (partial state can't exist by construction), but it still catches direct DB manipulation, future code paths that bypass VerifyAttestation, or commit-then-error-return.

Lock-order note

This flow takes the attestation_records row lock first (SELECT FOR UPDATE), then implicitly the identities row lock (via UpdateIdentity's UPDATE). Future code paths that hold both should acquire them in the same order. Postgres detects deadlocks (40P01) and aborts one tx, so the worst case from inversion is a transient retry, not silent corruption.

Files

File Change
internal/store/postgres/tx.go NEWWithTx / dbOrTx / hasTx helpers, sharing txFromContext.
internal/store/postgres/credential.go All methods route through dbOrTx.
internal/store/postgres/identity.go All methods route through dbOrTx.
internal/store/postgres/attestation.go All methods route through dbOrTx; new GetByIDForUpdate enforces hasTx.
internal/service/attestation.go db *bun.DB field; three writes wrapped in RunInTx; identity re-loaded inside the tx; outer return values assigned exactly once on commit.
server.go One-line wiring: pass db into NewAttestationService.
tests/integration/attestation_oidc_test.go New TestAttestationConcurrentVerifyMintsExactlyOneCredential + TestAttestationVerifyGuardCatchesPartialFailureRetry; updated docstring on existing tests.
tests/integration/postgres_tx_test.go NEW — three tests pinning rollback, auto-commit fallback, and the WithTx-required contract.

Test plan

Six regression tests pin the guarantees:

  • TestPostgresWithTxRollbackPersistsNothing — tx rollback is honored end-to-end.
  • TestPostgresWithoutTxFallsBackToAutoCommit — no-tx callers still auto-commit.
  • TestGetByIDForUpdateRequiresTxWithTx contract is enforced.
  • TestAttestationVerifyGuardCatchesPartialFailureRetry — defense-in-depth guard rejects manually-planted partial state.
  • TestAttestationDoubleVerifyIsRejected — post-success idempotency unchanged.
  • TestAttestationConcurrentVerifyMintsExactlyOneCredential — fires N=8 concurrent /verify calls, asserts exactly one credential lands in the DB and the other seven return 409.

Local: go vet ./... clean, go build ./... clean, go test -race clean, full integration suite green in ~10s. All 10 CI checks green at every commit on the branch. Conflict-goroutine durations of 23-27ms vs winner ~13ms are independent evidence the row lock actually serializes.

What's NOT in this PR (deliberate)

  • Refactor IssueCredential to sign-then-persist to shrink lock duration. Surface change disproportionate to ~10ms savings on a per-attestation flow that isn't hot path.
  • Extend dbOrTx to apikey / refresh_token / oauth_client / etc. None of those participate in the attestation flow today; YAGNI until something needs them. Pattern is in place when it's wanted.

Closes

#98

…ries

`AttestationService.VerifyAttestation` runs three writes serially without
a wrapping DB transaction:

  1. credentialSvc.IssueCredential — inserts issued_credentials row.
  2. identitySvc.UpdateIdentity   — flips identities.trust_level.
  3. repo.Update                  — sets attestation_records.is_verified.

Today's `if record.IsVerified` guard at the top of the function only
catches retries after a fully-successful run. It does NOT catch the
partial-failure window where Step 1 succeeded but Step 2 or Step 3
failed: in that state, IsVerified is still false, but
AttestationRecord.CredentialID is already populated. A retry from the
caller would therefore re-run Step 1 and mint a SECOND credential from
the same proof.

The proper fix is the full DB-transaction refactor proposed in #98 —
threading bun.IDB through CredentialService.IssueCredential,
IdentityService.UpdateIdentity, and AttestationRepository.Update. That
is a real cross-service surface change (~7 IssueCredential call sites
plus repo signature changes) and is left as the follow-up.

This PR does the smaller mitigation the issue itself proposes as a
fallback: extend the guard to ALSO trip when CredentialID != "".
That closes the double-credential window without touching service-layer
signatures. The remaining gap — Step 2 succeeding but Step 3 failing
leaves trust_level promoted on a record that's not marked verified — is
unchanged, but it's a strictly weaker problem (no double issuance, just
an audit-trail inconsistency) and disappears entirely once the
transaction work lands.

## Test plan

New regression test `TestAttestationVerifyGuardCatchesPartialFailureRetry`:
- runs a clean verify (ends with IsVerified=true, CredentialID set)
- surgically rewinds is_verified to false in the DB to simulate the
  Step 2/3 failure window
- asserts the retry returns 409 Conflict instead of minting a second
  credential

Without the new condition the test fails at the second verify (returns
200 + new access token). With it, second verify is rejected as expected.

Existing TestAttestationDoubleVerifyIsRejected continues to pass — the
new condition is OR'd with the old one; post-success guard behavior is
unchanged.

Issue #98 stays open to track the full transaction refactor.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a guard in the attestation service to prevent duplicate credential issuance during retries of partially failed requests, along with a corresponding integration test. Feedback indicates that the guard is currently ineffective because the CredentialID is only persisted in the final step of the process; a failure in earlier steps would leave the database in a state that bypasses this check. To resolve this, the reviewer suggests wrapping the database operations in a transaction to ensure atomicity. Furthermore, the integration test is noted to simulate a state that does not occur during actual application failure modes.

Comment thread internal/service/attestation.go Outdated
Comment thread tests/integration/attestation_oidc_test.go
rsharath added 6 commits May 5, 2026 13:54
…oses #98)

VerifyAttestation does three independent writes:

  1. credentialSvc.IssueCredential — inserts issued_credentials row.
  2. identitySvc.UpdateIdentity   — flips identities.trust_level.
  3. repo.Update                  — sets attestation_records.is_verified
                                    and links credential_id.

Before this change each ran in its own auto-commit transaction. A failure
between steps left partial state — most painfully, Step 1 succeeding then
Step 2 or Step 3 failing meant a retry would mint a second credential
from the same proof, since the IsVerified guard hadn't been set yet.

## Approach

Context-based tx propagation rather than threading bun.IDB through every
service signature. New helper `internal/store/postgres/tx.go`:

  - `WithTx(ctx, tx)` attaches an in-flight tx to ctx.
  - `dbOrTx(ctx, fallback)` (unexported) returns the tx if ctx has one,
    falls back to the repo's *bun.DB handle.

Three repo methods now route through dbOrTx:

  - postgres.CredentialRepository.Create
  - postgres.IdentityRepository.Update
  - postgres.AttestationRepository.Update

Repos that don't see WithTx behave exactly as before (auto-commit per
statement), so this is a strict superset of the prior behavior — every
existing caller is untouched, no signatures change.

VerifyAttestation now wraps the three writes:

  s.db.RunInTx(ctx, nil, func(ctx context.Context, tx bun.Tx) error {
      ctx = postgres.WithTx(ctx, tx)
      // IssueCredential / UpdateIdentity / repo.Update
  })

AttestationService gains a `db *bun.DB` field so it can open the tx;
NewAttestationService picks up one new parameter, wired in server.go.

## Why context-based and not signature-based

Threading `bun.IDB` through CredentialService.IssueCredential and
IdentityService.UpdateIdentity would have changed ~15 call sites across
oauth.go, agent.go, handler/identity.go, handler/credential.go. The
context-based pattern keeps the change local: only the repo methods that
actually need to participate in a tx pick up the helper, and every other
caller of those services keeps the existing signature.

## Defense in depth

The CredentialID-based check in the ErrAttestationAlreadyVerified guard
stays in place as belt-and-suspenders. With the transaction it's
unreachable through the normal flow — partial state cannot exist. But
it still catches direct DB manipulation, a future code path that
bypasses the verify flow, or a transaction-commits-but-error-returned
bug. Test pinning the guard remains.

## Test plan

- [x] `go vet ./...` clean (root + pkg/authjwt)
- [x] Full integration suite green ~10s, including:
      - TestAttestationOIDCVerifierHappyPath (proves the tx commits and all
        three writes land for a normal success path)
      - TestAttestationDoubleVerifyIsRejected (post-success guard)
      - TestAttestationVerifyGuardCatchesPartialFailureRetry (manually
        plants partial state to exercise the defense-in-depth guard)

Closes #98.
Three follow-ups on the transaction-wrap PR:

R1. Update the stale ErrAttestationAlreadyVerified comment.

The original guard-tighten commit said "Closes one of the two failure
windows tracked by #98; the full DB-transaction refactor closes the rest."
After the tx wrap landed in the previous commit on this branch, the full
refactor IS this PR — the comment misled a future reader about what was
load-bearing. Rewritten to make clear the guard is now strictly defense-
in-depth: the transaction wrap closes both failure windows; the guard
only fires for direct DB manipulation, future code paths that bypass
VerifyAttestation, or commit-then-error-return.

R2. Route every read in the three touched repos through dbOrTx.

Previously only Create / Update participated in transactions. Reads
(GetByID, GetByJTI, ListByIdentity, GetHighestVerifiedLevel,
RevokeAllActiveForIdentity, Revoke, GetByExternalID, GetByWIMSEURI,
List, Delete) used r.db directly. In today's flow that's harmless —
IssueCredential's reads don't touch tables we write to in the same
tx — but it's a latent stale-read footgun: a future refactor that adds
a read-after-write inside the closure would silently see pre-tx state.

Now every method in postgres/{credential,identity,attestation}.go
opens through dbOrTx, so the contract is uniform: if a tx is attached
to ctx, the statement participates; otherwise the repo's *bun.DB falls
through. No call site sees a behavior change.

Other repos (apikey.go, refresh_token.go, etc.) intentionally not
touched — they're not in the attestation flow and the consistent
pattern can be extended on demand.

R3. Add direct regression tests for the WithTx + dbOrTx mechanism.

tests/integration/postgres_tx_test.go:

  - TestPostgresWithTxRollbackPersistsNothing: opens a tx via testDB,
    inserts an Identity + an IssuedCredential through the repos with
    WithTx attached, returns a non-nil error from the closure, and
    asserts neither row survives. Pins the foundational invariant
    behind the attestation atomicity claim — without it, every existing
    test in the suite could pass while the rollback semantics are
    silently broken.

  - TestPostgresWithoutTxFallsBackToAutoCommit: confirms the no-tx-in-
    ctx path still auto-commits per statement, so every existing call
    site of these repos (which doesn't open a tx) is behaviorally
    unchanged.

## Test plan

- [x] go vet ./... clean
- [x] Full integration suite green ~10s, including the two new tests
- [x] Existing TestAttestationVerifyGuardCatchesPartialFailureRetry
      still pins the defense-in-depth guard behavior

The defense-in-depth claim and the rollback claim are now each backed
by their own test.
Addresses the remaining items from the self-review on the #98 tx wrap:

W1. Concurrent-verify race.

Two simultaneous /verify calls on the same record could each pass the
pre-tx IsVerified guard, each enter RunInTx, each call IssueCredential,
and leave the DB with two credentials minted from one proof. The tx
wrap by itself doesn't fix this — both txs would commit successfully,
just at different times.

Fix: SELECT ... FOR UPDATE on the attestation record as the FIRST
statement inside the tx (new repo method GetByIDForUpdate). The second
verify queues on the row lock; when it acquires, the first has already
written CredentialID and the in-tx guard re-check rejects it with
ErrAttestationAlreadyVerified.

Pinned by TestAttestationConcurrentVerifyMintsExactlyOneCredential:
fires N=8 concurrent /verify calls on the same record and asserts
exactly one returns 200, the other seven return 409, and the DB ends
with exactly one credential row for the identity.

W2. Identity read moved inside the tx.

GetIdentity used to run before RunInTx, so a concurrent UpdateIdentity
that deactivated the identity between the read and the tx could mint
a credential against a stale view. Now reads happen inside the closure,
under the same READ COMMITTED snapshot as the writes.

W4. Isolation level documented.

Comment on the RunInTx call notes that nil TxOptions = READ COMMITTED
and that the row lock provides serialization; we don't need REPEATABLE
READ across the whole tx.

N1. Outer return values assigned only on success.

The previous version captured `record`/`accessToken`/`cred` as outer
vars and wrote into `record` mid-closure. On rollback the outer
`record` was left holding "verified-looking" fields (no observable bug
because of the early txErr return, but a footgun for future code that
reads after the error check).

Now: closure-scoped locals (`locked`, `issued`, `issuedCred`) hold
in-progress state, and the three outer vars are assigned exactly once
at the end of the closure. A rollback path never assigns them, so the
caller can't observe partial state by accident.

N3. WithTx nesting documented.

Added a paragraph to WithTx's doc comment noting that re-attaching on a
ctx that already carries a tx replaces it (no savepoints, no merge),
and that the right pattern is "open one tx at the outermost service
call, attach once."

## Test plan

- [x] go vet ./... clean
- [x] go build ./... clean
- [x] Full integration suite green ~10s, including:
      - TestAttestationConcurrentVerifyMintsExactlyOneCredential (NEW —
        pins the row lock against an N=8 concurrent attack)
      - TestAttestationOIDCVerifierHappyPath (regression — happy path
        still commits all three writes)
      - TestAttestationDoubleVerifyIsRejected (post-success guard)
      - TestAttestationVerifyGuardCatchesPartialFailureRetry
        (defense-in-depth guard against direct DB manipulation)
      - TestPostgresWithTxRollbackPersistsNothing /
        TestPostgresWithoutTxFallsBackToAutoCommit (foundational
        WithTx + dbOrTx mechanism)

W3 (cryptographic signing inside the tx) is intentionally left as-is.
It's a per-attestation cost, not a hot-path concern, and refactoring
would mean either pre-computing the signature outside the tx (loses
ability to use ctx-derived state) or splitting IssueCredential into
sign + persist phases (broader surface change with no measurable gain).
R1. Fix misleading "no-op" claim on GetByIDForUpdate.

The doc previously said "outside one [tx], FOR UPDATE on Postgres is a
no-op." Technically wrong — Postgres wraps every statement in an
implicit transaction, so the SELECT does acquire the lock; it just
releases it on the implicit commit when the statement returns. The
practical effect (no useful serialization) is the same, but the
wording would mislead a reader trying to reason about lock lifetime.
Rewrote to describe the real behavior.

R2. Enforce that GetByIDForUpdate is called inside a tx.

The doc said "MUST be called inside a transaction" but the code
didn't check. A future caller that forgets postgres.WithTx would
silently downgrade to a per-statement lock with no useful
serialization — exactly the failure mode the lock is meant to prevent,
made invisible.

Added a `hasTx(ctx)` helper alongside dbOrTx and a one-line guard:
"GetByIDForUpdate must be called inside a postgres.WithTx context."
Misuse now fails fast instead of producing a race that only shows up
under concurrent load.

R3. Stop calling require.* from goroutines in the race test.

The Go testing contract says t.FailNow / require.* MUST be called
from the goroutine running the test, not from workers. The previous
race test fired N goroutines that each called the verifyAttestation
helper, which routes through doRequest → require.NoError. In practice
this often works, but it's officially undefined behaviour: a worker
that fails won't actually halt the test, and on some Go versions the
failure is silently lost.

Refactored the goroutines to do raw http.NewRequest +
http.DefaultClient.Do, write (status, err) to a per-slot result
struct, and have the main test goroutine assert post-Wait. Same
coverage, no testing-API misuse.

`go test -race` confirms no data races; conflict goroutines record
durations of ~23-27ms (vs ~13ms for the lock holder), independent
evidence the row lock is actually serializing the requests.

## Test plan

- [x] go vet ./... clean
- [x] go test -race targeted suite — clean, no race warnings
- [x] Full integration suite green ~10s
- [x] Race test still observes 1 OK / 7 conflict / 1 credential
Three small follow-ups from the latest self-review pass:

W1. Document lock ordering.

The tx now takes the attestation_records row lock (FOR UPDATE) first,
then implicitly takes the identities row lock when UpdateIdentity runs.
A future code path that does the inverse — identity lock first, then
attestation lock — could deadlock with this one. Postgres detects
deadlocks (40P01) and aborts one tx, so it's not a correctness bug,
just a transient retry concern. Added a comment on the RunInTx site
documenting the ordering so a future contributor doesn't introduce
the inverse path silently.

N1. Rename `updatedRecord` to `verifiedRecord`.

The variable holds the post-success attestation record. "Updated" is
generic; "verified" is the specific thing that makes this record
worth holding onto. One-name swap, no behavior change.

N2. Tighten the pre-tx fast-fail comment.

The previous wording said the pre-tx guard "shaves work off the
obvious-already-done case" — accurate but understated. The real saving
on retry storms is avoiding the verifier run, the second DB connection
for the tx, and the BEGIN/SELECT FOR UPDATE/ROLLBACK round-trip that
the in-tx guard would otherwise require. Reworded.

Skipped: W2 (signing inside the lock) — a real refactor of
CredentialService into "build+sign" + "persist" phases for ~10ms of
lock contention savings per attestation verify; not worth the surface
change. W3 (sanity-check that conflict goroutines actually waited) —
timing-based assertion would be flaky on heterogeneous CI hardware.

## Test plan

- [x] go vet ./... clean
- [x] Full integration suite green ~10s, no regressions
R1. Add regression test for the WithTx contract on GetByIDForUpdate.

The hasTx guard added in cc67d52 protects against a future caller
forgetting postgres.WithTx around GetByIDForUpdate — silent downgrade
to a per-statement implicit tx would acquire the row lock and release
it immediately, providing no serialization. Without a test, the guard
itself could be removed by a future contributor thinking it's
redundant, and the regression would only surface under concurrent
load.

New TestGetByIDForUpdateRequiresTx in postgres_tx_test.go calls
GetByIDForUpdate with context.Background() and asserts the contract
error fires.

W1. Drop the misleading "failed to lock attestation record" wrap.

When GetByIDForUpdate's hasTx guard fires, the previous wrap chain
read "failed to lock attestation record: GetByIDForUpdate must be
called inside a postgres.WithTx context — ..." Misleading: an
operator chasing the message would assume a runtime DB problem when
the cause is actually a programming mistake. Removed the outer
wrap; the repo method's error already names what failed (whether
runtime or contract violation), and the inline closure comment
explains why.

W2. Trim the docstring redundancy.

The function-level docstring previously described the atomicity +
serialization invariant in detail, then the inline comments inside
the closure repeated the same content at finer grain. Trimmed the
docstring to a one-paragraph summary that points at the inline
commentary; the inline comments remain authoritative.

N1. Share the txKey type assertion via txFromContext helper.

dbOrTx and hasTx both did the same `ctx.Value(txKey{}).(bun.Tx)`
assertion. Extracted into a single txFromContext returning (tx, ok)
so the key + type live in exactly one place.

N2. Move the lock-order comment inside the closure.

The "Lock-order note" comment previously sat above RunInTx but
described what happens INSIDE the closure. Moved next to the
GetByIDForUpdate call where the read-the-comment-then-look-at-the-
code distance is zero.

## Test plan

- [x] go vet ./... clean
- [x] go test ./... full suite green ~10s
- [x] TestGetByIDForUpdateRequiresTx pins the contract
@rsharath rsharath changed the title fix: tighten attestation re-verify guard to catch partial-failure retries (mitigates #98) fix: make attestation verify atomic and concurrency-safe (closes #98) May 5, 2026
@rsharath
Copy link
Copy Markdown
Contributor Author

rsharath commented May 5, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements transaction atomicity and concurrency control in the VerifyAttestation flow. It introduces a context-based transaction mechanism across the Attestation, Identity, and Credential repositories, utilizing SELECT FOR UPDATE to prevent duplicate credential issuance during concurrent requests. A review comment identified a risk of transaction poisoning in the IdentityRepository.Delete method due to an ignored error during a pre-deletion update.

Comment thread internal/store/postgres/identity.go Outdated
The pre-DELETE UPDATE that stamps modified_by previously discarded its
error via `_, _ = db.NewUpdate()...Exec(ctx)`. Outside a transaction
this was harmless: the subsequent DELETE either succeeded (best case)
or surfaced its own error (e.g., row not found).

Inside a transaction this was a real bug. Postgres aborts a transaction
the moment any statement fails, so a failed pre-stamp would mean the
DELETE that follows fails with the generic "current transaction is
aborted, commands ignored until end of transaction block" — losing the
original cause and making the failure mode hard to diagnose.

The Delete repo method was changed to participate in WithTx alongside
Create / Update in commit aa07997. Now that callers can route Delete
through a tx via dbOrTx, the swallow pattern is no longer safe.

Fix is the obvious one: propagate the UPDATE error. Behavior change
outside a tx is "loud failure on a benign UPDATE problem" instead of
"silent audit gap" — strictly safer, and aligns the audit trail with
the same atomicity guarantees the rest of the repo provides under tx.

Flagged by Gemini on PR #119.
@rsharath rsharath merged commit 66e4167 into main May 6, 2026
10 checks passed
@rsharath rsharath deleted the fix-attestation-double-credential-on-retry branch May 6, 2026 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chore: wrap attestation verification writes in a DB transaction

2 participants