Server thread safety #275

bigbrett · 2026-01-22T22:34:57Z

Server thread safety

TL;DR: Makes wolfHSM server safe to use in multithreaded scenarios.

Overview

This pull request implements thread-safe access to shared server resources in wolfHSM, specifically targeting the NVM (non-volatile memory) subsystem which also protects the global key cache. Crypto is left to a subsequent PR but is the likely next candidate.

Note that a server context itself still cannot be shared across threads without proper serialization by the caller. This PR adds the mechanisms such that, when multiple server contexts share an NVM instance (which includes the global keystore), access to those shared resources is properly serialized, allowing requests from multiple clients to be processed concurrently in separate threads.

Changes

Introduces lock abstraction layer (wh_lock.{c,h}) with callback-based design for platform independence
Example POSIX lock implementation using pthread_mutex
Adds server-level NVM locking API (wh_Server_NvmLock()/wh_Server_NvmUnlock()) with convenience macros WH_SERVER_NVM_LOCK()/WH_SERVER_NVM_UNLOCK()
All request handlers that access NVM or global keystore resources acquire the lock at the handler level before performing operations
Lower-level modules (NVM, keystore, counter, cert, etc.) remain lock-free; synchronization is the responsibility of the request handler layer
Thread safe functionality enabled with the WOLFHSM_CFG_THREADSAFE build option. When this option is NOT defined, all lock macros compile to no-ops with zero overhead
Adds "thread safe stress test" to test suite that attempts to flush out data races via a large number of contention cases, meant to be run under ThreadSanitizer

Design Rationale

The locking strategy is intentionally simple: acquire the NVM lock at the start of a request handler, perform all operations (including any compound operations involving multiple NVM/cache accesses), then release the lock. This approach:

Avoids TOCTOU issues - No risk of metadata becoming stale or objects being destroyed/replaced between checks
Makes lock scope visible - Locking is explicit at the handler level rather than hidden in lower layers

Gaps/Future Work

Serializing access to global crypto state, specifically hardware crypto for ports. A bit of a tricky problem since offload is provided at the port level, and there isn't a good way for wolfHSM to know which algos will be accelerated and which won't. A naive implementation might consider simply locking the server crypto context, but this contains a mixture of local (CMAC) and quasi-global (RNG) elements and no abstraction for hardware. Locks also need to be synchronized with the wolfCrypt port mutex. We should refactor the server crypto context and perhaps split it into local and global structures, with the global supporting hardware state. Future work...

…ety, serializing access to shared global resources like NVM and global keycache

billphipps

Truly excellent! You solved this just the way I had hoped for!
My requested changes are very limited and not really functional. More just fleshing out the exact requirements for a real implementation and a few minor typos and renaming opportunities.

The stress testing framework is outstanding!

wolfhsm/wh_lock.h

billphipps · 2026-01-25T16:28:39Z

src/wh_lock.c

+#include "wolfhsm/wh_lock.h"
+#include "wolfhsm/wh_error.h"
+
+#ifdef WOLFHSM_CFG_THREADSAFE


Is this the best name? Consider the more mundane WOLFHSM_CFG_LOCKS. Threadsafe may imply more than just locks, like cancelability.

yeah was kind of wishy washy on this. good point. Let me think on it.

test/wh_test_lock.c

test/wh_test_posix_threadsafe_stress.c

billphipps · 2026-01-25T16:47:35Z

test/wh_test_posix_threadsafe_stress.c

Consider adding posix into the name of this file since it heavily used posix to provide any real functionality.

Yeah it might be nice to organize our posix tests in one spot. maybe test/posix or port/posix/test/ so we can leave our wh_test_*.c stuff generic for all platforms

I really like that solution. +1

that is a good idea. Unfortunately a lot of our generic tests modules (e.g. wh_test_clientserver.c) contain both generic drivers as well as a POSIX harness (e.g. spins up the client + server threads). I think it might be best to push this out of scope of this PR and refactor the tests to better split generic test drivers (e.g. whTest_XXXClientCfg(whClientConfig*) and whTest_XXXCLientCtx(whClientCtx*)) from the actual underlying test harness. I'd wager we could reduce a lot of code that way with one or two unified harnesses that drivers just run on top of

Yeah agreed! Definitely outside the scope of this PR

rizlik

I didn't look into tests yet.
Great work.
Is this lock enough to properly synchronize client request?
Example, _HandleNvmRead:

    rc = wh_Nvm_GetMetadata(server->nvm, id, &meta);
    if (rc != WH_ERROR_OK) {
        return rc;
    }

    if (offset >= meta.len)
        return WH_ERROR_BADARGS;

    /* Clamp length to object size */
    if ((offset + len) > meta.len) {
        len = meta.len - offset;
    }

    rc = wh_Nvm_ReadChecked(server->nvm, id, offset, len, out_data);
    if (rc != WH_ERROR_OK)

metadata can be changed between GetMetadata and ReadChecked.
Also, when handling key request:

            /* get a new id if one wasn't provided */
            if (WH_KEYID_ISERASED(meta->id)) {
                ret     = wh_Server_KeystoreGetUniqueId(server, &meta->id);
                resp.rc = ret;
            }
            /* write the key */
            if (ret == WH_ERROR_OK) {
                ret     = wh_Server_KeystoreCacheKeyChecked(server, meta, in);
                resp.rc = ret;
            }

the id might not be unique anymore when _KeysotreCacheKeyCached.

Would more coarse granular locking at request level simplify the design?

src/wh_server_keystore.c

API/Error handling: - Add initialized flag to whLock structure to distinguish init states - Enhance error handling: acquire/release check initialized flag - Make wh_Lock_Cleanup zero structure for clear post-cleanup state - Document init/cleanup must be single-threaded (no atomics) - Document cleanup preconditions (no active contention required) - Update all API docs with precise return codes and error conditions - Change blocking acquire failure from ERROR_LOCKED to ERROR_ABORTED - Add comment explaining why non-blocking acquire is not provided POSIX port improvements: - Enhanced errno mapping in posix_lock.c (EINVAL→BADARGS, etc) - Trap PTHREAD_MUTEX_ERRORCHECK errors (EDEADLK, EPERM) Test coverage: - Add testUninitializedLock to validate error handling - Enhance testLockLifecycle with post-cleanup validation tests Misc: - Apply consistent critical section style pattern in wh_nvm.c - Update copyright years to 2026 - Rename stress test files to wh_test_posix_threadsafe_stress.*

bigbrett · 2026-01-27T18:04:23Z

@rizlik great catch, thanks. I thought I fixed all of those but clearly there are some non-atomic compound operations still lurking. I will make another pass to ensure I make them all atomic.

rizlik · 2026-01-27T18:24:08Z

@rizlik great catch, thanks. I thought I fixed all of those but clearly there are some non-atomic compound operations still lurking. I will make another pass to ensure I make them all atomic.

I wonder, if we are going to use a single lock, can't we just acquire the lock at wh_Server_HandleKeyRequest start and release the lock at the end (same for wh_Server_HandleNvmRequest)?

It's probably a tradeoff, we'll gain simplicity as we don't need locked vs unlocked APIs but there is the risk that other part of the code misuse Nvm API and introduce races in the future.

bigbrett · 2026-01-27T19:37:47Z

It's probably a tradeoff, we'll gain simplicity as we don't need locked vs unlocked APIs but there is the risk that other part of the code misuse Nvm API and introduce races in the future.

@rizlik yep that is what I was worried about and why I didn't initially try it that way ¯\_(ツ)_/¯

I'm not 100% sold on which is better

wolfhsm/wh_lock.h

src/wh_nvm.c

src/wh_server_keystore.c

…nter, img_mgr, and nvm modules Adds proper thread-safety locking discipline to additional server modules that perform compound NVM operations. This prevents TOCTOU (Time-Of-Check-Time-Of-Use) issues where metadata could become stale between check and use/writeback. Changes: - wh_server_cert.c: Add NVM locking for atomic GetMetadata + Read operations in certificate read and export paths - wh_server_counter.c: Add NVM locking for atomic read-modify-write counter increment operations - wh_server_img_mgr.c: Add NVM locking for atomic signature load operations - wh_server_keystore.c: Refactor to use unlocked internal variants for compound operations (GetUniqueId + CacheKey, policy check + erase, freshen + export). Add locking discipline documentation. - wh_server_nvm.c: Add NVM locking for DMA read operations to ensure metadata remains valid throughout transfer. Add locking discipline documentation. - wh_test_posix_threadsafe_stress.c: Add new stress test phases for counter concurrent increment, counter increment vs read, NVM read vs resize, NVM concurrent resize, and NVM read DMA vs resize. Add counter atomicity validation. All compound operations now follow the pattern: 1. Acquire server->nvm->lock 2. Use only *Unlocked() variants internally 3. Keep lock held for entire operation including DMA 4. Release lock after all metadata-dependent operations complete

AlexLanzano

Looks really good so far!

My main concern is the addition of *Unlocked functions. I feel like there has to be a way to remove those and still use the top level API functions by either checking if the current thread has already acquired the nvm lock. Or by creating a lock for both the keystore and the nvm.

test/Makefile

test/wh_test_lock.c

AlexLanzano · 2026-01-28T15:54:51Z

test/wh_test_posix_threadsafe_stress.c

Yeah it might be nice to organize our posix tests in one spot. maybe test/posix or port/posix/test/ so we can leave our wh_test_*.c stuff generic for all platforms

wolfhsm/wh_nvm_internal.h

…vel server module APIs (keystore, NVM, counter, etc.) and aquire lock in request handling functions (e.g. wh_Server_HandleXXXRequest())

protection - TSAN options to fail-fast in CI on error

bigbrett · 2026-01-28T23:58:51Z

OK @billphipps @rizlik @AlexLanzano I have updated this to dramatically simplify based on our meeting discussion. I recommend reviewing the "fresh" diff against main, and not looking at the diff since your last review, as it will be VERY noisy given how much I ripped out. I will probably want to squash commits before we merge given that I redid it.

Notable changes:

Centralized lock acquisition: BIG refactor moving the locking from lower-level server module APIs (NVM, keystore, counter, etc.) up to the request handling layer (wh_Server_HandleXXXRequest() functions)
Removed wh_nvm_internal.h: Eliminated the separate internal header containing "unlocked" NVM variants; these are no longer needed with the new locking architecture
Added SHE supprot: Realized I missed the SHE module before this, so went ahead and added it
Updated wh_nvm.h documentation: Added misisng Doxygen documentation for NVM APIs
Test cleanup: Fixed macro protection issues and test housekeeping in lock and stress tests

One thing to note: while the lock aquisition/release has been fully removed from the lower layer APIs and relocated to the handlers, I did keep the lock Init/Cleanup inside wh_Nvm_Init()/wh_Nvm_Cleanup() just since this should happen before any threads are spawned and before any server contexts are initialized. I can remove this and put the burden on the caller to init NVM then immediately initialize the lock, but figured this was simpler. It is commented accordingly in the NVM API. Let me know if we think this should instead be left to the caller.

rizlik

very good, I've just minor comments.
I can't properly understand TSAN test still, I'll try to give it a look soon

rizlik · 2026-01-29T15:23:44Z

src/wh_nvm.c

+        context->cb      = NULL;
+        context->context = NULL;


NIT: aren't these = NULL redundant?

rizlik · 2026-01-29T15:56:25Z

src/wh_server_keystore.c

+                if (ret == WH_ERROR_OK) {
+                    /* Translate server keyId back to client format with flags
+                     */
+                    resp.id = wh_KeyId_TranslateToClient(meta->id);
+                }


NIT: translateToClient operation can be done outside of the critical section

rizlik · 2026-01-29T15:57:42Z

src/wh_server_keystore.c

+                if (ret == WH_ERROR_OK) {
+                    /* Translate server keyId back to client format with flags
+                     */
+                    resp.id = wh_KeyId_TranslateToClient(meta->id);


rizlik · 2026-01-29T16:00:59Z

src/wh_server_nvm.c

+            rc = WH_SERVER_NVM_LOCK(server);
+            if (rc == WH_ERROR_OK) {
+                /* Process the list action */
+                rc = wh_Nvm_List(server->nvm, req.access, req.flags,
+                                 req.startId, &resp.count, &resp.id);
+
+                (void)WH_SERVER_NVM_UNLOCK(server);


Probably the List API is problematic from the point of view of the Client as it is supposed to work in multiple rounds. Consider adding a comment in the List documentation. For the future we might want to provide alternative API as well.

rizlik · 2026-01-29T16:01:42Z

src/wh_server_nvm.c

+
+                if (rc == 0) {
+                    resp.id     = meta.id;
+                    resp.access = meta.access;
+                    resp.flags  = meta.flags;
+                    resp.len    = meta.len;
+                    memcpy(resp.label, meta.label, sizeof(resp.label));
+                }


NIT: This can be out from the critical section

rizlik · 2026-01-29T16:07:42Z

test/tsan.supp

+race:wolfCrypt_Init
+race:wolfCrypt_Cleanup
+
+# Races on gCryptoDev array in crypto callback registration
+race:wc_CryptoCb_RegisterDevice
+race:wc_CryptoCb_UnRegisterDevice
+race:wc_CryptoCb_GetDevice


I never used TSAN. Can proper locking in tests (or initialization in a single thread) avoid adding these exceptions?

rizlik · 2026-01-29T16:16:26Z

test/wh_test_posix_threadsafe_stress.c

+ * NOTE: Use client-facing keyId format (simple ID + flags), NOT server-internal
+ * format (WH_MAKE_KEYID). The server's wh_KeyId_TranslateFromClient() extracts
+ * only the lower 8 bits as ID and checks WH_KEYID_CLIENT_GLOBAL_FLAG for
+ * global. Using WH_MAKE_KEYID with user=1 sets bit 8, which is
+ * WH_KEYID_CLIENT_GLOBAL_FLAG!


I feel that this comment is misplaced.

rizlik · 2026-01-29T16:17:26Z

test/wh_test_posix_threadsafe_stress.c

+#define HOT_NVM_ID ((whNvmId)100)
+#define HOT_NVM_ID_2 ((whNvmId)101)
+#define HOT_NVM_ID_3 ((whNvmId)102)
+#define HOT_COUNTER_ID ((whNvmId)200)


can't we use HOT_KEY_ID_GLOBAL?

WOLFHSM_CFG_THREADSAFE: Adds framework for internal server thread saf…

2cfc0e4

…ety, serializing access to shared global resources like NVM and global keycache

bigbrett requested review from AlexLanzano and billphipps January 22, 2026 23:26

bigbrett assigned billphipps and AlexLanzano and unassigned billphipps Jan 22, 2026

bigbrett requested review from JacobBarthelmeh and rizlik January 22, 2026 23:27

bigbrett mentioned this pull request Jan 23, 2026

authentication manager feature addition #270

Open

billphipps requested changes Jan 25, 2026

View reviewed changes

rizlik requested changes Jan 26, 2026

View reviewed changes

src/wh_server_keystore.c Outdated Show resolved Hide resolved

src/wh_server_keystore.c Outdated Show resolved Hide resolved

AlexLanzano reviewed Jan 27, 2026

View reviewed changes

wolfhsm/wh_lock.h Show resolved Hide resolved

AlexLanzano reviewed Jan 27, 2026

View reviewed changes

src/wh_nvm.c Show resolved Hide resolved

AlexLanzano reviewed Jan 27, 2026

View reviewed changes

src/wh_server_keystore.c Show resolved Hide resolved

AlexLanzano requested changes Jan 28, 2026

View reviewed changes

Massive refactor to locking integration. Pull locking out of lower le…

a58ca2b

…vel server module APIs (keystore, NVM, counter, etc.) and aquire lock in request handling functions (e.g. wh_Server_HandleXXXRequest())

bigbrett assigned bigbrett and unassigned AlexLanzano and billphipps Jan 28, 2026

bigbrett force-pushed the server-thread-safe branch 3 times, most recently from 2a07204 to 4de1c8e Compare January 28, 2026 23:23

- cleanups, formatting, test housekeeping fixes surrounding macro

de16a6a

protection - TSAN options to fail-fast in CI on error

bigbrett force-pushed the server-thread-safe branch from 4de1c8e to de16a6a Compare January 28, 2026 23:24

update wh_nvm.h doxygen

d7a4566

bigbrett requested review from AlexLanzano, billphipps and rizlik January 28, 2026 23:58

bigbrett assigned rizlik, AlexLanzano and billphipps and unassigned bigbrett Jan 28, 2026

rizlik reviewed Jan 29, 2026

View reviewed changes

Server thread safety #275

Are you sure you want to change the base?

Server thread safety #275

Conversation

bigbrett commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Server thread safety

Overview

Changes

Design Rationale

Gaps/Future Work

Uh oh!

billphipps left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rizlik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bigbrett commented Jan 27, 2026

Uh oh!

rizlik commented Jan 27, 2026

Uh oh!

bigbrett commented Jan 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AlexLanzano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bigbrett commented Jan 28, 2026

Uh oh!

rizlik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bigbrett commented Jan 22, 2026 •

edited

Loading