Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ kernel_files-y += kernel/eventlog.c
# User-space process support
kernel_files-y += kernel/proc/uaccess.c kernel/proc/proc.c \
kernel/proc/syscall.c kernel/proc/loader.c kernel/proc/spawn.c \
kernel/proc/cap.c \
kernel/proc/pipe.c kernel/proc/signal.c

# TCP/IP stack
Expand Down
18 changes: 8 additions & 10 deletions docs/pse51-matrix.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ The following PSE51 services are present and exercised by selftests
| `clock_gettime` | `SYS_CLOCK_GETTIME` | implemented | `CLOCK_MONOTONIC`, `CLOCK_REALTIME`, `CLOCK_THREAD_CPUTIME_ID`, `CLOCK_PROCESS_CPUTIME_ID`. |
| `clock_getres` | `SYS_CLOCK_GETRES` | implemented | Resolution derives from the timebase frequency; sub-millisecond on QEMU `virt`. |
| `clock_settime` | (none) | not-applicable | Realtime clock is anchored to boot ticks; no settable wall clock yet. |
| `clock_nanosleep` | (none) | stubbed | The relative form is covered by `nanosleep`; the absolute form is tracked under PSE51 ABI alignment work in `TODO.md`. |
| `clock_nanosleep` | (none) | stubbed | The relative form is covered by `nanosleep`; the absolute form is tracked under PSE51 ABI alignment work. |
| `nanosleep` | `SYS_NANOSLEEP` | implemented-with-mazu-abi | Accepts `struct timespec`. On `EINTR` the kernel writes the unexpired remainder to `*rem` when `rem` is non-NULL (best-effort: a bad `rem` pointer does not mask the `EINTR` return). On normal completion `*rem` is unmodified. `tv_sec` is bounded against u64 overflow to keep the kernel-side ns/ms conversion safe. |

## Synchronization (kernel handles)
Expand All @@ -124,14 +124,13 @@ sync handle table (`kernel/sync/sync_handle.c`).

| Interface (POSIX) | Mazu syscall | Status | Notes |
|---|---|---|---|
| `pthread_self` | `SYS_THREAD_SELF` | implemented | Returns `td->id`. |
| `pthread_create` | `SYS_THREAD_CREATE` | implemented | PROC_THREAD_MAX = 4. Slot reservation under `proc_table_lock`, per-thread stack VA inside the proc slot. Priority inherits from creator; an explicit priority arg ABI is a future extension. |
| `pthread_join` | `SYS_THREAD_JOIN` | implemented | Blocks on `target->td_join_wq`; atomically claims `EXITED -> REAPED` via cmpxchg before reaping. EDEADLK on self-join, ESRCH on unknown TID, EINVAL on detached/already-reaped, EINTR on cancellation. |
| `pthread_self` | `SYS_THREAD_SELF` | implemented | Returns the caller's `CAP_TYPE_THREAD` small-int handle. |
| `pthread_create` | `SYS_THREAD_CREATE` | implemented | PROC_THREAD_MAX = 4. Slot reservation under `proc_table_lock`, per-thread stack VA inside the proc slot. Returns a fresh `CAP_TYPE_THREAD` handle. Priority inherits from creator; an explicit priority arg ABI is a future extension. |
| `pthread_join` | `SYS_THREAD_JOIN` | implemented | Blocks on `target->td_join_wq`; atomically claims `EXITED -> REAPED` via cmpxchg before reaping. EDEADLK on self-join, ESRCH on unknown thread handle, EINVAL on detached/already-reaped, EINTR on cancellation. |
| `pthread_detach` | `SYS_THREAD_DETACH` | implemented | Tries `JOINABLE -> DETACHED` first; if the target already exited, claims `EXITED -> REAPED` and reaps inline. Either claim wakes pending joiners. |
| `pthread_exit` | `SYS_THREAD_EXIT` | implemented | Last-thread exit collapses into `proc_exit`; non-last exit unwinds the thread's robust futex list. A user thread that returns from its entry function lands on the per-process unmapped trampoline at `signal_trampoline_pc(p)+4`; the trap handler synthesizes `SYS_THREAD_EXIT(0)`, so an implicit return is equivalent to an explicit pthread_exit. |
| `pthread_setschedparam` / `_getschedparam` | `SYS_THREAD_SETSCHEDPARAM` / `_GETSCHEDPARAM` | implemented-with-mazu-abi | Take a kernel TID (0 = self) and a scalar priority. Privilege bound: cannot raise above caller's own base priority. |
| `pthread_setschedparam` / `_getschedparam` | `SYS_THREAD_SETSCHEDPARAM` / `_GETSCHEDPARAM` | implemented-with-mazu-abi | Take a `CAP_TYPE_THREAD` handle (0 = self) and a scalar priority. Privilege bound: cannot raise above caller's own base priority. |
Comment thread
cubic-dev-ai[bot] marked this conversation as resolved.
| `pthread_attr_*` | (libc) | stubbed | Attribute objects (`setstack`, `setdetachstate`, `setschedpolicy`, `setschedparam`, `setinheritsched`) are user-space libc concerns, but a "PSE51 complete" claim requires them to exist somewhere in the toolchain image. Mazu does not ship a libc with these wrappers today. The kernel ABI accepts the resolved (entry, arg, stack, prio) tuple; once a libc lands, this row flips to `not-applicable`. |
| `pthread_setschedparam` / `_getschedparam` | (none) | stubbed | Today scalar priority is set via `SYS_SCHED_SETPARAM` / `_GETPARAM` on the calling thread only. Per-thread policy/priority change blocked on per-thread sched-parameter state migration. |
| `pthread_spin_init` / `_lock` / `_trylock` / `_unlock` / `_destroy` | (none) | stubbed | Mazu has kernel-internal spinlocks, but no userspace-visible busy-wait primitive. The `_POSIX_SPIN_LOCKS` macro is therefore intentionally *not* defined and `_SC_SPIN_LOCKS` returns -1 — advertising it would let an app gate on the macro and call absent APIs. Expect a libc-side implementation backed by a futex once threads land, not a kernel syscall. |
| `pthread_cancel` / `pthread_setcancelstate` / `pthread_testcancel` | `SYS_THREAD_CANCEL` / `SYS_THREAD_SETCANCELSTATE` / `SYS_THREAD_TESTCANCEL` | implemented | Deferred cancellation: pthread_cancel sets `td_cancel_pending`; the target observes the bit at the next cancellation point and exits with code -ECANCELED. ASYNC type is treated as DEFERRED because Mazu has no in-kernel cancellation points other than blocking syscalls. |

Expand All @@ -143,7 +142,7 @@ sync handle table (`kernel/sync/sync_handle.c`).
| `sigaction` | `SYS_SIGACTION` | implemented | Per-process disposition. `sa_mask` is a `u32` bitmask, not `sigset_t`. |
| `sigreturn` | `SYS_SIGRETURN` | implemented | Cookie-validated frame teardown. |
| `pthread_sigmask` | `SYS_PTHREAD_SIGMASK` | implemented | Same wire shape as `SYS_SIGPROCMASK`; both operate on the calling thread's `td_sig.blocked`. Distinct syscall numbers so libc can keep `pthread_sigmask` and `sigprocmask` as separate ABI surfaces. |
| `pthread_kill` | `SYS_PTHREAD_KILL` | implemented | Thread-directed signal: bit lands on the named thread's `td_sig.pending` rather than the per-proc `proc_pending` mask. SIGKILL rejected with EINVAL (must be process-wide). |
| `pthread_kill` | `SYS_PTHREAD_KILL` | implemented | Thread-directed signal: bit lands on the named thread's `td_sig.pending` rather than the per-proc `proc_pending` mask. Takes a `CAP_TYPE_THREAD` handle. SIGKILL rejected with EINVAL (must be process-wide). |
| `sigsuspend` | `SYS_SIGSUSPEND` | implemented | Replace blocked mask with the supplied set, yield-loop until a deliverable signal arrives, restore prior mask, return EINTR. |
| `sigtimedwait` / `sigwait` / `sigwaitinfo` | `SYS_SIGTIMEDWAIT` | implemented-with-mazu-abi | Block until any signal in the supplied set is pending; dequeue without invoking the handler; return signo. Honors `struct timespec *` timeout (NULL = wait forever; expired = EAGAIN). |
| `sigqueue` value delivery | (none) | stubbed | Mazu signals are level-style: a single bit per signal in `pending`, no per-signal value queue. The wait API set above advertises `_POSIX_REALTIME_SIGNALS = 1` (subset) but `sigqueue` with a payload value requires an additional bounded queue subsystem. |
Expand All @@ -154,7 +153,7 @@ sync handle table (`kernel/sync/sync_handle.c`).

| Interface (POSIX) | Mazu syscall | Status | Notes |
|---|---|---|---|
| `timer_create` | `SYS_TIMER_CREATE` | implemented-with-mazu-abi | Pool-allocated (8 timers per process). Signal number is fixed to `SIGALRM`; the per-call target thread (`SIGEV_THREAD_ID`) is supplied via `posix_timer_settime`'s new `target_tid` parameter (a3 of `SYS_TIMER_SETTIME`); pass 0 for process-directed delivery. If the targeted thread has already exited at expiry, the signal is silently dropped (POSIX strict). |
| `timer_create` | `SYS_TIMER_CREATE` | implemented-with-mazu-abi | Pool-allocated (8 timers per process). Signal number is fixed to `SIGALRM`; the per-call target thread (`SIGEV_THREAD_ID`) is supplied via `posix_timer_settime`'s new thread-handle parameter (a3 of `SYS_TIMER_SETTIME`); pass 0 for process-directed delivery. If the targeted thread has already exited at expiry, the signal is silently dropped (POSIX strict). |
| `timer_settime` | `SYS_TIMER_SETTIME` | implemented-with-mazu-abi | ABI takes `u64 value_ms, u64 interval_ms` instead of `struct itimerspec`. `value_ms == 0` disarms (POSIX semantics). |
| `timer_gettime` | `SYS_TIMER_GETTIME` | implemented-with-mazu-abi | Returns remaining milliseconds as a scalar. |
| `timer_getoverrun` | `SYS_TIMER_GETOVERRUN` | implemented | Increments only while the previous `SIGALRM` is still pending (POSIX overrun semantics). |
Expand Down Expand Up @@ -278,8 +277,7 @@ The bounded multi-threaded process model is in place: per-thread
state migration (signal pending/blocked, signal-frame chain, robust
futex list, errno TLS) and the user-visible pthread surface
(`SYS_THREAD_CREATE` and friends) have both landed, with
`PROC_THREAD_MAX = 4`. The two remaining gaps, tracked as
non-blocking follow-ups in `TODO.md`, are the `sigqueue` payload
`PROC_THREAD_MAX = 4`. The two remaining gaps are the `sigqueue` payload
queue (requires a bounded per-signal queue subsystem) and the
`pthread_attr_*` libc family (strictly a libc-side concern; the
kernel ABI already accepts the resolved (entry, arg, prio) tuple).
251 changes: 251 additions & 0 deletions include/mazu/cap.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
/* SPDX-License-Identifier: MIT */
/* Capability-based security: public interface.
*
* Every process owns a fixed cap_space (CAP_SPACE_SLOTS entries). Each
* slot is an unforgeable handle to a typed kernel object. The slot word
* packs object_index, type, rights, type_meta, and a 32-bit generation;
* the cap layer enforces unforgeability, single-hop delegation, lazy
* revocation via generation bump, and an active-use pin that keeps the
* underlying object alive across blocking syscalls.
*
* Userspace sees two handle shapes:
* - posix_fd: the small-int slot_index. This is what sys_open returns
* and what sys_read/sys_write/sys_close consume. The cap layer does
* the slot-bound permission check on every dereference.
* - cap_handle: a 64-bit token (cap_make_handle / cap_get_token) that
* carries the generation/type/rights snapshot taken at mint time.
* Required for cap-management syscalls that need stale-handle
* detection across threads or across spaces.
*
* Rights are a 4-bit lattice (READ, WRITE, EXEC, GRANT). Rights are
* monotonically non-increasing on delegation: cap_transfer strips GRANT,
* while spawn-style cap_inherit_fd preserves the source rights snapshot.
* Plain sys_open does not mint GRANT.
*
* Implementation lives in kernel/proc/cap.c; the threat model, slot bit
* layout, lock ordering, and refcount lifecycle are documented in that
* file's header.
*/

#ifndef MAZU_CAP_H
#define MAZU_CAP_H

#include <mazu/base.h>
#include <mazu/vfs.h>

struct pipe;
struct posix_timer;
struct proc;

/* Per-process slot count. Sized from the actual object budget
* (PROC_FD_MAX + thread/timer/IPC + reserve), not a round number.
* No heap growth, no dynamic resize.
*/
#define CAP_SPACE_SLOTS 128

/* System-wide pool of delegate_record entries. Each cap_transfer
* allocates one; cap_revoke_delegate consumes it. Sized to cover the
* worst-case outstanding-delegation count across all processes.
*/
#define CAP_DELEGATE_RECORD_MAX 1024

/* Typed handle kinds. The slot word stores 4 bits, so up to 16 types
* fit; the unused 2 are reserved for future kernel-object surfaces.
* Adding a transferable type also requires extending cap_release_object
* and cap_object_inc_ref dispatches in kernel/proc/cap.c.
*/
enum cap_type {
CAP_TYPE_NONE = 0, /* empty / dropped slot */
CAP_TYPE_FD = 1, /* POSIX file descriptor (VFS, pipe, console) */
CAP_TYPE_TIMER = 2, /* POSIX interval timer */
CAP_TYPE_THREAD = 3, /* pthread handle; reserved-slot range */
CAP_TYPE_IRQ = 4, /* IRQ control (reserved) */
CAP_TYPE_ENDPOINT = 5, /* IPC endpoint (reserved) */
CAP_TYPE_DELEGATE = 6, /* supervisor-side handle on an outstanding grant */
CAP_TYPE_CAPSPACE = 7, /* meta cap on the cap_space itself (reserved) */
CAP_TYPE_SCHED = 8, /* scheduling control (reserved) */
CAP_TYPE_MUTEX = 9, /* pi_mutex pool entry */
CAP_TYPE_CONDVAR = 10, /* condvar pool entry */
CAP_TYPE_SEMAPHORE = 11, /* semaphore pool entry */
CAP_TYPE_BARRIER = 12, /* barrier pool entry */
CAP_TYPE_RWLOCK = 13, /* rwlock pool entry */
CAP_TYPE_MQUEUE = 14, /* POSIX message queue */
};

/* 4-bit rights lattice. Rights cannot be amplified after mint:
* - cap_transfer requires GRANT on the source and produces a
* destination without GRANT (single-hop attenuation).
* - cap_inherit_fd clones the source slot into another process for
* spawn-style FD inheritance, preserving the source rights snapshot.
* - Cap lookups verify (slot.rights & required_rights) == required_rights;
* a partial-rights cap is rejected for the operation that exceeds it.
*/
#define CAP_RIGHT_READ BIT(0) /* read-side ops (read, recv, get, query) */
#define CAP_RIGHT_WRITE \
BIT(1) /* write-side ops (write, send, post, mutate) \
*/
#define CAP_RIGHT_EXEC BIT(2) /* reserved for future memory caps */
#define CAP_RIGHT_GRANT BIT(3) /* may be cap_transfer'd or inherited */

/* type_meta bit assignments for CAP_TYPE_FD. Other types reserve their
* own bits in the same 11-bit field but do not use them today.
*/
#define CAP_FD_META_CLOEXEC BIT(0) /* close-on-exec; dup() clears it */

/* Backend tag for CAP_TYPE_FD entries. The kind selects the dispose
* hook (vfs_close vs pipe_close vs noop for console) when the last cap
* to the underlying object drops.
*/
enum cap_fd_kind {
CAP_FD_KIND_CONSOLE = 0,
CAP_FD_KIND_VFS = 1,
CAP_FD_KIND_PIPE = 2,
};

struct cap_space {
/* Per-slot capability word. The bit layout is documented in
* kernel/proc/cap.c; here it is opaque -- callers go through the
* cap_lookup_* / cap_open_* / cap_drop_* helpers.
*/
u64 slots[CAP_SPACE_SLOTS];
/* Per-slot grant_epoch snapshot. For slots minted by cap_transfer
* (or by spawn-time inheritance from such a slot), this records the
* originating delegate_record's 64-bit monotonic epoch.
* cap_revoke_delegate's scan matches on (type, object_index,
* delegate_epoch) so that 32-bit slot generation wrapping or two
* unrelated grants of the same object cannot be confused. Zero for
* slots that are not part of any outstanding delegation.
*/
u64 delegate_epoch[CAP_SPACE_SLOTS];
};

/* Object-constructor return shape. Used by cap-system internal mint
* paths that take a fully-resolved object pointer and assign it to a
* slot under fd_lock.
*/
struct cap_ctor_result {
u16 object_index;
u8 rights;
u16 type_meta;
};

/* Active-use pin on a kernel object. Returned by cap_lookup_fd /
* cap_lookup_timer / cap_lookup_object after the cap is validated and
* the underlying pool entry's refcount has been bumped. The caller MUST
* pair every non-zeroed return with cap_put_ref so the pool entry
* survives concurrent revocation across blocking syscalls.
*
* The empty / dropped state is type == CAP_TYPE_NONE; cap_put_ref
* tests the type field (not ptr) for liveness, since lookup variants
* for sync primitives and mqueue return ptr == NULL and fetch the
* typed pointer via a separate _get helper.
*/
struct cap_ref {
void *ptr;
u16 object_index;
u8 type;
};

/* Read-only snapshot of a cap_space slot, returned by cap_slot_read /
* cap_lookup_slot / cap_lookup_token. The slot_index is the array
* position; the other fields mirror the slot word.
*/
struct cap_slot_view {
bool valid;
u8 slot_index;
u16 object_index;
u8 type;
u8 rights;
u16 type_meta;
u32 generation;
};

/* Per-FD pool entry. One per open file description; multiple cap_space
* slots may reference the same entry (dup, transfer, inheritance) and
* refcount tracks how many.
*/
struct fd_pool_entry {
bool in_use;
u8 kind; /* enum cap_fd_kind */
bool pipe_read_end;
bool is_seekable;
u8 console_id;
sz offset; /* POSIX dup'd FDs share this offset */
u32 refcount; /* cap_space slots + active-use pins */
struct vfs_file file;
struct pipe *pipe;
};

void cap_init(void);
void cap_space_init(struct proc *p);
void cap_space_teardown(struct proc *p);

u64 cap_make_handle(const struct cap_slot_view *slot);
i64 cap_get_token(struct proc *p, i32 slot_idx, u8 expected_type);
i64 cap_drop_token(struct proc *p, u64 token);
i64 cap_transfer(struct proc *src, u16 dst_pid, u64 token, u8 new_rights);
i64 cap_revoke_delegate(struct proc *src, u64 delegate_token);

i64 cap_close_fd(struct proc *p, i32 fd);
struct cap_ref cap_lookup_fd(struct proc *p, i32 fd, u8 required_rights);
void cap_put_ref(struct cap_ref *ref);
i32 cap_dup_fd(struct proc *p, i32 oldfd, i32 newfd_hint, bool exact_target);
i32 cap_inherit_fd(struct proc *src, struct proc *dst, i32 src_fd, i32 dst_fd);
i32 cap_open_vfs(struct proc *p,
struct vfs_file file,
u8 rights,
bool is_seekable,
i32 slot_hint,
bool exact_target);
i32 cap_open_pipe(struct proc *p,
struct pipe *pipe,
bool read_end,
u8 rights,
i32 slot_hint,
bool exact_target);
i32 cap_open_console(struct proc *p,
u8 console_id,
u8 rights,
i32 slot_hint,
bool exact_target);
i32 cap_open_handle(struct proc *p,
u16 object_index,
u8 type,
u8 rights,
i32 slot_hint,
bool exact_target);
i32 cap_open_timer(struct proc *p,
u16 object_index,
u8 rights,
i32 slot_hint,
bool exact_target);
bool cap_fd_is_valid(struct proc *p, i32 fd);
bool cap_fd_has_rights(struct proc *p, i32 fd, u8 rights);
bool cap_fd_is_seekable(struct proc *p, i32 fd);
bool cap_fd_is_pipe(struct proc *p, i32 fd);
bool cap_fd_pipe_read_end(struct proc *p, i32 fd);
struct cap_slot_view cap_slot_read(struct proc *p, i32 slot_idx);
i32 cap_find_free_fd(struct proc *p);
bool cap_lookup_slot(struct proc *p,
i32 handle,
u8 required_rights,
u8 expected_type,
struct cap_slot_view *out);
bool cap_lookup_token(struct proc *p,
u64 token,
u8 required_rights,
u8 expected_type,
struct cap_slot_view *out);
/* cap_lookup_object: validates a cap slot AND takes an active-use ref on the
* underlying object. The returned cap_ref carries the type and object_index;
* caller MUST pair every non-zero return with cap_put_ref so the object
* survives concurrent revocation/destroy across blocking syscalls.
* Returns a zeroed cap_ref on EBADF/EACCES/EINVAL.
*/
struct cap_ref cap_lookup_object(struct proc *p,
i32 handle,
u8 required_rights,
u8 expected_type);
struct cap_ref cap_lookup_timer(struct proc *p, i32 handle, u8 required_rights);

#endif /* MAZU_CAP_H */
Loading
Loading