liblasx2lsx intercepts LASX (LoongArch SIMD Extension) instructions at runtime on systems without native LASX support. It works by:
- LD_PRELOAD — a shared library loaded before all others
- SIGILL capture — LASX instructions not supported by hardware raise SIGILL
- Dual translation paths — pure C emulation (fallback) or JIT compilation to LSX (128-bit)
Application (uses LASX instructions)
│ LD_PRELOAD=./liblasx2lsx.so
▼
┌─────────────────────────────────────┐
│ sigill_hook.c │
│ register_sigill_handler() │
│ → sigaction(SIGILL) │
└─────────┬───────────────────────────┘
│ LASX instr → SIGILL
▼
┌─────────────────────────────────────┐
│ sigill_handler() fallback chain: │
│ │
│ 1. lasx_emu_create_interpret_block │ JIT batch (block/loop/usedef)
│ 2. lasx_emu_create_interpret │ JIT single instruction
│ 3. do_lasx_emu() │ Pure C emulation
└─────────────────────────────────────┘
A LASX register (XR) is 256 bits = 4 × uint64_t slots (little-endian):
D[0] = B[0..7] → W[0] = B[0..3], W[1] = B[4..7]
D[1] = B[8..15] → W[2] = B[8..11], W[3] = B[12..15]
D[2] = B[16..23] → W[4] = B[16..19], W[5] = B[20..23]
D[3] = B[24..31] → W[6] = B[24..27], W[7] = B[28..31]
Each thread has a thread_data_t:
typedef struct {
uint64_t gpr[32]; // GPR snapshot
uint64_t data[32][4]; // XR[0..31], 4 uint64_t each
uint64_t data_vr[32][2]; // VR high-half storage (for JIT)
FILE *log_file; // JIT log
uint32_t instr_count; // instruction counter
uint64_t vregs_fcc; // floating-point condition codes
} thread_data_t;Access via thread_data_get() — lazy pthread TLS init.
Defined in lasx_emu_private.h. Each operates on a specific element width:
| Function | Width | Granularity |
|---|---|---|
vreg_read_64 / vreg_write_64 |
64-bit | Per dword slot (0..3) |
vreg_read_32 / vreg_write_32 |
32-bit | Per word slot (0..7) |
vreg_read_16 / vreg_write_16 |
16-bit | Per halfword slot (0..15) |
vreg_read_8 / vreg_write_8 |
8-bit | Per byte slot (0..31) |
vreg_read_128 / vreg_write_128 |
128-bit | Per 128-bit pair (0..1) |
Critical: Write granularity must match the element width. Writing 64-bit when the output is 16-bit elements will clobber adjacent values.
register_sigill_handler() — a __attribute__((constructor)) that runs on LD_PRELOAD:
- Skips if not loaded via
LD_PRELOAD - Reads
DISABLE_LSX_INTRINSICSenv var - Reads
LASX_PERF_STATS=1for performance counters - Reads
LIBLASX2LSX_INTERPRET_MODEfor JIT mode selection:block— batch consecutive XV instructionsloop— detect and JIT whole loopsusedef— use-def analysis for VR remapping- Default:
usedef | loop | fragcombined
- Calls
lasx_init_interpret()to mmap JIT code buffer at0x60000000 - Registers
sigill_handlerviasigaction(SIGILL, SA_SIGINFO)
void sigill_handler(int sig, siginfo_t *info, void *context) {
// 1. Try fragment JIT
if (lasx_emu_create_interpret_fragment(uc)) { return; }
// 2. Try block JIT (loop/usedef/plain block)
if (lasx_emu_create_interpret_block(uc)) { return; }
// 3. Try single-instruction JIT
if (lasx_emu_create_interpret(uc, instr)) { return; }
// 4. Fallback: pure C emulation
int ret = do_lasx_emu(uc, instr);
if (ret) {
uc->uc_mcontext.__pc += 4; // advance past the instruction
} else {
// Unknown instruction — restore default SIGILL, let process crash
signal(SIGILL, SIG_DFL);
}
}Each JIT step modifies the original code in-place by replacing the target instruction with jiscr1 (jump to interpreter scratch register). Subsequent executions hit the JIT code directly, not the SIGILL handler.
| Variable | Effect |
|---|---|
LD_PRELOAD=./liblasx2lsx.so |
Load the library |
LIBLASX2LSX_INTERPRET_MODE=block |
Block JIT only |
LIBLASX2LSX_INTERPRET_MODE=usedef |
Use-def JIT |
LIBLASX2LSX_DEBUG=1 |
Enable debug logging (tdlog) |
LASX_PERF_STATS=1 |
Per-instruction performance counters |
LASX_PROFILE=1 |
JIT profiling (requires profile build) |
The main dispatcher do_lasx_emu() uses 13 sequential switch statements organized by opcode bit fields. The rule:
shift = 32 - OP_number
| Switch shift | OP prefix | Example cases |
|---|---|---|
instr >> 18 |
OP14 | xvpermi, xvldi, xvandi_b, xvbitseli_b, xvshuf4i_* |
instr >> 20 |
OP12 | xvfmadd, xvfnmsub, xvfcmp_cond_s/d |
instr >> 15 |
OP17 | Bulk: arith, logic, cmp, float, mul/div, shift, extadd, bitmanip |
instr >> 8 |
OP24 | xvseteqz, xvsetnez, xvsetanyeqz, xvsetallnez |
instr >> 10 |
OP22 | xvfsqrt, xvfrecip, xvfrecipe, xvfclass, xvfcvt, xvffint, xvftint |
| Various | OP21-O9 | xvrepl128vei, xvinsgr2vr, xvbitseti, xvslli, xvsrani, xvld/xvst |
Wrong shift = silent dispatch failure. An instruction with the wrong shift will fall through to the next switch and be misinterpreted or unrecognized.
Every instruction follows the same pattern:
case OP17_XVADD_B:
perf_inc(P_XVADD_B);
do_lasx_emu_xvadd_b(uc, instr);
return 1;Each do_lasx_emu_* function:
- Decodes register operands from
instr - Reads source VREGs via
vreg_read_* - Computes result per element
- Writes result via
vreg_write_*
When xd == xj or xd == xk, iterating while reading and writing xd corrupts subsequent iterations.
Wrong:
for (int i = 0; i < 4; i++) {
uint64_t src = vreg_read_64(uc, td, xj, i);
uint64_t dst = vreg_read_64(uc, td, xd, i); // xd == xj → reads modified value!
dst = compute(dst, src);
vreg_write_64(uc, td, xd, i, dst);
}Correct (per-slot atomic):
for (int slot = 0; slot < 4; slot++) {
uint64_t src_j = vreg_read_64(uc, td, xj, slot);
uint64_t src_k = vreg_read_64(uc, td, xk, slot);
uint64_t dst = 0;
// process all elements in this slot
dst = compute(...);
vreg_write_64(uc, td, xd, slot, dst);
}The JIT translates consecutive LASX instructions into LSX (128-bit) equivalents. Each LASX instruction becomes 1-2 LSX instructions operating on the low and high 128-bit halves.
- Scans forward for consecutive XV instructions (
detect_xv_batch()) - Translates each as a simple 2×LSX sequence
- Replaces first instruction with
jiscr1→ JIT code - Returns via
jiscr0
Builds a use-def chain across the batch to eliminate redundant high-half loads/stores:
build_block_usedef()— linear scan of all XV instructions in batch- Tracks
first_use,first_write,last_use,last_writeper XR - Assigns temp VR to each XR that needs its high half live
- Two passes: first allocates all active XRs, second allocates only written XRs
- Tracks
- Emits prologue — saves original temp VR values, loads high halves
- Translates each instruction with
temp_vrmapping for high-half VR - Emits epilogue — stores modified high halves, restores temp VRs
Common bug: prologue re-saves the same physical VR when two XR intervals are assigned the same temp VR by the allocator. Fix: use a saved_mask bitmap.
Detects backward branches and JITs the entire loop body. See Section 6.
| File | Role |
|---|---|
lasx_interpret.c |
Entry point: lasx_emu_create_interpret_block(), lasx_emu_create_interpret(), batch detection, jiscr1 patching |
lasx_interpret_opt_gen.c |
Macro-generated translators: GEN_OPT_3OP, GEN_OPT_2OP produce __gen_lasx_interpret_XXX_opt() |
lasx_interpret_opt_gen_xvmap.c |
Special-case translators for complex instructions (xvld, xvst, vext2xv, xvpermi_q, xvfcmp_cond_s) |
lasx_interpret_opt_gen_usedef.c |
Usedef-aware translators with int *temp_vr parameter |
lasx_patterns_xvmap.c |
Loop pattern detection and xvmap-based loop JIT |
jiscr1 (Jump to Interpreter Scratch Register 1) replaces the original instruction in the application's code. When execution reaches it:
- Jumps to JIT code (address stored in scratch register)
- JIT code executes the translated instruction sequence
jiscr0returns to the instruction after the original
The translation is permanent for the process lifetime. No further SIGILL for that instruction.
When a SIGILL occurs in the middle of a loop, detect_loop_range() scans backward from the faulting instruction to find the loop start:
- Scan backward to find the first instruction of the pattern
- Scan forward for a backward conditional branch (loop back edge)
- Returns
{loop_start, loop_end, loop_len}
The xvmap mechanism assigns each used XR a pair of physical VRs (low + high) and generates LSX code for the entire loop body. Key steps:
check_xvmap_feasibility()— validates all loop instructions are translatablegenerate_xvmap_loop():- Builds xvmap (assigns VR pairs)
- Emits prologue: save callee-saved VRs, load high halves
- Translates each LASX instruction to paired LSX
- Emits epilogue: store high halves, restore VRs
- Replaces loop entry with
jiscr1→ JIT code
static unsigned int *detected_loop_start = NULL;
static unsigned int *detected_loop_end = NULL;
void *detected_loop_jit_entry = NULL;After optimization, is_loop_already_optimized() checks if the loop start is already a jiscr1 instruction to avoid re-optimization.
Lagoon is a complete LoongArch assembler implemented as a C library (lagoon.c + lagoon.h). It provides:
- Instruction encoding for the entire LA64 ISA (GPR, FPR, LSX, LASX)
- Label management with backpatching
- JIT buffer management with bounds checking
// Assembler state
lagoon_assembler_t as;
la_init_assembler(&as, buffer, capacity);
// Instruction emission (700+ functions)
la_add_d(&as, rd, rj, rk); // GPR add
la_vadd_b(&as, vd, vj, vk); // LSX vadd.b
la_xvfmadd_s(&as, xd, xj, xk, xa); // LASX xvfmadd.s
// Label management
la_label(&as, &label);
la_b(&as, &label); // forward branch
la_bind(&as, &label); // backpatch target
// Buffer management
int n = la_get_inst_count(&as);
void *cursor = la_get_cursor(&as);The core function: writes 4 bytes to the JIT buffer and advances the cursor. Every la_*() function ultimately calls emit32().
tests/ — Hand-written unit tests (21 files)
test_xvadd.c Basic arithmetic
test_xvfbasic.c Basic float
test_xvfma.c Float multiply-add
test_xvblock42.c 42-instruction block JIT test
test_xvblock38.c 38-instruction block JIT test
test_xvmap_*.c Loop JIT tests
TESTS — Passes reliably
TESTS_DEBUG — Known issues (QEMU LASX differences)
random_test/ — Systematic random testing (363 instructions)
src/ 707 .c test files (one per instruction + variants)
result/ Expected output files
test/ Compiled test binaries
random_data Default 64-byte test input
make test # 703 tests via emulator (block mode)
make test-native # 703 tests via native QEMU LASX
make test-loop # block + loop mode
make test-usedef # usedef mode
make test-single TEST=xvadd.b # Single instruction test
make test-single TEST=xvadd.b SEED=12345 # With random seedEach test file:
- Loads 64 bytes of random data into two 256-bit vectors
- Executes the LASX instruction via inline assembly (
.worddirective) - Prints the 256-bit result as 4 hex
uint64_t
Results are compared against QEMU native output (expected) or between emulator modes.
python3 tools/verify_coverage.py # Check switch structure coverage
python3 tools/check_lasx_opcodes.py # Validate opcode definitionsSet LIBLASX2LSX_DEBUG=1 to enable thread-safe debug logging:
tdlog("xr%d = 0x%016lx\n", reg, val); // prints to per-thread logWith lasx_profile.so (profile build), each JIT translation is logged to /tmp/lasx-jit-PID-TID.log:
make profile # Build liblasx2lsx_profile.so
LASX_PROFILE=1 LD_PRELOAD=./liblasx2lsx_profile.so ./programLog format:
=== batch 6 instrs mode=usedef PC=0000007fdd0e219c ===
0000007fdd0e219c: 77ec0441 xvpermi.q $xr1, $xr2, 1
0000007fdd0e21a0: 75308442 xvfadd.s $xr2, $xr2, $xr1
These logs feed into gen_usedef_tests.py for regression testing.
LASX_PERF_STATS=1 LD_PRELOAD=./liblasx2lsx.so ./programPrints top 20 most-executed instructions every ~1 second:
=== Performance Stats (top 20) ===
xvadd.w : 1234567 (23.4%)
xvfmadd.s : 987654 (18.7%)
...
Since the library replaces instructions with jiscr1, standard GDB breakpoints on LASX instructions won't fire (the instruction has been overwritten). Strategies:
- Break before library load: Set breakpoints before
LD_PRELOADtakes effect - HITRACE: Use the HITRACE macro to annotate specific emulator functions
- Force pure emulation: Set
LIBLASX2LSX_INTERPRET_MODE=oneto disable JIT entirely - Per-instruction emit:
la_insn_to_str()prints the disassembly of emitted LSX instructions
A standalone tool that compares block-mode and usedef-mode LSX code generation for a given instruction sequence:
make jit_compare
./jit_compare <hex_instr> [hex_instr...]Outputs the LSX instructions emitted by each mode. Useful for debugging allocator decisions.
Symptom: Wrong results when xd == xj or xd == xk.
Root cause: Iterating while reading from and writing to the same register.
Fix: Read all source operands per slot, compute, then write once per slot.
Symptom: Adjacent elements get incorrect values.
Root cause: Using vreg_write_64 when the output has 16-bit elements.
Fix: Match the write function to the element width:
_h_b→vreg_write_16_w_h→vreg_write_32_d_w→vreg_write_64
Symptom: Large negative saturated values become large positive values.
Root cause: Using an unsigned intermediate:
// WRONG:
uint32_t c = (int32_t)saturated; // sign lost
// RIGHT:
dst |= ((uint64_t)(int32_t)saturated << shift);Some instructions have non-obvious operand order:
| Instruction | Semantics | Common Error |
|---|---|---|
xvandn.v |
rd = rk & ~rj |
rd = rj & ~rk |
xvorn.v |
rd = rk | ~rj |
rd = rj | ~rk |
vext2xv reads from a vector register (XR), not from GPR. All 12 variants:
| Instruction | Source → Dest | Extension |
|---|---|---|
vext2xv.h.b |
XR byte → XR half | Signed |
vext2xv.hu.bu |
XR byte → XR half | Unsigned |
vext2xv.w.h |
XR half → XR word | Signed |
| ... |
For xvsrln.b.h, output bytes are stored non-contiguously:
- B[0..7] → D0
- B[8..15] → D1 (zeroed)
- B[16..23] → D2
- B[24..31] → D3 (zeroed)
Symptom: Usedef mode produces wrong results, block mode correct.
Root cause: Allocator assigns the same temp VR to two XRs. Prologue saves each separately, but the second save overwrites the first. Epilogue restores the wrong value.
Fix: Use saved_mask bitmap to deduplicate prologue saves and epilogue restores.
For xvaddwev.h.b (even elements):
- Output
H[i]reads fromB[2i](byte at even position) - Index calculation:
byte_idx = i * 2
For xvaddwod.h.b (odd elements):
- Output
H[i]reads fromB[2i+1](byte at odd position) - Index calculation:
byte_idx = i * 2 + 1
- Usedef sub-interval allocation: Currently the allocator either gives a temp VR for the full block or falls back to per-instruction. Sub-interval allocation (reusing VRs across non-overlapping XR lifetimes) would reduce register pressure.
- Loop-aware usedef: Combine loop detection with usedef analysis for more efficient loop JIT.
- Fragment JIT: The fragment mode exists but is incomplete (
lasx_emu_create_interpret_fragmentalways returns false).
- Remaining instructions: A small number of LASX instructions (primarily xvshuf4i_d variants) still lack emulator implementations. Run
tools/find_missing_instructions.pyto check. - Float precision:
xvfrecipe,xvfrsqrte,xvflogbare excluded from random tests due to precision differences between QEMU and hardware. A hardware-based reference test suite is needed.
- Parallel random tests: The
run_parallel_tests.shscript works but lacks deterministic output ordering for CI integration. - Hardware CI: Add 2K3000 or 3A6000 as a remote test target for native LASX verification.
- Coverage-guided fuzzing: AFL/libfuzzer integration to find edge cases in instruction emulation.
- Out-of-tree builds: Currently all build artifacts pollute the source tree.
- Meson/CMake migration: The hand-written Makefile (671 lines) is growing unwieldy.
- Windows cross-compile: No support for cross-compiling to LA64 from Windows hosts.
- Reduce lagoon.c size: At 32,474 lines,
lagoon.cis one file. Split by instruction category. - Automated RMW checking: A static analyzer pass to detect the RMW bug pattern would prevent common regressions.
- Consistent naming: Some functions use
lasx_emu_*, othersdo_lasx_emu_*. Unify.