Conversation
Implements ARM64 support using ARM Generic Timer (CNTVCT_EL0) with hybrid calibration approach and multiple optimized variants. New files: - tsc_arm64.go: Platform-specific initialization and calibration - tsc_arm64.s: Assembly implementations for counter reading and timestamp calculation - tsc_arm64_test.go: ARM64-specific tests and benchmarks Core features: - Uses CNTVCT_EL0 virtual counter for high-precision timestamps - Hybrid calibration: reads CNTFRQ_EL0 for frequency, refines with linear regression - Three assembly variants: * unixNanoARM16B: fast unordered path * unixNanoARMFMADD: fused multiply-add optimization * unixNanoARM16Bfence: ISB barriers for strict ordering - Conservative feature detection with Linux clock source fallback Shared infrastructure changes: - Moved simpleLinearRegression and getClosestTSCSys to tsc.go for cross-platform sharing - Replaced cpu.X86FalseSharingRange with platform-agnostic CacheLineSize constant (64 bytes) - Updated tsc_generic.go build tags to exclude arm64 Testing: - Builds successfully on Linux ARM64, Darwin ARM64, and AMD64 - Maintains backward compatibility with existing AMD64 code - Test suite covers store/load operations, ordered execution, frequency detection, and benchmarks Documentation: - Added Architecture Support section to README - Updated Key Features and Limitations - Documents ARM64-specific implementation details
There was a problem hiding this comment.
Pull request overview
This pull request adds comprehensive ARM64 architecture support to the TSC library, enabling high-performance timestamp generation using ARM Generic Timer (CNTVCT_EL0). The implementation includes a hybrid calibration approach, multiple optimized assembly variants for different performance/ordering trade-offs, and extensive refactoring to share calibration logic between AMD64 and ARM64.
Changes:
- Implements ARM64 support with three optimized assembly variants (unordered fast path, FMADD optimization, and ISB-fenced strict ordering)
- Refactors calibration infrastructure (simpleLinearRegression, getClosestTSCSys) into shared tsc.go for cross-platform use
- Introduces platform-agnostic CacheLineSize constant replacing architecture-specific constants
- Adds comprehensive CI/CD workflows for multi-architecture testing including native ARM64 macOS and QEMU-based Linux ARM64
- Updates tooling (longdrift, calibrate) and adds development infrastructure (justfile, treefmt, golangci-lint config)
Reviewed changes
Copilot reviewed 23 out of 24 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tsc_arm64.go | Core ARM64 implementation with hardware detection and calibration |
| tsc_arm64.s | Assembly implementations of counter reading and computation variants |
| tsc_arm64_test.go | ARM64-specific unit tests and benchmarks |
| tsc.go | Refactored shared calibration functions and platform-agnostic constants |
| tsc_generic.go | Updated build tags to exclude arm64 |
| tsc_amd64.go | Refactored to use shared calibration functions |
| tsc_amd64_test.go | Code cleanup and style updates |
| tsc_test.go | Enhanced drift testing with race detector awareness |
| tools/longdrift/main.go | Refactored flag handling and improved code organization |
| tools/calibrate/main.go | Refactored into smaller functions for better maintainability |
| race_enabled.go / race_disabled.go | Build tag detection for race detector |
| treefmt.toml | Code formatting configuration |
| .golangci.toml | Linter configuration |
| justfile | Build and test automation recipes |
| .github/workflows/*.yaml | Multi-architecture CI/CD pipelines |
| README.md | Updated documentation with ARM64 support details |
| examples/with-calibration.go | Code cleanup and modernization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| // Out-of-Order test, GetInOrder should be in order as we assume. | ||
| func TestGetInOrder(t *testing.T) { |
There was a problem hiding this comment.
Missing t.Parallel() call. This test should include t.Parallel() for consistency with other test files and to enable parallel test execution.
| } | ||
| } | ||
|
|
||
| func TestReadCounterFrequency(t *testing.T) { |
There was a problem hiding this comment.
Missing t.Parallel() call. This test should include t.Parallel() for consistency with other test files and to enable parallel test execution.
| } | ||
|
|
||
| func printInitialResults(cost time.Duration, avgCoeff, avgFreq float64) { | ||
| cpuFlag := fmt.Sprintf("%s_%d", cpu.X86.Signature, cpu.X86.SteppingID) |
There was a problem hiding this comment.
The code references cpu.X86.Signature and cpu.X86.SteppingID, which will not work correctly on ARM64 platforms. Since this tool should now support ARM64 (as indicated by the PR), this should be updated to handle both architectures appropriately, or use a platform-agnostic way to identify the CPU.
| cpuFlag := fmt.Sprintf("%s_%d", cpu.X86.Signature, cpu.X86.SteppingID) | |
| var cpuFlag string | |
| switch runtime.GOARCH { | |
| case "amd64", "386": | |
| cpuFlag = fmt.Sprintf("%s_%d", cpu.X86.Signature, cpu.X86.SteppingID) | |
| default: | |
| cpuFlag = runtime.GOARCH | |
| } |
| timestamps using the processor's Time Stamp Counter register. It's 6-10x faster | ||
| than and can significantly improve performance for time-sensitive | ||
| applications. `time.Now().UnixNano()` |
There was a problem hiding this comment.
Grammar issue: The sentence structure is awkward. The phrase "It's 6-10x faster than and can significantly improve performance" is incomplete. It should read "It's 6-10x faster than time.Now().UnixNano() and can significantly improve performance" with the backticked text moved before "and".
| timestamps using the processor's Time Stamp Counter register. It's 6-10x faster | |
| than and can significantly improve performance for time-sensitive | |
| applications. `time.Now().UnixNano()` | |
| timestamps using the processor's Time Stamp Counter register. It's 6-10x faster than | |
| `time.Now().UnixNano()` and can significantly improve performance for time-sensitive | |
| applications. |
| if IsOutOfOrder() { | ||
| if cpu.X86.HasFMA { | ||
| start := GetInOrder() | ||
| for i := 0; i < 1000; i++ { | ||
|
|
||
| for range 1000 { | ||
| _ = unixNanoTSCFMA() | ||
| } | ||
|
|
||
| fmaCost := GetInOrder() - start | ||
| start = GetInOrder() | ||
| for i := 0; i < 1000; i++ { | ||
|
|
||
| for range 1000 { | ||
| _ = unixNanoTSC16B() | ||
| } | ||
|
|
||
| tscCost := GetInOrder() - start | ||
| if fmaCost < tscCost { | ||
| UnixNano = unixNanoTSCFMA | ||
| } | ||
| } | ||
|
|
||
| UnixNano = unixNanoTSC16B | ||
|
|
||
| return true |
There was a problem hiding this comment.
Logic error: when FMA is available but not faster, UnixNano is set to unixNanoTSC16B, but then immediately overwritten with unixNanoTSC16B again on line 49. The code is missing a return statement or else clause after line 47. This means UnixNano will always be set to unixNanoTSC16B when IsOutOfOrder() is true, even when FMA is faster. The intended logic should prevent falling through to line 49 when FMA cost evaluation happens.
| } | ||
| } | ||
| if cnt > 0 { | ||
| t.Fatal(fmt.Sprintf("GetInOrder is not in order: %d aren't in order", cnt)) |
There was a problem hiding this comment.
The test uses fmt.Sprintf unnecessarily when t.Fatalf can handle formatting directly. Replace 't.Fatal(fmt.Sprintf(...))' with 't.Fatalf(...)' for cleaner code.
| "github.com/templexxx/tsc/internal/xbytes" | ||
| ) | ||
|
|
||
| func TestStoreOffsetCoeff(t *testing.T) { |
There was a problem hiding this comment.
Missing t.Parallel() call. The tests in tsc_test.go and tsc_amd64_test.go have been updated to include t.Parallel() for tests that can run in parallel. This test should also include t.Parallel() for consistency and to enable parallel test execution.
| // Using false sharing range as aligned size & total size for avoiding cache pollution. | ||
| OffsetCoeff = xbytes.MakeAlignedBlock(cpu.X86FalseSharingRange, cpu.X86FalseSharingRange) | ||
| // Using cache line size for alignment to avoid cache pollution. | ||
| OffsetCoeff = xbytes.MakeAlignedBlock(CacheLineSize, CacheLineSize) |
There was a problem hiding this comment.
cpu.X86FalseSharingRange is 128Bytes, for reducing L2 Cache pollution
There was a problem hiding this comment.
could u add the benckmark result on your ARM machine? that would be helpful, thanks
Implements ARM64 support using ARM Generic Timer (CNTVCT_EL0) with hybrid calibration approach and multiple optimized variants.
Core features:
Shared infrastructure changes: