Skip to content

Added ARM64 support#10

Open
MeKo-Christian wants to merge 2 commits intotemplexxx:masterfrom
MeKo-Christian:master
Open

Added ARM64 support#10
MeKo-Christian wants to merge 2 commits intotemplexxx:masterfrom
MeKo-Christian:master

Conversation

@MeKo-Christian
Copy link
Copy Markdown

Implements ARM64 support using ARM Generic Timer (CNTVCT_EL0) with hybrid calibration approach and multiple optimized variants.

Core features:

  • Uses CNTVCT_EL0 virtual counter for high-precision timestamps
  • Hybrid calibration: reads CNTFRQ_EL0 for frequency, refines with linear regression
  • Three assembly variants:
    • unixNanoARM16B: fast unordered path
    • unixNanoARMFMADD: fused multiply-add optimization
    • unixNanoARM16Bfence: ISB barriers for strict ordering
  • Conservative feature detection with Linux clock source fallback

Shared infrastructure changes:

  • Moved simpleLinearRegression and getClosestTSCSys to tsc.go for cross-platform sharing
  • Replaced cpu.X86FalseSharingRange with platform-agnostic CacheLineSize constant (64 bytes)
  • Updated tsc_generic.go build tags to exclude arm64

Implements ARM64 support using ARM Generic Timer (CNTVCT_EL0) with
hybrid calibration approach and multiple optimized variants.

New files:
- tsc_arm64.go: Platform-specific initialization and calibration
- tsc_arm64.s: Assembly implementations for counter reading and
  timestamp calculation
- tsc_arm64_test.go: ARM64-specific tests and benchmarks

Core features:
- Uses CNTVCT_EL0 virtual counter for high-precision timestamps
- Hybrid calibration: reads CNTFRQ_EL0 for frequency, refines with
  linear regression
- Three assembly variants:
  * unixNanoARM16B: fast unordered path
  * unixNanoARMFMADD: fused multiply-add optimization
  * unixNanoARM16Bfence: ISB barriers for strict ordering
- Conservative feature detection with Linux clock source fallback

Shared infrastructure changes:
- Moved simpleLinearRegression and getClosestTSCSys to tsc.go for
  cross-platform sharing
- Replaced cpu.X86FalseSharingRange with platform-agnostic
  CacheLineSize constant (64 bytes)
- Updated tsc_generic.go build tags to exclude arm64

Testing:
- Builds successfully on Linux ARM64, Darwin ARM64, and AMD64
- Maintains backward compatibility with existing AMD64 code
- Test suite covers store/load operations, ordered execution,
  frequency detection, and benchmarks

Documentation:
- Added Architecture Support section to README
- Updated Key Features and Limitations
- Documents ARM64-specific implementation details
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds comprehensive ARM64 architecture support to the TSC library, enabling high-performance timestamp generation using ARM Generic Timer (CNTVCT_EL0). The implementation includes a hybrid calibration approach, multiple optimized assembly variants for different performance/ordering trade-offs, and extensive refactoring to share calibration logic between AMD64 and ARM64.

Changes:

  • Implements ARM64 support with three optimized assembly variants (unordered fast path, FMADD optimization, and ISB-fenced strict ordering)
  • Refactors calibration infrastructure (simpleLinearRegression, getClosestTSCSys) into shared tsc.go for cross-platform use
  • Introduces platform-agnostic CacheLineSize constant replacing architecture-specific constants
  • Adds comprehensive CI/CD workflows for multi-architecture testing including native ARM64 macOS and QEMU-based Linux ARM64
  • Updates tooling (longdrift, calibrate) and adds development infrastructure (justfile, treefmt, golangci-lint config)

Reviewed changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tsc_arm64.go Core ARM64 implementation with hardware detection and calibration
tsc_arm64.s Assembly implementations of counter reading and computation variants
tsc_arm64_test.go ARM64-specific unit tests and benchmarks
tsc.go Refactored shared calibration functions and platform-agnostic constants
tsc_generic.go Updated build tags to exclude arm64
tsc_amd64.go Refactored to use shared calibration functions
tsc_amd64_test.go Code cleanup and style updates
tsc_test.go Enhanced drift testing with race detector awareness
tools/longdrift/main.go Refactored flag handling and improved code organization
tools/calibrate/main.go Refactored into smaller functions for better maintainability
race_enabled.go / race_disabled.go Build tag detection for race detector
treefmt.toml Code formatting configuration
.golangci.toml Linter configuration
justfile Build and test automation recipes
.github/workflows/*.yaml Multi-architecture CI/CD pipelines
README.md Updated documentation with ARM64 support details
examples/with-calibration.go Code cleanup and modernization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tsc_arm64_test.go
Comment thread tsc_arm64.go
Comment thread tsc_arm64_test.go
Comment thread tsc_arm64_test.go
}

// Out-of-Order test, GetInOrder should be in order as we assume.
func TestGetInOrder(t *testing.T) {
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing t.Parallel() call. This test should include t.Parallel() for consistency with other test files and to enable parallel test execution.

Copilot uses AI. Check for mistakes.
Comment thread tsc_arm64_test.go
}
}

func TestReadCounterFrequency(t *testing.T) {
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing t.Parallel() call. This test should include t.Parallel() for consistency with other test files and to enable parallel test execution.

Copilot uses AI. Check for mistakes.
Comment thread tools/calibrate/main.go
}

func printInitialResults(cost time.Duration, avgCoeff, avgFreq float64) {
cpuFlag := fmt.Sprintf("%s_%d", cpu.X86.Signature, cpu.X86.SteppingID)
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code references cpu.X86.Signature and cpu.X86.SteppingID, which will not work correctly on ARM64 platforms. Since this tool should now support ARM64 (as indicated by the PR), this should be updated to handle both architectures appropriately, or use a platform-agnostic way to identify the CPU.

Suggested change
cpuFlag := fmt.Sprintf("%s_%d", cpu.X86.Signature, cpu.X86.SteppingID)
var cpuFlag string
switch runtime.GOARCH {
case "amd64", "386":
cpuFlag = fmt.Sprintf("%s_%d", cpu.X86.Signature, cpu.X86.SteppingID)
default:
cpuFlag = runtime.GOARCH
}

Copilot uses AI. Check for mistakes.
Comment thread README.md
Comment on lines +4 to +6
timestamps using the processor's Time Stamp Counter register. It's 6-10x faster
than and can significantly improve performance for time-sensitive
applications. `time.Now().UnixNano()`
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar issue: The sentence structure is awkward. The phrase "It's 6-10x faster than and can significantly improve performance" is incomplete. It should read "It's 6-10x faster than time.Now().UnixNano() and can significantly improve performance" with the backticked text moved before "and".

Suggested change
timestamps using the processor's Time Stamp Counter register. It's 6-10x faster
than and can significantly improve performance for time-sensitive
applications. `time.Now().UnixNano()`
timestamps using the processor's Time Stamp Counter register. It's 6-10x faster than
`time.Now().UnixNano()` and can significantly improve performance for time-sensitive
applications.

Copilot uses AI. Check for mistakes.
Comment thread tsc_amd64.go
Comment on lines 28 to 51
if IsOutOfOrder() {
if cpu.X86.HasFMA {
start := GetInOrder()
for i := 0; i < 1000; i++ {

for range 1000 {
_ = unixNanoTSCFMA()
}

fmaCost := GetInOrder() - start
start = GetInOrder()
for i := 0; i < 1000; i++ {

for range 1000 {
_ = unixNanoTSC16B()
}

tscCost := GetInOrder() - start
if fmaCost < tscCost {
UnixNano = unixNanoTSCFMA
}
}

UnixNano = unixNanoTSC16B

return true
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic error: when FMA is available but not faster, UnixNano is set to unixNanoTSC16B, but then immediately overwritten with unixNanoTSC16B again on line 49. The code is missing a return statement or else clause after line 47. This means UnixNano will always be set to unixNanoTSC16B when IsOutOfOrder() is true, even when FMA is faster. The intended logic should prevent falling through to line 49 when FMA cost evaluation happens.

Copilot uses AI. Check for mistakes.
Comment thread tsc_arm64_test.go
}
}
if cnt > 0 {
t.Fatal(fmt.Sprintf("GetInOrder is not in order: %d aren't in order", cnt))
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses fmt.Sprintf unnecessarily when t.Fatalf can handle formatting directly. Replace 't.Fatal(fmt.Sprintf(...))' with 't.Fatalf(...)' for cleaner code.

Copilot uses AI. Check for mistakes.
Comment thread tsc_arm64_test.go
"github.com/templexxx/tsc/internal/xbytes"
)

func TestStoreOffsetCoeff(t *testing.T) {
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing t.Parallel() call. The tests in tsc_test.go and tsc_amd64_test.go have been updated to include t.Parallel() for tests that can run in parallel. This test should also include t.Parallel() for consistency and to enable parallel test execution.

Copilot uses AI. Check for mistakes.
Comment thread tsc.go
// Using false sharing range as aligned size & total size for avoiding cache pollution.
OffsetCoeff = xbytes.MakeAlignedBlock(cpu.X86FalseSharingRange, cpu.X86FalseSharingRange)
// Using cache line size for alignment to avoid cache pollution.
OffsetCoeff = xbytes.MakeAlignedBlock(CacheLineSize, CacheLineSize)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpu.X86FalseSharingRange is 128Bytes, for reducing L2 Cache pollution

Comment thread README.md
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could u add the benckmark result on your ARM machine? that would be helpful, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants