Skip to content

[WIP] [Deepin-Kernel-SIG] [linux 6.6.y] [Arm] [Fromlist] [Security] arm64: Support for Arm CCA in KVM#1520

Open
Avenger-285714 wants to merge 71 commits intodeepin-community:linux-6.6.yfrom
Avenger-285714:cca-backport-wip
Open

[WIP] [Deepin-Kernel-SIG] [linux 6.6.y] [Arm] [Fromlist] [Security] arm64: Support for Arm CCA in KVM#1520
Avenger-285714 wants to merge 71 commits intodeepin-community:linux-6.6.yfrom
Avenger-285714:cca-backport-wip

Conversation

@Avenger-285714
Copy link
Member

以下 description 由 AI 辅助生成:

arm64: 支持 Arm CCA (Confidential Compute Architecture) in KVM

概述

本 PR 将 Arm CCA(Confidential Compute Architecture)支持回合到 linux-6.6.y (v6.6.127) 内核。CCA 是 Arm 的机密计算架构,通过 Realm Management Extension (RME) 和 Realm Management Monitor (RMM) 为虚拟机提供硬件级别的内存隔离与保护,使 guest 内存对 hypervisor 不可见。

本次回合基于上游 v12 版本补丁系列 "[PATCH v12 00/46] arm64: Support for Arm CCA in KVM"(原始基线为 v6.19-rc1+),包含 71 个提交

提交组成

类型 数量 标记方式 说明
Upstream Backport 25 commit body 中包含 [ Upstream commit <sha1> ] 从上游主线 cherry-pick 的前置基础设施提交,保留原始作者
Fromlist 补丁 46 commit title 以 Fromlist: 为前缀 CCA v12 补丁系列的 46 个补丁,尚未合入上游主线

一、Upstream Backport 提交详解 (25 个)

CCA 补丁系列依赖的多项基础设施在 6.6.y 中完全不存在,必须先从上游 backport。这些提交分为三个子阶段:

阶段 0.1:guest_memfd 核心框架 (10 个)

virt/kvm/guest_memfd.c 整个子系统(guest 私有内存后端)以及 CONFIG_KVM_GUEST_MEMFDCONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES 等 Kconfig 选项在 6.6.y 中均不存在。以下 10 个提交构建了 CCA 所需的最小 guest_memfd 框架:

序号 分支 commit 原始 commit 标题 作用
1 968978aaba80 5a475554db1e KVM: Introduce per-page memory attributes 引入 KVM_SET_MEMORY_ATTRIBUTES ioctl 和 xarray 存储的按页内存属性框架
2 303b8bf0c6ad a7800aa80ea4 KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory 创建 virt/kvm/guest_memfd.c,实现 guest 私有内存后端
3 9d80bea67d28 80583d0cfd8f KVM: guest-memfd: fix unused-function warning 修复 #2 的编译警告
4 78aaab8f673e 1d23040caa8b KVM: guest_memfd: Use AS_INACCESSIBLE when creating guest_memfd inode 安全性:将 guest_memfd inode 标记为不可直接访问
5 04540c790841 70623723778a KVM: guest_memfd: pass error up from filemap_grab_folio 错误处理改进
6 c9b28b8e40c8 fa30b0dc91c8 KVM: guest_memfd: limit overzealous WARN 稳定性修复
7 adb2ddb3b378 3bb2531e20bf KVM: guest_memfd: Add hook for initializing memory 引入 kvm_arch_gmem_prepare() 回调,CCA 用于 Realm 内存初始化
8 fd27a620480e 17573fd971f9 KVM: guest_memfd: extract __kvm_gmem_get_pfn() 提取内部辅助函数,为后续 API 演化做准备
9 b0fe205e0697 1f6c06b17751 KVM: guest_memfd: Add interface for populating gmem pages with user data 引入 kvm_gmem_populate(),CCA patch 20 用于填充 Realm 初始内存
10 7b47c551d87b a90764f0e4ed KVM: guest_memfd: Add hook for invalidating memory 引入 kvm_arch_gmem_invalidate() 回调,CCA 用于 Realm 内存回收

阶段 0.2:API 修复与命名统一 (12 个)

在核心框架之上,还需要一系列修复、重命名和 API 演化提交以匹配 CCA 补丁所期望的接口:

序号 分支 commit 原始 commit 标题 作用
11 df348ae0e862 d81473840ce1 KVM: interrupt kvm_gmem_populate() on signals 健壮性:允许信号中断长时间的 populate 操作
12 ad1f35ab6858 e300614f10bd KVM: cleanup and add shortcuts to kvm_range_has_memory_attributes() 代码简化
13 f74c8d9744e4 47bb584237cc KVM: Allow CPU to reschedule while setting per-page memory attributes 大内存区域的调度优化
14 e24c03506e3c 19a9a1ab5c3d KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GUEST_MEMFD 配置项命名统一
15 3cfed31caac9 d0d87226f535 KVM: guest_memfd: return folio from __kvm_gmem_get_pfn() API 演化:返回 folio 而非仅设置 pfn
16 244147e581fb d04c77d23122 KVM: guest_memfd: delay folio_mark_uptodate() until after successful preparation 正确性:仅在 prepare 成功后才标记 uptodate
17 7edb9f3ea128 564429a6bd8d KVM: rename CONFIG_HAVE_KVM_GMEM_* to CONFIG_HAVE_KVM_ARCH_GMEM_* 配置项命名统一
18 9867b6e495b0 e4ee54479273 KVM: guest_memfd: let kvm_gmem_populate() operate only on private gfns CCA patch 20 所需的行为约束条件
19 7a4bf632a6a4 dca6c8853232 KVM: Add member to struct kvm_gfn_range to indicate private/shared CCA 内存属性变更通知所需
20 4eb848a26ec9 923310be23b2 KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() 函数重命名,CCA patch 24 引用此名称
21 2b3c6d0abdc6 638ea79669f8 KVM: arm64: Refactor user_mem_abort() 拆分 user_mem_abort() 为 shared/gmem 路径
22 e908e1b7fa31 a7b57e099592 KVM: arm64: Handle guest_memfd-backed guest page faults 引入 gmem_abort() 处理 guest 私有内存缺页

阶段 0.3:UAPI/API 补全 (3 个)

在应用全部 46 个 CCA 补丁后,发现以下关键 UAPI 定义和 API 签名在 6.6.y 中完全缺失,导致编译失败。这 3 个补丁补全了缺失的接口定义:

序号 分支 commit 原始 commit 标题 补全的缺失项
23 362de6414aa9 16f95f3b95ca KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace KVM_EXIT_MEMORY_FAULT(exit reason #39)、kvm_run.memory_fault 结构体、3 参数版 kvm_prepare_memory_fault_exit()
24 f52c7a72d8b8 8dd2eee9d526 KVM: x86/mmu: Handle page fault for private memory KVM_MEMORY_EXIT_FLAG_PRIVATE 定义、将 kvm_prepare_memory_fault_exit() 升级为 6 参数版(增加 is_write/is_exec/is_private)、kvm_faultin_pfn_private()kvm_max_level_for_order()
25 f33220e586ee 1fbee5b01a0f KVM: guest_memfd: Provide "struct page" as output from kvm_gmem_get_pfn() kvm_gmem_get_pfn() 增加 struct page **page 输出参数,CCA 的 arm64 缺页处理需要此参数获取 struct page

📋 #23-#25 说明:原计划中仅有 22 个前置 backport。在 CCA 补丁全部应用后进行接口完整性验证时,发现 KVM_EXIT_MEMORY_FAULT UAPI(exit reason #39)、kvm_run.memory_fault 结构体、KVM_MEMORY_EXIT_FLAG_PRIVATE 等定义在 6.6.y 中完全不存在,kvm_prepare_memory_fault_exit()kvm_gmem_get_pfn() 的函数签名也与 CCA 补丁所需不匹配。为保持与上游的一致性,选择从上游 cherry-pick 这 3 个正规提交来补全,而非使用自定义 deepin: 修复。


二、Fromlist CCA 补丁 (46 个)

46 个 CCA 补丁来自上游邮件列表提交的 v12 版本(尚未合入主线),按原始编号 01-46 顺序应用。

完整提交列表

编号 分支 commit 标题 功能分组
01 24b9abbadd7f kvm: arm64: Include kvm_emulate.h in kvm/arm_psci.h 基础架构
02 518467276a83 arm64: RME: Handle Granule Protection Faults (GPFs) 基础架构
03 076d987f7772 arm64: RMI: Add SMC definitions for calling the RMM RMI 接口
04 3a3f7ad88a02 arm64: RMI: Add wrappers for RMI calls RMI 接口
05 a7f730878bef arm64: RMI: Check for RMI support at KVM init RMI 接口
06 aeb5303ad0b8 arm64: RMI: Define the user ABI RMI 接口
07 afc2ab68ad30 arm64: RMI: Basic infrastructure for creating a realm Realm 创建
08 077c40cc0590 kvm: arm64: Don't expose unsupported capabilities for realm guests Realm 创建
09 f2ade45d5ffa KVM: arm64: Allow passing machine type in KVM creation Realm 创建
10 b55df100a8da arm64: RMI: RTT tear down Realm 创建
11 7b5895a4420d arm64: RMI: Activate realm on first VCPU run Realm 创建
12 49f5140f8958 arm64: RMI: Allocate/free RECs to match vCPUs Realm 创建
13 960dd69f0be4 KVM: arm64: vgic: Provide helper for number of list registers 中断/定时器
14 cfd4093a365d arm64: RMI: Support for the VGIC in realms 中断/定时器
15 b422fc1dac78 KVM: arm64: Support timers in realm RECs 中断/定时器
16 37d1dfaf5825 arm64: RMI: Handle realm enter/exit 运行时处理
17 9a682ece496c arm64: RMI: Handle RMI_EXIT_RIPAS_CHANGE 运行时处理
18 5247c359d8d5 KVM: arm64: Handle realm MMIO emulation 运行时处理
19 d3bbfb218950 KVM: arm64: Expose support for private memory guest_memfd 集成
20 b283df588980 arm64: RMI: Allow populating initial contents guest_memfd 集成
21 d092e884102a arm64: RMI: Set RIPAS of initial memslots guest_memfd 集成
22 fd2d851ec369 arm64: RMI: Create the realm descriptor guest_memfd 集成
23 14ed1026558e arm64: RMI: Add a VMID allocator for realms guest_memfd 集成
24 4d43d62fad0b arm64: RMI: Runtime faulting of memory guest_memfd 集成 ⚠️
25 07f2c44fbdd1 KVM: arm64: Handle realm VCPU load VCPU 管理
26 be825b367036 KVM: arm64: Validate register access for a Realm VM VCPU 管理
27 c4a1a72d6bf4 KVM: arm64: Handle Realm PSCI requests VCPU 管理
28 2eb2b434e6eb KVM: arm64: WARN on injected undef exceptions VCPU 管理
29 770c17b40b43 arm64: Don't expose stolen time for realm guests VCPU 管理
30 82f0a4d8fc4b arm64: RMI: allow userspace to inject aborts VCPU 管理
31 351eefbe3e11 arm64: RMI: support RSI_HOST_CALL 扩展特性
32 77b631f8cdad arm64: RMI: Allow checking SVE on VM instance 扩展特性
33 95143a558cf7 arm64: RMI: Always use 4k pages for realms 扩展特性
34 43b2e9d91ea0 arm64: RMI: Prevent Device mappings for Realms 扩展特性
35 5686e6e6d77b HACK: Restore per-CPU cpu_armpmu pointer PMU 支持
36 c4ea18576076 arm_pmu: Provide a mechanism for disabling the physical IRQ PMU 支持
37 41aacd68e2c9 arm64: RMI: Enable PMU support with a realm guest PMU 支持
38 66f8d0071a09 arm64: RMI: Propagate number of breakpoints and watchpoints to userspace 调试支持
39 89a8a4caad52 arm64: RMI: Set breakpoint parameters through SET_ONE_REG 调试支持
40 5308dd832f16 arm64: RMI: Initialize PMCR.N with number counter supported by RMM PMU 支持
41 9de55f73cfa9 arm64: RMI: Propagate max SVE vector length from RMM SVE 支持
42 d7347fb7a726 arm64: RMI: Configure max SVE vector length for a Realm SVE 支持
43 1bcf11badcea arm64: RMI: Provide register list for unfinalized RMI RECs 寄存器管理
44 5ea16f211e4f arm64: RMI: Provide accurate register list 寄存器管理
45 9b38228b2b70 KVM: arm64: Expose KVM_ARM_VCPU_REC to user space 用户态 ABI
46 a0ba4561deac arm64: RMI: Enable realms to be created 最终使能

功能分组详解

基础架构与 RMI 接口 (Patch 01-06):修复头文件依赖,添加 GPF(Granule Protection Fault)处理,定义 RMI SMC 调用接口(rmi_smc.hrmi_cmds.h,均为新文件),KVM 初始化时检测 RMI 支持并引入 kvm_is_realm() 等辅助函数,定义 Realm 用户态 ABI 并新增 KVM_CAP_ARM_RMI capability。

Realm 虚拟机创建与管理 (Patch 07-12):Realm 创建基础设施(IPA limit 判断、Realm Descriptor 管理),Realm guest 的 capability 过滤(屏蔽不支持的特性),允许通过 KVM_VM_TYPE_ARM_REALM 指定 machine type 创建 Realm,RTT(Realm Translation Table)拆除清理,首次 VCPU 运行时激活 Realm,REC(Realm Execution Context)分配/释放。

中断与定时器支持 (Patch 13-15):VGIC list register 数量查询 helper,Realm 中 VGIC 的完整支持,Realm REC 中的定时器支持。

运行时处理 (Patch 16-18):Realm 进入/退出处理(rmi-exit.c,新文件),RMI_EXIT_RIPAS_CHANGE(Realm IPA State 变更请求)处理,Realm MMIO 模拟。

guest_memfd 集成 (Patch 19-24):暴露 private memory 支持,允许通过 kvm_gmem_populate() 填充 Realm 初始内存内容,设置初始 memslot 的 RIPAS,创建 Realm Descriptor,Realm VMID 分配器,运行时内存缺页处理(Patch 24 是最复杂的补丁——在 gmem_abort() 中处理 Realm 私有内存的缺页,涉及 RMI 数据/RTT 创建)。

VCPU 与寄存器管理 (Patch 25-30):Realm VCPU load,寄存器访问验证,Realm PSCI 请求处理,注入未定义异常时的 WARN 检查,禁用 Realm guest 的 stolen time,允许用户态注入 abort。

扩展特性 (Patch 31-46):RSI_HOST_CALL 支持,SVE 检查与 4K 页面强制,禁止 Realm 的 Device mapping,PMU 支持(包括 per-CPU cpu_armpmu 指针恢复、IRQ disable 机制、PMU 计数器初始化),断点/观察点参数传播与设置,SVE 向量长度从 RMM 获取与配置,Realm REC 的精确寄存器列表,暴露 KVM_ARM_VCPU_REC 到用户态,最终通过使能 static branch 允许创建 Realm。

⚠️ Patch 24 (Runtime faulting of memory) 是整个系列中最复杂的补丁,涉及最多的 API 等价替换,是运行时正确性的关键风险点。


三、提交顺序说明

25 个 upstream backport 先于 46 个 CCA 补丁应用,其中阶段 0.3 的 3 个 UAPI/API 补全提交穿插在 CCA Patch 23 和 Patch 24 之间(因为 Patch 24 是第一个实际使用 kvm_prepare_memory_fault_exit() 6 参数版本和 kvm_gmem_get_pfn()struct page **page 参数的补丁):

 Upstream #1-#22 (Phase 0.1 + 0.2)
   ↓
 CCA Patch 01-23 (Fromlist)
   ↓
 Upstream #23-#25 (Phase 0.3: UAPI/API 补全)
   ↓
 CCA Patch 24-46 (Fromlist)

四、相对于原始上游/Fromlist 状态的适配改动

由于原始 CCA 补丁基于 v6.19-rc1+,而目标分支为 v6.6.127,两者之间存在大约 3 年的内核演进差距,回合过程中进行了以下主要适配:

4.1 guest_memfd 子系统完整 Backport

原因:6.6.y 中 virt/kvm/guest_memfd.c 完全不存在,CONFIG_KVM_GUEST_MEMFDCONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES 等 Kconfig 选项均缺失。CCA 补丁 19、20、24 强依赖此子系统。

做法:从上游主线中精选了 22 个 guest_memfd 相关提交按依赖顺序 cherry-pick,包括核心框架创建、bug 修复、API 演化和命名统一。这些提交之间存在严格的顺序依赖关系。

cherry-pick 冲突解决

  • include/uapi/linux/kvm.h:6.6.y 中 KVM_CAP 最高编号为 229,需要在已有 deepin 特有 CAP(如 HYGON 相关)之后正确插入新 CAP 编号
  • virt/kvm/Kconfig / Makefile.kvm:新增 guest_memfd 相关的 Kconfig 选项和编译规则
  • include/linux/kvm_host.h:在 6.6.y 已有的结构体定义中插入新的内存属性和 gmem 相关声明

4.2 KVM_EXIT_MEMORY_FAULT UAPI 补全

原因:CCA 补丁 24 在 gmem_abort() 中调用 kvm_prepare_memory_fault_exit() 向用户态报告私有/共享内存访问不匹配的情况。但 6.6.y 中 KVM_EXIT_MEMORY_FAULT(exit reason #39)、kvm_run.memory_fault 结构体、KVM_MEMORY_EXIT_FLAG_PRIVATE 均完全缺失。

做法:Backport 了 3 个额外的上游提交(#23-#25)来补全这些 UAPI 定义和 API 签名,而非使用自定义的 deepin: 修复提交,以保持代码与上游的一致性。

冲突解决

  • include/uapi/linux/kvm.h:在 6.6.y 的 exit reason 列表中插入 KVM_EXIT_MEMORY_FAULT = 39,在 kvm_run union 中添加 memory_fault 成员
  • include/linux/kvm_host.h:保留 6.6.y 已有的 Phase 0 基础设施代码(memory attributes、gmem 函数声明),同时正确合入 kvm_prepare_memory_fault_exit() 的 6 参数版本
  • arch/x86/kvm/mmu/mmu.ckvm_mmu_max_mapping_level() 函数签名变更(增加 max_level 参数),保留 6.6.y 使用的 kvm_slot_has_gmem() 函数名(上游此处为 kvm_slot_can_be_private(),在 Phase 0 commit chore(CI): Update workflows.yml #20 中已重命名)
  • arch/x86/kvm/mmu/mmu_internal.h:在 kvm_page_fault 结构体中添加 refcounted_page 字段
  • arch/x86/kvm/svm/sev.c:上游 commit 引入的 SNP/SEV 相关函数(snp_rmptable_psmashsev_handle_rmp_fault 等)在 6.6.y 中无对应基础代码,全部丢弃(保留 HEAD 侧空内容)
  • virt/kvm/guest_memfd.ckvm_gmem_get_pfn() 增加 struct page **page 参数后,在函数体中添加 folio_file_page() 调用以正确设置页面输出

4.3 mmu.c 的 API 适配

CCA 补丁(特别是 patch 07、10、17、24、33、34)在 arch/arm64/kvm/mmu.c 中使用了多个 6.6.y 不存在的 API。回合时进行了以下等价替换:

原始补丁使用的 v6.19 API 6.6.y 中的等价替换 说明
KVM_PGT_FN(func)(args) 直接调用 func(args) + kvm_is_realm() 检查 KVM_PGT_FN 是 v6.19 的 pKVM MMU 分发宏(由 fce886a60207 引入),属于大型 pKVM 重构,不适合单独 backport
kvm_fault_lock(kvm) read_lock(&kvm->mmu_lock) 6.6.y 中 kvm_fault_lock() helper 不存在
kvm_fault_unlock(kvm) read_unlock(&kvm->mmu_lock) 同上
kvm_release_faultin_page(kvm, pfn, ...) kvm_release_pfn_clean(pfn) 6.6.y 中 kvm_release_faultin_page() 不存在
kvm_stage2_destroy(pgt) kvm_pgtable_stage2_destroy(pgt) v6.19 由 d68d66e57e2b 重命名拆分
kvm_init_ipa_range() 内联在 kvm_init_stage2_mmu() 中的等价逻辑 v6.19 已将 IPA 范围初始化拆分为独立函数

4.4 arm.c 上下文适配

arch/arm64/kvm/arm.c 中的多个函数在 v6.19 经历了重构:

  • kvm_arch_init_vm():v6.19 在 lockdep 之后增加了 kvm_init_nested() 调用,6.6.y 没有。Realm type 解析代码插入在 6.6.y 的 lockdep 与 kvm_share_hyp() 之间
  • kvm_arch_vcpu_run_pid_change():v6.19 已移除 kvm_arch_vcpu_run_map_fp() 调用,6.6.y 仍保留。Realm activation 代码插入在 vcpu_has_run_once() 检查之后
  • kvm_vm_ioctl_check_extension():6.6.y 最后一个 case 为 KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES(kernel-rolling) mtd: rawnand: Add Phytium NAND flash controller support #229),v6.19 多了数个新 case。KVM_CAP_ARM_RMI 在本分支中分配了新编号

4.5 其他适配

  • arch/arm64/kvm/reset.c:v6.19 有 system_supported_vcpu_features(kvm) 函数(接受 kvm 参数),6.6.y 不存在,直接在 6.6.y 中实现等价的 realm feature 检查逻辑
  • arch/arm64/kvm/inject_fault.c:v6.19 新增了 __kvm_inject_exception() 等 generic injection primitive,6.6.y 没有,CCA 补丁的 WARN 检查直接插入到 6.6.y 已有函数中
  • drivers/perf/arm_pmu.c:Patch 35 为 HACK 恢复 cpu_armpmu per-CPU 指针(v6.19 中已被 fa9d27773873 删除),在 6.6.y 中该指针仍存在,因此此补丁的适配有所简化
  • arch/arm64/kvm/pmu-emul.c:v6.19 经历了 PMCR.N → nr_pmu_counters 重命名重构,6.6.y 仍使用原始变量名,已相应适配
  • include/uapi/linux/kvm.h:新增 KVM_VM_TYPE_ARM_MASKKVM_VM_TYPE_ARM_REALM 定义,6.6.y 中仅有 KVM_VM_TYPE_ARM_IPA_SIZE_MASK
  • arch/arm64/include/asm/kvm_host.hKVM_VCPU_MAX_FEATURES 从 7 增加到 8,为 KVM_ARM_VCPU_REC 预留位

4.6 有意不回合的上游重构

以下上游重构被有意回合,以减少对 6.6.y 代码库的侵入性:

上游提交 标题 不回合原因
fce886a60207 KVM: arm64: Plumb the pKVM MMU in KVM 引入 KVM_PGT_FN 宏的大型 pKVM 重构,会大幅改变 mmu.c 结构,对 CCA 非必需
dc06193532af KVM: Move x86's API to release a faultin page to common KVM 引入 kvm_release_faultin_page(),可用 kvm_release_pfn_clean() 等价替代
85c7869e30b7 KVM: arm64: Use __kvm_faultin_pfn() to handle memory aborts ARM64 采用新 faultin API,可用原有 API 替代
d68d66e57e2b KVM: arm64: Split kvm_pgtable_stage2_destroy() 函数重命名拆分,保持 6.6.y 原有函数名即可
8cc9dc1ae4fb KVM: arm64: Rename the device variable to s2_force_noncacheable 纯重命名,保持 6.6.y 变量名 device

五、涉及的主要文件

新增文件

文件 说明
virt/kvm/guest_memfd.c guest 私有内存后端(从 upstream backport)
arch/arm64/include/asm/rmi_smc.h RMI SMC 调用定义
arch/arm64/include/asm/rmi_cmds.h RMI 调用封装
arch/arm64/include/asm/kvm_rmi.h KVM RMI 内部接口
arch/arm64/kvm/rmi.c RMI 实现主体
arch/arm64/kvm/rmi-exit.c RMI 退出处理

主要修改文件

文件 修改说明
include/uapi/linux/kvm.h 新增 KVM_EXIT_MEMORY_FAULT、KVM_CAP_ARM_RMI、KVM_VM_TYPE_ARM_REALM、memory_fault struct 等 UAPI 定义
include/linux/kvm_host.h 新增 memory attributes 基础设施、kvm_prepare_memory_fault_exit()、kvm_gmem_get_pfn() 等声明
arch/arm64/kvm/mmu.c Realm 内存管理(IPA limit、RTT 拆除、缺页处理、RIPAS 变更、4K 页面强制)
arch/arm64/kvm/arm.c Realm 创建/销毁/capability/VCPU 生命周期管理
arch/arm64/include/asm/kvm_host.h struct kvm_arch 增加 realm 字段、KVM_VCPU_MAX_FEATURES 增加
arch/arm64/include/asm/kvm_emulate.h kvm_is_realm()、vcpu_is_rec() 等辅助函数
arch/arm64/kvm/guest.c Realm 寄存器访问验证和列表管理
arch/arm64/kvm/reset.c Realm VCPU 初始化和 PMCR.N 设置
arch/arm64/kvm/psci.c Realm PSCI 请求处理
arch/arm64/kvm/inject_fault.c Realm 异常注入检查
arch/arm64/kvm/mmio.c Realm MMIO 模拟
arch/arm64/mm/fault.c GPF 异常处理注册
arch/arm64/kvm/vgic/ Realm VGIC 支持
arch/arm64/kvm/arch_timer.c Realm timer 同步
arch/arm64/kvm/pmu-emul.c Realm PMU 支持
drivers/perf/arm_pmu.c PMU IRQ disable 机制
virt/kvm/Kconfig 新增 KVM_GUEST_MEMFD、KVM_GENERIC_MEMORY_ATTRIBUTES 选项
Documentation/virt/kvm/api.rst KVM_CAP_ARM_RMI 等接口文档

Link: #1319

chao-p and others added 30 commits March 1, 2026 09:14
[ Upstream commit 5a47555 ]

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.

Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.

Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.

Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.

To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation.  For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.

It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
…mory

[ Upstream commit a7800aa ]

Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.

A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem.  With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings.   E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection.  Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.

Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping.  Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.

Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).

More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd.  While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption.  And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.

Attempt deepin-community#1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.

Attempt deepin-community#2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping.  And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.

Attempt deepin-community#3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory.  That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.

Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem.  I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.

Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay.  Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.

Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 80583d0 ]

With migration disabled, one function becomes unused:

virt/kvm/guest_memfd.c:262:12: error: 'kvm_gmem_migrate_folio' defined but not used [-Werror=unused-function]
  262 | static int kvm_gmem_migrate_folio(struct address_space *mapping,
      |            ^~~~~~~~~~~~~~~~~~~~~~

Remove the #ifdef around the reference so that fallback_migrate_folio()
is never used.  The gmem implementation of the hook is trivial; since
the gmem mapping is unmovable, the pages should not be migrated anyway.

Fixes: a7800aa ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 1d23040 ]

truncate_inode_pages_range() may attempt to zero pages before truncating
them, and this will occur before arch-specific invalidations can be
triggered via .invalidate_folio/.free_folio hooks via kvm_gmem_aops. For
AMD SEV-SNP this would result in an RMP #PF being generated by the
hardware, which is currently treated as fatal (and even if specifically
allowed for, would not result in anything other than garbage being
written to guest pages due to encryption). On Intel TDX this would also
result in undesirable behavior.

Set the AS_INACCESSIBLE flag to prevent the MM from attempting
unexpected accesses of this sort during operations like truncation.

This may also in some cases yield a decent performance improvement for
guest_memfd userspace implementations that hole-punch ranges immediately
after private->shared conversions via KVM_SET_MEMORY_ATTRIBUTES, since
the current implementation of truncate_inode_pages_range() always ends
up zero'ing an entire 4K range if it is backing by a 2M folio.

Link: https://lore.kernel.org/lkml/ZR9LYhpxTaTk6PJX@google.com/
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240329212444.395559-6-michael.roth@amd.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 7062372 ]

Some SNP ioctls will require the page not to be in the pagecache, and as such they
will want to return EEXIST to userspace.  Start by passing the error up from
filemap_grab_folio.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit fa30b0d ]

Because kvm_gmem_get_pfn() is called from the page fault path without
any of the slots_lock, filemap lock or mmu_lock taken, it is
possible for it to race with kvm_gmem_unbind().  This is not a
problem, as any PTE that is installed temporarily will be zapped
before the guest has the occasion to run.

However, it is not possible to have a complete unbind+bind
racing with the page fault, because deleting the memslot
will call synchronize_srcu_expedited() and wait for the
page fault to be resolved.  Thus, we can still warn if
the file is there and is not the one we expect.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 3bb2531 ]

guest_memfd pages are generally expected to be in some arch-defined
initial state prior to using them for guest memory. For SEV-SNP this
initial state is 'private', or 'guest-owned', and requires additional
operations to move these pages into a 'private' state by updating the
corresponding entries the RMP table.

Allow for an arch-defined hook to handle updates of this sort, and go
ahead and implement one for x86 so KVM implementations like AMD SVM can
register a kvm_x86_ops callback to handle these updates for SEV-SNP
guests.

The preparation callback is always called when allocating/grabbing
folios via gmem, and it is up to the architecture to keep track of
whether or not the pages are already in the expected state (e.g. the RMP
table in the case of SEV-SNP).

In some cases, it is necessary to defer the preparation of the pages to
handle things like in-place encryption of initial guest memory payloads
before marking these pages as 'private'/'guest-owned'.  Add an argument
(always true for now) to kvm_gmem_get_folio() that allows for the
preparation callback to be bypassed.  To detect possible issues in
the way userspace initializes memory, it is only possible to add an
unprepared page if it is not already included in the filemap.

Link: https://lore.kernel.org/lkml/ZLqVdvsF11Ddo7Dq@google.com/
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20231230172351.574091-5-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 17573fd ]

In preparation for adding a function that walks a set of pages
provided by userspace and populates them in a guest_memfd,
add a version of kvm_gmem_get_pfn() that has a "bool prepare"
argument and passes it down to kvm_gmem_get_folio().

Populating guest memory has to call repeatedly __kvm_gmem_get_pfn()
on the same file, so make the new function take struct file*.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 1f6c06b ]

During guest run-time, kvm_arch_gmem_prepare() is issued as needed to
prepare newly-allocated gmem pages prior to mapping them into the guest.
In the case of SEV-SNP, this mainly involves setting the pages to
private in the RMP table.

However, for the GPA ranges comprising the initial guest payload, which
are encrypted/measured prior to starting the guest, the gmem pages need
to be accessed prior to setting them to private in the RMP table so they
can be initialized with the userspace-provided data. Additionally, an
SNP firmware call is needed afterward to encrypt them in-place and
measure the contents into the guest's launch digest.

While it is possible to bypass the kvm_arch_gmem_prepare() hooks so that
this handling can be done in an open-coded/vendor-specific manner, this
may expose more gmem-internal state/dependencies to external callers
than necessary. Try to avoid this by implementing an interface that
tries to handle as much of the common functionality inside gmem as
possible, while also making it generic enough to potentially be
usable/extensible for TDX as well.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit a90764f ]

In some cases, like with SEV-SNP, guest memory needs to be updated in a
platform-specific manner before it can be safely freed back to the host.
Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
allow for special handling of this sort when freeing memory in response
to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
ahead and define an arch-specific hook for x86 since it will be needed
for handling memory used for SEV-SNP guests.

Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20231230172351.574091-6-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit d814738 ]

kvm_gmem_populate() is a potentially lengthy operation that can involve
multiple calls to the firmware.  Interrupt it if a signal arrives.

Fixes: 1f6c06b ("KVM: guest_memfd: Add interface for populating gmem pages with user data")
Cc: Isaku Yamahata <isaku.yamahata@intel.com>
Cc: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit e300614 ]

Use a guard to simplify early returns, and add two more easy
shortcuts.  If the requested attributes are invalid, the attributes
xarray will never show them as set.  And if testing a single page,
kvm_get_memory_attributes() is more efficient.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 47bb584 ]

When running an SEV-SNP guest with a sufficiently large amount of memory (1TB+),
the host can experience CPU soft lockups when running an operation in
kvm_vm_set_mem_attributes() to set memory attributes on the whole
range of guest memory.

watchdog: BUG: soft lockup - CPU#8 stuck for 26s! [qemu-kvm:6372]
CPU: 8 UID: 0 PID: 6372 Comm: qemu-kvm Kdump: loaded Not tainted 6.15.0-rc7.20250520.el9uek.rc1.x86_64 deepin-community#1 PREEMPT(voluntary)
Hardware name: Oracle Corporation ORACLE SERVER E4-2c/Asm,MB Tray,2U,E4-2c, BIOS 78016600 11/13/2024
RIP: 0010:xas_create+0x78/0x1f0
Code: 00 00 00 41 80 fc 01 0f 84 82 00 00 00 ba 06 00 00 00 bd 06 00 00 00 49 8b 45 08 4d 8d 65 08 41 39 d6 73 20 83 ed 06 48 85 c0 <74> 67 48 89 c2 83 e2 03 48 83 fa 02 75 0c 48 3d 00 10 00 00 0f 87
RSP: 0018:ffffad890a34b940 EFLAGS: 00000286
RAX: ffff96f30b261daa RBX: ffffad890a34b9c8 RCX: 0000000000000000
RDX: 000000000000001e RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffad890a356868
R13: ffffad890a356860 R14: 0000000000000000 R15: ffffad890a356868
FS:  00007f5578a2a400(0000) GS:ffff97ed317e1000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f015c70fb18 CR3: 00000001109fd006 CR4: 0000000000f70ef0
PKRU: 55555554
Call Trace:
 <TASK>
 xas_store+0x58/0x630
 __xa_store+0xa5/0x130
 xa_store+0x2c/0x50
 kvm_vm_set_mem_attributes+0x343/0x710 [kvm]
 kvm_vm_ioctl+0x796/0xab0 [kvm]
 __x64_sys_ioctl+0xa3/0xd0
 do_syscall_64+0x8c/0x7a0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f5578d031bb
Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d 4c 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe0a742b88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000000004020aed2 RCX: 00007f5578d031bb
RDX: 00007ffe0a742c80 RSI: 000000004020aed2 RDI: 000000000000000b
RBP: 0000010000000000 R08: 0000010000000000 R09: 0000017680000000
R10: 0000000000000080 R11: 0000000000000246 R12: 00005575e5f95120
R13: 00007ffe0a742c80 R14: 0000000000000008 R15: 00005575e5f961e0

While looping through the range of memory setting the attributes,
call cond_resched() to give the scheduler a chance to run a higher
priority task on the runqueue if necessary and avoid staying in
kernel mode long enough to trigger the lockup.

Fixes: 5a47555 ("KVM: Introduce per-page memory attributes")
Cc: stable@vger.kernel.org # 6.12.x
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250609091121.2497429-2-liam.merwick@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
[ Upstream commit 19a9a1a ]

Rename the Kconfig option CONFIG_KVM_PRIVATE_MEM to
CONFIG_KVM_GUEST_MEMFD. The original name implied that the feature only
supported "private" memory. However, CONFIG_KVM_PRIVATE_MEM enables
guest_memfd in general, which is not exclusively for private memory.
Subsequent patches in this series will add guest_memfd support for
non-CoCo VMs, whose memory is not private.

Renaming the Kconfig option to CONFIG_KVM_GUEST_MEMFD more accurately
reflects its broader scope as the main Kconfig option for all
guest_memfd-backed memory. This provides clearer semantics for the
option and avoids confusion as new features are introduced.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit d0d8722 ]

Right now this is simply more consistent and avoids use of pfn_to_page()
and put_page().  It will be put to more use in upcoming patches, to
ensure that the up-to-date flag is set at the very end of both the
kvm_gmem_get_pfn() and kvm_gmem_populate() flows.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
…preparation

[ Upstream commit d04c77d ]

The up-to-date flag as is now is not too useful; it tells guest_memfd not
to overwrite the contents of a folio, but it doesn't say that the page
is ready to be mapped into the guest.  For encrypted guests, mapping
a private page requires that the "preparation" phase has succeeded,
and at the same time the same page cannot be prepared twice.

So, ensure that folio_mark_uptodate() is only called on a prepared page.  If
kvm_gmem_prepare_folio() or the post_populate callback fail, the folio
will not be marked up-to-date; it's not a problem to call clear_highpage()
again on such a page prior to the next preparation attempt.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 564429a ]

Add "ARCH" to the symbols; shortly, the "prepare" phase will include both
the arch-independent step to clear out contents left in the page by the
host, and the arch-dependent step enabled by CONFIG_HAVE_KVM_GMEM_PREPARE.
For consistency do the same for CONFIG_HAVE_KVM_GMEM_INVALIDATE as well.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit e4ee544 ]

This check is currently performed by sev_gmem_post_populate(), but it
applies to all callers of kvm_gmem_populate(): the point of the function
is that the memory is being encrypted and some work has to be done
on all the gfns in order to encrypt them.

Therefore, check the KVM_MEMORY_ATTRIBUTE_PRIVATE attribute prior
to invoking the callback, and stop the operation if a shared page
is encountered.  Because CONFIG_KVM_PRIVATE_MEM in principle does
not require attributes, this makes kvm_gmem_populate() depend on
CONFIG_KVM_GENERIC_PRIVATE_MEM (which does require them).

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit dca6c88 ]

Add new members to strut kvm_gfn_range to indicate which mapping
(private-vs-shared) to operate on: enum kvm_gfn_range_filter
attr_filter. Update the core zapping operations to set them appropriately.

TDX utilizes two GPA aliases for the same memslots, one for memory that is
for private memory and one that is for shared. For private memory, KVM
cannot always perform the same operations it does on memory for default
VMs, such as zapping pages and having them be faulted back in, as this
requires guest coordination. However, some operations such as guest driven
conversion of memory between private and shared should zap private memory.

Internally to the MMU, private and shared mappings are tracked on separate
roots. Mapping and zapping operations will operate on the respective GFN
alias for each root (private or shared). So zapping operations will by
default zap both aliases. Add fields in struct kvm_gfn_range to allow
callers to specify which aliases so they can only target the aliases
appropriate for their specific operation.

There was feedback that target aliases should be specified such that the
default value (0) is to operate on both aliases. Several options were
considered. Several variations of having separate bools defined such
that the default behavior was to process both aliases. They either allowed
nonsensical configurations, or were confusing for the caller. A simple
enum was also explored and was close, but was hard to process in the
caller. Instead, use an enum with the default value (0) reserved as a
disallowed value. Catch ranges that didn't have the target aliases
specified by looking for that specific value.

Set target alias with enum appropriately for these MMU operations:
 - For KVM's mmu notifier callbacks, zap shared pages only because private
   pages won't have a userspace mapping
 - For setting memory attributes, kvm_arch_pre_set_memory_attributes()
   chooses the aliases based on the attribute.
 - For guest_memfd invalidations, zap private only.

Link: https://lore.kernel.org/kvm/ZivIF9vjKcuGie3s@google.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-3-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 923310b ]

Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() to improve
clarity and accurately reflect its purpose.

The function kvm_slot_can_be_private() was previously used to check if a
given kvm_memory_slot is backed by guest_memfd. However, its name
implied that the memory in such a slot was exclusively "private".

As guest_memfd support expands to include non-private memory (e.g.,
shared host mappings), it's important to remove this association. The
new name, kvm_slot_has_gmem(), states that the slot is backed by
guest_memfd without making assumptions about the memory's privacy
attributes.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 638ea79 ]

Refactor user_mem_abort() to improve code clarity and simplify
assumptions within the function.

Key changes include:

* Immediately set force_pte to true at the beginning of the function if
  logging_active is true. This simplifies the flow and makes the
  condition for forcing a PTE more explicit.

* Remove the misleading comment stating that logging_active is
  guaranteed to never be true for VM_PFNMAP memslots, as this assertion
  is not entirely correct.

* Extract reusable code blocks into new helper functions:
  * prepare_mmu_memcache(): Encapsulates the logic for preparing and
    topping up the MMU page cache.
  * adjust_nested_fault_perms(): Isolates the adjustments to shadow S2
    permissions and the encoding of nested translation levels.

* Update min(a, (long)b) to min_t(long, a, b) for better type safety and
  consistency.

* Perform other minor tidying up of the code.

These changes primarily aim to simplify user_mem_abort() and make its
logic easier to understand and maintain, setting the stage for future
modifications.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Tao Chan <chentao@kylinos.cn>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit a7b57e0 ]

Add arm64 architecture support for handling guest page faults on memory
slots backed by guest_memfd.

This change introduces a new function, gmem_abort(), which encapsulates
the fault handling logic specific to guest_memfd-backed memory. The
kvm_handle_guest_abort() entry point is updated to dispatch to
gmem_abort() when a fault occurs on a guest_memfd-backed memory slot (as
determined by kvm_slot_has_gmem()).

Until guest_memfd gains support for huge pages, the fault granule for
these memory regions is restricted to PAGE_SIZE.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: James Houghton <jthoughton@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-19-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix a potential build error (like below, when asm/kvm_emulate.h gets
included after the kvm/arm_psci.h) by including the missing header file
in kvm/arm_psci.h:

./include/kvm/arm_psci.h: In function ‘kvm_psci_version’:
./include/kvm/arm_psci.h:29:13: error: implicit declaration of function
   ‘vcpu_has_feature’; did you mean ‘cpu_have_feature’? [-Werror=implicit-function-declaration]
   29 |         if (vcpu_has_feature(vcpu, KVM_ARM_VCPU_PSCI_0_2)) {
	         |             ^~~~~~~~~~~~~~~~
			       |             cpu_have_feature

Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
If the host attempts to access granules that have been delegated for use
in a realm these accesses will be caught and will trigger a Granule
Protection Fault (GPF).

A fault during a page walk signals a bug in the kernel and is handled by
oopsing the kernel. A non-page walk fault could be caused by user space
having access to a page which has been delegated to the kernel and will
trigger a SIGBUS to allow debugging why user space is trying to access a
delegated page.

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Steven Price <steven.price@arm.com>
The RMM (Realm Management Monitor) provides functionality that can be
accessed by SMC calls from the host.

The SMC definitions are based on DEN0137[1] version 1.0-rel0

[1] https://developer.arm.com/documentation/den0137/1-0rel0/

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
The wrappers make the call sites easier to read and deal with the
boiler plate of handling the error codes from the RMM.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
Query the RMI version number and check if it is a compatible version. A
static key is also provided to signal that a supported RMM is available.

Functions are provided to query if a VM or VCPU is a realm (or rec)
which currently will always return false.

Later patches make use of struct realm and the states as the ioctls
interfaces are added to support realm and REC creation and destruction.

Signed-off-by: Steven Price <steven.price@arm.com>
There is one CAP which identified the presence of CCA, and two ioctls.
One ioctl is used to populate memory and the other is used when user
space is providing the PSCI implementation to identify the target of the
operation.

Signed-off-by: Steven Price <steven.price@arm.com>
Introduce the skeleton functions for creating and destroying a realm.
The IPA size requested is checked against what the RMM supports.

The actual work of constructing the realm will be added in future
patches.

Signed-off-by: Steven Price <steven.price@arm.com>
… guests

RMM v1.0 provides no mechanism for the host to perform debug operations
on the guest. So limit the extensions that are visible to an allowlist
so that only those capabilities we can support are advertised.

Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
Steven Price and others added 21 commits March 2, 2026 14:22
The RMM only allows setting the GPRS (x0-x30) and PC for a realm
guest. Check this in kvm_arm_set_reg() so that the VMM can receive a
suitable error return if other registers are written to.

The RMM makes similar restrictions for reading of the guest's registers
(this is *confidential* compute after all), however we don't impose the
restriction here. This allows the VMM to read (stale) values from the
registers which might be useful to read back the initial values even if
the RMM doesn't provide the latest version. For migration of a realm VM,
a new interface will be needed so that the VMM can receive an
(encrypted) blob of the VM's state.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
The RMM needs to be informed of the target REC when a PSCI call is made
with an MPIDR argument. Expose an ioctl to the userspace in case the PSCI
is handled by it.

Co-developed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
The RMM doesn't allow injection of a undefined exception into a realm
guest. Add a WARN to catch if this ever happens.

Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
The VMM has no control or visibility of the VPU execution of a realm
guest, and therefore is unable to provide meaningful stolen time
statistics. Reflect this by not advertising KVM_CAP_STEAL_TIME when
running a realm guest.

Note that steal time accounting is not available when a guest is running
within a Arm CCA realm (machine type KVM_VM_TYPE_ARM_REALM).

Signed-off-by: Steven Price <steven.price@arm.com>
For Realm guests it is impossible to directly inject a synchronous
exception. Instead the RMM can be asked to inject a Synchronous External
Abort (SEA) when the next REC enter is performed.

Expose the KVM_SET_VCPU_EVENTS API to provide the means for the VMM
to trigger an SEA injection, when the previous exit was due to a Data
abort for an emulated unprotected access.

Signed-off-by: Steven Price <steven.price@arm.com>
Forward RSI_HOST_CALLS to KVM's HVC handler.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Given we have different types of VMs supported, check the
support for SVE for the given instance of the VM to accurately
report the status.

Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
The minimum granule size for the RMM is 4k page. So force 4k page on for
realm guests.

Signed-off-by: Steven Price <steven.price@arm.com>
Physical device assignment is not supported by RMM v1.0, so it
doesn't make much sense to allow device mappings within the realm.
Prevent them when the guest is a realm.

Signed-off-by: Steven Price <steven.price@arm.com>
Commit fa9d27773873 ("perf: arm_pmu: Kill last use of per-CPU cpu_armpmu
pointer") removed the per-CPU cpu_armpmu. Rather than refactoring the
code to deal with this just reintroduce it. The CCA PMU code will be
changing when switching to the RMM v2.0 ABI and will need completely
reworking.

Signed-off-by: Steven Price <steven.price@arm.com>
Arm CCA assigns the physical PMU device to the guest running in realm
world, however the IRQs are routed via the host. To enter a realm guest
while a PMU IRQ is pending it is necessary to block the physical IRQ to
prevent an immediate exit. Provide a mechanism in the PMU driver for KVM
to control the physical IRQ.

Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Use the PMU registers from the RmiRecExit structure to identify when an
overflow interrupt is due and inject it into the guest. Also hook up the
configuration option for enabling the PMU within the guest.

When entering a realm guest with a PMU interrupt pending, it is
necessary to disable the physical interrupt. Otherwise when the RMM
restores the PMU state the physical interrupt will trigger causing an
immediate exit back to the host. The guest is expected to acknowledge
the interrupt causing a host exit (to update the GIC state) which gives
the opportunity to re-enable the physical interrupt before the next PMU
event.

Number of PMU counters is configured by the VMM by writing to PMCR.N.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Steven Price <steven.price@arm.com>
… to userspace

The RMM describes the maximum number of BPs/WPs available to the guest
in the Feature Register 0. Propagate those numbers into ID_AA64DFR0_EL1,
which is visible to userspace. A VMM needs this information in order to
set up realm parameters.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Allow userspace to configure the number of breakpoints and watchpoints
of a Realm VM through KVM_SET_ONE_REG ID_AA64DFR0_EL1.

The KVM sys_reg handler checks the user value against the maximum value
given by RMM (arm64_check_features() gets it from the
read_sanitised_id_aa64dfr0_el1() reset handler).

Userspace discovers that it can write these fields by issuing a
KVM_ARM_GET_REG_WRITABLE_MASKS ioctl.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
… by RMM

Provide an accurate number of available PMU counters to userspace when
setting up a Realm.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
RMM provides the maximum vector length it supports for a guest in its
feature register. Make it visible to the rest of KVM and to userspace
via KVM_REG_ARM64_SVE_VLS.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Obtain the max vector length configured by userspace on the vCPUs, and
write it into the Realm parameters. By default the vCPU is configured
with the max vector length reported by RMM, and userspace can reduce it
with a write to KVM_REG_ARM64_SVE_VLS.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Steven Price <steven.price@arm.com>
KVM_GET_REG_LIST should not be called before SVE is finalized. The ioctl
handler currently returns -EPERM in this case. But because it uses
kvm_arm_vcpu_is_finalized(), it now also rejects the call for
unfinalized REC even though finalizing the REC can only be done late,
after Realm descriptor creation.

Move the check to copy_sve_reg_indices(). One adverse side effect of
this change is that a KVM_GET_REG_LIST call that only probes for the
array size will now succeed even if SVE is not finalized, but that seems
harmless since the following KVM_GET_REG_LIST with the full array will
fail.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Userspace can set a few registers with KVM_SET_ONE_REG (9 GP registers
at runtime, and 3 system registers during initialization). Update the
register list returned by KVM_GET_REG_LIST.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Steven Price <steven.price@arm.com>
Increment KVM_VCPU_MAX_FEATURES to expose the new capability to user
space.

Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
All the pieces are now in place, so enable kvm_rmi_is_available when the
RMM is detected.

Signed-off-by: Steven Price <steven.price@arm.com>
@Avenger-285714 Avenger-285714 requested review from Copilot and opsiff March 2, 2026 07:08
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @Avenger-285714, your pull request is larger than the review limit of 150000 diff characters

@deepin-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from avenger-285714. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Avenger-285714 Avenger-285714 changed the title [Deepin-Kernel-SIG] [linux 6.6.y] [Arm] [Fromlist] [Security] arm64: Support for Arm CCA in KVM [WIP] [Deepin-Kernel-SIG] [linux 6.6.y] [Arm] [Fromlist] [Security] arm64: Support for Arm CCA in KVM Mar 2, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR backports Arm CCA (RME/Realm) enablement for KVM to the linux-6.6.y kernel series, including the prerequisite generic KVM infrastructure (guest_memfd, per-page memory attributes, and UAPI extensions) needed to support private vs. shared guest memory.

Changes:

  • Add generic KVM guest_memfd and per-page memory attribute infrastructure, including new ioctls/UAPI (KVM_CREATE_GUEST_MEMFD, KVM_SET_MEMORY_ATTRIBUTES, KVM_SET_USER_MEMORY_REGION2, KVM_EXIT_MEMORY_FAULT).
  • Integrate Arm64 Realm/RMI plumbing into KVM (new RMI headers, realm VM/vCPU lifecycle, MMU fault handling, VGIC/timer/PSCI adaptations).
  • Extend x86 KVM paths to interoperate with generic private memory infrastructure (e.g., memory-fault exit info, gmem prepare/invalidate hooks, SNP-related populate flow in SEV code).

Reviewed changes

Copilot reviewed 49 out of 49 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
virt/kvm/kvm_mm.h Adds guest_memfd interface declarations/stubs.
virt/kvm/kvm_main.c Wires memslot lifecycle to gmem bind/unbind; adds memory attributes and guest_memfd ioctls.
virt/kvm/guest_memfd.c Introduces guest_memfd backing implementation (folio management, bind/unbind, populate).
virt/kvm/Makefile.kvm Builds guest_memfd support when enabled.
virt/kvm/Kconfig Adds KVM_GUEST_MEMFD and generic memory-attribute/private-mem Kconfig symbols.
include/uapi/linux/kvm.h Adds UAPI for memory attributes, guest_memfd, memory-fault exit, arm64 VM type bits, and new caps.
include/linux/perf/arm_pmu.h Exposes per-CPU arm PMU pointer and physical IRQ toggling API.
include/linux/kvm_host.h Adds mem attributes API hooks and gmem PFN retrieval API.
include/kvm/arm_psci.h Adds arm64 KVM emulate header dependency.
include/kvm/arm_pmu.h Adds helper macro used by realm PMU/IRQ handling.
include/kvm/arm_arch_timer.h Exposes realm timer update helper.
fs/anon_inodes.c Exports anon inode helper for guest_memfd file creation.
drivers/perf/arm_pmu.c Implements PMU physical IRQ enable/disable helper and exposes cpu_armpmu.
arch/x86/kvm/x86.c Adds x86 arch hooks for gmem prepare/invalidate; exposes memory-fault-info capability.
arch/x86/kvm/svm/sev.c Adds SNP launch/update flow using gmem populate support.
arch/x86/kvm/mmu/mmu_internal.h Extends page fault tracking for private memory and refcounted pages.
arch/x86/kvm/mmu/mmu.c Adds private-memory PFN faultin path and memory-attribute integration.
arch/x86/kvm/Kconfig Enables generic private-mem + gmem hooks under SEV.
arch/x86/include/asm/kvm_host.h Adds x86 arch “has_private_mem” plumbing and gmem ops hooks.
arch/x86/include/asm/kvm-x86-ops.h Adds optional x86 ops entries for gmem prepare/invalidate.
arch/arm64/mm/fault.c Adds GPF handling for RME Granule Protection Faults.
arch/arm64/kvm/vgic/vgic.h Adds realm-specific LR count helper and RMI include.
arch/arm64/kvm/vgic/vgic.c Adds realm save/restore paths for VGIC state.
arch/arm64/kvm/vgic/vgic-v3.c Skips host-side APR/trap handling for realm vCPUs.
arch/arm64/kvm/vgic/vgic-init.c Blocks unsupported VGICv2 emulation for realms.
arch/arm64/kvm/sys_regs.c Tightens ID reg validation and hides sysregs for realms.
arch/arm64/kvm/rmi-exit.c Implements realm REC exit decoding/handling.
arch/arm64/kvm/reset.c Adds realm-aware SVE max VL handling and REC cleanup.
arch/arm64/kvm/psci.c Integrates realm PSCI completion semantics.
arch/arm64/kvm/pmu-emul.c Reads realm PMU overflow status from REC exit context.
arch/arm64/kvm/mmu.c Adds realm mapping/unmapping paths and gmem fault handling for private memory.
arch/arm64/kvm/mmio.c Adjusts MMIO emulation return path for realm REC ABI.
arch/arm64/kvm/inject_fault.c Adjusts exception injection behavior for realm RECs.
arch/arm64/kvm/hypercalls.c Hides FW reg indices for realms.
arch/arm64/kvm/guest.c Restricts and validates writable regs and event injection for realms.
arch/arm64/kvm/arm.c Adds realm VM type, capability filtering, REC run loop, and RMI init/populate ioctls.
arch/arm64/kvm/arch_timer.c Adds realm timer IRQ update path and realm-specific offset behavior.
arch/arm64/kvm/Makefile Builds new RMI implementation files.
arch/arm64/kvm/Kconfig Enables generic memory-attributes integration for arm64 KVM and related selects.
arch/arm64/kernel/cpufeature.c Exposes RME feature bits in CPU feature framework.
arch/arm64/include/uapi/asm/kvm.h Adds KVM_ARM_VCPU_REC feature bit.
arch/arm64/include/asm/virt.h Declares static key for RMI availability.
arch/arm64/include/asm/rmi_smc.h Adds RMI SMC IDs and ABI structures.
arch/arm64/include/asm/rmi_cmds.h Adds inline wrappers for RMI commands.
arch/arm64/include/asm/kvm_rmi.h Adds realm/REC structures and internal APIs.
arch/arm64/include/asm/kvm_host.h Adds realm fields to KVM arch structs and realm-private-mem indicator.
arch/arm64/include/asm/kvm_emulate.h Adds realm helpers (kvm_is_realm, kvm_gpa_from_fault, vcpu_is_rec, etc.).
Documentation/virt/kvm/api.rst Documents new UAPIs/capabilities for guest_memfd, memory attributes, and Arm RMI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if (IS_ERR(folio))
return PTR_ERR(folio);

*page = folio_file_page(folio, gfn - slot->base_gfn);
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The page returned via folio_file_page() is indexed by file offset, but this uses gfn - slot->base_gfn (offset within the memslot) and ignores slot->gmem.pgoff. For non-zero slot->gmem.pgoff, this returns the wrong page pointer. Use the same file index that __kvm_gmem_get_pfn() computes (e.g. gfn - slot->base_gfn + slot->gmem.pgoff).

Copilot uses AI. Check for mistakes.
Comment on lines +2413 to +2415
bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
unsigned long attrs)
{
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kvm_range_has_memory_attributes() is implemented with a 4-argument signature (attrs only), but new callers added in this PR pass five arguments (mask + attrs). This is a build break and also suggests the helper's semantics are incomplete. Update the function signature and implement masked matching (or update all callers to the 4-arg form).

Copilot uses AI. Check for mistakes.
Comment on lines +775 to +782
void arm_pmu_set_phys_irq(bool enable)
{
int cpu = get_cpu();
struct arm_pmu *pmu = per_cpu(cpu_armpmu, cpu);
int irq;

irq = armpmu_get_cpu_irq(pmu, cpu);
if (irq && !enable)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arm_pmu_set_phys_irq() dereferences pmu->hw_events via armpmu_get_cpu_irq(pmu, cpu) without checking whether pmu is NULL. per_cpu(cpu_armpmu, cpu) is set/cleared during CPU hotplug and could be NULL; add a NULL check (and possibly validate cpu_irq_ops) to avoid a potential NULL deref.

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +20
struct list_head *gmem_list = &inode->i_mapping->i_private_list;
struct kvm_gmem *gmem;
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inode->i_mapping doesn't have an i_private_list member (the address_space member is private_list). This will fail to compile and is also inconsistent with later code that uses inode->i_mapping->private_list.

Copilot uses AI. Check for mistakes.
Comment on lines +410 to +412
file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
O_RDWR, NULL);
if (IS_ERR(file)) {
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anon_inode_create_getfile() is used here but is not declared in include/linux/anon_inodes.h and doesn't appear to be defined anywhere in the tree. As-is this will not compile/link. Either switch to an existing helper like anon_inode_getfile_secure()/anon_inode_getfile(), or add/declare an anon_inode_create_getfile() wrapper in fs/anon_inodes.c and the header.

Copilot uses AI. Check for mistakes.
return __anon_inode_getfile(name, fops, priv, flags,
context_inode, true);
}
EXPORT_SYMBOL_GPL(anon_inode_create_getfile);
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exports anon_inode_create_getfile, but there is no corresponding function definition in this file (only anon_inode_getfile() and anon_inode_getfile_secure()). Exporting an undefined symbol will fail the build, and it doesn't add a prototype for anon_inode_create_getfile() either. Define the wrapper (likely calling __anon_inode_getfile(..., secure=true)) and add it to include/linux/anon_inodes.h, or export the correct existing symbol instead.

Copilot uses AI. Check for mistakes.
Comment on lines +659 to +662
while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order),
KVM_MEMORY_ATTRIBUTE_PRIVATE,
KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
if (!max_order)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kvm_range_has_memory_attributes() is called with five arguments here, but its declaration/definition in this PR only accepts four. This is a build break; please align the function signature and update all call sites consistently (e.g. if a mask parameter is intended, add it to the prototype and implement masked matching in the helper).

Copilot uses AI. Check for mistakes.
Comment on lines +7296 to +7297
if (level == PG_LEVEL_2M)
return kvm_range_has_memory_attributes(kvm, start, end, ~0, attrs);
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kvm_range_has_memory_attributes() is invoked with five arguments here, but in this PR it's declared/defined with only four parameters. This mismatch will fail to compile; align the helper's prototype/definition with the intended call pattern (or adjust callers).

Copilot uses AI. Check for mistakes.
Comment on lines +2418 to +2419
bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
unsigned long attrs);
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kvm_range_has_memory_attributes() is declared with a 4-argument signature, but there are new callers in this PR that pass five arguments (e.g. x86 MMU hugepage checks and guest_memfd populate). This will not compile; update the prototype (and the implementation) to match the intended signature and semantics.

Copilot uses AI. Check for mistakes.
select SCHED_INFO
select GUEST_PERF_EVENTS if PERF_EVENTS
select KVM_GENERIC_MEMORY_ATTRIBUTES
select HAVE_KVM_ARCH_GMEM_POPULATE
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HAVE_KVM_ARCH_GMEM_POPULATE is selected here, but that Kconfig symbol isn't defined anywhere (at least not in this PR or the existing tree). Selecting an undefined symbol will break Kconfig processing; either add the missing symbol definition (likely in virt/kvm/Kconfig) or drop this select if it's not needed.

Suggested change
select HAVE_KVM_ARCH_GMEM_POPULATE

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants