[WIP] [Deepin-Kernel-SIG] [linux 6.6.y] [Arm] [Fromlist] [Security] arm64: Support for Arm CCA in KVM#1520
Conversation
[ Upstream commit 5a47555 ] In confidential computing usages, whether a page is private or shared is necessary information for KVM to perform operations like page fault handling, page zapping etc. There are other potential use cases for per-page memory attributes, e.g. to make memory read-only (or no-exec, or exec-only, etc.) without having to modify memslots. Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory attributes to a guest memory range. Use an xarray to store the per-page attributes internally, with a naive, not fully optimized implementation, i.e. prioritize correctness over performance for the initial implementation. Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX attributes/protections in the future, e.g. to give userspace fine-grained control over read, write, and execute protections for guest memory. Provide arch hooks for handling attribute changes before and after common code sets the new attributes, e.g. x86 will use the "pre" hook to zap all relevant mappings, and the "post" hook to track whether or not hugepages can be used to map the range. To simplify the implementation wrap the entire sequence with kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly guaranteed to be an invalidation. For the initial use case, x86 *will* always invalidate memory, and preventing arch code from creating new mappings while the attributes are in flux makes it much easier to reason about the correctness of consuming attributes. It's possible that future usages may not require an invalidation, e.g. if KVM ends up supporting RWX protections and userspace grants _more_ protections, but again opt for simplicity and punt optimizations to if/when they are needed. Suggested-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com Cc: Fuad Tabba <tabba@google.com> Cc: Xu Yilun <yilun.xu@intel.com> Cc: Mickaël Salaün <mic@digikod.net> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20231027182217.3615211-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
…mory [ Upstream commit a7800aa ] Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based memory that is tied to a specific KVM virtual machine and whose primary purpose is to serve guest memory. A guest-first memory subsystem allows for optimizations and enhancements that are kludgy or outright infeasible to implement/support in a generic memory subsystem. With guest_memfd, guest protections and mapping sizes are fully decoupled from host userspace mappings. E.g. KVM currently doesn't support mapping memory as writable in the guest without it also being writable in host userspace, as KVM's ABI uses VMA protections to define the allow guest protection. Userspace can fudge this by establishing two mappings, a writable mapping for the guest and readable one for itself, but that’s suboptimal on multiple fronts. Similarly, KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size, e.g. KVM doesn’t support creating a 1GiB guest mapping unless userspace also has a 1GiB guest mapping. Decoupling the mappings sizes would allow userspace to precisely map only what is needed without impacting guest performance, e.g. to harden against unintentional accesses to guest memory. Decoupling guest and userspace mappings may also allow for a cleaner alternative to high-granularity mappings for HugeTLB, which has reached a bit of an impasse and is unlikely to ever be merged. A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to mmap() guest memory). More immediately, being able to map memory into KVM guests without mapping said memory into the host is critical for Confidential VMs (CoCo VMs), the initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent untrusted software from reading guest private data by encrypting guest memory with a key that isn't usable by the untrusted host, projects such as Protected KVM (pKVM) provide confidentiality and integrity *without* relying on memory encryption. And with SEV-SNP and TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior. Attempt deepin-community#1 to support CoCo VMs was to add a VMA flag to mark memory as being mappable only by KVM (or a similarly enlightened kernel subsystem). That approach was abandoned largely due to it needing to play games with PROT_NONE to prevent userspace from accessing guest memory. Attempt deepin-community#2 to was to usurp PG_hwpoison to prevent the host from mapping guest private memory into userspace, but that approach failed to meet several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel wouldn't easily be able to enforce a 1:1 page:guest association, let alone a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory that isn't backed by 'struct page', e.g. if devices gain support for exposing encrypted memory regions to guests. Attempt deepin-community#3 was to extend the memfd() syscall and wrap shmem to provide dedicated file-based guest memory. That approach made it as far as v10 before feedback from Hugh Dickins and Christian Brauner (and others) led to it demise. Hugh's objection was that piggybacking shmem made no sense for KVM's use case as KVM didn't actually *want* the features provided by shmem. I.e. KVM was using memfd() and shmem to avoid having to manage memory directly, not because memfd() and shmem were the optimal solution, e.g. things like read/write/mmap in shmem were dead weight. Christian pointed out flaws with implementing a partial overlay (wrapping only _some_ of shmem), e.g. poking at inode_operations or super_operations would show shmem stuff, but address_space_operations and file_operations would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM stop being lazy and create a proper API. Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org Cc: Fuad Tabba <tabba@google.com> Cc: Vishal Annapurve <vannapurve@google.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Maciej Szmigiero <mail@maciej.szmigiero.name> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Quentin Perret <qperret@google.com> Cc: Michael Roth <michael.roth@amd.com> Cc: Wang <wei.w.wang@intel.com> Cc: Liam Merwick <liam.merwick@oracle.com> Cc: Isaku Yamahata <isaku.yamahata@gmail.com> Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Co-developed-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Ackerley Tng <ackerleytng@google.com> Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20231027182217.3615211-17-seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 80583d0 ] With migration disabled, one function becomes unused: virt/kvm/guest_memfd.c:262:12: error: 'kvm_gmem_migrate_folio' defined but not used [-Werror=unused-function] 262 | static int kvm_gmem_migrate_folio(struct address_space *mapping, | ^~~~~~~~~~~~~~~~~~~~~~ Remove the #ifdef around the reference so that fallback_migrate_folio() is never used. The gmem implementation of the hook is trivial; since the gmem mapping is unmovable, the pages should not be migrated anyway. Fixes: a7800aa ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory") Reported-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 1d23040 ] truncate_inode_pages_range() may attempt to zero pages before truncating them, and this will occur before arch-specific invalidations can be triggered via .invalidate_folio/.free_folio hooks via kvm_gmem_aops. For AMD SEV-SNP this would result in an RMP #PF being generated by the hardware, which is currently treated as fatal (and even if specifically allowed for, would not result in anything other than garbage being written to guest pages due to encryption). On Intel TDX this would also result in undesirable behavior. Set the AS_INACCESSIBLE flag to prevent the MM from attempting unexpected accesses of this sort during operations like truncation. This may also in some cases yield a decent performance improvement for guest_memfd userspace implementations that hole-punch ranges immediately after private->shared conversions via KVM_SET_MEMORY_ATTRIBUTES, since the current implementation of truncate_inode_pages_range() always ends up zero'ing an entire 4K range if it is backing by a 2M folio. Link: https://lore.kernel.org/lkml/ZR9LYhpxTaTk6PJX@google.com/ Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240329212444.395559-6-michael.roth@amd.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 7062372 ] Some SNP ioctls will require the page not to be in the pagecache, and as such they will want to return EEXIST to userspace. Start by passing the error up from filemap_grab_folio. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit fa30b0d ] Because kvm_gmem_get_pfn() is called from the page fault path without any of the slots_lock, filemap lock or mmu_lock taken, it is possible for it to race with kvm_gmem_unbind(). This is not a problem, as any PTE that is installed temporarily will be zapped before the guest has the occasion to run. However, it is not possible to have a complete unbind+bind racing with the page fault, because deleting the memslot will call synchronize_srcu_expedited() and wait for the page fault to be resolved. Thus, we can still warn if the file is there and is not the one we expect. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 3bb2531 ] guest_memfd pages are generally expected to be in some arch-defined initial state prior to using them for guest memory. For SEV-SNP this initial state is 'private', or 'guest-owned', and requires additional operations to move these pages into a 'private' state by updating the corresponding entries the RMP table. Allow for an arch-defined hook to handle updates of this sort, and go ahead and implement one for x86 so KVM implementations like AMD SVM can register a kvm_x86_ops callback to handle these updates for SEV-SNP guests. The preparation callback is always called when allocating/grabbing folios via gmem, and it is up to the architecture to keep track of whether or not the pages are already in the expected state (e.g. the RMP table in the case of SEV-SNP). In some cases, it is necessary to defer the preparation of the pages to handle things like in-place encryption of initial guest memory payloads before marking these pages as 'private'/'guest-owned'. Add an argument (always true for now) to kvm_gmem_get_folio() that allows for the preparation callback to be bypassed. To detect possible issues in the way userspace initializes memory, it is only possible to add an unprepared page if it is not already included in the filemap. Link: https://lore.kernel.org/lkml/ZLqVdvsF11Ddo7Dq@google.com/ Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20231230172351.574091-5-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 17573fd ] In preparation for adding a function that walks a set of pages provided by userspace and populates them in a guest_memfd, add a version of kvm_gmem_get_pfn() that has a "bool prepare" argument and passes it down to kvm_gmem_get_folio(). Populating guest memory has to call repeatedly __kvm_gmem_get_pfn() on the same file, so make the new function take struct file*. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 1f6c06b ] During guest run-time, kvm_arch_gmem_prepare() is issued as needed to prepare newly-allocated gmem pages prior to mapping them into the guest. In the case of SEV-SNP, this mainly involves setting the pages to private in the RMP table. However, for the GPA ranges comprising the initial guest payload, which are encrypted/measured prior to starting the guest, the gmem pages need to be accessed prior to setting them to private in the RMP table so they can be initialized with the userspace-provided data. Additionally, an SNP firmware call is needed afterward to encrypt them in-place and measure the contents into the guest's launch digest. While it is possible to bypass the kvm_arch_gmem_prepare() hooks so that this handling can be done in an open-coded/vendor-specific manner, this may expose more gmem-internal state/dependencies to external callers than necessary. Try to avoid this by implementing an interface that tries to handle as much of the common functionality inside gmem as possible, while also making it generic enough to potentially be usable/extensible for TDX as well. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit a90764f ] In some cases, like with SEV-SNP, guest memory needs to be updated in a platform-specific manner before it can be safely freed back to the host. Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to allow for special handling of this sort when freeing memory in response to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go ahead and define an arch-specific hook for x86 since it will be needed for handling memory used for SEV-SNP guests. Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20231230172351.574091-6-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit d814738 ] kvm_gmem_populate() is a potentially lengthy operation that can involve multiple calls to the firmware. Interrupt it if a signal arrives. Fixes: 1f6c06b ("KVM: guest_memfd: Add interface for populating gmem pages with user data") Cc: Isaku Yamahata <isaku.yamahata@intel.com> Cc: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit e300614 ] Use a guard to simplify early returns, and add two more easy shortcuts. If the requested attributes are invalid, the attributes xarray will never show them as set. And if testing a single page, kvm_get_memory_attributes() is more efficient. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 47bb584 ] When running an SEV-SNP guest with a sufficiently large amount of memory (1TB+), the host can experience CPU soft lockups when running an operation in kvm_vm_set_mem_attributes() to set memory attributes on the whole range of guest memory. watchdog: BUG: soft lockup - CPU#8 stuck for 26s! [qemu-kvm:6372] CPU: 8 UID: 0 PID: 6372 Comm: qemu-kvm Kdump: loaded Not tainted 6.15.0-rc7.20250520.el9uek.rc1.x86_64 deepin-community#1 PREEMPT(voluntary) Hardware name: Oracle Corporation ORACLE SERVER E4-2c/Asm,MB Tray,2U,E4-2c, BIOS 78016600 11/13/2024 RIP: 0010:xas_create+0x78/0x1f0 Code: 00 00 00 41 80 fc 01 0f 84 82 00 00 00 ba 06 00 00 00 bd 06 00 00 00 49 8b 45 08 4d 8d 65 08 41 39 d6 73 20 83 ed 06 48 85 c0 <74> 67 48 89 c2 83 e2 03 48 83 fa 02 75 0c 48 3d 00 10 00 00 0f 87 RSP: 0018:ffffad890a34b940 EFLAGS: 00000286 RAX: ffff96f30b261daa RBX: ffffad890a34b9c8 RCX: 0000000000000000 RDX: 000000000000001e RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffad890a356868 R13: ffffad890a356860 R14: 0000000000000000 R15: ffffad890a356868 FS: 00007f5578a2a400(0000) GS:ffff97ed317e1000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f015c70fb18 CR3: 00000001109fd006 CR4: 0000000000f70ef0 PKRU: 55555554 Call Trace: <TASK> xas_store+0x58/0x630 __xa_store+0xa5/0x130 xa_store+0x2c/0x50 kvm_vm_set_mem_attributes+0x343/0x710 [kvm] kvm_vm_ioctl+0x796/0xab0 [kvm] __x64_sys_ioctl+0xa3/0xd0 do_syscall_64+0x8c/0x7a0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f5578d031bb Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d 4c 0f 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe0a742b88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 000000004020aed2 RCX: 00007f5578d031bb RDX: 00007ffe0a742c80 RSI: 000000004020aed2 RDI: 000000000000000b RBP: 0000010000000000 R08: 0000010000000000 R09: 0000017680000000 R10: 0000000000000080 R11: 0000000000000246 R12: 00005575e5f95120 R13: 00007ffe0a742c80 R14: 0000000000000008 R15: 00005575e5f961e0 While looping through the range of memory setting the attributes, call cond_resched() to give the scheduler a chance to run a higher priority task on the runqueue if necessary and avoid staying in kernel mode long enough to trigger the lockup. Fixes: 5a47555 ("KVM: Introduce per-page memory attributes") Cc: stable@vger.kernel.org # 6.12.x Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250609091121.2497429-2-liam.merwick@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com>
[ Upstream commit 19a9a1a ] Rename the Kconfig option CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GUEST_MEMFD. The original name implied that the feature only supported "private" memory. However, CONFIG_KVM_PRIVATE_MEM enables guest_memfd in general, which is not exclusively for private memory. Subsequent patches in this series will add guest_memfd support for non-CoCo VMs, whose memory is not private. Renaming the Kconfig option to CONFIG_KVM_GUEST_MEMFD more accurately reflects its broader scope as the main Kconfig option for all guest_memfd-backed memory. This provides clearer semantics for the option and avoids confusion as new features are introduced. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit d0d8722 ] Right now this is simply more consistent and avoids use of pfn_to_page() and put_page(). It will be put to more use in upcoming patches, to ensure that the up-to-date flag is set at the very end of both the kvm_gmem_get_pfn() and kvm_gmem_populate() flows. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
…preparation [ Upstream commit d04c77d ] The up-to-date flag as is now is not too useful; it tells guest_memfd not to overwrite the contents of a folio, but it doesn't say that the page is ready to be mapped into the guest. For encrypted guests, mapping a private page requires that the "preparation" phase has succeeded, and at the same time the same page cannot be prepared twice. So, ensure that folio_mark_uptodate() is only called on a prepared page. If kvm_gmem_prepare_folio() or the post_populate callback fail, the folio will not be marked up-to-date; it's not a problem to call clear_highpage() again on such a page prior to the next preparation attempt. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 564429a ] Add "ARCH" to the symbols; shortly, the "prepare" phase will include both the arch-independent step to clear out contents left in the page by the host, and the arch-dependent step enabled by CONFIG_HAVE_KVM_GMEM_PREPARE. For consistency do the same for CONFIG_HAVE_KVM_GMEM_INVALIDATE as well. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit e4ee544 ] This check is currently performed by sev_gmem_post_populate(), but it applies to all callers of kvm_gmem_populate(): the point of the function is that the memory is being encrypted and some work has to be done on all the gfns in order to encrypt them. Therefore, check the KVM_MEMORY_ATTRIBUTE_PRIVATE attribute prior to invoking the callback, and stop the operation if a shared page is encountered. Because CONFIG_KVM_PRIVATE_MEM in principle does not require attributes, this makes kvm_gmem_populate() depend on CONFIG_KVM_GENERIC_PRIVATE_MEM (which does require them). Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit dca6c88 ] Add new members to strut kvm_gfn_range to indicate which mapping (private-vs-shared) to operate on: enum kvm_gfn_range_filter attr_filter. Update the core zapping operations to set them appropriately. TDX utilizes two GPA aliases for the same memslots, one for memory that is for private memory and one that is for shared. For private memory, KVM cannot always perform the same operations it does on memory for default VMs, such as zapping pages and having them be faulted back in, as this requires guest coordination. However, some operations such as guest driven conversion of memory between private and shared should zap private memory. Internally to the MMU, private and shared mappings are tracked on separate roots. Mapping and zapping operations will operate on the respective GFN alias for each root (private or shared). So zapping operations will by default zap both aliases. Add fields in struct kvm_gfn_range to allow callers to specify which aliases so they can only target the aliases appropriate for their specific operation. There was feedback that target aliases should be specified such that the default value (0) is to operate on both aliases. Several options were considered. Several variations of having separate bools defined such that the default behavior was to process both aliases. They either allowed nonsensical configurations, or were confusing for the caller. A simple enum was also explored and was close, but was hard to process in the caller. Instead, use an enum with the default value (0) reserved as a disallowed value. Catch ranges that didn't have the target aliases specified by looking for that specific value. Set target alias with enum appropriately for these MMU operations: - For KVM's mmu notifier callbacks, zap shared pages only because private pages won't have a userspace mapping - For setting memory attributes, kvm_arch_pre_set_memory_attributes() chooses the aliases based on the attribute. - For guest_memfd invalidations, zap private only. Link: https://lore.kernel.org/kvm/ZivIF9vjKcuGie3s@google.com/ Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-3-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 923310b ] Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() to improve clarity and accurately reflect its purpose. The function kvm_slot_can_be_private() was previously used to check if a given kvm_memory_slot is backed by guest_memfd. However, its name implied that the memory in such a slot was exclusively "private". As guest_memfd support expands to include non-private memory (e.g., shared host mappings), it's important to remove this association. The new name, kvm_slot_has_gmem(), states that the slot is backed by guest_memfd without making assumptions about the memory's privacy attributes. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit 638ea79 ] Refactor user_mem_abort() to improve code clarity and simplify assumptions within the function. Key changes include: * Immediately set force_pte to true at the beginning of the function if logging_active is true. This simplifies the flow and makes the condition for forcing a PTE more explicit. * Remove the misleading comment stating that logging_active is guaranteed to never be true for VM_PFNMAP memslots, as this assertion is not entirely correct. * Extract reusable code blocks into new helper functions: * prepare_mmu_memcache(): Encapsulates the logic for preparing and topping up the MMU page cache. * adjust_nested_fault_perms(): Isolates the adjustments to shadow S2 permissions and the encoding of nested translation levels. * Update min(a, (long)b) to min_t(long, a, b) for better type safety and consistency. * Perform other minor tidying up of the code. These changes primarily aim to simplify user_mem_abort() and make its logic easier to understand and maintain, setting the stage for future modifications. Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Tao Chan <chentao@kylinos.cn> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[ Upstream commit a7b57e0 ] Add arm64 architecture support for handling guest page faults on memory slots backed by guest_memfd. This change introduces a new function, gmem_abort(), which encapsulates the fault handling logic specific to guest_memfd-backed memory. The kvm_handle_guest_abort() entry point is updated to dispatch to gmem_abort() when a fault occurs on a guest_memfd-backed memory slot (as determined by kvm_slot_has_gmem()). Until guest_memfd gains support for huge pages, the fault granule for these memory regions is restricted to PAGE_SIZE. Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: James Houghton <jthoughton@google.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix a potential build error (like below, when asm/kvm_emulate.h gets
included after the kvm/arm_psci.h) by including the missing header file
in kvm/arm_psci.h:
./include/kvm/arm_psci.h: In function ‘kvm_psci_version’:
./include/kvm/arm_psci.h:29:13: error: implicit declaration of function
‘vcpu_has_feature’; did you mean ‘cpu_have_feature’? [-Werror=implicit-function-declaration]
29 | if (vcpu_has_feature(vcpu, KVM_ARM_VCPU_PSCI_0_2)) {
| ^~~~~~~~~~~~~~~~
| cpu_have_feature
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
If the host attempts to access granules that have been delegated for use in a realm these accesses will be caught and will trigger a Granule Protection Fault (GPF). A fault during a page walk signals a bug in the kernel and is handled by oopsing the kernel. A non-page walk fault could be caused by user space having access to a page which has been delegated to the kernel and will trigger a SIGBUS to allow debugging why user space is trying to access a delegated page. Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Steven Price <steven.price@arm.com>
The RMM (Realm Management Monitor) provides functionality that can be accessed by SMC calls from the host. The SMC definitions are based on DEN0137[1] version 1.0-rel0 [1] https://developer.arm.com/documentation/den0137/1-0rel0/ Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Steven Price <steven.price@arm.com>
The wrappers make the call sites easier to read and deal with the boiler plate of handling the error codes from the RMM. Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Steven Price <steven.price@arm.com>
Query the RMI version number and check if it is a compatible version. A static key is also provided to signal that a supported RMM is available. Functions are provided to query if a VM or VCPU is a realm (or rec) which currently will always return false. Later patches make use of struct realm and the states as the ioctls interfaces are added to support realm and REC creation and destruction. Signed-off-by: Steven Price <steven.price@arm.com>
There is one CAP which identified the presence of CCA, and two ioctls. One ioctl is used to populate memory and the other is used when user space is providing the PSCI implementation to identify the target of the operation. Signed-off-by: Steven Price <steven.price@arm.com>
Introduce the skeleton functions for creating and destroying a realm. The IPA size requested is checked against what the RMM supports. The actual work of constructing the realm will be added in future patches. Signed-off-by: Steven Price <steven.price@arm.com>
… guests RMM v1.0 provides no mechanism for the host to perform debug operations on the guest. So limit the extensions that are visible to an allowlist so that only those capabilities we can support are advertised. Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Steven Price <steven.price@arm.com>
The RMM only allows setting the GPRS (x0-x30) and PC for a realm guest. Check this in kvm_arm_set_reg() so that the VMM can receive a suitable error return if other registers are written to. The RMM makes similar restrictions for reading of the guest's registers (this is *confidential* compute after all), however we don't impose the restriction here. This allows the VMM to read (stale) values from the registers which might be useful to read back the initial values even if the RMM doesn't provide the latest version. For migration of a realm VM, a new interface will be needed so that the VMM can receive an (encrypted) blob of the VM's state. Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Signed-off-by: Steven Price <steven.price@arm.com>
The RMM needs to be informed of the target REC when a PSCI call is made with an MPIDR argument. Expose an ioctl to the userspace in case the PSCI is handled by it. Co-developed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Steven Price <steven.price@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com>
The RMM doesn't allow injection of a undefined exception into a realm guest. Add a WARN to catch if this ever happens. Signed-off-by: Steven Price <steven.price@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com>
The VMM has no control or visibility of the VPU execution of a realm guest, and therefore is unable to provide meaningful stolen time statistics. Reflect this by not advertising KVM_CAP_STEAL_TIME when running a realm guest. Note that steal time accounting is not available when a guest is running within a Arm CCA realm (machine type KVM_VM_TYPE_ARM_REALM). Signed-off-by: Steven Price <steven.price@arm.com>
For Realm guests it is impossible to directly inject a synchronous exception. Instead the RMM can be asked to inject a Synchronous External Abort (SEA) when the next REC enter is performed. Expose the KVM_SET_VCPU_EVENTS API to provide the means for the VMM to trigger an SEA injection, when the previous exit was due to a Data abort for an emulated unprotected access. Signed-off-by: Steven Price <steven.price@arm.com>
Forward RSI_HOST_CALLS to KVM's HVC handler. Signed-off-by: Joey Gouly <joey.gouly@arm.com> Signed-off-by: Steven Price <steven.price@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com>
Given we have different types of VMs supported, check the support for SVE for the given instance of the VM to accurately report the status. Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Steven Price <steven.price@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Joey Gouly <joey.gouly@arm.com>
The minimum granule size for the RMM is 4k page. So force 4k page on for realm guests. Signed-off-by: Steven Price <steven.price@arm.com>
Physical device assignment is not supported by RMM v1.0, so it doesn't make much sense to allow device mappings within the realm. Prevent them when the guest is a realm. Signed-off-by: Steven Price <steven.price@arm.com>
Commit fa9d27773873 ("perf: arm_pmu: Kill last use of per-CPU cpu_armpmu
pointer") removed the per-CPU cpu_armpmu. Rather than refactoring the
code to deal with this just reintroduce it. The CCA PMU code will be
changing when switching to the RMM v2.0 ABI and will need completely
reworking.
Signed-off-by: Steven Price <steven.price@arm.com>
Arm CCA assigns the physical PMU device to the guest running in realm world, however the IRQs are routed via the host. To enter a realm guest while a PMU IRQ is pending it is necessary to block the physical IRQ to prevent an immediate exit. Provide a mechanism in the PMU driver for KVM to control the physical IRQ. Signed-off-by: Steven Price <steven.price@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com>
Use the PMU registers from the RmiRecExit structure to identify when an overflow interrupt is due and inject it into the guest. Also hook up the configuration option for enabling the PMU within the guest. When entering a realm guest with a PMU interrupt pending, it is necessary to disable the physical interrupt. Otherwise when the RMM restores the PMU state the physical interrupt will trigger causing an immediate exit back to the host. The guest is expected to acknowledge the interrupt causing a host exit (to update the GIC state) which gives the opportunity to re-enable the physical interrupt before the next PMU event. Number of PMU counters is configured by the VMM by writing to PMCR.N. Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Steven Price <steven.price@arm.com>
… to userspace The RMM describes the maximum number of BPs/WPs available to the guest in the Feature Register 0. Propagate those numbers into ID_AA64DFR0_EL1, which is visible to userspace. A VMM needs this information in order to set up realm parameters. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Steven Price <steven.price@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Allow userspace to configure the number of breakpoints and watchpoints of a Realm VM through KVM_SET_ONE_REG ID_AA64DFR0_EL1. The KVM sys_reg handler checks the user value against the maximum value given by RMM (arm64_check_features() gets it from the read_sanitised_id_aa64dfr0_el1() reset handler). Userspace discovers that it can write these fields by issuing a KVM_ARM_GET_REG_WRITABLE_MASKS ioctl. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Steven Price <steven.price@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com>
… by RMM Provide an accurate number of available PMU counters to userspace when setting up a Realm. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Steven Price <steven.price@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Joey Gouly <joey.gouly@arm.com>
RMM provides the maximum vector length it supports for a guest in its feature register. Make it visible to the rest of KVM and to userspace via KVM_REG_ARM64_SVE_VLS. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Steven Price <steven.price@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Obtain the max vector length configured by userspace on the vCPUs, and write it into the Realm parameters. By default the vCPU is configured with the max vector length reported by RMM, and userspace can reduce it with a write to KVM_REG_ARM64_SVE_VLS. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Steven Price <steven.price@arm.com>
KVM_GET_REG_LIST should not be called before SVE is finalized. The ioctl handler currently returns -EPERM in this case. But because it uses kvm_arm_vcpu_is_finalized(), it now also rejects the call for unfinalized REC even though finalizing the REC can only be done late, after Realm descriptor creation. Move the check to copy_sve_reg_indices(). One adverse side effect of this change is that a KVM_GET_REG_LIST call that only probes for the array size will now succeed even if SVE is not finalized, but that seems harmless since the following KVM_GET_REG_LIST with the full array will fail. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Steven Price <steven.price@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com>
Userspace can set a few registers with KVM_SET_ONE_REG (9 GP registers at runtime, and 3 system registers during initialization). Update the register list returned by KVM_GET_REG_LIST. Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Steven Price <steven.price@arm.com>
Increment KVM_VCPU_MAX_FEATURES to expose the new capability to user space. Signed-off-by: Steven Price <steven.price@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com>
All the pieces are now in place, so enable kvm_rmi_is_available when the RMM is detected. Signed-off-by: Steven Price <steven.price@arm.com>
There was a problem hiding this comment.
Sorry @Avenger-285714, your pull request is larger than the review limit of 150000 diff characters
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR backports Arm CCA (RME/Realm) enablement for KVM to the linux-6.6.y kernel series, including the prerequisite generic KVM infrastructure (guest_memfd, per-page memory attributes, and UAPI extensions) needed to support private vs. shared guest memory.
Changes:
- Add generic KVM guest_memfd and per-page memory attribute infrastructure, including new ioctls/UAPI (
KVM_CREATE_GUEST_MEMFD,KVM_SET_MEMORY_ATTRIBUTES,KVM_SET_USER_MEMORY_REGION2,KVM_EXIT_MEMORY_FAULT). - Integrate Arm64 Realm/RMI plumbing into KVM (new RMI headers, realm VM/vCPU lifecycle, MMU fault handling, VGIC/timer/PSCI adaptations).
- Extend x86 KVM paths to interoperate with generic private memory infrastructure (e.g., memory-fault exit info, gmem prepare/invalidate hooks, SNP-related populate flow in SEV code).
Reviewed changes
Copilot reviewed 49 out of 49 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| virt/kvm/kvm_mm.h | Adds guest_memfd interface declarations/stubs. |
| virt/kvm/kvm_main.c | Wires memslot lifecycle to gmem bind/unbind; adds memory attributes and guest_memfd ioctls. |
| virt/kvm/guest_memfd.c | Introduces guest_memfd backing implementation (folio management, bind/unbind, populate). |
| virt/kvm/Makefile.kvm | Builds guest_memfd support when enabled. |
| virt/kvm/Kconfig | Adds KVM_GUEST_MEMFD and generic memory-attribute/private-mem Kconfig symbols. |
| include/uapi/linux/kvm.h | Adds UAPI for memory attributes, guest_memfd, memory-fault exit, arm64 VM type bits, and new caps. |
| include/linux/perf/arm_pmu.h | Exposes per-CPU arm PMU pointer and physical IRQ toggling API. |
| include/linux/kvm_host.h | Adds mem attributes API hooks and gmem PFN retrieval API. |
| include/kvm/arm_psci.h | Adds arm64 KVM emulate header dependency. |
| include/kvm/arm_pmu.h | Adds helper macro used by realm PMU/IRQ handling. |
| include/kvm/arm_arch_timer.h | Exposes realm timer update helper. |
| fs/anon_inodes.c | Exports anon inode helper for guest_memfd file creation. |
| drivers/perf/arm_pmu.c | Implements PMU physical IRQ enable/disable helper and exposes cpu_armpmu. |
| arch/x86/kvm/x86.c | Adds x86 arch hooks for gmem prepare/invalidate; exposes memory-fault-info capability. |
| arch/x86/kvm/svm/sev.c | Adds SNP launch/update flow using gmem populate support. |
| arch/x86/kvm/mmu/mmu_internal.h | Extends page fault tracking for private memory and refcounted pages. |
| arch/x86/kvm/mmu/mmu.c | Adds private-memory PFN faultin path and memory-attribute integration. |
| arch/x86/kvm/Kconfig | Enables generic private-mem + gmem hooks under SEV. |
| arch/x86/include/asm/kvm_host.h | Adds x86 arch “has_private_mem” plumbing and gmem ops hooks. |
| arch/x86/include/asm/kvm-x86-ops.h | Adds optional x86 ops entries for gmem prepare/invalidate. |
| arch/arm64/mm/fault.c | Adds GPF handling for RME Granule Protection Faults. |
| arch/arm64/kvm/vgic/vgic.h | Adds realm-specific LR count helper and RMI include. |
| arch/arm64/kvm/vgic/vgic.c | Adds realm save/restore paths for VGIC state. |
| arch/arm64/kvm/vgic/vgic-v3.c | Skips host-side APR/trap handling for realm vCPUs. |
| arch/arm64/kvm/vgic/vgic-init.c | Blocks unsupported VGICv2 emulation for realms. |
| arch/arm64/kvm/sys_regs.c | Tightens ID reg validation and hides sysregs for realms. |
| arch/arm64/kvm/rmi-exit.c | Implements realm REC exit decoding/handling. |
| arch/arm64/kvm/reset.c | Adds realm-aware SVE max VL handling and REC cleanup. |
| arch/arm64/kvm/psci.c | Integrates realm PSCI completion semantics. |
| arch/arm64/kvm/pmu-emul.c | Reads realm PMU overflow status from REC exit context. |
| arch/arm64/kvm/mmu.c | Adds realm mapping/unmapping paths and gmem fault handling for private memory. |
| arch/arm64/kvm/mmio.c | Adjusts MMIO emulation return path for realm REC ABI. |
| arch/arm64/kvm/inject_fault.c | Adjusts exception injection behavior for realm RECs. |
| arch/arm64/kvm/hypercalls.c | Hides FW reg indices for realms. |
| arch/arm64/kvm/guest.c | Restricts and validates writable regs and event injection for realms. |
| arch/arm64/kvm/arm.c | Adds realm VM type, capability filtering, REC run loop, and RMI init/populate ioctls. |
| arch/arm64/kvm/arch_timer.c | Adds realm timer IRQ update path and realm-specific offset behavior. |
| arch/arm64/kvm/Makefile | Builds new RMI implementation files. |
| arch/arm64/kvm/Kconfig | Enables generic memory-attributes integration for arm64 KVM and related selects. |
| arch/arm64/kernel/cpufeature.c | Exposes RME feature bits in CPU feature framework. |
| arch/arm64/include/uapi/asm/kvm.h | Adds KVM_ARM_VCPU_REC feature bit. |
| arch/arm64/include/asm/virt.h | Declares static key for RMI availability. |
| arch/arm64/include/asm/rmi_smc.h | Adds RMI SMC IDs and ABI structures. |
| arch/arm64/include/asm/rmi_cmds.h | Adds inline wrappers for RMI commands. |
| arch/arm64/include/asm/kvm_rmi.h | Adds realm/REC structures and internal APIs. |
| arch/arm64/include/asm/kvm_host.h | Adds realm fields to KVM arch structs and realm-private-mem indicator. |
| arch/arm64/include/asm/kvm_emulate.h | Adds realm helpers (kvm_is_realm, kvm_gpa_from_fault, vcpu_is_rec, etc.). |
| Documentation/virt/kvm/api.rst | Documents new UAPIs/capabilities for guest_memfd, memory attributes, and Arm RMI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (IS_ERR(folio)) | ||
| return PTR_ERR(folio); | ||
|
|
||
| *page = folio_file_page(folio, gfn - slot->base_gfn); |
There was a problem hiding this comment.
The page returned via folio_file_page() is indexed by file offset, but this uses gfn - slot->base_gfn (offset within the memslot) and ignores slot->gmem.pgoff. For non-zero slot->gmem.pgoff, this returns the wrong page pointer. Use the same file index that __kvm_gmem_get_pfn() computes (e.g. gfn - slot->base_gfn + slot->gmem.pgoff).
| bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, | ||
| unsigned long attrs) | ||
| { |
There was a problem hiding this comment.
kvm_range_has_memory_attributes() is implemented with a 4-argument signature (attrs only), but new callers added in this PR pass five arguments (mask + attrs). This is a build break and also suggests the helper's semantics are incomplete. Update the function signature and implement masked matching (or update all callers to the 4-arg form).
| void arm_pmu_set_phys_irq(bool enable) | ||
| { | ||
| int cpu = get_cpu(); | ||
| struct arm_pmu *pmu = per_cpu(cpu_armpmu, cpu); | ||
| int irq; | ||
|
|
||
| irq = armpmu_get_cpu_irq(pmu, cpu); | ||
| if (irq && !enable) |
There was a problem hiding this comment.
arm_pmu_set_phys_irq() dereferences pmu->hw_events via armpmu_get_cpu_irq(pmu, cpu) without checking whether pmu is NULL. per_cpu(cpu_armpmu, cpu) is set/cleared during CPU hotplug and could be NULL; add a NULL check (and possibly validate cpu_irq_ops) to avoid a potential NULL deref.
| struct list_head *gmem_list = &inode->i_mapping->i_private_list; | ||
| struct kvm_gmem *gmem; |
There was a problem hiding this comment.
inode->i_mapping doesn't have an i_private_list member (the address_space member is private_list). This will fail to compile and is also inconsistent with later code that uses inode->i_mapping->private_list.
| file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem, | ||
| O_RDWR, NULL); | ||
| if (IS_ERR(file)) { |
There was a problem hiding this comment.
anon_inode_create_getfile() is used here but is not declared in include/linux/anon_inodes.h and doesn't appear to be defined anywhere in the tree. As-is this will not compile/link. Either switch to an existing helper like anon_inode_getfile_secure()/anon_inode_getfile(), or add/declare an anon_inode_create_getfile() wrapper in fs/anon_inodes.c and the header.
| return __anon_inode_getfile(name, fops, priv, flags, | ||
| context_inode, true); | ||
| } | ||
| EXPORT_SYMBOL_GPL(anon_inode_create_getfile); |
There was a problem hiding this comment.
This exports anon_inode_create_getfile, but there is no corresponding function definition in this file (only anon_inode_getfile() and anon_inode_getfile_secure()). Exporting an undefined symbol will fail the build, and it doesn't add a prototype for anon_inode_create_getfile() either. Define the wrapper (likely calling __anon_inode_getfile(..., secure=true)) and add it to include/linux/anon_inodes.h, or export the correct existing symbol instead.
| while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order), | ||
| KVM_MEMORY_ATTRIBUTE_PRIVATE, | ||
| KVM_MEMORY_ATTRIBUTE_PRIVATE)) { | ||
| if (!max_order) |
There was a problem hiding this comment.
kvm_range_has_memory_attributes() is called with five arguments here, but its declaration/definition in this PR only accepts four. This is a build break; please align the function signature and update all call sites consistently (e.g. if a mask parameter is intended, add it to the prototype and implement masked matching in the helper).
| if (level == PG_LEVEL_2M) | ||
| return kvm_range_has_memory_attributes(kvm, start, end, ~0, attrs); |
There was a problem hiding this comment.
kvm_range_has_memory_attributes() is invoked with five arguments here, but in this PR it's declared/defined with only four parameters. This mismatch will fail to compile; align the helper's prototype/definition with the intended call pattern (or adjust callers).
| bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, | ||
| unsigned long attrs); |
There was a problem hiding this comment.
kvm_range_has_memory_attributes() is declared with a 4-argument signature, but there are new callers in this PR that pass five arguments (e.g. x86 MMU hugepage checks and guest_memfd populate). This will not compile; update the prototype (and the implementation) to match the intended signature and semantics.
| select SCHED_INFO | ||
| select GUEST_PERF_EVENTS if PERF_EVENTS | ||
| select KVM_GENERIC_MEMORY_ATTRIBUTES | ||
| select HAVE_KVM_ARCH_GMEM_POPULATE |
There was a problem hiding this comment.
HAVE_KVM_ARCH_GMEM_POPULATE is selected here, but that Kconfig symbol isn't defined anywhere (at least not in this PR or the existing tree). Selecting an undefined symbol will break Kconfig processing; either add the missing symbol definition (likely in virt/kvm/Kconfig) or drop this select if it's not needed.
| select HAVE_KVM_ARCH_GMEM_POPULATE |
以下 description 由 AI 辅助生成:
arm64: 支持 Arm CCA (Confidential Compute Architecture) in KVM
概述
本 PR 将 Arm CCA(Confidential Compute Architecture)支持回合到
linux-6.6.y(v6.6.127) 内核。CCA 是 Arm 的机密计算架构,通过 Realm Management Extension (RME) 和 Realm Management Monitor (RMM) 为虚拟机提供硬件级别的内存隔离与保护,使 guest 内存对 hypervisor 不可见。本次回合基于上游 v12 版本补丁系列 "[PATCH v12 00/46] arm64: Support for Arm CCA in KVM"(原始基线为 v6.19-rc1+),包含 71 个提交。
提交组成
[ Upstream commit <sha1> ]Fromlist:为前缀一、Upstream Backport 提交详解 (25 个)
CCA 补丁系列依赖的多项基础设施在 6.6.y 中完全不存在,必须先从上游 backport。这些提交分为三个子阶段:
阶段 0.1:guest_memfd 核心框架 (10 个)
virt/kvm/guest_memfd.c整个子系统(guest 私有内存后端)以及CONFIG_KVM_GUEST_MEMFD、CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES等 Kconfig 选项在 6.6.y 中均不存在。以下 10 个提交构建了 CCA 所需的最小 guest_memfd 框架:968978aaba805a475554db1eKVM_SET_MEMORY_ATTRIBUTESioctl 和 xarray 存储的按页内存属性框架303b8bf0c6ada7800aa80ea4virt/kvm/guest_memfd.c,实现 guest 私有内存后端9d80bea67d2880583d0cfd8f78aaab8f673e1d23040caa8b04540c79084170623723778ac9b28b8e40c8fa30b0dc91c8adb2ddb3b3783bb2531e20bfkvm_arch_gmem_prepare()回调,CCA 用于 Realm 内存初始化fd27a620480e17573fd971f9b0fe205e06971f6c06b17751kvm_gmem_populate(),CCA patch 20 用于填充 Realm 初始内存7b47c551d87ba90764f0e4edkvm_arch_gmem_invalidate()回调,CCA 用于 Realm 内存回收阶段 0.2:API 修复与命名统一 (12 个)
在核心框架之上,还需要一系列修复、重命名和 API 演化提交以匹配 CCA 补丁所期望的接口:
df348ae0e862d81473840ce1ad1f35ab6858e300614f10bdf74c8d9744e447bb584237cce24c03506e3c19a9a1ab5c3d3cfed31caac9d0d87226f535244147e581fbd04c77d231227edb9f3ea128564429a6bd8d9867b6e495b0e4ee544792737a4bf632a6a4dca6c88532324eb848a26ec9923310be23b22b3c6d0abdc6638ea79669f8user_mem_abort()为 shared/gmem 路径e908e1b7fa31a7b57e099592gmem_abort()处理 guest 私有内存缺页阶段 0.3:UAPI/API 补全 (3 个)
在应用全部 46 个 CCA 补丁后,发现以下关键 UAPI 定义和 API 签名在 6.6.y 中完全缺失,导致编译失败。这 3 个补丁补全了缺失的接口定义:
362de6414aa916f95f3b95caKVM_EXIT_MEMORY_FAULT(exit reason #39)、kvm_run.memory_fault结构体、3 参数版kvm_prepare_memory_fault_exit()f52c7a72d8b88dd2eee9d526KVM_MEMORY_EXIT_FLAG_PRIVATE定义、将kvm_prepare_memory_fault_exit()升级为 6 参数版(增加is_write/is_exec/is_private)、kvm_faultin_pfn_private()和kvm_max_level_for_order()f33220e586ee1fbee5b01a0fkvm_gmem_get_pfn()增加struct page **page输出参数,CCA 的 arm64 缺页处理需要此参数获取 struct page二、Fromlist CCA 补丁 (46 个)
46 个 CCA 补丁来自上游邮件列表提交的 v12 版本(尚未合入主线),按原始编号 01-46 顺序应用。
完整提交列表
24b9abbadd7f518467276a83076d987f77723a3f7ad88a02a7f730878befaeb5303ad0b8afc2ab68ad30077c40cc0590f2ade45d5ffab55df100a8da7b5895a4420d49f5140f8958960dd69f0be4cfd4093a365db422fc1dac7837d1dfaf58259a682ece496c5247c359d8d5d3bbfb218950b283df588980d092e884102afd2d851ec36914ed1026558e4d43d62fad0b07f2c44fbdd1be825b367036c4a1a72d6bf42eb2b434e6eb770c17b40b4382f0a4d8fc4b351eefbe3e1177b631f8cdad95143a558cf743b2e9d91ea05686e6e6d77bc4ea1857607641aacd68e2c966f8d0071a0989a8a4caad525308dd832f169de55f73cfa9d7347fb7a7261bcf11badcea5ea16f211e4f9b38228b2b70a0ba4561deac功能分组详解
基础架构与 RMI 接口 (Patch 01-06):修复头文件依赖,添加 GPF(Granule Protection Fault)处理,定义 RMI SMC 调用接口(
rmi_smc.h、rmi_cmds.h,均为新文件),KVM 初始化时检测 RMI 支持并引入kvm_is_realm()等辅助函数,定义 Realm 用户态 ABI 并新增KVM_CAP_ARM_RMIcapability。Realm 虚拟机创建与管理 (Patch 07-12):Realm 创建基础设施(IPA limit 判断、Realm Descriptor 管理),Realm guest 的 capability 过滤(屏蔽不支持的特性),允许通过
KVM_VM_TYPE_ARM_REALM指定 machine type 创建 Realm,RTT(Realm Translation Table)拆除清理,首次 VCPU 运行时激活 Realm,REC(Realm Execution Context)分配/释放。中断与定时器支持 (Patch 13-15):VGIC list register 数量查询 helper,Realm 中 VGIC 的完整支持,Realm REC 中的定时器支持。
运行时处理 (Patch 16-18):Realm 进入/退出处理(
rmi-exit.c,新文件),RMI_EXIT_RIPAS_CHANGE(Realm IPA State 变更请求)处理,Realm MMIO 模拟。guest_memfd 集成 (Patch 19-24):暴露 private memory 支持,允许通过
kvm_gmem_populate()填充 Realm 初始内存内容,设置初始 memslot 的 RIPAS,创建 Realm Descriptor,Realm VMID 分配器,运行时内存缺页处理(Patch 24 是最复杂的补丁——在gmem_abort()中处理 Realm 私有内存的缺页,涉及 RMI 数据/RTT 创建)。VCPU 与寄存器管理 (Patch 25-30):Realm VCPU load,寄存器访问验证,Realm PSCI 请求处理,注入未定义异常时的 WARN 检查,禁用 Realm guest 的 stolen time,允许用户态注入 abort。
扩展特性 (Patch 31-46):RSI_HOST_CALL 支持,SVE 检查与 4K 页面强制,禁止 Realm 的 Device mapping,PMU 支持(包括 per-CPU
cpu_armpmu指针恢复、IRQ disable 机制、PMU 计数器初始化),断点/观察点参数传播与设置,SVE 向量长度从 RMM 获取与配置,Realm REC 的精确寄存器列表,暴露KVM_ARM_VCPU_REC到用户态,最终通过使能 static branch 允许创建 Realm。三、提交顺序说明
25 个 upstream backport 先于 46 个 CCA 补丁应用,其中阶段 0.3 的 3 个 UAPI/API 补全提交穿插在 CCA Patch 23 和 Patch 24 之间(因为 Patch 24 是第一个实际使用
kvm_prepare_memory_fault_exit()6 参数版本和kvm_gmem_get_pfn()含struct page **page参数的补丁):四、相对于原始上游/Fromlist 状态的适配改动
由于原始 CCA 补丁基于 v6.19-rc1+,而目标分支为 v6.6.127,两者之间存在大约 3 年的内核演进差距,回合过程中进行了以下主要适配:
4.1 guest_memfd 子系统完整 Backport
原因:6.6.y 中
virt/kvm/guest_memfd.c完全不存在,CONFIG_KVM_GUEST_MEMFD、CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES等 Kconfig 选项均缺失。CCA 补丁 19、20、24 强依赖此子系统。做法:从上游主线中精选了 22 个 guest_memfd 相关提交按依赖顺序 cherry-pick,包括核心框架创建、bug 修复、API 演化和命名统一。这些提交之间存在严格的顺序依赖关系。
cherry-pick 冲突解决:
include/uapi/linux/kvm.h:6.6.y 中KVM_CAP最高编号为 229,需要在已有 deepin 特有 CAP(如 HYGON 相关)之后正确插入新 CAP 编号virt/kvm/Kconfig/Makefile.kvm:新增 guest_memfd 相关的 Kconfig 选项和编译规则include/linux/kvm_host.h:在 6.6.y 已有的结构体定义中插入新的内存属性和 gmem 相关声明4.2 KVM_EXIT_MEMORY_FAULT UAPI 补全
原因:CCA 补丁 24 在
gmem_abort()中调用kvm_prepare_memory_fault_exit()向用户态报告私有/共享内存访问不匹配的情况。但 6.6.y 中KVM_EXIT_MEMORY_FAULT(exit reason #39)、kvm_run.memory_fault结构体、KVM_MEMORY_EXIT_FLAG_PRIVATE均完全缺失。做法:Backport 了 3 个额外的上游提交(#23-#25)来补全这些 UAPI 定义和 API 签名,而非使用自定义的
deepin:修复提交,以保持代码与上游的一致性。冲突解决:
include/uapi/linux/kvm.h:在 6.6.y 的 exit reason 列表中插入KVM_EXIT_MEMORY_FAULT = 39,在kvm_rununion 中添加memory_fault成员include/linux/kvm_host.h:保留 6.6.y 已有的 Phase 0 基础设施代码(memory attributes、gmem 函数声明),同时正确合入kvm_prepare_memory_fault_exit()的 6 参数版本arch/x86/kvm/mmu/mmu.c:kvm_mmu_max_mapping_level()函数签名变更(增加max_level参数),保留 6.6.y 使用的kvm_slot_has_gmem()函数名(上游此处为kvm_slot_can_be_private(),在 Phase 0 commit chore(CI): Update workflows.yml #20 中已重命名)arch/x86/kvm/mmu/mmu_internal.h:在kvm_page_fault结构体中添加refcounted_page字段arch/x86/kvm/svm/sev.c:上游 commit 引入的 SNP/SEV 相关函数(snp_rmptable_psmash、sev_handle_rmp_fault等)在 6.6.y 中无对应基础代码,全部丢弃(保留 HEAD 侧空内容)virt/kvm/guest_memfd.c:kvm_gmem_get_pfn()增加struct page **page参数后,在函数体中添加folio_file_page()调用以正确设置页面输出4.3 mmu.c 的 API 适配
CCA 补丁(特别是 patch 07、10、17、24、33、34)在
arch/arm64/kvm/mmu.c中使用了多个 6.6.y 不存在的 API。回合时进行了以下等价替换:KVM_PGT_FN(func)(args)func(args)+kvm_is_realm()检查KVM_PGT_FN是 v6.19 的 pKVM MMU 分发宏(由fce886a60207引入),属于大型 pKVM 重构,不适合单独 backportkvm_fault_lock(kvm)read_lock(&kvm->mmu_lock)kvm_fault_lock()helper 不存在kvm_fault_unlock(kvm)read_unlock(&kvm->mmu_lock)kvm_release_faultin_page(kvm, pfn, ...)kvm_release_pfn_clean(pfn)kvm_release_faultin_page()不存在kvm_stage2_destroy(pgt)kvm_pgtable_stage2_destroy(pgt)d68d66e57e2b重命名拆分kvm_init_ipa_range()kvm_init_stage2_mmu()中的等价逻辑4.4 arm.c 上下文适配
arch/arm64/kvm/arm.c中的多个函数在 v6.19 经历了重构:kvm_arch_init_vm():v6.19 在 lockdep 之后增加了kvm_init_nested()调用,6.6.y 没有。Realm type 解析代码插入在 6.6.y 的 lockdep 与kvm_share_hyp()之间kvm_arch_vcpu_run_pid_change():v6.19 已移除kvm_arch_vcpu_run_map_fp()调用,6.6.y 仍保留。Realm activation 代码插入在vcpu_has_run_once()检查之后kvm_vm_ioctl_check_extension():6.6.y 最后一个 case 为KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES((kernel-rolling) mtd: rawnand: Add Phytium NAND flash controller support #229),v6.19 多了数个新 case。KVM_CAP_ARM_RMI在本分支中分配了新编号4.5 其他适配
arch/arm64/kvm/reset.c:v6.19 有system_supported_vcpu_features(kvm)函数(接受 kvm 参数),6.6.y 不存在,直接在 6.6.y 中实现等价的 realm feature 检查逻辑arch/arm64/kvm/inject_fault.c:v6.19 新增了__kvm_inject_exception()等 generic injection primitive,6.6.y 没有,CCA 补丁的 WARN 检查直接插入到 6.6.y 已有函数中drivers/perf/arm_pmu.c:Patch 35 为 HACK 恢复cpu_armpmuper-CPU 指针(v6.19 中已被fa9d27773873删除),在 6.6.y 中该指针仍存在,因此此补丁的适配有所简化arch/arm64/kvm/pmu-emul.c:v6.19 经历了 PMCR.N →nr_pmu_counters重命名重构,6.6.y 仍使用原始变量名,已相应适配include/uapi/linux/kvm.h:新增KVM_VM_TYPE_ARM_MASK和KVM_VM_TYPE_ARM_REALM定义,6.6.y 中仅有KVM_VM_TYPE_ARM_IPA_SIZE_MASKarch/arm64/include/asm/kvm_host.h:KVM_VCPU_MAX_FEATURES从 7 增加到 8,为KVM_ARM_VCPU_REC预留位4.6 有意不回合的上游重构
以下上游重构被有意不回合,以减少对 6.6.y 代码库的侵入性:
fce886a60207KVM_PGT_FN宏的大型 pKVM 重构,会大幅改变 mmu.c 结构,对 CCA 非必需dc06193532afkvm_release_faultin_page(),可用kvm_release_pfn_clean()等价替代85c7869e30b7d68d66e57e2b8cc9dc1ae4fbdevice五、涉及的主要文件
新增文件
virt/kvm/guest_memfd.carch/arm64/include/asm/rmi_smc.harch/arm64/include/asm/rmi_cmds.harch/arm64/include/asm/kvm_rmi.harch/arm64/kvm/rmi.carch/arm64/kvm/rmi-exit.c主要修改文件
include/uapi/linux/kvm.hinclude/linux/kvm_host.harch/arm64/kvm/mmu.carch/arm64/kvm/arm.carch/arm64/include/asm/kvm_host.harch/arm64/include/asm/kvm_emulate.harch/arm64/kvm/guest.carch/arm64/kvm/reset.carch/arm64/kvm/psci.carch/arm64/kvm/inject_fault.carch/arm64/kvm/mmio.carch/arm64/mm/fault.carch/arm64/kvm/vgic/arch/arm64/kvm/arch_timer.carch/arm64/kvm/pmu-emul.cdrivers/perf/arm_pmu.cvirt/kvm/KconfigDocumentation/virt/kvm/api.rstLink: #1319