Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp7531028rwd; Tue, 6 Jun 2023 12:07:06 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ42c2ITa5CHVOJUVyhivYnEO9GvBJrXiKG1nSRsq8hGuvegoEXL7B7mZTdqQQ6bl29zWskw X-Received: by 2002:a05:6a20:4283:b0:10e:43e:e223 with SMTP id o3-20020a056a20428300b0010e043ee223mr591519pzj.1.1686078426192; Tue, 06 Jun 2023 12:07:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686078426; cv=none; d=google.com; s=arc-20160816; b=q6WOsKtwQeOoEw4QPU86UTaCriilVitJMf17AFoHV9wLcm32qR/gsqnLB3xwRJagcL LnkQEWCItiwlNdLu1xi+ztkW3DqYXoverHp2YkDmwo5+p5hi0GRdr4Wv9o+WPk4DbAz9 7cTgtUmNrSFUAbXBjuf7oFdXEGIYvVTfix8iI0QdX+UUFe70JaSyYZ5sMZLGlZtlZXPn UuQEF9/O1j/r5OZRjg8UxsObGKvVIEC2HUUHbLz3VQZWi5jkzBqyg3YTaPjwcbuA5cHk WeF7YZVrjQoqh1LsBltUuKEIQvCowf2J0UzDbcdR83a3l0JSuW8VrmH2Hb3CbsVX/oTn wXiA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :dkim-signature; bh=/joJEhBBdzxQgynNGH7peDx/QfEr6GsPkjZ482ouABQ=; b=UorS2NnCoW9Gb8nKLs+JBFPWtTafrMc6uNsPBFjvwUBGE7Qd3w1svo04IRvl/3WY4A UOZUBjCcjbLxEsZC/BMx6RvqbgP97UQEsawV8IWGbefwsMCp6ZsSYtElA24LuQigaX6f VKP6e4YkmZZA83u35v0z+0LQ1NY9hHRqm9CskKxQJRiL78PkXG5jh06odRZgo5nY52DR RDJkLYw6Iov35eCrmW+SUnGUEPWnQLY2kMIiCu7+4/mI1bhQlFfKNheXlVRP0MsiR4Eu xi86VznX2Oj07fTlDzOeR7AdssoDrmzcVFFxl9FKnMJ0gZX1Yd7lBnidWzzHV8mzsX28 kZcQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=MCnDuiOd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h4-20020a633844000000b0052c419dc8d1si7618982pgn.274.2023.06.06.12.06.53; Tue, 06 Jun 2023 12:07:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=MCnDuiOd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239215AbjFFTEU (ORCPT + 99 others); Tue, 6 Jun 2023 15:04:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38660 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239196AbjFFTEQ (ORCPT ); Tue, 6 Jun 2023 15:04:16 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D3F5610F0 for ; Tue, 6 Jun 2023 12:04:13 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-bb39316a68eso1899066276.0 for ; Tue, 06 Jun 2023 12:04:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1686078253; x=1688670253; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=/joJEhBBdzxQgynNGH7peDx/QfEr6GsPkjZ482ouABQ=; b=MCnDuiOdaiqemGK+B9Wcg2SOCR5FqYrZoFyJJMxj8OcQTvGJd46LgU8813yFq4OLu7 oxHawS6NtHSn47FD97mwhGbAiG58ujpvOTG0QKQ4duIfdH2pTgFT7ia1AZxbbcr3ih+7 R0vp77j8ghcoOJNy8K3dZ61GW2OiUnJ991eqFbjtcZnx3wsx6YbEjQH50Ze0UWl2SR14 x0iygVg9EVNacsgk4nhQ62VJMBm+3y9KBWw+Gq9TnkIyJiGCZqpaH27nWgEGB+3I3Sui 4bYDl+vG7ScgFNu0van73yt1DCU3OG2JXehWtrcz9fihEkacxto7+YCuNAN9BCX1WiGD qdxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686078253; x=1688670253; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=/joJEhBBdzxQgynNGH7peDx/QfEr6GsPkjZ482ouABQ=; b=JaqsyepNrhPKgkcCXdzb64AWezhq+6UOwL30EvuLv1zZkKBUXgWH8dgsxE8VmAbucu lXdJ92kQwdVn8yay4RQM2RPql2T11OVScJc8yFzLv5KZem3ALjEj2Ih2KYcLTf1pMB5R tcX/C+bjZXsv8Uta0Sg20wsmQs3vHIOA+x1vqd6CmE28wDlWcwDexCnAMVR5cqRpgonQ lj76sls7Kr6YCmnDuU9wn/VHK11htvyQ37JrrWZbFOrfiu+4TIHiUuGQWlOyvnYwmwkF k2l/ZxUbQJQZUSUHCQE5f3/nVrk11OL74IbivMYA5aiEGue21q7SwVJE7/hLleK8/G+l YUuw== X-Gm-Message-State: AC+VfDz0DtOOQNwITkYlyt7Xwh0/ZJ34mkrSDNc7KW5uyRYfDyEIfZYW w8EMXQo1aomIcTyGboQAyPEQYNh0HEFi/ZEGWQ== X-Received: from ackerleytng-ctop.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:13f8]) (user=ackerleytng job=sendgmr) by 2002:a25:b53:0:b0:ba8:918a:ceec with SMTP id 80-20020a250b53000000b00ba8918aceecmr1077064ybl.4.1686078252995; Tue, 06 Jun 2023 12:04:12 -0700 (PDT) Date: Tue, 6 Jun 2023 19:03:45 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog Message-ID: Subject: [RFC PATCH 00/19] hugetlb support for KVM guest_mem From: Ackerley Tng To: akpm@linux-foundation.org, mike.kravetz@oracle.com, muchun.song@linux.dev, pbonzini@redhat.com, seanjc@google.com, shuah@kernel.org, willy@infradead.org Cc: brauner@kernel.org, chao.p.peng@linux.intel.com, coltonlewis@google.com, david@redhat.com, dhildenb@redhat.com, dmatlack@google.com, erdemaktas@google.com, hughd@google.com, isaku.yamahata@gmail.com, jarkko@kernel.org, jmattson@google.com, joro@8bytes.org, jthoughton@google.com, jun.nakajima@intel.com, kirill.shutemov@linux.intel.com, liam.merwick@oracle.com, mail@maciej.szmigiero.name, mhocko@suse.com, michael.roth@amd.com, qperret@google.com, rientjes@google.com, rppt@kernel.org, steven.price@arm.com, tabba@google.com, vannapurve@google.com, vbabka@suse.cz, vipinsh@google.com, vkuznets@redhat.com, wei.w.wang@intel.com, yu.c.zhang@linux.intel.com, kvm@vger.kernel.org, linux-api@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, qemu-devel@nongnu.org, x86@kernel.org, Ackerley Tng Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, This patchset builds upon a soon-to-be-published WIP patchset that Sean published at https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo, mentioned at [1]. The tree can be found at: https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1 In this patchset, hugetlb support for KVM's guest_mem (aka gmem) is introduced, allowing VM private memory (for confidential computing) to be backed by hugetlb pages. guest_mem provides userspace with a handle, with which userspace can allocate and deallocate memory for confidential VMs without mapping the memory into userspace. Why use hugetlb instead of introducing a new allocator, like gmem does for 4K and transparent hugepages? + hugetlb provides the following useful functionality, which would otherwise have to be reimplemented: + Allocation of hugetlb pages at boot time, including + Parsing of kernel boot parameters to configure hugetlb + Tracking of usage in hstate + gmem will share the same system-wide pool of hugetlb pages, so users don't have to have separate pools for hugetlb and gmem + Page accounting with subpools + hugetlb pages are tracked in subpools, which gmem uses to reserve pages from the global hstate + Memory charging + hugetlb provides code that charges memory to cgroups + Reporting: hugetlb usage and availability are available at /proc/meminfo, etc The first 11 patches in this patchset is a series of refactoring to decouple hugetlb and hugetlbfs. The central thread binding the refactoring is that some functions (like inode_resv_map(), inode_subpool(), inode_hstate(), etc) rely on a hugetlbfs concept, that the resv_map, subpool, hstate, are in a specific field in a hugetlb inode. Refactoring to parametrize functions by hstate, subpool, resv_map will allow hugetlb to be used by gmem and in other places where these data structures aren't necessarily stored in the same positions in the inode. The refactoring proposed here is just the minimum required to get a proof-of-concept working with gmem. I would like to get opinions on this approach before doing further refactoring. (See TODOs) TODOs: + hugetlb/hugetlbfs refactoring + remove_inode_hugepages() no longer needs to be exposed, it is hugetlbfs specific and used only in inode.c + remove_mapping_hugepages(), remove_inode_single_folio(), hugetlb_unreserve_pages() shouldn't need to take inode as a parameter + Updating inode->i_blocks can be refactored to a separate function and called from hugetlbfs and gmem + alloc_hugetlb_folio_from_subpool() shouldn't need to be parametrized by vma + hugetlb_reserve_pages() should be refactored to be symmetric with hugetlb_unreserve_pages() + It should be parametrized by resv_map + alloc_hugetlb_folio_from_subpool() could perhaps use hugetlb_reserve_pages()? + gmem + Figure out if resv_map should be used by gmem at all + Probably needs more refactoring to decouple resv_map from hugetlb functions Questions for the community: 1. In this patchset, every gmem file backed with hugetlb is given a new subpool. Is that desirable? + In hugetlbfs, a subpool always belongs to a mount, and hugetlbfs has one mount per hugetlb size (2M, 1G, etc) + memfd_create(MFD_HUGETLB) effectively returns a full hugetlbfs file, so it (rightfully) uses the hugetlbfs kernel mounts and their subpools + I gave each file a subpool mostly to speed up implementation and still be able to reserve hugetlb pages from the global hstate based on the gmem file size. + gmem, unlike hugetlbfs, isn't meant to be a full filesystem, so + Should there be multiple mounts, one for each hugetlb size? + Will the mounts be initialized on boot or on first gmem file creation? + Or is one subpool per gmem file fine? 2. Should resv_map be used for gmem at all, since gmem doesn't allow userspace reservations? [1] https://lore.kernel.org/lkml/ZEM5Zq8oo+xnApW9@google.com/ --- Ackerley Tng (19): mm: hugetlb: Expose get_hstate_idx() mm: hugetlb: Move and expose hugetlbfs_zero_partial_page mm: hugetlb: Expose remove_inode_hugepages mm: hugetlb: Decouple hstate, subpool from inode mm: hugetlb: Allow alloc_hugetlb_folio() to be parametrized by subpool and hstate mm: hugetlb: Provide hugetlb_filemap_add_folio() mm: hugetlb: Refactor vma_*_reservation functions mm: hugetlb: Refactor restore_reserve_on_error mm: hugetlb: Use restore_reserve_on_error directly in filesystems mm: hugetlb: Parametrize alloc_hugetlb_folio_from_subpool() by resv_map mm: hugetlb: Parametrize hugetlb functions by resv_map mm: truncate: Expose preparation steps for truncate_inode_pages_final KVM: guest_mem: Refactor kvm_gmem fd creation to be in layers KVM: guest_mem: Refactor cleanup to separate inode and file cleanup KVM: guest_mem: hugetlb: initialization and cleanup KVM: guest_mem: hugetlb: allocate and truncate from hugetlb KVM: selftests: Add basic selftests for hugetlbfs-backed guest_mem KVM: selftests: Support various types of backing sources for private memory KVM: selftests: Update test for various private memory backing source types fs/hugetlbfs/inode.c | 102 ++-- include/linux/hugetlb.h | 86 ++- include/linux/mm.h | 1 + include/uapi/linux/kvm.h | 25 + mm/hugetlb.c | 324 +++++++----- mm/truncate.c | 24 +- .../testing/selftests/kvm/guest_memfd_test.c | 33 +- .../testing/selftests/kvm/include/test_util.h | 14 + tools/testing/selftests/kvm/lib/test_util.c | 74 +++ .../kvm/x86_64/private_mem_conversions_test.c | 38 +- virt/kvm/guest_mem.c | 488 ++++++++++++++---- 11 files changed, 882 insertions(+), 327 deletions(-) -- 2.41.0.rc0.172.g3f132b7071-goog