Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp1193314rwb; Fri, 23 Sep 2022 09:17:07 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6g0sbpAuVa4niXxreEyqn7WfvTxE0t6sR6r0Y73qvGPm+LLSCQRLJGryafZag6vQfJ9HfT X-Received: by 2002:aa7:cb0b:0:b0:456:e744:79e5 with SMTP id s11-20020aa7cb0b000000b00456e74479e5mr490198edt.191.1663949827617; Fri, 23 Sep 2022 09:17:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1663949827; cv=none; d=google.com; s=arc-20160816; b=yMNWKw2tUfAw3+wmtDGx7wPIYiNj3k7d8YTsQGegURv1fmg3AqPfJdEsKKsx6Zu9jL 4ZRJBCFwFk3pkzBfA8HLPJCjnPwb87+4uZUytdDGHIYhw9qKJcnJzAK5zn0XRDHs5bmD WbSNVimtBG7KIIziihkfLffXWCJ/CW91Vy5VEZNltMi3jDUImN9XbYMLdea0F19bFH84 q6G86x10WpmCeHaYDzWliVnBzAVEXnU6h2r/ptyqvzHUSXWugKeAdEjkrS8/cmCfqQe/ olozXfaunWoYbUHFETmMI+iye/rCh08KinqYK4DUUOoIeRwvQxn0RQ1CtVjcyI4tJ7fz RFYg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=SUWKwcsZsry2fgvm0fDpc5kVnOus3A4t37kogjVDCAM=; b=WfyhHE3H0EcF9mXEM36d5k5RY4powxtk5PkIdMyiVKIG32OcYTaTxKAb/szUUf2B4V XBpm1LrbgLSNrEEJ9k+xIaZIY3o6YKhMlgd+SPMecr7NU6EbRZL0NS3YC8rqV8Cj20Hc ob47LHvuHgVJ82QW3avIy8rXrB05k1C4J7K/ZNCiCuBzzG3FjgJQ/3Shy+kXsEta73Px x665U+u6L/2knB8Uh28fOa9HBfgvHefX16pJyCy2KrrH8pO3oyhBOpOWos0OwexcoZWZ PZGr4aEkWrcb0CB1UoEWm7xSX4E60ZGc/HGpBos2rh/DmOk13MFxUPeq5pJdRCcSuxEq S51g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="ENR7gm/u"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j10-20020aa7c40a000000b0044ebbd162bfsi6984206edq.283.2022.09.23.09.16.41; Fri, 23 Sep 2022 09:17:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="ENR7gm/u"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231808AbiIWPVL (ORCPT + 99 others); Fri, 23 Sep 2022 11:21:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34626 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232537AbiIWPU7 (ORCPT ); Fri, 23 Sep 2022 11:20:59 -0400 Received: from mail-lj1-x232.google.com (mail-lj1-x232.google.com [IPv6:2a00:1450:4864:20::232]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7439A13BCF6 for ; Fri, 23 Sep 2022 08:20:52 -0700 (PDT) Received: by mail-lj1-x232.google.com with SMTP id z20so456039ljq.3 for ; Fri, 23 Sep 2022 08:20:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date; bh=SUWKwcsZsry2fgvm0fDpc5kVnOus3A4t37kogjVDCAM=; b=ENR7gm/ufNi1bXyPhRkLqOIR8X61pAuHZDAxvBQJtBMo1XxcJXpZCyvZXjdZtxEGqs Rld/BEcs2SP4Qyo9L4ZnVLbMttpTZWKLplZzijm55RmkSXps8vg+nlMTQRL8Q8n8sP8V PQt74elT4aSaOtht+ghVFN1gYzIIvLmyZUBZycxab9teGDT4klNyOyRb/aF7xVrdEhn9 7oLI/l2Io92zNWZn3GSEQCNXcxdOVW4hgKbxpiZ3V8r6dHHCPZ3UDzrOP3Zt+jMD0Csi vmw3ypm10IkLOpdYu1J8qcIYc6CXN4laL/fqx5HsbDxujMM99DHcUV+6BWkduPO9udYw ZNYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date; bh=SUWKwcsZsry2fgvm0fDpc5kVnOus3A4t37kogjVDCAM=; b=olUUxVfx38tFUanRURDIBQBRsDI5ncsV49jLsMB8GJR/mJqUWvUHyiWbBhLrkgK6ZH In6U9fvyAPEyq54lOSEcDGmlfu21FwiSSmjAwgPst6zYuOPpcetWhd9u9bzOSwElHILQ zq/Q4bWHpWPD4C0q33FS9gD5/FxpDzLoBdepc6Q1BI+5uqwoSUyMAefqBhwKKV4V18WE yjYvXElQg/OR3tqJm02k52fjfzajSu8EjYiQiU69CoyEizFnk3scfDM6tiZaqvN5sQTy /BR6cwGitjfK2CtshVslhG3AMaBJo3Y7yrgaTmGsqNnUFJxQ8CXTOfQKB8GEwX63FkvR VdiA== X-Gm-Message-State: ACrzQf2Vbf8l8UxbKLPPycoFCF1ZWfCdcd0WIuQ+Exq8xndui0atQD5l bdKkU2gHVNb9NGHVeAO5Mz3Kld0++kEH7OobpB2ga0rr8VYkMA== X-Received: by 2002:a05:651c:1508:b0:26c:622e:abe1 with SMTP id e8-20020a05651c150800b0026c622eabe1mr3040402ljf.228.1663946450628; Fri, 23 Sep 2022 08:20:50 -0700 (PDT) MIME-Version: 1.0 References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <84e81d21-c800-4fd5-ad7c-f20bcdd7508b@www.fastmail.com> In-Reply-To: <84e81d21-c800-4fd5-ad7c-f20bcdd7508b@www.fastmail.com> From: Fuad Tabba Date: Fri, 23 Sep 2022 16:20:13 +0100 Message-ID: Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd To: Andy Lutomirski Cc: Sean Christopherson , David Hildenbrand , Chao Peng , kvm list , Linux Kernel Mailing List , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Linux API , linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "the arch/x86 maintainers" , "H. Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A. Shutemov" , "Nakajima, Jun" , Dave Hansen , Andi Kleen , aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , Michal Hocko , Muchun Song , wei.w.wang@intel.com, Will Deacon , Marc Zyngier Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, <...> > > Regarding pKVM's use case, with the shim approach I believe this can be= done by > > allowing userspace mmap() the "hidden" memfd, but with a ton of restric= tions > > piled on top. > > > > My first thought was to make the uAPI a set of KVM ioctls so that KVM > > could tightly > > tightly control usage without taking on too much complexity in the > > kernel, but > > working through things, routing the behavior through the shim itself > > might not be > > all that horrific. > > > > IIRC, we discarded the idea of allowing userspace to map the "private" > > fd because > > things got too complex, but with the shim it doesn't seem _that_ bad. > > What's the exact use case? Is it just to pre-populate the memory? Prepopulate memory and access memory that could go back and forth from being shared to being private. Cheers, /fuad > > > > E.g. on the memfd side: > > > > 1. The entire memfd must be mapped, and at most one mapping is allowe= d, i.e. > > mapping is all or nothing. > > > > 2. Acquiring a reference via get_pfn() is disallowed if there's a map= ping for > > the restricted memfd. > > > > 3. Add notifier hooks to allow downstream users to further restrict t= hings. > > > > 4. Disallow splitting VMAs, e.g. to force userspace to munmap() every= thing in > > one shot. > > > > 5. Require that there are no outstanding references at munmap(). Or = if this > > can't be guaranteed by userspace, maybe add some way for userspace= to wait > > until it's ok to convert to private? E.g. so that get_pfn() doesn= 't need > > to do an expensive check every time. > > Hmm. I haven't looked at the code to see if this would really work, but = I think this could be done more in line with how the rest of the kernel wor= ks by using the rmap infrastructure. When the pKVM memfd is in not-yet-pri= vate mode, just let it be mmapped as usual (but don't allow any form of GUP= or pinning). Then have an ioctl to switch to to shared mode that takes lo= cks or sets flags so that no new faults can be serviced and does unmap_mapp= ing_range. > > As long as the shim arranges to have its own vm_ops, I don't immediately = see any reason this can't work.