Received: by 2002:ad5:4acb:0:0:0:0:0 with SMTP id n11csp752925imw; Wed, 13 Jul 2022 07:24:01 -0700 (PDT) X-Google-Smtp-Source: AGRyM1uyPSy6peVeGxsYoqcUFQOB+w2Iujwcp3tOOFI6zoy9+7EeXLs1N5tca+Z+cLsLHTKZXHke X-Received: by 2002:a17:907:7781:b0:6fe:4398:47b3 with SMTP id ky1-20020a170907778100b006fe439847b3mr3740629ejc.513.1657722240811; Wed, 13 Jul 2022 07:24:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1657722240; cv=none; d=google.com; s=arc-20160816; b=iSvKKI104rGbS5YRquxblKbrmy80IDVw4B+xMKr+fTIPhRwbwSgo1nifgEFcRqOeY9 TjDvZPydaqwsAPhuJLwEdbkX4UWyf9IN8JkiFui9h++zBNDe4OxzojfyQIxKDFjXvrQT k8puZj0hAab1ddjlg/kSzUKrpsWGeTju/UMArbk56u6b8gTxmvJukXSTpfyy0Wh3sr7W GjEfYeZ8qYKx4FgijrEOyOWCPsvik6QQ33FpECnSgY/H3MJBcTANnWzgqkq6LB8jQvN1 2D7Shtdpz4VPJgqLgXNrkk/p8D/CG/FtdWgWEy1o5ouUc46bIQGmUmv8K916oS2gsJuQ 1kew== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:subject :organization:from:references:cc:to:content-language:user-agent :mime-version:date:message-id:dkim-signature; bh=9KGqf/qeI+POrrn3TKUE5pxqC/C4LValvbrjKc6vdrc=; b=ho62s1QWXjxmq0lOUY5dJSuO/QUuMq8sBzWSAw3LinV4uNvsGUEbXHXDgZSt3r7aNM R1cKSRkzfZlOGWbvqHMtTr6T5tB62r/EJBa6rS0XMHC6uarnynOzgLeuUD1G9E7sHZFP e0RvyZM6HsetKTz9V1JFmFS6UB2Qc9eXbqsu+5O58KmRS7Zf0mIkpF/lMOAaCiw4Nbi6 w72Hsi0pYqmJHjSfA9Wgx9jZD6J2kTjLSiOVBnJXspY6ajB+Mge7hZ7S9VqnfaSwsxY6 cCMM8f6tW3LxIEp7Uii84ArReis/84h9nhpbXNNIhNf4GJ3nGkkhocy3uaB9QTsnE8/B 2ysA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=A7L44shE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o24-20020a056402039800b0043a2b584d53si150959edv.462.2022.07.13.07.23.35; Wed, 13 Jul 2022 07:24:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=A7L44shE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236553AbiGMOBw (ORCPT + 99 others); Wed, 13 Jul 2022 10:01:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35036 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236549AbiGMOB3 (ORCPT ); Wed, 13 Jul 2022 10:01:29 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 060BE2ED40 for ; Wed, 13 Jul 2022 07:01:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1657720872; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9KGqf/qeI+POrrn3TKUE5pxqC/C4LValvbrjKc6vdrc=; b=A7L44shEIPWrKzHiA3jX07oVBUHTZ3DcrUDroLqPIINmq2yURySaihjcjwrYQTKqn5XJyG hu8A+Vu9j3NaVSiEIeaCvr1u6fE0g4Qq3MM0bCZ6k2U6Y9rcE/WnSLJoEiLR0IjvVn1ZoD SoVguAtMMvU6bkNX7IgD3hWdp3YWQJ4= Received: from mail-ej1-f71.google.com (mail-ej1-f71.google.com [209.85.218.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-147-d2vSy4s3OZuk7caKbD9P9w-1; Wed, 13 Jul 2022 10:01:03 -0400 X-MC-Unique: d2vSy4s3OZuk7caKbD9P9w-1 Received: by mail-ej1-f71.google.com with SMTP id l2-20020a170906078200b006fed42bfeacso3485606ejc.16 for ; Wed, 13 Jul 2022 07:01:03 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=9KGqf/qeI+POrrn3TKUE5pxqC/C4LValvbrjKc6vdrc=; b=KDYAPZH5IQpnE21Qkv07rODPqyfg3fsE9PVRK/JoR55chC+lj2s96bTQcw98hcqAJx sDpy159B6yJ3Hx2jpj0a4q0dOPNDc7VHTBUua1dcGeemN+eNkDMTf0gqVEH2i97ABBBM UIl7scO+/IHIzVjdBQ5w3EbLelWLMHdq/6lM5nPEivlsyeS+7XGUHC3+iFK5Bh+CllDn sn2rOSVZa1GAQ7Cg4jO4vLGxUxq4S3fRIS7Bt4QLxrDVqLRy0/nKb1/UgX3zPIXV6hOV ZiFiuyIcAw+tK9ehkJqN+uSy5zShVnGwgXFRHa9zR9Pd7811na7anv8bugAfFDonZsWj jlSw== X-Gm-Message-State: AJIora/3EMutjBnKa7UhLB6LAuVfhDwYTbyOlrbEUfsjzWxv5nRu9fb+ Zzou72dIwCU+0Bxj8B+G9vPBIupN3JaEwXlP3B+Ukqs2CY+FJVELxIcEgX/OdQvoUm8hWUaB+zO 2wRGaMu4jnOUvubDhi/kvDDyv X-Received: by 2002:a05:6402:194d:b0:43a:82da:b0f3 with SMTP id f13-20020a056402194d00b0043a82dab0f3mr5153367edz.104.1657720862043; Wed, 13 Jul 2022 07:01:02 -0700 (PDT) X-Received: by 2002:a05:6402:194d:b0:43a:82da:b0f3 with SMTP id f13-20020a056402194d00b0043a82dab0f3mr5153326edz.104.1657720861765; Wed, 13 Jul 2022 07:01:01 -0700 (PDT) Received: from ?IPV6:2003:cb:c707:5800:5009:e8d0:d95e:544d? (p200300cbc70758005009e8d0d95e544d.dip0.t-ipconnect.de. [2003:cb:c707:5800:5009:e8d0:d95e:544d]) by smtp.gmail.com with ESMTPSA id e2-20020a056402088200b0042dcbc3f302sm7975139edy.36.2022.07.13.07.01.00 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 13 Jul 2022 07:01:01 -0700 (PDT) Message-ID: <397f3cb2-1351-afcf-cd87-e8f9fb482059@redhat.com> Date: Wed, 13 Jul 2022 16:00:59 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Content-Language: en-US To: Khalid Aziz , Andrew Morton , Mike Kravetz Cc: willy@infradead.org, aneesh.kumar@linux.ibm.com, arnd@arndb.de, 21cnbao@gmail.com, corbet@lwn.net, dave.hansen@linux.intel.com, ebiederm@xmission.com, hagen@jauu.net, jack@suse.cz, keescook@chromium.org, kirill@shutemov.name, kucharsk@gmail.com, linkinjeon@kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, longpeng2@huawei.com, luto@kernel.org, markhemm@googlemail.com, pcc@google.com, rppt@kernel.org, sieberf@amazon.com, sjpark@amazon.de, surenb@google.com, tst@schoebel-theuer.de, yzaikin@google.com References: <20220701212403.77ab8139b6e1aca87fae119e@linux-foundation.org> <0864a811-53c8-a87b-a32d-d6f4c7945caa@redhat.com> <357da99d-d096-a790-31d7-ee477e37c705@oracle.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v2 0/9] Add support for shared PTEs across processes In-Reply-To: <357da99d-d096-a790-31d7-ee477e37c705@oracle.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_LOW,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08.07.22 21:36, Khalid Aziz wrote: > On 7/8/22 05:47, David Hildenbrand wrote: >> On 02.07.22 06:24, Andrew Morton wrote: >>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz wrote: >>> >>>> This patch series implements a mechanism in kernel to allow >>>> userspace processes to opt into sharing PTEs. It adds a new >>>> in-memory filesystem - msharefs. >>> >>> Dumb question: why do we need a new filesystem for this? Is it not >>> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files? >>> >> >> IIRC, the general opinion at LSF/MM was that this approach at hand is >> makes people nervous and I at least am not convinced that we really want >> to have this upstream. > > Hi David, Hi Khalid, > > You are right that sharing page tables across processes feels scary, but at the same time threads already share PTEs and > this just extends that concept to processes. They share a *mm* including a consistent virtual memory layout (VMA list). Page table sharing is just a side product of that. You could even call page tables just an implementation detail to produce that consistent virtual memory layout -- described for that MM via a different data structure. > A number of people have commented on potential usefulness of this concept > and implementation. ... and a lot of people raised concerns. Yes, page table sharing to reduce memory consumption/tlb misses/... is something reasonable to have. But that doesn't require mshare, as hugetlb has proven. The design might be useful for a handful of corner (!) cases, but as the cover letter only talks about memory consumption of page tables, I'll not care about those. Once these corner cases are explained and deemed important, we might want to think of possible alternatives to explore the solution space. > There were concerns raised about being able to make this safe and reliable. > I had agreed to send a > second version of the patch incorporating feedback from last review and LSF/MM, and that is what v2 patch is about. The Okay, most of the changes I saw are related to the user interface, not to any of the actual dirty implementation-detail concerns. And the cover letter is not really clear what's actually happening under the hood and what the (IMHO) weird semantics of the design imply (as can be seen from Andrews reply). > suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion > on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is > better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that > wrong. Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff, and asked the honest question if we can just remove it and replace it by something generic in the future. And as I learned, we most probably cannot rip that out without affecting existing user space. Even replacing it by mshare() would degrade existing user space. So the natural thing to reduce page table consumption (again, what this cover letter talks about) for user space (semi- ?)automatically for MAP_SHARED files is to factor out what hugetlb has, and teach generic MM code to cache and reuse page tables (PTE and PMD tables should be sufficient) where suitable. For reasonably aligned mappings and mapping sizes, it shouldn't be too hard (I know, locking ...), to cache and reuse page tables attached to files -- similar to what hugetlb does, just in a generic way. We might want a mechanism to enable/disable this for specific processes and/or VMAs, but these are minor details. And that could come for free for existing user space, because page tables, and how they are handled, would just be an implementation detail. I'd be really interested into what the major roadblocks/downsides file-based page table sharing has. Because I am not convinced that a mechanism like mshare() -- that has to be explicitly implemented+used by user space -- is required for that. -- Thanks, David / dhildenb