Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp2667738rwb; Thu, 29 Sep 2022 13:19:55 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6KVgVnN3PZSqIV47GyVaiHVsDPdQTqw1DCNdryRum8DxRnaFLe0STfUs79i5ycDnyRCY2e X-Received: by 2002:a17:90b:4d8c:b0:208:91a:c426 with SMTP id oj12-20020a17090b4d8c00b00208091ac426mr4783599pjb.173.1664482795033; Thu, 29 Sep 2022 13:19:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1664482795; cv=none; d=google.com; s=arc-20160816; b=PAfNi8h9azeC1cYGBMo187GTvcXZc0cQtJ+npDlKdJd0s7dyfGXrCLXiqKuq9SO8uw cEjMu2h/JjtIhuEAUoo6s3AKKIrG9C9dCR9usjWYT2xZmZ+jJkNA4j9DfP97BPfLmgaL Ar1EZ2te56+QryVCOJaMN8ZBFbFNL2i3Dv0ekKjhbrDOH5GAOPx6jyXzm/0yHAEBla4T WmnSbaeWFsBR5E3Oc+y58hX+k19NdMFc2F/G1KElQakXQN8BNUTKU/DZ1z4K+KZBkKqU K2Bjx6p2GGgL3dA2pR0ZBz9WNjOAS/NW0IwBllTfBRsdZRJNApE5K6H3/ZtANozp1Zf4 +T1w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=r7x8U2WWuZqtH2rrIIcS2XynHY/yseYye2zrqMM6Et8=; b=n/aevLe6+m4rUZI6Q4WLShCoI4rKHejFILa/5A9puWrk0mjpwkocXEAOdJGL0BF8+K Ls7KJ0cdDfixMr0udkUaNPUNRk28G0TbmtT+ToMXtTindbsMnHVqcCgp3xGbp7BiaF1D QNXxXBlxUsees+SRWBXYwMPQAolm1hPJEU0xBJEl4oKu9nd2I88of/sAeRvjqkPbhjdS GKBBclOdCfSNX/P5nsJzKIlVanDV3DP+3NMVDYqSmUDwv9QD21RgtVM081xsLUKHQFtC /eTqOKBpvKm7wlIVm51tMvY/D+1ClZo77SAvASq3cLYZv8GJXuTjqI72nOhS9WUnf2k8 QzkA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=gDrN6VJe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ja18-20020a170902efd200b001754bb3be28si556398plb.415.2022.09.29.13.19.43; Thu, 29 Sep 2022 13:19:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=gDrN6VJe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235901AbiI2S3q (ORCPT + 99 others); Thu, 29 Sep 2022 14:29:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53546 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234453AbiI2S3m (ORCPT ); Thu, 29 Sep 2022 14:29:42 -0400 Received: from mail-pl1-x629.google.com (mail-pl1-x629.google.com [IPv6:2607:f8b0:4864:20::629]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E04D6E95 for ; Thu, 29 Sep 2022 11:29:39 -0700 (PDT) Received: by mail-pl1-x629.google.com with SMTP id w20so1936559ply.12 for ; Thu, 29 Sep 2022 11:29:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date; bh=r7x8U2WWuZqtH2rrIIcS2XynHY/yseYye2zrqMM6Et8=; b=gDrN6VJe+mCeTqrI0ZJHo1vweqUHeLnG/QtkQ+DoIZUwHtCLrxLaLVwR+HfP0pk41R oQkzMBxK6quJZhURlfZMiFJRRilyfqlTU8W3FeDE5jpIgCpMTH1M7HGSP0CTu3vAYn5G 4d67xxs/L19iljmPtkYgInGVP6V9KxOlxi/70U13iGVrCE7WIXayTIVsHTxti2WBr478 Eb9hLQuC1tEIztpaLGBKIG89t+D1Hw4bDR+UQHZL8Z4zX7fPHV4wIkMwRUEVkiEkmr9H paogj3j45ErOKFiW3J3li+JkffObx3c5wCBMs995tNn43vcRmDI272qWCodVqbP24vec gWdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date; bh=r7x8U2WWuZqtH2rrIIcS2XynHY/yseYye2zrqMM6Et8=; b=vKyLsr+P72tww4Y9mfZETGzqhQnKTkR86dtfvZaD8PrdbzUQgiL/FTR15wUNJ56670 cyNgIDC28w6OEsQ597kAO+bSI/GFRzZMNLVUVv7Pg8UGiHbQv3lsiMQh9j2Z3HVhcW2F Mw0hNBQs1RsBPXfUgsMfI8y+09h3j+7TTer73QV/bpl39Lxdy1O/kZTpK7aZ4vcJhITX K3uM7ZZkSLXq2ckRIMPxYLwukerj0DXsEFFIHBGdxXC1dox1MOzmpxrcPdSNRzwo+1XT BS9BgwPjxjwnGuoaOOF/Ud4lGoFB/mW8j+grB30Bh2AqlGeRr31CA+DyLjrF2kWBOSnk Tzrw== X-Gm-Message-State: ACrzQf1rvf2jcT+6rerEesXyYM3BXbFhYBzoKWkMQS8qXuENvsbeEzVY GCdCAfMzqNuW5L7ps11r3M4= X-Received: by 2002:a17:902:d48d:b0:178:306d:f75c with SMTP id c13-20020a170902d48d00b00178306df75cmr4635850plg.73.1664476179293; Thu, 29 Sep 2022 11:29:39 -0700 (PDT) Received: from strix-laptop (2001-b011-20e0-1b9a-f5f9-665b-0715-9cc1.dynamic-ip6.hinet.net. [2001:b011:20e0:1b9a:f5f9:665b:715:9cc1]) by smtp.gmail.com with ESMTPSA id m9-20020a170902db0900b0016c09a0ef87sm178417plx.255.2022.09.29.11.29.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 29 Sep 2022 11:29:38 -0700 (PDT) Date: Fri, 30 Sep 2022 02:29:32 +0800 From: Chih-En Lin To: David Hildenbrand Cc: Nadav Amit , Andrew Morton , Qi Zheng , Matthew Wilcox , Christophe Leroy , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Luis Chamberlain , Kees Cook , Iurii Zaikin , Vlastimil Babka , William Kucharski , "Kirill A . Shutemov" , Peter Xu , Suren Baghdasaryan , Arnd Bergmann , Tong Tiangen , Pasha Tatashin , Li kunyu , Anshuman Khandual , Minchan Kim , Yang Shi , Song Liu , Miaohe Lin , Thomas Gleixner , Sebastian Andrzej Siewior , Andy Lutomirski , Fenghua Yu , Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table Message-ID: References: <20220927162957.270460-1-shiyn.lin@gmail.com> <20220927162957.270460-10-shiyn.lin@gmail.com> <3D21021E-490F-4FE0-9C75-BB3A46A66A26@vmware.com> <39c5ef18-1138-c879-2c6d-c013c79fa335@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <39c5ef18-1138-c879-2c6d-c013c79fa335@redhat.com> X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 29, 2022 at 07:24:31PM +0200, David Hildenbrand wrote: > > > IMHO, a relaxed form that focuses on only the memory consumption reduction > > > could *possibly* be accepted upstream if it's not too invasive or complex. > > > During fork(), we'd do exactly what we used to do to PTEs (increment > > > mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O, > > > duplicate swap entries; all while holding the page table lock), however, > > > sharing the prepared page table with the child process using COW after we > > > prepared it. > > > > > > Any (most once we want to *optimize* rmap handling) modification attempts > > > require breaking COW -- copying the page table for the faulting process. But > > > at that point, the PTEs are already write-protected and properly accounted > > > (refcount/mapcount/PageAnonExclusive). > > > > > > Doing it that way might not require any questionable GUP hacks and swapping, > > > MMU notifiers etc. "might just work as expected" because the accounting > > > remains unchanged" -- we simply de-duplicate the page table itself we'd have > > > after fork and any modification attempts simply replace the mapped copy. > > > > Agree. > > However for GUP hacks, if we want to do the COW to page table, we still > > need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to > > check whether the PTE table is available or not before we do the COW to > > the table). Otherwise, it will be more complicated since it might need > > to handle situations like while preparing the COW work, it just figuring > > out that it needs to duplicate the whole table and roll back (recover > > the state and copy it to new table). Hopefully, I'm not wrong here. > > The nice thing is that GUP itself *usually* doesn't modify page tables. One > corner case is follow_pfn_pte(). All other modifications should happen in > the actual fault handler that has to deal with such kind of unsharing either > way when modifying the PTE. > > If the pages are already in a COW-ed pagetable in the desired "shared" state > (e.g., PageAnonExclusive cleared on an anonymous page), R/O pinning of such > pages will just work as expected and we shouldn't be surprised by another > set of GUP+COW CVEs. > > We'd really only deduplicate the page table and not play other tricks with > the actual page table content that differ from the existing way of handling > fork(). > > I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code when > not modifying the page table. I think we only need "we have to unshare this > page table now" in follow_pfn_pte() and inside the fault handling when GUP > triggers a fault. > > I hope my assumption is correct, or am I missing something? > My consideration is when we pinned the page and did the COW to make the page table be shared. It might not allow mapping the pinned page to R/O) into both processes. So, if the fork is working on the shared state, it needs to recover the table and copy to a new one since that pinned page will need to copy immediately. We can hold the shared state after occurring such a situation. So we still need some trick to let the fork() know which page table already has the pinned page (or such page won't let us share) before going to duplicate. Am I wrong here? After that, since we handled the accounting in fork(), we don't need ownership (pmd_t pointer) anymore. We have to find another way to mark the table to be exclusive. (Right now, COW_PTE_OWNER_EXCLUSIVE flag is stored at that space.) > > > > > But devil is in the detail (page table lock, TLB flushing). > > > > Sure, it might be an overhead in the page fault and needs to be handled > > carefully. ;) > > > > > "will make fork() even have more overhead" is not a good excuse for such > > > complexity/hacks -- sure, it will make your benchmark results look better in > > > comparison ;) > > > > ;);) > > I think that, even if we do the accounting with the COW page table, it > > still has a little bit improve. > > :) > > My gut feeling is that this is true. While we have to do a pass over the > parent page table during fork and wrprotect all PTEs etc., we don't have to > duplicate the page table content and allocate/free memory for that. > > One interesting case is when we cannot share an anon page with the child > process because it maybe pinned -- and we have to copy it via > copy_present_page(). In that case, the page table between the parent and the > child would differ and we'd not be able to share the page table. That is what I want to say above. The case might happen in the middle of the shared page table progress. It might cost more overhead to recover it. Therefore, if GUP wants to pin the mapped page we can mark the PTE table first, so fork() won't waste time doing the work for sharing. > That case could be caught in copy_pte_range(): in case we'd have to allocate > a page via page_copy_prealloc(), we'd have to fall back to the ordinary > "separate page table for the child" way of doing things. > > But that looks doable to me. Sounds good. :) > -- > Thanks, > > David / dhildenb > Thanks, Chih-En Lin