Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp2667387rwb; Thu, 29 Sep 2022 13:19:34 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6NFIeFwMb87Rh0zxH6cKZtrlsFp0BeJ1M3UdnYy4tGp1hmvUlZkcA3UwMm//hNVDmwr54q X-Received: by 2002:a63:f20e:0:b0:439:398f:80f8 with SMTP id v14-20020a63f20e000000b00439398f80f8mr4332624pgh.494.1664482773925; Thu, 29 Sep 2022 13:19:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1664482773; cv=none; d=google.com; s=arc-20160816; b=pBhIQbw/QzF8OmoHSfXp4vC0rTWNlkXqOOxZzFxX7kG96bRoO3L4kHNeTip9Lcw4PH 4QvkCblE89O+ftoKlGXzRxSEraAwFbz1ssxrUbci1J35mFeXv60cAyvSXaXSUq6rjkY2 5xoYhaYU4Jj4rGN9Ws0GVlUQPm4a6r1hqGITDyXjml/7aa524JePMiWa9cpn1djY3aeM np0T3VBPNFRqKySi0uOYKBXkslD/UK8wHhZryXtgEZ9EMgpala18mgrP54OvJqumdka8 BC65v0024YLuvMz7HwlhLki2NPHlZrEX1fyFjhEE9BjLHuGm9JXu2lt1jMlFN6XpWfHj xMXQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :organization:from:references:cc:to:content-language:subject :user-agent:mime-version:date:message-id:dkim-signature; bh=J+3aJrIpSEB7VAOHzf/GVjSen+CP0rNaukCFkC+ZVZc=; b=mJqNy1xHJ6A/ph47b5+gCUXkhsvQ0V1I5DbseLO9kcQg+8C67yC683AEFXd/LbqPXc PAO+nUO2Q7UP7gjYBMf/Yx5DFedREarEO8C0boKa5Z/DjGh6oK81dyAYMC5/ommK5tEg 9CfHVWypQvItQwc/XPsg6cxNoSAG18MS72rhqUccJEapz4HoaGCb9hFbQ4nOvZUVkBoq 1BCv0VM0OHOvqeK7AXoIuEsinvolHAjSzK5WIDpJsB61XbjzSbl/AkPRGOpV9L2g/Txk RDtjtxS3P2oRHkBNiYgFLIYZhF5slfVtuyTGcfW4BgsmzB807sOskf0bnYBV/0jVDXlE oELQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Chhd10vT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q24-20020a63cc58000000b0043c1ef4fff2si722965pgi.345.2022.09.29.13.19.22; Thu, 29 Sep 2022 13:19:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Chhd10vT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230018AbiI2TA2 (ORCPT + 99 others); Thu, 29 Sep 2022 15:00:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58588 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229932AbiI2TAU (ORCPT ); Thu, 29 Sep 2022 15:00:20 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F356E3CBC0 for ; Thu, 29 Sep 2022 12:00:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1664478015; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=J+3aJrIpSEB7VAOHzf/GVjSen+CP0rNaukCFkC+ZVZc=; b=Chhd10vT0cUk5b/6nBYdGB6s0/QSTVzjkuqCXMPfDPNPi7LS8oSvzdJBYKuvwZdLA3vLI6 XlH/35ukVoz0obmN0/SYxJMcF8sIdb+fClNpuK+K+qovEP9yQ9GlCFLQ7X4SgkTo+oM3/f ZRp6VWoMx/yoIBnIBZ49/G+Q2xYwT0Q= Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com [209.85.221.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-288-ZIcJO-9xMFONyxWElb99Kw-1; Thu, 29 Sep 2022 15:00:13 -0400 X-MC-Unique: ZIcJO-9xMFONyxWElb99Kw-1 Received: by mail-wr1-f72.google.com with SMTP id i1-20020adfa501000000b0022cd1e1137bso841968wrb.0 for ; Thu, 29 Sep 2022 12:00:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date; bh=J+3aJrIpSEB7VAOHzf/GVjSen+CP0rNaukCFkC+ZVZc=; b=kjd68oqZ63d03vKJRU51iRrtMs7zwzQ28/cX39hDQyjwvZyCLes12jbENiKWYJdzMG 1z5mpVjZZL69BsoiqL7VZ1tN4cEtvCvusCm8cXoqRrMp2gO1n87rtyr9wWO2+xpU0dVR vtJF8jedOhjZHezgReZtrnNsFr4BBajIqkW3zwawrg6ULmkGcooaYtxr3yFe35KFqLQu C+tYU1T4OyHZj24sU0Qcjh6h2/GeyN8UKu1hn999ZRcIsJrxPkrmDhqwULrw26HbP8sQ o5eESzpNVB1+qlKTZa5bk4A3RhPDKuskw17/akjcRc5H79SiUR8laHx/dJWKVhVbJ7Qf 2Bfw== X-Gm-Message-State: ACrzQf252Krv4X6THR7GmtbxvPkuowtgI52jUUZvuT2GS/4WZu76WinT sMoiNtu0PCoD+BrTZP3S8KBbWM0DTXzpXPcCM4MRRk5PHucsBSvEdJKdbEdQ4Yk+pEvVjmiraIm BHBhT3P/SJWn35ONGshyS97T9 X-Received: by 2002:adf:e192:0:b0:228:d066:a844 with SMTP id az18-20020adfe192000000b00228d066a844mr3695783wrb.54.1664478007865; Thu, 29 Sep 2022 12:00:07 -0700 (PDT) X-Received: by 2002:adf:e192:0:b0:228:d066:a844 with SMTP id az18-20020adfe192000000b00228d066a844mr3695762wrb.54.1664478007499; Thu, 29 Sep 2022 12:00:07 -0700 (PDT) Received: from ?IPV6:2003:cb:c705:ce00:b5d:2b28:1eb5:9245? (p200300cbc705ce000b5d2b281eb59245.dip0.t-ipconnect.de. [2003:cb:c705:ce00:b5d:2b28:1eb5:9245]) by smtp.gmail.com with ESMTPSA id i13-20020a5d55cd000000b0022ae59d472esm73668wrw.112.2022.09.29.12.00.05 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 29 Sep 2022 12:00:07 -0700 (PDT) Message-ID: <3654e74b-8145-33bb-1eb7-fb5e2ffd2fba@redhat.com> Date: Thu, 29 Sep 2022 21:00:05 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.3.0 Subject: Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table Content-Language: en-US To: Chih-En Lin Cc: Nadav Amit , Andrew Morton , Qi Zheng , Matthew Wilcox , Christophe Leroy , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Luis Chamberlain , Kees Cook , Iurii Zaikin , Vlastimil Babka , William Kucharski , "Kirill A . Shutemov" , Peter Xu , Suren Baghdasaryan , Arnd Bergmann , Tong Tiangen , Pasha Tatashin , Li kunyu , Anshuman Khandual , Minchan Kim , Yang Shi , Song Liu , Miaohe Lin , Thomas Gleixner , Sebastian Andrzej Siewior , Andy Lutomirski , Fenghua Yu , Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng References: <20220927162957.270460-1-shiyn.lin@gmail.com> <20220927162957.270460-10-shiyn.lin@gmail.com> <3D21021E-490F-4FE0-9C75-BB3A46A66A26@vmware.com> <39c5ef18-1138-c879-2c6d-c013c79fa335@redhat.com> <834c258d-4c0e-1753-3608-8a7e28c14d07@redhat.com> From: David Hildenbrand Organization: Red Hat In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-6.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 29.09.22 20:57, Chih-En Lin wrote: > On Thu, Sep 29, 2022 at 08:38:52PM +0200, David Hildenbrand wrote: >> On 29.09.22 20:29, Chih-En Lin wrote: >>> On Thu, Sep 29, 2022 at 07:24:31PM +0200, David Hildenbrand wrote: >>>>>> IMHO, a relaxed form that focuses on only the memory consumption reduction >>>>>> could *possibly* be accepted upstream if it's not too invasive or complex. >>>>>> During fork(), we'd do exactly what we used to do to PTEs (increment >>>>>> mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O, >>>>>> duplicate swap entries; all while holding the page table lock), however, >>>>>> sharing the prepared page table with the child process using COW after we >>>>>> prepared it. >>>>>> >>>>>> Any (most once we want to *optimize* rmap handling) modification attempts >>>>>> require breaking COW -- copying the page table for the faulting process. But >>>>>> at that point, the PTEs are already write-protected and properly accounted >>>>>> (refcount/mapcount/PageAnonExclusive). >>>>>> >>>>>> Doing it that way might not require any questionable GUP hacks and swapping, >>>>>> MMU notifiers etc. "might just work as expected" because the accounting >>>>>> remains unchanged" -- we simply de-duplicate the page table itself we'd have >>>>>> after fork and any modification attempts simply replace the mapped copy. >>>>> >>>>> Agree. >>>>> However for GUP hacks, if we want to do the COW to page table, we still >>>>> need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to >>>>> check whether the PTE table is available or not before we do the COW to >>>>> the table). Otherwise, it will be more complicated since it might need >>>>> to handle situations like while preparing the COW work, it just figuring >>>>> out that it needs to duplicate the whole table and roll back (recover >>>>> the state and copy it to new table). Hopefully, I'm not wrong here. >>>> >>>> The nice thing is that GUP itself *usually* doesn't modify page tables. One >>>> corner case is follow_pfn_pte(). All other modifications should happen in >>>> the actual fault handler that has to deal with such kind of unsharing either >>>> way when modifying the PTE. >>>> >>>> If the pages are already in a COW-ed pagetable in the desired "shared" state >>>> (e.g., PageAnonExclusive cleared on an anonymous page), R/O pinning of such >>>> pages will just work as expected and we shouldn't be surprised by another >>>> set of GUP+COW CVEs. >>>> >>>> We'd really only deduplicate the page table and not play other tricks with >>>> the actual page table content that differ from the existing way of handling >>>> fork(). >>>> >>>> I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code when >>>> not modifying the page table. I think we only need "we have to unshare this >>>> page table now" in follow_pfn_pte() and inside the fault handling when GUP >>>> triggers a fault. >>>> >>>> I hope my assumption is correct, or am I missing something? >>>> >>> >>> My consideration is when we pinned the page and did the COW to make the >>> page table be shared. It might not allow mapping the pinned page to R/O) >>> into both processes. >>> >>> So, if the fork is working on the shared state, it needs to recover the >>> table and copy to a new one since that pinned page will need to copy >>> immediately. We can hold the shared state after occurring such a >>> situation. So we still need some trick to let the fork() know which page >>> table already has the pinned page (or such page won't let us share) >>> before going to duplicate. >>> >>> Am I wrong here? >> >> I think you might be overthinking this. Let's keep it simple: >> >> 1) Handle pinned anon pages just as I described below, falling back to the >> "slow" path of page table copying. >> >> 2) Once we passed that stage, you can be sure that the COW-ed page table >> cannot have actually pinned anon pages. All anon pages in such a page table >> have PageAnonExclusive cleared and are "maybe shared". GUP cannot succeed in >> pinning these pages anymore, because it will only pin exclusive anon pages! >> >> 3) If anybody wants to take a R/O pin on a shared anon page that is mapped >> into a COW-ed page table, we trigger a fault with FAULT_FLAG_UNSHARE instead >> of pinning the page. This has to break COW on the page table and properly >> map an exclusive anon page into it, breaking COW. >> >> Do you see a problem with that? >> >>> >>> After that, since we handled the accounting in fork(), we don't need >>> ownership (pmd_t pointer) anymore. We have to find another way to mark >>> the table to be exclusive. (Right now, COW_PTE_OWNER_EXCLUSIVE flag is >>> stored at that space.) >>> >>>>> >>>>>> But devil is in the detail (page table lock, TLB flushing). >>>>> >>>>> Sure, it might be an overhead in the page fault and needs to be handled >>>>> carefully. ;) >>>>> >>>>>> "will make fork() even have more overhead" is not a good excuse for such >>>>>> complexity/hacks -- sure, it will make your benchmark results look better in >>>>>> comparison ;) >>>>> >>>>> ;);) >>>>> I think that, even if we do the accounting with the COW page table, it >>>>> still has a little bit improve. >>>> >>>> :) >>>> >>>> My gut feeling is that this is true. While we have to do a pass over the >>>> parent page table during fork and wrprotect all PTEs etc., we don't have to >>>> duplicate the page table content and allocate/free memory for that. >>>> >>>> One interesting case is when we cannot share an anon page with the child >>>> process because it maybe pinned -- and we have to copy it via >>>> copy_present_page(). In that case, the page table between the parent and the >>>> child would differ and we'd not be able to share the page table. >>> >>> That is what I want to say above. >>> The case might happen in the middle of the shared page table progress. >>> It might cost more overhead to recover it. Therefore, if GUP wants to >>> pin the mapped page we can mark the PTE table first, so fork() won't >>> waste time doing the work for sharing. >> >> Having pinned pages is a corner case for most apps. No need to worry about >> optimizing this corner case for now. >> >> I see what you are trying to optimize, but I don't think this is needed in a >> first version, and probably never is needed. >> >> >> Any attempts to mark page tables in a certain way from GUP >> (COW_PTE_OWNER_EXCLUSIVE) is problematic either way: GUP-fast >> (get_user_pages_fast) can race with pretty much anything, even with >> concurrent fork. I suspect your current code might be really racy in that >> regard. > > I see. > Now, I know why optimizing that corner case is not worth it. > Thank you for explaining that. Falling back after already processing some PTEs requires some care, though. I guess it's not too hard to get it right -- it might be harder to get it "clean". But we can talk about that detail later. -- Thanks, David / dhildenb