Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp313303rdb; Tue, 5 Dec 2023 06:17:29 -0800 (PST) X-Google-Smtp-Source: AGHT+IHq1K8BJhQmFmOu2MNvhZ/yTMwb6L8O84PRgE1iqicFVCY5lu4FEQl6FiAX1ZJrjC/Mur5z X-Received: by 2002:a05:6a00:883:b0:6ce:2731:d5d4 with SMTP id q3-20020a056a00088300b006ce2731d5d4mr1274471pfj.69.1701785848792; Tue, 05 Dec 2023 06:17:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701785848; cv=none; d=google.com; s=arc-20160816; b=GkuWWAWj3j+wGR24A/OGJpFD2POMy2wVtrAGnHjQYg6ZeO98mIrbTtYafXT1alLRId MtWYKtQc6vcFSJlz1BsBTuOpql4or0O1BOCB2RriY/jozmcVCAOsaHQTsGZJfPlVdU0r Wo6Glq93Ps3Sq6xeK2/iTPdwGFHdfdlltpL1UfMPanyEyHb5hhrwuV/xsI+lMcGPx01b By7ZB3tfAl2Hb9XM/0XaDYg1VOQ+Nl69SBJtfeqLu5KEum6SS/JBeeGMRxMuSCB0ziMi Se73pI7rABRgqBIhO1MuM+lQ4aOHEP0SV86UVcMY/LLLOpK5KYXBMsPkhnuDYjtFo6vO vrZQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=Cht6ahCfrzLBhe+fF9nekAtEzRODT0snbYwuzGinUiY=; fh=trt4PdSBegCHZPRiqPmatjero+uClKSnMdCqK8PHWAU=; b=bVt5zUfPpmAyD0LcNJSg7rKHJI4YV269NHiGJAjC8OGf29IrZieEXO+VChWQv52/AV h3MRwups+jH0387kw6EfpQyVyDiq3hDMaF85fFPYCoGyvUv52dTyINJSwVX1D1sD/ES7 ltjJnYHGFXO27oMTLM5zMyAjnZ1BgzoaCJ2+1RTCMcWBU8ILZVe3I3JXOMAN04veNiUM v1uuHBufaqv0zVIyGlULtIXzhvQtnSU424xowmEGyht/Vvizf48SGhCfPp3SsvKSs9CK JJ1Rtf5yhRslt9PalTIiodzG0WmKwZU1DdNuntxHTgggD30yO2SKJe+W6kpjpoAkG9lq DlUQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id h19-20020a635753000000b005b92fb731a9si9629147pgm.834.2023.12.05.06.17.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 05 Dec 2023 06:17:28 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 9EF9B8158337; Tue, 5 Dec 2023 06:17:25 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345853AbjLEOQy (ORCPT + 99 others); Tue, 5 Dec 2023 09:16:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32822 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345857AbjLEOQx (ORCPT ); Tue, 5 Dec 2023 09:16:53 -0500 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A24C1B9 for ; Tue, 5 Dec 2023 06:16:59 -0800 (PST) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DF00A139F; Tue, 5 Dec 2023 06:17:45 -0800 (PST) Received: from [10.57.73.130] (unknown [10.57.73.130]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 87FDD3F6C4; Tue, 5 Dec 2023 06:16:55 -0800 (PST) Message-ID: <0b48135a-a44b-4b5a-a33b-abd3a3b47ff8@arm.com> Date: Tue, 5 Dec 2023 14:16:53 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 01/15] mm: Batch-copy PTE ranges during fork() Content-Language: en-GB To: David Hildenbrand , Catalin Marinas , Will Deacon , Ard Biesheuvel , Marc Zyngier , Oliver Upton , James Morse , Suzuki K Poulose , Zenghui Yu , Andrey Ryabinin , Alexander Potapenko , Andrey Konovalov , Dmitry Vyukov , Vincenzo Frascino , Andrew Morton , Anshuman Khandual , Matthew Wilcox , Yu Zhao , Mark Rutland , Kefeng Wang , John Hubbard , Zi Yan , Barry Song <21cnbao@gmail.com>, Alistair Popple , Yang Shi Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20231204105440.61448-1-ryan.roberts@arm.com> <20231204105440.61448-2-ryan.roberts@arm.com> <104de2d6-ecf9-4b0c-a982-5bd8e1aea758@redhat.com> <5b8b9f8c-8e9b-42a5-b8b2-9b96903f3ada@redhat.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Tue, 05 Dec 2023 06:17:25 -0800 (PST) On 05/12/2023 12:04, David Hildenbrand wrote: > On 05.12.23 12:30, Ryan Roberts wrote: >> On 04/12/2023 17:27, David Hildenbrand wrote: >>>> >>>> With rmap batching from [1] -- rebased+changed on top of that -- we could turn >>>> that into an effective (untested): >>>> >>>>            if (page && folio_test_anon(folio)) { >>>> +               nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr, >>>> end, >>>> +                                               pte, enforce_uffd_wp, >>>> &nr_dirty, >>>> +                                               &nr_writable); >>>>                    /* >>>>                     * If this page may have been pinned by the parent process, >>>>                     * copy the page immediately for the child so that we'll >>>> always >>>>                     * guarantee the pinned page won't be randomly replaced >>>> in the >>>>                     * future. >>>>                     */ >>>> -               folio_get(folio); >>>> -               if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, >>>> src_vma))) { >>>> +               folio_ref_add(folio, nr); >>>> +               if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page, nr, >>>> src_vma))) { >>>>                            /* Page may be pinned, we have to copy. */ >>>> -                       folio_put(folio); >>>> -                       return copy_present_page(dst_vma, src_vma, dst_pte, >>>> src_pte, >>>> -                                                addr, rss, prealloc, page); >>>> +                       folio_ref_sub(folio, nr); >>>> +                       ret = copy_present_page(dst_vma, src_vma, dst_pte, >>>> +                                               src_pte, addr, rss, prealloc, >>>> +                                               page); >>>> +                       return ret == 0 ? 1 : ret; >>>>                    } >>>> -               rss[MM_ANONPAGES]++; >>>> +               rss[MM_ANONPAGES] += nr; >>>>            } else if (page) { >>>> -               folio_get(folio); >>>> -               folio_dup_file_rmap_pte(folio, page); >>>> -               rss[mm_counter_file(page)]++; >>>> +               nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr, >>>> end, >>>> +                                               pte, enforce_uffd_wp, >>>> &nr_dirty, >>>> +                                               &nr_writable); >>>> +               folio_ref_add(folio, nr); >>>> +               folio_dup_file_rmap_ptes(folio, page, nr); >>>> +               rss[mm_counter_file(page)] += nr; >>>>            } >>>> >>>> >>>> We'll have to test performance, but it could be that we want to specialize >>>> more on !folio_test_large(). That code is very performance-sensitive. >>>> >>>> >>>> [1] https://lkml.kernel.org/r/20231204142146.91437-1-david@redhat.com >>> >>> So, on top of [1] without rmap batching but with a slightly modified version of >> >> Can you clarify what you mean by "without rmap batching"? I thought [1] >> implicitly adds rmap batching? (e.g. folio_dup_file_rmap_ptes(), which you've >> added in the code snippet above). > > Not calling the batched variants but essentially doing what your code does (with > some minor improvements, like updating the rss counters only once). > > The snipped above is only linked below. I had the performance numbers for [1] > ready, so I gave it a test on top of that. > > To keep it simple, you might just benchmark w and w/o your patches. > >> >>> yours (that keeps the existing code structure as pointed out and e.g., updates >>> counter updates), running my fork() microbenchmark with a 1 GiB of memory: >>> >>> Compared to [1], with all order-0 pages it gets 13--14% _slower_ and with all >>> PTE-mapped THP (order-9) it gets ~29--30% _faster_. >> >> What test are you running - I'd like to reproduce if possible, since it sounds >> like I've got some work to do to remove the order-0 regression. > > Essentially just allocating 1 GiB of memory an measuring how long it takes to > call fork(). > > order-0 benchmarks: > > https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/order-0-benchmarks.c?ref_type=heads > > e.g.,: $ ./order-0-benchmarks fork 100 > > > pte-mapped-thp benchmarks: > > https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-thp-benchmarks.c?ref_type=heads > > e.g.,: $ ./pte-mapped-thp-benchmarks fork 100 > > > Ideally, pin to one CPU and get stable performance numbers by disabling > SMT+turbo etc. This is great - thanks! I'll get to work... > >> >>> >>> So looks like we really want to have a completely seprate code path for >>> "!folio_test_large()" to keep that case as fast as possible. And "Likely" we >>> want to use "likely(!folio_test_large()". ;) >> >> Yuk, but fair enough. If I can repro the perf numbers, I'll have a go a >> reworking this. >> >> I think you're also implicitly suggesting that this change needs to depend on >> [1]? Which is a shame... > > Not necessarily. It certainly cleans up the code, but we can do that in any > order reasonable. > >> >> I guess I should also go through a similar exercise for patch 2 in this series. > > > Yes. There are "unmap" and "pte-dontneed" benchmarks contained in both files above. >