Received: by 2002:a05:7412:419a:b0:f3:1519:9f41 with SMTP id i26csp2913276rdh; Mon, 27 Nov 2023 02:00:54 -0800 (PST) X-Google-Smtp-Source: AGHT+IFnUwfULjVMwlmA01Et8Ux0SJ90MB6OJvXaUglbuG9hmnP2diZMKM5ancxD9D+MWY9Ik5Dj X-Received: by 2002:a05:6a00:1c9c:b0:6cb:b87d:8986 with SMTP id y28-20020a056a001c9c00b006cbb87d8986mr13628931pfw.8.1701079253967; Mon, 27 Nov 2023 02:00:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701079253; cv=none; d=google.com; s=arc-20160816; b=u6Wtuujt66Pd1OfTf30RtKzYIRhG+dDYHDiPPRR1t9vUHXj6+I03YbRFenFKZG1ipq RqPmU05p+RLi37VrEAitWOYQ+db4hvWfDN33YWBeYar5ct/jHT/vKscHjP/IkMbyHXpU diDlEP4vXbhz1s5MiedFkYepmi2b3hInnqIfv0SzoVzB9EVi2bNldllQ2uBFc6plqK8V DrEa/rNtbhN0nh5u0x8CDpfNZKxYSQU87Pf/RRJG4uhJIjbKBbP4Q4zQ6eHjmGej7WHb x5DiGwpK0p5uuRguwy5JYnsPIYAwff5fOR/kOj+rqWufugPbN/5FzFdnQ3lkkiVWY0dZ VmIg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=n6tK140abEWkHitKUAk7mI5hF8YD1QH02qgtvosYmP0=; fh=YFu1/ulTj8Z3PIqCnVWDSL0JORTqainAMpKIkzDAcGk=; b=SuPGdq5rO/Tuqjsacds5D2iSbGJ1ym3XAJZRxScEwHcc6jk8r9FHBkr+gLcJvQ9DAw IImmmQdslRYcCn+70yV9xr1LmyQ3Z15wBgh/NZPnhdVp5T6eCngY0WicMlktWJmEH717 MtJEKCpIozuhoMJ+Cq9wmxNDeGtvhexDpwXPyw4NbHP8Za3xtsOe9e4QPikP2ZitojW/ y2BEgmsNFYUUHIWXeFi1nNQfbTaEhHfKxMtsagYUu0EIcDj8Orv04XQWQQsFsJFH0Fh/ 7jeI1XK0L69Zg/+izo/Q+5YNLVP9uWfp7YMJx5F3UYebFpHhztTXf/raa0jfn2W7KIOM 1D0g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=NpBL6sMG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from fry.vger.email (fry.vger.email. [2620:137:e000::3:8]) by mx.google.com with ESMTPS id z9-20020a056a00240900b006cb8daad91csi9635183pfh.187.2023.11.27.02.00.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 27 Nov 2023 02:00:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=NpBL6sMG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id A5495807C5F5; Mon, 27 Nov 2023 02:00:33 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232440AbjK0KAI (ORCPT + 99 others); Mon, 27 Nov 2023 05:00:08 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38372 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232721AbjK0KAE (ORCPT ); Mon, 27 Nov 2023 05:00:04 -0500 Received: from mail-vk1-xa2f.google.com (mail-vk1-xa2f.google.com [IPv6:2607:f8b0:4864:20::a2f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0ECAE13A for ; Mon, 27 Nov 2023 02:00:10 -0800 (PST) Received: by mail-vk1-xa2f.google.com with SMTP id 71dfb90a1353d-4abf86c5ae1so948743e0c.3 for ; Mon, 27 Nov 2023 02:00:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701079209; x=1701684009; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=n6tK140abEWkHitKUAk7mI5hF8YD1QH02qgtvosYmP0=; b=NpBL6sMGvSkB2sBe6PpSrAg4rCCD/XYtp7aIfsQ/GX3BwuE9uLEFt/bX01NYrDQD3a dLlM/cLmeg16C3BqinlcKD9XMQQ8b8z0e0BnQX+9iS7Jc/vnwuyMGTUFHh0CiEPIVELv Yaw5szKN3rXA6xX4kucFnm05PWiofqZyF6a3nGXn8G8GgxNip23oKEzuVr7Ai2pnY9jh 5/rFcQC9kC4Ow2vclj96kdRsny/ZOzQxTU2Efi6FHz1Hi2etcVNW28IeIWwltGFc+EBy ZKBBIQ720NDg6QIQf8rHIUlciBrLytCRhulmQc5wwFUgNL+0PUXA+AdGa/s28ECQoMVB LmAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701079209; x=1701684009; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=n6tK140abEWkHitKUAk7mI5hF8YD1QH02qgtvosYmP0=; b=A81rGgKaItmYaehlA2EdU1SnmKwsOWOe0NtmzN3xY+hWofn/Q/UPN+r+G3L2LQR9IQ uAN4qJsapHOJQmZH05ArX1NN0WCUTDII4zPEI0/WXkiQn3AGS0WQKR18yMATK7T/tGHV eYnKpRpiGoHZEqRFiEoStGxwd4IF0BRJ1RKp8KY6wszYNaWJfvb4E7VkpCqm6DKz1W6P 8V5kscbipmSl9NYtc+am9nVzTzk5aHwDMC7Av2Bcvp/QFVNQlnwHU8Ks5OukM/87LfjE K/sW57CJn4mM93qpt3KY2gtT9Fiu6vSYEB8G7ehsBn9bHjxoGpCaudrDGp2jfBqQLS1L L6uQ== X-Gm-Message-State: AOJu0YwKvc1KKFzQHH/C6TuirA1P61/cfOjyshGPgNWASIJiQzfOv6QS 27X3QK52uOfWQnGYlqQoKQf5AhZCrvRje8ZxOWM= X-Received: by 2002:a05:6122:4592:b0:4a4:680:bfad with SMTP id de18-20020a056122459200b004a40680bfadmr11294574vkb.7.1701079208944; Mon, 27 Nov 2023 02:00:08 -0800 (PST) MIME-Version: 1.0 References: <271f1e98-6217-4b40-bae0-0ac9fe5851cb@redhat.com> <20231127084217.13110-1-v-songbaohua@oppo.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Mon, 27 Nov 2023 22:59:57 +1300 Message-ID: Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork() To: Ryan Roberts Cc: david@redhat.com, akpm@linux-foundation.org, andreyknvl@gmail.com, anshuman.khandual@arm.com, ardb@kernel.org, catalin.marinas@arm.com, dvyukov@google.com, glider@google.com, james.morse@arm.com, jhubbard@nvidia.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mark.rutland@arm.com, maz@kernel.org, oliver.upton@linux.dev, ryabinin.a.a@gmail.com, suzuki.poulose@arm.com, vincenzo.frascino@arm.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yuzenghui@huawei.com, yuzhao@google.com, ziy@nvidia.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.6 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Mon, 27 Nov 2023 02:00:33 -0800 (PST) On Mon, Nov 27, 2023 at 10:35=E2=80=AFPM Ryan Roberts wrote: > > On 27/11/2023 08:42, Barry Song wrote: > >>> + for (i =3D 0; i < nr; i++, page++) { > >>> + if (anon) { > >>> + /* > >>> + * If this page may have been pinned by t= he > >>> + * parent process, copy the page immediat= ely for > >>> + * the child so that we'll always guarant= ee the > >>> + * pinned page won't be randomly replaced= in the > >>> + * future. > >>> + */ > >>> + if (unlikely(page_try_dup_anon_rmap( > >>> + page, false, src_vma))) { > >>> + if (i !=3D 0) > >>> + break; > >>> + /* Page may be pinned, we have to= copy. */ > >>> + return copy_present_page( > >>> + dst_vma, src_vma, dst_pte= , > >>> + src_pte, addr, rss, preal= loc, > >>> + page); > >>> + } > >>> + rss[MM_ANONPAGES]++; > >>> + VM_BUG_ON(PageAnonExclusive(page)); > >>> + } else { > >>> + page_dup_file_rmap(page, false); > >>> + rss[mm_counter_file(page)]++; > >>> + } > >>> } > >>> - rss[MM_ANONPAGES]++; > >>> - } else if (page) { > >>> - folio_get(folio); > >>> - page_dup_file_rmap(page, false); > >>> - rss[mm_counter_file(page)]++; > >>> + > >>> + nr =3D i; > >>> + folio_ref_add(folio, nr); > >> > >> You're changing the order of mapcount vs. refcount increment. Don't. > >> Make sure your refcount >=3D mapcount. > >> > >> You can do that easily by doing the folio_ref_add(folio, nr) first and > >> then decrementing in case of error accordingly. Errors due to pinned > >> pages are the corner case. > >> > >> I'll note that it will make a lot of sense to have batch variants of > >> page_try_dup_anon_rmap() and page_dup_file_rmap(). > >> > > > > i still don't understand why it is not a entire map+1, but an increment > > in each basepage. > > Because we are PTE-mapping the folio, we have to account each individual = page. > If we accounted the entire folio, where would we unaccount it? Each page = can be > unmapped individually (e.g. munmap() part of the folio) so need to accoun= t each > page. When PMD mapping, the whole thing is either mapped or unmapped, and= its > atomic, so we can account the entire thing. Hi Ryan, There is no problem. for example, a large folio is entirely mapped in process A with CONPTE, and only page2 is mapped in process B. then we will have entire_map =3D 0 page0.map =3D -1 page1.map =3D -1 page2.map =3D 0 page3.map =3D -1 .... > > > > > as long as it is a CONTPTE large folio, there is no much difference wit= h > > PMD-mapped large folio. it has all the chance to be DoubleMap and need > > split. > > > > When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or an= y > > similar things on a part of the large folio in process A, > > > > this large folio will have partially mapped subpage in A (all CONTPE bi= ts > > in all subpages need to be removed though we only unmap a part of the > > large folioas HW requires consistent CONTPTEs); and it has entire map i= n > > process B(all PTEs are still CONPTES in process B). > > > > isn't it more sensible for this large folios to have entire_map =3D 0(f= or > > process B), and subpages which are still mapped in process A has map_co= unt > > =3D0? (start from -1). > > > >> Especially, the batch variant of page_try_dup_anon_rmap() would only > >> check once if the folio maybe pinned, and in that case, you can simply > >> drop all references again. So you either have all or no ptes to proces= s, > >> which makes that code easier. > > I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But > fundamentally you can only use entire_mapcount if its only possible to ma= p and > unmap the whole folio atomically. My point is that CONTPEs should either all-set in all 16 PTEs or all are dr= opped in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise, it is partially mapped. if a large folio is mapped in one processes with all CONTPTEs and meanwhile in another process with partial mapping(w/o CONTPTE), it is DoubleMapped. Since we always hold ptl to set or drop CONTPTE bits, set/drop is still atomic in a spinlock area. > > >> > >> But that can be added on top, and I'll happily do that. > >> > >> -- > >> Cheers, > >> > >> David / dhildenb > > Thanks Barry