Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9A56C61DA4 for ; Thu, 9 Feb 2023 18:16:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229841AbjBISQi (ORCPT ); Thu, 9 Feb 2023 13:16:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35600 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229850AbjBISQf (ORCPT ); Thu, 9 Feb 2023 13:16:35 -0500 Received: from mail-qv1-xf2f.google.com (mail-qv1-xf2f.google.com [IPv6:2607:f8b0:4864:20::f2f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9B2326312E for ; Thu, 9 Feb 2023 10:16:33 -0800 (PST) Received: by mail-qv1-xf2f.google.com with SMTP id d8so1912439qvs.4 for ; Thu, 09 Feb 2023 10:16:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen.com; s=google; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=D17ul6Jv4BBsV+2Jo6JXzyrGGRmbOMiujl+Liy7jNKU=; b=Ym+CXKOtcv8hciunQy57PF9+Z0g9lkZnEGbxg0LeHnfu/JqGR+eXCyBLaJNHe2a1gB RA77v7i9vAW3zCN/HYYrnibeS1rILcYRPWFAbU/mBmjZZm0l8+f1M417b/mqwPOPVQDY uEBxLZpogxT1gue/5pPtX+xiHXuqlEuy0Coi0J2u3USmPmA2W9cOU8/XpJiJk8CS852B ejQEBszFP3wzyp+8Lac5VDTaDfGAB/M4Zb6goMSfStqcOGEDTpf01j+sYhEPjGl/1mXf CTXdW8E+gSiGAe4/HkAr13WDVdah+jTZsJw2drmrwOJQy4C5Ul9NJJMndD3aDclHNkBe Hk3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=D17ul6Jv4BBsV+2Jo6JXzyrGGRmbOMiujl+Liy7jNKU=; b=ouPDJT9Hy9K5o2WJwQxrm5S3JXDdnyWNW258zB5iYAXImb7ofOaR6r9RDGouFzFd3X Q1wrFvsxp6ALlZjAHnDz49rLTHT7xEwOM2CdoaiZ/jR3F0vnApT0x4TjYJ89o10HNArK JElvk2ZnjUcoo1NWe/XqqZ+DAXUBAtpzjKhUicG6tq3kpU89Ar3RXNhu9mDwdaVfyFY6 K4hCuFxFTa/duEREk33jz08mx/u+1T7eWsgvIczYZefYW4tbKETLqOwr4ocOAFLICa26 n5Ermty3qjLqwSbY9/dkg56XDKaWY6tc3r7C0301g+2NWoHMJcbj/XDB3/AIKOFlHvub CWow== X-Gm-Message-State: AO0yUKUhGZUM/rciFu2iMdf5QoVib9RCmsGNDnnrBXxoT+JTcmuCrCOm GpjB2uN6eEOV75yoYNWT2ys0061JGGm50Zr+QvNTZQ== X-Google-Smtp-Source: AK7set9/asfDGBIlS6FIxPCSSF0J5s4zfbRpUSo/lPaWYG0FFjA+m1JuZQPdJzdk/PLzyIx6cdEIqfKndQA7GHYwaXM= X-Received: by 2002:a0c:f302:0:b0:56e:8d82:742d with SMTP id j2-20020a0cf302000000b0056e8d82742dmr49876qvl.34.1675966592638; Thu, 09 Feb 2023 10:16:32 -0800 (PST) MIME-Version: 1.0 References: <20230207035139.272707-1-shiyn.lin@gmail.com> In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> From: Pasha Tatashin Date: Thu, 9 Feb 2023 13:15:56 -0500 Message-ID: Subject: Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table To: Chih-En Lin Cc: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song , Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 6, 2023 at 10:52 PM Chih-En Lin wrote: > > v3 -> v4 > - Add Kconfig, CONFIG_COW_PTE, since some of the architectures, e.g., > s390 and powerpc32, don't support the PMD entry and PTE table > operations. > - Fix unmatch type of break_cow_pte_range() in > migrate_vma_collect_pmd(). > - Don=E2=80=99t break COW PTE in folio_referenced_one(). > - Fix the wrong VMA range checking in break_cow_pte_range(). > - Only break COW when we modify the soft-dirty bit in > clear_refs_pte_range(). > - Handle do_swap_page() with COW PTE in mm/memory.c and mm/khugepaged.c. > - Change the tlb flush from flush_tlb_mm_range() (x86 specific) to > tlb_flush_pmd_range(). > - Handle VM_DONTCOPY with COW PTE fork. > - Fix the wrong address and invalid vma in recover_pte_range(). > - Fix the infinite page fault loop in GUP routine. > In mm/gup.c:follow_pfn_pte(), instead of calling the break COW PTE > handler, we return -EMLINK to let the GUP handles the page fault > (call faultin_page() in __get_user_pages()). > - return not_found(pvmw) if the break COW PTE failed in > page_vma_mapped_walk(). > - Since COW PTE has the same result as the normal COW selftest, it > probably passed the COW selftest. > > # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB) > not ok 33 No leak from parent into child > # [RUN] vmsplice() + unmap in child with mprotect() optimization = ... with hugetlb (2048 kB) > not ok 44 No leak from parent into child > # [RUN] vmsplice() before fork(), unmap in parent after fork() ..= . with hugetlb (2048 kB) > not ok 55 No leak from child into parent > # [RUN] vmsplice() + unmap in parent after fork() ... with hugetl= b (2048 kB) > not ok 66 No leak from child into parent > > Bail out! 4 out of 147 tests failed > # Totals: pass:143 fail:4 xfail:0 xpass:0 skip:0 error:0 > See the more information about anon cow hugetlb tests: > https://patchwork.kernel.org/project/linux-mm/patch/20220927110120.10= 6906-5-david@redhat.com/ > > > v3: https://lore.kernel.org/linux-mm/20221220072743.3039060-1-shiyn.lin@g= mail.com/T/ > > RFC v2 -> v3 > - Change the sysctl with PID to prctl(PR_SET_COW_PTE). > - Account all the COW PTE mapped pages in fork() instead of defer it to > page fault (break COW PTE). > - If there is an unshareable mapped page (maybe pinned or private > device), recover all the entries that are already handled by COW PTE > fork, then copy to the new one. > - Remove COW_PTE_OWNER_EXCLUSIVE flag and handle the only case of GUP, > follow_pfn_pte(). > - Remove the PTE ownership since we don't need it. > - Use pte lock to protect the break COW PTE and free COW-ed PTE. > - Do TLB flushing in break COW PTE handler. > - Handle THP, KSM, madvise, mprotect, uffd and migrate device. > - Handle the replacement page of uprobe. > - Handle the clear_refs_write() of fs/proc. > - All of the benchmarks dropped since the accounting and pte lock. > The benchmarks of v3 is worse than RFC v2, most of the cases are > similar to the normal fork, but there still have an use case > (TriforceAFL) is better than the normal fork version. > > RFC v2: https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.li= n@gmail.com/T/ > > RFC v1 -> RFC v2 > - Change the clone flag method to sysctl with PID. > - Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and > MMF_COW_PTE_READY, for the sysctl. > - Change the owner pointer to use the folio padding. > - Handle all the VMAs that cover the PTE table when doing the break COW P= TE. > - Remove the self-defined refcount to use the _refcount for the page > table page. > - Add the exclusive flag to let the page table only own by one task in > some situations. > - Invalidate address range MMU notifier and start the write_seqcount > when doing the break COW PTE. > - Handle the swap cache and swapoff. > > RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@gm= ail.com/ > > --- > > Currently, copy-on-write is only used for the mapped memory; the child > process still needs to copy the entire page table from the parent > process during forking. The parent process might take a lot of time and > memory to copy the page table when the parent has a big page table > allocated. For example, the memory usage of a process after forking with > 1 GB mapped memory is as follows: For some reason, I was not able to reproduce performance improvements with a simple fork() performance measurement program. The results that I saw are the following: Base: Fork latency per gigabyte: 0.004416 seconds Fork latency per gigabyte: 0.004382 seconds Fork latency per gigabyte: 0.004442 seconds COW kernel: Fork latency per gigabyte: 0.004524 seconds Fork latency per gigabyte: 0.004764 seconds Fork latency per gigabyte: 0.004547 seconds AMD EPYC 7B12 64-Core Processor Base: Fork latency per gigabyte: 0.003923 seconds Fork latency per gigabyte: 0.003909 seconds Fork latency per gigabyte: 0.003955 seconds COW kernel: Fork latency per gigabyte: 0.004221 seconds Fork latency per gigabyte: 0.003882 seconds Fork latency per gigabyte: 0.003854 seconds Given, that page table for child is not copied, I was expecting the performance to be better with COW kernel, and also not to depend on the size of the parent. Test program: #include #include #include #include #include #include #include #include #define USEC 1000000 #define GIG (1ul << 30) #define NGIG 32 #define SIZE (NGIG * GIG) #define NPROC 16 void main() { int page_size =3D getpagesize(); struct timeval start, end; long duration, i; char *p; p =3D mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (p =3D=3D MAP_FAILED) { perror("mmap"); exit(1); } madvise(p, SIZE, MADV_NOHUGEPAGE); /* Touch every page */ for (i =3D 0; i < SIZE; i +=3D page_size) p[i] =3D 0; gettimeofday(&start, NULL); for (i =3D 0; i < NPROC; i++) { int pid =3D fork(); if (pid =3D=3D 0) { sleep(30); exit(0); } } gettimeofday(&end, NULL); /* Normolize per proc and per gig */ duration =3D ((end.tv_sec - start.tv_sec) * USEC + (end.tv_usec - start.tv_usec)) / NPROC / NGIG; printf("Fork latency per gigabyte: %ld.%06ld seconds\n", duration / USEC, duration % USEC); }