Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp4700471rwb; Mon, 21 Nov 2022 10:41:57 -0800 (PST) X-Google-Smtp-Source: AA0mqf4pu8UvjEc1K3mrV5K48POUH0USbYmf+AlSJ77VgaibCJabpguOVfQAeL9Ff+Oret4Zp984 X-Received: by 2002:a17:902:f813:b0:17f:8011:dd03 with SMTP id ix19-20020a170902f81300b0017f8011dd03mr320979plb.59.1669056117573; Mon, 21 Nov 2022 10:41:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669056117; cv=none; d=google.com; s=arc-20160816; b=E2plVWBV1i7TmCu7Zj/btnypnC/bCt7hd55iYenFNh7WRQ2UZBJswHd4pPJbaPXtwd 1JG9GLanZT4vza5hO3cLs0w+8GnlEl//tv7Swa5r73r3H8q2gJPLak8ZTHgF7jIYcJd/ 0cgTq2nkJxjcNSTMGuXoS2/w5kv4htQVmIBUeN5VQnCvxCwx/iMiwKMPopjxFhj3q6Ll xKW9HbZ8BQ/ze1/QbT4ezpMcmLUn3CCGVzzJBwpwIO8jckRI0rDbhTJMIDznzoQgKydd 22vC5MT6qn9ph8Qaemyc0n+quaYBIbWovHPl+MmKKZYwIezocfOkLr/o12Gx6UNutmkj KOHQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=HKC6PzeRn6bPjEwwYUvrQnoSSD38TBwL6brYXNQ38R8=; b=kGd15hTlGXcKtar0DK0nHFz8dIK96yia+UFgO67tc9yoBrYe4dHicIytZ7dKLpiuJB f795K0HBCFZ2mNiiNYlooLXS+govE+VMnVpZgIQhB4Fte8F2T1Vx+5q00bTKFmp1rSAW SHsYAVNnJFMLE17w0SHoYJ8m2cK6JYS4Jb7NED/psKklrCqbgIePWQGx300ud3fVE36M k44m6W2moc4tp8tzvEindm20a89pF2AtVjwqmIVbtt5aUudBrB/fkpJh37vsgkN13KKD nPylhPKssVcT7pHhrsEDsqrGHfkq82SMTynjlVHI08nZBDokb7d3VIFesueUqbbjuOb2 HQIw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=hFt9gHc3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b10-20020a6541ca000000b0046fea833b07si12002407pgq.123.2022.11.21.10.41.45; Mon, 21 Nov 2022 10:41:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=hFt9gHc3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229716AbiKUSdD (ORCPT + 92 others); Mon, 21 Nov 2022 13:33:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43086 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229682AbiKUSdB (ORCPT ); Mon, 21 Nov 2022 13:33:01 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B9D2AD0DC5 for ; Mon, 21 Nov 2022 10:32:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1669055521; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=HKC6PzeRn6bPjEwwYUvrQnoSSD38TBwL6brYXNQ38R8=; b=hFt9gHc3vxwslBRw8P9VnrCXMlACTeDdSSsHfsFhtoN0IGQ7GPGnrTwUb3XZtXQ6Ad+jug YfQ1tVjL69WWXU8BSZH/xijZCWTWlcyrXK2HHVMqM6JRYbBOXSUlpLHpNVP2712TvofrPk EPplZSSCKrQFuyl+ICZVKwHxxutYHtM= Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com [209.85.222.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-127-uMkCMel8MAmdW9tLiUcodA-1; Mon, 21 Nov 2022 13:31:59 -0500 X-MC-Unique: uMkCMel8MAmdW9tLiUcodA-1 Received: by mail-qk1-f200.google.com with SMTP id bq13-20020a05620a468d00b006fa5a75759aso16561017qkb.13 for ; Mon, 21 Nov 2022 10:31:59 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=HKC6PzeRn6bPjEwwYUvrQnoSSD38TBwL6brYXNQ38R8=; b=bwGdNMNsD4b2v3cNhn1hbjnJqXGyjnkJqtLhQTwXvbV/HaK+BHQ+D4C2bOFvNdCref Ro+9rkm//W5dlod3tYUOkooeVJz902cWiIY0KAToBqvBfBAK/aTG0Zr4WIFS8wtUrXak RT5eZvZ3O6eUG/em3toPStE1tK9UGpSZQfVr2cpoggp1kdaCTZ/fYsbLvuXt5J9EG6ZG RA4TmiDI4Yhq74ebRqWLOdaevuCxCYubef3TqDCOAKElXmu+zIzSSLFiJEnICBCxarft 39DfnPO315czJzt31Yg0idV6Xq8PjIeWEV++a0TT/NR6C+T/gUNsY5eV56276YaiiMnh Jhqw== X-Gm-Message-State: ANoB5pk3dPZdoNoh7FHgeyN/YR4/L+Ip4Ul7WKs68XKbrEdhDA1s238O 5582E9IBPvpF/a9qBDrn/3mVNH3BXRFD9FFZALBjVHQAHJ01PQbSBjT5kjOg42qq/KCrkYwWnOh +GwsK0yA+Exbdb7fAJlOAj1yE X-Received: by 2002:a37:5841:0:b0:6ec:b463:3b88 with SMTP id m62-20020a375841000000b006ecb4633b88mr17087692qkb.720.1669055518890; Mon, 21 Nov 2022 10:31:58 -0800 (PST) X-Received: by 2002:a37:5841:0:b0:6ec:b463:3b88 with SMTP id m62-20020a375841000000b006ecb4633b88mr17087671qkb.720.1669055518522; Mon, 21 Nov 2022 10:31:58 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-46-70-31-27-79.dsl.bell.ca. [70.31.27.79]) by smtp.gmail.com with ESMTPSA id r13-20020a05620a298d00b006eee3a09ff3sm8695375qkp.69.2022.11.21.10.31.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 21 Nov 2022 10:31:58 -0800 (PST) Date: Mon, 21 Nov 2022 13:31:57 -0500 From: Peter Xu To: hev Cc: Huacai Chen , Huacai Chen , loongarch@lists.linux.dev, Xuefeng Li , Guo Ren , Xuerui Wang , Jiaxun Yang , "David S . Miller" , sparclinux@vger.kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: Re: [PATCH 04/47] LoongArch: Set _PAGE_DIRTY only if _PAGE_WRITE is set in {pmd,pte}_mkdirty() Message-ID: References: <20221117042532.4064448-1-chenhuacai@loongson.cn> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Nov 19, 2022 at 10:20:20PM +0800, hev wrote: > Hi, Peter, Hev, > > On Fri, Nov 18, 2022 at 2:53 AM Peter Xu wrote: > > > > On Thu, Nov 17, 2022 at 10:12:07AM -0500, Peter Xu wrote: > > > Hi, Huacai, > > > > > > On Thu, Nov 17, 2022 at 12:25:32PM +0800, Huacai Chen wrote: > > > > Now {pmd,pte}_mkdirty() set _PAGE_DIRTY bit unconditionally, this causes > > > > random segmentation fault after commit 0ccf7f168e17bb7e ("mm/thp: carry > > > > over dirty bit when thp splits on pmd"). > > > > > > > > The reason is: when fork(), parent process use pmd_wrprotect() to clear > > > > huge page's _PAGE_WRITE and _PAGE_DIRTY (for COW); > > > > > > Is it safe to drop dirty bit when wr-protect? It means the mm can reclaim > > > the page directly assuming the page contains rubbish. > > > > > > Consider after fork() and memory pressure kicks the kswapd, I don't see > > > anything stops the kswapd from recycling the pages and lose the data in > > > both processes. > > > > Feel free to ignore this question.. I think I got an answer from Hev (and > > I then got a follow up question): > > > > https://lore.kernel.org/all/Y3Z9Zf0jARMOkFBq@x1n/ > > > > > > > > > then pte_mkdirty() set > > > > _PAGE_DIRTY as well as _PAGE_MODIFIED while splitting dirty huge pages; > > > > once _PAGE_DIRTY is set, there will be no tlb modify exception so the COW > > > > machanism fails; and at last memory corruption occurred between parent > > > > and child processes. > > > > > > > > So, we should set _PAGE_DIRTY only when _PAGE_WRITE is set in {pmd,pte}_ > > > > mkdirty(). > > > > > > > > Cc: stable@vger.kernel.org > > > > Cc: Peter Xu > > > > Signed-off-by: Huacai Chen > > > > --- > > > > Note: CC sparc maillist because they have similar issues. > > > > > > I also had a look on sparc64, it seems to not do the same as loongarch > > > here (not removing dirty in wr-protect): > > > > > > static inline pmd_t pmd_wrprotect(pmd_t pmd) > > > { > > > pte_t pte = __pte(pmd_val(pmd)); > > > > > > pte = pte_wrprotect(pte); > > > > > > return __pmd(pte_val(pte)); > > > } > > > > > > static inline pte_t pte_wrprotect(pte_t pte) > > > { > > > unsigned long val = pte_val(pte), tmp; > > > > > > __asm__ __volatile__( > > > "\n661: andn %0, %3, %0\n" > > > " nop\n" > > > "\n662: nop\n" > > > " nop\n" > > > " .section .sun4v_2insn_patch, \"ax\"\n" > > > " .word 661b\n" > > > " sethi %%uhi(%4), %1\n" > > > " sllx %1, 32, %1\n" > > > " .word 662b\n" > > > " or %1, %%lo(%4), %1\n" > > > " andn %0, %1, %0\n" > > > " .previous\n" > > > : "=r" (val), "=r" (tmp) > > > : "0" (val), "i" (_PAGE_WRITE_4U | _PAGE_W_4U), > > > "i" (_PAGE_WRITE_4V | _PAGE_W_4V)); > > > > > > return __pte(val); > > > } > > > > (Same here; I just overlooked what does _PAGE_W_4U meant..) > > > > > > > > > > > > > arch/loongarch/include/asm/pgtable.h | 8 ++++++-- > > > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h > > > > index 946704bee599..debbe116f105 100644 > > > > --- a/arch/loongarch/include/asm/pgtable.h > > > > +++ b/arch/loongarch/include/asm/pgtable.h > > > > @@ -349,7 +349,9 @@ static inline pte_t pte_mkclean(pte_t pte) > > > > > > > > static inline pte_t pte_mkdirty(pte_t pte) > > > > { > > > > - pte_val(pte) |= (_PAGE_DIRTY | _PAGE_MODIFIED); > > > > + pte_val(pte) |= _PAGE_MODIFIED; > > > > + if (pte_val(pte) & _PAGE_WRITE) > > > > + pte_val(pte) |= _PAGE_DIRTY; > > > > > > I'm not sure whether mm has rule to always set write bit then set dirty > > > bit, need to be careful here because the outcome may differ when use: > > > > > > pte_mkdirty(pte_mkwrite(pte)) > > > (expected) > > > > > > VS: > > > > > > pte_mkwrite(pte_mkdirty(pte)) > > > (dirty not set) > > > > > > I had a feeling I miss some arch-specific details here on why loongarch > > > needs such implementation, but I can't quickly tell. > > > > After a closer look I think it's fine for loongarch as pte_mkwrite will > > also set the dirty bit unconditionally, so at least the two ways will still > > generate the same pte (DIRTY+MODIFIED+WRITE). > > > > But this whole thing is still confusing to me. It'll still be great if > > anyone can help explain why the _DIRTY cannot be set only in pte_mkwrite() > > if that's the solo place in charge of "whether the pte is writable". > > > > The other follow up question is: how do we mark "this pte contains valid > > data" (the common definition of "dirty bit"), while "this pte is not > > writable" on loongarch? > > > > It can happen when we're installing a page with non-zero data meanwhile > > wr-protected. That's actually a valid case for userfaultfd wr-protect mode > > where user specified UFFDIO_COPY ioctl with flag UFFDIO_COPY_MODE_WP, where > > we'll install a non-zero page from user buffer but don't grant write bit. > > > > From code-wise, I think it can be done currently with this on loongarch: > > > > pte_wrprotect(pte_mkwrite(pte_mkdirty(pte))) > > > > Where pte_wrprotect(pte_mkwrite(pte)) is not a no-op but applying MODIFIED. > > We would like to note that on LoongArch (for misunderstanding naming): > * _PAGE_DIRTY meaning hardware writable. > * _PAGE_WRITE meaning software writable. > * _PAGE_MODIFIED meaning software dirty, this page contains updated valid data. > > PTE APIs: > * pte_mkwrite: Allow to write, only needs set _PAGE_WRITE. > * pte_mkdirty: Mark as dirty, only needs set _PAGE_MODIFIED. > * pte_dirty: Test is dirty, only test _PAGE_MODIFIED. > * pte_wrprotect: Clear both writable, force to raise exception to > handle_mm_fault. > > If a pte is only set _PAGE_WRITE without _PAGE_DIRTY by pte_mkwrite, > then a write memory access will cause mmu exception, and the > (_PAGE_DIRTY|_PAGE_MODIFIED) will be set in this exception handler. I > think the _PAGE_DIRTY is also possible to set in pte_mkwrite for > speedup, then _PAGE_MODIFIED must be set at the same time. To avoid > the page data being modified but not detected by pte_dirty. (Current > code may needs to fix > > pte_mkdirty mark pte as dirty is the main function, It can also make > pte writeable by hardware(_PAGE_DIRTY) for speedup (too) if and only > if the pte is writable(_PAGE_WRITE). (mkdirty sets _PAGE_DIRTY > unconditionally is the root cause of the huge page COW issue. Yes, and that's why Huacai's patch can fix this issue for Loongarch, but sparc64 still suffers from it so far. > > For write-protection, pte_wrprotect will clear both writable(software > and hardware) in pte to force a MMU exception to handle_mm_fault. > > So yeah, the pte marked as dirty(_PAGE_MODIFIED) and without any > writable in the following code: > > pte_wrprotect(pte_mkwrite(pte_mkdirty(pte))) Thanks for the further explanation, Hev. I think so far I keep the questioning on whether it's a good optimization to apply _DIRTY in pte_mkdirty here. Since we have pte_mkwrite() API after all, even if there's an optimization IMHO it should be done by the kernel generic code already, I don't immediately see why it needs to be arch-specific. IOW, I'm not sure whether such optimization for loongarch will bring any real performance benefit? The thing is, generic mm code should already have called pte_mkwrite() along with pte_mkdirty() when proper, so _DIRTY will be properly applied even if pte_mkdirty() only applies _MODIFIED in loongarch code without extra mmu faults - if you see the code base that's really the major cases, except a few very special conditions where we want to only set _MODIFIED but may want to keep the pte read-only, but in that case it's never wise to set _DIRTY in pte_mkdirty anyway. I just don't know whether there's anything else I could have overlooked, so maybe removing "set _DIRTY" in pte_mkdirty() will regress in some other way. For now, I think I'll go ahead and try to propose another trivial fix for the pmd split generic code so we'll have the pte_mkdirty back (I think I need to reorder pte_wrprotect() so that it needs to appear after pte_mkdirty()) but hopefully also work for sparc64; loongarch will also benefit if before this patch lands. (Sorry to be off-topic on this thread) -- Peter Xu