Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp7554342rwd; Tue, 6 Jun 2023 12:30:40 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6TUmAVqWtak90OdWTK265p7L7Ut0aFARLb8HdD6JIRoJUl9CkwEEp45UIht/XwEd3cRp3V X-Received: by 2002:ad4:5bea:0:b0:625:b517:6dd0 with SMTP id k10-20020ad45bea000000b00625b5176dd0mr697523qvc.8.1686079839862; Tue, 06 Jun 2023 12:30:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686079839; cv=none; d=google.com; s=arc-20160816; b=jrkq2EtrpfYAc/cX9rQSdi4h3ZT5HC7dFy+dOziBykamYi5tmhp7AtcwZe5QnKAkRw iK/QSTOUXXWWJ1IZJ9osO3EgHwlWgF2G+ByB13ZTy1Wlh/U1puOIVAhCwf+4KMkMnM9t 1oP0TenqyeVIncxQah6+8aW5OAXSTb0U3zi/udYOwSylhjAjsBD2+tjAgqqBsLA8iDcK Z8aYMXE/T4dJmVsU1+txReX0jLkowT2VedpUlQKZ8Ux/Ja+dHp1oFeiw9FUi+OdzM81F lwg+qF8e38dEgX/NpP1c0Efy2iJUqcocvYog2L5sWUOC8Wqwa7Qa8rxusdyTPvoJ2G2f HH/w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=6V4VnQ+NwlTLpp1UfRmPdcygP5cp3PVVXwMSQln1Mw4=; b=fYR05LyS0VFaiHpsXD3WcXTyVE6GOKB32RuPgPdwyqvMIqI++PrX4NQ4u58IGUDM6Q iA1mRkX+GnPqzCVGYKfsQGUXOh8JI0MF2oxzggIOc7pi2cSFsPTiZCKsvVyyPpO6qhgu C4KcoTbILeJ65fPPQWe8GBvNHKhD2XKAvoQqcaEwN5r9WFWdHoVmdhvVDRrnZyWDqbPg mks3fGpeSsqnwPeEnfQ5tjk57H/dJULjb01xA3GYNDQEdR7fKEwBwE9+WlsbEVidRL+o nEr+F2EBf945Nxm1327oO9adIESQ0Hbk6H1/RrGENfh0KXXSXowIMtSyJTW907Z/3f34 5+lw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=huHRTZhm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e19-20020a05620a12d300b0075dabfdefb4si3516785qkl.331.2023.06.06.12.30.25; Tue, 06 Jun 2023 12:30:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=huHRTZhm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239478AbjFFTNm (ORCPT + 99 others); Tue, 6 Jun 2023 15:13:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45132 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239436AbjFFTNT (ORCPT ); Tue, 6 Jun 2023 15:13:19 -0400 Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 66D601BD7 for ; Tue, 6 Jun 2023 12:12:15 -0700 (PDT) Received: by mail-pj1-x102c.google.com with SMTP id 98e67ed59e1d1-2568caabfbfso3223395a91.3 for ; Tue, 06 Jun 2023 12:12:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1686078735; x=1688670735; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=6V4VnQ+NwlTLpp1UfRmPdcygP5cp3PVVXwMSQln1Mw4=; b=huHRTZhmyVde5umjaMPBXE9BqvehtfkERNX0+20lrgbb+8BZmowdyr2pLmm/hBeKqj SXskEgPUvfeQg9JvqFPSuMluR8H4EJB7LzMVVw+tekJ49ehCm0xmpS2aHsvURYkf7hbQ nHNrQOAeNuseR6C5I6RVw3E3+iSfhpG7uCpaWTEUkMhH338GO1B7+ejPK6h+8DWZDhfH bKHBTGbLMVCH29lRdVlWC7QLN1boxnVeAk2gA9lRKEfS+Nsj+m3PqpbzYsRkul3tDm5M r4Y/Pg2yjy2IgTq3EKGfiPj69z5p07fNLL1s+jJeFHBuKgUu1FbdkxQUuE5myqzXEmic AzBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686078735; x=1688670735; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6V4VnQ+NwlTLpp1UfRmPdcygP5cp3PVVXwMSQln1Mw4=; b=kedxGxn1WHUldLaEDPdd/UV33oBa9DozNQFobyl9EhimxnoGQ3A+YTP/f1zIJZfkwa ssvUzvzdweeJqW82xIxbLy6M8lL4kN5uD/KtTSgo8l/9fvs9oSsB0lMCULF9QYT76wSW wdzNgDNqjC8AuYJlMTT2Q0NDzXo7n93s5flr7hkSIoRcWbs5Mc15AaQNXd1+56BL/8sO QszrKTjajFkuv6TtrbGF6HTJFaLS1rjBlZeni4O9Tu35UcicH35eqt/oaUTHOP9O6KnY OhnQ65e5XdHsFJsScLT7zXPfC/1XQe2HGULV6TtfWXsq178+WnXPZ9munOl8KAnQiS08 tbqA== X-Gm-Message-State: AC+VfDyVHLGEc15UuHOoGULrMK48S/RV+dXz5Hk8nPlDzqyd/atMp00w XN064d9adbGqOAHfucpSqNhRFdms4T03S/W9Dm/rqGSn X-Received: by 2002:a17:90a:10c9:b0:259:6a29:ef1c with SMTP id b9-20020a17090a10c900b002596a29ef1cmr1545385pje.5.1686078734710; Tue, 06 Jun 2023 12:12:14 -0700 (PDT) MIME-Version: 1.0 References: <20230602230552.350731-1-peterx@redhat.com> <20230602230552.350731-5-peterx@redhat.com> In-Reply-To: From: Yang Shi Date: Tue, 6 Jun 2023 12:12:03 -0700 Message-ID: Subject: Re: [PATCH 4/4] mm: Make most walk page paths with pmd_trans_unstable() to retry To: Peter Xu Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, David Hildenbrand , Alistair Popple , Andrew Morton , Andrea Arcangeli , "Kirill A . Shutemov" , Johannes Weiner , John Hubbard , Naoya Horiguchi , Muhammad Usama Anjum , Hugh Dickins , Mike Rapoport Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 5, 2023 at 12:20=E2=80=AFPM Peter Xu wrote: > > On Mon, Jun 05, 2023 at 11:46:04AM -0700, Yang Shi wrote: > > On Fri, Jun 2, 2023 at 4:06=E2=80=AFPM Peter Xu wro= te: > > > > > > For most of the page walk paths, logically it'll always be good to ha= ve the > > > pmd retries if hit pmd_trans_unstable() race. We can treat it as non= e > > > pmd (per comment above pmd_trans_unstable()), but in most cases we're= not > > > even treating that as a none pmd. If to fix it anyway, a retry will = be the > > > most accurate. > > > > > > I've went over all the pmd_trans_unstable() special cases and this pa= tch > > > should cover all the rest places where we should retry properly with > > > unstable pmd. With the newly introduced ACTION_AGAIN since 2020 we c= an > > > easily achieve that. > > > > > > These are the call sites that I think should be fixed with it: > > > > > > *** fs/proc/task_mmu.c: > > > smaps_pte_range[634] if (pmd_trans_unstable(pmd)) > > > clear_refs_pte_range[1194] if (pmd_trans_unstable(pmd)) > > > pagemap_pmd_range[1542] if (pmd_trans_unstable(pmdp)) > > > gather_pte_stats[1891] if (pmd_trans_unstable(pmd)) > > > *** mm/memcontrol.c: > > > mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd= )) > > > mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd)) > > > *** mm/memory-failure.c: > > > hwpoison_pte_range[794] if (pmd_trans_unstable(pmdp)) > > > *** mm/mempolicy.c: > > > queue_folios_pte_range[517] if (pmd_trans_unstable(pmd)) > > > *** mm/madvise.c: > > > madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd)) > > > madvise_free_pte_range[625] if (pmd_trans_unstable(pmd)) > > > > > > IIUC most of them may or may not be a big issue even without a retry, > > > either because they're already not strict (smaps, pte_stats, MADV_COL= D, > > > .. it can mean e.g. the statistic may be inaccurate or one less 2M ch= unk to > > > cold worst case), but some of them could have functional error withou= t the > > > retry afaiu (e.g. pagemap, where we can have the output buffer shifte= d over > > > the unstable pmd range.. so IIUC the pagemap result can be wrong). > > > > > > While these call sites all look fine, and don't need any change: > > > > > > *** include/linux/pgtable.h: > > > pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_= unstable(pmd); > > > *** mm/gup.c: > > > follow_pmd_mask[695] if (pmd_trans_unstable(pmd)) > > > *** mm/mapping_dirty_helpers.c: > > > wp_clean_pmd_entry[131] if (!pmd_trans_unstable(&pmdval)) > > > *** mm/memory.c: > > > do_anonymous_page[4060] if (unlikely(pmd_trans_unstable(vmf->p= md))) > > > *** mm/migrate_device.c: > > > migrate_vma_insert_page[616] if (unlikely(pmd_trans_unstable(pmdp))= ) > > > *** mm/mincore.c: > > > mincore_pte_range[116] if (pmd_trans_unstable(pmd)) { > > > > > > Signed-off-by: Peter Xu > > > --- > > > fs/proc/task_mmu.c | 17 +++++++++++++---- > > > mm/madvise.c | 8 ++++++-- > > > mm/memcontrol.c | 8 ++++++-- > > > mm/memory-failure.c | 4 +++- > > > mm/mempolicy.c | 4 +++- > > > 5 files changed, 31 insertions(+), 10 deletions(-) > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > index 6259dd432eeb..823eaba5c6bf 100644 > > > --- a/fs/proc/task_mmu.c > > > +++ b/fs/proc/task_mmu.c > > > @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned = long addr, unsigned long end, > > > goto out; > > > } > > > > > > - if (pmd_trans_unstable(pmd)) > > > + if (pmd_trans_unstable(pmd)) { > > > + walk->action =3D ACTION_AGAIN; > > > goto out; > > > + } > > > + > > > /* > > > * The mmap_lock held all the way back in m_start() is what > > > * keeps khugepaged out of here and from collapsing things > > > @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, un= signed long addr, > > > return 0; > > > } > > > > > > - if (pmd_trans_unstable(pmd)) > > > + if (pmd_trans_unstable(pmd)) { > > > + walk->action =3D ACTION_AGAIN; > > > return 0; > > > + } > > > > > > pte =3D pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > > for (; addr !=3D end; pte++, addr +=3D PAGE_SIZE) { > > > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsi= gned long addr, unsigned long end, > > > return err; > > > } > > > > > > - if (pmd_trans_unstable(pmdp)) > > > + if (pmd_trans_unstable(pmdp)) { > > > + walk->action =3D ACTION_AGAIN; > > > return 0; > > > > Had a quick look at the pagemap code, I agree with your analysis, > > "returning 0" may mess up pagemap, retry should be fine. But I'm > > wondering whether we should just fill in empty entries. Anyway I don't > > have a strong opinion on this, just a little bit concerned by > > potential indefinite retry. > > Yes, none pte is still an option. But if we're going to fix this anyway, > it seems better to fix it with the accurate new thing that poped up, and > it's even less change (just apply walk->action rather than doing random > stuff in different call sites). > > I see that you have worry on deadloop over this, so I hope to discuss > altogether here. > > Unlike normal checks, pmd_trans_unstable() check means something must hav= e > changed in the past very short period or it should just never if nothing > changed concurrently from under us, so it's not a "if (flag=3D=3Dtrue)" c= heck > which is even more likely to loop. > > If we see the places that I didn't touch, most of them suggested a retry = in > one form or another. So if there's a worry this will also not the first > time to do a retry (and for such a "unstable" API, that's really the most > natural thing to do which is to retry until it's stable). IIUC other than do_anonymous_page() suggests retry (retry page fault), others may not, for example: - follow_pmd_mask: return -EBUSY - wp_clean_pmd_entry: actually just retry for pmd_none case, but the pagewalk code does handle pmd_none by skipping it, so it basically just retry once - min_core_pte_range: treated as unmapped range by calling __mincore_unmapped_range Anyway I really don't have a strong opinion on this. I may be just over-concerned. I just thought if nobody cares whether the result is accurate or not, why do we bother fixing those cases? > > So in general, it seems to me if we deadloop over pmd_trans_unstable() fo= r > whatever reason then something more wrong could have happened.. > > Thanks, > > -- > Peter Xu >