Received: by 2002:a05:7412:419a:b0:f3:1519:9f41 with SMTP id i26csp801855rdh; Thu, 23 Nov 2023 20:16:56 -0800 (PST) X-Google-Smtp-Source: AGHT+IEeaX3p7/Hn+IoMC2yl2KQSyPhZkeyqJTB8W6ZjDeyawkAlsceZZAlOcxZqEwl+u2FrDx1l X-Received: by 2002:a05:6a21:1644:b0:18b:5a8a:4248 with SMTP id no4-20020a056a21164400b0018b5a8a4248mr1718283pzb.22.1700799416026; Thu, 23 Nov 2023 20:16:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700799416; cv=none; d=google.com; s=arc-20160816; b=e+xA7pZMlo7hCvZ3birPIZCCOH+T4nMZafcK74M5U0BXY2xDI3JenNVQRME42azmBF MzbsRwpnuQ2NsqkabBJ3DDkYCurUXdJz+AZzmUjKFUvlgcV3pRkjD2IMD+9LIoIZ2DRr BQXFeDpxfkp8A5TwA1J16tYgbxPk9RpXYM0K8Eo7gX2WMAmYN+VWcjTKK6mCj2/jHk6M 1o2sL1F8SjHUQkDex9kKqexia9CQx9zO4sgt5O7V+6wQcnGMgIY0UkjD3ESYCj0C9qjd tpSlfQckX001fl95IWO+lHwfyR+qo5zufRyvnbLjo81kGk3vqZTVvZNmzpKwdX2CV/me ABAA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:message-id:date:references:in-reply-to:subject:cc:to :from:dkim-signature; bh=BTeT8cshMmmS5qVboYoZ7C53gThvEto6vl81yAiVHkk=; fh=jQSArHvYuZS+jn6kg+9563mP0O6+mJtnNeBo8BEKCyI=; b=0bvXuDtA6ZuN8xfy11lV/+MXXDlsBbCDOrDX0po1zM1JyqwhPWSs68AwbvrMLGiB3E bQkYflR0fg9phwEanjN/YSq2YRXkGSKl5q7mmvLg1o+kT2sMyrffHK9V97sTz8izRr6y W5TX/+uBMcoo6sWkam6HlePI/++/HyhRUnq0sAV/Vd1rs6cCNZS07whcnbhPbEQzxTZe XeLkzea28jo/I1LeFGzO+jQfT48JvEmFhgggoN2UEK5KGI+THRX7+nP45ZpASnoOX0tD uibCNQDQo/qk0mzRmKTNvNyoen3DMe4UiN3D/jQ9tYM5BbBAXGmfz0fy5Pl74Nbq2pvm cD3Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=c+dwFPnl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33]) by mx.google.com with ESMTPS id ce17-20020a17090aff1100b0028547dd3c19si2696811pjb.103.2023.11.23.20.16.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 23 Nov 2023 20:16:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=c+dwFPnl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.33 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 14DB9829E905; Thu, 23 Nov 2023 20:16:53 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229769AbjKXEQd (ORCPT + 99 others); Thu, 23 Nov 2023 23:16:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46180 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229453AbjKXEQb (ORCPT ); Thu, 23 Nov 2023 23:16:31 -0500 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BBEAA101 for ; Thu, 23 Nov 2023 20:16:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1700799397; x=1732335397; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=wLbfhTiNsp3RmLNz+uw+lM6yKEd6DDGxnCGNuclyhiQ=; b=c+dwFPnloZKqX/SphJnrwz8E2Smk/w1fB9/YVfeMAoTeL6aMyY7BC/xg mLm6uEuVzcuI5mYbfWidsj+BHMSzHf2WL6+mXnsfawmDtYQ3Jxg2l3GtJ /NHyspaq7lQemffV9AQW93aBTHDsEYeGV7JjBmOfZwYANjD/ywql6xMil STPsdSMMmuWJhL5+EWws/2GMHjyFjUS7J3H2VSvHRRoTXmkAGwxVpyJLp T+WJb/3/yhlwymg3IFwjIpFjg7s9aZVccEAruYTZwsc+2V6DSADZv2gzP xdskke3Mqyg56o/pR3gZ364Rx3BdHjBvHVuxPK+y2PC2SDmq1uBW8x8Rq g==; X-IronPort-AV: E=McAfee;i="6600,9927,10902"; a="382766473" X-IronPort-AV: E=Sophos;i="6.04,223,1695711600"; d="scan'208";a="382766473" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Nov 2023 20:16:00 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10902"; a="858254423" X-IronPort-AV: E=Sophos;i="6.04,223,1695711600"; d="scan'208";a="858254423" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Nov 2023 20:15:57 -0800 From: "Huang, Ying" To: "zhangpeng (AS)" Cc: Yin Fengwei , , , , , , , , , , Subject: Re: [RFC PATCH] mm: filemap: avoid unnecessary major faults in filemap_fault() In-Reply-To: <48235d73-3dc6-263d-7822-6d479b753d46@huawei.com> (zhangpeng's message of "Thu, 23 Nov 2023 15:57:44 +0800") References: <20231122140052.4092083-1-zhangpeng362@huawei.com> <801bd0c9-7d0c-4231-93e5-7532e8231756@intel.com> <48235d73-3dc6-263d-7822-6d479b753d46@huawei.com> Date: Fri, 24 Nov 2023 12:13:56 +0800 Message-ID: <87y1en7pq3.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Thu, 23 Nov 2023 20:16:53 -0800 (PST) "zhangpeng (AS)" writes: > On 2023/11/23 13:26, Yin Fengwei wrote: > >> On 11/23/23 12:12, zhangpeng (AS) wrote: >>> On 2023/11/23 9:09, Yin Fengwei wrote: >>> >>>> Hi Peng, >>>> >>>> On 11/22/23 22:00, Peng Zhang wrote: >>>>> From: ZhangPeng >>>>> >>>>> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) >>>>> in application, which leading to an unexpected performance issue[1]. >>>>> >>>>> This caused by temporarily cleared pte during a read/modify/write upd= ate >>>>> of the pte, eg, do_numa_page()/change_pte_range(). >>>>> >>>>> For the data segment of the user-mode program, the global variable ar= ea >>>>> is a private mapping. After the pagecache is loaded, the private anon= ymous >>>>> page is generated after the COW is triggered. Mlockall can lock COW p= ages >>>>> (anonymous pages), but the original file pages cannot be locked and m= ay >>>>> be reclaimed. If the global variable (private anon page) is accessed = when >>>>> vmf->pte is zeroed in numa fault, a file page fault will be triggered. >>>>> >>>>> At this time, the original private file page may have been reclaimed. >>>>> If the page cache is not available at this time, a major fault will be >>>>> triggered and the file will be read, causing additional overhead. >>>>> >>>>> Fix this by rechecking the pte by holding ptl in filemap_fault() befo= re >>>>> triggering a major fault. >>>>> >>>>> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa174= 34ee@huawei.com/ >>>>> >>>>> Signed-off-by: ZhangPeng >>>>> Signed-off-by: Kefeng Wang >>>>> --- >>>>> =C2=A0 mm/filemap.c | 14 ++++++++++++++ >>>>> =C2=A0 1 file changed, 14 insertions(+) >>>>> >>>>> diff --git a/mm/filemap.c b/mm/filemap.c >>>>> index 71f00539ac00..bb5e6a2790dc 100644 >>>>> --- a/mm/filemap.c >>>>> +++ b/mm/filemap.c >>>>> @@ -3226,6 +3226,20 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 mapping_locked =3D true; >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } else { >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pte_t *ptep =3D pte_offse= t_map_lock(vmf->vma->vm_mm, vmf->pmd, >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 vmf->address, &vmf->ptl); >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (ptep) { >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 * Recheck pte with ptl locked as the pte can be cleared >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 * temporarily during a read/modify/write update. >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 */ >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i= f (unlikely(!pte_none(ptep_get(ptep)))) >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 ret =3D VM_FAULT_NOPAGE; >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 p= te_unmap_unlock(ptep, vmf->ptl); >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i= f (unlikely(ret)) >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 return ret; >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>> I am curious. Did you try not to take PTL here and just check whether = PTE is not NONE? >>> Thank you for your reply. >>> >>> If we don't take PTL, the current use case won't trigger this issue eit= her. >> Is this verified by testing or just in theory? > > If we add a delay between ptep_modify_prot_start() and ptep_modify_prot_c= ommit(), > this issue will also trigger. Without delay, we haven't reproduced this p= roblem > so far. > >>> In most cases, if we don't take PTL, this issue won't be triggered. How= ever, >>> there is still a possibility of triggering this issue. The corner case = is that >>> task 2 triggers a page fault when task 1 is between ptep_modify_prot_st= art() >>> and ptep_modify_prot_commit() in do_numa_page(). Furthermore,task 2 pas= ses the >>> check whether the PTE is not NONE before task 1 updates PTE in >>> ptep_modify_prot_commit() without taking PTL. >> There is very limited operations between ptep_modify_prot_start() and >> ptep_modify_prot_commit(). While the code path from page fault to this c= heck is >> long. My understanding is it's very likely the PTE is not NONE when do P= TE check >> here without hold PTL (This is my theory. :)). > > Yes, there is a high probability that this issue won't occur without taki= ng PTL. > >> In the other side, acquiring/releasing PTL may bring performance impacti= on. It may >> not be big deal because the IO operations in this code path. But it's be= tter to >> collect some performance data IMHO. > > We tested the performance of file private mapping page fault (page_fault2= .c of > will-it-scale [1]) and file shared mapping page fault (page_fault3.c of w= ill-it-scale). > The difference in performance (in operations per second) before and after= patch > applied is about 0.7% on a x86 physical machine. Whether is it improvement or reduction? -- Best Regards, Huang, Ying > [1] https://github.com/antonblanchard/will-it-scale/tree/master > >> >> Regards >> Yin, Fengwei >> >>>> Regards >>>> Yin, Fengwei >>>> >>>>> + >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* No page in= the page cache at all */ >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 count_vm_even= t(PGMAJFAULT); >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 count_memcg_e= vent_mm(vmf->vma->vm_mm, PGMAJFAULT);