Received: by 2002:a05:7412:419a:b0:f3:1519:9f41 with SMTP id i26csp805899rdh; Thu, 23 Nov 2023 20:28:56 -0800 (PST) X-Google-Smtp-Source: AGHT+IEUpJ0s4stARjNWBdNyGFZzNt48e7bT1sf9xgo6kyZjSFNQcJICCbAmpo1YEWgHajRE/XHF X-Received: by 2002:a05:6358:998a:b0:16d:bc14:d106 with SMTP id j10-20020a056358998a00b0016dbc14d106mr1567555rwb.16.1700800136468; Thu, 23 Nov 2023 20:28:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700800136; cv=none; d=google.com; s=arc-20160816; b=EwHmGGKNWF/hDm1MqRWNDT1cuqZ+ZMDlPG8kyBEEtPRqigCAwz6crNpNbeI0W66Wg0 yM3Ip6GtOBJyOB5TieDPJEyx12lczpComhT6Qv4l6HsJp6CmEG61yPB4LntwDfMLwKgj W9UvWnTefWle7lb8z8igDot8sTWjQ/EGBYcFezWTGNeSYy7u1Wf9I0sJjZg3CYlgtUTm W1ZjvpZULvFoVeu7pirUzx6Qx9rtlLGKxtlLqHhfgIMUZ9k255z8Cwzgwqk9iPssRfxG doXboZV149P1JadFzx50ur7xGVtsp5X50K1zGgRHB48dpK05yWQG/Eddmcbsb7GBzvAe ADRw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:message-id:date:references:in-reply-to:subject:cc:to :from:dkim-signature; bh=foFfsR8+dQraes/T16ARVfYC2O3I/JKQiCK+GO3A9AI=; fh=jQSArHvYuZS+jn6kg+9563mP0O6+mJtnNeBo8BEKCyI=; b=PUuZSt4nQLqbgc+k2tYigQf4bdeU83FOi41OO0zDJG9tDdocXnuaB5p6PugyZ3a5oL R9tkFYLzY9aWQDckw7M6BkJDWLc3Z6SOHLhm28YdGVNAhLFHyA8T2xNQ7U/WT/zsTiU4 29h82fMPilyRzdOnZ1PW3qLzlxas/yJSkddlUGyfnnzgEGNOheCmVqt2eCk2FnGjXeJ4 jkTq89zYeSiw/NLvJ59Z6agPZOF8lc5aVtagVEASUtpFBPv1omvhbKxz4nL5khFOLdIY 18LZoryWdjGzHh06/tXJGCaUqA8U28gRF4iVz04WyIyJ4k/9u3/GRNn0vBemUBFK40Op 6T/w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=gEbeAinW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id j29-20020a63595d000000b005b8f61fcba6si2568442pgm.452.2023.11.23.20.28.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 23 Nov 2023 20:28:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=gEbeAinW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 89CC3805DEC6; Thu, 23 Nov 2023 20:28:53 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229769AbjKXE2f (ORCPT + 99 others); Thu, 23 Nov 2023 23:28:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48628 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229453AbjKXE2e (ORCPT ); Thu, 23 Nov 2023 23:28:34 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A86291BE for ; Thu, 23 Nov 2023 20:28:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1700800120; x=1732336120; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=J4Ile0Qosg4PQ4W33TLsXAlMkveaq9qRrr4ZS5TQ5es=; b=gEbeAinWtbqgOe2WzAGsCu9uU62+ovPv7CN+HxxyqPrNNgXIvWfgAzII 8IEr+yeSeOPPKLzlP/Rgccxf0oBcbCAgyjlVIeU0Ys93cBg1FiBtmUot+ zv7G3NK835LDqpVpmoYEjG71cBm40skwH+j59PBifFvVStq0vWz+lAAvx cGdkuqtj+1agdyxr/hUCgxlA3HrVa5TEMJ9384w+TjkUqQePFQllg+8j4 z58BUkQv0JPJJmdwVowAIO52Qd+KYnhZZWkH7mgTVe1gjkX8V3VCAQSlm I2LPBoKE8YRzvpK0lLGnAfj8L2ALup+6wz5+tDzhh/I2CmOPSA6ozp8qt Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10902"; a="391238928" X-IronPort-AV: E=Sophos;i="6.04,223,1695711600"; d="scan'208";a="391238928" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Nov 2023 20:28:40 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10902"; a="1098976827" X-IronPort-AV: E=Sophos;i="6.04,223,1695711600"; d="scan'208";a="1098976827" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Nov 2023 20:28:37 -0800 From: "Huang, Ying" To: "zhangpeng (AS)" Cc: Yin Fengwei , , , , , , , , , , Subject: Re: [RFC PATCH] mm: filemap: avoid unnecessary major faults in filemap_fault() In-Reply-To: <87y1en7pq3.fsf@yhuang6-desk2.ccr.corp.intel.com> (Ying Huang's message of "Fri, 24 Nov 2023 12:13:56 +0800") References: <20231122140052.4092083-1-zhangpeng362@huawei.com> <801bd0c9-7d0c-4231-93e5-7532e8231756@intel.com> <48235d73-3dc6-263d-7822-6d479b753d46@huawei.com> <87y1en7pq3.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Fri, 24 Nov 2023 12:26:36 +0800 Message-ID: <87ttpb7p4z.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Thu, 23 Nov 2023 20:28:53 -0800 (PST) "Huang, Ying" writes: > "zhangpeng (AS)" writes: > >> On 2023/11/23 13:26, Yin Fengwei wrote: >> >>> On 11/23/23 12:12, zhangpeng (AS) wrote: >>>> On 2023/11/23 9:09, Yin Fengwei wrote: >>>> >>>>> Hi Peng, >>>>> >>>>> On 11/22/23 22:00, Peng Zhang wrote: >>>>>> From: ZhangPeng >>>>>> >>>>>> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTUR= E) >>>>>> in application, which leading to an unexpected performance issue[1]. >>>>>> >>>>>> This caused by temporarily cleared pte during a read/modify/write up= date >>>>>> of the pte, eg, do_numa_page()/change_pte_range(). >>>>>> >>>>>> For the data segment of the user-mode program, the global variable a= rea >>>>>> is a private mapping. After the pagecache is loaded, the private ano= nymous >>>>>> page is generated after the COW is triggered. Mlockall can lock COW = pages >>>>>> (anonymous pages), but the original file pages cannot be locked and = may >>>>>> be reclaimed. If the global variable (private anon page) is accessed= when >>>>>> vmf->pte is zeroed in numa fault, a file page fault will be triggere= d. >>>>>> >>>>>> At this time, the original private file page may have been reclaimed. >>>>>> If the page cache is not available at this time, a major fault will = be >>>>>> triggered and the file will be read, causing additional overhead. >>>>>> >>>>>> Fix this by rechecking the pte by holding ptl in filemap_fault() bef= ore >>>>>> triggering a major fault. >>>>>> >>>>>> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17= 434ee@huawei.com/ >>>>>> >>>>>> Signed-off-by: ZhangPeng >>>>>> Signed-off-by: Kefeng Wang >>>>>> --- >>>>>> =C2=A0 mm/filemap.c | 14 ++++++++++++++ >>>>>> =C2=A0 1 file changed, 14 insertions(+) >>>>>> >>>>>> diff --git a/mm/filemap.c b/mm/filemap.c >>>>>> index 71f00539ac00..bb5e6a2790dc 100644 >>>>>> --- a/mm/filemap.c >>>>>> +++ b/mm/filemap.c >>>>>> @@ -3226,6 +3226,20 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 mapping_locked =3D true; >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } else { >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pte_t *ptep =3D pte_offs= et_map_lock(vmf->vma->vm_mm, vmf->pmd, >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 vmf->address, &vmf->ptl); >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (ptep) { >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = /* >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 * Recheck pte with ptl locked as the pte can be cleared >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 * temporarily during a read/modify/write update. >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 */ >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = if (unlikely(!pte_none(ptep_get(ptep)))) >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 ret =3D VM_FAULT_NOPAGE; >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = pte_unmap_unlock(ptep, vmf->ptl); >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = if (unlikely(ret)) >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 return ret; >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>> I am curious. Did you try not to take PTL here and just check whether= PTE is not NONE? >>>> Thank you for your reply. >>>> >>>> If we don't take PTL, the current use case won't trigger this issue ei= ther. >>> Is this verified by testing or just in theory? >> >> If we add a delay between ptep_modify_prot_start() and ptep_modify_prot_= commit(), >> this issue will also trigger. Without delay, we haven't reproduced this = problem >> so far. >> >>>> In most cases, if we don't take PTL, this issue won't be triggered. Ho= wever, >>>> there is still a possibility of triggering this issue. The corner case= is that >>>> task 2 triggers a page fault when task 1 is between ptep_modify_prot_s= tart() >>>> and ptep_modify_prot_commit() in do_numa_page(). Furthermore,task 2 pa= sses the >>>> check whether the PTE is not NONE before task 1 updates PTE in >>>> ptep_modify_prot_commit() without taking PTL. >>> There is very limited operations between ptep_modify_prot_start() and >>> ptep_modify_prot_commit(). While the code path from page fault to this = check is >>> long. My understanding is it's very likely the PTE is not NONE when do = PTE check >>> here without hold PTL (This is my theory. :)). >> >> Yes, there is a high probability that this issue won't occur without tak= ing PTL. >> >>> In the other side, acquiring/releasing PTL may bring performance impact= ion. It may >>> not be big deal because the IO operations in this code path. But it's b= etter to >>> collect some performance data IMHO. >> >> We tested the performance of file private mapping page fault (page_fault= 2.c of >> will-it-scale [1]) and file shared mapping page fault (page_fault3.c of = will-it-scale). >> The difference in performance (in operations per second) before and afte= r patch >> applied is about 0.7% on a x86 physical machine. > > Whether is it improvement or reduction? And I think that you need to test ramdisk cases too to verify whether this will cause performance regression and how much. -- Best Regards, Huang, Ying > -- > Best Regards, > Huang, Ying > >> [1] https://github.com/antonblanchard/will-it-scale/tree/master >> >>> >>> Regards >>> Yin, Fengwei >>> >>>>> Regards >>>>> Yin, Fengwei >>>>> >>>>>> + >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* No page i= n the page cache at all */ >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 count_vm_eve= nt(PGMAJFAULT); >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 count_memcg_= event_mm(vmf->vma->vm_mm, PGMAJFAULT);