Received: by 2002:a25:c593:0:0:0:0:0 with SMTP id v141csp1447837ybe; Wed, 11 Sep 2019 15:23:07 -0700 (PDT) X-Google-Smtp-Source: APXvYqzkXbOaej7bBL7djqTziZaiscfwtN19Ek2AV/fig+Ox8zQOL2zjUwAx8yzJpYbuyv7ClWhP X-Received: by 2002:a50:8961:: with SMTP id f30mr38330504edf.144.1568240586995; Wed, 11 Sep 2019 15:23:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1568240586; cv=none; d=google.com; s=arc-20160816; b=gjQC67vrwh3BuNaLjq4zJshrh932RfqIwAu+ygY1pG35b+3NbBx7/ReCmssyN//RY7 30vWYEfxMJ7LnIdOa+RiSPDFj09LSTYmNM8AThA9pmHhvi+SrBgFtRvYeAtPv324aOpM b+Mmx1Mfd41JHHRX/Y9a3zT8u0P9nOD3lHZ4Bq+fxPBjYGIIUoTBqNeP0/KLqbCQHYiP g7BHVUwV/26IFPTOAh7vsURtcOC1zwIJOAL+MmxQSBunst1ioNiVZNTUPJMIews4KMtc liHBbEq750wrPTDyEJKbuobIOlNtaBTDBH50+iR9h/KnrTSrn92BfIL3pKL5g95qkGJc lbVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:to:references:message-id :content-transfer-encoding:cc:date:in-reply-to:from:subject :mime-version:dkim-signature; bh=2ZFP/ET73683mUucItfq4m0bSDDy3ddeJPRfTUbNCzY=; b=kcRAz3nOpA55NDs/J3wQYVRNInSBuZiDlop/2Unypmvd2fP2mPKkmmbCpiJrXqPt2U N4/oRK9i/3hn08d3dAot0lWWiIHi62ctb4uKhYAp+4xaWMO9rtnk65tFEW2+0jRkzjgs 0Zt8h8P5PT7xFynYvSidX8SteDUV5amxBlsLO875u7liIphEK+bb3F6WO4bkRNzdM+a0 y1bOK7HGZKDk44YWAilkxoJuqiK/JLn+ytUMtEeeEi82GOFWWoi++HTldRj/Dyc7x9kt C/fPDTbTKqx5mXrLZhh32+Oqbi1f7Kw05HWmvgVQSX7mVBw77LZ1sMdVUBm5ZMYRoJrc ENiQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@lca.pw header.s=google header.b="F/E+pdbJ"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b42si13403307edb.11.2019.09.11.15.22.42; Wed, 11 Sep 2019 15:23:06 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@lca.pw header.s=google header.b="F/E+pdbJ"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728806AbfIKTnE (ORCPT + 99 others); Wed, 11 Sep 2019 15:43:04 -0400 Received: from mail-qt1-f196.google.com ([209.85.160.196]:38870 "EHLO mail-qt1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728638AbfIKTnE (ORCPT ); Wed, 11 Sep 2019 15:43:04 -0400 Received: by mail-qt1-f196.google.com with SMTP id b2so26771373qtq.5 for ; Wed, 11 Sep 2019 12:43:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=lca.pw; s=google; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=2ZFP/ET73683mUucItfq4m0bSDDy3ddeJPRfTUbNCzY=; b=F/E+pdbJUdtectzMGAX889EbfD+uvoPzdWknjWGrU2IW/P478uxNiNntIC52eaMqHI OikuFaF+/CXQ3+ggRtSe/VC7P7hjt/RxxD+1jGmM9aos260dY3y8bFYpWysulvGFvbKp wJTamap6WyoQnblbLcYiRFCf5ZjMyXLrw5KzQACIMvVJDqL8+m832uu5lHYCLHniK7Cq bCudo8I+6O0+Ss6FJX8SuP8jMh6NT2atv6chpB+NRsyBA7kdQ3EK2hxhFmvPLrNylaez 2XzETAVZ2F2w/Bs0iqQ/KQpFO7tbZXsnck+yvBzRH/6tzcGe7rzBsyoLC+y3+nCDGF02 kRpw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=2ZFP/ET73683mUucItfq4m0bSDDy3ddeJPRfTUbNCzY=; b=L2TFGBlZqnUp61aYHKfMsSy46V5823M7bvk0umHIVPeWC82QWGCGNgUBCh4+67JxsA Vq/nt6xiaNZ02TzC1symil+aaAJGRjt1yMtmeiGQ6Xi/v/k7TST9E/MjnI6g+uAL+0XD H0QSefYFEWjAs78L/uADwCeWEw/ISG4rGE70dpWsfHZC1Uwkkpd7VvnFz0QCJqWowr7x f9nHCY7qvWh0qdaKi+/Y/AQTP+VDloOpRdRqS5j80mkYc1E9OVl+gL8UoN5Pjvbl0Pb7 zDE6nWZ1xV/5n65vza/vPcclkxL8M356VXH/s0DICSWVTCznGqUwr3XoRKBhwZ8E7+fV ynQA== X-Gm-Message-State: APjAAAWrmLO5dh439anenpBLBbhQIlWrIJmDnvgezXjS2dLrv0iBePP3 6IyRKZZzCaHxSzHGNsGSEGMM1g== X-Received: by 2002:ac8:6704:: with SMTP id e4mr37307655qtp.244.1568230982535; Wed, 11 Sep 2019 12:43:02 -0700 (PDT) Received: from [192.168.1.153] (pool-71-184-117-43.bstnma.fios.verizon.net. [71.184.117.43]) by smtp.gmail.com with ESMTPSA id h27sm9858623qkl.75.2019.09.11.12.42.55 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 11 Sep 2019 12:42:56 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: [PATCH 5/5] hugetlbfs: Limit wait time when trying to share huge PMD From: Qian Cai In-Reply-To: <1a8e6c0a-6ba6-d71f-974e-f8a9c623c25b@redhat.com> Date: Wed, 11 Sep 2019 15:42:54 -0400 Cc: Peter Zijlstra , Ingo Molnar , Will Deacon , Alexander Viro , Mike Kravetz , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Davidlohr Bueso Content-Transfer-Encoding: quoted-printable Message-Id: <70714929-2CE3-42F4-BD31-427077C9E24E@lca.pw> References: <20190911150537.19527-1-longman@redhat.com> <20190911150537.19527-6-longman@redhat.com> <1a8e6c0a-6ba6-d71f-974e-f8a9c623c25b@redhat.com> To: Waiman Long X-Mailer: Apple Mail (2.3445.104.11) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Sep 11, 2019, at 12:34 PM, Waiman Long wrote: >=20 > On 9/11/19 5:01 PM, Qian Cai wrote: >>=20 >>> On Sep 11, 2019, at 11:05 AM, Waiman Long = wrote: >>>=20 >>> When allocating a large amount of static hugepages (~500-1500GB) on = a >>> system with large number of CPUs (4, 8 or even 16 sockets), = performance >>> degradation (random multi-second delays) was observed when thousands >>> of processes are trying to fault in the data into the huge pages. = The >>> likelihood of the delay increases with the number of sockets and = hence >>> the CPUs a system has. This only happens in the initial setup phase >>> and will be gone after all the necessary data are faulted in. >>>=20 >>> These random delays, however, are deemed unacceptable. The cause of >>> that delay is the long wait time in acquiring the mmap_sem when = trying >>> to share the huge PMDs. >>>=20 >>> To remove the unacceptable delays, we have to limit the amount of = wait >>> time on the mmap_sem. So the new down_write_timedlock() function is >>> used to acquire the write lock on the mmap_sem with a timeout value = of >>> 10ms which should not cause a perceivable delay. If timeout happens, >>> the task will abandon its effort to share the PMD and allocate its = own >>> copy instead. >>>=20 >>> When too many timeouts happens (threshold currently set at 256), the >>> system may be too large for PMD sharing to be useful without undue = delay. >>> So the sharing will be disabled in this case. >>>=20 >>> Signed-off-by: Waiman Long >>> --- >>> include/linux/fs.h | 7 +++++++ >>> mm/hugetlb.c | 24 +++++++++++++++++++++--- >>> 2 files changed, 28 insertions(+), 3 deletions(-) >>>=20 >>> diff --git a/include/linux/fs.h b/include/linux/fs.h >>> index 997a530ff4e9..e9d3ad465a6b 100644 >>> --- a/include/linux/fs.h >>> +++ b/include/linux/fs.h >>> @@ -40,6 +40,7 @@ >>> #include >>> #include >>> #include >>> +#include >>>=20 >>> #include >>> #include >>> @@ -519,6 +520,12 @@ static inline void i_mmap_lock_write(struct = address_space *mapping) >>> down_write(&mapping->i_mmap_rwsem); >>> } >>>=20 >>> +static inline bool i_mmap_timedlock_write(struct address_space = *mapping, >>> + ktime_t timeout) >>> +{ >>> + return down_write_timedlock(&mapping->i_mmap_rwsem, timeout); >>> +} >>> + >>> static inline void i_mmap_unlock_write(struct address_space = *mapping) >>> { >>> up_write(&mapping->i_mmap_rwsem); >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index 6d7296dd11b8..445af661ae29 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -4750,6 +4750,8 @@ void = adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, >>> } >>> } >>>=20 >>> +#define PMD_SHARE_DISABLE_THRESHOLD (1 << 8) >>> + >>> /* >>> * Search for a shareable pmd page for hugetlb. In any case calls = pmd_alloc() >>> * and returns the corresponding pte. While this is not necessary for = the >>> @@ -4770,11 +4772,24 @@ pte_t *huge_pmd_share(struct mm_struct *mm, = unsigned long addr, pud_t *pud) >>> pte_t *spte =3D NULL; >>> pte_t *pte; >>> spinlock_t *ptl; >>> + static atomic_t timeout_cnt; >>>=20 >>> - if (!vma_shareable(vma, addr)) >>> - return (pte_t *)pmd_alloc(mm, pud, addr); >>> + /* >>> + * Don't share if it is not sharable or locking attempt timed = out >>> + * after 10ms. After 256 timeouts, PMD sharing will be = permanently >>> + * disabled as it is just too slow. >> It looks like this kind of policy interacts with kernel debug options = like KASAN (which is going to slow the system down >> anyway) could introduce tricky issues due to different timings on a = debug kernel. >=20 > With respect to lockdep, down_write_timedlock() works like a trylock. = So > a lot of checking will be skipped. Also the lockdep code won't be run > until the lock is acquired. So its execution time has no effect on the > timeout. No only lockdep, but also things like KASAN, debug_pagealloc, = page_poison, kmemleak, debug objects etc that all going to slow down things in huge_pmd_share(), and = make it tricky to get a right timeout value for those debug kernels without changing the = previous behavior.=