Received: by 10.223.185.116 with SMTP id b49csp3277398wrg; Mon, 12 Feb 2018 23:58:17 -0800 (PST) X-Google-Smtp-Source: AH8x224E6wEIxIr+u+Wuok5nbXmOv8aEeWEJoj8ueOtUCteHjDIiSFqZM7mKpKxP/Ncf223GLvNY X-Received: by 2002:a17:902:51e9:: with SMTP id y96-v6mr335442plh.344.1518508697793; Mon, 12 Feb 2018 23:58:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518508697; cv=none; d=google.com; s=arc-20160816; b=tPljSIVRxAnwqUuVUSGB6d24MRhIYbMg4yfiNx2HhE7SZ2MHNpMFZY/vi6n3FKUPf3 ax6gXFhyZtqOHLEC/zfYyaQiMf1UMEntWbg1LmGILiiV7HCKmykZzJw0qP7bgkPppltw xre0r9RPGPG1K79tXqrlC9KSSP8clqNxwfZeUyPmy9sHjhLx9i06EZA17sOI2eofulUI ebYUCzPFkhrIed5WjIlFrht1JXGLgjq1PTPqruYQirQTKftHreRWBAtvJe3+wIfXjmaf qvUwlBmhpXfEUjEIuuCYvbqXeLWNGP8VGHc2ailvlKGxFvj+45rKB5T8Aj1oury9eECI V9gA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :references:cc:to:subject:from:arc-authentication-results; bh=B5VZxIVeXBHPi/aujAFGozxYg+nsA8x8XTOaNe1Ev8E=; b=rOEGoxdTL7gwLCZwpAmBHU+2Ce5Zlh7Lve/Aif/35aKb3hsl1EvbSdGZYQVR57Vcv1 5H0PLsCyn2en3Rytvlp0yrLbEsHUSFfdj1pJKvtKV0UUwe+DsfDzgqJyGkch77CWW0C+ 2EeDEfuF7BZzh/F+nHpq0KkfRWk4Ry1hOCec40WlyG9oLrjSUJrBbwsPIrXhcJmc8CeX 2+8ND2+YqISQxNVKz69gUBTFw1QAks4PvxoJO2DKzD/EADNaR2ppPOmSqD4yHIWBrZZk Pt97rB7f5V15m14olz8jl3ABAfgDW6jIvZLeKjmTt7jWtczuj6icV2xThvAFTLPvNYtK 0G3Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g7-v6si3714217plo.380.2018.02.12.23.58.02; Mon, 12 Feb 2018 23:58:17 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933661AbeBMH4q (ORCPT + 99 others); Tue, 13 Feb 2018 02:56:46 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:36182 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933606AbeBMH4p (ORCPT ); Tue, 13 Feb 2018 02:56:45 -0500 Received: from pps.filterd (m0098393.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w1D7sbLh003845 for ; Tue, 13 Feb 2018 02:56:45 -0500 Received: from e06smtp15.uk.ibm.com (e06smtp15.uk.ibm.com [195.75.94.111]) by mx0a-001b2d01.pphosted.com with ESMTP id 2g3ps89kpr-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Tue, 13 Feb 2018 02:56:44 -0500 Received: from localhost by e06smtp15.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 13 Feb 2018 07:56:41 -0000 Received: from b06cxnps3075.portsmouth.uk.ibm.com (9.149.109.195) by e06smtp15.uk.ibm.com (192.168.101.145) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Tue, 13 Feb 2018 07:56:34 -0000 Received: from d06av23.portsmouth.uk.ibm.com (d06av23.portsmouth.uk.ibm.com [9.149.105.59]) by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w1D7uXnS61407344; Tue, 13 Feb 2018 07:56:33 GMT Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 724D3A406D; Tue, 13 Feb 2018 07:49:53 +0000 (GMT) Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 64695A4040; Tue, 13 Feb 2018 07:49:51 +0000 (GMT) Received: from [9.145.32.99] (unknown [9.145.32.99]) by d06av23.portsmouth.uk.ibm.com (Postfix) with ESMTP; Tue, 13 Feb 2018 07:49:51 +0000 (GMT) From: Laurent Dufour Subject: Re: [PATCH v7 00/24] Speculative page faults To: Andrew Morton Cc: paulmck@linux.vnet.ibm.com, peterz@infradead.org, kirill@shutemov.name, ak@linux.intel.com, mhocko@kernel.org, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , Andrea Arcangeli , Alexei Starovoitov , kemi.wang@intel.com, sergey.senozhatsky.work@gmail.com, Daniel Jordan , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, khandual@linux.vnet.ibm.com, npiggin@gmail.com, bsingharora@gmail.com, Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org References: <1517935810-31177-1-git-send-email-ldufour@linux.vnet.ibm.com> <20180208125301.99445c91979343756e4cca9b@linux-foundation.org> Date: Tue, 13 Feb 2018 08:56:31 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <20180208125301.99445c91979343756e4cca9b@linux-foundation.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 18021307-0020-0000-0000-000003F5155B X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18021307-0021-0000-0000-000042879F89 Message-Id: X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-02-13_04:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1802130099 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/02/2018 21:53, Andrew Morton wrote: > On Tue, 6 Feb 2018 17:49:46 +0100 Laurent Dufour wrote: > >> This is a port on kernel 4.15 of the work done by Peter Zijlstra to >> handle page fault without holding the mm semaphore [1]. >> >> The idea is to try to handle user space page faults without holding the >> mmap_sem. This should allow better concurrency for massively threaded >> process since the page fault handler will not wait for other threads memory >> layout change to be done, assuming that this change is done in another part >> of the process's memory space. This type page fault is named speculative >> page fault. If the speculative page fault fails because of a concurrency is >> detected or because underlying PMD or PTE tables are not yet allocating, it >> is failing its processing and a classic page fault is then tried. >> >> The speculative page fault (SPF) has to look for the VMA matching the fault >> address without holding the mmap_sem, this is done by introducing a rwlock >> which protects the access to the mm_rb tree. Previously this was done using >> SRCU but it was introducing a lot of scheduling to process the VMA's >> freeing >> operation which was hitting the performance by 20% as reported by Kemi Wang >> [2].Using a rwlock to protect access to the mm_rb tree is limiting the >> locking contention to these operations which are expected to be in a O(log >> n) >> order. In addition to ensure that the VMA is not freed in our back a >> reference count is added and 2 services (get_vma() and put_vma()) are >> introduced to handle the reference count. When a VMA is fetch from the RB >> tree using get_vma() is must be later freeed using put_vma(). Furthermore, >> to allow the VMA to be used again by the classic page fault handler a >> service is introduced can_reuse_spf_vma(). This service is expected to be >> called with the mmap_sem hold. It checked that the VMA is still matching >> the specified address and is releasing its reference count as the mmap_sem >> is hold it is ensure that it will not be freed in our back. In general, the >> VMA's reference count could be decremented when holding the mmap_sem but it >> should not be increased as holding the mmap_sem is ensuring that the VMA is >> stable. I can't see anymore the overhead I got while will-it-scale >> benchmark anymore. >> >> The VMA's attributes checked during the speculative page fault processing >> have to be protected against parallel changes. This is done by using a per >> VMA sequence lock. This sequence lock allows the speculative page fault >> handler to fast check for parallel changes in progress and to abort the >> speculative page fault in that case. >> >> Once the VMA is found, the speculative page fault handler would check for >> the VMA's attributes to verify that the page fault has to be handled >> correctly or not. Thus the VMA is protected through a sequence lock which >> allows fast detection of concurrent VMA changes. If such a change is >> detected, the speculative page fault is aborted and a *classic* page fault >> is tried. VMA sequence lockings are added when VMA attributes which are >> checked during the page fault are modified. >> >> When the PTE is fetched, the VMA is checked to see if it has been changed, >> so once the page table is locked, the VMA is valid, so any other changes >> leading to touching this PTE will need to lock the page table, so no >> parallel change is possible at this time. >> >> The locking of the PTE is done with interrupts disabled, this allows to >> check for the PMD to ensure that there is not an ongoing collapsing >> operation. Since khugepaged is firstly set the PMD to pmd_none and then is >> waiting for the other CPU to have catch the IPI interrupt, if the pmd is >> valid at the time the PTE is locked, we have the guarantee that the >> collapsing opertion will have to wait on the PTE lock to move foward. This >> allows the SPF handler to map the PTE safely. If the PMD value is different >> than the one recorded at the beginning of the SPF operation, the classic >> page fault handler will be called to handle the operation while holding the >> mmap_sem. As the PTE lock is done with the interrupts disabled, the lock is >> done using spin_trylock() to avoid dead lock when handling a page fault >> while a TLB invalidate is requested by an other CPU holding the PTE. >> >> Support for THP is not done because when checking for the PMD, we can be >> confused by an in progress collapsing operation done by khugepaged. The >> issue is that pmd_none() could be true either if the PMD is not already >> populate or if the underlying PTE are in the way to be collapsed. So we >> cannot safely allocate a PMD if pmd_none() is true. >> >> This series builds on top of v4.15-mmotm-2018-01-31-16-51 and is >> functional on x86 and PowerPC. > > One question which people will want to answer is "is this thing > working". ie, how frequently does the code fall back to the regular > heavyweight fault path. > > I see that trace events have been added for this, but the overall > changelog doesn't describe them. I think this material is important > enough to justify including it here. Got it, I'll detail the new perf and trace events here. > Also, a few words to help people figure out how to gather these stats > would be nice. And maybe helper scripts if appropriate? I'll provide some command line examples detailing how to capture those events. > I'm wondering if this info should even be presented via > /proc/self/something, dunno. My understanding is that this is part of the kernel ABI, so I was not comfortable to touch it but if needed I could probably put some numbers there. > And it would be interesting to present the fallback frequency in the > benchmark results. Yes these numbers are missing. Here are numbers I captured during a kernbench run on a 80 CPUs Power node: 87549520 faults 0 spf Which is expected as the kernbench's processes are not multithreaded. When running ebizzy on the same node: 711589 faults 692649 spf 10579 pagefault:spf_pte_lock 7815 pagefault:spf_vma_changed 0 pagefault:spf_vma_noanon 417 pagefault:spf_vma_notsup 0 pagefault:spf_vma_access 0 pagefault:spf_pmd_changed Here about 98% of the page faults where managed in a speculative way. > >> ------------------ >> Benchmarks results >> >> There is no functional change compared to the v6 so benchmark results are >> the same. >> Please see https://lkml.org/lkml/2018/1/12/515 for details. > > Please include this vitally important info in the [0/n], don't make > people chase links. Sorry, will do next time. > > And I'd really like to see some quantitative testing results for real > workloads, not just a bunch of microbenchmarks. Help us understand how > useful this patchset is to our users. We did non official runs using a "popular in memory multithreaded database product" on 176 cores SMT8 Power system which showed a 30% improvements in the number of transaction processed per second. Here are the perf data captured during 2 of these runs : vanilla spf faults 89.418 101.364 spf n/a 97.989 With the SPF kernel, most of the page fault were processed in a speculative way. Laurent.