Received: by 10.213.65.68 with SMTP id h4csp3948048imn; Tue, 3 Apr 2018 13:38:58 -0700 (PDT) X-Google-Smtp-Source: AIpwx49JWxxLA3+wGizo4imv8mXCNG37nEMQveQcTqWRd8KGoNT2QrfHpuQCRzcH/JGvJHY5+fyp X-Received: by 2002:a17:902:407:: with SMTP id 7-v6mr15523299ple.47.1522787938821; Tue, 03 Apr 2018 13:38:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1522787938; cv=none; d=google.com; s=arc-20160816; b=qYcl15og7wPaTu8wWHHc7eTO5rD6S6eHyg+uIuKomNBxSCHuJe7JqzdymSjO2VlsEo 2yqACc/4AFfHmMkogNQLj0Eli/GHLP2Ho1++xgrTZ6fJqI8dh/kq50aOGaVDaYmsvhOV Mbu/Cv+Q43Vnaz5VNfBnRTsIarTHWiJqxyEHH2ioV06C4AaEdOSuR6yWzsftNNWDv7ey s2SZBWwv9BH38cmdvs7D5isjGkCkjvnaw3JNj84pUv7U7Yldm1yQaFCFAeG5c7irPCwF 0q8A3aa+CrEhkrtWPbHiEehJNkKJTGH8Kmev5jEnTtNUyq/l3p/okMEA1/8ZS6qftoNl yblQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date :arc-authentication-results; bh=iZ4SRQX4W1er4F4X6s0Xh+phQ4fyVL7mlz9Lvk0rqzc=; b=osQ81Xai32sMlX0dJtb8488PZCK3UzBR04olTstEaU/sMCLVkCUb7xBYhzdrQ6Y3fe hIv+g8aCJXCXpzRVFiwSrKqMroZ5rdj402sejuhoMLVlClvBIh9WtGNzr7fFpCzPRr0H 5LlUgdN8H5OkvHKgds4Sgiz9Q/OKmUuUnaGxb6MJ4P2ncdGv7k0D5N3W5tLLPQa1zZ4V R291YDlLuX3XsxXzqwc9wGmFZz9uKe0owb8be2F7V4+fxO5Zo2i75vMXKwVq3KaoeLEa N+aQB2ElN18RBFS4W6XDGOvKlW0XGPVuwuwVliJBHmW2p7/MAvrH6/nP5wjsA6urtb6g 8qkA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 9si2794682pfh.242.2018.04.03.13.38.42; Tue, 03 Apr 2018 13:38:58 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752935AbeDCUhZ (ORCPT + 99 others); Tue, 3 Apr 2018 16:37:25 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:36460 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751469AbeDCUhX (ORCPT ); Tue, 3 Apr 2018 16:37:23 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id DC40D8182D06; Tue, 3 Apr 2018 20:37:22 +0000 (UTC) Received: from redhat.com (ovpn-121-77.rdu2.redhat.com [10.10.121.77]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 228C31134CDA; Tue, 3 Apr 2018 20:37:20 +0000 (UTC) Date: Tue, 3 Apr 2018 16:37:18 -0400 From: Jerome Glisse To: Laurent Dufour Cc: paulmck@linux.vnet.ibm.com, peterz@infradead.org, akpm@linux-foundation.org, kirill@shutemov.name, ak@linux.intel.com, mhocko@kernel.org, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , Andrea Arcangeli , Alexei Starovoitov , kemi.wang@intel.com, sergey.senozhatsky.work@gmail.com, Daniel Jordan , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, khandual@linux.vnet.ibm.com, npiggin@gmail.com, bsingharora@gmail.com, Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org Subject: Re: [PATCH v9 00/24] Speculative page faults Message-ID: <20180403203718.GE5935@redhat.com> References: <1520963994-28477-1-git-send-email-ldufour@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1520963994-28477-1-git-send-email-ldufour@linux.vnet.ibm.com> User-Agent: Mutt/1.9.2 (2017-12-15) X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.8]); Tue, 03 Apr 2018 20:37:23 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.8]); Tue, 03 Apr 2018 20:37:23 +0000 (UTC) for IP:'10.11.54.3' DOMAIN:'int-mx03.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'jglisse@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 13, 2018 at 06:59:30PM +0100, Laurent Dufour wrote: > This is a port on kernel 4.16 of the work done by Peter Zijlstra to > handle page fault without holding the mm semaphore [1]. > > The idea is to try to handle user space page faults without holding the > mmap_sem. This should allow better concurrency for massively threaded > process since the page fault handler will not wait for other threads memory > layout change to be done, assuming that this change is done in another part > of the process's memory space. This type page fault is named speculative > page fault. If the speculative page fault fails because of a concurrency is > detected or because underlying PMD or PTE tables are not yet allocating, it > is failing its processing and a classic page fault is then tried. > > The speculative page fault (SPF) has to look for the VMA matching the fault > address without holding the mmap_sem, this is done by introducing a rwlock > which protects the access to the mm_rb tree. Previously this was done using > SRCU but it was introducing a lot of scheduling to process the VMA's > freeing > operation which was hitting the performance by 20% as reported by Kemi Wang > [2].Using a rwlock to protect access to the mm_rb tree is limiting the > locking contention to these operations which are expected to be in a O(log > n) > order. In addition to ensure that the VMA is not freed in our back a > reference count is added and 2 services (get_vma() and put_vma()) are > introduced to handle the reference count. When a VMA is fetch from the RB > tree using get_vma() is must be later freeed using put_vma(). Furthermore, > to allow the VMA to be used again by the classic page fault handler a > service is introduced can_reuse_spf_vma(). This service is expected to be > called with the mmap_sem hold. It checked that the VMA is still matching > the specified address and is releasing its reference count as the mmap_sem > is hold it is ensure that it will not be freed in our back. In general, the > VMA's reference count could be decremented when holding the mmap_sem but it > should not be increased as holding the mmap_sem is ensuring that the VMA is > stable. I can't see anymore the overhead I got while will-it-scale > benchmark anymore. > > The VMA's attributes checked during the speculative page fault processing > have to be protected against parallel changes. This is done by using a per > VMA sequence lock. This sequence lock allows the speculative page fault > handler to fast check for parallel changes in progress and to abort the > speculative page fault in that case. > > Once the VMA is found, the speculative page fault handler would check for > the VMA's attributes to verify that the page fault has to be handled > correctly or not. Thus the VMA is protected through a sequence lock which > allows fast detection of concurrent VMA changes. If such a change is > detected, the speculative page fault is aborted and a *classic* page fault > is tried. VMA sequence lockings are added when VMA attributes which are > checked during the page fault are modified. > > When the PTE is fetched, the VMA is checked to see if it has been changed, > so once the page table is locked, the VMA is valid, so any other changes > leading to touching this PTE will need to lock the page table, so no > parallel change is possible at this time. What would have been nice is some pseudo highlevel code before all the above detailed description. Something like: speculative_fault(addr) { mm_lock_for_vma_snapshot() vma_snapshot = snapshot_vma_infos(addr) mm_unlock_for_vma_snapshot() ... if (!vma_can_speculatively_fault(vma_snapshot, addr)) return; ... /* Do fault ie alloc memory, read from file ... */ page = ...; preempt_disable(); if (vma_snapshot_still_valid(vma_snapshot, addr) && vma_pte_map_lock(vma_snapshot, addr)) { if (pte_same(ptep, orig_pte)) { /* Setup new pte */ page = NULL; } } preempt_enable(); if (page) put(page) } I just find pseudo code easier for grasping the highlevel view of the expected code flow. > > The locking of the PTE is done with interrupts disabled, this allows to > check for the PMD to ensure that there is not an ongoing collapsing > operation. Since khugepaged is firstly set the PMD to pmd_none and then is > waiting for the other CPU to have catch the IPI interrupt, if the pmd is > valid at the time the PTE is locked, we have the guarantee that the > collapsing opertion will have to wait on the PTE lock to move foward. This > allows the SPF handler to map the PTE safely. If the PMD value is different > than the one recorded at the beginning of the SPF operation, the classic > page fault handler will be called to handle the operation while holding the > mmap_sem. As the PTE lock is done with the interrupts disabled, the lock is > done using spin_trylock() to avoid dead lock when handling a page fault > while a TLB invalidate is requested by an other CPU holding the PTE. > > Support for THP is not done because when checking for the PMD, we can be > confused by an in progress collapsing operation done by khugepaged. The > issue is that pmd_none() could be true either if the PMD is not already > populated or if the underlying PTE are in the way to be collapsed. So we > cannot safely allocate a PMD if pmd_none() is true. Might be a good topic fo LSF/MM, should we set the pmd to something else then 0 when collapsing pmd (apply to pud too) ? This would allow support THP. [...] > > Ebizzy: > ------- > The test is counting the number of records per second it can manage, the > higher is the best. I run it like this 'ebizzy -mTRp'. To get consistent > result I repeated the test 100 times and measure the average result. The > number is the record processes per second, the higher is the best. > > BASE SPF delta > 16 CPUs x86 VM 14902.6 95905.16 543.55% > 80 CPUs P8 node 37240.24 78185.67 109.95% I find those results interesting as it seems that SPF do not scale well on big configuration. Note that it still have a sizeable improvement so it is still a very interesting feature i believe. Still understanding what is happening here might a good idea. From the numbers below it seems there is 2 causes to the scaling issue. First pte lock contention (kind of expected i guess). Second changes to vma while faulting. Have you thought about this ? Do i read those numbers in the wrong way ? > > Here are the performance counter read during a run on a 16 CPUs x86 VM: > Performance counter stats for './ebizzy -mRTp': > 888157 faults > 884773 spf > 92 pagefault:spf_pte_lock > 2379 pagefault:spf_vma_changed > 0 pagefault:spf_vma_noanon > 80 pagefault:spf_vma_notsup > 0 pagefault:spf_vma_access > 0 pagefault:spf_pmd_changed > > And the ones captured during a run on a 80 CPUs Power node: > Performance counter stats for './ebizzy -mRTp': > 762134 faults > 728663 spf > 19101 pagefault:spf_pte_lock > 13969 pagefault:spf_vma_changed > 0 pagefault:spf_vma_noanon > 272 pagefault:spf_vma_notsup > 0 pagefault:spf_vma_access > 0 pagefault:spf_pmd_changed There is one aspect that i would like to see cover. Maybe i am not understanding something fundamental, but it seems to me that SPF can trigger OOM or at very least over stress page allocation. Assume you have a lot of concurrent SPF to anonymous vma and they all allocate new pages, then you might overallocate for a single address by a factor correlated with the number of CPUs in your system. Now, multiply this for several distinc address and you might be allocating a lot of memory transiently ie just for a short period time. While the fact that you quickly free when you fail should prevent the OOM reaper. But still this might severly stress the memory allocation path. Am i missing something in how this all work ? Or is the above some- thing that might be of concern ? Should there be some boundary on the maximum number of concurrent SPF (and thus boundary on maximum page temporary page allocation) ? Cheers, J?r?me