Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp6280584imm; Mon, 23 Jul 2018 15:01:28 -0700 (PDT) X-Google-Smtp-Source: AAOMgpdJwOMSCbXDqBDoX1R8FLH1HendjO3rP++oQxQiSUzxe59n3535yWuWC2mLP1M/deEzZ+1A X-Received: by 2002:a65:5907:: with SMTP id f7-v6mr13489667pgu.83.1532383288314; Mon, 23 Jul 2018 15:01:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1532383288; cv=none; d=google.com; s=arc-20160816; b=UPvcOOsE9/Q/F7xYSpitmRK4EmXugMSPeV2EvFZHiMUPigrbtPLBDRxJq4FdanIiRK GVtrFpe++gesEyNc9/4pQQ+8tBFRubuGEX37xAvB6Tw9N9dbRgV6w2hSMqDUAjDSOba3 rhixEU5LhYdF5hr6amhfTy/qA/FGykJWFMo03G2adE/LBxkgg2CYeueGvVW/afCiuqkQ aRxdTCuNzMezuzweXe0G41YBOJKxouLWuue2pqKVOI/QC7UbfTWZ7a1yeVP2IXxraT6+ filhYHDciuo1z2HchFEi2NzbYjC2Z1XLe9KmNl5IRpw2b6z35G6k2ys1kwaXq6JM6fpt Plbg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=957UljAJEez9imO1I1VmVG2CWgX7JkfNGPrlq2HS27M=; b=ciDQyas3iK0HRgikiP/adir3eptUfIit2J3Kifrgd+EZak/p5e29Nm02VgRDfoAuci EhnJkd9H/ZEo85yhPyar+1Sns1KPCk2/XnX5A5k21OJUg58oA68Is6p71FViY7e3+cDc lVw9JY/PlxXqiotl0QaU8+1DQtT0W5BEIAkW2/0fsQLSuNFkr+DN4AeqZZdePuY1HEdU fAz7Q01zEMua2lnsi8ue/PyefoDWXdXJUaD5K9qNXIlF7crS3Ya058h4OIZwnihvBavM WUcq7VH5YkE3vCCk0F1E28pWTm3ohFkKvNBx5WvYI7KVfNQCSGvKnHBJUb6e2Q5AxI9A 2VkA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j63-v6si8721877pgd.425.2018.07.23.15.01.13; Mon, 23 Jul 2018 15:01:28 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388206AbeGWXDc (ORCPT + 99 others); Mon, 23 Jul 2018 19:03:32 -0400 Received: from out30-133.freemail.mail.aliyun.com ([115.124.30.133]:50977 "EHLO out30-133.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388137AbeGWXDc (ORCPT ); Mon, 23 Jul 2018 19:03:32 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R101e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04400;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0T5BsFI1_1532383206; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0T5BsFI1_1532383206) by smtp.aliyun-inc.com(127.0.0.1); Tue, 24 Jul 2018 06:00:08 +0800 Subject: Re: [RFC v5 0/2] mm: zap pages with read mmap_sem in munmap for large mapping To: mhocko@kernel.org, willy@infradead.org, ldufour@linux.vnet.ibm.com, kirill@shutemov.name, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <1531956101-8526-1-git-send-email-yang.shi@linux.alibaba.com> From: Yang Shi Message-ID: Date: Mon, 23 Jul 2018 15:00:05 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <1531956101-8526-1-git-send-email-yang.shi@linux.alibaba.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi folks, Any comment on this version? Thanks, Yang On 7/18/18 4:21 PM, Yang Shi wrote: > Background: > Recently, when we ran some vm scalability tests on machines with large memory, > we ran into a couple of mmap_sem scalability issues when unmapping large memory > space, please refer to https://lkml.org/lkml/2017/12/14/733 and > https://lkml.org/lkml/2018/2/20/576. > > > History: > Then akpm suggested to unmap large mapping section by section and drop mmap_sem > at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784). > > V1 patch series was submitted to the mailing list per Andrew's suggestion > (see https://lkml.org/lkml/2018/3/20/786). Then I received a lot great feedback > and suggestions. > > Then this topic was discussed on LSFMM summit 2018. In the summit, Michal Hocko > suggested (also in the v1 patches review) to try "two phases" approach. Zapping > pages with read mmap_sem, then doing via cleanup with write mmap_sem (for > discussion detail, see https://lwn.net/Articles/753269/) > > > Approach: > Zapping pages is the most time consuming part, according to the suggestion from > Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like > what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas. > > But, we can't call MADV_DONTNEED directly, since there are two major drawbacks: > * The unexpected state from PF if it wins the race in the middle of munmap. > It may return zero page, instead of the content or SIGSEGV. > * Can’t handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which > is a showstopper from akpm > > But, some part may need write mmap_sem, for example, vma splitting. So, > the design is as follows: > acquire write mmap_sem > lookup vmas (find and split vmas) > detach vmas > deal with special mappings > downgrade_write > > zap pages > free page tables > release mmap_sem > > The vm events with read mmap_sem may come in during page zapping, but > since vmas have been detached before, they, i.e. page fault, gup, etc, > will not be able to find valid vma, then just return SIGSEGV or -EFAULT > as expected. > > If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, they are > considered as special mappings. They will be dealt with before zapping > pages with write mmap_sem held. Basically, just update vm_flags. > > And, since they are also manipulated by unmap_single_vma() which is > called by unmap_vma() with read mmap_sem held in this case, to > prevent from updating vm_flags in read critical section, a new > parameter, called "skip_flags" is added to unmap_region(), unmap_vmas() > and unmap_single_vma(). If it is true, then just skip unmap those > special mappings. Currently, the only place which pass true to this > parameter is us. > > With this approach we don't have to re-acquire mmap_sem again to clean > up vmas to avoid race window which might get the address space changed. > > And, since the lock acquire/release cost is managed to the minimum and > almost as same as before, the optimization could be extended to any size > of mapping without incuring significant penalty to small mappings. > > For the time being, just do this in munmap syscall path. Other vm_munmap() or > do_munmap() call sites (i.e mmap, mremap, etc) remain intact for stability > reason. > > Changelog: > v4 -> v5: > * Detach vmas before zapping pages so that we don't have to use VM_DEAD to mark > a being unmapping vma since they have been detached from rbtree when zapping > pages. Per Kirill > * Eliminate VM_DEAD stuff > * With this change we don't have to re-acquire write mmap_sem to do cleanup. > So, we could eliminate a potential race window > * Eliminate PUD_SIZE check, and extend this optimization to all size > > v3 -> v4: > * Extend check_stable_address_space to check VM_DEAD as Michal suggested > * Deal with vm_flags update of VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe > mappings with exclusive lock held. The actual unmapping is still done with read > mmap_sem to solve akpm's concern > * Clean up vmas with calling do_munmap to prevent from race condition by not > carrying vmas as Kirill suggested > * Extracted more common code > * Solved some code cleanup comments from akpm > * Dropped uprobe and arch specific code, now all the changes are mm only > * Still keep PUD_SIZE threshold, if everyone thinks it is better to extend to all > sizes or smaller size, will remove it > * Make this optimization 64 bit only explicitly per akpm's suggestion > > v2 -> v3: > * Refactor do_munmap code to extract the common part per Peter's sugestion > * Introduced VM_DEAD flag per Michal's suggestion. Just handled VM_DEAD in > x86's page fault handler for the time being. Other architectures will be covered > once the patch series is reviewed > * Now lookup vma (find and split) and set VM_DEAD flag with write mmap_sem, then > zap mapping with read mmap_sem, then clean up pgtables and vmas with write > mmap_sem per Peter's suggestion > > v1 -> v2: > * Re-implemented the code per the discussion on LSFMM summit > > > Regression and performance data: > Did the below regression test with setting thresh to 4K manually in the code: > * Full LTP > * Trinity (munmap/all vm syscalls) > * Stress-ng: mmap/mmapfork/mmapfixed/mmapaddr/mmapmany/vm > * mm-tests: kernbench, phpbench, sysbench-mariadb, will-it-scale > * vm-scalability > > With the patches, exclusive mmap_sem hold time when munmap a 80GB address > space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level > from second. > > munmap_test-15002 [008] 594.380138: funcgraph_entry: | vm_munmap_zap_rlock() { > munmap_test-15002 [008] 594.380146: funcgraph_entry: !2485684 us | unmap_region(); > munmap_test-15002 [008] 596.865836: funcgraph_exit: !2485692 us | } > > Here the excution time of unmap_region() is used to evaluate the time of > holding read mmap_sem, then the remaining time is used with holding > exclusive lock. > > Yang Shi (2): > mm: refactor do_munmap() to extract the common part > mm: mmap: zap pages with read mmap_sem in munmap > > include/linux/mm.h | 2 +- > mm/memory.c | 35 +++++++++++++----- > mm/mmap.c | 219 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------- > 3 files changed, 199 insertions(+), 57 deletions(-)