Received: by 10.213.65.68 with SMTP id h4csp751917imn; Tue, 20 Mar 2018 14:33:53 -0700 (PDT) X-Google-Smtp-Source: AG47ELsinV2BTQ92GbAL8xm2NBwO62g/zRM5GM9jlqGZI1IX8dP3x8b7E7NbmkmHxsN69owKTOLe X-Received: by 10.98.110.71 with SMTP id j68mr14703491pfc.93.1521581633606; Tue, 20 Mar 2018 14:33:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521581633; cv=none; d=google.com; s=arc-20160816; b=OLxgWOlH+GBFHy4eI0pG22ln5fly8KfGw7pTdk6a3gCRBL8EAqtqcPBUr7JplCA4pb Cggb3RJve3fyhyU7IUPI2ZG8+a8m/8j6phbcaTwTuSncRYC+WJ5tjAsL3CshR1N2qjn5 EzPTUGUM04+AU5x8QU/IcrA6CoBrkR+FF45Ani1URp7xaOEwUBZPWp+UBAh9hrBrgZfg 6xcL5XNBT2oe9pIkFRoe95YH+zSrzQu7Pfby0UxIcjr+6FpjQBpWVWUlDPBUmBmW/QUO aCF/o7a6g/xHazArfSeYJ8v7cr13WvTM5BeVlN6lsPnG1hORI3TyGuFoc4CdXHhcoJzw dCBQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:arc-authentication-results; bh=cquQ1RwEUWLX3qht/iWp54y8cD9ZIRaTIhAwd+TwF9k=; b=L4lBMwrGoymUz6NdGMO3+nHmlJb2JUWJuAyrxq3cdXpi7QhbfkvccKUi3GiNkSDSeY zkdTk3X6ZW7rCS6Efk8Jip8eEIP26jt4TGF4qXSq6b5w93Rkw7jjcc68rDgYRIGkfS3z +0Oly57ou7Xtfhe6LtOewxraM+sOMFdMuveOKV1xHoH6ZV1bc3FPfU307ptbRWy2koGZ KNMYUMZ8I9TjABny6sxkuUEsyCzXsBvCyi5yrZ676xQZOuxrtb9vhdXT+gn9mUpivceK YJii4wBGz2hX8VkDxoGu9NE/xMSy1O8ukJQbRW+I4C+oKSZCoOCaEBy1UIjTM5DyG3pm xOjw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v15si1755804pgt.635.2018.03.20.14.33.39; Tue, 20 Mar 2018 14:33:53 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751744AbeCTVcC (ORCPT + 99 others); Tue, 20 Mar 2018 17:32:02 -0400 Received: from out30-130.freemail.mail.aliyun.com ([115.124.30.130]:48741 "EHLO out30-130.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751697AbeCTVb6 (ORCPT ); Tue, 20 Mar 2018 17:31:58 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R821e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04400;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=4;SR=0;TI=SMTPD_---0SzoMu-x_1521581495; Received: from e19h19392.et15sqa.tbsite.net(mailfrom:yang.shi@linux.alibaba.com fp:106.11.238.198) by smtp.aliyun-inc.com(127.0.0.1); Wed, 21 Mar 2018 05:31:44 +0800 From: Yang Shi To: akpm@linux-foundation.org Cc: yang.shi@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 0/8] Drop mmap_sem during unmapping large map Date: Wed, 21 Mar 2018 05:31:18 +0800 Message-Id: <1521581486-99134-1-git-send-email-yang.shi@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Background: Recently, when we ran some vm scalability tests on machines with large memory, we ran into a couple of mmap_sem scalability issues when unmapping large memory space, please refer to https://lkml.org/lkml/2017/12/14/733 and https://lkml.org/lkml/2018/2/20/576. Then akpm suggested to unmap large mapping section by section and drop mmap_sem at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784). So, this series of patches are aimed to solve the mmap_sem issue by adopting akpm's suggestion. Approach: A couple of approaches were explored. #1. Unmap large map by section in vm_munmap(). It works, but just sys_munmap() can benefit from this change. #2. Do unmapping in deeper place of the call chain, i.e. zap_pmd_range(). In this way, I don't have to define a magic size for unmapping. But, there are two major issues: * mmap_sem may be acquired by down_write() or down_read() in all the possible call paths. So, the call path has to be checked to determine to use which variants, either _write or _read. It increases the complexity significantly. * The below race condition might be introduced:      CPU A                         CPU B ----------                ---------- do_munmap zap_pmd_range up_write                      do_munmap                                     down_write                                     ......                                     remove_vma_list                                     up_write down_write access vmas  <-- use-after-free bug And, unmapping by section requires splitting vma, so the code has to deal with partial unmapped vma, it also increase the complexity significantly. #3. Do it in do_munmap(). I can keep splitting vma/unmap region/free pagetables /free vmas sequence atomic for every section. And, not only sys_munmap() can benefit, but also mremap and sysv shm. The only problem is it may not want to drop mmap_sem from some call paths. So, an extra parameter, called "atomic", is introduced to do_munmap(). The caller can pass "true" or "false" to tell do_munmap() if dropping mmap_sem is expected or not. "True" means not drop, "false" means drop. Since all callers to do_munmap() acquire mmap_sem by _write, so I just need deal with one variant. And, when re-acquiring mmap_sem, just use down_write() for now since dealing with the return value of down_write_killable() sounds unnecessary. Other than these, a magic section size has to be defined explicitly, now HPAGE_PUD_SIZE is used. According to my test, HPAGE_PUD_SIZE sounds good enough. This is also why down_write() is used for re-acquiring mmap_sem instead of down_write_killable(). Smaller size looks have to much overhead. Regression and performance data: Test is run on a machine with 32 cores of E5-2680 @ 2.70GHz and 384GB memory Full LTP test is done, no regression issue. Measurement of SyS_munmap() execution time: size pristine patched delta 80GB 5008377 us 4905841 us -2% 160GB 9129243 us 9145306 us +0.18% 320GB 17915310 us 17990174 us +0.42% Throughput of page faults (#/s) with vm-scalability: pristine patched delta mmap-pread-seq 554894 563517 +1.6% mmap-pread-seq-mt 581232 580772 -0.079% mmap-xread-seq-mt 99182 105400 +6.3% Throughput of page faults (#/s) with the below stress-ng test: stress-ng --mmap 0 --mmap-bytes 80G --mmap-file --metrics --perf --timeout 600s pristine patched delta 100165 108396 +8.2% There are 8 patches in this series. 1/8: Introduce “atomic” parameter and define do_munmap_range(), modify do_munmap() to call do_munmap() to unmap memory by section 2/8 - 6/8: modify do_munmap() call sites in mm/mmap.c, mm/mremap.c, fs/proc/vmcore.c, ipc/shm.c and mm/nommu.c to adopt "atomic" parameter 7/8 - 8/8: modify the do_munmap() call sites in arch/x86 to adopt "atomic" parameter Yang Shi (8): mm: mmap: unmap large mapping by section mm: mmap: pass atomic parameter to do_munmap() call sites mm: mremap: pass atomic parameter to do_munmap() mm: nommu: add atomic parameter to do_munmap() ipc: shm: pass atomic parameter to do_munmap() fs: proc/vmcore: pass atomic parameter to do_munmap() x86: mpx: pass atomic parameter to do_munmap() x86: vma: pass atomic parameter to do_munmap() arch/x86/entry/vdso/vma.c | 2 +- arch/x86/mm/mpx.c | 2 +- fs/proc/vmcore.c | 4 ++-- include/linux/mm.h | 2 +- ipc/shm.c | 9 ++++++--- mm/mmap.c | 48 ++++++++++++++++++++++++++++++++++++++++++------ mm/mremap.c | 10 ++++++---- mm/nommu.c | 5 +++-- 8 files changed, 62 insertions(+), 20 deletions(-)