Received: by 10.213.65.68 with SMTP id h4csp751565imn; Tue, 20 Mar 2018 14:33:12 -0700 (PDT) X-Google-Smtp-Source: AG47ELvmZJWJ4vRDR9pEnSy3CajUHL7xADV38LuMcbinB+EaPcrG/GUljOnehe4IFN5C2CRj/n59 X-Received: by 2002:a17:902:7804:: with SMTP id p4-v6mr18334655pll.17.1521581592661; Tue, 20 Mar 2018 14:33:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521581592; cv=none; d=google.com; s=arc-20160816; b=jsgBb7V9h4t/1BFwJCtrbcNjDJnjfkpIHBjF5JflcH87cKcUS+aWl8Ho0ZHsN8cx4A znkNrzfrm9jLvL1ZLZka162kBkrHkF/kMtEadNP0Iode8oUkMYlaXiipnddQy9T3jgV4 +YiB1nJPn9dMGlHqYVKIh7hyC+vMdqE157A4R2ruKzF03v0YSnxwU4IznPWuD+pyU108 CHF4KRIhMbeN+W7mxUkq/IrKAY3+Pv64zsImhsGkYOMIkEaLg21hIzQnB2fZg1hEScsm MZhCu+PsWWnH9ajGMoCTrpDlNukObA3tn1IpfA90LFVkgjzOeDLbyzcZODw80O0331JW 7dVA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:arc-authentication-results; bh=JNO5NkJWKA0Do3wB66zyCDQWFHgyBSyeQZ25LgF1/K4=; b=z8bgO4eElBZpgXuB0hlu12rIG8BqaE62aoecCkCyeea/zjRLWOuRevTRp1lrESzjdp YHhNvPIidk4M/D7v45wKRonKvkEHwPSdE3xDcCB+f39U45KeSyh4w+zZ5+nCkSTJuj4u wUdhu1QcRhNchRVjYaH6g+M/JboCdpxNqvYmAeM/ics2mMiF6ev6Gx9cIwk6hyOtudxQ w99rkU94d6F42gKQssqxpvucJ/ZLcIZnCodU7/q6+LSUTWUSPfv/pYsiSv3bzkIsbQ8k GNvE+QvGjWu4odAlg5xLbB60MFR6ZorzZCn6KRpWKDVUrQwW2/pquK0rEcNKE+bYGZcb z8XA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k8si1740569pgc.361.2018.03.20.14.32.58; Tue, 20 Mar 2018 14:33:12 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751546AbeCTVbt (ORCPT + 99 others); Tue, 20 Mar 2018 17:31:49 -0400 Received: from out30-132.freemail.mail.aliyun.com ([115.124.30.132]:43872 "EHLO out30-132.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751372AbeCTVbs (ORCPT ); Tue, 20 Mar 2018 17:31:48 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R911e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07487;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=4;SR=0;TI=SMTPD_---0SzoMu-x_1521581495; Received: from e19h19392.et15sqa.tbsite.net(mailfrom:yang.shi@linux.alibaba.com fp:106.11.238.198) by smtp.aliyun-inc.com(127.0.0.1); Wed, 21 Mar 2018 05:31:45 +0800 From: Yang Shi To: akpm@linux-foundation.org Cc: yang.shi@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section Date: Wed, 21 Mar 2018 05:31:19 +0800 Message-Id: <1521581486-99134-2-git-send-email-yang.shi@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1521581486-99134-1-git-send-email-yang.shi@linux.alibaba.com> References: <1521581486-99134-1-git-send-email-yang.shi@linux.alibaba.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When running some mmap/munmap scalability tests with large memory (i.e. > 300GB), the below hung task issue may happen occasionally. INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 Call Trace: [] ? __schedule+0x250/0x730 [] schedule+0x36/0x80 [] rwsem_down_read_failed+0xf0/0x150 [] call_rwsem_down_read_failed+0x18/0x30 [] down_read+0x20/0x40 [] proc_pid_cmdline_read+0xd9/0x4e0 [] ? do_filp_open+0xa5/0x100 [] __vfs_read+0x37/0x150 [] ? security_file_permission+0x9b/0xc0 [] vfs_read+0x96/0x130 [] SyS_read+0x55/0xc0 [] entry_SYSCALL_64_fastpath+0x1a/0xc5 It is because munmap holds mmap_sem from very beginning to all the way down to the end, and doesn't release it in the middle. When unmapping large mapping, it may take long time (take ~18 seconds to unmap 320GB mapping with every single page mapped on an idle machine). Since unmapping does't require any atomicity, so here unmap large mapping (> HPAGE_PUD_SIZE) section by section, and release mmap_sem for unmapping every HPAGE_PUD_SIZE if mmap_sem is contended and the call path is fine to be interrupted controlled by "atomic", newly added parameter to do_munmap(). "false" means it is fine to do unlock/relock to mmap_sem in the middle. Not only munmap may benefit from this change, but also mremap/shm since they all call do_munmap() to do the real work. The below is some regression and performance data collected on a machine with 32 cores of E5-2680 @ 2.70GHz and 384GB memory. Measurement of SyS_munmap() execution time: size pristine patched delta 80GB 5008377 us 4905841 us -2% 160GB 9129243 us 9145306 us +0.18% 320GB 17915310 us 17990174 us +0.42% Throughput of page faults (#/s) with vm-scalability: pristine patched delta mmap-pread-seq 554894 563517 +1.6% mmap-pread-seq-mt 581232 580772 -0.079% mmap-xread-seq-mt 99182 105400 +6.3% Throughput of page faults (#/s) with the below stress-ng test: stress-ng --mmap 0 --mmap-bytes 80G --mmap-file --metrics --perf --timeout 600s pristine patched delta 100165 108396 +8.2% Signed-off-by: Yang Shi --- include/linux/mm.h | 2 +- mm/mmap.c | 40 ++++++++++++++++++++++++++++++++++++++-- 2 files changed, 39 insertions(+), 3 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ad06d42..2e447d4 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2212,7 +2212,7 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate, struct list_head *uf); extern int do_munmap(struct mm_struct *, unsigned long, size_t, - struct list_head *uf); + struct list_head *uf, bool atomic); static inline unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, diff --git a/mm/mmap.c b/mm/mmap.c index 9efdc021..ad6ae7a 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2632,8 +2632,8 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, * work. This now handles partial unmappings. * Jeremy Fitzhardinge */ -int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, - struct list_head *uf) +static int do_munmap_range(struct mm_struct *mm, unsigned long start, + size_t len, struct list_head *uf) { unsigned long end; struct vm_area_struct *vma, *prev, *last; @@ -2733,6 +2733,42 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, return 0; } +int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, + struct list_head *uf, bool atomic) +{ + int ret = 0; + size_t step = HPAGE_PUD_SIZE; + + /* + * unmap large mapping (> huge pud size) section by section + * in order to give mmap_sem waiters a chance to acquire it. + */ + if (len <= step) + ret = do_munmap_range(mm, start, len, uf); + else { + do { + ret = do_munmap_range(mm, start, step, uf); + if (ret < 0) + break; + + if (rwsem_is_contended(&mm->mmap_sem) && !atomic && + need_resched()) { + VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem)); + up_write(&mm->mmap_sem); + cond_resched(); + down_write(&mm->mmap_sem); + } + + start += step; + len -= step; + if (len <= step) + step = len; + } while (len > 0); + } + + return ret; +} + int vm_munmap(unsigned long start, size_t len) { int ret; -- 1.8.3.1