Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp6757321imm; Tue, 24 Jul 2018 02:29:21 -0700 (PDT) X-Google-Smtp-Source: AAOMgpfVE6wb4q4k0odkhW/NSqaWgW96AANtF7P9c8LoXhn4gPFTQ3/cVnYKJPIMhn6xOkhLBKtp X-Received: by 2002:a17:902:9695:: with SMTP id n21-v6mr16405455plp.6.1532424560996; Tue, 24 Jul 2018 02:29:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1532424560; cv=none; d=google.com; s=arc-20160816; b=G+aXFIXqK4fwHfd3uujQLXqxOAUE9xcsY6Du2Vk0kPB2o7rKmfLclGY+BwaF7fXnz8 J6cusNZ3SaRpS4n0GJQ0neklv4U5Lq6AsIR9E/yDLPEPGQJyn/3yaoKoJHf1SirHtYso G0Ed1X/qXgWPNWmghuS+3jiNgnEpoIUL962HAuvqKNDYYyympYAbUcfxt40IrBPYnyfx UVyFpJ+T5aaQRAwp21ahZylcs5R8p5HiCbKfUu1mCYpS+ldHtdxUevIRoR+baFD4gKj/ MwmQiFhSF0OWkrLWW/49o0f5r+K0Xe8eQDCpnjEHK5qNthX94inLZH3wwR5FJvKR0X49 Vsdw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=6urF8lVY7WnhUTbah7YMUee/vtiJvW5GZnV1hrJEwTU=; b=E7ZLZqrkPiJQ1Yr2+RziMOzS48XRQQa19g/8NSJEljBHs1U7fbgWhiF2/BAwcZtXpD qdHO5eKfPbAnHg1lIiEfpximzrYGS0IUdPI3cUsOd8Me1c0jaOu4XO0xtJQIF1/Mo7SL aYgNCkdt/btKjjRLX0F8prjpV+P28loB2wV2ACRXsjfGkyK4qgyfc0vsMvf5eWdd5VEl 2p271N27QCCTeZWNpXqKN9gKZsn2hDLcQcgIebiAGKtoq6PFmCzyX6UzNhx+UIeAfM5D paqmCpk8lszs7r6HWERMzVqzeEOHu1myW6iqFHXDC+lb2IraFNKRzDa+tbnfsXcqkBQE e/lA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@shutemov-name.20150623.gappssmtp.com header.s=20150623 header.b="JzPDIyp/"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i61-v6si10355505plb.138.2018.07.24.02.29.06; Tue, 24 Jul 2018 02:29:20 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@shutemov-name.20150623.gappssmtp.com header.s=20150623 header.b="JzPDIyp/"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388318AbeGXKcd (ORCPT + 99 others); Tue, 24 Jul 2018 06:32:33 -0400 Received: from mail-pg1-f196.google.com ([209.85.215.196]:37278 "EHLO mail-pg1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388271AbeGXKcd (ORCPT ); Tue, 24 Jul 2018 06:32:33 -0400 Received: by mail-pg1-f196.google.com with SMTP id n7-v6so2481818pgq.4 for ; Tue, 24 Jul 2018 02:27:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov-name.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=6urF8lVY7WnhUTbah7YMUee/vtiJvW5GZnV1hrJEwTU=; b=JzPDIyp/e2rBnnQ3AWVrqLSZDEHiXFD2qFQO6Bl5Qz/S3NoatyIq7oO0pzgdmj5jEY eaBTSsOSQyZLkclE0ZkVahLdOLjAyj1KMuOFIQGLZ0wV+uKks6w6d5FzmUnM2w0sjace M1L30QlcDqUdE7a8vdiVdrHAVkghirjhOFhb4ZUnWzxEuUdwbJo9wFPfs7OfQz85T5sW 2VtZBk7xA9sNl32SwORBhmt45hdhiJEwAIzl/OGtKdHCfeBQ0+cq2z82nPxb42caznu2 4E6MbGGX6dNv2i5O/klMe1xCvRKRy64PgNemShLkBUUFrmeLo6phIdaE+R0e/nxjl60e 4Uwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=6urF8lVY7WnhUTbah7YMUee/vtiJvW5GZnV1hrJEwTU=; b=GUZ6GWhxK2KYHtuRtD2ToxJkcGnp6b+/j6MVJG3IH7rHWcWQDqu+kPZzafPJDypDq8 TpLZkjx78lLDSXJg0Il89eGO9QYlkvUOZGyD72o4nhu/nfw2MeichAi/w++r6Rhc6BSL LVoPUCW/9GyNRUaoYqSH4Vexqw7n1rmk4ylu4Qzjxv+XSDJR8ihI/OllLBrfcnLIBU5D Kd+z8GNcIgHKQIa2RmjYlP4PIwCa2KvinILAVxUc4D4eDY/R6NJmPo6mKgqYglDOZgRg DMxpvSYu+yEvhTL905cfd+kIE4XlyNtca6L7+WDvrcyxl9IQ8igMxuorhHgoZVNj6Hs0 xgXA== X-Gm-Message-State: AOUpUlG+da4mD9O2kYC8uyA4jX82ew9wagU9U9ffoH62MSVTY2JTe8rJ 0bL16B2PvYwunGxMZm9IjB4nGQ== X-Received: by 2002:a63:c80e:: with SMTP id z14-v6mr15237915pgg.77.1532424419742; Tue, 24 Jul 2018 02:26:59 -0700 (PDT) Received: from kshutemo-mobl1.localdomain ([192.55.54.41]) by smtp.gmail.com with ESMTPSA id p64-v6sm17956010pfa.47.2018.07.24.02.26.58 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 24 Jul 2018 02:26:59 -0700 (PDT) Received: by kshutemo-mobl1.localdomain (Postfix, from userid 1000) id 4C9733002A3; Tue, 24 Jul 2018 12:26:54 +0300 (+03) Date: Tue, 24 Jul 2018 12:26:54 +0300 From: "Kirill A. Shutemov" To: Yang Shi Cc: mhocko@kernel.org, willy@infradead.org, ldufour@linux.vnet.ibm.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFC v5 PATCH 2/2] mm: mmap: zap pages with read mmap_sem in munmap Message-ID: <20180724092653.rr66rq32dyaob5tc@kshutemo-mobl1> References: <1531956101-8526-1-git-send-email-yang.shi@linux.alibaba.com> <1531956101-8526-3-git-send-email-yang.shi@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1531956101-8526-3-git-send-email-yang.shi@linux.alibaba.com> User-Agent: NeoMutt/20180622 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 19, 2018 at 07:21:41AM +0800, Yang Shi wrote: > When running some mmap/munmap scalability tests with large memory (i.e. > > 300GB), the below hung task issue may happen occasionally. > > INFO: task ps:14018 blocked for more than 120 seconds. > Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > ps D 0 14018 1 0x00000004 > ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 > ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 > 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 > Call Trace: > [] ? __schedule+0x250/0x730 > [] schedule+0x36/0x80 > [] rwsem_down_read_failed+0xf0/0x150 > [] call_rwsem_down_read_failed+0x18/0x30 > [] down_read+0x20/0x40 > [] proc_pid_cmdline_read+0xd9/0x4e0 > [] ? do_filp_open+0xa5/0x100 > [] __vfs_read+0x37/0x150 > [] ? security_file_permission+0x9b/0xc0 > [] vfs_read+0x96/0x130 > [] SyS_read+0x55/0xc0 > [] entry_SYSCALL_64_fastpath+0x1a/0xc5 > > It is because munmap holds mmap_sem exclusively from very beginning to > all the way down to the end, and doesn't release it in the middle. When > unmapping large mapping, it may take long time (take ~18 seconds to > unmap 320GB mapping with every single page mapped on an idle machine). > > Zapping pages is the most time consuming part, according to the > suggestion from Michal Hocko [1], zapping pages can be done with holding > read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write > mmap_sem to cleanup vmas. > > But, some part may need write mmap_sem, for example, vma splitting. So, > the design is as follows: > acquire write mmap_sem > lookup vmas (find and split vmas) > detach vmas > deal with special mappings > downgrade_write > > zap pages > free page tables > release mmap_sem > > The vm events with read mmap_sem may come in during page zapping, but > since vmas have been detached before, they, i.e. page fault, gup, etc, > will not be able to find valid vma, then just return SIGSEGV or -EFAULT > as expected. > > If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, they are > considered as special mappings. They will be dealt with before zapping > pages with write mmap_sem held. Basically, just update vm_flags. > > And, since they are also manipulated by unmap_single_vma() which is > called by unmap_vma() with read mmap_sem held in this case, to > prevent from updating vm_flags in read critical section, a new > parameter, called "skip_flags" is added to unmap_region(), unmap_vmas() > and unmap_single_vma(). If it is true, then just skip unmap those > special mappings. Currently, the only place which pass true to this > parameter is us. > > With this approach we don't have to re-acquire mmap_sem again to clean > up vmas to avoid race window which might get the address space changed. > > And, since the lock acquire/release cost is managed to the minimum and > almost as same as before, the optimization could be extended to any size > of mapping without incuring significan penalty to small mappings. > > For the time being, just do this in munmap syscall path. Other > vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain > intact for stability reason. > > With the patches, exclusive mmap_sem hold time when munmap a 80GB > address space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to > us level from second. > > munmap_test-15002 [008] 594.380138: funcgraph_entry: | vm_munmap_zap_rlock() { > munmap_test-15002 [008] 594.380146: funcgraph_entry: !2485684 us | unmap_region(); > munmap_test-15002 [008] 596.865836: funcgraph_exit: !2485692 us | } > > Here the excution time of unmap_region() is used to evaluate the time of > holding read mmap_sem, then the remaining time is used with holding > exclusive lock. > > [1] https://lwn.net/Articles/753269/ > > Suggested-by: Michal Hocko > Suggested-by: Kirill A. Shutemov > Cc: Matthew Wilcox > Cc: Laurent Dufour > Cc: Andrew Morton > Signed-off-by: Yang Shi > --- > include/linux/mm.h | 2 +- > mm/memory.c | 35 +++++++++++++------ > mm/mmap.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++----- > 3 files changed, 117 insertions(+), 19 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index a0fbb9f..95a4e97 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1321,7 +1321,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, > void zap_page_range(struct vm_area_struct *vma, unsigned long address, > unsigned long size); > void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, > - unsigned long start, unsigned long end); > + unsigned long start, unsigned long end, bool skip_flags); skip_flags is not specific enough. Which flags? Maybe skip_vm_flags or smething. > > /** > * mm_walk - callbacks for walk_page_range > diff --git a/mm/memory.c b/mm/memory.c > index 7206a63..00ecdae 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1514,7 +1514,7 @@ void unmap_page_range(struct mmu_gather *tlb, > static void unmap_single_vma(struct mmu_gather *tlb, > struct vm_area_struct *vma, unsigned long start_addr, > unsigned long end_addr, > - struct zap_details *details) > + struct zap_details *details, bool skip_flags) > { > unsigned long start = max(vma->vm_start, start_addr); > unsigned long end; > @@ -1525,11 +1525,13 @@ static void unmap_single_vma(struct mmu_gather *tlb, > if (end <= vma->vm_start) > return; > > - if (vma->vm_file) > - uprobe_munmap(vma, start, end); > + if (!skip_flags) { > + if (vma->vm_file) > + uprobe_munmap(vma, start, end); > > - if (unlikely(vma->vm_flags & VM_PFNMAP)) > - untrack_pfn(vma, 0, 0); > + if (unlikely(vma->vm_flags & VM_PFNMAP)) > + untrack_pfn(vma, 0, 0); > + } > > if (start != end) { > if (unlikely(is_vm_hugetlb_page(vma))) { > @@ -1546,7 +1548,19 @@ static void unmap_single_vma(struct mmu_gather *tlb, > */ > if (vma->vm_file) { > i_mmap_lock_write(vma->vm_file->f_mapping); > - __unmap_hugepage_range_final(tlb, vma, start, end, NULL); > + if (!skip_flags) > + /* > + * The vma is being unmapped with read > + * mmap_sem. > + * Can't update vm_flags, it will be > + * updated later with exclusive lock > + * held > + */ Later? When? Don't we run this after mmap_sem is downgraded to read? And wrap it into {}.. > + __unmap_hugepage_range(tlb, vma, start, > + end, NULL); > + else > + __unmap_hugepage_range_final(tlb, vma, > + start, end, NULL); > i_mmap_unlock_write(vma->vm_file->f_mapping); > } > } else > @@ -1574,13 +1588,14 @@ static void unmap_single_vma(struct mmu_gather *tlb, > */ > void unmap_vmas(struct mmu_gather *tlb, > struct vm_area_struct *vma, unsigned long start_addr, > - unsigned long end_addr) > + unsigned long end_addr, bool skip_flags) > { > struct mm_struct *mm = vma->vm_mm; > > mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); > for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) > - unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); > + unmap_single_vma(tlb, vma, start_addr, end_addr, NULL, > + skip_flags); > mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); > } > > @@ -1604,7 +1619,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start, > update_hiwater_rss(mm); > mmu_notifier_invalidate_range_start(mm, start, end); > for ( ; vma && vma->vm_start < end; vma = vma->vm_next) { > - unmap_single_vma(&tlb, vma, start, end, NULL); > + unmap_single_vma(&tlb, vma, start, end, NULL, false); > > /* > * zap_page_range does not specify whether mmap_sem should be > @@ -1641,7 +1656,7 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr > tlb_gather_mmu(&tlb, mm, address, end); > update_hiwater_rss(mm); > mmu_notifier_invalidate_range_start(mm, address, end); > - unmap_single_vma(&tlb, vma, address, end, details); > + unmap_single_vma(&tlb, vma, address, end, details, false); > mmu_notifier_invalidate_range_end(mm, address, end); > tlb_finish_mmu(&tlb, address, end); > } > diff --git a/mm/mmap.c b/mm/mmap.c > index 2504094..f5d5312 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -73,7 +73,7 @@ > > static void unmap_region(struct mm_struct *mm, > struct vm_area_struct *vma, struct vm_area_struct *prev, > - unsigned long start, unsigned long end); > + unsigned long start, unsigned long end, bool skip_flags); > > /* description of effects of mapping type and prot in current implementation. > * this is due to the limited x86 page protection hardware. The expected > @@ -1824,7 +1824,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, > fput(file); > > /* Undo any partial mapping done by a device driver. */ > - unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end); > + unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end, false); > charged = 0; > if (vm_flags & VM_SHARED) > mapping_unmap_writable(file->f_mapping); > @@ -2559,7 +2559,7 @@ static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma) > */ > static void unmap_region(struct mm_struct *mm, > struct vm_area_struct *vma, struct vm_area_struct *prev, > - unsigned long start, unsigned long end) > + unsigned long start, unsigned long end, bool skip_flags) > { > struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap; > struct mmu_gather tlb; > @@ -2567,7 +2567,7 @@ static void unmap_region(struct mm_struct *mm, > lru_add_drain(); > tlb_gather_mmu(&tlb, mm, start, end); > update_hiwater_rss(mm); > - unmap_vmas(&tlb, vma, start, end); > + unmap_vmas(&tlb, vma, start, end, skip_flags); > free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, > next ? next->vm_start : USER_PGTABLES_CEILING); > tlb_finish_mmu(&tlb, start, end); > @@ -2778,6 +2778,79 @@ static inline void munmap_mlock_vma(struct vm_area_struct *vma, > } > } > > +/* > + * Zap pages with read mmap_sem held > + * > + * uf is the list for userfaultfd > + */ > +static int do_munmap_zap_rlock(struct mm_struct *mm, unsigned long start, > + size_t len, struct list_head *uf) > +{ > + unsigned long end = 0; > + struct vm_area_struct *start_vma = NULL, *prev, *vma; > + int ret = 0; > + > + if (!munmap_addr_sanity(start, len)) > + return -EINVAL; > + > + len = PAGE_ALIGN(len); > + > + end = start + len; > + > + /* > + * need write mmap_sem to split vmas and detach vmas > + * splitting vma up-front to save PITA to clean if it is failed Please fix all the comments to have consistent style. > + */ > + if (down_write_killable(&mm->mmap_sem)) > + return -EINTR; > + > + ret = munmap_lookup_vma(mm, &start_vma, &prev, start, end); > + if (ret != 1) > + goto out; > + > + if (unlikely(uf)) { > + ret = userfaultfd_unmap_prep(start_vma, start, end, uf); > + if (ret) > + goto out; > + } > + > + /* Handle mlocked vmas */ > + if (mm->locked_vm) > + munmap_mlock_vma(start_vma, end); > + > + /* Detach vmas from rbtree */ > + detach_vmas_to_be_unmapped(mm, start_vma, prev, end); > + > + /* > + * Clear uprobe, VM_PFNMAP and hugetlb mapping in advance since they > + * need update vm_flags with write mmap_sem > + */ > + vma = start_vma; > + for ( ; vma && vma->vm_start < end; vma = vma->vm_next) { > + if (vma->vm_file) > + uprobe_munmap(vma, vma->vm_start, vma->vm_end); > + if (unlikely(vma->vm_flags & VM_PFNMAP)) > + untrack_pfn(vma, 0, 0); > + if (is_vm_hugetlb_page(vma)) > + vma->vm_flags &= ~VM_MAYSHARE; > + } > + > + downgrade_write(&mm->mmap_sem); > + > + /* zap mappings with read mmap_sem */ > + unmap_region(mm, start_vma, prev, start, end, true); > + > + arch_unmap(mm, start_vma, start, end); > + remove_vma_list(mm, start_vma); > + up_read(&mm->mmap_sem); > + > + return 0; > + > +out: > + up_write(&mm->mmap_sem); > + return ret; > +} > + > /* Munmap is split into 2 main parts -- this part which finds > * what needs doing, and the areas themselves, which do the > * work. This now handles partial unmappings. > @@ -2826,7 +2899,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, > * Remove the vma's, and unmap the actual pages > */ > detach_vmas_to_be_unmapped(mm, vma, prev, end); > - unmap_region(mm, vma, prev, start, end); > + unmap_region(mm, vma, prev, start, end, false); > > arch_unmap(mm, vma, start, end); > > @@ -2836,6 +2909,17 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, > return 0; > } > > +static int vm_munmap_zap_rlock(unsigned long start, size_t len) > +{ > + int ret; > + struct mm_struct *mm = current->mm; > + LIST_HEAD(uf); > + > + ret = do_munmap_zap_rlock(mm, start, len, &uf); > + userfaultfd_unmap_complete(mm, &uf); > + return ret; > +} > + > int vm_munmap(unsigned long start, size_t len) > { > int ret; > @@ -2855,10 +2939,9 @@ int vm_munmap(unsigned long start, size_t len) > SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) > { > profile_munmap(addr); > - return vm_munmap(addr, len); > + return vm_munmap_zap_rlock(addr, len); > } > > - > /* > * Emulation of deprecated remap_file_pages() syscall. > */ > @@ -3146,7 +3229,7 @@ void exit_mmap(struct mm_struct *mm) > tlb_gather_mmu(&tlb, mm, 0, -1); > /* update_hiwater_rss(mm) here? but nobody should be looking */ > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > - unmap_vmas(&tlb, vma, 0, -1); > + unmap_vmas(&tlb, vma, 0, -1, false); > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > tlb_finish_mmu(&tlb, 0, -1); > > -- > 1.8.3.1 > -- Kirill A. Shutemov