Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp11832imm; Mon, 2 Jul 2018 06:54:17 -0700 (PDT) X-Google-Smtp-Source: ADUXVKJIMl4u4pcq9ijTgaMFjSyloCVgrQfZmdjDWwzxnMIEqaTQoqWEx4/k++4w8J3N7Ja4sBOE X-Received: by 2002:a63:4b1f:: with SMTP id y31-v6mr22216101pga.14.1530539657028; Mon, 02 Jul 2018 06:54:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1530539656; cv=none; d=google.com; s=arc-20160816; b=cBpIszOU3CBFVGPcmo3PsUVIKP+TyyFIzyEhfLpw7dNwAoUSpu0FghOVRzfpuoMHBy grd8ymMsoSajeWLLFV+qT0L+j36NSS6uQcyS+BpxkdtITi8LYPzBmYCB+iCdOd1ISgFN dhZ25/ZFhjDurURj3ViqyRiuHibcZtWptV2Kq3gI2Eaj5PlHUTBvWayICIqPvVt88BOK 1JheCcj5r6/GpuAxu9ykSUUvbpYOj6xGC9yJoYi1cal3Oi9P9IUkS3zWM8jeMYmmCU5r jzc3BrAMGixWKsNvV5OkA5QvJ9kSnjKR/vQsAyVKMEtFe0U5EgwRdQ7oJ3zYjuA+fZZ0 tiQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=foCu3zss954ftnyOLSRX9hwcZ3ui2fIkK5iurBFCGMI=; b=zpWCIvsc0IT97yPhYL0qBS4dazgqu18OIS4b/YivqxfPLPj1Bx7mE9IAl3C7HR2sdk dd/AhRlaSpEemDg3e1cBHKqHSd6oUWnnSKXsAQzLxiOgYTKvX/coAS4jEwT3XwCqE2JC mvJgmZPALSRQ+MZ6PVsDWXTa9bjVQ9sG8xcLGuSaUkvC4GNaAWNYbFYfivYLzP5XouVg 1f6n+AqWSagGuYnImSIV61ECgExGZ02Bj72njTnltZnGF/7R3JhFxjlTOp4WBV6pGhmY 5RR4zmDyi3cTjq9glxDB2Xr2dpBKCy+ygJQV2T+6rBrP0VvsLaWe/V38r6VtuA3+uuKo BX+w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o189-v6si2510347pfg.5.2018.07.02.06.54.02; Mon, 02 Jul 2018 06:54:16 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752404AbeGBNxR (ORCPT + 99 others); Mon, 2 Jul 2018 09:53:17 -0400 Received: from mx2.suse.de ([195.135.220.15]:48392 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752303AbeGBNxO (ORCPT ); Mon, 2 Jul 2018 09:53:14 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id E76FAAE07; Mon, 2 Jul 2018 13:53:12 +0000 (UTC) Date: Mon, 2 Jul 2018 15:53:11 +0200 From: Michal Hocko To: Yang Shi Cc: willy@infradead.org, ldufour@linux.vnet.ibm.com, akpm@linux-foundation.org, peterz@infradead.org, mingo@redhat.com, acme@kernel.org, alexander.shishkin@linux.intel.com, jolsa@redhat.com, namhyung@kernel.org, tglx@linutronix.de, hpa@zytor.com, linux-mm@kvack.org, x86@kernel.org, linux-kernel@vger.kernel.org Subject: Re: [RFC v3 PATCH 4/5] mm: mmap: zap pages with read mmap_sem for large mapping Message-ID: <20180702135311.GY19043@dhcp22.suse.cz> References: <1530311985-31251-1-git-send-email-yang.shi@linux.alibaba.com> <1530311985-31251-5-git-send-email-yang.shi@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1530311985-31251-5-git-send-email-yang.shi@linux.alibaba.com> User-Agent: Mutt/1.10.0 (2018-05-17) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat 30-06-18 06:39:44, Yang Shi wrote: > When running some mmap/munmap scalability tests with large memory (i.e. > > 300GB), the below hung task issue may happen occasionally. > > INFO: task ps:14018 blocked for more than 120 seconds. > Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > ps D 0 14018 1 0x00000004 > ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 > ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 > 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 > Call Trace: > [] ? __schedule+0x250/0x730 > [] schedule+0x36/0x80 > [] rwsem_down_read_failed+0xf0/0x150 > [] call_rwsem_down_read_failed+0x18/0x30 > [] down_read+0x20/0x40 > [] proc_pid_cmdline_read+0xd9/0x4e0 > [] ? do_filp_open+0xa5/0x100 > [] __vfs_read+0x37/0x150 > [] ? security_file_permission+0x9b/0xc0 > [] vfs_read+0x96/0x130 > [] SyS_read+0x55/0xc0 > [] entry_SYSCALL_64_fastpath+0x1a/0xc5 > > It is because munmap holds mmap_sem from very beginning to all the way > down to the end, and doesn't release it in the middle. When unmapping > large mapping, it may take long time (take ~18 seconds to unmap 320GB > mapping with every single page mapped on an idle machine). > > It is because munmap holds mmap_sem from very beginning to all the way > down to the end, and doesn't release it in the middle. When unmapping > large mapping, it may take long time (take ~18 seconds to unmap 320GB > mapping with every single page mapped on an idle machine). > > Zapping pages is the most time consuming part, according to the > suggestion from Michal Hock [1], zapping pages can be done with holding s@Hock@Hocko@ > read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write > mmap_sem to cleanup vmas. All zapped vmas will have VM_DEAD flag set, > the page fault to VM_DEAD vma will trigger SIGSEGV. This really deserves an explanation why the all dance is really needed. It would be also good to mention how do you achieve the overal consistency. E.g. you are dropping mmap_sem and then re-taking it for write. What if any pending write lock succeeds and modify the address space? Does it matter, why if not? > Define large mapping size thresh as PUD size or 1GB, just zap pages with > read mmap_sem for mappings which are >= thresh value. > > If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, then just > fallback to regular path since unmapping those mappings need acquire > write mmap_sem. > > For the time being, just do this in munmap syscall path. Other > vm_munmap() or do_munmap() call sites remain intact for stability > reason. What are those stability reasons? > The below is some regression and performance data collected on a machine > with 32 cores of E5-2680 @ 2.70GHz and 384GB memory. > > With the patched kernel, write mmap_sem hold time is dropped to us level > from second. I haven't read through the implemenation carefuly TBH but the changelog needs quite some work to explain the solution and resulting semantic of munmap after the change. -- Michal Hocko SUSE Labs