Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp916623imm; Fri, 3 Aug 2018 14:03:13 -0700 (PDT) X-Google-Smtp-Source: AAOMgpf1XcqPMue5iuLayqW5QdoXPAW4LE3fWShMJnlFw8x8rkYY71viVPgPo8lvX2PqDB7N+nd4 X-Received: by 2002:a17:902:1665:: with SMTP id g92-v6mr5024951plg.293.1533330193863; Fri, 03 Aug 2018 14:03:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1533330193; cv=none; d=google.com; s=arc-20160816; b=X4Mgr0Mkam7941Zs8m2lakTZpZNoymrMiyFO9tWWI5wrmmBECxXCrlvnh107B/6mdF J3TZo3ioFPJaku+1Oojd6QDh+ZYJXuiWkMIAOyW4FWXiFOQtUiLvBu4nkOngk6LXY4TX KXHFuIL6L3ylY57hWTVM7MrNNFkNdc+XH6F45vh7nsMz8MHJPjbiqmIoMQXne/oRmKEB l79NI2FBUN7m2R/y5zPpFFRyENP+7vT2SGwvMVsPlmo3azmfmXgadQP+mg313A1u2HmK qW/1BlhBJc80hDUz5+SXRz/ljgrEIeQjpOCOLP5KATEJHsGlun3WTiLruFjglz+Gh44Y zSDQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=I2KEQWGuUTw7xCheDI1yuMNGjb9MbLbbsP5rhu16wns=; b=zwlpfoYT1YWubio3x3b+2palN9JiBYew5GkTlJxbZizlA1TgPS2BsRzENPZWxz0iGr aNIMvxebKOp5EiZ56Ku9k2e6i7rjQohvrF591qRayhHaPEJ/xGEvnB86N+0MMXoJaVda C4ErZ+ehttxiXP7ooY0bGqDKROEe5q7BzgaSXoVyHCwTNJukeMiv1KEvkGfRbZP9Eeqw V1fweYD4Bf0OKHZ97DIio7JnTHYWY5DWcq0tP5Kw5uWdK1tlhVQoA0Rim9tHLVi3/puz HYPt02oDsig19GL/poctJpoYkDNt5rwXF1njOgQ6HnKL0Ql87yvE+qG417HxW40U4YDy mf5Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l17-v6si5178193pgn.182.2018.08.03.14.02.57; Fri, 03 Aug 2018 14:03:13 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731741AbeHCXAF (ORCPT + 99 others); Fri, 3 Aug 2018 19:00:05 -0400 Received: from out30-133.freemail.mail.aliyun.com ([115.124.30.133]:46630 "EHLO out30-133.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728477AbeHCXAF (ORCPT ); Fri, 3 Aug 2018 19:00:05 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R501e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01429;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0T5z-Ft5_1533330119; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0T5z-Ft5_1533330119) by smtp.aliyun-inc.com(127.0.0.1); Sat, 04 Aug 2018 05:02:02 +0800 Subject: Re: [RFC v6 PATCH 2/2] mm: mmap: zap pages with read mmap_sem in munmap To: Michal Hocko Cc: willy@infradead.org, ldufour@linux.vnet.ibm.com, kirill@shutemov.name, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <1532628614-111702-1-git-send-email-yang.shi@linux.alibaba.com> <1532628614-111702-3-git-send-email-yang.shi@linux.alibaba.com> <20180803090759.GI27245@dhcp22.suse.cz> From: Yang Shi Message-ID: Date: Fri, 3 Aug 2018 14:01:58 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20180803090759.GI27245@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 8/3/18 2:07 AM, Michal Hocko wrote: > On Fri 27-07-18 02:10:14, Yang Shi wrote: >> When running some mmap/munmap scalability tests with large memory (i.e. >>> 300GB), the below hung task issue may happen occasionally. >> INFO: task ps:14018 blocked for more than 120 seconds. >> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >> message. >> ps D 0 14018 1 0x00000004 >> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 >> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 >> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 >> Call Trace: >> [] ? __schedule+0x250/0x730 >> [] schedule+0x36/0x80 >> [] rwsem_down_read_failed+0xf0/0x150 >> [] call_rwsem_down_read_failed+0x18/0x30 >> [] down_read+0x20/0x40 >> [] proc_pid_cmdline_read+0xd9/0x4e0 >> [] ? do_filp_open+0xa5/0x100 >> [] __vfs_read+0x37/0x150 >> [] ? security_file_permission+0x9b/0xc0 >> [] vfs_read+0x96/0x130 >> [] SyS_read+0x55/0xc0 >> [] entry_SYSCALL_64_fastpath+0x1a/0xc5 >> >> It is because munmap holds mmap_sem exclusively from very beginning to >> all the way down to the end, and doesn't release it in the middle. When >> unmapping large mapping, it may take long time (take ~18 seconds to >> unmap 320GB mapping with every single page mapped on an idle machine). >> >> Zapping pages is the most time consuming part, according to the >> suggestion from Michal Hocko [1], zapping pages can be done with holding >> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write >> mmap_sem to cleanup vmas. >> >> But, some part may need write mmap_sem, for example, vma splitting. So, >> the design is as follows: >> acquire write mmap_sem >> lookup vmas (find and split vmas) >> detach vmas >> deal with special mappings >> downgrade_write >> >> zap pages >> free page tables >> release mmap_sem >> >> The vm events with read mmap_sem may come in during page zapping, but >> since vmas have been detached before, they, i.e. page fault, gup, etc, >> will not be able to find valid vma, then just return SIGSEGV or -EFAULT >> as expected. >> >> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, they are >> considered as special mappings. They will be dealt with before zapping >> pages with write mmap_sem held. Basically, just update vm_flags. > Well, I think it would be safer to simply fallback to the current > implementation with these mappings and deal with them on top. This would > make potential issues easier to bisect and partial reverts as well. Do you mean just call do_munmap()? It sounds ok. Although we may waste some cycles to repeat what has done, it sounds not too bad since those special mappings should be not very common. > >> And, since they are also manipulated by unmap_single_vma() which is >> called by unmap_vma() with read mmap_sem held in this case, to >> prevent from updating vm_flags in read critical section, a new >> parameter, called "skip_flags" is added to unmap_region(), unmap_vmas() >> and unmap_single_vma(). If it is true, then just skip unmap those >> special mappings. Currently, the only place which pass true to this >> parameter is us. > skip parameters are usually ugly and lead to more mess later on. Can we > do without them? We need a way to tell unmap_region() that it is called in a kind of special context which updating vm_flags is not allowed. I didn't think of a better way. We could add a new API to do what unmap_region() does without updating vm_flags, but we would have to  duplicate some code. > >> With this approach we don't have to re-acquire mmap_sem again to clean >> up vmas to avoid race window which might get the address space changed. > By with this approach you mean detaching right? Yes, the detaching approach. > >> And, since the lock acquire/release cost is managed to the minimum and >> almost as same as before, the optimization could be extended to any size >> of mapping without incurring significant penalty to small mappings. > I guess you mean to say that lock downgrade approach doesn't lead to > regressions because the overal time mmap_sem is taken is not longer? Yes. And, there is not lock take/retake cost since we don't release it. > >> For the time being, just do this in munmap syscall path. Other >> vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain >> intact for stability reason. > You have used this argument previously and several people have asked. > I think it is just wrong. Either the concept is safe and all callers can > use it or it is not and then those subtle differences should be called > out. Your previous response was that you simply haven't tested other > paths. Well, that is not an argument, I am afraid. The whole thing > should be done at a proper layer. If there are some difficulties to > achieve that for all callers then OK just be explicit about that. I can > imagine some callers really require the exclusive look when munmap > returns for example. Yes, the statement here sounds ambiguous. There are definitely some difficulties to achieve that in mmap and mremap. Since they acquire write mmap_sem at the very beginning, then do their stuff, which may call do_munmap if overlapped address space has to be changed. But, the optimized do_munmap would like to be called without mmap_sem held so that we can do the optimization. So, if we want to do the similar optimization for mmap/mremap path, I'm afraid we would have to redesign them. I assumes munmap itself is the main source of the latency issue. mmap/mremap might hit the latency problem if they are trying to map or remap a huge overlapped address space, but it should be rare. So, I leave them untouched. > >> With the patches, exclusive mmap_sem hold time when munmap a 80GB >> address space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to >> us level from second. >> >> munmap_test-15002 [008] 594.380138: funcgraph_entry: | vm_munmap_zap_rlock() { >> munmap_test-15002 [008] 594.380146: funcgraph_entry: !2485684 us | unmap_region(); >> munmap_test-15002 [008] 596.865836: funcgraph_exit: !2485692 us | } >> >> Here the excution time of unmap_region() is used to evaluate the time of >> holding read mmap_sem, then the remaining time is used with holding >> exclusive lock. > I will be reading through the patch and follow up on that separately. Thanks, Yang