Received: by 10.213.65.68 with SMTP id h4csp1392240imn; Wed, 21 Mar 2018 09:33:35 -0700 (PDT) X-Google-Smtp-Source: AG47ELuH4wyuhDOxKHCTCiQDTvDRlQfvBxZrzXrMRYQRxg0xo1HcRtPofAQAjFyZITTh+TOiI6XK X-Received: by 10.98.107.138 with SMTP id g132mr73752pfc.163.1521650015703; Wed, 21 Mar 2018 09:33:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521650015; cv=none; d=google.com; s=arc-20160816; b=cuFBX3xvJhKNsC7dx8LkDD0DSX2C3mDimKpk8n6v71x3x45ncI5zBdMC3DJLlaZERt du/XqHa8qt/QoevwrhHYXilFcivvPY+HvdLQbsg1tSaoLqsTj4jRYq77Jo6o+l5hQHf5 3fblQTv2UVeU5he573GXJFkaD7cXfeg7v3CRQohoqYhXnGQo+ngD6RGKewHZEtOMh8fw HY7sl/2hpu+cjLxX1vvcrF9FnzNWk8Jin7UlFiG3vbsTsSIjHRkHgx5ZXC2R/wu03YQ5 LFYQmqg2hnIm9Z9n8VTmM+47gtzOZzgoZUEWf7XY9f4a71Qu68xssvs7LOo/q33y0ZbM tvsQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=5gBnuf5rNzWwXkH0Iritwjc0tdOyBUD405anlpNzutY=; b=rSzxvxu2per0hE7nv+mnbCjWIBL27D2tzk/jbL1R3aHIKovbnZy+nTkosKxv4rVEMY 8DnXzxYEGUoCidk3kiBjnFvqw36XUh8aikL1kcOSCsafZhSTji+lBQEMI1Zlors1gdp4 jAMtMiiiALs38W6uoVjH2CvhBWYNMwbLoB9DXRYW8XhziZ8UhBj2KI06xS4cryacNAh/ 7mJQHLuNLsqbOWOeuYldBFK+wYMgv2QXR499rB3MCqmHgLLiDPGQ6UiIUtA4Lhzpvrbc MEWvqPi6ARDTvgzq063jA35AfFGVFc6WucFHVBPiL8tUi+zqOfACb9qKg4yMR6HNZPxD ybrA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e20si3234491pfi.359.2018.03.21.09.33.20; Wed, 21 Mar 2018 09:33:35 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752727AbeCUQbm (ORCPT + 99 others); Wed, 21 Mar 2018 12:31:42 -0400 Received: from out30-130.freemail.mail.aliyun.com ([115.124.30.130]:34435 "EHLO out30-130.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752104AbeCUQbi (ORCPT ); Wed, 21 Mar 2018 12:31:38 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R101e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01353;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=4;SR=0;TI=SMTPD_---0SzrTq6-_1521649886; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:121.0.29.197) by smtp.aliyun-inc.com(127.0.0.1); Thu, 22 Mar 2018 00:31:28 +0800 Subject: Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section To: Michal Hocko Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <1521581486-99134-1-git-send-email-yang.shi@linux.alibaba.com> <1521581486-99134-2-git-send-email-yang.shi@linux.alibaba.com> <20180321130833.GM23100@dhcp22.suse.cz> From: Yang Shi Message-ID: Date: Wed, 21 Mar 2018 09:31:22 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20180321130833.GM23100@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 3/21/18 6:08 AM, Michal Hocko wrote: > On Wed 21-03-18 05:31:19, Yang Shi wrote: >> When running some mmap/munmap scalability tests with large memory (i.e. >>> 300GB), the below hung task issue may happen occasionally. >> INFO: task ps:14018 blocked for more than 120 seconds. >> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >> message. >> ps D 0 14018 1 0x00000004 >> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 >> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 >> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 >> Call Trace: >> [] ? __schedule+0x250/0x730 >> [] schedule+0x36/0x80 >> [] rwsem_down_read_failed+0xf0/0x150 >> [] call_rwsem_down_read_failed+0x18/0x30 >> [] down_read+0x20/0x40 >> [] proc_pid_cmdline_read+0xd9/0x4e0 >> [] ? do_filp_open+0xa5/0x100 >> [] __vfs_read+0x37/0x150 >> [] ? security_file_permission+0x9b/0xc0 >> [] vfs_read+0x96/0x130 >> [] SyS_read+0x55/0xc0 >> [] entry_SYSCALL_64_fastpath+0x1a/0xc5 >> >> It is because munmap holds mmap_sem from very beginning to all the way >> down to the end, and doesn't release it in the middle. When unmapping >> large mapping, it may take long time (take ~18 seconds to unmap 320GB >> mapping with every single page mapped on an idle machine). > Yes, this definitely sucks. One way to work that around is to split the > unmap to two phases. One to drop all the pages. That would only need > mmap_sem for read and then tear down the mapping with the mmap_sem for > write. This wouldn't help for parallel mmap_sem writers but those really > need a different approach (e.g. the range locking). page fault might sneak in to map a page which has been unmapped before? range locking should help a lot on manipulating small sections of a large mapping in parallel or multiple small mappings. It may not achieve too much for single large mapping. > >> Since unmapping does't require any atomicity, so here unmap large > How come? Could you be more specific why? Once you drop the lock the > address space might change under your feet and you might be unmapping a > completely different vma. That would require userspace doing nasty > things of course (e.g. MAP_FIXED) but I am worried that userspace really > depends on mmap/munmap atomicity these days. Sorry for the ambiguity. The statement does look misleading. munmap does need certain atomicity, particularly for the below sequence: splitting vma unmap region free pagetables free vmas Otherwise it may run into the below race condition: CPU A CPU B ---------- ---------- do_munmap zap_pmd_range up_write do_munmap down_write ...... remove_vma_list up_write down_write access vmas <-- use-after-free bug This is why I do the range unmap in do_munmap() rather than doing it in deeper location, i.e. zap_pmd_range(). I elaborated this in the cover letter. Thanks, Yang