Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp906983imm; Fri, 29 Jun 2018 08:15:23 -0700 (PDT) X-Google-Smtp-Source: AAOMgpelkylTPn0hQgKQST5vJKHL1XJPrLGKjbjKZ+bL8NiaCKM1IlGFDJVPQGMBnMIY6K9y1xoI X-Received: by 2002:a62:b20c:: with SMTP id x12-v6mr5772573pfe.64.1530285323035; Fri, 29 Jun 2018 08:15:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1530285323; cv=none; d=google.com; s=arc-20160816; b=ePWbQx5c2EjddgJJ2B7XCGhQpJjPyzQJQr/ZJd9AJN2mwIAbYPIOyV6D8uJh0CCdgT szbPjeLVG6oW0aAGTT7D7T1vnUUf4mtzpVegSo/9vJm4A8Vr0sP0GA/DD0nzsy9fLxj7 VSqagSvkDzuWlvB4Y+KKRrrJlLQNsIytFHHZToWPbCk68nM8iZRWUGtors4RVv3DsGJ0 ZE6RTFdo6o1fw87doI+NNanZO9Jx6rVtt1V2qN744WMOHCuma/GgvTs6u+cYPSu3gbjf gUBB8TZ27uyYU9VeXnURkZcjOkr6WvUAYIEtXVE5UvhQW0s2KwVhc3Qef2GBk3ySQFIS Tcgw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date :arc-authentication-results; bh=lf9rLUZsueg7Fas3orTG12B4E5mA/rpsWV6wfTe2Z8E=; b=resCMeoGauRLao9aovbTJcBQE/ruI2N0oUYkAvsYcCtbx5nGfKbGbIKEVflsYnlkNj e3C4GnOFGaeUwBlj+D0Jw2PopGNenIE06VfgTF4OXf/pDwZY97EtA1bUxMnNYnvTbuTA O3SIB7P9mMZJYP45zab8PDlNsc+bPlf2q8jt8IcZlUzSygI1RBxg3uRLvmK0NO1SB/w0 lTbWwRu/A27l7NFEJtM/PylhQ9TZp9DizqcylZxHXpBjDELBhpkM4+nx0c2ZyydvZ4LC MiG5DLRZ5pv0RmTsVVUyWmIj32kxcuyGmUCCieX81xyFQVDaFcEVNUJ/jKGW7/bTe0iR FPzQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n22-v6si3587854pfh.210.2018.06.29.08.15.08; Fri, 29 Jun 2018 08:15:22 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755187AbeF2Lj6 (ORCPT + 99 others); Fri, 29 Jun 2018 07:39:58 -0400 Received: from mx2.suse.de ([195.135.220.15]:60750 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S933005AbeF2Lj5 (ORCPT ); Fri, 29 Jun 2018 07:39:57 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id BF18EAFA3; Fri, 29 Jun 2018 11:39:55 +0000 (UTC) Date: Fri, 29 Jun 2018 13:39:54 +0200 From: Michal Hocko To: Yang Shi Cc: Peter Zijlstra , Nadav Amit , Matthew Wilcox , ldufour@linux.vnet.ibm.com, Andrew Morton , Ingo Molnar , acme@kernel.org, alexander.shishkin@linux.intel.com, jolsa@redhat.com, namhyung@kernel.org, "open list:MEMORY MANAGEMENT" , linux-kernel@vger.kernel.org Subject: Re: [RFC v2 PATCH 2/2] mm: mmap: zap pages with read mmap_sem for large mapping Message-ID: <20180629113954.GB5963@dhcp22.suse.cz> References: <20180620071817.GJ13685@dhcp22.suse.cz> <263935d9-d07c-ab3e-9e42-89f73f57be1e@linux.alibaba.com> <20180626074344.GZ2458@hirez.programming.kicks-ass.net> <20180627072432.GC32348@dhcp22.suse.cz> <20180628115101.GE32348@dhcp22.suse.cz> <2ecdb667-f4de-673d-6a5f-ee50df505d0c@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.10.0 (2018-05-17) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 28-06-18 17:59:25, Yang Shi wrote: > > > On 6/28/18 12:10 PM, Yang Shi wrote: > > > > > > On 6/28/18 4:51 AM, Michal Hocko wrote: > > > On Wed 27-06-18 10:23:39, Yang Shi wrote: > > > > > > > > On 6/27/18 12:24 AM, Michal Hocko wrote: > > > > > On Tue 26-06-18 18:03:34, Yang Shi wrote: > > > > > > On 6/26/18 12:43 AM, Peter Zijlstra wrote: > > > > > > > On Mon, Jun 25, 2018 at 05:06:23PM -0700, Yang Shi wrote: > > > > > > > > By looking this deeper, we may not be able to > > > > > > > > cover all the unmapping range > > > > > > > > for VM_DEAD, for example, if the start addr is > > > > > > > > in the middle of a vma. We > > > > > > > > can't set VM_DEAD to that vma since that would > > > > > > > > trigger SIGSEGV for still > > > > > > > > mapped area. > > > > > > > > > > > > > > > > splitting can't be done with read mmap_sem held, > > > > > > > > so maybe just set VM_DEAD > > > > > > > > to non-overlapped vmas. Access to overlapped > > > > > > > > vmas (first and last) will > > > > > > > > still have undefined behavior. > > > > > > > Acquire mmap_sem for writing, split, mark VM_DEAD, > > > > > > > drop mmap_sem. Acquire > > > > > > > mmap_sem for reading, madv_free drop mmap_sem. Acquire mmap_sem for > > > > > > > writing, free everything left, drop mmap_sem. > > > > > > > > > > > > > > ? > > > > > > > > > > > > > > Sure, you acquire the lock 3 times, but both write > > > > > > > instances should be > > > > > > > 'short', and I suppose you can do a demote between 1 > > > > > > > and 2 if you care. > > > > > > Thanks, Peter. Yes, by looking the code and trying two > > > > > > different approaches, > > > > > > it looks this approach is the most straight-forward one. > > > > > Yes, you just have to be careful about the max vma count limit. > > > > Yes, we should just need copy what do_munmap does as below: > > > > > > > > if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) > > > > ???? ??? ??? return -ENOMEM; > > > > > > > > If the mas map count limit has been reached, it will return > > > > failure before > > > > zapping mappings. > > > Yeah, but as soon as you drop the lock and retake it, somebody might > > > have changed the adddress space and we might get inconsistency. > > > > > > So I am wondering whether we really need upgrade_read (to promote read > > > to write lock) and do the > > > ????down_write > > > ????split & set up VM_DEAD > > > ????downgrade_write > > > ????unmap > > > ????upgrade_read > > > ????zap ptes > > > ????up_write > > Promoting to write lock may be a trouble. There might be other users in the > critical section with read lock, we have to wait them to finish. Yes. Is that a problem though? > > I'm supposed address space changing just can be done by mmap, mremap, > > mprotect. If so, we may utilize the new VM_DEAD flag. If the VM_DEAD > > flag is set for the vma, just return failure since it is being unmapped. > > > > Does it sounds reasonable? > > It looks we just need care about MAP_FIXED (mmap) and MREMAP_FIXED (mremap), > right? > > How about letting them return -EBUSY or -EAGAIN to notify the application? Well, non of those is documented to return EBUSY and EAGAIN already has a meaning for locked memory. > This changes the behavior a little bit, MAP_FIXED and mremap may fail if > they fail the race with munmap (if the mapping is larger than 1GB). I'm not > sure if any multi-threaded application uses MAP_FIXED and MREMAP_FIXED very > heavily which may run into the race condition. I guess it should be rare to > meet all the conditions to trigger the race. > > The programmer should be very cautious about MAP_FIXED.MREMAP_FIXED since > they may corrupt its own address space as the man page noted. Well, I suspect you are overcomplicating this a bit. This should be really straightforward thing - well except for VM_DEAD which is quite tricky already. We should rather not spread this trickyness outside of the #PF path. And I would even try hard to start that part simple to see whether it actually matters. Relying on races between threads without any locking is quite questionable already. Nobody has pointed to a sane usecase so far. -- Michal Hocko SUSE Labs