Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp164367pxu; Tue, 5 Jan 2021 07:41:39 -0800 (PST) X-Google-Smtp-Source: ABdhPJzM+BsseduKawKmRT3GELo28UI2TU9br/cOO1cvema1dl2Q7MAAfuHbDjiZbIZ/KqOJ/6/N X-Received: by 2002:a17:906:3a55:: with SMTP id a21mr71934600ejf.516.1609861299520; Tue, 05 Jan 2021 07:41:39 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1609861299; cv=none; d=google.com; s=arc-20160816; b=X2acINYod3AJZYVo+ywcJRTnm6nrpccrhuD0W+n5eW2gbXzP9lNJZFuD8tjQEhOtTn WbG3BYwn2z+UoAyQz+6WoZRjmt8CEN28qfj/etxHozVkUMgWziMdGoTgePsKyGxpMbl4 nH4Oqevyf8G8Q4EMt2Xe/tl7/8t2qoN9mIL3PQ7HwH4dvpVcbLdjk9rln9YH1+MOwrL3 oGZSfE2qWYhlCOQOZIdj1mcEy068IDTSYXjG3V0GXDqG5fGW8WKkbXqecOoWCLQM5Ohq fL4HTVzE3aatZQWdCFPVZx+fmLX/iTyqqgYYtwXJRP0NrayifTwaBe2pnC5pcKAuVbpr w+RA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=4gg5v8gYolfKEI/M5DZ6ILyDSGPGL00ihhXUV/cqmCI=; b=iaCwKHw1TAEQV0lmhsC+ZjGkdlgXb0dBhmVhrkiQqAH+eVCKJgPoNpuJv1Kpt1rwtg tO5eDJiFRsnf/M4cBrhtK6TP6UPoRhv1xHCx9uonL3a3hsCAUd6iv8042B8BPO28J2+L KMjEHHi6SjDdIBvBdW717xqUvHpT20ok4tVLYCCTyt/zyu95Or7NF5hIblwbqdXdB8c2 m+LjHiTgYvA2NDtIilpHJV/X4kj677+US0al6K4Badg7uQ0EBlX+PZe1LSucym/WvTVN ub07NJdPMZi98l7tkR1bbWREVKWJTASon5k51B6oJ0lZ6YerCtkrATy9udEsFkNbPeBw UyWQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=cOauWB3X; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id rk15si30220008ejb.170.2021.01.05.07.41.15; Tue, 05 Jan 2021 07:41:39 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=cOauWB3X; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728676AbhAEPjc (ORCPT + 99 others); Tue, 5 Jan 2021 10:39:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39280 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728142AbhAEPjb (ORCPT ); Tue, 5 Jan 2021 10:39:31 -0500 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0C1E7C061574; Tue, 5 Jan 2021 07:38:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=4gg5v8gYolfKEI/M5DZ6ILyDSGPGL00ihhXUV/cqmCI=; b=cOauWB3Xr42RBSMHO7PQU/Qyh1 /QNF1n5YXhHE7cyf9lPTvk7S8GVTlyJY+bYDID/AYz8XgqcAFcbA0eT2tmoaVdAQ0jgrzd+2JWI9g uU5u/fBnPw50d7SVBgO+f0Z6fhZZQh5bWC+4QlSvSqHUQ72s0Y3U64m9FJ/KFrCYjJjzFKJ2hiM+F rfi/+aQc7Xg4VUoC2BRDbKE9C7DEB+qZIXLt7BLi0XKXJ6E9gOQ1eGoqwGzCyvU1m5Vr1tKFERCBy wXAtq1knz8fFQxQ7iU+7Wgqnvtq8B6GmTELgqMsYYf+0dJayBCOFk2gXiUhlIA+prgnK6qaeP8n61 9tCKnPQQ==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.94 #2 (Red Hat Linux)) id 1kwoOm-001OSY-K1; Tue, 05 Jan 2021 15:37:47 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 4D3E230015A; Tue, 5 Jan 2021 16:37:27 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 3ED9E201B8B84; Tue, 5 Jan 2021 16:37:27 +0100 (CET) Date: Tue, 5 Jan 2021 16:37:27 +0100 From: Peter Zijlstra To: Linus Torvalds Cc: Andy Lutomirski , Peter Xu , Nadav Amit , Yu Zhao , Andrea Arcangeli , linux-mm , lkml , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable , Minchan Kim , Will Deacon Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Message-ID: <20210105153727.GK3040@hirez.programming.kicks-ass.net> References: <9E301C7C-882A-4E0F-8D6D-1170E792065A@gmail.com> <1FCC8F93-FF29-44D3-A73A-DF943D056680@gmail.com> <20201221223041.GL6640@xz-x1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 21, 2020 at 08:16:11PM -0800, Linus Torvalds wrote: > So I think the basic rule is that "if you hold mmap_sem for writing, > you're always safe". And that really should be considered the > "default" locking. > > ANY time you make a modification to the VM layer, you should basically > always treat it as a write operation, and get the mmap_sem for > writing. > > Yeah, yeah, that's a bit simplified, and it ignores various special > cases (and the hardware page table walkers that obviously take no > locks at all), but if you hold the mmap_sem for writing you won't > really race with anything else - not page faults, and not other > "modify this VM". > To a first approximation, everybody that changes the VM should take > the mmap_sem for writing, and the readers should just be just about > page fault handling (and I count GUP as "page fault handling" too - > it's kind of the same "look up page" rather than "modify vm" kind of > operation). > > And there are just a _lot_ more page faults than there are things that > modify the page tables and the vma's. > > So having that mental model of "lookup of pages in a VM take mmap_semn > for reading, any modification of the VM uses it for writing" makes > sense both from a performance angle and a logical standpoint. It's the > correct model. > And it's worth noting that COW is still "lookup of pages", even though > it might modify the page tables in the process. The same way lookup > can modify the page tables to mark things accessed or dirty. > > So COW is still a lookup operation, in ways that "change the > writabiility of this range" very much is not. COW is "lookup for > write", and the magic we do to copy to make that write valid is still > all about the lookup of the page. (your other email clarified this point; the COW needs to copy while holding the PTL and we need TLBI under PTL if we're to change this) > Which brings up another mental mistake I saw earlier in this thread: > you should not think "mmap_sem is for vma, and the page table lock is > for the page table changes". > > mmap_sem is the primary lock for any modifications to the VM layout, > whether it be in the vma's or in the page tables. > > Now, the page table lock does exist _in_addition_to_ the mmap_sem, but > it is partly because > > (a) we have things that historically walked the page tables _without_ > walking the vma's (notably the virtual memory scanning) > > (b) we do allow concurrent page faults, so we then need a lower-level > lock to serialize the parallelism we _do_ have. And I'm thinking the speculative page fault series steps right into all this, it fundamentally avoids mmap_sem and entirely relies on the PTL. Which opens it up to exactly these races explored here. The range lock approach does not suffer this, but I'm still worried about the actual performance of that thing.