Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp4676827pxu; Mon, 21 Dec 2020 20:18:23 -0800 (PST) X-Google-Smtp-Source: ABdhPJyhkg+FfPvECarETu2K4QfR8LNu4yY+I3PcYW/mopfNPSFAikDrjyBZfNNzZHjoq33zVTOI X-Received: by 2002:a17:906:1719:: with SMTP id c25mr410705eje.251.1608610702779; Mon, 21 Dec 2020 20:18:22 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1608610702; cv=none; d=google.com; s=arc-20160816; b=gjVfbz6uKsh+0GSTYhB8EuJDoEa4jfOv8dJ50ZgjlKb7Ph47kiaNZaYMZURynASn0u Jm0DmyMIZm3ni/7iFj1XobncQeJX8z4CvL/MFqEK8LiTjz6XxDRt2RHvkzMvZbPfW+Jy DFtJf5vxHwEKfGCjhpJj/B0y0jLmZZkYtKxsbuDqnmrvS7mFp8lCsz2oMZOKAJ2R1/rK cqPcvWToT6TFcVBPIHqrRP9u8v7Wp6UvE4KARNVLdClArvSlPqZ6pgCn0CZKMbDOCaVs S4FnrnolvzOMIopr38ho6cudif9SelUhXv2BRMatlw6hbYTFiW/ntufA55l29zqP/YcM ojgQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=C/IyHJco45Qv6BHyymqm5fsLF1iVOYE2Rc667l3brbw=; b=XUDsITGfDaYS8Bq5TKgNPc2zKof7z7zkMI0rL5TPhWt0HJM46eUYYG8HLZH1DPoRHM ZNeLURF/uopGJUxfEuXw/nYg7Rhm4NZhY4HXMxe+fyGOdWc7lz1SWJFBJUME1OsV0H1u 3IwfcoZ3dBuTttIppyWEmVGqtiiHk9GUlkbgvMZerpovmqg+wIoVfplpO0kLZwU72ab1 mxaSsetucaOm8p1RF4fKTgEEsdpOJv21wwwHvctgPZeR46OqLftetqPJSTRyxEQ9JQ3N UHjrSyRdL79/0HmFj1DT1QDqqSoFlmLlOf/h4KGwpwomCyrvT2+1UY01uuK5SERU058U jKOA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux-foundation.org header.s=google header.b=IR0T55Dm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id dc22si9308480edb.328.2020.12.21.20.17.59; Mon, 21 Dec 2020 20:18:22 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linux-foundation.org header.s=google header.b=IR0T55Dm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725850AbgLVERN (ORCPT + 99 others); Mon, 21 Dec 2020 23:17:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58566 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725783AbgLVERM (ORCPT ); Mon, 21 Dec 2020 23:17:12 -0500 Received: from mail-lf1-x12f.google.com (mail-lf1-x12f.google.com [IPv6:2a00:1450:4864:20::12f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D5A14C0613D6 for ; Mon, 21 Dec 2020 20:16:31 -0800 (PST) Received: by mail-lf1-x12f.google.com with SMTP id s26so28797264lfc.8 for ; Mon, 21 Dec 2020 20:16:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=C/IyHJco45Qv6BHyymqm5fsLF1iVOYE2Rc667l3brbw=; b=IR0T55DmJ9mkOAt5PW8Gqr31ryut37JKHFwkKEKNUnz84Mk9NwhsAQe1HxZsHQ7zmB MdR95VrHiV3LKUGcBK3v6VgHVBsa/aLcrgAlzNjpqqiV6O7hKjnM4BwDsBhDCmFtduFt aGec+wYY5jH8H2b3n90B1OMSiOJSvUAF62wS0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=C/IyHJco45Qv6BHyymqm5fsLF1iVOYE2Rc667l3brbw=; b=NPvvlrl+65faEGmgWKQSIFVudeY+eyZbKTFPgJmk/+yAfuBeU3iy4VfNOcBY8dJbKK KnGU8ERbDtZ2S7D8uadDXoVkfLvQIe7wRGiPIFaDc27kBbnbm4Wj03DeNtuU8edDFY02 7JU9T57c/R/MdsmFRFiBlqPRQ3N6B9CFWeewp9ejWAk+lVc27ZwDk7/CD+mdH3QWgAIe 8/TJQzoU6R5l8tPrmna+Ui0XVuI9r8AjyiZM1bSXsW3/iKbdsCg810Crx9gCv+im8UUw 2UcD17C3IRSjyQkM5VU3woah2odW1jGtMkluyRDWIur1x2elhJtYbdgy9mL78miwRNzJ Eysw== X-Gm-Message-State: AOAM531kUAfDRDbCCU9Qnbg+qd5dl5qlJk5qvAQ6+txzZjt54S+ZlN9P PMuejwSVfHycDDOaageyzYNcSD6y+2qdHg== X-Received: by 2002:a05:651c:48e:: with SMTP id s14mr9321344ljc.159.1608610589562; Mon, 21 Dec 2020 20:16:29 -0800 (PST) Received: from mail-lf1-f44.google.com (mail-lf1-f44.google.com. [209.85.167.44]) by smtp.gmail.com with ESMTPSA id i29sm2384185lfc.193.2020.12.21.20.16.28 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 21 Dec 2020 20:16:28 -0800 (PST) Received: by mail-lf1-f44.google.com with SMTP id m12so28852319lfo.7 for ; Mon, 21 Dec 2020 20:16:28 -0800 (PST) X-Received: by 2002:a2e:b4af:: with SMTP id q15mr8557233ljm.507.1608610587583; Mon, 21 Dec 2020 20:16:27 -0800 (PST) MIME-Version: 1.0 References: <20201221172711.GE6640@xz-x1> <76B4F49B-ED61-47EA-9BE4-7F17A26B610D@gmail.com> <9E301C7C-882A-4E0F-8D6D-1170E792065A@gmail.com> <1FCC8F93-FF29-44D3-A73A-DF943D056680@gmail.com> <20201221223041.GL6640@xz-x1> In-Reply-To: From: Linus Torvalds Date: Mon, 21 Dec 2020 20:16:11 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect To: Andy Lutomirski Cc: Peter Xu , Nadav Amit , Yu Zhao , Andrea Arcangeli , linux-mm , lkml , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable , Minchan Kim , Will Deacon , Peter Zijlstra Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 21, 2020 at 7:19 PM Andy Lutomirski wrote: > > Ugh, this is unpleasantly complicated. I probably should have phrased it differently, because the case you quote is actually a good one: > I will admit that any API that > takes an address and more-or-less-blindly marks it RO makes me quite > nervous even assuming all the relevant locks are held. At least > userfaultfd refuses to operate on VM_SHARED VMAs, but we have another > instance of this (with mmap_sem held for write) in x86: > mark_screen_rdonly(). Dare I ask how broken this is? We could likely > get away with deleting it entirely. So I think the basic rule is that "if you hold mmap_sem for writing, you're always safe". And that really should be considered the "default" locking. ANY time you make a modification to the VM layer, you should basically always treat it as a write operation, and get the mmap_sem for writing. Yeah, yeah, that's a bit simplified, and it ignores various special cases (and the hardware page table walkers that obviously take no locks at all), but if you hold the mmap_sem for writing you won't really race with anything else - not page faults, and not other "modify this VM". So mark_screen_rdonly() looks trivially fine to me. I don't think it really necessarily matters any more, and it's a legacy thing for a legacy hardware issue, but I don't think there's any reason at all to remove it either. To a first approximation, everybody that changes the VM should take the mmap_sem for writing, and the readers should just be just about page fault handling (and I count GUP as "page fault handling" too - it's kind of the same "look up page" rather than "modify vm" kind of operation). And there are just a _lot_ more page faults than there are things that modify the page tables and the vma's. So having that mental model of "lookup of pages in a VM take mmap_semn for reading, any modification of the VM uses it for writing" makes sense both from a performance angle and a logical standpoint. It's the correct model. And it's worth noting that COW is still "lookup of pages", even though it might modify the page tables in the process. The same way lookup can modify the page tables to mark things accessed or dirty. So COW is still a lookup operation, in ways that "change the writabiility of this range" very much is not. COW is "lookup for write", and the magic we do to copy to make that write valid is still all about the lookup of the page. Which brings up another mental mistake I saw earlier in this thread: you should not think "mmap_sem is for vma, and the page table lock is for the page table changes". mmap_sem is the primary lock for any modifications to the VM layout, whether it be in the vma's or in the page tables. Now, the page table lock does exist _in_addition_to_ the mmap_sem, but it is partly because (a) we have things that historically walked the page tables _without_ walking the vma's (notably the virtual memory scanning) (b) we do allow concurrent page faults, so we then need a lower-level lock to serialize the parallelism we _do_ have. And yes, we have ended up allowing some of those "modify the VM state" to then take the mmap_sem for reading, just because their modifications are slight enough that we can then say "ok, this is a write modification, but the writes it does are protected by the page table lock, we'll just get the mmap_sem for reading". It's an optimization, but it should be considered exactly that: not fundamental, but just a clever trick and an optimization. It's why I went "userfaultfd is buggy" immediately when I noticed. It basically became clear that "oh, that's an optimization, but it's an *invalid* one", in that it didn't actually work and give the same end result. So when I said: > Anything that changes RW->RO - like fork(), for example - needs to > take the mmap_lock. I didn't mean that we should consider that RW->RO change to be this complex semantic marker that we should honor and say "ok, because it's a RW->RO change, we need to hold the mmap_sem". I phrased it really badly, in other words. What I *should* have said is that *because* userfaultfd is changing the VM layout, it should always act as if it had to take the mmap_sem for writing, and that the RW->RO change is an example of when downgrading that write-lock to a read lock is simply not valid - because it basically breaks the rules about what a lookup (ie a read) can do. A lookup can never turn a writable page non-writable. A lookup - through COW - _can_ turn a read-only page writable. So that's why "RO->RW" can be ok under the read lock, but RW->RO is not. Does that clarify the situation when I phrase it that way instead? Linus