Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1539534imu; Wed, 9 Jan 2019 21:30:17 -0800 (PST) X-Google-Smtp-Source: ALg8bN6kbGJqplFOKKx+vz/VwWgD65DhYAfECGU3woIEo9LWE5BQpMhhSGnZE01Ks1RtZQ8deajx X-Received: by 2002:a63:6782:: with SMTP id b124mr8211721pgc.151.1547098217499; Wed, 09 Jan 2019 21:30:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1547098217; cv=none; d=google.com; s=arc-20160816; b=BDN4jiEJEMDAC/nj0Mw3+5D03Pi8KQMkUOgmTLiPHhakTUXkbopl9MYmbYi5Tha63e QyhJxU3m6akyog8lk+4RKRi0zRPImau9zf8T+wE8+nfFnRe6zEz0xGLgN/L7ZdkweF9O AOPJZihLip/ltNSIiYp+oX7v6w71n0p5T2ivqPuqlkwwKH/VYPosP4MnadgFcAX3XsbK Jt3jQR5hGkF6OlG/dDj2y/uYGd4gdBJa3spiUIw0d2k6awHMXMuqh4qZATLBo4HC1FEj a23iDUEqPbBGgJYcYETAILkW8Gl0WZNM1q5ARQaL8joaiIng91F3qiilU6h9dIutui2o usMQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=9u3C8NsSo0x9v2wiJfScrUNLdHkNf+XKWpTuuOQtYfk=; b=y6GhNKLMdPoiANSbby70HeQESlnmXlo8S4qG6JOjD2kv9v+WvEcL2elG56cJwhoImv MGW5ZG+hIrLJG2ozNatyrZdSpJ8xsUYJeyfPJc1J0S93yAqrR/lS6aZqsxdIXaiGsSgz xWrHs2j2ScjkH0WLPeR2WWiwbbXTdEOSI15vs8LULorOZUu9uoKtn61NaBgPVPoHFQPC pwFoNecFGeDIgMBFfRY+IM5mujEmkt3NDr2Pw7J0CSSY/pBN5tv9ixOXFgNWNSzqQuzx +2bHIoAch6SGCyNAfCBfH0iHhykZHGFQ4cz97UOz2eXrOLPI2swluJZDgWxcrJv0y3Jl ZoUQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b="jD/02zRq"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e4si69795902pgk.127.2019.01.09.21.30.02; Wed, 09 Jan 2019 21:30:17 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b="jD/02zRq"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727233AbfAJF04 (ORCPT + 99 others); Thu, 10 Jan 2019 00:26:56 -0500 Received: from mail.kernel.org ([198.145.29.99]:39016 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727135AbfAJF04 (ORCPT ); Thu, 10 Jan 2019 00:26:56 -0500 Received: from mail-wr1-f52.google.com (mail-wr1-f52.google.com [209.85.221.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 9D291214DA for ; Thu, 10 Jan 2019 05:26:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1547098014; bh=wLCgdOnFpGnaabfvxmw/GIzvBYL8EOsxdXCRQtu+b6k=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=jD/02zRqVYSLcRrm0k2aZShqQUWWW+GD3v9RvxvyhPy6GmZ0TU1ixnEOsH+MLsHq9 o/vzoNGEdLYBQaDc6ZfcDAodXLaP1QD4Bfr3XjPvXw8yRkwWpWDIbSBiZZiORkJYh5 NXMLOhIPnKggXexhrPZmXTupic4JfR08yVnP/9YE= Received: by mail-wr1-f52.google.com with SMTP id x10so9860652wrs.8 for ; Wed, 09 Jan 2019 21:26:54 -0800 (PST) X-Gm-Message-State: AJcUukeOfhAgD1e4UTfvyM4mr3/uVcAY5YMZT6AyX6aiQWyjm6sEPeY0 UD78EXO4g8n/nbfDnDKwbZ5zHXItaGsNDj5++3jlYg== X-Received: by 2002:adf:f0c5:: with SMTP id x5mr7264313wro.77.1547098013084; Wed, 09 Jan 2019 21:26:53 -0800 (PST) MIME-Version: 1.0 References: <20190106001138.GW6310@bombadil.infradead.org> <20190108044336.GB27534@dastard> <20190109022430.GE27534@dastard> <20190109043906.GF27534@dastard> <20190110004424.GH27534@dastard> In-Reply-To: From: Andy Lutomirski Date: Wed, 9 Jan 2019 21:26:41 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged To: Linus Torvalds Cc: Dave Chinner , Jiri Kosina , Matthew Wilcox , Jann Horn , Andrew Morton , Greg KH , Peter Zijlstra , Michal Hocko , Linux-MM , kernel list , Linux API Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 9, 2019 at 5:18 PM Linus Torvalds wrote: > > On Wed, Jan 9, 2019 at 4:44 PM Dave Chinner wrote: > > > > I wouldn't look at ext4 as an example of a reliable, problem free > > direct IO implementation because, historically speaking, it's been a > > series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and > > been far worse than XFS from data integrity, performance and > > reliability perspectives. > > That's some big words from somebody who just admitted to much worse hacks. > > Seriously. XFS is buggy in this regard, ext4 apparently isn't. > > Thinking that it's better to just invalidate the cache for direct IO > reads is all kinds of odd. > This whole discussion seems to have gone a little bit off the rails... Linus, I think I agree with Dave's overall sentiment, though, and I think you should consider reverting your patch. Here's why. The basic idea behind the attack is that the authors found efficient ways to do two things: evict a page from page cache and detect, *without filling the cache*, whether a page is cached. The combination lets an attacker efficiently tell when another process reads a page. We need to keep in mind that this attack is a sophisticated attack, and anyone using it won't have any problem using a nontrivial way to detect whether a page is in page cache. So, unless we're going to try for real to make it hard to tell whether a page is cached without causing that page to become cached, it's not worth playing whack-a-mole. And, right now, mincore is whacking a mole. RWF_NOWAIT appears to do essentially the same thing at very little cost. I haven't really dug in, but I assume that various prefaulting tricks combined with various pagetable probing tricks can do similar things, but that's at least a *lot* more complicated. So unless we're going to lock down RWF_NOWAIT as well, I see no reason to lock down mincore(). Direct IO is a red herring -- O_DIRECT is destructive enough that it seems likely to make the attacks a lot less efficient. --- begin digression --- Since direct IO has been brought up, I have a question. I've wondered for years why direct IO works the way it does. If I were implementing it from scratch, my first inclination would be to use the page cache instead of fighting it. To do a single-page direct read, I would look that page up in the page cache (i.e. i_pages these days). If the page is there, I would do a normal buffered read. If the page is not there, I would insert a record into i_pages indicating that direct IO is in progress and then I would do the IO into the destination page. If any other read, direct or otherwise, sees a record saying "under direct IO", it would wait. To do a single-page direct write, I would look it up in i_pages. If it's there, I would do a buffered write followed by a sync (because applications expect a sync). If it's not there, I would again add a record saying "under direct IO" and do the IO. The idea is that this works as much like buffered IO as possible, except that the pages backing the IO aren't normal sharable page cache pages. The multi-page case would be just an optimization on top of the single-page case. The idea would be to somehow mark i_pages with entire extents under direct IO. It's a radix tree -- this can, at least in theory, be done efficiently. As long as all direct IO operations run in increasing order of offset, there shouldn't be lock ordering problems. Other than history and possibly performance, is there any reason that direct IO doesn't work this way? P.S. What, if anything, prevents direct writes from causing trouble when the underlying FS or backing store needs stable pages? Similarly, what, if anything, prevents direct reads from temporarily exposing unintended data to user code if the fs or underlying device transforms the data during the read process (e.g. by decrypting something)? --- end digression ---