Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp4314438imu; Mon, 7 Jan 2019 20:47:25 -0800 (PST) X-Google-Smtp-Source: ALg8bN5RclcT88Ro6bi8EEaSMQ7TJkSRxdj/sXNQYcsY6JsQxvnW1Uc9/GgroMbKi/623/mCUCMi X-Received: by 2002:a62:dbc2:: with SMTP id f185mr259887pfg.235.1546922845105; Mon, 07 Jan 2019 20:47:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1546922845; cv=none; d=google.com; s=arc-20160816; b=RMPV0HgHPFd0v/YBTfS9pm0XN7oS6a9sKaHAjaVfSIEZpEB7q073xsx8emT3KUINAQ OlU6XGMGRAPdkwQFo5UYrmuNRGMuPUq2OO1Qx6ykUicWge2sYbQaYCyTGSiBEkvD9DbZ qzIYFybCYH/uV3HgffiLlE7cRD9Q5Z+m0QlekeSNWQSp2fiVz7LgyxSKPg6UgVVyPP7p V+Bu0bMoBFNP6XBkFfqnzaVPIWF27LN8AqwruEqy8Ewk/s5k6c/j3LOVV15n86YWa86+ kr5xGOHfuGRbhCqyJ3HMEnz5ow0iGsZizHHQ2eh/rf4zWGVjudZACkJD3Rl27wZjCGQP q0iQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=7JEMZE+EVijv/IWCRCYOzc5nalWQl6jUKuuOcdtXDvw=; b=XXNnoFtwjvrYLuCtf4HCPPCSvt0wl08ufpLgGm2amnSA1XxfN37pbhUImmQS5aVGVC W+AsJ/TwHtiZ53HaMznL8yFOYtQhYdue7kKoJJqFtTdvNYZJHrcjWgIj+wrNCanzXni0 EnjYHimGopEvq4V9oy0dikCFlEG4MDiWcfJfCE/UgacxOyhknLMpPfa4Y+V7WimkuRka pPKhO+0XsG84CmG+iByZSF0AY7Xi/Ze3i8wYCYI5b514jmvj9E9/FGTVWqZHkvxAMbl9 8V6hT5oFcRcYwpcX6UnJJIwcttyn3/enY0Bjjaq1vmxb79no1IaFo8UKTXv86kIfxawx Ry+w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id cc17si45717965plb.265.2019.01.07.20.46.39; Mon, 07 Jan 2019 20:47:25 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727482AbfAHEnl (ORCPT + 99 others); Mon, 7 Jan 2019 23:43:41 -0500 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:37026 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727374AbfAHEnl (ORCPT ); Mon, 7 Jan 2019 23:43:41 -0500 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail06.adl2.internode.on.net with ESMTP; 08 Jan 2019 15:13:38 +1030 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1ggjEm-0007Uv-H7; Tue, 08 Jan 2019 15:43:36 +1100 Date: Tue, 8 Jan 2019 15:43:36 +1100 From: Dave Chinner To: Linus Torvalds Cc: Matthew Wilcox , Jann Horn , Jiri Kosina , Andrew Morton , Greg KH , Peter Zijlstra , Michal Hocko , Linux-MM , kernel list , Linux API Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged Message-ID: <20190108044336.GB27534@dastard> References: <20190106001138.GW6310@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Jan 06, 2019 at 01:46:37PM -0800, Linus Torvalds wrote: > On Sat, Jan 5, 2019 at 5:50 PM Linus Torvalds > wrote: > > > > Slightly updated patch in case somebody wants to try things out. > > I decided to just apply that patch. It is *not* marked for stable, > very intentionally, because I expect that we will need to wait and see > if there are issues with it, and whether we might have to do something > entirely different (more like the traditional behavior with some extra > "only for owner" logic). So, I read the paper and before I was half way through it I figured there are a bunch of other similar page cache invalidation attacks we can perform without needing mincore. i.e. Focussing on mmap() and mincore() misses the wider issues we have with global shared caches. My first thought: fd = open(some_file, O_RDONLY); iov[0].iov_base = buf; iov[0].iov_len = 1; ret = preadv2(fd, iov, 1, off, RWF_NOWAIT); switch (ret) { case -EAGAIN: /* page not resident in page cache */ break; case 1: /* page resident in page cache */ break; default: /* beyond EOF or some other error */ break; } This is "as documented" in the man page for preadv2: RWF_NOWAIT (since Linux 4.14) Do not wait for data which is not immediately available. If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read. If no bytes were read, it will return -1 and set errno to EAGAIN. Currently, this flag is meaningful only for preadv2(). IOWs, we've explicitly designed interfaces to communicate whether data is "immediately accessible" or not to the application so they can make sensible decisions about IO scheduling. i.e. IO won't block the main application processing loop and so can be scheduled in the background by the app and the data processed when that IO returns. That just so happens to be exactly the same information about the page cache that mincore is making available to userspace. If we "remove" this information from the interfaces like it has been done for mincore(), it doesn't mean userspace can't get it in other ways. e.g. it now just has to time the read(2) syscall duration and infer whether the data came from the page cache or disk from the timing information. IMO, there's nothing new or novel about this page cache information leak - it was just a matter of time before some researcher put 2 and 2 together and realised that sharing the page cache across a security boundary is little different from sharing deduplicated pages across those same security boundaries. i.e. As long as we shared caches across security boundaries and userspace can control both cache invalidation and instantiation, we cannot prevent userspace from constructing these invalidate+read information exfiltration scenarios. And then there is overlayfs. Overlay is really just a way to efficiently share the page cache of the same underlying read-only directory tree across all containers on a host. i.e. we have been specifically designing our container hosting systems to share the underlying read-only page cache across all security boundaries on the host. If overlay is vulnerable to these shared page cache attacks (seems extremely likely) then we've got much bigger problems than mincore to think about.... > But doing a test patch during the merge window (which is about to > close) sounds like the right thing to do. IMO it seems like the wrong thing to do. It's just a hacky band-aid over a specific extraction method and does nothing to reduce the actual scope of the information leak. Focussing on the band-aid means you've missed all the other avenues that the same information is exposed and all the infrastructure we've build on the core concept of sharing kernel side pages across security boundaries. And that's even without considering whether the change breaks userspace. Which it does. e.g. vmtouch is fairly widely used to manage page cache instantiation for rapid bring-up and migration of guest VMs and containers. They save the hot page cache information from a running container and then using that to instantiate the page cache in new instances running the same workload so they run at full speed right from the start. This use case calls mincore() to pull the page cache information from the running container. If anyone else proposed merging a syscall implementation change that was extremely likely to break userspace you'd be shouting at them that "we don't break userspace".... Cheers, Dave. -- Dave Chinner david@fromorbit.com