Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1370151imu; Wed, 9 Jan 2019 17:17:12 -0800 (PST) X-Google-Smtp-Source: ALg8bN4oN1XA9uH5tSuOQDQR7/EdAwl+EQ03E63wMhX97akaM3QZbNXMyoR0QHHWddkKQpPc4zyc X-Received: by 2002:a62:8893:: with SMTP id l141mr8112526pfd.1.1547083032881; Wed, 09 Jan 2019 17:17:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1547083032; cv=none; d=google.com; s=arc-20160816; b=HUj3KxnXmDBnIFJr4EMjt2qvjeKSfRE+Ugl38eP/WqZOf81eRs5klv0OhZPC9OGEoG eSk/sOZWbPnS6u5XSQCYg5jFs3Pa8htScPJFOVbNRAF4ZlfQswU9SJhYnP8UvohwHVjE UiBstVQuH2Hk+CPwMwRD6fs4Q4x/wLBc+aPKE2a+izMddztcwsJITKgXG1cxjq9hUr06 SuSLb+MOE8dgmID5i4b4X1DNp5DSijdpN4ZT6L5BvceOdpwGmGSjbYo8/2k7AOnI1Ac3 dyjge9iuHn5PbnAlgIKKsm1EUsxvSXKHohfCMxjvDs3mgLOn6M6gh7jwwrkyq9H2Szw4 7woQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=5tEsF15x1hbuy0CnQz805VG68y5qlwRdpCFaF+w90fc=; b=RgidHQAl6CwtMXCtKssqe9hPQ0kaDEVFey07GWLFPVILo4sLIi/i1Zdmz5/X4H/TKx m8gEuLVvWBJM1HdbWKgkUkPitie3hWlldLPNog55uPlJV/L9xAfK0hLPr2FtOG0WTLix nzhakl5D9rpxsG4wgIwyvK9UV2kMIk1LXv+S+j6RUa7ssFonx9cOqBEbdszCSUb+IK4r poUPPUVOLwoIBjugUhqyDxV9YcTlwiqxv2gD/r8CEH+b6J9v5QD4wbKk/DEJHZgs/g7F VknkkP9GCkL07jlcdt96lxqoc5WtDXrD9h2QxPLgtxFHc9RlodxPz6UtnCVuPM/oWuSr KBVA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 93si46158633plc.2.2019.01.09.17.16.58; Wed, 09 Jan 2019 17:17:12 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727026AbfAJBPk (ORCPT + 99 others); Wed, 9 Jan 2019 20:15:40 -0500 Received: from ipmail02.adl2.internode.on.net ([150.101.137.139]:22439 "EHLO ipmail02.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726833AbfAJBPh (ORCPT ); Wed, 9 Jan 2019 20:15:37 -0500 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail02.adl2.internode.on.net with ESMTP; 10 Jan 2019 11:45:33 +1030 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1ghOwX-0001qM-Bd; Thu, 10 Jan 2019 12:15:33 +1100 Date: Thu, 10 Jan 2019 12:15:33 +1100 From: Dave Chinner To: Jiri Kosina Cc: Linus Torvalds , Matthew Wilcox , Jann Horn , Andrew Morton , Greg KH , Peter Zijlstra , Michal Hocko , Linux-MM , kernel list , Linux API Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged Message-ID: <20190110011533.GI27534@dastard> References: <20190106001138.GW6310@bombadil.infradead.org> <20190108044336.GB27534@dastard> <20190109022430.GE27534@dastard> <20190109043906.GF27534@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 09, 2019 at 11:08:57AM +0100, Jiri Kosina wrote: > On Wed, 9 Jan 2019, Dave Chinner wrote: > > > FWIW, I just realised that the easiest, most reliable way to invalidate > > the page cache over a file range is simply to do a O_DIRECT read on it. > > Neat, good catch indeed. Still, it's only the invalidation part, but the > residency check is the crucial one. > > > > Rationale has been provided by Daniel Gruss in this thread -- if the > > > attacker is left with cache timing as the only available vector, he's > > > going to be much more successful with mounting hardware cache timing > > > attack anyway. > > > > No, he said: > > > > "Restricting mincore() is sufficient to fix the hardware-agnostic > > part." > > > > That's not correct - preadv2(RWF_NOWAIT) is also hardware agnostic and > > provides exactly the same information about the page cache as mincore. > > Yeah, preadv2(RWF_NOWAIT) is in the same teritory as mincore(), it has > "just" been overlooked. I can't speak for Daniel, but I believe he might > be ok with rephrasing the above as "Restricting mincore() and RWF_NOWAIT > is sufficient ...". Good luck with restricting RWF_NOWAIT. I eagerly await all the fstests that exercise both the existing and new behaviours to demonstrate they work correctly. > > Timed read/mmap access loops for cache observation are also hardware > > agnostic, and on fast SSD based storage will only be marginally slower > > bandwidth than preadv2(RWF_NOWAIT). > > > > Attackers will pick whatever leak vector we don't fix, so we either fix > > them all (which I think is probably impossible without removing caching > > altogether) > > We can't really fix the fact that it's possible to do the timing on the HW > caches though. We can't really fix the fact that it's possible to do the timing on the page cache, either. > > or we start thinking about how we need to isolate the page cache so that > > information isn't shared across important security boundaries (e.g. page > > cache contents are per-mount namespace). > > Umm, sorry for being dense, but how would that help that particular attack > scenario on a system that doesn't really employ any namespacing? What's your security boundary? The "detect what code an app is running" exploit is based on invalidating and then observing how shared, non-user-owned files mapped with execute privileges change cache residency. If the security boundary is within the local container, should users inside that container be allowed to invalidate the cache of executable files and libraries they don't own? In this case, we can't stop observation, because that only require read permissions and high precision timing, hence the only thing that can be done here is prevent non-owners from invalidating the page cache. If the security boundary is a namespace or guest VM, then permission checks don't work - the user may own the file within that container. This problem now is that the page cache is observable and controllable from both sides of the fence. Hence the only way to prevent observation of the code being run in a different namespace is to prevent the page being shared across both containers. The exfiltration exploit requires the page cache to be observable and controllable on both sides of the security boundary. Should users be able to observe and control the cached pages accessed by a different container? KSM page deduplication lessons say no. This is an even harder problem, because page cache residency can be observed from remote machines.... What scares me is that new features being proposed could make our exposure a whole lot worse. e.g. the recent virtio-pmem ("fake-dax") proposal will directly share host page cache pages into guest VMs w/ DAX capability. i.e. the guest directly accesses the host page cache. This opens up the potential for host page cache timing attacks from the guest VMs, and potential guest to guest observation/exploitation is possible if the same files are mapped into multiple guests.... IOws the two questions here are simply: "What's your security boundary?" and "Is the page cache visible and controllable on both sides?". Cheers, Dave. -- Dave Chinner david@fromorbit.com