Received: by 2002:ac0:a679:0:0:0:0:0 with SMTP id p54csp718177imp; Wed, 20 Feb 2019 07:50:57 -0800 (PST) X-Google-Smtp-Source: AHgI3IbsWgnecdtxRI2P5upRQYadu6Sf52JDiU6gqfrwsLGwpGCh9ct5Zn9u2TxKBacLwIsiriSl X-Received: by 2002:a62:53c5:: with SMTP id h188mr23859177pfb.190.1550677857819; Wed, 20 Feb 2019 07:50:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550677857; cv=none; d=google.com; s=arc-20160816; b=LR8cDHIOMckQwEFWAqZyaQ/1aEP+c5WXFiLXbFc1lnmXvD3ReaAWuw0d6OTmVbAbtC LnuNpgkCF2teUWbSfupwHsrWRsEs+3jhRpU1gegNKzPl/9mlxQRHuZ+/uwqqM2gwhPE5 IDlFeDULMnlDzpnydhYJiUZRq75aLmIDzDfkA37Xc1pXUL7h7ns5Cy2GeSIPhadngecg HSC5YSoiH0Z9O/y2OafD4YKfo0c0XC+Of5zQMNj427pPie13wdBdmKnHWnHZ7xrSC1Vn xvM/avkNax7WeXqzuKTaBDeMtdJD/8iaMGbsCCavZOgOixJ/kQswtBHD/ESG+1yg3CCb xvhQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:message-id:in-reply-to:date:references:subject:cc:to :from; bh=IWpoBfame5A6bK+eMM5O5Gh5hlIVWX/DZOvUngFd5x8=; b=DXiVvRIB5LXkTzfPXq5E2qkjhSEze6+SBs1Q8Az91NtTj76/iOtzA9jSzZtY0ToPD8 PefJ/I6Fl5PrWwIXpCzOYPkTkbyOP9p5TjyqbjDLQNcHcrwc8tdZis+gJlA9KgzbxHXn dJXDridNxwFZPuiruOgikHSq+CL01hjEC/bjLgfLZB+FQ2fp09SVUXDYA3mGewspZ3Zr Q/AQOlMFyvAUu9uvQooqPqVeaEOdbpiJhsA6LXKEtvixX7qupqqsAY2Ie/RS5au7dXNj BewmdnM0UYvhwg6r6U/c9X4LOeUJ0dhr52+Y8EQHU6FeqVJhTf929NFrU3vvpk8PRonm Pnmg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y3si14142680plr.116.2019.02.20.07.50.41; Wed, 20 Feb 2019 07:50:57 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726882AbfBTPtf convert rfc822-to-8bit (ORCPT + 99 others); Wed, 20 Feb 2019 10:49:35 -0500 Received: from mx2.suse.de ([195.135.220.15]:34056 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726019AbfBTPte (ORCPT ); Wed, 20 Feb 2019 10:49:34 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 60097AF1E; Wed, 20 Feb 2019 15:49:32 +0000 (UTC) From: Nicolai Stange To: Nicolai Stange Cc: Vlastimil Babka , "Kirill A. Shutemov" , Tejun Heo , Kevin Easton , Cyril Hrubis , Daniel Gruss , Andy Lutomirski , Linus Torvalds , Dave Chinner , Dominique Martinet , Jiri Kosina , Matthew Wilcox , Jann Horn , Andrew Morton , Greg KH , Peter Zijlstra , Michal Hocko , Linux-MM , kernel list , Linux API Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged References: <20190110004424.GH27534@dastard> <20190110070355.GJ27534@dastard> <20190110122442.GA21216@nautica> <20190111020340.GM27534@dastard> <20190111040434.GN27534@dastard> <20190111073606.GP27534@dastard> Date: Wed, 20 Feb 2019 16:49:29 +0100 In-Reply-To: (Linus Torvalds's message of "Fri, 11 Jan 2019 08:26:14 -0800") Message-ID: <87imxejw8m.fsf@suse.de> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Linus Torvalds writes: > So in order to use it as a signal, first you have to first scrub the > cache (because if the page was already there, there's no signal at > all), and then for the signal to be as useful as possible, you're also > going to want to try to get out more than one bit of information: you > are going to try to see the patterns and the timings of how it gets > filled. > > And that's actually quite painful. You don't know the initial cache > state, and you're not (in general) controlling the machine entirely, > because there's also that actual other entity that you're trying to > attack and see what it does. > > So what you want to do is basically to first make sure the cache is > scrubbed (only for the pages you're interested in!), then trigger > whatever behavior you are looking for, and then look how that affected > the cache. > > In other words, you want *multiple* residency status check - first to > see what the cache state is (because you're going to want that for > scrubbing), then to see that "yes, it's gone" when doing the > scrubbing, and then to see the *pattern* and timings of how things are > brought in. In an attempt to gain a better understanding of the guided eviction part resp. the relevance of mincore() & friends to that, I worked on reproducing the results from [1], section 6.1 ("Efficient Page Cache Eviction on Linux"). In case anybody wants to run their own experiments: the sources can be found at [2]. Disclaimer: I don't have access to the sources used by the [1]-paper's authors nor do I know anything about their experimental setup. So it might very well be the case, that my implementation is completely different and/or inefficient. Anyways, quoting from [1], section 6.1: "Eviction Set 1. These are pages already in the page cache, used by other processes. To keep them in the page cache, a thread continuously accesses these pages while also keep- ing the system load low by using sched yield and sleep. Consequently, they are among the most recently accessed pages of the system and eviction of these pages becomes highly unlikely." I had two questions: 1.) Do the actual contents of "Eviction set 1" matter for the guided eviction's performance or can they as well be arbitrary but fixed? Because if the set's actual contents weren't of any importance, then mincore() would not be needed to initialize it with "pages already in the page cache". 2.) How does keeping some fixed set resident + doing some IO compare to simply streaming a huge random file through the page cache? (To make it explicit: I didn't look into the probe part of the attack or the checking of the victim page's residency status as a termination condition for the eviction run.) Obviously, there are two primary factors affecting the victim page eviction performance: the file page cache size and disk read bandwidth. Naively speaking, I would suppose that keeping a certain set resident is a cheap and stable way to constrain IO to the remaining portion of the page cache and thus, reduce the amount of data required to be read from disk until the victim page gets evicted. Results summary (details can be found at the end of this mail): - The baseline benchmark of simply streaming random data through the page cache behaves as expected: avg of "Inactive(file)" / avg of "victim page eviction time" yields ~480MB/s, which approx. matches my disk's read bandwidth (note: the victim page was mapped as !PROT_EXEC). - I didn't do any sophisticated fine-tuning wrt. to parameters, but for the configuration yielding the best result, the average victim page eviction time was 147ms (stddev(*): 69ms, stderr: 1ms) with the "random but fixed resident set method". That's an improvement by a factor of 2.6 over the baseline "streaming random data method" (with the same amount of anonymous memory, i.e. "Eviction set 3", allocated: 7GB out of a total of 8GB). - In principle, question 1.) can't be answered by experiment without controlling the initial, legitimate system workload. I did some lax tests on my desktop running firefox, libreoffice etc. though and of course, overall responsiveness got a lot better if the "Eviction set 1" had been populated with pages already resident at the time the "attack" started. But the victim page eviction times didn't seem to improve -- at least not by factors such that my biased mind would have been able to recognize any change. In conclusion, keeping an arbitrary, but fixed "Eviction set 1" resident improved the victim page eviction performance by some factor over the "streaming" baseline, where "Eviction set 1" was populated from a single, attacker-controlled file and mincore() was not needed for determining its initial contents. To my surprise though, I needed to rely on mincore() at some other place, namely *my* implementation of keeping the resident set resident. My first naive approach was to have a single thread repeatedly iterating over the pages and reading the first few bytes from each through a mmapped area. That did not work out, because, even for smaller resident sets of 1GB and if run with realtime priority, the accessing thread would at some point in time encounter a reclaimed page and have to wait for the page fault to get served. While waiting for that, even more of the resident pages are likely to get reclaimed, causing additional delays later on. Eventually, the resident page accessor looses the game and will encounter page faults for almost the whole resident set (which isn't resident anymore). I worked around this by making the accessor thread check page residency via mincore(), touch only the resident ones and queue the others to some refault ("rewarm") thread. From briefly looking at iostat, this rewarmer thread actually seemed to saturate the disk and thus, there was no need for additional IO to put pressure on the page cache. For completeness, the amount of pages from the resident set actually found resident made up for ~97% of all file page cache pages (Inactive+Active(file)). Note that the way mincore() is used here is different than for the probe part of the attack: for probing, we'd like to know when a victim page has been faulted in again, while the residency keeper needs to check that some page has not been reclaimed before accessing it. Furthermore, mincore() is run on pages from an attacker-controlled and -owned file here. AFAICS, the patches currently under review (c.f. [3]) don't mitigate against this latter abuse of mincore() & Co. I personally doubt that doing something about it would be worth it though: first of all, until proven otherwise, I'm assuming that the improvement of the "resident set" based eviction method over the "streaming" baseline is not by orders of magnitude and that the victim page eviction times interesting to attackers (let's say better than 500ms) are achievable only under certain conditions: no swap and the ability to block a large fraction of total memory with anonymous mappings. What's more, I can imagine that there are other ways to keep the resident set resident without relying on mincore() at all: I haven't tried it, but simply spawning a larger number of accessor threads, each at a different position within "Eviction set 1", and hoping that most of the time at least one of them finds a resident page to touch, might work, too. Thanks, Nicolai Experiment setup + numbers ========================== My laptop has 8GB of RAM, a SSD and runs OpenSUSE 42.3 kernel package version 4.4.165-81. I stopped almost all services, but didn't setup any CPU isolation. I ran 5 x 2 experiments where I allocated (and filled, of course) 3,4,5,6,7GB of anonymous memory each and compared the baseline "read-in-a-huge-file" (mapped PROT_EXEC) results with the resident set based eviction for each of these. While doing so, I continuously measured eviction times of a !PROT_EXEC victim page and reported the 'Active(file)' + 'Inactive(file)' statistics from /proc/meminfo. Baseline "streaming" benchmark results: anon mem (GB): 7 6 5 4 3 Inactive(file) (MB): 189.3389 709.7063 1221.0910 1735.7510 2247.3340 eviction time (ms): 386.7712 1453.3975 2533.6084 3622.4995 4694.4668 quotient (MB/s): 489.5372 488.3085 481.9573 479.1584 478.7198 and, for comparison with the results from the resident set based eviction below: Inactive+Active(file) (MB): 358 1390 2415 3445 4469 Resident set based eviction: For the results below, I chose to draw the resident set from a single file filled with random data and made it total mem - allocated anon mem in size. The disk bandwidth was saturated all the time. anon mem (GB): 7 6 5 4 3 eviction time (ms): 146.6296 428.9594 620.2084 863.1423 977.8963 improvement over baseline: 2.6 3.4 4.1 4.2 4.8 Inactive+Active(file) (MB): 429 1449 2471 3494 4515 resident from res. set (MB): 417 1428 2426 3424 4429 fraction: 97.3% 98.6% 98.2% 98.0% 98.1% [1] https://arxiv.org/abs/1901.01161 ("Page Cache Attacks") [2] https://github.com/nicstange/pgc [3] https://lkml.kernel.org/r/20190130124420.1834-1-vbabka@suse.cz (*) The distribution of eviction times is not Gaussian. -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)