2019-01-05 17:28:29

by Jiri Kosina

[permalink] [raw]
Subject: [PATCH] mm/mincore: allow for making sys_mincore() privileged

From: Jiri Kosina <[email protected]>

There are possibilities [1] how mincore() could be used as a converyor of
a sidechannel information about pagecache metadata.

Provide vm.mincore_privileged sysctl, which makes it possible to mincore()
start returning -EPERM in case it's invoked by a process lacking
CAP_SYS_ADMIN.

The default behavior stays "mincore() can be used by anybody" in order to
be conservative with respect to userspace behavior.

[1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/

Signed-off-by: Jiri Kosina <[email protected]>
---
Documentation/sysctl/vm.txt | 9 +++++++++
kernel/sysctl.c | 8 ++++++++
mm/mincore.c | 5 +++++
3 files changed, 22 insertions(+)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 187ce4f599a2..afb8635e925e 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -41,6 +41,7 @@ Currently, these files are in /proc/sys/vm:
- min_free_kbytes
- min_slab_ratio
- min_unmapped_ratio
+- mincore_privileged
- mmap_min_addr
- mmap_rnd_bits
- mmap_rnd_compat_bits
@@ -485,6 +486,14 @@ files and similar are considered.
The default is 1 percent.

==============================================================
+mincore_privileged:
+
+mincore() could be potentially used to mount a side-channel attack against
+pagecache metadata. This sysctl provides system administrators means to
+make it available only to processess that own CAP_SYS_ADMIN capability.
+
+The default is 0, which means mincore() can be used without restrictions.
+==============================================================

mmap_min_addr

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1825f712e73b..f03cb07c8dd4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -114,6 +114,7 @@ extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
#ifndef CONFIG_MMU
extern int sysctl_nr_trim_pages;
#endif
+extern int sysctl_mincore_privileged;

/* Constants used for minimum and maximum */
#ifdef CONFIG_LOCKUP_DETECTOR
@@ -1684,6 +1685,13 @@ static struct ctl_table vm_table[] = {
.extra2 = (void *)&mmap_rnd_compat_bits_max,
},
#endif
+ {
+ .procname = "mincore_privileged",
+ .data = &sysctl_mincore_privileged,
+ .maxlen = sizeof(sysctl_mincore_privileged),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
{ }
};

diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..77d4928cdfaa 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -21,6 +21,8 @@
#include <linux/uaccess.h>
#include <asm/pgtable.h>

+int sysctl_mincore_privileged;
+
static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
@@ -228,6 +230,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
unsigned long pages;
unsigned char *tmp;

+ if (sysctl_mincore_privileged && !capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
/* Check the start address: needs to be page-aligned.. */
if (start & ~PAGE_MASK)
return -EINVAL;
--
Jiri Kosina
SUSE Labs


2019-01-05 19:21:27

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On 5.1.2019 18:27, Jiri Kosina wrote:
> From: Jiri Kosina <[email protected]>
>
> There are possibilities [1] how mincore() could be used as a converyor of
> a sidechannel information about pagecache metadata.
>
> Provide vm.mincore_privileged sysctl, which makes it possible to mincore()
> start returning -EPERM in case it's invoked by a process lacking
> CAP_SYS_ADMIN.

Haven't checked the details yet, but wouldn't it be safe if anonymous private
mincore() kept working, and restrictions were applied only to page cache?

> The default behavior stays "mincore() can be used by anybody" in order to
> be conservative with respect to userspace behavior.

What if we lied instead of returned -EPERM, to not break userspace so obviously?
I guess false positive would be the safer lie?

> [1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/
>
> Signed-off-by: Jiri Kosina <[email protected]>
> ---
> Documentation/sysctl/vm.txt | 9 +++++++++
> kernel/sysctl.c | 8 ++++++++
> mm/mincore.c | 5 +++++
> 3 files changed, 22 insertions(+)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 187ce4f599a2..afb8635e925e 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -41,6 +41,7 @@ Currently, these files are in /proc/sys/vm:
> - min_free_kbytes
> - min_slab_ratio
> - min_unmapped_ratio
> +- mincore_privileged
> - mmap_min_addr
> - mmap_rnd_bits
> - mmap_rnd_compat_bits
> @@ -485,6 +486,14 @@ files and similar are considered.
> The default is 1 percent.
>
> ==============================================================
> +mincore_privileged:
> +
> +mincore() could be potentially used to mount a side-channel attack against
> +pagecache metadata. This sysctl provides system administrators means to
> +make it available only to processess that own CAP_SYS_ADMIN capability.
> +
> +The default is 0, which means mincore() can be used without restrictions.
> +==============================================================
>
> mmap_min_addr
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 1825f712e73b..f03cb07c8dd4 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -114,6 +114,7 @@ extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
> #ifndef CONFIG_MMU
> extern int sysctl_nr_trim_pages;
> #endif
> +extern int sysctl_mincore_privileged;
>
> /* Constants used for minimum and maximum */
> #ifdef CONFIG_LOCKUP_DETECTOR
> @@ -1684,6 +1685,13 @@ static struct ctl_table vm_table[] = {
> .extra2 = (void *)&mmap_rnd_compat_bits_max,
> },
> #endif
> + {
> + .procname = "mincore_privileged",
> + .data = &sysctl_mincore_privileged,
> + .maxlen = sizeof(sysctl_mincore_privileged),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + },
> { }
> };
>
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 218099b5ed31..77d4928cdfaa 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -21,6 +21,8 @@
> #include <linux/uaccess.h>
> #include <asm/pgtable.h>
>
> +int sysctl_mincore_privileged;
> +
> static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
> unsigned long end, struct mm_walk *walk)
> {
> @@ -228,6 +230,9 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
> unsigned long pages;
> unsigned char *tmp;
>
> + if (sysctl_mincore_privileged && !capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
> /* Check the start address: needs to be page-aligned.. */
> if (start & ~PAGE_MASK)
> return -EINVAL;
>


2019-01-05 19:26:33

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, 5 Jan 2019, Vlastimil Babka wrote:

> > There are possibilities [1] how mincore() could be used as a converyor of
> > a sidechannel information about pagecache metadata.
> >
> > Provide vm.mincore_privileged sysctl, which makes it possible to mincore()
> > start returning -EPERM in case it's invoked by a process lacking
> > CAP_SYS_ADMIN.
>
> Haven't checked the details yet, but wouldn't it be safe if anonymous private
> mincore() kept working, and restrictions were applied only to page cache?

I was considering that, but then I decided not to do so, as that'd make
the interface even more confusing and semantics non-obvious in the
'privileged' case.

> > The default behavior stays "mincore() can be used by anybody" in order to
> > be conservative with respect to userspace behavior.
>
> What if we lied instead of returned -EPERM, to not break userspace so
> obviously? I guess false positive would be the safer lie?

So your proposal basically would be

if (privileged && !CAP_SYS_ADMIN)
if (pagecache)
return false;
else
return do_mincore()

right ?

I think userspace would hate us for that semantics, but on the other hand
I can sort of understand the 'mincore() is racy anyway, so what' argument,
if that's what you are suggesting.

But then, I have no idea what userspace is using mincore() for.
https://codesearch.debian.net/search?q=mincore might provide some insight
I guess (thanks Matthew).

--
Jiri Kosina
SUSE Labs


2019-01-05 19:40:00

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On 5.1.2019 20:24, Jiri Kosina wrote:
> On Sat, 5 Jan 2019, Vlastimil Babka wrote:
>
>>> There are possibilities [1] how mincore() could be used as a converyor of
>>> a sidechannel information about pagecache metadata.
>>>
>>> Provide vm.mincore_privileged sysctl, which makes it possible to mincore()
>>> start returning -EPERM in case it's invoked by a process lacking
>>> CAP_SYS_ADMIN.
>>
>> Haven't checked the details yet, but wouldn't it be safe if anonymous private
>> mincore() kept working, and restrictions were applied only to page cache?
>
> I was considering that, but then I decided not to do so, as that'd make
> the interface even more confusing and semantics non-obvious in the
> 'privileged' case.
>
>>> The default behavior stays "mincore() can be used by anybody" in order to
>>> be conservative with respect to userspace behavior.
>>
>> What if we lied instead of returned -EPERM, to not break userspace so
>> obviously? I guess false positive would be the safer lie?
>
> So your proposal basically would be
>
> if (privileged && !CAP_SYS_ADMIN)
> if (pagecache)
> return false;

I was thinking about "return true" here, assuming that userspace generally wants
to ensure itself there won't be page faults when it starts doing something
critical, and if it sees a "false" it will try to do some kind of prefaulting,
possibly in a loop. There might be somebody trying to make sure something is out
of pagecache (it wants to see "false"), but can't think of anything except
benchmarks?

> else
> return do_mincore()
>
> right ?
>
> I think userspace would hate us for that semantics, but on the other hand
> I can sort of understand the 'mincore() is racy anyway, so what' argument,
> if that's what you are suggesting.
>
> But then, I have no idea what userspace is using mincore() for.
> https://codesearch.debian.net/search?q=mincore might provide some insight
> I guess (thanks Matthew).
>


2019-01-05 19:46:15

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Hi Jiri,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.20 next-20190103]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Jiri-Kosina/mm-mincore-allow-for-making-sys_mincore-privileged/20190106-013707
config: c6x-evmc6678_defconfig (attached as .config)
compiler: c6x-elf-gcc (GCC) 8.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=c6x

All errors (new ones prefixed by >>):

>> kernel/sysctl.o:(.fardata+0x6a0): undefined reference to `sysctl_mincore_privileged'

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.00 kB)
.config.gz (5.13 kB)
Download all attachments

2019-01-05 19:48:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, Jan 5, 2019 at 9:27 AM Jiri Kosina <[email protected]> wrote:
>
> From: Jiri Kosina <[email protected]>
>
> There are possibilities [1] how mincore() could be used as a converyor of
> a sidechannel information about pagecache metadata.

Can we please just limit it to vma's that are either anonymous, or map
a file that the user actually owns?

Then the capability check could be for "override the file owner check"
instead, which makes tons of sense.

No new sysctl's for something like this, please.

Linus

2019-01-05 19:59:22

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Hi Jiri,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.20 next-20190103]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Jiri-Kosina/mm-mincore-allow-for-making-sys_mincore-privileged/20190106-013707
config: microblaze-nommu_defconfig (attached as .config)
compiler: microblaze-linux-gcc (GCC) 8.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=microblaze

All errors (new ones prefixed by >>):

>> kernel/sysctl.o:(.data+0x67c): undefined reference to `sysctl_mincore_privileged'

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.02 kB)
.config.gz (12.27 kB)
Download all attachments

2019-01-05 20:16:08

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, 5 Jan 2019, Linus Torvalds wrote:

> > There are possibilities [1] how mincore() could be used as a converyor of
> > a sidechannel information about pagecache metadata.
>
> Can we please just limit it to vma's that are either anonymous, or map
> a file that the user actually owns?
>
> Then the capability check could be for "override the file owner check"
> instead, which makes tons of sense.

Makes sense.

I am still not completely sure what to return in such cases though; we can
either blatantly lie and always pretend that the pages are resident (to
avoid calling process entering some prefaulting mode), or return -ENOMEM
for mappings of files that don't belong to the user (in case it's not
CAP_SYS_ADMIN one).

--
Jiri Kosina
SUSE Labs


2019-01-05 20:16:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, Jan 5, 2019 at 11:46 AM Linus Torvalds
<[email protected]> wrote:
>
> Can we please just limit it to vma's that are either anonymous, or map
> a file that the user actually owns?

.. or slightly simpler: a file that the user opened for writing.

IOW, some (TOTALLY UNTESTED!) patch like this?

Linus


Attachments:
patch.diff (1.13 kB)

2019-01-05 20:19:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

[ Crossed emails ]

On Sat, Jan 5, 2019 at 12:12 PM Jiri Kosina <[email protected]> wrote:
>
> I am still not completely sure what to return in such cases though; we can
> either blatantly lie and always pretend that the pages are resident

That's what my untested patch did. Or maybe just claim they are all
not present?

And again, that patch was entirely untested, so it may be garbage and
have some fundamental problem. I also don't know exactly what rule
might make most sense, but "you can write to the file" certainly to me
implies that you also could know what parts of it are in-core.

Who actually _uses_ mincore()? That's probably the best guide to what
we should do. Maybe they open the file read-only even if they are the
owner, and we really should look at file ownership instead.

I tried to make that "can_do_mincore()" function easy to understand
and easy to just modify to some sane state.

Again, my patch is meant as a "perhaps something like this?" rather
than some "this is _exactly_ how it must be done". Take the patch as a
quick suggestion, not some final answer.

Linus

2019-01-05 20:44:50

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, 5 Jan 2019, Linus Torvalds wrote:

> > I am still not completely sure what to return in such cases though; we can
> > either blatantly lie and always pretend that the pages are resident
>
> That's what my untested patch did. Or maybe just claim they are all
> not present?

Thinking about it a little bit more, I believe Vlastimil has a good point
with 'non present' potentially causing more bogus activity in userspace in
response (in an effort to actually make them present, and failing
indefinitely).

IOW, I think it's a reasonable expectation that the common scenario is
"check if it's present, and if not, try to fault it in" instead of "check
if it's present, and if it is, try to evict it".

> And again, that patch was entirely untested, so it may be garbage and
> have some fundamental problem.

I will be travelling for next ~24 hours, but I have just asked our QA guys
to run it through some basic battery of testing (which will probably
happen on monday anyway).

> I also don't know exactly what rule might make most sense, but "you can
> write to the file" certainly to me implies that you also could know what
> parts of it are in-core.

I think it's reasonable; I can't really imagine any sidechannel to a
global state be possibly mounted on valid R/W mappings. I'd guess that
probably the most interesting here are the code segments of shared
libraries, allowing to trace victim's execution.

> Who actually _uses_ mincore()? That's probably the best guide to what
> we should do. Maybe they open the file read-only even if they are the
> owner, and we really should look at file ownership instead.

Yeah, well

https://codesearch.debian.net/search?q=mincore

is a bit too much mess to get some idea quickly I am afraid.

--
Jiri Kosina
SUSE Labs


2019-01-05 21:56:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina <[email protected]> wrote:
>
> > Who actually _uses_ mincore()? That's probably the best guide to what
> > we should do. Maybe they open the file read-only even if they are the
> > owner, and we really should look at file ownership instead.
>
> Yeah, well
>
> https://codesearch.debian.net/search?q=mincore
>
> is a bit too much mess to get some idea quickly I am afraid.

Yeah, heh.

And the first hit is 'fincore', which probably nobody cares about
anyway, but it does

fd = open (name, O_RDONLY)
..
mmap(window, len, PROT_NONE, MAP_PRIVATE, ..

so if we want to keep that working, we'd really need to actually check
file ownership rather than just looking at f_mode.

But I don't know if anybody *uses* and cares about fincore, and it's
particularly questionable for non-root users.

And the Android go runtime code seems to oddly use mincore to figure
out page size:

// try using mincore to detect the physical page size.
// mincore should return EINVAL when address is not a multiple of
system page size.

which is all kinds of odd, but whatever.. Why mincore, rather than
something sane and obvious like mmap? Don't ask me...

Anyway, the Debian code search just results in mostly non-present
stuff. It's sad that google code search is no more. It was great for
exactly these kinds of questions.

The mono runtime seems to have some mono_pages_not_faulted() function,
but I don't know if people use it for file mappings, and I couldn't
find any interesting users of it.

I didn't find anything that seems to really care, but I gave up after
a few pages of really boring stuff.

Linus

2019-01-05 22:58:04

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, Jan 5, 2019 at 6:27 PM Jiri Kosina <[email protected]> wrote:
> There are possibilities [1] how mincore() could be used as a converyor of
> a sidechannel information about pagecache metadata.
>
> Provide vm.mincore_privileged sysctl, which makes it possible to mincore()
> start returning -EPERM in case it's invoked by a process lacking
> CAP_SYS_ADMIN.
>
> The default behavior stays "mincore() can be used by anybody" in order to
> be conservative with respect to userspace behavior.
>
> [1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/

Just checking: I guess /proc/$pid/pagemap (iow, the pagemap_read()
handler) is less problematic because it only returns data about the
state of page tables, and doesn't query the address_space? In other
words, it permits monitoring evictions, but non-intrusively detecting
that something has been loaded into memory by another process is
harder?

2019-01-05 23:09:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, Jan 5, 2019 at 2:55 PM Jann Horn <[email protected]> wrote:
>
> Just checking: I guess /proc/$pid/pagemap (iow, the pagemap_read()
> handler) is less problematic because it only returns data about the
> state of page tables, and doesn't query the address_space? In other
> words, it permits monitoring evictions, but non-intrusively detecting
> that something has been loaded into memory by another process is
> harder?

Yes. I think it was brought up during the original report, but to use
the pagemap for this, you'd basically need to first populate all the
pages, and then wait for pageout.

So pagemap *does* leak very similar data, but it's much harder to use
as an attack vector.

That said, I think "mincore()" is just the simple one. You *can* (and
this was also discussed on the security list) do things like

- fault in a single page

- the kernel will opportunistically fault in pages that it already
has available _around_ that page.

- then use pagemap (or just _timing_ - no real kernel support needed)
to see if those pages are now mapped in your address space

so honestly, mincore is just the "big hammer" and easy way to get at
some of this data. But it's probably worth closing exactly because
it's easy. There are other ways to get at the "are these pages mapped"
information, but they are a lot more combersome to use.

Side note: maybe we could just remove the "__mincore_unmapped_range()"
thing entirely, and basically make mincore() do what pagemap does,
which is to say "are the pages mapped in this VM".

That would be nicer than my patch, simply because removing code is
always nice. And arguably it's a better semantic anyway.

Linus

2019-01-05 23:11:28

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, 5 Jan 2019, Jann Horn wrote:

> > Provide vm.mincore_privileged sysctl, which makes it possible to mincore()
> > start returning -EPERM in case it's invoked by a process lacking
> > CAP_SYS_ADMIN.
> >
> > The default behavior stays "mincore() can be used by anybody" in order to
> > be conservative with respect to userspace behavior.
> >
> > [1] https://www.theregister.co.uk/2019/01/05/boffins_beat_page_cache/
>
> Just checking: I guess /proc/$pid/pagemap (iow, the pagemap_read()
> handler) is less problematic because it only returns data about the
> state of page tables, and doesn't query the address_space? In other
> words, it permits monitoring evictions, but non-intrusively detecting
> that something has been loaded into memory by another process is
> harder?

So I was just about to immediately reply that we don't expose pagemap
anymore due to rowhammer, but apparently that's not true any more
(this behavioud was originally introduced by ab676b7d6fbf, but then
changed via 1c90308e7a77 (and further adjusted for swap entries in
ab6ecf247a, but I guess that's not all that interesting).

Hmm.

But unless it has been mapped with MAP_POPULATE (whcih is outside the
attacker's control), there is no guarantee that the mappings would
actually be there, right?

--
Jiri Kosina
SUSE Labs


2019-01-05 23:19:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, Jan 5, 2019 at 3:05 PM Linus Torvalds
<[email protected]> wrote:
>
> That would be nicer than my patch, simply because removing code is
> always nice. And arguably it's a better semantic anyway.

Yeah, I wonder why we did that thing where mincore() walks the page
tables, but if they are empty it looks in the page cache.

[... goes and looks in history ..]

It goes back to forever, it looks like. I can't find a reason.

Anyway, a removal patch would look something like the attached, I
think. That makes mincore() actually say how many pages are in _this_
mapping, not how many pages could be paged in without doing IO.

Hmm. Maybe we should try this first. Simplicity is always good.

Again, obviously untested.

Linus


Attachments:
patch.diff (2.65 kB)

2019-01-05 23:31:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds
<[email protected]> wrote:
>
> It goes back to forever, it looks like. I can't find a reason.

Our man-pages talk abouit the "without doing IO" part. That may be the
result of our code, though, not the reason for it.

The BSD man-page has other flags, but doesn't describe what "in core"
really means:

MINCORE_INCORE Page is in core (resident).

MINCORE_REFERENCED Page has been referenced by us.

MINCORE_MODIFIED Page has been modified by us.

MINCORE_REFERENCED_OTHER Page has been referenced.

MINCORE_MODIFIED_OTHER Page has been modified.

MINCORE_SUPER Page is part of a large (``super'') page.

but the fact that it has MINCORE_MODIFIED_OTHER does obviously imply
that yes, historically it really did look up the pages elsewhere, not
just in the page tables.

Still, maybe we can get away with just making it be about our own page
tables. That would be lovely.

Linus

2019-01-05 23:41:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds
<[email protected]> wrote:
>
> It goes back to forever, it looks like. I can't find a reason.

mincore() was originally added in 2.3.52pre3, it looks like. Around
2000 or so. But sadly before the BK history.

And that comment about

"Later we can get more picky about what "in core" means precisely."

that still exists above mincore_page() goes back to the original patch.

Linus

2019-01-06 00:14:01

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, Jan 05, 2019 at 03:39:10PM -0800, Linus Torvalds wrote:
> On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > It goes back to forever, it looks like. I can't find a reason.
>
> mincore() was originally added in 2.3.52pre3, it looks like. Around
> 2000 or so. But sadly before the BK history.
>
> And that comment about
>
> "Later we can get more picky about what "in core" means precisely."
>
> that still exists above mincore_page() goes back to the original patch.

FreeBSD claims to have a manpage from SunOS 4.1.3 with mincore (!)

https://www.freebsd.org/cgi/man.cgi?query=mincore&apropos=0&sektion=0&manpath=SunOS+4.1.3&arch=default&format=html

DESCRIPTION
mincore() returns the primary memory residency status of pages in the
address space covered by mappings in the range [addr, addr + len). The
status is returned as a char-per-page in the character array referenced
by *vec (which the system assumes to be large enough to encompass all
the pages in the address range). The least significant bit of each
character is set to 1 to indicate that the referenced page is in pri-
mary memory, 0 if it is not. The settings of other bits in each char-
acter is undefined and may contain other information in the future.


2019-01-06 00:24:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, Jan 5, 2019 at 4:11 PM Matthew Wilcox <[email protected]> wrote:
>
> FreeBSD claims to have a manpage from SunOS 4.1.3 with mincore (!)
>
> https://www.freebsd.org/cgi/man.cgi?query=mincore&apropos=0&sektion=0&manpath=SunOS+4.1.3&arch=default&format=html
>
> DESCRIPTION
> mincore() returns the primary memory residency status of pages in the
> address space covered by mappings in the range [addr, addr + len).

It's still not clear that "primary memory residency status" actually means.

Does it mean "mapped", or does it mean "exists in caches and doesn't need IO".

I don't even know what kind of caches SunOS 4.1.3 had. The Linux
implementation depends on the page cache, and wouldn't work (at least
not very well) in a system that has a traditional disk buffer cache.

Anyway, I guess it's mostly moot. From a "does this cause regressions"
standpoint, the only thing that matters is really whatever Linux
programs that have used this since it was introduced 18+ years ago.

But I think my patch to just rip out all that page lookup, and just
base it on the page table state has the fundamental advantage that it
gets rid of code. Maybe I should jst commit it, and see if anything
breaks? We do have options in case things break, and then we'd at
least know who cares (and perhaps a lot more information of _why_ they
care).

Linus

2019-01-06 01:52:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, Jan 5, 2019 at 4:22 PM Linus Torvalds
<[email protected]> wrote:
>
> But I think my patch to just rip out all that page lookup, and just
> base it on the page table state has the fundamental advantage that it
> gets rid of code. Maybe I should jst commit it, and see if anything
> breaks? We do have options in case things break, and then we'd at
> least know who cares (and perhaps a lot more information of _why_ they
> care).

Slightly updated patch in case somebody wants to try things out.

Also, any comments about the pmd_trans_unstable() case?

Linus


Attachments:
patch.diff (3.44 kB)

2019-01-06 11:39:09

by Kevin Easton

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote:
> On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina <[email protected]> wrote:
> >
> > > Who actually _uses_ mincore()? That's probably the best guide to what
> > > we should do. Maybe they open the file read-only even if they are the
> > > owner, and we really should look at file ownership instead.
> >
> > Yeah, well
> >
> > https://codesearch.debian.net/search?q=mincore
> >
> > is a bit too much mess to get some idea quickly I am afraid.

> Anyway, the Debian code search just results in mostly non-present
> stuff. It's sad that google code search is no more. It was great for
> exactly these kinds of questions.

If you select the "Group search results by Debian source package"
option on the search results page it makes it a lot easier to skim
through.

It looks to me like Firefox is expecting mincore() not to fail on
libraries that it has mapped:

https://sources.debian.org/src/firefox-esr/60.4.0esr-1/mozglue/linker/BaseElf.cpp/?hl=98#L98

- Kevin
>
> The mono runtime seems to have some mono_pages_not_faulted() function,
> but I don't know if people use it for file mappings, and I couldn't
> find any interesting users of it.
>
> I didn't find anything that seems to really care, but I gave up after
> a few pages of really boring stuff.
>
> Linus
>
>

2019-01-06 21:48:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, Jan 5, 2019 at 5:50 PM Linus Torvalds
<[email protected]> wrote:
>
> Slightly updated patch in case somebody wants to try things out.

I decided to just apply that patch. It is *not* marked for stable,
very intentionally, because I expect that we will need to wait and see
if there are issues with it, and whether we might have to do something
entirely different (more like the traditional behavior with some extra
"only for owner" logic).

But doing a test patch during the merge window (which is about to
close) sounds like the right thing to do.

Linus

2019-01-07 04:35:59

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Linus Torvalds wrote on Sat, Jan 05, 2019:
> But I think my patch to just rip out all that page lookup, and just
> base it on the page table state has the fundamental advantage that it
> gets rid of code. Maybe I should jst commit it, and see if anything
> breaks? We do have options in case things break, and then we'd at
> least know who cares (and perhaps a lot more information of _why_ they
> care).

There actually are many tools like fincore which depend on mincore to
try to tell whether a file is "loaded in cache" or not (I personally use
vmtouch[1], but I know of at least nocache[2] uses it as well to only
try to evict used pages)

[1] https://hoytech.com/vmtouch/
[2] https://github.com/Feh/nocache


I mostly use these to either fadvise(POSIX_FADV_DONTNEED) or
prefetch/lock whole files so my "production" use-cases don't actually
rely on the mincore part of them; but when playing with these actions
it's actually fairly useful to be able to visualize which part of a file
ended in cache or monitor how a file's content evolve in cache...

There are various non-obvious behaviours where being able to poke around
is enlightening (e.g. fadvise dontneed is actually a hint, so even if
nothing uses the file linux sometimes keep the data around if it thinks
that would be useful and nocache has a mode to call fadvise multiple
times and things like that...)


Anyway, I agree the use of mincore for this is rather ugly, and
frankly some "cache management API" might be better in the long run if
only for performance reason (don't try these tools on a hundred TB
sparse file...), but until that pipe dream comes true I think mincore as
it was is useful for system admins.



Linus Torvalds wrote on Sun, Jan 06, 2019:
> I decided to just apply that patch. It is *not* marked for stable,
> very intentionally, because I expect that we will need to wait and see
> if there are issues with it, and whether we might have to do something
> entirely different (more like the traditional behavior with some extra
> "only for owner" logic).

FWIW I personally don't care much about "only for owner" or depending on
mmap options; I don't understand much of the security implications
honestly so I'm not sure how these limitations actually help.
On the other hand, a simple CAP_SYS_ADMIN check making the call take
either behaviour should be safe and would cover what I described above.


(by the way, while we are discussing permissions, a regular user can use
fadvise dontneed on files it doesn't own as well as long as it can open
them for reading; I'm not sure if that would need restricting as well in
the context of the security issue. Frankly even with mincore someone
could likely tell the difference through timing, if they just do it a
few times. Do magic, probe, flush out, repeat until satisfied.)


Thanks,
--
Dominique Martinet | Asmadeus

2019-01-07 10:13:56

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Linus Torvalds <[email protected]> writes:

> On Sat, Jan 5, 2019 at 3:16 PM Linus Torvalds
> <[email protected]> wrote:
>>
>> It goes back to forever, it looks like. I can't find a reason.
>
> mincore() was originally added in 2.3.52pre3, it looks like. Around
> 2000 or so. But sadly before the BK history.

Yeah, it's here in the commit titled "Import 2.3.52pre3" (takes a second
or two to load):

https://github.com/mpe/linux-fullhistory/commit/a1bcda3256956318c95c8da8bee09f79190bb034#diff-fd2d793b8b4760b4887c8c7bbb3451d7R1730

But no further detail.

(Instructions for using that tree https://github.com/mpe/linux-fullhistory/wiki)

cheers

2019-01-07 10:34:41

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On 1/7/19 5:32 AM, Dominique Martinet wrote:
> Linus Torvalds wrote on Sat, Jan 05, 2019:
>> But I think my patch to just rip out all that page lookup, and just
>> base it on the page table state has the fundamental advantage that it
>> gets rid of code. Maybe I should jst commit it, and see if anything
>> breaks? We do have options in case things break, and then we'd at
>> least know who cares (and perhaps a lot more information of _why_ they
>> care).
>
> There actually are many tools like fincore which depend on mincore to
> try to tell whether a file is "loaded in cache" or not (I personally use
> vmtouch[1], but I know of at least nocache[2] uses it as well to only
> try to evict used pages)

nocache could probably do fine without mincore. IIUC the point is to not
evict anything that was already resident prior to running some command
wrapped in nocache. Without the mincore checks,
posix_fadvise(POSIX_FADV_DONTNEED) will still not drop anything that
others have mapped. That means without mincore() it will drop data
that's in cache but not currently in use by anybody, which shouldn't
cause large performance regressions?

> [1] https://hoytech.com/vmtouch/
> [2] https://github.com/Feh/nocache
>
>
> I mostly use these to either fadvise(POSIX_FADV_DONTNEED) or
> prefetch/lock whole files so my "production" use-cases don't actually
> rely on the mincore part of them;

Ah so you seem to confirm my above point.

...

> FWIW I personally don't care much about "only for owner" or depending on
> mmap options; I don't understand much of the security implications
> honestly so I'm not sure how these limitations actually help.
> On the other hand, a simple CAP_SYS_ADMIN check making the call take
> either behaviour should be safe and would cover what I described above.

So without CAP_SYS_ADMIN, mincore() would return mapping status, and
with CAP_SYS_ADMIN, it would return cache residency status? Very clumsy
:( Maybe if we introduced mincore2() with flags similar to BSD mentioned
earlier in the thread, and the cache residency flag would require
CAP_SYS_ADMIN or something similar.
> (by the way, while we are discussing permissions, a regular user can use
> fadvise dontneed on files it doesn't own as well as long as it can open
> them for reading; I'm not sure if that would need restricting as well in
> the context of the security issue.

Probably not, as I've mentioned it won't evict what's mapped by somebody
else. And eviction is also possible via controlling LRU, which is what
the paper [1] does anyway (and also mentions that DONTNEED doesn't
work). Being able to evict somebody's page is AFAIU not sufficient for
attack, the side channel is about knowing that somebody brought that
page back to RAM by touching it.

> Frankly even with mincore someone
> could likely tell the difference through timing, if they just do it a
> few times. Do magic, probe, flush out, repeat until satisfied.)

That's my bigger concern here. In [1] there's described a remote attack
(on webserver) using the page fault timing differences for present/not
present page cache pages. Noisy but works, and I expect locally it to be
much less noisy. Yet the countermeasures section only mentions
restricting mincore() as if it was sufficient (and also how to make
evictions harder, but that's secondary IMHO).

[1] https://arxiv.org/abs/1901.01161

>
> Thanks,
>


2019-01-07 11:57:50

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Vlastimil Babka wrote on Mon, Jan 07, 2019:
> On 1/7/19 5:32 AM, Dominique Martinet wrote:
> > Linus Torvalds wrote on Sat, Jan 05, 2019:
> >> But I think my patch to just rip out all that page lookup, and just
> >> base it on the page table state has the fundamental advantage that it
> >> gets rid of code. Maybe I should jst commit it, and see if anything
> >> breaks? We do have options in case things break, and then we'd at
> >> least know who cares (and perhaps a lot more information of _why_ they
> >> care).
> >
> > There actually are many tools like fincore which depend on mincore to
> > try to tell whether a file is "loaded in cache" or not (I personally use
> > vmtouch[1], but I know of at least nocache[2] uses it as well to only
> > try to evict used pages)
>
> nocache could probably do fine without mincore. IIUC the point is to not
> evict anything that was already resident prior to running some command
> wrapped in nocache. Without the mincore checks,
> posix_fadvise(POSIX_FADV_DONTNEED) will still not drop anything that
> others have mapped. That means without mincore() it will drop data
> that's in cache but not currently in use by anybody, which shouldn't
> cause large performance regressions?

Yes, I sent them a patch to allow this behaviour (which got merged ages
ago); but the default nocache usage will try to preserve pages that were
mapped before even if the user mapped it themselves so it's a little bit
more robust that just trusting linux to do the right thing. I did
various tests with fadvise dontneed and honestly the way it works is far
from obvious when operating as a single user at least; I didn't test
multi-users as that didn't really concern me.

With the current mincore change, it will think everything was "in core"
and not flush anything unless my option to just fadvise dontneed
everything is passed though ; so even if we can make it work it is a
change of behaviour that is breaking an existing application, and it has
no way of telling it didn't work.


Honestly though, as I said, mincore() is much more useful for debugging
for me ; the application can be changed if required. I just pointed it
out as it'll need changing, and it has no obvious way of testing at
runtime if the syscall works (except dumb kernel version check, but that
won't work with stable backports); so it's not that obvious.

> > FWIW I personally don't care much about "only for owner" or depending on
> > mmap options; I don't understand much of the security implications
> > honestly so I'm not sure how these limitations actually help.
> > On the other hand, a simple CAP_SYS_ADMIN check making the call take
> > either behaviour should be safe and would cover what I described above.
>
> So without CAP_SYS_ADMIN, mincore() would return mapping status, and
> with CAP_SYS_ADMIN, it would return cache residency status? Very clumsy
> :( Maybe if we introduced mincore2() with flags similar to BSD mentioned
> earlier in the thread, and the cache residency flag would require
> CAP_SYS_ADMIN or something similar.

I agree, that's rather clumsy... Or rather might lead to some unexpected
behaviours. I'm open to other ideas.
I'm not sure how the BSD flags help though?

> > (by the way, while we are discussing permissions, a regular user can use
> > fadvise dontneed on files it doesn't own as well as long as it can open
> > them for reading; I'm not sure if that would need restricting as well in
> > the context of the security issue.
>
> Probably not, as I've mentioned it won't evict what's mapped by somebody
> else. And eviction is also possible via controlling LRU, which is what
> the paper [1] does anyway (and also mentions that DONTNEED doesn't
> work). Being able to evict somebody's page is AFAIU not sufficient for
> attack, the side channel is about knowing that somebody brought that
> page back to RAM by touching it.

Thanks for the link to the paper, I hadn't taken the time to extract it
from the news article but it's much more interesting indeed.
Some of what's been said here makes more sense to me now (checking
about ownership and similar might indeed be good enough).

For evicting page I agree most targets would be different users so I
guess that makes sense as well; there seem to be other ways of
efficiently evicting a page from cache anyway.

(BTW, this was brought up earlier, the paper also mentions
/proc/self/pagemap as not being useful.)


> > Frankly even with mincore someone
> > could likely tell the difference through timing, if they just do it a
> > few times. Do magic, probe, flush out, repeat until satisfied.)

(I meant "without mincore", obviously, as you understood properly)

> That's my bigger concern here. In [1] there's described a remote attack
> (on webserver) using the page fault timing differences for present/not
> present page cache pages. Noisy but works, and I expect locally it to be
> much less noisy. Yet the countermeasures section only mentions
> restricting mincore() as if it was sufficient (and also how to make
> evictions harder, but that's secondary IMHO).

I'd suggest making clock rougher for non-root users but javascript tried
that and it wasn't enough... :)
Honestly won't be of much help there, good luck?

Thanks,
--
Dominique

2019-01-07 12:04:29

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On 1/7/19 12:08 PM, Dominique Martinet wrote:
>
> With the current mincore change, it will think everything was "in core"
> and not flush anything unless my option to just fadvise dontneed
> everything is passed though ; so even if we can make it work it is a
> change of behaviour that is breaking an existing application, and it has
> no way of telling it didn't work.

IIUC the current change is commit 574823bfab82 ("Change mincore() to count
"mapped" pages rather than "cached" pages") which will not pretend
everything
is "in core", but only pages that the calling process has populated page
table mapping for (which implies in core, but the opposite doesn't
hold). "nocache" most certainly doesn't populate the mappings before
calling mincore(), as that would bring pages to page cache and defeat
the purpose of determining if they were already there prior the nocache
execution. Instead it will think that nothing was "in core", and thus
later call fadvise dontneed or everything, but as I've said earlier that
shouldn't matter much.

> Honestly though, as I said, mincore() is much more useful for debugging
> for me ; the application can be changed if required. I just pointed it
> out as it'll need changing, and it has no obvious way of testing at
> runtime if the syscall works (except dumb kernel version check, but that
> won't work with stable backports); so it's not that obvious.

Agree.

>>> FWIW I personally don't care much about "only for owner" or depending on
>>> mmap options; I don't understand much of the security implications
>>> honestly so I'm not sure how these limitations actually help.
>>> On the other hand, a simple CAP_SYS_ADMIN check making the call take
>>> either behaviour should be safe and would cover what I described above.
>>
>> So without CAP_SYS_ADMIN, mincore() would return mapping status, and
>> with CAP_SYS_ADMIN, it would return cache residency status? Very clumsy
>> :( Maybe if we introduced mincore2() with flags similar to BSD mentioned
>> earlier in the thread, and the cache residency flag would require
>> CAP_SYS_ADMIN or something similar.
>
> I agree, that's rather clumsy... Or rather might lead to some unexpected
> behaviours. I'm open to other ideas.
> I'm not sure how the BSD flags help though?

Definitely it would be a long-term solution, introducing new API,
waiting for userspace to use it... and meanwhile we would have to keep
the status quo or some kind of the clumsy/subtle approach.

>>> (by the way, while we are discussing permissions, a regular user can use
>>> fadvise dontneed on files it doesn't own as well as long as it can open
>>> them for reading; I'm not sure if that would need restricting as well in
>>> the context of the security issue.
>>
>> Probably not, as I've mentioned it won't evict what's mapped by somebody
>> else. And eviction is also possible via controlling LRU, which is what
>> the paper [1] does anyway (and also mentions that DONTNEED doesn't
>> work). Being able to evict somebody's page is AFAIU not sufficient for
>> attack, the side channel is about knowing that somebody brought that
>> page back to RAM by touching it.
>
> Thanks for the link to the paper, I hadn't taken the time to extract it
> from the news article but it's much more interesting indeed.

It went public only this morning, the article was older :)


2019-01-07 17:28:29

by Daniel Gruss

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On 1/7/19 12:08 PM, Dominique Martinet wrote:
>> That's my bigger concern here. In [1] there's described a remote attack
>> (on webserver) using the page fault timing differences for present/not
>> present page cache pages. Noisy but works, and I expect locally it to be
>> much less noisy. Yet the countermeasures section only mentions
>> restricting mincore() as if it was sufficient (and also how to make
>> evictions harder, but that's secondary IMHO).
>
> I'd suggest making clock rougher for non-root users but javascript tried
> that and it wasn't enough... :)
> Honestly won't be of much help there, good luck?

Restricting mincore() is sufficient to fix the hardware-agnostic part.
If the attack is not hardware-agnostic anymore, an attacker could also
just use a hardware cache attack, which has a higher temporal and
spatial resolution, so there's no reason why the attacker would use page
cache attacks instead then.


Cheers,
Daniel

2019-01-08 04:47:25

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sun, Jan 06, 2019 at 01:46:37PM -0800, Linus Torvalds wrote:
> On Sat, Jan 5, 2019 at 5:50 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > Slightly updated patch in case somebody wants to try things out.
>
> I decided to just apply that patch. It is *not* marked for stable,
> very intentionally, because I expect that we will need to wait and see
> if there are issues with it, and whether we might have to do something
> entirely different (more like the traditional behavior with some extra
> "only for owner" logic).

So, I read the paper and before I was half way through it I figured
there are a bunch of other similar page cache invalidation attacks
we can perform without needing mincore. i.e. Focussing on mmap() and
mincore() misses the wider issues we have with global shared caches.

My first thought:

fd = open(some_file, O_RDONLY);
iov[0].iov_base = buf;
iov[0].iov_len = 1;
ret = preadv2(fd, iov, 1, off, RWF_NOWAIT);
switch (ret) {
case -EAGAIN:
/* page not resident in page cache */
break;
case 1:
/* page resident in page cache */
break;
default:
/* beyond EOF or some other error */
break;
}

This is "as documented" in the man page for preadv2:

RWF_NOWAIT (since Linux 4.14)
Do not wait for data which is not immediately available.
If this flag is specified, the preadv2() system call will
return instantly if it would have to read data from the
backing storage or wait for a lock. If some data was
successfully read, it will return the number of bytes read.
If no bytes were read, it will return -1 and set errno to
EAGAIN. Currently, this flag is meaningful only for
preadv2().

IOWs, we've explicitly designed interfaces to communicate whether
data is "immediately accessible" or not to the application so they
can make sensible decisions about IO scheduling. i.e. IO won't
block the main application processing loop and so can be scheduled in
the background by the app and the data processed when that IO
returns. That just so happens to be exactly the same information
about the page cache that mincore is making available to userspace.

If we "remove" this information from the interfaces like it has been
done for mincore(), it doesn't mean userspace can't get it in other
ways. e.g. it now just has to time the read(2) syscall duration and
infer whether the data came from the page cache or disk from the
timing information.

IMO, there's nothing new or novel about this page cache information
leak - it was just a matter of time before some researcher put 2 and
2 together and realised that sharing the page cache across a
security boundary is little different from sharing deduplicated
pages across those same security boundaries. i.e. As long as we
shared caches across security boundaries and userspace can control
both cache invalidation and instantiation, we cannot prevent
userspace from constructing these invalidate+read information
exfiltration scenarios.

And then there is overlayfs. Overlay is really just a way to
efficiently share the page cache of the same underlying read-only
directory tree across all containers on a host. i.e. we have been
specifically designing our container hosting systems to share the
underlying read-only page cache across all security boundaries on
the host. If overlay is vulnerable to these shared page cache
attacks (seems extremely likely) then we've got much bigger problems
than mincore to think about....

> But doing a test patch during the merge window (which is about to
> close) sounds like the right thing to do.

IMO it seems like the wrong thing to do. It's just a hacky band-aid
over a specific extraction method and does nothing to reduce the
actual scope of the information leak. Focussing on the band-aid
means you've missed all the other avenues that the same information
is exposed and all the infrastructure we've build on the core
concept of sharing kernel side pages across security boundaries.

And that's even without considering whether the change breaks
userspace. Which it does. e.g. vmtouch is fairly widely used to
manage page cache instantiation for rapid bring-up and migration of
guest VMs and containers. They save the hot page cache information
from a running container and then using that to instantiate the page
cache in new instances running the same workload so they run at full
speed right from the start. This use case calls mincore() to pull
the page cache information from the running container.

If anyone else proposed merging a syscall implementation change that
was extremely likely to break userspace you'd be shouting at them
that "we don't break userspace"....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-08 08:59:54

by Kevin Easton

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote:
> On Sat, Jan 5, 2019 at 12:43 PM Jiri Kosina <[email protected]> wrote:
> >
> > > Who actually _uses_ mincore()? That's probably the best guide to what
> > > we should do. Maybe they open the file read-only even if they are the
> > > owner, and we really should look at file ownership instead.
> >
> > Yeah, well
> >
> > https://codesearch.debian.net/search?q=mincore
> >
> > is a bit too much mess to get some idea quickly I am afraid.
>
> Yeah, heh.
>
> And the first hit is 'fincore', which probably nobody cares about
> anyway, but it does
>
> fd = open (name, O_RDONLY)
> ..
> mmap(window, len, PROT_NONE, MAP_PRIVATE, ..
>
> so if we want to keep that working, we'd really need to actually check
> file ownership rather than just looking at f_mode.
>
> But I don't know if anybody *uses* and cares about fincore, and it's
> particularly questionable for non-root users.
>
...
> I didn't find anything that seems to really care, but I gave up after
> a few pages of really boring stuff.

I've gone through everything in the Debian code search, and this is the
stuff that seems like it would be affected at all by the current patch:

util-linux
Contains 'fincore' as already noted above.

e2fsprogs
e4defrag tries to drop pages that it caused to be loaded into the
page cache, but it's not clear that this ever worked as designed
anyway (it calls mincore() before ioctl(fd, EXT4_IOC_MOVE_EXT ..)
but then after the sync_file_range it drops the pages that *were*
in the page cache at the time of mincore()).

pgfincore
postgresql extension used to try to dump/restore page cache status
of database backing files across reboots. It uses a fresh mapping
with mincore() to try to determine the current page cache status of
a file.

nocache
LD_PRELOAD library that tries to drop any pages that the victim
program has caused to be loaded into the page cache, uses mincore
on a fresh mapping to see what was resident beforehand. Also
includes 'cachestats' command that's basically another 'fincore'.

xfsprogs
xfs_io has a 'mincore' sub-command that is roughly equivalent to
'fincore'.

vmtouch
vmtouch is "Portable file system cache diagnostics and control",
among other things it implements 'fincore' type functionality, and
one of its touted use-cases is "Preserving virtual memory profile
when failing over servers".

qemu
qemu uses mincore() with a fresh PROT_NONE, MAP_PRIVATE mapping to
implement the "x-check-cache-dropped" option.
( https://patchwork.kernel.org/patch/10395865/ )

(Everything else I could see was either looking at anonymous VMAs, its
own existing mapping that it's been using for actual IO, or was just
using mincore() to see if an address was part of any mapping at all).

- Kevin

2019-01-08 09:17:58

by Bernd Petrovitsch

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On 05/01/2019 20:38, Vlastimil Babka wrote:
[...]
> I was thinking about "return true" here, assuming that userspace generally wants
> to ensure itself there won't be page faults when it starts doing something
> critical, and if it sees a "false" it will try to do some kind of prefaulting,
> possibly in a loop. There might be somebody trying to make sure something is out

Isn't that racy by design as the pages may get flushed out after the check?
Shouldn't the application use e.g. mlock()/.... to guarantee no page
faults in the first place?

MfG,
Bernd
--
Bernd Petrovitsch Email : [email protected]
LUGA : http://www.luga.at

2019-01-08 11:39:11

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Tue, 8 Jan 2019, Bernd Petrovitsch wrote:

> Shouldn't the application use e.g. mlock()/.... to guarantee no page
> faults in the first place?

Calling mincore() on pages you've just mlock()ed is sort of pointless
though.

--
Jiri Kosina
SUSE Labs


2019-01-08 13:55:56

by Bernd Petrovitsch

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On 08/01/2019 12:37, Jiri Kosina wrote:
> On Tue, 8 Jan 2019, Bernd Petrovitsch wrote:
>
>> Shouldn't the application use e.g. mlock()/.... to guarantee no page
>> faults in the first place?
>
> Calling mincore() on pages you've just mlock()ed is sort of pointless
> though.

Obviously;-)

Sorry for being unclear above: If I want my application to
avoid suffering from page faults, I use simply mlock()
(and/or friends) to nail the relevant pages into physical
RAM and not "look if they are out, if yes, get them in" which
has also the risk that these important pages are too soon
evicted again.

But perhaps I'm missing some very special use cases ....

MfG,
Brend
--
Bernd Petrovitsch Email : [email protected]
LUGA : http://www.luga.at

2019-01-08 14:52:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Tue, Jan 08, 2019 at 02:53:00PM +0100, Bernd Petrovitsch wrote:
> On 08/01/2019 12:37, Jiri Kosina wrote:
> > On Tue, 8 Jan 2019, Bernd Petrovitsch wrote:
> >
> >> Shouldn't the application use e.g. mlock()/.... to guarantee no page
> >> faults in the first place?
> >
> > Calling mincore() on pages you've just mlock()ed is sort of pointless
> > though.
>
> Obviously;-)
>
> Sorry for being unclear above: If I want my application to
> avoid suffering from page faults, I use simply mlock()
> (and/or friends) to nail the relevant pages into physical
> RAM and not "look if they are out, if yes, get them in" which
> has also the risk that these important pages are too soon
> evicted again.

Note, that mlock() doesn't prevent minor page faults. Mlocked memory is
still subject to mechanisms that makes the page temporary unmapped. For
instance migration (including NUMA balancing), compaction, khugepaged...

--
Kirill A. Shutemov

2019-01-08 18:54:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Mon, Jan 7, 2019 at 8:43 PM Dave Chinner <[email protected]> wrote:
>
> So, I read the paper and before I was half way through it I figured
> there are a bunch of other similar page cache invalidation attacks
> we can perform without needing mincore. i.e. Focussing on mmap() and
> mincore() misses the wider issues we have with global shared caches.

Oh, agreed, and that was discussed in the original report too.

The thing is, you can also depend on our pre-faulting of pages in the
page fault handler, and use that to get the cached status of nearby
pages. So do something like "fault one page, then do mincore() to see
how many pages near it were mapped". See our "do_fault_around()"
logic.

But mincore is certainly the easiest interface, and the one that
doesn't require much effort or setup. It's also the one where our old
behavior was actually arguably simply stupid and actively wrong (ie
"in caches" isn't even strictly speaking a valid question, since the
caches in question may be invalid). So let's try to see if giving
mincore() slightly more well-defined semantics actually causes any
pain.

I do think that the RWF_NOWAIT case might also be interesting to look at.

Linus

2019-01-09 02:35:50

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, 9 Jan 2019, Dave Chinner wrote:

> > But mincore is certainly the easiest interface, and the one that
> > doesn't require much effort or setup.
>
> Off the top of my head, here's a few vectors for reading the page
> cache residency state without perturbing the page cache residency
> pattern:
> - mincore
> - preadv2(RWF_NOWAIT)
> - fadvise(POSIX_FADV_RANDOM); timed read(2) syscalls
> - madvise(MADV_RANDOM); timed read of first byte in each page

While I obviously agree that all those are creating pagecache sidechannel
in principle, I think we really should mostly focus on the first two (with
mincore() already having been covered).

Rationale has been provided by Daniel Gruss in this thread -- if the
attacker is left with cache timing as the only available vector, he's
going to be much more successful with mounting hardware cache timing
attack anyway.

Thanks,

--
Jiri Kosina
SUSE Labs


2019-01-09 03:09:28

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Tue, Jan 08, 2019 at 09:57:49AM -0800, Linus Torvalds wrote:
> On Mon, Jan 7, 2019 at 8:43 PM Dave Chinner <[email protected]> wrote:
> >
> > So, I read the paper and before I was half way through it I figured
> > there are a bunch of other similar page cache invalidation attacks
> > we can perform without needing mincore. i.e. Focussing on mmap() and
> > mincore() misses the wider issues we have with global shared caches.
>
> Oh, agreed, and that was discussed in the original report too.
>
> The thing is, you can also depend on our pre-faulting of pages in the
> page fault handler, and use that to get the cached status of nearby
> pages. So do something like "fault one page, then do mincore() to see
> how many pages near it were mapped". See our "do_fault_around()"
> logic.

Observing fault-around could help you detect what code an application is
running, but it's not necessary (and can be turned off). Also, such
an it observation is not dependent on using mincore. neither
fault-around nor mincore are required functionality to exploit the
information leaks.

And, FWIW, fault-around actually destroys the information in the
exfiltration channel described in the paper because it perturbs the
carefully constructed page cache residency pattern that encodes the
message.

> But mincore is certainly the easiest interface, and the one that
> doesn't require much effort or setup.

Off the top of my head, here's a few vectors for reading the page
cache residency state without perturbing the page cache residency
pattern:
- mincore
- preadv2(RWF_NOWAIT)
- fadvise(POSIX_FADV_RANDOM); timed read(2) syscalls
- madvise(MADV_RANDOM); timed read of first byte in each page

i.e. mincore is a messenger, but it's not the only trivial
observation technique available. The only difference between mincore
and the others will be the observation latency and hence channel
bandwidth.

IOWs, the question we need to focus on now is not "does breaking
mincore affect anyone", it is "how the hell do we mitigate and
isolate an information leak exposed by fundamental OS functionality
that *everything* depends on for performance"?

> It's also the one where our old
> behavior was actually arguably simply stupid and actively wrong (ie
> "in caches" isn't even strictly speaking a valid question, since the
> caches in question may be invalid).

This is irrelevant to the problem reported. Sure, mincore may be
an awful interface, but it's semantics are not the cause of the
information leak. You're just shooting the messenger...

> I do think that the RWF_NOWAIT case might also be interesting to look at.

As are all the other mechanisms you can use to observer page cache
residency without perturbing it's state.

Keep in mind that the researchers documented a remote observation
technique that leaked the information across the network to a remote
host, so this leak has much, much wider scope than changing mincore
can address...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-09 04:41:39

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 09, 2019 at 03:31:35AM +0100, Jiri Kosina wrote:
> On Wed, 9 Jan 2019, Dave Chinner wrote:
>
> > > But mincore is certainly the easiest interface, and the one that
> > > doesn't require much effort or setup.
> >
> > Off the top of my head, here's a few vectors for reading the page
> > cache residency state without perturbing the page cache residency
> > pattern:
> > - mincore
> > - preadv2(RWF_NOWAIT)
> > - fadvise(POSIX_FADV_RANDOM); timed read(2) syscalls
> > - madvise(MADV_RANDOM); timed read of first byte in each page
>
> While I obviously agree that all those are creating pagecache sidechannel
> in principle, I think we really should mostly focus on the first two (with
> mincore() already having been covered).

FWIW, I just realised that the easiest, most reliable way to
invalidate the page cache over a file range is simply to do a
O_DIRECT read on it. IOWs, all three requirements of this
information leak - highly specific, reliable cache invalidation
control, controlled cache instantiation and 3rd-party detection of
cache residency can all be performed with just the read(2)
syscall...

> Rationale has been provided by Daniel Gruss in this thread -- if the
> attacker is left with cache timing as the only available vector, he's
> going to be much more successful with mounting hardware cache timing
> attack anyway.

No, he said:

"Restricting mincore() is sufficient to fix the hardware-agnostic
part."

That's not correct - preadv2(RWF_NOWAIT) is also hardware agnostic
and provides exactly the same information about the page cache as
mincore. Timed read/mmap access loops for cache observation are
also hardware agnostic, and on fast SSD based storage will only be
marginally slower bandwidth than preadv2(RWF_NOWAIT).

Attackers will pick whatever leak vector we don't fix, so we either
fix them all (which I think is probably impossible without removing
caching altogether) or we start thinking about how we need to
isolate the page cache so that information isn't shared across
important security boundaries (e.g. page cache contents are
per-mount namespace).

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-09 10:13:52

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, 9 Jan 2019, Dave Chinner wrote:

> FWIW, I just realised that the easiest, most reliable way to invalidate
> the page cache over a file range is simply to do a O_DIRECT read on it.

Neat, good catch indeed. Still, it's only the invalidation part, but the
residency check is the crucial one.

> > Rationale has been provided by Daniel Gruss in this thread -- if the
> > attacker is left with cache timing as the only available vector, he's
> > going to be much more successful with mounting hardware cache timing
> > attack anyway.
>
> No, he said:
>
> "Restricting mincore() is sufficient to fix the hardware-agnostic
> part."
>
> That's not correct - preadv2(RWF_NOWAIT) is also hardware agnostic and
> provides exactly the same information about the page cache as mincore.

Yeah, preadv2(RWF_NOWAIT) is in the same teritory as mincore(), it has
"just" been overlooked. I can't speak for Daniel, but I believe he might
be ok with rephrasing the above as "Restricting mincore() and RWF_NOWAIT
is sufficient ...".

> Timed read/mmap access loops for cache observation are also hardware
> agnostic, and on fast SSD based storage will only be marginally slower
> bandwidth than preadv2(RWF_NOWAIT).
>
> Attackers will pick whatever leak vector we don't fix, so we either fix
> them all (which I think is probably impossible without removing caching
> altogether)

We can't really fix the fact that it's possible to do the timing on the HW
caches though.

> or we start thinking about how we need to isolate the page cache so that
> information isn't shared across important security boundaries (e.g. page
> cache contents are per-mount namespace).

Umm, sorry for being dense, but how would that help that particular attack
scenario on a system that doesn't really employ any namespacing? (which I
still believe is a majority of the systems out there, but I might have
just missed the containers train long time ago :) ).

--
Jiri Kosina
SUSE Labs


2019-01-09 20:32:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Tue, Jan 8, 2019 at 8:39 PM Dave Chinner <[email protected]> wrote:
>
> FWIW, I just realised that the easiest, most reliable way to
> invalidate the page cache over a file range is simply to do a
> O_DIRECT read on it.

If that's the case, that's actually an O_DIRECT bug.

It should only invalidate the caches on write.

On reads, it wants to either _flush_ any direct caches before the
read, or just take the data from the caches. At no point is
"invalidate" a valid model.

Of course, I'm not in the least bit shocked if O_DIRECT is buggy like
this. But looking at least at the ext4 routine, the read just does

ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,

and I don't see any invalidation.

Having read access to a file absolutely should *not* mean that you can
flush caches on it. That's a write op.

Any filesystem that invalidates the caches on read is utterly buggy.

Can you actually point to such a thing? Let's get that fixed, because
it's completely wrong regardless of this whole mincore issue.

Linus

2019-01-10 00:45:52

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 09, 2019 at 10:25:43AM -0800, Linus Torvalds wrote:
> On Tue, Jan 8, 2019 at 8:39 PM Dave Chinner <[email protected]> wrote:
> >
> > FWIW, I just realised that the easiest, most reliable way to
> > invalidate the page cache over a file range is simply to do a
> > O_DIRECT read on it.
>
> If that's the case, that's actually an O_DIRECT bug.
>
> It should only invalidate the caches on write.

Sounds nice from a theoretical POV, but reality has taught us
very different lessons.

FWIW, a quick check of XFS's history so you understand how long this
behaviour has been around. It was introduced in the linux port in
2001 as direct IO support was being added:

commit e837eac23662afae603aaaef7c94bc839c1b8f67
Author: Steve Lord <[email protected]>
Date: Mon Mar 5 16:47:52 2001 +0000

Add bounds checking for direct I/O, do the cache invalidation for
data coherency on direct I/O.

This was basically a direct port of the flush+invalidation code in
the Irix direct IO path, which was introduced in 1995:

> revision 1.149
> date: 1995/08/11 20:09:44; author: ajs; state: Exp; lines: +70 -2
> 280514 Adding page cache flusing calls to make direct
> I/O coherent with buffered I/O.

IOWs, history tells us that invalidation for direct IO reads has
been done on XFS for almost 25 years. I know for certain that there
have been applications out there that depend on this
invalidation-on-read behaviour (another of those "reality bites"
lessons) so we can't just remove it because you *think* it is a bug.

i.e. we *could* remove the invalidation on read, but this we have a
major behavioural change to the XFS direct IO path. This means we
need to determine if we've just awoken sleeping data corruption
krakens as well as determine if there are any performance
regressions that result from the behavioural change.

Which brings me to validation. If the recent
clone/dedupe/copy_file_range() debacle has taught me anything, it's
that validating a "simple" IO path mechanism is going to take months
worth of machine time before we have any confidence that the change
is not going to expose users to new data corruption problems.

That's the difficulty here - it only takes 5 minutes to change
the code, but there's months of machine time needed to determine if
it's really safe to make that code change. Testing has a nasty habit
of finding invalid assumptions; when those are assumptions about
data coherency and integrity we can't test them on our users.

And, really, this would be just another band-aid over a symptom of
the information leak - it doesn't prevent users from being able to
control page cache invalidation. It just removes one method, just
like hacking mincore only removes one method of observing the page
cache. And, like mincore(), there's every chance it impacts on
userspace in a negative manner and so we need to be very careful
here.

> On reads, it wants to either _flush_ any direct caches before the
> read, or just take the data from the caches. At no point is
> "invalidate" a valid model.
>
> Of course, I'm not in the least bit shocked if O_DIRECT is buggy like
> this. But looking at least at the ext4 routine, the read just does
>
> ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
>
> and I don't see any invalidation.

I wouldn't look at ext4 as an example of a reliable, problem free
direct IO implementation because, historically speaking, it's been a
series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and
been far worse than XFS from data integrity, performance and
reliability perspectives.

IMO, "because ext4" has been a poor reason for justifying anything
for a long time, not the least when talking about features that
didn't even originate in extN....

> Can you actually point to such a thing? Let's get that fixed, because
> it's completely wrong regardless of this whole mincore issue.

The biggest problem that remains today is that we have no mechanism
for serialising page faults against DIO. If we leave pages cached in
memory while we have a AIO+DIO read (or write!) in progress, we can
dirty the page and run a buffered read before the AIO+DIO read
returns. This now leaves us in the state where where the AIO+DIO
read returns different (stale) data to a buffered read that has
already completed because it hit the dirty page cache. i.e. we
still have nasty page cache vs direct IO coherency problems, and
they are largely unsolvable because of the limitations of the core
kernel infrastructure architecture.

Yes, you can argue that userspace is doing an insane thing, but
every so often we come across coherency issues like this that are
out of a user's control (e.g. backup scan vs app accesses) and we do
our best to ensure that they don't cause problems given the
constraints we have. Invalidating the page cache on dio reads
mostly mitigates these coherency race conditions and that's why it's
still there in the XFS code paths...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-10 01:17:12

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 09, 2019 at 11:08:57AM +0100, Jiri Kosina wrote:
> On Wed, 9 Jan 2019, Dave Chinner wrote:
>
> > FWIW, I just realised that the easiest, most reliable way to invalidate
> > the page cache over a file range is simply to do a O_DIRECT read on it.
>
> Neat, good catch indeed. Still, it's only the invalidation part, but the
> residency check is the crucial one.
>
> > > Rationale has been provided by Daniel Gruss in this thread -- if the
> > > attacker is left with cache timing as the only available vector, he's
> > > going to be much more successful with mounting hardware cache timing
> > > attack anyway.
> >
> > No, he said:
> >
> > "Restricting mincore() is sufficient to fix the hardware-agnostic
> > part."
> >
> > That's not correct - preadv2(RWF_NOWAIT) is also hardware agnostic and
> > provides exactly the same information about the page cache as mincore.
>
> Yeah, preadv2(RWF_NOWAIT) is in the same teritory as mincore(), it has
> "just" been overlooked. I can't speak for Daniel, but I believe he might
> be ok with rephrasing the above as "Restricting mincore() and RWF_NOWAIT
> is sufficient ...".

Good luck with restricting RWF_NOWAIT. I eagerly await all the
fstests that exercise both the existing and new behaviours to
demonstrate they work correctly.

> > Timed read/mmap access loops for cache observation are also hardware
> > agnostic, and on fast SSD based storage will only be marginally slower
> > bandwidth than preadv2(RWF_NOWAIT).
> >
> > Attackers will pick whatever leak vector we don't fix, so we either fix
> > them all (which I think is probably impossible without removing caching
> > altogether)
>
> We can't really fix the fact that it's possible to do the timing on the HW
> caches though.

We can't really fix the fact that it's possible to do the timing on
the page cache, either.

> > or we start thinking about how we need to isolate the page cache so that
> > information isn't shared across important security boundaries (e.g. page
> > cache contents are per-mount namespace).
>
> Umm, sorry for being dense, but how would that help that particular attack
> scenario on a system that doesn't really employ any namespacing?

What's your security boundary?

The "detect what code an app is running" exploit is based on
invalidating and then observing how shared, non-user-owned files
mapped with execute privileges change cache residency.

If the security boundary is within the local container, should users
inside that container be allowed to invalidate the cache of
executable files and libraries they don't own? In this case, we
can't stop observation, because that only require read permissions
and high precision timing, hence the only thing that can be done
here is prevent non-owners from invalidating the page cache.

If the security boundary is a namespace or guest VM, then permission
checks don't work - the user may own the file within that container.
This problem now is that the page cache is observable and
controllable from both sides of the fence. Hence the only way to
prevent observation of the code being run in a different namespace
is to prevent the page being shared across both containers.

The exfiltration exploit requires the page cache to be observable
and controllable on both sides of the security boundary. Should
users be able to observe and control the cached pages accessed by a
different container? KSM page deduplication lessons say no. This is
an even harder problem, because page cache residency can be observed
from remote machines....

What scares me is that new features being proposed could make our
exposure a whole lot worse. e.g. the recent virtio-pmem ("fake-dax")
proposal will directly share host page cache pages into guest VMs w/
DAX capability. i.e. the guest directly accesses the host page
cache. This opens up the potential for host page cache timing
attacks from the guest VMs, and potential guest to guest
observation/exploitation is possible if the same files are mapped
into multiple guests....

IOws the two questions here are simply: "What's your security
boundary?" and "Is the page cache visible and controllable on both
sides?".

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-10 01:20:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 9, 2019 at 4:44 PM Dave Chinner <[email protected]> wrote:
>
> I wouldn't look at ext4 as an example of a reliable, problem free
> direct IO implementation because, historically speaking, it's been a
> series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and
> been far worse than XFS from data integrity, performance and
> reliability perspectives.

That's some big words from somebody who just admitted to much worse hacks.

Seriously. XFS is buggy in this regard, ext4 apparently isn't.

Thinking that it's better to just invalidate the cache for direct IO
reads is all kinds of odd.

Linus

2019-01-10 05:30:17

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 9, 2019 at 5:18 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, Jan 9, 2019 at 4:44 PM Dave Chinner <[email protected]> wrote:
> >
> > I wouldn't look at ext4 as an example of a reliable, problem free
> > direct IO implementation because, historically speaking, it's been a
> > series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and
> > been far worse than XFS from data integrity, performance and
> > reliability perspectives.
>
> That's some big words from somebody who just admitted to much worse hacks.
>
> Seriously. XFS is buggy in this regard, ext4 apparently isn't.
>
> Thinking that it's better to just invalidate the cache for direct IO
> reads is all kinds of odd.
>

This whole discussion seems to have gone a little bit off the rails...

Linus, I think I agree with Dave's overall sentiment, though, and I
think you should consider reverting your patch. Here's why. The
basic idea behind the attack is that the authors found efficient ways
to do two things: evict a page from page cache and detect, *without
filling the cache*, whether a page is cached. The combination lets an
attacker efficiently tell when another process reads a page. We need
to keep in mind that this attack is a sophisticated attack, and anyone
using it won't have any problem using a nontrivial way to detect
whether a page is in page cache.

So, unless we're going to try for real to make it hard to tell whether
a page is cached without causing that page to become cached, it's not
worth playing whack-a-mole. And, right now, mincore is whacking a
mole. RWF_NOWAIT appears to do essentially the same thing at very
little cost. I haven't really dug in, but I assume that various
prefaulting tricks combined with various pagetable probing tricks can
do similar things, but that's at least a *lot* more complicated.

So unless we're going to lock down RWF_NOWAIT as well, I see no reason
to lock down mincore(). Direct IO is a red herring -- O_DIRECT is
destructive enough that it seems likely to make the attacks a lot less
efficient.


--- begin digression ---

Since direct IO has been brought up, I have a question. I've wondered
for years why direct IO works the way it does. If I were implementing
it from scratch, my first inclination would be to use the page cache
instead of fighting it. To do a single-page direct read, I would look
that page up in the page cache (i.e. i_pages these days). If the page
is there, I would do a normal buffered read. If the page is not
there, I would insert a record into i_pages indicating that direct IO
is in progress and then I would do the IO into the destination page.
If any other read, direct or otherwise, sees a record saying "under
direct IO", it would wait.

To do a single-page direct write, I would look it up in i_pages. If
it's there, I would do a buffered write followed by a sync (because
applications expect a sync). If it's not there, I would again add a
record saying "under direct IO" and do the IO.

The idea is that this works as much like buffered IO as possible,
except that the pages backing the IO aren't normal sharable page cache
pages.

The multi-page case would be just an optimization on top of the
single-page case. The idea would be to somehow mark i_pages with
entire extents under direct IO. It's a radix tree -- this can, at
least in theory, be done efficiently. As long as all direct IO
operations run in increasing order of offset, there shouldn't be lock
ordering problems.

Other than history and possibly performance, is there any reason that
direct IO doesn't work this way?

P.S. What, if anything, prevents direct writes from causing trouble
when the underlying FS or backing store needs stable pages?
Similarly, what, if anything, prevents direct reads from temporarily
exposing unintended data to user code if the fs or underlying device
transforms the data during the read process (e.g. by decrypting
something)?

--- end digression ---

2019-01-10 07:05:37

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 09, 2019 at 05:18:21PM -0800, Linus Torvalds wrote:
> On Wed, Jan 9, 2019 at 4:44 PM Dave Chinner <[email protected]> wrote:
> >
> > I wouldn't look at ext4 as an example of a reliable, problem free
> > direct IO implementation because, historically speaking, it's been a
> > series of nasty hacks (*cough* mount -o dioread_nolock *cough*) and
> > been far worse than XFS from data integrity, performance and
> > reliability perspectives.
>
> That's some big words from somebody who just admitted to much worse hacks.

Sorry, what hacks did I just admit to making? This O_DIRECT
behaviour long predates me - I'm just the messenger and you are
shooting from the hip.

Linus, the point I was making is that there are many, many ways to
control page cache invalidation and measure page cache residency,
and that trying to address them one-by-one is just a game of
whack-a-mole.

In future, can you please try not to go off the rails when someone
mentions O_DIRECT? You have a terrible habit of going off on
misdirected rants about O_DIRECT and/or XFS at any opportunity you
can get, and all it does is derail whatever useful conversation was
taking place.

> Seriously. XFS is buggy in this regard, ext4 apparently isn't.

So you keep asserting despite being presented with evidence that it
mitigates other longstanding bugs that are really hard to solve.
Ignoring all the evidence you've been presented with and
re-asserting your original statement doesn't make it correct.

Did you not think to ask "what are those problems, and what can do
to solve them so we can remove the invalidation mitigations that XFS
uses?". That would be a useful contribution, whereas shouting about
how O_DIRECT is broken just pisses off the people working their
asses off to fix the problems you just heard about and are ranting
about.

> Thinking that it's better to just invalidate the cache for direct IO
> reads is all kinds of odd.

No, it's practicality. If we can't fix the problem, we have to
mitigate it. When we fix the underlying problem we can remove the
mitigation code. having you assert that it's broken and demand that
it be removed doesn't change the fact that we haven't fixed the
underlying problems. It's being worked on, but we're not there yet.

-Dave.
--
Dave Chinner
[email protected]

2019-01-10 07:55:56

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, 10 Jan 2019, Dave Chinner wrote:

> > Yeah, preadv2(RWF_NOWAIT) is in the same teritory as mincore(), it has
> > "just" been overlooked. I can't speak for Daniel, but I believe he might
> > be ok with rephrasing the above as "Restricting mincore() and RWF_NOWAIT
> > is sufficient ...".
>
> Good luck with restricting RWF_NOWAIT. I eagerly await all the
> fstests that exercise both the existing and new behaviours to
> demonstrate they work correctly.

Well, we can still resurrect my original aproach of doing this opt-in
based on a sysctl setting, and letting the admin choose his poison.

If 'secure' mode is selected, RWF_NOWAIT will then probably just always
fail wit EAGAIN.

--
Jiri Kosina
SUSE Labs


2019-01-10 12:27:37

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Linus Torvalds wrote on Thu, Jan 10, 2019:
> (Except, of course, if somebody actually notices outside of tests.
> Which may well happen and just force us to revert that commit. But
> that's a separate issue entirely).

Both Dave and I pointed at a couple of utilities that break with
this. nocache can arguably work with the new behaviour but will behave
differently; vmtouch on the other hand is no longer able to display
what's in cache or not - people use that for example to "warm up" a
container in page cache based on how it appears after it had been
running for a while is a pretty valid usecase to me.

From the list Kevin harvested out of the debian code search, the
postgresql use case is pretty similar - probe what pages of the database
were in cache at shutdown so when you restart it you can preload these
and reach "cruse speed" faster.

Sure that's probably not billions of users but this all looks fairly
valid to me...

--
Dominique

2019-01-10 14:52:05

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 10, 2019 at 11:44:24AM +1100, Dave Chinner wrote:
> And, really, this would be just another band-aid over a symptom of
> the information leak - it doesn't prevent users from being able to
> control page cache invalidation. It just removes one method, just
> like hacking mincore only removes one method of observing the page
> cache. And, like mincore(), there's every chance it impacts on
> userspace in a negative manner and so we need to be very careful
> here.

Putting the mincore() / cache timing information leak aside though,
the current behaviour of XFS means that an attacker can screw up the
performance of random applications just by repeatedly doing O_DIRECT
reads of libc.so.

Maybe O_DIRECT reads should be forbidden from files on XFS unless you
also have write access to them? (eg owner).


2019-01-10 16:02:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 9, 2019 at 11:04 PM Dave Chinner <[email protected]> wrote:
>
> Sorry, what hacks did I just admit to making? This O_DIRECT
> behaviour long predates me - I'm just the messenger and you are
> shooting from the hip.

Sure, sorry. I find this whole thing annoying.

> Linus, the point I was making is that there are many, many ways to
> control page cache invalidation and measure page cache residency,
> and that trying to address them one-by-one is just a game of
> whack-a-mole.

.. and I agree. But let's a step back.Because there are different issues.

First off, the whole page cache attack is not necessarily something
many people will care about. As has been pointed out, it's often a
matter of convenience and (relative) portability.

And no, we're *never* going to stop all side channel leaks. Some parts
of caching (notably the timing effects of it) are pretty fundamental.

So at no point is this going to be some kind of absolute line in the
sand _anyway_. There is no black-and-white "you're protected", there's
only levels of convenience.

A remote attacker is hopefully going to be limited by the interfaces
to just timing attacks, although who knows what something like JS
might expose. Presumably neither mincore() nor arbitrary O_DIRECT or
pread2() flags.

Anyway, the reason I was trying to plug mincore() is largely that that
code didn't make much sense to begin with, and simply this:

mm/mincore.c | 94 +++++++++---------------------------------------------------
1 file changed, 13 insertions(+), 81 deletions(-)

if we can make people happier by removing lines of code and making the
semantics more clear anyway, it's worth trying.

No?

Is that everything? No. As mentioned, you'll never get to that "ok, we
plugged everything" point anyway. But removing a fairly easy way to
probe the cache that has no real upsides should be fairly
non-controversial.

But I do have to say that in many ways the page cache is *not* a great
attack vector because there's often lots of it, and it's fairly hard
to control. Once something is in the page cache for whatever reason,
it tends to be pretty sticky, and flushing it tends to be fairly hard
to predict.

And a cheap and residency (whether a simple probe like mincore of or a
NOWAIT flag) check is actually important just to try to control the
flushing part. Brute-forcing the flushing is generally very expensive,
but if you can't even see if you flushed it, it's way more so.

If there's a way to control the cache residency directly, that's
actually a much bigger hole than any residency check ever were.

Because once you can flush caches by reading, at that point you can
just flush a particular page and look at the IO stats for the root
partition or something. No residency check even needed.

So I do think that yes, as long as you can do a directed cache flush,
mincore is *entirely* immaterial.

Still, giving mincore clearer semantics and simpler code? Win-win.

(Except, of course, if somebody actually notices outside of tests.
Which may well happen and just force us to revert that commit. But
that's a separate issue entirely).

But I do think that we should strive to *never* invalidate caches on
read accesses. I don't actually see where you are doing that,
honestly: at least dio_complete() only does it for writes.

So I'm actually hoping that you are mis-remembering this and it turns
out that O_DIRECT reads don't invalidate caches.

Linus

2019-01-10 16:27:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 09, 2019 at 09:26:41PM -0800, Andy Lutomirski wrote:
> Since direct IO has been brought up, I have a question. I've wondered
> for years why direct IO works the way it does. If I were implementing
> it from scratch, my first inclination would be to use the page cache
> instead of fighting it. To do a single-page direct read, I would look
> that page up in the page cache (i.e. i_pages these days). If the page
> is there, I would do a normal buffered read. If the page is not
> there, I would insert a record into i_pages indicating that direct IO
> is in progress and then I would do the IO into the destination page.
> If any other read, direct or otherwise, sees a record saying "under
> direct IO", it would wait.

OK, you're in the same ballpark I am ;-) Kent Overstreet pointed out
that what you want to do here is great for the mixed case, but it's
pretty inefficient for IOs to files which are wholly uncached.

So what I'm currently thinking about is an rwsem which works like this:

O_DIRECT task:
if i_pages is empty, take rwsem for read, recheck i_pages is empty, do IO,
drop rwsem.
if i_pages is not empty, insert XA_LOCK_ENTRY, when IO complete, wake waitqueue for that (mapping, index).

buffered IO:
if i_pages is empty, take rwsem for write, allocate page, insert page, drop rwsem.
if i_pages is not empty, look up index, if entry is XA_LOCK_ENTRY sleep on
waitqueue. otherwise proceed as now.


2019-01-10 22:07:34

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 10, 2019 at 06:47:11AM -0800, Matthew Wilcox wrote:
> On Wed, Jan 09, 2019 at 09:26:41PM -0800, Andy Lutomirski wrote:
> > Since direct IO has been brought up, I have a question. I've wondered
> > for years why direct IO works the way it does. If I were implementing
> > it from scratch, my first inclination would be to use the page cache
> > instead of fighting it. To do a single-page direct read, I would look
> > that page up in the page cache (i.e. i_pages these days). If the page
> > is there, I would do a normal buffered read. If the page is not
> > there, I would insert a record into i_pages indicating that direct IO
> > is in progress and then I would do the IO into the destination page.
> > If any other read, direct or otherwise, sees a record saying "under
> > direct IO", it would wait.
>
> OK, you're in the same ballpark I am ;-) Kent Overstreet pointed out
> that what you want to do here is great for the mixed case, but it's
> pretty inefficient for IOs to files which are wholly uncached.
>
> So what I'm currently thinking about is an rwsem which works like this:
>
> O_DIRECT task:
> if i_pages is empty, take rwsem for read, recheck i_pages is empty, do IO,
> drop rwsem.

GUP does page fault on user buffer which is a mmapped region of same
file. page fault sets up for buffered IO, tries to take rwsem for
write, deadlocks.

Most of the schemes we come up with fall down at this point - you
can't hold a lock over gup that is also used in the buffered IO
path. That's why XFS (and now ext4) have the IOLOCK and MMAPLOCK
for truncation serialisation - we can't lock out both read()/write()
and mmap IO paths with the same lock...

> if i_pages is not empty, insert XA_LOCK_ENTRY, when IO complete, wake waitqueue for that (mapping, index).

I assume you really mean add a tag to the entry?

But this means there is no record ofthe direct IO being in flight
except for the rwsem being held across the IO. Even if we did insert
a flag to say "DIO in progress" and not rely on the lock....

> buffered IO:
> if i_pages is empty, take rwsem for write, allocate page, insert page, drop rwsem.
> if i_pages is not empty, look up index, if entry is XA_LOCK_ENTRY sleep on
> waitqueue. otherwise proceed as now.

... we'll sleep on that flags in the page fault and deadlock anyway.

I'm pretty sure we explored this "record DIO state in the radix
tree" 2 or 3 years ago and came to the conclusion that it didn't
work for reasons like the above. i.e. it doesn't solve the problems
we currently have with locking and serialisation between DIO and
mmap...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-10 22:13:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 10, 2019 at 4:25 AM Dominique Martinet
<[email protected]> wrote:
>
> Linus Torvalds wrote on Thu, Jan 10, 2019:
> > (Except, of course, if somebody actually notices outside of tests.
> > Which may well happen and just force us to revert that commit. But
> > that's a separate issue entirely).
>
> Both Dave and I pointed at a couple of utilities that break with
> this. nocache can arguably work with the new behaviour but will behave
> differently; vmtouch on the other hand is no longer able to display
> what's in cache or not - people use that for example to "warm up" a
> container in page cache based on how it appears after it had been
> running for a while is a pretty valid usecase to me.

So honestly, the main reason I'm loath to revert is that yes, we know
of theoretical differences, but they seem to all be
performance-related.

It would be really good to hear numbers. Is the warm-up optimization
something that changes things from 3ms to 3.5ms? Or does it change
things from 3ms to half a second?

Because Dave is absolutely correct that mincore() isn't really even
all that interesting an information leak if you can do the same with
RWF_NOWAIT. But the other side of that same coin is that if we're not
able to block mincore() sanely, then there's no point at looking at
RWF_NOWAIT either.

And we *can* do sane things about RWF_NOWAIT. For example, we could
start async IO on RWF_NOWAIT, and suddenly it would go from "probe the
page cache" to "probe and fill", and be much harder to use as an
attack vector..

Do we want to do that? Maybe, maybe not. But if mincore() can't be
fixed, there's no point in even trying.

Now, if the mincore() change results in a big performance hit for
people who use it as a heuristic for filling caches etc, then
reverting the trial balloon is obviously something we must do, but at
that point I'd also like to know which load it was that cared so much,
and just what it did. Because we did have an alternate patch that just
said "was the file writably opened, then we can do the page cache
probing". But at least one user (fincore) didn't do even that.

So right now, I consider the mincore change to be a "try to probe the
state of mincore users", and we haven't really gotten a lot of
information back yet.

Linus

2019-01-11 02:09:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 10, 2019 at 1:44 PM Dave Chinner <[email protected]> wrote:
>
> GUP does page fault on user buffer which is a mmapped region of same
> file. page fault sets up for buffered IO, tries to take rwsem for
> write, deadlocks.
>
> Most of the schemes we come up with fall down at this point - you
> can't hold a lock over gup that is also used in the buffered IO
> path. That's why XFS (and now ext4) have the IOLOCK and MMAPLOCK
> for truncation serialisation - we can't lock out both read()/write()
> and mmap IO paths with the same lock...

Side note: a somewhat similar version of is true even in the absence
of GUP and dio, for the case of doing a mmap of a file, and then
reading or writing from the mapped region into the file itself.

There are "interesting" locking scenarios wrt just holding the page
locked, and trying to then fill that page with information with just a
regular "copy_from_user()".

Page fault -> try to read the file -> oops, the page we're trying to
read from is locked because we're trying to write to it.

So we have that odd dance in generic_perform_write() which does

- touch the first user byte without holding any lock

- do write_begin() (which gets the page lock)

- copy from user space using the "atomic" copy (which just gives up on fault)

- if nothing got copied, go back and try again with a smaller copy
that can't cross a page. We might have raced with pageout.

It might be possible to do something similar for direct IO, although
simpler: just do the GUP entirely atomically (and in the fault case
just fall back to non-direct IO).

Linus

2019-01-11 04:01:22

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 09, 2019 at 09:26:41PM -0800, Andy Lutomirski wrote:
> Since direct IO has been brought up, I have a question. I've wondered
> for years why direct IO works the way it does. If I were implementing
> it from scratch, my first inclination would be to use the page cache
> instead of fighting it. To do a single-page direct read, I would look
> that page up in the page cache (i.e. i_pages these days). If the page
> is there, I would do a normal buffered read. If the page is not

Therein lies the problem. Copying data is prohibitively expensive,
and that's the primary reason for O_DIRECT existing. i.e. O_DIRECT
is a low-overhead, zero-copy data movement interface.

The moment we switch from using CPU to dispatch IO to copying data,
performance goes down because we will be unable to keep storage
pipelines full. IOWs, any rework of O_DIRECT that involves copying
data is a non-starter.

But let's bring this back to the issue at hand - observability of
page cache residency of file pages. If th epage is caceh resident,
then it will have a latency of copying that data out of the page
(i.e. very low latency). If the page is not resident, then it will
do IO and take much, much longer to complete. i.e. we have clear
timing differences between cachce hit and cache miss IO. This is
exactly the timing information needed for observing page cache
residency.

We need to work out how to make page cache residency less
observable, not add new, near perfect observation mechanisms that
third parties can easily exploit...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-11 04:06:06

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote:
> And we *can* do sane things about RWF_NOWAIT. For example, we could
> start async IO on RWF_NOWAIT, and suddenly it would go from "probe the
> page cache" to "probe and fill", and be much harder to use as an
> attack vector..

We can only do that if the application submits the read via AIO and
has an async IO completion reporting mechanism. Otherwise we have
to wait for the IO to complete to copy the data into the user's
buffer. And given that the app is using RWF_NOWAIT to explicitly
avoid blocking on the IO....

Also, keep in mind that RWF_NOWAIT also prevents blocking on
filesystem locks and full request queues. One of the prime drivers
of RWF_NOWAIT was to prevent AIO submission from blocking on
filesystem locks - it allows userspace to submit other IO while it
can't get all the access it requires to a single file or a single
block device is congested.

Hence I don't think there's a such a simple answer here - blocking
for IO breaks RWF_NOWAIT.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-11 05:33:33

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 10, 2019 at 06:18:16PM -0800, Linus Torvalds wrote:
> On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner <[email protected]> wrote:
> >
> > On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote:
> > > And we *can* do sane things about RWF_NOWAIT. For example, we could
> > > start async IO on RWF_NOWAIT, and suddenly it would go from "probe the
> > > page cache" to "probe and fill", and be much harder to use as an
> > > attack vector..
> >
> > We can only do that if the application submits the read via AIO and
> > has an async IO completion reporting mechanism.
>
> Oh, no, you misunderstand.
>
> RWF_NOWAIT has a lot of situations where it will potentially return
> early (the DAX and direct IO ones have their own), but I was thinking
> of the one in generic_file_buffered_read(), which triggers when you
> don't find a page mapping. That looks like the obvious "probe page
> cache" case.
>
> But we could literally move that test down just a few lines. Let it
> start read-ahead.
>
> .. and then it will actually trigger on the *second* case instead, where we have
>
> if (!PageUptodate(page)) {
> if (iocb->ki_flags & IOCB_NOWAIT) {
> put_page(page);
> goto would_block;
> }
>
> and that's where RWF_MNOWAIT would act.
>
> It would still return EAGAIN.
>
> But it would have started filling the page cache. So now the act of
> probing would fill the page cache, and the attacker would be left high
> and dry - the fact that the page cache now exists is because of the
> attack, not because of whatever it was trying to measure.
>
> See?

Except for fadvise(POSIX_FADV_RANDOM) which triggers this code in
page_cache_sync_readahead():

/* be dumb */
if (filp && (filp->f_mode & FMODE_RANDOM)) {
force_page_cache_readahead(mapping, filp, offset, req_size);
return;
}

So it will only read the single page we tried to access and won't
perturb the rest of the message encoded into subsequent pages in
file.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-11 05:34:25

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged



> On Jan 10, 2019, at 8:04 PM, Dave Chinner <[email protected]> wrote:
>
>> On Thu, Jan 10, 2019 at 06:18:16PM -0800, Linus Torvalds wrote:
>>> On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner <[email protected]> wrote:
>>>
>>>> On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote:
>>>> And we *can* do sane things about RWF_NOWAIT. For example, we could
>>>> start async IO on RWF_NOWAIT, and suddenly it would go from "probe the
>>>> page cache" to "probe and fill", and be much harder to use as an
>>>> attack vector..
>>>
>>> We can only do that if the application submits the read via AIO and
>>> has an async IO completion reporting mechanism.
>>
>> Oh, no, you misunderstand.
>>
>> RWF_NOWAIT has a lot of situations where it will potentially return
>> early (the DAX and direct IO ones have their own), but I was thinking
>> of the one in generic_file_buffered_read(), which triggers when you
>> don't find a page mapping. That looks like the obvious "probe page
>> cache" case.
>>
>> But we could literally move that test down just a few lines. Let it
>> start read-ahead.
>>
>> .. and then it will actually trigger on the *second* case instead, where we have
>>
>> if (!PageUptodate(page)) {
>> if (iocb->ki_flags & IOCB_NOWAIT) {
>> put_page(page);
>> goto would_block;
>> }
>>
>> and that's where RWF_MNOWAIT would act.
>>
>> It would still return EAGAIN.
>>
>> But it would have started filling the page cache. So now the act of
>> probing would fill the page cache, and the attacker would be left high
>> and dry - the fact that the page cache now exists is because of the
>> attack, not because of whatever it was trying to measure.
>>
>> See?
>
> Except for fadvise(POSIX_FADV_RANDOM) which triggers this code in
> page_cache_sync_readahead():
>
> /* be dumb */
> if (filp && (filp->f_mode & FMODE_RANDOM)) {
> force_page_cache_readahead(mapping, filp, offset, req_size);
> return;
> }
>
> So it will only read the single page we tried to access and won't
> perturb the rest of the message encoded into subsequent pages in
> file.
>

There are two types of attacks. One is an intentional side channel where two cooperating processes communicate. This is, under some circumstances, a problem, but it’s not one we’re about to solve in general. The other is an attacker monitoring an unwilling process. I think we care a lot more about that, and Linus’ idea will help.

2019-01-11 05:38:37

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Linus Torvalds wrote on Thu, Jan 10, 2019:
> On Thu, Jan 10, 2019 at 4:25 AM Dominique Martinet
> <[email protected]> wrote:
> > Linus Torvalds wrote on Thu, Jan 10, 2019:
> > > (Except, of course, if somebody actually notices outside of tests.
> > > Which may well happen and just force us to revert that commit. But
> > > that's a separate issue entirely).
> >
> > Both Dave and I pointed at a couple of utilities that break with
> > this. nocache can arguably work with the new behaviour but will behave
> > differently; vmtouch on the other hand is no longer able to display
> > what's in cache or not - people use that for example to "warm up" a
> > container in page cache based on how it appears after it had been
> > running for a while is a pretty valid usecase to me.
>
> So honestly, the main reason I'm loath to revert is that yes, we know
> of theoretical differences, but they seem to all be
> performance-related.

I don't see what other use mincore could have, yes - even the
"debugging" use I gave is performance investigations and not hard
problems (and I probably would go straight to perf nowadays, you'd get
the info that the program doesn't use cache from the call graphs)

> It would be really good to hear numbers. Is the warm-up optimization
> something that changes things from 3ms to 3.5ms? Or does it change
> things from 3ms to half a second?

This is heavily workload and storage hardware dependant, so hard to give
some absolute value.

Trying with some big server, fast SSD, mysql and doing:
# echo 3 > /proc/sys/vm/drop_caches
# (optional) prefetch table and innodb files
# systemctl restart mariadb
# time mysql -q db "select * from mytable where id in $ENTRIES" > /dev/null
# time mysql -q db "select * from mytable where id in $ENTRIES2" > /dev/null
# time mysql -q db "select * from mytable where id in $ENTRIES3" > /dev/null
(where ENTRIES* are lists of 1000 id, and id is indexed; the table is 8GB
for 62590661 entries so 1000 entries is approx 128KB of data out of that
file)

I get on average over a few queries approximately a real time of 350ms,
230ms and 220ms immediately after drop cache and service restart, and
150ms, 60ms and 60ms after a prefetch (hand-wavy average over 3 runs, I
didn't have the patience to do proper testing).
(In both cases, user/sys are less than 10ms; I don't see much difference
there)

If I restart the service without dropping caches and redo the query I
get 60ms from the first query onwards so I must not be preloading
everything properly, some real script that would look all over a
container to properly restore the page cache would do better than me
blindly preloading a few files.

Either way, we're talking about a factor of 2-3 until the application has
been looking at most of the entries, and I didn't try to see how that
would look like on spinning disks or the kind of slow storage one would
get on VPS somewhere in the cloud - I'm sure someone with time to waste
could get much more impressive figures, but this already look pretty
worthwhile to me.

--
Dominique Martinet | Asmadeus

2019-01-11 06:01:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner <[email protected]> wrote:
>
> On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds wrote:
> > And we *can* do sane things about RWF_NOWAIT. For example, we could
> > start async IO on RWF_NOWAIT, and suddenly it would go from "probe the
> > page cache" to "probe and fill", and be much harder to use as an
> > attack vector..
>
> We can only do that if the application submits the read via AIO and
> has an async IO completion reporting mechanism.

Oh, no, you misunderstand.

RWF_NOWAIT has a lot of situations where it will potentially return
early (the DAX and direct IO ones have their own), but I was thinking
of the one in generic_file_buffered_read(), which triggers when you
don't find a page mapping. That looks like the obvious "probe page
cache" case.

But we could literally move that test down just a few lines. Let it
start read-ahead.

.. and then it will actually trigger on the *second* case instead, where we have

if (!PageUptodate(page)) {
if (iocb->ki_flags & IOCB_NOWAIT) {
put_page(page);
goto would_block;
}

and that's where RWF_MNOWAIT would act.

It would still return EAGAIN.

But it would have started filling the page cache. So now the act of
probing would fill the page cache, and the attacker would be left high
and dry - the fact that the page cache now exists is because of the
attack, not because of whatever it was trying to measure.

See?

But obviously this kind of change only matters if we also have
mincore() not returning the probe data. mincore() obviously can't do
the same kind of read-ahead to defeat things.

Linus

2019-01-11 08:58:39

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Linus Torvalds wrote on Thu, Jan 10, 2019:
> But those numbers aren't about the mincore() change. That's just from
> dropping caches.
>
> Now, what's the difference with the mincore change, and without? Is it
> actually measurable?
>
> Because that's all that matters: is the mincore change something you
> can even notice? Is it a big regression?
>
> The fact that things are slower when they are cold in the cache isn't
> the issue. The issue is whether the change to mincore semantics makes
> any difference to real loads.

mincore itself isn't used to reload the data, but is necessary to know
*what* you need to reload.
If you don't know what pages are hot, how can you preload them?

For small enough a database and with enough memory you can act stupid
and load the whole tables in cache, that's what I did because I was lazy
and only had a big mysql data set around. But the container warming up
automaton Dave mentioned and postgresql db preloading with pgfincore
explicitely depend on being able to tell what they need to preload.


pgfincore documentation states:
----
set of PostgreSQL functions to manage blocks in memory

Those functions let you know which and how many disk block from a
relation are in the page cache of the operating system, and eventually
write the result to a file. Then using this file, it is possible to
restore the page cache state for each block of the relation.
----
If you cannot dump an arbitrary "hot state" x, you cannot restore it.


This is all basically a repeat of the other subthread; sure precaching
itself doesn't need mincore and if you're all-knowing the syscall isn't
needed, but mere mortals need it.


If it's about the commit itself, vmtouch tells me 0 page in these
database files are in cache when I reboot to a 5.0-rc1 kernel and run
some queries, so the difference after a fresh boot is exactly what I
stated. I'm probably missing something but I'm not quite sure in what
situation the "new mincore" has any use right now.
--
Dominique Martinet | Asmadeus

2019-01-11 09:00:45

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 10, 2019 at 11:08:07PM -0800, Linus Torvalds wrote:
> On Thu, Jan 10, 2019 at 8:04 PM Dave Chinner <[email protected]> wrote:
> >
> > So it will only read the single page we tried to access and won't
> > perturb the rest of the message encoded into subsequent pages in
> > file.
>
> Dave, you're being intentionally obtuse, aren't you?
>
> It's only that single page that *matters*. That's the page that the
> probe reveals the status of - but it's also the page that the probe
> then *changes* the status of.

It changes the state of it /after/ we've already got the information
we need from it. It's not up to date, it has to come from disk, we
return EAGAIN, which means it was not in the cache.

i.e. if we return EAGAIN, we've leaked the inforation the attacker
wants regardless of how the act of initiating readahead on the page
change the state of the page. Yes, it raises the complexity bar a
bit, and lowers the monitoring frequency somewhat, but that's about
it.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-11 09:00:50

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, 10 Jan 2019, Dave Chinner wrote:

> Sounds nice from a theoretical POV, but reality has taught us very
> different lessons.
>
> FWIW, a quick check of XFS's history so you understand how long this
> behaviour has been around. It was introduced in the linux port in 2001
> as direct IO support was being added:
>
> commit e837eac23662afae603aaaef7c94bc839c1b8f67
> Author: Steve Lord <[email protected]>
> Date: Mon Mar 5 16:47:52 2001 +0000
>
> Add bounds checking for direct I/O, do the cache invalidation for
> data coherency on direct I/O.

Out of curiosity, which repository is this from please? Even google
doesn't seem to know about this SHA.

Thanks,

--
Jiri Kosina
SUSE Labs


2019-01-11 10:22:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 10, 2019 at 8:04 PM Dave Chinner <[email protected]> wrote:
>
> So it will only read the single page we tried to access and won't
> perturb the rest of the message encoded into subsequent pages in
> file.

Dave, you're being intentionally obtuse, aren't you?

It's only that single page that *matters*. That's the page that the
probe reveals the status of - but it's also the page that the probe
then *changes* the status of.

See?

Think of it as "the act of measurement changes that which is
measured". And that makes the measurement pointless.

Linus

2019-01-11 10:38:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 10, 2019 at 8:58 PM Dominique Martinet
<[email protected]> wrote:
>
> I get on average over a few queries approximately a real time of 350ms,
> 230ms and 220ms immediately after drop cache and service restart, and
> 150ms, 60ms and 60ms after a prefetch (hand-wavy average over 3 runs, I
> didn't have the patience to do proper testing).
> (In both cases, user/sys are less than 10ms; I don't see much difference
> there)

But those numbers aren't about the mincore() change. That's just from
dropping caches.

Now, what's the difference with the mincore change, and without? Is it
actually measurable?

Because that's all that matters: is the mincore change something you
can even notice? Is it a big regression?

The fact that things are slower when they are cold in the cache isn't
the issue. The issue is whether the change to mincore semantics makes
any difference to real loads.

Linus

2019-01-11 10:40:48

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 10, 2019 at 08:08:37PM -0800, Andy Lutomirski wrote:
> > On Jan 10, 2019, at 8:04 PM, Dave Chinner <[email protected]>
> > wrote:
> >
> >> On Thu, Jan 10, 2019 at 06:18:16PM -0800, Linus Torvalds
> >> wrote:
> >>> On Thu, Jan 10, 2019 at 6:03 PM Dave Chinner
> >>> <[email protected]> wrote:
> >>>
> >>>> On Thu, Jan 10, 2019 at 02:11:01PM -0800, Linus Torvalds
> >>>> wrote: And we *can* do sane things about RWF_NOWAIT. For
> >>>> example, we could start async IO on RWF_NOWAIT, and suddenly
> >>>> it would go from "probe the page cache" to "probe and fill",
> >>>> and be much harder to use as an attack vector..
> >>>
> >>> We can only do that if the application submits the read via
> >>> AIO and has an async IO completion reporting mechanism.
> >>
> >> Oh, no, you misunderstand.
> >>
> >> RWF_NOWAIT has a lot of situations where it will potentially
> >> return early (the DAX and direct IO ones have their own), but I
> >> was thinking of the one in generic_file_buffered_read(), which
> >> triggers when you don't find a page mapping. That looks like
> >> the obvious "probe page cache" case.
> >>
> >> But we could literally move that test down just a few lines.
> >> Let it start read-ahead.
> >>
> >> .. and then it will actually trigger on the *second* case
> >> instead, where we have
> >>
> >> if (!PageUptodate(page)) { if (iocb->ki_flags &
> >> IOCB_NOWAIT) { put_page(page); goto would_block;
> >> }
> >>
> >> and that's where RWF_MNOWAIT would act.
> >>
> >> It would still return EAGAIN.
> >>
> >> But it would have started filling the page cache. So now the
> >> act of probing would fill the page cache, and the attacker
> >> would be left high and dry - the fact that the page cache now
> >> exists is because of the attack, not because of whatever it was
> >> trying to measure.
> >>
> >> See?
> >
> > Except for fadvise(POSIX_FADV_RANDOM) which triggers this code
> > in page_cache_sync_readahead():
> >
> > /* be dumb */ if (filp && (filp->f_mode & FMODE_RANDOM))
> > { force_page_cache_readahead(mapping, filp, offset,
> > req_size); return; }
> >
> > So it will only read the single page we tried to access and
> > won't perturb the rest of the message encoded into subsequent
> > pages in file.
>
> There are two types of attacks. One is an intentional side
> channel where two cooperating processes communicate. This is,
> under some circumstances, a problem,

Yes, that's the covert communication channel that can cross container
and machine boundaries without any required privileges.

> but it’s not one
> we’re about to solve in general. The other is an attacker
> monitoring an unwilling process.

Which uses exactly the same mechanisms as the first case. i.e.
controlled invalidation and page cache residency monitoring.If we
aren't going to solve the first problem case, the we aren't going to
solve the second because they are one and the same problem...

However, I suspect you have misunderstood the monitoring mechanism
here - dispatch IO for this page doesn't prevent the information
leak about that page. It's when we return EAGAIN that we leak
information about page cache residency.

What Linus is attempting to do is perturb the nearby state of the
page cache by triggering async readahead in the EAGAIN case. Async
readahead will fill all the holes in readahead window and hence
destroy the information about where the page fault landed and
instantiated the page cache. That would prevent the attacker from
determining what code the executable is running as they would only
be able to check a single page in an executable at a time and that
makes the attack highly impractical.

But if the attacker uses FADV_RANDOM, readahead is only triggered
for the page the attacker is trying to read. Hence it does not
disturb the nearby page cache residency pattern the executable's
page faults left behind and so doesn't destroy the information that
they are trying to extract from the unwilling process.

Sure, Linus's change makes monitoring the executable file after
FADV_RANDOM a "read-once" mechanism. However, detection of what code
is executing is a repeated invalidate-and-sweep exercise to begin
with, so it basically doesn't change the information or the rate at
which the monitoring process can extract information from the file.

/me hasn't thought about this sort of stuff since he was running
page cache invalidation attacks on Irix system libraries way back in
2002 when he found a libc bug that killed the init process and
paniced the kernel.

This isn't my first rodeo - it's been well known for a long, long
time that the system page cache can be exploited to monitor
executing code...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-11 20:52:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 10, 2019 at 11:36 PM Dave Chinner <[email protected]> wrote:
>
> > It's only that single page that *matters*. That's the page that the
> > probe reveals the status of - but it's also the page that the probe
> > then *changes* the status of.
>
> It changes the state of it /after/ we've already got the information
> we need from it. It's not up to date, it has to come from disk, we
> return EAGAIN, which means it was not in the cache.

Oh, I see the confusion.

Yes, you get the information about whether something was in the cache
or not, so the side channel does exist to some degree.

But it's actually hugely reduced for a rather important reason: the
_primary_ reason for needing to know whether some page is in the cache
or not is not actually to see if it was ever accessed - it's to see
that the cache has been scrubbed (and to _guide_ the scrubbing), and
*when* it was accessed.

Think of it this way: the buffer cache residency is actually a
horribly bad signal on its own mainly because you generally have a
very high hit-rate. In most normal non-streaming situations with
sufficient amounts of memory you have pretty much everything cached.

So in order to use it as a signal, first you have to first scrub the
cache (because if the page was already there, there's no signal at
all), and then for the signal to be as useful as possible, you're also
going to want to try to get out more than one bit of information: you
are going to try to see the patterns and the timings of how it gets
filled.

And that's actually quite painful. You don't know the initial cache
state, and you're not (in general) controlling the machine entirely,
because there's also that actual other entity that you're trying to
attack and see what it does.

So what you want to do is basically to first make sure the cache is
scrubbed (only for the pages you're interested in!), then trigger
whatever behavior you are looking for, and then look how that affected
the cache.

In other words, you want *multiple* residency status check - first to
see what the cache state is (because you're going to want that for
scrubbing), then to see that "yes, it's gone" when doing the
scrubbing, and then to see the *pattern* and timings of how things are
brought in.

And then you're likely to want to do this over and over again, so that
you can get real data out of the signal.

This is why something that doesn't perturb what you measure is really
important. If the act of measurement brings the page in, then you
can't use it for that "did I successfully scrub it" phase at all, and
you can't use it for measurement but once, so your view into patterns
and timings is going to be *much* worse.

And notice that this is true even if the act of measurement only
affects the *one* page you're measuring. Sure, any additional noise
around it would likely be annoying too, but it's not really necessary
to make the attack much harder to carry out. In fact, it's almost
irrelevant, since the signal you're trying to *see* is going to be
affected by prefetching etc too, so the patterns and timings you need
to look at are in bigger chunks than the readahead thing.

So yes, you as an attacker can remove the prefetching from *your*
load, but you can't remove it from the target load anyway, so you'll
just have to live with it.

Can you brute-force scrubbing? Yes. For something like an L1 cache,
that's easy (well, QoS domains make it harder). For something like a
disk cache, it's much harder, and makes any attempt to read out state
a lot slower. The paper that started this all uses mincore() not just
to see "is the page now scrubbed", but also to guide the scrubbing
itself (working set estimation etc).

And note that in many ways, the *scrubbing* is really the harder part.
Populating the cache is really easy: just read the data you want to
populate.

So if you are looking for a particular signal, say "did this error
case trigger so that it faulted in *that* piece of information", you'd
want to scrub the target, populate everything else, and then try to
measure at "did I trigger that target". Except you wouldn't want to do
it one page at a time but see as much pattern of "they were touched in
this order" as you can, and you'd like to get timing information of
how the pages you are interested were populated too.

And you'd generally do this over and over and over again because
you're trying to read out some signal.

Notice what the expensive operation was? It's the scrubbing.The "did
the target do IO" you might actually even see other ways for the
trivial cases, like even just look at iostat: just pre-populate
everything but the part you care about, then try to trigger whatever
you're searching for, and see if it caused IO or not.

So it's a bit like a chalkboard: in order to read out the result, you
need to erase it first, and doing that blindly is nasty. And you want
to look at timings, which is also really nasty if every time you look,
you smudge the very place you looked at. It makes it hard to see what
somebody else is writing on the board if you're always overwriting
what you just looked at. Did you get some new information? If not, now
you have to go back and do that scrubbing again, and you'll likely be
missing what *else* the person wrote.

Ans as always: there is no "black and white". There is no "absolute
security", and similarly, there is no "absolute leak proof". It's all
about making it inconvenient enough that it's not really practical.

Linus

2019-01-16 16:28:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 16, 2019 at 12:42 PM Josh Snyder <[email protected]> wrote:
>
> For Netflix, losing accurate information from the mincore syscall would
> lengthen database cluster maintenance operations from days to months. We
> rely on cross-process mincore to migrate the contents of a page cache from
> machine to machine, and across reboots.

Ok, this is the kind of feedback we need, and means I guess we can't
just use the mapping existence for mincore.

The two other ways that we considered were:

(a) owner of the file gets to know cache information for that file.

(b) having the fd opened *writably* gets you cache residency information.

Sadly, taking a look at happycache, you open the file read-only, so
(b) doesn't work.

Judging just from the source code, I can't tell how the user ownership
works. Any input on that?

And if you're not the owner of the file, do you have another
suggestion for that "Yes, I have the right to see what's in-core for
this file". Because the problem is literally that if it's some random
read-only system file, the kernel shouldn't leak access patterns to
it..

Linus

2019-01-16 16:33:15

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged



> On Jan 15, 2019, at 9:00 PM, Linus Torvalds <[email protected]> wrote:
>
>> On Wed, Jan 16, 2019 at 12:42 PM Josh Snyder <[email protected]> wrote:
>>
>> For Netflix, losing accurate information from the mincore syscall would
>> lengthen database cluster maintenance operations from days to months. We
>> rely on cross-process mincore to migrate the contents of a page cache from
>> machine to machine, and across reboots.
>
> Ok, this is the kind of feedback we need, and means I guess we can't
> just use the mapping existence for mincore.
>
> The two other ways that we considered were:
>
> (a) owner of the file gets to know cache information for that file.
>
> (b) having the fd opened *writably* gets you cache residency information.
>
> Sadly, taking a look at happycache, you open the file read-only, so
> (b) doesn't work.
>
> Judging just from the source code, I can't tell how the user ownership
> works. Any input on that?
>
> And if you're not the owner of the file, do you have another
> suggestion for that "Yes, I have the right to see what's in-core for
> this file". Because the problem is literally that if it's some random
> read-only system file, the kernel shouldn't leak access patterns to
> it..
>
>


Something like CAP_DAC_READ_SEARCH might not be crazy.

2019-01-16 16:37:02

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Linus Torvalds wrote on Wed, Jan 16, 2019:
> Anybody willing to test the above patch instead? And replace the
>
> || capable(CAP_SYS_ADMIN)
>
> check with something like
>
> || inode_permission(inode, MAY_WRITE) == 0
>
> instead?
>
> (This is obviously after you've reverted the "only check mmap
> residency" patch..)

That seems to work on an x86_64 vm.

I've tested with the attached patch:
- root can lookup pages on any file I tried;
- user can lookup page on file it owns, assuming it can write to it
(e.g. it won't work on a 0400 file you own)
- user cannot lookup pages on e.g. /lib64/libc-2.28.so

There is a difference with your previous patch though, that used to list
no page in core when it didn't know; this patch lists pages as in core
when it refuses to tell. I don't think that's very important, though.

If anything, the 0400 user-owner file might be a problem in some edge
case (e.g. if you're preloading git directories, many objects are 0444);
should we *also* check ownership?...

--
Dominique


Attachments:
(No filename) (1.04 kB)
mincore.diff (1.20 kB)
Download all attachments

2019-01-16 17:16:23

by Josh Snyder

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Linus Torvalds wrote on Thu, Jan 10, 2019:
> So right now, I consider the mincore change to be a "try to probe the
> state of mincore users", and we haven't really gotten a lot of
> information back yet.

<raises hand>

For Netflix, losing accurate information from the mincore syscall would
lengthen database cluster maintenance operations from days to months. We
rely on cross-process mincore to migrate the contents of a page cache from
machine to machine, and across reboots.

To do this, I wrote and maintain happycache [1], a page cache dumper/loader
tool. It is quite similar in architecture to pgfincore, except that it is
agnostic to workload. The gist of happycache's operation is "produce a dump
of residence status for each page, do some operation, then reload exactly
the same pages which were present before." happycache is entirely dependent
on accurate reporting of the in-core status of file-backed pages, as
accessed by another process.

We primarily use happycache with Cassandra, which (like Postgres +
pgfincore) relies heavily on OS page cache to reduce disk accesses. Because
our workloads never experience a cold page cache, we are able to provision
hardware for a peak utilization level that is far lower than the
hypothetical "every query is a cache miss" peak.

A database warmed by happycache can be ready for service in seconds
(bounded only by the performance of the drives and the I/O subsystem), with
no period of in-service degradation. By contrast, putting a database in
service without a page cache entails a potentially unbounded period of
degradation (at Netflix, the time to populate a single node's cache via
natural cache misses varies by workload from hours to weeks). If a single
node upgrade were to take weeks, then upgrading an entire cluster would
take months. Since we want to apply security upgrades (and other things) on
a somewhat tighter schedule, we would have to develop more complex
solutions to provide the same functionality already provided by mincore.

At the bottom line, happycache is designed to benignly exploit the same
information leak documented in the paper [2]. I think it makes perfect
sense to remove cross-process mincore functionality from unprivileged
users, but not to remove it entirely.

Josh Snyder
Netflix Cloud Database Engineering

[1] https://github.com/hashbrowncipher/happycache
[2] https://arxiv.org/abs/1901.01161

2019-01-16 17:25:32

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Fri, Jan 11, 2019 at 08:26:14AM -0800, Linus Torvalds wrote:
> On Thu, Jan 10, 2019 at 11:36 PM Dave Chinner <[email protected]> wrote:
> >
> > > It's only that single page that *matters*. That's the page that the
> > > probe reveals the status of - but it's also the page that the probe
> > > then *changes* the status of.
> >
> > It changes the state of it /after/ we've already got the information
> > we need from it. It's not up to date, it has to come from disk, we
> > return EAGAIN, which means it was not in the cache.
>
> Oh, I see the confusion.
>
> Yes, you get the information about whether something was in the cache
> or not, so the side channel does exist to some degree.
>
> But it's actually hugely reduced for a rather important reason: the
> _primary_ reason for needing to know whether some page is in the cache
> or not is not actually to see if it was ever accessed - it's to see
> that the cache has been scrubbed (and to _guide_ the scrubbing), and
> *when* it was accessed.

Oh, you're assuming that you need to probe the page cache to
determine if brute force invalidation has progressed far enough
to invalidate the page in question.

I'm assuming that you can invalidate the page cache reliably by a
means that does not repeated require probing to detect invalidation
has occurred. I've mentioned one method in this discussion
already...

IOWs, just because the specific brute force attack documented in the
paper required repeated probing it doesn't mean all future
invalidation attacks will require repeated probing. Hence a robust
defense mechanism should not rely on the attacker requiring multiple
observations of the page cache to extract the information they are
seeking...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-16 18:05:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 16, 2019 at 5:25 PM Andy Lutomirski <[email protected]> wrote:
>
> Something like CAP_DAC_READ_SEARCH might not be crazy.

I agree that it would work. In fact' it's what Jiri's patch basically
did. Except Jiri used CAP_SYS_ADMIN instead.

But that then basically limits it to root (or root-like with
capability masks), which is quite likely to not work in practice all
that well. That's why I wanted to find alternatives.

*Very* few people want to run their databases as root.

Jiri's original patch kind of acknowledged that by making the new test
be conditional, and off by default. So then it's a "only do this for
lockdown mode, because normal people won't find it acceptable".

And I'm not a huge fan of that approach. If you don't protect normal
people, then what's the point, really?

Linus

2019-01-16 18:06:59

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Linus Torvalds wrote on Wed, Jan 16, 2019:
> *Very* few people want to run their databases as root.

In the case of happycache, this isn't the database doing the
dump/restore, but a separate process that could have the cap - it's
better if we can do without though, and from his readme he runs as user
cassandra in the /var/lib/cassandra directory for example so that'd
match the file owner.

For pgfincore, it's a postgres extension so the main process does it -
but it does have files open as write as well as being the owner.

> Jiri's original patch kind of acknowledged that by making the new test
> be conditional, and off by default. So then it's a "only do this for
> lockdown mode, because normal people won't find it acceptable".
>
> And I'm not a huge fan of that approach. If you don't protect normal
> people, then what's the point, really?

I agree with that.
"Being owner or has cap" (whichever cap) is probably OK.
On the other hand, writeability check makes more sense in general -
could we somehow check if the user has write access to the file instead
of checking if it currently is opened read-write?

--
Dominique

2019-01-16 18:28:06

by Josh Snyder

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet <[email protected]>
wrote:
>
> There is a difference with your previous patch though, that used to list no
> page in core when it didn't know; this patch lists pages as in core when it
> refuses to tell. I don't think that's very important, though.

Is there a reason not to return -EPERM in this case?

>
> If anything, the 0400 user-owner file might be a problem in some edge
> case (e.g. if you're preloading git directories, many objects are 0444);
> should we *also* check ownership?...

Yes, this seems valuable. Some databases with immutable files (e.g. git, as
you've mentioned) conceivably operate this way.

Josh

2019-01-16 21:06:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 16, 2019 at 11:45 AM Dave Chinner <[email protected]> wrote:
>
> I'm assuming that you can invalidate the page cache reliably by a
> means that does not repeated require probing to detect invalidation
> has occurred. I've mentioned one method in this discussion
> already...

Yes. And it was made clear to you that it was a bug in xfs dio and
what the right thing to do was.

And you ignored that, and claimed it was a feature.

Why do you then bother arguing this thing? We absolutely agree that
xfs has an information leak. If you don't care, just _say_ so. Don't
try to argue against other people who are trying to fix things.

We can easily just say "ok, xfs people don't care", and ignore the xfs
invalidation issue. That's fine.

But don't try to make it a big deal for other filesystems that _don't_
have the bug. I even pointed out how ext4 does the page cache flushing
correcrly. You pooh-poohed it.

You can't have it both ways.

Either you care or you don't. If you don't care (and so far everything
you said seems to imply you don't), then why are you even discussing
this? Just admit you don't care, and we're done.

Linus

2019-01-16 21:10:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 16, 2019 at 5:46 PM Dominique Martinet
<[email protected]> wrote:
>
> "Being owner or has cap" (whichever cap) is probably OK.
> On the other hand, writeability check makes more sense in general -
> could we somehow check if the user has write access to the file instead
> of checking if it currently is opened read-write?

That's likely the best option. We could say "is it open for write, or
_could_ we open it for writing?"

It's a slightly annoying special case, and I'd have preferred to avoid
it, but it doesn't sound *compilcated*.

I'm on the road, but I did send out this:

https://lore.kernel.org/lkml/CAHk-=wif_9nvNHJiyxHzJ80_WUb0P7CXNBvXkjZz-r1u0ozp7g@mail.gmail.com/

originally. The "let's try to only do the mmap residency" was the
optimistic "maybe we can just get rid of this complexity entirely"
version..

Anybody willing to test the above patch instead? And replace the

|| capable(CAP_SYS_ADMIN)

check with something like

|| inode_permission(inode, MAY_WRITE) == 0

instead?

(This is obviously after you've reverted the "only check mmap
residency" patch..)

Linus

2019-01-16 22:56:48

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, 16 Jan 2019, Linus Torvalds wrote:

> > "Being owner or has cap" (whichever cap) is probably OK. On the other
> > hand, writeability check makes more sense in general - could we
> > somehow check if the user has write access to the file instead of
> > checking if it currently is opened read-write?
>
> That's likely the best option. We could say "is it open for write, or
> _could_ we open it for writing?"
>
> It's a slightly annoying special case, and I'd have preferred to avoid
> it, but it doesn't sound *compilcated*.
>
> I'm on the road, but I did send out this:
>
> https://lore.kernel.org/lkml/CAHk-=wif_9nvNHJiyxHzJ80_WUb0P7CXNBvXkjZz-r1u0ozp7g@mail.gmail.com/
>
> originally. The "let's try to only do the mmap residency" was the
> optimistic "maybe we can just get rid of this complexity entirely"
> version..
>
> Anybody willing to test the above patch instead? And replace the
>
> || capable(CAP_SYS_ADMIN)
>
> check with something like
>
> || inode_permission(inode, MAY_WRITE) == 0
>
> instead?
>
> (This is obviously after you've reverted the "only check mmap
> residency" patch..)

So that seems to deal with mincore() in a reasonable way indeed.

It doesn't unfortunately really solve the preadv2(RWF_NOWAIT), nor does it
provide any good answer what to do about it, does it?

Thanks,

--
Jiri Kosina
SUSE Labs


2019-01-16 23:31:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 17, 2019 at 4:12 AM Jiri Kosina <[email protected]> wrote:
>
> So that seems to deal with mincore() in a reasonable way indeed.
>
> It doesn't unfortunately really solve the preadv2(RWF_NOWAIT), nor does it
> provide any good answer what to do about it, does it?

As I suggested earlier in the thread, the fix for RWF_NOWAIT might be
to just move the test down to after readahead.

We could/should be smarter than this,but it *might* be as simple as just

diff --git a/mm/filemap.c b/mm/filemap.c
index 9f5e323e883e..7bcdd36e629d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(

page = find_get_page(mapping, index);
if (!page) {
- if (iocb->ki_flags & IOCB_NOWAIT)
- goto would_block;
page_cache_sync_readahead(mapping,
ra, filp,
index, last_index - index);

which starts readahead even if IOCB_NOWAIT is set and will then test
IOCB_NOWAIT _later_ and not actually wait for it.

Linus

2019-01-17 00:36:10

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, 17 Jan 2019, Linus Torvalds wrote:

> > So that seems to deal with mincore() in a reasonable way indeed.
> >
> > It doesn't unfortunately really solve the preadv2(RWF_NOWAIT), nor does it
> > provide any good answer what to do about it, does it?
>
> As I suggested earlier in the thread, the fix for RWF_NOWAIT might be
> to just move the test down to after readahead.

So I've done some basic smoke testing (~2 hours of LTP+xfstests) on the
kernel with the three topmost patches from

https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel

applied (also attaching to this mail), and no obvious breakage popped up.

So if noone sees any principal problem there, I'll happily submit it with
proper attribution etc.

Thanks,

--
Jiri Kosina
SUSE Labs


Attachments:
0001-Revert-Change-mincore-to-count-mapped-pages-rather-t.patch (4.15 kB)
0002-mm-mincore-make-mincore-more-conservative.patch (2.26 kB)
0003-mm-filemap-initiate-readahead-even-if-IOCB_NOWAIT-is.patch (1.34 kB)
Download all attachments

2019-01-17 00:39:37

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 16, 2019 at 09:23:04PM +0100, Jiri Kosina wrote:
> On Thu, 17 Jan 2019, Linus Torvalds wrote:
> > As I suggested earlier in the thread, the fix for RWF_NOWAIT might be
> > to just move the test down to after readahead.

Your patch 3/3 just removes the test. Am I right in thinking that it
doesn't need to be *moved* because the existing test after !PageUptodate
catches it?

Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there
any in LTP?

Some typos in the commit messages:

> Another aproach (checking file access permissions in order to decide
"approach"

> Subject: [PATCH 2/3] mm/mincore: make mincore() more conservative
>
> The semantics of what mincore() considers to be resident is not completely
> clearar, but Linux has always (since 2.3.52, which is when mincore() was
"clear"

> initially done) treated it as "page is available in page cache".
>
> That's potentially a problem, as that [in]directly exposes meta-information
> about pagecache / memory mapping state even about memory not strictly belonging
> to the process executing the syscall, opening possibilities for sidechannel
> attacks.
>
> Change the semantics of mincore() so that it only reveals pagecache information
> for non-anonymous mappings that belog to files that the calling process could
"belong"


2019-01-17 00:40:10

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, 16 Jan 2019, Matthew Wilcox wrote:

> > On Thu, 17 Jan 2019, Linus Torvalds wrote:
> > > As I suggested earlier in the thread, the fix for RWF_NOWAIT might be
> > > to just move the test down to after readahead.
>
> Your patch 3/3 just removes the test. Am I right in thinking that it
> doesn't need to be *moved* because the existing test after !PageUptodate
> catches it?

Exactly. It just initiates read-ahead for IOCB_NOWAIT cases as well, and
if it's actually set, it'll be handled by the !PageUpdtodate case.

> Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there
> any in LTP?

Not in the released version AFAIK. I've asked the LTP maintainer (in our
internal bugzilla) to take care of this thread a few days ago, but not
sure what came out of it. Adding him (Cyril) to CC.

> Some typos in the commit messages:
>
> > Another aproach (checking file access permissions in order to decide
> "approach"
>
> > Subject: [PATCH 2/3] mm/mincore: make mincore() more conservative
> >
> > The semantics of what mincore() considers to be resident is not completely
> > clearar, but Linux has always (since 2.3.52, which is when mincore() was
> "clear"
>
> > initially done) treated it as "page is available in page cache".
> >
> > That's potentially a problem, as that [in]directly exposes meta-information
> > about pagecache / memory mapping state even about memory not strictly belonging
> > to the process executing the syscall, opening possibilities for sidechannel
> > attacks.
> >
> > Change the semantics of mincore() so that it only reveals pagecache information
> > for non-anonymous mappings that belog to files that the calling process could
> "belong"

Thanks.

--
Jiri Kosina
SUSE Labs


2019-01-17 02:08:06

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Jiri Kosina wrote on Wed, Jan 16, 2019:
> So if noone sees any principal problem there, I'll happily submit it with
> proper attribution etc.

I'm not convinced just the write permission check is enough for
mincore(), as Josh also seems to share the concern I raised (e.g. map a
git directory "hot" pages)
We probably need to add an inode_owner_or_capable() or similar, the open
question is do we still need the write access check after that - I don't
really know how expensive these calls are.

Thanks,
--
Dominique

2019-01-17 02:29:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 16, 2019 at 4:54 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, Jan 16, 2019 at 11:45 AM Dave Chinner <[email protected]> wrote:
> >
> > I'm assuming that you can invalidate the page cache reliably by a
> > means that does not repeated require probing to detect invalidation
> > has occurred. I've mentioned one method in this discussion
> > already...
>
> Yes. And it was made clear to you that it was a bug in xfs dio and
> what the right thing to do was.

Side note: I actually think we *do* the right thing. Even for xfs. I
couldn't find the alleged place that invalidates the page cache on dio
reads.

The *generic* dio code only does it for writes (which is correct and
fine). And maybe xfs has some extra invalidation, but I don't see it.

So I actually hope your "you can use direct-io read to do directed
invalidating of the page cache" isn't true. I admittedly did *not* try
to delve very deeply into it, but the invalidates I found looked
correct. The generic code does it for writes, and at least ext4 does
the "writeback and wait" for reads.

There *does* seem to be a 'invalidate_inode_pages2_range()' call in
iomap_dio_rw(). That has a *comment* that says it only is for writes,
but it looks to me like it would trigger for reads too.

Just a plain bug/oversight? Or me misreading things.

So yes, maybe xfs does that "invalidate on read", but it really seems
to be just a bug. If the xfs people insist on keeping the bug, fine
(looks like gfs2 and xfs are the only users), but it seems kind of
sad.

Linus

2019-01-17 03:39:24

by Kevin Easton

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Tue, Jan 15, 2019 at 11:52:25PM -0800, Josh Snyder wrote:
> On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet <[email protected]>
> wrote:
> >
> > There is a difference with your previous patch though, that used to list no
> > page in core when it didn't know; this patch lists pages as in core when it
> > refuses to tell. I don't think that's very important, though.
>
> Is there a reason not to return -EPERM in this case?

When I was looking through the Debian Code Search results, quite a few
of the hits were for code that uses mincore() as a way to check if
_anything_ is mapped at an address or not. This code doesn't care about
the in core / not in core result of mincore(), just whether it returned
an error or not.

I think a new error return would break most of the instances of that
code I saw.

- Kevin


2019-01-17 04:31:08

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 16, 2019 at 05:00:25PM +1200, Linus Torvalds wrote:
> And if you're not the owner of the file, do you have another
> suggestion for that "Yes, I have the right to see what's in-core for
> this file". Because the problem is literally that if it's some random
> read-only system file, the kernel shouldn't leak access patterns to
> it..

This probably isn't a good heuristic, but thought I'd mention it
anyway ... if the file is executable and you're not the owner, mincore
always/never says its pages are resident. That'd fix all library leaks,
but then there's probably a smart way of figuring out something from
access patterns to a data file of some kind (/etc/passwd?)

2019-01-17 08:45:54

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, Jan 16, 2019 at 04:54:49PM +1200, Linus Torvalds wrote:
> On Wed, Jan 16, 2019 at 11:45 AM Dave Chinner <[email protected]> wrote:
> >
> > I'm assuming that you can invalidate the page cache reliably by a
> > means that does not repeated require probing to detect invalidation
> > has occurred. I've mentioned one method in this discussion
> > already...
>
> Yes. And it was made clear to you that it was a bug in xfs dio and
> what the right thing to do was.
>
> And you ignored that, and claimed it was a feature.

Linus, either you aren't listening or you're being intentionally
provocative.

So, for the *third* time this thread: we can probably remove this
code but first we need to be sure it doesn't cause unexpected
regressions before we commit such a change. We are not cowboys who
test userspace behavioural changes on users without review or
discussion.

Indeed, I wrote a patch to remove the invalidation /several days
ago/ and put it into my test trees, and it's been there since. Just
because you don't see immediate changes doesn't mean it isn't
happening.

> Either you care or you don't. If you don't care (and so far everything
> you said seems to imply you don't),

Linus, this is just a personal attack and IMO a violation of the
CoC. It's straight out wrong, insulting, totally unprofessional and
completely uncalled for.

This is most definitely not a useful technical response to the
issues I raised. i.e you cut out all the context of my response
about whether "no probing necessary" page cache invalidation attacks
are something we need to care about in the future. We don't need you
to shout about existing "no probing necessary" page cache
invalidation attacks that are already being addressed, we need to
determine if it's going to be a recurring problem in future because
that directly affects the mitigation strategies we can implement.

-Dave.
--
Dave Chinner
[email protected]

2019-01-17 08:48:52

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Fri, Jan 11, 2019 at 08:36:55AM +0100, Jiri Kosina wrote:
> On Thu, 10 Jan 2019, Dave Chinner wrote:
>
> > Sounds nice from a theoretical POV, but reality has taught us very
> > different lessons.
> >
> > FWIW, a quick check of XFS's history so you understand how long this
> > behaviour has been around. It was introduced in the linux port in 2001
> > as direct IO support was being added:
> >
> > commit e837eac23662afae603aaaef7c94bc839c1b8f67
> > Author: Steve Lord <[email protected]>
> > Date: Mon Mar 5 16:47:52 2001 +0000
> >
> > Add bounds checking for direct I/O, do the cache invalidation for
> > data coherency on direct I/O.
>
> Out of curiosity, which repository is this from please? Even google
> doesn't seem to know about this SHA.

because oss.sgi.com is no longer with us, it's fallen out of all the
search engines. It was from the "archive/xfs-import.git" tree on
oss.sgi.com:

https://web.archive.org/web/20120326044237/http://oss.sgi.com:80/cgi-bin/gitweb.cgi

but archive.org doesn't have a copy of the git tree. It contained
the XFS history right back to the first Irix commit in 1993. Some of
us still have copies of it sitting around....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-17 08:58:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 17, 2019 at 9:37 AM Matthew Wilcox <[email protected]> wrote:
>
> Your patch 3/3 just removes the test. Am I right in thinking that it
> doesn't need to be *moved* because the existing test after !PageUptodate
> catches it?

That's the _hope_.

That's the simplest patch I can come up with as a potential solution.
But it's possible that there's some nasty performance regression
because somebody really relies on not even triggering read-ahead, and
we might need to do some totally different thing.

So it may be that somebody has a case that really wants something
else, and we'd need to move the RWF_NOWAIT test elsewhere and do
something slightly more complicated. As with the mincore() change,
maybe reality doesn't like the simplest fix...

> Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there
> any in LTP?

RWF_NOWAIT is actually _fairly_ new. It was introduced "only" about
18 months ago and made it into 4.13.

Which makes me hopeful there aren't a lot of people who care deeply.

And starting readahead *may* actually be what a RWF_NOWAIT read user
generally wants, so for all we know it might even improve performance
and/or allow new uses. With the "start readahead but don't wait for
it" semantics, you can have a model where you try to handle all the
requests that can be handled out of cache first (using RWF_NOWAIT) and
then when you've run out of cached cases you clear the RWF_NOWAIT
flag, but now the IO has been started early (and could overlap with
the cached request handling), so then when you actually do a blocking
version, you get much better performance.

So there is an argument that removing that one RWF_NOWAIT case might
actually be a good thing in general, outside of the "don't allow
probing the cache without changing the state of it" issue.

But that's handwavy and optimistic. Reality is often not as accomodating ;)

Linus

2019-01-17 09:38:38

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, 17 Jan 2019, Dave Chinner wrote:

> > > commit e837eac23662afae603aaaef7c94bc839c1b8f67
> > > Author: Steve Lord <[email protected]>
> > > Date: Mon Mar 5 16:47:52 2001 +0000
> > >
> > > Add bounds checking for direct I/O, do the cache invalidation for
> > > data coherency on direct I/O.
> >
> > Out of curiosity, which repository is this from please? Even google
> > doesn't seem to know about this SHA.
>
> because oss.sgi.com is no longer with us, it's fallen out of all the
> search engines. It was from the "archive/xfs-import.git" tree on
> oss.sgi.com:
>
> https://web.archive.org/web/20120326044237/http://oss.sgi.com:80/cgi-bin/gitweb.cgi
>
> but archive.org doesn't have a copy of the git tree. It contained
> the XFS history right back to the first Irix commit in 1993. Some of
> us still have copies of it sitting around....

For cases like this, would it be worth pushing it to git.kernel.org as an
frozen historical reference archive?

Thanks,

--
Jiri Kosina
SUSE Labs


2019-01-17 09:56:04

by Cyril Hrubis

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Hi!
> > Your patch 3/3 just removes the test. Am I right in thinking that it
> > doesn't need to be *moved* because the existing test after !PageUptodate
> > catches it?
>
> Exactly. It just initiates read-ahead for IOCB_NOWAIT cases as well, and
> if it's actually set, it'll be handled by the !PageUpdtodate case.
>
> > Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there
> > any in LTP?
>
> Not in the released version AFAIK. I've asked the LTP maintainer (in our
> internal bugzilla) to take care of this thread a few days ago, but not
> sure what came out of it. Adding him (Cyril) to CC.

So far not much, I've looked over our mincore() tests and noted down how
to improve them here:

https://github.com/linux-test-project/ltp/issues/461

We do plan to test the final mincore() fix:

https://github.com/linux-test-project/ltp/issues/460

And we do have RWF_NOWAIT tests on our TODO for some time as well:

https://github.com/linux-test-project/ltp/issues/286

I guess I can raise priority for that one so that we have basic
functional tests in a week or so. Also if anyone has some RWF_NOWAIT
tests already it would be nice if these could be shared with us.


[A bit off topic rant]

I've been telling kernel developers for years that if they have a test
code they used when developing a kernel feature that they should share
it with us (LTP community) and we will turn these into automated tests
and maintain them for free. LTP is also used in many QA departements
around the word so such tests will end up executed in different
environments also for free. Sadly this does not happen much and there
are only few exceptions so far. But maybe I wasn't shouting loudly
enough.

--
Cyril Hrubis
[email protected]

2019-01-17 21:52:02

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On 1/16/2019 8:52 AM, Josh Snyder wrote:
> On Tue, Jan 15, 2019 at 10:34 PM Dominique Martinet <[email protected]>
> wrote:
>>
>> There is a difference with your previous patch though, that used to list no
>> page in core when it didn't know; this patch lists pages as in core when it
>> refuses to tell. I don't think that's very important, though.

I've argued previously that reporting false positives (as your patch does)
should be better, otherwise there might be somebody trying to fault in their
pages in a loop until mincore reports positive, which would become an endless
loop. So agreed with your change.

Or maybe we could resort to the 5.0-rc1 page table check (that is now being
reverted) but only in cases when we are not allowed the page cache residency
check? Or would that be needlessly complicated? And it would be able to leak if
a page was evicted from the page cache...

> Is there a reason not to return -EPERM in this case?

That would definitely break somebody.

>>
>> If anything, the 0400 user-owner file might be a problem in some edge
>> case (e.g. if you're preloading git directories, many objects are 0444);
>> should we *also* check ownership?...
>
> Yes, this seems valuable. Some databases with immutable files (e.g. git, as
> you've mentioned) conceivably operate this way.
>
> Josh
>


2019-01-17 22:30:56

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 17, 2019 at 09:18:41AM +0100, Jiri Kosina wrote:
> On Thu, 17 Jan 2019, Dave Chinner wrote:
>
> > > > commit e837eac23662afae603aaaef7c94bc839c1b8f67
> > > > Author: Steve Lord <[email protected]>
> > > > Date: Mon Mar 5 16:47:52 2001 +0000
> > > >
> > > > Add bounds checking for direct I/O, do the cache invalidation for
> > > > data coherency on direct I/O.
> > >
> > > Out of curiosity, which repository is this from please? Even google
> > > doesn't seem to know about this SHA.
> >
> > because oss.sgi.com is no longer with us, it's fallen out of all the
> > search engines. It was from the "archive/xfs-import.git" tree on
> > oss.sgi.com:
> >
> > https://web.archive.org/web/20120326044237/http://oss.sgi.com:80/cgi-bin/gitweb.cgi
> >
> > but archive.org doesn't have a copy of the git tree. It contained
> > the XFS history right back to the first Irix commit in 1993. Some of
> > us still have copies of it sitting around....
>
> For cases like this, would it be worth pushing it to git.kernel.org as an
> frozen historical reference archive?

I'm not sure we should be putting code from Irix on kernel.org. I
uploaded a copy to github a few months ago so XFS devs could easily
reference relevant commits in email. The reasons we decided not to
upload it to kernel.org should be clear from the readme...

https://github.com/dchinner/xfs-history

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-01-18 04:52:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Fri, Jan 18, 2019 at 9:45 AM Vlastimil Babka <[email protected]> wrote:
>
> Or maybe we could resort to the 5.0-rc1 page table check (that is now being
> reverted) but only in cases when we are not allowed the page cache residency
> check? Or would that be needlessly complicated?

I think it would be good fallback semantics, but I'm not sure it's
worth it. Have you tried writing a patch for it? I don't think you'd
want to do the check *when* you find a hole, so you'd have to do it
upfront and then pass the cached data down with the private pointer
(or have a separate "struct mm_walk" structure, perhaps?

So I suspect we're better off with the patch we have. But if somebody
*wants* to try to do that fancier patch, and it doesn't look
horrendous, I think it might be the "quality" solution.

Linus

2019-01-18 04:58:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 17, 2019 at 4:51 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Jan 17, 2019 at 9:37 AM Matthew Wilcox <[email protected]> wrote:
> >
> > Your patch 3/3 just removes the test. Am I right in thinking that it
> > doesn't need to be *moved* because the existing test after !PageUptodate
> > catches it?
>
> That's the _hope_.
>
> That's the simplest patch I can come up with as a potential solution.
> But it's possible that there's some nasty performance regression
> because somebody really relies on not even triggering read-ahead, and
> we might need to do some totally different thing.

Oh, and somebody should probably check that there isn't some simple
way to just avoid that readahead code entirely.

In particular, right now we skip readahead for at least these cases:

/* no read-ahead */
if (!ra->ra_pages)
return;

if (blk_cgroup_congested())
return;

and I don't think we need to worry about the cgroup congestion case -
if the attack has to also congest its cgroup with IO, I think they
have bigger problems.

And I think 'ra_pages' can be zero only in the presence of IO errors,
but I might be wrong. It would be good if somebody double-checks that.

Linus

2019-01-18 14:25:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Hello, Linus.

On Sat, Jan 05, 2019 at 01:54:03PM -0800, Linus Torvalds wrote:
> And the first hit is 'fincore', which probably nobody cares about
> anyway, but it does
>
> fd = open (name, O_RDONLY)
> ..
> mmap(window, len, PROT_NONE, MAP_PRIVATE, ..

So, folks here have been using fincore(1) for diagnostic purposes and
are also looking to expand on it to investigate per-cgroup cache
usages (mmap -> mincore -> /proc/self/pagemap -> /proc/kpagecgroup ->
cgroup path).

These are all root-only usages to find out what's going on with the
whole page cache. We aren't attached to doing things this particular
way but it'd suck if there's no way.

Thanks.

--
tejun

2019-01-18 19:04:17

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On 1/18/19 5:49 AM, Linus Torvalds wrote:
> On Fri, Jan 18, 2019 at 9:45 AM Vlastimil Babka <[email protected]> wrote:
>>
>> Or maybe we could resort to the 5.0-rc1 page table check (that is now being
>> reverted) but only in cases when we are not allowed the page cache residency
>> check? Or would that be needlessly complicated?
>
> I think it would be good fallback semantics, but I'm not sure it's
> worth it. Have you tried writing a patch for it? I don't think you'd
> want to do the check *when* you find a hole, so you'd have to do it
> upfront and then pass the cached data down with the private pointer
> (or have a separate "struct mm_walk" structure, perhaps?
>
> So I suspect we're better off with the patch we have. But if somebody
> *wants* to try to do that fancier patch, and it doesn't look
> horrendous, I think it might be the "quality" solution.

I thought to drop the idea because of leaking that page has been
evicted, but then I realized there are other ways to check for that
anyway in /proc. So I'll try, but probably not until after next week. If
somebody else wants to, they are welcome. As you say, the current
solution should be ok, so that would be a patch on top anyway, for
bisectability etc.

Vlastimil

> Linus
>


2019-01-23 20:28:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 17, 2019 at 9:23 AM Jiri Kosina <[email protected]> wrote:
>
> So I've done some basic smoke testing (~2 hours of LTP+xfstests) on the
> kernel with the three topmost patches from
>
> https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel
>
> applied (also attaching to this mail), and no obvious breakage popped up.
>
> So if noone sees any principal problem there, I'll happily submit it with
> proper attribution etc.

So this seems to have died down, and a week later we seem to not have
a lot of noise here any more. I think it means people either weren't
testing it, or just didn't find any real problems.

I've reverted the 'let's try to just remove the code' part in my tree.
But I didn't apply the two other patches yet. Any final comments
before that should happen?

Linus

2019-01-23 20:36:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 24, 2019 at 9:27 AM Linus Torvalds
<[email protected]> wrote:
>
> I've reverted the 'let's try to just remove the code' part in my tree.
> But I didn't apply the two other patches yet. Any final comments
> before that should happen?

Side note: the inode_permission() addition to can_do_mincore() in that
patch 0002, seems to be questionable. We do

+static inline bool can_do_mincore(struct vm_area_struct *vma)
+{
+ return vma_is_anonymous(vma)
+ || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE))
+ || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
+}

note how it tests whether vma->vm_file is NULL for the FMODE_WRITE
test, but not for the inode_permission() test.

So either we test unnecessarily in the second line, or we don't
properly test it in the third one.

I think the "test vm_file" thing may be unnecessary, because a
non-anonymous mapping should always have a file pointer and an inode.
But I could imagine some odd case (vdso mapping, anyone?) that
doesn't have a vm_file, but also isn't anonymous.

Anybody?

Anyway, it's one reason why I didn't actually apply those other two
patches yet. This may be a 5.1 issue..

Linus

2019-01-23 23:13:31

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, 24 Jan 2019, Linus Torvalds wrote:

> Side note: the inode_permission() addition to can_do_mincore() in that
> patch 0002, seems to be questionable. We do
>
> +static inline bool can_do_mincore(struct vm_area_struct *vma)
> +{
> + return vma_is_anonymous(vma)
> + || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE))
> + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
> +}
>
> note how it tests whether vma->vm_file is NULL for the FMODE_WRITE
> test, but not for the inode_permission() test.
>
> So either we test unnecessarily in the second line, or we don't
> properly test it in the third one.
>
> I think the "test vm_file" thing may be unnecessary, because a
> non-anonymous mapping should always have a file pointer and an inode.
> But I could imagine some odd case (vdso mapping, anyone?) that
> doesn't have a vm_file, but also isn't anonymous.

Hmm, good point.

So dropping the 'vma->vm_file' test and checking whether given vma is
special mapping should hopefully provide the desired semantics, shouldn't
it?

--
Jiri Kosina
SUSE Labs


2019-01-24 00:22:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, Jan 24, 2019 at 12:12 PM Jiri Kosina <[email protected]> wrote:
>
> >
> > I think the "test vm_file" thing may be unnecessary, because a
> > non-anonymous mapping should always have a file pointer and an inode.
> > But I could imagine some odd case (vdso mapping, anyone?) that
> > doesn't have a vm_file, but also isn't anonymous.
>
> Hmm, good point.
>
> So dropping the 'vma->vm_file' test and checking whether given vma is
> special mapping should hopefully provide the desired semantics, shouldn't
> it?

Maybe. But on the whole I think it would be simpler and more
straightforward to just instead add a vm_file test for the
inode_permission() case. That way you at least know that you aren't
following a NULL pointer.

If the file then turns out to be some special thing, it doesn't really
_matter_, I think. It won't have anything in the page cache etc, but
the code should "work".

Linus

2019-01-24 00:26:16

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Linus Torvalds wrote on Thu, Jan 24, 2019:
> I've reverted the 'let's try to just remove the code' part in my tree.
> But I didn't apply the two other patches yet. Any final comments
> before that should happen?

I mentionned when sending the updated version that just checking file
permission might not be enough, e.g. a git tree is full of read-only
objects that someone might want to preload and think we might really
want to check both despite the overhead in the denied case.

Josh agreed and I meant to send a new version since nothing was
happening but work priorities got the better of me, and I was kind of
waiting for the ltp testcases[1] as well because aside from the few
tests I ran by hand I'm not sure the few hours of ltp/xfstests Jiri ran
did much but this is probably going to be a chicken-or-egg problem..

[1] https://github.com/linux-test-project/ltp/issues/461

Jiri Kosina wrote on Thu, Jan 24, 2019:
> On Thu, 24 Jan 2019, Linus Torvalds wrote:
>
> > Side note: the inode_permission() addition to can_do_mincore() in that
> > patch 0002, seems to be questionable. We do
> >
> > +static inline bool can_do_mincore(struct vm_area_struct *vma)
> > +{
> > + return vma_is_anonymous(vma)
> > + || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE))
> > + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
> > +}
> >
> > note how it tests whether vma->vm_file is NULL for the FMODE_WRITE
> > test, but not for the inode_permission() test.
> >
> > So either we test unnecessarily in the second line, or we don't
> > properly test it in the third one.
> >
> > I think the "test vm_file" thing may be unnecessary, because a
> > non-anonymous mapping should always have a file pointer and an inode.
> > But I could imagine some odd case (vdso mapping, anyone?) that
> > doesn't have a vm_file, but also isn't anonymous.
>
> Hmm, good point.
>
> So dropping the 'vma->vm_file' test and checking whether given vma is
> special mapping should hopefully provide the desired semantics, shouldn't
> it?

I think it's probably better to keep this simple, if we're going to
check something before accessing vm_file we might as well directly check
it.

I was thinking of something along the lines of:
return vma_is_anonymous(vma) || (vma->vm_file &&
(inode_owner_or_capable(file_inode(vma->vm_file))
|| inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0));

I dropped the first f_mode check because none of the known mincore users
open the files read-write, and the check is redundant with
inode_permission() so while it would probably be an optimisation in some
cases I do not think it is useful in practice.
On the other hand, I have no idea how expensive the inode_permission and
owner checks really are - do they try to refresh attributes on a
networked filesystem or would it trust the cache or is it fs dependant?

Honestly this is more a case of "the people who's be interested in
seeing this have no idea what they're doing" than lack of interest.. I
wouldn't mind if there were tests doing mincore on a bunch of special
files/mappings but I just tried on a few regular files by hand, this
isn't proper coverage; I'll try to take more time to test various
mappings today (JST).


Thanks,
--
Dominique

2019-01-24 12:45:40

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Dominique Martinet wrote on Thu, Jan 24, 2019:
> I was thinking of something along the lines of:
> return vma_is_anonymous(vma) || (vma->vm_file &&
> (inode_owner_or_capable(file_inode(vma->vm_file))
> || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0));
>
> I dropped the first f_mode check because none of the known mincore users
> open the files read-write, and the check is redundant with
> inode_permission() so while it would probably be an optimisation in some
> cases I do not think it is useful in practice.
> On the other hand, I have no idea how expensive the inode_permission and
> owner checks really are - do they try to refresh attributes on a
> networked filesystem or would it trust the cache or is it fs dependant?
>
> Honestly this is more a case of "the people who's be interested in
> seeing this have no idea what they're doing" than lack of interest.. I
> wouldn't mind if there were tests doing mincore on a bunch of special
> files/mappings but I just tried on a few regular files by hand, this
> isn't proper coverage; I'll try to take more time to test various
> mappings today (JST).

I've done some tests with this, it appears OK.

Obviously the tests I previously had done still work:
- user's own files are ok, even if read-only now.
- non-user writable files are ok.
- non-user non-writable files (e.g. system libs) aren't.
- root can still do anything.

On new tests:
- there are vmas with no file that aren't anonymous and come all the
way there (vvar and vdso), so factoring vma->vm_file check is definitely
needed.
- vsyscall doesn't reach can_do_mincore()
- [heap] [stack] and other fileless regular maps are anonymous

- I tried a char device (/dev/zero) and it was marked anonymous despite
mapping with MAP_SHARED, which is somewhat expected I guess?
- I couldn't map /proc or /sys files (no such device), so no mincore
there.


I'd post my test program but I actually added pr_info messages in
can_do_mincore to check what it returned because madvise dontneed isn't
guaranteed to evict pages so we can't rely on madvise dontneed + mincore
to return 0; not sure what to do for ltp... If anyone has a good idea of
how to check if mincore actually got granted permissions without
drop_caches I'll post to the ltp github.


Anything else to try?

Jiri, you've offered resubmitting the last two patches properly, can you
incorporate this change or should I just send this directly? (I'd take
most of your commit message and add your name somewhere)


Thanks,
--
Dominique

2019-01-24 14:26:33

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, 24 Jan 2019, Dominique Martinet wrote:

> Jiri, you've offered resubmitting the last two patches properly, can you
> incorporate this change or should I just send this directly? (I'd take
> most of your commit message and add your name somewhere)

I've been running some basic smoke testing with the kernel from

https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel-v2

(attaching the respective two patches to apply on top of latest Linus'
tree to this mail as well), and everything looks good so far.

Thanks,

--
Jiri Kosina
SUSE Labs


Attachments:
0001-mm-mincore-make-mincore-more-conservative.patch (2.28 kB)
0002-mm-filemap-initiate-readahead-even-if-IOCB_NOWAIT-is.patch (1.34 kB)
Download all attachments

2019-01-27 22:36:17

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Thu, 24 Jan 2019, Jiri Kosina wrote:

> > Jiri, you've offered resubmitting the last two patches properly, can you
> > incorporate this change or should I just send this directly? (I'd take
> > most of your commit message and add your name somewhere)
>
> I've been running some basic smoke testing with the kernel from
>
> https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel-v2
>
> (attaching the respective two patches to apply on top of latest Linus'
> tree to this mail as well), and everything looks good so far.

So, any objections to aproaching it this way?

I've not been able to spot any obvious breakage so far with it.

Thanks,

--
Jiri Kosina
SUSE Labs


2019-01-28 00:06:26

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Jiri Kosina wrote on Sun, Jan 27, 2019:
> So, any objections to aproaching it this way?

I'm not sure why I'm the main recipient of that mail but answering
because I am -- let's get these patches in through the regular -mm tree
though

--
Dominique

2019-01-28 13:53:47

by Cyril Hrubis

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Hi!
> > Of course, there aren't any tests for RWF_NOWAIT in xfstests. Are there
> > any in LTP?

Just FYI I've send a patch with basic RWF_NOWAIT test for review to LTP
ML and also CCed mailing lists from this thread.

https://lkml.org/lkml/2019/1/28/416

--
Cyril Hrubis
[email protected]

2019-01-29 23:53:50

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Mon, 28 Jan 2019, Dominique Martinet wrote:

> > So, any objections to aproaching it this way?
>
> I'm not sure why I'm the main recipient of that mail but answering
> because I am -- let's get these patches in through the regular -mm tree
> though

*prod to mm maintainers* (at least for an opinion)

--
Jiri Kosina
SUSE Labs


2019-01-30 09:11:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed 30-01-19 00:52:02, Jiri Kosina wrote:
> On Mon, 28 Jan 2019, Dominique Martinet wrote:
>
> > > So, any objections to aproaching it this way?
> >
> > I'm not sure why I'm the main recipient of that mail but answering
> > because I am -- let's get these patches in through the regular -mm tree
> > though
>
> *prod to mm maintainers* (at least for an opinion)

Could you repost those patches please? The thread is long and it is not
really clear what is the most up-to-date state of patches (at least to
me).
--
Michal Hocko
SUSE Labs

2019-01-30 12:30:33

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

On Wed, 30 Jan 2019, Michal Hocko wrote:

> > > I'm not sure why I'm the main recipient of that mail but answering
> > > because I am -- let's get these patches in through the regular -mm
> > > tree though
> >
> > *prod to mm maintainers* (at least for an opinion)
>
> Could you repost those patches please? The thread is long and it is not
> really clear what is the most up-to-date state of patches (at least to
> me).

Vlastimil seems to have one extra patch to go on top, so we agreed that
he'll be sending that as a complete self-contained series (either as a
followup to the very first e-mail in this monsterthread, or completely
separately) shortly.

Thanks,

--
Jiri Kosina
SUSE Labs


2019-01-30 12:45:41

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

From: Jiri Kosina <[email protected]>

preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache contents, as
it reveals metadata about residency of pages in pagecache.

If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page not
resident" information, and vice versa.

Close that sidechannel by always initiating readahead on the cache if we
encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, probing
the pagecache residency itself will actually populate the cache, making the
sidechannel useless.

Originally-by: Linus Torvalds <[email protected]>
Cc: Dominique Martinet <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Kevin Easton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Cyril Hrubis <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Daniel Gruss <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/filemap.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 9f5e323e883e..7bcdd36e629d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,

page = find_get_page(mapping, index);
if (!page) {
- if (iocb->ki_flags & IOCB_NOWAIT)
- goto would_block;
page_cache_sync_readahead(mapping,
ra, filp,
index, last_index - index);
--
2.20.1


2019-01-30 12:45:46

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH 1/3] mm/mincore: make mincore() more conservative

From: Jiri Kosina <[email protected]>

The semantics of what mincore() considers to be resident is not completely
clear, but Linux has always (since 2.3.52, which is when mincore() was
initially done) treated it as "page is available in page cache".

That's potentially a problem, as that [in]directly exposes meta-information
about pagecache / memory mapping state even about memory not strictly belonging
to the process executing the syscall, opening possibilities for sidechannel
attacks.

Change the semantics of mincore() so that it only reveals pagecache information
for non-anonymous mappings that belog to files that the calling process could
(if it tried to) successfully open for writing.

Originally-by: Linus Torvalds <[email protected]>
Originally-by: Dominique Martinet <[email protected]>
Cc: Dominique Martinet <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Kevin Easton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Cyril Hrubis <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Daniel Gruss <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/mincore.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..747a4907a3ac 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -169,6 +169,14 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
return 0;
}

+static inline bool can_do_mincore(struct vm_area_struct *vma)
+{
+ return vma_is_anonymous(vma) ||
+ (vma->vm_file &&
+ (inode_owner_or_capable(file_inode(vma->vm_file))
+ || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0));
+}
+
/*
* Do a chunk of "sys_mincore()". We've already checked
* all the arguments, we hold the mmap semaphore: we should
@@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
vma = find_vma(current->mm, addr);
if (!vma || addr < vma->vm_start)
return -ENOMEM;
- mincore_walk.mm = vma->vm_mm;
end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
+ if (!can_do_mincore(vma)) {
+ unsigned long pages = (end - addr) >> PAGE_SHIFT;
+ memset(vec, 1, pages);
+ return pages;
+ }
+ mincore_walk.mm = vma->vm_mm;
err = walk_page_range(addr, end, &mincore_walk);
if (err < 0)
return err;
--
2.20.1


2019-01-30 12:46:14

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments

I've collected the patches from the discussion for formal posting. The first
two should be settled already, third one is the possible improvement I've
mentioned earlier, where only in restricted case we resort to existence of page
table mapping (the original and later reverted approach from Linus) instead of
faking the result completely. Review and testing welcome.

The consensus seems to be going through -mm tree for 5.1, unless Linus wants
them alredy for 5.0.

Jiri Kosina (2):
mm/mincore: make mincore() more conservative
mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

Vlastimil Babka (1):
mm/mincore: provide mapped status when cached status is not allowed

mm/filemap.c | 2 --
mm/mincore.c | 54 ++++++++++++++++++++++++++++++++++++++++------------
2 files changed, 42 insertions(+), 14 deletions(-)

--
2.20.1


2019-01-30 12:46:32

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed

After "mm/mincore: make mincore() more conservative" we sometimes restrict the
information about page cache residency, which we have to do without breaking
existing userspace, if possible. We thus fake the resulting values as 1, which
should be safer than faking them as 0, as there might theoretically exist code
that would try to fault in the page(s) until mincore() returns 1.

Faking 1 however means that such code would not fault in a page even if it was
not in page cache, with unwanted performance implications. We can improve the
situation by revisting the approach of 574823bfab82 ("Change mincore() to count
"mapped" pages rather than "cached" pages") but only applying it to cases where
page cache residency check is restricted. Thus mincore() will return 0 for an
unmapped page (which may or may not be resident in a pagecache), and 1 after
the process faults it in.

One potential downside is that mincore() will be again able to recognize when a
previously mapped page was reclaimed. While that might be useful for some
attack scenarios, it's not as crucial as recognizing that somebody else faulted
the page in, and there are also other ways to recognize reclaimed pages anyway.

Cc: Jiri Kosina <[email protected]>
Cc: Dominique Martinet <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Kevin Easton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Cyril Hrubis <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Daniel Gruss <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/mincore.c | 49 +++++++++++++++++++++++++++++++++----------------
1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 747a4907a3ac..d6784a803ae7 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -21,12 +21,18 @@
#include <linux/uaccess.h>
#include <asm/pgtable.h>

+struct mincore_walk_private {
+ unsigned char *vec;
+ bool can_check_pagecache;
+};
+
static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
#ifdef CONFIG_HUGETLB_PAGE
unsigned char present;
- unsigned char *vec = walk->private;
+ struct mincore_walk_private *walk_private = walk->private;
+ unsigned char *vec = walk_private->vec;

/*
* Hugepages under user process are always in RAM and never
@@ -35,7 +41,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
present = pte && !huge_pte_none(huge_ptep_get(pte));
for (; addr != end; vec++, addr += PAGE_SIZE)
*vec = present;
- walk->private = vec;
+ walk_private->vec = vec;
#else
BUG();
#endif
@@ -85,7 +91,8 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
}

static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
- struct vm_area_struct *vma, unsigned char *vec)
+ struct vm_area_struct *vma, unsigned char *vec,
+ bool can_check_pagecache)
{
unsigned long nr = (end - addr) >> PAGE_SHIFT;
int i;
@@ -95,7 +102,9 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,

pgoff = linear_page_index(vma, addr);
for (i = 0; i < nr; i++, pgoff++)
- vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
+ vec[i] = can_check_pagecache ?
+ mincore_page(vma->vm_file->f_mapping, pgoff)
+ : 0;
} else {
for (i = 0; i < nr; i++)
vec[i] = 0;
@@ -106,8 +115,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
static int mincore_unmapped_range(unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
- walk->private += __mincore_unmapped_range(addr, end,
- walk->vma, walk->private);
+ struct mincore_walk_private *walk_private = walk->private;
+ unsigned char *vec = walk_private->vec;
+
+ walk_private->vec += __mincore_unmapped_range(addr, end, walk->vma,
+ vec, walk_private->can_check_pagecache);
return 0;
}

@@ -117,7 +129,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
spinlock_t *ptl;
struct vm_area_struct *vma = walk->vma;
pte_t *ptep;
- unsigned char *vec = walk->private;
+ struct mincore_walk_private *walk_private = walk->private;
+ unsigned char *vec = walk_private->vec;
int nr = (end - addr) >> PAGE_SHIFT;

ptl = pmd_trans_huge_lock(pmd, vma);
@@ -128,7 +141,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
}

if (pmd_trans_unstable(pmd)) {
- __mincore_unmapped_range(addr, end, vma, vec);
+ __mincore_unmapped_range(addr, end, vma, vec,
+ walk_private->can_check_pagecache);
goto out;
}

@@ -138,7 +152,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,

if (pte_none(pte))
__mincore_unmapped_range(addr, addr + PAGE_SIZE,
- vma, vec);
+ vma, vec, walk_private->can_check_pagecache);
else if (pte_present(pte))
*vec = 1;
else { /* pte is a swap entry */
@@ -152,8 +166,12 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
*vec = 1;
} else {
#ifdef CONFIG_SWAP
- *vec = mincore_page(swap_address_space(entry),
+ if (walk_private->can_check_pagecache)
+ *vec = mincore_page(
+ swap_address_space(entry),
swp_offset(entry));
+ else
+ *vec = 0;
#else
WARN_ON(1);
*vec = 1;
@@ -187,22 +205,21 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
struct vm_area_struct *vma;
unsigned long end;
int err;
+ struct mincore_walk_private walk_private = {
+ .vec = vec
+ };
struct mm_walk mincore_walk = {
.pmd_entry = mincore_pte_range,
.pte_hole = mincore_unmapped_range,
.hugetlb_entry = mincore_hugetlb,
- .private = vec,
+ .private = &walk_private
};

vma = find_vma(current->mm, addr);
if (!vma || addr < vma->vm_start)
return -ENOMEM;
end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
- if (!can_do_mincore(vma)) {
- unsigned long pages = (end - addr) >> PAGE_SHIFT;
- memset(vec, 1, pages);
- return pages;
- }
+ walk_private.can_check_pagecache = can_do_mincore(vma);
mincore_walk.mm = vma->vm_mm;
err = walk_page_range(addr, end, &mincore_walk);
if (err < 0)
--
2.20.1


2019-01-30 15:06:51

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

* Vlastimil Babka:

> preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache
> contents, as it reveals metadata about residency of pages in
> pagecache.
>
> If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page
> not resident" information, and vice versa.
>
> Close that sidechannel by always initiating readahead on the cache if
> we encounter a cache miss for preadv2(RWF_NOWAIT); with that in place,
> probing the pagecache residency itself will actually populate the
> cache, making the sidechannel useless.

I think this needs to use a different flag because the semantics are so
much different. If I understand this change correctly, previously,
RWF_NOWAIT essentially avoided any I/O, and now it does not.

Thanks,
Florian

2019-01-30 15:18:00

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On Wed, 30 Jan 2019, Florian Weimer wrote:

> > preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache
> > contents, as it reveals metadata about residency of pages in
> > pagecache.
> >
> > If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page
> > not resident" information, and vice versa.
> >
> > Close that sidechannel by always initiating readahead on the cache if
> > we encounter a cache miss for preadv2(RWF_NOWAIT); with that in place,
> > probing the pagecache residency itself will actually populate the
> > cache, making the sidechannel useless.
>
> I think this needs to use a different flag because the semantics are so
> much different. If I understand this change correctly, previously,
> RWF_NOWAIT essentially avoided any I/O, and now it does not.

It still avoid synchronous I/O, due to this code still being in place:

if (!PageUptodate(page)) {
if (iocb->ki_flags & IOCB_NOWAIT) {
put_page(page);
goto would_block;
}

but goes the would_block path only after initiating asynchronous
readahead.

--
Jiri Kosina
SUSE Labs


2019-01-31 09:44:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

On Wed 30-01-19 13:44:18, Vlastimil Babka wrote:
> From: Jiri Kosina <[email protected]>
>
> The semantics of what mincore() considers to be resident is not completely
> clear, but Linux has always (since 2.3.52, which is when mincore() was
> initially done) treated it as "page is available in page cache".
>
> That's potentially a problem, as that [in]directly exposes meta-information
> about pagecache / memory mapping state even about memory not strictly belonging
> to the process executing the syscall, opening possibilities for sidechannel
> attacks.
>
> Change the semantics of mincore() so that it only reveals pagecache information
> for non-anonymous mappings that belog to files that the calling process could
> (if it tried to) successfully open for writing.

I agree that this is a better way than the original 574823bfab82
("Change mincore() to count "mapped" pages rather than "cached" pages").
One thing is still not clear to me though. Is the new owner/writeable
check OK for the Netflix-like usecases? I mean does happycache have
appropriate access to the cache data? I have tried to re-read the
original thread but couldn't find any confirmation.

I nit below

> Originally-by: Linus Torvalds <[email protected]>
> Originally-by: Dominique Martinet <[email protected]>
> Cc: Dominique Martinet <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Kevin Easton <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Cyril Hrubis <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Cc: Daniel Gruss <[email protected]>
> Signed-off-by: Jiri Kosina <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>

other than that looks good to me.
Acked-by: Michal Hocko <[email protected]>

If this still doesn't help happycache kind of workloads then we should
add a capability check IMO but this looks like a decent foundation to
me.

> ---
> mm/mincore.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 218099b5ed31..747a4907a3ac 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -169,6 +169,14 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> return 0;
> }
>
> +static inline bool can_do_mincore(struct vm_area_struct *vma)
> +{
> + return vma_is_anonymous(vma) ||
> + (vma->vm_file &&
> + (inode_owner_or_capable(file_inode(vma->vm_file))
> + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0));
> +}

This is hard to read. Can we do
if (vma_is_anonymous(vma))
return true;
if (!vma->vm_file)
return false;
return inode_owner_or_capable(file_inode(vma->vm_file)) ||
inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;

--
Michal Hocko
SUSE Labs

2019-01-31 09:52:07

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

Michal Hocko wrote on Thu, Jan 31, 2019:
> > Change the semantics of mincore() so that it only reveals pagecache information
> > for non-anonymous mappings that belog to files that the calling process could
> > (if it tried to) successfully open for writing.
>
> I agree that this is a better way than the original 574823bfab82
> ("Change mincore() to count "mapped" pages rather than "cached" pages").
> One thing is still not clear to me though. Is the new owner/writeable
> check OK for the Netflix-like usecases? I mean does happycache have
> appropriate access to the cache data? I have tried to re-read the
> original thread but couldn't find any confirmation.

It's enough for my use cases and Josh didn't seem to oppose, but since
he's not in Cc I don't think he would answer -- added him now :)

FWIW happycache writes in the current directory by default so I assume
in the way they use it it would usually have access one way or another.

> If this still doesn't help happycache kind of workloads then we should
> add a capability check IMO but this looks like a decent foundation to
> me.

the inode_owner_or_capable/inode_permission helpers already do allow
quite a few capabilities there


--
Dominique

2019-01-31 09:57:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

[Cc fs-devel]

On Wed 30-01-19 13:44:19, Vlastimil Babka wrote:
> From: Jiri Kosina <[email protected]>
>
> preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache contents, as
> it reveals metadata about residency of pages in pagecache.
>
> If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page not
> resident" information, and vice versa.
>
> Close that sidechannel by always initiating readahead on the cache if we
> encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, probing
> the pagecache residency itself will actually populate the cache, making the
> sidechannel useless.

I guess the current wording doesn't disallow background IO to be
triggered for EAGAIN case. I am not sure whether that breaks clever
applications which try to perform larger IO for those cases though.

> Originally-by: Linus Torvalds <[email protected]>
> Cc: Dominique Martinet <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Kevin Easton <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Cyril Hrubis <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Cc: Daniel Gruss <[email protected]>
> Signed-off-by: Jiri Kosina <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---
> mm/filemap.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9f5e323e883e..7bcdd36e629d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
>
> page = find_get_page(mapping, index);
> if (!page) {
> - if (iocb->ki_flags & IOCB_NOWAIT)
> - goto would_block;
> page_cache_sync_readahead(mapping,
> ra, filp,
> index, last_index - index);

Maybe a stupid question but I am not really familiar with this path but
what exactly does prevent a sync read down page_cache_sync_readahead
path?
--
Michal Hocko
SUSE Labs

2019-01-31 10:09:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed

On Wed 30-01-19 13:44:20, Vlastimil Babka wrote:
> After "mm/mincore: make mincore() more conservative" we sometimes restrict the
> information about page cache residency, which we have to do without breaking
> existing userspace, if possible. We thus fake the resulting values as 1, which
> should be safer than faking them as 0, as there might theoretically exist code
> that would try to fault in the page(s) until mincore() returns 1.
>
> Faking 1 however means that such code would not fault in a page even if it was
> not in page cache, with unwanted performance implications. We can improve the
> situation by revisting the approach of 574823bfab82 ("Change mincore() to count
> "mapped" pages rather than "cached" pages") but only applying it to cases where
> page cache residency check is restricted. Thus mincore() will return 0 for an
> unmapped page (which may or may not be resident in a pagecache), and 1 after
> the process faults it in.
>
> One potential downside is that mincore() will be again able to recognize when a
> previously mapped page was reclaimed. While that might be useful for some
> attack scenarios, it's not as crucial as recognizing that somebody else faulted
> the page in, and there are also other ways to recognize reclaimed pages anyway.

Is this really worth it? Do we know about any specific usecase that
would benefit from this change? TBH I would rather wait for the report
than add a hard to evaluate side channel.

> Cc: Jiri Kosina <[email protected]>
> Cc: Dominique Martinet <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Kevin Easton <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Cyril Hrubis <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Cc: Daniel Gruss <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---
> mm/mincore.c | 49 +++++++++++++++++++++++++++++++++----------------
> 1 file changed, 33 insertions(+), 16 deletions(-)
>
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 747a4907a3ac..d6784a803ae7 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -21,12 +21,18 @@
> #include <linux/uaccess.h>
> #include <asm/pgtable.h>
>
> +struct mincore_walk_private {
> + unsigned char *vec;
> + bool can_check_pagecache;
> +};
> +
> static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
> unsigned long end, struct mm_walk *walk)
> {
> #ifdef CONFIG_HUGETLB_PAGE
> unsigned char present;
> - unsigned char *vec = walk->private;
> + struct mincore_walk_private *walk_private = walk->private;
> + unsigned char *vec = walk_private->vec;
>
> /*
> * Hugepages under user process are always in RAM and never
> @@ -35,7 +41,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
> present = pte && !huge_pte_none(huge_ptep_get(pte));
> for (; addr != end; vec++, addr += PAGE_SIZE)
> *vec = present;
> - walk->private = vec;
> + walk_private->vec = vec;
> #else
> BUG();
> #endif
> @@ -85,7 +91,8 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
> }
>
> static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
> - struct vm_area_struct *vma, unsigned char *vec)
> + struct vm_area_struct *vma, unsigned char *vec,
> + bool can_check_pagecache)
> {
> unsigned long nr = (end - addr) >> PAGE_SHIFT;
> int i;
> @@ -95,7 +102,9 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
>
> pgoff = linear_page_index(vma, addr);
> for (i = 0; i < nr; i++, pgoff++)
> - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
> + vec[i] = can_check_pagecache ?
> + mincore_page(vma->vm_file->f_mapping, pgoff)
> + : 0;
> } else {
> for (i = 0; i < nr; i++)
> vec[i] = 0;
> @@ -106,8 +115,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
> static int mincore_unmapped_range(unsigned long addr, unsigned long end,
> struct mm_walk *walk)
> {
> - walk->private += __mincore_unmapped_range(addr, end,
> - walk->vma, walk->private);
> + struct mincore_walk_private *walk_private = walk->private;
> + unsigned char *vec = walk_private->vec;
> +
> + walk_private->vec += __mincore_unmapped_range(addr, end, walk->vma,
> + vec, walk_private->can_check_pagecache);
> return 0;
> }
>
> @@ -117,7 +129,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> spinlock_t *ptl;
> struct vm_area_struct *vma = walk->vma;
> pte_t *ptep;
> - unsigned char *vec = walk->private;
> + struct mincore_walk_private *walk_private = walk->private;
> + unsigned char *vec = walk_private->vec;
> int nr = (end - addr) >> PAGE_SHIFT;
>
> ptl = pmd_trans_huge_lock(pmd, vma);
> @@ -128,7 +141,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> }
>
> if (pmd_trans_unstable(pmd)) {
> - __mincore_unmapped_range(addr, end, vma, vec);
> + __mincore_unmapped_range(addr, end, vma, vec,
> + walk_private->can_check_pagecache);
> goto out;
> }
>
> @@ -138,7 +152,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>
> if (pte_none(pte))
> __mincore_unmapped_range(addr, addr + PAGE_SIZE,
> - vma, vec);
> + vma, vec, walk_private->can_check_pagecache);
> else if (pte_present(pte))
> *vec = 1;
> else { /* pte is a swap entry */
> @@ -152,8 +166,12 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> *vec = 1;
> } else {
> #ifdef CONFIG_SWAP
> - *vec = mincore_page(swap_address_space(entry),
> + if (walk_private->can_check_pagecache)
> + *vec = mincore_page(
> + swap_address_space(entry),
> swp_offset(entry));
> + else
> + *vec = 0;
> #else
> WARN_ON(1);
> *vec = 1;
> @@ -187,22 +205,21 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
> struct vm_area_struct *vma;
> unsigned long end;
> int err;
> + struct mincore_walk_private walk_private = {
> + .vec = vec
> + };
> struct mm_walk mincore_walk = {
> .pmd_entry = mincore_pte_range,
> .pte_hole = mincore_unmapped_range,
> .hugetlb_entry = mincore_hugetlb,
> - .private = vec,
> + .private = &walk_private
> };
>
> vma = find_vma(current->mm, addr);
> if (!vma || addr < vma->vm_start)
> return -ENOMEM;
> end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
> - if (!can_do_mincore(vma)) {
> - unsigned long pages = (end - addr) >> PAGE_SHIFT;
> - memset(vec, 1, pages);
> - return pages;
> - }
> + walk_private.can_check_pagecache = can_do_mincore(vma);
> mincore_walk.mm = vma->vm_mm;
> err = walk_page_range(addr, end, &mincore_walk);
> if (err < 0)
> --
> 2.20.1

--
Michal Hocko
SUSE Labs

2019-01-31 10:16:06

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On Thu, 31 Jan 2019, Michal Hocko wrote:

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 9f5e323e883e..7bcdd36e629d 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
> >
> > page = find_get_page(mapping, index);
> > if (!page) {
> > - if (iocb->ki_flags & IOCB_NOWAIT)
> > - goto would_block;
> > page_cache_sync_readahead(mapping,
> > ra, filp,
> > index, last_index - index);
>
> Maybe a stupid question but I am not really familiar with this path but
> what exactly does prevent a sync read down page_cache_sync_readahead
> path?

page_cache_sync_readahead() only submits the read ahead request(s), it
doesn't wait for it to finish.

--
Jiri Kosina
SUSE Labs


2019-01-31 10:24:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On Thu 31-01-19 11:15:28, Jiri Kosina wrote:
> On Thu, 31 Jan 2019, Michal Hocko wrote:
>
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 9f5e323e883e..7bcdd36e629d 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
> > >
> > > page = find_get_page(mapping, index);
> > > if (!page) {
> > > - if (iocb->ki_flags & IOCB_NOWAIT)
> > > - goto would_block;
> > > page_cache_sync_readahead(mapping,
> > > ra, filp,
> > > index, last_index - index);
> >
> > Maybe a stupid question but I am not really familiar with this path but
> > what exactly does prevent a sync read down page_cache_sync_readahead
> > path?
>
> page_cache_sync_readahead() only submits the read ahead request(s), it
> doesn't wait for it to finish.

OK, I guess my question was not precise. What does prevent taking fs
locks down the path?
--
Michal Hocko
SUSE Labs

2019-01-31 10:30:56

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On Thu, 31 Jan 2019, Michal Hocko wrote:

> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 9f5e323e883e..7bcdd36e629d 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
> > > >
> > > > page = find_get_page(mapping, index);
> > > > if (!page) {
> > > > - if (iocb->ki_flags & IOCB_NOWAIT)
> > > > - goto would_block;
> > > > page_cache_sync_readahead(mapping,
> > > > ra, filp,
> > > > index, last_index - index);
> > >
> > > Maybe a stupid question but I am not really familiar with this path but
> > > what exactly does prevent a sync read down page_cache_sync_readahead
> > > path?
> >
> > page_cache_sync_readahead() only submits the read ahead request(s), it
> > doesn't wait for it to finish.
>
> OK, I guess my question was not precise. What does prevent taking fs
> locks down the path?

Well, RWF_NOWAIT doesn't mean the kernel can't reschedule while executing
preadv2(), right? It just means it will not wait for the arrival of the
whole data blob into pagecache in case it's not there.

--
Jiri Kosina
SUSE Labs


2019-01-31 10:49:07

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

* Jiri Kosina:

> On Wed, 30 Jan 2019, Florian Weimer wrote:
>
>> > preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache
>> > contents, as it reveals metadata about residency of pages in
>> > pagecache.
>> >
>> > If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page
>> > not resident" information, and vice versa.
>> >
>> > Close that sidechannel by always initiating readahead on the cache if
>> > we encounter a cache miss for preadv2(RWF_NOWAIT); with that in place,
>> > probing the pagecache residency itself will actually populate the
>> > cache, making the sidechannel useless.
>>
>> I think this needs to use a different flag because the semantics are so
>> much different. If I understand this change correctly, previously,
>> RWF_NOWAIT essentially avoided any I/O, and now it does not.
>
> It still avoid synchronous I/O, due to this code still being in place:
>
> if (!PageUptodate(page)) {
> if (iocb->ki_flags & IOCB_NOWAIT) {
> put_page(page);
> goto would_block;
> }
>
> but goes the would_block path only after initiating asynchronous
> readahead.

But it wouldn't schedule asynchronous readahead before?

I'm worried that something, say PostgreSQL doing a sequential scan,
would implement a two-pass approach, first using RWF_NOWAIT to process
what's in the kernel page cache, and then read the rest without it. If
RWF_NOWAIT is treated as a prefetch hint, there could be much more read
activity, and a lot of it would be pointless because the data might have
to be evicted before userspace can use it.

Thanks,
Florian

2019-01-31 11:33:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On Thu 31-01-19 11:30:24, Jiri Kosina wrote:
> On Thu, 31 Jan 2019, Michal Hocko wrote:
>
> > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > index 9f5e323e883e..7bcdd36e629d 100644
> > > > > --- a/mm/filemap.c
> > > > > +++ b/mm/filemap.c
> > > > > @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
> > > > >
> > > > > page = find_get_page(mapping, index);
> > > > > if (!page) {
> > > > > - if (iocb->ki_flags & IOCB_NOWAIT)
> > > > > - goto would_block;
> > > > > page_cache_sync_readahead(mapping,
> > > > > ra, filp,
> > > > > index, last_index - index);
> > > >
> > > > Maybe a stupid question but I am not really familiar with this path but
> > > > what exactly does prevent a sync read down page_cache_sync_readahead
> > > > path?
> > >
> > > page_cache_sync_readahead() only submits the read ahead request(s), it
> > > doesn't wait for it to finish.
> >
> > OK, I guess my question was not precise. What does prevent taking fs
> > locks down the path?
>
> Well, RWF_NOWAIT doesn't mean the kernel can't reschedule while executing
> preadv2(), right? It just means it will not wait for the arrival of the
> whole data blob into pagecache in case it's not there.

No, it can reschedule for sure but the man page says:
: If this flag is specified, the preadv2() system call will return
: instantly if it would have to read data from the backing storage or wait
: for a lock.

I assume that the lock is meant to be a filesystem lock here.
--
Michal Hocko
SUSE Labs

2019-01-31 11:35:17

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On Thu, 31 Jan 2019, Florian Weimer wrote:

> >> I think this needs to use a different flag because the semantics are so
> >> much different. If I understand this change correctly, previously,
> >> RWF_NOWAIT essentially avoided any I/O, and now it does not.
> >
> > It still avoid synchronous I/O, due to this code still being in place:
> >
> > if (!PageUptodate(page)) {
> > if (iocb->ki_flags & IOCB_NOWAIT) {
> > put_page(page);
> > goto would_block;
> > }
> >
> > but goes the would_block path only after initiating asynchronous
> > readahead.
>
> But it wouldn't schedule asynchronous readahead before?

It would, that's kind of the whole point.

> I'm worried that something, say PostgreSQL doing a sequential scan,
> would implement a two-pass approach, first using RWF_NOWAIT to process
> what's in the kernel page cache, and then read the rest without it. If
> RWF_NOWAIT is treated as a prefetch hint, there could be much more read
> activity, and a lot of it would be pointless because the data might have
> to be evicted before userspace can use it.

So are you aware of anything already existing, that'd implement this
semantics? I've quickly grepped https://github.com/postgres/postgres for
RWF_NOWAIT, and they don't seem to use it at all. RWF_NOWAIT is rather
new.

The usecase I am aware of is to make sure that the thread doing
io_submit() doesn't get blocked for too long, because it has other things
to do quickly in order to avoid starving other sub-threads (and delegate
the I/O submission to asynchronous context).

--
Jiri Kosina
SUSE Labs


2019-01-31 12:04:52

by Daniel Gruss

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On 1/30/19 1:44 PM, Vlastimil Babka wrote:
> Close that sidechannel by always initiating readahead on the cache if we
> encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, probing
> the pagecache residency itself will actually populate the cache, making the
> sidechannel useless.

I fear this does not really close the side channel. You can time the
preadv2 function and infer which path it took, so you just bring it down
to the same as using mmap and timing accesses.
If I understood it correctly, this patch just removes the advantages of
preadv2 over mmmap+access for the attacker.


Cheers,
Daniel

2019-01-31 12:08:41

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On Thu, 31 Jan 2019, Daniel Gruss wrote:

> If I understood it correctly, this patch just removes the advantages of
> preadv2 over mmmap+access for the attacker.

Which is the desired effect. We are not trying to solve the timing aspect,
as I don't think there is a reasonable way to do it, is there?

--
Jiri Kosina
SUSE Labs


2019-01-31 12:09:06

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On 1/31/19 1:04 PM, Daniel Gruss wrote:
> On 1/30/19 1:44 PM, Vlastimil Babka wrote:
>> Close that sidechannel by always initiating readahead on the cache if we
>> encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, probing
>> the pagecache residency itself will actually populate the cache, making the
>> sidechannel useless.
>
> I fear this does not really close the side channel. You can time the
> preadv2 function and infer which path it took, so you just bring it down
> to the same as using mmap and timing accesses.
> If I understood it correctly, this patch just removes the advantages of
> preadv2 over mmmap+access for the attacker.

But isn't that the same with mincore()? We can't simply remove the
possibility of mmap+access, but we are closing the simpler methods?

Vlastimil


> Cheers,
> Daniel
>


2019-01-31 12:58:03

by Daniel Gruss

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On 1/31/19 1:08 PM, Jiri Kosina wrote:
> On Thu, 31 Jan 2019, Daniel Gruss wrote:
>
>> If I understood it correctly, this patch just removes the advantages of
>> preadv2 over mmmap+access for the attacker.
>
> Which is the desired effect. We are not trying to solve the timing aspect,
> as I don't think there is a reasonable way to do it, is there?

There are two building blocks to cache attacks, bringing the cache into
a state, and observing a state change, you can mitigate them by breaking
either of these building blocks.

For most attacks the attacker would be interested in observing *when* a
specific victim page is loaded into the page cache rather than observing
whether it is in the page cache right now (it could be there for ages if
the system was not under memory pressure).
So, one could try to prevent interference in the page cache between
attacker and victim -> working set algorithms do that to some extent.
Simpler idea (with more side effects) would be limiting the maximum
share of the page cache per user (or per process, depending on the
threat model)...


Cheers,
Daniel

2019-01-31 17:48:13

by Josh Snyder

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

On Thu, Jan 31, 2019 at 1:44 AM Michal Hocko <[email protected]> wrote:
> One thing is still not clear to me though. Is the new owner/writeable
> check OK for the Netflix-like usecases? I mean does happycache have
> appropriate access to the cache data? I have tried to re-read the
> original thread but couldn't find any confirmation.

The owner/writable check will suit every database that I've ever used
happycache with, including cassandra, postgres and git.

Acked-by: Josh Snyder <[email protected]>

2019-01-31 18:15:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On Thu, Jan 31, 2019 at 2:23 AM Michal Hocko <[email protected]> wrote:
>
> OK, I guess my question was not precise. What does prevent taking fs
> locks down the path?

IOCB_NOWAIT has never meant that, and will never mean it.

We will never give user space those kinds of guarantees. We do locking
for various reasons. For example, we'll do the mm lock just when
fetching/storing data from/to user space if there's a page fault. Or -
more obviously - we'll also check for - and sleep on - mandatory locks
in rw_verify_area().

There is nothing like "atomic IO" to user space. We simply do not give
those kinds of guarantees. That's even more true when this is a
information leak that we shouldn't expose to user space in the first
place.

Linus

2019-02-01 03:17:45

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On Thu, Jan 31, 2019 at 10:56:44AM +0100, Michal Hocko wrote:
> [Cc fs-devel]
>
> On Wed 30-01-19 13:44:19, Vlastimil Babka wrote:
> > From: Jiri Kosina <[email protected]>
> >
> > preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache contents, as
> > it reveals metadata about residency of pages in pagecache.
> >
> > If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page not
> > resident" information, and vice versa.
> >
> > Close that sidechannel by always initiating readahead on the cache if we
> > encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, probing
> > the pagecache residency itself will actually populate the cache, making the
> > sidechannel useless.
>
> I guess the current wording doesn't disallow background IO to be
> triggered for EAGAIN case. I am not sure whether that breaks clever
> applications which try to perform larger IO for those cases though.

Actually, it does:

RWF_NOWAIT (since Linux 4.14)

Do not wait for data which is not immediately available. If
this flag is specified, the preadv2() system call will return
instantly if it would have to read data from the backing storage
or wait for a lock.

page_cache_sync_readahead() can block on page allocation, it calls
->readpages() which means there are page locks and filesystem locks
in play (e.g. for block mapping), there's potential for blocking on
metadata IO (both submission and completion) to read block maps, the
data readahead can be submitted for IO so it can get stuck anywhere
in the IO path, etc...

Basically, it completely subverts the documented behaviour of
RWF_NOWAIT.

There are applications (like Samba (*)) that are planning to use
this to avoid blocking their main processing threads on buffered
IO. This change makes RWF_NOWAIT pretty much useless to them - it
/was/ the only solution we had for reliably issuing non-blocking IO,
with this patch it isn't a viable solution at all.

(*) https://github.com/samba-team/samba/commit/6381044c0270a647c20935d22fd23f235d19b328

IOWs, if this change goes through, it needs to be documented as an
intentional behavioural bug in the preadv2 manpage so that userspace
developers are aware of the new limitations of RWF_NOWAIT and should
avoid it like the plague.

But worse than that is nobody has bothered to (or ask someone
familiar with the code to) do an audit of RWF_NOWAIT usage after I
pointed out the behavioural issues. The one person who was engaged
and /had done an audit/ got shouted down with so much bullshit they
just walked away....

So, I'll invite the incoherent, incandescent O_DIRECT rage flames of
Linus to be unleashed again and point out the /other reference/ to
IOCB_NOWAIT in mm/filemap.c. That is, in generic_file_read_iter(),
in the *generic O_DIRECT read path*:

if (iocb->ki_flags & IOCB_DIRECT) {
.....
if (iocb->ki_flags & IOCB_NOWAIT) {
if (filemap_range_has_page(mapping, iocb->ki_pos,
iocb->ki_pos + count - 1))
return -EAGAIN;
} else {
.....

This page cache probe is about 100 lines of code down from the code
that this patch modifies, in it's direct caller. It's not hard to
find, I shouldn't have to point it out, nor have to explain how it
makes this patch completely irrelevant.

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 9f5e323e883e..7bcdd36e629d 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
> >
> > page = find_get_page(mapping, index);
> > if (!page) {
> > - if (iocb->ki_flags & IOCB_NOWAIT)
> > - goto would_block;
> > page_cache_sync_readahead(mapping,
> > ra, filp,
> > index, last_index - index);
>
> Maybe a stupid question but I am not really familiar with this path but
> what exactly does prevent a sync read down page_cache_sync_readahead
> path?

It's effectively useless as a workaround because you can avoid the
readahead IO being issued relatively easily:

void page_cache_sync_readahead(struct address_space *mapping,
struct file_ra_state *ra, struct file *filp,
pgoff_t offset, unsigned long req_size)
{
/* no read-ahead */
if (!ra->ra_pages)
return;

if (blk_cgroup_congested())
return;
....

IOWs, we just have to issue enough IO to congest the block device (or,
even easier, a rate-limited cgroup), and we can still use RWF_NOWAIT
to probe the page cache. Or if we can convince ra->ra_pages to be
zero (e.g. it's on bdi device with no readahead configured because
it's real fast) then it doesn't work there, either.

So this a) isn't a robust workaround, b) it breaks documented API
semantics and c) isn't the only path to page cache probing via
RWF_NOWAIT. It's just a new game of whack-a-mole.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-02-01 05:28:27

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On Thu, Jan 31, 2019 at 09:54:16AM -0800, Linus Torvalds wrote:
> On Thu, Jan 31, 2019 at 2:23 AM Michal Hocko <[email protected]> wrote:
> >
> > OK, I guess my question was not precise. What does prevent taking fs
> > locks down the path?
>
> IOCB_NOWAIT has never meant that, and will never mean it.

I think you're wrong, Linus. IOCB_NOWAIT was specifically designed
to prevent blocking on filesystem locks during AIO submission. The
initial commits spell that out pretty clearly:

commit b745fafaf70c0a98a2e1e7ac8cb14542889ceb0e
Author: Goldwyn Rodrigues <[email protected]>
Date: Tue Jun 20 07:05:43 2017 -0500

fs: Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT

RWF_NOWAIT informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.

FMODE_AIO_NOWAIT is a flag which identifies the file opened is capable
of returning -EAGAIN if the AIO call will block. This must be set by
supporting filesystems in the ->open() call.

Filesystems xfs, btrfs and ext4 would be supported in the following patches.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Goldwyn Rodrigues <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

commit 29a5d29ec181ebdc98a26cedbd76ce9870248892
Author: Goldwyn Rodrigues <[email protected]>
Date: Tue Jun 20 07:05:48 2017 -0500

xfs: nowait aio support

If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable
immediately.

IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin
if it needs allocation either due to file extension, writing to a hole,
or COW or waiting for other DIOs to finish.

Return -EAGAIN if we don't have extent list in memory.

Signed-off-by: Goldwyn Rodrigues <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

commit 728fbc0e10b7f3ce2ee043b32e3453fd5201c055
Author: Goldwyn Rodrigues <[email protected]>
Date: Tue Jun 20 07:05:47 2017 -0500

ext4: nowait aio support

Return EAGAIN if any of the following checks fail for direct I/O:
+ i_rwsem is lockable
+ Writing beyond end of file (will trigger allocation)
+ Blocks are not allocated at the write location

Signed-off-by: Goldwyn Rodrigues <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

> We will never give user space those kinds of guarantees. We do locking
> for various reasons. For example, we'll do the mm lock just when
> fetching/storing data from/to user space if there's a page fault.

You are conflating "best effort non-blocking operation" with
"atomic guarantee". RWF_NOWAIT/IOCB_NOWAIT is the
former, not the latter.

i.e. RWF_NOWAIT addresses the "every second IO submission blocks"
problems that AIO submission suffered from due to filesystem lock
contention, not the rare and unusual things like "page fault during
get_user_pages in direct IO submission". Maybe one day, but right
now those rare cases are not pain points for applications that
require nonblock AIO submission via RWF_NOWAIT.

> Or -
> more obviously - we'll also check for - and sleep on - mandatory locks
> in rw_verify_area().

Well, only if you don't use fcntl(O_NONBLOCK) on the file to tell
mandatory locking to fail with -EAGAIN instead of sleeping.

-Dave.
--
Dave Chinner
[email protected]

2019-02-01 07:14:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On Thu, Jan 31, 2019 at 9:16 PM Dave Chinner <[email protected]> wrote:
>
> You are conflating "best effort non-blocking operation" with
> "atomic guarantee". RWF_NOWAIT/IOCB_NOWAIT is the
> former, not the latter.

Right.

That's my *point*, Dave.

It's not 'atomic guarantee", and never will be. We are in 100%
agreement. That's what I _said_.

And part of "best effort" is very much "not a security information leak".

I really don't see why you are so argumentative.

As I mentioned earlier in the thread, it's actually quite possible
that users will actually find that starting read-ahead is a *good*
thing, Dave.

Even - in fact *particularly* - the user you brought up: samba using
RWF_NOWAIT to try to do things synchronously quickly.

So Dave, why are you being so negative?

Linus

2019-02-01 07:22:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On Thu, Jan 31, 2019 at 11:05 PM Linus Torvalds
<[email protected]> wrote:
>
> And part of "best effort" is very much "not a security information leak".

Side note: it's entirely possible that the preadv2(RWF_NOWAIT)
interface is actually already effectively too slow to be effectively
used as much of an attack vector.

One of the advantages of mincore() for the attack was that you could
just get a lot of page status information in one go. With RWF_NOWAIT,
you only really get "up to the first non-cached page", so it's already
a weaker signal than mincore() gave.

System calls aren't horrendously slow (at least not with fixed
non-meltdown CPU's), but it might still be a somewhat noticeable
inconvenience in an attack that is already probably not all that easy
to do on an arbitrary target.

So it might not be a huge deal. But I think we should at least try to
make things less useful for these kinds of attack vectors.

And no, that doesn't mean "stop all theoretical attacks". It means
"let's try to make things less convenient as a data leak".

That's why things like "oh, you can still see the signal if you can
keep the backing device congested" is not something I'd worry about.
It's just another (big) inconvenience, and not all that simple to do.
At some point, it's simply not worth it as an attack vector any more.

Linus

2019-02-01 08:56:55

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

Here's updated version with Michal's suggestion, and acks:

I think this patch is fine to go, less sure about 2/3 and 3/3.

----8<----
From 49f17d9f6a42ecc2a508125b0c880ff0402a6f49 Mon Sep 17 00:00:00 2001
From: Jiri Kosina <[email protected]>
Date: Wed, 16 Jan 2019 20:53:17 +0100
Subject: [PATCH v2] mm/mincore: make mincore() more conservative

The semantics of what mincore() considers to be resident is not completely
clear, but Linux has always (since 2.3.52, which is when mincore() was
initially done) treated it as "page is available in page cache".

That's potentially a problem, as that [in]directly exposes meta-information
about pagecache / memory mapping state even about memory not strictly belonging
to the process executing the syscall, opening possibilities for sidechannel
attacks.

Change the semantics of mincore() so that it only reveals pagecache information
for non-anonymous mappings that belog to files that the calling process could
(if it tried to) successfully open for writing.

[[email protected]: restructure can_do_mincore() conditions]
Originally-by: Linus Torvalds <[email protected]>
Originally-by: Dominique Martinet <[email protected]>
Cc: Dominique Martinet <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Kevin Easton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Cyril Hrubis <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Daniel Gruss <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Josh Snyder <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/mincore.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..b8842b849604 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -169,6 +169,16 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
return 0;
}

+static inline bool can_do_mincore(struct vm_area_struct *vma)
+{
+ if (vma_is_anonymous(vma))
+ return true;
+ if (!vma->vm_file)
+ return false;
+ return inode_owner_or_capable(file_inode(vma->vm_file)) ||
+ inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
+}
+
/*
* Do a chunk of "sys_mincore()". We've already checked
* all the arguments, we hold the mmap semaphore: we should
@@ -189,8 +199,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
vma = find_vma(current->mm, addr);
if (!vma || addr < vma->vm_start)
return -ENOMEM;
- mincore_walk.mm = vma->vm_mm;
end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
+ if (!can_do_mincore(vma)) {
+ unsigned long pages = (end - addr) >> PAGE_SHIFT;
+ memset(vec, 1, pages);
+ return pages;
+ }
+ mincore_walk.mm = vma->vm_mm;
err = walk_page_range(addr, end, &mincore_walk);
if (err < 0)
return err;
--
2.20.1



2019-02-01 09:06:17

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed

On 1/31/19 11:09 AM, Michal Hocko wrote:
> On Wed 30-01-19 13:44:20, Vlastimil Babka wrote:
>> After "mm/mincore: make mincore() more conservative" we sometimes restrict the
>> information about page cache residency, which we have to do without breaking
>> existing userspace, if possible. We thus fake the resulting values as 1, which
>> should be safer than faking them as 0, as there might theoretically exist code
>> that would try to fault in the page(s) until mincore() returns 1.
>>
>> Faking 1 however means that such code would not fault in a page even if it was
>> not in page cache, with unwanted performance implications. We can improve the
>> situation by revisting the approach of 574823bfab82 ("Change mincore() to count
>> "mapped" pages rather than "cached" pages") but only applying it to cases where
>> page cache residency check is restricted. Thus mincore() will return 0 for an
>> unmapped page (which may or may not be resident in a pagecache), and 1 after
>> the process faults it in.
>>
>> One potential downside is that mincore() will be again able to recognize when a
>> previously mapped page was reclaimed. While that might be useful for some
>> attack scenarios, it's not as crucial as recognizing that somebody else faulted
>> the page in, and there are also other ways to recognize reclaimed pages anyway.
>
> Is this really worth it? Do we know about any specific usecase that
> would benefit from this change? TBH I would rather wait for the report
> than add a hard to evaluate side channel.

Well it's not that complicated IMHO. Linus said it's worth trying, so
let's see how he likes the result. The side channel exists anyway as
long as process can e.g. check if its rss shrinked, and I doubt we are
going to remove that possibility.

Also CC Josh Snyder since I forgot originally, and keeping rest of mail
for reference.

>> Cc: Jiri Kosina <[email protected]>
>> Cc: Dominique Martinet <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Dave Chinner <[email protected]>
>> Cc: Kevin Easton <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Cyril Hrubis <[email protected]>
>> Cc: Tejun Heo <[email protected]>
>> Cc: Kirill A. Shutemov <[email protected]>
>> Cc: Daniel Gruss <[email protected]>
>> Signed-off-by: Vlastimil Babka <[email protected]>
>> ---
>> mm/mincore.c | 49 +++++++++++++++++++++++++++++++++----------------
>> 1 file changed, 33 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/mincore.c b/mm/mincore.c
>> index 747a4907a3ac..d6784a803ae7 100644
>> --- a/mm/mincore.c
>> +++ b/mm/mincore.c
>> @@ -21,12 +21,18 @@
>> #include <linux/uaccess.h>
>> #include <asm/pgtable.h>
>>
>> +struct mincore_walk_private {
>> + unsigned char *vec;
>> + bool can_check_pagecache;
>> +};
>> +
>> static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
>> unsigned long end, struct mm_walk *walk)
>> {
>> #ifdef CONFIG_HUGETLB_PAGE
>> unsigned char present;
>> - unsigned char *vec = walk->private;
>> + struct mincore_walk_private *walk_private = walk->private;
>> + unsigned char *vec = walk_private->vec;
>>
>> /*
>> * Hugepages under user process are always in RAM and never
>> @@ -35,7 +41,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
>> present = pte && !huge_pte_none(huge_ptep_get(pte));
>> for (; addr != end; vec++, addr += PAGE_SIZE)
>> *vec = present;
>> - walk->private = vec;
>> + walk_private->vec = vec;
>> #else
>> BUG();
>> #endif
>> @@ -85,7 +91,8 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
>> }
>>
>> static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
>> - struct vm_area_struct *vma, unsigned char *vec)
>> + struct vm_area_struct *vma, unsigned char *vec,
>> + bool can_check_pagecache)
>> {
>> unsigned long nr = (end - addr) >> PAGE_SHIFT;
>> int i;
>> @@ -95,7 +102,9 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
>>
>> pgoff = linear_page_index(vma, addr);
>> for (i = 0; i < nr; i++, pgoff++)
>> - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
>> + vec[i] = can_check_pagecache ?
>> + mincore_page(vma->vm_file->f_mapping, pgoff)
>> + : 0;
>> } else {
>> for (i = 0; i < nr; i++)
>> vec[i] = 0;
>> @@ -106,8 +115,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
>> static int mincore_unmapped_range(unsigned long addr, unsigned long end,
>> struct mm_walk *walk)
>> {
>> - walk->private += __mincore_unmapped_range(addr, end,
>> - walk->vma, walk->private);
>> + struct mincore_walk_private *walk_private = walk->private;
>> + unsigned char *vec = walk_private->vec;
>> +
>> + walk_private->vec += __mincore_unmapped_range(addr, end, walk->vma,
>> + vec, walk_private->can_check_pagecache);
>> return 0;
>> }
>>
>> @@ -117,7 +129,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>> spinlock_t *ptl;
>> struct vm_area_struct *vma = walk->vma;
>> pte_t *ptep;
>> - unsigned char *vec = walk->private;
>> + struct mincore_walk_private *walk_private = walk->private;
>> + unsigned char *vec = walk_private->vec;
>> int nr = (end - addr) >> PAGE_SHIFT;
>>
>> ptl = pmd_trans_huge_lock(pmd, vma);
>> @@ -128,7 +141,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>> }
>>
>> if (pmd_trans_unstable(pmd)) {
>> - __mincore_unmapped_range(addr, end, vma, vec);
>> + __mincore_unmapped_range(addr, end, vma, vec,
>> + walk_private->can_check_pagecache);
>> goto out;
>> }
>>
>> @@ -138,7 +152,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>>
>> if (pte_none(pte))
>> __mincore_unmapped_range(addr, addr + PAGE_SIZE,
>> - vma, vec);
>> + vma, vec, walk_private->can_check_pagecache);
>> else if (pte_present(pte))
>> *vec = 1;
>> else { /* pte is a swap entry */
>> @@ -152,8 +166,12 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>> *vec = 1;
>> } else {
>> #ifdef CONFIG_SWAP
>> - *vec = mincore_page(swap_address_space(entry),
>> + if (walk_private->can_check_pagecache)
>> + *vec = mincore_page(
>> + swap_address_space(entry),
>> swp_offset(entry));
>> + else
>> + *vec = 0;
>> #else
>> WARN_ON(1);
>> *vec = 1;
>> @@ -187,22 +205,21 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
>> struct vm_area_struct *vma;
>> unsigned long end;
>> int err;
>> + struct mincore_walk_private walk_private = {
>> + .vec = vec
>> + };
>> struct mm_walk mincore_walk = {
>> .pmd_entry = mincore_pte_range,
>> .pte_hole = mincore_unmapped_range,
>> .hugetlb_entry = mincore_hugetlb,
>> - .private = vec,
>> + .private = &walk_private
>> };
>>
>> vma = find_vma(current->mm, addr);
>> if (!vma || addr < vma->vm_start)
>> return -ENOMEM;
>> end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
>> - if (!can_do_mincore(vma)) {
>> - unsigned long pages = (end - addr) >> PAGE_SHIFT;
>> - memset(vec, 1, pages);
>> - return pages;
>> - }
>> + walk_private.can_check_pagecache = can_do_mincore(vma);
>> mincore_walk.mm = vma->vm_mm;
>> err = walk_page_range(addr, end, &mincore_walk);
>> if (err < 0)
>> --
>> 2.20.1
>


2019-02-01 09:12:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed

On Fri 01-02-19 10:04:23, Vlastimil Babka wrote:
> The side channel exists anyway as long as process can e.g. check if
> its rss shrinked, and I doubt we are going to remove that possibility.

Well, but rss update will not tell you that the page has been faulted in
which is the most interesting part. You shouldn't be able to sniff on
/proc/$vicimt/smaps as an attacker.
--
Michal Hocko
SUSE Labs

2019-02-01 09:49:24

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed

On 2/1/19 10:11 AM, Michal Hocko wrote:
> On Fri 01-02-19 10:04:23, Vlastimil Babka wrote:
>> The side channel exists anyway as long as process can e.g. check if
>> its rss shrinked, and I doubt we are going to remove that possibility.
>
> Well, but rss update will not tell you that the page has been faulted in
> which is the most interesting part.

Sure, but the patch doesn't add back that capability neither. It allows
to recognize page being reclaimed, and I argue you can infer that from
rss change as well. That change is mentioned in the last paragraph in
changelog, and I thought "add a hard to evaluate side channel" in your
reply referred to that. It doesn't add back the "original" side channel
to detect somebody else accessed a page.

> You shouldn't be able to sniff on
> /proc/$vicimt/smaps as an attacker.




2019-02-06 20:15:06

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed

On Fri, 1 Feb 2019, Vlastimil Babka wrote:

> > Well, but rss update will not tell you that the page has been faulted in
> > which is the most interesting part.
>
> Sure, but the patch doesn't add back that capability neither. It allows
> to recognize page being reclaimed, and I argue you can infer that from
> rss change as well. That change is mentioned in the last paragraph in
> changelog, and I thought "add a hard to evaluate side channel" in your
> reply referred to that. It doesn't add back the "original" side channel
> to detect somebody else accessed a page.

On Fri, 1 Feb 2019, Vlastimil Babka wrote:

> > Is this really worth it? Do we know about any specific usecase that
> > would benefit from this change? TBH I would rather wait for the report
> > than add a hard to evaluate side channel.
>
> Well it's not that complicated IMHO. Linus said it's worth trying, so
> let's see how he likes the result. The side channel exists anyway as
> long as process can e.g. check if its rss shrinked, and I doubt we are
> going to remove that possibility.

Linus, do you have any opinion here?

I have a hunch that mm maintainers are keeping this on a backburner
because there might still open question(s) in the air.

Thanks,

--
Jiri Kosina
SUSE Labs


2019-02-12 03:45:33

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed

On Fri, 1 Feb 2019, Vlastimil Babka wrote:

> >> After "mm/mincore: make mincore() more conservative" we sometimes restrict the
> >> information about page cache residency, which we have to do without breaking
> >> existing userspace, if possible. We thus fake the resulting values as 1, which
> >> should be safer than faking them as 0, as there might theoretically exist code
> >> that would try to fault in the page(s) until mincore() returns 1.
> >>
> >> Faking 1 however means that such code would not fault in a page even if it was
> >> not in page cache, with unwanted performance implications. We can improve the
> >> situation by revisting the approach of 574823bfab82 ("Change mincore() to count
> >> "mapped" pages rather than "cached" pages") but only applying it to cases where
> >> page cache residency check is restricted. Thus mincore() will return 0 for an
> >> unmapped page (which may or may not be resident in a pagecache), and 1 after
> >> the process faults it in.
> >>
> >> One potential downside is that mincore() will be again able to recognize when a
> >> previously mapped page was reclaimed. While that might be useful for some
> >> attack scenarios, it's not as crucial as recognizing that somebody else faulted
> >> the page in, and there are also other ways to recognize reclaimed pages anyway.
> >
> > Is this really worth it? Do we know about any specific usecase that
> > would benefit from this change? TBH I would rather wait for the report
> > than add a hard to evaluate side channel.
>
> Well it's not that complicated IMHO. Linus said it's worth trying, so
> let's see how he likes the result. The side channel exists anyway as
> long as process can e.g. check if its rss shrinked, and I doubt we are
> going to remove that possibility.

So, where do we go from here?

Either Linus and Andrew like the mincore() return value tweak, or this
could be further discussed (*). But in either of the cases, I think
patches 1 and 2 should be at least queued for 5.1.

(*) I'd personally include it as well, as I don't see how it would break
anything, it's pretty straightforward, and brings back some sanity to
mincore() return value.

Thanks,

--
Jiri Kosina
SUSE Labs


2019-02-12 06:37:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed

On Tue 12-02-19 04:44:30, Jiri Kosina wrote:
> On Fri, 1 Feb 2019, Vlastimil Babka wrote:
>
> > >> After "mm/mincore: make mincore() more conservative" we sometimes restrict the
> > >> information about page cache residency, which we have to do without breaking
> > >> existing userspace, if possible. We thus fake the resulting values as 1, which
> > >> should be safer than faking them as 0, as there might theoretically exist code
> > >> that would try to fault in the page(s) until mincore() returns 1.
> > >>
> > >> Faking 1 however means that such code would not fault in a page even if it was
> > >> not in page cache, with unwanted performance implications. We can improve the
> > >> situation by revisting the approach of 574823bfab82 ("Change mincore() to count
> > >> "mapped" pages rather than "cached" pages") but only applying it to cases where
> > >> page cache residency check is restricted. Thus mincore() will return 0 for an
> > >> unmapped page (which may or may not be resident in a pagecache), and 1 after
> > >> the process faults it in.
> > >>
> > >> One potential downside is that mincore() will be again able to recognize when a
> > >> previously mapped page was reclaimed. While that might be useful for some
> > >> attack scenarios, it's not as crucial as recognizing that somebody else faulted
> > >> the page in, and there are also other ways to recognize reclaimed pages anyway.
> > >
> > > Is this really worth it? Do we know about any specific usecase that
> > > would benefit from this change? TBH I would rather wait for the report
> > > than add a hard to evaluate side channel.
> >
> > Well it's not that complicated IMHO. Linus said it's worth trying, so
> > let's see how he likes the result. The side channel exists anyway as
> > long as process can e.g. check if its rss shrinked, and I doubt we are
> > going to remove that possibility.
>
> So, where do we go from here?
>
> Either Linus and Andrew like the mincore() return value tweak, or this
> could be further discussed (*). But in either of the cases, I think
> patches 1 and 2 should be at least queued for 5.1.

I would go with patch 1 for 5.1. Patches 2 still sounds controversial or
incomplete to me. And patch 3, well I will leave the decision to
Andrew/Linus.

> (*) I'd personally include it as well, as I don't see how it would break
> anything, it's pretty straightforward, and brings back some sanity to
> mincore() return value.

--
Michal Hocko
SUSE Labs

2019-02-12 13:09:51

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed

On Tue, 12 Feb 2019, Michal Hocko wrote:

> I would go with patch 1 for 5.1. Patches 2 still sounds controversial or
> incomplete to me.

Is it because of the disagreement what 'non-blocking' really means, or do
you see something else missing?

Merging patch just patch 1 withouth patch 2 is probably sort of useless
excercise, unfortunately.

Thanks,

--
Jiri Kosina
SUSE Labs


2019-02-12 14:04:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/mincore: provide mapped status when cached status is not allowed

On Tue 12-02-19 14:09:03, Jiri Kosina wrote:
> On Tue, 12 Feb 2019, Michal Hocko wrote:
>
> > I would go with patch 1 for 5.1. Patches 2 still sounds controversial or
> > incomplete to me.
>
> Is it because of the disagreement what 'non-blocking' really means, or do
> you see something else missing?

Not only. See the remark from Dave [1] that the patch in its current
form seems to be incomplete. Also FS people were not involved
properly to evaluate all the potential fallouts. Even if the only way
forward is to "cripple" IOCB_NOWAIT then the documentation should go
along with the change rather than suprise people much later when the
system behaves unexpectedly. So I _think_ this patch is not really ready
yet.

Also I haven't heard any discussion whether we can reduce the effect of
the change in a similar way we do for mincore.

> Merging patch just patch 1 withouth patch 2 is probably sort of useless
> excercise, unfortunately.

Why would that be the case. We know that mincore is the simplest way
_right now_. Closing it makes sense on its own.

[1] http://lkml.kernel.org/r/20190201014446.GU6173@dastard
--
Michal Hocko
SUSE Labs

2019-02-12 15:49:59

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O

On Fri, 1 Feb 2019, Dave Chinner wrote:

> So, I'll invite the incoherent, incandescent O_DIRECT rage flames of
> Linus to be unleashed again and point out the /other reference/ to
> IOCB_NOWAIT in mm/filemap.c. That is, in generic_file_read_iter(),
> in the *generic O_DIRECT read path*:
>
> if (iocb->ki_flags & IOCB_DIRECT) {
> .....
> if (iocb->ki_flags & IOCB_NOWAIT) {
> if (filemap_range_has_page(mapping, iocb->ki_pos,
> iocb->ki_pos + count - 1))
> return -EAGAIN;
> } else {
> .....

OK, thanks Dave, this is a good point I've missed in this mail before
(probabably as I focused only on the aspect of disagreement what NONBLOCK
actually means :) ). I will look into fixing it for next iteration.

> It's effectively useless as a workaround because you can avoid the
> readahead IO being issued relatively easily:
>
> void page_cache_sync_readahead(struct address_space *mapping,
> struct file_ra_state *ra, struct file *filp,
> pgoff_t offset, unsigned long req_size)
> {
> /* no read-ahead */
> if (!ra->ra_pages)
> return;
>
> if (blk_cgroup_congested())
> return;
> ....
>
> IOWs, we just have to issue enough IO to congest the block device (or,
> even easier, a rate-limited cgroup), and we can still use RWF_NOWAIT
> to probe the page cache. Or if we can convince ra->ra_pages to be
> zero (e.g. it's on bdi device with no readahead configured because
> it's real fast) then it doesn't work there, either.

It's though questionable whether the noise level here wouldn't be too high
already for any sidechannel to work reliably. So I'd suggest to operate
under the assumption that it would be too noisy, unless anyone is able to
prove otherwise.

Thanks,

--
Jiri Kosina
SUSE Labs


2019-02-20 15:50:57

by Nicolai Stange

[permalink] [raw]
Subject: Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

Linus Torvalds <[email protected]> writes:

<snip>

> So in order to use it as a signal, first you have to first scrub the
> cache (because if the page was already there, there's no signal at
> all), and then for the signal to be as useful as possible, you're also
> going to want to try to get out more than one bit of information: you
> are going to try to see the patterns and the timings of how it gets
> filled.
>
> And that's actually quite painful. You don't know the initial cache
> state, and you're not (in general) controlling the machine entirely,
> because there's also that actual other entity that you're trying to
> attack and see what it does.
>
> So what you want to do is basically to first make sure the cache is
> scrubbed (only for the pages you're interested in!), then trigger
> whatever behavior you are looking for, and then look how that affected
> the cache.
>
> In other words, you want *multiple* residency status check - first to
> see what the cache state is (because you're going to want that for
> scrubbing), then to see that "yes, it's gone" when doing the
> scrubbing, and then to see the *pattern* and timings of how things are
> brought in.

<snap>

In an attempt to gain a better understanding of the guided eviction part
resp. the relevance of mincore() & friends to that, I worked on
reproducing the results from [1], section 6.1 ("Efficient Page Cache
Eviction on Linux").

In case anybody wants to run their own experiments: the sources can
be found at [2].

Disclaimer: I don't have access to the sources used by the [1]-paper's
authors nor do I know anything about their experimental setup. So it
might very well be the case, that my implementation is completely
different and/or inefficient.

Anyways, quoting from [1], section 6.1:

"Eviction Set 1. These are pages already in the page cache,
used by other processes. To keep them in the page cache,
a thread continuously accesses these pages while also keep-
ing the system load low by using sched yield and sleep.
Consequently, they are among the most recently accessed
pages of the system and eviction of these pages becomes
highly unlikely."

I had two questions:
1.) Do the actual contents of "Eviction set 1" matter for the guided
eviction's performance or can they as well be arbitrary but fixed?
Because if the set's actual contents weren't of any importance,
then mincore() would not be needed to initialize it with "pages
already in the page cache".
2.) How does keeping some fixed set resident + doing some IO compare to
simply streaming a huge random file through the page cache?

(To make it explicit: I didn't look into the probe part of the attack
or the checking of the victim page's residency status as a
termination condition for the eviction run.)

Obviously, there are two primary factors affecting the victim page
eviction performance: the file page cache size and disk read
bandwidth.

Naively speaking, I would suppose that keeping a certain set resident is
a cheap and stable way to constrain IO to the remaining portion of the
page cache and thus, reduce the amount of data required to be read from
disk until the victim page gets evicted.


Results summary (details can be found at the end of this mail):
- The baseline benchmark of simply streaming random data
through the page cache behaves as expected:

avg of "Inactive(file)" / avg of "victim page eviction time"

yields ~480MB/s, which approx. matches my disk's read bandwidth
(note: the victim page was mapped as !PROT_EXEC).

- I didn't do any sophisticated fine-tuning wrt. to parameters, but
for the configuration yielding the best result, the average victim
page eviction time was 147ms (stddev(*): 69ms, stderr: 1ms) with the
"random but fixed resident set method". That's an improvement by a
factor of 2.6 over the baseline "streaming random data method" (with
the same amount of anonymous memory, i.e. "Eviction set 3",
allocated: 7GB out of a total of 8GB).

- In principle, question 1.) can't be answered by experiment without
controlling the initial, legitimate system workload. I did some lax
tests on my desktop running firefox, libreoffice etc. though and of
course, overall responsiveness got a lot better if the "Eviction set
1" had been populated with pages already resident at the time the
"attack" started. But the victim page eviction times didn't seem to
improve -- at least not by factors such that my biased mind would
have been able to recognize any change.

In conclusion, keeping an arbitrary, but fixed "Eviction set 1" resident
improved the victim page eviction performance by some factor over the
"streaming" baseline, where "Eviction set 1" was populated from a
single, attacker-controlled file and mincore() was not needed for
determining its initial contents.

To my surprise though, I needed to rely on mincore() at some other
place, namely *my* implementation of keeping the resident set
resident. My first naive approach was to have a single thread repeatedly
iterating over the pages and reading the first few bytes from each
through a mmapped area. That did not work out, because, even for smaller
resident sets of 1GB and if run with realtime priority, the accessing
thread would at some point in time encounter a reclaimed page and have
to wait for the page fault to get served. While waiting for that, even
more of the resident pages are likely to get reclaimed, causing
additional delays later on. Eventually, the resident page accessor
looses the game and will encounter page faults for almost the whole
resident set (which isn't resident anymore). I worked around this by
making the accessor thread check page residency via mincore(), touch
only the resident ones and queue the others to some refault ("rewarm")
thread. From briefly looking at iostat, this rewarmer thread actually
seemed to saturate the disk and thus, there was no need for additional
IO to put pressure on the page cache. For completeness, the amount of
pages from the resident set actually found resident made up for ~97% of
all file page cache pages (Inactive+Active(file)).

Note that the way mincore() is used here is different than for the
probe part of the attack: for probing, we'd like to know when a victim
page has been faulted in again, while the residency keeper needs to
check that some page has not been reclaimed before accessing
it. Furthermore, mincore() is run on pages from an attacker-controlled
and -owned file here. AFAICS, the patches currently under review
(c.f. [3]) don't mitigate against this latter abuse of mincore() & Co.

I personally doubt that doing something about it would be worth it
though: first of all, until proven otherwise, I'm assuming that the
improvement of the "resident set" based eviction method over the
"streaming" baseline is not by orders of magnitude and that the victim
page eviction times interesting to attackers (let's say better than
500ms) are achievable only under certain conditions: no swap and the
ability to block a large fraction of total memory with anonymous
mappings.

What's more, I can imagine that there are other ways to keep the
resident set resident without relying on mincore() at all: I haven't
tried it, but simply spawning a larger number of accessor threads, each
at a different position within "Eviction set 1", and hoping that most of
the time at least one of them finds a resident page to touch, might
work, too.


Thanks,

Nicolai


Experiment setup + numbers
==========================
My laptop has 8GB of RAM, a SSD and runs OpenSUSE 42.3 kernel package
version 4.4.165-81. I stopped almost all services, but didn't setup any
CPU isolation.

I ran 5 x 2 experiments where I allocated (and filled, of course)
3,4,5,6,7GB of anonymous memory each and compared the baseline
"read-in-a-huge-file" (mapped PROT_EXEC) results with the resident set
based eviction for each of these. While doing so, I continuously
measured eviction times of a !PROT_EXEC victim page and reported the
'Active(file)' + 'Inactive(file)' statistics from /proc/meminfo.


Baseline "streaming" benchmark results:

anon mem (GB): 7 6 5 4 3
Inactive(file) (MB): 189.3389 709.7063 1221.0910 1735.7510 2247.3340
eviction time (ms): 386.7712 1453.3975 2533.6084 3622.4995 4694.4668
quotient (MB/s): 489.5372 488.3085 481.9573 479.1584 478.7198

and, for comparison with the results from the resident set based eviction
below:
Inactive+Active(file) (MB): 358 1390 2415 3445 4469


Resident set based eviction:

For the results below, I chose to draw the resident set from a single
file filled with random data and made it total mem - allocated anon
mem in size. The disk bandwidth was saturated all the time.

anon mem (GB): 7 6 5 4 3
eviction time (ms): 146.6296 428.9594 620.2084 863.1423 977.8963
improvement over baseline: 2.6 3.4 4.1 4.2 4.8
Inactive+Active(file) (MB): 429 1449 2471 3494 4515
resident from res. set (MB): 417 1428 2426 3424 4429
fraction: 97.3% 98.6% 98.2% 98.0% 98.1%



[1] https://arxiv.org/abs/1901.01161 ("Page Cache Attacks")
[2] https://github.com/nicstange/pgc
[3] https://lkml.kernel.org/r/[email protected]
(*) The distribution of eviction times is not Gaussian.


--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)

2019-03-06 14:45:31

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments

On Wed, 30 Jan 2019, Vlastimil Babka wrote:

> I've collected the patches from the discussion for formal posting. The first
> two should be settled already, third one is the possible improvement I've
> mentioned earlier, where only in restricted case we resort to existence of page
> table mapping (the original and later reverted approach from Linus) instead of
> faking the result completely. Review and testing welcome.
>
> The consensus seems to be going through -mm tree for 5.1, unless Linus wants
> them alredy for 5.0.
>
> Jiri Kosina (2):
> mm/mincore: make mincore() more conservative
> mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O
>
> Vlastimil Babka (1):
> mm/mincore: provide mapped status when cached status is not allowed

Andrew,

could you please take at least the correct and straightforward fix for
mincore() before we figure out how to deal with the slightly less
practical RWF_NOWAIT? Thanks.

--
Jiri Kosina
SUSE Labs


2019-03-06 22:36:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments

On Wed, 6 Mar 2019 13:11:39 +0100 (CET) Jiri Kosina <[email protected]> wrote:

> On Wed, 30 Jan 2019, Vlastimil Babka wrote:
>
> > I've collected the patches from the discussion for formal posting. The first
> > two should be settled already, third one is the possible improvement I've
> > mentioned earlier, where only in restricted case we resort to existence of page
> > table mapping (the original and later reverted approach from Linus) instead of
> > faking the result completely. Review and testing welcome.
> >
> > The consensus seems to be going through -mm tree for 5.1, unless Linus wants
> > them alredy for 5.0.
> >
> > Jiri Kosina (2):
> > mm/mincore: make mincore() more conservative
> > mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O
> >
> > Vlastimil Babka (1):
> > mm/mincore: provide mapped status when cached status is not allowed
>
> Andrew,
>
> could you please take at least the correct and straightforward fix for
> mincore() before we figure out how to deal with the slightly less
> practical RWF_NOWAIT? Thanks.

I assume we're talking about [1/3] and [2/3] from this thread?

Can we have a resend please? Gather the various acks and revisions,
make changelog changes to address the review questions and comments?

Thanks.

2019-03-06 22:49:00

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments

On Wed, 6 Mar 2019, Andrew Morton wrote:

> > could you please take at least the correct and straightforward fix for
> > mincore() before we figure out how to deal with the slightly less
> > practical RWF_NOWAIT? Thanks.
>
> I assume we're talking about [1/3] and [2/3] from this thread?
>
> Can we have a resend please? Gather the various acks and revisions,
> make changelog changes to address the review questions and comments?

1/3 is clearly the one to be merged. The version with all the acks
gathered is in this thread, at

https://lore.kernel.org/lkml/[email protected]/

Attaching the patch also at the end of this mail so that it could be
easily picked up.

I am unfortunately not sure what changelog changes you are talking about,
there were none requested during the review as far as I know.

2/3 is clearly postponed for now, it needs more thinking.

3/3 is actually waiting for your decision, see

https://lore.kernel.org/lkml/[email protected]/

The 1/3 patch to be merged in any case:


=== cut here ===

From: Jiri Kosina <[email protected]>
Date: Wed, 16 Jan 2019 20:53:17 +0100
Subject: [PATCH v2] mm/mincore: make mincore() more conservative

The semantics of what mincore() considers to be resident is not completely
clear, but Linux has always (since 2.3.52, which is when mincore() was
initially done) treated it as "page is available in page cache".

That's potentially a problem, as that [in]directly exposes meta-information
about pagecache / memory mapping state even about memory not strictly belonging
to the process executing the syscall, opening possibilities for sidechannel
attacks.

Change the semantics of mincore() so that it only reveals pagecache information
for non-anonymous mappings that belog to files that the calling process could
(if it tried to) successfully open for writing.

[[email protected]: restructure can_do_mincore() conditions]
Originally-by: Linus Torvalds <[email protected]>
Originally-by: Dominique Martinet <[email protected]>
Cc: Dominique Martinet <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Kevin Easton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Cyril Hrubis <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Daniel Gruss <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Josh Snyder <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/mincore.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..b8842b849604 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -169,6 +169,16 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
return 0;
}

+static inline bool can_do_mincore(struct vm_area_struct *vma)
+{
+ if (vma_is_anonymous(vma))
+ return true;
+ if (!vma->vm_file)
+ return false;
+ return inode_owner_or_capable(file_inode(vma->vm_file)) ||
+ inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
+}
+
/*
* Do a chunk of "sys_mincore()". We've already checked
* all the arguments, we hold the mmap semaphore: we should
@@ -189,8 +199,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
vma = find_vma(current->mm, addr);
if (!vma || addr < vma->vm_start)
return -ENOMEM;
- mincore_walk.mm = vma->vm_mm;
end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
+ if (!can_do_mincore(vma)) {
+ unsigned long pages = (end - addr) >> PAGE_SHIFT;
+ memset(vec, 1, pages);
+ return pages;
+ }
+ mincore_walk.mm = vma->vm_mm;
err = walk_page_range(addr, end, &mincore_walk);
if (err < 0)
return err;


--
Jiri Kosina
SUSE Labs


2019-03-06 23:15:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

On Wed, 30 Jan 2019 13:44:18 +0100 Vlastimil Babka <[email protected]> wrote:

> From: Jiri Kosina <[email protected]>
>
> The semantics of what mincore() considers to be resident is not completely
> clear, but Linux has always (since 2.3.52, which is when mincore() was
> initially done) treated it as "page is available in page cache".
>
> That's potentially a problem, as that [in]directly exposes meta-information
> about pagecache / memory mapping state even about memory not strictly belonging
> to the process executing the syscall, opening possibilities for sidechannel
> attacks.
>
> Change the semantics of mincore() so that it only reveals pagecache information
> for non-anonymous mappings that belog to files that the calling process could
> (if it tried to) successfully open for writing.

"for writing" comes as a bit of a surprise. Why not for reading?

Could we please explain the reasoning in the changelog and in the
(presently absent) comments which describe can_do_mincore()?

> @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
> vma = find_vma(current->mm, addr);
> if (!vma || addr < vma->vm_start)
> return -ENOMEM;
> - mincore_walk.mm = vma->vm_mm;
> end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
> + if (!can_do_mincore(vma)) {
> + unsigned long pages = (end - addr) >> PAGE_SHIFT;

I'm not sure this is correct in all cases. If

addr = 4095
vma->vm_end = 4096
pages = 1000

then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it
should have been 1.

Please check?

A mincore test suite in tools/testing/selftests would be useful,
methinks. To exercise such corner cases, check for future breakage,
etc.

> + memset(vec, 1, pages);
> + return pages;
> + }
> + mincore_walk.mm = vma->vm_mm;
> err = walk_page_range(addr, end, &mincore_walk);
> if (err < 0)
> return err;



2019-03-06 23:25:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments

On Wed, 6 Mar 2019 23:48:03 +0100 (CET) Jiri Kosina <[email protected]> wrote:

> 3/3 is actually waiting for your decision, see
>
> https://lore.kernel.org/lkml/[email protected]/

I pity anyone who tried to understand this code by reading this code.
Can we please get some careful commentary in there explaining what is
going on, and why things are thus?

I guess the [3/3] change makes sense, although it's unclear whether
anyone really needs it? 5.0 was released with 574823bfab8 ("Change
mincore() to count "mapped" pages rather than "cached" pages") so we'll
have a release cycle to somewhat determine how much impact 574823bfab8
has on users. How about I queue up [3/3] and we reevaluate its
desirability in a couple of months?




2019-03-06 23:33:01

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments

Andrew Morton wrote on Wed, Mar 06, 2019:
> On Wed, 6 Mar 2019 23:48:03 +0100 (CET) Jiri Kosina <[email protected]> wrote:
>
> > 3/3 is actually waiting for your decision, see
> >
> > https://lore.kernel.org/lkml/[email protected]/
>
> I pity anyone who tried to understand this code by reading this code.
> Can we please get some careful commentary in there explaining what is
> going on, and why things are thus?
>
> I guess the [3/3] change makes sense, although it's unclear whether
> anyone really needs it? 5.0 was released with 574823bfab8 ("Change
> mincore() to count "mapped" pages rather than "cached" pages") so we'll
> have a release cycle to somewhat determine how much impact 574823bfab8
> has on users. How about I queue up [3/3] and we reevaluate its
> desirability in a couple of months?

FWIW,

574823bfab8 has been reverted in 30bac164aca750, included in 5.0-rc4, so
the controversial change has only been there from 5.0-rc1 to 5.0-rc3

--
Dominique

2019-03-06 23:40:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments

On Thu, 7 Mar 2019 00:32:09 +0100 Dominique Martinet <[email protected]> wrote:

> Andrew Morton wrote on Wed, Mar 06, 2019:
> > On Wed, 6 Mar 2019 23:48:03 +0100 (CET) Jiri Kosina <[email protected]> wrote:
> >
> > > 3/3 is actually waiting for your decision, see
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> >
> > I pity anyone who tried to understand this code by reading this code.
> > Can we please get some careful commentary in there explaining what is
> > going on, and why things are thus?
> >
> > I guess the [3/3] change makes sense, although it's unclear whether
> > anyone really needs it? 5.0 was released with 574823bfab8 ("Change
> > mincore() to count "mapped" pages rather than "cached" pages") so we'll
> > have a release cycle to somewhat determine how much impact 574823bfab8
> > has on users. How about I queue up [3/3] and we reevaluate its
> > desirability in a couple of months?
>
> FWIW,
>
> 574823bfab8 has been reverted in 30bac164aca750, included in 5.0-rc4, so
> the controversial change has only been there from 5.0-rc1 to 5.0-rc3

Ah, OK, thanks, I misread.

Linus, do you have thoughts on
http://lkml.kernel.org/r/[email protected] ?


2019-03-07 00:02:29

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

On Wed, 6 Mar 2019, Andrew Morton wrote:

> > The semantics of what mincore() considers to be resident is not completely
> > clear, but Linux has always (since 2.3.52, which is when mincore() was
> > initially done) treated it as "page is available in page cache".
> >
> > That's potentially a problem, as that [in]directly exposes meta-information
> > about pagecache / memory mapping state even about memory not strictly belonging
> > to the process executing the syscall, opening possibilities for sidechannel
> > attacks.
> >
> > Change the semantics of mincore() so that it only reveals pagecache information
> > for non-anonymous mappings that belog to files that the calling process could
> > (if it tried to) successfully open for writing.
>
> "for writing" comes as a bit of a surprise. Why not for reading?

I guess this is a rhetorical question from you :) but fair enough, good
point, I'll explain this a bit more in the changelog and in the code
comments.

> > @@ -189,8 +197,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
> > vma = find_vma(current->mm, addr);
> > if (!vma || addr < vma->vm_start)
> > return -ENOMEM;
> > - mincore_walk.mm = vma->vm_mm;
> > end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
> > + if (!can_do_mincore(vma)) {
> > + unsigned long pages = (end - addr) >> PAGE_SHIFT;
>
> I'm not sure this is correct in all cases. If
>
> addr = 4095
> vma->vm_end = 4096
> pages = 1000
>
> then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it
> should have been 1.

Good catch! It should rather be something like

unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT);

I'll fix that up and resend tomorrow.

Thanks,

--
Jiri Kosina
SUSE Labs


2019-03-07 00:41:31

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

Jiri Kosina wrote on Thu, Mar 07, 2019:
> > I'm not sure this is correct in all cases. If
> >
> > addr = 4095
> > vma->vm_end = 4096
> > pages = 1000
> >
> > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it
> > should have been 1.
>
> Good catch! It should rather be something like
>
> unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT);

That would be 0 for addr = 0 and vma->vm_end = 1; I assume we would
still want to count that as one page.
I'm not too familiar with this area of the code, but I think there's a
handy macro we can use for this, perhaps
DIV_ROUND_UP(end - addr, PAGE_SIZE) ?

kernel/kexec_core.c has defined PAGE_COUNT() which seems more
appropriate but I do not see a global equivalent
#define PAGE_COUNT(x) (((x) + PAGE_SIZE - 1) >> PAGE_SHIFT)

--
Dominique

2019-03-07 05:48:12

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/mincore: make mincore() more conservative

On Thu, 7 Mar 2019, Dominique Martinet wrote:

> > >
> > > addr = 4095
> > > vma->vm_end = 4096
> > > pages = 1000
> > >
> > > then `end' is 4096 and `(end - addr) << PAGE_SHIFT' is zero, but it
> > > should have been 1.
> >
> > Good catch! It should rather be something like
> >
> > unsigned long pages = (end >> PAGE_SHIFT) - (addr >> PAGE_SHIFT);
>
> That would be 0 for addr = 0 and vma->vm_end = 1; I assume we would
> still want to count that as one page.

Yeah, that was bogus as well, ETOOTIRED yesterday, sorry for the noise.
Both the variants are off.

> I'm not too familiar with this area of the code, but I think there's a
> handy macro we can use for this, perhaps
> DIV_ROUND_UP(end - addr, PAGE_SIZE) ?
>
> kernel/kexec_core.c has defined PAGE_COUNT() which seems more
> appropriate but I do not see a global equivalent
> #define PAGE_COUNT(x) (((x) + PAGE_SIZE - 1) >> PAGE_SHIFT)

I'll fix that up when doing the other changes requested by Andrew.

Thanks,

--
Jiri Kosina
SUSE Labs


2019-03-09 16:56:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/3] mincore() and IOCB_NOWAIT adjustments

On Wed, Mar 6, 2019 at 3:38 PM Andrew Morton <[email protected]> wrote:
>
> Linus, do you have thoughts on
> http://lkml.kernel.org/r/[email protected] ?

I think that's fine, and probably the right thing to do, but I also
suspect that nobody actually cares ;(

Linus

2019-03-12 14:18:31

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v2 0/2] prevent mincore() page cache leaks

Here's a new version of the mincore() patches, with feedback from Andrew Morton
applied. The IOCB_NOWAIT patch was dropped since David Chinner pointed out it's
incomplete. We definitely want the first patch, while for the second Linus
said:

I think that's fine, and probably the right thing to do, but I also
suspect that nobody actually cares ;(

Whether or not somebody cares, we should hear of no breakage. If somebody does
care after all, without second patch we might hear of breakage, so I would
suggest applying it. It's not that complicated after all (famous last words?)


Jiri Kosina (1):
mm/mincore: make mincore() more conservative

Vlastimil Babka (1):
mm/mincore: provide mapped status when cached status is not allowed

mm/mincore.c | 80 ++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 68 insertions(+), 12 deletions(-)

--
2.20.1


2019-03-12 14:18:33

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v2 2/2] mm/mincore: provide mapped status when cached status is not allowed

After "mm/mincore: make mincore() more conservative" we sometimes restrict the
information about page cache residency, which needs to be done without breaking
existing userspace, as much as possible. Instead of returning with error, we
thus fake the results. For that we return residency values as 1, which should
be safer than faking them as 0, as there might theoretically exist code that
would try to fault in the page(s) in a loop until mincore() returns 1 for them.

Faking 1 however means that such code would not fault in a page even if it was
not truly in page cache, with possibly unwanted performance implications. We
can improve the situation by revisting the approach of 574823bfab82 ("Change
mincore() to count "mapped" pages rather than "cached" pages"), later reverted
by 30bac164aca7 and replaced by restricting/faking the results. In this patch
we apply the approach only to cases where page cache residency check is
restricted. Thus mincore() will return 0 for an unmapped page (which may or may
not be resident in a pagecache), and 1 after the process faults it in.

One potential downside is that mincore() users will be now able to recognize
when a previously mapped page was reclaimed. While that might be useful for
some attack scenarios, it is not as crucial as recognizing that somebody else
faulted the page in, which is the main reason we are making mincore() more
conservative. For detecting that pages being reclaimed, there are also other
existing ways anyway.

Cc: Jiri Kosina <[email protected]>
Cc: Dominique Martinet <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Kevin Easton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Cyril Hrubis <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Daniel Gruss <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/mincore.c | 67 +++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 51 insertions(+), 16 deletions(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index c3f058bd0faf..c9a265abc631 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -21,12 +21,23 @@
#include <linux/uaccess.h>
#include <asm/pgtable.h>

+/*
+ * mincore() page walk's private structure. Contains pointer to the array
+ * of return values to be set, and whether the current vma passed the
+ * can_do_mincore() check.
+ */
+struct mincore_walk_private {
+ unsigned char *vec;
+ bool can_check_pagecache;
+};
+
static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
#ifdef CONFIG_HUGETLB_PAGE
unsigned char present;
- unsigned char *vec = walk->private;
+ struct mincore_walk_private *walk_private = walk->private;
+ unsigned char *vec = walk_private->vec;

/*
* Hugepages under user process are always in RAM and never
@@ -35,7 +46,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
present = pte && !huge_pte_none(huge_ptep_get(pte));
for (; addr != end; vec++, addr += PAGE_SIZE)
*vec = present;
- walk->private = vec;
+ walk_private->vec = vec;
#else
BUG();
#endif
@@ -85,7 +96,8 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
}

static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
- struct vm_area_struct *vma, unsigned char *vec)
+ struct vm_area_struct *vma, unsigned char *vec,
+ bool can_check_pagecache)
{
unsigned long nr = (end - addr) >> PAGE_SHIFT;
int i;
@@ -95,7 +107,14 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,

pgoff = linear_page_index(vma, addr);
for (i = 0; i < nr; i++, pgoff++)
- vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
+ /*
+ * Return page cache residency state if we are allowed
+ * to, otherwise return mapping state, which is 0 for
+ * an unmapped range.
+ */
+ vec[i] = can_check_pagecache ?
+ mincore_page(vma->vm_file->f_mapping, pgoff)
+ : 0;
} else {
for (i = 0; i < nr; i++)
vec[i] = 0;
@@ -106,8 +125,11 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
static int mincore_unmapped_range(unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
- walk->private += __mincore_unmapped_range(addr, end,
- walk->vma, walk->private);
+ struct mincore_walk_private *walk_private = walk->private;
+ unsigned char *vec = walk_private->vec;
+
+ walk_private->vec += __mincore_unmapped_range(addr, end, walk->vma,
+ vec, walk_private->can_check_pagecache);
return 0;
}

@@ -117,7 +139,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
spinlock_t *ptl;
struct vm_area_struct *vma = walk->vma;
pte_t *ptep;
- unsigned char *vec = walk->private;
+ struct mincore_walk_private *walk_private = walk->private;
+ unsigned char *vec = walk_private->vec;
int nr = (end - addr) >> PAGE_SHIFT;

ptl = pmd_trans_huge_lock(pmd, vma);
@@ -128,7 +151,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
}

if (pmd_trans_unstable(pmd)) {
- __mincore_unmapped_range(addr, end, vma, vec);
+ __mincore_unmapped_range(addr, end, vma, vec,
+ walk_private->can_check_pagecache);
goto out;
}

@@ -138,7 +162,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,

if (pte_none(pte))
__mincore_unmapped_range(addr, addr + PAGE_SIZE,
- vma, vec);
+ vma, vec, walk_private->can_check_pagecache);
else if (pte_present(pte))
*vec = 1;
else { /* pte is a swap entry */
@@ -152,8 +176,20 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
*vec = 1;
} else {
#ifdef CONFIG_SWAP
- *vec = mincore_page(swap_address_space(entry),
+ /*
+ * If tmpfs pages are being swapped out, treat
+ * it with same restrictions on mincore() as
+ * the page cache so we don't expose that
+ * somebody else brought them back from swap.
+ * In the restricted case return 0 as swap
+ * entry means the page is not mapped.
+ */
+ if (walk_private->can_check_pagecache)
+ *vec = mincore_page(
+ swap_address_space(entry),
swp_offset(entry));
+ else
+ *vec = 0;
#else
WARN_ON(1);
*vec = 1;
@@ -195,22 +231,21 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
struct vm_area_struct *vma;
unsigned long end;
int err;
+ struct mincore_walk_private walk_private = {
+ .vec = vec
+ };
struct mm_walk mincore_walk = {
.pmd_entry = mincore_pte_range,
.pte_hole = mincore_unmapped_range,
.hugetlb_entry = mincore_hugetlb,
- .private = vec,
+ .private = &walk_private
};

vma = find_vma(current->mm, addr);
if (!vma || addr < vma->vm_start)
return -ENOMEM;
end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
- if (!can_do_mincore(vma)) {
- unsigned long pages = DIV_ROUND_UP(end - addr, PAGE_SIZE);
- memset(vec, 1, pages);
- return pages;
- }
+ walk_private.can_check_pagecache = can_do_mincore(vma);
mincore_walk.mm = vma->vm_mm;
err = walk_page_range(addr, end, &mincore_walk);
if (err < 0)
--
2.20.1


2019-03-12 14:18:39

by Vlastimil Babka

[permalink] [raw]
Subject: [PATCH v2 1/2] mm/mincore: make mincore() more conservative

From: Jiri Kosina <[email protected]>

The semantics of what mincore() considers to be resident is not completely
clear, but Linux has always (since 2.3.52, which is when mincore() was
initially done) treated it as "page is available in page cache".

That's potentially a problem, as that [in]directly exposes meta-information
about pagecache / memory mapping state even about memory not strictly belonging
to the process executing the syscall, opening possibilities for sidechannel
attacks.

Change the semantics of mincore() so that it only reveals pagecache information
for non-anonymous mappings that belog to files that the calling process could
(if it tried to) successfully open for writing; otherwise we'd be including
shared non-exclusive mappings, which

- is the sidechannel

- is not the usecase for mincore(), as that's primarily used for data, not
(shared) text

[[email protected]: restructure can_do_mincore() conditions]
Originally-by: Linus Torvalds <[email protected]>
Originally-by: Dominique Martinet <[email protected]>
Cc: Dominique Martinet <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Kevin Easton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Cyril Hrubis <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Daniel Gruss <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Josh Snyder <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/mincore.c | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..c3f058bd0faf 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -169,6 +169,22 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
return 0;
}

+static inline bool can_do_mincore(struct vm_area_struct *vma)
+{
+ if (vma_is_anonymous(vma))
+ return true;
+ if (!vma->vm_file)
+ return false;
+ /*
+ * Reveal pagecache information only for non-anonymous mappings that
+ * correspond to the files the calling process could (if tried) open
+ * for writing; otherwise we'd be including shared non-exclusive
+ * mappings, which opens a side channel.
+ */
+ return inode_owner_or_capable(file_inode(vma->vm_file)) ||
+ inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
+}
+
/*
* Do a chunk of "sys_mincore()". We've already checked
* all the arguments, we hold the mmap semaphore: we should
@@ -189,8 +205,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
vma = find_vma(current->mm, addr);
if (!vma || addr < vma->vm_start)
return -ENOMEM;
- mincore_walk.mm = vma->vm_mm;
end = min(vma->vm_end, addr + (pages << PAGE_SHIFT));
+ if (!can_do_mincore(vma)) {
+ unsigned long pages = DIV_ROUND_UP(end - addr, PAGE_SIZE);
+ memset(vec, 1, pages);
+ return pages;
+ }
+ mincore_walk.mm = vma->vm_mm;
err = walk_page_range(addr, end, &mincore_walk);
if (err < 0)
return err;
--
2.20.1