From: Peter Zijstra <[email protected]>
In order to create VMAs that are not accessible to userspace create a new
VM_NOUSER flag. This can be used in conjunction with
install_special_mapping() to inject 'kernel' data into the userspace map.
Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
_PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.
Signed-off-by: Peter Zijstra <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/include/uapi/asm/mman.h | 4 ++++
include/linux/mm.h | 2 ++
include/linux/mman.h | 4 ++++
mm/mmap.c | 12 ++++++++++--
4 files changed, 20 insertions(+), 2 deletions(-)
--- a/arch/x86/include/uapi/asm/mman.h
+++ b/arch/x86/include/uapi/asm/mman.h
@@ -26,6 +26,10 @@
((key) & 0x8 ? VM_PKEY_BIT3 : 0))
#endif
+#define arch_vm_get_page_prot_excl(vm_flags) __pgprot( \
+ ((vm_flags) & VM_NOUSER ? _PAGE_USER : 0) \
+ )
+
#include <asm-generic/mman.h>
#endif /* _ASM_X86_MMAN_H */
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -193,6 +193,7 @@ extern unsigned int kobjsize(const void
#define VM_ARCH_1 0x01000000 /* Architecture-specific flag */
#define VM_WIPEONFORK 0x02000000 /* Wipe VMA contents in child. */
#define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */
+#define VM_ARCH_0 0x08000000 /* Architecture-specific flag */
#define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */
#define VM_HUGEPAGE 0x20000000 /* MADV_HUGEPAGE marked this vma */
@@ -224,6 +225,7 @@ extern unsigned int kobjsize(const void
#endif
#if defined(CONFIG_X86)
+# define VM_NOUSER VM_ARCH_0 /* Not accessible by userspace */
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
# define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -43,6 +43,10 @@ static inline void vm_unacct_memory(long
#define arch_vm_get_page_prot(vm_flags) __pgprot(0)
#endif
+#ifndef arch_vm_get_page_prot_excl
+#define arch_vm_get_page_prot_excl(vm_flags) __pgprot(0)
+#endif
+
#ifndef arch_validate_prot
/*
* This is called from mprotect(). PROT_GROWSDOWN and PROT_GROWSUP have
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -102,9 +102,17 @@ pgprot_t protection_map[16] __ro_after_i
pgprot_t vm_get_page_prot(unsigned long vm_flags)
{
- return __pgprot(pgprot_val(protection_map[vm_flags &
- (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]) |
+ pgprot_t prot;
+
+ prot = protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
+
+ prot = __pgprot(pgprot_val(prot) |
pgprot_val(arch_vm_get_page_prot(vm_flags)));
+
+ prot = __pgprot(pgprot_val(prot) &
+ ~pgprot_val(arch_vm_get_page_prot_excl(vm_flags)));
+
+ return prot;
}
EXPORT_SYMBOL(vm_get_page_prot);
On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <[email protected]> wrote:
> From: Peter Zijstra <[email protected]>
>
> In order to create VMAs that are not accessible to userspace create a new
> VM_NOUSER flag. This can be used in conjunction with
> install_special_mapping() to inject 'kernel' data into the userspace map.
>
> Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
> pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
> _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.
How does this interact with get_user_pages(), etc?
On Tue, Dec 12, 2017 at 10:00:08AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <[email protected]> wrote:
> > From: Peter Zijstra <[email protected]>
> >
> > In order to create VMAs that are not accessible to userspace create a new
> > VM_NOUSER flag. This can be used in conjunction with
> > install_special_mapping() to inject 'kernel' data into the userspace map.
> >
> > Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
> > pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
> > _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.
>
> How does this interact with get_user_pages(), etc?
gup would find the page. These patches do in fact rely on that through
the populate things.
On Tue, Dec 12, 2017 at 10:05 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Dec 12, 2017 at 10:00:08AM -0800, Andy Lutomirski wrote:
>> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <[email protected]> wrote:
>> > From: Peter Zijstra <[email protected]>
>> >
>> > In order to create VMAs that are not accessible to userspace create a new
>> > VM_NOUSER flag. This can be used in conjunction with
>> > install_special_mapping() to inject 'kernel' data into the userspace map.
>> >
>> > Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
>> > pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
>> > _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.
>>
>> How does this interact with get_user_pages(), etc?
>
> gup would find the page. These patches do in fact rely on that through
> the populate things.
>
Blech. So you can write(2) from the LDT to a file and you can even
sendfile it, perhaps. What happens if it's get_user_page()'d when
modify_ldt() wants to free it?
This patch series scares the crap out of me.
On Tue, Dec 12, 2017 at 10:06:51AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 12, 2017 at 10:05 AM, Peter Zijlstra <[email protected]> wrote:
> > gup would find the page. These patches do in fact rely on that through
> > the populate things.
> >
>
> Blech. So you can write(2) from the LDT to a file and you can even
> sendfile it, perhaps.
Hmm, indeed.. I suppose I could go fix that. But how bad is it to leak
the LDT contents?
What would be far worse of course is if we could read(2) data into the
ldt, I'll look into that.
> What happens if it's get_user_page()'d when
> modify_ldt() wants to free it?
modify_ldt should never free pages, we only ever free pages when we
destroy the mm.
On Tue, Dec 12, 2017 at 10:00:08AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <[email protected]> wrote:
> > From: Peter Zijstra <[email protected]>
> >
> > In order to create VMAs that are not accessible to userspace create a new
> > VM_NOUSER flag. This can be used in conjunction with
> > install_special_mapping() to inject 'kernel' data into the userspace map.
> >
> > Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
> > pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
> > _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.
>
> How does this interact with get_user_pages(), etc?
So I went through that code and I think I found a bug related to this.
get_user_pages_fast() will ultimately end up doing
pte_access_permitted() before getting the page, follow_page OTOH does
not do this, which makes for a curious difference between the two.
So I'm thinking we want the below irrespective of the VM_NOUSER patch,
but with VM_NOUSER it would mean write(2) will no longer be able to
access the page.
diff --git a/mm/gup.c b/mm/gup.c
index dfcde13f289a..b852f37a2b0c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -153,6 +153,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
}
if (flags & FOLL_GET) {
+ if (!pte_access_permitted(pte, !!(flags & FOLL_WRITE))) {
+ page = ERR_PTR(-EFAULT);
+ goto out;
+ }
+
get_page(page);
/* drop the pgmap reference now that we hold the page */
On Wed, Dec 13, 2017 at 01:22:11PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 12, 2017 at 10:00:08AM -0800, Andy Lutomirski wrote:
> > On Tue, Dec 12, 2017 at 9:32 AM, Thomas Gleixner <[email protected]> wrote:
> > > From: Peter Zijstra <[email protected]>
> > >
> > > In order to create VMAs that are not accessible to userspace create a new
> > > VM_NOUSER flag. This can be used in conjunction with
> > > install_special_mapping() to inject 'kernel' data into the userspace map.
> > >
> > > Similar to how arch_vm_get_page_prot() allows adding _PAGE_flags to
> > > pgprot_t, introduce arch_vm_get_page_prot_excl() which masks
> > > _PAGE_flags from pgprot_t and use this to implement VM_NOUSER for x86.
> >
> > How does this interact with get_user_pages(), etc?
>
> So I went through that code and I think I found a bug related to this.
>
> get_user_pages_fast() will ultimately end up doing
> pte_access_permitted() before getting the page, follow_page OTOH does
> not do this, which makes for a curious difference between the two.
>
> So I'm thinking we want the below irrespective of the VM_NOUSER patch,
> but with VM_NOUSER it would mean write(2) will no longer be able to
> access the page.
Oh..
We do call pte_access_permitted(), but only for write access.
See can_follow_write_pte().
The issue seems bigger: we also need such calls for other page table levels :-/
Dave, what is effect of this on protection keys?
>
> diff --git a/mm/gup.c b/mm/gup.c
> index dfcde13f289a..b852f37a2b0c 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -153,6 +153,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> }
>
> if (flags & FOLL_GET) {
> + if (!pte_access_permitted(pte, !!(flags & FOLL_WRITE))) {
> + page = ERR_PTR(-EFAULT);
> + goto out;
> + }
> +
> get_page(page);
>
> /* drop the pgmap reference now that we hold the page */
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kirill A. Shutemov
On Wed, Dec 13, 2017 at 03:57:40PM +0300, Kirill A. Shutemov wrote:
> On Wed, Dec 13, 2017 at 01:22:11PM +0100, Peter Zijlstra wrote:
> > get_user_pages_fast() will ultimately end up doing
> > pte_access_permitted() before getting the page, follow_page OTOH does
> > not do this, which makes for a curious difference between the two.
> >
> > So I'm thinking we want the below irrespective of the VM_NOUSER patch,
> > but with VM_NOUSER it would mean write(2) will no longer be able to
> > access the page.
>
> Oh..
>
> We do call pte_access_permitted(), but only for write access.
> See can_follow_write_pte().
My can_follow_write_pte() looks like:
static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
{
return pte_write(pte) ||
((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
}
am I perchance looking at the wrong tee?
> The issue seems bigger: we also need such calls for other page table levels :-/
Sure.
On Wed, Dec 13, 2017 at 03:34:55PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 13, 2017 at 03:57:40PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Dec 13, 2017 at 01:22:11PM +0100, Peter Zijlstra wrote:
>
> > > get_user_pages_fast() will ultimately end up doing
> > > pte_access_permitted() before getting the page, follow_page OTOH does
> > > not do this, which makes for a curious difference between the two.
> > >
> > > So I'm thinking we want the below irrespective of the VM_NOUSER patch,
> > > but with VM_NOUSER it would mean write(2) will no longer be able to
> > > access the page.
> >
> > Oh..
> >
> > We do call pte_access_permitted(), but only for write access.
> > See can_follow_write_pte().
>
> My can_follow_write_pte() looks like:
>
> static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> {
> return pte_write(pte) ||
> ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> }
>
> am I perchance looking at the wrong tee?
I'm looking at Linus' tree.
It was changed recently:
5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + gup paths")
+Dan.
--
Kirill A. Shutemov
On Wed, Dec 13, 2017 at 05:43:39PM +0300, Kirill A. Shutemov wrote:
> > am I perchance looking at the wrong tee?
>
> I'm looking at Linus' tree.
Clearly I'm not synced up enough... :/
> It was changed recently:
> 5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + gup paths")
>
Indeed. So FOLL_GET should also get these tests and, as you said, the
other levels too.
I would like FOLL_POPULATE (doesn't have FOLL_GET) to be allowed
'access'.
On Wed, Dec 13, 2017 at 04:00:07PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 13, 2017 at 05:43:39PM +0300, Kirill A. Shutemov wrote:
> > > am I perchance looking at the wrong tee?
> >
> > I'm looking at Linus' tree.
>
> Clearly I'm not synced up enough... :/
>
> > It was changed recently:
> > 5c9d2d5c269c ("mm: replace pte_write with pte_access_permitted in fault + gup paths")
> >
>
> Indeed. So FOLL_GET should also get these tests and, as you said, the
> other levels too.
>
> I would like FOLL_POPULATE (doesn't have FOLL_GET) to be allowed
> 'access'.
Similarly, should we avoid arch_vma_access_permitted() if !FOLL_GET ?
On 12/13/2017 04:57 AM, Kirill A. Shutemov wrote:
> Dave, what is effect of this on protection keys?
The goal was to make pkeys-protected userspace memory access
_consistent_ with normal access. Specifically, we want a kernel to
disallow access (or writes) to memory where userspace mapping has a pkey
whose permissions are in conflict with the access.
For instance:
This will fault writing a byte to 'addr':
char *addr = malloc(PAGE_SIZE);
pkey_mprotect(addr, PAGE_SIZE, 13);
pkey_deny_access(13);
*addr[0] = 'f';
But this will write one byte to addr successfully (if it uses the kernel
mapping of the physical page backing 'addr'):
char *addr = malloc(PAGE_SIZE);
pkey_mprotect(addr, PAGE_SIZE, 13);
pkey_deny_access(13);
read(fd, addr, 1);
On Wed, Dec 13, 2017 at 07:14:41AM -0800, Dave Hansen wrote:
> On 12/13/2017 04:57 AM, Kirill A. Shutemov wrote:
> > Dave, what is effect of this on protection keys?
>
> The goal was to make pkeys-protected userspace memory access
> _consistent_ with normal access. Specifically, we want a kernel to
> disallow access (or writes) to memory where userspace mapping has a pkey
> whose permissions are in conflict with the access.
>
> For instance:
>
> This will fault writing a byte to 'addr':
>
> char *addr = malloc(PAGE_SIZE);
> pkey_mprotect(addr, PAGE_SIZE, 13);
> pkey_deny_access(13);
> *addr[0] = 'f';
>
> But this will write one byte to addr successfully (if it uses the kernel
> mapping of the physical page backing 'addr'):
>
> char *addr = malloc(PAGE_SIZE);
> pkey_mprotect(addr, PAGE_SIZE, 13);
> pkey_deny_access(13);
> read(fd, addr, 1);
>
This seems confused to me; why are these two cases different?
On 12/13/2017 07:32 AM, Peter Zijlstra wrote:
>> This will fault writing a byte to 'addr':
>>
>> char *addr = malloc(PAGE_SIZE);
>> pkey_mprotect(addr, PAGE_SIZE, 13);
>> pkey_deny_access(13);
>> *addr[0] = 'f';
>>
>> But this will write one byte to addr successfully (if it uses the kernel
>> mapping of the physical page backing 'addr'):
>>
>> char *addr = malloc(PAGE_SIZE);
>> pkey_mprotect(addr, PAGE_SIZE, 13);
>> pkey_deny_access(13);
>> read(fd, addr, 1);
>>
> This seems confused to me; why are these two cases different?
Protection keys doesn't work in the kernel direct map, so if the read()
was implemented by writing to the direct map alias of 'addr' then this
would bypass protection keys.
On Wed, Dec 13, 2017 at 07:47:46AM -0800, Dave Hansen wrote:
> On 12/13/2017 07:32 AM, Peter Zijlstra wrote:
> >> This will fault writing a byte to 'addr':
> >>
> >> char *addr = malloc(PAGE_SIZE);
> >> pkey_mprotect(addr, PAGE_SIZE, 13);
> >> pkey_deny_access(13);
> >> *addr[0] = 'f';
> >>
> >> But this will write one byte to addr successfully (if it uses the kernel
> >> mapping of the physical page backing 'addr'):
> >>
> >> char *addr = malloc(PAGE_SIZE);
> >> pkey_mprotect(addr, PAGE_SIZE, 13);
> >> pkey_deny_access(13);
> >> read(fd, addr, 1);
> >>
> > This seems confused to me; why are these two cases different?
>
> Protection keys doesn't work in the kernel direct map, so if the read()
> was implemented by writing to the direct map alias of 'addr' then this
> would bypass protection keys.
Which is why get_user_pages() _should_ enforce this.
What use are protection keys if you can trivially circumvent them?
On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstra <[email protected]> wrote:
>
> Which is why get_user_pages() _should_ enforce this.
>
> What use are protection keys if you can trivially circumvent them?
No, we will *not* worry about protection keys in get_user_pages().
They are not "security". They are a debug aid and safety against random mis-use.
In particular, they are very much *NOT* about "trivially circumvent
them". The user could just change their mapping thing, for chrissake!
We already allow access to PROT_NONE for gdb and friends, very much on purpose.
We're not going to make the VM more complex for something that
absolutely nobody cares about, and has zero security issues.
Linus
On 12/13/2017 10:08 AM, Linus Torvalds wrote:
> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstr <[email protected]> wrote:
>> Which is why get_user_pages() _should_ enforce this.
>>
>> What use are protection keys if you can trivially circumvent them?
> No, we will *not* worry about protection keys in get_user_pages().
We did introduce some support for it here:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=33a709b25a760b91184bb335cf7d7c32b8123013
> They are not "security". They are a debug aid and safety against
> random mis-use.
Totally agree. It's not about security. As I mentioned in the commit,
the goal here was to try to make pkey-protected access behavior
consistent with mprotect().
I still think this was nice to do and probably surprises users less than
if we didn't have it.
> We already allow access to PROT_NONE for gdb and friends, very much on purpose.
Yup, exactly, and that's one of the reasons that I tried to call those
out as "remote" access that are specicifially no subject to protection keys.
On Wed, Dec 13, 2017 at 10:21 AM, Dave Hansen <[email protected]> wrote:
> On 12/13/2017 10:08 AM, Linus Torvalds wrote:
>> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstr <[email protected]> wrote:
>>> Which is why get_user_pages() _should_ enforce this.
>>>
>>> What use are protection keys if you can trivially circumvent them?
>> No, we will *not* worry about protection keys in get_user_pages().
>
> We did introduce some support for it here:
>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=33a709b25a760b91184bb335cf7d7c32b8123013
Ugh. I never realized.
We should revert that, I feel. It's literally extra complexity for no
actual real gain, and there is a real downside: the extra complexity
that will cause people to get things wrong.
This thread about us getting it wrong is just the proof. I vote for
not trying to "fix" this case, let's just remove it.
Linus
On Wed, Dec 13, 2017 at 10:08 AM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstra <[email protected]> wrote:
>>
>> Which is why get_user_pages() _should_ enforce this.
>>
>> What use are protection keys if you can trivially circumvent them?
>
> No, we will *not* worry about protection keys in get_user_pages().
>
Hmm. If I goof some pointer and pass that bogus pointer to read(2),
and I'm using pkey to protect my mmapped database, I think i'd rather
that read(2) fail. Sure, pkey is trivially circumventable using
wrpkru or mprotect, but those are obvious dangerous functions.
On Wed, Dec 13, 2017 at 10:08:30AM -0800, Linus Torvalds wrote:
> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > Which is why get_user_pages() _should_ enforce this.
> >
> > What use are protection keys if you can trivially circumvent them?
>
> No, we will *not* worry about protection keys in get_user_pages().
>
> They are not "security". They are a debug aid and safety against random mis-use.
>
> In particular, they are very much *NOT* about "trivially circumvent
> them". The user could just change their mapping thing, for chrissake!
>
> We already allow access to PROT_NONE for gdb and friends, very much on purpose.
>
> We're not going to make the VM more complex for something that
> absolutely nobody cares about, and has zero security issues.
OK, that might have been my phrasing that was off -- mostly because I
was looking at it from the VM_NOUSER angle, but currently:
- gup_pte_range() has pte_access_permitted()
- follow_page_pte() has pte_access_permitted() for FOLL_WRITE only
All I'm saying is that that is inconsistent and we should change
follow_page_pte() to use pte_access_permitted() for FOLL_GET, such that
__get_user_pages_fast() and __get_user_pages() have matching semantics.
Now, if VM_NOUSER were to live, the above change would ensure write(2)
cannot read from such VMAs, where the existing test for FOLL_WRITE
already disallows read(2) from writing to them.
But even without VM_NOUSER it makes the VM more consistent than it is
today.
On Wed, Dec 13, 2017 at 10:32 AM, Peter Zijlstra <[email protected]> wrote:
>
> Now, if VM_NOUSER were to live, the above change would ensure write(2)
> cannot read from such VMAs, where the existing test for FOLL_WRITE
> already disallows read(2) from writing to them.
So I don't mind at all the notion of disallowing access to some
special mappings at the vma level. So a VM_NOUSER flag that just
disallows get_user_pages entirely I'm ok with.
It's the protection keys in particular that I don't like having to
worry about. They are subtle and have odd architecture-specific
meaning, and needs to be checked at all levels in the page table tree.
Linus
On Tue, Dec 12, 2017 at 06:32:26PM +0100, Thomas Gleixner wrote:
> From: Peter Zijstra <[email protected]>
>
> In order to create VMAs that are not accessible to userspace create a new
> VM_NOUSER flag. This can be used in conjunction with
> install_special_mapping() to inject 'kernel' data into the userspace map.
Maybe I misunderstand the intent behind this, but I was recently looking
at something kind of similar. I was calling it VM_NOTLB and it wouldn't
put TLB entries into the userspace map at all. The idea was to be able
to use the user address purely as a handle for specific kernel pages,
which were guaranteed to never be mapped into userspace, so we didn't
need to send TLB invalidations when we took those pages away from the user
process again. But we'd be able to pass the address to read() or write().
So I was going to check the VMA flags in no_page_table() and return the
struct page that was notmapped there. I didn't get as far as constructing
a prototype yet, and I'm not entirely sure I understand the purpose of
this patch, so perhaps there's no synergy here at all (and perhaps my
idea wouldn't have worked anyway).
On Wed, Dec 13, 2017 at 01:50:22PM -0800, Matthew Wilcox wrote:
> On Tue, Dec 12, 2017 at 06:32:26PM +0100, Thomas Gleixner wrote:
> > From: Peter Zijstra <[email protected]>
> >
> > In order to create VMAs that are not accessible to userspace create a new
> > VM_NOUSER flag. This can be used in conjunction with
> > install_special_mapping() to inject 'kernel' data into the userspace map.
>
> Maybe I misunderstand the intent behind this, but I was recently looking
> at something kind of similar. I was calling it VM_NOTLB and it wouldn't
> put TLB entries into the userspace map at all. The idea was to be able
> to use the user address purely as a handle for specific kernel pages,
> which were guaranteed to never be mapped into userspace, so we didn't
> need to send TLB invalidations when we took those pages away from the user
> process again. But we'd be able to pass the address to read() or write().
>
> So I was going to check the VMA flags in no_page_table() and return the
> struct page that was notmapped there. I didn't get as far as constructing
> a prototype yet, and I'm not entirely sure I understand the purpose of
> this patch, so perhaps there's no synergy here at all (and perhaps my
> idea wouldn't have worked anyway).
Yeah, completely different. This here actually needs the page table
entries. Currently we keep the LDT in kernel memory, but with PTI we
loose the entire kernel map.
Since the LDT is strictly per process, the idea was to actually inject
it into the userspace map. Except of course, userspace must not actually
be able to access it. So by mapping it !_PAGE_USER its 'invisible'.
But the CPU very much needs the mapping, it will load the LDT entries
through them.
On Wed, Dec 13, 2017 at 11:12:33PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 13, 2017 at 01:50:22PM -0800, Matthew Wilcox wrote:
> > On Tue, Dec 12, 2017 at 06:32:26PM +0100, Thomas Gleixner wrote:
> > > From: Peter Zijstra <[email protected]>
> > > In order to create VMAs that are not accessible to userspace create a new
> > > VM_NOUSER flag. This can be used in conjunction with
> > > install_special_mapping() to inject 'kernel' data into the userspace map.
> >
> > Maybe I misunderstand the intent behind this, but I was recently looking
> > at something kind of similar. I was calling it VM_NOTLB and it wouldn't
> > put TLB entries into the userspace map at all. The idea was to be able
> > to use the user address purely as a handle for specific kernel pages,
> > which were guaranteed to never be mapped into userspace, so we didn't
> > need to send TLB invalidations when we took those pages away from the user
> > process again. But we'd be able to pass the address to read() or write().
>
> Since the LDT is strictly per process, the idea was to actually inject
> it into the userspace map. Except of course, userspace must not actually
> be able to access it. So by mapping it !_PAGE_USER its 'invisible'.
>
> But the CPU very much needs the mapping, it will load the LDT entries
> through them.
So can I use your VM_NOUSER VMAs for my purpose? That is, can I change
the page table without flushing the TLB? The only access to these PTEs
will be through the kernel mapping, so I don't see why I'd need to.
On Wed, Dec 13, 2017 at 4:10 PM, Matthew Wilcox <[email protected]> wrote:
> On Wed, Dec 13, 2017 at 11:12:33PM +0100, Peter Zijlstra wrote:
>> On Wed, Dec 13, 2017 at 01:50:22PM -0800, Matthew Wilcox wrote:
>> > On Tue, Dec 12, 2017 at 06:32:26PM +0100, Thomas Gleixner wrote:
>> > > From: Peter Zijstra <[email protected]>
>> > > In order to create VMAs that are not accessible to userspace create a new
>> > > VM_NOUSER flag. This can be used in conjunction with
>> > > install_special_mapping() to inject 'kernel' data into the userspace map.
>> >
>> > Maybe I misunderstand the intent behind this, but I was recently looking
>> > at something kind of similar. I was calling it VM_NOTLB and it wouldn't
>> > put TLB entries into the userspace map at all. The idea was to be able
>> > to use the user address purely as a handle for specific kernel pages,
>> > which were guaranteed to never be mapped into userspace, so we didn't
>> > need to send TLB invalidations when we took those pages away from the user
>> > process again. But we'd be able to pass the address to read() or write().
>>
>> Since the LDT is strictly per process, the idea was to actually inject
>> it into the userspace map. Except of course, userspace must not actually
>> be able to access it. So by mapping it !_PAGE_USER its 'invisible'.
>>
>> But the CPU very much needs the mapping, it will load the LDT entries
>> through them.
>
> So can I use your VM_NOUSER VMAs for my purpose? That is, can I change
> the page table without flushing the TLB? The only access to these PTEs
> will be through the kernel mapping, so I don't see why I'd need to.
I doubt it, since if it's in the kernel pagetables at all, then the
mapping can be cached for kernel purposes.
But I still think this discussion is off in the weeds. x86 does not
actually need any of this stuff.
Linus Torvalds <[email protected]> writes:
> On Wed, Dec 13, 2017 at 7:54 AM, Peter Zijlstra <[email protected]> wrote:
>>
>> Which is why get_user_pages() _should_ enforce this.
>>
>> What use are protection keys if you can trivially circumvent them?
>
> No, we will *not* worry about protection keys in get_user_pages().
>
> They are not "security". They are a debug aid and safety against random mis-use.
>
> In particular, they are very much *NOT* about "trivially circumvent
> them". The user could just change their mapping thing, for chrissake!
>
> We already allow access to PROT_NONE for gdb and friends, very much on purpose.
>
Can you clarify this? We recently did fix read access on PROT_NONE via
gup here for ppc64 https://lkml.kernel.org/r/[email protected]
What is the expected behaviour against gup and get_user_pages for
PROT_NONE.
Another issue is we end up behaving differently with PROT_NONE mapping
based on whether autonuma is enabled or not. For a PROT_NONE mapping we
return true with pte_protnone().
-aneesh