LinuxLists.cc - vmalloced stacks on x86

2014-10-25 00:22:56

Subject: vmalloced stacks on x86_64?

Is there any good reason not to use vmalloc for x86_64 stacks?

The tricky bits I've thought of are:

- On any context switch, we probably need to probe the new stack
before switching to it. That way, if it's going to fault due to an
out-of-sync pgd, we still have a stack available to handle the fault.

- Any time we change cr3, we may need to check that the pgd
corresponding to rsp is there. If now, we need to sync it over.

- For simplicity, we probably want all stack ptes to be present all
the time. This is fine; vmalloc already works that way.

- If we overrun the stack, we double-fault. This should be easy to
detect: any double-fault where rsp is less than 20 bytes from the
bottom of the stack is a failure to deliver a non-IST exception due to
a stack overflow. The question is: what do we do if this happens?
We could just panic (guaranteed to work). We could also try to
recover by killing the offending task, but that might be a bit
challenging, since we're in IST context. We could do something truly
awful: increment RSP by a few hundred bytes, point RIP at do_exit, and
return from the double fault.

Thoughts? This shouldn't be all that much code.

--Andy

2014-10-25 02:38:41

by H. Peter Anvin

[permalink] [raw]

Subject: Re: vmalloced stacks on x86_64?

On 10/24/2014 05:22 PM, Andy Lutomirski wrote:
> Is there any good reason not to use vmalloc for x86_64 stacks?

Additional TLB pressure if anything else.

Now, on the flipside: what is the *benefit*?

-hpa

2014-10-25 19:03:09

by Andy Lutomirski

[permalink] [raw]

Subject: Re: vmalloced stacks on x86_64?

On Oct 25, 2014 2:15 AM, "Ingo Molnar" <[email protected]> wrote:
>
>
> * Andy Lutomirski <[email protected]> wrote:
>
> > Is there any good reason not to use vmalloc for x86_64 stacks?
>
> In addition to what hpa mentioned, __pa()/__va() on-kstack DMA
> gets tricky, for legacy drivers. (Not sure how many of these are
> left though.)

Hopefully very few. DMA debugging warns if the driver uses the DMA
API, and if the driver doesn't, then IOMMUs will break it.

virtio-net is an oddball offender. I have a patch.

--Andy

>
> Thanks,
>
> Ingo

2014-10-25 20:24:33

by Ingo Molnar

[permalink] [raw]

Subject: Re: vmalloced stacks on x86_64?

* Andy Lutomirski <[email protected]> wrote:

> Is there any good reason not to use vmalloc for x86_64 stacks?

In addition to what hpa mentioned, __pa()/__va() on-kstack DMA
gets tricky, for legacy drivers. (Not sure how many of these are
left though.)

Thanks,

Ingo

2014-10-25 21:12:11

by Andy Lutomirski

[permalink] [raw]

Subject: Re: vmalloced stacks on x86_64?

On Oct 24, 2014 7:38 PM, "H. Peter Anvin" <[email protected]> wrote:
>
> On 10/24/2014 05:22 PM, Andy Lutomirski wrote:
> > Is there any good reason not to use vmalloc for x86_64 stacks?
>
> Additional TLB pressure if anything else.

I wonder how much this matters. It certainly helps on context
switches if the new stack is in the same TLB entry. But, for entries
that use less than one page of stack, I can imagine this making almost
no difference.

>
> Now, on the flipside: what is the *benefit*?

Immediate exception on overflow, and no high order allocation issues.
The former is a nice mitigation against exploits based on overflowing
the stack.

--Andy

>
> -hpa
>

2014-10-25 22:26:12

by Richard Weinberger

[permalink] [raw]

Subject: Re: vmalloced stacks on x86_64?

On Sat, Oct 25, 2014 at 2:22 AM, Andy Lutomirski <[email protected]> wrote:
> Is there any good reason not to use vmalloc for x86_64 stacks?
>
> The tricky bits I've thought of are:
>
> - On any context switch, we probably need to probe the new stack
> before switching to it. That way, if it's going to fault due to an
> out-of-sync pgd, we still have a stack available to handle the fault.
>
> - Any time we change cr3, we may need to check that the pgd
> corresponding to rsp is there. If now, we need to sync it over.
>
> - For simplicity, we probably want all stack ptes to be present all
> the time. This is fine; vmalloc already works that way.
>
> - If we overrun the stack, we double-fault. This should be easy to
> detect: any double-fault where rsp is less than 20 bytes from the
> bottom of the stack is a failure to deliver a non-IST exception due to
> a stack overflow. The question is: what do we do if this happens?
> We could just panic (guaranteed to work). We could also try to
> recover by killing the offending task, but that might be a bit
> challenging, since we're in IST context. We could do something truly
> awful: increment RSP by a few hundred bytes, point RIP at do_exit, and
> return from the double fault.
>
> Thoughts? This shouldn't be all that much code.

FWIW, grsecurity has this already.
Maybe we can reuse their GRKERNSEC_KSTACKOVERFLOW feature.
It allocates the kernel stack using vmalloc() and installs guard pages.

--
Thanks,
//richard

2014-10-25 23:16:46

by Andy Lutomirski

[permalink] [raw]

Subject: Re: vmalloced stacks on x86_64?

On Sat, Oct 25, 2014 at 3:26 PM, Richard Weinberger
<[email protected]> wrote:
> On Sat, Oct 25, 2014 at 2:22 AM, Andy Lutomirski <[email protected]> wrote:
>> Is there any good reason not to use vmalloc for x86_64 stacks?
>>
>> The tricky bits I've thought of are:
>>
>> - On any context switch, we probably need to probe the new stack
>> before switching to it. That way, if it's going to fault due to an
>> out-of-sync pgd, we still have a stack available to handle the fault.
>>
>> - Any time we change cr3, we may need to check that the pgd
>> corresponding to rsp is there. If now, we need to sync it over.
>>
>> - For simplicity, we probably want all stack ptes to be present all
>> the time. This is fine; vmalloc already works that way.
>>
>> - If we overrun the stack, we double-fault. This should be easy to
>> detect: any double-fault where rsp is less than 20 bytes from the
>> bottom of the stack is a failure to deliver a non-IST exception due to
>> a stack overflow. The question is: what do we do if this happens?
>> We could just panic (guaranteed to work). We could also try to
>> recover by killing the offending task, but that might be a bit
>> challenging, since we're in IST context. We could do something truly
>> awful: increment RSP by a few hundred bytes, point RIP at do_exit, and
>> return from the double fault.
>>
>> Thoughts? This shouldn't be all that much code.
>
> FWIW, grsecurity has this already.
> Maybe we can reuse their GRKERNSEC_KSTACKOVERFLOW feature.
> It allocates the kernel stack using vmalloc() and installs guard pages.
>

On brief inspection, grsecurity isn't actually vmallocing the stack.
It seems to be allocating it the normal way and then vmapping it.
That allows it to modify sg_set_buf to work on stack addresses (sigh).

After each switch_mm, it probes the whole kernel stack. (This seems
dangerous to me -- if the live stack isn't mapped in the new mm, won't
that double-fault?) I also see no evidence that it probes the new
stack when switching stacks. I suspect that it only works because it
gets lucky.

If we're worried about on-stack DMA, we could (by config option or
otherwise) allow DMA on a vmalloced stack, at least through the sg
interfaces. And we could WARN and fix it :)

--Andy

P.S. I see what appears to be some of my code in grsec. I feel
entirely justified in taking good bits of grsec and sticking them in
the upstream kernel.

2014-10-25 23:32:05

by Richard Weinberger

[permalink] [raw]

Subject: Re: vmalloced stacks on x86_64?

Am 26.10.2014 um 01:16 schrieb Andy Lutomirski:
> On Sat, Oct 25, 2014 at 3:26 PM, Richard Weinberger
> <[email protected]> wrote:
>> On Sat, Oct 25, 2014 at 2:22 AM, Andy Lutomirski <[email protected]> wrote:
>>> Is there any good reason not to use vmalloc for x86_64 stacks?
>>>
>>> The tricky bits I've thought of are:
>>>
>>> - On any context switch, we probably need to probe the new stack
>>> before switching to it. That way, if it's going to fault due to an
>>> out-of-sync pgd, we still have a stack available to handle the fault.
>>>
>>> - Any time we change cr3, we may need to check that the pgd
>>> corresponding to rsp is there. If now, we need to sync it over.
>>>
>>> - For simplicity, we probably want all stack ptes to be present all
>>> the time. This is fine; vmalloc already works that way.
>>>
>>> - If we overrun the stack, we double-fault. This should be easy to
>>> detect: any double-fault where rsp is less than 20 bytes from the
>>> bottom of the stack is a failure to deliver a non-IST exception due to
>>> a stack overflow. The question is: what do we do if this happens?
>>> We could just panic (guaranteed to work). We could also try to
>>> recover by killing the offending task, but that might be a bit
>>> challenging, since we're in IST context. We could do something truly
>>> awful: increment RSP by a few hundred bytes, point RIP at do_exit, and
>>> return from the double fault.
>>>
>>> Thoughts? This shouldn't be all that much code.
>>
>> FWIW, grsecurity has this already.
>> Maybe we can reuse their GRKERNSEC_KSTACKOVERFLOW feature.
>> It allocates the kernel stack using vmalloc() and installs guard pages.
>>
>
> On brief inspection, grsecurity isn't actually vmallocing the stack.
> It seems to be allocating it the normal way and then vmapping it.
> That allows it to modify sg_set_buf to work on stack addresses (sigh).

Oh, you're right. They have changed it. (But not the Kconfig help of course)
Last time I looked they did a vmalloc().
I'm not sure which version of the patch was but I think it was code like that one:
http://www.grsecurity.net/~spender/kstackovf32.diff

Thanks,
//richard

2014-10-26 04:11:54

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: vmalloced stacks on x86_64?

2014-10-25 2:22 GMT+02:00 Andy Lutomirski <[email protected]>:
> Is there any good reason not to use vmalloc for x86_64 stacks?
>
> The tricky bits I've thought of are:
>
> - On any context switch, we probably need to probe the new stack
> before switching to it. That way, if it's going to fault due to an
> out-of-sync pgd, we still have a stack available to handle the fault.

Would that prevent from any further fault on a vmalloc'ed kernel
stack? We would need to ensure that pre-faulting, say the first byte,
is enough to sync the whole new stack entirely otherwise we risk
another future fault and some places really aren't safely faulted.

>
> - Any time we change cr3, we may need to check that the pgd
> corresponding to rsp is there. If now, we need to sync it over.
>
> - For simplicity, we probably want all stack ptes to be present all
> the time. This is fine; vmalloc already works that way.
>
> - If we overrun the stack, we double-fault. This should be easy to
> detect: any double-fault where rsp is less than 20 bytes from the
> bottom of the stack is a failure to deliver a non-IST exception due to
> a stack overflow. The question is: what do we do if this happens?
> We could just panic (guaranteed to work). We could also try to
> recover by killing the offending task, but that might be a bit
> challenging, since we're in IST context. We could do something truly
> awful: increment RSP by a few hundred bytes, point RIP at do_exit, and
> return from the double fault.
>
> Thoughts? This shouldn't be all that much code.

2014-10-26 05:49:48

by Andy Lutomirski

[permalink] [raw]

Subject: Re: vmalloced stacks on x86_64?

On Oct 25, 2014 9:11 PM, "Frederic Weisbecker" <[email protected]> wrote:
>
> 2014-10-25 2:22 GMT+02:00 Andy Lutomirski <[email protected]>:
> > Is there any good reason not to use vmalloc for x86_64 stacks?
> >
> > The tricky bits I've thought of are:
> >
> > - On any context switch, we probably need to probe the new stack
> > before switching to it. That way, if it's going to fault due to an
> > out-of-sync pgd, we still have a stack available to handle the fault.
>
> Would that prevent from any further fault on a vmalloc'ed kernel
> stack? We would need to ensure that pre-faulting, say the first byte,
> is enough to sync the whole new stack entirely otherwise we risk
> another future fault and some places really aren't safely faulted.
>

I think so. The vmalloc faults only happen when the entire top-level
page table entry is missing, and those cover giant swaths of address
space.

I don't know whether the vmalloc code guarantees not to span a pmd
(pud? why couldn't these be called pte0, pte1, pte2, etc.?) boundary.

--Andy

> >
> > - Any time we change cr3, we may need to check that the pgd
> > corresponding to rsp is there. If now, we need to sync it over.
> >
> > - For simplicity, we probably want all stack ptes to be present all
> > the time. This is fine; vmalloc already works that way.
> >
> > - If we overrun the stack, we double-fault. This should be easy to
> > detect: any double-fault where rsp is less than 20 bytes from the
> > bottom of the stack is a failure to deliver a non-IST exception due to
> > a stack overflow. The question is: what do we do if this happens?
> > We could just panic (guaranteed to work). We could also try to
> > recover by killing the offending task, but that might be a bit
> > challenging, since we're in IST context. We could do something truly
> > awful: increment RSP by a few hundred bytes, point RIP at do_exit, and
> > return from the double fault.
> >
> > Thoughts? This shouldn't be all that much code.

2014-10-26 16:46:08

by Eric Dumazet

[permalink] [raw]

Subject: Re: vmalloced stacks on x86_64?

On Fri, 2014-10-24 at 19:38 -0700, H. Peter Anvin wrote:
> On 10/24/2014 05:22 PM, Andy Lutomirski wrote:
> > Is there any good reason not to use vmalloc for x86_64 stacks?
>
> Additional TLB pressure if anything else.

It seems TLB pressure gets less and less interest these days...

Is it still worth trying to reduce it ?

I was wondering for example why 'hashdist' is not cleared if current
host runs a NUMA enabled kernel, but has a single node.

Something like following maybe ?

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 7dbe5ec9d9cd08afac13797e2adac291fb703eec..0846ef054b0620a7be0c6f69b1a2f21c78d57d3b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1181,7 +1181,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
hashdist= [KNL,NUMA] Large hashes allocated during boot
are distributed across NUMA nodes. Defaults on
for 64-bit NUMA, off otherwise.
- Format: 0 | 1 (for off | on)
+ Format: 0 | 1 | 2 (for off | on if NUMA host | on)

hcl= [IA-64] SGI's Hardware Graph compatibility layer

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1a883705a12a8a12410914be93b2ee65807cc423..8aded4c11c8c1cc5778e9ae2b9cd5146070b5b03 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -668,7 +668,8 @@ static int __init dummy_numa_init(void)

node_set(0, numa_nodes_parsed);
numa_add_memblk(0, 0, PFN_PHYS(max_pfn));
-
+ if (hashdist == HASHDIST_DEFAULT)
+ hashdist = 0;
return 0;
}

2014-10-26 18:16:12

by Linus Torvalds

[permalink] [raw]

Subject: Re: vmalloced stacks on x86_64?

On Sat, Oct 25, 2014 at 4:16 PM, Andy Lutomirski <[email protected]> wrote:
>
> On brief inspection, grsecurity isn't actually vmallocing the stack.
> It seems to be allocating it the normal way and then vmapping it.
> That allows it to modify sg_set_buf to work on stack addresses (sigh).

Perhaps more importantly, the vmalloc space is a limited resource (at
least on 32-bit), and using vmap probably results in less
fragmentation.

I don't think either is really even an option on 32-bit due to the
limited address space. On 64-bit, maybe a virtually remapped stack
would be ok.

Linus

2014-10-26 20:29:58

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: vmalloced stacks on x86_64?

On Sat, Oct 25, 2014 at 10:49:25PM -0700, Andy Lutomirski wrote:
> On Oct 25, 2014 9:11 PM, "Frederic Weisbecker" <[email protected]> wrote:
> >
> > 2014-10-25 2:22 GMT+02:00 Andy Lutomirski <[email protected]>:
> > > Is there any good reason not to use vmalloc for x86_64 stacks?
> > >
> > > The tricky bits I've thought of are:
> > >
> > > - On any context switch, we probably need to probe the new stack
> > > before switching to it. That way, if it's going to fault due to an
> > > out-of-sync pgd, we still have a stack available to handle the fault.
> >
> > Would that prevent from any further fault on a vmalloc'ed kernel
> > stack? We would need to ensure that pre-faulting, say the first byte,
> > is enough to sync the whole new stack entirely otherwise we risk
> > another future fault and some places really aren't safely faulted.
> >
>
> I think so. The vmalloc faults only happen when the entire top-level
> page table entry is missing, and those cover giant swaths of address
> space.
>
> I don't know whether the vmalloc code guarantees not to span a pmd
> (pud? why couldn't these be called pte0, pte1, pte2, etc.?) boundary.

So dereferencing stack[0] is probably enough for 8KB worth of stack. I think
we have vmalloc_sync_all() but I heard this only work on x86-64.

Too bad we don't have a universal solution, I have that problem with per cpu allocated
memory faulting at random places. I hit at least two places where it got harmful:
context tracking and perf callchains. We fixed the latter using open-coded per cpu
allocation. I still haven't found a solution for context tracking.

2014-10-27 01:12:30

by Andy Lutomirski

[permalink] [raw]

Subject: Re: vmalloced stacks on x86_64?

On Sun, Oct 26, 2014 at 1:29 PM, Frederic Weisbecker <[email protected]> wrote:
> On Sat, Oct 25, 2014 at 10:49:25PM -0700, Andy Lutomirski wrote:
>> On Oct 25, 2014 9:11 PM, "Frederic Weisbecker" <[email protected]> wrote:
>> >
>> > 2014-10-25 2:22 GMT+02:00 Andy Lutomirski <[email protected]>:
>> > > Is there any good reason not to use vmalloc for x86_64 stacks?
>> > >
>> > > The tricky bits I've thought of are:
>> > >
>> > > - On any context switch, we probably need to probe the new stack
>> > > before switching to it. That way, if it's going to fault due to an
>> > > out-of-sync pgd, we still have a stack available to handle the fault.
>> >
>> > Would that prevent from any further fault on a vmalloc'ed kernel
>> > stack? We would need to ensure that pre-faulting, say the first byte,
>> > is enough to sync the whole new stack entirely otherwise we risk
>> > another future fault and some places really aren't safely faulted.
>> >
>>
>> I think so. The vmalloc faults only happen when the entire top-level
>> page table entry is missing, and those cover giant swaths of address
>> space.
>>
>> I don't know whether the vmalloc code guarantees not to span a pmd
>> (pud? why couldn't these be called pte0, pte1, pte2, etc.?) boundary.
>
> So dereferencing stack[0] is probably enough for 8KB worth of stack. I think
> we have vmalloc_sync_all() but I heard this only work on x86-64.
>

I have no desire to do this for 32-bit. But we don't need
vmalloc_sync_all -- we just need to sync the ony required entry.

> Too bad we don't have a universal solution, I have that problem with per cpu allocated
> memory faulting at random places. I hit at least two places where it got harmful:
> context tracking and perf callchains. We fixed the latter using open-coded per cpu
> allocation. I still haven't found a solution for context tracking.

In principle, we could pre-populate all top-level pgd entries at boot,
but that would cost up to 256 pages of memory, I think.

--Andy