LinuxLists.cc - Re: [RFC 2/2] x86_64: expand kernel stack to 16K

2014-10-21 02:00:47

Subject: Re: [RFC 2/2] x86_64: expand kernel stack to 16K

On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote:
> On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin <[email protected]> wrote:
> >
> > If we removed struct thread_info from the stack allocation then one
> > could do a guard page below the stack. Of course, we'd have to use IST
> > for #PF in that case, which makes it a non-production option.
>
> We could just have the guard page in between the stack and the
> thread_info, take a double fault, and then just map it back in on
> double fault.
>
> That would give us 8kB of "normal" stack, with a very loud fault - and
> then an extra 7kB or so of stack (whatever the size of thread-info is)
> - after the first time it traps.
>
> That said, it's still likely a non-production option due to the page
> table games we'd have to play at fork/clone time.

[thread necrophilia]

So digging this back up, it occurs to me that after we bumped to 16K,
we never did anything like the debug stuff you suggested here.

The reason I'm bringing this up, is that the last few weeks, I've been
seeing things like..

[27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left

So we're now eating past that first 8KB in some situations.

Do we care ? Or shall we only start worrying if it gets even deeper ?

Dave

2014-10-21 04:59:33

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [RFC 2/2] x86_64: expand kernel stack to 16K

On 10/20/2014 07:00 PM, Dave Jones wrote:
> On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote:
> > On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin <[email protected]> wrote:
> > >
> > > If we removed struct thread_info from the stack allocation then one
> > > could do a guard page below the stack. Of course, we'd have to use IST
> > > for #PF in that case, which makes it a non-production option.

Why is thread_info in the stack allocation anyway? Every time I look at
the entry asm, one (minor) thing that contributes to general
brain-hurtingness / sense of horrified awe is the incomprehensible (to
me) split between task_struct and thread_info.

struct thread_info is at the bottom of the stack, right? If we don't
want to merge it into task_struct, couldn't we stick it at the top of
the stack instead? Anything that can overwrite the *top* of the stack
gives trivial user-controlled CPL0 execution regardless.

> >
> > We could just have the guard page in between the stack and the
> > thread_info, take a double fault, and then just map it back in on
> > double fault.
> >
> > That would give us 8kB of "normal" stack, with a very loud fault - and
> > then an extra 7kB or so of stack (whatever the size of thread-info is)
> > - after the first time it traps.
> >
> > That said, it's still likely a non-production option due to the page
> > table games we'd have to play at fork/clone time.

What's wrong with vmalloc? Doesn't it already have guard pages?

(Also, we have a shiny hardware dirty bit, so we could relatively
cheaply check whether we're near the limit without any weird
#PF-in-weird-context issues.)

Also, muahaha, I've infected more people with the crazy idea that
intentional double-faults are okay. Suckers! Soon I'll have Linux
returning from interrupts with lret! (IIRC Windows used to do
intentional *triple* faults on context switches, so this should be
considered entirely sensible.)

>
> [thread necrophilia]
>
> So digging this back up, it occurs to me that after we bumped to 16K,
> we never did anything like the debug stuff you suggested here.
>
> The reason I'm bringing this up, is that the last few weeks, I've been
> seeing things like..
>
> [27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left
>
> So we're now eating past that first 8KB in some situations.
>
> Do we care ? Or shall we only start worrying if it gets even deeper ?

I would *love* to have an immediate, loud failure when we overrun the
stack. This will unavoidably increase the number of TLB misses, but
that probably isn't so bad.

--Andy