LinuxLists.cc - Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

2007-08-01 08:11:46

Subject: Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

On 7/31/07, Eric Sandeen <[email protected]> wrote:

> No, what I had did only that, so it was still a matter of probabilities...

How expensive would it be to allocate two , then use the MMU mark the
second page unwritable? Hardware wise it should be possible, (for
constant 4k pagesizes, I have not worked with variable pagesize MMUs)
and since it's a per-context-switch constant operation, it would be a
special case in the fault handler rather then adding another entry to
the VM for every process.

Using large hardware pages to cover the kernel mapping could be worked
around by leaving the area where the current process stack resides
mapped via 4k pages. Of course, I haven't touched a modern PC MMU in
ages, so I could be missing something fundamentally difficult.

The other issue is with the layered IO design - no matter what we
configure the stack size to, it is still possible to create a set of
translation layers that will cause it to crash regularly: XFS on
dm_crypt on loop on XFS on dm_crypt on loop on ad infinitum.

That said, I'm missing something here - why is the stack growing?
Filesystems should be issuing bios with callbacks, so they should be
back off the stack, same with dm, loop, etc. Am I missing step where
they use a wrapper function that pretends to be syncronous?

2007-08-01 13:34:27

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

On Wed, Aug 01, 2007 at 04:11:23AM -0400, Dan Merillat wrote:
> How expensive would it be to allocate two , then use the MMU mark the
> second page unwritable? Hardware wise it should be possible, (for

Tweaking kernel ptes is prohibitive during clone() because that's
kernel memory and it would require a flush tlb all with IPIs that
won't scale (IPIs are really the blocker). Basically vmalloc already
does what you suggest with the gap page and yet we can't use it for
performance reasons. Kernel stack should be readable by any context to
allow sysrq+t kind of things, so I doubt it's feasible to do tricks to
avoid ipis.

2007-08-01 15:47:49

by Alan

[permalink] [raw]

Subject: Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

On Wed, 1 Aug 2007 15:33:58 +0200
Andrea Arcangeli <[email protected]> wrote:

> On Wed, Aug 01, 2007 at 04:11:23AM -0400, Dan Merillat wrote:
> > How expensive would it be to allocate two , then use the MMU mark the
> > second page unwritable? Hardware wise it should be possible, (for
>
> Tweaking kernel ptes is prohibitive during clone() because that's
> kernel memory and it would require a flush tlb all with IPIs that
> won't scale (IPIs are really the blocker)

Agreed - except when doing debug work then its an acceptable cost. You
still have to sort the debug side out because you are going to fault the
kernel stack which will probably then cause a triple fault and reboot on
the spot.

2007-08-10 01:03:50

by Dan Merillat

[permalink] [raw]

Subject: Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

On 8/1/07, Alan Cox <[email protected]> wrote:
> On Wed, 1 Aug 2007 15:33:58 +0200
> Andrea Arcangeli <[email protected]> wrote:
> > Tweaking kernel ptes is prohibitive during clone() because that's
> > kernel memory and it would require a flush tlb all with IPIs that
> > won't scale (IPIs are really the blocker)
>
> Agreed - except when doing debug work then its an acceptable cost. You
> still have to sort the debug side out because you are going to fault the
> kernel stack which will probably then cause a triple fault and reboot on
> the spot.

I was assuming debugging work, yes. I was also thinking it wouldn't
be done at clone() time, but mapped (on a single CPU) at the time of a
context switch. It would eliminate IPI, but would probably make the
rest of the TLB handling much too ugly to contemplate. As an
alternative, could the TLB flush and associated IPI be deferred until
the process migrates? First migration would trigger flush/IPI,
further migration would be as now, no? I'd happily run it with
various dm/md layers underneath

On 8/1/07, Denis Vlasenko <[email protected]> wrote:
> Hmm, neat. Why do you need to _allocate second page_ at all?
> Just mark it "not present"...

Because the kernel mapping covers all physical memory contiguously, so
if the page isn't allocated, it could be used by a kernel data
structure you need to access. Same reason the kernel stack has to be
contiguous pages. Well, for non-highmem at least. Either way, you
don't want to mark an in-use page as inaccessable, you never know
what's under there.