2008-01-25 00:24:21

by Ingo Molnar

[permalink] [raw]
Subject: [x86.git] new CPA implementation


* Andi Kleen <[email protected]> wrote:

> > please unify the files first, we dont want to let pageattr_32.c and
> > pageattr_64.c diverge even more. Once we get these files unified we
> > layer more features ontop of it.
>
> They work significantly differently in a few important areas (e.g.
> particularly regarding NX handling and the kernel mapping) Off the top
> of my head I don't know of a clean way to unify them. 32bit and 64bit
> kernel mappings differ in many significant ways. Maybe it's possible,
> but it's certainly not something I would want to tackle short term in
> a hurry. [...]

FYI, we've unified and streamlined them in latest x86.git. There's now a
single arch/x86/mm/pagattr.c implementation of the c_p_a() APIs that is
used by both the 32-bit and by the 64-bit x86 kernel.

It's a significantly cleaner and simpler c_p_a() implementation, and
it's also structured in a way to make it easy for gbpages (and clflush)
to be added as well.

One of the big simplifications was to remove largepage reassembly. (We
could perhaps still add that back in the future, if someone shows the
performance benefits and real-life significance of it. But the
refcounting was nasty and error-prone and was buggy even with your
latest CPA patches.)

other features:

- the new implementation is much more scalable, because it is lockless
in the fastpath - while previous c_p_a() implementations used a global
spinlock / or the global init_task.mmap_sem semaphore.

- new 64-bit CONFIG_DEBUG_PAGEALLOC support has been implemented and
has been tested to work fine.

- PAGEALLOC does not require PSE to be cleared from the CPU anymore.
(The pagetables will still be broken up into 4K ptes during bootup,
but that happens as part of the regular c_p_a() sequence. (and thus
we get more testing of the largepage-splitup code)

- the CPA-testsuite now passes without failures on both 32-bit and
64-bit. (it never fully worked with your CPA series.)

> The kernel mappings between 32bit and 64bit are quite different. [...]

10 lines of code (out of 300 lines of pageattr.c) handles that
difference.

the new code is still being tested, but it's looking pretty good so far.
(Note: certain bits within the new cpa series are not backmerged yet.)

Ingo


2008-01-25 08:41:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [x86.git] new CPA implementation


> One of the big simplifications was to remove largepage reassembly. (We
> could perhaps still add that back in the future, if someone shows the
> performance benefits and real-life significance of it.

Let's call it a deoptimization, but ok. I suspect you'll hear about
it again at some point in the future in form of performance regressions.

I'm a little surprised you chose this to simply way though. My feeling was always
that the primary way to simply cpa would have been to get rid of the
separate flushing step (which in hindsight was probably not a useful
optimization and it caused fairly tricky code)

Also Linus used to have pretty strong opinions in the past about using
direct pages -- good luck getting it past him.

> But the
> refcounting was nasty and error-prone and was buggy even with your
> latest CPA patches.)

What was the remaining problem?

>
> other features:
>
> - the new implementation is much more scalable, because it is lockless
> in the fastpath

What fast path? This should not really be called that often, especially
not when DEBUG_PAGEALLOC has its own simple implementation.

Anyways the most important general optimization imho (which you unfortunately
dropped) was to get rid of the WBINVDs which unlike everything else
cpa does are _really_ costly.

> - while previous c_p_a() implementations used a global
> spinlock / or the global init_task.mmap_sem semaphore.

It'll be interesting to see how you avoided all the races.

> - new 64-bit CONFIG_DEBUG_PAGEALLOC support has been implemented and
> has been tested to work fine.

That was on my todo list, but yes it was pretty easy now. The only
missing bit really came from the PAT patchkit to add infrastructure
to 64bit to set up 4K pages at boot.

> - PAGEALLOC does not require PSE to be cleared from the CPU anymore.
> (The pagetables will still be broken up into 4K ptes during bootup,
> but that happens as part of the regular c_p_a() sequence. (and thus
> we get more testing of the largepage-splitup code)

Clearing the bit was always a nasty hack. Good to finally clean it up.

However I hope you don't allocate memory in the kernel_map_pages in
regular operation now to do split on demand. Doing so would be a mistake
imho because there are all kinds of nasty corner cases with potential
recursion etc.

> - the CPA-testsuite now passes without failures on both 32-bit and
> 64-bit. (it never fully worked with your CPA series.)

Without reassembly implemented CPA_TEST will always imply running a lot of
the direct memory as 4K pages so it can't be safely enabled on production
kernels anymore. You should probably at least add a warning or limit
the test to only work on a small portion of the direct mapping now.

Anyways I'll look at redoing GBpages support on top of that new implementation
later. Without reassembly it should be nearly trivial now. Hopefully it can then
be still merged for .25 then.

BTW to play with open cards I found now on my own testing a new GBpages problem that
I'm investigating -- kexec seems to have trouble with it. I'll try to come up
with a fix for that, although I admit I currently don't have any clue why
they even interact.

-Andi

2008-01-25 13:30:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [x86.git] new CPA implementation


* Andi Kleen <[email protected]> wrote:

> > - the new implementation is much more scalable, because it is
> > lockless
> > in the fastpath
>
> What fast path? This should not really be called that often,
> especially not when DEBUG_PAGEALLOC has its own simple implementation.

that's a point you are still missing badly in all these discussions
about unification and sound design practices: code reuse and a clean,
layered design. PAGEALLOC now uses change_page_attr() again and that
approach is working really well.

You made PAGEALLOC use a special-purpose remapping function but that
would make c_p_a() unrobust indirectly - simply because it would be used
much less.

So i've removed that change and have fixed c_p_a() instead to both be
fast and scalable enough for PAGEALLOC and to be robust enough for
PAGEALLOC. Both sides of the equation (PAGEALLOC and c_p_a()) benefit
from that.

> Anyways the most important general optimization imho (which you
> unfortunately dropped) was to get rid of the WBINVDs which unlike
> everything else cpa does are _really_ costly.

it wasnt "dropped" - your wbinvd->clflush feature was never submitted in
a clean fashion. The "great CPA" patchset you submitted to lkml 3 weeks
ago was a badly interwined tangle of features and fixes, which just
added ontop of the existing set of bad practices and design. We asked
you to restructure, we even added your patchset to x86.git to encourage
you to do it (and you wrote the whole code to begin with), but you did
not clean it up and you generally were not showing interest in such
efforts. We finally got sick of your stance and i rewrote the code with
Thomas.

The current CPA patchset in x86.git does it the right way around:
_first_ it got rid of the sources of unrobustness, consolidated and
cleaned up the structure and design of the codebase, and now we can
layer features ontop of it. (once we can trust the new codebase)

> > - the CPA-testsuite now passes without failures on both 32-bit and
> > 64-bit. (it never fully worked with your CPA series.)
>
> Without reassembly implemented CPA_TEST will always imply running a
> lot of the direct memory as 4K pages so it can't be safely enabled on
> production kernels anymore. You should probably at least add a warning
> or limit the test to only work on a small portion of the direct
> mapping now.

good point - i have made it dependent on DEBUG_KERNEL now and extended
the help text.

Ingo

2008-01-25 14:51:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [x86.git] new CPA implementation

Ingo Molnar <[email protected]> writes:

> * Andi Kleen <[email protected]> wrote:
>
>> > - the new implementation is much more scalable, because it is
>> > lockless
>> > in the fastpath
>>
>> What fast path? This should not really be called that often,
>> especially not when DEBUG_PAGEALLOC has its own simple implementation.

> that's a point you are still missing badly in all these discussions
> about unification and sound design practices: code reuse and a clean,
> layered design.

kernel_map_pages does its own thing for flushes. I must admit I always
considered that abuse because it gives a somewhat fragile special case
where c_p_a() is not allowed to set up any state for g_f_t() for some
specific cases. Clean layering looks different to me.

> PAGEALLOC now uses change_page_attr() again and that
> approach is working really well.

I don't really see any attempt to stop the

allocation -> kernel_map_pages -> split -> allocation -> kernel_map_pages ->
other split -> allocation -> ....

recursion. Ok perhaps it is bounded in most cases, still looks
quite risky to me. For gbpages it will be even more likely a problem
because the max recursion depth is far deeper with the one more
level to split. Probably will resort to clear direct_gbpages
with DEBUG_PAGEALLOC, but that also seems unclean.

I also don't like that you always force GFP_ATOMIC for all c_p_a()s
now. That makes c_p_a() much less reliable than it should be.
GFP_ATOMIC without a clear fallback is usually a mistake and most non
debugalloc c_p_a users don't have one. Yes i386 did this always, but
it was always wrong and now x86-64 got that misfeature too.

When I said earlier that not clearing PSE is a good idea I had assumed
you would set some other flag that forces all pages to be 4k from
the start, but that doesn't seem to be the case.

> it wasnt "dropped" - your wbinvd->clflush feature was never submitted in
> a clean fashion. The "great CPA" patchset you submitted to lkml 3 weeks
> ago was a badly interwined tangle of features and fixes,

I reviewed the sequence of changes on git-x86 you did and they seem
hardly be more separated into fixes and cleanups and changes than my
patchkit was. The main difference seems to be that it doesn't add anything,
just removes a lot of features. Since removing tends to be easier
it is a little shorter, but not much. So I think this particular criticism
is unfair since your own attempt at this was not any better.

-Andi

2008-01-26 16:30:46

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [x86.git] new CPA implementation

On Fri, 25 Jan 2008 15:51:18 +0100
Andi Kleen <[email protected]> wrote:

> Ingo Molnar <[email protected]> writes:
>
> > * Andi Kleen <[email protected]> wrote:
> >
> >> > - the new implementation is much more scalable, because it is
> >> > lockless
> >> > in the fastpath
> >>
> >> What fast path? This should not really be called that often,
> >> especially not when DEBUG_PAGEALLOC has its own simple
> >> implementation.
>
> > that's a point you are still missing badly in all these discussions
> > about unification and sound design practices: code reuse and a
> > clean, layered design.
>
> kernel_map_pages does its own thing for flushes. I must admit I always
> considered that abuse because it gives a somewhat fragile special case
> where c_p_a() is not allowed to set up any state for g_f_t() for some
> specific cases. Clean layering looks different to me.
>
> > PAGEALLOC now uses change_page_attr() again and that
> > approach is working really well.
>
> I don't really see any attempt to stop the
>
> allocation -> kernel_map_pages -> split -> allocation ->
> kernel_map_pages -> other split -> allocation -> ....
>

but.. if you start out with everything not split, and only split on free....
well arguably that's the same thing yeah.. but it could be done such that you can use the
page you just freed for the split ;-)