2004-09-15 20:01:40

by Zwane Mwaikambo

[permalink] [raw]
Subject: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm

William spotted this stray bit, LOCK_SECTION isn't used anymore on x86_64.
Andrew i've diffed against -mm because i'd like for the irq enable on
contention patch to be merged, i believe making spinlock functions out
of line was a prerequisite Andi wanted before agreeing to the irq enable
code.

Signed-off-by: Zwane Mwaikambo <[email protected]>

Index: linux-2.6.9-rc1-mm5/include/asm-x86_64/spinlock.h
===================================================================
RCS file: /home/cvsroot/linux-2.6.9-rc1-mm5/include/asm-x86_64/spinlock.h,v
retrieving revision 1.1.1.1
diff -u -p -B -r1.1.1.1 spinlock.h
--- linux-2.6.9-rc1-mm5/include/asm-x86_64/spinlock.h 13 Sep 2004 12:46:47 -0000 1.1.1.1
+++ linux-2.6.9-rc1-mm5/include/asm-x86_64/spinlock.h 15 Sep 2004 08:47:52 -0000
@@ -45,20 +45,18 @@ typedef struct {
#define spin_lock_string \
"\n1:\t" \
"lock ; decb %0\n\t" \
- "js 2f\n" \
- LOCK_SECTION_START("") \
+ "jns 3f\n" \
"2:\t" \
"rep;nop\n\t" \
"cmpb $0,%0\n\t" \
"jle 2b\n\t" \
"jmp 1b\n" \
- LOCK_SECTION_END
+ "3:\n\t"

#define spin_lock_string_flags \
"\n1:\t" \
"lock ; decb %0\n\t" \
- "js 2f\n\t" \
- LOCK_SECTION_START("") \
+ "jns 4f\n\t" \
"2:\t" \
"test $0x200, %1\n\t" \
"jz 3f\n\t" \
@@ -69,7 +67,7 @@ typedef struct {
"jle 3b\n\t" \
"cli\n\t" \
"jmp 1b\n" \
- LOCK_SECTION_END
+ "4:\n\t"

/*
* This works. Despite all the confusion.


2004-09-15 21:45:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm

Zwane Mwaikambo <[email protected]> wrote:
>
> William spotted this stray bit, LOCK_SECTION isn't used anymore on x86_64.

btw, Ingo and I were scratching heads over an x86_64 oops in curent -linus
trees.

If you enable profiling and frame pointers, profile_pc() goes splat
dereferencing the `regs' argument when it decides that the pc refers to a
lock section. Ingo said `regs' had a value of 0x2, iirc. Consider this a
bug report ;)

2004-09-15 21:57:42

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm

On Wed, 15 Sep 2004, Andrew Morton wrote:

> Zwane Mwaikambo <[email protected]> wrote:
> >
> > William spotted this stray bit, LOCK_SECTION isn't used anymore on x86_64.
>
> btw, Ingo and I were scratching heads over an x86_64 oops in curent -linus
> trees.
>
> If you enable profiling and frame pointers, profile_pc() goes splat
> dereferencing the `regs' argument when it decides that the pc refers to a
> lock section. Ingo said `regs' had a value of 0x2, iirc. Consider this a
> bug report ;)

Yeah profile_pc on x86_64 is just broken, i'll have a look.

Thanks,
Zwane

2004-09-15 22:17:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm


* Andrew Morton <[email protected]> wrote:

> Zwane Mwaikambo <[email protected]> wrote:
> >
> > William spotted this stray bit, LOCK_SECTION isn't used anymore on x86_64.
>
> btw, Ingo and I were scratching heads over an x86_64 oops in curent -linus
> trees.
>
> If you enable profiling and frame pointers, profile_pc() goes splat
> dereferencing the `regs' argument when it decides that the pc refers to a
> lock section. Ingo said `regs' had a value of 0x2, iirc. Consider this a
> bug report ;)

'regs->rbp' had the wrong value i think.

Ingo

2004-09-15 22:44:33

by Andrew Chew

[permalink] [raw]
Subject: RE: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm

> > If you enable profiling and frame pointers, profile_pc() goes splat
> > dereferencing the `regs' argument when it decides that the
> pc refers
> > to a lock section. Ingo said `regs' had a value of 0x2, iirc.
> > Consider this a bug report ;)

I've been seeing this as well, although I'm pretty sure I have profiling
disabled. Are you sure it isn't the combination of SMP support and
frame pointers that causes this bug to occur? (It looks like
profile_pc() takes on a different implementation only if both CONFIG_SMP
and CONFIG_FRAME_POINTER is defined.)

2004-09-16 06:14:05

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm

On Wed, Sep 15, 2004 at 02:45:23PM -0700, Andrew Morton wrote:
> Zwane Mwaikambo <[email protected]> wrote:
> >
> > William spotted this stray bit, LOCK_SECTION isn't used anymore on x86_64.
>
> btw, Ingo and I were scratching heads over an x86_64 oops in curent -linus
> trees.
>
> If you enable profiling and frame pointers, profile_pc() goes splat
> dereferencing the `regs' argument when it decides that the pc refers to a
> lock section. Ingo said `regs' had a value of 0x2, iirc. Consider this a
> bug report ;)

Known problem. Interrupts don't save regs->rbp, but the new profile_pc
that was introduced recently uses it.

One quick fix is to just use SAVE_ALL in the interrupt entry code,
but I don't like this because it will affect interrupt latency.

The real fix is to fix profile_pc to not reference it.

-Andi

2004-09-16 06:26:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm


* Andi Kleen <[email protected]> wrote:

> Known problem. Interrupts don't save regs->rbp, but the new profile_pc
> that was introduced recently uses it.
>
> One quick fix is to just use SAVE_ALL in the interrupt entry code, but
> I don't like this because it will affect interrupt latency.
>
> The real fix is to fix profile_pc to not reference it.

it would be nice if we could profile the callers of the spinlock
functions instead of the opaque spinlock functions.

the ebp trick is nice, but forcing a formal stack frame for every
function has global performance implications. Couldnt we define some
sort of current-> field [or current_thread_info() field] that the
spinlock code could set and clear, which field would be listened to by
profile_pc(), so that the time spent spinning would be attributed to the
callee? Something like:

spin_lock()
current->real_pc = __builtin_return_address(0);
...
current->real_pc = 0;

profile_pc():
...
if (current->real_pc)
pc = current->real_pc;
else
pc = regs->rip;

this basically means that we set up a 'conditional frame pointer' in the
spinlock functions - but only in the spinlock functions! It is true that
this would be 1-2 cycles more expensive than using the frame pointer
register but considering the totality of performance i believe this is
wastly superior.

AFAIR __builtin_return_address(0) is nice and sweet on all platforms,
and it works even with -fomit-frame-pointers. (levels 1 and more are not
guaranteed to work, but level 0 always works.) On x86/x64 gcc derives it
from %esp so the overhead of setting up current->real_pc. On platforms
that have 'current' (or current_thread_info()) in a register, saving and
clearing current->real_pc ought to be a matter of 2-3 instructions.
(could be 2 instructions total on x64 - the same cost as
-fno-omit-frame-pointer ebp saving would have!)

->real_pc would have a small race window from the point it's cleared up
until it returns, but it's not a big issue because 99% of the spinlock
related profile overhead is either in the spinning section or the first
access to the cacheline. If there is some small overhead measured
spin_lock() it's not a big issue, as long as the brunt of the overhead
is attributed to the ->real_pc callee.

spin_unlock() doesnt even need to set up ->real_pc - making this
solution even cheaper.

hm?

Ingo

2004-09-16 06:44:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm

On Thu, Sep 16, 2004 at 08:27:59AM +0200, Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > Known problem. Interrupts don't save regs->rbp, but the new profile_pc
> > that was introduced recently uses it.
> >
> > One quick fix is to just use SAVE_ALL in the interrupt entry code, but
> > I don't like this because it will affect interrupt latency.
> >
> > The real fix is to fix profile_pc to not reference it.
>
> it would be nice if we could profile the callers of the spinlock
> functions instead of the opaque spinlock functions.
>
> the ebp trick is nice, but forcing a formal stack frame for every
> function has global performance implications. Couldnt we define some

I don't think that is needed anyways. It would seem to
be overkill to me to make a relatively fast path slower
just for the profiler interrupt.

I think the idea was that the spinlock functions should be small
enough that they don't have any stack local variables. In this case
for the standard non FP build you can just use *regs->rsp. The only problem
was with CONFIG_FRAME_POINTER, because then the compiler puts
an additional word onto the stack. I think the right way
is to just correct for this word and still use rsp.

Here's an untested patch to implement this. Does this fix the problem for
you?

-Andi


Index: linux/arch/x86_64/kernel/time.c
===================================================================
--- linux.orig/arch/x86_64/kernel/time.c 2004-09-13 22:19:22.%N +0200
+++ linux/arch/x86_64/kernel/time.c 2004-09-16 08:43:06.%N +0200
@@ -184,9 +184,11 @@
unsigned long profile_pc(struct pt_regs *regs)
{
unsigned long pc = instruction_pointer(regs);
-
+ /* [0] is the frame pointer, [1] is the return address.
+ This assumes that the spinlock function is small enough
+ to not have any stack variables. */
if (in_lock_functions(pc))
- return *(unsigned long *)regs->rbp;
+ return ((unsigned long *)regs->rsp)[1];
return pc;
}
EXPORT_SYMBOL(profile_pc);

2004-09-16 06:49:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm


* Andi Kleen <[email protected]> wrote:

> I think the idea was that the spinlock functions should be small
> enough that they don't have any stack local variables. [...]

this might work for x64, but even there, are you sure it works even with
CONFIG_PREEMPT enabled (there the spinlocks are more complex)? It sure
wont work on x86 so i think we need a generic 'soft PC' solution.

the alternative would be to unwind the stack - quite some task on some
platforms ...

Ingo

2004-09-16 06:53:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm

On Thu, Sep 16, 2004 at 08:51:01AM +0200, Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > I think the idea was that the spinlock functions should be small
> > enough that they don't have any stack local variables. [...]
>
> this might work for x64, but even there, are you sure it works even with
> CONFIG_PREEMPT enabled (there the spinlocks are more complex)? It sure

I would expect so. x86-64 should have enough callee clobbered registers
by default to handle it. Unless someone makes them complex enough that it
needs more than 4 variables or so or adds another function call. But I hope
this won't happen.

> wont work on x86 so i think we need a generic 'soft PC' solution.
>
> the alternative would be to unwind the stack - quite some task on some
> platforms ...

Sometimes call graph profiling would be very useful, but I wouldn't
want the profiler to do it by default, especially not for this silly
simple case. dwarf2 unwinding is complex enough that just requiring
frame pointers for the CG case would look attractive.

-Andi

2004-09-16 06:56:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm


* Andi Kleen <[email protected]> wrote:

> > the alternative would be to unwind the stack - quite some task on some
> > platforms ...
>
> Sometimes call graph profiling would be very useful, but I wouldn't
> want the profiler to do it by default, especially not for this silly
> simple case. dwarf2 unwinding is complex enough that just requiring
> frame pointers for the CG case would look attractive.

but ... frame pointers are overhead to all of the kernel. (the overhead
is less pronounced on architectures with more registers, but even with
15 registers, 14 or 15 can still be a difference.) While unwinding is
overhead to profiling only - if enabled. Also, there could be other
functionality (exception handling?) that could benefit from dwarf2
unwinding.

Ingo

2004-09-16 07:10:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm

On Thu, Sep 16, 2004 at 08:58:05AM +0200, Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > > the alternative would be to unwind the stack - quite some task on some
> > > platforms ...
> >
> > Sometimes call graph profiling would be very useful, but I wouldn't
> > want the profiler to do it by default, especially not for this silly
> > simple case. dwarf2 unwinding is complex enough that just requiring
> > frame pointers for the CG case would look attractive.
>
> but ... frame pointers are overhead to all of the kernel. (the overhead

Yes, it may be up to a few percent in extreme cases because it
adds a stall on rsp on K8. On the other hand the code
may get slightly smaller, because a rbp reference is one byte
shorter than a rsp reference.

> is less pronounced on architectures with more registers, but even with
> 15 registers, 14 or 15 can still be a difference.) While unwinding is
> overhead to profiling only - if enabled. Also, there could be other
> functionality (exception handling?) that could benefit from dwarf2
> unwinding.

Your oopses could get better backtraces, but that could be done
with frame pointers too (I'm surprised nobody did it already btw...)

You can try to write a dwarf2 unwinder for the kernel (actually
I think IA64 already has one, but it's quite complex as expected
and doesn't easily map to anything else). Even with that doing
a dwarf2 unwind interpretation will have much more overhead.
For me it doesn't look unreasonable to recompile the kernel for
special profiles though.

-Andi

2004-09-16 07:19:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm


* Andi Kleen <[email protected]> wrote:

> > is less pronounced on architectures with more registers, but even with
> > 15 registers, 14 or 15 can still be a difference.) While unwinding is
> > overhead to profiling only - if enabled. Also, there could be other
> > functionality (exception handling?) that could benefit from dwarf2
> > unwinding.
>
> Your oopses could get better backtraces, but that could be done with
> frame pointers too (I'm surprised nobody did it already btw...)

i did it for x86 a number of times. It gets really messy - you need to
save ebp across interrupt and exception contexts and the ptrace code has
deep assumptions there. But frame pointers on x86 are really really bad
- 6 instead of 7 registers, quite a number of more spills. _Despite
this overhead_, there are distros that picked framepointers on x86 due
to the debuggability plus! So by not having a clean unwinding and
backtracing infrastructure we are forcing distributors to compile an
inferior kernel.

> You can try to write a dwarf2 unwinder for the kernel (actually I
> think IA64 already has one, but it's quite complex as expected and
> doesn't easily map to anything else). Even with that doing a dwarf2
> unwind interpretation will have much more overhead. For me it doesn't
> look unreasonable to recompile the kernel for special profiles though.

the main issue is production level distro kernels - they will pick the
profilable option. So we must work on this issue some more. My ->real_pc
solution solves the profiling problem at a minimal cost (the cost to
spin_lock is close to the cost of ebp saving (on x86), and the cost to
other functions is zero). If a distro has the option between compiling
with framepointers or compiling with ->real_pc i'm sure they will pick
the ->real_pc solution. I dont say it's an elegant solution but it will
work on all architectures - and kernel/spinlock.c is shared amongst all
architectures.

(likewise, they'd pick the dwarf2 unwinder over ->real_pc because that
removes the spin_lock overhead but a dwarf2 unwinder doesnt seem to be
in the works ...)

Even on x64, you really dont want profiling to break just because
someone compiles with spinlock debugging enabled and you happen to run
out of the 7 callee-saved registers ... I think your patch is dangerous
in this respect - it might work if you can detect for sure at build time
whether there's any local variable. Tricks like this really tend to
haunt us later.

Ingo

2004-09-16 07:34:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm

On Thu, Sep 16, 2004 at 09:19:31AM +0200, Ingo Molnar wrote:
> i did it for x86 a number of times. It gets really messy - you need to
> save ebp across interrupt and exception contexts and the ptrace code has
> deep assumptions there. But frame pointers on x86 are really really bad
> - 6 instead of 7 registers, quite a number of more spills. _Despite
> this overhead_, there are distros that picked framepointers on x86 due
> to the debuggability plus! So by not having a clean unwinding and
> backtracing infrastructure we are forcing distributors to compile an
> inferior kernel.

I think you're overstating the case here. We certainly have adequate
profiling and debugging infrastructure on x86-64 with frame pointer
off.

In fact on x86-64 it works even better because kgdb actually
works properly with dwarf2 and without FP and the kernel has all the proper
unwind info.

I was talking about call graph profiling, but that's an additional
new feature that could be added in the future (oprofile already
supports it with an extra patch on i386). It may need it. But
not having it wasn't a big issue so far.

I think most users of CG profiling want to have it for user space
anyways, not necessarily for the kernel.

>
> > You can try to write a dwarf2 unwinder for the kernel (actually I
> > think IA64 already has one, but it's quite complex as expected and
> > doesn't easily map to anything else). Even with that doing a dwarf2
> > unwind interpretation will have much more overhead. For me it doesn't
> > look unreasonable to recompile the kernel for special profiles though.
>
> the main issue is production level distro kernels - they will pick the
> profilable option. So we must work on this issue some more. My ->real_pc

Something is mixed up here:

The whole problem only happens on kernels using frame pointer. I never
saw it, simply because I don't use frame pointers.

On a frame pointer less kernel profiling works just fine, and
with this fix it should work the same on a FP kernel.

> solution solves the profiling problem at a minimal cost (the cost to

I think my new profile_pc gives it at even smaller cost though :)


> (likewise, they'd pick the dwarf2 unwinder over ->real_pc because that
> removes the spin_lock overhead but a dwarf2 unwinder doesnt seem to be
> in the works ...)
>
> Even on x64, you really dont want profiling to break just because
> someone compiles with spinlock debugging enabled and you happen to run
> out of the 7 callee-saved registers ... I think your patch is dangerous

If someone adds such bloated debug code they can deal with it
themselves. I don't think we should try to handle such extreme
cases by default.

Ok, adding a comment there may be fair to warn them in advance.

> in this respect - it might work if you can detect for sure at build time
> whether there's any local variable. Tricks like this really tend to
> haunt us later.

There are already lots of such assumptions in the kernel (e.g. WCHAN and
others). I don't think adding one more is a big issue.

-Andi

2004-09-16 07:43:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm


* Andi Kleen <[email protected]> wrote:

> Something is mixed up here:
>
> The whole problem only happens on kernels using frame pointer. I never
> saw it, simply because I don't use frame pointers.
>
> On a frame pointer less kernel profiling works just fine, and with
> this fix it should work the same on a FP kernel.

it only works on pointer less kernels because the spinlock profile
unwinding is _conditional_ on an FP kernel right now:

#if defined(CONFIG_SMP) && defined(CONFIG_FRAME_POINTER)
unsigned long profile_pc(struct pt_regs *regs)
{
...

on non-FP kernels you'll see all the overhead in the single spin_lock()
function, agreed?

> > in this respect - it might work if you can detect for sure at build time
> > whether there's any local variable. Tricks like this really tend to
> > haunt us later.
>
> There are already lots of such assumptions in the kernel (e.g. WCHAN
> and others). I don't think adding one more is a big issue.

wchan has only one assumption: that that all __sched section functions
have a valid frame pointer. This is not unrobust at all. (it is
nonperformant though on register-starved platforms and having proper
unwind would fix wchan too.) In fact ->real_pc could be used by __sched
functions as well, it would likely be cheaper (on x86) than having to
compile with -fno-omit-frame-pointers.

Ingo

2004-09-16 07:53:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm

On Thu, Sep 16, 2004 at 09:44:31AM +0200, Ingo Molnar wrote:
> it only works on pointer less kernels because the spinlock profile
> unwinding is _conditional_ on an FP kernel right now:
>
> #if defined(CONFIG_SMP) && defined(CONFIG_FRAME_POINTER)
> unsigned long profile_pc(struct pt_regs *regs)
> {
> ...
>
> on non-FP kernels you'll see all the overhead in the single spin_lock()
> function, agreed?

Agreed. Ok, should be easy to fix with the same scheme, except
that it has to use offset 0 instead of 1 for the !FP case.
I will cook up a patch.

BTW the SMP check is also wrong because preemptive kernels also
need this unwinding, although they should not spend too much
time in locking.

> > > in this respect - it might work if you can detect for sure at build time
> > > whether there's any local variable. Tricks like this really tend to
> > > haunt us later.
> >
> > There are already lots of such assumptions in the kernel (e.g. WCHAN
> > and others). I don't think adding one more is a big issue.
>
> wchan has only one assumption: that that all __sched section functions

It's more than WCHAN, e.g. sysrq-t assumes it too.

-Andi

2004-09-16 09:01:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm

On Thu, Sep 16, 2004 at 09:44:31AM +0200, Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > Something is mixed up here:
> >
> > The whole problem only happens on kernels using frame pointer. I never
> > saw it, simply because I don't use frame pointers.
> >
> > On a frame pointer less kernel profiling works just fine, and with
> > this fix it should work the same on a FP kernel.
>
> it only works on pointer less kernels because the spinlock profile
> unwinding is _conditional_ on an FP kernel right now:

Actually my idea doesn't work because there is a one instruction
race where rbp is not set up on the stack yet. I think I give up and just save
%rbp in this case.

-Andi

2004-09-16 17:01:36

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm

On Thu, 16 Sep 2004, Ingo Molnar wrote:

> the ebp trick is nice, but forcing a formal stack frame for every
> function has global performance implications. Couldnt we define some
> sort of current-> field [or current_thread_info() field] that the
> spinlock code could set and clear, which field would be listened to by
> profile_pc(), so that the time spent spinning would be attributed to the
> callee? Something like:

I think the generic route is nice but wouldn't this break with the
following.

taskA:
spin_lock(lockA); // contended
<interrupt>
int1:
spin_lock(lockB)

I was thinking along the likes of a per_cpu real_pc, but realised it falls
prey to the same problem as above... Unless we have irq threads, then of
course your solution works.

Zwane

2004-09-16 19:29:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] remove LOCK_SECTION from x86_64 spin_lock asm


* Zwane Mwaikambo <[email protected]> wrote:

> On Thu, 16 Sep 2004, Ingo Molnar wrote:
>
> > the ebp trick is nice, but forcing a formal stack frame for every
> > function has global performance implications. Couldnt we define some
> > sort of current-> field [or current_thread_info() field] that the
> > spinlock code could set and clear, which field would be listened to by
> > profile_pc(), so that the time spent spinning would be attributed to the
> > callee? Something like:
>
> I think the generic route is nice but wouldn't this break with the
> following.
>
> taskA:
> spin_lock(lockA); // contended
> <interrupt>
> int1:
> spin_lock(lockB)
>
> I was thinking along the likes of a per_cpu real_pc, but realised it
> falls prey to the same problem as above... Unless we have irq threads,
> then of course your solution works.

you mean the nesting? spin_lock() should save/restore the value instead
of setting/clearing it - and fork() should initialize it to zero.

Ingo