2011-03-24 04:59:15

by Nikanth Karthikesan

[permalink] [raw]
Subject: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On x86_64 SMP with lots of CPU atomic instructions which assert the LOCK #
signal can stall other CPUs. And as the number of cores increase this penalty
scales proportionately. So it is best to try and avoid atomic instructions
wherever possible. test_and_set_bit_lock() can avoid using LOCK_PREFIX if it
finds the bit set already.

Signed-off-by: Nikanth Karthikesan <[email protected]>

---

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 903683b..26a42ff 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -203,19 +203,6 @@ static inline int test_and_set_bit(int nr, volatile unsigned long *addr)
}

/**
- * test_and_set_bit_lock - Set a bit and return its old value for lock
- * @nr: Bit to set
- * @addr: Address to count from
- *
- * This is the same as test_and_set_bit on x86.
- */
-static __always_inline int
-test_and_set_bit_lock(int nr, volatile unsigned long *addr)
-{
- return test_and_set_bit(nr, addr);
-}
-
-/**
* __test_and_set_bit - Set a bit and return its old value
* @nr: Bit to set
* @addr: Address to count from
@@ -339,6 +326,25 @@ static int test_bit(int nr, const volatile unsigned long *addr);
: variable_test_bit((nr), (addr)))

/**
+ * test_and_set_bit_lock - Set a bit and return its old value for lock
+ * @nr: Bit to set
+ * @addr: Address to count from
+ *
+ * This is the same as test_and_set_bit on x86. But atomic operation is
+ * avoided, if the bit was already set.
+ */
+static __always_inline int
+test_and_set_bit_lock(int nr, volatile unsigned long *addr)
+{
+#ifdef CONFIG_SMP
+ barrier();
+ if (test_bit(nr, addr))
+ return 1;
+#endif
+ return test_and_set_bit(nr, addr);
+}
+
+/**
* __ffs - find first set bit in word
* @word: The word to search
*


2011-03-24 08:51:43

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

>>> On 24.03.11 at 05:56, Nikanth Karthikesan <[email protected]> wrote:
> On x86_64 SMP with lots of CPU atomic instructions which assert the LOCK #
> signal can stall other CPUs. And as the number of cores increase this
> penalty
> scales proportionately. So it is best to try and avoid atomic instructions
> wherever possible. test_and_set_bit_lock() can avoid using LOCK_PREFIX if it
> finds the bit set already.

Why don't you do this in test_and_set_bit() instead?

Jan

> Signed-off-by: Nikanth Karthikesan <[email protected]>
>
> ---
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 903683b..26a42ff 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -203,19 +203,6 @@ static inline int test_and_set_bit(int nr, volatile
> unsigned long *addr)
> }
>
> /**
> - * test_and_set_bit_lock - Set a bit and return its old value for lock
> - * @nr: Bit to set
> - * @addr: Address to count from
> - *
> - * This is the same as test_and_set_bit on x86.
> - */
> -static __always_inline int
> -test_and_set_bit_lock(int nr, volatile unsigned long *addr)
> -{
> - return test_and_set_bit(nr, addr);
> -}
> -
> -/**
> * __test_and_set_bit - Set a bit and return its old value
> * @nr: Bit to set
> * @addr: Address to count from
> @@ -339,6 +326,25 @@ static int test_bit(int nr, const volatile unsigned long
> *addr);
> : variable_test_bit((nr), (addr)))
>
> /**
> + * test_and_set_bit_lock - Set a bit and return its old value for lock
> + * @nr: Bit to set
> + * @addr: Address to count from
> + *
> + * This is the same as test_and_set_bit on x86. But atomic operation is
> + * avoided, if the bit was already set.
> + */
> +static __always_inline int
> +test_and_set_bit_lock(int nr, volatile unsigned long *addr)
> +{
> +#ifdef CONFIG_SMP
> + barrier();
> + if (test_bit(nr, addr))
> + return 1;
> +#endif
> + return test_and_set_bit(nr, addr);
> +}
> +
> +/**
> * __ffs - find first set bit in word
> * @word: The word to search
> *


2011-03-24 08:57:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Nikanth Karthikesan <[email protected]> wrote:

> On x86_64 SMP with lots of CPU atomic instructions which assert the LOCK #
> signal can stall other CPUs. And as the number of cores increase this penalty
> scales proportionately. So it is best to try and avoid atomic instructions
> wherever possible. test_and_set_bit_lock() can avoid using LOCK_PREFIX if it
> finds the bit set already.
>
> Signed-off-by: Nikanth Karthikesan <[email protected]>
>
> ---
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 903683b..26a42ff 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -203,19 +203,6 @@ static inline int test_and_set_bit(int nr, volatile unsigned long *addr)
> }
>
> /**
> - * test_and_set_bit_lock - Set a bit and return its old value for lock
> - * @nr: Bit to set
> - * @addr: Address to count from
> - *
> - * This is the same as test_and_set_bit on x86.
> - */
> -static __always_inline int
> -test_and_set_bit_lock(int nr, volatile unsigned long *addr)
> -{
> - return test_and_set_bit(nr, addr);
> -}
> -
> -/**
> * __test_and_set_bit - Set a bit and return its old value
> * @nr: Bit to set
> * @addr: Address to count from
> @@ -339,6 +326,25 @@ static int test_bit(int nr, const volatile unsigned long *addr);
> : variable_test_bit((nr), (addr)))
>
> /**
> + * test_and_set_bit_lock - Set a bit and return its old value for lock
> + * @nr: Bit to set
> + * @addr: Address to count from
> + *
> + * This is the same as test_and_set_bit on x86. But atomic operation is
> + * avoided, if the bit was already set.
> + */
> +static __always_inline int
> +test_and_set_bit_lock(int nr, volatile unsigned long *addr)
> +{
> +#ifdef CONFIG_SMP
> + barrier();
> + if (test_bit(nr, addr))
> + return 1;
> +#endif
> + return test_and_set_bit(nr, addr);
> +}

On modern x86 CPUs there's no "#LOCK signal" anymore - it's replaced by a
M[O]ESI cache coherency bus. I'd expect modern x86 CPUs to be pretty fast when
the cacheline is local and the bit is set already.

So you really need to back up your patch with actual hard numbers. Putting this
code into user-space and using pthreads to loop on the same global variable and
testing the before/after effect would be sufficient i think. You can use 'perf
stat --repeat 10' kind of measurements to see whether there's any improvement
larger than the noise of the measurement.

Thanks,

Ingo

2011-03-24 14:52:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On Thu, Mar 24, 2011 at 04:56:47AM -0400, Ingo Molnar wrote:
>
> * Nikanth Karthikesan <[email protected]> wrote:
>
> > On x86_64 SMP with lots of CPU atomic instructions which assert the LOCK #
> > signal can stall other CPUs. And as the number of cores increase this penalty
> > scales proportionately. So it is best to try and avoid atomic instructions
> > wherever possible. test_and_set_bit_lock() can avoid using LOCK_PREFIX if it
> > finds the bit set already.
> >
> > Signed-off-by: Nikanth Karthikesan <[email protected]>

[..]

> > + * test_and_set_bit_lock - Set a bit and return its old value for lock
> > + * @nr: Bit to set
> > + * @addr: Address to count from
> > + *
> > + * This is the same as test_and_set_bit on x86. But atomic operation is
> > + * avoided, if the bit was already set.
> > + */
> > +static __always_inline int
> > +test_and_set_bit_lock(int nr, volatile unsigned long *addr)
> > +{
> > +#ifdef CONFIG_SMP
> > + barrier();
> > + if (test_bit(nr, addr))
> > + return 1;
> > +#endif
> > + return test_and_set_bit(nr, addr);
> > +}
>
> On modern x86 CPUs there's no "#LOCK signal" anymore - it's replaced
> by a M[O]ESI cache coherency bus. I'd expect modern x86 CPUs to be
> pretty fast when the cacheline is local and the bit is set already.

Correct. However, LOCK still could have some overhead associated with it
and avoiding it by using only an non-atomic op (test_bit()) could bring
some miniscule speedup...

> So you really need to back up your patch with actual hard numbers.
> Putting this code into user-space and using pthreads to loop on
> the same global variable and testing the before/after effect would
> be sufficient i think. You can use 'perf stat --repeat 10' kind of
> measurements to see whether there's any improvement larger than the
> noise of the measurement.

and Ingo's question is right on the money - is this speedup noticeable
or does it simply disappear in the noise?

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

2011-03-24 16:47:21

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

>>> On 24.03.11 at 15:52, Borislav Petkov <[email protected]> wrote:

(haven't seen Ingo's original reply, so responding here)

> On Thu, Mar 24, 2011 at 04:56:47AM -0400, Ingo Molnar wrote:
>>
>> * Nikanth Karthikesan <[email protected]> wrote:
>>
>> > On x86_64 SMP with lots of CPU atomic instructions which assert the LOCK #
>> > signal can stall other CPUs. And as the number of cores increase this
> penalty
>> > scales proportionately. So it is best to try and avoid atomic instructions
>> > wherever possible. test_and_set_bit_lock() can avoid using LOCK_PREFIX if
> it
>> > finds the bit set already.
>> >
>> > Signed-off-by: Nikanth Karthikesan <[email protected]>
>
> [..]
>
>> > + * test_and_set_bit_lock - Set a bit and return its old value for lock
>> > + * @nr: Bit to set
>> > + * @addr: Address to count from
>> > + *
>> > + * This is the same as test_and_set_bit on x86. But atomic operation is
>> > + * avoided, if the bit was already set.
>> > + */
>> > +static __always_inline int
>> > +test_and_set_bit_lock(int nr, volatile unsigned long *addr)
>> > +{
>> > +#ifdef CONFIG_SMP
>> > + barrier();
>> > + if (test_bit(nr, addr))
>> > + return 1;
>> > +#endif
>> > + return test_and_set_bit(nr, addr);
>> > +}
>>
>> On modern x86 CPUs there's no "#LOCK signal" anymore - it's replaced
>> by a M[O]ESI cache coherency bus. I'd expect modern x86 CPUs to be
>> pretty fast when the cacheline is local and the bit is set already.

Are you certain? Iirc the lock prefix implies minimally a read-for-
ownership (if CPUs are really smart enough to optimize away the
write - I wonder whether that would be correct at all when it
comes to locked operations), which means a cacheline can still be
bouncing heavily.

>> So you really need to back up your patch with actual hard numbers.
>> Putting this code into user-space and using pthreads to loop on
>> the same global variable and testing the before/after effect would
>> be sufficient i think. You can use 'perf stat --repeat 10' kind of
>> measurements to see whether there's any improvement larger than the
>> noise of the measurement.
>
> and Ingo's question is right on the money - is this speedup noticeable
> or does it simply disappear in the noise?

This cacheline bouncing was actually observed and measured
on SGI UV systems, but I'm not certain we're permitted to publish
that data. I'm copying the two SGI guys who had reported that
issue (and the special case fix, which Nikanth simply generalized)
to us, for them to decide.

Jan

2011-03-24 17:10:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On Wed, Mar 23, 2011 at 9:56 PM, Nikanth Karthikesan <[email protected]> wrote:
> On x86_64 SMP with lots of CPU atomic instructions which assert the LOCK #
> signal can stall other CPUs. And as the number of cores increase this penalty
> scales proportionately. So it is best to try and avoid atomic instructions
> wherever possible. test_and_set_bit_lock() can avoid using LOCK_PREFIX if it
> finds the bit set already.

This is potentially _very_ wrong. It means that test_and_set_bit() is
no longer a serializing instruction in the failure case, and I wonder
what effect that will have on the thousands of users.

It also means that test_and_set_bit() on an uncached entry now starts
out with a read-for-ownership cache operation, which can be quite a
bit slower than the exclusive ownership thing for the hopefully common
case where it succeeds.

So no, I really think this is seriously wrong. It basically makes it
impossible for the user of the bitop function to do a good job if it
wants to.

WHICH test_and_set_bit() are you having performance issues with?
Because I think the right approach is to do this optimization on a
case-by-case basis in the code that actually does the operation, not
in the low-level routine.

Linus

2011-03-24 17:13:13

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On Thu, Mar 24, 2011 at 10:26:01AM +0530, Nikanth Karthikesan wrote:
> On x86_64 SMP with lots of CPU atomic instructions which assert the LOCK #
> signal can stall other CPUs. And as the number of cores increase this penalty
> scales proportionately. So it is best to try and avoid atomic instructions
> wherever possible. test_and_set_bit_lock() can avoid using LOCK_PREFIX if it
> finds the bit set already.
>
> Signed-off-by: Nikanth Karthikesan <[email protected]>
>



Don't we also have an issue related to the coherency protocols.

If the cacheline is referenced by a test-and-set instruction and
the line does not currently reside in the local caches, it is fetched
for exclusive access using a single off-socket request.

If the code first reads the CL, then does a test-and-set, the line
may be first read in shared mode. Then a second off-socket request must be issued to
obtain exclusive access. This can be a serious issue on large systems.
If the bit is typically already set, the new code is a big win but is this
the case? I suspect not.


Do we need a new variant of test-and-set? The new variant would
first test the bit, then set it if not already set. This would be used only
in places where the bit is likely already set. The downside is that it is
frequently difficult to predict whether the bit is already set.


> ---
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 903683b..26a42ff 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -203,19 +203,6 @@ static inline int test_and_set_bit(int nr, volatile unsigned long *addr)
> }
>
> /**
> - * test_and_set_bit_lock - Set a bit and return its old value for lock
> - * @nr: Bit to set
> - * @addr: Address to count from
> - *
> - * This is the same as test_and_set_bit on x86.
> - */
> -static __always_inline int
> -test_and_set_bit_lock(int nr, volatile unsigned long *addr)
> -{
> - return test_and_set_bit(nr, addr);
> -}
> -
> -/**
> * __test_and_set_bit - Set a bit and return its old value
> * @nr: Bit to set
> * @addr: Address to count from
> @@ -339,6 +326,25 @@ static int test_bit(int nr, const volatile unsigned long *addr);
> : variable_test_bit((nr), (addr)))
>
> /**
> + * test_and_set_bit_lock - Set a bit and return its old value for lock
> + * @nr: Bit to set
> + * @addr: Address to count from
> + *
> + * This is the same as test_and_set_bit on x86. But atomic operation is
> + * avoided, if the bit was already set.
> + */
> +static __always_inline int
> +test_and_set_bit_lock(int nr, volatile unsigned long *addr)
> +{
> +#ifdef CONFIG_SMP
> + barrier();
> + if (test_bit(nr, addr))
> + return 1;
> +#endif
> + return test_and_set_bit(nr, addr);
> +}
> +
> +/**
> * __ffs - find first set bit in word
> * @word: The word to search
> *

2011-03-24 17:19:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Jan Beulich <[email protected]> wrote:

> >>> On 24.03.11 at 15:52, Borislav Petkov <[email protected]> wrote:
>
> (haven't seen Ingo's original reply, so responding here)
>
> > On Thu, Mar 24, 2011 at 04:56:47AM -0400, Ingo Molnar wrote:
> >>
> >> * Nikanth Karthikesan <[email protected]> wrote:
> >>
> >> > On x86_64 SMP with lots of CPU atomic instructions which assert the LOCK #
> >> > signal can stall other CPUs. And as the number of cores increase this
> > penalty
> >> > scales proportionately. So it is best to try and avoid atomic instructions
> >> > wherever possible. test_and_set_bit_lock() can avoid using LOCK_PREFIX if
> > it
> >> > finds the bit set already.
> >> >
> >> > Signed-off-by: Nikanth Karthikesan <[email protected]>
> >
> > [..]
> >
> >> > + * test_and_set_bit_lock - Set a bit and return its old value for lock
> >> > + * @nr: Bit to set
> >> > + * @addr: Address to count from
> >> > + *
> >> > + * This is the same as test_and_set_bit on x86. But atomic operation is
> >> > + * avoided, if the bit was already set.
> >> > + */
> >> > +static __always_inline int
> >> > +test_and_set_bit_lock(int nr, volatile unsigned long *addr)
> >> > +{
> >> > +#ifdef CONFIG_SMP
> >> > + barrier();
> >> > + if (test_bit(nr, addr))
> >> > + return 1;
> >> > +#endif
> >> > + return test_and_set_bit(nr, addr);
> >> > +}
> >>
> >> On modern x86 CPUs there's no "#LOCK signal" anymore - it's replaced
> >> by a M[O]ESI cache coherency bus. I'd expect modern x86 CPUs to be
> >> pretty fast when the cacheline is local and the bit is set already.
>
> Are you certain? Iirc the lock prefix implies minimally a read-for-
> ownership (if CPUs are really smart enough to optimize away the
> write - I wonder whether that would be correct at all when it
> comes to locked operations), which means a cacheline can still be
> bouncing heavily.

Yeah. On what workload was this?

Generally you use test_and_set_bit() if you expect it to be 'owned' by whoever
calls it, and released by someone else.

It would be really useful to run perf top on an affected box and see which
kernel function causes this. It might be better to add a test_bit() to the
affected codepath - instead of bloating all test_and_set_bit() users.

Note that the patch can also cause overhead: the test_bit() can miss the cache,
it will bring in the cacheline shared, and the subsequent test_and_set() call
will then dirty the cacheline - so the CPU might miss again and has to wait for
other CPUs to first flush this cacheline.

So we really need more details here.

Thanks,

Ingo

2011-03-24 17:31:26

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

>
> This cacheline bouncing was actually observed and measured
> on SGI UV systems, but I'm not certain we're permitted to publish
> that data. I'm copying the two SGI guys who had reported that
> issue (and the special case fix, which Nikanth simply generalized)
> to us, for them to decide.

We frequently run into the cacheline bouncing issues. I don't have
the data handy that you refer to, but feel free to publish it.

--- jack

2011-03-24 18:39:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

Nikanth Karthikesan <[email protected]> writes:

> On x86_64 SMP with lots of CPU atomic instructions which assert the LOCK #
> signal can stall other CPUs. And as the number of cores increase this penalty

This description is very wrong. No modern CPU still has a LOCK # signal
or does global stalls for LOCK.

Do you actually have any data this is a problem and how much
difference the patch makes?

Also there's the missing barrier now of course.

-Andi
--
[email protected] -- Speaking for myself only

2011-03-24 20:00:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Jack Steiner <[email protected]> wrote:

> >
> > This cacheline bouncing was actually observed and measured
> > on SGI UV systems, but I'm not certain we're permitted to publish
> > that data. I'm copying the two SGI guys who had reported that
> > issue (and the special case fix, which Nikanth simply generalized)
> > to us, for them to decide.
>
> We frequently run into the cacheline bouncing issues. I don't have
> the data handy that you refer to, but feel free to publish it.

One good way to see cache bounces is to run a misses/accesses ratio profile:

perf top -e cache-misses -e cache-references --count-filter 10

Note the two events: this runs a 'weighted' profile, you'll see (LLC)
cache-misses of a function relative to cache-references it does, a
misses/references ratio in essence.

The --count-filter filters out rare entries. (so that rare functions
accidentally producing a large ratio do not clutter the output)

For example during a scheduler-intense workload you'll get something like:

PerfTop: 32652 irqs/sec kernel:71.2% exact: 0.0% [cache-misses/cache-references], (all, 16 CPUs)
-------------------------------------------------------------------------------------------------------

weight samples pcnt function DSO
______ _______ _____ ____________________________ ____________________

1.9 606 3.2% irqtime_account_process_tick [kernel.kallsyms]
1.6 854 4.4% update_vsyscall [kernel.kallsyms]
1.5 446 2.3% atomic_cmpxchg [kernel.kallsyms]
1.5 758 3.9% tick_do_update_jiffies64 [kernel.kallsyms]
1.4 149 0.8% arch_local_irq_save [kernel.kallsyms]
1.3 1524 7.9% do_timer [kernel.kallsyms]
1.2 215 1.1% clear_page_c [kernel.kallsyms]
1.2 128 0.7% dso__find_symbol /home/mingo/bin/perf
1.0 281 1.5% calc_global_load [kernel.kallsyms]
0.9 560 2.9% profile_tick [kernel.kallsyms]
0.7 246 1.3% _raw_spin_lock [kernel.kallsyms]
0.6 2523 13.1% current_kernel_time [kernel.kallsyms]

This output is very different from a plain cycles (or even cache-misses)
measured profile and is very good at identifying 'bouncy' cache-miss sources.

Another good 'view' is store-references against store-misses:

PerfTop: 29530 irqs/sec kernel:99.5% exact: 0.0% [L1-dcache-store-misses/L1-dcache-stores], (all, 16 CPUs)
-------------------------------------------------------------------------------------------------------

weight samples pcnt function DSO
______ _______ _____ ________________________ __________________________________

1271.3 3814 3.2% apic_timer_interrupt [kernel.kallsyms]
844.0 844 0.7% read_tsc [kernel.kallsyms]
615.0 615 0.5% timekeeping_get_ns [kernel.kallsyms]
520.0 520 0.4% intel_pmu_disable_all [kernel.kallsyms]
390.0 390 0.3% tick_dev_program_event [kernel.kallsyms]
308.3 1850 1.5% update_vsyscall [kernel.kallsyms]
251.7 755 0.6% hrtimer_interrupt [kernel.kallsyms]
246.0 246 0.2% find_busiest_group [kernel.kallsyms]
222.7 668 0.6% native_apic_mem_write [kernel.kallsyms]
149.0 298 0.2% apic_write [kernel.kallsyms]
137.0 274 0.2% irq_enter [kernel.kallsyms]
105.0 105 0.1% arch_local_irq_save [kernel.kallsyms]
101.0 101 0.1% tick_program_event [kernel.kallsyms]
95.5 191 0.2% ack_APIC_irq [kernel.kallsyms]

You might want to experiment around with the events to see which one expresses
things best for you on the system in question.

Thanks,

Ingo

2011-03-24 20:42:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

Ingo Molnar <[email protected]> writes:
>
> One good way to see cache bounces is to run a misses/accesses ratio profile:
>
> perf top -e cache-misses -e cache-references --count-filter 10
>
> Note the two events: this runs a 'weighted' profile, you'll see (LLC)
> cache-misses of a function relative to cache-references it does, a
> misses/references ratio in essence.

If anyone does this on a Nehalem please only use 2.6.39-rc*+ which
includes the offcore patches. Anything before that will get you complete
bogus numbers.

BTW with Lin-Ming's memory latency profiling code you can do this much
more directly.

-Andi
--
[email protected] -- Speaking for myself only

2011-03-24 20:51:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Andi Kleen <[email protected]> wrote:

> Ingo Molnar <[email protected]> writes:
> >
> > One good way to see cache bounces is to run a misses/accesses ratio profile:
> >
> > perf top -e cache-misses -e cache-references --count-filter 10
> >
> > Note the two events: this runs a 'weighted' profile, you'll see (LLC)
> > cache-misses of a function relative to cache-references it does, a
> > misses/references ratio in essence.
>
> If anyone does this on a Nehalem please only use 2.6.39-rc*+ which
> includes the offcore patches. Anything before that will get you complete
> bogus numbers.

In that case the second example i quoted can be used:

perf top -e L1-dcache-store-misses -e L1-dcache-stores --count-filter 10

Or:

perf top -e L1-dcache-load-misses -e L1-dcache-loads --count-filter 10

Thanks,

Ingo

2011-03-24 20:51:41

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

Le jeudi 24 mars 2011 à 21:00 +0100, Ingo Molnar a écrit :

> One good way to see cache bounces is to run a misses/accesses ratio profile:
>
> perf top -e cache-misses -e cache-references --count-filter 10
>

Oh well , something must be broken...

"perf top" is working here, but if I use any "-e ...." argument, it
fails :

# perf top -e cache-misses

Error: sys_perf_event_open() syscall returned with 2 (No such file or directory).
/bin/dmesg may provide additional information.

Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?

# grep PERF_EVENTS .config
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_EVENTS=y
CONFIG_HAVE_PERF_EVENTS_NMI=y

Nothing in dmesg...

uname -a
Linux ed001 2.6.38-08165-g2712750-dirty #517 SMP Thu Mar 24 21:31:00 CET 2011 i686 i686 i386 GNU/Linux


2011-03-24 20:54:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Eric Dumazet <[email protected]> wrote:

> Le jeudi 24 mars 2011 ? 21:00 +0100, Ingo Molnar a ?crit :
>
> > One good way to see cache bounces is to run a misses/accesses ratio profile:
> >
> > perf top -e cache-misses -e cache-references --count-filter 10
> >
>
> Oh well , something must be broken...
>
> "perf top" is working here, but if I use any "-e ...." argument, it
> fails :
>
> # perf top -e cache-misses
>
> Error: sys_perf_event_open() syscall returned with 2 (No such file or directory).
> /bin/dmesg may provide additional information.
>
> Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?
>
> # grep PERF_EVENTS .config
> CONFIG_HAVE_PERF_EVENTS=y
> CONFIG_PERF_EVENTS=y
> CONFIG_HAVE_PERF_EVENTS_NMI=y
>
> Nothing in dmesg...
>
> uname -a
> Linux ed001 2.6.38-08165-g2712750-dirty #517 SMP Thu Mar 24 21:31:00 CET 2011 i686 i686 i386 GNU/Linux

Does it work better if you run it as root?

To easiest way to debug as user is to do:

echo -1 > /proc/sys/kernel/perf_event_paranoid

and to put this into /etc/sysctl.conf:

kernel.perf_event_paranoid = -1

Thanks,

Ingo

2011-03-24 21:02:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

Le jeudi 24 mars 2011 à 21:54 +0100, Ingo Molnar a écrit :

> Does it work better if you run it as root?
>

I confess I always am root on my dev machines :(

> To easiest way to debug as user is to do:
>
> echo -1 > /proc/sys/kernel/perf_event_paranoid
>
> and to put this into /etc/sysctl.conf:
>
> kernel.perf_event_paranoid = -1
>


Hmm... on one machine (HP BL460c G1 ) I do have :

[ 0.015998] Performance Events: PEBS fmt0+, Core2 events, Broken BIOS detected, using software events only.
[ 0.015998] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 186 is 43003c)


and the other one (HB BL460c G6)

[ 0.122760] Performance Events: PEBS fmt1+, Nehalem events, Broken BIOS detected, using software events only.
[ 0.123047] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)


Hmm.. I have latest HP BIOS.

Version: I15
Release Date: 10/25/2010

and

Version: I24
Release Date: 09/02/2010

I am pretty sure I used "-e ..." some time ago with same BIOS.


2011-03-24 21:37:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

> In that case the second example i quoted can be used:
>
> perf top -e L1-dcache-store-misses -e L1-dcache-stores --count-filter 10

That will include L2 and L3 stores -- you won't get any indication
on misses to other sockets.

-Andi
--
[email protected] -- Speaking for myself only.

2011-03-24 21:43:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

Eric Dumazet <[email protected]> writes:
>
> [ 0.015998] Performance Events: PEBS fmt0+, Core2 events, Broken BIOS detected, using software events only.
> [ 0.015998] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 186 is 43003c)

There should be a BIOS setting to disable that (something with p-states
likely).

> I am pretty sure I used "-e ..." some time ago with same BIOS.

The "if anyone else uses the PMU throw in your toys and sulk" check was
only recently added.

Or fix perf just ignore counters that are already used. The others
should work fine.

-Andi

--
[email protected] -- Speaking for myself only

2011-03-24 23:27:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On Thu, Mar 24, 2011 at 2:42 PM, Andi Kleen <[email protected]> wrote:
>
> The "if anyone else uses the PMU throw in your toys and sulk" check was
> only recently added.

,, and I'd like to point out that we should just say "screw the
f*cking BIOS, it's doing things wrong". And then just take over the
PMU events, and make sure that they aren't routed to SCI. Instead of
the current "ok, roll over and die when the BIOS does something
idiotic".

People continuously claim that the BIOS really needs it, and I have
never EVER seen any good explanation of why that particular sh*t
argument would b true. It seems to be purely about politics, where
some idiotic vendor (namely HP) has convinced Intel that they really
need it. To the point where some engineers seem to have bought into
the whole thing and actually believe that fairy tale ("firmware can do
better" - hah! They must be feeding people some bad drugs at the
cafeteria)

Linus

2011-03-24 23:56:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

> never EVER seen any good explanation of why that particular sh*t
> argument would b true. It seems to be purely about politics, where
> some idiotic vendor (namely HP) has convinced Intel that they really
> need it. To the point where some engineers seem to have bought into
> the whole thing and actually believe that fairy tale ("firmware can do
> better" - hah! They must be feeding people some bad drugs at the
> cafeteria)

For the record I don't think it's a good idea for the BIOS to do
this (and I'm not aware of any engineer who does),
but I think Linux should do better than just disabling PMU use when
this happens.

However I suspect taking over SCI would cause endless problems
and is very likely not a good idea.

-Andi

--
[email protected] -- Speaking for myself only.

2011-03-25 05:47:29

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

Le vendredi 25 mars 2011 à 00:56 +0100, Andi Kleen a écrit :
> > never EVER seen any good explanation of why that particular sh*t
> > argument would b true. It seems to be purely about politics, where
> > some idiotic vendor (namely HP) has convinced Intel that they really
> > need it. To the point where some engineers seem to have bought into
> > the whole thing and actually believe that fairy tale ("firmware can do
> > better" - hah! They must be feeding people some bad drugs at the
> > cafeteria)
>
> For the record I don't think it's a good idea for the BIOS to do
> this (and I'm not aware of any engineer who does),
> but I think Linux should do better than just disabling PMU use when
> this happens.
>
> However I suspect taking over SCI would cause endless problems
> and is very likely not a good idea.

I tried many different changes in BIOS and all failed (the machine is
damn slow at boot, this takes age).

I am stuck :(

Thanks

2011-03-25 09:23:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Andi Kleen <[email protected]> wrote:

> > never EVER seen any good explanation of why that particular sh*t
> > argument would b true. It seems to be purely about politics, where
> > some idiotic vendor (namely HP) has convinced Intel that they really
> > need it. To the point where some engineers seem to have bought into
> > the whole thing and actually believe that fairy tale ("firmware can do
> > better" - hah! They must be feeding people some bad drugs at the
> > cafeteria)
>
> For the record I don't think it's a good idea for the BIOS to do
> this (and I'm not aware of any engineer who does),

There's really just two sane options:

- complain about the BIOS corrupting CPU state and refusing to use the PMU
- complain about the BIOS corrupting CPU state and using the PMU against the BIOS

We went for the first one but i'll be more than glad to implement Linus's much
more aggressive second option.

Btw., for the record, the thing you have been advocating in the past was a
third option: for the kernel to step aside quietly and to let the BIOS corrupt
a counter or two. You even sent us some sort of BIOS specification about how to
implement that. That's pretty much the worst solution imaginable.

Thanks,

Ingo

2011-03-25 09:32:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Eric Dumazet <[email protected]> wrote:

> Le vendredi 25 mars 2011 ? 00:56 +0100, Andi Kleen a ?crit :
> > > never EVER seen any good explanation of why that particular sh*t
> > > argument would b true. It seems to be purely about politics, where
> > > some idiotic vendor (namely HP) has convinced Intel that they really
> > > need it. To the point where some engineers seem to have bought into
> > > the whole thing and actually believe that fairy tale ("firmware can do
> > > better" - hah! They must be feeding people some bad drugs at the
> > > cafeteria)
> >
> > For the record I don't think it's a good idea for the BIOS to do
> > this (and I'm not aware of any engineer who does),
> > but I think Linux should do better than just disabling PMU use when
> > this happens.
> >
> > However I suspect taking over SCI would cause endless problems
> > and is very likely not a good idea.
>
> I tried many different changes in BIOS and all failed (the machine is
> damn slow at boot, this takes age).
>
> I am stuck :(

Could you please try the patch below?

Thanks,

Ingo

------------------->
>From 14df27334ac47a5cec67fb2238d14499346acc38 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Fri, 25 Mar 2011 10:24:23 +0100
Subject: [PATCH] perf, x86: Complain louder about BIOSen corrupting CPU/PMU state and continue

Eric Dumazet reported that hardware PMU events do not work on his
system, due to the BIOS corrupting PMU state:

Performance Events: PEBS fmt0+, Core2 events, Broken BIOS detected, using software events only.
[Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 186 is 43003c)

Linus suggested that we continue in the face of such BIOS-induced CPU
state corruption:

http://lkml.org/lkml/2011/3/24/608

Such BIOSes will have to be fixed - developers rely on a working and fully
capable PMU and BIOS interfering with CPU state is simply not acceptable.

So this patch changes perf to continue when it detects such BIOS
interaction, some hardware events may be unreliable due to the BIOS writing
and re-writing them - there's not much the kernel can do about that.

Reported-by: Eric Dumazet <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Steven Rostedt <[email protected]>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index ec46eea..eb00677 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -500,12 +500,17 @@ static bool check_hw_exists(void)
return true;

bios_fail:
- printk(KERN_CONT "Broken BIOS detected, using software events only.\n");
+ /*
+ * We still allow the PMU driver to operate:
+ */
+ printk(KERN_CONT "Broken BIOS detected, complain to your hardware vendor.\n");
printk(KERN_ERR FW_BUG "the BIOS has corrupted hw-PMU resources (MSR %x is %Lx)\n", reg, val);
- return false;
+
+ return true;

msr_fail:
printk(KERN_CONT "Broken PMU hardware detected, using software events only.\n");
+
return false;
}

2011-03-25 09:35:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Linus Torvalds <[email protected]> wrote:

> ,, and I'd like to point out that we should just say "screw the
> f*cking BIOS, it's doing things wrong". And then just take over the
> PMU events, and make sure that they aren't routed to SCI. Instead of
> the current "ok, roll over and die when the BIOS does something
> idiotic".
>
> People continuously claim that the BIOS really needs it, and I have
> never EVER seen any good explanation of why that particular sh*t
> argument would b true. It seems to be purely about politics, where
> some idiotic vendor (namely HP) has convinced Intel that they really
> need it. To the point where some engineers seem to have bought into
> the whole thing and actually believe that fairy tale ("firmware can do
> better" - hah! They must be feeding people some bad drugs at the
> cafeteria)

Ok, fully agreed, and i've changed the code to "detect the BIOS breakage,
warn about it but otherwise ignore the BIOS".

Thanks,

Ingo

2011-03-25 09:38:10

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

Le vendredi 25 mars 2011 à 06:47 +0100, Eric Dumazet a écrit :

> I tried many different changes in BIOS and all failed (the machine is
> damn slow at boot, this takes age).

I even tried on a more recent hardware (ProLiant BL460c G7)

BIOS : Version: I27 Release Date: 09/30/2010

[ 0.021849] CPU0: Intel(R) Xeon(R) CPU E5640 @ 2.67GHz stepping 02
[ 0.123904] Performance Events: PEBS fmt1+, Westmere events, Broken BIOS detected, using software events only.
[ 0.124125] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)

I guess I'll have to patch my kernel, I doubt HP will make any change in this area.

2011-03-25 09:45:06

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

Le vendredi 25 mars 2011 à 10:32 +0100, Ingo Molnar a écrit :
> * Eric Dumazet <[email protected]> wrote:
>
> > Le vendredi 25 mars 2011 à 00:56 +0100, Andi Kleen a écrit :
> > > > never EVER seen any good explanation of why that particular sh*t
> > > > argument would b true. It seems to be purely about politics, where
> > > > some idiotic vendor (namely HP) has convinced Intel that they really
> > > > need it. To the point where some engineers seem to have bought into
> > > > the whole thing and actually believe that fairy tale ("firmware can do
> > > > better" - hah! They must be feeding people some bad drugs at the
> > > > cafeteria)
> > >
> > > For the record I don't think it's a good idea for the BIOS to do
> > > this (and I'm not aware of any engineer who does),
> > > but I think Linux should do better than just disabling PMU use when
> > > this happens.
> > >
> > > However I suspect taking over SCI would cause endless problems
> > > and is very likely not a good idea.
> >
> > I tried many different changes in BIOS and all failed (the machine is
> > damn slow at boot, this takes age).
> >
> > I am stuck :(
>
> Could you please try the patch below?

This obviously works, but you probably need to make a full pass to make
sure we dont have a MSR failure -this should return false in this case.




2011-03-25 09:59:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Eric Dumazet <[email protected]> wrote:

> Le vendredi 25 mars 2011 ? 10:32 +0100, Ingo Molnar a ?crit :
> > * Eric Dumazet <[email protected]> wrote:
> >
> > > Le vendredi 25 mars 2011 ? 00:56 +0100, Andi Kleen a ?crit :
> > > > > never EVER seen any good explanation of why that particular sh*t
> > > > > argument would b true. It seems to be purely about politics, where
> > > > > some idiotic vendor (namely HP) has convinced Intel that they really
> > > > > need it. To the point where some engineers seem to have bought into
> > > > > the whole thing and actually believe that fairy tale ("firmware can do
> > > > > better" - hah! They must be feeding people some bad drugs at the
> > > > > cafeteria)
> > > >
> > > > For the record I don't think it's a good idea for the BIOS to do
> > > > this (and I'm not aware of any engineer who does),
> > > > but I think Linux should do better than just disabling PMU use when
> > > > this happens.
> > > >
> > > > However I suspect taking over SCI would cause endless problems
> > > > and is very likely not a good idea.
> > >
> > > I tried many different changes in BIOS and all failed (the machine is
> > > damn slow at boot, this takes age).
> > >
> > > I am stuck :(
> >
> > Could you please try the patch below?
>
> This obviously works, [...]

Congrats to your now much-improved perf experience! :-)

> [...] but you probably need to make a full pass to make sure we dont have a
> MSR failure -this should return false in this case.

Wanted to keep this patch simple - we are not really hitting MSR failure cases
in practice, and by getting the message the user is at least warned that
*something* is amiss.

Thanks,

Ingo

2011-03-25 10:05:21

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

>>> On 24.03.11 at 18:19, Ingo Molnar <[email protected]> wrote:
> * Jan Beulich <[email protected]> wrote:
>> Are you certain? Iirc the lock prefix implies minimally a read-for-
>> ownership (if CPUs are really smart enough to optimize away the
>> write - I wonder whether that would be correct at all when it
>> comes to locked operations), which means a cacheline can still be
>> bouncing heavily.
>
> Yeah. On what workload was this?
>
> Generally you use test_and_set_bit() if you expect it to be 'owned' by
> whoever calls it, and released by someone else.
>
> It would be really useful to run perf top on an affected box and see which
> kernel function causes this. It might be better to add a test_bit() to the
> affected codepath - instead of bloating all test_and_set_bit() users.

Indeed, I agree with you and Linus in this aspect.

> Note that the patch can also cause overhead: the test_bit() can miss the
> cache, it will bring in the cacheline shared, and the subsequent test_and_set()
> call will then dirty the cacheline - so the CPU might miss again and has to wait
> for other CPUs to first flush this cacheline.
>
> So we really need more details here.

The problem was observed with __lock_page() (in a variant not
upstream for reasons not known to me), and prefixing e.g.
trylock_page() with an extra PageLocked() check yielded the
below quoted improvements.

Jack - were there any similar measurements done on upstream
code?

Jan


**** Quoting Jack Steiner <[email protected]> ****

The following tests were run on UVSW :
768p Westmere
128 nodes


Boot times - greater than 2X reduction in boot time:
2286s PTF #8
1899s PTF #8
975s new algorithm
962s new algorithm

Boot messages referring to udev timeouts - eliminated:
(After the udevadm settle timeout, the events queue contains):

7174 PTF #8
9435 PTF #8
0 new algorithm
0 new algorithm

AIM7 results - no difference at low numbers of tasks. Improvements at high counts:
Jobs/Min at 2000 users
5100 PTF #8
17750 new algorithm

Wallclock seconds to run test at 2000 users
2250s PTF #8
650s new algorithm

CPU Seconds at 2000 users
1300000 PTF #8
14000 new algorithm


Test of large parallel app faulting for text.

Text resident in page cache (10000 pages):
REAL USER SYS
22.830s 23m5.567s 85m59.042s PTF #8 run1
26.267s 34m3.536s 104m20.035s PTF #8 run2
10.890s 19m27.305s 39m50.949s new algorithm run1
10.860s 20m42.698s 40m48.889s new algorithm run2

Text on Disk (1000 pages)
REAL USER SYS
31.658s 9m25.379s 71m11.967s PTF #8
24.348s 6m15.323s 45m27.578s new algorithm

_________________________________________________________________________________
The following tests were run on UV48:
4 racks
256 sockets
2452p westmere

Boot time:
4562 sec PTF#8
1965 sec new

MPI "helloworld" with 1024 ranks
35 sec PTF #8
22 sec new


Test of large parallel app faulting for text.
Text resident in page cache (10000 pages):
REAL USER SYS
46.394s 141m19s 366m53s PTF #8
38.986s 137m36 264m52s PTF #8
7.987s 34m50s 42m36s new algorithm
10.550s 43m31s 59m45s new algorithm


AIM7 Results (this is the original AIM7 - not the recent opensource version)
------------------------------
Jobs/Min
TASKS PTF #8 new
1 487.8 486.6
10 4405.8 4940.6
100 18570.5 18198.9
1000 17262.3 17167.1
2000 4879.3 18163.9
4000 ** 18846.2
------------------------------
Real Seconds
TASKS PTF #8 new
1 11.9 12.0
10 13.2 11.8
100 31.3 32.0
1000 337.2 339.0
2000 2385.6 640.8
4000 ** 1235.3
------------------------------
CPU Seconds
TASKS PTF #8 new
1 1.6 1.6
10 11.5 12.9
100 132.2 137.2
1000 4486.5 6586.3
2000 1758419.7 27845.7
4000 ** 65619.5

** Timed out

2011-03-25 10:19:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On Fri, 2011-03-25 at 10:22 +0100, Ingo Molnar wrote:
> * Andi Kleen <[email protected]> wrote:
>
> > > never EVER seen any good explanation of why that particular sh*t
> > > argument would b true. It seems to be purely about politics, where
> > > some idiotic vendor (namely HP) has convinced Intel that they really
> > > need it. To the point where some engineers seem to have bought into
> > > the whole thing and actually believe that fairy tale ("firmware can do
> > > better" - hah! They must be feeding people some bad drugs at the
> > > cafeteria)
> >
> > For the record I don't think it's a good idea for the BIOS to do
> > this (and I'm not aware of any engineer who does),
>
> There's really just two sane options:
>
> - complain about the BIOS corrupting CPU state and refusing to use the PMU
> - complain about the BIOS corrupting CPU state and using the PMU against the BIOS
>
> We went for the first one but i'll be more than glad to implement Linus's much
> more aggressive second option.
>
> Btw., for the record, the thing you have been advocating in the past was a
> third option: for the kernel to step aside quietly and to let the BIOS corrupt
> a counter or two. You even sent us some sort of BIOS specification about how to
> implement that. That's pretty much the worst solution imaginable.

Also seriously complicated by the kexec case where the previous kernel
didn't clean up PMU state. There is simply no sane way to detect if its
actually used and by whoem.

The whole PMU 'sharing' concept championed by Andi is utter crap.

As for simply using it despite the BIOS corrupting it, that might not
always work, the BIOS might simply over-write your state because it
one-sidedly declares to own the MSRs (observed behaviour).

Its all a big clusterfuck and really the best way (IMO) is what we have
now to put pressure on and force the BIOS vendors to play nice.

I assume both HP and DELL will be seriously unhappy with the kernel
spewing FIRMWARE BUG messages on boot on their boxen, the question is,
will they be unhappy enough to fix it..

Now Ingo's patch keeps the warning and lets you take the PMU back and
live with whatever consequences that brings (incorrect counts etc), that
might also work but puts less pressure on the vendors because things
appear to work.

2011-03-25 10:51:23

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On Fri, Mar 25, 2011 at 05:59:31AM -0400, Ingo Molnar wrote:
> > [...] but you probably need to make a full pass to make sure we dont have a
> > MSR failure -this should return false in this case.
>
> Wanted to keep this patch simple - we are not really hitting MSR
> failure cases in practice, and by getting the message the user is at
> least warned that *something* is amiss.

AFAIR, Robert was telling me couple of months ago about some braindead
BIOSes doing power management by using at least one perf counter. A
radical approach like that might interfere with that BIOS since it will
try to reprogram the counter continuously and f*ck up OS PMU use.

Anyway, adding Robert for clarification.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

2011-03-25 11:08:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On Fri, 2011-03-25 at 10:59 +0100, Ingo Molnar wrote:
>
> Wanted to keep this patch simple - we are not really hitting MSR failure cases
> in practice,

Everything qemu/kvm hits those.

2011-03-25 11:10:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Jan Beulich <[email protected]> wrote:

> >>> On 24.03.11 at 18:19, Ingo Molnar <[email protected]> wrote:
> > * Jan Beulich <[email protected]> wrote:
> >> Are you certain? Iirc the lock prefix implies minimally a read-for-
> >> ownership (if CPUs are really smart enough to optimize away the
> >> write - I wonder whether that would be correct at all when it
> >> comes to locked operations), which means a cacheline can still be
> >> bouncing heavily.
> >
> > Yeah. On what workload was this?
> >
> > Generally you use test_and_set_bit() if you expect it to be 'owned' by
> > whoever calls it, and released by someone else.
> >
> > It would be really useful to run perf top on an affected box and see which
> > kernel function causes this. It might be better to add a test_bit() to the
> > affected codepath - instead of bloating all test_and_set_bit() users.
>
> Indeed, I agree with you and Linus in this aspect.
>
> > Note that the patch can also cause overhead: the test_bit() can miss the
> > cache, it will bring in the cacheline shared, and the subsequent test_and_set()
> > call will then dirty the cacheline - so the CPU might miss again and has to wait
> > for other CPUs to first flush this cacheline.
> >
> > So we really need more details here.
>
> The problem was observed with __lock_page() (in a variant not
> upstream for reasons not known to me), and prefixing e.g.
> trylock_page() with an extra PageLocked() check yielded the
> below quoted improvements.

The page lock flag is indeed one of those (rather rare) exceptions to typical
object locking patterns. So in that particular case adding the PageLocked()
test to trylock_page() would be the right approach to improving performance.

In the common case this change actively hurts for various reasons:

- can turn a cache miss into two cache misses
- adds an often unnecessary branch instruction
- adds often unnecessary bloat
- leaks a barrier

Thanks,

Ingo

2011-03-25 11:12:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2011-03-25 at 10:59 +0100, Ingo Molnar wrote:
> >
> > Wanted to keep this patch simple - we are not really hitting MSR failure cases
> > in practice,
>
> Everything qemu/kvm hits those.

and other virtualizers as well. I meant on real hardware with a real BIOS.

Thanks,

Ingo

2011-03-25 12:08:07

by Nikanth Karthikesan

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On Friday, March 25, 2011 04:40:13 pm Ingo Molnar wrote:
> * Jan Beulich <[email protected]> wrote:
> > >>> On 24.03.11 at 18:19, Ingo Molnar <[email protected]> wrote:
> > > * Jan Beulich <[email protected]> wrote:
> > >> Are you certain? Iirc the lock prefix implies minimally a read-for-
> > >> ownership (if CPUs are really smart enough to optimize away the
> > >> write - I wonder whether that would be correct at all when it
> > >> comes to locked operations), which means a cacheline can still be
> > >> bouncing heavily.
> > >
> > > Yeah. On what workload was this?
> > >
> > > Generally you use test_and_set_bit() if you expect it to be 'owned' by
> > > whoever calls it, and released by someone else.
> > >
> > > It would be really useful to run perf top on an affected box and see
> > > which kernel function causes this. It might be better to add a
> > > test_bit() to the affected codepath - instead of bloating all
> > > test_and_set_bit() users.
> >
> > Indeed, I agree with you and Linus in this aspect.
> >
> > > Note that the patch can also cause overhead: the test_bit() can miss
> > > the cache, it will bring in the cacheline shared, and the subsequent
> > > test_and_set() call will then dirty the cacheline - so the CPU might
> > > miss again and has to wait for other CPUs to first flush this
> > > cacheline.
> > >
> > > So we really need more details here.
> >
> > The problem was observed with __lock_page() (in a variant not
> > upstream for reasons not known to me), and prefixing e.g.
> > trylock_page() with an extra PageLocked() check yielded the
> > below quoted improvements.
>
> The page lock flag is indeed one of those (rather rare) exceptions to
> typical object locking patterns. So in that particular case adding the
> PageLocked() test to trylock_page() would be the right approach to
> improving performance.
>
> In the common case this change actively hurts for various reasons:
>
> - can turn a cache miss into two cache misses
> - adds an often unnecessary branch instruction
> - adds often unnecessary bloat
> - leaks a barrier
>

Yes, I think I am observing these ill-effects when testing the code copied to
user-space.

Thanks
Nikanth

2011-03-25 13:13:17

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On Fri, Mar 25, 2011 at 10:06:10AM +0000, Jan Beulich wrote:
> >>> On 24.03.11 at 18:19, Ingo Molnar <[email protected]> wrote:
> > * Jan Beulich <[email protected]> wrote:
> >> Are you certain? Iirc the lock prefix implies minimally a read-for-
> >> ownership (if CPUs are really smart enough to optimize away the
> >> write - I wonder whether that would be correct at all when it
> >> comes to locked operations), which means a cacheline can still be
> >> bouncing heavily.
> >
> > Yeah. On what workload was this?
> >
> > Generally you use test_and_set_bit() if you expect it to be 'owned' by
> > whoever calls it, and released by someone else.
> >
> > It would be really useful to run perf top on an affected box and see which
> > kernel function causes this. It might be better to add a test_bit() to the
> > affected codepath - instead of bloating all test_and_set_bit() users.
>
> Indeed, I agree with you and Linus in this aspect.
>
> > Note that the patch can also cause overhead: the test_bit() can miss the
> > cache, it will bring in the cacheline shared, and the subsequent test_and_set()
> > call will then dirty the cacheline - so the CPU might miss again and has to wait
> > for other CPUs to first flush this cacheline.
> >
> > So we really need more details here.
>
> The problem was observed with __lock_page() (in a variant not
> upstream for reasons not known to me), and prefixing e.g.
> trylock_page() with an extra PageLocked() check yielded the
> below quoted improvements.
>
> Jack - were there any similar measurements done on upstream
> code?

Not yet but it is high on my list to test. I suspect a similar problem exists.
I'll post the results as soon as I have them.

>
> Jan
>
>
> **** Quoting Jack Steiner <[email protected]> ****
>
> The following tests were run on UVSW :
> 768p Westmere
> 128 nodes
>
>
> Boot times - greater than 2X reduction in boot time:
> 2286s PTF #8
> 1899s PTF #8
> 975s new algorithm
> 962s new algorithm
>
> Boot messages referring to udev timeouts - eliminated:
> (After the udevadm settle timeout, the events queue contains):
>
> 7174 PTF #8
> 9435 PTF #8
> 0 new algorithm
> 0 new algorithm
>
> AIM7 results - no difference at low numbers of tasks. Improvements at high counts:
> Jobs/Min at 2000 users
> 5100 PTF #8
> 17750 new algorithm
>
> Wallclock seconds to run test at 2000 users
> 2250s PTF #8
> 650s new algorithm
>
> CPU Seconds at 2000 users
> 1300000 PTF #8
> 14000 new algorithm
>
>
> Test of large parallel app faulting for text.
>
> Text resident in page cache (10000 pages):
> REAL USER SYS
> 22.830s 23m5.567s 85m59.042s PTF #8 run1
> 26.267s 34m3.536s 104m20.035s PTF #8 run2
> 10.890s 19m27.305s 39m50.949s new algorithm run1
> 10.860s 20m42.698s 40m48.889s new algorithm run2
>
> Text on Disk (1000 pages)
> REAL USER SYS
> 31.658s 9m25.379s 71m11.967s PTF #8
> 24.348s 6m15.323s 45m27.578s new algorithm
>
> _________________________________________________________________________________
> The following tests were run on UV48:
> 4 racks
> 256 sockets
> 2452p westmere
>
> Boot time:
> 4562 sec PTF#8
> 1965 sec new
>
> MPI "helloworld" with 1024 ranks
> 35 sec PTF #8
> 22 sec new
>
>
> Test of large parallel app faulting for text.
> Text resident in page cache (10000 pages):
> REAL USER SYS
> 46.394s 141m19s 366m53s PTF #8
> 38.986s 137m36 264m52s PTF #8
> 7.987s 34m50s 42m36s new algorithm
> 10.550s 43m31s 59m45s new algorithm
>
>
> AIM7 Results (this is the original AIM7 - not the recent opensource version)
> ------------------------------
> Jobs/Min
> TASKS PTF #8 new
> 1 487.8 486.6
> 10 4405.8 4940.6
> 100 18570.5 18198.9
> 1000 17262.3 17167.1
> 2000 4879.3 18163.9
> 4000 ** 18846.2
> ------------------------------
> Real Seconds
> TASKS PTF #8 new
> 1 11.9 12.0
> 10 13.2 11.8
> 100 31.3 32.0
> 1000 337.2 339.0
> 2000 2385.6 640.8
> 4000 ** 1235.3
> ------------------------------
> CPU Seconds
> TASKS PTF #8 new
> 1 1.6 1.6
> 10 11.5 12.9
> 100 132.2 137.2
> 1000 4486.5 6586.3
> 2000 1758419.7 27845.7
> 4000 ** 65619.5
>
> ** Timed out
>

Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On 25.03.11 06:21:16, Peter Zijlstra wrote:
> On Fri, 2011-03-25 at 10:22 +0100, Ingo Molnar wrote:
> > * Andi Kleen <[email protected]> wrote:

> > > For the record I don't think it's a good idea for the BIOS to do
> > > this (and I'm not aware of any engineer who does),
> >
> > There's really just two sane options:
> >
> > - complain about the BIOS corrupting CPU state and refusing to use the PMU
> > - complain about the BIOS corrupting CPU state and using the PMU against the BIOS
> >
> > We went for the first one but i'll be more than glad to implement Linus's much
> > more aggressive second option.
> >
> > Btw., for the record, the thing you have been advocating in the past was a
> > third option: for the kernel to step aside quietly and to let the BIOS corrupt
> > a counter or two. You even sent us some sort of BIOS specification about how to
> > implement that. That's pretty much the worst solution imaginable.

Option 2 wont work, I have seen BIOSes that block access to the
counter registers, then there is no way for the OS to take over
control.

So, if you want to use perf anyway on such systems, you will have to
implement option 3 to mark the counter as "reserved" ...


> Also seriously complicated by the kexec case where the previous kernel
> didn't clean up PMU state. There is simply no sane way to detect if its
> actually used and by whoem.
>
> The whole PMU 'sharing' concept championed by Andi is utter crap.

... but this seems not to be an option.

>
> As for simply using it despite the BIOS corrupting it, that might not
> always work, the BIOS might simply over-write your state because it
> one-sidedly declares to own the MSRs (observed behaviour).
>
> Its all a big clusterfuck and really the best way (IMO) is what we have
> now to put pressure on and force the BIOS vendors to play nice.
>
> I assume both HP and DELL will be seriously unhappy with the kernel
> spewing FIRMWARE BUG messages on boot on their boxen, the question is,
> will they be unhappy enough to fix it..

So, we better stick then with option 1. My experience is that new
system's bioses try not to claim perfctrs (affected systems I have
seen are about 2-3 years old), but I am not really sure here.

> Now Ingo's patch keeps the warning and lets you take the PMU back and
> live with whatever consequences that brings (incorrect counts etc), that
> might also work but puts less pressure on the vendors because things
> appear to work.

And yes, using the counter anyway may corrupt counter values.

-Robert


--
Advanced Micro Devices, Inc.
Operating System Research Center

Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On 25.03.11 05:32:28, Ingo Molnar wrote:
>
> * Eric Dumazet <[email protected]> wrote:
>
> > Le vendredi 25 mars 2011 ? 00:56 +0100, Andi Kleen a ?crit :
> > > > never EVER seen any good explanation of why that particular sh*t
> > > > argument would b true. It seems to be purely about politics, where
> > > > some idiotic vendor (namely HP) has convinced Intel that they really
> > > > need it. To the point where some engineers seem to have bought into
> > > > the whole thing and actually believe that fairy tale ("firmware can do
> > > > better" - hah! They must be feeding people some bad drugs at the
> > > > cafeteria)
> > >
> > > For the record I don't think it's a good idea for the BIOS to do
> > > this (and I'm not aware of any engineer who does),
> > > but I think Linux should do better than just disabling PMU use when
> > > this happens.
> > >
> > > However I suspect taking over SCI would cause endless problems
> > > and is very likely not a good idea.
> >
> > I tried many different changes in BIOS and all failed (the machine is
> > damn slow at boot, this takes age).
> >
> > I am stuck :(
>
> Could you please try the patch below?
>
> Thanks,
>
> Ingo
>
> ------------------->
> From 14df27334ac47a5cec67fb2238d14499346acc38 Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <[email protected]>
> Date: Fri, 25 Mar 2011 10:24:23 +0100
> Subject: [PATCH] perf, x86: Complain louder about BIOSen corrupting CPU/PMU state and continue
>
> Eric Dumazet reported that hardware PMU events do not work on his
> system, due to the BIOS corrupting PMU state:
>
> Performance Events: PEBS fmt0+, Core2 events, Broken BIOS detected, using software events only.
> [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 186 is 43003c)
>
> Linus suggested that we continue in the face of such BIOS-induced CPU
> state corruption:
>
> http://lkml.org/lkml/2011/3/24/608
>
> Such BIOSes will have to be fixed - developers rely on a working and fully
> capable PMU and BIOS interfering with CPU state is simply not acceptable.
>
> So this patch changes perf to continue when it detects such BIOS
> interaction, some hardware events may be unreliable due to the BIOS writing
> and re-writing them - there's not much the kernel can do about that.
>
> Reported-by: Eric Dumazet <[email protected]>
> Suggested-by: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> LKML-Reference: <new-submission>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/x86/kernel/cpu/perf_event.c | 9 +++++++--
> 1 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index ec46eea..eb00677 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -500,12 +500,17 @@ static bool check_hw_exists(void)
> return true;
>
> bios_fail:
> - printk(KERN_CONT "Broken BIOS detected, using software events only.\n");
> + /*
> + * We still allow the PMU driver to operate:
> + */
> + printk(KERN_CONT "Broken BIOS detected, complain to your hardware vendor.\n");
> printk(KERN_ERR FW_BUG "the BIOS has corrupted hw-PMU resources (MSR %x is %Lx)\n", reg, val);
> - return false;
> +
> + return true;

Ingo, you jump out the loop here. This skips checks on other registers
and msr access. And if we want to continue anyway, checking msr access
becomes more important as the bios may block it.

-Robert

>
> msr_fail:
> printk(KERN_CONT "Broken PMU hardware detected, using software events only.\n");
> +
> return false;
> }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Advanced Micro Devices, Inc.
Operating System Research Center

2011-03-25 16:29:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On Fri, Mar 25, 2011 at 3:06 AM, Jan Beulich <[email protected]> wrote:
>
> The problem was observed with __lock_page() (in a variant not
> upstream for reasons not known to me), and prefixing e.g.
> trylock_page() with an extra PageLocked() check yielded the
> below quoted improvements.

Ok. __lock_page() _definitely_ should do the test_bit() thing first,
because it's normally called from lock_page() that has already tested
the bit.

But it already seems to do that, so I'm wondering what your variant is.

I'm also a bit surprised that lock_page() is that hot (unless your
_lock_page() variant is simply too broken and ends up spinning?).
Maybe we have some path that takes the page lock unnecessarily? What's
the load?

Linus

2011-03-25 16:46:28

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

>>> On 25.03.11 at 17:29, Linus Torvalds <[email protected]> wrote:
> On Fri, Mar 25, 2011 at 3:06 AM, Jan Beulich <[email protected]> wrote:
>>
>> The problem was observed with __lock_page() (in a variant not
>> upstream for reasons not known to me), and prefixing e.g.
>> trylock_page() with an extra PageLocked() check yielded the
>> below quoted improvements.
>
> Ok. __lock_page() _definitely_ should do the test_bit() thing first,
> because it's normally called from lock_page() that has already tested
> the bit.
>
> But it already seems to do that, so I'm wondering what your variant is.

lock_page() does, but here it's really about __lock_page().

We've got a patch from Nick Piggin in the tree that aims at
speeding up unlock_page, which turns __lock_page() (on
2.6.32) into

void __lock_page(struct page *page)
{
wait_queue_head_t *wq = page_waitqueue(page);
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

do {
prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
SetPageWaiters(page);
if (likely(PageLocked(page)))
sync_page(page);
} while (!trylock_page(page));
finish_wait(wq, &wait.wait);
}

(No, I do not know why the patch isn't upstream, but Nick is on Cc,
so maybe he can tell.)

> I'm also a bit surprised that lock_page() is that hot (unless your
> _lock_page() variant is simply too broken and ends up spinning?).
> Maybe we have some path that takes the page lock unnecessarily? What's
> the load?

So yes, there is spinning involved. A first improvement from Jack
was to make the SetPageWaiters() (introduced by that patch)
conditional upon the bit not already being set.

But that wasn't yielding the desired improvement, hence they
added a PageLocked() || ... in front of the trylock_page(), which
in turn raised the question why trylock_page() wouldn't do this
generally.

Jan

2011-03-25 16:50:48

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On Fri, Mar 25, 2011 at 09:29:34AM -0700, Linus Torvalds wrote:
> On Fri, Mar 25, 2011 at 3:06 AM, Jan Beulich <[email protected]> wrote:
> >
> > The problem was observed with __lock_page() (in a variant not
> > upstream for reasons not known to me), and prefixing e.g.
> > trylock_page() with an extra PageLocked() check yielded the
> > below quoted improvements.
>
> Ok. __lock_page() _definitely_ should do the test_bit() thing first,
> because it's normally called from lock_page() that has already tested
> the bit.
>
> But it already seems to do that, so I'm wondering what your variant is.
>
> I'm also a bit surprised that lock_page() is that hot (unless your
> _lock_page() variant is simply too broken and ends up spinning?).
> Maybe we have some path that takes the page lock unnecessarily? What's
> the load?

We see the problem primarily on launching very large MPI applications.
The master process rapidly forks a large number (1 per cpu) of processes.
Each faults in a large number of text pages.

The text pages are resident in the page cache. No IO is involved but
the page lock quickly becomes a very hot contended cacheline.

Note also that this is observed in a 2.6.32 distro kernel that has a
different implementation of __lock_page. I think a similar problem
exists in the upstream kernel but have not had a chance to investigate.

We also see a similar problem during boot when a large number of udevd
processes are created.


--- jack

2011-03-25 17:15:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

> Also seriously complicated by the kexec case where the previous kernel
> didn't clean up PMU state. There is simply no sane way to detect if its

That's a good point, but we can easily stop the PMU before kexec.

> actually used and by whoem.

You check if the counter is enabled. If it's already enabled it's
used by someone else.

> The whole PMU 'sharing' concept championed by Andi is utter crap.

Why? It's the same thing as having some less counters. You need
to already support that for architectural perfmon with variable
counters anyways or for sharing with oprofile.

> As for simply using it despite the BIOS corrupting it, that might not
> always work, the BIOS might simply over-write your state because it
> one-sidedly declares to own the MSRs (observed behaviour).

Yes, that doesn't work. If someone else is active you have to step back.

> Its all a big clusterfuck and really the best way (IMO) is what we have
> now to put pressure on and force the BIOS vendors to play nice.

It just results in users like Eric being screwed. I doubt that any
BIOS writer ever heard about it. Congratulations for a great plan.

-Andi

2011-03-25 17:22:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

> Could you please try the patch below?

This is very likely not a good idea. I heard some systems can hang
if you overwrite their PMUs this way and you're unlucky enough :-(

-Andi

2011-03-25 19:22:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Andi Kleen <[email protected]> wrote:

> > Also seriously complicated by the kexec case where the previous kernel
> > didn't clean up PMU state. There is simply no sane way to detect if its
>
> That's a good point, but we can easily stop the PMU before kexec.

Wrong - there's lots of existing Linux versions out there that will kexec with
PMU state active. We cannot change them retroactively.

> > actually used and by whoem.
>
> You check if the counter is enabled. If it's already enabled it's used by
> someone else.

Wrong - or it can be enabled if it was left enabled accidentally. We treat PMU
state like we treat other CPU state.

> > The whole PMU 'sharing' concept championed by Andi is utter crap.
>
> Why? It's the same thing as having some less counters.

Wrong again - 25% or 50% of the events stolen with no user choice is a pretty
big deal.

Consider the example in this thread: cachemiss profiling done via perf, which
needs two events. If one of the generic counters is 'stolen' by a BIOS and
Linux accepts this silently then it means the loss of real functionality.

In other words, '25% of a specific hardware functionality stolen by the BIOS'
(or more) is absolutely not acceptable.

> [...] You need to already support that for architectural perfmon with
> variable counters anyways or for sharing with oprofile.

Wrong, that's different - multiplexing the PMU is obviously done within the OS.
It's evidently user configurable - users can use oprofile or perf - and the
user can still fully utilise the PMU to the extent he wishes to - it's his
hardware.

It is not possible for the kernel to stop the BIOS from using the PMU though,
so it's not 'sharing' no matter what 'protocol' is used, it's 'stealing'.

> > As for simply using it despite the BIOS corrupting it, that might not
> > always work, the BIOS might simply over-write your state because it
> > one-sidedly declares to own the MSRs (observed behaviour).
>
> Yes, that doesn't work. If someone else is active you have to step back.
>
> > Its all a big clusterfuck and really the best way (IMO) is what we have
> > now to put pressure on and force the BIOS vendors to play nice.
>
> It just results in users like Eric being screwed. I doubt that any
> BIOS writer ever heard about it. Congratulations for a great plan.

I'd encourage you to think through this topic instead of making derisive
comments about others ...

Thanks,

Ingo

2011-03-25 19:26:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Andi Kleen <[email protected]> wrote:

> > Could you please try the patch below?
>
> This is very likely not a good idea. I heard some systems can hang
> if you overwrite their PMUs this way and you're unlucky enough :-(

Wrong - in that case all kernels from 2.6.32 to 2.6.38 would already be locking
up. This patch simply restores behavior to essentially what it was before
v2.6.38 - but also adds the warning.

Thanks,

Ingo

2011-03-25 19:32:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Robert Richter <[email protected]> wrote:

> > Its all a big clusterfuck and really the best way (IMO) is what we have
> > now to put pressure on and force the BIOS vendors to play nice.
> >
> > I assume both HP and DELL will be seriously unhappy with the kernel
> > spewing FIRMWARE BUG messages on boot on their boxen, the question is,
> > will they be unhappy enough to fix it..
>
> So, we better stick then with option 1. My experience is that new
> system's bioses try not to claim perfctrs (affected systems I have
> seen are about 2-3 years old), but I am not really sure here.

That's good news - BIOSen unilaterally stealing PMU real estate is a really
utterly crazy concept.

For a limited physical resource like the PMU the correct approach to add
PMU-using features is to add an OS driver that implements the feature via the
regular PMU access functions. We already have such features so it's very much
possible. That way it all becomes controllable and configurable to the user.

Thanks,

Ingo

2011-03-25 20:27:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On Fri, 2011-03-25 at 10:38 +0100, Eric Dumazet wrote:
>
> BIOS : Version: I27 Release Date: 09/30/2010
>
> [ 0.021849] CPU0: Intel(R) Xeon(R) CPU E5640 @ 2.67GHz stepping 02
> [ 0.123904] Performance Events: PEBS fmt1+, Westmere events, Broken BIOS detected, using software events only.
> [ 0.124125] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)

Do the below instructions yield a working system?:

> > On the BL460c G6/G7, one could perform the following steps:
> >
> > To configure BIOS low-latency options using RBSU
> > 1. Press F9 during POST to enter RBSU.
> > 2. Press CTRL-A to open the menu.
> > 3. Select Service Options.
> > 4. Disable either or both of the following options:
> > o Processor Power and Utilization Monitoring
> > o Memory Pre-Failure Notification
> >
> > By doing the above the BIOS should release the performance counter, and the
> > kernel's "perf" should not encounter a conflict.
> >
> > The document that you are looking for is here: (please see page 8 and later):
> > http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c01804533.pdf


2011-03-26 08:15:40

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

Le vendredi 25 mars 2011 à 21:29 +0100, Peter Zijlstra a écrit :
> On Fri, 2011-03-25 at 10:38 +0100, Eric Dumazet wrote:
> >
> > BIOS : Version: I27 Release Date: 09/30/2010
> >
> > [ 0.021849] CPU0: Intel(R) Xeon(R) CPU E5640 @ 2.67GHz stepping 02
> > [ 0.123904] Performance Events: PEBS fmt1+, Westmere events, Broken BIOS detected, using software events only.
> > [ 0.124125] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
>
> Do the below instructions yield a working system?:
>
> > > On the BL460c G6/G7, one could perform the following steps:
> > >
> > > To configure BIOS low-latency options using RBSU
> > > 1. Press F9 during POST to enter RBSU.
> > > 2. Press CTRL-A to open the menu.
> > > 3. Select Service Options.
> > > 4. Disable either or both of the following options:
> > > o Processor Power and Utilization Monitoring
> > > o Memory Pre-Failure Notification
> > >
> > > By doing the above the BIOS should release the performance counter, and the
> > > kernel's "perf" should not encounter a conflict.
> > >
> > > The document that you are looking for is here: (please see page 8 and later):
> > > http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c01804533.pdf
>
>
>

Thanks Peter !

I was not aware of the hidden "CTRL-A" feature on RBSU, so I never met
these configuration settings ("Service Options" added in main menu)

I already was in "Power regulator mode : Static High mode"
QPI Link power management : Disabled

I have latest released I24 BIOS
DMI: HP ProLiant BL460c G6, BIOS I24 01/29/2011

1) I disabled "Power and Utilization Monitoring" _and_ "Memory
Pre-Failure Notification"

Still I got the 'MSR 38d is 330' error

2) I then disabled Intel Turbo boost

Same error

3) Minimum Processor idle Power state
From C1E-State --> No C-states

Same error

Tricky isnt it ?


2011-03-26 09:42:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

On Sat, 2011-03-26 at 09:15 +0100, Eric Dumazet wrote:
> Le vendredi 25 mars 2011 à 21:29 +0100, Peter Zijlstra a écrit :
> > On Fri, 2011-03-25 at 10:38 +0100, Eric Dumazet wrote:
> > >
> > > BIOS : Version: I27 Release Date: 09/30/2010
> > >
> > > [ 0.021849] CPU0: Intel(R) Xeon(R) CPU E5640 @ 2.67GHz stepping 02
> > > [ 0.123904] Performance Events: PEBS fmt1+, Westmere events, Broken BIOS detected, using software events only.
> > > [ 0.124125] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
> >
> > Do the below instructions yield a working system?:
> >
> > > > On the BL460c G6/G7, one could perform the following steps:
> > > >
> > > > To configure BIOS low-latency options using RBSU
> > > > 1. Press F9 during POST to enter RBSU.
> > > > 2. Press CTRL-A to open the menu.
> > > > 3. Select Service Options.
> > > > 4. Disable either or both of the following options:
> > > > o Processor Power and Utilization Monitoring
> > > > o Memory Pre-Failure Notification
> > > >
> > > > By doing the above the BIOS should release the performance counter, and the
> > > > kernel's "perf" should not encounter a conflict.
> > > >
> > > > The document that you are looking for is here: (please see page 8 and later):
> > > > http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c01804533.pdf
> >
> >
> >
>
> Thanks Peter !
>
> I was not aware of the hidden "CTRL-A" feature on RBSU, so I never met
> these configuration settings ("Service Options" added in main menu)
>
> I already was in "Power regulator mode : Static High mode"
> QPI Link power management : Disabled
>
> I have latest released I24 BIOS
> DMI: HP ProLiant BL460c G6, BIOS I24 01/29/2011
>
> 1) I disabled "Power and Utilization Monitoring" _and_ "Memory
> Pre-Failure Notification"
>
> Still I got the 'MSR 38d is 330' error
>
> 2) I then disabled Intel Turbo boost
>
> Same error
>
> 3) Minimum Processor idle Power state
> From C1E-State --> No C-states
>
> Same error
>
> Tricky isnt it ?

Definitely, OK thanks for testing, I'll contact HP again.

2011-03-26 09:57:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible


* Eric Dumazet <[email protected]> wrote:

> > > > By doing the above the BIOS should release the performance counter, and
> > > > the kernel's "perf" should not encounter a conflict.
> > > >
> > > > The document that you are looking for is here: (please see page 8 and
> > > > later):
> > > > http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c01804533.pdf
>
> Thanks Peter !
>
> I was not aware of the hidden "CTRL-A" feature on RBSU, so I never met these
> configuration settings ("Service Options" added in main menu)
>
> I already was in "Power regulator mode : Static High mode" QPI Link power
> management : Disabled
>
> I have latest released I24 BIOS
> DMI: HP ProLiant BL460c G6, BIOS I24 01/29/2011
>
> 1) I disabled "Power and Utilization Monitoring" _and_ "Memory
> Pre-Failure Notification"
>
> Still I got the 'MSR 38d is 330' error
>
> 2) I then disabled Intel Turbo boost
>
> Same error
>
> 3) Minimum Processor idle Power state
> From C1E-State --> No C-states
>
> Same error
>
> Tricky isnt it ?

This kind of user-configurability reminds me of Mr. Prosser from the local
council who wants to knock down Arthur Dent’s house with a bulldozer to make
way for a new bypass:

"But the plans were on display ..."

"On display? I eventually had to go down to the cellar to find them."

"That's the display department."

"With a flashlight."

"Ah, well the lights had probably gone."

"So had the stairs."

"But look, you found the notice didn't you?"

"Yes," said Arthur, "yes I did. It was on display in the bottom of a locked
filing cabinet stuck in a disused lavatory with a sign on the door saying
'Beware of the Leopard'."

Ingo