2014-10-03 19:44:36

by Steinar H. Gunderson

[permalink] [raw]
Subject: Slowdown due to threads bouncing between HT cores

Hi,

I did a chess benchmark of my new machine (2x E5-2650v3, so 20x2.3GHz
Haswell-EP), and it performed a bit worse than comparable Windows setups.
It looks like the scheduler somehow doesn't perform as well with
hyperthreading; HT is on in the BIOS, but I'm only using 20 threads
(chess scales sublinearly, so using all 40 usually isn't a good idea),
so really, the threads should just get one core each and that's it.
It looks like they are bouncing between cores, reducing overall performance
by ~20% for some reason. (The machine is otherwise generally idle.)

First some details to reproduce more easily. Kernel version is 3.16.3, 64-bit
x86, Debian stable (so gcc 4.7.2). The benchmark binary is a chess engine
knows as Stockfish; this is the compile I used (because that's what everyone
else is benchmarking with):

http://abrok.eu/stockfish/builds/dbd6156fceaf9bec8e9ff14f99c325c36b284079/linux64modernsse/stockfish_13111907_x64_modern_sse42

Stockfish is GPL, so the source is readily available if you should need it.

The benchmark is run with by just running the binary, then giving it these
commands one by one:

uci
setoption name Threads value 20
setoption name Hash value 1024
position fen rnbq1rk1/pppnbppp/4p3/3pP1B1/3P3P/2N5/PPP2PP1/R2QKBNR w KQ – 0 7
go wtime 7200000 winc 30000 btime 7200000 binc 30000

After ~3 minutes, it will output “bestmove d1g4 ponder f8e8”. A few lines
above that, you'll see a line with something similar to “nps 13266463”.
That's nodes per second, and you want it to be higher.

So, benchmark:

- Default: 13266 kN/sec
- Change from ondemand to performance on all cores: 14600 kN/sec
- taskset -c 0-19 (locking affinity to only one set of hyperthreads):
17512 kN/sec

There is some local variation, but it's typically within a few percent.
Does anyone know what's going on? I have CONFIG_SCHED_SMT=y and
CONFIG_SCHED_MC=y.

/* Steinar */
--
Homepage: http://www.sesse.net/


2014-10-03 21:14:23

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: Slowdown due to threads bouncing between HT cores

On Fri, Oct 03, 2014 at 11:11:52PM +0200, Marc Burkhardt wrote:
> As I understand your mail, you problem is quite similar, isn't it?

I guess it depends on how often your process migrates. If it happens, like,
every second, it's not a big problem (and probably is expected).
If it happens all the time, it might be; it depends a bit on a number of
factors.

/* Steinar */
--
Homepage: http://www.sesse.net/

2014-10-03 21:17:14

by Marc Burkhardt

[permalink] [raw]
Subject: Re: Slowdown due to threads bouncing between HT cores

* Steinar H. Gunderson <[email protected]> [2014-10-03 21:44:29 +0200]:

Hi Steinar,

I had a question a while ago: https://lkml.org/lkml/2014/5/25/57

I compiled mpv which is quite small to compile but has a pretty predictable time
spent in the linking phase of the build.

I, however, did not get an answer to this quastion as of today.

As I understand your mail, you problem is quite similar, isn't it?

Thanks,
Marc

> Hi,
>
> I did a chess benchmark of my new machine (2x E5-2650v3, so 20x2.3GHz
> Haswell-EP), and it performed a bit worse than comparable Windows setups.
> It looks like the scheduler somehow doesn't perform as well with
> hyperthreading; HT is on in the BIOS, but I'm only using 20 threads
> (chess scales sublinearly, so using all 40 usually isn't a good idea),
> so really, the threads should just get one core each and that's it.
> It looks like they are bouncing between cores, reducing overall performance
> by ~20% for some reason. (The machine is otherwise generally idle.)
>
> First some details to reproduce more easily. Kernel version is 3.16.3, 64-bit
> x86, Debian stable (so gcc 4.7.2). The benchmark binary is a chess engine
> knows as Stockfish; this is the compile I used (because that's what everyone
> else is benchmarking with):
>
> http://abrok.eu/stockfish/builds/dbd6156fceaf9bec8e9ff14f99c325c36b284079/linux64modernsse/stockfish_13111907_x64_modern_sse42
>
> Stockfish is GPL, so the source is readily available if you should need it.
>
> The benchmark is run with by just running the binary, then giving it these
> commands one by one:
>
> uci
> setoption name Threads value 20
> setoption name Hash value 1024
> position fen rnbq1rk1/pppnbppp/4p3/3pP1B1/3P3P/2N5/PPP2PP1/R2QKBNR w KQ – 0 7
> go wtime 7200000 winc 30000 btime 7200000 binc 30000
>
> After ~3 minutes, it will output “bestmove d1g4 ponder f8e8”. A few lines
> above that, you'll see a line with something similar to “nps 13266463”.
> That's nodes per second, and you want it to be higher.
>
> So, benchmark:
>
> - Default: 13266 kN/sec
> - Change from ondemand to performance on all cores: 14600 kN/sec
> - taskset -c 0-19 (locking affinity to only one set of hyperthreads):
> 17512 kN/sec
>
> There is some local variation, but it's typically within a few percent.
> Does anyone know what's going on? I have CONFIG_SCHED_SMT=y and
> CONFIG_SCHED_MC=y.
>
> /* Steinar */
> --
> Homepage: http://www.sesse.net/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Marc Burkhardt

2014-10-04 09:22:13

by Marc Burkhardt

[permalink] [raw]
Subject: Re: Slowdown due to threads bouncing between HT cores

* Steinar H. Gunderson <[email protected]> [2014-10-03 23:14:17 +0200]:

> On Fri, Oct 03, 2014 at 11:11:52PM +0200, Marc Burkhardt wrote:
> > As I understand your mail, you problem is quite similar, isn't it?
>
> I guess it depends on how often your process migrates. If it happens, like,
> every second, it's not a big problem (and probably is expected).
> If it happens all the time, it might be; it depends a bit on a number of
> factors.

Hi Steinar,

you say the behavior is probably expected, but I still don't get the point why
this could be normal or even wanted. There's a cost for migrating the process to
another CPU. In case my machine has 4 cores, 3 of them 'idle' and that 1 running
process that keeps being migrated ... where's the benefit of migrating it?

Do you have any source/docs where I could read about this?

Thanks,
Marc

>
> /* Steinar */
> --
> Homepage: http://www.sesse.net/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Marc Burkhardt

2014-10-04 13:41:17

by Andi Kleen

[permalink] [raw]
Subject: Re: Slowdown due to threads bouncing between HT cores

"Steinar H. Gunderson" <[email protected]> writes:
>
> So, benchmark:
>
> - Default: 13266 kN/sec
> - Change from ondemand to performance on all cores: 14600 kN/sec
> - taskset -c 0-19 (locking affinity to only one set of hyperthreads):
> 17512 kN/sec

iirc the main reasons for unbound threads migrating away are:

- something else gets scheduled on these logical CPUs, so
the scheduler tries to balance to run queue lengths

You could check that with perf timechart or perf sched record/map
or kernelshark.

- there is some IO or communication which causes wakeup affinity.

You could try disabling WAKEUP_PREEMPTION or NEXT_BUDDY in
/sys/kernel/debug/sched_features

-Andi
--
[email protected] -- Speaking for myself only

2014-10-04 14:12:51

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: Slowdown due to threads bouncing between HT cores

On Sat, Oct 04, 2014 at 06:41:15AM -0700, Andi Kleen wrote:
> - something else gets scheduled on these logical CPUs, so
> the scheduler tries to balance to run queue lengths
>
> You could check that with perf timechart or perf sched record/map
> or kernelshark.

I've never read any of these maps before, but perf sched map really doesn't
indicate to me that there's a lot of other stuff going on. It seems to mainly
show a lot of Stockfish processes bouncing around seemingly randomly with not
much understanding of hyperthread pairs. Of course, there's the odd other
job, including ksoftirq or an RCU process.

I can send you a copy of the map if you want to, but it is of course rather
large.

> - there is some IO or communication which causes wakeup affinity.

There's a fair amount of communication between the threads; I don't know the
architecture very deeply (multithreading in chess is rather nontrivial),
but as far as I know, the worker threads access shared data through shm,
sometimes using pthread mutexes to lock some of it.

This also means, by the way, that occasionally they will sleep. They're not
by default going to hog the CPU 100% of the time, more like 90%.

> You could try disabling WAKEUP_PREEMPTION or NEXT_BUDDY in
> /sys/kernel/debug/sched_features

NO_NEXT_BUDDY was already set. (Changing it to NEXT_BUDDY didn't seem to help
anything.) I tried setting NO_WAKEUP_PREEMPTION, and it didn't make a
difference that I could see; they still bounce around a lot.

/* Steinar */
--
Homepage: http://www.sesse.net/

2014-10-04 14:50:12

by Chuck Ebbert

[permalink] [raw]
Subject: Re: Slowdown due to threads bouncing between HT cores

On Fri, 3 Oct 2014 21:44:29 +0200
"Steinar H. Gunderson" <[email protected]> wrote:

> Hi,
>
> I did a chess benchmark of my new machine (2x E5-2650v3, so 20x2.3GHz
> Haswell-EP), and it performed a bit worse than comparable Windows setups.
> It looks like the scheduler somehow doesn't perform as well with
> hyperthreading; HT is on in the BIOS, but I'm only using 20 threads
> (chess scales sublinearly, so using all 40 usually isn't a good idea),
> so really, the threads should just get one core each and that's it.
> It looks like they are bouncing between cores, reducing overall performance
> by ~20% for some reason. (The machine is otherwise generally idle.)
>

Try playing with /proc/sys/kernel/sched_migration_cost_ns. This sets
the number of nanoseconds the kernel will wait before considering
moving a thread to another CPU. I have mine set to 50000000.

2014-10-05 11:19:21

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: Slowdown due to threads bouncing between HT cores

On Sat, Oct 04, 2014 at 09:50:04AM -0500, Chuck Ebbert wrote:
> Try playing with /proc/sys/kernel/sched_migration_cost_ns. This sets
> the number of nanoseconds the kernel will wait before considering
> moving a thread to another CPU. I have mine set to 50000000.

I can't get any good effect out of this. I tried both 50 ms (your value) and
500 ms, and while it seems (by eyeballing the per-cpu display in top) to
_sometimes_ lock the processes to cores, more often than not, they still
bounce around between all 40. Worse still, when it _does_, it seems to often
lock them to hyperthread pairs (e.g., I've seen it put threads only on
virtual cores 0-9 and 20-29, which means it has all of them on one socket!)

Notwithstanding top, the benchmarks don't improve; setting cores manually
with taskset still is much better.

/* Steinar */
--
Homepage: http://www.sesse.net/

2014-10-08 15:37:51

by Mike Galbraith

[permalink] [raw]
Subject: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

Seems you opened a can of futex worms...

On Fri, 2014-10-03 at 21:44 +0200, Steinar H. Gunderson wrote:
> Hi,
>
> I did a chess benchmark of my new machine (2x E5-2650v3, so 20x2.3GHz
> Haswell-EP), and it performed a bit worse than comparable Windows setups.
> It looks like the scheduler somehow doesn't perform as well with
> hyperthreading; HT is on in the BIOS, but I'm only using 20 threads
> (chess scales sublinearly, so using all 40 usually isn't a good idea),
> so really, the threads should just get one core each and that's it.
> It looks like they are bouncing between cores, reducing overall performance
> by ~20% for some reason. (The machine is otherwise generally idle.)
>
> First some details to reproduce more easily. Kernel version is 3.16.3, 64-bit
> x86, Debian stable (so gcc 4.7.2). The benchmark binary is a chess engine
> knows as Stockfish; this is the compile I used (because that's what everyone
> else is benchmarking with):
>
> http://abrok.eu/stockfish/builds/dbd6156fceaf9bec8e9ff14f99c325c36b284079/linux64modernsse/stockfish_13111907_x64_modern_sse42
>
> Stockfish is GPL, so the source is readily available if you should need it.
>
> The benchmark is run with by just running the binary, then giving it these
> commands one by one:
>
> uci
> setoption name Threads value 20
> setoption name Hash value 1024
> position fen rnbq1rk1/pppnbppp/4p3/3pP1B1/3P3P/2N5/PPP2PP1/R2QKBNR w KQ – 0 7
> go wtime 7200000 winc 30000 btime 7200000 binc 30000
>
> After ~3 minutes, it will output “bestmove d1g4 ponder f8e8”. A few lines
> above that, you'll see a line with something similar to “nps 13266463”.
> That's nodes per second, and you want it to be higher.
>
> So, benchmark:
>
> - Default: 13266 kN/sec
> - Change from ondemand to performance on all cores: 14600 kN/sec
> - taskset -c 0-19 (locking affinity to only one set of hyperthreads):
> 17512 kN/sec
>
> There is some local variation, but it's typically within a few percent.
> Does anyone know what's going on? I have CONFIG_SCHED_SMT=y and
> CONFIG_SCHED_MC=y.
>
> /* Steinar */

I don't see that on the 2 x E5-2697 box I borrowed to take a peek. Once
I got stockfish to actually run to completion by hunting down and brute
force reverting the below, I see ~32 million nodes/sec throughput with
3.17 whether I use taskset or just let it do its thing.

Without the revert, the thing starts up fine, runs for 5 seconds or so,
then comes to a screeching halt with one thread looping endlessly...

1412780609.892144 futex(0xd3ed18, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000034>
1412780609.892216 futex(0xd3ed44, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 248307, {1412780609, 897000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.004857>
1412780609.897144 futex(0xd3ed18, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000021>
1412780609.897202 futex(0xd3ed44, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 248309, {1412780609, 902000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.004862>
1412780609.902157 futex(0xd3ed18, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000025>
1412780609.902226 futex(0xd3ed44, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 248311, {1412780609, 907000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.004845>
1412780609.907144 futex(0xd3ed18, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000021>
1412780609.907202 futex(0xd3ed44, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 248313, {1412780609, 912000000}, ffffffff^CProcess 2756 detached
<detached ...>

I have not seen this on my little single socket E5620 box, nor on my 8
socket 64 core DL980, but the DL980 is a poor crippled thing (8GB ram,
interleaved), so may be too much of a slug (race? me? really!?!) to make
anything bad happen. The 2 socket E5-2697 (28 core) box OTOH is a fully
repeatable fail.

#!/bin/sh

stockfish <<%EOF
uci
setoption name Threads value 28
setoption name Hash value 1024
position fen rnbq1rk1/pppnbppp/4p3/3pP1B1/3P3P/2N5/PPP2PP1/R2QKBNR w KQ – 0 7
go wtime 7200000 winc 30000 btime 7200000 binc 30000

%EOF

11d4616bd07f38d496bd489ed8fad1dc4d928823 is the first bad commit
commit 11d4616bd07f38d496bd489ed8fad1dc4d928823
Author: Linus Torvalds <[email protected]>
Date: Thu Mar 20 22:11:17 2014 -0700

futex: revert back to the explicit waiter counting code

Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid
taking the hb->lock if there's nothing to wake up") causes java threads
getting stuck on futexes when runing specjbb on a power7 numa box.

The cause appears to be that the powerpc spinlocks aren't using the same
ticket lock model that we use on x86 (and other) architectures, which in
turn result in the "spin_is_locked()" test in hb_waiters_pending()
occasionally reporting an unlocked spinlock even when there are pending
waiters.

So this reinstates Davidlohr Bueso's original explicit waiter counting
code, which I had convinced Davidlohr to drop in favor of figuring out
the pending waiters by just using the existing state of the spinlock and
the wait queue.

Reported-and-tested-by: Srikar Dronamraju <[email protected]>
Original-code-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

:040000 040000 6db0586dfa6e007f8feb4d230b5dfa70751a74e2 3c77ee44544ac3e9dfd17abac706cd4ebbacdd51 M kernel


git bisect start
# good: [d8ec26d7f8287f5788a494f56e8814210f0e64be] Linux 3.13
git bisect good d8ec26d7f8287f5788a494f56e8814210f0e64be
# bad: [455c6fdbd219161bd09b1165f11699d6d73de11c] Linux 3.14
git bisect bad 455c6fdbd219161bd09b1165f11699d6d73de11c
# good: [82c477669a4665eb4e52030792051e0559ee2a36] Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 82c477669a4665eb4e52030792051e0559ee2a36
# good: [ca2a650f3dfdc30d71d21bcbb04d2d057779f3f9] Merge branch 'for-linus' of git://git.infradead.org/users/vkoul/slave-dma
git bisect good ca2a650f3dfdc30d71d21bcbb04d2d057779f3f9
# good: [f2de3a159937bfb1ab1ca671e0f2d06cda286a24] Merge tag 'sound-3.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect good f2de3a159937bfb1ab1ca671e0f2d06cda286a24
# good: [9b3e7c9b9ab5c2827c1ecd45327b851a1bd01c2a] Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 9b3e7c9b9ab5c2827c1ecd45327b851a1bd01c2a
# good: [52a4c6404f91f2d2c5592ee6365a8418c4565f53] selinux: add gfp argument to security_xfrm_policy_alloc and fix callers
git bisect good 52a4c6404f91f2d2c5592ee6365a8418c4565f53
# good: [53611c0ce9f6e2fa2e31f9ab4ad8c08c512085ba] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect good 53611c0ce9f6e2fa2e31f9ab4ad8c08c512085ba
# bad: [b37199e626b31e1175fb06764c5d1d687723aac2] rcuwalk: recheck mount_lock after mountpoint crossing attempts
git bisect bad b37199e626b31e1175fb06764c5d1d687723aac2
# good: [004e5cf743086990e5fc04a14437b3966d7fa9a2] Merge branch 'exynos-drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos into drm-fixes
git bisect good 004e5cf743086990e5fc04a14437b3966d7fa9a2
# good: [7c3895383fea48dab2374b04a936de4717a85a81] Merge branch 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
git bisect good 7c3895383fea48dab2374b04a936de4717a85a81
# good: [477cc48484ea0bfdee9e8319259b35a0ec03f332] Merge tag 'trace-fixes-v3.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
git bisect good 477cc48484ea0bfdee9e8319259b35a0ec03f332
# bad: [084c6c5013af3c62f1c344435214496f5ac999f2] Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect bad 084c6c5013af3c62f1c344435214496f5ac999f2
# bad: [08edb33c4e1b810011f21d7705811b7b9a0535f0] Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
git bisect bad 08edb33c4e1b810011f21d7705811b7b9a0535f0
# bad: [708f04d2abf4e90abee61d9ffb1f165038017ecf] block: free q->flush_rq in blk_init_allocated_queue error paths
git bisect bad 708f04d2abf4e90abee61d9ffb1f165038017ecf
# bad: [11d4616bd07f38d496bd489ed8fad1dc4d928823] futex: revert back to the explicit waiter counting code
git bisect bad 11d4616bd07f38d496bd489ed8fad1dc4d928823
# first bad commit: [11d4616bd07f38d496bd489ed8fad1dc4d928823] futex: revert back to the explicit waiter counting code


Bend spindle mutilate like so, and it works:

kernel/futex.c | 58 ++++++++++------------------------------------------------
1 file changed, 10 insertions(+), 48 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 815d7af..cbb9fc4 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -250,7 +250,6 @@ static const struct futex_q futex_q_init = {
* waiting on a futex.
*/
struct futex_hash_bucket {
- atomic_t waiters;
spinlock_t lock;
struct plist_head chain;
} ____cacheline_aligned_in_smp;
@@ -270,37 +269,22 @@ static inline void futex_get_mm(union futex_key *key)
smp_mb__after_atomic();
}

-/*
- * Reflects a new waiter being added to the waitqueue.
- */
-static inline void hb_waiters_inc(struct futex_hash_bucket *hb)
+static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
{
#ifdef CONFIG_SMP
- atomic_inc(&hb->waiters);
/*
- * Full barrier (A), see the ordering comment above.
+ * Tasks trying to enter the critical region are most likely
+ * potential waiters that will be added to the plist. Ensure
+ * that wakers won't miss to-be-slept tasks in the window between
+ * the wait call and the actual plist_add.
*/
- smp_mb__after_atomic();
-#endif
-}
-
-/*
- * Reflects a waiter being removed from the waitqueue by wakeup
- * paths.
- */
-static inline void hb_waiters_dec(struct futex_hash_bucket *hb)
-{
-#ifdef CONFIG_SMP
- atomic_dec(&hb->waiters);
-#endif
-}
+ if (spin_is_locked(&hb->lock))
+ return true;
+ smp_rmb(); /* Make sure we check the lock state first */

-static inline int hb_waiters_pending(struct futex_hash_bucket *hb)
-{
-#ifdef CONFIG_SMP
- return atomic_read(&hb->waiters);
+ return !plist_head_empty(&hb->chain);
#else
- return 1;
+ return true;
#endif
}

@@ -1071,7 +1055,6 @@ static void __unqueue_futex(struct futex_q *q)

hb = container_of(q->lock_ptr, struct futex_hash_bucket, lock);
plist_del(&q->list, &hb->chain);
- hb_waiters_dec(hb);
}

/*
@@ -1356,9 +1339,7 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
*/
if (likely(&hb1->chain != &hb2->chain)) {
plist_del(&q->list, &hb1->chain);
- hb_waiters_dec(hb1);
plist_add(&q->list, &hb2->chain);
- hb_waiters_inc(hb2);
q->lock_ptr = &hb2->lock;
}
get_futex_key_refs(key2);
@@ -1549,7 +1530,6 @@ retry:
hb2 = hash_futex(&key2);

retry_private:
- hb_waiters_inc(hb2);
double_lock_hb(hb1, hb2);

if (likely(cmpval != NULL)) {
@@ -1559,7 +1539,6 @@ retry_private:

if (unlikely(ret)) {
double_unlock_hb(hb1, hb2);
- hb_waiters_dec(hb2);

ret = get_user(curval, uaddr1);
if (ret)
@@ -1618,7 +1597,6 @@ retry_private:
break;
case -EFAULT:
double_unlock_hb(hb1, hb2);
- hb_waiters_dec(hb2);
put_futex_key(&key2);
put_futex_key(&key1);
ret = fault_in_user_writeable(uaddr2);
@@ -1633,7 +1611,6 @@ retry_private:
* - The user space value changed.
*/
double_unlock_hb(hb1, hb2);
- hb_waiters_dec(hb2);
put_futex_key(&key2);
put_futex_key(&key1);
cond_resched();
@@ -1709,7 +1686,6 @@ retry_private:

out_unlock:
double_unlock_hb(hb1, hb2);
- hb_waiters_dec(hb2);

/*
* drop_futex_key_refs() must be called outside the spinlocks. During
@@ -1737,17 +1713,6 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
struct futex_hash_bucket *hb;

hb = hash_futex(&q->key);
-
- /*
- * Increment the counter before taking the lock so that
- * a potential waker won't miss a to-be-slept task that is
- * waiting for the spinlock. This is safe as all queue_lock()
- * users end up calling queue_me(). Similarly, for housekeeping,
- * decrement the counter at queue_unlock() when some error has
- * occurred and we don't end up adding the task to the list.
- */
- hb_waiters_inc(hb);
-
q->lock_ptr = &hb->lock;

spin_lock(&hb->lock); /* implies MB (A) */
@@ -1759,7 +1724,6 @@ queue_unlock(struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
spin_unlock(&hb->lock);
- hb_waiters_dec(hb);
}

/**
@@ -2482,7 +2446,6 @@ int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
* Unqueue the futex_q and determine which it was.
*/
plist_del(&q->list, &hb->chain);
- hb_waiters_dec(hb);

/* Handle spurious wakeups gracefully */
ret = -EWOULDBLOCK;
@@ -3035,7 +2998,6 @@ static int __init futex_init(void)
futex_detect_cmpxchg();

for (i = 0; i < futex_hashsize; i++) {
- atomic_set(&futex_queues[i].waiters, 0);
plist_head_init(&futex_queues[i].chain);
spin_lock_init(&futex_queues[i].lock);
}


2014-10-08 16:14:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Wed, 8 Oct 2014, Mike Galbraith wrote:
> Seems you opened a can of futex worms...

Bah.

> I don't see that on the 2 x E5-2697 box I borrowed to take a peek. Once
> I got stockfish to actually run to completion by hunting down and brute
> force reverting the below, I see ~32 million nodes/sec throughput with
> 3.17 whether I use taskset or just let it do its thing.
>
> Without the revert, the thing starts up fine, runs for 5 seconds or so,
> then comes to a screeching halt with one thread looping endlessly...
>
> 1412780609.892144 futex(0xd3ed18, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000034>
> 1412780609.892216 futex(0xd3ed44, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 248307, {1412780609, 897000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.004857>
> 1412780609.897144 futex(0xd3ed18, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000021>
> 1412780609.897202 futex(0xd3ed44, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 248309, {1412780609, 902000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.004862>
> 1412780609.902157 futex(0xd3ed18, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000025>
> 1412780609.902226 futex(0xd3ed44, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 248311, {1412780609, 907000000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.004845>
> 1412780609.907144 futex(0xd3ed18, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000021>
> 1412780609.907202 futex(0xd3ed44, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 248313, {1412780609, 912000000}, ffffffff^CProcess 2756 detached
> <detached ...>

So that thread waits 5ms for the futex, times out, fiddles with a
different futex and waits some more...

As it looks from a short glance on the code it's a condition
variable... So if nothing updates and signals the condition, it will
show exactly that behaviour.

> I have not seen this on my little single socket E5620 box, nor on my 8
> socket 64 core DL980, but the DL980 is a poor crippled thing (8GB ram,
> interleaved), so may be too much of a slug (race? me? really!?!) to make
> anything bad happen. The 2 socket E5-2697 (28 core) box OTOH is a fully
> repeatable fail.

> 11d4616bd07f38d496bd489ed8fad1dc4d928823 is the first bad commit
> commit 11d4616bd07f38d496bd489ed8fad1dc4d928823
> Author: Linus Torvalds <[email protected]>
> Date: Thu Mar 20 22:11:17 2014 -0700
>
> futex: revert back to the explicit waiter counting code
>
> Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid
> taking the hb->lock if there's nothing to wake up") causes java threads
> getting stuck on futexes when runing specjbb on a power7 numa box.
>
> The cause appears to be that the powerpc spinlocks aren't using the same
> ticket lock model that we use on x86 (and other) architectures, which in
> turn result in the "spin_is_locked()" test in hb_waiters_pending()
> occasionally reporting an unlocked spinlock even when there are pending
> waiters.

Well, unfortunately we cannot revert that for obvious reasons and I
really doubt, that it is the real problem.

It looks far more like an issue with the stocking fish code, but hell
with futexes one can never be sure.

Thanks,

tglx

2014-10-08 16:23:23

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Wed, Oct 08, 2014 at 05:37:44PM +0200, Mike Galbraith wrote:
> Seems you opened a can of futex worms...

Awesome.

> I don't see that on the 2 x E5-2697 box I borrowed to take a peek. Once
> I got stockfish to actually run to completion by hunting down and brute
> force reverting the below, I see ~32 million nodes/sec throughput with
> 3.17 whether I use taskset or just let it do its thing.

Interesting. If you open up top, do you see (like me) the load being spread
out across multiple CPUs, or do they actually stay in place like they're
supposed to?

I suppose you run with Threads set to 28, right? (And not, say, 56.)

/* Steinar */
--
Homepage: http://www.sesse.net/

2014-10-08 16:46:10

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Wed, Oct 08, 2014 at 06:14:18PM +0200, Thomas Gleixner wrote:
> It looks far more like an issue with the stocking fish code, but hell
> with futexes one can never be sure.

OK, maybe we should move to a more recent Stockfish version first of all;
the specific benchmark was about that specific binary, but for tracking down
futex issues we can see if more recent code fixes it (the SMP in this thing
keeps getting developed).

I'm moving to 2ac206e847a04a7de07690dd575c6949e5625115 (current head) of
https://github.com/mcostalba/Stockfish.git, and building with
“make -j ARCH=x86-64-bmi2”.

I still don't see any hangs, but I do see the same behavior of moving around
between CPUs as the older version exhibited. In a test run (using the given
test script, just with 28 replaced by 20), I get 18273 kN/sec with default,
and 21875 kN/sec when using taskset.

/* Steinar */
--
Homepage: http://www.sesse.net/

2014-10-08 17:04:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Wed, Oct 8, 2014 at 11:37 AM, Mike Galbraith
<[email protected]> wrote:
>
> 11d4616bd07f38d496bd489ed8fad1dc4d928823 is the first bad commit
> commit 11d4616bd07f38d496bd489ed8fad1dc4d928823
> Author: Linus Torvalds <[email protected]>
> Date: Thu Mar 20 22:11:17 2014 -0700
>
> futex: revert back to the explicit waiter counting code

While that revert might make things a tiny bit slower (I hated doing
it, but the clever approach sadly didn't work on powerpc and depended
on x86 locking semantics), I seriously doubt it's really relevant.
It's more likely that the *real* problem itself is very
timing-dependent, and the subtle synchronization changes here then
expose it or hide it, rather than really fixing anything.

So like Thomas, I would suspect a race condition in the futex use, and
then the exact futex implementation details are just exposing it
incidentally.

Linus

2014-10-08 17:05:22

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Wed, Oct 08, 2014 at 01:04:01PM -0400, Linus Torvalds wrote:
> So like Thomas, I would suspect a race condition in the futex use, and
> then the exact futex implementation details are just exposing it
> incidentally.

FWIW, Stockfish does not use futex directly; it uses pthreads (or Win32
threads if you are on that OS :-) ). That doesn't preclude a race condition
somewhere, of course.

/* Steinar */
--
Homepage: http://www.sesse.net/

2014-10-08 17:52:26

by Mike Galbraith

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Wed, 2014-10-08 at 18:45 +0200, Steinar H. Gunderson wrote:
> On Wed, Oct 08, 2014 at 06:14:18PM +0200, Thomas Gleixner wrote:
> > It looks far more like an issue with the stocking fish code, but hell
> > with futexes one can never be sure.
>
> OK, maybe we should move to a more recent Stockfish version first of all;
> the specific benchmark was about that specific binary, but for tracking down
> futex issues we can see if more recent code fixes it (the SMP in this thing
> keeps getting developed).

Mine was self build master, downloaded as a tarball.

> I'm moving to 2ac206e847a04a7de07690dd575c6949e5625115 (current head) of
> https://github.com/mcostalba/Stockfish.git, and building with
> “make -j ARCH=x86-64-bmi2”.
>
> I still don't see any hangs, but I do see the same behavior of moving around
> between CPUs as the older version exhibited. In a test run (using the given
> test script, just with 28 replaced by 20), I get 18273 kN/sec with default,
> and 21875 kN/sec when using taskset.

Sure, that will likely be workqueues or whatnot running when one of the
chess threads wakes, so select_idle_sibling() shows one of its two
faces.. the butt ugly one. That's what I wanted to play with on a too
darn big socket box when I discovered the thing didn't want to play.

-Mike

2014-10-08 17:59:50

by Mike Galbraith

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Wed, 2014-10-08 at 13:04 -0400, Linus Torvalds wrote:
> On Wed, Oct 8, 2014 at 11:37 AM, Mike Galbraith
> <[email protected]> wrote:
> >
> > 11d4616bd07f38d496bd489ed8fad1dc4d928823 is the first bad commit
> > commit 11d4616bd07f38d496bd489ed8fad1dc4d928823
> > Author: Linus Torvalds <[email protected]>
> > Date: Thu Mar 20 22:11:17 2014 -0700
> >
> > futex: revert back to the explicit waiter counting code
>
> While that revert might make things a tiny bit slower (I hated doing
> it, but the clever approach sadly didn't work on powerpc and depended
> on x86 locking semantics), I seriously doubt it's really relevant.
> It's more likely that the *real* problem itself is very
> timing-dependent, and the subtle synchronization changes here then
> expose it or hide it, rather than really fixing anything.
>
> So like Thomas, I would suspect a race condition in the futex use, and
> then the exact futex implementation details are just exposing it
> incidentally.

Whew, good, futex.c is hard. Heads up chess guys <punt>.

-Mike

2014-10-24 15:25:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Wed, 8 Oct 2014, Mike Galbraith wrote:
> On Wed, 2014-10-08 at 13:04 -0400, Linus Torvalds wrote:
> > On Wed, Oct 8, 2014 at 11:37 AM, Mike Galbraith
> > <[email protected]> wrote:
> > >
> > > 11d4616bd07f38d496bd489ed8fad1dc4d928823 is the first bad commit
> > > commit 11d4616bd07f38d496bd489ed8fad1dc4d928823
> > > Author: Linus Torvalds <[email protected]>
> > > Date: Thu Mar 20 22:11:17 2014 -0700
> > >
> > > futex: revert back to the explicit waiter counting code
> >
> > While that revert might make things a tiny bit slower (I hated doing
> > it, but the clever approach sadly didn't work on powerpc and depended
> > on x86 locking semantics), I seriously doubt it's really relevant.
> > It's more likely that the *real* problem itself is very
> > timing-dependent, and the subtle synchronization changes here then
> > expose it or hide it, rather than really fixing anything.
> >
> > So like Thomas, I would suspect a race condition in the futex use, and
> > then the exact futex implementation details are just exposing it
> > incidentally.
>
> Whew, good, futex.c is hard. Heads up chess guys <punt>.

I wonder whether the barrier fix which got into 3.17 late fixes that
issue as well.

Thanks,

tglx

2014-10-24 16:38:48

by Mike Galbraith

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Fri, 2014-10-24 at 17:25 +0200, Thomas Gleixner wrote:
> On Wed, 8 Oct 2014, Mike Galbraith wrote:
> > On Wed, 2014-10-08 at 13:04 -0400, Linus Torvalds wrote:
> > > On Wed, Oct 8, 2014 at 11:37 AM, Mike Galbraith
> > > <[email protected]> wrote:
> > > >
> > > > 11d4616bd07f38d496bd489ed8fad1dc4d928823 is the first bad commit
> > > > commit 11d4616bd07f38d496bd489ed8fad1dc4d928823
> > > > Author: Linus Torvalds <[email protected]>
> > > > Date: Thu Mar 20 22:11:17 2014 -0700
> > > >
> > > > futex: revert back to the explicit waiter counting code
> > >
> > > While that revert might make things a tiny bit slower (I hated doing
> > > it, but the clever approach sadly didn't work on powerpc and depended
> > > on x86 locking semantics), I seriously doubt it's really relevant.
> > > It's more likely that the *real* problem itself is very
> > > timing-dependent, and the subtle synchronization changes here then
> > > expose it or hide it, rather than really fixing anything.
> > >
> > > So like Thomas, I would suspect a race condition in the futex use, and
> > > then the exact futex implementation details are just exposing it
> > > incidentally.
> >
> > Whew, good, futex.c is hard. Heads up chess guys <punt>.
>
> I wonder whether the barrier fix which got into 3.17 late fixes that
> issue as well.

Yes, it did.

-Mike

2014-10-26 10:39:29

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Fri, Oct 24, 2014 at 06:38:41PM +0200, Mike Galbraith wrote:
>>> Whew, good, futex.c is hard. Heads up chess guys <punt>.
>> I wonder whether the barrier fix which got into 3.17 late fixes that
>> issue as well.
> Yes, it did.

This is only about the lockup, right, not that the threads bounce around a
lot and make things slower?

/* Steinar */
--
Homepage: http://www.sesse.net/

2014-10-26 13:16:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Sun, 2014-10-26 at 11:39 +0100, Steinar H. Gunderson wrote:
> On Fri, Oct 24, 2014 at 06:38:41PM +0200, Mike Galbraith wrote:
> >>> Whew, good, futex.c is hard. Heads up chess guys <punt>.
> >> I wonder whether the barrier fix which got into 3.17 late fixes that
> >> issue as well.
> > Yes, it did.
>
> This is only about the lockup, right, not that the threads bounce around a
> lot and make things slower?

Yes. I couldn't reproduce your results, on my 28 core+ht box, there was
little difference between pinned/unpinned. Maybe the box was too quite,
not enough random activity to motivate select_idle_sibling().

-Mike

2014-10-26 13:58:42

by Mike Galbraith

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Sun, 2014-10-26 at 14:16 +0100, Mike Galbraith wrote:
> On Sun, 2014-10-26 at 11:39 +0100, Steinar H. Gunderson wrote:
> > On Fri, Oct 24, 2014 at 06:38:41PM +0200, Mike Galbraith wrote:
> > >>> Whew, good, futex.c is hard. Heads up chess guys <punt>.
> > >> I wonder whether the barrier fix which got into 3.17 late fixes that
> > >> issue as well.
> > > Yes, it did.
> >
> > This is only about the lockup, right, not that the threads bounce around a
> > lot and make things slower?
>
> Yes. I couldn't reproduce your results, on my 28 core+ht box, there was
> little difference between pinned/unpinned. Maybe the box was too quite,
> not enough random activity to motivate select_idle_sibling().

Can you try the below?

---
kernel/sched/fair.c | 6 ++++++
1 file changed, 6 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4477,6 +4477,12 @@ static int select_idle_sibling(struct ta
return i;

/*
+ * Don't bounce hefty tasks about, they may be hitting contention.
+ */
+ if (p->se.load.weight - (p->se.load.weight/4) < p->se.avg.load_avg_contrib)
+ return target;
+
+ /*
* Otherwise, iterate the domains and find an elegible idle cpu.
*/
sd = rcu_dereference(per_cpu(sd_llc, target));

2014-10-26 14:11:54

by Steinar H. Gunderson

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Sun, Oct 26, 2014 at 02:58:36PM +0100, Mike Galbraith wrote:
> Can you try the below?

I can bake it into the kernel the next time I boot, but that is unlikely to
be anytime soon, I'm afraid. I'm maybe a bit surprised nobody else can
reproduce my issue; I assume it won't help if I give out shell access
somehow.

/* Steinar */
--
Homepage: http://www.sesse.net/

2014-10-26 14:41:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Sun, 2014-10-26 at 15:11 +0100, Steinar H. Gunderson wrote:
> On Sun, Oct 26, 2014 at 02:58:36PM +0100, Mike Galbraith wrote:
> > Can you try the below?
>
> I can bake it into the kernel the next time I boot, but that is unlikely to
> be anytime soon, I'm afraid. I'm maybe a bit surprised nobody else can
> reproduce my issue; I assume it won't help if I give out shell access
> somehow.

Sure it would, any bug I can get my hands on, I can toss rocks and
sticks at until one of us gives up :) Trouble is me having time to go
play with your box, and you trusting me to not do naughty things while
there, /me nearly always being in godmode (permission denied allergy),
that would be a problem.

-Mike


2014-10-27 10:05:09

by Mike Galbraith

[permalink] [raw]
Subject: Re: bisected: futex regression >= 3.14 - was - Slowdown due to threads bouncing between HT cores

On Sun, 2014-10-26 at 15:11 +0100, Steinar H. Gunderson wrote:
> I'm maybe a bit surprised nobody else can
> reproduce my issue;

I tried again this morning, and failed again this morning.

>From notes...

box = 2 socket, 14 cores/socket +ht, kernel = v3.18-rc2

stockfish 28 threads
nps 30511800 time 150120
nps 31337817 time 197553
nps 30675979 time 183005
nps 30706678 time 205037
nps 33389720 time 319808 (note massive runtime (ms) variance)
avg 30046381 211104

Adds perturbation source - ring-test -t 56 -w 1 -s 1 -p
(pipe ring, work 1ms, sleep 1ms, thread 0-55 pinned to cpu 0-55)

One ring did nada, bump to 5 concurrent rings.

nps 29443509 time 177969
nps 28713560 time 159663
nps 30628452 time 192714
nps 28834888 time 151611
nps 32611498 time 387760

Hm, still nada, try _10_.

nps 29998092 time 164625
nps 31142316 time 178655

Heh, 10 rings of pokers, still nada, try heavier tbench.

...

4 tbench pairs - 4 full cpus worth of competition
nps 27157639 time 144184
nps 26511984 time 143998
nps 28408851 time 182903
nps 28501510 time 157897
nps 28357045 time 172855
avg 27787405 160367 (avg runtime drops?.. shrug)

Ah, finally a nps delta.. but nothing unexpected.

vs zero competition, average nps is 0.924

4 cpus shared ~= stockfish runs on 26 cpus, 26/28 = 0.928

Test session summary: the sun moved westward.