2013-06-30 07:57:42

by Mike Galbraith

[permalink] [raw]
Subject: sched: context tracking demolishes pipe-test

taskset -c 3 pipe-test 1

switch-rate
3.10.0-master 426.4 KHz
3.10.0-masterx 591.8 Khz 1.387 (591=notgood, 426=omfg)

3.10.0-master 3.10.0-masterx
7.13% [k] __schedule 9.58% [k] __schedule
! 6.02% [.] __read_nocancel 8.25% [k] system_call
! 5.82% [.] __write_nocancel 5.34% [k] _raw_spin_lock_irqsave
! 4.28% [k] system_call 4.26% [k] pipe_read
! 4.13% [k] user_exit 3.53% [k] copy_user_generic_string
3.64% [k] _raw_spin_lock_irqsave 3.52% [k] native_sched_clock
3.37% [k] pipe_read 3.43% [k] __switch_to
2.72% [k] copy_user_generic_string 3.21% [.] main
2.60% [.] main 2.96% [k] mutex_lock
2.59% [k] native_sched_clock 2.89% [k] mutex_unlock
2.47% [k] __switch_to 2.86% [k] sched_clock_local
2.10% [k] mutex_unlock 2.63% [.] __read_nocancel
2.08% [k] sched_clock_local 2.47% [k] pipe_write
2.05% [k] mutex_lock 2.37% [k] _raw_spin_unlock_irqrestore
1.91% [k] pipe_write 1.96% [.] __write_nocancel
1.89% [k] user_enter 1.62% [k] vfs_write
1.72% [k] _raw_spin_unlock_irqrestore 1.54% [k] do_sync_write
1.51% [k] rcu_eqs_enter_common.isra.47 1.52% [k] fget_light
1.41% [k] rcu_eqs_exit_common.isra.45 1.48% [k] fsnotify
1.30% [k] tracesys 1.39% [k] dequeue_task_fair
1.23% [k] fsnotify 1.39% [k] update_curr
1.23% [k] int_check_syscall_exit_worka 1.37% [k] do_sync_read
1.20% [k] fget_light 1.33% [.] __GI___libc_write
1.16% [k] vfs_write 1.29% [k] try_to_wake_up
1.05% [k] syscall_trace_enter 1.09% [k] _raw_spin_lock_irq
1.04% [k] do_sync_write 1.08% [k] pipe_iov_copy_from_user
1.03% [k] context_tracking_task_switch 1.05% [k] finish_task_switch
0.99% [k] update_curr 1.04% [k] prepare_to_wait

marge:/usr/local/src/kernel/linux-3.x.git # diff -up config.horrible config.not-as-bad
--- config.horrible 2013-06-30 08:38:47.785258079 +0200
+++ config.not-as-bad 2013-06-30 08:33:08.246761153 +0200
@@ -48,7 +48,7 @@ CONFIG_BUILDTIME_EXTABLE_SORT=y
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
-CONFIG_LOCALVERSION="-master"
+CONFIG_LOCALVERSION="-masterx"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
@@ -126,7 +126,7 @@ CONFIG_TREE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_CONTEXT_TRACKING=y
CONFIG_RCU_USER_QS=y
-CONFIG_CONTEXT_TRACKING_FORCE=y
+# CONFIG_CONTEXT_TRACKING_FORCE is not set
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set

marge:/usr/local/src/kernel/linux-3.x.git # grep NO_HZ .config
CONFIG_NO_HZ_COMMON=y
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
# CONFIG_RCU_FAST_NO_HZ is not set


2013-06-30 21:29:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched: context tracking demolishes pipe-test

On Sun, Jun 30, 2013 at 09:57:36AM +0200, Mike Galbraith wrote:
> taskset -c 3 pipe-test 1
>
> switch-rate
> 3.10.0-master 426.4 KHz
> 3.10.0-masterx 591.8 Khz 1.387 (591=notgood, 426=omfg)
>
> 3.10.0-master 3.10.0-masterx
> 7.13% [k] __schedule 9.58% [k] __schedule
> ! 6.02% [.] __read_nocancel 8.25% [k] system_call
> ! 5.82% [.] __write_nocancel 5.34% [k] _raw_spin_lock_irqsave
> ! 4.28% [k] system_call 4.26% [k] pipe_read
> ! 4.13% [k] user_exit 3.53% [k] copy_user_generic_string
> 3.64% [k] _raw_spin_lock_irqsave 3.52% [k] native_sched_clock
> 3.37% [k] pipe_read 3.43% [k] __switch_to
> 2.72% [k] copy_user_generic_string 3.21% [.] main
> 2.60% [.] main 2.96% [k] mutex_lock
> 2.59% [k] native_sched_clock 2.89% [k] mutex_unlock
> 2.47% [k] __switch_to 2.86% [k] sched_clock_local
> 2.10% [k] mutex_unlock 2.63% [.] __read_nocancel
> 2.08% [k] sched_clock_local 2.47% [k] pipe_write
> 2.05% [k] mutex_lock 2.37% [k] _raw_spin_unlock_irqrestore
> 1.91% [k] pipe_write 1.96% [.] __write_nocancel
> 1.89% [k] user_enter 1.62% [k] vfs_write
> 1.72% [k] _raw_spin_unlock_irqrestore 1.54% [k] do_sync_write
> 1.51% [k] rcu_eqs_enter_common.isra.47 1.52% [k] fget_light
> 1.41% [k] rcu_eqs_exit_common.isra.45 1.48% [k] fsnotify
> 1.30% [k] tracesys 1.39% [k] dequeue_task_fair
> 1.23% [k] fsnotify 1.39% [k] update_curr
> 1.23% [k] int_check_syscall_exit_worka 1.37% [k] do_sync_read
> 1.20% [k] fget_light 1.33% [.] __GI___libc_write
> 1.16% [k] vfs_write 1.29% [k] try_to_wake_up
> 1.05% [k] syscall_trace_enter 1.09% [k] _raw_spin_lock_irq
> 1.04% [k] do_sync_write 1.08% [k] pipe_iov_copy_from_user
> 1.03% [k] context_tracking_task_switch 1.05% [k] finish_task_switch
> 0.99% [k] update_curr 1.04% [k] prepare_to_wait
>

Yeah, who'd have thought that putting stuff in the syscall path would've
made syscalls more expensive ;-)

But yeah, that's a _lot_ more expensive.. I'd not be surprised if more
people would find that objectionable.

2013-07-01 06:08:00

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched: context tracking demolishes pipe-test

On Sun, 2013-06-30 at 23:29 +0200, Peter Zijlstra wrote:

> Yeah, who'd have thought that putting stuff in the syscall path would've
> made syscalls more expensive ;-)

(careful, you'll injure my innocence, Santa and the Tooth Fairy exist!)

> But yeah, that's a _lot_ more expensive.. I'd not be surprised if more
> people would find that objectionable.

Distros may want some hot patching or _something_ before doing the usual
new=turn-it-on thing. Per trusty (spelled with 'c') old Q6600, the cost
is pretty darn high.

-regress is my ~light regression testing config, carried forward from
3.6.32...master across all intervening trees.
-regressx is plus CONFIG_RCU_USER_QS=y CONFIG_CONTEXT_TRACKING_FORCE=y
-regressxx is plus full dynticks

taskset -c 3 pipe-test 1

2.6.32-regress 654.5 Khz 1.000
3.10.0-regress 652.1 Khz .996 1.000
3.10.0-regressx 476.8 Khz .728 .731
3.10.0-regressxx 275.6 Khz .421 .422

tbench

2.6.32-regress 327.502 MB/sec 1.000
3.10.0-regress 319.621 MB/sec .975 1.000
3.10.0-regressx 292.894 MB/sec .894 .916
3.10.0-regressxx 243.738 MB/sec .744 .762

netperf TCP_RR

2.6.32-regress 104041.84 Trans/sec 1.000
3.10.0-regress 94961.34 Trans/sec .912 1.000
3.10.0-regressx 82383.33 Trans/sec .791 .867
3.10.0-regressxx 61527.59 Trans/sec .591 .647

2013-07-01 08:06:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched: context tracking demolishes pipe-test

On Mon, Jul 01, 2013 at 08:07:55AM +0200, Mike Galbraith wrote:
> On Sun, 2013-06-30 at 23:29 +0200, Peter Zijlstra wrote:
>
> > Yeah, who'd have thought that putting stuff in the syscall path would've
> > made syscalls more expensive ;-)
>
> (careful, you'll injure my innocence, Santa and the Tooth Fairy exist!)
>
> > But yeah, that's a _lot_ more expensive.. I'd not be surprised if more
> > people would find that objectionable.
>
> Distros may want some hot patching or _something_ before doing the usual
> new=turn-it-on thing. Per trusty (spelled with 'c') old Q6600, the cost
> is pretty darn high.
>
> -regress is my ~light regression testing config, carried forward from
> 3.6.32...master across all intervening trees.
> -regressx is plus CONFIG_RCU_USER_QS=y CONFIG_CONTEXT_TRACKING_FORCE=y
> -regressxx is plus full dynticks
>
> taskset -c 3 pipe-test 1
>
> 2.6.32-regress 654.5 Khz 1.000
> 3.10.0-regress 652.1 Khz .996 1.000
> 3.10.0-regressx 476.8 Khz .728 .731
> 3.10.0-regressxx 275.6 Khz .421 .422
>
> tbench
>
> 2.6.32-regress 327.502 MB/sec 1.000
> 3.10.0-regress 319.621 MB/sec .975 1.000
> 3.10.0-regressx 292.894 MB/sec .894 .916
> 3.10.0-regressxx 243.738 MB/sec .744 .762
>
> netperf TCP_RR
>
> 2.6.32-regress 104041.84 Trans/sec 1.000
> 3.10.0-regress 94961.34 Trans/sec .912 1.000
> 3.10.0-regressx 82383.33 Trans/sec .791 .867
> 3.10.0-regressxx 61527.59 Trans/sec .591 .647

So aside from the context tracking stuff, there's still a regression
we might want to look at. That's still a ~10% drop against 2.6.32 for
TCP_RR and few percents for tbench.

2013-07-01 09:12:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched: context tracking demolishes pipe-test

On Mon, 2013-07-01 at 10:06 +0200, Peter Zijlstra wrote:

> So aside from the context tracking stuff, there's still a regression
> we might want to look at. That's still a ~10% drop against 2.6.32 for
> TCP_RR and few percents for tbench.

Yeah, known, and some of it's ours.

-Mike

2013-07-01 09:21:05

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched: context tracking demolishes pipe-test

On Mon, 2013-07-01 at 11:12 +0200, Mike Galbraith wrote:
> On Mon, 2013-07-01 at 10:06 +0200, Peter Zijlstra wrote:
>
> > So aside from the context tracking stuff, there's still a regression
> > we might want to look at. That's still a ~10% drop against 2.6.32 for
> > TCP_RR and few percents for tbench.
>
> Yeah, known, and some of it's ours.

(btw tbench has a ~5% phase-of-moon jitter, you can pretty much
disregard that one)

2013-07-02 04:03:53

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched: context tracking demolishes pipe-test

On Mon, 2013-07-01 at 11:20 +0200, Mike Galbraith wrote:
> On Mon, 2013-07-01 at 11:12 +0200, Mike Galbraith wrote:
> > On Mon, 2013-07-01 at 10:06 +0200, Peter Zijlstra wrote:
> >
> > > So aside from the context tracking stuff, there's still a regression
> > > we might want to look at. That's still a ~10% drop against 2.6.32 for
> > > TCP_RR and few percents for tbench.
> >
> > Yeah, known, and some of it's ours.
>
> (btw tbench has a ~5% phase-of-moon jitter, you can pretty much
> disregard that one)

Hm. Seems we don't own much of TCP_RR regression after all, somewhere
along the line while my silly-tester hat was moldering, we got some
cycles back.. in the light config case anyway.

With wakeup granularity set to zero, per pipe-test, scheduler is within
variance of .32, sometimes appearing a tad lighter, though usually a wee
bit heavier. TCP_RR throughput delta does not correlate.

echo 0 > sched_wakeup_granularity_ns

pipe-test
2.6.32-regress 689.8 Khz 1.000
3.10.0-regress 682.5 Khz .989

netperf TCP_RR
2.6.32-regress 117910.11 Trans/sec 1.000
3.10.0-regress 96955.12 Trans/sec .822

It should be closer than this.

3.10.0-regress 2.6.32-regress
3.85% [kernel] [k] tcp_ack 4.04% [kernel] [k] tcp_sendmsg
3.34% [kernel] [k] __schedule 3.63% [kernel] [k] schedule
2.93% [kernel] [k] tcp_sendmsg 2.86% [kernel] [k] tcp_recvmsg
2.54% [kernel] [k] tcp_rcv_established 2.83% [kernel] [k] tcp_ack
2.26% [kernel] [k] tcp_transmit_skb 2.19% [kernel] [k] system_call
1.90% [kernel] [k] __netif_receive_skb_core 2.16% [kernel] [k] tcp_transmit_skb
1.87% [kernel] [k] tcp_v4_rcv 2.07% libc-2.14.1.so [.] __libc_recv
1.84% [kernel] [k] tcp_write_xmit 1.95% [kernel] [k] _spin_lock_bh
1.70% [kernel] [k] __switch_to 1.89% libc-2.14.1.so [.] __libc_send
1.57% [kernel] [k] tcp_recvmsg 1.77% [kernel] [k] tcp_rcv_established
1.54% [kernel] [k] _raw_spin_lock_bh 1.70% [kernel] [k] netif_receive_skb
1.52% libc-2.14.1.so [.] __libc_recv 1.61% [kernel] [k] tcp_v4_rcv
1.43% [kernel] [k] ip_rcv 1.49% [kernel] [k] native_sched_clock
1.35% [kernel] [k] local_bh_enable 1.49% [kernel] [k] tcp_write_xmit
1.33% [kernel] [k] _raw_spin_lock_irqsave 1.46% [kernel] [k] __switch_to
1.26% [kernel] [k] ip_queue_xmit 1.35% [kernel] [k] dev_queue_xmit
1.16% [kernel] [k] __inet_lookup_established 1.29% [kernel] [k] __alloc_skb
1.14% [kernel] [k] mod_timer 1.27% [kernel] [k] skb_release_data
1.13% [kernel] [k] process_backlog 1.26% netserver [.] recv_tcp_rr
1.13% [kernel] [k] read_tsc 1.22% [kernel] [k] local_bh_enable
1.13% libc-2.14.1.so [.] __libc_send 1.18% netperf [.] send_tcp_rr
1.12% [kernel] [k] system_call 1.18% [kernel] [k] sched_clock_local
1.07% [kernel] [k] tcp_event_data_recv 1.11% [kernel] [k] copy_user_generic_string
1.04% [kernel] [k] ip_finish_output 1.07% [kernel] [k] _spin_lock_irqsave

-Mike

2013-07-02 07:20:08

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched: context tracking demolishes pipe-test

CCs net-wizards since staring at profile/annotation isn't helping, and I
know all too well how these things laugh perversely at bisection.

On Tue, 2013-07-02 at 06:03 +0200, Mike Galbraith wrote:
> On Mon, 2013-07-01 at 11:20 +0200, Mike Galbraith wrote:
> > On Mon, 2013-07-01 at 11:12 +0200, Mike Galbraith wrote:
> > > On Mon, 2013-07-01 at 10:06 +0200, Peter Zijlstra wrote:
> > >
> > > > So aside from the context tracking stuff, there's still a regression
> > > > we might want to look at. That's still a ~10% drop against 2.6.32 for
> > > > TCP_RR and few percents for tbench.
> > >
> > > Yeah, known, and some of it's ours.
> >
> > (btw tbench has a ~5% phase-of-moon jitter, you can pretty much
> > disregard that one)
>
> Hm. Seems we don't own much of TCP_RR regression after all, somewhere
> along the line while my silly-tester hat was moldering, we got some
> cycles back.. in the light config case anyway.
>
> With wakeup granularity set to zero, per pipe-test, scheduler is within
> variance of .32, sometimes appearing a tad lighter, though usually a wee
> bit heavier. TCP_RR throughput delta does not correlate.
>
> echo 0 > sched_wakeup_granularity_ns
>
> pipe-test
> 2.6.32-regress 689.8 Khz 1.000
> 3.10.0-regress 682.5 Khz .989
>
> netperf TCP_RR
> 2.6.32-regress 117910.11 Trans/sec 1.000
> 3.10.0-regress 96955.12 Trans/sec .822
>
> It should be closer than this.
>
> 3.10.0-regress 2.6.32-regress
> 3.85% [kernel] [k] tcp_ack 4.04% [kernel] [k] tcp_sendmsg
> 3.34% [kernel] [k] __schedule 3.63% [kernel] [k] schedule
> 2.93% [kernel] [k] tcp_sendmsg 2.86% [kernel] [k] tcp_recvmsg
> 2.54% [kernel] [k] tcp_rcv_established 2.83% [kernel] [k] tcp_ack
> 2.26% [kernel] [k] tcp_transmit_skb 2.19% [kernel] [k] system_call
> 1.90% [kernel] [k] __netif_receive_skb_core 2.16% [kernel] [k] tcp_transmit_skb
> 1.87% [kernel] [k] tcp_v4_rcv 2.07% libc-2.14.1.so [.] __libc_recv
> 1.84% [kernel] [k] tcp_write_xmit 1.95% [kernel] [k] _spin_lock_bh
> 1.70% [kernel] [k] __switch_to 1.89% libc-2.14.1.so [.] __libc_send
> 1.57% [kernel] [k] tcp_recvmsg 1.77% [kernel] [k] tcp_rcv_established
> 1.54% [kernel] [k] _raw_spin_lock_bh 1.70% [kernel] [k] netif_receive_skb
> 1.52% libc-2.14.1.so [.] __libc_recv 1.61% [kernel] [k] tcp_v4_rcv
> 1.43% [kernel] [k] ip_rcv 1.49% [kernel] [k] native_sched_clock
> 1.35% [kernel] [k] local_bh_enable 1.49% [kernel] [k] tcp_write_xmit
> 1.33% [kernel] [k] _raw_spin_lock_irqsave 1.46% [kernel] [k] __switch_to
> 1.26% [kernel] [k] ip_queue_xmit 1.35% [kernel] [k] dev_queue_xmit
> 1.16% [kernel] [k] __inet_lookup_established 1.29% [kernel] [k] __alloc_skb
> 1.14% [kernel] [k] mod_timer 1.27% [kernel] [k] skb_release_data
> 1.13% [kernel] [k] process_backlog 1.26% netserver [.] recv_tcp_rr
> 1.13% [kernel] [k] read_tsc 1.22% [kernel] [k] local_bh_enable
> 1.13% libc-2.14.1.so [.] __libc_send 1.18% netperf [.] send_tcp_rr
> 1.12% [kernel] [k] system_call 1.18% [kernel] [k] sched_clock_local
> 1.07% [kernel] [k] tcp_event_data_recv 1.11% [kernel] [k] copy_user_generic_string
> 1.04% [kernel] [k] ip_finish_output 1.07% [kernel] [k] _spin_lock_irqsave
>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/