Subject: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

We had to insert a preempt enable/disable in the fastpath a while ago. This
was mainly due to a lot of state that is kept to be allocating from the per
cpu freelist. In particular the page field is not covered by
this_cpu_cmpxchg used in the fastpath to do the necessary atomic state
change for fast path allocation and freeing.

This patch removes the need for the page field to describe the state of the
per cpu list. The freelist pointer can be used to determine the page struct
address if necessary.

However, currently this does not work for the termination value of a list
which is NULL and the same for all slab pages. If we use a valid pointer
into the page as well as set the last bit then all freelist pointers can
always be used to determine the address of the page struct and we will not
need the page field anymore in the per cpu are for a slab. Testing for the
end of the list is a test if the first bit is set.

So the first patch changes the termination pointer for freelists to do just
that. The second removes the page field and then third can then remove the
preempt enable/disable.

Removing the ->page field reduces the cache footprint of the fastpath so hopefully overall
allocator effectiveness will increase further. Also RT uses full preemption which means
that currently pretty expensive code has to be inserted into the fastpath. This approach
allows the removal of that code and a corresponding performance increase.

For V1 a number of changes were made to avoid the overhead of virt_to_page
and page_address from the RFC.

Slab Benchmarks on a kernel with CONFIG_PREEMPT show an improvement of
20%-50% of fastpath latency:

Before:

Single thread testing
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 68 cycles kfree -> 107 cycles
10000 times kmalloc(16) -> 69 cycles kfree -> 108 cycles
10000 times kmalloc(32) -> 78 cycles kfree -> 112 cycles
10000 times kmalloc(64) -> 97 cycles kfree -> 112 cycles
10000 times kmalloc(128) -> 111 cycles kfree -> 119 cycles
10000 times kmalloc(256) -> 114 cycles kfree -> 139 cycles
10000 times kmalloc(512) -> 110 cycles kfree -> 142 cycles
10000 times kmalloc(1024) -> 114 cycles kfree -> 156 cycles
10000 times kmalloc(2048) -> 155 cycles kfree -> 174 cycles
10000 times kmalloc(4096) -> 203 cycles kfree -> 209 cycles
10000 times kmalloc(8192) -> 361 cycles kfree -> 265 cycles
10000 times kmalloc(16384) -> 597 cycles kfree -> 286 cycles

2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 114 cycles
10000 times kmalloc(16)/kfree -> 115 cycles
10000 times kmalloc(32)/kfree -> 117 cycles
10000 times kmalloc(64)/kfree -> 115 cycles
10000 times kmalloc(128)/kfree -> 111 cycles
10000 times kmalloc(256)/kfree -> 116 cycles
10000 times kmalloc(512)/kfree -> 110 cycles
10000 times kmalloc(1024)/kfree -> 114 cycles
10000 times kmalloc(2048)/kfree -> 110 cycles
10000 times kmalloc(4096)/kfree -> 107 cycles
10000 times kmalloc(8192)/kfree -> 108 cycles
10000 times kmalloc(16384)/kfree -> 706 cycles


After:


Single thread testing
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 41 cycles kfree -> 81 cycles
10000 times kmalloc(16) -> 47 cycles kfree -> 88 cycles
10000 times kmalloc(32) -> 48 cycles kfree -> 93 cycles
10000 times kmalloc(64) -> 58 cycles kfree -> 89 cycles
10000 times kmalloc(128) -> 84 cycles kfree -> 104 cycles
10000 times kmalloc(256) -> 92 cycles kfree -> 125 cycles
10000 times kmalloc(512) -> 86 cycles kfree -> 129 cycles
10000 times kmalloc(1024) -> 88 cycles kfree -> 125 cycles
10000 times kmalloc(2048) -> 120 cycles kfree -> 159 cycles
10000 times kmalloc(4096) -> 176 cycles kfree -> 183 cycles
10000 times kmalloc(8192) -> 294 cycles kfree -> 233 cycles
10000 times kmalloc(16384) -> 585 cycles kfree -> 291 cycles

2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 100 cycles
10000 times kmalloc(16)/kfree -> 108 cycles
10000 times kmalloc(32)/kfree -> 101 cycles
10000 times kmalloc(64)/kfree -> 109 cycles
10000 times kmalloc(128)/kfree -> 125 cycles
10000 times kmalloc(256)/kfree -> 60 cycles
10000 times kmalloc(512)/kfree -> 60 cycles
10000 times kmalloc(1024)/kfree -> 67 cycles
10000 times kmalloc(2048)/kfree -> 60 cycles
10000 times kmalloc(4096)/kfree -> 65 cycles
10000 times kmalloc(8192)/kfree -> 60 cycles
10000 times kmalloc(16384)/kfree -> 686 cycles


2014-12-11 13:35:45

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

On Wed, 10 Dec 2014 10:30:17 -0600
Christoph Lameter <[email protected]> wrote:

[...]
>
> Slab Benchmarks on a kernel with CONFIG_PREEMPT show an improvement of
> 20%-50% of fastpath latency:
>
> Before:
>
> Single thread testing
[...]
> 2. Kmalloc: alloc/free test
[...]
> 10000 times kmalloc(256)/kfree -> 116 cycles
[...]
>
>
> After:
>
> Single thread testing
[...]
> 2. Kmalloc: alloc/free test
[...]
> 10000 times kmalloc(256)/kfree -> 60 cycles
[...]

It looks like an impressive saving 116 -> 60 cycles. I just don't see
the same kind of improvements with my similar tests[1][2].

My test[1] is just a fast-path loop over kmem_cache_alloc+free on
256bytes objects. (Results after explicitly inlining new func
is_pointer_to_page())

baseline: 47 cycles(tsc) 19.032 ns
patchset: 45 cycles(tsc) 18.135 ns

I do see the improvement, but it is not as high as I would have expected.

(CPU E5-2695)

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_kmem_cache1.c
[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/qmempool_bench.c

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

On Thu, 11 Dec 2014, Jesper Dangaard Brouer wrote:

> It looks like an impressive saving 116 -> 60 cycles. I just don't see
> the same kind of improvements with my similar tests[1][2].

This is particularly for a CONFIG_PREEMPT kernel. There will be no effect
on !CONFIG_PREEMPT I hope.

> I do see the improvement, but it is not as high as I would have expected.

Do you have CONFIG_PREEMPT set?

2014-12-11 16:51:23

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

On Thu, 11 Dec 2014 09:03:24 -0600 (CST)
Christoph Lameter <[email protected]> wrote:

> On Thu, 11 Dec 2014, Jesper Dangaard Brouer wrote:
>
> > It looks like an impressive saving 116 -> 60 cycles. I just don't see
> > the same kind of improvements with my similar tests[1][2].
>
> This is particularly for a CONFIG_PREEMPT kernel. There will be no effect
> on !CONFIG_PREEMPT I hope.
>
> > I do see the improvement, but it is not as high as I would have expected.
>
> Do you have CONFIG_PREEMPT set?

Yes.

$ grep CONFIG_PREEMPT .config
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y

Full config here:
http://people.netfilter.org/hawk/kconfig/config01-slub-fastpath01

I was expecting to see at least (specifically) 4.291 ns improvement, as
this is the measured[1] cost of preempt_{disable,enable] on my system.

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

On Thu, 11 Dec 2014, Jesper Dangaard Brouer wrote:

> I was expecting to see at least (specifically) 4.291 ns improvement, as
> this is the measured[1] cost of preempt_{disable,enable] on my system.

Right. Those calls are taken out of the fastpaths by this patchset for
the CONFIG_PREEMPT case. So the numbers that you got do not make much
sense to me.

2014-12-11 18:20:07

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1


Warning, I'm getting crashes with this patchset, during my network load testing.
I don't have a nice crash dump to show, yet, but it is in the slub code.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

2014-12-11 19:14:53

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

On Thu, 11 Dec 2014 11:18:31 -0600 (CST)
Christoph Lameter <[email protected]> wrote:

> On Thu, 11 Dec 2014, Jesper Dangaard Brouer wrote:
>
> > I was expecting to see at least (specifically) 4.291 ns improvement, as
> > this is the measured[1] cost of preempt_{disable,enable] on my system.
>
> Right. Those calls are taken out of the fastpaths by this patchset for
> the CONFIG_PREEMPT case. So the numbers that you got do not make much
> sense to me.

True, that is also that I'm saying. I'll try to figure out that is
going on, tomorrow.

You are welcome to run my test harness:
http://netoptimizer.blogspot.dk/2014/11/announce-github-repo-prototype-kernel.html
https://github.com/netoptimizer/prototype-kernel/blob/master/getting_started.rst

Just load module: time_bench_kmem_cache1
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_kmem_cache1.c

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

2014-12-12 10:39:42

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

On Thu, 11 Dec 2014 18:37:58 +0100
Jesper Dangaard Brouer <[email protected]> wrote:

> Warning, I'm getting crashes with this patchset, during my network load testing.
> I don't have a nice crash dump to show, yet, but it is in the slub code.

Crash/OOM during IP-forwarding network overload test[1] with pktgen,
single flow thus activating a single CPU on target (device under test).

Testing done with net-next at commit 52c9b12d380, with patchset applied.
Baseline testing have been done without patchset.

[1] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen02_burst.sh

[ 135.258503] console [netcon0] enabled
[ 164.970377] ixgbe 0000:04:00.0 eth4: detected SFP+: 5
[ 165.078455] ixgbe 0000:04:00.0 eth4: NIC Link is Up 10 Gbps, Flow Control: None
[ 165.266662] ixgbe 0000:04:00.1 eth5: detected SFP+: 6
[ 165.396958] ixgbe 0000:04:00.1 eth5: NIC Link is Up 10 Gbps, Flow Control: None
[...]
[ 290.298350] ksoftirqd/11: page allocation failure: order:0, mode:0x20
[ 290.298632] CPU: 11 PID: 64 Comm: ksoftirqd/11 Not tainted 3.18.0-rc7-net-next+ #852
[ 290.299109] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.0a 09/27/2013
[ 290.299377] 0000000000000000 ffff88046c4eba28 ffffffff8164f6f2 ffff88047fd6d1a0
[ 290.300169] 0000000000000020 ffff88046c4ebab8 ffffffff8111d241 0000000000000000
[ 290.300833] 0000003000000000 ffff88047ffd9b38 ffff880003d86400 0000000000000040
[ 290.301496] Call Trace:
[ 290.301763] [<ffffffff8164f6f2>] dump_stack+0x4e/0x71
[ 290.302035] [<ffffffff8111d241>] warn_alloc_failed+0xd1/0x130
[ 290.302307] [<ffffffff81120ced>] __alloc_pages_nodemask+0x71d/0xa80
[ 290.302572] [<ffffffff81536b70>] __alloc_page_frag+0x130/0x150
[ 290.302840] [<ffffffff8153b63e>] __alloc_rx_skb+0x5e/0x110
[ 290.303112] [<ffffffff8153b74d>] __napi_alloc_skb+0x1d/0x40
[ 290.303383] [<ffffffffa00f15b1>] ixgbe_clean_rx_irq+0xf1/0x8e0 [ixgbe]
[ 290.303655] [<ffffffffa00f2a7d>] ixgbe_poll+0x41d/0x7c0 [ixgbe]
[ 290.303920] [<ffffffff8154817c>] net_rx_action+0x14c/0x270
[ 290.304185] [<ffffffff8107ad7a>] __do_softirq+0x10a/0x220
[ 290.304455] [<ffffffff8107aeb0>] run_ksoftirqd+0x20/0x50
[ 290.304724] [<ffffffff810962e9>] smpboot_thread_fn+0x159/0x270
[ 290.304991] [<ffffffff81096190>] ? SyS_setgroups+0x180/0x180
[ 290.305260] [<ffffffff81092846>] kthread+0xd6/0xf0
[ 290.305525] [<ffffffff81092770>] ? kthread_create_on_node+0x170/0x170
[ 290.305568] rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[ 290.305570] rsyslogd cpuset=/ mems_allowed=0
[ 290.306534] [<ffffffff81656a2c>] ret_from_fork+0x7c/0xb0
[ 290.306800] [<ffffffff81092770>] ? kthread_create_on_node+0x170/0x170
[ 290.307068] CPU: 1 PID: 2264 Comm: rsyslogd Not tainted 3.18.0-rc7-net-next+ #852
[ 290.307553] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.0a 09/27/2013
[ 290.307823] 0000000000000000 ffff88045248f8f8 ffffffff8164f6f2 0000000012a112a1
[ 290.308480] 0000000000000000 ffff88045248f978 ffffffff8164c061 ffff88045248f958
[ 290.309137] ffffffff810bd1e9 ffff88045248fa18 ffffffff8112a42b ffff88045248f948
[ 290.309805] Call Trace:
[ 290.310064] [<ffffffff8164f6f2>] dump_stack+0x4e/0x71
[ 290.310326] [<ffffffff8164c061>] dump_header.isra.8+0x96/0x201
[ 290.310593] [<ffffffff810bd1e9>] ? rcu_oom_notify+0xd9/0xf0
[ 290.310863] [<ffffffff8112a42b>] ? shrink_zones+0x25b/0x390
[ 290.316403] [<ffffffff8111b4c2>] oom_kill_process+0x202/0x370
[ 290.316672] [<ffffffff8107ee72>] ? has_capability_noaudit+0x12/0x20
[ 290.316943] [<ffffffff8111bcbe>] out_of_memory+0x4ee/0x530
[ 290.317212] [<ffffffff8112100e>] __alloc_pages_nodemask+0xa3e/0xa80
[ 290.317480] [<ffffffff8115d5a7>] alloc_pages_current+0x97/0x120
[ 290.317749] [<ffffffff81117dc7>] __page_cache_alloc+0xa7/0xd0
[ 290.318010] [<ffffffff8111a387>] filemap_fault+0x1c7/0x400
[ 290.318278] [<ffffffff8113da06>] __do_fault+0x36/0xd0
[ 290.318544] [<ffffffff8113fc8f>] do_read_fault.isra.78+0x1bf/0x2c0
[ 290.318815] [<ffffffff810ae1c0>] ? autoremove_wake_function+0x40/0x40
[ 290.319083] [<ffffffff8114128e>] handle_mm_fault+0x67e/0xc20
[ 290.319346] [<ffffffff81042dba>] __do_page_fault+0x18a/0x5a0
[ 290.319610] [<ffffffff810ae180>] ? abort_exclusive_wait+0xa0/0xa0
[ 290.319877] [<ffffffff810431dc>] do_page_fault+0xc/0x10
[ 290.320142] [<ffffffff81658062>] page_fault+0x22/0x30
[ 290.320441] Mem-Info:
[ 290.320703] Node 0 DMA per-cpu:
[ 290.321011] CPU 0: hi: 0, btch: 1 usd: 0
[ 290.321272] CPU 1: hi: 0, btch: 1 usd: 0
[ 290.321532] CPU 2: hi: 0, btch: 1 usd: 0
[ 290.321792] CPU 3: hi: 0, btch: 1 usd: 0
[ 290.322055] CPU 4: hi: 0, btch: 1 usd: 0
[ 290.322319] CPU 5: hi: 0, btch: 1 usd: 0
[ 290.322581] CPU 6: hi: 0, btch: 1 usd: 0
[ 290.322845] CPU 7: hi: 0, btch: 1 usd: 0
[ 290.323108] CPU 8: hi: 0, btch: 1 usd: 0
[ 290.323367] CPU 9: hi: 0, btch: 1 usd: 0
[ 290.323625] CPU 10: hi: 0, btch: 1 usd: 0
[ 290.323885] CPU 11: hi: 0, btch: 1 usd: 0
[ 290.324143] Node 0 DMA32 per-cpu:
[ 290.324445] CPU 0: hi: 186, btch: 31 usd: 0
[ 290.324704] CPU 1: hi: 186, btch: 31 usd: 0
[ 290.324962] CPU 2: hi: 186, btch: 31 usd: 0
[ 290.325227] CPU 3: hi: 186, btch: 31 usd: 0
[ 290.325488] CPU 4: hi: 186, btch: 31 usd: 0
[ 290.325753] CPU 5: hi: 186, btch: 31 usd: 0
[ 290.326016] CPU 6: hi: 186, btch: 31 usd: 0
[ 290.326279] CPU 7: hi: 186, btch: 31 usd: 0
[ 290.326546] CPU 8: hi: 186, btch: 31 usd: 0
[ 290.326811] CPU 9: hi: 186, btch: 31 usd: 0
[ 290.327075] CPU 10: hi: 186, btch: 31 usd: 0
[ 290.327344] CPU 11: hi: 186, btch: 31 usd: 0
[ 290.327609] Node 0 Normal per-cpu:
[ 290.327916] CPU 0: hi: 186, btch: 31 usd: 25
[ 290.328179] CPU 1: hi: 186, btch: 31 usd: 0
[ 290.328444] CPU 2: hi: 186, btch: 31 usd: 0
[ 290.328708] CPU 3: hi: 186, btch: 31 usd: 0
[ 290.328969] CPU 4: hi: 186, btch: 31 usd: 0
[ 290.329230] CPU 5: hi: 186, btch: 31 usd: 0
[ 290.329491] CPU 6: hi: 186, btch: 31 usd: 0
[ 290.329753] CPU 7: hi: 186, btch: 31 usd: 0
[ 290.330014] CPU 8: hi: 186, btch: 31 usd: 0
[ 290.330275] CPU 9: hi: 186, btch: 31 usd: 0
[ 290.330536] CPU 10: hi: 186, btch: 31 usd: 0
[ 290.330801] CPU 11: hi: 186, btch: 31 usd: 0
[ 290.331066] active_anon:109 inactive_anon:0 isolated_anon:0
[ 290.331066] active_file:132 inactive_file:104 isolated_file:0
[ 290.331066] unevictable:2141 dirty:0 writeback:0 unstable:0
[ 290.331066] free:26484 slab_reclaimable:3264 slab_unreclaimable:3985491
[ 290.331066] mapped:1957 shmem:17 pagetables:618 bounce:0
[ 290.331066] free_cma:0
[ 290.332411] Node 0 DMA free:15856kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[ 290.333825] lowmem_reserve[]: 0 1917 15995 15995
[ 290.334317] Node 0 DMA32 free:64740kB min:8092kB low:10112kB high:12136kB active_anon:296kB inactive_anon:0kB active_file:136kB inactive_file:32kB unevictable:1940kB isolated(anon):0kB isolated(file):0kB present:2042792kB managed:1964512kB mlocked:1956kB dirty:0kB writeback:0kB mapped:1516kB shmem:0kB slab_reclaimable:1436kB slab_unreclaimable:1864332kB kernel_stack:144kB pagetables:460kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:25708 all_unreclaimable? yes
[ 290.335947] lowmem_reserve[]: 0 0 14077 14077
[ 290.336439] Node 0 Normal free:24532kB min:59424kB low:74280kB high:89136kB active_anon:140kB inactive_anon:0kB active_file:392kB inactive_file:384kB unevictable:6624kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14415676kB mlocked:6624kB dirty:0kB writeback:0kB mapped:6312kB shmem:68kB slab_reclaimable:11620kB slab_unreclaimable:14078392kB kernel_stack:2864kB pagetables:2012kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 290.338061] lowmem_reserve[]: 0 0 0 0
[ 290.338546] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15856kB
[ 290.339996] Node 0 DMA32: 473*4kB (UEM) 221*8kB (UEM) 116*16kB (UEM) 86*32kB (UEM) 55*64kB (UEM) 24*128kB (UEM) 12*256kB (UM) 3*512kB (EM) 1*1024kB (E) 2*2048kB (UR) 10*4096kB (MR) = 65548kB
[ 290.341804] Node 0 Normal: 994*4kB (UEM) 577*8kB (EM) 203*16kB (EM) 113*32kB (UEM) 47*64kB (UEM) 13*128kB (UM) 4*256kB (UM) 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 25248kB
[ 290.343466] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 290.343947] 2081 total pagecache pages
[ 290.344210] 13 pages in swap cache
[ 290.344473] Swap cache stats: add 4436, delete 4423, find 5/8
[ 290.344739] Free swap = 8198904kB
[ 290.345000] Total swap = 8216572kB
[ 290.345262] 4184707 pages RAM
[ 290.345523] 0 pages HighMem/MovableOnly
[ 290.345788] 85688 pages reserved
[ 290.346049] 0 pages hwpoisoned
[ 290.346307] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 290.346788] [ 680] 0 680 2678 264 9 107 -1000 udevd
[ 290.347267] [ 1833] 0 1833 10161 0 24 70 0 monitor
[ 290.347750] [ 1834] 0 1834 10196 517 27 131 0 ovsdb-server
[ 290.348230] [ 1844] 0 1844 10299 50 24 67 0 monitor
[ 290.348711] [ 1845] 0 1845 10338 2114 41 0 0 ovs-vswitchd
[ 290.349194] [ 2261] 0 2261 62333 386 22 139 0 rsyslogd
[ 290.349676] [ 2293] 81 2293 5366 344 13 69 0 dbus-daemon
[ 290.350157] [ 2315] 68 2315 9070 403 29 313 0 hald
[ 290.350632] [ 2316] 0 2316 5097 339 23 45 0 hald-runner
[ 290.351111] [ 2345] 0 2345 5627 0 25 42 0 hald-addon-inpu
[ 290.351594] [ 2354] 68 2354 4498 339 20 37 0 hald-addon-acpi
[ 290.352069] [ 2363] 0 2363 2677 256 9 106 -1000 udevd
[ 290.352540] [ 2471] 0 2471 30430 129 18 558 0 pmqos-static.py
[ 290.353011] [ 2486] 0 2486 16672 368 33 179 -1000 sshd
[ 290.353481] [ 2497] 0 2497 44314 550 61 1064 0 tuned
[ 290.353956] [ 2511] 0 2511 29328 363 16 152 0 crond
[ 290.354430] [ 2528] 0 2528 5400 0 14 46 0 atd
[ 290.354906] [ 2541] 0 2541 26020 228 12 28 0 rhsmcertd
[ 290.355386] [ 2562] 0 2562 1031 308 9 18 0 mingetty
[ 290.355858] [ 2564] 0 2564 1031 308 9 18 0 mingetty
[ 290.356336] [ 2566] 0 2566 1031 308 9 17 0 mingetty
[ 290.356813] [ 2568] 0 2568 1031 308 9 18 0 mingetty
[ 290.357291] [ 2570] 0 2570 1031 308 9 18 0 mingetty
[ 290.357766] [ 2571] 0 2571 2677 256 9 106 -1000 udevd
[ 290.358245] [ 2573] 0 2573 1031 308 9 18 0 mingetty
[ 290.358719] [ 2576] 0 2576 25109 985 52 212 0 sshd
[ 290.359196] [ 2598] 500 2598 25109 695 50 235 0 sshd
[ 290.359673] [ 2611] 500 2611 27820 348 19 806 0 bash
[ 290.360147] Out of memory: Kill process 1845 (ovs-vswitchd) score 0 or sacrifice child
[ 290.360624] Killed process 1845 (ovs-vswitchd) total-vm:41352kB, anon-rss:732kB, file-rss:7724kB
[ 290.450766] ksoftirqd/11: page allocation failure: order:0, mode:0x204020
[ 290.451031] ksoftirqd/11: page allocation failure: order:0, mode:0x204020
[ 290.451033] CPU: 11 PID: 64 Comm: ksoftirqd/11 Not tainted 3.18.0-rc7-net-next+ #852
[ 290.451033] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.0a 09/27/2013
[ 290.451034] 0000000000000000 ffff88046c4eb2e8 ffffffff8164f6f2 0000000014801480
[ 290.451035] 0000000000204020 ffff88046c4eb378 ffffffff8111d241 0000000000000000
[ 290.451036] 0000003000000000 ffff88047ffd9b38 0000000000000001 0000000000000040
[ 290.451037] Call Trace:
[ 290.451040] [<ffffffff8164f6f2>] dump_stack+0x4e/0x71
[ 290.451042] [<ffffffff8111d241>] warn_alloc_failed+0xd1/0x130
[ 290.451045] [<ffffffff81137c89>] ? compaction_suitable+0x19/0x20
[ 290.451046] [<ffffffff81120ced>] __alloc_pages_nodemask+0x71d/0xa80
[ 290.451049] [<ffffffff81348aea>] ? vsnprintf+0x3ba/0x590
[ 290.451052] [<ffffffff8115d5a7>] alloc_pages_current+0x97/0x120
[ 290.451054] [<ffffffff8116563d>] new_slab+0x2ad/0x310
[ 290.451056] [<ffffffff811660e7>] __slab_alloc.isra.63+0x207/0x4d0
[ 290.451057] [<ffffffff8116645b>] kmem_cache_alloc_node+0xab/0x110
[ 290.451059] [<ffffffff81536e47>] __alloc_skb+0x47/0x1d0
[ 290.451063] [<ffffffff8138f5a1>] ? vgacon_set_cursor_size.isra.7+0xa1/0x120
[ 290.451066] [<ffffffff815636c4>] netpoll_send_udp+0x84/0x3f0
[ 290.451068] [<ffffffffa028b8bf>] write_msg+0xcf/0x140 [netconsole]
[ 290.451070] [<ffffffff810b3edb>] call_console_drivers.constprop.24+0x9b/0xa0
[ 290.451071] [<ffffffff810b452d>] console_unlock+0x36d/0x450
[ 290.451072] [<ffffffff810b4960>] vprintk_emit+0x350/0x570
[ 290.451073] [<ffffffff8164be24>] printk+0x5c/0x5e
[ 290.451075] [<ffffffff8111d23c>] warn_alloc_failed+0xcc/0x130
[ 290.451077] [<ffffffff8154987c>] ? dev_hard_start_xmit+0x16c/0x320
[ 290.451079] [<ffffffff81120ced>] __alloc_pages_nodemask+0x71d/0xa80
[ 290.451081] [<ffffffff81567c22>] ? sch_direct_xmit+0x112/0x220
[ 290.451083] [<ffffffff8115d5a7>] alloc_pages_current+0x97/0x120
[ 290.451084] [<ffffffff8116563d>] new_slab+0x2ad/0x310
[ 290.451085] [<ffffffff811660e7>] __slab_alloc.isra.63+0x207/0x4d0
[ 292.302602] hald-addon-acpi invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[ 292.303094] hald-addon-acpi cpuset=/ mems_allowed=0
[ 292.303456] CPU: 4 PID: 2354 Comm: hald-addon-acpi Not tainted 3.18.0-rc7-net-next+ #852
[ 292.303939] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.0a 09/27/2013
[ 292.304209] 0000000000000000 ffff88044f2af8f8 ffffffff8164f6f2 00000000000038ce
[ 292.304884] 0000000000000000 ffff88044f2af978 ffffffff8164c061 ffff88044f2af958
[ 292.305560] ffffffff810bd1e9 ffff88044f2afa18 ffffffff8112a42b ffff88044f2af948
[ 292.306231] Call Trace:
[ 292.306497] [<ffffffff8164f6f2>] dump_stack+0x4e/0x71
[ 292.306768] [<ffffffff8164c061>] dump_header.isra.8+0x96/0x201
[ 292.307038] [<ffffffff810bd1e9>] ? rcu_oom_notify+0xd9/0xf0
[ 292.307305] [<ffffffff8112a42b>] ? shrink_zones+0x25b/0x390
[ 292.307577] [<ffffffff8111b4c2>] oom_kill_process+0x202/0x370
[ 292.307846] [<ffffffff8107ee72>] ? has_capability_noaudit+0x12/0x20
[ 292.308117] [<ffffffff8111bcbe>] out_of_memory+0x4ee/0x530
[ 292.308386] [<ffffffff8112100e>] __alloc_pages_nodemask+0xa3e/0xa80
[ 292.308659] [<ffffffff8115d5a7>] alloc_pages_current+0x97/0x120
[ 292.308930] [<ffffffff81117dc7>] __page_cache_alloc+0xa7/0xd0
[ 292.309200] [<ffffffff8111a387>] filemap_fault+0x1c7/0x400
[ 292.309470] [<ffffffff8113da06>] __do_fault+0x36/0xd0
[ 292.309740] [<ffffffff8113fc8f>] do_read_fault.isra.78+0x1bf/0x2c0
[ 292.310010] [<ffffffff8114128e>] handle_mm_fault+0x67e/0xc20
[ 292.310280] [<ffffffff810970d5>] ? finish_task_switch+0x45/0xf0
[ 292.310551] [<ffffffff81042dba>] __do_page_fault+0x18a/0x5a0
[ 292.310821] [<ffffffff816559d2>] ? do_nanosleep+0x92/0xe0
[ 292.311087] [<ffffffff810c4d88>] ? hrtimer_nanosleep+0xb8/0x1a0
[ 292.311353] [<ffffffff810c3e90>] ? hrtimer_get_res+0x50/0x50
[ 292.311618] [<ffffffff810431dc>] do_page_fault+0xc/0x10
[ 292.311884] [<ffffffff81658062>] page_fault+0x22/0x30
[ 292.312145] Mem-Info:
[ 292.312403] Node 0 DMA per-cpu:
[ 292.312714] CPU 0: hi: 0, btch: 1 usd: 0
[ 292.312977] CPU 1: hi: 0, btch: 1 usd: 0
[ 292.313241] CPU 2: hi: 0, btch: 1 usd: 0
[ 292.313505] CPU 3: hi: 0, btch: 1 usd: 0
[ 292.313772] CPU 4: hi: 0, btch: 1 usd: 0
[ 292.314038] CPU 5: hi: 0, btch: 1 usd: 0
[ 292.314304] CPU 6: hi: 0, btch: 1 usd: 0
[ 292.314571] CPU 7: hi: 0, btch: 1 usd: 0
[ 292.314840] CPU 8: hi: 0, btch: 1 usd: 0
[ 292.315107] CPU 9: hi: 0, btch: 1 usd: 0
[ 292.315372] CPU 10: hi: 0, btch: 1 usd: 0
[ 292.315640] CPU 11: hi: 0, btch: 1 usd: 0
[ 292.315905] Node 0 DMA32 per-cpu:
[ 292.316219] CPU 0: hi: 186, btch: 31 usd: 0
[ 292.316487] CPU 1: hi: 186, btch: 31 usd: 0
[ 292.316754] CPU 2: hi: 186, btch: 31 usd: 0
[ 292.317019] CPU 3: hi: 186, btch: 31 usd: 0
[ 292.317284] CPU 4: hi: 186, btch: 31 usd: 0
[ 292.317553] CPU 5: hi: 186, btch: 31 usd: 0
[ 292.317819] CPU 6: hi: 186, btch: 31 usd: 0
[ 292.318086] CPU 7: hi: 186, btch: 31 usd: 0
[ 292.318352] CPU 8: hi: 186, btch: 31 usd: 0
[ 292.318623] CPU 9: hi: 186, btch: 31 usd: 0
[ 292.318892] CPU 10: hi: 186, btch: 31 usd: 0
[ 292.319161] CPU 11: hi: 186, btch: 31 usd: 0
[ 292.319427] Node 0 Normal per-cpu:
[ 292.319742] CPU 0: hi: 186, btch: 31 usd: 2
[ 292.320009] CPU 1: hi: 186, btch: 31 usd: 0
[ 292.320275] CPU 2: hi: 186, btch: 31 usd: 0
[ 292.320542] CPU 3: hi: 186, btch: 31 usd: 0
[ 292.320811] CPU 4: hi: 186, btch: 31 usd: 0
[ 292.321079] CPU 5: hi: 186, btch: 31 usd: 0
[ 292.321346] CPU 6: hi: 186, btch: 31 usd: 0
[ 292.321614] CPU 7: hi: 186, btch: 31 usd: 0
[ 292.321880] CPU 8: hi: 186, btch: 31 usd: 0
[ 292.322146] CPU 9: hi: 186, btch: 31 usd: 0
[ 292.322412] CPU 10: hi: 186, btch: 31 usd: 0
[ 292.322681] CPU 11: hi: 186, btch: 31 usd: 0
[ 292.322947] active_anon:0 inactive_anon:2 isolated_anon:0
[ 292.322947] active_file:81 inactive_file:42 isolated_file:0
[ 292.322947] unevictable:0 dirty:0 writeback:0 unstable:0
[ 292.322947] free:24558 slab_reclaimable:3128 slab_unreclaimable:3989981
[ 292.322947] mapped:39 shmem:0 pagetables:577 bounce:0
[ 292.322947] free_cma:0
[ 292.324305] Node 0 DMA free:15856kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[ 292.325716] lowmem_reserve[]: 0 1917 15995 15995
[ 292.326216] Node 0 DMA32 free:59736kB min:8092kB low:10112kB high:12136kB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2042792kB managed:1964512kB mlocked:0kB dirty:0kB writeback:0kB mapped:156kB shmem:0kB slab_reclaimable:1356kB slab_unreclaimable:1872080kB kernel_stack:144kB pagetables:304kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:820 all_unreclaimable? yes
[ 292.327636] lowmem_reserve[]:Normal free:22900kB min:59424kB low:74280kB high:89136kB active_anon:0kB inactive_anon:8kB active_file:240kB inactive_file:200kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14415676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11156kB slab_unreclaimable:14087812kB kernel_stack:2848kB pagetables:2004kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3288 all_unreclaimable? yes
[ 292.360844] monitor invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[ 292.361326] monitor cpuset=/ mems_allowed=0
[ 292.361687] CPU: 1 PID: 1844 Comm: monitor Not tainted 3.18.0-rc7-net-next+ #852
[ 292.362162] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.0a 09/27/2013
[ 292.362429] 0000000000000000 ffff880456aef8f8 ffffffff8164f6f2 0000000000003cf9
[ 292.363095] 0000000000000000 ffff880456aef978 ffffffff8164c061 ffff880456aef958
[ 292.363764] ffffffff810bd1e9 ffff880456aefa18 ffffffff8112a42b ffff880456aef988
[ 292.364433] Call Trace:
[ 292.364695] [<ffffffff8164f6f2>] dump_stack+0x4e/0x71
[ 292.381944] active_anon:0 inactive_anon:2 isolated_anon:0
[ 292.381944] active_file:81 inactive_file:42 isolated_file:0
[ 292.381944] unevictable:0 dirty:0 writeback:0 unstable:0
[ 292.381944] free:24864 slab_reclaimable:3128 slab_unreclaimable:3989981
[ 292.381944] mapped:39 shmem:0 pagetables:577 bounce:0
[ 292.381944] free_cma:0
DMA free:15856kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
DMA32 free:60068kB min:8092kB low:10112kB high:12136kB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2042792kB managed:1964512kB mlocked:0kB dirty:0kB writeback:0kB mapped:156kB shmem:0kB slab_reclaimable:1356kB slab_unreclaimable:1872080kB kernel_stack:144kB pagetables:304kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:516 all_unreclaimable? yes
Normal free:23532kB min:59424kB low:74280kB high:89136kB active_anon:0kB inactive_anon:8kB active_file:240kB inactive_file:200kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14415676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11156kB slab_unreclaimable:14087812kB kernel_stack:2848kB pagetables:2004kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3240 all_unreclaimable? yes
[ 292.419725] ovsdb-server invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
[ 292.420210] ovsdb-server cpuset=/ mems_allowed=0
[ 292.420570] CPU: 2 PID: 1834 Comm: ovsdb-server Not tainted 3.18.0-rc7-net-next+ #852
[ 292.421053] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.0a 09/27/2013
[ 292.421320] 0000000000000000[ 292.440836] active_anon:0 inactive_anon:2 isolated_anon:0
[ 292.440836] active_file:81 inactive_file:42 isolated_file:0
[ 292.440836] unevictable:0 dirty:0 writeback:0 unstable:0
[ 292.440836] free:24864 slab_reclaimable:3128 slab_unreclaimable:3989981
[ 292.440836] mapped:39 shmem:0 pagetables:577 bounce:0
[ 292.440836] free_cma:0
DMA free:15856kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
DMA32 free:60068kB min:8092kB low:10112kB high:12136kB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2042792kB managed:1964512kB mlocked:0kB dirty:0kB writeback:0kB mapped:156kB shmem:0kB slab_reclaimable:1356kB slab_unreclaimable:1872080kB kernel_stack:144kB pagetables:304kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:516 all_unreclaimable? yes
Normal free:23532kB min:59424kB low:74280kB high:89136kB active_anon:0kB inactive_anon:8kB active_file:240kB inactive_file:200kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14415676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11156kB slab_unreclaimable:14087812kB kernel_stack:2848kB pagetables:2004kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3240 all_unreclaimable? yes
[ 292.502168] active_anon:0 inactive_anon:2 isolated_anon:0
[ 292.502168] active_file:81 inactive_file:42 isolated_file:0
[ 292.502168] unevictable:0 dirty:0 writeback:0 unstable:0
[ 292.502168] free:24864 slab_reclaimable:3128 slab_unreclaimable:3989981
[ 292.502168] mapped:39 shmem:0 pagetables:577 bounce:0
[ 292.502168] free_cma:0
DMA free:15856kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
DMA32 free:60068kB min:8092kB low:10112kB high:12136kB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2042792kB managed:1964512kB mlocked:0kB dirty:0kB writeback:0kB mapped:156kB shmem:0kB slab_reclaimable:1356kB slab_unreclaimable:1872080kB kernel_stack:144kB pagetables:304kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:516 all_unreclaimable? yes
Normal free:23532kB min:59424kB low:74280kB high:89136kB active_anon:0kB inactive_anon:8kB active_file:240kB inactive_file:200kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14415676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11156kB slab_unreclaimable:14087812kB kernel_stack:2848kB pagetables:2004kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3240 all_unreclaimable? yes
[ 292.539902] hald invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
[ 292.540379] hald cpuset=/ mems_allowed=0
[ 292.540736] CPU: 9 PID: 2315 Comm: hald Not tainted 3.18.0-rc7-net-next+ #852
[ 292.541004] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.0a 09/27/2013
[ 292.541266] 0000000000000000 ffff88044f8eb4a8 ffffffff8164f6f2 0000000000004977
[ 292.541934] 0000000000000000 ffff88044f8eb528 ffffffff8164c061 ffff88044f8eb508
[ 292.542600] ffffffff810bd1e9 ffff88044f8eb5c8 ffffffff8112a42b ffff88044f8eb4f8
[ 292.543265] Call Trace:
[ 292.543530] [<ffffffff8164f6f2>] dump_stack+0x4e/0x71
[ 292.543797] [<ffffffff8164c061>] dump_header.isra.8+0x96/0x201
[ 292.544064] [<ffffffff810bd1e9>] ? rcu_oom_notify+0xd9/0xf0
[ 292.544331] [<ffffffff8112a42b>] ? shrink_zones+0x25b/0x390
[ 292.544597] [<ffffffff8111b4c2>] oom_kill_process+0x202/0x370
[ 292.544867] [<ffffffff8111aff5>] ? oom_unkillable_task.isra.5+0xc5/0xf0
[ 292.545134] [<ffffffff8111bcbe>] out_of_memory+0x4ee/0x530
[ 292.545399] [<ffffffff8112100e>] __alloc_pages_nodemask+0xa3e/0xa80
[ 292.545664] [<ffffffff8115fa1f>] alloc_pages_vma+0x9f/0x1b0
[ 292.545934] [<ffffffff8115283b>] read_swap_cache_async+0x13b/0x1e0
[ 292.546202] [<ffffffff81152a06>] swapin_readahead+0x126/0x190
[ 292.546467] [<ffffffff81118ada>] ? pagecache_get_page+0x2a/0x1e0
[ 292.564580] active_anon:0 inactive_anon:2 isolated_anon:0
[ 292.564580] active_file:81 inactive_file:42 isolated_file:0
[ 292.564580] unevictable:0 dirty:0 writeback:0 unstable:0
[ 292.564580] free:24864 slab_reclaimable:3128 slab_unreclaimable:3989981
[ 292.564580] mapped:39 shmem:0 pagetables:480 bounce:0
[ 292.564580] free_cma:0
DMA free:15856kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
DMA32 free:60068kB min:8092kB low:10112kB high:12136kB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2042792kB managed:1964512kB mlocked:0kB dirty:0kB writeback:0kB mapped:156kB shmem:0kB slab_reclaimable:1356kB slab_unreclaimable:1872080kB kernel_stack:144kB pagetables:304kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:516 all_unreclaimable? yes
Normal free:23532kB min:59424kB low:74280kB high:89136kB active_anon:0kB inactive_anon:8kB active_file:240kB inactive_file:200kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14415676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11156kB slab_unreclaimable:14087812kB kernel_stack:2848kB pagetables:1616kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3240 all_unreclaimable? yes

[ 293.207640] Call Trace:
[ 293.207903] [<ffffffff8164f6f2>] dump_stack+0x4e/0x71
[ 293.208166] [<ffffffff8164c061>] dump_header.isra.8+0x96/0x201
[ 293.208431] [<ffffffff810bd1e9>] ? rcu_oom_notify+0xd9/0xf0
[ 293.208696] [<ffffffff8112a42b>] ? shrink_zones+0x25b/0x390
[ 293.208963] [<ffffffff8111b4c2>] oom_kill_process+0x202/0x370
[ 293.209232] [<ffffffff8111aff5>] ? oom_unkillable_task.isra.5+0xc5/0xf0
[ 293.209500] [<ffffffff8111bcbe>] out_of_memory+0x4ee/0x530
[ 293.209761] [<ffffffff8112100e>] __alloc_pages_nodemask+0xa3e/0xa80
[ 293.210029] [<ffffffff8115fa1f>] alloc_pages_vma+0x9f/0x1b0
[ 293.210297] [<ffffffff8115283b>] read_swap_cache_async+0x13b/0x1e0
[ 293.210562] [<ffffffff81152a06>] swapin_readahead+0x126/0x190
[ 293.210828] [<ffffffff81118ada>] ? pagecache_get_page+0x2a/0x1e0
[ 293.211092] [<ffffffff811415d8>] handle_mm_fault+0x9c8/0xc20
[ 293.211357] [<ffffffff810a364f>] ? dequeue_entity+0x10f/0x600
[ 293.211626] [<ffffffff81042dba>] __do_page_fault+0x18a/0x5a0
[ 293.211892] [<ffffffff810970d5>] ? finish_task_switch+0x45/0xf0
[ 293.212156] [<ffffffff81651070>] ? __schedule+0x290/0x7f0
[ 293.212416] [<ffffffff810431dc>] do_page_fault+0xc/0x10
[ 293.212676] [<ffffffff81658062>] page_fault+0x22/0x30
[ 293.212937] [<ffffffff8118b189>] ? do_sys_poll+0x179/0x5b0
[ 293.213196] [<ffffffff8118b13d>] ? do_sys_poll+0x12d/0x5b0
[ 293.213459] [<ffffffff815ead03>] ? unix_stream_sendmsg+0x413/0x450
[ 293.213724] [<ffffffff81189f00>] ? poll_select_copy_remaining+0x140/0x140
[ 293.213992] [<ffffffff81189f00>] ? poll_select_copy_remaining+0x140/0x140
[ 293.214259] [<ffffffff81189f00>] ? poll_select_copy_remaining+0x140/0x140
[ 293.214528] [<ffffffff81189f00>] ? poll_select_copy_remaining+0x140/0x140
[ 293.214792] [<ffffffff81189f00>] ? poll_select_copy_remaining+0x140/0x140
[ 293.215056] [<ffffffff81189f00>] ? poll_select_copy_remaining+0x140/0x140
[ 293.215321] [<ffffffff8117ac45>] ? SYSC_newfstat+0x25/0x30
[ 293.215586] [<ffffffff8118b697>] SyS_poll+0x77/0x100
[ 293.215851] [<ffffffff81656ad2>] system_call_fastpath+0x12/0x17
[ 293.216115] Mem-Info:
[ 293.216374] Node 0 DMA per-cpu:
[ 293.216685] CPU 0: hi: 0, btch: 1 usd: 0
[ 293.216948] CPU 1: hi: 0, btch: 1 usd: 0
[ 293.217209] CPU 2: hi: 0, btch: 1 usd: 0
[ 293.217473] CPU 3: hi: 0, btch: 1 usd: 0
[ 293.217734] CPU 4: hi: 0, btch: 1 usd: 0
[ 293.217999] CPU 5: hi: 0, btch: 1 usd: 0
[ 293.218261] CPU 6: hi: 0, btch: 1 usd: 0
[ 293.218525] CPU 7: hi: 0, btch: 1 usd: 0
[ 293.218787] CPU 8: hi: 0, btch: 1 usd: 0
[ 293.219051] CPU 9: hi: 0, btch: 1 usd: 0
[ 293.219314] CPU 10: hi: 0, btch: 1 usd: 0
[ 293.219580] CPU 11: hi: 0, btch: 1 usd: 0
[ 293.219843] Node 0 DMA32 per-cpu:
[ 293.220149] CPU 0: hi: 186, btch: 31 usd: 0
[ 293.220411] CPU 1: hi: 186, btch: 31 usd: 0
[ 293.220677] CPU 2: hi: 186, btch: 31 usd: 0
[ 293.220940] CPU 3: hi: 186, btch: 31 usd: 0
[ 293.221203] CPU 4: hi: 186, btch: 31 usd: 0
[ 293.221467] CPU 5: hi: 186, btch: 31 usd: 0
[ 293.221730] CPU 6: hi: 186, btch: 31 usd: 0
[ 293.221992] CPU 7: hi: 186, btch: 31 usd: 0
[ 293.222253] CPU 8: hi: 186, btch: 31 usd: 0
[ 293.222519] CPU 9: hi: 186, btch: 31 usd: 0
[ 293.222782] CPU 10: hi: 186, btch: 31 usd: 0
[ 293.223043] CPU 11: hi: 186, btch: 31 usd: 0
[ 293.223306] Node 0 Normal per-cpu:
[ 293.223615] CPU 0: hi: 186, btch: 31 usd: 0
[ 293.223878] CPU 1: hi: 186, btch: 31 usd: 0
[ 293.224142] CPU 2: hi: 186, btch: 31 usd: 0
[ 293.224404] CPU 3: hi: 186, btch: 31 usd: 0
[ 293.224672] CPU 4: hi: 186, btch: 31 usd: 0
[ 293.224934] CPU 5: hi: 186, btch: 31 usd: 0
[ 293.225198] CPU 6: hi: 186, btch: 31 usd: 0
[ 293.225462] CPU 7: hi: 186, btch: 31 usd: 0
[ 293.225726] CPU 8: hi: 186, btch: 31 usd: 0
[ 293.225989] CPU 9: hi: 186, btch: 31 usd: 0
[ 293.226249] CPU 10: hi: 186, btch: 31 usd: 0
[ 293.226507] CPU 11: hi: 186, btch: 31 usd: 0
[ 293.226769] active_anon:0 inactive_anon:0 isolated_anon:0
[ 293.226769] active_file:148 inactive_file:26 isolated_file:0
[ 293.226769] unevictable:0 dirty:0 writeback:0 unstable:0
[ 293.226769] free:25182 slab_reclaimable:3095 slab_unreclaimable:3990011
[ 293.226769] mapped:17 shmem:0 pagetables:427 bounce:0
[ 293.226769] free_cma:0
[ 293.228100] Node 0 DMA free:15856kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[ 293.229516] lowmem_reserve[]: 0 1917 15995 15995
[ 293.230008] Node 0 DMA32 free:60388kB min:8092kB low:10112kB high:12136kB active_anon:0kB inactive_anon:0kB active_file:84kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2042792kB managed:1964512kB mlocked:0kB dirty:0kB writeback:0kB mapped:68kB shmem:0kB slab_reclaimable:1356kB slab_unreclaimable:1872156kB kernel_stack:144kB pagetables:232kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5548 all_unreclaimable? yes
[ 293.236698] lowmem_reserve[]: 0 0 14077 14077
[ 293.237185] Node 0 Normal free:24484kB min:59424kB low:74280kB high:89136kB active_anon:0kB inactive_anon:0kB active_file:508kB inactive_file:56kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14680064kB managed:14415676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11024kB slab_unreclaimable:14087856kB kernel_stack:2832kB pagetables:1476kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3904 all_unreclaimable? yes
[ 293.238816] lowmem_reserve[]: 0 0 0 0
[ 293.239309] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15856kB
[ 293.240758] Node 0 DMA32: 408*4kB (UEM) 43*8kB (UEM) 23*16kB (EM) 20*32kB (EM) 17*64kB (EM) 10*128kB (UEM) 9*256kB (UM) 5*512kB (EM) 3*1024kB (EM) 3*2048kB (MR) 10*4096kB (MR) = 60392kB
[ 293.242558] Node 0 Normal: 912*4kB (UEM) 585*8kB (UEM) 193*16kB (UEM) 100*32kB (UEM) 46*64kB (UEM) 16*128kB (M) 3*256kB (UM) 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 24472kB
[ 293.244228] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 293.244707] 172 total pagecache pages
[ 293.244964] 0 pages in swap cache
[ 293.245226] Swap cache stats: add 4615, delete 4615, find 2631/2669
[ 293.245492] Free swap = 8209840kB
[ 293.245751] Total swap = 8216572kB
[ 293.246012] 4184707 pages RAM
[ 293.246272] 0 pages HighMem/MovableOnly
[ 293.246535] 85688 pages reserved
[ 293.246797] 0 pages hwpoisoned
[ 293.247059] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 293.247544] [ 680] 0 680 2678 0 9 107 -1000 udevd
[ 293.248022] [ 1833] 0 1833 10161 0 24 70 0 monitor
[ 293.248502] [ 1834] 0 1834 10196 0 27 131 0 ovsdb-server
[ 293.248984] [ 1844] 0 1844 10299 0 24 82 0 monitor
[ 293.249467] [ 2261] 0 2261 62333 0 22 143 0 rsyslogd
[ 293.249949] [ 2293] 81 2293 5366 1 13 69 0 dbus-daemon
[ 293.250431] [ 2345] 0 2345 5627 0 25 42 0 hald-addon-inpu
[ 293.250912] [ 2354] 68 2354 4498 1 20 37 0 hald-addon-acpi
[ 293.251396] [ 2363] 0 2363 2677 0 9 106 -1000 udevd
[ 293.251870] [ 2486] 0 2486 16672 0 33 179 -1000 sshd
[ 293.252346] [ 2511] 0 2511 29328 1 16 152 0 crond
[ 293.252825] [ 2528] 0 2528 5400 0 14 46 0 atd
[ 293.253304] [ 2541] 0 2541 26020 1 12 28 0 rhsmcertd
[ 293.253787] [ 2562] 0 2562 1031 1 9 18 0 mingetty
[ 293.254267] [ 2564] 0 2564 1031 1 9 18 0 mingetty
[ 293.254747] [ 2566] 0 2566 1031 1 9 17 0 mingetty
[ 293.255218] [ 2568] 0 2568 1031 1 9 18 0 mingetty
[ 293.255688] [ 2570] 0 2570 1031 1 9 18 0 mingetty
[ 293.256157] [ 2571] 0 2571 2677 0 9 106 -1000 udevd
[ 293.256634] [ 2573] 0 2573 1031 1 9 18 0 mingetty
[ 293.257108] [ 2576] 0 2576 25109 1 52 234 0 sshd
[ 293.257585] [ 2598] 500 2598 25109 0 50 247 0 sshd
[ 293.258059] Out of memory: Kill process 2598 (sshd) score 0 or sacrifice child
[ 293.258536] Killed process 2598 (sshd) total-vm:100436kB, anon-rss:0kB, file-rss:0kB
[... etc ...]

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

On Fri, 12 Dec 2014, Jesper Dangaard Brouer wrote:

> Crash/OOM during IP-forwarding network overload test[1] with pktgen,
> single flow thus activating a single CPU on target (device under test).

Hmmm... Bisected it and the patch that removes the page pointer from
kmem_cache_cpu causes in a memory leak. Pretty obvious with hackbench.

2014-12-15 07:55:26

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

On Wed, Dec 10, 2014 at 10:30:17AM -0600, Christoph Lameter wrote:
> We had to insert a preempt enable/disable in the fastpath a while ago. This
> was mainly due to a lot of state that is kept to be allocating from the per
> cpu freelist. In particular the page field is not covered by
> this_cpu_cmpxchg used in the fastpath to do the necessary atomic state
> change for fast path allocation and freeing.
>
> This patch removes the need for the page field to describe the state of the
> per cpu list. The freelist pointer can be used to determine the page struct
> address if necessary.
>
> However, currently this does not work for the termination value of a list
> which is NULL and the same for all slab pages. If we use a valid pointer
> into the page as well as set the last bit then all freelist pointers can
> always be used to determine the address of the page struct and we will not
> need the page field anymore in the per cpu are for a slab. Testing for the
> end of the list is a test if the first bit is set.
>
> So the first patch changes the termination pointer for freelists to do just
> that. The second removes the page field and then third can then remove the
> preempt enable/disable.
>
> Removing the ->page field reduces the cache footprint of the fastpath so hopefully overall
> allocator effectiveness will increase further. Also RT uses full preemption which means
> that currently pretty expensive code has to be inserted into the fastpath. This approach
> allows the removal of that code and a corresponding performance increase.
>
> For V1 a number of changes were made to avoid the overhead of virt_to_page
> and page_address from the RFC.
>
> Slab Benchmarks on a kernel with CONFIG_PREEMPT show an improvement of
> 20%-50% of fastpath latency:
>
> Before:
>
> Single thread testing
> 1. Kmalloc: Repeatedly allocate then free test
> 10000 times kmalloc(8) -> 68 cycles kfree -> 107 cycles
> 10000 times kmalloc(16) -> 69 cycles kfree -> 108 cycles
> 10000 times kmalloc(32) -> 78 cycles kfree -> 112 cycles
> 10000 times kmalloc(64) -> 97 cycles kfree -> 112 cycles
> 10000 times kmalloc(128) -> 111 cycles kfree -> 119 cycles
> 10000 times kmalloc(256) -> 114 cycles kfree -> 139 cycles
> 10000 times kmalloc(512) -> 110 cycles kfree -> 142 cycles
> 10000 times kmalloc(1024) -> 114 cycles kfree -> 156 cycles
> 10000 times kmalloc(2048) -> 155 cycles kfree -> 174 cycles
> 10000 times kmalloc(4096) -> 203 cycles kfree -> 209 cycles
> 10000 times kmalloc(8192) -> 361 cycles kfree -> 265 cycles
> 10000 times kmalloc(16384) -> 597 cycles kfree -> 286 cycles
>
> 2. Kmalloc: alloc/free test
> 10000 times kmalloc(8)/kfree -> 114 cycles
> 10000 times kmalloc(16)/kfree -> 115 cycles
> 10000 times kmalloc(32)/kfree -> 117 cycles
> 10000 times kmalloc(64)/kfree -> 115 cycles
> 10000 times kmalloc(128)/kfree -> 111 cycles
> 10000 times kmalloc(256)/kfree -> 116 cycles
> 10000 times kmalloc(512)/kfree -> 110 cycles
> 10000 times kmalloc(1024)/kfree -> 114 cycles
> 10000 times kmalloc(2048)/kfree -> 110 cycles
> 10000 times kmalloc(4096)/kfree -> 107 cycles
> 10000 times kmalloc(8192)/kfree -> 108 cycles
> 10000 times kmalloc(16384)/kfree -> 706 cycles
>
>
> After:
>
>
> Single thread testing
> 1. Kmalloc: Repeatedly allocate then free test
> 10000 times kmalloc(8) -> 41 cycles kfree -> 81 cycles
> 10000 times kmalloc(16) -> 47 cycles kfree -> 88 cycles
> 10000 times kmalloc(32) -> 48 cycles kfree -> 93 cycles
> 10000 times kmalloc(64) -> 58 cycles kfree -> 89 cycles
> 10000 times kmalloc(128) -> 84 cycles kfree -> 104 cycles
> 10000 times kmalloc(256) -> 92 cycles kfree -> 125 cycles
> 10000 times kmalloc(512) -> 86 cycles kfree -> 129 cycles
> 10000 times kmalloc(1024) -> 88 cycles kfree -> 125 cycles
> 10000 times kmalloc(2048) -> 120 cycles kfree -> 159 cycles
> 10000 times kmalloc(4096) -> 176 cycles kfree -> 183 cycles
> 10000 times kmalloc(8192) -> 294 cycles kfree -> 233 cycles
> 10000 times kmalloc(16384) -> 585 cycles kfree -> 291 cycles
>
> 2. Kmalloc: alloc/free test
> 10000 times kmalloc(8)/kfree -> 100 cycles
> 10000 times kmalloc(16)/kfree -> 108 cycles
> 10000 times kmalloc(32)/kfree -> 101 cycles
> 10000 times kmalloc(64)/kfree -> 109 cycles
> 10000 times kmalloc(128)/kfree -> 125 cycles
> 10000 times kmalloc(256)/kfree -> 60 cycles
> 10000 times kmalloc(512)/kfree -> 60 cycles
> 10000 times kmalloc(1024)/kfree -> 67 cycles
> 10000 times kmalloc(2048)/kfree -> 60 cycles
> 10000 times kmalloc(4096)/kfree -> 65 cycles
> 10000 times kmalloc(8192)/kfree -> 60 cycles

Hello, Christoph.

I don't review in detail, but, at a glance, overall patchset looks good.
But, above result looks odd. Improvement is beyond what we can expect.
Do you have any idea why allocating object more than 256 bytes is so
fast?

Thanks.

2014-12-17 07:13:51

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

2014-12-15 16:59 GMT+09:00 Joonsoo Kim <[email protected]>:
> On Wed, Dec 10, 2014 at 10:30:17AM -0600, Christoph Lameter wrote:
>> We had to insert a preempt enable/disable in the fastpath a while ago. This
>> was mainly due to a lot of state that is kept to be allocating from the per
>> cpu freelist. In particular the page field is not covered by
>> this_cpu_cmpxchg used in the fastpath to do the necessary atomic state
>> change for fast path allocation and freeing.
>>
>> This patch removes the need for the page field to describe the state of the
>> per cpu list. The freelist pointer can be used to determine the page struct
>> address if necessary.
>>
>> However, currently this does not work for the termination value of a list
>> which is NULL and the same for all slab pages. If we use a valid pointer
>> into the page as well as set the last bit then all freelist pointers can
>> always be used to determine the address of the page struct and we will not
>> need the page field anymore in the per cpu are for a slab. Testing for the
>> end of the list is a test if the first bit is set.
>>
>> So the first patch changes the termination pointer for freelists to do just
>> that. The second removes the page field and then third can then remove the
>> preempt enable/disable.
>>
>> Removing the ->page field reduces the cache footprint of the fastpath so hopefully overall
>> allocator effectiveness will increase further. Also RT uses full preemption which means
>> that currently pretty expensive code has to be inserted into the fastpath. This approach
>> allows the removal of that code and a corresponding performance increase.
>>
>> For V1 a number of changes were made to avoid the overhead of virt_to_page
>> and page_address from the RFC.
>>
>> Slab Benchmarks on a kernel with CONFIG_PREEMPT show an improvement of
>> 20%-50% of fastpath latency:
>>
>> Before:
>>
>> Single thread testing
>> 1. Kmalloc: Repeatedly allocate then free test
>> 10000 times kmalloc(8) -> 68 cycles kfree -> 107 cycles
>> 10000 times kmalloc(16) -> 69 cycles kfree -> 108 cycles
>> 10000 times kmalloc(32) -> 78 cycles kfree -> 112 cycles
>> 10000 times kmalloc(64) -> 97 cycles kfree -> 112 cycles
>> 10000 times kmalloc(128) -> 111 cycles kfree -> 119 cycles
>> 10000 times kmalloc(256) -> 114 cycles kfree -> 139 cycles
>> 10000 times kmalloc(512) -> 110 cycles kfree -> 142 cycles
>> 10000 times kmalloc(1024) -> 114 cycles kfree -> 156 cycles
>> 10000 times kmalloc(2048) -> 155 cycles kfree -> 174 cycles
>> 10000 times kmalloc(4096) -> 203 cycles kfree -> 209 cycles
>> 10000 times kmalloc(8192) -> 361 cycles kfree -> 265 cycles
>> 10000 times kmalloc(16384) -> 597 cycles kfree -> 286 cycles
>>
>> 2. Kmalloc: alloc/free test
>> 10000 times kmalloc(8)/kfree -> 114 cycles
>> 10000 times kmalloc(16)/kfree -> 115 cycles
>> 10000 times kmalloc(32)/kfree -> 117 cycles
>> 10000 times kmalloc(64)/kfree -> 115 cycles
>> 10000 times kmalloc(128)/kfree -> 111 cycles
>> 10000 times kmalloc(256)/kfree -> 116 cycles
>> 10000 times kmalloc(512)/kfree -> 110 cycles
>> 10000 times kmalloc(1024)/kfree -> 114 cycles
>> 10000 times kmalloc(2048)/kfree -> 110 cycles
>> 10000 times kmalloc(4096)/kfree -> 107 cycles
>> 10000 times kmalloc(8192)/kfree -> 108 cycles
>> 10000 times kmalloc(16384)/kfree -> 706 cycles
>>
>>
>> After:
>>
>>
>> Single thread testing
>> 1. Kmalloc: Repeatedly allocate then free test
>> 10000 times kmalloc(8) -> 41 cycles kfree -> 81 cycles
>> 10000 times kmalloc(16) -> 47 cycles kfree -> 88 cycles
>> 10000 times kmalloc(32) -> 48 cycles kfree -> 93 cycles
>> 10000 times kmalloc(64) -> 58 cycles kfree -> 89 cycles
>> 10000 times kmalloc(128) -> 84 cycles kfree -> 104 cycles
>> 10000 times kmalloc(256) -> 92 cycles kfree -> 125 cycles
>> 10000 times kmalloc(512) -> 86 cycles kfree -> 129 cycles
>> 10000 times kmalloc(1024) -> 88 cycles kfree -> 125 cycles
>> 10000 times kmalloc(2048) -> 120 cycles kfree -> 159 cycles
>> 10000 times kmalloc(4096) -> 176 cycles kfree -> 183 cycles
>> 10000 times kmalloc(8192) -> 294 cycles kfree -> 233 cycles
>> 10000 times kmalloc(16384) -> 585 cycles kfree -> 291 cycles
>>
>> 2. Kmalloc: alloc/free test
>> 10000 times kmalloc(8)/kfree -> 100 cycles
>> 10000 times kmalloc(16)/kfree -> 108 cycles
>> 10000 times kmalloc(32)/kfree -> 101 cycles
>> 10000 times kmalloc(64)/kfree -> 109 cycles
>> 10000 times kmalloc(128)/kfree -> 125 cycles
>> 10000 times kmalloc(256)/kfree -> 60 cycles
>> 10000 times kmalloc(512)/kfree -> 60 cycles
>> 10000 times kmalloc(1024)/kfree -> 67 cycles
>> 10000 times kmalloc(2048)/kfree -> 60 cycles
>> 10000 times kmalloc(4096)/kfree -> 65 cycles
>> 10000 times kmalloc(8192)/kfree -> 60 cycles
>
> Hello, Christoph.
>
> I don't review in detail, but, at a glance, overall patchset looks good.
> But, above result looks odd. Improvement is beyond what we can expect.
> Do you have any idea why allocating object more than 256 bytes is so
> fast?

Ping... and I found another way to remove preempt_disable/enable
without complex changes.

What we want to ensure is getting tid and kmem_cache_cpu
on the same cpu. We can achieve that goal with below condition loop.

I ran Jesper's benchmark and saw 3~5% win in a fast-path loop over
kmem_cache_alloc+free in CONFIG_PREEMPT.

14.5 ns -> 13.8 ns

See following patch.

Thanks.

----------->8-------------
diff --git a/mm/slub.c b/mm/slub.c
index 95d2142..e537af5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2399,8 +2399,10 @@ redo:
* on a different processor between the determination of the pointer
* and the retrieval of the tid.
*/
- preempt_disable();
- c = this_cpu_ptr(s->cpu_slab);
+ do {
+ tid = this_cpu_read(s->cpu_slab->tid);
+ c = this_cpu_ptr(s->cpu_slab);
+ } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));

/*
* The transaction ids are globally unique per cpu and per operation on
@@ -2408,8 +2410,6 @@ redo:
* occurs on the right processor and that there was no operation on the
* linked list in between.
*/
- tid = c->tid;
- preempt_enable();

object = c->freelist;
page = c->page;
@@ -2655,11 +2655,10 @@ redo:
* data is retrieved via this pointer. If we are on the same cpu
* during the cmpxchg then the free will succedd.
*/
- preempt_disable();
- c = this_cpu_ptr(s->cpu_slab);
-
- tid = c->tid;
- preempt_enable();
+ do {
+ tid = this_cpu_read(s->cpu_slab->tid);
+ c = this_cpu_ptr(s->cpu_slab);
+ } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));

if (likely(page == c->page)) {
set_freepointer(s, object, c->freelist);

2014-12-17 12:09:09

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

On Wed, 17 Dec 2014 16:13:49 +0900 Joonsoo Kim <[email protected]> wrote:

> Ping... and I found another way to remove preempt_disable/enable
> without complex changes.
>
> What we want to ensure is getting tid and kmem_cache_cpu
> on the same cpu. We can achieve that goal with below condition loop.
>
> I ran Jesper's benchmark and saw 3~5% win in a fast-path loop over
> kmem_cache_alloc+free in CONFIG_PREEMPT.
>
> 14.5 ns -> 13.8 ns

Hi Kim,

I've tested you patch. Full report below patch.

Summary, I'm seeing 18.599 ns -> 17.523 ns (-1.076ns better).

For network overload tests:

Dropping packets in iptables raw, which is hitting the slub fast-path.
Here I'm seeing an improvement of 3ns.

For IP-forward, which is also invoking the slub slower path, I'm seeing
an improvement of 6ns (I were not expecting to see any improvement
here, the kmem_cache_alloc code is 24bytes smaller, so perhaps it's
saving some icache).

Full report below patch...

> See following patch.
>
> Thanks.
>
> ----------->8-------------
> diff --git a/mm/slub.c b/mm/slub.c
> index 95d2142..e537af5 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2399,8 +2399,10 @@ redo:
> * on a different processor between the determination of the pointer
> * and the retrieval of the tid.
> */
> - preempt_disable();
> - c = this_cpu_ptr(s->cpu_slab);
> + do {
> + tid = this_cpu_read(s->cpu_slab->tid);
> + c = this_cpu_ptr(s->cpu_slab);
> + } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
>
> /*
> * The transaction ids are globally unique per cpu and per operation on
> @@ -2408,8 +2410,6 @@ redo:
> * occurs on the right processor and that there was no operation on the
> * linked list in between.
> */
> - tid = c->tid;
> - preempt_enable();
>
> object = c->freelist;
> page = c->page;
> @@ -2655,11 +2655,10 @@ redo:
> * data is retrieved via this pointer. If we are on the same cpu
> * during the cmpxchg then the free will succedd.
> */
> - preempt_disable();
> - c = this_cpu_ptr(s->cpu_slab);
> -
> - tid = c->tid;
> - preempt_enable();
> + do {
> + tid = this_cpu_read(s->cpu_slab->tid);
> + c = this_cpu_ptr(s->cpu_slab);
> + } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
>
> if (likely(page == c->page)) {
> set_freepointer(s, object, c->freelist);

SLUB evaluation 03
==================

Testing patch from Joonsoo Kim <[email protected]> slub fast-path
preempt_{disable,enable} avoidance.

Kernel
======
Compiler: GCC 4.9.1

Kernel config ::

$ grep PREEMPT .config
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
# CONFIG_DEBUG_PREEMPT is not set

$ egrep -e "SLUB|SLAB" .config
# CONFIG_SLUB_DEBUG is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLUB_CPU_PARTIAL is not set
# CONFIG_SLUB_STATS is not set

On top of::

commit f96fe225677b3efb74346ebd56fafe3997b02afa
Merge: 5543798 eea3e8f
Author: Linus Torvalds <[email protected]>
Date: Fri Dec 12 16:11:12 2014 -0800

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net


Setup
=====

netfilter_unload_modules.sh
netfilter_unload_modules.sh
sudo rmmod nf_reject_ipv4 nf_reject_ipv6

base_device_setup.sh eth4 # 10G sink/receiving interface (ixgbe)
base_device_setup.sh eth5
sudo ethtool --coalesce eth4 rx-usecs 30
sudo ip neigh add 192.168.21.66 dev eth5 lladdr 00:00:ba:d0:ba:d0
sudo ip route add 198.18.0.0/15 via 192.168.21.66 dev eth5


# sudo tuned-adm active
Current active profile: latency-performance

Drop in raw
-----------
alias iptables='sudo iptables'
iptables -t raw -N simple || iptables -t raw -F simple
iptables -t raw -I simple -d 198.18.0.0/15 -j DROP
iptables -t raw -D PREROUTING -j simple
iptables -t raw -I PREROUTING -j simple

Generator
---------
./pktgen02_burst.sh -d 198.18.0.2 -i eth8 -m 90:E2:BA:0A:56:B4 -b 8 -t 3 -s 64


Patch by Joonsoo Kim to avoid preempt in slub
=============================================

baseline: without patch
-----------------------

baseline kernel v3.18-7016-gf96fe22 at commit f96fe22567

Type:kmem fastpath reuse Per elem: 46 cycles(tsc) 18.599 ns
- (measurement period time:1.859917529 sec time_interval:1859917529)
- (invoke count:100000000 tsc_interval:4649791431)

alloc N-pattern before free with 256 elements

Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.077 ns
- (measurement period time:1.025993290 sec time_interval:1025993290)
- (invoke count:25600000 tsc_interval:2564981743)

single flow/CPU
* IP-forward
- instant rx:0 tx:1165376 pps n:60 average: rx:0 tx:1165928 pps
(instant variation TX -0.407 ns (min:-0.828 max:0.507) RX 0.000 ns)
* Drop in RAW (slab fast-path test)
- instant rx:3245248 tx:0 pps n:60 average: rx:3245325 tx:0 pps
(instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.007 ns)

Christoph's slab_test, baseline kernel (at commit f96fe22567)::

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 68 cycles
10000 times kmalloc(16)/kfree -> 68 cycles
10000 times kmalloc(32)/kfree -> 69 cycles
10000 times kmalloc(64)/kfree -> 68 cycles
10000 times kmalloc(128)/kfree -> 68 cycles
10000 times kmalloc(256)/kfree -> 68 cycles
10000 times kmalloc(512)/kfree -> 74 cycles
10000 times kmalloc(1024)/kfree -> 75 cycles
10000 times kmalloc(2048)/kfree -> 74 cycles
10000 times kmalloc(4096)/kfree -> 74 cycles
10000 times kmalloc(8192)/kfree -> 75 cycles
10000 times kmalloc(16384)/kfree -> 510 cycles

$ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page'
ffffffff81163bd0 00000000000000e1 T kmem_cache_alloc
ffffffff81163ac0 000000000000010c T kmem_cache_alloc_node
ffffffff81162cb0 000000000000013b T kmem_cache_free


with patch
----------

single flow/CPU
* IP-forward
- instant rx:0 tx:1174652 pps n:60 average: rx:0 tx:1174222 pps
(instant variation TX 0.311 ns (min:-0.230 max:1.018) RX 0.000 ns)
* compare against baseline:
- 1174222-1165928 = +8294pps
- (1/1174222*10^9)-(1/1165928*10^9) = -6.058ns

* Drop in RAW (slab fast-path test)
- instant rx:3277440 tx:0 pps n:74 average: rx:3277737 tx:0 pps
(instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.028 ns)
* compare against baseline:
- 3277737-3245325 = +32412 pps
- (1/3277737*10^9)-(1/3245325*10^9) = -3.047ns

SLUB fast-path test: time_bench_kmem_cache1
* modprobe time_bench_kmem_cache1 ; rmmod time_bench_kmem_cache1; sudo dmesg -c

Type:kmem fastpath reuse Per elem: 43 cycles(tsc) 17.523 ns (step:0)
- (measurement period time:1.752338378 sec time_interval:1752338378)
- (invoke count:100000000 tsc_interval:4380843588)
* difference: 17.523 - 18.599 = -1.076ns

alloc N-pattern before free with 256 elements

Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.369 ns (step:0)
- (measurement period time:1.033447112 sec time_interval:1033447112)
- (invoke count:25600000 tsc_interval:2583616203)
* difference: 40.369 - 40.077 = +0.292ns


Christoph's slab_test::

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 65 cycles
10000 times kmalloc(16)/kfree -> 66 cycles
10000 times kmalloc(32)/kfree -> 65 cycles
10000 times kmalloc(64)/kfree -> 66 cycles
10000 times kmalloc(128)/kfree -> 66 cycles
10000 times kmalloc(256)/kfree -> 71 cycles
10000 times kmalloc(512)/kfree -> 72 cycles
10000 times kmalloc(1024)/kfree -> 71 cycles
10000 times kmalloc(2048)/kfree -> 71 cycles
10000 times kmalloc(4096)/kfree -> 71 cycles
10000 times kmalloc(8192)/kfree -> 65 cycles
10000 times kmalloc(16384)/kfree -> 511 cycles

$ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page'
ffffffff81163ba0 00000000000000c9 T kmem_cache_alloc
ffffffff81163aa0 00000000000000f8 T kmem_cache_alloc_node
ffffffff81162cb0 0000000000000133 T kmem_cache_free



Kernel size change
------------------

$ scripts/bloat-o-meter vmlinux vmlinux-kim-preempt-avoid
add/remove: 0/0 grow/shrink: 0/8 up/down: 0/-248 (-248)
function old new delta
kmem_cache_free 315 307 -8
kmem_cache_alloc_node 268 248 -20
kmem_cache_alloc 225 201 -24
kfree 274 250 -24
__kmalloc_node_track_caller 356 324 -32
__kmalloc_node 340 308 -32
__kmalloc 324 273 -51
__kmalloc_track_caller 343 286 -57


Qmempool notes:
---------------

On baseline kernel:

Type:qmempool fastpath reuse SOFTIRQ Per elem: 33 cycles(tsc) 13.287 ns
- (measurement period time:0.398628965 sec time_interval:398628965)
- (invoke count:30000000 tsc_interval:996571541)

Type:qmempool fastpath reuse BH-disable Per elem: 47 cycles(tsc) 19.180 ns
- (measurement period time:0.575425927 sec time_interval:575425927)
- (invoke count:30000000 tsc_interval:1438563781)

qmempool_bench: N-pattern with 256 elements

Type:qmempool alloc+free N-pattern Per elem: 62 cycles(tsc) 24.955 ns (step:0)
- (measurement period time:0.638871008 sec time_interval:638871008)
- (invoke count:25600000 tsc_interval:1597176303)


--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

On Wed, 17 Dec 2014, Joonsoo Kim wrote:

> Ping... and I found another way to remove preempt_disable/enable
> without complex changes.
>
> What we want to ensure is getting tid and kmem_cache_cpu
> on the same cpu. We can achieve that goal with below condition loop.
>
> I ran Jesper's benchmark and saw 3~5% win in a fast-path loop over
> kmem_cache_alloc+free in CONFIG_PREEMPT.
>
> 14.5 ns -> 13.8 ns
>
> See following patch.

Good idea. How does this affect the !CONFIG_PREEMPT case?

Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

On Wed, 17 Dec 2014, Joonsoo Kim wrote:

> + do {
> + tid = this_cpu_read(s->cpu_slab->tid);
> + c = this_cpu_ptr(s->cpu_slab);
> + } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));


Assembly code produced is a bit weird. I think the compiler undoes what
you wanted to do:

46fb: 49 8b 1e mov (%r14),%rbx rbx = c =s->cpu_slab?
46fe: 65 4c 8b 6b 08 mov %gs:0x8(%rbx),%r13 r13 = tid
4703: e8 00 00 00 00 callq 4708 <kmem_cache_alloc+0x48> ??
4708: 89 c0 mov %eax,%eax ??
470a: 48 03 1c c5 00 00 00 add 0x0(,%rax,8),%rbx ??
4711: 00
4712: 4c 3b 6b 08 cmp 0x8(%rbx),%r13 tid == c->tid
4716: 49 89 d8 mov %rbx,%r8
4719: 75 e0 jne 46fb <kmem_cache_alloc+0x3b>

Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

On Wed, 17 Dec 2014, Christoph Lameter wrote:

> On Wed, 17 Dec 2014, Joonsoo Kim wrote:
>
> > + do {
> > + tid = this_cpu_read(s->cpu_slab->tid);
> > + c = this_cpu_ptr(s->cpu_slab);
> > + } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));

Here is another one without debugging:

0xffffffff811d23bb <+59>: mov %gs:0x8(%r9),%rdx tid(rdx) = this_cpu_read()
0xffffffff811d23c0 <+64>: mov %r9,%r8
0xffffffff811d23c3 <+67>: add %gs:0x7ee37d9d(%rip),%r8 c (r8) =
0xffffffff811d23cb <+75>: cmp 0x8(%r8),%rdx c->tid == tid
0xffffffff811d23cf <+79>: jne 0xffffffff811d23bb <kmem_cache_alloc+59>

Actually that looks ok.

2014-12-18 14:34:31

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

2014-12-17 21:08 GMT+09:00 Jesper Dangaard Brouer <[email protected]>:
> On Wed, 17 Dec 2014 16:13:49 +0900 Joonsoo Kim <[email protected]> wrote:
>
>> Ping... and I found another way to remove preempt_disable/enable
>> without complex changes.
>>
>> What we want to ensure is getting tid and kmem_cache_cpu
>> on the same cpu. We can achieve that goal with below condition loop.
>>
>> I ran Jesper's benchmark and saw 3~5% win in a fast-path loop over
>> kmem_cache_alloc+free in CONFIG_PREEMPT.
>>
>> 14.5 ns -> 13.8 ns
>
> Hi Kim,
>
> I've tested you patch. Full report below patch.
>
> Summary, I'm seeing 18.599 ns -> 17.523 ns (-1.076ns better).

Thanks for testing! :)
It will help to convince others.

Thanks.

> For network overload tests:
>
> Dropping packets in iptables raw, which is hitting the slub fast-path.
> Here I'm seeing an improvement of 3ns.
>
> For IP-forward, which is also invoking the slub slower path, I'm seeing
> an improvement of 6ns (I were not expecting to see any improvement
> here, the kmem_cache_alloc code is 24bytes smaller, so perhaps it's
> saving some icache).
>
> Full report below patch...
>
>> See following patch.
>>
>> Thanks.
>>
>> ----------->8-------------
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 95d2142..e537af5 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2399,8 +2399,10 @@ redo:
>> * on a different processor between the determination of the pointer
>> * and the retrieval of the tid.
>> */
>> - preempt_disable();
>> - c = this_cpu_ptr(s->cpu_slab);
>> + do {
>> + tid = this_cpu_read(s->cpu_slab->tid);
>> + c = this_cpu_ptr(s->cpu_slab);
>> + } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
>>
>> /*
>> * The transaction ids are globally unique per cpu and per operation on
>> @@ -2408,8 +2410,6 @@ redo:
>> * occurs on the right processor and that there was no operation on the
>> * linked list in between.
>> */
>> - tid = c->tid;
>> - preempt_enable();
>>
>> object = c->freelist;
>> page = c->page;
>> @@ -2655,11 +2655,10 @@ redo:
>> * data is retrieved via this pointer. If we are on the same cpu
>> * during the cmpxchg then the free will succedd.
>> */
>> - preempt_disable();
>> - c = this_cpu_ptr(s->cpu_slab);
>> -
>> - tid = c->tid;
>> - preempt_enable();
>> + do {
>> + tid = this_cpu_read(s->cpu_slab->tid);
>> + c = this_cpu_ptr(s->cpu_slab);
>> + } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
>>
>> if (likely(page == c->page)) {
>> set_freepointer(s, object, c->freelist);
>
> SLUB evaluation 03
> ==================
>
> Testing patch from Joonsoo Kim <[email protected]> slub fast-path
> preempt_{disable,enable} avoidance.
>
> Kernel
> ======
> Compiler: GCC 4.9.1
>
> Kernel config ::
>
> $ grep PREEMPT .config
> CONFIG_PREEMPT_RCU=y
> CONFIG_PREEMPT_NOTIFIERS=y
> # CONFIG_PREEMPT_NONE is not set
> # CONFIG_PREEMPT_VOLUNTARY is not set
> CONFIG_PREEMPT=y
> CONFIG_PREEMPT_COUNT=y
> # CONFIG_DEBUG_PREEMPT is not set
>
> $ egrep -e "SLUB|SLAB" .config
> # CONFIG_SLUB_DEBUG is not set
> # CONFIG_SLAB is not set
> CONFIG_SLUB=y
> # CONFIG_SLUB_CPU_PARTIAL is not set
> # CONFIG_SLUB_STATS is not set
>
> On top of::
>
> commit f96fe225677b3efb74346ebd56fafe3997b02afa
> Merge: 5543798 eea3e8f
> Author: Linus Torvalds <[email protected]>
> Date: Fri Dec 12 16:11:12 2014 -0800
>
> Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
>
>
> Setup
> =====
>
> netfilter_unload_modules.sh
> netfilter_unload_modules.sh
> sudo rmmod nf_reject_ipv4 nf_reject_ipv6
>
> base_device_setup.sh eth4 # 10G sink/receiving interface (ixgbe)
> base_device_setup.sh eth5
> sudo ethtool --coalesce eth4 rx-usecs 30
> sudo ip neigh add 192.168.21.66 dev eth5 lladdr 00:00:ba:d0:ba:d0
> sudo ip route add 198.18.0.0/15 via 192.168.21.66 dev eth5
>
>
> # sudo tuned-adm active
> Current active profile: latency-performance
>
> Drop in raw
> -----------
> alias iptables='sudo iptables'
> iptables -t raw -N simple || iptables -t raw -F simple
> iptables -t raw -I simple -d 198.18.0.0/15 -j DROP
> iptables -t raw -D PREROUTING -j simple
> iptables -t raw -I PREROUTING -j simple
>
> Generator
> ---------
> ./pktgen02_burst.sh -d 198.18.0.2 -i eth8 -m 90:E2:BA:0A:56:B4 -b 8 -t 3 -s 64
>
>
> Patch by Joonsoo Kim to avoid preempt in slub
> =============================================
>
> baseline: without patch
> -----------------------
>
> baseline kernel v3.18-7016-gf96fe22 at commit f96fe22567
>
> Type:kmem fastpath reuse Per elem: 46 cycles(tsc) 18.599 ns
> - (measurement period time:1.859917529 sec time_interval:1859917529)
> - (invoke count:100000000 tsc_interval:4649791431)
>
> alloc N-pattern before free with 256 elements
>
> Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.077 ns
> - (measurement period time:1.025993290 sec time_interval:1025993290)
> - (invoke count:25600000 tsc_interval:2564981743)
>
> single flow/CPU
> * IP-forward
> - instant rx:0 tx:1165376 pps n:60 average: rx:0 tx:1165928 pps
> (instant variation TX -0.407 ns (min:-0.828 max:0.507) RX 0.000 ns)
> * Drop in RAW (slab fast-path test)
> - instant rx:3245248 tx:0 pps n:60 average: rx:3245325 tx:0 pps
> (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.007 ns)
>
> Christoph's slab_test, baseline kernel (at commit f96fe22567)::
>
> Single thread testing
> =====================
> 1. Kmalloc: Repeatedly allocate then free test
> 10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
> 10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
> 10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
> 10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
> 10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
> 10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
> 10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
> 10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
> 10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
> 10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
> 10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
> 10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
> 2. Kmalloc: alloc/free test
> 10000 times kmalloc(8)/kfree -> 68 cycles
> 10000 times kmalloc(16)/kfree -> 68 cycles
> 10000 times kmalloc(32)/kfree -> 69 cycles
> 10000 times kmalloc(64)/kfree -> 68 cycles
> 10000 times kmalloc(128)/kfree -> 68 cycles
> 10000 times kmalloc(256)/kfree -> 68 cycles
> 10000 times kmalloc(512)/kfree -> 74 cycles
> 10000 times kmalloc(1024)/kfree -> 75 cycles
> 10000 times kmalloc(2048)/kfree -> 74 cycles
> 10000 times kmalloc(4096)/kfree -> 74 cycles
> 10000 times kmalloc(8192)/kfree -> 75 cycles
> 10000 times kmalloc(16384)/kfree -> 510 cycles
>
> $ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page'
> ffffffff81163bd0 00000000000000e1 T kmem_cache_alloc
> ffffffff81163ac0 000000000000010c T kmem_cache_alloc_node
> ffffffff81162cb0 000000000000013b T kmem_cache_free
>
>
> with patch
> ----------
>
> single flow/CPU
> * IP-forward
> - instant rx:0 tx:1174652 pps n:60 average: rx:0 tx:1174222 pps
> (instant variation TX 0.311 ns (min:-0.230 max:1.018) RX 0.000 ns)
> * compare against baseline:
> - 1174222-1165928 = +8294pps
> - (1/1174222*10^9)-(1/1165928*10^9) = -6.058ns
>
> * Drop in RAW (slab fast-path test)
> - instant rx:3277440 tx:0 pps n:74 average: rx:3277737 tx:0 pps
> (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.028 ns)
> * compare against baseline:
> - 3277737-3245325 = +32412 pps
> - (1/3277737*10^9)-(1/3245325*10^9) = -3.047ns
>
> SLUB fast-path test: time_bench_kmem_cache1
> * modprobe time_bench_kmem_cache1 ; rmmod time_bench_kmem_cache1; sudo dmesg -c
>
> Type:kmem fastpath reuse Per elem: 43 cycles(tsc) 17.523 ns (step:0)
> - (measurement period time:1.752338378 sec time_interval:1752338378)
> - (invoke count:100000000 tsc_interval:4380843588)
> * difference: 17.523 - 18.599 = -1.076ns
>
> alloc N-pattern before free with 256 elements
>
> Type:kmem alloc+free N-pattern Per elem: 100 cycles(tsc) 40.369 ns (step:0)
> - (measurement period time:1.033447112 sec time_interval:1033447112)
> - (invoke count:25600000 tsc_interval:2583616203)
> * difference: 40.369 - 40.077 = +0.292ns
>
>
> Christoph's slab_test::
>
> Single thread testing
> =====================
> 1. Kmalloc: Repeatedly allocate then free test
> 10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
> 10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
> 10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
> 10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
> 10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
> 10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
> 10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
> 10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
> 10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
> 10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
> 10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
> 10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
> 2. Kmalloc: alloc/free test
> 10000 times kmalloc(8)/kfree -> 65 cycles
> 10000 times kmalloc(16)/kfree -> 66 cycles
> 10000 times kmalloc(32)/kfree -> 65 cycles
> 10000 times kmalloc(64)/kfree -> 66 cycles
> 10000 times kmalloc(128)/kfree -> 66 cycles
> 10000 times kmalloc(256)/kfree -> 71 cycles
> 10000 times kmalloc(512)/kfree -> 72 cycles
> 10000 times kmalloc(1024)/kfree -> 71 cycles
> 10000 times kmalloc(2048)/kfree -> 71 cycles
> 10000 times kmalloc(4096)/kfree -> 71 cycles
> 10000 times kmalloc(8192)/kfree -> 65 cycles
> 10000 times kmalloc(16384)/kfree -> 511 cycles
>
> $ nm --print-size vmlinux | egrep -e 'kmem_cache_alloc|kmem_cache_free|is_pointer_to_page'
> ffffffff81163ba0 00000000000000c9 T kmem_cache_alloc
> ffffffff81163aa0 00000000000000f8 T kmem_cache_alloc_node
> ffffffff81162cb0 0000000000000133 T kmem_cache_free
>
>
>
> Kernel size change
> ------------------
>
> $ scripts/bloat-o-meter vmlinux vmlinux-kim-preempt-avoid
> add/remove: 0/0 grow/shrink: 0/8 up/down: 0/-248 (-248)
> function old new delta
> kmem_cache_free 315 307 -8
> kmem_cache_alloc_node 268 248 -20
> kmem_cache_alloc 225 201 -24
> kfree 274 250 -24
> __kmalloc_node_track_caller 356 324 -32
> __kmalloc_node 340 308 -32
> __kmalloc 324 273 -51
> __kmalloc_track_caller 343 286 -57
>
>
> Qmempool notes:
> ---------------
>
> On baseline kernel:
>
> Type:qmempool fastpath reuse SOFTIRQ Per elem: 33 cycles(tsc) 13.287 ns
> - (measurement period time:0.398628965 sec time_interval:398628965)
> - (invoke count:30000000 tsc_interval:996571541)
>
> Type:qmempool fastpath reuse BH-disable Per elem: 47 cycles(tsc) 19.180 ns
> - (measurement period time:0.575425927 sec time_interval:575425927)
> - (invoke count:30000000 tsc_interval:1438563781)
>
> qmempool_bench: N-pattern with 256 elements
>
> Type:qmempool alloc+free N-pattern Per elem: 62 cycles(tsc) 24.955 ns (step:0)
> - (measurement period time:0.638871008 sec time_interval:638871008)
> - (invoke count:25600000 tsc_interval:1597176303)
>
>
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Sr. Network Kernel Developer at Red Hat
> Author of http://www.iptv-analyzer.org
> LinkedIn: http://www.linkedin.com/in/brouer

2014-12-18 14:38:11

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

2014-12-18 0:36 GMT+09:00 Christoph Lameter <[email protected]>:
> On Wed, 17 Dec 2014, Joonsoo Kim wrote:
>
>> Ping... and I found another way to remove preempt_disable/enable
>> without complex changes.
>>
>> What we want to ensure is getting tid and kmem_cache_cpu
>> on the same cpu. We can achieve that goal with below condition loop.
>>
>> I ran Jesper's benchmark and saw 3~5% win in a fast-path loop over
>> kmem_cache_alloc+free in CONFIG_PREEMPT.
>>
>> 14.5 ns -> 13.8 ns
>>
>> See following patch.
>
> Good idea. How does this affect the !CONFIG_PREEMPT case?

One more this_cpu_xxx makes fastpath slow if !CONFIG_PREEMPT.
Roughly 3~5%.

We can deal with each cases separately although it looks dirty.

#ifdef CONFIG_PREEMPT
XXX
#else
YYY
#endif

Thanks.

2014-12-18 14:41:37

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

2014-12-18 1:10 GMT+09:00 Christoph Lameter <[email protected]>:
> On Wed, 17 Dec 2014, Joonsoo Kim wrote:
>
>> + do {
>> + tid = this_cpu_read(s->cpu_slab->tid);
>> + c = this_cpu_ptr(s->cpu_slab);
>> + } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
>
>
> Assembly code produced is a bit weird. I think the compiler undoes what
> you wanted to do:

I checked my compiled code and it seems to be no problem.
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2

Thanks.

> 46fb: 49 8b 1e mov (%r14),%rbx rbx = c =s->cpu_slab?
> 46fe: 65 4c 8b 6b 08 mov %gs:0x8(%rbx),%r13 r13 = tid
> 4703: e8 00 00 00 00 callq 4708 <kmem_cache_alloc+0x48> ??
> 4708: 89 c0 mov %eax,%eax ??
> 470a: 48 03 1c c5 00 00 00 add 0x0(,%rax,8),%rbx ??
> 4711: 00
> 4712: 4c 3b 6b 08 cmp 0x8(%rbx),%r13 tid == c->tid
> 4716: 49 89 d8 mov %rbx,%r8
> 4719: 75 e0 jne 46fb <kmem_cache_alloc+0x3b>
>

Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1


On Thu, 18 Dec 2014, Joonsoo Kim wrote:
> > Good idea. How does this affect the !CONFIG_PREEMPT case?
>
> One more this_cpu_xxx makes fastpath slow if !CONFIG_PREEMPT.
> Roughly 3~5%.
>
> We can deal with each cases separately although it looks dirty.

Ok maybe you can come up with a solution that is as clean as possible?

2014-12-18 15:08:24

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/7] slub: Fastpath optimization (especially for RT) V1

2014-12-18 23:57 GMT+09:00 Christoph Lameter <[email protected]>:
>
> On Thu, 18 Dec 2014, Joonsoo Kim wrote:
>> > Good idea. How does this affect the !CONFIG_PREEMPT case?
>>
>> One more this_cpu_xxx makes fastpath slow if !CONFIG_PREEMPT.
>> Roughly 3~5%.
>>
>> We can deal with each cases separately although it looks dirty.
>
> Ok maybe you can come up with a solution that is as clean as possible?

Okay. Will do!

Thanks.