2017-12-08 08:50:22

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement



On 2017年11月30日 17:45, Michal Hocko wrote:
> On Thu 30-11-17 17:32:08, kemi wrote:

> Do not get me wrong. If we want to make per-node stats more optimal,
> then by all means let's do that. But having 3 sets of counters is just
> way to much.
>

Hi, Michal
Apologize to respond later in this email thread.

After thinking about how to optimize our per-node stats more gracefully,
we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
we can keep everything in per cpu counter and sum them up when read /proc
or /sys for numa stats.
What's your idea for that? thanks

The motivation for that modification is listed below:
1) thanks to 0-day system, a bug is reported for the V1 patch:

[ 0.000000] BUG: unable to handle kernel paging request at 0392b000
[ 0.000000] IP: __inc_numa_state+0x2a/0x34
[ 0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff53
[ 0.000000] Oops: 0002 [#1] PREEMPT SMP
[ 0.000000] Modules linked in:
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.0-12996-g81611e2 #1
[ 0.000000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 0.000000] task: cbf56000 task.stack: cbf4e000
[ 0.000000] EIP: __inc_numa_state+0x2a/0x34
[ 0.000000] EFLAGS: 00210006 CPU: 0
[ 0.000000] EAX: 0392b000 EBX: 00000000 ECX: 00000000 EDX: cbef90ef
[ 0.000000] ESI: cffdb320 EDI: 00000004 EBP: cbf4fd80 ESP: cbf4fd7c
[ 0.000000] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 0.000000] CR0: 80050033 CR2: 0392b000 CR3: 0c0a8000 CR4: 000406b0
[ 0.000000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 0.000000] DR6: fffe0ff0 DR7: 00000400
[ 0.000000] Call Trace:
[ 0.000000] zone_statistics+0x4d/0x5b
[ 0.000000] get_page_from_freelist+0x257/0x993
[ 0.000000] __alloc_pages_nodemask+0x108/0x8c8
[ 0.000000] ? __bitmap_weight+0x38/0x41
[ 0.000000] ? pcpu_next_md_free_region+0xe/0xab
[ 0.000000] ? pcpu_chunk_refresh_hint+0x8b/0xbc
[ 0.000000] ? pcpu_chunk_slot+0x1e/0x24
[ 0.000000] ? pcpu_chunk_relocate+0x15/0x6d
[ 0.000000] ? find_next_bit+0xa/0xd
[ 0.000000] ? cpumask_next+0x15/0x18
[ 0.000000] ? pcpu_alloc+0x399/0x538
[ 0.000000] cache_grow_begin+0x85/0x31c
[ 0.000000] ____cache_alloc+0x147/0x1e0
[ 0.000000] ? debug_smp_processor_id+0x12/0x14
[ 0.000000] kmem_cache_alloc+0x80/0x145
[ 0.000000] create_kmalloc_cache+0x22/0x64
[ 0.000000] kmem_cache_init+0xf9/0x16c
[ 0.000000] start_kernel+0x1d4/0x3d6
[ 0.000000] i386_start_kernel+0x9a/0x9e
[ 0.000000] startup_32_smp+0x15f/0x170

That is because u64 percpu pointer vm_numa_stat is used before initialization.

[...]
> +extern u64 __percpu *vm_numa_stat;
[...]
> +#ifdef CONFIG_NUMA
> + size = sizeof(u64) * num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS;
> + align = __alignof__(u64[num_possible_nodes() * NR_VM_NUMA_STAT_ITEMS]);
> + vm_numa_stat = (u64 __percpu *)__alloc_percpu(size, align);
> +#endif

The pointer is used in mm_init->kmem_cache_init->create_kmalloc_cache->...->
__alloc_pages() when CONFIG_SLAB/CONFIG_ZONE_DMA is set in kconfig, while the
vm_numa_stat is initialized in setup_per_cpu_pageset after mm_init is called.
The proposal mentioned above can fix it by making the numa stats counter ready
before calling mm_init (start_kernel->build_all_zonelists() can help to do that)

2) Compare to the V1 patch, this modification makes the semantics of per-node numa
stats more clear for review and maintenance.


2017-12-08 08:48:01

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

On Fri 08-12-17 16:38:46, kemi wrote:
>
>
> On 2017年11月30日 17:45, Michal Hocko wrote:
> > On Thu 30-11-17 17:32:08, kemi wrote:
>
> > Do not get me wrong. If we want to make per-node stats more optimal,
> > then by all means let's do that. But having 3 sets of counters is just
> > way to much.
> >
>
> Hi, Michal
> Apologize to respond later in this email thread.
>
> After thinking about how to optimize our per-node stats more gracefully,
> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
> we can keep everything in per cpu counter and sum them up when read /proc
> or /sys for numa stats.
> What's your idea for that? thanks

I would like to see a strong argument why we cannot make it a _standard_
node counter.
--
Michal Hocko
SUSE Labs

2017-12-12 02:07:32

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement



On 2017年12月08日 16:47, Michal Hocko wrote:
> On Fri 08-12-17 16:38:46, kemi wrote:
>>
>>
>> On 2017年11月30日 17:45, Michal Hocko wrote:
>>> On Thu 30-11-17 17:32:08, kemi wrote:
>>
>> After thinking about how to optimize our per-node stats more gracefully,
>> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
>> we can keep everything in per cpu counter and sum them up when read /proc
>> or /sys for numa stats.
>> What's your idea for that? thanks
>
> I would like to see a strong argument why we cannot make it a _standard_
> node counter.
>

all right.
This issue is first reported and discussed in 2017 MM summit, referred to
the topic "Provoking and fixing memory bottlenecks -Focused on the page
allocator presentation" presented by Jesper.

http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit
2017-JesperBrouer.pdf (slide 15/16)

As you know, page allocator is too slow and has becomes a bottleneck
in high-speed network.
Jesper also showed some data in that presentation: with micro benchmark
stresses order-0 fast path(per CPU pages), *32%* extra CPU cycles cost
(143->97) comes from CONFIG_NUMA.

When I took a look at this issue, I reproduced this issue and got a
similar result to Jesper's. Furthermore, with the help from Jesper,
the overhead is root caused and the real cause of this overhead comes
from an extra level of function calls such as zone_statistics() (*10%*,
nearly 1/3, including __inc_numa_state), policy_zonelist, get_task_policy(),
policy_nodemask and etc (perf profiling cpu cycles). zone_statistics()
is the biggest one introduced by CONFIG_NUMA in fast path that we can
do something for optimizing page allocator. Plus, the overhead of
zone_statistics() significantly increase with more and more cpu
cores and nodes due to cache bouncing.

Therefore, we submitted a patch before to mitigate the overhead of
zone_statistics() by reducing global NUMA counter update frequency
(enlarge threshold size, as suggested by Dave Hansen). I also would
like to have an implementation of a "_standard_node counter" for NUMA
stats, but I wonder how we can keep the performance gain at the
same time.

2017-12-12 08:11:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

On Tue 12-12-17 10:05:26, kemi wrote:
>
>
> On 2017年12月08日 16:47, Michal Hocko wrote:
> > On Fri 08-12-17 16:38:46, kemi wrote:
> >>
> >>
> >> On 2017年11月30日 17:45, Michal Hocko wrote:
> >>> On Thu 30-11-17 17:32:08, kemi wrote:
> >>
> >> After thinking about how to optimize our per-node stats more gracefully,
> >> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
> >> we can keep everything in per cpu counter and sum them up when read /proc
> >> or /sys for numa stats.
> >> What's your idea for that? thanks
> >
> > I would like to see a strong argument why we cannot make it a _standard_
> > node counter.
> >
>
> all right.
> This issue is first reported and discussed in 2017 MM summit, referred to
> the topic "Provoking and fixing memory bottlenecks -Focused on the page
> allocator presentation" presented by Jesper.
>
> http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit
> 2017-JesperBrouer.pdf (slide 15/16)
>
> As you know, page allocator is too slow and has becomes a bottleneck
> in high-speed network.
> Jesper also showed some data in that presentation: with micro benchmark
> stresses order-0 fast path(per CPU pages), *32%* extra CPU cycles cost
> (143->97) comes from CONFIG_NUMA.
>
> When I took a look at this issue, I reproduced this issue and got a
> similar result to Jesper's. Furthermore, with the help from Jesper,
> the overhead is root caused and the real cause of this overhead comes
> from an extra level of function calls such as zone_statistics() (*10%*,
> nearly 1/3, including __inc_numa_state), policy_zonelist, get_task_policy(),
> policy_nodemask and etc (perf profiling cpu cycles). zone_statistics()
> is the biggest one introduced by CONFIG_NUMA in fast path that we can
> do something for optimizing page allocator. Plus, the overhead of
> zone_statistics() significantly increase with more and more cpu
> cores and nodes due to cache bouncing.
>
> Therefore, we submitted a patch before to mitigate the overhead of
> zone_statistics() by reducing global NUMA counter update frequency
> (enlarge threshold size, as suggested by Dave Hansen). I also would
> like to have an implementation of a "_standard_node counter" for NUMA
> stats, but I wonder how we can keep the performance gain at the
> same time.

I understand all that. But we do have a way to put all that overhead
away by disabling the stats altogether. I presume that CPU cycle
sensitive workloads would simply use that option because the stats are
quite limited in their usefulness anyway IMHO. So we are back to: Do
normal workloads care all that much to have 3rd way to account for
events? I haven't heard a sound argument for that.

--
Michal Hocko
SUSE Labs

2017-12-14 01:42:32

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement



On 2017年12月12日 16:11, Michal Hocko wrote:
> On Tue 12-12-17 10:05:26, kemi wrote:
>>
>>
>> On 2017年12月08日 16:47, Michal Hocko wrote:
>>> On Fri 08-12-17 16:38:46, kemi wrote:
>>>>
>>>>
>>>> On 2017年11月30日 17:45, Michal Hocko wrote:
>>>>> On Thu 30-11-17 17:32:08, kemi wrote:
>>>>
>>>> After thinking about how to optimize our per-node stats more gracefully,
>>>> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
>>>> we can keep everything in per cpu counter and sum them up when read /proc
>>>> or /sys for numa stats.
>>>> What's your idea for that? thanks
>>>
>>> I would like to see a strong argument why we cannot make it a _standard_
>>> node counter.
>>>
>>
>> all right.
>> This issue is first reported and discussed in 2017 MM summit, referred to
>> the topic "Provoking and fixing memory bottlenecks -Focused on the page
>> allocator presentation" presented by Jesper.
>>
>> http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit
>> 2017-JesperBrouer.pdf (slide 15/16)
>>
>> As you know, page allocator is too slow and has becomes a bottleneck
>> in high-speed network.
>> Jesper also showed some data in that presentation: with micro benchmark
>> stresses order-0 fast path(per CPU pages), *32%* extra CPU cycles cost
>> (143->97) comes from CONFIG_NUMA.
>>
>> When I took a look at this issue, I reproduced this issue and got a
>> similar result to Jesper's. Furthermore, with the help from Jesper,
>> the overhead is root caused and the real cause of this overhead comes
>> from an extra level of function calls such as zone_statistics() (*10%*,
>> nearly 1/3, including __inc_numa_state), policy_zonelist, get_task_policy(),
>> policy_nodemask and etc (perf profiling cpu cycles). zone_statistics()
>> is the biggest one introduced by CONFIG_NUMA in fast path that we can
>> do something for optimizing page allocator. Plus, the overhead of
>> zone_statistics() significantly increase with more and more cpu
>> cores and nodes due to cache bouncing.
>>
>> Therefore, we submitted a patch before to mitigate the overhead of
>> zone_statistics() by reducing global NUMA counter update frequency
>> (enlarge threshold size, as suggested by Dave Hansen). I also would
>> like to have an implementation of a "_standard_node counter" for NUMA
>> stats, but I wonder how we can keep the performance gain at the
>> same time.
>
> I understand all that. But we do have a way to put all that overhead
> away by disabling the stats altogether. I presume that CPU cycle
> sensitive workloads would simply use that option because the stats are
> quite limited in their usefulness anyway IMHO. So we are back to: Do
> normal workloads care all that much to have 3rd way to account for
> events? I haven't heard a sound argument for that.
>

I'm not a fan of adding code that nobody(or 0.001%) cares.
We can't depend on that tunable interface too much, because our customers
or even kernel hacker may not know that new added interface, or sometimes
NUMA stats can't be disabled in their environments. That's the reason
why we spent time to do that optimization other than simply adding a runtime
configuration interface.

Furthermore, the code we optimized for is the core area of kernel that can
benefit most of kernel actions, more or less I think.

All right, let's think about it in another way, does a u64 percpu array per-node
for NUMA stats really make code too much complicated and hard to maintain?
I'm afraid not IMHO.



2017-12-14 07:29:45

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

On Thu 14-12-17 09:40:32, kemi wrote:
>
>
> On 2017年12月12日 16:11, Michal Hocko wrote:
> > On Tue 12-12-17 10:05:26, kemi wrote:
> >>
> >>
> >> On 2017年12月08日 16:47, Michal Hocko wrote:
> >>> On Fri 08-12-17 16:38:46, kemi wrote:
> >>>>
> >>>>
> >>>> On 2017年11月30日 17:45, Michal Hocko wrote:
> >>>>> On Thu 30-11-17 17:32:08, kemi wrote:
> >>>>
> >>>> After thinking about how to optimize our per-node stats more gracefully,
> >>>> we may add u64 vm_numa_stat_diff[] in struct per_cpu_nodestat, thus,
> >>>> we can keep everything in per cpu counter and sum them up when read /proc
> >>>> or /sys for numa stats.
> >>>> What's your idea for that? thanks
> >>>
> >>> I would like to see a strong argument why we cannot make it a _standard_
> >>> node counter.
> >>>
> >>
> >> all right.
> >> This issue is first reported and discussed in 2017 MM summit, referred to
> >> the topic "Provoking and fixing memory bottlenecks -Focused on the page
> >> allocator presentation" presented by Jesper.
> >>
> >> http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit
> >> 2017-JesperBrouer.pdf (slide 15/16)
> >>
> >> As you know, page allocator is too slow and has becomes a bottleneck
> >> in high-speed network.
> >> Jesper also showed some data in that presentation: with micro benchmark
> >> stresses order-0 fast path(per CPU pages), *32%* extra CPU cycles cost
> >> (143->97) comes from CONFIG_NUMA.
> >>
> >> When I took a look at this issue, I reproduced this issue and got a
> >> similar result to Jesper's. Furthermore, with the help from Jesper,
> >> the overhead is root caused and the real cause of this overhead comes
> >> from an extra level of function calls such as zone_statistics() (*10%*,
> >> nearly 1/3, including __inc_numa_state), policy_zonelist, get_task_policy(),
> >> policy_nodemask and etc (perf profiling cpu cycles). zone_statistics()
> >> is the biggest one introduced by CONFIG_NUMA in fast path that we can
> >> do something for optimizing page allocator. Plus, the overhead of
> >> zone_statistics() significantly increase with more and more cpu
> >> cores and nodes due to cache bouncing.
> >>
> >> Therefore, we submitted a patch before to mitigate the overhead of
> >> zone_statistics() by reducing global NUMA counter update frequency
> >> (enlarge threshold size, as suggested by Dave Hansen). I also would
> >> like to have an implementation of a "_standard_node counter" for NUMA
> >> stats, but I wonder how we can keep the performance gain at the
> >> same time.
> >
> > I understand all that. But we do have a way to put all that overhead
> > away by disabling the stats altogether. I presume that CPU cycle
> > sensitive workloads would simply use that option because the stats are
> > quite limited in their usefulness anyway IMHO. So we are back to: Do
> > normal workloads care all that much to have 3rd way to account for
> > events? I haven't heard a sound argument for that.
> >
>
> I'm not a fan of adding code that nobody(or 0.001%) cares.
> We can't depend on that tunable interface too much, because our customers
> or even kernel hacker may not know that new added interface,

Come on. If somebody want's to tune the system to squeeze every single
cycle then there is tuning required and those people can figure out.

> or sometimes
> NUMA stats can't be disabled in their environments.

why?

> That's the reason
> why we spent time to do that optimization other than simply adding a runtime
> configuration interface.
>
> Furthermore, the code we optimized for is the core area of kernel that can
> benefit most of kernel actions, more or less I think.
>
> All right, let's think about it in another way, does a u64 percpu array per-node
> for NUMA stats really make code too much complicated and hard to maintain?
> I'm afraid not IMHO.

I disagree. The whole numa stat things has turned out to be nasty to
maintain. For a very limited gain. Now you are just shifting that
elsewhere. Look, there are other counters taken in the allocator, we do
not want to treat them specially. We have a nice per-cpu infrastructure
here so I really fail to see why we should code-around it. If that can
be improved then by all means let's do it.

So unless you have a strong usecase I would vote for a simpler code.

--
Michal Hocko
SUSE Labs

2017-12-14 08:57:57

by Kemi Wang

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement



On 2017年12月14日 15:29, Michal Hocko wrote:
> On Thu 14-12-17 09:40:32, kemi wrote:
>>
>>
>> or sometimes
>> NUMA stats can't be disabled in their environments.
>
> why?
>
>> That's the reason
>> why we spent time to do that optimization other than simply adding a runtime
>> configuration interface.
>>
>> Furthermore, the code we optimized for is the core area of kernel that can
>> benefit most of kernel actions, more or less I think.
>>
>> All right, let's think about it in another way, does a u64 percpu array per-node
>> for NUMA stats really make code too much complicated and hard to maintain?
>> I'm afraid not IMHO.
>
> I disagree. The whole numa stat things has turned out to be nasty to
> maintain. For a very limited gain. Now you are just shifting that
> elsewhere. Look, there are other counters taken in the allocator, we do
> not want to treat them specially. We have a nice per-cpu infrastructure
> here so I really fail to see why we should code-around it. If that can
> be improved then by all means let's do it.
>

Yes, I agree with you that we may improve current per-cpu infrastructure.
May we have a chance to increase the size of vm_node_stat_diff from s8 to s16 for
this "per-cpu infrastructure" (s32 in per-cpu counter infrastructure)? The
limitation of type s8 seems not enough with more and more cpu cores, especially
for those monotone increasing type of counters like NUMA counters.

before after(moving numa to per_cpu_nodestat
and change s8 to s16)
sizeof(struct per_cpu_nodestat) 28 68

If ok, we can also keep that improvement in a nice way.

2017-12-14 09:23:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: NUMA stats code cleanup and enhancement

On Thu 14-12-17 16:55:54, kemi wrote:
>
>
> On 2017年12月14日 15:29, Michal Hocko wrote:
> > On Thu 14-12-17 09:40:32, kemi wrote:
> >>
> >>
> >> or sometimes
> >> NUMA stats can't be disabled in their environments.
> >
> > why?
> >
> >> That's the reason
> >> why we spent time to do that optimization other than simply adding a runtime
> >> configuration interface.
> >>
> >> Furthermore, the code we optimized for is the core area of kernel that can
> >> benefit most of kernel actions, more or less I think.
> >>
> >> All right, let's think about it in another way, does a u64 percpu array per-node
> >> for NUMA stats really make code too much complicated and hard to maintain?
> >> I'm afraid not IMHO.
> >
> > I disagree. The whole numa stat things has turned out to be nasty to
> > maintain. For a very limited gain. Now you are just shifting that
> > elsewhere. Look, there are other counters taken in the allocator, we do
> > not want to treat them specially. We have a nice per-cpu infrastructure
> > here so I really fail to see why we should code-around it. If that can
> > be improved then by all means let's do it.
> >
>
> Yes, I agree with you that we may improve current per-cpu infrastructure.
> May we have a chance to increase the size of vm_node_stat_diff from s8 to s16 for
> this "per-cpu infrastructure" (s32 in per-cpu counter infrastructure)? The
> limitation of type s8 seems not enough with more and more cpu cores, especially
> for those monotone increasing type of counters like NUMA counters.
>
> before after(moving numa to per_cpu_nodestat
> and change s8 to s16)
> sizeof(struct per_cpu_nodestat) 28 68
>
> If ok, we can also keep that improvement in a nice way.

I wouldn't be opposed. Maybe we should make it nr_cpus sized.

--
Michal Hocko
SUSE Labs