2017-06-04 11:33:26

by Johannes Weiner

[permalink] [raw]
Subject: Re: Regression on ARMs in next-20170531

On Wed, May 31, 2017 at 06:43:33PM +0100, Russell King - ARM Linux wrote:
> On Wed, May 31, 2017 at 09:45:45AM -0700, Tony Lindgren wrote:
> > Mark Brown noticed that the so far the only booting
> > ARMs are all with CONFIG_SMP disabled and I just
> > confirmed that's the case.
>
> > 8< --------------------
> > Unable to handle kernel paging request at virtual address 2e116007
> > pgd = c0004000
> > [2e116007] *pgd=00000000
> > Internal error: Oops: 5 [#1] SMP ARM
> > Modules linked in:
> > CPU: 0 PID: 0 Comm: swapper Not tainted 4.12.0-rc3-00153-gb6bc6724488a #200
> > Hardware name: Generic DRA74X (Flattened Device Tree)
> > task: c0d0adc0 task.stack: c0d00000
> > PC is at __mod_node_page_state+0x2c/0xc8
> > LR is at __per_cpu_offset+0x0/0x8
> > pc : [<c0271de8>] lr : [<c0d07da4>] psr: 600000d3
> > sp : c0d01eec ip : 00000000 fp : c15782f4
> > r10: 00000000 r9 : c1591280 r8 : 00004000
> > r7 : 00000001 r6 : 00000006 r5 : 2e116000 r4 : 00000007
> > r3 : 00000007 r2 : 00000001 r1 : 00000006 r0 : c0dc27c0
> > Flags: nZCv IRQs off FIQs off Mode SVC_32 ISA ARM Segment none
> ...
> > Code: e79e5103 e28c3001 e0833001 e1a04003 (e19440d5)
>
> This disassembles to:
>
> 0: e79e5103 ldr r5, [lr, r3, lsl #2]
> 4: e28c3001 add r3, ip, #1
> 8: e0833001 add r3, r3, r1
> c: e1a04003 mov r4, r3
> 10: e19440d5 ldrsb r4, [r4, r5]
>
> I don't have a similarly configured kernel, but here I have for the
> start of this function:
>
> 00000680 <__mod_node_page_state>:
> 680: e1a0c00d mov ip, sp
> 684: e92dd870 push {r4, r5, r6, fp, ip, lr, pc}
> 688: e24cb004 sub fp, ip, #4
> 68c: e590cc00 ldr ip, [r0, #3072] ; 0xc00
> 690: e1a0400d mov r4, sp
> 694: ee1d6f90 mrc 15, 0, r6, cr13, cr0, {4}
> 698: e08c5001 add r5, ip, r1
> 69c: e2855001 add r5, r5, #1
> 6a0: e1a03005 mov r3, r5
> 6a4: e196c0dc ldrsb ip, [r6, ip]
> 6a8: e19630d3 ldrsb r3, [r6, r3]
>
> r5 in your code is the equivalent of r6, r4 => r3, r3 -> r5.
> lr is the __per_cpu_offset array, so the first instruction is
> trying to load the percpu offset.
>
> The faulting code is:
>
> x = delta + __this_cpu_read(*p);
>
> specifically "__this_cpu_read(*p)".
>
> "ip" holds "pcp" from:
>
> struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
>
> and you may notice that it's zero in the register dump. So,
> pgdat->per_cpu_nodestats is NULL here.
>
> This seems to be setup in setup_per_cpu_pageset(), which in the init
> order, happens way after mm_init() (which contains kmem_cache_init()).

Thanks for the analysis, Russell.

I think it's NULL because the slab allocation happens before even the
root_mem_cgroup is set up, and so root_mem_cgroup -> lruvec -> pgdat
gives us garbage.

Tony, Josef, since the patches are dropped from -next, could you test
the -mm tree at git://git.cmpxchg.org/linux-mmots.git and verify that
this patch below fixes the issue?

---

>From 47007dfcd7873cb93d11466a93b1f41f6a7a434f Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Sun, 4 Jun 2017 07:02:44 -0400
Subject: [PATCH] mm: memcontrol: per-lruvec stats infrastructure fix 2

Even with the previous fix routing !page->mem_cgroup stats to the root
cgroup, we still see crashes in certain configurations as the root is
not initialized for the earliest possible accounting sites in certain
configurations.

Don't track uncharged pages at all, not even in the root. This takes
care of early accounting as well as special pages that aren't tracked.

Because we still need to account at the pgdat level, we can no longer
implement the lruvec_page_state functions on top of the lruvec_state
ones. But that's okay, it was a little silly to look up the nodeinfo
and descend to the lruvec, only to container_of() back to the nodeinfo
where the lruvec_stat structure is sitting.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bea6f08e9e16..da9360885260 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -585,27 +585,27 @@ static inline void mod_lruvec_state(struct lruvec *lruvec,
static inline void __mod_lruvec_page_state(struct page *page,
enum node_stat_item idx, int val)
{
- struct mem_cgroup *memcg;
- struct lruvec *lruvec;
-
- /* Special pages in the VM aren't charged, use root */
- memcg = page->mem_cgroup ? : root_mem_cgroup;
+ struct mem_cgroup_per_node *pn;

- lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
- __mod_lruvec_state(lruvec, idx, val);
+ __mod_node_page_state(page_pgdat(page), idx, val);
+ if (mem_cgroup_disabled() || !page->mem_cgroup)
+ return;
+ __mod_memcg_state(page->mem_cgroup, idx, val);
+ pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
+ __this_cpu_add(pn->lruvec_stat->count[idx], val);
}

static inline void mod_lruvec_page_state(struct page *page,
enum node_stat_item idx, int val)
{
- struct mem_cgroup *memcg;
- struct lruvec *lruvec;
-
- /* Special pages in the VM aren't charged, use root */
- memcg = page->mem_cgroup ? : root_mem_cgroup;
+ struct mem_cgroup_per_node *pn;

- lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
- mod_lruvec_state(lruvec, idx, val);
+ mod_node_page_state(page_pgdat(page), idx, val);
+ if (mem_cgroup_disabled() || !page->mem_cgroup)
+ return;
+ mod_memcg_state(page->mem_cgroup, idx, val);
+ pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
+ this_cpu_add(pn->lruvec_stat->count[idx], val);
}

unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
--
2.13.0


2017-06-06 05:55:22

by Tony Lindgren

[permalink] [raw]
Subject: Re: Regression on ARMs in next-20170531

* Johannes Weiner <[email protected]> [170604 04:36]:
> I think it's NULL because the slab allocation happens before even the
> root_mem_cgroup is set up, and so root_mem_cgroup -> lruvec -> pgdat
> gives us garbage.
>
> Tony, Josef, since the patches are dropped from -next, could you test
> the -mm tree at git://git.cmpxchg.org/linux-mmots.git and verify that
> this patch below fixes the issue?

Looks like next-20170605 is broken for ARMs again.. And the patch
below does not apply for me against mmots/master or next.
Care to update?

Regards,

Tony

2017-06-06 12:30:26

by Tony Lindgren

[permalink] [raw]
Subject: Re: Regression on ARMs in next-20170531

* Tony Lindgren <[email protected]> [170605 22:55]:
> * Johannes Weiner <[email protected]> [170604 04:36]:
> > I think it's NULL because the slab allocation happens before even the
> > root_mem_cgroup is set up, and so root_mem_cgroup -> lruvec -> pgdat
> > gives us garbage.
> >
> > Tony, Josef, since the patches are dropped from -next, could you test
> > the -mm tree at git://git.cmpxchg.org/linux-mmots.git and verify that
> > this patch below fixes the issue?
>
> Looks like next-20170605 is broken for ARMs again.. And the patch
> below does not apply for me against mmots/master or next.
> Care to update?

Oh I got it to apply on next-20170605, I must have had something
else applied causing issues on earlyer attempt. Now I'm getting the
following error.

Regards,

Tony

8< -----------------
Unable to handle kernel paging request at virtual address 2ea2d007
pgd = c0004000
[2ea2d007] *pgd=00000000
Internal error: Oops: 5 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.12.0-rc3-next-20170605+ #1227
Hardware name: Generic OMAP4 (Flattened Device Tree)
task: c0d0ae00 task.stack: c0d00000
PC is at __mod_node_page_state+0x2c/0xc8
LR is at __per_cpu_offset+0x0/0x8
pc : [<c0280078>] lr : [<c0d07d6c>] psr: 200001d3
sp : c0d01eec ip : 00000000 fp : 00000001
r10: c0c7cf68 r9 : 00008000 r8 : 00000000
r7 : 00000001 r6 : 00000006 r5 : 2ea2d000 r4 : 00000007
r3 : 00000007 r2 : 00000001 r1 : 00000006 r0 : c0dc1fc0
Flags: nzCv IRQs off FIQs off Mode SVC_32 ISA ARM Segment none
Control: 10c5387d Table: 8000404a DAC: 00000051
Process swapper (pid: 0, stack limit = 0xc0d00218)
Stack: (0xc0d01eec to 0xc0d02000)
1ee0: 400001d3 c0dc1fc0 c028018c 00000001 c1599440
1f00: c0d58834 efd83000 00000000 c02af214 01000000 c157a890 00002000 00008000
1f20: 00000001 00000001 00008000 c02aeb4c 00000000 00008000 c0d58834 00008000
1f40: 01008000 c0c23a88 c0d58834 c1580034 400001d3 c02afa9c 00000000 c086b230
1f60: c0d58834 000000c0 01000000 c157a78c c0abe0fc 00000080 00002000 c0dd4000
1f80: efffec40 c0c55a48 00000000 c0c23a88 c157a78c c0c5be48 c0c5bde8 c157a890
1fa0: c0dd4000 c0c25a9c 00000000 ffffffff c0dd4000 c0d07940 c0dd4000 c0c00abc
1fc0: ffffffff ffffffff 00000000 c0c006a0 00000000 c0c55a48 c0dd4214 c0d07958
1fe0: c0c55a44 c0d0cae4 8000406a 411fc093 00000000 8000807c 00000000 00000000
[<c0280078>] (__mod_node_page_state) from [<c028018c>] (mod_node_page_state+0x2c/0x4c)
[<c028018c>] (mod_node_page_state) from [<c02af214>] (cache_alloc_refill+0x654/0x898)
[<c02af214>] (cache_alloc_refill) from [<c02afa9c>] (kmem_cache_alloc+0x2d4/0x364)
[<c02afa9c>] (kmem_cache_alloc) from [<c0c23a88>] (create_kmalloc_cache+0x20/0x8c)
[<c0c23a88>] (create_kmalloc_cache) from [<c0c25a9c>] (kmem_cache_init+0xac/0x11c)
[<c0c25a9c>] (kmem_cache_init) from [<c0c00abc>] (start_kernel+0x1b8/0x3d8)
[<c0c00abc>] (start_kernel) from [<8000807c>] (0x8000807c)
Code: e79e5103 e28c3001 e0833001 e1a04003 (e19440d5)

2017-06-06 14:37:18

by Johannes Weiner

[permalink] [raw]
Subject: Re: Regression on ARMs in next-20170531

On Tue, Jun 06, 2017 at 05:30:10AM -0700, Tony Lindgren wrote:
> PC is at __mod_node_page_state+0x2c/0xc8
> LR is at __per_cpu_offset+0x0/0x8
> pc : [<c0280078>] lr : [<c0d07d6c>] psr: 200001d3
> sp : c0d01eec ip : 00000000 fp : 00000001
> r10: c0c7cf68 r9 : 00008000 r8 : 00000000
> r7 : 00000001 r6 : 00000006 r5 : 2ea2d000 r4 : 00000007
> r3 : 00000007 r2 : 00000001 r1 : 00000006 r0 : c0dc1fc0
> Flags: nzCv IRQs off FIQs off Mode SVC_32 ISA ARM Segment none
> Control: 10c5387d Table: 8000404a DAC: 00000051
> Process swapper (pid: 0, stack limit = 0xc0d00218)
> Stack: (0xc0d01eec to 0xc0d02000)
> 1ee0: 400001d3 c0dc1fc0 c028018c 00000001 c1599440
> 1f00: c0d58834 efd83000 00000000 c02af214 01000000 c157a890 00002000 00008000
> 1f20: 00000001 00000001 00008000 c02aeb4c 00000000 00008000 c0d58834 00008000
> 1f40: 01008000 c0c23a88 c0d58834 c1580034 400001d3 c02afa9c 00000000 c086b230
> 1f60: c0d58834 000000c0 01000000 c157a78c c0abe0fc 00000080 00002000 c0dd4000
> 1f80: efffec40 c0c55a48 00000000 c0c23a88 c157a78c c0c5be48 c0c5bde8 c157a890
> 1fa0: c0dd4000 c0c25a9c 00000000 ffffffff c0dd4000 c0d07940 c0dd4000 c0c00abc
> 1fc0: ffffffff ffffffff 00000000 c0c006a0 00000000 c0c55a48 c0dd4214 c0d07958
> 1fe0: c0c55a44 c0d0cae4 8000406a 411fc093 00000000 8000807c 00000000 00000000
> [<c0280078>] (__mod_node_page_state) from [<c028018c>] (mod_node_page_state+0x2c/0x4c)
> [<c028018c>] (mod_node_page_state) from [<c02af214>] (cache_alloc_refill+0x654/0x898)
> [<c02af214>] (cache_alloc_refill) from [<c02afa9c>] (kmem_cache_alloc+0x2d4/0x364)
> [<c02afa9c>] (kmem_cache_alloc) from [<c0c23a88>] (create_kmalloc_cache+0x20/0x8c)
> [<c0c23a88>] (create_kmalloc_cache) from [<c0c25a9c>] (kmem_cache_init+0xac/0x11c)
> [<c0c25a9c>] (kmem_cache_init) from [<c0c00abc>] (start_kernel+0x1b8/0x3d8)

That's the one Russell analyzed and I misinterpreted. We put a fix
into -next to initialize pgdat->per_cpu_nodestats in time for slab
initialization during boot.

Is today's -next working again?

2017-06-06 15:03:49

by Andrew Morton

[permalink] [raw]
Subject: Re: Regression on ARMs in next-20170531

On Tue, 6 Jun 2017 10:36:48 -0400 Johannes Weiner <[email protected]> wrote:

> On Tue, Jun 06, 2017 at 05:30:10AM -0700, Tony Lindgren wrote:
> > PC is at __mod_node_page_state+0x2c/0xc8
> > LR is at __per_cpu_offset+0x0/0x8
> > pc : [<c0280078>] lr : [<c0d07d6c>] psr: 200001d3
> > sp : c0d01eec ip : 00000000 fp : 00000001
> > r10: c0c7cf68 r9 : 00008000 r8 : 00000000
> > r7 : 00000001 r6 : 00000006 r5 : 2ea2d000 r4 : 00000007
> > r3 : 00000007 r2 : 00000001 r1 : 00000006 r0 : c0dc1fc0
> > Flags: nzCv IRQs off FIQs off Mode SVC_32 ISA ARM Segment none
> > Control: 10c5387d Table: 8000404a DAC: 00000051
> > Process swapper (pid: 0, stack limit = 0xc0d00218)
> > Stack: (0xc0d01eec to 0xc0d02000)
> > 1ee0: 400001d3 c0dc1fc0 c028018c 00000001 c1599440
> > 1f00: c0d58834 efd83000 00000000 c02af214 01000000 c157a890 00002000 00008000
> > 1f20: 00000001 00000001 00008000 c02aeb4c 00000000 00008000 c0d58834 00008000
> > 1f40: 01008000 c0c23a88 c0d58834 c1580034 400001d3 c02afa9c 00000000 c086b230
> > 1f60: c0d58834 000000c0 01000000 c157a78c c0abe0fc 00000080 00002000 c0dd4000
> > 1f80: efffec40 c0c55a48 00000000 c0c23a88 c157a78c c0c5be48 c0c5bde8 c157a890
> > 1fa0: c0dd4000 c0c25a9c 00000000 ffffffff c0dd4000 c0d07940 c0dd4000 c0c00abc
> > 1fc0: ffffffff ffffffff 00000000 c0c006a0 00000000 c0c55a48 c0dd4214 c0d07958
> > 1fe0: c0c55a44 c0d0cae4 8000406a 411fc093 00000000 8000807c 00000000 00000000
> > [<c0280078>] (__mod_node_page_state) from [<c028018c>] (mod_node_page_state+0x2c/0x4c)
> > [<c028018c>] (mod_node_page_state) from [<c02af214>] (cache_alloc_refill+0x654/0x898)
> > [<c02af214>] (cache_alloc_refill) from [<c02afa9c>] (kmem_cache_alloc+0x2d4/0x364)
> > [<c02afa9c>] (kmem_cache_alloc) from [<c0c23a88>] (create_kmalloc_cache+0x20/0x8c)
> > [<c0c23a88>] (create_kmalloc_cache) from [<c0c25a9c>] (kmem_cache_init+0xac/0x11c)
> > [<c0c25a9c>] (kmem_cache_init) from [<c0c00abc>] (start_kernel+0x1b8/0x3d8)
>
> That's the one Russell analyzed and I misinterpreted. We put a fix
> into -next to initialize pgdat->per_cpu_nodestats in time for slab
> initialization during boot.
>
> Is today's -next working again?

I'll be even less than usually functional for the next week, so please
cc Stephen on any -next hotfixes.

2017-06-07 04:38:51

by Tony Lindgren

[permalink] [raw]
Subject: Re: Regression on ARMs in next-20170531

* Andrew Morton <[email protected]> [170606 08:07]:
> On Tue, 6 Jun 2017 10:36:48 -0400 Johannes Weiner <[email protected]> wrote:
>
> > On Tue, Jun 06, 2017 at 05:30:10AM -0700, Tony Lindgren wrote:
> > > PC is at __mod_node_page_state+0x2c/0xc8
> > > LR is at __per_cpu_offset+0x0/0x8
> > > pc : [<c0280078>] lr : [<c0d07d6c>] psr: 200001d3
> > > sp : c0d01eec ip : 00000000 fp : 00000001
> > > r10: c0c7cf68 r9 : 00008000 r8 : 00000000
> > > r7 : 00000001 r6 : 00000006 r5 : 2ea2d000 r4 : 00000007
> > > r3 : 00000007 r2 : 00000001 r1 : 00000006 r0 : c0dc1fc0
> > > Flags: nzCv IRQs off FIQs off Mode SVC_32 ISA ARM Segment none
> > > Control: 10c5387d Table: 8000404a DAC: 00000051
> > > Process swapper (pid: 0, stack limit = 0xc0d00218)
> > > Stack: (0xc0d01eec to 0xc0d02000)
> > > 1ee0: 400001d3 c0dc1fc0 c028018c 00000001 c1599440
> > > 1f00: c0d58834 efd83000 00000000 c02af214 01000000 c157a890 00002000 00008000
> > > 1f20: 00000001 00000001 00008000 c02aeb4c 00000000 00008000 c0d58834 00008000
> > > 1f40: 01008000 c0c23a88 c0d58834 c1580034 400001d3 c02afa9c 00000000 c086b230
> > > 1f60: c0d58834 000000c0 01000000 c157a78c c0abe0fc 00000080 00002000 c0dd4000
> > > 1f80: efffec40 c0c55a48 00000000 c0c23a88 c157a78c c0c5be48 c0c5bde8 c157a890
> > > 1fa0: c0dd4000 c0c25a9c 00000000 ffffffff c0dd4000 c0d07940 c0dd4000 c0c00abc
> > > 1fc0: ffffffff ffffffff 00000000 c0c006a0 00000000 c0c55a48 c0dd4214 c0d07958
> > > 1fe0: c0c55a44 c0d0cae4 8000406a 411fc093 00000000 8000807c 00000000 00000000
> > > [<c0280078>] (__mod_node_page_state) from [<c028018c>] (mod_node_page_state+0x2c/0x4c)
> > > [<c028018c>] (mod_node_page_state) from [<c02af214>] (cache_alloc_refill+0x654/0x898)
> > > [<c02af214>] (cache_alloc_refill) from [<c02afa9c>] (kmem_cache_alloc+0x2d4/0x364)
> > > [<c02afa9c>] (kmem_cache_alloc) from [<c0c23a88>] (create_kmalloc_cache+0x20/0x8c)
> > > [<c0c23a88>] (create_kmalloc_cache) from [<c0c25a9c>] (kmem_cache_init+0xac/0x11c)
> > > [<c0c25a9c>] (kmem_cache_init) from [<c0c00abc>] (start_kernel+0x1b8/0x3d8)
> >
> > That's the one Russell analyzed and I misinterpreted. We put a fix
> > into -next to initialize pgdat->per_cpu_nodestats in time for slab
> > initialization during boot.

OK

> > Is today's -next working again?

Yes just tested that next-20170606 is working again thanks!

> I'll be even less than usually functional for the next week, so please
> cc Stephen on any -next hotfixes.

OK good to know if new issues show up.

Regards,

Tony