2013-10-25 16:15:23

by Flavio Leitner

[permalink] [raw]
Subject: RIP: mem_cgroup_move_account+0xf4/0x290


While playing with guests and net-next kernel, I've triggered
this with some frequency. Even Fedora 19 kernel reproduces.

It it a known issue?

Thanks,
fbl

[ 6790.349763] kvm: zapping shadow pages for mmio generation wraparound
[ 6792.283879] kvm: zapping shadow pages for mmio generation wraparound
[ 7535.654438] perf samples too long (2719 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[ 7535.665948] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 11.560 msecs
[ 7691.048392] virbr0: port 1(vnet0) entered disabled state
[ 7691.056281] device vnet0 left promiscuous mode
[ 7691.061674] virbr0: port 1(vnet0) entered disabled state
[ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
[ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
[ 7691.178574] PGD 0
[ 7691.181042] Oops: 0000 [#1] SMP
[ 7691.184761] Modules linked in: vhost_net vhost macvtap macvlan tun veth openvswitch xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 vxlan ip_tunnel gre libcrc32c ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw coretemp kvm_intel snd_hda_codec_realtek snd_hda_intel nfsd snd_hda_codec kvm auth_rpcgss nfs_acl snd_hwdep lockd snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc sunrpc snd_timer crc32c_intel i7core_edac bnx2 shpchp ptp snd iTCO_wdt joydev pps_core iTCO_vendor_support pcspkr soundcore microcode serio_raw lpc_ich edac_core mfd_core i2c_i801 acpi_cpufreq hid_logitech_dj nouveau ata_generic pata_acpi video i2c_algo_bit drm_kms_helper ttm drm mxm_wmi i2c_core pata_marvell wmi [last unloaded: openvswitch]
[ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G I 3.12.0-rc6-01188-gb45bd46 #1
[ 7691.295779] Hardware name: /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
[ 7691.306066] Workqueue: events css_killed_work_fn
[ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
[ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>] [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
[ 7691.329728] RSP: 0018:ffff88042957bcc8 EFLAGS: 00010002
[ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
[ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
[ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
[ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
[ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
[ 7691.375738] FS: 0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
[ 7691.384755] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
[ 7691.399235] Stack:
[ 7691.401672] ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
[ 7691.409968] ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
[ 7691.418264] ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
[ 7691.426578] Call Trace:
[ 7691.429513] [<ffffffff81312b6d>] ? list_del+0xd/0x30
[ 7691.435250] [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
[ 7691.442874] [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
[ 7691.449942] [<ffffffff810da877>] offline_css+0x27/0x50
[ 7691.455874] [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
[ 7691.462466] [<ffffffff810821f5>] process_one_work+0x175/0x430
[ 7691.469041] [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
[ 7691.475345] [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
[ 7691.481919] [<ffffffff81089860>] kthread+0xc0/0xd0
[ 7691.487478] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
[ 7691.494352] [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
[ 7691.500464] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
[ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00 00 41 8b 4c 24 18 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c> 39 56 18 0f 8c 36 01 00 00 44 89 c9 f7 d9 89 cf 65 48 01 7e
[ 7691.528638] RIP [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290


2013-10-26 03:39:50

by Johannes Weiner

[permalink] [raw]
Subject: Re: RIP: mem_cgroup_move_account+0xf4/0x290

On Fri, Oct 25, 2013 at 02:15:55PM -0200, Flavio Leitner wrote:
>
> While playing with guests and net-next kernel, I've triggered
> this with some frequency. Even Fedora 19 kernel reproduces.
>
> It it a known issue?
>
> Thanks,
> fbl
>
> [ 6790.349763] kvm: zapping shadow pages for mmio generation wraparound
> [ 6792.283879] kvm: zapping shadow pages for mmio generation wraparound
> [ 7535.654438] perf samples too long (2719 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
> [ 7535.665948] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 11.560 msecs
> [ 7691.048392] virbr0: port 1(vnet0) entered disabled state
> [ 7691.056281] device vnet0 left promiscuous mode
> [ 7691.061674] virbr0: port 1(vnet0) entered disabled state
> [ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
> [ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> [ 7691.178574] PGD 0
> [ 7691.181042] Oops: 0000 [#1] SMP
> [ 7691.184761] Modules linked in: vhost_net vhost macvtap macvlan tun veth openvswitch xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 vxlan ip_tunnel gre libcrc32c ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw coretemp kvm_intel snd_hda_codec_realtek snd_hda_intel nfsd snd_hda_codec kvm auth_rpcgss nfs_acl snd_hwdep lockd snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc sunrpc snd_timer crc32c_intel i7core_edac bnx2 shpchp ptp snd iTCO_wdt joydev pps_core iTCO_vendor_support pcspkr soundcore microcode serio_raw lpc_ich edac_core mfd_core i2c_i801 acpi_cpufreq hid_logitech_dj nouveau ata_generic pata_acpi video i2c_algo_bit drm_kms_helper ttm drm mxm_wmi i2c_core pata_marvell wmi [last unloaded: openvswitch]
> [ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G I 3.12.0-rc6-01188-gb45bd46 #1
> [ 7691.295779] Hardware name: /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
> [ 7691.306066] Workqueue: events css_killed_work_fn
> [ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
> [ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>] [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> [ 7691.329728] RSP: 0018:ffff88042957bcc8 EFLAGS: 00010002
> [ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
> [ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
> [ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
> [ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
> [ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
> [ 7691.375738] FS: 0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
> [ 7691.384755] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
> [ 7691.399235] Stack:
> [ 7691.401672] ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
> [ 7691.409968] ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
> [ 7691.418264] ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
> [ 7691.426578] Call Trace:
> [ 7691.429513] [<ffffffff81312b6d>] ? list_del+0xd/0x30
> [ 7691.435250] [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
> [ 7691.442874] [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
> [ 7691.449942] [<ffffffff810da877>] offline_css+0x27/0x50
> [ 7691.455874] [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
> [ 7691.462466] [<ffffffff810821f5>] process_one_work+0x175/0x430
> [ 7691.469041] [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
> [ 7691.475345] [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
> [ 7691.481919] [<ffffffff81089860>] kthread+0xc0/0xd0
> [ 7691.487478] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> [ 7691.494352] [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
> [ 7691.500464] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00 00 41 8b 4c 24 18 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c> 39 56 18 0f 8c 36 01 00 00 44 89 c9 f7 d9 89 cf 65 48 01 7e

This is

All code
========
0: 85 f6 test %esi,%esi
2: 48 8b 55 d0 mov -0x30(%rbp),%rdx
6: 44 8b 4d c8 mov -0x38(%rbp),%r9d
a: 4c 8b 45 c0 mov -0x40(%rbp),%r8
e: 0f 85 b3 00 00 00 jne 0xc7
14: 41 8b 4c 24 18 mov 0x18(%r12),%ecx
19: 85 c9 test %ecx,%ecx
1b: 0f 88 a6 00 00 00 js 0xc7
21: 48 8b b2 30 02 00 00 mov 0x230(%rdx),%rsi
28: 45 89 ca mov %r9d,%r10d
2b:* 4c 39 56 18 cmp %r10,0x18(%rsi) <-- trapping instruction
2f: 0f 8c 36 01 00 00 jl 0x16b
35: 44 89 c9 mov %r9d,%ecx
38: f7 d9 neg %ecx
3a: 89 cf mov %ecx,%edi
3c: 65 gs
3d: 48 rex.W
3e: 01 .byte 0x1
3f: 7e .byte 0x7e

which corresponds to

WARN_ON_ONCE(from->stat->count[idx] < nr_pages);

Humm. from->stat is a percpu pointer... This patch should fix it:

---
>From 4e9fe9d7e8502eab1c8bb4761de838f61cd4a8e0 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Fri, 25 Oct 2013 23:23:31 -0400
Subject: [patch] mm: memcg: fix percpu variable access crash

3ea67d06e467 ("memcg: add per cgroup writeback pages accounting")
added a WARN_ON_ONCE() to sanity check the page statistics counter
when moving charges. Unfortunately, it dereferences the percpu
counter directly, which may result in a crash like this:

[ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
[ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
[ 7691.178574] PGD 0
[ 7691.181042] Oops: 0000 [#1] SMP
[...]
[ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G I 3.12.0-rc6-01188-gb45bd46 #1
[ 7691.295779] Hardware name: /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
[ 7691.306066] Workqueue: events css_killed_work_fn
[ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
[ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>] [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
[ 7691.329728] RSP: 0018:ffff88042957bcc8 EFLAGS: 00010002
[ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
[ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
[ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
[ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
[ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
[ 7691.375738] FS: 0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
[ 7691.384755] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
[ 7691.399235] Stack:
[ 7691.401672] ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
[ 7691.409968] ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
[ 7691.418264] ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
[ 7691.426578] Call Trace:
[ 7691.429513] [<ffffffff81312b6d>] ? list_del+0xd/0x30
[ 7691.435250] [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
[ 7691.442874] [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
[ 7691.449942] [<ffffffff810da877>] offline_css+0x27/0x50
[ 7691.455874] [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
[ 7691.462466] [<ffffffff810821f5>] process_one_work+0x175/0x430
[ 7691.469041] [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
[ 7691.475345] [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
[ 7691.481919] [<ffffffff81089860>] kthread+0xc0/0xd0
[ 7691.487478] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
[ 7691.494352] [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
[ 7691.500464] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
[ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00 00 41 8b 4c 24 18
85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c> 39 56 18 0f 8c 36 01 00 00 44 89 c9
f7 d9 89 cf 65 48 01 7e
[ 7691.528638] RIP [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290

Add the required __this_cpu_read().

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4097a78..a4864b6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
{
/* Update stat data for mem_cgroup */
preempt_disable();
- WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
+ WARN_ON_ONCE(__this_cpu_read(from->stat->count[idx]) < nr_pages);
__this_cpu_add(from->stat->count[idx], -nr_pages);
__this_cpu_add(to->stat->count[idx], nr_pages);
preempt_enable();
--
1.8.4.1

2013-10-26 12:46:41

by 含黛

[permalink] [raw]
Subject: Re: RIP: mem_cgroup_move_account+0xf4/0x290

On 10/26/2013 11:39 AM, Johannes Weiner wrote:
> On Fri, Oct 25, 2013 at 02:15:55PM -0200, Flavio Leitner wrote:
>> While playing with guests and net-next kernel, I've triggered
>> this with some frequency. Even Fedora 19 kernel reproduces.
>>
>> It it a known issue?
>>
>> Thanks,
>> fbl
>>
>> [ 6790.349763] kvm: zapping shadow pages for mmio generation wraparound
>> [ 6792.283879] kvm: zapping shadow pages for mmio generation wraparound
>> [ 7535.654438] perf samples too long (2719 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
>> [ 7535.665948] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 11.560 msecs
>> [ 7691.048392] virbr0: port 1(vnet0) entered disabled state
>> [ 7691.056281] device vnet0 left promiscuous mode
>> [ 7691.061674] virbr0: port 1(vnet0) entered disabled state
>> [ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
>> [ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
>> [ 7691.178574] PGD 0
>> [ 7691.181042] Oops: 0000 [#1] SMP
>> [ 7691.184761] Modules linked in: vhost_net vhost macvtap macvlan tun veth openvswitch xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 vxlan ip_tunnel gre libcrc32c ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw coretemp kvm_intel snd_hda_codec_realtek snd_hda_intel nfsd snd_hda_codec kvm auth_rpcgss nfs_acl snd_hwdep lockd snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc sunrpc snd_timer crc32c_intel i7core_edac bnx2 shpchp ptp snd iTCO_wdt joydev pps_core iTCO_vendor_support pcspkr soundcore microcode serio_raw lpc_ich edac_core mfd_core i2c_i801 acpi_cpufreq hid_logitech_dj nouveau ata_generic pata_acpi video i2c_algo_bit drm_kms_helper ttm drm mxm_wmi i2c_core pata_marvell wmi [last unloaded: openvswitch]
>> [ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G I 3.12.0-rc6-01188-gb45bd46 #1
>> [ 7691.295779] Hardware name: /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
>> [ 7691.306066] Workqueue: events css_killed_work_fn
>> [ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
>> [ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>] [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
>> [ 7691.329728] RSP: 0018:ffff88042957bcc8 EFLAGS: 00010002
>> [ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
>> [ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
>> [ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
>> [ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
>> [ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
>> [ 7691.375738] FS: 0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
>> [ 7691.384755] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
>> [ 7691.399235] Stack:
>> [ 7691.401672] ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
>> [ 7691.409968] ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
>> [ 7691.418264] ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
>> [ 7691.426578] Call Trace:
>> [ 7691.429513] [<ffffffff81312b6d>] ? list_del+0xd/0x30
>> [ 7691.435250] [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
>> [ 7691.442874] [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
>> [ 7691.449942] [<ffffffff810da877>] offline_css+0x27/0x50
>> [ 7691.455874] [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
>> [ 7691.462466] [<ffffffff810821f5>] process_one_work+0x175/0x430
>> [ 7691.469041] [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
>> [ 7691.475345] [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
>> [ 7691.481919] [<ffffffff81089860>] kthread+0xc0/0xd0
>> [ 7691.487478] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
>> [ 7691.494352] [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
>> [ 7691.500464] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
>> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00 00 41 8b 4c 24 18 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c> 39 56 18 0f 8c 36 01 00 00 44 89 c9 f7 d9 89 cf 65 48 01 7e
> This is
>
> All code
> ========
> 0: 85 f6 test %esi,%esi
> 2: 48 8b 55 d0 mov -0x30(%rbp),%rdx
> 6: 44 8b 4d c8 mov -0x38(%rbp),%r9d
> a: 4c 8b 45 c0 mov -0x40(%rbp),%r8
> e: 0f 85 b3 00 00 00 jne 0xc7
> 14: 41 8b 4c 24 18 mov 0x18(%r12),%ecx
> 19: 85 c9 test %ecx,%ecx
> 1b: 0f 88 a6 00 00 00 js 0xc7
> 21: 48 8b b2 30 02 00 00 mov 0x230(%rdx),%rsi
> 28: 45 89 ca mov %r9d,%r10d
> 2b:* 4c 39 56 18 cmp %r10,0x18(%rsi) <-- trapping instruction
> 2f: 0f 8c 36 01 00 00 jl 0x16b
> 35: 44 89 c9 mov %r9d,%ecx
> 38: f7 d9 neg %ecx
> 3a: 89 cf mov %ecx,%edi
> 3c: 65 gs
> 3d: 48 rex.W
> 3e: 01 .byte 0x1
> 3f: 7e .byte 0x7e
>
> which corresponds to
>
> WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
>
> Humm. from->stat is a percpu pointer... This patch should fix it:
>
> ---
> From 4e9fe9d7e8502eab1c8bb4761de838f61cd4a8e0 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Fri, 25 Oct 2013 23:23:31 -0400
> Subject: [patch] mm: memcg: fix percpu variable access crash
>
> 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting")
> added a WARN_ON_ONCE() to sanity check the page statistics counter
> when moving charges. Unfortunately, it dereferences the percpu
> counter directly, which may result in a crash like this:
>
> [ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
> [ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> [ 7691.178574] PGD 0
> [ 7691.181042] Oops: 0000 [#1] SMP
> [...]
> [ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G I 3.12.0-rc6-01188-gb45bd46 #1
> [ 7691.295779] Hardware name: /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
> [ 7691.306066] Workqueue: events css_killed_work_fn
> [ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
> [ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>] [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> [ 7691.329728] RSP: 0018:ffff88042957bcc8 EFLAGS: 00010002
> [ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
> [ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
> [ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
> [ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
> [ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
> [ 7691.375738] FS: 0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
> [ 7691.384755] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
> [ 7691.399235] Stack:
> [ 7691.401672] ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
> [ 7691.409968] ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
> [ 7691.418264] ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
> [ 7691.426578] Call Trace:
> [ 7691.429513] [<ffffffff81312b6d>] ? list_del+0xd/0x30
> [ 7691.435250] [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
> [ 7691.442874] [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
> [ 7691.449942] [<ffffffff810da877>] offline_css+0x27/0x50
> [ 7691.455874] [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
> [ 7691.462466] [<ffffffff810821f5>] process_one_work+0x175/0x430
> [ 7691.469041] [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
> [ 7691.475345] [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
> [ 7691.481919] [<ffffffff81089860>] kthread+0xc0/0xd0
> [ 7691.487478] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> [ 7691.494352] [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
> [ 7691.500464] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00 00 41 8b 4c 24 18
> 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c> 39 56 18 0f 8c 36 01 00 00 44 89 c9
> f7 d9 89 cf 65 48 01 7e
> [ 7691.528638] RIP [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
>
> Add the required __this_cpu_read().

Sorry for my mistake and thanks for the fix up, it looks good to me.

Reviewed-by: Sha Zhengju <[email protected]>


Thanks,
Sha
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/memcontrol.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4097a78..a4864b6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> {
> /* Update stat data for mem_cgroup */
> preempt_disable();
> - WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
> + WARN_ON_ONCE(__this_cpu_read(from->stat->count[idx]) < nr_pages);
> __this_cpu_add(from->stat->count[idx], -nr_pages);
> __this_cpu_add(to->stat->count[idx], nr_pages);
> preempt_enable();

2013-10-26 15:41:14

by Greg Thelen

[permalink] [raw]
Subject: Re: RIP: mem_cgroup_move_account+0xf4/0x290

On Sat, Oct 26 2013, 含黛 wrote:

> On 10/26/2013 11:39 AM, Johannes Weiner wrote:
>> On Fri, Oct 25, 2013 at 02:15:55PM -0200, Flavio Leitner wrote:
>>> While playing with guests and net-next kernel, I've triggered
>>> this with some frequency. Even Fedora 19 kernel reproduces.
>>>
>>> It it a known issue?
>>>
>>> Thanks,
>>> fbl
>>>
>>> [ 6790.349763] kvm: zapping shadow pages for mmio generation wraparound
>>> [ 6792.283879] kvm: zapping shadow pages for mmio generation wraparound
>>> [ 7535.654438] perf samples too long (2719 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
>>> [ 7535.665948] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 11.560 msecs
>>> [ 7691.048392] virbr0: port 1(vnet0) entered disabled state
>>> [ 7691.056281] device vnet0 left promiscuous mode
>>> [ 7691.061674] virbr0: port 1(vnet0) entered disabled state
>>> [ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
>>> [ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
>>> [ 7691.178574] PGD 0
>>> [ 7691.181042] Oops: 0000 [#1] SMP
>>> [ 7691.184761] Modules linked in: vhost_net vhost macvtap macvlan tun veth
>>> openvswitch xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast
>>> ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp
>>> llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6
>>> nf_nat_ipv6 vxlan ip_tunnel gre libcrc32c ip6table_mangle ip6table_security
>>> ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
>>> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
>>> iptable_security iptable_raw coretemp kvm_intel snd_hda_codec_realtek
>>> snd_hda_intel nfsd snd_hda_codec kvm auth_rpcgss nfs_acl snd_hwdep lockd
>>> snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc sunrpc snd_timer
>>> crc32c_intel i7core_edac bnx2 shpchp ptp snd iTCO_wdt joydev pps_core
>>> iTCO_vendor_support pcspkr soundcore microcode serio_raw lpc_ich edac_core
>>> mfd_core i2c_i801 acpi_cpufreq hid_logitech_dj nouveau ata_generic pata_acpi
>>> video i2c_algo_bit drm_kms_helper ttm drm mxm_wmi i2c_core pata_marvell wmi
>>> [last unloaded: openvswitch]
>>> [ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G I 3.12.0-rc6-01188-gb45bd46 #1
>>> [ 7691.295779] Hardware name: /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
>>> [ 7691.306066] Workqueue: events css_killed_work_fn
>>> [ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
>>> [ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>] [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
>>> [ 7691.329728] RSP: 0018:ffff88042957bcc8 EFLAGS: 00010002
>>> [ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
>>> [ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
>>> [ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
>>> [ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
>>> [ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
>>> [ 7691.375738] FS: 0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
>>> [ 7691.384755] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> [ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
>>> [ 7691.399235] Stack:
>>> [ 7691.401672] ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
>>> [ 7691.409968] ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
>>> [ 7691.418264] ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
>>> [ 7691.426578] Call Trace:
>>> [ 7691.429513] [<ffffffff81312b6d>] ? list_del+0xd/0x30
>>> [ 7691.435250] [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
>>> [ 7691.442874] [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
>>> [ 7691.449942] [<ffffffff810da877>] offline_css+0x27/0x50
>>> [ 7691.455874] [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
>>> [ 7691.462466] [<ffffffff810821f5>] process_one_work+0x175/0x430
>>> [ 7691.469041] [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
>>> [ 7691.475345] [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
>>> [ 7691.481919] [<ffffffff81089860>] kthread+0xc0/0xd0
>>> [ 7691.487478] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
>>> [ 7691.494352] [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
>>> [ 7691.500464] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
>>> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00
>>> 00 41 8b 4c 24 18 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c>
>>> 39 56 18 0f 8c 36 01 00 00 44 89 c9 f7 d9 89 cf 65 48 01 7e
>> This is
>>
>> All code
>> ========
>> 0: 85 f6 test %esi,%esi
>> 2: 48 8b 55 d0 mov -0x30(%rbp),%rdx
>> 6: 44 8b 4d c8 mov -0x38(%rbp),%r9d
>> a: 4c 8b 45 c0 mov -0x40(%rbp),%r8
>> e: 0f 85 b3 00 00 00 jne 0xc7
>> 14: 41 8b 4c 24 18 mov 0x18(%r12),%ecx
>> 19: 85 c9 test %ecx,%ecx
>> 1b: 0f 88 a6 00 00 00 js 0xc7
>> 21: 48 8b b2 30 02 00 00 mov 0x230(%rdx),%rsi
>> 28: 45 89 ca mov %r9d,%r10d
>> 2b:* 4c 39 56 18 cmp %r10,0x18(%rsi) <-- trapping instruction
>> 2f: 0f 8c 36 01 00 00 jl 0x16b
>> 35: 44 89 c9 mov %r9d,%ecx
>> 38: f7 d9 neg %ecx
>> 3a: 89 cf mov %ecx,%edi
>> 3c: 65 gs
>> 3d: 48 rex.W
>> 3e: 01 .byte 0x1
>> 3f: 7e .byte 0x7e
>>
>> which corresponds to
>>
>> WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
>>
>> Humm. from->stat is a percpu pointer... This patch should fix it:
>>
>> ---
>> From 4e9fe9d7e8502eab1c8bb4761de838f61cd4a8e0 Mon Sep 17 00:00:00 2001
>> From: Johannes Weiner <[email protected]>
>> Date: Fri, 25 Oct 2013 23:23:31 -0400
>> Subject: [patch] mm: memcg: fix percpu variable access crash
>>
>> 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting")
>> added a WARN_ON_ONCE() to sanity check the page statistics counter
>> when moving charges. Unfortunately, it dereferences the percpu
>> counter directly, which may result in a crash like this:
>>
>> [ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
>> [ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
>> [ 7691.178574] PGD 0
>> [ 7691.181042] Oops: 0000 [#1] SMP
>> [...]
>> [ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G I 3.12.0-rc6-01188-gb45bd46 #1
>> [ 7691.295779] Hardware name: /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
>> [ 7691.306066] Workqueue: events css_killed_work_fn
>> [ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
>> [ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>] [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
>> [ 7691.329728] RSP: 0018:ffff88042957bcc8 EFLAGS: 00010002
>> [ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
>> [ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
>> [ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
>> [ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
>> [ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
>> [ 7691.375738] FS: 0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
>> [ 7691.384755] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
>> [ 7691.399235] Stack:
>> [ 7691.401672] ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
>> [ 7691.409968] ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
>> [ 7691.418264] ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
>> [ 7691.426578] Call Trace:
>> [ 7691.429513] [<ffffffff81312b6d>] ? list_del+0xd/0x30
>> [ 7691.435250] [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
>> [ 7691.442874] [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
>> [ 7691.449942] [<ffffffff810da877>] offline_css+0x27/0x50
>> [ 7691.455874] [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
>> [ 7691.462466] [<ffffffff810821f5>] process_one_work+0x175/0x430
>> [ 7691.469041] [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
>> [ 7691.475345] [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
>> [ 7691.481919] [<ffffffff81089860>] kthread+0xc0/0xd0
>> [ 7691.487478] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
>> [ 7691.494352] [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
>> [ 7691.500464] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
>> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00 00 41 8b 4c 24 18
>> 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c> 39 56 18 0f 8c 36 01 00 00 44 89 c9
>> f7 d9 89 cf 65 48 01 7e
>> [ 7691.528638] RIP [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
>>
>> Add the required __this_cpu_read().
>
> Sorry for my mistake and thanks for the fix up, it looks good to me.
>
> Reviewed-by: Sha Zhengju <[email protected]>
>
>
> Thanks,
> Sha
>>
>> Signed-off-by: Johannes Weiner <[email protected]>
>> ---
>> mm/memcontrol.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 4097a78..a4864b6 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>> {
>> /* Update stat data for mem_cgroup */
>> preempt_disable();
>> - WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
>> + WARN_ON_ONCE(__this_cpu_read(from->stat->count[idx]) < nr_pages);
>> __this_cpu_add(from->stat->count[idx], -nr_pages);
>> __this_cpu_add(to->stat->count[idx], nr_pages);
>> preempt_enable();
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

I was just polishing up this following patch which I think is better
because it should avoid spurious warnings.

---8<---

>From c1f43ef0f4cc42fb2ecaeaca71bd247365e3521e Mon Sep 17 00:00:00 2001
From: Greg Thelen <[email protected]>
Date: Fri, 25 Oct 2013 21:59:57 -0700
Subject: [PATCH] memcg: remove incorrect underflow check

When a memcg is deleted mem_cgroup_reparent_charges() moves charged
memory to the parent memcg. As of v3.11-9444-g3ea67d0 "memcg: add per
cgroup writeback pages accounting" there's bad pointer read. The goal
was to check for counter underflow. The counter is a per cpu counter
and there are two problems with the code:
(1) per cpu access function isn't used, instead a naked pointer is
used which easily causes panic.
(2) the check doesn't sum all cpus

Test:
$ cd /sys/fs/cgroup/memory
$ mkdir x
$ echo 3 > /proc/sys/vm/drop_caches
$ (echo $BASHPID >> x/tasks && exec cat) &
[1] 7154
$ grep ^mapped x/memory.stat
mapped_file 53248
$ echo 7154 > tasks
$ rmdir x
<PANIC>

The fix is to remove the check. It's currently dangerous and isn't
worth fixing it to use something expensive, such as
percpu_counter_sum(), for each reparented page. __this_cpu_read()
isn't enough to fix this because there's no guarantees of the current
cpus count. The only guarantees is that the sum of all per-cpu
counter is >= nr_pages.

Signed-off-by: Greg Thelen <[email protected]>
---
mm/memcontrol.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 34d3ca9..aa8185c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3773,7 +3773,6 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
{
/* Update stat data for mem_cgroup */
preempt_disable();
- WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
__this_cpu_add(from->stat->count[idx], -nr_pages);
__this_cpu_add(to->stat->count[idx], nr_pages);
preempt_enable();
--
1.8.4.1

2013-10-28 10:00:58

by Sha Zhengju

[permalink] [raw]
Subject: Re: RIP: mem_cgroup_move_account+0xf4/0x290

On Sat, Oct 26, 2013 at 11:41 PM, Greg Thelen <[email protected]> wrote:
> On Sat, Oct 26 2013, wrote:
>
>> On 10/26/2013 11:39 AM, Johannes Weiner wrote:
>>> On Fri, Oct 25, 2013 at 02:15:55PM -0200, Flavio Leitner wrote:
>>>> While playing with guests and net-next kernel, I've triggered
>>>> this with some frequency. Even Fedora 19 kernel reproduces.
>>>>
>>>> It it a known issue?
>>>>
>>>> Thanks,
>>>> fbl
>>>>
>>>> [ 6790.349763] kvm: zapping shadow pages for mmio generation wraparound
>>>> [ 6792.283879] kvm: zapping shadow pages for mmio generation wraparound
>>>> [ 7535.654438] perf samples too long (2719 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
>>>> [ 7535.665948] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 11.560 msecs
>>>> [ 7691.048392] virbr0: port 1(vnet0) entered disabled state
>>>> [ 7691.056281] device vnet0 left promiscuous mode
>>>> [ 7691.061674] virbr0: port 1(vnet0) entered disabled state
>>>> [ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
>>>> [ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
>>>> [ 7691.178574] PGD 0
>>>> [ 7691.181042] Oops: 0000 [#1] SMP
>>>> [ 7691.184761] Modules linked in: vhost_net vhost macvtap macvlan tun veth
>>>> openvswitch xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast
>>>> ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp
>>>> llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6
>>>> nf_nat_ipv6 vxlan ip_tunnel gre libcrc32c ip6table_mangle ip6table_security
>>>> ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
>>>> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
>>>> iptable_security iptable_raw coretemp kvm_intel snd_hda_codec_realtek
>>>> snd_hda_intel nfsd snd_hda_codec kvm auth_rpcgss nfs_acl snd_hwdep lockd
>>>> snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc sunrpc snd_timer
>>>> crc32c_intel i7core_edac bnx2 shpchp ptp snd iTCO_wdt joydev pps_core
>>>> iTCO_vendor_support pcspkr soundcore microcode serio_raw lpc_ich edac_core
>>>> mfd_core i2c_i801 acpi_cpufreq hid_logitech_dj nouveau ata_generic pata_acpi
>>>> video i2c_algo_bit drm_kms_helper ttm drm mxm_wmi i2c_core pata_marvell wmi
>>>> [last unloaded: openvswitch]
>>>> [ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G I 3.12.0-rc6-01188-gb45bd46 #1
>>>> [ 7691.295779] Hardware name: /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
>>>> [ 7691.306066] Workqueue: events css_killed_work_fn
>>>> [ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
>>>> [ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>] [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
>>>> [ 7691.329728] RSP: 0018:ffff88042957bcc8 EFLAGS: 00010002
>>>> [ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
>>>> [ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
>>>> [ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
>>>> [ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
>>>> [ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
>>>> [ 7691.375738] FS: 0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
>>>> [ 7691.384755] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> [ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
>>>> [ 7691.399235] Stack:
>>>> [ 7691.401672] ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
>>>> [ 7691.409968] ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
>>>> [ 7691.418264] ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
>>>> [ 7691.426578] Call Trace:
>>>> [ 7691.429513] [<ffffffff81312b6d>] ? list_del+0xd/0x30
>>>> [ 7691.435250] [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
>>>> [ 7691.442874] [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
>>>> [ 7691.449942] [<ffffffff810da877>] offline_css+0x27/0x50
>>>> [ 7691.455874] [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
>>>> [ 7691.462466] [<ffffffff810821f5>] process_one_work+0x175/0x430
>>>> [ 7691.469041] [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
>>>> [ 7691.475345] [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
>>>> [ 7691.481919] [<ffffffff81089860>] kthread+0xc0/0xd0
>>>> [ 7691.487478] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
>>>> [ 7691.494352] [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
>>>> [ 7691.500464] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
>>>> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00
>>>> 00 41 8b 4c 24 18 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c>
>>>> 39 56 18 0f 8c 36 01 00 00 44 89 c9 f7 d9 89 cf 65 48 01 7e
>>> This is
>>>
>>> All code
>>> ========
>>> 0: 85 f6 test %esi,%esi
>>> 2: 48 8b 55 d0 mov -0x30(%rbp),%rdx
>>> 6: 44 8b 4d c8 mov -0x38(%rbp),%r9d
>>> a: 4c 8b 45 c0 mov -0x40(%rbp),%r8
>>> e: 0f 85 b3 00 00 00 jne 0xc7
>>> 14: 41 8b 4c 24 18 mov 0x18(%r12),%ecx
>>> 19: 85 c9 test %ecx,%ecx
>>> 1b: 0f 88 a6 00 00 00 js 0xc7
>>> 21: 48 8b b2 30 02 00 00 mov 0x230(%rdx),%rsi
>>> 28: 45 89 ca mov %r9d,%r10d
>>> 2b:* 4c 39 56 18 cmp %r10,0x18(%rsi) <-- trapping instruction
>>> 2f: 0f 8c 36 01 00 00 jl 0x16b
>>> 35: 44 89 c9 mov %r9d,%ecx
>>> 38: f7 d9 neg %ecx
>>> 3a: 89 cf mov %ecx,%edi
>>> 3c: 65 gs
>>> 3d: 48 rex.W
>>> 3e: 01 .byte 0x1
>>> 3f: 7e .byte 0x7e
>>>
>>> which corresponds to
>>>
>>> WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
>>>
>>> Humm. from->stat is a percpu pointer... This patch should fix it:
>>>
>>> ---
>>> From 4e9fe9d7e8502eab1c8bb4761de838f61cd4a8e0 Mon Sep 17 00:00:00 2001
>>> From: Johannes Weiner <[email protected]>
>>> Date: Fri, 25 Oct 2013 23:23:31 -0400
>>> Subject: [patch] mm: memcg: fix percpu variable access crash
>>>
>>> 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting")
>>> added a WARN_ON_ONCE() to sanity check the page statistics counter
>>> when moving charges. Unfortunately, it dereferences the percpu
>>> counter directly, which may result in a crash like this:
>>>
>>> [ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
>>> [ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
>>> [ 7691.178574] PGD 0
>>> [ 7691.181042] Oops: 0000 [#1] SMP
>>> [...]
>>> [ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G I 3.12.0-rc6-01188-gb45bd46 #1
>>> [ 7691.295779] Hardware name: /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
>>> [ 7691.306066] Workqueue: events css_killed_work_fn
>>> [ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
>>> [ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>] [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
>>> [ 7691.329728] RSP: 0018:ffff88042957bcc8 EFLAGS: 00010002
>>> [ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
>>> [ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
>>> [ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
>>> [ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
>>> [ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
>>> [ 7691.375738] FS: 0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
>>> [ 7691.384755] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> [ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
>>> [ 7691.399235] Stack:
>>> [ 7691.401672] ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
>>> [ 7691.409968] ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
>>> [ 7691.418264] ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
>>> [ 7691.426578] Call Trace:
>>> [ 7691.429513] [<ffffffff81312b6d>] ? list_del+0xd/0x30
>>> [ 7691.435250] [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
>>> [ 7691.442874] [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
>>> [ 7691.449942] [<ffffffff810da877>] offline_css+0x27/0x50
>>> [ 7691.455874] [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
>>> [ 7691.462466] [<ffffffff810821f5>] process_one_work+0x175/0x430
>>> [ 7691.469041] [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
>>> [ 7691.475345] [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
>>> [ 7691.481919] [<ffffffff81089860>] kthread+0xc0/0xd0
>>> [ 7691.487478] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
>>> [ 7691.494352] [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
>>> [ 7691.500464] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
>>> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00 00 41 8b 4c 24 18
>>> 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c> 39 56 18 0f 8c 36 01 00 00 44 89 c9
>>> f7 d9 89 cf 65 48 01 7e
>>> [ 7691.528638] RIP [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
>>>
>>> Add the required __this_cpu_read().
>>
>> Sorry for my mistake and thanks for the fix up, it looks good to me.
>>
>> Reviewed-by: Sha Zhengju <[email protected]>
>>
>>
>> Thanks,
>> Sha
>>>
>>> Signed-off-by: Johannes Weiner <[email protected]>
>>> ---
>>> mm/memcontrol.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 4097a78..a4864b6 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>>> {
>>> /* Update stat data for mem_cgroup */
>>> preempt_disable();
>>> - WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
>>> + WARN_ON_ONCE(__this_cpu_read(from->stat->count[idx]) < nr_pages);
>>> __this_cpu_add(from->stat->count[idx], -nr_pages);
>>> __this_cpu_add(to->stat->count[idx], nr_pages);
>>> preempt_enable();
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> I was just polishing up this following patch which I think is better
> because it should avoid spurious warnings.
>
> ---8<---
>
> From c1f43ef0f4cc42fb2ecaeaca71bd247365e3521e Mon Sep 17 00:00:00 2001
> From: Greg Thelen <[email protected]>
> Date: Fri, 25 Oct 2013 21:59:57 -0700
> Subject: [PATCH] memcg: remove incorrect underflow check
>
> When a memcg is deleted mem_cgroup_reparent_charges() moves charged
> memory to the parent memcg. As of v3.11-9444-g3ea67d0 "memcg: add per
> cgroup writeback pages accounting" there's bad pointer read. The goal
> was to check for counter underflow. The counter is a per cpu counter
> and there are two problems with the code:
> (1) per cpu access function isn't used, instead a naked pointer is
> used which easily causes panic.
> (2) the check doesn't sum all cpus
>
> Test:
> $ cd /sys/fs/cgroup/memory
> $ mkdir x
> $ echo 3 > /proc/sys/vm/drop_caches
> $ (echo $BASHPID >> x/tasks && exec cat) &
> [1] 7154
> $ grep ^mapped x/memory.stat
> mapped_file 53248
> $ echo 7154 > tasks
> $ rmdir x
> <PANIC>
>
> The fix is to remove the check. It's currently dangerous and isn't
> worth fixing it to use something expensive, such as
> percpu_counter_sum(), for each reparented page. __this_cpu_read()
> isn't enough to fix this because there's no guarantees of the current
> cpus count. The only guarantees is that the sum of all per-cpu
> counter is >= nr_pages.

Yes... you're right. We should check the total stats of 'from' memcg
rather than the per cpu stat. I vote to remove the check completely.
Thanks for fixing this.

>
> Signed-off-by: Greg Thelen <[email protected]>


Reviewed-by: Sha Zhengju <[email protected]>
Thanks.


> ---
> mm/memcontrol.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 34d3ca9..aa8185c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3773,7 +3773,6 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> {
> /* Update stat data for mem_cgroup */
> preempt_disable();
> - WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
> __this_cpu_add(from->stat->count[idx], -nr_pages);
> __this_cpu_add(to->stat->count[idx], nr_pages);
> preempt_enable();
> --
> 1.8.4.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/



--
Thanks,
Sha

2013-10-29 14:29:10

by Johannes Weiner

[permalink] [raw]
Subject: Re: RIP: mem_cgroup_move_account+0xf4/0x290

On Sat, Oct 26, 2013 at 08:41:11AM -0700, Greg Thelen wrote:
> On Sat, Oct 26 2013, 含黛 wrote:
>
> > On 10/26/2013 11:39 AM, Johannes Weiner wrote:
> >> On Fri, Oct 25, 2013 at 02:15:55PM -0200, Flavio Leitner wrote:
> >>> While playing with guests and net-next kernel, I've triggered
> >>> this with some frequency. Even Fedora 19 kernel reproduces.
> >>>
> >>> It it a known issue?
> >>>
> >>> Thanks,
> >>> fbl
> >>>
> >>> [ 6790.349763] kvm: zapping shadow pages for mmio generation wraparound
> >>> [ 6792.283879] kvm: zapping shadow pages for mmio generation wraparound
> >>> [ 7535.654438] perf samples too long (2719 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
> >>> [ 7535.665948] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 11.560 msecs
> >>> [ 7691.048392] virbr0: port 1(vnet0) entered disabled state
> >>> [ 7691.056281] device vnet0 left promiscuous mode
> >>> [ 7691.061674] virbr0: port 1(vnet0) entered disabled state
> >>> [ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
> >>> [ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >>> [ 7691.178574] PGD 0
> >>> [ 7691.181042] Oops: 0000 [#1] SMP
> >>> [ 7691.184761] Modules linked in: vhost_net vhost macvtap macvlan tun veth
> >>> openvswitch xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast
> >>> ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp
> >>> llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6
> >>> nf_nat_ipv6 vxlan ip_tunnel gre libcrc32c ip6table_mangle ip6table_security
> >>> ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
> >>> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
> >>> iptable_security iptable_raw coretemp kvm_intel snd_hda_codec_realtek
> >>> snd_hda_intel nfsd snd_hda_codec kvm auth_rpcgss nfs_acl snd_hwdep lockd
> >>> snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc sunrpc snd_timer
> >>> crc32c_intel i7core_edac bnx2 shpchp ptp snd iTCO_wdt joydev pps_core
> >>> iTCO_vendor_support pcspkr soundcore microcode serio_raw lpc_ich edac_core
> >>> mfd_core i2c_i801 acpi_cpufreq hid_logitech_dj nouveau ata_generic pata_acpi
> >>> video i2c_algo_bit drm_kms_helper ttm drm mxm_wmi i2c_core pata_marvell wmi
> >>> [last unloaded: openvswitch]
> >>> [ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G I 3.12.0-rc6-01188-gb45bd46 #1
> >>> [ 7691.295779] Hardware name: /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
> >>> [ 7691.306066] Workqueue: events css_killed_work_fn
> >>> [ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
> >>> [ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>] [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >>> [ 7691.329728] RSP: 0018:ffff88042957bcc8 EFLAGS: 00010002
> >>> [ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
> >>> [ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
> >>> [ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
> >>> [ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
> >>> [ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
> >>> [ 7691.375738] FS: 0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
> >>> [ 7691.384755] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>> [ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
> >>> [ 7691.399235] Stack:
> >>> [ 7691.401672] ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
> >>> [ 7691.409968] ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
> >>> [ 7691.418264] ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
> >>> [ 7691.426578] Call Trace:
> >>> [ 7691.429513] [<ffffffff81312b6d>] ? list_del+0xd/0x30
> >>> [ 7691.435250] [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
> >>> [ 7691.442874] [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
> >>> [ 7691.449942] [<ffffffff810da877>] offline_css+0x27/0x50
> >>> [ 7691.455874] [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
> >>> [ 7691.462466] [<ffffffff810821f5>] process_one_work+0x175/0x430
> >>> [ 7691.469041] [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
> >>> [ 7691.475345] [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
> >>> [ 7691.481919] [<ffffffff81089860>] kthread+0xc0/0xd0
> >>> [ 7691.487478] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> >>> [ 7691.494352] [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
> >>> [ 7691.500464] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> >>> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00
> >>> 00 41 8b 4c 24 18 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c>
> >>> 39 56 18 0f 8c 36 01 00 00 44 89 c9 f7 d9 89 cf 65 48 01 7e
> >> This is
> >>
> >> All code
> >> ========
> >> 0: 85 f6 test %esi,%esi
> >> 2: 48 8b 55 d0 mov -0x30(%rbp),%rdx
> >> 6: 44 8b 4d c8 mov -0x38(%rbp),%r9d
> >> a: 4c 8b 45 c0 mov -0x40(%rbp),%r8
> >> e: 0f 85 b3 00 00 00 jne 0xc7
> >> 14: 41 8b 4c 24 18 mov 0x18(%r12),%ecx
> >> 19: 85 c9 test %ecx,%ecx
> >> 1b: 0f 88 a6 00 00 00 js 0xc7
> >> 21: 48 8b b2 30 02 00 00 mov 0x230(%rdx),%rsi
> >> 28: 45 89 ca mov %r9d,%r10d
> >> 2b:* 4c 39 56 18 cmp %r10,0x18(%rsi) <-- trapping instruction
> >> 2f: 0f 8c 36 01 00 00 jl 0x16b
> >> 35: 44 89 c9 mov %r9d,%ecx
> >> 38: f7 d9 neg %ecx
> >> 3a: 89 cf mov %ecx,%edi
> >> 3c: 65 gs
> >> 3d: 48 rex.W
> >> 3e: 01 .byte 0x1
> >> 3f: 7e .byte 0x7e
> >>
> >> which corresponds to
> >>
> >> WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
> >>
> >> Humm. from->stat is a percpu pointer... This patch should fix it:
> >>
> >> ---
> >> From 4e9fe9d7e8502eab1c8bb4761de838f61cd4a8e0 Mon Sep 17 00:00:00 2001
> >> From: Johannes Weiner <[email protected]>
> >> Date: Fri, 25 Oct 2013 23:23:31 -0400
> >> Subject: [patch] mm: memcg: fix percpu variable access crash
> >>
> >> 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting")
> >> added a WARN_ON_ONCE() to sanity check the page statistics counter
> >> when moving charges. Unfortunately, it dereferences the percpu
> >> counter directly, which may result in a crash like this:
> >>
> >> [ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
> >> [ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >> [ 7691.178574] PGD 0
> >> [ 7691.181042] Oops: 0000 [#1] SMP
> >> [...]
> >> [ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G I 3.12.0-rc6-01188-gb45bd46 #1
> >> [ 7691.295779] Hardware name: /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
> >> [ 7691.306066] Workqueue: events css_killed_work_fn
> >> [ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
> >> [ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>] [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >> [ 7691.329728] RSP: 0018:ffff88042957bcc8 EFLAGS: 00010002
> >> [ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
> >> [ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
> >> [ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
> >> [ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
> >> [ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
> >> [ 7691.375738] FS: 0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
> >> [ 7691.384755] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >> [ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
> >> [ 7691.399235] Stack:
> >> [ 7691.401672] ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
> >> [ 7691.409968] ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
> >> [ 7691.418264] ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
> >> [ 7691.426578] Call Trace:
> >> [ 7691.429513] [<ffffffff81312b6d>] ? list_del+0xd/0x30
> >> [ 7691.435250] [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
> >> [ 7691.442874] [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
> >> [ 7691.449942] [<ffffffff810da877>] offline_css+0x27/0x50
> >> [ 7691.455874] [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
> >> [ 7691.462466] [<ffffffff810821f5>] process_one_work+0x175/0x430
> >> [ 7691.469041] [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
> >> [ 7691.475345] [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
> >> [ 7691.481919] [<ffffffff81089860>] kthread+0xc0/0xd0
> >> [ 7691.487478] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> >> [ 7691.494352] [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
> >> [ 7691.500464] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> >> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00 00 41 8b 4c 24 18
> >> 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c> 39 56 18 0f 8c 36 01 00 00 44 89 c9
> >> f7 d9 89 cf 65 48 01 7e
> >> [ 7691.528638] RIP [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >>
> >> Add the required __this_cpu_read().
> >
> > Sorry for my mistake and thanks for the fix up, it looks good to me.
> >
> > Reviewed-by: Sha Zhengju <[email protected]>
> >
> >
> > Thanks,
> > Sha
> >>
> >> Signed-off-by: Johannes Weiner <[email protected]>
> >> ---
> >> mm/memcontrol.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 4097a78..a4864b6 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> >> {
> >> /* Update stat data for mem_cgroup */
> >> preempt_disable();
> >> - WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
> >> + WARN_ON_ONCE(__this_cpu_read(from->stat->count[idx]) < nr_pages);
> >> __this_cpu_add(from->stat->count[idx], -nr_pages);
> >> __this_cpu_add(to->stat->count[idx], nr_pages);
> >> preempt_enable();
> >
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> I was just polishing up this following patch which I think is better
> because it should avoid spurious warnings.
>
> ---8<---
>
> >From c1f43ef0f4cc42fb2ecaeaca71bd247365e3521e Mon Sep 17 00:00:00 2001
> From: Greg Thelen <[email protected]>
> Date: Fri, 25 Oct 2013 21:59:57 -0700
> Subject: [PATCH] memcg: remove incorrect underflow check
>
> When a memcg is deleted mem_cgroup_reparent_charges() moves charged
> memory to the parent memcg. As of v3.11-9444-g3ea67d0 "memcg: add per
> cgroup writeback pages accounting" there's bad pointer read. The goal
> was to check for counter underflow. The counter is a per cpu counter
> and there are two problems with the code:
> (1) per cpu access function isn't used, instead a naked pointer is
> used which easily causes panic.
> (2) the check doesn't sum all cpus
>
> Test:
> $ cd /sys/fs/cgroup/memory
> $ mkdir x
> $ echo 3 > /proc/sys/vm/drop_caches
> $ (echo $BASHPID >> x/tasks && exec cat) &
> [1] 7154
> $ grep ^mapped x/memory.stat
> mapped_file 53248
> $ echo 7154 > tasks
> $ rmdir x
> <PANIC>
>
> The fix is to remove the check. It's currently dangerous and isn't
> worth fixing it to use something expensive, such as
> percpu_counter_sum(), for each reparented page. __this_cpu_read()
> isn't enough to fix this because there's no guarantees of the current
> cpus count. The only guarantees is that the sum of all per-cpu
> counter is >= nr_pages.
>
> Signed-off-by: Greg Thelen <[email protected]>

Makes sense to me. Thanks, Greg!

Acked-by: Johannes Weiner <[email protected]>

2013-10-29 16:22:49

by Flavio Leitner

[permalink] [raw]
Subject: Re: RIP: mem_cgroup_move_account+0xf4/0x290

On Sat, Oct 26, 2013 at 08:41:11AM -0700, Greg Thelen wrote:
> On Sat, Oct 26 2013, 含黛 wrote:
>
> > On 10/26/2013 11:39 AM, Johannes Weiner wrote:
> >> On Fri, Oct 25, 2013 at 02:15:55PM -0200, Flavio Leitner wrote:
> >>> While playing with guests and net-next kernel, I've triggered
> >>> this with some frequency. Even Fedora 19 kernel reproduces.
> >>>
> >>> It it a known issue?
> >>>
> >>> Thanks,
> >>> fbl
> >>>
> >>> [ 6790.349763] kvm: zapping shadow pages for mmio generation wraparound
> >>> [ 6792.283879] kvm: zapping shadow pages for mmio generation wraparound
> >>> [ 7535.654438] perf samples too long (2719 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
> >>> [ 7535.665948] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 11.560 msecs
> >>> [ 7691.048392] virbr0: port 1(vnet0) entered disabled state
> >>> [ 7691.056281] device vnet0 left promiscuous mode
> >>> [ 7691.061674] virbr0: port 1(vnet0) entered disabled state
> >>> [ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
> >>> [ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >>> [ 7691.178574] PGD 0
> >>> [ 7691.181042] Oops: 0000 [#1] SMP
> >>> [ 7691.184761] Modules linked in: vhost_net vhost macvtap macvlan tun veth
> >>> openvswitch xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast
> >>> ipt_MASQUERADE ip6t_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp
> >>> llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6
> >>> nf_nat_ipv6 vxlan ip_tunnel gre libcrc32c ip6table_mangle ip6table_security
> >>> ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
> >>> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
> >>> iptable_security iptable_raw coretemp kvm_intel snd_hda_codec_realtek
> >>> snd_hda_intel nfsd snd_hda_codec kvm auth_rpcgss nfs_acl snd_hwdep lockd
> >>> snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc sunrpc snd_timer
> >>> crc32c_intel i7core_edac bnx2 shpchp ptp snd iTCO_wdt joydev pps_core
> >>> iTCO_vendor_support pcspkr soundcore microcode serio_raw lpc_ich edac_core
> >>> mfd_core i2c_i801 acpi_cpufreq hid_logitech_dj nouveau ata_generic pata_acpi
> >>> video i2c_algo_bit drm_kms_helper ttm drm mxm_wmi i2c_core pata_marvell wmi
> >>> [last unloaded: openvswitch]
> >>> [ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G I 3.12.0-rc6-01188-gb45bd46 #1
> >>> [ 7691.295779] Hardware name: /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
> >>> [ 7691.306066] Workqueue: events css_killed_work_fn
> >>> [ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
> >>> [ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>] [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >>> [ 7691.329728] RSP: 0018:ffff88042957bcc8 EFLAGS: 00010002
> >>> [ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
> >>> [ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
> >>> [ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
> >>> [ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
> >>> [ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
> >>> [ 7691.375738] FS: 0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
> >>> [ 7691.384755] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>> [ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
> >>> [ 7691.399235] Stack:
> >>> [ 7691.401672] ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
> >>> [ 7691.409968] ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
> >>> [ 7691.418264] ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
> >>> [ 7691.426578] Call Trace:
> >>> [ 7691.429513] [<ffffffff81312b6d>] ? list_del+0xd/0x30
> >>> [ 7691.435250] [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
> >>> [ 7691.442874] [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
> >>> [ 7691.449942] [<ffffffff810da877>] offline_css+0x27/0x50
> >>> [ 7691.455874] [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
> >>> [ 7691.462466] [<ffffffff810821f5>] process_one_work+0x175/0x430
> >>> [ 7691.469041] [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
> >>> [ 7691.475345] [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
> >>> [ 7691.481919] [<ffffffff81089860>] kthread+0xc0/0xd0
> >>> [ 7691.487478] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> >>> [ 7691.494352] [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
> >>> [ 7691.500464] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> >>> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00
> >>> 00 41 8b 4c 24 18 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c>
> >>> 39 56 18 0f 8c 36 01 00 00 44 89 c9 f7 d9 89 cf 65 48 01 7e
> >> This is
> >>
> >> All code
> >> ========
> >> 0: 85 f6 test %esi,%esi
> >> 2: 48 8b 55 d0 mov -0x30(%rbp),%rdx
> >> 6: 44 8b 4d c8 mov -0x38(%rbp),%r9d
> >> a: 4c 8b 45 c0 mov -0x40(%rbp),%r8
> >> e: 0f 85 b3 00 00 00 jne 0xc7
> >> 14: 41 8b 4c 24 18 mov 0x18(%r12),%ecx
> >> 19: 85 c9 test %ecx,%ecx
> >> 1b: 0f 88 a6 00 00 00 js 0xc7
> >> 21: 48 8b b2 30 02 00 00 mov 0x230(%rdx),%rsi
> >> 28: 45 89 ca mov %r9d,%r10d
> >> 2b:* 4c 39 56 18 cmp %r10,0x18(%rsi) <-- trapping instruction
> >> 2f: 0f 8c 36 01 00 00 jl 0x16b
> >> 35: 44 89 c9 mov %r9d,%ecx
> >> 38: f7 d9 neg %ecx
> >> 3a: 89 cf mov %ecx,%edi
> >> 3c: 65 gs
> >> 3d: 48 rex.W
> >> 3e: 01 .byte 0x1
> >> 3f: 7e .byte 0x7e
> >>
> >> which corresponds to
> >>
> >> WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
> >>
> >> Humm. from->stat is a percpu pointer... This patch should fix it:
> >>
> >> ---
> >> From 4e9fe9d7e8502eab1c8bb4761de838f61cd4a8e0 Mon Sep 17 00:00:00 2001
> >> From: Johannes Weiner <[email protected]>
> >> Date: Fri, 25 Oct 2013 23:23:31 -0400
> >> Subject: [patch] mm: memcg: fix percpu variable access crash
> >>
> >> 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting")
> >> added a WARN_ON_ONCE() to sanity check the page statistics counter
> >> when moving charges. Unfortunately, it dereferences the percpu
> >> counter directly, which may result in a crash like this:
> >>
> >> [ 7691.163363] BUG: unable to handle kernel paging request at 000060fbc0002a20
> >> [ 7691.171145] IP: [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >> [ 7691.178574] PGD 0
> >> [ 7691.181042] Oops: 0000 [#1] SMP
> >> [...]
> >> [ 7691.285989] CPU: 1 PID: 14 Comm: kworker/1:0 Tainted: G I 3.12.0-rc6-01188-gb45bd46 #1
> >> [ 7691.295779] Hardware name: /DX58SO, BIOS SOX5810J.86A.5599.2012.0529.2218 05/29/2012
> >> [ 7691.306066] Workqueue: events css_killed_work_fn
> >> [ 7691.311303] task: ffff880429555dc0 ti: ffff88042957a000 task.ti: ffff88042957a000
> >> [ 7691.319673] RIP: 0010:[<ffffffff8119dcb4>] [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >> [ 7691.329728] RSP: 0018:ffff88042957bcc8 EFLAGS: 00010002
> >> [ 7691.335747] RAX: 0000000000000246 RBX: ffff88042b17bc30 RCX: 0000000000000004
> >> [ 7691.343720] RDX: ffff880424cd6000 RSI: 000060fbc0002a08 RDI: ffff880424cd622c
> >> [ 7691.351735] RBP: ffff88042957bd20 R08: ffff880424cd4000 R09: 0000000000000001
> >> [ 7691.359751] R10: 0000000000000001 R11: 0000000000000001 R12: ffffea00103ef0c0
> >> [ 7691.367745] R13: ffff880424cd6000 R14: 0000000000000000 R15: ffff880424cd622c
> >> [ 7691.375738] FS: 0000000000000000(0000) GS:ffff88043fc20000(0000) knlGS:0000000000000000
> >> [ 7691.384755] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >> [ 7691.391238] CR2: 000060fbc0002a20 CR3: 0000000001c0c000 CR4: 00000000000027e0
> >> [ 7691.399235] Stack:
> >> [ 7691.401672] ffff88042957bce8 ffff88042957bce8 ffffffff81312b6d ffff880424cd4000
> >> [ 7691.409968] ffff880400000001 ffff880424cd6000 ffffea00103ef0c0 ffff880424cd0430
> >> [ 7691.418264] ffff88042b17bc30 ffffea00103ef0e0 ffff880424cd6000 ffff88042957bda8
> >> [ 7691.426578] Call Trace:
> >> [ 7691.429513] [<ffffffff81312b6d>] ? list_del+0xd/0x30
> >> [ 7691.435250] [<ffffffff8119f5e7>] mem_cgroup_reparent_charges+0x247/0x460
> >> [ 7691.442874] [<ffffffff8119f9af>] mem_cgroup_css_offline+0xaf/0x1b0
> >> [ 7691.449942] [<ffffffff810da877>] offline_css+0x27/0x50
> >> [ 7691.455874] [<ffffffff810dcf8d>] css_killed_work_fn+0x2d/0xa0
> >> [ 7691.462466] [<ffffffff810821f5>] process_one_work+0x175/0x430
> >> [ 7691.469041] [<ffffffff81082e1b>] worker_thread+0x11b/0x3a0
> >> [ 7691.475345] [<ffffffff81082d00>] ? rescuer_thread+0x340/0x340
> >> [ 7691.481919] [<ffffffff81089860>] kthread+0xc0/0xd0
> >> [ 7691.487478] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> >> [ 7691.494352] [<ffffffff8166ea3c>] ret_from_fork+0x7c/0xb0
> >> [ 7691.500464] [<ffffffff810897a0>] ? insert_kthread_work+0x40/0x40
> >> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 00 00 41 8b 4c 24 18
> >> 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c> 39 56 18 0f 8c 36 01 00 00 44 89 c9
> >> f7 d9 89 cf 65 48 01 7e
> >> [ 7691.528638] RIP [<ffffffff8119dcb4>] mem_cgroup_move_account+0xf4/0x290
> >>
> >> Add the required __this_cpu_read().
> >
> > Sorry for my mistake and thanks for the fix up, it looks good to me.
> >
> > Reviewed-by: Sha Zhengju <[email protected]>
> >
> >
> > Thanks,
> > Sha
> >>
> >> Signed-off-by: Johannes Weiner <[email protected]>
> >> ---
> >> mm/memcontrol.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 4097a78..a4864b6 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> >> {
> >> /* Update stat data for mem_cgroup */
> >> preempt_disable();
> >> - WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
> >> + WARN_ON_ONCE(__this_cpu_read(from->stat->count[idx]) < nr_pages);
> >> __this_cpu_add(from->stat->count[idx], -nr_pages);
> >> __this_cpu_add(to->stat->count[idx], nr_pages);
> >> preempt_enable();
> >
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> I was just polishing up this following patch which I think is better
> because it should avoid spurious warnings.
>
> ---8<---
>
> From c1f43ef0f4cc42fb2ecaeaca71bd247365e3521e Mon Sep 17 00:00:00 2001
> From: Greg Thelen <[email protected]>
> Date: Fri, 25 Oct 2013 21:59:57 -0700
> Subject: [PATCH] memcg: remove incorrect underflow check
>
> When a memcg is deleted mem_cgroup_reparent_charges() moves charged
> memory to the parent memcg. As of v3.11-9444-g3ea67d0 "memcg: add per
> cgroup writeback pages accounting" there's bad pointer read. The goal
> was to check for counter underflow. The counter is a per cpu counter
> and there are two problems with the code:
> (1) per cpu access function isn't used, instead a naked pointer is
> used which easily causes panic.
> (2) the check doesn't sum all cpus
>
> Test:
> $ cd /sys/fs/cgroup/memory
> $ mkdir x
> $ echo 3 > /proc/sys/vm/drop_caches
> $ (echo $BASHPID >> x/tasks && exec cat) &
> [1] 7154
> $ grep ^mapped x/memory.stat
> mapped_file 53248
> $ echo 7154 > tasks
> $ rmdir x
> <PANIC>
>
> The fix is to remove the check. It's currently dangerous and isn't
> worth fixing it to use something expensive, such as
> percpu_counter_sum(), for each reparented page. __this_cpu_read()
> isn't enough to fix this because there's no guarantees of the current
> cpus count. The only guarantees is that the sum of all per-cpu
> counter is >= nr_pages.
>
> Signed-off-by: Greg Thelen <[email protected]>
> ---

Tested-by: Flavio Leitner <[email protected]>