2015-11-19 10:19:26

by Raghavendra K T

[permalink] [raw]
Subject: BUG: Unable to handle kernel paging request for data at address __percpu_counter_add

Hi,

While I was creating thousands of docker container on a power8 baremetal
(config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating
around 5600 container
I have hit below problem.
[This is looking similar to
https://bugzilla.kernel.org/show_bug.cgi?id=101011, but
kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9)
since it is 4.3.0 tagged kernel]

Any hints on how to go about the fix. Please let me know if you think
any more information needed.

docker daemon is device mapper based. (and it took a day to recreate the
problem)

[ by disabling CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able
to create 10k containers without any problem]

Nov 14 17:27:00 docker5 kernel: [40161.570029] Unable to handle kernel
paging request for data at address 0x3fedfa0000
Nov 14 17:27:00 docker5 kernel: [40161.570125] Faulting instruction
address: 0xc00000000056de90
Nov 14 17:27:00 docker5 kernel: [40161.570136] Oops: Kernel access of
bad area, sig: 11 [#1]
Nov 14 17:27:00 docker5 kernel: [40161.570143] SMP NR_CPUS=256 NUMA PowerNV
Nov 14 17:27:00 docker5 kernel: [40161.570177] Modules linked in:
veth(E) xt_nat(E) xt_tcpudp(E) xt_addrtype(E) xt_conntrack(E)
ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) iptable_nat(E)
nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) iptable_filter(E)
ip_tables(E) x_tables(E) nf_nat(E) nf_conntrack(E) bridge(E) stp(E)
llc(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E)
dm_bufio(E) libcrc32c(E) uio_pdrv_genirq(E) powernv_rng(E) uio(E)
autofs4(E) ses(E) enclosure(E) mlx4_en(E) vxlan(E) ip6_udp_tunnel(E)
udp_tunnel(E) lpfc(E) mlx4_core(E) scsi_transport_fc(E) ipr(E)
Nov 14 17:27:00 docker5 kernel: [40161.570755] CPU: 154 PID: 77177 Comm:
docker Tainted: G E 4.3.0+ #34
Nov 14 17:27:00 docker5 kernel: [40161.570830] task: c00000eaec7f2780
ti: c00000eaa4ac0000 task.ti: c00000eaa4ac0000
Nov 14 17:27:00 docker5 kernel: [40161.570904] NIP: c00000000056de90 LR:
c0000000002273e0 CTR: 0000000000000000
Nov 14 17:27:00 docker5 kernel: [40161.570978] REGS: c00000eaa4ac3530
TRAP: 0300 Tainted: G E (4.3.0+)
Nov 14 17:27:00 docker5 kernel: [40161.571051] MSR: 9000000100009033
<SF,HV,EE,ME,IR,DR,RI,LE> CR: 28028428 XER: 20000000
Nov 14 17:27:00 docker5 kernel: [40161.571244] CFAR: c000000000008468
DAR: 0000003fedfa0000 DSISR: 40000000 SOFTE: 0
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR00: c0000000002273e0
c00000eaa4ac37b0 c0000000014d6c00 c00000f1f7603fb8
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR04: 0000000000000001
0000000000000040 0000000000000001 0000000000000001
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR08: 000000000000007d
0000000000000000 0000003fedfa0000 0000003fedfa0000
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR12: c0000000003a4700
c000000007fbb700 c000000000cff0f8 0000000000000000
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR16: c00000e790430400
0000000000000000 0000000000000000 c00000e7a7e1a000
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR20: c00000e7c9d16800
0000000000000000 c00000000176cfc4 0000000000000001
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR24: 000000000009eca9
0000000000000001 0000000000000000 c00000ffcf8cb800
Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR28: c00000f21a739af0
0000000000000001 c000000001505414 c00000f1f7603fb8
Nov 14 17:27:00 docker5 kernel: [40161.572243] NIP [c00000000056de90]
__percpu_counter_add+0x30/0x100
Nov 14 17:27:00 docker5 kernel: [40161.572310] LR [c0000000002273e0]
account_page_dirtied+0x100/0x250
Nov 14 17:27:00 docker5 kernel: [40161.572373] Call Trace:
Nov 14 17:27:00 docker5 kernel: [40161.572401] [c00000eaa4ac37b0]
[c00000eaa4ac37f0] 0xc00000eaa4ac37f0 (unreliable)
Nov 14 17:27:00 docker5 kernel: [40161.572491] [c00000eaa4ac37f0]
[c0000000002273e0] account_page_dirtied+0x100/0x250
Nov 14 17:27:00 docker5 kernel: [40161.572580] [c00000eaa4ac3840]
[c00000000031031c] __set_page_dirty+0x7c/0x130
Nov 14 17:27:00 docker5 kernel: [40161.572656] [c00000eaa4ac3890]
[c0000000003106f8] mark_buffer_dirty+0x178/0x1c0
Nov 14 17:27:00 docker5 kernel: [40161.572746] [c00000eaa4ac38d0]
[c0000000003a5c54] ext4_commit_super+0x1d4/0x340
Nov 14 17:27:00 docker5 kernel: [40161.572835] [c00000eaa4ac3970]
[c0000000003a8d58] ext4_setup_super+0x118/0x250
Nov 14 17:27:00 docker5 kernel: [40161.572924] [c00000eaa4ac3a00]
[c0000000003abce4] ext4_fill_super+0x1c04/0x3250
Nov 14 17:27:00 docker5 kernel: [40161.573013] [c00000eaa4ac3b50]
[c0000000002c9964] mount_bdev+0x234/0x270
Nov 14 17:27:00 docker5 kernel: [40161.573089] [c00000eaa4ac3bd0]
[c0000000003a3178] ext4_mount+0x48/0x60
Nov 14 17:27:00 docker5 kernel: [40161.573165] [c00000eaa4ac3c10]
[c0000000002cad9c] mount_fs+0x8c/0x230
Nov 14 17:27:00 docker5 kernel: [40161.573242] [c00000eaa4ac3cb0]
[c0000000002f0518] vfs_kern_mount+0x78/0x180
Nov 14 17:27:00 docker5 kernel: [40161.573319] [c00000eaa4ac3d00]
[c0000000002f5150] do_mount+0x2e0/0xf60
Nov 14 17:27:00 docker5 kernel: [40161.573436] [c00000eaa4ac3dd0]
[c0000000002f61c4] SyS_mount+0xa4/0x110
Nov 14 17:27:00 docker5 kernel: [40161.573579] [c00000eaa4ac3e30]
[c000000000009260] system_call+0x38/0xd0
Nov 14 17:27:00 docker5 kernel: [40161.573718] Instruction dump:
Nov 14 17:27:00 docker5 kernel: [40161.573790] 3c4c00f7 38428da0
7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ffc1
Nov 14 17:27:00 docker5 kernel: [40161.574046] 7c7f1b78 7c9d2378
e94d0030 e9230020 <7fc952aa> 7fde2214 7fbe2800 409c0014
Nov 14 17:27:00 docker5 kernel: [40161.574298] ---[ end trace
25e9f03d556f3e5b ]---

root@docker5:~/linux# addr2line 0xc00000000056de90 -e vmlinux.nostrip
lib/percpu_counter.c:80
root@docker5:~/linux# addr2line c0000000002273e0 -e vmlinux.nostrip
include/linux/backing-dev.h:61

- Raghu


2015-11-23 21:13:27

by Tejun Heo

[permalink] [raw]
Subject: Re: BUG: Unable to handle kernel paging request for data at address __percpu_counter_add

Hello,

On Thu, Nov 19, 2015 at 03:54:35PM +0530, Raghavendra K T wrote:
> While I was creating thousands of docker container on a power8 baremetal
> (config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating
> around 5600 container
> I have hit below problem.
> [This is looking similar to
> https://bugzilla.kernel.org/show_bug.cgi?id=101011, but
> kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) since
> it is 4.3.0 tagged kernel]
>
> Any hints on how to go about the fix. Please let me know if you think any
> more information needed.
>
> docker daemon is device mapper based. (and it took a day to recreate the
> problem)
>
> [ by disabling CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able to
> create 10k containers without any problem]

Could be the same problem that Ilya is trying to fix. ie. blkdev i_wb
pointing to a stale wb. Can you please see whether the following
patch resolves the issue?

http://lkml.kernel.org/g/[email protected]

Thanks.

--
tejun

2015-11-24 06:00:20

by Raghavendra K T

[permalink] [raw]
Subject: Re: BUG: Unable to handle kernel paging request for data at address __percpu_counter_add

On 11/24/2015 02:43 AM, Tejun Heo wrote:
> Hello,
>
> On Thu, Nov 19, 2015 at 03:54:35PM +0530, Raghavendra K T wrote:
>> While I was creating thousands of docker container on a power8 baremetal
>> (config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating
>> around 5600 container
>> I have hit below problem.
>> [This is looking similar to
>> https://bugzilla.kernel.org/show_bug.cgi?id=101011, but
>> kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) since
>> it is 4.3.0 tagged kernel]
>>
>> Any hints on how to go about the fix. Please let me know if you think any
>> more information needed.
>>
>> docker daemon is device mapper based. (and it took a day to recreate the
>> problem)
>>
>> [ by disabling CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able to
>> create 10k containers without any problem]
>
> Could be the same problem that Ilya is trying to fix. ie. blkdev i_wb
> pointing to a stale wb. Can you please see whether the following
> patch resolves the issue?
>
> http://lkml.kernel.org/g/[email protected]
>

Thanks Tejun for the pointer.
Will check if the patch resolves the issue. (reproduction takes loong
time.. so it may take some time to report back).

2015-11-30 07:13:33

by Raghavendra K T

[permalink] [raw]
Subject: Re: BUG: Unable to handle kernel paging request for data at address __percpu_counter_add

On 11/24/2015 02:43 AM, Tejun Heo wrote:
> Hello,
>
> On Thu, Nov 19, 2015 at 03:54:35PM +0530, Raghavendra K T wrote:
>> While I was creating thousands of docker container on a power8 baremetal
>> (config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating
>> around 5600 container
>> I have hit below problem.
>> [This is looking similar to
>> https://bugzilla.kernel.org/show_bug.cgi?id=101011, but
>> kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) since
>> it is 4.3.0 tagged kernel]
>>
>> Any hints on how to go about the fix. Please let me know if you think any
>> more information needed.
>>
>> docker daemon is device mapper based. (and it took a day to recreate the
>> problem)
>>
>> [ by disabling CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able to
>> create 10k containers without any problem]
>
> Could be the same problem that Ilya is trying to fix. ie. blkdev i_wb
> pointing to a stale wb. Can you please see whether the following
> patch resolves the issue?
>
> http://lkml.kernel.org/g/[email protected]
>

Hi Tejun,

Thanks again for the pointer. I was now able to create more than 10k
containers without any problem with CGROUP_WRITEBACK on whereas
earlier I had hit this problem few times around 5k+ containers itself.
(Also Replying to Ilya's thread).