2014-10-11 03:44:20

by Carlos Carvalho

[permalink] [raw]
Subject: massive memory leak in 3.1[3-5] with nfs4+kerberos

We're observing a big memory leak in 3.1[3-5]. We've gone until 3.15.8 and back
to 3.14 because of LTS. Today we're running 3.14.21. The problem has existed
for several months but recently has become a show-stopper.

Here are the values of SUnreclaim: from /proc/meminfo, sampled at every 4h
(units are kB):

87192
297044
765320
2325160
3306056
4412808
4799292
5085392
4999936
5521648
6628496
7785460
8518084
8988404
9141220
9533224
10053484
10954000
11716700
12369516
12847412
13318872
13846196
14339476
14815600
15293564
15798024
17092772
19240084
21679888
22399060
22943812
23407004
24049804
26210880
28034980
29059812 <== almost 30GB!

After a few days the machine has lost so much memory that it panics or becomes
very slow due to lack of cache and we have to reboot it. It's a busy file
server of home directories.

We have several other busy servers (including identical hardware) but the
memory leak happens only in this machine. What is different with it is that
it's the only place where we use:
- nfs4 with authentication and encryption by kerberos
- raid10

All others do only nfs3 or no nfs, and raid6. That's why we suspect it's a nfs4
problem.

What about these patches: http://permalink.gmane.org/gmane.linux.nfs/62012
Bruce said they were accepted but they're not in 3.14. Were they rejected or
forgotten? Could they have any relation to this memory leak?



2014-10-28 19:12:31

by Carlos Carvalho

[permalink] [raw]
Subject: Re: massive memory leak in 3.1[3-5] with nfs4+kerberos

J. Bruce Fields ([email protected]) wrote on Tue, Oct 28, 2014 at 12:24:50PM BRST:
> On Tue, Oct 28, 2014 at 12:14:28PM -0200, Carlos Carvalho wrote:
> > J. Bruce Fields ([email protected]) wrote on Tue, Oct 14, 2014 at 05:42:45PM BRT:
> > > On Mon, Oct 13, 2014 at 08:50:27PM -0300, Carlos Carvalho wrote:
> > > > J. Bruce Fields ([email protected]) wrote on Mon, Oct 13, 2014 at 10:58:40AM BRT:
> > > > Note the big xprt_alloc. slabinfo is found in the kernel tree at tools/vm.
> > > > Another way to see it:
> > > >
> > > > urquell# sort -n /sys/kernel/slab/kmalloc-2048/alloc_calls | tail -n 2
> > > > 1519 nfsd4_create_session+0x24a/0x810 age=189221/25894524/71426273 pid=5372-5436 cpus=0-11,13-16,19-20 nodes=0-1
> > > > 3380755 xprt_alloc+0x1e/0x190 age=5/27767270/71441075 pid=6-32599 cpus=0-31 nodes=0-1
> > >
> > > Agreed that the xprt_alloc is suspicious, though I don't really
> > > understand these statistics.
> > >
> > > Since you have 4.1 clients, maybe this would be explained by a leak in
> > > the backchannel code.
> >
> > We've set clients to use 4.0 and it only made the problem worse; the growth in
> > unreclaimable memory was faster.
> >
> > > It could certainly still be worth testing 3.17 if possible.
> >
> > We tested it and it SEEMS the problem doesn't appear in 3.17.1; the SUnreclaim
> > value oscillates up and down as usual, instead of increasing monotonically.
> > However it didn't last long enough for us to get conclusive numbers because
> > after about 5-6h the machine fills the screen with "NMI watchdog CPU #... is
> > locked for more than 22s".
>
> Are the backtraces with those messages?

First one, nfsd:

Oct 22 18:37:12 urquell kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 23s! [nfsd:12603]
Oct 22 18:37:12 urquell kernel: Modules linked in:
Oct 22 18:37:12 urquell kernel: CPU: 23 PID: 12603 Comm: nfsd Not tainted 3.17.1-urquell-slabdebug #2
Oct 22 18:37:12 urquell kernel: Hardware name: SGI.COM SGI MIS Server/S2600JF, BIOS SE5C600.86B.01.03.0002.062020121504 06/20/2012
Oct 22 18:37:12 urquell kernel: task: ffff880fbb321000 ti: ffff880fbaf58000 task.ti: ffff880fbaf58000
Oct 22 18:37:12 urquell kernel: RIP: 0010:[<ffffffff82511685>] [<ffffffff82511685>] _raw_spin_unlock_irqrestore+0x5/0x10
Oct 22 18:37:12 urquell kernel: RSP: 0018:ffff880fbaf5bde0 EFLAGS: 00000296
Oct 22 18:37:12 urquell kernel: RAX: ffffffff82841930 RBX: ffffffff820963db RCX: 0000000000000000
Oct 22 18:37:12 urquell kernel: RDX: ffffffff82841948 RSI: 0000000000000296 RDI: ffffffff82841940
Oct 22 18:37:12 urquell kernel: RBP: ffff880fd1ac52d0 R08: 0000000000000000 R09: 0000000000000000
Oct 22 18:37:12 urquell kernel: R10: ffff880fd1ac5b28 R11: ffff880fff804780 R12: 00000000ffffffff
Oct 22 18:37:12 urquell kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Oct 22 18:37:12 urquell kernel: FS: 0000000000000000(0000) GS:ffff880fffde0000(0000) knlGS:0000000000000000
Oct 22 18:37:12 urquell kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 22 18:37:12 urquell kernel: CR2: 00007f22e0b01400 CR3: 000000000280e000 CR4: 00000000000407e0
Oct 22 18:37:12 urquell kernel: Stack:
Oct 22 18:37:12 urquell kernel: ffffffff821dd282 ffff880fbaf5bde8 ffff880fbaf5bde8 ffff881ff785c6d8
Oct 22 18:37:12 urquell kernel: ffff880fd1ac52d0 ffff881ff785c6d8 0000000000000020 ffffffff8285dac0
Oct 22 18:37:12 urquell kernel: ffffffff821e3550 ffff880fbaf5be28 ffff880fbaf5be28 ffff881ff785c6d8
Oct 22 18:37:12 urquell kernel: Call Trace:
Oct 22 18:37:12 urquell kernel: [<ffffffff821dd282>] ? __destroy_client+0xd2/0x130
Oct 22 18:37:12 urquell kernel: [<ffffffff821e3550>] ? nfs4_state_shutdown_net+0x150/0x1e0
Oct 22 18:37:12 urquell kernel: [<ffffffff821c1152>] ? nfsd_shutdown_net+0x32/0x60
Oct 22 18:37:12 urquell kernel: [<ffffffff821c11da>] ? nfsd_last_thread+0x5a/0x90
Oct 22 18:37:12 urquell kernel: [<ffffffff821c163b>] ? nfsd_destroy+0x4b/0x70
Oct 22 18:37:12 urquell kernel: [<ffffffff821c17a4>] ? nfsd+0x144/0x170
Oct 22 18:37:12 urquell kernel: [<ffffffff821c1660>] ? nfsd_destroy+0x70/0x70
Oct 22 18:37:12 urquell kernel: [<ffffffff8207f38c>] ? kthread+0xbc/0xe0
Oct 22 18:37:12 urquell kernel: [<ffffffff8207f2d0>] ? __kthread_parkme+0x70/0x70
Oct 22 18:37:12 urquell kernel: [<ffffffff82511dac>] ? ret_from_fork+0x7c/0xb0
Oct 22 18:37:12 urquell kernel: [<ffffffff8207f2d0>] ? __kthread_parkme+0x70/0x70

Next one, shortly after, another task and core:

Oct 22 18:37:40 urquell kernel: NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [migration/9:54]
Oct 22 18:37:40 urquell kernel: Modules linked in:
Oct 22 18:37:40 urquell kernel: CPU: 9 PID: 54 Comm: migration/9 Tainted: G L 3.17.1-urquell-slabdebug #2
Oct 22 18:37:40 urquell kernel: Hardware name: SGI.COM SGI MIS Server/S2600JF, BIOS SE5C600.86B.01.03.0002.062020121504 06/20/2012
Oct 22 18:37:40 urquell kernel: task: ffff881ff932a000 ti: ffff881ff888c000 task.ti: ffff881ff888c000
Oct 22 18:37:40 urquell kernel: RIP: 0010:[<ffffffff820c51ec>] [<ffffffff820c51ec>] multi_cpu_stop+0x4c/0xb0
Oct 22 18:37:40 urquell kernel: RSP: 0000:ffff881ff888fd98 EFLAGS: 00000293
Oct 22 18:37:40 urquell kernel: RAX: 0000000000000000 RBX: ffff881ff932a000 RCX: ffff88203fc2d5a8
Oct 22 18:37:40 urquell kernel: RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff881ff412bbb0
Oct 22 18:37:40 urquell kernel: RBP: ffff881ff412bbb0 R08: ffff881ff888c000 R09: 0000000000000001
Oct 22 18:37:40 urquell kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff881ff6e18d68
Oct 22 18:37:40 urquell kernel: R13: 0000000000000046 R14: ffffffff8209197b R15: ffff88203fc31200
Oct 22 18:37:40 urquell kernel: FS: 0000000000000000(0000) GS:ffff88203fc20000(0000) knlGS:0000000000000000
Oct 22 18:37:40 urquell kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 22 18:37:40 urquell kernel: CR2: 00007fd3488e5640 CR3: 000000000280e000 CR4: 00000000000407e0
Oct 22 18:37:40 urquell kernel: Stack:
Oct 22 18:37:40 urquell kernel: 0000000000000000 ffff88203fc2d5a0 ffff881ff412bbd8 ffff88203fc2d5a8
Oct 22 18:37:40 urquell kernel: ffff881ff412bbb0 ffffffff820c4e50 ffff880fffc11200 ffffffff82084351
Oct 22 18:37:40 urquell kernel: 0000000000000001 ffff881ff92b8000 0000000000000000 0000000000000000
Oct 22 18:37:40 urquell kernel: Call Trace:
Oct 22 18:37:40 urquell kernel: [<ffffffff820c4e50>] ? cpu_stopper_thread+0x70/0x120
Oct 22 18:37:40 urquell kernel: [<ffffffff82084351>] ? ttwu_do_wakeup+0x11/0x80
Oct 22 18:37:40 urquell kernel: [<ffffffff8208254c>] ? smpboot_thread_fn+0xfc/0x190
Oct 22 18:37:40 urquell kernel: [<ffffffff82082450>] ? SyS_setgroups+0x130/0x130
Oct 22 18:37:40 urquell kernel: [<ffffffff8207f38c>] ? kthread+0xbc/0xe0
Oct 22 18:37:40 urquell kernel: [<ffffffff8207f2d0>] ? __kthread_parkme+0x70/0x70
Oct 22 18:37:40 urquell kernel: [<ffffffff82511dac>] ? ret_from_fork+0x7c/0xb0
Oct 22 18:37:40 urquell kernel: [<ffffffff8207f2d0>] ? __kthread_parkme+0x70/0x70

These alternate in a series:

Oct 22 18:37:40 urquell kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [nfsd:12603]
[REMOVED]
Oct 22 18:37:40 urquell kernel: [<ffffffff821e3550>] ? nfs4_state_shutdown_net+0x150/0x1e0
Oct 22 18:37:40 urquell kernel: [<ffffffff821c1152>] ? nfsd_shutdown_net+0x32/0x60
Oct 22 18:37:40 urquell kernel: [<ffffffff821c11da>] ? nfsd_last_thread+0x5a/0x90
Oct 22 18:37:40 urquell kernel: [<ffffffff821c163b>] ? nfsd_destroy+0x4b/0x70
Oct 22 18:37:40 urquell kernel: [<ffffffff821c17a4>] ? nfsd+0x144/0x170
Oct 22 18:37:40 urquell kernel: [<ffffffff821c1660>] ? nfsd_destroy+0x70/0x70
Oct 22 18:37:40 urquell kernel: [<ffffffff8207f38c>] ? kthread+0xbc/0xe0
Oct 22 18:37:40 urquell kernel: [<ffffffff8207f2d0>] ? __kthread_parkme+0x70/0x70
Oct 22 18:37:40 urquell kernel: [<ffffffff82511dac>] ? ret_from_fork+0x7c/0xb0
Oct 22 18:37:40 urquell kernel: [<ffffffff8207f2d0>] ? __kthread_parkme+0x70/0x70

Oct 22 18:38:08 urquell kernel: NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [migration/9:54]
[REMOVED]
Oct 22 18:38:08 urquell kernel: [<ffffffff820c4e50>] ? cpu_stopper_thread+0x70/0x120
Oct 22 18:38:08 urquell kernel: [<ffffffff82084351>] ? ttwu_do_wakeup+0x11/0x80
Oct 22 18:38:08 urquell kernel: [<ffffffff8208254c>] ? smpboot_thread_fn+0xfc/0x190
Oct 22 18:38:08 urquell kernel: [<ffffffff82082450>] ? SyS_setgroups+0x130/0x130
Oct 22 18:38:08 urquell kernel: [<ffffffff8207f38c>] ? kthread+0xbc/0xe0
Oct 22 18:38:08 urquell kernel: [<ffffffff8207f2d0>] ? __kthread_parkme+0x70/0x70
Oct 22 18:38:08 urquell kernel: [<ffffffff82511dac>] ? ret_from_fork+0x7c/0xb0
Oct 22 18:38:08 urquell kernel: [<ffffffff8207f2d0>] ? __kthread_parkme+0x70/0x70

Etc. There's some variation in the call thread:

Oct 22 18:38:44 urquell kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 23s! [nfsd:12603]
[REMOVED]
Oct 22 18:38:44 urquell kernel: [<ffffffff82096093>] ? __wake_up+0x43/0x70
Oct 22 18:38:44 urquell kernel: [<ffffffff821dcf49>] ? nfs4_put_stid+0x29/0x100
Oct 22 18:38:44 urquell kernel: [<ffffffff821dd282>] ? __destroy_client+0xd2/0x130
Oct 22 18:38:44 urquell kernel: [<ffffffff821e3550>] ? nfs4_state_shutdown_net+0x150/0x1e0
Oct 22 18:38:44 urquell kernel: [<ffffffff821c1152>] ? nfsd_shutdown_net+0x32/0x60
Oct 22 18:38:44 urquell kernel: [<ffffffff821c11da>] ? nfsd_last_thread+0x5a/0x90
Oct 22 18:38:44 urquell kernel: [<ffffffff821c163b>] ? nfsd_destroy+0x4b/0x70
Oct 22 18:38:44 urquell kernel: [<ffffffff821c17a4>] ? nfsd+0x144/0x170
Oct 22 18:38:44 urquell kernel: [<ffffffff821c1660>] ? nfsd_destroy+0x70/0x70
Oct 22 18:38:44 urquell kernel: [<ffffffff8207f38c>] ? kthread+0xbc/0xe0
Oct 22 18:38:44 urquell kernel: [<ffffffff8207f2d0>] ? __kthread_parkme+0x70/0x70
Oct 22 18:38:44 urquell kernel: [<ffffffff82511dac>] ? ret_from_fork+0x7c/0xb0
Oct 22 18:38:44 urquell kernel: [<ffffffff8207f2d0>] ? __kthread_parkme+0x70/0x70


Then reboot. Next NMI appeared next day:

Oct 23 11:43:24 urquell kernel: NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [kworker/u64:3:6414]
Oct 23 11:43:24 urquell kernel: Modules linked in:
Oct 23 11:43:24 urquell kernel: CPU: 5 PID: 6414 Comm: kworker/u64:3 Not tainted 3.17.1-urquell-slabdebug #2
Oct 23 11:43:24 urquell kernel: Hardware name: SGI.COM SGI MIS Server/S2600JF, BIOS SE5C600.86B.01.03.0002.062020121504 06/20/2012
Oct 23 11:43:24 urquell kernel: Workqueue: nfsd4 laundromat_main
Oct 23 11:43:24 urquell kernel: task: ffff881ff0d49800 ti: ffff8813f5a04000 task.ti: ffff8813f5a04000
Oct 23 11:43:24 urquell kernel: RIP: 0010:[<ffffffff85245cc8>] [<ffffffff85245cc8>] _atomic_dec_and_lock+0x18/0x60
Oct 23 11:43:24 urquell kernel: RSP: 0018:ffff8813f5a07d08 EFLAGS: 00000246
Oct 23 11:43:24 urquell kernel: RAX: 0000000000000000 RBX: ffffffff85096290 RCX: 00000000ffffffff
Oct 23 11:43:24 urquell kernel: RDX: 0000000000000000 RSI: ffffffff850963db RDI: ffff8813f5a07cf8
Oct 23 11:43:24 urquell kernel: RBP: ffffffff85096093 R08: 0000000000000000 R09: 0000000000000001
Oct 23 11:43:24 urquell kernel: R10: 0000000000000001 R11: 0140000000000000 R12: 0000000000000000
Oct 23 11:43:24 urquell kernel: R13: ffff880ff5c752d0 R14: ffffffff851bdaa0 R15: 0000000000000206
Oct 23 11:43:24 urquell kernel: FS: 0000000000000000(0000) GS:ffff880fffca0000(0000) knlGS:0000000000000000
Oct 23 11:43:24 urquell kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 23 11:43:24 urquell kernel: CR2: ffffffffff600400 CR3: 000000000580e000 CR4: 00000000000407e0
Oct 23 11:43:24 urquell kernel: Stack:
Oct 23 11:43:24 urquell kernel: ffff881fee03dba0 ffffffff85096093 ffff8813f5a07cf8 ffffffff851dcf49
Oct 23 11:43:24 urquell kernel: ffff881fee039950 ffff881fee0398d8 ffff8813f5a07d48 ffffffff851dd282
Oct 23 11:43:24 urquell kernel: ffff8813f5a07d48 ffff8813f5a07d48 ffff8813f5a07db8 ffff881ff7fbbd98
Oct 23 11:43:24 urquell kernel: Call Trace:
Oct 23 11:43:24 urquell kernel: [<ffffffff85096093>] ? __wake_up+0x43/0x70
Oct 23 11:43:24 urquell kernel: [<ffffffff851dcf49>] ? nfs4_put_stid+0x29/0x100
Oct 23 11:43:24 urquell kernel: [<ffffffff851dd282>] ? __destroy_client+0xd2/0x130
Oct 23 11:43:24 urquell kernel: [<ffffffff851dd449>] ? laundromat_main+0x149/0x3f0
Oct 23 11:43:24 urquell kernel: [<ffffffff8507a6a4>] ? process_one_work+0x134/0x340
Oct 23 11:43:24 urquell kernel: [<ffffffff8507adb4>] ? worker_thread+0x114/0x460
Oct 23 11:43:24 urquell kernel: [<ffffffff85095b5c>] ? __wake_up_common+0x4c/0x80
Oct 23 11:43:24 urquell kernel: [<ffffffff8507aca0>] ? init_pwq+0x160/0x160
Oct 23 11:43:24 urquell kernel: [<ffffffff8507f38c>] ? kthread+0xbc/0xe0
Oct 23 11:43:24 urquell kernel: [<ffffffff8507f2d0>] ? __kthread_parkme+0x70/0x70
Oct 23 11:43:24 urquell kernel: [<ffffffff85511dac>] ? ret_from_fork+0x7c/0xb0
Oct 23 11:43:24 urquell kernel: [<ffffffff8507f2d0>] ? __kthread_parkme+0x70/0x70

Oct 23 11:43:52 urquell kernel: NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [kworker/u64:3:6414]
Oct 23 11:43:52 urquell kernel: Modules linked in:
Oct 23 11:43:52 urquell kernel: CPU: 5 PID: 6414 Comm: kworker/u64:3 Tainted: G L 3.17.1-urquell-slabdebug #2
Oct 23 11:43:52 urquell kernel: Hardware name: SGI.COM SGI MIS Server/S2600JF, BIOS SE5C600.86B.01.03.0002.062020121504 06/20/2012
Oct 23 11:43:52 urquell kernel: Workqueue: nfsd4 laundromat_main
Oct 23 11:43:52 urquell kernel: task: ffff881ff0d49800 ti: ffff8813f5a04000 task.ti: ffff8813f5a04000
Oct 23 11:43:52 urquell kernel: RIP: 0010:[<ffffffff85511685>] [<ffffffff85511685>] _raw_spin_unlock_irqrestore+0x5/0x10
Oct 23 11:43:52 urquell kernel: RSP: 0018:ffff8813f5a07d40 EFLAGS: 00000296
Oct 23 11:43:52 urquell kernel: RAX: ffffffff85841930 RBX: ffffffff850963db RCX: 0000000000000000
Oct 23 11:43:52 urquell kernel: RDX: ffffffff85841948 RSI: 0000000000000296 RDI: ffffffff85841940
Oct 23 11:43:52 urquell kernel: RBP: ffff881fee0398d8 R08: 0000000000000000 R09: 0000000000000001
Oct 23 11:43:52 urquell kernel: R10: 0000000000000001 R11: 0140000000000000 R12: ffff8813f5a07d48
Oct 23 11:43:52 urquell kernel: R13: ffff8813f5a07d48 R14: 0000000000000000 R15: 0000000000000001
Oct 23 11:43:52 urquell kernel: FS: 0000000000000000(0000) GS:ffff880fffca0000(0000) knlGS:0000000000000000
Oct 23 11:43:52 urquell kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 23 11:43:52 urquell kernel: CR2: ffffffffff600400 CR3: 000000000580e000 CR4: 00000000000407e0
Oct 23 11:43:52 urquell kernel: Stack:
Oct 23 11:43:52 urquell kernel: ffffffff851dd282 ffff8813f5a07d48 ffff8813f5a07d48 ffff8813f5a07db8
Oct 23 11:43:52 urquell kernel: ffff881ff7fbbd98 ffff8813f5a07db8 ffff881ff7fbbeb0 000000005449058d
Oct 23 11:43:52 urquell kernel: ffffffff851dd449 0000000000000000 0000000000000000 0000000000000000
Oct 23 11:43:52 urquell kernel: Call Trace:
Oct 23 11:43:52 urquell kernel: [<ffffffff851dd282>] ? __destroy_client+0xd2/0x130
Oct 23 11:43:52 urquell kernel: [<ffffffff851dd449>] ? laundromat_main+0x149/0x3f0
Oct 23 11:43:52 urquell kernel: [<ffffffff8507a6a4>] ? process_one_work+0x134/0x340
Oct 23 11:43:52 urquell kernel: [<ffffffff8507adb4>] ? worker_thread+0x114/0x460
Oct 23 11:43:52 urquell kernel: [<ffffffff85095b5c>] ? __wake_up_common+0x4c/0x80
Oct 23 11:43:52 urquell kernel: [<ffffffff8507aca0>] ? init_pwq+0x160/0x160
Oct 23 11:43:52 urquell kernel: [<ffffffff8507f38c>] ? kthread+0xbc/0xe0
Oct 23 11:43:52 urquell kernel: [<ffffffff8507f2d0>] ? __kthread_parkme+0x70/0x70
Oct 23 11:43:52 urquell kernel: [<ffffffff85511dac>] ? ret_from_fork+0x7c/0xb0
Oct 23 11:43:52 urquell kernel: [<ffffffff8507f2d0>] ? __kthread_parkme+0x70/0x70

And it repeats, always kworker at the same core.

We then reverted to 3.14.21, since an uptime of ~5h (under load) is not viable
in this mission critical server.


2014-10-28 14:14:30

by Carlos Carvalho

[permalink] [raw]
Subject: Re: massive memory leak in 3.1[3-5] with nfs4+kerberos

J. Bruce Fields ([email protected]) wrote on Tue, Oct 14, 2014 at 05:42:45PM BRT:
> On Mon, Oct 13, 2014 at 08:50:27PM -0300, Carlos Carvalho wrote:
> > J. Bruce Fields ([email protected]) wrote on Mon, Oct 13, 2014 at 10:58:40AM BRT:
> > Note the big xprt_alloc. slabinfo is found in the kernel tree at tools/vm.
> > Another way to see it:
> >
> > urquell# sort -n /sys/kernel/slab/kmalloc-2048/alloc_calls | tail -n 2
> > 1519 nfsd4_create_session+0x24a/0x810 age=189221/25894524/71426273 pid=5372-5436 cpus=0-11,13-16,19-20 nodes=0-1
> > 3380755 xprt_alloc+0x1e/0x190 age=5/27767270/71441075 pid=6-32599 cpus=0-31 nodes=0-1
>
> Agreed that the xprt_alloc is suspicious, though I don't really
> understand these statistics.
>
> Since you have 4.1 clients, maybe this would be explained by a leak in
> the backchannel code.

We've set clients to use 4.0 and it only made the problem worse; the growth in
unreclaimable memory was faster.

> It could certainly still be worth testing 3.17 if possible.

We tested it and it SEEMS the problem doesn't appear in 3.17.1; the SUnreclaim
value oscillates up and down as usual, instead of increasing monotonically.
However it didn't last long enough for us to get conclusive numbers because
after about 5-6h the machine fills the screen with "NMI watchdog CPU #... is
locked for more than 22s". It spits these messages for many cores at once, and
becomes unresponsive; we have to reboot it from the console with alt+sysreq.

Do these 2 new pieces of info give a clue?

2014-10-28 14:24:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: massive memory leak in 3.1[3-5] with nfs4+kerberos

On Tue, Oct 28, 2014 at 12:14:28PM -0200, Carlos Carvalho wrote:
> J. Bruce Fields ([email protected]) wrote on Tue, Oct 14, 2014 at 05:42:45PM BRT:
> > On Mon, Oct 13, 2014 at 08:50:27PM -0300, Carlos Carvalho wrote:
> > > J. Bruce Fields ([email protected]) wrote on Mon, Oct 13, 2014 at 10:58:40AM BRT:
> > > Note the big xprt_alloc. slabinfo is found in the kernel tree at tools/vm.
> > > Another way to see it:
> > >
> > > urquell# sort -n /sys/kernel/slab/kmalloc-2048/alloc_calls | tail -n 2
> > > 1519 nfsd4_create_session+0x24a/0x810 age=189221/25894524/71426273 pid=5372-5436 cpus=0-11,13-16,19-20 nodes=0-1
> > > 3380755 xprt_alloc+0x1e/0x190 age=5/27767270/71441075 pid=6-32599 cpus=0-31 nodes=0-1
> >
> > Agreed that the xprt_alloc is suspicious, though I don't really
> > understand these statistics.
> >
> > Since you have 4.1 clients, maybe this would be explained by a leak in
> > the backchannel code.
>
> We've set clients to use 4.0 and it only made the problem worse; the growth in
> unreclaimable memory was faster.
>
> > It could certainly still be worth testing 3.17 if possible.
>
> We tested it and it SEEMS the problem doesn't appear in 3.17.1; the SUnreclaim
> value oscillates up and down as usual, instead of increasing monotonically.
> However it didn't last long enough for us to get conclusive numbers because
> after about 5-6h the machine fills the screen with "NMI watchdog CPU #... is
> locked for more than 22s".

Are the backtraces with those messages?

--b.

> It spits these messages for many cores at once, and
> becomes unresponsive; we have to reboot it from the console with alt+sysreq.
>
> Do these 2 new pieces of info give a clue?

2014-10-14 20:42:47

by J. Bruce Fields

[permalink] [raw]
Subject: Re: massive memory leak in 3.1[3-5] with nfs4+kerberos

On Mon, Oct 13, 2014 at 08:50:27PM -0300, Carlos Carvalho wrote:
> J. Bruce Fields ([email protected]) wrote on Mon, Oct 13, 2014 at 10:58:40AM BRT:
> > On Sat, Oct 11, 2014 at 12:36:27AM -0300, Carlos Carvalho wrote:
> > > We're observing a big memory leak in 3.1[3-5]. We've gone until 3.15.8 and back
> > > to 3.14 because of LTS. Today we're running 3.14.21. The problem has existed
> > > for several months but recently has become a show-stopper.
> >
> > Is there an older version that you know was OK?
>
> Perhaps something as old as 3.8 but I'm not sure if it still worked. We jumped
> from 3.8 to 3.13 and this one certainly leaks.
...
> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
...
> 3374235 3374199 99% 2.07K 224949 15 7198368K kmalloc-2048

> urquell# ./slabinfo -r kmalloc-2048
...
> kmalloc-2048: Kernel object allocation
> -----------------------------------------------------------------------
...
> 3377159 xprt_alloc+0x1e/0x190 age=0/27663979/71308304 pid=6-32599 cpus=0-31 nodes=0-1
...
> Note the big xprt_alloc. slabinfo is found in the kernel tree at tools/vm.
> Another way to see it:
>
> urquell# sort -n /sys/kernel/slab/kmalloc-2048/alloc_calls | tail -n 2
> 1519 nfsd4_create_session+0x24a/0x810 age=189221/25894524/71426273 pid=5372-5436 cpus=0-11,13-16,19-20 nodes=0-1
> 3380755 xprt_alloc+0x1e/0x190 age=5/27767270/71441075 pid=6-32599 cpus=0-31 nodes=0-1

Agreed that the xprt_alloc is suspicious, though I don't really
understand these statistics.

Since you have 4.1 clients, maybe this would be explained by a leak in
the backchannel code.

> Yet another puzzling thing for us is that the number of allocs and frees is
> nearly equal:
>
> urquell# awk '{summ += $1} END {print summ}' /sys/kernel/slab/kmalloc-2048/alloc_calls
> 3385122
> urquell# awk '{summ += $1} END {print summ}' /sys/kernel/slab/kmalloc-2048/free_calls
> 3385273

I can't tell what these numbers actually mean. (E.g. is it really
tracking every single alloc and free since the kernel booted, or does
this just represent recent behavior?)

>
> > It would also be interesting to know whether the problem is with nfs4 or
> > krb5. But I don't know if you have an easy way to test that. (E.g.
> > temporarily downgrade to nfs3 while keeping krb5 and see if that
> > matters?)
>
> That'd be quite hard to do...
>
> > Do you know if any of your clients are using NFSv4.1?
>
> All of them. Clients are a few general login servers and about a hundred
> terminals. All of them are diskless and mount their root via nfs3 without
> kerberos. The login servers mount the user home dirs with nfs4.1 WITHOUT
> kerberos. The terminals run ubuntu and mount with nfs4.1 AND kerberos. Here is
> their /proc/version:

OK, thanks. I'm not seeing it yet.

I don't see any relevant-looking fix in recent git history, though there
might have overlooked something, especially in the recent rewrite of the
NFSv4 state code.

It could certainly still be worth testing 3.17 if possible.

--b.

2014-10-13 23:50:30

by Carlos Carvalho

[permalink] [raw]
Subject: Re: massive memory leak in 3.1[3-5] with nfs4+kerberos

J. Bruce Fields ([email protected]) wrote on Mon, Oct 13, 2014 at 10:58:40AM BRT:
> On Sat, Oct 11, 2014 at 12:36:27AM -0300, Carlos Carvalho wrote:
> > We're observing a big memory leak in 3.1[3-5]. We've gone until 3.15.8 and back
> > to 3.14 because of LTS. Today we're running 3.14.21. The problem has existed
> > for several months but recently has become a show-stopper.
>
> Is there an older version that you know was OK?

Perhaps something as old as 3.8 but I'm not sure if it still worked. We jumped
from 3.8 to 3.13 and this one certainly leaks.

> > Here are the values of SUnreclaim: from /proc/meminfo, sampled at every 4h
> > (units are kB):
[TRIMMED]
> > 28034980
> > 29059812 <== almost 30GB!
>
> Can you figure out from /proc/slabinfo which slab is the problem?

I don't understand it precisely but here's what slabtop says:

urquell# slabtop -o -sc
Active / Total Objects (% used) : 62466073 / 63672277 (98.1%)
Active / Total Slabs (% used) : 1903762 / 1903762 (100.0%)
Active / Total Caches (% used) : 122 / 140 (87.1%)
Active / Total Size (% used) : 31965140.38K / 32265879.49K (99.1%)
Minimum / Average / Maximum Object : 0.08K / 0.51K / 8.07K

OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
13782656 13745696 99% 1.00K 430708 32 13782656K ext4_inode_cache
3374235 3374199 99% 2.07K 224949 15 7198368K kmalloc-2048
6750996 6750937 99% 0.57K 241107 28 3857712K kmalloc-512
9438787 9121010 96% 0.26K 304477 31 2435816K dentry
11455656 11444417 99% 0.17K 249036 46 1992288K buffer_head
161378 161367 99% 4.07K 23054 7 737728K kmalloc-4096
6924380 6924248 99% 0.09K 150530 46 602120K kmalloc-16
6754644 6754432 99% 0.08K 132444 51 529776K kmalloc-8
662286 648372 97% 0.62K 12986 51 415552K radix_tree_node
161730 161709 99% 1.69K 8985 18 287520K TCP
1987470 1840303 92% 0.13K 66249 30 264996K kmalloc-64
203366 162439 79% 0.69K 4421 46 141472K sock_inode_cache
626112 285190 45% 0.11K 17392 36 69568K ext4_extent_status
177123 177040 99% 0.31K 3473 51 55568K skbuff_head_cache
357272 357220 99% 0.12K 10508 34 42032K jbd2_inode
136422 136375 99% 0.20K 3498 39 27984K ext4_groupinfo_4k

Note that ext4* appears to be the vilain but it's not because
echo 3>/proc/sys/vm/drop_caches
gets rid of it. The problem seems to be the kmalloc-*, particularly
kmalloc-2048, which never gets smaller.

Maybe this gives you a clue:

urquell# ./slabinfo -r kmalloc-2048

Slabcache: kmalloc-2048 Aliases: 0 Order : 3 Objects: 3379188

Sizes (bytes) Slabs Debug Memory
------------------------------------------------------------------------
Object : 2048 Total : 225282 Sanity Checks : On Total: 7382040576
SlabObj: 2120 Full : 225277 Redzoning : On Used : 6920577024
SlabSiz: 32768 Partial: 5 Poisoning : On Loss : 461463552
Loss : 72 CpuSlab: 0 Tracking : On Lalig: 243301536
Align : 8 Objects: 15 Tracing : Off Lpadd: 218072976

kmalloc-2048 has no kmem_cache operations

kmalloc-2048: Kernel object allocation
-----------------------------------------------------------------------
1 mcheck_cpu_init+0x2dd/0x4b0 age=71315213 pid=0 cpus=0 nodes=0
4 mempool_create_node+0x67/0x130 age=71310870/71310884/71310896 pid=845 cpus=8,10 nodes=1
1 pcpu_extend_area_map+0x36/0xf0 age=71314878 pid=1 cpus=10 nodes=1
1 __vmalloc_node_range+0xb1/0x250 age=71314877 pid=1 cpus=10 nodes=1
7 alloc_fdmem+0x17/0x30 age=19467/40818077/71308204 pid=5541-16287 cpus=0-1,4,6-7 nodes=0
2 __register_sysctl_table+0x52/0x540 age=71315213/71315213/71315213 pid=0 cpus=0 nodes=0
1 register_leaf_sysctl_tables+0x7f/0x1f0 age=71315213 pid=0 cpus=0 nodes=0
1 ext4_kvmalloc+0x1e/0x60 age=71310060 pid=2778 cpus=0 nodes=0
2 ext4_kvzalloc+0x1e/0x70 age=71310041/71310197/71310354 pid=2778 cpus=4,9 nodes=0-1
4 ext4_fill_super+0x9e/0x2d50 age=71310064/71310406/71310853 pid=987-2778 cpus=0,9-10 nodes=0-1
4 journal_init_common+0x12/0x150 age=71310051/71310386/71310852 pid=987-2778 cpus=0,8-9 nodes=0-1
50 create_client.isra.79+0x6d/0x450 age=56450/25384654/71293503 pid=5372-5435 cpus=0-17,19-21,23 nodes=0-1
1550 nfsd4_create_session+0x24a/0x810 age=56450/25266733/71293502 pid=5372-5436 cpus=0-11,13-16,19-20,27 nodes=0-1
136 pci_alloc_dev+0x22/0x60 age=71314711/71314783/71314907 pid=1 cpus=0 nodes=0-1
1 acpi_ev_create_gpe_block+0x110/0x323 age=71315026 pid=1 cpus=0 nodes=0
2 tty_write+0x1f9/0x290 age=11025/35647516/71284008 pid=5965-16318 cpus=5-6 nodes=0
2 kobj_map_init+0x20/0xa0 age=71315078/71315145/71315213 pid=0-1 cpus=0 nodes=0
7 scsi_host_alloc+0x30/0x410 age=71312864/71313119/71314655 pid=4 cpus=0 nodes=0
52 scsi_alloc_sdev+0x52/0x290 age=71311100/71311851/71314235 pid=6-289 cpus=0-1,3,8 nodes=0-1
50 _scsih_slave_alloc+0x36/0x1f0 age=71311100/71311758/71312386 pid=7-289 cpus=0,8 nodes=0-1
1 ata_attach_transport+0x12/0x550 age=71314887 pid=1 cpus=10 nodes=1
2 usb_create_shared_hcd+0x3d/0x1b0 age=71312498/71312500/71312502 pid=4 cpus=0 nodes=0
2 input_alloc_absinfo+0x1f/0x50 age=71312135/71312157/71312180 pid=175 cpus=0 nodes=0
18 sk_prot_alloc.isra.51+0xab/0x180 age=71307028/71307092/71307162 pid=5268 cpus=0-2 nodes=0
12 reqsk_queue_alloc+0x5b/0xf0 age=71308217/71308264/71308326 pid=5281-5531 cpus=0,7 nodes=0
9 __alloc_skb+0x82/0x2a0 age=11/62981/85980 pid=16286-28415 cpus=0,2,12,20 nodes=0-1
19 alloc_netdev_mqs+0x5d/0x380 age=71309462/71309950/71314887 pid=1-4109 cpus=1,9-10 nodes=0-1
48 neigh_sysctl_register+0x39/0x270 age=71309462/71310133/71314878 pid=1-4109 cpus=1-3,5-6,9-10,18 nodes=0-1
2 neigh_hash_alloc+0x9d/0xb0 age=71122283/71208056/71293830 pid=4732-5411 cpus=0,12 nodes=0-1
4 __rtnl_register+0x9e/0xd0 age=71312234/71314317/71315079 pid=1 cpus=0,2,10 nodes=0-1
2 nf_ct_l4proto_register+0x96/0x140 age=71312233/71312234/71312235 pid=1 cpus=2 nodes=0
25 __devinet_sysctl_register+0x39/0xf0 age=71309462/71310428/71314878 pid=1-4109 cpus=1,3,5-6,9-10,18 nodes=0-1
3377159 xprt_alloc+0x1e/0x190 age=0/27663979/71308304 pid=6-32599 cpus=0-31 nodes=0-1
6 cache_create_net+0x32/0x80 age=71314835/71314848/71314876 pid=1 cpus=8,10 nodes=1
1 uncore_types_init+0x33/0x1a5 age=71314842 pid=1 cpus=31 nodes=1
2 netdev_create_hash+0x12/0x30 age=71314887/71314887/71314887 pid=1 cpus=10 nodes=1

kmalloc-2048: Kernel object freeing
------------------------------------------------------------------------
1236114 <not-available> age=4366207748 pid=0 cpus=0 nodes=0-1
1 pcpu_extend_area_map+0x79/0xf0 age=71312606 pid=281 cpus=0 nodes=0
13387 do_readv_writev+0x10f/0x2b0 age=11019/32837523/71273914 pid=5372-5436 cpus=0-31 nodes=0-1
181 __free_fdtable+0xd/0x20 age=64631/41719568/71310089 pid=632-32741 cpus=0-26,28-29,31 nodes=0-1
25068 free_session+0x113/0x140 age=65100/41914089/71277241 pid=6-32564 cpus=0-9,11-23,25-31 nodes=0-1
1008 destroy_client+0x34b/0x3f0 age=88408/41955492/71233598 pid=6-32564 cpus=0-9,13-23,25-26,28-31 nodes=0-1
5568 nfsd4_create_session+0x567/0x810 age=56743/41522683/71293808 pid=5372-5436 cpus=0-28,30 nodes=0-1
5 acpi_pci_irq_find_prt_entry+0x253/0x26d age=71312722/71313708/71315113 pid=1-4 cpus=0,10 nodes=0-1
1 free_tty_struct+0x23/0x40 age=68221685 pid=933 cpus=3 nodes=0
14 flush_to_ldisc+0x86/0x170 age=2758222/47649639/71281276 pid=254-12135 cpus=0,2,9-10,12,15-17,25 nodes=0-1
2094711 __kfree_skb+0x9/0xa0 age=48/27035171/71311197 pid=0-32767 cpus=0-31 nodes=0-1
1 pskb_expand_head+0x150/0x220 age=71315138 pid=1 cpus=10 nodes=1
6 xt_free_table_info+0x4e/0x130 age=71308862/71308863/71308864 pid=5003-5005 cpus=25,29-30 nodes=0
3 inetdev_event+0x36d/0x4e0 age=71310679/71310692/71310703 pid=1160-1199 cpus=5-6,18 nodes=0
5 inetdev_event+0x37a/0x4e0 age=71310679/71310715/71310749 pid=1160-1199 cpus=3,5,18-19 nodes=0
3155 unix_stream_sendmsg+0x39c/0x3b0 age=31207/33532093/71308550 pid=398-32756 cpus=0-31 nodes=0-1
4 addrconf_notify+0x1b1/0x770 age=71310703/71310724/71310749 pid=1160-1199 cpus=3,5,18-19 nodes=0

NUMA nodes : 0 1
--------------------------------
All slabs 118.1K 107.1K
Partial slabs 1 4

Note the big xprt_alloc. slabinfo is found in the kernel tree at tools/vm.
Another way to see it:

urquell# sort -n /sys/kernel/slab/kmalloc-2048/alloc_calls | tail -n 2
1519 nfsd4_create_session+0x24a/0x810 age=189221/25894524/71426273 pid=5372-5436 cpus=0-11,13-16,19-20 nodes=0-1
3380755 xprt_alloc+0x1e/0x190 age=5/27767270/71441075 pid=6-32599 cpus=0-31 nodes=0-1

Yet another puzzling thing for us is that the number of allocs and frees is
nearly equal:

urquell# awk '{summ += $1} END {print summ}' /sys/kernel/slab/kmalloc-2048/alloc_calls
3385122
urquell# awk '{summ += $1} END {print summ}' /sys/kernel/slab/kmalloc-2048/free_calls
3385273

> It would also be interesting to know whether the problem is with nfs4 or
> krb5. But I don't know if you have an easy way to test that. (E.g.
> temporarily downgrade to nfs3 while keeping krb5 and see if that
> matters?)

That'd be quite hard to do...

> Do you know if any of your clients are using NFSv4.1?

All of them. Clients are a few general login servers and about a hundred
terminals. All of them are diskless and mount their root via nfs3 without
kerberos. The login servers mount the user home dirs with nfs4.1 WITHOUT
kerberos. The terminals run ubuntu and mount with nfs4.1 AND kerberos. Here is
their /proc/version:

Linux version 3.14.14-kernel (root@urquell) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #15 SMP Thu Jul 31 11:15:52 BRT 2014

and here their mount:

urquell.home:/home on /home type nfs4(rw,vers=4.1,addr=10.17.110.3,clientaddr=10.17.110.11)

> What filesystem are you exporting, with what options?

ext4. For the terminals:
/exports 10.254.0.0/16(sec=krb5p:none,ro,async,fsid=0,crossmnt,subtree_check,no_root_squash)
/exports/home 10.254.0.0/16(sec=krb5p,rw,async,no_subtree_check,root_squash)

For the servers:
/exports/home server.home(rw,async,root_squash,no_subtree_check)

> > What about these patches: http://permalink.gmane.org/gmane.linux.nfs/62012
> > Bruce said they were accepted but they're not in 3.14. Were they rejected or
> > forgotten? Could they have any relation to this memory leak?
>
> Those are in 3.15.
>
> There'd be no harm in trying them, but on a quick skim I don't think
> they're likely to explain your symptoms.

Yes, we tried until 3.15.8 to no avail.

2014-10-13 13:58:41

by J. Bruce Fields

[permalink] [raw]
Subject: Re: massive memory leak in 3.1[3-5] with nfs4+kerberos

On Sat, Oct 11, 2014 at 12:36:27AM -0300, Carlos Carvalho wrote:
> We're observing a big memory leak in 3.1[3-5]. We've gone until 3.15.8 and back
> to 3.14 because of LTS. Today we're running 3.14.21. The problem has existed
> for several months but recently has become a show-stopper.

Is there an older version that you know was OK?

> Here are the values of SUnreclaim: from /proc/meminfo, sampled at every 4h
> (units are kB):
>
> 87192
> 297044
> 765320
> 2325160
> 3306056
> 4412808
> 4799292
> 5085392
> 4999936
> 5521648
> 6628496
> 7785460
> 8518084
> 8988404
> 9141220
> 9533224
> 10053484
> 10954000
> 11716700
> 12369516
> 12847412
> 13318872
> 13846196
> 14339476
> 14815600
> 15293564
> 15798024
> 17092772
> 19240084
> 21679888
> 22399060
> 22943812
> 23407004
> 24049804
> 26210880
> 28034980
> 29059812 <== almost 30GB!

Can you figure out from /proc/slabinfo which slab is the problem?

> After a few days the machine has lost so much memory that it panics or becomes
> very slow due to lack of cache and we have to reboot it. It's a busy file
> server of home directories.
>
> We have several other busy servers (including identical hardware) but the
> memory leak happens only in this machine. What is different with it is that
> it's the only place where we use:
> - nfs4 with authentication and encryption by kerberos
> - raid10
>
> All others do only nfs3 or no nfs, and raid6. That's why we suspect it's a nfs4
> problem.

It would also be interesting to know whether the problem is with nfs4 or
krb5. But I don't know if you have an easy way to test that. (E.g.
temporarily downgrade to nfs3 while keeping krb5 and see if that
matters?)

Do you know if any of your clients are using NFSv4.1?

What filesystem are you exporting, with what options?

> What about these patches: http://permalink.gmane.org/gmane.linux.nfs/62012
> Bruce said they were accepted but they're not in 3.14. Were they rejected or
> forgotten? Could they have any relation to this memory leak?

Those are in 3.15.

There'd be no harm in trying them, but on a quick skim I don't think
they're likely to explain your symptoms.

--b.

2014-10-28 19:37:17

by Carlos Carvalho

[permalink] [raw]
Subject: Re: massive memory leak in 3.1[3-5] with nfs4+kerberos

J. Bruce Fields ([email protected]) wrote on Tue, Oct 28, 2014 at 05:29:40PM BRST:
> On Tue, Oct 28, 2014 at 05:12:29PM -0200, Carlos Carvalho wrote:
> > J. Bruce Fields ([email protected]) wrote on Tue, Oct 28, 2014 at 12:24:50PM BRST:
> > > On Tue, Oct 28, 2014 at 12:14:28PM -0200, Carlos Carvalho wrote:
> > > > J. Bruce Fields ([email protected]) wrote on Tue, Oct 14, 2014 at 05:42:45PM BRT:
> > > > > On Mon, Oct 13, 2014 at 08:50:27PM -0300, Carlos Carvalho wrote:
> > > > > > J. Bruce Fields ([email protected]) wrote on Mon, Oct 13, 2014 at 10:58:40AM BRT:
> > > > > > Note the big xprt_alloc. slabinfo is found in the kernel tree at tools/vm.
> > > > > > Another way to see it:
> > > > > >
> > > > > > urquell# sort -n /sys/kernel/slab/kmalloc-2048/alloc_calls | tail -n 2
> > > > > > 1519 nfsd4_create_session+0x24a/0x810 age=189221/25894524/71426273 pid=5372-5436 cpus=0-11,13-16,19-20 nodes=0-1
> > > > > > 3380755 xprt_alloc+0x1e/0x190 age=5/27767270/71441075 pid=6-32599 cpus=0-31 nodes=0-1
> > > > >
> > > > > Agreed that the xprt_alloc is suspicious, though I don't really
> > > > > understand these statistics.
> > > > >
> > > > > Since you have 4.1 clients, maybe this would be explained by a leak in
> > > > > the backchannel code.
> > > >
> > > > We've set clients to use 4.0 and it only made the problem worse; the growth in
> > > > unreclaimable memory was faster.
> > > >
> > > > > It could certainly still be worth testing 3.17 if possible.
> > > >
> > > > We tested it and it SEEMS the problem doesn't appear in 3.17.1; the SUnreclaim
> > > > value oscillates up and down as usual, instead of increasing monotonically.
> > > > However it didn't last long enough for us to get conclusive numbers because
> > > > after about 5-6h the machine fills the screen with "NMI watchdog CPU #... is
> > > > locked for more than 22s".
> > >
> > > Are the backtraces with those messages?
> >
> > First one, nfsd:
> >
> > Oct 22 18:37:12 urquell kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 23s! [nfsd:12603]
> > Oct 22 18:37:12 urquell kernel: Modules linked in:
> > Oct 22 18:37:12 urquell kernel: CPU: 23 PID: 12603 Comm: nfsd Not tainted 3.17.1-urquell-slabdebug #2
> > Oct 22 18:37:12 urquell kernel: Hardware name: SGI.COM SGI MIS Server/S2600JF, BIOS SE5C600.86B.01.03.0002.062020121504 06/20/2012
> > Oct 22 18:37:12 urquell kernel: task: ffff880fbb321000 ti: ffff880fbaf58000 task.ti: ffff880fbaf58000
> > Oct 22 18:37:12 urquell kernel: RIP: 0010:[<ffffffff82511685>] [<ffffffff82511685>] _raw_spin_unlock_irqrestore+0x5/0x10
> > Oct 22 18:37:12 urquell kernel: RSP: 0018:ffff880fbaf5bde0 EFLAGS: 00000296
> > Oct 22 18:37:12 urquell kernel: RAX: ffffffff82841930 RBX: ffffffff820963db RCX: 0000000000000000
> > Oct 22 18:37:12 urquell kernel: RDX: ffffffff82841948 RSI: 0000000000000296 RDI: ffffffff82841940
> > Oct 22 18:37:12 urquell kernel: RBP: ffff880fd1ac52d0 R08: 0000000000000000 R09: 0000000000000000
> > Oct 22 18:37:12 urquell kernel: R10: ffff880fd1ac5b28 R11: ffff880fff804780 R12: 00000000ffffffff
> > Oct 22 18:37:12 urquell kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> > Oct 22 18:37:12 urquell kernel: FS: 0000000000000000(0000) GS:ffff880fffde0000(0000) knlGS:0000000000000000
> > Oct 22 18:37:12 urquell kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Oct 22 18:37:12 urquell kernel: CR2: 00007f22e0b01400 CR3: 000000000280e000 CR4: 00000000000407e0
> > Oct 22 18:37:12 urquell kernel: Stack:
> > Oct 22 18:37:12 urquell kernel: ffffffff821dd282 ffff880fbaf5bde8 ffff880fbaf5bde8 ffff881ff785c6d8
> > Oct 22 18:37:12 urquell kernel: ffff880fd1ac52d0 ffff881ff785c6d8 0000000000000020 ffffffff8285dac0
> > Oct 22 18:37:12 urquell kernel: ffffffff821e3550 ffff880fbaf5be28 ffff880fbaf5be28 ffff881ff785c6d8
> > Oct 22 18:37:12 urquell kernel: Call Trace:
> > Oct 22 18:37:12 urquell kernel: [<ffffffff821dd282>] ? __destroy_client+0xd2/0x130
> > Oct 22 18:37:12 urquell kernel: [<ffffffff821e3550>] ? nfs4_state_shutdown_net+0x150/0x1e0
> > Oct 22 18:37:12 urquell kernel: [<ffffffff821c1152>] ? nfsd_shutdown_net+0x32/0x60
> > Oct 22 18:37:12 urquell kernel: [<ffffffff821c11da>] ? nfsd_last_thread+0x5a/0x90
> > Oct 22 18:37:12 urquell kernel: [<ffffffff821c163b>] ? nfsd_destroy+0x4b/0x70
> > Oct 22 18:37:12 urquell kernel: [<ffffffff821c17a4>] ? nfsd+0x144/0x170
> > Oct 22 18:37:12 urquell kernel: [<ffffffff821c1660>] ? nfsd_destroy+0x70/0x70
> > Oct 22 18:37:12 urquell kernel: [<ffffffff8207f38c>] ? kthread+0xbc/0xe0
> > Oct 22 18:37:12 urquell kernel: [<ffffffff8207f2d0>] ? __kthread_parkme+0x70/0x70
> > Oct 22 18:37:12 urquell kernel: [<ffffffff82511dac>] ? ret_from_fork+0x7c/0xb0
> > Oct 22 18:37:12 urquell kernel: [<ffffffff8207f2d0>] ? __kthread_parkme+0x70/0x70
>
> That means the server was shutting down.

Not at this moment. Reboot happened about 10min later; the other traces
happened before I rebooted.

> Is this some kind of cluster failover or were you shutting it down by hand?

By hand, and there's no cluster.

2014-10-28 19:29:41

by J. Bruce Fields

[permalink] [raw]
Subject: Re: massive memory leak in 3.1[3-5] with nfs4+kerberos

On Tue, Oct 28, 2014 at 05:12:29PM -0200, Carlos Carvalho wrote:
> J. Bruce Fields ([email protected]) wrote on Tue, Oct 28, 2014 at 12:24:50PM BRST:
> > On Tue, Oct 28, 2014 at 12:14:28PM -0200, Carlos Carvalho wrote:
> > > J. Bruce Fields ([email protected]) wrote on Tue, Oct 14, 2014 at 05:42:45PM BRT:
> > > > On Mon, Oct 13, 2014 at 08:50:27PM -0300, Carlos Carvalho wrote:
> > > > > J. Bruce Fields ([email protected]) wrote on Mon, Oct 13, 2014 at 10:58:40AM BRT:
> > > > > Note the big xprt_alloc. slabinfo is found in the kernel tree at tools/vm.
> > > > > Another way to see it:
> > > > >
> > > > > urquell# sort -n /sys/kernel/slab/kmalloc-2048/alloc_calls | tail -n 2
> > > > > 1519 nfsd4_create_session+0x24a/0x810 age=189221/25894524/71426273 pid=5372-5436 cpus=0-11,13-16,19-20 nodes=0-1
> > > > > 3380755 xprt_alloc+0x1e/0x190 age=5/27767270/71441075 pid=6-32599 cpus=0-31 nodes=0-1
> > > >
> > > > Agreed that the xprt_alloc is suspicious, though I don't really
> > > > understand these statistics.
> > > >
> > > > Since you have 4.1 clients, maybe this would be explained by a leak in
> > > > the backchannel code.
> > >
> > > We've set clients to use 4.0 and it only made the problem worse; the growth in
> > > unreclaimable memory was faster.
> > >
> > > > It could certainly still be worth testing 3.17 if possible.
> > >
> > > We tested it and it SEEMS the problem doesn't appear in 3.17.1; the SUnreclaim
> > > value oscillates up and down as usual, instead of increasing monotonically.
> > > However it didn't last long enough for us to get conclusive numbers because
> > > after about 5-6h the machine fills the screen with "NMI watchdog CPU #... is
> > > locked for more than 22s".
> >
> > Are the backtraces with those messages?
>
> First one, nfsd:
>
> Oct 22 18:37:12 urquell kernel: NMI watchdog: BUG: soft lockup - CPU#23 stuck for 23s! [nfsd:12603]
> Oct 22 18:37:12 urquell kernel: Modules linked in:
> Oct 22 18:37:12 urquell kernel: CPU: 23 PID: 12603 Comm: nfsd Not tainted 3.17.1-urquell-slabdebug #2
> Oct 22 18:37:12 urquell kernel: Hardware name: SGI.COM SGI MIS Server/S2600JF, BIOS SE5C600.86B.01.03.0002.062020121504 06/20/2012
> Oct 22 18:37:12 urquell kernel: task: ffff880fbb321000 ti: ffff880fbaf58000 task.ti: ffff880fbaf58000
> Oct 22 18:37:12 urquell kernel: RIP: 0010:[<ffffffff82511685>] [<ffffffff82511685>] _raw_spin_unlock_irqrestore+0x5/0x10
> Oct 22 18:37:12 urquell kernel: RSP: 0018:ffff880fbaf5bde0 EFLAGS: 00000296
> Oct 22 18:37:12 urquell kernel: RAX: ffffffff82841930 RBX: ffffffff820963db RCX: 0000000000000000
> Oct 22 18:37:12 urquell kernel: RDX: ffffffff82841948 RSI: 0000000000000296 RDI: ffffffff82841940
> Oct 22 18:37:12 urquell kernel: RBP: ffff880fd1ac52d0 R08: 0000000000000000 R09: 0000000000000000
> Oct 22 18:37:12 urquell kernel: R10: ffff880fd1ac5b28 R11: ffff880fff804780 R12: 00000000ffffffff
> Oct 22 18:37:12 urquell kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> Oct 22 18:37:12 urquell kernel: FS: 0000000000000000(0000) GS:ffff880fffde0000(0000) knlGS:0000000000000000
> Oct 22 18:37:12 urquell kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Oct 22 18:37:12 urquell kernel: CR2: 00007f22e0b01400 CR3: 000000000280e000 CR4: 00000000000407e0
> Oct 22 18:37:12 urquell kernel: Stack:
> Oct 22 18:37:12 urquell kernel: ffffffff821dd282 ffff880fbaf5bde8 ffff880fbaf5bde8 ffff881ff785c6d8
> Oct 22 18:37:12 urquell kernel: ffff880fd1ac52d0 ffff881ff785c6d8 0000000000000020 ffffffff8285dac0
> Oct 22 18:37:12 urquell kernel: ffffffff821e3550 ffff880fbaf5be28 ffff880fbaf5be28 ffff881ff785c6d8
> Oct 22 18:37:12 urquell kernel: Call Trace:
> Oct 22 18:37:12 urquell kernel: [<ffffffff821dd282>] ? __destroy_client+0xd2/0x130
> Oct 22 18:37:12 urquell kernel: [<ffffffff821e3550>] ? nfs4_state_shutdown_net+0x150/0x1e0
> Oct 22 18:37:12 urquell kernel: [<ffffffff821c1152>] ? nfsd_shutdown_net+0x32/0x60
> Oct 22 18:37:12 urquell kernel: [<ffffffff821c11da>] ? nfsd_last_thread+0x5a/0x90
> Oct 22 18:37:12 urquell kernel: [<ffffffff821c163b>] ? nfsd_destroy+0x4b/0x70
> Oct 22 18:37:12 urquell kernel: [<ffffffff821c17a4>] ? nfsd+0x144/0x170
> Oct 22 18:37:12 urquell kernel: [<ffffffff821c1660>] ? nfsd_destroy+0x70/0x70
> Oct 22 18:37:12 urquell kernel: [<ffffffff8207f38c>] ? kthread+0xbc/0xe0
> Oct 22 18:37:12 urquell kernel: [<ffffffff8207f2d0>] ? __kthread_parkme+0x70/0x70
> Oct 22 18:37:12 urquell kernel: [<ffffffff82511dac>] ? ret_from_fork+0x7c/0xb0
> Oct 22 18:37:12 urquell kernel: [<ffffffff8207f2d0>] ? __kthread_parkme+0x70/0x70

That means the server was shutting down. Is this some kind of cluster
failover or were you shutting it down by hand?

...
> Then reboot. Next NMI appeared next day:
>
> Oct 23 11:43:24 urquell kernel: NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [kworker/u64:3:6414]
> Oct 23 11:43:24 urquell kernel: Modules linked in:
> Oct 23 11:43:24 urquell kernel: CPU: 5 PID: 6414 Comm: kworker/u64:3 Not tainted 3.17.1-urquell-slabdebug #2
> Oct 23 11:43:24 urquell kernel: Hardware name: SGI.COM SGI MIS Server/S2600JF, BIOS SE5C600.86B.01.03.0002.062020121504 06/20/2012
> Oct 23 11:43:24 urquell kernel: Workqueue: nfsd4 laundromat_main
> Oct 23 11:43:24 urquell kernel: task: ffff881ff0d49800 ti: ffff8813f5a04000 task.ti: ffff8813f5a04000
> Oct 23 11:43:24 urquell kernel: RIP: 0010:[<ffffffff85245cc8>] [<ffffffff85245cc8>] _atomic_dec_and_lock+0x18/0x60
> Oct 23 11:43:24 urquell kernel: RSP: 0018:ffff8813f5a07d08 EFLAGS: 00000246
> Oct 23 11:43:24 urquell kernel: RAX: 0000000000000000 RBX: ffffffff85096290 RCX: 00000000ffffffff
> Oct 23 11:43:24 urquell kernel: RDX: 0000000000000000 RSI: ffffffff850963db RDI: ffff8813f5a07cf8
> Oct 23 11:43:24 urquell kernel: RBP: ffffffff85096093 R08: 0000000000000000 R09: 0000000000000001
> Oct 23 11:43:24 urquell kernel: R10: 0000000000000001 R11: 0140000000000000 R12: 0000000000000000
> Oct 23 11:43:24 urquell kernel: R13: ffff880ff5c752d0 R14: ffffffff851bdaa0 R15: 0000000000000206
> Oct 23 11:43:24 urquell kernel: FS: 0000000000000000(0000) GS:ffff880fffca0000(0000) knlGS:0000000000000000
> Oct 23 11:43:24 urquell kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Oct 23 11:43:24 urquell kernel: CR2: ffffffffff600400 CR3: 000000000580e000 CR4: 00000000000407e0
> Oct 23 11:43:24 urquell kernel: Stack:
> Oct 23 11:43:24 urquell kernel: ffff881fee03dba0 ffffffff85096093 ffff8813f5a07cf8 ffffffff851dcf49
> Oct 23 11:43:24 urquell kernel: ffff881fee039950 ffff881fee0398d8 ffff8813f5a07d48 ffffffff851dd282
> Oct 23 11:43:24 urquell kernel: ffff8813f5a07d48 ffff8813f5a07d48 ffff8813f5a07db8 ffff881ff7fbbd98
> Oct 23 11:43:24 urquell kernel: Call Trace:
> Oct 23 11:43:24 urquell kernel: [<ffffffff85096093>] ? __wake_up+0x43/0x70
> Oct 23 11:43:24 urquell kernel: [<ffffffff851dcf49>] ? nfs4_put_stid+0x29/0x100
> Oct 23 11:43:24 urquell kernel: [<ffffffff851dd282>] ? __destroy_client+0xd2/0x130
> Oct 23 11:43:24 urquell kernel: [<ffffffff851dd449>] ? laundromat_main+0x149/0x3f0
> Oct 23 11:43:24 urquell kernel: [<ffffffff8507a6a4>] ? process_one_work+0x134/0x340
> Oct 23 11:43:24 urquell kernel: [<ffffffff8507adb4>] ? worker_thread+0x114/0x460
> Oct 23 11:43:24 urquell kernel: [<ffffffff85095b5c>] ? __wake_up_common+0x4c/0x80
> Oct 23 11:43:24 urquell kernel: [<ffffffff8507aca0>] ? init_pwq+0x160/0x160
> Oct 23 11:43:24 urquell kernel: [<ffffffff8507f38c>] ? kthread+0xbc/0xe0
> Oct 23 11:43:24 urquell kernel: [<ffffffff8507f2d0>] ? __kthread_parkme+0x70/0x70
> Oct 23 11:43:24 urquell kernel: [<ffffffff85511dac>] ? ret_from_fork+0x7c/0xb0
> Oct 23 11:43:24 urquell kernel: [<ffffffff8507f2d0>] ? __kthread_parkme+0x70/0x70

Hm. So maybe we forgot to drop cl_lock somewhere?

--b.

>
> Oct 23 11:43:52 urquell kernel: NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [kworker/u64:3:6414]
> Oct 23 11:43:52 urquell kernel: Modules linked in:
> Oct 23 11:43:52 urquell kernel: CPU: 5 PID: 6414 Comm: kworker/u64:3 Tainted: G L 3.17.1-urquell-slabdebug #2
> Oct 23 11:43:52 urquell kernel: Hardware name: SGI.COM SGI MIS Server/S2600JF, BIOS SE5C600.86B.01.03.0002.062020121504 06/20/2012
> Oct 23 11:43:52 urquell kernel: Workqueue: nfsd4 laundromat_main
> Oct 23 11:43:52 urquell kernel: task: ffff881ff0d49800 ti: ffff8813f5a04000 task.ti: ffff8813f5a04000
> Oct 23 11:43:52 urquell kernel: RIP: 0010:[<ffffffff85511685>] [<ffffffff85511685>] _raw_spin_unlock_irqrestore+0x5/0x10
> Oct 23 11:43:52 urquell kernel: RSP: 0018:ffff8813f5a07d40 EFLAGS: 00000296
> Oct 23 11:43:52 urquell kernel: RAX: ffffffff85841930 RBX: ffffffff850963db RCX: 0000000000000000
> Oct 23 11:43:52 urquell kernel: RDX: ffffffff85841948 RSI: 0000000000000296 RDI: ffffffff85841940
> Oct 23 11:43:52 urquell kernel: RBP: ffff881fee0398d8 R08: 0000000000000000 R09: 0000000000000001
> Oct 23 11:43:52 urquell kernel: R10: 0000000000000001 R11: 0140000000000000 R12: ffff8813f5a07d48
> Oct 23 11:43:52 urquell kernel: R13: ffff8813f5a07d48 R14: 0000000000000000 R15: 0000000000000001
> Oct 23 11:43:52 urquell kernel: FS: 0000000000000000(0000) GS:ffff880fffca0000(0000) knlGS:0000000000000000
> Oct 23 11:43:52 urquell kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Oct 23 11:43:52 urquell kernel: CR2: ffffffffff600400 CR3: 000000000580e000 CR4: 00000000000407e0
> Oct 23 11:43:52 urquell kernel: Stack:
> Oct 23 11:43:52 urquell kernel: ffffffff851dd282 ffff8813f5a07d48 ffff8813f5a07d48 ffff8813f5a07db8
> Oct 23 11:43:52 urquell kernel: ffff881ff7fbbd98 ffff8813f5a07db8 ffff881ff7fbbeb0 000000005449058d
> Oct 23 11:43:52 urquell kernel: ffffffff851dd449 0000000000000000 0000000000000000 0000000000000000
> Oct 23 11:43:52 urquell kernel: Call Trace:
> Oct 23 11:43:52 urquell kernel: [<ffffffff851dd282>] ? __destroy_client+0xd2/0x130
> Oct 23 11:43:52 urquell kernel: [<ffffffff851dd449>] ? laundromat_main+0x149/0x3f0
> Oct 23 11:43:52 urquell kernel: [<ffffffff8507a6a4>] ? process_one_work+0x134/0x340
> Oct 23 11:43:52 urquell kernel: [<ffffffff8507adb4>] ? worker_thread+0x114/0x460
> Oct 23 11:43:52 urquell kernel: [<ffffffff85095b5c>] ? __wake_up_common+0x4c/0x80
> Oct 23 11:43:52 urquell kernel: [<ffffffff8507aca0>] ? init_pwq+0x160/0x160
> Oct 23 11:43:52 urquell kernel: [<ffffffff8507f38c>] ? kthread+0xbc/0xe0
> Oct 23 11:43:52 urquell kernel: [<ffffffff8507f2d0>] ? __kthread_parkme+0x70/0x70
> Oct 23 11:43:52 urquell kernel: [<ffffffff85511dac>] ? ret_from_fork+0x7c/0xb0
> Oct 23 11:43:52 urquell kernel: [<ffffffff8507f2d0>] ? __kthread_parkme+0x70/0x70
>
> And it repeats, always kworker at the same core.
>
> We then reverted to 3.14.21, since an uptime of ~5h (under load) is not viable
> in this mission critical server.
>