2008-06-13 20:15:57

by J. Bruce Fields

[permalink] [raw]
Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?

On Thu, Jun 12, 2008 at 02:54:09PM -0500, Weathers, Norman R. wrote:
>
>
> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of J. Bruce Fields
> > Sent: Wednesday, June 11, 2008 5:55 PM
> > To: Weathers, Norman R.
> > Cc: Jeff Layton; [email protected];
> > [email protected]
> > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> >
> > On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers, Norman R. wrote:
> > > I will try and get it patched and retested, but it may be a
> > day or two
> > > before I can get back the information due to production jobs now
> > > running. Once they finish up, I will get back with the info.
> >
> > Understood.
> >
>
>
> I was able to get my big user to cooperate and let me in to be able to
> get the information that you were needing. The full output from the
> /proc/slab_allocator file is at
> http://www.shashi-weathers.net/linux/cluster/NFS_DEBUG_2 . The 16
> thread case is very interesting. Also, there is a small txt file in the
> directory that has some rpc errors, but I imagine the way that I am
> running the box (oversubscribed threads) has more to do with the rpc
> errors than anything else. For those of you wanting the gist of the
> story, the size-4096 slab has the following very large allocation:
>
> size-4096: 2 sys_init_module+0x140b/0x1980
> size-4096: 1 __vmalloc_area_node+0x188/0x1b0
> size-4096: 1 seq_read+0x1d9/0x2e0
> size-4096: 1 slabstats_open+0x2b/0x80
> size-4096: 5 vc_allocate+0x167/0x190
> size-4096: 3 input_allocate_device+0x12/0x80
> size-4096: 1 hid_add_field+0x122/0x290
> size-4096: 9 reqsk_queue_alloc+0x5f/0xf0
> size-4096: 1846825 __alloc_skb+0x7d/0x170
> size-4096: 3 alloc_netdev+0x33/0xa0
> size-4096: 10 neigh_sysctl_register+0x52/0x2b0
> size-4096: 5 devinet_sysctl_register+0x28/0x110
> size-4096: 1 pidmap_init+0x15/0x60
> size-4096: 1 netlink_proto_init+0x44/0x190
> size-4096: 1 ip_rt_init+0xfd/0x2f0
> size-4096: 1 cipso_v4_init+0x13/0x70
> size-4096: 3 journal_init_revoke+0xe7/0x270 [jbd]
> size-4096: 3 journal_init_revoke+0x18a/0x270 [jbd]
> size-4096: 2 journal_init_inode+0x84/0x150 [jbd]
> size-4096: 2 bnx2_alloc_mem+0x18/0x1f0 [bnx2]
> size-4096: 1 joydev_connect+0x53/0x390 [joydev]
> size-4096: 13 kmem_alloc+0xb3/0x100 [xfs]
> size-4096: 5 addrconf_sysctl_register+0x31/0x130 [ipv6]
> size-4096: 7 rpc_clone_client+0x84/0x140 [sunrpc]
> size-4096: 3 rpc_create+0x254/0x4d0 [sunrpc]
> size-4096: 16 __svc_create_thread+0x53/0x1f0 [sunrpc]
> size-4096: 16 __svc_create_thread+0x72/0x1f0 [sunrpc]
> size-4096: 1 nfsd_racache_init+0x2e/0x140 [nfsd]
>
> The big one seems to be the __alloc_skb. (This is with 16 threads, and
> it says that we are using up somewhere between 12 and 14 GB of memory,
> about 2 to 3 gig of that is disk cache). If I were to put anymore
> threads out there, the server would become almost unresponsive (it was
> bad enough as it was).
>
> At the same time, I also noticed this:
>
> skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
>
> Don't know for sure if that is meaningful or not....

OK, so, starting at net/core/skbuff.c, this means that this memory was
allocated by __alloc_skb() calls with something nonzero in the third
("fclone") argument. The only such caller is alloc_skb_fclone().
Callers of alloc_skb_fclone() include:

sk_stream_alloc_skb:
do_tcp_sendpages
tcp_sendmsg
tcp_fragment
tso_fragment
tcp_mtu_probe
tcp_send_fin
tcp_connect
buf_acquire:
lots of callers in tipc code (whatever that is).

So unless you're using tipc, or you have something in userspace going
haywire (perhaps netstat would help rule that out?), then I suppose
there's something wrong with knfsd's tcp code. Which makes sense, I
guess.

I'd think this sort of allocation would be limited by the number of
sockets times the size of the send and receive buffers.
svc_xprt.c:svc_check_conn_limits() claims to be limiting the number of
sockets to (nrthreads+3)*20. (You aren't hitting the "too many open
connections" printk there, are you?) The total buffer size should be
bounded by something like 4 megs.

--b.

>
>
>
> > > Thanks everyone for looking at this, by the way!
> >
> > And thanks for your persistence.
> >
> > --b.
> >
>
>
> Anytime. This is the part of the job that is fun (except for my
> users...). Anyone can watch a system run, it's dealing with the unknown
> that makes it interesting.

OK! Because I'm a bit stuck, so this will take some more work....

--b.

>
>
> Norman Weathers
>
>
> > >
> > > >
> > > >
> > > > diff --git a/mm/slab.c b/mm/slab.c
> > > > index 06236e4..b379e31 100644
> > > > --- a/mm/slab.c
> > > > +++ b/mm/slab.c
> > > > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name,
> > > > size_t size, size_t align,
> > > > * above the next power of two: caches with object
> > > > sizes just above a
> > > > * power of two have a significant amount of internal
> > > > fragmentation.
> > > > */
> > > > - if (size < 4096 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > > > + if (size < 8192 || fls(size - 1) == fls(size-1 + REDZONE_ALIGN +
> > > > 2 *
> > > > sizeof(unsigned long long)))
> > > > flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > > > if (!(flags & SLAB_DESTROY_BY_RCU))
> > > >
> > >
> > >
> > > Norman Weathers
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >


2008-06-13 21:53:38

by Weathers, Norman R.

[permalink] [raw]
Subject: RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?



> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of J. Bruce Fields
> Sent: Friday, June 13, 2008 3:16 PM
> To: Weathers, Norman R.
> Cc: Jeff Layton; [email protected];
> [email protected]; Neil Brown
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
>
> On Thu, Jun 12, 2008 at 02:54:09PM -0500, Weathers, Norman R. wrote:
> >
> >
> > > -----Original Message-----
> > > From: [email protected]
> > > [mailto:[email protected]] On Behalf Of J.
> Bruce Fields
> > > Sent: Wednesday, June 11, 2008 5:55 PM
> > > To: Weathers, Norman R.
> > > Cc: Jeff Layton; [email protected];
> > > [email protected]
> > > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> > >
> > > On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers,
> Norman R. wrote:
> > > > I will try and get it patched and retested, but it may be a
> > > day or two
> > > > before I can get back the information due to production jobs now
> > > > running. Once they finish up, I will get back with the info.
> > >
> > > Understood.
> > >
> >
> >
> > I was able to get my big user to cooperate and let me in to
> be able to
> > get the information that you were needing. The full output from the
> > /proc/slab_allocator file is at
> > http://www.shashi-weathers.net/linux/cluster/NFS_DEBUG_2 . The 16
> > thread case is very interesting. Also, there is a small
> txt file in the
> > directory that has some rpc errors, but I imagine the way that I am
> > running the box (oversubscribed threads) has more to do with the rpc
> > errors than anything else. For those of you wanting the gist of the
> > story, the size-4096 slab has the following very large allocation:
> >
> > size-4096: 2 sys_init_module+0x140b/0x1980
> > size-4096: 1 __vmalloc_area_node+0x188/0x1b0
> > size-4096: 1 seq_read+0x1d9/0x2e0
> > size-4096: 1 slabstats_open+0x2b/0x80
> > size-4096: 5 vc_allocate+0x167/0x190
> > size-4096: 3 input_allocate_device+0x12/0x80
> > size-4096: 1 hid_add_field+0x122/0x290
> > size-4096: 9 reqsk_queue_alloc+0x5f/0xf0
> > size-4096: 1846825 __alloc_skb+0x7d/0x170
> > size-4096: 3 alloc_netdev+0x33/0xa0
> > size-4096: 10 neigh_sysctl_register+0x52/0x2b0
> > size-4096: 5 devinet_sysctl_register+0x28/0x110
> > size-4096: 1 pidmap_init+0x15/0x60
> > size-4096: 1 netlink_proto_init+0x44/0x190
> > size-4096: 1 ip_rt_init+0xfd/0x2f0
> > size-4096: 1 cipso_v4_init+0x13/0x70
> > size-4096: 3 journal_init_revoke+0xe7/0x270 [jbd]
> > size-4096: 3 journal_init_revoke+0x18a/0x270 [jbd]
> > size-4096: 2 journal_init_inode+0x84/0x150 [jbd]
> > size-4096: 2 bnx2_alloc_mem+0x18/0x1f0 [bnx2]
> > size-4096: 1 joydev_connect+0x53/0x390 [joydev]
> > size-4096: 13 kmem_alloc+0xb3/0x100 [xfs]
> > size-4096: 5 addrconf_sysctl_register+0x31/0x130 [ipv6]
> > size-4096: 7 rpc_clone_client+0x84/0x140 [sunrpc]
> > size-4096: 3 rpc_create+0x254/0x4d0 [sunrpc]
> > size-4096: 16 __svc_create_thread+0x53/0x1f0 [sunrpc]
> > size-4096: 16 __svc_create_thread+0x72/0x1f0 [sunrpc]
> > size-4096: 1 nfsd_racache_init+0x2e/0x140 [nfsd]
> >
> > The big one seems to be the __alloc_skb. (This is with 16
> threads, and
> > it says that we are using up somewhere between 12 and 14 GB
> of memory,
> > about 2 to 3 gig of that is disk cache). If I were to put anymore
> > threads out there, the server would become almost
> unresponsive (it was
> > bad enough as it was).
> >
> > At the same time, I also noticed this:
> >
> > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> >
> > Don't know for sure if that is meaningful or not....
>
> OK, so, starting at net/core/skbuff.c, this means that this memory was
> allocated by __alloc_skb() calls with something nonzero in the third
> ("fclone") argument. The only such caller is alloc_skb_fclone().
> Callers of alloc_skb_fclone() include:
>
> sk_stream_alloc_skb:
> do_tcp_sendpages
> tcp_sendmsg
> tcp_fragment
> tso_fragment

Interesting you should mention the tso... We recently went through and
turned on TSO on all of our systems, trying it out to see if it helped
with performance... This could be something to do with that. I can try
disabling the tso on all of the servers and see if that helps with the
memory. Actually, I think I will, and I will monitor the situation. I
think it might help some, but I still think there may be something else
going on in a deep corner...

> tcp_mtu_probe
> tcp_send_fin
> tcp_connect
> buf_acquire:
> lots of callers in tipc code (whatever that is).
>
> So unless you're using tipc, or you have something in userspace going
> haywire (perhaps netstat would help rule that out?), then I suppose
> there's something wrong with knfsd's tcp code. Which makes sense, I
> guess.
>

Not for sure what tipc is either....

> I'd think this sort of allocation would be limited by the number of
> sockets times the size of the send and receive buffers.
> svc_xprt.c:svc_check_conn_limits() claims to be limiting the number of
> sockets to (nrthreads+3)*20. (You aren't hitting the "too many open
> connections" printk there, are you?) The total buffer size should be
> bounded by something like 4 megs.
>
> --b.
>

Yes, we are getting a continuous stream of the too many open connections
scrolling across our logs.


> >
> >
> >
> > > > Thanks everyone for looking at this, by the way!
> > >
> > > And thanks for your persistence.
> > >
> > > --b.
> > >
> >
> >
> > Anytime. This is the part of the job that is fun (except for my
> > users...). Anyone can watch a system run, it's dealing
> with the unknown
> > that makes it interesting.
>
> OK! Because I'm a bit stuck, so this will take some more work....
>
> --b.
>

No problems. I feel good if I exercised some deep corner of the code
and found something that needed flushed out, that's what the experience
is all about, isn't it?


> >
> >
> > Norman Weathers
> >
> >
> > > >
> > > > >
> > > > >
> > > > > diff --git a/mm/slab.c b/mm/slab.c
> > > > > index 06236e4..b379e31 100644
> > > > > --- a/mm/slab.c
> > > > > +++ b/mm/slab.c
> > > > > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name,
> > > > > size_t size, size_t align,
> > > > > * above the next power of two: caches with object
> > > > > sizes just above a
> > > > > * power of two have a significant amount of internal
> > > > > fragmentation.
> > > > > */
> > > > > - if (size < 4096 || fls(size - 1) == fls(size-1
> + REDZONE_ALIGN +
> > > > > + if (size < 8192 || fls(size - 1) == fls(size-1
> + REDZONE_ALIGN +
> > > > > 2 *
> > > > > sizeof(unsigned long long)))
> > > > > flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > > > > if (!(flags & SLAB_DESTROY_BY_RCU))
> > > > >
> > > >
> > > >
> > > > Norman Weathers
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > linux-nfs" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>