From: "Weathers, Norman R." <Norman.R.Weathers-496aOtIFJR1B+Kdf37RAV9BPR1lH4CV8@public.gmane.org>
Subject: RE: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
Date: Fri, 13 Jun 2008 16:53:31 -0500
Message-ID: <0122F800A3B64C449565A9E8C297701002D75DB6@hoexmb9.conoco.net>
References: <0122F800A3B64C449565A9E8C297701002D75D9F@hoexmb9.conoco.net> <20080610171602.GG20184@fieldses.org> <0122F800A3B64C449565A9E8C297701002D75DA3@hoexmb9.conoco.net> <20080611184613.GM15380@fieldses.org> <20080611195222.GP15380@fieldses.org> <20080611160947.5f08fb16@tleilax.poochiereds.net> <20080611205749.GA25194@fieldses.org> <0122F800A3B64C449565A9E8C297701002D75DAA@hoexmb9.conoco.net> <20080611225431.GD25194@fieldses.org> <0122F800A3B64C449565A9E8C297701002D75DAE@hoexmb9.conoco.net> <20080613201552.GH8501@fieldses.org>
Mime-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Cc: "Jeff Layton" <jlayton@poochiereds.net>,
	<linux-kernel@vger.kernel.org>, <linux-nfs@vger.kernel.org>,
	"Neil Brown" <neilb@suse.de>
To: "J. Bruce Fields" <bfields@fieldses.org>
In-Reply-To: <20080613201552.GH8501@fieldses.org>
Sender: linux-nfs-owner@vger.kernel.org

 
> -----Original Message-----
> From: linux-nfs-owner@vger.kernel.org 
> [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J. Bruce Fields
> Sent: Friday, June 13, 2008 3:16 PM
> To: Weathers, Norman R.
> Cc: Jeff Layton; linux-kernel@vger.kernel.org; 
> linux-nfs@vger.kernel.org; Neil Brown
> Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> 
> On Thu, Jun 12, 2008 at 02:54:09PM -0500, Weathers, Norman R. wrote:
> >  
> > 
> > > -----Original Message-----
> > > From: linux-nfs-owner@vger.kernel.org 
> > > [mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J. 
> Bruce Fields
> > > Sent: Wednesday, June 11, 2008 5:55 PM
> > > To: Weathers, Norman R.
> > > Cc: Jeff Layton; linux-kernel@vger.kernel.org; 
> > > linux-nfs@vger.kernel.org
> > > Subject: Re: CONFIG_DEBUG_SLAB_LEAK omits size-4096 and larger?
> > > 
> > > On Wed, Jun 11, 2008 at 05:46:13PM -0500, Weathers, 
> Norman R. wrote:
> > > > I will try and get it patched and retested, but it may be a 
> > > day or two
> > > > before I can get back the information due to production jobs now
> > > > running.  Once they finish up, I will get back with the info.
> > > 
> > > Understood.
> > > 
> > 
> > 
> > I was able to get my big user to cooperate and let me in to 
> be able to
> > get the information that you were needing.  The full output from the
> > /proc/slab_allocator file is at
> > http://www.shashi-weathers.net/linux/cluster/NFS_DEBUG_2 .  The 16
> > thread case is very interesting.  Also, there is a small 
> txt file in the
> > directory that has some rpc errors, but I imagine the way that I am
> > running the box (oversubscribed threads) has more to do with the rpc
> > errors than anything else.  For those of you wanting the gist of the
> > story, the size-4096 slab has the following very large allocation:
> > 
> > size-4096: 2 sys_init_module+0x140b/0x1980
> > size-4096: 1 __vmalloc_area_node+0x188/0x1b0
> > size-4096: 1 seq_read+0x1d9/0x2e0
> > size-4096: 1 slabstats_open+0x2b/0x80
> > size-4096: 5 vc_allocate+0x167/0x190
> > size-4096: 3 input_allocate_device+0x12/0x80
> > size-4096: 1 hid_add_field+0x122/0x290
> > size-4096: 9 reqsk_queue_alloc+0x5f/0xf0
> > size-4096: 1846825 __alloc_skb+0x7d/0x170
> > size-4096: 3 alloc_netdev+0x33/0xa0
> > size-4096: 10 neigh_sysctl_register+0x52/0x2b0
> > size-4096: 5 devinet_sysctl_register+0x28/0x110
> > size-4096: 1 pidmap_init+0x15/0x60
> > size-4096: 1 netlink_proto_init+0x44/0x190
> > size-4096: 1 ip_rt_init+0xfd/0x2f0
> > size-4096: 1 cipso_v4_init+0x13/0x70
> > size-4096: 3 journal_init_revoke+0xe7/0x270 [jbd]
> > size-4096: 3 journal_init_revoke+0x18a/0x270 [jbd]
> > size-4096: 2 journal_init_inode+0x84/0x150 [jbd]
> > size-4096: 2 bnx2_alloc_mem+0x18/0x1f0 [bnx2]
> > size-4096: 1 joydev_connect+0x53/0x390 [joydev]
> > size-4096: 13 kmem_alloc+0xb3/0x100 [xfs]
> > size-4096: 5 addrconf_sysctl_register+0x31/0x130 [ipv6]
> > size-4096: 7 rpc_clone_client+0x84/0x140 [sunrpc]
> > size-4096: 3 rpc_create+0x254/0x4d0 [sunrpc]
> > size-4096: 16 __svc_create_thread+0x53/0x1f0 [sunrpc]
> > size-4096: 16 __svc_create_thread+0x72/0x1f0 [sunrpc]
> > size-4096: 1 nfsd_racache_init+0x2e/0x140 [nfsd]
> > 
> > The big one seems to be the __alloc_skb. (This is with 16 
> threads, and
> > it says that we are using up somewhere between 12 and 14 GB 
> of memory,
> > about 2 to 3 gig of that is disk cache).  If I were to put anymore
> > threads out there, the server would become almost 
> unresponsive (it was
> > bad enough as it was).   
> > 
> > At the same time, I also noticed this:
> > 
> > skbuff_fclone_cache: 1842524 __alloc_skb+0x50/0x170
> > 
> > Don't know for sure if that is meaningful or not....
> 
> OK, so, starting at net/core/skbuff.c, this means that this memory was
> allocated by __alloc_skb() calls with something nonzero in the third
> ("fclone") argument.  The only such caller is alloc_skb_fclone().
> Callers of alloc_skb_fclone() include:
> 
> 	sk_stream_alloc_skb:
> 		do_tcp_sendpages
> 		tcp_sendmsg
> 		tcp_fragment
> 		tso_fragment

Interesting you should mention the tso...  We recently went through and
turned on TSO on all of our systems, trying it out to see if it helped
with performance...  This could be something to do with that.  I can try
disabling the tso on all of the servers and see if that helps with the
memory.  Actually, I think I will, and I will monitor the situation.  I
think it might help some, but I still think there may be something else
going on in a deep corner...

> 		tcp_mtu_probe
> 	tcp_send_fin
> 	tcp_connect
> 	buf_acquire:
> 		lots of callers in tipc code (whatever that is).
> 
> So unless you're using tipc, or you have something in userspace going
> haywire (perhaps netstat would help rule that out?), then I suppose
> there's something wrong with knfsd's tcp code.  Which makes sense, I
> guess.
> 

Not for sure what tipc is either....

> I'd think this sort of allocation would be limited by the number of
> sockets times the size of the send and receive buffers.
> svc_xprt.c:svc_check_conn_limits() claims to be limiting the number of
> sockets to (nrthreads+3)*20.  (You aren't hitting the "too many open
> connections" printk there, are you?)  The total buffer size should be
> bounded by something like 4 megs.
> 
> --b.
> 

Yes, we are getting a continuous stream of the too many open connections
scrolling across our logs.  


> > 
> > 
> > 
> > > > Thanks everyone for looking at this, by the way!
> > > 
> > > And thanks for your persistence.
> > > 
> > > --b.
> > > 
> > 
> > 
> > Anytime.  This is the part of the job that is fun (except for my
> > users...).  Anyone can watch a system run, it's dealing 
> with the unknown
> > that makes it interesting.
> 
> OK!  Because I'm a bit stuck, so this will take some more work....
> 
> --b.
> 

No problems.  I feel good if I exercised some deep corner of the code
and found something that needed flushed out, that's what the experience
is all about, isn't it?


> > 
> > 
> > Norman Weathers
> > 
> > 
> > > > 
> > > > > 
> > > > > 
> > > > > diff --git a/mm/slab.c b/mm/slab.c
> > > > > index 06236e4..b379e31 100644
> > > > > --- a/mm/slab.c
> > > > > +++ b/mm/slab.c
> > > > > @@ -2202,7 +2202,7 @@ kmem_cache_create (const char *name, 
> > > > > size_t size, size_t align,
> > > > >  	 * above the next power of two: caches with object 
> > > > > sizes just above a
> > > > >  	 * power of two have a significant amount of internal 
> > > > > fragmentation.
> > > > >  	 */
> > > > > -	if (size < 4096 || fls(size - 1) == fls(size-1 
> + REDZONE_ALIGN +
> > > > > +	if (size < 8192 || fls(size - 1) == fls(size-1 
> + REDZONE_ALIGN +
> > > > >  						2 * 
> > > > > sizeof(unsigned long long)))
> > > > >  		flags |= SLAB_RED_ZONE | SLAB_STORE_USER;
> > > > >  	if (!(flags & SLAB_DESTROY_BY_RCU))
> > > > > 
> > > > 
> > > > 
> > > > Norman Weathers
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe 
> > > linux-nfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> --
> To unsubscribe from this list: send the line "unsubscribe 
> linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>