Message-ID: <4A3A7FDE.2050301@msgid.tls.msk.ru>
Date: Thu, 18 Jun 2009 21:56:46 +0400
From: Michael Tokarev <mjt@tls.msk.ru>
Organization: Telecom Service, JSC
User-Agent: Mozilla-Thunderbird 2.0.0.19 (X11/20090103)
MIME-Version: 1.0
To: David Rientjes <rientjes@google.com>
CC: "J. Bruce Fields" <bfields@fieldses.org>,
       Justin Piszcz <jpiszcz@lucidpixels.com>, linux-kernel@vger.kernel.org
Subject: Re: 2.6.29.1: nfsd: page allocation failure - nfsd or kernel problem?
References: <alpine.DEB.2.00.0906161203160.27742@p34.internal.lan> <alpine.DEB.2.00.0906161205260.27742@p34.internal.lan> <4A37FE48.6070306@msgid.tls.msk.ru> <4A38ACC0.3060501@msgid.tls.msk.ru> <alpine.DEB.2.00.0906170542170.3600@p34.internal.lan> <4A38C7CA.7040005@msgid.tls.msk.ru> <20090617185139.GF24040@fieldses.org> <4A395119.5060108@msgid.tls.msk.ru> <alpine.DEB.2.00.0906171335590.4786@chino.kir.corp.google.com> <4A3A00D9.8090504@msgid.tls.msk.ru> <alpine.DEB.2.00.0906180956560.2894@chino.kir.corp.google.com>
In-Reply-To: <alpine.DEB.2.00.0906180956560.2894@chino.kir.corp.google.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5035
Lines: 107

David Rientjes wrote:
> On Thu, 18 Jun 2009, Michael Tokarev wrote:
> 
>> David Rientjes wrote:
>>> On Thu, 18 Jun 2009, Michael Tokarev wrote:
>>>
>>>>> 	http://bugzilla.kernel.org/show_bug.cgi?id=13518
>>>> Does not look similar.
>>>>
>>>> I repeated the issue here.  The slab which is growing here is buffer_head.
>>>> It's growing slowly -- right now, after ~5 minutes of constant writes over
>>>> nfs, its size is 428423 objects, growing at about 5000 objects/minute
>>>> rate.
>>>> When stopping writing, the cache shrinks slowly back to an acceptable
>>>> size, probably when the data gets actually written to disk.
>>> Not sure if you're referring to the bugzilla entry or Justin's reported
>>> issue.  Justin's issue is actually allocating a skbuff_head_cache slab while
>>> the system is oom.
>> We have the same issue - I replied to Justin's initial email with exactly
>> the same trace as him.  I didn't see your reply up until today, -- the one
>> you're referring to below.
>>
> 
> If it's the exact same trace, then the page allocation failure is 
> occurring as the result of slab's growth of the skbuff_head_cache cache, 
> not buffer_head.

See http://lkml.org/lkml/2009/6/16/550 -- second message in this thread
is mine, it shows exactly the same trace.

> So it appears as though the issue you're raising is that buffer_head is 
> consuming far too much memory, which causes the system to be oom when 
> attempting a GFP_ATOMIC allocation for skbuff_head_cache and is otherwise 
> unseen with alloc_buffer_head() because it is allowed to invoke direct 
> reclaim:
> 
> 	$ grep -r alloc_buffer_head\( fs/*
> 	fs/buffer.c:		bh = alloc_buffer_head(GFP_NOFS);
> 	fs/buffer.c:struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
> 	fs/gfs2/log.c:	bh = alloc_buffer_head(GFP_NOFS | __GFP_NOFAIL);
> 	fs/jbd/journal.c:	new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
> 	fs/jbd2/journal.c:	new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);

Might be.

Here, I see the following scenario.  With freshly booted server, 1.9Gb RAM,
slabtop shows about 11K entries in buffer_head slab, and about 1.7Gb free RAM.

When starting writing from another machine to this one over nfs, buffer_head
slab grows quite rapidly up to about 450K entries (total size 48940K) and
free memory drops to almost zero -- this happens in first 1..2 minutes
(GigE network, writing from /dev/zero using dd).

The cache does not grow further -- just because there's no free memory for
growing.  On a 4Gb machine it grows up to about 920K objects.

And from time to time during write the same warning occurs.  And slows
down write from ~70Mb/sec (it is almost the actual speed of the target
drive - it can do ~80Mb/sec) to almost zero for several seconds.

>> As far as I can see, the warning itself, while harmless, indicates some
>> deeper problem.  Namely, we shouldn't have an OOM condition - the system
>> is doing nothing but NFS, there's only one NFS client which writes single
>> large file, the system has 2GB (or 4Gb on another machine) RAM.  It should
>> not OOM to start with.
> 
> Thanks to the page allocation failure that Justin posted earlier, which 
> shows the state of the available system memory, it shows that the machine 
> truly is oom.  You seem to have isolated that to an enormous amount of 
> buffer_head slab, which is a good start.

It's not really slabs it seems.  In my case the total amount of buffer_heads
is about 49Mb which is very small compared with the amount of memory on the
system.  But as far as I can *guess* buffer_head is just that - head, a
pointer to some other place...  Unwritten or cached data?

Note that the only way to shrink that buffer_head cache back is to remove
the file in question on the server.

>> Well, there ARE side-effects actually.  When the issue happens, the I/O
>> over NFS slows down to almost zero bytes/sec for some while, and resumes
>> slowly after about half a minute - sometimes faster, sometimes slower.
>> Again, the warning itself is harmless, but it shows a deeper issue.  I
>> don't think it's wise to ignore the sympthom -- the actual cause should
>> be fixed instead.  I think.
> 
> Since the GFP_ATOMIC allocation cannot trigger reclaim itself, it must 
> rely on other allocations or background writeout to free the memory and 
> this will be considerably slower than a blocking allocation.  The page 
> allocation failure messages from Justin's post indicate there are 0 pages 
> under writeback at the time of oom yet ZONE_NORMAL has reclaimable memory; 
> this is the result of the nonblocking allocation.

So... what's the "consensus" so far?  Just shut up the warning as you
initially proposed?

At least I don't see any immediately alternative.  Well, but I don't know
kernel internals either :)

Thanks!

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/