Hi.
I think I have found a regression introduced by commit 9d0be50 "ext4:
Calculate metadata requirements more accurately".
I am using ext4 on a NFSv4 server running unpatched kernel 2.6.33.3.
The client is currently running unpatch 2.6.33.3, although I also saw
the problem with the client running 2.6.32.10.
The output from 'df' on the client varies wildly in the presence of
certain writes. I have not pinned down an exact write pattern that
causes it, but I do have an application that causes it fairly reliably.
When the bug happens, I see swings like this:
Sun May 9 23:04:58 2010 blocks=961173888 available=28183168
Sun May 9 23:04:59 2010 blocks=961173888 available=12823424
Sun May 9 23:05:00 2010 blocks=961173888 available=28183040
(produced by a script that checks statvfs output every second; units are
kB, df output is effectively identical)
There is no possible way this system could write and then erase 15GB of
disk space in a second, as the drive can sustain only about 40MB/sec.
This problem is not present in any of the 2.6.32.* kernels. I used git
bisect to narrow down the range to between 3e8d95d (good) and 741f21e8
(bad) before I gave up because the kernels would oops before I could
test. There are 2 ext4 patches in that range, 9d0be50 and ee5f4d9. The
other patches are S390 and SH arch fixes.
I checked out a copy of 2.6.33.3 and reverted commit 9d0be50. There was
a small conflict in fs/ext4/inode.c which I hand merged. The resulting
kernel has not exhibited the problem in an hour of testing, where
previously I could trigger it in a minute or two.
If you need more information, I can gladly provide it.
--
Bruce Guenter <[email protected]> http://untroubled.org/
Hi,
> I think I have found a regression introduced by commit 9d0be50 "ext4:
> Calculate metadata requirements more accurately".
Thanks for report!
> I am using ext4 on a NFSv4 server running unpatched kernel 2.6.33.3.
> The client is currently running unpatch 2.6.33.3, although I also saw
> the problem with the client running 2.6.32.10.
>
> The output from 'df' on the client varies wildly in the presence of
> certain writes. I have not pinned down an exact write pattern that
> causes it, but I do have an application that causes it fairly reliably.
> When the bug happens, I see swings like this:
>
> Sun May 9 23:04:58 2010 blocks=961173888 available=28183168
> Sun May 9 23:04:59 2010 blocks=961173888 available=12823424
> Sun May 9 23:05:00 2010 blocks=961173888 available=28183040
>
> (produced by a script that checks statvfs output every second; units are
> kB, df output is effectively identical)
Hmm, I'm not seeing anything obviously wrong with that patch but
apparently the number of blocks reserved for delayed allocation is
miscomputed by a lot...
Is your ext4 filesystem create from scratch or converted from ext3? Is
your application using lots of different files or rather a couple of small
ones? Anyways a reproducing program would be the best in this case...
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
On Thu, May 20, 2010 at 06:11:21PM +0200, Jan Kara wrote:
> Hmm, I'm not seeing anything obviously wrong with that patch but
> apparently the number of blocks reserved for delayed allocation is
> miscomputed by a lot...
IIRC this only affects NFS clients. The same test running locally on
the server doesn't show the same problem. I will retest shortly and
confirm (and also test with 2.6.34).
> Is your ext4 filesystem create from scratch or converted from ext3?
I converted from ext3.
> Is
> your application using lots of different files or rather a couple of small
> ones?
The application writes over NFS two files using writev. One is
currently about 38MB and the other is under 1MB.
The "application" FWIW is saving games over NFS from Wesnoth. At least
that's the most obvious application I have found that can reliably
trigger the problem. I did notice the problem at several times when the
game was not even running, but couldn't pin down what else was
happening.
> Anyways a reproducing program would be the best in this case...
I will try to code something up to reproduce the symptoms using the
system call trace as a reference.
--
Bruce Guenter <[email protected]> http://untroubled.org/
On Thu, May 20, 2010 at 10:32:14AM -0600, Bruce Guenter wrote:
> IIRC this only affects NFS clients. The same test running locally on
> the server doesn't show the same problem. I will retest shortly and
> confirm (and also test with 2.6.34).
Sorry, I was mistaken. Running 'df' either locally or remotely shows
the same results.
I have been unable to reproduce the problem with a simplified test
program as of yet, and have so far been unable to reproduce the problem
with 2.6.34 with any program either.
Hopefully this means the problem has been solved, but I'll let you know
if it reappears in 2.6.34
--
Bruce Guenter <[email protected]> http://untroubled.org/
On Thu, May 20, 2010 at 02:59:11PM -0600, Bruce Guenter wrote:
> On Thu, May 20, 2010 at 10:32:14AM -0600, Bruce Guenter wrote:
> > IIRC this only affects NFS clients. The same test running locally on
> > the server doesn't show the same problem. I will retest shortly and
> > confirm (and also test with 2.6.34).
>
> Sorry, I was mistaken. Running 'df' either locally or remotely shows
> the same results.
>
> I have been unable to reproduce the problem with a simplified test
> program as of yet, and have so far been unable to reproduce the problem
> with 2.6.34 with any program either.
>
> Hopefully this means the problem has been solved, but I'll let you know
> if it reappears in 2.6.34
Sorry for not responding right away; I've been a bit swamped lately
and this fell through the cracks in my inbox.
Yes this is fixed in 2.6.34. The fix for your problem was commit
d330a5be. If I have a moment to breathe I'll try to get that
backported to the 2.6.33 stable series, but upgrading to 2.6.34 works
too. :-)
- Ted