2009-08-21 00:00:35

by Simon Kirby

[permalink] [raw]
Subject: Re: 2.6.28.9: EXT3/NFS inodes corruption

On Thu, Aug 20, 2009 at 07:19:53PM +0200, Sylvain Rochet wrote:

> So, everything is fine, but the problem happened only one time on this
> server, so we cannot conclude anything after a few weeks. However,
> I now have physical access back, so we will switch back to the former
> server where the problem happened quite frequently, then we will see!

Not to derail the thread, but you were definitely seeing the same issues
with stock 2.6.30.4, right? We had all sorts of corruption happening for
files served via NFS with 2.6.28 and 2.6.29, but everything was magically
fixed on 2.6.30 (though we needed a lot of fscking). I never did track
down what change fixed it, since it took a while to reproduce.

Hmm. I just noticed what seems to be a new occurrence of "deleted inode
referenced" on a box with 2.6.30. We saw many when we first upgraded to
2.6.30 due to the corruption caused by 2.6.29, but those all occurred
within a day or so and were fsck'd. I would have thought the backup
sweeps would have tripped over that inode way before now...

Just wondering if you can confirm that the errors you saw with 2.6.30.4
were not leftover from older kernels.

Cheers,

Simon-


2009-08-21 10:51:57

by Sylvain Rochet

[permalink] [raw]
Subject: Re: 2.6.28.9: EXT3/NFS inodes corruption

Hi,


On Thu, Aug 20, 2009 at 05:00:35PM -0700, Simon Kirby wrote:
> On Thu, Aug 20, 2009 at 07:19:53PM +0200, Sylvain Rochet wrote:
>
> > So, everything is fine, but the problem happened only one time on this
> > server, so we cannot conclude anything after a few weeks. However,
> > I now have physical access back, so we will switch back to the former
> > server where the problem happened quite frequently, then we will see!
>
> Not to derail the thread, but you were definitely seeing the same issues
> with stock 2.6.30.4, right?

Nope, the last issue we had came from 2.6.28.9.

We upgraded to 2.6.30.3 on the advice of Jan, then we "upgraded" to
2.6.30.3 with the first Jan's patch to add some debug output
(0001-ext3-Debug-unlinking-of-inodes.patch). Finally we upgraded to
2.6.30.4 with the first and the second Jan's patch
(0001-fs-Make-sure-data-stored-into-inode-is-properly-see.patch) to add
a smp_mb() in the unlock_new_inode() function.


> We had all sorts of corruption happening for files served via NFS with
> 2.6.28 and 2.6.29, but everything was magically fixed on 2.6.30
> (though we needed a lot of fscking). I never did track down what
> change fixed it, since it took a while to reproduce.

Same here, everything is fine since 2.6.30. We will switch back to the
quad-core server where the corruption happen(ed) in a few days. We are
now using a bi-opteron server because we suspected hardware issues on
the quad-core, the corruption happened only one time on the bi-opteron
(which is IMHO a sufficient evidence to discard hardware issue). I guess
the issue was(or is) kinda SMP related.

And yep, we also had long times playing with fsck ;-) Luckily that the
corruption only occurs on new files, and new files are mostly caches,
sessions, logs, and such, so fsck used its chainsaw on quite
not-really-important files.


> Hmm. I just noticed what seems to be a new occurrence of "deleted inode
> referenced" on a box with 2.6.30. We saw many when we first upgraded to
> 2.6.30 due to the corruption caused by 2.6.29, but those all occurred
> within a day or so and were fsck'd. I would have thought the backup
> sweeps would have tripped over that inode way before now...
>
> Just wondering if you can confirm that the errors you saw with 2.6.30.4
> were not leftover from older kernels.

The few garbaged inodes from 2.6.28.9 (and previous) were pushed to
lost+found to prevent future use of them. We do a fsck when we moved to
2.6.30.4 that fixed everything. We never had corruption yet with the
2.6.30.4.


Sylvain


Attachments:
(No filename) (2.51 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments