Hi
Since upgrading to 2.6.25-rc1 I see filesystem corruption on my XFS
filesystem. I can reproduce this by doing "git reset --hard v2.6.25-rc1"
on a git checkout which is on some other revision. Git outputs strange
error messages (like file xxx is a directory when xxx really is a file)
and sometimes the filesystem "hangs" (I can no longer do any operations
on it even from another shell). If I reboot with a working kernel and
check the filesystem xfs_check reports many errors. I also see the
problem when doing other (not related to git) operations on the
filesystem. Git reset is just the easiest way to reproduce it.
I was able to track this corruption down to commit
a69b176df246d59626e6a9c640b44c0921fa4566 ([XFS] Use the generic bitops
rather than implementing them ourselves.) using git bisect.
Reverting edd319dc527733e61eec5bdc9ce20c94634b6482 ([XFS] Fix
xfs_lowbit64) to avoid merge conflicts and the faulty commit on top of
2.6.25-rc3 fixes the problem.
My filesystem is on an LVM2 logical volume and my computer is a
PowerBook G4 (model 5,8). I'm using GCC 4.2.3.
My problem is similar to the problem Johannes Berg reported in:
http://oss.sgi.com/archives/xfs/2008-02/msg00244.html
AFAIK Johannes also uses a PowerBook. Maybe this is an endianness
issue.
Gaudenz
--
Ever tried. Ever failed. No matter.
Try again. Fail again. Fail better.
~ Samuel Beckett ~
Hi,
> Git reset is just the easiest way to reproduce it.
Interesting :)
> I was able to track this corruption down to commit
> a69b176df246d59626e6a9c640b44c0921fa4566 ([XFS] Use the generic bitops
> rather than implementing them ourselves.) using git bisect.
>
> Reverting edd319dc527733e61eec5bdc9ce20c94634b6482 ([XFS] Fix
> xfs_lowbit64) to avoid merge conflicts and the faulty commit on top of
> 2.6.25-rc3 fixes the problem.
Odd. The replaced code doesn't look like it has any sort of endianness
assumptions.
> My filesystem is on an LVM2 logical volume and my computer is a
> PowerBook G4 (model 5,8). I'm using GCC 4.2.3.
>
> My problem is similar to the problem Johannes Berg reported in:
> http://oss.sgi.com/archives/xfs/2008-02/msg00244.html
>
> AFAIK Johannes also uses a PowerBook.
Indeed, I do, forgot to mention that, thanks for copying me.
johannes
Gaudenz Steinlin wrote:
> Hi
>
> Since upgrading to 2.6.25-rc1 I see filesystem corruption on my XFS
> filesystem. I can reproduce this by doing "git reset --hard v2.6.25-rc1"
> on a git checkout which is on some other revision. Git outputs strange
> error messages (like file xxx is a directory when xxx really is a file)
> and sometimes the filesystem "hangs" (I can no longer do any operations
> on it even from another shell). If I reboot with a working kernel and
> check the filesystem xfs_check reports many errors. I also see the
> problem when doing other (not related to git) operations on the
> filesystem. Git reset is just the easiest way to reproduce it.
>
> I was able to track this corruption down to commit
> a69b176df246d59626e6a9c640b44c0921fa4566 ([XFS] Use the generic bitops
> rather than implementing them ourselves.) using git bisect.
>
> Reverting edd319dc527733e61eec5bdc9ce20c94634b6482 ([XFS] Fix
> xfs_lowbit64) to avoid merge conflicts and the faulty commit on top of
> 2.6.25-rc3 fixes the problem.
If you're feeling motivated, maybe you can narrow it down to which of
the changes - xfs_highbit32, xfs_highbit64, xfs_lowbit32, or
xfs_lowbit64 - is causing the problem? (or maybe they all are ...)
Or maybe someone looking at the commit can immediately see the
problem... but I can't :)
-Eric
On Monday, 25 of February 2008, Eric Sandeen wrote:
> Gaudenz Steinlin wrote:
> > Hi
> >
> > Since upgrading to 2.6.25-rc1 I see filesystem corruption on my XFS
> > filesystem. I can reproduce this by doing "git reset --hard v2.6.25-rc1"
> > on a git checkout which is on some other revision. Git outputs strange
> > error messages (like file xxx is a directory when xxx really is a file)
> > and sometimes the filesystem "hangs" (I can no longer do any operations
> > on it even from another shell). If I reboot with a working kernel and
> > check the filesystem xfs_check reports many errors. I also see the
> > problem when doing other (not related to git) operations on the
> > filesystem. Git reset is just the easiest way to reproduce it.
> >
> > I was able to track this corruption down to commit
> > a69b176df246d59626e6a9c640b44c0921fa4566 ([XFS] Use the generic bitops
> > rather than implementing them ourselves.) using git bisect.
> >
> > Reverting edd319dc527733e61eec5bdc9ce20c94634b6482 ([XFS] Fix
> > xfs_lowbit64) to avoid merge conflicts and the faulty commit on top of
> > 2.6.25-rc3 fixes the problem.
>
> If you're feeling motivated, maybe you can narrow it down to which of
> the changes - xfs_highbit32, xfs_highbit64, xfs_lowbit32, or
> xfs_lowbit64 - is causing the problem? (or maybe they all are ...)
>
> Or maybe someone looking at the commit can immediately see the
> problem... but I can't :)
Well, IMO a reproducible filesystem corruption is a serious enough issue
for reverting all of the commits in question.
Thanks,
Rafael
Rafael J. Wysocki wrote:
> On Monday, 25 of February 2008, Eric Sandeen wrote:
>> If you're feeling motivated, maybe you can narrow it down to which of
>> the changes - xfs_highbit32, xfs_highbit64, xfs_lowbit32, or
>> xfs_lowbit64 - is causing the problem? (or maybe they all are ...)
>>
>> Or maybe someone looking at the commit can immediately see the
>> problem... but I can't :)
>
> Well, IMO a reproducible filesystem corruption is a serious enough issue
> for reverting all of the commits in question.
I'm not suggesting a partial revert; I just wonder which part of the
change is causing the problem, as part of the debugging process.
-Eric
On Tuesday, 26 of February 2008, Eric Sandeen wrote:
> Rafael J. Wysocki wrote:
> > On Monday, 25 of February 2008, Eric Sandeen wrote:
>
>
> >> If you're feeling motivated, maybe you can narrow it down to which of
> >> the changes - xfs_highbit32, xfs_highbit64, xfs_lowbit32, or
> >> xfs_lowbit64 - is causing the problem? (or maybe they all are ...)
> >>
> >> Or maybe someone looking at the commit can immediately see the
> >> problem... but I can't :)
> >
> > Well, IMO a reproducible filesystem corruption is a serious enough issue
> > for reverting all of the commits in question.
>
> I'm not suggesting a partial revert; I just wonder which part of the
> change is causing the problem, as part of the debugging process.
Understood.
My point is, if that's not practical (whatever the reason), I'd consider
reverting all of the commits in question.
Thanks,
Rafael
On Tue, Feb 26, 2008 at 12:52:56AM +0100, Rafael J. Wysocki wrote:
> > I'm not suggesting a partial revert; I just wonder which part of the
> > change is causing the problem, as part of the debugging process.
>
> Understood.
>
> My point is, if that's not practical (whatever the reason), I'd consider
> reverting all of the commits in question.
If you could revert all of them and verify it makes the problem go away
that would be a very good start already.
On Tuesday, 26 of February 2008, Christoph Hellwig wrote:
> On Tue, Feb 26, 2008 at 12:52:56AM +0100, Rafael J. Wysocki wrote:
> > > I'm not suggesting a partial revert; I just wonder which part of the
> > > change is causing the problem, as part of the debugging process.
> >
> > Understood.
> >
> > My point is, if that's not practical (whatever the reason), I'd consider
> > reverting all of the commits in question.
>
> If you could revert all of them and verify it makes the problem go away
> that would be a very good start already.
The original reporter (CC added) said exactly that, if I understood him
correctly:
http://lkml.org/lkml/2008/2/25/123
Thanks,
Rafael
On Tue, Feb 26, 2008 at 01:13:56AM +0100, Rafael J. Wysocki wrote:
> On Tuesday, 26 of February 2008, Christoph Hellwig wrote:
> > On Tue, Feb 26, 2008 at 12:52:56AM +0100, Rafael J. Wysocki wrote:
> > > > I'm not suggesting a partial revert; I just wonder which part of the
> > > > change is causing the problem, as part of the debugging process.
> > >
> > > Understood.
> > >
> > > My point is, if that's not practical (whatever the reason), I'd consider
> > > reverting all of the commits in question.
> >
> > If you could revert all of them and verify it makes the problem go away
> > that would be a very good start already.
>
> The original reporter (CC added) said exactly that, if I understood him
> correctly:
>
> http://lkml.org/lkml/2008/2/25/123
Sorry if I was not clear. The problematic commit after bisecting is
a69b176df246d59626e6a9c640b44c0921fa4566. Reverting this commit and
commit edd319dc527733e61eec5bdc9ce20c94634b6482 fixes the problem. So
all other commits in the XFS merge for 2.6.25 seem to be OK. I had to
revert the second commit only to avoid a merge conflict.
And I forget to mention on my first post: Please CC me on all replies.
I'm not subscribed to the lists.
Gaudenz
--
Ever tried. Ever failed. No matter.
Try again. Fail again. Fail better.
~ Samuel Beckett ~
On Tue, Feb 26, 2008 at 01:13:56AM +0100, Rafael J. Wysocki wrote:
> On Tuesday, 26 of February 2008, Christoph Hellwig wrote:
> > On Tue, Feb 26, 2008 at 12:52:56AM +0100, Rafael J. Wysocki wrote:
> > > > I'm not suggesting a partial revert; I just wonder which part of the
> > > > change is causing the problem, as part of the debugging process.
I debuged this a bit further by testing the 4 changed functions
individually. The problem only occurs with the new version of
xfs_lowbit64.
Gaudenz
--
Ever tried. Ever failed. No matter.
Try again. Fail again. Fail better.
~ Samuel Beckett ~
Gaudenz Steinlin wrote:
> On Tue, Feb 26, 2008 at 01:13:56AM +0100, Rafael J. Wysocki wrote:
>> On Tuesday, 26 of February 2008, Christoph Hellwig wrote:
>>> On Tue, Feb 26, 2008 at 12:52:56AM +0100, Rafael J. Wysocki wrote:
>>>>> I'm not suggesting a partial revert; I just wonder which part of the
>>>>> change is causing the problem, as part of the debugging process.
>
> I debuged this a bit further by testing the 4 changed functions
> individually. The problem only occurs with the new version of
> xfs_lowbit64.
FWIW, Dave & I did some testing/debugging on 32-bit powerpc, and it is
indeed only xfs_lowbit64 which is doing the wrong thing on that arch,
because generic find_next_bit is doing the wrong thing on big-endian
32-bit systems, for sizes > 32 bits, near as I can tell.
Rather than reverting it all, I think just changing xfs_lowbit64 back to:
int
xfs_lowbit64(
__uint64_t v)
{
__uint32_t w = (__uint32_t)v;
int n = 0;
if (w) { /* lower bits */
n = ffs(w);
} else { /* upper bits */
w = (__uint32_t)(v >> 32);
if (w && (n = ffs(w)))
n += 32;
}
return n - 1;
}
for now should fix it (this is essentially just ffs64())
-Eric
Eric Sandeen wrote:
> Gaudenz Steinlin wrote:
>> On Tue, Feb 26, 2008 at 01:13:56AM +0100, Rafael J. Wysocki wrote:
>>> On Tuesday, 26 of February 2008, Christoph Hellwig wrote:
>>>> On Tue, Feb 26, 2008 at 12:52:56AM +0100, Rafael J. Wysocki wrote:
>>>>>> I'm not suggesting a partial revert; I just wonder which part of the
>>>>>> change is causing the problem, as part of the debugging process.
>> I debuged this a bit further by testing the 4 changed functions
>> individually. The problem only occurs with the new version of
>> xfs_lowbit64.
>
> FWIW, Dave & I did some testing/debugging on 32-bit powerpc, and it is
> indeed only xfs_lowbit64 which is doing the wrong thing on that arch,
> because generic find_next_bit is doing the wrong thing on big-endian
> 32-bit systems, for sizes > 32 bits, near as I can tell.
>
> Rather than reverting it all, I think just changing xfs_lowbit64 back to:
> ...
Thanks Eric (and Dave), we'll look at your proposed fix. But for now,
the bit ops cleanup patch has been fully reverted, see Lachlan's earlier
mail and git pull request.
We're also expanding our internal h/w test matrix to include some
additional platforms to avoid this kind of thing in the future.
Cheers
-- Mark
> I debuged this a bit further by testing the 4 changed functions
> individually. The problem only occurs with the new version of
> xfs_lowbit64.
Eh, uh, of course. Now that I look at that code it becomes obvious.
find_first_bit() works on unsigned longs, not 64-bit quantities, so
find_first_bit((unsigned long *)&t, 64) isn't equivalent to finding the
lowest bit set in a 64-bit quantity.
Think of the memory layout of a 64-bit word:
LE: low 32 bits | high 32 bits
BE: high 32 bits | low 32 bits
Take a look at the start of include/asm-powerpc/bitops.h, and note how
bitops don't define the memory layout at all :)
So find_first_bit(&t, 64) on BE will give you the number of the first
bit of the 32-bit rotated quantity, ie. of ((t<<32) | (t>>32)).
The problem doesn't happen with highbit64 because fls64 was specifically
coded for this purpose.
You really need to keep xfs_lowbit64 defined as it was before, or, maybe
even better, define ffs64 in parallel to fls64.
johannes
Johannes Berg wrote:
>> I debuged this a bit further by testing the 4 changed functions
>> individually. The problem only occurs with the new version of
>> xfs_lowbit64.
...
> You really need to keep xfs_lowbit64 defined as it was before, or, maybe
> even better, define ffs64 in parallel to fls64.
Yep, I agree.
-Eric