2010-05-11 09:27:25

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

Hi folks !

So I've been adapting upstream to my little D-Link DNS323 NAS box, which
is based on a Marvell 88f5182 chipset, ie, orion5 family) and started
hitting a problem with both ext3 and ext4 (on top of md/raid1).

First, the setup: This is a VIVT cache ARM CPU (so it -could- be some
cache issues though it's a bit weird as I would expect metadata to be
normal kernel pages and thus not hit cache aliases but then I'm no
specialist of how we deal with those critters).

Linux version 2.6.34-rc7-00019-g6c4f192-dirty (benh@pasglop) (gcc version 4.4.0 (GCC) ) #6 PREEMPT Tue May 11 17:02:04 EST 2010
CPU: Feroceon [41069260] revision 0 (ARMv5TEJ), cr=a0053177
CPU: VIVT data cache, VIVT instruction cache

There's two 1T disk using the built-in SATA chipset. There's an almost-1T
partition on each (sda1 and sdb1) which are setup as a raid1 md.

Then, I create a filesystem (I started with ext4 and mkfs'ed it back to ext3
after I started having problems but things persist).

I'm booted off an nfs root and basically the problem happens when I rsync over
a pre-made root to the disks.

On a freshly mkfs'ed ext3 or 4, mounted in /mnt/raid, I rsync the thing over,
used it a bit and generally observe the first symptoms in the form of a few

EXT3-fs (md0): warning: ext3_rename: Deleting old file (34103297), 2, error=-2
EXT3-fs (md0): warning: ext3_rename: Deleting old file (34113314), 2, error=-2
EXT3-fs (md0): warning: ext3_rename: Deleting old file (34127945), 2, error=-2

in my kernel log. I had a very similar message with ext4 iirc.

I unmount the filesystem and fsck it (it takes almost 1h) and I then get
a bunch of:

Problem in HTREE directory inode 5005385: node (12) has bad max hash
Problem in HTREE directory inode 5005385: node (13) has bad max hash
Problem in HTREE directory inode 5005385: node (14) has bad max hash
Invalid HTREE directory inode 5005385 (/raid-foo/var/lib/dpkg/info). Clear HTree index<y>? yes

followed by a bunch of

Inode 4981405 ref count is 1, should be 2. Fix<y>? yes

This is reasonably reproducable, as I have re-done it twice and I get similar
errors.

Tomorrow, if time permits, I'll see if I can reproduce on a smaller partition
without md, and eventually narrow it to a smaller and more predictible set of
operations.

Since I doubt ext3 is busted so dramatically in mainline for "normal" machines,
I tend to suspect things could be related to the infamous vivt caches. On the
other hand, it's pretty clearly metadata or journal corruption and I'm not
sure we ever do things that could cause aliases (such as vmap etc..) on
these things, and they shouldn't be mapped into userspace... unless it's fsck
itself that causes aliases to occur at the block device level ? (I do unmount
though before I run fsck).

On the other hand, it could also be a busticated marvell SATA driver :-)

I have no problem with the vendor kernel, but it's ancient (2.6.12) and based
on an out of tree variant of a Marvell originated BSP, so everything is
completely different, especially in the area of drivers for the chipset.

Anyways, I'll see if I can gather more data tomorrow as time, viruses and sick
kids permits.

In the meantime, any hint appreciated.

Cheers,
Ben.




2010-05-11 10:16:47

by Jamie Lokier

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

Benjamin Herrenschmidt wrote:
> On the other hand, it could also be a busticated marvell SATA driver :-)

I've seen similar ext3 corruption errors with an IDE on a Sigma
Designs EM8620 with ARM core. It's completely unrelated :-) But the
fault was in the IDE driver doing cache flushes too late after
starting to set up DMA. To diagnose it, we luckily found a
reproducible test case, then sprinkled full cache flushes at various
places to narrow it down, and eventually found that one full cache
flush in the driver itself sorted it out, which told us where the
fault was. From there it was easy.

-- Jamie

2010-05-11 10:47:58

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

On Tue, 2010-05-11 at 11:16 +0100, Jamie Lokier wrote:
>
> I've seen similar ext3 corruption errors with an IDE on a Sigma
> Designs EM8620 with ARM core. It's completely unrelated :-) But the
> fault was in the IDE driver doing cache flushes too late after
> starting to set up DMA. To diagnose it, we luckily found a
> reproducible test case, then sprinkled full cache flushes at various
> places to narrow it down, and eventually found that one full cache
> flush in the driver itself sorted it out, which told us where the
> fault was. From there it was easy.

Ok, well, that's -a- possibility... we'll see. I'm doing some more tests
to try to characterize the problem a bit better, then we can play with
the driver. I think it's also worth trying with earlier kernels in case
it's some kind of regression, though that means I'll have to backport my
board support for that little box :-)

Cheers,
Ben.

2010-05-11 10:57:58

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

On Tue, 2010-05-11 at 19:23 +1000, Benjamin Herrenschmidt wrote:

> Since I doubt ext3 is busted so dramatically in mainline for "normal" machines,
> I tend to suspect things could be related to the infamous vivt caches. On the
> other hand, it's pretty clearly metadata or journal corruption and I'm not
> sure we ever do things that could cause aliases (such as vmap etc..) on
> these things, and they shouldn't be mapped into userspace... unless it's fsck
> itself that causes aliases to occur at the block device level ? (I do unmount
> though before I run fsck).
>
> On the other hand, it could also be a busticated marvell SATA driver :-)
>
> I have no problem with the vendor kernel, but it's ancient (2.6.12) and based
> on an out of tree variant of a Marvell originated BSP, so everything is
> completely different, especially in the area of drivers for the chipset.
>
> Anyways, I'll see if I can gather more data tomorrow as time, viruses and sick
> kids permits.
>
> In the meantime, any hint appreciated.

A quick other test which brings more infos, using a smaller (about 5GB)
partition and no md or raid involved:

- Boot with NFS root
- mkfs /dev/sdb2 (no md or raid involved)
- mount /dev/sdb2 /mnt/test
- rsync -avx /test-stuff /mnt/test
- cd /mnt/test
- md5sum -c ~/test-stuff-sums.txt

That gives me a whole bunch of:

md5sum: ./usr/bin/debconf-escape: No such file or directory
./usr/bin/debconf-escape: FAILED open or read
./usr/bin/stat: OK
md5sum: ./usr/bin/chrt: No such file or directory
./usr/bin/chrt: FAILED open or read

In fact, if I do ls /mnt/test/usr/bin/ I see debconf but if I do
ls /mnt/test/usr/bin/chrt then I get No such file or directory.

So something is badly wrong :-)

Now, trying without the dir_index feature (mkfs.ext3 -O ^dir_index)
and it works fine. All my md5sum's are correct and fsck passes.

So there's what looks like a problem specific to htree's. I don't think
it's a SATA driver problem (doesn't smell like it but we can't
completely dismiss the possibility yet). Could be a VIVT issue but then
why ? I don't see ext3 playing with virtual mappings and none of that
should alias with userspace...

Or is it incorrectly accessing pages while they are DMA'ed to or from ?
IE. Accessing with the CPU pages between dma_map_* and dma_unmap_* ?
That will break on a number of setups including swiotlb on x86 so I tend
to doubt it but who knows...

Anyways, enough for tonight.

Cheers,
Ben.

2010-05-11 11:15:18

by Santosh Shilimkar

[permalink] [raw]
Subject: RE: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

> -----Original Message-----
> From: [email protected] [mailto:linux-arm-kernel-
> [email protected]] On Behalf Of Benjamin Herrenschmidt
> Sent: Tuesday, May 11, 2010 4:28 PM
> To: [email protected]
> Cc: Nicolas Pitre; Saeed Bishara; James E.J. Bottomley; Andrew Morton; [email protected];
> [email protected]
> Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)
>
> On Tue, 2010-05-11 at 19:23 +1000, Benjamin Herrenschmidt wrote:
>
> > Since I doubt ext3 is busted so dramatically in mainline for "normal" machines,
> > I tend to suspect things could be related to the infamous vivt caches. On the
> > other hand, it's pretty clearly metadata or journal corruption and I'm not
> > sure we ever do things that could cause aliases (such as vmap etc..) on
> > these things, and they shouldn't be mapped into userspace... unless it's fsck
> > itself that causes aliases to occur at the block device level ? (I do unmount
> > though before I run fsck).
> >
> > On the other hand, it could also be a busticated marvell SATA driver :-)
> >
> > I have no problem with the vendor kernel, but it's ancient (2.6.12) and based
> > on an out of tree variant of a Marvell originated BSP, so everything is
> > completely different, especially in the area of drivers for the chipset.
> >
> > Anyways, I'll see if I can gather more data tomorrow as time, viruses and sick
> > kids permits.
> >
> > In the meantime, any hint appreciated.
>
> A quick other test which brings more infos, using a smaller (about 5GB)
> partition and no md or raid involved:
>
> - Boot with NFS root
> - mkfs /dev/sdb2 (no md or raid involved)
> - mount /dev/sdb2 /mnt/test
> - rsync -avx /test-stuff /mnt/test
> - cd /mnt/test
> - md5sum -c ~/test-stuff-sums.txt
>
> That gives me a whole bunch of:
>
> md5sum: ./usr/bin/debconf-escape: No such file or directory
> ./usr/bin/debconf-escape: FAILED open or read
> ./usr/bin/stat: OK
> md5sum: ./usr/bin/chrt: No such file or directory
> ./usr/bin/chrt: FAILED open or read
>
> In fact, if I do ls /mnt/test/usr/bin/ I see debconf but if I do
> ls /mnt/test/usr/bin/chrt then I get No such file or directory.
>
> So something is badly wrong :-)
>
> Now, trying without the dir_index feature (mkfs.ext3 -O ^dir_index)
> and it works fine. All my md5sum's are correct and fsck passes.
>
> So there's what looks like a problem specific to htree's. I don't think
> it's a SATA driver problem (doesn't smell like it but we can't
> completely dismiss the possibility yet). Could be a VIVT issue but then
> why ? I don't see ext3 playing with virtual mappings and none of that
> should alias with userspace...
>
> Or is it incorrectly accessing pages while they are DMA'ed to or from ?
> IE. Accessing with the CPU pages between dma_map_* and dma_unmap_* ?
> That will break on a number of setups including swiotlb on x86 so I tend
> to doubt it but who knows...

Not sure if it is related but we saw similar issue with MMC driver on
Cortex-A9 MP core. I was seeing similar weird issues with EXT3 and
everything use to be fine with EXT2.

All those errors you mentioned use to come and at times whole FS
Use to get corrupted.

It was root-caused to MMC driver which uses DMA engine for TX/RX. The
issue was mainly with writes and hence EXT3 was having issue which does
journaling and hence more writes.
There was a memory write barrier missing before the DMA descriptors
are handed over to DMA controller.

Regards,
Santosh


2010-05-12 15:00:58

by Jan Kara

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

> On Tue, 2010-05-11 at 19:23 +1000, Benjamin Herrenschmidt wrote:
>
> > Since I doubt ext3 is busted so dramatically in mainline for "normal" machines,
> > I tend to suspect things could be related to the infamous vivt caches. On the
> > other hand, it's pretty clearly metadata or journal corruption and I'm not
> > sure we ever do things that could cause aliases (such as vmap etc..) on
> > these things, and they shouldn't be mapped into userspace... unless it's fsck
> > itself that causes aliases to occur at the block device level ? (I do unmount
> > though before I run fsck).
> >
> > On the other hand, it could also be a busticated marvell SATA driver :-)
> >
> > I have no problem with the vendor kernel, but it's ancient (2.6.12) and based
> > on an out of tree variant of a Marvell originated BSP, so everything is
> > completely different, especially in the area of drivers for the chipset.
> >
> > Anyways, I'll see if I can gather more data tomorrow as time, viruses and sick
> > kids permits.
> >
> > In the meantime, any hint appreciated.
>
> A quick other test which brings more infos, using a smaller (about 5GB)
> partition and no md or raid involved:
>
> - Boot with NFS root
> - mkfs /dev/sdb2 (no md or raid involved)
> - mount /dev/sdb2 /mnt/test
> - rsync -avx /test-stuff /mnt/test
> - cd /mnt/test
> - md5sum -c ~/test-stuff-sums.txt
>
> That gives me a whole bunch of:
>
> md5sum: ./usr/bin/debconf-escape: No such file or directory
> ./usr/bin/debconf-escape: FAILED open or read
> ./usr/bin/stat: OK
> md5sum: ./usr/bin/chrt: No such file or directory
> ./usr/bin/chrt: FAILED open or read
Could you get the filesystem image with: e2image -r /dev/sdb2 buggy-image
bzip2 it and make it available somewhere? Maybe I could guess something
from the way the filesystem gets corrupted.
Oh, and also overwrite the partition with zeros before calling mkfs to make
the analysis simpler.

> In fact, if I do ls /mnt/test/usr/bin/ I see debconf but if I do
> ls /mnt/test/usr/bin/chrt then I get No such file or directory.
>
> So something is badly wrong :-)
>
> Now, trying without the dir_index feature (mkfs.ext3 -O ^dir_index)
> and it works fine. All my md5sum's are correct and fsck passes.
Funny. Not sure how that could happen...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2010-05-12 22:17:43

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)


> Could you get the filesystem image with: e2image -r /dev/sdb2 buggy-image
> bzip2 it and make it available somewhere? Maybe I could guess something
> from the way the filesystem gets corrupted.
> Oh, and also overwrite the partition with zeros before calling mkfs to make
> the analysis simpler.

Will do asap.

> > In fact, if I do ls /mnt/test/usr/bin/ I see debconf but if I do
> > ls /mnt/test/usr/bin/chrt then I get No such file or directory.
> >
> > So something is badly wrong :-)
> >
> > Now, trying without the dir_index feature (mkfs.ext3 -O ^dir_index)
> > and it works fine. All my md5sum's are correct and fsck passes.
> Funny. Not sure how that could happen...

Yeah, strange. I looked at the code a bit and I don't see anything
obvious, ext3 seems to be using the same standard buffer head access
methods for the htree as for the rest of the metadata, and I see no
fancy playing with virtual addresses that could explain a VIVT problem.

I could be an issue with the SATA controller that gets more easily
triggered by the access patterns caused by htree, though that's a bit
strange.

I'll let you know when I have something to look at.

Cheers,
Ben.



2010-05-12 22:22:17

by Jamie Lokier

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

Shilimkar, Santosh wrote:
> There was a memory write barrier missing before the DMA descriptors
> are handed over to DMA controller.

On that note, are the cache flush functions implicit memory barriers?

-- Jamie

2010-05-12 22:47:11

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

On Wed, 2010-05-12 at 23:21 +0100, Jamie Lokier wrote:
> Shilimkar, Santosh wrote:
> > There was a memory write barrier missing before the DMA descriptors
> > are handed over to DMA controller.
>
> On that note, are the cache flush functions implicit memory barriers?

(Adding Fujita on CC)

That's a very good question. The generic inline implementation of
dma_sync_* is:

static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
size_t size,
enum dma_data_direction dir)
{
struct dma_map_ops *ops = get_dma_ops(dev);

BUG_ON(!valid_dma_direction(dir));
if (ops->sync_single_for_cpu)
ops->sync_single_for_cpu(dev, addr, size, dir);
debug_dma_sync_single_for_cpu(dev, addr, size, dir);
}

Which means that for coherent architectures that do not implement
the ops->sync_* hooks, we are probably missing a barrier here...

Thus if the above is expected to be a memory barrier, it's broken on
cache coherent powerpc for example. On non-coherent powerpc, we do cache
flushes and those are implicit barriers.

Now, in the case at hand, which is my ARM based NAS, I believe this
is non cache-coherent and thus uses cache flush ops. I don't know ARM
well enough but I would expect these to be implicit barriers. Russell ?
Nico ?

IE. You may have found a bug here though I don't know whether it's the
bug we are hitting right now :-)

Cheers,
Ben.

2010-05-12 23:08:41

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

On Thu, May 13, 2010 at 08:47:11AM +1000, Benjamin Herrenschmidt wrote:
> Now, in the case at hand, which is my ARM based NAS, I believe this
> is non cache-coherent and thus uses cache flush ops. I don't know ARM
> well enough but I would expect these to be implicit barriers. Russell ?
> Nico ?

ARMv5 doesn't have a weak memory ordering model, and doesn't have any
memory barrier instructions.

2010-05-12 23:41:51

by James Bottomley

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

On Thu, 2010-05-13 at 08:47 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2010-05-12 at 23:21 +0100, Jamie Lokier wrote:
> > Shilimkar, Santosh wrote:
> > > There was a memory write barrier missing before the DMA descriptors
> > > are handed over to DMA controller.
> >
> > On that note, are the cache flush functions implicit memory barriers?

Not exactly ... they *should* be stream ordered with respect to accesses
to the memory they're flushing (which isn't the same thing, and no-one
ever went broke overestimating the stupidity of chip designers, but if a
flush instruction needs explicit ordering, I'd expect that to be built
into the arch layer).

> (Adding Fujita on CC)
>
> That's a very good question. The generic inline implementation of
> dma_sync_* is:
>
> static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
> size_t size,
> enum dma_data_direction dir)
> {
> struct dma_map_ops *ops = get_dma_ops(dev);
>
> BUG_ON(!valid_dma_direction(dir));
> if (ops->sync_single_for_cpu)
> ops->sync_single_for_cpu(dev, addr, size, dir);
> debug_dma_sync_single_for_cpu(dev, addr, size, dir);
> }
>
> Which means that for coherent architectures that do not implement
> the ops->sync_* hooks, we are probably missing a barrier here...
>
> Thus if the above is expected to be a memory barrier, it's broken on
> cache coherent powerpc for example. On non-coherent powerpc, we do cache
> flushes and those are implicit barriers.

Can you explain this a little more. On a cache coherent machine, the
sync is a nop ... why would you want a nop to be any type of barrier?

James



2010-05-13 00:17:13

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

On Wed, 2010-05-12 at 17:00 +0200, Jan Kara wrote:

> Could you get the filesystem image with: e2image -r /dev/sdb2 buggy-image
> bzip2 it and make it available somewhere? Maybe I could guess something
> from the way the filesystem gets corrupted.
> Oh, and also overwrite the partition with zeros before calling mkfs to make
> the analysis simpler.

Here we go.

Created a smaller partition (100MB). dd'ed /dev/zero to it, dd'ed it off
to NFS host to locally cmp it with /dev/zero to make sure it's nice and
clean. It is. Then mkfs.ext3, mount, rsync over /usr/lib of my test
root, unmount, and dd it off to the NFS host again.

The result shows the same errors with fsck.ext3 -n -f ./test.img

I uploaded it at:

http://gate.crashing.org/~benh/test.img.bz2

Cheers,
Ben.

> > In fact, if I do ls /mnt/test/usr/bin/ I see debconf but if I do
> > ls /mnt/test/usr/bin/chrt then I get No such file or directory.
> >
> > So something is badly wrong :-)
> >
> > Now, trying without the dir_index feature (mkfs.ext3 -O ^dir_index)
> > and it works fine. All my md5sum's are correct and fsck passes.
> Funny. Not sure how that could happen...
>
> Honza



2010-05-13 00:18:10

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

On Wed, 2010-05-12 at 18:41 -0500, James Bottomley wrote:
> > Which means that for coherent architectures that do not implement
> > the ops->sync_* hooks, we are probably missing a barrier here...
> >
> > Thus if the above is expected to be a memory barrier, it's broken on
> > cache coherent powerpc for example. On non-coherent powerpc, we do
> cache
> > flushes and those are implicit barriers.
>
> Can you explain this a little more. On a cache coherent machine, the
> sync is a nop ... why would you want a nop to be any type of barrier?

Well if the driver can peek at the data after the sync, and have any
kind of ordering guarantee that it doesn't get stale data (the load
isn't prefetched or speculated early), that would require an mb() or at
least rmb().

It would seem sensible for drivers to assume that something like
dma_cache_sync_for_cpu() thus has the semantics of an rmb() at least,
no ?

Cheers,
Ben.

2010-05-13 03:13:18

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

On Thu, 13 May 2010 08:47:11 +1000
Benjamin Herrenschmidt <[email protected]> wrote:

> On Wed, 2010-05-12 at 23:21 +0100, Jamie Lokier wrote:
> > Shilimkar, Santosh wrote:
> > > There was a memory write barrier missing before the DMA descriptors
> > > are handed over to DMA controller.
> >
> > On that note, are the cache flush functions implicit memory barriers?
>
> (Adding Fujita on CC)
>
> That's a very good question. The generic inline implementation of
> dma_sync_* is:
>
> static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
> size_t size,
> enum dma_data_direction dir)
> {
> struct dma_map_ops *ops = get_dma_ops(dev);
>
> BUG_ON(!valid_dma_direction(dir));
> if (ops->sync_single_for_cpu)
> ops->sync_single_for_cpu(dev, addr, size, dir);
> debug_dma_sync_single_for_cpu(dev, addr, size, dir);
> }
>
> Which means that for coherent architectures that do not implement
> the ops->sync_* hooks, we are probably missing a barrier here...
>
> Thus if the above is expected to be a memory barrier, it's broken on
> cache coherent powerpc for example. On non-coherent powerpc, we do cache
> flushes and those are implicit barriers.

X86 OOSTORE uses a memory barrier dma_sync_single_for_device (seems
that some mips archs also use it and do cache operations).

I think that the DMA-API says that

- dma_sync_single_for_device() makes sure the data ready for DMA.

- dma_sync_single_for_cpu() makes sure that drivers doesn't get the
stale data after DMA.

I guess, it means if an architecture need a memory barrier (not only
cache operations) to guarantee the above, the architecture needs to
take care of it.

2010-05-13 04:42:31

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

On Thu, 2010-05-13 at 12:12 +0900, FUJITA Tomonori wrote:
> X86 OOSTORE uses a memory barrier dma_sync_single_for_device (seems
> that some mips archs also use it and do cache operations).
>
> I think that the DMA-API says that
>
> - dma_sync_single_for_device() makes sure the data ready for DMA.
>
> - dma_sync_single_for_cpu() makes sure that drivers doesn't get the
> stale data after DMA.
>
> I guess, it means if an architecture need a memory barrier (not only
> cache operations) to guarantee the above, the architecture needs to
> take care of it.

Right, but doing an indirect function pointer call for a memory barrier
in what can be a pretty hot code path doesn't sound like a great idea
if it can be avoided. I'd rather we define some inline that the arch
can stick in there if it wishes so.

Cheers,
Ben.

2010-05-13 15:12:58

by Jan Kara

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

On Thu 13-05-10 10:15:14, Benjamin Herrenschmidt wrote:
> On Wed, 2010-05-12 at 17:00 +0200, Jan Kara wrote:
>
> > Could you get the filesystem image with: e2image -r /dev/sdb2 buggy-image
> > bzip2 it and make it available somewhere? Maybe I could guess something
> > from the way the filesystem gets corrupted.
> > Oh, and also overwrite the partition with zeros before calling mkfs to make
> > the analysis simpler.
>
> Here we go.
>
> Created a smaller partition (100MB). dd'ed /dev/zero to it, dd'ed it off
> to NFS host to locally cmp it with /dev/zero to make sure it's nice and
> clean. It is. Then mkfs.ext3, mount, rsync over /usr/lib of my test
> root, unmount, and dd it off to the NFS host again.
>
> The result shows the same errors with fsck.ext3 -n -f ./test.img
>
> I uploaded it at:
>
> http://gate.crashing.org/~benh/test.img.bz2
Thanks. I had a look at it. It seems that it's really the HTREE code
that's causing problems. When a directory is using HTREEs, it should contain
directory entries in some hash range in each block. Say entries with hash
0-0xabcd1234 in block 0, entries with hash 0xabcd1235-0xffffffff in block
1. But the corrupted directories do not obey this rule. In particular e.g.
block 1 should contain entries with hashes 0-0x2fac6780 but it contains
lots of entries with hash larger than 0x2fac6780. I've spent quite some
time looking into the code and I'm not yet sure how that could happen.
My first suspect would be code splitting blocks when a block gets full.
The code allocates new block for a directory, then inside the buffer
creates an array of

struct dx_map_entry
{
u32 hash;
u16 offs;
u16 size;
};

entries describing entries in the block to split. Then this array is
sorted, we find a place to split and copy appropriate entries to the new
block and compact the old block.
In the binary dump of one of corrupted directories, I've found the built
array:
000e70 00 00 00 00 00 00 00 00 16 bd f8 01 f0 03 10 00
000e80 7a b1 0d 05 e0 03 10 00 2e 19 bb 08 cc 03 14 00
000e90 60 bd bf 09 e0 03 10 00 a6 79 5d 0a b0 03 0c 00
000ea0 08 5a 1b 0b a0 03 10 00 ac b6 15 0d a0 03 10 00
000eb0 d8 3e cc 0f 84 03 10 00 8c 0f f1 27 74 03 10 00
000ec0 24 09 85 2f 60 03 14 00 b8 61 c3 35 60 03 14 00
000ed0 c4 27 62 37 60 03 14 00 7c d9 07 39 60 03 14 00
000ee0 80 d6 70 3a 60 03 14 00 28 49 0e 3b 94 03 0c 00
000ef0 fe c6 81 3f 94 03 0c 00 78 b8 1e 40 b0 03 0c 00
000f00 2a 3e 04 42 b0 03 0c 00 82 0a 81 46 b0 03 0c 00
000f10 bc 2a 87 47 b0 03 0c 00 ea 9e f3 47 b0 03 0c 00
000f20 20 ac 28 48 bc 02 18 00 ac 20 f2 4c 5c 02 18 00
000f30 62 10 1d 54 bc 02 18 00 5e f9 f6 57 b0 03 0c 00
000f40 78 d4 58 5e 5c 02 18 00 da ab 8f 62 5c 02 18 00
000f50 b0 15 98 68 e4 01 18 00 2e a4 15 79 5c 02 18 00
000f60 b2 d9 16 79 bc 02 18 00 3e eb fe 7a a0 01 18 00
000f70 40 fa c2 81 a0 01 18 00 a0 c3 79 84 a0 01 18 00
000f80 3a cc a8 85 a0 01 18 00 aa 10 53 87 44 01 18 00
000f90 78 33 b0 8a 44 01 18 00 f4 da a1 8b 8c 02 18 00
000fa0 3e 6a d6 8c 10 02 1c 00 2a 27 17 8e 44 01 18 00
000fb0 c0 4a 91 90 10 02 1c 00 b2 61 6c 9a 70 01 14 00
000fc0 e0 ef eb a3 8c 02 18 00 c2 e3 00 a5 8c 02 18 00
000fd0 c0 5d 6e b6 5c 02 18 00 a8 1c 6e b7 5c 02 18 00
000fe0 fc 45 ea ba 40 00 14 00 52 d5 21 be a0 01 18 00
000ff0 4e 66 55 d1 a0 01 18 00 b0 35 b4 d3 8c 02 18 00
<next directory block starts>

If you look at the array more in detail, you'll notice that 'offs' part of
structure is sometimes identical. That should never happen because 'offs'
contains offset of the corresponding directory entry in a block. So when
offsets are identical in this array, subsequent move will copy some entries
several times and leave entries that should be moved in the old block,
resulting in a corruption we see.
The question is, how could offsets be the same? dx_make_map seems to get
it right and dx_sort_map as well. Maybe I'd peek into disassembly of
dx_sort_map to see whether swap() macro does what it should... If that
looks OK, you could try adding some debug checks into dx_sort_map and try
to catch the moment when duplicate offsets are created...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-05-13 15:39:57

by James Bottomley

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

On Thu, 2010-05-13 at 10:18 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2010-05-12 at 18:41 -0500, James Bottomley wrote:
> > > Which means that for coherent architectures that do not implement
> > > the ops->sync_* hooks, we are probably missing a barrier here...
> > >
> > > Thus if the above is expected to be a memory barrier, it's broken on
> > > cache coherent powerpc for example. On non-coherent powerpc, we do
> > cache
> > > flushes and those are implicit barriers.
> >
> > Can you explain this a little more. On a cache coherent machine, the
> > sync is a nop ... why would you want a nop to be any type of barrier?
>
> Well if the driver can peek at the data after the sync, and have any
> kind of ordering guarantee that it doesn't get stale data (the load
> isn't prefetched or speculated early), that would require an mb() or at
> least rmb().

So the guarantee that it doesn't look at stale data after the sync on a
cache coherent machine means ordering the dma write to physical memory
with the subsequent cpu read ... no memory barrier can actually do that.
Usually this is done externally, by making sure the memory change is
visible before sending the irq that tells the driver it is there ... on
some numa systems, this can be a problem (hence the mmiowb/relaxed read
thing).

> It would seem sensible for drivers to assume that something like
> dma_cache_sync_for_cpu() thus has the semantics of an rmb() at least,
> no ?

I still don't see why ... I don't see how you'd ever get a read of the
area speculated before the event that tells the driver its OK to read
the memory. In theory, I agree that it looks logical to require the
read never be speculated before the sync ... but in practice, I don't
see there ever being a problem with this since the sync isn't the event
that says the memory is safe to read.

James



2010-05-13 21:35:36

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

On Thu, 2010-05-13 at 17:12 +0200, Jan Kara wrote:

> If you look at the array more in detail, you'll notice that 'offs' part of
> structure is sometimes identical. That should never happen because 'offs'
> contains offset of the corresponding directory entry in a block. So when
> offsets are identical in this array, subsequent move will copy some entries
> several times and leave entries that should be moved in the old block,
> resulting in a corruption we see.
> The question is, how could offsets be the same? dx_make_map seems to get
> it right and dx_sort_map as well. Maybe I'd peek into disassembly of
> dx_sort_map to see whether swap() macro does what it should... If that
> looks OK, you could try adding some debug checks into dx_sort_map and try
> to catch the moment when duplicate offsets are created...

Thanks. I'll have a look. It could be a miscompile indeed. My cross-gcc
is a 4.4.0, maybe that's the root of the problem.

Cheers,
Ben.



2010-05-13 23:54:54

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

On Thu, 2010-05-13 at 17:12 +0200, Jan Kara wrote:

> If you look at the array more in detail, you'll notice that 'offs' part of
> structure is sometimes identical. That should never happen because 'offs'
> contains offset of the corresponding directory entry in a block. So when
> offsets are identical in this array, subsequent move will copy some entries
> several times and leave entries that should be moved in the old block,
> resulting in a corruption we see.
> The question is, how could offsets be the same? dx_make_map seems to get
> it right and dx_sort_map as well. Maybe I'd peek into disassembly of
> dx_sort_map to see whether swap() macro does what it should... If that
> looks OK, you could try adding some debug checks into dx_sort_map and try
> to catch the moment when duplicate offsets are created...

Ok so a very quick test with another compiler and the problem -appears-
gone. I'll investigate more in depth later tonight or this week-end.

Looks like it may all have been a false alarm, just don't use
gcc-4.4.0 :-)

Cheers,
Ben.


2010-05-13 23:53:50

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)


> > Well if the driver can peek at the data after the sync, and have any
> > kind of ordering guarantee that it doesn't get stale data (the load
> > isn't prefetched or speculated early), that would require an mb() or at
> > least rmb().
>
> So the guarantee that it doesn't look at stale data after the sync on a
> cache coherent machine means ordering the dma write to physical memory
> with the subsequent cpu read ... no memory barrier can actually do that.
> Usually this is done externally, by making sure the memory change is
> visible before sending the irq that tells the driver it is there ... on
> some numa systems, this can be a problem (hence the mmiowb/relaxed read
> thing).

Right. I was more thinking about something along the lines of the device
writes to some descriptor and then queues it up in a list. So we have
two different DMA data structures with a data dependency.

The question is whether the driver can expect to just do the sync, read
the queue, then sync again, then read the descriptor, or does it also
need an explicit rmb in between ?

Let's assume that the ordering is maintained at the DMA -> coherency
domain level, the question is will the sync op be guaranteed to toss
prefetch / speculative loads ? (IE. Guarantee the order of the two loads
done by the CPU) or do we need an explicit rmb.

> > It would seem sensible for drivers to assume that something like
> > dma_cache_sync_for_cpu() thus has the semantics of an rmb() at least,
> > no ?
>
> I still don't see why ... I don't see how you'd ever get a read of the
> area speculated before the event that tells the driver its OK to read
> the memory. In theory, I agree that it looks logical to require the
> read never be speculated before the sync ... but in practice, I don't
> see there ever being a problem with this since the sync isn't the event
> that says the memory is safe to read.

Cheers,
Ben.

> James
>
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2010-05-14 17:42:40

by Jamie Lokier

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

Russell King - ARM Linux wrote:
> On Thu, May 13, 2010 at 08:47:11AM +1000, Benjamin Herrenschmidt wrote:
> > Now, in the case at hand, which is my ARM based NAS, I believe this
> > is non cache-coherent and thus uses cache flush ops. I don't know ARM
> > well enough but I would expect these to be implicit barriers. Russell ?
> > Nico ?
>
> ARMv5 doesn't have a weak memory ordering model, and doesn't have any
> memory barrier instructions.

It does have buffered writes, doesn't it? Are they always flushed by
the cache flush ops?

-- Jamie

2010-05-14 17:59:27

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: Rampant ext3/4 corruption on 2.6.34-rc7 with VIVT ARM (Marvell 88f5182)

On Fri, May 14, 2010 at 06:41:53PM +0100, Jamie Lokier wrote:
> Russell King - ARM Linux wrote:
> > On Thu, May 13, 2010 at 08:47:11AM +1000, Benjamin Herrenschmidt wrote:
> > > Now, in the case at hand, which is my ARM based NAS, I believe this
> > > is non cache-coherent and thus uses cache flush ops. I don't know ARM
> > > well enough but I would expect these to be implicit barriers. Russell ?
> > > Nico ?
> >
> > ARMv5 doesn't have a weak memory ordering model, and doesn't have any
> > memory barrier instructions.
>
> It does have buffered writes, doesn't it? Are they always flushed by
> the cache flush ops?

Surprisingly, for DMA it's something we've always done. Odd that.