2021-01-05 15:50:39

by Russell King (Oracle)

[permalink] [raw]
Subject: Aarch64 EXT4FS inode checksum failures - seems to be weak memory ordering issues

Hi,

This is an update on where I am with this long standing issue at the
current time.

Since 5.4, I have been struggling with several of my ARM64 systems, of
different SoC vendors and differing filesystem media, were sporadically
reporting inode checksum failures on their root filesystems. The time
taken to report this has been anything between a few hours and three
months of uptime, making the problem unrealistic to bisect.

The issue was first seen on my SolidRun Clearfog CX LX2160A based
system, but was also subsequently noticed on my Armada 8040 based
systems running kernels 5.4 and later. Kernel 5.2 has proven stable
with 566 days of uptime with no issue.

It has taken a long time to get debugging in place to see what is going
on - and this is currently detailed on the front page of
http://www.armlinux.org.uk right now, which has formed a blog of this problem
- since almost no one has taken any interest in it.

However, over the last couple of days, a way to reproduce it has been
found, at least for the LX2160A based system. Power down, leave the
machine powered off for some time. Power up, log in and run:

while :; do sleep 5; find /var /usr /bin /sbin -type f -print0 | \
xargs -0 md5sum >/dev/null; done

Within a few minutes it seems to have spat out an inode checksum
failure if the problem exists. However, testing for the problem _not_
existing is quite difficult - just because it doesn't appear in the
first few minutes does not mean it has been solved - see above where it
can take three months.

However, evidence is currently pointing towards commit 22ec71615d82
("arm64: io: Relax implicit barriers in default I/O accessors") having
revealed this problem. Will is very certain that this change is
correct, and we feel that it may have exposed some other issue in the
Aarch64 code.

Further attempts seem to suggest that the problem is specifically the
barrier in __iormb(). Leaving __iowmb() untouched, and changing the
barrier in __iormb() from dma_rmb() to rmb() _appears_ to result in the
problem disappearing. "Appears" is stressed because further testing is
needed - and that is probably going to take many months before we know
for certain.

However, this suggests that there is a memory ordering bug with aarch64
somewhere. Will can follow up with his own thoughts to this email.

We don't know if it is:
- the kernel.
- the Cortex A72.
- the Cache coherent interconnect.

I don't think it's the CCI, as I believe the Armada 8040 uses Marvell's
own IP for that based around Aurora 2 (the functional spec doesn't make
it clear.) Remember, I'm seeing this problem on both Armada 8040 and
LX2160A. We don't know of any known errata for the A72 in this area.
So, we're down to something in the kernel.

It is possible that it could be compiler related, but I don't see that;
if the "dmb oshld" were strong enough, then it should mean that the
subsequent reads to checksum the inode data after the inode data has
been DMA'd into memory should be reading the correct values from memory
already - but they aren't. And if changing "dmb oshld" to "dsb ld" means
that the code can then read the right values, that to me points fairly
definitively to a hardware problem.

Now, ext4fs is pretty good at checksumming the metadata in the
filesystem - each inode is individually checksummed with CRC32C and two
16-bit halves are stored in each inode. Directories are also
checksummed. ext4fs validates the inode checksum on every ext4_iget()
call. Do other filesystems do similar?

Anyway, here is the patch I'm currently running, which _seems_ so far
to be the minimal fix for my problems. Will thinks that this is hiding
the real problem by adding barriers, but I don't see there's much
choice but to apply this - I don't see what other debugging could be
done without the use of expensive hardware simulation, or detailed
hardware level tracing - the kind of which a silicon vendor or ARM Ltd
would have.

I'm at the end of what I can do with this; I'm going to keep this patch
in my kernel, since it fixes it for me.

Will would like a reliable reproducer - yes, that would be ideal, but
I'm afraid that's a mamoth task in itself. It's taken a year to find
this method of reproducing it.

There's also the matter that in one case I've seen, the ext4 checksum
has been wrong. The subsequent hexdump has been correct, and the post-
hexdump checksum recalculation has remained incorrect - and the same
value as the first incorrect checksum. However, the inode with the
_same_ checksum has subsequently validated correctly by the kernel and
by e2fsck. I can not explain this.

diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index ff50dd731852..be63c47aecc4 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -95,7 +95,7 @@ static inline u64 __raw_readq(const volatile void __iomem *addr)
({ \
unsigned long tmp; \
\
- dma_rmb(); \
+ rmb(); \
\
/* \
* Create a dummy control dependency from the IO read to any \

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


2021-01-05 22:40:36

by Russell King (Oracle)

[permalink] [raw]
Subject: Re: Aarch64 EXT4FS inode checksum failures - seems to be weak memory ordering issues

On Tue, Jan 05, 2021 at 10:27:28AM -0800, Darrick J. Wong wrote:
> On Tue, Jan 05, 2021 at 03:47:26PM +0000, Russell King - ARM Linux admin wrote:
> > Hi,
> >
> > This is an update on where I am with this long standing issue at the
> > current time.
> >
> > Since 5.4, I have been struggling with several of my ARM64 systems, of
> > different SoC vendors and differing filesystem media, were sporadically
> > reporting inode checksum failures on their root filesystems. The time
> > taken to report this has been anything between a few hours and three
> > months of uptime, making the problem unrealistic to bisect.
>
> Aha, I was wondering what happened to this bug report. :)

Yes... it's continued causing me grief!

> > The issue was first seen on my SolidRun Clearfog CX LX2160A based
> > system, but was also subsequently noticed on my Armada 8040 based
> > systems running kernels 5.4 and later. Kernel 5.2 has proven stable
> > with 566 days of uptime with no issue.
> >
> > It has taken a long time to get debugging in place to see what is going
> > on - and this is currently detailed on the front page of
> > http://www.armlinux.org.uk right now, which has formed a blog of this problem
> > - since almost no one has taken any interest in it.
> >
> > However, over the last couple of days, a way to reproduce it has been
> > found, at least for the LX2160A based system. Power down, leave the
> > machine powered off for some time. Power up, log in and run:
> >
> > while :; do sleep 5; find /var /usr /bin /sbin -type f -print0 | \
> > xargs -0 md5sum >/dev/null; done
>
> Does that fill up the page cache enough to push memory reclaim?

No - I have around 30GB of memory available on the LX2160A system, and
around 3GB that the above is crunching through.

> > Within a few minutes it seems to have spat out an inode checksum
> > failure if the problem exists. However, testing for the problem _not_
> > existing is quite difficult - just because it doesn't appear in the
> > first few minutes does not mean it has been solved - see above where it
> > can take three months.
> >
> > However, evidence is currently pointing towards commit 22ec71615d82
> > ("arm64: io: Relax implicit barriers in default I/O accessors") having
> > revealed this problem. Will is very certain that this change is
> > correct, and we feel that it may have exposed some other issue in the
> > Aarch64 code.
> >
> > Further attempts seem to suggest that the problem is specifically the
> > barrier in __iormb(). Leaving __iowmb() untouched, and changing the
> > barrier in __iormb() from dma_rmb() to rmb() _appears_ to result in the
> > problem disappearing. "Appears" is stressed because further testing is
> > needed - and that is probably going to take many months before we know
> > for certain.
> >
> > However, this suggests that there is a memory ordering bug with aarch64
> > somewhere. Will can follow up with his own thoughts to this email.
> >
> > We don't know if it is:
> > - the kernel.
> > - the Cortex A72.
> > - the Cache coherent interconnect.
> >
> > I don't think it's the CCI, as I believe the Armada 8040 uses Marvell's
> > own IP for that based around Aurora 2 (the functional spec doesn't make
> > it clear.) Remember, I'm seeing this problem on both Armada 8040 and
> > LX2160A. We don't know of any known errata for the A72 in this area.
> > So, we're down to something in the kernel.
> >
> > It is possible that it could be compiler related, but I don't see that;
> > if the "dmb oshld" were strong enough, then it should mean that the
> > subsequent reads to checksum the inode data after the inode data has
> > been DMA'd into memory should be reading the correct values from memory
> > already - but they aren't. And if changing "dmb oshld" to "dsb ld" means
> > that the code can then read the right values, that to me points fairly
> > definitively to a hardware problem.
> >
> > Now, ext4fs is pretty good at checksumming the metadata in the
> > filesystem - each inode is individually checksummed with CRC32C and two
> > 16-bit halves are stored in each inode. Directories are also
> > checksummed. ext4fs validates the inode checksum on every ext4_iget()
> > call. Do other filesystems do similar?
>
> XFS and ext4 both validate the ondisk csum when constructing their
> incore inodes, and set them when flushing the incore inode back to disk.
>
> I vaguely wonder if there's something else going on here, like ... a
> background memory reclaim thread running on one cpu writes out an inode
> core with new checksum (because reading the file bumped the atime), and
> then another cpu comes along and has to reconstitute the (just
> reclaimed) incore inode, but for whatever reason doesn't get the version
> that the other cpu just wrote?
>
> That's like 130% speculative though, and note that I have no idea what
> the "outer shareable" domain[1] is.
>
> [1] https://developer.arm.com/docs/ddi0597/h/base-instructions-alphabetic-order/dmb-data-memory-barrier

In some of these cases I have observed, the inode appears not to have
changed for quite a long time - it seems to be "media read due to
iget -> checksum failure", although I can't say that definitively.

Given that most of the instances I've been able to provoke over the last
two days have been on the LX2160A, and have been a case of "power on
from cold, boot, login, run the above command, it fails on the first
iteration", it's unlikely that there's much in the way of inode
modification going on.

I've had a few instances now where, for example:

provided = 4b98d7c4 calculated = 5cec3c81
inode(ffffffa6d4dd9c00)
00000000: a4 81 00 00 41 0a 00 00 03 57 f3 5f 25 18 fd 5d
00000010: 70 57 fe 5a 00 00 00 00 00 00 01 00 08 00 00 00
00000020: 00 00 08 00 01 00 00 00 0a f3 01 00 04 00 00 00
00000030: 00 00 00 00 00 00 00 00 01 00 00 00 36 14 11 00
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000060: 00 00 00 00 1a f4 40 7c 00 00 00 00 00 00 00 00
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 c4 d7 00 00
00000080: 20 00 98 4b a8 c9 54 61 00 00 00 00 40 58 34 9f
00000090: 25 18 fd 5d a8 c9 54 61 00 00 00 00 00 00 00 00
000000a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000000b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000000c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000000d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000000e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000000f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
recalculated = 5cec3c81
EXT4-fs error (device nvme0n1p2): ext4_lookup:1707: inode #270077: comm md5sum: iget: checksum invalid

The reason this is interesting is that the hexdump is _correct_ for
the stored checksum (which is 0x4b98d7c4 - stored at hex offsets +83,
+82, +7d, +7c). However, the initial calculation when validating the
inode was incorrect, and the recalculation after dumping the data was
_also_ incorrect.

As mentioned on my writeup on http://www.armlinux.org.uk (please read it,
there's lots of information there about the history of this and what
has been tried, including the patch for the above output, and I don't
want to keep having to repeat it in emails as different people ask
the same question!) I've already tried disabling the ARM64 accelerated
CRC32 code, and that has made no difference to this problem.

One thing I have recently been trying is to disable preemption within
ext4_inode_csum_verify() - that _seems_ to make it much harder to
reproduce the problem. I've managed it only once despite trying
multiple times. Unfortunately, the filesystem was not mounted
errors=remount-ro, so I couldn't verify the hexdump.

I'm also trying booting with maxcpus=2, so that the kernel is limited
to a single CPU cluster. So far, I haven't been able to provoke the
problem. However, this means absolutely nothing, as past experience
has shown it may take up to three months to occur.

> > Anyway, here is the patch I'm currently running, which _seems_ so far
> > to be the minimal fix for my problems. Will thinks that this is hiding
> > the real problem by adding barriers, but I don't see there's much
> > choice but to apply this - I don't see what other debugging could be
> > done without the use of expensive hardware simulation, or detailed
> > hardware level tracing - the kind of which a silicon vendor or ARM Ltd
> > would have.
>
> (FWIW I haven't seen checksum errors on xfs or ext4 on arm64, though
> most of my testing is relegated to beating on a raspberry pi very
> slowly...)

There's a thought that this problem could have something to do with
migrating across CPUs in different clusters (hence why trying
maxcpus=2). The LX2160A has 16 CPUs - 8 clusters of 2 CPUs. The Armada
8040 has 2 clusters of 2 CPUs. As I understand it, the bcm2711 has four
CPUs which aren't separated into clusters.

This may explain why, /if/ ext4_inode_csum_verify() migrating between
the clusters is important to trigger this, it hasn't been seen on any
Raspberry Pis (I'm also told that Raspbian defaults to 32-bit kernels
which may be another factor why there haven't been people reporting
this.)

> > I'm at the end of what I can do with this; I'm going to keep this patch
> > in my kernel, since it fixes it for me.
>
> Well if you've managed to hit this on multiple different machines after
> a long soak time, I wonder how many other people will trip over this too.
> It wouldn't be the first time a fs stunts performance to avoid
> corruption. ;)

Well, replacing the __iormb() barrier with a dsb appears to either solve
the problem or make it much less likely to happen - although Will Deacon
feels that this is plastering over the problem, and it will come back to
bite in the future.

However, I have put a kernel on one of the affected systems (which is a
long-running system) that revert's Will's 22ec71615d82 commit I mention
above. This /was/ the situation before 5.4-rc1, and it seems to allow
these platforms to work. If that proves to be stable in six months time,
and there has been no other progress on this bug, then as far as I'm
concerned, that is quite simply the required fix for this.

Quite simply, I've given up with Aarch64 over the last year over this;
it doesn't interest me at the moment - not even for development of
other stuff - as fundamentally I have had my trust in it totally
shattered. ARM32 is clearly the superior architecture! :)

> > Will would like a reliable reproducer - yes, that would be ideal, but
> > I'm afraid that's a mamoth task in itself. It's taken a year to find
> > this method of reproducing it.
> >
> > There's also the matter that in one case I've seen, the ext4 checksum
> > has been wrong. The subsequent hexdump has been correct, and the post-
> > hexdump checksum recalculation has remained incorrect - and the same
> > value as the first incorrect checksum. However, the inode with the
> > _same_ checksum has subsequently validated correctly by the kernel and
> > by e2fsck. I can not explain this.
>
> Strange. You're just using the same ext4_inode_csum() that everything
> else uses, right?

Yes, except for the patch to dump out a hexdump of the inode on
checksum failure and to then recalculate it.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

2021-01-06 17:22:52

by Will Deacon

[permalink] [raw]
Subject: Re: Aarch64 EXT4FS inode checksum failures - seems to be weak memory ordering issues

On Wed, Jan 06, 2021 at 01:52:53PM +0000, Russell King - ARM Linux admin wrote:
> On Wed, Jan 06, 2021 at 11:53:59AM +0000, Mark Rutland wrote:
> > ... and are you using defconfig or something else?
>
> Not sure I replied to this. I'm not using the defconfig, I've my own
> .config
>
> As I mentioned, Will has built a 5.10 kernel using Arnd's gcc 4.9.4
> and hasn't been able to reproduce it. He's sent me his kernel, which
> I've booted here, and haven't yet been able to provoke it.
>
> Meanwhile, my 5.9 kernel continues to exhibit this problem, so I've
> sent Will my .config (which I'll include here.) There are differences
> in some of the block layer configuration. There's differences in the
> errata configuration, but we don't think that's a cause (they're not
> relevant for Cortex A72).
>
> Our plan is:
> - Will is switching to 5.9, and using my config as a base for his
> platform.
> - Will is going to send me his modified version of my config.
> - We are both going to build using the same kernel sources and same
> config.
> - We are going to test our own kernels, and also swap kernel images
> and test each others.
>
> Watch this space for more news...

I've managed to reproduce the corruption on my AMD Seattle board (8x A57).
I haven't had a chance to dig deeper yet, but here's the recipe which works
for me:

1. I'm using GCC 4.9.4 simply to try to get as close as I can to rmk's
setup. I don't know if this is necessary or not, but the toolchain is
here:

https://kernel.org/pub/tools/crosstool/files/bin/arm64/4.9.4/arm64-gcc-4.9.4-nolibc-aarch64-linux-gnu.tar.xz

and I needed to pull down an old libmpfr to get cc1 to work:

http://ports.ubuntu.com/pool/main/m/mpfr4/libmpfr4_3.1.2-1_arm64.deb

2. I build a 5.9 kernel with the config here:

https://mirrors.edge.kernel.org/pub/linux/kernel/people/will/bugs/rmk/config-5.9.0

and the resulting Image is here:

https://mirrors.edge.kernel.org/pub/linux/kernel/people/will/bugs/rmk/Image-5.9.0

3. Using that kernel, I boot into a 64-bit Debian 10 filesystem and open a
couple of terminals over SSH.

4. In one terminal, I run:

$ while (true); do find /var /usr /bin /sbin -type f -print0 | xargs -0
md5sum > /dev/null; echo 2 | sudo tee /proc/sys/vm/drop_caches; done

(note that sudo will prompt you for a password on the first iteration)

5. In the other terminal, I run:

$ while (true); do ./hackbench ; sleep 1; done

where hackbench is built from:

https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

and compiled according to comment in the source code.

With that, I see the following after ten seconds or so:

EXT4-fs error (device sda2): ext4_lookup:1707: inode #674497: comm md5sum: iget: checksum invalid

Russell, Mark -- does this recipe explode reliably for you too?

Will

2021-01-07 21:50:22

by Arnd Bergmann

[permalink] [raw]
Subject: Re: Aarch64 EXT4FS inode checksum failures - seems to be weak memory ordering issues

On Thu, Jan 7, 2021 at 5:27 PM Theodore Ts'o <[email protected]> wrote:
>
> On Thu, Jan 07, 2021 at 01:37:47PM +0000, Russell King - ARM Linux admin wrote:
> > > The gcc bugzilla mentions backports into gcc-linaro, but I do not see
> > > them in my git history.
> >
> > So, do we raise the minimum gcc version for the kernel as a whole to 5.1
> > or just for aarch64?
>
> Russell, Arnd, thanks so much for tracking down the root cause of the
> bug!

There is one more thing that I wondered about when looking through
the ext4 code: Should it just call the crc32c_le() function directly
instead of going through the crypto layer? It seems that with Ard's
rework from 2018, that can just call the underlying architecture specific
implementation anyway.

> I will note that RHEL 7 uses gcc 4.8. I personally don't have an
> objections to requiring developers using RHEL 7 to have to install a
> more modern gcc (since I use Debian Testing and gcc 10.2.1, myself,
> and gcc 5.1 is so five years ago :-), but I could imagine that being
> considered inconvenient for some.

The main users of gcc-4.9 that I recall from previous discussions
were Android and Debian 8, but both of them are done now: Debian 8
has reached its end of life last summer, and Android uses clang
for building new kernels.

Arnd

2021-01-07 22:43:40

by Eric Biggers

[permalink] [raw]
Subject: Re: Aarch64 EXT4FS inode checksum failures - seems to be weak memory ordering issues

On Thu, Jan 07, 2021 at 10:14:46PM +0000, Russell King - ARM Linux admin wrote:
> On Thu, Jan 07, 2021 at 10:48:05PM +0100, Arnd Bergmann wrote:
> > On Thu, Jan 7, 2021 at 5:27 PM Theodore Ts'o <[email protected]> wrote:
> > >
> > > On Thu, Jan 07, 2021 at 01:37:47PM +0000, Russell King - ARM Linux admin wrote:
> > > > > The gcc bugzilla mentions backports into gcc-linaro, but I do not see
> > > > > them in my git history.
> > > >
> > > > So, do we raise the minimum gcc version for the kernel as a whole to 5.1
> > > > or just for aarch64?
> > >
> > > Russell, Arnd, thanks so much for tracking down the root cause of the
> > > bug!
> >
> > There is one more thing that I wondered about when looking through
> > the ext4 code: Should it just call the crc32c_le() function directly
> > instead of going through the crypto layer? It seems that with Ard's
> > rework from 2018, that can just call the underlying architecture specific
> > implementation anyway.
>
> Yes, I've been wondering about that too. To me, it looks like the
> ext4 code performs a layering violation by going "under the covers"
> - there are accessor functions to set the CRC and retrieve it. ext4
> instead just makes the assumption that the CRC value is stored after
> struct shash_desc. Especially as the crypto/crc32c code references
> the value using:
>
> struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
>
> Not even crypto drivers are allowed to assume that desc+1 is where
> the CRC is stored.

It violates how the shash API is meant to be used in general, but there is a
test that enforces that the shash_desc_ctx for crc32c must be just the single
u32 crc value. See alg_test_crc32c() in crypto/testmgr.c. So it's apparently
intended to work.

>
> However, struct shash_desc is already 128 bytes in size on aarch64,

Ard Biesheuvel recently sent a patch to reduce the alignment of struct
shash_desc to ARCH_SLAB_MINALIGN
(https://lkml.kernel.org/linux-crypto/[email protected]/),
since apparently most of the bloat is from alignment for DMA, which isn't
necessary. I think that reduces the size by a lot on arm64.

> and the proper way of doing it via SHASH_DESC_ON_STACK() is overkill,
> being strangely 2 * sizeof(struct shash_desc) + 360 (which looks like
> another bug to me!)

Are you referring to the '2 * sizeof(struct shash_desc)' rather than just
'sizeof(struct shash_desc)'? As mentioned in the comment above
HASH_MAX_DESCSIZE, there can be a nested shash_desc due to HMAC.
So I believe the value is correct.

> So, I agree with you wrt crc32c_le(), especially as it would be more
> efficient, and as the use of crc32c is already hard coded in the ext4
> code - not only with crypto_alloc_shash("crc32c", 0, 0) but also with
> the fixed-size structure in ext4_chksum().
>
> However, it's ultimately up to the ext4 maintainers to decide.

As I mentioned in my other response, crc32c_le() isn't a proper library API
(like some of the newer lib/crypto/ stuff) but rather just a wrapper for the
shash API, and it doesn't handle modules being dynamically loaded/unloaded.
So switching to it may cause a performance regression.

What I'd recommend is making crc32c_le() able to call architecture-speccific
implementations directly, similar to blake2s() and chacha20() in lib/crypto/.
Then there would be no concern about when modules get loaded, etc...

- Eric

2021-01-08 08:23:44

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: Aarch64 EXT4FS inode checksum failures - seems to be weak memory ordering issues

On Thu, 7 Jan 2021 at 23:42, Eric Biggers <[email protected]> wrote:
>
> On Thu, Jan 07, 2021 at 10:14:46PM +0000, Russell King - ARM Linux admin wrote:
> > On Thu, Jan 07, 2021 at 10:48:05PM +0100, Arnd Bergmann wrote:
> > > On Thu, Jan 7, 2021 at 5:27 PM Theodore Ts'o <[email protected]> wrote:
> > > >
> > > > On Thu, Jan 07, 2021 at 01:37:47PM +0000, Russell King - ARM Linux admin wrote:
> > > > > > The gcc bugzilla mentions backports into gcc-linaro, but I do not see
> > > > > > them in my git history.
> > > > >
> > > > > So, do we raise the minimum gcc version for the kernel as a whole to 5.1
> > > > > or just for aarch64?
> > > >
> > > > Russell, Arnd, thanks so much for tracking down the root cause of the
> > > > bug!
> > >
> > > There is one more thing that I wondered about when looking through
> > > the ext4 code: Should it just call the crc32c_le() function directly
> > > instead of going through the crypto layer? It seems that with Ard's
> > > rework from 2018, that can just call the underlying architecture specific
> > > implementation anyway.
> >
> > Yes, I've been wondering about that too. To me, it looks like the
> > ext4 code performs a layering violation by going "under the covers"
> > - there are accessor functions to set the CRC and retrieve it. ext4
> > instead just makes the assumption that the CRC value is stored after
> > struct shash_desc. Especially as the crypto/crc32c code references
> > the value using:
> >
> > struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
> >
> > Not even crypto drivers are allowed to assume that desc+1 is where
> > the CRC is stored.
>
> It violates how the shash API is meant to be used in general, but there is a
> test that enforces that the shash_desc_ctx for crc32c must be just the single
> u32 crc value. See alg_test_crc32c() in crypto/testmgr.c. So it's apparently
> intended to work.
>
> >
> > However, struct shash_desc is already 128 bytes in size on aarch64,
>
> Ard Biesheuvel recently sent a patch to reduce the alignment of struct
> shash_desc to ARCH_SLAB_MINALIGN
> (https://lkml.kernel.org/linux-crypto/[email protected]/),
> since apparently most of the bloat is from alignment for DMA, which isn't
> necessary. I think that reduces the size by a lot on arm64.
>
> > and the proper way of doing it via SHASH_DESC_ON_STACK() is overkill,
> > being strangely 2 * sizeof(struct shash_desc) + 360 (which looks like
> > another bug to me!)
>
> Are you referring to the '2 * sizeof(struct shash_desc)' rather than just
> 'sizeof(struct shash_desc)'? As mentioned in the comment above
> HASH_MAX_DESCSIZE, there can be a nested shash_desc due to HMAC.
> So I believe the value is correct.
>
> > So, I agree with you wrt crc32c_le(), especially as it would be more
> > efficient, and as the use of crc32c is already hard coded in the ext4
> > code - not only with crypto_alloc_shash("crc32c", 0, 0) but also with
> > the fixed-size structure in ext4_chksum().
> >
> > However, it's ultimately up to the ext4 maintainers to decide.
>
> As I mentioned in my other response, crc32c_le() isn't a proper library API
> (like some of the newer lib/crypto/ stuff) but rather just a wrapper for the
> shash API, and it doesn't handle modules being dynamically loaded/unloaded.
> So switching to it may cause a performance regression.
>
> What I'd recommend is making crc32c_le() able to call architecture-speccific
> implementations directly, similar to blake2s() and chacha20() in lib/crypto/.
> Then there would be no concern about when modules get loaded, etc...
>

I have looked into this in the past, both for crc32(c) and crc-t10dif,
both of which use a horrid method of wrapping a shash into a library
API. This was before we had static calls, though, and this work kind
of stalled on that. It should be straight-forward to redefine the
crc32c() library function as a static call, and have an optimized
module (or builtin) perform the [conditional] static call update at
module_init() time. The only missing piece is autoloading such
modules, which is tricky with softdeps if the dependency is from the
core kernel.

Currently, we have many users of crc32(c) in the kernel that go via
the shash (or synchronous ahash) layer to perform crc32c, all of which
would be better served by a library API, given that the hash type is a
compile time constant, and only synchronous calls are made.




drivers/infiniband/hw/i40iw/i40iw_utils.c: tfm =
crypto_alloc_shash("crc32c", 0, 0);
drivers/infiniband/sw/rxe/rxe_verbs.c: tfm = crypto_alloc_shash("crc32", 0, 0);
drivers/infiniband/sw/siw/siw_main.c: siw_crypto_shash =
crypto_alloc_shash("crc32c", 0, 0);
drivers/md/dm-crypt.c: tcw->crc32_tfm = crypto_alloc_shash("crc32", 0,
drivers/nvme/host/tcp.c: tfm = crypto_alloc_ahash("crc32c", 0,
CRYPTO_ALG_ASYNC);
drivers/nvme/target/tcp.c: tfm = crypto_alloc_ahash("crc32c", 0,
CRYPTO_ALG_ASYNC);
drivers/scsi/iscsi_tcp.c: tfm = crypto_alloc_ahash("crc32c", 0,
CRYPTO_ALG_ASYNC);
drivers/target/iscsi/iscsi_target_login.c: tfm =
crypto_alloc_ahash("crc32c", 0, CRYPTO_ALG_ASYNC);
fs/ext4/super.c: sbi->s_chksum_driver = crypto_alloc_shash("crc32c", 0, 0);
fs/f2fs/super.c: sbi->s_chksum_driver = crypto_alloc_shash("crc32", 0, 0);
fs/jbd2/journal.c: journal->j_chksum_driver =
crypto_alloc_shash("crc32c", 0, 0);
fs/jbd2/journal.c: journal->j_chksum_driver =
crypto_alloc_shash("crc32c", 0, 0);
lib/libcrc32c.c: tfm = crypto_alloc_shash("crc32c", 0, 0);

2021-01-08 09:25:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Aarch64 EXT4FS inode checksum failures - seems to be weak memory ordering issues

On Thu, Jan 07, 2021 at 10:20:38PM +0100, Arnd Bergmann wrote:
> On Thu, Jan 7, 2021 at 2:37 PM Russell King - ARM Linux admin

> > So, do we raise the minimum gcc version for the kernel as a whole to 5.1
> > or just for aarch64?
>
> I'd personally love to see gcc-5 as the global minimum version, as that
> would let us finally use --std=gnu11 features instead of gnu89. [There are
> a couple of useful features that are incompatible with gnu89, and
> gnu99/gnu11 support in gcc didn't like the kernel sources]

+1 for raising the tree-wide minimum (again!). We actually have a bunch
of work-arounds for 4.9 bugs we can get rid of as well.

2021-01-08 09:28:43

by Will Deacon

[permalink] [raw]
Subject: Re: Aarch64 EXT4FS inode checksum failures - seems to be weak memory ordering issues

On Fri, Jan 08, 2021 at 10:21:54AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 07, 2021 at 10:20:38PM +0100, Arnd Bergmann wrote:
> > On Thu, Jan 7, 2021 at 2:37 PM Russell King - ARM Linux admin
>
> > > So, do we raise the minimum gcc version for the kernel as a whole to 5.1
> > > or just for aarch64?
> >
> > I'd personally love to see gcc-5 as the global minimum version, as that
> > would let us finally use --std=gnu11 features instead of gnu89. [There are
> > a couple of useful features that are incompatible with gnu89, and
> > gnu99/gnu11 support in gcc didn't like the kernel sources]
>
> +1 for raising the tree-wide minimum (again!). We actually have a bunch
> of work-arounds for 4.9 bugs we can get rid of as well.

We even just added another one for arm64 KVM! [1]

So yes, I'm in favour of leaving gcc 4.9 to rot as well, especially after
this ext4 debugging experience.

Will

[1] https://git.kernel.org/linus/9fd339a45be5

2021-01-12 13:24:26

by Lukas Wunner

[permalink] [raw]
Subject: Re: Aarch64 EXT4FS inode checksum failures - seems to be weak memory ordering issues

On Fri, Jan 08, 2021 at 12:02:53PM -0800, Linus Torvalds wrote:
> I appreciate Arnd pointing out "--std=gnu11", though. What are the
> actual relevant language improvements?
>
> Variable declarations in for-loops is the only one I can think of. I
> think that would clean up some code (and some macros), but might not
> be compelling on its own.

Anonymous structs/unions. I used to have a use case for that in
struct efi_dev_path in include/linux/efi.h, but Ard Biesheuvel
refactored it in a gnu89-compatible way for v5.7 with db8952e7094f.

[The above was copy-pasted from last time this discussion came up
in July 2020. Back then, Kirill Shutemov likewise mentioned the
local variables in loops feature:
https://lore.kernel.org/lkml/[email protected]/
]

Thanks,

Lukas