Hi,
I've hit a mainline gcc 10.2 (also gcc 9.3) bug which triggers at -O3
causing wrong codegen.
Config needs to have initramfs + gzip compressed.
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_KERNEL_GZIP=y
CONFIG_DECOMPRESS_GZIP=y
CONFIG_INITRAMFS_COMPRESSION_GZIP=y
lib/zlib_inflate/inffast.c
if (dist > 2) {
unsigned short *sfrom;
sfrom = (unsigned short *)(from);
loops = len >> 1;
do
*sout++ = *sfrom++;
^^^^^^ ^^^^^^^^
while (--loops);
out = (unsigned char *)sout;
from = (unsigned char *)sfrom;
}
...
The gist of issue is that despite use of unsigned short pointers, gcc is
generating wider load/stores (8-byte LDD/STD on arcv2 and 16-byte on
aarch64) causing extraneous bytes to copied into inflated gzip binaries
manifesting later as corrupted fragments in the binaries.
I've opened a gcc bug at:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363
The workaround is to build lib/zlib_inflate/inffast.c with -O2, although
I reckon not many arches build with -O3 as default. I'll be proposing an
ARC only patch to build this file with -O2, unless people think it needs
to be generalized.
Also problem originally seen on 5.6 kernel, although I confirm it shows
on latest mainline as well.
Unraveling this pretty fun, gory details for those interested at:
https://github.com/foss-for-synopsys-dwc-arc-processors/toolchain/issues/372
Thx,
-Vineet
On Fri, Apr 30, 2021 at 1:46 PM Vineet Gupta <[email protected]> wrote:
>
> I've hit a mainline gcc 10.2 (also gcc 9.3) bug which triggers at -O3
> causing wrong codegen.
I'd be more than happy to just disable CC_OPTIMIZE_FOR_PERFORMANCE_O3 entirely.
The advantages are very questionable - with a lot of the optimizations
at O3 being about loops, something which the kernel to a close
approximation doesn't have.
Most kernel loops are "count on one hand" iterations, and loop
optimizations generally just make things worse.
And we've had problems with -O3 before, because not only are the
optimizations a bit esoteric, they are often relatively untested. If
you look around at various projects (outside the kernel), -O2 is
generally the "default".
And that's entirely ignoring the gcc history - where -O3 has often
been very buggy indeed. It's gotten much better, but I just don't see
the upside of using -O3.
In fact, it looks like we already have that
depends on ARC
for -O3, exactly because nobody really wants to use this.
So this bug seems to be entirely ARC-specific, in that only ARC can
use -O3 for the kernel already.
Linus
On 4/30/21 3:06 PM, Linus Torvalds wrote:
> On Fri, Apr 30, 2021 at 1:46 PM Vineet Gupta <[email protected]> wrote:
>>
>> I've hit a mainline gcc 10.2 (also gcc 9.3) bug which triggers at -O3
>> causing wrong codegen.
>
> I'd be more than happy to just disable CC_OPTIMIZE_FOR_PERFORMANCE_O3 entirely.
>
> The advantages are very questionable - with a lot of the optimizations
> at O3 being about loops, something which the kernel to a close
> approximation doesn't have.
>
> Most kernel loops are "count on one hand" iterations, and loop
> optimizations generally just make things worse.
>
> And we've had problems with -O3 before, because not only are the
> optimizations a bit esoteric, they are often relatively untested. If
> you look around at various projects (outside the kernel), -O2 is
> generally the "default".
I agree that -O2 is default, but we've had -O3 default for ARC kernel
forever, since last decade seriously. The reason I turned it on back
then was upside of 10% performance improvement on select LMBench numbers
on hardware at the time which for a rookie kernel hacker was yay momemt.
I can revisit this and see if that is still true.
> And that's entirely ignoring the gcc history - where -O3 has often
> been very buggy indeed. It's gotten much better, but I just don't see
> the upside of using -O3.
>
> In fact, it looks like we already have that
>
> depends on ARC
>
> for -O3, exactly because nobody really wants to use this.
Either that or that people are not brave enough ;-) Perhaps gcc folks
would like me to retain this as a testing ground if nothing else.
> So this bug seems to be entirely ARC-specific, in that only ARC can
> use -O3 for the kernel already.
kid in me complaining "that's not fair !"
-Vineet
On Fri, Apr 30, 2021 at 3:44 PM Vineet Gupta <[email protected]> wrote:
>
> I agree that -O2 is default, but we've had -O3 default for ARC kernel
> forever, since last decade seriously. The reason I turned it on back
> then was upside of 10% performance improvement on select LMBench numbers
> on hardware at the time which for a rookie kernel hacker was yay momemt.
> I can revisit this and see if that is still true.
It would be interesting if you can actually show 10% improvement, and
also pinpoint things in the profile.
I (long long long ago) actually used -O6 for kernel builds as a "give
me everything you have" (I don't think gcc has actually ever done
anything more than O3, but whatever).
In fact, you can find
Q: What Does gcc -O6 Do?
in some kernel FAQ's from those historical times.
We eventually gave up on it, because it just generated bigger and
slower code, and people got very tired of all the compiler bugs.
Linus
Hello.
On Fri, Apr 30, 2021 at 03:06:30PM -0700, Linus Torvalds wrote:
> On Fri, Apr 30, 2021 at 1:46 PM Vineet Gupta <[email protected]> wrote:
> >
> > I've hit a mainline gcc 10.2 (also gcc 9.3) bug which triggers at -O3
> > causing wrong codegen.
>
> I'd be more than happy to just disable CC_OPTIMIZE_FOR_PERFORMANCE_O3 entirely.
FWIW, we used to find real bugs using -O3 in the past [1].
> The advantages are very questionable - with a lot of the optimizations
> at O3 being about loops, something which the kernel to a close
> approximation doesn't have.
>
> Most kernel loops are "count on one hand" iterations, and loop
> optimizations generally just make things worse.
>
> And we've had problems with -O3 before, because not only are the
> optimizations a bit esoteric, they are often relatively untested. If
> you look around at various projects (outside the kernel), -O2 is
> generally the "default".
>
> And that's entirely ignoring the gcc history - where -O3 has often
> been very buggy indeed. It's gotten much better, but I just don't see
> the upside of using -O3.
>
> In fact, it looks like we already have that
>
> depends on ARC
>
> for -O3, exactly because nobody really wants to use this.
>
> So this bug seems to be entirely ARC-specific, in that only ARC can
> use -O3 for the kernel already.
>
> Linus
[1] https://lore.kernel.org/lkml/[email protected]/
--
Oleksandr Natalenko (post-factum)
On Fri, Apr 30, 2021 at 1:46 PM Vineet Gupta <[email protected]> wrote:
>
> I've hit a mainline gcc 10.2 (also gcc 9.3) bug which triggers at -O3
> causing wrong codegen.
So it does seem to be a gcc bug or at least mis-feature where gcc gets
the aliasing decision wrong when vectorizing the code.
That said, the fact that gcc even tries to vectorize the code shows
how pointless it was for us to (years ago) try to make it use 16-bit
accesses in the first place.
So can you try this patch that basically reverts some of those
kernel-specific changes to zlib's inffast.c - and does so by just
upgrading it to a newer version from a more modern zlib (which in this
case still means "from 2017", but that's the most recent version there
is).
This is a fairly quick hack, and I really tried to keep it to just
inffast.c and inftrees.c with a few minimal fixups to inflate.c
itself.
Most of the changes are for naming (which seems to be at least partly
for C++ namespace reasons, ie "this" -> "here"), but there's some
cleanup wrt maximum table sizes etc too.
NOTE! I have not tested this patch in any other way than "it compiles
with allmodconfig". The diff looks reasonable to me, but that's all
I'm really going to say.
Does this avoid the gcc -O3 problem for you?
It would be lovely if somebody also took a look at some of the other
zlib code, like inflate.c itself. But some of that code has rather
subtle changes for things like s390 hardware support, and we have
thihngs like our fallthrough rules etc, so it gets a bit hairier.
Which is why I did just this fairly minimal part.
Note that the commit message has a "Not-yet-signed-off-by:" from me.
That's purely because I wanted it to be obvious that this needs a lot
more testing - I'm not comfy with this patch unless somebody takes it
upon themselves to actually test it under different loads (and
different architectures).
Linus
On 5/3/21 10:41 AM, Linus Torvalds wrote:
> On Fri, Apr 30, 2021 at 1:46 PM Vineet Gupta <[email protected]> wrote:
>> I've hit a mainline gcc 10.2 (also gcc 9.3) bug which triggers at -O3
>> causing wrong codegen.
> So it does seem to be a gcc bug or at least mis-feature where gcc gets
> the aliasing decision wrong when vectorizing the code.
>
> That said, the fact that gcc even tries to vectorize the code shows
> how pointless it was for us to (years ago) try to make it use 16-bit
> accesses in the first place.
>
> So can you try this patch that basically reverts some of those
> kernel-specific changes to zlib's inffast.c - and does so by just
> upgrading it to a newer version from a more modern zlib (which in this
> case still means "from 2017", but that's the most recent version there
> is).
>
> This is a fairly quick hack, and I really tried to keep it to just
> inffast.c and inftrees.c with a few minimal fixups to inflate.c
> itself.
>
> Most of the changes are for naming (which seems to be at least partly
> for C++ namespace reasons, ie "this" -> "here"), but there's some
> cleanup wrt maximum table sizes etc too.
>
> NOTE! I have not tested this patch in any other way than "it compiles
> with allmodconfig". The diff looks reasonable to me, but that's all
> I'm really going to say.
>
> Does this avoid the gcc -O3 problem for you?
Yes it does. I built the following:
2021-05-03 b93bedcf8fa4 Update zlib inffast code to more modern zlib
2021-02-26 ea680985468f ARC: fix CONFIG_HARDENED_USERCOPY
2021-04-23 f7118f8ada1b ARC: entry: fix off-by-one error in syscall
number validation
2021-04-21 1cb7eefda7ed ARC: kgdb: add 'fallthrough' to prevent a warning
2021-03-22 163630b2d95b arc: Fix typos/spellos
2021-04-11 d434405aaab7 Linux 5.12-rc7
And it seems to be DTRT
| Linux version 5.12.0-rc7-00005-gb93bedcf8fa4
(vineetg@vineetg-Latitude-7400) (arc-linux-gcc.br_real (Buildroot
2021.02-6-g5e29ba7bf732) 10.2.0, GNU ld (GNU Binutils) 2.36.1) #678
PREEMPT Mon May 3 11:29:32 PDT 2021
| Memory @ 80000000 [1024M]
| ...
| with environment:
| HOME=/
| TERM=linux
| Starting syslogd: OK
| Starting klogd: OK
| Running sysctl: OK
| $
| $ zcat /proc/config.gz | egrep "(OPTIM|COMPRESSION_GZIP)"
| CONFIG_INITRAMFS_COMPRESSION_GZIP=y
| # CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE is not set
| CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3=y
|
| $ free -m
| total used free shared buff/cache
available
| Mem: 1012 3 978 31 31 972
| Swap:
|
> It would be lovely if somebody also took a look at some of the other
> zlib code, like inflate.c itself. But some of that code has rather
> subtle changes for things like s390 hardware support, and we have
> thihngs like our fallthrough rules etc, so it gets a bit hairier.
I took a quick look, but there some to be subtle state machine changes
in inflate.c which I'm not comfortable mucking around with.
> Which is why I did just this fairly minimal part.
>
> Note that the commit message has a "Not-yet-signed-off-by:" from me.
> That's purely because I wanted it to be obvious that this needs a lot
> more testing - I'm not comfy with this patch unless somebody takes it
> upon themselves to actually test it under different loads (and
> different architectures).
Maybe give it some time to bake in linux-next for a 5.14 inclusion ?
Thx again for jumping in and steering gcc folks to right conclusions.
-Vineet
On Mon, May 03, 2021 at 10:41:45AM -0700, Linus Torvalds wrote:
> It would be lovely if somebody also took a look at some of the other
> zlib code, like inflate.c itself. But some of that code has rather
> subtle changes for things like s390 hardware support, and we have
> thihngs like our fallthrough rules etc, so it gets a bit hairier.
>
> Which is why I did just this fairly minimal part.
>
> Note that the commit message has a "Not-yet-signed-off-by:" from me.
> That's purely because I wanted it to be obvious that this needs a lot
> more testing - I'm not comfy with this patch unless somebody takes it
> upon themselves to actually test it under different loads (and
> different architectures).
The patch below is required on top of your patch to make it compile
for s390 as well.
Tested with kernel image decompression, and also btrfs with file
compression; both software and hardware compression.
Everything seems to work.
diff --git a/lib/zlib_dfltcc/dfltcc_inflate.c b/lib/zlib_dfltcc/dfltcc_inflate.c
index fb60b5a6a1cb..3bb3e431b79f 100644
--- a/lib/zlib_dfltcc/dfltcc_inflate.c
+++ b/lib/zlib_dfltcc/dfltcc_inflate.c
@@ -1,5 +1,7 @@
// SPDX-License-Identifier: Zlib
+#include <linux/zlib.h>
+#include "../zlib_inflate/inftrees.h"
#include "../zlib_inflate/inflate.h"
#include "dfltcc_util.h"
#include "dfltcc.h"
@@ -122,7 +124,7 @@ dfltcc_inflate_action dfltcc_inflate(
param->cvt = CVT_ADLER32;
param->sbb = state->bits;
param->hl = state->whave; /* Software and hardware history formats match */
- param->ho = (state->write - state->whave) & ((1 << HB_BITS) - 1);
+ param->ho = (state->wnext - state->whave) & ((1 << HB_BITS) - 1);
if (param->hl)
param->nt = 0; /* Honor history for the first block */
param->cv = state->check;
@@ -137,7 +139,7 @@ dfltcc_inflate_action dfltcc_inflate(
state->last = cc == DFLTCC_CC_OK;
state->bits = param->sbb;
state->whave = param->hl;
- state->write = (param->ho + param->hl) & ((1 << HB_BITS) - 1);
+ state->wnext = (param->ho + param->hl) & ((1 << HB_BITS) - 1);
state->check = param->cv;
if (cc == DFLTCC_CC_OP2_CORRUPT && param->oesc != 0) {
/* Report an error if stream is corrupted */
diff --git a/lib/zlib_inflate/inflate.h b/lib/zlib_inflate/inflate.h
index 6ece8efd879b..52314b8b28ea 100644
--- a/lib/zlib_inflate/inflate.h
+++ b/lib/zlib_inflate/inflate.h
@@ -1,3 +1,6 @@
+#ifndef __ZLIB_INFLATE_INFLATE_H
+#define __ZLIB_INFLATE_INFLATE_H
+
/* inflate.h -- internal inflate state definition
* Copyright (C) 1995-2016 Mark Adler
* For conditions of distribution and use, see copyright notice in zlib.h
@@ -123,3 +126,5 @@ struct inflate_state {
int back; /* bits back of last unprocessed length/lit */
unsigned was; /* initial length of match */
};
+
+#endif /* __ZLIB_INFLATE_INFLATE_H */
diff --git a/lib/zlib_inflate/inftrees.h b/lib/zlib_inflate/inftrees.h
index fe4307fcfbe3..39d5d90d1258 100644
--- a/lib/zlib_inflate/inftrees.h
+++ b/lib/zlib_inflate/inftrees.h
@@ -1,3 +1,6 @@
+#ifndef __ZLIB_INFLATE_INFTREES_H
+#define __ZLIB_INFLATE_INFTREES_H
+
/* inftrees.h -- header to use inftrees.c
* Copyright (C) 1995-2005, 2010 Mark Adler
* For conditions of distribution and use, see copyright notice in zlib.h
@@ -60,3 +63,5 @@ typedef enum {
int zlib_inflate_table (codetype type, unsigned short *lens,
unsigned codes, code **table,
unsigned *bits, unsigned short *work);
+
+#endif /* __ZLIB_INFLATE_INFTREES_H */