2021-09-06 15:03:32

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Sun, Sep 05, 2021 at 11:24:05AM -0700, Linus Torvalds wrote:
> ... but make it a config option so that broken environments can disable
> it when required.
>
> We really should always have a clean build, and will disable specific
> over-eager warnings as required, if we can't fix them. But while I
> fairly religiously enforce that in my own tree, it doesn't get enforced
> by various build robots that don't necessarily report warnings.
>
> So this just makes '-Werror' a default compiler flag, but allows people
> to disable it for their configuration if they have some particular
> issues.
>
> Occasionally, new compiler versions end up enabling new warnings, and it
> can take a while before we have them fixed (or the warnings disabled if
> that is what it takes), so the config option allows for that situation.
>
> Hopefully this will mean that I get fewer pull requests that have new
> warnings that were not noticed by various automation we have in place.
>
> Knock wood.
>

I guess the good news is that some builds still pass.

Build results:
total: 153 pass: 89 fail: 64
Failed builds:
alpha:defconfig
alpha:allmodconfig
arcv2:defconfig
arcv2:axs103_defconfig
arcv2:vdk_hs38_smp_defconfig
arm:s3c2410_defconfig
arm:ixp4xx_defconfig
arm:omap1_defconfig
arm:footbridge_defconfig
arm:keystone_defconfig
arm:vexpress_defconfig
arm:imx_v4_v5_defconfig
arm:s3c6400_defconfig
arm:s5pv210_defconfig
arm:integrator_defconfig
arm:pxa910_defconfig
arm:clps711x_defconfig
csky:defconfig
h8300:edosk2674_defconfig
h8300:h8300h-sim_defconfig
h8300:h8s-sim_defconfig
hexagon:defconfig
i386:allyesconfig
i386:allmodconfig
ia64:defconfig
m68k:defconfig
m68k:allmodconfig
m68k:sun3_defconfig
m68k_nommu:m5272c3_defconfig
m68k_nommu:m5307c3_defconfig
m68k_nommu:m5249evb_defconfig
m68k_nommu:m5407c3_defconfig
m68k_nommu:m5475evb_defconfig
microblaze:mmu_defconfig
mips:allmodconfig
mips:bcm63xx_defconfig
mips:e55_defconfig
mips:malta_defconfig
nds32:defconfig
nds32:allmodconfig
nios2:3c120_defconfig
parisc:allmodconfig
parisc:generic-32bit_defconfig
parisc64:generic-64bit_defconfig
powerpc:allmodconfig
powerpc:cell_defconfig
powerpc:maple_defconfig
powerpc:ppc6xx_defconfig
powerpc:mpc83xx_defconfig
powerpc:tqm8xx_defconfig
powerpc:83xx/mpc834x_mds_defconfig
riscv32:allmodconfig
riscv:allmodconfig
s390:allmodconfig
sh:defconfig
sh:dreamcast_defconfig
sh:microdev_defconfig
sh:shx3_defconfig
sparc32:defconfig
sparc64:allmodconfig
sparc64:defconfig
um:defconfig
xtensa:defconfig
xtensa:allmodconfig
Qemu test results:
total: 479 pass: 340 fail: 139
Failed tests:
<many>

Guenter


2021-09-06 16:14:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Mon, Sep 6, 2021 at 7:26 AM Guenter Roeck <[email protected]> wrote:
>
> Build results:
> total: 153 pass: 89 fail: 64

Well, that sadly proves the point of that patch. x86-64 may be clean,
because I have required it manually. Others not necessarily so much..

I've got at least one sparc64 fix in my inbox. It _might_ fix some
other cases too (syscall checking), but I suspect it's one of those
"death by a thousand cuts" situations, not just one or two issues that
show up.

Do you end up exposing the errors anywhere where I can take a look?

If some of them are just because of bad tooling on certain
architectures (ie fundamentally "this is unfixable, because we use
gcc-XYZ that just always causes warnings") then those we could/should
just disable -Werror for those and forget about them.

But hopefully most cases are just "people haven't cared enough" and
easily fixed.

Linus

2021-09-06 16:50:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Mon, Sep 6, 2021 at 9:12 AM Linus Torvalds
<[email protected]> wrote:
>
> I've got at least one sparc64 fix in my inbox. It _might_ fix some
> other cases too (syscall checking), but I suspect it's one of those
> "death by a thousand cuts" situations, not just one or two issues that
> show up.

I pushed out that "don't make the syscall checking produce errors from
warnings" patch by Stephen Rothwell.

One down, N to go.

Linus

2021-09-06 17:21:45

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On 9/6/21 9:12 AM, Linus Torvalds wrote:
> On Mon, Sep 6, 2021 at 7:26 AM Guenter Roeck <[email protected]> wrote:
>>
>> Build results:
>> total: 153 pass: 89 fail: 64
>
> Well, that sadly proves the point of that patch. x86-64 may be clean,
> because I have required it manually. Others not necessarily so much..
>
> I've got at least one sparc64 fix in my inbox. It _might_ fix some
> other cases too (syscall checking), but I suspect it's one of those
> "death by a thousand cuts" situations, not just one or two issues that
> show up.
>
> Do you end up exposing the errors anywhere where I can take a look?
>

Logs are available from KernelCI.
See https://linux.kernelci.org/job/mainline/
I expect that 0-day will also have a field day.

> If some of them are just because of bad tooling on certain
> architectures (ie fundamentally "this is unfixable, because we use
> gcc-XYZ that just always causes warnings") then those we could/should
> just disable -Werror for those and forget about them.
>
> But hopefully most cases are just "people haven't cared enough" and
> easily fixed.
>

We'll see. For my testbed I disabled the new configuration flag
for the time being because its primary focus is boot tests, and
there won't be any boot tests if images fail to build.

Guenter

2021-09-06 23:09:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

[ Adding some subsystem maintainers ]

On Mon, Sep 6, 2021 at 10:06 AM Guenter Roeck <[email protected]> wrote:
>
> > But hopefully most cases are just "people haven't cared enough" and
> > easily fixed.
>
> We'll see. For my testbed I disabled the new configuration flag
> for the time being because its primary focus is boot tests, and
> there won't be any boot tests if images fail to build.

Sure, reasonable.

I've checked a few of the build errors by doing the appropriate cross
compiles, and it doesn't seem bad - but it does seem like we have a
number of really pointless long-standing warnings that should have
been fixed long ago.

For example, looking at sparc64, there are several build errors due to
those warnings now being fatal:

- drivers/gpu/drm/ttm/ttm_pool.c:386

This is a type mismatch error. It looks like __fls() on sparc64
returns 'int'. And the ttm_pool.c code assumes it returns 'unsigned
long'.

Oddly enough, the very line after that line does "min_t(unsigned
int" to get the types in line.

So the immediate reason is "sparc64 is different". But the deeper
reason seems to be that ttm_pool.c has odd type assumptions. But that
warning should have been fixed long ago, either way.

Christian/Huang? I get the feeling that both lines in that file
should use the min_t(). Hmm?

- drivers/input/joystick/analog.c:160

#warning Precise timer not defined for this architecture.

Unfortunate. I suspect that warning just has to be removed. It has
never caused anything to be fixed, it's old to the point of predating
the git history. Dmitry?

- at least a couple of stringop-overread errors. Attached is a
possible for for one of them.

The stringop overread is odd, because another one of them is

fs/qnx4/dir.c: In function ‘qnx4_readdir’:
fs/qnx4/dir.c:51:32: error: ‘strnlen’ specified bound 48 exceeds
source size 16 [-Werror=stringop-overread]
51 | size = strnlen(de->di_fname, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~

but I'm not seeing why that one happens on sparc64, but not on arm64
or x86-64. There doesn't seem to be anything architecture-specific
anywhere in that area.

Funky.

Davem - attached patch compiles cleanly for me, but I'm not sure it's
necessarily the right thing to do, and I didn't check the code
generation. Maybe it screws up. Can somebody test on sparc64 and
perhaps think about it more than I did?

Linus


Attachments:
patch.diff (741.00 B)

2021-09-06 23:52:29

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Mon, Sep 06, 2021 at 04:06:04PM -0700, Linus Torvalds wrote:
> [ Adding some subsystem maintainers ]
>
> On Mon, Sep 6, 2021 at 10:06 AM Guenter Roeck <[email protected]> wrote:
> >
> > > But hopefully most cases are just "people haven't cared enough" and
> > > easily fixed.
> >
> > We'll see. For my testbed I disabled the new configuration flag
> > for the time being because its primary focus is boot tests, and
> > there won't be any boot tests if images fail to build.
>
> Sure, reasonable.
>
> I've checked a few of the build errors by doing the appropriate cross
> compiles, and it doesn't seem bad - but it does seem like we have a
> number of really pointless long-standing warnings that should have
> been fixed long ago.
>
> For example, looking at sparc64, there are several build errors due to
> those warnings now being fatal:
>
> - drivers/gpu/drm/ttm/ttm_pool.c:386
>
> This is a type mismatch error. It looks like __fls() on sparc64
> returns 'int'. And the ttm_pool.c code assumes it returns 'unsigned
> long'.
>
> Oddly enough, the very line after that line does "min_t(unsigned
> int" to get the types in line.
>
> So the immediate reason is "sparc64 is different". But the deeper
> reason seems to be that ttm_pool.c has odd type assumptions. But that
> warning should have been fixed long ago, either way.
>
> Christian/Huang? I get the feeling that both lines in that file
> should use the min_t(). Hmm?
>
> - drivers/input/joystick/analog.c:160
>
> #warning Precise timer not defined for this architecture.
>
> Unfortunate. I suspect that warning just has to be removed. It has
> never caused anything to be fixed, it's old to the point of predating
> the git history. Dmitry?
>
My solution would be to just remove the old code (that isn't using ktime)
including the module parameter that disables it. Sure, we want to be
backward compatible, but that code is 15+ years old and should really be
retired.

> - at least a couple of stringop-overread errors. Attached is a
> possible for for one of them.
>
> The stringop overread is odd, because another one of them is
>
> fs/qnx4/dir.c: In function ‘qnx4_readdir’:
> fs/qnx4/dir.c:51:32: error: ‘strnlen’ specified bound 48 exceeds
> source size 16 [-Werror=stringop-overread]
> 51 | size = strnlen(de->di_fname, size);
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> but I'm not seeing why that one happens on sparc64, but not on arm64
> or x86-64. There doesn't seem to be anything architecture-specific
> anywhere in that area.
>
> Funky.
>
Not really. That is because de->di_fname is always 16 bytes but size
can be 48 if the node is really a link. The use of de is overloaded
in that case; de is struct qnx4_inode_entry (where di_fname is 16 bytes)
but the actual data is struct qnx4_link_info where the name is 48 bytes
long. A possible fix (compile tested only) is below.

I think the warning/error is only reported with gcc 11.x. Do you possibly
use an older compiler for x86/arm64 ?

Anyway, below is a partial list of build errors I have seen. Some of
them are easy to fix (such as the ones due to unused functions),
but others seem to be tricky.

Guenter

---
diff --git a/fs/qnx4/dir.c b/fs/qnx4/dir.c
index a6ee23aadd28..f75dcadd98e5 100644
--- a/fs/qnx4/dir.c
+++ b/fs/qnx4/dir.c
@@ -44,20 +44,17 @@ static int qnx4_readdir(struct file *file, struct dir_context *ctx)
continue;
if (!(de->di_status & (QNX4_FILE_USED|QNX4_FILE_LINK)))
continue;
- if (!(de->di_status & QNX4_FILE_LINK))
- size = QNX4_SHORT_NAME_MAX;
- else
- size = QNX4_NAME_MAX;
- size = strnlen(de->di_fname, size);
- QNX4DEBUG((KERN_INFO "qnx4_readdir:%.*s\n", size, de->di_fname));
- if (!(de->di_status & QNX4_FILE_LINK))
+ if (!(de->di_status & QNX4_FILE_LINK)) {
+ size = strnlen(de->di_fname, QNX4_SHORT_NAME_MAX);
ino = blknum * QNX4_INODES_PER_BLOCK + ix - 1;
- else {
+ } else {
le = (struct qnx4_link_info*)de;
+ size = strnlen(le->dl_fname, QNX4_NAME_MAX);
ino = ( le32_to_cpu(le->dl_inode_blk) - 1 ) *
QNX4_INODES_PER_BLOCK +
le->dl_inode_ndx;
}
+ QNX4DEBUG((KERN_INFO "qnx4_readdir:%.*s\n", size, de->di_fname));
if (!dir_emit(ctx, de->di_fname, size, ino, DT_UNKNOWN)) {
brelse(bh);
return 0;

---
alpha.log:arch/alpha/kernel/setup.c:493:13: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
alpha.log:drivers/atm/ambassador.c:1747:58: error: passing argument 1 of 'virt_to_bus' discards 'volatile' qualifier from pointer target type [-Werror=discarded-qualifiers]
alpha.log:drivers/gpu/drm/rockchip/cdn-dp-core.c:1126:12: error: 'cdn_dp_resume' defined but not used [-Werror=unused-function]
alpha.log:drivers/net/ethernet/3com/3c515.c:1053:22: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
alpha.log:drivers/net/ethernet/amd/ni65.c:751:37: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
alpha.log:drivers/net/hamradio/6pack.c:71:41: error: unsigned conversion from 'int' to 'unsigned char' changes value from '256' to '0' [-Werror=overflow]
alpha.log:drivers/net/wan/lmc/lmc_main.c:1782:50: error: passing argument 1 of 'virt_to_bus' discards 'volatile' qualifier from pointer target type [-Werror=discarded-qualifiers]
alpha.log:drivers/net/wan/lmc/lmc_main.c:1791:53: error: passing argument 1 of 'virt_to_bus' discards 'volatile' qualifier from pointer target type [-Werror=discarded-qualifiers]
alpha.log:drivers/net/wan/lmc/lmc_main.c:1793:51: error: passing argument 1 of 'virt_to_bus' discards 'volatile' qualifier from pointer target type [-Werror=discarded-qualifiers]
alpha.log:drivers/net/wan/lmc/lmc_main.c:1804:50: error: passing argument 1 of 'virt_to_bus' discards 'volatile' qualifier from pointer target type [-Werror=discarded-qualifiers]
alpha.log:drivers/net/wan/lmc/lmc_main.c:1806:50: error: passing argument 1 of 'virt_to_bus' discards 'volatile' qualifier from pointer target type [-Werror=discarded-qualifiers]
alpha.log:drivers/net/wan/lmc/lmc_main.c:1807:51: error: passing argument 1 of 'virt_to_bus' discards 'volatile' qualifier from pointer target type [-Werror=discarded-qualifiers]
alpha.log:drivers/spi/spi-tegra20-slink.c:1188:12: error: 'tegra_slink_runtime_suspend' defined but not used [-Werror=unused-function]
alpha.log:drivers/spi/spi-tegra20-slink.c:1200:12: error: 'tegra_slink_runtime_resume' defined but not used [-Werror=unused-function]
alpha.log:fs/qnx4/dir.c:51:32: error: 'strnlen' specified bound 48 exceeds source size 16 [-Werror=stringop-overread]
m68k.log:./arch/m68k/include/asm/raw_io.h:20:19: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
m68k.log:./arch/m68k/include/asm/raw_io.h:30:32: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
m68k.log:./arch/m68k/include/asm/string.h:72:25: error: '__builtin_memcpy' reading 6 bytes from a region of size 0 [-Werror=stringop-overread]
m68k.log:arch/m68k/mvme147/config.c:174:2: error: #warning check me! [-Werror=cpp]
m68k.log:arch/m68k/mvme16x/config.c:439:2: error: #warning check me! [-Werror=cpp]
m68k.log:drivers/gpu/drm/rockchip/cdn-dp-core.c:1126:12: error: 'cdn_dp_resume' defined but not used [-Werror=unused-function]
m68k.log:drivers/input/joystick/analog.c:160:2: error: #warning Precise timer not defined for this architecture. [-Werror=cpp]
m68k.log:drivers/spi/spi-tegra20-slink.c:1188:12: error: 'tegra_slink_runtime_suspend' defined but not used [-Werror=unused-function]
m68k.log:drivers/spi/spi-tegra20-slink.c:1200:12: error: 'tegra_slink_runtime_resume' defined but not used [-Werror=unused-function]
mips.log:./arch/mips/include/asm/sibyte/bcm1480_scd.h:261: error: "M_SPC_CFG_CLEAR" redefined [-Werror]
mips.log:./arch/mips/include/asm/sibyte/bcm1480_scd.h:262: error: "M_SPC_CFG_ENABLE" redefined [-Werror]
mips.log:drivers/input/joystick/analog.c:160:2: error: #warning Precise timer not defined for this architecture. [-Werror=cpp]
ppc.log:drivers/net/ethernet/cirrus/cs89x0.c:897:41: error: implicit declaration of function 'isa_virt_to_bus' [-Werror=implicit-function-declaration]
riscv32.log:drivers/gpu/drm/rockchip/cdn-dp-core.c:1126:12: error: 'cdn_dp_resume' defined but not used [-Werror=unused-function]
riscv.log:drivers/gpu/drm/rockchip/cdn-dp-core.c:1126:12: error: 'cdn_dp_resume' defined but not used [-Werror=unused-function]
s390.log:arch/s390/kernel/syscall.c:168:1: error: '__do_syscall' uses dynamic stack allocation [-Werror]
s390.log:drivers/gpu/drm/rockchip/cdn-dp-core.c:1126:12: error: 'cdn_dp_resume' defined but not used [-Werror=unused-function]
s390.log:drivers/input/joystick/analog.c:160:2: error: #warning Precise timer not defined for this architecture. [-Werror=cpp]
s390.log:drivers/spi/spi-tegra20-slink.c:1188:12: error: 'tegra_slink_runtime_suspend' defined but not used [-Werror=unused-function]
s390.log:drivers/spi/spi-tegra20-slink.c:1200:12: error: 'tegra_slink_runtime_resume' defined but not used [-Werror=unused-function]
s390.log:lib/test_kasan.c:767:1: error: 'kasan_alloca_oob_left' uses dynamic stack allocation [-Werror]
s390.log:lib/test_kasan.c:782:1: error: 'kasan_alloca_oob_right' uses dynamic stack allocation [-Werror]
s390.log:s390-linux-objcopy: error: the input file 'arch/s390/boot/compressed/syms.bin' is empty
sparc64.log:arch/sparc/kernel/mdesc.c:647:22: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
sparc64.log:arch/sparc/kernel/mdesc.c:692:22: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
sparc64.log:arch/sparc/kernel/mdesc.c:719:21: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
sparc64.log:drivers/input/joystick/analog.c:160:2: error: #warning Precise timer not defined for this architecture. [-Werror=cpp]
sparc64.log:fs/qnx4/dir.c:51:32: error: 'strnlen' specified bound 48 exceeds source size 16 [-Werror=stringop-overread]
sparc64.log:./include/linux/minmax.h:20:35: error: comparison of distinct pointer types lacks a cast [-Werror]
sparc.log:crypto/blake2b_generic.c:109:1: error: the frame size of 2288 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
sparc.log:drivers/input/joystick/analog.c:160:2: error: #warning Precise timer not defined for this architecture. [-Werror=cpp]
sparc.log:drivers/spi/spi-tegra20-slink.c:1188:12: error: 'tegra_slink_runtime_suspend' defined but not used [-Werror=unused-function]
sparc.log:drivers/spi/spi-tegra20-slink.c:1200:12: error: 'tegra_slink_runtime_resume' defined but not used [-Werror=unused-function]
sparc.log:drivers/tty/serial/sunzilog.c:1128:13: error: 'sunzilog_putchar' defined but not used [-Werror=unused-function]
sparc.log:drivers/usb/host/ehci-hcd.c:1301: error: "PLATFORM_DRIVER" redefined [-Werror]
sparc.log:drivers/usb/host/ehci-sh.c:173:31: error: 'ehci_hcd_sh_driver' defined but not used [-Werror=unused-variable]
sparc.log:fs/qnx4/dir.c:51:32: error: 'strnlen' specified bound 48 exceeds source size 16 [-Werror=stringop-overread]

2021-09-07 01:14:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Mon, Sep 6, 2021 at 4:49 PM Guenter Roeck <[email protected]> wrote:
>
> > but I'm not seeing why that one happens on sparc64, but not on arm64
> > or x86-64. There doesn't seem to be anything architecture-specific
> > anywhere in that area.
> >
> > Funky.
>
> Not really. That is because de->di_fname is always 16 bytes but size
> can be 48 if the node is really a link. The use of de is overloaded
> in that case; de is struct qnx4_inode_entry (where di_fname is 16 bytes)
> but the actual data is struct qnx4_link_info where the name is 48 bytes
> long. A possible fix (compile tested only) is below.
>
> I think the warning/error is only reported with gcc 11.x. Do you possibly
> use an older compiler for x86/arm64 ?

No. Literally the same exact version. All of them are

gcc version 11.2.1 20210728

from F34.

I suspect it's something about the config - a sparc64 allmodconfig
presumably doesn't end up having some of the things x86-64 has enabled
(because of different core config parameters), and then optimizes
differently as a result and shows the issue that way.

Or something. <wild handwaving>

Linus

2021-09-07 02:31:42

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On 9/6/21 6:12 PM, Linus Torvalds wrote:
> On Mon, Sep 6, 2021 at 4:49 PM Guenter Roeck <[email protected]> wrote:
>>
>>> but I'm not seeing why that one happens on sparc64, but not on arm64
>>> or x86-64. There doesn't seem to be anything architecture-specific
>>> anywhere in that area.
>>>
>>> Funky.
>>
>> Not really. That is because de->di_fname is always 16 bytes but size
>> can be 48 if the node is really a link. The use of de is overloaded
>> in that case; de is struct qnx4_inode_entry (where di_fname is 16 bytes)
>> but the actual data is struct qnx4_link_info where the name is 48 bytes
>> long. A possible fix (compile tested only) is below.
>>
>> I think the warning/error is only reported with gcc 11.x. Do you possibly
>> use an older compiler for x86/arm64 ?
>
> No. Literally the same exact version. All of them are
>
> gcc version 11.2.1 20210728
>
> from F34.
>
> I suspect it's something about the config - a sparc64 allmodconfig
> presumably doesn't end up having some of the things x86-64 has enabled
> (because of different core config parameters), and then optimizes
> differently as a result and shows the issue that way.
>
> Or something. <wild handwaving>
>

Looks like Arnd stumbled into the qnx4 problem before:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578

He might have an idea how to fix it for good.

Guenter

2021-09-07 02:33:30

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Mon, Sep 06, 2021 at 09:12:12AM -0700, Linus Torvalds wrote:
> On Mon, Sep 6, 2021 at 7:26 AM Guenter Roeck <[email protected]> wrote:
> >
> > Build results:
> > total: 153 pass: 89 fail: 64
>
> Well, that sadly proves the point of that patch. x86-64 may be clean,
> because I have required it manually. Others not necessarily so much..
>
> I've got at least one sparc64 fix in my inbox. It _might_ fix some
> other cases too (syscall checking), but I suspect it's one of those
> "death by a thousand cuts" situations, not just one or two issues that
> show up.
>
> Do you end up exposing the errors anywhere where I can take a look?
>
> If some of them are just because of bad tooling on certain
> architectures (ie fundamentally "this is unfixable, because we use
> gcc-XYZ that just always causes warnings") then those we could/should
> just disable -Werror for those and forget about them.
>
> But hopefully most cases are just "people haven't cared enough" and
> easily fixed.

Our clang builds got bit pretty hard by this. From my local builds
(clang-14), the following ones failed (file name describes the config)
along with the errors plus some triage. -Wframe-larger-than= appears to
be the most common warning. I apologize if this email is too long or
convoluted, I can try to break it down better in the future.



arm32-allmodconfig.log: arch/arm/lib/xor-neon.c:30:2: error: This code requires at least version 4.6 of GCC [-Werror,-W#warnings]
arm32-alpine.log: arch/arm/lib/xor-neon.c:30:2: error: This code requires at least version 4.6 of GCC [-Werror,-W#warnings]
arm32-debian.log: arch/arm/lib/xor-neon.c:30:2: error: This code requires at least version 4.6 of GCC [-Werror,-W#warnings
arm32-fedora.log: arch/arm/lib/xor-neon.c:30:2: error: This code requires at least version 4.6 of GCC [-Werror,-W#warnings]
arm32-opensuse.log: arch/arm/lib/xor-neon.c:30:2: error: This code requires at least version 4.6 of GCC [-Werror,-W#warnings]
arm32-v7-archlinux.log:arch/arm/lib/xor-neon.c:30:2: error: This code requires at least version 4.6 of GCC [-Werror,-W#warnings]

This has been tracked for a while with no real resolution:

https://github.com/ClangBuiltLinux/linux/issues/496
https://github.com/ClangBuiltLinux/linux/issues/503
https://lore.kernel.org/r/[email protected]/
https://lore.kernel.org/r/[email protected]/
https://lore.kernel.org/r/[email protected]/



arm32-allmodconfig.log: drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c:157:11: error: variable 'err' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
arm32-allmodconfig.log: drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c:257:7: error: variable 'err' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
arm32-allmodconfig.log: drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c:262:7: error: variable 'err' is used uninitialized whenever switch case is taken [-Werror,-Wsometimes-uninitialized]

This affected a great number of configs. It is known and is being fixed:

https://lore.kernel.org/r/CA+G9fYsV7sTfaefGj3bpkvVdRQUeiWCVRiu6ovjtM=qri-HJ8g@mail.gmail.com/
https://lore.kernel.org/r/[email protected]/

Unfortunately, these uninitialized warnings will constantly plague us
because GCC does not warn due to -Wmaybe-uninitialized being disabled
because it is not as reliable as the clang warning.



arm32-allmodconfig.log: crypto/wp512.c:782:13: error: stack frame size (1176) exceeds limit (1024) in function 'wp512_process_buffer' [-Werror,-Wframe-larger-than]
arm32-allmodconfig.log: drivers/firmware/tegra/bpmp-debugfs.c:294:12: error: stack frame size (1256) exceeds limit (1024) in function 'bpmp_debug_show' [-Werror,-Wframe-larger-than]
arm32-allmodconfig.log: drivers/firmware/tegra/bpmp-debugfs.c:357:16: error: stack frame size (1264) exceeds limit (1024) in function 'bpmp_debug_store' [-Werror,-Wframe-larger-than]
arm32-allmodconfig.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:3043:6: error: stack frame size (1384) exceeds limit (1024) in function 'bw_calcs' [-Werror,-Wframe-larger-than]
arm32-allmodconfig.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:77:13: error: stack frame size (5560) exceeds limit (1024) in function 'calculate_bandwidth' [-Werror,-Wframe-larger-than]
arm32-allmodconfig.log: drivers/mtd/chips/cfi_cmdset_0001.c:1872:12: error: stack frame size (1064) exceeds limit (1024) in function 'cfi_intelext_writev' [-Werror,-Wframe-larger-than]
arm32-allmodconfig.log: drivers/ntb/hw/idt/ntb_hw_idt.c:1041:27: error: stack frame size (1032) exceeds limit (1024) in function 'idt_scan_mws' [-Werror,-Wframe-larger-than]
arm32-allmodconfig.log: drivers/staging/fbtft/fbtft-core.c:902:12: error: stack frame size (1072) exceeds limit (1024) in function 'fbtft_init_display_from_property' [-Werror,-Wframe-larger-than]
arm32-allmodconfig.log: drivers/staging/fbtft/fbtft-core.c:992:5: error: stack frame size (1064) exceeds limit (1024) in function 'fbtft_init_display' [-Werror,-Wframe-larger-than]
arm32-allmodconfig.log: drivers/staging/rtl8723bs/core/rtw_security.c:1288:5: error: stack frame size (1040) exceeds limit (1024) in function 'rtw_aes_decrypt' [-Werror,-Wframe-larger-than]
arm32-fedora.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:3043:6: error: stack frame size (1376) exceeds limit (1024) in function 'bw_calcs' [-Werror,-Wframe-larger-than]
arm32-fedora.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:77:13: error: stack frame size (5384) exceeds limit (1024) in function 'calculate_bandwidth' [-Werror,-Wframe-larger-than]

Aside from the dce_calcs.c warnings, these do not seem too bad. I
believe allmodconfig turns on UBSAN but it could also be aggressive
inlining by clang. I intend to look at all -Wframe-large-than warnings
closely later.



arm64-alpine.log:drivers/scsi/lpfc/lpfc_init.c:11608:48: error: shift count >= width of type [-Werror,-Wshift-count-overflow]
arm64-alpine.log:drivers/scsi/lpfc/lpfc_init.c:8280:29: error: no member named 'c_stat' in 'struct lpfc_sli4_hba'
arm64-alpine.log:drivers/scsi/lpfc/lpfc_init.c:9092:48: error: shift count >= width of type [-Werror,-Wshift-count-overflow]
arm64-alpine.log:drivers/scsi/lpfc/lpfc_nvme.c:1592:3: error: use of undeclared identifier 'start'
arm64-alpine.log:drivers/scsi/lpfc/lpfc_nvme.c:1639:28: error: use of undeclared identifier 'start'; did you mean 'cstat'?
arm64-alpine.log:drivers/scsi/lpfc/lpfc_scsi.c:5587:2: error: use of undeclared identifier 'start'
arm64-alpine.log:drivers/scsi/lpfc/lpfc_scsi.c:5670:27: error: use of undeclared identifier 'start'
x86_64-alpine.log:drivers/scsi/lpfc/lpfc_init.c:11608:48: error: shift count >= width of type [-Werror,-Wshift-count-overflow]
x86_64-alpine.log:drivers/scsi/lpfc/lpfc_init.c:8280:29: error: no member named 'c_stat' in 'struct lpfc_sli4_hba'
x86_64-alpine.log:drivers/scsi/lpfc/lpfc_init.c:9092:48: error: shift count >= width of type [-Werror,-Wshift-count-overflow]
x86_64-alpine.log:drivers/scsi/lpfc/lpfc_nvme.c:1592:3: error: use of undeclared identifier 'start'
x86_64-alpine.log:drivers/scsi/lpfc/lpfc_nvme.c:1639:28: error: use of undeclared identifier 'start'
x86_64-alpine.log:drivers/scsi/lpfc/lpfc_scsi.c:5587:2: error: use of undeclared identifier 'start'
x86_64-alpine.log:drivers/scsi/lpfc/lpfc_scsi.c:5670:27: error: use of undeclared identifier 'start'; did you mean 'stac'?

The -Wshift-count-overflow warnings only show because there are other
errors in this file. This appears to be because some variables or
members of a structure are defined when the lpfc debug config is
disabled. I'll send patches for these later.



arm64-archlinux.log: arch/arm64/crypto/aes-neonbs-glue.c:270:12: error: stack frame size (1056) exceeds limit (1024) in function 'aesbs_xts_setkey' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/char/ipmi/ipmi_msghandler.c:4850:13: error: stack frame size (1072) exceeds limit (1024) in function 'ipmi_panic_request_and_wait' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/crypto/ccp/ccp-ops.c:629:1: error: stack frame size (1072) exceeds limit (1024) in function 'ccp_run_aes_gcm_cmd' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/firmware/tegra/bpmp-debugfs.c:294:12: error: stack frame size (1296) exceeds limit (1024) in function 'bpmp_debug_show' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/firmware/tegra/bpmp-debugfs.c:357:16: error: stack frame size (1328) exceeds limit (1024) in function 'bpmp_debug_store' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/gpu/drm/radeon/radeon_cs.c:661:5: error: stack frame size (1184) exceeds limit (1024) in function 'radeon_cs_ioctl' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:1779:5: error: stack frame size (1152) exceeds limit (1024) in function 'arm_smmu_atc_inv_domain' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:1851:13: error: stack frame size (1136) exceeds limit (1024) in function '__arm_smmu_tlb_inv_range' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:2295:13: error: stack frame size (1136) exceeds limit (1024) in function 'arm_smmu_disable_ats' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:952:13: error: stack frame size (1152) exceeds limit (1024) in function 'arm_smmu_sync_cd' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/net/wireguard/allowedips.c:255:6: error: stack frame size (1120) exceeds limit (1024) in function 'wg_allowedips_free' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/net/wireguard/allowedips.c:53:13: error: stack frame size (1088) exceeds limit (1024) in function 'root_free_rcu' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/staging/fbtft/fb_hx8353d.c:20:12: error: stack frame size (1040) exceeds limit (1024) in function 'init_display' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/staging/fbtft/fb_ssd1331.c:131:12: error: stack frame size (1360) exceeds limit (1024) in function 'set_gamma' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/staging/fbtft/fb_ssd1351.c:120:12: error: stack frame size (1360) exceeds limit (1024) in function 'set_gamma' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/staging/fbtft/fbtft-core.c:902:12: error: stack frame size (1040) exceeds limit (1024) in function 'fbtft_init_display_from_property' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/vhost/scsi.c:1543:1: error: stack frame size (1152) exceeds limit (1024) in function 'vhost_scsi_set_endpoint' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/vhost/scsi.c:1670:1: error: stack frame size (1136) exceeds limit (1024) in function 'vhost_scsi_clear_endpoint' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: drivers/vhost/scsi.c:1831:12: error: stack frame size (1376) exceeds limit (1024) in function 'vhost_scsi_release' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: fs/binfmt_elf.c:766:12: error: stack frame size (1072) exceeds limit (1024) in function 'parse_elf_properties' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: fs/binfmt_elf.c:766:12: error: stack frame size (1088) exceeds limit (1024) in function 'parse_elf_properties' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: fs/select.c:970:12: error: stack frame size (1040) exceeds limit (1024) in function 'do_sys_poll' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: net/core/rtnetlink.c:3251:12: error: stack frame size (1088) exceeds limit (1024) in function '__rtnl_newlink' [-Werror,-Wframe-larger-than]
arm64-archlinux.log: net/sunrpc/auth_gss/gss_krb5_crypto.c:599:1: error: stack frame size (1152) exceeds limit (1024) in function 'gss_krb5_aes_encrypt' [-Werror,-Wframe-larger-than]
arm64-fedora.log: arch/arm64/crypto/aes-neonbs-glue.c:270:12: error: stack frame size (1040) exceeds limit (1024) in function 'aesbs_xts_setkey' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/block/drbd/drbd_main.c:2507:5: error: stack frame size (1088) exceeds limit (1024) in function 'set_resource_options' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/char/ipmi/ipmi_msghandler.c:4850:13: error: stack frame size (1072) exceeds limit (1024) in function 'ipmi_panic_request_and_wait' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/clk/zynqmp/clkc.c:768:12: error: stack frame size (1040) exceeds limit (1024) in function 'zynqmp_clock_probe' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/crypto/ccp/ccp-ops.c:629:1: error: stack frame size (1056) exceeds limit (1024) in function 'ccp_run_aes_gcm_cmd' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/firmware/tegra/bpmp-debugfs.c:294:12: error: stack frame size (1296) exceeds limit (1024) in function 'bpmp_debug_show' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/firmware/tegra/bpmp-debugfs.c:357:16: error: stack frame size (1328) exceeds limit (1024) in function 'bpmp_debug_store' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:1779:5: error: stack frame size (1152) exceeds limit (1024) in function 'arm_smmu_atc_inv_domain' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:1851:13: error: stack frame size (1136) exceeds limit (1024) in function '__arm_smmu_tlb_inv_range' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:2295:13: error: stack frame size (1136) exceeds limit (1024) in function 'arm_smmu_disable_ats' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:952:13: error: stack frame size (1152) exceeds limit (1024) in function 'arm_smmu_sync_cd' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/net/ethernet/freescale/dpaa/dpaa_eth.c:3308:12: error: stack frame size (8400) exceeds limit (1024) in function 'dpaa_eth_probe' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c:534:12: error: stack frame size (4208) exceeds limit (1024) in function 'dpaa_set_coalesce' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/net/wireguard/allowedips.c:255:6: error: stack frame size (1120) exceeds limit (1024) in function 'wg_allowedips_free' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/net/wireguard/allowedips.c:53:13: error: stack frame size (1088) exceeds limit (1024) in function 'root_free_rcu' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/net/wireless/ath/ath11k/mac.c:2539:12: error: stack frame size (1040) exceeds limit (1024) in function 'ath11k_mac_op_hw_scan' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/vhost/scsi.c:1543:1: error: stack frame size (1152) exceeds limit (1024) in function 'vhost_scsi_set_endpoint' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/vhost/scsi.c:1670:1: error: stack frame size (1136) exceeds limit (1024) in function 'vhost_scsi_clear_endpoint' [-Werror,-Wframe-larger-than]
arm64-fedora.log: drivers/vhost/scsi.c:1831:12: error: stack frame size (1360) exceeds limit (1024) in function 'vhost_scsi_release' [-Werror,-Wframe-larger-than]
arm64-fedora.log: fs/binfmt_elf.c:766:12: error: stack frame size (1072) exceeds limit (1024) in function 'parse_elf_properties' [-Werror,-Wframe-larger-than]
arm64-fedora.log: fs/jffs2/xattr.c:775:6: error: stack frame size (1168) exceeds limit (1024) in function 'jffs2_build_xattr_subsystem' [-Werror,-Wframe-larger-than]
arm64-fedora.log: fs/select.c:970:12: error: stack frame size (1040) exceeds limit (1024) in function 'do_sys_poll' [-Werror,-Wframe-larger-than]
arm64-fedora.log: include/linux/module.h:76:12: error: stack frame size (1120) exceeds limit (1024) in function 'init_module' [-Werror,-Wframe-larger-than]
arm64-fedora.log: kernel/cgroup/cpuset.c:1536:12: error: stack frame size (1600) exceeds limit (1024) in function 'update_cpumask' [-Werror,-Wframe-larger-than]
arm64-fedora.log: kernel/cgroup/cpuset.c:1985:12: error: stack frame size (1712) exceeds limit (1024) in function 'update_prstate' [-Werror,-Wframe-larger-than]
arm64-fedora.log: kernel/cgroup/cpuset.c:3200:13: error: stack frame size (1632) exceeds limit (1024) in function 'cpuset_hotplug_workfn' [-Werror,-Wframe-larger-than]
arm64-fedora.log: kernel/irq/affinity.c:338:12: error: stack frame size (1136) exceeds limit (1024) in function 'irq_build_affinity_masks' [-Werror,-Wframe-larger-than]
arm64-fedora.log: kernel/sched/core.c:7934:1: error: stack frame size (1088) exceeds limit (1024) in function '__sched_setaffinity' [-Werror,-Wframe-larger-than]
arm64-fedora.log: kernel/sched/isolation.c:80:19: error: stack frame size (1088) exceeds limit (1024) in function 'housekeeping_setup' [-Werror,-Wframe-larger-than]
arm64-fedora.log: net/core/rtnetlink.c:3251:12: error: stack frame size (1088) exceeds limit (1024) in function '__rtnl_newlink' [-Werror,-Wframe-larger-than]
arm64-fedora.log: net/sunrpc/auth_gss/gss_krb5_crypto.c:599:1: error: stack frame size (1152) exceeds limit (1024) in function 'gss_krb5_aes_encrypt' [-Werror,-Wframe-larger-than]

It appears that both Arch Linux and Fedora define CONFIG_FRAME_WARN
as 1024, below its default of 2048. I am not sure these look particurly
scary (although there are some that are rather large that need to be
looked at).



i386-debian.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dcn_calcs.c:452:13: error: stack frame size (1628) exceeds limit (1024) in function 'dcn_bw_calc_rq_dlg_ttu' [-Werror,-Wframe-larger-than]
i386-debian.log: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20.c:1085:13: error: stack frame size (1356) exceeds limit (1024) in function 'dml20_DISPCLKDPPCLKDCFCLKDeepSleepPrefetchParametersWatermarksAndPerformanceCalculation' [-Werror,-Wframe-larger-than]
i386-debian.log: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20.c:3286:6: error: stack frame size (1484) exceeds limit (1024) in function 'dml20_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than]
i386-debian.log: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20v2.c:1145:13: error: stack frame size (1228) exceeds limit (1024) in function 'dml20v2_DISPCLKDPPCLKDCFCLKDeepSleepPrefetchParametersWatermarksAndPerformanceCalculation' [-Werror,-Wframe-larger-than]
i386-debian.log: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20v2.c:3393:6: error: stack frame size (1372) exceeds limit (1024) in function 'dml20v2_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than]
i386-debian.log: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_mode_vba_21.c:1466:13: error: stack frame size (1308) exceeds limit (1024) in function 'DISPCLKDPPCLKDCFCLKDeepSleepPrefetchParametersWatermarksAndPerformanceCalculation' [-Werror,-Wframe-larger-than]
i386-debian.log: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_mode_vba_21.c:3397:6: error: stack frame size (1564) exceeds limit (1024) in function 'dml21_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than]
i386-debian.log: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_rq_dlg_calc_21.c:1657:6: error: stack frame size (1100) exceeds limit (1024) in function 'dml21_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
i386-debian.log: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_rq_dlg_calc_21.c:829:13: error: stack frame size (1084) exceeds limit (1024) in function 'dml_rq_dlg_get_dlg_params' [-Werror,-Wframe-larger-than]
i386-debian.log: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_rq_dlg_calc_30.c:1831:6: error: stack frame size (1108) exceeds limit (1024) in function 'dml30_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
i386-debian.log: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_rq_dlg_calc_30.c:981:13: error: stack frame size (1148) exceeds limit (1024) in function 'dml_rq_dlg_get_dlg_params' [-Werror,-Wframe-larger-than]
i386-debian.log: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_rq_dlg_calc_31.c:939:13: error: stack frame size (1372) exceeds limit (1024) in function 'dml_rq_dlg_get_dlg_params' [-Werror,-Wframe-larger-than]
i386-debian.log: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dml1_display_rq_dlg_calc.c:997:6: error: stack frame size (1212) exceeds limit (1024) in function 'dml1_rq_dlg_get_dlg_params' [-Werror,-Wframe-larger-than]

I am guessing these are all excessive due to the floating point logic
and more inlining. I have investigated some of these previously due to
prior reports:

https://github.com/ClangBuiltLinux/linux/issues/693
https://github.com/ClangBuiltLinux/linux/issues/694
https://github.com/ClangBuiltLinux/linux/issues/695



powerpc64le-debian.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_mode_vba_30.c:1918:13: error: stack frame size (2208) exceeds limit (2048) in function 'DISPCLKDPPCLKDCFCLKDeepSleepPrefetchParametersWatermarksAndPerformanceCalculation' [-Werror,-Wframe-larger-than]
powerpc64le-debian.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_mode_vba_30.c:3644:6: error: stack frame size (2496) exceeds limit (2048) in function 'dml30_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than]
powerpc64le-debian.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_mode_vba_31.c:3956:6: error: stack frame size (2720) exceeds limit (2048) in function 'dml31_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than]
powerpc64le-fedora.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_mode_vba_30.c:1918:13: error: stack frame size (2208) exceeds limit (2048) in function 'DISPCLKDPPCLKDCFCLKDeepSleepPrefetchParametersWatermarksAndPerformanceCalculation' [-Werror,-Wframe-larger-than]
powerpc64le-fedora.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_mode_vba_30.c:3644:6: error: stack frame size (2496) exceeds limit (2048) in function 'dml30_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than]
powerpc64le-fedora.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_mode_vba_31.c:3956:6: error: stack frame size (2720) exceeds limit (2048) in function 'dml31_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than]
powerpc64le-opensuse.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_mode_vba_30.c:1918:13: error: stack frame size (2208) exceeds limit (2048) in function 'DISPCLKDPPCLKDCFCLKDeepSleepPrefetchParametersWatermarksAndPerformanceCalculation' [-Werror,-Wframe-larger-than]
powerpc64le-opensuse.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_mode_vba_30.c:3644:6: error: stack frame size (2496) exceeds limit (2048) in function 'dml30_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than]
powerpc64le-opensuse.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_mode_vba_31.c:3956:6: error: stack frame size (2720) exceeds limit (2048) in function 'dml31_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than]


I suspect this is a backend problem because these do not really appear
in any other configurations (maybe with Altivec?)



powerpc64le-debian.log:drivers/net/ethernet/sfc/falcon/farch.c:994:10: error: shift count is negative [-Werror,-Wshift-count-negative]
powerpc64le-debian.log:drivers/net/ethernet/sfc/farch.c:985:10: error: shift count is negative [-Werror,-Wshift-count-negative]
powerpc64le-opensuse.log:drivers/net/ethernet/sfc/falcon/farch.c:994:10: error: shift count is negative [-Werror,-Wshift-count-negative]
powerpc64le-opensuse.log:drivers/net/ethernet/sfc/farch.c:985:10: error: shift count is negative [-Werror,-Wshift-count-negative]

I believe this is a false positive due to a bug with how clang models
asm goto in the control flow graph: https://bugs.llvm.org/show_bug.cgi?id=51682



riscv-allmodconfig.log:crypto/ecc.c:1276:13: error: stack frame size (3392) exceeds limit (2048) in function 'ecc_point_mult' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:crypto/ecc.c:1358:6: error: stack frame size (3168) exceeds limit (2048) in function 'ecc_point_mult_shamir' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/block/drbd/drbd_receiver.c:924:12: error: stack frame size (2080) exceeds limit (2048) in function 'conn_connect' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/block/loop.c:1818:12: error: stack frame size (2592) exceeds limit (2048) in function 'lo_ioctl' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/crypto/ccree/cc_hash.c:1882:5: error: stack frame size (2528) exceeds limit (2048) in function 'cc_init_hash_sram' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/gpu/drm/nouveau/nvkm/subdev/fb/ramgf100.c:567:1: error: stack frame size (3520) exceeds limit (2048) in function 'gf100_ram_new_' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/gpu/drm/nouveau/nvkm/subdev/fb/ramgk104.c:1521:1: error: stack frame size (5856) exceeds limit (2048) in function 'gk104_ram_new_' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/gpu/drm/nouveau/nvkm/subdev/fb/ramgt215.c:940:1: error: stack frame size (2624) exceeds limit (2048) in function 'gt215_ram_new' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/gpu/drm/rockchip/cdn-dp-core.c:1126:12: error: unused function 'cdn_dp_resume' [-Werror,-Wunused-function]
riscv-allmodconfig.log:drivers/hwmon/occ/common.c:1150:5: error: stack frame size (3008) exceeds limit (2048) in function 'occ_setup' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/infiniband/hw/ocrdma/ocrdma_stats.c:686:16: error: stack frame size (20736) exceeds limit (2048) in function 'ocrdma_dbgfs_ops_read' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/media/common/cx2341x.c:1574:5: error: stack frame size (2944) exceeds limit (2048) in function 'cx2341x_handler_init' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/media/i2c/cx25840/cx25840-core.c:2294:12: error: stack frame size (2976) exceeds limit (2048) in function 'cx25840_reset' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/media/i2c/cx25840/cx25840-core.c:5651:13: error: stack frame size (2400) exceeds limit (2048) in function 'cx23888_std_setup' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/media/pci/cx23885/cx23885-dvb.c:1187:12: error: stack frame size (2688) exceeds limit (2048) in function 'dvb_register' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/media/pci/ddbridge/ddbridge-core.c:2365:6: error: stack frame size (2336) exceeds limit (2048) in function 'ddb_ports_init' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/media/usb/dvb-usb-v2/mxl111sf-i2c.c:799:5: error: stack frame size (2208) exceeds limit (2048) in function 'mxl111sf_i2c_xfer' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/media/usb/gspca/sn9c2028.c:802:12: error: stack frame size (3168) exceeds limit (2048) in function 'sd_start' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/mtd/chips/cfi_cmdset_0001.c:1872:12: error: stack frame size (2432) exceeds limit (2048) in function 'cfi_intelext_writev' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/net/ethernet/intel/i40e/i40e_ddp.c:264:5: error: stack frame size (2368) exceeds limit (2048) in function 'i40e_ddp_load' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/net/wan/slic_ds26522.c:203:12: error: stack frame size (15328) exceeds limit (2048) in function 'slic_ds26522_probe' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/net/wireless/ath/ath11k/qmi.c:2695:13: error: stack frame size (4384) exceeds limit (2048) in function 'ath11k_qmi_driver_event_work' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/net/wireless/atmel/atmel.c:1050:13: error: stack frame size (2656) exceeds limit (2048) in function 'rx_done_irq' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/net/wireless/atmel/atmel.c:1305:5: error: stack frame size (5152) exceeds limit (2048) in function 'atmel_open' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_n.c:15430:13: error: stack frame size (2400) exceeds limit (2048) in function 'wlc_phy_workarounds_nphy_gainctrl' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_n.c:17019:13: error: stack frame size (6272) exceeds limit (2048) in function 'wlc_phy_workarounds_nphy' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_n.c:25631:1: error: stack frame size (3136) exceeds limit (2048) in function 'wlc_phy_cal_txiqlo_nphy' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_n.c:28303:1: error: stack frame size (2464) exceeds limit (2048) in function 'wlc_phy_txpwr_index_nphy' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/net/wireless/intel/ipw2x00/ipw2100.c:5471:12: error: stack frame size (2880) exceeds limit (2048) in function 'ipw2100_configure_security' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/net/wireless/intel/iwlwifi/mvm/ftm-initiator.c:855:5: error: stack frame size (4672) exceeds limit (2048) in function 'iwl_mvm_ftm_start' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/phy/ti/phy-j721e-wiz.c:1133:12: error: stack frame size (2336) exceeds limit (2048) in function 'wiz_probe' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/rtc/rtc-r9701.c:89:12: error: stack frame size (2400) exceeds limit (2048) in function 'r9701_set_datetime' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/scsi/pm8001/pm80xx_hwi.c:3537:12: error: stack frame size (2368) exceeds limit (2048) in function 'mpi_hw_event' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/scsi/qla2xxx/qla_bsg.c:2787:1: error: stack frame size (3296) exceeds limit (2048) in function 'qla2x00_process_vendor_specific' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/staging/media/hantro/hantro_g2_hevc_dec.c:536:5: error: stack frame size (3616) exceeds limit (2048) in function 'hantro_g2_hevc_dec_run' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/staging/rtl8723bs/core/rtw_security.c:1288:5: error: stack frame size (6976) exceeds limit (2048) in function 'rtw_aes_decrypt' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/staging/rtl8723bs/core/rtw_security.c:865:19: error: stack frame size (5536) exceeds limit (2048) in function 'aes_cipher' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/staging/wlan-ng/cfg80211.c:436:12: error: stack frame size (3904) exceeds limit (2048) in function 'prism2_connect' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/usb/misc/sisusbvga/sisusb.c:1878:12: error: stack frame size (3680) exceeds limit (2048) in function 'sisusb_init_gfxcore' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:drivers/video/fbdev/omap2/omapfb/displays/panel-lgphilips-lb035q02.c:117:12: error: stack frame size (14400) exceeds limit (2048) in function 'lb035q02_connect' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:fs/io_uring.c:6578:12: error: stack frame size (2112) exceeds limit (2048) in function 'io_issue_sqe' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:fs/ocfs2/dlm/dlmdomain.c:1852:12: error: stack frame size (2272) exceeds limit (2048) in function 'dlm_join_domain' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:fs/ocfs2/dlm/dlmmaster.c:701:28: error: stack frame size (2208) exceeds limit (2048) in function 'dlm_get_lock_resource' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:fs/ocfs2/dlm/dlmrecovery.c:427:12: error: stack frame size (2976) exceeds limit (2048) in function 'dlm_do_recovery' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:include/linux/module.h:76:12: error: stack frame size (2848) exceeds limit (2048) in function 'init_module' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:lib/bitfield_kunit.c:60:20: error: stack frame size (11328) exceeds limit (10240) in function 'test_bitfields_constants' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:lib/test_kasan.c:946:13: error: stack frame size (3104) exceeds limit (2048) in function 'kasan_bitops_generic' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:lib/test_scanf.c:217:20: error: stack frame size (4640) exceeds limit (2048) in function 'numbers_simple' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:net/mac80211/mesh.c:1516:6: error: stack frame size (2272) exceeds limit (2048) in function 'ieee80211_mesh_rx_queued_mgmt' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:net/qrtr/ns.c:661:13: error: stack frame size (2144) exceeds limit (2048) in function 'qrtr_ns_worker' [-Werror,-Wframe-larger-than]
riscv-allmodconfig.log:sound/usb/mixer_s1810c.c:543:5: error: stack frame size (2208) exceeds limit (2048) in function 'snd_sc1810_init_mixer' [-Werror,-Wframe-larger-than]

I suspect this is a backend problem because these do not really appear
in any other configurations (might also be something with a sanitizer?)



s390x-defconfig.log: include/asm-generic/io.h:464:31: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
s390x-defconfig.log: include/asm-generic/io.h:477:61: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
s390x-defconfig.log: include/asm-generic/io.h:490:61: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
s390x-defconfig.log: include/asm-generic/io.h:501:33: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
s390x-defconfig.log: include/asm-generic/io.h:511:59: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
s390x-defconfig.log: include/asm-generic/io.h:521:59: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
s390x-defconfig.log: include/asm-generic/io.h:609:20: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
s390x-defconfig.log: include/asm-generic/io.h:617:20: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
s390x-defconfig.log: include/asm-generic/io.h:625:20: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
s390x-defconfig.log: include/asm-generic/io.h:634:21: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
s390x-defconfig.log: include/asm-generic/io.h:643:21: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
s390x-defconfig.log: include/asm-generic/io.h:652:21: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]

This affected all s390x configs I test. fs/btrfs force enables W=1 so we
get these. This is known and had a solution rejected at pull time:

https://github.com/ClangBuiltLinux/linux/issues/1285
https://lore.kernel.org/r/[email protected]/
https://lore.kernel.org/r/CAK8P3a2oZ-+qd3Nhpy9VVXCJB3DU5N-y-ta2JpP0t6NHh=GVXw@mail.gmail.com/


s390x-allmodconfig.log:drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:77:13: error: stack frame size (5184) exceeds limit (2048) in function 'calculate_bandwidth' [-Werror,-Wframe-larger-than]

Same deal as the other dc_calcs.c warnings.



x86_64-allmodconfig-O3.log:drivers/net/ethernet/microchip/sparx5/sparx5_calendar.c:566:5: error: stack frame size (2504) exceeds limit (2048) in function 'sparx5_config_dsm_calendar' [-Werror,-Wframe-larger-than]

Probably aggressive inlining due to testing -O3.

x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dcn_calcs.c:452:13: error: stack frame size (1800) exceeds limit (1280) in function 'dcn_bw_calc_rq_dlg_ttu' [-Werror,-Wframe-larger-than]
x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_rq_dlg_calc_21.c:1657:6: error: stack frame size (1336) exceeds limit (1280) in function 'dml21_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_rq_dlg_calc_30.c:1831:6: error: stack frame size (1352) exceeds limit (1280) in function 'dml30_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_rq_dlg_calc_31.c:1676:6: error: stack frame size (1336) exceeds limit (1280) in function 'dml31_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
x86_64-alpine.log:drivers/vhost/scsi.c:1831:12: error: stack frame size (1320) exceeds limit (1280) in function 'vhost_scsi_release' [-Werror,-Wframe-larger-than]

Another instance where distros lower CONFIG_FRAME_WARN below the 2048
default. Again, none look particularly scary but should still probably
be dealt with.

Cheers,
Nathan

2021-09-07 09:42:24

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Tue, Sep 7, 2021 at 1:51 AM Guenter Roeck <[email protected]> wrote:
> On Mon, Sep 06, 2021 at 04:06:04PM -0700, Linus Torvalds wrote:
> > [ Adding some subsystem maintainers ]
> >
> > On Mon, Sep 6, 2021 at 10:06 AM Guenter Roeck <[email protected]> wrote:
> > >
> > > > But hopefully most cases are just "people haven't cared enough" and
> > > > easily fixed.
> > >
> > > We'll see. For my testbed I disabled the new configuration flag
> > > for the time being because its primary focus is boot tests, and
> > > there won't be any boot tests if images fail to build.
> >
> > Sure, reasonable.
> >
> > I've checked a few of the build errors by doing the appropriate cross
> > compiles, and it doesn't seem bad - but it does seem like we have a
> > number of really pointless long-standing warnings that should have
> > been fixed long ago.

I have a tree with fixes for anything that has hit on arm, arm64 or x86.
There are many reasons why some patch never made it in, but usually
it's because I was not persistent about resending the fix when the first
version didn't make it. In other cases I wasn't sure about my own fix.

> > For example, looking at sparc64, there are several build errors due to
> > those warnings now being fatal:
> >
> > - drivers/gpu/drm/ttm/ttm_pool.c:386
> >
> > This is a type mismatch error. It looks like __fls() on sparc64
> > returns 'int'. And the ttm_pool.c code assumes it returns 'unsigned
> > long'.
> > Oddly enough, the very line after that line does "min_t(unsigned
> > int" to get the types in line.
> > So the immediate reason is "sparc64 is different".

arc is the same as sparc here, but everything else uses unsigned long.
We've come a long way in making all those helper functions consistent
in their types, but there are still a number of exceptions.

> > But the deeper
> > reason seems to be that ttm_pool.c has odd type assumptions. But that
> > warning should have been fixed long ago, either way.
> >
> > Christian/Huang? I get the feeling that both lines in that file
> > should use the min_t(). Hmm?
> >
> > - drivers/input/joystick/analog.c:160
> >
> > #warning Precise timer not defined for this architecture.
> >
> > Unfortunate. I suspect that warning just has to be removed. It has
> > never caused anything to be fixed, it's old to the point of predating
> > the git history. Dmitry?
> >
> My solution would be to just remove the old code (that isn't using ktime)
> including the module parameter that disables it. Sure, we want to be
> backward compatible, but that code is 15+ years old and should really be
> retired.

Agreed. I added a couple of architectures to the #ifdef check over time,
but realistically this driver is only ever used on x86-32 anyway, and
we don't even care about the others here.

If we remove the #else path here, I'd make it "depends on ISA ||
COMPILE_TEST".

> > - at least a couple of stringop-overread errors. Attached is a
> > possible for for one of them.
> >
> > The stringop overread is odd, because another one of them is
> >
> > fs/qnx4/dir.c: In function ‘qnx4_readdir’:
> > fs/qnx4/dir.c:51:32: error: ‘strnlen’ specified bound 48 exceeds
> > source size 16 [-Werror=stringop-overread]
> > 51 | size = strnlen(de->di_fname, size);
> > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > but I'm not seeing why that one happens on sparc64, but not on arm64
> > or x86-64. There doesn't seem to be anything architecture-specific
> > anywhere in that area.
> >
> > Funky.
> >
> Not really. That is because de->di_fname is always 16 bytes but size
> can be 48 if the node is really a link. The use of de is overloaded
> in that case; de is struct qnx4_inode_entry (where di_fname is 16 bytes)
> but the actual data is struct qnx4_link_info where the name is 48 bytes
> long. A possible fix (compile tested only) is below.
>
> I think the warning/error is only reported with gcc 11.x. Do you possibly
> use an older compiler for x86/arm64 ?
>
> Anyway, below is a partial list of build errors I have seen. Some of
> them are easy to fix (such as the ones due to unused functions),
> but others seem to be tricky.

This one is worse, I think this is the same warning as the one I
reported as a false-positive in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99673
and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578

I submitted a patch in
https://lore.kernel.org/all/[email protected]/

To summarize the problems:

- gcc and clang have different approaches to this type of warning: clang
tries to only produce diagnostics that are 100% reproducible regardless
of compiler internals, while gcc tries to use as much information as it has
to warn about things that may go wrong, including things that it only knows
because of inlining. Making this warning reliable is a variation of the
halting problem, just like the -Wmaybe-uninitialized warnings. The diagnostic
is definitely helpful and I found real bugs because of it, but you can never
be sure that you have found all instances.

- Some of the -Wstringop-overread warnings (and related ones) from gcc are
actually wrong, because the object size is just a heuristic. If you
have multiple
overlapping fixed-length fields in a union, gcc-11 picks one of the union
members to determine the size of the string buffer within the structure, even
when the string operation uses a different union member as the output, and
that member has the correct size.
This is also a common problem: when a new warning option gets introduced
first, there are false positives that get fixed in subsequent
compiler versions.


> alpha.log:arch/alpha/kernel/setup.c:493:13: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]

I sent a couple of fixes for these: this is another false-postive bug
in gcc that made it
into the release, and it triggers on every architecture that accesses
the boot parameters
at a fixed pointer. The problem is that gcc treats '(void *)0x12345'
the same as 'NULL +
0x12345', and decides that this is an invalid NULL pointer access, so
the array has
zero readable bytes.

> alpha.log:drivers/atm/ambassador.c:1747:58: error: passing argument 1 of 'virt_to_bus' discards 'volatile' qualifier from pointer target type [-Werror=discarded-qualifiers]

Surely an alpha specific mistake, though we could fix all those
drivers to drop the
'volatile'.

> alpha.log:drivers/net/ethernet/amd/ni65.c:751:37: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]

Nobody tests ISA drivers on 64-bit architectures...

> alpha.log:drivers/net/hamradio/6pack.c:71:41: error: unsigned conversion from 'int' to 'unsigned char' changes value from '256' to '0' [-Werror=overflow]

This driver is apparently broken for any HZ >= 1024
> ppc.log:drivers/net/ethernet/cirrus/cs89x0.c:897:41: error: implicit declaration of function 'isa_virt_to_bus' [-Werror=implicit-function-declaration]

My fix is in the network tree.

> riscv32.log:drivers/gpu/drm/rockchip/cdn-dp-core.c:1126:12: error: 'cdn_dp_resume' defined but not used [-Werror=unused-function]
> riscv.log:drivers/gpu/drm/rockchip/cdn-dp-core.c:1126:12: error: 'cdn_dp_resume' defined but not used [-Werror=unused-function]

A fix was submitted today, we get at least a dozen of those for each
kernel release, and there
is a plan for avoiding them altogether, but it's a giant treewide
change that nobody has managed
to tackle.

> s390.log:arch/s390/kernel/syscall.c:168:1: error: '__do_syscall' uses dynamic stack allocation [-Werror]

This is add_random_kstack_offset(). No idea why it doesn't trigger on
x86, but that
warning should probably get shut up inside of the macro.

> sparc64.log:arch/sparc/kernel/mdesc.c:647:22: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
> sparc64.log:arch/sparc/kernel/mdesc.c:692:22: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
> sparc64.log:arch/sparc/kernel/mdesc.c:719:21: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]

Same as on alpha

Arnd

2021-09-07 09:51:46

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Tue, Sep 7, 2021 at 4:32 AM Nathan Chancellor <[email protected]> wrote:
>
> arm32-allmodconfig.log: crypto/wp512.c:782:13: error: stack frame size (1176) exceeds limit (1024) in function 'wp512_process_buffer' [-Werror,-Wframe-larger-than]
> arm32-allmodconfig.log: drivers/firmware/tegra/bpmp-debugfs.c:294:12: error: stack frame size (1256) exceeds limit (1024) in function 'bpmp_debug_show' [-Werror,-Wframe-larger-than]
> arm32-allmodconfig.log: drivers/firmware/tegra/bpmp-debugfs.c:357:16: error: stack frame size (1264) exceeds limit (1024) in function 'bpmp_debug_store' [-Werror,-Wframe-larger-than]
> arm32-allmodconfig.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:3043:6: error: stack frame size (1384) exceeds limit (1024) in function 'bw_calcs' [-Werror,-Wframe-larger-than]
> arm32-allmodconfig.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:77:13: error: stack frame size (5560) exceeds limit (1024) in function 'calculate_bandwidth' [-Werror,-Wframe-larger-than]
> arm32-allmodconfig.log: drivers/mtd/chips/cfi_cmdset_0001.c:1872:12: error: stack frame size (1064) exceeds limit (1024) in function 'cfi_intelext_writev' [-Werror,-Wframe-larger-than]
> arm32-allmodconfig.log: drivers/ntb/hw/idt/ntb_hw_idt.c:1041:27: error: stack frame size (1032) exceeds limit (1024) in function 'idt_scan_mws' [-Werror,-Wframe-larger-than]
> arm32-allmodconfig.log: drivers/staging/fbtft/fbtft-core.c:902:12: error: stack frame size (1072) exceeds limit (1024) in function 'fbtft_init_display_from_property' [-Werror,-Wframe-larger-than]
> arm32-allmodconfig.log: drivers/staging/fbtft/fbtft-core.c:992:5: error: stack frame size (1064) exceeds limit (1024) in function 'fbtft_init_display' [-Werror,-Wframe-larger-than]
> arm32-allmodconfig.log: drivers/staging/rtl8723bs/core/rtw_security.c:1288:5: error: stack frame size (1040) exceeds limit (1024) in function 'rtw_aes_decrypt' [-Werror,-Wframe-larger-than]
> arm32-fedora.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:3043:6: error: stack frame size (1376) exceeds limit (1024) in function 'bw_calcs' [-Werror,-Wframe-larger-than]
> arm32-fedora.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:77:13: error: stack frame size (5384) exceeds limit (1024) in function 'calculate_bandwidth' [-Werror,-Wframe-larger-than]
>
> Aside from the dce_calcs.c warnings, these do not seem too bad. I
> believe allmodconfig turns on UBSAN but it could also be aggressive
> inlining by clang. I intend to look at all -Wframe-large-than warnings
> closely later.

I've had them close to zero in the past, but a couple of new ones came in.

The amdgpu ones are probably not fixable unless they stop using 64-bit
floats in the kernel for
random calculations. The crypto/* ones tend to be compiler bugs, but hard to fix

> It appears that both Arch Linux and Fedora define CONFIG_FRAME_WARN
> as 1024, below its default of 2048. I am not sure these look particurly
> scary (although there are some that are rather large that need to be
> looked at).

For 64-bit, you usually need 1280 bytes stack space to get a
reasonably clean build,
anything that uses more than that tends to be a bug in the code but we
never warned
about those by default as the default warning limit in defconfig is 2048.

I think the distros using 1024 did that because they use a common base config
for 32-bit and 64-bit targets.

> I suspect this is a backend problem because these do not really appear
> in any other configurations (might also be something with a sanitizer?)

Agreed. Someone needs to bisect the .config or the compiler flags to see what
triggers them.

> s390x-defconfig.log: include/asm-generic/io.h:464:31: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
> s390x-defconfig.log: include/asm-generic/io.h:477:61: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
> s390x-defconfig.log: include/asm-generic/io.h:490:61: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
> s390x-defconfig.log: include/asm-generic/io.h:501:33: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
> s390x-defconfig.log: include/asm-generic/io.h:511:59: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
> s390x-defconfig.log: include/asm-generic/io.h:521:59: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
> s390x-defconfig.log: include/asm-generic/io.h:609:20: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
> s390x-defconfig.log: include/asm-generic/io.h:617:20: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
> s390x-defconfig.log: include/asm-generic/io.h:625:20: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
> s390x-defconfig.log: include/asm-generic/io.h:634:21: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
> s390x-defconfig.log: include/asm-generic/io.h:643:21: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
> s390x-defconfig.log: include/asm-generic/io.h:652:21: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]
>
> This affected all s390x configs I test. fs/btrfs force enables W=1 so we
> get these. This is known and had a solution rejected at pull time:
>
> https://github.com/ClangBuiltLinux/linux/issues/1285
> https://lore.kernel.org/r/[email protected]/
> https://lore.kernel.org/r/CAK8P3a2oZ-+qd3Nhpy9VVXCJB3DU5N-y-ta2JpP0t6NHh=GVXw@mail.gmail.com/

I posted a new idea for a patch, but it needs more work. I'm happy to work with
any volunteers that want to help tighten the Kconfig dependencies to ensure that
those drivers are only built on architectures that provide I/O port accesses.

> x86_64-allmodconfig-O3.log:drivers/net/ethernet/microchip/sparx5/sparx5_calendar.c:566:5: error: stack frame size (2504) exceeds limit (2048) in function 'sparx5_config_dsm_calendar' [-Werror,-Wframe-larger-than]
>
> Probably aggressive inlining due to testing -O3.

If inlining causes it, it was already bad without the inlining. It
looks like there are
some large arrays on the stack of some of the called functions, so a driver fix
is needed anyway.

> x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dcn_calcs.c:452:13: error: stack frame size (1800) exceeds limit (1280) in function 'dcn_bw_calc_rq_dlg_ttu' [-Werror,-Wframe-larger-than]
> x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_rq_dlg_calc_21.c:1657:6: error: stack frame size (1336) exceeds limit (1280) in function 'dml21_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
> x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_rq_dlg_calc_30.c:1831:6: error: stack frame size (1352) exceeds limit (1280) in function 'dml30_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
> x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_rq_dlg_calc_31.c:1676:6: error: stack frame size (1336) exceeds limit (1280) in function 'dml31_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
> x86_64-alpine.log:drivers/vhost/scsi.c:1831:12: error: stack frame size (1320) exceeds limit (1280) in function 'vhost_scsi_release' [-Werror,-Wframe-larger-than]
>
> Another instance where distros lower CONFIG_FRAME_WARN below the 2048
> default. Again, none look particularly scary but should still probably
> be dealt with.

I would argue that they are still scary and should be addressed in the
code, it's just that
we don't see them on build bots that use the 2048 byte default.

Arnd

2021-09-07 16:04:02

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On 9/6/21 7:29 PM, Guenter Roeck wrote:
> On 9/6/21 6:12 PM, Linus Torvalds wrote:
>> On Mon, Sep 6, 2021 at 4:49 PM Guenter Roeck <[email protected]> wrote:
>>>
>>>> but I'm not seeing why that one happens on sparc64, but not on arm64
>>>> or x86-64. There doesn't seem to be anything architecture-specific
>>>> anywhere in that area.
>>>>
>>>> Funky.
>>>
>>> Not really. That is because de->di_fname is always 16 bytes but size
>>> can be 48 if the node is really a link. The use of de is overloaded
>>> in that case; de is struct qnx4_inode_entry (where di_fname is 16 bytes)
>>> but the actual data is struct qnx4_link_info where the name is 48 bytes
>>> long. A possible fix (compile tested only) is below.
>>>
>>> I think the warning/error is only reported with gcc 11.x. Do you possibly
>>> use an older compiler for x86/arm64 ?
>>
>> No. Literally the same exact version. All of them are
>>
>>      gcc version 11.2.1 20210728
>>
>> from F34.
>>
>> I suspect it's something about the config - a sparc64 allmodconfig
>> presumably doesn't end up having some of the things x86-64 has enabled
>> (because of different core config parameters), and then optimizes
>> differently as a result and shows the issue that way.
>>
>> Or something. <wild handwaving>
>>
>
> Looks like Arnd stumbled into the qnx4 problem before:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578
>

... and submitted a patch for it:

https://lore.kernel.org/lkml/[email protected]/

Looks like it got lost.

Guenter

2021-09-07 17:45:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Tue, Sep 7, 2021 at 2:11 AM Arnd Bergmann <[email protected]> wrote:
>
> > x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dcn_calcs.c:452:13: error: stack frame size (1800) exceeds limit (1280) in function 'dcn_bw_calc_rq_dlg_ttu' [-Werror,-Wframe-larger-than]
> > x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_rq_dlg_calc_21.c:1657:6: error: stack frame size (1336) exceeds limit (1280) in function 'dml21_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
> > x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_rq_dlg_calc_30.c:1831:6: error: stack frame size (1352) exceeds limit (1280) in function 'dml30_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
> > x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_rq_dlg_calc_31.c:1676:6: error: stack frame size (1336) exceeds limit (1280) in function 'dml31_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
> > x86_64-alpine.log:drivers/vhost/scsi.c:1831:12: error: stack frame size (1320) exceeds limit (1280) in function 'vhost_scsi_release' [-Werror,-Wframe-larger-than]
> >
> > Another instance where distros lower CONFIG_FRAME_WARN below the 2048
> > default. Again, none look particularly scary but should still probably
> > be dealt with.
>
> I would argue that they are still scary and should be addressed in the
> code, it's just that we don't see them on build bots that use the 2048 byte default.

No, they are scary for another reason entirely: clang is clearly doing
a *HORRIBLE* job with stack usage.

To take that dml30_rq_dlg_get_dlg_reg() function as an example: yes,
it has a few structures on the stack, but gcc allocates 512-720 bytes
of stack space depending on my config.

Not 1280 bytes.

So it's not even *close* to the 1024 byte limit with gcc, much less the 2kB one.

I don't know why clang basically decides to use almost double the
stack space. Maybe it's some other config option that does it, I tried
a fairly normal one and a "almost everythign enabled" one, and
couldn't get close to the reported stack frame size with gcc.

Just to try to make things as close as possible, I tried with the
exact same normal non-debug config (apart from obvious
compiler-dependent things), and picked that dml30_rq_dlg_get_dlg_reg()
function to look at (for no real reason other than that the stack
frame was biggest above.

Gcc did a 720-byte stack frame for that case. Not great, but whatever.

clang did a 1136-byte stack frame for the same thing.

Do I know why? No. I do note that that code is disgusting.

It's passing one of those structs around by value, for example. That's
a 72-byte structure that is copied on the stack due to stupid calling
conventions. Maybe clang generates a few extra temporaries for it as
part of the function call stack setup? Who knows..

Linus

2021-09-07 17:49:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Tue, Sep 7, 2021 at 10:10 AM Linus Torvalds
<[email protected]> wrote:
>
> Do I know why? No. I do note that that code is disgusting.
>
> It's passing one of those structs around by value, for example. That's
> a 72-byte structure that is copied on the stack due to stupid calling
> conventions. Maybe clang generates a few extra temporaries for it as
> part of the function call stack setup? Who knows..

Ooh, yes.

This attached patch is crap - it converts the helper functions to use
const pointers instead of passing the whole structure, but it then
only converts that one file that *uses* them.

So the end result will not compile in general, but you can do

make drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_rq_dlg_calc_30.o

and it compiles for me.

And while gcc doesn't care that much - it will apparently either
generate the argument stack every call - clang cares deeply.

The nasty 720-byte stack frame that clang generates turns into just a
320-byte one, and code generation in general looks a *lot* better.

Now, as mentioned, this patch is broken and incomplete. But I really
think the AMD GPU people need to do this. It makes those functions go
from practically unusable to not horribly disgusting.

So Harry/Leo/Alex/Christian and amd-gfx list - can you look into
making this ugly "make one file compile better" patch actually work
properly?

It *looks* lto me ike that code was perhaps written for a C++ compiler
and the helpers have been written as a "pass by reference", and the
arguments used to be

const display_data_rq_misc_params_st& rq_misc_param

and then the compiler will pass the argument as a pointer. And then it
was converted to C, and the "pass by reference" in the function
declaration was turned into "pass by value", to avoid changing "." to
"->" in the use.

But a '&arg' thing in C++ really is a '*arg' pointer in C, and should
have been done as that.

Of course, it's also possible that that code was simply written by
somebody who didn't understand just *how* horrible it is to pass
structures bigger than a word or two by value.

Do we have a compiler warning for passing big structures by value?

Linus


Attachments:
patch.diff (17.03 kB)

2021-09-07 18:52:17

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On 9/7/21 10:10 AM, Linus Torvalds wrote:
> On Tue, Sep 7, 2021 at 2:11 AM Arnd Bergmann <[email protected]> wrote:
>>
>>> x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dcn_calcs.c:452:13: error: stack frame size (1800) exceeds limit (1280) in function 'dcn_bw_calc_rq_dlg_ttu' [-Werror,-Wframe-larger-than]
>>> x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_rq_dlg_calc_21.c:1657:6: error: stack frame size (1336) exceeds limit (1280) in function 'dml21_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
>>> x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_rq_dlg_calc_30.c:1831:6: error: stack frame size (1352) exceeds limit (1280) in function 'dml30_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
>>> x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_rq_dlg_calc_31.c:1676:6: error: stack frame size (1336) exceeds limit (1280) in function 'dml31_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
>>> x86_64-alpine.log:drivers/vhost/scsi.c:1831:12: error: stack frame size (1320) exceeds limit (1280) in function 'vhost_scsi_release' [-Werror,-Wframe-larger-than]
>>>

FWIW, the above is because of

static void vhost_scsi_flush(struct vhost_scsi *vs)
{
struct vhost_scsi_inflight *old_inflight[VHOST_SCSI_MAX_VQ];

where VHOST_SCSI_MAX_VQ=128. Presumably some versions of clang inline
this function. gcc has the same problem here - its stack frame size is
also > 1024 for vhost_scsi_flush().

Guenter

2021-09-07 19:13:39

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On 9/7/2021 10:48 AM, Guenter Roeck wrote:
> On 9/7/21 10:10 AM, Linus Torvalds wrote:
>> On Tue, Sep 7, 2021 at 2:11 AM Arnd Bergmann <[email protected]> wrote:
>>>
>>>> x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dcn_calcs.c:452:13:
>>>> error: stack frame size (1800) exceeds limit (1280) in function
>>>> 'dcn_bw_calc_rq_dlg_ttu' [-Werror,-Wframe-larger-than]
>>>> x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_rq_dlg_calc_21.c:1657:6:
>>>> error: stack frame size (1336) exceeds limit (1280) in function
>>>> 'dml21_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
>>>> x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_rq_dlg_calc_30.c:1831:6:
>>>> error: stack frame size (1352) exceeds limit (1280) in function
>>>> 'dml30_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
>>>> x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn31/display_rq_dlg_calc_31.c:1676:6:
>>>> error: stack frame size (1336) exceeds limit (1280) in function
>>>> 'dml31_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
>>>> x86_64-alpine.log:drivers/vhost/scsi.c:1831:12: error: stack frame
>>>> size (1320) exceeds limit (1280) in function 'vhost_scsi_release'
>>>> [-Werror,-Wframe-larger-than]
>>>>
>
> FWIW, the above is because of
>
> static void vhost_scsi_flush(struct vhost_scsi *vs)
> {
>         struct vhost_scsi_inflight *old_inflight[VHOST_SCSI_MAX_VQ];
>
> where VHOST_SCSI_MAX_VQ=128. Presumably some versions of clang inline
> this function. gcc has the same problem here - its stack frame size is
> also > 1024 for vhost_scsi_flush().

Good to know. When investigating these, I intend to compare against GCC
to see what the difference is to know if it is a problem with the code
or a compiler issue.

Cheers,
Nathan

2021-09-07 21:10:27

by Harry Wentland

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds



On 2021-09-07 1:33 p.m., Linus Torvalds wrote:
> On Tue, Sep 7, 2021 at 10:10 AM Linus Torvalds
> <[email protected]> wrote:
>>
>> Do I know why? No. I do note that that code is disgusting.
>>
>> It's passing one of those structs around by value, for example. That's
>> a 72-byte structure that is copied on the stack due to stupid calling
>> conventions. Maybe clang generates a few extra temporaries for it as
>> part of the function call stack setup? Who knows..
>
> Ooh, yes.
>
> This attached patch is crap - it converts the helper functions to use
> const pointers instead of passing the whole structure, but it then
> only converts that one file that *uses* them.
>
> So the end result will not compile in general, but you can do
>
> make drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_rq_dlg_calc_30.o
>
> and it compiles for me.
>
> And while gcc doesn't care that much - it will apparently either
> generate the argument stack every call - clang cares deeply.
>
> The nasty 720-byte stack frame that clang generates turns into just a
> 320-byte one, and code generation in general looks a *lot* better.
>
> Now, as mentioned, this patch is broken and incomplete. But I really
> think the AMD GPU people need to do this. It makes those functions go
> from practically unusable to not horribly disgusting.
>
> So Harry/Leo/Alex/Christian and amd-gfx list - can you look into
> making this ugly "make one file compile better" patch actually work
> properly?
>

Yes, will take a look at this tonight. We definitely shouldn't be passing
large structs by value.

Harry

> It *looks* lto me ike that code was perhaps written for a C++ compiler
> and the helpers have been written as a "pass by reference", and the
> arguments used to be
>
> const display_data_rq_misc_params_st& rq_misc_param
>
> and then the compiler will pass the argument as a pointer. And then it
> was converted to C, and the "pass by reference" in the function
> declaration was turned into "pass by value", to avoid changing "." to
> "->" in the use.
>
> But a '&arg' thing in C++ really is a '*arg' pointer in C, and should
> have been done as that.
>
> Of course, it's also possible that that code was simply written by
> somebody who didn't understand just *how* horrible it is to pass
> structures bigger than a word or two by value.
>
> Do we have a compiler warning for passing big structures by value?
>
> Linus
>

2021-09-08 04:31:19

by Harry Wentland

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds



On 2021-09-07 5:07 p.m., Harry Wentland wrote:
>
>
> On 2021-09-07 1:33 p.m., Linus Torvalds wrote:
>> On Tue, Sep 7, 2021 at 10:10 AM Linus Torvalds
>> <[email protected]> wrote:
>>>
>>> Do I know why? No. I do note that that code is disgusting.
>>>
>>> It's passing one of those structs around by value, for example. That's
>>> a 72-byte structure that is copied on the stack due to stupid calling
>>> conventions. Maybe clang generates a few extra temporaries for it as
>>> part of the function call stack setup? Who knows..
>>
>> Ooh, yes.
>>
>> This attached patch is crap - it converts the helper functions to use
>> const pointers instead of passing the whole structure, but it then
>> only converts that one file that *uses* them.
>>
>> So the end result will not compile in general, but you can do
>>
>> make drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_rq_dlg_calc_30.o
>>
>> and it compiles for me.
>>
>> And while gcc doesn't care that much - it will apparently either
>> generate the argument stack every call - clang cares deeply.
>>
>> The nasty 720-byte stack frame that clang generates turns into just a
>> 320-byte one, and code generation in general looks a *lot* better.
>>
>> Now, as mentioned, this patch is broken and incomplete. But I really
>> think the AMD GPU people need to do this. It makes those functions go
>> from practically unusable to not horribly disgusting.
>>
>> So Harry/Leo/Alex/Christian and amd-gfx list - can you look into
>> making this ugly "make one file compile better" patch actually work
>> properly?
>>
>
> Yes, will take a look at this tonight. We definitely shouldn't be passing
> large structs by value.
>

Attached patches fix these x86_64 ones reported by Nick:

x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dcn_calcs.c:452:13: error: stack frame size (1800) exceeds limit (1280) in function 'dcn_bw_calc_rq_dlg_ttu' [-Werror,-Wframe-larger-than]
x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_rq_dlg_calc_21.c:1657:6: error: stack frame size (1336) exceeds limit (1280) in function 'dml21_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]
x86_64-alpine.log:drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn30/display_rq_dlg_calc_30.c:1831:6: error: stack frame size (1352) exceeds limit (1280) in function 'dml30_rq_dlg_get_dlg_reg' [-Werror,-Wframe-larger-than]

I'm also seeing one more that might be more challenging to fix but is nearly at 1024:

drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_mode_vba_21.c:3397:6: error: stack frame size of 1064 bytes in function 'dml21_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than=]

The attached patches build and boot without error or warning on a Radeon RX 5500 XT.

Harry

> Harry
>
>> It *looks* lto me ike that code was perhaps written for a C++ compiler
>> and the helpers have been written as a "pass by reference", and the
>> arguments used to be
>>
>> const display_data_rq_misc_params_st& rq_misc_param
>>
>> and then the compiler will pass the argument as a pointer. And then it
>> was converted to C, and the "pass by reference" in the function
>> declaration was turned into "pass by value", to avoid changing "." to
>> "->" in the use.
>>
>> But a '&arg' thing in C++ really is a '*arg' pointer in C, and should
>> have been done as that.
>>
>> Of course, it's also possible that that code was simply written by
>> somebody who didn't understand just *how* horrible it is to pass
>> structures bigger than a word or two by value.
>>
>> Do we have a compiler warning for passing big structures by value?
>>
>> Linus
>>
>


Attachments:
0001-drm-amd-display-Pass-display_pipe_params_st-as-const.patch (30.44 kB)
0002-drm-amd-display-Allocate-structs-needed-by-dcn_bw_ca.patch (22.48 kB)
Download all attachments

2021-09-08 04:43:09

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Mon, Sep 06, 2021 at 04:49:21PM -0700, Guenter Roeck wrote:
> On Mon, Sep 06, 2021 at 04:06:04PM -0700, Linus Torvalds wrote:
[ ... ]

> > - at least a couple of stringop-overread errors. Attached is a
> > possible for for one of them.
> >

I keep seeing problems like this.

drivers/net/ethernet/i825xx/82596.c: In function 'i82596_probe':
./arch/m68k/include/asm/string.h:72:25: error: '__builtin_memcpy' reading 6 bytes from a region of size 0 [-Werror=stringop-overread]
72 | #define memcpy(d, s, n) __builtin_memcpy(d, s, n)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/i825xx/82596.c:1147:17: note: in expansion of macro 'memcpy'
1147 | memcpy(eth_addr, (void *) 0xfffc1f2c, ETH_ALEN); /* YUCK! Get addr from NOVRAM */
| ^~~~~~
cc1: all warnings being treated as errors

It is seen with gcc 11.x whenever a memXXX or strXXX function parameter
is a pointer to a fixed address. gcc is happy if "(void *) 0xfffc1f2c"
is passed to a global function which does nothing but return the address,
such as:

void *sanitize_address(void *address)
{
return address;
}

and:

memcpy(eth_addr, sanitize_address((void *) 0xfffc1f2c), ETH_ALEN);

but that just seems weird. Is there a better solution ?

Thanks,
Guenter

2021-09-08 04:44:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Tue, Sep 7, 2021 at 8:52 PM Harry Wentland <[email protected]> wrote:
>
> Attached patches fix these x86_64 ones reported by Nick:

Hmm.

You didn't seem to fix up the calling convention for print__xyz(),
which still take those xyz structs as pass-by-value.

Obviously it would be good to do things incrementally, so if that
attached patch was just [1/N] I won't complain..

> I'm also seeing one more that might be more challenging to fix but is nearly at 1024:
>
> drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_mode_vba_21.c:3397:6: error: stack frame size of 1064 bytes in function 'dml21_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than=]

Oh Gods, that function is truly something else..

Is there some reason why it's one humongous function, with the
occasional single-line comment?

Because it really looks to me like pretty much everywhere I see one of
those rare comments, I would go "this part should be a function of its
own", and then there would be one caller fuynction that just calls
each of those sub-functions one after the other.

That would - I think - make the code easier to read, and then it would
also make it very obvious where it magically uses a lot of stack.

My suspicion is actually "nowhere". The stack use is just hugely
spread out, and the compiler has just kept accumulating more spill
variables on the frame with no single big reason.

Yes, I see a couple of local structures:

Pipe myPipe;
HostVM myHostVM;

but more than that I see several function calls that have basically 62
arguments. And I wish I was making that number up. I'm not. That
"CalculatePrefetchSchedule()" call literally has 62 arguments.

But *all* of the top-level loops in that function literally look like
they could - and should - be functions in their own right. Some of
them would be fairly complex even so (ie that code under the comment

//Prefetch Check

would be quite the big function all of its own.

We have a coding style thing:

Documentation/process/coding-style.rst

that says that you should strive to have functions that are "short and
sweet" and fit on one or two screenfuls of text.

That one function from hell is 1832 lines of code.

It really could be improved upon.

Linus

2021-09-08 05:00:17

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Tue, Sep 07, 2021 at 09:28:38PM -0700, Guenter Roeck wrote:

> memcpy(eth_addr, sanitize_address((void *) 0xfffc1f2c), ETH_ALEN);
>
> but that just seems weird. Is there a better solution ?

(char (*)[ETH_ALEN])? Said that, shouldn't that be doing something like
ioremap(), rather than casting explicit constants?

2021-09-08 05:15:44

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On 9/7/21 9:48 PM, Al Viro wrote:
> On Tue, Sep 07, 2021 at 09:28:38PM -0700, Guenter Roeck wrote:
>
>> memcpy(eth_addr, sanitize_address((void *) 0xfffc1f2c), ETH_ALEN);
>>
>> but that just seems weird. Is there a better solution ?
>
> (char (*)[ETH_ALEN])? Said that, shouldn't that be doing something like
> ioremap(), rather than casting explicit constants?
>

Typecasts or even assigning the address to a variable does not help.
The sanitizer function can not be static either.

I don't know the hardware, so I can not answer the ioremap() question.

This is just one example, though; there are several sprinkled throughout
the code. Another is:

arch/parisc/kernel/setup.c: running_on_qemu = (memcmp(&PAGE0->pad0, "SeaBIOS", 8) == 0);

where

#define PAGE0 ((struct zeropage *)__PAGE_OFFSET)

and __PAGE_OFFSET depends on the configuration.

That code runs early in setup; I don't think ioremap() would even
be available at that time. A workaround for that problem would be
a global variable pointing to PAGE0 (or of course an address sanitizer
function), but again that seems odd just to make the compiler happy.

Guenter

2021-09-08 05:26:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Tue, Sep 7, 2021 at 9:28 PM Guenter Roeck <[email protected]> wrote:
>
> It is seen with gcc 11.x whenever a memXXX or strXXX function parameter
> is a pointer to a fixed address.

I wonder why I don't see it with gcc 11.2 here on x86-64.

> gcc is happy if "(void *) 0xfffc1f2c"
> is passed to a global function which does nothing but return the address,
> such as:
>
> void *sanitize_address(void *address)
> {
> return address;
> }

We have had reasons to do things like that before for somewhat similar
(well, opposite) reasons - trying to disassociate some pointer from
its originating symbol type.

Look at RELOC_HIDE().

It might be worth it having something similar for "absolute_pointer()".

Entirely untested "written-in-the-MUA" garbage:

#define absolute_pointer(val) \
({ void *__res; __asm__("":"=r" (__res):"0" ((unsigned
long)(val))); __res; })

I dunno.

Linus

2021-09-08 06:19:11

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On 9/7/21 9:55 PM, Linus Torvalds wrote:
> On Tue, Sep 7, 2021 at 9:28 PM Guenter Roeck <[email protected]> wrote:
>>
>> It is seen with gcc 11.x whenever a memXXX or strXXX function parameter
>> is a pointer to a fixed address.
>
> I wonder why I don't see it with gcc 11.2 here on x86-64.
>

I see the problem only on some architectures. No idea what triggers it,
but it is definitely architecture dependent.

>> gcc is happy if "(void *) 0xfffc1f2c"
>> is passed to a global function which does nothing but return the address,
>> such as:
>>
>> void *sanitize_address(void *address)
>> {
>> return address;
>> }
>
> We have had reasons to do things like that before for somewhat similar
> (well, opposite) reasons - trying to disassociate some pointer from
> its originating symbol type.
>
> Look at RELOC_HIDE().
>
> It might be worth it having something similar for "absolute_pointer()".
>
> Entirely untested "written-in-the-MUA" garbage:
>
> #define absolute_pointer(val) \
> ({ void *__res; __asm__("":"=r" (__res):"0" ((unsigned
> long)(val))); __res; })
>

or:

#define absolute_pointer(val) RELOC_HIDE(val, 0)

or maybe:

#define absolute_pointer(val) RELOC_HIDE((void *)val, 0)

would do the same (though the first variant needs a pointer as argument).
All of those compile.

I tested the first and the last option on qemu:parisc and confirmed that
both work as expected.

I'd be happy to send a formal patch. Which one do you prefer, and where
should I put it ?

Thanks,
Guenter

2021-09-08 07:14:37

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Wed, Sep 8, 2021 at 7:16 AM Guenter Roeck <[email protected]> wrote:
> On 9/7/21 9:48 PM, Al Viro wrote:
> > On Tue, Sep 07, 2021 at 09:28:38PM -0700, Guenter Roeck wrote:
> >> memcpy(eth_addr, sanitize_address((void *) 0xfffc1f2c), ETH_ALEN);
> >>
> >> but that just seems weird. Is there a better solution ?
> >
> > (char (*)[ETH_ALEN])? Said that, shouldn't that be doing something like
> > ioremap(), rather than casting explicit constants?
>
> Typecasts or even assigning the address to a variable does not help.
> The sanitizer function can not be static either.

So it can only be fixed by obfuscating the constant address in a
chain of out-of-line functions...
How is this compiler to be used for bare-metal programming?

> I don't know the hardware, so I can not answer the ioremap() question.

Yes it should. But this driver dates back to 2.1.110, when only
half of the architectures already had ioremap().

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2021-09-08 09:53:14

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Wed, Sep 8, 2021 at 9:49 AM Geert Uytterhoeven <[email protected]> wrote:
> On Wed, Sep 8, 2021 at 7:16 AM Guenter Roeck <[email protected]> wrote:
> > On 9/7/21 9:48 PM, Al Viro wrote:
> > > On Tue, Sep 07, 2021 at 09:28:38PM -0700, Guenter Roeck wrote:
> > >> memcpy(eth_addr, sanitize_address((void *) 0xfffc1f2c), ETH_ALEN);
> > >>
> > >> but that just seems weird. Is there a better solution ?
> > >
> > > (char (*)[ETH_ALEN])? Said that, shouldn't that be doing something like
> > > ioremap(), rather than casting explicit constants?
> >
> > Typecasts or even assigning the address to a variable does not help.
> > The sanitizer function can not be static either.
>
> So it can only be fixed by obfuscating the constant address in a
> chain of out-of-line functions...
> How is this compiler to be used for bare-metal programming?

I reported this as a gcc bug when I first saw it back in March:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578

Martin Sebor suggested marking the pointer as 'volatile' as a workaround,
which is probably fine for bare-metal programming, but I would consider
that bad style for the kernel boot arguments. The RELOC_HIDE trick is probably
fine here, as there are only a couple of instances, and for the network
driver, using volatile is probably appropriate as well.

I still hope this can be fixed in a future gcc-11.x release. Maybe we should
add further instances of the problem on the gcc bug to boost the priority?

> > I don't know the hardware, so I can not answer the ioremap() question.
>
> Yes it should. But this driver dates back to 2.1.110, when only
> half of the architectures already had ioremap().

How does mvme16x even create the mapping? Is this a virtual address
that is hardwired to the bus or do you have a static mapping somewhere?
I see two other drivers accessing the nvram here

arch/m68k/mvme16x/config.c:static MK48T08ptr_t volatile rtc =
(MK48T08ptr_t)MVME_RTC_BASE;
arch/m68k/mvme16x/rtc.c: volatile MK48T08ptr_t rtc =
(MK48T08ptr_t)MVME_RTC_BASE;

The same trick should work here, just create a local variable with a
volatile pointer and read from that.

Arnd

2021-09-08 10:20:59

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

Hi Arnd,

On Wed, Sep 8, 2021 at 11:50 AM Arnd Bergmann <[email protected]> wrote:
> On Wed, Sep 8, 2021 at 9:49 AM Geert Uytterhoeven <[email protected]> wrote:
> > On Wed, Sep 8, 2021 at 7:16 AM Guenter Roeck <[email protected]> wrote:
> > > On 9/7/21 9:48 PM, Al Viro wrote:
> > > > On Tue, Sep 07, 2021 at 09:28:38PM -0700, Guenter Roeck wrote:
> > > >> memcpy(eth_addr, sanitize_address((void *) 0xfffc1f2c), ETH_ALEN);
> > > >>
> > > >> but that just seems weird. Is there a better solution ?
> > > >
> > > > (char (*)[ETH_ALEN])? Said that, shouldn't that be doing something like
> > > > ioremap(), rather than casting explicit constants?
> > >
> > > Typecasts or even assigning the address to a variable does not help.
> > > The sanitizer function can not be static either.
> >
> > So it can only be fixed by obfuscating the constant address in a
> > chain of out-of-line functions...
> > How is this compiler to be used for bare-metal programming?
>
> I reported this as a gcc bug when I first saw it back in March:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578
>
> Martin Sebor suggested marking the pointer as 'volatile' as a workaround,
> which is probably fine for bare-metal programming, but I would consider
> that bad style for the kernel boot arguments. The RELOC_HIDE trick is probably
> fine here, as there are only a couple of instances, and for the network
> driver, using volatile is probably appropriate as well.

Yeah, volatile should be fine for drivers.
In fact this is one of the few places where I/O registers are accessed
without involving volatile.

> I still hope this can be fixed in a future gcc-11.x release. Maybe we should
> add further instances of the problem on the gcc bug to boost the priority?
>
> > > I don't know the hardware, so I can not answer the ioremap() question.
> >
> > Yes it should. But this driver dates back to 2.1.110, when only
> > half of the architectures already had ioremap().
>
> How does mvme16x even create the mapping? Is this a virtual address
> that is hardwired to the bus or do you have a static mapping somewhere?

It's part of the transparent mapping of the top address space for
I/O devices in arch/m68k/kernel/head.S.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2021-09-08 12:22:55

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Wed, Sep 8, 2021 at 11:50 AM Arnd Bergmann <[email protected]> wrote:
> On Wed, Sep 8, 2021 at 9:49 AM Geert Uytterhoeven <[email protected]> wrote:
> > On Wed, Sep 8, 2021 at 7:16 AM Guenter Roeck <[email protected]> wrote:
> > > On 9/7/21 9:48 PM, Al Viro wrote:
> > > > On Tue, Sep 07, 2021 at 09:28:38PM -0700, Guenter Roeck wrote:
> > > >> memcpy(eth_addr, sanitize_address((void *) 0xfffc1f2c), ETH_ALEN);
> > > >>
> > > >> but that just seems weird. Is there a better solution ?
> > > >
> > > > (char (*)[ETH_ALEN])? Said that, shouldn't that be doing something like
> > > > ioremap(), rather than casting explicit constants?
> > >
> > > Typecasts or even assigning the address to a variable does not help.
> > > The sanitizer function can not be static either.
> >
> > So it can only be fixed by obfuscating the constant address in a
> > chain of out-of-line functions...
> > How is this compiler to be used for bare-metal programming?
>
> I reported this as a gcc bug when I first saw it back in March:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578
>
> Martin Sebor suggested marking the pointer as 'volatile' as a workaround,
> which is probably fine for bare-metal programming, but I would consider
> that bad style for the kernel boot arguments. The RELOC_HIDE trick is probably
> fine here, as there are only a couple of instances, and for the network
> driver, using volatile is probably appropriate as well.

A related one, I guess, is:

arch/m68k/include/asm/string.h:72:25: error: argument 2 null where
non-null expected [-Werror=nonnull]
72 | #define memcpy(d, s, n) __builtin_memcpy(d, s, n)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
drivers/staging/vc04_services/interface/vchiq_arm/vchiq_arm.c:387:4:
note: in expansion of macro ‘memcpy’
387 | memcpy((char *)kmap(pages[0]) +
| ^~~~~~

Seen with my sun3-allmodconfig build.
As NO_DMA=y, dmam_alloc_coherent() returns NULL, and the compiler
discovers that g_fragments_base is never assigned to and thus must
be NULL.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2021-09-08 13:07:05

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On 9/8/21 2:50 AM, Arnd Bergmann wrote:
> On Wed, Sep 8, 2021 at 9:49 AM Geert Uytterhoeven <[email protected]> wrote:
>> On Wed, Sep 8, 2021 at 7:16 AM Guenter Roeck <[email protected]> wrote:
>>> On 9/7/21 9:48 PM, Al Viro wrote:
>>>> On Tue, Sep 07, 2021 at 09:28:38PM -0700, Guenter Roeck wrote:
>>>>> memcpy(eth_addr, sanitize_address((void *) 0xfffc1f2c), ETH_ALEN);
>>>>>
>>>>> but that just seems weird. Is there a better solution ?
>>>>
>>>> (char (*)[ETH_ALEN])? Said that, shouldn't that be doing something like
>>>> ioremap(), rather than casting explicit constants?
>>>
>>> Typecasts or even assigning the address to a variable does not help.
>>> The sanitizer function can not be static either.
>>
>> So it can only be fixed by obfuscating the constant address in a
>> chain of out-of-line functions...
>> How is this compiler to be used for bare-metal programming?
>
> I reported this as a gcc bug when I first saw it back in March:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578
>
> Martin Sebor suggested marking the pointer as 'volatile' as a workaround,
> which is probably fine for bare-metal programming, but I would consider
> that bad style for the kernel boot arguments. The RELOC_HIDE trick is probably
> fine here, as there are only a couple of instances, and for the network
> driver, using volatile is probably appropriate as well.
>
> I still hope this can be fixed in a future gcc-11.x release. Maybe we should
> add further instances of the problem on the gcc bug to boost the priority?
>
>>> I don't know the hardware, so I can not answer the ioremap() question.
>>
>> Yes it should. But this driver dates back to 2.1.110, when only
>> half of the architectures already had ioremap().
>
> How does mvme16x even create the mapping? Is this a virtual address
> that is hardwired to the bus or do you have a static mapping somewhere?
> I see two other drivers accessing the nvram here
>
> arch/m68k/mvme16x/config.c:static MK48T08ptr_t volatile rtc =
> (MK48T08ptr_t)MVME_RTC_BASE;

Is that even correct ? I am always shaky with qualifiers, but doesn't
that mean that the pointer is volatile, not the object it points to ?

> arch/m68k/mvme16x/rtc.c: volatile MK48T08ptr_t rtc =
> (MK48T08ptr_t)MVME_RTC_BASE;
>
> The same trick should work here, just create a local variable with a
> volatile pointer and read from that.
>

I had tried that; it doesn't work because then the compiler complains
that the 'volatile' qualifier is discarded when passing the argument.

drivers/net/ethernet/i825xx/82596.c: In function 'i82596_probe':
drivers/net/ethernet/i825xx/82596.c:1147:34: error:
passing argument 2 of '__builtin_memcpy' discards 'volatile' qualifier from pointer target type

Oddly enough, a memcpy on the 'rtc' variable doesn't fail,
neither with nor without volatile. Something else is going on.

Guenter

2021-09-08 13:30:40

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Wed, Sep 08, 2021 at 05:42:30AM -0700, Guenter Roeck wrote:

> Oddly enough, a memcpy on the 'rtc' variable doesn't fail,
> neither with nor without volatile. Something else is going on.

While we are at it, would memcpy_fromio() complain? Seeing that
this is what's really intended there...

2021-09-08 14:04:17

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On 9/8/21 6:19 AM, Al Viro wrote:
> On Wed, Sep 08, 2021 at 05:42:30AM -0700, Guenter Roeck wrote:
>
>> Oddly enough, a memcpy on the 'rtc' variable doesn't fail,
>> neither with nor without volatile. Something else is going on.
>
> While we are at it, would memcpy_fromio() complain? Seeing that
> this is what's really intended there...
>

It doesn't make a difference on m68k.

#define memcpy_fromio memcpy_fromio
static inline void memcpy_fromio(void *dst, const volatile void __iomem *src,
int count)
{
__builtin_memcpy(dst, (void __force *) src, count);
}

It boils down to the use of __builtin_memcpy(). m68k implements its own version
of memcpy(). If that is used, everything works fine. However, if a file includes
<linux/string.h>, memcpy is replaced with __builtin_memcpy:

#define __HAVE_ARCH_MEMCPY
extern void *memcpy(void *, const void *, __kernel_size_t);
#define memcpy(d, s, n) __builtin_memcpy(d, s, n)

and the compilation fails.

That also explains why only some architectures/files are affected.
Presumably those are the architectures which use __builtin_memcpy().

Guenter

2021-09-08 15:01:06

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] Enable '-Werror' by default for all kernel builds

From: Arnd Bergmann
> Sent: 08 September 2021 10:50
...
> > > I don't know the hardware, so I can not answer the ioremap() question.
> >
> > Yes it should. But this driver dates back to 2.1.110, when only
> > half of the architectures already had ioremap().
>
> How does mvme16x even create the mapping? Is this a virtual address
> that is hardwired to the bus or do you have a static mapping somewhere?
> I see two other drivers accessing the nvram here
>
> arch/m68k/mvme16x/config.c:static MK48T08ptr_t volatile rtc = (MK48T08ptr_t)MVME_RTC_BASE;
> arch/m68k/mvme16x/rtc.c: volatile MK48T08ptr_t rtc = (MK48T08ptr_t)MVME_RTC_BASE;
>
> The same trick should work here, just create a local variable with a
> volatile pointer and read from that.

Or define a C 'extern' for the actual data and get the linker script
to assign a fixed value to the symbol.
(Although that does pollute the global namespace.)

An alternative is to use an asm statement so the compiler
cannot track the actual value.
Something like:

#define launder(x) asm volatile( "" : "+r" (x))

MK48T08ptr_t rtc = (void *)MVME_RTC_BASE;
launder(rtc);

That also works a bit like READ_ONCE() except that is works
on a value that is (hopefully) already in a register rather
that during the read from memory.
Useful when the compiler's 'value tracking' pessimises code.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-09-08 20:58:31

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

Hi Arnd,

On Tue, Sep 07, 2021 at 11:11:17AM +0200, Arnd Bergmann wrote:
> On Tue, Sep 7, 2021 at 4:32 AM Nathan Chancellor <[email protected]> wrote:
> >
> > arm32-allmodconfig.log: crypto/wp512.c:782:13: error: stack frame size (1176) exceeds limit (1024) in function 'wp512_process_buffer' [-Werror,-Wframe-larger-than]
> > arm32-allmodconfig.log: drivers/firmware/tegra/bpmp-debugfs.c:294:12: error: stack frame size (1256) exceeds limit (1024) in function 'bpmp_debug_show' [-Werror,-Wframe-larger-than]
> > arm32-allmodconfig.log: drivers/firmware/tegra/bpmp-debugfs.c:357:16: error: stack frame size (1264) exceeds limit (1024) in function 'bpmp_debug_store' [-Werror,-Wframe-larger-than]
> > arm32-allmodconfig.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:3043:6: error: stack frame size (1384) exceeds limit (1024) in function 'bw_calcs' [-Werror,-Wframe-larger-than]
> > arm32-allmodconfig.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:77:13: error: stack frame size (5560) exceeds limit (1024) in function 'calculate_bandwidth' [-Werror,-Wframe-larger-than]
> > arm32-allmodconfig.log: drivers/mtd/chips/cfi_cmdset_0001.c:1872:12: error: stack frame size (1064) exceeds limit (1024) in function 'cfi_intelext_writev' [-Werror,-Wframe-larger-than]
> > arm32-allmodconfig.log: drivers/ntb/hw/idt/ntb_hw_idt.c:1041:27: error: stack frame size (1032) exceeds limit (1024) in function 'idt_scan_mws' [-Werror,-Wframe-larger-than]
> > arm32-allmodconfig.log: drivers/staging/fbtft/fbtft-core.c:902:12: error: stack frame size (1072) exceeds limit (1024) in function 'fbtft_init_display_from_property' [-Werror,-Wframe-larger-than]
> > arm32-allmodconfig.log: drivers/staging/fbtft/fbtft-core.c:992:5: error: stack frame size (1064) exceeds limit (1024) in function 'fbtft_init_display' [-Werror,-Wframe-larger-than]
> > arm32-allmodconfig.log: drivers/staging/rtl8723bs/core/rtw_security.c:1288:5: error: stack frame size (1040) exceeds limit (1024) in function 'rtw_aes_decrypt' [-Werror,-Wframe-larger-than]
> > arm32-fedora.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:3043:6: error: stack frame size (1376) exceeds limit (1024) in function 'bw_calcs' [-Werror,-Wframe-larger-than]
> > arm32-fedora.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:77:13: error: stack frame size (5384) exceeds limit (1024) in function 'calculate_bandwidth' [-Werror,-Wframe-larger-than]
> >
> > Aside from the dce_calcs.c warnings, these do not seem too bad. I
> > believe allmodconfig turns on UBSAN but it could also be aggressive
> > inlining by clang. I intend to look at all -Wframe-large-than warnings
> > closely later.
>
> I've had them close to zero in the past, but a couple of new ones came in.
>
> The amdgpu ones are probably not fixable unless they stop using 64-bit
> floats in the kernel for
> random calculations. The crypto/* ones tend to be compiler bugs, but hard to fix

I have started taking a look at these. Most of the allmodconfig ones
appear to be related to CONFIG_KASAN, which is now supported for
CONFIG_ARM.

The two in bpmp-debugfs.c appear regardless of CONFIG_KASAN and it turns
out that you actually submitted a patch for these:

https://lore.kernel.org/r/[email protected]/

Is it worth resending or pinging that?

The dce_calcs.c ones also appear without CONFIG_KASAN, which you noted
is probably unavoidable.

The other ones only appear with CONFIG_KASAN. I have not investigated
each instance to see exactly how much KASAN makes the stack blow up.
Perhaps it is worth setting the default of CONFIG_FRAME_WARN to a higher
value with clang+COMPILE_TEST+KASAN?

> > It appears that both Arch Linux and Fedora define CONFIG_FRAME_WARN
> > as 1024, below its default of 2048. I am not sure these look particurly
> > scary (although there are some that are rather large that need to be
> > looked at).
>
> For 64-bit, you usually need 1280 bytes stack space to get a
> reasonably clean build,
> anything that uses more than that tends to be a bug in the code but we
> never warned
> about those by default as the default warning limit in defconfig is 2048.
>
> I think the distros using 1024 did that because they use a common base config
> for 32-bit and 64-bit targets.

That is a fair explanation.

> > I suspect this is a backend problem because these do not really appear
> > in any other configurations (might also be something with a sanitizer?)
>
> Agreed. Someone needs to bisect the .config or the compiler flags to see what
> triggers them.

For other people following along, there were a lot of
-Wframe-larger-than instances from RISC-V allmodconfig.

Turns out this is because CONFIG_KASAN_STACK is not respected with
RISC-V. They do not set CONFIG_KASAN_SHADOW_OFFSET so following along in
scripts/Makefile.kasan, CFLAGS_KASAN_SHADOW does not get set to
anything, which means that only '-fsanitize=kernel-address' gets added
to the command line, with none of the other parameters.

I guess there are a couple of ways to tackle this:

1. RISC-V could implement CONFIG_KASAN_SHADOW_OFFSET. They mention that
the logic of KASAN_SHADOW_OFFSET was taken from arm64 but they did
not borrow the Kconfig logic it seems.

2. asan-stack could be hoisted out of the else branch so that it is
always enabled/disabled regardless of KASAN_SHADOW_OFFSET being
defined, which resolved all of these warnings for me in my testing.

I am adding the KASAN and RISC-V folks to CC for this reason.

Cheers,
Nathan

2021-09-08 21:26:37

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On 9/8/21 1:55 PM, Nathan Chancellor wrote:
> Hi Arnd,
>
> On Tue, Sep 07, 2021 at 11:11:17AM +0200, Arnd Bergmann wrote:
>> On Tue, Sep 7, 2021 at 4:32 AM Nathan Chancellor <[email protected]> wrote:
>>>
>>> arm32-allmodconfig.log: crypto/wp512.c:782:13: error: stack frame size (1176) exceeds limit (1024) in function 'wp512_process_buffer' [-Werror,-Wframe-larger-than]
>>> arm32-allmodconfig.log: drivers/firmware/tegra/bpmp-debugfs.c:294:12: error: stack frame size (1256) exceeds limit (1024) in function 'bpmp_debug_show' [-Werror,-Wframe-larger-than]
>>> arm32-allmodconfig.log: drivers/firmware/tegra/bpmp-debugfs.c:357:16: error: stack frame size (1264) exceeds limit (1024) in function 'bpmp_debug_store' [-Werror,-Wframe-larger-than]
>>> arm32-allmodconfig.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:3043:6: error: stack frame size (1384) exceeds limit (1024) in function 'bw_calcs' [-Werror,-Wframe-larger-than]
>>> arm32-allmodconfig.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:77:13: error: stack frame size (5560) exceeds limit (1024) in function 'calculate_bandwidth' [-Werror,-Wframe-larger-than]
>>> arm32-allmodconfig.log: drivers/mtd/chips/cfi_cmdset_0001.c:1872:12: error: stack frame size (1064) exceeds limit (1024) in function 'cfi_intelext_writev' [-Werror,-Wframe-larger-than]
>>> arm32-allmodconfig.log: drivers/ntb/hw/idt/ntb_hw_idt.c:1041:27: error: stack frame size (1032) exceeds limit (1024) in function 'idt_scan_mws' [-Werror,-Wframe-larger-than]
>>> arm32-allmodconfig.log: drivers/staging/fbtft/fbtft-core.c:902:12: error: stack frame size (1072) exceeds limit (1024) in function 'fbtft_init_display_from_property' [-Werror,-Wframe-larger-than]
>>> arm32-allmodconfig.log: drivers/staging/fbtft/fbtft-core.c:992:5: error: stack frame size (1064) exceeds limit (1024) in function 'fbtft_init_display' [-Werror,-Wframe-larger-than]
>>> arm32-allmodconfig.log: drivers/staging/rtl8723bs/core/rtw_security.c:1288:5: error: stack frame size (1040) exceeds limit (1024) in function 'rtw_aes_decrypt' [-Werror,-Wframe-larger-than]
>>> arm32-fedora.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:3043:6: error: stack frame size (1376) exceeds limit (1024) in function 'bw_calcs' [-Werror,-Wframe-larger-than]
>>> arm32-fedora.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:77:13: error: stack frame size (5384) exceeds limit (1024) in function 'calculate_bandwidth' [-Werror,-Wframe-larger-than]
>>>
>>> Aside from the dce_calcs.c warnings, these do not seem too bad. I
>>> believe allmodconfig turns on UBSAN but it could also be aggressive
>>> inlining by clang. I intend to look at all -Wframe-large-than warnings
>>> closely later.
>>
>> I've had them close to zero in the past, but a couple of new ones came in.
>>
>> The amdgpu ones are probably not fixable unless they stop using 64-bit
>> floats in the kernel for
>> random calculations. The crypto/* ones tend to be compiler bugs, but hard to fix
>
> I have started taking a look at these. Most of the allmodconfig ones
> appear to be related to CONFIG_KASAN, which is now supported for
> CONFIG_ARM.
>

Would it make sense to make KASAN depend on !COMPILE_TEST ?
After all, the point of KASAN is runtime testing, not build testing.

Guenter

2021-09-08 22:01:16

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Wed, Sep 08, 2021 at 02:16PM -0700, Guenter Roeck wrote:
> On 9/8/21 1:55 PM, Nathan Chancellor wrote:
[...]
> > I have started taking a look at these. Most of the allmodconfig ones
> > appear to be related to CONFIG_KASAN, which is now supported for
> > CONFIG_ARM.
> >
>
> Would it make sense to make KASAN depend on !COMPILE_TEST ?
> After all, the point of KASAN is runtime testing, not build testing.

It'd be good to avoid. It has helped uncover build issues with KASAN in
the past. Or at least make it dependent on the problematic architecture.
For example if arm is a problem, something like this:

--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -71,7 +71,7 @@ config ARM
select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6
select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU
- select HAVE_ARCH_KASAN if MMU && !XIP_KERNEL
+ select HAVE_ARCH_KASAN if MMU && !XIP_KERNEL && (!COMPILE_TEST || !CC_IS_CLANG)
select HAVE_ARCH_MMAP_RND_BITS if MMU
select HAVE_ARCH_PFN_VALID
select HAVE_ARCH_SECCOMP

More generally, with clang, the problem is known and due to KASAN stack
instrumentation (CONFIG_KASAN_STACK):

| config KASAN_STACK
| bool "Enable stack instrumentation (unsafe)" if CC_IS_CLANG && !COMPILE_TEST
| depends on KASAN_GENERIC || KASAN_SW_TAGS
| depends on !ARCH_DISABLE_KASAN_INLINE
| default y if CC_IS_GCC
| help
| The LLVM stack address sanitizer has a know problem that
| causes excessive stack usage in a lot of functions, see
| https://bugs.llvm.org/show_bug.cgi?id=38809
| Disabling asan-stack makes it safe to run kernels build
| with clang-8 with KASAN enabled, though it loses some of
| the functionality.
| This feature is always disabled when compile-testing with clang
| to avoid cluttering the output in stack overflow warnings,
| but clang users can still enable it for builds without
| CONFIG_COMPILE_TEST. On gcc it is assumed to always be safe
| to use and enabled by default.
| If the architecture disables inline instrumentation, stack
| instrumentation is also disabled as it adds inline-style
| instrumentation that is run unconditionally.

This is already disabled if COMPILE_TEST and building with clang. As
far as I know, there's no easy fix for clang and it's been discussed
many times over with LLVM devs.

Thanks,
-- Marco

2021-09-09 00:51:18

by Harry Wentland

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds



On 2021-09-08 12:41 a.m., Linus Torvalds wrote:
> On Tue, Sep 7, 2021 at 8:52 PM Harry Wentland <[email protected]> wrote:
>>
>> Attached patches fix these x86_64 ones reported by Nick:
>
> Hmm.
>
> You didn't seem to fix up the calling convention for print__xyz(),
> which still take those xyz structs as pass-by-value.
>
> Obviously it would be good to do things incrementally, so if that
> attached patch was just [1/N] I won't complain..
>

You're right. I was focussed on the stack frame limit but fixed up
the rest as well now and sent the series out.

https://lkml.org/lkml/2021/9/8/933

>> I'm also seeing one more that might be more challenging to fix but is nearly at 1024:
>>
>> drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_mode_vba_21.c:3397:6: error: stack frame size of 1064 bytes in function 'dml21_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than=]
>
> Oh Gods, that function is truly something else..
>
> Is there some reason why it's one humongous function, with the
> occasional single-line comment?
>
> Because it really looks to me like pretty much everywhere I see one of
> those rare comments, I would go "this part should be a function of its
> own", and then there would be one caller fuynction that just calls
> each of those sub-functions one after the other.
>

Yeah, that's what I'm thinking as well. It would likely fix the stack
size, even without dynamically allocating the two structs you mention
below.

> That would - I think - make the code easier to read, and then it would
> also make it very obvious where it magically uses a lot of stack.
>
> My suspicion is actually "nowhere". The stack use is just hugely
> spread out, and the compiler has just kept accumulating more spill
> variables on the frame with no single big reason.
>
> Yes, I see a couple of local structures:
>
> Pipe myPipe;
> HostVM myHostVM;
>
> but more than that I see several function calls that have basically 62
> arguments. And I wish I was making that number up. I'm not. That
> "CalculatePrefetchSchedule()" call literally has 62 arguments.
>
> But *all* of the top-level loops in that function literally look like
> they could - and should - be functions in their own right. Some of
> them would be fairly complex even so (ie that code under the comment
>
> //Prefetch Check
>
> would be quite the big function all of its own.
>
> We have a coding style thing:
>
> Documentation/process/coding-style.rst
>
> that says that you should strive to have functions that are "short and
> sweet" and fit on one or two screenfuls of text.
>
> That one function from hell is 1832 lines of code.
>
> It really could be improved upon.
>

Absolutely.

The file comes with an (easy to miss) disclaimer that it is
"gcc-parseable HW gospel":
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/display/dc/dml/dcn20/display_mode_vba_20.c?h=v5.14#n32

The display_mode_vba* stuff deals with bandwidth formulas from HW
designers. At some point in the past we attempted to convert them
to something more readable and elegant but would often run into
difficulties getting support from the right people when things
wouldn't work. Using the HW designer's code directly tends to
short circuit any arguments about SW correctness.

In short, I don't really like this code but it works. It helps
prevent black screens and underflows on the display.

We try to follow the coding-style.rst for the most part elsewhere,
though there are still plenty of areas where we can improve.

Harry

> Linus
>

2021-09-09 06:02:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Wed, Sep 08, 2021 at 11:58:56PM +0200, Marco Elver wrote:
> It'd be good to avoid. It has helped uncover build issues with KASAN in
> the past. Or at least make it dependent on the problematic architecture.
> For example if arm is a problem, something like this:

I'm also seeing quite a few stack size warnings with KASAN on x86_64
without COMPILT_TEST using gcc 10.2.1 from Debian. In fact there are a
few warnings without KASAN, but with KASAN there are a lot more.
I'll try to find some time to dig into them.

While we're at it, with -Werror something like this is really futile:

drivers/gpu/drm/amd/amdgpu/amdgpu_object.c: In function ‘amdgpu_bo_support_uswc’:
drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:493:2: warning: #warning
Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance thanks to write-combining [-Wcpp
493 | #warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance \
| ^~~~~~~

2021-09-09 06:11:36

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On 9/8/21 10:58 PM, Christoph Hellwig wrote:
> On Wed, Sep 08, 2021 at 11:58:56PM +0200, Marco Elver wrote:
>> It'd be good to avoid. It has helped uncover build issues with KASAN in
>> the past. Or at least make it dependent on the problematic architecture.
>> For example if arm is a problem, something like this:
>
> I'm also seeing quite a few stack size warnings with KASAN on x86_64
> without COMPILT_TEST using gcc 10.2.1 from Debian. In fact there are a
> few warnings without KASAN, but with KASAN there are a lot more.
> I'll try to find some time to dig into them.
>
> While we're at it, with -Werror something like this is really futile:
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c: In function ‘amdgpu_bo_support_uswc’:
> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:493:2: warning: #warning
> Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance thanks to write-combining [-Wcpp
> 493 | #warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance \
> | ^~~~~~~
>

I have been wondering if all those #warning "errors" should either
be removed or be replaced with "#pragma message".

Guenter

2021-09-09 07:34:59

by Christian König

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

Am 09.09.21 um 08:07 schrieb Guenter Roeck:
> On 9/8/21 10:58 PM, Christoph Hellwig wrote:
>> On Wed, Sep 08, 2021 at 11:58:56PM +0200, Marco Elver wrote:
>>> It'd be good to avoid. It has helped uncover build issues with KASAN in
>>> the past. Or at least make it dependent on the problematic
>>> architecture.
>>> For example if arm is a problem, something like this:
>>
>> I'm also seeing quite a few stack size warnings with KASAN on x86_64
>> without COMPILT_TEST using gcc 10.2.1 from Debian.  In fact there are a
>> few warnings without KASAN, but with KASAN there are a lot more.
>> I'll try to find some time to dig into them.
>>
>> While we're at it, with -Werror something like this is really futile:
>>
>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c: In function
>> ‘amdgpu_bo_support_uswc’:
>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:493:2: warning: #warning
>> Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance
>> thanks to write-combining [-Wcpp
>>    493 | #warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for
>> better performance \
>>        |  ^~~~~~~

Ah, yes good point!

>
> I have been wondering if all those #warning "errors" should either
> be removed or be replaced with "#pragma message".

Well we started to add those warnings because people compiled their
kernel with CONFIG_MTRR and CONFIG_X86_PAT and was then wondering why
the performance of the display driver was so crappy.

When those warning now generate an error which you have to disable
explicitly then that might not be bad at all.

It at least points people to this setting and makes it really clear that
they are doing something very unusual and need to keep in mind that it
might not have the desired result.

Regards,
Christian.

>
> Guenter

2021-09-09 10:55:14

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Thu, 9 Sept 2021 at 07:59, Christoph Hellwig <[email protected]> wrote:
> On Wed, Sep 08, 2021 at 11:58:56PM +0200, Marco Elver wrote:
> > It'd be good to avoid. It has helped uncover build issues with KASAN in
> > the past. Or at least make it dependent on the problematic architecture.
> > For example if arm is a problem, something like this:
>
> I'm also seeing quite a few stack size warnings with KASAN on x86_64
> without COMPILT_TEST using gcc 10.2.1 from Debian. In fact there are a
> few warnings without KASAN, but with KASAN there are a lot more.
> I'll try to find some time to dig into them.

Right, this reminded me that we actually at least double the real
stack size for KASAN builds, because it inherently requires more stack
space. I think we need Wframe-larger-than to match that, otherwise
we'll just keep having this problem:

https://lkml.kernel.org/r/[email protected]

> While we're at it, with -Werror something like this is really futile:
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c: In function ‘amdgpu_bo_support_uswc’:
> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:493:2: warning: #warning
> Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance thanks to write-combining [-Wcpp
> 493 | #warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance \
> | ^~~~~~~

2021-09-09 11:03:23

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Thu, Sep 9, 2021 at 12:54 PM Marco Elver <[email protected]> wrote:
> On Thu, 9 Sept 2021 at 07:59, Christoph Hellwig <[email protected]> wrote:
> > On Wed, Sep 08, 2021 at 11:58:56PM +0200, Marco Elver wrote:
> > > It'd be good to avoid. It has helped uncover build issues with KASAN in
> > > the past. Or at least make it dependent on the problematic architecture.
> > > For example if arm is a problem, something like this:
> >
> > I'm also seeing quite a few stack size warnings with KASAN on x86_64
> > without COMPILT_TEST using gcc 10.2.1 from Debian. In fact there are a
> > few warnings without KASAN, but with KASAN there are a lot more.
> > I'll try to find some time to dig into them.
>
> Right, this reminded me that we actually at least double the real
> stack size for KASAN builds, because it inherently requires more stack
> space. I think we need Wframe-larger-than to match that, otherwise
> we'll just keep having this problem:
>
> https://lkml.kernel.org/r/[email protected]

The problem with this is that it completely defeats the point of the
stack size warnings in allmodconfig kernels when they have KASAN
enabled and end up missing obvious code bugs in drivers that put
large structures on the stack. Let's not go there.

Arnd

2021-09-09 12:07:05

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Thu, 9 Sept 2021 at 13:00, Arnd Bergmann <[email protected]> wrote:
> On Thu, Sep 9, 2021 at 12:54 PM Marco Elver <[email protected]> wrote:
> > On Thu, 9 Sept 2021 at 07:59, Christoph Hellwig <[email protected]> wrote:
> > > On Wed, Sep 08, 2021 at 11:58:56PM +0200, Marco Elver wrote:
> > > > It'd be good to avoid. It has helped uncover build issues with KASAN in
> > > > the past. Or at least make it dependent on the problematic architecture.
> > > > For example if arm is a problem, something like this:
> > >
> > > I'm also seeing quite a few stack size warnings with KASAN on x86_64
> > > without COMPILT_TEST using gcc 10.2.1 from Debian. In fact there are a
> > > few warnings without KASAN, but with KASAN there are a lot more.
> > > I'll try to find some time to dig into them.
> >
> > Right, this reminded me that we actually at least double the real
> > stack size for KASAN builds, because it inherently requires more stack
> > space. I think we need Wframe-larger-than to match that, otherwise
> > we'll just keep having this problem:
> >
> > https://lkml.kernel.org/r/[email protected]
>
> The problem with this is that it completely defeats the point of the
> stack size warnings in allmodconfig kernels when they have KASAN
> enabled and end up missing obvious code bugs in drivers that put
> large structures on the stack. Let's not go there.

Sure, but the reality is that the real stack size is already doubled
for KASAN. And that should be reflected in Wframe-larger-than.

Either that, or we just have to live with the occasional warning (that
is likely benign). But with WERROR we're now forced to make the
defaults as sane as possible. If the worry is allmodconfig, maybe we
do have to make KASAN dependent on !COMPILE_TEST, even though that's
not great either because it has caught real issues in the past (it'll
also mean doing the same for all other instrumentation-based tools,
like KCSAN, UBSAN, etc.).

2021-09-09 13:32:09

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Thu, Sep 9, 2021 at 1:43 PM Marco Elver <[email protected]> wrote:
> On Thu, 9 Sept 2021 at 13:00, Arnd Bergmann <[email protected]> wrote:
> > On Thu, Sep 9, 2021 at 12:54 PM Marco Elver <[email protected]> wrote:
> > > On Thu, 9 Sept 2021 at 07:59, Christoph Hellwig <[email protected]> wrote:
> > > > On Wed, Sep 08, 2021 at 11:58:56PM +0200, Marco Elver wrote:
> > > > > It'd be good to avoid. It has helped uncover build issues with KASAN in
> > > > > the past. Or at least make it dependent on the problematic architecture.
> > > > > For example if arm is a problem, something like this:
> > > >
> > > > I'm also seeing quite a few stack size warnings with KASAN on x86_64
> > > > without COMPILT_TEST using gcc 10.2.1 from Debian. In fact there are a
> > > > few warnings without KASAN, but with KASAN there are a lot more.
> > > > I'll try to find some time to dig into them.
> > >
> > > Right, this reminded me that we actually at least double the real
> > > stack size for KASAN builds, because it inherently requires more stack
> > > space. I think we need Wframe-larger-than to match that, otherwise
> > > we'll just keep having this problem:
> > >
> > > https://lkml.kernel.org/r/[email protected]
> >
> > The problem with this is that it completely defeats the point of the
> > stack size warnings in allmodconfig kernels when they have KASAN
> > enabled and end up missing obvious code bugs in drivers that put
> > large structures on the stack. Let's not go there.
>
> Sure, but the reality is that the real stack size is already doubled
> for KASAN. And that should be reflected in Wframe-larger-than.

I don't think "double" is an accurate description of what is going on,
it's much more complex than this. There are some functions
that completely explode with KASAN_STACK enabled on clang,
and many other functions instances that don't grow much at all.

I've been building randconfig kernels for a long time with KASAN_STACK
enabled on gcc, and the limit increased to 1440 bytes for 32-bit
and not increased beyond the normal 2048 bytes for 64-bit. I have
some patches to address the outliers and should go through and
resend some of those.

With the same limits and patches using clang, and KASAN=y but
KASAN_STACK=n I also get no warnings in randconfig builds,
but KASAN_STACK on clang doesn't really seem to have a good
limit that would make an allmodconfig kernel build with no warnings.

These are the worst offenders I see based on configuration, using
an 32-bit ARM allmodconfig with my fixups:

gcc-11, KASAN, no KASAN_STACK, FRAME_WARN=1024:
(nothing)

gcc-11, KASAN_STACK:
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c:782:1:
warning: the frame size of 1416 bytes is larger than 1024 bytes
[-Wframe-larger-than=]
drivers/media/dvb-frontends/mxl5xx.c:1575:1: warning: the frame size
of 1240 bytes is larger than 1024 bytes [-Wframe-larger-than=]
drivers/mtd/nftlcore.c:468:1: warning: the frame size of 1232 bytes is
larger than 1024 bytes [-Wframe-larger-than=]
drivers/char/ipmi/ipmi_msghandler.c:4880:1: warning: the frame size of
1232 bytes is larger than 1024 bytes [-Wframe-larger-than=]
drivers/mtd/chips/cfi_cmdset_0001.c:1870:1: warning: the frame size of
1224 bytes is larger than 1024 bytes [-Wframe-larger-than=]
drivers/net/wireless/ath/ath9k/ar9003_paprd.c:749:1: warning: the
frame size of 1216 bytes is larger than 1024 bytes
[-Wframe-larger-than=]
drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c:136:1: warning:
the frame size of 1216 bytes is larger than 1024 bytes
[-Wframe-larger-than=]
drivers/ntb/hw/idt/ntb_hw_idt.c:1116:1: warning: the frame size of
1200 bytes is larger than 1024 bytes [-Wframe-larger-than=]
net/dcb/dcbnl.c:1172:1: warning: the frame size of 1192 bytes is
larger than 1024 bytes [-Wframe-larger-than=]
fs/select.c:1042:1: warning: the frame size of 1192 bytes is larger
than 1024 bytes [-Wframe-larger-than=]

clang-12 KASAN, no KASAN_STACK, FRAME_WARN=1024:

kernel/trace/trace_events_hist.c:4601:13: error: stack frame size 1384
exceeds limit 1024 in function 'hist_trigger_print_key'
[-Werror,-Wframe-larger-than]
drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:3045:6:
error: stack frame size 1384 exceeds limit 1024 in function 'bw_calcs'
[-Werror,-Wframe-larger-than]
drivers/staging/fbtft/fbtft-core.c:992:5: error: stack frame size 1208
exceeds limit 1024 in function 'fbtft_init_display'
[-Werror,-Wframe-larger-than]
crypto/wp512.c:782:13: error: stack frame size 1176 exceeds limit 1024
in function 'wp512_process_buffer' [-Werror,-Wframe-larger-than]
drivers/staging/fbtft/fbtft-core.c:902:12: error: stack frame size
1080 exceeds limit 1024 in function 'fbtft_init_display_from_property'
[-Werror,-Wframe-larger-than]
drivers/mtd/chips/cfi_cmdset_0001.c:1872:12: error: stack frame size
1064 exceeds limit 1024 in function 'cfi_intelext_writev'
[-Werror,-Wframe-larger-than]
drivers/staging/rtl8723bs/core/rtw_security.c:1288:5: error: stack
frame size 1040 exceeds limit 1024 in function 'rtw_aes_decrypt'
[-Werror,-Wframe-larger-than]
drivers/ntb/hw/idt/ntb_hw_idt.c:1041:27: error: stack frame size 1032
exceeds limit 1024 in function 'idt_scan_mws'
[-Werror,-Wframe-larger-than]

clang-12, KASAN_STACK:

drivers/infiniband/hw/ocrdma/ocrdma_stats.c:686:16: error: stack frame
size 20608 exceeds limit 1024 in function 'ocrdma_dbgfs_ops_read'
[-Werror,-Wframe-larger-than]
lib/bitfield_kunit.c:60:20: error: stack frame size 10336 exceeds
limit 10240 in function 'test_bitfields_constants'
[-Werror,-Wframe-larger-than]
drivers/net/wireless/ralink/rt2x00/rt2800lib.c:9012:13: error: stack
frame size 9952 exceeds limit 1024 in function 'rt2800_init_rfcsr'
[-Werror,-Wframe-larger-than]
drivers/net/usb/r8152.c:7486:13: error: stack frame size 8768 exceeds
limit 1024 in function 'r8156b_hw_phy_cfg'
[-Werror,-Wframe-larger-than]
drivers/media/dvb-frontends/nxt200x.c:915:12: error: stack frame size
8192 exceeds limit 1024 in function 'nxt2004_init'
[-Werror,-Wframe-larger-than]
drivers/net/wan/slic_ds26522.c:203:12: error: stack frame size 8064
exceeds limit 1024 in function 'slic_ds26522_probe'
[-Werror,-Wframe-larger-than]
drivers/firmware/broadcom/bcm47xx_sprom.c:188:13: error: stack frame
size 8064 exceeds limit 1024 in function 'bcm47xx_sprom_fill_auto'
[-Werror,-Wframe-larger-than]
drivers/media/dvb-frontends/drxd_hard.c:2857:12: error: stack frame
size 7584 exceeds limit 1024 in function 'drxd_set_frontend'
[-Werror,-Wframe-larger-than]
drivers/media/dvb-frontends/nxt200x.c:519:12: error: stack frame size
6848 exceeds limit 1024 in function
'nxt200x_setup_frontend_parameters' [-Werror,-Wframe-larger-than]
drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_n.c:17019:13:
error: stack frame size 6560 exceeds limit 1024 in function
'wlc_phy_workarounds_nphy' [-Werror,-Wframe-larger-than]

> Either that, or we just have to live with the occasional warning (that
> is likely benign). But with WERROR we're now forced to make the
> defaults as sane as possible. If the worry is allmodconfig, maybe we
> do have to make KASAN dependent on !COMPILE_TEST, even though that's
> not great either because it has caught real issues in the past (it'll
> also mean doing the same for all other instrumentation-based tools,
> like KCSAN, UBSAN, etc.).

I would prefer going back to marking KASAN_STACK as broken on clang, it does
not seem like the warnings on the symbol were enough to stop people from
attempting to using it, and the remaining warnings seem fixable with a small
increase of the FRAME_WARN when using KASAN with clang but no KASAN_STACK,
or when using KASAN_STACK with gcc.

Arnd

2021-09-09 15:01:33

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On 9/9/21 12:30 AM, Christian König wrote:
> Am 09.09.21 um 08:07 schrieb Guenter Roeck:
>> On 9/8/21 10:58 PM, Christoph Hellwig wrote:
>>> On Wed, Sep 08, 2021 at 11:58:56PM +0200, Marco Elver wrote:
>>>> It'd be good to avoid. It has helped uncover build issues with KASAN in
>>>> the past. Or at least make it dependent on the problematic architecture.
>>>> For example if arm is a problem, something like this:
>>>
>>> I'm also seeing quite a few stack size warnings with KASAN on x86_64
>>> without COMPILT_TEST using gcc 10.2.1 from Debian.  In fact there are a
>>> few warnings without KASAN, but with KASAN there are a lot more.
>>> I'll try to find some time to dig into them.
>>>
>>> While we're at it, with -Werror something like this is really futile:
>>>
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c: In function ‘amdgpu_bo_support_uswc’:
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:493:2: warning: #warning
>>> Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance thanks to write-combining [-Wcpp
>>>    493 | #warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance \
>>>        |  ^~~~~~~
>
> Ah, yes good point!
>
>>
>> I have been wondering if all those #warning "errors" should either
>> be removed or be replaced with "#pragma message".
>
> Well we started to add those warnings because people compiled their kernel with CONFIG_MTRR and CONFIG_X86_PAT and was then wondering why the performance of the display driver was so crappy.
>
> When those warning now generate an error which you have to disable explicitly then that might not be bad at all.
>
> It at least points people to this setting and makes it really clear that they are doing something very unusual and need to keep in mind that it might not have the desired result.
>

That specific warning is surrounded with "#ifndef CONFIG_COMPILE_TEST"
so it doesn't really matter because it doesn't cause test build failures.
Of course, we could do the same for any #warning which does now
cause a test build failure.

Guenter

2021-09-09 16:56:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Wed, Sep 8, 2021 at 10:59 PM Christoph Hellwig <[email protected]> wrote:
>
> While we're at it, with -Werror something like this is really futile:

Yeah, I'm thinking we could do

-Wno-error=cpp

to at least allow the cpp warnings to come through without being fatal.

Because while they can be annoying too, they are most definitely under
our direct control, so..

I didn't actually test that, but I think it should work.

That said, maybe they should just be removed. They might be better off
just as Kconfig rules, rather than as a "hey, you screwed up your
Kconfig" warning after the fact.

Linus

2021-09-09 17:06:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Thu, Sep 9, 2021 at 4:43 AM Marco Elver <[email protected]> wrote:
>
> Sure, but the reality is that the real stack size is already doubled
> for KASAN. And that should be reflected in Wframe-larger-than.

I don't think that's true.

Quite the reverse, in fact.

Yes, the *dynamic* stack size is doubled due to KASAN, because it will
cause much deeper callchains.

But the individual frames don't grow that much apart from compilers
doing stupid things (ie apparently clang and KASAN_STACK), and if
anything, the deeper dynamic call chains means that the individual
frame size being small is even *more* important, but we do compensate
for the deeper stacks by making THREAD_SIZE_ORDER bigger at least on
x86.

Honestly, I am not even happy with the current "2048 bytes for
64-bit". The excuse has been that 64-bit needs more stack, but all it
ever did was clearly to just allow people to just do bad things.

Because a 1kB stack frame is horrendous even in 64-bit. That's not
"spill some registers" kind of stack frame. That's "put a big
structure on the stack" kind of stack frame regardless of any other
issues.

And no, "but we have 16kB of stack and we'll switch stacks on
interrupts" is not an excuse for one single level to use up 1kB, much
less 2kB. Does anybody seriously believe that we don't quite normally
have stacks that are easily tens of frames deep?

Without having some true "this is the full callchain" information, the
best we can do is just limit individual stack frames. And 2kB is
*excessive*.

Linus

2021-09-21 15:44:33

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] Enable '-Werror' by default for all kernel builds

On Wed, Sep 8, 2021 at 10:55 PM Nathan Chancellor <[email protected]> wrote:
> On Tue, Sep 07, 2021 at 11:11:17AM +0200, Arnd Bergmann wrote:
> > On Tue, Sep 7, 2021 at 4:32 AM Nathan Chancellor <[email protected]> wrote:
function 'rtw_aes_decrypt' [-Werror,-Wframe-larger-than]
> > > arm32-fedora.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:3043:6: error: stack frame size (1376) exceeds limit (1024) in function 'bw_calcs' [-Werror,-Wframe-larger-than]
> > > arm32-fedora.log: drivers/gpu/drm/amd/amdgpu/../display/dc/calcs/dce_calcs.c:77:13: error: stack frame size (5384) exceeds limit (1024) in function 'calculate_bandwidth' [-Werror,-Wframe-larger-than]
> > >
> > > Aside from the dce_calcs.c warnings, these do not seem too bad. I
> > > believe allmodconfig turns on UBSAN but it could also be aggressive
> > > inlining by clang. I intend to look at all -Wframe-large-than warnings
> > > closely later.
> >
> > I've had them close to zero in the past, but a couple of new ones came in.
> >
> > The amdgpu ones are probably not fixable unless they stop using 64-bit
> > floats in the kernel for
> > random calculations. The crypto/* ones tend to be compiler bugs, but hard to fix
>
> I have started taking a look at these. Most of the allmodconfig ones
> appear to be related to CONFIG_KASAN, which is now supported for
> CONFIG_ARM.
>
> The two in bpmp-debugfs.c appear regardless of CONFIG_KASAN and it turns
> out that you actually submitted a patch for these:
>
> https://lore.kernel.org/r/[email protected]/
>
> Is it worth resending or pinging that?

I'm now restarting from a clean tree for my randconfig patches to see which
ones are actually needed, will hopefully get to that.

> The dce_calcs.c ones also appear without CONFIG_KASAN, which you noted
> is probably unavoidable.

(adding amdgpu folks to Cc here)

Harry Wentland did a nice rework for dcn_calcs.c that should also be
portable to dce_calcs.c, I hope that he will be able to get to that as well.

Looking at my older patches now, I found that I had only suppressed that one
and given up fixing it, but I did put my analysis into
https://bugs.llvm.org/show_bug.cgi?id=42551, which should be helpful
for addressing it in either the kernel or the compiler.

Arnd