LinuxLists.cc - mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

2022-06-22 10:33:03

Subject: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

Hi All,

I have recently (since yesterday) started building the mainline kernel
with clang-14 and I am seeing a build failure with allmodconfig.

In file included from drivers/net/ethernet/huawei/hinic/hinic_devlink.c:15:
In file included from ./include/linux/netlink.h:7:
In file included from ./include/linux/skbuff.h:15:
In file included from ./include/linux/time.h:60:
In file included from ./include/linux/time32.h:13:
In file included from ./include/linux/timex.h:67:
In file included from ./arch/x86/include/asm/timex.h:5:
In file included from ./arch/x86/include/asm/processor.h:22:
In file included from ./arch/x86/include/asm/msr.h:11:
In file included from ./arch/x86/include/asm/cpumask.h:5:
In file included from ./include/linux/cpumask.h:12:
In file included from ./include/linux/bitmap.h:11:
In file included from ./include/linux/string.h:253:
./include/linux/fortify-string.h:344:4: error: call to __write_overflow_field declared with 'warning' attribute: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror,-Wattribute-warning]
__write_overflow_field(p_size_field, size);

After a bisect it pointed to 281d0c962752 ("fortify: Add Clang support").
Since it was introduced in v5.18, not sure if its already reported before.

--
Regards
Sudip

2022-06-22 14:05:54

by Linus Torvalds

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Wed, Jun 22, 2022 at 5:23 AM Sudip Mukherjee
<[email protected]> wrote:
>
> I have recently (since yesterday) started building the mainline kernel
> with clang-14 and I am seeing a build failure with allmodconfig.

Yeah, the clang build has never been allmodconfig-clean, although I
think it's starting to get pretty close.

I build the kernel I actually _use_ with clang, and make sure it's
clean in sane configurations, but my full allmodconfig build I do with
gcc.

Partly because of that "the clang build hasn't quite gotten there yet"
and partly because last I tried it was even slower to build (not a big
issue for my default config, but does matter for the allmodconfig
build, even on my beefy home machine)

I would love for people to start doing allmodconfig builds with clang
too, but it would require some initial work to fix it... Hint, hint.

And in the case of this warning attribute case, the clang error messages are

(a) verbose

(b) useless

because they point to where the warning attribute is (I know where it
is), but don't point to where it's actually triggering (ie where it
was actually inlined and called from).

The gcc equivalent of that warning actually says exactly where the
problem is. The clang one is useless, which is probably part of why
people aren't fixing them, because even if they would want to, they
just give up.

Nick, Nathan, any chance of getting better error messages out of
clang? In some cases, they are very good, so it's not like clang does
bad error messages by default. But in this case, the error message
really is *entirely* useless.

Linus

2022-06-22 15:29:27

by Nathan Chancellor

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

Hi Linus,

On Wed, Jun 22, 2022 at 08:47:22AM -0500, Linus Torvalds wrote:
> On Wed, Jun 22, 2022 at 5:23 AM Sudip Mukherjee
> <[email protected]> wrote:
> >
> > I have recently (since yesterday) started building the mainline kernel
> > with clang-14 and I am seeing a build failure with allmodconfig.

Right, this is known. Kees sent a fix for that warning recently but it
went to net-next instead of net it seems:

https://git.kernel.org/netdev/net-next/c/2c0ab32b73cfe39a609192f338464e948fc39117

I am not sure if that change could be cherry-picked or applied to net so
that it could be fixed in mainline, I see the netdev maintainers are
already on CC so maybe they can comment on that?

> Yeah, the clang build has never been allmodconfig-clean, although I
> think it's starting to get pretty close.

Right, we are almost there with ARCH=arm64 and ARCH=x86_64. Both arm64
and x86_64 suffer from the warning that Sudip reported and arm64 grew a
new warning this release from commit 274d79e51875 ("ASoC: Intel: avs:
Configure modules according to their type"):

sound/soc/intel/avs/path.c:815:18: error: stack frame size (2176) exceeds limit (2048) in 'avs_path_create' [-Werror,-Wframe-larger-than]
struct avs_path *avs_path_create(struct avs_dev *adev, u32 dma_id,
^
1 error generated.

I haven't fully figured out what is going wrong with this one, I did
write some analysis on our bug tracker if anyone is curious:

https://github.com/ClangBuiltLinux/linux/issues/1642#issuecomment-1156815611

With those two warnings fixed, arm64 and x86_64 allmodconfig become
-Werror clean with clang-11 through clang-15 on mainline; 5.15 is
already there.

Other architectures aren't that far behind either, ARCH=arm is the worst
one because of KASAN, which has been brought up a few times before
without any real resolution.

> I build the kernel I actually _use_ with clang, and make sure it's
> clean in sane configurations, but my full allmodconfig build I do with
> gcc.
>
> Partly because of that "the clang build hasn't quite gotten there yet"
> and partly because last I tried it was even slower to build (not a big
> issue for my default config, but does matter for the allmodconfig
> build, even on my beefy home machine)
>
> I would love for people to start doing allmodconfig builds with clang
> too, but it would require some initial work to fix it... Hint, hint.

Right, we are working on a statically linked and optimized build of LLVM
that people can use similar to the GCC builds provided on kernel.org,
which should make the compile time problem not as bad as well as making
it easier for developers to get access to a recent version of clang with
all the fixes and improvements that we have made in upstream LLVM.

> And in the case of this warning attribute case, the clang error messages are
>
> (a) verbose
>
> (b) useless
>
> because they point to where the warning attribute is (I know where it
> is), but don't point to where it's actually triggering (ie where it
> was actually inlined and called from).
>
> The gcc equivalent of that warning actually says exactly where the
> problem is. The clang one is useless, which is probably part of why
> people aren't fixing them, because even if they would want to, they
> just give up.
>
> Nick, Nathan, any chance of getting better error messages out of
> clang? In some cases, they are very good, so it's not like clang does
> bad error messages by default. But in this case, the error message
> really is *entirely* useless.

Right, known papercut :( Kees also noticed this and reported it on our
issue tracker:

https://github.com/ClangBuiltLinux/linux/issues/1571

I don't do as much on the LLVM side as Nick so I'll let him comment on
how feasible implementing that in clang will be, he already has some
comments on the issue tracker there.

Cheers,
Nathan

2022-06-22 15:35:31

by Sudip Mukherjee

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

2022-06-22 15:54:34

by Linus Torvalds

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Wed, Jun 22, 2022 at 10:08 AM Sudip Mukherjee
<[email protected]> wrote:
>
> Yeah, true. I had to check to find out its from the memcpy() in check_image_valid().

Funky but true - I can reproduce it, and just commenting out that
memcpy fixes the warning.

And I see nothing wrong with that code - it's copying a 'struct
fw_section_info_st' between two other structs that seem to have arrays
that are appropriately sized.

Replacing the memcpy() with just a structure assignment seems to get
rid of the warning, and seems to be a simple fix (patch attached), but
I don't understand why that memcpy() would warn.

This looks like a clang bug to me, but maybe I'm missing something.

Linus

Attachments:

patch.diff (829.00 B)

2022-06-22 16:09:53

by Sudip Mukherjee

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Wed, Jun 22, 2022 at 10:19:31AM -0500, Linus Torvalds wrote:
> On Wed, Jun 22, 2022 at 10:08 AM Sudip Mukherjee
> <[email protected]> wrote:
> >
> > Yeah, true. I had to check to find out its from the memcpy() in check_image_valid().
>
> Funky but true - I can reproduce it, and just commenting out that
> memcpy fixes the warning.
>
> And I see nothing wrong with that code - it's copying a 'struct
> fw_section_info_st' between two other structs that seem to have arrays
> that are appropriately sized.

imho, there is no check for 'i' and it can become more than MAX_FW_TYPE_NUM and
in that case it will overwrite.

This fixes the error for me and I think will also address the concern that
clang is raising.

diff --git a/drivers/net/ethernet/huawei/hinic/hinic_devlink.c b/drivers/net/ethernet/huawei/hinic/hinic_devlink.c
index 60ae8bfc5f69..bc4b3ec15925 100644
--- a/drivers/net/ethernet/huawei/hinic/hinic_devlink.c
+++ b/drivers/net/ethernet/huawei/hinic/hinic_devlink.c
@@ -41,7 +41,7 @@ static bool check_image_valid(struct hinic_devlink_priv *priv, const u8 *buf,
return false;
}

- for (i = 0; i < fw_image->fw_info.fw_section_cnt; i++) {
+ for (i = 0; i < fw_image->fw_info.fw_section_cnt && i < MAX_FW_TYPE_NUM; i++) {
len += fw_image->fw_section_info[i].fw_section_len;
memcpy(&host_image->image_section_info[i],
&fw_image->fw_section_info[i],

--
Regards
Sudip

2022-06-22 16:15:48

by Linus Torvalds

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Wed, Jun 22, 2022 at 11:00 AM Sudip Mukherjee
<[email protected]> wrote:
>
> imho, there is no check for 'i' and it can become more than MAX_FW_TYPE_NUM and
> in that case it will overwrite.

No. That's already checked a few lines before, in the

if (fw_image->fw_info.fw_section_cnt > MAX_FW_TYPE_NUM) {
.. error out

path. And fw_section_cnt as a value is an unsigned bitfield of 16
bits, so there's no chance of some kind of integer signedness
confusion.

So clang is just wrong here.

The fact that you can apparently silence the error with an extra bogus
check does hopefully give clang people a clue about *where* clang is
wrong, but it's not an acceptable workaround for the kernel.

We don't write worse source code to make bad compilers happy.

My "use a struct assignment" is more acceptable because at least then
the source code doesn't get worse. It arguably should have been done
that way the whole time, even if 'memcpy()' is the traditional C way
of doing struct assignments (traditional as in "_really_ old
traditional C").

Linus

2022-06-22 16:28:10

by Sudip Mukherjee

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Wed, Jun 22, 2022 at 11:07:40AM -0500, Linus Torvalds wrote:
> On Wed, Jun 22, 2022 at 11:00 AM Sudip Mukherjee
> <[email protected]> wrote:
> >
> > imho, there is no check for 'i' and it can become more than MAX_FW_TYPE_NUM and
> > in that case it will overwrite.
>
> No. That's already checked a few lines before, in the
>
> if (fw_image->fw_info.fw_section_cnt > MAX_FW_TYPE_NUM) {
> .. error out
>
> path. And fw_section_cnt as a value is an unsigned bitfield of 16
> bits, so there's no chance of some kind of integer signedness
> confusion.

oops. yeah, sorry missed that.

>
> So clang is just wrong here.
>
> The fact that you can apparently silence the error with an extra bogus
> check does hopefully give clang people a clue about *where* clang is
> wrong, but it's not an acceptable workaround for the kernel.
>
> We don't write worse source code to make bad compilers happy.
>
> My "use a struct assignment" is more acceptable because at least then
> the source code doesn't get worse. It arguably should have been done
> that way the whole time, even if 'memcpy()' is the traditional C way
> of doing struct assignments (traditional as in "_really_ old
> traditional C").

Incidentally, its same as what Kees sent.

2c0ab32b73cf ("hinic: Replace memcpy() with direct assignment") in next-20220622.

--
Regards
Sudip

2022-06-22 16:30:04

by Linus Torvalds

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Wed, Jun 22, 2022 at 10:02 AM Nathan Chancellor <[email protected]> wrote:
>
> Right, we are working on a statically linked and optimized build of LLVM
> that people can use similar to the GCC builds provided on kernel.org,
> which should make the compile time problem not as bad as well as making
> it easier for developers to get access to a recent version of clang with
> all the fixes and improvements that we have made in upstream LLVM.

So I'm on the road, and will try to see how well I can do that
allmodconfig build on my poor laptop and see what else goes wrong for
now.

But I do have to say that it's just a lot better if the regular distro
compiler build works out of the box. I did build my own clang binary
for a while, just because I was an early adopter of the whole "ask
goto with outputs" thing, but I've been *so* much happier now that I
don't need to do that any more.

So I would prefer not going backwards.

I wish the standard clang build just stopped doing the crazy shared
library thing. The security arguments for it are just ridiculous, when
any shared library update for any security reason would mean a full
clang update _anyway_.

I realize it's partly distro rules too, but the distros only do that
"we always build shared libraries" when the project itself makes that
an option. And it's a really stupid option for the regular C compiler
case.

Side note: I think gcc takes almost exactly the opposite approach, and
does much better as a result. It doesn't do a shared libary, but what
it *does* do is make 'gcc' itself a reasonably small binary that just
does the command line front-end parsing.

The advantage of the gcc model is that it works *really* well for the
"test compiler options" kind of stage, where you only run that much
simpler 'gcc' wrapper binary.

We don't actually take full advantage of that, because we do end up
doing a real "build" of an empty file, so "cc1" does actually get
executed, but even then it's done fairly efficiently with 'vfork()'.
That "cc-option" part of the kernel build is actually noticeable
during configuration etc, and clang is *much* slower because of how it
is built.

See

time clang -Werror -c -x c /dev/null

and compare it with gcc. And if you want to see a really *big*
difference, pick a command line option that causes an error because it
doesn't exist..

I really wish clang wasn't so much noticeably slower. It's limiting
what I do with it, and I've had other developers say the same.

Linus

2022-06-22 17:38:30

by Sudip Mukherjee

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Wed, Jun 22, 2022 at 11:21:09AM -0500, Linus Torvalds wrote:
> On Wed, Jun 22, 2022 at 10:02 AM Nathan Chancellor <[email protected]> wrote:
> >
> > Right, we are working on a statically linked and optimized build of LLVM
> > that people can use similar to the GCC builds provided on kernel.org,
> > which should make the compile time problem not as bad as well as making
> > it easier for developers to get access to a recent version of clang with
> > all the fixes and improvements that we have made in upstream LLVM.
>
> So I'm on the road, and will try to see how well I can do that
> allmodconfig build on my poor laptop and see what else goes wrong for
> now.

Tried it after applying your patch. There was no build failure, but some warnings:

fs/reiserfs/reiserfs.o: warning: objtool: leaf_copy_items_entirely+0x7fd: stack state mismatch: cfa1=4+240 cfa2=4+232
arch/x86/kvm/kvm.o: warning: objtool: emulator_cmpxchg_emulated+0x705: call to __ubsan_handle_load_invalid_value() with UACCESS enabled
arch/x86/kvm/kvm.o: warning: objtool: paging64_update_accessed_dirty_bits+0x39e: call to __ubsan_handle_load_invalid_value() with UACCESS enabled
arch/x86/kvm/kvm.o: warning: objtool: paging32_update_accessed_dirty_bits+0x390: call to __ubsan_handle_load_invalid_value() with UACCESS enabled
arch/x86/kvm/kvm.o: warning: objtool: ept_update_accessed_dirty_bits+0x43f: call to __ubsan_handle_load_invalid_value() with UACCESS enabled
drivers/video/fbdev/smscufx.o: warning: objtool: ufx_ops_write() falls through to next function ufx_ops_setcolreg()
drivers/video/fbdev/udlfb.o: warning: objtool: dlfb_ops_write() falls through to next function dlfb_ops_setcolreg()
drivers/soc/qcom/qcom_rpmh.o: warning: objtool: rpmh_rsc_write_ctrl_data() falls through to next function trace_raw_output_rpmh_tx_done()
drivers/scsi/mpi3mr/mpi3mr.o: warning: objtool: mpi3mr_op_request_post() falls through to next function mpi3mr_check_rh_fault_ioc()
drivers/gpu/drm/radeon/radeon.o: warning: objtool: sumo_dpm_set_power_state() falls through to next function sumo_dpm_post_set_power_state()
drivers/net/ethernet/wiznet/w5100.o: warning: objtool: w5100_tx_skb() falls through to next function w5100_get_drvinfo()
drivers/net/ethernet/wiznet/w5100.o: warning: objtool: w5100_rx_skb() falls through to next function w5100_mmio_probe()
vmlinux.o: warning: objtool: __startup_64() falls through to next function startup_64_setup_env()
vmlinux.o: warning: objtool: sync_regs+0x24: call to memcpy() leaves .noinstr.text section
vmlinux.o: warning: objtool: vc_switch_off_ist+0xbe: call to memcpy() leaves .noinstr.text section
vmlinux.o: warning: objtool: fixup_bad_iret+0x36: call to memset() leaves .noinstr.text section
vmlinux.o: warning: objtool: __sev_get_ghcb+0xa0: call to memcpy() leaves .noinstr.text section
vmlinux.o: warning: objtool: __sev_put_ghcb+0x35: call to memcpy() leaves .noinstr.text section

The build took 16 minutes 6 seconds on the build machines in my office (Codethink).

--
Regards
Sudip

2022-06-22 18:47:02

by Jakub Kicinski

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Wed, 22 Jun 2022 08:01:57 -0700 Nathan Chancellor wrote:
> On Wed, Jun 22, 2022 at 08:47:22AM -0500, Linus Torvalds wrote:
> > On Wed, Jun 22, 2022 at 5:23 AM Sudip Mukherjee
> > <[email protected]> wrote:
> > >
> > > I have recently (since yesterday) started building the mainline kernel
> > > with clang-14 and I am seeing a build failure with allmodconfig.
>
> Right, this is known. Kees sent a fix for that warning recently but it
> went to net-next instead of net it seems:
>
> https://git.kernel.org/netdev/net-next/c/2c0ab32b73cfe39a609192f338464e948fc39117
>
> I am not sure if that change could be cherry-picked or applied to net so
> that it could be fixed in mainline, I see the netdev maintainers are
> already on CC so maybe they can comment on that?

Sorry about the hassle, done now. It's commit 1e70212e0315 ("hinic:
Replace memcpy() with direct assignment") in net and should make it
to Linus tomorrow.

2022-06-22 18:54:12

by Linus Torvalds

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Wed, Jun 22, 2022 at 12:26 PM Sudip Mukherjee
<[email protected]> wrote:
>
> Tried it after applying your patch. There was no build failure, but some warnings:

So some of those objtool warnings are, I think, because clang does odd
and crazy things for when it decides "this is not reachable" code.

I don't much like it, and neither does objtool, but it is what it is.
When clang decides "I'm calling a function that cannot return", it
will have a "call" instruction and then it will just fall off the face
of the earth after that.

That includes falling through to the next function, or just to random
other labels after the function, and then objtool as a result
complains about a stack state mismatch (when the fallthrough is the
same function, but now the stack pointer is different in different
parts), or of the "falls through to next function".

I think it's a clang misfeature in that if something goes wrong, you
basically execute random code. I'd much rather see clang insert a real
'ud' instruction or 'int3' or whatever. But it doesn't.

I didn't check whether gcc avoids that "don't make assumptions about
non-return functions" or whether it's just that objtool recognizes
whatever pattern gcc uses.

So *some* of the warnings are basically code generation disagreements,
but aren't signs of actual problems per se.

Others may be because objdump knows about gcc foibles in ways it
doesn't know about some clang foibles. I think the "call to memcpy()
leaves .noinstr.text section" might be of that type: clang seems to
sometimes generate out-of-line memcpy calls for absolutely ridiculous
things (I've seen a 16-byte memcpy done that way - just stupid when
it's just two load/store pairs, and just the function call overhead is
much more than that).

And then a couple seem to be real mis-features in our code:

> arch/x86/kvm/kvm.o: warning: objtool: emulator_cmpxchg_emulated+0x705: call to __ubsan_handle_load_invalid_value() with UACCESS enabled
> arch/x86/kvm/kvm.o: warning: objtool: paging64_update_accessed_dirty_bits+0x39e: call to __ubsan_handle_load_invalid_value() with UACCESS enabled
> arch/x86/kvm/kvm.o: warning: objtool: paging32_update_accessed_dirty_bits+0x390: call to __ubsan_handle_load_invalid_value() with UACCESS enabled
> arch/x86/kvm/kvm.o: warning: objtool: ept_update_accessed_dirty_bits+0x43f: call to __ubsan_handle_load_invalid_value() with UACCESS enabled

and I actually sent email to the kvm and x86 people involved in those
code sequences. It's the __try_cmpxchg_user() macro that isn't great.

Not a fatal problem, but a sign of something we should do better, and
that one doesn't seem to be due to any actual clang misfeature.

> The build took 16 minutes 6 seconds on the build machines in my office (Codethink).

Oh yeah. I'm on a laptop that is a few years old (good laptop, just
not top-of-the-line any more), and it's been chugging along for almost
two hours by now. It's still working on it, and making "solid
progress". But it does not seem to have hit any other failures after
that memcpy fixlet, so I guess I will get to that successful end some
time soon.

I don't typically do allmodconfig builds while on the road, much less
with clang. With gcc, I seem to recall it being a bit over an hour.
Clang is worse.

Linus

2022-06-22 22:54:06

by Nick Desaulniers

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

+ llvm list, moving net and net-folk to bcc. Follow along on lore if
still interested.

On Wed, Jun 22, 2022 at 10:49 AM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, Jun 22, 2022 at 12:26 PM Sudip Mukherjee
> <[email protected]> wrote:
> >
> > Tried it after applying your patch. There was no build failure, but some warnings:
>
> So some of those objtool warnings are, I think, because clang does odd
> and crazy things for when it decides "this is not reachable" code.
>
> I don't much like it, and neither does objtool, but it is what it is.
> When clang decides "I'm calling a function that cannot return", it
> will have a "call" instruction and then it will just fall off the face
> of the earth after that.
>
> That includes falling through to the next function, or just to random
> other labels after the function, and then objtool as a result
> complains about a stack state mismatch (when the fallthrough is the
> same function, but now the stack pointer is different in different
> parts), or of the "falls through to next function".
>
> I think it's a clang misfeature in that if something goes wrong, you
> basically execute random code. I'd much rather see clang insert a real
> 'ud' instruction or 'int3' or whatever. But it doesn't.

So adding `-mllvm -trap-unreachable` will turn these
`__builtin_unreachable()`'s into trapping instructions. I think we
should just do that/enable that in the kernel. The following patch
eliminates ALL of the fallthrough warnings observed from objtool on
x86_64 defconfig builds.

```
diff --git a/scripts/Makefile.clang b/scripts/Makefile.clang
index 87285b76adb2..1fbf8a8f3751 100644
--- a/scripts/Makefile.clang
+++ b/scripts/Makefile.clang
@@ -36,6 +36,7 @@ endif
# so they can be implemented or wrapped in cc-option.
CLANG_FLAGS += -Werror=unknown-warning-option
CLANG_FLAGS += -Werror=ignored-optimization-argument
+CLANG_FLAGS += -mllvm -trap-unreachable
KBUILD_CFLAGS += $(CLANG_FLAGS)
KBUILD_AFLAGS += $(CLANG_FLAGS)
export CLANG_FLAGS
```

There's more I need to do for LTO; `-mllvm` flags need to be passed to
the linker in that case. Let me do a few more builds, collect
statistics on build size differences (guessing neglidgeable), then
will send out a more formal patch.

>
> I didn't check whether gcc avoids that "don't make assumptions about
> non-return functions" or whether it's just that objtool recognizes
> whatever pattern gcc uses.
>
> So *some* of the warnings are basically code generation disagreements,
> but aren't signs of actual problems per se.
>
> Others may be because objdump knows about gcc foibles in ways it
> doesn't know about some clang foibles. I think the "call to memcpy()
> leaves .noinstr.text section" might be of that type: clang seems to
> sometimes generate out-of-line memcpy calls for absolutely ridiculous
> things (I've seen a 16-byte memcpy done that way - just stupid when
> it's just two load/store pairs, and just the function call overhead is
> much more than that).

IIRC, that was from CONFIG_KASAN.

Looking at the disassembly (llvm-objdump's `--disassemble-symbols=`
flag is handy) of the following from an allmodconfig build:

vmlinux.o: warning: objtool: sync_regs+0x24: call to memcpy() leaves
.noinstr.text section

168B memcpy

vmlinux.o: warning: objtool: vc_switch_off_ist+0xbe: call to memcpy()
leaves .noinstr.text section

168B memcpy

vmlinux.o: warning: objtool: fixup_bad_iret+0x36: call to memset()
leaves .noinstr.text section

168B memset

vmlinux.o: warning: objtool: __sev_get_ghcb+0xa0: call to memcpy()
leaves .noinstr.text section

4096B memcpy

vmlinux.o: warning: objtool: __sev_put_ghcb+0x35: call to memcpy()
leaves .noinstr.text section

4096B memcpy

So it doesn't seem like it's the same issue of "dump memcpy of small
`n` that we'd seen previously). I suspect that objtool's assumption
that the compiler can't turn assignments into libcalls is...compiler
specific.

Replying to earlier points in the thread now:

On Wed, Jun 22, 2022 at 9:21 AM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, Jun 22, 2022 at 10:02 AM Nathan Chancellor <[email protected]> wrote:
> >
> > Right, we are working on a statically linked and optimized build of LLVM
> > that people can use similar to the GCC builds provided on kernel.org,
> > which should make the compile time problem not as bad as well as making
> > it easier for developers to get access to a recent version of clang with
> > all the fixes and improvements that we have made in upstream LLVM.
>
> So I'm on the road, and will try to see how well I can do that
> allmodconfig build on my poor laptop and see what else goes wrong for
> now.
>
> But I do have to say that it's just a lot better if the regular distro
> compiler build works out of the box. I did build my own clang binary
> for a while, just because I was an early adopter of the whole "ask
> goto with outputs" thing, but I've been *so* much happier now that I
> don't need to do that any more.
>
> So I would prefer not going backwards.

I agree.

> I wish the standard clang build just stopped doing the crazy shared
> library thing. The security arguments for it are just ridiculous, when
> any shared library update for any security reason would mean a full
> clang update _anyway_.

Regarding the "security" argument against the use of shared libraries;
I agree with you. If the compiler will just crash when given five
open parenthesis as inputs, it never was designed for untrusted input
in the first place.

That said, if I had to host executables and libraries for download,
perhaps it would be a smaller bill to serve a bunch of libraries over
fully statically linked (and thus larger) executables.

I think it's too late and a non-starter to suggest removing the
ability to build libllvm.so or libclang.so to the LLVM community at
this point though.

>
> I realize it's partly distro rules too, but the distros only do that
> "we always build shared libraries" when the project itself makes that
> an option. And it's a really stupid option for the regular C compiler
> case.

Right, in that case our hands are somewhat tied. They're not our
distributions. Even if we do our own builds/distributions of clang,
you can lead a horse to water...

That said, I think we can can help distros better configure their
builds. For instance, a sweet spot might be to statically link clang,
but dynamically link all of the GNU-binutils-like substitutes. Those
are seldom invoked and not the bottleneck in any profile. LLVM's
cmake doesn't have an option to do that easily though; we should
provide one then recommend distros use it. Make it easy to do the
right thing.

>
> Side note: I think gcc takes almost exactly the opposite approach, and
> does much better as a result. It doesn't do a shared libary, but what
> it *does* do is make 'gcc' itself a reasonably small binary that just
> does the command line front-end parsing.
>
> The advantage of the gcc model is that it works *really* well for the
> "test compiler options" kind of stage, where you only run that much
> simpler 'gcc' wrapper binary.
>
> We don't actually take full advantage of that, because we do end up
> doing a real "build" of an empty file, so "cc1" does actually get
> executed, but even then it's done fairly efficiently with 'vfork()'.
> That "cc-option" part of the kernel build is actually noticeable
> during configuration etc, and clang is *much* slower because of how it
> is built.
>
> See
>
> time clang -Werror -c -x c /dev/null
>
> and compare it with gcc. And if you want to see a really *big*
> difference, pick a command line option that causes an error because it
> doesn't exist..

Looking at a profile, there's a lot of stupid stuff we're doing. We
can probably get faster "at doing nothing." See
https://gist.github.com/nickdesaulniers/81a87ffa784c13d0bf60f60b1d54651b
for the profile and my commentary/initial thoughts.

>
> I really wish clang wasn't so much noticeably slower. It's limiting
> what I do with it, and I've had other developers say the same.

We can do better. I'll keep pushing on this up my chain of command.
That statement stands in stark contrast to the below:

On Wed, Jun 22, 2022 at 6:47 AM Linus Torvalds
<[email protected]> wrote:
>
> I build the kernel I actually _use_ with clang, and make sure it's
> clean in sane configurations, but my full allmodconfig build I do with
> gcc.

A fantastic and motivational endorsement for the hard work we put in,
which is why it would be a travesty if build times and allmodconfig
hygiene caused us to lose your support.

>
> Partly because of that "the clang build hasn't quite gotten there yet"
> and partly because last I tried it was even slower to build (not a big
> issue for my default config, but does matter for the allmodconfig
> build, even on my beefy home machine)
>
> I would love for people to start doing allmodconfig builds with clang
> too, but it would require some initial work to fix it... Hint, hint.

As Nathan notes, we've been working on it. Long tail that's constantly
regressing, but WIP nonetheless.

>
> And in the case of this warning attribute case, the clang error messages are
>
> (a) verbose
>
> (b) useless
>
> because they point to where the warning attribute is (I know where it
> is), but don't point to where it's actually triggering (ie where it
> was actually inlined and called from).
>
> The gcc equivalent of that warning actually says exactly where the
> problem is. The clang one is useless, which is probably part of why
> people aren't fixing them, because even if they would want to, they
> just give up.
>
> Nick, Nathan, any chance of getting better error messages out of
> clang? In some cases, they are very good, so it's not like clang does
> bad error messages by default. But in this case, the error message
> really is *entirely* useless.

Yeah, it's definitely not helpful in its current form. I'll have to
think a bit more about how we can retain and display inlining
decisions, which is what's necessary here to make the diagnostic
actionable.

Building with `KCFLAGS=-Rpass=inline` does provide some hints, but
also a lot of unhelpful noise:

```
$ make LLVM=1 -j72 drivers/net/ethernet/huawei/hinic/hinic_devlink.o
KCFLAGS=-Rpass=inline
...
drivers/net/ethernet/huawei/hinic/hinic_devlink.c:46:3: remark:
'_Z18fortify_memcpy_chkmmmmmPKc' inlined into 'check_image_valid':
always inline attribute at callsite check_image_valid:23:3;
[-Rpass=inline]
memcpy(&host_image->image_section_info[i],
^
...
```
AFAIK, the current architecture of LLVM doesn't retain inlining
decisions made, so clang can point to the definition of a function
that shouldn't have been called (one annotated w/
__attribute__((error(""))) or __attribute__((warning("")))) but it
can't tell you which call site specifically was problematic. There's
similarly unhelpful diagnostics sometimes with inline asm that feels
vaguely reminiscent I document here:
https://github.com/ClangBuiltLinux/linux/issues/1571#issuecomment-1135199630.

As to _why_ clang isn't getting this object size correct, I wasn't
able to find out today, but will keep digging. Stay tuned.
https://github.com/ClangBuiltLinux/linux/issues/1592
--
Thanks,
~Nick Desaulniers

2022-06-23 10:25:28

by David Laight

[permalink] [raw]

Subject: RE: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

From: Nick Desaulniers
> Sent: 22 June 2022 23:40
....
> > We don't actually take full advantage of that, because we do end up
> > doing a real "build" of an empty file, so "cc1" does actually get
> > executed, but even then it's done fairly efficiently with 'vfork()'.
> > That "cc-option" part of the kernel build is actually noticeable
> > during configuration etc, and clang is *much* slower because of how it
> > is built.
> >
> > See
> >
> > time clang -Werror -c -x c /dev/null
> >
> > and compare it with gcc. And if you want to see a really *big*
> > difference, pick a command line option that causes an error because it
> > doesn't exist..
>
> Looking at a profile, there's a lot of stupid stuff we're doing. We
> can probably get faster "at doing nothing." See
> https://gist.github.com/nickdesaulniers/81a87ffa784c13d0bf60f60b1d54651b
> for the profile and my commentary/initial thoughts.
>
> >
> > I really wish clang wasn't so much noticeably slower. It's limiting
> > what I do with it, and I've had other developers say the same.
>
> We can do better. I'll keep pushing on this up my chain of command.
> That statement stands in stark contrast to the below:

The slow startup must also make a big difference to anything
that uses autoconf.
That tends to run a lot of compiles of trivial code just to
find out that the system is 'normal'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-06-24 00:08:11

by Nick Desaulniers

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Wed, Jun 22, 2022 at 3:40 PM Nick Desaulniers
<[email protected]> wrote:
>
> On Wed, Jun 22, 2022 at 10:49 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Wed, Jun 22, 2022 at 12:26 PM Sudip Mukherjee
> > <[email protected]> wrote:
> > >
> > > Tried it after applying your patch. There was no build failure, but some warnings:
> >
> > So some of those objtool warnings are, I think, because clang does odd
> > and crazy things for when it decides "this is not reachable" code.
> >
> > I don't much like it, and neither does objtool, but it is what it is.
> > When clang decides "I'm calling a function that cannot return", it
> > will have a "call" instruction and then it will just fall off the face
> > of the earth after that.
> >
> > That includes falling through to the next function, or just to random
> > other labels after the function, and then objtool as a result
> > complains about a stack state mismatch (when the fallthrough is the
> > same function, but now the stack pointer is different in different
> > parts), or of the "falls through to next function".
> >
> > I think it's a clang misfeature in that if something goes wrong, you
> > basically execute random code. I'd much rather see clang insert a real
> > 'ud' instruction or 'int3' or whatever. But it doesn't.
>
> So adding `-mllvm -trap-unreachable` will turn these
> `__builtin_unreachable()`'s into trapping instructions. I think we
> should just do that/enable that in the kernel. The following patch
> eliminates ALL of the fallthrough warnings observed from objtool on
> x86_64 defconfig builds.
>
> ```
> diff --git a/scripts/Makefile.clang b/scripts/Makefile.clang
> index 87285b76adb2..1fbf8a8f3751 100644
> --- a/scripts/Makefile.clang
> +++ b/scripts/Makefile.clang
> @@ -36,6 +36,7 @@ endif
> # so they can be implemented or wrapped in cc-option.
> CLANG_FLAGS += -Werror=unknown-warning-option
> CLANG_FLAGS += -Werror=ignored-optimization-argument
> +CLANG_FLAGS += -mllvm -trap-unreachable
> KBUILD_CFLAGS += $(CLANG_FLAGS)
> KBUILD_AFLAGS += $(CLANG_FLAGS)
> export CLANG_FLAGS
> ```
>
> There's more I need to do for LTO; `-mllvm` flags need to be passed to
> the linker in that case. Let me do a few more builds, collect
> statistics on build size differences (guessing neglidgeable), then
> will send out a more formal patch.

Looks like these are actually from calls to
__ubsan_handle_divrem_overflow which is __noreturn when panic_on_warn
is set by the corresponding config. I wonder if we should be
unconditionally adding __ubsan_handle_divrem_overflow to the allow
list `global_noreturns` in tools/objtool/check.c? It seems like the
kconfig defines aren't passed through to the tools/ sources.

List of fallthrough warnings from allmodconfig for reference:
https://lore.kernel.org/lkml/YrNQrPNF%2FXfriP99@debian/

--
Thanks,
~Nick Desaulniers

2022-06-28 23:14:22

by Josh Poimboeuf

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Thu, Jun 23, 2022 at 04:33:35PM -0700, Nick Desaulniers wrote:
> On Wed, Jun 22, 2022 at 3:40 PM Nick Desaulniers
> <[email protected]> wrote:
> >
> > On Wed, Jun 22, 2022 at 10:49 AM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > On Wed, Jun 22, 2022 at 12:26 PM Sudip Mukherjee
> > > <[email protected]> wrote:
> > > >
> > > > Tried it after applying your patch. There was no build failure, but some warnings:
> > >
> > > So some of those objtool warnings are, I think, because clang does odd
> > > and crazy things for when it decides "this is not reachable" code.
> > >
> > > I don't much like it, and neither does objtool, but it is what it is.
> > > When clang decides "I'm calling a function that cannot return", it
> > > will have a "call" instruction and then it will just fall off the face
> > > of the earth after that.

FWIW, GCC does the same thing for calls to __noreturn functions. After
the call it just "falls off the face of the earth". So there's nothing
special about Clang here.

> Looks like these are actually from calls to
> __ubsan_handle_divrem_overflow which is __noreturn when panic_on_warn
> is set by the corresponding config. I wonder if we should be
> unconditionally adding __ubsan_handle_divrem_overflow to the allow
> list `global_noreturns` in tools/objtool/check.c? It seems like the
> kconfig defines aren't passed through to the tools/ sources.
>
> List of fallthrough warnings from allmodconfig for reference:
> https://lore.kernel.org/lkml/YrNQrPNF%2FXfriP99@debian/

As discussed with Nick on IRC, the problem highlighted by the
fallthrough warnings seems to be a mismatch in expectations as to
whether __ubsan_handle_divrem_overflow() is noreturn.

When Clang inserts calls to it, it assumes it's noreturn. But in fact
its kernel implementation can actually return, if !panic_on_warn.

So it needs to be changed to *never* return, otherwise hilarity will
ensue when it returns and "falls off the face of the earth".

And then objtool needs to be told that it's in fact noreturn.

So, something like this:

diff --git a/lib/ubsan.c b/lib/ubsan.c
index 36bd75e33426..d257baa62721 100644
--- a/lib/ubsan.c
+++ b/lib/ubsan.c
@@ -177,6 +177,7 @@ void __ubsan_handle_divrem_overflow(void *_data, void *lhs, void *rhs)
pr_err("division by zero\n");

ubsan_epilogue();
+ panic("can't return from __ubsan_handle_divrem_overflow()");
}
EXPORT_SYMBOL(__ubsan_handle_divrem_overflow);

diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 864bb9dd3584..6f67d48fb3e4 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -187,6 +187,7 @@ static bool __dead_end_function(struct objtool_file *file, struct symbol *func,
"__invalid_creds",
"cpu_startup_entry",
"__ubsan_handle_builtin_unreachable",
+ "__ubsan_handle_divrem_overflow",
"ex_handler_msr_mce",
};

2022-06-29 16:44:06

by Linus Torvalds

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Tue, Jun 28, 2022 at 3:43 PM Josh Poimboeuf <[email protected]> wrote:
>
> So, something like this:

No, clang should just be fixed.

These UBSAN reports should usually be WARN_ON_ONCE.

It's all the same issues we've had before: causing a panic will just
kill the machine, and gets us fewer reports.

Now, UBSAN is something that presumably normal people don't actually
run on real hardware, so it's probably less of a deal than some. But
hey, maybe somebody wants to actually run an UBSAN kernel on a real
load with a full accelerated graphical UI and real drivers: a panic
may end up killing the kernel, and there you sit, with a dead machine
and no idea what went wrong.

So the whole "panic if UBSAN reports something" is COMPLETE GARBAGE.
It actually makes the whole point of running UBSAN go away. You *lose*
coverage.

So please don't make the kernel worse because clang got something like
this wrong.

Just fix clang.

And fix your mindset.

Linus

2022-06-29 16:45:55

by Josh Poimboeuf

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Wed, Jun 29, 2022 at 09:08:20AM -0700, Linus Torvalds wrote:
> On Tue, Jun 28, 2022 at 3:43 PM Josh Poimboeuf <[email protected]> wrote:
> >
> > So, something like this:
>
> No, clang should just be fixed.
>
> These UBSAN reports should usually be WARN_ON_ONCE.
>
> It's all the same issues we've had before: causing a panic will just
> kill the machine, and gets us fewer reports.
>
> Now, UBSAN is something that presumably normal people don't actually
> run on real hardware, so it's probably less of a deal than some. But
> hey, maybe somebody wants to actually run an UBSAN kernel on a real
> load with a full accelerated graphical UI and real drivers: a panic
> may end up killing the kernel, and there you sit, with a dead machine
> and no idea what went wrong.
>
> So the whole "panic if UBSAN reports something" is COMPLETE GARBAGE.
> It actually makes the whole point of running UBSAN go away. You *lose*
> coverage.
>
> So please don't make the kernel worse because clang got something like
> this wrong.
>
> Just fix clang.
>
> And fix your mindset.

Yeah, good point. All the other UBSAN handlers (other than builtin
unreachable) try to recover. There's nothing special about divrem
overflow which requires it to be fatal.

So clang needs to stop assuming the divrem overflow handler is noreturn.

--
Josh

2022-06-29 21:23:15

by Nick Desaulniers

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Wed, Jun 29, 2022 at 9:34 AM Josh Poimboeuf <[email protected]> wrote:
>
> On Wed, Jun 29, 2022 at 09:08:20AM -0700, Linus Torvalds wrote:
> > On Tue, Jun 28, 2022 at 3:43 PM Josh Poimboeuf <[email protected]> wrote:
> > >
> > > So, something like this:
> >
> > No, clang should just be fixed.
> >
> > These UBSAN reports should usually be WARN_ON_ONCE.
> >
> > It's all the same issues we've had before: causing a panic will just
> > kill the machine, and gets us fewer reports.
> >
> > Now, UBSAN is something that presumably normal people don't actually
> > run on real hardware, so it's probably less of a deal than some. But
> > hey, maybe somebody wants to actually run an UBSAN kernel on a real
> > load with a full accelerated graphical UI and real drivers: a panic
> > may end up killing the kernel, and there you sit, with a dead machine
> > and no idea what went wrong.
> >
> > So the whole "panic if UBSAN reports something" is COMPLETE GARBAGE.
> > It actually makes the whole point of running UBSAN go away. You *lose*
> > coverage.
> >
> > So please don't make the kernel worse because clang got something like
> > this wrong.
> >
> > Just fix clang.
> >
> > And fix your mindset.
>
> Yeah, good point. All the other UBSAN handlers (other than builtin
> unreachable) try to recover. There's nothing special about divrem
> overflow which requires it to be fatal.
>
> So clang needs to stop assuming the divrem overflow handler is noreturn.

I haven't been able to verify in the generated IR or in LLVM sources
that that is the case; it doesn't appear to be any assumption about
__ubsan_handle_divrem_overflow being noreturn.

Instead, it looks like this has more to do with undefined behavior,
IIUC. It seems that for `-fsanitize=integer-divide-by-zero`
(CONFIG_UBSAN_DIV_ZERO), we introduce new control flow to warn on div
by zero, but the results of the division are undefined, so
unconditionally trying to consume the result is UB.

So:

int foo (int x, int y) {
return x / y;
}

doesn't check for divide by zero, -fsanitize=integer-divide-by-zero
introduces control flow to check for divide by zero and call
__ubsan_handle_divrem_overflow in that exceptional case, but the
result of the division is still undefined for that case, so control
flow that tries to use that result is simplified into using the result
in the non-zero divisor case, and calls to
__ubsan_handle_divrem_overflow followed by (what is effectively)
__builtin_unreachable().

For hahas I tried:

```
diff --git a/drivers/video/fbdev/smscufx.c b/drivers/video/fbdev/smscufx.c
index d7aa5511c361..cb2eab660fd2 100644
--- a/drivers/video/fbdev/smscufx.c
+++ b/drivers/video/fbdev/smscufx.c
@@ -907,8 +907,11 @@ static ssize_t ufx_ops_write(struct fb_info
*info, const char __user *buf,
result = fb_sys_write(info, buf, count, ppos);

if (result > 0) {
- int start = max((int)(offset / info->fix.line_length), 0);
- int lines = min((u32)((result / info->fix.line_length) + 1),
+ int start, lines;
+ if (!info->fix.line_length)
+ return 0;
+ start = max((int)(offset / info->fix.line_length), 0);
+ lines = min((u32)((result / info->fix.line_length) + 1),
(u32)info->var.yres);

ufx_handle_damage(dev, 0, start, info->var.xres, lines);
```
That way we don't unconditionally consume the results of a possible
divide by zero, and that fixes the warning from that TU. Does
inserting guards against divide by zero on the kernel side seem like
reasonable fixes for these warnings?
--
Thanks,
~Nick Desaulniers

2022-06-29 22:14:06

by Linus Torvalds

[permalink] [raw]

Subject: Re: mainline build failure due to 281d0c962752 ("fortify: Add Clang support")

On Wed, Jun 29, 2022 at 2:21 PM Nick Desaulniers
<[email protected]> wrote:
>
> For hahas I tried:
>
> ```
> diff --git a/drivers/video/fbdev/smscufx.c b/drivers/video/fbdev/smscufx.c

Yeah, no.

I think the proper fix is to just say

depends on !CLANG

in the UBSAN Kconfig.

The whole point of UBSAN is to *detect* undefined behavior, but continue.

If we wanted to crash on undefined behavior, we'd just leave it alone,
without any compiler option.

Think of it this way: you have a user (corporate-speak: "customer")
that is seeing behavior you can't figure out. You enable a lot of
these debug options to see "are they triggering something I can't
trigger?".

But you want to get the *report* about that, you don't want to
actually break the users load. You want the user to be able to do
what they always did, knowing that at least you're not making their
problem any worse. Sure, it's slower, but it should just continue as
well as it ever did.

Whatever UBSAN then detects may OR MAY NOT be the actual cause of the
user report. So no, it's not ok to say "Ahh, I found it, I will now
print out the report and crash".

So at no point is it ok to say "oh, that's undefined, so we can do
anything we want". NOT AT ALL. That's actually true in general, but
it's _particularly_ true when you build with UBSAN, since now you're
actively trying to get reports about undefined behavior, and crashing
will destroy that.

It will destroy that particularly for a kernel where crashing in an
unusual place that the developer doesn't see quite possibly means "odd
device driver". In this case it would be the particular frame buffer
the user uses - crashing there probably means that there is no output
to help debug things.

But this is actually true even for non-kernel uses: if you are running
an app as a developer, and you're looking for problems, having one
place with undefined behavior in no way means that you don't care
about any other undefined places. So crashing is literally *never* the
right thing to do.

So no, the whole "it's still undefined, so we can't return" argument
is completely broken.

Linus