2023-02-14 08:46:39

by Xi Ruoyao

[permalink] [raw]
Subject: "kernel ade access" oops on LoongArch

This is a "help wanted" message :(.

I've recently run into some strange kernel oops testing Glibc for LoongArch. A log looks like:

[11569.195043] Kernel ade access[#1]:
[11569.198441] CPU: 1 PID: 1132296 Comm: ld-linux-loonga Not tainted 6.2.0-rc8+ #61
[11569.205792] Hardware name: Loongson Loongson-3A5000-HV-7A2000-1w-V0.1-EVB/Loongson-LS3A5000-7A2000-1w-EVB-V1.21, BIOS Loongson-UDK2018-V4.0.05383-beta10 1
[11569.219536] $ 0 : 0000000000000000 90000000005e3448 90000001113a0000 90000001113a3ab0
[11569.227505] $ 4 : 90000001113a3af8 1000000000cf16d0 5555555555555850 000000000000000c
[11569.235475] $ 8 : 90000000009caa10 0000000000000000 00000000000002ca 000000000000008b
[11569.243438] $12 : 0000000000000001 9000000000cf1258 ffffffffffffffff 00007ffffb93c000
[11569.251402] $16 : 0000000000000000 0000000000000140 0000000000000000 0000000000000020
[11569.259366] $20 : 90000001113a3ec8 9000000000a97ee0 00007ffffb93bfa0 1555555555555613
[11569.267334] $24 : 1000000000cf16d0 000000000000000c 9000000000cf1258 90000000009caa10
[11569.275303] $28 : 90000001113a3af8 0aaaaaaaaaaaab0a 00007ffffb93bde0 90000001113a3ec0
[11569.283268] era : 90000000009caa10 cmp_ex_search+0x0/0x28
[11569.288814] ra : 90000000005e3448 bsearch+0x58/0xa8
[11569.293921] CSR crmd: 000000b0
[11569.293923] CSR prmd: 00000004
[11569.297037] CSR euen: 00000000
[11569.300152] CSR ecfg: 00071c1c
[11569.303266] CSR estat: 00480000
[11569.309587] ExcCode : 8 (SubCode 1)
[11569.313049] BadVA : 1000000000cf16d0
[11569.316596] PrId : 0014c011 (Loongson-64bit)
[11569.320923] Modules linked in: amdgpu nls_cp936 vfat fat input_leds drm_ttm_helper ttm video gpu_sched drm_buddy snd_hda_codec_generic drm_display_helper ledtrig_audio drm_kms_helper led_class snd_hda_intel sha256_generic snd_intel_dspcfg cfbfillrect libsha256 snd_hda_codec syscopyarea snd_hda_core hid_generic cfbimgblt cfg80211 snd_pcm sysfillrect usbhid sysimgblt snd_timer cfbcopyarea hid snd igb soundcore efivarfs
[11569.357709] Process ld-linux-loonga (pid: 1132296, threadinfo=000000003cbd0caa, task=000000005bcd27a6)
[11569.366977] Stack : 00007ffffb93bd60 0000000000000000 9000000180a36a40 0000000000000001
[11569.374940] 90000001113a3bb0 00007ffffb93c000 9000000000224c94 90000000009cab2c
[11569.382899] 0000000000000001 9000000000224c94 00007ffff3258000 900000000025a1b4
[11569.390866] 90000001113a3bb0 900000000022f4cc 00007ffffb93c000 900000000022f74c
[11569.398834] 9000000180a36a40 0000000000000001 0000000000000000 00007ffffb93c000
[11569.406800] 90000001113a3bb0 900000000022f8f8 90000001113a3ec0 00007ffffb93bde0
[11569.414768] 00007ffffb93bd60 0000000000000000 0000000000000000 00007fffff7c4600
[11569.422734] 9000000182ebab70 9000000000d08000 0000000046505501 900000000022ee6c
[11569.430698] 0000000000000000 9000000000224b84 90000001113a0000 90000001113a3cf0
[11569.438661] 0000000000000000 00007ffffb93c0d0 0000000000000000 0000000000000040
[11569.446627] ...
[11569.449058] Call Trace:
[11569.449062] [<90000000009caa10>] cmp_ex_search+0x0/0x28
[11569.456681] [<90000000005e3448>] bsearch+0x58/0xa8
[11569.461443] [<90000000009cab2c>] search_extable+0x28/0x34
[11569.466807] [<900000000025a1b4>] search_exception_tables+0x48/0x7c
[11569.472953] [<900000000022f4cc>] fixup_exception+0x18/0xcc
[11569.478410] [<900000000022f74c>] do_sigsegv+0x174/0x1b0
[11569.483605] [<900000000022f8f8>] do_page_fault+0x170/0x344
[11569.489058] [<900000000022ee6c>] tlb_do_page_fault_1+0x128/0x1c4
[11569.495029] [<9000000000224b84>] handle_signal+0x634/0x884
[11569.500487] [<9000000000225704>] arch_do_signal_or_restart+0xb4/0xe0
[11569.506808] [<90000000002b5b30>] exit_to_user_mode_prepare+0xbc/0x100
[11569.513214] [<9000000000a02628>] syscall_exit_to_user_mode+0x30/0x4c
[11569.519533] [<90000000002214a4>] handle_syscall+0xc4/0x160

[11569.526472] Code: 4c000020 02800404 4c000020 <240000ac> 26000084 0010b0a5 680014a4 00129484 00111004

[11569.537704] ---[ end trace 0000000000000000 ]---

"BadVA : 1000000000cf16d0" may suggest the highest bit of an address is
somehow cleared.

The issue is not deterministic, but it seems easily reproduced by:

1. Compile Glibc:

../glibc/configure --prefix=/usr \
--disable-werror \
--enable-kernel=5.19 \
--enable-stack-protector=strong \
--with-headers=/usr/include \
libc_cv_slibdir=/usr/lib
make -j4

2. Check Glibc:

make check -j4

3. If the oops did not happen during the last step, run a specific test
in a dead loop:

while true; do make test t=malloc/tst-mallocfork3-malloc-check; done

Then an oops would likely show up in several minutes.

Though the oops is nondeterministic, I'm almost sure it's not a hardware
stability issue because I'm getting exactly same stack traces for each
oops message. I cannot easily rule out the possibility about "the
compiler miscompiles kernel code" though.

I'm running 6.2-rc8 with the following patches from loongarch-next:

ACPI: Define ACPI_MACHINE_WIDTH to 64 for LoongArch
PCI: loongson: Improve the MRRS quirk for LS7A
PCI: Add quirk for LS7A to avoid reboot failure
irqchip/loongson-liointc: Save/restore int_edge/int_pol registers during S3/S4
LoongArch: Add vector extensions support
tools: Add LoongArch build infrastructure
libbpf: Add LoongArch support to bpf_tracing.h
selftests/seccomp: Add LoongArch selftesting support
SH: cpuinfo: Fix a warning for CONFIG_CPUMASK_OFFSTACK
LoongArch: Add CPU HWMon platform driver

Any idea to fix the issue or suggestion to debug it further?

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University


2023-02-14 14:52:08

by Huacai Chen

[permalink] [raw]
Subject: Re: "kernel ade access" oops on LoongArch

Hi, Ruoyao,

It seems to have something related to Youling's relative exception patchset.

Huacai

On Tue, Feb 14, 2023 at 4:46 PM Xi Ruoyao <[email protected]> wrote:
>
> This is a "help wanted" message :(.
>
> I've recently run into some strange kernel oops testing Glibc for LoongArch. A log looks like:
>
> [11569.195043] Kernel ade access[#1]:
> [11569.198441] CPU: 1 PID: 1132296 Comm: ld-linux-loonga Not tainted 6.2.0-rc8+ #61
> [11569.205792] Hardware name: Loongson Loongson-3A5000-HV-7A2000-1w-V0.1-EVB/Loongson-LS3A5000-7A2000-1w-EVB-V1.21, BIOS Loongson-UDK2018-V4.0.05383-beta10 1
> [11569.219536] $ 0 : 0000000000000000 90000000005e3448 90000001113a0000 90000001113a3ab0
> [11569.227505] $ 4 : 90000001113a3af8 1000000000cf16d0 5555555555555850 000000000000000c
> [11569.235475] $ 8 : 90000000009caa10 0000000000000000 00000000000002ca 000000000000008b
> [11569.243438] $12 : 0000000000000001 9000000000cf1258 ffffffffffffffff 00007ffffb93c000
> [11569.251402] $16 : 0000000000000000 0000000000000140 0000000000000000 0000000000000020
> [11569.259366] $20 : 90000001113a3ec8 9000000000a97ee0 00007ffffb93bfa0 1555555555555613
> [11569.267334] $24 : 1000000000cf16d0 000000000000000c 9000000000cf1258 90000000009caa10
> [11569.275303] $28 : 90000001113a3af8 0aaaaaaaaaaaab0a 00007ffffb93bde0 90000001113a3ec0
> [11569.283268] era : 90000000009caa10 cmp_ex_search+0x0/0x28
> [11569.288814] ra : 90000000005e3448 bsearch+0x58/0xa8
> [11569.293921] CSR crmd: 000000b0
> [11569.293923] CSR prmd: 00000004
> [11569.297037] CSR euen: 00000000
> [11569.300152] CSR ecfg: 00071c1c
> [11569.303266] CSR estat: 00480000
> [11569.309587] ExcCode : 8 (SubCode 1)
> [11569.313049] BadVA : 1000000000cf16d0
> [11569.316596] PrId : 0014c011 (Loongson-64bit)
> [11569.320923] Modules linked in: amdgpu nls_cp936 vfat fat input_leds drm_ttm_helper ttm video gpu_sched drm_buddy snd_hda_codec_generic drm_display_helper ledtrig_audio drm_kms_helper led_class snd_hda_intel sha256_generic snd_intel_dspcfg cfbfillrect libsha256 snd_hda_codec syscopyarea snd_hda_core hid_generic cfbimgblt cfg80211 snd_pcm sysfillrect usbhid sysimgblt snd_timer cfbcopyarea hid snd igb soundcore efivarfs
> [11569.357709] Process ld-linux-loonga (pid: 1132296, threadinfo=000000003cbd0caa, task=000000005bcd27a6)
> [11569.366977] Stack : 00007ffffb93bd60 0000000000000000 9000000180a36a40 0000000000000001
> [11569.374940] 90000001113a3bb0 00007ffffb93c000 9000000000224c94 90000000009cab2c
> [11569.382899] 0000000000000001 9000000000224c94 00007ffff3258000 900000000025a1b4
> [11569.390866] 90000001113a3bb0 900000000022f4cc 00007ffffb93c000 900000000022f74c
> [11569.398834] 9000000180a36a40 0000000000000001 0000000000000000 00007ffffb93c000
> [11569.406800] 90000001113a3bb0 900000000022f8f8 90000001113a3ec0 00007ffffb93bde0
> [11569.414768] 00007ffffb93bd60 0000000000000000 0000000000000000 00007fffff7c4600
> [11569.422734] 9000000182ebab70 9000000000d08000 0000000046505501 900000000022ee6c
> [11569.430698] 0000000000000000 9000000000224b84 90000001113a0000 90000001113a3cf0
> [11569.438661] 0000000000000000 00007ffffb93c0d0 0000000000000000 0000000000000040
> [11569.446627] ...
> [11569.449058] Call Trace:
> [11569.449062] [<90000000009caa10>] cmp_ex_search+0x0/0x28
> [11569.456681] [<90000000005e3448>] bsearch+0x58/0xa8
> [11569.461443] [<90000000009cab2c>] search_extable+0x28/0x34
> [11569.466807] [<900000000025a1b4>] search_exception_tables+0x48/0x7c
> [11569.472953] [<900000000022f4cc>] fixup_exception+0x18/0xcc
> [11569.478410] [<900000000022f74c>] do_sigsegv+0x174/0x1b0
> [11569.483605] [<900000000022f8f8>] do_page_fault+0x170/0x344
> [11569.489058] [<900000000022ee6c>] tlb_do_page_fault_1+0x128/0x1c4
> [11569.495029] [<9000000000224b84>] handle_signal+0x634/0x884
> [11569.500487] [<9000000000225704>] arch_do_signal_or_restart+0xb4/0xe0
> [11569.506808] [<90000000002b5b30>] exit_to_user_mode_prepare+0xbc/0x100
> [11569.513214] [<9000000000a02628>] syscall_exit_to_user_mode+0x30/0x4c
> [11569.519533] [<90000000002214a4>] handle_syscall+0xc4/0x160
>
> [11569.526472] Code: 4c000020 02800404 4c000020 <240000ac> 26000084 0010b0a5 680014a4 00129484 00111004
>
> [11569.537704] ---[ end trace 0000000000000000 ]---
>
> "BadVA : 1000000000cf16d0" may suggest the highest bit of an address is
> somehow cleared.
>
> The issue is not deterministic, but it seems easily reproduced by:
>
> 1. Compile Glibc:
>
> ../glibc/configure --prefix=/usr \
> --disable-werror \
> --enable-kernel=5.19 \
> --enable-stack-protector=strong \
> --with-headers=/usr/include \
> libc_cv_slibdir=/usr/lib
> make -j4
>
> 2. Check Glibc:
>
> make check -j4
>
> 3. If the oops did not happen during the last step, run a specific test
> in a dead loop:
>
> while true; do make test t=malloc/tst-mallocfork3-malloc-check; done
>
> Then an oops would likely show up in several minutes.
>
> Though the oops is nondeterministic, I'm almost sure it's not a hardware
> stability issue because I'm getting exactly same stack traces for each
> oops message. I cannot easily rule out the possibility about "the
> compiler miscompiles kernel code" though.
>
> I'm running 6.2-rc8 with the following patches from loongarch-next:
>
> ACPI: Define ACPI_MACHINE_WIDTH to 64 for LoongArch
> PCI: loongson: Improve the MRRS quirk for LS7A
> PCI: Add quirk for LS7A to avoid reboot failure
> irqchip/loongson-liointc: Save/restore int_edge/int_pol registers during S3/S4
> LoongArch: Add vector extensions support
> tools: Add LoongArch build infrastructure
> libbpf: Add LoongArch support to bpf_tracing.h
> selftests/seccomp: Add LoongArch selftesting support
> SH: cpuinfo: Fix a warning for CONFIG_CPUMASK_OFFSTACK
> LoongArch: Add CPU HWMon platform driver
>
> Any idea to fix the issue or suggestion to debug it further?
>
> --
> Xi Ruoyao <[email protected]>
> School of Aerospace Science and Technology, Xidian University

2023-02-15 04:53:09

by Youling Tang

[permalink] [raw]
Subject: Re: "kernel ade access" oops on LoongArch

Hi, Ruoyao

On 02/14/2023 04:46 PM, Xi Ruoyao wrote:
> This is a "help wanted" message :(.
>
> I've recently run into some strange kernel oops testing Glibc for LoongArch. A log looks like:
>
> [11569.195043] Kernel ade access[#1]:
> [11569.198441] CPU: 1 PID: 1132296 Comm: ld-linux-loonga Not tainted 6.2.0-rc8+ #61
> [11569.205792] Hardware name: Loongson Loongson-3A5000-HV-7A2000-1w-V0.1-EVB/Loongson-LS3A5000-7A2000-1w-EVB-V1.21, BIOS Loongson-UDK2018-V4.0.05383-beta10 1
> [11569.219536] $ 0 : 0000000000000000 90000000005e3448 90000001113a0000 90000001113a3ab0
> [11569.227505] $ 4 : 90000001113a3af8 1000000000cf16d0 5555555555555850 000000000000000c
> [11569.235475] $ 8 : 90000000009caa10 0000000000000000 00000000000002ca 000000000000008b
> [11569.243438] $12 : 0000000000000001 9000000000cf1258 ffffffffffffffff 00007ffffb93c000
> [11569.251402] $16 : 0000000000000000 0000000000000140 0000000000000000 0000000000000020
> [11569.259366] $20 : 90000001113a3ec8 9000000000a97ee0 00007ffffb93bfa0 1555555555555613
> [11569.267334] $24 : 1000000000cf16d0 000000000000000c 9000000000cf1258 90000000009caa10
> [11569.275303] $28 : 90000001113a3af8 0aaaaaaaaaaaab0a 00007ffffb93bde0 90000001113a3ec0
> [11569.283268] era : 90000000009caa10 cmp_ex_search+0x0/0x28
> [11569.288814] ra : 90000000005e3448 bsearch+0x58/0xa8
> [11569.293921] CSR crmd: 000000b0
> [11569.293923] CSR prmd: 00000004
> [11569.297037] CSR euen: 00000000
> [11569.300152] CSR ecfg: 00071c1c
> [11569.303266] CSR estat: 00480000
> [11569.309587] ExcCode : 8 (SubCode 1)
> [11569.313049] BadVA : 1000000000cf16d0
> [11569.316596] PrId : 0014c011 (Loongson-64bit)
> [11569.320923] Modules linked in: amdgpu nls_cp936 vfat fat input_leds drm_ttm_helper ttm video gpu_sched drm_buddy snd_hda_codec_generic drm_display_helper ledtrig_audio drm_kms_helper led_class snd_hda_intel sha256_generic snd_intel_dspcfg cfbfillrect libsha256 snd_hda_codec syscopyarea snd_hda_core hid_generic cfbimgblt cfg80211 snd_pcm sysfillrect usbhid sysimgblt snd_timer cfbcopyarea hid snd igb soundcore efivarfs
> [11569.357709] Process ld-linux-loonga (pid: 1132296, threadinfo=000000003cbd0caa, task=000000005bcd27a6)
> [11569.366977] Stack : 00007ffffb93bd60 0000000000000000 9000000180a36a40 0000000000000001
> [11569.374940] 90000001113a3bb0 00007ffffb93c000 9000000000224c94 90000000009cab2c
> [11569.382899] 0000000000000001 9000000000224c94 00007ffff3258000 900000000025a1b4
> [11569.390866] 90000001113a3bb0 900000000022f4cc 00007ffffb93c000 900000000022f74c
> [11569.398834] 9000000180a36a40 0000000000000001 0000000000000000 00007ffffb93c000
> [11569.406800] 90000001113a3bb0 900000000022f8f8 90000001113a3ec0 00007ffffb93bde0
> [11569.414768] 00007ffffb93bd60 0000000000000000 0000000000000000 00007fffff7c4600
> [11569.422734] 9000000182ebab70 9000000000d08000 0000000046505501 900000000022ee6c
> [11569.430698] 0000000000000000 9000000000224b84 90000001113a0000 90000001113a3cf0
> [11569.438661] 0000000000000000 00007ffffb93c0d0 0000000000000000 0000000000000040
> [11569.446627] ...
> [11569.449058] Call Trace:
> [11569.449062] [<90000000009caa10>] cmp_ex_search+0x0/0x28
> [11569.456681] [<90000000005e3448>] bsearch+0x58/0xa8
> [11569.461443] [<90000000009cab2c>] search_extable+0x28/0x34
> [11569.466807] [<900000000025a1b4>] search_exception_tables+0x48/0x7c
> [11569.472953] [<900000000022f4cc>] fixup_exception+0x18/0xcc
> [11569.478410] [<900000000022f74c>] do_sigsegv+0x174/0x1b0
> [11569.483605] [<900000000022f8f8>] do_page_fault+0x170/0x344
> [11569.489058] [<900000000022ee6c>] tlb_do_page_fault_1+0x128/0x1c4
> [11569.495029] [<9000000000224b84>] handle_signal+0x634/0x884
> [11569.500487] [<9000000000225704>] arch_do_signal_or_restart+0xb4/0xe0
> [11569.506808] [<90000000002b5b30>] exit_to_user_mode_prepare+0xbc/0x100
> [11569.513214] [<9000000000a02628>] syscall_exit_to_user_mode+0x30/0x4c
> [11569.519533] [<90000000002214a4>] handle_syscall+0xc4/0x160
>
> [11569.526472] Code: 4c000020 02800404 4c000020 <240000ac> 26000084 0010b0a5 680014a4 00129484 00111004
>
> [11569.537704] ---[ end trace 0000000000000000 ]---
>
> "BadVA : 1000000000cf16d0" may suggest the highest bit of an address is
> somehow cleared.
>
> The issue is not deterministic, but it seems easily reproduced by:
>
> 1. Compile Glibc:
>
> ../glibc/configure --prefix=/usr \
> --disable-werror \
> --enable-kernel=5.19 \
> --enable-stack-protector=strong \
> --with-headers=/usr/include \
> libc_cv_slibdir=/usr/lib
> make -j4
>
> 2. Check Glibc:
>
> make check -j4

When I try to build glibc, it fails like below :( .

git clone https://sourceware.org/git/glibc.git
mkdir build_glibc
cd build_glibc
../glibc/configure --prefix=/usr --disable-werror --enable-kernel=5.19
--enable-stack-protector=strong --with-headers=/usr/include
libc_cv_slibdir=/usr/lib
make -j4

log:
/home/loongson/build_glibc/csu/crtn.o
In file included from ../include/stdlib.h:15,
from /home/loongson/build_glibc/cstdlib:79,
from /usr/include/c++/13.0.0/ext/string_conversions.h:41,
from /usr/include/c++/13.0.0/bits/basic_string.h:4040,
from /usr/include/c++/13.0.0/string:52,
from /usr/include/c++/13.0.0/bits/locale_classes.h:40,
from /usr/include/c++/13.0.0/bits/ios_base.h:41,
from /usr/include/c++/13.0.0/ios:42:
../stdlib/stdlib.h:141:8: error: ‘_Float32’ does not name a type
141 | extern _Float32 strtof32 (const char *__restrict __nptr,
| ^~~~~~~~
../stdlib/stdlib.h:147:8: error: ‘_Float64’ does not name a type
147 | extern _Float64 strtof64 (const char *__restrict __nptr,
| ^~~~~~~~
...
/usr/bin/ld: /home/loongson/build_glibc/libc.a(dl-reloc-static-pie.o):
in function `_dl_relocate_static_pie':
/home/loongson/glibc/elf/dl-reloc-static-pie.c:44: undefined reference
to `_DYNAMIC'
/usr/bin/ld: /home/loongson/glibc/elf/dl-reloc-static-pie.c:44:
undefined reference to `_DYNAMIC'
/usr/bin/ld: /home/loongson/build_glibc/support/test-run-command: hidden
symbol `_DYNAMIC' isn't defined
/usr/bin/ld: final link failed: bad value

Youling.
>
> 3. If the oops did not happen during the last step, run a specific test
> in a dead loop:
>
> while true; do make test t=malloc/tst-mallocfork3-malloc-check; done
>
> Then an oops would likely show up in several minutes.
>
> Though the oops is nondeterministic, I'm almost sure it's not a hardware
> stability issue because I'm getting exactly same stack traces for each
> oops message. I cannot easily rule out the possibility about "the
> compiler miscompiles kernel code" though.
>
> I'm running 6.2-rc8 with the following patches from loongarch-next:
>
> ACPI: Define ACPI_MACHINE_WIDTH to 64 for LoongArch
> PCI: loongson: Improve the MRRS quirk for LS7A
> PCI: Add quirk for LS7A to avoid reboot failure
> irqchip/loongson-liointc: Save/restore int_edge/int_pol registers during S3/S4
> LoongArch: Add vector extensions support
> tools: Add LoongArch build infrastructure
> libbpf: Add LoongArch support to bpf_tracing.h
> selftests/seccomp: Add LoongArch selftesting support
> SH: cpuinfo: Fix a warning for CONFIG_CPUMASK_OFFSTACK
> LoongArch: Add CPU HWMon platform driver
>
> Any idea to fix the issue or suggestion to debug it further?
>


2023-02-15 05:36:15

by Xi Ruoyao

[permalink] [raw]
Subject: Re: "kernel ade access" oops on LoongArch

On Wed, 2023-02-15 at 12:52 +0800, Youling Tang wrote:
> ../stdlib/stdlib.h:141:8: error: ‘_Float32’ does not name a type
>    141 | extern _Float32 strtof32 (const char *__restrict __nptr,
>        |        ^~~~~~~~

This is because Glibc expects GCC 13 to support _Float32, but early GCC
13 snapshots did not.

> /usr/bin/ld: /home/loongson/build_glibc/libc.a(dl-reloc-static-pie.o):
> in function `_dl_relocate_static_pie':
> /home/loongson/glibc/elf/dl-reloc-static-pie.c:44: undefined reference
> to `_DYNAMIC'

Oh, this one is my fault. The check for compiler static PIE support was
not written correctly. I'll fix it for Glibc later, but now you can
update GCC to the latest git master to proceed.
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2023-02-15 07:23:11

by Youling Tang

[permalink] [raw]
Subject: Re: "kernel ade access" oops on LoongArch



On 02/15/2023 01:35 PM, Xi Ruoyao wrote:
> On Wed, 2023-02-15 at 12:52 +0800, Youling Tang wrote:
>> ../stdlib/stdlib.h:141:8: error: ‘_Float32’ does not name a type
>> 141 | extern _Float32 strtof32 (const char *__restrict __nptr,
>> | ^~~~~~~~
>
> This is because Glibc expects GCC 13 to support _Float32, but early GCC
> 13 snapshots did not.
>
>> /usr/bin/ld: /home/loongson/build_glibc/libc.a(dl-reloc-static-pie.o):
>> in function `_dl_relocate_static_pie':
>> /home/loongson/glibc/elf/dl-reloc-static-pie.c:44: undefined reference
>> to `_DYNAMIC'
>
> Oh, this one is my fault. The check for compiler static PIE support was
> not written correctly. I'll fix it for Glibc later, but now you can
> update GCC to the latest git master to proceed.
>

Tested on Loongson-3C5000L-LL machine, using CLFS7.3 system.

$ gcc -v
gcc version 13.0.0 20221018 (experimental) (GCC)

# make check -j32
/home/loongson/build_glibc/math/test-tgmath3-atan2.c:
在函数‘test_atan2_84’中:
/home/loongson/build_glibc/math/test-tgmath3-atan2.c:903:59:
错误:conflicting types for ‘var__Float32x’; have ‘double’
903 | extern typeof (atan2 (vol_var__Float32x, vol_var_char))
var__Float32x __attribute__ ((unused));
|
^~~~~~~~~~~~~

There was a build error in make check, so only tst-mallocfork3-
malloc-check was tested separately.

# make test t=malloc/tst-mallocfork3-malloc-check
make[2]: 离开目录“/home/loongson/glibc/malloc”
PASS: malloc/tst-mallocfork3-malloc-check
original exit status 0
info: signals received during fork: 301
info: signals received during free: 1693
info: signals received during malloc: 119
make[1]: 离开目录“/home/loongson/glibc”

A total of five tests are PASS, and the serial port does not display
CallTrace.

Youling.


2023-02-15 07:52:05

by Jinyang He

[permalink] [raw]
Subject: Re: "kernel ade access" oops on LoongArch

On 2023-02-15 15:23, Youling Tang wrote:

>
>
> On 02/15/2023 01:35 PM, Xi Ruoyao wrote:
>> On Wed, 2023-02-15 at 12:52 +0800, Youling Tang wrote:
>>> ../stdlib/stdlib.h:141:8: error: ‘_Float32’ does not name a type
>>>    141 | extern _Float32 strtof32 (const char *__restrict __nptr,
>>>        |        ^~~~~~~~
>>
>> This is because Glibc expects GCC 13 to support _Float32, but early GCC
>> 13 snapshots did not.
>>
>>> /usr/bin/ld: /home/loongson/build_glibc/libc.a(dl-reloc-static-pie.o):
>>> in function `_dl_relocate_static_pie':
>>> /home/loongson/glibc/elf/dl-reloc-static-pie.c:44: undefined reference
>>> to `_DYNAMIC'
>>
>> Oh, this one is my fault.  The check for compiler static PIE support was
>> not written correctly.  I'll fix it for Glibc later, but now you can
>> update GCC to the latest git master to proceed.
>>
>
> Tested on Loongson-3C5000L-LL machine, using CLFS7.3 system.
>
> $ gcc -v
> gcc version 13.0.0 20221018 (experimental) (GCC)
>
> # make check -j32
> /home/loongson/build_glibc/math/test-tgmath3-atan2.c:
> 在函数‘test_atan2_84’中:
> /home/loongson/build_glibc/math/test-tgmath3-atan2.c:903:59:
> 错误:conflicting types for ‘var__Float32x’; have ‘double’
>   903 |   extern typeof (atan2 (vol_var__Float32x, vol_var_char))
> var__Float32x __attribute__ ((unused));
>       | ^~~~~~~~~~~~~
>
> There was a build error in make check, so only tst-mallocfork3-
> malloc-check was tested separately.
>
> # make test t=malloc/tst-mallocfork3-malloc-check
> make[2]: 离开目录“/home/loongson/glibc/malloc”
> PASS: malloc/tst-mallocfork3-malloc-check
> original exit status 0
> info: signals received during fork: 301
> info: signals received during free: 1693
> info: signals received during malloc: 119
> make[1]: 离开目录“/home/loongson/glibc”
>
> A total of five tests are PASS, and the serial port does not display
> CallTrace.
>
> Youling.
>
I had test it by using the cmd "while true..." Ruoyao gave on

Loongson-3A5000, CLFS 7.1, 6.2-rc8 kernel with those patches and

6.2-rc7 kernel form loongson-next. No calltrace displayed, either.

Jinyang


2023-02-15 08:07:32

by Xi Ruoyao

[permalink] [raw]
Subject: Re: "kernel ade access" oops on LoongArch

On Wed, 2023-02-15 at 15:51 +0800, Jinyang He wrote:

> > There was a build error in make check, so only tst-mallocfork3-
> > malloc-check was tested separately.
> >
> > # make test t=malloc/tst-mallocfork3-malloc-check
> > make[2]: 离开目录“/home/loongson/glibc/malloc”
> > PASS: malloc/tst-mallocfork3-malloc-check
> > original exit status 0
> > info: signals received during fork: 301
> > info: signals received during free: 1693
> > info: signals received during malloc: 119
> > make[1]: 离开目录“/home/loongson/glibc”
> >
> > A total of five tests are PASS, and the serial port does not display
> > CallTrace.
> >
> > Youling.
> >
> I had test it by using the cmd "while true..." Ruoyao gave on
>
> Loongson-3A5000, CLFS 7.1, 6.2-rc8 kernel with those patches and
>
> 6.2-rc7 kernel form loongson-next. No calltrace displayed, either.

Hmm... I've read the code for a while and I couldn't see how it could
end up accessing a bad address too. Maybe my hardware or compiler is
really faulty?


--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2023-02-15 08:26:09

by Youling Tang

[permalink] [raw]
Subject: Re: "kernel ade access" oops on LoongArch



On 02/15/2023 04:07 PM, Xi Ruoyao wrote:
> On Wed, 2023-02-15 at 15:51 +0800, Jinyang He wrote:
>
>>> There was a build error in make check, so only tst-mallocfork3-
>>> malloc-check was tested separately.
>>>
>>> # make test t=malloc/tst-mallocfork3-malloc-check
>>> make[2]: 离开目录“/home/loongson/glibc/malloc”
>>> PASS: malloc/tst-mallocfork3-malloc-check
>>> original exit status 0
>>> info: signals received during fork: 301
>>> info: signals received during free: 1693
>>> info: signals received during malloc: 119
>>> make[1]: 离开目录“/home/loongson/glibc”
>>>
>>> A total of five tests are PASS, and the serial port does not display
>>> CallTrace.
>>>
>>> Youling.
>>>
>> I had test it by using the cmd "while true..." Ruoyao gave on
>>
>> Loongson-3A5000, CLFS 7.1, 6.2-rc8 kernel with those patches and
>>
>> 6.2-rc7 kernel form loongson-next. No calltrace displayed, either.
>
> Hmm... I've read the code for a while and I couldn't see how it could
> end up accessing a bad address too. Maybe my hardware or compiler is
> really faulty?

Can you modify the kernel as follows and test it, so as to avoid
possible relationship with the exception table data link position and
alignment rules (or use EXCEPTION_TABLE(12))?

--- a/arch/loongarch/kernel/vmlinux.lds.S
+++ b/arch/loongarch/kernel/vmlinux.lds.S
@@ -4,7 +4,6 @@
#include <asm/thread_info.h>

#define PAGE_SIZE _PAGE_SIZE
-#define RO_EXCEPTION_TABLE_ALIGN 4

/*
* Put .bss..swapper_pg_dir as the first thing in .bss. This will
@@ -54,6 +53,8 @@ SECTIONS
. = ALIGN(PECOFF_SEGMENT_ALIGN);
_etext = .;

+ EXCEPTION_TABLE(16)
+

Youling.



2023-02-15 08:35:57

by Xi Ruoyao

[permalink] [raw]
Subject: Re: "kernel ade access" oops on LoongArch

On Wed, 2023-02-15 at 16:25 +0800, Youling Tang wrote:
> Can you modify the kernel as follows and test it, so as to avoid
> possible relationship with the exception table data link position and
> alignment rules (or use EXCEPTION_TABLE(12))?
>
> --- a/arch/loongarch/kernel/vmlinux.lds.S
> +++ b/arch/loongarch/kernel/vmlinux.lds.S
> @@ -4,7 +4,6 @@
>   #include <asm/thread_info.h>
>
>   #define PAGE_SIZE _PAGE_SIZE
> -#define RO_EXCEPTION_TABLE_ALIGN       4
>
>   /*
>    * Put .bss..swapper_pg_dir as the first thing in .bss. This will
> @@ -54,6 +53,8 @@ SECTIONS
>          . = ALIGN(PECOFF_SEGMENT_ALIGN);
>          _etext = .;
>
> +       EXCEPTION_TABLE(16)
> +

It seems the kernel refuses to boot after the change, but I'm not
completely sure: I'm 5 km away from the board and operating it via ssh
so maybe it's a reboot failure or network failure. I'll report again in
the evening.

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2023-02-15 11:52:25

by Xi Ruoyao

[permalink] [raw]
Subject: Re: "kernel ade access" oops on LoongArch

On Wed, 2023-02-15 at 16:35 +0800, Xi Ruoyao wrote:
> On Wed, 2023-02-15 at 16:25 +0800, Youling Tang wrote:
> > Can you modify the kernel as follows and test it, so as to avoid
> > possible relationship with the exception table data link position and
> > alignment rules (or use EXCEPTION_TABLE(12))?
> >
> > --- a/arch/loongarch/kernel/vmlinux.lds.S
> > +++ b/arch/loongarch/kernel/vmlinux.lds.S
> > @@ -4,7 +4,6 @@
> >   #include <asm/thread_info.h>
> >
> >   #define PAGE_SIZE _PAGE_SIZE
> > -#define RO_EXCEPTION_TABLE_ALIGN       4
> >
> >   /*
> >    * Put .bss..swapper_pg_dir as the first thing in .bss. This will
> > @@ -54,6 +53,8 @@ SECTIONS
> >          . = ALIGN(PECOFF_SEGMENT_ALIGN);
> >          _etext = .;
> >
> > +       EXCEPTION_TABLE(16)
> > +
>
> It seems the kernel refuses to boot after the change, but I'm not
> completely sure: I'm 5 km away from the board and operating it via ssh
> so maybe it's a reboot failure or network failure.  I'll report again in
> the evening.

It was a reboot failure.

Now it has booted successfully, but the stack trace still shows (during
the 25th run of the make test t=... command).

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2023-02-15 12:52:00

by Xi Ruoyao

[permalink] [raw]
Subject: Re: "kernel ade access" oops on LoongArch

On Wed, 2023-02-15 at 19:52 +0800, Xi Ruoyao wrote:
> On Wed, 2023-02-15 at 16:35 +0800, Xi Ruoyao wrote:
> > On Wed, 2023-02-15 at 16:25 +0800, Youling Tang wrote:
> > > Can you modify the kernel as follows and test it, so as to avoid
> > > possible relationship with the exception table data link position
> > > and
> > > alignment rules (or use EXCEPTION_TABLE(12))?
> > >
> > > --- a/arch/loongarch/kernel/vmlinux.lds.S
> > > +++ b/arch/loongarch/kernel/vmlinux.lds.S
> > > @@ -4,7 +4,6 @@
> > >   #include <asm/thread_info.h>
> > >
> > >   #define PAGE_SIZE _PAGE_SIZE
> > > -#define RO_EXCEPTION_TABLE_ALIGN       4
> > >
> > >   /*
> > >    * Put .bss..swapper_pg_dir as the first thing in .bss. This
> > > will
> > > @@ -54,6 +53,8 @@ SECTIONS
> > >          . = ALIGN(PECOFF_SEGMENT_ALIGN);
> > >          _etext = .;
> > >
> > > +       EXCEPTION_TABLE(16)
> > > +
> >
> > It seems the kernel refuses to boot after the change, but I'm not
> > completely sure: I'm 5 km away from the board and operating it via
> > ssh
> > so maybe it's a reboot failure or network failure.  I'll report
> > again in
> > the evening.
>
> It was a reboot failure.
>
> Now it has booted successfully, but the stack trace still shows (during
> the 25th run of the make test t=... command).

Ouch, I know what's happening...

In the architecture-independent code we have something like

extern struct exception_table_entry a[], b[];
bsearch(a, b - a);

According to the C standard, when you write "b - a" where a and b are
pointers to type T, "b" and "a" must be pointers to elements in the same
array of T. So the compiler can assume ((uintptr_t)b - (uintptr_t)a) %
12 == 0 and optimize "b - a" to something like

(((uintptr_t)b - (uintptr_t)a) >> 2) * inv3

Here inv3 is the inversion of 3 in the modulo-2**64 integer ring, so the
compiler can avoid an expensive divide instruction. But in my vmlinux
((uintptr_t)b - (uintptr_t)a) is somehow not a multiple of 12:

(gdb) p ((uintptr_t)__stop___ex_table - (uintptr_t)__start___ex_table) % sizeof(struct exception_table_entry)
$9 = 8

So I guess

#define RO_EXCEPTION_TABLE_ALIGN 12

will work. I'll take a try...

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2023-02-15 13:06:44

by Xi Ruoyao

[permalink] [raw]
Subject: Re: "kernel ade access" oops on LoongArch

On Wed, 2023-02-15 at 20:51 +0800, Xi Ruoyao wrote:
> On Wed, 2023-02-15 at 19:52 +0800, Xi Ruoyao wrote:
> > On Wed, 2023-02-15 at 16:35 +0800, Xi Ruoyao wrote:
> > > On Wed, 2023-02-15 at 16:25 +0800, Youling Tang wrote:
> > > > Can you modify the kernel as follows and test it, so as to avoid
> > > > possible relationship with the exception table data link
> > > > position
> > > > and
> > > > alignment rules (or use EXCEPTION_TABLE(12))?
> > > >
> > > > --- a/arch/loongarch/kernel/vmlinux.lds.S
> > > > +++ b/arch/loongarch/kernel/vmlinux.lds.S
> > > > @@ -4,7 +4,6 @@
> > > >   #include <asm/thread_info.h>
> > > >
> > > >   #define PAGE_SIZE _PAGE_SIZE
> > > > -#define RO_EXCEPTION_TABLE_ALIGN       4
> > > >
> > > >   /*
> > > >    * Put .bss..swapper_pg_dir as the first thing in .bss. This
> > > > will
> > > > @@ -54,6 +53,8 @@ SECTIONS
> > > >          . = ALIGN(PECOFF_SEGMENT_ALIGN);
> > > >          _etext = .;
> > > >
> > > > +       EXCEPTION_TABLE(16)
> > > > +
> > >
> > > It seems the kernel refuses to boot after the change, but I'm not
> > > completely sure: I'm 5 km away from the board and operating it via
> > > ssh
> > > so maybe it's a reboot failure or network failure.  I'll report
> > > again in
> > > the evening.
> >
> > It was a reboot failure.
> >
> > Now it has booted successfully, but the stack trace still shows
> > (during
> > the 25th run of the make test t=... command).
>
> Ouch, I know what's happening...
>
> In the architecture-independent code we have something like
>
> extern struct exception_table_entry a[], b[];
> bsearch(a, b - a);
>
> According to the C standard, when you write "b - a" where a and b are
> pointers to type T, "b" and "a" must be pointers to elements in the
> same
> array of T.  So the compiler can assume ((uintptr_t)b - (uintptr_t)a)
> %
> 12 == 0 and optimize "b - a" to something like
>
> (((uintptr_t)b - (uintptr_t)a) >> 2) * inv3
>
> Here inv3 is the inversion of 3 in the modulo-2**64 integer ring, so
> the
> compiler can avoid an expensive divide instruction.  But in my vmlinux
> ((uintptr_t)b - (uintptr_t)a) is somehow not a multiple of 12:
>
> (gdb) p ((uintptr_t)__stop___ex_table - (uintptr_t)__start___ex_table)
> % sizeof(struct exception_table_entry)
> $9 = 8
>
> So I guess
>
> #define RO_EXCEPTION_TABLE_ALIGN       12
>
> will work.  I'll take a try...

No, it's not related...

The reason is the "LoongArch: Add vector extensions support" in my local
repo is not same as the version in loongarch-next! My local version
contains some ".section __ex_table", and the content seems predates the
relative exception table change.

Sorry for wasted you guys an afternoon :(.

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University