When allocating the pages for bss the start address needs to be rounded
down instead of up.
Otherwise the start of the bss segment may be unmapped.
The was reported to happen on Aarch64:
Memory allocated by set_brk():
Before: start=0x420000 end=0x420000
After: start=0x41f000 end=0x420000
The triggering binary looks like this:
Elf file type is EXEC (Executable file)
Entry point 0x400144
There are 4 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x0000000000000178 0x0000000000000178 R E 0x10000
LOAD 0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
0x0000000000000000 0x0000000000000008 RW 0x10000
NOTE 0x0000000000000120 0x0000000000400120 0x0000000000400120
0x0000000000000024 0x0000000000000024 R 0x4
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 0x10
Section to Segment mapping:
Segment Sections...
00 .note.gnu.build-id .text .eh_frame
01 .bss
02 .note.gnu.build-id
03
Reported-by: Sebastian Ott <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: [email protected]
Signed-off-by: Thomas Weißschuh <[email protected]>
---
I'm not really familiar with the ELF loading process, so putting this
out as RFC.
A example binary compiled with aarch64-linux-gnu-gcc 13.2.0 is available
at https://test.t-8ch.de/binfmt-bss-repro.bin
---
fs/binfmt_elf.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 7b3d2d491407..4008a57d388b 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -112,7 +112,7 @@ static struct linux_binfmt elf_format = {
static int set_brk(unsigned long start, unsigned long end, int prot)
{
- start = ELF_PAGEALIGN(start);
+ start = ELF_PAGESTART(start);
end = ELF_PAGEALIGN(end);
if (end > start) {
/*
---
base-commit: aed8aee11130a954356200afa3f1b8753e8a9482
change-id: 20230914-bss-alloc-f523fa61718c
Best regards,
--
Thomas Weißschuh <[email protected]>
On 2023-09-15 23:15:05+0100, Pedro Falcato wrote:
> On Fri, Sep 15, 2023 at 4:54 AM Thomas Weißschuh <[email protected]> wrote:
> >
> > When allocating the pages for bss the start address needs to be rounded
> > down instead of up.
> > Otherwise the start of the bss segment may be unmapped.
> >
> > The was reported to happen on Aarch64:
> >
> > Memory allocated by set_brk():
> > Before: start=0x420000 end=0x420000
> > After: start=0x41f000 end=0x420000
> >
> > The triggering binary looks like this:
> >
> > Elf file type is EXEC (Executable file)
> > Entry point 0x400144
> > There are 4 program headers, starting at offset 64
> >
> > Program Headers:
> > Type Offset VirtAddr PhysAddr
> > FileSiz MemSiz Flags Align
> > LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
> > 0x0000000000000178 0x0000000000000178 R E 0x10000
> > LOAD 0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
> > 0x0000000000000000 0x0000000000000008 RW 0x10000
> > NOTE 0x0000000000000120 0x0000000000400120 0x0000000000400120
> > 0x0000000000000024 0x0000000000000024 R 0x4
> > GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
> > 0x0000000000000000 0x0000000000000000 RW 0x10
> >
> > Section to Segment mapping:
> > Segment Sections...
> > 00 .note.gnu.build-id .text .eh_frame
> > 01 .bss
> > 02 .note.gnu.build-id
> > 03
> >
> > Reported-by: Sebastian Ott <[email protected]>
> > Closes: https://lore.kernel.org/lkml/[email protected]/
> > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > Cc: [email protected]
> > Signed-off-by: Thomas Weißschuh <[email protected]>
> > ---
> >
> > I'm not really familiar with the ELF loading process, so putting this
> > out as RFC.
> >
> > A example binary compiled with aarch64-linux-gnu-gcc 13.2.0 is available
> > at https://test.t-8ch.de/binfmt-bss-repro.bin
> > ---
> > fs/binfmt_elf.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > index 7b3d2d491407..4008a57d388b 100644
> > --- a/fs/binfmt_elf.c
> > +++ b/fs/binfmt_elf.c
> > @@ -112,7 +112,7 @@ static struct linux_binfmt elf_format = {
> >
> > static int set_brk(unsigned long start, unsigned long end, int prot)
> > {
> > - start = ELF_PAGEALIGN(start);
> > + start = ELF_PAGESTART(start);
> > end = ELF_PAGEALIGN(end);
> > if (end > start) {
> > /*
>
> I don't see how this change can be correct. set_brk takes the start of
> .bss as the start, so doing ELF_PAGESTART(start) will give you what
> may very well be another ELF segment. In the common case, you'd map an
> anonymous page on top of someone's .data, which will misload the ELF.
That does make sense, and indeed it breaks more complex binaries.
> The current logic looks OK to me (gosh this code would ideally take a
> good refactoring...). I still can't quite tell how padzero() (in the
> original report) is -EFAULTing though.
As a test I replaced the asm clear_user() in padzero() with the generic
memset()-based implementation from include/asm-generic/uaccess.h.
It does provide better diagnostics, see below.
Who should have mapped this partial .bss page if there is no .data?
Maybe the logic needs to be a bit more complex and check if this page
has been already mapped for .data and in that case don't map it again.
[ 5.620235] Run /init as init process
[ 5.662763] CUSTOM DEBUG ELF_PAGEALIGN(start)=0x420000 ELF_PAGEALIGN(end)=0x420000 ELF_PAGESTART(0x41f000)
[ 5.667176] Unable to handle kernel paging request at virtual address 000000000041ffe8
[ 5.668062] Mem abort info:
[ 5.668429] ESR = 0x0000000096000045
[ 5.669400] EC = 0x25: DABT (current EL), IL = 32 bits
[ 5.670119] SET = 0, FnV = 0
[ 5.670608] EA = 0, S1PTW = 0
[ 5.671172] FSC = 0x05: level 1 translation fault
[ 5.672024] Data abort info:
[ 5.673273] ISV = 0, ISS = 0x00000045, ISS2 = 0x00000000
[ 5.674169] CM = 0, WnR = 1, TnD = 0, TagAccess = 0
[ 5.674991] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 5.676871] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000043f20000
[ 5.677776] [000000000041ffe8] pgd=0800000043c62003, p4d=0800000043c62003, pud=0800000043c62003, pmd=0000000000000000
[ 5.681522] Internal error: Oops: 0000000096000045 [#1] PREEMPT SMP
[ 5.682604] Modules linked in:
[ 5.683576] CPU: 0 PID: 1 Comm: init Not tainted 6.6.0-rc1+ #241 00a261b9689606c4fc0c90eb29739c5b0eec7b82
[ 5.684706] Hardware name: linux,dummy-virt (DT)
[ 5.685462] pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 5.686094] pc : __memset+0x50/0x188
[ 5.686572] lr : padzero+0x84/0xa0
[ 5.686956] sp : ffffffc08003bc70
[ 5.687307] x29: ffffffc08003bc70 x28: 0000000000000000 x27: ffffff80026afa00
[ 5.688091] x26: 000000000041fff0 x25: 000000000041ffe8 x24: 0000000000400144
[ 5.688698] x23: 0000000000000000 x22: 0000000000000000 x21: 0000000000000000
[ 5.689275] x20: 0000000000000fe8 x19: 000000000041ffe8 x18: ffffffffffffffff
[ 5.689928] x17: ffffffdf46cd2984 x16: ffffffdf46cd2880 x15: 0720072007200720
[ 5.690597] x14: 0720072007200720 x13: 0720072007200720 x12: 0000000000000000
[ 5.691192] x11: 00000000ffffefff x10: 0000000000000000 x9 : ffffffdf46e1ba28
[ 5.691906] x8 : 000000000041ffe8 x7 : 0000000000000000 x6 : 0000000000057fa8
[ 5.692496] x5 : 0000000000000fff x4 : 0000000000000008 x3 : 0000000000000000
[ 5.693168] x2 : 0000000000000018 x1 : 0000000000000000 x0 : 000000000041ffe8
[ 5.693985] Call trace:
[ 5.694318] __memset+0x50/0x188
[ 5.694708] load_elf_binary+0x630/0x15d0
[ 5.695132] bprm_execve+0x2bc/0x7c0
[ 5.695505] kernel_execve+0x144/0x1c8
[ 5.695882] run_init_process+0xf8/0x110
[ 5.696264] kernel_init+0x8c/0x200
[ 5.696624] ret_from_fork+0x10/0x20
[ 5.697216] Code: d65f03c0 cb0803e4 f2400c84 54000080 (a9001d07)
[ 5.698936] ---[ end trace 0000000000000000 ]---
[ 5.701625] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 5.702502] SMP: stopping secondary CPUs
[ 5.703608] Kernel Offset: 0x1ec6a00000 from 0xffffffc080000000
[ 5.704119] PHYS_OFFSET: 0x40000000
[ 5.704491] CPU features: 0x0000000d,00020000,0000420b
[ 5.705276] Memory Limit: none
Hej Thomas,
On Thu, 14 Sep 2023, Thomas Weißschuh wrote:
> fs/binfmt_elf.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 7b3d2d491407..4008a57d388b 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -112,7 +112,7 @@ static struct linux_binfmt elf_format = {
>
> static int set_brk(unsigned long start, unsigned long end, int prot)
> {
> - start = ELF_PAGEALIGN(start);
> + start = ELF_PAGESTART(start);
> end = ELF_PAGEALIGN(end);
> if (end > start) {
> /*
>
My arm box failed to boot with that patch applied on top of 6.6-rc1 .
There was nothing suspicious on the serial console it just hung somewhere
in userspace initialization. Sadly there was also nothing in the system
logs. 6.6-rc1 worked fine.
Sebastian
On Fri, Sep 15, 2023 at 4:54 AM Thomas Weißschuh <[email protected]> wrote:
>
> When allocating the pages for bss the start address needs to be rounded
> down instead of up.
> Otherwise the start of the bss segment may be unmapped.
>
> The was reported to happen on Aarch64:
>
> Memory allocated by set_brk():
> Before: start=0x420000 end=0x420000
> After: start=0x41f000 end=0x420000
>
> The triggering binary looks like this:
>
> Elf file type is EXEC (Executable file)
> Entry point 0x400144
> There are 4 program headers, starting at offset 64
>
> Program Headers:
> Type Offset VirtAddr PhysAddr
> FileSiz MemSiz Flags Align
> LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
> 0x0000000000000178 0x0000000000000178 R E 0x10000
> LOAD 0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
> 0x0000000000000000 0x0000000000000008 RW 0x10000
> NOTE 0x0000000000000120 0x0000000000400120 0x0000000000400120
> 0x0000000000000024 0x0000000000000024 R 0x4
> GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
> 0x0000000000000000 0x0000000000000000 RW 0x10
>
> Section to Segment mapping:
> Segment Sections...
> 00 .note.gnu.build-id .text .eh_frame
> 01 .bss
> 02 .note.gnu.build-id
> 03
>
> Reported-by: Sebastian Ott <[email protected]>
> Closes: https://lore.kernel.org/lkml/[email protected]/
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: [email protected]
> Signed-off-by: Thomas Weißschuh <[email protected]>
> ---
>
> I'm not really familiar with the ELF loading process, so putting this
> out as RFC.
>
> A example binary compiled with aarch64-linux-gnu-gcc 13.2.0 is available
> at https://test.t-8ch.de/binfmt-bss-repro.bin
> ---
> fs/binfmt_elf.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 7b3d2d491407..4008a57d388b 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -112,7 +112,7 @@ static struct linux_binfmt elf_format = {
>
> static int set_brk(unsigned long start, unsigned long end, int prot)
> {
> - start = ELF_PAGEALIGN(start);
> + start = ELF_PAGESTART(start);
> end = ELF_PAGEALIGN(end);
> if (end > start) {
> /*
I don't see how this change can be correct. set_brk takes the start of
.bss as the start, so doing ELF_PAGESTART(start) will give you what
may very well be another ELF segment. In the common case, you'd map an
anonymous page on top of someone's .data, which will misload the ELF.
The current logic looks OK to me (gosh this code would ideally take a
good refactoring...). I still can't quite tell how padzero() (in the
original report) is -EFAULTing though.
--
Pedro
Hello,
kernel test robot noticed "segfault_at_ip_sp_error" on:
commit: 13bd7a228b281e5cef2f51a236cafaa3400592a5 ("[PATCH RFC] binfmt_elf: fully allocate bss pages")
url: https://github.com/intel-lab-lkp/linux/commits/Thomas-Wei-schuh/binfmt_elf-fully-allocate-bss-pages/20230915-000102
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [PATCH RFC] binfmt_elf: fully allocate bss pages
in testcase: boot
compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]
[ 11.004901][ T1] ### dt-test ### EXPECT_NOT / : WARNING: <<all>>
[ 11.005947][ T1] ### dt-test ### EXPECT_NOT / : ------------[ cut here ]------------
[ 11.006784][ T1] ### dt-test ### pass of_unittest_lifecycle():3252
[ 11.008735][ T1] ### dt-test ### pass of_unittest_lifecycle():3253
[ 11.009666][ T1] ### dt-test ### pass of_unittest_check_tree_linkage():271
[ 11.010598][ T1] ### dt-test ### pass of_unittest_check_tree_linkage():272
[ 11.011531][ T1] ### dt-test ### FAIL of_unittest_overlay_high_level():3542 overlay_base_root not initialized
[ 11.012852][ T1] ### dt-test ### end of unittest - 303 passed, 1 failed
[ 11.022721][ T39] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[ 11.042019][ T1] Sending DHCP requests ., OK
[ 12.032757][ T1] IP-Config: Got DHCP answer from 10.0.2.2, my address is 10.0.2.15
[ 12.033736][ T1] IP-Config: Complete:
[ 12.034229][ T1] device=eth0, hwaddr=52:54:00:12:34:56, ipaddr=10.0.2.15, mask=255.255.255.0, gw=10.0.2.2
[ 12.035554][ T1] host=vm-meta-36, domain=, nis-domain=(none)
[ 12.036331][ T1] bootserver=10.0.2.2, rootserver=10.0.2.2, rootpath=
[ 12.036337][ T1] nameserver0=10.0.2.3
[ 12.038817][ T1] clk: Disabling unused clocks
[ 12.041570][ T1] Freeing unused kernel image (initmem) memory: 1036K
[ 12.059292][ T1] Write protecting kernel text and read-only data: 10632k
[ 12.075444][ T1] Run /init as init process
[ 12.075883][ T1] with arguments:
[ 12.076211][ T1] /init
[ 12.076481][ T1] with environment:
[ 12.076818][ T1] HOME=/
[ 12.077095][ T1] TERM=linux
[ 12.077397][ T1] RESULT_ROOT=/result/boot/1/vm-snb/debian-11.1-i386-20220923.cgz/i386-randconfig-016-20230915/gcc-12/13bd7a228b281e5cef2f51a236cafaa3400592a5/5
[ 12.078684][ T1] BOOT_IMAGE=/pkg/linux/i386-randconfig-016-20230915/gcc-12/13bd7a228b281e5cef2f51a236cafaa3400592a5/vmlinuz-6.6.0-rc1-00073-g13bd7a228b28
[ 12.079910][ T1] branch=linux-review/Thomas-Wei-schuh/binfmt_elf-fully-allocate-bss-pages/20230915-000102
[ 12.080775][ T1] job=/lkp/jobs/scheduled/vm-meta-36/boot-1-debian-11.1-i386-20220923.cgz-i386-randconfig-016-20230915-13bd7a228b28-20230917-97632-11h3y6y-5.yaml
[ 12.082051][ T1] user=lkp
[ 12.082345][ T1] ARCH=i386
[ 12.082639][ T1] kconfig=i386-randconfig-016-20230915
[ 12.083177][ T1] commit=13bd7a228b281e5cef2f51a236cafaa3400592a5
[ 12.083743][ T1] max_uptime=600
[ 12.084074][ T1] LKP_SERVER=internal-lkp-server
[ 12.084522][ T1] selinux=0
[ 12.084820][ T1] softlockup_panic=1
[ 12.085181][ T1] prompt_ramdisk=0
[ 12.085551][ T1] vga=normal
[ 12.117728][ T1] [1]: RTC configured in localtime, applying delta of 0 minutes to system time.
Welcome to Debian GNU/Linux 11 (bullseye)!
[ 12.189049][ T58] process 58 ((sd-executor)) attempted a POSIX timer syscall while CONFIG_POSIX_TIMERS is not set
[ 12.234253][ T63] systemd-getty-g[63]: segfault at 484771 ip 00480047 sp bffb6e4c error 7 in true[480000+1000] likely on CPU 0 (core 0, socket 0)
[ 12.242969][ T63] Code: 00 00 00 b8 82 00 00 00 00 00 00 34 00 20 00 0b 00 28 00 1e 00 1d 00 06 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00 60 01 00 <00> 60 01 00 00 04 00 00 00 04 00 00 00 03 00 00 00 94 01 00 00 94
All code
========
0: 00 00 add %al,(%rax)
2: 00 b8 82 00 00 00 add %bh,0x82(%rax)
8: 00 00 add %al,(%rax)
a: 00 34 00 add %dh,(%rax,%rax,1)
d: 20 00 and %al,(%rax)
f: 0b 00 or (%rax),%eax
11: 28 00 sub %al,(%rax)
13: 1e (bad)
14: 00 1d 00 06 00 00 add %bl,0x600(%rip) # 0x61a
1a: 00 34 00 add %dh,(%rax,%rax,1)
1d: 00 00 add %al,(%rax)
1f: 34 00 xor $0x0,%al
21: 00 00 add %al,(%rax)
23: 34 00 xor $0x0,%al
25: 00 00 add %al,(%rax)
27: 60 (bad)
28: 01 00 add %eax,(%rax)
2a:* 00 60 01 add %ah,0x1(%rax) <-- trapping instruction
2d: 00 00 add %al,(%rax)
2f: 04 00 add $0x0,%al
31: 00 00 add %al,(%rax)
33: 04 00 add $0x0,%al
35: 00 00 add %al,(%rax)
37: 03 00 add (%rax),%eax
39: 00 00 add %al,(%rax)
3b: 94 xchg %eax,%esp
3c: 01 00 add %eax,(%rax)
3e: 00 .byte 0x0
3f: 94 xchg %eax,%esp
Code starting with the faulting instruction
===========================================
0: 00 60 01 add %ah,0x1(%rax)
3: 00 00 add %al,(%rax)
5: 04 00 add $0x0,%al
7: 00 00 add %al,(%rax)
9: 04 00 add $0x0,%al
b: 00 00 add %al,(%rax)
d: 03 00 add (%rax),%eax
f: 00 00 add %al,(%rax)
11: 94 xchg %eax,%esp
12: 01 00 add %eax,(%rax)
14: 00 .byte 0x0
15: 94 xchg %eax,%esp
[ 12.256651][ T62] systemd-fstab-g[62]: segfault at 0 ip 004a0004 sp bf81264b error 6 in systemd-fstab-generator[4a0000+2000] likely on CPU 0 (core 0, socket 0)
[ 12.257967][ T62] Code: Unable to access opcode bytes at 0x49ffda.
Code starting with the faulting instruction
===========================================
[ 12.266578][ T60] systemd-cryptse[60]: segfault at 0 ip 00453004 sp bfeefa7b error 6 in systemd-cryptsetup-generator[453000+1000] likely on CPU 1 (core 1, socket 0)
[ 12.271885][ T60] Code: Unable to access opcode bytes at 0x452fda.
Code starting with the faulting instruction
===========================================
[ 12.276875][ T61] systemd-debug-g[61]: segfault at fffff000 ip 00464004 sp bfd3675b error 7 in systemd-debug-generator[464000+1000] likely on CPU 1 (core 1, socket 0)
[ 12.278229][ T61] Code: Unable to access opcode bytes at 0x463fda.
Code starting with the faulting instruction
===========================================
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230918/[email protected]
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hej,
since we figured that the proposed patch is not going to work I've spent a
couple more hours looking at this (some static binaries on arm64 segfault
during load [0]). The segfault happens because of a failed clear_user()
call in load_elf_binary(). The address we try to write zeros to is mapped with
correct permissions.
After some experiments I've noticed that writing to anonymous mappings work
fine and all the error cases happend on file backed VMAs. Debugging showed that
in elf_map() we call vm_mmap() with a file offset of 15 pages - for a binary
that's less than 1KiB in size.
Looking at the ELF headers again that 15 pages offset originates from the offset
of the 2nd segment - so, I guess the loader did as instructed and that binary is
just too nasty?
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x0000000000000178 0x0000000000000178 R E 0x10000
LOAD 0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
0x0000000000000000 0x0000000000000008 RW 0x10000
NOTE 0x0000000000000120 0x0000000000400120 0x0000000000400120
0x0000000000000024 0x0000000000000024 R 0x4
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 0x10
As an additional test I've added a bunch of zeros at the end of that binary
so that the offset is within that file and it did load just fine.
On the other hand there is this section header:
[ 4] .bss NOBITS 000000000041ffe8 0000ffe8
0000000000000008 0000000000000000 WA 0 0 1
"sh_offset
This member's value gives the byte offset from the beginning of the file to
the first byte in the section. One section type, SHT_NOBITS described
below, occupies no space in the file, and its sh_offset member locates
the conceptual placement in the file.
"
So, still not sure what to do here..
Sebastian
[0] https://lore.kernel.org/lkml/[email protected]/
Sebastian Ott <[email protected]> writes:
> Hej,
>
> since we figured that the proposed patch is not going to work I've spent a
> couple more hours looking at this (some static binaries on arm64 segfault
> during load [0]). The segfault happens because of a failed clear_user()
> call in load_elf_binary(). The address we try to write zeros to is mapped with
> correct permissions.
>
> After some experiments I've noticed that writing to anonymous mappings work
> fine and all the error cases happend on file backed VMAs. Debugging showed that
> in elf_map() we call vm_mmap() with a file offset of 15 pages - for a binary
> that's less than 1KiB in size.
>
> Looking at the ELF headers again that 15 pages offset originates from the offset
> of the 2nd segment - so, I guess the loader did as instructed and that binary is
> just too nasty?
>
> Program Headers:
> Type Offset VirtAddr PhysAddr
> FileSiz MemSiz Flags Align
> LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
> 0x0000000000000178 0x0000000000000178 R E 0x10000
> LOAD 0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
> 0x0000000000000000 0x0000000000000008 RW 0x10000
> NOTE 0x0000000000000120 0x0000000000400120 0x0000000000400120
> 0x0000000000000024 0x0000000000000024 R 0x4
> GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
> 0x0000000000000000 0x0000000000000000 RW 0x10
>
> As an additional test I've added a bunch of zeros at the end of that binary
> so that the offset is within that file and it did load just fine.
>
> On the other hand there is this section header:
> [ 4] .bss NOBITS 000000000041ffe8 0000ffe8
> 0000000000000008 0000000000000000 WA 0 0 1
>
> "sh_offset
> This member's value gives the byte offset from the beginning of the file to
> the first byte in the section. One section type, SHT_NOBITS described
> below, occupies no space in the file, and its sh_offset member locates
> the conceptual placement in the file.
> "
>
> So, still not sure what to do here..
>
> Sebastian
>
> [0] https://lore.kernel.org/lkml/[email protected]/
I think that .bss section that is being generated is atrocious.
At the same time I looked at what the linux elf loader is trying to do,
and the elf loader's handling of program segments with memsz > filesz
has serious remnants a.out of programs allocating memory with the brk
syscall.
Lots of the structure looks like it started with the assumption that
there would only be a single program header with memsz > filesz the way
and that was the .bss. The way things were in the a.out days and
handling of other cases has been debugged in later.
So I have modified elf_map to always return successfully when there is
a zero filesz in the program header for an elf segment.
Then I have factored out a function clear_tail that ensures the zero
padding for an entire elf segment is present.
Please test this and see if it causes your test case to work.
Please also dig into gcc or whichever code generates that horrendous
.bss section and see if that can be fixed so the code can work on older
kernels. A section that only contains .bss has no business not being
properly aligned. Even if the data in that section doesn't start at the
beginning of the page, there is no reason to feed nasty data to other
programs. It just increases the odds of complications for no good
reason. At a minimum that is going to be needed to run that code on
older kernels.
Eric
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 7b3d2d491407..f6608df75df6 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -110,43 +110,66 @@ static struct linux_binfmt elf_format = {
#define BAD_ADDR(x) (unlikely((unsigned long)(x) >= TASK_SIZE))
-static int set_brk(unsigned long start, unsigned long end, int prot)
-{
- start = ELF_PAGEALIGN(start);
- end = ELF_PAGEALIGN(end);
- if (end > start) {
- /*
- * Map the last of the bss segment.
- * If the header is requesting these pages to be
- * executable, honour that (ppc32 needs this).
- */
- int error = vm_brk_flags(start, end - start,
- prot & PROT_EXEC ? VM_EXEC : 0);
- if (error)
- return error;
- }
- current->mm->start_brk = current->mm->brk = end;
- return 0;
-}
-
/* We need to explicitly zero any fractional pages
after the data section (i.e. bss). This would
contain the junk from the file that should not
be in memory
*/
-static int padzero(unsigned long elf_bss)
+static int padzero(unsigned long elf_bss, unsigned long elf_brk)
{
unsigned long nbyte;
nbyte = ELF_PAGEOFFSET(elf_bss);
if (nbyte) {
nbyte = ELF_MIN_ALIGN - nbyte;
+ if (nbyte > elf_brk - elf_bss)
+ nbyte = elf_brk - elf_bss;
if (clear_user((void __user *) elf_bss, nbyte))
return -EFAULT;
}
return 0;
}
+static int clear_tail(struct elf_phdr *phdr, unsigned long load_bias, int prot)
+{
+ unsigned long start, end;
+
+ /* Is there a tail to clear? */
+ if (phdr->p_filesz >= phdr->p_memsz)
+ return 0;
+
+ /* Where does the tail start? */
+ if (phdr->p_filesz)
+ start = load_bias + phdr->p_vaddr + phdr->p_filesz;
+ else
+ start = ELF_PAGESTART(load_bias + phdr->p_vaddr);
+
+ /* Where does the tail end? */
+ end = load_bias + phdr->p_vaddr + phdr->p_memsz;
+
+ /*
+ * This bss-zeroing can fail if the ELF
+ * file specifies odd protections. So
+ * we don't check the return value
+ */
+ padzero(start, end);
+
+ start = ELF_PAGEALIGN(start);
+ end = ELF_PAGEALIGN(end);
+ if (end > start) {
+ /*
+ * Map the last of the bss segment.
+ * If the header is requesting these pages to be
+ * executable, honour that (ppc32 needs this).
+ */
+ int error = vm_brk_flags(start, end - start,
+ prot & PROT_EXEC ? VM_EXEC : 0);
+ if (error)
+ return error;
+ }
+ return 0;
+}
+
/* Let's use some macros to make this stack manipulation a little clearer */
#ifdef CONFIG_STACK_GROWSUP
#define STACK_ADD(sp, items) ((elf_addr_t __user *)(sp) + (items))
@@ -379,7 +402,7 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
/* mmap() will return -EINVAL if given a zero size, but a
* segment with zero filesize is perfectly valid */
- if (!size)
+ if (!eppnt->p_filesz)
return addr;
/*
@@ -596,8 +619,6 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
struct elf_phdr *eppnt;
unsigned long load_addr = 0;
int load_addr_set = 0;
- unsigned long last_bss = 0, elf_bss = 0;
- int bss_prot = 0;
unsigned long error = ~0UL;
unsigned long total_size;
int i;
@@ -661,50 +682,13 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
goto out;
}
- /*
- * Find the end of the file mapping for this phdr, and
- * keep track of the largest address we see for this.
- */
- k = load_addr + eppnt->p_vaddr + eppnt->p_filesz;
- if (k > elf_bss)
- elf_bss = k;
-
- /*
- * Do the same thing for the memory mapping - between
- * elf_bss and last_bss is the bss section.
- */
- k = load_addr + eppnt->p_vaddr + eppnt->p_memsz;
- if (k > last_bss) {
- last_bss = k;
- bss_prot = elf_prot;
- }
+ /* Map anonymous pages and clear the tail if needed */
+ error = clear_tail(eppnt, load_addr, elf_prot);
+ if (error)
+ goto out;
}
}
- /*
- * Now fill out the bss section: first pad the last page from
- * the file up to the page boundary, and zero it from elf_bss
- * up to the end of the page.
- */
- if (padzero(elf_bss)) {
- error = -EFAULT;
- goto out;
- }
- /*
- * Next, align both the file and mem bss up to the page size,
- * since this is where elf_bss was just zeroed up to, and where
- * last_bss will end after the vm_brk_flags() below.
- */
- elf_bss = ELF_PAGEALIGN(elf_bss);
- last_bss = ELF_PAGEALIGN(last_bss);
- /* Finally, if there is still more bss to allocate, do it. */
- if (last_bss > elf_bss) {
- error = vm_brk_flags(elf_bss, last_bss - elf_bss,
- bss_prot & PROT_EXEC ? VM_EXEC : 0);
- if (error)
- goto out;
- }
-
error = load_addr;
out:
return error;
@@ -828,8 +812,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
unsigned long error;
struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
struct elf_phdr *elf_property_phdata = NULL;
- unsigned long elf_bss, elf_brk;
- int bss_prot = 0;
+ unsigned long elf_brk;
int retval, i;
unsigned long elf_entry;
unsigned long e_entry;
@@ -1020,7 +1003,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (retval < 0)
goto out_free_dentry;
- elf_bss = 0;
elf_brk = 0;
start_code = ~0UL;
@@ -1040,32 +1022,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (elf_ppnt->p_type != PT_LOAD)
continue;
- if (unlikely (elf_brk > elf_bss)) {
- unsigned long nbyte;
-
- /* There was a PT_LOAD segment with p_memsz > p_filesz
- before this one. Map anonymous pages, if needed,
- and clear the area. */
- retval = set_brk(elf_bss + load_bias,
- elf_brk + load_bias,
- bss_prot);
- if (retval)
- goto out_free_dentry;
- nbyte = ELF_PAGEOFFSET(elf_bss);
- if (nbyte) {
- nbyte = ELF_MIN_ALIGN - nbyte;
- if (nbyte > elf_brk - elf_bss)
- nbyte = elf_brk - elf_bss;
- if (clear_user((void __user *)elf_bss +
- load_bias, nbyte)) {
- /*
- * This bss-zeroing can fail if the ELF
- * file specifies odd protections. So
- * we don't check the return value
- */
- }
- }
- }
elf_prot = make_prot(elf_ppnt->p_flags, &arch_state,
!!interpreter, false);
@@ -1208,42 +1164,31 @@ static int load_elf_binary(struct linux_binprm *bprm)
goto out_free_dentry;
}
- k = elf_ppnt->p_vaddr + elf_ppnt->p_filesz;
+ /* Map anonymous pages and clear the tail if needed */
+ retval = clear_tail(elf_ppnt, load_bias, elf_prot);
+ if (retval)
+ goto out_free_dentry;
- if (k > elf_bss)
- elf_bss = k;
+ k = elf_ppnt->p_vaddr + elf_ppnt->p_filesz;
if ((elf_ppnt->p_flags & PF_X) && end_code < k)
end_code = k;
if (end_data < k)
end_data = k;
k = elf_ppnt->p_vaddr + elf_ppnt->p_memsz;
if (k > elf_brk) {
- bss_prot = elf_prot;
elf_brk = k;
}
}
e_entry = elf_ex->e_entry + load_bias;
phdr_addr += load_bias;
- elf_bss += load_bias;
elf_brk += load_bias;
start_code += load_bias;
end_code += load_bias;
start_data += load_bias;
end_data += load_bias;
- /* Calling set_brk effectively mmaps the pages that we need
- * for the bss and break sections. We must do this before
- * mapping in the interpreter, to make sure it doesn't wind
- * up getting placed where the bss needs to go.
- */
- retval = set_brk(elf_bss, elf_brk, bss_prot);
- if (retval)
- goto out_free_dentry;
- if (likely(elf_bss != elf_brk) && unlikely(padzero(elf_bss))) {
- retval = -EFAULT; /* Nobody gets to see this, but.. */
- goto out_free_dentry;
- }
+ current->mm->start_brk = current->mm->brk = ELF_PAGEALIGN(elf_brk);
if (interpreter) {
elf_entry = load_elf_interp(interp_elf_ex,
@@ -1369,7 +1314,6 @@ static int load_elf_library(struct file *file)
{
struct elf_phdr *elf_phdata;
struct elf_phdr *eppnt;
- unsigned long elf_bss, bss, len;
int retval, error, i, j;
struct elfhdr elf_ex;
@@ -1425,19 +1369,9 @@ static int load_elf_library(struct file *file)
if (error != ELF_PAGESTART(eppnt->p_vaddr))
goto out_free_ph;
- elf_bss = eppnt->p_vaddr + eppnt->p_filesz;
- if (padzero(elf_bss)) {
- error = -EFAULT;
+ error = clear_tail(eppnt, 0, 0);
+ if (error)
goto out_free_ph;
- }
-
- len = ELF_PAGEALIGN(eppnt->p_filesz + eppnt->p_vaddr);
- bss = ELF_PAGEALIGN(eppnt->p_memsz + eppnt->p_vaddr);
- if (bss > len) {
- error = vm_brk(len, bss - len);
- if (error)
- goto out_free_ph;
- }
error = 0;
out_free_ph:
Sebastian Ott <[email protected]> writes:
> On Sun, 24 Sep 2023, Eric W. Biederman wrote:
>> Sebastian Ott <[email protected]> writes:
>>
>>> Hej,
>>>
>>> since we figured that the proposed patch is not going to work I've spent a
>>> couple more hours looking at this (some static binaries on arm64 segfault
>>> during load [0]). The segfault happens because of a failed clear_user()
>>> call in load_elf_binary(). The address we try to write zeros to is mapped with
>>> correct permissions.
>>>
>>> After some experiments I've noticed that writing to anonymous mappings work
>>> fine and all the error cases happend on file backed VMAs. Debugging showed that
>>> in elf_map() we call vm_mmap() with a file offset of 15 pages - for a binary
>>> that's less than 1KiB in size.
>>>
>>> Looking at the ELF headers again that 15 pages offset originates from the offset
>>> of the 2nd segment - so, I guess the loader did as instructed and that binary is
>>> just too nasty?
>>>
>>> Program Headers:
>>> Type Offset VirtAddr PhysAddr
>>> FileSiz MemSiz Flags Align
>>> LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
>>> 0x0000000000000178 0x0000000000000178 R E 0x10000
>>> LOAD 0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
>>> 0x0000000000000000 0x0000000000000008 RW 0x10000
>>> NOTE 0x0000000000000120 0x0000000000400120 0x0000000000400120
>>> 0x0000000000000024 0x0000000000000024 R 0x4
>>> GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
>>> 0x0000000000000000 0x0000000000000000 RW 0x10
>>>
>>> As an additional test I've added a bunch of zeros at the end of that binary
>>> so that the offset is within that file and it did load just fine.
>>>
>>> On the other hand there is this section header:
>>> [ 4] .bss NOBITS 000000000041ffe8 0000ffe8
>>> 0000000000000008 0000000000000000 WA 0 0 1
>>>
>>> "sh_offset
>>> This member's value gives the byte offset from the beginning of the file to
>>> the first byte in the section. One section type, SHT_NOBITS described
>>> below, occupies no space in the file, and its sh_offset member locates
>>> the conceptual placement in the file.
>>> "
>>>
>>> So, still not sure what to do here..
>>>
>>> Sebastian
>>>
>>> [0] https://lore.kernel.org/lkml/[email protected]/
>>
>> I think that .bss section that is being generated is atrocious.
>>
>> At the same time I looked at what the linux elf loader is trying to do,
>> and the elf loader's handling of program segments with memsz > filesz
>> has serious remnants a.out of programs allocating memory with the brk
>> syscall.
>>
>> Lots of the structure looks like it started with the assumption that
>> there would only be a single program header with memsz > filesz the way
>> and that was the .bss. The way things were in the a.out days and
>> handling of other cases has been debugged in later.
>>
>> So I have modified elf_map to always return successfully when there is
>> a zero filesz in the program header for an elf segment.
>>
>> Then I have factored out a function clear_tail that ensures the zero
>> padding for an entire elf segment is present.
>>
>> Please test this and see if it causes your test case to work.
>
> Sadly, that causes issues for other programs:
Bah. Too much cleanup at once.
I will respin.
> [ 44.164596] Run /init as init process
> [ 44.168763] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [ 44.176409] CPU: 32 PID: 1 Comm: init Not tainted 6.6.0-rc2+ #89
> [ 44.182404] Hardware name: GIGABYTE R181-T92-00/MT91-FS4-00, BIOS F34 08/13/2020
> [ 44.189786] Call trace:
> [ 44.192220] dump_backtrace+0xa4/0x130
> [ 44.195961] show_stack+0x20/0x38
> [ 44.199264] dump_stack_lvl+0x48/0x60
> [ 44.202917] dump_stack+0x18/0x28
> [ 44.206219] panic+0x2e0/0x350
> [ 44.209264] do_exit+0x370/0x390
> [ 44.212481] do_group_exit+0x3c/0xa0
> [ 44.216044] get_signal+0x800/0x808
> [ 44.219521] do_signal+0xfc/0x200
> [ 44.222824] do_notify_resume+0xc8/0x418
> [ 44.226734] el0_da+0x114/0x120
> [ 44.229866] el0t_64_sync_handler+0xb8/0x130
> [ 44.234124] el0t_64_sync+0x194/0x198
> [ 44.237776] SMP: stopping secondary CPUs
> [ 44.241740] Kernel Offset: disabled
> [ 44.245215] CPU features: 0x03000000,14028142,10004203
> [ 44.250342] Memory Limit: none
> [ 44.253383] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
Eric
On Sun, 24 Sep 2023, Eric W. Biederman wrote:
> Sebastian Ott <[email protected]> writes:
>
>> Hej,
>>
>> since we figured that the proposed patch is not going to work I've spent a
>> couple more hours looking at this (some static binaries on arm64 segfault
>> during load [0]). The segfault happens because of a failed clear_user()
>> call in load_elf_binary(). The address we try to write zeros to is mapped with
>> correct permissions.
>>
>> After some experiments I've noticed that writing to anonymous mappings work
>> fine and all the error cases happend on file backed VMAs. Debugging showed that
>> in elf_map() we call vm_mmap() with a file offset of 15 pages - for a binary
>> that's less than 1KiB in size.
>>
>> Looking at the ELF headers again that 15 pages offset originates from the offset
>> of the 2nd segment - so, I guess the loader did as instructed and that binary is
>> just too nasty?
>>
>> Program Headers:
>> Type Offset VirtAddr PhysAddr
>> FileSiz MemSiz Flags Align
>> LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
>> 0x0000000000000178 0x0000000000000178 R E 0x10000
>> LOAD 0x000000000000ffe8 0x000000000041ffe8 0x000000000041ffe8
>> 0x0000000000000000 0x0000000000000008 RW 0x10000
>> NOTE 0x0000000000000120 0x0000000000400120 0x0000000000400120
>> 0x0000000000000024 0x0000000000000024 R 0x4
>> GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
>> 0x0000000000000000 0x0000000000000000 RW 0x10
>>
>> As an additional test I've added a bunch of zeros at the end of that binary
>> so that the offset is within that file and it did load just fine.
>>
>> On the other hand there is this section header:
>> [ 4] .bss NOBITS 000000000041ffe8 0000ffe8
>> 0000000000000008 0000000000000000 WA 0 0 1
>>
>> "sh_offset
>> This member's value gives the byte offset from the beginning of the file to
>> the first byte in the section. One section type, SHT_NOBITS described
>> below, occupies no space in the file, and its sh_offset member locates
>> the conceptual placement in the file.
>> "
>>
>> So, still not sure what to do here..
>>
>> Sebastian
>>
>> [0] https://lore.kernel.org/lkml/[email protected]/
>
> I think that .bss section that is being generated is atrocious.
>
> At the same time I looked at what the linux elf loader is trying to do,
> and the elf loader's handling of program segments with memsz > filesz
> has serious remnants a.out of programs allocating memory with the brk
> syscall.
>
> Lots of the structure looks like it started with the assumption that
> there would only be a single program header with memsz > filesz the way
> and that was the .bss. The way things were in the a.out days and
> handling of other cases has been debugged in later.
>
> So I have modified elf_map to always return successfully when there is
> a zero filesz in the program header for an elf segment.
>
> Then I have factored out a function clear_tail that ensures the zero
> padding for an entire elf segment is present.
>
> Please test this and see if it causes your test case to work.
Sadly, that causes issues for other programs:
[ 44.164596] Run /init as init process
[ 44.168763] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 44.176409] CPU: 32 PID: 1 Comm: init Not tainted 6.6.0-rc2+ #89
[ 44.182404] Hardware name: GIGABYTE R181-T92-00/MT91-FS4-00, BIOS F34 08/13/2020
[ 44.189786] Call trace:
[ 44.192220] dump_backtrace+0xa4/0x130
[ 44.195961] show_stack+0x20/0x38
[ 44.199264] dump_stack_lvl+0x48/0x60
[ 44.202917] dump_stack+0x18/0x28
[ 44.206219] panic+0x2e0/0x350
[ 44.209264] do_exit+0x370/0x390
[ 44.212481] do_group_exit+0x3c/0xa0
[ 44.216044] get_signal+0x800/0x808
[ 44.219521] do_signal+0xfc/0x200
[ 44.222824] do_notify_resume+0xc8/0x418
[ 44.226734] el0_da+0x114/0x120
[ 44.229866] el0t_64_sync_handler+0xb8/0x130
[ 44.234124] el0t_64_sync+0x194/0x198
[ 44.237776] SMP: stopping secondary CPUs
[ 44.241740] Kernel Offset: disabled
[ 44.245215] CPU features: 0x03000000,14028142,10004203
[ 44.250342] Memory Limit: none
[ 44.253383] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---