Hi,
This is the continuation of the work Eric started for handling
"p_memsz > p_filesz" in arbitrary segments (rather than just the last,
BSS, segment). I've added the suggested changes:
- drop unused "elf_bss" variable
- refactor load_elf_interp() to use elf_load()
- refactor load_elf_library() to use elf_load()
- report padzero() errors when PROT_WRITE is present
- drop vm_brk()
Thanks!
-Kees
v4:
- refactor load_elf_library() too
- don't refactor padzero(), just test in the only remaining caller
- drop now-unused vm_brk()
v3: https://lore.kernel.org/all/[email protected]
v2: https://lore.kernel.org/lkml/[email protected]
v1: https://lore.kernel.org/lkml/[email protected]
Eric W. Biederman (1):
binfmt_elf: Support segments with 0 filesz and misaligned starts
Kees Cook (5):
binfmt_elf: elf_bss no longer used by load_elf_binary()
binfmt_elf: Use elf_load() for interpreter
binfmt_elf: Use elf_load() for library
binfmt_elf: Only report padzero() errors when PROT_WRITE
mm: Remove unused vm_brk()
fs/binfmt_elf.c | 214 ++++++++++++++++-----------------------------
include/linux/mm.h | 3 +-
mm/mmap.c | 6 --
mm/nommu.c | 5 --
4 files changed, 76 insertions(+), 152 deletions(-)
--
2.34.1
From: "Eric W. Biederman" <[email protected]>
Implement a helper elf_load() that wraps elf_map() and performs all
of the necessary work to ensure that when "memsz > filesz" the bytes
described by "memsz > filesz" are zeroed.
An outstanding issue is if the first segment has filesz 0, and has a
randomized location. But that is the same as today.
In this change I replaced an open coded padzero() that did not clear
all of the way to the end of the page, with padzero() that does.
I also stopped checking the return of padzero() as there is at least
one known case where testing for failure is the wrong thing to do.
It looks like binfmt_elf_fdpic may have the proper set of tests
for when error handling can be safely completed.
I found a couple of commits in the old history
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git,
that look very interesting in understanding this code.
commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail")
commit c6e2227e4a3e ("[SPARC64]: Missing user access return value checks in fs/binfmt_elf.c and fs/compat.c")
commit 5bf3be033f50 ("v2.4.10.1 -> v2.4.10.2")
Looking at commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail"):
> commit 39b56d902bf35241e7cba6cc30b828ed937175ad
> Author: Pavel Machek <[email protected]>
> Date: Wed Feb 9 22:40:30 2005 -0800
>
> [PATCH] binfmt_elf: clearing bss may fail
>
> So we discover that Borland's Kylix application builder emits weird elf
> files which describe a non-writeable bss segment.
>
> So remove the clear_user() check at the place where we zero out the bss. I
> don't _think_ there are any security implications here (plus we've never
> checked that clear_user() return value, so whoops if it is a problem).
>
> Signed-off-by: Pavel Machek <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
It seems pretty clear that binfmt_elf_fdpic with skipping clear_user() for
non-writable segments and otherwise calling clear_user(), aka padzero(),
and checking it's return code is the right thing to do.
I just skipped the error checking as that avoids breaking things.
And notably, it looks like Borland's Kylix died in 2005 so it might be
safe to just consider read-only segments with memsz > filesz an error.
Reported-by: Sebastian Ott <[email protected]>
Reported-by: Thomas Weißschuh <[email protected]>
Closes: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
---
fs/binfmt_elf.c | 111 +++++++++++++++++++++---------------------------
1 file changed, 48 insertions(+), 63 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 7b3d2d491407..2a615f476e44 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -110,25 +110,6 @@ static struct linux_binfmt elf_format = {
#define BAD_ADDR(x) (unlikely((unsigned long)(x) >= TASK_SIZE))
-static int set_brk(unsigned long start, unsigned long end, int prot)
-{
- start = ELF_PAGEALIGN(start);
- end = ELF_PAGEALIGN(end);
- if (end > start) {
- /*
- * Map the last of the bss segment.
- * If the header is requesting these pages to be
- * executable, honour that (ppc32 needs this).
- */
- int error = vm_brk_flags(start, end - start,
- prot & PROT_EXEC ? VM_EXEC : 0);
- if (error)
- return error;
- }
- current->mm->start_brk = current->mm->brk = end;
- return 0;
-}
-
/* We need to explicitly zero any fractional pages
after the data section (i.e. bss). This would
contain the junk from the file that should not
@@ -406,6 +387,51 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
return(map_addr);
}
+static unsigned long elf_load(struct file *filep, unsigned long addr,
+ const struct elf_phdr *eppnt, int prot, int type,
+ unsigned long total_size)
+{
+ unsigned long zero_start, zero_end;
+ unsigned long map_addr;
+
+ if (eppnt->p_filesz) {
+ map_addr = elf_map(filep, addr, eppnt, prot, type, total_size);
+ if (BAD_ADDR(map_addr))
+ return map_addr;
+ if (eppnt->p_memsz > eppnt->p_filesz) {
+ zero_start = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
+ eppnt->p_filesz;
+ zero_end = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
+ eppnt->p_memsz;
+
+ /* Zero the end of the last mapped page */
+ padzero(zero_start);
+ }
+ } else {
+ map_addr = zero_start = ELF_PAGESTART(addr);
+ zero_end = zero_start + ELF_PAGEOFFSET(eppnt->p_vaddr) +
+ eppnt->p_memsz;
+ }
+ if (eppnt->p_memsz > eppnt->p_filesz) {
+ /*
+ * Map the last of the segment.
+ * If the header is requesting these pages to be
+ * executable, honour that (ppc32 needs this).
+ */
+ int error;
+
+ zero_start = ELF_PAGEALIGN(zero_start);
+ zero_end = ELF_PAGEALIGN(zero_end);
+
+ error = vm_brk_flags(zero_start, zero_end - zero_start,
+ prot & PROT_EXEC ? VM_EXEC : 0);
+ if (error)
+ map_addr = error;
+ }
+ return map_addr;
+}
+
+
static unsigned long total_mapping_size(const struct elf_phdr *phdr, int nr)
{
elf_addr_t min_addr = -1;
@@ -829,7 +855,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
struct elf_phdr *elf_property_phdata = NULL;
unsigned long elf_bss, elf_brk;
- int bss_prot = 0;
int retval, i;
unsigned long elf_entry;
unsigned long e_entry;
@@ -1040,33 +1065,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (elf_ppnt->p_type != PT_LOAD)
continue;
- if (unlikely (elf_brk > elf_bss)) {
- unsigned long nbyte;
-
- /* There was a PT_LOAD segment with p_memsz > p_filesz
- before this one. Map anonymous pages, if needed,
- and clear the area. */
- retval = set_brk(elf_bss + load_bias,
- elf_brk + load_bias,
- bss_prot);
- if (retval)
- goto out_free_dentry;
- nbyte = ELF_PAGEOFFSET(elf_bss);
- if (nbyte) {
- nbyte = ELF_MIN_ALIGN - nbyte;
- if (nbyte > elf_brk - elf_bss)
- nbyte = elf_brk - elf_bss;
- if (clear_user((void __user *)elf_bss +
- load_bias, nbyte)) {
- /*
- * This bss-zeroing can fail if the ELF
- * file specifies odd protections. So
- * we don't check the return value
- */
- }
- }
- }
-
elf_prot = make_prot(elf_ppnt->p_flags, &arch_state,
!!interpreter, false);
@@ -1162,7 +1160,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
}
}
- error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
+ error = elf_load(bprm->file, load_bias + vaddr, elf_ppnt,
elf_prot, elf_flags, total_size);
if (BAD_ADDR(error)) {
retval = IS_ERR_VALUE(error) ?
@@ -1217,10 +1215,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (end_data < k)
end_data = k;
k = elf_ppnt->p_vaddr + elf_ppnt->p_memsz;
- if (k > elf_brk) {
- bss_prot = elf_prot;
+ if (k > elf_brk)
elf_brk = k;
- }
}
e_entry = elf_ex->e_entry + load_bias;
@@ -1232,18 +1228,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
start_data += load_bias;
end_data += load_bias;
- /* Calling set_brk effectively mmaps the pages that we need
- * for the bss and break sections. We must do this before
- * mapping in the interpreter, to make sure it doesn't wind
- * up getting placed where the bss needs to go.
- */
- retval = set_brk(elf_bss, elf_brk, bss_prot);
- if (retval)
- goto out_free_dentry;
- if (likely(elf_bss != elf_brk) && unlikely(padzero(elf_bss))) {
- retval = -EFAULT; /* Nobody gets to see this, but.. */
- goto out_free_dentry;
- }
+ current->mm->start_brk = current->mm->brk = ELF_PAGEALIGN(elf_brk);
if (interpreter) {
elf_entry = load_elf_interp(interp_elf_ex,
--
2.34.1
Handle arbitrary memsz>filesz in interpreter ELF segments, instead of
only supporting it in the last segment (which is expected to be the
BSS).
Cc: Eric Biederman <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Reported-by: Pedro Falcato <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Kees Cook <[email protected]>
---
fs/binfmt_elf.c | 46 +---------------------------------------------
1 file changed, 1 insertion(+), 45 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 0214d5a949fc..db47cb802f89 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -622,8 +622,6 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
struct elf_phdr *eppnt;
unsigned long load_addr = 0;
int load_addr_set = 0;
- unsigned long last_bss = 0, elf_bss = 0;
- int bss_prot = 0;
unsigned long error = ~0UL;
unsigned long total_size;
int i;
@@ -660,7 +658,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
else if (no_base && interp_elf_ex->e_type == ET_DYN)
load_addr = -vaddr;
- map_addr = elf_map(interpreter, load_addr + vaddr,
+ map_addr = elf_load(interpreter, load_addr + vaddr,
eppnt, elf_prot, elf_type, total_size);
total_size = 0;
error = map_addr;
@@ -686,51 +684,9 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
error = -ENOMEM;
goto out;
}
-
- /*
- * Find the end of the file mapping for this phdr, and
- * keep track of the largest address we see for this.
- */
- k = load_addr + eppnt->p_vaddr + eppnt->p_filesz;
- if (k > elf_bss)
- elf_bss = k;
-
- /*
- * Do the same thing for the memory mapping - between
- * elf_bss and last_bss is the bss section.
- */
- k = load_addr + eppnt->p_vaddr + eppnt->p_memsz;
- if (k > last_bss) {
- last_bss = k;
- bss_prot = elf_prot;
- }
}
}
- /*
- * Now fill out the bss section: first pad the last page from
- * the file up to the page boundary, and zero it from elf_bss
- * up to the end of the page.
- */
- if (padzero(elf_bss)) {
- error = -EFAULT;
- goto out;
- }
- /*
- * Next, align both the file and mem bss up to the page size,
- * since this is where elf_bss was just zeroed up to, and where
- * last_bss will end after the vm_brk_flags() below.
- */
- elf_bss = ELF_PAGEALIGN(elf_bss);
- last_bss = ELF_PAGEALIGN(last_bss);
- /* Finally, if there is still more bss to allocate, do it. */
- if (last_bss > elf_bss) {
- error = vm_brk_flags(elf_bss, last_bss - elf_bss,
- bss_prot & PROT_EXEC ? VM_EXEC : 0);
- if (error)
- goto out;
- }
-
error = load_addr;
out:
return error;
--
2.34.1
With the BSS handled generically via the new filesz/memsz mismatch
handling logic in elf_load(), elf_bss no longer needs to be tracked.
Drop the variable.
Cc: Eric Biederman <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Suggested-by: Eric Biederman <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
---
fs/binfmt_elf.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 2a615f476e44..0214d5a949fc 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -854,7 +854,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
unsigned long error;
struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
struct elf_phdr *elf_property_phdata = NULL;
- unsigned long elf_bss, elf_brk;
+ unsigned long elf_brk;
int retval, i;
unsigned long elf_entry;
unsigned long e_entry;
@@ -1045,7 +1045,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (retval < 0)
goto out_free_dentry;
- elf_bss = 0;
elf_brk = 0;
start_code = ~0UL;
@@ -1208,8 +1207,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
k = elf_ppnt->p_vaddr + elf_ppnt->p_filesz;
- if (k > elf_bss)
- elf_bss = k;
if ((elf_ppnt->p_flags & PF_X) && end_code < k)
end_code = k;
if (end_data < k)
@@ -1221,7 +1218,6 @@ static int load_elf_binary(struct linux_binprm *bprm)
e_entry = elf_ex->e_entry + load_bias;
phdr_addr += load_bias;
- elf_bss += load_bias;
elf_brk += load_bias;
start_code += load_bias;
end_code += load_bias;
--
2.34.1
Pedro Falcato <[email protected]> writes:
> On Fri, Sep 29, 2023 at 4:24 AM Kees Cook <[email protected]> wrote:
>>
>> From: "Eric W. Biederman" <[email protected]>
>>
>> Implement a helper elf_load() that wraps elf_map() and performs all
>> of the necessary work to ensure that when "memsz > filesz" the bytes
>> described by "memsz > filesz" are zeroed.
>>
>> An outstanding issue is if the first segment has filesz 0, and has a
>> randomized location. But that is the same as today.
>>
>> In this change I replaced an open coded padzero() that did not clear
>> all of the way to the end of the page, with padzero() that does.
>>
>> I also stopped checking the return of padzero() as there is at least
>> one known case where testing for failure is the wrong thing to do.
>> It looks like binfmt_elf_fdpic may have the proper set of tests
>> for when error handling can be safely completed.
>>
>> I found a couple of commits in the old history
>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git,
>> that look very interesting in understanding this code.
>>
>> commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail")
>> commit c6e2227e4a3e ("[SPARC64]: Missing user access return value checks in fs/binfmt_elf.c and fs/compat.c")
>> commit 5bf3be033f50 ("v2.4.10.1 -> v2.4.10.2")
>>
>> Looking at commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail"):
>> > commit 39b56d902bf35241e7cba6cc30b828ed937175ad
>> > Author: Pavel Machek <[email protected]>
>> > Date: Wed Feb 9 22:40:30 2005 -0800
>> >
>> > [PATCH] binfmt_elf: clearing bss may fail
>> >
>> > So we discover that Borland's Kylix application builder emits weird elf
>> > files which describe a non-writeable bss segment.
>> >
>> > So remove the clear_user() check at the place where we zero out the bss. I
>> > don't _think_ there are any security implications here (plus we've never
>> > checked that clear_user() return value, so whoops if it is a problem).
>> >
>> > Signed-off-by: Pavel Machek <[email protected]>
>> > Signed-off-by: Andrew Morton <[email protected]>
>> > Signed-off-by: Linus Torvalds <[email protected]>
>>
>> It seems pretty clear that binfmt_elf_fdpic with skipping clear_user() for
>> non-writable segments and otherwise calling clear_user(), aka padzero(),
>> and checking it's return code is the right thing to do.
>>
>> I just skipped the error checking as that avoids breaking things.
>>
>> And notably, it looks like Borland's Kylix died in 2005 so it might be
>> safe to just consider read-only segments with memsz > filesz an error.
>>
>> Reported-by: Sebastian Ott <[email protected]>
>> Reported-by: Thomas Weißschuh <[email protected]>
>> Closes: https://lkml.kernel.org/r/[email protected]
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> Link: https://lore.kernel.org/r/[email protected]
>> Signed-off-by: Kees Cook <[email protected]>
>> ---
>> fs/binfmt_elf.c | 111 +++++++++++++++++++++---------------------------
>> 1 file changed, 48 insertions(+), 63 deletions(-)
>>
>> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
>> index 7b3d2d491407..2a615f476e44 100644
>> --- a/fs/binfmt_elf.c
>> +++ b/fs/binfmt_elf.c
>> @@ -110,25 +110,6 @@ static struct linux_binfmt elf_format = {
>>
>> #define BAD_ADDR(x) (unlikely((unsigned long)(x) >= TASK_SIZE))
>>
>> -static int set_brk(unsigned long start, unsigned long end, int prot)
>> -{
>> - start = ELF_PAGEALIGN(start);
>> - end = ELF_PAGEALIGN(end);
>> - if (end > start) {
>> - /*
>> - * Map the last of the bss segment.
>> - * If the header is requesting these pages to be
>> - * executable, honour that (ppc32 needs this).
>> - */
>> - int error = vm_brk_flags(start, end - start,
>> - prot & PROT_EXEC ? VM_EXEC : 0);
>> - if (error)
>> - return error;
>> - }
>> - current->mm->start_brk = current->mm->brk = end;
>> - return 0;
>> -}
>> -
>> /* We need to explicitly zero any fractional pages
>> after the data section (i.e. bss). This would
>> contain the junk from the file that should not
>> @@ -406,6 +387,51 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
>> return(map_addr);
>> }
>>
>> +static unsigned long elf_load(struct file *filep, unsigned long addr,
>> + const struct elf_phdr *eppnt, int prot, int type,
>> + unsigned long total_size)
>> +{
>> + unsigned long zero_start, zero_end;
>> + unsigned long map_addr;
>> +
>> + if (eppnt->p_filesz) {
>> + map_addr = elf_map(filep, addr, eppnt, prot, type, total_size);
>> + if (BAD_ADDR(map_addr))
>> + return map_addr;
>> + if (eppnt->p_memsz > eppnt->p_filesz) {
>> + zero_start = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
>> + eppnt->p_filesz;
>> + zero_end = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
>> + eppnt->p_memsz;
>> +
>> + /* Zero the end of the last mapped page */
>> + padzero(zero_start);
>> + }
>> + } else {
>> + map_addr = zero_start = ELF_PAGESTART(addr);
>> + zero_end = zero_start + ELF_PAGEOFFSET(eppnt->p_vaddr) +
>> + eppnt->p_memsz;
>
> What happens if a previous segment has mapped ELF_PAGESTART(addr)?
> Don't we risk mapping over that?
It is bug of whomever produced the ELF executable.
The architectural page size is known is part of the per architecture
sysv ABI. Typical it is the same or larger than the hardware page
size.
ELF executables are always mmaped in page sized chunks. Which makes a
starting offset part-way through a page weird, and a bit awkward, but it
is something the code already attempts to handle.
> Whereas AFAIK old logic would just padzero the bss bytes.
No. The old logic would either map that region with elf_map, and then
call padzero to zero out the bss bytes, or the old logic would fail if
the file offset was not contained within the file.
The updated logic if "filesz == 0" simply ignores the file offset
and always mmaps /dev/zero instead. This means that giving a bogus
file offset does not unnecessarily cause an executable to fail.
If the desired behavior is to have file contents and bss on the same
page of data, the generator of the elf program header needs to
have "memsz > filesz". That is already well supported for everything
except the elf interpreters.
Eric
Pedro Falcato <[email protected]> writes:
> On Fri, Sep 29, 2023 at 4:24 AM Kees Cook <[email protected]> wrote:
>>
>> Hi,
>>
>> This is the continuation of the work Eric started for handling
>> "p_memsz > p_filesz" in arbitrary segments (rather than just the last,
>> BSS, segment). I've added the suggested changes:
>>
>> - drop unused "elf_bss" variable
>> - refactor load_elf_interp() to use elf_load()
>> - refactor load_elf_library() to use elf_load()
>> - report padzero() errors when PROT_WRITE is present
>> - drop vm_brk()
>>
>> Thanks!
>>
>> -Kees
>>
>> v4:
>> - refactor load_elf_library() too
>> - don't refactor padzero(), just test in the only remaining caller
>> - drop now-unused vm_brk()
>> v3: https://lore.kernel.org/all/[email protected]
>> v2: https://lore.kernel.org/lkml/[email protected]
>> v1: https://lore.kernel.org/lkml/[email protected]
>>
>> Eric W. Biederman (1):
>> binfmt_elf: Support segments with 0 filesz and misaligned starts
>>
>> Kees Cook (5):
>> binfmt_elf: elf_bss no longer used by load_elf_binary()
>> binfmt_elf: Use elf_load() for interpreter
>> binfmt_elf: Use elf_load() for library
>> binfmt_elf: Only report padzero() errors when PROT_WRITE
>> mm: Remove unused vm_brk()
>>
>> fs/binfmt_elf.c | 214 ++++++++++++++++-----------------------------
>> include/linux/mm.h | 3 +-
>> mm/mmap.c | 6 --
>> mm/nommu.c | 5 --
>> 4 files changed, 76 insertions(+), 152 deletions(-)
>
> Sorry for taking so long to take a look at this.
> While I didn't test PPC64 (I don't own PPC64 hardware, and I wasn't
> the original reporter), I did manage to craft a reduced test case of:
>
> a.out:
>
> Program Headers:
> Type Offset VirtAddr PhysAddr
> FileSiz MemSiz Flags Align
> PHDR 0x0000000000000040 0x0000000000000040 0x0000000000000040
> 0x00000000000001f8 0x00000000000001f8 R 0x8
> INTERP 0x0000000000000238 0x0000000000000238 0x0000000000000238
> 0x0000000000000020 0x0000000000000020 R 0x1
> [Requesting program interpreter: /home/pfalcato/musl/lib/libc.so]
> LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
> 0x0000000000000428 0x0000000000000428 R 0x1000
> LOAD 0x0000000000001000 0x0000000000001000 0x0000000000001000
> 0x00000000000000cd 0x00000000000000cd R E 0x1000
> LOAD 0x0000000000002000 0x0000000000002000 0x0000000000002000
> 0x0000000000000084 0x0000000000000084 R 0x1000
> LOAD 0x0000000000002e50 0x0000000000003e50 0x0000000000003e50
> 0x00000000000001c8 0x00000000000001c8 RW 0x1000
> DYNAMIC 0x0000000000002e50 0x0000000000003e50 0x0000000000003e50
> 0x0000000000000180 0x0000000000000180 RW 0x8
> GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
> 0x0000000000000000 0x0000000000000000 RW 0x10
> GNU_RELRO 0x0000000000002e50 0x0000000000003e50 0x0000000000003e50
> 0x00000000000001b0 0x00000000000001b0 R 0x1
>
> /home/pfalcato/musl/lib/libc.so:
> Program Headers:
> Type Offset VirtAddr PhysAddr
> FileSiz MemSiz Flags Align
> PHDR 0x0000000000000040 0x0000000000000040 0x0000000000000040
> 0x0000000000000230 0x0000000000000230 R 0x8
> LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
> 0x0000000000049d9c 0x0000000000049d9c R 0x1000
> LOAD 0x0000000000049da0 0x000000000004ada0 0x000000000004ada0
> 0x0000000000057d30 0x0000000000057d30 R E 0x1000
> LOAD 0x00000000000a1ad0 0x00000000000a3ad0 0x00000000000a3ad0
> 0x00000000000005f0 0x00000000000015f0 RW 0x1000
> LOAD 0x00000000000a20c0 0x00000000000a60c0 0x00000000000a60c0
> 0x0000000000000428 0x0000000000002f80 RW 0x1000
> DYNAMIC 0x00000000000a1f38 0x00000000000a3f38 0x00000000000a3f38
> 0x0000000000000110 0x0000000000000110 RW 0x8
> GNU_RELRO 0x00000000000a1ad0 0x00000000000a3ad0 0x00000000000a3ad0
> 0x00000000000005f0 0x0000000000002530 R 0x1
> GNU_EH_FRAME 0x0000000000049d10 0x0000000000049d10 0x0000000000049d10
> 0x0000000000000024 0x0000000000000024 R 0x4
> GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
> 0x0000000000000000 0x0000000000000000 RW 0x0
> NOTE 0x0000000000000270 0x0000000000000270 0x0000000000000270
> 0x0000000000000018 0x0000000000000018 R 0x4
>
> Section to Segment mapping:
> Segment Sections...
> 00
> 01 .note.gnu.build-id .dynsym .gnu.hash .hash .dynstr .rela.dyn
> .rela.plt .rodata .eh_frame_hdr .eh_frame
> 02 .text .plt
> 03 .data.rel.ro .dynamic .got .toc
> 04 .data .got.plt .bss
> 05 .dynamic
> 06 .data.rel.ro .dynamic .got .toc
> 07 .eh_frame_hdr
> 08
> 09 .note.gnu.build-id
>
>
> So on that end, you can take my
>
> Tested-by: Pedro Falcato <[email protected]>
>
> Although this still doesn't address the other bug I found
> (https://github.com/heatd/elf-bug-questionmark), where segments can
> accidentally overwrite cleared BSS if we end up in a situation where
> e.g we have the following segments:
>
> Program Headers:
> Type Offset VirtAddr PhysAddr
> FileSiz MemSiz Flags Align
> LOAD 0x0000000000001000 0x0000000000400000 0x0000000000400000
> 0x0000000000000045 0x0000000000000045 R E 0x1000
> LOAD 0x0000000000002000 0x0000000000401000 0x0000000000401000
> 0x000000000000008c 0x000000000000008c R 0x1000
> LOAD 0x0000000000000000 0x0000000000402000 0x0000000000402000
> 0x0000000000000000 0x0000000000000801 RW 0x1000
> LOAD 0x0000000000002801 0x0000000000402801 0x0000000000402801
> 0x0000000000000007 0x0000000000000007 RW 0x1000
> NOTE 0x0000000000002068 0x0000000000401068 0x0000000000401068
> 0x0000000000000024 0x0000000000000024 0x4
>
> Section to Segment mapping:
> Segment Sections...
> 00 .text
> 01 .rodata .note.gnu.property .note.gnu.build-id
> 02 .bss
> 03 .data
> 04 .note.gnu.build-id
>
> since the mmap of .data will end up happening over .bss.
This is simply invalid userspace, doubly so with the alignment set to
0x1000.
The best the kernel can do is have a pre-pass through the elf program
headers (before the point of no-return) and if they describe overlapping
segments or out of order segments, cause execve syscall to fail.
Eric
Sebastian Ott <[email protected]> writes:
> Hello Kees,
>
> On Thu, 28 Sep 2023, Kees Cook wrote:
>> This is the continuation of the work Eric started for handling
>> "p_memsz > p_filesz" in arbitrary segments (rather than just the last,
>> BSS, segment). I've added the suggested changes:
>>
>> - drop unused "elf_bss" variable
>> - refactor load_elf_interp() to use elf_load()
>> - refactor load_elf_library() to use elf_load()
>> - report padzero() errors when PROT_WRITE is present
>> - drop vm_brk()
>
> While I was debugging the initial issue I stumbled over the following
> - care to take it as part of this series?
>
> ----->8
> [PATCH] mm: vm_brk_flags don't bail out while holding lock
>
> Calling vm_brk_flags() with flags set other than VM_EXEC
> will exit the function without releasing the mmap_write_lock.
>
> Just do the sanity check before the lock is acquired. This
> doesn't fix an actual issue since no caller sets a flag other
> than VM_EXEC.
That seems like a sensible patch.
Have you by any chance read this code enough to understand what is
gained by calling vm_brk_flags rather than vm_mmap without a file?
Unless there is a real advantage it probably makes sense to replace
the call of vm_brk_flags with vm_mmap(NULL, ...) as binfmt_elf_fdpic
has already done.
That would allow removing vm_brk_flags and sys_brk would be the last
caller of do_brk_flags.
Eric
> Cc: Andrew Morton <[email protected]>
> Cc: [email protected]
> Signed-off-by: Sebastian Ott <[email protected]>
> ---
> mm/mmap.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index b56a7f0c9f85..7ed286662839 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3143,13 +3143,13 @@ int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
> if (!len)
> return 0;
>
> - if (mmap_write_lock_killable(mm))
> - return -EINTR;
> -
> /* Until we need other flags, refuse anything except VM_EXEC. */
> if ((flags & (~VM_EXEC)) != 0)
> return -EINVAL;
>
> + if (mmap_write_lock_killable(mm))
> + return -EINTR;
> +
> ret = check_brk_limits(addr, len);
> if (ret)
> goto limits_failed;
Hello Kees,
On Thu, 28 Sep 2023, Kees Cook wrote:
> This is the continuation of the work Eric started for handling
> "p_memsz > p_filesz" in arbitrary segments (rather than just the last,
> BSS, segment). I've added the suggested changes:
>
> - drop unused "elf_bss" variable
> - refactor load_elf_interp() to use elf_load()
> - refactor load_elf_library() to use elf_load()
> - report padzero() errors when PROT_WRITE is present
> - drop vm_brk()
While I was debugging the initial issue I stumbled over the following
- care to take it as part of this series?
----->8
[PATCH] mm: vm_brk_flags don't bail out while holding lock
Calling vm_brk_flags() with flags set other than VM_EXEC
will exit the function without releasing the mmap_write_lock.
Just do the sanity check before the lock is acquired. This
doesn't fix an actual issue since no caller sets a flag other
than VM_EXEC.
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Signed-off-by: Sebastian Ott <[email protected]>
---
mm/mmap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index b56a7f0c9f85..7ed286662839 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3143,13 +3143,13 @@ int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
if (!len)
return 0;
- if (mmap_write_lock_killable(mm))
- return -EINTR;
-
/* Until we need other flags, refuse anything except VM_EXEC. */
if ((flags & (~VM_EXEC)) != 0)
return -EINVAL;
+ if (mmap_write_lock_killable(mm))
+ return -EINTR;
+
ret = check_brk_limits(addr, len);
if (ret)
goto limits_failed;
--
2.41.0
On Fri, Sep 29, 2023 at 4:24 AM Kees Cook <[email protected]> wrote:
>
> Hi,
>
> This is the continuation of the work Eric started for handling
> "p_memsz > p_filesz" in arbitrary segments (rather than just the last,
> BSS, segment). I've added the suggested changes:
>
> - drop unused "elf_bss" variable
> - refactor load_elf_interp() to use elf_load()
> - refactor load_elf_library() to use elf_load()
> - report padzero() errors when PROT_WRITE is present
> - drop vm_brk()
>
> Thanks!
>
> -Kees
>
> v4:
> - refactor load_elf_library() too
> - don't refactor padzero(), just test in the only remaining caller
> - drop now-unused vm_brk()
> v3: https://lore.kernel.org/all/[email protected]
> v2: https://lore.kernel.org/lkml/[email protected]
> v1: https://lore.kernel.org/lkml/[email protected]
>
> Eric W. Biederman (1):
> binfmt_elf: Support segments with 0 filesz and misaligned starts
>
> Kees Cook (5):
> binfmt_elf: elf_bss no longer used by load_elf_binary()
> binfmt_elf: Use elf_load() for interpreter
> binfmt_elf: Use elf_load() for library
> binfmt_elf: Only report padzero() errors when PROT_WRITE
> mm: Remove unused vm_brk()
>
> fs/binfmt_elf.c | 214 ++++++++++++++++-----------------------------
> include/linux/mm.h | 3 +-
> mm/mmap.c | 6 --
> mm/nommu.c | 5 --
> 4 files changed, 76 insertions(+), 152 deletions(-)
Sorry for taking so long to take a look at this.
While I didn't test PPC64 (I don't own PPC64 hardware, and I wasn't
the original reporter), I did manage to craft a reduced test case of:
a.out:
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
PHDR 0x0000000000000040 0x0000000000000040 0x0000000000000040
0x00000000000001f8 0x00000000000001f8 R 0x8
INTERP 0x0000000000000238 0x0000000000000238 0x0000000000000238
0x0000000000000020 0x0000000000000020 R 0x1
[Requesting program interpreter: /home/pfalcato/musl/lib/libc.so]
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000428 0x0000000000000428 R 0x1000
LOAD 0x0000000000001000 0x0000000000001000 0x0000000000001000
0x00000000000000cd 0x00000000000000cd R E 0x1000
LOAD 0x0000000000002000 0x0000000000002000 0x0000000000002000
0x0000000000000084 0x0000000000000084 R 0x1000
LOAD 0x0000000000002e50 0x0000000000003e50 0x0000000000003e50
0x00000000000001c8 0x00000000000001c8 RW 0x1000
DYNAMIC 0x0000000000002e50 0x0000000000003e50 0x0000000000003e50
0x0000000000000180 0x0000000000000180 RW 0x8
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 0x10
GNU_RELRO 0x0000000000002e50 0x0000000000003e50 0x0000000000003e50
0x00000000000001b0 0x00000000000001b0 R 0x1
/home/pfalcato/musl/lib/libc.so:
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
PHDR 0x0000000000000040 0x0000000000000040 0x0000000000000040
0x0000000000000230 0x0000000000000230 R 0x8
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000049d9c 0x0000000000049d9c R 0x1000
LOAD 0x0000000000049da0 0x000000000004ada0 0x000000000004ada0
0x0000000000057d30 0x0000000000057d30 R E 0x1000
LOAD 0x00000000000a1ad0 0x00000000000a3ad0 0x00000000000a3ad0
0x00000000000005f0 0x00000000000015f0 RW 0x1000
LOAD 0x00000000000a20c0 0x00000000000a60c0 0x00000000000a60c0
0x0000000000000428 0x0000000000002f80 RW 0x1000
DYNAMIC 0x00000000000a1f38 0x00000000000a3f38 0x00000000000a3f38
0x0000000000000110 0x0000000000000110 RW 0x8
GNU_RELRO 0x00000000000a1ad0 0x00000000000a3ad0 0x00000000000a3ad0
0x00000000000005f0 0x0000000000002530 R 0x1
GNU_EH_FRAME 0x0000000000049d10 0x0000000000049d10 0x0000000000049d10
0x0000000000000024 0x0000000000000024 R 0x4
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 0x0
NOTE 0x0000000000000270 0x0000000000000270 0x0000000000000270
0x0000000000000018 0x0000000000000018 R 0x4
Section to Segment mapping:
Segment Sections...
00
01 .note.gnu.build-id .dynsym .gnu.hash .hash .dynstr .rela.dyn
.rela.plt .rodata .eh_frame_hdr .eh_frame
02 .text .plt
03 .data.rel.ro .dynamic .got .toc
04 .data .got.plt .bss
05 .dynamic
06 .data.rel.ro .dynamic .got .toc
07 .eh_frame_hdr
08
09 .note.gnu.build-id
So on that end, you can take my
Tested-by: Pedro Falcato <[email protected]>
Although this still doesn't address the other bug I found
(https://github.com/heatd/elf-bug-questionmark), where segments can
accidentally overwrite cleared BSS if we end up in a situation where
e.g we have the following segments:
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000001000 0x0000000000400000 0x0000000000400000
0x0000000000000045 0x0000000000000045 R E 0x1000
LOAD 0x0000000000002000 0x0000000000401000 0x0000000000401000
0x000000000000008c 0x000000000000008c R 0x1000
LOAD 0x0000000000000000 0x0000000000402000 0x0000000000402000
0x0000000000000000 0x0000000000000801 RW 0x1000
LOAD 0x0000000000002801 0x0000000000402801 0x0000000000402801
0x0000000000000007 0x0000000000000007 RW 0x1000
NOTE 0x0000000000002068 0x0000000000401068 0x0000000000401068
0x0000000000000024 0x0000000000000024 0x4
Section to Segment mapping:
Segment Sections...
00 .text
01 .rodata .note.gnu.property .note.gnu.build-id
02 .bss
03 .data
04 .note.gnu.build-id
since the mmap of .data will end up happening over .bss.
--
Pedro
On Fri, Sep 29, 2023 at 4:24 AM Kees Cook <[email protected]> wrote:
>
> From: "Eric W. Biederman" <[email protected]>
>
> Implement a helper elf_load() that wraps elf_map() and performs all
> of the necessary work to ensure that when "memsz > filesz" the bytes
> described by "memsz > filesz" are zeroed.
>
> An outstanding issue is if the first segment has filesz 0, and has a
> randomized location. But that is the same as today.
>
> In this change I replaced an open coded padzero() that did not clear
> all of the way to the end of the page, with padzero() that does.
>
> I also stopped checking the return of padzero() as there is at least
> one known case where testing for failure is the wrong thing to do.
> It looks like binfmt_elf_fdpic may have the proper set of tests
> for when error handling can be safely completed.
>
> I found a couple of commits in the old history
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git,
> that look very interesting in understanding this code.
>
> commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail")
> commit c6e2227e4a3e ("[SPARC64]: Missing user access return value checks in fs/binfmt_elf.c and fs/compat.c")
> commit 5bf3be033f50 ("v2.4.10.1 -> v2.4.10.2")
>
> Looking at commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail"):
> > commit 39b56d902bf35241e7cba6cc30b828ed937175ad
> > Author: Pavel Machek <[email protected]>
> > Date: Wed Feb 9 22:40:30 2005 -0800
> >
> > [PATCH] binfmt_elf: clearing bss may fail
> >
> > So we discover that Borland's Kylix application builder emits weird elf
> > files which describe a non-writeable bss segment.
> >
> > So remove the clear_user() check at the place where we zero out the bss. I
> > don't _think_ there are any security implications here (plus we've never
> > checked that clear_user() return value, so whoops if it is a problem).
> >
> > Signed-off-by: Pavel Machek <[email protected]>
> > Signed-off-by: Andrew Morton <[email protected]>
> > Signed-off-by: Linus Torvalds <[email protected]>
>
> It seems pretty clear that binfmt_elf_fdpic with skipping clear_user() for
> non-writable segments and otherwise calling clear_user(), aka padzero(),
> and checking it's return code is the right thing to do.
>
> I just skipped the error checking as that avoids breaking things.
>
> And notably, it looks like Borland's Kylix died in 2005 so it might be
> safe to just consider read-only segments with memsz > filesz an error.
>
> Reported-by: Sebastian Ott <[email protected]>
> Reported-by: Thomas Weißschuh <[email protected]>
> Closes: https://lkml.kernel.org/r/[email protected]
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
> Signed-off-by: Kees Cook <[email protected]>
> ---
> fs/binfmt_elf.c | 111 +++++++++++++++++++++---------------------------
> 1 file changed, 48 insertions(+), 63 deletions(-)
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 7b3d2d491407..2a615f476e44 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -110,25 +110,6 @@ static struct linux_binfmt elf_format = {
>
> #define BAD_ADDR(x) (unlikely((unsigned long)(x) >= TASK_SIZE))
>
> -static int set_brk(unsigned long start, unsigned long end, int prot)
> -{
> - start = ELF_PAGEALIGN(start);
> - end = ELF_PAGEALIGN(end);
> - if (end > start) {
> - /*
> - * Map the last of the bss segment.
> - * If the header is requesting these pages to be
> - * executable, honour that (ppc32 needs this).
> - */
> - int error = vm_brk_flags(start, end - start,
> - prot & PROT_EXEC ? VM_EXEC : 0);
> - if (error)
> - return error;
> - }
> - current->mm->start_brk = current->mm->brk = end;
> - return 0;
> -}
> -
> /* We need to explicitly zero any fractional pages
> after the data section (i.e. bss). This would
> contain the junk from the file that should not
> @@ -406,6 +387,51 @@ static unsigned long elf_map(struct file *filep, unsigned long addr,
> return(map_addr);
> }
>
> +static unsigned long elf_load(struct file *filep, unsigned long addr,
> + const struct elf_phdr *eppnt, int prot, int type,
> + unsigned long total_size)
> +{
> + unsigned long zero_start, zero_end;
> + unsigned long map_addr;
> +
> + if (eppnt->p_filesz) {
> + map_addr = elf_map(filep, addr, eppnt, prot, type, total_size);
> + if (BAD_ADDR(map_addr))
> + return map_addr;
> + if (eppnt->p_memsz > eppnt->p_filesz) {
> + zero_start = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
> + eppnt->p_filesz;
> + zero_end = map_addr + ELF_PAGEOFFSET(eppnt->p_vaddr) +
> + eppnt->p_memsz;
> +
> + /* Zero the end of the last mapped page */
> + padzero(zero_start);
> + }
> + } else {
> + map_addr = zero_start = ELF_PAGESTART(addr);
> + zero_end = zero_start + ELF_PAGEOFFSET(eppnt->p_vaddr) +
> + eppnt->p_memsz;
What happens if a previous segment has mapped ELF_PAGESTART(addr)?
Don't we risk mapping over that?
Whereas AFAIK old logic would just padzero the bss bytes.
--
Pedro
On Fri, Sep 29, 2023 at 01:33:50PM +0200, Sebastian Ott wrote:
> Hello Kees,
>
> On Thu, 28 Sep 2023, Kees Cook wrote:
> > This is the continuation of the work Eric started for handling
> > "p_memsz > p_filesz" in arbitrary segments (rather than just the last,
> > BSS, segment). I've added the suggested changes:
> >
> > - drop unused "elf_bss" variable
> > - refactor load_elf_interp() to use elf_load()
> > - refactor load_elf_library() to use elf_load()
> > - report padzero() errors when PROT_WRITE is present
> > - drop vm_brk()
>
> While I was debugging the initial issue I stumbled over the following
> - care to take it as part of this series?
> ----->8
> [PATCH] mm: vm_brk_flags don't bail out while holding lock
>
> Calling vm_brk_flags() with flags set other than VM_EXEC
> will exit the function without releasing the mmap_write_lock.
>
> Just do the sanity check before the lock is acquired. This
> doesn't fix an actual issue since no caller sets a flag other
> than VM_EXEC.
Oh, eek. Yeah, that seems like a good idea. :)
Reviewed-by: Kees Cook <[email protected]>
-Kees
--
Kees Cook
On Fri, Sep 29, 2023 at 12:58:18PM +0100, Pedro Falcato wrote:
> So on that end, you can take my
>
> Tested-by: Pedro Falcato <[email protected]>
Thanks!
--
Kees Cook