At the moment, we have that rather ugly mmget_still_valid() helper to
work around <https://crbug.com/project-zero/1790>: ELF core dumping
doesn't take the mmap_sem while traversing the task's VMAs, and if
anything (like userfaultfd) then remotely messes with the VMA tree,
fireworks ensue. So at the moment we use mmget_still_valid() to bail
out in any writers that might be operating on a remote mm's VMAs.
With this series, I'm trying to get rid of the need for that as
cleanly as possible.
In particular, I want to avoid holding the mmap_sem across unbounded
sleeps.
Patches 1, 2 and 3 are relatively unrelated cleanups in the core
dumping code.
Patches 4 and 5 implement the main change: Instead of repeatedly
accessing the VMA list with sleeps in between, we snapshot it at the
start with proper locking, and then later we just use our copy of
the VMA list. This ensures that the kernel won't crash, that VMA
metadata in the coredump is consistent even in the presence of
concurrent modifications, and that any virtual addresses that aren't
being concurrently modified have their contents show up in the core
dump properly.
The disadvantage of this approach is that we need a bit more memory
during core dumping for storing metadata about all VMAs.
After this series has landed, we should be able to rip out
mmget_still_valid().
Testing done so far:
- Creating a simple core dump on X86-64 still works.
- The created coredump on X86-64 opens in GDB, and both the stack and the
exectutable look vaguely plausible.
- 32-bit ARM compiles with FDPIC support, both with MMU and !MMU config.
I'm CCing some folks from the architectures that use FDPIC in case
anyone wants to give this a spin.
This series is based on
<https://lore.kernel.org/linux-fsdevel/[email protected]/>
(Christoph Hellwig's "remove set_fs calls from the coredump code v4").
changed in v2:
- replace "Fix handling of partial writes in dump_emit()" with
"Let dump_emit() bail out on short writes" (Linus)
- get rid of the useless complicated cache flushing in
"Take mmap_sem in get_dump_page()" (Linus)
Jann Horn (5):
binfmt_elf_fdpic: Stop using dump_emit() on user pointers on !MMU
coredump: Let dump_emit() bail out on short writes
coredump: Refactor page range dumping into common helper
binfmt_elf, binfmt_elf_fdpic: Use a VMA list snapshot
mm/gup: Take mmap_sem in get_dump_page()
fs/binfmt_elf.c | 170 ++++++++++++---------------------------
fs/binfmt_elf_fdpic.c | 106 +++++++++---------------
fs/coredump.c | 123 +++++++++++++++++++++++++---
include/linux/coredump.h | 12 +++
mm/gup.c | 60 +++++++-------
5 files changed, 245 insertions(+), 226 deletions(-)
base-commit: 6a8b55ed4056ea5559ebe4f6a4b247f627870d4c
prerequisite-patch-id: c0a20b414eebc48fe0a8ca570b05de34c7980396
prerequisite-patch-id: 51973b8db0fa4b114e0c3fd8936b634d9d5061c5
prerequisite-patch-id: 0e1e8de282ca6d458dc6cbdc6b6ec5879edd8a05
prerequisite-patch-id: d5ee749c4d3a22ec80bd0dd88aadf89aeb569db8
prerequisite-patch-id: 46ce14e59e98e212a1eca0aef69c6dcdb62b8242
--
2.26.2.526.g744177e7f7-goog
dump_emit() is for kernel pointers, and VMAs describe userspace memory.
Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
even if it probably doesn't matter much on !MMU systems - especially given
that it looks like we can just use the same get_dump_page() as on MMU if
we move it out of the CONFIG_MMU block.
Signed-off-by: Jann Horn <[email protected]>
---
fs/binfmt_elf_fdpic.c | 8 ------
mm/gup.c | 58 +++++++++++++++++++++----------------------
2 files changed, 29 insertions(+), 37 deletions(-)
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index c62c17a5c34a9..f5b47076fa762 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1495,14 +1495,11 @@ static bool elf_fdpic_dump_segments(struct coredump_params *cprm)
struct vm_area_struct *vma;
for (vma = current->mm->mmap; vma; vma = vma->vm_next) {
-#ifdef CONFIG_MMU
unsigned long addr;
-#endif
if (!maydump(vma, cprm->mm_flags))
continue;
-#ifdef CONFIG_MMU
for (addr = vma->vm_start; addr < vma->vm_end;
addr += PAGE_SIZE) {
bool res;
@@ -1518,11 +1515,6 @@ static bool elf_fdpic_dump_segments(struct coredump_params *cprm)
if (!res)
return false;
}
-#else
- if (!dump_emit(cprm, (void *) vma->vm_start,
- vma->vm_end - vma->vm_start))
- return false;
-#endif
}
return true;
}
diff --git a/mm/gup.c b/mm/gup.c
index 50681f0286ded..76080c4dbff05 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1490,35 +1490,6 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
up_read(&mm->mmap_sem);
return ret; /* 0 or negative error code */
}
-
-/**
- * get_dump_page() - pin user page in memory while writing it to core dump
- * @addr: user address
- *
- * Returns struct page pointer of user page pinned for dump,
- * to be freed afterwards by put_page().
- *
- * Returns NULL on any kind of failure - a hole must then be inserted into
- * the corefile, to preserve alignment with its headers; and also returns
- * NULL wherever the ZERO_PAGE, or an anonymous pte_none, has been found -
- * allowing a hole to be left in the corefile to save diskspace.
- *
- * Called without mmap_sem, but after all other threads have been killed.
- */
-#ifdef CONFIG_ELF_CORE
-struct page *get_dump_page(unsigned long addr)
-{
- struct vm_area_struct *vma;
- struct page *page;
-
- if (__get_user_pages(current, current->mm, addr, 1,
- FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma,
- NULL) < 1)
- return NULL;
- flush_cache_page(vma, addr, page_to_pfn(page));
- return page;
-}
-#endif /* CONFIG_ELF_CORE */
#else /* CONFIG_MMU */
static long __get_user_pages_locked(struct task_struct *tsk,
struct mm_struct *mm, unsigned long start,
@@ -1565,6 +1536,35 @@ static long __get_user_pages_locked(struct task_struct *tsk,
}
#endif /* !CONFIG_MMU */
+/**
+ * get_dump_page() - pin user page in memory while writing it to core dump
+ * @addr: user address
+ *
+ * Returns struct page pointer of user page pinned for dump,
+ * to be freed afterwards by put_page().
+ *
+ * Returns NULL on any kind of failure - a hole must then be inserted into
+ * the corefile, to preserve alignment with its headers; and also returns
+ * NULL wherever the ZERO_PAGE, or an anonymous pte_none, has been found -
+ * allowing a hole to be left in the corefile to save diskspace.
+ *
+ * Called without mmap_sem, but after all other threads have been killed.
+ */
+#ifdef CONFIG_ELF_CORE
+struct page *get_dump_page(unsigned long addr)
+{
+ struct vm_area_struct *vma;
+ struct page *page;
+
+ if (__get_user_pages(current, current->mm, addr, 1,
+ FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma,
+ NULL) < 1)
+ return NULL;
+ flush_cache_page(vma, addr, page_to_pfn(page));
+ return page;
+}
+#endif /* CONFIG_ELF_CORE */
+
#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
{
--
2.26.2.526.g744177e7f7-goog
dump_emit() has a retry loop, but there seems to be no way for that retry
logic to actually be used; and it was also buggy, writing the same data
repeatedly after a short write.
Let's just bail out on a short write.
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Jann Horn <[email protected]>
---
fs/coredump.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/fs/coredump.c b/fs/coredump.c
index 408418e6aa131..d6fcc36a7db1f 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -823,17 +823,17 @@ int dump_emit(struct coredump_params *cprm, const void *addr, int nr)
ssize_t n;
if (cprm->written + nr > cprm->limit)
return 0;
- while (nr) {
- if (dump_interrupted())
- return 0;
- n = __kernel_write(file, addr, nr, &pos);
- if (n <= 0)
- return 0;
- file->f_pos = pos;
- cprm->written += n;
- cprm->pos += n;
- nr -= n;
- }
+
+
+ if (dump_interrupted())
+ return 0;
+ n = __kernel_write(file, addr, nr, &pos);
+ if (n != nr)
+ return 0;
+ file->f_pos = pos;
+ cprm->written += n;
+ cprm->pos += n;
+
return 1;
}
EXPORT_SYMBOL(dump_emit);
--
2.26.2.526.g744177e7f7-goog
Both fs/binfmt_elf.c and fs/binfmt_elf_fdpic.c need to dump ranges of pages
into the coredump file. Extract that logic into a common helper.
Any other binfmt that actually wants to create coredumps will probably need
the same function; so stop making get_dump_page() depend on
CONFIG_ELF_CORE.
Signed-off-by: Jann Horn <[email protected]>
---
fs/binfmt_elf.c | 22 ++--------------------
fs/binfmt_elf_fdpic.c | 18 +++---------------
fs/coredump.c | 33 +++++++++++++++++++++++++++++++++
include/linux/coredump.h | 2 ++
mm/gup.c | 2 --
5 files changed, 40 insertions(+), 37 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index b29b84595b09f..fb36469848323 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2323,26 +2323,8 @@ static int elf_core_dump(struct coredump_params *cprm)
for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
vma = next_vma(vma, gate_vma)) {
- unsigned long addr;
- unsigned long end;
-
- end = vma->vm_start + vma_filesz[i++];
-
- for (addr = vma->vm_start; addr < end; addr += PAGE_SIZE) {
- struct page *page;
- int stop;
-
- page = get_dump_page(addr);
- if (page) {
- void *kaddr = kmap(page);
- stop = !dump_emit(cprm, kaddr, PAGE_SIZE);
- kunmap(page);
- put_page(page);
- } else
- stop = !dump_skip(cprm, PAGE_SIZE);
- if (stop)
- goto cleanup;
- }
+ if (!dump_user_range(cprm, vma->vm_start, vma_filesz[i++]))
+ goto cleanup;
}
dump_truncate(cprm);
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index f5b47076fa762..938f66f4de9b2 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1500,21 +1500,9 @@ static bool elf_fdpic_dump_segments(struct coredump_params *cprm)
if (!maydump(vma, cprm->mm_flags))
continue;
- for (addr = vma->vm_start; addr < vma->vm_end;
- addr += PAGE_SIZE) {
- bool res;
- struct page *page = get_dump_page(addr);
- if (page) {
- void *kaddr = kmap(page);
- res = dump_emit(cprm, kaddr, PAGE_SIZE);
- kunmap(page);
- put_page(page);
- } else {
- res = dump_skip(cprm, PAGE_SIZE);
- }
- if (!res)
- return false;
- }
+ if (!dump_user_range(cprm, vma->vm_start,
+ vma->vma_end - vma->vm_start))
+ return false;
}
return true;
}
diff --git a/fs/coredump.c b/fs/coredump.c
index d6fcc36a7db1f..88f625eecaac1 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -859,6 +859,39 @@ int dump_skip(struct coredump_params *cprm, size_t nr)
}
EXPORT_SYMBOL(dump_skip);
+#ifdef CONFIG_ELF_CORE
+int dump_user_range(struct coredump_params *cprm, unsigned long start,
+ unsigned long len)
+{
+ unsigned long addr;
+
+ for (addr = start; addr < start + len; addr += PAGE_SIZE) {
+ struct page *page;
+ int stop;
+
+ /*
+ * To avoid having to allocate page tables for virtual address
+ * ranges that have never been used yet, use a helper that
+ * returns NULL when encountering an empty page table entry that
+ * would otherwise have been filled with the zero page.
+ */
+ page = get_dump_page(addr);
+ if (page) {
+ void *kaddr = kmap(page);
+
+ stop = !dump_emit(cprm, kaddr, PAGE_SIZE);
+ kunmap(page);
+ put_page(page);
+ } else {
+ stop = !dump_skip(cprm, PAGE_SIZE);
+ }
+ if (stop)
+ return 0;
+ }
+ return 1;
+}
+#endif
+
int dump_align(struct coredump_params *cprm, int align)
{
unsigned mod = cprm->pos & (align - 1);
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index abf4b4e65dbb9..4289dc21c04ff 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -16,6 +16,8 @@ extern int dump_skip(struct coredump_params *cprm, size_t nr);
extern int dump_emit(struct coredump_params *cprm, const void *addr, int nr);
extern int dump_align(struct coredump_params *cprm, int align);
extern void dump_truncate(struct coredump_params *cprm);
+int dump_user_range(struct coredump_params *cprm, unsigned long start,
+ unsigned long len);
#ifdef CONFIG_COREDUMP
extern void do_coredump(const kernel_siginfo_t *siginfo);
#else
diff --git a/mm/gup.c b/mm/gup.c
index 76080c4dbff05..9a7e83772f1fe 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1550,7 +1550,6 @@ static long __get_user_pages_locked(struct task_struct *tsk,
*
* Called without mmap_sem, but after all other threads have been killed.
*/
-#ifdef CONFIG_ELF_CORE
struct page *get_dump_page(unsigned long addr)
{
struct vm_area_struct *vma;
@@ -1563,7 +1562,6 @@ struct page *get_dump_page(unsigned long addr)
flush_cache_page(vma, addr, page_to_pfn(page));
return page;
}
-#endif /* CONFIG_ELF_CORE */
#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
--
2.26.2.526.g744177e7f7-goog
In both binfmt_elf and binfmt_elf_fdpic, use a new helper
dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
VMA, if we have one) while protected by the mmap_sem, and then use that
snapshot instead of walking the VMA list without locking.
An alternative approach would be to keep the mmap_sem held across the
entire core dumping operation; however, keeping the mmap_sem locked while
we may be blocked for an unbounded amount of time (e.g. because we're
dumping to a FUSE filesystem or so) isn't really optimal; the mmap_sem
blocks things like the ->release handler of userfaultfd, and we don't
really want critical system daemons to grind to a halt just because someone
"gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or something
like that.
Since both the normal ELF code and the FDPIC ELF code need this
functionality (and if any other binfmt wants to add coredump support in the
future, they'd probably need it, too), implement this with a common helper
in fs/coredump.c.
A downside of this approach is that we now need a bigger amount of kernel
memory per userspace VMA in the normal ELF case, and that we need O(n)
kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't
be terribly bad.
Signed-off-by: Jann Horn <[email protected]>
---
fs/binfmt_elf.c | 152 +++++++++++++--------------------------
fs/binfmt_elf_fdpic.c | 86 ++++++++++------------
fs/coredump.c | 68 ++++++++++++++++++
include/linux/coredump.h | 10 +++
4 files changed, 168 insertions(+), 148 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index fb36469848323..dffe9dc8497ca 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1292,8 +1292,12 @@ static bool always_dump_vma(struct vm_area_struct *vma)
return false;
}
+#define DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER 1
+
/*
* Decide what to dump of a segment, part, all or none.
+ * The result must be fixed up via vma_dump_size_fixup() once we're in a context
+ * that's allowed to sleep arbitrarily long.
*/
static unsigned long vma_dump_size(struct vm_area_struct *vma,
unsigned long mm_flags)
@@ -1348,30 +1352,15 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma,
/*
* If this looks like the beginning of a DSO or executable mapping,
- * check for an ELF header. If we find one, dump the first page to
- * aid in determining what was mapped here.
+ * we'll check for an ELF header. If we find one, we'll dump the first
+ * page to aid in determining what was mapped here.
+ * However, we shouldn't sleep on userspace reads while holding the
+ * mmap_sem, so we just return a placeholder for now that will be fixed
+ * up later in vma_dump_size_fixup().
*/
if (FILTER(ELF_HEADERS) &&
- vma->vm_pgoff == 0 && (vma->vm_flags & VM_READ)) {
- u32 __user *header = (u32 __user *) vma->vm_start;
- u32 word;
- /*
- * Doing it this way gets the constant folded by GCC.
- */
- union {
- u32 cmp;
- char elfmag[SELFMAG];
- } magic;
- BUILD_BUG_ON(SELFMAG != sizeof word);
- magic.elfmag[EI_MAG0] = ELFMAG0;
- magic.elfmag[EI_MAG1] = ELFMAG1;
- magic.elfmag[EI_MAG2] = ELFMAG2;
- magic.elfmag[EI_MAG3] = ELFMAG3;
- if (unlikely(get_user(word, header)))
- word = 0;
- if (word == magic.cmp)
- return PAGE_SIZE;
- }
+ vma->vm_pgoff == 0 && (vma->vm_flags & VM_READ))
+ return DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER;
#undef FILTER
@@ -1381,6 +1370,22 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma,
return vma->vm_end - vma->vm_start;
}
+/* Fix up the result from vma_dump_size(), now that we're allowed to sleep. */
+static void vma_dump_size_fixup(struct core_vma_metadata *meta)
+{
+ char elfmag[SELFMAG];
+
+ if (meta->dump_size != DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER)
+ return;
+
+ if (copy_from_user(elfmag, (void __user *)meta->start, SELFMAG)) {
+ meta->dump_size = 0;
+ return;
+ }
+ meta->dump_size =
+ (memcmp(elfmag, ELFMAG, SELFMAG) == 0) ? PAGE_SIZE : 0;
+}
+
/* An ELF note in memory */
struct memelfnote
{
@@ -2124,32 +2129,6 @@ static void free_note_info(struct elf_note_info *info)
#endif
-static struct vm_area_struct *first_vma(struct task_struct *tsk,
- struct vm_area_struct *gate_vma)
-{
- struct vm_area_struct *ret = tsk->mm->mmap;
-
- if (ret)
- return ret;
- return gate_vma;
-}
-/*
- * Helper function for iterating across a vma list. It ensures that the caller
- * will visit `gate_vma' prior to terminating the search.
- */
-static struct vm_area_struct *next_vma(struct vm_area_struct *this_vma,
- struct vm_area_struct *gate_vma)
-{
- struct vm_area_struct *ret;
-
- ret = this_vma->vm_next;
- if (ret)
- return ret;
- if (this_vma == gate_vma)
- return NULL;
- return gate_vma;
-}
-
static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
elf_addr_t e_shoff, int segs)
{
@@ -2176,9 +2155,8 @@ static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
static int elf_core_dump(struct coredump_params *cprm)
{
int has_dumped = 0;
- int segs, i;
+ int vma_count, segs, i;
size_t vma_data_size = 0;
- struct vm_area_struct *vma, *gate_vma;
struct elfhdr elf;
loff_t offset = 0, dataoff;
struct elf_note_info info = { };
@@ -2186,30 +2164,21 @@ static int elf_core_dump(struct coredump_params *cprm)
struct elf_shdr *shdr4extnum = NULL;
Elf_Half e_phnum;
elf_addr_t e_shoff;
- elf_addr_t *vma_filesz = NULL;
+ struct core_vma_metadata *vma_meta;
+
+ if (dump_vma_snapshot(cprm, &vma_count, &vma_meta, vma_dump_size))
+ return 0;
+
+ for (i = 0; i < vma_count; i++) {
+ vma_dump_size_fixup(vma_meta + i);
+ vma_data_size += vma_meta[i].dump_size;
+ }
- /*
- * We no longer stop all VM operations.
- *
- * This is because those proceses that could possibly change map_count
- * or the mmap / vma pages are now blocked in do_exit on current
- * finishing this core dump.
- *
- * Only ptrace can touch these memory addresses, but it doesn't change
- * the map_count or the pages allocated. So no possibility of crashing
- * exists while dumping the mm->vm_next areas to the core file.
- */
-
/*
* The number of segs are recored into ELF header as 16bit value.
* Please check DEFAULT_MAX_MAP_COUNT definition when you modify here.
*/
- segs = current->mm->map_count;
- segs += elf_core_extra_phdrs();
-
- gate_vma = get_gate_vma(current->mm);
- if (gate_vma != NULL)
- segs++;
+ segs = vma_count + elf_core_extra_phdrs();
/* for notes section */
segs++;
@@ -2247,24 +2216,6 @@ static int elf_core_dump(struct coredump_params *cprm)
dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE);
- /*
- * Zero vma process will get ZERO_SIZE_PTR here.
- * Let coredump continue for register state at least.
- */
- vma_filesz = kvmalloc(array_size(sizeof(*vma_filesz), (segs - 1)),
- GFP_KERNEL);
- if (!vma_filesz)
- goto cleanup;
-
- for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
- vma = next_vma(vma, gate_vma)) {
- unsigned long dump_size;
-
- dump_size = vma_dump_size(vma, cprm->mm_flags);
- vma_filesz[i++] = dump_size;
- vma_data_size += dump_size;
- }
-
offset += vma_data_size;
offset += elf_core_extra_data_size();
e_shoff = offset;
@@ -2285,22 +2236,20 @@ static int elf_core_dump(struct coredump_params *cprm)
goto cleanup;
/* Write program headers for segments dump */
- for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
- vma = next_vma(vma, gate_vma)) {
+ for (i = 0; i < vma_count; i++) {
+ struct core_vma_metadata *meta = vma_meta + i;
struct elf_phdr phdr;
phdr.p_type = PT_LOAD;
phdr.p_offset = offset;
- phdr.p_vaddr = vma->vm_start;
+ phdr.p_vaddr = meta->start;
phdr.p_paddr = 0;
- phdr.p_filesz = vma_filesz[i++];
- phdr.p_memsz = vma->vm_end - vma->vm_start;
+ phdr.p_filesz = meta->dump_size;
+ phdr.p_memsz = meta->end - meta->start;
offset += phdr.p_filesz;
- phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0;
- if (vma->vm_flags & VM_WRITE)
- phdr.p_flags |= PF_W;
- if (vma->vm_flags & VM_EXEC)
- phdr.p_flags |= PF_X;
+ phdr.p_flags = meta->flags & VM_READ ? PF_R : 0;
+ phdr.p_flags |= meta->flags & VM_WRITE ? PF_W : 0;
+ phdr.p_flags |= meta->flags & VM_EXEC ? PF_X : 0;
phdr.p_align = ELF_EXEC_PAGESIZE;
if (!dump_emit(cprm, &phdr, sizeof(phdr)))
@@ -2321,9 +2270,10 @@ static int elf_core_dump(struct coredump_params *cprm)
if (!dump_skip(cprm, dataoff - cprm->pos))
goto cleanup;
- for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
- vma = next_vma(vma, gate_vma)) {
- if (!dump_user_range(cprm, vma->vm_start, vma_filesz[i++]))
+ for (i = 0; i < vma_count; i++) {
+ struct core_vma_metadata *meta = vma_meta + i;
+
+ if (!dump_user_range(cprm, meta->start, meta->dump_size))
goto cleanup;
}
dump_truncate(cprm);
@@ -2339,7 +2289,7 @@ static int elf_core_dump(struct coredump_params *cprm)
cleanup:
free_note_info(&info);
kfree(shdr4extnum);
- kvfree(vma_filesz);
+ kvfree(vma_meta);
kfree(phdr4note);
return has_dumped;
}
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 938f66f4de9b2..bde51f40085b9 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1190,7 +1190,8 @@ static int elf_fdpic_map_file_by_direct_mmap(struct elf_fdpic_params *params,
*
* I think we should skip something. But I am not sure how. H.J.
*/
-static int maydump(struct vm_area_struct *vma, unsigned long mm_flags)
+static unsigned long vma_dump_size(struct vm_area_struct *vma,
+ unsigned long mm_flags)
{
int dump_ok;
@@ -1219,7 +1220,7 @@ static int maydump(struct vm_area_struct *vma, unsigned long mm_flags)
kdcore("%08lx: %08lx: %s (DAX private)", vma->vm_start,
vma->vm_flags, dump_ok ? "yes" : "no");
}
- return dump_ok;
+ goto out;
}
/* By default, dump shared memory if mapped from an anonymous file. */
@@ -1228,13 +1229,13 @@ static int maydump(struct vm_area_struct *vma, unsigned long mm_flags)
dump_ok = test_bit(MMF_DUMP_ANON_SHARED, &mm_flags);
kdcore("%08lx: %08lx: %s (share)", vma->vm_start,
vma->vm_flags, dump_ok ? "yes" : "no");
- return dump_ok;
+ goto out;
}
dump_ok = test_bit(MMF_DUMP_MAPPED_SHARED, &mm_flags);
kdcore("%08lx: %08lx: %s (share)", vma->vm_start,
vma->vm_flags, dump_ok ? "yes" : "no");
- return dump_ok;
+ goto out;
}
#ifdef CONFIG_MMU
@@ -1243,14 +1244,16 @@ static int maydump(struct vm_area_struct *vma, unsigned long mm_flags)
dump_ok = test_bit(MMF_DUMP_MAPPED_PRIVATE, &mm_flags);
kdcore("%08lx: %08lx: %s (!anon)", vma->vm_start,
vma->vm_flags, dump_ok ? "yes" : "no");
- return dump_ok;
+ goto out;
}
#endif
dump_ok = test_bit(MMF_DUMP_ANON_PRIVATE, &mm_flags);
kdcore("%08lx: %08lx: %s", vma->vm_start, vma->vm_flags,
dump_ok ? "yes" : "no");
- return dump_ok;
+
+out:
+ return dump_ok ? vma->vm_end - vma->vm_start : 0;
}
/* An ELF note in memory */
@@ -1490,31 +1493,30 @@ static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
/*
* dump the segments for an MMU process
*/
-static bool elf_fdpic_dump_segments(struct coredump_params *cprm)
+static bool elf_fdpic_dump_segments(struct coredump_params *cprm,
+ struct core_vma_metadata *vma_meta,
+ int vma_count)
{
- struct vm_area_struct *vma;
+ int i;
- for (vma = current->mm->mmap; vma; vma = vma->vm_next) {
- unsigned long addr;
+ for (i = 0; i < vma_count; i++) {
+ struct core_vma_metadata *meta = vma_meta + i;
- if (!maydump(vma, cprm->mm_flags))
- continue;
-
- if (!dump_user_range(cprm, vma->vm_start,
- vma->vma_end - vma->vm_start))
+ if (!dump_user_range(cprm, meta->start, meta->dump_size))
return false;
}
return true;
}
-static size_t elf_core_vma_data_size(unsigned long mm_flags)
+static size_t elf_core_vma_data_size(unsigned long mm_flags,
+ struct core_vma_metadata *vma_meta,
+ int vma_count)
{
- struct vm_area_struct *vma;
size_t size = 0;
+ int i;
- for (vma = current->mm->mmap; vma; vma = vma->vm_next)
- if (maydump(vma, mm_flags))
- size += vma->vm_end - vma->vm_start;
+ for (i = 0; i < vma_count; i++)
+ size += vma_meta[i].dump_size;
return size;
}
@@ -1529,9 +1531,8 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
{
#define NUM_NOTES 6
int has_dumped = 0;
- int segs;
+ int vma_count, segs;
int i;
- struct vm_area_struct *vma;
struct elfhdr *elf = NULL;
loff_t offset = 0, dataoff;
int numnote;
@@ -1552,18 +1553,7 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
elf_addr_t e_shoff;
struct core_thread *ct;
struct elf_thread_status *tmp;
-
- /*
- * We no longer stop all VM operations.
- *
- * This is because those proceses that could possibly change map_count
- * or the mmap / vma pages are now blocked in do_exit on current
- * finishing this core dump.
- *
- * Only ptrace can touch these memory addresses, but it doesn't change
- * the map_count or the pages allocated. So no possibility of crashing
- * exists while dumping the mm->vm_next areas to the core file.
- */
+ struct core_vma_metadata *vma_meta = NULL;
/* alloc memory for large data structures: too large to be on stack */
elf = kmalloc(sizeof(*elf), GFP_KERNEL);
@@ -1588,6 +1578,9 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
goto cleanup;
#endif
+ if (dump_vma_snapshot(cprm, &vma_count, &vma_meta, vma_dump_size))
+ goto cleanup;
+
for (ct = current->mm->core_state->dumper.next;
ct; ct = ct->next) {
tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
@@ -1611,8 +1604,7 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
fill_prstatus(prstatus, current, cprm->siginfo->si_signo);
elf_core_copy_regs(&prstatus->pr_reg, cprm->regs);
- segs = current->mm->map_count;
- segs += elf_core_extra_phdrs();
+ segs = vma_count + elf_core_extra_phdrs();
/* for notes section */
segs++;
@@ -1680,7 +1672,7 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
/* Page-align dumped data */
dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE);
- offset += elf_core_vma_data_size(cprm->mm_flags);
+ offset += elf_core_vma_data_size(cprm->mm_flags, vma_meta, vma_count);
offset += elf_core_extra_data_size();
e_shoff = offset;
@@ -1700,24 +1692,23 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
goto cleanup;
/* write program headers for segments dump */
- for (vma = current->mm->mmap; vma; vma = vma->vm_next) {
+ for (i = 0; i < vma_count; i++) {
+ struct core_vma_metadata *meta = vma_meta + i;
struct elf_phdr phdr;
size_t sz;
- sz = vma->vm_end - vma->vm_start;
+ sz = meta->end - meta->start;
phdr.p_type = PT_LOAD;
phdr.p_offset = offset;
- phdr.p_vaddr = vma->vm_start;
+ phdr.p_vaddr = meta->start;
phdr.p_paddr = 0;
- phdr.p_filesz = maydump(vma, cprm->mm_flags) ? sz : 0;
+ phdr.p_filesz = meta->dump_size;
phdr.p_memsz = sz;
offset += phdr.p_filesz;
- phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0;
- if (vma->vm_flags & VM_WRITE)
- phdr.p_flags |= PF_W;
- if (vma->vm_flags & VM_EXEC)
- phdr.p_flags |= PF_X;
+ phdr.p_flags = meta->flags & VM_READ ? PF_R : 0;
+ phdr.p_flags |= meta->flags & VM_WRITE ? PF_W : 0;
+ phdr.p_flags |= meta->flags & VM_EXEC ? PF_X : 0;
phdr.p_align = ELF_EXEC_PAGESIZE;
if (!dump_emit(cprm, &phdr, sizeof(phdr)))
@@ -1745,7 +1736,7 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
if (!dump_skip(cprm, dataoff - cprm->pos))
goto cleanup;
- if (!elf_fdpic_dump_segments(cprm))
+ if (!elf_fdpic_dump_segments(cprm, vma_meta, vma_count))
goto cleanup;
if (!elf_core_write_extra_data(cprm))
@@ -1769,6 +1760,7 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
list_del(tmp);
kfree(list_entry(tmp, struct elf_thread_status, list));
}
+ kvfree(vma_meta);
kfree(phdr4note);
kfree(elf);
kfree(prstatus);
diff --git a/fs/coredump.c b/fs/coredump.c
index 88f625eecaac1..4213eab89190f 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -918,3 +918,71 @@ void dump_truncate(struct coredump_params *cprm)
}
}
EXPORT_SYMBOL(dump_truncate);
+
+static struct vm_area_struct *first_vma(struct task_struct *tsk,
+ struct vm_area_struct *gate_vma)
+{
+ struct vm_area_struct *ret = tsk->mm->mmap;
+
+ if (ret)
+ return ret;
+ return gate_vma;
+}
+/*
+ * Helper function for iterating across a vma list. It ensures that the caller
+ * will visit `gate_vma' prior to terminating the search.
+ */
+static struct vm_area_struct *next_vma(struct vm_area_struct *this_vma,
+ struct vm_area_struct *gate_vma)
+{
+ struct vm_area_struct *ret;
+
+ ret = this_vma->vm_next;
+ if (ret)
+ return ret;
+ if (this_vma == gate_vma)
+ return NULL;
+ return gate_vma;
+}
+
+/*
+ * Under the mmap_sem, take a snapshot of relevant information about the task's
+ * VMAs.
+ */
+int dump_vma_snapshot(struct coredump_params *cprm, int *vma_count,
+ struct core_vma_metadata **vma_meta,
+ unsigned long (*dump_size_cb)(struct vm_area_struct *, unsigned long))
+{
+ struct vm_area_struct *vma, *gate_vma;
+ struct mm_struct *mm = current->mm;
+ int i;
+
+ if (down_read_killable(&mm->mmap_sem))
+ return -EINTR;
+
+ gate_vma = get_gate_vma(mm);
+ *vma_count = mm->map_count + (gate_vma ? 1 : 0);
+
+ *vma_meta = kvmalloc_array(*vma_count, sizeof(**vma_meta), GFP_KERNEL);
+ if (!*vma_meta) {
+ up_read(&mm->mmap_sem);
+ return -ENOMEM;
+ }
+
+ for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
+ vma = next_vma(vma, gate_vma)) {
+ (*vma_meta)[i++] = (struct core_vma_metadata) {
+ .start = vma->vm_start,
+ .end = vma->vm_end,
+ .flags = vma->vm_flags,
+ .dump_size = dump_size_cb(vma, cprm->mm_flags)
+ };
+ }
+
+ up_read(&mm->mmap_sem);
+
+ if (WARN_ON(i != *vma_count))
+ return -EFAULT;
+
+ return 0;
+}
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index 4289dc21c04ff..d3387866dce7b 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -7,6 +7,13 @@
#include <linux/fs.h>
#include <asm/siginfo.h>
+struct core_vma_metadata {
+ unsigned long start, end;
+ unsigned long filesize;
+ unsigned long flags;
+ unsigned long dump_size;
+};
+
/*
* These are the only things you should do on a core-file: use only these
* functions to write out all the necessary info.
@@ -18,6 +25,9 @@ extern int dump_align(struct coredump_params *cprm, int align);
extern void dump_truncate(struct coredump_params *cprm);
int dump_user_range(struct coredump_params *cprm, unsigned long start,
unsigned long len);
+int dump_vma_snapshot(struct coredump_params *cprm, int *vma_count,
+ struct core_vma_metadata **vma_meta,
+ unsigned long (*dump_size_cb)(struct vm_area_struct *, unsigned long));
#ifdef CONFIG_COREDUMP
extern void do_coredump(const kernel_siginfo_t *siginfo);
#else
--
2.26.2.526.g744177e7f7-goog
Properly take the mmap_sem before calling into the GUP code from
get_dump_page(); and play nice, allowing the GUP code to drop the mmap_sem
if it has to sleep.
As Linus pointed out, we don't actually need the VMA because
__get_user_pages() will flush the dcache for us if necessary.
Signed-off-by: Jann Horn <[email protected]>
---
mm/gup.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 9a7e83772f1fe..03f659ddd830a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1548,19 +1548,23 @@ static long __get_user_pages_locked(struct task_struct *tsk,
* NULL wherever the ZERO_PAGE, or an anonymous pte_none, has been found -
* allowing a hole to be left in the corefile to save diskspace.
*
- * Called without mmap_sem, but after all other threads have been killed.
+ * Called without mmap_sem (takes and releases the mmap_sem by itself).
*/
struct page *get_dump_page(unsigned long addr)
{
- struct vm_area_struct *vma;
+ struct mm_struct *mm = current->mm;
struct page *page;
+ int locked = 1;
+ int ret;
- if (__get_user_pages(current, current->mm, addr, 1,
- FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma,
- NULL) < 1)
+ if (down_read_killable(&mm->mmap_sem))
return NULL;
- flush_cache_page(vma, addr, page_to_pfn(page));
- return page;
+ ret = __get_user_pages_locked(current, mm, addr, 1, &page, NULL,
+ &locked,
+ FOLL_FORCE | FOLL_DUMP | FOLL_GET);
+ if (locked)
+ up_read(&mm->mmap_sem);
+ return (ret == 1) ? page : NULL;
}
#if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
--
2.26.2.526.g744177e7f7-goog
On Wed, Apr 29, 2020 at 11:49:49PM +0200, Jann Horn wrote:
> At the moment, we have that rather ugly mmget_still_valid() helper to
> work around <https://crbug.com/project-zero/1790>: ELF core dumping
> doesn't take the mmap_sem while traversing the task's VMAs, and if
> anything (like userfaultfd) then remotely messes with the VMA tree,
> fireworks ensue. So at the moment we use mmget_still_valid() to bail
> out in any writers that might be operating on a remote mm's VMAs.
>
> With this series, I'm trying to get rid of the need for that as
> cleanly as possible.
> In particular, I want to avoid holding the mmap_sem across unbounded
> sleeps.
>
>
> Patches 1, 2 and 3 are relatively unrelated cleanups in the core
> dumping code.
>
> Patches 4 and 5 implement the main change: Instead of repeatedly
> accessing the VMA list with sleeps in between, we snapshot it at the
> start with proper locking, and then later we just use our copy of
> the VMA list. This ensures that the kernel won't crash, that VMA
> metadata in the coredump is consistent even in the presence of
> concurrent modifications, and that any virtual addresses that aren't
> being concurrently modified have their contents show up in the core
> dump properly.
>
> The disadvantage of this approach is that we need a bit more memory
> during core dumping for storing metadata about all VMAs.
>
> After this series has landed, we should be able to rip out
> mmget_still_valid().
>
>
> Testing done so far:
>
> - Creating a simple core dump on X86-64 still works.
> - The created coredump on X86-64 opens in GDB, and both the stack and the
> exectutable look vaguely plausible.
> - 32-bit ARM compiles with FDPIC support, both with MMU and !MMU config.
>
> I'm CCing some folks from the architectures that use FDPIC in case
> anyone wants to give this a spin.
I've never had any reason to use FDPIC, and I don't have any binaries
that would use it. Nicolas Pitre added ARM support, so I guess he
would be the one to talk to about it. (Added Nicolas.)
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 10.2Mbps down 587kbps up
On Wed, Apr 29, 2020 at 2:57 PM Russell King - ARM Linux admin
<[email protected]> wrote:
>
> I've never had any reason to use FDPIC, and I don't have any binaries
> that would use it. Nicolas Pitre added ARM support, so I guess he
> would be the one to talk to about it. (Added Nicolas.)
While we're at it, is there anybody who knows binfmt_flat?
It might be Nicolas too.
binfmt_flat doesn't do core-dumping, but it has some other oddities.
In particular, I'd like to bring sanity to the installation of the new
creds, and all the _normal_ binfmt cases do it largely close together
with setup_new_exec().
binfmt_flat is doing odd things. It's doing this:
/* Flush all traces of the currently running executable */
if (id == 0) {
ret = flush_old_exec(bprm);
if (ret)
goto err;
/* OK, This is the point of no return */
set_personality(PER_LINUX_32BIT);
setup_new_exec(bprm);
}
in load_flat_file() - which is also used to loading _libraries_. Where
it makes no sense at all.
It does the
install_exec_creds(bprm);
in load_flat_binary() (which makes more sense: that is only for actual
binary loading, no library case).
I would _like_ for every binfmt loader to do
/* Flush all traces of the currently running executable */
retval = flush_old_exec(bprm);
if (retval)
return retval;
.. possibly set up personalities here ..
setup_new_exec(bprm);
install_exec_creds(bprm);
all together, and at least merge 'setup_new_exec()' with 'install_exec_creds()'.
And I think all the binfmt handlers would be ok with that, but the
flat one in particular is really oddly set up.
*Particularly* with that flush_old_exec/setup_new_exec() being done by
the same routine that is also loading libraries (and called from
'calc_reloc()' from binary loading too).
Adding Greg Ungerer for m68knommu. Can somebody sort out why that
flush_old_exec/setup_new_exec() isn't in load_flat_binary() like
install_exec_creds() is?
Most of that file goes back to pre-git days. And most of the commits
since are not so much about binfmt_flat, as they are about cleanups or
changes elsewhere where binfmt_flat was just a victim.
Linus
On Wed, 29 Apr 2020, Linus Torvalds wrote:
> While we're at it, is there anybody who knows binfmt_flat?
I'd say Greg Ungerer.
> It might be Nicolas too.
I only contributed the necessary changes to make it work on targets with
a MMU. Once fdpic started to worked I used that instead.
FWIW I couldn't find a toolchain that would produce FLAT binaries with
dynamic libs on ARM so I only used static binaries.
Nicolas
On Wed, 29 Apr 2020, Russell King - ARM Linux admin wrote:
> On Wed, Apr 29, 2020 at 11:49:49PM +0200, Jann Horn wrote:
> > At the moment, we have that rather ugly mmget_still_valid() helper to
> > work around <https://crbug.com/project-zero/1790>: ELF core dumping
> > doesn't take the mmap_sem while traversing the task's VMAs, and if
> > anything (like userfaultfd) then remotely messes with the VMA tree,
> > fireworks ensue. So at the moment we use mmget_still_valid() to bail
> > out in any writers that might be operating on a remote mm's VMAs.
> >
> > With this series, I'm trying to get rid of the need for that as
> > cleanly as possible.
> > In particular, I want to avoid holding the mmap_sem across unbounded
> > sleeps.
> >
> >
> > Patches 1, 2 and 3 are relatively unrelated cleanups in the core
> > dumping code.
> >
> > Patches 4 and 5 implement the main change: Instead of repeatedly
> > accessing the VMA list with sleeps in between, we snapshot it at the
> > start with proper locking, and then later we just use our copy of
> > the VMA list. This ensures that the kernel won't crash, that VMA
> > metadata in the coredump is consistent even in the presence of
> > concurrent modifications, and that any virtual addresses that aren't
> > being concurrently modified have their contents show up in the core
> > dump properly.
> >
> > The disadvantage of this approach is that we need a bit more memory
> > during core dumping for storing metadata about all VMAs.
> >
> > After this series has landed, we should be able to rip out
> > mmget_still_valid().
> >
> >
> > Testing done so far:
> >
> > - Creating a simple core dump on X86-64 still works.
> > - The created coredump on X86-64 opens in GDB, and both the stack and the
> > exectutable look vaguely plausible.
> > - 32-bit ARM compiles with FDPIC support, both with MMU and !MMU config.
> >
> > I'm CCing some folks from the architectures that use FDPIC in case
> > anyone wants to give this a spin.
>
> I've never had any reason to use FDPIC, and I don't have any binaries
> that would use it. Nicolas Pitre added ARM support, so I guess he
> would be the one to talk to about it. (Added Nicolas.)
It's been a while since I worked with it. However Christophe Lyon (in
CC) added support for ARM FDPIC to mainline gcc recently, so hopefully
he might still be set up and able to help.
Nicolas
On 30/4/20 9:03 am, Linus Torvalds wrote:
> On Wed, Apr 29, 2020 at 2:57 PM Russell King - ARM Linux admin
> <[email protected]> wrote:
>>
>> I've never had any reason to use FDPIC, and I don't have any binaries
>> that would use it. Nicolas Pitre added ARM support, so I guess he
>> would be the one to talk to about it. (Added Nicolas.)
>
> While we're at it, is there anybody who knows binfmt_flat?
>
> It might be Nicolas too.
>
> binfmt_flat doesn't do core-dumping, but it has some other oddities.
> In particular, I'd like to bring sanity to the installation of the new
> creds, and all the _normal_ binfmt cases do it largely close together
> with setup_new_exec().
>
> binfmt_flat is doing odd things. It's doing this:
>
> /* Flush all traces of the currently running executable */
> if (id == 0) {
> ret = flush_old_exec(bprm);
> if (ret)
> goto err;
>
> /* OK, This is the point of no return */
> set_personality(PER_LINUX_32BIT);
> setup_new_exec(bprm);
> }
>
> in load_flat_file() - which is also used to loading _libraries_. Where
> it makes no sense at all.
I haven't looked at the shared lib support in there for a long time,
but I thought that "id" is only 0 for the actual final program.
Libraries have a slot or id number associated with them.
> It does the
>
> install_exec_creds(bprm);
>
> in load_flat_binary() (which makes more sense: that is only for actual
> binary loading, no library case).
>
> I would _like_ for every binfmt loader to do
>
> /* Flush all traces of the currently running executable */
> retval = flush_old_exec(bprm);
> if (retval)
> return retval;
>
> .. possibly set up personalities here ..
>
> setup_new_exec(bprm);
> install_exec_creds(bprm);
>
> all together, and at least merge 'setup_new_exec()' with 'install_exec_creds()'.
>
> And I think all the binfmt handlers would be ok with that, but the
> flat one in particular is really oddly set up.
>
> *Particularly* with that flush_old_exec/setup_new_exec() being done by
> the same routine that is also loading libraries (and called from
> 'calc_reloc()' from binary loading too).
>
> Adding Greg Ungerer for m68knommu. Can somebody sort out why that
> flush_old_exec/setup_new_exec() isn't in load_flat_binary() like
> install_exec_creds() is?
>
> Most of that file goes back to pre-git days. And most of the commits
> since are not so much about binfmt_flat, as they are about cleanups or
> changes elsewhere where binfmt_flat was just a victim.
I'll have a look at this.
Quick hack test shows moving setup_new_exec(bprm) to be just before
install_exec_creds(bprm) works fine for the static binaries case.
Doing the flush_old_exec(bprm) there too crashed out - I'll need to
dig into that to see why.
Regards
Greg
On Fri, May 01, 2020 at 12:10:05AM +1000, Greg Ungerer wrote:
>
>
> On 30/4/20 9:03 am, Linus Torvalds wrote:
> >On Wed, Apr 29, 2020 at 2:57 PM Russell King - ARM Linux admin
> ><[email protected]> wrote:
> >>
> >>I've never had any reason to use FDPIC, and I don't have any binaries
> >>that would use it. Nicolas Pitre added ARM support, so I guess he
> >>would be the one to talk to about it. (Added Nicolas.)
> >
> >While we're at it, is there anybody who knows binfmt_flat?
> >
> >It might be Nicolas too.
> >
> >binfmt_flat doesn't do core-dumping, but it has some other oddities.
> >In particular, I'd like to bring sanity to the installation of the new
> >creds, and all the _normal_ binfmt cases do it largely close together
> >with setup_new_exec().
> >
> >binfmt_flat is doing odd things. It's doing this:
> >
> > /* Flush all traces of the currently running executable */
> > if (id == 0) {
> > ret = flush_old_exec(bprm);
> > if (ret)
> > goto err;
> >
> > /* OK, This is the point of no return */
> > set_personality(PER_LINUX_32BIT);
> > setup_new_exec(bprm);
> > }
> >
> >in load_flat_file() - which is also used to loading _libraries_. Where
> >it makes no sense at all.
>
> I haven't looked at the shared lib support in there for a long time,
> but I thought that "id" is only 0 for the actual final program.
> Libraries have a slot or id number associated with them.
This sounds correct. My understanding of FLAT shared library support
is that it's really bad and based on having preassigned slot indices
for each library on the system, and a global array per-process to give
to data base address for each library. Libraries are compiled to know
their own slot numbers so that they just load from fixed_reg[slot_id]
to get what's effectively their GOT pointer.
I'm not sure if anybody has actually used this in over a decade. Last
time I looked the tooling appeared broken, but in this domain lots of
users have forked private tooling that's not publicly available or at
least not publicly indexed, so it's hard to say for sure.
Rich
On Thu, Apr 30, 2020 at 7:10 AM Greg Ungerer <[email protected]> wrote:
>
> > in load_flat_file() - which is also used to loading _libraries_. Where
> > it makes no sense at all.
>
> I haven't looked at the shared lib support in there for a long time,
> but I thought that "id" is only 0 for the actual final program.
> Libraries have a slot or id number associated with them.
Yes, that was my assumption, but looking at the code, it really isn't
obvious that that is the case at all.
'id' gets calculated from fields that very much look like they could
be zero (eg by taking the top bits from another random field).
> > Most of that file goes back to pre-git days. And most of the commits
> > since are not so much about binfmt_flat, as they are about cleanups or
> > changes elsewhere where binfmt_flat was just a victim.
>
> I'll have a look at this.
Thanks.
> Quick hack test shows moving setup_new_exec(bprm) to be just before
> install_exec_creds(bprm) works fine for the static binaries case.
> Doing the flush_old_exec(bprm) there too crashed out - I'll need to
> dig into that to see why.
Just moving setup_new_exec() would at least allow us to then join the
two together, and just say "setup_new_exec() does the credential
installation too".
So to some degree, that's the important one.
But that flush_old_exec() does look odd in load_flat_file(). It's not
like anything but executing a binary should flush the old exec.
Certainly not loading a library, however odd that flat library code
is.
My _guess_ is that the reason for this is that "load_flat_file()" also
does a lot of verification of the file and does that whole "return
-ENOEXEC if the file format isn't right". So we don't want to flush
the old exec before that is done, but we obviously also don't want to
flush the old exec after we've actually loaded the new one into
memory..
So the location of flush_old_exec() makes that kind of sense, but it
would have made it better if that flat file support had a clear
separation of "check the file" from "load the file".
Oh well. As mentioned, the whole "at least put setup_new_exec() and
install_exec_creds() together" is the bigger thing.
But if it's true that nobody really uses the odd flat library support
any more and there are no testers, maybe we should consider ripping it
out...
Linus
Linus Torvalds <[email protected]> writes:
> On Thu, Apr 30, 2020 at 7:10 AM Greg Ungerer <[email protected]> wrote:
>> > Most of that file goes back to pre-git days. And most of the commits
>> > since are not so much about binfmt_flat, as they are about cleanups or
>> > changes elsewhere where binfmt_flat was just a victim.
>>
>> I'll have a look at this.
>
> Thanks.
>
>> Quick hack test shows moving setup_new_exec(bprm) to be just before
>> install_exec_creds(bprm) works fine for the static binaries case.
>> Doing the flush_old_exec(bprm) there too crashed out - I'll need to
>> dig into that to see why.
>
> Just moving setup_new_exec() would at least allow us to then join the
> two together, and just say "setup_new_exec() does the credential
> installation too".
But it is only half a help if we allow failure points between
flush_old_exec and install_exec_creds.
Greg do things work acceptably if install_exec_creds is moved to right
after setup_new_exec? (patch below)
Looking at the code in load_flat_file after setup_new_exec it looks like
the kinds of things that in binfmt_elf.c we do after install_exec_creds
(aka vm_map). So I think we want install_exec_creds sooner, instead
of setup_new_exec later.
> But if it's true that nobody really uses the odd flat library support
> any more and there are no testers, maybe we should consider ripping it
> out...
I looked a little deeper and there is another reason to think about
ripping out the flat library loader. The code is recursive, and
supports a maximum of 4 shared libraries in the entire system.
load_flat_binary
load_flat_file
calc_reloc
load_flat_shared_libary
load_flat_file
....
I am mystified with what kind of system can survive with a grand total
of 4 shared libaries. I think my a.out slackware system that I ran on
my i486 had more shared libraries.
Having read just a bit more it is definitely guaranteed (by the code)
that the first time load_flat_file is called id 0 will be used (aka id 0
is guaranteed to be the binary), and the ids 1, 2, 3 and 4 will only be
used if a relocation includes that id to reference an external shared
library. That part of the code is drop dead simple.
---
This is what I was thinking about applying.
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index 831a2b25ba79..1a1d1fcb893f 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -541,6 +541,7 @@ static int load_flat_file(struct linux_binprm *bprm,
/* OK, This is the point of no return */
set_personality(PER_LINUX_32BIT);
setup_new_exec(bprm);
+ install_exec_creds(bprm);
}
/*
@@ -963,8 +964,6 @@ static int load_flat_binary(struct linux_binprm *bprm)
}
}
- install_exec_creds(bprm);
-
set_binfmt(&flat_format);
#ifdef CONFIG_MMU
On 4/30/20 9:51 AM, Rich Felker wrote:
> This sounds correct. My understanding of FLAT shared library support
> is that it's really bad and based on having preassigned slot indices
> for each library on the system, and a global array per-process to give
> to data base address for each library. Libraries are compiled to know
> their own slot numbers so that they just load from fixed_reg[slot_id]
> to get what's effectively their GOT pointer.
>
> I'm not sure if anybody has actually used this in over a decade. Last
> time I looked the tooling appeared broken, but in this domain lots of
> users have forked private tooling that's not publicly available or at
> least not publicly indexed, so it's hard to say for sure.
Lots of people in this area are also still using 10 year old tools because it
breaks every time they upgrade.
Heck, nommu support for architectures musl doesn't support yet is _explicitly_
the main thing keeping uClibc alive:
https://www.openwall.com/lists/musl/2015/05/30/1
Rob
On 1/5/20 5:07 am, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
>> On Thu, Apr 30, 2020 at 7:10 AM Greg Ungerer <[email protected]> wrote:
>
>>>> Most of that file goes back to pre-git days. And most of the commits
>>>> since are not so much about binfmt_flat, as they are about cleanups or
>>>> changes elsewhere where binfmt_flat was just a victim.
>>>
>>> I'll have a look at this.
>>
>> Thanks.
>>
>>> Quick hack test shows moving setup_new_exec(bprm) to be just before
>>> install_exec_creds(bprm) works fine for the static binaries case.
>>> Doing the flush_old_exec(bprm) there too crashed out - I'll need to
>>> dig into that to see why.
>>
>> Just moving setup_new_exec() would at least allow us to then join the
>> two together, and just say "setup_new_exec() does the credential
>> installation too".
>
> But it is only half a help if we allow failure points between
> flush_old_exec and install_exec_creds.
>
> Greg do things work acceptably if install_exec_creds is moved to right
> after setup_new_exec? (patch below)
Yes, confirmed. Worked fine with that patch applied.
> Looking at the code in load_flat_file after setup_new_exec it looks like
> the kinds of things that in binfmt_elf.c we do after install_exec_creds
> (aka vm_map). So I think we want install_exec_creds sooner, instead
> of setup_new_exec later.
>
>> But if it's true that nobody really uses the odd flat library support
>> any more and there are no testers, maybe we should consider ripping it
>> out...
>
> I looked a little deeper and there is another reason to think about
> ripping out the flat library loader. The code is recursive, and
> supports a maximum of 4 shared libraries in the entire system.
>
> load_flat_binary
> load_flat_file
> calc_reloc
> load_flat_shared_libary
> load_flat_file
> ....
>
> I am mystified with what kind of system can survive with a grand total
> of 4 shared libaries. I think my a.out slackware system that I ran on
> my i486 had more shared libraries.
The kind of embedded systems that were built with this stuff 20 years
ago didn't have lots of applications and libraries. I think we found
back then that most of your savings were from making libc shared.
Less significant gains from making other libraries shared. And there
was a bit of extra pain in setting them up with the shared library
code generation options (that had to be unique for each one).
The whole mechanism is a bit of hack, and there was a few other
limitations with the way it worked (I don't recall what they were
right now).
I am definitely in favor of removing it.
Regards
Greg
> Having read just a bit more it is definitely guaranteed (by the code)
> that the first time load_flat_file is called id 0 will be used (aka id 0
> is guaranteed to be the binary), and the ids 1, 2, 3 and 4 will only be
> used if a relocation includes that id to reference an external shared
> library. That part of the code is drop dead simple.
>
> ---
>
> This is what I was thinking about applying.
>
> diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
> index 831a2b25ba79..1a1d1fcb893f 100644
> --- a/fs/binfmt_flat.c
> +++ b/fs/binfmt_flat.c
> @@ -541,6 +541,7 @@ static int load_flat_file(struct linux_binprm *bprm,
> /* OK, This is the point of no return */
> set_personality(PER_LINUX_32BIT);
> setup_new_exec(bprm);
> + install_exec_creds(bprm);
> }
>
> /*
> @@ -963,8 +964,6 @@ static int load_flat_binary(struct linux_binprm *bprm)
> }
> }
>
> - install_exec_creds(bprm);
> -
> set_binfmt(&flat_format);
>
> #ifdef CONFIG_MMU
>
>
On 1/5/20 12:51 am, Rich Felker wrote:
> On Fri, May 01, 2020 at 12:10:05AM +1000, Greg Ungerer wrote:
>>
>>
>> On 30/4/20 9:03 am, Linus Torvalds wrote:
>>> On Wed, Apr 29, 2020 at 2:57 PM Russell King - ARM Linux admin
>>> <[email protected]> wrote:
>>>>
>>>> I've never had any reason to use FDPIC, and I don't have any binaries
>>>> that would use it. Nicolas Pitre added ARM support, so I guess he
>>>> would be the one to talk to about it. (Added Nicolas.)
>>>
>>> While we're at it, is there anybody who knows binfmt_flat?
>>>
>>> It might be Nicolas too.
>>>
>>> binfmt_flat doesn't do core-dumping, but it has some other oddities.
>>> In particular, I'd like to bring sanity to the installation of the new
>>> creds, and all the _normal_ binfmt cases do it largely close together
>>> with setup_new_exec().
>>>
>>> binfmt_flat is doing odd things. It's doing this:
>>>
>>> /* Flush all traces of the currently running executable */
>>> if (id == 0) {
>>> ret = flush_old_exec(bprm);
>>> if (ret)
>>> goto err;
>>>
>>> /* OK, This is the point of no return */
>>> set_personality(PER_LINUX_32BIT);
>>> setup_new_exec(bprm);
>>> }
>>>
>>> in load_flat_file() - which is also used to loading _libraries_. Where
>>> it makes no sense at all.
>>
>> I haven't looked at the shared lib support in there for a long time,
>> but I thought that "id" is only 0 for the actual final program.
>> Libraries have a slot or id number associated with them.
>
> This sounds correct. My understanding of FLAT shared library support
> is that it's really bad and based on having preassigned slot indices
> for each library on the system, and a global array per-process to give
> to data base address for each library. Libraries are compiled to know
> their own slot numbers so that they just load from fixed_reg[slot_id]
> to get what's effectively their GOT pointer.
>
> I'm not sure if anybody has actually used this in over a decade. Last
> time I looked the tooling appeared broken, but in this domain lots of
> users have forked private tooling that's not publicly available or at
> least not publicly indexed, so it's hard to say for sure.
Be at least 12 or 13 years since I last had a working shared library
build for m68knommu. I have not bothered with it since then, not that I
even used it much when it worked. Seemed more pain than it was worth.
Regards
Greg
On 1/5/20 2:54 am, Linus Torvalds wrote:
> On Thu, Apr 30, 2020 at 7:10 AM Greg Ungerer <[email protected]> wrote:
>>
>>> in load_flat_file() - which is also used to loading _libraries_. Where
>>> it makes no sense at all.
>>
>> I haven't looked at the shared lib support in there for a long time,
>> but I thought that "id" is only 0 for the actual final program.
>> Libraries have a slot or id number associated with them.
>
> Yes, that was my assumption, but looking at the code, it really isn't
> obvious that that is the case at all.
>
> 'id' gets calculated from fields that very much look like they could
> be zero (eg by taking the top bits from another random field).
>
>>> Most of that file goes back to pre-git days. And most of the commits
>>> since are not so much about binfmt_flat, as they are about cleanups or
>>> changes elsewhere where binfmt_flat was just a victim.
>>
>> I'll have a look at this.
>
> Thanks.
>
>> Quick hack test shows moving setup_new_exec(bprm) to be just before
>> install_exec_creds(bprm) works fine for the static binaries case.
>> Doing the flush_old_exec(bprm) there too crashed out - I'll need to
>> dig into that to see why.
>
> Just moving setup_new_exec() would at least allow us to then join the
> two together, and just say "setup_new_exec() does the credential
> installation too".
>
> So to some degree, that's the important one.
>
> But that flush_old_exec() does look odd in load_flat_file(). It's not
> like anything but executing a binary should flush the old exec.
> Certainly not loading a library, however odd that flat library code
> is.
>
> My _guess_ is that the reason for this is that "load_flat_file()" also
> does a lot of verification of the file and does that whole "return
> -ENOEXEC if the file format isn't right". So we don't want to flush
> the old exec before that is done, but we obviously also don't want to
> flush the old exec after we've actually loaded the new one into
> memory..
Yeah, that is what it looks like. Looking at the history, the introduction
of setup_new_exec() [in commit 221af7f87b974] was probably just
added where the the existing flush_old_exec() was.
> So the location of flush_old_exec() makes that kind of sense, but it
> would have made it better if that flat file support had a clear
> separation of "check the file" from "load the file".
>
> Oh well. As mentioned, the whole "at least put setup_new_exec() and
> install_exec_creds() together" is the bigger thing.
>
> But if it's true that nobody really uses the odd flat library support
> any more and there are no testers, maybe we should consider ripping it
> out...
I am for that. If nobody pipes up and complains I'll look at taking it out.
Regards
Greg
Greg Ungerer <[email protected]> writes:
> On 1/5/20 5:07 am, Eric W. Biederman wrote:
>> Linus Torvalds <[email protected]> writes:
>>
>>> On Thu, Apr 30, 2020 at 7:10 AM Greg Ungerer <[email protected]> wrote:
>>
>>>>> Most of that file goes back to pre-git days. And most of the commits
>>>>> since are not so much about binfmt_flat, as they are about cleanups or
>>>>> changes elsewhere where binfmt_flat was just a victim.
>>>>
>>>> I'll have a look at this.
>>>
>>> Thanks.
>>>
>>>> Quick hack test shows moving setup_new_exec(bprm) to be just before
>>>> install_exec_creds(bprm) works fine for the static binaries case.
>>>> Doing the flush_old_exec(bprm) there too crashed out - I'll need to
>>>> dig into that to see why.
>>>
>>> Just moving setup_new_exec() would at least allow us to then join the
>>> two together, and just say "setup_new_exec() does the credential
>>> installation too".
>>
>> But it is only half a help if we allow failure points between
>> flush_old_exec and install_exec_creds.
>>
>> Greg do things work acceptably if install_exec_creds is moved to right
>> after setup_new_exec? (patch below)
>
> Yes, confirmed. Worked fine with that patch applied.
Good. Thank you.
That is what we need for other cleanups. All three of those together.
>> This is what I was thinking about applying.
>>
>> diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
>> index 831a2b25ba79..1a1d1fcb893f 100644
>> --- a/fs/binfmt_flat.c
>> +++ b/fs/binfmt_flat.c
>> @@ -541,6 +541,7 @@ static int load_flat_file(struct linux_binprm *bprm,
>> /* OK, This is the point of no return */
>> set_personality(PER_LINUX_32BIT);
>> setup_new_exec(bprm);
>> + install_exec_creds(bprm);
>> }
>> /*
>> @@ -963,8 +964,6 @@ static int load_flat_binary(struct linux_binprm *bprm)
>> }
>> }
>> - install_exec_creds(bprm);
>> -
>> set_binfmt(&flat_format);
>> #ifdef CONFIG_MMU
Eric
On 5/1/20 1:00 AM, Greg Ungerer wrote:
>> This sounds correct. My understanding of FLAT shared library support
>> is that it's really bad and based on having preassigned slot indices
>> for each library on the system, and a global array per-process to give
>> to data base address for each library. Libraries are compiled to know
>> their own slot numbers so that they just load from fixed_reg[slot_id]
>> to get what's effectively their GOT pointer.
fdpic is to elf what binflt is to a.out, and a.out shared libraries were never
pretty. Or easy.
>> I'm not sure if anybody has actually used this in over a decade. Last
>> time I looked the tooling appeared broken, but in this domain lots of
>> users have forked private tooling that's not publicly available or at
>> least not publicly indexed, so it's hard to say for sure.
>
> Be at least 12 or 13 years since I last had a working shared library
> build for m68knommu. I have not bothered with it since then, not that I
> even used it much when it worked. Seemed more pain than it was worth.
Shared libraries worked fine with fdpic on sh2 last I checked, it's basically
just ELF PIC with the ability to move the 4 segments (text/rodata/bss/data)
independently of each other. (4 base pointers, no waiting.)
I don't think I've _ever_ used shared binflt libraries. I left myself
breadcrumbs back when I was wrestling with that stuff:
https://landley.net/notes-2014.html#07-12-2014
But it looks like that last time I touched anything using elf2flt was:
https://landley.net/notes-2018.html#08-05-2018
And that was just because arm's fdpic support stayed out of tree for years so I
dug up binflt and gave it another go. (It sucked so much I wound up building
static pie for cortex-m, taking the efficiency hit, and moving on. Running pie
binaries on nommu _works_, it's just incredibly inefficient. Since the writeable
and readable segments of the ELF are all relative to the same single base
pointer, you can't share the read-only parts of the binaries without address
remapping, so if you launch 4 instances of PIE bash on nommu you've loaded 4
instances of the bash text and rodata, and of course none of it can even be
demand faulted. In theory shared libraries _do_ help there but I hit some ld.so
bug and didn't want to debug a half-assed solution, so big hammer and moved on
until arm fdpic got merged and fixed it _properly_...)
Rob
P.S. The reason for binflt is bare metal hardware engineers who are conceptually
uncomfortable with software love them, because it's as close to "objcopy -O
binary" as they can get. Meanwhile on j-core we've had an 8k ROM boot loader
that loads vmlinux images and does the ELF relocations for 5 years now, and ever
since the switch to device tree that's our _only_ way to feed a dtb to the
kernel without statically linking it in, so it's ELF all the way down for us.
On Wed, Apr 29, 2020 at 11:49:50PM +0200, Jann Horn wrote:
> dump_emit() is for kernel pointers, and VMAs describe userspace memory.
> Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
> even if it probably doesn't matter much on !MMU systems - especially given
> that it looks like we can just use the same get_dump_page() as on MMU if
> we move it out of the CONFIG_MMU block.
Looks sensible. Did you get a chance to test this with a nommu setup?
On Wed, Apr 29, 2020 at 11:49:52PM +0200, Jann Horn wrote:
> Both fs/binfmt_elf.c and fs/binfmt_elf_fdpic.c need to dump ranges of pages
> into the coredump file. Extract that logic into a common helper.
>
> Any other binfmt that actually wants to create coredumps will probably need
> the same function; so stop making get_dump_page() depend on
> CONFIG_ELF_CORE.
Why is the #ifdef CONFIG_ELF_CORE in gup.c removed when the only
remaining caller is under the same ifdef?
Otherwise this looks fine to me.
On Wed, Apr 29, 2020 at 11:49:53PM +0200, Jann Horn wrote:
> In both binfmt_elf and binfmt_elf_fdpic, use a new helper
> dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
> VMA, if we have one) while protected by the mmap_sem, and then use that
> snapshot instead of walking the VMA list without locking.
>
> An alternative approach would be to keep the mmap_sem held across the
> entire core dumping operation; however, keeping the mmap_sem locked while
> we may be blocked for an unbounded amount of time (e.g. because we're
> dumping to a FUSE filesystem or so) isn't really optimal; the mmap_sem
> blocks things like the ->release handler of userfaultfd, and we don't
> really want critical system daemons to grind to a halt just because someone
> "gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or something
> like that.
>
> Since both the normal ELF code and the FDPIC ELF code need this
> functionality (and if any other binfmt wants to add coredump support in the
> future, they'd probably need it, too), implement this with a common helper
> in fs/coredump.c.
>
> A downside of this approach is that we now need a bigger amount of kernel
> memory per userspace VMA in the normal ELF case, and that we need O(n)
> kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't
> be terribly bad.
>
> Signed-off-by: Jann Horn <[email protected]>
> ---
> fs/binfmt_elf.c | 152 +++++++++++++--------------------------
> fs/binfmt_elf_fdpic.c | 86 ++++++++++------------
> fs/coredump.c | 68 ++++++++++++++++++
> include/linux/coredump.h | 10 +++
> 4 files changed, 168 insertions(+), 148 deletions(-)
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index fb36469848323..dffe9dc8497ca 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -1292,8 +1292,12 @@ static bool always_dump_vma(struct vm_area_struct *vma)
> return false;
> }
>
> +#define DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER 1
> +
> /*
> * Decide what to dump of a segment, part, all or none.
> + * The result must be fixed up via vma_dump_size_fixup() once we're in a context
> + * that's allowed to sleep arbitrarily long.
> */
> static unsigned long vma_dump_size(struct vm_area_struct *vma,
> unsigned long mm_flags)
> @@ -1348,30 +1352,15 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma,
>
> /*
> * If this looks like the beginning of a DSO or executable mapping,
> - * check for an ELF header. If we find one, dump the first page to
> - * aid in determining what was mapped here.
> + * we'll check for an ELF header. If we find one, we'll dump the first
> + * page to aid in determining what was mapped here.
> + * However, we shouldn't sleep on userspace reads while holding the
> + * mmap_sem, so we just return a placeholder for now that will be fixed
> + * up later in vma_dump_size_fixup().
> */
> if (FILTER(ELF_HEADERS) &&
> - vma->vm_pgoff == 0 && (vma->vm_flags & VM_READ)) {
> - u32 __user *header = (u32 __user *) vma->vm_start;
> - u32 word;
> - /*
> - * Doing it this way gets the constant folded by GCC.
> - */
> - union {
> - u32 cmp;
> - char elfmag[SELFMAG];
> - } magic;
> - BUILD_BUG_ON(SELFMAG != sizeof word);
> - magic.elfmag[EI_MAG0] = ELFMAG0;
> - magic.elfmag[EI_MAG1] = ELFMAG1;
> - magic.elfmag[EI_MAG2] = ELFMAG2;
> - magic.elfmag[EI_MAG3] = ELFMAG3;
> - if (unlikely(get_user(word, header)))
> - word = 0;
> - if (word == magic.cmp)
> - return PAGE_SIZE;
> - }
> + vma->vm_pgoff == 0 && (vma->vm_flags & VM_READ))
> + return DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER;
>
> #undef FILTER
>
> @@ -1381,6 +1370,22 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma,
> return vma->vm_end - vma->vm_start;
> }
>
> +/* Fix up the result from vma_dump_size(), now that we're allowed to sleep. */
> +static void vma_dump_size_fixup(struct core_vma_metadata *meta)
> +{
> + char elfmag[SELFMAG];
> +
> + if (meta->dump_size != DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER)
> + return;
> +
> + if (copy_from_user(elfmag, (void __user *)meta->start, SELFMAG)) {
> + meta->dump_size = 0;
> + return;
> + }
> + meta->dump_size =
> + (memcmp(elfmag, ELFMAG, SELFMAG) == 0) ? PAGE_SIZE : 0;
> +}
While this code looks entirely correct, it took me way too long to
follow. I'd take te DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER into the caller,
and then have a simple function like this:
static void vma_dump_size_fixup(struct core_vma_metadata *meta)
{
char elfmag[SELFMAG];
if (copy_from_user(elfmag, (void __user *)meta->start, SELFMAG) ||
memcmp(elfmag, ELFMAG, sizeof(elfmag)) != 0)
meta->dump_size = 0;
else
meta->dump_size = PAGE_SIZE;
}
Also a few comments explaining why we do this fixup would help readers
of the code.
> - if (vma->vm_flags & VM_WRITE)
> - phdr.p_flags |= PF_W;
> - if (vma->vm_flags & VM_EXEC)
> - phdr.p_flags |= PF_X;
> + phdr.p_flags = meta->flags & VM_READ ? PF_R : 0;
> + phdr.p_flags |= meta->flags & VM_WRITE ? PF_W : 0;
> + phdr.p_flags |= meta->flags & VM_EXEC ? PF_X : 0;
Sorry for another nitpick, but I find the spelled out version with the
if a lot easier to read.
> +int dump_vma_snapshot(struct coredump_params *cprm, int *vma_count,
> + struct core_vma_metadata **vma_meta,
> + unsigned long (*dump_size_cb)(struct vm_area_struct *, unsigned long))
Plase don't use single tabs for indentating parameter continuations
(we actually have two styles - either two tabs or aligned after the
opening brace, pick your poison :))
> + *vma_meta = kvmalloc_array(*vma_count, sizeof(**vma_meta), GFP_KERNEL);
> + if (!*vma_meta) {
> + up_read(&mm->mmap_sem);
> + return -ENOMEM;
> + }
> +
> + for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
> + vma = next_vma(vma, gate_vma)) {
> + (*vma_meta)[i++] = (struct core_vma_metadata) {
> + .start = vma->vm_start,
> + .end = vma->vm_end,
> + .flags = vma->vm_flags,
> + .dump_size = dump_size_cb(vma, cprm->mm_flags)
> + };
This looks a little weird. Why not kcalloc + just initialize the four
fields we actually fill out here?
On Tue, May 5, 2020 at 12:48 PM Christoph Hellwig <[email protected]> wrote:
> On Wed, Apr 29, 2020 at 11:49:50PM +0200, Jann Horn wrote:
> > dump_emit() is for kernel pointers, and VMAs describe userspace memory.
> > Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
> > even if it probably doesn't matter much on !MMU systems - especially given
> > that it looks like we can just use the same get_dump_page() as on MMU if
> > we move it out of the CONFIG_MMU block.
>
> Looks sensible. Did you get a chance to test this with a nommu setup?
Nope. Do you happen to have a recommendation for a convenient
environment I can use with QEMU, or something like that? I'm guessing
that just running a standard armel Debian userspace with a !mmu ARM
kernel wouldn't work so well?
On Tue, May 5, 2020 at 12:50 PM Christoph Hellwig <[email protected]> wrote:
> On Wed, Apr 29, 2020 at 11:49:52PM +0200, Jann Horn wrote:
> > Both fs/binfmt_elf.c and fs/binfmt_elf_fdpic.c need to dump ranges of pages
> > into the coredump file. Extract that logic into a common helper.
> >
> > Any other binfmt that actually wants to create coredumps will probably need
> > the same function; so stop making get_dump_page() depend on
> > CONFIG_ELF_CORE.
>
> Why is the #ifdef CONFIG_ELF_CORE in gup.c removed when the only
> remaining caller is under the same ifdef?
Oh, whoops, good point - I should go put that back.
On Tue, May 5, 2020 at 1:04 PM Christoph Hellwig <[email protected]> wrote:
> On Wed, Apr 29, 2020 at 11:49:53PM +0200, Jann Horn wrote:
> > In both binfmt_elf and binfmt_elf_fdpic, use a new helper
> > dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
> > VMA, if we have one) while protected by the mmap_sem, and then use that
> > snapshot instead of walking the VMA list without locking.
[...]
> > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > index fb36469848323..dffe9dc8497ca 100644
> > --- a/fs/binfmt_elf.c
> > +++ b/fs/binfmt_elf.c
> > @@ -1292,8 +1292,12 @@ static bool always_dump_vma(struct vm_area_struct *vma)
> > return false;
> > }
> >
> > +#define DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER 1
> > +
> > /*
> > * Decide what to dump of a segment, part, all or none.
> > + * The result must be fixed up via vma_dump_size_fixup() once we're in a context
> > + * that's allowed to sleep arbitrarily long.
> > */
> > static unsigned long vma_dump_size(struct vm_area_struct *vma,
> > unsigned long mm_flags)
> > @@ -1348,30 +1352,15 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma,
> >
> > /*
> > * If this looks like the beginning of a DSO or executable mapping,
> > - * check for an ELF header. If we find one, dump the first page to
> > - * aid in determining what was mapped here.
> > + * we'll check for an ELF header. If we find one, we'll dump the first
> > + * page to aid in determining what was mapped here.
> > + * However, we shouldn't sleep on userspace reads while holding the
> > + * mmap_sem, so we just return a placeholder for now that will be fixed
> > + * up later in vma_dump_size_fixup().
> > */
> > if (FILTER(ELF_HEADERS) &&
> > - vma->vm_pgoff == 0 && (vma->vm_flags & VM_READ)) {
> > - u32 __user *header = (u32 __user *) vma->vm_start;
> > - u32 word;
> > - /*
> > - * Doing it this way gets the constant folded by GCC.
> > - */
> > - union {
> > - u32 cmp;
> > - char elfmag[SELFMAG];
> > - } magic;
> > - BUILD_BUG_ON(SELFMAG != sizeof word);
> > - magic.elfmag[EI_MAG0] = ELFMAG0;
> > - magic.elfmag[EI_MAG1] = ELFMAG1;
> > - magic.elfmag[EI_MAG2] = ELFMAG2;
> > - magic.elfmag[EI_MAG3] = ELFMAG3;
> > - if (unlikely(get_user(word, header)))
> > - word = 0;
> > - if (word == magic.cmp)
> > - return PAGE_SIZE;
> > - }
> > + vma->vm_pgoff == 0 && (vma->vm_flags & VM_READ))
> > + return DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER;
> >
> > #undef FILTER
> >
> > @@ -1381,6 +1370,22 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma,
> > return vma->vm_end - vma->vm_start;
> > }
> >
> > +/* Fix up the result from vma_dump_size(), now that we're allowed to sleep. */
> > +static void vma_dump_size_fixup(struct core_vma_metadata *meta)
> > +{
> > + char elfmag[SELFMAG];
> > +
> > + if (meta->dump_size != DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER)
> > + return;
> > +
> > + if (copy_from_user(elfmag, (void __user *)meta->start, SELFMAG)) {
> > + meta->dump_size = 0;
> > + return;
> > + }
> > + meta->dump_size =
> > + (memcmp(elfmag, ELFMAG, SELFMAG) == 0) ? PAGE_SIZE : 0;
> > +}
>
> While this code looks entirely correct, it took me way too long to
> follow. I'd take te DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER into the caller,
> and then have a simple function like this:
>
> static void vma_dump_size_fixup(struct core_vma_metadata *meta)
> {
> char elfmag[SELFMAG];
>
> if (copy_from_user(elfmag, (void __user *)meta->start, SELFMAG) ||
> memcmp(elfmag, ELFMAG, sizeof(elfmag)) != 0)
> meta->dump_size = 0;
> else
> meta->dump_size = PAGE_SIZE;
> }
I guess I can make that change, even if I personally think it's less
clear if parts of the fixup logic spill over into the caller instead
of being handled locally here. :P
> Also a few comments explaining why we do this fixup would help readers
> of the code.
>
> > - if (vma->vm_flags & VM_WRITE)
> > - phdr.p_flags |= PF_W;
> > - if (vma->vm_flags & VM_EXEC)
> > - phdr.p_flags |= PF_X;
> > + phdr.p_flags = meta->flags & VM_READ ? PF_R : 0;
> > + phdr.p_flags |= meta->flags & VM_WRITE ? PF_W : 0;
> > + phdr.p_flags |= meta->flags & VM_EXEC ? PF_X : 0;
>
> Sorry for another nitpick, but I find the spelled out version with the
> if a lot easier to read.
Huh... I find it easier to scan if it is three lines with the same
pattern, but I'm not too attached to it.
In that case, I guess I should change it like this? The old code had a
ternary for VM_READ and branches for the other two, which didn't seem
very pretty to me.
phdr.p_flags = 0;
if (meta->flags & VM_READ)
phdr.p_flags |= PF_R;
if (meta->flags & VM_WRITE)
phdr.p_flags |= PF_W;
if (meta->flags & VM_EXEC)
phdr.p_flags |= PF_X;
> > +int dump_vma_snapshot(struct coredump_params *cprm, int *vma_count,
> > + struct core_vma_metadata **vma_meta,
> > + unsigned long (*dump_size_cb)(struct vm_area_struct *, unsigned long))
>
> Plase don't use single tabs for indentating parameter continuations
> (we actually have two styles - either two tabs or aligned after the
> opening brace, pick your poison :))
I did that because if I use either one of those styles, I'll have to
either move the callback type into a typedef or add line breaks in the
parameters of the callback type. I guess I can write it like this...
int dump_vma_snapshot(struct coredump_params *cprm, int *vma_count,
struct core_vma_metadata **vma_meta,
unsigned long (*dump_size_cb)(struct vm_area_struct *,
unsigned long));
but if you also dislike that, let me know and I'll add a typedef instead. :P
> > + *vma_meta = kvmalloc_array(*vma_count, sizeof(**vma_meta), GFP_KERNEL);
> > + if (!*vma_meta) {
> > + up_read(&mm->mmap_sem);
> > + return -ENOMEM;
> > + }
> > +
> > + for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
> > + vma = next_vma(vma, gate_vma)) {
> > + (*vma_meta)[i++] = (struct core_vma_metadata) {
> > + .start = vma->vm_start,
> > + .end = vma->vm_end,
> > + .flags = vma->vm_flags,
> > + .dump_size = dump_size_cb(vma, cprm->mm_flags)
> > + };
>
> This looks a little weird. Why not kcalloc + just initialize the four
> fields we actually fill out here?
Yeah, I can just change the syntax here to normal member writes if you
want. I just thought C99-style initialization looked nicer, but I
guess that's unusual in the kernel...
(And I just noticed that that "filesize" member of that struct
core_vma_metadata I'm defining is entirely unused... I'll have to
remove that in the next iteration.)
On Tue, May 05, 2020 at 01:42:12PM +0200, Jann Horn wrote:
> On Tue, May 5, 2020 at 12:48 PM Christoph Hellwig <[email protected]> wrote:
> > On Wed, Apr 29, 2020 at 11:49:50PM +0200, Jann Horn wrote:
> > > dump_emit() is for kernel pointers, and VMAs describe userspace memory.
> > > Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
> > > even if it probably doesn't matter much on !MMU systems - especially given
> > > that it looks like we can just use the same get_dump_page() as on MMU if
> > > we move it out of the CONFIG_MMU block.
> >
> > Looks sensible. Did you get a chance to test this with a nommu setup?
>
> Nope. Do you happen to have a recommendation for a convenient
> environment I can use with QEMU, or something like that? I'm guessing
> that just running a standard armel Debian userspace with a !mmu ARM
> kernel wouldn't work so well?
Nommu generally needs special userspace either using uclibc-ng or musl.
When I did the RISC-V nommu work I used buildroot for my root file
systems. We haven't gotten elffdpic to work on RISC-V yet, so I can't
use that setup for testing, but it should support ARM as well.
On Tue, May 5, 2020 at 2:15 PM Christoph Hellwig <[email protected]> wrote:
> On Tue, May 05, 2020 at 01:42:12PM +0200, Jann Horn wrote:
> > On Tue, May 5, 2020 at 12:48 PM Christoph Hellwig <[email protected]> wrote:
> > > On Wed, Apr 29, 2020 at 11:49:50PM +0200, Jann Horn wrote:
> > > > dump_emit() is for kernel pointers, and VMAs describe userspace memory.
> > > > Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
> > > > even if it probably doesn't matter much on !MMU systems - especially given
> > > > that it looks like we can just use the same get_dump_page() as on MMU if
> > > > we move it out of the CONFIG_MMU block.
> > >
> > > Looks sensible. Did you get a chance to test this with a nommu setup?
> >
> > Nope. Do you happen to have a recommendation for a convenient
> > environment I can use with QEMU, or something like that? I'm guessing
> > that just running a standard armel Debian userspace with a !mmu ARM
> > kernel wouldn't work so well?
>
> Nommu generally needs special userspace either using uclibc-ng or musl.
> When I did the RISC-V nommu work I used buildroot for my root file
> systems. We haven't gotten elffdpic to work on RISC-V yet, so I can't
> use that setup for testing, but it should support ARM as well.
I've finally gotten around to testing this, and discovered that I
actually had to change something in the patch - thanks for asking me
to test this.
Some notes on running ARM nommu testing:
I ended up running QEMU with "-machine versatilepb". To make that
work, I applied this patch:
<https://github.com/buildroot/buildroot/blob/master/board/qemu/arm-versatile/patches/linux/versatile-nommu.patch>
A couple of directories up, there are also a README and a kernel
config for that.
Note that the emulated harddrive of this board doesn't seem to work,
because it's connected via PCI, and nommu generally can't use PCI; but
you can boot from initramfs, and you can copy files from/to the host
with netcat, since the emulated network card does work. (To avoid
having to bring up the interface from userspace, you can use
"ip=10.0.2.1::10.0.2.2:255.255.255.0" on the kernel cmdline if the
corresponding feature is enabled in the kernel config.)
The first trouble I ran into with trying to run FDPIC userspace (based
on musl) was that Linux has support for running ARM userspace in
"26-bit mode", which is some ARM feature from the dark ages, with no
support in QEMU; and while normally Linux only tries to enable that
thing when the binary explicitly requires it, the FDPIC path isn't
wired up to the appropriate personality logic properly, and so you get
a spectacular explosion, where eventually the kernel oopses with a
message about how it's trying to load an invalid value into CPSR
because first the kernel tries to return to 26-bit mode, and then,
through some mysterious spooky action at a distance, the kernel
(AFAICS) ends up trying to do a syscall return with the stack pointer
pointing somewhere in the middle of the kernel stack (and not where
the entry register frame is).
Anyway, my hacky workaround for that is:
diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
index b9241051e5cb..d5aa409e366c 100644
--- a/arch/arm/include/asm/processor.h
+++ b/arch/arm/include/asm/processor.h
@@ -70,7 +70,7 @@ static inline void
arch_thread_struct_whitelist(unsigned long *offset,
if (current->personality & ADDR_LIMIT_32BIT) \
regs->ARM_cpsr = USR_MODE; \
else \
- regs->ARM_cpsr = USR26_MODE; \
+ { WARN(1, "setting USR26_MODE"); regs->ARM_cpsr = USR_MODE; } \
if (elf_hwcap & HWCAP_THUMB && pc & 1) \
regs->ARM_cpsr |= PSR_T_BIT; \
regs->ARM_cpsr |= PSR_ENDSTATE; \
Next up: Early on in the libc startup code, musl aborts execution by
intentionally executing an undefined instruction in
__set_thread_area(), because it can't figure out any working
implementation of atomic cmpxchg. For the MMU case, there is a kuser
helper (what x86 would call vsyscall); but for NOMMU ARM, no working
implementation exists. So I gave up on musl and went with uclibc-ng
(built via buildroot) instead, since uclibc-ng has support for
compiling out thread support.
Annoyingly, buildroot doesn't support FDPIC (at least not for nommu
ARM). So I ended up telling it to build a small FLAT userspace, and
used a standard ARM toolchain to build a tiny static PIE ELF binary
with no reliance on libc (the FDPIC loader can actually load normal
ELF mostly fine as long as it's PIE, at the cost of having to
duplicate the text section for every instance) - luckily I didn't need
the ELF binary to actually do anything complicated, and so working
without any libc was tolerable:
arm-linux-gnueabi-gcc-10 -fPIC -c -o test_crash.o test_crash.c
arm-linux-gnueabi-ld -pie --no-dynamic-linker -o test_crash test_crash.o
Next fun part: gdb-multiarch doesn't seem to be able to open FDPIC
core dumps properly - none of the register status is available. I took
apart the core dump before and after the patch in a hex editor,
though, and it seems to have all the expected stuff in it. I'm
guessing that maybe GDB got thrown off by struct elf_prstatus having a
different layout if the core dump was generated on nommu? GDB's
elf32_arm_nabi_grok_prstatus() seems to only handle the struct size
for the non-FDPIC struct.