2018-07-18 23:02:26

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH v3 0/8] /proc/kcore improvements

From: Omar Sandoval <[email protected]>

Hi,

This series makes a few improvements to /proc/kcore. Patches 1, 2, and 3
are prep patches. Patch 4 is a fix/cleanup. Patch 5 is another prep
patch. Patches 6 and 7 are optimizations to ->read(). Patch 8 adds
vmcoreinfo to /proc/kcore.

Again, based on v4.18-rc4 + James' patch in the mm tree, and tested with
crash and readelf.

Thanks!

Changes from v2:

- Add __init to kclist_add() as per Andrew
- Got rid of conversion of kcore_need_update from int to atomic_t and
just used xchg() instead of atomic_cmpxchg() (split out into a new
patch instead of combining it with the rwlock -> rwsem conversion)
- Add comment about the increase in file size to patch 8

Changes from v1:

- Rebased onto v4.18-rc4 + James' patch
(https://patchwork.kernel.org/patch/10519739/) in the mm tree
- Fix spurious sparse warning (see the report and response in
https://patchwork.kernel.org/patch/10512431/)

Omar Sandoval (8):
proc/kcore: don't grab lock for kclist_add()
proc/kcore: don't grab lock for memory hotplug notifier
proc/kcore: replace kclist_lock rwlock with rwsem
proc/kcore: fix memory hotplug vs multiple opens race
proc/kcore: hold lock during read
proc/kcore: clean up ELF header generation
proc/kcore: optimize multiple page reads
proc/kcore: add vmcoreinfo note to /proc/kcore

fs/proc/Kconfig | 1 +
fs/proc/kcore.c | 534 +++++++++++++++++--------------------
include/linux/crash_core.h | 2 +
include/linux/kcore.h | 2 +-
kernel/crash_core.c | 4 +-
5 files changed, 251 insertions(+), 292 deletions(-)

--
2.18.0



2018-07-18 23:00:16

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH v3 8/8] proc/kcore: add vmcoreinfo note to /proc/kcore

From: Omar Sandoval <[email protected]>

The vmcoreinfo information is useful for runtime debugging tools, not
just for crash dumps. A lot of this information can be determined by
other means, but this is much more convenient, and it only adds a page
at most to the file.

Signed-off-by: Omar Sandoval <[email protected]>
---
fs/proc/Kconfig | 1 +
fs/proc/kcore.c | 18 ++++++++++++++++--
include/linux/crash_core.h | 2 ++
kernel/crash_core.c | 4 ++--
4 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 0eaeb41453f5..817c02b13b1d 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -31,6 +31,7 @@ config PROC_FS
config PROC_KCORE
bool "/proc/kcore support" if !ARM
depends on PROC_FS && MMU
+ select CRASH_CORE
help
Provides a virtual ELF core file of the live kernel. This can
be read with gdb and other ELF tools. No modifications can be
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 25fefdd05ee5..ab7c1a1dad50 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -10,6 +10,7 @@
* Safe accesses to vmalloc/direct-mapped discontiguous areas, Kanoj Sarcar <[email protected]>
*/

+#include <linux/crash_core.h>
#include <linux/mm.h>
#include <linux/proc_fs.h>
#include <linux/kcore.h>
@@ -81,10 +82,13 @@ static size_t get_kcore_size(int *nphdr, size_t *phdrs_len, size_t *notes_len,
}

*phdrs_len = *nphdr * sizeof(struct elf_phdr);
- *notes_len = (3 * (sizeof(struct elf_note) + ALIGN(sizeof(CORE_STR), 4)) +
+ *notes_len = (4 * sizeof(struct elf_note) +
+ 3 * ALIGN(sizeof(CORE_STR), 4) +
+ VMCOREINFO_NOTE_NAME_BYTES +
ALIGN(sizeof(struct elf_prstatus), 4) +
ALIGN(sizeof(struct elf_prpsinfo), 4) +
- ALIGN(arch_task_struct_size, 4));
+ ALIGN(arch_task_struct_size, 4) +
+ ALIGN(vmcoreinfo_size, 4));
*data_offset = PAGE_ALIGN(sizeof(struct elfhdr) + *phdrs_len +
*notes_len);
return *data_offset + size;
@@ -406,6 +410,16 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
sizeof(prpsinfo));
append_kcore_note(notes, &i, CORE_STR, NT_TASKSTRUCT, current,
arch_task_struct_size);
+ /*
+ * vmcoreinfo_size is mostly constant after init time, but it
+ * can be changed by crash_save_vmcoreinfo(). Racing here with a
+ * panic on another CPU before the machine goes down is insanely
+ * unlikely, but it's better to not leave potential buffer
+ * overflows lying around, regardless.
+ */
+ append_kcore_note(notes, &i, VMCOREINFO_NOTE_NAME, 0,
+ vmcoreinfo_data,
+ min(vmcoreinfo_size, notes_len - i));

tsz = min_t(size_t, buflen, notes_offset + notes_len - *fpos);
if (copy_to_user(buffer, notes + *fpos - notes_offset, tsz)) {
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index b511f6d24b42..525510a9f965 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -60,6 +60,8 @@ phys_addr_t paddr_vmcoreinfo_note(void);
#define VMCOREINFO_CONFIG(name) \
vmcoreinfo_append_str("CONFIG_%s=y\n", #name)

+extern unsigned char *vmcoreinfo_data;
+extern size_t vmcoreinfo_size;
extern u32 *vmcoreinfo_note;

Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index b66aced5e8c2..d02c58b94460 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -14,8 +14,8 @@
#include <asm/sections.h>

/* vmcoreinfo stuff */
-static unsigned char *vmcoreinfo_data;
-static size_t vmcoreinfo_size;
+unsigned char *vmcoreinfo_data;
+size_t vmcoreinfo_size;
u32 *vmcoreinfo_note;

/* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
--
2.18.0


2018-07-18 23:00:28

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH v3 3/8] proc/kcore: replace kclist_lock rwlock with rwsem

From: Omar Sandoval <[email protected]>

Now we only need kclist_lock from user context and at fs init time, and
the following changes need to sleep while holding the kclist_lock.

Signed-off-by: Omar Sandoval <[email protected]>
---
fs/proc/kcore.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index e83f15a4f66d..ae43a97d511d 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -59,7 +59,7 @@ struct memelfnote
};

static LIST_HEAD(kclist_head);
-static DEFINE_RWLOCK(kclist_lock);
+static DECLARE_RWSEM(kclist_lock);
static int kcore_need_update = 1;

/* This doesn't grab kclist_lock, so it should only be used at init time. */
@@ -117,7 +117,7 @@ static void __kcore_update_ram(struct list_head *list)
struct kcore_list *tmp, *pos;
LIST_HEAD(garbage);

- write_lock(&kclist_lock);
+ down_write(&kclist_lock);
if (xchg(&kcore_need_update, 0)) {
list_for_each_entry_safe(pos, tmp, &kclist_head, list) {
if (pos->type == KCORE_RAM
@@ -128,7 +128,7 @@ static void __kcore_update_ram(struct list_head *list)
} else
list_splice(list, &garbage);
proc_root_kcore->size = get_kcore_size(&nphdr, &size);
- write_unlock(&kclist_lock);
+ up_write(&kclist_lock);

free_kclist_ents(&garbage);
}
@@ -451,11 +451,11 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
int nphdr;
unsigned long start;

- read_lock(&kclist_lock);
+ down_read(&kclist_lock);
size = get_kcore_size(&nphdr, &elf_buflen);

if (buflen == 0 || *fpos >= size) {
- read_unlock(&kclist_lock);
+ up_read(&kclist_lock);
return 0;
}

@@ -472,11 +472,11 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
tsz = buflen;
elf_buf = kzalloc(elf_buflen, GFP_ATOMIC);
if (!elf_buf) {
- read_unlock(&kclist_lock);
+ up_read(&kclist_lock);
return -ENOMEM;
}
elf_kcore_store_hdr(elf_buf, nphdr, elf_buflen);
- read_unlock(&kclist_lock);
+ up_read(&kclist_lock);
if (copy_to_user(buffer, elf_buf + *fpos, tsz)) {
kfree(elf_buf);
return -EFAULT;
@@ -491,7 +491,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
if (buflen == 0)
return acc;
} else
- read_unlock(&kclist_lock);
+ up_read(&kclist_lock);

/*
* Check to see if our file offset matches with any of
@@ -504,12 +504,12 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
while (buflen) {
struct kcore_list *m;

- read_lock(&kclist_lock);
+ down_read(&kclist_lock);
list_for_each_entry(m, &kclist_head, list) {
if (start >= m->addr && start < (m->addr+m->size))
break;
}
- read_unlock(&kclist_lock);
+ up_read(&kclist_lock);

if (&m->list == &kclist_head) {
if (clear_user(buffer, tsz))
--
2.18.0


2018-07-18 23:00:33

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH v3 4/8] proc/kcore: fix memory hotplug vs multiple opens race

From: Omar Sandoval <[email protected]>

There's a theoretical race condition that will cause /proc/kcore to miss
a memory hotplug event:

CPU0 CPU1
// hotplug event 1
kcore_need_update = 1

open_kcore() open_kcore()
kcore_update_ram() kcore_update_ram()
// Walk RAM // Walk RAM
__kcore_update_ram() __kcore_update_ram()
kcore_need_update = 0

// hotplug event 2
kcore_need_update = 1
kcore_need_update = 0

Note that CPU1 set up the RAM kcore entries with the state after hotplug
event 1 but cleared the flag for hotplug event 2. The RAM entries will
therefore be stale until there is another hotplug event.

This is an extremely unlikely sequence of events, but the fix makes the
synchronization saner, anyways: we serialize the entire update sequence,
which means that whoever clears the flag will always succeed in
replacing the kcore list.

Signed-off-by: Omar Sandoval <[email protected]>
---
fs/proc/kcore.c | 93 +++++++++++++++++++++++--------------------------
1 file changed, 44 insertions(+), 49 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index ae43a97d511d..95aa988c5b5d 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -98,53 +98,15 @@ static size_t get_kcore_size(int *nphdr, size_t *elf_buflen)
return size + *elf_buflen;
}

-static void free_kclist_ents(struct list_head *head)
-{
- struct kcore_list *tmp, *pos;
-
- list_for_each_entry_safe(pos, tmp, head, list) {
- list_del(&pos->list);
- kfree(pos);
- }
-}
-/*
- * Replace all KCORE_RAM/KCORE_VMEMMAP information with passed list.
- */
-static void __kcore_update_ram(struct list_head *list)
-{
- int nphdr;
- size_t size;
- struct kcore_list *tmp, *pos;
- LIST_HEAD(garbage);
-
- down_write(&kclist_lock);
- if (xchg(&kcore_need_update, 0)) {
- list_for_each_entry_safe(pos, tmp, &kclist_head, list) {
- if (pos->type == KCORE_RAM
- || pos->type == KCORE_VMEMMAP)
- list_move(&pos->list, &garbage);
- }
- list_splice_tail(list, &kclist_head);
- } else
- list_splice(list, &garbage);
- proc_root_kcore->size = get_kcore_size(&nphdr, &size);
- up_write(&kclist_lock);
-
- free_kclist_ents(&garbage);
-}
-
-
#ifdef CONFIG_HIGHMEM
/*
* If no highmem, we can assume [0...max_low_pfn) continuous range of memory
* because memory hole is not as big as !HIGHMEM case.
* (HIGHMEM is special because part of memory is _invisible_ from the kernel.)
*/
-static int kcore_update_ram(void)
+static int kcore_ram_list(struct list_head *head)
{
- LIST_HEAD(head);
struct kcore_list *ent;
- int ret = 0;

ent = kmalloc(sizeof(*ent), GFP_KERNEL);
if (!ent)
@@ -152,9 +114,8 @@ static int kcore_update_ram(void)
ent->addr = (unsigned long)__va(0);
ent->size = max_low_pfn << PAGE_SHIFT;
ent->type = KCORE_RAM;
- list_add(&ent->list, &head);
- __kcore_update_ram(&head);
- return ret;
+ list_add(&ent->list, head);
+ return 0;
}

#else /* !CONFIG_HIGHMEM */
@@ -253,11 +214,10 @@ kclist_add_private(unsigned long pfn, unsigned long nr_pages, void *arg)
return 1;
}

-static int kcore_update_ram(void)
+static int kcore_ram_list(struct list_head *list)
{
int nid, ret;
unsigned long end_pfn;
- LIST_HEAD(head);

/* Not inialized....update now */
/* find out "max pfn" */
@@ -269,15 +229,50 @@ static int kcore_update_ram(void)
end_pfn = node_end;
}
/* scan 0 to max_pfn */
- ret = walk_system_ram_range(0, end_pfn, &head, kclist_add_private);
- if (ret) {
- free_kclist_ents(&head);
+ ret = walk_system_ram_range(0, end_pfn, list, kclist_add_private);
+ if (ret)
return -ENOMEM;
+ return 0;
+}
+#endif /* CONFIG_HIGHMEM */
+
+static int kcore_update_ram(void)
+{
+ LIST_HEAD(list);
+ LIST_HEAD(garbage);
+ int nphdr;
+ size_t size;
+ struct kcore_list *tmp, *pos;
+ int ret = 0;
+
+ down_write(&kclist_lock);
+ if (!xchg(&kcore_need_update, 0))
+ goto out;
+
+ ret = kcore_ram_list(&list);
+ if (ret) {
+ /* Couldn't get the RAM list, try again next time. */
+ WRITE_ONCE(kcore_need_update, 1);
+ list_splice_tail(&list, &garbage);
+ goto out;
+ }
+
+ list_for_each_entry_safe(pos, tmp, &kclist_head, list) {
+ if (pos->type == KCORE_RAM || pos->type == KCORE_VMEMMAP)
+ list_move(&pos->list, &garbage);
+ }
+ list_splice_tail(&list, &kclist_head);
+
+ proc_root_kcore->size = get_kcore_size(&nphdr, &size);
+
+out:
+ up_write(&kclist_lock);
+ list_for_each_entry_safe(pos, tmp, &garbage, list) {
+ list_del(&pos->list);
+ kfree(pos);
}
- __kcore_update_ram(&head);
return ret;
}
-#endif /* CONFIG_HIGHMEM */

/*****************************************************************************/
/*
--
2.18.0


2018-07-18 23:00:35

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH v3 6/8] proc/kcore: clean up ELF header generation

From: Omar Sandoval <[email protected]>

Currently, the ELF file header, program headers, and note segment are
allocated all at once, in some icky code dating back to 2.3. Programs
tend to read the file header, then the program headers, then the note
segment, all separately, so this is a waste of effort. It's cleaner and
more efficient to handle the three separately.

Signed-off-by: Omar Sandoval <[email protected]>
---
fs/proc/kcore.c | 350 +++++++++++++++++++-----------------------------
1 file changed, 141 insertions(+), 209 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index e317ac890871..e2ca58d49938 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -49,15 +49,6 @@ static struct proc_dir_entry *proc_root_kcore;
#define kc_offset_to_vaddr(o) ((o) + PAGE_OFFSET)
#endif

-/* An ELF note in memory */
-struct memelfnote
-{
- const char *name;
- int type;
- unsigned int datasz;
- void *data;
-};
-
static LIST_HEAD(kclist_head);
static DECLARE_RWSEM(kclist_lock);
static int kcore_need_update = 1;
@@ -73,7 +64,8 @@ void __init kclist_add(struct kcore_list *new, void *addr, size_t size,
list_add_tail(&new->list, &kclist_head);
}

-static size_t get_kcore_size(int *nphdr, size_t *elf_buflen)
+static size_t get_kcore_size(int *nphdr, size_t *phdrs_len, size_t *notes_len,
+ size_t *data_offset)
{
size_t try, size;
struct kcore_list *m;
@@ -87,15 +79,15 @@ static size_t get_kcore_size(int *nphdr, size_t *elf_buflen)
size = try;
*nphdr = *nphdr + 1;
}
- *elf_buflen = sizeof(struct elfhdr) +
- (*nphdr + 2)*sizeof(struct elf_phdr) +
- 3 * ((sizeof(struct elf_note)) +
- roundup(sizeof(CORE_STR), 4)) +
- roundup(sizeof(struct elf_prstatus), 4) +
- roundup(sizeof(struct elf_prpsinfo), 4) +
- roundup(arch_task_struct_size, 4);
- *elf_buflen = PAGE_ALIGN(*elf_buflen);
- return size + *elf_buflen;
+
+ *phdrs_len = *nphdr * sizeof(struct elf_phdr);
+ *notes_len = (3 * (sizeof(struct elf_note) + ALIGN(sizeof(CORE_STR), 4)) +
+ ALIGN(sizeof(struct elf_prstatus), 4) +
+ ALIGN(sizeof(struct elf_prpsinfo), 4) +
+ ALIGN(arch_task_struct_size, 4));
+ *data_offset = PAGE_ALIGN(sizeof(struct elfhdr) + *phdrs_len +
+ *notes_len);
+ return *data_offset + size;
}

#ifdef CONFIG_HIGHMEM
@@ -241,7 +233,7 @@ static int kcore_update_ram(void)
LIST_HEAD(list);
LIST_HEAD(garbage);
int nphdr;
- size_t size;
+ size_t phdrs_len, notes_len, data_offset;
struct kcore_list *tmp, *pos;
int ret = 0;

@@ -263,7 +255,8 @@ static int kcore_update_ram(void)
}
list_splice_tail(&list, &kclist_head);

- proc_root_kcore->size = get_kcore_size(&nphdr, &size);
+ proc_root_kcore->size = get_kcore_size(&nphdr, &phdrs_len, &notes_len,
+ &data_offset);

out:
up_write(&kclist_lock);
@@ -274,228 +267,168 @@ static int kcore_update_ram(void)
return ret;
}

-/*****************************************************************************/
-/*
- * determine size of ELF note
- */
-static int notesize(struct memelfnote *en)
+static void append_kcore_note(char *notes, size_t *i, const char *name,
+ unsigned int type, const void *desc,
+ size_t descsz)
{
- int sz;
-
- sz = sizeof(struct elf_note);
- sz += roundup((strlen(en->name) + 1), 4);
- sz += roundup(en->datasz, 4);
-
- return sz;
-} /* end notesize() */
-
-/*****************************************************************************/
-/*
- * store a note in the header buffer
- */
-static char *storenote(struct memelfnote *men, char *bufp)
-{
- struct elf_note en;
-
-#define DUMP_WRITE(addr,nr) do { memcpy(bufp,addr,nr); bufp += nr; } while(0)
-
- en.n_namesz = strlen(men->name) + 1;
- en.n_descsz = men->datasz;
- en.n_type = men->type;
-
- DUMP_WRITE(&en, sizeof(en));
- DUMP_WRITE(men->name, en.n_namesz);
-
- /* XXX - cast from long long to long to avoid need for libgcc.a */
- bufp = (char*) roundup((unsigned long)bufp,4);
- DUMP_WRITE(men->data, men->datasz);
- bufp = (char*) roundup((unsigned long)bufp,4);
-
-#undef DUMP_WRITE
-
- return bufp;
-} /* end storenote() */
-
-/*
- * store an ELF coredump header in the supplied buffer
- * nphdr is the number of elf_phdr to insert
- */
-static void elf_kcore_store_hdr(char *bufp, int nphdr, int dataoff)
-{
- struct elf_prstatus prstatus; /* NT_PRSTATUS */
- struct elf_prpsinfo prpsinfo; /* NT_PRPSINFO */
- struct elf_phdr *nhdr, *phdr;
- struct elfhdr *elf;
- struct memelfnote notes[3];
- off_t offset = 0;
- struct kcore_list *m;
-
- /* setup ELF header */
- elf = (struct elfhdr *) bufp;
- bufp += sizeof(struct elfhdr);
- offset += sizeof(struct elfhdr);
- memcpy(elf->e_ident, ELFMAG, SELFMAG);
- elf->e_ident[EI_CLASS] = ELF_CLASS;
- elf->e_ident[EI_DATA] = ELF_DATA;
- elf->e_ident[EI_VERSION]= EV_CURRENT;
- elf->e_ident[EI_OSABI] = ELF_OSABI;
- memset(elf->e_ident+EI_PAD, 0, EI_NIDENT-EI_PAD);
- elf->e_type = ET_CORE;
- elf->e_machine = ELF_ARCH;
- elf->e_version = EV_CURRENT;
- elf->e_entry = 0;
- elf->e_phoff = sizeof(struct elfhdr);
- elf->e_shoff = 0;
- elf->e_flags = ELF_CORE_EFLAGS;
- elf->e_ehsize = sizeof(struct elfhdr);
- elf->e_phentsize= sizeof(struct elf_phdr);
- elf->e_phnum = nphdr;
- elf->e_shentsize= 0;
- elf->e_shnum = 0;
- elf->e_shstrndx = 0;
-
- /* setup ELF PT_NOTE program header */
- nhdr = (struct elf_phdr *) bufp;
- bufp += sizeof(struct elf_phdr);
- offset += sizeof(struct elf_phdr);
- nhdr->p_type = PT_NOTE;
- nhdr->p_offset = 0;
- nhdr->p_vaddr = 0;
- nhdr->p_paddr = 0;
- nhdr->p_filesz = 0;
- nhdr->p_memsz = 0;
- nhdr->p_flags = 0;
- nhdr->p_align = 0;
-
- /* setup ELF PT_LOAD program header for every area */
- list_for_each_entry(m, &kclist_head, list) {
- phdr = (struct elf_phdr *) bufp;
- bufp += sizeof(struct elf_phdr);
- offset += sizeof(struct elf_phdr);
-
- phdr->p_type = PT_LOAD;
- phdr->p_flags = PF_R|PF_W|PF_X;
- phdr->p_offset = kc_vaddr_to_offset(m->addr) + dataoff;
- phdr->p_vaddr = (size_t)m->addr;
- if (m->type == KCORE_RAM)
- phdr->p_paddr = __pa(m->addr);
- else if (m->type == KCORE_TEXT)
- phdr->p_paddr = __pa_symbol(m->addr);
- else
- phdr->p_paddr = (elf_addr_t)-1;
- phdr->p_filesz = phdr->p_memsz = m->size;
- phdr->p_align = PAGE_SIZE;
- }
-
- /*
- * Set up the notes in similar form to SVR4 core dumps made
- * with info from their /proc.
- */
- nhdr->p_offset = offset;
-
- /* set up the process status */
- notes[0].name = CORE_STR;
- notes[0].type = NT_PRSTATUS;
- notes[0].datasz = sizeof(struct elf_prstatus);
- notes[0].data = &prstatus;
-
- memset(&prstatus, 0, sizeof(struct elf_prstatus));
-
- nhdr->p_filesz = notesize(&notes[0]);
- bufp = storenote(&notes[0], bufp);
-
- /* set up the process info */
- notes[1].name = CORE_STR;
- notes[1].type = NT_PRPSINFO;
- notes[1].datasz = sizeof(struct elf_prpsinfo);
- notes[1].data = &prpsinfo;
-
- memset(&prpsinfo, 0, sizeof(struct elf_prpsinfo));
- prpsinfo.pr_state = 0;
- prpsinfo.pr_sname = 'R';
- prpsinfo.pr_zomb = 0;
-
- strcpy(prpsinfo.pr_fname, "vmlinux");
- strlcpy(prpsinfo.pr_psargs, saved_command_line, sizeof(prpsinfo.pr_psargs));
-
- nhdr->p_filesz += notesize(&notes[1]);
- bufp = storenote(&notes[1], bufp);
-
- /* set up the task structure */
- notes[2].name = CORE_STR;
- notes[2].type = NT_TASKSTRUCT;
- notes[2].datasz = arch_task_struct_size;
- notes[2].data = current;
-
- nhdr->p_filesz += notesize(&notes[2]);
- bufp = storenote(&notes[2], bufp);
-
-} /* end elf_kcore_store_hdr() */
+ struct elf_note *note = (struct elf_note *)&notes[*i];
+
+ note->n_namesz = strlen(name) + 1;
+ note->n_descsz = descsz;
+ note->n_type = type;
+ *i += sizeof(*note);
+ memcpy(&notes[*i], name, note->n_namesz);
+ *i = ALIGN(*i + note->n_namesz, 4);
+ memcpy(&notes[*i], desc, descsz);
+ *i = ALIGN(*i + descsz, 4);
+}

-/*****************************************************************************/
-/*
- * read from the ELF header and then kernel memory
- */
static ssize_t
read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
{
char *buf = file->private_data;
- size_t size, tsz;
- size_t elf_buflen;
+ size_t phdrs_offset, notes_offset, data_offset;
+ size_t phdrs_len, notes_len;
+ struct kcore_list *m;
+ size_t tsz;
int nphdr;
unsigned long start;
size_t orig_buflen = buflen;
int ret = 0;

down_read(&kclist_lock);
- size = get_kcore_size(&nphdr, &elf_buflen);

- if (buflen == 0 || *fpos >= size)
- goto out;
+ get_kcore_size(&nphdr, &phdrs_len, &notes_len, &data_offset);
+ phdrs_offset = sizeof(struct elfhdr);
+ notes_offset = phdrs_offset + phdrs_len;
+
+ /* ELF file header. */
+ if (buflen && *fpos < sizeof(struct elfhdr)) {
+ struct elfhdr ehdr = {
+ .e_ident = {
+ [EI_MAG0] = ELFMAG0,
+ [EI_MAG1] = ELFMAG1,
+ [EI_MAG2] = ELFMAG2,
+ [EI_MAG3] = ELFMAG3,
+ [EI_CLASS] = ELF_CLASS,
+ [EI_DATA] = ELF_DATA,
+ [EI_VERSION] = EV_CURRENT,
+ [EI_OSABI] = ELF_OSABI,
+ },
+ .e_type = ET_CORE,
+ .e_machine = ELF_ARCH,
+ .e_version = EV_CURRENT,
+ .e_phoff = sizeof(struct elfhdr),
+ .e_flags = ELF_CORE_EFLAGS,
+ .e_ehsize = sizeof(struct elfhdr),
+ .e_phentsize = sizeof(struct elf_phdr),
+ .e_phnum = nphdr,
+ };
+
+ tsz = min_t(size_t, buflen, sizeof(struct elfhdr) - *fpos);
+ if (copy_to_user(buffer, (char *)&ehdr + *fpos, tsz)) {
+ ret = -EFAULT;
+ goto out;
+ }

- /* trim buflen to not go beyond EOF */
- if (buflen > size - *fpos)
- buflen = size - *fpos;
+ buffer += tsz;
+ buflen -= tsz;
+ *fpos += tsz;
+ }

- /* construct an ELF core header if we'll need some of it */
- if (*fpos < elf_buflen) {
- char * elf_buf;
+ /* ELF program headers. */
+ if (buflen && *fpos < phdrs_offset + phdrs_len) {
+ struct elf_phdr *phdrs, *phdr;

- tsz = elf_buflen - *fpos;
- if (buflen < tsz)
- tsz = buflen;
- elf_buf = kzalloc(elf_buflen, GFP_KERNEL);
- if (!elf_buf) {
+ phdrs = kzalloc(phdrs_len, GFP_KERNEL);
+ if (!phdrs) {
ret = -ENOMEM;
goto out;
}
- elf_kcore_store_hdr(elf_buf, nphdr, elf_buflen);
- if (copy_to_user(buffer, elf_buf + *fpos, tsz)) {
- kfree(elf_buf);
+
+ phdrs[0].p_type = PT_NOTE;
+ phdrs[0].p_offset = notes_offset;
+ phdrs[0].p_filesz = notes_len;
+
+ phdr = &phdrs[1];
+ list_for_each_entry(m, &kclist_head, list) {
+ phdr->p_type = PT_LOAD;
+ phdr->p_flags = PF_R | PF_W | PF_X;
+ phdr->p_offset = kc_vaddr_to_offset(m->addr) + data_offset;
+ phdr->p_vaddr = (size_t)m->addr;
+ if (m->type == KCORE_RAM)
+ phdr->p_paddr = __pa(m->addr);
+ else if (m->type == KCORE_TEXT)
+ phdr->p_paddr = __pa_symbol(m->addr);
+ else
+ phdr->p_paddr = (elf_addr_t)-1;
+ phdr->p_filesz = phdr->p_memsz = m->size;
+ phdr->p_align = PAGE_SIZE;
+ phdr++;
+ }
+
+ tsz = min_t(size_t, buflen, phdrs_offset + phdrs_len - *fpos);
+ if (copy_to_user(buffer, (char *)phdrs + *fpos - phdrs_offset,
+ tsz)) {
+ kfree(phdrs);
ret = -EFAULT;
goto out;
}
- kfree(elf_buf);
+ kfree(phdrs);
+
+ buffer += tsz;
buflen -= tsz;
*fpos += tsz;
- buffer += tsz;
+ }
+
+ /* ELF note segment. */
+ if (buflen && *fpos < notes_offset + notes_len) {
+ struct elf_prstatus prstatus = {};
+ struct elf_prpsinfo prpsinfo = {
+ .pr_sname = 'R',
+ .pr_fname = "vmlinux",
+ };
+ char *notes;
+ size_t i = 0;
+
+ strlcpy(prpsinfo.pr_psargs, saved_command_line,
+ sizeof(prpsinfo.pr_psargs));
+
+ notes = kzalloc(notes_len, GFP_KERNEL);
+ if (!notes) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ append_kcore_note(notes, &i, CORE_STR, NT_PRSTATUS, &prstatus,
+ sizeof(prstatus));
+ append_kcore_note(notes, &i, CORE_STR, NT_PRPSINFO, &prpsinfo,
+ sizeof(prpsinfo));
+ append_kcore_note(notes, &i, CORE_STR, NT_TASKSTRUCT, current,
+ arch_task_struct_size);

- /* leave now if filled buffer already */
- if (buflen == 0)
+ tsz = min_t(size_t, buflen, notes_offset + notes_len - *fpos);
+ if (copy_to_user(buffer, notes + *fpos - notes_offset, tsz)) {
+ kfree(notes);
+ ret = -EFAULT;
goto out;
+ }
+ kfree(notes);
+
+ buffer += tsz;
+ buflen -= tsz;
+ *fpos += tsz;
}

/*
* Check to see if our file offset matches with any of
* the addresses in the elf_phdr on our list.
*/
- start = kc_offset_to_vaddr(*fpos - elf_buflen);
+ start = kc_offset_to_vaddr(*fpos - data_offset);
if ((tsz = (PAGE_SIZE - (start & ~PAGE_MASK))) > buflen)
tsz = buflen;
-
- while (buflen) {
- struct kcore_list *m;

+ while (buflen) {
list_for_each_entry(m, &kclist_head, list) {
if (start >= m->addr && start < (m->addr+m->size))
break;
@@ -557,7 +490,6 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
return orig_buflen - buflen;
}

-
static int open_kcore(struct inode *inode, struct file *filp)
{
if (!capable(CAP_SYS_RAWIO))
--
2.18.0


2018-07-18 23:00:42

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH v3 7/8] proc/kcore: optimize multiple page reads

From: Omar Sandoval <[email protected]>

The current code does a full search of the segment list every time for
every page. This is wasteful, since it's almost certain that the next
page will be in the same segment. Instead, check if the previous segment
covers the current page before doing the list search.

Signed-off-by: Omar Sandoval <[email protected]>
---
fs/proc/kcore.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index e2ca58d49938..25fefdd05ee5 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -428,10 +428,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
if ((tsz = (PAGE_SIZE - (start & ~PAGE_MASK))) > buflen)
tsz = buflen;

+ m = NULL;
while (buflen) {
- list_for_each_entry(m, &kclist_head, list) {
- if (start >= m->addr && start < (m->addr+m->size))
- break;
+ /*
+ * If this is the first iteration or the address is not within
+ * the previous entry, search for a matching entry.
+ */
+ if (!m || start < m->addr || start >= m->addr + m->size) {
+ list_for_each_entry(m, &kclist_head, list) {
+ if (start >= m->addr &&
+ start < m->addr + m->size)
+ break;
+ }
}

if (&m->list == &kclist_head) {
--
2.18.0


2018-07-18 23:01:00

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH v3 2/8] proc/kcore: don't grab lock for memory hotplug notifier

From: Omar Sandoval <[email protected]>

The memory hotplug notifier kcore_callback() only needs kclist_lock to
prevent races with __kcore_update_ram(), but we can easily eliminate
that race by using an atomic xchg() in __kcore_update_ram(). This is
preparation for converting kclist_lock to an rwsem.

Signed-off-by: Omar Sandoval <[email protected]>
---
fs/proc/kcore.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index b0b9a76f28d6..e83f15a4f66d 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -118,7 +118,7 @@ static void __kcore_update_ram(struct list_head *list)
LIST_HEAD(garbage);

write_lock(&kclist_lock);
- if (kcore_need_update) {
+ if (xchg(&kcore_need_update, 0)) {
list_for_each_entry_safe(pos, tmp, &kclist_head, list) {
if (pos->type == KCORE_RAM
|| pos->type == KCORE_VMEMMAP)
@@ -127,7 +127,6 @@ static void __kcore_update_ram(struct list_head *list)
list_splice_tail(list, &kclist_head);
} else
list_splice(list, &garbage);
- kcore_need_update = 0;
proc_root_kcore->size = get_kcore_size(&nphdr, &size);
write_unlock(&kclist_lock);

@@ -593,9 +592,8 @@ static int __meminit kcore_callback(struct notifier_block *self,
switch (action) {
case MEM_ONLINE:
case MEM_OFFLINE:
- write_lock(&kclist_lock);
kcore_need_update = 1;
- write_unlock(&kclist_lock);
+ break;
}
return NOTIFY_OK;
}
--
2.18.0


2018-07-18 23:01:05

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH v3 5/8] proc/kcore: hold lock during read

From: Omar Sandoval <[email protected]>

Now that we're using an rwsem, we can hold it during the entirety of
read_kcore() and have a common return path. This is preparation for the
next change.

Signed-off-by: Omar Sandoval <[email protected]>
---
fs/proc/kcore.c | 70 ++++++++++++++++++++++++++++---------------------
1 file changed, 40 insertions(+), 30 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 95aa988c5b5d..e317ac890871 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -440,19 +440,18 @@ static ssize_t
read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
{
char *buf = file->private_data;
- ssize_t acc = 0;
size_t size, tsz;
size_t elf_buflen;
int nphdr;
unsigned long start;
+ size_t orig_buflen = buflen;
+ int ret = 0;

down_read(&kclist_lock);
size = get_kcore_size(&nphdr, &elf_buflen);

- if (buflen == 0 || *fpos >= size) {
- up_read(&kclist_lock);
- return 0;
- }
+ if (buflen == 0 || *fpos >= size)
+ goto out;

/* trim buflen to not go beyond EOF */
if (buflen > size - *fpos)
@@ -465,28 +464,26 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
tsz = elf_buflen - *fpos;
if (buflen < tsz)
tsz = buflen;
- elf_buf = kzalloc(elf_buflen, GFP_ATOMIC);
+ elf_buf = kzalloc(elf_buflen, GFP_KERNEL);
if (!elf_buf) {
- up_read(&kclist_lock);
- return -ENOMEM;
+ ret = -ENOMEM;
+ goto out;
}
elf_kcore_store_hdr(elf_buf, nphdr, elf_buflen);
- up_read(&kclist_lock);
if (copy_to_user(buffer, elf_buf + *fpos, tsz)) {
kfree(elf_buf);
- return -EFAULT;
+ ret = -EFAULT;
+ goto out;
}
kfree(elf_buf);
buflen -= tsz;
*fpos += tsz;
buffer += tsz;
- acc += tsz;

/* leave now if filled buffer already */
if (buflen == 0)
- return acc;
- } else
- up_read(&kclist_lock);
+ goto out;
+ }

/*
* Check to see if our file offset matches with any of
@@ -499,25 +496,29 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
while (buflen) {
struct kcore_list *m;

- down_read(&kclist_lock);
list_for_each_entry(m, &kclist_head, list) {
if (start >= m->addr && start < (m->addr+m->size))
break;
}
- up_read(&kclist_lock);

if (&m->list == &kclist_head) {
- if (clear_user(buffer, tsz))
- return -EFAULT;
+ if (clear_user(buffer, tsz)) {
+ ret = -EFAULT;
+ goto out;
+ }
} else if (m->type == KCORE_VMALLOC) {
vread(buf, (char *)start, tsz);
/* we have to zero-fill user buffer even if no read */
- if (copy_to_user(buffer, buf, tsz))
- return -EFAULT;
+ if (copy_to_user(buffer, buf, tsz)) {
+ ret = -EFAULT;
+ goto out;
+ }
} else if (m->type == KCORE_USER) {
/* User page is handled prior to normal kernel page: */
- if (copy_to_user(buffer, (char *)start, tsz))
- return -EFAULT;
+ if (copy_to_user(buffer, (char *)start, tsz)) {
+ ret = -EFAULT;
+ goto out;
+ }
} else {
if (kern_addr_valid(start)) {
/*
@@ -525,26 +526,35 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
* hardened user copy kernel text checks.
*/
if (probe_kernel_read(buf, (void *) start, tsz)) {
- if (clear_user(buffer, tsz))
- return -EFAULT;
+ if (clear_user(buffer, tsz)) {
+ ret = -EFAULT;
+ goto out;
+ }
} else {
- if (copy_to_user(buffer, buf, tsz))
- return -EFAULT;
+ if (copy_to_user(buffer, buf, tsz)) {
+ ret = -EFAULT;
+ goto out;
+ }
}
} else {
- if (clear_user(buffer, tsz))
- return -EFAULT;
+ if (clear_user(buffer, tsz)) {
+ ret = -EFAULT;
+ goto out;
+ }
}
}
buflen -= tsz;
*fpos += tsz;
buffer += tsz;
- acc += tsz;
start += tsz;
tsz = (buflen > PAGE_SIZE ? PAGE_SIZE : buflen);
}

- return acc;
+out:
+ up_write(&kclist_lock);
+ if (ret)
+ return ret;
+ return orig_buflen - buflen;
}


--
2.18.0


2018-07-18 23:02:08

by Omar Sandoval

[permalink] [raw]
Subject: [PATCH v3 1/8] proc/kcore: don't grab lock for kclist_add()

From: Omar Sandoval <[email protected]>

kclist_add() is only called at init time, so there's no point in
grabbing any locks. We're also going to replace the rwlock with a rwsem,
which we don't want to try grabbing during early boot.

While we're here, mark kclist_add() with __init so that we'll get a
warning if it's called from non-init code.

Signed-off-by: Omar Sandoval <[email protected]>
---
fs/proc/kcore.c | 7 +++----
include/linux/kcore.h | 2 +-
2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 66c373230e60..b0b9a76f28d6 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -62,16 +62,15 @@ static LIST_HEAD(kclist_head);
static DEFINE_RWLOCK(kclist_lock);
static int kcore_need_update = 1;

-void
-kclist_add(struct kcore_list *new, void *addr, size_t size, int type)
+/* This doesn't grab kclist_lock, so it should only be used at init time. */
+void __init kclist_add(struct kcore_list *new, void *addr, size_t size,
+ int type)
{
new->addr = (unsigned long)addr;
new->size = size;
new->type = type;

- write_lock(&kclist_lock);
list_add_tail(&new->list, &kclist_head);
- write_unlock(&kclist_lock);
}

static size_t get_kcore_size(int *nphdr, size_t *elf_buflen)
diff --git a/include/linux/kcore.h b/include/linux/kcore.h
index 8de55e4b5ee9..c20f296438fb 100644
--- a/include/linux/kcore.h
+++ b/include/linux/kcore.h
@@ -35,7 +35,7 @@ struct vmcoredd_node {
};

#ifdef CONFIG_PROC_KCORE
-extern void kclist_add(struct kcore_list *, void *, size_t, int type);
+void __init kclist_add(struct kcore_list *, void *, size_t, int type);
#else
static inline
void kclist_add(struct kcore_list *new, void *addr, size_t size, int type)
--
2.18.0


2018-07-24 15:14:09

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] proc/kcore: hold lock during read

On 2018/07/19 7:58, Omar Sandoval wrote:
> From: Omar Sandoval <[email protected]>
>
> Now that we're using an rwsem, we can hold it during the entirety of
> read_kcore() and have a common return path. This is preparation for the
> next change.
>
> Signed-off-by: Omar Sandoval <[email protected]>
> ---
> fs/proc/kcore.c | 70 ++++++++++++++++++++++++++++---------------------
> 1 file changed, 40 insertions(+), 30 deletions(-)
>
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index 95aa988c5b5d..e317ac890871 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -440,19 +440,18 @@ static ssize_t
> read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> {
> char *buf = file->private_data;
> - ssize_t acc = 0;
> size_t size, tsz;
> size_t elf_buflen;
> int nphdr;
> unsigned long start;
> + size_t orig_buflen = buflen;
> + int ret = 0;
>
> down_read(&kclist_lock);

(...snipped...)

> +out:
> + up_write(&kclist_lock);

Oops. This needs to be up_read().

> + if (ret)
> + return ret;
> + return orig_buflen - buflen;
> }
>
[ 43.508922] ------------[ cut here ]------------
[ 43.509931] DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
[ 43.509940] WARNING: CPU: 0 PID: 7933 at kernel/locking/rwsem.c:133 up_write+0x75/0x80
[ 43.512792] Modules linked in: pcspkr sg vmw_vmci i2c_piix4 sd_mod ata_generic pata_acpi vmwgfx drm_kms_helper syscopyarea ahci sysfillrect libahci sysimgblt fb_sys_fops mptspi ata_piix ttm scsi_transport_spi mptscsih drm e1000 mptbase libata i2c_core serio_raw ipv6 crc_ccitt
[ 43.517692] CPU: 0 PID: 7933 Comm: kexec Not tainted 4.18.0-rc6-next-20180724+ #715
[ 43.519237] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[ 43.521498] RIP: 0010:up_write+0x75/0x80
[ 43.522391] Code: 00 5b c3 e8 0d 03 3a 00 85 c0 74 d9 83 3d 6a e4 10 02 00 75 d0 48 c7 c6 c8 2f e0 81 48 c7 c7 a3 e3 de 81 31 c0 e8 fb 8d fa ff <0f> 0b eb b7 0f 1f 80 00 00 00 00 8b 05 42 3e 06 02 53 48 89 fb 85
[ 43.526253] RSP: 0018:ffffc90007ec7bc8 EFLAGS: 00010282
[ 43.527303] RAX: 0000000000000000 RBX: ffffffff8207b5e0 RCX: 0000000000000006
[ 43.528874] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013ba15b70
[ 43.530496] RBP: ffffc90007ec7e28 R08: 0000000000000000 R09: 0000000000000001
[ 43.532007] R10: 0000000000000000 R11: 292928746e657272 R12: ffffffff8207b660
[ 43.533448] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000000e000
[ 43.534969] FS: 00007fdd6e0c0740(0000) GS:ffff88013ba00000(0000) knlGS:0000000000000000
[ 43.536623] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 43.537943] CR2: 00000000017dc000 CR3: 00000001376b5006 CR4: 00000000001606f0
[ 43.539608] Call Trace:
[ 43.540154] read_kcore+0x81/0x630
[ 43.540932] proc_reg_read+0x34/0x60
[ 43.541722] __vfs_read+0x2e/0x160
[ 43.542450] vfs_read+0x84/0x130
[ 43.543166] ksys_read+0x50/0xc0
[ 43.543856] do_syscall_64+0x4f/0x1f0
[ 43.544684] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 43.545909] RIP: 0033:0x7fdd6d798c70
[ 43.546688] Code: 0b 31 c0 48 83 c4 08 e9 be fe ff ff 48 8d 3d 07 b9 09 00 e8 52 8a 02 00 66 90 83 3d 2d c3 2d 00 00 75 10 b8 00 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 4e cc 01 00 48 89 04 24
[ 43.550607] RSP: 002b:00007fffe5207428 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 43.552192] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fdd6d798c70
[ 43.553711] RDX: 0000000000010000 RSI: 00000000017cd840 RDI: 0000000000000004
[ 43.555240] RBP: 0000000000010000 R08: 0000000000000001 R09: 0000000000010000
[ 43.556779] R10: 0000000000000079 R11: 0000000000000246 R12: 0000000000000004
[ 43.558284] R13: 00000000017cd840 R14: 00007fffe52074d8 R15: 00007fffe52076c0
[ 43.559809] irq event stamp: 25471
[ 43.560597] hardirqs last enabled at (25471): [<ffffffff81800966>] restore_regs_and_return_to_kernel+0x0/0x2a
[ 43.562821] hardirqs last disabled at (25470): [<ffffffff81800fa6>] error_exit+0x6/0x20
[ 43.564515] softirqs last enabled at (24962): [<ffffffff81a001db>] __do_softirq+0x1db/0x48e
[ 43.566308] softirqs last disabled at (24955): [<ffffffff8107197d>] irq_exit+0xcd/0xe0
[ 43.567982] ---[ end trace 0140237dd1b1be70 ]---


2018-07-25 23:36:08

by Omar Sandoval

[permalink] [raw]
Subject: Re: [PATCH v3 5/8] proc/kcore: hold lock during read

On Wed, Jul 25, 2018 at 12:11:26AM +0900, Tetsuo Handa wrote:
> On 2018/07/19 7:58, Omar Sandoval wrote:
> > From: Omar Sandoval <[email protected]>
> >
> > Now that we're using an rwsem, we can hold it during the entirety of
> > read_kcore() and have a common return path. This is preparation for the
> > next change.
> >
> > Signed-off-by: Omar Sandoval <[email protected]>
> > ---
> > fs/proc/kcore.c | 70 ++++++++++++++++++++++++++++---------------------
> > 1 file changed, 40 insertions(+), 30 deletions(-)
> >
> > diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> > index 95aa988c5b5d..e317ac890871 100644
> > --- a/fs/proc/kcore.c
> > +++ b/fs/proc/kcore.c
> > @@ -440,19 +440,18 @@ static ssize_t
> > read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> > {
> > char *buf = file->private_data;
> > - ssize_t acc = 0;
> > size_t size, tsz;
> > size_t elf_buflen;
> > int nphdr;
> > unsigned long start;
> > + size_t orig_buflen = buflen;
> > + int ret = 0;
> >
> > down_read(&kclist_lock);
>
> (...snipped...)
>
> > +out:
> > + up_write(&kclist_lock);
>
> Oops. This needs to be up_read().

Yeah, thanks, I'll fix this and rerun my tests with lockdep.

2018-08-28 11:01:39

by Dominique Martinet

[permalink] [raw]
Subject: KASAN error in [PATCH v3 7/8] proc/kcore: optimize multiple page reads


> The current code does a full search of the segment list every time for
> every page. This is wasteful, since it's almost certain that the next
> page will be in the same segment. Instead, check if the previous segment
> covers the current page before doing the list search.
>
> Signed-off-by: Omar Sandoval <[email protected]>
> ---
> fs/proc/kcore.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index e2ca58d49938..25fefdd05ee5 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -428,10 +428,18 @@ read_kcore(struct file *file, char __user *buffer,
> size_t buflen, loff_t *fpos)
> if ((tsz = (PAGE_SIZE - (start & ~PAGE_MASK))) > buflen)
> tsz = buflen;
>
> + m = NULL;
> while (buflen) {
> - list_for_each_entry(m, &kclist_head, list) {
> - if (start >= m->addr && start < (m->addr+m->size))
> - break;
> + /*
> + * If this is the first iteration or the address is not within
> + * the previous entry, search for a matching entry.
> + */
> + if (!m || start < m->addr || start >= m->addr + m->size) {

This line apparently triggers a KASAN warning since I rebooted on
4.19-rc1

This is 100% reproductible on my machine when the kdump service starts
(fedora28 x86_64 VM), here's the full stack (on 4.19-rc1):
[ 38.161102] BUG: KASAN: global-out-of-bounds in read_kcore+0xd5c/0xf20
[ 38.162123] Read of size 8 at addr ffffffffa6c0f770 by task kexec/6201

[ 38.163386] CPU: 16 PID: 6201 Comm: kexec Not tainted 4.19.0-rc1+ #13
[ 38.164374] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[ 38.166211] Call Trace:
[ 38.166658] dump_stack+0x71/0xa9
[ 38.167443] print_address_description+0x65/0x22e
[ 38.168194] ? read_kcore+0xd5c/0xf20
[ 38.168778] kasan_report.cold.6+0x241/0x306
[ 38.169458] read_kcore+0xd5c/0xf20
[ 38.170018] ? open_kcore+0x1d0/0x1d0
[ 38.170605] ? avc_has_perm_noaudit+0x370/0x370
[ 38.171291] ? kasan_unpoison_shadow+0x30/0x40
[ 38.171973] ? kasan_kmalloc+0xbf/0xe0
[ 38.172562] ? kmem_cache_alloc_trace+0x105/0x200
[ 38.173289] ? open_kcore+0x5f/0x1d0
[ 38.173858] ? open_kcore+0x5f/0x1d0
[ 38.174428] ? deref_stack_reg+0xe0/0xe0
[ 38.175038] proc_reg_read+0x18b/0x220
[ 38.175652] ? proc_reg_unlocked_ioctl+0x210/0x210
[ 38.176399] __vfs_read+0xe1/0x6b0
[ 38.176930] ? __x64_sys_copy_file_range+0x450/0x450
[ 38.177723] ? do_filp_open+0x190/0x250
[ 38.178313] ? may_open_dev+0xc0/0xc0
[ 38.178886] ? __fsnotify_update_child_dentry_flags.part.3+0x330/0x330
[ 38.179883] ? __fsnotify_inode_delete+0x20/0x20
[ 38.180608] ? __inode_security_revalidate+0x8e/0xb0
[ 38.181378] vfs_read+0xde/0x2c0
[ 38.181889] ksys_read+0xb2/0x160
[ 38.182413] ? kernel_write+0x130/0x130
[ 38.183000] ? task_work_run+0x74/0x1c0
[ 38.183621] do_syscall_64+0xa0/0x2e0
[ 38.184183] ? async_page_fault+0x8/0x30
[ 38.184802] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 38.185610] RIP: 0033:0x7fc525d74091
[ 38.186155] Code: fe ff ff 50 48 8d 3d b6 b6 09 00 e8 59 05 02 00 66 0f 1f 84 00 00 00 00 00 48 8d 05 51 39 2d 00 8b 00 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 c3 66 0f 1f 44 00 00 41 54 49 89 d4 55 48
[ 38.188980] RSP: 002b:00007fffd6802a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 38.190153] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc525d74091
[ 38.191240] RDX: 0000000000010000 RSI: 00000000017f90b0 RDI: 0000000000000004
[ 38.192329] RBP: 0000000000010000 R08: 00007fc526043420 R09: 0000000000000001
[ 38.194593] R10: 00000000017e8010 R11: 0000000000000246 R12: 00000000017f90b0
[ 38.196823] R13: 0000000000000004 R14: 00007fffd6802ac8 R15: 00007fffd6802cb0

[ 38.200526] The buggy address belongs to the variable:
[ 38.202539] kclist_head+0x10/0x440

[ 38.205568] Memory state around the buggy address:
[ 38.207411] ffffffffa6c0f600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 38.209648] ffffffffa6c0f680: 00 00 00 00 00 00 00 00 04 fa fa fa fa fa fa fa
[ 38.211812] >ffffffffa6c0f700: 00 00 00 00 00 fa fa fa fa fa fa fa 00 00 fa fa
[ 38.213936] ^
[ 38.216010] ffffffffa6c0f780: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
[ 38.218178] ffffffffa6c0f800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 38.220306] ==================================================================

where
[ 37.636589] read_kcore+0xd5c/0xf20
symbolizes to
[< none >] read_kcore+0xd5c/0xf20 fs/proc/kcore.c:454
, the above check.


I haven't checked but I think I am in the first case below:
if (&m->list == &kclist_head) {
meaning no address matched in the list, so you cannot check m->addr and
m->size in this case -- I'm afraid you will have to run through the list
just in case if that happens even if there likely won't be any match for
the next address either.


> + list_for_each_entry(m, &kclist_head, list) {
> + if (start >= m->addr &&
> + start < m->addr + m->size)
> + break;
> + }
> }
>
> if (&m->list == &kclist_head) {

--
Dominique Martinet

2018-08-29 04:05:32

by Dominique Martinet

[permalink] [raw]
Subject: [PATCH] proc/kcore: fix invalid memory access in multi-page read optimization

The 'm' kcore_list item can point to kclist_head, and it is incorrect to
look at m->addr / m->size in this case.
There is no choice but to run through the list of entries for every address
if we did not find any entry in the previous iteration

Fixes: bf991c2231117 ("proc/kcore: optimize multiple page reads")
Signed-off-by: Dominique Martinet <[email protected]>
---

I guess now I'm looking at bf991c2231117 again that it would be slightly
more efficient to remove the !m check and initialize m to point to
kclist_head like this:
m = list_entry(&kclist_head, struct kcore_list, list);
but it feels a bit forced to me; deferring the choice to others.

fs/proc/kcore.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index ad72261ee3fe..50036f6e1f52 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -451,7 +451,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
* If this is the first iteration or the address is not within
* the previous entry, search for a matching entry.
*/
- if (!m || start < m->addr || start >= m->addr + m->size) {
+ if (!m || &m->list == &kclist_head || start < m->addr ||
+ start >= m->addr + m->size) {
list_for_each_entry(m, &kclist_head, list) {
if (start >= m->addr &&
start < m->addr + m->size)
--
2.17.1


2018-09-04 18:05:26

by Omar Sandoval

[permalink] [raw]
Subject: Re: [PATCH] proc/kcore: fix invalid memory access in multi-page read optimization

On Wed, Aug 29, 2018 at 06:04:07AM +0200, Dominique Martinet wrote:
> The 'm' kcore_list item can point to kclist_head, and it is incorrect to
> look at m->addr / m->size in this case.
> There is no choice but to run through the list of entries for every address
> if we did not find any entry in the previous iteration
>
> Fixes: bf991c2231117 ("proc/kcore: optimize multiple page reads")
> Signed-off-by: Dominique Martinet <[email protected]>
> ---
>
> I guess now I'm looking at bf991c2231117 again that it would be slightly
> more efficient to remove the !m check and initialize m to point to
> kclist_head like this:
> m = list_entry(&kclist_head, struct kcore_list, list);
> but it feels a bit forced to me; deferring the choice to others.

Good catch! Sorry I missed this last week, Google decided this was spam
for some reason. How about fixing it like this? One less conditional in
the common case, no hacky list_entry :)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index ad72261ee3fe..578926032880 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -464,6 +464,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
ret = -EFAULT;
goto out;
}
+ m = NULL;
} else if (m->type == KCORE_VMALLOC) {
vread(buf, (char *)start, tsz);
/* we have to zero-fill user buffer even if no read */

> fs/proc/kcore.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index ad72261ee3fe..50036f6e1f52 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -451,7 +451,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> * If this is the first iteration or the address is not within
> * the previous entry, search for a matching entry.
> */
> - if (!m || start < m->addr || start >= m->addr + m->size) {
> + if (!m || &m->list == &kclist_head || start < m->addr ||
> + start >= m->addr + m->size) {
> list_for_each_entry(m, &kclist_head, list) {
> if (start >= m->addr &&
> start < m->addr + m->size)
> --
> 2.17.1
>

2018-09-04 22:26:07

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH] proc/kcore: fix invalid memory access in multi-page read optimization

Omar Sandoval wrote on Tue, Sep 04, 2018:
> On Wed, Aug 29, 2018 at 06:04:07AM +0200, Dominique Martinet wrote:
> > The 'm' kcore_list item can point to kclist_head, and it is incorrect to
> > look at m->addr / m->size in this case.
> > There is no choice but to run through the list of entries for every address
> > if we did not find any entry in the previous iteration
> >
> > Fixes: bf991c2231117 ("proc/kcore: optimize multiple page reads")
> > Signed-off-by: Dominique Martinet <[email protected]>
> > ---
> >
> > I guess now I'm looking at bf991c2231117 again that it would be slightly
> > more efficient to remove the !m check and initialize m to point to
> > kclist_head like this:
> > m = list_entry(&kclist_head, struct kcore_list, list);
> > but it feels a bit forced to me; deferring the choice to others.
>
> Good catch! Sorry I missed this last week, Google decided this was spam
> for some reason.

Joys of self-hosted emails, it happens from time to time :/

> How about fixing it like this? One less conditional in the common
> case, no hacky list_entry :)

Good idea, I'll send a v2 in a few minutes after rebooting into it, no
reason it won't work but might as well make earth a slightly warmer
place along the way.

--
Dominique

2018-09-04 22:37:11

by Dominique Martinet

[permalink] [raw]
Subject: [PATCH v2] proc/kcore: fix invalid memory access in multi-page read optimization

From: Dominique Martinet <[email protected]>

The 'm' kcore_list item could point to kclist_head, and it is incorrect to
look at m->addr / m->size in this case.
There is no choice but to run through the list of entries for every address
if we did not find any entry in the previous iteration

Reset 'm' to NULL in that case at Omar Sandoval's suggestion.

Fixes: bf991c2231117 ("proc/kcore: optimize multiple page reads")
Signed-off-by: Dominique Martinet <[email protected]>
---
fs/proc/kcore.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index ad72261ee3fe..578926032880 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -464,6 +464,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
ret = -EFAULT;
goto out;
}
+ m = NULL;
} else if (m->type == KCORE_VMALLOC) {
vread(buf, (char *)start, tsz);
/* we have to zero-fill user buffer even if no read */
--
2.17.1


2018-09-04 22:40:20

by Dominique Martinet

[permalink] [raw]
Subject: [PATCH v3] proc/kcore: fix invalid memory access in multi-page read optimization

The 'm' kcore_list item could point to kclist_head, and it is incorrect to
look at m->addr / m->size in this case.
There is no choice but to run through the list of entries for every address
if we did not find any entry in the previous iteration

Reset 'm' to NULL in that case at Omar Sandoval's suggestion.

Fixes: bf991c2231117 ("proc/kcore: optimize multiple page reads")
Signed-off-by: Dominique Martinet <[email protected]>
---

Sorry, resent v2 because From didn't match sob tag

fs/proc/kcore.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index ad72261ee3fe..578926032880 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -464,6 +464,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
ret = -EFAULT;
goto out;
}
+ m = NULL;
} else if (m->type == KCORE_VMALLOC) {
vread(buf, (char *)start, tsz);
/* we have to zero-fill user buffer even if no read */
--
2.17.1


2018-09-04 22:43:16

by Omar Sandoval

[permalink] [raw]
Subject: Re: [PATCH v3] proc/kcore: fix invalid memory access in multi-page read optimization

On Wed, Sep 05, 2018 at 12:38:22AM +0200, Dominique Martinet wrote:
> The 'm' kcore_list item could point to kclist_head, and it is incorrect to
> look at m->addr / m->size in this case.
> There is no choice but to run through the list of entries for every address
> if we did not find any entry in the previous iteration
>
> Reset 'm' to NULL in that case at Omar Sandoval's suggestion.
>
> Fixes: bf991c2231117 ("proc/kcore: optimize multiple page reads")

Reviewed-by: Omar Sandoval <[email protected]>

Thanks again for catching this!

> Signed-off-by: Dominique Martinet <[email protected]>
> ---
>
> Sorry, resent v2 because From didn't match sob tag
>
> fs/proc/kcore.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index ad72261ee3fe..578926032880 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -464,6 +464,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> ret = -EFAULT;
> goto out;
> }
> + m = NULL;
> } else if (m->type == KCORE_VMALLOC) {
> vread(buf, (char *)start, tsz);
> /* we have to zero-fill user buffer even if no read */
> --
> 2.17.1
>

2018-09-05 19:58:47

by Bhupesh Sharma

[permalink] [raw]
Subject: Re: [PATCH v3] proc/kcore: fix invalid memory access in multi-page read optimization

On Wed, Sep 5, 2018 at 4:11 AM, Omar Sandoval <[email protected]> wrote:
> On Wed, Sep 05, 2018 at 12:38:22AM +0200, Dominique Martinet wrote:
>> The 'm' kcore_list item could point to kclist_head, and it is incorrect to
>> look at m->addr / m->size in this case.
>> There is no choice but to run through the list of entries for every address
>> if we did not find any entry in the previous iteration
>>
>> Reset 'm' to NULL in that case at Omar Sandoval's suggestion.
>>
>> Fixes: bf991c2231117 ("proc/kcore: optimize multiple page reads")
>
> Reviewed-by: Omar Sandoval <[email protected]>
>
> Thanks again for catching this!
>
>> Signed-off-by: Dominique Martinet <[email protected]>
>> ---
>>
>> Sorry, resent v2 because From didn't match sob tag
>>
>> fs/proc/kcore.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
>> index ad72261ee3fe..578926032880 100644
>> --- a/fs/proc/kcore.c
>> +++ b/fs/proc/kcore.c
>> @@ -464,6 +464,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>> ret = -EFAULT;
>> goto out;
>> }
>> + m = NULL;
>> } else if (m->type == KCORE_VMALLOC) {
>> vread(buf, (char *)start, tsz);
>> /* we have to zero-fill user buffer even if no read */
>> --
>> 2.17.1

Looks sane to me, so:

Reviewed-by: Bhupesh Sharma <[email protected]>

Thanks.

2018-09-05 20:58:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3] proc/kcore: fix invalid memory access in multi-page read optimization

On Wed, 5 Sep 2018 00:38:22 +0200 Dominique Martinet <[email protected]> wrote:

> The 'm' kcore_list item could point to kclist_head, and it is incorrect to
> look at m->addr / m->size in this case.
> There is no choice but to run through the list of entries for every address
> if we did not find any entry in the previous iteration
>
> Reset 'm' to NULL in that case at Omar Sandoval's suggestion.
>
> ...
>
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -464,6 +464,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
> ret = -EFAULT;
> goto out;
> }
> + m = NULL;
> } else if (m->type == KCORE_VMALLOC) {
> vread(buf, (char *)start, tsz);
> /* we have to zero-fill user buffer even if no read */

lgtm. Let's add a nice little why-were-doing-this?

--- a/fs/proc/kcore.c~proc-kcore-fix-invalid-memory-access-in-multi-page-read-optimization-v3-fix
+++ a/fs/proc/kcore.c
@@ -464,7 +464,7 @@ read_kcore(struct file *file, char __use
ret = -EFAULT;
goto out;
}
- m = NULL;
+ m = NULL; /* skip the list anchor */
} else if (m->type == KCORE_VMALLOC) {
vread(buf, (char *)start, tsz);
/* we have to zero-fill user buffer even if no read */
_


2018-09-05 22:02:38

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH v3] proc/kcore: fix invalid memory access in multi-page read optimization

Andrew Morton wrote on Wed, Sep 05, 2018:
> lgtm. Let's add a nice little why-were-doing-this?

Sure, thank you.

> --- a/fs/proc/kcore.c~proc-kcore-fix-invalid-memory-access-in-multi-page-read-optimization-v3-fix
> +++ a/fs/proc/kcore.c
> @@ -464,7 +464,7 @@ read_kcore(struct file *file, char __use
> ret = -EFAULT;
> goto out;
> }
> - m = NULL;
> + m = NULL; /* skip the list anchor */
> } else if (m->type == KCORE_VMALLOC) {
> vread(buf, (char *)start, tsz);
> /* we have to zero-fill user buffer even if no read */

--
Dominique