2024-01-19 07:29:42

by Hou Tao

[permalink] [raw]
Subject: [PATCH bpf 0/3] Fix the read of vsyscall page through bpf

From: Hou Tao <[email protected]>

Hi,

As reported by syzboot [1] and [2], when trying to read vsyscall page
by using bpf_probe_read_kernel() or bpf_probe_read(), oops will happen.

Thomas Gleixner had proposed a test patch [3], but it seems that no
formal patch is posted after about one month [4], so I post it instead
and add an Originally-from tag in patch #2.

Patch #1 makes is_vsyscall_vaddr() being a common helper. Patch #2 fixes
the problem by disallowing vsyscall page read for
copy_from_kernel_nofault(). Patch #3 adds one test case to ensure the
read of vsyscall page through bpf is rejected. Although vsyscall page
can be disabled by vsyscall=none, but it doesn't affect the reproduce of
the problem and the added test.

Comments are always welcome.

[1]: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com/
[2]: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com/
[3]: https://lore.kernel.org/bpf/87r0jwquhv.ffs@tglx/
[4]: https://lore.kernel.org/bpf/[email protected]/

Hou Tao (3):
x86/mm: Move is_vsyscall_vaddr() into mm_internal.h
x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()
selftest/bpf: Test the read of vsyscall page under x86-64

arch/x86/mm/fault.c | 11 +---
arch/x86/mm/maccess.c | 6 ++
arch/x86/mm/mm_internal.h | 13 ++++
.../selftests/bpf/prog_tests/read_vsyscall.c | 61 +++++++++++++++++++
.../selftests/bpf/progs/read_vsyscall.c | 45 ++++++++++++++
5 files changed, 127 insertions(+), 9 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
create mode 100644 tools/testing/selftests/bpf/progs/read_vsyscall.c

--
2.29.2



2024-01-19 07:29:48

by Hou Tao

[permalink] [raw]
Subject: [PATCH bpf 1/3] x86/mm: Move is_vsyscall_vaddr() into mm_internal.h

From: Hou Tao <[email protected]>

Moving is_vsyscall_vaddr() into mm_internal.h to make it available for
copy_from_kernel_nofault_allowed() in arch/x86/mm/maccess.c.

Signed-off-by: Hou Tao <[email protected]>
---
arch/x86/mm/fault.c | 11 ++---------
arch/x86/mm/mm_internal.h | 13 +++++++++++++
2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 679b09cfe241c..69e007761d9a9 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -38,6 +38,8 @@
#define CREATE_TRACE_POINTS
#include <asm/trace/exceptions.h>

+#include "mm_internal.h"
+
/*
* Returns 0 if mmiotrace is disabled, or if the fault is not
* handled by mmiotrace:
@@ -798,15 +800,6 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
show_opcodes(regs, loglvl);
}

-/*
- * The (legacy) vsyscall page is the long page in the kernel portion
- * of the address space that has user-accessible permissions.
- */
-static bool is_vsyscall_vaddr(unsigned long vaddr)
-{
- return unlikely((vaddr & PAGE_MASK) == VSYSCALL_ADDR);
-}
-
static void
__bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
unsigned long address, u32 pkey, int si_code)
diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h
index 3f37b5c80bb32..4ebf6051e1ed7 100644
--- a/arch/x86/mm/mm_internal.h
+++ b/arch/x86/mm/mm_internal.h
@@ -2,6 +2,10 @@
#ifndef __X86_MM_INTERNAL_H
#define __X86_MM_INTERNAL_H

+#include <uapi/asm/vsyscall.h>
+
+#include <asm/page_types.h>
+
void *alloc_low_pages(unsigned int num);
static inline void *alloc_low_page(void)
{
@@ -25,4 +29,13 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache);

extern unsigned long tlb_single_page_flush_ceiling;

+/*
+ * The (legacy) vsyscall page is the long page in the kernel portion
+ * of the address space that has user-accessible permissions.
+ */
+static inline bool is_vsyscall_vaddr(unsigned long vaddr)
+{
+ return unlikely((vaddr & PAGE_MASK) == VSYSCALL_ADDR);
+}
+
#endif /* __X86_MM_INTERNAL_H */
--
2.29.2


2024-01-19 07:29:53

by Hou Tao

[permalink] [raw]
Subject: [PATCH bpf 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()

From: Hou Tao <[email protected]>

When trying to use copy_from_kernel_nofault() to read vsyscall page
through a bpf program, the following oops was reported:

BUG: unable to handle page fault for address: ffffffffff600000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ #58
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
RIP: 0010:copy_from_kernel_nofault+0x6f/0x110
......
Call Trace:
<TASK>
? copy_from_kernel_nofault+0x6f/0x110
bpf_probe_read_kernel+0x1d/0x50
bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d
trace_call_bpf+0xc5/0x1c0
perf_call_bpf_enter.isra.0+0x69/0xb0
perf_syscall_enter+0x13e/0x200
syscall_trace_enter+0x188/0x1c0
do_syscall_64+0xb5/0xe0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
</TASK>
......
---[ end trace 0000000000000000 ]---

The oops happens as follows: A bpf program uses bpf_probe_read_kernel()
to read from vsyscall page, bpf_probe_read_kernel() invokes
copy_from_kernel_nofault() in turn and then invokes __get_user_asm(). A
page fault exception is triggered accordingly, but handle_page_fault()
considers the vsyscall page address as a userspace address instead of
a kernel space address, so the fix-up set-up by bpf isn't applied.
Because the exception happens in kernel space and page fault handling is
disabled, page_fault_oops() is invoked and an oops happens.

Fix it by disallowing vsyscall page read for copy_from_kernel_nofault().

Originally-from: Thomas Gleixner <[email protected]>
Reported-by: [email protected]
Closes: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com
Reported-by: xingwei lee <[email protected]>
Closes: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com
Signed-off-by: Hou Tao <[email protected]>
---
arch/x86/mm/maccess.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/x86/mm/maccess.c b/arch/x86/mm/maccess.c
index 6993f026adec9..bb454e0abbfcf 100644
--- a/arch/x86/mm/maccess.c
+++ b/arch/x86/mm/maccess.c
@@ -3,6 +3,8 @@
#include <linux/uaccess.h>
#include <linux/kernel.h>

+#include "mm_internal.h"
+
#ifdef CONFIG_X86_64
bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
{
@@ -15,6 +17,10 @@ bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
if (vaddr < TASK_SIZE_MAX + PAGE_SIZE)
return false;

+ /* vsyscall page is also considered as userspace address. */
+ if (is_vsyscall_vaddr(vaddr))
+ return false;
+
/*
* Allow everything during early boot before 'x86_virt_bits'
* is initialized. Needed for instruction decoding in early
--
2.29.2


2024-01-19 07:30:23

by Hou Tao

[permalink] [raw]
Subject: [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64

From: Hou Tao <[email protected]>

Using bpf_probe_read_kernel{_str}() or bpf_probe_read{_str}() to read
from vsyscall page under x86-64 will trigger oops, so add one test case
to ensure that the problem is fixed.

Beside those four bpf helpers mentioned above, testing the read of
vsyscall page by using bpf_probe_read_user{_str} and
bpf_copy_from_user{_task}() as well.

vsyscall page could be disabled by CONFIG_LEGACY_VSYSCALL_NONE or
vsyscall=none boot cmd-line, but it doesn't affect the reproduce of the
problem and the returned error codes.

Signed-off-by: Hou Tao <[email protected]>
---
.../selftests/bpf/prog_tests/read_vsyscall.c | 61 +++++++++++++++++++
.../selftests/bpf/progs/read_vsyscall.c | 45 ++++++++++++++
2 files changed, 106 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
create mode 100644 tools/testing/selftests/bpf/progs/read_vsyscall.c

diff --git a/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c b/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
new file mode 100644
index 0000000000000..d9247cc89cf3e
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2024. Huawei Technologies Co., Ltd */
+#include "test_progs.h"
+#include "read_vsyscall.skel.h"
+
+#if defined(__x86_64__)
+/* For VSYSCALL_ADDR */
+#include <asm/vsyscall.h>
+#else
+/* To prevent build failure on non-x86 arch */
+#define VSYSCALL_ADDR 0UL
+#endif
+
+struct read_ret_desc {
+ const char *name;
+ int ret;
+} all_read[] = {
+ { .name = "probe_read_kernel", .ret = -ERANGE },
+ { .name = "probe_read_kernel_str", .ret = -ERANGE },
+ { .name = "probe_read", .ret = -ERANGE },
+ { .name = "probe_read_str", .ret = -ERANGE },
+ /* __access_ok() will fail */
+ { .name = "probe_read_user", .ret = -EFAULT },
+ /* __access_ok() will fail */
+ { .name = "probe_read_user_str", .ret = -EFAULT },
+ /* access_ok() will fail */
+ { .name = "copy_from_user", .ret = -EFAULT },
+ /* both vma_lookup() and expand_stack() will fail */
+ { .name = "copy_from_user_task", .ret = -EFAULT },
+};
+
+void test_read_vsyscall(void)
+{
+ struct read_vsyscall *skel;
+ unsigned int i;
+ int err;
+
+#if !defined(__x86_64__)
+ test__skip();
+ return;
+#endif
+ skel = read_vsyscall__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "read_vsyscall open_load"))
+ return;
+
+ skel->bss->target_pid = getpid();
+ err = read_vsyscall__attach(skel);
+ if (!ASSERT_EQ(err, 0, "read_vsyscall attach"))
+ goto out;
+
+ /* userspace may don't have vsyscall page due to LEGACY_VSYSCALL_NONE,
+ * but it doesn't affect the returned error codes.
+ */
+ skel->bss->user_ptr = (void *)VSYSCALL_ADDR;
+ usleep(1);
+
+ for (i = 0; i < ARRAY_SIZE(all_read); i++)
+ ASSERT_EQ(skel->bss->read_ret[i], all_read[i].ret, all_read[i].name);
+out:
+ read_vsyscall__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/read_vsyscall.c b/tools/testing/selftests/bpf/progs/read_vsyscall.c
new file mode 100644
index 0000000000000..986f96687ae15
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/read_vsyscall.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2024. Huawei Technologies Co., Ltd */
+#include <linux/types.h>
+#include <bpf/bpf_helpers.h>
+
+#include "bpf_misc.h"
+
+int target_pid = 0;
+void *user_ptr = 0;
+int read_ret[8];
+
+char _license[] SEC("license") = "GPL";
+
+SEC("fentry/" SYS_PREFIX "sys_nanosleep")
+int do_probe_read(void *ctx)
+{
+ char buf[8];
+
+ if ((bpf_get_current_pid_tgid() >> 32) != target_pid)
+ return 0;
+
+ read_ret[0] = bpf_probe_read_kernel(buf, sizeof(buf), user_ptr);
+ read_ret[1] = bpf_probe_read_kernel_str(buf, sizeof(buf), user_ptr);
+ read_ret[2] = bpf_probe_read(buf, sizeof(buf), user_ptr);
+ read_ret[3] = bpf_probe_read_str(buf, sizeof(buf), user_ptr);
+ read_ret[4] = bpf_probe_read_user(buf, sizeof(buf), user_ptr);
+ read_ret[5] = bpf_probe_read_user_str(buf, sizeof(buf), user_ptr);
+
+ return 0;
+}
+
+SEC("fentry.s/" SYS_PREFIX "sys_nanosleep")
+int do_copy_from_user(void *ctx)
+{
+ char buf[8];
+
+ if ((bpf_get_current_pid_tgid() >> 32) != target_pid)
+ return 0;
+
+ read_ret[6] = bpf_copy_from_user(buf, sizeof(buf), user_ptr);
+ read_ret[7] = bpf_copy_from_user_task(buf, sizeof(buf), user_ptr,
+ bpf_get_current_task_btf(), 0);
+
+ return 0;
+}
--
2.29.2


2024-01-20 00:35:30

by Sohil Mehta

[permalink] [raw]
Subject: Re: [PATCH bpf 1/3] x86/mm: Move is_vsyscall_vaddr() into mm_internal.h

On 1/18/2024 11:30 PM, Hou Tao wrote:
> From: Hou Tao <[email protected]>
>
> Moving is_vsyscall_vaddr() into mm_internal.h to make it available for
> copy_from_kernel_nofault_allowed() in arch/x86/mm/maccess.c.
>

Instead of mm_internal.h would a better place for is_vsyscall_vaddr() be
arch/x86/include/asm/vsyscall.h?

Sohil

2024-01-20 01:32:23

by Hou Tao

[permalink] [raw]
Subject: Re: [PATCH bpf 1/3] x86/mm: Move is_vsyscall_vaddr() into mm_internal.h

Hi,

On 1/20/2024 8:35 AM, Sohil Mehta wrote:
> On 1/18/2024 11:30 PM, Hou Tao wrote:
>> From: Hou Tao <[email protected]>
>>
>> Moving is_vsyscall_vaddr() into mm_internal.h to make it available for
>> copy_from_kernel_nofault_allowed() in arch/x86/mm/maccess.c.
>>
> Instead of mm_internal.h would a better place for is_vsyscall_vaddr() be
> arch/x86/include/asm/vsyscall.h?

Yes, asm/vsyscall.h is better indeed. Will update in v2. Thanks for the
suggestion.
>
> Sohil
>
> .


2024-01-22 06:31:13

by Yonghong Song

[permalink] [raw]
Subject: Re: [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64


On 1/18/24 11:30 PM, Hou Tao wrote:
> From: Hou Tao <[email protected]>
>
> Using bpf_probe_read_kernel{_str}() or bpf_probe_read{_str}() to read
> from vsyscall page under x86-64 will trigger oops, so add one test case
> to ensure that the problem is fixed.
>
> Beside those four bpf helpers mentioned above, testing the read of
> vsyscall page by using bpf_probe_read_user{_str} and
> bpf_copy_from_user{_task}() as well.
>
> vsyscall page could be disabled by CONFIG_LEGACY_VSYSCALL_NONE or
> vsyscall=none boot cmd-line, but it doesn't affect the reproduce of the
> problem and the returned error codes.
>
> Signed-off-by: Hou Tao <[email protected]>
> ---
> .../selftests/bpf/prog_tests/read_vsyscall.c | 61 +++++++++++++++++++
> .../selftests/bpf/progs/read_vsyscall.c | 45 ++++++++++++++
> 2 files changed, 106 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
> create mode 100644 tools/testing/selftests/bpf/progs/read_vsyscall.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c b/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
> new file mode 100644
> index 0000000000000..d9247cc89cf3e
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
> @@ -0,0 +1,61 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (C) 2024. Huawei Technologies Co., Ltd */
> +#include "test_progs.h"
> +#include "read_vsyscall.skel.h"
> +
> +#if defined(__x86_64__)
> +/* For VSYSCALL_ADDR */
> +#include <asm/vsyscall.h>
> +#else
> +/* To prevent build failure on non-x86 arch */
> +#define VSYSCALL_ADDR 0UL
> +#endif
> +
> +struct read_ret_desc {
> + const char *name;
> + int ret;
> +} all_read[] = {
> + { .name = "probe_read_kernel", .ret = -ERANGE },
> + { .name = "probe_read_kernel_str", .ret = -ERANGE },
> + { .name = "probe_read", .ret = -ERANGE },
> + { .name = "probe_read_str", .ret = -ERANGE },
> + /* __access_ok() will fail */
> + { .name = "probe_read_user", .ret = -EFAULT },
> + /* __access_ok() will fail */
> + { .name = "probe_read_user_str", .ret = -EFAULT },
> + /* access_ok() will fail */
> + { .name = "copy_from_user", .ret = -EFAULT },
> + /* both vma_lookup() and expand_stack() will fail */
> + { .name = "copy_from_user_task", .ret = -EFAULT },

The above comments are not clear enough. For example,
'__access_ok() will fail', user will need to
check the source code where __access_ok() is and
this could be hard e.g., for probe_read_user_str().
Another example, 'both vma_lookup() and expand_stack() will fail',
where is vma_lookup()/expand_stack()? User needs to further
check to make sense.

I suggest remove the above comments and add more
detailed explanation in commit messages with callstack
indicating where the fail/error return happens.

> +};
> +
> +void test_read_vsyscall(void)
> +{
> + struct read_vsyscall *skel;
> + unsigned int i;
> + int err;
> +
> +#if !defined(__x86_64__)
> + test__skip();
> + return;
> +#endif
> + skel = read_vsyscall__open_and_load();
> + if (!ASSERT_OK_PTR(skel, "read_vsyscall open_load"))
> + return;
> +
> + skel->bss->target_pid = getpid();
> + err = read_vsyscall__attach(skel);
> + if (!ASSERT_EQ(err, 0, "read_vsyscall attach"))
> + goto out;
> +
> + /* userspace may don't have vsyscall page due to LEGACY_VSYSCALL_NONE,
> + * but it doesn't affect the returned error codes.
> + */
> + skel->bss->user_ptr = (void *)VSYSCALL_ADDR;
> + usleep(1);
> +
> + for (i = 0; i < ARRAY_SIZE(all_read); i++)
> + ASSERT_EQ(skel->bss->read_ret[i], all_read[i].ret, all_read[i].name);
> +out:
> + read_vsyscall__destroy(skel);
> +}
[...]

2024-01-23 00:55:21

by Sohil Mehta

[permalink] [raw]
Subject: Re: [PATCH bpf 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()

On 1/18/2024 11:30 PM, Hou Tao wrote:
> From: Hou Tao <[email protected]>
>
> When trying to use copy_from_kernel_nofault() to read vsyscall page
> through a bpf program, the following oops was reported:
>
> BUG: unable to handle page fault for address: ffffffffff600000
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0
> Oops: 0000 [#1] PREEMPT SMP PTI
> CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ #58
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
> RIP: 0010:copy_from_kernel_nofault+0x6f/0x110
> ......
> Call Trace:
> <TASK>
> ? copy_from_kernel_nofault+0x6f/0x110
> bpf_probe_read_kernel+0x1d/0x50
> bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d
> trace_call_bpf+0xc5/0x1c0
> perf_call_bpf_enter.isra.0+0x69/0xb0
> perf_syscall_enter+0x13e/0x200
> syscall_trace_enter+0x188/0x1c0
> do_syscall_64+0xb5/0xe0
> entry_SYSCALL_64_after_hwframe+0x6e/0x76
> </TASK>
> ......
> ---[ end trace 0000000000000000 ]---
>
> The oops happens as follows: A bpf program uses bpf_probe_read_kernel()
> to read from vsyscall page, bpf_probe_read_kernel() invokes
> copy_from_kernel_nofault() in turn and then invokes __get_user_asm(). A
> page fault exception is triggered accordingly, but handle_page_fault()
> considers the vsyscall page address as a userspace address instead of
> a kernel space address, so the fix-up set-up by bpf isn't applied.

This comment and the one in the code below seem contradictory and
confusing. Do we want the vsyscall page address to be considered as a
userspace address or not?

IIUC, the issue here is that the vsyscall page (in xonly mode) is not
really mapped and therefore running copy_from_kernel_nofault() on this
address is incorrect. This patch fixes this by making
copy_from_kernel_nofault() return an error for a vsyscall address.


> Because the exception happens in kernel space and page fault handling is
> disabled, page_fault_oops() is invoked and an oops happens.
>
> Fix it by disallowing vsyscall page read for copy_from_kernel_nofault().
>

[Maybe I have misunderstood the issue here and following questions are
not even relevant.]

But, what about vsyscall=emulate? In that mode the page is actually
mapped. Would we want the page read to go through then?

> Originally-from: Thomas Gleixner <[email protected]>

Documentation/process/maintainer-tip.rst says to use "Originally-by:"


> diff --git a/arch/x86/mm/maccess.c b/arch/x86/mm/maccess.c
> index 6993f026adec9..bb454e0abbfcf 100644
> --- a/arch/x86/mm/maccess.c
> +++ b/arch/x86/mm/maccess.c
> @@ -3,6 +3,8 @@
> #include <linux/uaccess.h>
> #include <linux/kernel.h>
>
> +#include "mm_internal.h"
> +
> #ifdef CONFIG_X86_64
> bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
> {
> @@ -15,6 +17,10 @@ bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
> if (vaddr < TASK_SIZE_MAX + PAGE_SIZE)
> return false;
>
> + /* vsyscall page is also considered as userspace address. */

A bit more explanation about why this should happen might be useful.

> + if (is_vsyscall_vaddr(vaddr))
> + return false;
> +
> /*
> * Allow everything during early boot before 'x86_virt_bits'
> * is initialized. Needed for instruction decoding in early


2024-01-23 01:10:53

by Sohil Mehta

[permalink] [raw]
Subject: Re: [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64

On 1/18/2024 11:30 PM, Hou Tao wrote:

> vsyscall page could be disabled by CONFIG_LEGACY_VSYSCALL_NONE or
> vsyscall=none boot cmd-line, but it doesn't affect the reproduce of the
> problem and the returned error codes.
>

With vsyscall=emulate a direct read of the vsyscall address from
userspace is expected to go through. This is mode deprecated so maybe it
wouldn't matter much. Without the fix in patch 2/3, do you see the same
behavior with vsyscall=emulate set in the cmdline?

Sohil


2024-01-25 06:42:07

by Hou Tao

[permalink] [raw]
Subject: Re: [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64

Hi,

On 1/22/2024 2:30 PM, Yonghong Song wrote:
>
> On 1/18/24 11:30 PM, Hou Tao wrote:
>> From: Hou Tao <[email protected]>
>>
>> Using bpf_probe_read_kernel{_str}() or bpf_probe_read{_str}() to read
>> from vsyscall page under x86-64 will trigger oops, so add one test case
>> to ensure that the problem is fixed.
>>
>> Beside those four bpf helpers mentioned above, testing the read of
>> vsyscall page by using bpf_probe_read_user{_str} and
>> bpf_copy_from_user{_task}() as well.
>>
>> vsyscall page could be disabled by CONFIG_LEGACY_VSYSCALL_NONE or
>> vsyscall=none boot cmd-line, but it doesn't affect the reproduce of the
>> problem and the returned error codes.
>>
>> Signed-off-by: Hou Tao <[email protected]>
>> ---
>>   .../selftests/bpf/prog_tests/read_vsyscall.c  | 61 +++++++++++++++++++
>>   .../selftests/bpf/progs/read_vsyscall.c       | 45 ++++++++++++++
>>   2 files changed, 106 insertions(+)
>>   create mode 100644
>> tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/read_vsyscall.c
>>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
>> b/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
>> new file mode 100644
>> index 0000000000000..d9247cc89cf3e
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/read_vsyscall.c
>> @@ -0,0 +1,61 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (C) 2024. Huawei Technologies Co., Ltd */
>> +#include "test_progs.h"
>> +#include "read_vsyscall.skel.h"
>> +
>> +#if defined(__x86_64__)
>> +/* For VSYSCALL_ADDR */
>> +#include <asm/vsyscall.h>
>> +#else
>> +/* To prevent build failure on non-x86 arch */
>> +#define VSYSCALL_ADDR 0UL
>> +#endif
>> +
>> +struct read_ret_desc {
>> +    const char *name;
>> +    int ret;
>> +} all_read[] = {
>> +    { .name = "probe_read_kernel", .ret = -ERANGE },
>> +    { .name = "probe_read_kernel_str", .ret = -ERANGE },
>> +    { .name = "probe_read", .ret = -ERANGE },
>> +    { .name = "probe_read_str", .ret = -ERANGE },
>> +    /* __access_ok() will fail */
>> +    { .name = "probe_read_user", .ret = -EFAULT },
>> +    /* __access_ok() will fail */
>> +    { .name = "probe_read_user_str", .ret = -EFAULT },
>> +    /* access_ok() will fail */
>> +    { .name = "copy_from_user", .ret = -EFAULT },
>> +    /* both vma_lookup() and expand_stack() will fail */
>> +    { .name = "copy_from_user_task", .ret = -EFAULT },
>
> The above comments are not clear enough. For example,
> '__access_ok() will fail', user will need to
> check the source code where __access_ok() is and
> this could be hard e.g., for probe_read_user_str().
> Another example, 'both vma_lookup() and expand_stack() will fail',
> where is vma_lookup()/expand_stack()? User needs to further
> check to make sense.

Yes. These comment are highly coupled with the implementation.
>
> I suggest remove the above comments and add more
> detailed explanation in commit messages with callstack
> indicating where the fail/error return happens.

Will do in v2. Thanks for the suggestions.
>
>> +};
>> +
>> +void test_read_vsyscall(void)
>> +{
>> +    struct read_vsyscall *skel;
>> +    unsigned int i;
>> +    int err;
>> +
>> +#if !defined(__x86_64__)
>> +    test__skip();
>> +    return;
>> +#endif
>> +    skel = read_vsyscall__open_and_load();
>> +    if (!ASSERT_OK_PTR(skel, "read_vsyscall open_load"))
>> +        return;
>> +
>> +    skel->bss->target_pid = getpid();
>> +    err = read_vsyscall__attach(skel);
>> +    if (!ASSERT_EQ(err, 0, "read_vsyscall attach"))
>> +        goto out;
>> +
>> +    /* userspace may don't have vsyscall page due to
>> LEGACY_VSYSCALL_NONE,
>> +     * but it doesn't affect the returned error codes.
>> +     */
>> +    skel->bss->user_ptr = (void *)VSYSCALL_ADDR;
>> +    usleep(1);
>> +
>> +    for (i = 0; i < ARRAY_SIZE(all_read); i++)
>> +        ASSERT_EQ(skel->bss->read_ret[i], all_read[i].ret,
>> all_read[i].name);
>> +out:
>> +    read_vsyscall__destroy(skel);
>> +}
> [...]


2024-01-25 07:18:26

by Hou Tao

[permalink] [raw]
Subject: Re: [PATCH bpf 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()

Hi,

On 1/23/2024 8:18 AM, Sohil Mehta wrote:
> On 1/18/2024 11:30 PM, Hou Tao wrote:
>> From: Hou Tao <[email protected]>
>>
>> When trying to use copy_from_kernel_nofault() to read vsyscall page
>> through a bpf program, the following oops was reported:
>>
>> BUG: unable to handle page fault for address: ffffffffff600000
>> #PF: supervisor read access in kernel mode
>> #PF: error_code(0x0000) - not-present page
>> PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0
>> Oops: 0000 [#1] PREEMPT SMP PTI
>> CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ #58
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
>> RIP: 0010:copy_from_kernel_nofault+0x6f/0x110
>> ......
>> Call Trace:
>> <TASK>
>> ? copy_from_kernel_nofault+0x6f/0x110
>> bpf_probe_read_kernel+0x1d/0x50
>> bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d
>> trace_call_bpf+0xc5/0x1c0
>> perf_call_bpf_enter.isra.0+0x69/0xb0
>> perf_syscall_enter+0x13e/0x200
>> syscall_trace_enter+0x188/0x1c0
>> do_syscall_64+0xb5/0xe0
>> entry_SYSCALL_64_after_hwframe+0x6e/0x76
>> </TASK>
>> ......
>> ---[ end trace 0000000000000000 ]---
>>
>> The oops happens as follows: A bpf program uses bpf_probe_read_kernel()
>> to read from vsyscall page, bpf_probe_read_kernel() invokes
>> copy_from_kernel_nofault() in turn and then invokes __get_user_asm(). A
>> page fault exception is triggered accordingly, but handle_page_fault()
>> considers the vsyscall page address as a userspace address instead of
>> a kernel space address, so the fix-up set-up by bpf isn't applied.
> This comment and the one in the code below seem contradictory and
> confusing. Do we want the vsyscall page address to be considered as a
> userspace address or not?

Now handle_page_fault() has already considered the vsyscall page as a
userspace address, and in the patch we update copy_from_kernel_nofault()
to consider vsyscall page as a userspapce address as well.
>
> IIUC, the issue here is that the vsyscall page (in xonly mode) is not
> really mapped and therefore running copy_from_kernel_nofault() on this
> address is incorrect. This patch fixes this by making
> copy_from_kernel_nofault() return an error for a vsyscall address.
>

Yes, but the issue may occur for vsyscall=none case as well. Because
fault_in_kernel_space() invoked by handle_page_fault() will return
false, so in do_user_addr_fault(), when smap feature is enabled, the
invocation of copy_from_kernel_nofault() will trigger oops due to the
following code snippet:

        if (unlikely(cpu_feature_enabled(X86_FEATURE_SMAP) &&
                     !(error_code & X86_PF_USER) &&
                     !(regs->flags & X86_EFLAGS_AC))) {
                /*
                 * No extable entry here.  This was a kernel access to an
                 * invalid pointer.  get_kernel_nofault() will not get here.
                 */
                page_fault_oops(regs, error_code, address);
                return;
        }

>> Because the exception happens in kernel space and page fault handling is
>> disabled, page_fault_oops() is invoked and an oops happens.
>>
>> Fix it by disallowing vsyscall page read for copy_from_kernel_nofault().
>>
> [Maybe I have misunderstood the issue here and following questions are
> not even relevant.]
>
> But, what about vsyscall=emulate? In that mode the page is actually
> mapped. Would we want the page read to go through then?

Er, Now the vsyscall page is considered as a userspace address, I think
we should reject its read through copy_from_kernel_nofault() even it is
mapped.

>
>> Originally-from: Thomas Gleixner <[email protected]>
> Documentation/process/maintainer-tip.rst says to use "Originally-by:"

Thanks for the tip. Will update.
>
>
>> diff --git a/arch/x86/mm/maccess.c b/arch/x86/mm/maccess.c
>> index 6993f026adec9..bb454e0abbfcf 100644
>> --- a/arch/x86/mm/maccess.c
>> +++ b/arch/x86/mm/maccess.c
>> @@ -3,6 +3,8 @@
>> #include <linux/uaccess.h>
>> #include <linux/kernel.h>
>>
>> +#include "mm_internal.h"
>> +
>> #ifdef CONFIG_X86_64
>> bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
>> {
>> @@ -15,6 +17,10 @@ bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
>> if (vaddr < TASK_SIZE_MAX + PAGE_SIZE)
>> return false;
>>
>> + /* vsyscall page is also considered as userspace address. */
> A bit more explanation about why this should happen might be useful.
>
>> + if (is_vsyscall_vaddr(vaddr))
>> + return false;
>> +
>> /*
>> * Allow everything during early boot before 'x86_virt_bits'
>> * is initialized. Needed for instruction decoding in early


2024-01-25 07:55:45

by Hou Tao

[permalink] [raw]
Subject: Re: [PATCH bpf 3/3] selftest/bpf: Test the read of vsyscall page under x86-64



On 1/23/2024 8:25 AM, Sohil Mehta wrote:
> On 1/18/2024 11:30 PM, Hou Tao wrote:
>
>> vsyscall page could be disabled by CONFIG_LEGACY_VSYSCALL_NONE or
>> vsyscall=none boot cmd-line, but it doesn't affect the reproduce of the
>> problem and the returned error codes.
>>
> With vsyscall=emulate a direct read of the vsyscall address from
> userspace is expected to go through. This is mode deprecated so maybe it
> wouldn't matter much. Without the fix in patch 2/3, do you see the same
> behavior with vsyscall=emulate set in the cmdline?

Er, I think it depends on whether or not SMAP [1] feature is available.
When SMAP feature is enabled, even the vsyscall page is populated,
reading the vsyscall page through bpf_read_kernel() will trigger a page
fault and then oops. But when there is not SMAP, bpf_read_kernel() will
succeed. So I think the test may need to be skipped if vsyscall_mode is
emulate.

[1]: https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
>
> Sohil