Hi,
Here is the v5 series of probe-event to support user-space access.
This version I simplified probe_user_read (Thanks PeterZ!) and move
open-coded non-pagefault strnlen_user code to mm/maccess.c.
Changes in v5:
- [3/6]: Simplify probe_user_read() and add strnlen_unsafe_user()
- [4/6]: Use strnlen_unsafe_user()
V4 series is here;
https://lkml.kernel.org/r/155135700242.438.12930250099644500209.stgit@devbox
====
Kprobe event user-space memory access features:
For user-space access extension, this series adds 2 features,
"ustring" type and user-space dereference syntax. "ustring" is
used for recording a null-terminated string in user-space from
kprobe events.
"ustring" type is easy, it is able to use instead of "string"
type, so if you want to record a user-space string via
"__user char *", you can use ustring type instead of string.
For example,
echo 'p do_sys_open path=+0($arg2):ustring' >> kprobe_events
will record the path string from user-space.
The user-space dereference syntax is also simple. Thi just
adds 'u' prefix before an offset value.
+|-u<OFFSET>(<FETCHARG>)
e.g. +u8(%ax), +u0(+0(%si))
This is more generic. If you want to refer the variable in user-
space from its address or access a field in data structure in
user-space, you need to use this.
For example, if you probe do_sched_setscheduler(pid, policy,
param) and record param->sched_priority, you can add new
probe as below;
p do_sched_setscheduler priority=+u0($arg3)
Actually, with this feature, "ustring" type is not absolutely
necessary, because these are same meanings.
+0($arg2):ustring == +u0($arg2):string
Note that kprobe event provides these methods, but it doesn't
change it from kernel to user automatically because we do not
know whether the given address is in userspace or kernel on
some arch.
For perf-probe, we can add some attribute for each argument
which indicate that the variable in user space. If gcc/clang
supports debuginfo for __user (address_space attribute),
perf-probe can support it and automatically choose the
correct dereference method.
Thank you,
---
Masami Hiramatsu (5):
uaccess: Use user_access_ok() in user_access_begin()
uaccess: Add non-pagefault user-space read functions
tracing/probe: Add ustring type for user-space string
tracing/probe: Support user-space dereference
selftests/ftrace: Add user-memory access syntax testcase
Peter Zijlstra (1):
uaccess: Add user_access_ok()
Documentation/trace/kprobetrace.rst | 28 ++++-
Documentation/trace/uprobetracer.rst | 9 +
arch/x86/include/asm/uaccess.h | 10 +-
include/linux/uaccess.h | 34 +++++-
kernel/trace/trace.c | 7 +
kernel/trace/trace_kprobe.c | 35 ++++++
kernel/trace/trace_probe.c | 36 +++++-
kernel/trace/trace_probe.h | 3
kernel/trace/trace_probe_tmpl.h | 37 +++++-
kernel/trace/trace_uprobe.c | 19 +++
mm/maccess.c | 122 +++++++++++++++++++-
.../ftrace/test.d/kprobe/kprobe_args_user.tc | 31 +++++
12 files changed, 338 insertions(+), 33 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc
--
Masami Hiramatsu (Linaro) <[email protected]>
Add "ustring" type for fetching user-space string from kprobe event.
User can specify ustring type at uprobe event, and it is same as
"string" for uprobe.
Note that probe-event provides this option but it doesn't choose the
correct type automatically since we have not way to decide the address
is in user-space or not on some arch (and on some other arch, you can
fetch the string by "string" type). So user must carefully check the
target code (e.g. if you see __user on the target variable) and
use this new type.
Signed-off-by: Masami Hiramatsu <[email protected]>
Acked-by: Steven Rostedt (VMware) <[email protected]>
---
Changes in v5:
- Use strnlen_unsafe_user() in fetch_store_strlen_user().
Changes in v2:
- Use strnlen_user() instead of open code for fetch_store_strlen_user().
---
Documentation/trace/kprobetrace.rst | 9 +++++++--
kernel/trace/trace.c | 2 +-
kernel/trace/trace_kprobe.c | 29 +++++++++++++++++++++++++++++
kernel/trace/trace_probe.c | 14 +++++++++++---
kernel/trace/trace_probe.h | 1 +
kernel/trace/trace_probe_tmpl.h | 14 +++++++++++++-
kernel/trace/trace_uprobe.c | 12 ++++++++++++
7 files changed, 74 insertions(+), 7 deletions(-)
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 235ce2ab131a..a3ac7c9ac242 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -55,7 +55,8 @@ Synopsis of kprobe_events
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
- (x8/x16/x32/x64), "string" and bitfield are supported.
+ (x8/x16/x32/x64), "string", "ustring" and bitfield
+ are supported.
(\*1) only for the probe on function entry (offs == 0).
(\*2) only for return probe.
@@ -77,7 +78,11 @@ apply it to registers/stack-entries etc. (for example, '$stack1:x8[8]' is
wrong, but '+8($stack):x8[8]' is OK.)
String type is a special type, which fetches a "null-terminated" string from
kernel space. This means it will fail and store NULL if the string container
-has been paged out.
+has been paged out. "ustring" type is an alternative of string for user-space.
+Note that kprobe-event provides string/ustring types, but doesn't change it
+automatically. So user has to decide if the targe string in kernel or in user
+space carefully. On some arch, if you choose wrong one, it always fails to
+record string data.
The string array type is a bit different from other types. For other base
types, <base-type>[1] is equal to <base-type> (e.g. +0(%di):x32[1] is same
as +0(%di):x32.) But string[1] is not equal to string. The string type itself
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c4238b441624..4cacbb0e1538 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4643,7 +4643,7 @@ static const char readme_msg[] =
"\t $stack<index>, $stack, $retval, $comm\n"
#endif
"\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, string, symbol,\n"
- "\t b<bit-width>@<bit-offset>/<container-size>,\n"
+ "\t b<bit-width>@<bit-offset>/<container-size>, ustring,\n"
"\t <type>\\[<array-size>\\]\n"
#ifdef CONFIG_HIST_TRIGGERS
"\t field: <stype> <name>;\n"
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 9eaf07f99212..43b5c3230773 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -872,6 +872,14 @@ fetch_store_strlen(unsigned long addr)
return (ret < 0) ? ret : len;
}
+/* Return the length of string -- including null terminal byte */
+static nokprobe_inline int
+fetch_store_strlen_user(unsigned long addr)
+{
+ return strnlen_unsafe_user((__force const void __user *)addr,
+ MAX_STRING_SIZE);
+}
+
/*
* Fetch a null-terminated string. Caller MUST set *(u32 *)buf with max
* length and relative data location.
@@ -896,6 +904,27 @@ fetch_store_string(unsigned long addr, void *dest, void *base)
return ret;
}
+/*
+ * Fetch a null-terminated string from user. Caller MUST set *(u32 *)buf
+ * with max length and relative data location.
+ */
+static nokprobe_inline int
+fetch_store_string_user(unsigned long addr, void *dest, void *base)
+{
+ int maxlen = get_loc_len(*(u32 *)dest);
+ u8 *dst = get_loc_data(dest, base);
+ long ret;
+
+ if (unlikely(!maxlen))
+ return -ENOMEM;
+ ret = strncpy_from_unsafe_user(dst, (__force const void __user *)addr,
+ maxlen);
+
+ if (ret >= 0)
+ *(u32 *)dest = make_data_loc(ret, (void *)dst - base);
+ return ret;
+}
+
static nokprobe_inline int
probe_mem_read(void *dest, void *src, size_t size)
{
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 9962cb5da8ac..a7012de37a00 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -73,6 +73,8 @@ static const struct fetch_type probe_fetch_types[] = {
/* Special types */
__ASSIGN_FETCH_TYPE("string", string, string, sizeof(u32), 1,
"__data_loc char[]"),
+ __ASSIGN_FETCH_TYPE("ustring", string, string, sizeof(u32), 1,
+ "__data_loc char[]"),
/* Basic types */
ASSIGN_FETCH_TYPE(u8, u8, 0),
ASSIGN_FETCH_TYPE(u16, u16, 0),
@@ -440,7 +442,8 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size,
goto fail;
/* Store operation */
- if (!strcmp(parg->type->name, "string")) {
+ if (!strcmp(parg->type->name, "string") ||
+ !strcmp(parg->type->name, "ustring")) {
if (code->op != FETCH_OP_DEREF && code->op != FETCH_OP_IMM &&
code->op != FETCH_OP_COMM) {
pr_info("string only accepts memory or address.\n");
@@ -459,7 +462,11 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size,
goto fail;
}
}
- code->op = FETCH_OP_ST_STRING; /* In DEREF case, replace it */
+ /* If op == DEREF, replace it with STRING */
+ if (!strcmp(parg->type->name, "ustring"))
+ code->op = FETCH_OP_ST_USTRING;
+ else
+ code->op = FETCH_OP_ST_STRING;
code->size = parg->type->size;
parg->dynamic = true;
} else if (code->op == FETCH_OP_DEREF) {
@@ -484,7 +491,8 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size,
/* Loop(Array) operation */
if (parg->count) {
if (scode->op != FETCH_OP_ST_MEM &&
- scode->op != FETCH_OP_ST_STRING) {
+ scode->op != FETCH_OP_ST_STRING &&
+ scode->op != FETCH_OP_ST_USTRING) {
pr_info("array only accepts memory or address\n");
ret = -EINVAL;
goto fail;
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 8a63f8bc01bc..cf4ba8bbb841 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -95,6 +95,7 @@ enum fetch_op {
FETCH_OP_ST_RAW, /* Raw: .size */
FETCH_OP_ST_MEM, /* Mem: .offset, .size */
FETCH_OP_ST_STRING, /* String: .offset, .size */
+ FETCH_OP_ST_USTRING, /* User String: .offset, .size */
// Stage 4 (modify) op
FETCH_OP_MOD_BF, /* Bitfield: .basesize, .lshift, .rshift */
// Stage 5 (loop) op
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index 4737bb8c07a3..7526f6f8d7b0 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -59,6 +59,9 @@ process_fetch_insn(struct fetch_insn *code, struct pt_regs *regs,
static nokprobe_inline int fetch_store_strlen(unsigned long addr);
static nokprobe_inline int
fetch_store_string(unsigned long addr, void *dest, void *base);
+static nokprobe_inline int fetch_store_strlen_user(unsigned long addr);
+static nokprobe_inline int
+fetch_store_string_user(unsigned long addr, void *dest, void *base);
static nokprobe_inline int
probe_mem_read(void *dest, void *src, size_t size);
@@ -91,6 +94,10 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val,
ret += fetch_store_strlen(val + code->offset);
code++;
goto array;
+ } else if (code->op == FETCH_OP_ST_USTRING) {
+ ret += fetch_store_strlen_user(val + code->offset);
+ code++;
+ goto array;
} else
return -EILSEQ;
}
@@ -106,6 +113,10 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val,
loc = *(u32 *)dest;
ret = fetch_store_string(val + code->offset, dest, base);
break;
+ case FETCH_OP_ST_USTRING:
+ loc = *(u32 *)dest;
+ ret = fetch_store_string_user(val + code->offset, dest, base);
+ break;
default:
return -EILSEQ;
}
@@ -123,7 +134,8 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val,
total += ret;
if (++i < code->param) {
code = s3;
- if (s3->op != FETCH_OP_ST_STRING) {
+ if (s3->op != FETCH_OP_ST_STRING &&
+ s3->op != FETCH_OP_ST_USTRING) {
dest += s3->size;
val += s3->size;
goto stage3;
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 9bde07c06362..92facae8c3d8 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -173,6 +173,12 @@ fetch_store_string(unsigned long addr, void *dest, void *base)
return ret;
}
+static nokprobe_inline int
+fetch_store_string_user(unsigned long addr, void *dest, void *base)
+{
+ return fetch_store_string(addr, dest, base);
+}
+
/* Return the length of string -- including null terminal byte */
static nokprobe_inline int
fetch_store_strlen(unsigned long addr)
@@ -185,6 +191,12 @@ fetch_store_strlen(unsigned long addr)
return (len > MAX_STRING_SIZE) ? 0 : len;
}
+static nokprobe_inline int
+fetch_store_strlen_user(unsigned long addr)
+{
+ return fetch_store_strlen(addr);
+}
+
static unsigned long translate_user_vaddr(unsigned long file_offset)
{
unsigned long base_addr;
Support user-space dereference syntax for probe event arguments
to dereference the data-structure or array in user-space.
The syntax is just adding 'u' before an offset value.
+|-u<OFFSET>(<FETCHARG>)
e.g. +u8(%ax), +u0(+0(%si))
For example, if you probe do_sched_setscheduler(pid, policy,
param) and record param->sched_priority, you can add new
probe as below;
p do_sched_setscheduler priority=+u0($arg3)
Note that kprobe event provides this and it doesn't change the
dereference method automatically because we do not know whether
the given address is in userspace or kernel on some arch.
So as same as "ustring", this is an option for user, who has to
carefully choose the dereference method.
Signed-off-by: Masami Hiramatsu <[email protected]>
---
Changes in v4:
- Fix a bug for parsing argument and simplify code.
- Fix documents accroding to Steve's comment.
Changes in v3:
- Add a section for user memory access to the document.
---
Documentation/trace/kprobetrace.rst | 27 ++++++++++++++++++++++-----
Documentation/trace/uprobetracer.rst | 9 +++++----
kernel/trace/trace.c | 5 +++--
kernel/trace/trace_kprobe.c | 6 ++++++
kernel/trace/trace_probe.c | 24 ++++++++++++++++++------
kernel/trace/trace_probe.h | 2 ++
kernel/trace/trace_probe_tmpl.h | 23 ++++++++++++++++++-----
kernel/trace/trace_uprobe.c | 7 +++++++
8 files changed, 81 insertions(+), 22 deletions(-)
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index a3ac7c9ac242..0639ae492932 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -51,7 +51,7 @@ Synopsis of kprobe_events
$argN : Fetch the Nth function argument. (N >= 1) (\*1)
$retval : Fetch return value.(\*2)
$comm : Fetch current task comm.
- +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(\*3)
+ +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
@@ -61,6 +61,7 @@ Synopsis of kprobe_events
(\*1) only for the probe on function entry (offs == 0).
(\*2) only for return probe.
(\*3) this is useful for fetching a field of data structures.
+ (\*4) "u" means user-space dereference. See :ref:`user_mem_access`.
Types
-----
@@ -79,10 +80,7 @@ wrong, but '+8($stack):x8[8]' is OK.)
String type is a special type, which fetches a "null-terminated" string from
kernel space. This means it will fail and store NULL if the string container
has been paged out. "ustring" type is an alternative of string for user-space.
-Note that kprobe-event provides string/ustring types, but doesn't change it
-automatically. So user has to decide if the targe string in kernel or in user
-space carefully. On some arch, if you choose wrong one, it always fails to
-record string data.
+See :ref:`user_mem_access` for more info..
The string array type is a bit different from other types. For other base
types, <base-type>[1] is equal to <base-type> (e.g. +0(%di):x32[1] is same
as +0(%di):x32.) But string[1] is not equal to string. The string type itself
@@ -97,6 +95,25 @@ Symbol type('symbol') is an alias of u32 or u64 type (depends on BITS_PER_LONG)
which shows given pointer in "symbol+offset" style.
For $comm, the default type is "string"; any other type is invalid.
+.. _user_mem_access:
+User Memory Access
+------------------
+Kprobe events supports user-space memory access. For that purpose, you can use
+either user-space dereference syntax or 'ustring' type.
+
+The user-space dereference syntax allows you to access a field of a data
+structure in user-space. This is done by adding the "u" prefix to the
+dereference syntax. For example, +u4(%si) means it will read memory from the
+address in the register %si offset by 4, and the mory is expected to be in
+user-space. You can use this for strings too, e.g. +u0(%si):string will read
+a string from the address in the register %si that is expected to be in user-
+space. 'ustring' is a shortcut way of performing the same task. That is,
++0(%si):ustring is equivalent to +u0(%si):string.
+
+Note that kprobe-event provides the user-memory access syntax but it doesn't
+use it transparently. This means if you use normal dereference or string type
+for user memory, it might fail, and always fails on some arch. So user has to
+check if the targe data is in kernel or in user space carefully.
Per-Probe Event Filtering
-------------------------
diff --git a/Documentation/trace/uprobetracer.rst b/Documentation/trace/uprobetracer.rst
index 4c3bfde2ba47..4656624a3da9 100644
--- a/Documentation/trace/uprobetracer.rst
+++ b/Documentation/trace/uprobetracer.rst
@@ -42,16 +42,17 @@ Synopsis of uprobe_tracer
@+OFFSET : Fetch memory at OFFSET (OFFSET from same file as PATH)
$stackN : Fetch Nth entry of stack (N >= 0)
$stack : Fetch stack address.
- $retval : Fetch return value.(*)
+ $retval : Fetch return value.(\*1)
$comm : Fetch current task comm.
- +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**)
+ +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*2)(\*3)
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
(x8/x16/x32/x64), "string" and bitfield are supported.
- (*) only for return probe.
- (**) this is useful for fetching a field of data structures.
+ (\*1) only for return probe.
+ (\*2) this is useful for fetching a field of data structures.
+ (\*3) Unlike kprobe event, "u" prefix will just be ignored.
Types
-----
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4cacbb0e1538..5408a82a015d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4638,10 +4638,11 @@ static const char readme_msg[] =
"\t args: <name>=fetcharg[:type]\n"
"\t fetcharg: %<register>, @<address>, @<symbol>[+|-<offset>],\n"
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
- "\t $stack<index>, $stack, $retval, $comm, $arg<N>\n"
+ "\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
#else
- "\t $stack<index>, $stack, $retval, $comm\n"
+ "\t $stack<index>, $stack, $retval, $comm,\n"
#endif
+ "\t +|-[u]<offset>(<fetcharg>)\n"
"\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, string, symbol,\n"
"\t b<bit-width>@<bit-offset>/<container-size>, ustring,\n"
"\t <type>\\[<array-size>\\]\n"
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 43b5c3230773..848a41adddb4 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -931,6 +931,12 @@ probe_mem_read(void *dest, void *src, size_t size)
return probe_kernel_read(dest, src, size);
}
+static nokprobe_inline int
+probe_mem_read_user(void *dest, void *src, size_t size)
+{
+ return probe_user_read(dest, src, size);
+}
+
/* Note that we don't verify it, since the code does not come from user space */
static int
process_fetch_insn(struct fetch_insn *code, struct pt_regs *regs, void *dest,
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index a7012de37a00..13593c6fe0ef 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -239,6 +239,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
{
struct fetch_insn *code = *pcode;
unsigned long param;
+ int deref = FETCH_OP_DEREF;
long offset = 0;
char *tmp;
int ret = 0;
@@ -301,8 +302,14 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
break;
case '+': /* deref memory */
- arg++; /* Skip '+', because kstrtol() rejects it. */
case '-':
+ if (arg[1] == 'u') {
+ deref = FETCH_OP_UDEREF;
+ arg[1] = arg[0];
+ arg++;
+ }
+ if (arg[0] == '+')
+ arg++; /* Skip '+', because kstrtol() rejects it. */
tmp = strchr(arg, '(');
if (!tmp)
return -EINVAL;
@@ -328,7 +335,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
return -E2BIG;
*pcode = code;
- code->op = FETCH_OP_DEREF;
+ code->op = deref;
code->offset = offset;
}
break;
@@ -444,13 +451,14 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size,
/* Store operation */
if (!strcmp(parg->type->name, "string") ||
!strcmp(parg->type->name, "ustring")) {
- if (code->op != FETCH_OP_DEREF && code->op != FETCH_OP_IMM &&
- code->op != FETCH_OP_COMM) {
+ if (code->op != FETCH_OP_DEREF && code->op != FETCH_OP_UDEREF
+ && code->op != FETCH_OP_IMM && code->op != FETCH_OP_COMM) {
pr_info("string only accepts memory or address.\n");
ret = -EINVAL;
goto fail;
}
- if (code->op != FETCH_OP_DEREF || parg->count) {
+ if ((code->op == FETCH_OP_IMM || code->op == FETCH_OP_COMM)
+ || parg->count) {
/*
* IMM and COMM is pointing actual address, those must
* be kept, and if parg->count != 0, this is an array
@@ -463,7 +471,8 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size,
}
}
/* If op == DEREF, replace it with STRING */
- if (!strcmp(parg->type->name, "ustring"))
+ if (!strcmp(parg->type->name, "ustring") ||
+ code->op == FETCH_OP_UDEREF)
code->op = FETCH_OP_ST_USTRING;
else
code->op = FETCH_OP_ST_STRING;
@@ -472,6 +481,9 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size,
} else if (code->op == FETCH_OP_DEREF) {
code->op = FETCH_OP_ST_MEM;
code->size = parg->type->size;
+ } else if (code->op == FETCH_OP_UDEREF) {
+ code->op = FETCH_OP_ST_UMEM;
+ code->size = parg->type->size;
} else {
code++;
if (code->op != FETCH_OP_NOP) {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index cf4ba8bbb841..a5e8b2ac2c97 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -91,9 +91,11 @@ enum fetch_op {
FETCH_OP_FOFFS, /* File offset: .immediate */
// Stage 2 (dereference) op
FETCH_OP_DEREF, /* Dereference: .offset */
+ FETCH_OP_UDEREF, /* User-space Dereference: .offset */
// Stage 3 (store) ops
FETCH_OP_ST_RAW, /* Raw: .size */
FETCH_OP_ST_MEM, /* Mem: .offset, .size */
+ FETCH_OP_ST_UMEM, /* Mem: .offset, .size */
FETCH_OP_ST_STRING, /* String: .offset, .size */
FETCH_OP_ST_USTRING, /* User String: .offset, .size */
// Stage 4 (modify) op
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index 7526f6f8d7b0..06f2d901c4cf 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -64,6 +64,8 @@ static nokprobe_inline int
fetch_store_string_user(unsigned long addr, void *dest, void *base);
static nokprobe_inline int
probe_mem_read(void *dest, void *src, size_t size);
+static nokprobe_inline int
+probe_mem_read_user(void *dest, void *src, size_t size);
/* From the 2nd stage, routine is same */
static nokprobe_inline int
@@ -77,14 +79,21 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val,
stage2:
/* 2nd stage: dereference memory if needed */
- while (code->op == FETCH_OP_DEREF) {
- lval = val;
- ret = probe_mem_read(&val, (void *)val + code->offset,
- sizeof(val));
+ do {
+ if (code->op == FETCH_OP_DEREF) {
+ lval = val;
+ ret = probe_mem_read(&val, (void *)val + code->offset,
+ sizeof(val));
+ } else if (code->op == FETCH_OP_UDEREF) {
+ lval = val;
+ ret = probe_mem_read_user(&val,
+ (void *)val + code->offset, sizeof(val));
+ } else
+ break;
if (ret)
return ret;
code++;
- }
+ } while (1);
s3 = code;
stage3:
@@ -109,6 +118,10 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val,
case FETCH_OP_ST_MEM:
probe_mem_read(dest, (void *)val + code->offset, code->size);
break;
+ case FETCH_OP_ST_UMEM:
+ probe_mem_read_user(dest, (void *)val + code->offset,
+ code->size);
+ break;
case FETCH_OP_ST_STRING:
loc = *(u32 *)dest;
ret = fetch_store_string(val + code->offset, dest, base);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 92facae8c3d8..a86afc9e2a6a 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -140,6 +140,13 @@ probe_mem_read(void *dest, void *src, size_t size)
return copy_from_user(dest, vaddr, size) ? -EFAULT : 0;
}
+
+static nokprobe_inline int
+probe_mem_read_user(void *dest, void *src, size_t size)
+{
+ return probe_mem_read(dest, src, size);
+}
+
/*
* Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
* length and relative data location.
Add a user-memory access syntax testcase which checks
new user-memory access syntax and ustring type.
Signed-off-by: Masami Hiramatsu <[email protected]>
---
.../ftrace/test.d/kprobe/kprobe_args_user.tc | 31 ++++++++++++++++++++
1 file changed, 31 insertions(+)
create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc
new file mode 100644
index 000000000000..996da9d1eec5
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc
@@ -0,0 +1,31 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Kprobe event user-memory access
+
+[ -f kprobe_events ] || exit_unsupported # this is configurable
+
+grep -A10 "fetcharg:" README | grep -q 'ustring' || exit_unsupported
+grep -A10 "fetcharg:" README | grep -q '\[u\]<offset>' || exit_unsupported
+
+:;: "user-memory access syntax and ustring working on user memory";:
+echo 'p:myevent do_sys_open path=+0($arg2):ustring path2=+u0($arg2):string' \
+ > kprobe_events
+
+grep myevent kprobe_events | \
+ grep -q 'path=+0($arg2):ustring path2=+u0($arg2):string'
+echo 1 > events/kprobes/myevent/enable
+echo > /dev/null
+echo 0 > events/kprobes/myevent/enable
+
+grep myevent trace | grep -q 'path="/dev/null" path2="/dev/null"'
+
+:;: "user-memory access syntax and ustring not working with kernel memory";:
+echo 'p:myevent vfs_symlink path=+0($arg3):ustring path2=+u0($arg3):string' \
+ > kprobe_events
+echo 1 > events/kprobes/myevent/enable
+ln -s foo $TMPDIR/bar
+echo 0 > events/kprobes/myevent/enable
+
+grep myevent trace | grep -q 'path=(fault) path2=(fault)'
+
+exit 0
Use user_access_ok() instead of access_ok() in user_access_begin()
to validate the access context is user. This also allow us to
use (generic) strncpy_from_user() and strnlen_user() in atomic
state with setting USER_DS and disable pagefaults.
Signed-off-by: Masami Hiramatsu <[email protected]>
---
arch/x86/include/asm/uaccess.h | 2 +-
include/linux/uaccess.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 3125d129d3b6..3b7502a34114 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -715,7 +715,7 @@ extern struct movsl_mask {
*/
static __must_check inline bool user_access_begin(const void __user *ptr, size_t len)
{
- if (unlikely(!access_ok(ptr,len)))
+ if (unlikely(!user_access_ok(ptr, len)))
return 0;
__uaccess_begin_nospec();
return 1;
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index bf762689658b..1afd9dfabe67 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -282,7 +282,7 @@ extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count);
probe_kernel_read(&retval, addr, sizeof(retval))
#ifndef user_access_begin
-#define user_access_begin(ptr,len) access_ok(ptr, len)
+#define user_access_begin(ptr, len) user_access_ok(ptr, len)
#define user_access_end() do { } while (0)
#define unsafe_get_user(x, ptr, err) do { if (unlikely(__get_user(x, ptr))) goto err; } while (0)
#define unsafe_put_user(x, ptr, err) do { if (unlikely(__put_user(x, ptr))) goto err; } while (0)
From: Peter Zijlstra <[email protected]>
Add user_access_ok() macro which ensures current context
is user context, or explicitly do set_fs(USER_DS).
This function is very much like access_ok(), except it (may)
have different context validation. In general we must be
very careful when using this.
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Masami Hiramatsu <[email protected]>
---
arch/x86/include/asm/uaccess.h | 8 +++++++-
include/linux/uaccess.h | 18 ++++++++++++++++++
2 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 780f2b42c8ef..3125d129d3b6 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -92,12 +92,18 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
* checks that the pointer is in the user space range - after calling
* this function, memory access functions may still return -EFAULT.
*/
-#define access_ok(addr, size) \
+#define access_ok(addr, size) \
({ \
WARN_ON_IN_IRQ(); \
likely(!__range_not_ok(addr, size, user_addr_max())); \
})
+#define user_access_ok(addr, size) \
+({ \
+ WARN_ON_ONCE(!segment_eq(get_fs(), USER_DS)); \
+ likely(!__range_not_ok(addr, size, user_addr_max())); \
+})
+
/*
* These are the main single-value transfer routines. They automatically
* use the right size if we just have the right pointer type.
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 37b226e8df13..bf762689658b 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -10,6 +10,24 @@
#include <asm/uaccess.h>
+/**
+ * user_access_ok: Checks if a user space pointer is valid
+ * @addr: User space pointer to start of block to check
+ * @size: Size of block to check
+ *
+ * Context: User context or explicit set_fs(USER_DS).
+ *
+ * This function is very much like access_ok(), except it (may) have different
+ * context validation. In general we must be very careful when using this.
+ */
+#ifndef user_access_ok
+#define user_access_ok(addr, size) \
+({ \
+ WARN_ON_ONCE(!segment_eq(get_fs(), USER_DS)); \
+ access_ok(addr, size); \
+})
+#endif
+
/*
* Architectures should provide two primitives (raw_copy_{to,from}_user())
* and get rid of their private instances of copy_{to,from}_user() and
Add probe_user_read(), strncpy_from_unsafe_user() and
strnlen_unsafe_user() which allows caller to access user-space
in IRQ context.
Current probe_kernel_read() and strncpy_from_unsafe() are
not available for user-space memory, because it sets
KERNEL_DS while accessing data. On some arch, user address
space and kernel address space can be co-exist, but others
can not. In that case, setting KERNEL_DS means given
address is treated as a kernel address space.
Also strnlen_user() is only available from user context since
it can sleep if pagefault is enabled.
To access user-space memory without pagefault, we need
these new functions which sets USER_DS while accessing
the data.
Signed-off-by: Masami Hiramatsu <[email protected]>
---
Changes in v5:
- Simplify probe_user_read() (Thanks, Peter!)
- Add strnlen_unsafe_user()
Changes in v3:
- Use user_access_ok() for probe_user_read().
Changes in v2:
- Simplify strncpy_from_unsafe_user() using strncpy_from_user()
according to Linus's suggestion.
- Simplify probe_user_read() not using intermediate function.
---
include/linux/uaccess.h | 14 +++++
mm/maccess.c | 122 +++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 130 insertions(+), 6 deletions(-)
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 1afd9dfabe67..5be7f9adb418 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -258,6 +258,17 @@ extern long probe_kernel_read(void *dst, const void *src, size_t size);
extern long __probe_kernel_read(void *dst, const void *src, size_t size);
/*
+ * probe_user_read(): safely attempt to read from a location in user space
+ * @dst: pointer to the buffer that shall take the data
+ * @src: address to read from
+ * @size: size of the data chunk
+ *
+ * Safely read from address @src to the buffer at @dst. If a kernel fault
+ * happens, handle that and return -EFAULT.
+ */
+extern long probe_user_read(void *dst, const void __user *src, size_t size);
+
+/*
* probe_kernel_write(): safely attempt to write to a location
* @dst: address to write to
* @src: pointer to the data that shall be written
@@ -270,6 +281,9 @@ extern long notrace probe_kernel_write(void *dst, const void *src, size_t size);
extern long notrace __probe_kernel_write(void *dst, const void *src, size_t size);
extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count);
+extern long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr,
+ long count);
+extern long strnlen_unsafe_user(const void __user *unsafe_addr, long count);
/**
* probe_kernel_address(): safely attempt to read from a location
diff --git a/mm/maccess.c b/mm/maccess.c
index ec00be51a24f..d1b2ec78d9ef 100644
--- a/mm/maccess.c
+++ b/mm/maccess.c
@@ -5,8 +5,20 @@
#include <linux/mm.h>
#include <linux/uaccess.h>
+static __always_inline long
+probe_read_common(void *dst, const void __user *src, size_t size)
+{
+ long ret;
+
+ pagefault_disable();
+ ret = __copy_from_user_inatomic(dst, src, size);
+ pagefault_enable();
+
+ return ret ? -EFAULT : 0;
+}
+
/**
- * probe_kernel_read(): safely attempt to read from a location
+ * probe_kernel_read(): safely attempt to read from a kernel-space location
* @dst: pointer to the buffer that shall take the data
* @src: address to read from
* @size: size of the data chunk
@@ -29,17 +41,45 @@ long __probe_kernel_read(void *dst, const void *src, size_t size)
mm_segment_t old_fs = get_fs();
set_fs(KERNEL_DS);
- pagefault_disable();
- ret = __copy_from_user_inatomic(dst,
- (__force const void __user *)src, size);
- pagefault_enable();
+ ret = probe_read_common(dst, (__force const void __user *)src, size);
set_fs(old_fs);
- return ret ? -EFAULT : 0;
+ return ret;
}
EXPORT_SYMBOL_GPL(probe_kernel_read);
/**
+ * probe_user_read(): safely attempt to read from a user-space location
+ * @dst: pointer to the buffer that shall take the data
+ * @src: address to read from. This must be a user address.
+ * @size: size of the data chunk
+ *
+ * Safely read from user address @src to the buffer at @dst. If a kernel fault
+ * happens, handle that and return -EFAULT.
+ */
+
+long __weak probe_user_read(void *dst, const void __user *src, size_t size)
+ __attribute__((alias("__probe_user_read")));
+
+long __probe_user_read(void *dst, const void __user *src, size_t size)
+{
+ long ret = -EFAULT;
+ mm_segment_t old_fs = get_fs();
+
+ /*
+ * Since this can be called in IRQ context, we carefully set the
+ * USER_DS and use user_access_ok() which checks segment setting
+ * instead of task context.
+ */
+ set_fs(USER_DS);
+ if (user_access_ok(src, size))
+ ret = probe_read_common(dst, src, size);
+ set_fs(old_fs);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(probe_user_read);
+
+/**
* probe_kernel_write(): safely attempt to write to a location
* @dst: address to write to
* @src: pointer to the data that shall be written
@@ -66,6 +106,7 @@ long __probe_kernel_write(void *dst, const void *src, size_t size)
}
EXPORT_SYMBOL_GPL(probe_kernel_write);
+
/**
* strncpy_from_unsafe: - Copy a NUL terminated string from unsafe address.
* @dst: Destination address, in kernel space. This buffer must be at
@@ -105,3 +146,72 @@ long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count)
return ret ? -EFAULT : src - unsafe_addr;
}
+
+/**
+ * strncpy_from_unsafe_user: - Copy a NUL terminated string from unsafe user
+ * address.
+ * @dst: Destination address, in kernel space. This buffer must be at
+ * least @count bytes long.
+ * @unsafe_addr: Unsafe user address.
+ * @count: Maximum number of bytes to copy, including the trailing NUL.
+ *
+ * Copies a NUL-terminated string from unsafe user address to kernel buffer.
+ *
+ * On success, returns the length of the string INCLUDING the trailing NUL.
+ *
+ * If access fails, returns -EFAULT (some data may have been copied
+ * and the trailing NUL added).
+ *
+ * If @count is smaller than the length of the string, copies @count-1 bytes,
+ * sets the last byte of @dst buffer to NUL and returns @count.
+ */
+long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr,
+ long count)
+{
+ mm_segment_t old_fs = get_fs();
+ long ret;
+
+ if (unlikely(count <= 0))
+ return 0;
+
+ set_fs(USER_DS);
+ pagefault_disable();
+ ret = strncpy_from_user(dst, unsafe_addr, count);
+ pagefault_enable();
+ set_fs(old_fs);
+ if (ret >= count) {
+ ret = count;
+ dst[ret - 1] = '\0';
+ } else if (ret > 0)
+ ret++;
+ return ret;
+}
+
+/**
+ * strnlen_unsafe_user: - Get the size of a user string INCLUDING final NUL.
+ * @unsafe_addr: The string to measure.
+ * @count: Maximum count (including NUL character)
+ *
+ * Get the size of a NUL-terminated string in user space without pagefault.
+ *
+ * Returns the size of the string INCLUDING the terminating NUL.
+ *
+ * If the string is too long, returns a number larger than @count. User
+ * has to check the return value against "> count".
+ * On exception (or invalid count), returns 0.
+ *
+ * Unlike strnlen_user, this can be used from IRQ handler etc. because
+ * it disables pagefaults.
+ */
+long strnlen_unsafe_user(const void __user *unsafe_addr, long count)
+{
+ mm_segment_t old_fs = get_fs();
+ int ret;
+
+ set_fs(USER_DS);
+ pagefault_disable();
+ ret = strnlen_user(unsafe_addr, count);
+ pagefault_enable();
+ set_fs(old_fs);
+ return ret;
+}
On 2/28/19 8:03 AM, Masami Hiramatsu wrote:
> Add probe_user_read(), strncpy_from_unsafe_user() and
> strnlen_unsafe_user() which allows caller to access user-space
> in IRQ context.
>
> Current probe_kernel_read() and strncpy_from_unsafe() are
> not available for user-space memory, because it sets
> KERNEL_DS while accessing data. On some arch, user address
> space and kernel address space can be co-exist, but others
> can not. In that case, setting KERNEL_DS means given
Just curious. Given the list of arch's currently linux supports,
do you know which arch's fall into "user address space and
kernel address space" can co-exist, and which arch's cannot?
Thanks!
Yonghong
> address is treated as a kernel address space.
> Also strnlen_user() is only available from user context since
> it can sleep if pagefault is enabled.
>
> To access user-space memory without pagefault, we need
> these new functions which sets USER_DS while accessing
> the data.
>
> Signed-off-by: Masami Hiramatsu <[email protected]>
> ---
> Changes in v5:
> - Simplify probe_user_read() (Thanks, Peter!)
> - Add strnlen_unsafe_user()
> Changes in v3:
> - Use user_access_ok() for probe_user_read().
> Changes in v2:
> - Simplify strncpy_from_unsafe_user() using strncpy_from_user()
> according to Linus's suggestion.
> - Simplify probe_user_read() not using intermediate function.
> ---
> include/linux/uaccess.h | 14 +++++
> mm/maccess.c | 122 +++++++++++++++++++++++++++++++++++++++++++++--
> 2 files changed, 130 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 1afd9dfabe67..5be7f9adb418 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -258,6 +258,17 @@ extern long probe_kernel_read(void *dst, const void *src, size_t size);
> extern long __probe_kernel_read(void *dst, const void *src, size_t size);
>
> /*
> + * probe_user_read(): safely attempt to read from a location in user space
> + * @dst: pointer to the buffer that shall take the data
> + * @src: address to read from
> + * @size: size of the data chunk
> + *
> + * Safely read from address @src to the buffer at @dst. If a kernel fault
> + * happens, handle that and return -EFAULT.
> + */
> +extern long probe_user_read(void *dst, const void __user *src, size_t size);
> +
> +/*
> * probe_kernel_write(): safely attempt to write to a location
> * @dst: address to write to
> * @src: pointer to the data that shall be written
> @@ -270,6 +281,9 @@ extern long notrace probe_kernel_write(void *dst, const void *src, size_t size);
> extern long notrace __probe_kernel_write(void *dst, const void *src, size_t size);
>
> extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count);
> +extern long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr,
> + long count);
> +extern long strnlen_unsafe_user(const void __user *unsafe_addr, long count);
>
> /**
> * probe_kernel_address(): safely attempt to read from a location
> diff --git a/mm/maccess.c b/mm/maccess.c
> index ec00be51a24f..d1b2ec78d9ef 100644
> --- a/mm/maccess.c
> +++ b/mm/maccess.c
> @@ -5,8 +5,20 @@
> #include <linux/mm.h>
> #include <linux/uaccess.h>
>
> +static __always_inline long
> +probe_read_common(void *dst, const void __user *src, size_t size)
> +{
> + long ret;
> +
> + pagefault_disable();
> + ret = __copy_from_user_inatomic(dst, src, size);
> + pagefault_enable();
> +
> + return ret ? -EFAULT : 0;
> +}
> +
> /**
> - * probe_kernel_read(): safely attempt to read from a location
> + * probe_kernel_read(): safely attempt to read from a kernel-space location
> * @dst: pointer to the buffer that shall take the data
> * @src: address to read from
> * @size: size of the data chunk
> @@ -29,17 +41,45 @@ long __probe_kernel_read(void *dst, const void *src, size_t size)
> mm_segment_t old_fs = get_fs();
>
> set_fs(KERNEL_DS);
> - pagefault_disable();
> - ret = __copy_from_user_inatomic(dst,
> - (__force const void __user *)src, size);
> - pagefault_enable();
> + ret = probe_read_common(dst, (__force const void __user *)src, size);
> set_fs(old_fs);
>
> - return ret ? -EFAULT : 0;
> + return ret;
> }
> EXPORT_SYMBOL_GPL(probe_kernel_read);
>
> /**
> + * probe_user_read(): safely attempt to read from a user-space location
> + * @dst: pointer to the buffer that shall take the data
> + * @src: address to read from. This must be a user address.
> + * @size: size of the data chunk
> + *
> + * Safely read from user address @src to the buffer at @dst. If a kernel fault
> + * happens, handle that and return -EFAULT.
> + */
> +
> +long __weak probe_user_read(void *dst, const void __user *src, size_t size)
> + __attribute__((alias("__probe_user_read")));
> +
> +long __probe_user_read(void *dst, const void __user *src, size_t size)
> +{
> + long ret = -EFAULT;
> + mm_segment_t old_fs = get_fs();
> +
> + /*
> + * Since this can be called in IRQ context, we carefully set the
> + * USER_DS and use user_access_ok() which checks segment setting
> + * instead of task context.
> + */
> + set_fs(USER_DS);
> + if (user_access_ok(src, size))
> + ret = probe_read_common(dst, src, size);
> + set_fs(old_fs);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(probe_user_read);
> +
> +/**
> * probe_kernel_write(): safely attempt to write to a location
> * @dst: address to write to
> * @src: pointer to the data that shall be written
> @@ -66,6 +106,7 @@ long __probe_kernel_write(void *dst, const void *src, size_t size)
> }
> EXPORT_SYMBOL_GPL(probe_kernel_write);
>
> +
> /**
> * strncpy_from_unsafe: - Copy a NUL terminated string from unsafe address.
> * @dst: Destination address, in kernel space. This buffer must be at
> @@ -105,3 +146,72 @@ long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count)
>
> return ret ? -EFAULT : src - unsafe_addr;
> }
> +
> +/**
> + * strncpy_from_unsafe_user: - Copy a NUL terminated string from unsafe user
> + * address.
> + * @dst: Destination address, in kernel space. This buffer must be at
> + * least @count bytes long.
> + * @unsafe_addr: Unsafe user address.
> + * @count: Maximum number of bytes to copy, including the trailing NUL.
> + *
> + * Copies a NUL-terminated string from unsafe user address to kernel buffer.
> + *
> + * On success, returns the length of the string INCLUDING the trailing NUL.
> + *
> + * If access fails, returns -EFAULT (some data may have been copied
> + * and the trailing NUL added).
> + *
> + * If @count is smaller than the length of the string, copies @count-1 bytes,
> + * sets the last byte of @dst buffer to NUL and returns @count.
> + */
> +long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr,
> + long count)
> +{
> + mm_segment_t old_fs = get_fs();
> + long ret;
> +
> + if (unlikely(count <= 0))
> + return 0;
> +
> + set_fs(USER_DS);
> + pagefault_disable();
> + ret = strncpy_from_user(dst, unsafe_addr, count);
> + pagefault_enable();
> + set_fs(old_fs);
> + if (ret >= count) {
> + ret = count;
> + dst[ret - 1] = '\0';
> + } else if (ret > 0)
> + ret++;
> + return ret;
> +}
> +
> +/**
> + * strnlen_unsafe_user: - Get the size of a user string INCLUDING final NUL.
> + * @unsafe_addr: The string to measure.
> + * @count: Maximum count (including NUL character)
> + *
> + * Get the size of a NUL-terminated string in user space without pagefault.
> + *
> + * Returns the size of the string INCLUDING the terminating NUL.
> + *
> + * If the string is too long, returns a number larger than @count. User
> + * has to check the return value against "> count".
> + * On exception (or invalid count), returns 0.
> + *
> + * Unlike strnlen_user, this can be used from IRQ handler etc. because
> + * it disables pagefaults.
> + */
> +long strnlen_unsafe_user(const void __user *unsafe_addr, long count)
> +{
> + mm_segment_t old_fs = get_fs();
> + int ret;
> +
> + set_fs(USER_DS);
> + pagefault_disable();
> + ret = strnlen_user(unsafe_addr, count);
> + pagefault_enable();
> + set_fs(old_fs);
> + return ret;
> +}
>
Hi Yonghong,
On Thu, 28 Feb 2019 22:49:43 +0000
Yonghong Song <[email protected]> wrote:
>
>
> On 2/28/19 8:03 AM, Masami Hiramatsu wrote:
> > Add probe_user_read(), strncpy_from_unsafe_user() and
> > strnlen_unsafe_user() which allows caller to access user-space
> > in IRQ context.
> >
> > Current probe_kernel_read() and strncpy_from_unsafe() are
> > not available for user-space memory, because it sets
> > KERNEL_DS while accessing data. On some arch, user address
> > space and kernel address space can be co-exist, but others
> > can not. In that case, setting KERNEL_DS means given
>
> Just curious. Given the list of arch's currently linux supports,
> do you know which arch's fall into "user address space and
> kernel address space" can co-exist, and which arch's cannot?
As far as I can heard, (and based on probe_kernel_read() failure)
sparc32 (and sparc64?), arm64, and s390 will not work.
x86 works, but if user patch the 4G/4G, it shouldn't work.
Thank you,
>
> Thanks!
>
> Yonghong
>
>
> > address is treated as a kernel address space.
> > Also strnlen_user() is only available from user context since
> > it can sleep if pagefault is enabled.
> >
> > To access user-space memory without pagefault, we need
> > these new functions which sets USER_DS while accessing
> > the data.
> >
> > Signed-off-by: Masami Hiramatsu <[email protected]>
> > ---
> > Changes in v5:
> > - Simplify probe_user_read() (Thanks, Peter!)
> > - Add strnlen_unsafe_user()
> > Changes in v3:
> > - Use user_access_ok() for probe_user_read().
> > Changes in v2:
> > - Simplify strncpy_from_unsafe_user() using strncpy_from_user()
> > according to Linus's suggestion.
> > - Simplify probe_user_read() not using intermediate function.
> > ---
> > include/linux/uaccess.h | 14 +++++
> > mm/maccess.c | 122 +++++++++++++++++++++++++++++++++++++++++++++--
> > 2 files changed, 130 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> > index 1afd9dfabe67..5be7f9adb418 100644
> > --- a/include/linux/uaccess.h
> > +++ b/include/linux/uaccess.h
> > @@ -258,6 +258,17 @@ extern long probe_kernel_read(void *dst, const void *src, size_t size);
> > extern long __probe_kernel_read(void *dst, const void *src, size_t size);
> >
> > /*
> > + * probe_user_read(): safely attempt to read from a location in user space
> > + * @dst: pointer to the buffer that shall take the data
> > + * @src: address to read from
> > + * @size: size of the data chunk
> > + *
> > + * Safely read from address @src to the buffer at @dst. If a kernel fault
> > + * happens, handle that and return -EFAULT.
> > + */
> > +extern long probe_user_read(void *dst, const void __user *src, size_t size);
> > +
> > +/*
> > * probe_kernel_write(): safely attempt to write to a location
> > * @dst: address to write to
> > * @src: pointer to the data that shall be written
> > @@ -270,6 +281,9 @@ extern long notrace probe_kernel_write(void *dst, const void *src, size_t size);
> > extern long notrace __probe_kernel_write(void *dst, const void *src, size_t size);
> >
> > extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count);
> > +extern long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr,
> > + long count);
> > +extern long strnlen_unsafe_user(const void __user *unsafe_addr, long count);
> >
> > /**
> > * probe_kernel_address(): safely attempt to read from a location
> > diff --git a/mm/maccess.c b/mm/maccess.c
> > index ec00be51a24f..d1b2ec78d9ef 100644
> > --- a/mm/maccess.c
> > +++ b/mm/maccess.c
> > @@ -5,8 +5,20 @@
> > #include <linux/mm.h>
> > #include <linux/uaccess.h>
> >
> > +static __always_inline long
> > +probe_read_common(void *dst, const void __user *src, size_t size)
> > +{
> > + long ret;
> > +
> > + pagefault_disable();
> > + ret = __copy_from_user_inatomic(dst, src, size);
> > + pagefault_enable();
> > +
> > + return ret ? -EFAULT : 0;
> > +}
> > +
> > /**
> > - * probe_kernel_read(): safely attempt to read from a location
> > + * probe_kernel_read(): safely attempt to read from a kernel-space location
> > * @dst: pointer to the buffer that shall take the data
> > * @src: address to read from
> > * @size: size of the data chunk
> > @@ -29,17 +41,45 @@ long __probe_kernel_read(void *dst, const void *src, size_t size)
> > mm_segment_t old_fs = get_fs();
> >
> > set_fs(KERNEL_DS);
> > - pagefault_disable();
> > - ret = __copy_from_user_inatomic(dst,
> > - (__force const void __user *)src, size);
> > - pagefault_enable();
> > + ret = probe_read_common(dst, (__force const void __user *)src, size);
> > set_fs(old_fs);
> >
> > - return ret ? -EFAULT : 0;
> > + return ret;
> > }
> > EXPORT_SYMBOL_GPL(probe_kernel_read);
> >
> > /**
> > + * probe_user_read(): safely attempt to read from a user-space location
> > + * @dst: pointer to the buffer that shall take the data
> > + * @src: address to read from. This must be a user address.
> > + * @size: size of the data chunk
> > + *
> > + * Safely read from user address @src to the buffer at @dst. If a kernel fault
> > + * happens, handle that and return -EFAULT.
> > + */
> > +
> > +long __weak probe_user_read(void *dst, const void __user *src, size_t size)
> > + __attribute__((alias("__probe_user_read")));
> > +
> > +long __probe_user_read(void *dst, const void __user *src, size_t size)
> > +{
> > + long ret = -EFAULT;
> > + mm_segment_t old_fs = get_fs();
> > +
> > + /*
> > + * Since this can be called in IRQ context, we carefully set the
> > + * USER_DS and use user_access_ok() which checks segment setting
> > + * instead of task context.
> > + */
> > + set_fs(USER_DS);
> > + if (user_access_ok(src, size))
> > + ret = probe_read_common(dst, src, size);
> > + set_fs(old_fs);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(probe_user_read);
> > +
> > +/**
> > * probe_kernel_write(): safely attempt to write to a location
> > * @dst: address to write to
> > * @src: pointer to the data that shall be written
> > @@ -66,6 +106,7 @@ long __probe_kernel_write(void *dst, const void *src, size_t size)
> > }
> > EXPORT_SYMBOL_GPL(probe_kernel_write);
> >
> > +
> > /**
> > * strncpy_from_unsafe: - Copy a NUL terminated string from unsafe address.
> > * @dst: Destination address, in kernel space. This buffer must be at
> > @@ -105,3 +146,72 @@ long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count)
> >
> > return ret ? -EFAULT : src - unsafe_addr;
> > }
> > +
> > +/**
> > + * strncpy_from_unsafe_user: - Copy a NUL terminated string from unsafe user
> > + * address.
> > + * @dst: Destination address, in kernel space. This buffer must be at
> > + * least @count bytes long.
> > + * @unsafe_addr: Unsafe user address.
> > + * @count: Maximum number of bytes to copy, including the trailing NUL.
> > + *
> > + * Copies a NUL-terminated string from unsafe user address to kernel buffer.
> > + *
> > + * On success, returns the length of the string INCLUDING the trailing NUL.
> > + *
> > + * If access fails, returns -EFAULT (some data may have been copied
> > + * and the trailing NUL added).
> > + *
> > + * If @count is smaller than the length of the string, copies @count-1 bytes,
> > + * sets the last byte of @dst buffer to NUL and returns @count.
> > + */
> > +long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr,
> > + long count)
> > +{
> > + mm_segment_t old_fs = get_fs();
> > + long ret;
> > +
> > + if (unlikely(count <= 0))
> > + return 0;
> > +
> > + set_fs(USER_DS);
> > + pagefault_disable();
> > + ret = strncpy_from_user(dst, unsafe_addr, count);
> > + pagefault_enable();
> > + set_fs(old_fs);
> > + if (ret >= count) {
> > + ret = count;
> > + dst[ret - 1] = '\0';
> > + } else if (ret > 0)
> > + ret++;
> > + return ret;
> > +}
> > +
> > +/**
> > + * strnlen_unsafe_user: - Get the size of a user string INCLUDING final NUL.
> > + * @unsafe_addr: The string to measure.
> > + * @count: Maximum count (including NUL character)
> > + *
> > + * Get the size of a NUL-terminated string in user space without pagefault.
> > + *
> > + * Returns the size of the string INCLUDING the terminating NUL.
> > + *
> > + * If the string is too long, returns a number larger than @count. User
> > + * has to check the return value against "> count".
> > + * On exception (or invalid count), returns 0.
> > + *
> > + * Unlike strnlen_user, this can be used from IRQ handler etc. because
> > + * it disables pagefaults.
> > + */
> > +long strnlen_unsafe_user(const void __user *unsafe_addr, long count)
> > +{
> > + mm_segment_t old_fs = get_fs();
> > + int ret;
> > +
> > + set_fs(USER_DS);
> > + pagefault_disable();
> > + ret = strnlen_user(unsafe_addr, count);
> > + pagefault_enable();
> > + set_fs(old_fs);
> > + return ret;
> > +}
> >
--
Masami Hiramatsu <[email protected]>
On 2/28/19 6:29 PM, Masami Hiramatsu wrote:
> Hi Yonghong,
>
> On Thu, 28 Feb 2019 22:49:43 +0000
> Yonghong Song <[email protected]> wrote:
>
>>
>>
>> On 2/28/19 8:03 AM, Masami Hiramatsu wrote:
>>> Add probe_user_read(), strncpy_from_unsafe_user() and
>>> strnlen_unsafe_user() which allows caller to access user-space
>>> in IRQ context.
>>>
>>> Current probe_kernel_read() and strncpy_from_unsafe() are
>>> not available for user-space memory, because it sets
>>> KERNEL_DS while accessing data. On some arch, user address
>>> space and kernel address space can be co-exist, but others
>>> can not. In that case, setting KERNEL_DS means given
>>
>> Just curious. Given the list of arch's currently linux supports,
>> do you know which arch's fall into "user address space and
>> kernel address space" can co-exist, and which arch's cannot?
>
> As far as I can heard, (and based on probe_kernel_read() failure)
> sparc32 (and sparc64?), arm64, and s390 will not work.
> x86 works, but if user patch the 4G/4G, it shouldn't work.
Thanks for the info! Great to know the details.
>
> Thank you,
>
>>
>> Thanks!
>>
>> Yonghong
>>
>>
>>> address is treated as a kernel address space.
>>> Also strnlen_user() is only available from user context since
>>> it can sleep if pagefault is enabled.
>>>
>>> To access user-space memory without pagefault, we need
>>> these new functions which sets USER_DS while accessing
>>> the data.
>>>
>>> Signed-off-by: Masami Hiramatsu <[email protected]>
>>> ---
>>> Changes in v5:
>>> - Simplify probe_user_read() (Thanks, Peter!)
>>> - Add strnlen_unsafe_user()
>>> Changes in v3:
>>> - Use user_access_ok() for probe_user_read().
>>> Changes in v2:
>>> - Simplify strncpy_from_unsafe_user() using strncpy_from_user()
>>> according to Linus's suggestion.
>>> - Simplify probe_user_read() not using intermediate function.
>>> ---
>>> include/linux/uaccess.h | 14 +++++
>>> mm/maccess.c | 122 +++++++++++++++++++++++++++++++++++++++++++++--
>>> 2 files changed, 130 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
>>> index 1afd9dfabe67..5be7f9adb418 100644
>>> --- a/include/linux/uaccess.h
>>> +++ b/include/linux/uaccess.h
>>> @@ -258,6 +258,17 @@ extern long probe_kernel_read(void *dst, const void *src, size_t size);
>>> extern long __probe_kernel_read(void *dst, const void *src, size_t size);
>>>
>>> /*
>>> + * probe_user_read(): safely attempt to read from a location in user space
>>> + * @dst: pointer to the buffer that shall take the data
>>> + * @src: address to read from
>>> + * @size: size of the data chunk
>>> + *
>>> + * Safely read from address @src to the buffer at @dst. If a kernel fault
>>> + * happens, handle that and return -EFAULT.
>>> + */
>>> +extern long probe_user_read(void *dst, const void __user *src, size_t size);
>>> +
>>> +/*
>>> * probe_kernel_write(): safely attempt to write to a location
>>> * @dst: address to write to
>>> * @src: pointer to the data that shall be written
>>> @@ -270,6 +281,9 @@ extern long notrace probe_kernel_write(void *dst, const void *src, size_t size);
>>> extern long notrace __probe_kernel_write(void *dst, const void *src, size_t size);
>>>
>>> extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count);
>>> +extern long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr,
>>> + long count);
>>> +extern long strnlen_unsafe_user(const void __user *unsafe_addr, long count);
>>>
>>> /**
>>> * probe_kernel_address(): safely attempt to read from a location
>>> diff --git a/mm/maccess.c b/mm/maccess.c
>>> index ec00be51a24f..d1b2ec78d9ef 100644
>>> --- a/mm/maccess.c
>>> +++ b/mm/maccess.c
>>> @@ -5,8 +5,20 @@
>>> #include <linux/mm.h>
>>> #include <linux/uaccess.h>
>>>
>>> +static __always_inline long
>>> +probe_read_common(void *dst, const void __user *src, size_t size)
>>> +{
>>> + long ret;
>>> +
>>> + pagefault_disable();
>>> + ret = __copy_from_user_inatomic(dst, src, size);
>>> + pagefault_enable();
>>> +
>>> + return ret ? -EFAULT : 0;
>>> +}
>>> +
>>> /**
>>> - * probe_kernel_read(): safely attempt to read from a location
>>> + * probe_kernel_read(): safely attempt to read from a kernel-space location
>>> * @dst: pointer to the buffer that shall take the data
>>> * @src: address to read from
>>> * @size: size of the data chunk
>>> @@ -29,17 +41,45 @@ long __probe_kernel_read(void *dst, const void *src, size_t size)
>>> mm_segment_t old_fs = get_fs();
>>>
>>> set_fs(KERNEL_DS);
>>> - pagefault_disable();
>>> - ret = __copy_from_user_inatomic(dst,
>>> - (__force const void __user *)src, size);
>>> - pagefault_enable();
>>> + ret = probe_read_common(dst, (__force const void __user *)src, size);
>>> set_fs(old_fs);
>>>
>>> - return ret ? -EFAULT : 0;
>>> + return ret;
>>> }
>>> EXPORT_SYMBOL_GPL(probe_kernel_read);
>>>
>>> /**
>>> + * probe_user_read(): safely attempt to read from a user-space location
>>> + * @dst: pointer to the buffer that shall take the data
>>> + * @src: address to read from. This must be a user address.
>>> + * @size: size of the data chunk
>>> + *
>>> + * Safely read from user address @src to the buffer at @dst. If a kernel fault
>>> + * happens, handle that and return -EFAULT.
>>> + */
>>> +
>>> +long __weak probe_user_read(void *dst, const void __user *src, size_t size)
>>> + __attribute__((alias("__probe_user_read")));
>>> +
>>> +long __probe_user_read(void *dst, const void __user *src, size_t size)
>>> +{
>>> + long ret = -EFAULT;
>>> + mm_segment_t old_fs = get_fs();
>>> +
>>> + /*
>>> + * Since this can be called in IRQ context, we carefully set the
>>> + * USER_DS and use user_access_ok() which checks segment setting
>>> + * instead of task context.
>>> + */
>>> + set_fs(USER_DS);
>>> + if (user_access_ok(src, size))
>>> + ret = probe_read_common(dst, src, size);
>>> + set_fs(old_fs);
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(probe_user_read);
>>> +
>>> +/**
>>> * probe_kernel_write(): safely attempt to write to a location
>>> * @dst: address to write to
>>> * @src: pointer to the data that shall be written
>>> @@ -66,6 +106,7 @@ long __probe_kernel_write(void *dst, const void *src, size_t size)
>>> }
>>> EXPORT_SYMBOL_GPL(probe_kernel_write);
>>>
>>> +
>>> /**
>>> * strncpy_from_unsafe: - Copy a NUL terminated string from unsafe address.
>>> * @dst: Destination address, in kernel space. This buffer must be at
>>> @@ -105,3 +146,72 @@ long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count)
>>>
>>> return ret ? -EFAULT : src - unsafe_addr;
>>> }
>>> +
>>> +/**
>>> + * strncpy_from_unsafe_user: - Copy a NUL terminated string from unsafe user
>>> + * address.
>>> + * @dst: Destination address, in kernel space. This buffer must be at
>>> + * least @count bytes long.
>>> + * @unsafe_addr: Unsafe user address.
>>> + * @count: Maximum number of bytes to copy, including the trailing NUL.
>>> + *
>>> + * Copies a NUL-terminated string from unsafe user address to kernel buffer.
>>> + *
>>> + * On success, returns the length of the string INCLUDING the trailing NUL.
>>> + *
>>> + * If access fails, returns -EFAULT (some data may have been copied
>>> + * and the trailing NUL added).
>>> + *
>>> + * If @count is smaller than the length of the string, copies @count-1 bytes,
>>> + * sets the last byte of @dst buffer to NUL and returns @count.
>>> + */
>>> +long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr,
>>> + long count)
>>> +{
>>> + mm_segment_t old_fs = get_fs();
>>> + long ret;
>>> +
>>> + if (unlikely(count <= 0))
>>> + return 0;
>>> +
>>> + set_fs(USER_DS);
>>> + pagefault_disable();
>>> + ret = strncpy_from_user(dst, unsafe_addr, count);
>>> + pagefault_enable();
>>> + set_fs(old_fs);
>>> + if (ret >= count) {
>>> + ret = count;
>>> + dst[ret - 1] = '\0';
>>> + } else if (ret > 0)
>>> + ret++;
>>> + return ret;
>>> +}
>>> +
>>> +/**
>>> + * strnlen_unsafe_user: - Get the size of a user string INCLUDING final NUL.
>>> + * @unsafe_addr: The string to measure.
>>> + * @count: Maximum count (including NUL character)
>>> + *
>>> + * Get the size of a NUL-terminated string in user space without pagefault.
>>> + *
>>> + * Returns the size of the string INCLUDING the terminating NUL.
>>> + *
>>> + * If the string is too long, returns a number larger than @count. User
>>> + * has to check the return value against "> count".
>>> + * On exception (or invalid count), returns 0.
>>> + *
>>> + * Unlike strnlen_user, this can be used from IRQ handler etc. because
>>> + * it disables pagefaults.
>>> + */
>>> +long strnlen_unsafe_user(const void __user *unsafe_addr, long count)
>>> +{
>>> + mm_segment_t old_fs = get_fs();
>>> + int ret;
>>> +
>>> + set_fs(USER_DS);
>>> + pagefault_disable();
>>> + ret = strnlen_user(unsafe_addr, count);
>>> + pagefault_enable();
>>> + set_fs(old_fs);
>>> + return ret;
>>> +}
>>>
>
>
FYI, we noticed the following commit (built with gcc-8):
commit: 780464aed08ad00aa6d5f81ac8bef2e714dc8b06 ("[PATCH v5 2/6] uaccess: Use user_access_ok() in user_access_begin()")
url: https://github.com/0day-ci/linux/commits/Masami-Hiramatsu/tracing-probes-uaccess-Add-support-user-space-access/20190303-203749
in testcase: boot
on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 8G
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
+-----------------------------------------------------------------+------------+------------+
| | 0331497cc8 | 780464aed0 |
+-----------------------------------------------------------------+------------+------------+
| boot_successes | 4 | 0 |
| boot_failures | 0 | 4 |
| WARNING:at_arch/x86/include/asm/uaccess.h:#strnlen_user/0x | 0 | 4 |
| RIP:strnlen_user | 0 | 4 |
| WARNING:at_arch/x86/include/asm/uaccess.h:#strncpy_from_user/0x | 0 | 4 |
| RIP:strncpy_from_user | 0 | 4 |
+-----------------------------------------------------------------+------------+------------+
[ 3.970230] WARNING: CPU: 1 PID: 20 at arch/x86/include/asm/uaccess.h:718 strnlen_user+0x110/0x160
[ 3.970294] Modules linked in:
[ 3.970294] CPU: 1 PID: 20 Comm: kdevtmpfs Not tainted 5.0.0-rc8-00253-g780464a #1
[ 3.970294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 3.970294] RIP: 0010:strnlen_user+0x110/0x160
[ 3.970294] Code: 5e c3 48 ba 00 f0 ff ff ff 7f 00 00 e9 47 ff ff ff 48 ba 00 f0 ff ff ff ff ff 00 e9 38 ff ff ff 5b 31 c0 5d 41 5e c3 31 c0 c3 <0f> 0b e9 30 ff ff ff 4c 29 c1 48 8d 46 01 48 39 ce 48 0f 47 c2 eb
[ 4.003505] RSP: 0000:ffffa41d00cfbe58 EFLAGS: 00010286
[ 4.003505] RAX: ffffffffffffffff RBX: ffffffff8cafeec0 RCX: ffffffffffffffff
[ 4.003505] RDX: 00007ffffffff000 RSI: 0000000000001000 RDI: ffffffff8cafeec0
[ 4.003505] RBP: 0000000000001000 R08: ffffa41d00cfbed6 R09: 000000007350113f
[ 4.003505] R10: 0000000000000001 R11: 00000000ffffffff R12: ffffffff8cafeec0
[ 4.003505] R13: ffffffff8cacdf83 R14: 0000000000008000 R15: ffffffff8bf9dd90
[ 4.003505] FS: 0000000000000000(0000) GS:ffff8f8bffd00000(0000) knlGS:0000000000000000
[ 4.003505] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4.003505] CR2: 00000000ffffffff CR3: 000000010ae0e000 CR4: 00000000000406e0
[ 4.003505] Call Trace:
[ 4.003505] strndup_user+0x14/0x60
[ 4.003505] ksys_mount+0x30/0xd0
[ 4.003505] ? handle_create+0x1f0/0x1f0
[ 4.003505] devtmpfsd+0x9c/0x190
[ 4.003505] kthread+0x11d/0x140
[ 4.003505] ? __kthread_parkme+0x80/0x80
[ 4.003505] ret_from_fork+0x35/0x40
[ 4.003505] ---[ end trace 223fbf0f5f9372dc ]---
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
Thanks,
lkp
On Sun, Mar 3, 2019 at 9:40 AM kernel test robot <[email protected]> wrote:
>
> commit: 780464aed08ad00aa6d5f81ac8bef2e714dc8b06 ("[PATCH v5 2/6] uaccess: Use user_access_ok() in user_access_begin()")
Hmm. Not an upstream commit ID, so this is presumably a backport.
Ok, let's see what it is using the web link:
> url: https://github.com/0day-ci/linux/commits/Masami-Hiramatsu/tracing-probes-uaccess-Add-support-user-space-access/20190303-203749
Yeah, that just gives a github 404 error.
I'm _assuming_ it's the WARN_ON_IN_IRQ() in the access_ok().
Which doesn't much make sense either (why wouldn't it happen in
mainline)? But without a useful web link to see what is actually being
tested, I guess it's not something I can even look at...
Linus
On Sun, 3 Mar 2019 11:53:58 -0800
Linus Torvalds <[email protected]> wrote:
> On Sun, Mar 3, 2019 at 9:40 AM kernel test robot <[email protected]> wrote:
> >
> > commit: 780464aed08ad00aa6d5f81ac8bef2e714dc8b06 ("[PATCH v5 2/6] uaccess: Use user_access_ok() in user_access_begin()")
>
> Hmm. Not an upstream commit ID, so this is presumably a backport.
>
> Ok, let's see what it is using the web link:
>
> > url: https://github.com/0day-ci/linux/commits/Masami-Hiramatsu/tracing-probes-uaccess-Add-support-user-space-access/20190303-203749
>
> Yeah, that just gives a github 404 error.
>
> I'm _assuming_ it's the WARN_ON_IN_IRQ() in the access_ok().
I think it comes from WARN_ON_ONCE(!segment_eq(get_fs(), USER_DS)) in
user_access_ok(). The call trace shows that strndup_user might be called
from kernel daemon context.
[ 4.003505] Call Trace:
[ 4.003505] strndup_user+0x14/0x60
[ 4.003505] ksys_mount+0x30/0xd0
[ 4.003505] ? handle_create+0x1f0/0x1f0
[ 4.003505] devtmpfsd+0x9c/0x190
I guess devtmpfsd has not set USER_DS. Hmm, in that case, how ksys_*()
parameters should be treated? Those APIs will take __user pointers, but
actually, in-kernel callers call ksys_*() with non __user variables.
For example,
static int devtmpfsd(void *p)
{
char options[] = "mode=0755";
int *err = p;
*err = ksys_unshare(CLONE_NEWNS);
if (*err)
goto out;
*err = ksys_mount("devtmpfs", "/", "devtmpfs", MS_SILENT, options);
if (*err)
...
__force __user casting doesn't help, because these parameters are in kernel
memory, not in user memory. I think if we forcibly set USER_DS, it should
fail on some arch.
Peter, I think we can remove that WARN_ON_ONCE() from user_access_ok(),
since user_access_begin() is not only actually start accessing user, but it
also accessing kernel memory.
> Which doesn't much make sense either (why wouldn't it happen in
> mainline)? But without a useful web link to see what is actually being
> tested, I guess it's not something I can even look at...
Yeah, we need working web link on the report...
Thank you,
--
Masami Hiramatsu <[email protected]>
On Sun, Mar 3, 2019 at 5:14 PM Masami Hiramatsu <[email protected]> wrote:
>
> I think it comes from WARN_ON_ONCE(!segment_eq(get_fs(), USER_DS)) in
> user_access_ok(). The call trace shows that strndup_user might be called
> from kernel daemon context.
Ahh, yes.
We've had this before. We've gotten rid of the actual "use system
calls", but we still have some of the init sequence in particular just
calling the wrappers instead.
And yes, ksys_mount() takes __user pointers.
It would be a lot better to use "do_mount()", which is the interface
that takes actual "char *" pointers.
Linus
On 3/4/19 3:53 AM, Linus Torvalds wrote:
> On Sun, Mar 3, 2019 at 9:40 AM kernel test robot <[email protected]> wrote:
>> commit: 780464aed08ad00aa6d5f81ac8bef2e714dc8b06 ("[PATCH v5 2/6] uaccess: Use user_access_ok() in user_access_begin()")
> Hmm. Not an upstream commit ID, so this is presumably a backport.
>
> Ok, let's see what it is using the web link:
>
>> url: https://github.com/0day-ci/linux/commits/Masami-Hiramatsu/tracing-probes-uaccess-Add-support-user-space-access/20190303-203749
> Yeah, that just gives a github 404 error.
Sorry for the broken link, It's ok now.
Best Regards,
Rong Chen
>
> I'm _assuming_ it's the WARN_ON_IN_IRQ() in the access_ok().
>
> Which doesn't much make sense either (why wouldn't it happen in
> mainline)? But without a useful web link to see what is actually being
> tested, I guess it's not something I can even look at...
>
> Linus
> _______________________________________________
> LKP mailing list
> [email protected]
> https://lists.01.org/mailman/listinfo/lkp
On Sun, 3 Mar 2019 18:37:59 -0800
Linus Torvalds <[email protected]> wrote:
> On Sun, Mar 3, 2019 at 5:14 PM Masami Hiramatsu <[email protected]> wrote:
> >
> > I think it comes from WARN_ON_ONCE(!segment_eq(get_fs(), USER_DS)) in
> > user_access_ok(). The call trace shows that strndup_user might be called
> > from kernel daemon context.
>
> Ahh, yes.
>
> We've had this before. We've gotten rid of the actual "use system
> calls", but we still have some of the init sequence in particular just
> calling the wrappers instead.
Are those safe if we are in init sequence?
>
> And yes, ksys_mount() takes __user pointers.
>
> It would be a lot better to use "do_mount()", which is the interface
> that takes actual "char *" pointers.
Unfortunately, it still takes a __user pointer.
long do_mount(const char *dev_name, const char __user *dir_name,
const char *type_page, unsigned long flags, void *data_page)
So what we need is
long do_mount(const char *dev_name, struct path *dir_path,
const char *type_page, unsigned long flags, void *data_page)
or introduce kern_do_mount()?
Since devtmpfsd calls ksys_chdir() and ksys_chroot(), we need to replace
those too. Fortunately, it seems that the last part which we have to fix.
Thank you,
>
> Linus
--
Masami Hiramatsu <[email protected]>
Hello,
On Mon, 4 Mar 2019 18:06:10 +0900
Masami Hiramatsu <[email protected]> wrote:
> On Sun, 3 Mar 2019 18:37:59 -0800
> Linus Torvalds <[email protected]> wrote:
>
> > On Sun, Mar 3, 2019 at 5:14 PM Masami Hiramatsu <[email protected]> wrote:
> > >
> > > I think it comes from WARN_ON_ONCE(!segment_eq(get_fs(), USER_DS)) in
> > > user_access_ok(). The call trace shows that strndup_user might be called
> > > from kernel daemon context.
> >
> > Ahh, yes.
> >
> > We've had this before. We've gotten rid of the actual "use system
> > calls", but we still have some of the init sequence in particular just
> > calling the wrappers instead.
>
> Are those safe if we are in init sequence?
>
> >
> > And yes, ksys_mount() takes __user pointers.
> >
> > It would be a lot better to use "do_mount()", which is the interface
> > that takes actual "char *" pointers.
>
> Unfortunately, it still takes a __user pointer.
>
> long do_mount(const char *dev_name, const char __user *dir_name,
> const char *type_page, unsigned long flags, void *data_page)
>
> So what we need is
>
> long do_mount(const char *dev_name, struct path *dir_path,
> const char *type_page, unsigned long flags, void *data_page)
>
> or introduce kern_do_mount()?
Would this work?
( BTW, what is this code for? It seems totally insane termination.
at least this should be done in copy_mount_options().
if (data_page)
((char *)data_page)[PAGE_SIZE - 1] = 0;
)
=======
devtmpfs: Avoid passing kernel memory to __user parameter
From: Masami Hiramatsu <[email protected]>
Since ksys_mount(), ksys_chdir() and ksys_chroot() takes
__user parameters, devtmpfsd must not pass the kernel memory
to them. On some arch, it is not possible to pass th kernel
memory to them, since the memory address spaces can be
different.
This introduces kern_do_mount(), kern_chdir() and kern_chroot()
and replaces ksys_* in devtmpfsd().
Signed-off-by: Masami Hiramatsu <[email protected]>
---
drivers/base/devtmpfs.c | 12 +++++++++---
fs/namespace.c | 31 +++++++++++++++++++++++++------
fs/open.c | 34 ++++++++++++++++++++++++++++++----
include/linux/fs.h | 2 ++
include/linux/syscalls.h | 6 ++++++
5 files changed, 72 insertions(+), 13 deletions(-)
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index 0dbc43068eeb..e418b004d6c6 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -381,14 +381,20 @@ static int devtmpfsd(void *p)
{
char options[] = "mode=0755";
int *err = p;
+
*err = ksys_unshare(CLONE_NEWNS);
if (*err)
goto out;
- *err = ksys_mount("devtmpfs", "/", "devtmpfs", MS_SILENT, options);
+
+ *err = kern_do_mount("devtmpfs", "/", "devtmpfs", MS_SILENT, options);
+ if (*err)
+ goto out;
+ *err = kern_chdir("/.."); /* will traverse into overmounted root */
+ if (*err)
+ goto out;
+ *err = kern_chroot(".");
if (*err)
goto out;
- ksys_chdir("/.."); /* will traverse into overmounted root */
- ksys_chroot(".");
complete(&setup_done);
while (1) {
spin_lock(&req_lock);
diff --git a/fs/namespace.c b/fs/namespace.c
index 678ef175d63a..68a6b573107d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2761,8 +2761,9 @@ char *copy_mount_string(const void __user *data)
* Therefore, if this magic number is present, it carries no information
* and must be discarded.
*/
-long do_mount(const char *dev_name, const char __user *dir_name,
- const char *type_page, unsigned long flags, void *data_page)
+static inline long __do_mount(const char *dev_name, const char *kdir_name,
+ const char __user *dir_name, const char *type_page,
+ unsigned long flags, void *data_page)
{
struct path path;
unsigned int mnt_flags = 0, sb_flags;
@@ -2773,14 +2774,14 @@ long do_mount(const char *dev_name, const char __user *dir_name,
flags &= ~MS_MGC_MSK;
/* Basic sanity checks */
- if (data_page)
- ((char *)data_page)[PAGE_SIZE - 1] = 0;
-
if (flags & MS_NOUSER)
return -EINVAL;
/* ... and get the mountpoint */
- retval = user_path(dir_name, &path);
+ if (kdir_name)
+ retval = kern_path(kdir_name, LOOKUP_FOLLOW, &path);
+ else
+ retval = user_path(dir_name, &path);
if (retval)
return retval;
@@ -2849,6 +2850,24 @@ long do_mount(const char *dev_name, const char __user *dir_name,
return retval;
}
+long kern_do_mount(const char *dev_name, const char *dir_name,
+ const char *type_page, unsigned long flags, void *data_page)
+{
+ return __do_mount(dev_name, dir_name, NULL, type_page, flags,
+ data_page);
+}
+
+long do_mount(const char *dev_name, const char __user *dir_name,
+ const char *type_page, unsigned long flags, void *data_page)
+{
+ /* This must come from copy_mount_options */
+ if (data_page)
+ ((char *)data_page)[PAGE_SIZE - 1] = 0;
+
+ return __do_mount(dev_name, NULL, dir_name, type_page, flags,
+ data_page);
+}
+
static struct ucounts *inc_mnt_namespaces(struct user_namespace *ns)
{
return inc_ucount(ns, current_euid(), UCOUNT_MNT_NAMESPACES);
diff --git a/fs/open.c b/fs/open.c
index 0285ce7dbd51..2a9eee753d3a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -430,13 +430,16 @@ SYSCALL_DEFINE2(access, const char __user *, filename, int, mode)
return do_faccessat(AT_FDCWD, filename, mode);
}
-int ksys_chdir(const char __user *filename)
+static int __do_chdir(const char *kfname, const char __user *ufname)
{
struct path path;
int error;
unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
retry:
- error = user_path_at(AT_FDCWD, filename, lookup_flags, &path);
+ if (kfname)
+ error = kern_path(kfname, lookup_flags, &path);
+ else
+ error = user_path_at(AT_FDCWD, ufname, lookup_flags, &path);
if (error)
goto out;
@@ -456,6 +459,16 @@ int ksys_chdir(const char __user *filename)
return error;
}
+int kern_chdir(const char *filename)
+{
+ return __do_chdir(filename, NULL);
+}
+
+int ksys_chdir(const char __user *filename)
+{
+ return __do_chdir(NULL, filename);
+}
+
SYSCALL_DEFINE1(chdir, const char __user *, filename)
{
return ksys_chdir(filename);
@@ -483,13 +496,16 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
return error;
}
-int ksys_chroot(const char __user *filename)
+static int __do_chroot(const char *kfname, const char __user *ufname)
{
struct path path;
int error;
unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
retry:
- error = user_path_at(AT_FDCWD, filename, lookup_flags, &path);
+ if (kfname)
+ error = kern_path(kfname, lookup_flags, &path);
+ else
+ error = user_path_at(AT_FDCWD, ufname, lookup_flags, &path);
if (error)
goto out;
@@ -516,6 +532,16 @@ int ksys_chroot(const char __user *filename)
return error;
}
+int kern_chroot(const char *filename)
+{
+ return __do_chroot(filename, NULL);
+}
+
+int ksys_chroot(const char __user *filename)
+{
+ return __do_chroot(NULL, filename);
+}
+
SYSCALL_DEFINE1(chroot, const char __user *, filename)
{
return ksys_chroot(filename);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 29d8e2cfed0e..38be90b86435 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2279,6 +2279,8 @@ extern int may_umount_tree(struct vfsmount *);
extern int may_umount(struct vfsmount *);
extern long do_mount(const char *, const char __user *,
const char *, unsigned long, void *);
+extern long kern_do_mount(const char *, const char *,
+ const char *, unsigned long, void *);
extern struct vfsmount *collect_mounts(const struct path *);
extern void drop_collected_mounts(struct vfsmount *);
extern int iterate_mounts(int (*)(struct vfsmount *, void *), void *,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 257cccba3062..ec8bfca990c9 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1146,6 +1146,10 @@ asmlinkage long sys_ni_syscall(void);
* Kernel code should not call syscalls (i.e., sys_xyzyyz()) directly.
* Instead, use one of the functions which work equivalently, such as
* the ksys_xyzyyz() functions prototyped below.
+ * However, ksys_xyzxyz() functions should be used carefully, because
+ * kernel must not pass kernel memory to __user parameter. Force type
+ * casting is meaningless, since those address will be accessed by
+ * copy_from_user() etc. which will reject accessing kernel memory.
*/
int ksys_mount(char __user *dev_name, char __user *dir_name, char __user *type,
@@ -1153,8 +1157,10 @@ int ksys_mount(char __user *dev_name, char __user *dir_name, char __user *type,
int ksys_umount(char __user *name, int flags);
int ksys_dup(unsigned int fildes);
int ksys_chroot(const char __user *filename);
+int kern_chroot(const char *filename);
ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count);
int ksys_chdir(const char __user *filename);
+int kern_chdir(const char *filename);
int ksys_fchmod(unsigned int fd, umode_t mode);
int ksys_fchown(unsigned int fd, uid_t user, gid_t group);
int ksys_getdents64(unsigned int fd, struct linux_dirent64 __user *dirent,
--
Masami Hiramatsu <[email protected]>
On Mon, Mar 4, 2019 at 4:16 PM Masami Hiramatsu <[email protected]> wrote:
>
> Hello,
>
> On Mon, 4 Mar 2019 18:06:10 +0900
> Masami Hiramatsu <[email protected]> wrote:
>
> > On Sun, 3 Mar 2019 18:37:59 -0800
> > Linus Torvalds <[email protected]> wrote:
> >
> > > On Sun, Mar 3, 2019 at 5:14 PM Masami Hiramatsu <[email protected]> wrote:
> > > >
> > > > I think it comes from WARN_ON_ONCE(!segment_eq(get_fs(), USER_DS)) in
> > > > user_access_ok(). The call trace shows that strndup_user might be called
> > > > from kernel daemon context.
> > >
> > > Ahh, yes.
> > >
> > > We've had this before. We've gotten rid of the actual "use system
> > > calls", but we still have some of the init sequence in particular just
> > > calling the wrappers instead.
> >
> > Are those safe if we are in init sequence?
> >
> > >
> > > And yes, ksys_mount() takes __user pointers.
> > >
> > > It would be a lot better to use "do_mount()", which is the interface
> > > that takes actual "char *" pointers.
> >
> > Unfortunately, it still takes a __user pointer.
> >
> > long do_mount(const char *dev_name, const char __user *dir_name,
> > const char *type_page, unsigned long flags, void *data_page)
> >
> > So what we need is
> >
> > long do_mount(const char *dev_name, struct path *dir_path,
> > const char *type_page, unsigned long flags, void *data_page)
> >
> > or introduce kern_do_mount()?
>
> Would this work?
>
> ( BTW, what is this code for? It seems totally insane termination.
> at least this should be done in copy_mount_options().
> if (data_page)
> ((char *)data_page)[PAGE_SIZE - 1] = 0;
> )
>
> =======
> devtmpfs: Avoid passing kernel memory to __user parameter
>
> From: Masami Hiramatsu <[email protected]>
>
> Since ksys_mount(), ksys_chdir() and ksys_chroot() takes
> __user parameters, devtmpfsd must not pass the kernel memory
> to them. On some arch, it is not possible to pass th kernel
> memory to them, since the memory address spaces can be
> different.
strncpy_from_user() also uses user_access_begin(), and that thing's
used from all the VFS syscalls via getname(). And those are used all
over the place in init/ - for example:
init/do_mounts.c uses ksys_open() and ksys_chroot()
init/do_mounts_initrd.c uses ksys_open(), ksys_chdir(), ksys_mount(),
ksys_mkdir() and ksys_unlink()
init/do_mounts_md.c uses ksys_open()
and so on.
(identify_ramdisk_image() even uses ksys_lseek() and ksys_read()
instead of just using kernel_read()...)
On Mon, Mar 4, 2019 at 1:06 AM Masami Hiramatsu <[email protected]> wrote:
>
> On Sun, 3 Mar 2019 18:37:59 -0800
> Linus Torvalds <[email protected]> wrote:
> > We've had this before. We've gotten rid of the actual "use system
> > calls", but we still have some of the init sequence in particular just
> > calling the wrappers instead.
>
> Are those safe if we are in init sequence?
Yes, they are, it runs with set_fs(KERNEL_DS).
But the patches made that now complain about copying from non-user
space, even though it's fine.
Basically, "strncpy_from_user()" shouldn't use "user_access_ok()",
since it actually can take a kernel address (with set_fs()).
Your "unsafe" version for tracing that actually sets "set_fs(USER_DS)"
is thje only thing that should use that helper.
> > And yes, ksys_mount() takes __user pointers.
> >
> > It would be a lot better to use "do_mount()", which is the interface
> > that takes actual "char *" pointers.
>
> Unfortunately, it still takes a __user pointer.
Ahh, yes, the name remains in user space.
Besides, I'm sure you'd just hit other cases instead where people use
set_fs() and copy strings.
> So what we need is
>
> long do_mount(const char *dev_name, struct path *dir_path,
> const char *type_page, unsigned long flags, void *data_page)
>
> or introduce kern_do_mount()?
It's actually fairly painful. Particularly because of that "void *data_page".
Your second email with "Would this work?" helper function _wopuldn't_
work correctly, exactly because you passed in a regular string to the
data page.
Also, I don't want to see code that replaces the unconditional "copy
path from user space" with a conditional "do we have path in kernel
space".
So together with the whole "uyou'll hit other peoblems anyway", I
don't think this is a good approach.
I think you simply need to have a separate "unsafe_strncpy()"
function, and not change the existing "strncpy_from_user()".
Linus
On Mon, 4 Mar 2019 10:59:22 -0800
Linus Torvalds <[email protected]> wrote:
> On Mon, Mar 4, 2019 at 1:06 AM Masami Hiramatsu <[email protected]> wrote:
> >
> > On Sun, 3 Mar 2019 18:37:59 -0800
> > Linus Torvalds <[email protected]> wrote:
> > > We've had this before. We've gotten rid of the actual "use system
> > > calls", but we still have some of the init sequence in particular just
> > > calling the wrappers instead.
> >
> > Are those safe if we are in init sequence?
>
> Yes, they are, it runs with set_fs(KERNEL_DS).
>
> But the patches made that now complain about copying from non-user
> space, even though it's fine.
>
> Basically, "strncpy_from_user()" shouldn't use "user_access_ok()",
> since it actually can take a kernel address (with set_fs()).
OK, so strncpy_from_user() or any other copy_from_user() should be
available for copying kernel memory if set_fs(KERNEL_DS).
> Your "unsafe" version for tracing that actually sets "set_fs(USER_DS)"
> is thje only thing that should use that helper.
I see, it ensures it is accessing user-memory.
>
> > > And yes, ksys_mount() takes __user pointers.
> > >
> > > It would be a lot better to use "do_mount()", which is the interface
> > > that takes actual "char *" pointers.
> >
> > Unfortunately, it still takes a __user pointer.
>
> Ahh, yes, the name remains in user space.
>
> Besides, I'm sure you'd just hit other cases instead where people use
> set_fs() and copy strings.
Yeah, under init/ I saw such cases.
>
> > So what we need is
> >
> > long do_mount(const char *dev_name, struct path *dir_path,
> > const char *type_page, unsigned long flags, void *data_page)
> >
> > or introduce kern_do_mount()?
>
> It's actually fairly painful. Particularly because of that "void *data_page".
Yeah, that is what I've hit while testing :-(
>
> Your second email with "Would this work?" helper function _wopuldn't_
> work correctly, exactly because you passed in a regular string to the
> data page.
>
> Also, I don't want to see code that replaces the unconditional "copy
> path from user space" with a conditional "do we have path in kernel
> space".
Yes, that's just a hack :)
>
> So together with the whole "uyou'll hit other peoblems anyway", I
> don't think this is a good approach.
>
> I think you simply need to have a separate "unsafe_strncpy()"
> function, and not change the existing "strncpy_from_user()".
Would you mean implementing yet another "strncpy_from_user without
pagefault"?
What we changed here is just use user_access_ok() instead access_ok()
in user_access_begin() because access_ok() may cause false-positive
warning if we use it in IRQ.
I think the better way to do this is allowing strncpy_from_user()
if some conditions are match, like
- strncpy_from_user() will be able to copy user memory with set_fs(USER_DS)
- strncpy_from_user() can copy kernel memory with set_fs(KERNEL_DS)
- strncpy_from_user() can access unsafe memory in IRQ context if
pagefault is disabled.
This is almost done, except for CONFIG_DEBUG_ATOMIC_SLEEP=y on x86.
So, what about adding a condition to WARN_ON_IN_IRQ() like below
instead of introducing user_access_ok() ?
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 780f2b42c8ef..ec0f0b74c9ab 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -70,7 +70,7 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
})
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
-# define WARN_ON_IN_IRQ() WARN_ON_ONCE(!in_task())
+# define WARN_ON_IN_IRQ() WARN_ON_ONCE(pagefault_disabled() && !in_task())
#else
# define WARN_ON_IN_IRQ()
#endif
Of course we have to move pagefault_disabled() macro in somewhere better place.
Thank you,
--
Masami Hiramatsu <[email protected]>
On Tue, 5 Mar 2019 11:36:35 +0900
Masami Hiramatsu <[email protected]> wrote:
> On Mon, 4 Mar 2019 10:59:22 -0800
> Linus Torvalds <[email protected]> wrote:
>
> I think the better way to do this is allowing strncpy_from_user()
> if some conditions are match, like
>
> - strncpy_from_user() will be able to copy user memory with set_fs(USER_DS)
> - strncpy_from_user() can copy kernel memory with set_fs(KERNEL_DS)
> - strncpy_from_user() can access unsafe memory in IRQ context if
> pagefault is disabled.
>
> This is almost done, except for CONFIG_DEBUG_ATOMIC_SLEEP=y on x86.
>
> So, what about adding a condition to WARN_ON_IN_IRQ() like below
> instead of introducing user_access_ok() ?
>
> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
> index 780f2b42c8ef..ec0f0b74c9ab 100644
> --- a/arch/x86/include/asm/uaccess.h
> +++ b/arch/x86/include/asm/uaccess.h
> @@ -70,7 +70,7 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
> })
>
> #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> -# define WARN_ON_IN_IRQ() WARN_ON_ONCE(!in_task())
> +# define WARN_ON_IN_IRQ() WARN_ON_ONCE(pagefault_disabled() && !in_task())
Oops, I meant !pagefault_disabled() && !in_task().
> #else
> # define WARN_ON_IN_IRQ()
> #endif
>
> Of course we have to move pagefault_disabled() macro in somewhere better place.
And I think we don't need it. anyway the macro is referenced via linux/uaccess.h.
Thank you,
--
Masami Hiramatsu <[email protected]>
On Tue, 5 Mar 2019 17:22:41 +0900
Masami Hiramatsu <[email protected]> wrote:
> On Tue, 5 Mar 2019 11:36:35 +0900
> Masami Hiramatsu <[email protected]> wrote:
>
> > On Mon, 4 Mar 2019 10:59:22 -0800
> > Linus Torvalds <[email protected]> wrote:
> >
> > I think the better way to do this is allowing strncpy_from_user()
> > if some conditions are match, like
> >
> > - strncpy_from_user() will be able to copy user memory with set_fs(USER_DS)
> > - strncpy_from_user() can copy kernel memory with set_fs(KERNEL_DS)
> > - strncpy_from_user() can access unsafe memory in IRQ context if
> > pagefault is disabled.
> >
> > This is almost done, except for CONFIG_DEBUG_ATOMIC_SLEEP=y on x86.
> >
> > So, what about adding a condition to WARN_ON_IN_IRQ() like below
> > instead of introducing user_access_ok() ?
OK, here is the patch. Does it work for us?
====
x86/uaccess: Verify access_ok() context strictly
From: Masami Hiramatsu <[email protected]>
WARN_ON_IN_IRQ() assumes that the access_ok() and following
user memory access can sleep. But this assumption is not
always correct; when the pagefault is disabled, following
memory access will just returns -EFAULT and never sleep.
Add pagefault_disabled() check in WARN_ON_ONCE() so that
it can ignore the case we call it with disabling pagefault.
For this purpose, this modified pagefault_disabled() as
an inline function.
Signed-off-by: Masami Hiramatsu <[email protected]>
---
arch/x86/include/asm/uaccess.h | 4 +++-
include/linux/uaccess.h | 5 ++++-
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 780f2b4..b98a552 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -70,7 +70,9 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
})
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
-# define WARN_ON_IN_IRQ() WARN_ON_ONCE(!in_task())
+static inline bool pagefault_disabled(void);
+# define WARN_ON_IN_IRQ() \
+ WARN_ON_ONCE(!in_task() && !pagefault_disabled())
#else
# define WARN_ON_IN_IRQ()
#endif
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 37b226e..ef3032d 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -203,7 +203,10 @@ static inline void pagefault_enable(void)
/*
* Is the pagefault handler disabled? If so, user access methods will not sleep.
*/
-#define pagefault_disabled() (current->pagefault_disabled != 0)
+static inline bool pagefault_disabled(void)
+{
+ return current->pagefault_disabled != 0;
+}
/*
* The pagefault handler is in general disabled by pagefault_disable() or
On Tue, Mar 05, 2019 at 11:36:35AM +0900, Masami Hiramatsu wrote:
> I think the better way to do this is allowing strncpy_from_user()
O
> if some conditions are match, like
>
> - strncpy_from_user() will be able to copy user memory with set_fs(USER_DS)
> - strncpy_from_user() can copy kernel memory with set_fs(KERNEL_DS)
> - strncpy_from_user() can access unsafe memory in IRQ context if
> pagefault is disabled.
>
> This is almost done, except for CONFIG_DEBUG_ATOMIC_SLEEP=y on x86.
>
> So, what about adding a condition to WARN_ON_IN_IRQ() like below
> instead of introducing user_access_ok() ?
>
> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
> index 780f2b42c8ef..ec0f0b74c9ab 100644
> --- a/arch/x86/include/asm/uaccess.h
> +++ b/arch/x86/include/asm/uaccess.h
> @@ -70,7 +70,7 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
> })
>
> #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> -# define WARN_ON_IN_IRQ() WARN_ON_ONCE(!in_task())
> +# define WARN_ON_IN_IRQ() WARN_ON_ONCE(pagefault_disabled() && !in_task())
That doesn't make any kind of sense to me; see faulthandler_disabled().
IOW. interrupt (and any atomic context really) won't take faults anyway.
I dislike that whole KERNEL_DS thing, but obviously that's not something
that's going away.
Would something like:
WARN_ON_ONCE(!(in_task || segment_eq(get_fs(), USER_DS)))
Work? Then we allow KERNEL_DS in task context, but for interrupt and
others require USER_DS.
On Tue, 5 Mar 2019 10:07:29 +0100
Peter Zijlstra <[email protected]> wrote:
> On Tue, Mar 05, 2019 at 11:36:35AM +0900, Masami Hiramatsu wrote:
> > I think the better way to do this is allowing strncpy_from_user()
> O
> > if some conditions are match, like
> >
> > - strncpy_from_user() will be able to copy user memory with set_fs(USER_DS)
> > - strncpy_from_user() can copy kernel memory with set_fs(KERNEL_DS)
> > - strncpy_from_user() can access unsafe memory in IRQ context if
> > pagefault is disabled.
> >
> > This is almost done, except for CONFIG_DEBUG_ATOMIC_SLEEP=y on x86.
> >
> > So, what about adding a condition to WARN_ON_IN_IRQ() like below
> > instead of introducing user_access_ok() ?
> >
> > diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
> > index 780f2b42c8ef..ec0f0b74c9ab 100644
> > --- a/arch/x86/include/asm/uaccess.h
> > +++ b/arch/x86/include/asm/uaccess.h
> > @@ -70,7 +70,7 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
> > })
> >
> > #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> > -# define WARN_ON_IN_IRQ() WARN_ON_ONCE(!in_task())
> > +# define WARN_ON_IN_IRQ() WARN_ON_ONCE(pagefault_disabled() && !in_task())
>
> That doesn't make any kind of sense to me; see faulthandler_disabled().
> IOW. interrupt (and any atomic context really) won't take faults anyway.
Hmm, I thought CONFIG_DEBUG_ATOMIC_SLEEP=y tries to detect that some operations
which can sleep in atomic, like IRQ context, doesn't it?
(note that above should be !pagefault_disabled() anyway)
So I guessed WARN_ON_IN_IRQ() intended to detect the access_ok() was used
in atomic, because it might follow some copy_from_user() like operation
which can sleep when it hits a pagefault. Is my guess wrong?
If correct, I think if pagefault is disabled, the caller never sleep,
so we don't need to take care of that.
Could you tell me why WARN_ON_ONCE(!in_task()) is needed in access_ok()?
>
> I dislike that whole KERNEL_DS thing, but obviously that's not something
> that's going away.
>
> Would something like:
>
> WARN_ON_ONCE(!(in_task || segment_eq(get_fs(), USER_DS)))
>
> Work? Then we allow KERNEL_DS in task context, but for interrupt and
> others require USER_DS.
But what would this mean? I can't understand why we limit using
access_ok() so strictly and narrow the cases.
Thank you,
--
Masami Hiramatsu <[email protected]>
On Tue, Mar 05, 2019 at 10:58:01PM +0900, Masami Hiramatsu wrote:
> Could you tell me why WARN_ON_ONCE(!in_task()) is needed in access_ok()?
That came from here:
lkml.kernel.org/r/[email protected]
Because in-irq usage is dodgy, since we don't actually know what mm or
ds it loaded.
> > I dislike that whole KERNEL_DS thing, but obviously that's not something
> > that's going away.
> >
> > Would something like:
> >
> > WARN_ON_ONCE(!(in_task || segment_eq(get_fs(), USER_DS)))
> >
> > Work? Then we allow KERNEL_DS in task context, but for interrupt and
> > others require USER_DS.
>
> But what would this mean? I can't understand why we limit using
> access_ok() so strictly and narrow the cases.
Because it's been a source of bugs. Any sanity checking we can put in
seems like a good thing at this point.
On Tue, 5 Mar 2019 15:53:06 +0100
Peter Zijlstra <[email protected]> wrote:
> On Tue, Mar 05, 2019 at 10:58:01PM +0900, Masami Hiramatsu wrote:
>
> > Could you tell me why WARN_ON_ONCE(!in_task()) is needed in access_ok()?
>
> That came from here:
>
> lkml.kernel.org/r/[email protected]
>
> Because in-irq usage is dodgy, since we don't actually know what mm or
> ds it loaded.
Yes, I would like to allow it only if setting pagefault-disable correctly.
(and setting ds too, it is good to me)
>
> > > I dislike that whole KERNEL_DS thing, but obviously that's not something
> > > that's going away.
> > >
> > > Would something like:
> > >
> > > WARN_ON_ONCE(!(in_task || segment_eq(get_fs(), USER_DS)))
> > >
> > > Work? Then we allow KERNEL_DS in task context, but for interrupt and
> > > others require USER_DS.
> >
> > But what would this mean? I can't understand why we limit using
> > access_ok() so strictly and narrow the cases.
>
> Because it's been a source of bugs. Any sanity checking we can put in
> seems like a good thing at this point.
Hmm, I see yours is strict, fit with current code, but complicated rule.
- strncpy_from_user() can access user memory with set_fs(USER_DS)
in task context
- strncpy_from_user() can access kernel memory with set_fs(KERNEL_DS)
in task context
- strncpy_from_user() can access user memory in IRQ context if
pagefault is disabled and with set_fs(USER_DS). (but pagefault-disabled
is not verified)
- strncpy_from_user() never allowed to access kernel memory in IRQ context,
even if pagefault is disabled and with set_fs(KERNEL_DS).
And mine is simple.
- strncpy_from_user() can access user memory with set_fs(USER_DS)
in task context
- strncpy_from_user() can access kernel memory with set_fs(KERNEL_DS)
in task context
- strncpy_from_user() can access user/kernel memory (depends on DS)
in IRQ context if pagefault is disabled. (both verified)
Thank you,
--
Masami Hiramatsu <[email protected]>