Changes since v2 [1]:
* style fix in Documentation/speculation.txt (Geert)
* add Russell and Catalin to the cc on the ARM patches (Russell)
* clarify changelog for "x86: introduce __uaccess_begin_nospec and
ASM_IFENCE" (Eric, Linus, Josh)
* fix the dynamic 'mask' / 'ifence' toggle vs CONFIG_JUMP_LABEL=n
(Peter)
* include the get_user_{1,2,4,8} helpers in the ASM_IFENCE protections
(Linus)
* fix array_ptr_mask for ARCH=i386 builds (Kbuild robot)
* prioritize the get_user protections, and the fdtable fix
[1]: https://lwn.net/Articles/744141/
---
Quoting Mark's original RFC:
"Recently, Google Project Zero discovered several classes of attack
against speculative execution. One of these, known as variant-1, allows
explicit bounds checks to be bypassed under speculation, providing an
arbitrary read gadget. Further details can be found on the GPZ blog [2]
and the Documentation patch in this series."
This series incorporates Mark Rutland's latest ARM changes and adds
the x86 specific implementation of 'ifence_array_ptr'. That ifence
based approach is provided as an opt-in fallback, but the default
mitigation, '__array_ptr', uses a 'mask' approach that removes
conditional branches instructions, and otherwise aims to redirect
speculation to use a NULL pointer rather than a user controlled value.
The mask is generated by the following from Alexei, and Linus:
mask = ~(long)(_i | (_s - 1 - _i)) >> (BITS_PER_LONG - 1);
...and Linus provided an optimized mask generation helper for x86:
asm ("cmpq %1,%2; sbbq %0,%0;"
:"=r" (mask)
:"r"(sz),"r" (idx)
:"cc");
The 'array_ptr' mechanism can be switched between 'mask' and 'ifence'
via the spectre_v1={mask,ifence} command line option if
CONFIG_SPECTRE1_DYNAMIC=y, and the compile-time default is otherwise set
by selecting either CONFIG_SPECTRE1_MASK or CONFIG_SPECTRE1_IFENCE. This
level of sophistication is provided given concerns about 'value
speculation' [3].
The get_user protections and 'array_ptr' infrastructure are the only
concern of this patch set. Going forward 'array_ptr' is a tool that
sub-system maintainers can use to instrument array bounds checks like
'__fcheck_files'. When to use 'array_ptr' is saved for a future patch
set, and in the meantime the 'get_user' protections raise the bar for
launching a Spectre-v1 attack.
These patches are also available via the 'nospec-v3' git branch here:
git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec-v3
Note that the BPF fix for Spectre variant1 is merged for 4.15-rc8.
[2]: https://googleprojectzero.blogspot.co.uk/2018/01/reading-privileged-memory-with-side.html
[3]: https://marc.info/?l=linux-netdev&m=151527996901350&w=2
---
Dan Williams (6):
x86: implement ifence()
x86: implement ifence_array_ptr() and array_ptr_mask()
asm/nospec: mask speculative execution flows
x86: introduce __uaccess_begin_nospec and ASM_IFENCE
x86: use __uaccess_begin_nospec and ASM_IFENCE in get_user paths
vfs, fdtable: prevent bounds-check bypass via speculative execution
Mark Rutland (3):
Documentation: document array_ptr
arm64: implement ifence_array_ptr()
arm: implement ifence_array_ptr()
Documentation/speculation.txt | 143 +++++++++++++++++++++++++++++++++++++
arch/arm/Kconfig | 1
arch/arm/include/asm/barrier.h | 24 ++++++
arch/arm64/Kconfig | 1
arch/arm64/include/asm/barrier.h | 24 ++++++
arch/x86/Kconfig | 3 +
arch/x86/include/asm/barrier.h | 50 +++++++++++++
arch/x86/include/asm/msr.h | 3 -
arch/x86/include/asm/smap.h | 4 +
arch/x86/include/asm/uaccess.h | 16 +++-
arch/x86/include/asm/uaccess_32.h | 6 +-
arch/x86/include/asm/uaccess_64.h | 12 ++-
arch/x86/lib/copy_user_64.S | 3 +
arch/x86/lib/getuser.S | 5 +
arch/x86/lib/usercopy_32.c | 8 +-
include/linux/fdtable.h | 7 +-
include/linux/nospec.h | 92 ++++++++++++++++++++++++
kernel/Kconfig.nospec | 46 ++++++++++++
kernel/Makefile | 1
kernel/nospec.c | 52 +++++++++++++
lib/Kconfig | 3 +
21 files changed, 484 insertions(+), 20 deletions(-)
create mode 100644 Documentation/speculation.txt
create mode 100644 include/linux/nospec.h
create mode 100644 kernel/Kconfig.nospec
create mode 100644 kernel/nospec.c
The new barrier, 'ifence', ensures that speculative execution never
crosses the fence.
Previously the kernel only needed this fence in 'rdtsc_ordered', but now
it is also proposed as a mitigation against Spectre variant1 attacks.
When used it needs to be placed in the success path after a bounds check
i.e.:
if (x < max) {
ifence();
val = array[x];
} else
return -EINVAL;
With this change the cpu will never issue speculative reads of
'array + x' with values of x >= max.
'ifence', via 'ifence_array_ptr', is an opt-in fallback to the default
mitigation provided by '__array_ptr'. It is also proposed for blocking
speculation in the 'get_user' path to bypass 'access_ok' checks. For
now, just provide the common definition for later patches to build upon.
Suggested-by: Peter Zijlstra <[email protected]>
Suggested-by: Alan Cox <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/barrier.h | 4 ++++
arch/x86/include/asm/msr.h | 3 +--
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 7fb336210e1b..b04f572d6d97 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -24,6 +24,10 @@
#define wmb() asm volatile("sfence" ::: "memory")
#endif
+/* prevent speculative execution past this barrier */
+#define ifence() alternative_2("", "mfence", X86_FEATURE_MFENCE_RDTSC, \
+ "lfence", X86_FEATURE_LFENCE_RDTSC)
+
#ifdef CONFIG_X86_PPRO_FENCE
#define dma_rmb() rmb()
#else
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 07962f5f6fba..e426d2a33ff3 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -214,8 +214,7 @@ static __always_inline unsigned long long rdtsc_ordered(void)
* that some other imaginary CPU is updating continuously with a
* time stamp.
*/
- alternative_2("", "mfence", X86_FEATURE_MFENCE_RDTSC,
- "lfence", X86_FEATURE_LFENCE_RDTSC);
+ ifence();
return rdtsc();
}
'__array_ptr' is proposed as a generic mechanism to mitigate against
Spectre-variant-1 attacks, i.e. an attack that bypasses memory bounds
checks via speculative execution). The '__array_ptr' implementation
appears safe for current generation cpus across multiple architectures.
In comparison, 'ifence_array_ptr' uses a hard / architectural 'ifence'
approach to preclude the possibility speculative execution. However, it
is not the default given a concern for avoiding instruction-execution
barriers in potential fast paths.
Based on an original implementation by Linus Torvalds, tweaked to remove
speculative flows by Alexei Starovoitov, and tweaked again by Linus to
introduce an x86 assembly implementation for the mask generation.
Co-developed-by: Linus Torvalds <[email protected]>
Co-developed-by: Alexei Starovoitov <[email protected]>
Co-developed-by: Peter Zijlstra <[email protected]>
Cc: Russell King <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Signed-off-by: Dan Williams <[email protected]>
---
arch/arm/Kconfig | 1 +
arch/arm64/Kconfig | 1 +
arch/x86/Kconfig | 3 ++
include/linux/nospec.h | 92 ++++++++++++++++++++++++++++++++++++++++++++++++
kernel/Kconfig.nospec | 46 ++++++++++++++++++++++++
kernel/Makefile | 1 +
kernel/nospec.c | 52 +++++++++++++++++++++++++++
lib/Kconfig | 3 ++
8 files changed, 199 insertions(+)
create mode 100644 include/linux/nospec.h
create mode 100644 kernel/Kconfig.nospec
create mode 100644 kernel/nospec.c
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 51c8df561077..fd4789ec8cac 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -7,6 +7,7 @@ config ARM
select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ELF_RANDOMIZE
+ select ARCH_HAS_IFENCE
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
select ARCH_HAS_STRICT_MODULE_RWX if MMU
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c9a7e9e1414f..22765c4b6986 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -16,6 +16,7 @@ config ARM64
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_GIGANTIC_PAGE if (MEMORY_ISOLATION && COMPACTION) || CMA
select ARCH_HAS_KCOV
+ select ARCH_HAS_IFENCE
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d4fc98c50378..68698289c83c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOV if X86_64
+ select ARCH_HAS_IFENCE
select ARCH_HAS_PMEM_API if X86_64
# Causing hangs/crashes, see the commit that added this change for details.
select ARCH_HAS_REFCOUNT
@@ -442,6 +443,8 @@ config INTEL_RDT
Say N if unsure.
+source "kernel/Kconfig.nospec"
+
if X86_32
config X86_EXTENDED_PLATFORM
bool "Support for extended (non-PC) x86 platforms"
diff --git a/include/linux/nospec.h b/include/linux/nospec.h
new file mode 100644
index 000000000000..f6e7ba7a7344
--- /dev/null
+++ b/include/linux/nospec.h
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2018 Intel Corporation. All rights reserved.
+
+#ifndef __NOSPEC_H__
+#define __NOSPEC_H__
+
+#include <linux/jump_label.h>
+#include <asm/barrier.h>
+
+/*
+ * If idx is negative or if idx > size then bit 63 is set in the mask,
+ * and the value of ~(-1L) is zero. When the mask is zero, bounds check
+ * failed, __array_ptr will return NULL.
+ */
+#ifndef array_ptr_mask
+#define array_ptr_mask(idx, sz) \
+({ \
+ unsigned long mask; \
+ unsigned long _i = (idx); \
+ unsigned long _s = (sz); \
+ \
+ mask = ~(long)(_i | (_s - 1 - _i)) >> (BITS_PER_LONG - 1); \
+ mask; \
+})
+#endif
+
+/**
+ * __array_ptr - Generate a pointer to an array element, ensuring
+ * the pointer is bounded under speculation to NULL.
+ *
+ * @base: the base of the array
+ * @idx: the index of the element, must be less than LONG_MAX
+ * @sz: the number of elements in the array, must be less than LONG_MAX
+ *
+ * If @idx falls in the interval [0, @sz), returns the pointer to
+ * @arr[@idx], otherwise returns NULL.
+ */
+#define __array_ptr(base, idx, sz) \
+({ \
+ union { typeof(*(base)) *_ptr; unsigned long _bit; } __u; \
+ typeof(*(base)) *_arr = (base); \
+ unsigned long _i = (idx); \
+ unsigned long _mask = array_ptr_mask(_i, (sz)); \
+ \
+ __u._ptr = _arr + (_i & _mask); \
+ __u._bit &= _mask; \
+ __u._ptr; \
+})
+
+#if defined(ARCH_HAS_IFENCE) && !defined(ifence_array_ptr)
+#error Arch claims ARCH_HAS_IFENCE, but does not implement ifence_array_ptr
+#endif
+
+#ifdef CONFIG_SPECTRE1_DYNAMIC
+#ifndef HAVE_JUMP_LABEL
+#error Compiler lacks asm-goto, can generate unsafe code
+#endif
+
+#ifdef CONFIG_SPECTRE1_IFENCE
+DECLARE_STATIC_KEY_TRUE(nospec_key);
+#else
+DECLARE_STATIC_KEY_FALSE(nospec_key);
+#endif
+
+/*
+ * The expectation is that no compiler or cpu will mishandle __array_ptr
+ * leading to problematic speculative execution. Bypass the ifence
+ * based implementation by default.
+ */
+#define array_ptr(base, idx, sz) \
+({ \
+ typeof(*(base)) *__ret; \
+ \
+ if (static_branch_unlikely(&nospec_key)) \
+ __ret = ifence_array_ptr(base, idx, sz); \
+ else \
+ __ret = __array_ptr(base, idx, sz); \
+ __ret; \
+})
+#else /* CONFIG_SPECTRE1_DYNAMIC */
+/*
+ * If jump labels are disabled we hard code either ifence_array_ptr or
+ * array_ptr based on the config choice
+ */
+#ifdef CONFIG_SPECTRE1_IFENCE
+#define array_ptr ifence_array_ptr
+#else
+/* fallback to __array_ptr by default */
+#define array_ptr __array_ptr
+#endif
+#endif /* CONFIG_SPECTRE1_DYNAMIC */
+#endif /* __NOSPEC_H__ */
diff --git a/kernel/Kconfig.nospec b/kernel/Kconfig.nospec
new file mode 100644
index 000000000000..33e34a87d067
--- /dev/null
+++ b/kernel/Kconfig.nospec
@@ -0,0 +1,46 @@
+# SPDX-License-Identifier: GPL-2.0
+
+menu "Speculative execution past bounds check"
+ depends on ARCH_HAS_IFENCE
+
+choice
+ prompt "Speculative execution past bounds check"
+ default SPECTRE1_MASK
+ help
+ Select the default mechanism for guarding against kernel
+ memory leaks via speculative execution past a boundary-check
+ (Spectre variant1) . This choice determines the contents of
+ the array_ptr() helper. Note, that vulnerable code paths need
+ to be instrumented with this helper to be protected.
+
+config SPECTRE1_MASK
+ bool "mask"
+ help
+ Provide an array_ptr() implementation that arranges for only
+ safe speculative flows to be exposed to the compiler/cpu. It
+ is preferred over "ifence" since it arranges for problematic
+ speculation to be disabled without need of an instruction
+ barrier.
+
+config SPECTRE1_IFENCE
+ bool "ifence"
+ depends on ARCH_HAS_IFENCE
+ help
+ Provide a array_ptr() implementation that is specified by the
+ cpu architecture to barrier all speculative execution. Unless
+ you have specific knowledge of the "mask" approach being
+ unsuitable with a given compiler/cpu, select "mask".
+
+endchoice
+
+config SPECTRE1_DYNAMIC
+ bool "Support dynamic switching of speculative execution mitigation"
+ depends on ARCH_HAS_IFENCE
+ depends on JUMP_LABEL
+ help
+ For architectures that support the 'ifence' mitigation, allow
+ dynamic switching between it and the 'mask' approach. This supports
+ evaluation or emergency switching.
+
+ If unsure, say Y
+endmenu
diff --git a/kernel/Makefile b/kernel/Makefile
index 172d151d429c..d5269be9d58a 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -101,6 +101,7 @@ obj-$(CONFIG_TRACEPOINTS) += trace/
obj-$(CONFIG_IRQ_WORK) += irq_work.o
obj-$(CONFIG_CPU_PM) += cpu_pm.o
obj-$(CONFIG_BPF) += bpf/
+obj-$(CONFIG_SPECTRE1_DYNAMIC) += nospec.o
obj-$(CONFIG_PERF_EVENTS) += events/
diff --git a/kernel/nospec.c b/kernel/nospec.c
new file mode 100644
index 000000000000..992de957216d
--- /dev/null
+++ b/kernel/nospec.c
@@ -0,0 +1,52 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2018 Intel Corporation. All rights reserved.
+#include <linux/module.h>
+#include <linux/compiler.h>
+#include <linux/jump_label.h>
+#include <linux/moduleparam.h>
+
+enum {
+ F_IFENCE,
+};
+
+#ifdef CONFIG_SPECTRE1_IFENCE
+static unsigned long nospec_flag = 1 << F_IFENCE;
+DEFINE_STATIC_KEY_TRUE(nospec_key);
+#else
+static unsigned long nospec_flag;
+DEFINE_STATIC_KEY_FALSE(nospec_key);
+#endif
+
+EXPORT_SYMBOL(nospec_key);
+
+static int param_set_nospec(const char *val, const struct kernel_param *kp)
+{
+ unsigned long *flags = kp->arg;
+
+ if (strcmp(val, "ifence") == 0 || strcmp(val, "ifence\n") == 0) {
+ if (!test_and_set_bit(F_IFENCE, flags))
+ static_key_enable(&nospec_key.key);
+ return 0;
+ } else if (strcmp(val, "mask") == 0 || strcmp(val, "mask\n") == 0) {
+ if (test_and_clear_bit(F_IFENCE, flags))
+ static_key_disable(&nospec_key.key);
+ return 0;
+ }
+ return -EINVAL;
+}
+
+static int param_get_nospec(char *buffer, const struct kernel_param *kp)
+{
+ unsigned long *flags = kp->arg;
+
+ return sprintf(buffer, "%s\n", test_bit(F_IFENCE, flags)
+ ? "ifence" : "mask");
+}
+
+static struct kernel_param_ops nospec_param_ops = {
+ .set = param_set_nospec,
+ .get = param_get_nospec,
+};
+
+core_param_cb(spectre_v1, &nospec_param_ops, &nospec_flag, 0600);
+MODULE_PARM_DESC(spectre_v1, "Spectre-v1 mitigation: 'mask' (default) vs 'ifence'");
diff --git a/lib/Kconfig b/lib/Kconfig
index c5e84fbcb30b..3cc7e7a03781 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -570,6 +570,9 @@ config STACKDEPOT
bool
select STACKTRACE
+config ARCH_HAS_IFENCE
+ bool
+
config SBITMAP
bool
From: Mark Rutland <[email protected]>
This patch implements ifence_array_ptr() for arm64, using an
LDR+CSEL+CSDB sequence to inhibit speculative use of the returned value.
Signed-off-by: Mark Rutland <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/arm64/include/asm/barrier.h | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 77651c49ef44..74ffcddb26e6 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -40,6 +40,30 @@
#define dma_rmb() dmb(oshld)
#define dma_wmb() dmb(oshst)
+#define ifence_array_ptr(arr, idx, sz) \
+({ \
+ typeof(&(arr)[0]) __nap_arr = (arr); \
+ typeof(idx) __nap_idx = (idx); \
+ typeof(sz) __nap_sz = (sz); \
+ \
+ unsigned long __nap_ptr = (unsigned long)__nap_arr + \
+ sizeof(__nap_arr[0]) * idx; \
+ \
+ asm volatile( \
+ " cmp %[i], %[s]\n" \
+ " b.cs 1f\n" \
+ " ldr %[p], %[pp]\n" \
+ "1: csel %[p], %[p], xzr, cc\n" \
+ " hint #0x14 // CSDB\n" \
+ : [p] "=&r" (__nap_ptr) \
+ : [pp] "m" (__nap_ptr), \
+ [i] "r" ((unsigned long)__nap_idx), \
+ [s] "r" ((unsigned long)__nap_sz) \
+ : "cc"); \
+ \
+ (typeof(&(__nap_arr)[0]))__nap_ptr; \
+})
+
#define __smp_mb() dmb(ish)
#define __smp_rmb() dmb(ishld)
#define __smp_wmb() dmb(ishst)
From: Mark Rutland <[email protected]>
Document the rationale and usage of the new array_ptr() helper.
Signed-off-by: Mark Rutland <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
Documentation/speculation.txt | 143 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 143 insertions(+)
create mode 100644 Documentation/speculation.txt
diff --git a/Documentation/speculation.txt b/Documentation/speculation.txt
new file mode 100644
index 000000000000..1e59d1d9eaf4
--- /dev/null
+++ b/Documentation/speculation.txt
@@ -0,0 +1,143 @@
+This document explains potential effects of speculation, and how undesirable
+effects can be mitigated portably using common APIs.
+
+===========
+Speculation
+===========
+
+To improve performance and minimize average latencies, many contemporary CPUs
+employ speculative execution techniques such as branch prediction, performing
+work which may be discarded at a later stage.
+
+Typically speculative execution cannot be observed from architectural state,
+such as the contents of registers. However, in some cases it is possible to
+observe its impact on microarchitectural state, such as the presence or
+absence of data in caches. Such state may form side-channels which can be
+observed to extract secret information.
+
+For example, in the presence of branch prediction, it is possible for bounds
+checks to be ignored by code which is speculatively executed. Consider the
+following code:
+
+ int load_array(int *array, unsigned int idx)
+ {
+ if (idx >= MAX_ARRAY_ELEMS)
+ return 0;
+ else
+ return array[idx];
+ }
+
+Which, on arm64, may be compiled to an assembly sequence such as:
+
+ CMP <idx>, #MAX_ARRAY_ELEMS
+ B.LT less
+ MOV <returnval>, #0
+ RET
+ less:
+ LDR <returnval>, [<array>, <idx>]
+ RET
+
+It is possible that a CPU mis-predicts the conditional branch, and
+speculatively loads array[idx], even if idx >= MAX_ARRAY_ELEMS. This value
+will subsequently be discarded, but the speculated load may affect
+microarchitectural state which can be subsequently measured.
+
+More complex sequences involving multiple dependent memory accesses may result
+in sensitive information being leaked. Consider the following code, building
+on the prior example:
+
+ int load_dependent_arrays(int *arr1, int *arr2, int idx)
+ {
+ int val1, val2,
+
+ val1 = load_array(arr1, idx);
+ val2 = load_array(arr2, val1);
+
+ return val2;
+ }
+
+Under speculation, the first call to load_array() may return the value of an
+out-of-bounds address, while the second call will influence microarchitectural
+state dependent on this value. This may provide an arbitrary read primitive.
+
+====================================
+Mitigating speculation side-channels
+====================================
+
+The kernel provides a generic API to ensure that bounds checks are respected
+even under speculation. Architectures which are affected by speculation-based
+side-channels are expected to implement these primitives.
+
+The array_ptr() helper in <asm/barrier.h> can be used to prevent
+information from being leaked via side-channels.
+
+A call to array_ptr(arr, idx, sz) returns a sanitized pointer to
+arr[idx] only if idx falls in the [0, sz) interval. When idx < 0 or idx > sz,
+NULL is returned. Additionally, array_ptr() an out-of-bounds poitner is
+not propagated to code which is speculatively executed.
+
+This can be used to protect the earlier load_array() example:
+
+ int load_array(int *array, unsigned int idx)
+ {
+ int *elem;
+
+ elem = array_ptr(array, idx, MAX_ARRAY_ELEMS);
+ if (elem)
+ return *elem;
+ else
+ return 0;
+ }
+
+This can also be used in situations where multiple fields on a structure are
+accessed:
+
+ struct foo array[SIZE];
+ int a, b;
+
+ void do_thing(int idx)
+ {
+ struct foo *elem;
+
+ elem = array_ptr(array, idx, SIZE);
+ if (elem) {
+ a = elem->field_a;
+ b = elem->field_b;
+ }
+ }
+
+It is imperative that the returned pointer is used. Pointers which are
+generated separately are subject to a number of potential CPU and compiler
+optimizations, and may still be used speculatively. For example, this means
+that the following sequence is unsafe:
+
+ struct foo array[SIZE];
+ int a, b;
+
+ void do_thing(int idx)
+ {
+ if (array_ptr(array, idx, SIZE) != NULL) {
+ // unsafe as wrong pointer is used
+ a = array[idx].field_a;
+ b = array[idx].field_b;
+ }
+ }
+
+Similarly, it is unsafe to compare the returned pointer with other pointers,
+as this may permit the compiler to substitute one pointer with another,
+permitting speculation. For example, the following sequence is unsafe:
+
+ struct foo array[SIZE];
+ int a, b;
+
+ void do_thing(int idx)
+ {
+ struct foo *elem = array_ptr(array, idx, size);
+
+ // unsafe due to pointer substitution
+ if (elem == &array[idx]) {
+ a = elem->field_a;
+ b = elem->field_b;
+ }
+ }
+
From: Mark Rutland <[email protected]>
This patch implements ifence_array_ptr() for arm, using an
LDR+MOVCS+CSDB sequence to inhibit speculative use of the returned
value.
Cc: Russell King <[email protected]>
Signed-off-by: Mark Rutland <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/arm/include/asm/barrier.h | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 40f5c410fd8c..919235ed6e68 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -59,6 +59,30 @@ extern void arm_heavy_mb(void);
#define dma_wmb() barrier()
#endif
+#define ifence_array_ptr(arr, idx, sz) \
+({ \
+ typeof(&(arr)[0]) __nap_arr = (arr); \
+ typeof(idx) __nap_idx = (idx); \
+ typeof(sz) __nap_sz = (sz); \
+ \
+ unsigned long __nap_ptr = (unsigned long)__nap_arr + \
+ sizeof(__nap_arr[0]) * idx; \
+ \
+ asm volatile( \
+ " cmp %[i], %[s]\n" \
+ " bcs 1f\n" \
+ " ldr %[p], %[pp]\n" \
+ "1: movcs %[p], #0\n" \
+ " .inst 0xe320f018 @ CSDB\n" \
+ : [p] "=&r" (__nap_ptr) \
+ : [pp] "m" (__nap_ptr), \
+ [i] "r" ((unsigned long)__nap_idx), \
+ [s] "r" ((unsigned long)__nap_sz) \
+ : "cc"); \
+ \
+ (typeof(&(__nap_arr)[0]))__nap_ptr; \
+})
+
#define __smp_mb() dmb(ish)
#define __smp_rmb() __smp_mb()
#define __smp_wmb() dmb(ishst)
'ifence_array_ptr' is provided as an alternative to the default
'__array_ptr' implementation that uses a mask to sanitize user
controllable pointers. Later patches will allow it to be selected via
the kernel command line. The '__array_ptr' implementation otherwise
appears safe for current generation cpus across multiple architectures.
'array_ptr_mask' is used by the default 'array_ptr' implementation to
cheaply calculate an array bounds mask.
Suggested-by: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/barrier.h | 46 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index b04f572d6d97..1b507f9e2cc7 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -28,6 +28,52 @@
#define ifence() alternative_2("", "mfence", X86_FEATURE_MFENCE_RDTSC, \
"lfence", X86_FEATURE_LFENCE_RDTSC)
+/**
+ * ifence_array_ptr - Generate a pointer to an array element,
+ * ensuring the pointer is bounded under speculation.
+ *
+ * @arr: the base of the array
+ * @idx: the index of the element
+ * @sz: the number of elements in the array
+ *
+ * If @idx falls in the interval [0, @sz), returns the pointer to
+ * @arr[@idx], otherwise returns NULL.
+ */
+#define ifence_array_ptr(arr, idx, sz) \
+({ \
+ typeof(*(arr)) *__arr = (arr), *__ret; \
+ typeof(idx) __idx = (idx); \
+ typeof(sz) __sz = (sz); \
+ \
+ __ret = __idx < __sz ? __arr + __idx : NULL; \
+ ifence(); \
+ __ret; \
+})
+
+/**
+ * array_ptr_mask - generate a mask for array_ptr() that is ~0UL when
+ * the bounds check succeeds and 0 otherwise
+ */
+#define array_ptr_mask array_ptr_mask
+static inline unsigned long array_ptr_mask(unsigned long idx, unsigned long sz)
+{
+ unsigned long mask;
+
+ /*
+ * mask = index - size, if that result is >= 0 then the index is
+ * invalid and the mask is 0 else ~0
+ */
+#ifdef CONFIG_X86_32
+ asm ("cmpl %1,%2; sbbl %0,%0;"
+#else
+ asm ("cmpq %1,%2; sbbq %0,%0;"
+#endif
+ :"=r" (mask)
+ :"r"(sz),"r" (idx)
+ :"cc");
+ return mask;
+}
+
#ifdef CONFIG_X86_PPRO_FENCE
#define dma_rmb() rmb()
#else
Expectedly, static analysis reports that 'fd' is a user controlled value
that is used as a data dependency to read from the 'fdt->fd' array. In
order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue reads
based on an invalid 'file *' returned from __fcheck_files.
Cc: Al Viro <[email protected]>
Co-developed-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/fdtable.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 1c65817673db..9731f1a255db 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -10,6 +10,7 @@
#include <linux/compiler.h>
#include <linux/spinlock.h>
#include <linux/rcupdate.h>
+#include <linux/nospec.h>
#include <linux/types.h>
#include <linux/init.h>
#include <linux/fs.h>
@@ -81,9 +82,11 @@ struct dentry;
static inline struct file *__fcheck_files(struct files_struct *files, unsigned int fd)
{
struct fdtable *fdt = rcu_dereference_raw(files->fdt);
+ struct file __rcu **fdp;
- if (fd < fdt->max_fds)
- return rcu_dereference_raw(fdt->fd[fd]);
+ fdp = array_ptr(fdt->fd, fd, fdt->max_fds);
+ if (fdp)
+ return rcu_dereference_raw(*fdp);
return NULL;
}
For 'get_user' paths, do not allow the kernel to speculate on the value
of a user controlled pointer. In addition to the 'stac' instruction for
Supervisor Mode Access Protection, an 'ifence' causes the 'access_ok'
result to resolve in the pipeline before the cpu might take any
speculative action on the pointer value.
Since this is a major kernel interface that deals with user controlled
data, the '__uaccess_begin_nospec' mechanism will prevent speculative
execution past an 'access_ok' permission check. While speculative
execution past 'access_ok' is not enough to lead to a kernel memory
leak, it is a necessary precondition.
To be clear, '__uaccess_begin_nospec' and ASM_IFENCE are not addressing
any known issues with 'get_user' they are addressing a class of
potential problems that could be near 'get_user' usages. In other words,
these helpers are for hygiene not clinical fixes.
There are no functional changes in this patch.
Suggested-by: Linus Torvalds <[email protected]>
Suggested-by: Andi Kleen <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: [email protected]
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/smap.h | 4 ++++
arch/x86/include/asm/uaccess.h | 10 ++++++++++
2 files changed, 14 insertions(+)
diff --git a/arch/x86/include/asm/smap.h b/arch/x86/include/asm/smap.h
index db333300bd4b..0b59707e0b46 100644
--- a/arch/x86/include/asm/smap.h
+++ b/arch/x86/include/asm/smap.h
@@ -40,6 +40,10 @@
#endif /* CONFIG_X86_SMAP */
+#define ASM_IFENCE \
+ ALTERNATIVE_2 "", "mfence", X86_FEATURE_MFENCE_RDTSC, \
+ "lfence", X86_FEATURE_LFENCE_RDTSC
+
#else /* __ASSEMBLY__ */
#include <asm/alternative.h>
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 574dff4d2913..a31fd4fc6483 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -124,6 +124,11 @@ extern int __get_user_bad(void);
#define __uaccess_begin() stac()
#define __uaccess_end() clac()
+#define __uaccess_begin_nospec() \
+({ \
+ stac(); \
+ ifence(); \
+})
/*
* This is a type: either unsigned long, if the argument fits into
@@ -487,6 +492,11 @@ struct __large_struct { unsigned long buf[100]; };
__uaccess_begin(); \
barrier();
+#define uaccess_try_nospec do { \
+ current->thread.uaccess_err = 0; \
+ __uaccess_begin_nospec(); \
+ barrier();
+
#define uaccess_catch(err) \
__uaccess_end(); \
(err) |= (current->thread.uaccess_err ? -EFAULT : 0); \
On Sat, Jan 13, 2018 at 10:18 AM, Dan Williams <[email protected]> wrote:
> diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> index c97d935a29e8..85f400b8ee7c 100644
> --- a/arch/x86/lib/getuser.S
> +++ b/arch/x86/lib/getuser.S
> @@ -41,6 +41,7 @@ ENTRY(__get_user_1)
> cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
> jae bad_get_user
> ASM_STAC
> + ASM_IFENCE
> 1: movzbl (%_ASM_AX),%edx
> xor %eax,%eax
> ASM_CLAC
So I really would like to know from somebody (preferably somebody with
real microarchitectural knowledge) just how expensive that "lfence"
ends up being.
Because since we could just generate the masking of the address from
the exact same condition code that we already generate, the "lfence"
really can be replaced by just two ALU instructions instead:
diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index c97d935a29e8..4c378b485399 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -40,6 +40,8 @@ ENTRY(__get_user_1)
mov PER_CPU_VAR(current_task), %_ASM_DX
cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user
+ sbb %_ASM_DX,%_ASM_DX
+ and %_ASM_DX,%_ASM_AX
ASM_STAC
1: movzbl (%_ASM_AX),%edx
xor %eax,%eax
which looks like it should have a fairly low maximum overhead (ok, the
above is totally untested, maybe I got the condition the wrong way
around _again_).
I _know_ that lfence is expensive as hell on P4, for example.
Yes, yes, "sbb" is often more expensive than most ALU instructions,
and Agner Fog says it has a 10-cycle latency on Prescott (which is
outrageous, but being one or two cycles more due to the flags
generation is normal). So the sbb/and may certainly add a few cycles
to the critical path, but on Prescott "lfence" is *50* cycles
according to those same tables by Agner Fog.
Is there anybody who is willing to say one way or another wrt the
"sbb/and" sequence vs "lfence".
Linus
Quoting Linus:
I do think that it would be a good idea to very expressly document
the fact that it's not that the user access itself is unsafe. I do
agree that things like "get_user()" want to be protected, but not
because of any direct bugs or problems with get_user() and friends,
but simply because get_user() is an excellent source of a pointer
that is obviously controlled from a potentially attacking user
space. So it's a prime candidate for then finding _subsequent_
accesses that can then be used to perturb the cache.
Note that '__copy_user_ll' is also called in the 'put_user' case, but
there is currently no indication that put_user in general deserves the
same hygiene as 'get_user'.
Suggested-by: Linus Torvalds <[email protected]>
Suggested-by: Andi Kleen <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: [email protected]
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/uaccess.h | 6 +++---
arch/x86/include/asm/uaccess_32.h | 6 +++---
arch/x86/include/asm/uaccess_64.h | 12 ++++++------
arch/x86/lib/copy_user_64.S | 3 +++
arch/x86/lib/getuser.S | 5 +++++
arch/x86/lib/usercopy_32.c | 8 ++++----
6 files changed, 24 insertions(+), 16 deletions(-)
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index a31fd4fc6483..82c73f064e76 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -450,7 +450,7 @@ do { \
({ \
int __gu_err; \
__inttype(*(ptr)) __gu_val; \
- __uaccess_begin(); \
+ __uaccess_begin_nospec(); \
__get_user_size(__gu_val, (ptr), (size), __gu_err, -EFAULT); \
__uaccess_end(); \
(x) = (__force __typeof__(*(ptr)))__gu_val; \
@@ -558,7 +558,7 @@ struct __large_struct { unsigned long buf[100]; };
* get_user_ex(...);
* } get_user_catch(err)
*/
-#define get_user_try uaccess_try
+#define get_user_try uaccess_try_nospec
#define get_user_catch(err) uaccess_catch(err)
#define get_user_ex(x, ptr) do { \
@@ -592,7 +592,7 @@ extern void __cmpxchg_wrong_size(void)
__typeof__(ptr) __uval = (uval); \
__typeof__(*(ptr)) __old = (old); \
__typeof__(*(ptr)) __new = (new); \
- __uaccess_begin(); \
+ __uaccess_begin_nospec(); \
switch (size) { \
case 1: \
{ \
diff --git a/arch/x86/include/asm/uaccess_32.h b/arch/x86/include/asm/uaccess_32.h
index 72950401b223..ba2dc1930630 100644
--- a/arch/x86/include/asm/uaccess_32.h
+++ b/arch/x86/include/asm/uaccess_32.h
@@ -29,21 +29,21 @@ raw_copy_from_user(void *to, const void __user *from, unsigned long n)
switch (n) {
case 1:
ret = 0;
- __uaccess_begin();
+ __uaccess_begin_nospec();
__get_user_asm_nozero(*(u8 *)to, from, ret,
"b", "b", "=q", 1);
__uaccess_end();
return ret;
case 2:
ret = 0;
- __uaccess_begin();
+ __uaccess_begin_nospec();
__get_user_asm_nozero(*(u16 *)to, from, ret,
"w", "w", "=r", 2);
__uaccess_end();
return ret;
case 4:
ret = 0;
- __uaccess_begin();
+ __uaccess_begin_nospec();
__get_user_asm_nozero(*(u32 *)to, from, ret,
"l", "k", "=r", 4);
__uaccess_end();
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index f07ef3c575db..62546b3a398e 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -55,31 +55,31 @@ raw_copy_from_user(void *dst, const void __user *src, unsigned long size)
return copy_user_generic(dst, (__force void *)src, size);
switch (size) {
case 1:
- __uaccess_begin();
+ __uaccess_begin_nospec();
__get_user_asm_nozero(*(u8 *)dst, (u8 __user *)src,
ret, "b", "b", "=q", 1);
__uaccess_end();
return ret;
case 2:
- __uaccess_begin();
+ __uaccess_begin_nospec();
__get_user_asm_nozero(*(u16 *)dst, (u16 __user *)src,
ret, "w", "w", "=r", 2);
__uaccess_end();
return ret;
case 4:
- __uaccess_begin();
+ __uaccess_begin_nospec();
__get_user_asm_nozero(*(u32 *)dst, (u32 __user *)src,
ret, "l", "k", "=r", 4);
__uaccess_end();
return ret;
case 8:
- __uaccess_begin();
+ __uaccess_begin_nospec();
__get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src,
ret, "q", "", "=r", 8);
__uaccess_end();
return ret;
case 10:
- __uaccess_begin();
+ __uaccess_begin_nospec();
__get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src,
ret, "q", "", "=r", 10);
if (likely(!ret))
@@ -89,7 +89,7 @@ raw_copy_from_user(void *dst, const void __user *src, unsigned long size)
__uaccess_end();
return ret;
case 16:
- __uaccess_begin();
+ __uaccess_begin_nospec();
__get_user_asm_nozero(*(u64 *)dst, (u64 __user *)src,
ret, "q", "", "=r", 16);
if (likely(!ret))
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 020f75cc8cf6..2429ca38dee6 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -31,6 +31,7 @@
*/
ENTRY(copy_user_generic_unrolled)
ASM_STAC
+ ASM_IFENCE
cmpl $8,%edx
jb 20f /* less then 8 bytes, go to byte copy loop */
ALIGN_DESTINATION
@@ -135,6 +136,7 @@ EXPORT_SYMBOL(copy_user_generic_unrolled)
*/
ENTRY(copy_user_generic_string)
ASM_STAC
+ ASM_IFENCE
cmpl $8,%edx
jb 2f /* less than 8 bytes, go to byte copy loop */
ALIGN_DESTINATION
@@ -175,6 +177,7 @@ EXPORT_SYMBOL(copy_user_generic_string)
*/
ENTRY(copy_user_enhanced_fast_string)
ASM_STAC
+ ASM_IFENCE
cmpl $64,%edx
jb .L_copy_short_string /* less then 64 bytes, avoid the costly 'rep' */
movl %edx,%ecx
diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index c97d935a29e8..85f400b8ee7c 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -41,6 +41,7 @@ ENTRY(__get_user_1)
cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user
ASM_STAC
+ ASM_IFENCE
1: movzbl (%_ASM_AX),%edx
xor %eax,%eax
ASM_CLAC
@@ -55,6 +56,7 @@ ENTRY(__get_user_2)
cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user
ASM_STAC
+ ASM_IFENCE
2: movzwl -1(%_ASM_AX),%edx
xor %eax,%eax
ASM_CLAC
@@ -69,6 +71,7 @@ ENTRY(__get_user_4)
cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user
ASM_STAC
+ ASM_IFENCE
3: movl -3(%_ASM_AX),%edx
xor %eax,%eax
ASM_CLAC
@@ -84,6 +87,7 @@ ENTRY(__get_user_8)
cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user
ASM_STAC
+ ASM_IFENCE
4: movq -7(%_ASM_AX),%rdx
xor %eax,%eax
ASM_CLAC
@@ -95,6 +99,7 @@ ENTRY(__get_user_8)
cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user_8
ASM_STAC
+ ASM_IFENCE
4: movl -7(%_ASM_AX),%edx
5: movl -3(%_ASM_AX),%ecx
xor %eax,%eax
diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index 1b377f734e64..7add8ba06887 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -331,12 +331,12 @@ do { \
unsigned long __copy_user_ll(void *to, const void *from, unsigned long n)
{
- stac();
+ __uaccess_begin_nospec();
if (movsl_is_ok(to, from, n))
__copy_user(to, from, n);
else
n = __copy_user_intel(to, from, n);
- clac();
+ __uaccess_end();
return n;
}
EXPORT_SYMBOL(__copy_user_ll);
@@ -344,7 +344,7 @@ EXPORT_SYMBOL(__copy_user_ll);
unsigned long __copy_from_user_ll_nocache_nozero(void *to, const void __user *from,
unsigned long n)
{
- stac();
+ __uaccess_begin_nospec();
#ifdef CONFIG_X86_INTEL_USERCOPY
if (n > 64 && static_cpu_has(X86_FEATURE_XMM2))
n = __copy_user_intel_nocache(to, from, n);
@@ -353,7 +353,7 @@ unsigned long __copy_from_user_ll_nocache_nozero(void *to, const void __user *fr
#else
__copy_user(to, from, n);
#endif
- clac();
+ __uaccess_end();
return n;
}
EXPORT_SYMBOL(__copy_from_user_ll_nocache_nozero);
On Sat, Jan 13, 2018 at 11:05 AM, Linus Torvalds
<[email protected]> wrote:
>
> I _know_ that lfence is expensive as hell on P4, for example.
>
> Yes, yes, "sbb" is often more expensive than most ALU instructions,
> and Agner Fog says it has a 10-cycle latency on Prescott (which is
> outrageous, but being one or two cycles more due to the flags
> generation is normal). So the sbb/and may certainly add a few cycles
> to the critical path, but on Prescott "lfence" is *50* cycles
> according to those same tables by Agner Fog.
Side note: I don't think P4 is really relevant for a performance
discussion, I was just giving it as an example where we do know actual
cycles.
I'm much more interested in modern Intel big-core CPU's, and just
wondering whether somebody could ask an architect.
Because I _suspect_ the answer from a CPU architect would be: "Christ,
the sbb/and sequence is much better because it doesn't have any extra
serialization", but maybe I'm wrong, and people feel that lfence is
particularly easy to do right without any real downside.
Linus
Linus Torvalds <[email protected]> writes:
> On Sat, Jan 13, 2018 at 11:05 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> I _know_ that lfence is expensive as hell on P4, for example.
>>
>> Yes, yes, "sbb" is often more expensive than most ALU instructions,
>> and Agner Fog says it has a 10-cycle latency on Prescott (which is
>> outrageous, but being one or two cycles more due to the flags
>> generation is normal). So the sbb/and may certainly add a few cycles
>> to the critical path, but on Prescott "lfence" is *50* cycles
>> according to those same tables by Agner Fog.
>
> Side note: I don't think P4 is really relevant for a performance
> discussion, I was just giving it as an example where we do know actual
> cycles.
>
> I'm much more interested in modern Intel big-core CPU's, and just
> wondering whether somebody could ask an architect.
>
> Because I _suspect_ the answer from a CPU architect would be: "Christ,
> the sbb/and sequence is much better because it doesn't have any extra
> serialization", but maybe I'm wrong, and people feel that lfence is
> particularly easy to do right without any real downside.
As an educated observer it seems like the cmpq/sbb/and sequence is an
improvement because it moves the dependency from one end of the cpu
pipeline to another. If any cpu does data speculation on anything other
than branch targets that sequence could still be susceptible to
speculation.
>From the AMD patches it appears that lfence is becoming a serializing
instruction which in principal is much more expensive.
Also do we have alternatives for these sequences so if we run on an
in-order atom (or 386 or 486) where speculation does not occur we can
avoid the cost?
Eric
On Sat, Jan 13, 2018 at 11:33 AM, Linus Torvalds
<[email protected]> wrote:
> On Sat, Jan 13, 2018 at 11:05 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> I _know_ that lfence is expensive as hell on P4, for example.
>>
>> Yes, yes, "sbb" is often more expensive than most ALU instructions,
>> and Agner Fog says it has a 10-cycle latency on Prescott (which is
>> outrageous, but being one or two cycles more due to the flags
>> generation is normal). So the sbb/and may certainly add a few cycles
>> to the critical path, but on Prescott "lfence" is *50* cycles
>> according to those same tables by Agner Fog.
>
> Side note: I don't think P4 is really relevant for a performance
> discussion, I was just giving it as an example where we do know actual
> cycles.
>
> I'm much more interested in modern Intel big-core CPU's, and just
> wondering whether somebody could ask an architect.
>
> Because I _suspect_ the answer from a CPU architect would be: "Christ,
> the sbb/and sequence is much better because it doesn't have any extra
> serialization", but maybe I'm wrong, and people feel that lfence is
> particularly easy to do right without any real downside.
>
>From the last paragraph of this guidance:
https://software.intel.com/sites/default/files/managed/c5/63/336996-Speculative-Execution-Side-Channel-Mitigations.pdf
...I read that as Linux can constrain speculation with 'and; sbb'
instead of 'lfence', and any future changes will be handled like any
new cpu enabling.
To your specific question of the relative costs, sbb is
architecturally cheaper, so let's go with that approach.
For this '__uaccess_begin_nospec' patch set, at a minimum the kernel
needs a helper that can be easily grep'd when/if it needs changing in
a future kernel. It also indicates that the command line approach to
dynamically switch the mitigation mechanism is over-engineering.
That said, for get_user specifically, can we do something even
cheaper. Dave H. reminds me that any valid user pointer that gets past
the address limit check will have the high bit clear. So instead of
calculating a mask, just unconditionally clear the high bit. It seems
worse case userspace can speculatively leak something that's already
in its address space.
I'll respin this set along those lines, and drop the ifence bits.
On Tue, Jan 16, 2018 at 2:23 PM, Dan Williams <[email protected]> wrote:
> On Sat, Jan 13, 2018 at 11:33 AM, Linus Torvalds
[..]
> I'll respin this set along those lines, and drop the ifence bits.
So now I'm not so sure. Yes, get_user_{1,2,4,8} can mask the pointer
with the address limit result, but this doesn't work for the
access_ok() + __get_user() case. We can either change the access_ok()
calling convention to return a properly masked pointer to be used in
subsequent calls to __get_user(), or go with lfence on every
__get_user call. There seem to be several drivers that open code
copy_from_user() with __get_user loops, so the 'fence every
__get_user' approach might have noticeable overhead. On the other hand
the access_ok conversion, while it could be scripted with coccinelle,
is ~300 sites (VERIFY_READ), if you're concerned about having
something small to merge for 4.15.
I think the access_ok() conversion to return a speculation sanitized
pointer or NULL is the way to go unless I'm missing something simpler.
Other ideas?
On Tue, Jan 16, 2018 at 08:30:17PM -0800, Dan Williams wrote:
> On Tue, Jan 16, 2018 at 2:23 PM, Dan Williams <[email protected]> wrote:
> > On Sat, Jan 13, 2018 at 11:33 AM, Linus Torvalds
> [..]
> > I'll respin this set along those lines, and drop the ifence bits.
>
> So now I'm not so sure. Yes, get_user_{1,2,4,8} can mask the pointer
> with the address limit result, but this doesn't work for the
> access_ok() + __get_user() case. We can either change the access_ok()
> calling convention to return a properly masked pointer to be used in
> subsequent calls to __get_user(), or go with lfence on every
> __get_user call. There seem to be several drivers that open code
> copy_from_user() with __get_user loops, so the 'fence every
> __get_user' approach might have noticeable overhead. On the other hand
> the access_ok conversion, while it could be scripted with coccinelle,
> is ~300 sites (VERIFY_READ), if you're concerned about having
> something small to merge for 4.15.
>
> I think the access_ok() conversion to return a speculation sanitized
> pointer or NULL is the way to go unless I'm missing something simpler.
> Other ideas?
What masked pointer? access_ok() exists for other architectures as well,
and the fewer callers remain outside of arch/*, the better.
Anything that open-codes copy_from_user() that way is *ALREADY* fucked if
it cares about the overhead - recent x86 boxen will have slowdown from
hell on stac()/clac() pairs. Anything like that on a hot path is already
deep in trouble and needs to be found and fixed. What drivers would those
be? We don't have that many __get_user() users left outside of arch/*
anymore...
On Tue, Jan 16, 2018 at 10:28 PM, Al Viro <[email protected]> wrote:
> On Tue, Jan 16, 2018 at 08:30:17PM -0800, Dan Williams wrote:
>> On Tue, Jan 16, 2018 at 2:23 PM, Dan Williams <[email protected]> wrote:
>> > On Sat, Jan 13, 2018 at 11:33 AM, Linus Torvalds
>> [..]
>> > I'll respin this set along those lines, and drop the ifence bits.
>>
>> So now I'm not so sure. Yes, get_user_{1,2,4,8} can mask the pointer
>> with the address limit result, but this doesn't work for the
>> access_ok() + __get_user() case. We can either change the access_ok()
>> calling convention to return a properly masked pointer to be used in
>> subsequent calls to __get_user(), or go with lfence on every
>> __get_user call. There seem to be several drivers that open code
>> copy_from_user() with __get_user loops, so the 'fence every
>> __get_user' approach might have noticeable overhead. On the other hand
>> the access_ok conversion, while it could be scripted with coccinelle,
>> is ~300 sites (VERIFY_READ), if you're concerned about having
>> something small to merge for 4.15.
>>
>> I think the access_ok() conversion to return a speculation sanitized
>> pointer or NULL is the way to go unless I'm missing something simpler.
>> Other ideas?
>
> What masked pointer?
The pointer value that is masked under speculation.
diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index c97d935a29e8..4c378b485399 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -40,6 +40,8 @@ ENTRY(__get_user_1)
mov PER_CPU_VAR(current_task), %_ASM_DX
cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user
+ sbb %_ASM_DX,%_ASM_DX
+ and %_ASM_DX,%_ASM_AX
ASM_STAC
1: movzbl (%_ASM_AX),%edx
xor %eax,%eax
...i.e %_ASM_AX is guaranteed to be zero if userspace tries to cause
speculation with an address above the limit. The proposal is make
access_ok do that same masking so we never speculate on pointers from
userspace aimed at kernel memory.
> access_ok() exists for other architectures as well,
I'd modify those as well...
> and the fewer callers remain outside of arch/*, the better.
>
> Anything that open-codes copy_from_user() that way is *ALREADY* fucked if
> it cares about the overhead - recent x86 boxen will have slowdown from
> hell on stac()/clac() pairs. Anything like that on a hot path is already
> deep in trouble and needs to be found and fixed. What drivers would those
> be?
So I took a closer look and the pattern is not copy_from_user it's
more like __get_user + write-to-hardware loops. If the performance is
already expected to be bad for those then perhaps an lfence each loop
iteration won't be much worse. It's still a waste because the lfence
is only needed once after the access_ok.
> We don't have that many __get_user() users left outside of arch/*
> anymore...
From: Dan Williams
> Sent: 17 January 2018 06:50
...
> > Anything that open-codes copy_from_user() that way is *ALREADY* fucked if
> > it cares about the overhead - recent x86 boxen will have slowdown from
> > hell on stac()/clac() pairs. Anything like that on a hot path is already
> > deep in trouble and needs to be found and fixed. What drivers would those
> > be?
>
> So I took a closer look and the pattern is not copy_from_user it's
> more like __get_user + write-to-hardware loops. If the performance is
> already expected to be bad for those then perhaps an lfence each loop
> iteration won't be much worse. It's still a waste because the lfence
> is only needed once after the access_ok.
Performance of PCIe writes isn't that back (since they are posted)
adding a synchronising instructions to __get_user() could easily
be noticeable.
Unfortunately you can't use copy_from_user() with a PCIe mapped
target memory address (or even memcpy_to_io()) because on x86
the copy is aliased to memcpy() and that uses 'rep movsb'
which has to do single byte transfers because the address is
uncached.
(This is really bad for PCIe reads.)
David
On Tue, 2018-01-16 at 14:41 -0800, Linus Torvalds wrote:
>
>
> On Jan 16, 2018 14:23, "Dan Williams" <[email protected]>
> wrote:
> > That said, for get_user specifically, can we do something even
> > cheaper. Dave H. reminds me that any valid user pointer that gets
> > past
> > the address limit check will have the high bit clear. So instead of
> > calculating a mask, just unconditionally clear the high bit. It
> > seems
> > worse case userspace can speculatively leak something that's
> > already
> > in its address space.
>
> That's not at all true.
>
> The address may be a kernel address. That's the whole point of
> 'set_fs()'.
Can we kill off the remaining users of set_fs() ?
Alan
On Tue, Jan 16, 2018 at 10:50 PM, Dan Williams <[email protected]> wrote:
> On Tue, Jan 16, 2018 at 10:28 PM, Al Viro <[email protected]> wrote:
[..]
>> Anything that open-codes copy_from_user() that way is *ALREADY* fucked if
>> it cares about the overhead - recent x86 boxen will have slowdown from
>> hell on stac()/clac() pairs. Anything like that on a hot path is already
>> deep in trouble and needs to be found and fixed. What drivers would those
>> be?
>
> So I took a closer look and the pattern is not copy_from_user it's
> more like __get_user + write-to-hardware loops. If the performance is
> already expected to be bad for those then perhaps an lfence each loop
> iteration won't be much worse. It's still a waste because the lfence
> is only needed once after the access_ok.
>
>> We don't have that many __get_user() users left outside of arch/*
>> anymore...
Given the concern of having something easy to backport first I think
we should start with lfence in __uaccess_begin(). Any deeper changes
to the access_ok() + __get_user calling convention can build on top of
that baseline.
On Wed, Jan 17, 2018 at 02:17:26PM +0000, Alan Cox wrote:
> On Tue, 2018-01-16 at 14:41 -0800, Linus Torvalds wrote:
> >
> >
> > On Jan 16, 2018 14:23, "Dan Williams" <[email protected]>
> > wrote:
> > > That said, for get_user specifically, can we do something even
> > > cheaper. Dave H. reminds me that any valid user pointer that gets
> > > past
> > > the address limit check will have the high bit clear. So instead of
> > > calculating a mask, just unconditionally clear the high bit. It
> > > seems
> > > worse case userspace can speculatively leak something that's
> > > already
> > > in its address space.
> >
> > That's not at all true.
> >
> > The address may be a kernel address. That's the whole point of
> > 'set_fs()'.
>
> Can we kill off the remaining users of set_fs() ?
Not easily. They tend to come in pairs (the usual pattern is get_fs(),
save the result, set_fs(something), do work, set_fs(saved)), and
counting each such area as single instance we have (in my tree right
now) 121 locations. Some could be killed (and will eventually be -
the number of set_fs()/access_ok()/__{get,put}_user()/__copy_...()
call sites had been seriously decreasing during the last couple of
years), but some are really hard to kill off.
How, for example, would you deal with this one:
/*
* Receive a datagram from a UDP socket.
*/
static int svc_udp_recvfrom(struct svc_rqst *rqstp)
{
struct svc_sock *svsk =
container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
struct svc_serv *serv = svsk->sk_xprt.xpt_server;
struct sk_buff *skb;
union {
struct cmsghdr hdr;
long all[SVC_PKTINFO_SPACE / sizeof(long)];
} buffer;
struct cmsghdr *cmh = &buffer.hdr;
struct msghdr msg = {
.msg_name = svc_addr(rqstp),
.msg_control = cmh,
.msg_controllen = sizeof(buffer),
.msg_flags = MSG_DONTWAIT,
};
...
err = kernel_recvmsg(svsk->sk_sock, &msg, NULL,
0, 0, MSG_PEEK | MSG_DONTWAIT);
With kernel_recvmsg() (and in my tree the above is its last surviving caller)
being
int kernel_recvmsg(struct socket *sock, struct msghdr *msg,
struct kvec *vec, size_t num, size_t size, int flags)
{
mm_segment_t oldfs = get_fs();
int result;
iov_iter_kvec(&msg->msg_iter, READ | ITER_KVEC, vec, num, size);
set_fs(KERNEL_DS);
result = sock_recvmsg(sock, msg, flags);
set_fs(oldfs);
return result;
}
EXPORT_SYMBOL(kernel_recvmsg);
We are asking for recvmsg() with zero data length; what we really want is
->msg_control. And _that_ is why we need that set_fs() - we want the damn
thing to go into local variable.
But note that filling ->msg_control will happen in put_cmsg(), called
from ip_cmsg_recv_pktinfo(), called from ip_cmsg_recv_offset(),
called from udp_recvmsg(), called from sock_recvmsg_nosec(), called
from sock_recvmsg(). Or in another path in case of IPv6.
Sure, we can arrange for propagation of that all way down those
call chains. My preference would be to try and mark that (one and
only) case in ->msg_flags, so that put_cmsg() would be able to
check. ___sys_recvmsg() sets that as
msg_sys->msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
so we ought to be free to use any bit other than those two. Since
put_cmsg() already checks ->msg_flags, that shouldn't put too much
overhead. But then we'll need to do something to prevent speculative
execution straying down that way, won't we? I'm not saying it can't
be done, but quite a few of the remaining call sites will take
serious work.
Incidentally, what about copy_to_iter() and friends? They
check iov_iter flavour and go either into the "copy to kernel buffer"
or "copy to userland" paths. Do we need to deal with mispredictions
there? We are calling a bunch of those on read()...
On Tue, Jan 16, 2018 at 8:30 PM, Dan Williams <[email protected]> wrote:
>
> I think the access_ok() conversion to return a speculation sanitized
> pointer or NULL is the way to go unless I'm missing something simpler.
No, that's way too big of a conversion.
Just make get_user() and friends (that currently use ASM_STAC) use the
address masking.
The people who use uaccess_begin() can use the lfence there.
Basically, the rule is trivial: find all 'stac' users, and use address
masking if those users already integrate the limit check, and lfence
they don't.
Linus
On Wed, Jan 17, 2018 at 6:17 AM, Alan Cox <[email protected]> wrote:
>
> Can we kill off the remaining users of set_fs() ?
I would love to, but it's not going to happen short-term. If ever.
Some could be removed today: the code in arch/x86/net/bpf_jit_comp.c
seems to be literally the ramblings of a diseased mind. There's no
reason for the set_fs(), there's no reason for the
flush_icache_range() (it's a no-op on x86 anyway), and the smp_wmb()
looks bogus too.
I have no idea how that braindamage happened, but I assume it got
copied from some broken source.
But there are about ~100 set_fs() calls in generic code, and some of
those really are pretty fundamental. Doing things like "kernel_read()"
without set_fs() is basically impossible.
We've had set_fs() since the beginning. The naming is obviously very
historical. We have it for a very good reason, and I don't really see
that reason going away.
So realistically, we want to _minimize_ set_fs(), and we might want to
make sure that it's done only in limited settings (it might, for
example, be a good idea and a realistic goal to make sure that drivers
and modules can't do it, and use proper helper functions like that
"read_kernel()").
But getting rid of the concept entirely? Doesn't seem likely.
Linus
On Wed, Jan 17, 2018 at 10:52 AM, Al Viro <[email protected]> wrote:
> On Wed, Jan 17, 2018 at 02:17:26PM +0000, Alan Cox wrote:
[..]
> Incidentally, what about copy_to_iter() and friends? They
> check iov_iter flavour and go either into the "copy to kernel buffer"
> or "copy to userland" paths. Do we need to deal with mispredictions
> there? We are calling a bunch of those on read()...
>
Those should be protected by the conversion of __uaccess_begin to
__uaccess_begin_nospec that includes the lfence.
On Wed, Jan 17, 2018 at 11:26 AM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Jan 17, 2018 at 6:17 AM, Alan Cox <[email protected]> wrote:
>>
>> Can we kill off the remaining users of set_fs() ?
>
> I would love to, but it's not going to happen short-term. If ever.
>
> Some could be removed today: the code in arch/x86/net/bpf_jit_comp.c
> seems to be literally the ramblings of a diseased mind. There's no
> reason for the set_fs(), there's no reason for the
> flush_icache_range() (it's a no-op on x86 anyway), and the smp_wmb()
> looks bogus too.
>
> I have no idea how that braindamage happened, but I assume it got
> copied from some broken source.
At the time commit 0a14842f5a3c0e88a1e59fac5c3025db39721f74 went in,
this was the first JIT implementation for BPF, so maybe I wanted to avoid
other arches to forget to flush icache : You bet that my implementation served
as a reference for other JIT.
At that time, various calls to flush_icache_range() were definitely in arch/x86
or kernel/module.c
(I believe I must have copied the code from kernel/module.c, but that
I am not sure)
>
> But there are about ~100 set_fs() calls in generic code, and some of
> those really are pretty fundamental. Doing things like "kernel_read()"
> without set_fs() is basically impossible.
>
> We've had set_fs() since the beginning. The naming is obviously very
> historical. We have it for a very good reason, and I don't really see
> that reason going away.
>
> So realistically, we want to _minimize_ set_fs(), and we might want to
> make sure that it's done only in limited settings (it might, for
> example, be a good idea and a realistic goal to make sure that drivers
> and modules can't do it, and use proper helper functions like that
> "read_kernel()").
>
> But getting rid of the concept entirely? Doesn't seem likely.
>
> Linus
On Wed, Jan 17, 2018 at 11:54:12AM -0800, Dan Williams wrote:
> On Wed, Jan 17, 2018 at 10:52 AM, Al Viro <[email protected]> wrote:
> > On Wed, Jan 17, 2018 at 02:17:26PM +0000, Alan Cox wrote:
> [..]
> > Incidentally, what about copy_to_iter() and friends? They
> > check iov_iter flavour and go either into the "copy to kernel buffer"
> > or "copy to userland" paths. Do we need to deal with mispredictions
> > there? We are calling a bunch of those on read()...
> >
>
> Those should be protected by the conversion of __uaccess_begin to
> __uaccess_begin_nospec that includes the lfence.
Huh? What the hell does it do to speculative execution of "memcpy those
suckers" branch?
On Wed, Jan 17, 2018 at 12:05 PM, Al Viro <[email protected]> wrote:
> On Wed, Jan 17, 2018 at 11:54:12AM -0800, Dan Williams wrote:
>> On Wed, Jan 17, 2018 at 10:52 AM, Al Viro <[email protected]> wrote:
>> > On Wed, Jan 17, 2018 at 02:17:26PM +0000, Alan Cox wrote:
>> [..]
>> > Incidentally, what about copy_to_iter() and friends? They
>> > check iov_iter flavour and go either into the "copy to kernel buffer"
>> > or "copy to userland" paths. Do we need to deal with mispredictions
>> > there? We are calling a bunch of those on read()...
>> >
>>
>> Those should be protected by the conversion of __uaccess_begin to
>> __uaccess_begin_nospec that includes the lfence.
>
> Huh? What the hell does it do to speculative execution of "memcpy those
> suckers" branch?
'raw_copy_from_user 'is changed to use 'uaccess_begin_nospec' instead
of plain 'uacess_begin'. The only difference between those being that
the former includes an lfence. So with this sequence.
if (access_ok(VERIFY_READ, from, n)) {
kasan_check_write(to, n);
n = raw_copy_from_user(to, from, n);
}
return n;
...'from' is guaranteed to be within the address limit with respect to
speculative execution, or otherwise never de-referenced.
On Wed, Jan 17, 2018 at 06:52:32PM +0000, Al Viro wrote:
[use of set_fs() by sunrpc]
> We are asking for recvmsg() with zero data length; what we really want is
> ->msg_control. And _that_ is why we need that set_fs() - we want the damn
> thing to go into local variable.
>
> But note that filling ->msg_control will happen in put_cmsg(), called
> from ip_cmsg_recv_pktinfo(), called from ip_cmsg_recv_offset(),
> called from udp_recvmsg(), called from sock_recvmsg_nosec(), called
> from sock_recvmsg(). Or in another path in case of IPv6.
> Sure, we can arrange for propagation of that all way down those
> call chains. My preference would be to try and mark that (one and
> only) case in ->msg_flags, so that put_cmsg() would be able to
> check. ___sys_recvmsg() sets that as
> msg_sys->msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
> so we ought to be free to use any bit other than those two. Since
> put_cmsg() already checks ->msg_flags, that shouldn't put too much
> overhead.
Folks, does anybody have objections against the following:
Get rid of kernel_recvmsg() (and thus set_fs()) use in sunrpc
In net/sunrpc/svcsock.c:svc_udp_recvfrom() we want to get
IP_PKTINFO; currently we stash the address of local variable
into ->msg_control (which normall contains a userland pointer
in recepients) and issue zero-length ->recvmsg() under
set_fs(KERNEL_DS).
Similar to the way put_cmsg() handles 32bit case on biarch
targets, introduce a flag telling put_cmsg() to treat
->msg_control as kernel pointer, using memcpy instead of
copy_to_user(). That allows to avoid the use of kernel_recvmsg()
with its set_fs().
All places that might have non-NULL ->msg_control either pass the
msghdr only to ->sendmsg() and its ilk, or have ->msg_flags
sanitized before passing msghdr anywhere. IOW, there no
way the new flag to reach put_cmsg() in the mainline kernel,
and after this change it will only be seen in sunrpc case.
Incidentally, all other kernel_recvmsg() users are very easy to
convert to sock_recvmsg(), so that would allow to kill
kernel_recvmsg() off...
Signed-off-by: Al Viro <[email protected]>
---
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 9286a5a8c60c..60947da9c806 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -298,6 +298,7 @@ struct ucred {
#else
#define MSG_CMSG_COMPAT 0 /* We never have 32 bit fixups */
#endif
+#define MSG_CMSG_KERNEL 0x10000000
/* Setsockoptions(2) level. Thanks to BSD these must match IPPROTO_xxx */
diff --git a/net/core/scm.c b/net/core/scm.c
index b1ff8a441748..1b73b682e441 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -237,10 +237,17 @@ int put_cmsg(struct msghdr * msg, int level, int type, int len, void *data)
cmhdr.cmsg_len = cmlen;
err = -EFAULT;
- if (copy_to_user(cm, &cmhdr, sizeof cmhdr))
- goto out;
- if (copy_to_user(CMSG_DATA(cm), data, cmlen - sizeof(struct cmsghdr)))
- goto out;
+ if (unlikely(MSG_CMSG_KERNEL & msg->msg_flags)) {
+ struct cmsghdr *p = msg->msg_control;
+ memcpy(p, &cmhdr, sizeof cmhdr);
+ memcpy(CMSG_DATA(p), data, cmlen - sizeof(struct cmsghdr));
+ } else {
+ if (copy_to_user(cm, &cmhdr, sizeof cmhdr))
+ goto out;
+ if (copy_to_user(CMSG_DATA(cm), data,
+ cmlen - sizeof(struct cmsghdr)))
+ goto out;
+ }
cmlen = CMSG_SPACE(len);
if (msg->msg_controllen < cmlen)
cmlen = msg->msg_controllen;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 5570719e4787..774904f67c93 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -545,7 +545,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
.msg_name = svc_addr(rqstp),
.msg_control = cmh,
.msg_controllen = sizeof(buffer),
- .msg_flags = MSG_DONTWAIT,
+ .msg_flags = MSG_DONTWAIT | MSG_CMSG_KERNEL,
};
size_t len;
int err;
@@ -565,8 +565,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
skb = NULL;
- err = kernel_recvmsg(svsk->sk_sock, &msg, NULL,
- 0, 0, MSG_PEEK | MSG_DONTWAIT);
+ err = sock_recvmsg(svsk->sk_sock, &msg, MSG_PEEK | MSG_DONTWAIT);
if (err >= 0)
skb = skb_recv_udp(svsk->sk_sk, 0, 1, &err);
On Wed, Jan 17, 2018 at 7:06 PM, Al Viro <[email protected]> wrote:
>
> Similar to the way put_cmsg() handles 32bit case on biarch
> targets, introduce a flag telling put_cmsg() to treat
> ->msg_control as kernel pointer, using memcpy instead of
> copy_to_user(). That allows to avoid the use of kernel_recvmsg()
> with its set_fs().
If this is the only case that kernel_recvmsg() exists for, then by all
means, that patch certainly looks like a good thing.
Linus
On Wed, Jan 17, 2018 at 07:16:02PM -0800, Linus Torvalds wrote:
> On Wed, Jan 17, 2018 at 7:06 PM, Al Viro <[email protected]> wrote:
> >
> > Similar to the way put_cmsg() handles 32bit case on biarch
> > targets, introduce a flag telling put_cmsg() to treat
> > ->msg_control as kernel pointer, using memcpy instead of
> > copy_to_user(). That allows to avoid the use of kernel_recvmsg()
> > with its set_fs().
>
> If this is the only case that kernel_recvmsg() exists for, then by all
> means, that patch certainly looks like a good thing.
In -next that's the only remaining caller. kernel_recvmsg() is
{
mm_segment_t oldfs = get_fs();
int result;
iov_iter_kvec(&msg->msg_iter, READ | ITER_KVEC, vec, num, size);
set_fs(KERNEL_DS);
result = sock_recvmsg(sock, msg, flags);
set_fs(oldfs);
return result;
}
and
iov_iter_kvec(&msg->msg_iter, READ | ITER_KVEC, vec, num, size);
result = sock_recvmsg(sock, msg, flags);
works just fine for copying the data - that gets handled by copy_to_iter()
and copy_page_to_iter(). Those don't care about set_fs(); the trouble with
sunrpc call site is that we want to fill msg_control-pointed kernel object.
That gets copied by put_cmsg().
We could turn ->msg_control/->msg_controllen into another
iov_iter, but seeing that we never do scatter-gather for those
IMO that would be a massive overkill. A flag controlling whether
->msg_control is kernel or userland pointer would do, especially
since we already have a flag for "do we want a native or compat
layout for cmsg" in there.
That's the only caller we need it for, but that thing looks cheap
enough. Obviously needs to pass testing, including "is it too ugly to
live as far as Davem is concerned" test, though...
> We could turn ->msg_control/->msg_controllen into another
> iov_iter, but seeing that we never do scatter-gather for those
> IMO that would be a massive overkill. A flag controlling whether
> ->msg_control is kernel or userland pointer would do, especially
> since we already have a flag for "do we want a native or compat
> layout for cmsg" in there.
While your current hack seems like a nice short term improvement
I think we need an iov_iter or iov_iter-light there in the long run.
Same for ioctl so that we can pass properly typed kernel or user
buffers through without all these set_fs hacks.
On Wed, Jan 17, 2018 at 11:26:08AM -0800, Linus Torvalds wrote:
> But there are about ~100 set_fs() calls in generic code, and some of
> those really are pretty fundamental. Doing things like "kernel_read()"
> without set_fs() is basically impossible.
Not if we move to iov_iter or iov_iter-like behavior for all reads
and writes. There is an issue with how vectored writes are handles
in plain read/write vs read_iter/write_iter inherited from readv/writev,
but that's nothing a flag, or a second set of methods with the
same signature.
But there are more annoying things, most notable in-kernel ioctls
calls. We have quite a few of them, and while many are just utterly
stupid and can be replaced with direct function calls or new methods
(I've done quite a few conversions of those) some might be left.
Something like iov_iter might be the answer again.
Then we have things like probe_kernel_read/probe_kernel_write which
abuse the exception handling in get/put user. But with a little
arch helper we don't strictly need get_fs/set_fs for that.
On Thu, Jan 18, 2018 at 8:38 AM, Christoph Hellwig <[email protected]> wrote:
>
> > But there are about ~100 set_fs() calls in generic code, and some of
> > those really are pretty fundamental. Doing things like "kernel_read()"
> > without set_fs() is basically impossible.
>
> Not if we move to iov_iter or iov_iter-like behavior for all reads
> and writes.
Not going to happen. Really. We have how many tens of thousands of
drivers again, all doing "copy_to_user()".
And the fact is, set_fs() really isn't even a problem for this. Never
really has been. From a security standpoint, it would actually be
*much* worse if we made those ten thousand places do "if (kernel_flag)
memcpy() else copy_to_user()".
We've had some issues with set_fs() being abused in interesting ways.
But "kernel_read()" and friends is not it.
Linus
On Thu, Jan 18, 2018 at 08:29:57AM -0800, Christoph Hellwig wrote:
> > We could turn ->msg_control/->msg_controllen into another
> > iov_iter, but seeing that we never do scatter-gather for those
> > IMO that would be a massive overkill. A flag controlling whether
> > ->msg_control is kernel or userland pointer would do, especially
> > since we already have a flag for "do we want a native or compat
> > layout for cmsg" in there.
>
> While your current hack seems like a nice short term improvement
> I think we need an iov_iter or iov_iter-light there in the long run.
For one caller in the entire history of the kernel?
> Same for ioctl so that we can pass properly typed kernel or user
> buffers through without all these set_fs hacks.
Umm... Most of the PITA with ioctls is due to compat ones being
reformatted for native and fed under set_fs(). I actually have
a series dealing with most of such places for net ioctls. Sure,
there's also ioctl_by_bdev(), but for those we might be better
off exposing the things like ->get_last_session() and its ilk to
filesystems that want to deal with cdroms...
It's kernel_setsockopt() that is the real PITA...
On Thu, Jan 18, 2018 at 08:49:31AM -0800, Linus Torvalds wrote:
> On Thu, Jan 18, 2018 at 8:38 AM, Christoph Hellwig <[email protected]> wrote:
> >
> > > But there are about ~100 set_fs() calls in generic code, and some of
> > > those really are pretty fundamental. Doing things like "kernel_read()"
> > > without set_fs() is basically impossible.
> >
> > Not if we move to iov_iter or iov_iter-like behavior for all reads
> > and writes.
>
> Not going to happen. Really. We have how many tens of thousands of
> drivers again, all doing "copy_to_user()".
The real PITA is not even that (we could provide helpers making
conversion from ->read() to ->read_iter() easy for char devices,
etc.). It's the semantics of readv(2). Consider e.g. readv()
from /dev/rtc, with iovec array consisting of 10 segments, each
int-sized. Right now we'll get rtc_dev_read() called in a loop,
once for each segment. Single read() into 40-byte buffer will
fill one long and bugger off. Converting it to ->read_iter()
will mean more than just "use copy_to_iter() instead of put_user()" -
that would be trivial. But to preserve the current behaviour
we would need something like
total = 0;
while (iov_iter_count(to)) {
count = iov_iter_single_seg_count(to);
/* current body of rtc_dev_read(), with
* put_user() replaced with copy_to_iter()
*/
....
if (res < 0) {
if (!total)
total = res;
break;
}
total += res;
if (res != count)
break;
}
return total;
in that thing. And similar boilerplates would be needed in
a whole lot of drivers. Sure, they are individually trivial,
but they would add up to shitloads of code to get wrong.
These are basically all ->read() instances that ignore *ppos
and, unlike pipes, do not attempt to fill as much of the
buffer as possible. We do have quite a few of such.
Some ->read() instances can be easily converted to ->read_iter()
and will, in fact, be better off that way. We had patches of
that sort and I'm certain that we still have such places left.
Ditto for ->write() and ->write_iter(). But those are not
even close to being the majority. Sorry.
We could, in principle, do something like
dev_rtc_read_iter(iocb, to)
{
return loop_read_iter(iocb, to, modified_dev_rtc_read);
}
with modified_dev_rtc_read() being the result of minimal
conversion (put_user() and copy_to_user() replaced with used
of copy_to_iter()). It would be less boilerplate that way,
but I really don't see big benefits from doing that.
On the write side the things are just as unpleasant - we have
a lot of ->write() instances that parse the beginning of the
buffer, ignore the rest and report that everything got written.
writev() on those will parse each iovec segment, ignoring the
junk in the end of each. Again, that loop needs to go somewhere.
And we do have a bunch of "parse the buffer and do some action
once" ->write() instances - in char devices, debugfs, etc.
On Thu, Jan 18, 2018 at 04:43:02AM +0000, Al Viro wrote:
> We could turn ->msg_control/->msg_controllen into another
> iov_iter, but seeing that we never do scatter-gather for those
> IMO that would be a massive overkill. A flag controlling whether
> ->msg_control is kernel or userland pointer would do, especially
> since we already have a flag for "do we want a native or compat
> layout for cmsg" in there.
>
> That's the only caller we need it for, but that thing looks cheap
> enough. Obviously needs to pass testing, including "is it too ugly to
> live as far as Davem is concerned" test, though...
BTW, there's another series of set_fs-removal patches in
net ioctls; still needs review, though. With that one we would be down
to 11 instances in the entire net/*:
* SO_RCVTIMEO/SO_SNDTIMEO handling in compat [sg]etsockopt()
* passing SIOC{ADD,DEL}TUNNEL down (ipmr_del_tunnel(),ipmr_new_tunnel(),
addrconf_set_dstaddr())
* SIOCGSTAMP/SIOCGSTAMPNS in compat ioctls
* SIOCADDRT/SIOCDELRT in compat ioctls
* kernel_[gs]etsockopt()
* ipv6_renew_options_kern()
I don't know if all of that stuff can be realistically done without set_fs().
kernel_setsockopt(), in particular, is unpleasant...
The patches need review and testing, obviously; I'll post them in followups,
the entire series (on top of net/master) is in
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.net-ioctl
From: Al Viro <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
include/net/wext.h | 4 ++--
net/core/dev_ioctl.c | 18 ------------------
net/socket.c | 2 +-
net/wireless/wext-core.c | 13 +++++++++----
4 files changed, 12 insertions(+), 25 deletions(-)
diff --git a/include/net/wext.h b/include/net/wext.h
index e51f067fdb3a..aa192a670304 100644
--- a/include/net/wext.h
+++ b/include/net/wext.h
@@ -7,7 +7,7 @@
struct net;
#ifdef CONFIG_WEXT_CORE
-int wext_handle_ioctl(struct net *net, struct iwreq *iwr, unsigned int cmd,
+int wext_handle_ioctl(struct net *net, unsigned int cmd,
void __user *arg);
int compat_wext_handle_ioctl(struct net *net, unsigned int cmd,
unsigned long arg);
@@ -15,7 +15,7 @@ int compat_wext_handle_ioctl(struct net *net, unsigned int cmd,
struct iw_statistics *get_wireless_stats(struct net_device *dev);
int call_commit_handler(struct net_device *dev);
#else
-static inline int wext_handle_ioctl(struct net *net, struct iwreq *iwr, unsigned int cmd,
+static inline int wext_handle_ioctl(struct net *net, unsigned int cmd,
void __user *arg)
{
return -EINVAL;
diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index 5cdec23dd28e..d262f159f9fd 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -411,24 +411,6 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
if (cmd == SIOCGIFNAME)
return dev_ifname(net, (struct ifreq __user *)arg);
- /*
- * Take care of Wireless Extensions. Unfortunately struct iwreq
- * isn't a proper subset of struct ifreq (it's 8 byte shorter)
- * so we need to treat it specially, otherwise applications may
- * fault if the struct they're passing happens to land at the
- * end of a mapped page.
- */
- if (cmd >= SIOCIWFIRST && cmd <= SIOCIWLAST) {
- struct iwreq iwr;
-
- if (copy_from_user(&iwr, arg, sizeof(iwr)))
- return -EFAULT;
-
- iwr.ifr_name[sizeof(iwr.ifr_name) - 1] = 0;
-
- return wext_handle_ioctl(net, &iwr, cmd, arg);
- }
-
if (copy_from_user(&ifr, arg, sizeof(struct ifreq)))
return -EFAULT;
diff --git a/net/socket.c b/net/socket.c
index 6d29ebce93dd..d0e24f6ec1a9 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1005,7 +1005,7 @@ static long sock_ioctl(struct file *file, unsigned cmd, unsigned long arg)
} else
#ifdef CONFIG_WEXT_CORE
if (cmd >= SIOCIWFIRST && cmd <= SIOCIWLAST) {
- err = dev_ioctl(net, cmd, argp);
+ err = wext_handle_ioctl(net, cmd, argp);
} else
#endif
switch (cmd) {
diff --git a/net/wireless/wext-core.c b/net/wireless/wext-core.c
index 6cdb054484d6..9efbfc753347 100644
--- a/net/wireless/wext-core.c
+++ b/net/wireless/wext-core.c
@@ -1035,18 +1035,23 @@ static int ioctl_standard_call(struct net_device * dev,
}
-int wext_handle_ioctl(struct net *net, struct iwreq *iwr, unsigned int cmd,
- void __user *arg)
+int wext_handle_ioctl(struct net *net, unsigned int cmd, void __user *arg)
{
struct iw_request_info info = { .cmd = cmd, .flags = 0 };
+ struct iwreq iwr;
int ret;
- ret = wext_ioctl_dispatch(net, iwr, cmd, &info,
+ if (copy_from_user(&iwr, arg, sizeof(iwr)))
+ return -EFAULT;
+
+ iwr.ifr_name[sizeof(iwr.ifr_name) - 1] = 0;
+
+ ret = wext_ioctl_dispatch(net, &iwr, cmd, &info,
ioctl_standard_call,
ioctl_private_call);
if (ret >= 0 &&
IW_IS_GET(cmd) &&
- copy_to_user(arg, iwr, sizeof(struct iwreq)))
+ copy_to_user(arg, &iwr, sizeof(struct iwreq)))
return -EFAULT;
return ret;
--
2.11.0
From: Al Viro <[email protected]>
Only two of dev_ioctl() callers may pass SIOCGIFCONF to it.
Separating that codepath from the rest of dev_ioctl() allows both
to simplify dev_ioctl() itself (all other cases work with struct ifreq *)
*and* seriously simplify the compat side of that beast: all it takes
is passing to inet_gifconf() an extra argument - the size of individual
records (sizeof(struct ifreq) or sizeof(struct compat_ifreq)). With
dev_ifconf() called directly from sock_do_ioctl()/compat_dev_ifconf()
that's easy to arrange.
As the result, compat side of SIOCGIFCONF doesn't need any
allocations, copy_in_user() back and forth, etc.
Signed-off-by: Al Viro <[email protected]>
---
include/linux/netdevice.h | 4 ++-
net/core/dev_ioctl.c | 29 +++++------------
net/ipv4/devinet.c | 16 +++++-----
net/socket.c | 79 ++++++++++++++---------------------------------
4 files changed, 42 insertions(+), 86 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ed0799a12bf2..221487f5b214 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2759,7 +2759,8 @@ static inline bool dev_validate_header(const struct net_device *dev,
return false;
}
-typedef int gifconf_func_t(struct net_device * dev, char __user * bufptr, int len);
+typedef int gifconf_func_t(struct net_device * dev, char __user * bufptr,
+ int len, int size);
int register_gifconf(unsigned int family, gifconf_func_t *gifconf);
static inline int unregister_gifconf(unsigned int family)
{
@@ -3313,6 +3314,7 @@ void netdev_rx_handler_unregister(struct net_device *dev);
bool dev_valid_name(const char *name);
int dev_ioctl(struct net *net, unsigned int cmd, void __user *);
+int dev_ifconf(struct net *net, struct ifconf *, int);
int dev_ethtool(struct net *net, struct ifreq *);
unsigned int dev_get_flags(const struct net_device *);
int __dev_change_flags(struct net_device *, unsigned int flags);
diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index 7e690d0ccd05..5cdec23dd28e 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -66,9 +66,8 @@ EXPORT_SYMBOL(register_gifconf);
* Thus we will need a 'compatibility mode'.
*/
-static int dev_ifconf(struct net *net, char __user *arg)
+int dev_ifconf(struct net *net, struct ifconf *ifc, int size)
{
- struct ifconf ifc;
struct net_device *dev;
char __user *pos;
int len;
@@ -79,11 +78,8 @@ static int dev_ifconf(struct net *net, char __user *arg)
* Fetch the caller's info block.
*/
- if (copy_from_user(&ifc, arg, sizeof(struct ifconf)))
- return -EFAULT;
-
- pos = ifc.ifc_buf;
- len = ifc.ifc_len;
+ pos = ifc->ifc_buf;
+ len = ifc->ifc_len;
/*
* Loop over the interfaces, and write an info block for each.
@@ -95,10 +91,10 @@ static int dev_ifconf(struct net *net, char __user *arg)
if (gifconf_list[i]) {
int done;
if (!pos)
- done = gifconf_list[i](dev, NULL, 0);
+ done = gifconf_list[i](dev, NULL, 0, size);
else
done = gifconf_list[i](dev, pos + total,
- len - total);
+ len - total, size);
if (done < 0)
return -EFAULT;
total += done;
@@ -109,12 +105,12 @@ static int dev_ifconf(struct net *net, char __user *arg)
/*
* All done. Write the updated control block back to the caller.
*/
- ifc.ifc_len = total;
+ ifc->ifc_len = total;
/*
* Both BSD and Solaris return 0 here, so we do too.
*/
- return copy_to_user(arg, &ifc, sizeof(struct ifconf)) ? -EFAULT : 0;
+ return 0;
}
/*
@@ -412,17 +408,6 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
int ret;
char *colon;
- /* One special case: SIOCGIFCONF takes ifconf argument
- and requires shared lock, because it sleeps writing
- to user space.
- */
-
- if (cmd == SIOCGIFCONF) {
- rtnl_lock();
- ret = dev_ifconf(net, (char __user *) arg);
- rtnl_unlock();
- return ret;
- }
if (cmd == SIOCGIFNAME)
return dev_ifname(net, (struct ifreq __user *)arg);
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 7a93359fbc72..1771549d2438 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1188,22 +1188,25 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
goto out;
}
-static int inet_gifconf(struct net_device *dev, char __user *buf, int len)
+static int inet_gifconf(struct net_device *dev, char __user *buf, int len, int size)
{
struct in_device *in_dev = __in_dev_get_rtnl(dev);
struct in_ifaddr *ifa;
struct ifreq ifr;
int done = 0;
+ if (WARN_ON(size > sizeof(struct ifreq)))
+ goto out;
+
if (!in_dev)
goto out;
for (ifa = in_dev->ifa_list; ifa; ifa = ifa->ifa_next) {
if (!buf) {
- done += sizeof(ifr);
+ done += size;
continue;
}
- if (len < (int) sizeof(ifr))
+ if (len < size)
break;
memset(&ifr, 0, sizeof(struct ifreq));
strcpy(ifr.ifr_name, ifa->ifa_label);
@@ -1212,13 +1215,12 @@ static int inet_gifconf(struct net_device *dev, char __user *buf, int len)
(*(struct sockaddr_in *)&ifr.ifr_addr).sin_addr.s_addr =
ifa->ifa_local;
- if (copy_to_user(buf, &ifr, sizeof(struct ifreq))) {
+ if (copy_to_user(buf + done, &ifr, size)) {
done = -EFAULT;
break;
}
- buf += sizeof(struct ifreq);
- len -= sizeof(struct ifreq);
- done += sizeof(struct ifreq);
+ len -= size;
+ done += size;
}
out:
return done;
diff --git a/net/socket.c b/net/socket.c
index fbfae1ed3ff5..112216c537e7 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -961,10 +961,22 @@ static long sock_do_ioctl(struct net *net, struct socket *sock,
* If this ioctl is unknown try to hand it down
* to the NIC driver.
*/
- if (err == -ENOIOCTLCMD)
- err = dev_ioctl(net, cmd, argp);
+ if (err != -ENOIOCTLCMD)
+ return err;
- return err;
+ if (cmd == SIOCGIFCONF) {
+ struct ifconf ifc;
+ if (copy_from_user(&ifc, argp, sizeof(struct ifconf)))
+ return -EFAULT;
+ rtnl_lock();
+ err = dev_ifconf(net, &ifc, sizeof(struct ifreq));
+ rtnl_unlock();
+ if (!err && copy_to_user(argp, &ifc, sizeof(struct ifconf)))
+ err = -EFAULT;
+ return err;
+ }
+
+ return dev_ioctl(net, cmd, argp);
}
/*
@@ -2682,70 +2694,25 @@ static int dev_ifname32(struct net *net, struct compat_ifreq __user *uifr32)
return 0;
}
-static int dev_ifconf(struct net *net, struct compat_ifconf __user *uifc32)
+static int compat_dev_ifconf(struct net *net, struct compat_ifconf __user *uifc32)
{
struct compat_ifconf ifc32;
struct ifconf ifc;
- struct ifconf __user *uifc;
- struct compat_ifreq __user *ifr32;
- struct ifreq __user *ifr;
- unsigned int i, j;
int err;
if (copy_from_user(&ifc32, uifc32, sizeof(struct compat_ifconf)))
return -EFAULT;
- memset(&ifc, 0, sizeof(ifc));
- if (ifc32.ifcbuf == 0) {
- ifc32.ifc_len = 0;
- ifc.ifc_len = 0;
- ifc.ifc_req = NULL;
- uifc = compat_alloc_user_space(sizeof(struct ifconf));
- } else {
- size_t len = ((ifc32.ifc_len / sizeof(struct compat_ifreq)) + 1) *
- sizeof(struct ifreq);
- uifc = compat_alloc_user_space(sizeof(struct ifconf) + len);
- ifc.ifc_len = len;
- ifr = ifc.ifc_req = (void __user *)(uifc + 1);
- ifr32 = compat_ptr(ifc32.ifcbuf);
- for (i = 0; i < ifc32.ifc_len; i += sizeof(struct compat_ifreq)) {
- if (copy_in_user(ifr, ifr32, sizeof(struct compat_ifreq)))
- return -EFAULT;
- ifr++;
- ifr32++;
- }
- }
- if (copy_to_user(uifc, &ifc, sizeof(struct ifconf)))
- return -EFAULT;
+ ifc.ifc_len = ifc32.ifc_len;
+ ifc.ifc_req = compat_ptr(ifc32.ifcbuf);
- err = dev_ioctl(net, SIOCGIFCONF, uifc);
+ rtnl_lock();
+ err = dev_ifconf(net, &ifc, sizeof(struct compat_ifreq));
+ rtnl_unlock();
if (err)
return err;
- if (copy_from_user(&ifc, uifc, sizeof(struct ifconf)))
- return -EFAULT;
-
- ifr = ifc.ifc_req;
- ifr32 = compat_ptr(ifc32.ifcbuf);
- for (i = 0, j = 0;
- i + sizeof(struct compat_ifreq) <= ifc32.ifc_len && j < ifc.ifc_len;
- i += sizeof(struct compat_ifreq), j += sizeof(struct ifreq)) {
- if (copy_in_user(ifr32, ifr, sizeof(struct compat_ifreq)))
- return -EFAULT;
- ifr32++;
- ifr++;
- }
-
- if (ifc32.ifcbuf == 0) {
- /* Translate from 64-bit structure multiple to
- * a 32-bit one.
- */
- i = ifc.ifc_len;
- i = ((i / sizeof(struct ifreq)) * sizeof(struct compat_ifreq));
- ifc32.ifc_len = i;
- } else {
- ifc32.ifc_len = i;
- }
+ ifc32.ifc_len = ifc.ifc_len;
if (copy_to_user(uifc32, &ifc32, sizeof(struct compat_ifconf)))
return -EFAULT;
@@ -3142,7 +3109,7 @@ static int compat_sock_ioctl_trans(struct file *file, struct socket *sock,
case SIOCGIFNAME:
return dev_ifname32(net, argp);
case SIOCGIFCONF:
- return dev_ifconf(net, argp);
+ return compat_dev_ifconf(net, argp);
case SIOCETHTOOL:
return ethtool_ioctl(net, argp);
case SIOCWANDEV:
--
2.11.0
From: Al Viro <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
include/linux/inetdevice.h | 2 +-
net/ipv4/af_inet.c | 21 ++++++++++++++++-----
net/ipv4/devinet.c | 41 +++++++++++++++--------------------------
net/ipv4/ipconfig.c | 17 +++--------------
4 files changed, 35 insertions(+), 46 deletions(-)
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index 1ac5bf95bfdd..e16fe7d44a71 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -173,7 +173,7 @@ static inline struct net_device *ip_dev_find(struct net *net, __be32 addr)
}
int inet_addr_onlink(struct in_device *in_dev, __be32 a, __be32 b);
-int devinet_ioctl(struct net *net, unsigned int cmd, void __user *);
+int devinet_ioctl(struct net *net, unsigned int cmd, struct ifreq *);
void devinet_init(void);
struct in_device *inetdev_by_index(struct net *, int);
__be32 inet_select_addr(const struct net_device *dev, __be32 dst, int scope);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 54cccdd8b1e3..1c2bfee2e249 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -872,6 +872,8 @@ int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
struct sock *sk = sock->sk;
int err = 0;
struct net *net = sock_net(sk);
+ void __user *p = (void __user *)arg;
+ struct ifreq ifr;
switch (cmd) {
case SIOCGSTAMP:
@@ -891,17 +893,26 @@ int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
err = arp_ioctl(net, cmd, (void __user *)arg);
break;
case SIOCGIFADDR:
- case SIOCSIFADDR:
case SIOCGIFBRDADDR:
- case SIOCSIFBRDADDR:
case SIOCGIFNETMASK:
- case SIOCSIFNETMASK:
case SIOCGIFDSTADDR:
+ case SIOCGIFPFLAGS:
+ if (copy_from_user(&ifr, p, sizeof(struct ifreq)))
+ return -EFAULT;
+ err = devinet_ioctl(net, cmd, &ifr);
+ if (!err && copy_to_user(p, &ifr, sizeof(struct ifreq)))
+ err = -EFAULT;
+ break;
+
+ case SIOCSIFADDR:
+ case SIOCSIFBRDADDR:
+ case SIOCSIFNETMASK:
case SIOCSIFDSTADDR:
case SIOCSIFPFLAGS:
- case SIOCGIFPFLAGS:
case SIOCSIFFLAGS:
- err = devinet_ioctl(net, cmd, (void __user *)arg);
+ if (copy_from_user(&ifr, p, sizeof(struct ifreq)))
+ return -EFAULT;
+ err = devinet_ioctl(net, cmd, &ifr);
break;
default:
if (sk->sk_prot->ioctl)
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 1771549d2438..e056c0067f2c 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -946,11 +946,10 @@ static int inet_abc_len(__be32 addr)
}
-int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
+int devinet_ioctl(struct net *net, unsigned int cmd, struct ifreq *ifr)
{
- struct ifreq ifr;
struct sockaddr_in sin_orig;
- struct sockaddr_in *sin = (struct sockaddr_in *)&ifr.ifr_addr;
+ struct sockaddr_in *sin = (struct sockaddr_in *)&ifr->ifr_addr;
struct in_device *in_dev;
struct in_ifaddr **ifap = NULL;
struct in_ifaddr *ifa = NULL;
@@ -959,22 +958,16 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
int ret = -EFAULT;
int tryaddrmatch = 0;
- /*
- * Fetch the caller's info block into kernel space
- */
-
- if (copy_from_user(&ifr, arg, sizeof(struct ifreq)))
- goto out;
- ifr.ifr_name[IFNAMSIZ - 1] = 0;
+ ifr->ifr_name[IFNAMSIZ - 1] = 0;
/* save original address for comparison */
memcpy(&sin_orig, sin, sizeof(*sin));
- colon = strchr(ifr.ifr_name, ':');
+ colon = strchr(ifr->ifr_name, ':');
if (colon)
*colon = 0;
- dev_load(net, ifr.ifr_name);
+ dev_load(net, ifr->ifr_name);
switch (cmd) {
case SIOCGIFADDR: /* Get interface address */
@@ -1014,7 +1007,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
rtnl_lock();
ret = -ENODEV;
- dev = __dev_get_by_name(net, ifr.ifr_name);
+ dev = __dev_get_by_name(net, ifr->ifr_name);
if (!dev)
goto done;
@@ -1031,7 +1024,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
This is checked above. */
for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL;
ifap = &ifa->ifa_next) {
- if (!strcmp(ifr.ifr_name, ifa->ifa_label) &&
+ if (!strcmp(ifr->ifr_name, ifa->ifa_label) &&
sin_orig.sin_addr.s_addr ==
ifa->ifa_local) {
break; /* found */
@@ -1044,7 +1037,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
if (!ifa) {
for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL;
ifap = &ifa->ifa_next)
- if (!strcmp(ifr.ifr_name, ifa->ifa_label))
+ if (!strcmp(ifr->ifr_name, ifa->ifa_label))
break;
}
}
@@ -1056,19 +1049,19 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
switch (cmd) {
case SIOCGIFADDR: /* Get interface address */
sin->sin_addr.s_addr = ifa->ifa_local;
- goto rarok;
+ break;
case SIOCGIFBRDADDR: /* Get the broadcast address */
sin->sin_addr.s_addr = ifa->ifa_broadcast;
- goto rarok;
+ break;
case SIOCGIFDSTADDR: /* Get the destination address */
sin->sin_addr.s_addr = ifa->ifa_address;
- goto rarok;
+ break;
case SIOCGIFNETMASK: /* Get the netmask for the interface */
sin->sin_addr.s_addr = ifa->ifa_mask;
- goto rarok;
+ break;
case SIOCSIFFLAGS:
if (colon) {
@@ -1076,11 +1069,11 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
if (!ifa)
break;
ret = 0;
- if (!(ifr.ifr_flags & IFF_UP))
+ if (!(ifr->ifr_flags & IFF_UP))
inet_del_ifa(in_dev, ifap, 1);
break;
}
- ret = dev_change_flags(dev, ifr.ifr_flags);
+ ret = dev_change_flags(dev, ifr->ifr_flags);
break;
case SIOCSIFADDR: /* Set interface address (and family) */
@@ -1095,7 +1088,7 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
break;
INIT_HLIST_NODE(&ifa->hash);
if (colon)
- memcpy(ifa->ifa_label, ifr.ifr_name, IFNAMSIZ);
+ memcpy(ifa->ifa_label, ifr->ifr_name, IFNAMSIZ);
else
memcpy(ifa->ifa_label, dev->name, IFNAMSIZ);
} else {
@@ -1182,10 +1175,6 @@ int devinet_ioctl(struct net *net, unsigned int cmd, void __user *arg)
rtnl_unlock();
out:
return ret;
-rarok:
- rtnl_unlock();
- ret = copy_to_user(arg, &ifr, sizeof(struct ifreq)) ? -EFAULT : 0;
- goto out;
}
static int inet_gifconf(struct net_device *dev, char __user *buf, int len, int size)
diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index e9e488e72900..6895fff609b1 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -329,17 +329,6 @@ set_sockaddr(struct sockaddr_in *sin, __be32 addr, __be16 port)
sin->sin_port = port;
}
-static int __init ic_devinet_ioctl(unsigned int cmd, struct ifreq *arg)
-{
- int res;
-
- mm_segment_t oldfs = get_fs();
- set_fs(get_ds());
- res = devinet_ioctl(&init_net, cmd, (struct ifreq __user *) arg);
- set_fs(oldfs);
- return res;
-}
-
static int __init ic_dev_ioctl(unsigned int cmd, struct ifreq *arg)
{
int res;
@@ -375,19 +364,19 @@ static int __init ic_setup_if(void)
memset(&ir, 0, sizeof(ir));
strcpy(ir.ifr_ifrn.ifrn_name, ic_dev->dev->name);
set_sockaddr(sin, ic_myaddr, 0);
- if ((err = ic_devinet_ioctl(SIOCSIFADDR, &ir)) < 0) {
+ if ((err = devinet_ioctl(&init_net, SIOCSIFADDR, &ir)) < 0) {
pr_err("IP-Config: Unable to set interface address (%d)\n",
err);
return -1;
}
set_sockaddr(sin, ic_netmask, 0);
- if ((err = ic_devinet_ioctl(SIOCSIFNETMASK, &ir)) < 0) {
+ if ((err = devinet_ioctl(&init_net, SIOCSIFNETMASK, &ir)) < 0) {
pr_err("IP-Config: Unable to set interface netmask (%d)\n",
err);
return -1;
}
set_sockaddr(sin, ic_myaddr | ~ic_netmask, 0);
- if ((err = ic_devinet_ioctl(SIOCSIFBRDADDR, &ir)) < 0) {
+ if ((err = devinet_ioctl(&init_net, SIOCSIFBRDADDR, &ir)) < 0) {
pr_err("IP-Config: Unable to set interface broadcast address (%d)\n",
err);
return -1;
--
2.11.0
From: Al Viro <[email protected]>
another sock_do_ioctl() equivalent
Signed-off-by: Al Viro <[email protected]>
---
net/socket.c | 36 ++++--------------------------------
1 file changed, 4 insertions(+), 32 deletions(-)
diff --git a/net/socket.c b/net/socket.c
index f280258bd6a4..b267d051b50d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2861,33 +2861,6 @@ static int compat_siocwandev(struct net *net, struct compat_ifreq __user *uifr32
return dev_ioctl(net, SIOCWANDEV, uifr);
}
-static int bond_ioctl(struct net *net, unsigned int cmd,
- struct compat_ifreq __user *ifr32)
-{
- struct ifreq kifr;
- mm_segment_t old_fs;
- int err;
-
- switch (cmd) {
- case SIOCBONDENSLAVE:
- case SIOCBONDRELEASE:
- case SIOCBONDSETHWADDR:
- case SIOCBONDCHANGEACTIVE:
- if (copy_from_user(&kifr, ifr32, sizeof(struct compat_ifreq)))
- return -EFAULT;
-
- old_fs = get_fs();
- set_fs(KERNEL_DS);
- err = dev_ioctl(net, cmd,
- (struct ifreq __user __force *) &kifr);
- set_fs(old_fs);
-
- return err;
- default:
- return -ENOIOCTLCMD;
- }
-}
-
/* Handle ioctls that use ifreq::ifr_data and just need struct ifreq converted */
static int compat_ifr_data_ioctl(struct net *net, unsigned int cmd,
struct compat_ifreq __user *u_ifreq32)
@@ -3081,11 +3054,6 @@ static int compat_sock_ioctl_trans(struct file *file, struct socket *sock,
case SIOCGIFMAP:
case SIOCSIFMAP:
return compat_sioc_ifmap(net, cmd, argp);
- case SIOCBONDENSLAVE:
- case SIOCBONDRELEASE:
- case SIOCBONDSETHWADDR:
- case SIOCBONDCHANGEACTIVE:
- return bond_ioctl(net, cmd, argp);
case SIOCADDRT:
case SIOCDELRT:
return routing_ioctl(net, sock, cmd, argp);
@@ -3149,6 +3117,10 @@ static int compat_sock_ioctl_trans(struct file *file, struct socket *sock,
case SIOCGARP:
case SIOCDARP:
case SIOCATMARK:
+ case SIOCBONDENSLAVE:
+ case SIOCBONDRELEASE:
+ case SIOCBONDSETHWADDR:
+ case SIOCBONDCHANGEACTIVE:
return sock_do_ioctl(net, sock, cmd, arg);
}
--
2.11.0
From: Al Viro <[email protected]>
no users since 2014
Signed-off-by: Al Viro <[email protected]>
---
include/linux/net.h | 1 -
net/socket.c | 13 -------------
2 files changed, 14 deletions(-)
diff --git a/include/linux/net.h b/include/linux/net.h
index caeb159abda5..68acc54976bf 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -306,7 +306,6 @@ int kernel_sendpage(struct socket *sock, struct page *page, int offset,
size_t size, int flags);
int kernel_sendpage_locked(struct sock *sk, struct page *page, int offset,
size_t size, int flags);
-int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
int kernel_sock_shutdown(struct socket *sock, enum sock_shutdown_cmd how);
/* Routine returns the IP overhead imposed by a (caller-protected) socket. */
diff --git a/net/socket.c b/net/socket.c
index 982e9585fa31..3184573f58df 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -3254,19 +3254,6 @@ int kernel_sendpage_locked(struct sock *sk, struct page *page, int offset,
}
EXPORT_SYMBOL(kernel_sendpage_locked);
-int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg)
-{
- mm_segment_t oldfs = get_fs();
- int err;
-
- set_fs(KERNEL_DS);
- err = sock->ops->ioctl(sock, cmd, arg);
- set_fs(oldfs);
-
- return err;
-}
-EXPORT_SYMBOL(kernel_sock_ioctl);
-
int kernel_sock_shutdown(struct socket *sock, enum sock_shutdown_cmd how)
{
return sock->ops->shutdown(sock, how);
--
2.11.0
From: Al Viro <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
include/linux/netdevice.h | 3 +-
net/core/dev_ioctl.c | 85 +++++++++++++------------------------------
net/socket.c | 91 +++++++++++++++++++++++------------------------
3 files changed, 71 insertions(+), 108 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 221487f5b214..3b4f603bc360 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3313,7 +3313,8 @@ int netdev_rx_handler_register(struct net_device *dev,
void netdev_rx_handler_unregister(struct net_device *dev);
bool dev_valid_name(const char *name);
-int dev_ioctl(struct net *net, unsigned int cmd, void __user *);
+int dev_ioctl(struct net *net, unsigned int cmd, struct ifreq *ifr,
+ bool *need_copyout);
int dev_ifconf(struct net *net, struct ifconf *, int);
int dev_ethtool(struct net *net, struct ifreq *);
unsigned int dev_get_flags(const struct net_device *);
diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index d262f159f9fd..0ab1af04296c 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -18,26 +18,10 @@
* match. --pb
*/
-static int dev_ifname(struct net *net, struct ifreq __user *arg)
+static int dev_ifname(struct net *net, struct ifreq *ifr)
{
- struct ifreq ifr;
- int error;
-
- /*
- * Fetch the caller's info block.
- */
-
- if (copy_from_user(&ifr, arg, sizeof(struct ifreq)))
- return -EFAULT;
- ifr.ifr_name[IFNAMSIZ-1] = 0;
-
- error = netdev_get_name(net, ifr.ifr_name, ifr.ifr_ifindex);
- if (error)
- return error;
-
- if (copy_to_user(arg, &ifr, sizeof(struct ifreq)))
- return -EFAULT;
- return 0;
+ ifr->ifr_name[IFNAMSIZ-1] = 0;
+ return netdev_get_name(net, ifr->ifr_name, ifr->ifr_ifindex);
}
static gifconf_func_t *gifconf_list[NPROTO];
@@ -402,24 +386,24 @@ EXPORT_SYMBOL(dev_load);
* positive or a negative errno code on error.
*/
-int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
+int dev_ioctl(struct net *net, unsigned int cmd, struct ifreq *ifr, bool *need_copyout)
{
- struct ifreq ifr;
int ret;
char *colon;
+ if (need_copyout)
+ *need_copyout = true;
if (cmd == SIOCGIFNAME)
- return dev_ifname(net, (struct ifreq __user *)arg);
-
- if (copy_from_user(&ifr, arg, sizeof(struct ifreq)))
- return -EFAULT;
+ return dev_ifname(net, ifr);
- ifr.ifr_name[IFNAMSIZ-1] = 0;
+ ifr->ifr_name[IFNAMSIZ-1] = 0;
- colon = strchr(ifr.ifr_name, ':');
+ colon = strchr(ifr->ifr_name, ':');
if (colon)
*colon = 0;
+ dev_load(net, ifr->ifr_name);
+
/*
* See which interface the caller is talking about.
*/
@@ -439,31 +423,19 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
case SIOCGIFMAP:
case SIOCGIFINDEX:
case SIOCGIFTXQLEN:
- dev_load(net, ifr.ifr_name);
rcu_read_lock();
- ret = dev_ifsioc_locked(net, &ifr, cmd);
+ ret = dev_ifsioc_locked(net, ifr, cmd);
rcu_read_unlock();
- if (!ret) {
- if (colon)
- *colon = ':';
- if (copy_to_user(arg, &ifr,
- sizeof(struct ifreq)))
- ret = -EFAULT;
- }
+ if (colon)
+ *colon = ':';
return ret;
case SIOCETHTOOL:
- dev_load(net, ifr.ifr_name);
rtnl_lock();
- ret = dev_ethtool(net, &ifr);
+ ret = dev_ethtool(net, ifr);
rtnl_unlock();
- if (!ret) {
- if (colon)
- *colon = ':';
- if (copy_to_user(arg, &ifr,
- sizeof(struct ifreq)))
- ret = -EFAULT;
- }
+ if (colon)
+ *colon = ':';
return ret;
/*
@@ -477,17 +449,11 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
case SIOCSIFNAME:
if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
return -EPERM;
- dev_load(net, ifr.ifr_name);
rtnl_lock();
- ret = dev_ifsioc(net, &ifr, cmd);
+ ret = dev_ifsioc(net, ifr, cmd);
rtnl_unlock();
- if (!ret) {
- if (colon)
- *colon = ':';
- if (copy_to_user(arg, &ifr,
- sizeof(struct ifreq)))
- ret = -EFAULT;
- }
+ if (colon)
+ *colon = ':';
return ret;
/*
@@ -528,10 +494,11 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
/* fall through */
case SIOCBONDSLAVEINFOQUERY:
case SIOCBONDINFOQUERY:
- dev_load(net, ifr.ifr_name);
rtnl_lock();
- ret = dev_ifsioc(net, &ifr, cmd);
+ ret = dev_ifsioc(net, ifr, cmd);
rtnl_unlock();
+ if (need_copyout)
+ *need_copyout = false;
return ret;
case SIOCGIFMEM:
@@ -551,13 +518,9 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
cmd == SIOCGHWTSTAMP ||
(cmd >= SIOCDEVPRIVATE &&
cmd <= SIOCDEVPRIVATE + 15)) {
- dev_load(net, ifr.ifr_name);
rtnl_lock();
- ret = dev_ifsioc(net, &ifr, cmd);
+ ret = dev_ifsioc(net, ifr, cmd);
rtnl_unlock();
- if (!ret && copy_to_user(arg, &ifr,
- sizeof(struct ifreq)))
- ret = -EFAULT;
return ret;
}
return -ENOTTY;
diff --git a/net/socket.c b/net/socket.c
index d0e24f6ec1a9..982e9585fa31 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -973,10 +973,17 @@ static long sock_do_ioctl(struct net *net, struct socket *sock,
rtnl_unlock();
if (!err && copy_to_user(argp, &ifc, sizeof(struct ifconf)))
err = -EFAULT;
- return err;
+ } else {
+ struct ifreq ifr;
+ bool need_copyout;
+ if (copy_from_user(&ifr, argp, sizeof(struct ifreq)))
+ return -EFAULT;
+ err = dev_ioctl(net, cmd, &ifr, &need_copyout);
+ if (!err && need_copyout)
+ if (copy_to_user(argp, &ifr, sizeof(struct ifreq)))
+ return -EFAULT;
}
-
- return dev_ioctl(net, cmd, argp);
+ return err;
}
/*
@@ -1000,8 +1007,15 @@ static long sock_ioctl(struct file *file, unsigned cmd, unsigned long arg)
sock = file->private_data;
sk = sock->sk;
net = sock_net(sk);
- if (cmd >= SIOCDEVPRIVATE && cmd <= (SIOCDEVPRIVATE + 15)) {
- err = dev_ioctl(net, cmd, argp);
+ if (unlikely(cmd >= SIOCDEVPRIVATE && cmd <= (SIOCDEVPRIVATE + 15))) {
+ struct ifreq ifr;
+ bool need_copyout;
+ if (copy_from_user(&ifr, argp, sizeof(struct ifreq)))
+ return -EFAULT;
+ err = dev_ioctl(net, cmd, &ifr, &need_copyout);
+ if (!err && need_copyout)
+ if (copy_to_user(argp, &ifr, sizeof(struct ifreq)))
+ return -EFAULT;
} else
#ifdef CONFIG_WEXT_CORE
if (cmd >= SIOCIWFIRST && cmd <= SIOCIWLAST) {
@@ -2704,9 +2718,9 @@ static int ethtool_ioctl(struct net *net, struct compat_ifreq __user *ifr32)
{
struct compat_ethtool_rxnfc __user *compat_rxnfc;
bool convert_in = false, convert_out = false;
- size_t buf_size = ALIGN(sizeof(struct ifreq), 8);
- struct ethtool_rxnfc __user *rxnfc;
- struct ifreq __user *ifr;
+ size_t buf_size = 0;
+ struct ethtool_rxnfc __user *rxnfc = NULL;
+ struct ifreq ifr;
u32 rule_cnt = 0, actual_rule_cnt;
u32 ethcmd;
u32 data;
@@ -2743,18 +2757,14 @@ static int ethtool_ioctl(struct net *net, struct compat_ifreq __user *ifr32)
case ETHTOOL_SRXCLSRLDEL:
buf_size += sizeof(struct ethtool_rxnfc);
convert_in = true;
+ rxnfc = compat_alloc_user_space(buf_size);
break;
}
- ifr = compat_alloc_user_space(buf_size);
- rxnfc = (void __user *)ifr + ALIGN(sizeof(struct ifreq), 8);
-
- if (copy_in_user(&ifr->ifr_name, &ifr32->ifr_name, IFNAMSIZ))
+ if (copy_from_user(&ifr.ifr_name, &ifr32->ifr_name, IFNAMSIZ))
return -EFAULT;
- if (put_user(convert_in ? rxnfc : compat_ptr(data),
- &ifr->ifr_ifru.ifru_data))
- return -EFAULT;
+ ifr.ifr_data = convert_in ? rxnfc : (void __user *)compat_rxnfc;
if (convert_in) {
/* We expect there to be holes between fs.m_ext and
@@ -2782,7 +2792,7 @@ static int ethtool_ioctl(struct net *net, struct compat_ifreq __user *ifr32)
return -EFAULT;
}
- ret = dev_ioctl(net, SIOCETHTOOL, ifr);
+ ret = dev_ioctl(net, SIOCETHTOOL, &ifr, NULL);
if (ret)
return ret;
@@ -2823,50 +2833,43 @@ static int ethtool_ioctl(struct net *net, struct compat_ifreq __user *ifr32)
static int compat_siocwandev(struct net *net, struct compat_ifreq __user *uifr32)
{
- void __user *uptr;
compat_uptr_t uptr32;
- struct ifreq __user *uifr;
+ struct ifreq ifr;
+ void __user *saved;
+ int err;
- uifr = compat_alloc_user_space(sizeof(*uifr));
- if (copy_in_user(uifr, uifr32, sizeof(struct compat_ifreq)))
+ if (copy_from_user(&ifr, uifr32, sizeof(struct compat_ifreq)))
return -EFAULT;
if (get_user(uptr32, &uifr32->ifr_settings.ifs_ifsu))
return -EFAULT;
- uptr = compat_ptr(uptr32);
-
- if (put_user(uptr, &uifr->ifr_settings.ifs_ifsu.raw_hdlc))
- return -EFAULT;
+ saved = ifr.ifr_settings.ifs_ifsu.raw_hdlc;
+ ifr.ifr_settings.ifs_ifsu.raw_hdlc = compat_ptr(uptr32);
- return dev_ioctl(net, SIOCWANDEV, uifr);
+ err = dev_ioctl(net, SIOCWANDEV, &ifr, NULL);
+ if (!err) {
+ ifr.ifr_settings.ifs_ifsu.raw_hdlc = saved;
+ if (copy_to_user(uifr32, &ifr, sizeof(struct compat_ifreq)))
+ err = -EFAULT;
+ }
+ return err;
}
/* Handle ioctls that use ifreq::ifr_data and just need struct ifreq converted */
static int compat_ifr_data_ioctl(struct net *net, unsigned int cmd,
struct compat_ifreq __user *u_ifreq32)
{
- struct ifreq __user *u_ifreq64;
- char tmp_buf[IFNAMSIZ];
- void __user *data64;
+ struct ifreq ifreq;
u32 data32;
- if (copy_from_user(&tmp_buf[0], &(u_ifreq32->ifr_ifrn.ifrn_name[0]),
- IFNAMSIZ))
+ if (copy_from_user(ifreq.ifr_name, u_ifreq32->ifr_name, IFNAMSIZ))
return -EFAULT;
- if (get_user(data32, &u_ifreq32->ifr_ifru.ifru_data))
+ if (get_user(data32, &u_ifreq32->ifr_data))
return -EFAULT;
- data64 = compat_ptr(data32);
+ ifreq.ifr_data = compat_ptr(data32);
- u_ifreq64 = compat_alloc_user_space(sizeof(*u_ifreq64));
-
- if (copy_to_user(&u_ifreq64->ifr_ifrn.ifrn_name[0], &tmp_buf[0],
- IFNAMSIZ))
- return -EFAULT;
- if (put_user(data64, &u_ifreq64->ifr_ifru.ifru_data))
- return -EFAULT;
-
- return dev_ioctl(net, cmd, u_ifreq64);
+ return dev_ioctl(net, cmd, &ifreq, NULL);
}
static int compat_sioc_ifmap(struct net *net, unsigned int cmd,
@@ -2874,7 +2877,6 @@ static int compat_sioc_ifmap(struct net *net, unsigned int cmd,
{
struct ifreq ifr;
struct compat_ifmap __user *uifmap32;
- mm_segment_t old_fs;
int err;
uifmap32 = &uifr32->ifr_ifru.ifru_map;
@@ -2888,10 +2890,7 @@ static int compat_sioc_ifmap(struct net *net, unsigned int cmd,
if (err)
return -EFAULT;
- old_fs = get_fs();
- set_fs(KERNEL_DS);
- err = dev_ioctl(net, cmd, (void __user __force *)&ifr);
- set_fs(old_fs);
+ err = dev_ioctl(net, cmd, &ifr, NULL);
if (cmd == SIOCGIFMAP && !err) {
err = copy_to_user(uifr32, &ifr, sizeof(ifr.ifr_name));
--
2.11.0
From: Al Viro <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
include/net/route.h | 2 +-
net/ipv4/af_inet.c | 7 ++++++-
net/ipv4/fib_frontend.c | 8 ++------
net/ipv4/ipconfig.c | 13 +------------
4 files changed, 10 insertions(+), 20 deletions(-)
diff --git a/include/net/route.h b/include/net/route.h
index d538e6db1afe..1eb9ce470e25 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -217,7 +217,7 @@ unsigned int inet_addr_type_dev_table(struct net *net,
const struct net_device *dev,
__be32 addr);
void ip_rt_multicast_event(struct in_device *);
-int ip_rt_ioctl(struct net *, unsigned int cmd, void __user *arg);
+int ip_rt_ioctl(struct net *, unsigned int cmd, struct rtentry *rt);
void ip_rt_get_source(u8 *src, struct sk_buff *skb, struct rtable *rt);
struct rtable *rt_dst_alloc(struct net_device *dev,
unsigned int flags, u16 type,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1c2bfee2e249..c24008daa3d8 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -874,6 +874,7 @@ int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
struct net *net = sock_net(sk);
void __user *p = (void __user *)arg;
struct ifreq ifr;
+ struct rtentry rt;
switch (cmd) {
case SIOCGSTAMP:
@@ -884,8 +885,12 @@ int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
break;
case SIOCADDRT:
case SIOCDELRT:
+ if (copy_from_user(&rt, p, sizeof(struct rtentry)))
+ return -EFAULT;
+ err = ip_rt_ioctl(net, cmd, &rt);
+ break;
case SIOCRTMSG:
- err = ip_rt_ioctl(net, cmd, (void __user *)arg);
+ err = -EINVAL;
break;
case SIOCDARP:
case SIOCGARP:
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 08259d078b1c..f05afaf3235c 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -587,10 +587,9 @@ static int rtentry_to_fib_config(struct net *net, int cmd, struct rtentry *rt,
* Handle IP routing ioctl calls.
* These are used to manipulate the routing tables
*/
-int ip_rt_ioctl(struct net *net, unsigned int cmd, void __user *arg)
+int ip_rt_ioctl(struct net *net, unsigned int cmd, struct rtentry *rt)
{
struct fib_config cfg;
- struct rtentry rt;
int err;
switch (cmd) {
@@ -599,11 +598,8 @@ int ip_rt_ioctl(struct net *net, unsigned int cmd, void __user *arg)
if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
return -EPERM;
- if (copy_from_user(&rt, arg, sizeof(rt)))
- return -EFAULT;
-
rtnl_lock();
- err = rtentry_to_fib_config(net, cmd, &rt, &cfg);
+ err = rtentry_to_fib_config(net, cmd, rt, &cfg);
if (err == 0) {
struct fib_table *tb;
diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index 6895fff609b1..5f396afaa08d 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -340,17 +340,6 @@ static int __init ic_dev_ioctl(unsigned int cmd, struct ifreq *arg)
return res;
}
-static int __init ic_route_ioctl(unsigned int cmd, struct rtentry *arg)
-{
- int res;
-
- mm_segment_t oldfs = get_fs();
- set_fs(get_ds());
- res = ip_rt_ioctl(&init_net, cmd, (void __user *) arg);
- set_fs(oldfs);
- return res;
-}
-
/*
* Set up interface addresses and routes.
*/
@@ -412,7 +401,7 @@ static int __init ic_setup_routes(void)
set_sockaddr((struct sockaddr_in *) &rm.rt_genmask, 0, 0);
set_sockaddr((struct sockaddr_in *) &rm.rt_gateway, ic_gateway, 0);
rm.rt_flags = RTF_UP | RTF_GATEWAY;
- if ((err = ic_route_ioctl(SIOCADDRT, &rm)) < 0) {
+ if ((err = ip_rt_ioctl(&init_net, SIOCADDRT, &rm)) < 0) {
pr_err("IP-Config: Cannot add default route (%d)\n",
err);
return -1;
--
2.11.0
From: Al Viro <[email protected]>
Signed-off-by: Al Viro <[email protected]>
---
net/ipv4/ipconfig.c | 17 +++--------------
1 file changed, 3 insertions(+), 14 deletions(-)
diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index 5f396afaa08d..f75802ad960f 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -329,17 +329,6 @@ set_sockaddr(struct sockaddr_in *sin, __be32 addr, __be16 port)
sin->sin_port = port;
}
-static int __init ic_dev_ioctl(unsigned int cmd, struct ifreq *arg)
-{
- int res;
-
- mm_segment_t oldfs = get_fs();
- set_fs(get_ds());
- res = dev_ioctl(&init_net, cmd, (struct ifreq __user *) arg);
- set_fs(oldfs);
- return res;
-}
-
/*
* Set up interface addresses and routes.
*/
@@ -375,11 +364,11 @@ static int __init ic_setup_if(void)
* out, we'll try to muddle along.
*/
if (ic_dev_mtu != 0) {
- strcpy(ir.ifr_name, ic_dev->dev->name);
- ir.ifr_mtu = ic_dev_mtu;
- if ((err = ic_dev_ioctl(SIOCSIFMTU, &ir)) < 0)
+ rtnl_lock();
+ if ((err = dev_set_mtu(ic_dev->dev, ic_dev_mtu)) < 0)
pr_err("IP-Config: Unable to set interface mtu to %d (%d)\n",
ic_dev_mtu, err);
+ rtnl_unlock();
}
return 0;
}
--
2.11.0
From: Al Viro <[email protected]>
it's been equivalent to sock_do_ioctl() since 2009...
Signed-off-by: Al Viro <[email protected]>
---
net/socket.c | 38 --------------------------------------
1 file changed, 38 deletions(-)
diff --git a/net/socket.c b/net/socket.c
index 112216c537e7..f280258bd6a4 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2915,42 +2915,6 @@ static int compat_ifr_data_ioctl(struct net *net, unsigned int cmd,
return dev_ioctl(net, cmd, u_ifreq64);
}
-static int dev_ifsioc(struct net *net, struct socket *sock,
- unsigned int cmd, struct compat_ifreq __user *uifr32)
-{
- struct ifreq __user *uifr;
- int err;
-
- uifr = compat_alloc_user_space(sizeof(*uifr));
- if (copy_in_user(uifr, uifr32, sizeof(*uifr32)))
- return -EFAULT;
-
- err = sock_do_ioctl(net, sock, cmd, (unsigned long)uifr);
-
- if (!err) {
- switch (cmd) {
- case SIOCGIFFLAGS:
- case SIOCGIFMETRIC:
- case SIOCGIFMTU:
- case SIOCGIFMEM:
- case SIOCGIFHWADDR:
- case SIOCGIFINDEX:
- case SIOCGIFADDR:
- case SIOCGIFBRDADDR:
- case SIOCGIFDSTADDR:
- case SIOCGIFNETMASK:
- case SIOCGIFPFLAGS:
- case SIOCGIFTXQLEN:
- case SIOCGMIIPHY:
- case SIOCGMIIREG:
- if (copy_in_user(uifr32, uifr, sizeof(*uifr32)))
- err = -EFAULT;
- break;
- }
- }
- return err;
-}
-
static int compat_sioc_ifmap(struct net *net, unsigned int cmd,
struct compat_ifreq __user *uifr32)
{
@@ -3181,8 +3145,6 @@ static int compat_sock_ioctl_trans(struct file *file, struct socket *sock,
case SIOCGMIIPHY:
case SIOCGMIIREG:
case SIOCSMIIREG:
- return dev_ifsioc(net, sock, cmd, argp);
-
case SIOCSARP:
case SIOCGARP:
case SIOCDARP:
--
2.11.0
From: Al Viro <[email protected]>
same story...
Signed-off-by: Al Viro <[email protected]>
---
net/socket.c | 22 +---------------------
1 file changed, 1 insertion(+), 21 deletions(-)
diff --git a/net/socket.c b/net/socket.c
index b267d051b50d..6d29ebce93dd 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2675,25 +2675,6 @@ static int do_siocgstampns(struct net *net, struct socket *sock,
return err;
}
-static int dev_ifname32(struct net *net, struct compat_ifreq __user *uifr32)
-{
- struct ifreq __user *uifr;
- int err;
-
- uifr = compat_alloc_user_space(sizeof(struct ifreq));
- if (copy_in_user(uifr, uifr32, sizeof(struct compat_ifreq)))
- return -EFAULT;
-
- err = dev_ioctl(net, SIOCGIFNAME, uifr);
- if (err)
- return err;
-
- if (copy_in_user(uifr32, uifr, sizeof(struct compat_ifreq)))
- return -EFAULT;
-
- return 0;
-}
-
static int compat_dev_ifconf(struct net *net, struct compat_ifconf __user *uifc32)
{
struct compat_ifconf ifc32;
@@ -3043,8 +3024,6 @@ static int compat_sock_ioctl_trans(struct file *file, struct socket *sock,
case SIOCSIFBR:
case SIOCGIFBR:
return old_bridge_ioctl(argp);
- case SIOCGIFNAME:
- return dev_ifname32(net, argp);
case SIOCGIFCONF:
return compat_dev_ifconf(net, argp);
case SIOCETHTOOL:
@@ -3121,6 +3100,7 @@ static int compat_sock_ioctl_trans(struct file *file, struct socket *sock,
case SIOCBONDRELEASE:
case SIOCBONDSETHWADDR:
case SIOCBONDCHANGEACTIVE:
+ case SIOCGIFNAME:
return sock_do_ioctl(net, sock, cmd, arg);
}
--
2.11.0
On Thu, Jan 18, 2018 at 07:31:56PM +0000, Al Viro wrote:
> * SO_RCVTIMEO/SO_SNDTIMEO handling in compat [sg]etsockopt()
> * passing SIOC{ADD,DEL}TUNNEL down (ipmr_del_tunnel(),ipmr_new_tunnel(),
> addrconf_set_dstaddr())
> * SIOCGSTAMP/SIOCGSTAMPNS in compat ioctls
> * SIOCADDRT/SIOCDELRT in compat ioctls
> * kernel_[gs]etsockopt()
> * ipv6_renew_options_kern()
>
> I don't know if all of that stuff can be realistically done without set_fs().
> kernel_setsockopt(), in particular, is unpleasant...
Speaking of weird indirect calls: in net/packet/af_packet.c:packet_ioctl() we
have this:
#ifdef CONFIG_INET
case SIOCADDRT:
case SIOCDELRT:
case SIOCDARP:
case SIOCGARP:
case SIOCSARP:
case SIOCGIFADDR:
case SIOCSIFADDR:
case SIOCGIFBRDADDR:
case SIOCSIFBRDADDR:
case SIOCGIFNETMASK:
case SIOCSIFNETMASK:
case SIOCGIFDSTADDR:
case SIOCSIFDSTADDR:
case SIOCSIFFLAGS:
return inet_dgram_ops.ioctl(sock, cmd, arg);
#endif
That's inet_ioctl(sock, cmd, arg) disguised by indirect. AFAICS,
that line dates back to 2.1.89; back then inet_dgram_ops had been
exported and inet_ioctl() had been static. When SCTP went in
they'd exported inet_ioctl() rather than playing that kind of
games.
Is there anything subtle I'm missing here that would make it
wrong to replace that with explicit call of inet_ioctl()?
On Thu, Jan 18, 2018 at 07:31:56PM +0000, Al Viro wrote:
> * SIOCADDRT/SIOCDELRT in compat ioctls
To bring back a question I'd asked back in October - what do
we do about SIOC...RT compat?
To recap:
* AF_INET sockets expect struct rtentry; it differs
between 32bit and 64bit, so routing_ioctl() in net/socket.c
is called from compat_sock_ioctl_trans() and does the right
thing. All proto_ops instances with .family = PF_INET (and
only they) have inet_ioctl() as ->ioctl(), and end up with
ip_rt_ioctl() called for native ones. Three of those have
->compat_ioctl() set to inet_compat_ioctl(), the rest have
it NULL. In any case, inet_compat_ioctl() ignores those,
leaving them to compat_sock_ioctl_trans() to pick up.
* for AF_INET6 the situation is similar, except that
they use struct in6_rtmsg. Compat is also dealt with in
routing_ioctl(). inet6_ioctl() for all such proto_ops
(and only those), ipv6_route_ioctl() is what ends up
handling the native ones. No ->compat_ioctl() in any
of those.
* AF_PACKET sockets expect struct rt_entry and
actually bounce the native calls to inet_ioctl(). No
->compat_ioctl() there, but routing_ioctl() in net/socket.c
does the right thing.
* AF_APPLETALK sockets expect struct rt_entry.
Native handled in atrtr_ioctl(); there is ->compat_ioctl(),
but it ignores those ioctls, so we go through the conversion
in net/socket.c. Also happens to work correctly.
* ax25, ipx, netrom, rose and x25 use structures
of their own, and those structures have identical layouts on
32bit and 64bit. x25 has ->compat_ioctl() that does the
right thing (bounces to native), the rest either have
->compat_ioctl() ignoring those ioctls (ipx) or do not
have ->compat_ioctl() at all. That ends up with generic
code picking those and buggering them up - routing_ioctl()
assumes that we want either in6_rtmsg (ipv6) or rtentry
(everything else). Unfortunately, in case of these
protocols we should just leave the suckers alone.
Back then Ralf has verified that the bug exists
and said he'd put together a fix. Looks like that fix
has fallen through the cracks, though.
* all other protocols fail those; usually with
ENOTTY, except for AF_QIPCRTR that fails with EINVAL.
Either way, compat is not an issue.
Note that handling of SIOCADDRT on e.g. raw ipv4
sockets from 32bit process is convoluted as hell. The
call chain is
compat_sys_ioctl()
compat_sock_ioctl()
inet_compat_ioctl()
compat_raw_ioctl()
=> -ENOIOCTLCMD, possibly
by way of ipmr_compat_ioctl()
compat_sock_ioctl_trans()
routing_ioctl() [conversion done here]
sock_do_ioctl()
inet_ioctl()
ip_rt_ioctl()
A lot of those are method calls, BTW, and the overhead on those has
just grown...
Does anybody have objections against the following?
1) Somewhere in net/core (or net/compat.c, for that matter) add
int compat_get_rtentry(struct rtentry *r, struct rtentry32 __user *p);
2) In inet_compat_ioctl() recognize SIOC{ADD,DEL}RT and do
err = compat_get_rtentry(&r, (void __user *)arg);
if (!err)
err = ip_rt_ioctl(...)
return err;
3) Add inet_compat_ioctl() as ->compat_ioctl in all PF_INET proto_ops.
4) Lift copyin from atrtr_ioctl() to atalk_ioctl(), teach
atalk_compat_ioctl() about these ioctls (using compat_get_rtentry()
and atrtr_ioctl(), that is).
5) Add ->compat_ioctl() to AF_PACKET, let it just call inet_compat_ioctl()
for those two.
6) Lift copyin from ipv6_route_ioctl() to inet6_ioctl().
Add inet6_compat_ioctl() that would recognize those two, do compat copyin
and call ipv6_route_ioctl(). Make it ->compat_ioctl for all PF_INET6
proto_ops.
7) Tell compat_sock_ioctl_trans() to move these two into the "just call
sock_do_ioctl()" group of cases. Or, with Ralf's fix, just remove these
two cases from compat_sock_ioctl_trans() completely. Either way,
routing_ioctl() becomes dead code and can be removed.
Looks nice:
Reviewed-by: Christoph Hellwig <[email protected]>
Looks good,
Reviewed-by: Christoph Hellwig <[email protected]>
> case SIOCGSTAMP:
> @@ -884,8 +885,12 @@ int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
> break;
> case SIOCADDRT:
> case SIOCDELRT:
> + if (copy_from_user(&rt, p, sizeof(struct rtentry)))
> + return -EFAULT;
> + err = ip_rt_ioctl(net, cmd, &rt);
> + break;
> case SIOCRTMSG:
> - err = ip_rt_ioctl(net, cmd, (void __user *)arg);
> + err = -EINVAL;
This looks odd, but ip_rt_ioctl never handled SIOCRTMSG to start with,
so it looks fine. Might be worth splitting into another prep patch
with a good changelog.
ip_rt_ioctl could also use some additional simplification if it's only
called for SIOCADDRT/SIOCDELRT and lose a level of indentation while
we're at it.
But otherwise this looks good:
Reviewed-by: Christoph Hellwig <[email protected]>
On Thu, Jan 18, 2018 at 07:37:49PM +0000, Al Viro wrote:
> From: Al Viro <[email protected]>
>
> it's been equivalent to sock_do_ioctl() since 2009...
>
> Signed-off-by: Al Viro <[email protected]>
Note that we have two different dev_ifsioc() defintions, which
is a little weird.
But the net/socket.c one does indeed seem superflous as there is no
point in doing the compat_alloc_user_space / copy_in_user as far
as I can tell. compat_ifreq differs from ifreq in the size of
ifru_data, but it seems none of these ioctls actually use it.
A better changelog would be nice, though.
On Thu, Jan 18, 2018 at 07:37:50PM +0000, Al Viro wrote:
> From: Al Viro <[email protected]>
>
> another sock_do_ioctl() equivalent
Except for the additional data copy. Which might be worth mentioning
in the changelog.
Looks fine,
Reviewed-by: Christoph Hellwig <[email protected]>
Al this series looks fine to me, want me to toss it into net-next?
On Wed, Jan 24, 2018 at 03:52:44PM -0500, David Miller wrote:
>
> Al this series looks fine to me, want me to toss it into net-next?
Do you want them reposted (with updated commit messages), or would
you prefer a pull request (with or without rebase to current tip
of net-next)?
On Thu, Jan 25, 2018 at 12:01:25AM +0000, Al Viro wrote:
> On Wed, Jan 24, 2018 at 03:52:44PM -0500, David Miller wrote:
> >
> > Al this series looks fine to me, want me to toss it into net-next?
>
> Do you want them reposted (with updated commit messages), or would
> you prefer a pull request (with or without rebase to current tip
> of net-next)?
Below is a pull request for rebased branch. Patches themselves are
identical to what had been posted, Reviewed-by added and commit message
for "kill dev_ifsioc()" made more detailed.
The following changes since commit be1b6e8b5470e8311bfa1a3dfd7bd59e85a99759:
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue (2018-01-24 18:02:17 -0500)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git rebased-net-ioctl
for you to fetch changes up to 5c59e564e46dcbab2ee7a4e9e0243562a39679a2:
kill kernel_sock_ioctl() (2018-01-24 19:13:45 -0500)
----------------------------------------------------------------
Al Viro (10):
net: separate SIOCGIFCONF handling from dev_ioctl()
devinet_ioctl(): take copyin/copyout to caller
ip_rt_ioctl(): take copyin to caller
kill dev_ifsioc()
kill bond_ioctl()
kill dev_ifname32()
lift handling of SIOCIW... out of dev_ioctl()
ipconfig: use dev_set_mtu()
dev_ioctl(): move copyin/copyout to callers
kill kernel_sock_ioctl()
include/linux/inetdevice.h | 2 +-
include/linux/net.h | 1 -
include/linux/netdevice.h | 7 +-
include/net/route.h | 2 +-
include/net/wext.h | 4 +-
net/core/dev_ioctl.c | 132 ++++++----------------
net/ipv4/af_inet.c | 28 ++++-
net/ipv4/devinet.c | 57 ++++------
net/ipv4/fib_frontend.c | 8 +-
net/ipv4/ipconfig.c | 47 ++------
net/socket.c | 271 ++++++++++++---------------------------------
net/wireless/wext-core.c | 13 ++-
12 files changed, 173 insertions(+), 399 deletions(-)
From: Al Viro <[email protected]>
Date: Thu, 25 Jan 2018 00:01:25 +0000
> On Wed, Jan 24, 2018 at 03:52:44PM -0500, David Miller wrote:
>>
>> Al this series looks fine to me, want me to toss it into net-next?
>
> Do you want them reposted (with updated commit messages), or would
> you prefer a pull request (with or without rebase to current tip
> of net-next)?
A pull request works for me. Rebasing to net-next tip is pilot's
discretion.