Quoting Mark's original RFC:
"Recently, Google Project Zero discovered several classes of attack
against speculative execution. One of these, known as variant-1, allows
explicit bounds checks to be bypassed under speculation, providing an
arbitrary read gadget. Further details can be found on the GPZ blog [1]
and the Documentation patch in this series."
This series incorporates Mark Rutland's latest api and adds the x86
specific implementation of nospec_barrier. The
nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
wide analysis performed by Elena Reshetova to address static analysis
reports where speculative execution on a userspace controlled value
could bypass a bounds check. The patches address a precondition for the
attack discussed in the Spectre paper [2].
A consideration worth noting for reviewing these patches is to weigh the
dramatic cost of being wrong about whether a given report is exploitable
vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
lets make the bar for applying these patches be "can you prove that the
bounds check bypass is *not* exploitable". Consider that the Spectre
paper reports one example of a speculation window being ~180 cycles.
Note that there is also a proposal from Linus, array_access [3], that
attempts to quash speculative execution past a bounds check without
introducing an lfence instruction. That may be a future optimization
possibility that is compatible with this api, but it would appear to
need guarantees from the compiler that it is not clear the kernel can
rely on at this point. It is also not clear that it would be a
significant performance win vs lfence.
These patches also will also be available via the 'nospec' git branch
here:
git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec
[1]: https://googleprojectzero.blogspot.co.uk/2018/01/reading-privileged-memory-with-side.html
[2]: https://spectreattack.com/spectre.pdf
[3]: https://marc.info/?l=linux-kernel&m=151510446027625&w=2
---
Andi Kleen (1):
x86, barrier: stop speculation for failed access_ok
Dan Williams (13):
x86: implement nospec_barrier()
[media] uvcvideo: prevent bounds-check bypass via speculative execution
carl9170: prevent bounds-check bypass via speculative execution
p54: prevent bounds-check bypass via speculative execution
qla2xxx: prevent bounds-check bypass via speculative execution
cw1200: prevent bounds-check bypass via speculative execution
Thermal/int340x: prevent bounds-check bypass via speculative execution
ipv6: prevent bounds-check bypass via speculative execution
ipv4: prevent bounds-check bypass via speculative execution
vfs, fdtable: prevent bounds-check bypass via speculative execution
net: mpls: prevent bounds-check bypass via speculative execution
udf: prevent bounds-check bypass via speculative execution
userns: prevent bounds-check bypass via speculative execution
Mark Rutland (4):
asm-generic/barrier: add generic nospec helpers
Documentation: document nospec helpers
arm64: implement nospec_ptr()
arm: implement nospec_ptr()
Documentation/speculation.txt | 166 ++++++++++++++++++++
arch/arm/include/asm/barrier.h | 75 +++++++++
arch/arm64/include/asm/barrier.h | 55 +++++++
arch/x86/include/asm/barrier.h | 6 +
arch/x86/include/asm/uaccess.h | 17 ++
drivers/media/usb/uvc/uvc_v4l2.c | 7 +
drivers/net/wireless/ath/carl9170/main.c | 6 -
drivers/net/wireless/intersil/p54/main.c | 8 +
drivers/net/wireless/st/cw1200/sta.c | 10 +
drivers/net/wireless/st/cw1200/wsm.h | 4
drivers/scsi/qla2xxx/qla_mr.c | 15 +-
.../thermal/int340x_thermal/int340x_thermal_zone.c | 14 +-
fs/udf/misc.c | 39 +++--
include/asm-generic/barrier.h | 68 ++++++++
include/linux/fdtable.h | 5 -
kernel/user_namespace.c | 10 -
net/ipv4/raw.c | 9 +
net/ipv6/raw.c | 9 +
net/mpls/af_mpls.c | 12 +
19 files changed, 466 insertions(+), 69 deletions(-)
create mode 100644 Documentation/speculation.txt
From: Mark Rutland <[email protected]>
Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
memory accesses under a bounds check may be speculated even if the
bounds check fails, providing a primitive for building a side channel.
This patch adds helpers which can be used to inhibit the use of
out-of-bounds pointers under speculation.
A generic implementation is provided for compatibility, but does not
guarantee safety under speculation. Architectures are expected to
override these helpers as necessary.
Signed-off-by: Mark Rutland <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Cc: Daniel Willams <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/asm-generic/barrier.h | 68 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index fe297b599b0a..91c3071f49e5 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -54,6 +54,74 @@
#define read_barrier_depends() do { } while (0)
#endif
+/*
+ * Inhibit subsequent speculative memory accesses.
+ *
+ * Architectures with a suitable memory barrier should provide an
+ * implementation. This is non-portable, and generic code should use
+ * nospec_ptr().
+ */
+#ifndef __nospec_barrier
+#define __nospec_barrier() do { } while (0)
+#endif
+
+/**
+ * nospec_ptr() - Ensure a pointer is bounded, even under speculation.
+ *
+ * @ptr: the pointer to test
+ * @lo: the lower valid bound for @ptr, inclusive
+ * @hi: the upper valid bound for @ptr, exclusive
+ *
+ * If @ptr falls in the interval [@lo, @i), returns @ptr, otherwise returns
+ * NULL.
+ *
+ * Architectures which do not provide __nospec_barrier() should override this
+ * to ensure that ptr falls in the [lo, hi) interval both under architectural
+ * execution and under speculation, preventing propagation of an out-of-bounds
+ * pointer to code which is speculatively executed.
+ */
+#ifndef nospec_ptr
+#define nospec_ptr(ptr, lo, hi) \
+({ \
+ typeof (ptr) __ret; \
+ typeof (ptr) __ptr = (ptr); \
+ typeof (ptr) __lo = (lo); \
+ typeof (ptr) __hi = (hi); \
+ \
+ __ret = (__lo <= __ptr && __ptr < __hi) ? __ptr : NULL; \
+ \
+ __nospec_barrier(); \
+ \
+ __ret; \
+})
+#endif
+
+/**
+ * nospec_array_ptr - Generate a pointer to an array element, ensuring the
+ * pointer is bounded under speculation.
+ *
+ * @arr: the base of the array
+ * @idx: the index of the element
+ * @sz: the number of elements in the array
+ *
+ * If @idx falls in the interval [0, @sz), returns the pointer to @arr[@idx],
+ * otherwise returns NULL.
+ *
+ * This is a wrapper around nospec_ptr(), provided for convenience.
+ * Architectures should implement nospec_ptr() to ensure this is the case
+ * under speculation.
+ */
+#define nospec_array_ptr(arr, idx, sz) \
+({ \
+ typeof(*(arr)) *__arr = (arr); \
+ typeof(idx) __idx = (idx); \
+ typeof(sz) __sz = (sz); \
+ \
+ nospec_ptr(__arr + __idx, __arr, __arr + __sz); \
+})
+
+#undef __nospec_barrier
+
#ifndef __smp_mb
#define __smp_mb() mb()
#endif
From: Mark Rutland <[email protected]>
Document the rationale and usage of the new nospec*() helpers.
Signed-off-by: Mark Rutland <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
Documentation/speculation.txt | 166 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 166 insertions(+)
create mode 100644 Documentation/speculation.txt
diff --git a/Documentation/speculation.txt b/Documentation/speculation.txt
new file mode 100644
index 000000000000..748fcd4dcda4
--- /dev/null
+++ b/Documentation/speculation.txt
@@ -0,0 +1,166 @@
+This document explains potential effects of speculation, and how undesirable
+effects can be mitigated portably using common APIs.
+
+===========
+Speculation
+===========
+
+To improve performance and minimize average latencies, many contemporary CPUs
+employ speculative execution techniques such as branch prediction, performing
+work which may be discarded at a later stage.
+
+Typically speculative execution cannot be observed from architectural state,
+such as the contents of registers. However, in some cases it is possible to
+observe its impact on microarchitectural state, such as the presence or
+absence of data in caches. Such state may form side-channels which can be
+observed to extract secret information.
+
+For example, in the presence of branch prediction, it is possible for bounds
+checks to be ignored by code which is speculatively executed. Consider the
+following code:
+
+ int load_array(int *array, unsigned int idx) {
+ if (idx >= MAX_ARRAY_ELEMS)
+ return 0;
+ else
+ return array[idx];
+ }
+
+Which, on arm64, may be compiled to an assembly sequence such as:
+
+ CMP <idx>, #MAX_ARRAY_ELEMS
+ B.LT less
+ MOV <returnval>, #0
+ RET
+ less:
+ LDR <returnval>, [<array>, <idx>]
+ RET
+
+It is possible that a CPU mis-predicts the conditional branch, and
+speculatively loads array[idx], even if idx >= MAX_ARRAY_ELEMS. This value
+will subsequently be discarded, but the speculated load may affect
+microarchitectural state which can be subsequently measured.
+
+More complex sequences involving multiple dependent memory accesses may result
+in sensitive information being leaked. Consider the following code, building on
+the prior example:
+
+ int load_dependent_arrays(int *arr1, int *arr2, int idx) {
+ int val1, val2,
+
+ val1 = load_array(arr1, idx);
+ val2 = load_array(arr2, val1);
+
+ return val2;
+ }
+
+Under speculation, the first call to load_array() may return the value of an
+out-of-bounds address, while the second call will influence microarchitectural
+state dependent on this value. This may provide an arbitrary read primitive.
+
+====================================
+Mitigating speculation side-channels
+====================================
+
+The kernel provides a generic API to ensure that bounds checks are respected
+even under speculation. Architectures which are affected by speculation-based
+side-channels are expected to implement these primitives.
+
+The following helpers found in <asm/barrier.h> can be used to prevent
+information from being leaked via side-channels.
+
+* nospec_ptr(ptr, lo, hi)
+
+ Returns a sanitized pointer that is bounded by the [lo, hi) interval. When
+ ptr < lo, or ptr >= hi, NULL is returned. Prevents an out-of-bounds pointer
+ being propagated to code which is speculatively executed.
+
+ This is expected to be used by code which computes pointers to data
+ structures, where part of the address (such as an array index) may be
+ user-controlled.
+
+ This can be used to protect the earlier load_array() example:
+
+ int load_array(int *array, unsigned int idx)
+ {
+ int *elem;
+
+ if ((elem = nospec_ptr(array + idx, array, array + MAX_ARRAY_ELEMS)))
+ return *elem;
+ else
+ return 0;
+ }
+
+ This can also be used in situations where multiple fields on a structure are
+ accessed:
+
+ struct foo array[SIZE];
+ int a, b;
+
+ void do_thing(int idx)
+ {
+ struct foo *elem;
+
+ if ((elem = nospec_ptr(array + idx, array, array + SIZE)) {
+ a = elem->field_a;
+ b = elem->field_b;
+ }
+ }
+
+ It is imperative that the returned pointer is used. Pointers which are
+ generated separately are subject to a number of potential CPU and compiler
+ optimizations, and may still be used speculatively. For example, this means
+ that the following sequence is unsafe:
+
+ struct foo array[SIZE];
+ int a, b;
+
+ void do_thing(int idx)
+ {
+ if (nospec_ptr(array + idx, array, array + SIZE) != NULL) {
+ // unsafe as wrong pointer is used
+ a = array[idx].field_a;
+ b = array[idx].field_b;
+ }
+ }
+
+ Similarly, it is unsafe to compare the returned pointer with other pointers,
+ as this may permit the compiler to substitute one pointer with another,
+ permitting speculation. For example, the following sequence is unsafe:
+
+ struct foo array[SIZE];
+ int a, b;
+
+ void do_thing(int idx)
+ {
+ struct foo *elem = nospec_ptr(array + idx, array, array + size);
+
+ // unsafe due to pointer substitution
+ if (elem == &array[idx]) {
+ a = elem->field_a;
+ b = elem->field_b;
+ }
+ }
+
+* nospec_array_ptr(arr, idx, sz)
+
+ Returns a sanitized pointer to arr[idx] only if idx falls in the [0, sz)
+ interval. When idx < 0 or idx > sz, NULL is returned. Prevents an
+ out-of-bounds pointer being propagated to code which is speculatively
+ executed.
+
+ This is a convenience function which wraps nospec_ptr(), and has the same
+ caveats w.r.t. the use of the returned pointer.
+
+ For example, this may be used as follows:
+
+ int load_array(int *array, unsigned int idx)
+ {
+ int *elem;
+
+ if ((elem = nospec_array_ptr(array, idx, MAX_ARRAY_ELEMS)))
+ return *elem;
+ else
+ return 0;
+ }
+
From: Mark Rutland <[email protected]>
This patch implements nospec_ptr() for arm64, following the recommended
architectural sequence.
Signed-off-by: Mark Rutland <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/arm64/include/asm/barrier.h | 55 ++++++++++++++++++++++++++++++++++++++
1 file changed, 55 insertions(+)
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 77651c49ef44..b4819f6a0e5c 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -40,6 +40,61 @@
#define dma_rmb() dmb(oshld)
#define dma_wmb() dmb(oshst)
+#define __load_no_speculate_n(ptr, lo, hi, failval, cmpptr, w, sz) \
+({ \
+ typeof(*ptr) __nln_val; \
+ typeof(*ptr) __failval = \
+ (typeof(*ptr))(unsigned long)(failval); \
+ \
+ asm volatile ( \
+ " cmp %[c], %[l]\n" \
+ " ccmp %[c], %[h], 2, cs\n" \
+ " b.cs 1f\n" \
+ " ldr" #sz " %" #w "[v], %[p]\n" \
+ "1: csel %" #w "[v], %" #w "[v], %" #w "[f], cc\n" \
+ " hint #0x14 // CSDB\n" \
+ : [v] "=&r" (__nln_val) \
+ : [p] "m" (*(ptr)), [l] "r" (lo), [h] "r" (hi), \
+ [f] "rZ" (__failval), [c] "r" (cmpptr) \
+ : "cc"); \
+ \
+ __nln_val; \
+})
+
+#define __load_no_speculate(ptr, lo, hi, failval, cmpptr) \
+({ \
+ typeof(*(ptr)) __nl_val; \
+ \
+ switch (sizeof(__nl_val)) { \
+ case 1: \
+ __nl_val = __load_no_speculate_n(ptr, lo, hi, failval, \
+ cmpptr, w, b); \
+ break; \
+ case 2: \
+ __nl_val = __load_no_speculate_n(ptr, lo, hi, failval, \
+ cmpptr, w, h); \
+ break; \
+ case 4: \
+ __nl_val = __load_no_speculate_n(ptr, lo, hi, failval, \
+ cmpptr, w, ); \
+ break; \
+ case 8: \
+ __nl_val = __load_no_speculate_n(ptr, lo, hi, failval, \
+ cmpptr, x, ); \
+ break; \
+ default: \
+ BUILD_BUG(); \
+ } \
+ \
+ __nl_val; \
+})
+
+#define nospec_ptr(ptr, lo, hi) \
+({ \
+ typeof(ptr) __np_ptr = (ptr); \
+ __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
+})
+
#define __smp_mb() dmb(ish)
#define __smp_rmb() dmb(ishld)
#define __smp_wmb() dmb(ishst)
From: Mark Rutland <[email protected]>
This patch implements nospec_ptr() for arm, following the recommended
architectural sequences for the arm and thumb instruction sets.
Signed-off-by: Mark Rutland <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/arm/include/asm/barrier.h | 75 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 75 insertions(+)
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 40f5c410fd8c..6384c90e4b72 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -37,6 +37,81 @@
#define dmb(x) __asm__ __volatile__ ("" : : : "memory")
#endif
+#ifdef CONFIG_THUMB2_KERNEL
+#define __load_no_speculate_n(ptr, lo, hi, failval, cmpptr, sz) \
+({ \
+ typeof(*ptr) __nln_val; \
+ typeof(*ptr) __failval = \
+ (typeof(*ptr)(unsigned long)(failval)); \
+ \
+ asm volatile ( \
+ " cmp %[c], %[l]\n" \
+ " it hs\n" \
+ " cmphs %[h], %[c]\n" \
+ " blo 1f\n" \
+ " ld" #sz " %[v], %[p]\n" \
+ "1: it lo\n" \
+ " movlo %[v], %[f]\n" \
+ " .inst 0xf3af8014 @ CSDB\n" \
+ : [v] "=&r" (__nln_val) \
+ : [p] "m" (*(ptr)), [l] "r" (lo), [h] "r" (hi), \
+ [f] "r" (__failval), [c] "r" (cmpptr) \
+ : "cc"); \
+ \
+ __nln_val; \
+})
+#else
+#define __load_no_speculate_n(ptr, lo, hi, failval, cmpptr, sz) \
+({ \
+ typeof(*ptr) __nln_val; \
+ typeof(*ptr) __failval = \
+ (typeof(*ptr)(unsigned long)(failval)); \
+ \
+ asm volatile ( \
+ " cmp %[c], %[l]\n" \
+ " cmphs %[h], %[c]\n" \
+ " ldr" #sz "hi %[v], %[p]\n" \
+ " movls %[v], %[f]\n" \
+ " .inst 0xe320f014 @ CSDB\n" \
+ : [v] "=&r" (__nln_val) \
+ : [p] "m" (*(ptr)), [l] "r" (lo), [h] "r" (hi), \
+ [f] "r" (__failval), [c] "r" (cmpptr) \
+ : "cc"); \
+ \
+ __nln_val; \
+})
+#endif
+
+#define __load_no_speculate(ptr, lo, hi, failval, cmpptr) \
+({ \
+ typeof(*(ptr)) __nl_val; \
+ \
+ switch (sizeof(__nl_val)) { \
+ case 1: \
+ __nl_val = __load_no_speculate_n(ptr, lo, hi, failval, \
+ cmpptr, b); \
+ break; \
+ case 2: \
+ __nl_val = __load_no_speculate_n(ptr, lo, hi, failval, \
+ cmpptr, h); \
+ break; \
+ case 4: \
+ __nl_val = __load_no_speculate_n(ptr, lo, hi, failval, \
+ cmpptr, ); \
+ break; \
+ default: \
+ BUILD_BUG(); \
+ } \
+ \
+ __nl_val; \
+})
+
+#define nospec_ptr(ptr, lo, hi) \
+({ \
+ typeof(ptr) __np_ptr = (ptr); \
+ __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
+})
+
#ifdef CONFIG_ARM_HEAVY_MB
extern void (*soc_mb)(void);
extern void arm_heavy_mb(void);
The new speculative execution barrier, nospec_barrier(), ensures
that any userspace controllable speculation doesn't cross the boundary.
Any user observable speculative activity on this CPU thread before this
point either completes, reaches a state it can no longer cause an
observable activity, or is aborted before instructions after the barrier
execute.
In the x86 case nospec_barrier() resolves to an lfence if
X86_FEATURE_LFENCE_RDTSC is present. Other architectures can define
their variants.
Note the expectation is that this barrier is never used directly, at
least outside of architecture specific code. It is implied by the
nospec_{array_ptr,ptr} macros.
x86, for now, depends on the barrier for protection while other
architectures place their speculation prevention in
nospec_{ptr,array_ptr} when a barrier instruction is not available or
too heavy-weight. In the x86 case lfence is not a fully serializing
instruction so it is not as expensive as other barriers.
Suggested-by: Peter Zijlstra <[email protected]>
Suggested-by: Arjan van de Ven <[email protected]>
Suggested-by: Alan Cox <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Alan Cox <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/barrier.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 7fb336210e1b..1148cd9f5ae7 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -24,6 +24,12 @@
#define wmb() asm volatile("sfence" ::: "memory")
#endif
+/*
+ * CPUs without LFENCE don't really speculate much. Possibly fall back to IRET-to-self.
+ */
+#define __nospec_barrier() alternative("", "lfence", X86_FEATURE_LFENCE_RDTSC)
+#define nospec_barrier __nospec_barrier
+
#ifdef CONFIG_X86_PPRO_FENCE
#define dma_rmb() rmb()
#else
From: Andi Kleen <[email protected]>
When access_ok fails we should always stop speculating.
Add the required barriers to the x86 access_ok macro.
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/uaccess.h | 17 +++++++++++++----
include/asm-generic/barrier.h | 6 +++---
2 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 574dff4d2913..9b6f20cfaeb9 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -43,6 +43,8 @@ static inline void set_fs(mm_segment_t fs)
/*
* Test whether a block of memory is a valid user space address.
* Returns 0 if the range is valid, nonzero otherwise.
+ *
+ * We also disable speculation when a check fails.
*/
static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, unsigned long limit)
{
@@ -53,14 +55,19 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
* important to subtract the size from the
* limit, not add it to the address).
*/
- if (__builtin_constant_p(size))
- return unlikely(addr > limit - size);
+ if (__builtin_constant_p(size)) {
+ if (unlikely(addr > limit - size))
+ return true;
+ nospec_barrier();
+ return false;
+ }
/* Arbitrary sizes? Be careful about overflow */
addr += size;
- if (unlikely(addr < size))
+ if (unlikely(addr < size || addr > limit))
return true;
- return unlikely(addr > limit);
+ nospec_barrier();
+ return false;
}
#define __range_not_ok(addr, size, limit) \
@@ -94,6 +101,8 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
* Note that, depending on architecture, this function probably just
* checks that the pointer is in the user space range - after calling
* this function, memory access functions may still return -EFAULT.
+ *
+ * Stops speculation automatically
*/
#define access_ok(type, addr, size) \
({ \
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 91c3071f49e5..a11765eba860 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -59,7 +59,9 @@
*
* Architectures with a suitable memory barrier should provide an
* implementation. This is non-portable, and generic code should use
- * nospec_ptr().
+ * nospec_{array_ptr,ptr}. Arch-specific code should define and use
+ * nospec_barrier() for usages where nospec_{array_ptr,ptr} is
+ * unsuitable.
*/
#ifndef __nospec_barrier
#define __nospec_barrier() do { } while (0)
@@ -120,8 +122,6 @@
nospec_ptr(__arr + __idx, __arr, __arr + __sz); \
})
-#undef __nospec_barrier
-
#ifndef __smp_mb
#define __smp_mb() mb()
#endif
Static analysis reports that 'index' may be a user controlled value that
is used as a data dependency to read 'pin' from the
'selector->baSourceID' array. In order to avoid potential leaks of
kernel memory values, block speculative execution of the instruction
stream that could issue reads based on an invalid value of 'pin'.
Based on an original patch by Elena Reshetova.
Cc: Laurent Pinchart <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: [email protected]
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/media/usb/uvc/uvc_v4l2.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/media/usb/uvc/uvc_v4l2.c b/drivers/media/usb/uvc/uvc_v4l2.c
index 3e7e283a44a8..7442626dc20e 100644
--- a/drivers/media/usb/uvc/uvc_v4l2.c
+++ b/drivers/media/usb/uvc/uvc_v4l2.c
@@ -22,6 +22,7 @@
#include <linux/mm.h>
#include <linux/wait.h>
#include <linux/atomic.h>
+#include <linux/compiler.h>
#include <media/v4l2-common.h>
#include <media/v4l2-ctrls.h>
@@ -810,6 +811,7 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
struct uvc_entity *iterm = NULL;
u32 index = input->index;
int pin = 0;
+ __u8 *elem;
if (selector == NULL ||
(chain->dev->quirks & UVC_QUIRK_IGNORE_SELECTOR_UNIT)) {
@@ -820,8 +822,9 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
break;
}
pin = iterm->id;
- } else if (index < selector->bNrInPins) {
- pin = selector->baSourceID[index];
+ } else if ((elem = nospec_array_ptr(selector->baSourceID, index,
+ selector->bNrInPins))) {
+ pin = *elem;
list_for_each_entry(iterm, &chain->entities, chain) {
if (!UVC_ENTITY_IS_ITERM(iterm))
continue;
Static analysis reports that 'queue' may be a user controlled value that
is used as a data dependency to read from the 'ar9170_qmap' array. In
order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue reads
based on an invalid result of 'ar9170_qmap[queue]'. In this case the
value of 'ar9170_qmap[queue]' is immediately reused as an index to the
'ar->edcf' array.
Based on an original patch by Elena Reshetova.
Cc: Christian Lamparter <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/net/wireless/ath/carl9170/main.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/net/wireless/ath/carl9170/main.c b/drivers/net/wireless/ath/carl9170/main.c
index 988c8857d78c..0ff34cbe2b62 100644
--- a/drivers/net/wireless/ath/carl9170/main.c
+++ b/drivers/net/wireless/ath/carl9170/main.c
@@ -41,6 +41,7 @@
#include <linux/module.h>
#include <linux/etherdevice.h>
#include <linux/random.h>
+#include <linux/compiler.h>
#include <net/mac80211.h>
#include <net/cfg80211.h>
#include "hw.h"
@@ -1384,11 +1385,12 @@ static int carl9170_op_conf_tx(struct ieee80211_hw *hw,
const struct ieee80211_tx_queue_params *param)
{
struct ar9170 *ar = hw->priv;
+ const u8 *elem;
int ret;
mutex_lock(&ar->mutex);
- if (queue < ar->hw->queues) {
- memcpy(&ar->edcf[ar9170_qmap[queue]], param, sizeof(*param));
+ if ((elem = nospec_array_ptr(ar9170_qmap, queue, ar->hw->queues))) {
+ memcpy(&ar->edcf[*elem], param, sizeof(*param));
ret = carl9170_set_qos(ar);
} else {
ret = -EINVAL;
Static analysis reports that 'queue' may be a user controlled value that
is used as a data dependency to read from the 'priv->qos_params' array.
In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue reads
based on an invalid result of 'priv->qos_params[queue]'.
Based on an original patch by Elena Reshetova.
Cc: Christian Lamparter <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/net/wireless/intersil/p54/main.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/net/wireless/intersil/p54/main.c b/drivers/net/wireless/intersil/p54/main.c
index ab6d39e12069..85c9cbee35fc 100644
--- a/drivers/net/wireless/intersil/p54/main.c
+++ b/drivers/net/wireless/intersil/p54/main.c
@@ -20,6 +20,7 @@
#include <linux/firmware.h>
#include <linux/etherdevice.h>
#include <linux/module.h>
+#include <linux/compiler.h>
#include <net/mac80211.h>
@@ -411,12 +412,13 @@ static int p54_conf_tx(struct ieee80211_hw *dev,
const struct ieee80211_tx_queue_params *params)
{
struct p54_common *priv = dev->priv;
+ struct p54_edcf_queue_param *p54_q;
int ret;
mutex_lock(&priv->conf_mutex);
- if (queue < dev->queues) {
- P54_SET_QUEUE(priv->qos_params[queue], params->aifs,
- params->cw_min, params->cw_max, params->txop);
+ if ((p54_q = nospec_array_ptr(priv->qos_params, queue, dev->queues))) {
+ P54_SET_QUEUE(p54_q[0], params->aifs, params->cw_min,
+ params->cw_max, params->txop);
ret = p54_set_edcf(priv);
} else
ret = -EINVAL;
Static analysis reports that 'handle' may be a user controlled value
that is used as a data dependency to read 'sp' from the
'req->outstanding_cmds' array. In order to avoid potential leaks of
kernel memory values, block speculative execution of the instruction
stream that could issue reads based on an invalid value of 'sp'. In this
case 'sp' is directly dereferenced later in the function.
Based on an original patch by Elena Reshetova.
Cc: [email protected]
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: [email protected]
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/scsi/qla2xxx/qla_mr.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/drivers/scsi/qla2xxx/qla_mr.c b/drivers/scsi/qla2xxx/qla_mr.c
index d5da3981cefe..128b41de3784 100644
--- a/drivers/scsi/qla2xxx/qla_mr.c
+++ b/drivers/scsi/qla2xxx/qla_mr.c
@@ -9,6 +9,7 @@
#include <linux/ktime.h>
#include <linux/pci.h>
#include <linux/ratelimit.h>
+#include <linux/compiler.h>
#include <linux/vmalloc.h>
#include <linux/bsg-lib.h>
#include <scsi/scsi_tcq.h>
@@ -2275,7 +2276,7 @@ qlafx00_ioctl_iosb_entry(scsi_qla_host_t *vha, struct req_que *req,
static void
qlafx00_status_entry(scsi_qla_host_t *vha, struct rsp_que *rsp, void *pkt)
{
- srb_t *sp;
+ srb_t *sp, **elem;
fc_port_t *fcport;
struct scsi_cmnd *cp;
struct sts_entry_fx00 *sts;
@@ -2304,8 +2305,9 @@ qlafx00_status_entry(scsi_qla_host_t *vha, struct rsp_que *rsp, void *pkt)
req = ha->req_q_map[que];
/* Validate handle. */
- if (handle < req->num_outstanding_cmds)
- sp = req->outstanding_cmds[handle];
+ if ((elem = nospec_array_ptr(req->outstanding_cmds, handle,
+ req->num_outstanding_cmds)))
+ sp = *elem;
else
sp = NULL;
@@ -2626,7 +2628,7 @@ static void
qlafx00_multistatus_entry(struct scsi_qla_host *vha,
struct rsp_que *rsp, void *pkt)
{
- srb_t *sp;
+ srb_t *sp, **elem;
struct multi_sts_entry_fx00 *stsmfx;
struct qla_hw_data *ha = vha->hw;
uint32_t handle, hindex, handle_count, i;
@@ -2655,8 +2657,9 @@ qlafx00_multistatus_entry(struct scsi_qla_host *vha,
req = ha->req_q_map[que];
/* Validate handle. */
- if (handle < req->num_outstanding_cmds)
- sp = req->outstanding_cmds[handle];
+ if ((elem = nospec_array_ptr(req->outstanding_cmds, handle,
+ req->num_outstanding_cmds)))
+ sp = *elem;
else
sp = NULL;
Static analysis reports that 'queue' may be a user controlled value that
is used as a data dependency to read 'txq_params' from the
'priv->tx_queue_params.params' array. In order to avoid potential leaks
of kernel memory values, block speculative execution of the instruction
stream that could issue reads based on an invalid value of 'txq_params'.
In this case 'txq_params' is referenced later in the function.
Based on an original patch by Elena Reshetova.
Cc: Solomon Peachy <[email protected]>
Cc: Kalle Valo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/net/wireless/st/cw1200/sta.c | 10 ++++++----
drivers/net/wireless/st/cw1200/wsm.h | 4 +---
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/net/wireless/st/cw1200/sta.c b/drivers/net/wireless/st/cw1200/sta.c
index 38678e9a0562..886942617f14 100644
--- a/drivers/net/wireless/st/cw1200/sta.c
+++ b/drivers/net/wireless/st/cw1200/sta.c
@@ -14,6 +14,7 @@
#include <linux/firmware.h>
#include <linux/module.h>
#include <linux/etherdevice.h>
+#include <linux/compiler.h>
#include "cw1200.h"
#include "sta.h"
@@ -612,18 +613,19 @@ int cw1200_conf_tx(struct ieee80211_hw *dev, struct ieee80211_vif *vif,
u16 queue, const struct ieee80211_tx_queue_params *params)
{
struct cw1200_common *priv = dev->priv;
+ struct wsm_set_tx_queue_params *txq_params;
int ret = 0;
/* To prevent re-applying PM request OID again and again*/
bool old_uapsd_flags;
mutex_lock(&priv->conf_mutex);
- if (queue < dev->queues) {
+ if ((txq_params = nospec_array_ptr(priv->tx_queue_params.params,
+ queue, dev->queues))) {
old_uapsd_flags = le16_to_cpu(priv->uapsd_info.uapsd_flags);
- WSM_TX_QUEUE_SET(&priv->tx_queue_params, queue, 0, 0, 0);
- ret = wsm_set_tx_queue_params(priv,
- &priv->tx_queue_params.params[queue], queue);
+ WSM_TX_QUEUE_SET(txq_params, 0, 0, 0);
+ ret = wsm_set_tx_queue_params(priv, txq_params, queue);
if (ret) {
ret = -EINVAL;
goto out;
diff --git a/drivers/net/wireless/st/cw1200/wsm.h b/drivers/net/wireless/st/cw1200/wsm.h
index 48086e849515..8c8d9191e233 100644
--- a/drivers/net/wireless/st/cw1200/wsm.h
+++ b/drivers/net/wireless/st/cw1200/wsm.h
@@ -1099,10 +1099,8 @@ struct wsm_tx_queue_params {
};
-#define WSM_TX_QUEUE_SET(queue_params, queue, ack_policy, allowed_time,\
- max_life_time) \
+#define WSM_TX_QUEUE_SET(p, ack_policy, allowed_time, max_life_time) \
do { \
- struct wsm_set_tx_queue_params *p = &(queue_params)->params[queue]; \
p->ackPolicy = (ack_policy); \
p->allowedMediumTime = (allowed_time); \
p->maxTransmitLifetime = (max_life_time); \
Static analysis reports that 'trip' may be a user controlled value that
is used as a data dependency to read '*temp' from the 'd->aux_trips'
array. In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue reads
based on an invalid value of '*temp'.
Based on an original patch by Elena Reshetova.
Cc: Srinivas Pandruvada <[email protected]>
Cc: Zhang Rui <[email protected]>
Cc: Eduardo Valentin <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
.../thermal/int340x_thermal/int340x_thermal_zone.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
index 145a5c53ff5c..442a1d9bf7ad 100644
--- a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
+++ b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
@@ -17,6 +17,7 @@
#include <linux/init.h>
#include <linux/acpi.h>
#include <linux/thermal.h>
+#include <linux/compiler.h>
#include "int340x_thermal_zone.h"
static int int340x_thermal_get_zone_temp(struct thermal_zone_device *zone,
@@ -52,20 +53,21 @@ static int int340x_thermal_get_trip_temp(struct thermal_zone_device *zone,
int trip, int *temp)
{
struct int34x_thermal_zone *d = zone->devdata;
+ unsigned long *elem;
int i;
if (d->override_ops && d->override_ops->get_trip_temp)
return d->override_ops->get_trip_temp(zone, trip, temp);
- if (trip < d->aux_trip_nr)
- *temp = d->aux_trips[trip];
- else if (trip == d->crt_trip_id)
+ if ((elem = nospec_array_ptr(d->aux_trips, trip, d->aux_trip_nr))) {
+ *temp = *elem;
+ } else if (trip == d->crt_trip_id) {
*temp = d->crt_temp;
- else if (trip == d->psv_trip_id)
+ } else if (trip == d->psv_trip_id) {
*temp = d->psv_temp;
- else if (trip == d->hot_trip_id)
+ } else if (trip == d->hot_trip_id) {
*temp = d->hot_temp;
- else {
+ } else {
for (i = 0; i < INT340X_THERMAL_MAX_ACT_TRIP_COUNT; i++) {
if (d->act_trips[i].valid &&
d->act_trips[i].id == trip) {
Static analysis reports that 'offset' may be a user controlled value
that is used as a data dependency reading from a raw6_frag_vec buffer.
In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid '*(rfv->c + offset)' value.
Based on an original patch by Elena Reshetova.
Cc: "David S. Miller" <[email protected]>
Cc: Alexey Kuznetsov <[email protected]>
Cc: Hideaki YOSHIFUJI <[email protected]>
Cc: [email protected]
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
net/ipv6/raw.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 761a473a07c5..384e3d59d148 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -33,6 +33,7 @@
#include <linux/skbuff.h>
#include <linux/compat.h>
#include <linux/uaccess.h>
+#include <linux/compiler.h>
#include <asm/ioctls.h>
#include <net/net_namespace.h>
@@ -725,17 +726,17 @@ static int raw6_getfrag(void *from, char *to, int offset, int len, int odd,
struct sk_buff *skb)
{
struct raw6_frag_vec *rfv = from;
+ char *rfv_buf;
- if (offset < rfv->hlen) {
+ if ((rfv_buf = nospec_array_ptr(rfv->c, offset, rfv->hlen))) {
int copy = min(rfv->hlen - offset, len);
if (skb->ip_summed == CHECKSUM_PARTIAL)
- memcpy(to, rfv->c + offset, copy);
+ memcpy(to, rfv_buf, copy);
else
skb->csum = csum_block_add(
skb->csum,
- csum_partial_copy_nocheck(rfv->c + offset,
- to, copy, 0),
+ csum_partial_copy_nocheck(rfv_buf, to, copy, 0),
odd);
odd = 0;
Static analysis reports that 'offset' may be a user controlled value
that is used as a data dependency reading from a raw_frag_vec buffer.
In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid '*(rfv->c + offset)' value.
Based on an original patch by Elena Reshetova.
Cc: "David S. Miller" <[email protected]>
Cc: Alexey Kuznetsov <[email protected]>
Cc: Hideaki YOSHIFUJI <[email protected]>
Cc: [email protected]
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
net/ipv4/raw.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 125c1eab3eaa..f72b20131a15 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -57,6 +57,7 @@
#include <linux/in_route.h>
#include <linux/route.h>
#include <linux/skbuff.h>
+#include <linux/compiler.h>
#include <linux/igmp.h>
#include <net/net_namespace.h>
#include <net/dst.h>
@@ -472,17 +473,17 @@ static int raw_getfrag(void *from, char *to, int offset, int len, int odd,
struct sk_buff *skb)
{
struct raw_frag_vec *rfv = from;
+ char *rfv_buf;
- if (offset < rfv->hlen) {
+ if ((rfv_buf = nospec_array_ptr(rfv->hdr.c, offset, rfv->hlen))) {
int copy = min(rfv->hlen - offset, len);
if (skb->ip_summed == CHECKSUM_PARTIAL)
- memcpy(to, rfv->hdr.c + offset, copy);
+ memcpy(to, rfv_buf, copy);
else
skb->csum = csum_block_add(
skb->csum,
- csum_partial_copy_nocheck(rfv->hdr.c + offset,
- to, copy, 0),
+ csum_partial_copy_nocheck(rfv_buf, to, copy, 0),
odd);
odd = 0;
Expectedly, static analysis reports that 'fd' is a user controlled value
that is used as a data dependency to read from the 'fdt->fd' array. In
order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue reads
based on an invalid 'file *' returned from __fcheck_files.
Based on an original patch by Elena Reshetova.
Cc: Al Viro <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/fdtable.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 1c65817673db..4a147c5c2533 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -81,9 +81,10 @@ struct dentry;
static inline struct file *__fcheck_files(struct files_struct *files, unsigned int fd)
{
struct fdtable *fdt = rcu_dereference_raw(files->fdt);
+ struct file __rcu **fdp;
- if (fd < fdt->max_fds)
- return rcu_dereference_raw(fdt->fd[fd]);
+ if ((fdp = nospec_array_ptr(fdt->fd, fd, fdt->max_fds)))
+ return rcu_dereference_raw(*fdp);
return NULL;
}
Static analysis reports that 'index' may be a user controlled value that
is used as a data dependency reading 'rt' from the 'platform_label'
array. In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid 'rt' value.
Based on an original patch by Elena Reshetova.
Cc: "David S. Miller" <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: [email protected]
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
net/mpls/af_mpls.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 8ca9915befc8..ebcf0e246cfe 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -8,6 +8,7 @@
#include <linux/ipv6.h>
#include <linux/mpls.h>
#include <linux/netconf.h>
+#include <linux/compiler.h>
#include <linux/vmalloc.h>
#include <linux/percpu.h>
#include <net/ip.h>
@@ -77,12 +78,13 @@ static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
{
struct mpls_route *rt = NULL;
+ struct mpls_route __rcu **platform_label =
+ rcu_dereference(net->mpls.platform_label);
+ struct mpls_route __rcu **rtp;
- if (index < net->mpls.platform_labels) {
- struct mpls_route __rcu **platform_label =
- rcu_dereference(net->mpls.platform_label);
- rt = rcu_dereference(platform_label[index]);
- }
+ if ((rtp = nospec_array_ptr(platform_label, index,
+ net->mpls.platform_labels)))
+ rt = rcu_dereference(*rtp);
return rt;
}
Static analysis reports that 'eahd->appAttrLocation' and
'eahd->impAttrLocation' may be a user controlled values that are used as
data dependencies for calculating source and destination buffers for
memmove operations. In order to avoid potential leaks of kernel memory
values, block speculative execution of the instruction stream that could
issue further reads based on invalid 'aal' or 'ial' values.
Based on an original patch by Elena Reshetova.
Cc: Jan Kara <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/udf/misc.c | 39 +++++++++++++++++++++------------------
1 file changed, 21 insertions(+), 18 deletions(-)
diff --git a/fs/udf/misc.c b/fs/udf/misc.c
index 401e64cde1be..9403160822de 100644
--- a/fs/udf/misc.c
+++ b/fs/udf/misc.c
@@ -51,6 +51,8 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size,
int offset;
uint16_t crclen;
struct udf_inode_info *iinfo = UDF_I(inode);
+ uint8_t *ea_dst, *ea_src;
+ uint32_t aal, ial;
ea = iinfo->i_ext.i_data;
if (iinfo->i_lenEAttr) {
@@ -100,33 +102,34 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size,
offset = iinfo->i_lenEAttr;
if (type < 2048) {
- if (le32_to_cpu(eahd->appAttrLocation) <
- iinfo->i_lenEAttr) {
- uint32_t aal =
- le32_to_cpu(eahd->appAttrLocation);
- memmove(&ea[offset - aal + size],
- &ea[aal], offset - aal);
+ aal = le32_to_cpu(eahd->appAttrLocation);
+ if ((ea_dst = nospec_array_ptr(ea, offset - aal + size,
+ iinfo->i_lenEAttr)) &&
+ (ea_src = nospec_array_ptr(ea, aal,
+ iinfo->i_lenEAttr))) {
+ memmove(ea_dst, ea_src, offset - aal);
offset -= aal;
eahd->appAttrLocation =
cpu_to_le32(aal + size);
}
- if (le32_to_cpu(eahd->impAttrLocation) <
- iinfo->i_lenEAttr) {
- uint32_t ial =
- le32_to_cpu(eahd->impAttrLocation);
- memmove(&ea[offset - ial + size],
- &ea[ial], offset - ial);
+
+ ial = le32_to_cpu(eahd->impAttrLocation);
+ if ((ea_dst = nospec_array_ptr(ea, offset - ial + size,
+ iinfo->i_lenEAttr)) &&
+ (ea_src = nospec_array_ptr(ea, ial,
+ iinfo->i_lenEAttr))) {
+ memmove(ea_dst, ea_src, offset - ial);
offset -= ial;
eahd->impAttrLocation =
cpu_to_le32(ial + size);
}
} else if (type < 65536) {
- if (le32_to_cpu(eahd->appAttrLocation) <
- iinfo->i_lenEAttr) {
- uint32_t aal =
- le32_to_cpu(eahd->appAttrLocation);
- memmove(&ea[offset - aal + size],
- &ea[aal], offset - aal);
+ aal = le32_to_cpu(eahd->appAttrLocation);
+ if ((ea_dst = nospec_array_ptr(ea, offset - aal + size,
+ iinfo->i_lenEAttr)) &&
+ (ea_src = nospec_array_ptr(ea, aal,
+ iinfo->i_lenEAttr))) {
+ memmove(ea_dst, ea_src, offset - aal);
offset -= aal;
eahd->appAttrLocation =
cpu_to_le32(aal + size);
Static analysis reports that 'pos' may be a user controlled value that
is used as a data dependency determining which extent to return out of
'map'. In order to avoid potential leaks of kernel memory values, block
speculative execution of the instruction stream that could issue further
reads based on an invalid speculative result from 'm_start()'.
Based on an original patch by Elena Reshetova.
Cc: "Eric W. Biederman" <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
kernel/user_namespace.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..e958f2e5c061 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -648,15 +648,13 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
{
loff_t pos = *ppos;
unsigned extents = map->nr_extents;
- smp_rmb();
- if (pos >= extents)
- return NULL;
+ /* paired with smp_wmb in map_write */
+ smp_rmb();
if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
- return &map->extent[pos];
-
- return &map->forward[pos];
+ return nospec_array_ptr(map->extent, pos, extents);
+ return nospec_array_ptr(map->forward, pos, extents);
}
static void *uid_m_start(struct seq_file *seq, loff_t *ppos)
On Fri, 2018-01-05 at 17:10 -0800, Dan Williams wrote:
> Static analysis reports that 'trip' may be a user controlled value
> that
> is used as a data dependency to read '*temp' from the 'd->aux_trips'
> array. In order to avoid potential leaks of kernel memory values,
> block
> speculative execution of the instruction stream that could issue
> reads
> based on an invalid value of '*temp'.
Not against the change as this is in a very slow path. But the trip is
not an arbitrary value which user can enter.
This trip value is the one of the sysfs attribute in thermal zone. For
example
# cd /sys/class/thermal/thermal_zone1
# ls trip_point_?_temp
trip_point_0_temp trip_point_1_temp trip_point_2_temp trip_point_3_t
emp trip_point_4_temp trip_point_5_temp trip_point_6_temp
Here the "trip" is one of the above trip_point_*_temp. So in this case
it can be from 0 to 6 as user can't do
# cat trip_point_7_temp
as there is no sysfs attribute for trip_point_7_temp.
The actual "trip" was obtained in thermal core via
if (sscanf(attr->attr.name, "trip_point_%d_temp", &trip) != 1)
return -EINVAL;
Thanks,
Srinivas
>
> Based on an original patch by Elena Reshetova.
>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Zhang Rui <[email protected]>
> Cc: Eduardo Valentin <[email protected]>
> Signed-off-by: Elena Reshetova <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> .../thermal/int340x_thermal/int340x_thermal_zone.c | 14 ++++++++
> ------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
> b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
> index 145a5c53ff5c..442a1d9bf7ad 100644
> --- a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
> +++ b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
> @@ -17,6 +17,7 @@
> #include <linux/init.h>
> #include <linux/acpi.h>
> #include <linux/thermal.h>
> +#include <linux/compiler.h>
> #include "int340x_thermal_zone.h"
>
> static int int340x_thermal_get_zone_temp(struct thermal_zone_device
> *zone,
> @@ -52,20 +53,21 @@ static int int340x_thermal_get_trip_temp(struct
> thermal_zone_device *zone,
> int trip, int *temp)
> {
> struct int34x_thermal_zone *d = zone->devdata;
> + unsigned long *elem;
> int i;
>
> if (d->override_ops && d->override_ops->get_trip_temp)
> return d->override_ops->get_trip_temp(zone, trip,
> temp);
>
> - if (trip < d->aux_trip_nr)
> - *temp = d->aux_trips[trip];
> - else if (trip == d->crt_trip_id)
> + if ((elem = nospec_array_ptr(d->aux_trips, trip, d-
> >aux_trip_nr))) {
> + *temp = *elem;
> + } else if (trip == d->crt_trip_id) {
> *temp = d->crt_temp;
> - else if (trip == d->psv_trip_id)
> + } else if (trip == d->psv_trip_id) {
> *temp = d->psv_temp;
> - else if (trip == d->hot_trip_id)
> + } else if (trip == d->hot_trip_id) {
> *temp = d->hot_temp;
> - else {
> + } else {
> for (i = 0; i < INT340X_THERMAL_MAX_ACT_TRIP_COUNT;
> i++) {
> if (d->act_trips[i].valid &&
> d->act_trips[i].id == trip) {
>
On Fri, Jan 5, 2018 at 5:53 PM, Srinivas Pandruvada
<[email protected]> wrote:
> On Fri, 2018-01-05 at 17:10 -0800, Dan Williams wrote:
>> Static analysis reports that 'trip' may be a user controlled value
>> that
>> is used as a data dependency to read '*temp' from the 'd->aux_trips'
>> array. In order to avoid potential leaks of kernel memory values,
>> block
>> speculative execution of the instruction stream that could issue
>> reads
>> based on an invalid value of '*temp'.
>
> Not against the change as this is in a very slow path. But the trip is
> not an arbitrary value which user can enter.
>
> This trip value is the one of the sysfs attribute in thermal zone. For
> example
>
> # cd /sys/class/thermal/thermal_zone1
> # ls trip_point_?_temp
> trip_point_0_temp trip_point_1_temp trip_point_2_temp trip_point_3_t
> emp trip_point_4_temp trip_point_5_temp trip_point_6_temp
>
> Here the "trip" is one of the above trip_point_*_temp. So in this case
> it can be from 0 to 6 as user can't do
> # cat trip_point_7_temp
> as there is no sysfs attribute for trip_point_7_temp.
>
> The actual "trip" was obtained in thermal core via
>
> if (sscanf(attr->attr.name, "trip_point_%d_temp", &trip) != 1)
> return -EINVAL;
>
> Thanks,
> Srinivas
Ah, great, thanks. So do we even need the bounds check at that point?
Dan Williams <[email protected]> writes:
> Quoting Mark's original RFC:
>
> "Recently, Google Project Zero discovered several classes of attack
> against speculative execution. One of these, known as variant-1, allows
> explicit bounds checks to be bypassed under speculation, providing an
> arbitrary read gadget. Further details can be found on the GPZ blog [1]
> and the Documentation patch in this series."
>
> This series incorporates Mark Rutland's latest api and adds the x86
> specific implementation of nospec_barrier. The
> nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
> wide analysis performed by Elena Reshetova to address static analysis
> reports where speculative execution on a userspace controlled value
> could bypass a bounds check. The patches address a precondition for the
> attack discussed in the Spectre paper [2].
Please expand this.
It is not clear what the static analysis is looking for. Have a clear
description of what is being fixed is crucial for allowing any of these
changes.
For the details given in the change description what I read is magic
changes because a magic process says this code is vunlerable.
Given the similarities in the code that is being patched to many other
places in the kernel it is not at all clear that this small set of
changes is sufficient for any purpose.
> A consideration worth noting for reviewing these patches is to weigh the
> dramatic cost of being wrong about whether a given report is exploitable
> vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
> lets make the bar for applying these patches be "can you prove that the
> bounds check bypass is *not* exploitable". Consider that the Spectre
> paper reports one example of a speculation window being ~180 cycles.
> Note that there is also a proposal from Linus, array_access [3], that
> attempts to quash speculative execution past a bounds check without
> introducing an lfence instruction. That may be a future optimization
> possibility that is compatible with this api, but it would appear to
> need guarantees from the compiler that it is not clear the kernel can
> rely on at this point. It is also not clear that it would be a
> significant performance win vs lfence.
It is also not clear that these changes fix anything, or are in any
sense correct for the problem they are trying to fix as the problem
is not clearly described.
In at least one place (mpls) you are patching a fast path. Compile out
or don't load mpls by all means. But it is not acceptable to change the
fast path without even considering performance.
So because the description sucks, and the due diligence is not there.
Nacked-by: "Eric W. Biederman" <[email protected]>
to the series.
>
> These patches also will also be available via the 'nospec' git branch
> here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec
>
> [1]: https://googleprojectzero.blogspot.co.uk/2018/01/reading-privileged-memory-with-side.html
> [2]: https://spectreattack.com/spectre.pdf
> [3]: https://marc.info/?l=linux-kernel&m=151510446027625&w=2
>
> ---
>
> Andi Kleen (1):
> x86, barrier: stop speculation for failed access_ok
>
> Dan Williams (13):
> x86: implement nospec_barrier()
> [media] uvcvideo: prevent bounds-check bypass via speculative execution
> carl9170: prevent bounds-check bypass via speculative execution
> p54: prevent bounds-check bypass via speculative execution
> qla2xxx: prevent bounds-check bypass via speculative execution
> cw1200: prevent bounds-check bypass via speculative execution
> Thermal/int340x: prevent bounds-check bypass via speculative execution
> ipv6: prevent bounds-check bypass via speculative execution
> ipv4: prevent bounds-check bypass via speculative execution
> vfs, fdtable: prevent bounds-check bypass via speculative execution
> net: mpls: prevent bounds-check bypass via speculative execution
> udf: prevent bounds-check bypass via speculative execution
> userns: prevent bounds-check bypass via speculative execution
>
> Mark Rutland (4):
> asm-generic/barrier: add generic nospec helpers
> Documentation: document nospec helpers
> arm64: implement nospec_ptr()
> arm: implement nospec_ptr()
>
> Documentation/speculation.txt | 166 ++++++++++++++++++++
> arch/arm/include/asm/barrier.h | 75 +++++++++
> arch/arm64/include/asm/barrier.h | 55 +++++++
> arch/x86/include/asm/barrier.h | 6 +
> arch/x86/include/asm/uaccess.h | 17 ++
> drivers/media/usb/uvc/uvc_v4l2.c | 7 +
> drivers/net/wireless/ath/carl9170/main.c | 6 -
> drivers/net/wireless/intersil/p54/main.c | 8 +
> drivers/net/wireless/st/cw1200/sta.c | 10 +
> drivers/net/wireless/st/cw1200/wsm.h | 4
> drivers/scsi/qla2xxx/qla_mr.c | 15 +-
> .../thermal/int340x_thermal/int340x_thermal_zone.c | 14 +-
> fs/udf/misc.c | 39 +++--
> include/asm-generic/barrier.h | 68 ++++++++
> include/linux/fdtable.h | 5 -
> kernel/user_namespace.c | 10 -
> net/ipv4/raw.c | 9 +
> net/ipv6/raw.c | 9 +
> net/mpls/af_mpls.c | 12 +
> 19 files changed, 466 insertions(+), 69 deletions(-)
> create mode 100644 Documentation/speculation.txt
On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams <[email protected]> wrote:
> From: Andi Kleen <[email protected]>
>
> When access_ok fails we should always stop speculating.
> Add the required barriers to the x86 access_ok macro.
Honestly, this seems completely bogus.
The description is pure garbage afaik.
The fact is, we have to stop speculating when access_ok() does *not*
fail - because that's when we'll actually do the access. And it's that
access that needs to be non-speculative.
That actually seems to be what the code does (it stops speculation
when __range_not_ok() returns false, but access_ok() is
!__range_not_ok()). But the explanation is crap, and dangerous.
Linus
On Fri, Jan 5, 2018 at 5:09 PM, Dan Williams <[email protected]> wrote:
> +#ifndef nospec_ptr
> +#define nospec_ptr(ptr, lo, hi) \
Do we actually want this horrible interface?
It just causes the compiler - or inline asm - to generate worse code,
because it needs to compare against both high and low limits.
Basically all users are arrays that are zero-based, and where a
comparison against the high _index_ limit would be sufficient.
But the way this is all designed, it's literally designed for bad code
generation for the unusual case, and the usual array case is written
in the form of the unusual and wrong non-array case. That really seems
excessively stupid.
Linus
On Fri, Jan 5, 2018 at 6:52 PM, Linus Torvalds
<[email protected]> wrote:
>
> The fact is, we have to stop speculating when access_ok() does *not*
> fail - because that's when we'll actually do the access. And it's that
> access that needs to be non-speculative.
I also suspect we should probably do this entirely differently.
Maybe the whole lfence can be part of uaccess_begin() instead (ie
currently 'stac()'). That would fit the existing structure better, I
think. And it would avoid any confusion about the whole "when to stop
speculation".
Linus
On Fri, Jan 5, 2018 at 6:55 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Jan 5, 2018 at 5:09 PM, Dan Williams <[email protected]> wrote:
>> +#ifndef nospec_ptr
>> +#define nospec_ptr(ptr, lo, hi) \
>
> Do we actually want this horrible interface?
>
> It just causes the compiler - or inline asm - to generate worse code,
> because it needs to compare against both high and low limits.
>
> Basically all users are arrays that are zero-based, and where a
> comparison against the high _index_ limit would be sufficient.
>
> But the way this is all designed, it's literally designed for bad code
> generation for the unusual case, and the usual array case is written
> in the form of the unusual and wrong non-array case. That really seems
> excessively stupid.
Yes, it appears we can kill nospec_ptr() and move nospec_array_ptr()
to assume 0 based arrays rather than use nospec_ptr.
On Fri, Jan 5, 2018 at 6:52 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams <[email protected]> wrote:
>> From: Andi Kleen <[email protected]>
>>
>> When access_ok fails we should always stop speculating.
>> Add the required barriers to the x86 access_ok macro.
>
> Honestly, this seems completely bogus.
>
> The description is pure garbage afaik.
>
> The fact is, we have to stop speculating when access_ok() does *not*
> fail - because that's when we'll actually do the access. And it's that
> access that needs to be non-speculative.
>
> That actually seems to be what the code does (it stops speculation
> when __range_not_ok() returns false, but access_ok() is
> !__range_not_ok()). But the explanation is crap, and dangerous.
Oh, bother, yes, good catch. It's been a long week. I'll take a look
at moving this to uaccess_begin().
On Fri, Jan 5, 2018 at 6:22 PM, Eric W. Biederman <[email protected]> wrote:
> Dan Williams <[email protected]> writes:
>
>> Quoting Mark's original RFC:
>>
>> "Recently, Google Project Zero discovered several classes of attack
>> against speculative execution. One of these, known as variant-1, allows
>> explicit bounds checks to be bypassed under speculation, providing an
>> arbitrary read gadget. Further details can be found on the GPZ blog [1]
>> and the Documentation patch in this series."
>>
>> This series incorporates Mark Rutland's latest api and adds the x86
>> specific implementation of nospec_barrier. The
>> nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
>> wide analysis performed by Elena Reshetova to address static analysis
>> reports where speculative execution on a userspace controlled value
>> could bypass a bounds check. The patches address a precondition for the
>> attack discussed in the Spectre paper [2].
>
> Please expand this.
>
> It is not clear what the static analysis is looking for. Have a clear
> description of what is being fixed is crucial for allowing any of these
> changes.
>
> For the details given in the change description what I read is magic
> changes because a magic process says this code is vunlerable.
Yes, that was my first reaction to the patches as well, I try below to
add some more background and guidance, but in the end these are static
analysis reports across a wide swath of sub-systems. It's going to
take some iteration with domain experts to improve the patch
descriptions, and that's the point of this series, to get the better
trained eyes from the actual sub-system owners to take a look at these
reports.
For example, I'm looking for feedback like what Srinivas gave where he
identified that the report is bogus, the branch condition can not be
seeded with bad values in that path. Be like Srinivas.
> Given the similarities in the code that is being patched to many other
> places in the kernel it is not at all clear that this small set of
> changes is sufficient for any purpose.
I find this assertion absurd, when in the past have we as kernel
developers ever been handed a static analysis report and then
questioned why the static analysis did not flag other call sites
before first reviewing the ones it did find?
>> A consideration worth noting for reviewing these patches is to weigh the
>> dramatic cost of being wrong about whether a given report is exploitable
>> vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
>> lets make the bar for applying these patches be "can you prove that the
>> bounds check bypass is *not* exploitable". Consider that the Spectre
>> paper reports one example of a speculation window being ~180 cycles.
>
>
>> Note that there is also a proposal from Linus, array_access [3], that
>> attempts to quash speculative execution past a bounds check without
>> introducing an lfence instruction. That may be a future optimization
>> possibility that is compatible with this api, but it would appear to
>> need guarantees from the compiler that it is not clear the kernel can
>> rely on at this point. It is also not clear that it would be a
>> significant performance win vs lfence.
>
> It is also not clear that these changes fix anything, or are in any
> sense correct for the problem they are trying to fix as the problem
> is not clearly described.
I'll try my best. I don't have first hand knowledge of how the static
analyzer is doing this job, and I don't think it matters for
evaluating these reports. I'll give you my thoughts on how I would
handle one of these reports if it flagged one of the sub-systems I
maintain.
Start with the example from the Spectre paper:
if (x < array1_size)
y = array2[array1[x] * 256];
In all the patches 'x' and 'array1' are called out explicitly. For example:
net: mpls: prevent bounds-check bypass via speculative execution
Static analysis reports that 'index' may be a user controlled value that
is used as a data dependency reading 'rt' from the 'platform_label'
array...
So the first thing to review is whether the analyzer got it wrong and
'x' is not arbitrarily controllable by userspace to cause speculation
outside of the checked bounds. Be like Srinivas. The next step is to
ask whether the code can be refactored so that 'x' is sanitized
earlier in the call stack, especially if the nospec_array_ptr() lands
in a hot path. The next aspect that I expect most would be tempted to
go check is whether 'array2[array1[x]]' occurs later in the code
stream, but with speculation windows being architecture dependent and
potentially large (~180 cycles in one case says the paper) I submit
that we should err on the side of caution and not guess if that second
dependent read has been emitted somewhere in the instruction stream.
> In at least one place (mpls) you are patching a fast path. Compile out
> or don't load mpls by all means. But it is not acceptable to change the
> fast path without even considering performance.
Performance matters greatly, but I need help to identify a workload
that is representative for this fast path to see what, if any, impact
is incurred. Even better is a review that says "nope, 'index' is not
subject to arbitrary userspace control at this point, drop the patch."
On Fri, Jan 05, 2018 at 05:11:10PM -0800, Dan Williams wrote:
> Static analysis reports that 'offset' may be a user controlled value
> that is used as a data dependency reading from a raw_frag_vec buffer.
> In order to avoid potential leaks of kernel memory values, block
> speculative execution of the instruction stream that could issue further
> reads based on an invalid '*(rfv->c + offset)' value.
>
> Based on an original patch by Elena Reshetova.
I thought we "proved" that this patch was not needed at all, based on
previous review. It doesn't look like that review cycle got
incorporated into this patch series at all, I guess I have to go back
and do it all again :(
thanks,
greg k-h
On Fri, Jan 05, 2018 at 05:11:10PM -0800, Dan Williams wrote:
> Static analysis reports that 'offset' may be a user controlled value
Can I see the rule that determined that? It does not feel like that is
correct, given the 3+ levels deep that this function gets this value
from...
Same for the ipv6 patch, it's the same code logic.
thanks,
greg k-h
On Fri, Jan 05, 2018 at 05:10:48PM -0800, Dan Williams wrote:
> Static analysis reports that 'handle' may be a user controlled value
> that is used as a data dependency to read 'sp' from the
> 'req->outstanding_cmds' array. In order to avoid potential leaks of
> kernel memory values, block speculative execution of the instruction
> stream that could issue reads based on an invalid value of 'sp'. In this
> case 'sp' is directly dereferenced later in the function.
I'm pretty sure that 'handle' comes from the hardware, not from
userspace, from what I can tell here. If we want to start auditing
__iomem data sources, great! But that's a bigger task, and one I don't
think we are ready to tackle...
thanks,
greg k-h
On Fri, Jan 05, 2018 at 05:10:32PM -0800, Dan Williams wrote:
> Static analysis reports that 'index' may be a user controlled value that
> is used as a data dependency to read 'pin' from the
> 'selector->baSourceID' array. In order to avoid potential leaks of
> kernel memory values, block speculative execution of the instruction
> stream that could issue reads based on an invalid value of 'pin'.
>
> Based on an original patch by Elena Reshetova.
>
> Cc: Laurent Pinchart <[email protected]>
> Cc: Mauro Carvalho Chehab <[email protected]>
> Cc: [email protected]
> Signed-off-by: Elena Reshetova <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> drivers/media/usb/uvc/uvc_v4l2.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/media/usb/uvc/uvc_v4l2.c b/drivers/media/usb/uvc/uvc_v4l2.c
> index 3e7e283a44a8..7442626dc20e 100644
> --- a/drivers/media/usb/uvc/uvc_v4l2.c
> +++ b/drivers/media/usb/uvc/uvc_v4l2.c
> @@ -22,6 +22,7 @@
> #include <linux/mm.h>
> #include <linux/wait.h>
> #include <linux/atomic.h>
> +#include <linux/compiler.h>
>
> #include <media/v4l2-common.h>
> #include <media/v4l2-ctrls.h>
> @@ -810,6 +811,7 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
> struct uvc_entity *iterm = NULL;
> u32 index = input->index;
> int pin = 0;
> + __u8 *elem;
>
> if (selector == NULL ||
> (chain->dev->quirks & UVC_QUIRK_IGNORE_SELECTOR_UNIT)) {
> @@ -820,8 +822,9 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
> break;
> }
> pin = iterm->id;
> - } else if (index < selector->bNrInPins) {
> - pin = selector->baSourceID[index];
> + } else if ((elem = nospec_array_ptr(selector->baSourceID, index,
> + selector->bNrInPins))) {
> + pin = *elem;
I dug through this before, and I couldn't find where index came from
userspace, I think seeing the coverity rule would be nice.
And if this value really is user controlled, then why is this the only
v4l driver that is affected? This is a common callback.
thanks,
greg k-h
On Sat, Jan 06, 2018 at 10:09:07AM +0100, Greg KH wrote:
> On Fri, Jan 05, 2018 at 05:10:32PM -0800, Dan Williams wrote:
> > Static analysis reports that 'index' may be a user controlled value that
> > is used as a data dependency to read 'pin' from the
> > 'selector->baSourceID' array. In order to avoid potential leaks of
> > kernel memory values, block speculative execution of the instruction
> > stream that could issue reads based on an invalid value of 'pin'.
> >
> > Based on an original patch by Elena Reshetova.
> >
> > Cc: Laurent Pinchart <[email protected]>
> > Cc: Mauro Carvalho Chehab <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: Elena Reshetova <[email protected]>
> > Signed-off-by: Dan Williams <[email protected]>
> > ---
> > drivers/media/usb/uvc/uvc_v4l2.c | 7 +++++--
> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/media/usb/uvc/uvc_v4l2.c b/drivers/media/usb/uvc/uvc_v4l2.c
> > index 3e7e283a44a8..7442626dc20e 100644
> > --- a/drivers/media/usb/uvc/uvc_v4l2.c
> > +++ b/drivers/media/usb/uvc/uvc_v4l2.c
> > @@ -22,6 +22,7 @@
> > #include <linux/mm.h>
> > #include <linux/wait.h>
> > #include <linux/atomic.h>
> > +#include <linux/compiler.h>
> >
> > #include <media/v4l2-common.h>
> > #include <media/v4l2-ctrls.h>
> > @@ -810,6 +811,7 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
> > struct uvc_entity *iterm = NULL;
> > u32 index = input->index;
> > int pin = 0;
> > + __u8 *elem;
> >
> > if (selector == NULL ||
> > (chain->dev->quirks & UVC_QUIRK_IGNORE_SELECTOR_UNIT)) {
> > @@ -820,8 +822,9 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
> > break;
> > }
> > pin = iterm->id;
> > - } else if (index < selector->bNrInPins) {
> > - pin = selector->baSourceID[index];
> > + } else if ((elem = nospec_array_ptr(selector->baSourceID, index,
> > + selector->bNrInPins))) {
> > + pin = *elem;
>
> I dug through this before, and I couldn't find where index came from
> userspace, I think seeing the coverity rule would be nice.
Ok, I take it back, this looks correct. Ugh, the v4l ioctl api is
crazy complex (rightfully so), it's amazing that coverity could navigate
that whole thing :)
While I'm all for fixing this type of thing, I feel like we need to do
something "else" for this as playing whack-a-mole for this pattern is
going to be a never-ending battle for all drivers for forever. Either
we need some way to mark this data path to make it easy for tools like
sparse to flag easily, or we need to catch the issue in the driver
subsystems, which unfortunatly, would harm the drivers that don't have
this type of issue (like here.)
I'm guessing that other operating systems, which don't have the luxury
of auditing all of their drivers are going for the "big hammer in the
subsystem" type of fix, right?
I don't have a good answer for this, but if there was some better way to
rewrite these types of patterns to just prevent the need for the
nospec_array_ptr() type thing, that might be the best overall for
everyone. Much like ebpf did with their changes. That way a simple
coccinelle rule would be able to catch the pattern and rewrite it.
Or am I just dreaming?
thanks,
greg k-h
On Sat, Jan 06, 2018 at 10:03:22AM +0100, Greg KH wrote:
> On Fri, Jan 05, 2018 at 05:10:48PM -0800, Dan Williams wrote:
> > Static analysis reports that 'handle' may be a user controlled value
> > that is used as a data dependency to read 'sp' from the
> > 'req->outstanding_cmds' array. In order to avoid potential leaks of
> > kernel memory values, block speculative execution of the instruction
> > stream that could issue reads based on an invalid value of 'sp'. In this
> > case 'sp' is directly dereferenced later in the function.
>
> I'm pretty sure that 'handle' comes from the hardware, not from
> userspace, from what I can tell here. If we want to start auditing
> __iomem data sources, great! But that's a bigger task, and one I don't
> think we are ready to tackle...
And as Peter Zijlstra has already mentioned, if we have to look at those
codepaths, USB drivers are the first up for that mess, so having access
to the coverity rules would be a great help in starting that effort.
thanks,
greg k-h
Hello!
On 1/6/2018 4:10 AM, Dan Williams wrote:
> Static analysis reports that 'queue' may be a user controlled value that
> is used as a data dependency to read from the 'ar9170_qmap' array. In
> order to avoid potential leaks of kernel memory values, block
> speculative execution of the instruction stream that could issue reads
> based on an invalid result of 'ar9170_qmap[queue]'. In this case the
> value of 'ar9170_qmap[queue]' is immediately reused as an index to the
> 'ar->edcf' array.
>
> Based on an original patch by Elena Reshetova.
>
> Cc: Christian Lamparter <[email protected]>
> Cc: Kalle Valo <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Elena Reshetova <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> drivers/net/wireless/ath/carl9170/main.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/carl9170/main.c b/drivers/net/wireless/ath/carl9170/main.c
> index 988c8857d78c..0ff34cbe2b62 100644
> --- a/drivers/net/wireless/ath/carl9170/main.c
> +++ b/drivers/net/wireless/ath/carl9170/main.c
> @@ -41,6 +41,7 @@
> #include <linux/module.h>
> #include <linux/etherdevice.h>
> #include <linux/random.h>
> +#include <linux/compiler.h>
> #include <net/mac80211.h>
> #include <net/cfg80211.h>
> #include "hw.h"
> @@ -1384,11 +1385,12 @@ static int carl9170_op_conf_tx(struct ieee80211_hw *hw,
> const struct ieee80211_tx_queue_params *param)
> {
> struct ar9170 *ar = hw->priv;
> + const u8 *elem;
> int ret;
>
> mutex_lock(&ar->mutex);
> - if (queue < ar->hw->queues) {
> - memcpy(&ar->edcf[ar9170_qmap[queue]], param, sizeof(*param));
> + if ((elem = nospec_array_ptr(ar9170_qmap, queue, ar->hw->queues))) {
I bet this causes checkpatch.pl to complain. I don't see a dire need to
upset it here, the assignment may well precede *if*...
> + memcpy(&ar->edcf[*elem], param, sizeof(*param));
> ret = carl9170_set_qos(ar);
> } else {
> ret = -EINVAL;
>
MBR, Sergei
On 1/6/2018 4:10 AM, Dan Williams wrote:
> Static analysis reports that 'queue' may be a user controlled value that
> is used as a data dependency to read from the 'priv->qos_params' array.
> In order to avoid potential leaks of kernel memory values, block
> speculative execution of the instruction stream that could issue reads
> based on an invalid result of 'priv->qos_params[queue]'.
>
> Based on an original patch by Elena Reshetova.
>
> Cc: Christian Lamparter <[email protected]>
> Cc: Kalle Valo <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Elena Reshetova <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> drivers/net/wireless/intersil/p54/main.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/wireless/intersil/p54/main.c b/drivers/net/wireless/intersil/p54/main.c
> index ab6d39e12069..85c9cbee35fc 100644
> --- a/drivers/net/wireless/intersil/p54/main.c
> +++ b/drivers/net/wireless/intersil/p54/main.c
[...]
> @@ -411,12 +412,13 @@ static int p54_conf_tx(struct ieee80211_hw *dev,
> const struct ieee80211_tx_queue_params *params)
> {
> struct p54_common *priv = dev->priv;
> + struct p54_edcf_queue_param *p54_q;
> int ret;
>
> mutex_lock(&priv->conf_mutex);
> - if (queue < dev->queues) {
> - P54_SET_QUEUE(priv->qos_params[queue], params->aifs,
> - params->cw_min, params->cw_max, params->txop);
> + if ((p54_q = nospec_array_ptr(priv->qos_params, queue, dev->queues))) {
Same complaint here...
> + P54_SET_QUEUE(p54_q[0], params->aifs, params->cw_min,
> + params->cw_max, params->txop);
> ret = p54_set_edcf(priv);
> } else
> ret = -EINVAL;
>
MBR, Sergei
On 1/6/2018 4:10 AM, Dan Williams wrote:
> Static analysis reports that 'trip' may be a user controlled value that
> is used as a data dependency to read '*temp' from the 'd->aux_trips'
> array. In order to avoid potential leaks of kernel memory values, block
> speculative execution of the instruction stream that could issue reads
> based on an invalid value of '*temp'.
>
> Based on an original patch by Elena Reshetova.
>
> Cc: Srinivas Pandruvada <[email protected]>
> Cc: Zhang Rui <[email protected]>
> Cc: Eduardo Valentin <[email protected]>
> Signed-off-by: Elena Reshetova <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> .../thermal/int340x_thermal/int340x_thermal_zone.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
> index 145a5c53ff5c..442a1d9bf7ad 100644
> --- a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
> +++ b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
[...]
> @@ -52,20 +53,21 @@ static int int340x_thermal_get_trip_temp(struct thermal_zone_device *zone,
> int trip, int *temp)
> {
> struct int34x_thermal_zone *d = zone->devdata;
> + unsigned long *elem;
> int i;
>
> if (d->override_ops && d->override_ops->get_trip_temp)
> return d->override_ops->get_trip_temp(zone, trip, temp);
>
> - if (trip < d->aux_trip_nr)
> - *temp = d->aux_trips[trip];
> - else if (trip == d->crt_trip_id)
> + if ((elem = nospec_array_ptr(d->aux_trips, trip, d->aux_trip_nr))) {
And here...
> + *temp = *elem;
> + } else if (trip == d->crt_trip_id) {
> *temp = d->crt_temp;
> - else if (trip == d->psv_trip_id)
> + } else if (trip == d->psv_trip_id) {
> *temp = d->psv_temp;
> - else if (trip == d->hot_trip_id)
> + } else if (trip == d->hot_trip_id) {
> *temp = d->hot_temp;
> - else {
> + } else {
> for (i = 0; i < INT340X_THERMAL_MAX_ACT_TRIP_COUNT; i++) {
> if (d->act_trips[i].valid &&
> d->act_trips[i].id == trip) {
MBR, Sergei
On 1/6/2018 4:11 AM, Dan Williams wrote:
> Static analysis reports that 'offset' may be a user controlled value
> that is used as a data dependency reading from a raw6_frag_vec buffer.
> In order to avoid potential leaks of kernel memory values, block
> speculative execution of the instruction stream that could issue further
> reads based on an invalid '*(rfv->c + offset)' value.
>
> Based on an original patch by Elena Reshetova.
>
> Cc: "David S. Miller" <[email protected]>
> Cc: Alexey Kuznetsov <[email protected]>
> Cc: Hideaki YOSHIFUJI <[email protected]>
> Cc: [email protected]
> Signed-off-by: Elena Reshetova <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> net/ipv6/raw.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
> index 761a473a07c5..384e3d59d148 100644
> --- a/net/ipv6/raw.c
> +++ b/net/ipv6/raw.c
[...]
> @@ -725,17 +726,17 @@ static int raw6_getfrag(void *from, char *to, int offset, int len, int odd,
> struct sk_buff *skb)
> {
> struct raw6_frag_vec *rfv = from;
> + char *rfv_buf;
>
> - if (offset < rfv->hlen) {
> + if ((rfv_buf = nospec_array_ptr(rfv->c, offset, rfv->hlen))) {
And here...
> int copy = min(rfv->hlen - offset, len);
>
> if (skb->ip_summed == CHECKSUM_PARTIAL)
> - memcpy(to, rfv->c + offset, copy);
> + memcpy(to, rfv_buf, copy);
> else
> skb->csum = csum_block_add(
> skb->csum,
> - csum_partial_copy_nocheck(rfv->c + offset,
> - to, copy, 0),
> + csum_partial_copy_nocheck(rfv_buf, to, copy, 0),
> odd);
>
> odd = 0;
MBR, Sergei
On 1/6/2018 4:11 AM, Dan Williams wrote:
> Static analysis reports that 'offset' may be a user controlled value
> that is used as a data dependency reading from a raw_frag_vec buffer.
> In order to avoid potential leaks of kernel memory values, block
> speculative execution of the instruction stream that could issue further
> reads based on an invalid '*(rfv->c + offset)' value.
>
> Based on an original patch by Elena Reshetova.
>
> Cc: "David S. Miller" <[email protected]>
> Cc: Alexey Kuznetsov <[email protected]>
> Cc: Hideaki YOSHIFUJI <[email protected]>
> Cc: [email protected]
> Signed-off-by: Elena Reshetova <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> net/ipv4/raw.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
> index 125c1eab3eaa..f72b20131a15 100644
> --- a/net/ipv4/raw.c
> +++ b/net/ipv4/raw.c
[...]
> @@ -472,17 +473,17 @@ static int raw_getfrag(void *from, char *to, int offset, int len, int odd,
> struct sk_buff *skb)
> {
> struct raw_frag_vec *rfv = from;
> + char *rfv_buf;
>
> - if (offset < rfv->hlen) {
> + if ((rfv_buf = nospec_array_ptr(rfv->hdr.c, offset, rfv->hlen))) {
And here...
[...]
MBR, Sergei
On 1/6/2018 4:11 AM, Dan Williams wrote:
> Expectedly, static analysis reports that 'fd' is a user controlled value
> that is used as a data dependency to read from the 'fdt->fd' array. In
> order to avoid potential leaks of kernel memory values, block
> speculative execution of the instruction stream that could issue reads
> based on an invalid 'file *' returned from __fcheck_files.
>
> Based on an original patch by Elena Reshetova.
>
> Cc: Al Viro <[email protected]>
> Signed-off-by: Elena Reshetova <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> include/linux/fdtable.h | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
> index 1c65817673db..4a147c5c2533 100644
> --- a/include/linux/fdtable.h
> +++ b/include/linux/fdtable.h
> @@ -81,9 +81,10 @@ struct dentry;
> static inline struct file *__fcheck_files(struct files_struct *files, unsigned int fd)
> {
> struct fdtable *fdt = rcu_dereference_raw(files->fdt);
> + struct file __rcu **fdp;
>
> - if (fd < fdt->max_fds)
> - return rcu_dereference_raw(fdt->fd[fd]);
> + if ((fdp = nospec_array_ptr(fdt->fd, fd, fdt->max_fds)))
And here...
> + return rcu_dereference_raw(*fdp);
> return NULL;
> }
>
MMR, Sergei
On 1/6/2018 4:11 AM, Dan Williams wrote:
> Static analysis reports that 'index' may be a user controlled value that
> is used as a data dependency reading 'rt' from the 'platform_label'
> array. In order to avoid potential leaks of kernel memory values, block
> speculative execution of the instruction stream that could issue further
> reads based on an invalid 'rt' value.
>
> Based on an original patch by Elena Reshetova.
>
> Cc: "David S. Miller" <[email protected]>
> Cc: Eric W. Biederman <[email protected]>
> Cc: [email protected]
> Signed-off-by: Elena Reshetova <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> net/mpls/af_mpls.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
> index 8ca9915befc8..ebcf0e246cfe 100644
> --- a/net/mpls/af_mpls.c
> +++ b/net/mpls/af_mpls.c
[...]
> @@ -77,12 +78,13 @@ static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
> static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
> {
> struct mpls_route *rt = NULL;
> + struct mpls_route __rcu **platform_label =
> + rcu_dereference(net->mpls.platform_label);
> + struct mpls_route __rcu **rtp;
>
> - if (index < net->mpls.platform_labels) {
> - struct mpls_route __rcu **platform_label =
> - rcu_dereference(net->mpls.platform_label);
> - rt = rcu_dereference(platform_label[index]);
> - }
> + if ((rtp = nospec_array_ptr(platform_label, index,
And here...
> + net->mpls.platform_labels)))
> + rt = rcu_dereference(*rtp);
> return rt;
> }
>
MBR, Sergei
On Sat, 6 Jan 2018 10:01:54 +0100
Greg KH <[email protected]> wrote:
> On Fri, Jan 05, 2018 at 05:11:10PM -0800, Dan Williams wrote:
> > Static analysis reports that 'offset' may be a user controlled value
>
> Can I see the rule that determined that? It does not feel like that is
> correct, given the 3+ levels deep that this function gets this value
> from...
On a current x86 you can execute something upwards of 150 instructions in
a speculation window.
Alan
On Fri, 5 Jan 2018 18:52:07 -0800
Linus Torvalds <[email protected]> wrote:
> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams <[email protected]> wrote:
> > From: Andi Kleen <[email protected]>
> >
> > When access_ok fails we should always stop speculating.
> > Add the required barriers to the x86 access_ok macro.
>
> Honestly, this seems completely bogus.
Also for x86-64 if we are trusting that an AND with a constant won't get
speculated into something else surely we can just and the address with ~(1
<< 63) before copying from/to user space ? The user will then just
speculatively steal their own memory.
Alan
On Saturday, January 6, 2018 2:10:37 AM CET Dan Williams wrote:
> Static analysis reports that 'queue' may be a user controlled value that
> is used as a data dependency to read from the 'ar9170_qmap' array. In
> order to avoid potential leaks of kernel memory values, block
> speculative execution of the instruction stream that could issue reads
> based on an invalid result of 'ar9170_qmap[queue]'. In this case the
> value of 'ar9170_qmap[queue]' is immediately reused as an index to the
> 'ar->edcf' array.
>
> Based on an original patch by Elena Reshetova.
>
> Cc: Christian Lamparter <[email protected]>
> Cc: Kalle Valo <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Elena Reshetova <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> drivers/net/wireless/ath/carl9170/main.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/carl9170/main.c b/drivers/net/wireless/ath/carl9170/main.c
> index 988c8857d78c..0ff34cbe2b62 100644
> --- a/drivers/net/wireless/ath/carl9170/main.c
> +++ b/drivers/net/wireless/ath/carl9170/main.c
> @@ -41,6 +41,7 @@
> #include <linux/module.h>
> #include <linux/etherdevice.h>
> #include <linux/random.h>
> +#include <linux/compiler.h>
> #include <net/mac80211.h>
> #include <net/cfg80211.h>
> #include "hw.h"
> @@ -1384,11 +1385,12 @@ static int carl9170_op_conf_tx(struct ieee80211_hw *hw,
> const struct ieee80211_tx_queue_params *param)
> {
> struct ar9170 *ar = hw->priv;
> + const u8 *elem;
> int ret;
>
> mutex_lock(&ar->mutex);
> - if (queue < ar->hw->queues) {
> - memcpy(&ar->edcf[ar9170_qmap[queue]], param, sizeof(*param));
> + if ((elem = nospec_array_ptr(ar9170_qmap, queue, ar->hw->queues))) {
> + memcpy(&ar->edcf[*elem], param, sizeof(*param));
> ret = carl9170_set_qos(ar);
> } else {
> ret = -EINVAL;
>
>
About the "queue" in carl9170_op_conf_tx, p54_conf_tx and cw1200_conf_tx:
The only way a user can set this in any meaningful way would be via
a NL80211_CMD_SET_WIPHY netlink message. However, the value will get
vetted there by cfg80211's parse_txq_params [0]. This is long before
it reaches any of the *_op_conf_tx functions.
And Furthermore a invalid queue (param->ac) would cause a crash in
this line in mac80211 before it even reaches the driver [1]:
| sdata->tx_conf[params->ac] = p;
| ^^^^^^^^
| if (drv_conf_tx(local, sdata, >>>> params->ac <<<<, &p)) {
| ^^ (this is a wrapper for the *_op_conf_tx)
I don't think these chin-up exercises are needed.
[0] <https://github.com/torvalds/linux/blob/master/net/wireless/nl80211.c#L2070>
[1] <https://github.com/torvalds/linux/blob/master/net/mac80211/cfg.c#L2157>
On Fri, 05 Jan 2018 17:11:04 -0800
Dan Williams <[email protected]> wrote:
> Static analysis reports that 'offset' may be a user controlled value
> that is used as a data dependency reading from a raw6_frag_vec buffer.
> In order to avoid potential leaks of kernel memory values, block
> speculative execution of the instruction stream that could issue further
> reads based on an invalid '*(rfv->c + offset)' value.
>
> Based on an original patch by Elena Reshetova.
>
> Cc: "David S. Miller" <[email protected]>
> Cc: Alexey Kuznetsov <[email protected]>
> Cc: Hideaki YOSHIFUJI <[email protected]>
> Cc: [email protected]
> Signed-off-by: Elena Reshetova <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> net/ipv6/raw.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
> index 761a473a07c5..384e3d59d148 100644
> --- a/net/ipv6/raw.c
> +++ b/net/ipv6/raw.c
> @@ -33,6 +33,7 @@
> #include <linux/skbuff.h>
> #include <linux/compat.h>
> #include <linux/uaccess.h>
> +#include <linux/compiler.h>
> #include <asm/ioctls.h>
>
> #include <net/net_namespace.h>
> @@ -725,17 +726,17 @@ static int raw6_getfrag(void *from, char *to, int offset, int len, int odd,
> struct sk_buff *skb)
> {
> struct raw6_frag_vec *rfv = from;
> + char *rfv_buf;
>
> - if (offset < rfv->hlen) {
> + if ((rfv_buf = nospec_array_ptr(rfv->c, offset, rfv->hlen))) {
> int copy = min(rfv->hlen - offset, len);
Minor nit.
Please don't do assignment in condition test here.
Instead.
rfv_buf = nospec_array_ptr(rfv->c, offset, rfv->hlen);
if (rfv_buf) {
> The only way a user can set this in any meaningful way would be via
> a NL80211_CMD_SET_WIPHY netlink message. However, the value will get
> vetted there by cfg80211's parse_txq_params [0]. This is long before
Far more than a couple of hundred instructions ? The problem is that the
processor will speculate that the parameter is valid and continue on that
basis until the speculation resolves or it hits some other limit that
stops it speculating further. That can be quite a distance on a modern
x86 processor, and for all we know could be even more on some of the
other CPUs involved.
> it reaches any of the *_op_conf_tx functions.
>
> And Furthermore a invalid queue (param->ac) would cause a crash in
> this line in mac80211 before it even reaches the driver [1]:
> | sdata->tx_conf[params->ac] = p;
> | ^^^^^^^^
Firstly it might not because the address you get as a result could be
valid kernel memory. In fact your attackers wants it to be valid kernel
memory since they are trying to find the value of a piece of that memory.
Secondly the processor is doing this speculatively so it won't fault. It
will eventually decide it went the wrong way and throw all the
speculative work away - leaving footprints. It won't fault unless the
speculative resolves that was the real path the code took.
If it's not a performance critical path then it's better to be safe.
Alan
On Sat, Jan 06, 2018 at 12:23:47PM +0000, Alan Cox wrote:
> On Sat, 6 Jan 2018 10:01:54 +0100
> Greg KH <[email protected]> wrote:
>
> > On Fri, Jan 05, 2018 at 05:11:10PM -0800, Dan Williams wrote:
> > > Static analysis reports that 'offset' may be a user controlled value
> >
> > Can I see the rule that determined that? It does not feel like that is
> > correct, given the 3+ levels deep that this function gets this value
> > from...
>
> On a current x86 you can execute something upwards of 150 instructions in
> a speculation window.
Yeah, I agree, it's deep :(
But for this patch, I thought the prior review determined that it was
not a problem. Was that somehow proven incorrect?
thanks,
greg k-h
On Sat, Jan 6, 2018 at 7:14 AM, Greg KH <[email protected]> wrote:
> On Sat, Jan 06, 2018 at 12:23:47PM +0000, Alan Cox wrote:
>> On Sat, 6 Jan 2018 10:01:54 +0100
>> Greg KH <[email protected]> wrote:
>>
>> > On Fri, Jan 05, 2018 at 05:11:10PM -0800, Dan Williams wrote:
>> > > Static analysis reports that 'offset' may be a user controlled value
>> >
>> > Can I see the rule that determined that? It does not feel like that is
>> > correct, given the 3+ levels deep that this function gets this value
>> > from...
>>
>> On a current x86 you can execute something upwards of 150 instructions in
>> a speculation window.
>
> Yeah, I agree, it's deep :(
>
> But for this patch, I thought the prior review determined that it was
> not a problem. Was that somehow proven incorrect?
I kept it in the series to get a re-review with the wider netdev
because I missed the discussion leading up to that 'drop the patch'
decision. Sorry, I should have noted that in the changelog or cover
letter.
On Sat, Jan 6, 2018 at 6:23 AM, Christian Lamparter <[email protected]> wrote:
> On Saturday, January 6, 2018 2:10:37 AM CET Dan Williams wrote:
>> Static analysis reports that 'queue' may be a user controlled value that
>> is used as a data dependency to read from the 'ar9170_qmap' array. In
>> order to avoid potential leaks of kernel memory values, block
>> speculative execution of the instruction stream that could issue reads
>> based on an invalid result of 'ar9170_qmap[queue]'. In this case the
>> value of 'ar9170_qmap[queue]' is immediately reused as an index to the
>> 'ar->edcf' array.
>>
>> Based on an original patch by Elena Reshetova.
>>
>> Cc: Christian Lamparter <[email protected]>
>> Cc: Kalle Valo <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Signed-off-by: Elena Reshetova <[email protected]>
>> Signed-off-by: Dan Williams <[email protected]>
>> ---
>> drivers/net/wireless/ath/carl9170/main.c | 6 ++++--
>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/wireless/ath/carl9170/main.c b/drivers/net/wireless/ath/carl9170/main.c
>> index 988c8857d78c..0ff34cbe2b62 100644
>> --- a/drivers/net/wireless/ath/carl9170/main.c
>> +++ b/drivers/net/wireless/ath/carl9170/main.c
>> @@ -41,6 +41,7 @@
>> #include <linux/module.h>
>> #include <linux/etherdevice.h>
>> #include <linux/random.h>
>> +#include <linux/compiler.h>
>> #include <net/mac80211.h>
>> #include <net/cfg80211.h>
>> #include "hw.h"
>> @@ -1384,11 +1385,12 @@ static int carl9170_op_conf_tx(struct ieee80211_hw *hw,
>> const struct ieee80211_tx_queue_params *param)
>> {
>> struct ar9170 *ar = hw->priv;
>> + const u8 *elem;
>> int ret;
>>
>> mutex_lock(&ar->mutex);
>> - if (queue < ar->hw->queues) {
>> - memcpy(&ar->edcf[ar9170_qmap[queue]], param, sizeof(*param));
>> + if ((elem = nospec_array_ptr(ar9170_qmap, queue, ar->hw->queues))) {
>> + memcpy(&ar->edcf[*elem], param, sizeof(*param));
>> ret = carl9170_set_qos(ar);
>> } else {
>> ret = -EINVAL;
>>
>>
> About the "queue" in carl9170_op_conf_tx, p54_conf_tx and cw1200_conf_tx:
>
> The only way a user can set this in any meaningful way would be via
> a NL80211_CMD_SET_WIPHY netlink message. However, the value will get
> vetted there by cfg80211's parse_txq_params [0]. This is long before
> it reaches any of the *_op_conf_tx functions.
>
> And Furthermore a invalid queue (param->ac) would cause a crash in
> this line in mac80211 before it even reaches the driver [1]:
> | sdata->tx_conf[params->ac] = p;
> | ^^^^^^^^
> | if (drv_conf_tx(local, sdata, >>>> params->ac <<<<, &p)) {
> | ^^ (this is a wrapper for the *_op_conf_tx)
>
> I don't think these chin-up exercises are needed.
Quite the contrary, you've identified a better place in the call stack
to sanitize the input and disable speculation. Then we can kill the
whole class of the wireless driver reports at once it seems.
On Saturday, January 6, 2018 4:06:21 PM CET Alan Cox wrote:
> > The only way a user can set this in any meaningful way would be via
> > a NL80211_CMD_SET_WIPHY netlink message. However, the value will get
> > vetted there by cfg80211's parse_txq_params [0]. This is long before
>
> Far more than a couple of hundred instructions ?
Well, the user would have to send a netlink message each time. So
cfg80211 can parse it (this is where the initial "if queue >= 4 " check
happen). So the CPU would have to continue through and into
rdev_set_txq_params() to get to mac80211's ieee80211_set_txq_params().
Then pass through that before gets to the driver's op_tx_conf. Once
there the driver code aquires a mutex_lock too before it gets to
check the queue value again.
Is this enough and how would the mutex_lock fit in there? Or can
the CPU skip past this as well?
> The problem is that the processor will speculate that the parameter
> is valid and continue on that basis until the speculation resolves
> or it hits some other limit that stops it speculating further.
> That can be quite a distance on a modern x86 processor, and for all
> we know could be even more on some of the other CPUs involved.
> > it reaches any of the *_op_conf_tx functions.
> >
> > And Furthermore a invalid queue (param->ac) would cause a crash in
> > this line in mac80211 before it even reaches the driver [1]:
> > | sdata->tx_conf[params->ac] = p;
> > | ^^^^^^^^
>
> Firstly it might not because the address you get as a result could be
> valid kernel memory. In fact your attackers wants it to be valid kernel
> memory since they are trying to find the value of a piece of that memory.
>
> Secondly the processor is doing this speculatively so it won't fault. It
> will eventually decide it went the wrong way and throw all the
> speculative work away - leaving footprints. It won't fault unless the
> speculative resolves that was the real path the code took.
>
> If it's not a performance critical path then it's better to be safe.
Thank you for reading the canary too.
Christian
On Fri, Jan 05, 2018 at 09:23:06PM -0800, Dan Williams wrote:
> On Fri, Jan 5, 2018 at 6:55 PM, Linus Torvalds
> <[email protected]> wrote:
> > On Fri, Jan 5, 2018 at 5:09 PM, Dan Williams <[email protected]> wrote:
> >> +#ifndef nospec_ptr
> >> +#define nospec_ptr(ptr, lo, hi) \
> >
> > Do we actually want this horrible interface?
> >
> > It just causes the compiler - or inline asm - to generate worse code,
> > because it needs to compare against both high and low limits.
> >
> > Basically all users are arrays that are zero-based, and where a
> > comparison against the high _index_ limit would be sufficient.
> >
> > But the way this is all designed, it's literally designed for bad code
> > generation for the unusual case, and the usual array case is written
> > in the form of the unusual and wrong non-array case. That really seems
> > excessively stupid.
>
> Yes, it appears we can kill nospec_ptr() and move nospec_array_ptr()
> to assume 0 based arrays rather than use nospec_ptr.
Sounds good to me; I can respin the arm/arm64 implementations accordingly.
We can always revisit that if we have non-array cases that need to be covered.
Thanks,
Mark.
On Fri, 2018-01-05 at 17:57 -0800, Dan Williams wrote:
> On Fri, Jan 5, 2018 at 5:53 PM, Srinivas Pandruvada
> <[email protected]> wrote:
> >
> > On Fri, 2018-01-05 at 17:10 -0800, Dan Williams wrote:
> > >
> > > Static analysis reports that 'trip' may be a user controlled
> > > value
> > > that
> > > is used as a data dependency to read '*temp' from the 'd-
> > > >aux_trips'
> > > array. In order to avoid potential leaks of kernel memory
> > > values,
> > > block
> > > speculative execution of the instruction stream that could issue
> > > reads
> > > based on an invalid value of '*temp'.
> > Not against the change as this is in a very slow path. But the trip
> > is
> > not an arbitrary value which user can enter.
> >
> > This trip value is the one of the sysfs attribute in thermal zone.
> > For
> > example
> >
> > # cd /sys/class/thermal/thermal_zone1
> > # ls trip_point_?_temp
> > trip_point_0_temp trip_point_1_temp trip_point_2_temp trip_point
> > _3_t
> > emp trip_point_4_temp trip_point_5_temp trip_point_6_temp
> >
> > Here the "trip" is one of the above trip_point_*_temp. So in this
> > case
> > it can be from 0 to 6 as user can't do
> > # cat trip_point_7_temp
> > as there is no sysfs attribute for trip_point_7_temp.
> >
> > The actual "trip" was obtained in thermal core via
> >
> > if (sscanf(attr->attr.name, "trip_point_%d_temp", &trip) !=
> > 1)
> > return -EINVAL;
> >
> > Thanks,
> > Srinivas
> Ah, great, thanks. So do we even need the bounds check at that point?
We are not bound checking but the way we identify type of the trip.
Based on ACPI support we order trips:
- Aux trips max_count = aux_trip_nr
- One Critical trip
- One Hot trip
- One passive trip
- Rest all trips are active trips
So in the above example if d->aux_trip_nr is 1 then trip_point_0_temp
read/write is for aux trip. If d->aux_trip_nr is 0 then it can be any
other non aux trip.
BUT I am not still up to date with these attacks. Not sure about the
perimeter of user controlled value. It is a user controlled but limited
by the sysfs attributes. So I will test this patch and let you know if
there are any issues.
Thanks,
Srinivas
On Sat, Jan 6, 2018 at 1:40 AM, Greg KH <[email protected]> wrote:
> On Sat, Jan 06, 2018 at 10:09:07AM +0100, Greg KH wrote:
>> On Fri, Jan 05, 2018 at 05:10:32PM -0800, Dan Williams wrote:
>> > Static analysis reports that 'index' may be a user controlled value that
>> > is used as a data dependency to read 'pin' from the
>> > 'selector->baSourceID' array. In order to avoid potential leaks of
>> > kernel memory values, block speculative execution of the instruction
>> > stream that could issue reads based on an invalid value of 'pin'.
>> >
>> > Based on an original patch by Elena Reshetova.
>> >
>> > Cc: Laurent Pinchart <[email protected]>
>> > Cc: Mauro Carvalho Chehab <[email protected]>
>> > Cc: [email protected]
>> > Signed-off-by: Elena Reshetova <[email protected]>
>> > Signed-off-by: Dan Williams <[email protected]>
>> > ---
>> > drivers/media/usb/uvc/uvc_v4l2.c | 7 +++++--
>> > 1 file changed, 5 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/drivers/media/usb/uvc/uvc_v4l2.c b/drivers/media/usb/uvc/uvc_v4l2.c
>> > index 3e7e283a44a8..7442626dc20e 100644
>> > --- a/drivers/media/usb/uvc/uvc_v4l2.c
>> > +++ b/drivers/media/usb/uvc/uvc_v4l2.c
>> > @@ -22,6 +22,7 @@
>> > #include <linux/mm.h>
>> > #include <linux/wait.h>
>> > #include <linux/atomic.h>
>> > +#include <linux/compiler.h>
>> >
>> > #include <media/v4l2-common.h>
>> > #include <media/v4l2-ctrls.h>
>> > @@ -810,6 +811,7 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
>> > struct uvc_entity *iterm = NULL;
>> > u32 index = input->index;
>> > int pin = 0;
>> > + __u8 *elem;
>> >
>> > if (selector == NULL ||
>> > (chain->dev->quirks & UVC_QUIRK_IGNORE_SELECTOR_UNIT)) {
>> > @@ -820,8 +822,9 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
>> > break;
>> > }
>> > pin = iterm->id;
>> > - } else if (index < selector->bNrInPins) {
>> > - pin = selector->baSourceID[index];
>> > + } else if ((elem = nospec_array_ptr(selector->baSourceID, index,
>> > + selector->bNrInPins))) {
>> > + pin = *elem;
>>
>> I dug through this before, and I couldn't find where index came from
>> userspace, I think seeing the coverity rule would be nice.
>
> Ok, I take it back, this looks correct. Ugh, the v4l ioctl api is
> crazy complex (rightfully so), it's amazing that coverity could navigate
> that whole thing :)
>
> While I'm all for fixing this type of thing, I feel like we need to do
> something "else" for this as playing whack-a-mole for this pattern is
> going to be a never-ending battle for all drivers for forever. Either
> we need some way to mark this data path to make it easy for tools like
> sparse to flag easily, or we need to catch the issue in the driver
> subsystems, which unfortunatly, would harm the drivers that don't have
> this type of issue (like here.)
>
> I'm guessing that other operating systems, which don't have the luxury
> of auditing all of their drivers are going for the "big hammer in the
> subsystem" type of fix, right?
>
> I don't have a good answer for this, but if there was some better way to
> rewrite these types of patterns to just prevent the need for the
> nospec_array_ptr() type thing, that might be the best overall for
> everyone. Much like ebpf did with their changes. That way a simple
> coccinelle rule would be able to catch the pattern and rewrite it.
>
> Or am I just dreaming?
At least on the coccinelle front you're dreaming. Julia already took a
look and said:
"I don't think Coccinelle would be good for doing this (ie
implementing taint analysis) because the dataflow is too complicated."
Perhaps the Coverity instance Dave mentioned at Ksummit 2012 has a
role to play here?
On Sat, Jan 6, 2018 at 4:32 AM, Alan Cox <[email protected]> wrote:
>
> Also for x86-64 if we are trusting that an AND with a constant won't get
> speculated into something else surely we can just and the address with ~(1
> << 63) before copying from/to user space ? The user will then just
> speculatively steal their own memory.
User accesses *may* go to the kernel too, with set_fs(KERNEL_DS).
We've been getting rid of those, but they still exist.
We historically perhaps have 'and'ed the address with
current->thread.addr_limit, but it's no longer a power-of-two mask, so
that doesn't work. We'd have to play tricks there, but it might work
to do something like
addr &= current->thread.addr_limit | 0xfff;
or similar.
But this is one area where the 'lfence' is probably not too bad. The
cost of the existing 'stac' instruction is much higher, and in fact
depending on how lfence works (and how stac affects the memory unit
pipeline, it might even _help_ to serialize the stac with the actual
address.
Linus
On Sat, Jan 6, 2018 at 6:48 AM, Stephen Hemminger
<[email protected]> wrote:
> On Fri, 05 Jan 2018 17:11:04 -0800
> Dan Williams <[email protected]> wrote:
>
>> Static analysis reports that 'offset' may be a user controlled value
>> that is used as a data dependency reading from a raw6_frag_vec buffer.
>> In order to avoid potential leaks of kernel memory values, block
>> speculative execution of the instruction stream that could issue further
>> reads based on an invalid '*(rfv->c + offset)' value.
>>
>> Based on an original patch by Elena Reshetova.
>>
>> Cc: "David S. Miller" <[email protected]>
>> Cc: Alexey Kuznetsov <[email protected]>
>> Cc: Hideaki YOSHIFUJI <[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Elena Reshetova <[email protected]>
>> Signed-off-by: Dan Williams <[email protected]>
>> ---
>> net/ipv6/raw.c | 9 +++++----
>> 1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
>> index 761a473a07c5..384e3d59d148 100644
>> --- a/net/ipv6/raw.c
>> +++ b/net/ipv6/raw.c
>> @@ -33,6 +33,7 @@
>> #include <linux/skbuff.h>
>> #include <linux/compat.h>
>> #include <linux/uaccess.h>
>> +#include <linux/compiler.h>
>> #include <asm/ioctls.h>
>>
>> #include <net/net_namespace.h>
>> @@ -725,17 +726,17 @@ static int raw6_getfrag(void *from, char *to, int offset, int len, int odd,
>> struct sk_buff *skb)
>> {
>> struct raw6_frag_vec *rfv = from;
>> + char *rfv_buf;
>>
>> - if (offset < rfv->hlen) {
>> + if ((rfv_buf = nospec_array_ptr(rfv->c, offset, rfv->hlen))) {
>> int copy = min(rfv->hlen - offset, len);
>
> Minor nit.
>
> Please don't do assignment in condition test here.
> Instead.
> rfv_buf = nospec_array_ptr(rfv->c, offset, rfv->hlen);
> if (rfv_buf) {
Yeah, sorry about that. This was a hold over from an earlier version
where nospec_array_ptr() did not include the necessary barrier and we
relied on a new if_nospec helper, but now that if_nospec is no longer
being proposed I can go back and clean this up.
On Sat, Jan 6, 2018 at 8:29 AM, Dan Williams <[email protected]> wrote:
> On Sat, Jan 6, 2018 at 7:14 AM, Greg KH <[email protected]> wrote:
>> On Sat, Jan 06, 2018 at 12:23:47PM +0000, Alan Cox wrote:
>>> On Sat, 6 Jan 2018 10:01:54 +0100
>>> Greg KH <[email protected]> wrote:
>>>
>>> > On Fri, Jan 05, 2018 at 05:11:10PM -0800, Dan Williams wrote:
>>> > > Static analysis reports that 'offset' may be a user controlled value
>>> >
>>> > Can I see the rule that determined that? It does not feel like that is
>>> > correct, given the 3+ levels deep that this function gets this value
>>> > from...
>>>
>>> On a current x86 you can execute something upwards of 150 instructions in
>>> a speculation window.
>>
>> Yeah, I agree, it's deep :(
>>
>> But for this patch, I thought the prior review determined that it was
>> not a problem. Was that somehow proven incorrect?
>
> I kept it in the series to get a re-review with the wider netdev
> because I missed the discussion leading up to that 'drop the patch'
> decision. Sorry, I should have noted that in the changelog or cover
> letter.
Is there a microbenchmark that can stress this path. If the cost is
negligible why play games with it being "probably ok"? Unless someone
can say 100% 'offset' is always bounded to something safe by the time
we get here just use nospec_array_ptr().
On Sat, Jan 06, 2018 at 12:32:42PM +0000, Alan Cox wrote:
> On Fri, 5 Jan 2018 18:52:07 -0800
> Linus Torvalds <[email protected]> wrote:
>
> > On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams <[email protected]> wrote:
> > > From: Andi Kleen <[email protected]>
> > >
> > > When access_ok fails we should always stop speculating.
> > > Add the required barriers to the x86 access_ok macro.
> >
> > Honestly, this seems completely bogus.
>
> Also for x86-64 if we are trusting that an AND with a constant won't get
> speculated into something else surely we can just and the address with ~(1
> << 63) before copying from/to user space ? The user will then just
> speculatively steal their own memory.
+1
Any type of straight line code can address variant 1.
Like changing:
array[index]
into
array[index & mask]
works even when 'mask' is a variable.
To proceed with speculative load from array cpu has to speculatively
load 'mask' from memory and speculatively do '&' alu.
If attacker cannot influence 'mask' the speculative value of it
will bound 'index & mask' value to be within array limits.
I think "lets sprinkle lfence everywhere" approach is going to
cause serious performance degradation. Yet people pushing for lfence
didn't present any numbers.
Last time lfence was removed from the networking drivers via dma_rmb()
packet-per-second metric jumped 10-30%. lfence forces all outstanding loads
to complete. If any prior load is waiting on L3 or memory,
lfence will cause 100+ ns stall and overall kernel performance will tank.
If kernel adopts this "lfence everywhere" approach it will be
the end of the kernel as we know it. All high performance operations
will move into user space. Networking and IO will be first.
Since it will takes years to design new cpus and even longer
to upgrade all servers the industry will have no choice,
but to move as much logic as possible from the kernel.
kpti already made crossing user/kernel boundary slower, but
kernel itself is still fast. If kernel will have lfence everywhere
the kernel itself will be slow.
In that sense retpolining the kernel is not as horrible as it sounds,
since both user space and kernel has to be retpolined.
On Sat, Jan 6, 2018 at 10:13 AM, Alexei Starovoitov
<[email protected]> wrote:
> On Sat, Jan 06, 2018 at 12:32:42PM +0000, Alan Cox wrote:
>> On Fri, 5 Jan 2018 18:52:07 -0800
>> Linus Torvalds <[email protected]> wrote:
>>
>> > On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams <[email protected]> wrote:
>> > > From: Andi Kleen <[email protected]>
>> > >
>> > > When access_ok fails we should always stop speculating.
>> > > Add the required barriers to the x86 access_ok macro.
>> >
>> > Honestly, this seems completely bogus.
>>
>> Also for x86-64 if we are trusting that an AND with a constant won't get
>> speculated into something else surely we can just and the address with ~(1
>> << 63) before copying from/to user space ? The user will then just
>> speculatively steal their own memory.
>
> +1
>
> Any type of straight line code can address variant 1.
> Like changing:
> array[index]
> into
> array[index & mask]
> works even when 'mask' is a variable.
> To proceed with speculative load from array cpu has to speculatively
> load 'mask' from memory and speculatively do '&' alu.
> If attacker cannot influence 'mask' the speculative value of it
> will bound 'index & mask' value to be within array limits.
>
> I think "lets sprinkle lfence everywhere" approach is going to
> cause serious performance degradation. Yet people pushing for lfence
> didn't present any numbers.
> Last time lfence was removed from the networking drivers via dma_rmb()
> packet-per-second metric jumped 10-30%. lfence forces all outstanding loads
> to complete. If any prior load is waiting on L3 or memory,
> lfence will cause 100+ ns stall and overall kernel performance will tank.
You are conflating dma_rmb() with the limited cases where
nospec_array_ptr() is used. I need help determining what the
performance impact of those limited places are.
> If kernel adopts this "lfence everywhere" approach it will be
> the end of the kernel as we know it. All high performance operations
> will move into user space. Networking and IO will be first.
> Since it will takes years to design new cpus and even longer
> to upgrade all servers the industry will have no choice,
> but to move as much logic as possible from the kernel.
>
> kpti already made crossing user/kernel boundary slower, but
> kernel itself is still fast. If kernel will have lfence everywhere
> the kernel itself will be slow.
>
> In that sense retpolining the kernel is not as horrible as it sounds,
> since both user space and kernel has to be retpolined.
retpoline is variant-2, this patch series is about variant-1.
On Sat, 6 Jan 2018 10:13:33 -0800
Alexei Starovoitov <[email protected]> wrote:
> On Sat, Jan 06, 2018 at 12:32:42PM +0000, Alan Cox wrote:
> > On Fri, 5 Jan 2018 18:52:07 -0800
> > Linus Torvalds <[email protected]> wrote:
> >
> > > On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams <[email protected]> wrote:
> > > > From: Andi Kleen <[email protected]>
> > > >
> > > > When access_ok fails we should always stop speculating.
> > > > Add the required barriers to the x86 access_ok macro.
> > >
> > > Honestly, this seems completely bogus.
> >
> > Also for x86-64 if we are trusting that an AND with a constant won't get
> > speculated into something else surely we can just and the address with ~(1
> > << 63) before copying from/to user space ? The user will then just
> > speculatively steal their own memory.
>
> +1
>
> Any type of straight line code can address variant 1.
> Like changing:
> array[index]
> into
> array[index & mask]
> works even when 'mask' is a variable.
That statement is unfortunately not one that we currently believe is
true for all architectures, platforms and implementations. It may be true
for some architectures but processors can speculate on more than just
execution paths. For some architecutres it may be the right way to
implement Linus array_* methods.
> I think "lets sprinkle lfence everywhere" approach is going to
> cause serious performance degradation. Yet people pushing for lfence
> didn't present any numbers.
Normally people who propose security fixes don't have to argue about the
fact they added 30 clocks to avoid your box being 0wned.
If you look at the patches very very few are in remotely hot paths, which
is good news. The ones in hot paths definitely need careful scrutiny but
the priority ought to be fixing and then optimizing unless its obvious
how to tackle it.
Alan
On Sat, Jan 06, 2018 at 10:29:49AM -0800, Dan Williams wrote:
> On Sat, Jan 6, 2018 at 10:13 AM, Alexei Starovoitov
> <[email protected]> wrote:
> > On Sat, Jan 06, 2018 at 12:32:42PM +0000, Alan Cox wrote:
> >> On Fri, 5 Jan 2018 18:52:07 -0800
> >> Linus Torvalds <[email protected]> wrote:
> >>
> >> > On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams <[email protected]> wrote:
> >> > > From: Andi Kleen <[email protected]>
> >> > >
> >> > > When access_ok fails we should always stop speculating.
> >> > > Add the required barriers to the x86 access_ok macro.
> >> >
> >> > Honestly, this seems completely bogus.
> >>
> >> Also for x86-64 if we are trusting that an AND with a constant won't get
> >> speculated into something else surely we can just and the address with ~(1
> >> << 63) before copying from/to user space ? The user will then just
> >> speculatively steal their own memory.
> >
> > +1
> >
> > Any type of straight line code can address variant 1.
> > Like changing:
> > array[index]
> > into
> > array[index & mask]
> > works even when 'mask' is a variable.
> > To proceed with speculative load from array cpu has to speculatively
> > load 'mask' from memory and speculatively do '&' alu.
> > If attacker cannot influence 'mask' the speculative value of it
> > will bound 'index & mask' value to be within array limits.
> >
> > I think "lets sprinkle lfence everywhere" approach is going to
> > cause serious performance degradation. Yet people pushing for lfence
> > didn't present any numbers.
> > Last time lfence was removed from the networking drivers via dma_rmb()
> > packet-per-second metric jumped 10-30%. lfence forces all outstanding loads
> > to complete. If any prior load is waiting on L3 or memory,
> > lfence will cause 100+ ns stall and overall kernel performance will tank.
>
> You are conflating dma_rmb() with the limited cases where
> nospec_array_ptr() is used. I need help determining what the
> performance impact of those limited places are.
really? fdtable, access_ok, net/ipv[46] is not critical path?
> > If kernel adopts this "lfence everywhere" approach it will be
> > the end of the kernel as we know it. All high performance operations
> > will move into user space. Networking and IO will be first.
> > Since it will takes years to design new cpus and even longer
> > to upgrade all servers the industry will have no choice,
> > but to move as much logic as possible from the kernel.
> >
> > kpti already made crossing user/kernel boundary slower, but
> > kernel itself is still fast. If kernel will have lfence everywhere
> > the kernel itself will be slow.
> >
> > In that sense retpolining the kernel is not as horrible as it sounds,
> > since both user space and kernel has to be retpolined.
>
> retpoline is variant-2, this patch series is about variant-1.
that's exactly the point. Don't slow down the kernel with lfences
to solve variant 1. retpoline for 2 is ok from long term kernel
viability perspective.
On Sat, Jan 06, 2018 at 06:38:59PM +0000, Alan Cox wrote:
> On Sat, 6 Jan 2018 10:13:33 -0800
> Alexei Starovoitov <[email protected]> wrote:
>
> > On Sat, Jan 06, 2018 at 12:32:42PM +0000, Alan Cox wrote:
> > > On Fri, 5 Jan 2018 18:52:07 -0800
> > > Linus Torvalds <[email protected]> wrote:
> > >
> > > > On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams <[email protected]> wrote:
> > > > > From: Andi Kleen <[email protected]>
> > > > >
> > > > > When access_ok fails we should always stop speculating.
> > > > > Add the required barriers to the x86 access_ok macro.
> > > >
> > > > Honestly, this seems completely bogus.
> > >
> > > Also for x86-64 if we are trusting that an AND with a constant won't get
> > > speculated into something else surely we can just and the address with ~(1
> > > << 63) before copying from/to user space ? The user will then just
> > > speculatively steal their own memory.
> >
> > +1
> >
> > Any type of straight line code can address variant 1.
> > Like changing:
> > array[index]
> > into
> > array[index & mask]
> > works even when 'mask' is a variable.
>
> That statement is unfortunately not one that we currently believe is
> true for all architectures, platforms and implementations. It may be true
> for some architectures but processors can speculate on more than just
> execution paths. For some architecutres it may be the right way to
> implement Linus array_* methods.
please name one where cpu will ignore above '&' and speculate
load from array based on 'index' only.
cpus execute what they see. speculative execution does the same
except results are not committed to visible registers and stay
in renanmed/shadow set. There is no 'undo' of the speculative execution.
The whole issue is that cache and branch predictor don't have
a shadow unlike registers.
> > I think "lets sprinkle lfence everywhere" approach is going to
> > cause serious performance degradation. Yet people pushing for lfence
> > didn't present any numbers.
>
> Normally people who propose security fixes don't have to argue about the
> fact they added 30 clocks to avoid your box being 0wned.
>
> If you look at the patches very very few are in remotely hot paths, which
> is good news. The ones in hot paths definitely need careful scrutiny but
> the priority ought to be fixing and then optimizing unless its obvious
> how to tackle it.
fdtable, access_ok and networking are clearly hot, no?
On Sat, Jan 6, 2018 at 10:39 AM, Alexei Starovoitov
<[email protected]> wrote:
[..]
>> retpoline is variant-2, this patch series is about variant-1.
>
> that's exactly the point. Don't slow down the kernel with lfences
> to solve variant 1. retpoline for 2 is ok from long term kernel
> viability perspective.
>
Setting aside that we still need to measure the impact of these
changes the end result will still be nospec_array_ptr() sprinkled in
various locations. So can we save the debate about what's inside that
macro on various architectures and at least proceed with annotating
the problematic locations? Perhaps we can go a step further and have a
config option to switch between the clever array_access() approach
from Linus that might be fine depending on the compiler, and the
cpu-vendor-recommended not to speculate implementation of
nospec_array_ptr().
Le 01/05/18 à 17:09, Dan Williams a écrit :
> Quoting Mark's original RFC:
>
> "Recently, Google Project Zero discovered several classes of attack
> against speculative execution. One of these, known as variant-1, allows
> explicit bounds checks to be bypassed under speculation, providing an
> arbitrary read gadget. Further details can be found on the GPZ blog [1]
> and the Documentation patch in this series."
>
> This series incorporates Mark Rutland's latest api and adds the x86
> specific implementation of nospec_barrier. The
> nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
> wide analysis performed by Elena Reshetova to address static analysis
> reports where speculative execution on a userspace controlled value
> could bypass a bounds check. The patches address a precondition for the
> attack discussed in the Spectre paper [2].
>
> A consideration worth noting for reviewing these patches is to weigh the
> dramatic cost of being wrong about whether a given report is exploitable
> vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
> lets make the bar for applying these patches be "can you prove that the
> bounds check bypass is *not* exploitable". Consider that the Spectre
> paper reports one example of a speculation window being ~180 cycles.
>
> Note that there is also a proposal from Linus, array_access [3], that
> attempts to quash speculative execution past a bounds check without
> introducing an lfence instruction. That may be a future optimization
> possibility that is compatible with this api, but it would appear to
> need guarantees from the compiler that it is not clear the kernel can
> rely on at this point. It is also not clear that it would be a
> significant performance win vs lfence.
>
> These patches also will also be available via the 'nospec' git branch
> here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec
Although I suppose -stable and distribution maintainers will keep a
close eye on these patches, is there a particular reason why they don't
include the relevant CVE number in their commit messages?
It sounds like Coverity was used to produce these patches? If so, is
there a plan to have smatch (hey Dan) or other open source static
analysis tool be possibly enhanced to do a similar type of work?
Thanks!
--
Florian
> It sounds like Coverity was used to produce these patches? If so, is
> there a plan to have smatch (hey Dan) or other open source static
> analysis tool be possibly enhanced to do a similar type of work?
I'd love for that to happen; the tricky part is being able to have even a
sort of sensible concept of "trusted" vs "untrusted" value...
if you look at a very small window of code, that does not work well;
you likely need to even look (as tool) across .c file boundaries
On Sat, Jan 06, 2018 at 10:54:27AM -0800, Dan Williams wrote:
> On Sat, Jan 6, 2018 at 10:39 AM, Alexei Starovoitov
> <[email protected]> wrote:
> [..]
> >> retpoline is variant-2, this patch series is about variant-1.
> >
> > that's exactly the point. Don't slow down the kernel with lfences
> > to solve variant 1. retpoline for 2 is ok from long term kernel
> > viability perspective.
> >
>
> Setting aside that we still need to measure the impact of these
> changes the end result will still be nospec_array_ptr() sprinkled in
> various locations. So can we save the debate about what's inside that
> macro on various architectures and at least proceed with annotating
> the problematic locations? Perhaps we can go a step further and have a
> config option to switch between the clever array_access() approach
> from Linus that might be fine depending on the compiler, and the
> cpu-vendor-recommended not to speculate implementation of
> nospec_array_ptr().
recommended by panicing vendors who had no better ideas?
Ohh, speculation is exploitable, let's stop speculation.
Instead of fighting it we can safely steer it where it doesn't leak
kernel data. AND approach is doing exactly that.
It probably can be made independent of compiler choice to use setbe-like insn.
On Sat, Jan 6, 2018 at 11:25 AM, Alexei Starovoitov
<[email protected]> wrote:
> On Sat, Jan 06, 2018 at 10:54:27AM -0800, Dan Williams wrote:
>> On Sat, Jan 6, 2018 at 10:39 AM, Alexei Starovoitov
>> <[email protected]> wrote:
>> [..]
>> >> retpoline is variant-2, this patch series is about variant-1.
>> >
>> > that's exactly the point. Don't slow down the kernel with lfences
>> > to solve variant 1. retpoline for 2 is ok from long term kernel
>> > viability perspective.
>> >
>>
>> Setting aside that we still need to measure the impact of these
>> changes the end result will still be nospec_array_ptr() sprinkled in
>> various locations. So can we save the debate about what's inside that
>> macro on various architectures and at least proceed with annotating
>> the problematic locations? Perhaps we can go a step further and have a
>> config option to switch between the clever array_access() approach
>> from Linus that might be fine depending on the compiler, and the
>> cpu-vendor-recommended not to speculate implementation of
>> nospec_array_ptr().
>
> recommended by panicing vendors who had no better ideas?
> Ohh, speculation is exploitable, let's stop speculation.
> Instead of fighting it we can safely steer it where it doesn't leak
> kernel data. AND approach is doing exactly that.
> It probably can be made independent of compiler choice to use setbe-like insn.
Right, when that 'probably' is 'certainly' for the architecture you
care about just update the nospec_array_ptr() definition at that
point.
On Fri, Jan 5, 2018 at 5:09 PM, Dan Williams <[email protected]> wrote:
> Quoting Mark's original RFC:
>
> "Recently, Google Project Zero discovered several classes of attack
> against speculative execution. One of these, known as variant-1, allows
> explicit bounds checks to be bypassed under speculation, providing an
> arbitrary read gadget. Further details can be found on the GPZ blog [1]
> and the Documentation patch in this series."
>
> This series incorporates Mark Rutland's latest api and adds the x86
> specific implementation of nospec_barrier. The
> nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
> wide analysis performed by Elena Reshetova to address static analysis
> reports where speculative execution on a userspace controlled value
> could bypass a bounds check. The patches address a precondition for the
> attack discussed in the Spectre paper [2].
>
> A consideration worth noting for reviewing these patches is to weigh the
> dramatic cost of being wrong about whether a given report is exploitable
> vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
> lets make the bar for applying these patches be "can you prove that the
> bounds check bypass is *not* exploitable". Consider that the Spectre
> paper reports one example of a speculation window being ~180 cycles.
>
> Note that there is also a proposal from Linus, array_access [3], that
> attempts to quash speculative execution past a bounds check without
> introducing an lfence instruction. That may be a future optimization
> possibility that is compatible with this api, but it would appear to
> need guarantees from the compiler that it is not clear the kernel can
> rely on at this point. It is also not clear that it would be a
> significant performance win vs lfence.
>
> These patches also will also be available via the 'nospec' git branch
> here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec
It appears that git.kernel.org has not mirrored out the new branch. In
the meantime here's an alternative location:
https://github.com/djbw/linux.git nospec
If there are updates to these patches they will appear in nospec-v2,
nospec-v3, etc... branches.
On Sat, 6 Jan 2018, Alexei Starovoitov wrote:
> On Sat, Jan 06, 2018 at 10:54:27AM -0800, Dan Williams wrote:
> > On Sat, Jan 6, 2018 at 10:39 AM, Alexei Starovoitov
> > <[email protected]> wrote:
> > [..]
> > >> retpoline is variant-2, this patch series is about variant-1.
> > >
> > > that's exactly the point. Don't slow down the kernel with lfences
> > > to solve variant 1. retpoline for 2 is ok from long term kernel
> > > viability perspective.
> > >
> >
> > Setting aside that we still need to measure the impact of these
> > changes the end result will still be nospec_array_ptr() sprinkled in
> > various locations. So can we save the debate about what's inside that
> > macro on various architectures and at least proceed with annotating
> > the problematic locations? Perhaps we can go a step further and have a
> > config option to switch between the clever array_access() approach
> > from Linus that might be fine depending on the compiler, and the
> > cpu-vendor-recommended not to speculate implementation of
> > nospec_array_ptr().
>
> recommended by panicing vendors who had no better ideas?
> Ohh, speculation is exploitable, let's stop speculation.
> Instead of fighting it we can safely steer it where it doesn't leak
> kernel data. AND approach is doing exactly that.
For one particular architecture and that's not a solution for generic
code.
Aside of that I fundamentally disagree with your purely performance
optimized argumentation. We need to make sure that we have a solution which
kills the problem safely and then take it from there. Correctness first,
optimization later is the rule for this. Better safe than sorry.
Thanks,
tglx
> cpus execute what they see. speculative execution does the same
> except results are not committed to visible registers and stay
> in renanmed/shadow set. There is no 'undo' of the speculative execution.
> The whole issue is that cache and branch predictor don't have
> a shadow unlike registers.
Can I suggest you read something like "Exploitig Value Locaity to Exceed
The Dataflow Limit" by Lipasti and Shen 1996.
In other words there are at least two problems with Linus proposal
1. The ffff/0000 mask has to be generated and that has to involve
speculative flows.
2. There are processors on the planet that may speculate not just what
instruction to execute but faced with a stall on an input continue by
using an educated guess at the value that will appear at the input in
future.
> > > I think "lets sprinkle lfence everywhere" approach is going to
> > > cause serious performance degradation. Yet people pushing for lfence
> > > didn't present any numbers.
> >
> > Normally people who propose security fixes don't have to argue about the
> > fact they added 30 clocks to avoid your box being 0wned.
> >
> > If you look at the patches very very few are in remotely hot paths, which
> > is good news. The ones in hot paths definitely need careful scrutiny but
> > the priority ought to be fixing and then optimizing unless its obvious
> > how to tackle it.
>
> fdtable, access_ok and networking are clearly hot,
Those points are not when compared to a lot of paths executed far far
more (like packet processing). The ipv4/6 ones are mixing lfences and a
lot of stores so probably hurt with lfence. fdtable worries me
a lot less and is anyway clearly required.
I have reviewing the packet building ones on my TODO list for a reason.
Alan
On Sat, Jan 6, 2018 at 11:37 AM, Dan Williams <[email protected]> wrote:
> On Fri, Jan 5, 2018 at 5:09 PM, Dan Williams <[email protected]> wrote:
>> Quoting Mark's original RFC:
>>
>> "Recently, Google Project Zero discovered several classes of attack
>> against speculative execution. One of these, known as variant-1, allows
>> explicit bounds checks to be bypassed under speculation, providing an
>> arbitrary read gadget. Further details can be found on the GPZ blog [1]
>> and the Documentation patch in this series."
>>
>> This series incorporates Mark Rutland's latest api and adds the x86
>> specific implementation of nospec_barrier. The
>> nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
>> wide analysis performed by Elena Reshetova to address static analysis
>> reports where speculative execution on a userspace controlled value
>> could bypass a bounds check. The patches address a precondition for the
>> attack discussed in the Spectre paper [2].
>>
>> A consideration worth noting for reviewing these patches is to weigh the
>> dramatic cost of being wrong about whether a given report is exploitable
>> vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
>> lets make the bar for applying these patches be "can you prove that the
>> bounds check bypass is *not* exploitable". Consider that the Spectre
>> paper reports one example of a speculation window being ~180 cycles.
>>
>> Note that there is also a proposal from Linus, array_access [3], that
>> attempts to quash speculative execution past a bounds check without
>> introducing an lfence instruction. That may be a future optimization
>> possibility that is compatible with this api, but it would appear to
>> need guarantees from the compiler that it is not clear the kernel can
>> rely on at this point. It is also not clear that it would be a
>> significant performance win vs lfence.
>>
>> These patches also will also be available via the 'nospec' git branch
>> here:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux nospec
>
> It appears that git.kernel.org has not mirrored out the new branch. In
> the meantime here's an alternative location:
>
> https://github.com/djbw/linux.git nospec
>
> If there are updates to these patches they will appear in nospec-v2,
> nospec-v3, etc... branches.
For completeness I appended the bpf fix [1] to the git branch.
https://lwn.net/Articles/743288/
On Sat, Jan 06, 2018 at 07:55:51PM +0000, Alan Cox wrote:
> > cpus execute what they see. speculative execution does the same
> > except results are not committed to visible registers and stay
> > in renanmed/shadow set. There is no 'undo' of the speculative execution.
> > The whole issue is that cache and branch predictor don't have
> > a shadow unlike registers.
>
> Can I suggest you read something like "Exploitig Value Locaity to Exceed
> The Dataflow Limit" by Lipasti and Shen 1996.
thanks for the pointer.
A quote from above paper:
"Value prediction consists of predicting entire 32- and 64-bit register values
based on previously-seen values"
> In other words there are at least two problems with Linus proposal
>
> 1. The ffff/0000 mask has to be generated and that has to involve
> speculative flows.
to answer above and Thomas's
"For one particular architecture and that's not a solution for generic code."
The following:
#define array_access(base, idx, max) ({ \
union { typeof(base[0]) _val; unsigned long _bit; } __u;\
unsigned long _i = (idx); \
unsigned long _m = (max); \
unsigned long _mask = ~(long)(_m - 1 - _i) >> 63; \
__u._val = base[_i & _mask]; \
__u._bit &= _mask; \
__u._val; })
is generic and no speculative flows.
> 2. There are processors on the planet that may speculate not just what
> instruction to execute but faced with a stall on an input continue by
> using an educated guess at the value that will appear at the input in
> future.
correct. that's why earlier I mentioned that "if 'mask' cannot
be influenced by attacker".
Even if 'mask' in 'index & mask' example is a stall the educated
guess will come from the prior value (according to the quoted paper)
To be honest I haven't read that particular paper in the past, but
abstracts fits my understanding and this array_access() proposal.
Thanks for the pointer. Will read it fully to make sure
I didn't miss anything.
> "Value prediction consists of predicting entire 32- and 64-bit register values
> based on previously-seen values"
For their implementation yes
>
> > In other words there are at least two problems with Linus proposal
> >
> > 1. The ffff/0000 mask has to be generated and that has to involve
> > speculative flows.
>
> to answer above and Thomas's
> "For one particular architecture and that's not a solution for generic code."
>
> The following:
> #define array_access(base, idx, max) ({ \
> union { typeof(base[0]) _val; unsigned long _bit; } __u;\
> unsigned long _i = (idx); \
> unsigned long _m = (max); \
> unsigned long _mask = ~(long)(_m - 1 - _i) >> 63; \
> __u._val = base[_i & _mask]; \
> __u._bit &= _mask; \
> __u._val; })
>
> is generic and no speculative flows.
In the value speculation case imagine it's been called 1000 times for
process A which as a limit of say 16 so that file->priv->max is 16, and
then is run for process B which is different.
A value speculating processor waiting for file->priv->max which has been
pushed out of cache by an attacker is at liberty to say 'I've no idea
what max is but hey it was 16 last time so lets plug 16 in and keep going"
So while the change in the mask computation is clever and (subject to
compiler cleverness) safe against guesses of which path will be taken I
don't think it's generically safe.
Unfortunately a lot of things we index are of different sizes as seen by
different tasks, or when passed different other index values so this does
matter.
> Even if 'mask' in 'index & mask' example is a stall the educated
> guess will come from the prior value (according to the quoted paper)
Which might be for a different set of variables when the table is say per
process like file handles, or the value is different each call.
If we have single array of fixed size then I suspect you are right but
usually we don't.
Alan
On Sat, Jan 06, 2018 at 06:38:59PM +0000, Alan Cox wrote:
> Normally people who propose security fixes don't have to argue about the
> fact they added 30 clocks to avoid your box being 0wned.
In fact it depends, because if a fix makes the system unusable for its
initial purpose, this fix will simply not be deployed at all, which is
the worst that can happen. Especially when it cannot be disabled by
config and people stop updating their systems to stay on the last
"known good" version. Fortunately in Linux we often have the choice so
that users rarely have a valid reason for not upgrading!
Willy
On Sat, Jan 06, 2018 at 08:22:13PM +0000, Alan Cox wrote:
> > "Value prediction consists of predicting entire 32- and 64-bit register values
> > based on previously-seen values"
>
> For their implementation yes
>
> >
> > > In other words there are at least two problems with Linus proposal
> > >
> > > 1. The ffff/0000 mask has to be generated and that has to involve
> > > speculative flows.
> >
> > to answer above and Thomas's
> > "For one particular architecture and that's not a solution for generic code."
> >
> > The following:
> > #define array_access(base, idx, max) ({ \
> > union { typeof(base[0]) _val; unsigned long _bit; } __u;\
> > unsigned long _i = (idx); \
> > unsigned long _m = (max); \
> > unsigned long _mask = ~(long)(_m - 1 - _i) >> 63; \
> > __u._val = base[_i & _mask]; \
> > __u._bit &= _mask; \
> > __u._val; })
> >
> > is generic and no speculative flows.
>
> In the value speculation case imagine it's been called 1000 times for
> process A which as a limit of say 16 so that file->priv->max is 16, and
> then is run for process B which is different.
>
> A value speculating processor waiting for file->priv->max which has been
> pushed out of cache by an attacker is at liberty to say 'I've no idea
> what max is but hey it was 16 last time so lets plug 16 in and keep going"
>
> So while the change in the mask computation is clever and (subject to
> compiler cleverness) safe against guesses of which path will be taken I
> don't think it's generically safe.
>
> Unfortunately a lot of things we index are of different sizes as seen by
> different tasks, or when passed different other index values so this does
> matter.
>
> > Even if 'mask' in 'index & mask' example is a stall the educated
> > guess will come from the prior value (according to the quoted paper)
>
> Which might be for a different set of variables when the table is say per
> process like file handles, or the value is different each call.
>
> If we have single array of fixed size then I suspect you are right but
> usually we don't.
Thanks. I see your point. Agree on the above.
The variant 1 exploit does 2000 bytes a second using 64-bit address math.
Things like 'fd' are 32-bit, so it's magnitude higher attack
complexity already (without any kernel changes).
If we do above array_access() the exploit complexity increases even more.
More so the attacker would need to train fdt->max_fds on a known
good fdt with millions of files for 100s of iterations only to do
one speculative access on another fdt with small max_fds
(to exploit value speculation from large max_fds)
while keeping cache line for that speculative out-of-bounds access on
small fdt empty and measuring cache load times on another cpu.
I frankly don't see such attack being able to keep cache lines pristine
for that small fdt speculation doing hundreds of non-speculative
accesses on another fdt. Way too many moving pieces.
Even if it would be practical the speed probably going to be in bytes per second,
so to read anything meaningful an attack detection techniques (that people
are actively working on) will be able to catch it.
At the end security cannot be absolute.
The current level of paranoia shouldn't force us to make hastily decisions.
So how about we do array_access() macro similar to above by default
with extra CONFIG_ to convert it to lfence ?
Why default to AND approach instead of lfence ?
Because the kernel should still be usable. If security
sacrifices performance so much such security will be turned off.
Ex: kpti suppose to add 5-30%. If it means 10% on production workload
and the datacenter capacity cannot grow 10% overnight, kpti will be off.
On Sat, 6 Jan 2018, Alexei Starovoitov wrote:
> So how about we do array_access() macro similar to above by default
> with extra CONFIG_ to convert it to lfence ?
> Why default to AND approach instead of lfence ?
> Because the kernel should still be usable. If security
> sacrifices performance so much such security will be turned off.
> Ex: kpti suppose to add 5-30%. If it means 10% on production workload
> and the datacenter capacity cannot grow 10% overnight, kpti will be off.
That's the decision and responsibility of the person who disables it.
Thanks,
tglx
> Even if it would be practical the speed probably going to be in bytes per second,
> so to read anything meaningful an attack detection techniques (that people
> are actively working on) will be able to catch it.
> At the end security cannot be absolute.
> The current level of paranoia shouldn't force us to make hastily decisions.
I think there are at least three overlapping problem spaces here
1. This is a new field. That could mean that it turns out to be
really hard and everyone discovers that eBPF was pretty much the only
interesting attack. It could also mean we are going to see several years
or refinement by evil geniuses all over the world and what we see now is
tip of iceberg in cleverness.
2. It is very very complicated to answer a question like "is
sequence x safe on all of vendor's microprocessors" even for the vendor
3. A lot of people outside of the professional security space are
very new to the concept of security economics and risk management as
opposed to seeing the fake binary nice green tick that says their
computers are secure that they can pass to their senior management or
show to audit.
> So how about we do array_access() macro similar to above by default
> with extra CONFIG_ to convert it to lfence ?
We don't have to decide today. Intel's current position is 'lfence'. Over
time we may see vendors provide more sequences. We will see vendors add
new instruction hints and the like (See the ARM whitepaper for example)
and the array access code will change and change again for the better.
The important thing is that there is something clean, all architecture
that can be used today that doesn't keep forcing everyone to change
drivers when new/better ways to do the speculation management appear.
> Why default to AND approach instead of lfence ?
> Because the kernel should still be usable. If security
> sacrifices performance so much such security will be turned off.
> Ex: kpti suppose to add 5-30%. If it means 10% on production workload
> and the datacenter capacity cannot grow 10% overnight, kpti will be off.
The differences involved on the "lfence" versus "and" versus before are
not likely to be anywhere in that order of magnitude. As I said I want to
take a hard look at the IPv4/6 ones but most of them are not in places
where you'd expect a lot of data to be in flight in a perf critical path.
Alan
On Fri, Jan 5, 2018 at 7:09 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Jan 5, 2018 at 6:52 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> The fact is, we have to stop speculating when access_ok() does *not*
>> fail - because that's when we'll actually do the access. And it's that
>> access that needs to be non-speculative.
>
> I also suspect we should probably do this entirely differently.
>
> Maybe the whole lfence can be part of uaccess_begin() instead (ie
> currently 'stac()'). That would fit the existing structure better, I
> think. And it would avoid any confusion about the whole "when to stop
> speculation".
I assume if we put this in uaccess_begin() we also need audit for
paths that use access_ok but don't do on to call uaccess_begin()? A
quick glance shows a few places where we are open coding the stac().
Perhaps land the lfence in stac() directly?
On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams <[email protected]> wrote:
>
> I assume if we put this in uaccess_begin() we also need audit for
> paths that use access_ok but don't do on to call uaccess_begin()? A
> quick glance shows a few places where we are open coding the stac().
> Perhaps land the lfence in stac() directly?
Yeah, we should put it in uaccess_begin(), and in the actual user
accessor helpers that do stac. Some of them probably should be changed
to use uaccess_begin() instead while at it.
One question for the CPU people: do we actually care and need to do
this for things that might *write* to something? The speculative write
obviously is killed, but does it perhaps bring in a cacheline even
when killed?
Because maybe we don't need the lfence in put_user(), only in get_user()?
Linus
From: Willy Tarreau <[email protected]>
Date: Sat, 6 Jan 2018 21:42:29 +0100
> On Sat, Jan 06, 2018 at 06:38:59PM +0000, Alan Cox wrote:
>> Normally people who propose security fixes don't have to argue about the
>> fact they added 30 clocks to avoid your box being 0wned.
>
> In fact it depends, because if a fix makes the system unusable for its
> initial purpose, this fix will simply not be deployed at all, which is
> the worst that can happen.
+1
I completely agree with Willy and Alexei.
And the scale isn't even accurate, we're talking about at least
hundreds upon hundreds of clocks, not 30, if we add an operation whose
side effect is to wait for all pending loads to complete. So yeah
this is going to be heavily scrutinized.
On Sat, Jan 06, 2018 at 11:05:07PM +0000, Alan Cox wrote:
> > Even if it would be practical the speed probably going to be in bytes per second,
> > so to read anything meaningful an attack detection techniques (that people
> > are actively working on) will be able to catch it.
> > At the end security cannot be absolute.
> > The current level of paranoia shouldn't force us to make hastily decisions.
>
> I think there are at least three overlapping problem spaces here
>
> 1. This is a new field. That could mean that it turns out to be
> really hard and everyone discovers that eBPF was pretty much the only
> interesting attack. It could also mean we are going to see several years
> or refinement by evil geniuses all over the world and what we see now is
> tip of iceberg in cleverness.
yep. plenty of unknowns and what's happening now is an overreaction.
> 2. It is very very complicated to answer a question like "is
> sequence x safe on all of vendor's microprocessors" even for the vendor
so far "is sequence x safe" was viewed by cpu vendors as
"is sequence x going to stop speculative execution".
The AND approach wasn't even considered and I suspect there may be
other ways to avoid the side channel attack, but vendors are too
focused on stopping speculation. Why? The vulnerability already
disclosed in detail. Public clouds already patched their kernels.
What is the rush to push half baked patches into upstream
that don't even address the variant 1 ?
> Intel's current position is 'lfence'.
Let's look at the facts. From GPZ blog:
"In simplified terms, this program is used to determine the address
of prog_map by guessing the offset from prog_map to a userspace
address and tail-calling through prog_map at the guessed offsets."
which clearly states that bpf_tail_call() was used in the attack.
Yet none of the intel nor arm patches address speculation in
this bpf helper!
It means that:
- gpz didn't share neither exploit nor the detailed description
of the POC with cpu vendors until now
- coverity rules used to find all these places in the kernel
failed to find bpf_tail_call
- cpu vendors were speculating what variant 1 can actually do
Now the attack is well described, yet cpu vendors still pushing
for lfence patches that don't make sense. Why?
Furthermore GPZ blog states:
"To cause the branch prediction to predict that the offset is below
the length of prog_map, tail calls to an in-bounds index are
performed in between."
which means that attack is 'in-bound * largeN + out-of-bound * 1'.
Going back to our discussion few emails ago about 'mask' as a variable
it means that value predictor when speculating this last out of
bound access will correctly predict exact 'mask', so 'index & mask'
fix will certainly defeat this poc.
More from the blog:
"To increase the mis-speculation window, the cache line containing
the length of prog_map is bounced to another core."
and in the kernel we have:
struct bpf_map {
atomic_t refcnt;
enum bpf_map_type map_type;
u32 key_size;
u32 value_size;
u32 max_entries;
...
note that only 'refcnt' could be used for bouncing,
so if we place max_entries and refcnt into different cache
lines this poc will likely fail as well.
More from the blog:
"At this point, a second eBPF program can be used to actually leak data.
In pseudocode, this program looks as follows:
uint64_t progmap_index = (((secret_data & bitmask) >> bitshift_selector) << 7) + prog_array_base_offset;
bpf_tail_call(prog_map, progmap_index);
"
which means that POC is relying 64-bit address speculation.
In the places coverity found the user supplied value is 32-bit,
so none of them are suitable for this POC.
We cannot rule out that the attack cannot be crafted to work
with 32-bit, but it means there is no urgency to lfence everything.
> The important thing is that there is something clean, all architecture
> that can be used today that doesn't keep forcing everyone to change
> drivers when new/better ways to do the speculation management appear.
What I think is important is to understand vulnerability first.
I don't think it was done.
> The differences involved on the "lfence" versus "and" versus before are
> not likely to be anywhere in that order of magnitude.
we clearly disagree here. Both intel and arm patches proposed
to add lfence in bpf_map_lookup() which is the hottest function
we have and we do run it at 40+Gbps speeds where every nanosecond
counts, so no, lfence is not a solution.
On Sat, Jan 06, 2018 at 07:38:14PM -0800, Alexei Starovoitov wrote:
> yep. plenty of unknowns and what's happening now is an overreaction.
To be fair there's overreaction on both sides. The vast majority of
users need to get a 100% safe system and will never notice any
difference. A few of us are more concerned by the risk of performance
loss brought by these fixes. We do all need to run tests on the
patchsets to bring numbers on the table.
> What is the rush to push half baked patches into upstream
> that don't even address the variant 1 ?
You probably need to trust the CPU vendors a bit for having had all
the details about the risks for a few months now and accept that if
they're destroying their product's performance compared to their main
competitor, they probably studied a number of other alternatives first.
It doesn't mean they thought about everything of course, and maybe they
need to study your proposal as a better solution to reduce criticism.
> which clearly states that bpf_tail_call() was used in the attack.
> Yet none of the intel nor arm patches address speculation in
> this bpf helper!
> It means that:
> - gpz didn't share neither exploit nor the detailed description
> of the POC with cpu vendors until now
Or it was considered less urgent to fix compared to the rest. It's
also possible that the scariest details were not written in the GPZ
article.
> Now the attack is well described, yet cpu vendors still pushing
> for lfence patches that don't make sense. Why?
Imagine if you were in their position and were pushing a solution which
would later be found to be inefficient and to be vulnerable again. Your
name would appear twice in the press in a few months, this would be
terrible. It makes sense in their position to find the safest fix first
given that those like you or me really concerned about the performance
impact know how to add an option to a boot loader or revert a patch that
causes trouble.
> What I think is important is to understand vulnerability first.
> I don't think it was done.
I suspect it was clearly done by those how had no other option but
working on this painful series over the last few months :-/
> > The differences involved on the "lfence" versus "and" versus before are
> > not likely to be anywhere in that order of magnitude.
>
> we clearly disagree here. Both intel and arm patches proposed
> to add lfence in bpf_map_lookup() which is the hottest function
> we have and we do run it at 40+Gbps speeds where every nanosecond
> counts, so no, lfence is not a solution.
Please help here by testing the patch series and reporting numbers
before, with the fix, and with your proposal. That's the best way to
catch their attention and to get your proposal considered as a viable
alternative (or as a partial one for certain environments). I did the
same when I believed it could be sufficient to add noise to RDTSC and
found it totally inefficient after testing. But it's better for
everyone that research and testing is done rather than criticizing the
proposed fixes (which is not fair to the people who work hard on them
for everyone else).
Cheers,
Willy
On Sat, Jan 06, 2018 at 09:41:17AM -0800, Dan Williams wrote:
> On Sat, Jan 6, 2018 at 1:40 AM, Greg KH <[email protected]> wrote:
> > On Sat, Jan 06, 2018 at 10:09:07AM +0100, Greg KH wrote:
> >> On Fri, Jan 05, 2018 at 05:10:32PM -0800, Dan Williams wrote:
> >> > Static analysis reports that 'index' may be a user controlled value that
> >> > is used as a data dependency to read 'pin' from the
> >> > 'selector->baSourceID' array. In order to avoid potential leaks of
> >> > kernel memory values, block speculative execution of the instruction
> >> > stream that could issue reads based on an invalid value of 'pin'.
> >> >
> >> > Based on an original patch by Elena Reshetova.
> >> >
> >> > Cc: Laurent Pinchart <[email protected]>
> >> > Cc: Mauro Carvalho Chehab <[email protected]>
> >> > Cc: [email protected]
> >> > Signed-off-by: Elena Reshetova <[email protected]>
> >> > Signed-off-by: Dan Williams <[email protected]>
> >> > ---
> >> > drivers/media/usb/uvc/uvc_v4l2.c | 7 +++++--
> >> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/drivers/media/usb/uvc/uvc_v4l2.c b/drivers/media/usb/uvc/uvc_v4l2.c
> >> > index 3e7e283a44a8..7442626dc20e 100644
> >> > --- a/drivers/media/usb/uvc/uvc_v4l2.c
> >> > +++ b/drivers/media/usb/uvc/uvc_v4l2.c
> >> > @@ -22,6 +22,7 @@
> >> > #include <linux/mm.h>
> >> > #include <linux/wait.h>
> >> > #include <linux/atomic.h>
> >> > +#include <linux/compiler.h>
> >> >
> >> > #include <media/v4l2-common.h>
> >> > #include <media/v4l2-ctrls.h>
> >> > @@ -810,6 +811,7 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
> >> > struct uvc_entity *iterm = NULL;
> >> > u32 index = input->index;
> >> > int pin = 0;
> >> > + __u8 *elem;
> >> >
> >> > if (selector == NULL ||
> >> > (chain->dev->quirks & UVC_QUIRK_IGNORE_SELECTOR_UNIT)) {
> >> > @@ -820,8 +822,9 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
> >> > break;
> >> > }
> >> > pin = iterm->id;
> >> > - } else if (index < selector->bNrInPins) {
> >> > - pin = selector->baSourceID[index];
> >> > + } else if ((elem = nospec_array_ptr(selector->baSourceID, index,
> >> > + selector->bNrInPins))) {
> >> > + pin = *elem;
> >>
> >> I dug through this before, and I couldn't find where index came from
> >> userspace, I think seeing the coverity rule would be nice.
> >
> > Ok, I take it back, this looks correct. Ugh, the v4l ioctl api is
> > crazy complex (rightfully so), it's amazing that coverity could navigate
> > that whole thing :)
> >
> > While I'm all for fixing this type of thing, I feel like we need to do
> > something "else" for this as playing whack-a-mole for this pattern is
> > going to be a never-ending battle for all drivers for forever. Either
> > we need some way to mark this data path to make it easy for tools like
> > sparse to flag easily, or we need to catch the issue in the driver
> > subsystems, which unfortunatly, would harm the drivers that don't have
> > this type of issue (like here.)
> >
> > I'm guessing that other operating systems, which don't have the luxury
> > of auditing all of their drivers are going for the "big hammer in the
> > subsystem" type of fix, right?
> >
> > I don't have a good answer for this, but if there was some better way to
> > rewrite these types of patterns to just prevent the need for the
> > nospec_array_ptr() type thing, that might be the best overall for
> > everyone. Much like ebpf did with their changes. That way a simple
> > coccinelle rule would be able to catch the pattern and rewrite it.
> >
> > Or am I just dreaming?
>
> At least on the coccinelle front you're dreaming. Julia already took a
> look and said:
>
> "I don't think Coccinelle would be good for doing this (ie
> implementing taint analysis) because the dataflow is too complicated."
Sorry for the confusion, no, I don't mean the "taint tracking", I mean
the generic pattern of "speculative out of bounds access" that we are
fixing here.
Yes, as you mentioned before, there are tons of false-positives in the
tree, as to find the real problems you have to show that userspace
controls the access index. But if we have a generic pattern that can
rewrite that type of logic into one where it does not matter at all
(i.e. like the ebpf proposed changes), then it would not be an issue if
they are false or not, we just rewrite them all to be safe.
We need to find some way not only to fix these issues now (like you are
doing with this series), but to prevent them from every coming back into
the codebase again. It's that second part that we need to keep in the
back of our minds here, while doing the first portion of this work.
> Perhaps the Coverity instance Dave mentioned at Ksummit 2012 has a
> role to play here?
We have a coverity instance that all kernel developers have access to
(just sign up and we grant it.) We have at least one person working
full time on fixing up errors that this instance reports. So if we
could get those rules added (which is why I asked for them), it would be
a great first line of defense to prevent the "adding new problems" issue
from happening right now for the 4.16-rc1 merge window.
thanks,
greg k-h
On Sat, 6 Jan 2018, Alexei Starovoitov wrote:
> which clearly states that bpf_tail_call() was used in the attack.
> Yet none of the intel nor arm patches address speculation in
> this bpf helper!
> It means that:
> - gpz didn't share neither exploit nor the detailed description
> of the POC with cpu vendors until now
> - coverity rules used to find all these places in the kernel
> failed to find bpf_tail_call
> - cpu vendors were speculating what variant 1 can actually do
You forgot to mention that there might be other attacks than the public POC
which are not covered by a simple AND ....
Thanks,
tglx
> > 2. It is very very complicated to answer a question like "is
> > sequence x safe on all of vendor's microprocessors" even for the vendor
>
> so far "is sequence x safe" was viewed by cpu vendors as
> "is sequence x going to stop speculative execution".
Incorrect. Modern processors are very very sophisticated beasts and you
are dealing with a wide range of architectures and micro-architectures
that all have their own differences.
> > Intel's current position is 'lfence'.
> which clearly states that bpf_tail_call() was used in the attack.
> Yet none of the intel nor arm patches address speculation in
> this bpf helper!
There are a set of patches under discussion for eBPF both the JIT and
interpreter. BTW I would expect that there are things Coverity didn't
find, and that we'll also see variants on the attack where different
tricks for measurement emerge that may change what is needed a little.
> which means that POC is relying 64-bit address speculation.
> In the places coverity found the user supplied value is 32-bit,
People have 32bit computers as well as 64bit and in some cases 32bit is
fine for an attack depending where your target is relative to the object.
> > The differences involved on the "lfence" versus "and" versus before are
> > not likely to be anywhere in that order of magnitude.
>
> we clearly disagree here. Both intel and arm patches proposed
> to add lfence in bpf_map_lookup() which is the hottest function
> we have and we do run it at 40+Gbps speeds where every nanosecond
> counts, so no, lfence is not a solution.
The last solution I saw proposed in that path for the JIT is to "and" with
constant which in that situation clearly makes the most sense providing
it is safe on all the CPUs involved.
lfence timing is also heavily dependent upon what work has to be done to
retire previous live instructions. BPF does not normally do a lot of
writing so you'd expect the cost to be low.
Anyway - Intel's current position is that lfence is safe. It's not
impossible other sequences will in future be documented as safe by one or
more vendors of x86 processors.
Alan
On Sat, 2018-01-06 at 20:36 -0500, David Miller wrote:
> From: Willy Tarreau <[email protected]>
> Date: Sat, 6 Jan 2018 21:42:29 +0100
>
> > On Sat, Jan 06, 2018 at 06:38:59PM +0000, Alan Cox wrote:
> >> Normally people who propose security fixes don't have to argue
> about the
> >> fact they added 30 clocks to avoid your box being 0wned.
> >
> > In fact it depends, because if a fix makes the system unusable for
> its
> > initial purpose, this fix will simply not be deployed at all, which
> is
> > the worst that can happen.
>
> +1
>
> I completely agree with Willy and Alexei.
>
> And the scale isn't even accurate, we're talking about at least
> hundreds upon hundreds of clocks, not 30, if we add an operation
> whose side effect is to wait for all pending loads to complete. So
> yeah this is going to be heavily scrutinized.
Plus this is the standard kernel code review MO: we've never blindly
accepted code just because *security* (otherwise we'd have grsec in by
now). We use the pushback to get better and more performant code.
What often happens is it turns out that the "either security or
performance" position was a false dichotomy and there is a way of
fixing stuff that's acceptable (although not usually perfect) for
everyone. I'm not saying this always happens, but it is reasonable to
let the iterative pushback see if we can get to better code in this
case rather than trying to cut it of with the "because *security*"
argument.
James
On Sun, 7 Jan 2018, James Bottomley wrote:
> On Sat, 2018-01-06 at 20:36 -0500, David Miller wrote:
> > From: Willy Tarreau <[email protected]>
> > Date: Sat, 6 Jan 2018 21:42:29 +0100
> >
> > > On Sat, Jan 06, 2018 at 06:38:59PM +0000, Alan Cox wrote:
> > >> Normally people who propose security fixes don't have to argue
> > about the
> > >> fact they added 30 clocks to avoid your box being 0wned.
> > >
> > > In fact it depends, because if a fix makes the system unusable for
> > its
> > > initial purpose, this fix will simply not be deployed at all, which
> > is
> > > the worst that can happen.
> >
> > +1
> >
> > I completely agree with Willy and Alexei.
> >
> > And the scale isn't even accurate, we're talking about at least
> > hundreds upon hundreds of clocks, not 30, if we add an operation
> > whose side effect is to wait for all pending loads to complete. So
> > yeah this is going to be heavily scrutinized.
>
> Plus this is the standard kernel code review MO: we've never blindly
> accepted code just because *security* (otherwise we'd have grsec in by
> now). We use the pushback to get better and more performant code.
> What often happens is it turns out that the "either security or
> performance" position was a false dichotomy and there is a way of
> fixing stuff that's acceptable (although not usually perfect) for
> everyone. I'm not saying this always happens, but it is reasonable to
> let the iterative pushback see if we can get to better code in this
> case rather than trying to cut it of with the "because *security*"
> argument.
In principle I agree, though there are a few things to consider:
1) We have not the time to discuss that for the next 6 month
2) Alexei's analyis is purely based on the public information of the google
zero folks. If it would be complete and the only attack vector all fine.
If not and I doubt it is, we're going to regret this decision faster
than we made it and this is not the kind of play field where we can
afford that.
The whole 'we know better and performance is key' attitude is what led to
this disaster in the first place. We should finaly start to learn.
Can we please stop that and live with the extra cycles for a few month up
to the point where we get more informed answers to all these questions?
Thanks,
tglx
> everyone. I'm not saying this always happens, but it is reasonable to
> let the iterative pushback see if we can get to better code in this
> case rather than trying to cut it of with the "because *security*"
> argument.
>
I'm not arguing otherwise - I'd just prefer most users machines are
secure while we have the discussion and while we see what other
architectural tricks people can come up with
Alan
On Sun, Jan 7, 2018 at 1:09 AM, Greg KH <[email protected]> wrote:
[..]
> Sorry for the confusion, no, I don't mean the "taint tracking", I mean
> the generic pattern of "speculative out of bounds access" that we are
> fixing here.
>
> Yes, as you mentioned before, there are tons of false-positives in the
> tree, as to find the real problems you have to show that userspace
> controls the access index. But if we have a generic pattern that can
> rewrite that type of logic into one where it does not matter at all
> (i.e. like the ebpf proposed changes), then it would not be an issue if
> they are false or not, we just rewrite them all to be safe.
>
> We need to find some way not only to fix these issues now (like you are
> doing with this series), but to prevent them from every coming back into
> the codebase again. It's that second part that we need to keep in the
> back of our minds here, while doing the first portion of this work.
I understand the goal, but I'm not sure any of our current annotation
mechanisms are suitable. We have:
__attribute__((noderef, address_space(x)))
...for the '__user' annotation and other pointers that must not be
de-referenced without a specific accessor. We also have:
__attribute__((bitwise))
...for values that should not be consumed directly without a specific
conversion like endian swapping.
The problem is that we need to see if a value derived from a userspace
controlled input is used to trigger a chain of dependent reads. As far
as I can see the annotation would need to be guided by taint analysis
to be useful, at which point we can just "annotate" the problem spot
with nospec_array_ptr(). Otherwise it seems the scope of a
"__nospec_array_index" annotation would have a low signal to noise
ratio.
Stopping speculation past a uacess_begin() boundary appears to handle
a wide swath of potential problems, and the rest likely needs taint
analysis, at least for now.
All that to say, yes, we need better tooling and infrastructure going forward.
On Sat, Jan 6, 2018 at 10:33 PM, Willy Tarreau <[email protected]> wrote:
>
> To be fair there's overreaction on both sides. The vast majority of
> users need to get a 100% safe system and will never notice any
> difference.
There is no such thing as a "100% safe system". Never will be - unless
you make sure you have no users.
Also, people definitely *are* noticing the performance issues with the
current set of patches, and they are causing real problems. Go search
for reports of Amazon AWS slowdowns.
So this whole "security is so important that performance doesn't
matter" mindset is pure and utter garbage.
And the whole "normal people won't even notice" is pure garbage too.
Don't spread that bullshit when you see actual normal people
complaining.
Performance matters. A *LOT*.
Linus
On Sun, Jan 07, 2018 at 11:47:07AM -0800, Linus Torvalds wrote:
> And the whole "normal people won't even notice" is pure garbage too.
> Don't spread that bullshit when you see actual normal people
> complaining.
>
> Performance matters. A *LOT*.
Linus, no need to explain that to me, I'm precisely trying to see how
to disable PTI for a specific process because I face up to 45% loss in
certain circumstances, making it a no-go. But while a few of us have
very specific workloads emphasizing this impact, others have very
different ones and will not notice. For example my laptop did boot
pretty fine and I didn't notice anything until I fire a network
benchmark.
Willy
On Sun, Jan 7, 2018 at 11:47 AM, Linus Torvalds
<[email protected]> wrote:
> On Sat, Jan 6, 2018 at 10:33 PM, Willy Tarreau <[email protected]> wrote:
>>
>> To be fair there's overreaction on both sides. The vast majority of
>> users need to get a 100% safe system and will never notice any
>> difference.
>
> There is no such thing as a "100% safe system". Never will be - unless
> you make sure you have no users.
>
> Also, people definitely *are* noticing the performance issues with the
> current set of patches, and they are causing real problems. Go search
> for reports of Amazon AWS slowdowns.
>
> So this whole "security is so important that performance doesn't
> matter" mindset is pure and utter garbage.
>
> And the whole "normal people won't even notice" is pure garbage too.
> Don't spread that bullshit when you see actual normal people
> complaining.
>
> Performance matters. A *LOT*.
I'm thinking we should provide the option to at least build the
hot-path nospec_array_ptr() usages without an lfence.
CONFIG_SPECTRE1_PARANOIA_SAFE
CONFIG_SPECTRE1_PARANOIA_PERF
...if only for easing performance testing and let the distribution set
its policy.
Where hot-path usages can do:
nospec_relax(nospec_array_ptr())
...to optionally ellide the lfence.
On Sun, Jan 7, 2018 at 12:12 PM, Willy Tarreau <[email protected]> wrote:
>
> Linus, no need to explain that to me, I'm precisely trying to see how
> to disable PTI for a specific process because I face up to 45% loss in
> certain circumstances, making it a no-go. But while a few of us have
> very specific workloads emphasizing this impact, others have very
> different ones and will not notice. For example my laptop did boot
> pretty fine and I didn't notice anything until I fire a network
> benchmark.
Sure, most people have hardware where the bottleneck is entirely
elsewhere (slow network, rotating disk, whatever).
But this whole "normal people won't notice" is dangerous thinking.
They may well notice very much, we simply don't know what they are
doing.
Quite honesty, it's equally correct to say "normal people won't be
affected by the security issue in the first place".
That laptop that you didn't have any issues with? Likely it never had
an exploit running on it either!
So the whole "normal people" argument is pure and utter garbage. It's
wrong. It's pure shit when it comes to performance, but it's also pure
shit when it comes to the security issue.
Don't use it.
We need to fix the security problem, but we need to do it *without*
these braindead arguments that performance is somehow secondary.
Linus
On Sun, 7 Jan 2018, Linus Torvalds wrote:
> We need to fix the security problem, but we need to do it *without*
> these braindead arguments that performance is somehow secondary.
I surely agree, but we have gone the way of PTI without the ability of
exempting individual processes exactly for one reason:
Lack of time
It can be done on top of the PTI implementation and it won't take ages.
For spectre_v1/2 we face the same problem simply because we got informed so
much ahead of time and we were all twiddling thumbs, enjoying our christmas
vacation and having a good time.
The exploits are out in the wild and they are observed already, so we
really have to make a decision whether we want to fix that in the most
obvious ways even if it hurts performance right now and then take a break
from all that hell and sit down and sort the performance mess or whether we
want to discuss the right way to fix it for the next 3 month and leave all
doors open until the last bit of performance is squeezed out.
I totally can understand the anger of those who learned all of this a few
days ago and are now grasping straws to avoid the obviously coming
performance hit, but its not our fault that we have to make a decision
which cannot make everyone happy tomorrow.
If the general sentiment is that we have to fix the performance issue
first, then please let me know and I take 3 weeks of vacation right now.
Thanks,
tglx
On Sun, Jan 07, 2018 at 12:17:11PM -0800, Linus Torvalds wrote:
> We need to fix the security problem, but we need to do it *without*
> these braindead arguments that performance is somehow secondary.
OK OK. At least we should have security by default and let people trade
it against performance if they want and understand the risk. People
never know when they're secure enough (security doesn't measure) but
they know fairly well when they're not performant enough to take action
(most often changing the machine).
Willy
From: Thomas Gleixner <[email protected]>
Date: Sun, 7 Jan 2018 19:31:41 +0100 (CET)
> 2) Alexei's analyis is purely based on the public information of the google
> zero folks. If it would be complete and the only attack vector all fine.
>
> If not and I doubt it is, we're going to regret this decision faster
> than we made it and this is not the kind of play field where we can
> afford that.
Please state this more clearly.
Do you know about other attack vectors and just are not allowed to
talk about them?
Or is this, ironically, speculation?
On Sun, Jan 07, 2018 at 11:08:24AM +0100, Thomas Gleixner wrote:
> On Sat, 6 Jan 2018, Alexei Starovoitov wrote:
> > which clearly states that bpf_tail_call() was used in the attack.
> > Yet none of the intel nor arm patches address speculation in
> > this bpf helper!
> > It means that:
> > - gpz didn't share neither exploit nor the detailed description
> > of the POC with cpu vendors until now
> > - coverity rules used to find all these places in the kernel
> > failed to find bpf_tail_call
> > - cpu vendors were speculating what variant 1 can actually do
>
> You forgot to mention that there might be other attacks than the public POC
> which are not covered by a simple AND ....
if above statement is not a pure speculation please share CVE number.
For varaint[123] they've been reserved for months.
> For spectre_v1/2 we face the same problem simply because we got informed so
> much ahead of time and we were all twiddling thumbs, enjoying our christmas
> vacation and having a good time.
right. they were informed in a way that they failed to address
variant1 with pre-embargo and public patches.
> The exploits are out in the wild and they are observed already, so we
this statement contradicts with everything that was publicly stated.
Or you're referring to 'exploit' at the end of spectre paper?
> want to discuss the right way to fix it for the next 3 month and leave all
> doors open until the last bit of performance is squeezed out.
Let's look at facts:
- Linus explains his array_access() idea
- lfence proponents quickly point out that it needs gcc to be smart
enough to emit setbe and go back to lfence patches
- I spent half an hour to tweak array_access() to be generic
on all archs and compiler indepedent
- lfence proponets point out that AND with a variable somehow
won't work, yet after further discussion it's actually fine due to
the nature of variant1 attack that needs to train predictor with in-bounds
access to mispredict with out-of-bounds speculative load
- then lfence proponets claim 'non public POC' not covered by AND
- it all in the matter of 2 days.
- and now the argument that it will take 3 month to discuss a solution
without lfence, yet still no performance numbers for lfence
though 'people in the know' were twiddling thumbs for months.
That's just not cool.
From: Thomas Gleixner <[email protected]>
Date: Sun, 7 Jan 2018 21:56:39 +0100 (CET)
> I surely agree, but we have gone the way of PTI without the ability of
> exempting individual processes exactly for one reason:
>
> Lack of time
>
> It can be done on top of the PTI implementation and it won't take ages.
>
> For spectre_v1/2 we face the same problem simply because we got informed so
> much ahead of time and we were all twiddling thumbs, enjoying our christmas
> vacation and having a good time.
I just want to point out that this should be noted in history as a
case where all of this controlled disclosure stuff seems to have made
things worse rather than better.
Why is there so much haste and paranoia if supposedly some group of
people had all this extra time to think about and deal with this bug?
>From what I've seen, every single time, the worse a problem is, the
more important it is to expose it to as many smart folks as possible.
And to do so as fast as possible.
And to me that means full disclosure immediately for the super high
level stuff like what we are dealing with here.
Think I'm nuts? Ok, then how did we fare any better by keeping this
junk under wraps for weeks if not months? (seriously, did responsible
people really know about this as far back as... June 2017?)
Controlled disclosure for high propfile bugs seems to only achieve two
things:
1) Vendors can cover their butts and come up with deflection
strategies.
2) The "theatre" aspect of security can be maximized as much as
possible. We even have a pretty web site and cute avatars this
time!
None of this has anything to do with having time to come up with the
best possible implementation of a fix. You know, the technical part?
So after what appears to be as much as 6 months of deliberating the
very wise men in the special room said: "KPTI and lfence"
Do I get this right?
On Sun, Jan 07, 2018 at 12:15:40PM -0800, Dan Williams wrote:
>
> I'm thinking we should provide the option to at least build the
> hot-path nospec_array_ptr() usages without an lfence.
>
> CONFIG_SPECTRE1_PARANOIA_SAFE
> CONFIG_SPECTRE1_PARANOIA_PERF
SAFE vs PERF naming is problematic and misleading, since users don't
have the data to make a decision they will be forced to go with SAFE.
What is not safe about array_access() macro with AND ?
How lfence approach makes it safer ?
Only because lfence was blessed by intel earlier when
they couldn't figure out a different way?
How about:
CONFIG_SPECTRE1_WORKAROUND_INDEX_MASK
CONFIG_SPECTRE1_WORKAROUND_LOAD_FENCE
> ...if only for easing performance testing and let the distribution set
> its policy.
>
> Where hot-path usages can do:
>
> nospec_relax(nospec_array_ptr())
AND approach doesn't prevent speculation hence nospec_ is an incorrect prefix.
Alan's "speculation management" terminology fits well here.
Can we keep array_access() name and change it underneath to
either mask or lfence ?
On Sun, Jan 07, 2018 at 01:59:35PM +0000, Alan Cox wrote:
> > which means that POC is relying 64-bit address speculation.
> > In the places coverity found the user supplied value is 32-bit,
>
> People have 32bit computers as well as 64bit and in some cases 32bit is
> fine for an attack depending where your target is relative to the object.
right. absolutely correct on 32-bit archs.
The question whether truncation to 32-bit is enough to workaround
spectre1 on 64-bit archs.
I hope the researchers can clarify.
> lfence timing is also heavily dependent upon what work has to be done to
> retire previous live instructions.
> BPF does not normally do a lot of writing so you'd expect the cost to be low.
right. to retire previous loads. Not sure what 'not a lot of writing'
has to do with lfence.
Our XDP based DDOS mostly does reads with few writes for stats into maps,
whereas XDP based load balancer modifies every packet.
XDP is root only, so not relevant in the spectre context. Just clarifying
read vs writes in BPF.
On Sun, Jan 07, 2018 at 09:23:47PM -0500, David Miller wrote:
> From: Thomas Gleixner <[email protected]>
> Date: Sun, 7 Jan 2018 21:56:39 +0100 (CET)
>
> > I surely agree, but we have gone the way of PTI without the ability of
> > exempting individual processes exactly for one reason:
> >
> > Lack of time
> >
> > It can be done on top of the PTI implementation and it won't take ages.
> >
> > For spectre_v1/2 we face the same problem simply because we got informed so
> > much ahead of time and we were all twiddling thumbs, enjoying our christmas
> > vacation and having a good time.
>
> I just want to point out that this should be noted in history as a
> case where all of this controlled disclosure stuff seems to have made
> things worse rather than better.
I will note that the "controlled disclosure" for this thing was a total
and complete mess, and unlike any that I have ever seen in the past.
The people involved in running it had no idea how to do it at all, and
because of that, it failed miserably, despite being warned about it
numerous times by numerous people.
> Why is there so much haste and paranoia if supposedly some group of
> people had all this extra time to think about and deal with this bug?
Because that group was so small and isolated that they did not actually
talk to anyone who could actually provide input to help deal with the
bug.
So we are stuck now with dealing with this "properly", which is fine,
but please don't think that this is an excuse to blame "controlled
disclosure". We know how to do that correctly, it did not happen in
this case at all because of the people driving the problem refused to do
it.
> Think I'm nuts? Ok, then how did we fare any better by keeping this
> junk under wraps for weeks if not months? (seriously, did responsible
> people really know about this as far back as... June 2017?)
Some "people" did, just not some "responsible people" :)
Oh well, time for the kernel to fix hardware bugs again, that's what we
are here for, you would think we would be used to it by now...
greg k-h
On Sun, Jan 07, 2018 at 06:24:11PM -0800, Alexei Starovoitov wrote:
> How about:
> CONFIG_SPECTRE1_WORKAROUND_INDEX_MASK
> CONFIG_SPECTRE1_WORKAROUND_LOAD_FENCE
INSTRUCTION_FENCE if anything. LFENCE for Intel (and now also for AMD as
per 0592b0bce169) is a misnomer, IFENCE would be a better name for it.
On Sun, Jan 07, 2018 at 06:57:35PM -0800, Alexei Starovoitov wrote:
> On Sun, Jan 07, 2018 at 01:59:35PM +0000, Alan Cox wrote:
> > lfence timing is also heavily dependent upon what work has to be done to
> > retire previous live instructions.
> > BPF does not normally do a lot of writing so you'd expect the cost to be low.
>
> right. to retire previous loads. Not sure what 'not a lot of writing'
> has to do with lfence.
LFENCE will wait for completion on _ALL_ prior instructions, not just
loads.
Stores are by far the most expensive instructions to wait for, as they
only complete once their value is globally visible (on x86).
On Sat, Jan 06, 2018 at 08:41:34PM +0100, Thomas Gleixner wrote:
> optimized argumentation. We need to make sure that we have a solution which
> kills the problem safely and then take it from there. Correctness first,
> optimization later is the rule for this. Better safe than sorry.
Agreed, assuming the objective here is to achieve a complete spectre
fix fast.
Also note there's a whole set of stuff to do in addition of IBRS:
IBPB, stuff_RSB() and the register hygiene in kernel entry points and
vmexists, that alters the whole syscall stackframe to be able to clear
callee saved registers.
That register hygiene was one of the most tedious pieces to get right
along with the PTI "rep movsb" (no C) stack trampoline that never
calls into C code with zero stack available because it's very bad to
do so, consdering C is free to use some stack for register spillage.
I suggest to discuss how important register hygiene is on top of IBRS,
IBPB and stuff_RSB() to fix spectre, not future optimizations that
only matter for old CPUS and are irrelevant for future silicon.
I also suggest to discuss how to automate the other parts of variant#1
lfence/mfence across the bound checks, depending on arch with a open
source scanner, or if to pretend developers think about it like we
think about mb() (except no regression test will ever notice a bounds
check speculation memory barrier missing).
Reptolines alone are leaving a whole set of stuff unfixed: register
hygiene still missing, bios/firmware calls still require ibrs, all asm
has to be audited by hand as there's no sure asm scanner I know of
(grep can go somewhere though) and the gcc dependency isn't very
flexible to begin with, and they don't help with lfence/mfence across
bound checks, they still require IBPB and stuff_RSB() to avoid
guest/user mode against guest/user spectre variant#2 attacks.
I don't see why we should talk about pure performance optimization at
this point instead of focusing on the above.
Not to tell if you want to guarantee mathematically that guest
userland cannot read the guest kernel memory by starting a spectre
variant#2 attack from guest userland to host userland (David Gilbert's
new attack discovery). For that you'll have to set ibrs_enabled 2
ibpb_enabled 1 mode or ibrs_enabled 0 ibpb_enabled 2 mode in the host
kernel or alternatively ibrs_enabled 0 ibpb_enabled 2 in the guest
kernel.
ibrs 2 bpbp 1 will prevent qemu userland to use the IBP so guest
userland cannot probe it. ibrs 0 ibpb 2 will flush the IBP state at
vmexit so qemu userland won't be affected by it. ibrs 0 ibpb 2 in
guest will flush the IBP state at kernel entry so guest userland won't
be able to affect anything.
Of course such an attack from guest user -> guest kernel -> host
kernel -> host user -> host kernel -> guest kernel -> guest user and
probing IBP (RSB is fixed for good with unconditional stuff_RSB in
vmexit even when SMEP is set, precisely because SMEP won't stop guest
ring 3 to probe host ring 3 RSB and same for ring 0) is far fetched,
but reptolines alone cannot solve it unless you also build qemu
userland with reptolines (which then means the whole userland has to
be built with reptolines because the qemu dependency chain is endless,
includes glibc etc..).
As a reminder (for lkml): if you use KVM, spectre variant#2 is the
only attack that can affect guest/host memory isolation. spectre
variant#1 and meltdown (aka variant#3) always have been impossible
through KVM guest/user isolation. spectre variant#2 is the one that is
harder to fix and it's the most theoretical of them all and it may be
impossible to mount as an attack depending on host kernel code that
has to play against itself to achieve it. The setup for such an attack
is very tedious, takes half an hour or several hours depending on the
amounts of memory and you may have to know already accurately the
kernel that is running on the host. As opposed to spectre variant#1
and meltdown (aka variant#3), it's very unlikely anybody gets attacked
through spectre variant#2. It's also the side channel with the lowest
amount of kbytes/sec of bandwidth if mounted successfully in the first
place. However if it can be mounted successfully it becomes almost a
concern as the other two variants, which is why it needs fixing too.
Thanks,
Andrea
On Fri, Jan 05, 2018 at 10:30:16PM -0800, Dan Williams wrote:
> On Fri, Jan 5, 2018 at 6:22 PM, Eric W. Biederman <[email protected]> wrote:
> > In at least one place (mpls) you are patching a fast path. Compile out
> > or don't load mpls by all means. But it is not acceptable to change the
> > fast path without even considering performance.
>
> Performance matters greatly, but I need help to identify a workload
> that is representative for this fast path to see what, if any, impact
> is incurred. Even better is a review that says "nope, 'index' is not
> subject to arbitrary userspace control at this point, drop the patch."
I think we're focussing a little too much on pure userspace. That is, we
should be saying under the attackers control. Inbound network packets
could equally be under the attackers control.
Sure, userspace is the most direct and highest bandwidth one, but I
think we should treat all (kernel) external values with the same
paranoia.
On Fri 05-01-18 17:11:26, Dan Williams wrote:
> Static analysis reports that 'eahd->appAttrLocation' and
> 'eahd->impAttrLocation' may be a user controlled values that are used as
> data dependencies for calculating source and destination buffers for
> memmove operations. In order to avoid potential leaks of kernel memory
> values, block speculative execution of the instruction stream that could
> issue further reads based on invalid 'aal' or 'ial' values.
>
> Based on an original patch by Elena Reshetova.
>
> Cc: Jan Kara <[email protected]>
> Signed-off-by: Elena Reshetova <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
Umm, so I fail to see the user controlled input here.
udf_add_extendedattr() is called from a single place - udf_update_inode()
- and there all arguments except 'inode' are constant AFAICS. Also it gets
called only when block device or character device inode is created - that
already limits the possible user influence a lot since generally normal
users are not allowed to do that.
If the static checker is concerned that the inode contents is to some extent
user controllable, it is right but:
a) In this particular case udf_add_extendedattr() can get executed only on
freshly created inode so values are controlled by the kernel.
b) Even if in future we'd call udf_add_extendedattr() on inode loaded from
disk we do sanitycheck i_lenEAttr in udf_read_inode() when loading the value
from disk. And that would generally be way before we get to functions adding
extended attributes to the inode...
So overall I don't think this patch is needed.
Honza
> diff --git a/fs/udf/misc.c b/fs/udf/misc.c
> index 401e64cde1be..9403160822de 100644
> --- a/fs/udf/misc.c
> +++ b/fs/udf/misc.c
> @@ -51,6 +51,8 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size,
> int offset;
> uint16_t crclen;
> struct udf_inode_info *iinfo = UDF_I(inode);
> + uint8_t *ea_dst, *ea_src;
> + uint32_t aal, ial;
>
> ea = iinfo->i_ext.i_data;
> if (iinfo->i_lenEAttr) {
> @@ -100,33 +102,34 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size,
>
> offset = iinfo->i_lenEAttr;
> if (type < 2048) {
> - if (le32_to_cpu(eahd->appAttrLocation) <
> - iinfo->i_lenEAttr) {
> - uint32_t aal =
> - le32_to_cpu(eahd->appAttrLocation);
> - memmove(&ea[offset - aal + size],
> - &ea[aal], offset - aal);
> + aal = le32_to_cpu(eahd->appAttrLocation);
> + if ((ea_dst = nospec_array_ptr(ea, offset - aal + size,
> + iinfo->i_lenEAttr)) &&
> + (ea_src = nospec_array_ptr(ea, aal,
> + iinfo->i_lenEAttr))) {
> + memmove(ea_dst, ea_src, offset - aal);
> offset -= aal;
> eahd->appAttrLocation =
> cpu_to_le32(aal + size);
> }
> - if (le32_to_cpu(eahd->impAttrLocation) <
> - iinfo->i_lenEAttr) {
> - uint32_t ial =
> - le32_to_cpu(eahd->impAttrLocation);
> - memmove(&ea[offset - ial + size],
> - &ea[ial], offset - ial);
> +
> + ial = le32_to_cpu(eahd->impAttrLocation);
> + if ((ea_dst = nospec_array_ptr(ea, offset - ial + size,
> + iinfo->i_lenEAttr)) &&
> + (ea_src = nospec_array_ptr(ea, ial,
> + iinfo->i_lenEAttr))) {
> + memmove(ea_dst, ea_src, offset - ial);
> offset -= ial;
> eahd->impAttrLocation =
> cpu_to_le32(ial + size);
> }
> } else if (type < 65536) {
> - if (le32_to_cpu(eahd->appAttrLocation) <
> - iinfo->i_lenEAttr) {
> - uint32_t aal =
> - le32_to_cpu(eahd->appAttrLocation);
> - memmove(&ea[offset - aal + size],
> - &ea[aal], offset - aal);
> + aal = le32_to_cpu(eahd->appAttrLocation);
> + if ((ea_dst = nospec_array_ptr(ea, offset - aal + size,
> + iinfo->i_lenEAttr)) &&
> + (ea_src = nospec_array_ptr(ea, aal,
> + iinfo->i_lenEAttr))) {
> + memmove(ea_dst, ea_src, offset - aal);
> offset -= aal;
> eahd->appAttrLocation =
> cpu_to_le32(aal + size);
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hi Dan,
Thank you for the patch.
On Saturday, 6 January 2018 03:10:32 EET Dan Williams wrote:
> Static analysis reports that 'index' may be a user controlled value that
> is used as a data dependency to read 'pin' from the
> 'selector->baSourceID' array. In order to avoid potential leaks of
> kernel memory values, block speculative execution of the instruction
> stream that could issue reads based on an invalid value of 'pin'.
I won't repeat the arguments already made in the thread regarding having
documented coverity rules for this, even if I agree with them.
> Based on an original patch by Elena Reshetova.
>
> Cc: Laurent Pinchart <[email protected]>
> Cc: Mauro Carvalho Chehab <[email protected]>
> Cc: [email protected]
> Signed-off-by: Elena Reshetova <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> drivers/media/usb/uvc/uvc_v4l2.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/media/usb/uvc/uvc_v4l2.c
> b/drivers/media/usb/uvc/uvc_v4l2.c index 3e7e283a44a8..7442626dc20e 100644
> --- a/drivers/media/usb/uvc/uvc_v4l2.c
> +++ b/drivers/media/usb/uvc/uvc_v4l2.c
> @@ -22,6 +22,7 @@
> #include <linux/mm.h>
> #include <linux/wait.h>
> #include <linux/atomic.h>
> +#include <linux/compiler.h>
>
> #include <media/v4l2-common.h>
> #include <media/v4l2-ctrls.h>
> @@ -810,6 +811,7 @@ static int uvc_ioctl_enum_input(struct file *file, void
> *fh, struct uvc_entity *iterm = NULL;
> u32 index = input->index;
> int pin = 0;
> + __u8 *elem;
>
> if (selector == NULL ||
> (chain->dev->quirks & UVC_QUIRK_IGNORE_SELECTOR_UNIT)) {
> @@ -820,8 +822,9 @@ static int uvc_ioctl_enum_input(struct file *file, void
> *fh, break;
> }
> pin = iterm->id;
> - } else if (index < selector->bNrInPins) {
> - pin = selector->baSourceID[index];
> + } else if ((elem = nospec_array_ptr(selector->baSourceID, index,
> + selector->bNrInPins))) {
> + pin = *elem;
> list_for_each_entry(iterm, &chain->entities, chain) {
> if (!UVC_ENTITY_IS_ITERM(iterm))
> continue;
(adding a bit more context)
> if (iterm->id == pin)
> break;
> }
> }
>
> if (iterm == NULL || iterm->id != pin)
> return -EINVAL;
>
> memset(input, 0, sizeof(*input));
> input->index = index;
> strlcpy(input->name, iterm->name, sizeof(input->name));
> if (UVC_ENTITY_TYPE(iterm) == UVC_ITT_CAMERA)
> input->type = V4L2_INPUT_TYPE_CAMERA;
So pin is used to search for an entry in the chain->entities list. Entries in
that list are allocated separately through kmalloc and can thus end up in
different cache lines, so I agree we have an issue. However, this is mitigated
by the fact that typical UVC devices have a handful (sometimes up to a dozen)
entities, so an attacker would only be able to read memory values that are
equal to the entity IDs used by the device. Entity IDs can be freely allocated
but typically count continuously from 0. It would take a specially-crafted UVC
device to be able to read all memory.
On the other hand, as this is nowhere close to being a fast path, I think we
can close this potential hole as proposed in the patch. So,
Reviewed-by: Laurent Pinchart <[email protected]>
Will you merge the whole series in one go, or would you like me to take the
patch in my tree ? In the latter case I'll wait until the nospec_array_ptr()
gets merged in mainline.
--
Regards,
Laurent Pinchart
On Mon, 8 Jan 2018 11:08:36 +0100
Peter Zijlstra <[email protected]> wrote:
> On Fri, Jan 05, 2018 at 10:30:16PM -0800, Dan Williams wrote:
> > On Fri, Jan 5, 2018 at 6:22 PM, Eric W. Biederman <[email protected]> wrote:
> > > In at least one place (mpls) you are patching a fast path. Compile out
> > > or don't load mpls by all means. But it is not acceptable to change the
> > > fast path without even considering performance.
> >
> > Performance matters greatly, but I need help to identify a workload
> > that is representative for this fast path to see what, if any, impact
> > is incurred. Even better is a review that says "nope, 'index' is not
> > subject to arbitrary userspace control at this point, drop the patch."
>
> I think we're focussing a little too much on pure userspace. That is, we
> should be saying under the attackers control. Inbound network packets
> could equally be under the attackers control.
Inbound network packets don't come with a facility to read back and do
cache timimg. For the more general case, timing attacks on network
activity are not exactly new, and you have to mitigate them in user space
because most of them are about how many instructions you execute on each
path. The ancient classic being telling if a user exists by seeing if the
password was actually checked.
Alan
On Mon, Jan 08, 2018 at 11:43:42AM +0000, Alan Cox wrote:
> On Mon, 8 Jan 2018 11:08:36 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Jan 05, 2018 at 10:30:16PM -0800, Dan Williams wrote:
> > > On Fri, Jan 5, 2018 at 6:22 PM, Eric W. Biederman <[email protected]> wrote:
> > > > In at least one place (mpls) you are patching a fast path. Compile out
> > > > or don't load mpls by all means. But it is not acceptable to change the
> > > > fast path without even considering performance.
> > >
> > > Performance matters greatly, but I need help to identify a workload
> > > that is representative for this fast path to see what, if any, impact
> > > is incurred. Even better is a review that says "nope, 'index' is not
> > > subject to arbitrary userspace control at this point, drop the patch."
> >
> > I think we're focussing a little too much on pure userspace. That is, we
> > should be saying under the attackers control. Inbound network packets
> > could equally be under the attackers control.
>
> Inbound network packets don't come with a facility to read back and do
> cache timimg.
But could they not be used in conjunction with a local task to prime the
stuff?
From: Linus Torvalds
> Sent: 07 January 2018 19:47
...
> Also, people definitely *are* noticing the performance issues with the
> current set of patches, and they are causing real problems. Go search
> for reports of Amazon AWS slowdowns.
Try over 35% slowdown....
Given that AWS instance runs known code (user and kernel) why do we
need to worry about any of these sideband attacks?
David
On Mon, 8 Jan 2018 12:00:28 +0000
David Laight <[email protected]> wrote:
> From: Linus Torvalds
> > Sent: 07 January 2018 19:47
> ...
> > Also, people definitely *are* noticing the performance issues with the
> > current set of patches, and they are causing real problems. Go search
> > for reports of Amazon AWS slowdowns.
>
> Try over 35% slowdown....
> Given that AWS instance runs known code (user and kernel) why do we
> need to worry about any of these sideband attacks?
You may not need to. Amazon themselves obviously need to worry that no
other VM steals your data (or vice versa) but above that (and with raw
hardware appliances) if you control all the code you run then the nopti
and other disables may be useful (At the end of the day as with anything
else you do your own risk assessment).
Do remember to consider if you are running untrusted but supposedly
sandboxed code (eg Java, js).
I'm not using pti etc on my minecraft VMs - no point. If anyone gets to
run arbitrary code on them except me then it's already compromised.
Alan
From: Alan Cox
> Sent: 08 January 2018 12:13
...
> > Try over 35% slowdown....
> > Given that AWS instance runs known code (user and kernel) why do we
> > need to worry about any of these sideband attacks?
>
> You may not need to. Amazon themselves obviously need to worry that no
> other VM steals your data (or vice versa) but above that (and with raw
> hardware appliances) if you control all the code you run then the nopti
> and other disables may be useful (At the end of the day as with anything
> else you do your own risk assessment).
I believe AWS allows VM kernels to load user-written device drivers
so the security of other VMs cannot rely on whether a VM is booted
with PTI=yes or PTI=no.
David
On 01/05/18 22:30, Dan Williams wrote:
> On Fri, Jan 5, 2018 at 6:22 PM, Eric W. Biederman <[email protected]> wrote:
>> Please expand this.
>>
>> It is not clear what the static analysis is looking for. Have a clear
>> description of what is being fixed is crucial for allowing any of these
>> changes.
>>
>> For the details given in the change description what I read is magic
>> changes because a magic process says this code is vulnerable.
>
> Yes, that was my first reaction to the patches as well, I try below to
> add some more background and guidance, but in the end these are static
> analysis reports across a wide swath of sub-systems. It's going to
> take some iteration with domain experts to improve the patch
> descriptions, and that's the point of this series, to get the better
> trained eyes from the actual sub-system owners to take a look at these
> reports.
More information about what the static analysis is looking for would
definitely be welcome.
Additionally, since the analysis tool is not publicly available, how are
authors of new kernel code assumed to verify whether or not their code
needs to use nospec_array_ptr()? How are reviewers of kernel code
assumed to verify whether or not nospec_array_ptr() is missing where it
should be used?
Since this patch series only modifies the upstream kernel, how will
out-of-tree drivers be fixed, e.g. the nVidia driver and the Android
drivers?
Thanks,
Bart.
On Fri, 05 Jan 2018 17:10:03 -0800
Dan Williams <[email protected]> wrote:
> Document the rationale and usage of the new nospec*() helpers.
I have just a couple of overall comments.
- It would be nice if the document were done in RST and placed in the
core-API manual, perhaps using kerneldoc comments for the macros
themselves. It's already 99.9% RST now, so the changes required would
be minimal.
- More importantly: is there any way at all to give guidance to
developers wondering *when* they should use these primitives? I think
it would be easy to create a situation where they don't get used where
they are really needed; meanwhile, there may well be a flood of
"helpful" patches adding them where they make no sense at all.
Thanks,
jon
Hi Jon,
On Mon, Jan 08, 2018 at 09:29:17AM -0700, Jonathan Corbet wrote:
> On Fri, 05 Jan 2018 17:10:03 -0800
> Dan Williams <[email protected]> wrote:
>
> > Document the rationale and usage of the new nospec*() helpers.
>
> I have just a couple of overall comments.
>
> - It would be nice if the document were done in RST and placed in the
> core-API manual, perhaps using kerneldoc comments for the macros
> themselves. It's already 99.9% RST now, so the changes required would
> be minimal.
Is there any quickstart guide to RST that you can recommend?
I'm happy to clean up the documentation; I'm just not familiar with RST.
> - More importantly: is there any way at all to give guidance to
> developers wondering *when* they should use these primitives? I think
> it would be easy to create a situation where they don't get used where
> they are really needed; meanwhile, there may well be a flood of
> "helpful" patches adding them where they make no sense at all.
This is on my TODO list.
The unfortunate truth is that it's likely to be a subjective judgement
call in many cases, depending on how likely it is that the user can
influence the code in question, so it's difficult to provide
hard-and-fast rules.
Thanks,
Mark.
* Peter Zijlstra <[email protected]> wrote:
> On Sun, Jan 07, 2018 at 06:24:11PM -0800, Alexei Starovoitov wrote:
> > How about:
> > CONFIG_SPECTRE1_WORKAROUND_INDEX_MASK
> > CONFIG_SPECTRE1_WORKAROUND_LOAD_FENCE
>
> INSTRUCTION_FENCE if anything. LFENCE for Intel (and now also for AMD as
> per 0592b0bce169) is a misnomer, IFENCE would be a better name for it.
Btw., we should probably propagate that naming to most places and only point it
out in comments/documentation that Intel/AMD calls this CPU functionality 'LFENCE'
for historic reasons and that the opcode got stolen for the Spectre fixes.
Thanks,
Ingo
* Alan Cox <[email protected]> wrote:
> On Mon, 8 Jan 2018 11:08:36 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Jan 05, 2018 at 10:30:16PM -0800, Dan Williams wrote:
> > > On Fri, Jan 5, 2018 at 6:22 PM, Eric W. Biederman <[email protected]> wrote:
> > > > In at least one place (mpls) you are patching a fast path. Compile out
> > > > or don't load mpls by all means. But it is not acceptable to change the
> > > > fast path without even considering performance.
> > >
> > > Performance matters greatly, but I need help to identify a workload
> > > that is representative for this fast path to see what, if any, impact
> > > is incurred. Even better is a review that says "nope, 'index' is not
> > > subject to arbitrary userspace control at this point, drop the patch."
> >
> > I think we're focussing a little too much on pure userspace. That is, we
> > should be saying under the attackers control. Inbound network packets
> > could equally be under the attackers control.
>
> Inbound network packets don't come with a facility to read back and do
> cache timimg. [...]
But the reply packets can be measured on the sending side, and the total delay
timing would thus carry the timing information.
Yes, a lot of noise gets added that way if we think 'packet goes through the
Internet' - but with gigabit local network access or even through localhost
access a lot of noise can be removed as well.
It's not as dangerous as a near instantaneous local attack, but 'needs a day of
runtime to brute-force through localhost or 10GigE' is still worrying in many
real-world security contexts.
So I concur with Peter that we should generally consider making all of our
responses to external data (maybe with the exception of pigeon post messages)
Spectre-safe.
Thanks,
Ingo
On Sat, Jan 6, 2018 at 5:20 PM, Linus Torvalds
<[email protected]> wrote:
> On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams <[email protected]> wrote:
>>
>> I assume if we put this in uaccess_begin() we also need audit for
>> paths that use access_ok but don't do on to call uaccess_begin()? A
>> quick glance shows a few places where we are open coding the stac().
>> Perhaps land the lfence in stac() directly?
>
> Yeah, we should put it in uaccess_begin(), and in the actual user
> accessor helpers that do stac. Some of them probably should be changed
> to use uaccess_begin() instead while at it.
>
> One question for the CPU people: do we actually care and need to do
> this for things that might *write* to something? The speculative write
> obviously is killed, but does it perhaps bring in a cacheline even
> when killed?
As far as I understand a write could trigger a request-for-ownership
read for the target cacheline.
> Because maybe we don't need the lfence in put_user(), only in get_user()?
Even though writes can trigger reads, as far as I can see the write
needs to be dependent on the first out-of-bounds read:
if (x < max)
y = array1[x];
put_user(array2 + y, z);
...in other words that first read should be annotated with
nospec_array_ptr() making an lfence in put_user() or other writes
moot.
yp = nospec_array_ptr(array1, x, max);
if (yp)
y = *yp;
put_user(array2 + y, z);
On Mon, 8 Jan 2018 17:09:59 +0000
Mark Rutland <[email protected]> wrote:
> > I have just a couple of overall comments.
> >
> > - It would be nice if the document were done in RST and placed in the
> > core-API manual, perhaps using kerneldoc comments for the macros
> > themselves. It's already 99.9% RST now, so the changes required would
> > be minimal.
>
> Is there any quickstart guide to RST that you can recommend?
http://docutils.sourceforge.net/docs/user/rst/quickref.html works
reasonably well. We have some info in the kernel documentation as well,
see http://static.lwn.net/kerneldoc/doc-guide/sphinx.html
Thanks,
jon
On Mon, Jan 8, 2018 at 1:09 PM, Dan Williams <[email protected]> wrote:
> On Sat, Jan 6, 2018 at 5:20 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams <[email protected]> wrote:
>>>
>>> I assume if we put this in uaccess_begin() we also need audit for
>>> paths that use access_ok but don't do on to call uaccess_begin()? A
>>> quick glance shows a few places where we are open coding the stac().
>>> Perhaps land the lfence in stac() directly?
>>
>> Yeah, we should put it in uaccess_begin(), and in the actual user
>> accessor helpers that do stac. Some of them probably should be changed
>> to use uaccess_begin() instead while at it.
>>
>> One question for the CPU people: do we actually care and need to do
>> this for things that might *write* to something? The speculative write
>> obviously is killed, but does it perhaps bring in a cacheline even
>> when killed?
>
> As far as I understand a write could trigger a request-for-ownership
> read for the target cacheline.
Oh, absolutely.
I just wonder at what point that happens.
Honestly, trying to get exclusive access to a cacheline can be _very_
expensive (not just for the local thread), so I would actually expect
that doing so for speculative writes is actually bad for performance.
That's doubly true because - unlike reads - there is no critical
latency issue, so trying to get the cache access started as early as
possible simply isn't all that important.
So I suspect that a write won't actually try to allocate the cacheline
until the write has actually retired.
End result: writes - unlike reads - *probably* will not speculatively
perturb the cache with speculative write addresses.
> Even though writes can trigger reads, as far as I can see the write
> needs to be dependent on the first out-of-bounds read
Yeah. A write on its own wouldn't matter, even if it were to perturb
the cache state, because the address already comes from user space, so
there's no new information in the cache perturbation for the attacker.
But that all implies that we shouldn't need the lfence for the
"put_user()" case, only for the get_user() (where the value we read
would then perhaps be used to do another access).
So we want to add the lfence (or "and") to get_user(), but not
necessarily put_user().
Agreed?
Linus
On Mon, Jan 8, 2018 at 3:44 PM, Linus Torvalds
<[email protected]> wrote:
> On Mon, Jan 8, 2018 at 1:09 PM, Dan Williams <[email protected]> wrote:
>> On Sat, Jan 6, 2018 at 5:20 PM, Linus Torvalds
>> <[email protected]> wrote:
>>> On Sat, Jan 6, 2018 at 3:31 PM, Dan Williams <[email protected]> wrote:
>>>>
>>>> I assume if we put this in uaccess_begin() we also need audit for
>>>> paths that use access_ok but don't do on to call uaccess_begin()? A
>>>> quick glance shows a few places where we are open coding the stac().
>>>> Perhaps land the lfence in stac() directly?
>>>
>>> Yeah, we should put it in uaccess_begin(), and in the actual user
>>> accessor helpers that do stac. Some of them probably should be changed
>>> to use uaccess_begin() instead while at it.
>>>
>>> One question for the CPU people: do we actually care and need to do
>>> this for things that might *write* to something? The speculative write
>>> obviously is killed, but does it perhaps bring in a cacheline even
>>> when killed?
>>
>> As far as I understand a write could trigger a request-for-ownership
>> read for the target cacheline.
>
> Oh, absolutely.
>
> I just wonder at what point that happens.
>
> Honestly, trying to get exclusive access to a cacheline can be _very_
> expensive (not just for the local thread), so I would actually expect
> that doing so for speculative writes is actually bad for performance.
>
> That's doubly true because - unlike reads - there is no critical
> latency issue, so trying to get the cache access started as early as
> possible simply isn't all that important.
>
> So I suspect that a write won't actually try to allocate the cacheline
> until the write has actually retired.
>
> End result: writes - unlike reads - *probably* will not speculatively
> perturb the cache with speculative write addresses.
>
>> Even though writes can trigger reads, as far as I can see the write
>> needs to be dependent on the first out-of-bounds read
>
> Yeah. A write on its own wouldn't matter, even if it were to perturb
> the cache state, because the address already comes from user space, so
> there's no new information in the cache perturbation for the attacker.
>
> But that all implies that we shouldn't need the lfence for the
> "put_user()" case, only for the get_user() (where the value we read
> would then perhaps be used to do another access).
>
> So we want to add the lfence (or "and") to get_user(), but not
> necessarily put_user().
Yes, perhaps __uaccess_begin_get() and __uaccess_begin_put() to keep
things separate?
> Agreed?
I've been thinking the "and" is only suitable for the array bounds
check, for get_user() we're trying to block speculation past
access_ok() at which point we can only do the lfence?
On Mon, Jan 8, 2018 at 3:53 PM, Dan Williams <[email protected]> wrote:
>
> I've been thinking the "and" is only suitable for the array bounds
> check, for get_user() we're trying to block speculation past
> access_ok() at which point we can only do the lfence?
Well, we *could* do the "and", at least for the simple cases (ie the
true "get_user()" that integrates the access_ok with the access).
IOW, mainly the code in arch/x86/lib/getuser.S.
But it probably is a lot simpler to just add the "lfence" to ASM_STAC,
because by definition those cases don't tend to be the truly critical
ones - people who use those functions tend to do one or two accesses,
and the real cost is likely the I$ misses and the D$ miss to get
current->addr_limit. Not to mention the "stac" itself, which is much
more expensive than the access on current microarchitectures.
But something like this *might* work:
index c97d935a29e8..7fa3d293beaf 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -38,8 +38,11 @@
.text
ENTRY(__get_user_1)
mov PER_CPU_VAR(current_task), %_ASM_DX
- cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
+ mov TASK_addr_limit(%_ASM_DX),%_ASM_DX
+ cmp %_ASM_DX,%_ASM_AX
jae bad_get_user
+ or $0xfff,%_ASM_DX
+ and %_ASM_DX,%_ASM_AX
ASM_STAC
1: movzbl (%_ASM_AX),%edx
xor %eax,%eax
(this only does the one-byte case - the 2/4/8 byte cases are exactly the same).
The above is completely untested and might have some stupid
thinko/typo, so take it purely as a "example patch" to show the
concept, rather than actually do it.
But just adding "lfence" to the existing ASM_STAC is a hell of a lot
easier, and the performance difference between that trivial patch and
the above "let's be clever with 'and'" might not be measurable.
I really have no idea how expensive lfence might actually end up being
in practice. It's possible that lfence is actually fairly cheap in
kernel code, since we tend to not have very high IPC anyway.
Linus
On Mon, Jan 8, 2018 at 3:23 AM, Laurent Pinchart
<[email protected]> wrote:
> Hi Dan,
>
> Thank you for the patch.
>
> On Saturday, 6 January 2018 03:10:32 EET Dan Williams wrote:
>> Static analysis reports that 'index' may be a user controlled value that
>> is used as a data dependency to read 'pin' from the
>> 'selector->baSourceID' array. In order to avoid potential leaks of
>> kernel memory values, block speculative execution of the instruction
>> stream that could issue reads based on an invalid value of 'pin'.
>
> I won't repeat the arguments already made in the thread regarding having
> documented coverity rules for this, even if I agree with them.
>
>> Based on an original patch by Elena Reshetova.
>>
>> Cc: Laurent Pinchart <[email protected]>
>> Cc: Mauro Carvalho Chehab <[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Elena Reshetova <[email protected]>
>> Signed-off-by: Dan Williams <[email protected]>
>> ---
>> drivers/media/usb/uvc/uvc_v4l2.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/media/usb/uvc/uvc_v4l2.c
>> b/drivers/media/usb/uvc/uvc_v4l2.c index 3e7e283a44a8..7442626dc20e 100644
>> --- a/drivers/media/usb/uvc/uvc_v4l2.c
>> +++ b/drivers/media/usb/uvc/uvc_v4l2.c
>> @@ -22,6 +22,7 @@
>> #include <linux/mm.h>
>> #include <linux/wait.h>
>> #include <linux/atomic.h>
>> +#include <linux/compiler.h>
>>
>> #include <media/v4l2-common.h>
>> #include <media/v4l2-ctrls.h>
>> @@ -810,6 +811,7 @@ static int uvc_ioctl_enum_input(struct file *file, void
>> *fh, struct uvc_entity *iterm = NULL;
>> u32 index = input->index;
>> int pin = 0;
>> + __u8 *elem;
>>
>> if (selector == NULL ||
>> (chain->dev->quirks & UVC_QUIRK_IGNORE_SELECTOR_UNIT)) {
>> @@ -820,8 +822,9 @@ static int uvc_ioctl_enum_input(struct file *file, void
>> *fh, break;
>> }
>> pin = iterm->id;
>> - } else if (index < selector->bNrInPins) {
>> - pin = selector->baSourceID[index];
>> + } else if ((elem = nospec_array_ptr(selector->baSourceID, index,
>> + selector->bNrInPins))) {
>> + pin = *elem;
>> list_for_each_entry(iterm, &chain->entities, chain) {
>> if (!UVC_ENTITY_IS_ITERM(iterm))
>> continue;
>
> (adding a bit more context)
>
>> if (iterm->id == pin)
>> break;
>> }
>> }
>>
>> if (iterm == NULL || iterm->id != pin)
>> return -EINVAL;
>>
>> memset(input, 0, sizeof(*input));
>> input->index = index;
>> strlcpy(input->name, iterm->name, sizeof(input->name));
>> if (UVC_ENTITY_TYPE(iterm) == UVC_ITT_CAMERA)
>> input->type = V4L2_INPUT_TYPE_CAMERA;
>
> So pin is used to search for an entry in the chain->entities list. Entries in
> that list are allocated separately through kmalloc and can thus end up in
> different cache lines, so I agree we have an issue. However, this is mitigated
> by the fact that typical UVC devices have a handful (sometimes up to a dozen)
> entities, so an attacker would only be able to read memory values that are
> equal to the entity IDs used by the device. Entity IDs can be freely allocated
> but typically count continuously from 0. It would take a specially-crafted UVC
> device to be able to read all memory.
>
> On the other hand, as this is nowhere close to being a fast path, I think we
> can close this potential hole as proposed in the patch. So,
>
> Reviewed-by: Laurent Pinchart <[email protected]>
Thanks Laurent!
> Will you merge the whole series in one go, or would you like me to take the
> patch in my tree ? In the latter case I'll wait until the nospec_array_ptr()
> gets merged in mainline.
I'll track it for now. Until the 'nospec_array_ptr()' discussion
resolves there won't be a stabilized commit-id for you to base a
branch.
Dan Williams <[email protected]> writes:
> Static analysis reports that 'index' may be a user controlled value that
> is used as a data dependency reading 'rt' from the 'platform_label'
> array. In order to avoid potential leaks of kernel memory values, block
> speculative execution of the instruction stream that could issue further
> reads based on an invalid 'rt' value.
In detail.
a) This code is fast path packet forwarding code. Introducing an
unconditional pipeline stall is not ok.
AKA either there is no speculation and so this is invulnerable
or there is speculation and you are creating an unconditional
pipeline stall here.
My back of the napkin caluculations say that a pipeline stall
is about 20 cycles. Which is about the same length of time
as a modern cache miss.
On a good day this code will perform with 0 cache misses. On a less
good day 1 cache miss. Which means you are quite possibly doubling
the runtime of mpls_forward.
b) The array is dynamically allocated which should provide some
protection, as it will be more difficult to predict the address
of the array which is needed to craft an malicious userspace value.
c) The code can be trivially modified to say:
static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
{
struct mpls_route *rt = NULL;
if (index < net->mpls.platform_labels) {
struct mpls_route __rcu **platform_label =
rcu_dereference(net->mpls.platform_label);
rt = rcu_dereference(platform_label[index & ((1 << 20) - 1)]);
}
return rt;
}
AKA a static mask will ensure that there is not a primitive that can be
used to access all of memory. That is max a 1 cycle slowdown in the
code, which is a much better trade off.
d) If we care more it is straight forward to modify
resize_platform_label_table() to ensure that the size of the array
is always a power of 2.
e) The fact that a pointer is returned from the array and it is treated
like a pointer would seem to provide a defense against the
exfiltration technique of using the value read as an index into
a small array, that user space code can probe aliased cached
lines of, to see which value was dereferenced.
So to this patch in particular.
Nacked-by: "Eric W. Biederman" <[email protected]>
This code path will be difficult to exploit. This change messes with
performance. There are ways to make this code path useless while
preserving the performance of the code.
Eric
>
> Based on an original patch by Elena Reshetova.
>
> Cc: "David S. Miller" <[email protected]>
> Cc: Eric W. Biederman <[email protected]>
> Cc: [email protected]
> Signed-off-by: Elena Reshetova <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> net/mpls/af_mpls.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
> index 8ca9915befc8..ebcf0e246cfe 100644
> --- a/net/mpls/af_mpls.c
> +++ b/net/mpls/af_mpls.c
> @@ -8,6 +8,7 @@
> #include <linux/ipv6.h>
> #include <linux/mpls.h>
> #include <linux/netconf.h>
> +#include <linux/compiler.h>
> #include <linux/vmalloc.h>
> #include <linux/percpu.h>
> #include <net/ip.h>
> @@ -77,12 +78,13 @@ static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
> static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
> {
> struct mpls_route *rt = NULL;
> + struct mpls_route __rcu **platform_label =
> + rcu_dereference(net->mpls.platform_label);
> + struct mpls_route __rcu **rtp;
>
> - if (index < net->mpls.platform_labels) {
> - struct mpls_route __rcu **platform_label =
> - rcu_dereference(net->mpls.platform_label);
> - rt = rcu_dereference(platform_label[index]);
> - }
> + if ((rtp = nospec_array_ptr(platform_label, index,
> + net->mpls.platform_labels)))
> + rt = rcu_dereference(*rtp);
> return rt;
> }
>
On Mon, Jan 8, 2018 at 7:11 PM, Eric W. Biederman <[email protected]> wrote:
> Dan Williams <[email protected]> writes:
>
>> Static analysis reports that 'index' may be a user controlled value that
>> is used as a data dependency reading 'rt' from the 'platform_label'
>> array. In order to avoid potential leaks of kernel memory values, block
>> speculative execution of the instruction stream that could issue further
>> reads based on an invalid 'rt' value.
>
>
> In detail.
> a) This code is fast path packet forwarding code. Introducing an
> unconditional pipeline stall is not ok.
>
> AKA either there is no speculation and so this is invulnerable
> or there is speculation and you are creating an unconditional
> pipeline stall here.
>
> My back of the napkin caluculations say that a pipeline stall
> is about 20 cycles. Which is about the same length of time
> as a modern cache miss.
>
> On a good day this code will perform with 0 cache misses. On a less
> good day 1 cache miss. Which means you are quite possibly doubling
> the runtime of mpls_forward.
>
> b) The array is dynamically allocated which should provide some
> protection, as it will be more difficult to predict the address
> of the array which is needed to craft an malicious userspace value.
>
> c) The code can be trivially modified to say:
>
> static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
> {
> struct mpls_route *rt = NULL;
>
> if (index < net->mpls.platform_labels) {
> struct mpls_route __rcu **platform_label =
> rcu_dereference(net->mpls.platform_label);
> rt = rcu_dereference(platform_label[index & ((1 << 20) - 1)]);
> }
> return rt;
> }
>
> AKA a static mask will ensure that there is not a primitive that can be
> used to access all of memory. That is max a 1 cycle slowdown in the
> code, which is a much better trade off.
>
> d) If we care more it is straight forward to modify
> resize_platform_label_table() to ensure that the size of the array
> is always a power of 2.
>
> e) The fact that a pointer is returned from the array and it is treated
> like a pointer would seem to provide a defense against the
> exfiltration technique of using the value read as an index into
> a small array, that user space code can probe aliased cached
> lines of, to see which value was dereferenced.
>
>
> So to this patch in particular.
> Nacked-by: "Eric W. Biederman" <[email protected]>
>
> This code path will be difficult to exploit. This change messes with
> performance. There are ways to make this code path useless while
> preserving the performance of the code.
>
Thanks, Eric understood. The discussion over the weekend came to the
conclusion that using a mask will be the default approach. The
nospec_array_ptr() will be defined to something similar to the
following, originally from Linus and tweaked by Alexei and I:
#define __nospec_array_ptr(base, idx, sz) \
({ \
union { typeof(&base[0]) _ptr; unsigned long _bit; } __u; \
unsigned long _i = (idx); \
unsigned long _m = (max); \
unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\
OPTIMIZER_HIDE_VAR(_mask); \
__u._ptr = &base[_i & _mask]; \
__u._bit &= _mask; \
__u._ptr; \
})
Does that address your performance concerns?
On Mon, Jan 8, 2018 at 7:42 PM, Dan Williams <[email protected]> wrote:
>
> originally from Linus and tweaked by Alexei and I:
Sadly, that tweak - while clever - is wrong.
> unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\
Why?
Because "(long)(_m-1-_i)" is not negative just because "i >= m". It
can still be positive.
Think "m = 100", "i=bignum". The subtraction will overflow and become
positive again, the shift will shift to zero, and then the mask will
become ~0.
Now, you can fix it, but you need to be a tiny bit more clever. In
particular, just make sure that you retain the high bit of "_i",
basically making the rule be that a negative index is not ever valid.
And then instead of "(_m - 1 - _i)", you use "(_i | (_m - 1 - _i))".
Now the sign bit is set if _i had it set, _or_ if the subtraction
turned negative, and you don't have to worry about the overflow
situation.
But it does require that extra step to be trustworthy. Still purely
cheap arithmetic operations, although there is possibly some
additional register pressure there.
Somebody might be able to come up with something even more minimal (or
find a fault in my fix of your tweak).
Obviously, with architecture-specific code, you may well be able to do
better, using the carry flag of the subtraction.
For example, on x86, I think you could do it with just two instructions:
# carry will be clear if idx >= max
cmpq %idx,%max
# mask will be clear if carry was clear, ~0 otherwise
sbbq %mask,%mask
to generate mask directly. I might have screwed that up. Worth perhaps trying?
Linus
On Mon, Jan 8, 2018 at 8:13 PM, Linus Torvalds
<[email protected]> wrote:
>
> # carry will be clear if idx >= max
> cmpq %idx,%max
Bah. Other way around.
cmpq %max,%idx
I'm a moron.
> # mask will be clear if carry was clear, ~0 otherwise
> sbbq %mask,%mask
>
> to generate mask directly. I might have screwed that up. Worth perhaps trying?
More importantly, worth _testing_ and fixing my hand-waving "asm like
this" crap.
But I do think that simple two-instruction cmpq/sbbq sequence could
get it right in just two trivial ALU instructions.
Linus
Hi Greg,
On Saturday, 6 January 2018 11:40:26 EET Greg KH wrote:
> On Sat, Jan 06, 2018 at 10:09:07AM +0100, Greg KH wrote:
> > On Fri, Jan 05, 2018 at 05:10:32PM -0800, Dan Williams wrote:
> >> Static analysis reports that 'index' may be a user controlled value that
> >> is used as a data dependency to read 'pin' from the
> >> 'selector->baSourceID' array. In order to avoid potential leaks of
> >> kernel memory values, block speculative execution of the instruction
> >> stream that could issue reads based on an invalid value of 'pin'.
> >>
> >> Based on an original patch by Elena Reshetova.
> >>
> >> Cc: Laurent Pinchart <[email protected]>
> >> Cc: Mauro Carvalho Chehab <[email protected]>
> >> Cc: [email protected]
> >> Signed-off-by: Elena Reshetova <[email protected]>
> >> Signed-off-by: Dan Williams <[email protected]>
> >> ---
> >>
> >> drivers/media/usb/uvc/uvc_v4l2.c | 7 +++++--
> >> 1 file changed, 5 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/media/usb/uvc/uvc_v4l2.c
> >> b/drivers/media/usb/uvc/uvc_v4l2.c index 3e7e283a44a8..7442626dc20e
> >> 100644
> >> --- a/drivers/media/usb/uvc/uvc_v4l2.c
> >> +++ b/drivers/media/usb/uvc/uvc_v4l2.c
> >> @@ -22,6 +22,7 @@
> >> #include <linux/mm.h>
> >> #include <linux/wait.h>
> >> #include <linux/atomic.h>
> >> +#include <linux/compiler.h>
> >>
> >> #include <media/v4l2-common.h>
> >> #include <media/v4l2-ctrls.h>
> >> @@ -810,6 +811,7 @@ static int uvc_ioctl_enum_input(struct file *file,
> >> void *fh,
> >> struct uvc_entity *iterm = NULL;
> >> u32 index = input->index;
> >> int pin = 0;
> >> + __u8 *elem;
> >>
> >> if (selector == NULL ||
> >> (chain->dev->quirks & UVC_QUIRK_IGNORE_SELECTOR_UNIT)) {
> >> @@ -820,8 +822,9 @@ static int uvc_ioctl_enum_input(struct file *file,
> >> void *fh,
> >> break;
> >> }
> >> pin = iterm->id;
> >> - } else if (index < selector->bNrInPins) {
> >> - pin = selector->baSourceID[index];
> >> + } else if ((elem = nospec_array_ptr(selector->baSourceID, index,
> >> + selector->bNrInPins))) {
> >> + pin = *elem;
> >
> > I dug through this before, and I couldn't find where index came from
> > userspace, I think seeing the coverity rule would be nice.
>
> Ok, I take it back, this looks correct. Ugh, the v4l ioctl api is
> crazy complex (rightfully so), it's amazing that coverity could navigate
> that whole thing :)
>
> While I'm all for fixing this type of thing, I feel like we need to do
> something "else" for this as playing whack-a-mole for this pattern is
> going to be a never-ending battle for all drivers for forever.
That's my concern too, as even if we managed to find and fix all the
occurrences of the problematic patterns (and we won't), new ones will keep
being merged all the time.
> Either we need some way to mark this data path to make it easy for tools
> like sparse to flag easily, or we need to catch the issue in the driver
> subsystems, which unfortunatly, would harm the drivers that don't have
> this type of issue (like here.)
But how would you do so ?
> I'm guessing that other operating systems, which don't have the luxury
> of auditing all of their drivers are going for the "big hammer in the
> subsystem" type of fix, right?
Other operating systems that ship closed-source drivers authored by hardware
vendors and not reviewed by third parties will likely stay vulnerable forever.
That's a small concern though as I expect those drivers to contain much large
security holes anyway.
> I don't have a good answer for this, but if there was some better way to
> rewrite these types of patterns to just prevent the need for the
> nospec_array_ptr() type thing, that might be the best overall for
> everyone. Much like ebpf did with their changes. That way a simple
> coccinelle rule would be able to catch the pattern and rewrite it.
>
> Or am I just dreaming?
Likely, but sometimes dreams come true :-) Do you have an idea how this could
be done ?
--
Regards,
Laurent Pinchart
On Tue, Jan 09, 2018 at 10:40:21AM +0200, Laurent Pinchart wrote:
> On Saturday, 6 January 2018 11:40:26 EET Greg KH wrote:
> > On Sat, Jan 06, 2018 at 10:09:07AM +0100, Greg KH wrote:
> >
> > While I'm all for fixing this type of thing, I feel like we need to do
> > something "else" for this as playing whack-a-mole for this pattern is
> > going to be a never-ending battle for all drivers for forever.
>
> That's my concern too, as even if we managed to find and fix all the
> occurrences of the problematic patterns (and we won't), new ones will keep
> being merged all the time.
And what about the millions of lines of out-of-tree drivers that we all
rely on every day in our devices? What about the distro kernels that
add random new drivers?
We need some sort of automated way to scan for this.
Intel, any chance we can get your coverity rules? Given that the date
of this original patchset was from last August, has anyone looked at
what is now in Linus's tree? What about linux-next? I just added 3
brand-new driver subsystems to the kernel tree there, how do we know
there isn't problems in them?
And what about all of the other ways user-data can be affected? Again,
as Peter pointed out, USB devices. I want some chance to be able to at
least audit the codebase we have to see if that path is an issue.
Without any hint of how to do this in an automated manner, we are all
in deep shit for forever.
> > Either we need some way to mark this data path to make it easy for tools
> > like sparse to flag easily, or we need to catch the issue in the driver
> > subsystems, which unfortunatly, would harm the drivers that don't have
> > this type of issue (like here.)
>
> But how would you do so ?
I do not know, it all depends on the access pattern, right?
> > I'm guessing that other operating systems, which don't have the luxury
> > of auditing all of their drivers are going for the "big hammer in the
> > subsystem" type of fix, right?
>
> Other operating systems that ship closed-source drivers authored by hardware
> vendors and not reviewed by third parties will likely stay vulnerable forever.
> That's a small concern though as I expect those drivers to contain much large
> security holes anyway.
Well yes, but odds are those operating systems are doing something to
mitigate this, right? Slowing down all user/kernel data paths?
Targeted code analysis tools? Something else? I doubt they just don't
care at all about it. At the least, I would think Coverity would be
trying to sell licenses for this :(
thanks,
greg k-h
Hi Greg,
On Tuesday, 9 January 2018 12:04:10 EET Greg KH wrote:
> On Tue, Jan 09, 2018 at 10:40:21AM +0200, Laurent Pinchart wrote:
> > On Saturday, 6 January 2018 11:40:26 EET Greg KH wrote:
> >> On Sat, Jan 06, 2018 at 10:09:07AM +0100, Greg KH wrote:
> >>
> >> While I'm all for fixing this type of thing, I feel like we need to do
> >> something "else" for this as playing whack-a-mole for this pattern is
> >> going to be a never-ending battle for all drivers for forever.
> >
> > That's my concern too, as even if we managed to find and fix all the
> > occurrences of the problematic patterns (and we won't), new ones will keep
> > being merged all the time.
>
> And what about the millions of lines of out-of-tree drivers that we all
> rely on every day in our devices? What about the distro kernels that
> add random new drivers?
Of course, even though the out-of-tree drivers probably come with lots of
security issues worse than this one.
> We need some sort of automated way to scan for this.
Is there any initiative to implement such a scan in an open-source tool ?
We also need to educate developers. An automatic scanner could help there, but
in the end the information has to spread to all our brains. It won't be easy,
and is likely not fully feasible, but it's no different than how developers
have to be educated about race conditions and locking for instance. It's a
mind set.
> Intel, any chance we can get your coverity rules? Given that the date
> of this original patchset was from last August, has anyone looked at
> what is now in Linus's tree? What about linux-next? I just added 3
> brand-new driver subsystems to the kernel tree there, how do we know
> there isn't problems in them?
>
> And what about all of the other ways user-data can be affected? Again,
> as Peter pointed out, USB devices. I want some chance to be able to at
> least audit the codebase we have to see if that path is an issue.
> Without any hint of how to do this in an automated manner, we are all
> in deep shit for forever.
Or at least until the hardware architecture evolves. Let's drop the x86
instruction set, expose the ?ops, and have gcc handle the scheduling. Sure, it
will mean recompiling everything for every x86 CPU model out there, but we
have source-based distros to the rescue :-D
> >> Either we need some way to mark this data path to make it easy for tools
> >> like sparse to flag easily, or we need to catch the issue in the driver
> >> subsystems, which unfortunatly, would harm the drivers that don't have
> >> this type of issue (like here.)
> >
> > But how would you do so ?
>
> I do not know, it all depends on the access pattern, right?
Any data coming from userspace could trigger such accesses. If we want
complete coverage the only way I can think of is starting from syscalls and
tainting data down the call stacks (__user could help to some extend), but
we'll likely be drowned in false positives. I don't see how we could mark
paths manually.
> >> I'm guessing that other operating systems, which don't have the luxury
> >> of auditing all of their drivers are going for the "big hammer in the
> >> subsystem" type of fix, right?
> >
> > Other operating systems that ship closed-source drivers authored by
> > hardware vendors and not reviewed by third parties will likely stay
> > vulnerable forever. That's a small concern though as I expect those
> > drivers to contain much large security holes anyway.
>
> Well yes, but odds are those operating systems are doing something to
> mitigate this, right? Slowing down all user/kernel data paths?
> Targeted code analysis tools? Something else? I doubt they just don't
> care at all about it. At the least, I would think Coverity would be
> trying to sell licenses for this :(
Given their track record of security issues in drivers (and I won't even
mention the ones that are present by design, such as root kits in game copy
protection systems for instance) I doubt they will do much beside sprinkling a
bit of PR dust, at least for the consumer market. On the server market it
might be different as there's less hardware variation, and thus less drivers
to handle.
--
Regards,
Laurent Pinchart
On Tue, Jan 09, 2018 at 04:26:28PM +0200, Laurent Pinchart wrote:
> Hi Greg,
>
> On Tuesday, 9 January 2018 12:04:10 EET Greg KH wrote:
> > On Tue, Jan 09, 2018 at 10:40:21AM +0200, Laurent Pinchart wrote:
> > > On Saturday, 6 January 2018 11:40:26 EET Greg KH wrote:
> > >> On Sat, Jan 06, 2018 at 10:09:07AM +0100, Greg KH wrote:
> > >>
> > >> While I'm all for fixing this type of thing, I feel like we need to do
> > >> something "else" for this as playing whack-a-mole for this pattern is
> > >> going to be a never-ending battle for all drivers for forever.
> > >
> > > That's my concern too, as even if we managed to find and fix all the
> > > occurrences of the problematic patterns (and we won't), new ones will keep
> > > being merged all the time.
> >
> > And what about the millions of lines of out-of-tree drivers that we all
> > rely on every day in our devices? What about the distro kernels that
> > add random new drivers?
>
> Of course, even though the out-of-tree drivers probably come with lots of
> security issues worse than this one.
Sure, but I have worked with some teams that have used coverity to find
and fix all of the reported bugs it founds. So some companies are
trying to fix their problems here, let's not make it impossible for them :)
> > We need some sort of automated way to scan for this.
>
> Is there any initiative to implement such a scan in an open-source tool ?
Sure, if you want to, but I have no such initiative...
> We also need to educate developers. An automatic scanner could help there, but
> in the end the information has to spread to all our brains. It won't be easy,
> and is likely not fully feasible, but it's no different than how developers
> have to be educated about race conditions and locking for instance. It's a
> mind set.
Agreed.
> > Intel, any chance we can get your coverity rules? Given that the date
> > of this original patchset was from last August, has anyone looked at
> > what is now in Linus's tree? What about linux-next? I just added 3
> > brand-new driver subsystems to the kernel tree there, how do we know
> > there isn't problems in them?
> >
> > And what about all of the other ways user-data can be affected? Again,
> > as Peter pointed out, USB devices. I want some chance to be able to at
> > least audit the codebase we have to see if that path is an issue.
> > Without any hint of how to do this in an automated manner, we are all
> > in deep shit for forever.
>
> Or at least until the hardware architecture evolves. Let's drop the x86
> instruction set, expose the ?ops, and have gcc handle the scheduling. Sure, it
> will mean recompiling everything for every x86 CPU model out there, but we
> have source-based distros to the rescue :-D
Then we are back at the itanium mess, where all of the hardware issues
were supposed be fixed by the compiler writers. We all remember how
well that worked out...
> > >> Either we need some way to mark this data path to make it easy for tools
> > >> like sparse to flag easily, or we need to catch the issue in the driver
> > >> subsystems, which unfortunatly, would harm the drivers that don't have
> > >> this type of issue (like here.)
> > >
> > > But how would you do so ?
> >
> > I do not know, it all depends on the access pattern, right?
>
> Any data coming from userspace could trigger such accesses. If we want
> complete coverage the only way I can think of is starting from syscalls and
> tainting data down the call stacks (__user could help to some extend), but
> we'll likely be drowned in false positives. I don't see how we could mark
> paths manually.
I agree, which is why I want to see how someone did this work
originally. We have no idea as no one is telling us anything :(
How do we "know" that these are the only problem areas? When was the
last scan run? On what tree? And so on...
thanks,
greg k-h
Dan Williams <[email protected]> writes:
> On Mon, Jan 8, 2018 at 7:11 PM, Eric W. Biederman <[email protected]> wrote:
>> Dan Williams <[email protected]> writes:
>>
>>> Static analysis reports that 'index' may be a user controlled value that
>>> is used as a data dependency reading 'rt' from the 'platform_label'
>>> array. In order to avoid potential leaks of kernel memory values, block
>>> speculative execution of the instruction stream that could issue further
>>> reads based on an invalid 'rt' value.
>>
>>
>> In detail.
>> a) This code is fast path packet forwarding code. Introducing an
>> unconditional pipeline stall is not ok.
>>
>> AKA either there is no speculation and so this is invulnerable
>> or there is speculation and you are creating an unconditional
>> pipeline stall here.
>>
>> My back of the napkin caluculations say that a pipeline stall
>> is about 20 cycles. Which is about the same length of time
>> as a modern cache miss.
>>
>> On a good day this code will perform with 0 cache misses. On a less
>> good day 1 cache miss. Which means you are quite possibly doubling
>> the runtime of mpls_forward.
>>
>> b) The array is dynamically allocated which should provide some
>> protection, as it will be more difficult to predict the address
>> of the array which is needed to craft an malicious userspace value.
>>
>> c) The code can be trivially modified to say:
>>
>> static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
>> {
>> struct mpls_route *rt = NULL;
>>
>> if (index < net->mpls.platform_labels) {
>> struct mpls_route __rcu **platform_label =
>> rcu_dereference(net->mpls.platform_label);
>> rt = rcu_dereference(platform_label[index & ((1 << 20) - 1)]);
>> }
>> return rt;
>> }
>>
>> AKA a static mask will ensure that there is not a primitive that can be
>> used to access all of memory. That is max a 1 cycle slowdown in the
>> code, which is a much better trade off.
>>
>> d) If we care more it is straight forward to modify
>> resize_platform_label_table() to ensure that the size of the array
>> is always a power of 2.
>>
>> e) The fact that a pointer is returned from the array and it is treated
>> like a pointer would seem to provide a defense against the
>> exfiltration technique of using the value read as an index into
>> a small array, that user space code can probe aliased cached
>> lines of, to see which value was dereferenced.
>>
>>
>> So to this patch in particular.
>> Nacked-by: "Eric W. Biederman" <[email protected]>
>>
>> This code path will be difficult to exploit. This change messes with
>> performance. There are ways to make this code path useless while
>> preserving the performance of the code.
>>
>
> Thanks, Eric understood. The discussion over the weekend came to the
> conclusion that using a mask will be the default approach. The
> nospec_array_ptr() will be defined to something similar to the
> following, originally from Linus and tweaked by Alexei and I:
>
> #define __nospec_array_ptr(base, idx, sz) \
> ({ \
> union { typeof(&base[0]) _ptr; unsigned long _bit; } __u; \
> unsigned long _i = (idx); \
> unsigned long _m = (max); \
> unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\
> OPTIMIZER_HIDE_VAR(_mask); \
> __u._ptr = &base[_i & _mask]; \
> __u._bit &= _mask; \
> __u._ptr; \
> })
>
> Does that address your performance concerns?
The user controlled value condition of your patchset implies that you
assume indirect branch predictor poisoning is handled in other ways.
Which means that we can assume speculation will take some variation of
the static call chain.
Further you are worrying about array accesses. Which means you
are worried about attacks that are the equivalent of meltdown that
can give you reading of all memory available to the kernel.
The mpls code in question reads a pointer from memory.
The only thing the code does with that pointer is verify it is not NULL
and dereference it.
That does not make it possible to extricate the pointer bits via a cache
side-channel as a pointer is 64bits wide.
There might maybe be a timing attack where it is timed how long the
packet takes to deliver. If you can find the base address of the array,
at best such a timeing attack will tell you is if some arbitrary cache
line is already cached in the kernel. Which is not the class of attack
your patchset is worried about. Further there are more direct ways
to probe the cache from a local process.
So I submit to you that the mpls code is not vulnerable to the class of
attack you are addressing.
Further I would argue that anything that reads a pointer from memory is
a very strong clue that it falls outside the class of code that you are
addressing.
Show me where I am wrong and I will consider patches.
Eric
On Tue, Jan 9, 2018 at 8:17 AM, Eric W. Biederman <[email protected]> wrote:
> Dan Williams <[email protected]> writes:
[..]
> The user controlled value condition of your patchset implies that you
> assume indirect branch predictor poisoning is handled in other ways.
>
> Which means that we can assume speculation will take some variation of
> the static call chain.
>
> Further you are worrying about array accesses. Which means you
> are worried about attacks that are the equivalent of meltdown that
> can give you reading of all memory available to the kernel.
>
>
> The mpls code in question reads a pointer from memory.
>
>
> The only thing the code does with that pointer is verify it is not NULL
> and dereference it.
>
> That does not make it possible to extricate the pointer bits via a cache
> side-channel as a pointer is 64bits wide.
>
> There might maybe be a timing attack where it is timed how long the
> packet takes to deliver. If you can find the base address of the array,
> at best such a timeing attack will tell you is if some arbitrary cache
> line is already cached in the kernel. Which is not the class of attack
> your patchset is worried about. Further there are more direct ways
> to probe the cache from a local process.
>
> So I submit to you that the mpls code is not vulnerable to the class of
> attack you are addressing.
>
> Further I would argue that anything that reads a pointer from memory is
> a very strong clue that it falls outside the class of code that you are
> addressing.
>
> Show me where I am wrong and I will consider patches.
No, the concern is a second dependent read (or write) within the
speculation window after this first bounds-checked dependent read.
I.e. this mpls code path appears to have the setup condition:
if (x < max)
val = array1[x];
...but it's not clear that there is an exploit condition later on in
the instruction stream:
array2[val] = y;
/* or */
y = array2[val];
My personal paranoia says submit the patch and not worry about finding
that later exploit condition, if DaveM wants to drop the patch that's
his prerogative. In general, with the performance conscious version of
nospec_array_ptr() being the default, why worry about what is / is not
in the speculation window?
On Fri, 5 Jan 2018, Dan Williams wrote:
[ ... snip ... ]
> Andi Kleen (1):
> x86, barrier: stop speculation for failed access_ok
>
> Dan Williams (13):
> x86: implement nospec_barrier()
> [media] uvcvideo: prevent bounds-check bypass via speculative execution
> carl9170: prevent bounds-check bypass via speculative execution
> p54: prevent bounds-check bypass via speculative execution
> qla2xxx: prevent bounds-check bypass via speculative execution
> cw1200: prevent bounds-check bypass via speculative execution
> Thermal/int340x: prevent bounds-check bypass via speculative execution
> ipv6: prevent bounds-check bypass via speculative execution
> ipv4: prevent bounds-check bypass via speculative execution
> vfs, fdtable: prevent bounds-check bypass via speculative execution
> net: mpls: prevent bounds-check bypass via speculative execution
> udf: prevent bounds-check bypass via speculative execution
> userns: prevent bounds-check bypass via speculative execution
>
> Mark Rutland (4):
> asm-generic/barrier: add generic nospec helpers
> Documentation: document nospec helpers
> arm64: implement nospec_ptr()
> arm: implement nospec_ptr()
So considering the recent publication of [1], how come we all of a sudden
don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and
LDX_MEM_##SIZEOP, and something comparable for eBPF JIT?
Is this going to be handled in eBPF in some other way?
Without that in place, and considering Jann Horn's paper, it would seem
like PTI doesn't really lock it down fully, right?
[1] https://bugs.chromium.org/p/project-zero/issues/attachmentText?aid=287305
--
Jiri Kosina
SUSE Labs
On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina <[email protected]> wrote:
> On Fri, 5 Jan 2018, Dan Williams wrote:
>
> [ ... snip ... ]
>> Andi Kleen (1):
>> x86, barrier: stop speculation for failed access_ok
>>
>> Dan Williams (13):
>> x86: implement nospec_barrier()
>> [media] uvcvideo: prevent bounds-check bypass via speculative execution
>> carl9170: prevent bounds-check bypass via speculative execution
>> p54: prevent bounds-check bypass via speculative execution
>> qla2xxx: prevent bounds-check bypass via speculative execution
>> cw1200: prevent bounds-check bypass via speculative execution
>> Thermal/int340x: prevent bounds-check bypass via speculative execution
>> ipv6: prevent bounds-check bypass via speculative execution
>> ipv4: prevent bounds-check bypass via speculative execution
>> vfs, fdtable: prevent bounds-check bypass via speculative execution
>> net: mpls: prevent bounds-check bypass via speculative execution
>> udf: prevent bounds-check bypass via speculative execution
>> userns: prevent bounds-check bypass via speculative execution
>>
>> Mark Rutland (4):
>> asm-generic/barrier: add generic nospec helpers
>> Documentation: document nospec helpers
>> arm64: implement nospec_ptr()
>> arm: implement nospec_ptr()
>
> So considering the recent publication of [1], how come we all of a sudden
> don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and
> LDX_MEM_##SIZEOP, and something comparable for eBPF JIT?
>
> Is this going to be handled in eBPF in some other way?
>
> Without that in place, and considering Jann Horn's paper, it would seem
> like PTI doesn't really lock it down fully, right?
Here is the latest (v3) bpf fix:
https://patchwork.ozlabs.org/patch/856645/
I currently have v2 on my 'nospec' branch and will move that to v3 for
the next update, unless it goes upstream before then.
On Tue, Jan 09, 2018 at 11:44:05AM -0800, Dan Williams wrote:
> On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina <[email protected]> wrote:
> > On Fri, 5 Jan 2018, Dan Williams wrote:
> >
> > [ ... snip ... ]
> >> Andi Kleen (1):
> >> x86, barrier: stop speculation for failed access_ok
> >>
> >> Dan Williams (13):
> >> x86: implement nospec_barrier()
> >> [media] uvcvideo: prevent bounds-check bypass via speculative execution
> >> carl9170: prevent bounds-check bypass via speculative execution
> >> p54: prevent bounds-check bypass via speculative execution
> >> qla2xxx: prevent bounds-check bypass via speculative execution
> >> cw1200: prevent bounds-check bypass via speculative execution
> >> Thermal/int340x: prevent bounds-check bypass via speculative execution
> >> ipv6: prevent bounds-check bypass via speculative execution
> >> ipv4: prevent bounds-check bypass via speculative execution
> >> vfs, fdtable: prevent bounds-check bypass via speculative execution
> >> net: mpls: prevent bounds-check bypass via speculative execution
> >> udf: prevent bounds-check bypass via speculative execution
> >> userns: prevent bounds-check bypass via speculative execution
> >>
> >> Mark Rutland (4):
> >> asm-generic/barrier: add generic nospec helpers
> >> Documentation: document nospec helpers
> >> arm64: implement nospec_ptr()
> >> arm: implement nospec_ptr()
> >
> > So considering the recent publication of [1], how come we all of a sudden
> > don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and
> > LDX_MEM_##SIZEOP, and something comparable for eBPF JIT?
> >
> > Is this going to be handled in eBPF in some other way?
> >
> > Without that in place, and considering Jann Horn's paper, it would seem
> > like PTI doesn't really lock it down fully, right?
>
> Here is the latest (v3) bpf fix:
>
> https://patchwork.ozlabs.org/patch/856645/
>
> I currently have v2 on my 'nospec' branch and will move that to v3 for
> the next update, unless it goes upstream before then.
That patch seems specific to CONFIG_BPF_SYSCALL. Is the bpf() syscall
the only attack vector? Or are there other ways to run bpf programs
that we should be worried about?
--
Josh
On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams <[email protected]> wrote:
> > From: Andi Kleen <[email protected]>
> >
> > When access_ok fails we should always stop speculating.
> > Add the required barriers to the x86 access_ok macro.
>
> Honestly, this seems completely bogus.
>
> The description is pure garbage afaik.
>
> The fact is, we have to stop speculating when access_ok() does *not*
> fail - because that's when we'll actually do the access. And it's that
> access that needs to be non-speculative.
>
> That actually seems to be what the code does (it stops speculation
> when __range_not_ok() returns false, but access_ok() is
> !__range_not_ok()). But the explanation is crap, and dangerous.
The description also seems to be missing the "why", as it's not
self-evident (to me, at least).
Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?
i.e., wouldn't the pattern be:
get_user(uval, uptr);
if (uval < array_size) {
lfence();
foo = a2[a1[uval] * 256];
}
Shouldn't the lfence come much later, *after* reading the variable and
comparing it and branching accordingly?
--
Josh
On Tue, Jan 9, 2018 at 1:41 PM, Josh Poimboeuf <[email protected]> wrote:
> On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
>> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams <[email protected]> wrote:
>> > From: Andi Kleen <[email protected]>
>> >
>> > When access_ok fails we should always stop speculating.
>> > Add the required barriers to the x86 access_ok macro.
>>
>> Honestly, this seems completely bogus.
>>
>> The description is pure garbage afaik.
>>
>> The fact is, we have to stop speculating when access_ok() does *not*
>> fail - because that's when we'll actually do the access. And it's that
>> access that needs to be non-speculative.
>>
>> That actually seems to be what the code does (it stops speculation
>> when __range_not_ok() returns false, but access_ok() is
>> !__range_not_ok()). But the explanation is crap, and dangerous.
>
> The description also seems to be missing the "why", as it's not
> self-evident (to me, at least).
>
> Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?
>
> i.e., wouldn't the pattern be:
>
> get_user(uval, uptr);
> if (uval < array_size) {
> lfence();
> foo = a2[a1[uval] * 256];
> }
>
> Shouldn't the lfence come much later, *after* reading the variable and
> comparing it and branching accordingly?
The goal of putting the lfence in uaccess_begin() is to prevent
speculation past access_ok(). You are correct that the cpu could later
mis-speculate on uval, that's where taint analysis tooling needs to
come into play to track uval to where it is used. That's where the
nospec_array_ptr() patches come into play.
On Tue, Jan 09, 2018 at 01:47:09PM -0800, Dan Williams wrote:
> On Tue, Jan 9, 2018 at 1:41 PM, Josh Poimboeuf <[email protected]> wrote:
> > On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
> >> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams <[email protected]> wrote:
> >> > From: Andi Kleen <[email protected]>
> >> >
> >> > When access_ok fails we should always stop speculating.
> >> > Add the required barriers to the x86 access_ok macro.
> >>
> >> Honestly, this seems completely bogus.
> >>
> >> The description is pure garbage afaik.
> >>
> >> The fact is, we have to stop speculating when access_ok() does *not*
> >> fail - because that's when we'll actually do the access. And it's that
> >> access that needs to be non-speculative.
> >>
> >> That actually seems to be what the code does (it stops speculation
> >> when __range_not_ok() returns false, but access_ok() is
> >> !__range_not_ok()). But the explanation is crap, and dangerous.
> >
> > The description also seems to be missing the "why", as it's not
> > self-evident (to me, at least).
> >
> > Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?
> >
> > i.e., wouldn't the pattern be:
> >
> > get_user(uval, uptr);
> > if (uval < array_size) {
> > lfence();
> > foo = a2[a1[uval] * 256];
> > }
> >
> > Shouldn't the lfence come much later, *after* reading the variable and
> > comparing it and branching accordingly?
>
> The goal of putting the lfence in uaccess_begin() is to prevent
> speculation past access_ok().
Right, but what's the purpose of preventing speculation past
access_ok()?
--
Josh
On Tue, Jan 9, 2018 at 1:49 PM, Josh Poimboeuf <[email protected]> wrote:
> On Tue, Jan 09, 2018 at 01:47:09PM -0800, Dan Williams wrote:
>> On Tue, Jan 9, 2018 at 1:41 PM, Josh Poimboeuf <[email protected]> wrote:
>> > On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
>> >> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams <[email protected]> wrote:
>> >> > From: Andi Kleen <[email protected]>
>> >> >
>> >> > When access_ok fails we should always stop speculating.
>> >> > Add the required barriers to the x86 access_ok macro.
>> >>
>> >> Honestly, this seems completely bogus.
>> >>
>> >> The description is pure garbage afaik.
>> >>
>> >> The fact is, we have to stop speculating when access_ok() does *not*
>> >> fail - because that's when we'll actually do the access. And it's that
>> >> access that needs to be non-speculative.
>> >>
>> >> That actually seems to be what the code does (it stops speculation
>> >> when __range_not_ok() returns false, but access_ok() is
>> >> !__range_not_ok()). But the explanation is crap, and dangerous.
>> >
>> > The description also seems to be missing the "why", as it's not
>> > self-evident (to me, at least).
>> >
>> > Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?
>> >
>> > i.e., wouldn't the pattern be:
>> >
>> > get_user(uval, uptr);
>> > if (uval < array_size) {
>> > lfence();
>> > foo = a2[a1[uval] * 256];
>> > }
>> >
>> > Shouldn't the lfence come much later, *after* reading the variable and
>> > comparing it and branching accordingly?
>>
>> The goal of putting the lfence in uaccess_begin() is to prevent
>> speculation past access_ok().
>
> Right, but what's the purpose of preventing speculation past
> access_ok()?
Caution. It's the same rationale for the nospec_array_ptr() patches.
If we, kernel community, can identify any possible speculation past a
bounds check we should inject a speculation mitigation. Unless there's
a way to be 100% certain that the first unwanted speculation can be
turned into a gadget later on in the instruction stream, err on the
side of shutting it down early.
On Tue, Jan 09, 2018 at 01:59:04PM -0800, Dan Williams wrote:
> > Right, but what's the purpose of preventing speculation past
> > access_ok()?
>
> Caution. It's the same rationale for the nospec_array_ptr() patches.
> If we, kernel community, can identify any possible speculation past a
> bounds check we should inject a speculation mitigation. Unless there's
> a way to be 100% certain that the first unwanted speculation can be
> turned into a gadget later on in the instruction stream, err on the
> side of shutting it down early.
I'm all for being cautious. The nospec_array_ptr() patches are fine,
and they make sense in light of the variant 1 CVE.
But that still doesn't answer my question. I haven't seen *any*
rationale for this patch. It would be helpful to at least describe
what's being protected against, even if it's hypothetical. How can we
review it if the commit log doesn't describe its purpose?
--
Josh
On Tue, Jan 9, 2018 at 2:23 PM, Josh Poimboeuf <[email protected]> wrote:
> On Tue, Jan 09, 2018 at 01:59:04PM -0800, Dan Williams wrote:
>> > Right, but what's the purpose of preventing speculation past
>> > access_ok()?
>>
>> Caution. It's the same rationale for the nospec_array_ptr() patches.
>> If we, kernel community, can identify any possible speculation past a
>> bounds check we should inject a speculation mitigation. Unless there's
>> a way to be 100% certain that the first unwanted speculation can be
>> turned into a gadget later on in the instruction stream, err on the
>> side of shutting it down early.
>
> I'm all for being cautious. The nospec_array_ptr() patches are fine,
> and they make sense in light of the variant 1 CVE.
>
> But that still doesn't answer my question. I haven't seen *any*
> rationale for this patch. It would be helpful to at least describe
> what's being protected against, even if it's hypothetical. How can we
> review it if the commit log doesn't describe its purpose?
Certainly the changelog needs improvement, I'll roll these concerns
into v2 and we can go from there.
On Mon, Jan 8, 2018 at 8:13 PM, Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Jan 8, 2018 at 7:42 PM, Dan Williams <[email protected]> wrote:
> >
> > originally from Linus and tweaked by Alexei and I:
>
> Sadly, that tweak - while clever - is wrong.
>
> > unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\
>
> Why?
>
> Because "(long)(_m-1-_i)" is not negative just because "i >= m". It
> can still be positive.
>
> Think "m = 100", "i=bignum". The subtraction will overflow and become
> positive again, the shift will shift to zero, and then the mask will
> become ~0.
>
> Now, you can fix it, but you need to be a tiny bit more clever. In
> particular, just make sure that you retain the high bit of "_i",
> basically making the rule be that a negative index is not ever valid.
>
> And then instead of "(_m - 1 - _i)", you use "(_i | (_m - 1 - _i))".
> Now the sign bit is set if _i had it set, _or_ if the subtraction
> turned negative, and you don't have to worry about the overflow
> situation.
>
> But it does require that extra step to be trustworthy. Still purely
> cheap arithmetic operations, although there is possibly some
> additional register pressure there.
>
> Somebody might be able to come up with something even more minimal (or
> find a fault in my fix of your tweak).
I looks like there is another problem, or I'm misreading the
cleverness. We want the mask to be ~0 in the ok case and 0 in the
out-of-bounds case. As far as I can see we end up with ~0 in the ok
case, and ~1 in the bad case. Don't we need to do something like the
following, at which point are we getting out of the realm of "cheap
ALU instructions"?
#define __nospec_array_ptr(base, idx, sz) \
({ \
union { typeof(&base[0]) _ptr; unsigned long _bit; } __u; \
unsigned long _i = (idx); \
unsigned long _s = (sz); \
unsigned long _v = (long)(_i | _s - 1 - _i) \
>> BITS_PER_LONG - 1; \
unsigned long _mask = _v * ~0UL; \
OPTIMIZER_HIDE_VAR(_mask); \
__u._ptr = &base[_i & _mask]; \
__u._bit &= _mask; \
__u._ptr; \
})
elem = nospec_array_ptr(array, idx, 3);
106: b8 02 00 00 00 mov $0x2,%eax
10b: 48 63 ff movslq %edi,%rdi
10e: 48 29 f8 sub %rdi,%rax
111: 48 09 f8 or %rdi,%rax
114: 48 c1 e8 3f shr $0x3f,%rax
118: 48 21 c7 and %rax,%rdi
11b: 48 8d 54 bc 04 lea 0x4(%rsp,%rdi,4),%rdx
120: 48 21 d0 and %rdx,%rax
Dan Williams <[email protected]> writes:
> On Tue, Jan 9, 2018 at 8:17 AM, Eric W. Biederman <[email protected]> wrote:
>> Dan Williams <[email protected]> writes:
> [..]
>> The user controlled value condition of your patchset implies that you
>> assume indirect branch predictor poisoning is handled in other ways.
>>
>> Which means that we can assume speculation will take some variation of
>> the static call chain.
>>
>> Further you are worrying about array accesses. Which means you
>> are worried about attacks that are the equivalent of meltdown that
>> can give you reading of all memory available to the kernel.
>>
>>
>> The mpls code in question reads a pointer from memory.
>>
>>
>> The only thing the code does with that pointer is verify it is not NULL
>> and dereference it.
>>
>> That does not make it possible to extricate the pointer bits via a cache
>> side-channel as a pointer is 64bits wide.
>>
>> There might maybe be a timing attack where it is timed how long the
>> packet takes to deliver. If you can find the base address of the array,
>> at best such a timeing attack will tell you is if some arbitrary cache
>> line is already cached in the kernel. Which is not the class of attack
>> your patchset is worried about. Further there are more direct ways
>> to probe the cache from a local process.
>>
>> So I submit to you that the mpls code is not vulnerable to the class of
>> attack you are addressing.
>>
>> Further I would argue that anything that reads a pointer from memory is
>> a very strong clue that it falls outside the class of code that you are
>> addressing.
>>
>> Show me where I am wrong and I will consider patches.
>
> No, the concern is a second dependent read (or write) within the
> speculation window after this first bounds-checked dependent read.
> I.e. this mpls code path appears to have the setup condition:
>
> if (x < max)
> val = array1[x];
>
> ...but it's not clear that there is an exploit condition later on in
> the instruction stream:
>
> array2[val] = y;
> /* or */
> y = array2[val];
>
> My personal paranoia says submit the patch and not worry about finding
> that later exploit condition, if DaveM wants to drop the patch that's
> his prerogative. In general, with the performance conscious version of
> nospec_array_ptr() being the default, why worry about what is / is not
> in the speculation window?
Sigh.
I am not worrying about what is in the speculation window. My
assumption is that the speculation window could be infinite, as we don't
know the limitations of all possible processors.
I am saying there is not a way to get the data out of the speculation
window.
I was saying that when you have:
if (x < max)
val = array1[x];
When val is a pointer not an integer.
Then
array2[val] = y;
/* or */
y = array2[va];
Won't happen.
val->field;
Will happen.
Which looks similar. However the address space of pointers is too
large. Making it impossible for an attack to know where to look in the
cache to see if "val->field" happened. At least on the assumption that
val is an arbitrary value.
Further mpls_forward is small enough the entire scope of "rt" the value
read possibly past the bound check is auditable without too much
trouble. I have looked and I don't see anything that could possibly
allow the value of "rt" to be exfitrated. The problem continuing to be
that it is a pointer and the only operation on the pointer besides
derferencing it is testing if it is NULL.
Other types of timing attacks are very hard if not impossible because
any packet presenting with a value outside the bounds check will be
dropped. So it will hard if not impossible to find something to time to
see how long it took to drop the packet.
So no this code path does not present with the classic signature of
variant 1: bounds check bypass CVE-2017-5753
Show me where I am wrong and I will gladly take patches. As it is the
mpls fast path does not appear vulnerable to the issue you are
addressing. So the best thing we can do for performance is nothing at
all.
All I am after is a plausible scenario. I just want to see it spelled
out which combinations of things could possibly be a problem.
Eric
On Tue, Jan 9, 2018 at 4:54 PM, Eric W. Biederman <[email protected]> wrote:
> Dan Williams <[email protected]> writes:
[..]
> Sigh.
> I am not worrying about what is in the speculation window. My
> assumption is that the speculation window could be infinite, as we don't
> know the limitations of all possible processors.
>
> I am saying there is not a way to get the data out of the speculation
> window.
>
> I was saying that when you have:
> if (x < max)
> val = array1[x];
>
> When val is a pointer not an integer.
> Then
> array2[val] = y;
> /* or */
> y = array2[va];
>
> Won't happen.
>
> val->field;
>
> Will happen.
>
> Which looks similar. However the address space of pointers is too
> large. Making it impossible for an attack to know where to look in the
> cache to see if "val->field" happened. At least on the assumption that
> val is an arbitrary value.
>
> Further mpls_forward is small enough the entire scope of "rt" the value
> read possibly past the bound check is auditable without too much
> trouble. I have looked and I don't see anything that could possibly
> allow the value of "rt" to be exfitrated. The problem continuing to be
> that it is a pointer and the only operation on the pointer besides
> derferencing it is testing if it is NULL.
>
> Other types of timing attacks are very hard if not impossible because
> any packet presenting with a value outside the bounds check will be
> dropped. So it will hard if not impossible to find something to time to
> see how long it took to drop the packet.
>
> So no this code path does not present with the classic signature of
> variant 1: bounds check bypass CVE-2017-5753
>
> Show me where I am wrong and I will gladly take patches. As it is the
> mpls fast path does not appear vulnerable to the issue you are
> addressing. So the best thing we can do for performance is nothing at
> all.
That's completely valid. Let's drop this one.
> All I am after is a plausible scenario. I just want to see it spelled
> out which combinations of things could possibly be a problem.
In fact, I'm not here to spell out the speculative execution problem
in this path beyond the initial user controlled speculative read. As
noted in the cover letter thread, this and the other patches are
simply reports that are fully expected to be resolved as false
positives by sub-system owners in some cases. I'm otherwise more
focused in landing common infrastructure that could be used in the
future as data-flow-analysis tooling improves to find these locations
with higher accuracy. In other words, the goal at the end of this
exercise is to identify a default nospec_array_ptr() implementation
that all sub-systems could accept for when the problem report turns
out to be real, and your pushback has already resulted in good
discussion of alternate nospec_array_ptr() implementations.
Thanks for your patience to keep the conversation going.
On Tue, Jan 9, 2018 at 4:48 PM, Dan Williams <[email protected]> wrote:
> On Mon, Jan 8, 2018 at 8:13 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> On Mon, Jan 8, 2018 at 7:42 PM, Dan Williams <[email protected]> wrote:
>> >
>> > originally from Linus and tweaked by Alexei and I:
>>
>> Sadly, that tweak - while clever - is wrong.
>>
>> > unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\
>>
>> Why?
>>
>> Because "(long)(_m-1-_i)" is not negative just because "i >= m". It
>> can still be positive.
>>
>> Think "m = 100", "i=bignum". The subtraction will overflow and become
>> positive again, the shift will shift to zero, and then the mask will
>> become ~0.
>>
>> Now, you can fix it, but you need to be a tiny bit more clever. In
>> particular, just make sure that you retain the high bit of "_i",
>> basically making the rule be that a negative index is not ever valid.
>>
>> And then instead of "(_m - 1 - _i)", you use "(_i | (_m - 1 - _i))".
>> Now the sign bit is set if _i had it set, _or_ if the subtraction
>> turned negative, and you don't have to worry about the overflow
>> situation.
>>
>> But it does require that extra step to be trustworthy. Still purely
>> cheap arithmetic operations, although there is possibly some
>> additional register pressure there.
>>
>> Somebody might be able to come up with something even more minimal (or
>> find a fault in my fix of your tweak).
>
> I looks like there is another problem, or I'm misreading the
> cleverness. We want the mask to be ~0 in the ok case and 0 in the
> out-of-bounds case. As far as I can see we end up with ~0 in the ok
> case, and ~1 in the bad case. Don't we need to do something like the
> following, at which point are we getting out of the realm of "cheap
> ALU instructions"?
>
> #define __nospec_array_ptr(base, idx, sz) \
> ({ \
> union { typeof(&base[0]) _ptr; unsigned long _bit; } __u; \
> unsigned long _i = (idx); \
> unsigned long _s = (sz); \
> unsigned long _v = (long)(_i | _s - 1 - _i) \
> >> BITS_PER_LONG - 1; \
> unsigned long _mask = _v * ~0UL; \
> OPTIMIZER_HIDE_VAR(_mask); \
> __u._ptr = &base[_i & _mask]; \
> __u._bit &= _mask; \
> __u._ptr; \
> })
Sorry, I'm slow of course ~(-1L) is 0.
On Tue, Jan 09, 2018 at 04:48:24PM -0800, Dan Williams wrote:
>
> #define __nospec_array_ptr(base, idx, sz) \
> ({ \
> union { typeof(&base[0]) _ptr; unsigned long _bit; } __u; \
> unsigned long _i = (idx); \
> unsigned long _s = (sz); \
> unsigned long _v = (long)(_i | _s - 1 - _i) \
> >> BITS_PER_LONG - 1; \
> unsigned long _mask = _v * ~0UL; \
> OPTIMIZER_HIDE_VAR(_mask); \
> __u._ptr = &base[_i & _mask]; \
> __u._bit &= _mask; \
> __u._ptr; \
> })
_v * ~0UL doesn't seem right and non intuitive.
What's wrong with:
unsigned long _mask = ~(long)(_i | _s - 1 - _i) >> BITS_PER_LONG - 1;
and why OPTIMIZER_HIDE_VAR ?
Could you remove '&' ?
since in doesn't work for:
struct {
int fd[4];
...
} *fdt;
it cannot be used as array_acces(fdt->fd, ...);
Could you please drop nospec_ prefix since it is misleading ?
This macro doesn't prevent speculation.
I think array_access() was the best name so far.
On 01/05/2018 05:10 PM, Dan Williams wrote:
> From: Mark Rutland <[email protected]>
>
> This patch implements nospec_ptr() for arm, following the recommended
> architectural sequences for the arm and thumb instruction sets.
>
Fedora picked up the series and it fails on arm:
In file included from ./include/linux/compiler.h:242:0,
from ./include/uapi/linux/swab.h:6,
from ./include/linux/swab.h:5,
from ./arch/arm/include/asm/opcodes.h:89,
from ./arch/arm/include/asm/bug.h:7,
from ./include/linux/bug.h:5,
from ./include/linux/mmdebug.h:5,
from ./include/linux/gfp.h:5,
from ./include/linux/slab.h:15,
from kernel/fork.c:14:
./include/linux/fdtable.h: In function '__fcheck_files':
./arch/arm/include/asm/barrier.h:112:41: error: expected declaration specifiers or '...' before numeric constant
__load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
^
./arch/arm/include/asm/barrier.h:68:32: note: in definition of macro '__load_no_speculate_n'
(typeof(*ptr)(unsigned long)(failval)); \
^~~~~~~
./arch/arm/include/asm/barrier.h:112:2: note: in expansion of macro '__load_no_speculate'
__load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
^~~~~~~~~~~~~~~~~~~
./include/asm-generic/barrier.h:122:2: note: in expansion of macro 'nospec_ptr'
nospec_ptr(__arr + __idx, __arr, __arr + __sz); \
^~~~~~~~~~
./include/linux/fdtable.h:86:13: note: in expansion of macro 'nospec_array_ptr'
if ((fdp = nospec_array_ptr(fdt->fd, fd, fdt->max_fds)))
^~~~~~~~~~~~~~~~
./arch/arm/include/asm/barrier.h:112:41: error: expected declaration specifiers or '...' before numeric constant
__load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
^
./arch/arm/include/asm/barrier.h:68:32: note: in definition of macro '__load_no_speculate_n'
(typeof(*ptr)(unsigned long)(failval)); \
^~~~~~~
./arch/arm/include/asm/barrier.h:112:2: note: in expansion of macro '__load_no_speculate'
__load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
^~~~~~~~~~~~~~~~~~~
./include/asm-generic/barrier.h:122:2: note: in expansion of macro 'nospec_ptr'
nospec_ptr(__arr + __idx, __arr, __arr + __sz); \
^~~~~~~~~~
./include/linux/fdtable.h:86:13: note: in expansion of macro 'nospec_array_ptr'
if ((fdp = nospec_array_ptr(fdt->fd, fd, fdt->max_fds)))
^~~~~~~~~~~~~~~~
./arch/arm/include/asm/barrier.h:112:41: error: expected declaration specifiers or '...' before numeric constant
__load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
^
./arch/arm/include/asm/barrier.h:68:32: note: in definition of macro '__load_no_speculate_n'
(typeof(*ptr)(unsigned long)(failval)); \
^~~~~~~
./arch/arm/include/asm/barrier.h:112:2: note: in expansion of macro '__load_no_speculate'
__load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
^~~~~~~~~~~~~~~~~~~
./include/asm-generic/barrier.h:122:2: note: in expansion of macro 'nospec_ptr'
nospec_ptr(__arr + __idx, __arr, __arr + __sz); \
^~~~~~~~~~
./include/linux/fdtable.h:86:13: note: in expansion of macro 'nospec_array_ptr'
if ((fdp = nospec_array_ptr(fdt->fd, fd, fdt->max_fds)))
I can't puzzle out what exactly is the problem here, except that it really
does not seem to like that failval. Does the arm compiler not like doing
the typeof with the __arr + __idx?
Thanks,
Laura
> Signed-off-by: Mark Rutland <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> arch/arm/include/asm/barrier.h | 75 ++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 75 insertions(+)
>
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index 40f5c410fd8c..6384c90e4b72 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -37,6 +37,81 @@
> #define dmb(x) __asm__ __volatile__ ("" : : : "memory")
> #endif
>
> +#ifdef CONFIG_THUMB2_KERNEL
> +#define __load_no_speculate_n(ptr, lo, hi, failval, cmpptr, sz) \
> +({ \
> + typeof(*ptr) __nln_val; \
> + typeof(*ptr) __failval = \
> + (typeof(*ptr)(unsigned long)(failval)); \
> + \
> + asm volatile ( \
> + " cmp %[c], %[l]\n" \
> + " it hs\n" \
> + " cmphs %[h], %[c]\n" \
> + " blo 1f\n" \
> + " ld" #sz " %[v], %[p]\n" \
> + "1: it lo\n" \
> + " movlo %[v], %[f]\n" \
> + " .inst 0xf3af8014 @ CSDB\n" \
> + : [v] "=&r" (__nln_val) \
> + : [p] "m" (*(ptr)), [l] "r" (lo), [h] "r" (hi), \
> + [f] "r" (__failval), [c] "r" (cmpptr) \
> + : "cc"); \
> + \
> + __nln_val; \
> +})
> +#else
> +#define __load_no_speculate_n(ptr, lo, hi, failval, cmpptr, sz) \
> +({ \
> + typeof(*ptr) __nln_val; \
> + typeof(*ptr) __failval = \
> + (typeof(*ptr)(unsigned long)(failval)); \
> + \
> + asm volatile ( \
> + " cmp %[c], %[l]\n" \
> + " cmphs %[h], %[c]\n" \
> + " ldr" #sz "hi %[v], %[p]\n" \
> + " movls %[v], %[f]\n" \
> + " .inst 0xe320f014 @ CSDB\n" \
> + : [v] "=&r" (__nln_val) \
> + : [p] "m" (*(ptr)), [l] "r" (lo), [h] "r" (hi), \
> + [f] "r" (__failval), [c] "r" (cmpptr) \
> + : "cc"); \
> + \
> + __nln_val; \
> +})
> +#endif
> +
> +#define __load_no_speculate(ptr, lo, hi, failval, cmpptr) \
> +({ \
> + typeof(*(ptr)) __nl_val; \
> + \
> + switch (sizeof(__nl_val)) { \
> + case 1: \
> + __nl_val = __load_no_speculate_n(ptr, lo, hi, failval, \
> + cmpptr, b); \
> + break; \
> + case 2: \
> + __nl_val = __load_no_speculate_n(ptr, lo, hi, failval, \
> + cmpptr, h); \
> + break; \
> + case 4: \
> + __nl_val = __load_no_speculate_n(ptr, lo, hi, failval, \
> + cmpptr, ); \
> + break; \
> + default: \
> + BUILD_BUG(); \
> + } \
> + \
> + __nl_val; \
> +})
> +
> +#define nospec_ptr(ptr, lo, hi) \
> +({ \
> + typeof(ptr) __np_ptr = (ptr); \
> + __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
> +})
> +
> #ifdef CONFIG_ARM_HEAVY_MB
> extern void (*soc_mb)(void);
> extern void arm_heavy_mb(void);
>
On Tue, Jan 9, 2018 at 5:57 PM, Alexei Starovoitov
<[email protected]> wrote:
> On Tue, Jan 09, 2018 at 04:48:24PM -0800, Dan Williams wrote:
>>
>> #define __nospec_array_ptr(base, idx, sz) \
>> ({ \
>> union { typeof(&base[0]) _ptr; unsigned long _bit; } __u; \
>> unsigned long _i = (idx); \
>> unsigned long _s = (sz); \
>> unsigned long _v = (long)(_i | _s - 1 - _i) \
>> >> BITS_PER_LONG - 1; \
>> unsigned long _mask = _v * ~0UL; \
>> OPTIMIZER_HIDE_VAR(_mask); \
>> __u._ptr = &base[_i & _mask]; \
>> __u._bit &= _mask; \
>> __u._ptr; \
>> })
>
> _v * ~0UL doesn't seem right and non intuitive.
> What's wrong with:
> unsigned long _mask = ~(long)(_i | _s - 1 - _i) >> BITS_PER_LONG - 1;
Yeah, I noticed it was ok immediately after I sent that.
> and why OPTIMIZER_HIDE_VAR ?
It was in Linus' original. but that was when it had the ternary
conditional, I'll drop it. It does not change the generated assembly.
> Could you remove '&' ?
Yes, that should be __u.ptr = base + (i & _mask)
> since in doesn't work for:
> struct {
> int fd[4];
> ...
> } *fdt;
> it cannot be used as array_acces(fdt->fd, ...);
>
> Could you please drop nospec_ prefix since it is misleading ?
When you came up with that tweak you noted:
"The following:
[..]
is generic and no speculative flows."
> This macro doesn't prevent speculation.
It masks dangerous speculation. At least, I read nospec as "No
Spectre" and it is a prefix used in the Spectre-v2 patches.
I also want to include the option, with a static branch, to switch it
to the hard "no speculation" version with an ifence if worse comes to
worse and we find a compiler / cpu where it doesn't work. The default
will be the fast and practical implementation.
> I think array_access() was the best name so far.
For other usages I need the pointer to the array element, also
array_access() by itself is unsuitable for __fcheck_files because we
still need rcu_dereference_raw() on the element de-reference. So, I
think it's better to get a sanitized array element pointer which can
be used with rcu, READ_ONCE(), etc... directly rather than try to do
the access in the same macro.
On Tue, Jan 09, 2018 at 06:22:09PM -0800, Dan Williams wrote:
>
> When you came up with that tweak you noted:
>
> "The following:
> [..]
> is generic and no speculative flows."
I meant 'no speculative control flow'
load speculation still happens.
>
> > This macro doesn't prevent speculation.
>
> It masks dangerous speculation. At least, I read nospec as "No
> Spectre" and it is a prefix used in the Spectre-v2 patches.
ahh. I thought 'nospec' means 'no speculation'.
I think it's too much of an honor to use bug name for the macro
that will be used in many places in the kernel.
> > I think array_access() was the best name so far.
>
> For other usages I need the pointer to the array element, also
> array_access() by itself is unsuitable for __fcheck_files because we
> still need rcu_dereference_raw() on the element de-reference. So, I
> think it's better to get a sanitized array element pointer which can
> be used with rcu, READ_ONCE(), etc... directly rather than try to do
> the access in the same macro.
makes sense, then array_ptr() should fit ?
I'm hearing rumors that the first cpu with variant 2 and 3 fixed
will be appearing in early 2019. Which is amazing considering cpu
release cycles, but it also means that variant 1 will stay with us
for long time and we better pick clean interface and name for it.
On Tue, Jan 9, 2018 at 4:48 PM, Dan Williams <[email protected]> wrote:
>
> I looks like there is another problem, or I'm misreading the
> cleverness.
I think you were misreading it.
I was basically saying that this:
unsigned long _mask = ~(long)(_m - 1 - _i) >> BITS_PER_LONG - 1;\
doesn't work, and that the "_m -1 - _i" needs to be replaced by "_i |
_m -1 -_i".
So you have
unsigned long _mask = ~(long)(_i (_m - 1 - _i)) >> BITS_PER_LONG - 1;\
which should give the right result. No?
But as mentioned, I think you can do it with two instructions if you
do an architecture-specific inline asm:
unsigned long mask;
asm("cmpq %1,%2; sbbq %0,%0"
:"=r" (mask)
:"g" (max),"r" (idx));
which is likely much faster, and has much better register usage ("max"
can be a constant or loaded directly from memory, and "mask" could be
assigned the same register as idx).
But once again, I didn't really test it.
Note that the "cmpq/sbbq" version works regardless of max/idx values,
since it literally does the math in BITS_ION_LONG+1 bits.
In contrast, the "binary or with idx" version only works if the high
bit set in idx cannot be valid (put another way: 'max' must not be
bigger than MAXLONG+1).
Linus
On 2018/1/10 10:04, Laura Abbott wrote:
> On 01/05/2018 05:10 PM, Dan Williams wrote:
>> From: Mark Rutland <[email protected]>
>>
>> This patch implements nospec_ptr() for arm, following the recommended
>> architectural sequences for the arm and thumb instruction sets.
>>
> Fedora picked up the series and it fails on arm:
>
> In file included from ./include/linux/compiler.h:242:0,
> from ./include/uapi/linux/swab.h:6,
> from ./include/linux/swab.h:5,
> from ./arch/arm/include/asm/opcodes.h:89,
> from ./arch/arm/include/asm/bug.h:7,
> from ./include/linux/bug.h:5,
> from ./include/linux/mmdebug.h:5,
> from ./include/linux/gfp.h:5,
> from ./include/linux/slab.h:15,
> from kernel/fork.c:14:
> ./include/linux/fdtable.h: In function '__fcheck_files':
> ./arch/arm/include/asm/barrier.h:112:41: error: expected declaration specifiers or '...' before numeric constant
> __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
> ^
> ./arch/arm/include/asm/barrier.h:68:32: note: in definition of macro '__load_no_speculate_n'
> (typeof(*ptr)(unsigned long)(failval)); \
> ^~~~~~~
> ./arch/arm/include/asm/barrier.h:112:2: note: in expansion of macro '__load_no_speculate'
> __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
> ^~~~~~~~~~~~~~~~~~~
> ./include/asm-generic/barrier.h:122:2: note: in expansion of macro 'nospec_ptr'
> nospec_ptr(__arr + __idx, __arr, __arr + __sz); \
> ^~~~~~~~~~
> ./include/linux/fdtable.h:86:13: note: in expansion of macro 'nospec_array_ptr'
> if ((fdp = nospec_array_ptr(fdt->fd, fd, fdt->max_fds)))
> ^~~~~~~~~~~~~~~~
> ./arch/arm/include/asm/barrier.h:112:41: error: expected declaration specifiers or '...' before numeric constant
> __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
> ^
> ./arch/arm/include/asm/barrier.h:68:32: note: in definition of macro '__load_no_speculate_n'
> (typeof(*ptr)(unsigned long)(failval)); \
> ^~~~~~~
> ./arch/arm/include/asm/barrier.h:112:2: note: in expansion of macro '__load_no_speculate'
> __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
> ^~~~~~~~~~~~~~~~~~~
> ./include/asm-generic/barrier.h:122:2: note: in expansion of macro 'nospec_ptr'
> nospec_ptr(__arr + __idx, __arr, __arr + __sz); \
> ^~~~~~~~~~
> ./include/linux/fdtable.h:86:13: note: in expansion of macro 'nospec_array_ptr'
> if ((fdp = nospec_array_ptr(fdt->fd, fd, fdt->max_fds)))
> ^~~~~~~~~~~~~~~~
> ./arch/arm/include/asm/barrier.h:112:41: error: expected declaration specifiers or '...' before numeric constant
> __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
> ^
> ./arch/arm/include/asm/barrier.h:68:32: note: in definition of macro '__load_no_speculate_n'
> (typeof(*ptr)(unsigned long)(failval)); \
> ^~~~~~~
> ./arch/arm/include/asm/barrier.h:112:2: note: in expansion of macro '__load_no_speculate'
> __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
> ^~~~~~~~~~~~~~~~~~~
> ./include/asm-generic/barrier.h:122:2: note: in expansion of macro 'nospec_ptr'
> nospec_ptr(__arr + __idx, __arr, __arr + __sz); \
> ^~~~~~~~~~
> ./include/linux/fdtable.h:86:13: note: in expansion of macro 'nospec_array_ptr'
> if ((fdp = nospec_array_ptr(fdt->fd, fd, fdt->max_fds)))
>
> I can't puzzle out what exactly is the problem here, except that it really
> does not seem to like that failval. Does the arm compiler not like doing
> the typeof with the __arr + __idx?
>> +#define __load_no_speculate_n(ptr, lo, hi, failval, cmpptr, sz) \
>> +({ \
>> + typeof(*ptr) __nln_val; \
>> + typeof(*ptr) __failval = \
>> + (typeof(*ptr)(unsigned long)(failval)); \
Just typo,
- (typeof(*ptr)(unsigned long)(failval)); \
+ (typeof(*ptr))(unsigned long)(failval); \
Please try it.
Thanks
Hanjun
On 01/09/2018 11:40 PM, Hanjun Guo wrote:
> On 2018/1/10 10:04, Laura Abbott wrote:
>> On 01/05/2018 05:10 PM, Dan Williams wrote:
>>> From: Mark Rutland <[email protected]>
>>>
>>> This patch implements nospec_ptr() for arm, following the recommended
>>> architectural sequences for the arm and thumb instruction sets.
>>>
>> Fedora picked up the series and it fails on arm:
>>
>> In file included from ./include/linux/compiler.h:242:0,
>> from ./include/uapi/linux/swab.h:6,
>> from ./include/linux/swab.h:5,
>> from ./arch/arm/include/asm/opcodes.h:89,
>> from ./arch/arm/include/asm/bug.h:7,
>> from ./include/linux/bug.h:5,
>> from ./include/linux/mmdebug.h:5,
>> from ./include/linux/gfp.h:5,
>> from ./include/linux/slab.h:15,
>> from kernel/fork.c:14:
>> ./include/linux/fdtable.h: In function '__fcheck_files':
>> ./arch/arm/include/asm/barrier.h:112:41: error: expected declaration specifiers or '...' before numeric constant
>> __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
>> ^
>> ./arch/arm/include/asm/barrier.h:68:32: note: in definition of macro '__load_no_speculate_n'
>> (typeof(*ptr)(unsigned long)(failval)); \
>> ^~~~~~~
>> ./arch/arm/include/asm/barrier.h:112:2: note: in expansion of macro '__load_no_speculate'
>> __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
>> ^~~~~~~~~~~~~~~~~~~
>> ./include/asm-generic/barrier.h:122:2: note: in expansion of macro 'nospec_ptr'
>> nospec_ptr(__arr + __idx, __arr, __arr + __sz); \
>> ^~~~~~~~~~
>> ./include/linux/fdtable.h:86:13: note: in expansion of macro 'nospec_array_ptr'
>> if ((fdp = nospec_array_ptr(fdt->fd, fd, fdt->max_fds)))
>> ^~~~~~~~~~~~~~~~
>> ./arch/arm/include/asm/barrier.h:112:41: error: expected declaration specifiers or '...' before numeric constant
>> __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
>> ^
>> ./arch/arm/include/asm/barrier.h:68:32: note: in definition of macro '__load_no_speculate_n'
>> (typeof(*ptr)(unsigned long)(failval)); \
>> ^~~~~~~
>> ./arch/arm/include/asm/barrier.h:112:2: note: in expansion of macro '__load_no_speculate'
>> __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
>> ^~~~~~~~~~~~~~~~~~~
>> ./include/asm-generic/barrier.h:122:2: note: in expansion of macro 'nospec_ptr'
>> nospec_ptr(__arr + __idx, __arr, __arr + __sz); \
>> ^~~~~~~~~~
>> ./include/linux/fdtable.h:86:13: note: in expansion of macro 'nospec_array_ptr'
>> if ((fdp = nospec_array_ptr(fdt->fd, fd, fdt->max_fds)))
>> ^~~~~~~~~~~~~~~~
>> ./arch/arm/include/asm/barrier.h:112:41: error: expected declaration specifiers or '...' before numeric constant
>> __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
>> ^
>> ./arch/arm/include/asm/barrier.h:68:32: note: in definition of macro '__load_no_speculate_n'
>> (typeof(*ptr)(unsigned long)(failval)); \
>> ^~~~~~~
>> ./arch/arm/include/asm/barrier.h:112:2: note: in expansion of macro '__load_no_speculate'
>> __load_no_speculate(&__np_ptr, lo, hi, 0, __np_ptr); \
>> ^~~~~~~~~~~~~~~~~~~
>> ./include/asm-generic/barrier.h:122:2: note: in expansion of macro 'nospec_ptr'
>> nospec_ptr(__arr + __idx, __arr, __arr + __sz); \
>> ^~~~~~~~~~
>> ./include/linux/fdtable.h:86:13: note: in expansion of macro 'nospec_array_ptr'
>> if ((fdp = nospec_array_ptr(fdt->fd, fd, fdt->max_fds)))
>>
>> I can't puzzle out what exactly is the problem here, except that it really
>> does not seem to like that failval. Does the arm compiler not like doing
>> the typeof with the __arr + __idx?
>
>>> +#define __load_no_speculate_n(ptr, lo, hi, failval, cmpptr, sz) \
>>> +({ \
>>> + typeof(*ptr) __nln_val; \
>>> + typeof(*ptr) __failval = \
>>> + (typeof(*ptr)(unsigned long)(failval)); \
>
> Just typo,
>
> - (typeof(*ptr)(unsigned long)(failval)); \
> + (typeof(*ptr))(unsigned long)(failval); \
>
> Please try it.
>
> Thanks
> Hanjun
>
Ah yeah, that's exactly it. I really missed the obvious.
Thanks,
Laura
On Tue, 9 Jan 2018, Josh Poimboeuf wrote:
> On Tue, Jan 09, 2018 at 11:44:05AM -0800, Dan Williams wrote:
> > On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina <[email protected]> wrote:
> > > On Fri, 5 Jan 2018, Dan Williams wrote:
> > >
> > > [ ... snip ... ]
> > >> Andi Kleen (1):
> > >> x86, barrier: stop speculation for failed access_ok
> > >>
> > >> Dan Williams (13):
> > >> x86: implement nospec_barrier()
> > >> [media] uvcvideo: prevent bounds-check bypass via speculative execution
> > >> carl9170: prevent bounds-check bypass via speculative execution
> > >> p54: prevent bounds-check bypass via speculative execution
> > >> qla2xxx: prevent bounds-check bypass via speculative execution
> > >> cw1200: prevent bounds-check bypass via speculative execution
> > >> Thermal/int340x: prevent bounds-check bypass via speculative execution
> > >> ipv6: prevent bounds-check bypass via speculative execution
> > >> ipv4: prevent bounds-check bypass via speculative execution
> > >> vfs, fdtable: prevent bounds-check bypass via speculative execution
> > >> net: mpls: prevent bounds-check bypass via speculative execution
> > >> udf: prevent bounds-check bypass via speculative execution
> > >> userns: prevent bounds-check bypass via speculative execution
> > >>
> > >> Mark Rutland (4):
> > >> asm-generic/barrier: add generic nospec helpers
> > >> Documentation: document nospec helpers
> > >> arm64: implement nospec_ptr()
> > >> arm: implement nospec_ptr()
> > >
> > > So considering the recent publication of [1], how come we all of a sudden
> > > don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and
> > > LDX_MEM_##SIZEOP, and something comparable for eBPF JIT?
> > >
> > > Is this going to be handled in eBPF in some other way?
> > >
> > > Without that in place, and considering Jann Horn's paper, it would seem
> > > like PTI doesn't really lock it down fully, right?
> >
> > Here is the latest (v3) bpf fix:
> >
> > https://patchwork.ozlabs.org/patch/856645/
> >
> > I currently have v2 on my 'nospec' branch and will move that to v3 for
> > the next update, unless it goes upstream before then.
Daniel, I guess you're planning to send this still for 4.15?
> That patch seems specific to CONFIG_BPF_SYSCALL. Is the bpf() syscall
> the only attack vector? Or are there other ways to run bpf programs
> that we should be worried about?
Seems like Alexei is probably the only person in the whole universe who
isn't CCed here ... let's fix that.
Thanks,
--
Jiri Kosina
SUSE Labs
On Thu, Jan 11, 2018 at 1:54 AM, Jiri Kosina <[email protected]> wrote:
> On Tue, 9 Jan 2018, Josh Poimboeuf wrote:
>
>> On Tue, Jan 09, 2018 at 11:44:05AM -0800, Dan Williams wrote:
>> > On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina <[email protected]> wrote:
>> > > On Fri, 5 Jan 2018, Dan Williams wrote:
>> > >
>> > > [ ... snip ... ]
>> > >> Andi Kleen (1):
>> > >> x86, barrier: stop speculation for failed access_ok
>> > >>
>> > >> Dan Williams (13):
>> > >> x86: implement nospec_barrier()
>> > >> [media] uvcvideo: prevent bounds-check bypass via speculative execution
>> > >> carl9170: prevent bounds-check bypass via speculative execution
>> > >> p54: prevent bounds-check bypass via speculative execution
>> > >> qla2xxx: prevent bounds-check bypass via speculative execution
>> > >> cw1200: prevent bounds-check bypass via speculative execution
>> > >> Thermal/int340x: prevent bounds-check bypass via speculative execution
>> > >> ipv6: prevent bounds-check bypass via speculative execution
>> > >> ipv4: prevent bounds-check bypass via speculative execution
>> > >> vfs, fdtable: prevent bounds-check bypass via speculative execution
>> > >> net: mpls: prevent bounds-check bypass via speculative execution
>> > >> udf: prevent bounds-check bypass via speculative execution
>> > >> userns: prevent bounds-check bypass via speculative execution
>> > >>
>> > >> Mark Rutland (4):
>> > >> asm-generic/barrier: add generic nospec helpers
>> > >> Documentation: document nospec helpers
>> > >> arm64: implement nospec_ptr()
>> > >> arm: implement nospec_ptr()
>> > >
>> > > So considering the recent publication of [1], how come we all of a sudden
>> > > don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and
>> > > LDX_MEM_##SIZEOP, and something comparable for eBPF JIT?
>> > >
>> > > Is this going to be handled in eBPF in some other way?
>> > >
>> > > Without that in place, and considering Jann Horn's paper, it would seem
>> > > like PTI doesn't really lock it down fully, right?
>> >
>> > Here is the latest (v3) bpf fix:
>> >
>> > https://patchwork.ozlabs.org/patch/856645/
>> >
>> > I currently have v2 on my 'nospec' branch and will move that to v3 for
>> > the next update, unless it goes upstream before then.
>
> Daniel, I guess you're planning to send this still for 4.15?
It's pending in the bpf.git tree:
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=b2157399cc9
>> That patch seems specific to CONFIG_BPF_SYSCALL. Is the bpf() syscall
>> the only attack vector? Or are there other ways to run bpf programs
>> that we should be worried about?
>
> Seems like Alexei is probably the only person in the whole universe who
> isn't CCed here ... let's fix that.
He will be cc'd on v2 of this series which will be available later today.
On 01/11/2018 04:58 PM, Dan Williams wrote:
> On Thu, Jan 11, 2018 at 1:54 AM, Jiri Kosina <[email protected]> wrote:
>> On Tue, 9 Jan 2018, Josh Poimboeuf wrote:
>>> On Tue, Jan 09, 2018 at 11:44:05AM -0800, Dan Williams wrote:
>>>> On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina <[email protected]> wrote:
>>>>> On Fri, 5 Jan 2018, Dan Williams wrote:
>>>>>
>>>>> [ ... snip ... ]
>>>>>> Andi Kleen (1):
>>>>>> x86, barrier: stop speculation for failed access_ok
>>>>>>
>>>>>> Dan Williams (13):
>>>>>> x86: implement nospec_barrier()
>>>>>> [media] uvcvideo: prevent bounds-check bypass via speculative execution
>>>>>> carl9170: prevent bounds-check bypass via speculative execution
>>>>>> p54: prevent bounds-check bypass via speculative execution
>>>>>> qla2xxx: prevent bounds-check bypass via speculative execution
>>>>>> cw1200: prevent bounds-check bypass via speculative execution
>>>>>> Thermal/int340x: prevent bounds-check bypass via speculative execution
>>>>>> ipv6: prevent bounds-check bypass via speculative execution
>>>>>> ipv4: prevent bounds-check bypass via speculative execution
>>>>>> vfs, fdtable: prevent bounds-check bypass via speculative execution
>>>>>> net: mpls: prevent bounds-check bypass via speculative execution
>>>>>> udf: prevent bounds-check bypass via speculative execution
>>>>>> userns: prevent bounds-check bypass via speculative execution
>>>>>>
>>>>>> Mark Rutland (4):
>>>>>> asm-generic/barrier: add generic nospec helpers
>>>>>> Documentation: document nospec helpers
>>>>>> arm64: implement nospec_ptr()
>>>>>> arm: implement nospec_ptr()
>>>>>
>>>>> So considering the recent publication of [1], how come we all of a sudden
>>>>> don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and
>>>>> LDX_MEM_##SIZEOP, and something comparable for eBPF JIT?
>>>>>
>>>>> Is this going to be handled in eBPF in some other way?
>>>>>
>>>>> Without that in place, and considering Jann Horn's paper, it would seem
>>>>> like PTI doesn't really lock it down fully, right?
>>>>
>>>> Here is the latest (v3) bpf fix:
>>>>
>>>> https://patchwork.ozlabs.org/patch/856645/
>>>>
>>>> I currently have v2 on my 'nospec' branch and will move that to v3 for
>>>> the next update, unless it goes upstream before then.
>>
>> Daniel, I guess you're planning to send this still for 4.15?
>
> It's pending in the bpf.git tree:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=b2157399cc9
Sorry for the delay, just noticed the question now since not on Cc either:
It made it into in DaveM's tree already and part of his latest pull-req
to Linus.
>>> That patch seems specific to CONFIG_BPF_SYSCALL. Is the bpf() syscall
>>> the only attack vector? Or are there other ways to run bpf programs
>>> that we should be worried about?
>>
>> Seems like Alexei is probably the only person in the whole universe who
>> isn't CCed here ... let's fix that.
>
> He will be cc'd on v2 of this series which will be available later today.
On Sat, Jan 6, 2018 at 1:03 AM, Greg KH <[email protected]> wrote:
> On Fri, Jan 05, 2018 at 05:10:48PM -0800, Dan Williams wrote:
>> Static analysis reports that 'handle' may be a user controlled value
>> that is used as a data dependency to read 'sp' from the
>> 'req->outstanding_cmds' array. In order to avoid potential leaks of
>> kernel memory values, block speculative execution of the instruction
>> stream that could issue reads based on an invalid value of 'sp'. In this
>> case 'sp' is directly dereferenced later in the function.
>
> I'm pretty sure that 'handle' comes from the hardware, not from
> userspace, from what I can tell here. If we want to start auditing
> __iomem data sources, great! But that's a bigger task, and one I don't
> think we are ready to tackle...
I think it falls in the hygiene bucket of shutting off an array index
from a source that could be under attacker control. Should we leave
this one un-patched while we decide if we generally have a problem
with trusting completion 'tags' from hardware? My vote is patch it for
now.
On Thu, Jan 11, 2018 at 02:15:12PM -0800, Dan Williams wrote:
> On Sat, Jan 6, 2018 at 1:03 AM, Greg KH <[email protected]> wrote:
> > On Fri, Jan 05, 2018 at 05:10:48PM -0800, Dan Williams wrote:
> >> Static analysis reports that 'handle' may be a user controlled value
> >> that is used as a data dependency to read 'sp' from the
> >> 'req->outstanding_cmds' array. In order to avoid potential leaks of
> >> kernel memory values, block speculative execution of the instruction
> >> stream that could issue reads based on an invalid value of 'sp'. In this
> >> case 'sp' is directly dereferenced later in the function.
> >
> > I'm pretty sure that 'handle' comes from the hardware, not from
> > userspace, from what I can tell here. If we want to start auditing
> > __iomem data sources, great! But that's a bigger task, and one I don't
> > think we are ready to tackle...
>
> I think it falls in the hygiene bucket of shutting off an array index
> from a source that could be under attacker control. Should we leave
> this one un-patched while we decide if we generally have a problem
> with trusting completion 'tags' from hardware? My vote is patch it for
> now.
Hah, if you are worried about "tags" from hardware, we have a lot more
auditing to do, right? I don't think anyone has looked into just basic
"bounds checking" for that type of information. For USB devices we have
_just_ started doing that over the past year, the odds of anyone looking
at PCI devices for this same problem is slim-to-none.
Again, here are my questions/objections right now to this series:
- How can we audit this stuff?
- How did you audit this stuff to find these usages?
- How do you know that this series fixes all of the issues?
- What exact tree/date did you run your audit against?
- How do you know that linux-next does not contain a boatload
more problems that we need to go back and fix after 4.16-rc1
is out?
- How can we prevent this type of pattern showing up again?
- How can we audit the data coming from hardware correctly?
I'm all for merging this series, but if anyone things that somehow the
whole problem is now "solved" in this area, they are sorely mistaken.
thanks,
greg k-h
On Fri, 2018-01-12 at 08:27 +0100, Greg KH wrote:
> On Thu, Jan 11, 2018 at 02:15:12PM -0800, Dan Williams wrote:
> >
> > On Sat, Jan 6, 2018 at 1:03 AM, Greg KH <[email protected]
> > > wrote:
> > >
> > > On Fri, Jan 05, 2018 at 05:10:48PM -0800, Dan Williams wrote:
> > > >
> > > > Static analysis reports that 'handle' may be a user controlled
> > > > value that is used as a data dependency to read 'sp' from the
> > > > 'req->outstanding_cmds' array. In order to avoid potential
> > > > leaks of kernel memory values, block speculative execution of
> > > > the instruction stream that could issue reads based on an
> > > > invalid value of 'sp'. In this case 'sp' is directly
> > > > dereferenced later in the function.
> > >
> > > I'm pretty sure that 'handle' comes from the hardware, not from
> > > userspace, from what I can tell here. If we want to start
> > > auditing __iomem data sources, great! But that's a bigger task,
> > > and one I don't think we are ready to tackle...
> >
> > I think it falls in the hygiene bucket of shutting off an array
> > index from a source that could be under attacker control. Should we
> > leave this one un-patched while we decide if we generally have a
> > problem with trusting completion 'tags' from hardware? My vote is
> > patch it for now.
>
> Hah, if you are worried about "tags" from hardware, we have a lot
> more auditing to do, right?
We'd also have a lot more to do: the assumption would have to be
malicious hardware and most hardware has access to fairly vital stuff
directly. I really don't think we have to worry about side channel
attacks from hardware until the direct attack vector is closed.
James
在 2018/1/6 9:09, Dan Williams 写道:
> Quoting Mark's original RFC:
>
> "Recently, Google Project Zero discovered several classes of attack
> against speculative execution. One of these, known as variant-1, allows
> explicit bounds checks to be bypassed under speculation, providing an
> arbitrary read gadget. Further details can be found on the GPZ blog [1]
> and the Documentation patch in this series."
>
> This series incorporates Mark Rutland's latest api and adds the x86
> specific implementation of nospec_barrier. The
> nospec_{array_ptr,ptr,barrier} helpers are then combined with a kernel
> wide analysis performed by Elena Reshetova to address static analysis
> reports where speculative execution on a userspace controlled value
@Elena, can I know how did you do this analysis? I mean manually or with
tool.
Thanks!
> could bypass a bounds check. The patches address a precondition for the
> attack discussed in the Spectre paper [2].
>
> A consideration worth noting for reviewing these patches is to weigh the
> dramatic cost of being wrong about whether a given report is exploitable
> vs the overhead nospec_{array_ptr,ptr} may introduce. In other words,
> lets make the bar for applying these patches be "can you prove that the
> bounds check bypass is *not* exploitable". Consider that the Spectre
> paper reports one example of a speculation window being ~180 cycles.
[snip]
>
>
--
Regards
QingFeng Hao