This patch-set is the first-cut implementation of write-rare memory
protection, as previously agreed [1]
Its purpose it to keep data write protected kernel data which is seldom
modified.
There is no read overhead, however writing requires special operations that
are probably unsitable for often-changing data.
The use is opt-in, by applying the modifier __wr_after_init to a variable
declaration.
As the name implies, the write protection kicks in only after init() is
completed; before that moment, the data is modifiable in the usual way.
Current Limitations:
* supports only data which is allocated statically, at build time.
* supports only x86_64
* might not work for very large amount of data, since it relies on the
assumption that said data can be entirely remapped, at init.
Some notes:
- even if the code is only for x86_64, it is placed in the generic
locations, with the intention of extending it also to arm64
- the current section used for collecting wr-after-init data might need to
be moved, to work with arm64 MMU
- the functionality is in its own c and h files, for now, to ease the
introduction (and refactoring) of code dealing with dynamic allocation
- recently some updated patches were posted for live-patch on arm64 [2],
they might help with adding arm64 support here
- to avoid the risk of weakening __ro_after_init, __wr_after_init data is
in a separate set of pages, and any invocation will confirm that the
memory affected falls within this range.
I have modified rodata_test accordingly, to check als othis case.
- to avoid replicating the code which does the change of mapping, there is
only one function performing multiple, selectable, operations, such as
memcpy(), memset(). I have added also rcu_assign_pointer() as further
example. But I'm not too fond of this implementation either. I just
couldn't think of any that I would like significantly better.
- I have left out the patchset from Nadav that these patches depend on,
but it can be found here [3] (Should have I resubmitted it?)
- I am not sure what is the correct form for giving proper credit wrt the
authoring of the wr_after_init mechanism, guidance would be appreciated
- In an attempt to spam less people, I have curbed the list of recipients.
If I have omitted someone who should have been kept/added, please
add them to the thread.
[1] https://www.openwall.com/lists/kernel-hardening/2018/11/22/8
[2] https://www.mail-archive.com/[email protected]/msg1793199.html
[3] https://www.mail-archive.com/[email protected]/msg1810245.html
Signed-off-by: Igor Stoppa <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Nadav Amit <[email protected]>
CC: Matthew Wilcox <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Kees Cook <[email protected]>
CC: Dave Hansen <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
Igor Stoppa (6):
[PATCH 1/6] __wr_after_init: linker section and label
[PATCH 2/6] __wr_after_init: write rare for static allocation
[PATCH 3/6] rodata_test: refactor tests
[PATCH 4/6] rodata_test: add verification for __wr_after_init
[PATCH 5/6] __wr_after_init: test write rare functionality
[PATCH 6/6] __wr_after_init: lkdtm test
drivers/misc/lkdtm/core.c | 3 +
drivers/misc/lkdtm/lkdtm.h | 3 +
drivers/misc/lkdtm/perms.c | 29 ++++++++
include/asm-generic/vmlinux.lds.h | 20 ++++++
include/linux/cache.h | 17 +++++
include/linux/prmem.h | 134 +++++++++++++++++++++++++++++++++++++
init/main.c | 2 +
mm/Kconfig | 4 ++
mm/Kconfig.debug | 9 +++
mm/Makefile | 2 +
mm/prmem.c | 124 ++++++++++++++++++++++++++++++++++
mm/rodata_test.c | 63 ++++++++++++------
mm/test_write_rare.c | 135 ++++++++++++++++++++++++++++++++++++++
13 files changed, 525 insertions(+), 20 deletions(-)
The write protection of the __wr_after_init data can be verified with the
same methodology used for const data.
Signed-off-by: Igor Stoppa <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Nadav Amit <[email protected]>
CC: Matthew Wilcox <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Kees Cook <[email protected]>
CC: Dave Hansen <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
---
mm/rodata_test.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/mm/rodata_test.c b/mm/rodata_test.c
index 3c1e515ca9b1..a98d088ad9cc 100644
--- a/mm/rodata_test.c
+++ b/mm/rodata_test.c
@@ -16,7 +16,19 @@
#define INIT_TEST_VAL 0xC3
+/*
+ * Note: __ro_after_init data is, for every practical effect, equivalent to
+ * const data, since they are even write protected at the same time; there
+ * is no need for separate testing.
+ * __wr_after_init data, otoh, is altered also after the write protection
+ * takes place and it cannot be exploitable for altering more permanent
+ * data.
+ */
+
static const int rodata_test_data = INIT_TEST_VAL;
+static int wr_after_init_test_data __wr_after_init = INIT_TEST_VAL;
+extern long __start_wr_after_init;
+extern long __end_wr_after_init;
static bool test_data(char *data_type, const int *data,
unsigned long start, unsigned long end)
@@ -60,6 +72,9 @@ void rodata_test(void)
{
if (test_data("rodata", &rodata_test_data,
(unsigned long)&__start_rodata,
- (unsigned long)&__end_rodata))
+ (unsigned long)&__end_rodata) &&
+ test_data("wr after init data", &wr_after_init_test_data,
+ (unsigned long)&__start_wr_after_init,
+ (unsigned long)&__end_wr_after_init))
pr_info("all tests were successful\n");
}
--
2.19.1
Set of test cases meant to confirm that the write rare functionality
works as expected.
Signed-off-by: Igor Stoppa <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Nadav Amit <[email protected]>
CC: Matthew Wilcox <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Kees Cook <[email protected]>
CC: Dave Hansen <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
---
include/linux/prmem.h | 7 ++-
mm/Kconfig.debug | 9 +++
mm/Makefile | 1 +
mm/test_write_rare.c | 135 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 149 insertions(+), 3 deletions(-)
create mode 100644 mm/test_write_rare.c
diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index b0131c1f5dc0..d2492ec24c8c 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -125,9 +125,10 @@ static inline void *wr_memcpy(void *p, const void *q, __kernel_size_t size)
*
* It is provided as macro, to match rcu_assign_pointer()
*/
-#define wr_rcu_assign_pointer(p, v) ({ \
- __wr_op((unsigned long)&p, v, sizeof(p), WR_RCU_ASSIGN_PTR); \
- p; \
+#define wr_rcu_assign_pointer(p, v) ({ \
+ __wr_op((unsigned long)&p, (unsigned long)v, sizeof(p), \
+ WR_RCU_ASSIGN_PTR); \
+ p; \
})
#endif
#endif
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 9a7b8b049d04..a26ecbd27aea 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -94,3 +94,12 @@ config DEBUG_RODATA_TEST
depends on STRICT_KERNEL_RWX
---help---
This option enables a testcase for the setting rodata read-only.
+
+config DEBUG_PRMEM_TEST
+ tristate "Run self test for statically allocated protected memory"
+ depends on STRICT_KERNEL_RWX
+ select PRMEM
+ default n
+ help
+ Tries to verify that the protection for statically allocated memory
+ works correctly and that the memory is effectively protected.
diff --git a/mm/Makefile b/mm/Makefile
index ef3867c16ce0..8de1d468f4e7 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -59,6 +59,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
obj-$(CONFIG_PRMEM) += prmem.o
+obj-$(CONFIG_DEBUG_PRMEM_TEST) += test_write_rare.o
obj-$(CONFIG_KSM) += ksm.o
obj-$(CONFIG_PAGE_POISONING) += page_poison.o
obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/test_write_rare.c b/mm/test_write_rare.c
new file mode 100644
index 000000000000..240cc43793d1
--- /dev/null
+++ b/mm/test_write_rare.c
@@ -0,0 +1,135 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * test_write_rare.c
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <[email protected]>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/bug.h>
+#include <linux/prmem.h>
+
+#ifdef pr_fmt
+#undef pr_fmt
+#endif
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+extern long __start_wr_after_init;
+extern long __end_wr_after_init;
+
+static __wr_after_init int scalar = '0';
+static __wr_after_init u8 array[PAGE_SIZE * 3] __aligned(PAGE_SIZE);
+
+/* The section must occupy a non-zero number of whole pages */
+static bool test_alignment(void)
+{
+ unsigned long pstart = (unsigned long)&__start_wr_after_init;
+ unsigned long pend = (unsigned long)&__end_wr_after_init;
+
+ if (WARN((pstart & ~PAGE_MASK) || (pend & ~PAGE_MASK) ||
+ (pstart >= pend), "Boundaries test failed."))
+ return false;
+ pr_info("Boundaries test passed.");
+ return true;
+}
+
+static inline bool test_pattern(void)
+{
+ return (memtst(array, '0', PAGE_SIZE / 2) ||
+ memtst(array + PAGE_SIZE / 2, '1', PAGE_SIZE * 3 / 4) ||
+ memtst(array + PAGE_SIZE * 5 / 4, '0', PAGE_SIZE / 2) ||
+ memtst(array + PAGE_SIZE * 7 / 4, '1', PAGE_SIZE * 3 / 4) ||
+ memtst(array + PAGE_SIZE * 5 / 2, '0', PAGE_SIZE / 2));
+}
+
+static bool test_wr_memset(void)
+{
+ int new_val = '1';
+
+ wr_memset(&scalar, new_val, sizeof(scalar));
+ if (WARN(memtst(&scalar, new_val, sizeof(scalar)),
+ "Scalar write rare memset test failed."))
+ return false;
+
+ pr_info("Scalar write rare memset test passed.");
+
+ wr_memset(array, '0', PAGE_SIZE * 3);
+ if (WARN(memtst(array, '0', PAGE_SIZE * 3),
+ "Array write rare memset test failed."))
+ return false;
+
+ wr_memset(array + PAGE_SIZE / 2, '1', PAGE_SIZE * 2);
+ if (WARN(memtst(array + PAGE_SIZE / 2, '1', PAGE_SIZE * 2),
+ "Array write rare memset test failed."))
+ return false;
+
+ wr_memset(array + PAGE_SIZE * 5 / 4, '0', PAGE_SIZE / 2);
+ if (WARN(memtst(array + PAGE_SIZE * 5 / 4, '0', PAGE_SIZE / 2),
+ "Array write rare memset test failed."))
+ return false;
+
+ if (WARN(test_pattern(), "Array write rare memset test failed."))
+ return false;
+
+ pr_info("Array write rare memset test passed.");
+ return true;
+}
+
+static u8 array_1[PAGE_SIZE * 2];
+static u8 array_2[PAGE_SIZE * 2];
+
+static bool test_wr_memcpy(void)
+{
+ int new_val = 0x12345678;
+
+ wr_assign(scalar, new_val);
+ if (WARN(memcmp(&scalar, &new_val, sizeof(scalar)),
+ "Scalar write rare memcpy test failed."))
+ return false;
+ pr_info("Scalar write rare memcpy test passed.");
+
+ wr_memset(array, '0', PAGE_SIZE * 3);
+ memset(array_1, '1', PAGE_SIZE * 2);
+ memset(array_2, '0', PAGE_SIZE * 2);
+ wr_memcpy(array + PAGE_SIZE / 2, array_1, PAGE_SIZE * 2);
+ wr_memcpy(array + PAGE_SIZE * 5 / 4, array_2, PAGE_SIZE / 2);
+
+ if (WARN(test_pattern(), "Array write rare memcpy test failed."))
+ return false;
+
+ pr_info("Array write rare memcpy test passed.");
+ return true;
+}
+
+static __wr_after_init int *dst;
+static int reference = 0x54;
+
+static bool test_wr_rcu_assign_pointer(void)
+{
+ wr_rcu_assign_pointer(dst, &reference);
+ return dst == &reference;
+}
+
+static int __init test_static_wr_init_module(void)
+{
+ pr_info("static write_rare test");
+ if (WARN(!(test_alignment() &&
+ test_wr_memset() &&
+ test_wr_memcpy() &&
+ test_wr_rcu_assign_pointer()),
+ "static rare-write test failed"))
+ return -EFAULT;
+ pr_info("static write_rare test passed");
+ return 0;
+}
+
+module_init(test_static_wr_init_module);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Igor Stoppa <[email protected]>");
+MODULE_DESCRIPTION("Test module for static write rare.");
--
2.19.1
Refactor the test cases, in preparation for using them also for testing
__wr_after_init memory.
Signed-off-by: Igor Stoppa <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Nadav Amit <[email protected]>
CC: Matthew Wilcox <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Kees Cook <[email protected]>
CC: Dave Hansen <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
---
mm/rodata_test.c | 48 ++++++++++++++++++++++++++++--------------------
1 file changed, 28 insertions(+), 20 deletions(-)
diff --git a/mm/rodata_test.c b/mm/rodata_test.c
index d908c8769b48..3c1e515ca9b1 100644
--- a/mm/rodata_test.c
+++ b/mm/rodata_test.c
@@ -14,44 +14,52 @@
#include <linux/uaccess.h>
#include <asm/sections.h>
-static const int rodata_test_data = 0xC3;
+#define INIT_TEST_VAL 0xC3
-void rodata_test(void)
+static const int rodata_test_data = INIT_TEST_VAL;
+
+static bool test_data(char *data_type, const int *data,
+ unsigned long start, unsigned long end)
{
- unsigned long start, end;
int zero = 0;
/* test 1: read the value */
/* If this test fails, some previous testrun has clobbered the state */
- if (!rodata_test_data) {
- pr_err("test 1 fails (start data)\n");
- return;
+ if (*data != INIT_TEST_VAL) {
+ pr_err("%s: test 1 fails (init data value)\n", data_type);
+ return false;
}
/* test 2: write to the variable; this should fault */
- if (!probe_kernel_write((void *)&rodata_test_data,
- (void *)&zero, sizeof(zero))) {
- pr_err("test data was not read only\n");
- return;
+ if (!probe_kernel_write((void *)data, (void *)&zero, sizeof(zero))) {
+ pr_err("%s: test data was not read only\n", data_type);
+ return false;
}
/* test 3: check the value hasn't changed */
- if (rodata_test_data == zero) {
- pr_err("test data was changed\n");
- return;
+ if (*data != INIT_TEST_VAL) {
+ pr_err("%s: test data was changed\n", data_type);
+ return false;
}
/* test 4: check if the rodata section is PAGE_SIZE aligned */
- start = (unsigned long)__start_rodata;
- end = (unsigned long)__end_rodata;
if (start & (PAGE_SIZE - 1)) {
- pr_err("start of .rodata is not page size aligned\n");
- return;
+ pr_err("%s: start of data is not page size aligned\n",
+ data_type);
+ return false;
}
if (end & (PAGE_SIZE - 1)) {
- pr_err("end of .rodata is not page size aligned\n");
- return;
+ pr_err("%s: end of data is not page size aligned\n",
+ data_type);
+ return false;
}
+ return true;
+}
- pr_info("all tests were successful\n");
+void rodata_test(void)
+{
+ if (test_data("rodata", &rodata_test_data,
+ (unsigned long)&__start_rodata,
+ (unsigned long)&__end_rodata))
+ pr_info("all tests were successful\n");
}
--
2.19.1
Verify that trying to modify a variable with the __wr_after_init
modifier wil lcause a crash.
Signed-off-by: Igor Stoppa <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Nadav Amit <[email protected]>
CC: Matthew Wilcox <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Kees Cook <[email protected]>
CC: Dave Hansen <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
---
drivers/misc/lkdtm/core.c | 3 +++
drivers/misc/lkdtm/lkdtm.h | 3 +++
drivers/misc/lkdtm/perms.c | 29 +++++++++++++++++++++++++++++
3 files changed, 35 insertions(+)
diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index 2837dc77478e..73c34b17c433 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -155,6 +155,9 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(ACCESS_USERSPACE),
CRASHTYPE(WRITE_RO),
CRASHTYPE(WRITE_RO_AFTER_INIT),
+#ifdef CONFIG_PRMEM
+ CRASHTYPE(WRITE_WR_AFTER_INIT),
+#endif
CRASHTYPE(WRITE_KERN),
CRASHTYPE(REFCOUNT_INC_OVERFLOW),
CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 3c6fd327e166..abba2f52ffa6 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -38,6 +38,9 @@ void lkdtm_READ_BUDDY_AFTER_FREE(void);
void __init lkdtm_perms_init(void);
void lkdtm_WRITE_RO(void);
void lkdtm_WRITE_RO_AFTER_INIT(void);
+#ifdef CONFIG_PRMEM
+void lkdtm_WRITE_WR_AFTER_INIT(void);
+#endif
void lkdtm_WRITE_KERN(void);
void lkdtm_EXEC_DATA(void);
void lkdtm_EXEC_STACK(void);
diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 53b85c9d16b8..f681730aa652 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -9,6 +9,7 @@
#include <linux/vmalloc.h>
#include <linux/mman.h>
#include <linux/uaccess.h>
+#include <linux/prmem.h>
#include <asm/cacheflush.h>
/* Whether or not to fill the target memory area with do_nothing(). */
@@ -27,6 +28,10 @@ static const unsigned long rodata = 0xAA55AA55;
/* This is marked __ro_after_init, so it should ultimately be .rodata. */
static unsigned long ro_after_init __ro_after_init = 0x55AA5500;
+/* This is marked __wr_after_init, so it should be in .rodata. */
+static
+unsigned long wr_after_init __wr_after_init = 0x55AA5500;
+
/*
* This just returns to the caller. It is designed to be copied into
* non-executable memory regions.
@@ -104,6 +109,28 @@ void lkdtm_WRITE_RO_AFTER_INIT(void)
*ptr ^= 0xabcd1234;
}
+#ifdef CONFIG_PRMEM
+
+void lkdtm_WRITE_WR_AFTER_INIT(void)
+{
+ unsigned long *ptr = &wr_after_init;
+
+ /*
+ * Verify we were written to during init. Since an Oops
+ * is considered a "success", a failure is to just skip the
+ * real test.
+ */
+ if ((*ptr & 0xAA) != 0xAA) {
+ pr_info("%p was NOT written during init!?\n", ptr);
+ return;
+ }
+
+ pr_info("attempting bad wr_after_init write at %p\n", ptr);
+ *ptr ^= 0xabcd1234;
+}
+
+#endif
+
void lkdtm_WRITE_KERN(void)
{
size_t size;
@@ -200,4 +227,6 @@ void __init lkdtm_perms_init(void)
/* Make sure we can write to __ro_after_init values during __init */
ro_after_init |= 0xAA;
+ /* Make sure we can write to __wr_after_init during __init */
+ wr_after_init |= 0xAA;
}
--
2.19.1
Implementation of write rare for statically allocated data, located in a
specific memory section through the use of the __write_rare label.
The basic functions are:
- wr_memset(): write rare counterpart of memset()
- wr_memcpy(): write rare counterpart of memcpy()
- wr_assign(): write rare counterpart of the assignment ('=') operator
- wr_rcu_assign_pointer(): write rare counterpart of rcu_assign_pointer()
The implementation is based on code from Andy Lutomirski and Nadav Amit
for patching the text on x86 [here goes reference to commits, once merged]
The modification of write protected data is done through an alternate
mapping of the same pages, as writable.
This mapping is local to each core and is active only for the duration
of each write operation.
Local interrupts are disabled, while the alternate mapping is active.
In theory, it could introduce a non-predictable delay, in a preemptible
system, however the amount of data to be altered is likely to be far
smaller than a page.
Signed-off-by: Igor Stoppa <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Nadav Amit <[email protected]>
CC: Matthew Wilcox <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Kees Cook <[email protected]>
CC: Dave Hansen <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
---
include/linux/prmem.h | 133 ++++++++++++++++++++++++++++++++++++++++++
init/main.c | 2 +
mm/Kconfig | 4 ++
mm/Makefile | 1 +
mm/prmem.c | 124 +++++++++++++++++++++++++++++++++++++++
5 files changed, 264 insertions(+)
create mode 100644 include/linux/prmem.h
create mode 100644 mm/prmem.c
diff --git a/include/linux/prmem.h b/include/linux/prmem.h
new file mode 100644
index 000000000000..b0131c1f5dc0
--- /dev/null
+++ b/include/linux/prmem.h
@@ -0,0 +1,133 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * prmem.h: Header for memory protection library
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <[email protected]>
+ *
+ * Support for:
+ * - statically allocated write rare data
+ */
+
+#ifndef _LINUX_PRMEM_H
+#define _LINUX_PRMEM_H
+
+#include <linux/set_memory.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/mutex.h>
+#include <linux/compiler.h>
+#include <linux/irqflags.h>
+
+/**
+ * memtst() - test n bytes of the source to match the c value
+ * @p: beginning of the memory to test
+ * @c: byte to compare against
+ * @len: amount of bytes to test
+ *
+ * Returns 0 on success, non-zero otherwise.
+ */
+static inline int memtst(void *p, int c, __kernel_size_t len)
+{
+ __kernel_size_t i;
+
+ for (i = 0; i < len; i++) {
+ u8 d = *(i + (u8 *)p) - (u8)c;
+
+ if (unlikely(d))
+ return d;
+ }
+ return 0;
+}
+
+
+#ifndef CONFIG_PRMEM
+
+static inline void *wr_memset(void *p, int c, __kernel_size_t len)
+{
+ return memset(p, c, len);
+}
+
+static inline void *wr_memcpy(void *p, const void *q, __kernel_size_t size)
+{
+ return memcpy(p, q, size);
+}
+
+#define wr_assign(var, val) ((var) = (val))
+
+#define wr_rcu_assign_pointer(p, v) \
+ rcu_assign_pointer(p, v)
+
+#else
+
+enum wr_op_type {
+ WR_MEMCPY,
+ WR_MEMSET,
+ WR_RCU_ASSIGN_PTR,
+ WR_OPS_NUMBER,
+};
+
+void *__wr_op(unsigned long dst, unsigned long src, __kernel_size_t len,
+ enum wr_op_type op);
+
+/**
+ * wr_memset() - sets n bytes of the destination to the c value
+ * @p: beginning of the memory to write to
+ * @c: byte to replicate
+ * @len: amount of bytes to copy
+ *
+ * Returns true on success, false otherwise.
+ */
+static inline void *wr_memset(void *p, int c, __kernel_size_t len)
+{
+ return __wr_op((unsigned long)p, (unsigned long)c, len, WR_MEMSET);
+}
+
+/**
+ * wr_memcpy() - copyes n bytes from source to destination
+ * @dst: beginning of the memory to write to
+ * @src: beginning of the memory to read from
+ * @n_bytes: amount of bytes to copy
+ *
+ * Returns pointer to the destination
+ */
+static inline void *wr_memcpy(void *p, const void *q, __kernel_size_t size)
+{
+ return __wr_op((unsigned long)p, (unsigned long)q, size, WR_MEMCPY);
+}
+
+/**
+ * wr_assign() - sets a write-rare variable to a specified value
+ * @var: the variable to set
+ * @val: the new value
+ *
+ * Returns: the variable
+ *
+ * Note: it might be possible to optimize this, to use wr_memset in some
+ * cases (maybe with NULL?).
+ */
+
+#define wr_assign(var, val) ({ \
+ typeof(var) tmp = (typeof(var))val; \
+ \
+ wr_memcpy(&var, &tmp, sizeof(var)); \
+ var; \
+})
+
+/**
+ * wr_rcu_assign_pointer() - initialize a pointer in rcu mode
+ * @p: the rcu pointer
+ * @v: the new value
+ *
+ * Returns the value assigned to the rcu pointer.
+ *
+ * It is provided as macro, to match rcu_assign_pointer()
+ */
+#define wr_rcu_assign_pointer(p, v) ({ \
+ __wr_op((unsigned long)&p, v, sizeof(p), WR_RCU_ASSIGN_PTR); \
+ p; \
+})
+#endif
+#endif
diff --git a/init/main.c b/init/main.c
index a461150adfb1..a36f2e54f937 100644
--- a/init/main.c
+++ b/init/main.c
@@ -498,6 +498,7 @@ void __init __weak thread_stack_cache_init(void)
void __init __weak mem_encrypt_init(void) { }
void __init __weak poking_init(void) { }
+void __init __weak wr_poking_init(void) { }
bool initcall_debug;
core_param(initcall_debug, initcall_debug, bool, 0644);
@@ -734,6 +735,7 @@ asmlinkage __visible void __init start_kernel(void)
delayacct_init();
poking_init();
+ wr_poking_init();
check_bugs();
acpi_subsystem_init();
diff --git a/mm/Kconfig b/mm/Kconfig
index d85e39da47ae..9b09339c027f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -142,6 +142,10 @@ config ARCH_DISCARD_MEMBLOCK
config MEMORY_ISOLATION
bool
+config PRMEM
+ def_bool n
+ depends on STRICT_KERNEL_RWX && X86_64
+
#
# Only be set on architectures that have completely implemented memory hotplug
# feature. If you are not sure, don't touch it.
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..ef3867c16ce0 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -58,6 +58,7 @@ obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
+obj-$(CONFIG_PRMEM) += prmem.o
obj-$(CONFIG_KSM) += ksm.o
obj-$(CONFIG_PAGE_POISONING) += page_poison.o
obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/prmem.c b/mm/prmem.c
new file mode 100644
index 000000000000..e8ab76701831
--- /dev/null
+++ b/mm/prmem.c
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * prmem.c: Memory Protection Library
+ *
+ * (C) Copyright 2017-2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <[email protected]>
+ */
+
+#include <linux/mm.h>
+#include <linux/string.h>
+#include <linux/compiler.h>
+#include <linux/slab.h>
+#include <linux/mmu_context.h>
+#include <linux/rcupdate.h>
+#include <linux/prmem.h>
+
+static __ro_after_init bool wr_ready;
+static __ro_after_init struct mm_struct *wr_poking_mm;
+static __ro_after_init unsigned long wr_poking_base;
+
+/*
+ * The following two variables are statically allocated by the linker
+ * script at the the boundaries of the memory region (rounded up to
+ * multiples of PAGE_SIZE) reserved for __wr_after_init.
+ */
+extern long __start_wr_after_init;
+extern long __end_wr_after_init;
+
+static inline bool is_wr_after_init(unsigned long ptr, __kernel_size_t size)
+{
+ unsigned long start = (unsigned long)&__start_wr_after_init;
+ unsigned long end = (unsigned long)&__end_wr_after_init;
+ unsigned long low = ptr;
+ unsigned long high = ptr + size;
+
+ return likely(start <= low && low <= high && high <= end);
+}
+
+
+void *__wr_op(unsigned long dst, unsigned long src, __kernel_size_t len,
+ enum wr_op_type op)
+{
+ temporary_mm_state_t prev;
+ unsigned long flags;
+ unsigned long offset;
+ unsigned long wr_poking_addr;
+
+ /* Confirm that the writable mapping exists. */
+ BUG_ON(!wr_ready);
+
+ if (WARN_ONCE(op >= WR_OPS_NUMBER, "Invalid WR operation.") ||
+ WARN_ONCE(!is_wr_after_init(dst, len), "Invalid WR range."))
+ return (void *)dst;
+
+ offset = dst - (unsigned long)&__start_wr_after_init;
+ wr_poking_addr = wr_poking_base + offset;
+ local_irq_save(flags);
+ prev = use_temporary_mm(wr_poking_mm);
+
+ kasan_disable_current();
+ if (op == WR_MEMCPY)
+ memcpy((void *)wr_poking_addr, (void *)src, len);
+ else if (op == WR_MEMSET)
+ memset((u8 *)wr_poking_addr, (u8)src, len);
+ else if (op == WR_RCU_ASSIGN_PTR)
+ /* generic version of rcu_assign_pointer */
+ smp_store_release((void **)wr_poking_addr,
+ RCU_INITIALIZER((void **)src));
+ kasan_enable_current();
+
+ barrier(); /* XXX redundant? */
+
+ unuse_temporary_mm(prev);
+ /* XXX make the verification optional? */
+ if (op == WR_MEMCPY)
+ BUG_ON(memcmp((void *)dst, (void *)src, len));
+ else if (op == WR_MEMSET)
+ BUG_ON(memtst((void *)dst, (u8)src, len));
+ else if (op == WR_RCU_ASSIGN_PTR)
+ BUG_ON(*(unsigned long *)dst != src);
+ local_irq_restore(flags);
+ return (void *)dst;
+}
+
+struct mm_struct *copy_init_mm(void);
+void __init wr_poking_init(void)
+{
+ unsigned long start = (unsigned long)&__start_wr_after_init;
+ unsigned long end = (unsigned long)&__end_wr_after_init;
+ unsigned long i;
+ unsigned long wr_range;
+
+ wr_poking_mm = copy_init_mm();
+ BUG_ON(!wr_poking_mm);
+
+ /* XXX What if it's too large to fit in the task unmapped mem? */
+ wr_range = round_up(end - start, PAGE_SIZE);
+
+ /* Randomize the poking address base*/
+ wr_poking_base = TASK_UNMAPPED_BASE +
+ (kaslr_get_random_long("Write Rare Poking") & PAGE_MASK) %
+ (TASK_SIZE - (TASK_UNMAPPED_BASE + wr_range));
+
+ /* Create alternate mapping for the entire wr_after_init range. */
+ for (i = start; i < end; i += PAGE_SIZE) {
+ struct page *page;
+ spinlock_t *ptl;
+ pte_t pte;
+ pte_t *ptep;
+ unsigned long wr_poking_addr;
+
+ BUG_ON(!(page = virt_to_page(i)));
+ wr_poking_addr = i - start + wr_poking_base;
+
+ /* The lock is not needed, but avoids open-coding. */
+ ptep = get_locked_pte(wr_poking_mm, wr_poking_addr, &ptl);
+ VM_BUG_ON(!ptep);
+
+ pte = mk_pte(page, PAGE_KERNEL);
+ set_pte_at(wr_poking_mm, wr_poking_addr, ptep, pte);
+ spin_unlock(ptl);
+ }
+ wr_ready = true;
+}
--
2.19.1
Introduce a section and a label for statically allocated write rare
data. The label is named "__wr_after_init".
As the name implies, after the init phase is completed, this section
will be modifiable only by invoking write rare functions.
The section must take up a set of full pages.
Signed-off-by: Igor Stoppa <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Nadav Amit <[email protected]>
CC: Matthew Wilcox <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Kees Cook <[email protected]>
CC: Dave Hansen <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
---
include/asm-generic/vmlinux.lds.h | 20 ++++++++++++++++++++
include/linux/cache.h | 17 +++++++++++++++++
2 files changed, 37 insertions(+)
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 3d7a6a9c2370..b711dbe6999f 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -311,6 +311,25 @@
KEEP(*(__jump_table)) \
__stop___jump_table = .;
+/*
+ * Allow architectures to handle wr_after_init data on their
+ * own by defining an empty WR_AFTER_INIT_DATA.
+ * However, it's important that pages containing WR_RARE data do not
+ * hold anything else, to avoid both accidentally unprotecting something
+ * that is supposed to stay read-only all the time and also to protect
+ * something else that is supposed to be writeable all the time.
+ */
+#ifndef WR_AFTER_INIT_DATA
+#define WR_AFTER_INIT_DATA(align) \
+ . = ALIGN(PAGE_SIZE); \
+ __start_wr_after_init = .; \
+ . = ALIGN(align); \
+ *(.data..wr_after_init) \
+ . = ALIGN(PAGE_SIZE); \
+ __end_wr_after_init = .; \
+ . = ALIGN(align);
+#endif
+
/*
* Allow architectures to handle ro_after_init data on their
* own by defining an empty RO_AFTER_INIT_DATA.
@@ -332,6 +351,7 @@
__start_rodata = .; \
*(.rodata) *(.rodata.*) \
RO_AFTER_INIT_DATA /* Read only after init */ \
+ WR_AFTER_INIT_DATA(align) /* wr after init */ \
KEEP(*(__vermagic)) /* Kernel version magic */ \
. = ALIGN(8); \
__start___tracepoints_ptrs = .; \
diff --git a/include/linux/cache.h b/include/linux/cache.h
index 750621e41d1c..9a7e7134b887 100644
--- a/include/linux/cache.h
+++ b/include/linux/cache.h
@@ -31,6 +31,23 @@
#define __ro_after_init __attribute__((__section__(".data..ro_after_init")))
#endif
+/*
+ * __wr_after_init is used to mark objects that cannot be modified
+ * directly after init (i.e. after mark_rodata_ro() has been called).
+ * These objects become effectively read-only, from the perspective of
+ * performing a direct write, like a variable assignment.
+ * However, they can be altered through a dedicated function.
+ * It is intended for those objects which are occasionally modified after
+ * init, however they are modified so seldomly, that the extra cost from
+ * the indirect modification is either negligible or worth paying, for the
+ * sake of the protection gained.
+ */
+#ifndef __wr_after_init
+#define __wr_after_init \
+ __attribute__((__section__(".data..wr_after_init")))
+#endif
+
+
#ifndef ____cacheline_aligned
#define ____cacheline_aligned __attribute__((__aligned__(SMP_CACHE_BYTES)))
#endif
--
2.19.1
I added some s390 and powerpc people.
On Tue, Dec 4, 2018 at 4:18 AM Igor Stoppa <[email protected]> wrote:
>
> Implementation of write rare for statically allocated data, located in a
> specific memory section through the use of the __write_rare label.
>
> The basic functions are:
> - wr_memset(): write rare counterpart of memset()
> - wr_memcpy(): write rare counterpart of memcpy()
> - wr_assign(): write rare counterpart of the assignment ('=') operator
> - wr_rcu_assign_pointer(): write rare counterpart of rcu_assign_pointer()
>
> The implementation is based on code from Andy Lutomirski and Nadav Amit
> for patching the text on x86 [here goes reference to commits, once merged]
>
> The modification of write protected data is done through an alternate
> mapping of the same pages, as writable.
> This mapping is local to each core and is active only for the duration
> of each write operation.
> Local interrupts are disabled, while the alternate mapping is active.
>
> In theory, it could introduce a non-predictable delay, in a preemptible
> system, however the amount of data to be altered is likely to be far
> smaller than a page.
>
> Signed-off-by: Igor Stoppa <[email protected]>
>
> CC: Andy Lutomirski <[email protected]>
> CC: Nadav Amit <[email protected]>
> CC: Matthew Wilcox <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Kees Cook <[email protected]>
> CC: Dave Hansen <[email protected]>
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> ---
> include/linux/prmem.h | 133 ++++++++++++++++++++++++++++++++++++++++++
> init/main.c | 2 +
> mm/Kconfig | 4 ++
> mm/Makefile | 1 +
> mm/prmem.c | 124 +++++++++++++++++++++++++++++++++++++++
> 5 files changed, 264 insertions(+)
> create mode 100644 include/linux/prmem.h
> create mode 100644 mm/prmem.c
>
> diff --git a/include/linux/prmem.h b/include/linux/prmem.h
> new file mode 100644
> index 000000000000..b0131c1f5dc0
> --- /dev/null
> +++ b/include/linux/prmem.h
> @@ -0,0 +1,133 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * prmem.h: Header for memory protection library
> + *
> + * (C) Copyright 2018 Huawei Technologies Co. Ltd.
> + * Author: Igor Stoppa <[email protected]>
> + *
> + * Support for:
> + * - statically allocated write rare data
> + */
> +
> +#ifndef _LINUX_PRMEM_H
> +#define _LINUX_PRMEM_H
> +
> +#include <linux/set_memory.h>
> +#include <linux/mm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/string.h>
> +#include <linux/slab.h>
> +#include <linux/mutex.h>
> +#include <linux/compiler.h>
> +#include <linux/irqflags.h>
> +
> +/**
> + * memtst() - test n bytes of the source to match the c value
> + * @p: beginning of the memory to test
> + * @c: byte to compare against
> + * @len: amount of bytes to test
> + *
> + * Returns 0 on success, non-zero otherwise.
> + */
> +static inline int memtst(void *p, int c, __kernel_size_t len)
> +{
> + __kernel_size_t i;
> +
> + for (i = 0; i < len; i++) {
> + u8 d = *(i + (u8 *)p) - (u8)c;
> +
> + if (unlikely(d))
> + return d;
> + }
> + return 0;
> +}
> +
> +
> +#ifndef CONFIG_PRMEM
> +
> +static inline void *wr_memset(void *p, int c, __kernel_size_t len)
> +{
> + return memset(p, c, len);
> +}
> +
> +static inline void *wr_memcpy(void *p, const void *q, __kernel_size_t size)
> +{
> + return memcpy(p, q, size);
> +}
> +
> +#define wr_assign(var, val) ((var) = (val))
> +
> +#define wr_rcu_assign_pointer(p, v) \
> + rcu_assign_pointer(p, v)
> +
> +#else
> +
> +enum wr_op_type {
> + WR_MEMCPY,
> + WR_MEMSET,
> + WR_RCU_ASSIGN_PTR,
> + WR_OPS_NUMBER,
> +};
> +
> +void *__wr_op(unsigned long dst, unsigned long src, __kernel_size_t len,
> + enum wr_op_type op);
> +
> +/**
> + * wr_memset() - sets n bytes of the destination to the c value
> + * @p: beginning of the memory to write to
> + * @c: byte to replicate
> + * @len: amount of bytes to copy
> + *
> + * Returns true on success, false otherwise.
> + */
> +static inline void *wr_memset(void *p, int c, __kernel_size_t len)
> +{
> + return __wr_op((unsigned long)p, (unsigned long)c, len, WR_MEMSET);
> +}
> +
> +/**
> + * wr_memcpy() - copyes n bytes from source to destination
> + * @dst: beginning of the memory to write to
> + * @src: beginning of the memory to read from
> + * @n_bytes: amount of bytes to copy
> + *
> + * Returns pointer to the destination
> + */
> +static inline void *wr_memcpy(void *p, const void *q, __kernel_size_t size)
> +{
> + return __wr_op((unsigned long)p, (unsigned long)q, size, WR_MEMCPY);
> +}
> +
> +/**
> + * wr_assign() - sets a write-rare variable to a specified value
> + * @var: the variable to set
> + * @val: the new value
> + *
> + * Returns: the variable
> + *
> + * Note: it might be possible to optimize this, to use wr_memset in some
> + * cases (maybe with NULL?).
> + */
> +
> +#define wr_assign(var, val) ({ \
> + typeof(var) tmp = (typeof(var))val; \
> + \
> + wr_memcpy(&var, &tmp, sizeof(var)); \
> + var; \
> +})
> +
> +/**
> + * wr_rcu_assign_pointer() - initialize a pointer in rcu mode
> + * @p: the rcu pointer
> + * @v: the new value
> + *
> + * Returns the value assigned to the rcu pointer.
> + *
> + * It is provided as macro, to match rcu_assign_pointer()
> + */
> +#define wr_rcu_assign_pointer(p, v) ({ \
> + __wr_op((unsigned long)&p, v, sizeof(p), WR_RCU_ASSIGN_PTR); \
> + p; \
> +})
> +#endif
> +#endif
> diff --git a/init/main.c b/init/main.c
> index a461150adfb1..a36f2e54f937 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -498,6 +498,7 @@ void __init __weak thread_stack_cache_init(void)
> void __init __weak mem_encrypt_init(void) { }
>
> void __init __weak poking_init(void) { }
> +void __init __weak wr_poking_init(void) { }
>
> bool initcall_debug;
> core_param(initcall_debug, initcall_debug, bool, 0644);
> @@ -734,6 +735,7 @@ asmlinkage __visible void __init start_kernel(void)
> delayacct_init();
>
> poking_init();
> + wr_poking_init();
> check_bugs();
>
> acpi_subsystem_init();
> diff --git a/mm/Kconfig b/mm/Kconfig
> index d85e39da47ae..9b09339c027f 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -142,6 +142,10 @@ config ARCH_DISCARD_MEMBLOCK
> config MEMORY_ISOLATION
> bool
>
> +config PRMEM
> + def_bool n
> + depends on STRICT_KERNEL_RWX && X86_64
> +
> #
> # Only be set on architectures that have completely implemented memory hotplug
> # feature. If you are not sure, don't touch it.
> diff --git a/mm/Makefile b/mm/Makefile
> index d210cc9d6f80..ef3867c16ce0 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -58,6 +58,7 @@ obj-$(CONFIG_SPARSEMEM) += sparse.o
> obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
> obj-$(CONFIG_SLOB) += slob.o
> obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
> +obj-$(CONFIG_PRMEM) += prmem.o
> obj-$(CONFIG_KSM) += ksm.o
> obj-$(CONFIG_PAGE_POISONING) += page_poison.o
> obj-$(CONFIG_SLAB) += slab.o
> diff --git a/mm/prmem.c b/mm/prmem.c
> new file mode 100644
> index 000000000000..e8ab76701831
> --- /dev/null
> +++ b/mm/prmem.c
> @@ -0,0 +1,124 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * prmem.c: Memory Protection Library
> + *
> + * (C) Copyright 2017-2018 Huawei Technologies Co. Ltd.
> + * Author: Igor Stoppa <[email protected]>
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/string.h>
> +#include <linux/compiler.h>
> +#include <linux/slab.h>
> +#include <linux/mmu_context.h>
> +#include <linux/rcupdate.h>
> +#include <linux/prmem.h>
> +
> +static __ro_after_init bool wr_ready;
> +static __ro_after_init struct mm_struct *wr_poking_mm;
> +static __ro_after_init unsigned long wr_poking_base;
> +
> +/*
> + * The following two variables are statically allocated by the linker
> + * script at the the boundaries of the memory region (rounded up to
> + * multiples of PAGE_SIZE) reserved for __wr_after_init.
> + */
> +extern long __start_wr_after_init;
> +extern long __end_wr_after_init;
> +
> +static inline bool is_wr_after_init(unsigned long ptr, __kernel_size_t size)
> +{
> + unsigned long start = (unsigned long)&__start_wr_after_init;
> + unsigned long end = (unsigned long)&__end_wr_after_init;
> + unsigned long low = ptr;
> + unsigned long high = ptr + size;
> +
> + return likely(start <= low && low <= high && high <= end);
> +}
> +
> +
> +void *__wr_op(unsigned long dst, unsigned long src, __kernel_size_t len,
> + enum wr_op_type op)
> +{
You might end up wanting something like:
#ifdef __arch_wr_op
return __arch_wr_op(...);
#endif
if an arch (s390? powerpc?) decides to have a totally different
implementation of this.
Hi s390 and powerpc people: it would be nice if this generic
implementation *worked* on your architectures and that it will allow
you to add some straightforward way to add a better arch-specific
implementation if you think that would be better.
--Andy
> + temporary_mm_state_t prev;
> + unsigned long flags;
> + unsigned long offset;
> + unsigned long wr_poking_addr;
> +
> + /* Confirm that the writable mapping exists. */
> + BUG_ON(!wr_ready);
> +
> + if (WARN_ONCE(op >= WR_OPS_NUMBER, "Invalid WR operation.") ||
> + WARN_ONCE(!is_wr_after_init(dst, len), "Invalid WR range."))
> + return (void *)dst;
> +
> + offset = dst - (unsigned long)&__start_wr_after_init;
> + wr_poking_addr = wr_poking_base + offset;
> + local_irq_save(flags);
> + prev = use_temporary_mm(wr_poking_mm);
> +
> + kasan_disable_current();
> + if (op == WR_MEMCPY)
> + memcpy((void *)wr_poking_addr, (void *)src, len);
> + else if (op == WR_MEMSET)
> + memset((u8 *)wr_poking_addr, (u8)src, len);
> + else if (op == WR_RCU_ASSIGN_PTR)
> + /* generic version of rcu_assign_pointer */
> + smp_store_release((void **)wr_poking_addr,
> + RCU_INITIALIZER((void **)src));
> + kasan_enable_current();
Hmm. I suspect this will explode quite badly on sane architectures
like s390. (In my book, despite how weird s390 is, it has a vastly
nicer model of "user" memory than any other architecture I know
of...). I think you should use copy_to_user(), etc, instead. I'm not
entirely sure what the best smp_store_release() replacement is.
Making this change may also mean you can get rid of the
kasan_disable_current().
> +
> + barrier(); /* XXX redundant? */
I think it's redundant. If unuse_temporary_mm() allows earlier stores
to hit the wrong address space, then something is very very wrong, and
something is also very very wrong if the optimizer starts moving
stores across a function call that is most definitely a barrier.
> +
> + unuse_temporary_mm(prev);
> + /* XXX make the verification optional? */
> + if (op == WR_MEMCPY)
> + BUG_ON(memcmp((void *)dst, (void *)src, len));
> + else if (op == WR_MEMSET)
> + BUG_ON(memtst((void *)dst, (u8)src, len));
> + else if (op == WR_RCU_ASSIGN_PTR)
> + BUG_ON(*(unsigned long *)dst != src);
Hmm. If you allowed cmpxchg or even plain xchg, then these bug_ons
would be thoroughly buggy, but maybe they're okay. But they should,
at most, be WARN_ON_ONCE(), given that you can trigger them by writing
the same addresses from two threads at once, and this isn't even
entirely obviously bogus given the presence of smp_store_release().
> + local_irq_restore(flags);
> + return (void *)dst;
> +}
> +
> +struct mm_struct *copy_init_mm(void);
> +void __init wr_poking_init(void)
> +{
> + unsigned long start = (unsigned long)&__start_wr_after_init;
> + unsigned long end = (unsigned long)&__end_wr_after_init;
> + unsigned long i;
> + unsigned long wr_range;
> +
> + wr_poking_mm = copy_init_mm();
> + BUG_ON(!wr_poking_mm);
> +
> + /* XXX What if it's too large to fit in the task unmapped mem? */
> + wr_range = round_up(end - start, PAGE_SIZE);
> +
> + /* Randomize the poking address base*/
> + wr_poking_base = TASK_UNMAPPED_BASE +
> + (kaslr_get_random_long("Write Rare Poking") & PAGE_MASK) %
> + (TASK_SIZE - (TASK_UNMAPPED_BASE + wr_range));
> +
> + /* Create alternate mapping for the entire wr_after_init range. */
> + for (i = start; i < end; i += PAGE_SIZE) {
> + struct page *page;
> + spinlock_t *ptl;
> + pte_t pte;
> + pte_t *ptep;
> + unsigned long wr_poking_addr;
> +
> + BUG_ON(!(page = virt_to_page(i)));
> + wr_poking_addr = i - start + wr_poking_base;
> +
> + /* The lock is not needed, but avoids open-coding. */
> + ptep = get_locked_pte(wr_poking_mm, wr_poking_addr, &ptl);
> + VM_BUG_ON(!ptep);
> +
> + pte = mk_pte(page, PAGE_KERNEL);
> + set_pte_at(wr_poking_mm, wr_poking_addr, ptep, pte);
> + spin_unlock(ptl);
> + }
> + wr_ready = true;
> +}
> --
> 2.19.1
>
On Tue, Dec 04, 2018 at 02:18:01PM +0200, Igor Stoppa wrote:
> +void *__wr_op(unsigned long dst, unsigned long src, __kernel_size_t len,
> + enum wr_op_type op)
> +{
> + temporary_mm_state_t prev;
> + unsigned long flags;
> + unsigned long offset;
> + unsigned long wr_poking_addr;
> +
> + /* Confirm that the writable mapping exists. */
> + BUG_ON(!wr_ready);
> +
> + if (WARN_ONCE(op >= WR_OPS_NUMBER, "Invalid WR operation.") ||
> + WARN_ONCE(!is_wr_after_init(dst, len), "Invalid WR range."))
> + return (void *)dst;
> +
> + offset = dst - (unsigned long)&__start_wr_after_init;
> + wr_poking_addr = wr_poking_base + offset;
> + local_irq_save(flags);
Why not local_irq_disable()? Do we have a use-case for wanting to access
this from interrupt context?
> + /* XXX make the verification optional? */
Well, yes. It seems like debug code to me.
> + /* Randomize the poking address base*/
> + wr_poking_base = TASK_UNMAPPED_BASE +
> + (kaslr_get_random_long("Write Rare Poking") & PAGE_MASK) %
> + (TASK_SIZE - (TASK_UNMAPPED_BASE + wr_range));
I don't think this is a great idea. We want to use the same mm for both
static and dynamic wr memory, yes? So we should have enough space for
all of ram, not splatter the static section all over the address space.
On x86-64 (4 level page tables), we have a 64TB space for all of physmem
and 128TB of user space, so we can place the base anywhere in a 64TB
range.
On Wed, Dec 05, 2018 at 03:13:56PM -0800, Andy Lutomirski wrote:
> > + if (op == WR_MEMCPY)
> > + memcpy((void *)wr_poking_addr, (void *)src, len);
> > + else if (op == WR_MEMSET)
> > + memset((u8 *)wr_poking_addr, (u8)src, len);
> > + else if (op == WR_RCU_ASSIGN_PTR)
> > + /* generic version of rcu_assign_pointer */
> > + smp_store_release((void **)wr_poking_addr,
> > + RCU_INITIALIZER((void **)src));
> > + kasan_enable_current();
>
> Hmm. I suspect this will explode quite badly on sane architectures
> like s390. (In my book, despite how weird s390 is, it has a vastly
> nicer model of "user" memory than any other architecture I know
> of...). I think you should use copy_to_user(), etc, instead. I'm not
> entirely sure what the best smp_store_release() replacement is.
> Making this change may also mean you can get rid of the
> kasan_disable_current().
If you make the MEMCPY one guarantee single-copy atomicity for native
words then you're basically done.
smp_store_release() can be implemented with:
smp_mb();
WRITE_ONCE();
So if we make MEMCPY provide the WRITE_ONCE(), all we need is that
barrier, which we can easily place at the call site and not overly
complicate our interface with this.
Because performance is down the drain already, an additional full
memory barrier is peanuts here (and in fact already implied by the x86
CR3 munging).
On 06/12/2018 01:13, Andy Lutomirski wrote:
>> + kasan_disable_current();
>> + if (op == WR_MEMCPY)
>> + memcpy((void *)wr_poking_addr, (void *)src, len);
>> + else if (op == WR_MEMSET)
>> + memset((u8 *)wr_poking_addr, (u8)src, len);
>> + else if (op == WR_RCU_ASSIGN_PTR)
>> + /* generic version of rcu_assign_pointer */
>> + smp_store_release((void **)wr_poking_addr,
>> + RCU_INITIALIZER((void **)src));
>> + kasan_enable_current();
>
> Hmm. I suspect this will explode quite badly on sane architectures
> like s390. (In my book, despite how weird s390 is, it has a vastly
> nicer model of "user" memory than any other architecture I know
> of...).
I see. I can try to setup also a qemu target for s390, for my tests.
There seems to be a Debian image, to have a fully bootable system.
> I think you should use copy_to_user(), etc, instead.
I'm having troubles with the "etc" part: as far as I can see, there are
both generic and specific support for both copying and clearing
user-space memory from kernel, however I couldn't find something that
looks like a memset_user().
I can of course roll my own, for example iterating copy_to_user() with
the support of a pre-allocated static buffer (1 page should be enough).
But, before I go down this path, I wanted to confirm that there's really
nothing better that I could use.
If that's really the case, the static buffer instance should be
replicated for each core, I think, since each core could be performing
its own memset_user() at the same time.
Alternatively, I could do a loop of WRITE_ONCE(), however I'm not sure
how that would work with (lack-of) alignment and might require also a
preamble/epilogue to deal with unaligned data?
> I'm not
> entirely sure what the best smp_store_release() replacement is.
> Making this change may also mean you can get rid of the
> kasan_disable_current().
>
>> +
>> + barrier(); /* XXX redundant? */
>
> I think it's redundant. If unuse_temporary_mm() allows earlier stores
> to hit the wrong address space, then something is very very wrong, and
> something is also very very wrong if the optimizer starts moving
> stores across a function call that is most definitely a barrier.
ok, thanks
>> +
>> + unuse_temporary_mm(prev);
>> + /* XXX make the verification optional? */
>> + if (op == WR_MEMCPY)
>> + BUG_ON(memcmp((void *)dst, (void *)src, len));
>> + else if (op == WR_MEMSET)
>> + BUG_ON(memtst((void *)dst, (u8)src, len));
>> + else if (op == WR_RCU_ASSIGN_PTR)
>> + BUG_ON(*(unsigned long *)dst != src);
>
> Hmm. If you allowed cmpxchg or even plain xchg, then these bug_ons
> would be thoroughly buggy, but maybe they're okay. But they should,
> at most, be WARN_ON_ONCE(),
I have to confess that I do not understand why Nadav's patchset was
required to use BUG_ON(), while here it's not correct, not even for
memcopy or memset .
Is it because it is single-threaded?
Or is it because text_poke() is patching code, instead of data?
I can turn to WARN_ON_ONCE(), but I'd like to understand the reason.
> given that you can trigger them by writing
> the same addresses from two threads at once, and this isn't even
> entirely obviously bogus given the presence of smp_store_release().
True, however would it be reasonable to require the use of an explicit
writer lock, from the user?
This operation is not exactly fast and should happen seldom; I'm not
sure if it's worth supporting cmpxchg. The speedup would be minimal.
I'd rather not implement the locking implicitly, even if it would be
possible to detect simultaneous writes, because it might lead to overall
inconsistent data.
--
igor
On 06/12/2018 06:44, Matthew Wilcox wrote:
> On Tue, Dec 04, 2018 at 02:18:01PM +0200, Igor Stoppa wrote:
>> +void *__wr_op(unsigned long dst, unsigned long src, __kernel_size_t len,
>> + enum wr_op_type op)
>> +{
>> + temporary_mm_state_t prev;
>> + unsigned long flags;
>> + unsigned long offset;
>> + unsigned long wr_poking_addr;
>> +
>> + /* Confirm that the writable mapping exists. */
>> + BUG_ON(!wr_ready);
>> +
>> + if (WARN_ONCE(op >= WR_OPS_NUMBER, "Invalid WR operation.") ||
>> + WARN_ONCE(!is_wr_after_init(dst, len), "Invalid WR range."))
>> + return (void *)dst;
>> +
>> + offset = dst - (unsigned long)&__start_wr_after_init;
>> + wr_poking_addr = wr_poking_base + offset;
>> + local_irq_save(flags);
>
> Why not local_irq_disable()? Do we have a use-case for wanting to access
> this from interrupt context?
No, not that I can think of. It was "just in case", but I can remove it.
>> + /* XXX make the verification optional? */
>
> Well, yes. It seems like debug code to me.
Ok, I was not sure about this, because text_poke() does it as part of
its normal operations.
>> + /* Randomize the poking address base*/
>> + wr_poking_base = TASK_UNMAPPED_BASE +
>> + (kaslr_get_random_long("Write Rare Poking") & PAGE_MASK) %
>> + (TASK_SIZE - (TASK_UNMAPPED_BASE + wr_range));
>
> I don't think this is a great idea. We want to use the same mm for both
> static and dynamic wr memory, yes? So we should have enough space for
> all of ram, not splatter the static section all over the address space.
>
> On x86-64 (4 level page tables), we have a 64TB space for all of physmem
> and 128TB of user space, so we can place the base anywhere in a 64TB
> range.
I was actually wondering about the dynamic part.
It's still not clear to me if it's possible to write the code in a
sufficiently generic way that it could work on all 64 bit architectures.
I'll start with x86-64 as you suggest.
--
igor
On 06/12/2018 11:44, Peter Zijlstra wrote:
> On Wed, Dec 05, 2018 at 03:13:56PM -0800, Andy Lutomirski wrote:
>
>>> + if (op == WR_MEMCPY)
>>> + memcpy((void *)wr_poking_addr, (void *)src, len);
>>> + else if (op == WR_MEMSET)
>>> + memset((u8 *)wr_poking_addr, (u8)src, len);
>>> + else if (op == WR_RCU_ASSIGN_PTR)
>>> + /* generic version of rcu_assign_pointer */
>>> + smp_store_release((void **)wr_poking_addr,
>>> + RCU_INITIALIZER((void **)src));
>>> + kasan_enable_current();
>>
>> Hmm. I suspect this will explode quite badly on sane architectures
>> like s390. (In my book, despite how weird s390 is, it has a vastly
>> nicer model of "user" memory than any other architecture I know
>> of...). I think you should use copy_to_user(), etc, instead. I'm not
>> entirely sure what the best smp_store_release() replacement is.
>> Making this change may also mean you can get rid of the
>> kasan_disable_current().
>
> If you make the MEMCPY one guarantee single-copy atomicity for native
> words then you're basically done.
>
> smp_store_release() can be implemented with:
>
> smp_mb();
> WRITE_ONCE();
>
> So if we make MEMCPY provide the WRITE_ONCE(), all we need is that
> barrier, which we can easily place at the call site and not overly
> complicate our interface with this.
Ok, so the 3rd case (WR_RCU_ASSIGN_PTR) could be handled outside of this
function.
But, since now memcpy() will be replaced by copy_to_user(), can I assume
that also copy_to_user() will be atomic, if the destination is properly
aligned? On x86_64 it seems yes, however it's not clear to me if this is
the outcome of an optimization or if I can expect it to be always true.
--
igor
On Mon, Dec 10, 2018 at 12:32:21AM +0200, Igor Stoppa wrote:
>
>
> On 06/12/2018 11:44, Peter Zijlstra wrote:
> > On Wed, Dec 05, 2018 at 03:13:56PM -0800, Andy Lutomirski wrote:
> >
> > > > + if (op == WR_MEMCPY)
> > > > + memcpy((void *)wr_poking_addr, (void *)src, len);
> > > > + else if (op == WR_MEMSET)
> > > > + memset((u8 *)wr_poking_addr, (u8)src, len);
> > > > + else if (op == WR_RCU_ASSIGN_PTR)
> > > > + /* generic version of rcu_assign_pointer */
> > > > + smp_store_release((void **)wr_poking_addr,
> > > > + RCU_INITIALIZER((void **)src));
> > > > + kasan_enable_current();
> > >
> > > Hmm. I suspect this will explode quite badly on sane architectures
> > > like s390. (In my book, despite how weird s390 is, it has a vastly
> > > nicer model of "user" memory than any other architecture I know
> > > of...). I think you should use copy_to_user(), etc, instead. I'm not
> > > entirely sure what the best smp_store_release() replacement is.
> > > Making this change may also mean you can get rid of the
> > > kasan_disable_current().
> >
> > If you make the MEMCPY one guarantee single-copy atomicity for native
> > words then you're basically done.
> >
> > smp_store_release() can be implemented with:
> >
> > smp_mb();
> > WRITE_ONCE();
> >
> > So if we make MEMCPY provide the WRITE_ONCE(), all we need is that
> > barrier, which we can easily place at the call site and not overly
> > complicate our interface with this.
>
> Ok, so the 3rd case (WR_RCU_ASSIGN_PTR) could be handled outside of this
> function.
> But, since now memcpy() will be replaced by copy_to_user(), can I assume
> that also copy_to_user() will be atomic, if the destination is properly
> aligned? On x86_64 it seems yes, however it's not clear to me if this is the
> outcome of an optimization or if I can expect it to be always true.
This would be a new contraint; one that needs to be documented and
verified by the various arch maintainers as they enable this feature on
their platform.
On Wed, 5 Dec 2018 15:13:56 -0800
Andy Lutomirski <[email protected]> wrote:
> I added some s390 and powerpc people.
>
> On Tue, Dec 4, 2018 at 4:18 AM Igor Stoppa <[email protected]> wrote:
> >
> > Implementation of write rare for statically allocated data, located in a
> > specific memory section through the use of the __write_rare label.
> >
> > The basic functions are:
> > - wr_memset(): write rare counterpart of memset()
> > - wr_memcpy(): write rare counterpart of memcpy()
> > - wr_assign(): write rare counterpart of the assignment ('=') operator
> > - wr_rcu_assign_pointer(): write rare counterpart of rcu_assign_pointer()
> >
> > The implementation is based on code from Andy Lutomirski and Nadav Amit
> > for patching the text on x86 [here goes reference to commits, once merged]
> >
> > The modification of write protected data is done through an alternate
> > mapping of the same pages, as writable.
> > This mapping is local to each core and is active only for the duration
> > of each write operation.
> > Local interrupts are disabled, while the alternate mapping is active.
> >
> > In theory, it could introduce a non-predictable delay, in a preemptible
> > system, however the amount of data to be altered is likely to be far
> > smaller than a page.
> >
> > Signed-off-by: Igor Stoppa <[email protected]>
> >
> > CC: Andy Lutomirski <[email protected]>
> > CC: Nadav Amit <[email protected]>
> > CC: Matthew Wilcox <[email protected]>
> > CC: Peter Zijlstra <[email protected]>
> > CC: Kees Cook <[email protected]>
> > CC: Dave Hansen <[email protected]>
> > CC: [email protected]
> > CC: [email protected]
> > CC: [email protected]
> > CC: [email protected]
> > ---
> > include/linux/prmem.h | 133 ++++++++++++++++++++++++++++++++++++++++++
> > init/main.c | 2 +
> > mm/Kconfig | 4 ++
> > mm/Makefile | 1 +
> > mm/prmem.c | 124 +++++++++++++++++++++++++++++++++++++++
> > 5 files changed, 264 insertions(+)
> > create mode 100644 include/linux/prmem.h
> > create mode 100644 mm/prmem.c
> >
> > diff --git a/include/linux/prmem.h b/include/linux/prmem.h
> > new file mode 100644
> > index 000000000000..b0131c1f5dc0
> > --- /dev/null
> > +++ b/include/linux/prmem.h
> > @@ -0,0 +1,133 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * prmem.h: Header for memory protection library
> > + *
> > + * (C) Copyright 2018 Huawei Technologies Co. Ltd.
> > + * Author: Igor Stoppa <[email protected]>
> > + *
> > + * Support for:
> > + * - statically allocated write rare data
> > + */
> > +
> > +#ifndef _LINUX_PRMEM_H
> > +#define _LINUX_PRMEM_H
> > +
> > +#include <linux/set_memory.h>
> > +#include <linux/mm.h>
> > +#include <linux/vmalloc.h>
> > +#include <linux/string.h>
> > +#include <linux/slab.h>
> > +#include <linux/mutex.h>
> > +#include <linux/compiler.h>
> > +#include <linux/irqflags.h>
> > +
> > +/**
> > + * memtst() - test n bytes of the source to match the c value
> > + * @p: beginning of the memory to test
> > + * @c: byte to compare against
> > + * @len: amount of bytes to test
> > + *
> > + * Returns 0 on success, non-zero otherwise.
> > + */
> > +static inline int memtst(void *p, int c, __kernel_size_t len)
> > +{
> > + __kernel_size_t i;
> > +
> > + for (i = 0; i < len; i++) {
> > + u8 d = *(i + (u8 *)p) - (u8)c;
> > +
> > + if (unlikely(d))
> > + return d;
> > + }
> > + return 0;
> > +}
> > +
> > +
> > +#ifndef CONFIG_PRMEM
> > +
> > +static inline void *wr_memset(void *p, int c, __kernel_size_t len)
> > +{
> > + return memset(p, c, len);
> > +}
> > +
> > +static inline void *wr_memcpy(void *p, const void *q, __kernel_size_t size)
> > +{
> > + return memcpy(p, q, size);
> > +}
> > +
> > +#define wr_assign(var, val) ((var) = (val))
> > +
> > +#define wr_rcu_assign_pointer(p, v) \
> > + rcu_assign_pointer(p, v)
> > +
> > +#else
> > +
> > +enum wr_op_type {
> > + WR_MEMCPY,
> > + WR_MEMSET,
> > + WR_RCU_ASSIGN_PTR,
> > + WR_OPS_NUMBER,
> > +};
> > +
> > +void *__wr_op(unsigned long dst, unsigned long src, __kernel_size_t len,
> > + enum wr_op_type op);
> > +
> > +/**
> > + * wr_memset() - sets n bytes of the destination to the c value
> > + * @p: beginning of the memory to write to
> > + * @c: byte to replicate
> > + * @len: amount of bytes to copy
> > + *
> > + * Returns true on success, false otherwise.
> > + */
> > +static inline void *wr_memset(void *p, int c, __kernel_size_t len)
> > +{
> > + return __wr_op((unsigned long)p, (unsigned long)c, len, WR_MEMSET);
> > +}
> > +
> > +/**
> > + * wr_memcpy() - copyes n bytes from source to destination
> > + * @dst: beginning of the memory to write to
> > + * @src: beginning of the memory to read from
> > + * @n_bytes: amount of bytes to copy
> > + *
> > + * Returns pointer to the destination
> > + */
> > +static inline void *wr_memcpy(void *p, const void *q, __kernel_size_t size)
> > +{
> > + return __wr_op((unsigned long)p, (unsigned long)q, size, WR_MEMCPY);
> > +}
> > +
> > +/**
> > + * wr_assign() - sets a write-rare variable to a specified value
> > + * @var: the variable to set
> > + * @val: the new value
> > + *
> > + * Returns: the variable
> > + *
> > + * Note: it might be possible to optimize this, to use wr_memset in some
> > + * cases (maybe with NULL?).
> > + */
> > +
> > +#define wr_assign(var, val) ({ \
> > + typeof(var) tmp = (typeof(var))val; \
> > + \
> > + wr_memcpy(&var, &tmp, sizeof(var)); \
> > + var; \
> > +})
> > +
> > +/**
> > + * wr_rcu_assign_pointer() - initialize a pointer in rcu mode
> > + * @p: the rcu pointer
> > + * @v: the new value
> > + *
> > + * Returns the value assigned to the rcu pointer.
> > + *
> > + * It is provided as macro, to match rcu_assign_pointer()
> > + */
> > +#define wr_rcu_assign_pointer(p, v) ({ \
> > + __wr_op((unsigned long)&p, v, sizeof(p), WR_RCU_ASSIGN_PTR); \
> > + p; \
> > +})
> > +#endif
> > +#endif
> > diff --git a/init/main.c b/init/main.c
> > index a461150adfb1..a36f2e54f937 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -498,6 +498,7 @@ void __init __weak thread_stack_cache_init(void)
> > void __init __weak mem_encrypt_init(void) { }
> >
> > void __init __weak poking_init(void) { }
> > +void __init __weak wr_poking_init(void) { }
> >
> > bool initcall_debug;
> > core_param(initcall_debug, initcall_debug, bool, 0644);
> > @@ -734,6 +735,7 @@ asmlinkage __visible void __init start_kernel(void)
> > delayacct_init();
> >
> > poking_init();
> > + wr_poking_init();
> > check_bugs();
> >
> > acpi_subsystem_init();
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index d85e39da47ae..9b09339c027f 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -142,6 +142,10 @@ config ARCH_DISCARD_MEMBLOCK
> > config MEMORY_ISOLATION
> > bool
> >
> > +config PRMEM
> > + def_bool n
> > + depends on STRICT_KERNEL_RWX && X86_64
> > +
> > #
> > # Only be set on architectures that have completely implemented memory hotplug
> > # feature. If you are not sure, don't touch it.
> > diff --git a/mm/Makefile b/mm/Makefile
> > index d210cc9d6f80..ef3867c16ce0 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -58,6 +58,7 @@ obj-$(CONFIG_SPARSEMEM) += sparse.o
> > obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
> > obj-$(CONFIG_SLOB) += slob.o
> > obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
> > +obj-$(CONFIG_PRMEM) += prmem.o
> > obj-$(CONFIG_KSM) += ksm.o
> > obj-$(CONFIG_PAGE_POISONING) += page_poison.o
> > obj-$(CONFIG_SLAB) += slab.o
> > diff --git a/mm/prmem.c b/mm/prmem.c
> > new file mode 100644
> > index 000000000000..e8ab76701831
> > --- /dev/null
> > +++ b/mm/prmem.c
> > @@ -0,0 +1,124 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * prmem.c: Memory Protection Library
> > + *
> > + * (C) Copyright 2017-2018 Huawei Technologies Co. Ltd.
> > + * Author: Igor Stoppa <[email protected]>
> > + */
> > +
> > +#include <linux/mm.h>
> > +#include <linux/string.h>
> > +#include <linux/compiler.h>
> > +#include <linux/slab.h>
> > +#include <linux/mmu_context.h>
> > +#include <linux/rcupdate.h>
> > +#include <linux/prmem.h>
> > +
> > +static __ro_after_init bool wr_ready;
> > +static __ro_after_init struct mm_struct *wr_poking_mm;
> > +static __ro_after_init unsigned long wr_poking_base;
> > +
> > +/*
> > + * The following two variables are statically allocated by the linker
> > + * script at the the boundaries of the memory region (rounded up to
> > + * multiples of PAGE_SIZE) reserved for __wr_after_init.
> > + */
> > +extern long __start_wr_after_init;
> > +extern long __end_wr_after_init;
> > +
> > +static inline bool is_wr_after_init(unsigned long ptr, __kernel_size_t size)
> > +{
> > + unsigned long start = (unsigned long)&__start_wr_after_init;
> > + unsigned long end = (unsigned long)&__end_wr_after_init;
> > + unsigned long low = ptr;
> > + unsigned long high = ptr + size;
> > +
> > + return likely(start <= low && low <= high && high <= end);
> > +}
> > +
> > +
> > +void *__wr_op(unsigned long dst, unsigned long src, __kernel_size_t len,
> > + enum wr_op_type op)
> > +{
>
> You might end up wanting something like:
>
> #ifdef __arch_wr_op
> return __arch_wr_op(...);
> #endif
>
> if an arch (s390? powerpc?) decides to have a totally different
> implementation of this.
>
> Hi s390 and powerpc people: it would be nice if this generic
> implementation *worked* on your architectures and that it will allow
> you to add some straightforward way to add a better arch-specific
> implementation if you think that would be better.
As the code is right now I can guarantee that it will not work on s390.
> > + temporary_mm_state_t prev;
> > + unsigned long flags;
> > + unsigned long offset;
> > + unsigned long wr_poking_addr;
> > +
> > + /* Confirm that the writable mapping exists. */
> > + BUG_ON(!wr_ready);
> > +
> > + if (WARN_ONCE(op >= WR_OPS_NUMBER, "Invalid WR operation.") ||
> > + WARN_ONCE(!is_wr_after_init(dst, len), "Invalid WR range."))
> > + return (void *)dst;
> > +
> > + offset = dst - (unsigned long)&__start_wr_after_init;
> > + wr_poking_addr = wr_poking_base + offset;
> > + local_irq_save(flags);
> > + prev = use_temporary_mm(wr_poking_mm);
> > +
> > + kasan_disable_current();
> > + if (op == WR_MEMCPY)
> > + memcpy((void *)wr_poking_addr, (void *)src, len);
> > + else if (op == WR_MEMSET)
> > + memset((u8 *)wr_poking_addr, (u8)src, len);
> > + else if (op == WR_RCU_ASSIGN_PTR)
> > + /* generic version of rcu_assign_pointer */
> > + smp_store_release((void **)wr_poking_addr,
> > + RCU_INITIALIZER((void **)src));
> > + kasan_enable_current();
>
> Hmm. I suspect this will explode quite badly on sane architectures
> like s390. (In my book, despite how weird s390 is, it has a vastly
> nicer model of "user" memory than any other architecture I know
> of...). I think you should use copy_to_user(), etc, instead. I'm not
> entirely sure what the best smp_store_release() replacement is.
> Making this change may also mean you can get rid of the
> kasan_disable_current().
use_temporary_mm() replaces the *user* mm, for s390 this is completely
independent from the kernel mm (different control register cr1 vs cr7).
The wr_poking_addr is not available when running in the kernel address
space.
> > +
> > + barrier(); /* XXX redundant? */
>
> I think it's redundant. If unuse_temporary_mm() allows earlier stores
> to hit the wrong address space, then something is very very wrong, and
> something is also very very wrong if the optimizer starts moving
> stores across a function call that is most definitely a barrier.
>
> > +
> > + unuse_temporary_mm(prev);
> > + /* XXX make the verification optional? */
> > + if (op == WR_MEMCPY)
> > + BUG_ON(memcmp((void *)dst, (void *)src, len));
> > + else if (op == WR_MEMSET)
> > + BUG_ON(memtst((void *)dst, (u8)src, len));
> > + else if (op == WR_RCU_ASSIGN_PTR)
> > + BUG_ON(*(unsigned long *)dst != src);
>
> Hmm. If you allowed cmpxchg or even plain xchg, then these bug_ons
> would be thoroughly buggy, but maybe they're okay. But they should,
> at most, be WARN_ON_ONCE(), given that you can trigger them by writing
> the same addresses from two threads at once, and this isn't even
> entirely obviously bogus given the presence of smp_store_release().
>
> > + local_irq_restore(flags);
> > + return (void *)dst;
> > +}
> > +
> > +struct mm_struct *copy_init_mm(void);
> > +void __init wr_poking_init(void)
> > +{
> > + unsigned long start = (unsigned long)&__start_wr_after_init;
> > + unsigned long end = (unsigned long)&__end_wr_after_init;
> > + unsigned long i;
> > + unsigned long wr_range;
> > +
> > + wr_poking_mm = copy_init_mm();
> > + BUG_ON(!wr_poking_mm);
> > +
> > + /* XXX What if it's too large to fit in the task unmapped mem? */
> > + wr_range = round_up(end - start, PAGE_SIZE);
> > +
> > + /* Randomize the poking address base*/
> > + wr_poking_base = TASK_UNMAPPED_BASE +
> > + (kaslr_get_random_long("Write Rare Poking") & PAGE_MASK) %
> > + (TASK_SIZE - (TASK_UNMAPPED_BASE + wr_range));
The wr_poking_base address is a user space address, no? This will be usable
only with the uaccess primitives on s390.
> > +
> > + /* Create alternate mapping for the entire wr_after_init range. */
> > + for (i = start; i < end; i += PAGE_SIZE) {
> > + struct page *page;
> > + spinlock_t *ptl;
> > + pte_t pte;
> > + pte_t *ptep;
> > + unsigned long wr_poking_addr;
> > +
> > + BUG_ON(!(page = virt_to_page(i)));
> > + wr_poking_addr = i - start + wr_poking_base;
> > +
> > + /* The lock is not needed, but avoids open-coding. */
> > + ptep = get_locked_pte(wr_poking_mm, wr_poking_addr, &ptl);
> > + VM_BUG_ON(!ptep);
> > +
> > + pte = mk_pte(page, PAGE_KERNEL);
> > + set_pte_at(wr_poking_mm, wr_poking_addr, ptep, pte);
> > + spin_unlock(ptl);
> > + }
> > + wr_ready = true;
> > +}
> > --
> > 2.19.1
> >
>
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On 12/12/2018 11:49, Martin Schwidefsky wrote:
> On Wed, 5 Dec 2018 15:13:56 -0800
> Andy Lutomirski <[email protected]> wrote:
>> Hi s390 and powerpc people: it would be nice if this generic
>> implementation *worked* on your architectures and that it will allow
>> you to add some straightforward way to add a better arch-specific
>> implementation if you think that would be better.
>
> As the code is right now I can guarantee that it will not work on s390.
OK, I have thrown the towel wrt developing at the same time for multiple
architectures.
ATM I'm oriented toward getting support for one (x86_64), leaving the
actual mechanism as architecture specific.
Then I can add another one or two and see what makes sense to refactor.
This approach should minimize the churning, overall.
--
igor