This patch-set introduces the possibility of protecting memory that has
been allocated dynamically.
The memory is managed in pools: when a memory pool is turned into R/O,
all the memory that is part of it, will become R/O.
A R/O pool can be destroyed, to recover its memory, but it cannot be
turned back into R/W mode.
This is intentional. This feature is meant for data that doesn't need
further modifications after initialization.
However the data might need to be released, for example as part of module
unloading.
To do this, the memory must first be freed, then the pool can be destroyed.
An example is provided, in the form of self-testing.
Changes since the v10 version:
Initially I tried to provide support for hardening the LSM hooks, but the
LSM code was too much in a flux to have some chance to be merged.
Several drop-in replacement for kmalloc based functions, for example
kzalloc.
From this perspective I have also modified genalloc, to make its free
functionality follow more closely the kfree, which doesn't need to be told
the size of the allocation being released. This was sent out for review
twice, but it has not received any feedback, so far.
Also genalloc now comes with self-testing.
The latest can be found also here:
https://www.spinics.net/lists/kernel/msg2696152.html
The need to integrate with hardened user copy has driven an optimization
in the management of vmap_areas, where each struct page in a vmalloc area
has a reference to it, saving the search through the various areas.
I was planning - and can still do it - to provide hardening for some IMA
data, but in the meanwhile it seems that the XFS developers might be
interested in htis functionality:
http://www.openwall.com/lists/kernel-hardening/2018/01/24/1
So I'm sending it out as preview.
Igor Stoppa (6):
genalloc: track beginning of allocations
genalloc: selftest
struct page: add field for vm_struct
Protectable Memory
Documentation for Pmalloc
Pmalloc: self-test
Documentation/core-api/pmalloc.txt | 104 ++++++++
include/linux/genalloc-selftest.h | 30 +++
include/linux/genalloc.h | 6 +-
include/linux/mm_types.h | 1 +
include/linux/pmalloc.h | 215 ++++++++++++++++
include/linux/vmalloc.h | 1 +
init/main.c | 2 +
lib/Kconfig | 15 ++
lib/Makefile | 1 +
lib/genalloc-selftest.c | 402 +++++++++++++++++++++++++++++
lib/genalloc.c | 444 +++++++++++++++++++++----------
mm/Kconfig | 7 +
mm/Makefile | 2 +
mm/pmalloc-selftest.c | 65 +++++
mm/pmalloc-selftest.h | 30 +++
mm/pmalloc.c | 516 +++++++++++++++++++++++++++++++++++++
mm/usercopy.c | 25 +-
mm/vmalloc.c | 18 +-
18 files changed, 1744 insertions(+), 140 deletions(-)
create mode 100644 Documentation/core-api/pmalloc.txt
create mode 100644 include/linux/genalloc-selftest.h
create mode 100644 include/linux/pmalloc.h
create mode 100644 lib/genalloc-selftest.c
create mode 100644 mm/pmalloc-selftest.c
create mode 100644 mm/pmalloc-selftest.h
create mode 100644 mm/pmalloc.c
--
2.9.3
The genalloc library is only capable of tracking if a certain unit of
allocation is in use or not.
It is not capable of discerning where the memory associated to an
allocation request begins and where it ends.
The reason is that units of allocations are tracked by using a bitmap,
where each bit represents that the unit is either allocated (1) or
available (0).
The user of the API must keep track of how much space was requested, if
it ever needs to be freed.
This can cause errors being undetected.
Ex:
* Only a subset of the memory provided to an allocation request is freed
* The memory from a subsequent allocation is freed
* The memory being freed doesn't start at the beginning of an
allocation.
The bitmap is used because it allows to perform lockless read/write
access, where this is supported by hw through cmpxchg.
Similarly, it is possible to scan the bitmap for a sufficiently long
sequence of zeros, to identify zones available for allocation.
--
This patch doubles the space reserved in the bitmap for each allocation.
By using 2 bits per allocation, it is possible to encode also the
information of where the allocation starts:
(msb to the left, lsb to the right, in the following "dictionary")
11: first allocation unit in the allocation
10: any subsequent allocation unit (if any) in the allocation
00: available allocation unit
01: invalid
Ex, with the same notation as above - MSb.......LSb:
...000010111100000010101011 <-- Read in this direction.
\__|\__|\|\____|\______|
| | | | \___ 4 used allocation units
| | | \___________ 3 empty allocation units
| | \_________________ 1 used allocation unit
| \___________________ 2 used allocation units
\_______________________ 2 empty allocation units
Because of the encoding, the previous lockless operations are still
possible. The only caveat is to change the parameter of the zero-finding
function which establishes the alignment at which to perform the test
for first zero.
The original value of the parameter is 0, meaning that an allocation can
start at any point in the bitmap, while the new value is 1, meaning that
allocations can start only at even places (bit 0, bit 2, etc.)
The number of zeroes to look for, must therefore be doubled.
When it's time to free the memory associated to an allocation request,
it's a matter of checking if the corresponding allocation unit is really
the beginning of an allocation (both bits are set to 1).
Looking for the ending can also be performed locklessly.
It's sufficient to identify the first mapped allocation unit
that is represented either as free (00) or busy (11).
Even if the allocation status should change in the meanwhile, it doesn't
matter, since it can only transition between free (00) and
first-allocated (11).
The parameter indicating to the *_free() function the size of the space
that should be freed is not currently removed, to facilitate the
transition, but it is verified, whenever it is not zero.
If it is set to zero, then the free function will autonomously decide the
size to be free, by scanning the bitmap.
About the implementation: the patch introduces the concept of "bitmap
entry", which has a 1:1 mapping with allocation units, while the code
being patched has a 1:1 mapping between allocation units and bits.
This means that, now, the bitmap can be extended (by following powers of
2), to track also other properties of the allocations, if ever needed.
Signed-off-by: Igor Stoppa <[email protected]>
---
include/linux/genalloc.h | 3 +-
lib/genalloc.c | 417 ++++++++++++++++++++++++++++++++---------------
2 files changed, 289 insertions(+), 131 deletions(-)
diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index 6dfec4d..a8fdabf 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -32,6 +32,7 @@
#include <linux/types.h>
#include <linux/spinlock_types.h>
+#include <linux/slab.h>
struct device;
struct device_node;
@@ -75,7 +76,7 @@ struct gen_pool_chunk {
phys_addr_t phys_addr; /* physical starting address of memory chunk */
unsigned long start_addr; /* start address of memory chunk */
unsigned long end_addr; /* end address of memory chunk (inclusive) */
- unsigned long bits[0]; /* bitmap for allocating memory chunk */
+ unsigned long entries[0]; /* bitmap for allocating memory chunk */
};
/*
diff --git a/lib/genalloc.c b/lib/genalloc.c
index 144fe6b..13bc8cf 100644
--- a/lib/genalloc.c
+++ b/lib/genalloc.c
@@ -36,114 +36,221 @@
#include <linux/genalloc.h>
#include <linux/of_device.h>
+#define ENTRY_ORDER 1UL
+#define ENTRY_MASK ((1UL << ((ENTRY_ORDER) + 1UL)) - 1UL)
+#define ENTRY_HEAD ENTRY_MASK
+#define ENTRY_UNUSED 0UL
+#define BITS_PER_ENTRY (1U << ENTRY_ORDER)
+#define BITS_DIV_ENTRIES(x) ((x) >> ENTRY_ORDER)
+#define ENTRIES_TO_BITS(x) ((x) << ENTRY_ORDER)
+#define BITS_DIV_LONGS(x) ((x) / BITS_PER_LONG)
+#define ENTRIES_DIV_LONGS(x) (BITS_DIV_LONGS(ENTRIES_TO_BITS(x)))
+
+#define ENTRIES_PER_LONG BITS_DIV_ENTRIES(BITS_PER_LONG)
+
+/* Binary pattern of 1010...1010 that spans one unsigned long. */
+#define MASK (~0UL / 3 * 2)
+
+/**
+ * get_bitmap_entry - extracts the specified entry from the bitmap
+ * @map: pointer to a bitmap
+ * @entry_index: the index of the desired entry in the bitmap
+ *
+ * Returns the requested bitmap.
+ */
+static inline unsigned long get_bitmap_entry(unsigned long *map,
+ int entry_index)
+{
+ return (map[ENTRIES_DIV_LONGS(entry_index)] >>
+ ENTRIES_TO_BITS(entry_index % ENTRIES_PER_LONG)) &
+ ENTRY_MASK;
+}
+
+
+/**
+ * mem_to_units - convert references to memory into orders of allocation
+ * @size: amount in bytes
+ * @order: pow of 2 represented by each entry in the bitmap
+ *
+ * Returns the number of units representing the size.
+ */
+static inline unsigned long mem_to_units(unsigned long size,
+ unsigned long order)
+{
+ return (size + (1UL << order) - 1) >> order;
+}
+
+/**
+ * chunk_size - dimension of a chunk of memory
+ * @chunk: pointer to the struct describing the chunk
+ *
+ * Returns the size of the chunk.
+ */
static inline size_t chunk_size(const struct gen_pool_chunk *chunk)
{
return chunk->end_addr - chunk->start_addr + 1;
}
-static int set_bits_ll(unsigned long *addr, unsigned long mask_to_set)
+
+/**
+ * set_bits_ll - according to the mask, sets the bits specified by
+ * value, at the address specified.
+ * @addr: where to write
+ * @mask: filter to apply for the bits to alter
+ * @value: actual configuration of bits to store
+ *
+ * Returns 0 upon success, -EBUSY otherwise
+ */
+static int set_bits_ll(unsigned long *addr,
+ unsigned long mask, unsigned long value)
{
- unsigned long val, nval;
+ unsigned long nval;
+ unsigned long present;
+ unsigned long target;
nval = *addr;
do {
- val = nval;
- if (val & mask_to_set)
+ present = nval;
+ if (present & mask)
return -EBUSY;
+ target = present | value;
cpu_relax();
- } while ((nval = cmpxchg(addr, val, val | mask_to_set)) != val);
-
+ } while ((nval = cmpxchg(addr, present, target)) != target);
return 0;
}
-static int clear_bits_ll(unsigned long *addr, unsigned long mask_to_clear)
+
+/**
+ * cleart_bits_ll - according to the mask, clears the bits specified by
+ * value, at the address specified.
+ * @addr: where to write
+ * @mask: filter to apply for the bits to alter
+ * @value: actual configuration of bits to clear
+ *
+ * Returns 0 upon success, -EBUSY otherwise
+ */
+static int clear_bits_ll(unsigned long *addr,
+ unsigned long mask, unsigned long value)
{
- unsigned long val, nval;
+ unsigned long nval;
+ unsigned long present;
+ unsigned long target;
nval = *addr;
+ present = nval;
+ if (unlikely((present & mask) ^ value))
+ return -EBUSY;
do {
- val = nval;
- if ((val & mask_to_clear) != mask_to_clear)
+ present = nval;
+ if (unlikely((present & mask) ^ value))
return -EBUSY;
+ target = present & ~mask;
cpu_relax();
- } while ((nval = cmpxchg(addr, val, val & ~mask_to_clear)) != val);
-
+ } while ((nval = cmpxchg(addr, present, target)) != target);
return 0;
}
-/*
- * bitmap_set_ll - set the specified number of bits at the specified position
+
+/**
+ * get_boundary - verify that an allocation effectively
+ * starts at the given address, then measure its length.
* @map: pointer to a bitmap
- * @start: a bit position in @map
- * @nr: number of bits to set
+ * @start_entry: the index of the first entry in the bitmap
+ * @nentries: number of entries to alter
*
- * Set @nr bits start from @start in @map lock-lessly. Several users
- * can set/clear the same bitmap simultaneously without lock. If two
- * users set the same bit, one user will return remain bits, otherwise
- * return 0.
+ * Returns the length of an allocation, otherwise -EINVAL if the
+ * parameters do not refer to a correct allocation.
*/
-static int bitmap_set_ll(unsigned long *map, int start, int nr)
+static int get_boundary(unsigned long *map, int start_entry, int nentries)
{
- unsigned long *p = map + BIT_WORD(start);
- const int size = start + nr;
- int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
- unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
-
- while (nr - bits_to_set >= 0) {
- if (set_bits_ll(p, mask_to_set))
- return nr;
- nr -= bits_to_set;
- bits_to_set = BITS_PER_LONG;
- mask_to_set = ~0UL;
- p++;
- }
- if (nr) {
- mask_to_set &= BITMAP_LAST_WORD_MASK(size);
- if (set_bits_ll(p, mask_to_set))
- return nr;
- }
+ int i;
+ unsigned long bitmap_entry;
- return 0;
+
+ if (unlikely(get_bitmap_entry(map, start_entry) != ENTRY_HEAD))
+ return -EINVAL;
+ for (i = start_entry + 1; i < nentries; i++) {
+ bitmap_entry = get_bitmap_entry(map, i);
+ if (bitmap_entry == ENTRY_HEAD ||
+ bitmap_entry == ENTRY_UNUSED)
+ return i;
+ }
+ return nentries - start_entry;
}
+
+#define SET_BITS 1
+#define CLEAR_BITS 0
+
/*
- * bitmap_clear_ll - clear the specified number of bits at the specified position
+ * alter_bitmap_ll - set or clear the entries associated to an allocation
+ * @alteration: selection if the bits selected should be set or cleared
* @map: pointer to a bitmap
- * @start: a bit position in @map
- * @nr: number of bits to set
+ * @start: the index of the first entry in the bitmap
+ * @nentries: number of entries to alter
*
- * Clear @nr bits start from @start in @map lock-lessly. Several users
- * can set/clear the same bitmap simultaneously without lock. If two
- * users clear the same bit, one user will return remain bits,
- * otherwise return 0.
+ * The modification happens lock-lessly.
+ * Several users can write to the same map simultaneously, without lock.
+ * If two users alter the same bit, one user will return remaining
+ * entries, otherwise return 0.
*/
-static int bitmap_clear_ll(unsigned long *map, int start, int nr)
+static int alter_bitmap_ll(bool alteration, unsigned long *map,
+ int start_entry, int nentries)
{
- unsigned long *p = map + BIT_WORD(start);
- const int size = start + nr;
- int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
- unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
-
- while (nr - bits_to_clear >= 0) {
- if (clear_bits_ll(p, mask_to_clear))
- return nr;
- nr -= bits_to_clear;
- bits_to_clear = BITS_PER_LONG;
- mask_to_clear = ~0UL;
- p++;
- }
- if (nr) {
- mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
- if (clear_bits_ll(p, mask_to_clear))
- return nr;
+ unsigned long start_bit;
+ unsigned long end_bit;
+ unsigned long mask;
+ unsigned long value;
+ int nbits;
+ int bits_to_write;
+ int index;
+ int (*action)(unsigned long *addr,
+ unsigned long mask, unsigned long value);
+
+ action = (alteration == SET_BITS) ? set_bits_ll : clear_bits_ll;
+
+ /* Prepare for writing the initial part of the allocation, from
+ * starting entry, to the end of the UL bitmap element which
+ * contains it. It might be larger than the actual allocation.
+ */
+ start_bit = ENTRIES_TO_BITS(start_entry);
+ end_bit = ENTRIES_TO_BITS(start_entry + nentries);
+ nbits = ENTRIES_TO_BITS(nentries);
+ bits_to_write = BITS_PER_LONG - start_bit % BITS_PER_LONG;
+ mask = BITMAP_FIRST_WORD_MASK(start_bit);
+ /* Mark the beginning of the allocation. */
+ value = MASK | (1UL << (start_bit % BITS_PER_LONG));
+ index = BITS_DIV_LONGS(start_bit);
+
+ /* Writes entries to the bitmap, as long as the reminder is
+ * positive or zero.
+ * Might be skipped if the entries to write do not reach the end
+ * of a bitmap UL unit.
+ */
+ while (nbits >= bits_to_write) {
+ if (action(map + index, mask, value & mask))
+ return BITS_DIV_ENTRIES(nbits);
+ nbits -= bits_to_write;
+ bits_to_write = BITS_PER_LONG;
+ mask = ~0UL;
+ value = MASK;
+ index++;
}
+ /* Takes care of the ending part of the entries to mark. */
+ if (nbits > 0) {
+ mask ^= BITMAP_FIRST_WORD_MASK((end_bit) % BITS_PER_LONG);
+ bits_to_write = nbits;
+ if (action(map + index, mask, value & mask))
+ return BITS_DIV_ENTRIES(nbits);
+ }
return 0;
}
+
/**
* gen_pool_create - create a new special memory pool
- * @min_alloc_order: log base 2 of number of bytes each bitmap bit represents
+ * @min_alloc_order: log base 2 of number of bytes each bitmap entry represents
* @nid: node id of the node the pool structure should be allocated on, or -1
*
* Create a new special memory pool that can be used to manage special purpose
@@ -183,10 +290,12 @@ int gen_pool_add_virt(struct gen_pool *pool, unsigned long virt, phys_addr_t phy
size_t size, int nid)
{
struct gen_pool_chunk *chunk;
- int nbits = size >> pool->min_alloc_order;
- int nbytes = sizeof(struct gen_pool_chunk) +
- BITS_TO_LONGS(nbits) * sizeof(long);
+ int nentries;
+ int nbytes;
+ nentries = size >> pool->min_alloc_order;
+ nbytes = sizeof(struct gen_pool_chunk) +
+ ENTRIES_DIV_LONGS(nentries) * sizeof(long);
chunk = kzalloc_node(nbytes, GFP_KERNEL, nid);
if (unlikely(chunk == NULL))
return -ENOMEM;
@@ -248,7 +357,7 @@ void gen_pool_destroy(struct gen_pool *pool)
list_del(&chunk->next_chunk);
end_bit = chunk_size(chunk) >> order;
- bit = find_next_bit(chunk->bits, end_bit, 0);
+ bit = find_next_bit(chunk->entries, end_bit, 0);
BUG_ON(bit < end_bit);
kfree(chunk);
@@ -292,7 +401,7 @@ unsigned long gen_pool_alloc_algo(struct gen_pool *pool, size_t size,
struct gen_pool_chunk *chunk;
unsigned long addr = 0;
int order = pool->min_alloc_order;
- int nbits, start_bit, end_bit, remain;
+ int nentries, start_entry, end_entry, remain;
#ifndef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
BUG_ON(in_nmi());
@@ -301,29 +410,32 @@ unsigned long gen_pool_alloc_algo(struct gen_pool *pool, size_t size,
if (size == 0)
return 0;
- nbits = (size + (1UL << order) - 1) >> order;
+ nentries = mem_to_units(size, order);
rcu_read_lock();
list_for_each_entry_rcu(chunk, &pool->chunks, next_chunk) {
if (size > atomic_read(&chunk->avail))
continue;
- start_bit = 0;
- end_bit = chunk_size(chunk) >> order;
+ start_entry = 0;
+ end_entry = chunk_size(chunk) >> order;
retry:
- start_bit = algo(chunk->bits, end_bit, start_bit,
- nbits, data, pool);
- if (start_bit >= end_bit)
+ start_entry = algo(chunk->entries, end_entry, start_entry,
+ nentries, data, pool);
+ if (start_entry >= end_entry)
continue;
- remain = bitmap_set_ll(chunk->bits, start_bit, nbits);
+ remain = alter_bitmap_ll(SET_BITS, chunk->entries,
+ start_entry, nentries);
if (remain) {
- remain = bitmap_clear_ll(chunk->bits, start_bit,
- nbits - remain);
- BUG_ON(remain);
+ remain = alter_bitmap_ll(CLEAR_BITS,
+ chunk->entries,
+ start_entry,
+ nentries - remain);
goto retry;
}
- addr = chunk->start_addr + ((unsigned long)start_bit << order);
- size = nbits << order;
+ addr = chunk->start_addr +
+ ((unsigned long)start_entry << order);
+ size = nentries << order;
atomic_sub(size, &chunk->avail);
break;
}
@@ -365,7 +477,7 @@ EXPORT_SYMBOL(gen_pool_dma_alloc);
* gen_pool_free - free allocated special memory back to the pool
* @pool: pool to free to
* @addr: starting address of memory to free back to pool
- * @size: size in bytes of memory to free
+ * @size: size in bytes of memory to free or 0, for auto-detection
*
* Free previously allocated special memory back to the specified
* pool. Can not be used in NMI handler on architectures without
@@ -375,22 +487,29 @@ void gen_pool_free(struct gen_pool *pool, unsigned long addr, size_t size)
{
struct gen_pool_chunk *chunk;
int order = pool->min_alloc_order;
- int start_bit, nbits, remain;
+ int start_entry, remaining_entries, nentries, remain;
+ int boundary;
#ifndef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
BUG_ON(in_nmi());
#endif
- nbits = (size + (1UL << order) - 1) >> order;
rcu_read_lock();
list_for_each_entry_rcu(chunk, &pool->chunks, next_chunk) {
if (addr >= chunk->start_addr && addr <= chunk->end_addr) {
BUG_ON(addr + size - 1 > chunk->end_addr);
- start_bit = (addr - chunk->start_addr) >> order;
- remain = bitmap_clear_ll(chunk->bits, start_bit, nbits);
+ start_entry = (addr - chunk->start_addr) >> order;
+ remaining_entries = (chunk->end_addr - addr) >> order;
+ boundary = get_boundary(chunk->entries, start_entry,
+ remaining_entries);
+ BUG_ON(boundary < 0);
+ nentries = boundary - start_entry;
+ BUG_ON(size &&
+ (nentries != mem_to_units(size, order)));
+ remain = alter_bitmap_ll(CLEAR_BITS, chunk->entries,
+ start_entry, nentries);
BUG_ON(remain);
- size = nbits << order;
- atomic_add(size, &chunk->avail);
+ atomic_add(nentries << order, &chunk->avail);
rcu_read_unlock();
return;
}
@@ -517,9 +636,9 @@ EXPORT_SYMBOL(gen_pool_set_algo);
* gen_pool_first_fit - find the first available region
* of memory matching the size requirement (no alignment constraint)
* @map: The address to base the search on
- * @size: The bitmap size in bits
- * @start: The bitnumber to start searching at
- * @nr: The number of zeroed bits we're looking for
+ * @size: The number of allocation units in the bitmap
+ * @start: The allocation unit to start searching at
+ * @nr: The number of allocation units we're looking for
* @data: additional data - unused
* @pool: pool to find the fit region memory from
*/
@@ -527,7 +646,15 @@ unsigned long gen_pool_first_fit(unsigned long *map, unsigned long size,
unsigned long start, unsigned int nr, void *data,
struct gen_pool *pool)
{
- return bitmap_find_next_zero_area(map, size, start, nr, 0);
+ unsigned long align_mask;
+ unsigned long bit_index;
+
+ align_mask = roundup_pow_of_two(BITS_PER_ENTRY) - 1;
+ bit_index = bitmap_find_next_zero_area(map, ENTRIES_TO_BITS(size),
+ ENTRIES_TO_BITS(start),
+ ENTRIES_TO_BITS(nr),
+ align_mask);
+ return BITS_DIV_ENTRIES(bit_index);
}
EXPORT_SYMBOL(gen_pool_first_fit);
@@ -535,9 +662,9 @@ EXPORT_SYMBOL(gen_pool_first_fit);
* gen_pool_first_fit_align - find the first available region
* of memory matching the size requirement (alignment constraint)
* @map: The address to base the search on
- * @size: The bitmap size in bits
- * @start: The bitnumber to start searching at
- * @nr: The number of zeroed bits we're looking for
+ * @size: The number of allocation units in the bitmap
+ * @start: The allocation unit to start searching at
+ * @nr: The number of allocation units we're looking for
* @data: data for alignment
* @pool: pool to get order from
*/
@@ -547,21 +674,28 @@ unsigned long gen_pool_first_fit_align(unsigned long *map, unsigned long size,
{
struct genpool_data_align *alignment;
unsigned long align_mask;
+ unsigned long bit_index;
int order;
alignment = data;
order = pool->min_alloc_order;
- align_mask = ((alignment->align + (1UL << order) - 1) >> order) - 1;
- return bitmap_find_next_zero_area(map, size, start, nr, align_mask);
+ align_mask = roundup_pow_of_two(
+ ENTRIES_TO_BITS(mem_to_units(alignment->align,
+ order))) - 1;
+ bit_index = bitmap_find_next_zero_area(map, ENTRIES_TO_BITS(size),
+ ENTRIES_TO_BITS(start),
+ ENTRIES_TO_BITS(nr),
+ align_mask);
+ return BITS_DIV_ENTRIES(bit_index);
}
EXPORT_SYMBOL(gen_pool_first_fit_align);
/**
* gen_pool_fixed_alloc - reserve a specific region
* @map: The address to base the search on
- * @size: The bitmap size in bits
- * @start: The bitnumber to start searching at
- * @nr: The number of zeroed bits we're looking for
+ * @size: The number of allocation units in the bitmap
+ * @start: The allocation unit to start searching at
+ * @nr: The number of allocation units we're looking for
* @data: data for alignment
* @pool: pool to get order from
*/
@@ -571,20 +705,23 @@ unsigned long gen_pool_fixed_alloc(unsigned long *map, unsigned long size,
{
struct genpool_data_fixed *fixed_data;
int order;
- unsigned long offset_bit;
- unsigned long start_bit;
+ unsigned long offset;
+ unsigned long align_mask;
+ unsigned long bit_index;
fixed_data = data;
order = pool->min_alloc_order;
- offset_bit = fixed_data->offset >> order;
if (WARN_ON(fixed_data->offset & ((1UL << order) - 1)))
return size;
+ offset = fixed_data->offset >> order;
+ align_mask = roundup_pow_of_two(BITS_PER_ENTRY) - 1;
+ bit_index = bitmap_find_next_zero_area(map, ENTRIES_TO_BITS(size),
+ ENTRIES_TO_BITS(start + offset),
+ ENTRIES_TO_BITS(nr), align_mask);
+ if (bit_index != ENTRIES_TO_BITS(offset))
+ return size;
- start_bit = bitmap_find_next_zero_area(map, size,
- start + offset_bit, nr, 0);
- if (start_bit != offset_bit)
- start_bit = size;
- return start_bit;
+ return BITS_DIV_ENTRIES(bit_index);
}
EXPORT_SYMBOL(gen_pool_fixed_alloc);
@@ -593,9 +730,9 @@ EXPORT_SYMBOL(gen_pool_fixed_alloc);
* of memory matching the size requirement. The region will be aligned
* to the order of the size specified.
* @map: The address to base the search on
- * @size: The bitmap size in bits
- * @start: The bitnumber to start searching at
- * @nr: The number of zeroed bits we're looking for
+ * @size: The number of allocation units in the bitmap
+ * @start: The allocation unit to start searching at
+ * @nr: The number of allocation units we're looking for
* @data: additional data - unused
* @pool: pool to find the fit region memory from
*/
@@ -603,9 +740,15 @@ unsigned long gen_pool_first_fit_order_align(unsigned long *map,
unsigned long size, unsigned long start,
unsigned int nr, void *data, struct gen_pool *pool)
{
- unsigned long align_mask = roundup_pow_of_two(nr) - 1;
-
- return bitmap_find_next_zero_area(map, size, start, nr, align_mask);
+ unsigned long align_mask;
+ unsigned long bit_index;
+
+ align_mask = roundup_pow_of_two(ENTRIES_TO_BITS(nr)) - 1;
+ bit_index = bitmap_find_next_zero_area(map, ENTRIES_TO_BITS(size),
+ ENTRIES_TO_BITS(start),
+ ENTRIES_TO_BITS(nr),
+ align_mask);
+ return BITS_DIV_ENTRIES(bit_index);
}
EXPORT_SYMBOL(gen_pool_first_fit_order_align);
@@ -613,9 +756,9 @@ EXPORT_SYMBOL(gen_pool_first_fit_order_align);
* gen_pool_best_fit - find the best fitting region of memory
* macthing the size requirement (no alignment constraint)
* @map: The address to base the search on
- * @size: The bitmap size in bits
- * @start: The bitnumber to start searching at
- * @nr: The number of zeroed bits we're looking for
+ * @size: The number of allocation units in the bitmap
+ * @start: The allocation unit to start searching at
+ * @nr: The number of allocation units we're looking for
* @data: additional data - unused
* @pool: pool to find the fit region memory from
*
@@ -626,27 +769,41 @@ unsigned long gen_pool_best_fit(unsigned long *map, unsigned long size,
unsigned long start, unsigned int nr, void *data,
struct gen_pool *pool)
{
- unsigned long start_bit = size;
+ unsigned long start_bit = ENTRIES_TO_BITS(size);
unsigned long len = size + 1;
unsigned long index;
+ unsigned long align_mask;
+ unsigned long bit_index;
- index = bitmap_find_next_zero_area(map, size, start, nr, 0);
+ align_mask = roundup_pow_of_two(BITS_PER_ENTRY) - 1;
+ bit_index = bitmap_find_next_zero_area(map, ENTRIES_TO_BITS(size),
+ ENTRIES_TO_BITS(start),
+ ENTRIES_TO_BITS(nr),
+ align_mask);
+ index = BITS_DIV_ENTRIES(bit_index);
while (index < size) {
- int next_bit = find_next_bit(map, size, index + nr);
- if ((next_bit - index) < len) {
- len = next_bit - index;
- start_bit = index;
+ int next_bit;
+
+ next_bit = find_next_bit(map, ENTRIES_TO_BITS(size),
+ ENTRIES_TO_BITS(index + nr));
+ if ((BITS_DIV_ENTRIES(next_bit) - index) < len) {
+ len = BITS_DIV_ENTRIES(next_bit) - index;
+ start_bit = ENTRIES_TO_BITS(index);
if (len == nr)
- return start_bit;
+ return BITS_DIV_ENTRIES(start_bit);
}
- index = bitmap_find_next_zero_area(map, size,
- next_bit + 1, nr, 0);
+ bit_index =
+ bitmap_find_next_zero_area(map,
+ ENTRIES_TO_BITS(size),
+ next_bit + 1,
+ ENTRIES_TO_BITS(nr),
+ align_mask);
+ index = BITS_DIV_ENTRIES(bit_index);
}
- return start_bit;
+ return BITS_DIV_ENTRIES(start_bit);
}
-EXPORT_SYMBOL(gen_pool_best_fit);
static void devm_gen_pool_release(struct device *dev, void *res)
{
--
2.9.3
Introduce a set of macros for writing concise test cases for genalloc.
The test cases are meant to provide regression testing, when working on
new functionality for genalloc.
Primarily they are meant to confirm that the various allocation strategy
will continue to work as expected.
The execution of the self testing is controlled through a Kconfig option.
Signed-off-by: Igor Stoppa <[email protected]>
---
include/linux/genalloc-selftest.h | 30 +++
init/main.c | 2 +
lib/Kconfig | 15 ++
lib/Makefile | 1 +
lib/genalloc-selftest.c | 402 ++++++++++++++++++++++++++++++++++++++
5 files changed, 450 insertions(+)
create mode 100644 include/linux/genalloc-selftest.h
create mode 100644 lib/genalloc-selftest.c
diff --git a/include/linux/genalloc-selftest.h b/include/linux/genalloc-selftest.h
new file mode 100644
index 0000000..7af1901
--- /dev/null
+++ b/include/linux/genalloc-selftest.h
@@ -0,0 +1,30 @@
+/*
+ * genalloc-selftest.h
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+
+#ifndef __GENALLOC_SELFTEST_H__
+#define __GENALLOC_SELFTEST_H__
+
+
+#ifdef CONFIG_GENERIC_ALLOCATOR_SELFTEST
+
+#include <linux/genalloc.h>
+
+void genalloc_selftest(void);
+
+#else
+
+static inline void genalloc_selftest(void){};
+
+#endif
+
+#endif
diff --git a/init/main.c b/init/main.c
index 0e4d39c..2bdacb9 100644
--- a/init/main.c
+++ b/init/main.c
@@ -87,6 +87,7 @@
#include <linux/io.h>
#include <linux/cache.h>
#include <linux/rodata_test.h>
+#include <linux/genalloc-selftest.h>
#include <asm/io.h>
#include <asm/bugs.h>
@@ -649,6 +650,7 @@ asmlinkage __visible void __init start_kernel(void)
*/
mem_encrypt_init();
+ genalloc_selftest();
#ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
diff --git a/lib/Kconfig b/lib/Kconfig
index b1445b2..1c535f4 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -291,6 +291,21 @@ config DECOMPRESS_LZ4
config GENERIC_ALLOCATOR
bool
+config GENERIC_ALLOCATOR_SELFTEST
+ bool "genalloc tester"
+ default n
+ select GENERIC_ALLOCATOR
+ help
+ Enable automated testing of the generic allocator.
+ The testing is primarily for the tracking of allocated space.
+
+config GENERIC_ALLOCATOR_SELFTEST_VERBOSE
+ bool "make the genalloc tester more verbose"
+ default n
+ select GENERIC_ALLOCATOR_SELFTEST
+ help
+ More information will be displayed during the self-testing.
+
#
# reed solomon support is select'ed if needed
#
diff --git a/lib/Makefile b/lib/Makefile
index b8f2c16..ff7ad5f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_LIBCRC32C) += libcrc32c.o
obj-$(CONFIG_CRC8) += crc8.o
obj-$(CONFIG_XXHASH) += xxhash.o
obj-$(CONFIG_GENERIC_ALLOCATOR) += genalloc.o
+obj-$(CONFIG_GENERIC_ALLOCATOR_SELFTEST) += genalloc-selftest.o
obj-$(CONFIG_842_COMPRESS) += 842/
obj-$(CONFIG_842_DECOMPRESS) += 842/
diff --git a/lib/genalloc-selftest.c b/lib/genalloc-selftest.c
new file mode 100644
index 0000000..007a0cf
--- /dev/null
+++ b/lib/genalloc-selftest.c
@@ -0,0 +1,402 @@
+/*
+ * genalloc-selftest.c
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/init.h>
+#include <linux/vmalloc.h>
+#include <asm/set_memory.h>
+#include <linux/string.h>
+#include <linux/debugfs.h>
+#include <linux/atomic.h>
+#include <linux/genalloc.h>
+
+
+
+/* Keep the bitmap small, while including case of cross-ulong mapping.
+ * For simplicity, the test cases use only 1 chunk of memory.
+ */
+#define BITMAP_SIZE_C 16
+#define ALLOC_ORDER 0
+
+#define ULONG_SIZE (sizeof(unsigned long))
+#define BITMAP_SIZE_UL (BITMAP_SIZE_C / ULONG_SIZE)
+#define MIN_ALLOC_SIZE (1 << ALLOC_ORDER)
+#define ENTRIES (BITMAP_SIZE_C * 8)
+#define CHUNK_SIZE (MIN_ALLOC_SIZE * ENTRIES)
+
+#ifndef CONFIG_GENERIC_ALLOCATOR_SELFTEST_VERBOSE
+
+static inline void print_first_chunk_bitmap(struct gen_pool *pool) {}
+
+#else
+
+static void print_first_chunk_bitmap(struct gen_pool *pool)
+{
+ struct gen_pool_chunk *chunk;
+ char bitmap[BITMAP_SIZE_C * 2 + 1];
+ unsigned long i;
+ char *bm = bitmap;
+ char *entry;
+
+ if (unlikely(pool == NULL || pool->chunks.next == NULL))
+ return;
+
+ chunk = container_of(pool->chunks.next, struct gen_pool_chunk,
+ next_chunk);
+ entry = (void *)chunk->entries;
+ for (i = 1; i <= BITMAP_SIZE_C; i++)
+ bm += snprintf(bm, 3, "%02hhx", entry[BITMAP_SIZE_C - i]);
+ *bm = '\0';
+ pr_notice("chunk: %p bitmap: 0x%s\n", chunk, bitmap);
+
+}
+
+#endif
+
+enum test_commands {
+ CMD_ALLOCATOR,
+ CMD_ALLOCATE,
+ CMD_FLUSH,
+ CMD_FREE,
+ CMD_NUMBER,
+ CMD_END = CMD_NUMBER,
+};
+
+struct null_struct {
+ void *null;
+};
+
+struct test_allocator {
+ genpool_algo_t algo;
+ union {
+ struct genpool_data_align align;
+ struct genpool_data_fixed offset;
+ struct null_struct null;
+ } data;
+};
+
+struct test_action {
+ unsigned int location;
+ char pattern[BITMAP_SIZE_C];
+ unsigned int size;
+};
+
+
+struct test_command {
+ enum test_commands command;
+ union {
+ struct test_allocator allocator;
+ struct test_action action;
+ };
+};
+
+
+/* To pass an array literal as parameter to a macro, it must go through
+ * this one, first.
+ */
+#define ARR(...) __VA_ARGS__
+
+#define SET_DATA(parameter, value) \
+ .parameter = { \
+ .parameter = value, \
+ } \
+
+#define SET_ALLOCATOR(alloc, parameter, value) \
+{ \
+ .command = CMD_ALLOCATOR, \
+ .allocator = { \
+ .algo = (alloc), \
+ .data = { \
+ SET_DATA(parameter, value), \
+ }, \
+ } \
+}
+
+#define ACTION_MEM(act, mem_size, mem_loc, match) \
+{ \
+ .command = act, \
+ .action = { \
+ .size = (mem_size), \
+ .location = (mem_loc), \
+ .pattern = match, \
+ }, \
+}
+
+#define ALLOCATE_MEM(mem_size, mem_loc, match) \
+ ACTION_MEM(CMD_ALLOCATE, mem_size, mem_loc, ARR(match))
+
+#define FREE_MEM(mem_size, mem_loc, match) \
+ ACTION_MEM(CMD_FREE, mem_size, mem_loc, ARR(match))
+
+#define FLUSH_MEM() \
+{ \
+ .command = CMD_FLUSH, \
+}
+
+#define END() \
+{ \
+ .command = CMD_END, \
+}
+
+static inline int compare_bitmaps(const struct gen_pool *pool,
+ const char *reference)
+{
+ struct gen_pool_chunk *chunk;
+ char *bitmap;
+ unsigned int i;
+
+ chunk = container_of(pool->chunks.next, struct gen_pool_chunk,
+ next_chunk);
+ bitmap = (char *)chunk->entries;
+
+ for (i = 0; i < BITMAP_SIZE_C; i++)
+ if (bitmap[i] != reference[i])
+ return -1;
+ return 0;
+}
+
+static void callback_set_allocator(struct gen_pool *pool,
+ const struct test_command *cmd,
+ unsigned long *locations)
+{
+ gen_pool_set_algo(pool, cmd->allocator.algo,
+ (void *)&cmd->allocator.data);
+}
+
+static void callback_allocate(struct gen_pool *pool,
+ const struct test_command *cmd,
+ unsigned long *locations)
+{
+ const struct test_action *action = &cmd->action;
+
+ locations[action->location] = gen_pool_alloc(pool, action->size);
+ BUG_ON(!locations[action->location]);
+ print_first_chunk_bitmap(pool);
+ BUG_ON(compare_bitmaps(pool, action->pattern));
+}
+
+static void callback_flush(struct gen_pool *pool,
+ const struct test_command *cmd,
+ unsigned long *locations)
+{
+ unsigned int i;
+
+ for (i = 0; i < ENTRIES; i++)
+ if (locations[i]) {
+ gen_pool_free(pool, locations[i], 0);
+ locations[i] = 0;
+ }
+}
+
+static void callback_free(struct gen_pool *pool,
+ const struct test_command *cmd,
+ unsigned long *locations)
+{
+ const struct test_action *action = &cmd->action;
+
+ gen_pool_free(pool, locations[action->location], 0);
+ locations[action->location] = 0;
+ print_first_chunk_bitmap(pool);
+ BUG_ON(compare_bitmaps(pool, action->pattern));
+}
+
+static void (* const callbacks[CMD_NUMBER])(struct gen_pool *,
+ const struct test_command *,
+ unsigned long *) = {
+ [CMD_ALLOCATOR] = callback_set_allocator,
+ [CMD_ALLOCATE] = callback_allocate,
+ [CMD_FREE] = callback_free,
+ [CMD_FLUSH] = callback_flush,
+};
+
+const struct test_command test_first_fit[] = {
+ SET_ALLOCATOR(gen_pool_first_fit, null, NULL),
+ ALLOCATE_MEM(3, 0, ARR({0x2b})),
+ ALLOCATE_MEM(2, 1, ARR({0xeb, 0x02})),
+ ALLOCATE_MEM(5, 2, ARR({0xeb, 0xae, 0x0a})),
+ FREE_MEM(2, 1, ARR({0x2b, 0xac, 0x0a})),
+ ALLOCATE_MEM(1, 1, ARR({0xeb, 0xac, 0x0a})),
+ FREE_MEM(0, 2, ARR({0xeb})),
+ FREE_MEM(0, 0, ARR({0xc0})),
+ FREE_MEM(0, 1, ARR({0x00})),
+ END(),
+};
+
+/* To make the test work for both 32bit and 64bit ulong sizes,
+ * allocate (8 / 2 * 4 - 1) = 15 bytes bytes, then 16, then 2.
+ * The first allocation prepares for the crossing of the 32bit ulong
+ * threshold. The following crosses the 32bit threshold and prepares for
+ * crossing the 64bit thresholds. The last is large enough (2 bytes) to
+ * cross the 64bit threshold.
+ * Then free the allocations in the order: 2nd, 1st, 3rd.
+ */
+const struct test_command test_ulong_span[] = {
+ SET_ALLOCATOR(gen_pool_first_fit, null, NULL),
+ ALLOCATE_MEM(15, 0, ARR({0xab, 0xaa, 0xaa, 0x2a})),
+ ALLOCATE_MEM(16, 1, ARR({0xab, 0xaa, 0xaa, 0xea,
+ 0xaa, 0xaa, 0xaa, 0x2a})),
+ ALLOCATE_MEM(2, 2, ARR({0xab, 0xaa, 0xaa, 0xea,
+ 0xaa, 0xaa, 0xaa, 0xea,
+ 0x02})),
+ FREE_MEM(0, 1, ARR({0xab, 0xaa, 0xaa, 0x2a,
+ 0x00, 0x00, 0x00, 0xc0,
+ 0x02})),
+ FREE_MEM(0, 0, ARR({0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0xc0,
+ 0x02})),
+ FREE_MEM(0, 2, ARR({0x00})),
+ END(),
+};
+
+/* Create progressively smaller allocations A B C D E.
+ * then free B and D.
+ * Then create new allocation that would fit in both of the gaps left by
+ * B and D. Verify that it uses the gap from B.
+ */
+const struct test_command test_first_fit_gaps[] = {
+ SET_ALLOCATOR(gen_pool_first_fit, null, NULL),
+ ALLOCATE_MEM(10, 0, ARR({0xab, 0xaa, 0x0a})),
+ ALLOCATE_MEM(8, 1, ARR({0xab, 0xaa, 0xba, 0xaa,
+ 0x0a})),
+ ALLOCATE_MEM(6, 2, ARR({0xab, 0xaa, 0xba, 0xaa,
+ 0xba, 0xaa})),
+ ALLOCATE_MEM(4, 3, ARR({0xab, 0xaa, 0xba, 0xaa,
+ 0xba, 0xaa, 0xab})),
+ ALLOCATE_MEM(2, 4, ARR({0xab, 0xaa, 0xba, 0xaa,
+ 0xba, 0xaa, 0xab, 0x0b})),
+ FREE_MEM(0, 1, ARR({0xab, 0xaa, 0x0a, 0x00,
+ 0xb0, 0xaa, 0xab, 0x0b})),
+ FREE_MEM(0, 3, ARR({0xab, 0xaa, 0x0a, 0x00,
+ 0xb0, 0xaa, 0x00, 0x0b})),
+ ALLOCATE_MEM(3, 3, ARR({0xab, 0xaa, 0xba, 0x02,
+ 0xb0, 0xaa, 0x00, 0x0b})),
+ FLUSH_MEM(),
+ END(),
+};
+
+/* Test first fit align */
+const struct test_command test_first_fit_align[] = {
+ SET_ALLOCATOR(gen_pool_first_fit_align, align, 4),
+ ALLOCATE_MEM(5, 0, ARR({0xab, 0x02})),
+ ALLOCATE_MEM(3, 1, ARR({0xab, 0x02, 0x2b})),
+ ALLOCATE_MEM(2, 2, ARR({0xab, 0x02, 0x2b, 0x0b})),
+ ALLOCATE_MEM(1, 3, ARR({0xab, 0x02, 0x2b, 0x0b, 0x03})),
+ FREE_MEM(0, 0, ARR({0x00, 0x00, 0x2b, 0x0b, 0x03})),
+ FREE_MEM(0, 2, ARR({0x00, 0x00, 0x2b, 0x00, 0x03})),
+ ALLOCATE_MEM(2, 0, ARR({0x0b, 0x00, 0x2b, 0x00, 0x03})),
+ FLUSH_MEM(),
+ END(),
+};
+
+
+/* Test fixed alloc */
+const struct test_command test_fixed_data[] = {
+ SET_ALLOCATOR(gen_pool_fixed_alloc, offset, 1),
+ ALLOCATE_MEM(5, 0, ARR({0xac, 0x0a})),
+ SET_ALLOCATOR(gen_pool_fixed_alloc, offset, 8),
+ ALLOCATE_MEM(3, 1, ARR({0xac, 0x0a, 0x2b})),
+ SET_ALLOCATOR(gen_pool_fixed_alloc, offset, 6),
+ ALLOCATE_MEM(2, 2, ARR({0xac, 0xba, 0x2b})),
+ SET_ALLOCATOR(gen_pool_fixed_alloc, offset, 30),
+ ALLOCATE_MEM(40, 3, ARR({0xac, 0xba, 0x2b, 0x00,
+ 0x00, 0x00, 0x00, 0xb0,
+ 0xaa, 0xaa, 0xaa, 0xaa,
+ 0xaa, 0xaa, 0xaa, 0xaa})),
+ FLUSH_MEM(),
+ END(),
+};
+
+
+/* Test first fit order align */
+const struct test_command test_first_fit_order_align[] = {
+ SET_ALLOCATOR(gen_pool_first_fit_order_align, null, NULL),
+ ALLOCATE_MEM(5, 0, ARR({0xab, 0x02})),
+ ALLOCATE_MEM(3, 1, ARR({0xab, 0x02, 0x2b})),
+ ALLOCATE_MEM(2, 2, ARR({0xab, 0xb2, 0x2b})),
+ ALLOCATE_MEM(1, 3, ARR({0xab, 0xbe, 0x2b})),
+ ALLOCATE_MEM(1, 4, ARR({0xab, 0xbe, 0xeb})),
+ ALLOCATE_MEM(2, 5, ARR({0xab, 0xbe, 0xeb, 0x0b})),
+ FLUSH_MEM(),
+ END(),
+};
+
+
+/* 007 Test best fit */
+const struct test_command test_best_fit[] = {
+ SET_ALLOCATOR(gen_pool_best_fit, null, NULL),
+ ALLOCATE_MEM(5, 0, ARR({0xab, 0x02})),
+ ALLOCATE_MEM(3, 1, ARR({0xab, 0xae})),
+ ALLOCATE_MEM(3, 2, ARR({0xab, 0xae, 0x2b})),
+ ALLOCATE_MEM(1, 3, ARR({0xab, 0xae, 0xeb})),
+ FREE_MEM(0, 0, ARR({0x00, 0xac, 0xeb})),
+ FREE_MEM(0, 2, ARR({0x00, 0xac, 0xc0})),
+ ALLOCATE_MEM(2, 0, ARR({0x00, 0xac, 0xcb})),
+ FLUSH_MEM(),
+ END(),
+};
+
+
+enum test_cases_indexes {
+ TEST_CASE_FIRST_FIT,
+ TEST_CASE_ULONG_SPAN,
+ TEST_CASE_FIRST_FIT_GAPS,
+ TEST_CASE_FIRST_FIT_ALIGN,
+ TEST_CASE_FIXED_DATA,
+ TEST_CASE_FIRST_FIT_ORDER_ALIGN,
+ TEST_CASE_BEST_FIT,
+ TEST_CASES_NUM,
+};
+
+const struct test_command *test_cases[TEST_CASES_NUM] = {
+ [TEST_CASE_FIRST_FIT] = test_first_fit,
+ [TEST_CASE_ULONG_SPAN] = test_ulong_span,
+ [TEST_CASE_FIRST_FIT_GAPS] = test_first_fit_gaps,
+ [TEST_CASE_FIRST_FIT_ALIGN] = test_first_fit_align,
+ [TEST_CASE_FIXED_DATA] = test_fixed_data,
+ [TEST_CASE_FIRST_FIT_ORDER_ALIGN] = test_first_fit_order_align,
+ [TEST_CASE_BEST_FIT] = test_best_fit,
+};
+
+
+void genalloc_selftest(void)
+{
+ static struct gen_pool *pool;
+ unsigned long locations[ENTRIES];
+ char chunk[CHUNK_SIZE];
+ int retval;
+ unsigned int i;
+ const struct test_command *cmd;
+
+ pool = gen_pool_create(ALLOC_ORDER, -1);
+ if (unlikely(!pool)) {
+ pr_err("genalloc-selftest: no memory for pool.");
+ return;
+ }
+
+ retval = gen_pool_add_virt(pool, (unsigned long)chunk, 0,
+ CHUNK_SIZE, -1);
+ if (unlikely(retval)) {
+ pr_err("genalloc-selftest: could not register chunk.");
+ goto destroy_pool;
+ }
+
+ memset(locations, 0, ENTRIES * sizeof(unsigned long));
+ for (i = 0; i < TEST_CASES_NUM; i++)
+ for (cmd = test_cases[i]; cmd->command < CMD_END; cmd++)
+ callbacks[cmd->command](pool, cmd, locations);
+ pr_notice("genalloc-selftest: executed successfully %d tests",
+ TEST_CASES_NUM);
+
+destroy_pool:
+ gen_pool_destroy(pool);
+}
--
2.9.3
When a page is used for virtual memory, it is often necessary to obtian
a handler to the corresponding vm_struct, which refers to the virtually
continuous area generated when invoking vmalloc.
The struct page has a "mapping" field, which can be re-used, to store a
pointer to the parent area. This will avoid more expensive searches.
As example, the function find_vm_area is reimplemented, to take advantage
of the newly introduced field.
Signed-off-by: Igor Stoppa <[email protected]>
---
include/linux/mm_types.h | 1 +
mm/vmalloc.c | 18 +++++++++++++-----
2 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd1af6b..c3a4825 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -84,6 +84,7 @@ struct page {
void *s_mem; /* slab first object */
atomic_t compound_mapcount; /* first tail page */
/* page_deferred_list().next -- second tail page */
+ struct vm_struct *area;
};
/* Second double word */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6739420..44c5dfc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1466,13 +1466,16 @@ struct vm_struct *get_vm_area_caller(unsigned long size, unsigned long flags,
*/
struct vm_struct *find_vm_area(const void *addr)
{
- struct vmap_area *va;
+ struct page *page;
- va = find_vmap_area((unsigned long)addr);
- if (va && va->flags & VM_VM_AREA)
- return va->vm;
+ if (unlikely(!is_vmalloc_addr(addr)))
+ return NULL;
- return NULL;
+ page = vmalloc_to_page(addr);
+ if (unlikely(!page))
+ return NULL;
+
+ return page->area;
}
/**
@@ -1536,6 +1539,7 @@ static void __vunmap(const void *addr, int deallocate_pages)
struct page *page = area->pages[i];
BUG_ON(!page);
+ page->area = NULL;
__free_pages(page, 0);
}
@@ -1744,6 +1748,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
const void *caller)
{
struct vm_struct *area;
+ unsigned int page_counter;
void *addr;
unsigned long real_size = size;
@@ -1769,6 +1774,9 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
kmemleak_vmalloc(area, size, gfp_mask);
+ for (page_counter = 0; page_counter < area->nr_pages; page_counter++)
+ area->pages[page_counter]->area = area;
+
return addr;
fail:
--
2.9.3
Detailed documentation about the protectable memory allocator.
Signed-off-by: Igor Stoppa <[email protected]>
---
Documentation/core-api/pmalloc.txt | 104 +++++++++++++++++++++++++++++++++++++
1 file changed, 104 insertions(+)
create mode 100644 Documentation/core-api/pmalloc.txt
diff --git a/Documentation/core-api/pmalloc.txt b/Documentation/core-api/pmalloc.txt
new file mode 100644
index 0000000..9c39672
--- /dev/null
+++ b/Documentation/core-api/pmalloc.txt
@@ -0,0 +1,104 @@
+============================
+Protectable memory allocator
+============================
+
+Introduction
+------------
+
+When trying to perform an attack toward a system, the attacker typically
+wants to alter the execution flow, in a way that allows actions which
+would otherwise be forbidden.
+
+In recent years there has been lots of effort in preventing the execution
+of arbitrary code, so the attacker is progressively pushed to look for
+alternatives.
+
+If code changes are either detected or even prevented, what is left is to
+alter kernel data.
+
+As countermeasure, constant data is collected in a section which is then
+marked as readonly.
+To expand on this, also statically allocated variables which are tagged
+as __ro_after_init will receive a similar treatment.
+The difference from constant data is that such variables can be still
+altered freely during the kernel init phase.
+
+However, such solution does not address those variables which could be
+treated essentially as read-only, but whose size is not known at compile
+time or cannot be fully initialized during the init phase.
+
+
+Design
+------
+
+pmalloc builds on top of genalloc, using the same concept of memory pools
+A pool is a handle to a group of chunks of memory of various sizes.
+When created, a pool is empty. It will be populated by allocating chunks
+of memory, either when the first memory allocation request is received, or
+when a pre-allocation is performed.
+
+Either way, one or more memory pages will be obtaiend from vmalloc and
+registered in the pool as chunk. Subsequent requests will be satisfied by
+either using any available free space from the current chunks, or by
+allocating more vmalloc pages, should the current free space not suffice.
+
+This is the key point of pmalloc: it groups data that must be protected
+into a set of pages. The protection is performed through the mmu, which
+is a prerequisite and has a minimum granularity of one page.
+
+If the relevant variables were not grouped, there would be a problem of
+allowing writes to other variables that might happen to share the same
+page, but require further alterations over time.
+
+A pool is a group of pages that are write protected at the same time.
+Ideally, they have some high level correlation (ex: they belong to the
+same module), which justifies write protecting them all together.
+
+To keep it to a minimum, locking is left to the user of the API, in
+those cases where it's not strictly needed.
+Ideally, no further locking is required, since each module can have own
+pool (or pools), which should, for example, avoid the need for cross
+module or cross thread synchronization about write protecting a pool.
+
+The overhead of creating an additional pool is minimal: a handful of bytes
+from kmalloc space for the metadata and then what is left unused from the
+page(s) registered as chunks.
+
+Compared to plain use of vmalloc, genalloc has the advantage of tightly
+packing the allocations, reducing the number of pages used and therefore
+the pressure on the TLB. The slight overhead in execution time of the
+allocation should be mostly irrelevant, because pmalloc memory is not
+meant to be allocated/freed in tight loops. Rather it ought to be taken
+in use, initialized and write protected. Possibly destroyed.
+
+Considering that not much data is supposed to be dynamically allocated
+and then marked as read-only, it shouldn't be an issue that the address
+range for pmalloc is limited, on 32-bit systemd.
+
+Regarding SMP systems, the allocations are expected to happen mostly
+during an initial transient, after which there should be no more need to
+perform cross-processor synchronizations of page tables.
+
+
+Use
+---
+
+The typical sequence, when using pmalloc, is:
+
+1. create a pool
+2. [optional] pre-allocate some memory in the pool
+3. issue one or more allocation requests to the pool
+4. initialize the memory obtained
+ - iterate over points 3 & 4 as needed -
+5. write protect the pool
+6. use in read-only mode the handlers obtained throguh the allocations
+7. [optional] destroy the pool
+
+
+In a scenario where, for example due to some error, part or all of the
+allocations performed at point 3 must be reverted, it is possible to free
+them, as long as point 5 has not been executed, and the pool is still
+modifiable. Such freed memory can be re-used.
+Performing a free operation on a write-protected pool will, instead,
+simply release the corresponding memory from the accounting, but it will
+be still impossible to alter its content.
--
2.9.3
Add basic self-test functionality for pmalloc.
Signed-off-by: Igor Stoppa <[email protected]>
---
mm/Kconfig | 7 ++++++
mm/Makefile | 1 +
mm/pmalloc-selftest.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++
mm/pmalloc-selftest.h | 30 ++++++++++++++++++++++++
mm/pmalloc.c | 3 +++
5 files changed, 106 insertions(+)
create mode 100644 mm/pmalloc-selftest.c
create mode 100644 mm/pmalloc-selftest.h
diff --git a/mm/Kconfig b/mm/Kconfig
index c782e8f..1de6ea6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -760,3 +760,10 @@ config GUP_BENCHMARK
performance of get_user_pages_fast().
See tools/testing/selftests/vm/gup_benchmark.c
+
+config PROTECTABLE_MEMORY_SELFTEST
+ bool "Run self test for pmalloc memory allocator"
+ default n
+ help
+ Tries to verify that pmalloc works correctly and that the memory
+ is effectively protected.
diff --git a/mm/Makefile b/mm/Makefile
index a6a47e1..1e76a9b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -66,6 +66,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
obj-$(CONFIG_ARCH_HAS_SET_MEMORY) += pmalloc.o
+obj-$(CONFIG_PROTECTABLE_MEMORY_SELFTEST) += pmalloc-selftest.o
obj-$(CONFIG_KSM) += ksm.o
obj-$(CONFIG_PAGE_POISONING) += page_poison.o
obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/pmalloc-selftest.c b/mm/pmalloc-selftest.c
new file mode 100644
index 0000000..1c025f3
--- /dev/null
+++ b/mm/pmalloc-selftest.c
@@ -0,0 +1,65 @@
+/*
+ * pmalloc-selftest.c
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <linux/pmalloc.h>
+#include <linux/mm.h>
+
+
+#define SIZE_1 (PAGE_SIZE * 3)
+#define SIZE_2 1000
+
+#define validate_alloc(expected, variable, size) \
+ pr_notice("must be " expected ": %s", \
+ is_pmalloc_object(variable, size) > 0 ? "ok" : "no")
+
+#define is_alloc_ok(variable, size) \
+ validate_alloc("ok", variable, size)
+
+#define is_alloc_no(variable, size) \
+ validate_alloc("no", variable, size)
+
+void pmalloc_selftest(void)
+{
+ struct gen_pool *pool_unprot;
+ struct gen_pool *pool_prot;
+ void *var_prot, *var_unprot, *var_vmall;
+
+ pr_notice("pmalloc self-test");
+ pool_unprot = pmalloc_create_pool("unprotected", 0);
+ pool_prot = pmalloc_create_pool("protected", 0);
+ BUG_ON(!(pool_unprot && pool_prot));
+
+ var_unprot = pmalloc(pool_unprot, SIZE_1 - 1, GFP_KERNEL);
+ var_prot = pmalloc(pool_prot, SIZE_1, GFP_KERNEL);
+ var_vmall = vmalloc(SIZE_2);
+ is_alloc_ok(var_unprot, 10);
+ is_alloc_ok(var_unprot, SIZE_1);
+ is_alloc_ok(var_unprot, PAGE_SIZE);
+ is_alloc_no(var_unprot, SIZE_1 + 1);
+ is_alloc_no(var_vmall, 10);
+
+
+ pfree(pool_unprot, var_unprot);
+ vfree(var_vmall);
+
+ pmalloc_protect_pool(pool_prot);
+
+ /* This will intentionally trigger a WARN because the pool being
+ * destroyed is not protected, which is unusual and should happen
+ * on error paths only, where probably other warnings are already
+ * displayed.
+ */
+ pmalloc_destroy_pool(pool_unprot);
+
+ /* This must not cause WARNings */
+ pmalloc_destroy_pool(pool_prot);
+}
diff --git a/mm/pmalloc-selftest.h b/mm/pmalloc-selftest.h
new file mode 100644
index 0000000..3673d23
--- /dev/null
+++ b/mm/pmalloc-selftest.h
@@ -0,0 +1,30 @@
+/*
+ * pmalloc-selftest.h
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+
+#ifndef __PMALLOC_SELFTEST_H__
+#define __PMALLOC_SELFTEST_H__
+
+
+#ifdef CONFIG_PROTECTABLE_MEMORY_SELFTEST
+
+#include <linux/pmalloc.h>
+
+void pmalloc_selftest(void);
+
+#else
+
+static inline void pmalloc_selftest(void){};
+
+#endif
+
+#endif
diff --git a/mm/pmalloc.c b/mm/pmalloc.c
index a64ac49..a722d7b 100644
--- a/mm/pmalloc.c
+++ b/mm/pmalloc.c
@@ -25,6 +25,8 @@
#include <asm/cacheflush.h>
#include <asm/page.h>
+#include "pmalloc-selftest.h"
+
/**
* pmalloc_data contains the data specific to a pmalloc pool,
* in a format compatible with the design of gen_alloc.
@@ -508,6 +510,7 @@ static int __init pmalloc_late_init(void)
}
}
mutex_unlock(&pmalloc_mutex);
+ pmalloc_selftest();
return 0;
}
late_initcall(pmalloc_late_init);
--
2.9.3
The MMU available in many systems running Linux can often provide R/O
protection to the memory pages it handles.
However, the MMU-based protection works efficiently only when said pages
contain exclusively data that will not need further modifications.
Statically allocated variables can be segregated into a dedicated
section, but this does not sit very well with dynamically allocated
ones.
Dynamic allocation does not provide, currently, any means for grouping
variables in memory pages that would contain exclusively data suitable
for conversion to read only access mode.
The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.
A module can request a pool and then refer any allocation request to the
pool handler it has received.
Once all the chunks of memory associated to a specific pool are
initialized, the pool can be protected.
After this point, the pool can only be destroyed (it is up to the module
to avoid any further references to the memory from the pool, after
the destruction is invoked).
The latter case is mainly meant for releasing memory, when a module is
unloaded.
A module can have as many pools as needed, for example to support the
protection of data that is initialized in sufficiently distinct phases.
Signed-off-by: Igor Stoppa <[email protected]>
---
include/linux/genalloc.h | 3 +
include/linux/pmalloc.h | 215 ++++++++++++++++++++
include/linux/vmalloc.h | 1 +
lib/genalloc.c | 27 +++
mm/Makefile | 1 +
mm/pmalloc.c | 513 +++++++++++++++++++++++++++++++++++++++++++++++
mm/usercopy.c | 25 ++-
7 files changed, 781 insertions(+), 4 deletions(-)
create mode 100644 include/linux/pmalloc.h
create mode 100644 mm/pmalloc.c
diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index a8fdabf..9f2974f 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -121,6 +121,9 @@ extern unsigned long gen_pool_alloc_algo(struct gen_pool *, size_t,
extern void *gen_pool_dma_alloc(struct gen_pool *pool, size_t size,
dma_addr_t *dma);
extern void gen_pool_free(struct gen_pool *, unsigned long, size_t);
+
+extern void gen_pool_flush_chunk(struct gen_pool *pool,
+ struct gen_pool_chunk *chunk);
extern void gen_pool_for_each_chunk(struct gen_pool *,
void (*)(struct gen_pool *, struct gen_pool_chunk *, void *), void *);
extern size_t gen_pool_avail(struct gen_pool *);
diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index 0000000..cb18739
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,215 @@
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#ifndef _PMALLOC_H
+#define _PMALLOC_H
+
+
+#include <linux/genalloc.h>
+#include <linux/string.h>
+
+#define PMALLOC_DEFAULT_ALLOC_ORDER (-1)
+
+/*
+ * Library for dynamic allocation of pools of memory that can be,
+ * after initialization, marked as read-only.
+ *
+ * This is intended to complement __read_only_after_init, for those cases
+ * where either it is not possible to know the initialization value before
+ * init is completed, or the amount of data is variable and can be
+ * determined only at run-time.
+ *
+ * ***WARNING***
+ * The user of the API is expected to synchronize:
+ * 1) allocation,
+ * 2) writes to the allocated memory,
+ * 3) write protection of the pool,
+ * 4) freeing of the allocated memory, and
+ * 5) destruction of the pool.
+ *
+ * For a non-threaded scenario, this type of locking is not even required.
+ *
+ * Even if the library were to provide support for locking, point 2)
+ * would still depend on the user taking the lock.
+ */
+
+
+/**
+ * pmalloc_create_pool - create a new protectable memory pool -
+ * @name: the name of the pool, must be unique
+ * @min_alloc_order: log2 of the minimum allocation size obtainable
+ * from the pool
+ *
+ * Creates a new (empty) memory pool for allocation of protectable
+ * memory. Memory will be allocated upon request (through pmalloc).
+ *
+ * Returns a pointer to the new pool upon success, otherwise a NULL.
+ */
+struct gen_pool *pmalloc_create_pool(const char *name,
+ int min_alloc_order);
+
+
+int is_pmalloc_object(const void *ptr, const unsigned long n);
+
+/**
+ * pmalloc_prealloc - tries to allocate a memory chunk of the requested size
+ * @pool: handler to the pool to be used for memory allocation
+ * @size: amount of memory (in bytes) requested
+ *
+ * Prepares a chunk of the requested size.
+ * This is intended to both minimize latency in later memory requests and
+ * avoid sleping during allocation.
+ * Memory allocated with prealloc is stored in one single chunk, as
+ * opposite to what is allocated on-demand when pmalloc runs out of free
+ * space already existing in the pool and has to invoke vmalloc.
+ *
+ * Returns true if the vmalloc call was successful, false otherwise.
+ */
+bool pmalloc_prealloc(struct gen_pool *pool, size_t size);
+
+/**
+ * pmalloc - allocate protectable memory from a pool
+ * @pool: handler to the pool to be used for memory allocation
+ * @size: amount of memory (in bytes) requested
+ * @gfp: flags for page allocation
+ *
+ * Allocates memory from an unprotected pool. If the pool doesn't have
+ * enough memory, and the request did not include GFP_ATOMIC, an attempt
+ * is made to add a new chunk of memory to the pool
+ * (a multiple of PAGE_SIZE), in order to fit the new request.
+ * Otherwise, NULL is returned.
+ *
+ * Returns the pointer to the memory requested upon success,
+ * NULL otherwise (either no memory available or pool already read-only).
+ */
+void *pmalloc(struct gen_pool *pool, size_t size, gfp_t gfp);
+
+
+/**
+ * pzalloc - zero-initialized version of pmalloc
+ * @pool: handler to the pool to be used for memory allocation
+ * @size: amount of memory (in bytes) requested
+ * @gfp: flags for page allocation
+ *
+ * Executes pmalloc, initializing the memory requested to 0,
+ * before returning the pointer to it.
+ *
+ * Returns the pointer to the zeroed memory requested, upon success,
+ * NULL otherwise (either no memory available or pool already read-only).
+ */
+static inline void *pzalloc(struct gen_pool *pool, size_t size, gfp_t gfp)
+{
+ return pmalloc(pool, size, gfp | __GFP_ZERO);
+}
+
+/**
+ * pmalloc_array - allocates an array according to the parameters
+ * @pool: handler to the pool to be used for memory allocation
+ * @size: amount of memory (in bytes) requested
+ * @gfp: flags for page allocation
+ *
+ * Executes pmalloc, if it has a chance to succeed.
+ *
+ * Returns either NULL or the pmalloc result.
+ */
+static inline void *pmalloc_array(struct gen_pool *pool, size_t n,
+ size_t size, gfp_t flags)
+{
+ if (unlikely(!(pool && n && size)))
+ return NULL;
+ return pmalloc(pool, n * size, flags);
+}
+
+/**
+ * pcalloc - allocates a 0-initialized array according to the parameters
+ * @pool: handler to the pool to be used for memory allocation
+ * @size: amount of memory (in bytes) requested
+ * @gfp: flags for page allocation
+ *
+ * Executes pmalloc, if it has a chance to succeed.
+ *
+ * Returns either NULL or the pmalloc result.
+ */
+static inline void *pcalloc(struct gen_pool *pool, size_t n,
+ size_t size, gfp_t flags)
+{
+ return pmalloc_array(pool, n, size, flags | __GFP_ZERO);
+}
+
+/**
+ * pstrdup - duplicate a string, using pmalloc as allocator
+ * @pool: handler to the pool to be used for memory allocation
+ * @s: string to duplicate
+ * @gfp: flags for page allocation
+ *
+ * Generates a copy of the given string, allocating sufficient memory
+ * from the given pmalloc pool.
+ *
+ * Returns a pointer to the replica, NULL in case of recoverable error.
+ */
+static inline char *pstrdup(struct gen_pool *pool, const char *s, gfp_t gfp)
+{
+ size_t len;
+ char *buf;
+
+ if (unlikely(pool == NULL || s == NULL))
+ return NULL;
+
+ len = strlen(s) + 1;
+ buf = pmalloc(pool, len, gfp);
+ if (likely(buf))
+ strncpy(buf, s, len);
+ return buf;
+}
+
+/**
+ * pmalloc_protect_pool - turn a read/write pool read-only
+ * @pool: the pool to protect
+ *
+ * Write-protects all the memory chunks assigned to the pool.
+ * This prevents any further allocation.
+ *
+ * Returns 0 upon success, -EINVAL in abnormal cases.
+ */
+int pmalloc_protect_pool(struct gen_pool *pool);
+
+/**
+ * pfree - mark as unused memory that was previously in use
+ * @pool: handler to the pool to be used for memory allocation
+ * @addr: the beginning of the memory area to be freed
+ *
+ * The behavior of pfree is different, depending on the state of the
+ * protection.
+ * If the pool is not yet protected, the memory is marked as unused and
+ * will be availabel for further allocations.
+ * If the pool is already protected, the memory is marked as unused, but
+ * it will still be impossible to perform further allocation, because of
+ * the existing protection.
+ * The freed memory, in this case, will be truly released only when the
+ * pool is destroyed.
+ */
+static inline void pfree(struct gen_pool *pool, const void *addr)
+{
+ gen_pool_free(pool, (unsigned long)addr, 0);
+}
+
+/**
+ * pmalloc_destroy_pool - destroys a pool and all the associated memory
+ * @pool: the pool to destroy
+ *
+ * All the memory that was allocated through pmalloc in the pool will be freed.
+ *
+ * Returns 0 upon success, -EINVAL in abnormal cases.
+ */
+int pmalloc_destroy_pool(struct gen_pool *pool);
+
+#endif
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 1e5d8c3..116d280 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -20,6 +20,7 @@ struct notifier_block; /* in notifier.h */
#define VM_UNINITIALIZED 0x00000020 /* vm_struct is not fully initialized */
#define VM_NO_GUARD 0x00000040 /* don't add guard page */
#define VM_KASAN 0x00000080 /* has allocated kasan shadow memory */
+#define VM_PMALLOC 0x00000100 /* pmalloc area - see docs */
/* bits [20..32] reserved for arch specific ioremap internals */
/*
diff --git a/lib/genalloc.c b/lib/genalloc.c
index 13bc8cf..8ce616fb 100644
--- a/lib/genalloc.c
+++ b/lib/genalloc.c
@@ -519,6 +519,33 @@ void gen_pool_free(struct gen_pool *pool, unsigned long addr, size_t size)
}
EXPORT_SYMBOL(gen_pool_free);
+
+/**
+ * gen_pool_flush_chunk - drops all the allocations from a specific chunk
+ * @pool: the generic memory pool
+ * @chunk: The chunk to wipe clear.
+ *
+ * This is meant to be called only while destroying a pool. It's up to the
+ * caller to avoid races, but really, at this point the pool should have
+ * already been retired and have become unavailable for any other sort of
+ * operation.
+ */
+void gen_pool_flush_chunk(struct gen_pool *pool,
+ struct gen_pool_chunk *chunk)
+{
+ size_t size;
+
+ if (unlikely(!(pool && chunk)))
+ return;
+
+ size = chunk->end_addr + 1 - chunk->start_addr;
+ memset(chunk->entries, 0,
+ DIV_ROUND_UP(size >> pool->min_alloc_order * BITS_PER_ENTRY,
+ BITS_PER_BYTE));
+ atomic_set(&chunk->avail, size);
+}
+
+
/**
* gen_pool_for_each_chunk - call func for every chunk of generic memory pool
* @pool: the generic memory pool
diff --git a/mm/Makefile b/mm/Makefile
index e669f02..a6a47e1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -65,6 +65,7 @@ obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
+obj-$(CONFIG_ARCH_HAS_SET_MEMORY) += pmalloc.o
obj-$(CONFIG_KSM) += ksm.o
obj-$(CONFIG_PAGE_POISONING) += page_poison.o
obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/pmalloc.c b/mm/pmalloc.c
new file mode 100644
index 0000000..a64ac49
--- /dev/null
+++ b/mm/pmalloc.c
@@ -0,0 +1,513 @@
+/*
+ * pmalloc.c: Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <linux/printk.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/genalloc.h>
+#include <linux/kernel.h>
+#include <linux/log2.h>
+#include <linux/slab.h>
+#include <linux/device.h>
+#include <linux/atomic.h>
+#include <linux/rculist.h>
+#include <linux/set_memory.h>
+#include <asm/cacheflush.h>
+#include <asm/page.h>
+
+/**
+ * pmalloc_data contains the data specific to a pmalloc pool,
+ * in a format compatible with the design of gen_alloc.
+ * Some of the fields are used for exposing the corresponding parameter
+ * to userspace, through sysfs.
+ */
+struct pmalloc_data {
+ struct gen_pool *pool; /* Link back to the associated pool. */
+ bool protected; /* Status of the pool: RO or RW. */
+ struct kobj_attribute attr_protected; /* Sysfs attribute. */
+ struct kobj_attribute attr_avail; /* Sysfs attribute. */
+ struct kobj_attribute attr_size; /* Sysfs attribute. */
+ struct kobj_attribute attr_chunks; /* Sysfs attribute. */
+ struct kobject *pool_kobject;
+ struct list_head node; /* list of pools */
+};
+
+static LIST_HEAD(pmalloc_final_list);
+static LIST_HEAD(pmalloc_tmp_list);
+static struct list_head *pmalloc_list = &pmalloc_tmp_list;
+static DEFINE_MUTEX(pmalloc_mutex);
+static struct kobject *pmalloc_kobject;
+
+static ssize_t pmalloc_pool_show_protected(struct kobject *dev,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ struct pmalloc_data *data;
+
+ data = container_of(attr, struct pmalloc_data, attr_protected);
+ if (data->protected)
+ return sprintf(buf, "protected\n");
+ else
+ return sprintf(buf, "unprotected\n");
+}
+
+static ssize_t pmalloc_pool_show_avail(struct kobject *dev,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ struct pmalloc_data *data;
+
+ data = container_of(attr, struct pmalloc_data, attr_avail);
+ return sprintf(buf, "%lu\n", gen_pool_avail(data->pool));
+}
+
+static ssize_t pmalloc_pool_show_size(struct kobject *dev,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ struct pmalloc_data *data;
+
+ data = container_of(attr, struct pmalloc_data, attr_size);
+ return sprintf(buf, "%lu\n", gen_pool_size(data->pool));
+}
+
+static void pool_chunk_number(struct gen_pool *pool,
+ struct gen_pool_chunk *chunk, void *data)
+{
+ unsigned long *counter = data;
+
+ (*counter)++;
+}
+
+static ssize_t pmalloc_pool_show_chunks(struct kobject *dev,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ struct pmalloc_data *data;
+ unsigned long chunks_num = 0;
+
+ data = container_of(attr, struct pmalloc_data, attr_chunks);
+ gen_pool_for_each_chunk(data->pool, pool_chunk_number, &chunks_num);
+ return sprintf(buf, "%lu\n", chunks_num);
+}
+
+/**
+ * Exposes the pool and its attributes through sysfs.
+ */
+static struct kobject *pmalloc_connect(struct pmalloc_data *data)
+{
+ const struct attribute *attrs[] = {
+ &data->attr_protected.attr,
+ &data->attr_avail.attr,
+ &data->attr_size.attr,
+ &data->attr_chunks.attr,
+ NULL
+ };
+ struct kobject *kobj;
+
+ kobj = kobject_create_and_add(data->pool->name, pmalloc_kobject);
+ if (unlikely(!kobj))
+ return NULL;
+
+ if (unlikely(sysfs_create_files(kobj, attrs) < 0)) {
+ kobject_put(kobj);
+ kobj = NULL;
+ }
+ return kobj;
+}
+
+/**
+ * Removes the pool and its attributes from sysfs.
+ */
+static void pmalloc_disconnect(struct pmalloc_data *data,
+ struct kobject *kobj)
+{
+ const struct attribute *attrs[] = {
+ &data->attr_protected.attr,
+ &data->attr_avail.attr,
+ &data->attr_size.attr,
+ &data->attr_chunks.attr,
+ NULL
+ };
+
+ sysfs_remove_files(kobj, attrs);
+ kobject_put(kobj);
+}
+
+/**
+ * Declares an attribute of the pool.
+ */
+
+#define pmalloc_attr_init(data, attr_name) \
+do { \
+ sysfs_attr_init(&data->attr_##attr_name.attr); \
+ data->attr_##attr_name.attr.name = #attr_name; \
+ data->attr_##attr_name.attr.mode = VERIFY_OCTAL_PERMISSIONS(0444); \
+ data->attr_##attr_name.show = pmalloc_pool_show_##attr_name; \
+} while (0)
+
+struct gen_pool *pmalloc_create_pool(const char *name, int min_alloc_order)
+{
+ struct gen_pool *pool;
+ const char *pool_name;
+ struct pmalloc_data *data;
+
+ if (!name) {
+ WARN_ON(1);
+ return NULL;
+ }
+
+ if (min_alloc_order < 0)
+ min_alloc_order = ilog2(sizeof(unsigned long));
+
+ pool = gen_pool_create(min_alloc_order, NUMA_NO_NODE);
+ if (unlikely(!pool))
+ return NULL;
+
+ mutex_lock(&pmalloc_mutex);
+ list_for_each_entry(data, pmalloc_list, node)
+ if (!strcmp(name, data->pool->name))
+ goto same_name_err;
+
+ pool_name = kstrdup(name, GFP_KERNEL);
+ if (unlikely(!pool_name))
+ goto name_alloc_err;
+
+ data = kzalloc(sizeof(struct pmalloc_data), GFP_KERNEL);
+ if (unlikely(!data))
+ goto data_alloc_err;
+
+ data->protected = false;
+ data->pool = pool;
+ pmalloc_attr_init(data, protected);
+ pmalloc_attr_init(data, avail);
+ pmalloc_attr_init(data, size);
+ pmalloc_attr_init(data, chunks);
+ pool->data = data;
+ pool->name = pool_name;
+
+ list_add(&data->node, pmalloc_list);
+ if (pmalloc_list == &pmalloc_final_list)
+ data->pool_kobject = pmalloc_connect(data);
+ mutex_unlock(&pmalloc_mutex);
+ return pool;
+
+data_alloc_err:
+ kfree(pool_name);
+name_alloc_err:
+same_name_err:
+ mutex_unlock(&pmalloc_mutex);
+ gen_pool_destroy(pool);
+ return NULL;
+}
+
+static inline int check_alloc_params(struct gen_pool *pool, size_t req_size)
+{
+ struct pmalloc_data *data;
+ unsigned int order;
+
+ if (unlikely(!req_size || !pool))
+ return -1;
+
+ order = (unsigned int)pool->min_alloc_order;
+ data = pool->data;
+
+ if (data == NULL)
+ return -1;
+
+ if (unlikely(data->protected)) {
+ WARN_ON(1);
+ return -1;
+ }
+ return 0;
+}
+
+
+static inline bool chunk_tagging(void *chunk, bool tag)
+{
+ struct vm_struct *area;
+ struct page *page;
+
+ if (!is_vmalloc_addr(chunk))
+ return false;
+
+ page = vmalloc_to_page(chunk);
+ if (unlikely(!page))
+ return false;
+
+ area = page->area;
+ if (tag)
+ area->flags |= VM_PMALLOC;
+ else
+ area->flags &= ~VM_PMALLOC;
+ return true;
+}
+
+
+static inline bool tag_chunk(void *chunk)
+{
+ return chunk_tagging(chunk, true);
+}
+
+
+static inline bool untag_chunk(void *chunk)
+{
+ return chunk_tagging(chunk, false);
+}
+
+enum {
+ INVALID_PMALLOC_OBJECT = -1,
+ NOT_PMALLOC_OBJECT = 0,
+ VALID_PMALLOC_OBJECT = 1,
+};
+
+int is_pmalloc_object(const void *ptr, const unsigned long n)
+{
+ struct vm_struct *area;
+ struct page *page;
+ unsigned long area_start;
+ unsigned long area_end;
+ unsigned long object_start;
+ unsigned long object_end;
+
+
+ /* is_pmalloc_object gets called pretty late, so chances are high
+ * that the object is indeed of vmalloc type
+ */
+ if (unlikely(!is_vmalloc_addr(ptr)))
+ return NOT_PMALLOC_OBJECT;
+
+ page = vmalloc_to_page(ptr);
+ if (unlikely(!page))
+ return NOT_PMALLOC_OBJECT;
+
+ area = page->area;
+
+ if (likely(!(area->flags & VM_PMALLOC)))
+ return NOT_PMALLOC_OBJECT;
+
+ area_start = (unsigned long)area->addr;
+ area_end = area_start + area->nr_pages * PAGE_SIZE - 1;
+ object_start = (unsigned long)ptr;
+ object_end = object_start + n - 1;
+
+ if (likely((area_start <= object_start) &&
+ (object_end <= area_end)))
+ return VALID_PMALLOC_OBJECT;
+ else
+ return INVALID_PMALLOC_OBJECT;
+}
+
+
+bool pmalloc_prealloc(struct gen_pool *pool, size_t size)
+{
+ void *chunk;
+ size_t chunk_size;
+ bool add_error;
+ unsigned int order;
+
+ if (check_alloc_params(pool, size))
+ return false;
+
+ order = (unsigned int)pool->min_alloc_order;
+
+ /* Expand pool */
+ chunk_size = roundup(size, PAGE_SIZE);
+ chunk = vmalloc(chunk_size);
+ if (unlikely(chunk == NULL))
+ return false;
+
+ /* Locking is already done inside gen_pool_add */
+ add_error = gen_pool_add(pool, (unsigned long)chunk, chunk_size,
+ NUMA_NO_NODE);
+ if (unlikely(add_error != 0))
+ goto abort;
+
+ return true;
+abort:
+ vfree(chunk);
+ return false;
+
+}
+
+void *pmalloc(struct gen_pool *pool, size_t size, gfp_t gfp)
+{
+ void *chunk;
+ size_t chunk_size;
+ bool add_error;
+ unsigned long retval;
+ unsigned int order;
+
+ if (check_alloc_params(pool, size))
+ return NULL;
+
+ order = (unsigned int)pool->min_alloc_order;
+
+retry_alloc_from_pool:
+ retval = gen_pool_alloc(pool, size);
+ if (retval)
+ goto return_allocation;
+
+ if (unlikely((gfp & __GFP_ATOMIC))) {
+ if (unlikely((gfp & __GFP_NOFAIL)))
+ goto retry_alloc_from_pool;
+ else
+ return NULL;
+ }
+
+ /* Expand pool */
+ chunk_size = roundup(size, PAGE_SIZE);
+ chunk = vmalloc(chunk_size);
+ if (unlikely(!chunk)) {
+ if (unlikely((gfp & __GFP_NOFAIL)))
+ goto retry_alloc_from_pool;
+ else
+ return NULL;
+ }
+ if (unlikely(!tag_chunk(chunk)))
+ goto free;
+
+ /* Locking is already done inside gen_pool_add */
+ add_error = gen_pool_add(pool, (unsigned long)chunk, chunk_size,
+ NUMA_NO_NODE);
+ if (unlikely(add_error))
+ goto abort;
+
+ retval = gen_pool_alloc(pool, size);
+ if (retval) {
+return_allocation:
+ *(size_t *)retval = size;
+ if (gfp & __GFP_ZERO)
+ memset((void *)retval, 0, size);
+ return (void *)retval;
+ }
+ /* Here there is no test for __GFP_NO_FAIL because, in case of
+ * concurrent allocation, one thread might add a chunk to the
+ * pool and this memory could be allocated by another thread,
+ * before the first thread gets a chance to use it.
+ * As long as vmalloc succeeds, it's ok to retry.
+ */
+ goto retry_alloc_from_pool;
+abort:
+ untag_chunk(chunk);
+free:
+ vfree(chunk);
+ return NULL;
+}
+
+static void pmalloc_chunk_set_protection(struct gen_pool *pool,
+
+ struct gen_pool_chunk *chunk,
+ void *data)
+{
+ const bool *flag = data;
+ size_t chunk_size = chunk->end_addr + 1 - chunk->start_addr;
+ unsigned long pages = chunk_size / PAGE_SIZE;
+
+ BUG_ON(chunk_size & (PAGE_SIZE - 1));
+
+ if (*flag)
+ set_memory_ro(chunk->start_addr, pages);
+ else
+ set_memory_rw(chunk->start_addr, pages);
+}
+
+static int pmalloc_pool_set_protection(struct gen_pool *pool, bool protection)
+{
+ struct pmalloc_data *data;
+ struct gen_pool_chunk *chunk;
+
+ if (unlikely(!pool))
+ return -EINVAL;
+
+ data = pool->data;
+
+ if (unlikely(!data))
+ return -EINVAL;
+
+ if (unlikely(data->protected == protection)) {
+ WARN_ON(1);
+ return 0;
+ }
+
+ data->protected = protection;
+ list_for_each_entry(chunk, &(pool)->chunks, next_chunk)
+ pmalloc_chunk_set_protection(pool, chunk, &protection);
+ return 0;
+}
+
+int pmalloc_protect_pool(struct gen_pool *pool)
+{
+ return pmalloc_pool_set_protection(pool, true);
+}
+
+
+static void pmalloc_chunk_free(struct gen_pool *pool,
+ struct gen_pool_chunk *chunk, void *data)
+{
+ untag_chunk(chunk);
+ gen_pool_flush_chunk(pool, chunk);
+ vfree_atomic((void *)chunk->start_addr);
+}
+
+
+int pmalloc_destroy_pool(struct gen_pool *pool)
+{
+ struct pmalloc_data *data;
+
+ if (unlikely(pool == NULL))
+ return -EINVAL;
+
+ data = pool->data;
+
+ if (unlikely(data == NULL))
+ return -EINVAL;
+
+ mutex_lock(&pmalloc_mutex);
+ list_del(&data->node);
+ mutex_unlock(&pmalloc_mutex);
+
+ if (likely(data->pool_kobject))
+ pmalloc_disconnect(data, data->pool_kobject);
+
+ pmalloc_pool_set_protection(pool, false);
+ gen_pool_for_each_chunk(pool, pmalloc_chunk_free, NULL);
+ gen_pool_destroy(pool);
+ kfree(data);
+ return 0;
+}
+
+/**
+ * When the sysfs is ready to receive registrations, connect all the
+ * pools previously created. Also enable further pools to be connected
+ * right away.
+ */
+static int __init pmalloc_late_init(void)
+{
+ struct pmalloc_data *data, *n;
+
+ pmalloc_kobject = kobject_create_and_add("pmalloc", kernel_kobj);
+
+ mutex_lock(&pmalloc_mutex);
+ pmalloc_list = &pmalloc_final_list;
+
+ if (likely(pmalloc_kobject != NULL)) {
+ list_for_each_entry_safe(data, n, &pmalloc_tmp_list, node) {
+ list_move(&data->node, &pmalloc_final_list);
+ pmalloc_connect(data);
+ }
+ }
+ mutex_unlock(&pmalloc_mutex);
+ return 0;
+}
+late_initcall(pmalloc_late_init);
diff --git a/mm/usercopy.c b/mm/usercopy.c
index a9852b2..c3b1029 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -15,6 +15,7 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/mm.h>
+#include <linux/pmalloc.h>
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/sched/task.h>
@@ -222,6 +223,7 @@ static inline const char *check_heap_object(const void *ptr, unsigned long n,
void __check_object_size(const void *ptr, unsigned long n, bool to_user)
{
const char *err;
+ int retv;
/* Skip all tests if size is zero. */
if (!n)
@@ -229,12 +231,12 @@ void __check_object_size(const void *ptr, unsigned long n, bool to_user)
/* Check for invalid addresses. */
err = check_bogus_address(ptr, n);
- if (err)
+ if (unlikely(err))
goto report;
/* Check for bad heap object. */
err = check_heap_object(ptr, n, to_user);
- if (err)
+ if (unlikely(err))
goto report;
/* Check for bad stack object. */
@@ -257,8 +259,23 @@ void __check_object_size(const void *ptr, unsigned long n, bool to_user)
/* Check for object in kernel to avoid text exposure. */
err = check_kernel_text_object(ptr, n);
- if (!err)
- return;
+ if (unlikely(err))
+ goto report;
+
+ /* Check if object is from a pmalloc chunk.
+ */
+ retv = is_pmalloc_object(ptr, n);
+ if (unlikely(retv)) {
+ if (unlikely(!to_user)) {
+ err = "<trying to write to pmalloc object>";
+ goto report;
+ }
+ if (retv < 0) {
+ err = "<invalid pmalloc object>";
+ goto report;
+ }
+ }
+ return;
report:
report_usercopy(ptr, n, to_user, err);
--
2.9.3
On Wed, Jan 24, 2018 at 6:56 PM, Igor Stoppa <[email protected]> wrote:
> The MMU available in many systems running Linux can often provide R/O
> protection to the memory pages it handles.
>
> However, the MMU-based protection works efficiently only when said pages
> contain exclusively data that will not need further modifications.
>
> Statically allocated variables can be segregated into a dedicated
> section, but this does not sit very well with dynamically allocated
> ones.
>
> Dynamic allocation does not provide, currently, any means for grouping
> variables in memory pages that would contain exclusively data suitable
> for conversion to read only access mode.
>
> The allocator here provided (pmalloc - protectable memory allocator)
> introduces the concept of pools of protectable memory.
>
> A module can request a pool and then refer any allocation request to the
> pool handler it has received.
>
> Once all the chunks of memory associated to a specific pool are
> initialized, the pool can be protected.
I'm not entirely convinced by the approach of marking small parts of
kernel memory as readonly for hardening.
Comments on some details are inline.
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 1e5d8c3..116d280 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -20,6 +20,7 @@ struct notifier_block; /* in notifier.h */
> #define VM_UNINITIALIZED 0x00000020 /* vm_struct is not fully initialized */
> #define VM_NO_GUARD 0x00000040 /* don't add guard page */
> #define VM_KASAN 0x00000080 /* has allocated kasan shadow memory */
> +#define VM_PMALLOC 0x00000100 /* pmalloc area - see docs */
Is "see docs" specific enough to actually guide the reader to the
right documentation?
> +#define pmalloc_attr_init(data, attr_name) \
> +do { \
> + sysfs_attr_init(&data->attr_##attr_name.attr); \
> + data->attr_##attr_name.attr.name = #attr_name; \
> + data->attr_##attr_name.attr.mode = VERIFY_OCTAL_PERMISSIONS(0444); \
> + data->attr_##attr_name.show = pmalloc_pool_show_##attr_name; \
> +} while (0)
Is there a good reason for making all these files mode 0444 (as
opposed to setting them to 0400 and then allowing userspace to make
them accessible if desired)? /proc/slabinfo contains vaguely similar
data and is mode 0400 (or mode 0600, depending on the kernel config)
AFAICS.
> +void *pmalloc(struct gen_pool *pool, size_t size, gfp_t gfp)
> +{
[...]
> + /* Expand pool */
> + chunk_size = roundup(size, PAGE_SIZE);
> + chunk = vmalloc(chunk_size);
You're allocating with vmalloc(), which, as far as I know, establishes
a second mapping in the vmalloc area for pages that are already mapped
as RW through the physmap. AFAICS, later, when you're trying to make
pages readonly, you're only changing the protections on the second
mapping in the vmalloc area, therefore leaving the memory writable
through the physmap. Is that correct? If so, please either document
the reasoning why this is okay or change it.
2 Minor typos inline below:
On 01/24/2018 09:56 AM, Igor Stoppa wrote:
> Detailed documentation about the protectable memory allocator.
>
> Signed-off-by: Igor Stoppa <[email protected]>
> ---
> Documentation/core-api/pmalloc.txt | 104 +++++++++++++++++++++++++++++++++++++
> 1 file changed, 104 insertions(+)
> create mode 100644 Documentation/core-api/pmalloc.txt
>
> diff --git a/Documentation/core-api/pmalloc.txt b/Documentation/core-api/pmalloc.txt
> new file mode 100644
> index 0000000..9c39672
> --- /dev/null
> +++ b/Documentation/core-api/pmalloc.txt
> @@ -0,0 +1,104 @@
> +============================
> +Protectable memory allocator
> +============================
> +
> +Introduction
> +------------
> +
> +When trying to perform an attack toward a system, the attacker typically
> +wants to alter the execution flow, in a way that allows actions which
> +would otherwise be forbidden.
> +
> +In recent years there has been lots of effort in preventing the execution
> +of arbitrary code, so the attacker is progressively pushed to look for
> +alternatives.
> +
> +If code changes are either detected or even prevented, what is left is to
> +alter kernel data.
> +
> +As countermeasure, constant data is collected in a section which is then
> +marked as readonly.
> +To expand on this, also statically allocated variables which are tagged
> +as __ro_after_init will receive a similar treatment.
> +The difference from constant data is that such variables can be still
> +altered freely during the kernel init phase.
> +
> +However, such solution does not address those variables which could be
> +treated essentially as read-only, but whose size is not known at compile
> +time or cannot be fully initialized during the init phase.
> +
> +
> +Design
> +------
> +
> +pmalloc builds on top of genalloc, using the same concept of memory pools
> +A pool is a handle to a group of chunks of memory of various sizes.
> +When created, a pool is empty. It will be populated by allocating chunks
> +of memory, either when the first memory allocation request is received, or
> +when a pre-allocation is performed.
> +
> +Either way, one or more memory pages will be obtaiend from vmalloc and
obtained
> +registered in the pool as chunk. Subsequent requests will be satisfied by
> +either using any available free space from the current chunks, or by
> +allocating more vmalloc pages, should the current free space not suffice.
> +
> +This is the key point of pmalloc: it groups data that must be protected
> +into a set of pages. The protection is performed through the mmu, which
> +is a prerequisite and has a minimum granularity of one page.
> +
> +If the relevant variables were not grouped, there would be a problem of
> +allowing writes to other variables that might happen to share the same
> +page, but require further alterations over time.
> +
> +A pool is a group of pages that are write protected at the same time.
> +Ideally, they have some high level correlation (ex: they belong to the
> +same module), which justifies write protecting them all together.
> +
> +To keep it to a minimum, locking is left to the user of the API, in
> +those cases where it's not strictly needed.
> +Ideally, no further locking is required, since each module can have own
> +pool (or pools), which should, for example, avoid the need for cross
> +module or cross thread synchronization about write protecting a pool.
> +
> +The overhead of creating an additional pool is minimal: a handful of bytes
> +from kmalloc space for the metadata and then what is left unused from the
> +page(s) registered as chunks.
> +
> +Compared to plain use of vmalloc, genalloc has the advantage of tightly
> +packing the allocations, reducing the number of pages used and therefore
> +the pressure on the TLB. The slight overhead in execution time of the
> +allocation should be mostly irrelevant, because pmalloc memory is not
> +meant to be allocated/freed in tight loops. Rather it ought to be taken
> +in use, initialized and write protected. Possibly destroyed.
> +
> +Considering that not much data is supposed to be dynamically allocated
> +and then marked as read-only, it shouldn't be an issue that the address
> +range for pmalloc is limited, on 32-bit systemd.
> +
> +Regarding SMP systems, the allocations are expected to happen mostly
> +during an initial transient, after which there should be no more need to
> +perform cross-processor synchronizations of page tables.
> +
> +
> +Use
> +---
> +
> +The typical sequence, when using pmalloc, is:
> +
> +1. create a pool
> +2. [optional] pre-allocate some memory in the pool
> +3. issue one or more allocation requests to the pool
> +4. initialize the memory obtained
> + - iterate over points 3 & 4 as needed -
> +5. write protect the pool
> +6. use in read-only mode the handlers obtained throguh the allocations
through
> +7. [optional] destroy the pool
> +
> +
> +In a scenario where, for example due to some error, part or all of the
> +allocations performed at point 3 must be reverted, it is possible to free
> +them, as long as point 5 has not been executed, and the pool is still
> +modifiable. Such freed memory can be re-used.
> +Performing a free operation on a write-protected pool will, instead,
> +simply release the corresponding memory from the accounting, but it will
> +be still impossible to alter its content.
>
On 24/01/18 21:14, Ralph Campbell wrote:
> 2 Minor typos inline below:
thanks for proof-reading, will fix accordingly.
--
igor
Hi,
thanks for the review. My reply below.
On 24/01/18 21:10, Jann Horn wrote:
> I'm not entirely convinced by the approach of marking small parts of
> kernel memory as readonly for hardening.
Because of the physmap you mention later?
Regarding small parts vs big parts (what is big enough?) I did propose
the use of a custom zone at the very beginning, however I met 2 objections:
1. It's not a special case and there was no will to reserve another zone
This might be mitigated by aliasing with a zone that is already
defined, but not in use. For example DMA or DMA32.
But it looks like a good way to replicate the confusion that is page
struct. Anyway, I found the next objection more convincing.
2. What would be the size of this zone? It would become something that
is really application specific. At the very least it should become a
command line parameter. A distro would have to allocate a lot of
memory for it, because it cannot really know upfront what its users
will do. But, most likely, the vast majority of users would never
need that much.
If you have some idea of how to address these objections without using
vmalloc, or at least without using the same page provider that vmalloc
is using now, I'd be interested to hear it.
Besides the double mapping problem, the major benefit I can see from
having a contiguous area is that it simplifies the hardened user copy
verification, because there is a fixed range to test for overlap.
> Comments on some details are inline.
thank you
>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
>> index 1e5d8c3..116d280 100644
>> --- a/include/linux/vmalloc.h
>> +++ b/include/linux/vmalloc.h
>> @@ -20,6 +20,7 @@ struct notifier_block; /* in notifier.h */
>> #define VM_UNINITIALIZED 0x00000020 /* vm_struct is not fully initialized */
>> #define VM_NO_GUARD 0x00000040 /* don't add guard page */
>> #define VM_KASAN 0x00000080 /* has allocated kasan shadow memory */
>> +#define VM_PMALLOC 0x00000100 /* pmalloc area - see docs */
>
> Is "see docs" specific enough to actually guide the reader to the
> right documentation?
The doc file is named pmalloc.txt, but I can be more explicit.
>> +#define pmalloc_attr_init(data, attr_name) \
>> +do { \
>> + sysfs_attr_init(&data->attr_##attr_name.attr); \
>> + data->attr_##attr_name.attr.name = #attr_name; \
>> + data->attr_##attr_name.attr.mode = VERIFY_OCTAL_PERMISSIONS(0444); \
>> + data->attr_##attr_name.show = pmalloc_pool_show_##attr_name; \
>> +} while (0)
>
> Is there a good reason for making all these files mode 0444 (as
> opposed to setting them to 0400 and then allowing userspace to make
> them accessible if desired)? /proc/slabinfo contains vaguely similar
> data and is mode 0400 (or mode 0600, depending on the kernel config)
> AFAICS.
ok, you do have a point, so far I have been mostly focusing on the
"drop-in replacement for kmalloc" aspect.
>> +void *pmalloc(struct gen_pool *pool, size_t size, gfp_t gfp)
>> +{
> [...]
>> + /* Expand pool */
>> + chunk_size = roundup(size, PAGE_SIZE);
>> + chunk = vmalloc(chunk_size);
>
> You're allocating with vmalloc(), which, as far as I know, establishes
> a second mapping in the vmalloc area for pages that are already mapped
> as RW through the physmap. AFAICS, later, when you're trying to make
> pages readonly, you're only changing the protections on the second
> mapping in the vmalloc area, therefore leaving the memory writable
> through the physmap. Is that correct? If so, please either document
> the reasoning why this is okay or change it.
About why vmalloc as backend for pmalloc, please refer to this:
http://www.openwall.com/lists/kernel-hardening/2018/01/24/11
I tried to give a short summary of what took me toward vmalloc.
vmalloc is also a convenient way of obtaining arbitrarily (within
reason) large amounts of virtually contiguous memory.
Your objection is toward the unprotected access, through the alternate
mapping, rather than to the idea of having pools that can be protected
individually, right?
In the mail I linked, I explained that I could not use kmalloc because
of the problem of splitting huge pages on ARM.
kmalloc does require the physmap, for performance reason.
However, vmalloc is already doing mapping of individual pages, because
it must ensure that they are virtually contiguous, so would it be
possible to have vmalloc _always_ outside of the physmap?
If I have understood correctly, the actual extension of physmap is
highly architecture and platform dependant, so it might be (but I have
not checked) that in some cases (like some 32bit systems) vmalloc is
typically outside of physmap, but probably that is not the case on 64bit?
Also, I need to understand how physmap works against vmalloc vs how it
works against kernel text and const/__ro_after_init sections.
Can they also be accessed (and written?) through the physmap?
But, to take a different angle: if an attacker knows where kernel
symbols are and has gained capability to write at arbitrary location(s)
in kernel data, what prevents a modification of mappings and permissions?
What is considered robust enough?
I have the impression that, without support from HW, to have some
one-way mechanism that protects some page permanently, it's always
possible to undo the various protections we are talking about, only harder.
From the perspective of protecting against accidental overwrites,
instead, the current implementation should be ok, since it's less likely
that some stray pointer happens to assume a value that goes through the
physmap.
But I'm interested to hear, if you have some suggestion about how to
prevent the side access through the physmap.
--
thanks, igor
On Thu, Jan 25, 2018 at 6:59 AM, Igor Stoppa <[email protected]> wrote:
>
> Hi,
>
> thanks for the review. My reply below.
>
> On 24/01/18 21:10, Jann Horn wrote:
>
> > I'm not entirely convinced by the approach of marking small parts of
> > kernel memory as readonly for hardening.
>
> Because of the physmap you mention later?
>
> Regarding small parts vs big parts (what is big enough?) I did propose
> the use of a custom zone at the very beginning, however I met 2 objections:
>
> 1. It's not a special case and there was no will to reserve another zone
> This might be mitigated by aliasing with a zone that is already
> defined, but not in use. For example DMA or DMA32.
> But it looks like a good way to replicate the confusion that is page
> struct. Anyway, I found the next objection more convincing.
>
> 2. What would be the size of this zone? It would become something that
> is really application specific. At the very least it should become a
> command line parameter. A distro would have to allocate a lot of
> memory for it, because it cannot really know upfront what its users
> will do. But, most likely, the vast majority of users would never
> need that much.
>
> If you have some idea of how to address these objections without using
> vmalloc, or at least without using the same page provider that vmalloc
> is using now, I'd be interested to hear it.
>
> Besides the double mapping problem, the major benefit I can see from
> having a contiguous area is that it simplifies the hardened user copy
> verification, because there is a fixed range to test for overlap.
>
> > Comments on some details are inline.
>
> thank you
>
> >> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> >> index 1e5d8c3..116d280 100644
> >> --- a/include/linux/vmalloc.h
> >> +++ b/include/linux/vmalloc.h
> >> @@ -20,6 +20,7 @@ struct notifier_block; /* in notifier.h */
> >> #define VM_UNINITIALIZED 0x00000020 /* vm_struct is not fully initialized */
> >> #define VM_NO_GUARD 0x00000040 /* don't add guard page */
> >> #define VM_KASAN 0x00000080 /* has allocated kasan shadow memory */
> >> +#define VM_PMALLOC 0x00000100 /* pmalloc area - see docs */
> >
> > Is "see docs" specific enough to actually guide the reader to the
> > right documentation?
>
> The doc file is named pmalloc.txt, but I can be more explicit.
>
> >> +#define pmalloc_attr_init(data, attr_name) \
> >> +do { \
> >> + sysfs_attr_init(&data->attr_##attr_name.attr); \
> >> + data->attr_##attr_name.attr.name = #attr_name; \
> >> + data->attr_##attr_name.attr.mode = VERIFY_OCTAL_PERMISSIONS(0444); \
> >> + data->attr_##attr_name.show = pmalloc_pool_show_##attr_name; \
> >> +} while (0)
> >
> > Is there a good reason for making all these files mode 0444 (as
> > opposed to setting them to 0400 and then allowing userspace to make
> > them accessible if desired)? /proc/slabinfo contains vaguely similar
> > data and is mode 0400 (or mode 0600, depending on the kernel config)
> > AFAICS.
>
> ok, you do have a point, so far I have been mostly focusing on the
>
> "drop-in replacement for kmalloc" aspect.
>
> >> +void *pmalloc(struct gen_pool *pool, size_t size, gfp_t gfp)
> >> +{
> > [...]
> >> + /* Expand pool */
> >> + chunk_size = roundup(size, PAGE_SIZE);
> >> + chunk = vmalloc(chunk_size);
> >
> > You're allocating with vmalloc(), which, as far as I know, establishes
> > a second mapping in the vmalloc area for pages that are already mapped
> > as RW through the physmap. AFAICS, later, when you're trying to make
> > pages readonly, you're only changing the protections on the second
> > mapping in the vmalloc area, therefore leaving the memory writable
> > through the physmap. Is that correct? If so, please either document
> > the reasoning why this is okay or change it.
>
> About why vmalloc as backend for pmalloc, please refer to this:
>
> http://www.openwall.com/lists/kernel-hardening/2018/01/24/11
>
> I tried to give a short summary of what took me toward vmalloc.
> vmalloc is also a convenient way of obtaining arbitrarily (within
> reason) large amounts of virtually contiguous memory.
>
> Your objection is toward the unprotected access, through the alternate
> mapping, rather than to the idea of having pools that can be protected
> individually, right?
>
> In the mail I linked, I explained that I could not use kmalloc because
> of the problem of splitting huge pages on ARM.
>
> kmalloc does require the physmap, for performance reason.
>
> However, vmalloc is already doing mapping of individual pages, because
> it must ensure that they are virtually contiguous, so would it be
> possible to have vmalloc _always_ outside of the physmap?
>
> If I have understood correctly, the actual extension of physmap is
> highly architecture and platform dependant, so it might be (but I have
> not checked) that in some cases (like some 32bit systems) vmalloc is
> typically outside of physmap, but probably that is not the case on 64bit?
>
> Also, I need to understand how physmap works against vmalloc vs how it
> works against kernel text and const/__ro_after_init sections.
>
> Can they also be accessed (and written?) through the physmap?
>
> But, to take a different angle: if an attacker knows where kernel
> symbols are and has gained capability to write at arbitrary location(s)
> in kernel data, what prevents a modification of mappings and permissions?
>
> What is considered robust enough?
>
> I have the impression that, without support from HW, to have some
> one-way mechanism that protects some page permanently, it's always
> possible to undo the various protections we are talking about, only harder.
>
> From the perspective of protecting against accidental overwrites,
> instead, the current implementation should be ok, since it's less likely
> that some stray pointer happens to assume a value that goes through the
> physmap.
>
> But I'm interested to hear, if you have some suggestion about how to
> prevent the side access through the physmap.
>
> --
> thanks, igor
DMA/physmap access coupled with a knowledge of which virtual mappings
are in the physical space should be enough for an attacker to bypass
the gating mechanism this work imposes. Not trivial, but not
impossible. Since there's no way to prevent that sort of access in
current hardware (especially something like a NIC or GPU working
independently of the CPU altogether), we have the option of checking
contents of a sealed page against a checksum/hash of the page prior to
returning its contents to the caller (since it needs to be read to be
verified), or some other mechanism within the read path to ensure that
no event since the last read affected the page/allocation. If the
structure containing the list of verifiers is separate from the page,
the attacker needs to resolve and change the contents of those
signatures for the pages they're affecting via DMA before the kernel
checks one against the other in the read path. I cant speak to
overhead, but it should complicate the logic of a successful attack
chain.
Off the cuff, if the allocator sums the contents when sealing a page,
stores it in a lookup table, and forces verification on every
read/lookup, it should prevent _use_ of memory which was modified
unless our attacker is clever enough to fix that up prior to the next
access. Since its write-once memory, race conditions on subsequent
access shouldn't be a problem.
--
Boris Lukashev
Systems Architect
Semper Victus
On Thu, Jan 25, 2018 at 10:14:28AM -0500, Boris Lukashev wrote:
> On Thu, Jan 25, 2018 at 6:59 AM, Igor Stoppa <[email protected]> wrote:
[...]
> DMA/physmap access coupled with a knowledge of which virtual mappings
> are in the physical space should be enough for an attacker to bypass
> the gating mechanism this work imposes. Not trivial, but not
> impossible. Since there's no way to prevent that sort of access in
> current hardware (especially something like a NIC or GPU working
> independently of the CPU altogether)
I am not saying this is impossible but this is unlikely they are several
mecanisms. First you have IOMMU it has been defaulted to on by OEM for
last few years (it use to be enabled only on server for virtualization).
Which means that a given device only can access memory that is mapped to
it through the IOMMU page table (usualy each device get their own distinct
IOMMU page table).
Then on device like GPU you have an MMU (no GPU without an MMU for the
last 10 years or more). The MMU is under the control of the kernel driver
of the GPU and for the open source driver we try hard to make sure it can
not be abuse and circumvent by userspace ie we restrict userspace process
to only access memory they own.
I am not saying that this can not happen but that we are trying our best
to avoid it.
Cheers,
J?r?me
On Wed, Jan 24, 2018 at 08:10:53PM +0100, Jann Horn wrote:
> I'm not entirely convinced by the approach of marking small parts of
> kernel memory as readonly for hardening.
It depends how significant the data stored in there are. For example,
storing function pointers in read-only memory provides significant
hardening.
> You're allocating with vmalloc(), which, as far as I know, establishes
> a second mapping in the vmalloc area for pages that are already mapped
> as RW through the physmap. AFAICS, later, when you're trying to make
> pages readonly, you're only changing the protections on the second
> mapping in the vmalloc area, therefore leaving the memory writable
> through the physmap. Is that correct? If so, please either document
> the reasoning why this is okay or change it.
Yes, this is still vulnerable to attacks through the physmap. That's also
true for marking structs as const. We should probably fix that at some
point, but at least they're not vulnerable to heap overruns by small
amounts ... you have to be able to overrun some other array by terabytes.
It's worth having a discussion about whether we want the pmalloc API
or whether we want a slab-based API. We can have a separate discussion
about an API to remove pages from the physmap.
On 26/01/18 07:35, Matthew Wilcox wrote:
> On Wed, Jan 24, 2018 at 08:10:53PM +0100, Jann Horn wrote:
>> I'm not entirely convinced by the approach of marking small parts of
>> kernel memory as readonly for hardening.
>
> It depends how significant the data stored in there are. For example,
> storing function pointers in read-only memory provides significant
> hardening.
>
>> You're allocating with vmalloc(), which, as far as I know, establishes
>> a second mapping in the vmalloc area for pages that are already mapped
>> as RW through the physmap. AFAICS, later, when you're trying to make
>> pages readonly, you're only changing the protections on the second
>> mapping in the vmalloc area, therefore leaving the memory writable
>> through the physmap. Is that correct? If so, please either document
>> the reasoning why this is okay or change it.
>
> Yes, this is still vulnerable to attacks through the physmap. That's also
> true for marking structs as const. We should probably fix that at some
> point, but at least they're not vulnerable to heap overruns by small
> amounts ... you have to be able to overrun some other array by terabytes.
Actually, I think there is something to say in favor of using a vmalloc
based approach, precisely because of the physmap :-P
If I understood correctly, the physmap is primarily meant to speed up
access to physical memory through the TLB. In particular, for kmalloc
based allocations.
Which means that, to perform a physmap-based attack to a kmalloced
allocation, one needs to know:
- the address of the target variable in the kmalloc range
- the randomized offset of the kernel
- the location of the physmap
But, for a vmalloc based allocation, there is one extra hoop: since the
mapping is really per page, now the attacker has actually to walk the
page table, to figure out where to poke in the physmap.
One more thought about physmap: does it map also code?
Because, if it does, and one wants to use it for an attack, isn't it
easier to look for some security test and replace a bne with be or
equivalent?
> It's worth having a discussion about whether we want the pmalloc API
> or whether we want a slab-based API.
pmalloc is meant to be useful where the attack surface is made up of
lots of small allocations - my first use case was the SE Linux policy
DB, where there is a variety of elements being allocated, in large
amount. To the point where having ready made caches would be wasteful.
Then there is the issue I already mentioned about arm/arm64 which would
require to break down large mappings, which seems to be against current
policy, as described in my previous mail:
http://www.openwall.com/lists/kernel-hardening/2018/01/24/11
I do not know exactly what you have in mind wrt slab, but my impression
is that it will most likely gravitate toward the pmalloc implementation.
It will need:
- "pools" or anyway some means to lock only a certain group of pages,
related to a specific kernel user
- (mostly) lockless allocation
- a way to manage granularity (or order of allocation)
Most of this is already provided by genalloc, which is what I ended up
almost re-implementing, before being pointed to it :-)
I only had to add the tracking of end of allocations, which is what the
patch 1/6 does - as side note, is anybody maintaining it?
I could not find an entry in MAINTAINERS
As I mentioned above, using vmalloc adds even an extra layer of protection.
The major downside is the increased TLB use, however this is not so
relevant for the volumes of data that I had to deal with so far:
only few 4K pages.
But you might have in mind something else.
I'd be interested to know what and what would be an obstacle in using
pmalloc. Maybe it can be solved.
--
igor
On 25/01/18 17:38, Jerome Glisse wrote:
> On Thu, Jan 25, 2018 at 10:14:28AM -0500, Boris Lukashev wrote:
>> On Thu, Jan 25, 2018 at 6:59 AM, Igor Stoppa <[email protected]> wrote:
>
> [...]
>
>> DMA/physmap access coupled with a knowledge of which virtual mappings
>> are in the physical space should be enough for an attacker to bypass
>> the gating mechanism this work imposes. Not trivial, but not
>> impossible. Since there's no way to prevent that sort of access in
>> current hardware (especially something like a NIC or GPU working
>> independently of the CPU altogether)
[...]
> I am not saying that this can not happen but that we are trying our best
> to avoid it.
How about an opt-in verification, similar to what proposed by Boris
Lukashev?
When reading back the data, one could access the pointer directly and
bypass the verification, or could use a function that explicitly checks
the integrity of the data.
Starting from an unprotected kmalloc allocation, even just turning the
data into R/O is an improvement, but if one can afford the overhead of
performing the verification, why not?
It would still be better if the service was provided by the library,
instead than implemented by individual users, I think.
--
igor
On Fri, Jan 26, 2018 at 7:28 AM, Igor Stoppa <[email protected]> wrote:
> On 25/01/18 17:38, Jerome Glisse wrote:
>> On Thu, Jan 25, 2018 at 10:14:28AM -0500, Boris Lukashev wrote:
>>> On Thu, Jan 25, 2018 at 6:59 AM, Igor Stoppa <[email protected]> wrote:
>>
>> [...]
>>
>>> DMA/physmap access coupled with a knowledge of which virtual mappings
>>> are in the physical space should be enough for an attacker to bypass
>>> the gating mechanism this work imposes. Not trivial, but not
>>> impossible. Since there's no way to prevent that sort of access in
>>> current hardware (especially something like a NIC or GPU working
>>> independently of the CPU altogether)
>
> [...]
>
>> I am not saying that this can not happen but that we are trying our best
>> to avoid it.
>
> How about an opt-in verification, similar to what proposed by Boris
> Lukashev?
>
> When reading back the data, one could access the pointer directly and
> bypass the verification, or could use a function that explicitly checks
> the integrity of the data.
>
> Starting from an unprotected kmalloc allocation, even just turning the
> data into R/O is an improvement, but if one can afford the overhead of
> performing the verification, why not?
>
I like the idea of making the verification call optional for consumers
allowing for fast/slow+hard paths depending on their needs.
Cant see any additional vectors for abuse (other than the original
ones effecting out-of-band modification) introduced by having
verify/normal callers, but i've not had enough coffee yet. Any access
races or things like that come to mind for anyone? Shouldn't happen
with a write-once allocation, but again, lacking coffee.
> It would still be better if the service was provided by the library,
> instead than implemented by individual users, I think.
>
> --
> igor
-Boris
On 24/01/18 19:56, Igor Stoppa wrote:
[...]
> +bool pmalloc_prealloc(struct gen_pool *pool, size_t size)
> +{
[...]
> +abort:
> + vfree(chunk);
this should be vfree_atomic()
[...]
> +void *pmalloc(struct gen_pool *pool, size_t size, gfp_t gfp)
> +{
[...]
> +free:
> + vfree(chunk);
and this one too
I will fix them in the next iteration.
I am waiting to see if any more comments arrive.
Otherwise, I'll send it out probably next Tuesday.
--
igor
On 26/01/18 18:36, Boris Lukashev wrote:
> I like the idea of making the verification call optional for consumers
> allowing for fast/slow+hard paths depending on their needs.
> Cant see any additional vectors for abuse (other than the original
> ones effecting out-of-band modification) introduced by having
> verify/normal callers, but i've not had enough coffee yet. Any access
> races or things like that come to mind for anyone?
Well, the devil is in the details.
In this case, the question is how to perform the verification in a way
that is sufficiently robust against races.
After thinking about it for a while, I doubt it can be done reliably.
It might work for some small data types, but the typical use case I have
found myself dealing with, is protecting data structures.
That also brings up a separate problem: what would be the size of data
to hash? At one extreme there is a page, but it's probably too much, so
what is the correct size? it cannot be smaller than a specific
allocation, however that would imply looking for the hash related to the
data being accessed, with extra overhead.
And the data being accessed might be a field in a struct, for which we
would not have any hash.
There would be a hash only for the containing struct that was allocated ...
Overall, it seems a good idea in theory, but when I think about its
implementation, it seems like the overhead is so big that it would
discourage its use for almost any practical purpose.
If one really wants to be paranoid could, otoh have redundancy in a
different pool.
--
igor
On Thu, 25 Jan 2018, Matthew Wilcox wrote:
> It's worth having a discussion about whether we want the pmalloc API
> or whether we want a slab-based API. We can have a separate discussion
> about an API to remove pages from the physmap.
We could even do this in a more thorough way. Can we use a ring 1 / 2
distinction to create a hardened OS core that policies the rest of
the ever expanding kernel with all its modules and this and that feature?
I think that will long term be a better approach and allow more than the
current hardening approaches can get you. It seems that we are willing to
tolerate significant performance regressions now. So lets use the
protection mechanisms that the hardware offers.
+Boris Lukashev
On 02/02/18 20:39, Christopher Lameter wrote:
> On Thu, 25 Jan 2018, Matthew Wilcox wrote:
>
>> It's worth having a discussion about whether we want the pmalloc API
>> or whether we want a slab-based API. We can have a separate discussion
>> about an API to remove pages from the physmap.
>
> We could even do this in a more thorough way. Can we use a ring 1 / 2
> distinction to create a hardened OS core that policies the rest of
> the ever expanding kernel with all its modules and this and that feature?
What would be the differentiating criteria? Furthermore, what are the
chances
of invalidating the entire concept, because there is already an
hypervisor using
the higher level features?
That is what you are proposing, if I understand correctly.
But more on this below ...
> I think that will long term be a better approach and allow more than the
> current hardening approaches can get you. It seems that we are willing to
> tolerate significant performance regressions now. So lets use the
> protection mechanisms that the hardware offers.
I would rather *not* propose significant performance regression :-P
There might be some one-off case or anyway rare event which is
penalized, but my preference goes to not introducing any significant
performance penalty, during regular use.
After all, the lower the penalty, the wider the (potential) adoption.
More in detail: there are 2 major cases for wanting some form of
read-only protection.
1) extra ward against accidental corruption
The kernel provides many debugging tools and they can detect lots of
errors during development, but they require time and knowledge to use
them, which are not always available.
Furthermore, it is objectively true that not all the code has the same
level of maturity, especially when non-upstream code is used in some
custom product. It's not my main goal, but it would be nice if that case
too could be addressed by the protection.
Corruption *can* happen.
Having live guards against it, will definitely help spotting bugs or, at
the very least, crash/reboot a device before it can cause permanent data
corruption.
Protection against accidental corruption should be used as widely as
possible, therefore it cannot have an high price tag, in terms of lost
performance. Otherwise, there's the risk that it will be just a debug
feature, more like lockdep or ubsan.
2) protection against malicious attacks
This is harder, of course, but what is realistically to be expected?
If an attacker can gain full control of the kernel, the only way to do
damage control is to have HW and/or higher privilege SW that can somehow
limit the reach of the attacker.
To make it work for real, it should be mandated that either these extra
HW/SW means can tell apart legitimate kernel activity from rogue
actions, or they operate so independently from the kernel that a
compromise kernel cannot use any API to influence them.
The consensus seems to be to put aside (for now) this concern and
instead focus on what is a typical scenario:
- some bug is found that allows to read/write kernel memory
- some other bug is found, which leaks the address of a well known
variable, effectively revealing the randomized offset of each symbol
placed in linear memory, once their relative location is known.
What is described above is a toolkit that effectively can allow - with
patience - to attack anything that is writable by the kernel.
Including page tables and permissions.
However the typical attack is more like: "let's flip some bit(s)".
Which is where __ro_after_init has its purpose to exist.
My proposal is to extend the same sort of protection also to variables
allocated dynamically.
* make the pages read only, once the data is initialized
* use vmalloc to prevent that exfiltrating the address of an unrelated
variable can easily give away the location of the real target, because
of the individual page mapping vs linear mapping.
Boris Lukashev proposed additional hardening, when accessing a certain
variable, in the form of hash/checksum, but I could not come up with an
implementation that did not have too much overhead.
Re-considering this, one option would be to have a function
"pool_validate()" - probably expensive - that could be invoked by a
piece of code before using the data from the pool.
Not perfect, because it would not be atomic, but it could be used once,
at the beginning of a function, without adding overhead to each access
to the pool that the function would perform.
An attacker would have to time the attack so that the corruption of the
data wold happen after the pool is validated and before the data is read
from it.
Possible, but way tricker than the current unprotected situation.
What I am trying to say, is that even after having multi-ring
implementation (which would be more dependent on HW features), there
would be still the problem of validating the legitimacy of the use of
the API that such implementation would expose.
I'd rather try to preserve performance and still provide a defense
against the more trivial attacks, since other types of attacks are much
harder to perform in the wild.
Of course, I'm interested in alternatives (I'll comment separately on
the compound pages)
The way pmalloc is designed is to take advantage of any page provider.
So far, vmalloc seems to me the best option, but something else might
emerge that works better.
Yet the pmalloc API is, I think, what would be still needed, to let the
rest of the kernel take advantage of this feature.
--
igor
On Sat, Feb 3, 2018 at 2:57 PM, Igor Stoppa <[email protected]> wrote:
>>> On Thu, 25 Jan 2018, Matthew Wilcox wrote:
>
>>>> It's worth having a discussion about whether we want the pmalloc API
>>>> or whether we want a slab-based API.
> I'd love to have some feedback specifically about the API.
>
> I have also some idea about userspace and how to extend the pmalloc
> concept to it:
>
> http://www.openwall.com/lists/kernel-hardening/2018/01/30/20
>
> I'll be AFK intermittently for about 2 weeks, so i might not be able to
> reply immediately, but from my perspective this would be just the
> beginning of a broader hardening of both kernel and userspace that I'd
> like to pursue.
>
> --
> igor
Regarding the notion of validated protected memory, is there a method
by which the resulting checksum could be used in a lookup
table/function to resolve the location of the protected data?
Effectively a hash table of protected allocations, with a benefit of
dedup since any data matching the same key would be the same data
(multiple identical cred structs being pushed around). Should leave
the resolver address/csum in recent memory to check against, right?
--
Boris Lukashev
Systems Architect
Semper Victus
>> On Thu, 25 Jan 2018, Matthew Wilcox wrote:
>>> It's worth having a discussion about whether we want the pmalloc API
>>> or whether we want a slab-based API.
I'd love to have some feedback specifically about the API.
I have also some idea about userspace and how to extend the pmalloc
concept to it:
http://www.openwall.com/lists/kernel-hardening/2018/01/30/20
I'll be AFK intermittently for about 2 weeks, so i might not be able to
reply immediately, but from my perspective this would be just the
beginning of a broader hardening of both kernel and userspace that I'd
like to pursue.
--
igor
On 03/02/18 22:12, Boris Lukashev wrote:
> Regarding the notion of validated protected memory, is there a method
> by which the resulting checksum could be used in a lookup
> table/function to resolve the location of the protected data?
What I have in mind is a checksum at page/vmap_area level, so there
would be no 1:1 mapping between a specific allocation and the checksum.
An extreme case would be the one where an allocation crosses one or more
page boundaries, while the checksum refers to a (partially) overlapping
memory area.
Code accessing a pool could perform one (relatively expensive)
validation. But still something that would require a more sophisticated
attack, to subvert.
> Effectively a hash table of protected allocations, with a benefit of
> dedup since any data matching the same key would be the same data
> (multiple identical cred structs being pushed around). Should leave
> the resolver address/csum in recent memory to check against, right?
I see where you are trying to land, but I do not see how it would work
without a further intermediate step.
pmalloc dishes out virtual memory addresses, when called.
It doesn't know what the user of the allocation will put in it.
The user, otoh, has the direct address of the memory it got.
What you are suggesting, if I have understood it correctly, is that,
when the pool is protected, the addresses already given out, will become
traps that get resolved through a lookup table that is built based on
the content of each allocation.
That seems to generate a lot of overhead, not to mention the fact that
it might not play very well with the MMU.
If I misunderstood, then I'd need a step by step description of what
happens, because it's not clear to me how else the data would be
accessed if not through the address that was obtained when pmalloc was
invoked.
--
igor
On Sat, Feb 3, 2018 at 3:32 PM, Igor Stoppa <[email protected]> wrote:
>
>
> On 03/02/18 22:12, Boris Lukashev wrote:
>
>> Regarding the notion of validated protected memory, is there a method
>> by which the resulting checksum could be used in a lookup
>> table/function to resolve the location of the protected data?
>
> What I have in mind is a checksum at page/vmap_area level, so there
> would be no 1:1 mapping between a specific allocation and the checksum.
>
> An extreme case would be the one where an allocation crosses one or more
> page boundaries, while the checksum refers to a (partially) overlapping
> memory area.
>
> Code accessing a pool could perform one (relatively expensive)
> validation. But still something that would require a more sophisticated
> attack, to subvert.
>
>> Effectively a hash table of protected allocations, with a benefit of
>> dedup since any data matching the same key would be the same data
>> (multiple identical cred structs being pushed around). Should leave
>> the resolver address/csum in recent memory to check against, right?
>
> I see where you are trying to land, but I do not see how it would work
> without a further intermediate step.
>
> pmalloc dishes out virtual memory addresses, when called.
>
> It doesn't know what the user of the allocation will put in it.
> The user, otoh, has the direct address of the memory it got.
>
> What you are suggesting, if I have understood it correctly, is that,
> when the pool is protected, the addresses already given out, will become
> traps that get resolved through a lookup table that is built based on
> the content of each allocation.
>
> That seems to generate a lot of overhead, not to mention the fact that
> it might not play very well with the MMU.
That is effectively what i'm suggesting - as a form of protection for
consumers against direct reads of data which may have been corrupted
by some irrelevant means. In the context of pmalloc, it would probably
be a separate type of ro+verified pool which consumers would
explicitly opt into. Say there's a maintenance cycle on a <name some
scary thing controlled by Linux> and it wants to make sure that the
instructions it read in are what they should have been before running
them, those consumers might well take the penalty if it keeps <said
scary big thing> from doing <the thing we're scared of it doing>.
If such a resolver could be implemented in a manner which doesnt break
all the things (including acceptable performance for at least a
significant number of workloads), it might be useful as a general tool
for handing out memory to userspace, even in rw, as it provides
execution context in which other requirements can be forcibly
resolved, preventing unauthorized access to pages the consumer
shouldn't get in a very generic way. Spectre comes to mind as a
potential class of issues to be addressed this way, since speculative
load could be prevented if the resolution were to fail.
>
> If I misunderstood, then I'd need a step by step description of what
> happens, because it's not clear to me how else the data would be
> accessed if not through the address that was obtained when pmalloc was
> invoked.
>
> --
> igor
--
Boris Lukashev
Systems Architect
Semper Victus
On 04/02/18 00:29, Boris Lukashev wrote:
> On Sat, Feb 3, 2018 at 3:32 PM, Igor Stoppa <[email protected]> wrote:
[...]
>> What you are suggesting, if I have understood it correctly, is that,
>> when the pool is protected, the addresses already given out, will become
>> traps that get resolved through a lookup table that is built based on
>> the content of each allocation.
>>
>> That seems to generate a lot of overhead, not to mention the fact that
>> it might not play very well with the MMU.
>
> That is effectively what i'm suggesting - as a form of protection for
> consumers against direct reads of data which may have been corrupted
> by some irrelevant means. In the context of pmalloc, it would probably
> be a separate type of ro+verified pool
ok, that seems more like an extension though.
ATM I am having problems gaining traction to get even the basic merged :-)
I would consider this as a possibility for future work, unless it is
said that it's necessary for pmalloc to be accepted ...
--
igor
On Sat, 3 Feb 2018, Igor Stoppa wrote:
> > We could even do this in a more thorough way. Can we use a ring 1 / 2
> > distinction to create a hardened OS core that policies the rest of
> > the ever expanding kernel with all its modules and this and that feature?
>
> What would be the differentiating criteria? Furthermore, what are the
> chances
> of invalidating the entire concept, because there is already an
> hypervisor using
> the higher level features?
> That is what you are proposing, if I understand correctly.
Were there not 4 rings as well as methods by the processor vendors to
virtualize them as well?
> > I think that will long term be a better approach and allow more than the
> > current hardening approaches can get you. It seems that we are willing to
> > tolerate significant performance regressions now. So lets use the
> > protection mechanisms that the hardware offers.
>
> I would rather *not* propose significant performance regression :-P
But we already have implemented significant kernel hardening which causes
performance regressions. Using hardware capabilities allows the processor
vendor to further optimize these mechanisms whereas the software
preventative measures are eating up more and more performance as the pile
them on. Plus these are methods that can be worked around. Restrictions
implemented in a higher ring can be enforced and are much better than
just "hardening" (which is making life difficult for the hackers and
throwing away performannce for the average user).
On 05/02/18 17:40, Christopher Lameter wrote:
> On Sat, 3 Feb 2018, Igor Stoppa wrote:
>
>>> We could even do this in a more thorough way. Can we use a ring 1 / 2
>>> distinction to create a hardened OS core that policies the rest of
>>> the ever expanding kernel with all its modules and this and that feature?
>>
>> What would be the differentiating criteria? Furthermore, what are the
>> chances
>> of invalidating the entire concept, because there is already an
>> hypervisor using
>> the higher level features?
>> That is what you are proposing, if I understand correctly.
>
> Were there not 4 rings as well as methods by the processor vendors to
> virtualize them as well?
I think you are talking x86, mostly.
On ARM there are ELx and they are often (typically?) already used.
For x86 I cannot comment.
>>> I think that will long term be a better approach and allow more than the
>>> current hardening approaches can get you. It seems that we are willing to
>>> tolerate significant performance regressions now. So lets use the
>>> protection mechanisms that the hardware offers.
>>
>> I would rather *not* propose significant performance regression :-P
>
> But we already have implemented significant kernel hardening which causes
> performance regressions. Using hardware capabilities allows the processor
> vendor to further optimize these mechanisms whereas the software
> preventative measures are eating up more and more performance as the pile
> them on. Plus these are methods that can be worked around. Restrictions
> implemented in a higher ring can be enforced and are much better than
> just "hardening" (which is making life difficult for the hackers and
> throwing away performannce for the average user).
What you are proposing requires major restructuring of the memory
management - at the very least - provided that it doesn't cause the
conflicts I mentioned above.
Even after you do that, the system will still be working with memory
pages, there will be still a need to segregate data within certain
pages, or pay the penalty of handling exceptions, when data with
different permissions coexist within the same page.
The way the pmalloc API is designed is meant to facilitate the
segregation and to actually improve performance, by grouping types of
data with same scope and permission.
WRT the implementation, there is a minimal exposure to the memory
provider, both for allocation and release.
Same goes for the protection mechanism.
It's a single call to the function which makes pages read only.
It would be trivial to swap it out with a call to whatever framework you
want to come up with, for implementing ring/EL based protection.
From this perspective, you can easily provide patches that implement
what you are proposing, against pmalloc, if you really think that it's
the way to go.
I'll be happy to use them, if they provide improved performance and same
or better protection.
The way I designed pmalloc was really to be able to switch to some
alternate memory provider and/or protection mechanism, should a better
one arise.
But it can be done in a separate step, I think, since you are not
proposing to just change pmalloc, you are proposing to re-design how the
overall kernel memory hardening works (including executable pages, const
data, __ro_after_init, etc.)
--
igor
On Sun, Feb 4, 2018 at 7:05 AM, Igor Stoppa <[email protected]> wrote:
> On 04/02/18 00:29, Boris Lukashev wrote:
>> On Sat, Feb 3, 2018 at 3:32 PM, Igor Stoppa <[email protected]> wrote:
>
> [...]
>
>>> What you are suggesting, if I have understood it correctly, is that,
>>> when the pool is protected, the addresses already given out, will become
>>> traps that get resolved through a lookup table that is built based on
>>> the content of each allocation.
>>>
>>> That seems to generate a lot of overhead, not to mention the fact that
>>> it might not play very well with the MMU.
>>
>> That is effectively what i'm suggesting - as a form of protection for
>> consumers against direct reads of data which may have been corrupted
>> by some irrelevant means. In the context of pmalloc, it would probably
>> be a separate type of ro+verified pool
> ok, that seems more like an extension though.
>
> ATM I am having problems gaining traction to get even the basic merged :-)
>
> I would consider this as a possibility for future work, unless it is
> said that it's necessary for pmalloc to be accepted ...
I would agree: let's get basic functionality in first. Both
verification and the physmap part can be done separately, IMO.
-Kees
--
Kees Cook
Pixel Security
On 02/12/2018 03:27 PM, Kees Cook wrote:
> On Sun, Feb 4, 2018 at 7:05 AM, Igor Stoppa <[email protected]> wrote:
>> On 04/02/18 00:29, Boris Lukashev wrote:
>>> On Sat, Feb 3, 2018 at 3:32 PM, Igor Stoppa <[email protected]> wrote:
>>
>> [...]
>>
>>>> What you are suggesting, if I have understood it correctly, is that,
>>>> when the pool is protected, the addresses already given out, will become
>>>> traps that get resolved through a lookup table that is built based on
>>>> the content of each allocation.
>>>>
>>>> That seems to generate a lot of overhead, not to mention the fact that
>>>> it might not play very well with the MMU.
>>>
>>> That is effectively what i'm suggesting - as a form of protection for
>>> consumers against direct reads of data which may have been corrupted
>>> by some irrelevant means. In the context of pmalloc, it would probably
>>> be a separate type of ro+verified pool
>> ok, that seems more like an extension though.
>>
>> ATM I am having problems gaining traction to get even the basic merged :-)
>>
>> I would consider this as a possibility for future work, unless it is
>> said that it's necessary for pmalloc to be accepted ...
>
> I would agree: let's get basic functionality in first. Both
> verification and the physmap part can be done separately, IMO.
Skipping over physmap leaves a pretty big area of exposure that could
be difficult to solve later. I appreciate this might block basic
functionality but I don't think we should just gloss over it without
at least some idea of what we would do.
Thanks,
Laura
On Mon, Feb 12, 2018 at 4:40 PM, Laura Abbott <[email protected]> wrote:
> On 02/12/2018 03:27 PM, Kees Cook wrote:
>>
>> On Sun, Feb 4, 2018 at 7:05 AM, Igor Stoppa <[email protected]>
>> wrote:
>>>
>>> On 04/02/18 00:29, Boris Lukashev wrote:
>>>>
>>>> On Sat, Feb 3, 2018 at 3:32 PM, Igor Stoppa <[email protected]>
>>>> wrote:
>>>
>>>
>>> [...]
>>>
>>>>> What you are suggesting, if I have understood it correctly, is that,
>>>>> when the pool is protected, the addresses already given out, will
>>>>> become
>>>>> traps that get resolved through a lookup table that is built based on
>>>>> the content of each allocation.
>>>>>
>>>>> That seems to generate a lot of overhead, not to mention the fact that
>>>>> it might not play very well with the MMU.
>>>>
>>>>
>>>> That is effectively what i'm suggesting - as a form of protection for
>>>> consumers against direct reads of data which may have been corrupted
>>>> by some irrelevant means. In the context of pmalloc, it would probably
>>>> be a separate type of ro+verified pool
>>>
>>> ok, that seems more like an extension though.
>>>
>>> ATM I am having problems gaining traction to get even the basic merged
>>> :-)
>>>
>>> I would consider this as a possibility for future work, unless it is
>>> said that it's necessary for pmalloc to be accepted ...
>>
>>
>> I would agree: let's get basic functionality in first. Both
>> verification and the physmap part can be done separately, IMO.
>
>
> Skipping over physmap leaves a pretty big area of exposure that could
> be difficult to solve later. I appreciate this might block basic
> functionality but I don't think we should just gloss over it without
> at least some idea of what we would do.
What's our exposure on physmap for other regions? e.g. things that are
executable, or made read-only later (like __ro_after_init)?
-Kees
--
Kees Cook
Pixel Security
On Tue, Feb 13, 2018 at 2:25 AM, Kees Cook <[email protected]> wrote:
> On Mon, Feb 12, 2018 at 4:40 PM, Laura Abbott <[email protected]> wrote:
>> On 02/12/2018 03:27 PM, Kees Cook wrote:
>>>
>>> On Sun, Feb 4, 2018 at 7:05 AM, Igor Stoppa <[email protected]>
>>> wrote:
>>>>
>>>> On 04/02/18 00:29, Boris Lukashev wrote:
>>>>>
>>>>> On Sat, Feb 3, 2018 at 3:32 PM, Igor Stoppa <[email protected]>
>>>>> wrote:
>>>>
>>>>
>>>> [...]
>>>>
>>>>>> What you are suggesting, if I have understood it correctly, is that,
>>>>>> when the pool is protected, the addresses already given out, will
>>>>>> become
>>>>>> traps that get resolved through a lookup table that is built based on
>>>>>> the content of each allocation.
>>>>>>
>>>>>> That seems to generate a lot of overhead, not to mention the fact that
>>>>>> it might not play very well with the MMU.
>>>>>
>>>>>
>>>>> That is effectively what i'm suggesting - as a form of protection for
>>>>> consumers against direct reads of data which may have been corrupted
>>>>> by some irrelevant means. In the context of pmalloc, it would probably
>>>>> be a separate type of ro+verified pool
>>>>
>>>> ok, that seems more like an extension though.
>>>>
>>>> ATM I am having problems gaining traction to get even the basic merged
>>>> :-)
>>>>
>>>> I would consider this as a possibility for future work, unless it is
>>>> said that it's necessary for pmalloc to be accepted ...
>>>
>>>
>>> I would agree: let's get basic functionality in first. Both
>>> verification and the physmap part can be done separately, IMO.
>>
>>
>> Skipping over physmap leaves a pretty big area of exposure that could
>> be difficult to solve later. I appreciate this might block basic
>> functionality but I don't think we should just gloss over it without
>> at least some idea of what we would do.
>
> What's our exposure on physmap for other regions? e.g. things that are
> executable, or made read-only later (like __ro_after_init)?
I just checked on a system with a 4.9 kernel, and there seems to be no
physical memory that is mapped as writable in the init PGD and
executable elsewhere.
Ah, I think I missed something. At least on X86, set_memory_ro,
set_memory_rw, set_memory_nx and set_memory_x all use
change_page_attr_clear/change_page_attr_set, which use
change_page_attr_set_clr, which calls __change_page_attr_set_clr()
with a second parameter "checkalias" that is set to 1 unless the bit
being changed is the NX bit, and that parameter causes the invocation
of cpa_process_alias(), which will, for mapped ranges, also change the
attributes of physmap ranges. set_memory_ro() and so on are also used
by the module loading code.
But in the ARM64 code, I don't see anything similar. Does anyone with
a better understanding of ARM64 want to check whether I missed
something? Or maybe, with a recent kernel, check whether executable
module pages show up with a second writable mapping in the
"kernel_page_tables" file in debugfs?
On 02/12/2018 07:39 PM, Jann Horn wrote:
> On Tue, Feb 13, 2018 at 2:25 AM, Kees Cook <[email protected]> wrote:
>> On Mon, Feb 12, 2018 at 4:40 PM, Laura Abbott <[email protected]> wrote:
>>> On 02/12/2018 03:27 PM, Kees Cook wrote:
>>>>
>>>> On Sun, Feb 4, 2018 at 7:05 AM, Igor Stoppa <[email protected]>
>>>> wrote:
>>>>>
>>>>> On 04/02/18 00:29, Boris Lukashev wrote:
>>>>>>
>>>>>> On Sat, Feb 3, 2018 at 3:32 PM, Igor Stoppa <[email protected]>
>>>>>> wrote:
>>>>>
>>>>>
>>>>> [...]
>>>>>
>>>>>>> What you are suggesting, if I have understood it correctly, is that,
>>>>>>> when the pool is protected, the addresses already given out, will
>>>>>>> become
>>>>>>> traps that get resolved through a lookup table that is built based on
>>>>>>> the content of each allocation.
>>>>>>>
>>>>>>> That seems to generate a lot of overhead, not to mention the fact that
>>>>>>> it might not play very well with the MMU.
>>>>>>
>>>>>>
>>>>>> That is effectively what i'm suggesting - as a form of protection for
>>>>>> consumers against direct reads of data which may have been corrupted
>>>>>> by some irrelevant means. In the context of pmalloc, it would probably
>>>>>> be a separate type of ro+verified pool
>>>>>
>>>>> ok, that seems more like an extension though.
>>>>>
>>>>> ATM I am having problems gaining traction to get even the basic merged
>>>>> :-)
>>>>>
>>>>> I would consider this as a possibility for future work, unless it is
>>>>> said that it's necessary for pmalloc to be accepted ...
>>>>
>>>>
>>>> I would agree: let's get basic functionality in first. Both
>>>> verification and the physmap part can be done separately, IMO.
>>>
>>>
>>> Skipping over physmap leaves a pretty big area of exposure that could
>>> be difficult to solve later. I appreciate this might block basic
>>> functionality but I don't think we should just gloss over it without
>>> at least some idea of what we would do.
>>
>> What's our exposure on physmap for other regions? e.g. things that are
>> executable, or made read-only later (like __ro_after_init)?
>
> I just checked on a system with a 4.9 kernel, and there seems to be no
> physical memory that is mapped as writable in the init PGD and
> executable elsewhere.
>
> Ah, I think I missed something. At least on X86, set_memory_ro,
> set_memory_rw, set_memory_nx and set_memory_x all use
> change_page_attr_clear/change_page_attr_set, which use
> change_page_attr_set_clr, which calls __change_page_attr_set_clr()
> with a second parameter "checkalias" that is set to 1 unless the bit
> being changed is the NX bit, and that parameter causes the invocation
> of cpa_process_alias(), which will, for mapped ranges, also change the
> attributes of physmap ranges. set_memory_ro() and so on are also used
> by the module loading code.
>
> But in the ARM64 code, I don't see anything similar. Does anyone with
> a better understanding of ARM64 want to check whether I missed
> something? Or maybe, with a recent kernel, check whether executable
> module pages show up with a second writable mapping in the
> "kernel_page_tables" file in debugfs?
>
No, arm64 doesn't fixup the aliases, mostly because arm64 uses larger
page sizes which can't be broken down at runtime. CONFIG_PAGE_POISONING
does use 4K pages which could be adjusted at runtime. So yes, you are
right we would have physmap exposure on arm64 as well.
To the original question, it does sound like we are actually okay
with the physmap.
Thanks,
Laura
On Tue, Feb 13, 2018 at 8:09 AM, Laura Abbott <[email protected]> wrote:
> No, arm64 doesn't fixup the aliases, mostly because arm64 uses larger
> page sizes which can't be broken down at runtime. CONFIG_PAGE_POISONING
> does use 4K pages which could be adjusted at runtime. So yes, you are
> right we would have physmap exposure on arm64 as well.
Errr, so that means even modules and kernel code are writable via the
arm64 physmap? That seems extraordinarily bad. :(
-Kees
--
Kees Cook
Pixel Security
On Wed, Feb 14, 2018 at 11:06 AM, Laura Abbott <[email protected]> wrote:
> On 02/13/2018 01:43 PM, Kees Cook wrote:
>>
>> On Tue, Feb 13, 2018 at 8:09 AM, Laura Abbott <[email protected]> wrote:
>>>
>>> No, arm64 doesn't fixup the aliases, mostly because arm64 uses larger
>>> page sizes which can't be broken down at runtime. CONFIG_PAGE_POISONING
>>> does use 4K pages which could be adjusted at runtime. So yes, you are
>>> right we would have physmap exposure on arm64 as well.
>>
>>
>> Errr, so that means even modules and kernel code are writable via the
>> arm64 physmap? That seems extraordinarily bad. :(
>>
>> -Kees
>>
>
> (adding linux-arm-kernel and changing the subject)
>
> Kernel code should be fine, if it isn't that is a bug that should be
> fixed. Modules yes are not fully protected. The conclusion from past
I think that's a pretty serious problem: we can't have aliases with
mismatched permissions; this degrades a deterministic protection
(read-only) to a probabilistic protection (knowing where the alias of
a target is mapped). Having an attack be "needs some info leaks"
instead of "need execution control to change perms" is a much lower
bar, IMO.
> experience has been that we cannot safely break down larger page sizes
> at runtime like x86 does. We could theoretically
> add support for fixing up the alias if PAGE_POISONING is enabled but
> I don't know who would actually use that in production. Performance
> is very poor at that point.
Why does using finer granularity on the physmap degrade performance? I
assume TLB pressure, but what is heavily using that area? (I must not
be understanding what physmap actually gets used for -- I thought it
was just a convenience to have a 1:1 virt/phys map for some lookups?)
-Kees
--
Kees Cook
Pixel Security
On Wed, Feb 14, 2018 at 11:29 AM, Kees Cook <[email protected]> wrote:
> Why does using finer granularity on the physmap degrade performance? I
> assume TLB pressure, but what is heavily using that area? (I must not
> be understanding what physmap actually gets used for -- I thought it
> was just a convenience to have a 1:1 virt/phys map for some lookups?)
Jann has sorted me out: it's that physmap isn't an _alias_ for the
buddy allocator memory areas; it's used directly.
-Kees
--
Kees Cook
Pixel Security
On Wed, Feb 14, 2018 at 11:06 AM, Laura Abbott <[email protected]> wrote:
> fixed. Modules yes are not fully protected. The conclusion from past
> experience has been that we cannot safely break down larger page sizes
> at runtime like x86 does. We could theoretically
> add support for fixing up the alias if PAGE_POISONING is enabled but
> I don't know who would actually use that in production. Performance
> is very poor at that point.
XPFO forces 4K pages on the physmap[1] for similar reasons. I have no
doubt about performance changes, but I'd be curious to see real
numbers. Did anyone do benchmarks on just the huge/4K change? (Without
also the XPFO overhead?)
If this, XPFO, and PAGE_POISONING all need it, I think we have to
start a closer investigation. :)
-Kees
[1] http://www.openwall.com/lists/kernel-hardening/2017/09/07/13
--
Kees Cook
Pixel Security
On 02/13/2018 01:43 PM, Kees Cook wrote:
> On Tue, Feb 13, 2018 at 8:09 AM, Laura Abbott <[email protected]> wrote:
>> No, arm64 doesn't fixup the aliases, mostly because arm64 uses larger
>> page sizes which can't be broken down at runtime. CONFIG_PAGE_POISONING
>> does use 4K pages which could be adjusted at runtime. So yes, you are
>> right we would have physmap exposure on arm64 as well.
>
> Errr, so that means even modules and kernel code are writable via the
> arm64 physmap? That seems extraordinarily bad. :(
>
> -Kees
>
(adding linux-arm-kernel and changing the subject)
Kernel code should be fine, if it isn't that is a bug that should be
fixed. Modules yes are not fully protected. The conclusion from past
experience has been that we cannot safely break down larger page sizes
at runtime like x86 does. We could theoretically
add support for fixing up the alias if PAGE_POISONING is enabled but
I don't know who would actually use that in production. Performance
is very poor at that point.
Thanks,
Laura
On 14 February 2018 at 19:06, Laura Abbott <[email protected]> wrote:
> On 02/13/2018 01:43 PM, Kees Cook wrote:
>>
>> On Tue, Feb 13, 2018 at 8:09 AM, Laura Abbott <[email protected]> wrote:
>>>
>>> No, arm64 doesn't fixup the aliases, mostly because arm64 uses larger
>>> page sizes which can't be broken down at runtime. CONFIG_PAGE_POISONING
>>> does use 4K pages which could be adjusted at runtime. So yes, you are
>>> right we would have physmap exposure on arm64 as well.
>>
>>
>> Errr, so that means even modules and kernel code are writable via the
>> arm64 physmap? That seems extraordinarily bad. :(
>>
>> -Kees
>>
>
> (adding linux-arm-kernel and changing the subject)
>
> Kernel code should be fine, if it isn't that is a bug that should be
> fixed.
We take care to ensure that the linear alias of the core kernel's
.text and .rodata segments are mapped read-only. When we first moved
the kernel out of the linear region, we did not map it there at all
anymore, but that broke hibernation so we had to put something back.
> Modules yes are not fully protected. The conclusion from past
> experience has been that we cannot safely break down larger page sizes
> at runtime like x86 does. We could theoretically
> add support for fixing up the alias if PAGE_POISONING is enabled but
> I don't know who would actually use that in production. Performance
> is very poor at that point.
>
As long as the linear alias of the module is mapped down to pages, we
should be able to tweak the permissions. I take it that PAGE_POISONING
does more than just that?
On 02/14/2018 11:28 AM, Ard Biesheuvel wrote:
> On 14 February 2018 at 19:06, Laura Abbott <[email protected]> wrote:
>> On 02/13/2018 01:43 PM, Kees Cook wrote:
>>>
>>> On Tue, Feb 13, 2018 at 8:09 AM, Laura Abbott <[email protected]> wrote:
>>>>
>>>> No, arm64 doesn't fixup the aliases, mostly because arm64 uses larger
>>>> page sizes which can't be broken down at runtime. CONFIG_PAGE_POISONING
>>>> does use 4K pages which could be adjusted at runtime. So yes, you are
>>>> right we would have physmap exposure on arm64 as well.
>>>
>>>
>>> Errr, so that means even modules and kernel code are writable via the
>>> arm64 physmap? That seems extraordinarily bad. :(
>>>
>>> -Kees
>>>
>>
>> (adding linux-arm-kernel and changing the subject)
>>
>> Kernel code should be fine, if it isn't that is a bug that should be
>> fixed.
>
> We take care to ensure that the linear alias of the core kernel's
> .text and .rodata segments are mapped read-only. When we first moved
> the kernel out of the linear region, we did not map it there at all
> anymore, but that broke hibernation so we had to put something back.
>
>> Modules yes are not fully protected. The conclusion from past
>> experience has been that we cannot safely break down larger page sizes
>> at runtime like x86 does. We could theoretically
>> add support for fixing up the alias if PAGE_POISONING is enabled but
>> I don't know who would actually use that in production. Performance
>> is very poor at that point.
>>
>
> As long as the linear alias of the module is mapped down to pages, we
> should be able to tweak the permissions. I take it that PAGE_POISONING
> does more than just that?
>
Page poisoning does exactly that. The argument I was trying to make
was that if nobody really uses page poisoning except for debugging
it might not be worth it to fix up the alias. Thinking a bit more,
this is a terrible argument for many reasons so yes I agree that
we can just fix up the alias if PAGE_POISONING (or other features)
are enabled.
Thanks,
Laura
On Wed, Feb 14, 2018 at 11:48:38AM -0800, Kees Cook wrote:
> On Wed, Feb 14, 2018 at 11:06 AM, Laura Abbott <[email protected]> wrote:
> > fixed. Modules yes are not fully protected. The conclusion from past
> > experience has been that we cannot safely break down larger page sizes
> > at runtime like x86 does. We could theoretically
> > add support for fixing up the alias if PAGE_POISONING is enabled but
> > I don't know who would actually use that in production. Performance
> > is very poor at that point.
>
> XPFO forces 4K pages on the physmap[1] for similar reasons. I have no
> doubt about performance changes, but I'd be curious to see real
> numbers. Did anyone do benchmarks on just the huge/4K change? (Without
> also the XPFO overhead?)
>
> If this, XPFO, and PAGE_POISONING all need it, I think we have to
> start a closer investigation. :)
I haven't but it shouldn't be too hard. What benchmarks are you
thinking?
Tycho
On Wed, Feb 14, 2018 at 2:13 PM, Tycho Andersen <[email protected]> wrote:
> On Wed, Feb 14, 2018 at 11:48:38AM -0800, Kees Cook wrote:
>> On Wed, Feb 14, 2018 at 11:06 AM, Laura Abbott <[email protected]> wrote:
>> > fixed. Modules yes are not fully protected. The conclusion from past
>> > experience has been that we cannot safely break down larger page sizes
>> > at runtime like x86 does. We could theoretically
>> > add support for fixing up the alias if PAGE_POISONING is enabled but
>> > I don't know who would actually use that in production. Performance
>> > is very poor at that point.
>>
>> XPFO forces 4K pages on the physmap[1] for similar reasons. I have no
>> doubt about performance changes, but I'd be curious to see real
>> numbers. Did anyone do benchmarks on just the huge/4K change? (Without
>> also the XPFO overhead?)
>>
>> If this, XPFO, and PAGE_POISONING all need it, I think we have to
>> start a closer investigation. :)
>
> I haven't but it shouldn't be too hard. What benchmarks are you
> thinking?
Unless I'm looking at some specific micro benchmark, I tend to default
to looking at kernel build benchmarks but that gets pretty noisy.
Laura regularly uses hackbench, IIRC. I'm not finding the pastebin I
had for that, though.
I wonder if we need a benchmark subdirectory in tools/testing/, so we
could collect some of these common tools? All benchmarks are terrible,
but at least we'd have the same terrible benchmarks. :)
-Kees
--
Kees Cook
Pixel Security
On 14/02/18 21:29, Kees Cook wrote:
> On Wed, Feb 14, 2018 at 11:06 AM, Laura Abbott <[email protected]> wrote:
[...]
>> Kernel code should be fine, if it isn't that is a bug that should be
>> fixed. Modules yes are not fully protected. The conclusion from past
>
> I think that's a pretty serious problem: we can't have aliases with
> mismatched permissions; this degrades a deterministic protection
> (read-only) to a probabilistic protection (knowing where the alias of
> a target is mapped). Having an attack be "needs some info leaks"
> instead of "need execution control to change perms" is a much lower
> bar, IMO.
Why "need execution control to change permission"?
Or, iow, what does it mean exactly?
ROP/JOP? Data-oriented control flow hijack?
Unless I misunderstand the meaning of "need execution control", I think
that "need write capability to arbitrary data address" should be
sufficient, albeit uncomfortable to use.
OTOH, "need read/write capability from/to arbitrary data address" would
be enough, I think, assuming that one knows the offset where to write to
- but that information could be inferred, for example, by scanning the
memory for known patterns.
IMHO the attack surface is so vast that it's not unreasonable to expect
that it will be possible to fish out means to perform arbitrary R/W into
kernel address space. Ex: some more recent/less tested driver.
One can argue that this sort of R/W activity probably does require some
form of execution control, but AFAIK, the only way to to prevent it, is
to have CFI - btw, is there any standardization in that sense?
So, from my (pessimistic?) perspective, the best that can be hoped for,
is to make it much harder to figure out where the data is located.
Virtual mapping has this side effect, compared to linear mapping.
But, once easier attack targets are removed, I suspect the page mapping
will become the next target.
--
igor
On Tue, Feb 20, 2018 at 8:28 AM, Igor Stoppa <[email protected]> wrote:
>
>
> On 14/02/18 21:29, Kees Cook wrote:
>> On Wed, Feb 14, 2018 at 11:06 AM, Laura Abbott <[email protected]> wrote:
>
> [...]
>
>>> Kernel code should be fine, if it isn't that is a bug that should be
>>> fixed. Modules yes are not fully protected. The conclusion from past
>>
>> I think that's a pretty serious problem: we can't have aliases with
>> mismatched permissions; this degrades a deterministic protection
>> (read-only) to a probabilistic protection (knowing where the alias of
>> a target is mapped). Having an attack be "needs some info leaks"
>> instead of "need execution control to change perms" is a much lower
>> bar, IMO.
>
> Why "need execution control to change permission"?
> Or, iow, what does it mean exactly?
> ROP/JOP? Data-oriented control flow hijack?
Right, I mean, if an attacker has already gained execute control, they
can just call the needed functions to change memory permissions. But
that isn't needed if there is a mismatch between physmap and virtmap:
i.e. they can write to the physmap without needing to change perms
first.
> One can argue that this sort of R/W activity probably does require some
> form of execution control, but AFAIK, the only way to to prevent it, is
> to have CFI - btw, is there any standardization in that sense?
I meant that I don't want a difference in protection between physmap
and virtmap. I'd like to be able to reason the smae about the
exposures in either.
> So, from my (pessimistic?) perspective, the best that can be hoped for,
> is to make it much harder to figure out where the data is located.
>
> Virtual mapping has this side effect, compared to linear mapping.
Right, this is good, for sure. No complaints there at all. It's why I
think pmalloc and arm64 physmap perms are separate issues.
-Kees
--
Kees Cook
Pixel Security