This is v5 of the x86 KASLR improvement series from Yinghai, Baoquan,
and myself. The current branch lives here:
http://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=kaslr/highmem
***Background:
Bugs have been reported around kdump, kexec, and some netboot situations
that didn't work when KASLR was enabled. While discussing the bugs, it
was found that the current KASLR implementation had various limitations,
but most importantly that it can only randomize in a 1GB region of
physical memory.
The current KASLR implementaion only randomizes the base physical
address of the kernel. If the delta from build-time load address and
KASLR run-time load address (i.e. the physical address of where the
kernel actually decompressed) is not equal to 0, relocation handling is
performed using the delta. Though in principle kernel can be randomized
to any physical address, the physical kernel text mapping address space
is limited to 1G and the virtual address is just offset by the same
amount. On x86_64 the result is the following range:
[0xffffffff80000000, 0xffffffffc0000000)
hpa and Vivek suggested we should change this by decoupling the physical
address and virtual address randomization of kernel text and let them work
separately. Then kernel text physical address can be randomized in region
[16M, 64T), and kernel text virtual address can be randomized in region
[0xffffffff80000000, 0xffffffffc0000000).
***Problems that needed solving:
- When booting from the startup_32 case, only a 0~4G identity mapping is
built. If kernel will be randomly put anywhere from 16M to 64T at
most, the price to build all region of identity mapping is too high.
We need build the identity mapping on demand, not covering all
physical address space.
- Decouple the physical address and virtual address randomization of kernel
text and let them work separately.
***Parts:
- The 1st part is clean-up and improvements to help the rest of the series.
(Patches 01-10)
- The 2nd part is Yinghai's building of identity mappings on demand.
This is used to solve the first problem mentioned above.
(Patches 11-12)
- The 3rd part is Baoquan's decoupling the physical address and virtual
address randomization of kernel text and letting them work separately,
based on Yinghai's ident mapping patches.
(Patches 13-21)
I've boot tested this a bunch on 32-bit and 64-bit, and things appear
to be working as expected. I've cleaned up the changelogs, improved
some comments, refactored a few things, etc. Changes are noted in the
individual changelogs.
Thanks!
-Kees
v4->v5:
- rewrote all the changelogs, and several comments.
- refactored e820 parser to use a while loop instead of goto.
- rearranged removal of CONFIG_RANDOMIZE_BASE_MAX_OFFSET to earlier.
- additionally dropped KERNEL_IMAGE_SIZE_DEFAULT
- refactored minimum address calculation
- refactored slot offset calculation for readability
- fixed 32-bit boot failures
- fixed CONFIG_RANDOMIZE_BASE=n boot failure
- improved debug reporting
[Baoquan's histroy]
v3->v4:
- Made changes according to Kees's comments.
Add one patch 20/20 as Kees suggested to use KERNEL_IMAGE_SIZE as offset
max of virtual random, meanwhile clean up useless CONFIG_RANDOM_OFFSET_MAX
x86, kaslr: Use KERNEL_IMAGE_SIZE as the offset max for kernel virtual randomization
v2->v3:
- It only takes care of the kaslr related patches.
For reviewers it's better to discuss only one issue in one thread.
* I take off one patch as follows from Yinghai's because I think it's unnecessay.
- Patch 05/19 x86, kaslr: rename output_size to output_run_size
output_size is enough to represen the value:
output_len > run_size ? output_len : run_size
* I add Patch 04/19, it's a comment update patch. For other patches, I just
adjust patch log and do several places of change comparing with 2nd round.
Please check the change log under patch log of each patch for details.
* Adjust sequence of several patches to make review easier. It doesn't
affect codes.
v1->v2:
- In 2nd round Yinghai made a big patchset including this kaslr fix and another
setup_data related fix. The link is here:
http://lists-archives.com/linux-kernel/28346903-x86-updated-patches-for-kaslr-and-setup_data-etc-for-v4-3.html
You can get the code from Yinghai's git branch:
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-v4.3-next
v1:
- The first round can be found here:
https://lwn.net/Articles/637115/
From: Yinghai Lu <[email protected]>
Currently z_extract_offset is calculated in boot/compressed/mkpiggy.c.
This doesn't work well because mkpiggy.c doesn't know the details of the
decompressor in use. As a result, it can only make an estimation, which
has risks:
- output + output_len (VO) could be much bigger than input + input_len
(ZO). In this case, the decompressed kernel plus relocs could overwrite
the decompression code while it is running.
- The head code of ZO could be bigger than z_extract_offset. In this case
an overwrite could happen when the head code is running to move ZO to
the end of buffer. Though currently the size of the head code is very
small it's still a potential risk. Since there is no rule to limit the
size of the head code of ZO, it runs the risk of suddenly becoming a
(hard to find) bug.
Instead, this moves the z_extract_offset calculation into header.S, and
makes adjustments to be sure that the above two cases can never happen.
Since we have (previously) made ZO always be located against the end of
decompression buffer, z_extract_offset is only used here to calculate an
appropriate buffer size (INIT_SIZE), and is not longer used elsewhere. As
such, it can be removed from voffset.h.
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/Makefile | 2 +-
arch/x86/boot/compressed/misc.c | 5 +----
arch/x86/boot/compressed/mkpiggy.c | 15 +--------------
arch/x86/boot/header.S | 23 ++++++++++++++++++++++-
4 files changed, 25 insertions(+), 20 deletions(-)
diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
index b1ef9e489084..942f7dabfb1e 100644
--- a/arch/x86/boot/Makefile
+++ b/arch/x86/boot/Makefile
@@ -95,7 +95,7 @@ targets += voffset.h
$(obj)/voffset.h: vmlinux FORCE
$(call if_changed,voffset)
-sed-zoffset := -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(startup_32\|startup_64\|efi32_stub_entry\|efi64_stub_entry\|efi_pe_entry\|input_data\|_end\|z_.*\)$$/\#define ZO_\2 0x\1/p'
+sed-zoffset := -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(startup_32\|startup_64\|efi32_stub_entry\|efi64_stub_entry\|efi_pe_entry\|input_data\|_end\|_ehead\|_text\|z_.*\)$$/\#define ZO_\2 0x\1/p'
quiet_cmd_zoffset = ZOFFSET $@
cmd_zoffset = $(NM) $< | sed -n $(sed-zoffset) > $@
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index c4477d5f3fff..e2a998f8c304 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -83,13 +83,10 @@
* To avoid problems with the compressed data's meta information an extra 18
* bytes are needed. Leading to the formula:
*
- * extra_bytes = (uncompressed_size >> 12) + 32768 + 18 + decompressor_size.
+ * extra_bytes = (uncompressed_size >> 12) + 32768 + 18.
*
* Adding 8 bytes per 32K is a bit excessive but much easier to calculate.
* Adding 32768 instead of 32767 just makes for round numbers.
- * Adding the decompressor_size is necessary as it musht live after all
- * of the data as well. Last I measured the decompressor is about 14K.
- * 10K of actual data and 4K of bss.
*
*/
diff --git a/arch/x86/boot/compressed/mkpiggy.c b/arch/x86/boot/compressed/mkpiggy.c
index b980046c3329..a613c84d9b88 100644
--- a/arch/x86/boot/compressed/mkpiggy.c
+++ b/arch/x86/boot/compressed/mkpiggy.c
@@ -21,8 +21,7 @@
* ----------------------------------------------------------------------- */
/*
- * Compute the desired load offset from a compressed program; outputs
- * a small assembly wrapper with the appropriate symbols defined.
+ * outputs a small assembly wrapper with the appropriate symbols defined.
*/
#include <stdlib.h>
@@ -35,7 +34,6 @@ int main(int argc, char *argv[])
{
uint32_t olen;
long ilen;
- unsigned long offs;
unsigned long run_size;
FILE *f = NULL;
int retval = 1;
@@ -67,15 +65,6 @@ int main(int argc, char *argv[])
ilen = ftell(f);
olen = get_unaligned_le32(&olen);
- /*
- * Now we have the input (compressed) and output (uncompressed)
- * sizes, compute the necessary decompression offset...
- */
-
- offs = (olen > ilen) ? olen - ilen : 0;
- offs += olen >> 12; /* Add 8 bytes for each 32K block */
- offs += 64*1024 + 128; /* Add 64K + 128 bytes slack */
- offs = (offs+4095) & ~4095; /* Round to a 4K boundary */
run_size = atoi(argv[2]);
printf(".section \".rodata..compressed\",\"a\",@progbits\n");
@@ -83,8 +72,6 @@ int main(int argc, char *argv[])
printf("z_input_len = %lu\n", ilen);
printf(".globl z_output_len\n");
printf("z_output_len = %lu\n", (unsigned long)olen);
- printf(".globl z_extract_offset\n");
- printf("z_extract_offset = 0x%lx\n", offs);
printf(".globl z_run_size\n");
printf("z_run_size = %lu\n", run_size);
diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S
index 6236b9ec4b76..6565dcb2b899 100644
--- a/arch/x86/boot/header.S
+++ b/arch/x86/boot/header.S
@@ -440,7 +440,28 @@ setup_data: .quad 0 # 64-bit physical pointer to
pref_address: .quad LOAD_PHYSICAL_ADDR # preferred load addr
-#define ZO_INIT_SIZE (ZO__end - ZO_startup_32 + ZO_z_extract_offset)
+/* Check arch/x86/boot/compressed/misc.c for the formula of extra_bytes*/
+#define ZO_z_extra_bytes ((ZO_z_output_len >> 12) + 65536 + 128)
+#if ZO_z_output_len > ZO_z_input_len
+#define ZO_z_extract_offset (ZO_z_output_len + ZO_z_extra_bytes - \
+ ZO_z_input_len)
+#else
+#define ZO_z_extract_offset ZO_z_extra_bytes
+#endif
+
+/*
+ * extract_offset has to be bigger than ZO head section. Otherwise
+ * during head code running to move ZO to end of buffer, it will
+ * overwrite head code itself.
+ */
+#if (ZO__ehead - ZO_startup_32) > ZO_z_extract_offset
+#define ZO_z_min_extract_offset ((ZO__ehead - ZO_startup_32 + 4095) & ~4095)
+#else
+#define ZO_z_min_extract_offset ((ZO_z_extract_offset + 4095) & ~4095)
+#endif
+
+#define ZO_INIT_SIZE (ZO__end - ZO_startup_32 + ZO_z_min_extract_offset)
+
#define VO_INIT_SIZE (VO__end - VO__text)
#if ZO_INIT_SIZE > VO_INIT_SIZE
#define INIT_SIZE ZO_INIT_SIZE
--
2.6.3
From: Yinghai Lu <[email protected]>
Currently the physical randomization's lower boundary is the load
address. For bootloaders that load kernels into very high memory
(e.g. kexec), this means randomization takes place in a very small window
at the top of memory, ignoring the large region of physical memory below
the load address.
Since mem_avoid is already correctly tracking the regions that must be
avoided, this patch changes the minimum address to which ever is less:
512M (to conservatively avoid unknown things in lower memory) or the
load address. Now, for example, if the kernel is loaded at 8G, [512M,
8G) will be added into possible physical memory positions.
Signed-off-by: Yinghai Lu <[email protected]>
[kees: rewrote changelog, refactor to use min()]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/aslr.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
index e83d3bb3808b..864a51863b45 100644
--- a/arch/x86/boot/compressed/aslr.c
+++ b/arch/x86/boot/compressed/aslr.c
@@ -459,7 +459,8 @@ void choose_kernel_location(unsigned char *input,
unsigned long output_size,
unsigned char **virt_offset)
{
- unsigned long random;
+ unsigned long random, min_addr;
+
*virt_offset = (unsigned char *)LOAD_PHYSICAL_ADDR;
#ifdef CONFIG_HIBERNATION
@@ -480,8 +481,11 @@ void choose_kernel_location(unsigned char *input,
mem_avoid_init((unsigned long)input, input_size,
(unsigned long)*output);
+ /* Low end should be the smaller of 512M or initial location. */
+ min_addr = min((unsigned long)*output, 512UL << 20);
+
/* Walk e820 and find a random address. */
- random = find_random_phy_addr((unsigned long)*output, output_size);
+ random = find_random_phy_addr(min_addr, output_size);
if (!random)
debug_putstr("KASLR could not find suitable E820 region...\n");
else {
--
2.6.3
From: Baoquan He <[email protected]>
Since struct slot_area and its new algorithm are used to track and select
memory ranges, the slots[] array and its associated functions are not
needed any more. This patch removes them.
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/aslr.c | 22 ----------------------
1 file changed, 22 deletions(-)
diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
index 0587eac3e05d..e83d3bb3808b 100644
--- a/arch/x86/boot/compressed/aslr.c
+++ b/arch/x86/boot/compressed/aslr.c
@@ -113,17 +113,6 @@ struct mem_vector {
#define MEM_AVOID_MAX 4
static struct mem_vector mem_avoid[MEM_AVOID_MAX];
-static bool mem_contains(struct mem_vector *region, struct mem_vector *item)
-{
- /* Item at least partially before region. */
- if (item->start < region->start)
- return false;
- /* Item at least partially after region. */
- if (item->start + item->size > region->start + region->size)
- return false;
- return true;
-}
-
static bool mem_overlaps(struct mem_vector *one, struct mem_vector *two)
{
/* Item one is entirely before item two. */
@@ -292,8 +281,6 @@ mem_min_overlap(struct mem_vector *img, struct mem_vector *out)
return min;
}
-static unsigned long slots[KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN];
-
struct slot_area {
unsigned long addr;
int num;
@@ -324,15 +311,6 @@ static void store_slot_info(struct mem_vector *region, unsigned long image_size)
}
}
-static void slots_append(unsigned long addr)
-{
- /* Overflowing the slots list should be impossible. */
- if (slot_max >= KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN)
- return;
-
- slots[slot_max++] = addr;
-}
-
static unsigned long slots_fetch_random(void)
{
unsigned long random;
--
2.6.3
From: Baoquan He <[email protected]>
The current KASLR implementation randomizes the physical and virtual
address of kernel together (both are offset by the same amount). It
calculates the delta of the physical address where vmlinux was linked
to load and where it is finally loaded. If the delta is not equal to 0
(i.e. the kernel was relocated), relocation handling needs be done.
On 64-bit, this patch randomizes both the physical address where kernel
is decompressed and the virtual address where kernel text is mapped and
will execute from. We now have two values being chosen, so the function
arguments are reorganized to pass by pointer so they can be directly
updated. Since relocation handling only depends on the virtual address,
we must check the virtual delta, not the physical delta for processing
kernel relocations. This also populates the page table for the new
virtual address range. 32-bit does not support a separate virtual
address, so it continues to use the physical offset for its virtual
offset.
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog, limited virtual split to 64-bit only]
[kees: fix CONFIG_RANDOMIZE_BASE=n boot failure]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/aslr.c | 45 +++++++++++++++++++++--------------------
arch/x86/boot/compressed/misc.c | 39 ++++++++++++++++++++---------------
arch/x86/boot/compressed/misc.h | 20 +++++++++---------
3 files changed, 57 insertions(+), 47 deletions(-)
diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
index c58215d5a80d..53ceaa0a08b9 100644
--- a/arch/x86/boot/compressed/aslr.c
+++ b/arch/x86/boot/compressed/aslr.c
@@ -391,7 +391,7 @@ static void process_e820_entry(struct e820entry *entry,
}
}
-static unsigned long find_random_addr(unsigned long minimum,
+static unsigned long find_random_phy_addr(unsigned long minimum,
unsigned long size)
{
int i;
@@ -431,23 +431,24 @@ static unsigned long find_random_virt_offset(unsigned long minimum,
return random * CONFIG_PHYSICAL_ALIGN + minimum;
}
-unsigned char *choose_kernel_location(unsigned char *input,
- unsigned long input_size,
- unsigned char *output,
- unsigned long output_size)
+void choose_kernel_location(unsigned char *input,
+ unsigned long input_size,
+ unsigned char **output,
+ unsigned long output_size,
+ unsigned char **virt_offset)
{
- unsigned long choice = (unsigned long)output;
unsigned long random;
+ *virt_offset = (unsigned char *)LOAD_PHYSICAL_ADDR;
#ifdef CONFIG_HIBERNATION
if (!cmdline_find_option_bool("kaslr")) {
debug_putstr("KASLR disabled by default...\n");
- goto out;
+ return;
}
#else
if (cmdline_find_option_bool("nokaslr")) {
debug_putstr("KASLR disabled by cmdline...\n");
- goto out;
+ return;
}
#endif
@@ -455,23 +456,23 @@ unsigned char *choose_kernel_location(unsigned char *input,
/* Record the various known unsafe memory ranges. */
mem_avoid_init((unsigned long)input, input_size,
- (unsigned long)output);
+ (unsigned long)*output);
/* Walk e820 and find a random address. */
- random = find_random_addr(choice, output_size);
- if (!random) {
+ random = find_random_phy_addr((unsigned long)*output, output_size);
+ if (!random)
debug_putstr("KASLR could not find suitable E820 region...\n");
- goto out;
+ else {
+ if ((unsigned long)*output != random) {
+ fill_pagetable(random, output_size);
+ switch_pagetable();
+ *output = (unsigned char *)random;
+ }
}
- /* Always enforce the minimum. */
- if (random < choice)
- goto out;
-
- choice = random;
-
- fill_pagetable(choice, output_size);
- switch_pagetable();
-out:
- return (unsigned char *)choice;
+ /* Pick random virtual address starting from LOAD_PHYSICAL_ADDR. */
+ if (IS_ENABLED(CONFIG_X86_64))
+ random = find_random_virt_offset(LOAD_PHYSICAL_ADDR,
+ output_size);
+ *virt_offset = (unsigned char *)random;
}
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index c47ac162d3bd..3d5b0eda0711 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -251,7 +251,8 @@ void error(char *x)
}
#if CONFIG_X86_NEED_RELOCS
-static void handle_relocations(void *output, unsigned long output_len)
+static void handle_relocations(void *output, unsigned long output_len,
+ void *virt_offset)
{
int *reloc;
unsigned long delta, map, ptr;
@@ -263,11 +264,6 @@ static void handle_relocations(void *output, unsigned long output_len)
* and where it was actually loaded.
*/
delta = min_addr - LOAD_PHYSICAL_ADDR;
- if (!delta) {
- debug_putstr("No relocation needed... ");
- return;
- }
- debug_putstr("Performing relocations... ");
/*
* The kernel contains a table of relocation addresses. Those
@@ -279,6 +275,20 @@ static void handle_relocations(void *output, unsigned long output_len)
map = delta - __START_KERNEL_map;
/*
+ * 32-bit always performs relocations. 64-bit relocations are only
+ * needed if KASLR has chosen a different starting address offset
+ * from __START_KERNEL_map.
+ */
+ if (IS_ENABLED(CONFIG_X86_64))
+ delta = (unsigned long)virt_offset - LOAD_PHYSICAL_ADDR;
+
+ if (!delta) {
+ debug_putstr("No relocation needed... ");
+ return;
+ }
+ debug_putstr("Performing relocations... ");
+
+ /*
* Process relocations: 32 bit relocations first then 64 bit after.
* Three sets of binary relocations are added to the end of the kernel
* before compression. Each relocation table entry is the kernel
@@ -331,7 +341,8 @@ static void handle_relocations(void *output, unsigned long output_len)
#endif
}
#else
-static inline void handle_relocations(void *output, unsigned long output_len)
+static inline void handle_relocations(void *output, unsigned long output_len,
+ void *virt_offset)
{ }
#endif
@@ -391,7 +402,7 @@ asmlinkage __visible void *decompress_kernel(void *rmode, memptr heap,
unsigned long output_len)
{
const unsigned long run_size = VO__end - VO__text;
- unsigned char *output_orig = output;
+ unsigned char *virt_offset;
real_mode = rmode;
@@ -429,9 +440,10 @@ asmlinkage __visible void *decompress_kernel(void *rmode, memptr heap,
* the entire decompressed kernel plus relocation table, or the
* entire decompressed kernel plus .bss and .brk sections.
*/
- output = choose_kernel_location(input_data, input_len, output,
+ choose_kernel_location(input_data, input_len, &output,
output_len > run_size ? output_len
- : run_size);
+ : run_size,
+ &virt_offset);
/* Validate memory location choices. */
if ((unsigned long)output & (MIN_KERNEL_ALIGN - 1))
@@ -452,12 +464,7 @@ asmlinkage __visible void *decompress_kernel(void *rmode, memptr heap,
__decompress(input_data, input_len, NULL, NULL, output, output_len,
NULL, error);
parse_elf(output);
- /*
- * 32-bit always performs relocations. 64-bit relocations are only
- * needed if kASLR has chosen a different load address.
- */
- if (!IS_ENABLED(CONFIG_X86_64) || output != output_orig)
- handle_relocations(output, output_len);
+ handle_relocations(output, output_len, virt_offset);
debug_putstr("done.\nBooting the kernel.\n");
return output;
}
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 39d0e9a53736..0b455406a850 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -69,20 +69,22 @@ int cmdline_find_option_bool(const char *option);
#if CONFIG_RANDOMIZE_BASE
/* aslr.c */
-unsigned char *choose_kernel_location(unsigned char *input,
- unsigned long input_size,
- unsigned char *output,
- unsigned long output_size);
+void choose_kernel_location(unsigned char *input,
+ unsigned long input_size,
+ unsigned char **output,
+ unsigned long output_size,
+ unsigned char **virt_offset);
/* cpuflags.c */
bool has_cpuflag(int flag);
#else
static inline
-unsigned char *choose_kernel_location(unsigned char *input,
- unsigned long input_size,
- unsigned char *output,
- unsigned long output_size)
+void choose_kernel_location(unsigned char *input,
+ unsigned long input_size,
+ unsigned char **output,
+ unsigned long output_size,
+ unsigned char **virt_offset)
{
- return output;
+ *virt_offset = *output;
}
#endif
--
2.6.3
KASLR will be calling get_random_long twice, but the debug output won't
distinguishing between them. This patch adds a report on when it is
fetching the physical vs virtual address offset. With this, once the
virtual offset is separate, the report changes:
Before:
KASLR using RDTSC...
KASLR using RDTSC...
After:
Physical KASLR using RDTSC...
Virtual KASLR using RDTSC...
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/aslr.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
index b527ff10372a..c58215d5a80d 100644
--- a/arch/x86/boot/compressed/aslr.c
+++ b/arch/x86/boot/compressed/aslr.c
@@ -60,7 +60,7 @@ static unsigned long get_random_boot(void)
return hash;
}
-static unsigned long get_random_long(void)
+static unsigned long get_random_long(const char *purpose)
{
#ifdef CONFIG_X86_64
const unsigned long mix_const = 0x5d6008cbf3848dd3UL;
@@ -70,7 +70,8 @@ static unsigned long get_random_long(void)
unsigned long raw, random = get_random_boot();
bool use_i8254 = true;
- debug_putstr("KASLR using");
+ debug_putstr(purpose);
+ debug_putstr(" KASLR using");
if (has_cpuflag(X86_FEATURE_RDRAND)) {
debug_putstr(" RDRAND");
@@ -338,7 +339,7 @@ static unsigned long slots_fetch_random(void)
if (slot_max == 0)
return 0;
- return slots[get_random_long() % slot_max];
+ return slots[get_random_long("Physical") % slot_max];
}
static void process_e820_entry(struct e820entry *entry,
@@ -425,7 +426,7 @@ static unsigned long find_random_virt_offset(unsigned long minimum,
slots = (KERNEL_IMAGE_SIZE - minimum - image_size) /
CONFIG_PHYSICAL_ALIGN + 1;
- random = get_random_long() % slots;
+ random = get_random_long("Virtual") % slots;
return random * CONFIG_PHYSICAL_ALIGN + minimum;
}
--
2.6.3
From: Baoquan He <[email protected]>
The function store_slot_info() is used to calculate the slot info of the
passed-in memory region and stores it into slot_areas[] after adjusting
for alignment and size requirements.
The function mem_min_overlap() is used to iterate over all mem_avoid
regions to find the earliest mem_avoid address that conflicts with the
given memory region. (For example, with the region [1024M, 2048M), if
there is a mem_avoid of [1536M, 1664M), this returns 1536M.) This can
be used to split memory regions when building the slot_area array.
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/aslr.c | 51 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
index abe618d489ea..b06618000732 100644
--- a/arch/x86/boot/compressed/aslr.c
+++ b/arch/x86/boot/compressed/aslr.c
@@ -257,6 +257,40 @@ static bool mem_avoid_overlap(struct mem_vector *img)
return false;
}
+static unsigned long
+mem_min_overlap(struct mem_vector *img, struct mem_vector *out)
+{
+ int i;
+ struct setup_data *ptr;
+ unsigned long min = img->start + img->size;
+
+ for (i = 0; i < MEM_AVOID_MAX; i++) {
+ if (mem_overlaps(img, &mem_avoid[i]) &&
+ (mem_avoid[i].start < min)) {
+ *out = mem_avoid[i];
+ min = mem_avoid[i].start;
+ }
+ }
+
+ /* Check all entries in the setup_data linked list. */
+ ptr = (struct setup_data *)(unsigned long)real_mode->hdr.setup_data;
+ while (ptr) {
+ struct mem_vector avoid;
+
+ avoid.start = (unsigned long)ptr;
+ avoid.size = sizeof(*ptr) + ptr->len;
+
+ if (mem_overlaps(img, &avoid) && (avoid.start < min)) {
+ *out = avoid;
+ min = avoid.start;
+ }
+
+ ptr = (struct setup_data *)(unsigned long)ptr->next;
+ }
+
+ return min;
+}
+
static unsigned long slots[KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN];
struct slot_area {
@@ -272,6 +306,23 @@ static unsigned long slot_max;
static unsigned long slot_area_index;
+static void store_slot_info(struct mem_vector *region, unsigned long image_size)
+{
+ struct slot_area slot_area;
+
+ slot_area.addr = region->start;
+ if (image_size <= CONFIG_PHYSICAL_ALIGN)
+ slot_area.num = region->size / CONFIG_PHYSICAL_ALIGN;
+ else
+ slot_area.num = (region->size - image_size) /
+ CONFIG_PHYSICAL_ALIGN + 1;
+
+ if (slot_area.num > 0) {
+ slot_areas[slot_area_index++] = slot_area;
+ slot_max += slot_area.num;
+ }
+}
+
static void slots_append(unsigned long addr)
{
/* Overflowing the slots list should be impossible. */
--
2.6.3
From: Baoquan He <[email protected]>
In order to support KASLR moving the kernel anywhere in the whole
of physical memory (could be 64T), we need to handle counting the
potential randomization locations in a more efficient manner.
In the worst case with 64T, there could be roughly 32 * 1024 * 1024
randomization slots if CONFIG_PHYSICAL_ALIGN is 0x1000000. Currently
the starting address of candidate positions is stored into the slots[]
array, one at a time. With this method will cost too much memory and
it's also very inefficient to get and save the slot information into
slot array one by one.
This patch introduces struct slot_area to manage each contiguous region
of randomization slots. Each slot_area will contain the starting address
and how many available slots are in this area. As with the original code,
the slot_areas will be avoid the mem_avoid regions, since those areas
should be be used for the relocated position.
Since setup_data is a linked list, it could contain an unknown number
of memory regions to be avoided, which could cause us to fragment
the contiguous memory that the slot_area array is tracking. In normal
operation this fragmentation will be extremely unlikely, but choose
a suitably large value (100) for the array. If there are more slots
available beyond 100, they will be ignored for KASLR selection.
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/aslr.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
index cd261debead9..abe618d489ea 100644
--- a/arch/x86/boot/compressed/aslr.c
+++ b/arch/x86/boot/compressed/aslr.c
@@ -258,8 +258,20 @@ static bool mem_avoid_overlap(struct mem_vector *img)
}
static unsigned long slots[KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN];
+
+struct slot_area {
+ unsigned long addr;
+ int num;
+};
+
+#define MAX_SLOT_AREA 100
+
+static struct slot_area slot_areas[MAX_SLOT_AREA];
+
static unsigned long slot_max;
+static unsigned long slot_area_index;
+
static void slots_append(unsigned long addr)
{
/* Overflowing the slots list should be impossible. */
--
2.6.3
From: Yinghai Lu <[email protected]>
parse_elf is using a local memcpy to move sections to their final running
position. However, this memcpy only supports non-overlapping arguments
(or dest < src).
To avoid future hard-to-debug surprises, this adds checking in memcpy to
detect the unhandled condition (which should not be happening currently).
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/misc.c | 9 ++-------
arch/x86/boot/compressed/misc.h | 2 ++
arch/x86/boot/compressed/string.c | 29 +++++++++++++++++++++++++++--
3 files changed, 31 insertions(+), 9 deletions(-)
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 562d647289ac..c47ac162d3bd 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -114,9 +114,6 @@
#undef memset
#define memzero(s, n) memset((s), 0, (n))
-
-static void error(char *m);
-
/*
* This is set up by the setup-routine at boot-time
*/
@@ -243,7 +240,7 @@ void __puthex(unsigned long value)
}
}
-static void error(char *x)
+void error(char *x)
{
error_putstr("\n\n");
error_putstr(x);
@@ -378,9 +375,7 @@ static void parse_elf(void *output)
#else
dest = (void *)(phdr->p_paddr);
#endif
- memcpy(dest,
- output + phdr->p_offset,
- phdr->p_filesz);
+ memcpy(dest, output + phdr->p_offset, phdr->p_filesz);
break;
default: /* Ignore other PT_* */ break;
}
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 11736a6a6670..39d0e9a53736 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -38,6 +38,8 @@ void __puthex(unsigned long value);
#define error_putstr(__x) __putstr(__x)
#define error_puthex(__x) __puthex(__x)
+void error(char *x);
+
#ifdef CONFIG_X86_VERBOSE_BOOTUP
#define debug_putstr(__x) __putstr(__x)
diff --git a/arch/x86/boot/compressed/string.c b/arch/x86/boot/compressed/string.c
index 00e788be1db9..3a935d0c82a8 100644
--- a/arch/x86/boot/compressed/string.c
+++ b/arch/x86/boot/compressed/string.c
@@ -1,7 +1,7 @@
#include "../string.c"
#ifdef CONFIG_X86_32
-void *memcpy(void *dest, const void *src, size_t n)
+void *__memcpy(void *dest, const void *src, size_t n)
{
int d0, d1, d2;
asm volatile(
@@ -15,7 +15,7 @@ void *memcpy(void *dest, const void *src, size_t n)
return dest;
}
#else
-void *memcpy(void *dest, const void *src, size_t n)
+void *__memcpy(void *dest, const void *src, size_t n)
{
long d0, d1, d2;
asm volatile(
@@ -30,6 +30,31 @@ void *memcpy(void *dest, const void *src, size_t n)
}
#endif
+extern void error(char *x);
+void *memcpy(void *dest, const void *src, size_t n)
+{
+ unsigned long start_dest, end_dest;
+ unsigned long start_src, end_src;
+ unsigned long max_start, min_end;
+
+ if (dest < src)
+ return __memcpy(dest, src, n);
+
+ start_dest = (unsigned long)dest;
+ end_dest = (unsigned long)dest + n;
+ start_src = (unsigned long)src;
+ end_src = (unsigned long)src + n;
+ max_start = (start_dest > start_src) ? start_dest : start_src;
+ min_end = (end_dest < end_src) ? end_dest : end_src;
+
+ if (max_start >= min_end)
+ return __memcpy(dest, src, n);
+
+ error("memcpy does not support overlapping with dest > src!\n");
+
+ return dest;
+}
+
void *memset(void *s, int c, size_t n)
{
int i;
--
2.6.3
From: Baoquan He <[email protected]>
To support randomizing the kernel virtual address separately from the
physical address, this patch adds find_random_virt_offset() to choose
a slot anywhere between LOAD_PHYSICAL_ADDR and KERNEL_IMAGE_SIZE.
Since this address is virtual, not physical, we can place the kernel
anywhere in this region, as long as it is aligned and (in the case of
kernel being larger than the slot size) placed with enough room to load
the entire kernel image.
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog, refactor slot calculation for readability]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/aslr.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
index b06618000732..b527ff10372a 100644
--- a/arch/x86/boot/compressed/aslr.c
+++ b/arch/x86/boot/compressed/aslr.c
@@ -407,6 +407,29 @@ static unsigned long find_random_addr(unsigned long minimum,
return slots_fetch_random();
}
+static unsigned long find_random_virt_offset(unsigned long minimum,
+ unsigned long image_size)
+{
+ unsigned long slots, random;
+
+ /* Make sure minimum is aligned. */
+ minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);
+ /* Align image_size for easy slot calculations. */
+ image_size = ALIGN(image_size, CONFIG_PHYSICAL_ALIGN);
+
+ /*
+ * There are how many CONFIG_PHYSICAL_ALIGN-sized slots
+ * that can hold image_size within the range of minimum to
+ * KERNEL_IMAGE_SIZE?
+ */
+ slots = (KERNEL_IMAGE_SIZE - minimum - image_size) /
+ CONFIG_PHYSICAL_ALIGN + 1;
+
+ random = get_random_long() % slots;
+
+ return random * CONFIG_PHYSICAL_ALIGN + minimum;
+}
+
unsigned char *choose_kernel_location(unsigned char *input,
unsigned long input_size,
unsigned char *output,
--
2.6.3
From: Yinghai Lu <[email protected]>
Currently, the kernel run_size (size of code plus brk and bss) is
calculated via the shell script arch/x86/tools/calc_run_size.sh. It gets
the file offset and mem size of for the .bss and .brk sections in vmlinux,
and adds them as follows:
run_size=$(( $offsetA + $sizeA + $sizeB ))
However, this is not correct (it is too large). To illustrate, here's
a walk-through of the script's calculation, compared to the correct way
to find it.
First, offsetA is found as the starting address of the first .bss or
.brk section seen in the ELF file. The sizeA and sizeB values are the
respective section sizes.
[bhe@x1 linux]$ objdump -h vmlinux
vmlinux: file format elf64-x86-64
Sections:
Idx Name Size VMA LMA File off Algn
27 .bss 00170000 ffffffff81ec8000 0000000001ec8000 012c8000 2**12
ALLOC
28 .brk 00027000 ffffffff82038000 0000000002038000 012c8000 2**0
ALLOC
Here, offsetA is 0x012c8000, with sizeA at 0x00170000 and sizeB at
0x00027000. The resulting run_size is 0x145f000:
0x012c8000 + 0x00170000 + 0x00027000 = 0x145f000
However, if we instead examine the ELF LOAD program headers, we see a
different picture.
[bhe@x1 linux]$ readelf -l vmlinux
Elf file type is EXEC (Executable file)
Entry point 0x1000000
There are 5 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000200000 0xffffffff81000000 0x0000000001000000
0x0000000000b5e000 0x0000000000b5e000 R E 200000
LOAD 0x0000000000e00000 0xffffffff81c00000 0x0000000001c00000
0x0000000000145000 0x0000000000145000 RW 200000
LOAD 0x0000000001000000 0x0000000000000000 0x0000000001d45000
0x0000000000018158 0x0000000000018158 RW 200000
LOAD 0x000000000115e000 0xffffffff81d5e000 0x0000000001d5e000
0x000000000016a000 0x0000000000301000 RWE 200000
NOTE 0x000000000099bcac 0xffffffff8179bcac 0x000000000179bcac
0x00000000000001bc 0x00000000000001bc 4
Section to Segment mapping:
Segment Sections...
00 .text .notes __ex_table .rodata __bug_table .pci_fixup .tracedata
__ksymtab __ksymtab_gpl __ksymtab_strings __init_rodata __param
__modver
01 .data .vvar
02 .data..percpu
03 .init.text .init.data .x86_cpu_dev.init .parainstructions
.altinstructions .altinstr_replacement .iommu_table .apicdrivers
.exit.text .smp_locks .bss .brk
04 .notes
As mentioned, run_size needs to be the size of the running kernel
including .bss and .brk. We can see from the Section/Segment mapping
above that .bss and .brk are included in segment 03 (which corresponds
to the final LOAD program header). To find the run_size, we calculate
the end of the LOAD segment from its PhysAddr start (0x0000000001d5e000)
and its MemSiz (0x0000000000301000), minus the physical load address of
the kernel (the first LOAD segment's PhysAddr: 0x0000000001000000). The
resulting run_size is 0x105f000:
0x0000000001d5e000 + 0x0000000000301000 - 0x0000000001000000 = 0x105f000
So, from this we can see that the existing run_size calculation is
0x400000 too high. And, as it turns out, the correct run_size is
actually equal to VO_end - VO_text, which is certainly easier to calculate.
_end: 0xffffffff8205f000
_text:0xffffffff81000000
0xffffffff8205f000 - 0xffffffff81000000 = 0x105f000
As a result, run_size is a simple constant, so we don't need to pass it
around; we already have voffset.h such things. We can share voffset.h
between misc.c and header.S instead of getting run_size in other ways.
This patch moves voffset.h creation code to boot/compressed/Makefile,
and switches misc.c to use the VO_end - VO_text calculation.
Dependence before:
boot/header.S ==> boot/voffset.h ==> vmlinux
boot/header.S ==> compressed/vmlinux ==> compressed/misc.c
Dependence after:
boot/header.S ==> compressed/vmlinux ==> compressed/misc.c ==>
boot/voffset.h ==> vmlinux
Fixes: e6023367d779 ("x86, kaslr: Prevent .bss from overlaping initrd")
Cc: Junjie Mao <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/Makefile | 11 +----------
arch/x86/boot/compressed/Makefile | 12 ++++++++++++
arch/x86/boot/compressed/misc.c | 3 +++
3 files changed, 16 insertions(+), 10 deletions(-)
diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
index 942f7dabfb1e..700a9c6e6159 100644
--- a/arch/x86/boot/Makefile
+++ b/arch/x86/boot/Makefile
@@ -86,15 +86,6 @@ $(obj)/vmlinux.bin: $(obj)/compressed/vmlinux FORCE
SETUP_OBJS = $(addprefix $(obj)/,$(setup-y))
-sed-voffset := -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(_text\|_end\)$$/\#define VO_\2 0x\1/p'
-
-quiet_cmd_voffset = VOFFSET $@
- cmd_voffset = $(NM) $< | sed -n $(sed-voffset) > $@
-
-targets += voffset.h
-$(obj)/voffset.h: vmlinux FORCE
- $(call if_changed,voffset)
-
sed-zoffset := -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(startup_32\|startup_64\|efi32_stub_entry\|efi64_stub_entry\|efi_pe_entry\|input_data\|_end\|_ehead\|_text\|z_.*\)$$/\#define ZO_\2 0x\1/p'
quiet_cmd_zoffset = ZOFFSET $@
@@ -106,7 +97,7 @@ $(obj)/zoffset.h: $(obj)/compressed/vmlinux FORCE
AFLAGS_header.o += -I$(obj)
-$(obj)/header.o: $(obj)/voffset.h $(obj)/zoffset.h
+$(obj)/header.o: $(obj)/zoffset.h
LDFLAGS_setup.elf := -T
$(obj)/setup.elf: $(src)/setup.ld $(SETUP_OBJS) FORCE
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6915ff2bd996..323674a9d6db 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -45,6 +45,18 @@ LDFLAGS_vmlinux := -T
hostprogs-y := mkpiggy
HOST_EXTRACFLAGS += -I$(srctree)/tools/include
+sed-voffset := -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(_text\|_end\)$$/\#define VO_\2 _AC(0x\1,UL)/p'
+
+quiet_cmd_voffset = VOFFSET $@
+ cmd_voffset = $(NM) $< | sed -n $(sed-voffset) > $@
+
+targets += ../voffset.h
+
+$(obj)/../voffset.h: vmlinux FORCE
+ $(call if_changed,voffset)
+
+$(obj)/misc.o: $(obj)/../voffset.h
+
vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
$(obj)/string.o $(obj)/cmdline.o \
$(obj)/piggy.o $(obj)/cpuflags.o
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 31e2d6155643..4d878a7b9fab 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -11,6 +11,7 @@
#include "misc.h"
#include "../string.h"
+#include "../voffset.h"
/* WARNING!!
* This code is compiled with -fPIC and it is relocated dynamically
@@ -415,6 +416,8 @@ asmlinkage __visible void *decompress_kernel(void *rmode, memptr heap,
lines = real_mode->screen_info.orig_video_lines;
cols = real_mode->screen_info.orig_video_cols;
+ run_size = VO__end - VO__text;
+
console_init();
debug_putstr("early console in decompress_kernel\n");
--
2.6.3
From: Yinghai Lu <[email protected]>
Currently KASLR only supports relocation in a small physical range (from
16M to 1G), due to using the initial kernel page table identify mapping.
To support ranges above this, we need to have an identity mapping for
the desired memory range before we can decompress (and later run) the
kernel.
This patch adds an identity mapping for needed memory range. This happens
in two possible boot paths:
If loaded via startup_32, we need to set up the identity mapping we need.
If loaded from a 64-bit bootloader, the bootloader will have
already set up an identity mapping, and we'll start via ZO's
(arch/x86/boot/compressed/vmlinux) startup_64. In this case, the
bootloader's page tables need to be avoided when we select the new VO
(vmlinux) location. If not, the decompressor could overwrite them during
decompression.
To accomplish avoiding the bootloader's page tables, we could walk the
pagetable and find every page that is used, and add them to mem_avoid,
but this needs extra code and will require increasing the size of the
mem_avoid array.
Instead, we can create new ident mapping instead, and pages for the
pagetable will come from the _pagetable section of ZO, which means they
are already in mem_avoid array. To do this, we reuse the code for the
kernel's identity mapping.
The _pgtable will be shared by 32-bit and 64-bit path to reduce init_size,
as now ZO _rodata to _end will contribute init_size.
To handle the possible mappings, we need to increase the pgt buffer size:
When booting via startup_64, as we need to cover the old VO, params,
cmdline and new VO. In an extreme case we could have them all beyond the
512G boundary, which needs (2+2)*4 pages with 2M mappings. And we'll
need 2 for first 2M for VGA RAM. One more is needed for level4. This
gets us to 19 pages total.
When booting via startup_32, KASLR could move new VO above 4G, so we
need to set extra identity mappings for new VO, which should only need
(2+2) pages at most when it is beyond the 512G boundary. So 19 pages
is sufficient for this case as well.
Cc: Kees Cook <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Borislav Petkov <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/Makefile | 3 ++
arch/x86/boot/compressed/aslr.c | 14 ++++++
arch/x86/boot/compressed/head_64.S | 4 +-
arch/x86/boot/compressed/misc.h | 11 +++++
arch/x86/boot/compressed/misc_pgt.c | 93 +++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/boot.h | 19 ++++++++
6 files changed, 142 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/boot/compressed/misc_pgt.c
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index aa421126556d..af9ea6f0baa2 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -63,6 +63,9 @@ vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
vmlinux-objs-$(CONFIG_EARLY_PRINTK) += $(obj)/early_serial_console.o
vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/aslr.o
+ifdef CONFIG_X86_64
+ vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/misc_pgt.o
+endif
$(obj)/eboot.o: KBUILD_CFLAGS += -fshort-wchar -mno-red-zone
diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
index 277c7a4ec3a3..cd261debead9 100644
--- a/arch/x86/boot/compressed/aslr.c
+++ b/arch/x86/boot/compressed/aslr.c
@@ -194,6 +194,7 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
*/
mem_avoid[0].start = input;
mem_avoid[0].size = (output + init_size) - input;
+ fill_pagetable(mem_avoid[0].start, mem_avoid[0].size);
/* Avoid initrd. */
initrd_start = (u64)real_mode->ext_ramdisk_image << 32;
@@ -202,6 +203,7 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
initrd_size |= real_mode->hdr.ramdisk_size;
mem_avoid[1].start = initrd_start;
mem_avoid[1].size = initrd_size;
+ /* No need to set mapping for initrd */
/* Avoid kernel command line. */
cmd_line = (u64)real_mode->ext_cmd_line_ptr << 32;
@@ -212,10 +214,19 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
;
mem_avoid[2].start = cmd_line;
mem_avoid[2].size = cmd_line_size;
+ fill_pagetable(mem_avoid[2].start, mem_avoid[2].size);
/* Avoid params */
mem_avoid[3].start = (unsigned long)real_mode;
mem_avoid[3].size = sizeof(*real_mode);
+ fill_pagetable(mem_avoid[3].start, mem_avoid[3].size);
+
+ /* don't need to set mapping for setup_data */
+
+#ifdef CONFIG_X86_VERBOSE_BOOTUP
+ /* for video ram */
+ fill_pagetable(0, PMD_SIZE);
+#endif
}
/* Does this memory vector overlap a known avoided area? */
@@ -371,6 +382,9 @@ unsigned char *choose_kernel_location(unsigned char *input,
goto out;
choice = random;
+
+ fill_pagetable(choice, output_size);
+ switch_pagetable();
out:
return (unsigned char *)choice;
}
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 36914515dd76..075bb15f9494 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -126,7 +126,7 @@ ENTRY(startup_32)
/* Initialize Page tables to 0 */
leal pgtable(%ebx), %edi
xorl %eax, %eax
- movl $((4096*6)/4), %ecx
+ movl $(BOOT_INIT_PGT_SIZE/4), %ecx
rep stosl
/* Build Level 4 */
@@ -478,4 +478,4 @@ boot_stack_end:
.section ".pgtable","a",@nobits
.balign 4096
pgtable:
- .fill 6*4096, 1, 0
+ .fill BOOT_PGT_SIZE, 1, 0
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index dcf01c22400e..11736a6a6670 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -84,6 +84,17 @@ unsigned char *choose_kernel_location(unsigned char *input,
}
#endif
+#ifdef CONFIG_X86_64
+void fill_pagetable(unsigned long start, unsigned long size);
+void switch_pagetable(void);
+extern unsigned char _pgtable[];
+#else
+static inline void fill_pagetable(unsigned long start, unsigned long size)
+{ }
+static inline void switch_pagetable(void)
+{ }
+#endif
+
#ifdef CONFIG_EARLY_PRINTK
/* early_serial_console.c */
extern int early_serial_base;
diff --git a/arch/x86/boot/compressed/misc_pgt.c b/arch/x86/boot/compressed/misc_pgt.c
new file mode 100644
index 000000000000..816551d0327c
--- /dev/null
+++ b/arch/x86/boot/compressed/misc_pgt.c
@@ -0,0 +1,93 @@
+#define __pa(x) ((unsigned long)(x))
+#define __va(x) ((void *)((unsigned long)(x)))
+
+#include "misc.h"
+
+#include <asm/init.h>
+#include <asm/pgtable.h>
+
+#include "../../mm/ident_map.c"
+#include "../string.h"
+
+struct alloc_pgt_data {
+ unsigned char *pgt_buf;
+ unsigned long pgt_buf_size;
+ unsigned long pgt_buf_offset;
+};
+
+static void *alloc_pgt_page(void *context)
+{
+ struct alloc_pgt_data *d = (struct alloc_pgt_data *)context;
+ unsigned char *p = (unsigned char *)d->pgt_buf;
+
+ if (d->pgt_buf_offset >= d->pgt_buf_size) {
+ debug_putstr("out of pgt_buf in misc.c\n");
+ debug_putaddr(d->pgt_buf_offset);
+ debug_putaddr(d->pgt_buf_size);
+ return NULL;
+ }
+
+ p += d->pgt_buf_offset;
+ d->pgt_buf_offset += PAGE_SIZE;
+
+ return p;
+}
+
+/*
+ * Use a normal definition of memset() from string.c. There are already
+ * included header files which expect a definition of memset() and by
+ * the time we define memset macro, it is too late.
+ */
+#undef memset
+
+unsigned long __force_order;
+static struct alloc_pgt_data pgt_data;
+static struct x86_mapping_info mapping_info;
+static pgd_t *level4p;
+
+void fill_pagetable(unsigned long start, unsigned long size)
+{
+ unsigned long end = start + size;
+
+ if (!level4p) {
+ pgt_data.pgt_buf_offset = 0;
+ mapping_info.alloc_pgt_page = alloc_pgt_page;
+ mapping_info.context = &pgt_data;
+ mapping_info.pmd_flag = __PAGE_KERNEL_LARGE_EXEC;
+
+ /*
+ * come from startup_32 ?
+ * then cr3 is _pgtable, we can reuse it.
+ */
+ level4p = (pgd_t *)read_cr3();
+ if ((unsigned long)level4p == (unsigned long)_pgtable) {
+ pgt_data.pgt_buf = (unsigned char *)_pgtable +
+ BOOT_INIT_PGT_SIZE;
+ pgt_data.pgt_buf_size = BOOT_PGT_SIZE -
+ BOOT_INIT_PGT_SIZE;
+ memset((unsigned char *)pgt_data.pgt_buf, 0,
+ pgt_data.pgt_buf_size);
+ debug_putstr("boot via startup_32\n");
+ } else {
+ pgt_data.pgt_buf = (unsigned char *)_pgtable;
+ pgt_data.pgt_buf_size = BOOT_PGT_SIZE;
+ memset((unsigned char *)pgt_data.pgt_buf, 0,
+ pgt_data.pgt_buf_size);
+ debug_putstr("boot via startup_64\n");
+ level4p = (pgd_t *)alloc_pgt_page(&pgt_data);
+ }
+ }
+
+ /* align boundary to 2M */
+ start = round_down(start, PMD_SIZE);
+ end = round_up(end, PMD_SIZE);
+ if (start >= end)
+ return;
+
+ kernel_ident_mapping_init(&mapping_info, level4p, start, end);
+}
+
+void switch_pagetable(void)
+{
+ write_cr3((unsigned long)level4p);
+}
diff --git a/arch/x86/include/asm/boot.h b/arch/x86/include/asm/boot.h
index 6b8d6e8cd449..52a9cbc1419f 100644
--- a/arch/x86/include/asm/boot.h
+++ b/arch/x86/include/asm/boot.h
@@ -32,7 +32,26 @@
#endif /* !CONFIG_KERNEL_BZIP2 */
#ifdef CONFIG_X86_64
+
#define BOOT_STACK_SIZE 0x4000
+
+#define BOOT_INIT_PGT_SIZE (6*4096)
+#ifdef CONFIG_RANDOMIZE_BASE
+/*
+ * 1 page for level4, 2 pages for first 2M.
+ * (2+2)*4 pages for kernel, param, cmd_line, random kernel
+ * if all cross 512G boundary.
+ * So total will be 19 pages.
+ */
+#ifdef CONFIG_X86_VERBOSE_BOOTUP
+#define BOOT_PGT_SIZE (19*4096)
+#else
+#define BOOT_PGT_SIZE (17*4096)
+#endif
+#else
+#define BOOT_PGT_SIZE BOOT_INIT_PGT_SIZE
+#endif
+
#else
#define BOOT_STACK_SIZE 0x1000
#endif
--
2.6.3
From: Baoquan He <[email protected]>
The comment that describes the analysis for the size of the decompressor
code only took gzip into account (there are 6 other decompressors that
could be used). The actual z_extract_offset calculation in code was
already handling the correct maximum size, but this documentation hadn't
been updated. This updates the documentation and fixes several typos.
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog, cleaned up comment style]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/misc.c | 26 +++++++++++++++++---------
1 file changed, 17 insertions(+), 9 deletions(-)
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index e2a998f8c304..31e2d6155643 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -19,11 +19,13 @@
*/
/*
- * Getting to provable safe in place decompression is hard.
- * Worst case behaviours need to be analyzed.
- * Background information:
+ * Getting to provable safe in place decompression is hard. Worst case
+ * behaviours need be analyzed. Here let's take decompressing gzip-compressed
+ * kernel as example to illustrate it.
+ *
+ * The file layout of gzip compressed kernel is as follows. For more
+ * information, please refer to RFC 1951 and RFC 1952.
*
- * The file layout is:
* magic[2]
* method[1]
* flags[1]
@@ -70,13 +72,13 @@
* of 5 bytes per 32767 bytes.
*
* The worst case internal to a compressed block is very hard to figure.
- * The worst case can at least be boundined by having one bit that represents
+ * The worst case can at least be bounded by having one bit that represents
* 32764 bytes and then all of the rest of the bytes representing the very
* very last byte.
*
* All of which is enough to compute an amount of extra data that is required
* to be safe. To avoid problems at the block level allocating 5 extra bytes
- * per 32767 bytes of data is sufficient. To avoind problems internal to a
+ * per 32767 bytes of data is sufficient. To avoid problems internal to a
* block adding an extra 32767 bytes (the worst case uncompressed block size)
* is sufficient, to ensure that in the worst case the decompressed data for
* block will stop the byte before the compressed data for a block begins.
@@ -88,11 +90,17 @@
* Adding 8 bytes per 32K is a bit excessive but much easier to calculate.
* Adding 32768 instead of 32767 just makes for round numbers.
*
+ * Above analysis is for decompressing gzip compressed kernel only. Up to
+ * now 6 different decompressor are supported all together. And among them
+ * xz stores data in chunks and has maximum chunk of 64K. Hence safety
+ * margin should be updated to cover all decompressors so that we don't
+ * need to deal with each of them separately. Please check
+ * the description in lib/decompressor_xxx.c for specific information.
+ *
+ * extra_bytes = (uncompressed_size >> 12) + 65536 + 128.
+ *
*/
-/*
- * gzip declarations
- */
#define STATIC static
#undef memcpy
--
2.6.3
From: Yinghai Lu <[email protected]>
Since run_size is now calculated in misc.c, the old script and associated
argument passing is no longer needed. This patch removes them.
Cc: "H. Peter Anvin" <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Junjie Mao <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/Makefile | 4 +---
arch/x86/boot/compressed/head_32.S | 3 +--
arch/x86/boot/compressed/head_64.S | 3 ---
arch/x86/boot/compressed/misc.c | 6 ++----
arch/x86/boot/compressed/mkpiggy.c | 10 ++-------
arch/x86/tools/calc_run_size.sh | 42 --------------------------------------
6 files changed, 6 insertions(+), 62 deletions(-)
delete mode 100644 arch/x86/tools/calc_run_size.sh
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 323674a9d6db..81a8d0fd34eb 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -109,10 +109,8 @@ suffix-$(CONFIG_KERNEL_XZ) := xz
suffix-$(CONFIG_KERNEL_LZO) := lzo
suffix-$(CONFIG_KERNEL_LZ4) := lz4
-RUN_SIZE = $(shell $(OBJDUMP) -h vmlinux | \
- $(CONFIG_SHELL) $(srctree)/arch/x86/tools/calc_run_size.sh)
quiet_cmd_mkpiggy = MKPIGGY $@
- cmd_mkpiggy = $(obj)/mkpiggy $< $(RUN_SIZE) > $@ || ( rm -f $@ ; false )
+ cmd_mkpiggy = $(obj)/mkpiggy $< > $@ || ( rm -f $@ ; false )
targets += piggy.S
$(obj)/piggy.S: $(obj)/vmlinux.bin.$(suffix-y) $(obj)/mkpiggy FORCE
diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
index 0c140f99c602..122b32faf3ef 100644
--- a/arch/x86/boot/compressed/head_32.S
+++ b/arch/x86/boot/compressed/head_32.S
@@ -210,7 +210,6 @@ relocated:
* Do the decompression, and jump to the new kernel..
*/
/* push arguments for decompress_kernel: */
- pushl $z_run_size /* size of kernel with .bss and .brk */
pushl $z_output_len /* decompressed length, end of relocs */
movl BP_init_size(%esi), %eax
@@ -226,7 +225,7 @@ relocated:
pushl %eax /* heap area */
pushl %esi /* real mode pointer */
call decompress_kernel /* returns kernel location in %eax */
- addl $28, %esp
+ addl $24, %esp
/*
* Jump to the decompressed kernel.
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 67dd8d300c61..36914515dd76 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -407,8 +407,6 @@ relocated:
* Do the decompression, and jump to the new kernel..
*/
pushq %rsi /* Save the real mode argument */
- movq $z_run_size, %r9 /* size of kernel with .bss and .brk */
- pushq %r9
movq %rsi, %rdi /* real mode address */
leaq boot_heap(%rip), %rsi /* malloc area for uncompression */
leaq input_data(%rip), %rdx /* input_data */
@@ -416,7 +414,6 @@ relocated:
movq %rbp, %r8 /* output target address */
movq $z_output_len, %r9 /* decompressed length, end of relocs */
call decompress_kernel /* returns kernel location in %rax */
- popq %r9
popq %rsi
/*
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 4d878a7b9fab..6adb5b827aca 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -393,9 +393,9 @@ asmlinkage __visible void *decompress_kernel(void *rmode, memptr heap,
unsigned char *input_data,
unsigned long input_len,
unsigned char *output,
- unsigned long output_len,
- unsigned long run_size)
+ unsigned long output_len)
{
+ const unsigned long run_size = VO__end - VO__text;
unsigned char *output_orig = output;
real_mode = rmode;
@@ -416,8 +416,6 @@ asmlinkage __visible void *decompress_kernel(void *rmode, memptr heap,
lines = real_mode->screen_info.orig_video_lines;
cols = real_mode->screen_info.orig_video_cols;
- run_size = VO__end - VO__text;
-
console_init();
debug_putstr("early console in decompress_kernel\n");
diff --git a/arch/x86/boot/compressed/mkpiggy.c b/arch/x86/boot/compressed/mkpiggy.c
index a613c84d9b88..c51486429bdc 100644
--- a/arch/x86/boot/compressed/mkpiggy.c
+++ b/arch/x86/boot/compressed/mkpiggy.c
@@ -34,13 +34,11 @@ int main(int argc, char *argv[])
{
uint32_t olen;
long ilen;
- unsigned long run_size;
FILE *f = NULL;
int retval = 1;
- if (argc < 3) {
- fprintf(stderr, "Usage: %s compressed_file run_size\n",
- argv[0]);
+ if (argc < 2) {
+ fprintf(stderr, "Usage: %s compressed_file\n", argv[0]);
goto bail;
}
@@ -65,15 +63,11 @@ int main(int argc, char *argv[])
ilen = ftell(f);
olen = get_unaligned_le32(&olen);
- run_size = atoi(argv[2]);
-
printf(".section \".rodata..compressed\",\"a\",@progbits\n");
printf(".globl z_input_len\n");
printf("z_input_len = %lu\n", ilen);
printf(".globl z_output_len\n");
printf("z_output_len = %lu\n", (unsigned long)olen);
- printf(".globl z_run_size\n");
- printf("z_run_size = %lu\n", run_size);
printf(".globl input_data, input_data_end\n");
printf("input_data:\n");
diff --git a/arch/x86/tools/calc_run_size.sh b/arch/x86/tools/calc_run_size.sh
deleted file mode 100644
index 1a4c17bb3910..000000000000
--- a/arch/x86/tools/calc_run_size.sh
+++ /dev/null
@@ -1,42 +0,0 @@
-#!/bin/sh
-#
-# Calculate the amount of space needed to run the kernel, including room for
-# the .bss and .brk sections.
-#
-# Usage:
-# objdump -h a.out | sh calc_run_size.sh
-
-NUM='\([0-9a-fA-F]*[ \t]*\)'
-OUT=$(sed -n 's/^[ \t0-9]*.b[sr][sk][ \t]*'"$NUM$NUM$NUM$NUM"'.*/\1\4/p')
-if [ -z "$OUT" ] ; then
- echo "Never found .bss or .brk file offset" >&2
- exit 1
-fi
-
-OUT=$(echo ${OUT# })
-sizeA=$(printf "%d" 0x${OUT%% *})
-OUT=${OUT#* }
-offsetA=$(printf "%d" 0x${OUT%% *})
-OUT=${OUT#* }
-sizeB=$(printf "%d" 0x${OUT%% *})
-OUT=${OUT#* }
-offsetB=$(printf "%d" 0x${OUT%% *})
-
-run_size=$(( $offsetA + $sizeA + $sizeB ))
-
-# BFD linker shows the same file offset in ELF.
-if [ "$offsetA" -ne "$offsetB" ] ; then
- # Gold linker shows them as consecutive.
- endB=$(( $offsetB + $sizeB ))
- if [ "$endB" != "$run_size" ] ; then
- printf "sizeA: 0x%x\n" $sizeA >&2
- printf "offsetA: 0x%x\n" $offsetA >&2
- printf "sizeB: 0x%x\n" $sizeB >&2
- printf "offsetB: 0x%x\n" $offsetB >&2
- echo ".bss and .brk are non-contiguous" >&2
- exit 1
- fi
-fi
-
-printf "%d\n" $run_size
-exit 0
--
2.6.3
From: Yinghai Lu <[email protected]>
Relocation handling performs bounds checking on the resulting calculated
addresses. The existing code uses output_len (VO size plus relocs size) as
the max address. This is not right since the max_addr check should stop at
the end of VO and exclude bss, brk, etc, which follows. The valid range
should be VO [_text, __bss_start] in the loaded physical address space.
This patch adds an export for __bss_start in voffset.h and uses it to
get the correct max_addr.
Signed-off-by: Yinghai Lu <[email protected]>
Cc: Baoquan He <[email protected]>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/Makefile | 2 +-
arch/x86/boot/compressed/misc.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 81a8d0fd34eb..aa421126556d 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -45,7 +45,7 @@ LDFLAGS_vmlinux := -T
hostprogs-y := mkpiggy
HOST_EXTRACFLAGS += -I$(srctree)/tools/include
-sed-voffset := -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(_text\|_end\)$$/\#define VO_\2 _AC(0x\1,UL)/p'
+sed-voffset := -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(_text\|__bss_start\|_end\)$$/\#define VO_\2 _AC(0x\1,UL)/p'
quiet_cmd_voffset = VOFFSET $@
cmd_voffset = $(NM) $< | sed -n $(sed-voffset) > $@
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 6adb5b827aca..562d647289ac 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -259,7 +259,7 @@ static void handle_relocations(void *output, unsigned long output_len)
int *reloc;
unsigned long delta, map, ptr;
unsigned long min_addr = (unsigned long)output;
- unsigned long max_addr = min_addr + output_len;
+ unsigned long max_addr = min_addr + (VO___bss_start - VO__text);
/*
* Calculate the delta between where vmlinux was linked to load
--
2.6.3
From: Yinghai Lu <[email protected]>
The mem_avoid array is used to track positions that should be avoided
(like the compressed kernel, decompression code, etc) when selecting a
memory position for the relocated kernel. Since ZO is now at the end of
the decompression buffer and the decompression code (and its heap and
stack) are at the front, we can safely consolidate the decompression
entry, the heap entry, and the stack entry. The boot_params memory,
however, could be elsewhere, so it should be explicitly included.
Cc: Kees Cook <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog, cleaned up code comment]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/aslr.c | 72 ++++++++++++++++++++++++++++++++---------
1 file changed, 56 insertions(+), 16 deletions(-)
diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
index 462654097d63..277c7a4ec3a3 100644
--- a/arch/x86/boot/compressed/aslr.c
+++ b/arch/x86/boot/compressed/aslr.c
@@ -109,7 +109,7 @@ struct mem_vector {
unsigned long size;
};
-#define MEM_AVOID_MAX 5
+#define MEM_AVOID_MAX 4
static struct mem_vector mem_avoid[MEM_AVOID_MAX];
static bool mem_contains(struct mem_vector *region, struct mem_vector *item)
@@ -134,22 +134,66 @@ static bool mem_overlaps(struct mem_vector *one, struct mem_vector *two)
return true;
}
+/*
+ * In theroy, KASLR can put the kernel anywhere in area of [16M, 64T). The
+ * mem_avoid array is used to store the ranges that need be avoided when
+ * KASLR searches for the relocated address. We must avoid any regions that
+ * are unsafe to overlap with during decompression, and other things like
+ * the initrd, cmdline and boot_params.
+ *
+ * How to calculate the unsafe area used by decompression is detailed here.
+ * The compressed vmlinux (ZO) plus relocs and the run space of ZO can't be
+ * overwritten by decompression output.
+ *
+ * ZO sits against the end of the decompression buffer, so we can calculate
+ * where text, data, bss, etc of ZO are positioned.
+ *
+ * The follow are already enforced by the code:
+ * - init_size >= run_size,
+ * - input+input_len >= output+output_len,
+ * - run_size could be >= or < output_len
+ *
+ * From this, we can make several observations, illustrated by a diagram:
+ * - init_size > run_size
+ * - input+input_len > output+output_len
+ * - run_size > output_len
+ *
+ * 0 output input input+input_len output+init_size
+ * | | | | |
+ * |-----|--------|--------|------------------|----|------------|----------|
+ * | | |
+ * output+init_size-ZO_INIT_SIZE output+output_len output+run_size
+ *
+ * [output, output+init_size) is for the buffer for decompressing the
+ * compressed kernel (ZO).
+ *
+ * [output, output+run_size) is for the uncompressed kernel (VO) run size.
+ * [output, output+output_len) is VO plus relocs
+ *
+ * [output+init_size-ZO_INIT_SIZE, output+init_size) is copied ZO.
+ * [input, input+input_len) is copied compressed (VO (vmlinux after objcopy)
+ * plus relocs), not the ZO.
+ *
+ * [input+input_len, output+init_size) is [_text, _end) for ZO. That was the
+ * first range in mem_avoid, which included ZO's heap and stack. Also
+ * [input, input+input_size) need be put in mem_avoid array, but since it
+ * is adjacent to the first entry, they can be merged. This is how we get
+ * the first entry in mem_avoid[].
+ */
static void mem_avoid_init(unsigned long input, unsigned long input_size,
- unsigned long output, unsigned long output_size)
+ unsigned long output)
{
+ unsigned long init_size = real_mode->hdr.init_size;
u64 initrd_start, initrd_size;
u64 cmd_line, cmd_line_size;
- unsigned long unsafe, unsafe_len;
char *ptr;
/*
* Avoid the region that is unsafe to overlap during
- * decompression (see calculations at top of misc.c).
+ * decompression.
*/
- unsafe_len = (output_size >> 12) + 32768 + 18;
- unsafe = (unsigned long)input + input_size - unsafe_len;
- mem_avoid[0].start = unsafe;
- mem_avoid[0].size = unsafe_len;
+ mem_avoid[0].start = input;
+ mem_avoid[0].size = (output + init_size) - input;
/* Avoid initrd. */
initrd_start = (u64)real_mode->ext_ramdisk_image << 32;
@@ -169,13 +213,9 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
mem_avoid[2].start = cmd_line;
mem_avoid[2].size = cmd_line_size;
- /* Avoid heap memory. */
- mem_avoid[3].start = (unsigned long)free_mem_ptr;
- mem_avoid[3].size = BOOT_HEAP_SIZE;
-
- /* Avoid stack memory. */
- mem_avoid[4].start = (unsigned long)free_mem_end_ptr;
- mem_avoid[4].size = BOOT_STACK_SIZE;
+ /* Avoid params */
+ mem_avoid[3].start = (unsigned long)real_mode;
+ mem_avoid[3].size = sizeof(*real_mode);
}
/* Does this memory vector overlap a known avoided area? */
@@ -317,7 +357,7 @@ unsigned char *choose_kernel_location(unsigned char *input,
/* Record the various known unsafe memory ranges. */
mem_avoid_init((unsigned long)input, input_size,
- (unsigned long)output, output_size);
+ (unsigned long)output);
/* Walk e820 and find a random address. */
random = find_random_addr(choice, output_size);
--
2.6.3
From: Yinghai Lu <[email protected]>
In order to support on-demand page table creation when moving the
kernel for KASLR, we need to use kernel_ident_mapping_init in the
decompression code. This splits it out into its own file for use outside
of init_64.c. Additionally, checking for __pa/__va defines is added
since they need to be overridden in the decompression code.
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/include/asm/page.h | 5 +++
arch/x86/mm/ident_map.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
arch/x86/mm/init_64.c | 74 +--------------------------------------------
3 files changed, 80 insertions(+), 73 deletions(-)
create mode 100644 arch/x86/mm/ident_map.c
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 802dde30c928..cf8f619b305f 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -37,7 +37,10 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+#ifndef __pa
#define __pa(x) __phys_addr((unsigned long)(x))
+#endif
+
#define __pa_nodebug(x) __phys_addr_nodebug((unsigned long)(x))
/* __pa_symbol should be used for C visible symbols.
This seems to be the official gcc blessed way to do such arithmetic. */
@@ -51,7 +54,9 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
#define __pa_symbol(x) \
__phys_addr_symbol(__phys_reloc_hide((unsigned long)(x)))
+#ifndef __va
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
+#endif
#define __boot_va(x) __va(x)
#define __boot_pa(x) __pa(x)
diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
new file mode 100644
index 000000000000..751ca920773a
--- /dev/null
+++ b/arch/x86/mm/ident_map.c
@@ -0,0 +1,74 @@
+
+static void ident_pmd_init(unsigned long pmd_flag, pmd_t *pmd_page,
+ unsigned long addr, unsigned long end)
+{
+ addr &= PMD_MASK;
+ for (; addr < end; addr += PMD_SIZE) {
+ pmd_t *pmd = pmd_page + pmd_index(addr);
+
+ if (!pmd_present(*pmd))
+ set_pmd(pmd, __pmd(addr | pmd_flag));
+ }
+}
+static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page,
+ unsigned long addr, unsigned long end)
+{
+ unsigned long next;
+
+ for (; addr < end; addr = next) {
+ pud_t *pud = pud_page + pud_index(addr);
+ pmd_t *pmd;
+
+ next = (addr & PUD_MASK) + PUD_SIZE;
+ if (next > end)
+ next = end;
+
+ if (pud_present(*pud)) {
+ pmd = pmd_offset(pud, 0);
+ ident_pmd_init(info->pmd_flag, pmd, addr, next);
+ continue;
+ }
+ pmd = (pmd_t *)info->alloc_pgt_page(info->context);
+ if (!pmd)
+ return -ENOMEM;
+ ident_pmd_init(info->pmd_flag, pmd, addr, next);
+ set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
+ }
+
+ return 0;
+}
+
+int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
+ unsigned long addr, unsigned long end)
+{
+ unsigned long next;
+ int result;
+ int off = info->kernel_mapping ? pgd_index(__PAGE_OFFSET) : 0;
+
+ for (; addr < end; addr = next) {
+ pgd_t *pgd = pgd_page + pgd_index(addr) + off;
+ pud_t *pud;
+
+ next = (addr & PGDIR_MASK) + PGDIR_SIZE;
+ if (next > end)
+ next = end;
+
+ if (pgd_present(*pgd)) {
+ pud = pud_offset(pgd, 0);
+ result = ident_pud_init(info, pud, addr, next);
+ if (result)
+ return result;
+ continue;
+ }
+
+ pud = (pud_t *)info->alloc_pgt_page(info->context);
+ if (!pud)
+ return -ENOMEM;
+ result = ident_pud_init(info, pud, addr, next);
+ if (result)
+ return result;
+ set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
+ }
+
+ return 0;
+}
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 214afda97911..65cfbeefbec4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -58,79 +58,7 @@
#include "mm_internal.h"
-static void ident_pmd_init(unsigned long pmd_flag, pmd_t *pmd_page,
- unsigned long addr, unsigned long end)
-{
- addr &= PMD_MASK;
- for (; addr < end; addr += PMD_SIZE) {
- pmd_t *pmd = pmd_page + pmd_index(addr);
-
- if (!pmd_present(*pmd))
- set_pmd(pmd, __pmd(addr | pmd_flag));
- }
-}
-static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page,
- unsigned long addr, unsigned long end)
-{
- unsigned long next;
-
- for (; addr < end; addr = next) {
- pud_t *pud = pud_page + pud_index(addr);
- pmd_t *pmd;
-
- next = (addr & PUD_MASK) + PUD_SIZE;
- if (next > end)
- next = end;
-
- if (pud_present(*pud)) {
- pmd = pmd_offset(pud, 0);
- ident_pmd_init(info->pmd_flag, pmd, addr, next);
- continue;
- }
- pmd = (pmd_t *)info->alloc_pgt_page(info->context);
- if (!pmd)
- return -ENOMEM;
- ident_pmd_init(info->pmd_flag, pmd, addr, next);
- set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
- }
-
- return 0;
-}
-
-int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
- unsigned long addr, unsigned long end)
-{
- unsigned long next;
- int result;
- int off = info->kernel_mapping ? pgd_index(__PAGE_OFFSET) : 0;
-
- for (; addr < end; addr = next) {
- pgd_t *pgd = pgd_page + pgd_index(addr) + off;
- pud_t *pud;
-
- next = (addr & PGDIR_MASK) + PGDIR_SIZE;
- if (next > end)
- next = end;
-
- if (pgd_present(*pgd)) {
- pud = pud_offset(pgd, 0);
- result = ident_pud_init(info, pud, addr, next);
- if (result)
- return result;
- continue;
- }
-
- pud = (pud_t *)info->alloc_pgt_page(info->context);
- if (!pud)
- return -ENOMEM;
- result = ident_pud_init(info, pud, addr, next);
- if (result)
- return result;
- set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
- }
-
- return 0;
-}
+#include "ident_map.c"
/*
* NOTE: pagetable_init alloc all the fixmap pagetables contiguous on the
--
2.6.3
From: Yinghai Lu <[email protected]>
This change makes later calculations about where the kernel is located
easier to reason about. To better understand this change, we must first
clarify what VO and ZO are. They were introduced in commits by hpa:
77d1a49 x86, boot: make symbols from the main vmlinux available
37ba7ab x86, boot: make kernel_alignment adjustable; new bzImage fields
Specifically:
VO:
- uncompressed kernel image
- size: VO__end - VO__text ("VO_INIT_SIZE" define)
ZO:
- bootable compressed kernel image (boot/compressed/vmlinux)
- head text + compressed kernel (VO and relocs table) + decompressor code
- size: ZO__end - ZO_startup_32 ("ZO_INIT_SIZE" define, though see below)
The INIT_SIZE definition is used to find the larger of the two image sizes:
#define ZO_INIT_SIZE (ZO__end - ZO_startup_32 + ZO_z_extract_offset)
#define VO_INIT_SIZE (VO__end - VO__text)
#if ZO_INIT_SIZE > VO_INIT_SIZE
#define INIT_SIZE ZO_INIT_SIZE
#else
#define INIT_SIZE VO_INIT_SIZE
#endif
The current code uses extract_offset to decide where to position the
copied ZO (i.e. ZO starts at extract_offset). (This is why ZO_INIT_SIZE
currently includes the extract_offset.)
When INIT_SIZE is bigger than VO_INIT_SIZE (uncommon but possible),
the copied ZO occupies the memory from extract_offset to the end of
decompression buffer. It overlaps with the soon-to-be-uncompressed kernel
like this:
|-----compressed kernel image------|
V V
0 extract_offset +INIT_SIZE
|-----------|---------------|-------------------------|--------|
| | | |
VO__text startup_32 of ZO VO__end ZO__end
^ ^
|-------uncompressed kernel image---------|
When INIT_SIZE is equal to VO_INIT_SIZE (likely) there's still space
left from end of ZO to the end of decompressing buffer, like below.
|-compressed kernel image-|
V V
0 extract_offset +INIT_SIZE
|-----------|---------------|-------------------------|--------|
| | | |
VO__text startup_32 of ZO ZO__end VO__end
^ ^
|------------uncompressed kernel image-------------|
To simplify calculations and avoid special cases, it is cleaner to
always place the compressed kernel image in memory so that ZO__end
is at the end of the decompression buffer, instead of placing that
start extract_offset as is currently done.
This patch adds BP_init_size (which is the INIT_SIZE as passed in from
the boot_params) into asm-offsets.c to make it visible to the assembly
code. Then when moving the ZO, it calculates the starting position of
the copied ZO (via BP_init_size and the ZO run size) so that the VO__end
will be at the end of the decompression buffer. To make the position
calculation safe, the end of ZO is page aligned (and a comment is added
to the existing VO alignment for good measure).
Signed-off-by: Yinghai Lu <[email protected]>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/head_32.S | 11 +++++++++--
arch/x86/boot/compressed/head_64.S | 8 ++++++--
arch/x86/boot/compressed/mkpiggy.c | 3 ---
arch/x86/boot/compressed/vmlinux.lds.S | 1 +
arch/x86/kernel/asm-offsets.c | 1 +
arch/x86/kernel/vmlinux.lds.S | 2 +-
6 files changed, 18 insertions(+), 8 deletions(-)
diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
index 8ef964ddc18e..0c140f99c602 100644
--- a/arch/x86/boot/compressed/head_32.S
+++ b/arch/x86/boot/compressed/head_32.S
@@ -148,7 +148,9 @@ preferred_addr:
1:
/* Target address to relocate to for decompression */
- addl $z_extract_offset, %ebx
+ movl BP_init_size(%esi), %eax
+ subl $_end, %eax
+ addl %eax, %ebx
/* Set up the stack */
leal boot_stack_end(%ebx), %esp
@@ -210,8 +212,13 @@ relocated:
/* push arguments for decompress_kernel: */
pushl $z_run_size /* size of kernel with .bss and .brk */
pushl $z_output_len /* decompressed length, end of relocs */
- leal z_extract_offset_negative(%ebx), %ebp
+
+ movl BP_init_size(%esi), %eax
+ subl $_end, %eax
+ movl %ebx, %ebp
+ subl %eax, %ebp
pushl %ebp /* output address */
+
pushl $z_input_len /* input_len */
leal input_data(%ebx), %eax
pushl %eax /* input_data */
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index b0c0d16ef58d..67dd8d300c61 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -102,7 +102,9 @@ ENTRY(startup_32)
1:
/* Target address to relocate to for decompression */
- addl $z_extract_offset, %ebx
+ movl BP_init_size(%esi), %eax
+ subl $_end, %eax
+ addl %eax, %ebx
/*
* Prepare for entering 64 bit mode
@@ -330,7 +332,9 @@ preferred_addr:
1:
/* Target address to relocate to for decompression */
- leaq z_extract_offset(%rbp), %rbx
+ movl BP_init_size(%rsi), %ebx
+ subl $_end, %ebx
+ addq %rbp, %rbx
/* Set up the stack */
leaq boot_stack_end(%rbx), %rsp
diff --git a/arch/x86/boot/compressed/mkpiggy.c b/arch/x86/boot/compressed/mkpiggy.c
index d8222f213182..b980046c3329 100644
--- a/arch/x86/boot/compressed/mkpiggy.c
+++ b/arch/x86/boot/compressed/mkpiggy.c
@@ -85,9 +85,6 @@ int main(int argc, char *argv[])
printf("z_output_len = %lu\n", (unsigned long)olen);
printf(".globl z_extract_offset\n");
printf("z_extract_offset = 0x%lx\n", offs);
- /* z_extract_offset_negative allows simplification of head_32.S */
- printf(".globl z_extract_offset_negative\n");
- printf("z_extract_offset_negative = -0x%lx\n", offs);
printf(".globl z_run_size\n");
printf("z_run_size = %lu\n", run_size);
diff --git a/arch/x86/boot/compressed/vmlinux.lds.S b/arch/x86/boot/compressed/vmlinux.lds.S
index 34d047c98284..e24e0a0c90c9 100644
--- a/arch/x86/boot/compressed/vmlinux.lds.S
+++ b/arch/x86/boot/compressed/vmlinux.lds.S
@@ -70,5 +70,6 @@ SECTIONS
_epgtable = . ;
}
#endif
+ . = ALIGN(PAGE_SIZE); /* keep ZO size page aligned */
_end = .;
}
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 5c042466f274..674134e9f5e5 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -80,6 +80,7 @@ void common(void) {
OFFSET(BP_hardware_subarch, boot_params, hdr.hardware_subarch);
OFFSET(BP_version, boot_params, hdr.version);
OFFSET(BP_kernel_alignment, boot_params, hdr.kernel_alignment);
+ OFFSET(BP_init_size, boot_params, hdr.init_size);
OFFSET(BP_pref_address, boot_params, hdr.pref_address);
OFFSET(BP_code32_start, boot_params, hdr.code32_start);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 4c941f88d405..9297a002d8e5 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -334,7 +334,7 @@ SECTIONS
__brk_limit = .;
}
- . = ALIGN(PAGE_SIZE);
+ . = ALIGN(PAGE_SIZE); /* keep VO_INIT_SIZE page aligned */
_end = .;
STABS_DEBUG
--
2.6.3
From: Yinghai Lu <[email protected]>
Since the boot_params can be found using the real_mode global variable,
there is no need to pass around a pointer to it. This slightly simplifies
the choose_kernel_location function and its callers.
Signed-off-by: Yinghai Lu <[email protected]>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/aslr.c | 5 ++---
arch/x86/boot/compressed/misc.c | 2 +-
arch/x86/boot/compressed/misc.h | 6 ++----
3 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
index 6a9b96b4624d..622aa881c6ab 100644
--- a/arch/x86/boot/compressed/aslr.c
+++ b/arch/x86/boot/compressed/aslr.c
@@ -295,8 +295,7 @@ static unsigned long find_random_addr(unsigned long minimum,
return slots_fetch_random();
}
-unsigned char *choose_kernel_location(struct boot_params *boot_params,
- unsigned char *input,
+unsigned char *choose_kernel_location(unsigned char *input,
unsigned long input_size,
unsigned char *output,
unsigned long output_size)
@@ -316,7 +315,7 @@ unsigned char *choose_kernel_location(struct boot_params *boot_params,
}
#endif
- boot_params->hdr.loadflags |= KASLR_FLAG;
+ real_mode->hdr.loadflags |= KASLR_FLAG;
/* Record the various known unsafe memory ranges. */
mem_avoid_init((unsigned long)input, input_size,
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 79dac1758e7c..f35ad9eb1bf1 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -428,7 +428,7 @@ asmlinkage __visible void *decompress_kernel(void *rmode, memptr heap,
* the entire decompressed kernel plus relocation table, or the
* entire decompressed kernel plus .bss and .brk sections.
*/
- output = choose_kernel_location(real_mode, input_data, input_len, output,
+ output = choose_kernel_location(input_data, input_len, output,
output_len > run_size ? output_len
: run_size);
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 3783dc3e10b3..dcf01c22400e 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -67,8 +67,7 @@ int cmdline_find_option_bool(const char *option);
#if CONFIG_RANDOMIZE_BASE
/* aslr.c */
-unsigned char *choose_kernel_location(struct boot_params *boot_params,
- unsigned char *input,
+unsigned char *choose_kernel_location(unsigned char *input,
unsigned long input_size,
unsigned char *output,
unsigned long output_size);
@@ -76,8 +75,7 @@ unsigned char *choose_kernel_location(struct boot_params *boot_params,
bool has_cpuflag(int flag);
#else
static inline
-unsigned char *choose_kernel_location(struct boot_params *boot_params,
- unsigned char *input,
+unsigned char *choose_kernel_location(unsigned char *input,
unsigned long input_size,
unsigned char *output,
unsigned long output_size)
--
2.6.3
From: Baoquan He <[email protected]>
Currently CONFIG_RANDOMIZE_BASE_MAX_OFFSET is used to limit the maximum
offset for kernel randomization. This limit doesn't need to be a CONFIG
since it is tied completely to KERNEL_IMAGE_SIZE, and will make no sense
once physical and virtual offsets are randomized separately. This patch
removes CONFIG_RANDOMIZE_BASE_MAX_OFFSET and consolidates the Kconfig
help text.
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog, dropped KERNEL_IMAGE_SIZE_DEFAULT, moved earlier]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/Kconfig | 57 +++++++++++++-----------------------
arch/x86/boot/compressed/aslr.c | 12 ++++----
arch/x86/include/asm/page_64_types.h | 8 ++---
arch/x86/mm/init_32.c | 3 --
4 files changed, 29 insertions(+), 51 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2dc18605831f..fd9ac711ada8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1936,51 +1936,36 @@ config RANDOMIZE_BASE
depends on RELOCATABLE
default n
---help---
- Randomizes the physical and virtual address at which the
- kernel image is decompressed, as a security feature that
- deters exploit attempts relying on knowledge of the location
- of kernel internals.
+ Randomizes the physical address at which the kernel image
+ is decompressed and the virtual address where the kernel
+ image is mapped, as a secrurity feature that deters exploit
+ attempts relying on knowledge of the location of kernel
+ internals.
+
+ The kernel physical address can be randomized from 16M to
+ 64T at most. The kernel virtual address will be offset
+ by up to KERNEL_IMAGE_SIZE. On 32-bit KERNEL_IMAGE_SIZE is
+ 512MiB. while on 64-bit this is limited by how the kernel
+ fixmap page table is positioned, so this cannot be larger
+ than 1GiB currently. Without RANDOMIZE_BASE there is a 512MiB
+ to 1.5GiB split between kernel and modules. When RANDOMIZE_BASE
+ is enabled, the modules area will shrink to compensate, up
+ to a 1GiB to 1GiB split, KERNEL_IMAGE_SIZE changes from 512MiB
+ to 1GiB.
Entropy is generated using the RDRAND instruction if it is
supported. If RDTSC is supported, it is used as well. If
neither RDRAND nor RDTSC are supported, then randomness is
read from the i8254 timer.
- The kernel will be offset by up to RANDOMIZE_BASE_MAX_OFFSET,
- and aligned according to PHYSICAL_ALIGN. Since the kernel is
- built using 2GiB addressing, and PHYSICAL_ALGIN must be at a
- minimum of 2MiB, only 10 bits of entropy is theoretically
- possible. At best, due to page table layouts, 64-bit can use
- 9 bits of entropy and 32-bit uses 8 bits.
+ Since the kernel is built using 2GiB addressing, and
+ PHYSICAL_ALGIN must be at a minimum of 2MiB, only 10 bits of
+ entropy is theoretically possible. At best, due to page table
+ layouts, 64-bit can use 9 bits of entropy and 32-bit uses 8
+ bits.
If unsure, say N.
-config RANDOMIZE_BASE_MAX_OFFSET
- hex "Maximum kASLR offset allowed" if EXPERT
- depends on RANDOMIZE_BASE
- range 0x0 0x20000000 if X86_32
- default "0x20000000" if X86_32
- range 0x0 0x40000000 if X86_64
- default "0x40000000" if X86_64
- ---help---
- The lesser of RANDOMIZE_BASE_MAX_OFFSET and available physical
- memory is used to determine the maximal offset in bytes that will
- be applied to the kernel when kernel Address Space Layout
- Randomization (kASLR) is active. This must be a multiple of
- PHYSICAL_ALIGN.
-
- On 32-bit this is limited to 512MiB by page table layouts. The
- default is 512MiB.
-
- On 64-bit this is limited by how the kernel fixmap page table is
- positioned, so this cannot be larger than 1GiB currently. Without
- RANDOMIZE_BASE, there is a 512MiB to 1.5GiB split between kernel
- and modules. When RANDOMIZE_BASE_MAX_OFFSET is above 512MiB, the
- modules area will shrink to compensate, up to the current maximum
- 1GiB to 1GiB split. The default is 1GiB.
-
- If unsure, leave at the default value.
-
# Relocation on x86 needs some additional build support
config X86_NEED_RELOCS
def_bool y
diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
index 622aa881c6ab..462654097d63 100644
--- a/arch/x86/boot/compressed/aslr.c
+++ b/arch/x86/boot/compressed/aslr.c
@@ -206,15 +206,13 @@ static bool mem_avoid_overlap(struct mem_vector *img)
return false;
}
-static unsigned long slots[CONFIG_RANDOMIZE_BASE_MAX_OFFSET /
- CONFIG_PHYSICAL_ALIGN];
+static unsigned long slots[KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN];
static unsigned long slot_max;
static void slots_append(unsigned long addr)
{
/* Overflowing the slots list should be impossible. */
- if (slot_max >= CONFIG_RANDOMIZE_BASE_MAX_OFFSET /
- CONFIG_PHYSICAL_ALIGN)
+ if (slot_max >= KERNEL_IMAGE_SIZE / CONFIG_PHYSICAL_ALIGN)
return;
slots[slot_max++] = addr;
@@ -240,7 +238,7 @@ static void process_e820_entry(struct e820entry *entry,
return;
/* Ignore entries entirely above our maximum. */
- if (entry->addr >= CONFIG_RANDOMIZE_BASE_MAX_OFFSET)
+ if (entry->addr >= KERNEL_IMAGE_SIZE)
return;
/* Ignore entries entirely below our minimum. */
@@ -265,8 +263,8 @@ static void process_e820_entry(struct e820entry *entry,
region.size -= region.start - entry->addr;
/* Reduce maximum size to fit end of image within maximum limit. */
- if (region.start + region.size > CONFIG_RANDOMIZE_BASE_MAX_OFFSET)
- region.size = CONFIG_RANDOMIZE_BASE_MAX_OFFSET - region.start;
+ if (region.start + region.size > KERNEL_IMAGE_SIZE)
+ region.size = KERNEL_IMAGE_SIZE - region.start;
/* Walk each aligned slot and check for avoided areas. */
for (img.start = region.start, img.size = image_size ;
diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 4928cf0d5af0..d5c2f8b40faa 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -47,12 +47,10 @@
* are fully set up. If kernel ASLR is configured, it can extend the
* kernel page table mapping, reducing the size of the modules area.
*/
-#define KERNEL_IMAGE_SIZE_DEFAULT (512 * 1024 * 1024)
-#if defined(CONFIG_RANDOMIZE_BASE) && \
- CONFIG_RANDOMIZE_BASE_MAX_OFFSET > KERNEL_IMAGE_SIZE_DEFAULT
-#define KERNEL_IMAGE_SIZE CONFIG_RANDOMIZE_BASE_MAX_OFFSET
+#if defined(CONFIG_RANDOMIZE_BASE)
+#define KERNEL_IMAGE_SIZE (1024 * 1024 * 1024)
#else
-#define KERNEL_IMAGE_SIZE KERNEL_IMAGE_SIZE_DEFAULT
+#define KERNEL_IMAGE_SIZE (512 * 1024 * 1024)
#endif
#endif /* _ASM_X86_PAGE_64_DEFS_H */
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index bd7a9b9e2e14..f2ee42d61894 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -804,9 +804,6 @@ void __init mem_init(void)
BUILD_BUG_ON(VMALLOC_START >= VMALLOC_END);
#undef high_memory
#undef __FIXADDR_TOP
-#ifdef CONFIG_RANDOMIZE_BASE
- BUILD_BUG_ON(CONFIG_RANDOMIZE_BASE_MAX_OFFSET > KERNEL_IMAGE_SIZE);
-#endif
#ifdef CONFIG_HIGHMEM
BUG_ON(PKMAP_BASE + LAST_PKMAP*PAGE_SIZE > FIXADDR_START);
--
2.6.3
From: Baoquan He <[email protected]>
When processing the relocation table, the offset used to calculate the
relocation is an int. This is sufficient for calculating the physical
address of the relocs entry on 32-bit systems and on 64-bit systems when
the relocation is under 2G. To handle relocations above 2G (seen in
situations like kexec, netboot, etc), this offset needs to be calculated
using a long to avoid wrapping and miscalculating the relocation.
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/misc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index f35ad9eb1bf1..c4477d5f3fff 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -295,7 +295,7 @@ static void handle_relocations(void *output, unsigned long output_len)
* So we work backwards from the end of the decompressed image.
*/
for (reloc = output + output_len - sizeof(*reloc); *reloc; reloc--) {
- int extended = *reloc;
+ long extended = *reloc;
extended += map;
ptr = (unsigned long)extended;
--
2.6.3
From: Baoquan He <[email protected]>
This patch exchanges the prior slots[] array for the new slot_areas[]
array, and lifts the limitation of KERNEL_IMAGE_SIZE on the physical
address offset for 64-bit. As before, process_e820_entry walks memory and
populates slot_areas[], splitting on any detected mem_avoid collisions.
Signed-off-by: Baoquan He <[email protected]>
[kees: rewrote changelog, refactored goto into while, limit 32-bit to 1G]
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/aslr.c | 92 ++++++++++++++++++++++++++++++-----------
1 file changed, 68 insertions(+), 24 deletions(-)
diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
index 53ceaa0a08b9..0587eac3e05d 100644
--- a/arch/x86/boot/compressed/aslr.c
+++ b/arch/x86/boot/compressed/aslr.c
@@ -335,25 +335,42 @@ static void slots_append(unsigned long addr)
static unsigned long slots_fetch_random(void)
{
+ unsigned long random;
+ int i;
+
/* Handle case of no slots stored. */
if (slot_max == 0)
return 0;
- return slots[get_random_long("Physical") % slot_max];
+ random = get_random_long("Physical") % slot_max;
+
+ for (i = 0; i < slot_area_index; i++) {
+ if (random >= slot_areas[i].num) {
+ random -= slot_areas[i].num;
+ continue;
+ }
+ return slot_areas[i].addr + random * CONFIG_PHYSICAL_ALIGN;
+ }
+
+ if (i == slot_area_index)
+ debug_putstr("slots_fetch_random() failed!?\n");
+ return 0;
}
static void process_e820_entry(struct e820entry *entry,
unsigned long minimum,
unsigned long image_size)
{
- struct mem_vector region, img;
+ struct mem_vector region, out;
+ struct slot_area slot_area;
+ unsigned long min, start_orig;
/* Skip non-RAM entries. */
if (entry->type != E820_RAM)
return;
- /* Ignore entries entirely above our maximum. */
- if (entry->addr >= KERNEL_IMAGE_SIZE)
+ /* On 32-bit, ignore entries entirely above our maximum. */
+ if (IS_ENABLED(CONFIG_X86_32) && entry->addr >= KERNEL_IMAGE_SIZE)
return;
/* Ignore entries entirely below our minimum. */
@@ -363,31 +380,54 @@ static void process_e820_entry(struct e820entry *entry,
region.start = entry->addr;
region.size = entry->size;
- /* Potentially raise address to minimum location. */
- if (region.start < minimum)
- region.start = minimum;
+ /* Give up if slot area array is full. */
+ while (slot_area_index < MAX_SLOT_AREA) {
+ start_orig = region.start;
- /* Potentially raise address to meet alignment requirements. */
- region.start = ALIGN(region.start, CONFIG_PHYSICAL_ALIGN);
+ /* Potentially raise address to minimum location. */
+ if (region.start < minimum)
+ region.start = minimum;
- /* Did we raise the address above the bounds of this e820 region? */
- if (region.start > entry->addr + entry->size)
- return;
+ /* Potentially raise address to meet alignment needs. */
+ region.start = ALIGN(region.start, CONFIG_PHYSICAL_ALIGN);
- /* Reduce size by any delta from the original address. */
- region.size -= region.start - entry->addr;
+ /* Did we raise the address above this e820 region? */
+ if (region.start > entry->addr + entry->size)
+ return;
- /* Reduce maximum size to fit end of image within maximum limit. */
- if (region.start + region.size > KERNEL_IMAGE_SIZE)
- region.size = KERNEL_IMAGE_SIZE - region.start;
+ /* Reduce size by any delta from the original address. */
+ region.size -= region.start - start_orig;
- /* Walk each aligned slot and check for avoided areas. */
- for (img.start = region.start, img.size = image_size ;
- mem_contains(®ion, &img) ;
- img.start += CONFIG_PHYSICAL_ALIGN) {
- if (mem_avoid_overlap(&img))
- continue;
- slots_append(img.start);
+ /* On 32-bit, reduce region size to fit within max size. */
+ if (IS_ENABLED(CONFIG_X86_32) &&
+ region.start + region.size > KERNEL_IMAGE_SIZE)
+ region.size = KERNEL_IMAGE_SIZE - region.start;
+
+ /* Return if region can't contain decompressed kernel */
+ if (region.size < image_size)
+ return;
+
+ /* If nothing overlaps, store the region and return. */
+ if (!mem_avoid_overlap(®ion)) {
+ store_slot_info(®ion, image_size);
+ return;
+ }
+
+ /* Other wise, find the lowest overlap. */
+ min = mem_min_overlap(®ion, &out);
+
+ /* Store the region if it can hold at least image_size. */
+ if (min > region.start + image_size) {
+ struct mem_vector tmp;
+
+ tmp.start = region.start;
+ tmp.size = min - region.start;
+ store_slot_info(&tmp, image_size);
+ }
+
+ /* Clip off the overlapping region and start over. */
+ region.size -= out.start - region.start + out.size;
+ region.start = out.start + out.size;
}
}
@@ -403,6 +443,10 @@ static unsigned long find_random_phy_addr(unsigned long minimum,
/* Verify potential e820 positions, appending to slots list. */
for (i = 0; i < real_mode->e820_entries; i++) {
process_e820_entry(&real_mode->e820_map[i], minimum, size);
+ if (slot_area_index == MAX_SLOT_AREA) {
+ debug_putstr("Aborted e820 scan (slot_areas full)!\n");
+ break;
+ }
}
return slots_fetch_random();
--
2.6.3
* Kees Cook <[email protected]> wrote:
> From: Yinghai Lu <[email protected]>
>
> Since the boot_params can be found using the real_mode global variable, there is
> no need to pass around a pointer to it. This slightly simplifies the
> choose_kernel_location function and its callers.
Yeah, so I really wanted to apply this series because the changelogs are now very
nice, but got held up by the very first patch again ...
Guys, 'real_mode' totally sucks as a variable name!
By removing a seemingly unnecessary parameter, you made the code actively worse to
read...
So please make a second patch on top of this one that renames 'real_mode' to
something that both displays what it is about (it sure isn't mainly about real
mode only!!), and also expresses that it's a global variable, not some local
function parameter.
In arch/x86/ we usually achieve that via prefixing it with x86_ or so. So
something like 'x86_boot_params' would work for me. I realize that this is
somewhat special, because this is pre-relocation code so we really have to be
careful about pointers - so maybe name it x86_boot_params_rm or so.
Also, I had a look at the whole file you are patching,
arch/x86/boot/compressed/misc.c, and for heaven's sake, please first improve the
comments on top of that file before further complicating that code with KASLR
details! Right now it consists of various low level ramblings choke full of typos.
No mention of where KASLR fits into the picture at all!
Right now it says:
* misc.c
*
* This is a collection of several routines from gzip-1.0.3
* adapted for Linux.
Which was perhaps true 15 years ago, but sure it's not true today! We do
relocation and KASLR processing in this path as well, which is just as important
as running the decompressor. Also, gzip is not the only decompressor we support.
The high level purpose of the code in this file should be explained first, and the
file should probably be renamed to something like extract.c - to signal that this
file is about extracting a kernel image and preparing it for execution (i.e.
relinking it, etc.). It's (much!) more than just pure decompression.
Likewise, decompress_kernel() should be renamed to extract_kernel().
The low level explanations in extract.c should be moved to the function that is
affected by those details. The top of the file should only contain general
explanations, a high level description of in what state the kernel is when we call
this code, and the sort of stuff we do in that code.
The credits should be collected into a single, coherent block. Unnecessary fluff
should be cut.
Furthermore, I see similar problems with arch/x86/boot/compressed/aslr.c: for
example it has no high level description at the top of the file _at all_. WTF is
this with writing security-sensitive code that has no high level design
description whatsoever?? Also, why is it named aslr.c, why not kaslr.c? We do have
both KASLR and ASLR code in the kernel, confusing naming on the source code level
sure does not help.
Also, what's this thing about choose_kernel_location()? That name is actively
hiding the fact that the main purpose of that function is to randomize things ...
It should be named x86_randomize_kernel_address() or so.
We need to stop making a mess of the x86 boot code! This is a highly critical
piece of kernel code that every single x86 Linux user will execute, we should
improve its visual presentation and general readability to the level of being
proud of it ...
Thanks,
Ingo
* Kees Cook <[email protected]> wrote:
> From: Baoquan He <[email protected]>
>
> When processing the relocation table, the offset used to calculate the
> relocation is an int. This is sufficient for calculating the physical
> address of the relocs entry on 32-bit systems and on 64-bit systems when
> the relocation is under 2G. To handle relocations above 2G (seen in
> situations like kexec, netboot, etc), this offset needs to be calculated
> using a long to avoid wrapping and miscalculating the relocation.
>
> Signed-off-by: Baoquan He <[email protected]>
> [kees: rewrote changelog]
> Signed-off-by: Kees Cook <[email protected]>
> ---
> arch/x86/boot/compressed/misc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> index f35ad9eb1bf1..c4477d5f3fff 100644
> --- a/arch/x86/boot/compressed/misc.c
> +++ b/arch/x86/boot/compressed/misc.c
> @@ -295,7 +295,7 @@ static void handle_relocations(void *output, unsigned long output_len)
> * So we work backwards from the end of the decompressed image.
> */
> for (reloc = output + output_len - sizeof(*reloc); *reloc; reloc--) {
> - int extended = *reloc;
> + long extended = *reloc;
> extended += map;
>
> ptr = (unsigned long)extended;
This patch and the code it patches is just plain sloppy. See this cast? This is an
object lesson of why type casts in C are actively dangerous, they have hidden the
32-bit truncation bug you've fixed with this patch.
But the primary bug, the cast, should be fixed! Together with all the other casts
of 'extended'.
Also, the boot code should be reviewed for unnecessary casts, it seems to be a
disease:
triton:~/tip> git grep -cE '\(unsigned.*;$' arch/x86/boot/compressed/
arch/x86/boot/compressed/aslr.c:14
arch/x86/boot/compressed/eboot.c:41
arch/x86/boot/compressed/misc.c:5
arch/x86/boot/compressed/misc.h:2
arch/x86/boot/compressed/mkpiggy.c:1
arch/x86/boot/compressed/string.c:4
For example the type dance and overloaded usage that choose_kernel_location() does
with the 'random' local variable in aslr.c is disgusting:
void choose_kernel_location(unsigned char *input,
unsigned long input_size,
unsigned char **output,
unsigned long output_size,
unsigned char **virt_offset)
{
unsigned long random, min_addr;
*virt_offset = (unsigned char *)LOAD_PHYSICAL_ADDR;
#ifdef CONFIG_HIBERNATION
if (!cmdline_find_option_bool("kaslr")) {
debug_putstr("KASLR disabled by default...\n");
return;
}
#else
if (cmdline_find_option_bool("nokaslr")) {
debug_putstr("KASLR disabled by cmdline...\n");
return;
}
#endif
real_mode->hdr.loadflags |= KASLR_FLAG;
/* Record the various known unsafe memory ranges. */
mem_avoid_init((unsigned long)input, input_size,
(unsigned long)*output);
/* Low end should be the smaller of 512M or initial location. */
min_addr = min((unsigned long)*output, 512UL << 20);
/* Walk e820 and find a random address. */
random = find_random_phy_addr(min_addr, output_size);
if (!random)
debug_putstr("KASLR could not find suitable E820 region...\n");
else {
if ((unsigned long)*output != random) {
fill_pagetable(random, output_size);
switch_pagetable();
*output = (unsigned char *)random;
}
}
/* Pick random virtual address starting from LOAD_PHYSICAL_ADDR. */
if (IS_ENABLED(CONFIG_X86_64))
random = find_random_virt_offset(LOAD_PHYSICAL_ADDR,
output_size);
*virt_offset = (unsigned char *)random;
}
Firstly, 'random' is a libc function name. We generally don't overload those.
Secondly, it's a random what? Variable names should make it plenty obvious. So it
should probably be named 'random_addr'.
Third:
/* Walk e820 and find a random address. */
random = find_random_phy_addr(min_addr, output_size);
yeah, so what that comment tells us we knew already, due to the function name!
What the comment should _really_ talk about is the high level purpose. Something
like: 'Walk the e820 map and find a random free RAM address to which we can still
decompress the whole kernel' would work so much better ...
Fourth, this function has seven (!!) type casts. We can sure do better.
Fifth:
/* Pick random virtual address starting from LOAD_PHYSICAL_ADDR. */
if (IS_ENABLED(CONFIG_X86_64))
random = find_random_virt_offset(LOAD_PHYSICAL_ADDR,
output_size);
*virt_offset = (unsigned char *)random;
So the purpose of this whole function is to pick _two_ random addresses: the
random physical address to place the kernel at, and on x86_64, to also randomize
the kernel virtual address, right? So exactly which comment tells us that it's
about this? Names like 'choose_kernel_location' are singular and are actively
misleading about this ...
... and then I haven't even mentioned small details like the imbalanced curly
braces.
This code sucks, and I'm not surprised at all that it was broken. It should be
improved before we can feature-extend it.
Thanks,
Ingo
* Kees Cook <[email protected]> wrote:
> From: Baoquan He <[email protected]>
>
> Currently CONFIG_RANDOMIZE_BASE_MAX_OFFSET is used to limit the maximum
> offset for kernel randomization. This limit doesn't need to be a CONFIG
> since it is tied completely to KERNEL_IMAGE_SIZE, and will make no sense
> once physical and virtual offsets are randomized separately. This patch
> removes CONFIG_RANDOMIZE_BASE_MAX_OFFSET and consolidates the Kconfig
> help text.
>
> Signed-off-by: Baoquan He <[email protected]>
> [kees: rewrote changelog, dropped KERNEL_IMAGE_SIZE_DEFAULT, moved earlier]
> Signed-off-by: Kees Cook <[email protected]>
> ---
> arch/x86/Kconfig | 57 +++++++++++++-----------------------
> arch/x86/boot/compressed/aslr.c | 12 ++++----
> arch/x86/include/asm/page_64_types.h | 8 ++---
> arch/x86/mm/init_32.c | 3 --
> 4 files changed, 29 insertions(+), 51 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2dc18605831f..fd9ac711ada8 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1936,51 +1936,36 @@ config RANDOMIZE_BASE
> depends on RELOCATABLE
> default n
> ---help---
> - Randomizes the physical and virtual address at which the
> - kernel image is decompressed, as a security feature that
> - deters exploit attempts relying on knowledge of the location
> - of kernel internals.
> + Randomizes the physical address at which the kernel image
> + is decompressed and the virtual address where the kernel
> + image is mapped, as a secrurity feature that deters exploit
Guys, please _read_ what you write: s/secrurity/security
> + attempts relying on knowledge of the location of kernel
> + internals.
> +
> + The kernel physical address can be randomized from 16M to
> + 64T at most.
The 64TB value sure reads weird if you are configuring a 32-bit system ...
A much better approach would be to split the help text into 32-bit and 64-bit
portions:
On 64-bit systems the kernel physical address will be randomized from 16M to the
top of available physical memory. (With a maximum of 64TB.)
On 32-bit systems the kernel physical address will be randomized from 16MB to
1GB.
Also note the assertive tone: if this Kconfig feature is eanbled, we say that the
kernel address _will_ be randomized, and we should make sure it is. (If for some
weird reason randomization fails we should warn prominently during bootup.)
> The kernel virtual address will be offset
> + by up to KERNEL_IMAGE_SIZE. On 32-bit KERNEL_IMAGE_SIZE is
> + 512MiB. while on 64-bit this is limited by how the kernel
> + fixmap page table is positioned, so this cannot be larger
> + than 1GiB currently. Without RANDOMIZE_BASE there is a 512MiB
> + to 1.5GiB split between kernel and modules. When RANDOMIZE_BASE
> + is enabled, the modules area will shrink to compensate, up
> + to a 1GiB to 1GiB split, KERNEL_IMAGE_SIZE changes from 512MiB
> + to 1GiB.
Beyond the broken capitalization, I'll show you what 99.999% of users who are not
kernel hackers will understand from this paragraph, approximately:
To dream: ay, and them? To bear to sling afterprises
coil, and scover'd cowards of resolence dream: ay, the us for no mome
wish'd. Thus and sweary life, or nobles cast and makes, whips and that
is sicklied of resolence of so long afterprises us more; for whips
all; and name whething after bear to sleep; to beart-ache shocks the
undiscover'd consummative have, but that pith a sleep of somethe under
'tis the spurns of troud makes off thance doth make whose bourns of
dispriz'd consient and arms more.
So this is really deep kernel internals, I get a headache trying to interpret it,
and it's my job to interpret this! Please try to formulate key pieces of
information in Kconfig help texts in a more ... approachable fashion, and move the
jargon to .c source code files.
> Entropy is generated using the RDRAND instruction if it is
> supported. If RDTSC is supported, it is used as well. If
> neither RDRAND nor RDTSC are supported, then randomness is
> read from the i8254 timer.
Also, instead of 'used as well' I'd say "is mixed into the entropy pool as well"
or so, to make sure it's clear that we don't exclusively rely on RDRAND or RDTSC.
Also, could we always mix the i8254 timer into this as well, not just when RDTSC
is unavailable?
> - The kernel will be offset by up to RANDOMIZE_BASE_MAX_OFFSET,
> - and aligned according to PHYSICAL_ALIGN. Since the kernel is
> - built using 2GiB addressing, and PHYSICAL_ALGIN must be at a
> - minimum of 2MiB, only 10 bits of entropy is theoretically
> - possible. At best, due to page table layouts, 64-bit can use
> - 9 bits of entropy and 32-bit uses 8 bits.
> + Since the kernel is built using 2GiB addressing, and
> + PHYSICAL_ALGIN must be at a minimum of 2MiB, only 10 bits of
> + entropy is theoretically possible. At best, due to page table
> + layouts, 64-bit can use 9 bits of entropy and 32-bit uses 8
> + bits.
Please read what you write, there's a typo in this section.
Another request: please stop the MiB/GiB nonsense and call it MB/GB. This isn't
storage code that has to fight marketing lies. Only the KASLR section in
arch/x86/Kconfig* is using MiB/GiB, everything else uses MB/GB naming, we should
stick with that.
Thanks,
Ingo
* Kees Cook <[email protected]> wrote:
> From: Yinghai Lu <[email protected]>
>
> This change makes later calculations about where the kernel is located
> easier to reason about. To better understand this change, we must first
> clarify what VO and ZO are. They were introduced in commits by hpa:
>
> 77d1a49 x86, boot: make symbols from the main vmlinux available
> 37ba7ab x86, boot: make kernel_alignment adjustable; new bzImage fields
>
> Specifically:
>
> VO:
> - uncompressed kernel image
> - size: VO__end - VO__text ("VO_INIT_SIZE" define)
>
> ZO:
> - bootable compressed kernel image (boot/compressed/vmlinux)
> - head text + compressed kernel (VO and relocs table) + decompressor code
> - size: ZO__end - ZO_startup_32 ("ZO_INIT_SIZE" define, though see below)
>
> The INIT_SIZE definition is used to find the larger of the two image sizes:
>
> #define ZO_INIT_SIZE (ZO__end - ZO_startup_32 + ZO_z_extract_offset)
> #define VO_INIT_SIZE (VO__end - VO__text)
> #if ZO_INIT_SIZE > VO_INIT_SIZE
> #define INIT_SIZE ZO_INIT_SIZE
> #else
> #define INIT_SIZE VO_INIT_SIZE
> #endif
Please also harmonize all the prefixes, i.e. use VO__/ZO__ everywhere (rename
things where necessary), instead of this mixed up VO_/ZO_/VO__/ZO__ mess.
Thanks,
Ingo
* Kees Cook <[email protected]> wrote:
> From: Yinghai Lu <[email protected]>
>
> Currently z_extract_offset is calculated in boot/compressed/mkpiggy.c.
What is the high level meaning of z_extract_offset? I cannot tell without reading
the code - and the point of changelogs is for them to be readable without having
to read the code.
The rest of the changelog is pretty good.
Thanks,
Ingo
* Kees Cook <[email protected]> wrote:
> From: Yinghai Lu <[email protected]>
>
> Currently, the kernel run_size (size of code plus brk and bss) is
> calculated via the shell script arch/x86/tools/calc_run_size.sh.
> It gets the file offset and mem size of for the .bss and .brk sections in
> vmlinux, and adds them as follows:
'of for'?
> So, from this we can see that the existing run_size calculation is
> 0x400000 too high.
Btw., why is it too high? Were some sections discarded?
> [...] And, as it turns out, the correct run_size is
> actually equal to VO_end - VO_text, which is certainly easier to calculate.
> _end: 0xffffffff8205f000
> _text:0xffffffff81000000
>
> 0xffffffff8205f000 - 0xffffffff81000000 = 0x105f000
>
> As a result, run_size is a simple constant, so we don't need to pass it
> around; we already have voffset.h such things.
(Typo.)
> [...] We can share voffset.h
> between misc.c and header.S instead of getting run_size in other ways.
> This patch moves voffset.h creation code to boot/compressed/Makefile,
> and switches misc.c to use the VO_end - VO_text calculation.
Btw., 'run_size' is a pretty meaningless name, it doesn't really tell us the most
important thing: that this is the complete size of the uncompressed kernel, the
total physically continuous size we need to allocate to it so it can execute?
So can we rename it to something more expressive, such as kernel_total_size or so?
Thanks,
Ingo
* Kees Cook <[email protected]> wrote:
> When INIT_SIZE is bigger than VO_INIT_SIZE (uncommon but possible),
> the copied ZO occupies the memory from extract_offset to the end of
> decompression buffer. It overlaps with the soon-to-be-uncompressed kernel
> like this:
>
> |-----compressed kernel image------|
> V V
> 0 extract_offset +INIT_SIZE
> |-----------|---------------|-------------------------|--------|
> | | | |
> VO__text startup_32 of ZO VO__end ZO__end
> ^ ^
> |-------uncompressed kernel image---------|
>
> When INIT_SIZE is equal to VO_INIT_SIZE (likely) there's still space
> left from end of ZO to the end of decompressing buffer, like below.
>
> |-compressed kernel image-|
> V V
> 0 extract_offset +INIT_SIZE
> |-----------|---------------|-------------------------|--------|
> | | | |
> VO__text startup_32 of ZO ZO__end VO__end
> ^ ^
> |------------uncompressed kernel image-------------|
>
> To simplify calculations and avoid special cases, it is cleaner to
> always place the compressed kernel image in memory so that ZO__end
> is at the end of the decompression buffer, instead of placing that
> start extract_offset as is currently done.
Btw., it would be nice to also put such a visualization (of the current layout of
these values) into the code itself.
Thanks,
Ingo
On 2016-04-14 Kees Cook wrote:
> From: Yinghai Lu <[email protected]>
>
> parse_elf is using a local memcpy to move sections to their final
> running position. However, this memcpy only supports non-overlapping
> arguments (or dest < src).
The same copy of memcpy is used by the decompressors too.
> To avoid future hard-to-debug surprises, this adds checking in memcpy
> to detect the unhandled condition (which should not be happening
> currently).
It's already a minor surprise that memcpy is expected to work for
overlapping buffers at all. It could be good to have a comment about
it because "scroll" and parse_elf seem to rely on it.
On the other hand, the new code and error message take quite a few bytes
of space, so a complete memmove can be smaller:
void *memmove(void *dest, const void *src, size_t n)
{
unsigned char *d = dest;
const unsigned char *s = src;
if (d <= s || d - s >= n)
return __memcpy(dest, src, n);
while (n-- > 0)
d[n] = s[n];
return dest;
}
Note that memmove is needed by lib/decompress_unxz.c. It contains its
own small version inside a "#ifndef memmove" block. That #ifndef should
be taken into account when adding a memmove symbol. Changing
decompress_unxz.c is fine but then one needs to think about other
archs too.
--
Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
On 2016-04-14 Kees Cook wrote:
> + * Above analysis is for decompressing gzip compressed kernel only. Up to
> + * now 6 different decompressor are supported all together.
There are six decompressors in Linux now, but the number can change and
the comment become outdated, so I suggest omitting the exact number here.
> And among them
> + * xz stores data in chunks and has maximum chunk of 64K. Hence safety
> + * margin should be updated to cover all decompressors so that we don't
> + * need to deal with each of them separately. Please check
> + * the description in lib/decompressor_xxx.c for specific information.
> + *
> + * extra_bytes = (uncompressed_size >> 12) + 65536 + 128.
lib/decompress_xxx.c, not lib/decompressor_xxx.c.
Referring to those files is problematic because only decompress_unxz.c
talks about the safety margin. The explanation of the safety margin for
gzip is still in misc.c instead of decompress_inflate.c.
I suspect that safety margin calculations haven't been made for other
compressors in Linux, so there is nothing that people could read for
more information. At least such information isn't present in the
comments or commit messages.
For example (and a bit off-topic), there is a possible sign of too small
safety margin in decompress_unlzo.c, where a memcpy call can get
overlapping dest and src buffers:
/* When the input data is not compressed at all,
* lzo1x_decompress_safe will fail, so call memcpy()
* instead */
if (unlikely(dst_len == src_len))
memcpy(out_buf, in_buf, src_len);
The overlap can only happen if there's enough incompressible data near
the end of the kernel image. It still works in practice as long as
memcpy works with overlapping buffers for dest < src.
--
Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
On Fri, Apr 15, 2016 at 12:29 AM, Ingo Molnar <[email protected]> wrote:
>
> * Kees Cook <[email protected]> wrote:
>
>> From: Yinghai Lu <[email protected]>
>>
>> Since the boot_params can be found using the real_mode global variable, there is
>> no need to pass around a pointer to it. This slightly simplifies the
>> choose_kernel_location function and its callers.
>
> Yeah, so I really wanted to apply this series because the changelogs are now very
> nice, but got held up by the very first patch again ...
Thanks for reading through everything! I'll be going through it all
and fixing up what you mentioned (along with anything else that jumps
out at me).
> Guys, 'real_mode' totally sucks as a variable name!
In our defense, that variable name predates git history. ;)
Regardless, I'll get it renamed.
> By removing a seemingly unnecessary parameter, you made the code actively worse to
> read...
>
> So please make a second patch on top of this one that renames 'real_mode' to
> something that both displays what it is about (it sure isn't mainly about real
> mode only!!), and also expresses that it's a global variable, not some local
> function parameter.
>
> In arch/x86/ we usually achieve that via prefixing it with x86_ or so. So
> something like 'x86_boot_params' would work for me. I realize that this is
> somewhat special, because this is pre-relocation code so we really have to be
> careful about pointers - so maybe name it x86_boot_params_rm or so.
>
> Also, I had a look at the whole file you are patching,
> arch/x86/boot/compressed/misc.c, and for heaven's sake, please first improve the
> comments on top of that file before further complicating that code with KASLR
> details! Right now it consists of various low level ramblings choke full of typos.
> No mention of where KASLR fits into the picture at all!
>
> Right now it says:
>
> * misc.c
> *
> * This is a collection of several routines from gzip-1.0.3
> * adapted for Linux.
>
> Which was perhaps true 15 years ago, but sure it's not true today! We do
> relocation and KASLR processing in this path as well, which is just as important
> as running the decompressor. Also, gzip is not the only decompressor we support.
>
> The high level purpose of the code in this file should be explained first, and the
> file should probably be renamed to something like extract.c - to signal that this
> file is about extracting a kernel image and preparing it for execution (i.e.
> relinking it, etc.). It's (much!) more than just pure decompression.
>
> Likewise, decompress_kernel() should be renamed to extract_kernel().
I'll get these all cleaned up. Yay comment-bit-rot. :)
> The low level explanations in extract.c should be moved to the function that is
> affected by those details. The top of the file should only contain general
> explanations, a high level description of in what state the kernel is when we call
> this code, and the sort of stuff we do in that code.
>
> The credits should be collected into a single, coherent block. Unnecessary fluff
> should be cut.
>
> Furthermore, I see similar problems with arch/x86/boot/compressed/aslr.c: for
> example it has no high level description at the top of the file _at all_. WTF is
> this with writing security-sensitive code that has no high level design
> description whatsoever?? Also, why is it named aslr.c, why not kaslr.c? We do have
> both KASLR and ASLR code in the kernel, confusing naming on the source code level
> sure does not help.
>
> Also, what's this thing about choose_kernel_location()? That name is actively
> hiding the fact that the main purpose of that function is to randomize things ...
>
> It should be named x86_randomize_kernel_address() or so.
When naming this I wanted to make it clear that the only thing it was
doing was picking a location. The move is done by the decompressor.
I'll try to clarify the name of the function.
> We need to stop making a mess of the x86 boot code! This is a highly critical
> piece of kernel code that every single x86 Linux user will execute, we should
> improve its visual presentation and general readability to the level of being
> proud of it ...
I totally agree. :) There is a bit of a culture of "don't change it
just to rename things" which I try to avoid colliding with. In this
case, though, they would all seriously improve readability.
-Kees
--
Kees Cook
Chrome OS & Brillo Security
On Fri, Apr 15, 2016 at 12:47 AM, Ingo Molnar <[email protected]> wrote:
>
> * Kees Cook <[email protected]> wrote:
>
>> From: Baoquan He <[email protected]>
>>
>> When processing the relocation table, the offset used to calculate the
>> relocation is an int. This is sufficient for calculating the physical
>> address of the relocs entry on 32-bit systems and on 64-bit systems when
>> the relocation is under 2G. To handle relocations above 2G (seen in
>> situations like kexec, netboot, etc), this offset needs to be calculated
>> using a long to avoid wrapping and miscalculating the relocation.
>>
>> Signed-off-by: Baoquan He <[email protected]>
>> [kees: rewrote changelog]
>> Signed-off-by: Kees Cook <[email protected]>
>> ---
>> arch/x86/boot/compressed/misc.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
>> index f35ad9eb1bf1..c4477d5f3fff 100644
>> --- a/arch/x86/boot/compressed/misc.c
>> +++ b/arch/x86/boot/compressed/misc.c
>> @@ -295,7 +295,7 @@ static void handle_relocations(void *output, unsigned long output_len)
>> * So we work backwards from the end of the decompressed image.
>> */
>> for (reloc = output + output_len - sizeof(*reloc); *reloc; reloc--) {
>> - int extended = *reloc;
>> + long extended = *reloc;
>> extended += map;
>>
>> ptr = (unsigned long)extended;
>
> This patch and the code it patches is just plain sloppy. See this cast? This is an
> object lesson of why type casts in C are actively dangerous, they have hidden the
> 32-bit truncation bug you've fixed with this patch.
>
> But the primary bug, the cast, should be fixed! Together with all the other casts
> of 'extended'.
In my defense, there's a ton of mixing of pointers vs unsigned longs
all through-out the boot code. And relocations are special since
they're explicitly designed to be sign-extended, etc. I'll clean
things up as best as I can.
>
> Also, the boot code should be reviewed for unnecessary casts, it seems to be a
> disease:
>
> triton:~/tip> git grep -cE '\(unsigned.*;$' arch/x86/boot/compressed/
> arch/x86/boot/compressed/aslr.c:14
> arch/x86/boot/compressed/eboot.c:41
> arch/x86/boot/compressed/misc.c:5
> arch/x86/boot/compressed/misc.h:2
> arch/x86/boot/compressed/mkpiggy.c:1
> arch/x86/boot/compressed/string.c:4
>
> For example the type dance and overloaded usage that choose_kernel_location() does
> with the 'random' local variable in aslr.c is disgusting:
>
> void choose_kernel_location(unsigned char *input,
> unsigned long input_size,
> unsigned char **output,
> unsigned long output_size,
> unsigned char **virt_offset)
> {
> unsigned long random, min_addr;
>
> *virt_offset = (unsigned char *)LOAD_PHYSICAL_ADDR;
>
> #ifdef CONFIG_HIBERNATION
> if (!cmdline_find_option_bool("kaslr")) {
> debug_putstr("KASLR disabled by default...\n");
> return;
> }
> #else
> if (cmdline_find_option_bool("nokaslr")) {
> debug_putstr("KASLR disabled by cmdline...\n");
> return;
> }
> #endif
>
> real_mode->hdr.loadflags |= KASLR_FLAG;
>
> /* Record the various known unsafe memory ranges. */
> mem_avoid_init((unsigned long)input, input_size,
> (unsigned long)*output);
>
> /* Low end should be the smaller of 512M or initial location. */
> min_addr = min((unsigned long)*output, 512UL << 20);
>
> /* Walk e820 and find a random address. */
> random = find_random_phy_addr(min_addr, output_size);
> if (!random)
> debug_putstr("KASLR could not find suitable E820 region...\n");
> else {
> if ((unsigned long)*output != random) {
> fill_pagetable(random, output_size);
> switch_pagetable();
> *output = (unsigned char *)random;
> }
> }
>
> /* Pick random virtual address starting from LOAD_PHYSICAL_ADDR. */
> if (IS_ENABLED(CONFIG_X86_64))
> random = find_random_virt_offset(LOAD_PHYSICAL_ADDR,
> output_size);
> *virt_offset = (unsigned char *)random;
> }
>
> Firstly, 'random' is a libc function name. We generally don't overload those.
>
> Secondly, it's a random what? Variable names should make it plenty obvious. So it
> should probably be named 'random_addr'.
>
> Third:
>
> /* Walk e820 and find a random address. */
> random = find_random_phy_addr(min_addr, output_size);
>
> yeah, so what that comment tells us we knew already, due to the function name!
> What the comment should _really_ talk about is the high level purpose. Something
> like: 'Walk the e820 map and find a random free RAM address to which we can still
> decompress the whole kernel' would work so much better ...
>
> Fourth, this function has seven (!!) type casts. We can sure do better.
Between the e820 values, the asm linkages, the relocations, etc,
there's a lot of mixing of types. As mentioned, I'll clean it up.
> Fifth:
>
> /* Pick random virtual address starting from LOAD_PHYSICAL_ADDR. */
> if (IS_ENABLED(CONFIG_X86_64))
> random = find_random_virt_offset(LOAD_PHYSICAL_ADDR,
> output_size);
> *virt_offset = (unsigned char *)random;
>
> So the purpose of this whole function is to pick _two_ random addresses: the
> random physical address to place the kernel at, and on x86_64, to also randomize
> the kernel virtual address, right? So exactly which comment tells us that it's
> about this? Names like 'choose_kernel_location' are singular and are actively
> misleading about this ...
>
> ... and then I haven't even mentioned small details like the imbalanced curly
> braces.
I'll bite: which braces jumped out at you? I ran all this through
checkpatch.pl in the hopes of finding style mistakes... Is it the
mixing of single-line code with multi-line code in the if statements?
>
> This code sucks, and I'm not surprised at all that it was broken. It should be
> improved before we can feature-extend it.
>
> Thanks,
>
> Ingo
-Kees
--
Kees Cook
Chrome OS & Brillo Security
On Fri, Apr 15, 2016 at 1:07 AM, Ingo Molnar <[email protected]> wrote:
>
> * Kees Cook <[email protected]> wrote:
>
>> From: Baoquan He <[email protected]>
>>
>> Currently CONFIG_RANDOMIZE_BASE_MAX_OFFSET is used to limit the maximum
>> offset for kernel randomization. This limit doesn't need to be a CONFIG
>> since it is tied completely to KERNEL_IMAGE_SIZE, and will make no sense
>> once physical and virtual offsets are randomized separately. This patch
>> removes CONFIG_RANDOMIZE_BASE_MAX_OFFSET and consolidates the Kconfig
>> help text.
>>
>> Signed-off-by: Baoquan He <[email protected]>
>> [kees: rewrote changelog, dropped KERNEL_IMAGE_SIZE_DEFAULT, moved earlier]
>> Signed-off-by: Kees Cook <[email protected]>
>> ---
>> arch/x86/Kconfig | 57 +++++++++++++-----------------------
>> arch/x86/boot/compressed/aslr.c | 12 ++++----
>> arch/x86/include/asm/page_64_types.h | 8 ++---
>> arch/x86/mm/init_32.c | 3 --
>> 4 files changed, 29 insertions(+), 51 deletions(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 2dc18605831f..fd9ac711ada8 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1936,51 +1936,36 @@ config RANDOMIZE_BASE
>> depends on RELOCATABLE
>> default n
>> ---help---
>> - Randomizes the physical and virtual address at which the
>> - kernel image is decompressed, as a security feature that
>> - deters exploit attempts relying on knowledge of the location
>> - of kernel internals.
>> + Randomizes the physical address at which the kernel image
>> + is decompressed and the virtual address where the kernel
>> + image is mapped, as a secrurity feature that deters exploit
>
> Guys, please _read_ what you write: s/secrurity/security
Gah, sorry. I was reading these, but things slip by. I'll fix it. (And
add these to the common misspellings that checkpatch.pl looks for.)
>
>> + attempts relying on knowledge of the location of kernel
>> + internals.
>> +
>> + The kernel physical address can be randomized from 16M to
>> + 64T at most.)
>
> The 64TB value sure reads weird if you are configuring a 32-bit system ...
>
> A much better approach would be to split the help text into 32-bit and 64-bit
> portions:
>
> On 64-bit systems the kernel physical address will be randomized from 16M to the
> top of available physical memory. (With a maximum of 64TB.)
>
> On 32-bit systems the kernel physical address will be randomized from 16MB to
> 1GB.
Yup, good idea.
> Also note the assertive tone: if this Kconfig feature is eanbled, we say that the
> kernel address _will_ be randomized, and we should make sure it is. (If for some
> weird reason randomization fails we should warn prominently during bootup.)
This will need some thought... is it better to fail to boot or to boot
without entropy? As-is, it's possible to randomly position the kernel
base address at exactly the location it was going to boot without
KASLR too, yet this is still considered random...
>
>
>> The kernel virtual address will be offset
>> + by up to KERNEL_IMAGE_SIZE. On 32-bit KERNEL_IMAGE_SIZE is
>> + 512MiB. while on 64-bit this is limited by how the kernel
>> + fixmap page table is positioned, so this cannot be larger
>> + than 1GiB currently. Without RANDOMIZE_BASE there is a 512MiB
>> + to 1.5GiB split between kernel and modules. When RANDOMIZE_BASE
>> + is enabled, the modules area will shrink to compensate, up
>> + to a 1GiB to 1GiB split, KERNEL_IMAGE_SIZE changes from 512MiB
>> + to 1GiB.
>
> Beyond the broken capitalization, I'll show you what 99.999% of users who are not
> kernel hackers will understand from this paragraph, approximately:
>
> To dream: ay, and them? To bear to sling afterprises
> coil, and scover'd cowards of resolence dream: ay, the us for no mome
> wish'd. Thus and sweary life, or nobles cast and makes, whips and that
> is sicklied of resolence of so long afterprises us more; for whips
> all; and name whething after bear to sleep; to beart-ache shocks the
> undiscover'd consummative have, but that pith a sleep of somethe under
> 'tis the spurns of troud makes off thance doth make whose bourns of
> dispriz'd consient and arms more.
>
> So this is really deep kernel internals, I get a headache trying to interpret it,
> and it's my job to interpret this! Please try to formulate key pieces of
> information in Kconfig help texts in a more ... approachable fashion, and move the
> jargon to .c source code files.
Trying to capture the effect of reducing the kernel/module split
without detailing the actual numbers may sound evasive, but I'll see
what I can do.
>
>> Entropy is generated using the RDRAND instruction if it is
>> supported. If RDTSC is supported, it is used as well. If
>> neither RDRAND nor RDTSC are supported, then randomness is
>> read from the i8254 timer.
>
> Also, instead of 'used as well' I'd say "is mixed into the entropy pool as well"
> or so, to make sure it's clear that we don't exclusively rely on RDRAND or RDTSC.
>
> Also, could we always mix the i8254 timer into this as well, not just when RDTSC
> is unavailable?
IIRC, hpa explicitly did not want to do this when he was making
suggestions on this area. I would need to dig out the thread -- I
can't find it now. I'd prefer to leave this as-is, since changing it
would add yet another variable to the behavioral changes of this
series.
>> - The kernel will be offset by up to RANDOMIZE_BASE_MAX_OFFSET,
>> - and aligned according to PHYSICAL_ALIGN. Since the kernel is
>> - built using 2GiB addressing, and PHYSICAL_ALGIN must be at a
>> - minimum of 2MiB, only 10 bits of entropy is theoretically
>> - possible. At best, due to page table layouts, 64-bit can use
>> - 9 bits of entropy and 32-bit uses 8 bits.
>> + Since the kernel is built using 2GiB addressing, and
>> + PHYSICAL_ALGIN must be at a minimum of 2MiB, only 10 bits of
>> + entropy is theoretically possible. At best, due to page table
>> + layouts, 64-bit can use 9 bits of entropy and 32-bit uses 8
>> + bits.
>
> Please read what you write, there's a typo in this section.
Gah, I need to teach my spell checker about #defines. ;) I read this
multiple times after you called it out and still couldn't see it
(http://www.mrc-cbu.cam.ac.uk/people/matt.davis/cmabridge/). Finally
dropped the _ and the spell checker flagged it. ;)
> Another request: please stop the MiB/GiB nonsense and call it MB/GB. This isn't
> storage code that has to fight marketing lies. Only the KASLR section in
> arch/x86/Kconfig* is using MiB/GiB, everything else uses MB/GB naming, we should
> stick with that.
Totally fine by me. I prefer MB/GB. I wonder if it is worth
documenting this preference somewhere in the style guide? It's
certainly rare in the kernel, but it's present (and there are even
#defines for it *sob*).
$ git grep '[KMGTP]iB' | wc -l
1239
$ git grep '[KMGTP]B' | wc -l
192251
-Kees
--
Kees Cook
Chrome OS & Brillo Security
On Fri, Apr 15, 2016 at 1:12 AM, Ingo Molnar <[email protected]> wrote:
>
> * Kees Cook <[email protected]> wrote:
>
>> From: Yinghai Lu <[email protected]>
>>
>> Currently z_extract_offset is calculated in boot/compressed/mkpiggy.c.
>
> What is the high level meaning of z_extract_offset? I cannot tell without reading
> the code - and the point of changelogs is for them to be readable without having
> to read the code.
I'll clarify it in the changelog. (FWIW, it's the offset into the
extraction buffer where it's safe to put the data-to-be-extracted).
> The rest of the changelog is pretty good.
Thanks!
-Kees
--
Kees Cook
Chrome OS & Brillo Security
On Fri, Apr 15, 2016 at 1:31 AM, Ingo Molnar <[email protected]> wrote:
>
> * Kees Cook <[email protected]> wrote:
>
>> From: Yinghai Lu <[email protected]>
>>
>> Currently, the kernel run_size (size of code plus brk and bss) is
>> calculated via the shell script arch/x86/tools/calc_run_size.sh.
>> It gets the file offset and mem size of for the .bss and .brk sections in
>> vmlinux, and adds them as follows:
>
> 'of for'?
>
>> So, from this we can see that the existing run_size calculation is
>> 0x400000 too high.
>
> Btw., why is it too high? Were some sections discarded?
I'm not sure why it's that specific value, but the old method used the
section's ELF _file_ position, not the virtual address, and that
seemed to cause the problem.
>
>> [...] And, as it turns out, the correct run_size is
>> actually equal to VO_end - VO_text, which is certainly easier to calculate.
>> _end: 0xffffffff8205f000
>> _text:0xffffffff81000000
>>
>> 0xffffffff8205f000 - 0xffffffff81000000 = 0x105f000
>>
>> As a result, run_size is a simple constant, so we don't need to pass it
>> around; we already have voffset.h such things.
>
> (Typo.)
>
>> [...] We can share voffset.h
>> between misc.c and header.S instead of getting run_size in other ways.
>> This patch moves voffset.h creation code to boot/compressed/Makefile,
>> and switches misc.c to use the VO_end - VO_text calculation.
>
> Btw., 'run_size' is a pretty meaningless name, it doesn't really tell us the most
> important thing: that this is the complete size of the uncompressed kernel, the
> total physically continuous size we need to allocate to it so it can execute?
>
> So can we rename it to something more expressive, such as kernel_total_size or so?
You got it. Thanks again for digging through all this!
-Kees
--
Kees Cook
Chrome OS & Brillo Security
On Fri, Apr 15, 2016 at 7:42 AM, Lasse Collin <[email protected]> wrote:
> On 2016-04-14 Kees Cook wrote:
>> From: Yinghai Lu <[email protected]>
>>
>> parse_elf is using a local memcpy to move sections to their final
>> running position. However, this memcpy only supports non-overlapping
>> arguments (or dest < src).
>
> The same copy of memcpy is used by the decompressors too.
>
>> To avoid future hard-to-debug surprises, this adds checking in memcpy
>> to detect the unhandled condition (which should not be happening
>> currently).
>
> It's already a minor surprise that memcpy is expected to work for
> overlapping buffers at all. It could be good to have a comment about
> it because "scroll" and parse_elf seem to rely on it.
>
> On the other hand, the new code and error message take quite a few bytes
> of space, so a complete memmove can be smaller:
>
> void *memmove(void *dest, const void *src, size_t n)
> {
> unsigned char *d = dest;
> const unsigned char *s = src;
>
> if (d <= s || d - s >= n)
> return __memcpy(dest, src, n);
>
> while (n-- > 0)
> d[n] = s[n];
>
> return dest;
> }
>
> Note that memmove is needed by lib/decompress_unxz.c. It contains its
> own small version inside a "#ifndef memmove" block. That #ifndef should
> be taken into account when adding a memmove symbol. Changing
> decompress_unxz.c is fine but then one needs to think about other
> archs too.
Awesome, thanks! I'd much prefer to fully fix this instead of just
throwing a warning. I'll get this added.
-Kees
--
Kees Cook
Chrome OS & Brillo Security
* Kees Cook <[email protected]> wrote:
> > Also note the assertive tone: if this Kconfig feature is eanbled, we say that
> > the kernel address _will_ be randomized, and we should make sure it is. (If
> > for some weird reason randomization fails we should warn prominently during
> > bootup.)
>
> This will need some thought... is it better to fail to boot or to boot without
> entropy? As-is, it's possible to randomly position the kernel base address at
> exactly the location it was going to boot without KASLR too, yet this is still
> considered random...
I think we should boot but print a prominent warning. On real systems it's not
supposed to happen, right? So we want to know if it happens, but don't want to
hassle the user with breaking the system by not booting.
> > Also, could we always mix the i8254 timer into this as well, not just when
> > RDTSC is unavailable?
>
> IIRC, hpa explicitly did not want to do this when he was making
> suggestions on this area. I would need to dig out the thread -- I
> can't find it now. I'd prefer to leave this as-is, since changing it
> would add yet another variable to the behavioral changes of this
> series.
Sure, can stay as-is for now.
Thanks,
Ingo
* Kees Cook <[email protected]> wrote:
> > So can we rename it to something more expressive, such as kernel_total_size or
> > so?
>
> You got it. Thanks again for digging through all this!
You are welcome! A couple of logistical suggestions:
Could you please split up the series a bit and limit the next series to say no
more than around 5 patches? (Can be a little bit more when justified to finish up
a particular line of thought) That way I can apply them in reviewable groups,
without having to reject the whole series because some patch deep into the series
has some problem.
I'd suggest starting with absolutely critical fixes (if any!) as-is, to make
backporting easier. By the looks of it I don't think there's any such patch in
this series, but just in case there are any, they can be at the front.
Then come the various cleanup patches and non-critical fixes - everything that is
not supposed to change the behavior of the kernel. I'd suggest doing them in
roughly this order:
- file renames first - so that any later revert in a smaller patch does not have
to go through a rename barrier.
- then .o-invariant trivial cleanups, the fixing, harmonization (and creation ;-)
of comments.
- then come more involved cleanups like moving logic from build time to boot
time, stricter bounds checks, non-essential fixes, etc.
It might be useful if you declared at this stage that you are mostly done with the
preparatory work and that the code base is ready for heavier changes, so that
people (and me) can review the whole source for anything missing. Often a car
needs a good power wash before we can tell what body work is needed.
... and once we are happy and proud about the code base, then come the more
exciting things: more fundamental changes, and new features - on top of a squeaky
clean code base.
This all can happen pretty quickly, as long as the ordering is proper.
Thanks,
Ingo
On Fri, Apr 15, 2016 at 1:09 AM, Ingo Molnar <[email protected]> wrote:
>
> * Kees Cook <[email protected]> wrote:
>
>> From: Yinghai Lu <[email protected]>
>>
>> This change makes later calculations about where the kernel is located
>> easier to reason about. To better understand this change, we must first
>> clarify what VO and ZO are. They were introduced in commits by hpa:
>>
>> 77d1a49 x86, boot: make symbols from the main vmlinux available
>> 37ba7ab x86, boot: make kernel_alignment adjustable; new bzImage fields
>>
>> Specifically:
>>
>> VO:
>> - uncompressed kernel image
>> - size: VO__end - VO__text ("VO_INIT_SIZE" define)
>>
>> ZO:
>> - bootable compressed kernel image (boot/compressed/vmlinux)
>> - head text + compressed kernel (VO and relocs table) + decompressor code
>> - size: ZO__end - ZO_startup_32 ("ZO_INIT_SIZE" define, though see below)
>>
>> The INIT_SIZE definition is used to find the larger of the two image sizes:
>>
>> #define ZO_INIT_SIZE (ZO__end - ZO_startup_32 + ZO_z_extract_offset)
>> #define VO_INIT_SIZE (VO__end - VO__text)
>> #if ZO_INIT_SIZE > VO_INIT_SIZE
>> #define INIT_SIZE ZO_INIT_SIZE
>> #else
>> #define INIT_SIZE VO_INIT_SIZE
>> #endif
>
> Please also harmonize all the prefixes, i.e. use VO__/ZO__ everywhere (rename
> things where necessary), instead of this mixed up VO_/ZO_/VO__/ZO__ mess.
I'm going to leave these as they are: they're auto-generated based on
various functions of interest:
boot/Makefile:sed-zoffset := -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW]
\(startup_32\|startup_64\|efi32_stub_entry\|efi64_stub_entry\|efi_pe_entry\|input_data\|_end\|z_.*\)$$/\#define
ZO_\2 0x\1/p'
i.e. ZO__end is _end's location. ZO_input_data is input_data's
position. I think it would further complicate things if we tried to
consolidate consecutive "_"s, or if we eliminated the leading "_"
(ZOinput_data and ZO_end).
-Kees
--
Kees Cook
Chrome OS & Brillo Security