2014-11-14 20:45:45

by Kees Cook

[permalink] [raw]
Subject: [PATCH v2] x86, mm: set NX across entire PMD at boot

When setting up permissions on kernel memory at boot, the end of the
PMD that was split from bss remained executable. It should be NX like
the rest. This performs a PMD alignment instead of a PAGE alignment to
get the correct span of memory, and should be freed.

Before:
---[ High Kernel Mapping ]---
...
0xffffffff8202d000-0xffffffff82200000 1868K RW GLB NX pte
0xffffffff82200000-0xffffffff82c00000 10M RW PSE GLB NX pmd
0xffffffff82c00000-0xffffffff82df5000 2004K RW GLB NX pte
0xffffffff82df5000-0xffffffff82e00000 44K RW GLB x pte
0xffffffff82e00000-0xffffffffc0000000 978M pmd

After:
---[ High Kernel Mapping ]---
...
0xffffffff8202d000-0xffffffff82200000 1868K RW GLB NX pte
0xffffffff82200000-0xffffffff82c00000 10M RW PSE GLB NX pmd
0xffffffff82c00000-0xffffffff82df5000 2004K RW GLB NX pte
0xffffffff82df5000-0xffffffff82e00000 44K RW NX pte
0xffffffff82e00000-0xffffffffc0000000 978M pmd

Signed-off-by: Kees Cook <[email protected]>
---
v2:
- added call to free_init_pages(), as suggested by tglx
---
arch/x86/mm/init_64.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 4cb8763868fc..0d498c922668 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1124,6 +1124,7 @@ void mark_rodata_ro(void)
unsigned long text_end = PFN_ALIGN(&__stop___ex_table);
unsigned long rodata_end = PFN_ALIGN(&__end_rodata);
unsigned long all_end = PFN_ALIGN(&_end);
+ unsigned long pmd_end = roundup(all_end, PMD_SIZE);

printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
(end - start) >> 10);
@@ -1135,7 +1136,7 @@ void mark_rodata_ro(void)
* The rodata/data/bss/brk section (but not the kernel text!)
* should also be not-executable.
*/
- set_memory_nx(rodata_start, (all_end - rodata_start) >> PAGE_SHIFT);
+ set_memory_nx(rodata_start, (pmd_end - rodata_start) >> PAGE_SHIFT);

rodata_test();

@@ -1147,6 +1148,7 @@ void mark_rodata_ro(void)
set_memory_ro(start, (end-start) >> PAGE_SHIFT);
#endif

+ free_init_pages("unused kernel", all_end, pmd_end);
free_init_pages("unused kernel",
(unsigned long) __va(__pa_symbol(text_end)),
(unsigned long) __va(__pa_symbol(rodata_start)));
--
1.9.1


--
Kees Cook
Chrome OS Security


2014-11-15 01:29:49

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Fri, Nov 14, 2014 at 12:45 PM, Kees Cook <[email protected]> wrote:
> When setting up permissions on kernel memory at boot, the end of the
> PMD that was split from bss remained executable. It should be NX like
> the rest. This performs a PMD alignment instead of a PAGE alignment to
> get the correct span of memory, and should be freed.
>
> Before:
> ---[ High Kernel Mapping ]---
> ...
> 0xffffffff8202d000-0xffffffff82200000 1868K RW GLB NX pte
> 0xffffffff82200000-0xffffffff82c00000 10M RW PSE GLB NX pmd
> 0xffffffff82c00000-0xffffffff82df5000 2004K RW GLB NX pte
> 0xffffffff82df5000-0xffffffff82e00000 44K RW GLB x pte
> 0xffffffff82e00000-0xffffffffc0000000 978M pmd
>
> After:
> ---[ High Kernel Mapping ]---
> ...
> 0xffffffff8202d000-0xffffffff82200000 1868K RW GLB NX pte
> 0xffffffff82200000-0xffffffff82c00000 10M RW PSE GLB NX pmd
> 0xffffffff82c00000-0xffffffff82df5000 2004K RW GLB NX pte
> 0xffffffff82df5000-0xffffffff82e00000 44K RW NX pte
> 0xffffffff82e00000-0xffffffffc0000000 978M pmd
>
> Signed-off-by: Kees Cook <[email protected]>
> ---
> v2:
> - added call to free_init_pages(), as suggested by tglx
> ---
> arch/x86/mm/init_64.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 4cb8763868fc..0d498c922668 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1124,6 +1124,7 @@ void mark_rodata_ro(void)
> unsigned long text_end = PFN_ALIGN(&__stop___ex_table);
> unsigned long rodata_end = PFN_ALIGN(&__end_rodata);
> unsigned long all_end = PFN_ALIGN(&_end);
> + unsigned long pmd_end = roundup(all_end, PMD_SIZE);
>
> printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
> (end - start) >> 10);
> @@ -1135,7 +1136,7 @@ void mark_rodata_ro(void)
> * The rodata/data/bss/brk section (but not the kernel text!)
> * should also be not-executable.
> */
> - set_memory_nx(rodata_start, (all_end - rodata_start) >> PAGE_SHIFT);
> + set_memory_nx(rodata_start, (pmd_end - rodata_start) >> PAGE_SHIFT);
>
> rodata_test();
>
> @@ -1147,6 +1148,7 @@ void mark_rodata_ro(void)
> set_memory_ro(start, (end-start) >> PAGE_SHIFT);
> #endif
>
> + free_init_pages("unused kernel", all_end, pmd_end);
> free_init_pages("unused kernel",
> (unsigned long) __va(__pa_symbol(text_end)),
> (unsigned long) __va(__pa_symbol(rodata_start)));

something is wrong:

[ 7.842479] Freeing unused kernel memory: 3844K (ffffffff82e52000 -
ffffffff83213000)
[ 7.843305] Write protecting the kernel read-only data: 28672k
[ 7.844433] BUG: Bad page state in process swapper/0 pfn:043c0
[ 7.845093] page:ffffea000010f000 count:0 mapcount:-127 mapping:
(null) index:0x2
[ 7.846388] flags: 0x10000000000000()
[ 7.846871] page dumped because: nonzero mapcount
[ 7.847343] Modules linked in:
[ 7.847719] CPU: 2 PID: 1 Comm: swapper/0 Not tainted
3.18.0-rc4-yh-01896-g40204c8-dirty #23
[ 7.848809] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
04/01/2014
[ 7.850014] ffffffff828300ca ffff880078babd68 ffffffff81ff47d0
0000000000000001
[ 7.850857] ffffea000010f000 ffff880078babd98 ffffffff8118c2bd
00000000001d4cc0
[ 7.851791] ffffea000010f000 ffffea000010f000 0000000000000000
ffff880078babdf8
[ 7.852700] Call Trace:
[ 7.852991] [<ffffffff81ff47d0>] dump_stack+0x45/0x57
[ 7.853494] [<ffffffff8118c2bd>] bad_page+0xfd/0x130
[ 7.854130] [<ffffffff8118c42c>] free_pages_prepare+0x13c/0x1c0
[ 7.854808] [<ffffffff8118c64d>] ? adjust_managed_page_count+0x5d/0x70
[ 7.855575] [<ffffffff8118f285>] free_hot_cold_page+0x35/0x180
[ 7.856326] [<ffffffff8118f3e3>] __free_pages+0x13/0x40
[ 7.856854] [<ffffffff8118f4dd>] free_reserved_area+0xcd/0x140
[ 7.857442] [<ffffffff81091778>] free_init_pages+0x98/0xb0
[ 7.858001] [<ffffffff81092085>] mark_rodata_ro+0xb5/0x120
[ 7.858622] [<ffffffff81fe3240>] ? rest_init+0xc0/0xc0
[ 7.859174] [<ffffffff81fe325d>] kernel_init+0x1d/0x100
[ 7.859724] [<ffffffff820066ec>] ret_from_fork+0x7c/0xb0
[ 7.860279] [<ffffffff81fe3240>] ? rest_init+0xc0/0xc0
[ 7.860836] Disabling lock debugging due to kernel taint
[ 7.861432] Freeing unused kernel memory: 376K (ffffffff843a2000 -
ffffffff84400000)
[ 7.866118] Freeing unused kernel memory: 1980K (ffff880002011000 -
ffff880002200000)
[ 7.870525] Freeing unused kernel memory: 1932K (ffff880002a1d000 -
ffff880002c00000)

[ 0.000000] .text: [0x01000000-0x0200d548]
[ 0.000000] .rodata: [0x02200000-0x02a1cfff]
[ 0.000000] .data: [0x02c00000-0x02e50e7f]
[ 0.000000] .init: [0x02e52000-0x03212fff]
[ 0.000000] .bss: [0x03221000-0x0437bfff]
[ 0.000000] .brk: [0x0437c000-0x043a1fff]

2014-11-15 02:29:47

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Fri, Nov 14, 2014 at 5:29 PM, Yinghai Lu <[email protected]> wrote:
> On Fri, Nov 14, 2014 at 12:45 PM, Kees Cook <[email protected]> wrote:
>> v2:
>> - added call to free_init_pages(), as suggested by tglx

> something is wrong:
>
> [ 7.842479] Freeing unused kernel memory: 3844K (ffffffff82e52000 -
> ffffffff83213000)
> [ 7.843305] Write protecting the kernel read-only data: 28672k

....
should use attached one instead.

1. should use _brk_end instead of &end, as we only use partial of
brk.
2. [_brk_end, pm_end) page range is already converted. aka
is not wasted.

---
arch/x86/mm/init_64.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -1124,7 +1124,8 @@ void mark_rodata_ro(void)
unsigned long end = (unsigned long) &__end_rodata_hpage_align;
unsigned long text_end = PFN_ALIGN(&__stop___ex_table);
unsigned long rodata_end = PFN_ALIGN(&__end_rodata);
- unsigned long all_end = PFN_ALIGN(&_end);
+ unsigned long all_end = PFN_ALIGN(_brk_end);
+ unsigned long pmd_end = roundup(all_end, PMD_SIZE);

printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
(end - start) >> 10);
@@ -1136,7 +1137,7 @@ void mark_rodata_ro(void)
* The rodata/data/bss/brk section (but not the kernel text!)
* should also be not-executable.
*/
- set_memory_nx(rodata_start, (all_end - rodata_start) >> PAGE_SHIFT);
+ set_memory_nx(rodata_start, (pmd_end - rodata_start) >> PAGE_SHIFT);

rodata_test();

@@ -1148,6 +1149,7 @@ void mark_rodata_ro(void)
set_memory_ro(start, (end-start) >> PAGE_SHIFT);
#endif

+ /* all_end to pmd_end is handled via free_all_bootmem() */
free_init_pages("unused kernel",
(unsigned long) __va(__pa_symbol(text_end)),
(unsigned long) __va(__pa_symbol(rodata_start)));


Attachments:
nx_after_end.patch (1.49 kB)

2014-11-15 02:46:23

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Fri, Nov 14, 2014 at 6:29 PM, Yinghai Lu <[email protected]> wrote:
> On Fri, Nov 14, 2014 at 5:29 PM, Yinghai Lu <[email protected]> wrote:
>> On Fri, Nov 14, 2014 at 12:45 PM, Kees Cook <[email protected]> wrote:
>>> v2:
>>> - added call to free_init_pages(), as suggested by tglx
>
>> something is wrong:
>>
>> [ 7.842479] Freeing unused kernel memory: 3844K (ffffffff82e52000 -
>> ffffffff83213000)
>> [ 7.843305] Write protecting the kernel read-only data: 28672k
>
> ....
> should use attached one instead.
>
> 1. should use _brk_end instead of &end, as we only use partial of
> brk.
> 2. [_brk_end, pm_end) page range is already converted. aka
> is not wasted.

Are you sure? For me, _brk_end isn't far enough:

[ 1.475572] all_end: 0xffffffff82df5000
[ 1.476736] _brk_end: 0xffffffff82dd6000

>
> ---
> arch/x86/mm/init_64.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/arch/x86/mm/init_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init_64.c
> +++ linux-2.6/arch/x86/mm/init_64.c
> @@ -1124,7 +1124,8 @@ void mark_rodata_ro(void)
> unsigned long end = (unsigned long) &__end_rodata_hpage_align;
> unsigned long text_end = PFN_ALIGN(&__stop___ex_table);
> unsigned long rodata_end = PFN_ALIGN(&__end_rodata);
> - unsigned long all_end = PFN_ALIGN(&_end);
> + unsigned long all_end = PFN_ALIGN(_brk_end);
> + unsigned long pmd_end = roundup(all_end, PMD_SIZE);
>
> printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
> (end - start) >> 10);
> @@ -1136,7 +1137,7 @@ void mark_rodata_ro(void)
> * The rodata/data/bss/brk section (but not the kernel text!)
> * should also be not-executable.
> */
> - set_memory_nx(rodata_start, (all_end - rodata_start) >> PAGE_SHIFT);
> + set_memory_nx(rodata_start, (pmd_end - rodata_start) >> PAGE_SHIFT);
>
> rodata_test();
>
> @@ -1148,6 +1149,7 @@ void mark_rodata_ro(void)
> set_memory_ro(start, (end-start) >> PAGE_SHIFT);
> #endif
>
> + /* all_end to pmd_end is handled via free_all_bootmem() */
> free_init_pages("unused kernel",
> (unsigned long) __va(__pa_symbol(text_end)),
> (unsigned long) __va(__pa_symbol(rodata_start)));

This patch produces the same results as my v1 patch:

0xffffffff8202d000-0xffffffff82200000 1868K RW GLB NX pte
0xffffffff82200000-0xffffffff82e00000 12M RW PSE GLB NX pmd
0xffffffff82e00000-0xffffffffc0000000 978M pmd

Is this correct? It sounded like tglx wanted the pmd split, like this:

0xffffffff82200000-0xffffffff82c00000 10M RW PSE GLB NX pmd
0xffffffff82c00000-0xffffffff82df5000 2004K RW GLB NX pte
0xffffffff82df5000-0xffffffff82e00000 44K RW NX pte
0xffffffff82e00000-0xffffffffc0000000 978M pmd


-Kees

--
Kees Cook
Chrome OS Security

2014-11-15 03:06:17

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Fri, Nov 14, 2014 at 5:29 PM, Yinghai Lu <[email protected]> wrote:
> On Fri, Nov 14, 2014 at 12:45 PM, Kees Cook <[email protected]> wrote:
>> When setting up permissions on kernel memory at boot, the end of the
>> PMD that was split from bss remained executable. It should be NX like
>> the rest. This performs a PMD alignment instead of a PAGE alignment to
>> get the correct span of memory, and should be freed.
>>
>> Before:
>> ---[ High Kernel Mapping ]---
>> ...
>> 0xffffffff8202d000-0xffffffff82200000 1868K RW GLB NX pte
>> 0xffffffff82200000-0xffffffff82c00000 10M RW PSE GLB NX pmd
>> 0xffffffff82c00000-0xffffffff82df5000 2004K RW GLB NX pte
>> 0xffffffff82df5000-0xffffffff82e00000 44K RW GLB x pte
>> 0xffffffff82e00000-0xffffffffc0000000 978M pmd
>>
>> After:
>> ---[ High Kernel Mapping ]---
>> ...
>> 0xffffffff8202d000-0xffffffff82200000 1868K RW GLB NX pte
>> 0xffffffff82200000-0xffffffff82c00000 10M RW PSE GLB NX pmd
>> 0xffffffff82c00000-0xffffffff82df5000 2004K RW GLB NX pte
>> 0xffffffff82df5000-0xffffffff82e00000 44K RW NX pte
>> 0xffffffff82e00000-0xffffffffc0000000 978M pmd
>>
>> Signed-off-by: Kees Cook <[email protected]>
>> ---
>> v2:
>> - added call to free_init_pages(), as suggested by tglx
>> ---
>> arch/x86/mm/init_64.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 4cb8763868fc..0d498c922668 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -1124,6 +1124,7 @@ void mark_rodata_ro(void)
>> unsigned long text_end = PFN_ALIGN(&__stop___ex_table);
>> unsigned long rodata_end = PFN_ALIGN(&__end_rodata);
>> unsigned long all_end = PFN_ALIGN(&_end);
>> + unsigned long pmd_end = roundup(all_end, PMD_SIZE);
>>
>> printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
>> (end - start) >> 10);
>> @@ -1135,7 +1136,7 @@ void mark_rodata_ro(void)
>> * The rodata/data/bss/brk section (but not the kernel text!)
>> * should also be not-executable.
>> */
>> - set_memory_nx(rodata_start, (all_end - rodata_start) >> PAGE_SHIFT);
>> + set_memory_nx(rodata_start, (pmd_end - rodata_start) >> PAGE_SHIFT);
>>
>> rodata_test();
>>
>> @@ -1147,6 +1148,7 @@ void mark_rodata_ro(void)
>> set_memory_ro(start, (end-start) >> PAGE_SHIFT);
>> #endif
>>
>> + free_init_pages("unused kernel", all_end, pmd_end);
>> free_init_pages("unused kernel",
>> (unsigned long) __va(__pa_symbol(text_end)),
>> (unsigned long) __va(__pa_symbol(rodata_start)));
>
> something is wrong:
>
> [ 7.842479] Freeing unused kernel memory: 3844K (ffffffff82e52000 -
> ffffffff83213000)
> [ 7.843305] Write protecting the kernel read-only data: 28672k
> [ 7.844433] BUG: Bad page state in process swapper/0 pfn:043c0
> [ 7.845093] page:ffffea000010f000 count:0 mapcount:-127 mapping:
> (null) index:0x2
> [ 7.846388] flags: 0x10000000000000()
> [ 7.846871] page dumped because: nonzero mapcount
> [ 7.847343] Modules linked in:
> [ 7.847719] CPU: 2 PID: 1 Comm: swapper/0 Not tainted
> 3.18.0-rc4-yh-01896-g40204c8-dirty #23
> [ 7.848809] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
> 04/01/2014
> [ 7.850014] ffffffff828300ca ffff880078babd68 ffffffff81ff47d0
> 0000000000000001
> [ 7.850857] ffffea000010f000 ffff880078babd98 ffffffff8118c2bd
> 00000000001d4cc0
> [ 7.851791] ffffea000010f000 ffffea000010f000 0000000000000000
> ffff880078babdf8
> [ 7.852700] Call Trace:
> [ 7.852991] [<ffffffff81ff47d0>] dump_stack+0x45/0x57
> [ 7.853494] [<ffffffff8118c2bd>] bad_page+0xfd/0x130
> [ 7.854130] [<ffffffff8118c42c>] free_pages_prepare+0x13c/0x1c0
> [ 7.854808] [<ffffffff8118c64d>] ? adjust_managed_page_count+0x5d/0x70
> [ 7.855575] [<ffffffff8118f285>] free_hot_cold_page+0x35/0x180
> [ 7.856326] [<ffffffff8118f3e3>] __free_pages+0x13/0x40
> [ 7.856854] [<ffffffff8118f4dd>] free_reserved_area+0xcd/0x140
> [ 7.857442] [<ffffffff81091778>] free_init_pages+0x98/0xb0
> [ 7.858001] [<ffffffff81092085>] mark_rodata_ro+0xb5/0x120
> [ 7.858622] [<ffffffff81fe3240>] ? rest_init+0xc0/0xc0
> [ 7.859174] [<ffffffff81fe325d>] kernel_init+0x1d/0x100
> [ 7.859724] [<ffffffff820066ec>] ret_from_fork+0x7c/0xb0
> [ 7.860279] [<ffffffff81fe3240>] ? rest_init+0xc0/0xc0
> [ 7.860836] Disabling lock debugging due to kernel taint
> [ 7.861432] Freeing unused kernel memory: 376K (ffffffff843a2000 -
> ffffffff84400000)
> [ 7.866118] Freeing unused kernel memory: 1980K (ffff880002011000 -
> ffff880002200000)
> [ 7.870525] Freeing unused kernel memory: 1932K (ffff880002a1d000 -
> ffff880002c00000)

Also, what tree is this? "Freeing %s" went away in
c88442ec45f30d587b38b935a14acde4e217a926 (and should probably be
re-added, which is what I assume has happened.)

>
> [ 0.000000] .text: [0x01000000-0x0200d548]
> [ 0.000000] .rodata: [0x02200000-0x02a1cfff]
> [ 0.000000] .data: [0x02c00000-0x02e50e7f]
> [ 0.000000] .init: [0x02e52000-0x03212fff]
> [ 0.000000] .bss: [0x03221000-0x0437bfff]
> [ 0.000000] .brk: [0x0437c000-0x043a1fff]

And which CONFIG turns on this reporting?

-Kees

--
Kees Cook
Chrome OS Security

2014-11-15 03:34:33

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Fri, Nov 14, 2014 at 7:06 PM, Kees Cook <[email protected]> wrote:
> On Fri, Nov 14, 2014 at 5:29 PM, Yinghai Lu <[email protected]> wrote:
>>
>> [ 0.000000] .text: [0x01000000-0x0200d548]
>> [ 0.000000] .rodata: [0x02200000-0x02a1cfff]
>> [ 0.000000] .data: [0x02c00000-0x02e50e7f]
>> [ 0.000000] .init: [0x02e52000-0x03212fff]
>> [ 0.000000] .bss: [0x03221000-0x0437bfff]
>> [ 0.000000] .brk: [0x0437c000-0x043a1fff]
>
> And which CONFIG turns on this reporting?
>

My local patch.

this brk does not shrink yet.

Yinghai

2014-11-15 03:38:21

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Fri, Nov 14, 2014 at 6:46 PM, Kees Cook <[email protected]> wrote:
> On Fri, Nov 14, 2014 at 6:29 PM, Yinghai Lu <[email protected]> wrote:

>> should use attached one instead.
>>
>> 1. should use _brk_end instead of &end, as we only use partial of
>> brk.
>> 2. [_brk_end, pm_end) page range is already converted. aka
>> is not wasted.
>
> Are you sure? For me, _brk_end isn't far enough:
>
> [ 1.475572] all_end: 0xffffffff82df5000
> [ 1.476736] _brk_end: 0xffffffff82dd6000

Yes. _brk_end should be small then &_end.

>
>>
>> ---
>> arch/x86/mm/init_64.c | 6 ++++--
>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> Index: linux-2.6/arch/x86/mm/init_64.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/mm/init_64.c
>> +++ linux-2.6/arch/x86/mm/init_64.c
>> @@ -1124,7 +1124,8 @@ void mark_rodata_ro(void)
>> unsigned long end = (unsigned long) &__end_rodata_hpage_align;
>> unsigned long text_end = PFN_ALIGN(&__stop___ex_table);
>> unsigned long rodata_end = PFN_ALIGN(&__end_rodata);
>> - unsigned long all_end = PFN_ALIGN(&_end);
>> + unsigned long all_end = PFN_ALIGN(_brk_end);
>> + unsigned long pmd_end = roundup(all_end, PMD_SIZE);
>>
>> printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
>> (end - start) >> 10);
>> @@ -1136,7 +1137,7 @@ void mark_rodata_ro(void)
>> * The rodata/data/bss/brk section (but not the kernel text!)
>> * should also be not-executable.
>> */
>> - set_memory_nx(rodata_start, (all_end - rodata_start) >> PAGE_SHIFT);
>> + set_memory_nx(rodata_start, (pmd_end - rodata_start) >> PAGE_SHIFT);
>>
>> rodata_test();
>>
>> @@ -1148,6 +1149,7 @@ void mark_rodata_ro(void)
>> set_memory_ro(start, (end-start) >> PAGE_SHIFT);
>> #endif
>>
>> + /* all_end to pmd_end is handled via free_all_bootmem() */
>> free_init_pages("unused kernel",
>> (unsigned long) __va(__pa_symbol(text_end)),
>> (unsigned long) __va(__pa_symbol(rodata_start)));
>
> This patch produces the same results as my v1 patch:
>
> 0xffffffff8202d000-0xffffffff82200000 1868K RW GLB NX pte
> 0xffffffff82200000-0xffffffff82e00000 12M RW PSE GLB NX pmd
> 0xffffffff82e00000-0xffffffffc0000000 978M pmd
>
> Is this correct? It sounded like tglx wanted the pmd split, like this:
>
> 0xffffffff82200000-0xffffffff82c00000 10M RW PSE GLB NX pmd
> 0xffffffff82c00000-0xffffffff82df5000 2004K RW GLB NX pte
> 0xffffffff82df5000-0xffffffff82e00000 44K RW NX pte
> 0xffffffff82e00000-0xffffffffc0000000 978M pmd

Need to remove GLB ?

Yinghai

2014-11-15 09:43:37

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Fri, Nov 14, 2014 at 7:38 PM, Yinghai Lu <[email protected]> wrote:
> On Fri, Nov 14, 2014 at 6:46 PM, Kees Cook <[email protected]> wrote:
>> Is this correct? It sounded like tglx wanted the pmd split, like this:
>>
>> 0xffffffff82200000-0xffffffff82c00000 10M RW PSE GLB NX pmd
>> 0xffffffff82c00000-0xffffffff82df5000 2004K RW GLB NX pte
>> 0xffffffff82df5000-0xffffffff82e00000 44K RW NX pte
>> 0xffffffff82e00000-0xffffffffc0000000 978M pmd
>
> Need to remove GLB ?

Please check attached one that clean up the highmap tail...

Subject: [RFC PATCH] x86, 64bit: cleanup highmap tail near partial 2M range

1. should use _brk_end instead of &end, as we only use partial of
brk.
2. [_brk_end, pm_end) page range is already converted mem. and
is not wasted.
3. add cleanup_highmap_tail for [_brk_end, pm_end).

Kernel Layout:
[ 0.000000] .text: [0x01000000-0x0200d5c8]
[ 0.000000] .rodata: [0x02200000-0x02a1cfff]
[ 0.000000] .data: [0x02c00000-0x02e50e7f]
[ 0.000000] .init: [0x02e52000-0x03212fff]
[ 0.000000] .bss: [0x03221000-0x0437bfff]
[ 0.000000] .brk: [0x0437c000-0x043a1fff]

Actually used brk:
[ 0.272959] memblock_reserve: [0x0000000437c000-0x00000004382fff]
flags 0x0 BRK

Before patch:
---[ High Kernel Mapping ]---
0xffffffff80000000-0xffffffff81000000 16M pmd
0xffffffff81000000-0xffffffff82200000 18M ro PSE GLB x pmd
0xffffffff82200000-0xffffffff82c00000 10M ro PSE GLB NX pmd
0xffffffff82c00000-0xffffffff82e00000 2M RW PSE GLB NX pmd
0xffffffff82e00000-0xffffffff83000000 2M RW GLB NX pte
0xffffffff83000000-0xffffffff83200000 2M RW PSE GLB NX pmd
0xffffffff83200000-0xffffffff83400000 2M RW GLB NX pte
0xffffffff83400000-0xffffffff84200000 14M RW PSE GLB NX pmd
0xffffffff84200000-0xffffffff843a2000 1672K RW GLB NX pte
0xffffffff843a2000-0xffffffff84400000 376K RW GLB x pte
0xffffffff84400000-0xffffffffa0000000 444M pmd
After patch:
---[ High Kernel Mapping ]---
0xffffffff80000000-0xffffffff81000000 16M pmd
0xffffffff81000000-0xffffffff82200000 18M ro PSE GLB x pmd
0xffffffff82200000-0xffffffff82c00000 10M ro PSE GLB NX pmd
0xffffffff82c00000-0xffffffff82e00000 2M RW PSE GLB NX pmd
0xffffffff82e00000-0xffffffff83000000 2M RW GLB NX pte
0xffffffff83000000-0xffffffff83200000 2M RW PSE GLB NX pmd
0xffffffff83200000-0xffffffff83400000 2M RW GLB NX pte
0xffffffff83400000-0xffffffff84200000 14M RW PSE GLB NX pmd
0xffffffff84200000-0xffffffff84383000 1548K RW GLB NX pte
0xffffffff84383000-0xffffffff84400000 500K pte
0xffffffff84400000-0xffffffffa0000000 444M pmd

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/mm/init_64.c | 23 ++++++++++++++++++++++-
arch/x86/mm/pageattr.c | 2 +-
2 files changed, 23 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -375,6 +375,7 @@ void __init init_extra_mapping_uc(unsign
__init_extra_mapping(phys, size, PAGE_KERNEL_LARGE_NOCACHE);
}

+static pmd_t *last_pmd;
/*
* The head.S code sets up the kernel high mapping:
*
@@ -408,9 +409,26 @@ void __init cleanup_highmap(void)
continue;
if (vaddr < (unsigned long) _text || vaddr > end)
set_pmd(pmd, __pmd(0));
+ else
+ last_pmd = pmd;
}
}

+static void __init cleanup_highmap_tail(unsigned long addr)
+{
+ int i;
+ pte_t *pte;
+
+ if (!last_pmd || pmd_none(*last_pmd))
+ return;
+
+ pte = (pte_t *)pmd_page_vaddr(*last_pmd);
+ pte += pte_index(addr);
+
+ for (i = pte_index(addr); i < PTRS_PER_PTE; i++, pte++)
+ set_pte(pte, __pte(0));
+}
+
static unsigned long __meminit
phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
pgprot_t prot)
@@ -1124,7 +1142,8 @@ void mark_rodata_ro(void)
unsigned long end = (unsigned long) &__end_rodata_hpage_align;
unsigned long text_end = PFN_ALIGN(&__stop___ex_table);
unsigned long rodata_end = PFN_ALIGN(&__end_rodata);
- unsigned long all_end = PFN_ALIGN(&_end);
+ unsigned long all_end = PFN_ALIGN(_brk_end);
+ unsigned long pmd_end = roundup(all_end, PMD_SIZE);

printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
(end - start) >> 10);
@@ -1137,6 +1156,8 @@ void mark_rodata_ro(void)
* should also be not-executable.
*/
set_memory_nx(rodata_start, (all_end - rodata_start) >> PAGE_SHIFT);
+ if (all_end < pmd_end)
+ cleanup_highmap_tail(all_end);

rodata_test();

Index: linux-2.6/arch/x86/mm/pageattr.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/pageattr.c
+++ linux-2.6/arch/x86/mm/pageattr.c
@@ -100,7 +100,7 @@ static inline unsigned long highmap_star

static inline unsigned long highmap_end_pfn(void)
{
- return __pa_symbol(roundup(_brk_end, PMD_SIZE)) >> PAGE_SHIFT;
+ return __pa_symbol(PFN_ALIGN(_brk_end)) >> PAGE_SHIFT;
}

#endif


Attachments:
nx_after_end.patch (4.81 kB)

2014-11-16 18:52:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Fri, 14 Nov 2014, Yinghai Lu wrote:

> On Fri, Nov 14, 2014 at 6:46 PM, Kees Cook <[email protected]> wrote:
> > On Fri, Nov 14, 2014 at 6:29 PM, Yinghai Lu <[email protected]> wrote:
>
> >> should use attached one instead.
> >>
> >> 1. should use _brk_end instead of &end, as we only use partial of
> >> brk.
> >> 2. [_brk_end, pm_end) page range is already converted. aka
> >> is not wasted.
> >
> > Are you sure? For me, _brk_end isn't far enough:
> >
> > [ 1.475572] all_end: 0xffffffff82df5000
> > [ 1.476736] _brk_end: 0xffffffff82dd6000
>
> Yes. _brk_end should be small then &_end.

Wrong. _brk_end can move up to _end, i.e. to __brk_limit.

But it's safe to use _brk_end when mark_rodata_ro() is called because
extend_brk() is gone already at that point.

Thanks,

tglx

2014-11-16 21:26:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Sat, 15 Nov 2014, Yinghai Lu wrote:
> +static pmd_t *last_pmd;
> /*
> * The head.S code sets up the kernel high mapping:
> *
> @@ -408,9 +409,26 @@ void __init cleanup_highmap(void)
> continue;
> if (vaddr < (unsigned long) _text || vaddr > end)
> set_pmd(pmd, __pmd(0));
> + else
> + last_pmd = pmd;

Why do you need to store this? You can compute this.

> +static void __init cleanup_highmap_tail(unsigned long addr)

Brilliant stuff. mark_rodata_ro() is called AFTER free_initmem() which
will free exactly that code.

> +{
> + int i;
> + pte_t *pte;
> +
> + if (!last_pmd || pmd_none(*last_pmd))
> + return;

Aside of the above: This is complete and utter crap. Just another
useless function which can be avoided by switching on brain before
coding.

> Index: linux-2.6/arch/x86/mm/pageattr.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/pageattr.c
> +++ linux-2.6/arch/x86/mm/pageattr.c
> @@ -100,7 +100,7 @@ static inline unsigned long highmap_star
>
> static inline unsigned long highmap_end_pfn(void)
> {
> - return __pa_symbol(roundup(_brk_end, PMD_SIZE)) >> PAGE_SHIFT;
> + return __pa_symbol(PFN_ALIGN(_brk_end)) >> PAGE_SHIFT;
> }

Wrong as well for CONFIG_DEBUG_RODATA=n for obvious reasons.

I know why I have your patches on blacklist ....

Thanks,

tglx

2014-11-16 23:44:33

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Fri, 14 Nov 2014, Kees Cook wrote:
> On Fri, Nov 14, 2014 at 6:29 PM, Yinghai Lu <[email protected]> wrote:
> > should use attached one instead.
> >
> > 1. should use _brk_end instead of &end, as we only use partial of
> > brk.
> > 2. [_brk_end, pm_end) page range is already converted. aka
> > is not wasted.
>
> Are you sure? For me, _brk_end isn't far enough:
>
> [ 1.475572] all_end: 0xffffffff82df5000
> [ 1.476736] _brk_end: 0xffffffff82dd6000

_brk_end is adjusted at boot time via extend_brk() up to __brk_limit,
which is the same as _end. We usually do not use all of that space. So
it's expected that _brk_end < _end.

> Is this correct? It sounded like tglx wanted the pmd split, like this:

Yes, I wanted to get rid of the high mapping for anything between
_brk_end and _end, and I brought you on the wrong track with my
suggestion to call free_init_pages(). Sorry about that.

That happened because I missed the completely non obvious fact, that
only the effective brk section is reserved for the kernel via
reserve_brk(). So the area between _brk_end and _end is already
reusable. Though that reuse works only by chance and not by design and
is completely undocumented as everything else in that area.

So the initial patch to get rid of the X mapping is of course to just
extend the area to the PMD. A little bit different to your initial
patch, but essentially the same.

- unsigned long all_end = PFN_ALIGN(&_end);
+ unsigned long all_end = roundup((unsigned long) &_end, PMD_SIZE);

I'm going to apply your V1 patch with the above roundup()
simplification. If a page of that area gets used later on then we are
going to split up the PMD anyway.

But still we want to get rid of that highmap between _brk_end and
_end, but there is absolutely no reason to come up with extra silly
functions for that.

So the obvious solution is to let setup_arch() reserve the memory up
to _end instead of _bss_stop, get rid of the extra reservation in
reserve_brk() and then let free_initmem() release the area between
_brk_end and _end. No extra hackery, no side effects, just works.

I spent quite some time to stare into that and I wonder about the
following related issues:

1) Why is the mark_rodata_ro() business a debug configuration, i.e
CONFIG_DEBUG_RODATA?

This does not make any sense at all. We really want RO and NX on by
default and AFAICT distros are turning that on anyway for obvious
reasons.

The only idiocity I found so far is the kgdb Documentation which
recommends to turn it off. Sigh.

So that should be changed to:

config ARCH_HAS_RONX
bool

config DISABLE_RONX
def_bool !ARCH_HAS_RONX || !KGDB_ENABLE_SECURITY_HOLES

plus

config ARCH_WANTS_KGDB_SECURITY_HOLES
bool

config KGDB_ENABLE_SECURITY_HOLES
bool "WTF?"
depends on BROKEN && ARCH_WANTS_KGDB_SECURITY_HOLES

2) What is actually the modules counterpart for mark_rodata_ro()?

CONFIG_DEBUG_SET_MODULE_RONX

Of course not enabled by default, but enabled by distros again.

See #1.

Now what's interesting aside of the general fuckup is that
CONFIG_DEBUG_RODATA is supported by:

arch/x86 and arch/parisc

But CONFIG_DEBUG_SET_MODULE_RONX is supported by:

arch/arm/ arch/arm64 arch/s390 and arch/x86

This does not make any sense at all.

Do arm/arm64/s390 have other means to make RO/NX work or are they
just doing it for modules? And how is that supposed to work with
KGDB if it is not aware of modules sections being RO/NX? KGDB has
only extra magic for CONFIG_DEBUG_RODATA, but not for
CONFIG_DEBUG_SET_MODULE_RONX.

Now for extended fun the x86 help text for that option says:

... Such protection may interfere with run-time code
patching and dynamic kernel tracing - and they might also protect
against certain classes of kernel exploits.
If in doubt, say "N".

Patently wrong. More sigh.

3) Why is mark_rodata_ro() called AFTER free_initmem() and therefor
cannot be marked __init ?

Just because ...

Thanks,

tglx

2014-11-17 03:26:27

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Sun, Nov 16, 2014 at 10:52 AM, Thomas Gleixner <[email protected]> wrote:
>> >
>> > Are you sure? For me, _brk_end isn't far enough:
>> >
>> > [ 1.475572] all_end: 0xffffffff82df5000
>> > [ 1.476736] _brk_end: 0xffffffff82dd6000
>>
>> Yes. _brk_end should be small then &_end.
>
> Wrong. _brk_end can move up to _end, i.e. to __brk_limit.

Confused. What is wrong? _brk_end is not smaller than &_end?

Yinghai

2014-11-17 03:30:58

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Sun, Nov 16, 2014 at 1:26 PM, Thomas Gleixner <[email protected]> wrote:
> On Sat, 15 Nov 2014, Yinghai Lu wrote:
>> +static pmd_t *last_pmd;
>> /*
>> * The head.S code sets up the kernel high mapping:
>> *
>> @@ -408,9 +409,26 @@ void __init cleanup_highmap(void)
>> continue;
>> if (vaddr < (unsigned long) _text || vaddr > end)
>> set_pmd(pmd, __pmd(0));
>> + else
>> + last_pmd = pmd;
>
> Why do you need to store this? You can compute this.

I'm not quite sure about the xen path.

>
>> +static void __init cleanup_highmap_tail(unsigned long addr)
>
> Brilliant stuff. mark_rodata_ro() is called AFTER free_initmem() which
> will free exactly that code.

I missed that.

Please check this one that address three problems that you pointed out.

Subject: [PATCH v2] x86, 64bit: cleanup highmap tail near partial 2M range

1. should use _brk_end instead of &_end in mark_rodata_ro().
_brk_end can move up to &_end, i.e. to __brk_limit. It's safe to
use _brk_end when mark_rodata_ro() is called because extend_brk()
is gone already at that point.
2. [_brk_end, pm_end) page range is already converted mem. and
is not wasted.
3. add cleanup_highmap_tail for [_brk_end, pm_end).

Kernel Layout:
[ 0.000000] .brk: [0x0437c000-0x043a1fff]

Actually used brk:
[ 0.272959] memblock_reserve: [0x0000000437c000-0x00000004382fff]
flags 0x0 BRK

Before patch:
---[ High Kernel Mapping ]---
...
0xffffffff83400000-0xffffffff84200000 14M RW PSE GLB NX pmd
0xffffffff84200000-0xffffffff843a2000 1672K RW GLB NX pte
0xffffffff843a2000-0xffffffff84400000 376K RW GLB x pte
0xffffffff84400000-0xffffffffa0000000 444M pmd
After patch:
---[ High Kernel Mapping ]---
...
0xffffffff83400000-0xffffffff84200000 14M RW PSE GLB NX pmd
0xffffffff84200000-0xffffffff84383000 1548K RW GLB NX pte
0xffffffff84383000-0xffffffff84400000 500K pte
0xffffffff84400000-0xffffffffa0000000 444M pmd

-v2: according to tglx
caculate the pmd postion instead of passing last_pmd.
cleanup_highmap_tail could not have __init, as it is called in mark_rodata_ro
and mark_rodata_ro is called after free_initmem.
highmap_end_pfn should keep PMD_SIZE alignment on !CONFIG_DEBUG_RODATA

Signed-off-by: Yinghai Lu <[email protected]>

---
arch/x86/mm/init_64.c | 22 +++++++++++++++++++++-
arch/x86/mm/pageattr.c | 4 ++++
2 files changed, 25 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -411,6 +411,23 @@ void __init cleanup_highmap(void)
}
}

+static void cleanup_highmap_tail(unsigned long addr)
+{
+ int i;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ pgd = pgd_offset_k(addr);
+ pud = (pud_t *)pgd_page_vaddr(*pgd) + pud_index(addr);
+ pmd = (pmd_t *)pud_page_vaddr(*pud) + pmd_index(addr);
+ pte = (pte_t *)pmd_page_vaddr(*pmd) + pte_index(addr);
+
+ for (i = pte_index(addr); i < PTRS_PER_PTE; i++, pte++)
+ set_pte(pte, __pte(0));
+}
+
static unsigned long __meminit
phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
pgprot_t prot)
@@ -1124,7 +1141,8 @@ void mark_rodata_ro(void)
unsigned long end = (unsigned long) &__end_rodata_hpage_align;
unsigned long text_end = PFN_ALIGN(&__stop___ex_table);
unsigned long rodata_end = PFN_ALIGN(&__end_rodata);
- unsigned long all_end = PFN_ALIGN(&_end);
+ unsigned long all_end = PFN_ALIGN(_brk_end);
+ unsigned long pmd_end = roundup(all_end, PMD_SIZE);

printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
(end - start) >> 10);
@@ -1137,6 +1155,8 @@ void mark_rodata_ro(void)
* should also be not-executable.
*/
set_memory_nx(rodata_start, (all_end - rodata_start) >> PAGE_SHIFT);
+ if (all_end < pmd_end)
+ cleanup_highmap_tail(all_end);

rodata_test();

Index: linux-2.6/arch/x86/mm/pageattr.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/pageattr.c
+++ linux-2.6/arch/x86/mm/pageattr.c
@@ -100,7 +100,11 @@ static inline unsigned long highmap_star

static inline unsigned long highmap_end_pfn(void)
{
+#ifdef CONFIG_DEBUG_RODATA
+ return __pa_symbol(PFN_ALIGN(_brk_end)) >> PAGE_SHIFT;
+#else
return __pa_symbol(roundup(_brk_end, PMD_SIZE)) >> PAGE_SHIFT;
+#endif
}

#endif


Attachments:
nx_after_end_v2.patch (3.73 kB)

2014-11-17 04:00:19

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Sun, Nov 16, 2014 at 3:44 PM, Thomas Gleixner <[email protected]> wrote:
>
> _brk_end is adjusted at boot time via extend_brk() up to __brk_limit,
> which is the same as _end. We usually do not use all of that space. So
> it's expected that _brk_end < _end.
>
>> Is this correct? It sounded like tglx wanted the pmd split, like this:
>
> Yes, I wanted to get rid of the high mapping for anything between
> _brk_end and _end, and I brought you on the wrong track with my
> suggestion to call free_init_pages(). Sorry about that.

:)

>
> That happened because I missed the completely non obvious fact, that
> only the effective brk section is reserved for the kernel via
> reserve_brk(). So the area between _brk_end and _end is already
> reusable. Though that reuse works only by chance and not by design and
> is completely undocumented as everything else in that area.

where is info for everything else?

>
> So the initial patch to get rid of the X mapping is of course to just
> extend the area to the PMD. A little bit different to your initial
> patch, but essentially the same.
>
> - unsigned long all_end = PFN_ALIGN(&_end);
> + unsigned long all_end = roundup((unsigned long) &_end, PMD_SIZE);
>
> I'm going to apply your V1 patch with the above roundup()
> simplification. If a page of that area gets used later on then we are
> going to split up the PMD anyway.

should use _brk_end instead of &_end for the roundup?

>
> But still we want to get rid of that highmap between _brk_end and
> _end, but there is absolutely no reason to come up with extra silly
> functions for that.
>
> So the obvious solution is to let setup_arch() reserve the memory up
> to _end instead of _bss_stop, get rid of the extra reservation in
> reserve_brk() and then let free_initmem() release the area between
> _brk_end and _end. No extra hackery, no side effects, just works.

So where get the highmap to be removed?
free_init_pages via free_initmem()?

with the code change like you suggested,

---
arch/x86/kernel/setup.c | 6 +-----
arch/x86/mm/init.c | 6 ++++++
arch/x86/mm/init_64.c | 2 +-
3 files changed, 8 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c
+++ linux-2.6/arch/x86/kernel/setup.c
@@ -286,10 +286,6 @@ static void __init cleanup_highmap(void)

static void __init reserve_brk(void)
{
- if (_brk_end > _brk_start)
- memblock_reserve(__pa_symbol(_brk_start),
- _brk_end - _brk_start);
-
/* Mark brk area as locked down and no longer taking any
new allocations */
_brk_start = 0;
@@ -857,7 +853,7 @@ dump_kernel_offset(struct notifier_block
void __init setup_arch(char **cmdline_p)
{
memblock_reserve(__pa_symbol(_text),
- (unsigned long)__bss_stop - (unsigned long)_text);
+ (unsigned long)_end - (unsigned long)_text);

early_reserve_initrd();

Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -1122,7 +1122,7 @@ void mark_rodata_ro(void)
unsigned long end = (unsigned long) &__end_rodata_hpage_align;
unsigned long text_end = PFN_ALIGN(&__stop___ex_table);
unsigned long rodata_end = PFN_ALIGN(&__end_rodata);
- unsigned long all_end = PFN_ALIGN(&_end);
+ unsigned long all_end = roundup((unsigned long)&_end, PMD_SIZE);

printk(KERN_INFO "Write protecting the kernel read-only data: %luk\n",
(end - start) >> 10);
Index: linux-2.6/arch/x86/mm/init.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init.c
+++ linux-2.6/arch/x86/mm/init.c
@@ -637,9 +637,15 @@ void free_init_pages(char *what, unsigne

void free_initmem(void)
{
+ unsigned long all_end = roundup((unsigned long)&_end, PMD_SIZE);
+
free_init_pages("unused kernel",
(unsigned long)(&__init_begin),
(unsigned long)(&__init_end));
+
+#ifdef CONFIG_X86_64
+ free_init_pages("unused brk", PFN_ALIGN(_brk_end), all_end);
+#endif
}

#ifdef CONFIG_BLK_DEV_INITRD


got
[ 7.942636] Freeing unused brk memory: 500K (ffffffff84383000 -
ffffffff84400000)

---[ High Kernel Mapping ]---
0xffffffff80000000-0xffffffff81000000 16M pmd
0xffffffff81000000-0xffffffff82200000 18M ro PSE GLB x pmd
0xffffffff82200000-0xffffffff82c00000 10M ro PSE GLB NX pmd
0xffffffff82c00000-0xffffffff82e00000 2M RW PSE GLB NX pmd
0xffffffff82e00000-0xffffffff83000000 2M RW GLB NX pte
0xffffffff83000000-0xffffffff83200000 2M RW PSE GLB NX pmd
0xffffffff83200000-0xffffffff83400000 2M RW GLB NX pte
0xffffffff83400000-0xffffffff84200000 14M RW PSE GLB NX pmd
0xffffffff84200000-0xffffffff84400000 2M RW GLB NX pte
0xffffffff84400000-0xffffffffa0000000 444M pmd

the _brk_end to &_end highmap is still there

Also if you want to remove highmap between _brk_end and _end,
Do you want highmap for
[text_end, rodata_start)
[rodata_end, _sdata),
as mark_rodata_ro is calling

free_init_pages("unused kernel",
(unsigned long) __va(__pa_symbol(text_end)),
(unsigned long) __va(__pa_symbol(rodata_start)));
free_init_pages("unused kernel",
(unsigned long) __va(__pa_symbol(rodata_end)),
(unsigned long) __va(__pa_symbol(_sdata)));

Thanks

Yinghai

2014-11-17 20:28:02

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Sun, Nov 16, 2014 at 3:44 PM, Thomas Gleixner <[email protected]> wrote:
> On Fri, 14 Nov 2014, Kees Cook wrote:
>> On Fri, Nov 14, 2014 at 6:29 PM, Yinghai Lu <[email protected]> wrote:
>> > should use attached one instead.
>> >
>> > 1. should use _brk_end instead of &end, as we only use partial of
>> > brk.
>> > 2. [_brk_end, pm_end) page range is already converted. aka
>> > is not wasted.
>>
>> Are you sure? For me, _brk_end isn't far enough:
>>
>> [ 1.475572] all_end: 0xffffffff82df5000
>> [ 1.476736] _brk_end: 0xffffffff82dd6000
>
> _brk_end is adjusted at boot time via extend_brk() up to __brk_limit,
> which is the same as _end. We usually do not use all of that space. So
> it's expected that _brk_end < _end.
>
>> Is this correct? It sounded like tglx wanted the pmd split, like this:
>
> Yes, I wanted to get rid of the high mapping for anything between
> _brk_end and _end, and I brought you on the wrong track with my
> suggestion to call free_init_pages(). Sorry about that.
>
> That happened because I missed the completely non obvious fact, that
> only the effective brk section is reserved for the kernel via
> reserve_brk(). So the area between _brk_end and _end is already
> reusable. Though that reuse works only by chance and not by design and
> is completely undocumented as everything else in that area.
>
> So the initial patch to get rid of the X mapping is of course to just
> extend the area to the PMD. A little bit different to your initial
> patch, but essentially the same.
>
> - unsigned long all_end = PFN_ALIGN(&_end);
> + unsigned long all_end = roundup((unsigned long) &_end, PMD_SIZE);
>
> I'm going to apply your V1 patch with the above roundup()
> simplification. If a page of that area gets used later on then we are
> going to split up the PMD anyway.

That's fine by me. Yinghai Lu seems to have potentially better
solutions, but my head is spinning after reading more of this thread.
:) I just want to make sure that at the end of the day, there are no
RW+x mappings.

> But still we want to get rid of that highmap between _brk_end and
> _end, but there is absolutely no reason to come up with extra silly
> functions for that.
>
> So the obvious solution is to let setup_arch() reserve the memory up
> to _end instead of _bss_stop, get rid of the extra reservation in
> reserve_brk() and then let free_initmem() release the area between
> _brk_end and _end. No extra hackery, no side effects, just works.
>
> I spent quite some time to stare into that and I wonder about the
> following related issues:
>
> 1) Why is the mark_rodata_ro() business a debug configuration, i.e
> CONFIG_DEBUG_RODATA?
>
> This does not make any sense at all. We really want RO and NX on by
> default and AFAICT distros are turning that on anyway for obvious
> reasons.

This is a historically badly named config item. Once arm and arm64
land their CONFIG_DEBUG_RODATA implementations, I was going to suggest
having this renamed.

>
> The only idiocity I found so far is the kgdb Documentation which
> recommends to turn it off. Sigh.
>
> So that should be changed to:
>
> config ARCH_HAS_RONX
> bool
>
> config DISABLE_RONX
> def_bool !ARCH_HAS_RONX || !KGDB_ENABLE_SECURITY_HOLES
>
> plus
>
> config ARCH_WANTS_KGDB_SECURITY_HOLES
> bool
>
> config KGDB_ENABLE_SECURITY_HOLES
> bool "WTF?"
> depends on BROKEN && ARCH_WANTS_KGDB_SECURITY_HOLES
>
> 2) What is actually the modules counterpart for mark_rodata_ro()?
>
> CONFIG_DEBUG_SET_MODULE_RONX
>
> Of course not enabled by default, but enabled by distros again.
>
> See #1.
>
> Now what's interesting aside of the general fuckup is that
> CONFIG_DEBUG_RODATA is supported by:
>
> arch/x86 and arch/parisc
>
> But CONFIG_DEBUG_SET_MODULE_RONX is supported by:
>
> arch/arm/ arch/arm64 arch/s390 and arch/x86

I have a series for arm that is waiting to be picked up by rmk:
https://patchwork.ozlabs.org/patch/400383/

Laura has been working on a series for arm64:
http://comments.gmane.org/gmane.linux.ports.arm.kernel/366777

So what you're seeing is us being in the middle of providing support
for it. It just happens that CONFIG_DEBUG_SET_MODULE_RONX is a bit
easier to implement, so it was completed first.

>
> This does not make any sense at all.
>
> Do arm/arm64/s390 have other means to make RO/NX work or are they
> just doing it for modules? And how is that supposed to work with
> KGDB if it is not aware of modules sections being RO/NX? KGDB has
> only extra magic for CONFIG_DEBUG_RODATA, but not for
> CONFIG_DEBUG_SET_MODULE_RONX.
>
> Now for extended fun the x86 help text for that option says:
>
> ... Such protection may interfere with run-time code
> patching and dynamic kernel tracing - and they might also protect
> against certain classes of kernel exploits.
> If in doubt, say "N".
>
> Patently wrong. More sigh.

Like the CONFIG naming, I hope to start cleaning these defaults up
once arm and arm64 are landed.

> 3) Why is mark_rodata_ro() called AFTER free_initmem() and therefor
> cannot be marked __init ?
>
> Just because ...

That's worth examining. Since most of the logic that does the ro/nx
work needs to stick around for things like ftrace, kgdb, etc, it's
possible there wouldn't be much savings from making mark_rodata_ro as
__init. I'll add this to my list to check.

Thanks!

-Kees

--
Kees Cook
Chrome OS Security

2014-11-17 20:32:36

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Mon, Nov 17, 2014 at 12:27:59PM -0800, Kees Cook wrote:
> I have a series for arm that is waiting to be picked up by rmk:
> https://patchwork.ozlabs.org/patch/400383/

It should've been in linux-next via my tree for the last two weeks
or so.

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-11-17 20:43:30

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Mon, Nov 17, 2014 at 12:32 PM, Russell King - ARM Linux
<[email protected]> wrote:
> On Mon, Nov 17, 2014 at 12:27:59PM -0800, Kees Cook wrote:
>> I have a series for arm that is waiting to be picked up by rmk:
>> https://patchwork.ozlabs.org/patch/400383/
>
> It should've been in linux-next via my tree for the last two weeks
> or so.

Fantastic! Thank you!

-Kees

--
Kees Cook
Chrome OS Security

2014-11-18 07:39:46

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Mon, Nov 17, 2014 at 12:27 PM, Kees Cook <[email protected]> wrote:
> On Sun, Nov 16, 2014 at 3:44 PM, Thomas Gleixner <[email protected]> wrote:
>>
>> So the initial patch to get rid of the X mapping is of course to just
>> extend the area to the PMD. A little bit different to your initial
>> patch, but essentially the same.
>>
>> - unsigned long all_end = PFN_ALIGN(&_end);
>> + unsigned long all_end = roundup((unsigned long) &_end, PMD_SIZE);
>>
>> I'm going to apply your V1 patch with the above roundup()
>> simplification. If a page of that area gets used later on then we are
>> going to split up the PMD anyway.
>
> That's fine by me. Yinghai Lu seems to have potentially better
> solutions, but my head is spinning after reading more of this thread.
> :) I just want to make sure that at the end of the day, there are no
> RW+x mappings.
>

Please check v3 that cleanup highmap in the middle.


Before patch:
---[ High Kernel Mapping ]---
0xffffffff80000000-0xffffffff81000000 16M pmd
0xffffffff81000000-0xffffffff82200000 18M ro PSE GLB x pmd
0xffffffff82200000-0xffffffff82c00000 10M ro PSE GLB NX pmd
0xffffffff82c00000-0xffffffff82e00000 2M RW PSE GLB NX pmd
0xffffffff82e00000-0xffffffff83000000 2M RW GLB NX pte
0xffffffff83000000-0xffffffff83200000 2M RW PSE GLB NX pmd
0xffffffff83200000-0xffffffff83400000 2M RW GLB NX pte
0xffffffff83400000-0xffffffff84200000 14M RW PSE GLB NX pmd
0xffffffff84200000-0xffffffff843a2000 1672K RW GLB NX pte
0xffffffff843a2000-0xffffffff84400000 376K RW GLB x pte
0xffffffff84400000-0xffffffffa0000000 444M pmd

After patch:
---[ High Kernel Mapping ]---
0xffffffff80000000-0xffffffff81000000 16M pmd
0xffffffff81000000-0xffffffff82000000 16M ro PSE GLB x pmd
0xffffffff82000000-0xffffffff82011000 68K ro GLB x pte
0xffffffff82011000-0xffffffff82200000 1980K pte
0xffffffff82200000-0xffffffff82a00000 8M ro PSE GLB NX pmd
0xffffffff82a00000-0xffffffff82a1d000 116K ro GLB NX pte
0xffffffff82a1d000-0xffffffff82c00000 1932K pte
0xffffffff82c00000-0xffffffff82e00000 2M RW PSE GLB NX pmd
0xffffffff82e00000-0xffffffff82e52000 328K RW GLB NX pte
0xffffffff82e52000-0xffffffff83000000 1720K pte
0xffffffff83000000-0xffffffff83200000 2M pmd
0xffffffff83200000-0xffffffff83213000 76K pte
0xffffffff83213000-0xffffffff83400000 1972K RW GLB NX pte
0xffffffff83400000-0xffffffff84200000 14M RW PSE GLB NX pmd
0xffffffff84200000-0xffffffff84383000 1548K RW GLB NX pte
0xffffffff84383000-0xffffffff84400000 500K pte
0xffffffff84400000-0xffffffffa0000000 444M pmd

Thanks

Yinghai


Attachments:
nx_after_end_v3_1.patch (3.00 kB)
nx_after_end_v3_2.patch (6.13 kB)
Download all attachments

2014-11-18 17:12:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2] x86, mm: set NX across entire PMD at boot

On Fri, 14 Nov 2014, Kees Cook wrote:
> On Fri, Nov 14, 2014 at 6:29 PM, Yinghai Lu <[email protected]> wrote:
> > On Fri, Nov 14, 2014 at 5:29 PM, Yinghai Lu <[email protected]> wrote:
> >> On Fri, Nov 14, 2014 at 12:45 PM, Kees Cook <[email protected]> wrote:
> >>> v2:
> >>> - added call to free_init_pages(), as suggested by tglx
> >
> >> something is wrong:
> >>
> >> [ 7.842479] Freeing unused kernel memory: 3844K (ffffffff82e52000 -
> >> ffffffff83213000)
> >> [ 7.843305] Write protecting the kernel read-only data: 28672k
> >
> > ....
> > should use attached one instead.
> >
> > 1. should use _brk_end instead of &end, as we only use partial of
> > brk.
> > 2. [_brk_end, pm_end) page range is already converted. aka
> > is not wasted.
>
> Are you sure? For me, _brk_end isn't far enough:

_brk_end is guaranteed to be <= _end. But we really want to use
_brk_end here, because if we have the following situation:

_brk_start: 0x03ff0000
_brk_end: 0x03ffff00
_end: 0x04016000

So we have the following PMDs setup:

0x03e00000 pmd rw nx <- covers the top of .bss
and the start of .brk
0x04000000 pmd rw nx <- covers the end of .brk
and some random unused

So if _brk_end is less than 0x04000000, then cleanup_highmem() has
zapped the extra PMD already. So we don't want to call set_nx() on
that.

If _brk_end is >= 0x04000000 then we cover that last pmd with the
set_nx call.

Completely non obvious as anything else in that area.

Thanks,

tglx