LinuxLists.cc - [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/02/2017 09:23 PM, Fengguang Wu wrote:
[...]
> I confirm that the below patch provided by Daniel fixes the above
> issues on mainline kernel, too. Where should this patch be sent to?

If nobody objects, I could send it to -net tree via Dave due to being
BPF related, but I don't mind sending it elsewhere too (f.e. Linus
directly?) in order to stop your bot from continuing to send such mails.

The issue seems only related to i386 and doesn't trigger each time with
Fengguang's kernel config and qemu image when I try to reproduce it.
set_memory_ro()/set_memory_rw() on i386 seems to work in general, but
when it's used/reproduced, from time to time (perhaps some corner-case?)
it looks like that memory area can have issues much later on after being
fed back to the allocator which then causes a GPF from random locations.
Gut feeling, it might be an issue in set_memory_*() that my commit
uncovered. Still looking into it, but mean-time I could just send the
below, sure.

Thanks,
Daniel

> It'd be very noisy if all these Oops hit the upcoming RC1 kernel.
>
> Daniel thinks there may be deeper problem in i386 set_memory_rw().
> However that could take much longer time to debug.
>
> Thanks,
> Fengguang
> ---
>
> Re: [bpf] 9d876e79df: BUG: unable to handle kernel paging request at 653a8346
>
>> On Tue, Feb 28, 2017 at 04:39:36PM +0100, Daniel Borkmann wrote:
>
> I have a rough feeling what it is, but I didn't have cycles to work on
> it yet (due to travel, sorry about that). The issue is likely shut down
> by just doing:
>
> ---
> arch/x86/Kconfig | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- linux.orig/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
> +++ linux/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
> @@ -54,7 +54,7 @@ config X86
> select ARCH_HAS_KCOV if X86_64
> select ARCH_HAS_MMIO_FLUSH
> select ARCH_HAS_PMEM_API if X86_64
> - select ARCH_HAS_SET_MEMORY
> + select ARCH_HAS_SET_MEMORY if X86_64
> select ARCH_HAS_SG_CHAIN
> select ARCH_HAS_STRICT_KERNEL_RWX
> select ARCH_HAS_STRICT_MODULE_RWX

2017-03-08 19:26:45

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

Adding x86 people too, since this seems to be something off about
ARCH_HAS_SET_MEMORY for x86-32.

The code seems to be shared between x86-32 and 64, I'm not seeing why
set_memory_r[ow]() should fail on one but not the other.

Considering that it seems to be flaky even on 32-bit, maybe it's
timing-related, or possibly related to TLB sizes or whatever (ie more
likely hidden by a larger TLB on more modern hardware?)

Anyway, just looking at change_page_attr_set_clr(), I notice that the
page alias checking treats NX specially:

/* No alias checking for _NX bit modifications */
checkalias = (pgprot_val(mask_set) | pgprot_val(mask_clr)) != _PAGE_NX;

which seems insane. Why would NX be different from other protection
bits (like _PAGE_RW)?

But that doesn't explain why the bpf code would have issues with this
all only on x86-32.

Maybe somebody else can see why ARCH_HAS_SET_MEMORY would depend on
64-bit only..

Linus

On Thu, Mar 2, 2017 at 12:40 PM, Daniel Borkmann <[email protected]> wrote:
> On 03/02/2017 09:23 PM, Fengguang Wu wrote:
> [...]
>>
>> I confirm that the below patch provided by Daniel fixes the above
>> issues on mainline kernel, too. Where should this patch be sent to?
>
>
> If nobody objects, I could send it to -net tree via Dave due to being
> BPF related, but I don't mind sending it elsewhere too (f.e. Linus
> directly?) in order to stop your bot from continuing to send such mails.
>
> The issue seems only related to i386 and doesn't trigger each time with
> Fengguang's kernel config and qemu image when I try to reproduce it.
> set_memory_ro()/set_memory_rw() on i386 seems to work in general, but
> when it's used/reproduced, from time to time (perhaps some corner-case?)
> it looks like that memory area can have issues much later on after being
> fed back to the allocator which then causes a GPF from random locations.
> Gut feeling, it might be an issue in set_memory_*() that my commit
> uncovered. Still looking into it, but mean-time I could just send the
> below, sure.
>
> Thanks,
> Daniel
>
>
>> It'd be very noisy if all these Oops hit the upcoming RC1 kernel.
>>
>> Daniel thinks there may be deeper problem in i386 set_memory_rw().
>> However that could take much longer time to debug.
>>
>> Thanks,
>> Fengguang
>> ---
>>
>> Re: [bpf] 9d876e79df: BUG: unable to handle kernel paging request at
>> 653a8346
>>
>>> On Tue, Feb 28, 2017 at 04:39:36PM +0100, Daniel Borkmann wrote:
>>
>>
>> I have a rough feeling what it is, but I didn't have cycles to work on
>> it yet (due to travel, sorry about that). The issue is likely shut down
>> by just doing:
>>
>> ---
>> arch/x86/Kconfig | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> --- linux.orig/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
>> +++ linux/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
>> @@ -54,7 +54,7 @@ config X86
>> select ARCH_HAS_KCOV if X86_64
>> select ARCH_HAS_MMIO_FLUSH
>> select ARCH_HAS_PMEM_API if X86_64
>> - select ARCH_HAS_SET_MEMORY
>> + select ARCH_HAS_SET_MEMORY if X86_64
>> select ARCH_HAS_SG_CHAIN
>> select ARCH_HAS_STRICT_KERNEL_RWX
>> select ARCH_HAS_STRICT_MODULE_RWX
>
>

2017-03-08 22:27:52

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

[ + Kees, Laura, and Dave ]

On 03/08/2017 08:25 PM, Linus Torvalds wrote:
> Adding x86 people too, since this seems to be something off about
> ARCH_HAS_SET_MEMORY for x86-32.
>
> The code seems to be shared between x86-32 and 64, I'm not seeing why
> set_memory_r[ow]() should fail on one but not the other.
>
> Considering that it seems to be flaky even on 32-bit, maybe it's
> timing-related, or possibly related to TLB sizes or whatever (ie more
> likely hidden by a larger TLB on more modern hardware?)
>
> Anyway, just looking at change_page_attr_set_clr(), I notice that the
> page alias checking treats NX specially:
>
> /* No alias checking for _NX bit modifications */
> checkalias = (pgprot_val(mask_set) | pgprot_val(mask_clr)) != _PAGE_NX;
>
> which seems insane. Why would NX be different from other protection
> bits (like _PAGE_RW)?
>
> But that doesn't explain why the bpf code would have issues with this
> all only on x86-32.
>
> Maybe somebody else can see why ARCH_HAS_SET_MEMORY would depend on
> 64-bit only..

Last days I tried to narrow this down a bit, I have a dirty hack I
used just for the sake of experimenting when booting into Fengguang's
x86-32 test kernel via qemu:

lib/Makefile | 3 ++-
lib/test_setmem.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+), 1 deletion(-)
create mode 100644 lib/test_setmem.c

diff --git a/lib/Makefile b/lib/Makefile
index 320ac46a..702af98 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -41,7 +41,8 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \
gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
bsearch.o find_bit.o llist.o memweight.o kfifo.o \
percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
- once.o refcount.o
+ once.o refcount.o test_setmem.o
+
obj-y += string_helpers.o
obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
obj-y += hexdump.o
diff --git a/lib/test_setmem.c b/lib/test_setmem.c
new file mode 100644
index 0000000..82b08db
--- /dev/null
+++ b/lib/test_setmem.c
@@ -0,0 +1,50 @@
+#include <linux/init.h>
+#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+#include <linux/delay.h>
+
+#include <asm/cacheflush.h>
+
+static int __init test_setmem(void)
+{
+ const gfp_t gfp_flags = GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO;
+ int ret, i, good = 0, max_runs = 1000, one = 1;
+ u32 pages, size;
+ u8 *buff;
+
+ for (i = 0; i < max_runs; i++) {
+ size = round_up(4096*8, PAGE_SIZE);
+ pages = size / PAGE_SIZE;
+
+ buff = __vmalloc(size, gfp_flags, PAGE_KERNEL);
+ if (buff == NULL) {
+ printk("%s: allocation failed!\n", __func__);
+ return 0;
+ }
+#if 1
+ buff[0] = 41;
+ buff[10] = 42;
+ buff[100] = 43;
+ buff[1000] = 44;
+ // can also be just read access, f.e. if (buff[0] + ...)
+#endif
+ WARN_ON_ONCE(set_memory_ro((unsigned long)buff, pages));
+// msleep(0);
+
+ ret = probe_kernel_write(&buff[1], &one, sizeof(one));
+ if (!ret)
+ printk("%s: write succeeded despite read-only!\n", __func__);
+ else
+ good++;
+
+ WARN_ON_ONCE(set_memory_rw((unsigned long)buff, pages));
+ vfree(buff);
+ }
+
+ printk("%s: write failed due to read-only: %d/%d\n", __func__, good, max_runs);
+ return 0;
+}
+
+late_initcall(test_setmem);
--
2.9.3

The issue seems to be accessing buff first (can be read or write access)
and then doing set_memory_ro() doesn't make it read-only immediately,
meaning the subsequent call into probe_kernel_write() will succeed without
error.

Then, if I don't touch buff first and only do the set_memory_ro() seems
to work and probe_kernel_write() will then fail as expected due to pages
being read-only now.

Now, if I access buff, do the set_memory_ro() and then a msleep(0), for
example, it "kind of" works most of the time (see last log extract below),
and probe_kernel_write() will fail.

None of this seems an issue with x86_64 and the test_setmem runs fine all
the time, same for the actual BPF stuff.

I've also added rodata_test from the log (perhaps related).

x86_64 (access data before set_memory_ro()):

[...]
[ 1.357834] Key type big_key registered
[ 1.418542] test_setmem: write failed due to read-only: 1000/1000
[ 1.419545] Magic number: 9:497:653
[ 1.419584] acpi LNXCPU:06: hash matches
[ 1.419750] rtc_cmos 00:02: setting system clock to 2017-03-08 22:37:44 UTC (1489012664)
[ 1.419777] PM: Hibernation image not present or could not be loaded.
[ 1.420643] Freeing unused kernel memory: 1720K
[ 1.420643] Write protecting the kernel read-only data: 14336k
[ 1.420997] Freeing unused kernel memory: 1644K
[ 1.421844] Freeing unused kernel memory: 652K
[ 1.423054] x86/mm: Checked W+X mappings: passed, no W+X pages found.
[ 1.423057] rodata_test: all tests were successful
[...]

x86-32 (access data before set_memory_ro()):

[...]
[ 8.551454] Key type trusted registered
[ 8.551792] Key type encrypted registered
[ 8.552274] test_setmem: write succeeded despite read-only!
[ 8.552740] test_setmem: write succeeded despite read-only!
[ 8.553205] test_setmem: write succeeded despite read-only!
[ 8.553686] test_setmem: write succeeded despite read-only!
[ 8.554179] test_setmem: write succeeded despite read-only!
[ 8.554666] test_setmem: write succeeded despite read-only!
[ 8.555157] test_setmem: write succeeded despite read-only!
[...]
[ 9.042271] test_setmem: write succeeded despite read-only!
[ 9.042741] test_setmem: write failed due to read-only: 0/1000
[ 9.043343] hd: no drives specified - use hd=cyl,head,sectors on kernel command line
[ 9.044034] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 9.044597] EDD information not available.
[ 9.045233] Freeing unused kernel memory: 612K
[ 9.045732] Write protecting the kernel text: 12604k
[ 9.046234] Write protecting the kernel read-only data: 5124k
[ 9.046781] rodata_test: test data was not read only
[...]

x86-32 (no access data before set_memory_ro()):

[...]
[ 8.537275] Key type encrypted registered
[ 8.541771] test_setmem: write failed due to read-only: 1000/1000
[ 8.542481] hd: no drives specified - use hd=cyl,head,sectors on kernel command line
[ 8.543265] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 8.543824] EDD information not available.
[ 8.544467] Freeing unused kernel memory: 612K
[ 8.544909] Write protecting the kernel text: 12604k
[ 8.545399] Write protecting the kernel read-only data: 5128k
[ 8.545903] rodata_test: test data was not read only
[...]

x86-32 (access data + msleep(0) before set_memory_ro()):

* run 1:

[...]
[ 8.478233] Key type encrypted registered
[ 20.334254] test_setmem: write succeeded despite read-only!
[ 28.474980] test_setmem: write failed due to read-only: 999/1000
[ 28.478187] hd: no drives specified - use hd=cyl,head,sectors on kernel command line
[ 28.480476] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 28.481342] EDD information not available.
[ 28.482192] Freeing unused kernel memory: 612K
[ 28.482841] Write protecting the kernel text: 12604k
[ 28.483568] Write protecting the kernel read-only data: 5128k
[ 28.484364] rodata_test: test data was not read only
[...]

* run 2:

[...]
[ 8.472088] Key type encrypted registered
[ 28.466359] test_setmem: write failed due to read-only: 1000/1000
[ 28.469162] hd: no drives specified - use hd=cyl,head,sectors on kernel command line
[ 28.470838] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 28.471590] EDD information not available.
[ 28.472346] Freeing unused kernel memory: 612K
[ 28.472910] Write protecting the kernel text: 12604k
[ 28.473542] Write protecting the kernel read-only data: 5128k
[ 28.474232] rodata_test: test data was not read only
[...]

I've attached the .config and Fengguang's test script. Qemu version is
2.7.1 from Fedora. Can be run as:

./reproduce-quantal-ivb41-127-20170301175555-i386-randconfig-b0-02262304-4.10.0-04456-g3051bf3-1 arch/x86/boot/bzImage

Thanks,
Daniel

> Linus
>
> On Thu, Mar 2, 2017 at 12:40 PM, Daniel Borkmann <[email protected]> wrote:
>> On 03/02/2017 09:23 PM, Fengguang Wu wrote:
>> [...]
>>>
>>> I confirm that the below patch provided by Daniel fixes the above
>>> issues on mainline kernel, too. Where should this patch be sent to?
>>
>>
>> If nobody objects, I could send it to -net tree via Dave due to being
>> BPF related, but I don't mind sending it elsewhere too (f.e. Linus
>> directly?) in order to stop your bot from continuing to send such mails.
>>
>> The issue seems only related to i386 and doesn't trigger each time with
>> Fengguang's kernel config and qemu image when I try to reproduce it.
>> set_memory_ro()/set_memory_rw() on i386 seems to work in general, but
>> when it's used/reproduced, from time to time (perhaps some corner-case?)
>> it looks like that memory area can have issues much later on after being
>> fed back to the allocator which then causes a GPF from random locations.
>> Gut feeling, it might be an issue in set_memory_*() that my commit
>> uncovered. Still looking into it, but mean-time I could just send the
>> below, sure.
>>
>> Thanks,
>> Daniel
>>
>>
>>> It'd be very noisy if all these Oops hit the upcoming RC1 kernel.
>>>
>>> Daniel thinks there may be deeper problem in i386 set_memory_rw().
>>> However that could take much longer time to debug.
>>>
>>> Thanks,
>>> Fengguang
>>> ---
>>>
>>> Re: [bpf] 9d876e79df: BUG: unable to handle kernel paging request at
>>> 653a8346
>>>
>>>> On Tue, Feb 28, 2017 at 04:39:36PM +0100, Daniel Borkmann wrote:
>>>
>>>
>>> I have a rough feeling what it is, but I didn't have cycles to work on
>>> it yet (due to travel, sorry about that). The issue is likely shut down
>>> by just doing:
>>>
>>> ---
>>> arch/x86/Kconfig | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> --- linux.orig/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
>>> +++ linux/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
>>> @@ -54,7 +54,7 @@ config X86
>>> select ARCH_HAS_KCOV if X86_64
>>> select ARCH_HAS_MMIO_FLUSH
>>> select ARCH_HAS_PMEM_API if X86_64
>>> - select ARCH_HAS_SET_MEMORY
>>> + select ARCH_HAS_SET_MEMORY if X86_64
>>> select ARCH_HAS_SG_CHAIN
>>> select ARCH_HAS_STRICT_KERNEL_RWX
>>> select ARCH_HAS_STRICT_MODULE_RWX
>>
>>

Attachments:

config.bak (95.29 kB)
reproduce-quantal-ivb41-127-20170301175555-i386-randconfig-b0-02262304-4.10.0-04456-g3051bf3-1 (888.00 B)
Download all attachments

2017-03-08 22:36:49

by Kees Cook

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <[email protected]> wrote:
> [ 28.474232] rodata_test: test data was not read only
> [...]

In my tests so far, I've never been able to get rodata_test to fail
(Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
recheck under Qemu 2.7.1. Do you see these failures on real hardware?

-Kees

--
Kees Cook
Pixel Security

2017-03-08 22:45:27

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <[email protected]> wrote:
>
> The issue seems to be accessing buff first (can be read or write access)
> and then doing set_memory_ro() doesn't make it read-only immediately,
> meaning the subsequent call into probe_kernel_write() will succeed without
> error.
>
> Then, if I don't touch buff first and only do the set_memory_ro() seems
> to work and probe_kernel_write() will then fail as expected due to pages
> being read-only now.

Ok, that definitely sounds like a TLB invalidate didn't happen.

> Now, if I access buff, do the set_memory_ro() and then a msleep(0), for
> example, it "kind of" works most of the time (see last log extract below),
> and probe_kernel_write() will fail.

Yeah, very much consistent with a missing TLB invalidate. Scheduling
will end up invalidating it, although if it's a global page even that
might not do it (but eventually the entry will just get flushed due to
other activity).

> None of this seems an issue with x86_64 and the test_setmem runs fine all
> the time, same for the actual BPF stuff.

The code does look somewhat confused about when to actually flush
things - see my earlier note about NX - but it would seem to always do
__flush_tlb_all() unless I missed something. At least as long as
CPA_FLUSHTLB is set. Maybe some case forgets to set that..

Linus

2017-03-08 23:19:03

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/08/2017 11:36 PM, Kees Cook wrote:
> On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <[email protected]> wrote:
>> [ 28.474232] rodata_test: test data was not read only
>> [...]
>
> In my tests so far, I've never been able to get rodata_test to fail
> (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
> recheck under Qemu 2.7.1. Do you see these failures on real hardware?

The x86_64 tests on real hardware, and the x86-32 only in the qemu
environment with the reproducer script. Haven't tested x86-32 kernel
outside of qemu so far.

Thanks,
Daniel

2017-03-09 00:23:58

by Laura Abbott

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/08/2017 02:36 PM, Kees Cook wrote:
> On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <[email protected]> wrote:
>> [ 28.474232] rodata_test: test data was not read only
>> [...]
>
> In my tests so far, I've never been able to get rodata_test to fail
> (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
> recheck under Qemu 2.7.1. Do you see these failures on real hardware?
>
> -Kees
>

FWIW, I'm seeing the same issue with qemu 2.6.2 and 2.8.0 on Fedora 24
and rawhide respectively.

I also notice that CONFIG_X86_PAE is turned off in the defconfig. If
I set CONFIG_HIGHMEM_64G which turns on CONFIG_X86_PAE the problem
goes away. I can't tell if this is an indication of magically hiding
the TLB problem or if there is an issue with !X86_PAE invalidation.

Thanks,
Laura

2017-03-09 01:34:59

by Fengguang Wu

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Wed, Mar 08, 2017 at 02:43:44PM -0800, Linus Torvalds wrote:
>On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <[email protected]> wrote:
>>
>> The issue seems to be accessing buff first (can be read or write access)
>> and then doing set_memory_ro() doesn't make it read-only immediately,
>> meaning the subsequent call into probe_kernel_write() will succeed without
>> error.
>>
>> Then, if I don't touch buff first and only do the set_memory_ro() seems
>> to work and probe_kernel_write() will then fail as expected due to pages
>> being read-only now.
>
>Ok, that definitely sounds like a TLB invalidate didn't happen.
>
>> Now, if I access buff, do the set_memory_ro() and then a msleep(0), for
>> example, it "kind of" works most of the time (see last log extract below),
>> and probe_kernel_write() will fail.
>
>Yeah, very much consistent with a missing TLB invalidate. Scheduling
>will end up invalidating it, although if it's a global page even that
>might not do it (but eventually the entry will just get flushed due to
>other activity).
>
>> None of this seems an issue with x86_64 and the test_setmem runs fine all
>> the time, same for the actual BPF stuff.
>
>The code does look somewhat confused about when to actually flush
>things - see my earlier note about NX - but it would seem to always do
>__flush_tlb_all() unless I missed something. At least as long as
>CPA_FLUSHTLB is set. Maybe some case forgets to set that..

Not sure if it's relevant, but out of 189 boots there are 2 boots
showing the below "CPA: called for zero pte." warning.

[ 7.116932] random: trinity: uninitialized urandom read (4 bytes read)
[ 16.366468] sock: process `trinity-main' is using obsolete setsockopt SO_BSDCOMPAT
[ 17.202396] BUG: unable to handle kernel paging request at 655d9eb2
[ 17.204081] IP: __release_sock+0x6e/0x100
[ 17.205207] *pde = 00000000
[ 17.205208]
[ 17.206755] Oops: 0000 [#1]
[ 17.207686] CPU: 0 PID: 382 Comm: trinity-main Not tainted 4.10.0-rc8-02017-g9d876e7 #1
[ 17.209819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[ 17.212431] task: d625d200 task.stack: d6222000
[ 17.213655] EIP: __release_sock+0x6e/0x100
[ 17.214833] EFLAGS: 00010246 CPU: 0
[ 17.215951] EAX: 00000000 EBX: 655d9eb2 ECX: 00000000 EDX: 00000201
[ 17.217587] ESI: 00000605 EDI: d6064800 EBP: d6223ef4 ESP: d6223ee8
[ 17.219185] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[ 17.220602] CR0: 80050033 CR2: 655d9eb2 CR3: 1610f000 CR4: 00000610
[ 17.221966] DR0: 080cb000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 17.223444] DR6: ffff0ff0 DR7: 00000600
[ 17.224343] Call Trace:
[ 17.225007] release_sock+0x2e/0x80
[ 17.225900] sock_setsockopt+0x8c/0x880
[ 17.226857] SyS_socketcall+0x658/0x6a0
[ 17.227804] do_fast_syscall_32+0x9a/0x160
[ 17.228765] entry_SYSENTER_32+0x4c/0x7b
[ 17.229694] EIP: 0xb7777cc5
[ 17.230428] EFLAGS: 00000282 CPU: 0
[ 17.231263] EAX: ffffffda EBX: 0000000e ECX: bfedce00 EDX: bfedce80
[ 17.232582] ESI: 0000001a EDI: 000000ae EBP: b754f93c ESP: bfedcdec
[ 17.233882] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[ 17.235044] Code: eb 29 8d 76 00 89 da 89 f8 ff 97 98 01 00 00 31 c9 ba 06 08 00 00 b8 d8 19 b1 c1 e8 ed 3d 85 ff e8 e8 62 04 00 85 f6 89 f3 74 42 <8b>
33 0f 18 06 8b 43 48 a8 01 74 0e 83 e0 fe 74 09 80 3d 3d 9c
[ 17.240429] EIP: __release_sock+0x6e/0x100 SS:ESP: 0068:d6223ee8
[ 17.241689] CR2: 00000000655d9eb2
[ 17.242509] ---[ end trace dc10480164c75444 ]---
[ 17.243569] ------------[ cut here ]------------
[ 17.243574] WARNING: CPU: 0 PID: 15 at arch/x86/mm/pageattr.c:1150 __cpa_process_fault+0x388/0x390
[ 17.243575] CPA: called for zero pte. vaddr = d7ab4000 cpa->vaddr = d7ab4000
[ 17.243577] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: G D 4.10.0-rc8-02017-g9d876e7 #1
[ 17.243578] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[ 17.243582] Workqueue: events bpf_prog_free_deferred
[ 17.243583] Call Trace:
[ 17.243588] dump_stack+0x16/0x25
[ 17.243588] dump_stack+0x16/0x25
[ 17.243590] __warn+0xd1/0xf0
[ 17.243592] ? __cpa_process_fault+0x388/0x390
[ 17.243593] warn_slowpath_fmt+0x3b/0x40
[ 17.243594] __cpa_process_fault+0x388/0x390
[ 17.243596] ? lookup_address_in_pgd+0xa/0x90
[ 17.243598] __change_page_attr+0x520/0x6c0
[ 17.243600] ? pfn_range_is_mapped+0xe/0x80
[ 17.243601] __change_page_attr_set_clr+0x38/0x180
[ 17.243603] change_page_attr_set_clr+0x107/0x3f0
[ 17.243605] ? dequeue_entity+0x86/0x230
[ 17.243607] set_memory_rw+0x3a/0x40
[ 17.243608] bpf_prog_free_deferred+0x16/0x30
[ 17.243612] process_one_work+0xfc/0x440
[ 17.243614] ? pick_next_task_fair+0x149/0x1d0
[ 17.243615] worker_thread+0x37/0x4e0
[ 17.243617] kthread+0xdd/0x110
[ 17.243618] ? process_one_work+0x440/0x440
[ 17.243620] ? __kthread_create_on_node+0x100/0x100
[ 17.243622] ret_from_fork+0x21/0x2c
[ 17.243623] ---[ end trace dc10480164c75445 ]---
[ 17.243627] BUG: unable to handle kernel NULL pointer dereference at 00000007
[ 17.243630] IP: ___cache_free+0x14/0x140
[ 17.243631] *pde = 00000000
[ 17.243631]
[ 17.243633] Oops: 0000 [#2]
[ 17.243635] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: G D W 4.10.0-rc8-02017-g9d876e7 #1
[ 17.243635] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[ 17.243636] Workqueue: events bpf_prog_free_deferred
[ 17.243637] task: d524e8c0 task.stack: d5254000
[ 17.243639] EIP: ___cache_free+0x14/0x140
[ 17.243640] EFLAGS: 00010046 CPU: 0
[ 17.243641] EAX: d6945af8 EBX: 00000003 ECX: c10dc08e EDX: 3beb072d
[ 17.243642] ESI: d6945af8 EDI: 3beb072d EBP: d5255f00 ESP: d5255ed4
[ 17.243643] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ 17.243644] CR0: 80050033 CR2: 00000007 CR3: 1610f000 CR4: 00000610
[ 17.243647] DR0: 080cb000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 17.243648] DR6: ffff0ff0 DR7: 00000600
[ 17.243648] Call Trace:
[ 17.243650] kfree+0x64/0xe0
[ 17.243652] ? bpf_prog_free_deferred+0x1e/0x30
[ 17.243653] bpf_prog_free_deferred+0x1e/0x30
[ 17.243654] process_one_work+0xfc/0x440
[ 17.243656] ? pick_next_task_fair+0x149/0x1d0
[ 17.243658] worker_thread+0x37/0x4e0
[ 17.243659] kthread+0xdd/0x110
[ 17.243661] ? process_one_work+0x440/0x440
[ 17.243662] ? __kthread_create_on_node+0x100/0x100
[ 17.243664] ret_from_fork+0x21/0x2c
[ 17.243664] Code: 89 da 89 f0 ff 0d 64 3e b4 c1 e8 f8 fe ff ff 83 c4 14 5b 5e 5d c3 90 55 89 e5 57 56 53 83 ec 20 e8 d2 21 72 00 8b 18 89 c6 89 d7 <8b> 43 04 39 03 73 65 a1 a0 fa 4b c2 85 c0 7f 1c 8b 03 8d 50 01
[ 17.243684] EIP: ___cache_free+0x14/0x140 SS:ESP: 0068:d5255ed4
[ 17.243684] CR2: 0000000000000007
[ 17.243685] ---[ end trace dc10480164c75446 ]---
[ 17.243686] Kernel panic - not syncing: Fatal exception
[ 17.243687] Kernel Offset: disabled

Regards,
Fengguang

2017-03-09 05:36:25

by Kees Cook

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Wed, Mar 8, 2017 at 3:55 PM, Laura Abbott <[email protected]> wrote:
> On 03/08/2017 02:36 PM, Kees Cook wrote:
>> On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <[email protected]> wrote:
>>> [ 28.474232] rodata_test: test data was not read only
>>> [...]
>>
>> In my tests so far, I've never been able to get rodata_test to fail
>> (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
>> recheck under Qemu 2.7.1. Do you see these failures on real hardware?
>>
>> -Kees
>>
>
> FWIW, I'm seeing the same issue with qemu 2.6.2 and 2.8.0 on Fedora 24
> and rawhide respectively.
>
> I also notice that CONFIG_X86_PAE is turned off in the defconfig. If
> I set CONFIG_HIGHMEM_64G which turns on CONFIG_X86_PAE the problem
> goes away. I can't tell if this is an indication of magically hiding
> the TLB problem or if there is an issue with !X86_PAE invalidation.

I found my difference. I normally run qemu with "-cpu host" which
makes the failure go away. With "-cpu kvm64", I see the rodata_test
failure immediately. Seems like this may be a kvm cpu feature
emulation bug? I'll see if I can find the specific cpu feature in the
morning...

-Kees

--
Kees Cook
Pixel Security

2017-03-09 13:05:13

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/09/2017 06:36 AM, Kees Cook wrote:
> On Wed, Mar 8, 2017 at 3:55 PM, Laura Abbott <[email protected]> wrote:
>> On 03/08/2017 02:36 PM, Kees Cook wrote:
>>> On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <[email protected]> wrote:
>>>> [ 28.474232] rodata_test: test data was not read only
>>>> [...]
>>>
>>> In my tests so far, I've never been able to get rodata_test to fail
>>> (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
>>> recheck under Qemu 2.7.1. Do you see these failures on real hardware?
>>>
>>> -Kees
>>
>> FWIW, I'm seeing the same issue with qemu 2.6.2 and 2.8.0 on Fedora 24
>> and rawhide respectively.
>>
>> I also notice that CONFIG_X86_PAE is turned off in the defconfig. If
>> I set CONFIG_HIGHMEM_64G which turns on CONFIG_X86_PAE the problem
>> goes away. I can't tell if this is an indication of magically hiding
>> the TLB problem or if there is an issue with !X86_PAE invalidation.
>
> I found my difference. I normally run qemu with "-cpu host" which
> makes the failure go away. With "-cpu kvm64", I see the rodata_test
> failure immediately. Seems like this may be a kvm cpu feature
> emulation bug? I'll see if I can find the specific cpu feature in the
> morning...

Interesting! Changing to "-cpu host" makes rodata_test succeed plus
my test_setmem and the test_bpf suite runs fine as well. Haven't seen
a corruption since. Switching back to "-cpu kvm64" I immediately see
mentioned issues again.

With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
was set each time we switched attrs and a cpa_flush_range() was
performed (with the correct number of pages and cache set to 0). That
would be a __flush_tlb_all() eventually.

Hmm, it indeed might seem likely that this could be an emulation bug.

Thanks,
Daniel

2017-03-09 13:11:08

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, 9 Mar 2017, Daniel Borkmann wrote:
> With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
> code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
> was set each time we switched attrs and a cpa_flush_range() was
> performed (with the correct number of pages and cache set to 0). That
> would be a __flush_tlb_all() eventually.
>
> Hmm, it indeed might seem likely that this could be an emulation bug.

Which variant of __flush_tlb_all() is used when the test fails?

Check for the following flags in /proc/cpuinfo: pge invpcid

Thanks,

tglx

2017-03-09 13:41:56

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Wed, 8 Mar 2017, Linus Torvalds wrote:

> Adding x86 people too, since this seems to be something off about
> ARCH_HAS_SET_MEMORY for x86-32.
>
> The code seems to be shared between x86-32 and 64, I'm not seeing why
> set_memory_r[ow]() should fail on one but not the other.

Indeed.

> Considering that it seems to be flaky even on 32-bit, maybe it's
> timing-related, or possibly related to TLB sizes or whatever (ie more
> likely hidden by a larger TLB on more modern hardware?)

The only difference I can see is the way how __tlb_flush_all() is
happening. We have 3 variants:

invpcid_flush_all() - depends on X86_FEATURE_INVPCID and X86_FEATURE_PGE
cr4 based flush - depends on X86_FEATURE_PGE
cr3 based flush

No idea which variant is used in that failure case.

> Anyway, just looking at change_page_attr_set_clr(), I notice that the
> page alias checking treats NX specially:
>
> /* No alias checking for _NX bit modifications */
> checkalias = (pgprot_val(mask_set) | pgprot_val(mask_clr)) != _PAGE_NX;
>
> which seems insane. Why would NX be different from other protection
> bits (like _PAGE_RW)?

The reason is that the alias mapping should never be executable at all.

Thanks,

tglx

2017-03-09 14:01:57

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/09/2017 02:10 PM, Thomas Gleixner wrote:
> On Thu, 9 Mar 2017, Daniel Borkmann wrote:
>> With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
>> code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
>> was set each time we switched attrs and a cpa_flush_range() was
>> performed (with the correct number of pages and cache set to 0). That
>> would be a __flush_tlb_all() eventually.
>>
>> Hmm, it indeed might seem likely that this could be an emulation bug.
>
> Which variant of __flush_tlb_all() is used when the test fails?
>
> Check for the following flags in /proc/cpuinfo: pge invpcid

I added the following and booted with both variants:

printk("X86_FEATURE_PGE:%u\n", static_cpu_has(X86_FEATURE_PGE));
printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID));

"-cpu host" gives:

[ 8.326117] X86_FEATURE_PGE:1
[ 8.326381] X86_FEATURE_INVPCID:1

"-cpu kvm64" gives:

[ 8.517069] X86_FEATURE_PGE:1
[ 8.517393] X86_FEATURE_INVPCID:0

Thanks,
Daniel

2017-03-09 14:49:53

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, 9 Mar 2017, Daniel Borkmann wrote:

> On 03/09/2017 02:10 PM, Thomas Gleixner wrote:
> > On Thu, 9 Mar 2017, Daniel Borkmann wrote:
> > > With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
> > > code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
> > > was set each time we switched attrs and a cpa_flush_range() was
> > > performed (with the correct number of pages and cache set to 0). That
> > > would be a __flush_tlb_all() eventually.
> > >
> > > Hmm, it indeed might seem likely that this could be an emulation bug.
> >
> > Which variant of __flush_tlb_all() is used when the test fails?
> >
> > Check for the following flags in /proc/cpuinfo: pge invpcid
>
> I added the following and booted with both variants:
>
> printk("X86_FEATURE_PGE:%u\n", static_cpu_has(X86_FEATURE_PGE));
> printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID));
>
> "-cpu host" gives:
>
> [ 8.326117] X86_FEATURE_PGE:1
> [ 8.326381] X86_FEATURE_INVPCID:1
>
> "-cpu kvm64" gives:
>
> [ 8.517069] X86_FEATURE_PGE:1
> [ 8.517393] X86_FEATURE_INVPCID:0

That's the one which fails. So it's using the CR4 based flushing. Just ran
a test on a physical system with PGE=1 and INVPCID=0. Works fine.

Emulation problem?

Thanks,

tglx

2017-03-09 14:56:26

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/09/2017 02:25 PM, Daniel Borkmann wrote:
> On 03/09/2017 02:10 PM, Thomas Gleixner wrote:
>> On Thu, 9 Mar 2017, Daniel Borkmann wrote:
>>> With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
>>> code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
>>> was set each time we switched attrs and a cpa_flush_range() was
>>> performed (with the correct number of pages and cache set to 0). That
>>> would be a __flush_tlb_all() eventually.
>>>
>>> Hmm, it indeed might seem likely that this could be an emulation bug.
>>
>> Which variant of __flush_tlb_all() is used when the test fails?
>>
>> Check for the following flags in /proc/cpuinfo: pge invpcid
>
> I added the following and booted with both variants:
>
> printk("X86_FEATURE_PGE:%u\n", static_cpu_has(X86_FEATURE_PGE));
> printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID));
>
> "-cpu host" gives:
>
> [ 8.326117] X86_FEATURE_PGE:1
> [ 8.326381] X86_FEATURE_INVPCID:1
>
> "-cpu kvm64" gives:
>
> [ 8.517069] X86_FEATURE_PGE:1
> [ 8.517393] X86_FEATURE_INVPCID:0

Fwiw, I tried switching from using cr4 (__native_flush_tlb_global_irq_disabled())
to slower cr3 (__native_flush_tlb()) in "-cpu kvm64" mode, and it looks like it
also lets all test cases pass (rodata_test, test_setmem, test_bpf), no corruption
happening, etc.

Test diff used:

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6fa8594..34f4582 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -188,9 +188,9 @@ static inline void __native_flush_tlb_single(unsigned long addr)

static inline void __flush_tlb_all(void)
{
- if (static_cpu_has(X86_FEATURE_PGE))
- __flush_tlb_global();
- else
+// if (static_cpu_has(X86_FEATURE_PGE))
+// __flush_tlb_global();
+// else
__flush_tlb();
}

2017-03-09 17:48:50

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 9, 2017 at 6:53 AM, Daniel Borkmann <[email protected]> wrote:
>
> Fwiw, I tried switching from using cr4
> (__native_flush_tlb_global_irq_disabled())
> to slower cr3 (__native_flush_tlb()) in "-cpu kvm64" mode, and it looks like
> it also lets all test cases pass (rodata_test, test_setmem, test_bpf), no
> corruption happening, etc.

Ok. I think this is conclusive: the qemu "-cpu kvm64" case is
definitely broken, since changing CR4.PGE is definitely
architecturally defined to flush all TLB entries.

This is not a guest kernel bug.

Of course, the bug may still be in the *host* kernel. Maybe the
emulation does something wrong. I see

if (((cr4 ^ old_cr4) & pdptr_bits) ||
(!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)))
kvm_mmu_reset_context(vcpu);

(where pdptr_bits includes the PGE bit), but I'm not sure if emulation
is supposed to do something else too.

Linus

2017-03-09 17:51:27

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/09/2017 03:49 PM, Thomas Gleixner wrote:
> On Thu, 9 Mar 2017, Daniel Borkmann wrote:
>> On 03/09/2017 02:10 PM, Thomas Gleixner wrote:
>>> On Thu, 9 Mar 2017, Daniel Borkmann wrote:
>>>> With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
>>>> code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
>>>> was set each time we switched attrs and a cpa_flush_range() was
>>>> performed (with the correct number of pages and cache set to 0). That
>>>> would be a __flush_tlb_all() eventually.
>>>>
>>>> Hmm, it indeed might seem likely that this could be an emulation bug.
>>>
>>> Which variant of __flush_tlb_all() is used when the test fails?
>>>
>>> Check for the following flags in /proc/cpuinfo: pge invpcid
>>
>> I added the following and booted with both variants:
>>
>> printk("X86_FEATURE_PGE:%u\n", static_cpu_has(X86_FEATURE_PGE));
>> printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID));
>>
>> "-cpu host" gives:
>>
>> [ 8.326117] X86_FEATURE_PGE:1
>> [ 8.326381] X86_FEATURE_INVPCID:1
>>
>> "-cpu kvm64" gives:
>>
>> [ 8.517069] X86_FEATURE_PGE:1
>> [ 8.517393] X86_FEATURE_INVPCID:0
>
> That's the one which fails. So it's using the CR4 based flushing. Just ran
> a test on a physical system with PGE=1 and INVPCID=0. Works fine.
>
> Emulation problem?

So in the git qemu code base (target/i386/helper.c), cr3 vs cr4 looks
like the following, both sharing the tlb_flush() itself:

void cpu_x86_update_cr3(CPUX86State *env, target_ulong new_cr3)
{
X86CPU *cpu = x86_env_get_cpu(env);

env->cr[3] = new_cr3;
if (env->cr[0] & CR0_PG_MASK) {
qemu_log_mask(CPU_LOG_MMU,
"CR3 update: CR3=" TARGET_FMT_lx "\n", new_cr3);
tlb_flush(CPU(cpu));
}
}

void cpu_x86_update_cr4(CPUX86State *env, uint32_t new_cr4)
{
X86CPU *cpu = x86_env_get_cpu(env);
uint32_t hflags;

#if defined(DEBUG_MMU)
printf("CR4 update: %08x -> %08x\n", (uint32_t)env->cr[4], new_cr4);
#endif
if ((new_cr4 ^ env->cr[4]) &
(CR4_PGE_MASK | CR4_PAE_MASK | CR4_PSE_MASK |
CR4_SMEP_MASK | CR4_SMAP_MASK | CR4_LA57_MASK)) {
tlb_flush(CPU(cpu));
}

[...]
}

I added some debugging around __native_flush_tlb_global_irq_disabled()
and if I understand it correctly, the idea of cr4 is that we need to
toggle X86_CR4_PGE in order to trigger a TLB flush.

What I see is that original cr4 is 0x610. The cpu_tlbstate.cr4 is
consistent to native_read_cr4() and since cr4 is != 0, it tells me
based on the comment in native_read_cr4() that cr4 seems to be
supported. Thus, meaning we end up with writing ...

native_write_cr4(0x610);
native_write_cr4(0x610);

... twice, and this just doesn't trigger the desired TLB flush. I
changed the code into the following ...

cr4 = this_cpu_read(cpu_tlbstate.cr4);
/* clear PGE */
- native_write_cr4(cr4 & ~X86_CR4_PGE);
+ native_write_cr4(cr4 ^ X86_CR4_PGE);
/* write old PGE again and flush TLBs */
native_write_cr4(cr4);

... and the test cases seem to be working for me now with "-cpu kvm64",
so that seems to trigger the TLB we were missing.

I don't know enough about x86 internals to tell whether the change is
sane, though, but it seems at least for qemu fwiw. ;) Thoughts?

Thanks,
Daniel

2017-03-09 18:08:40

by David Miller

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

From: Daniel Borkmann <[email protected]>
Date: Thu, 09 Mar 2017 18:51:03 +0100

> I added some debugging around __native_flush_tlb_global_irq_disabled()
> and if I understand it correctly, the idea of cr4 is that we need to
> toggle X86_CR4_PGE in order to trigger a TLB flush.
>
> What I see is that original cr4 is 0x610. The cpu_tlbstate.cr4 is
> consistent to native_read_cr4() and since cr4 is != 0, it tells me
> based on the comment in native_read_cr4() that cr4 seems to be
> supported. Thus, meaning we end up with writing ...
>
> native_write_cr4(0x610);
> native_write_cr4(0x610);
>
> ... twice, and this just doesn't trigger the desired TLB flush. I
> changed the code into the following ...
>
> cr4 = this_cpu_read(cpu_tlbstate.cr4);
> /* clear PGE */
> - native_write_cr4(cr4 & ~X86_CR4_PGE);
> + native_write_cr4(cr4 ^ X86_CR4_PGE);
> /* write old PGE again and flush TLBs */
> native_write_cr4(cr4);
>
> ... and the test cases seem to be working for me now with "-cpu
> kvm64",
> so that seems to trigger the TLB we were missing.

Great detective work Daniel.

2017-03-09 18:10:39

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 9, 2017 at 9:51 AM, Daniel Borkmann <[email protected]> wrote:
>
> What I see is that original cr4 is 0x610. The cpu_tlbstate.cr4 is
> consistent to native_read_cr4() and since cr4 is != 0, it tells me
> based on the comment in native_read_cr4() that cr4 seems to be
> supported. Thus, meaning we end up with writing ...
>
> native_write_cr4(0x610);
> native_write_cr4(0x610);
>
> ... twice, and this just doesn't trigger the desired TLB flush.

Very odd. We should always have PGE (0x0080) set in cr4 (if the CPU
supports it).

But yes, if PGE is clear then that certainly explains the bug, and
it's not an emulation issue.

> I changed the code into the following ...
>
> cr4 = this_cpu_read(cpu_tlbstate.cr4);
> /* clear PGE */
> - native_write_cr4(cr4 & ~X86_CR4_PGE);
> + native_write_cr4(cr4 ^ X86_CR4_PGE);
> /* write old PGE again and flush TLBs */
> native_write_cr4(cr4);

Yeah, good for debugging, but not a good patch in general. The only
valid reason for not having PGE enabled would be that the CPU doesn't
support PGE at all.

Linus

2017-03-09 18:15:26

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 9, 2017 at 10:10 AM, Linus Torvalds
<[email protected]> wrote:
>
> Very odd. We should always have PGE (0x0080) set in cr4 (if the CPU
> supports it).

Daniel, do you see the code in probe_page_size_mask() triggering?

/* Enable PGE if available */
if (boot_cpu_has(X86_FEATURE_PGE)) {
cr4_set_bits_and_update_boot(X86_CR4_PGE);
__supported_pte_mask |= _PAGE_GLOBAL;
} else
__supported_pte_mask &= ~_PAGE_GLOBAL;

but maybe there's something wrong with the percpu cr4 caching?

Linus

2017-03-09 18:31:22

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/09/2017 07:15 PM, Linus Torvalds wrote:
> On Thu, Mar 9, 2017 at 10:10 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Very odd. We should always have PGE (0x0080) set in cr4 (if the CPU
>> supports it).
>
> Daniel, do you see the code in probe_page_size_mask() triggering?
>
> /* Enable PGE if available */
> if (boot_cpu_has(X86_FEATURE_PGE)) {
> cr4_set_bits_and_update_boot(X86_CR4_PGE);
> __supported_pte_mask |= _PAGE_GLOBAL;

We do have boot_cpu_has(X86_FEATURE_PGE) and go indeed into this
branch here. So it seems something must be clearing it later, hmm.

> } else
> __supported_pte_mask &= ~_PAGE_GLOBAL;
>
> but maybe there's something wrong with the percpu cr4 caching?
>
> Linus
>

2017-03-09 21:32:45

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

[ + Borislav ]

On 03/09/2017 07:31 PM, Daniel Borkmann wrote:
> On 03/09/2017 07:15 PM, Linus Torvalds wrote:
>> On Thu, Mar 9, 2017 at 10:10 AM, Linus Torvalds
>> <[email protected]> wrote:
>>>
>>> Very odd. We should always have PGE (0x0080) set in cr4 (if the CPU
>>> supports it).
>>
>> Daniel, do you see the code in probe_page_size_mask() triggering?
>>
>> /* Enable PGE if available */
>> if (boot_cpu_has(X86_FEATURE_PGE)) {
>> cr4_set_bits_and_update_boot(X86_CR4_PGE);
>> __supported_pte_mask |= _PAGE_GLOBAL;
>
> We do have boot_cpu_has(X86_FEATURE_PGE) and go indeed into this
> branch here. So it seems something must be clearing it later, hmm.
>
>> } else
>> __supported_pte_mask &= ~_PAGE_GLOBAL;
>>
>> but maybe there's something wrong with the percpu cr4 caching?

So, I think I got a little bit further. To the printk's I added
previously in the test_setmem code that run on the problematic
kernel, I now extended them into:

printk("static_cpu X86_FEATURE_PGE:%u\n", static_cpu_has(X86_FEATURE_PGE));
printk("boot_cpu X86_FEATURE_PGE:%u\n", boot_cpu_has(X86_FEATURE_PGE));
printk("static_cpu X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID));
printk("boot_cpu X86_FEATURE_INVPCID:%u\n", boot_cpu_has(X86_FEATURE_INVPCID));

And here's what I get in the log:

"-cpu kvm64" gives:

[ 8.426865] static_cpu X86_FEATURE_PGE:1
[ 8.427148] boot_cpu X86_FEATURE_PGE:0
[ 8.427428] static_cpu X86_FEATURE_INVPCID:0
[ 8.427732] boot_cpu X86_FEATURE_INVPCID:0

"-cpu host" gives:

[ 8.426408] static_cpu X86_FEATURE_PGE:1
[ 8.426726] boot_cpu X86_FEATURE_PGE:0
[ 8.427037] static_cpu X86_FEATURE_INVPCID:1
[ 8.427375] boot_cpu X86_FEATURE_INVPCID:1

This means at that point in time static_cpu_has(X86_FEATURE_PGE) is
not the same as boot_cpu_has(X86_FEATURE_PGE).

The code that switches this off is in lguest_arch_host_init(). Right
before that, both are X86_FEATURE_PGE:1, X86_FEATURE_INVPCID:0 for
the "-cpu kvm64" case.

Then, the lguest code does:

get_online_cpus();
if (boot_cpu_has(X86_FEATURE_PGE)) { /* We have a broader idea of "global". */
/* Remember that this was originally set (for cleanup). */
cpu_had_pge = 1;
/*
* adjust_pge is a helper function which sets or unsets the PGE
* bit on its CPU, depending on the argument (0 == unset).
*/
on_each_cpu(adjust_pge, (void *)0, 1);
/* Turn off the feature in the global feature set. */
clear_cpu_cap(&boot_cpu_data, X86_FEATURE_PGE);
}
put_online_cpus();

So, adjust_pge() clears X86_CR4_PGE, and boot cpu has X86_FEATURE_PGE
unset. This means, with static_cpu_has(X86_FEATURE_PGE) still 1, we
run into using cr4 for TLB flushing with no X86_CR4_PGE bit set (which
doesn't trigger the flush), whereas we should be using cr3 for flushing
from that point onwards instead.

I tried with this one, and things seem to work again:

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6fa8594..fc5abff 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -188,7 +188,7 @@ static inline void __native_flush_tlb_single(unsigned long addr)

static inline void __flush_tlb_all(void)
{
- if (static_cpu_has(X86_FEATURE_PGE))
+ if (boot_cpu_has(X86_FEATURE_PGE))
__flush_tlb_global();
else
__flush_tlb();

Presumably coming from c109bf95992b ("x86/cpufeature: Remove cpu_has_pge")?

Thanks,
Daniel

2017-03-09 22:07:56

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 09, 2017 at 10:55:47PM +0100, Borislav Petkov wrote:
> Can you make that:
>
> setup_clear_cpu_cap(X86_FEATURE_PGE);
>
> and see if it fixes your issue?

Hmm, in reading the thread a bit more, that might not work. If I see it
correctly, lguest does

clear_cpu_cap(&boot_cpu_data, X86_FEATURE_PGE);

after the alternatives have run and static_cpu_has() sites have already
been patched so clearing that bit won't bring anything.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2017-03-09 22:14:07

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/09/2017 11:07 PM, Borislav Petkov wrote:
> On Thu, Mar 09, 2017 at 10:55:47PM +0100, Borislav Petkov wrote:
>> Can you make that:
>>
>> setup_clear_cpu_cap(X86_FEATURE_PGE);
>>
>> and see if it fixes your issue?
>
> Hmm, in reading the thread a bit more, that might not work. If I see it
> correctly, lguest does
>
> clear_cpu_cap(&boot_cpu_data, X86_FEATURE_PGE);
>
> after the alternatives have run and static_cpu_has() sites have already
> been patched so clearing that bit won't bring anything.

Yeah, I just tried that out and it had no effect unfortunately, the
static_cpu_has() was still 1.

2017-03-09 22:14:11

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 09, 2017 at 10:32:12PM +0100, Daniel Borkmann wrote:
> get_online_cpus();
> if (boot_cpu_has(X86_FEATURE_PGE)) { /* We have a broader idea of "global". */
> /* Remember that this was originally set (for cleanup). */
> cpu_had_pge = 1;
> /*
> * adjust_pge is a helper function which sets or unsets the PGE
> * bit on its CPU, depending on the argument (0 == unset).
> */
> on_each_cpu(adjust_pge, (void *)0, 1);
> /* Turn off the feature in the global feature set. */
> clear_cpu_cap(&boot_cpu_data, X86_FEATURE_PGE);

Can you make that:

setup_clear_cpu_cap(X86_FEATURE_PGE);

and see if it fixes your issue?

Thanks.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2017-03-09 22:48:49

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 09, 2017 at 11:11:17PM +0100, Daniel Borkmann wrote:
> Yeah, I just tried that out and it had no effect unfortunately, the
> static_cpu_has() was still 1.

Right, just as I thought.

I guess we could return to doing boot_cpu_has() in __flush_tlb_all()
then. I mean, the timing-sensitivity argument is meh - killing global
TLB entries a bit faster doesn't bring me a whole lot when I have to go
and walk pagetable and reestablish them, which is the real price to pay
anyway.

Thanks.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2017-03-09 23:26:06

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 9, 2017 at 2:48 PM, Borislav Petkov <[email protected]> wrote:
>
> I guess we could return to doing boot_cpu_has() in __flush_tlb_all()
> then. I mean, the timing-sensitivity argument is meh - killing global
> TLB entries a bit faster doesn't bring me a whole lot when I have to go
> and walk pagetable and reestablish them, which is the real price to pay
> anyway.

So should all of commit ("c109bf95992b x86/cpufeature: Remove
cpu_has_pge") just be reverted (and then marked for stable)?

Or do we have some alternate plan?

This has apparently been going on for a long while (it got merged into
4.7), but presumably it only actually _matters_ if lguest is enabled
and used and we've triggered that lguest_arch_host_init() code.

Maybe it's the lguest games with PGE that need to be removed?

Linus

2017-03-09 23:44:36

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 09, 2017 at 03:26:02PM -0800, Linus Torvalds wrote:
> So should all of commit ("c109bf95992b x86/cpufeature: Remove
> cpu_has_pge") just be reverted (and then marked for stable)?
>
> Or do we have some alternate plan?

I think we want to do this:

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6fa85944af83..fc5abff9b7fd 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -188,7 +188,7 @@ static inline void __native_flush_tlb_single(unsigned long addr)

static inline void __flush_tlb_all(void)
{
- if (static_cpu_has(X86_FEATURE_PGE))
+ if (boot_cpu_has(X86_FEATURE_PGE))
__flush_tlb_global();
else
__flush_tlb();
---

but it is late here so I'd prefer to do a real patch tomorrow when I'm
not almost sleeping on the keyboard. Unless Daniel wants to write one
and test it now.

> This has apparently been going on for a long while (it got merged into
> 4.7), but presumably it only actually _matters_ if lguest is enabled
> and used and we've triggered that lguest_arch_host_init() code.

That's what I gather too, yes.

What sane code would go and clear X86_FEATURE_PGE?!? :-)))

> Maybe it's the lguest games with PGE that need to be removed?

Well, as far as I can read the comment in lguest_arch_host_init(), it
does some monkey business with switching to the guest kernel where
global pages are not present anymore... or something. So it sounds to me
like lguest would break if we removed the games but I have no idea what
it does with that.

And besides, the small hunk above restores the situation before
("c109bf95992b x86/cpufeature: Remove cpu_has_pge") so applying it would
actually be a no-brainer.

Thanks.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2017-03-10 00:15:14

[permalink] [raw]

Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/10/2017 12:44 AM, Borislav Petkov wrote:
> On Thu, Mar 09, 2017 at 03:26:02PM -0800, Linus Torvalds wrote:
>> So should all of commit ("c109bf95992b x86/cpufeature: Remove
>> cpu_has_pge") just be reverted (and then marked for stable)?
>>
>> Or do we have some alternate plan?
>
> I think we want to do this:
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 6fa85944af83..fc5abff9b7fd 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -188,7 +188,7 @@ static inline void __native_flush_tlb_single(unsigned long addr)
>
> static inline void __flush_tlb_all(void)
> {
> - if (static_cpu_has(X86_FEATURE_PGE))
> + if (boot_cpu_has(X86_FEATURE_PGE))
> __flush_tlb_global();
> else
> __flush_tlb();
> ---
>
> but it is late here so I'd prefer to do a real patch tomorrow when I'm
> not almost sleeping on the keyboard. Unless Daniel wants to write one
> and test it now.

I think we're in the same time zone. ;) I could send something
official tomorrow cooking a changelog with analysis, but I don't
mind at all if you want to go ahead with that either. Feel free
to add my SoB or Tested-by to it.

>> This has apparently been going on for a long while (it got merged into
>> 4.7), but presumably it only actually _matters_ if lguest is enabled
>> and used and we've triggered that lguest_arch_host_init() code.
>
> That's what I gather too, yes.
>
> What sane code would go and clear X86_FEATURE_PGE?!? :-)))
>
>> Maybe it's the lguest games with PGE that need to be removed?
>
> Well, as far as I can read the comment in lguest_arch_host_init(), it
> does some monkey business with switching to the guest kernel where
> global pages are not present anymore... or something. So it sounds to me
> like lguest would break if we removed the games but I have no idea what
> it does with that.
>
> And besides, the small hunk above restores the situation before
> ("c109bf95992b x86/cpufeature: Remove cpu_has_pge") so applying it would
> actually be a no-brainer.

Agree, looks only that hunk changed in behavior from c109bf95992b
("x86/cpufeature: Remove cpu_has_pge").

> Thanks.

2017-03-12 21:40:55