2017-03-01 12:57:06

by Fengguang Wu

[permalink] [raw]
Subject: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

Hi all,

Is it BPF triggering BUGs all over the places?

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master

commit 3051bf36c25d5153051704291782f8d44e744d36
Merge: 1e74a2e 005c349
Author: Linus Torvalds <[email protected]>
AuthorDate: Wed Feb 22 10:15:09 2017 -0800
Commit: Linus Torvalds <[email protected]>
CommitDate: Wed Feb 22 10:15:09 2017 -0800

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

Pull networking updates from David Miller:
"Highlights:

1) Support TX_RING in AF_PACKET TPACKET_V3 mode, from Sowmini
Varadhan.

2) Simplify classifier state on sk_buff in order to shrink it a bit.
From Willem de Bruijn.

3) Introduce SIPHASH and it's usage for secure sequence numbers and
syncookies. From Jason A. Donenfeld.

4) Reduce CPU usage for ICMP replies we are going to limit or
suppress, from Jesper Dangaard Brouer.

5) Introduce Shared Memory Communications socket layer, from Ursula
Braun.

6) Add RACK loss detection and allow it to actually trigger fast
recovery instead of just assisting after other algorithms have
triggered it. From Yuchung Cheng.

7) Add xmit_more and BQL support to mvneta driver, from Simon Guinot.

8) skb_cow_data avoidance in esp4 and esp6, from Steffen Klassert.

9) Export MPLS packet stats via netlink, from Robert Shearman.

10) Significantly improve inet port bind conflict handling, especially
when an application is restarted and changes it's setting of
reuseport. From Josef Bacik.

11) Implement TX batching in vhost_net, from Jason Wang.

12) Extend the dummy device so that VF (virtual function) features,
such as configuration, can be more easily tested. From Phil
Sutter.

13) Avoid two atomic ops per page on x86 in bnx2x driver, from Eric
Dumazet.

14) Add new bpf MAP, implementing a longest prefix match trie. From
Daniel Mack.

15) Packet sample offloading support in mlxsw driver, from Yotam Gigi.

16) Add new aquantia driver, from David VomLehn.

17) Add bpf tracepoints, from Daniel Borkmann.

18) Add support for port mirroring to b53 and bcm_sf2 drivers, from
Florian Fainelli.

19) Remove custom busy polling in many drivers, it is done in the core
networking since 4.5 times. From Eric Dumazet.

20) Support XDP adjust_head in virtio_net, from John Fastabend.

21) Fix several major holes in neighbour entry confirmation, from
Julian Anastasov.

22) Add XDP support to bnxt_en driver, from Michael Chan.

23) VXLAN offloads for enic driver, from Govindarajulu Varadarajan.

24) Add IPVTAP driver (IP-VLAN based tap driver) from Sainath Grandhi.

25) Support GRO in IPSEC protocols, from Steffen Klassert"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1764 commits)
Revert "ath10k: Search SMBIOS for OEM board file extension"
net: socket: fix recvmmsg not returning error from sock_error
bnxt_en: use eth_hw_addr_random()
bpf: fix unlocking of jited image when module ronx not set
arch: add ARCH_HAS_SET_MEMORY config
net: napi_watchdog() can use napi_schedule_irqoff()
tcp: Revert "tcp: tcp_probe: use spin_lock_bh()"
net/hsr: use eth_hw_addr_random()
net: mvpp2: enable building on 64-bit platforms
net: mvpp2: switch to build_skb() in the RX path
net: mvpp2: simplify MVPP2_PRS_RI_* definitions
net: mvpp2: fix indentation of MVPP2_EXT_GLOBAL_CTRL_DEFAULT
net: mvpp2: remove unused register definitions
net: mvpp2: simplify mvpp2_bm_bufs_add()
net: mvpp2: drop useless fields in mvpp2_bm_pool and related code
net: mvpp2: remove unused 'tx_skb' field of 'struct mvpp2_tx_queue'
net: mvpp2: release reference to txq_cpu[] entry after unmapping
net: mvpp2: handle too large value in mvpp2_rx_time_coal_set()
net: mvpp2: handle too large value handling in mvpp2_rx_pkts_coal_set()
net: mvpp2: remove useless arguments in mvpp2_rx_{pkts, time}_coal_set
...

1e74a2eb1f Merge tag 'gcc-plugins-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
005c3490e9 Revert "ath10k: Search SMBIOS for OEM board file extension"
3051bf36c2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
+-------------------------------------------------------+------------+------------+------------+
| | 1e74a2eb1f | 005c3490e9 | 3051bf36c2 |
+-------------------------------------------------------+------------+------------+------------+
| boot_successes | 1223 | 1098 | 242 |
| boot_failures | 1 | 126 | 72 |
| BUG:unable_to_handle_kernel | 1 | 117 | 69 |
| Oops | 1 | 126 | 72 |
| EIP:perf_callchain_user | 1 | | |
| Kernel_panic-not_syncing:Fatal_exception | 1 | 121 | 67 |
| EIP:netlink_release | 0 | 20 | 3 |
| EIP:bpf_prog_free | 0 | 22 | 3 |
| EIP:filp_close | 0 | 64 | 23 |
| EIP:netlink_update_listeners | 0 | 10 | 9 |
| EIP:security_inode_getattr | 0 | 2 | |
| EIP:__lock_acquire | 0 | 1 | 11 |
| Kernel_panic-not_syncing:Fatal_exception_in_interrupt | 0 | 5 | 4 |
| EIP:__rcu_process_callbacks | 0 | 2 | |
| EIP:__fget_light | 0 | 1 | |
| EIP:__unix_remove_socket | 0 | 0 | 13 |
| INFO:trying_to_register_non-static_key | 0 | 0 | 2 |
| EIP:mnt_want_write_file | 0 | 0 | 1 |
| EIP:skb_dequeue | 0 | 0 | 1 |
| EIP:strlen | 0 | 0 | 1 |
| EIP:__netlink_lookup | 0 | 0 | 2 |
| EIP:vfs_fsync_range | 0 | 0 | 1 |
| EIP:__unix_find_socket_byname | 0 | 0 | 1 |
| EIP:release_sock | 0 | 0 | 1 |
+-------------------------------------------------------+------------+------------+------------+

DANGER: RUNNING AS ROOT.
Unless you are running in a virtual machine, this could cause serious problems such as overwriting CMOS
or similar which could potentially make this machine unbootable without a firmware reset.

ctrl-c now unless you really know what you are doing.
[ 32.906310] BUG: unable to handle kernel paging request at 0000a7cf
[ 32.907583] IP: __unix_remove_socket+0x10/0x50
[ 32.908465] *pde = 00000000
[ 32.908466]
[ 32.909350] Oops: 0002 [#1] DEBUG_PAGEALLOC
[ 32.910011] CPU: 0 PID: 436 Comm: trinity-main Not tainted 4.10.0-04456-g3051bf3 #1
[ 32.910011] task: 4f024000 task.stack: 4fb5c000
[ 32.910011] EIP: __unix_remove_socket+0x10/0x50
[ 32.910011] EFLAGS: 00010206 CPU: 0
[ 32.910011] EAX: 4f255400 EBX: 4f255400 ECX: 0bf2a7cc EDX: 0000a7cf
[ 32.910011] ESI: 4f25577c EDI: 4f255638 EBP: 4fb5de68 ESP: 4fb5de64
[ 32.910011] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ 32.910011] CR0: 80050033 CR2: 0000a7cf CR3: 0f02a000 CR4: 00000610
[ 32.910011] Call Trace:
[ 32.910011] unix_release_sock+0x2b/0x330
[ 32.910011] ? umount_tree+0x2d0/0x2d0
[ 32.910011] unix_release+0x16/0x30
[ 32.910011] sock_release+0x13/0x70
[ 32.910011] sock_close+0xb/0x10
[ 32.910011] __fput+0xa3/0x1e0
[ 32.910011] ____fput+0x8/0x10
[ 32.910011] task_work_run+0x6f/0xa0
[ 32.910011] do_exit+0x219/0xac0
[ 32.910011] do_group_exit+0x29/0x90
[ 32.910011] SyS_exit_group+0x11/0x20
[ 32.910011] do_fast_syscall_32+0x99/0x220
[ 32.910011] entry_SYSENTER_32+0x4c/0x7b
[ 32.910011] EIP: 0x377c6cc5
[ 32.910011] EFLAGS: 00000216 CPU: 0
[ 32.910011] EAX: ffffffda EBX: 00000000 ECX: 0000002d EDX: 377bd8ac
[ 32.910011] ESI: 00000000 EDI: 00000001 EBP: 3fedfb88 ESP: 3fedfa9c
[ 32.910011] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 007b
[ 32.910011] Code: 5c 24 04 c7 04 24 e8 e1 00 4c e8 e6 49 41 ff e9 7e ff ff ff 8d b4 26 00 00 00 00 8b 50 38 85 d2 74 26 55 89 e5 53 8b 48 34 85 c9 <89> 0a 74 03 89 51 04 89 c3 c7 40 38 00 00 00 00 8b 40 44 83 f8
[ 32.910011] EIP: __unix_remove_socket+0x10/0x50 SS:ESP: 0068:4fb5de64
[ 32.910011] CR2: 000000000000a7cf
[ 32.910011] ---[ end trace 940cd213af55418c ]---
[ 32.910011] Kernel panic - not syncing: Fatal exception

git bisect start d689555ceaa8b31008bda1c5e7a718e7db730148 c470abd4fde40ea6a0846a2beab642a578c0b8cd --
git bisect bad e45cee90c1347f9effa3ba4daa4508716a6bae88 # 02:35 4- 5 Merge 'hp-parisc/for-next' into devel-spot-201702262250
git bisect bad 9540e6d53b057fa1c95ded9fb80c331e4d0faa0a # 02:55 2- 4 Merge 'sound/topic/dollar-cove-ti' into devel-spot-201702262250
git bisect bad 349ed729f948dc7791353cc369b2de59b8e7ed11 # 03:13 27- 9 Merge 'jss-tpmdd/tabrm-v3' into devel-spot-201702262250
git bisect bad 00262ffd474b61eb4a9980751fd8179ffd7447db # 03:47 62- 6 Merge 'linux-review/Julian-Anastasov/ipv4-add-missing-initialization-for-flowi4_uid/20170226-222651' into devel-spot-201702262250
git bisect good b71a4e6d93e06fdcd652d82997c463fac31a58d3 # 04:56 305+ 0 0day base guard for 'devel-spot-201702262250'
git bisect bad 9c4713701c01e4cef6e2315c2818abc919ffb0de # 05:14 5- 4 bpf: Fix bpf_xdp_event_output
git bisect good ff58d005cd10fcd372787cceac547e11cf706ff6 # 06:58 301+ 0 Merge tag 'media/v4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect good fcdc103dac5bdabddd626a4dfe7e5aa5673a61c4 # 07:57 302+ 0 Merge tag 'linux-can-next-for-4.11-20170206' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
git bisect good 3105dfb2a99f52c67cd40ec092373081f022ec28 # 09:19 304+ 0 Merge tag 'wireless-drivers-next-for-davem-2017-02-16' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
git bisect bad ff47d8c05019d6e7753cef270d6399cb5a33be57 # 09:31 2- 3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect good 32cc1bb3cdac937d8199013512c79ead259e4b63 # 11:14 303+ 0 Merge branch 'mvpp2-next'
git bisect good 7bb033829ef3ecfc491c0ed0197966e8f197fbdc # 12:28 300+ 0 Merge tag 'rodata-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
git bisect good 72a01d0b6afb5862998d84c19ddc9e1c39a9588c # 13:28 303+ 0 s390/qdio: fix up tiqdio_thinint_handler() kerneldoc
git bisect good 549f2bf594782faa440055255831c9a51940a651 # 13:59 310+ 0 s390/mm: add cond_resched call to kernel page table dumper
git bisect good 005c3490e9db23738d91e02788606c0fe4734723 # 14:50 303+ 32 Revert "ath10k: Search SMBIOS for OEM board file extension"
git bisect good 1e74a2eb1f5cc7f2f2b5aa9c9eeecbcf352220a3 # 15:37 310+ 0 Merge tag 'gcc-plugins-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
git bisect good 50a0d46c98b72cde3c6945f066c0adf31e4e8590 # 16:34 301+ 1 s390/zcrypt: make ap_bus explicitly non-modular
git bisect good d24b98e3a9c66b16ed029e1b2bcdf3c90e9d82d9 # 17:24 304+ 0 s390/syscall: fix single stepped system calls
git bisect bad 3051bf36c25d5153051704291782f8d44e744d36 # 17:36 22- 8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
# first bad commit: [3051bf36c25d5153051704291782f8d44e744d36] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect good 1e74a2eb1f5cc7f2f2b5aa9c9eeecbcf352220a3 # 18:33 906+ 1 Merge tag 'gcc-plugins-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
git bisect good 005c3490e9db23738d91e02788606c0fe4734723 # 19:14 905+ 126 Revert "ath10k: Search SMBIOS for OEM board file extension"
# extra tests with CONFIG_DEBUG_INFO_REDUCED
git bisect bad 3051bf36c25d5153051704291782f8d44e744d36 # 19:22 11- 8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
# extra tests on HEAD of linux-devel/devel-spot-201702262250
git bisect bad d689555ceaa8b31008bda1c5e7a718e7db730148 # 19:22 0- 6 0day head guard for 'devel-spot-201702262250'
# extra tests on tree/branch linus/master
git bisect bad 2d6be4abf514fc26c83d239c7f31da1f95e4a31d # 19:36 20- 5 Merge tag 'for-linus-4.11' of git://git.code.sf.net/p/openipmi/linux-ipmi
# extra tests on tree/branch linux-next/master
git bisect bad 8d01c069486aca75b8f6018a759215b0ed0c91f0 # 19:47 13- 3 Add linux-next specific files for 20170228

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/lkp Intel Corporation


Attachments:
(No filename) (13.75 kB)
dmesg-quantal-ivb41-127:20170301175555:i386-randconfig-b0-02262304:4.10.0-04456-g3051bf3:1.gz (18.16 kB)
dmesg-quantal-ivb41-78:20170301182742:i386-randconfig-b0-02262304:4.10.0-02518-g1e74a2e:1.gz (20.59 kB)
dmesg-quantal-ivb41-100:20170301190328:i386-randconfig-b0-02262304:4.10.0-rc8-02021-g005c349:1.gz (18.37 kB)
reproduce-quantal-ivb41-127:20170301175555:i386-randconfig-b0-02262304:4.10.0-04456-g3051bf3:1 (886.00 B)
config-4.10.0-04456-g3051bf3 (94.86 kB)
Download all attachments

2017-03-02 20:26:52

by Fengguang Wu

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Wed, Mar 01, 2017 at 08:54:26PM +0800, Fengguang Wu wrote:
>Hi all,
>
>Is it BPF triggering BUGs all over the places?

It looks so, and here is a fix.

>1e74a2eb1f Merge tag 'gcc-plugins-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
>005c3490e9 Revert "ath10k: Search SMBIOS for OEM board file extension"
>3051bf36c2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
>+-------------------------------------------------------+------------+------------+------------+
>| | 1e74a2eb1f | 005c3490e9 | 3051bf36c2 |
>+-------------------------------------------------------+------------+------------+------------+
>| boot_successes | 1223 | 1098 | 242 |
>| boot_failures | 1 | 126 | 72 |
>| BUG:unable_to_handle_kernel | 1 | 117 | 69 |
>| Oops | 1 | 126 | 72 |
>| EIP:perf_callchain_user | 1 | | |
>| Kernel_panic-not_syncing:Fatal_exception | 1 | 121 | 67 |
>| EIP:netlink_release | 0 | 20 | 3 |
>| EIP:bpf_prog_free | 0 | 22 | 3 |
>| EIP:filp_close | 0 | 64 | 23 |
>| EIP:netlink_update_listeners | 0 | 10 | 9 |
>| EIP:security_inode_getattr | 0 | 2 | |
>| EIP:__lock_acquire | 0 | 1 | 11 |
>| Kernel_panic-not_syncing:Fatal_exception_in_interrupt | 0 | 5 | 4 |
>| EIP:__rcu_process_callbacks | 0 | 2 | |
>| EIP:__fget_light | 0 | 1 | |
>| EIP:__unix_remove_socket | 0 | 0 | 13 |
>| INFO:trying_to_register_non-static_key | 0 | 0 | 2 |
>| EIP:mnt_want_write_file | 0 | 0 | 1 |
>| EIP:skb_dequeue | 0 | 0 | 1 |
>| EIP:strlen | 0 | 0 | 1 |
>| EIP:__netlink_lookup | 0 | 0 | 2 |
>| EIP:vfs_fsync_range | 0 | 0 | 1 |
>| EIP:__unix_find_socket_byname | 0 | 0 | 1 |
>| EIP:release_sock | 0 | 0 | 1 |
>+-------------------------------------------------------+------------+------------+------------+

I confirm that the below patch provided by Daniel fixes the above
issues on mainline kernel, too. Where should this patch be sent to?
It'd be very noisy if all these Oops hit the upcoming RC1 kernel.

Daniel thinks there may be deeper problem in i386 set_memory_rw().
However that could take much longer time to debug.

Thanks,
Fengguang
---

Re: [bpf] 9d876e79df: BUG: unable to handle kernel paging request at 653a8346

> On Tue, Feb 28, 2017 at 04:39:36PM +0100, Daniel Borkmann wrote:

I have a rough feeling what it is, but I didn't have cycles to work on
it yet (due to travel, sorry about that). The issue is likely shut down
by just doing:

---
arch/x86/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- linux.orig/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
+++ linux/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
@@ -54,7 +54,7 @@ config X86
select ARCH_HAS_KCOV if X86_64
select ARCH_HAS_MMIO_FLUSH
select ARCH_HAS_PMEM_API if X86_64
- select ARCH_HAS_SET_MEMORY
+ select ARCH_HAS_SET_MEMORY if X86_64
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX

2017-03-02 21:31:51

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/02/2017 09:23 PM, Fengguang Wu wrote:
[...]
> I confirm that the below patch provided by Daniel fixes the above
> issues on mainline kernel, too. Where should this patch be sent to?

If nobody objects, I could send it to -net tree via Dave due to being
BPF related, but I don't mind sending it elsewhere too (f.e. Linus
directly?) in order to stop your bot from continuing to send such mails.

The issue seems only related to i386 and doesn't trigger each time with
Fengguang's kernel config and qemu image when I try to reproduce it.
set_memory_ro()/set_memory_rw() on i386 seems to work in general, but
when it's used/reproduced, from time to time (perhaps some corner-case?)
it looks like that memory area can have issues much later on after being
fed back to the allocator which then causes a GPF from random locations.
Gut feeling, it might be an issue in set_memory_*() that my commit
uncovered. Still looking into it, but mean-time I could just send the
below, sure.

Thanks,
Daniel

> It'd be very noisy if all these Oops hit the upcoming RC1 kernel.
>
> Daniel thinks there may be deeper problem in i386 set_memory_rw().
> However that could take much longer time to debug.
>
> Thanks,
> Fengguang
> ---
>
> Re: [bpf] 9d876e79df: BUG: unable to handle kernel paging request at 653a8346
>
>> On Tue, Feb 28, 2017 at 04:39:36PM +0100, Daniel Borkmann wrote:
>
> I have a rough feeling what it is, but I didn't have cycles to work on
> it yet (due to travel, sorry about that). The issue is likely shut down
> by just doing:
>
> ---
> arch/x86/Kconfig | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- linux.orig/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
> +++ linux/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
> @@ -54,7 +54,7 @@ config X86
> select ARCH_HAS_KCOV if X86_64
> select ARCH_HAS_MMIO_FLUSH
> select ARCH_HAS_PMEM_API if X86_64
> - select ARCH_HAS_SET_MEMORY
> + select ARCH_HAS_SET_MEMORY if X86_64
> select ARCH_HAS_SG_CHAIN
> select ARCH_HAS_STRICT_KERNEL_RWX
> select ARCH_HAS_STRICT_MODULE_RWX

2017-03-08 19:26:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

Adding x86 people too, since this seems to be something off about
ARCH_HAS_SET_MEMORY for x86-32.

The code seems to be shared between x86-32 and 64, I'm not seeing why
set_memory_r[ow]() should fail on one but not the other.

Considering that it seems to be flaky even on 32-bit, maybe it's
timing-related, or possibly related to TLB sizes or whatever (ie more
likely hidden by a larger TLB on more modern hardware?)

Anyway, just looking at change_page_attr_set_clr(), I notice that the
page alias checking treats NX specially:

/* No alias checking for _NX bit modifications */
checkalias = (pgprot_val(mask_set) | pgprot_val(mask_clr)) != _PAGE_NX;

which seems insane. Why would NX be different from other protection
bits (like _PAGE_RW)?

But that doesn't explain why the bpf code would have issues with this
all only on x86-32.

Maybe somebody else can see why ARCH_HAS_SET_MEMORY would depend on
64-bit only..

Linus

On Thu, Mar 2, 2017 at 12:40 PM, Daniel Borkmann <[email protected]> wrote:
> On 03/02/2017 09:23 PM, Fengguang Wu wrote:
> [...]
>>
>> I confirm that the below patch provided by Daniel fixes the above
>> issues on mainline kernel, too. Where should this patch be sent to?
>
>
> If nobody objects, I could send it to -net tree via Dave due to being
> BPF related, but I don't mind sending it elsewhere too (f.e. Linus
> directly?) in order to stop your bot from continuing to send such mails.
>
> The issue seems only related to i386 and doesn't trigger each time with
> Fengguang's kernel config and qemu image when I try to reproduce it.
> set_memory_ro()/set_memory_rw() on i386 seems to work in general, but
> when it's used/reproduced, from time to time (perhaps some corner-case?)
> it looks like that memory area can have issues much later on after being
> fed back to the allocator which then causes a GPF from random locations.
> Gut feeling, it might be an issue in set_memory_*() that my commit
> uncovered. Still looking into it, but mean-time I could just send the
> below, sure.
>
> Thanks,
> Daniel
>
>
>> It'd be very noisy if all these Oops hit the upcoming RC1 kernel.
>>
>> Daniel thinks there may be deeper problem in i386 set_memory_rw().
>> However that could take much longer time to debug.
>>
>> Thanks,
>> Fengguang
>> ---
>>
>> Re: [bpf] 9d876e79df: BUG: unable to handle kernel paging request at
>> 653a8346
>>
>>> On Tue, Feb 28, 2017 at 04:39:36PM +0100, Daniel Borkmann wrote:
>>
>>
>> I have a rough feeling what it is, but I didn't have cycles to work on
>> it yet (due to travel, sorry about that). The issue is likely shut down
>> by just doing:
>>
>> ---
>> arch/x86/Kconfig | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> --- linux.orig/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
>> +++ linux/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
>> @@ -54,7 +54,7 @@ config X86
>> select ARCH_HAS_KCOV if X86_64
>> select ARCH_HAS_MMIO_FLUSH
>> select ARCH_HAS_PMEM_API if X86_64
>> - select ARCH_HAS_SET_MEMORY
>> + select ARCH_HAS_SET_MEMORY if X86_64
>> select ARCH_HAS_SG_CHAIN
>> select ARCH_HAS_STRICT_KERNEL_RWX
>> select ARCH_HAS_STRICT_MODULE_RWX
>
>

2017-03-08 22:27:52

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

[ + Kees, Laura, and Dave ]

On 03/08/2017 08:25 PM, Linus Torvalds wrote:
> Adding x86 people too, since this seems to be something off about
> ARCH_HAS_SET_MEMORY for x86-32.
>
> The code seems to be shared between x86-32 and 64, I'm not seeing why
> set_memory_r[ow]() should fail on one but not the other.
>
> Considering that it seems to be flaky even on 32-bit, maybe it's
> timing-related, or possibly related to TLB sizes or whatever (ie more
> likely hidden by a larger TLB on more modern hardware?)
>
> Anyway, just looking at change_page_attr_set_clr(), I notice that the
> page alias checking treats NX specially:
>
> /* No alias checking for _NX bit modifications */
> checkalias = (pgprot_val(mask_set) | pgprot_val(mask_clr)) != _PAGE_NX;
>
> which seems insane. Why would NX be different from other protection
> bits (like _PAGE_RW)?
>
> But that doesn't explain why the bpf code would have issues with this
> all only on x86-32.
>
> Maybe somebody else can see why ARCH_HAS_SET_MEMORY would depend on
> 64-bit only..

Last days I tried to narrow this down a bit, I have a dirty hack I
used just for the sake of experimenting when booting into Fengguang's
x86-32 test kernel via qemu:

lib/Makefile | 3 ++-
lib/test_setmem.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+), 1 deletion(-)
create mode 100644 lib/test_setmem.c

diff --git a/lib/Makefile b/lib/Makefile
index 320ac46a..702af98 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -41,7 +41,8 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \
gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
bsearch.o find_bit.o llist.o memweight.o kfifo.o \
percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
- once.o refcount.o
+ once.o refcount.o test_setmem.o
+
obj-y += string_helpers.o
obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
obj-y += hexdump.o
diff --git a/lib/test_setmem.c b/lib/test_setmem.c
new file mode 100644
index 0000000..82b08db
--- /dev/null
+++ b/lib/test_setmem.c
@@ -0,0 +1,50 @@
+#include <linux/init.h>
+#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+#include <linux/delay.h>
+
+#include <asm/cacheflush.h>
+
+static int __init test_setmem(void)
+{
+ const gfp_t gfp_flags = GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO;
+ int ret, i, good = 0, max_runs = 1000, one = 1;
+ u32 pages, size;
+ u8 *buff;
+
+ for (i = 0; i < max_runs; i++) {
+ size = round_up(4096*8, PAGE_SIZE);
+ pages = size / PAGE_SIZE;
+
+ buff = __vmalloc(size, gfp_flags, PAGE_KERNEL);
+ if (buff == NULL) {
+ printk("%s: allocation failed!\n", __func__);
+ return 0;
+ }
+#if 1
+ buff[0] = 41;
+ buff[10] = 42;
+ buff[100] = 43;
+ buff[1000] = 44;
+ // can also be just read access, f.e. if (buff[0] + ...)
+#endif
+ WARN_ON_ONCE(set_memory_ro((unsigned long)buff, pages));
+// msleep(0);
+
+ ret = probe_kernel_write(&buff[1], &one, sizeof(one));
+ if (!ret)
+ printk("%s: write succeeded despite read-only!\n", __func__);
+ else
+ good++;
+
+ WARN_ON_ONCE(set_memory_rw((unsigned long)buff, pages));
+ vfree(buff);
+ }
+
+ printk("%s: write failed due to read-only: %d/%d\n", __func__, good, max_runs);
+ return 0;
+}
+
+late_initcall(test_setmem);
--
2.9.3

The issue seems to be accessing buff first (can be read or write access)
and then doing set_memory_ro() doesn't make it read-only immediately,
meaning the subsequent call into probe_kernel_write() will succeed without
error.

Then, if I don't touch buff first and only do the set_memory_ro() seems
to work and probe_kernel_write() will then fail as expected due to pages
being read-only now.

Now, if I access buff, do the set_memory_ro() and then a msleep(0), for
example, it "kind of" works most of the time (see last log extract below),
and probe_kernel_write() will fail.

None of this seems an issue with x86_64 and the test_setmem runs fine all
the time, same for the actual BPF stuff.

I've also added rodata_test from the log (perhaps related).

x86_64 (access data before set_memory_ro()):

[...]
[ 1.357834] Key type big_key registered
[ 1.418542] test_setmem: write failed due to read-only: 1000/1000
[ 1.419545] Magic number: 9:497:653
[ 1.419584] acpi LNXCPU:06: hash matches
[ 1.419750] rtc_cmos 00:02: setting system clock to 2017-03-08 22:37:44 UTC (1489012664)
[ 1.419777] PM: Hibernation image not present or could not be loaded.
[ 1.420643] Freeing unused kernel memory: 1720K
[ 1.420643] Write protecting the kernel read-only data: 14336k
[ 1.420997] Freeing unused kernel memory: 1644K
[ 1.421844] Freeing unused kernel memory: 652K
[ 1.423054] x86/mm: Checked W+X mappings: passed, no W+X pages found.
[ 1.423057] rodata_test: all tests were successful
[...]

x86-32 (access data before set_memory_ro()):

[...]
[ 8.551454] Key type trusted registered
[ 8.551792] Key type encrypted registered
[ 8.552274] test_setmem: write succeeded despite read-only!
[ 8.552740] test_setmem: write succeeded despite read-only!
[ 8.553205] test_setmem: write succeeded despite read-only!
[ 8.553686] test_setmem: write succeeded despite read-only!
[ 8.554179] test_setmem: write succeeded despite read-only!
[ 8.554666] test_setmem: write succeeded despite read-only!
[ 8.555157] test_setmem: write succeeded despite read-only!
[...]
[ 9.042271] test_setmem: write succeeded despite read-only!
[ 9.042741] test_setmem: write failed due to read-only: 0/1000
[ 9.043343] hd: no drives specified - use hd=cyl,head,sectors on kernel command line
[ 9.044034] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 9.044597] EDD information not available.
[ 9.045233] Freeing unused kernel memory: 612K
[ 9.045732] Write protecting the kernel text: 12604k
[ 9.046234] Write protecting the kernel read-only data: 5124k
[ 9.046781] rodata_test: test data was not read only
[...]

x86-32 (no access data before set_memory_ro()):

[...]
[ 8.537275] Key type encrypted registered
[ 8.541771] test_setmem: write failed due to read-only: 1000/1000
[ 8.542481] hd: no drives specified - use hd=cyl,head,sectors on kernel command line
[ 8.543265] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 8.543824] EDD information not available.
[ 8.544467] Freeing unused kernel memory: 612K
[ 8.544909] Write protecting the kernel text: 12604k
[ 8.545399] Write protecting the kernel read-only data: 5128k
[ 8.545903] rodata_test: test data was not read only
[...]

x86-32 (access data + msleep(0) before set_memory_ro()):

* run 1:

[...]
[ 8.478233] Key type encrypted registered
[ 20.334254] test_setmem: write succeeded despite read-only!
[ 28.474980] test_setmem: write failed due to read-only: 999/1000
[ 28.478187] hd: no drives specified - use hd=cyl,head,sectors on kernel command line
[ 28.480476] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 28.481342] EDD information not available.
[ 28.482192] Freeing unused kernel memory: 612K
[ 28.482841] Write protecting the kernel text: 12604k
[ 28.483568] Write protecting the kernel read-only data: 5128k
[ 28.484364] rodata_test: test data was not read only
[...]

* run 2:

[...]
[ 8.472088] Key type encrypted registered
[ 28.466359] test_setmem: write failed due to read-only: 1000/1000
[ 28.469162] hd: no drives specified - use hd=cyl,head,sectors on kernel command line
[ 28.470838] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 28.471590] EDD information not available.
[ 28.472346] Freeing unused kernel memory: 612K
[ 28.472910] Write protecting the kernel text: 12604k
[ 28.473542] Write protecting the kernel read-only data: 5128k
[ 28.474232] rodata_test: test data was not read only
[...]

I've attached the .config and Fengguang's test script. Qemu version is
2.7.1 from Fedora. Can be run as:

./reproduce-quantal-ivb41-127-20170301175555-i386-randconfig-b0-02262304-4.10.0-04456-g3051bf3-1 arch/x86/boot/bzImage

Thanks,
Daniel

> Linus
>
> On Thu, Mar 2, 2017 at 12:40 PM, Daniel Borkmann <[email protected]> wrote:
>> On 03/02/2017 09:23 PM, Fengguang Wu wrote:
>> [...]
>>>
>>> I confirm that the below patch provided by Daniel fixes the above
>>> issues on mainline kernel, too. Where should this patch be sent to?
>>
>>
>> If nobody objects, I could send it to -net tree via Dave due to being
>> BPF related, but I don't mind sending it elsewhere too (f.e. Linus
>> directly?) in order to stop your bot from continuing to send such mails.
>>
>> The issue seems only related to i386 and doesn't trigger each time with
>> Fengguang's kernel config and qemu image when I try to reproduce it.
>> set_memory_ro()/set_memory_rw() on i386 seems to work in general, but
>> when it's used/reproduced, from time to time (perhaps some corner-case?)
>> it looks like that memory area can have issues much later on after being
>> fed back to the allocator which then causes a GPF from random locations.
>> Gut feeling, it might be an issue in set_memory_*() that my commit
>> uncovered. Still looking into it, but mean-time I could just send the
>> below, sure.
>>
>> Thanks,
>> Daniel
>>
>>
>>> It'd be very noisy if all these Oops hit the upcoming RC1 kernel.
>>>
>>> Daniel thinks there may be deeper problem in i386 set_memory_rw().
>>> However that could take much longer time to debug.
>>>
>>> Thanks,
>>> Fengguang
>>> ---
>>>
>>> Re: [bpf] 9d876e79df: BUG: unable to handle kernel paging request at
>>> 653a8346
>>>
>>>> On Tue, Feb 28, 2017 at 04:39:36PM +0100, Daniel Borkmann wrote:
>>>
>>>
>>> I have a rough feeling what it is, but I didn't have cycles to work on
>>> it yet (due to travel, sorry about that). The issue is likely shut down
>>> by just doing:
>>>
>>> ---
>>> arch/x86/Kconfig | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> --- linux.orig/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
>>> +++ linux/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
>>> @@ -54,7 +54,7 @@ config X86
>>> select ARCH_HAS_KCOV if X86_64
>>> select ARCH_HAS_MMIO_FLUSH
>>> select ARCH_HAS_PMEM_API if X86_64
>>> - select ARCH_HAS_SET_MEMORY
>>> + select ARCH_HAS_SET_MEMORY if X86_64
>>> select ARCH_HAS_SG_CHAIN
>>> select ARCH_HAS_STRICT_KERNEL_RWX
>>> select ARCH_HAS_STRICT_MODULE_RWX
>>
>>


Attachments:
config.bak (95.29 kB)
reproduce-quantal-ivb41-127-20170301175555-i386-randconfig-b0-02262304-4.10.0-04456-g3051bf3-1 (888.00 B)
Download all attachments

2017-03-08 22:36:49

by Kees Cook

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <[email protected]> wrote:
> [ 28.474232] rodata_test: test data was not read only
> [...]

In my tests so far, I've never been able to get rodata_test to fail
(Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
recheck under Qemu 2.7.1. Do you see these failures on real hardware?

-Kees

--
Kees Cook
Pixel Security

2017-03-08 22:45:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <[email protected]> wrote:
>
> The issue seems to be accessing buff first (can be read or write access)
> and then doing set_memory_ro() doesn't make it read-only immediately,
> meaning the subsequent call into probe_kernel_write() will succeed without
> error.
>
> Then, if I don't touch buff first and only do the set_memory_ro() seems
> to work and probe_kernel_write() will then fail as expected due to pages
> being read-only now.

Ok, that definitely sounds like a TLB invalidate didn't happen.

> Now, if I access buff, do the set_memory_ro() and then a msleep(0), for
> example, it "kind of" works most of the time (see last log extract below),
> and probe_kernel_write() will fail.

Yeah, very much consistent with a missing TLB invalidate. Scheduling
will end up invalidating it, although if it's a global page even that
might not do it (but eventually the entry will just get flushed due to
other activity).

> None of this seems an issue with x86_64 and the test_setmem runs fine all
> the time, same for the actual BPF stuff.

The code does look somewhat confused about when to actually flush
things - see my earlier note about NX - but it would seem to always do
__flush_tlb_all() unless I missed something. At least as long as
CPA_FLUSHTLB is set. Maybe some case forgets to set that..

Linus

2017-03-08 23:19:03

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/08/2017 11:36 PM, Kees Cook wrote:
> On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <[email protected]> wrote:
>> [ 28.474232] rodata_test: test data was not read only
>> [...]
>
> In my tests so far, I've never been able to get rodata_test to fail
> (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
> recheck under Qemu 2.7.1. Do you see these failures on real hardware?

The x86_64 tests on real hardware, and the x86-32 only in the qemu
environment with the reproducer script. Haven't tested x86-32 kernel
outside of qemu so far.

Thanks,
Daniel

2017-03-09 00:23:58

by Laura Abbott

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/08/2017 02:36 PM, Kees Cook wrote:
> On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <[email protected]> wrote:
>> [ 28.474232] rodata_test: test data was not read only
>> [...]
>
> In my tests so far, I've never been able to get rodata_test to fail
> (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
> recheck under Qemu 2.7.1. Do you see these failures on real hardware?
>
> -Kees
>

FWIW, I'm seeing the same issue with qemu 2.6.2 and 2.8.0 on Fedora 24
and rawhide respectively.

I also notice that CONFIG_X86_PAE is turned off in the defconfig. If
I set CONFIG_HIGHMEM_64G which turns on CONFIG_X86_PAE the problem
goes away. I can't tell if this is an indication of magically hiding
the TLB problem or if there is an issue with !X86_PAE invalidation.


Thanks,
Laura

2017-03-09 01:34:59

by Fengguang Wu

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Wed, Mar 08, 2017 at 02:43:44PM -0800, Linus Torvalds wrote:
>On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <[email protected]> wrote:
>>
>> The issue seems to be accessing buff first (can be read or write access)
>> and then doing set_memory_ro() doesn't make it read-only immediately,
>> meaning the subsequent call into probe_kernel_write() will succeed without
>> error.
>>
>> Then, if I don't touch buff first and only do the set_memory_ro() seems
>> to work and probe_kernel_write() will then fail as expected due to pages
>> being read-only now.
>
>Ok, that definitely sounds like a TLB invalidate didn't happen.
>
>> Now, if I access buff, do the set_memory_ro() and then a msleep(0), for
>> example, it "kind of" works most of the time (see last log extract below),
>> and probe_kernel_write() will fail.
>
>Yeah, very much consistent with a missing TLB invalidate. Scheduling
>will end up invalidating it, although if it's a global page even that
>might not do it (but eventually the entry will just get flushed due to
>other activity).
>
>> None of this seems an issue with x86_64 and the test_setmem runs fine all
>> the time, same for the actual BPF stuff.
>
>The code does look somewhat confused about when to actually flush
>things - see my earlier note about NX - but it would seem to always do
>__flush_tlb_all() unless I missed something. At least as long as
>CPA_FLUSHTLB is set. Maybe some case forgets to set that..

Not sure if it's relevant, but out of 189 boots there are 2 boots
showing the below "CPA: called for zero pte." warning.

[ 7.116932] random: trinity: uninitialized urandom read (4 bytes read)
[ 16.366468] sock: process `trinity-main' is using obsolete setsockopt SO_BSDCOMPAT
[ 17.202396] BUG: unable to handle kernel paging request at 655d9eb2
[ 17.204081] IP: __release_sock+0x6e/0x100
[ 17.205207] *pde = 00000000
[ 17.205208]
[ 17.206755] Oops: 0000 [#1]
[ 17.207686] CPU: 0 PID: 382 Comm: trinity-main Not tainted 4.10.0-rc8-02017-g9d876e7 #1
[ 17.209819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[ 17.212431] task: d625d200 task.stack: d6222000
[ 17.213655] EIP: __release_sock+0x6e/0x100
[ 17.214833] EFLAGS: 00010246 CPU: 0
[ 17.215951] EAX: 00000000 EBX: 655d9eb2 ECX: 00000000 EDX: 00000201
[ 17.217587] ESI: 00000605 EDI: d6064800 EBP: d6223ef4 ESP: d6223ee8
[ 17.219185] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[ 17.220602] CR0: 80050033 CR2: 655d9eb2 CR3: 1610f000 CR4: 00000610
[ 17.221966] DR0: 080cb000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 17.223444] DR6: ffff0ff0 DR7: 00000600
[ 17.224343] Call Trace:
[ 17.225007] release_sock+0x2e/0x80
[ 17.225900] sock_setsockopt+0x8c/0x880
[ 17.226857] SyS_socketcall+0x658/0x6a0
[ 17.227804] do_fast_syscall_32+0x9a/0x160
[ 17.228765] entry_SYSENTER_32+0x4c/0x7b
[ 17.229694] EIP: 0xb7777cc5
[ 17.230428] EFLAGS: 00000282 CPU: 0
[ 17.231263] EAX: ffffffda EBX: 0000000e ECX: bfedce00 EDX: bfedce80
[ 17.232582] ESI: 0000001a EDI: 000000ae EBP: b754f93c ESP: bfedcdec
[ 17.233882] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[ 17.235044] Code: eb 29 8d 76 00 89 da 89 f8 ff 97 98 01 00 00 31 c9 ba 06 08 00 00 b8 d8 19 b1 c1 e8 ed 3d 85 ff e8 e8 62 04 00 85 f6 89 f3 74 42 <8b>
33 0f 18 06 8b 43 48 a8 01 74 0e 83 e0 fe 74 09 80 3d 3d 9c
[ 17.240429] EIP: __release_sock+0x6e/0x100 SS:ESP: 0068:d6223ee8
[ 17.241689] CR2: 00000000655d9eb2
[ 17.242509] ---[ end trace dc10480164c75444 ]---
[ 17.243569] ------------[ cut here ]------------
[ 17.243574] WARNING: CPU: 0 PID: 15 at arch/x86/mm/pageattr.c:1150 __cpa_process_fault+0x388/0x390
[ 17.243575] CPA: called for zero pte. vaddr = d7ab4000 cpa->vaddr = d7ab4000
[ 17.243577] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: G D 4.10.0-rc8-02017-g9d876e7 #1
[ 17.243578] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[ 17.243582] Workqueue: events bpf_prog_free_deferred
[ 17.243583] Call Trace:
[ 17.243588] dump_stack+0x16/0x25
[ 17.243588] dump_stack+0x16/0x25
[ 17.243590] __warn+0xd1/0xf0
[ 17.243592] ? __cpa_process_fault+0x388/0x390
[ 17.243593] warn_slowpath_fmt+0x3b/0x40
[ 17.243594] __cpa_process_fault+0x388/0x390
[ 17.243596] ? lookup_address_in_pgd+0xa/0x90
[ 17.243598] __change_page_attr+0x520/0x6c0
[ 17.243600] ? pfn_range_is_mapped+0xe/0x80
[ 17.243601] __change_page_attr_set_clr+0x38/0x180
[ 17.243603] change_page_attr_set_clr+0x107/0x3f0
[ 17.243605] ? dequeue_entity+0x86/0x230
[ 17.243607] set_memory_rw+0x3a/0x40
[ 17.243608] bpf_prog_free_deferred+0x16/0x30
[ 17.243612] process_one_work+0xfc/0x440
[ 17.243614] ? pick_next_task_fair+0x149/0x1d0
[ 17.243615] worker_thread+0x37/0x4e0
[ 17.243617] kthread+0xdd/0x110
[ 17.243618] ? process_one_work+0x440/0x440
[ 17.243620] ? __kthread_create_on_node+0x100/0x100
[ 17.243622] ret_from_fork+0x21/0x2c
[ 17.243623] ---[ end trace dc10480164c75445 ]---
[ 17.243627] BUG: unable to handle kernel NULL pointer dereference at 00000007
[ 17.243630] IP: ___cache_free+0x14/0x140
[ 17.243631] *pde = 00000000
[ 17.243631]
[ 17.243633] Oops: 0000 [#2]
[ 17.243635] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: G D W 4.10.0-rc8-02017-g9d876e7 #1
[ 17.243635] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[ 17.243636] Workqueue: events bpf_prog_free_deferred
[ 17.243637] task: d524e8c0 task.stack: d5254000
[ 17.243639] EIP: ___cache_free+0x14/0x140
[ 17.243640] EFLAGS: 00010046 CPU: 0
[ 17.243641] EAX: d6945af8 EBX: 00000003 ECX: c10dc08e EDX: 3beb072d
[ 17.243642] ESI: d6945af8 EDI: 3beb072d EBP: d5255f00 ESP: d5255ed4
[ 17.243643] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[ 17.243644] CR0: 80050033 CR2: 00000007 CR3: 1610f000 CR4: 00000610
[ 17.243647] DR0: 080cb000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 17.243648] DR6: ffff0ff0 DR7: 00000600
[ 17.243648] Call Trace:
[ 17.243650] kfree+0x64/0xe0
[ 17.243652] ? bpf_prog_free_deferred+0x1e/0x30
[ 17.243653] bpf_prog_free_deferred+0x1e/0x30
[ 17.243654] process_one_work+0xfc/0x440
[ 17.243656] ? pick_next_task_fair+0x149/0x1d0
[ 17.243658] worker_thread+0x37/0x4e0
[ 17.243659] kthread+0xdd/0x110
[ 17.243661] ? process_one_work+0x440/0x440
[ 17.243662] ? __kthread_create_on_node+0x100/0x100
[ 17.243664] ret_from_fork+0x21/0x2c
[ 17.243664] Code: 89 da 89 f0 ff 0d 64 3e b4 c1 e8 f8 fe ff ff 83 c4 14 5b 5e 5d c3 90 55 89 e5 57 56 53 83 ec 20 e8 d2 21 72 00 8b 18 89 c6 89 d7 <8b> 43 04 39 03 73 65 a1 a0 fa 4b c2 85 c0 7f 1c 8b 03 8d 50 01
[ 17.243684] EIP: ___cache_free+0x14/0x140 SS:ESP: 0068:d5255ed4
[ 17.243684] CR2: 0000000000000007
[ 17.243685] ---[ end trace dc10480164c75446 ]---
[ 17.243686] Kernel panic - not syncing: Fatal exception
[ 17.243687] Kernel Offset: disabled

Regards,
Fengguang

2017-03-09 05:36:25

by Kees Cook

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Wed, Mar 8, 2017 at 3:55 PM, Laura Abbott <[email protected]> wrote:
> On 03/08/2017 02:36 PM, Kees Cook wrote:
>> On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <[email protected]> wrote:
>>> [ 28.474232] rodata_test: test data was not read only
>>> [...]
>>
>> In my tests so far, I've never been able to get rodata_test to fail
>> (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
>> recheck under Qemu 2.7.1. Do you see these failures on real hardware?
>>
>> -Kees
>>
>
> FWIW, I'm seeing the same issue with qemu 2.6.2 and 2.8.0 on Fedora 24
> and rawhide respectively.
>
> I also notice that CONFIG_X86_PAE is turned off in the defconfig. If
> I set CONFIG_HIGHMEM_64G which turns on CONFIG_X86_PAE the problem
> goes away. I can't tell if this is an indication of magically hiding
> the TLB problem or if there is an issue with !X86_PAE invalidation.

I found my difference. I normally run qemu with "-cpu host" which
makes the failure go away. With "-cpu kvm64", I see the rodata_test
failure immediately. Seems like this may be a kvm cpu feature
emulation bug? I'll see if I can find the specific cpu feature in the
morning...

-Kees

--
Kees Cook
Pixel Security

2017-03-09 13:05:13

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/09/2017 06:36 AM, Kees Cook wrote:
> On Wed, Mar 8, 2017 at 3:55 PM, Laura Abbott <[email protected]> wrote:
>> On 03/08/2017 02:36 PM, Kees Cook wrote:
>>> On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <[email protected]> wrote:
>>>> [ 28.474232] rodata_test: test data was not read only
>>>> [...]
>>>
>>> In my tests so far, I've never been able to get rodata_test to fail
>>> (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
>>> recheck under Qemu 2.7.1. Do you see these failures on real hardware?
>>>
>>> -Kees
>>
>> FWIW, I'm seeing the same issue with qemu 2.6.2 and 2.8.0 on Fedora 24
>> and rawhide respectively.
>>
>> I also notice that CONFIG_X86_PAE is turned off in the defconfig. If
>> I set CONFIG_HIGHMEM_64G which turns on CONFIG_X86_PAE the problem
>> goes away. I can't tell if this is an indication of magically hiding
>> the TLB problem or if there is an issue with !X86_PAE invalidation.
>
> I found my difference. I normally run qemu with "-cpu host" which
> makes the failure go away. With "-cpu kvm64", I see the rodata_test
> failure immediately. Seems like this may be a kvm cpu feature
> emulation bug? I'll see if I can find the specific cpu feature in the
> morning...

Interesting! Changing to "-cpu host" makes rodata_test succeed plus
my test_setmem and the test_bpf suite runs fine as well. Haven't seen
a corruption since. Switching back to "-cpu kvm64" I immediately see
mentioned issues again.

With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
was set each time we switched attrs and a cpa_flush_range() was
performed (with the correct number of pages and cache set to 0). That
would be a __flush_tlb_all() eventually.

Hmm, it indeed might seem likely that this could be an emulation bug.

Thanks,
Daniel

2017-03-09 13:11:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, 9 Mar 2017, Daniel Borkmann wrote:
> With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
> code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
> was set each time we switched attrs and a cpa_flush_range() was
> performed (with the correct number of pages and cache set to 0). That
> would be a __flush_tlb_all() eventually.
>
> Hmm, it indeed might seem likely that this could be an emulation bug.

Which variant of __flush_tlb_all() is used when the test fails?

Check for the following flags in /proc/cpuinfo: pge invpcid

Thanks,

tglx

2017-03-09 13:41:56

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Wed, 8 Mar 2017, Linus Torvalds wrote:

> Adding x86 people too, since this seems to be something off about
> ARCH_HAS_SET_MEMORY for x86-32.
>
> The code seems to be shared between x86-32 and 64, I'm not seeing why
> set_memory_r[ow]() should fail on one but not the other.

Indeed.

> Considering that it seems to be flaky even on 32-bit, maybe it's
> timing-related, or possibly related to TLB sizes or whatever (ie more
> likely hidden by a larger TLB on more modern hardware?)

The only difference I can see is the way how __tlb_flush_all() is
happening. We have 3 variants:

invpcid_flush_all() - depends on X86_FEATURE_INVPCID and X86_FEATURE_PGE
cr4 based flush - depends on X86_FEATURE_PGE
cr3 based flush

No idea which variant is used in that failure case.

> Anyway, just looking at change_page_attr_set_clr(), I notice that the
> page alias checking treats NX specially:
>
> /* No alias checking for _NX bit modifications */
> checkalias = (pgprot_val(mask_set) | pgprot_val(mask_clr)) != _PAGE_NX;
>
> which seems insane. Why would NX be different from other protection
> bits (like _PAGE_RW)?

The reason is that the alias mapping should never be executable at all.

Thanks,

tglx

2017-03-09 14:01:57

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/09/2017 02:10 PM, Thomas Gleixner wrote:
> On Thu, 9 Mar 2017, Daniel Borkmann wrote:
>> With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
>> code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
>> was set each time we switched attrs and a cpa_flush_range() was
>> performed (with the correct number of pages and cache set to 0). That
>> would be a __flush_tlb_all() eventually.
>>
>> Hmm, it indeed might seem likely that this could be an emulation bug.
>
> Which variant of __flush_tlb_all() is used when the test fails?
>
> Check for the following flags in /proc/cpuinfo: pge invpcid

I added the following and booted with both variants:

printk("X86_FEATURE_PGE:%u\n", static_cpu_has(X86_FEATURE_PGE));
printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID));

"-cpu host" gives:

[ 8.326117] X86_FEATURE_PGE:1
[ 8.326381] X86_FEATURE_INVPCID:1

"-cpu kvm64" gives:

[ 8.517069] X86_FEATURE_PGE:1
[ 8.517393] X86_FEATURE_INVPCID:0

Thanks,
Daniel

2017-03-09 14:49:53

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, 9 Mar 2017, Daniel Borkmann wrote:

> On 03/09/2017 02:10 PM, Thomas Gleixner wrote:
> > On Thu, 9 Mar 2017, Daniel Borkmann wrote:
> > > With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
> > > code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
> > > was set each time we switched attrs and a cpa_flush_range() was
> > > performed (with the correct number of pages and cache set to 0). That
> > > would be a __flush_tlb_all() eventually.
> > >
> > > Hmm, it indeed might seem likely that this could be an emulation bug.
> >
> > Which variant of __flush_tlb_all() is used when the test fails?
> >
> > Check for the following flags in /proc/cpuinfo: pge invpcid
>
> I added the following and booted with both variants:
>
> printk("X86_FEATURE_PGE:%u\n", static_cpu_has(X86_FEATURE_PGE));
> printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID));
>
> "-cpu host" gives:
>
> [ 8.326117] X86_FEATURE_PGE:1
> [ 8.326381] X86_FEATURE_INVPCID:1
>
> "-cpu kvm64" gives:
>
> [ 8.517069] X86_FEATURE_PGE:1
> [ 8.517393] X86_FEATURE_INVPCID:0

That's the one which fails. So it's using the CR4 based flushing. Just ran
a test on a physical system with PGE=1 and INVPCID=0. Works fine.

Emulation problem?

Thanks,

tglx


2017-03-09 14:56:26

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/09/2017 02:25 PM, Daniel Borkmann wrote:
> On 03/09/2017 02:10 PM, Thomas Gleixner wrote:
>> On Thu, 9 Mar 2017, Daniel Borkmann wrote:
>>> With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
>>> code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
>>> was set each time we switched attrs and a cpa_flush_range() was
>>> performed (with the correct number of pages and cache set to 0). That
>>> would be a __flush_tlb_all() eventually.
>>>
>>> Hmm, it indeed might seem likely that this could be an emulation bug.
>>
>> Which variant of __flush_tlb_all() is used when the test fails?
>>
>> Check for the following flags in /proc/cpuinfo: pge invpcid
>
> I added the following and booted with both variants:
>
> printk("X86_FEATURE_PGE:%u\n", static_cpu_has(X86_FEATURE_PGE));
> printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID));
>
> "-cpu host" gives:
>
> [ 8.326117] X86_FEATURE_PGE:1
> [ 8.326381] X86_FEATURE_INVPCID:1
>
> "-cpu kvm64" gives:
>
> [ 8.517069] X86_FEATURE_PGE:1
> [ 8.517393] X86_FEATURE_INVPCID:0

Fwiw, I tried switching from using cr4 (__native_flush_tlb_global_irq_disabled())
to slower cr3 (__native_flush_tlb()) in "-cpu kvm64" mode, and it looks like it
also lets all test cases pass (rodata_test, test_setmem, test_bpf), no corruption
happening, etc.

Test diff used:

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6fa8594..34f4582 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -188,9 +188,9 @@ static inline void __native_flush_tlb_single(unsigned long addr)

static inline void __flush_tlb_all(void)
{
- if (static_cpu_has(X86_FEATURE_PGE))
- __flush_tlb_global();
- else
+// if (static_cpu_has(X86_FEATURE_PGE))
+// __flush_tlb_global();
+// else
__flush_tlb();
}

2017-03-09 17:48:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 9, 2017 at 6:53 AM, Daniel Borkmann <[email protected]> wrote:
>
> Fwiw, I tried switching from using cr4
> (__native_flush_tlb_global_irq_disabled())
> to slower cr3 (__native_flush_tlb()) in "-cpu kvm64" mode, and it looks like
> it also lets all test cases pass (rodata_test, test_setmem, test_bpf), no
> corruption happening, etc.

Ok. I think this is conclusive: the qemu "-cpu kvm64" case is
definitely broken, since changing CR4.PGE is definitely
architecturally defined to flush all TLB entries.

This is not a guest kernel bug.

Of course, the bug may still be in the *host* kernel. Maybe the
emulation does something wrong. I see

if (((cr4 ^ old_cr4) & pdptr_bits) ||
(!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)))
kvm_mmu_reset_context(vcpu);

(where pdptr_bits includes the PGE bit), but I'm not sure if emulation
is supposed to do something else too.

Linus

2017-03-09 17:51:27

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/09/2017 03:49 PM, Thomas Gleixner wrote:
> On Thu, 9 Mar 2017, Daniel Borkmann wrote:
>> On 03/09/2017 02:10 PM, Thomas Gleixner wrote:
>>> On Thu, 9 Mar 2017, Daniel Borkmann wrote:
>>>> With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
>>>> code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
>>>> was set each time we switched attrs and a cpa_flush_range() was
>>>> performed (with the correct number of pages and cache set to 0). That
>>>> would be a __flush_tlb_all() eventually.
>>>>
>>>> Hmm, it indeed might seem likely that this could be an emulation bug.
>>>
>>> Which variant of __flush_tlb_all() is used when the test fails?
>>>
>>> Check for the following flags in /proc/cpuinfo: pge invpcid
>>
>> I added the following and booted with both variants:
>>
>> printk("X86_FEATURE_PGE:%u\n", static_cpu_has(X86_FEATURE_PGE));
>> printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID));
>>
>> "-cpu host" gives:
>>
>> [ 8.326117] X86_FEATURE_PGE:1
>> [ 8.326381] X86_FEATURE_INVPCID:1
>>
>> "-cpu kvm64" gives:
>>
>> [ 8.517069] X86_FEATURE_PGE:1
>> [ 8.517393] X86_FEATURE_INVPCID:0
>
> That's the one which fails. So it's using the CR4 based flushing. Just ran
> a test on a physical system with PGE=1 and INVPCID=0. Works fine.
>
> Emulation problem?

So in the git qemu code base (target/i386/helper.c), cr3 vs cr4 looks
like the following, both sharing the tlb_flush() itself:

void cpu_x86_update_cr3(CPUX86State *env, target_ulong new_cr3)
{
X86CPU *cpu = x86_env_get_cpu(env);

env->cr[3] = new_cr3;
if (env->cr[0] & CR0_PG_MASK) {
qemu_log_mask(CPU_LOG_MMU,
"CR3 update: CR3=" TARGET_FMT_lx "\n", new_cr3);
tlb_flush(CPU(cpu));
}
}

void cpu_x86_update_cr4(CPUX86State *env, uint32_t new_cr4)
{
X86CPU *cpu = x86_env_get_cpu(env);
uint32_t hflags;

#if defined(DEBUG_MMU)
printf("CR4 update: %08x -> %08x\n", (uint32_t)env->cr[4], new_cr4);
#endif
if ((new_cr4 ^ env->cr[4]) &
(CR4_PGE_MASK | CR4_PAE_MASK | CR4_PSE_MASK |
CR4_SMEP_MASK | CR4_SMAP_MASK | CR4_LA57_MASK)) {
tlb_flush(CPU(cpu));
}

[...]
}

I added some debugging around __native_flush_tlb_global_irq_disabled()
and if I understand it correctly, the idea of cr4 is that we need to
toggle X86_CR4_PGE in order to trigger a TLB flush.

What I see is that original cr4 is 0x610. The cpu_tlbstate.cr4 is
consistent to native_read_cr4() and since cr4 is != 0, it tells me
based on the comment in native_read_cr4() that cr4 seems to be
supported. Thus, meaning we end up with writing ...

native_write_cr4(0x610);
native_write_cr4(0x610);

... twice, and this just doesn't trigger the desired TLB flush. I
changed the code into the following ...

cr4 = this_cpu_read(cpu_tlbstate.cr4);
/* clear PGE */
- native_write_cr4(cr4 & ~X86_CR4_PGE);
+ native_write_cr4(cr4 ^ X86_CR4_PGE);
/* write old PGE again and flush TLBs */
native_write_cr4(cr4);

... and the test cases seem to be working for me now with "-cpu kvm64",
so that seems to trigger the TLB we were missing.

I don't know enough about x86 internals to tell whether the change is
sane, though, but it seems at least for qemu fwiw. ;) Thoughts?

Thanks,
Daniel

2017-03-09 18:08:40

by David Miller

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

From: Daniel Borkmann <[email protected]>
Date: Thu, 09 Mar 2017 18:51:03 +0100

> I added some debugging around __native_flush_tlb_global_irq_disabled()
> and if I understand it correctly, the idea of cr4 is that we need to
> toggle X86_CR4_PGE in order to trigger a TLB flush.
>
> What I see is that original cr4 is 0x610. The cpu_tlbstate.cr4 is
> consistent to native_read_cr4() and since cr4 is != 0, it tells me
> based on the comment in native_read_cr4() that cr4 seems to be
> supported. Thus, meaning we end up with writing ...
>
> native_write_cr4(0x610);
> native_write_cr4(0x610);
>
> ... twice, and this just doesn't trigger the desired TLB flush. I
> changed the code into the following ...
>
> cr4 = this_cpu_read(cpu_tlbstate.cr4);
> /* clear PGE */
> - native_write_cr4(cr4 & ~X86_CR4_PGE);
> + native_write_cr4(cr4 ^ X86_CR4_PGE);
> /* write old PGE again and flush TLBs */
> native_write_cr4(cr4);
>
> ... and the test cases seem to be working for me now with "-cpu
> kvm64",
> so that seems to trigger the TLB we were missing.

Great detective work Daniel.

2017-03-09 18:10:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 9, 2017 at 9:51 AM, Daniel Borkmann <[email protected]> wrote:
>
> What I see is that original cr4 is 0x610. The cpu_tlbstate.cr4 is
> consistent to native_read_cr4() and since cr4 is != 0, it tells me
> based on the comment in native_read_cr4() that cr4 seems to be
> supported. Thus, meaning we end up with writing ...
>
> native_write_cr4(0x610);
> native_write_cr4(0x610);
>
> ... twice, and this just doesn't trigger the desired TLB flush.

Very odd. We should always have PGE (0x0080) set in cr4 (if the CPU
supports it).

But yes, if PGE is clear then that certainly explains the bug, and
it's not an emulation issue.

> I changed the code into the following ...
>
> cr4 = this_cpu_read(cpu_tlbstate.cr4);
> /* clear PGE */
> - native_write_cr4(cr4 & ~X86_CR4_PGE);
> + native_write_cr4(cr4 ^ X86_CR4_PGE);
> /* write old PGE again and flush TLBs */
> native_write_cr4(cr4);

Yeah, good for debugging, but not a good patch in general. The only
valid reason for not having PGE enabled would be that the CPU doesn't
support PGE at all.

Linus

2017-03-09 18:15:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 9, 2017 at 10:10 AM, Linus Torvalds
<[email protected]> wrote:
>
> Very odd. We should always have PGE (0x0080) set in cr4 (if the CPU
> supports it).

Daniel, do you see the code in probe_page_size_mask() triggering?

/* Enable PGE if available */
if (boot_cpu_has(X86_FEATURE_PGE)) {
cr4_set_bits_and_update_boot(X86_CR4_PGE);
__supported_pte_mask |= _PAGE_GLOBAL;
} else
__supported_pte_mask &= ~_PAGE_GLOBAL;

but maybe there's something wrong with the percpu cr4 caching?

Linus

2017-03-09 18:31:22

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/09/2017 07:15 PM, Linus Torvalds wrote:
> On Thu, Mar 9, 2017 at 10:10 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Very odd. We should always have PGE (0x0080) set in cr4 (if the CPU
>> supports it).
>
> Daniel, do you see the code in probe_page_size_mask() triggering?
>
> /* Enable PGE if available */
> if (boot_cpu_has(X86_FEATURE_PGE)) {
> cr4_set_bits_and_update_boot(X86_CR4_PGE);
> __supported_pte_mask |= _PAGE_GLOBAL;

We do have boot_cpu_has(X86_FEATURE_PGE) and go indeed into this
branch here. So it seems something must be clearing it later, hmm.

> } else
> __supported_pte_mask &= ~_PAGE_GLOBAL;
>
> but maybe there's something wrong with the percpu cr4 caching?
>
> Linus
>

2017-03-09 21:32:45

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

[ + Borislav ]

On 03/09/2017 07:31 PM, Daniel Borkmann wrote:
> On 03/09/2017 07:15 PM, Linus Torvalds wrote:
>> On Thu, Mar 9, 2017 at 10:10 AM, Linus Torvalds
>> <[email protected]> wrote:
>>>
>>> Very odd. We should always have PGE (0x0080) set in cr4 (if the CPU
>>> supports it).
>>
>> Daniel, do you see the code in probe_page_size_mask() triggering?
>>
>> /* Enable PGE if available */
>> if (boot_cpu_has(X86_FEATURE_PGE)) {
>> cr4_set_bits_and_update_boot(X86_CR4_PGE);
>> __supported_pte_mask |= _PAGE_GLOBAL;
>
> We do have boot_cpu_has(X86_FEATURE_PGE) and go indeed into this
> branch here. So it seems something must be clearing it later, hmm.
>
>> } else
>> __supported_pte_mask &= ~_PAGE_GLOBAL;
>>
>> but maybe there's something wrong with the percpu cr4 caching?

So, I think I got a little bit further. To the printk's I added
previously in the test_setmem code that run on the problematic
kernel, I now extended them into:

printk("static_cpu X86_FEATURE_PGE:%u\n", static_cpu_has(X86_FEATURE_PGE));
printk("boot_cpu X86_FEATURE_PGE:%u\n", boot_cpu_has(X86_FEATURE_PGE));
printk("static_cpu X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID));
printk("boot_cpu X86_FEATURE_INVPCID:%u\n", boot_cpu_has(X86_FEATURE_INVPCID));

And here's what I get in the log:

"-cpu kvm64" gives:

[ 8.426865] static_cpu X86_FEATURE_PGE:1
[ 8.427148] boot_cpu X86_FEATURE_PGE:0
[ 8.427428] static_cpu X86_FEATURE_INVPCID:0
[ 8.427732] boot_cpu X86_FEATURE_INVPCID:0

"-cpu host" gives:

[ 8.426408] static_cpu X86_FEATURE_PGE:1
[ 8.426726] boot_cpu X86_FEATURE_PGE:0
[ 8.427037] static_cpu X86_FEATURE_INVPCID:1
[ 8.427375] boot_cpu X86_FEATURE_INVPCID:1

This means at that point in time static_cpu_has(X86_FEATURE_PGE) is
not the same as boot_cpu_has(X86_FEATURE_PGE).

The code that switches this off is in lguest_arch_host_init(). Right
before that, both are X86_FEATURE_PGE:1, X86_FEATURE_INVPCID:0 for
the "-cpu kvm64" case.

Then, the lguest code does:

get_online_cpus();
if (boot_cpu_has(X86_FEATURE_PGE)) { /* We have a broader idea of "global". */
/* Remember that this was originally set (for cleanup). */
cpu_had_pge = 1;
/*
* adjust_pge is a helper function which sets or unsets the PGE
* bit on its CPU, depending on the argument (0 == unset).
*/
on_each_cpu(adjust_pge, (void *)0, 1);
/* Turn off the feature in the global feature set. */
clear_cpu_cap(&boot_cpu_data, X86_FEATURE_PGE);
}
put_online_cpus();

So, adjust_pge() clears X86_CR4_PGE, and boot cpu has X86_FEATURE_PGE
unset. This means, with static_cpu_has(X86_FEATURE_PGE) still 1, we
run into using cr4 for TLB flushing with no X86_CR4_PGE bit set (which
doesn't trigger the flush), whereas we should be using cr3 for flushing
from that point onwards instead.

I tried with this one, and things seem to work again:

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6fa8594..fc5abff 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -188,7 +188,7 @@ static inline void __native_flush_tlb_single(unsigned long addr)

static inline void __flush_tlb_all(void)
{
- if (static_cpu_has(X86_FEATURE_PGE))
+ if (boot_cpu_has(X86_FEATURE_PGE))
__flush_tlb_global();
else
__flush_tlb();

Presumably coming from c109bf95992b ("x86/cpufeature: Remove cpu_has_pge")?

Thanks,
Daniel

2017-03-09 22:07:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 09, 2017 at 10:55:47PM +0100, Borislav Petkov wrote:
> Can you make that:
>
> setup_clear_cpu_cap(X86_FEATURE_PGE);
>
> and see if it fixes your issue?

Hmm, in reading the thread a bit more, that might not work. If I see it
correctly, lguest does

clear_cpu_cap(&boot_cpu_data, X86_FEATURE_PGE);

after the alternatives have run and static_cpu_has() sites have already
been patched so clearing that bit won't bring anything.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2017-03-09 22:14:07

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/09/2017 11:07 PM, Borislav Petkov wrote:
> On Thu, Mar 09, 2017 at 10:55:47PM +0100, Borislav Petkov wrote:
>> Can you make that:
>>
>> setup_clear_cpu_cap(X86_FEATURE_PGE);
>>
>> and see if it fixes your issue?
>
> Hmm, in reading the thread a bit more, that might not work. If I see it
> correctly, lguest does
>
> clear_cpu_cap(&boot_cpu_data, X86_FEATURE_PGE);
>
> after the alternatives have run and static_cpu_has() sites have already
> been patched so clearing that bit won't bring anything.

Yeah, I just tried that out and it had no effect unfortunately, the
static_cpu_has() was still 1.

2017-03-09 22:14:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 09, 2017 at 10:32:12PM +0100, Daniel Borkmann wrote:
> get_online_cpus();
> if (boot_cpu_has(X86_FEATURE_PGE)) { /* We have a broader idea of "global". */
> /* Remember that this was originally set (for cleanup). */
> cpu_had_pge = 1;
> /*
> * adjust_pge is a helper function which sets or unsets the PGE
> * bit on its CPU, depending on the argument (0 == unset).
> */
> on_each_cpu(adjust_pge, (void *)0, 1);
> /* Turn off the feature in the global feature set. */
> clear_cpu_cap(&boot_cpu_data, X86_FEATURE_PGE);

Can you make that:

setup_clear_cpu_cap(X86_FEATURE_PGE);

and see if it fixes your issue?

Thanks.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2017-03-09 22:48:49

by Borislav Petkov

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 09, 2017 at 11:11:17PM +0100, Daniel Borkmann wrote:
> Yeah, I just tried that out and it had no effect unfortunately, the
> static_cpu_has() was still 1.

Right, just as I thought.

I guess we could return to doing boot_cpu_has() in __flush_tlb_all()
then. I mean, the timing-sensitivity argument is meh - killing global
TLB entries a bit faster doesn't bring me a whole lot when I have to go
and walk pagetable and reestablish them, which is the real price to pay
anyway.

Thanks.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2017-03-09 23:26:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 9, 2017 at 2:48 PM, Borislav Petkov <[email protected]> wrote:
>
> I guess we could return to doing boot_cpu_has() in __flush_tlb_all()
> then. I mean, the timing-sensitivity argument is meh - killing global
> TLB entries a bit faster doesn't bring me a whole lot when I have to go
> and walk pagetable and reestablish them, which is the real price to pay
> anyway.

So should all of commit ("c109bf95992b x86/cpufeature: Remove
cpu_has_pge") just be reverted (and then marked for stable)?

Or do we have some alternate plan?

This has apparently been going on for a long while (it got merged into
4.7), but presumably it only actually _matters_ if lguest is enabled
and used and we've triggered that lguest_arch_host_init() code.

Maybe it's the lguest games with PGE that need to be removed?

Linus

2017-03-09 23:44:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 09, 2017 at 03:26:02PM -0800, Linus Torvalds wrote:
> So should all of commit ("c109bf95992b x86/cpufeature: Remove
> cpu_has_pge") just be reverted (and then marked for stable)?
>
> Or do we have some alternate plan?

I think we want to do this:

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6fa85944af83..fc5abff9b7fd 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -188,7 +188,7 @@ static inline void __native_flush_tlb_single(unsigned long addr)

static inline void __flush_tlb_all(void)
{
- if (static_cpu_has(X86_FEATURE_PGE))
+ if (boot_cpu_has(X86_FEATURE_PGE))
__flush_tlb_global();
else
__flush_tlb();
---

but it is late here so I'd prefer to do a real patch tomorrow when I'm
not almost sleeping on the keyboard. Unless Daniel wants to write one
and test it now.

> This has apparently been going on for a long while (it got merged into
> 4.7), but presumably it only actually _matters_ if lguest is enabled
> and used and we've triggered that lguest_arch_host_init() code.

That's what I gather too, yes.

What sane code would go and clear X86_FEATURE_PGE?!? :-)))

> Maybe it's the lguest games with PGE that need to be removed?

Well, as far as I can read the comment in lguest_arch_host_init(), it
does some monkey business with switching to the guest kernel where
global pages are not present anymore... or something. So it sounds to me
like lguest would break if we removed the games but I have no idea what
it does with that.

And besides, the small hunk above restores the situation before
("c109bf95992b x86/cpufeature: Remove cpu_has_pge") so applying it would
actually be a no-brainer.

Thanks.

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2017-03-10 00:15:14

by Daniel Borkmann

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On 03/10/2017 12:44 AM, Borislav Petkov wrote:
> On Thu, Mar 09, 2017 at 03:26:02PM -0800, Linus Torvalds wrote:
>> So should all of commit ("c109bf95992b x86/cpufeature: Remove
>> cpu_has_pge") just be reverted (and then marked for stable)?
>>
>> Or do we have some alternate plan?
>
> I think we want to do this:
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 6fa85944af83..fc5abff9b7fd 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -188,7 +188,7 @@ static inline void __native_flush_tlb_single(unsigned long addr)
>
> static inline void __flush_tlb_all(void)
> {
> - if (static_cpu_has(X86_FEATURE_PGE))
> + if (boot_cpu_has(X86_FEATURE_PGE))
> __flush_tlb_global();
> else
> __flush_tlb();
> ---
>
> but it is late here so I'd prefer to do a real patch tomorrow when I'm
> not almost sleeping on the keyboard. Unless Daniel wants to write one
> and test it now.

I think we're in the same time zone. ;) I could send something
official tomorrow cooking a changelog with analysis, but I don't
mind at all if you want to go ahead with that either. Feel free
to add my SoB or Tested-by to it.

>> This has apparently been going on for a long while (it got merged into
>> 4.7), but presumably it only actually _matters_ if lguest is enabled
>> and used and we've triggered that lguest_arch_host_init() code.
>
> That's what I gather too, yes.
>
> What sane code would go and clear X86_FEATURE_PGE?!? :-)))
>
>> Maybe it's the lguest games with PGE that need to be removed?
>
> Well, as far as I can read the comment in lguest_arch_host_init(), it
> does some monkey business with switching to the guest kernel where
> global pages are not present anymore... or something. So it sounds to me
> like lguest would break if we removed the games but I have no idea what
> it does with that.
>
> And besides, the small hunk above restores the situation before
> ("c109bf95992b x86/cpufeature: Remove cpu_has_pge") so applying it would
> actually be a no-brainer.

Agree, looks only that hunk changed in behavior from c109bf95992b
("x86/cpufeature: Remove cpu_has_pge").

> Thanks.

2017-03-12 21:40:55

by Borislav Petkov

[permalink] [raw]
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

On Thu, Mar 09, 2017 at 03:26:02PM -0800, Linus Torvalds wrote:
> Maybe it's the lguest games with PGE that need to be removed?

Btw, tglx suggested something else the other day: warn when we're
changing boot_cpu_data x86_capability bits *after* alternatives have
run. The reasoning behind it being that potentially some patching
static_cpu_has() has done won't be correct anymore.

And it is pretty cheap to do it, it fires nicely on the 32-bit config
with LGUEST=y.

---
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index d59c15c3defd..f06c3dc6db70 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -124,8 +124,18 @@ extern const char * const x86_bug_flags[NBUGINTS*32];

#define boot_cpu_has(bit) cpu_has(&boot_cpu_data, bit)

-#define set_cpu_cap(c, bit) set_bit(bit, (unsigned long *)((c)->x86_capability))
-#define clear_cpu_cap(c, bit) clear_bit(bit, (unsigned long *)((c)->x86_capability))
+#define set_cpu_cap(c, bit) \
+({ \
+ WARN_ON(c == &boot_cpu_data && alternatives_patched); \
+ set_bit(bit, (unsigned long *)((c)->x86_capability)); \
+})
+
+#define clear_cpu_cap(c, bit) \
+({ \
+ WARN_ON(c == &boot_cpu_data && alternatives_patched); \
+ clear_bit(bit, (unsigned long *)((c)->x86_capability)); \
+})
+
#define setup_clear_cpu_cap(bit) do { \
clear_cpu_cap(&boot_cpu_data, bit); \
set_bit(bit, (unsigned long *)cpu_caps_cleared); \

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--