LinuxLists.cc - [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries

2021-08-19 13:30:00

Subject: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries

From: Mike Rapoport <[email protected]>

Jiri Olsa reported a fault when running:

# cat /proc/kallsyms | grep ksys_read
ffffffff8136d580 T ksys_read
# objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore

/proc/kcore: file format elf64-x86-64

Segmentation fault

krava33 login: [ 68.330612] general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI
[ 68.333118] CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508
[ 68.334922] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014
[ 68.336945] RIP: 0010:kern_addr_valid+0x150/0x300
[ 68.338082] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
[ 68.342220] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
[ 68.343428] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
[ 68.345029] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
[ 68.346599] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
[ 68.349000] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
[ 68.350804] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
[ 68.352609] FS: 00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
[ 68.354638] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 68.356104] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
[ 68.357896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 68.359694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 68.361597] PKRU: 55555554
[ 68.362460] Call Trace:
[ 68.363252] read_kcore+0x57f/0x920
[ 68.364289] ? rcu_read_lock_sched_held+0x12/0x80
[ 68.365630] ? rcu_read_lock_sched_held+0x12/0x80
[ 68.366955] ? rcu_read_lock_sched_held+0x12/0x80
[ 68.368277] ? trace_hardirqs_on+0x1b/0xd0
[ 68.369462] ? rcu_read_lock_sched_held+0x12/0x80
[ 68.370793] ? lock_acquire+0x195/0x2f0
[ 68.371920] ? lock_acquire+0x195/0x2f0
[ 68.373035] ? rcu_read_lock_sched_held+0x12/0x80
[ 68.374364] ? lock_acquire+0x195/0x2f0
[ 68.375498] ? rcu_read_lock_sched_held+0x12/0x80
[ 68.376831] ? rcu_read_lock_sched_held+0x12/0x80
[ 68.379883] ? rcu_read_lock_sched_held+0x12/0x80
[ 68.381268] ? lock_release+0x22b/0x3e0
[ 68.382458] ? _raw_spin_unlock+0x1f/0x30
[ 68.383685] ? __handle_mm_fault+0xcfc/0x15f0
[ 68.384994] ? rcu_read_lock_sched_held+0x12/0x80
[ 68.386389] ? lock_acquire+0x195/0x2f0
[ 68.387573] ? rcu_read_lock_sched_held+0x12/0x80
[ 68.388969] ? lock_release+0x22b/0x3e0
[ 68.390145] proc_reg_read+0x55/0xa0
[ 68.391257] ? vfs_read+0x78/0x1b0
[ 68.392336] vfs_read+0xa7/0x1b0
[ 68.393328] ksys_read+0x68/0xe0
[ 68.394308] do_syscall_64+0x3b/0x90
[ 68.395391] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 68.396804] RIP: 0033:0x7fcc11cf92e2
[ 68.397824] Code: c0 e9 b2 fe ff ff 50 48 8d 3d ea 2e 0a 00 e8 95 e9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[ 68.402420] RSP: 002b:00007ffd6e0f8da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 68.404357] RAX: ffffffffffffffda RBX: 0000565439305b20 RCX: 00007fcc11cf92e2
[ 68.406061] RDX: 0000000000800000 RSI: 00007fcc0f980010 RDI: 0000000000000003
[ 68.407747] RBP: 00007fcc11dcd300 R08: 0000000000000003 R09: 00007fcc0d980010
[ 68.410937] R10: 0000000003826000 R11: 0000000000000246 R12: 00007fcc0f980010
[ 68.412624] R13: 0000000000000d68 R14: 00007fcc11dcc700 R15: 0000000000800000
[ 68.414322] Modules linked in: intel_rapl_msr intel_rapl_common nfit kvm_intel kvm irqbypass rapl iTCO_wdt iTCO_vendor_support i2c_i801 i2c_smbus lpc_ich drm drm_panel_orientation_quirks zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
[ 68.419591] ---[ end trace e2c30f827226966b ]---
[ 68.420969] RIP: 0010:kern_addr_valid+0x150/0x300
[ 68.422308] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
[ 68.426826] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
[ 68.428150] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
[ 68.429813] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
[ 68.431465] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
[ 68.433115] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
[ 68.434768] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
[ 68.436423] FS: 00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
[ 68.438354] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 68.442077] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
[ 68.443727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 68.445370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 68.447010] PKRU: 55555554

The fault happens because kern_addr_valid() dereferences existent but not
present PMD in the high kernel mappings.

Such PMDs are created when free_kernel_image_pages() frees regions larger
than 2Mb. In this case a part of the freed memory is mapped with PMDs and
the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
mark the PMD as not present rather than wipe it completely.

Make kern_addr_valid() to check whether higher level page table entries are
present before trying to dereference them to fix this issue and to avoid
similar issues in the future.

Reported-by: Jiri Olsa <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
Cc: <[email protected]> # 4.4+
---

v2:
* drop pXd_none() checks and leave only pXd_present(), per David

v1: https://lore.kernel.org/lkml/[email protected]

arch/x86/mm/init_64.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ddeaba947eb3..879886c6cc53 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1433,18 +1433,18 @@ int kern_addr_valid(unsigned long addr)
return 0;

p4d = p4d_offset(pgd, addr);
- if (p4d_none(*p4d))
+ if (!p4d_present(*p4d))
return 0;

pud = pud_offset(p4d, addr);
- if (pud_none(*pud))
+ if (!pud_present(*pud))
return 0;

if (pud_large(*pud))
return pfn_valid(pud_pfn(*pud));

pmd = pmd_offset(pud, addr);
- if (pmd_none(*pmd))
+ if (!pmd_present(*pmd))
return 0;

if (pmd_large(*pmd))
--
2.28.0

2021-08-19 13:38:01

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries

On 19.08.21 15:27, Mike Rapoport wrote:
> From: Mike Rapoport <[email protected]>
>
> Jiri Olsa reported a fault when running:
>
> # cat /proc/kallsyms | grep ksys_read
> ffffffff8136d580 T ksys_read
> # objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore
>
> /proc/kcore: file format elf64-x86-64
>
> Segmentation fault
>
> krava33 login: [ 68.330612] general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI
> [ 68.333118] CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508
> [ 68.334922] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014
> [ 68.336945] RIP: 0010:kern_addr_valid+0x150/0x300
> [ 68.338082] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [ 68.342220] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [ 68.343428] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [ 68.345029] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [ 68.346599] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [ 68.349000] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [ 68.350804] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [ 68.352609] FS: 00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [ 68.354638] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 68.356104] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [ 68.357896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 68.359694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 68.361597] PKRU: 55555554
> [ 68.362460] Call Trace:
> [ 68.363252] read_kcore+0x57f/0x920
> [ 68.364289] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.365630] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.366955] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.368277] ? trace_hardirqs_on+0x1b/0xd0
> [ 68.369462] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.370793] ? lock_acquire+0x195/0x2f0
> [ 68.371920] ? lock_acquire+0x195/0x2f0
> [ 68.373035] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.374364] ? lock_acquire+0x195/0x2f0
> [ 68.375498] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.376831] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.379883] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.381268] ? lock_release+0x22b/0x3e0
> [ 68.382458] ? _raw_spin_unlock+0x1f/0x30
> [ 68.383685] ? __handle_mm_fault+0xcfc/0x15f0
> [ 68.384994] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.386389] ? lock_acquire+0x195/0x2f0
> [ 68.387573] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.388969] ? lock_release+0x22b/0x3e0
> [ 68.390145] proc_reg_read+0x55/0xa0
> [ 68.391257] ? vfs_read+0x78/0x1b0
> [ 68.392336] vfs_read+0xa7/0x1b0
> [ 68.393328] ksys_read+0x68/0xe0
> [ 68.394308] do_syscall_64+0x3b/0x90
> [ 68.395391] entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 68.396804] RIP: 0033:0x7fcc11cf92e2
> [ 68.397824] Code: c0 e9 b2 fe ff ff 50 48 8d 3d ea 2e 0a 00 e8 95 e9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> [ 68.402420] RSP: 002b:00007ffd6e0f8da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 68.404357] RAX: ffffffffffffffda RBX: 0000565439305b20 RCX: 00007fcc11cf92e2
> [ 68.406061] RDX: 0000000000800000 RSI: 00007fcc0f980010 RDI: 0000000000000003
> [ 68.407747] RBP: 00007fcc11dcd300 R08: 0000000000000003 R09: 00007fcc0d980010
> [ 68.410937] R10: 0000000003826000 R11: 0000000000000246 R12: 00007fcc0f980010
> [ 68.412624] R13: 0000000000000d68 R14: 00007fcc11dcc700 R15: 0000000000800000
> [ 68.414322] Modules linked in: intel_rapl_msr intel_rapl_common nfit kvm_intel kvm irqbypass rapl iTCO_wdt iTCO_vendor_support i2c_i801 i2c_smbus lpc_ich drm drm_panel_orientation_quirks zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
> [ 68.419591] ---[ end trace e2c30f827226966b ]---
> [ 68.420969] RIP: 0010:kern_addr_valid+0x150/0x300
> [ 68.422308] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [ 68.426826] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [ 68.428150] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [ 68.429813] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [ 68.431465] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [ 68.433115] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [ 68.434768] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [ 68.436423] FS: 00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [ 68.438354] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 68.442077] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [ 68.443727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 68.445370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 68.447010] PKRU: 55555554
>
> The fault happens because kern_addr_valid() dereferences existent but not
> present PMD in the high kernel mappings.
>
> Such PMDs are created when free_kernel_image_pages() frees regions larger
> than 2Mb. In this case a part of the freed memory is mapped with PMDs and
> the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
> mark the PMD as not present rather than wipe it completely.
>
> Make kern_addr_valid() to check whether higher level page table entries are
> present before trying to dereference them to fix this issue and to avoid
> similar issues in the future.
>
> Reported-by: Jiri Olsa <[email protected]>
> Signed-off-by: Mike Rapoport <[email protected]>
> Cc: <[email protected]> # 4.4+
> ---
>
> v2:
> * drop pXd_none() checks and leave only pXd_present(), per David
>
> v1: https://lore.kernel.org/lkml/[email protected]
>
> arch/x86/mm/init_64.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index ddeaba947eb3..879886c6cc53 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1433,18 +1433,18 @@ int kern_addr_valid(unsigned long addr)
> return 0;
>
> p4d = p4d_offset(pgd, addr);
> - if (p4d_none(*p4d))
> + if (!p4d_present(*p4d))
> return 0;
>
> pud = pud_offset(p4d, addr);
> - if (pud_none(*pud))
> + if (!pud_present(*pud))
> return 0;
>
> if (pud_large(*pud))
> return pfn_valid(pud_pfn(*pud));
>
> pmd = pmd_offset(pud, addr);
> - if (pmd_none(*pmd))
> + if (!pmd_present(*pmd))
> return 0;
>
> if (pmd_large(*pmd))
>

Hopefully we won't have other similar BUGs in the code because we leave
fake swap entries lying around in the direct map.

Thanks!

Reviewed-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb

2021-08-19 15:35:38

by Jiri Olsa

[permalink] [raw]

Subject: Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries

On Thu, Aug 19, 2021 at 04:27:17PM +0300, Mike Rapoport wrote:
> From: Mike Rapoport <[email protected]>
>
> Jiri Olsa reported a fault when running:
>
> # cat /proc/kallsyms | grep ksys_read
> ffffffff8136d580 T ksys_read
> # objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore
>
> /proc/kcore: file format elf64-x86-64
>
> Segmentation fault
>
> krava33 login: [ 68.330612] general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI
> [ 68.333118] CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508
> [ 68.334922] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014
> [ 68.336945] RIP: 0010:kern_addr_valid+0x150/0x300
> [ 68.338082] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [ 68.342220] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [ 68.343428] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [ 68.345029] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [ 68.346599] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [ 68.349000] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [ 68.350804] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [ 68.352609] FS: 00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [ 68.354638] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 68.356104] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [ 68.357896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 68.359694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 68.361597] PKRU: 55555554
> [ 68.362460] Call Trace:
> [ 68.363252] read_kcore+0x57f/0x920
> [ 68.364289] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.365630] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.366955] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.368277] ? trace_hardirqs_on+0x1b/0xd0
> [ 68.369462] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.370793] ? lock_acquire+0x195/0x2f0
> [ 68.371920] ? lock_acquire+0x195/0x2f0
> [ 68.373035] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.374364] ? lock_acquire+0x195/0x2f0
> [ 68.375498] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.376831] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.379883] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.381268] ? lock_release+0x22b/0x3e0
> [ 68.382458] ? _raw_spin_unlock+0x1f/0x30
> [ 68.383685] ? __handle_mm_fault+0xcfc/0x15f0
> [ 68.384994] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.386389] ? lock_acquire+0x195/0x2f0
> [ 68.387573] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.388969] ? lock_release+0x22b/0x3e0
> [ 68.390145] proc_reg_read+0x55/0xa0
> [ 68.391257] ? vfs_read+0x78/0x1b0
> [ 68.392336] vfs_read+0xa7/0x1b0
> [ 68.393328] ksys_read+0x68/0xe0
> [ 68.394308] do_syscall_64+0x3b/0x90
> [ 68.395391] entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 68.396804] RIP: 0033:0x7fcc11cf92e2
> [ 68.397824] Code: c0 e9 b2 fe ff ff 50 48 8d 3d ea 2e 0a 00 e8 95 e9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> [ 68.402420] RSP: 002b:00007ffd6e0f8da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 68.404357] RAX: ffffffffffffffda RBX: 0000565439305b20 RCX: 00007fcc11cf92e2
> [ 68.406061] RDX: 0000000000800000 RSI: 00007fcc0f980010 RDI: 0000000000000003
> [ 68.407747] RBP: 00007fcc11dcd300 R08: 0000000000000003 R09: 00007fcc0d980010
> [ 68.410937] R10: 0000000003826000 R11: 0000000000000246 R12: 00007fcc0f980010
> [ 68.412624] R13: 0000000000000d68 R14: 00007fcc11dcc700 R15: 0000000000800000
> [ 68.414322] Modules linked in: intel_rapl_msr intel_rapl_common nfit kvm_intel kvm irqbypass rapl iTCO_wdt iTCO_vendor_support i2c_i801 i2c_smbus lpc_ich drm drm_panel_orientation_quirks zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
> [ 68.419591] ---[ end trace e2c30f827226966b ]---
> [ 68.420969] RIP: 0010:kern_addr_valid+0x150/0x300
> [ 68.422308] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [ 68.426826] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [ 68.428150] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [ 68.429813] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [ 68.431465] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [ 68.433115] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [ 68.434768] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [ 68.436423] FS: 00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [ 68.438354] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 68.442077] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [ 68.443727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 68.445370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 68.447010] PKRU: 55555554
>
> The fault happens because kern_addr_valid() dereferences existent but not
> present PMD in the high kernel mappings.
>
> Such PMDs are created when free_kernel_image_pages() frees regions larger
> than 2Mb. In this case a part of the freed memory is mapped with PMDs and
> the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
> mark the PMD as not present rather than wipe it completely.
>
> Make kern_addr_valid() to check whether higher level page table entries are
> present before trying to dereference them to fix this issue and to avoid
> similar issues in the future.
>
> Reported-by: Jiri Olsa <[email protected]>

Tested-by: Jiri Olsa <[email protected]>

thanks,
jirka

> Signed-off-by: Mike Rapoport <[email protected]>
> Cc: <[email protected]> # 4.4+
> ---
>
> v2:
> * drop pXd_none() checks and leave only pXd_present(), per David
>
> v1: https://lore.kernel.org/lkml/[email protected]
>
> arch/x86/mm/init_64.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index ddeaba947eb3..879886c6cc53 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1433,18 +1433,18 @@ int kern_addr_valid(unsigned long addr)
> return 0;
>
> p4d = p4d_offset(pgd, addr);
> - if (p4d_none(*p4d))
> + if (!p4d_present(*p4d))
> return 0;
>
> pud = pud_offset(p4d, addr);
> - if (pud_none(*pud))
> + if (!pud_present(*pud))
> return 0;
>
> if (pud_large(*pud))
> return pfn_valid(pud_pfn(*pud));
>
> pmd = pmd_offset(pud, addr);
> - if (pmd_none(*pmd))
> + if (!pmd_present(*pmd))
> return 0;
>
> if (pmd_large(*pmd))
> --
> 2.28.0
>

2021-08-25 21:08:16

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries

On 8/19/21 6:27 AM, Mike Rapoport wrote:
> Such PMDs are created when free_kernel_image_pages() frees regions larger
> than 2Mb. In this case a part of the freed memory is mapped with PMDs and
> the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
> mark the PMD as not present rather than wipe it completely.
>
> Make kern_addr_valid() to check whether higher level page table entries are
> present before trying to dereference them to fix this issue and to avoid
> similar issues in the future.
>
> Reported-by: Jiri Olsa <[email protected]>
> Signed-off-by: Mike Rapoport <[email protected]>
> Cc: <[email protected]> # 4.4...
> pmd = pmd_offset(pud, addr);
> - if (pmd_none(*pmd))
> + if (!pmd_present(*pmd))
> return 0;

Yeah, that seems like the right fix. The one kern_addr_valid() user is
going to touch the memory so it *better* be present. p*d_none() was
definitely the wrong check.

Acked-by: Dave Hansen <[email protected]>

2021-09-02 08:55:38

by Mike Rapoport

[permalink] [raw]

Subject: Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries

Any updates on this?

On Thu, Aug 19, 2021 at 04:27:17PM +0300, Mike Rapoport wrote:
> From: Mike Rapoport <[email protected]>
>
> Jiri Olsa reported a fault when running:
>
> # cat /proc/kallsyms | grep ksys_read
> ffffffff8136d580 T ksys_read
> # objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore
>
> /proc/kcore: file format elf64-x86-64
>
> Segmentation fault
>
> krava33 login: [ 68.330612] general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI
> [ 68.333118] CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508
> [ 68.334922] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014
> [ 68.336945] RIP: 0010:kern_addr_valid+0x150/0x300
> [ 68.338082] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [ 68.342220] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [ 68.343428] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [ 68.345029] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [ 68.346599] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [ 68.349000] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [ 68.350804] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [ 68.352609] FS: 00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [ 68.354638] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 68.356104] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [ 68.357896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 68.359694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 68.361597] PKRU: 55555554
> [ 68.362460] Call Trace:
> [ 68.363252] read_kcore+0x57f/0x920
> [ 68.364289] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.365630] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.366955] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.368277] ? trace_hardirqs_on+0x1b/0xd0
> [ 68.369462] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.370793] ? lock_acquire+0x195/0x2f0
> [ 68.371920] ? lock_acquire+0x195/0x2f0
> [ 68.373035] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.374364] ? lock_acquire+0x195/0x2f0
> [ 68.375498] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.376831] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.379883] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.381268] ? lock_release+0x22b/0x3e0
> [ 68.382458] ? _raw_spin_unlock+0x1f/0x30
> [ 68.383685] ? __handle_mm_fault+0xcfc/0x15f0
> [ 68.384994] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.386389] ? lock_acquire+0x195/0x2f0
> [ 68.387573] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.388969] ? lock_release+0x22b/0x3e0
> [ 68.390145] proc_reg_read+0x55/0xa0
> [ 68.391257] ? vfs_read+0x78/0x1b0
> [ 68.392336] vfs_read+0xa7/0x1b0
> [ 68.393328] ksys_read+0x68/0xe0
> [ 68.394308] do_syscall_64+0x3b/0x90
> [ 68.395391] entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 68.396804] RIP: 0033:0x7fcc11cf92e2
> [ 68.397824] Code: c0 e9 b2 fe ff ff 50 48 8d 3d ea 2e 0a 00 e8 95 e9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> [ 68.402420] RSP: 002b:00007ffd6e0f8da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 68.404357] RAX: ffffffffffffffda RBX: 0000565439305b20 RCX: 00007fcc11cf92e2
> [ 68.406061] RDX: 0000000000800000 RSI: 00007fcc0f980010 RDI: 0000000000000003
> [ 68.407747] RBP: 00007fcc11dcd300 R08: 0000000000000003 R09: 00007fcc0d980010
> [ 68.410937] R10: 0000000003826000 R11: 0000000000000246 R12: 00007fcc0f980010
> [ 68.412624] R13: 0000000000000d68 R14: 00007fcc11dcc700 R15: 0000000000800000
> [ 68.414322] Modules linked in: intel_rapl_msr intel_rapl_common nfit kvm_intel kvm irqbypass rapl iTCO_wdt iTCO_vendor_support i2c_i801 i2c_smbus lpc_ich drm drm_panel_orientation_quirks zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
> [ 68.419591] ---[ end trace e2c30f827226966b ]---
> [ 68.420969] RIP: 0010:kern_addr_valid+0x150/0x300
> [ 68.422308] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [ 68.426826] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [ 68.428150] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [ 68.429813] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [ 68.431465] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [ 68.433115] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [ 68.434768] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [ 68.436423] FS: 00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [ 68.438354] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 68.442077] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [ 68.443727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 68.445370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 68.447010] PKRU: 55555554
>
> The fault happens because kern_addr_valid() dereferences existent but not
> present PMD in the high kernel mappings.
>
> Such PMDs are created when free_kernel_image_pages() frees regions larger
> than 2Mb. In this case a part of the freed memory is mapped with PMDs and
> the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
> mark the PMD as not present rather than wipe it completely.
>
> Make kern_addr_valid() to check whether higher level page table entries are
> present before trying to dereference them to fix this issue and to avoid
> similar issues in the future.
>
> Reported-by: Jiri Olsa <[email protected]>
> Signed-off-by: Mike Rapoport <[email protected]>
> Cc: <[email protected]> # 4.4+
> ---
>
> v2:
> * drop pXd_none() checks and leave only pXd_present(), per David
>
> v1: https://lore.kernel.org/lkml/[email protected]
>
> arch/x86/mm/init_64.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index ddeaba947eb3..879886c6cc53 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1433,18 +1433,18 @@ int kern_addr_valid(unsigned long addr)
> return 0;
>
> p4d = p4d_offset(pgd, addr);
> - if (p4d_none(*p4d))
> + if (!p4d_present(*p4d))
> return 0;
>
> pud = pud_offset(p4d, addr);
> - if (pud_none(*pud))
> + if (!pud_present(*pud))
> return 0;
>
> if (pud_large(*pud))
> return pfn_valid(pud_pfn(*pud));
>
> pmd = pmd_offset(pud, addr);
> - if (pmd_none(*pmd))
> + if (!pmd_present(*pmd))
> return 0;
>
> if (pmd_large(*pmd))
> --
> 2.28.0
>

--
Sincerely yours,
Mike.

2021-09-08 09:15:19

by Mike Rapoport

[permalink] [raw]

Subject: Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries

Ping?

On Thu, Aug 19, 2021 at 04:27:17PM +0300, Mike Rapoport wrote:
> From: Mike Rapoport <[email protected]>
>
> Jiri Olsa reported a fault when running:
>
> # cat /proc/kallsyms | grep ksys_read
> ffffffff8136d580 T ksys_read
> # objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore
>
> /proc/kcore: file format elf64-x86-64
>
> Segmentation fault
>
> krava33 login: [ 68.330612] general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI
> [ 68.333118] CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508
> [ 68.334922] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014
> [ 68.336945] RIP: 0010:kern_addr_valid+0x150/0x300
> [ 68.338082] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [ 68.342220] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [ 68.343428] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [ 68.345029] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [ 68.346599] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [ 68.349000] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [ 68.350804] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [ 68.352609] FS: 00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [ 68.354638] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 68.356104] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [ 68.357896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 68.359694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 68.361597] PKRU: 55555554
> [ 68.362460] Call Trace:
> [ 68.363252] read_kcore+0x57f/0x920
> [ 68.364289] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.365630] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.366955] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.368277] ? trace_hardirqs_on+0x1b/0xd0
> [ 68.369462] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.370793] ? lock_acquire+0x195/0x2f0
> [ 68.371920] ? lock_acquire+0x195/0x2f0
> [ 68.373035] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.374364] ? lock_acquire+0x195/0x2f0
> [ 68.375498] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.376831] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.379883] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.381268] ? lock_release+0x22b/0x3e0
> [ 68.382458] ? _raw_spin_unlock+0x1f/0x30
> [ 68.383685] ? __handle_mm_fault+0xcfc/0x15f0
> [ 68.384994] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.386389] ? lock_acquire+0x195/0x2f0
> [ 68.387573] ? rcu_read_lock_sched_held+0x12/0x80
> [ 68.388969] ? lock_release+0x22b/0x3e0
> [ 68.390145] proc_reg_read+0x55/0xa0
> [ 68.391257] ? vfs_read+0x78/0x1b0
> [ 68.392336] vfs_read+0xa7/0x1b0
> [ 68.393328] ksys_read+0x68/0xe0
> [ 68.394308] do_syscall_64+0x3b/0x90
> [ 68.395391] entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 68.396804] RIP: 0033:0x7fcc11cf92e2
> [ 68.397824] Code: c0 e9 b2 fe ff ff 50 48 8d 3d ea 2e 0a 00 e8 95 e9 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> [ 68.402420] RSP: 002b:00007ffd6e0f8da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 68.404357] RAX: ffffffffffffffda RBX: 0000565439305b20 RCX: 00007fcc11cf92e2
> [ 68.406061] RDX: 0000000000800000 RSI: 00007fcc0f980010 RDI: 0000000000000003
> [ 68.407747] RBP: 00007fcc11dcd300 R08: 0000000000000003 R09: 00007fcc0d980010
> [ 68.410937] R10: 0000000003826000 R11: 0000000000000246 R12: 00007fcc0f980010
> [ 68.412624] R13: 0000000000000d68 R14: 00007fcc11dcc700 R15: 0000000000800000
> [ 68.414322] Modules linked in: intel_rapl_msr intel_rapl_common nfit kvm_intel kvm irqbypass rapl iTCO_wdt iTCO_vendor_support i2c_i801 i2c_smbus lpc_ich drm drm_panel_orientation_quirks zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
> [ 68.419591] ---[ end trace e2c30f827226966b ]---
> [ 68.420969] RIP: 0010:kern_addr_valid+0x150/0x300
> [ 68.422308] Code: 1f 40 00 48 8b 0d e8 12 61 01 48 85 f6 0f 85 ca 00 00 00 48 81 e1 00 f0 ff ff 48 21 c1 48 b8 00 00 00 00 80 88 ff ff 48 01 ca <48> 8b 3c 02 48 f7 c7 9f ff ff ff 0f 84 d8 fe ff ff 48 89 f8 0f 1f
> [ 68.426826] RSP: 0018:ffffc90000bcbc38 EFLAGS: 00010206
> [ 68.428150] RAX: ffff888000000000 RBX: 0000000000001000 RCX: 000ffffffcbff000
> [ 68.429813] RDX: 000ffffffcbff000 RSI: 0000000000000000 RDI: 800ffffffcbff062
> [ 68.431465] RBP: ffffc90000bcbea8 R08: 0000000000001000 R09: 0000000000000000
> [ 68.433115] R10: 0000000000000000 R11: 0000000000001000 R12: 00007fcc0fd80010
> [ 68.434768] R13: ffffffff83400000 R14: 0000000000400000 R15: ffffffff843d23e0
> [ 68.436423] FS: 00007fcc111fcc80(0000) GS:ffff888275e00000(0000) knlGS:0000000000000000
> [ 68.438354] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 68.442077] CR2: 00007fcc0fd80000 CR3: 000000011226e004 CR4: 0000000000770ee0
> [ 68.443727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 68.445370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 68.447010] PKRU: 55555554
>
> The fault happens because kern_addr_valid() dereferences existent but not
> present PMD in the high kernel mappings.
>
> Such PMDs are created when free_kernel_image_pages() frees regions larger
> than 2Mb. In this case a part of the freed memory is mapped with PMDs and
> the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
> mark the PMD as not present rather than wipe it completely.
>
> Make kern_addr_valid() to check whether higher level page table entries are
> present before trying to dereference them to fix this issue and to avoid
> similar issues in the future.
>
> Reported-by: Jiri Olsa <[email protected]>
> Signed-off-by: Mike Rapoport <[email protected]>
> Cc: <[email protected]> # 4.4+
> ---
>
> v2:
> * drop pXd_none() checks and leave only pXd_present(), per David
>
> v1: https://lore.kernel.org/lkml/[email protected]
>
> arch/x86/mm/init_64.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index ddeaba947eb3..879886c6cc53 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1433,18 +1433,18 @@ int kern_addr_valid(unsigned long addr)
> return 0;
>
> p4d = p4d_offset(pgd, addr);
> - if (p4d_none(*p4d))
> + if (!p4d_present(*p4d))
> return 0;
>
> pud = pud_offset(p4d, addr);
> - if (pud_none(*pud))
> + if (!pud_present(*pud))
> return 0;
>
> if (pud_large(*pud))
> return pfn_valid(pud_pfn(*pud));
>
> pmd = pmd_offset(pud, addr);
> - if (pmd_none(*pmd))
> + if (!pmd_present(*pmd))
> return 0;
>
> if (pmd_large(*pmd))
> --
> 2.28.0
>

--
Sincerely yours,
Mike.

2021-09-08 10:37:25

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries

On Wed, Aug 25, 2021 at 11:47:10AM -0700, Dave Hansen wrote:
> On 8/19/21 6:27 AM, Mike Rapoport wrote:
> > Such PMDs are created when free_kernel_image_pages() frees regions larger
> > than 2Mb. In this case a part of the freed memory is mapped with PMDs and
> > the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
> > mark the PMD as not present rather than wipe it completely.
> >
> > Make kern_addr_valid() to check whether higher level page table entries are
> > present before trying to dereference them to fix this issue and to avoid
> > similar issues in the future.
> >
> > Reported-by: Jiri Olsa <[email protected]>
> > Signed-off-by: Mike Rapoport <[email protected]>
> > Cc: <[email protected]> # 4.4...
> > pmd = pmd_offset(pud, addr);
> > - if (pmd_none(*pmd))
> > + if (!pmd_present(*pmd))
> > return 0;
>
> Yeah, that seems like the right fix. The one kern_addr_valid() user is
> going to touch the memory so it *better* be present. p*d_none() was
> definitely the wrong check.
>
> Acked-by: Dave Hansen <[email protected]>

So I did stare at this for a while, trying to make sense of it and David
Hildenbrand asked for a Fixes: tag in v1 review and from doing a bit of
git archeology I think it should be:

c40a56a7818c ("x86/mm/init: Remove freed kernel image areas from alias mapping")

because that thing added the clearing of the Present bit for the high
kernel image mapping of those areas.

Right?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-09-08 10:55:17

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries

On Wed, Sep 08, 2021 at 12:35:21PM +0200, Borislav Petkov wrote:
> So I did stare at this for a while, trying to make sense of it and David
> Hildenbrand asked for a Fixes: tag in v1 review and from doing a bit of
> git archeology I think it should be:
>
> c40a56a7818c ("x86/mm/init: Remove freed kernel image areas from alias mapping")
>
> because that thing added the clearing of the Present bit for the high
> kernel image mapping of those areas.
>
> Right?

Hmm, but that commit is in v4.19. Mike has added

Cc: <[email protected]> # 4.4+

Mike, why 4.4 and newer?

Hmmm.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-09-08 11:35:56

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries

On Wed, Sep 08, 2021 at 02:22:31PM +0300, Mike Rapoport wrote:
> kern_addr_valid() wrongly uses pxy_none() rather than pxy_present() because
> according to 9a14aefc1d28 ("x86: cpa, fix lookup_address") there could be
> cases when page table entries exist but they are not valid.
> So a call to kern_addr_valid() for an address in the direct map would oops.
>
> I've stopped digging at 9a14aefc1d28 (which is in v2.6.26) and added the
> oldest stable we still support (4.4).
>
> I agree that before 4.19 it's more of a theoretical bug, but you know,
> things happen...

Hmmkay, I guess I should add the gist of that to the commit message so
that it is explained why 4.4.

I'm assuming the pxy_present() check is more strict than pxy_none() so
that backporting to all stable kernels should not introduce any risks...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-09-08 12:24:35

by Mike Rapoport

[permalink] [raw]

Subject: Re: [PATCH v2] x86/mm: fix kern_addr_valid to cope with existing but not present entries

On Wed, Sep 08, 2021 at 12:52:45PM +0200, Borislav Petkov wrote:
> On Wed, Sep 08, 2021 at 12:35:21PM +0200, Borislav Petkov wrote:
> > So I did stare at this for a while, trying to make sense of it and David
> > Hildenbrand asked for a Fixes: tag in v1 review and from doing a bit of
> > git archeology I think it should be:
> >
> > c40a56a7818c ("x86/mm/init: Remove freed kernel image areas from alias mapping")
> >
> > because that thing added the clearing of the Present bit for the high
> > kernel image mapping of those areas.
> >
> > Right?

Yes, in a sense.
As the only user of kern_addr_valid() is kcore and it only uses this check
for high kernel mappings, there should be no problem before 4.19.

But...

> Hmm, but that commit is in v4.19. Mike has added
>
> Cc: <[email protected]> # 4.4+
>
> Mike, why 4.4 and newer?

kern_addr_valid() wrongly uses pxy_none() rather than pxy_present() because
according to 9a14aefc1d28 ("x86: cpa, fix lookup_address") there could be
cases when page table entries exist but they are not valid.
So a call to kern_addr_valid() for an address in the direct map would oops.

I've stopped digging at 9a14aefc1d28 (which is in v2.6.26) and added the
oldest stable we still support (4.4).

I agree that before 4.19 it's more of a theoretical bug, but you know,
things happen...

> Hmmm.

--
Sincerely yours,
Mike.

2021-09-08 19:08:49

by tip-bot2 for Alexey Makhalov

[permalink] [raw]

Subject: [tip: x86/urgent] x86/mm: Fix kern_addr_valid() to cope with existing but not present entries

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 34b1999da935a33be6239226bfa6cd4f704c5c88
Gitweb: https://git.kernel.org/tip/34b1999da935a33be6239226bfa6cd4f704c5c88
Author: Mike Rapoport <[email protected]>
AuthorDate: Thu, 19 Aug 2021 16:27:17 +03:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Wed, 08 Sep 2021 20:50:32 +02:00

x86/mm: Fix kern_addr_valid() to cope with existing but not present entries

Jiri Olsa reported a fault when running:

# cat /proc/kallsyms | grep ksys_read
ffffffff8136d580 T ksys_read
# objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore

/proc/kcore: file format elf64-x86-64

Segmentation fault

general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI
CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014
RIP: 0010:kern_addr_valid
Call Trace:
read_kcore
? rcu_read_lock_sched_held
? rcu_read_lock_sched_held
? rcu_read_lock_sched_held
? trace_hardirqs_on
? rcu_read_lock_sched_held
? lock_acquire
? lock_acquire
? rcu_read_lock_sched_held
? lock_acquire
? rcu_read_lock_sched_held
? rcu_read_lock_sched_held
? rcu_read_lock_sched_held
? lock_release
? _raw_spin_unlock
? __handle_mm_fault
? rcu_read_lock_sched_held
? lock_acquire
? rcu_read_lock_sched_held
? lock_release
proc_reg_read
? vfs_read
vfs_read
ksys_read
do_syscall_64
entry_SYSCALL_64_after_hwframe

The fault happens because kern_addr_valid() dereferences existent but not
present PMD in the high kernel mappings.

Such PMDs are created when free_kernel_image_pages() frees regions larger
than 2Mb. In this case, a part of the freed memory is mapped with PMDs and
the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will
mark the PMD as not present rather than wipe it completely.

Have kern_addr_valid() check whether higher level page table entries are
present before trying to dereference them to fix this issue and to avoid
similar issues in the future.

Stable backporting note:
------------------------

Note that the stable marking is for all active stable branches because
there could be cases where pagetable entries exist but are not valid -
see 9a14aefc1d28 ("x86: cpa, fix lookup_address"), for example. So make
sure to be on the safe side here and use pXY_present() accessors rather
than pXY_none() which could #GP when accessing pages in the direct map.

Also see:

c40a56a7818c ("x86/mm/init: Remove freed kernel image areas from alias mapping")

for more info.

Reported-by: Jiri Olsa <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Tested-by: Jiri Olsa <[email protected]>
Cc: <[email protected]> # 4.4+
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/mm/init_64.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ddeaba9..879886c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1433,18 +1433,18 @@ int kern_addr_valid(unsigned long addr)
return 0;

p4d = p4d_offset(pgd, addr);
- if (p4d_none(*p4d))
+ if (!p4d_present(*p4d))
return 0;

pud = pud_offset(p4d, addr);
- if (pud_none(*pud))
+ if (!pud_present(*pud))
return 0;

if (pud_large(*pud))
return pfn_valid(pud_pfn(*pud));

pmd = pmd_offset(pud, addr);
- if (pmd_none(*pmd))
+ if (!pmd_present(*pmd))
return 0;

if (pmd_large(*pmd))