To clear a user buffer we cannot simply use memset, we have to use
clear_user(). Using a kernel config based on rawhide Fedora and a
virtio-mem device that registers a vmcore_cb, I can easily trigger:
[ 11.327580] systemd[1]: Starting Kdump Vmcore Save Service...
[ 11.339697] kdump[420]: Kdump is using the default log level(3).
[ 11.370964] kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
[ 11.373997] kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
[ 11.385357] kdump[465]: saving vmcore-dmesg.txt complete
[ 11.386722] kdump[467]: saving vmcore
[ 16.531275] BUG: unable to handle page fault for address: 00007f2374e01000
[ 16.531705] #PF: supervisor write access in kernel mode
[ 16.532037] #PF: error_code(0x0003) - permissions violation
[ 16.532396] PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867
[ 16.532872] Oops: 0003 [#1] PREEMPT SMP NOPTI
[ 16.533154] CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
[ 16.533513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
[ 16.534198] RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
[ 16.534552] Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
[ 16.535670] RSP: 0018:ffffc9000073be08 EFLAGS: 00010212
[ 16.535998] RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000
[ 16.536441] RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008
[ 16.536878] RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50
[ 16.537315] R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000
[ 16.537755] R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8
[ 16.538200] FS: 00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000
[ 16.538696] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 16.539055] CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0
[ 16.539510] Call Trace:
[ 16.539679] <TASK>
[ 16.539828] read_vmcore+0x236/0x2c0
[ 16.540063] ? enqueue_hrtimer+0x2f/0x80
[ 16.540323] ? inode_security+0x22/0x60
[ 16.540572] proc_reg_read+0x55/0xa0
[ 16.540807] vfs_read+0x95/0x190
[ 16.541022] ksys_read+0x4f/0xc0
[ 16.541238] do_syscall_64+0x3b/0x90
[ 16.541475] entry_SYSCALL_64_after_hwframe+0x44/0xae
To fix, properly use clear_user() when required.
Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages")
Cc: Dave Young <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Philipp Rudo <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: David Hildenbrand <[email protected]>
---
fs/proc/vmcore.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 30a3b66f475a..509f85148fee 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -154,9 +154,13 @@ ssize_t read_from_oldmem(char *buf, size_t count,
nr_bytes = count;
/* If pfn is not ram, return zeros for sparse dump files */
- if (!pfn_is_ram(pfn))
- memset(buf, 0, nr_bytes);
- else {
+ if (!pfn_is_ram(pfn)) {
+ tmp = 0;
+ if (!userbuf)
+ memset(buf, 0, nr_bytes);
+ else if (clear_user(buf, nr_bytes))
+ tmp = -EFAULT;
+ } else {
if (encrypted)
tmp = copy_oldmem_page_encrypted(pfn, buf,
nr_bytes,
@@ -165,12 +169,12 @@ ssize_t read_from_oldmem(char *buf, size_t count,
else
tmp = copy_oldmem_page(pfn, buf, nr_bytes,
offset, userbuf);
-
- if (tmp < 0) {
- up_read(&vmcore_cb_rwsem);
- return tmp;
- }
}
+ if (tmp < 0) {
+ up_read(&vmcore_cb_rwsem);
+ return tmp;
+ }
+
*ppos += nr_bytes;
count -= nr_bytes;
buf += nr_bytes;
base-commit: debe436e77c72fcee804fb867f275e6d31aa999c
--
2.31.1
On 11/11/21 at 08:18pm, David Hildenbrand wrote:
> To clear a user buffer we cannot simply use memset, we have to use
> clear_user(). Using a kernel config based on rawhide Fedora and a
> virtio-mem device that registers a vmcore_cb, I can easily trigger:
>
> [ 11.327580] systemd[1]: Starting Kdump Vmcore Save Service...
> [ 11.339697] kdump[420]: Kdump is using the default log level(3).
> [ 11.370964] kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
> [ 11.373997] kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
> [ 11.385357] kdump[465]: saving vmcore-dmesg.txt complete
> [ 11.386722] kdump[467]: saving vmcore
> [ 16.531275] BUG: unable to handle page fault for address: 00007f2374e01000
> [ 16.531705] #PF: supervisor write access in kernel mode
> [ 16.532037] #PF: error_code(0x0003) - permissions violation
> [ 16.532396] PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867
> [ 16.532872] Oops: 0003 [#1] PREEMPT SMP NOPTI
> [ 16.533154] CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
> [ 16.533513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
> [ 16.534198] RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
> [ 16.534552] Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
> [ 16.535670] RSP: 0018:ffffc9000073be08 EFLAGS: 00010212
> [ 16.535998] RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000
> [ 16.536441] RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008
> [ 16.536878] RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50
> [ 16.537315] R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000
> [ 16.537755] R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8
> [ 16.538200] FS: 00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000
> [ 16.538696] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 16.539055] CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0
> [ 16.539510] Call Trace:
> [ 16.539679] <TASK>
> [ 16.539828] read_vmcore+0x236/0x2c0
> [ 16.540063] ? enqueue_hrtimer+0x2f/0x80
> [ 16.540323] ? inode_security+0x22/0x60
> [ 16.540572] proc_reg_read+0x55/0xa0
> [ 16.540807] vfs_read+0x95/0x190
> [ 16.541022] ksys_read+0x4f/0xc0
> [ 16.541238] do_syscall_64+0x3b/0x90
> [ 16.541475] entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> To fix, properly use clear_user() when required.
Looks a great fix to me, thanks for fixing this.
Check the code, clear_user invokes access_ok to do check, then call
memset(). It's unclear to me how the bug is triggered, could you
please tell more so that I can learn?
>
> Fixes: 997c136f518c ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages")
> Cc: Dave Young <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Philipp Rudo <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> fs/proc/vmcore.c | 20 ++++++++++++--------
> 1 file changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 30a3b66f475a..509f85148fee 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -154,9 +154,13 @@ ssize_t read_from_oldmem(char *buf, size_t count,
> nr_bytes = count;
>
> /* If pfn is not ram, return zeros for sparse dump files */
> - if (!pfn_is_ram(pfn))
> - memset(buf, 0, nr_bytes);
> - else {
> + if (!pfn_is_ram(pfn)) {
> + tmp = 0;
> + if (!userbuf)
> + memset(buf, 0, nr_bytes);
> + else if (clear_user(buf, nr_bytes))
> + tmp = -EFAULT;
> + } else {
> if (encrypted)
> tmp = copy_oldmem_page_encrypted(pfn, buf,
> nr_bytes,
> @@ -165,12 +169,12 @@ ssize_t read_from_oldmem(char *buf, size_t count,
> else
> tmp = copy_oldmem_page(pfn, buf, nr_bytes,
> offset, userbuf);
> -
> - if (tmp < 0) {
> - up_read(&vmcore_cb_rwsem);
> - return tmp;
> - }
> }
> + if (tmp < 0) {
> + up_read(&vmcore_cb_rwsem);
> + return tmp;
> + }
> +
> *ppos += nr_bytes;
> count -= nr_bytes;
> buf += nr_bytes;
>
> base-commit: debe436e77c72fcee804fb867f275e6d31aa999c
> --
> 2.31.1
>
On 12.11.21 08:01, Baoquan He wrote:
> On 11/11/21 at 08:18pm, David Hildenbrand wrote:
>> To clear a user buffer we cannot simply use memset, we have to use
>> clear_user(). Using a kernel config based on rawhide Fedora and a
>> virtio-mem device that registers a vmcore_cb, I can easily trigger:
>>
>> [ 11.327580] systemd[1]: Starting Kdump Vmcore Save Service...
>> [ 11.339697] kdump[420]: Kdump is using the default log level(3).
>> [ 11.370964] kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
>> [ 11.373997] kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
>> [ 11.385357] kdump[465]: saving vmcore-dmesg.txt complete
>> [ 11.386722] kdump[467]: saving vmcore
>> [ 16.531275] BUG: unable to handle page fault for address: 00007f2374e01000
>> [ 16.531705] #PF: supervisor write access in kernel mode
>> [ 16.532037] #PF: error_code(0x0003) - permissions violation
>> [ 16.532396] PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867
>> [ 16.532872] Oops: 0003 [#1] PREEMPT SMP NOPTI
>> [ 16.533154] CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
>> [ 16.533513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
>> [ 16.534198] RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
>> [ 16.534552] Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
>> [ 16.535670] RSP: 0018:ffffc9000073be08 EFLAGS: 00010212
>> [ 16.535998] RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000
>> [ 16.536441] RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008
>> [ 16.536878] RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50
>> [ 16.537315] R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000
>> [ 16.537755] R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8
>> [ 16.538200] FS: 00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000
>> [ 16.538696] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 16.539055] CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0
>> [ 16.539510] Call Trace:
>> [ 16.539679] <TASK>
>> [ 16.539828] read_vmcore+0x236/0x2c0
>> [ 16.540063] ? enqueue_hrtimer+0x2f/0x80
>> [ 16.540323] ? inode_security+0x22/0x60
>> [ 16.540572] proc_reg_read+0x55/0xa0
>> [ 16.540807] vfs_read+0x95/0x190
>> [ 16.541022] ksys_read+0x4f/0xc0
>> [ 16.541238] do_syscall_64+0x3b/0x90
>> [ 16.541475] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>
>> To fix, properly use clear_user() when required.
>
> Looks a great fix to me, thanks for fixing this.
>
> Check the code, clear_user invokes access_ok to do check, then call
> memset(). It's unclear to me how the bug is triggered, could you
> please tell more so that I can learn?
>
TBH, I was testing virtio-mem+vmcore before without running into this
issue, but after I retested with upstream in a different setup
(different kernel config but eventually also different CPU features), I
ran into this.
Note that you were looking at the generic __clear_user() implementation,
the x86-64 variant is different, see arch/x86/lib/usercopy_64.c
I can spot that it triggers stac()/clac() (X86_SMAP):
https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
"that allows supervisor mode programs to optionally set user-space
memory mappings so that access to those mappings from supervisor mode
will cause a trap. This makes it harder for malicious programs to
"trick" the kernel into using instructions or data from a user-space
program"
Yes, that's most probably it :)
--
Thanks,
David / dhildenb
On 11/12/21 at 09:16am, David Hildenbrand wrote:
> On 12.11.21 08:01, Baoquan He wrote:
> > On 11/11/21 at 08:18pm, David Hildenbrand wrote:
> >> To clear a user buffer we cannot simply use memset, we have to use
> >> clear_user(). Using a kernel config based on rawhide Fedora and a
> >> virtio-mem device that registers a vmcore_cb, I can easily trigger:
> >>
> >> [ 11.327580] systemd[1]: Starting Kdump Vmcore Save Service...
> >> [ 11.339697] kdump[420]: Kdump is using the default log level(3).
> >> [ 11.370964] kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
> >> [ 11.373997] kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
> >> [ 11.385357] kdump[465]: saving vmcore-dmesg.txt complete
> >> [ 11.386722] kdump[467]: saving vmcore
> >> [ 16.531275] BUG: unable to handle page fault for address: 00007f2374e01000
> >> [ 16.531705] #PF: supervisor write access in kernel mode
> >> [ 16.532037] #PF: error_code(0x0003) - permissions violation
> >> [ 16.532396] PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867
> >> [ 16.532872] Oops: 0003 [#1] PREEMPT SMP NOPTI
> >> [ 16.533154] CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
> >> [ 16.533513] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
> >> [ 16.534198] RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
> >> [ 16.534552] Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
> >> [ 16.535670] RSP: 0018:ffffc9000073be08 EFLAGS: 00010212
> >> [ 16.535998] RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000
> >> [ 16.536441] RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008
> >> [ 16.536878] RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50
> >> [ 16.537315] R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000
> >> [ 16.537755] R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8
> >> [ 16.538200] FS: 00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000
> >> [ 16.538696] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [ 16.539055] CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0
> >> [ 16.539510] Call Trace:
> >> [ 16.539679] <TASK>
> >> [ 16.539828] read_vmcore+0x236/0x2c0
> >> [ 16.540063] ? enqueue_hrtimer+0x2f/0x80
> >> [ 16.540323] ? inode_security+0x22/0x60
> >> [ 16.540572] proc_reg_read+0x55/0xa0
> >> [ 16.540807] vfs_read+0x95/0x190
> >> [ 16.541022] ksys_read+0x4f/0xc0
> >> [ 16.541238] do_syscall_64+0x3b/0x90
> >> [ 16.541475] entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>
> >> To fix, properly use clear_user() when required.
> >
> > Looks a great fix to me, thanks for fixing this.
> >
> > Check the code, clear_user invokes access_ok to do check, then call
> > memset(). It's unclear to me how the bug is triggered, could you
> > please tell more so that I can learn?
> >
> TBH, I was testing virtio-mem+vmcore before without running into this
> issue, but after I retested with upstream in a different setup
> (different kernel config but eventually also different CPU features), I
> ran into this.
>
>
> Note that you were looking at the generic __clear_user() implementation,
> the x86-64 variant is different, see arch/x86/lib/usercopy_64.c
>
> I can spot that it triggers stac()/clac() (X86_SMAP):
> https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
>
> "that allows supervisor mode programs to optionally set user-space
> memory mappings so that access to those mappings from supervisor mode
> will cause a trap. This makes it harder for malicious programs to
> "trick" the kernel into using instructions or data from a user-space
> program"
OK, probably. I thought it's triggered in access_ok(), and tried to
figure out why. But seems we should do something to check this in
access_ok(), otherwise the logic of clear_user/_clear_user is not so
reasonable. Anyway, I have learned it, thanks a lot for digging it out.
By the way, I can't open above wiki article, found below commit from
hpa. Maybe we can add some into log to tell this, not strong opinin,
leave it to you.
For this patch, looks good to me.
Acked-by: Baoquan He <[email protected]>
~~~~~~~~~
commit 63bcff2a307b9bcc712a8251eb27df8b2e117967
Author: H. Peter Anvin <[email protected]>
Date: Fri Sep 21 12:43:12 2012 -0700
x86, smap: Add STAC and CLAC instructions to control user space access
When Supervisor Mode Access Prevention (SMAP) is enabled, access to
userspace from the kernel is controlled by the AC flag. To make the
performance of manipulating that flag acceptable, there are two new
instructions, STAC and CLAC, to set and clear it.
This patch adds those instructions, via alternative(), when the SMAP
feature is enabled. It also adds X86_EFLAGS_AC unconditionally to the
SYSCALL entry mask; there is simply no reason to make that one
conditional.
Signed-off-by: H. Peter Anvin <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
> > "that allows supervisor mode programs to optionally set user-space
> > memory mappings so that access to those mappings from supervisor mode
> > will cause a trap. This makes it harder for malicious programs to
> > "trick" the kernel into using instructions or data from a user-space
> > program"
>
> OK, probably. I thought it's triggered in access_ok(), and tried to
> figure out why. But seems we should do something to check this in
> access_ok(), otherwise the logic of clear_user/_clear_user is not so
> reasonable. Anyway, I have learned it, thanks a lot for digging it out.
>
> By the way, I can't open above wiki article, found below commit from
> hpa. Maybe we can add some into log to tell this, not strong opinin,
> leave it to you.
Yes, now that we know the root cause I'll add some more details to the
patch description and resend -- thanks Baoquan!
On 11/12/21 at 10:08am, David Hildenbrand wrote:
> > > "that allows supervisor mode programs to optionally set user-space
> > > memory mappings so that access to those mappings from supervisor mode
> > > will cause a trap. This makes it harder for malicious programs to
> > > "trick" the kernel into using instructions or data from a user-space
> > > program"
> >
> > OK, probably. I thought it's triggered in access_ok(), and tried to
> > figure out why. But seems we should do something to check this in
> > access_ok(), otherwise the logic of clear_user/_clear_user is not so
> > reasonable. Anyway, I have learned it, thanks a lot for digging it out.
> >
> > By the way, I can't open above wiki article, found below commit from
> > hpa. Maybe we can add some into log to tell this, not strong opinin,
> > leave it to you.
>
> Yes, now that we know the root cause I'll add some more details to the
> patch description and resend -- thanks Baoquan!
Thanks for sending v2.