The sysfs interface to seq_file continues to be rather fragile
(seq_get_buf() should not be used outside of seq_file), as seen with
some recent exploits[1]. Move the seq_file buffer to the vmap area
(while retaining the accounting flag), since it has guard pages that will
catch and stop linear overflows. This seems justified given that sysfs's
use of seq_file almost always already uses PAGE_SIZE allocations, has
normally short-lived allocations, and is not normally on a performance
critical path.
Once seq_get_buf() has been removed (and all sysfs callbacks using
seq_file directly), this change can also be removed.
[1] https://blog.grimm-co.com/2021/03/new-old-bugs-in-linux-kernel.html
Signed-off-by: Kees Cook <[email protected]>
---
fs/sysfs/file.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 9aefa7779b29..351ff75a97b1 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -16,6 +16,7 @@
#include <linux/mutex.h>
#include <linux/seq_file.h>
#include <linux/mm.h>
+#include <linux/vmalloc.h>
#include "sysfs.h"
@@ -32,6 +33,31 @@ static const struct sysfs_ops *sysfs_file_ops(struct kernfs_node *kn)
return kobj->ktype ? kobj->ktype->sysfs_ops : NULL;
}
+/*
+ * To be proactively defensive against sysfs show() handlers that do not
+ * correctly stay within their PAGE_SIZE buffer, use the vmap area to gain
+ * the trailing guard page which will stop linear buffer overflows.
+ */
+static void *sysfs_kf_seq_start(struct seq_file *sf, loff_t *ppos)
+{
+ struct kernfs_open_file *of = sf->private;
+ struct kernfs_node *kn = of->kn;
+
+ WARN_ON_ONCE(sf->buf);
+ sf->buf = __vmalloc(kn->attr.size, GFP_KERNEL_ACCOUNT);
+ if (!sf->buf)
+ return ERR_PTR(-ENOMEM);
+ sf->size = kn->attr.size;
+
+ /*
+ * Use the same behavior and code as single_open(): continue
+ * if pos is at the beginning; otherwise, NULL.
+ */
+ if (*ppos)
+ return NULL;
+ return SEQ_OPEN_SINGLE;
+}
+
/*
* Reads on sysfs are handled through seq_file, which takes care of hairy
* details like buffering and seeking. The following function pipes
@@ -206,14 +232,17 @@ static const struct kernfs_ops sysfs_file_kfops_empty = {
};
static const struct kernfs_ops sysfs_file_kfops_ro = {
+ .seq_start = sysfs_kf_seq_start,
.seq_show = sysfs_kf_seq_show,
};
static const struct kernfs_ops sysfs_file_kfops_wo = {
+ .seq_start = sysfs_kf_seq_start,
.write = sysfs_kf_write,
};
static const struct kernfs_ops sysfs_file_kfops_rw = {
+ .seq_start = sysfs_kf_seq_start,
.seq_show = sysfs_kf_seq_show,
.write = sysfs_kf_write,
};
--
2.25.1
On Thu, Apr 01, 2021 at 03:13:20PM -0700, Kees Cook wrote:
> The sysfs interface to seq_file continues to be rather fragile
> (seq_get_buf() should not be used outside of seq_file), as seen with
> some recent exploits[1]. Move the seq_file buffer to the vmap area
> (while retaining the accounting flag), since it has guard pages that will
> catch and stop linear overflows. This seems justified given that sysfs's
> use of seq_file almost always already uses PAGE_SIZE allocations, has
> normally short-lived allocations, and is not normally on a performance
> critical path.
This looks completely weird to me. In the end sysfs uses nothing
of the seq_file infrastructure, so why do we even pretend to use it?
Just switch sysfs_file_kfops_ro and sysfs_file_kfops_rw from using
->seq_show to ->read and do the vmalloc there instead of pretending
this is a seq_file.
> Once seq_get_buf() has been removed (and all sysfs callbacks using
> seq_file directly), this change can also be removed.
And with sysfs out of the way I think kiling off the other few users
should be pretty easy as well.
On Fri, Apr 02, 2021 at 08:32:21AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 01, 2021 at 03:13:20PM -0700, Kees Cook wrote:
> > The sysfs interface to seq_file continues to be rather fragile
> > (seq_get_buf() should not be used outside of seq_file), as seen with
> > some recent exploits[1]. Move the seq_file buffer to the vmap area
> > (while retaining the accounting flag), since it has guard pages that will
> > catch and stop linear overflows. This seems justified given that sysfs's
> > use of seq_file almost always already uses PAGE_SIZE allocations, has
> > normally short-lived allocations, and is not normally on a performance
> > critical path.
>
> This looks completely weird to me. In the end sysfs uses nothing
> of the seq_file infrastructure, so why do we even pretend to use it?
> Just switch sysfs_file_kfops_ro and sysfs_file_kfops_rw from using
> ->seq_show to ->read and do the vmalloc there instead of pretending
> this is a seq_file.
As far as I can tell it's a result of kernfs using seq_file, but sysfs
never converted all its callbacks to use seq_file.
> > Once seq_get_buf() has been removed (and all sysfs callbacks using
> > seq_file directly), this change can also be removed.
>
> And with sysfs out of the way I think kiling off the other few users
> should be pretty easy as well.
Let me look at switching to "read" ... it is a twisty maze. :)
--
Kees Cook
Greeting,
FYI, we noticed the following commit (built with gcc-9):
commit: 5f65c1f63b0db3b6e346b8cee60b15a740052203 ("[PATCH v4 3/3] sysfs: Unconditionally use vmalloc for buffer")
url: https://github.com/0day-ci/linux/commits/Kees-Cook/sysfs-Unconditionally-use-vmalloc-for-buffer/20210402-061521
base: https://git.kernel.org/cgit/linux/kernel/git/gregkh/driver-core.git d225ef6fda7ce9ff7d28764bd1cceea2d0215e8b
in testcase: xfstests
version: xfstests-x86_64-73c0871-1_20210401
with following parameters:
disk: 4HDD
fs: btrfs
test: generic-group-19
ucode: 0xe2
test-description: xfstests is a regression test suite for xfs and other files ystems.
test-url: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
on test machine: 4 threads Intel(R) Xeon(R) CPU E3-1225 v5 @ 3.30GHz with 16G memory
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>
[ 27.280968] WARNING: CPU: 1 PID: 1 at fs/sysfs/file.c:46 sysfs_kf_seq_start (kbuild/src/consumer/fs/sysfs/file.c:46 (discriminator 1))
[ 27.288864] Modules linked in:
[ 27.291965] CPU: 1 PID: 1 Comm: systemd Not tainted 5.12.0-rc4-00029-g5f65c1f63b0d #1
[ 27.299882] Hardware name: HP HP Z238 Microtower Workstation/8183, BIOS N51 Ver. 01.63 10/05/2017
[ 27.308817] RIP: 0010:sysfs_kf_seq_start (kbuild/src/consumer/fs/sysfs/file.c:46 (discriminator 1))
[ 27.313656] Code: 03 48 85 c0 74 18 49 8b 44 24 50 48 89 43 08 31 c0 48 83 7d 00 00 5b 0f 94 c0 5d 41 5c c3 5b 48 c7 c0 f4 ff ff ff 5d 41 5c c3 <0f> 0b eb c1 0f 1f 44 00 00 0f 1f 44 00 00 41 55 49 89 d5 41 54 55
All code
========
0: 03 48 85 add -0x7b(%rax),%ecx
3: c0 (bad)
4: 74 18 je 0x1e
6: 49 8b 44 24 50 mov 0x50(%r12),%rax
b: 48 89 43 08 mov %rax,0x8(%rbx)
f: 31 c0 xor %eax,%eax
11: 48 83 7d 00 00 cmpq $0x0,0x0(%rbp)
16: 5b pop %rbx
17: 0f 94 c0 sete %al
1a: 5d pop %rbp
1b: 41 5c pop %r12
1d: c3 retq
1e: 5b pop %rbx
1f: 48 c7 c0 f4 ff ff ff mov $0xfffffffffffffff4,%rax
26: 5d pop %rbp
27: 41 5c pop %r12
29: c3 retq
2a:* 0f 0b ud2 <-- trapping instruction
2c: eb c1 jmp 0xffffffffffffffef
2e: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
33: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
38: 41 55 push %r13
3a: 49 89 d5 mov %rdx,%r13
3d: 41 54 push %r12
3f: 55 push %rbp
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: eb c1 jmp 0xffffffffffffffc5
4: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
9: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
e: 41 55 push %r13
10: 49 89 d5 mov %rdx,%r13
13: 41 54 push %r12
15: 55 push %rbp
[ 27.332521] RSP: 0018:ffffc90000037db0 EFLAGS: 00010286
[ 27.337804] RAX: ffff88810d399780 RBX: ffff88843d9da708 RCX: 0000000000000374
[ 27.345013] RDX: 0000000000000001 RSI: ffff88843d9da730 RDI: ffff88843d9da708
[ 27.352205] RBP: ffff88843d9da730 R08: ffff88810d2d2500 R09: ffff88810cd03cf0
[ 27.359396] R10: ffffc90000037ed8 R11: 0000000000000000 R12: ffff88810d2d2500
[ 27.366580] R13: ffffc90000037e60 R14: ffff88843d9da730 R15: ffffc90000037f10
[ 27.373761] FS: 00007fc66c548940(0000) GS:ffff88843f480000(0000) knlGS:0000000000000000
[ 27.381924] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 27.387725] CR2: 000056311d28f778 CR3: 000000043d3f0003 CR4: 00000000003706e0
[ 27.394914] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 27.402106] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 27.409330] Call Trace:
[ 27.411836] kernfs_seq_start (kbuild/src/consumer/fs/kernfs/file.c:120)
[ 27.415733] seq_read_iter (kbuild/src/consumer/fs/seq_file.c:222)
[ 27.419528] new_sync_read (kbuild/src/consumer/fs/read_write.c:416 (discriminator 1))
[ 27.423332] vfs_read (kbuild/src/consumer/fs/read_write.c:496)
[ 27.426704] ksys_read (kbuild/src/consumer/fs/read_write.c:634)
[ 27.429983] do_syscall_64 (kbuild/src/consumer/arch/x86/entry/common.c:46)
[ 27.433606] entry_SYSCALL_64_after_hwframe (kbuild/src/consumer/arch/x86/entry/entry_64.S:112)
[ 27.438710] RIP: 0033:0x7fc66d6f9461
[ 27.442349] Code: fe ff ff 50 48 8d 3d fe d0 09 00 e8 e9 03 02 00 66 0f 1f 84 00 00 00 00 00 48 8d 05 99 62 0d 00 8b 00 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 c3 66 0f 1f 44 00 00 41 54 49 89 d4 55 48
All code
========
0: fe (bad)
1: ff (bad)
2: ff 50 48 callq *0x48(%rax)
5: 8d 3d fe d0 09 00 lea 0x9d0fe(%rip),%edi # 0x9d109
b: e8 e9 03 02 00 callq 0x203f9
10: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
17: 00 00
19: 48 8d 05 99 62 0d 00 lea 0xd6299(%rip),%rax # 0xd62b9
20: 8b 00 mov (%rax),%eax
22: 85 c0 test %eax,%eax
24: 75 13 jne 0x39
26: 31 c0 xor %eax,%eax
28: 0f 05 syscall
2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction
30: 77 57 ja 0x89
32: c3 retq
33: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
39: 41 54 push %r12
3b: 49 89 d4 mov %rdx,%r12
3e: 55 push %rbp
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax
6: 77 57 ja 0x5f
8: c3 retq
9: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
f: 41 54 push %r12
11: 49 89 d4 mov %rdx,%r12
14: 55 push %rbp
15: 48 rex.W
[ 27.461227] RSP: 002b:00007ffd808104c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 27.468845] RAX: ffffffffffffffda RBX: 000056311d28c290 RCX: 00007fc66d6f9461
[ 27.476038] RDX: 0000000000001000 RSI: 000056311d28e770 RDI: 0000000000000004
[ 27.483247] RBP: 0000000000000d68 R08: 0000000000000003 R09: 00007fc66d7cb1a0
[ 27.490432] R10: 000056311d28c010 R11: 0000000000000246 R12: 00007fc66d7c6760
[ 27.497621] R13: 00007fc66d7c72a0 R14: 0000000000100000 R15: 00007ffd80810600
[ 27.504821] ---[ end trace 2a4eabfde283610b ]---
[ 27.545425] random: lvmconfig: uninitialized urandom read (4 bytes read)
[ 27.912622] intel_pmc_core INT33A1:00: initialized
0m...
[ 27.954401] acpi PNP0C14:02: duplicate WMI GUID 2B814318-4BE8-4707-9D84-A190A859B5D0 (first instance was on PNP0C14:00)
[ 27.965332] acpi PNP0C14:02: duplicate WMI GUID 41227C2D-80E1-423F-8B8E-87E32755A0EB (first instance was on PNP0C14:00)
[ 27.976171] acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
[ 28.033742] RAPL PMU: API unit is 2^-32 Joules, 4 fixed counters, 655360 ms ovfl timer
[ 28.041736] RAPL PMU: hw unit of domain pp0-core 2^-14 Joules
[ 28.041737] RAPL PMU: hw unit of domain package 2^-14 Joules
[ 28.041738] RAPL PMU: hw unit of domain dram 2^-14 Joules
[ 28.041738] RAPL PMU: hw unit of domain pp1-gpu 2^-14 Joules
[ 28.073062] hp_wmi: query 0x4 returned error 0x5
[ 28.073614] ahci 0000:00:17.0: controller can't do SNTF, turning off CAP_SNTF
[ 28.079699] hp_wmi: query 0xd returned error 0x5
[ 28.086036] ahci 0000:00:17.0: SSS flag set, parallel bus scan disabled
[ 28.090684] input: HP WMI hotkeys as /devices/virtual/input/input6
[ 28.097362] ahci 0000:00:17.0: AHCI 0001.0301 32 slots 4 ports 6 Gbps 0xf impl RAID mode
[ 28.111742] ahci 0000:00:17.0: flags: 64bit ncq stag pm led clo only pio slum part ems deso sadm sds apst
1;39mFlush Journ
[ 28.143722] scsi host2: ahci
al to Persistent
[ 28.148012] scsi host3: ahci
[ 28.152282] ata1: SATA max UDMA/133 abar m2048@0xd104c000 port 0xd104c100 irq 123
[ 28.161008] ata2: SATA max UDMA/133 abar m2048@0xd104c000 port 0xd104c180 irq 123
[ 28.168587] ata3: SATA max UDMA/133 abar m2048@0xd104c000 port 0xd104c200 irq 123
[ 28.168588] ata4: SATA max UDMA/133 abar m2048@0xd104c000 port 0xd104c280 irq 123
[ 28.353928] i915 0000:00:02.0: vgaarb: deactivate vga console
em Initializatio
[ 28.379255] i915 0000:00:02.0: Direct firmware load for i915/skl_dmc_ver1_27.bin failed with error -2
[ 28.389824] i915 0000:00:02.0: [drm] Failed to load DMC firmware i915/skl_dmc_ver1_27.bin. Disabling runtime power management.
[ 28.401912] i915 0000:00:02.0: [drm] DMC firmware homepage: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
0m] Reached targ
[ 28.438877] intel_rapl_common: Found RAPL domain core
[ 28.445313] intel_rapl_common: Found RAPL domain uncore
[ 28.489283] ata1.00: ACPI cmd f5/00:00:00:00:00:e0 (SECURITY FREEZE LOCK) filtered out
[ 28.497929] ata1.00: ACPI cmd b1/c1:00:00:00:00:e0 (DEVICE CONFIGURATION OVERLAY) filtered out
Startin
[ 28.506825] ata1.00: ATA-8: WDC WD10EARS-00Y5B1, 80.00A80, max UDMA/133
[ 28.514715] ata1.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 32), AA
[ 28.532413] ata1.00: ACPI cmd b1/c1:00:00:00:00:e0 (DEVICE CONFIGURATION OVERLAY) filtered out
[ 28.541251] ata1.00: configured for UDMA/133
[ 28.546422] scsi 0:0:0:0: Direct-Access ATA WDC WD10EARS-00Y 0A80 PQ: 0 ANSI: 5
1;39mD-Bus Syste
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml
bin/lkp run compatible-job.yaml
---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation
Thanks,
Oliver Sang