This 2nd series documents the fixes better and includes a bdgrab() fix
for the issue noted by Minchan. A general fix has been proposed for two
of these issues however they are not yet deemed required upstream and so
we just open code individual solutions on the driver.
Luis Chamberlain (4):
zram: fix crashes due to use of cpu hotplug multistate
zram: avoid disksize setting when device is being claimed
zram: fix deadlock with sysfs attribute usage and driver removal
zram: fix possible races between sysfs use and bdev access
drivers/block/zram/zram_drv.c | 473 +++++++++++++++++++++++++++++-----
1 file changed, 414 insertions(+), 59 deletions(-)
--
2.27.0
As with other areas of the zram diver, check if the zram
block device is being claimed but do it early to avoid
trying to set something up for a zram device which may
be on its way out.
This doesn't fix a known crash, however the potential for
an issue issue is found through code inspection.
Signed-off-by: Luis Chamberlain <[email protected]>
---
drivers/block/zram/zram_drv.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 431b60cd85c1..6051d20b88c3 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1729,7 +1729,7 @@ static ssize_t disksize_store(struct device *dev,
mutex_lock(&zram_index_mutex);
- if (!zram_up) {
+ if (!zram_up || zram->claim) {
err = -ENODEV;
goto out;
}
--
2.27.0
Provide a simple state machine to fix races with driver exit where we
remove the CPU multistate callbacks and re-initialization / creation of
new per CPU instances which should be managed by these callbacks.
The zram driver makes use of cpu hotplug multistate support, whereby it
associates a struct zcomp per CPU. Each struct zcomp represents a
compression algorithm in charge of managing compression streams per CPU.
Although a compiled zram driver only supports a fixed set of compression
algorithms, each zram device gets a struct zcomp allocated per CPU. The
"multi" in CPU hotplug multstate refers to these per cpu struct zcomp
instances. Each of these will have the CPU hotplug callback called for
it on CPU plug / unplug. The kernel's CPU hotplug multistate keeps a
linked list of these different structures so that it will iterate over
them on CPU transitions.
By default at driver initialization we will create just one zram device
(num_devices=1) and a zcomp structure then set for the now default
lzo-rle comrpession algorithm. At driver removal we first remove each
zram device, and so we destroy the associated struct zcomp per CPU. But
since we expose sysfs attributes to create new devices or reset / initialize
existing zram devices, we can easily end up re-initializing a struct zcomp
for a zram device before the exit routine of the module removes the cpu
hotplug callback. When this happens the kernel's CPU hotplug will detect
that at least one instance (struct zcomp for us) exists. This can happen
in the following situation:
CPU 1 CPU 2
class_unregister(...);
idr_for_each(...);
zram_debugfs_destroy();
disksize_store(...);
idr_destroy(...);
unregister_blkdev(...);
cpuhp_remove_multi_state(...);
The warning comes up on cpuhp_remove_multi_state() when it sees that the
state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
In this case, that a struct zcom still exists, the driver allowed its
creation per CPU even though we could have just freed them per CPU
though a call on another CPU, and we are then later trying to remove the
hotplug callback.
Fix all this by providing a zram initialization boolean
protected the the shared in the driver zram_index_mutex,
which we can use to annotate when sysfs attributes are
safe to use or not -- once the driver is properly initialized.
When the driver is going down we also are sure to not let
userspace muck with attributes which may affect each per cpu
struct zcomp.
This also fixes a series of possible memory leaks. The
crashes and memory leaks can easily be caused by issuing
the zram02.sh script from the LTP project [0] in a loop
in two separate windows:
cd testcases/kernel/device-drivers/zram
while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
You end up with a splat as follows:
kernel: zram: Removed device: zram0
kernel: zram: Added device: zram0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: Adding 104857596k swap on /dev/zram0. Priority:-2 extents:1 across:104857596k SSFS
kernel: zram0: detected capacitky change from 209715200 to 0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: ------------[ cut here ]------------
kernel: Error: Removing state 63 which has instances left.
kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G E 5.12.0-rc1-next-20210304 #3
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Code: <etc>
kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES 0000 CR0: 0000000080050033
kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
kernel: Call Trace:
kernel: __cpuhp_remove_state+0x2e/0x80
kernel: __do_sys_delete_module+0x190/0x2a0
kernel: do_syscall_64+0x33/0x80
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
The "Error: Removing state 63 which has instances left" refers
to the zram per CPU struc zcomp instances left.
[0] https://github.com/linux-test-project/ltp.git
Signed-off-by: Luis Chamberlain <[email protected]>
---
drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++-----
1 file changed, 55 insertions(+), 8 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index cf8deecc39ef..431b60cd85c1 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
static int zram_major;
static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
+bool zram_up;
+
/* Module params (documentation at end) */
static unsigned int num_devices = 1;
/*
@@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram)
comp = zram->comp;
disksize = zram->disksize;
zram->disksize = 0;
+ zram->comp = NULL;
set_capacity_and_notify(zram->disk, 0);
part_stat_set_all(zram->disk->part0, 0);
@@ -1724,9 +1727,18 @@ static ssize_t disksize_store(struct device *dev,
struct zram *zram = dev_to_zram(dev);
int err;
+ mutex_lock(&zram_index_mutex);
+
+ if (!zram_up) {
+ err = -ENODEV;
+ goto out;
+ }
+
disksize = memparse(buf, NULL);
- if (!disksize)
- return -EINVAL;
+ if (!disksize) {
+ err = -EINVAL;
+ goto out;
+ }
down_write(&zram->init_lock);
if (init_done(zram)) {
@@ -1754,12 +1766,16 @@ static ssize_t disksize_store(struct device *dev,
set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
up_write(&zram->init_lock);
+ mutex_unlock(&zram_index_mutex);
+
return len;
out_free_meta:
zram_meta_free(zram, disksize);
out_unlock:
up_write(&zram->init_lock);
+out:
+ mutex_unlock(&zram_index_mutex);
return err;
}
@@ -1775,8 +1791,17 @@ static ssize_t reset_store(struct device *dev,
if (ret)
return ret;
- if (!do_reset)
- return -EINVAL;
+ mutex_lock(&zram_index_mutex);
+
+ if (!zram_up) {
+ len = -ENODEV;
+ goto out;
+ }
+
+ if (!do_reset) {
+ len = -EINVAL;
+ goto out;
+ }
zram = dev_to_zram(dev);
bdev = zram->disk->part0;
@@ -1785,7 +1810,8 @@ static ssize_t reset_store(struct device *dev,
/* Do not reset an active device or claimed device */
if (bdev->bd_openers || zram->claim) {
mutex_unlock(&bdev->bd_mutex);
- return -EBUSY;
+ len = -EBUSY;
+ goto out;
}
/* From now on, anyone can't open /dev/zram[0-9] */
@@ -1800,6 +1826,8 @@ static ssize_t reset_store(struct device *dev,
zram->claim = false;
mutex_unlock(&bdev->bd_mutex);
+out:
+ mutex_unlock(&zram_index_mutex);
return len;
}
@@ -2021,6 +2049,10 @@ static ssize_t hot_add_show(struct class *class,
int ret;
mutex_lock(&zram_index_mutex);
+ if (!zram_up) {
+ mutex_unlock(&zram_index_mutex);
+ return -ENODEV;
+ }
ret = zram_add();
mutex_unlock(&zram_index_mutex);
@@ -2048,6 +2080,11 @@ static ssize_t hot_remove_store(struct class *class,
mutex_lock(&zram_index_mutex);
+ if (!zram_up) {
+ ret = -ENODEV;
+ goto out;
+ }
+
zram = idr_find(&zram_index_idr, dev_id);
if (zram) {
ret = zram_remove(zram);
@@ -2057,6 +2094,7 @@ static ssize_t hot_remove_store(struct class *class,
ret = -ENODEV;
}
+out:
mutex_unlock(&zram_index_mutex);
return ret ? ret : count;
}
@@ -2083,12 +2121,15 @@ static int zram_remove_cb(int id, void *ptr, void *data)
static void destroy_devices(void)
{
+ mutex_lock(&zram_index_mutex);
+ zram_up = false;
class_unregister(&zram_control_class);
idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
zram_debugfs_destroy();
idr_destroy(&zram_index_idr);
unregister_blkdev(zram_major, "zram");
cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
+ mutex_unlock(&zram_index_mutex);
}
static int __init zram_init(void)
@@ -2116,15 +2157,21 @@ static int __init zram_init(void)
return -EBUSY;
}
+ mutex_lock(&zram_index_mutex);
+
while (num_devices != 0) {
- mutex_lock(&zram_index_mutex);
ret = zram_add();
- mutex_unlock(&zram_index_mutex);
- if (ret < 0)
+ if (ret < 0) {
+ mutex_unlock(&zram_index_mutex);
goto out_error;
+ }
num_devices--;
}
+ zram_up = true;
+
+ mutex_unlock(&zram_index_mutex);
+
return 0;
out_error:
--
2.27.0
When sysfs attributes use a lock also used on driver removal we can
potentially deadlock. This happens when for instance a sysfs file on
a driver is used, then at the same time we have driver removal trigger.
The driver removal code holds a lock, and then the sysfs file entry waits
for the same lock. While holding the lock the driver removal tries to
remove the sysfs entries, but these cannot be removed yet as one is
waiting for a lock. This won't complete as the lock is already held.
Likewise module removal cannot complete, and so we deadlock.
To fix this we just *try* to get a refcount to the module when a shared
lock is used, prior to mucking with a sysfs attribute. If this fails we
just give up right away.
We use a try method as a full lock means we'd then make our sysfs attributes
busy us out from possible module removal. A try lock on the module removal
ensures we give priority to module removal and interacting with sysfs
attributes only comes second. Using a full lock could mean for instance
that if you don't stop poking at sysfs files you cannot remove a module.
Demo of a deadlock with the zram driver:
CPU A CPU B
whatever_store()
module_unload
mutex_lock(foo)
mutex_lock(foo)
del_gendisk(zram->disk);
device_del()
device_remove_groups()
In this situation whatever_store() is waiting for the mutex foo to
become unlocked, but that won't happen until module removal is complete.
But module removal won't complete until the syfs file being poked completes
which is waiting for a lock already held.
This is a generic kernel issue with sysfs files which use any lock also
used on module removal. A generic solution has been proposed [0] however
there is no interest in having this upstream at this point, so we must
open code this on drivers which have this pattern of shared locks.
This issue can be reproduced easily on the zram driver as follows:
Loop 1 on one terminal:
while true;
do modprobe zram;
modprobe -r zram;
done
Loop 2 on a second terminal:
while true; do
echo 1024 > /sys/block/zram0/disksize;
echo 1 > /sys/block/zram0/reset;
done
Without this patch we end up in a deadlock, and the following
stack trace is produced which hints to us what the issue was:
INFO: task bash:888 blocked for more than 120 seconds.
Tainted: G E 5.12.0-rc1-next-20210304+ #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:bash state:D stack: 0 pid: 888 ppid: 887 flags:0x00000004
Call Trace:
__schedule+0x2e4/0x900
schedule+0x46/0xb0
schedule_preempt_disabled+0xa/0x10
__mutex_lock.constprop.0+0x2c3/0x490
? _kstrtoull+0x35/0xd0
reset_store+0x6c/0x160 [zram]
kernfs_fop_write_iter+0x124/0x1b0
new_sync_write+0x11c/0x1b0
vfs_write+0x1c2/0x260
ksys_write+0x5f/0xe0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f34f2c3df33
RSP: 002b:00007ffe751df6e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f34f2c3df33
RDX: 0000000000000002 RSI: 0000561ccb06ec10 RDI: 0000000000000001
RBP: 0000561ccb06ec10 R08: 000000000000000a R09: 0000000000000001
R10: 0000561ccb157590 R11: 0000000000000246 R12: 0000000000000002
R13: 00007f34f2d0e6a0 R14: 0000000000000002 R15: 00007f34f2d0e8a0
INFO: task modprobe:1104 can't die for more than 120 seconds.
task:modprobe state:D stack: 0 pid: 1104 ppid: 916 flags:0x00004004
Call Trace:
__schedule+0x2e4/0x900
schedule+0x46/0xb0
__kernfs_remove.part.0+0x228/0x2b0
? finish_wait+0x80/0x80
kernfs_remove_by_name_ns+0x50/0x90
remove_files+0x2b/0x60
sysfs_remove_group+0x38/0x80
sysfs_remove_groups+0x29/0x40
device_remove_attrs+0x4a/0x80
device_del+0x183/0x3e0
? mutex_lock+0xe/0x30
del_gendisk+0x27a/0x2d0
zram_remove+0x8a/0xb0 [zram]
? hot_remove_store+0xf0/0xf0 [zram]
zram_remove_cb+0xd/0x10 [zram]
idr_for_each+0x5e/0xd0
destroy_devices+0x39/0x6f [zram]
__do_sys_delete_module+0x190/0x2a0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f32adf727d7
RSP: 002b:00007ffc08bb38a8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 000055eea23cbb10 RCX: 00007f32adf727d7
RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055eea23cbb78
RBP: 000055eea23cbb10 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f32adfe5ac0 R11: 0000000000000206 R12: 000055eea23cbb78
R13: 0000000000000000 R14: 0000000000000000 R15: 000055eea23cbc20
[0] https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
---
drivers/block/zram/zram_drv.c | 158 ++++++++++++++++++++++++++++++----
1 file changed, 139 insertions(+), 19 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 6051d20b88c3..701ed28da125 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -239,10 +239,15 @@ static ssize_t initstate_show(struct device *dev,
u32 val;
struct zram *zram = dev_to_zram(dev);
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_read(&zram->init_lock);
val = init_done(zram);
up_read(&zram->init_lock);
+ module_put(THIS_MODULE);
+
return scnprintf(buf, PAGE_SIZE, "%u\n", val);
}
@@ -250,8 +255,16 @@ static ssize_t disksize_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct zram *zram = dev_to_zram(dev);
+ u64 disksize;
+
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
- return scnprintf(buf, PAGE_SIZE, "%llu\n", zram->disksize);
+ disksize = zram->disksize;
+
+ module_put(THIS_MODULE);
+
+ return scnprintf(buf, PAGE_SIZE, "%llu\n", disksize);
}
static ssize_t mem_limit_store(struct device *dev,
@@ -261,14 +274,23 @@ static ssize_t mem_limit_store(struct device *dev,
char *tmp;
struct zram *zram = dev_to_zram(dev);
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
limit = memparse(buf, &tmp);
- if (buf == tmp) /* no chars parsed, invalid input */
- return -EINVAL;
+
+ /* no chars parsed, invalid input */
+ if (buf == tmp) {
+ len = -EINVAL;
+ goto out;
+ }
down_write(&zram->init_lock);
zram->limit_pages = PAGE_ALIGN(limit) >> PAGE_SHIFT;
up_write(&zram->init_lock);
+out:
+ module_put(THIS_MODULE);
return len;
}
@@ -283,6 +305,9 @@ static ssize_t mem_used_max_store(struct device *dev,
if (err || val != 0)
return -EINVAL;
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_read(&zram->init_lock);
if (init_done(zram)) {
atomic_long_set(&zram->stats.max_used_pages,
@@ -290,6 +315,8 @@ static ssize_t mem_used_max_store(struct device *dev,
}
up_read(&zram->init_lock);
+ module_put(THIS_MODULE);
+
return len;
}
@@ -303,10 +330,13 @@ static ssize_t idle_store(struct device *dev,
if (!sysfs_streq(buf, "all"))
return -EINVAL;
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_read(&zram->init_lock);
if (!init_done(zram)) {
- up_read(&zram->init_lock);
- return -EINVAL;
+ len = -EINVAL;
+ goto out;
}
for (index = 0; index < nr_pages; index++) {
@@ -321,8 +351,9 @@ static ssize_t idle_store(struct device *dev,
zram_slot_unlock(zram, index);
}
+out:
up_read(&zram->init_lock);
-
+ module_put(THIS_MODULE);
return len;
}
@@ -337,6 +368,9 @@ static ssize_t writeback_limit_enable_store(struct device *dev,
if (kstrtoull(buf, 10, &val))
return ret;
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_read(&zram->init_lock);
spin_lock(&zram->wb_limit_lock);
zram->wb_limit_enable = val;
@@ -344,6 +378,8 @@ static ssize_t writeback_limit_enable_store(struct device *dev,
up_read(&zram->init_lock);
ret = len;
+ module_put(THIS_MODULE);
+
return ret;
}
@@ -353,12 +389,17 @@ static ssize_t writeback_limit_enable_show(struct device *dev,
bool val;
struct zram *zram = dev_to_zram(dev);
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_read(&zram->init_lock);
spin_lock(&zram->wb_limit_lock);
val = zram->wb_limit_enable;
spin_unlock(&zram->wb_limit_lock);
up_read(&zram->init_lock);
+ module_put(THIS_MODULE);
+
return scnprintf(buf, PAGE_SIZE, "%d\n", val);
}
@@ -372,6 +413,9 @@ static ssize_t writeback_limit_store(struct device *dev,
if (kstrtoull(buf, 10, &val))
return ret;
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_read(&zram->init_lock);
spin_lock(&zram->wb_limit_lock);
zram->bd_wb_limit = val;
@@ -379,6 +423,8 @@ static ssize_t writeback_limit_store(struct device *dev,
up_read(&zram->init_lock);
ret = len;
+ module_put(THIS_MODULE);
+
return ret;
}
@@ -388,12 +434,17 @@ static ssize_t writeback_limit_show(struct device *dev,
u64 val;
struct zram *zram = dev_to_zram(dev);
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_read(&zram->init_lock);
spin_lock(&zram->wb_limit_lock);
val = zram->bd_wb_limit;
spin_unlock(&zram->wb_limit_lock);
up_read(&zram->init_lock);
+ module_put(THIS_MODULE);
+
return scnprintf(buf, PAGE_SIZE, "%llu\n", val);
}
@@ -423,12 +474,15 @@ static ssize_t backing_dev_show(struct device *dev,
char *p;
ssize_t ret;
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_read(&zram->init_lock);
file = zram->backing_dev;
if (!file) {
memcpy(buf, "none\n", 5);
- up_read(&zram->init_lock);
- return 5;
+ ret = 5;
+ goto out;
}
p = file_path(file, buf, PAGE_SIZE - 1);
@@ -442,6 +496,7 @@ static ssize_t backing_dev_show(struct device *dev,
buf[ret++] = '\n';
out:
up_read(&zram->init_lock);
+ module_put(THIS_MODULE);
return ret;
}
@@ -459,9 +514,14 @@ static ssize_t backing_dev_store(struct device *dev,
int err;
struct zram *zram = dev_to_zram(dev);
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
file_name = kmalloc(PATH_MAX, GFP_KERNEL);
- if (!file_name)
- return -ENOMEM;
+ if (!file_name) {
+ err = -ENOMEM;
+ goto out_module;
+ }
down_write(&zram->init_lock);
if (init_done(zram)) {
@@ -529,6 +589,7 @@ static ssize_t backing_dev_store(struct device *dev,
pr_info("setup backing device %s\n", file_name);
kfree(file_name);
+ module_put(THIS_MODULE);
return len;
out:
@@ -543,7 +604,8 @@ static ssize_t backing_dev_store(struct device *dev,
up_write(&zram->init_lock);
kfree(file_name);
-
+out_module:
+ module_put(THIS_MODULE);
return err;
}
@@ -628,26 +690,31 @@ static ssize_t writeback_store(struct device *dev,
struct bio bio;
struct bio_vec bio_vec;
struct page *page;
- ssize_t ret = len;
+ ssize_t ret = -EINVAL;
int mode, err;
unsigned long blk_idx = 0;
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
if (sysfs_streq(buf, "idle"))
mode = IDLE_WRITEBACK;
else if (sysfs_streq(buf, "huge"))
mode = HUGE_WRITEBACK;
else {
if (strncmp(buf, PAGE_WB_SIG, sizeof(PAGE_WB_SIG) - 1))
- return -EINVAL;
+ goto out;
if (kstrtol(buf + sizeof(PAGE_WB_SIG) - 1, 10, &index) ||
index >= nr_pages)
- return -EINVAL;
+ goto out;
nr_pages = 1;
mode = PAGE_WRITEBACK;
}
+ ret = len;
+
down_read(&zram->init_lock);
if (!init_done(zram)) {
ret = -EINVAL;
@@ -781,6 +848,8 @@ static ssize_t writeback_store(struct device *dev,
__free_page(page);
release_init_lock:
up_read(&zram->init_lock);
+out:
+ module_put(THIS_MODULE);
return ret;
}
@@ -990,10 +1059,15 @@ static ssize_t comp_algorithm_show(struct device *dev,
size_t sz;
struct zram *zram = dev_to_zram(dev);
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_read(&zram->init_lock);
sz = zcomp_available_show(zram->compressor, buf);
up_read(&zram->init_lock);
+ module_put(THIS_MODULE);
+
return sz;
}
@@ -1013,15 +1087,20 @@ static ssize_t comp_algorithm_store(struct device *dev,
if (!zcomp_available_algorithm(compressor))
return -EINVAL;
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_write(&zram->init_lock);
if (init_done(zram)) {
- up_write(&zram->init_lock);
pr_info("Can't change algorithm for initialized device\n");
- return -EBUSY;
+ len = -EBUSY;
+ goto out;
}
strcpy(zram->compressor, compressor);
+out:
up_write(&zram->init_lock);
+ module_put(THIS_MODULE);
return len;
}
@@ -1030,14 +1109,19 @@ static ssize_t compact_store(struct device *dev,
{
struct zram *zram = dev_to_zram(dev);
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_read(&zram->init_lock);
if (!init_done(zram)) {
- up_read(&zram->init_lock);
- return -EINVAL;
+ len = -EINVAL;
+ goto out;
}
zs_compact(zram->mem_pool);
+out:
up_read(&zram->init_lock);
+ module_put(THIS_MODULE);
return len;
}
@@ -1048,6 +1132,9 @@ static ssize_t io_stat_show(struct device *dev,
struct zram *zram = dev_to_zram(dev);
ssize_t ret;
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_read(&zram->init_lock);
ret = scnprintf(buf, PAGE_SIZE,
"%8llu %8llu %8llu %8llu\n",
@@ -1056,6 +1143,7 @@ static ssize_t io_stat_show(struct device *dev,
(u64)atomic64_read(&zram->stats.invalid_io),
(u64)atomic64_read(&zram->stats.notify_free));
up_read(&zram->init_lock);
+ module_put(THIS_MODULE);
return ret;
}
@@ -1071,6 +1159,9 @@ static ssize_t mm_stat_show(struct device *dev,
memset(&pool_stats, 0x00, sizeof(struct zs_pool_stats));
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_read(&zram->init_lock);
if (init_done(zram)) {
mem_used = zs_get_total_pages(zram->mem_pool);
@@ -1092,6 +1183,7 @@ static ssize_t mm_stat_show(struct device *dev,
(u64)atomic64_read(&zram->stats.huge_pages),
(u64)atomic64_read(&zram->stats.huge_pages_since));
up_read(&zram->init_lock);
+ module_put(THIS_MODULE);
return ret;
}
@@ -1104,6 +1196,9 @@ static ssize_t bd_stat_show(struct device *dev,
struct zram *zram = dev_to_zram(dev);
ssize_t ret;
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_read(&zram->init_lock);
ret = scnprintf(buf, PAGE_SIZE,
"%8llu %8llu %8llu\n",
@@ -1112,6 +1207,7 @@ static ssize_t bd_stat_show(struct device *dev,
FOUR_K((u64)atomic64_read(&zram->stats.bd_writes)));
up_read(&zram->init_lock);
+ module_put(THIS_MODULE);
return ret;
}
#endif
@@ -1123,6 +1219,9 @@ static ssize_t debug_stat_show(struct device *dev,
struct zram *zram = dev_to_zram(dev);
ssize_t ret;
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
down_read(&zram->init_lock);
ret = scnprintf(buf, PAGE_SIZE,
"version: %d\n%8llu %8llu\n",
@@ -1131,6 +1230,8 @@ static ssize_t debug_stat_show(struct device *dev,
(u64)atomic64_read(&zram->stats.miss_free));
up_read(&zram->init_lock);
+ module_put(THIS_MODULE);
+
return ret;
}
@@ -1727,6 +1828,9 @@ static ssize_t disksize_store(struct device *dev,
struct zram *zram = dev_to_zram(dev);
int err;
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
mutex_lock(&zram_index_mutex);
if (!zram_up || zram->claim) {
@@ -1767,6 +1871,7 @@ static ssize_t disksize_store(struct device *dev,
up_write(&zram->init_lock);
mutex_unlock(&zram_index_mutex);
+ module_put(THIS_MODULE);
return len;
@@ -1776,6 +1881,7 @@ static ssize_t disksize_store(struct device *dev,
up_write(&zram->init_lock);
out:
mutex_unlock(&zram_index_mutex);
+ module_put(THIS_MODULE);
return err;
}
@@ -1791,6 +1897,9 @@ static ssize_t reset_store(struct device *dev,
if (ret)
return ret;
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
mutex_lock(&zram_index_mutex);
if (!zram_up) {
@@ -1828,6 +1937,7 @@ static ssize_t reset_store(struct device *dev,
out:
mutex_unlock(&zram_index_mutex);
+ module_put(THIS_MODULE);
return len;
}
@@ -2048,13 +2158,19 @@ static ssize_t hot_add_show(struct class *class,
{
int ret;
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
mutex_lock(&zram_index_mutex);
if (!zram_up) {
mutex_unlock(&zram_index_mutex);
- return -ENODEV;
+ ret = -ENODEV;
+ goto out;
}
ret = zram_add();
+out:
mutex_unlock(&zram_index_mutex);
+ module_put(THIS_MODULE);
if (ret < 0)
return ret;
@@ -2078,6 +2194,9 @@ static ssize_t hot_remove_store(struct class *class,
if (dev_id < 0)
return -EINVAL;
+ if (!try_module_get(THIS_MODULE))
+ return -ENODEV;
+
mutex_lock(&zram_index_mutex);
if (!zram_up) {
@@ -2096,6 +2215,7 @@ static ssize_t hot_remove_store(struct class *class,
out:
mutex_unlock(&zram_index_mutex);
+ module_put(THIS_MODULE);
return ret ? ret : count;
}
static CLASS_ATTR_WO(hot_remove);
--
2.27.0
Its possible to run into a race where a sysfs knob is being used,
we get preempted, and a zram device is removed before we complete
use of the sysfs knob. This can happen for instance on block
devices, where for instance the zram block devices just part of
the private data of the block device. To ensure the private
data pointer is valid we must bdget() / bdput() in between
access.
For instance this can happen in the following two situations
as examples to illustrate this better:
CPU 1 CPU 2
destroy_devices
...
compact_store()
zram = dev_to_zram(dev);
idr_for_each(zram_remove_cb
zram_remove
...
kfree(zram)
down_read(&zram->init_lock);
CPU 1 CPU 2
hot_remove_store
compact_store()
zram = dev_to_zram(dev);
zram_remove
kfree(zram)
down_read(&zram->init_lock);
This issue does not fix a known crash, however this race was
spotted by Minchan Kim through code inspection upon code review
of the sysfs deadlock patch.
A generic solution for sysfs has been suggested and implemented (refer
to the dev_type_get() / dev_type_put() part of the patch) [0], however
there is no interest at this point in time in fixing this in sysfs itself.
[0] https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
---
drivers/block/zram/zram_drv.c | 260 +++++++++++++++++++++++++++++-----
1 file changed, 224 insertions(+), 36 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 701ed28da125..9201a012ea8b 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -237,34 +237,55 @@ static ssize_t initstate_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
u32 val;
- struct zram *zram = dev_to_zram(dev);
+ int len;
+ struct zram *zram;
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ len = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
down_read(&zram->init_lock);
val = init_done(zram);
up_read(&zram->init_lock);
- module_put(THIS_MODULE);
+ len = scnprintf(buf, PAGE_SIZE, "%u\n", val);
- return scnprintf(buf, PAGE_SIZE, "%u\n", val);
+ bdput(dev_to_bdev(dev));
+out_nodev:
+ module_put(THIS_MODULE);
+ return len;
}
static ssize_t disksize_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
u64 disksize;
+ int len;
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ len = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
disksize = zram->disksize;
+ len = scnprintf(buf, PAGE_SIZE, "%llu\n", disksize);
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
- return scnprintf(buf, PAGE_SIZE, "%llu\n", disksize);
+ return len;
}
static ssize_t mem_limit_store(struct device *dev,
@@ -272,11 +293,17 @@ static ssize_t mem_limit_store(struct device *dev,
{
u64 limit;
char *tmp;
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ len = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
limit = memparse(buf, &tmp);
/* no chars parsed, invalid input */
@@ -290,6 +317,8 @@ static ssize_t mem_limit_store(struct device *dev,
up_write(&zram->init_lock);
out:
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
return len;
}
@@ -299,7 +328,7 @@ static ssize_t mem_used_max_store(struct device *dev,
{
int err;
unsigned long val;
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
err = kstrtoul(buf, 10, &val);
if (err || val != 0)
@@ -308,6 +337,13 @@ static ssize_t mem_used_max_store(struct device *dev,
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ len = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
down_read(&zram->init_lock);
if (init_done(zram)) {
atomic_long_set(&zram->stats.max_used_pages,
@@ -315,6 +351,8 @@ static ssize_t mem_used_max_store(struct device *dev,
}
up_read(&zram->init_lock);
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
return len;
@@ -323,8 +361,8 @@ static ssize_t mem_used_max_store(struct device *dev,
static ssize_t idle_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
- struct zram *zram = dev_to_zram(dev);
- unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+ struct zram *zram;
+ unsigned long nr_pages;
int index;
if (!sysfs_streq(buf, "all"))
@@ -333,6 +371,14 @@ static ssize_t idle_store(struct device *dev,
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ len = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+ nr_pages = zram->disksize >> PAGE_SHIFT;
+
down_read(&zram->init_lock);
if (!init_done(zram)) {
len = -EINVAL;
@@ -353,6 +399,8 @@ static ssize_t idle_store(struct device *dev,
out:
up_read(&zram->init_lock);
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
return len;
}
@@ -361,91 +409,126 @@ static ssize_t idle_store(struct device *dev,
static ssize_t writeback_limit_enable_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
u64 val;
- ssize_t ret = -EINVAL;
if (kstrtoull(buf, 10, &val))
- return ret;
+ return -EINVAL;
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ len = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
down_read(&zram->init_lock);
spin_lock(&zram->wb_limit_lock);
zram->wb_limit_enable = val;
spin_unlock(&zram->wb_limit_lock);
up_read(&zram->init_lock);
- ret = len;
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
- return ret;
+ return len;
}
static ssize_t writeback_limit_enable_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
bool val;
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ len = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
down_read(&zram->init_lock);
spin_lock(&zram->wb_limit_lock);
val = zram->wb_limit_enable;
spin_unlock(&zram->wb_limit_lock);
up_read(&zram->init_lock);
+ len = scnprintf(buf, PAGE_SIZE, "%d\n", val);
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
- return scnprintf(buf, PAGE_SIZE, "%d\n", val);
+ return len;
}
static ssize_t writeback_limit_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
u64 val;
- ssize_t ret = -EINVAL;
if (kstrtoull(buf, 10, &val))
- return ret;
+ return -EINVAL;
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ len = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
down_read(&zram->init_lock);
spin_lock(&zram->wb_limit_lock);
zram->bd_wb_limit = val;
spin_unlock(&zram->wb_limit_lock);
up_read(&zram->init_lock);
- ret = len;
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
- return ret;
+ return len;
}
static ssize_t writeback_limit_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
u64 val;
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
+ int len;
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ len = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
down_read(&zram->init_lock);
spin_lock(&zram->wb_limit_lock);
val = zram->bd_wb_limit;
spin_unlock(&zram->wb_limit_lock);
up_read(&zram->init_lock);
+ len = scnprintf(buf, PAGE_SIZE, "%llu\n", val);
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
- return scnprintf(buf, PAGE_SIZE, "%llu\n", val);
+ return len;
}
static void reset_bdev(struct zram *zram)
@@ -470,13 +553,20 @@ static ssize_t backing_dev_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct file *file;
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
char *p;
ssize_t ret;
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ ret = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
down_read(&zram->init_lock);
file = zram->backing_dev;
if (!file) {
@@ -496,6 +586,8 @@ static ssize_t backing_dev_show(struct device *dev,
buf[ret++] = '\n';
out:
up_read(&zram->init_lock);
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
return ret;
}
@@ -512,11 +604,18 @@ static ssize_t backing_dev_store(struct device *dev,
unsigned long nr_pages, *bitmap = NULL;
struct block_device *bdev = NULL;
int err;
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ err = -ENODEV;
+ goto out_module;
+ }
+
+ zram = dev_to_zram(dev);
+
file_name = kmalloc(PATH_MAX, GFP_KERNEL);
if (!file_name) {
err = -ENOMEM;
@@ -589,6 +688,7 @@ static ssize_t backing_dev_store(struct device *dev,
pr_info("setup backing device %s\n", file_name);
kfree(file_name);
+ bdput(dev_to_bdev(dev));
module_put(THIS_MODULE);
return len;
@@ -604,6 +704,7 @@ static ssize_t backing_dev_store(struct device *dev,
up_write(&zram->init_lock);
kfree(file_name);
+ bdput(dev_to_bdev(dev));
out_module:
module_put(THIS_MODULE);
return err;
@@ -684,8 +785,8 @@ static int read_from_bdev_async(struct zram *zram, struct bio_vec *bvec,
static ssize_t writeback_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
- struct zram *zram = dev_to_zram(dev);
- unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+ struct zram *zram;
+ unsigned long nr_pages;
unsigned long index = 0;
struct bio bio;
struct bio_vec bio_vec;
@@ -697,6 +798,14 @@ static ssize_t writeback_store(struct device *dev,
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ ret = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+ nr_pages = zram->disksize >> PAGE_SHIFT;
+
if (sysfs_streq(buf, "idle"))
mode = IDLE_WRITEBACK;
else if (sysfs_streq(buf, "huge"))
@@ -849,6 +958,8 @@ static ssize_t writeback_store(struct device *dev,
release_init_lock:
up_read(&zram->init_lock);
out:
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
return ret;
@@ -1057,15 +1168,24 @@ static ssize_t comp_algorithm_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
size_t sz;
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ sz = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
down_read(&zram->init_lock);
sz = zcomp_available_show(zram->compressor, buf);
up_read(&zram->init_lock);
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
return sz;
@@ -1074,7 +1194,7 @@ static ssize_t comp_algorithm_show(struct device *dev,
static ssize_t comp_algorithm_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
char compressor[ARRAY_SIZE(zram->compressor)];
size_t sz;
@@ -1090,6 +1210,13 @@ static ssize_t comp_algorithm_store(struct device *dev,
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ len = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
down_write(&zram->init_lock);
if (init_done(zram)) {
pr_info("Can't change algorithm for initialized device\n");
@@ -1100,6 +1227,8 @@ static ssize_t comp_algorithm_store(struct device *dev,
strcpy(zram->compressor, compressor);
out:
up_write(&zram->init_lock);
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
return len;
}
@@ -1107,11 +1236,18 @@ static ssize_t comp_algorithm_store(struct device *dev,
static ssize_t compact_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ len = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
down_read(&zram->init_lock);
if (!init_done(zram)) {
len = -EINVAL;
@@ -1121,6 +1257,8 @@ static ssize_t compact_store(struct device *dev,
zs_compact(zram->mem_pool);
out:
up_read(&zram->init_lock);
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
return len;
@@ -1135,6 +1273,13 @@ static ssize_t io_stat_show(struct device *dev,
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ ret = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
down_read(&zram->init_lock);
ret = scnprintf(buf, PAGE_SIZE,
"%8llu %8llu %8llu %8llu\n",
@@ -1143,6 +1288,8 @@ static ssize_t io_stat_show(struct device *dev,
(u64)atomic64_read(&zram->stats.invalid_io),
(u64)atomic64_read(&zram->stats.notify_free));
up_read(&zram->init_lock);
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
return ret;
@@ -1151,7 +1298,7 @@ static ssize_t io_stat_show(struct device *dev,
static ssize_t mm_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
struct zs_pool_stats pool_stats;
u64 orig_size, mem_used = 0;
long max_used;
@@ -1162,6 +1309,13 @@ static ssize_t mm_stat_show(struct device *dev,
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ ret = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
down_read(&zram->init_lock);
if (init_done(zram)) {
mem_used = zs_get_total_pages(zram->mem_pool);
@@ -1183,6 +1337,8 @@ static ssize_t mm_stat_show(struct device *dev,
(u64)atomic64_read(&zram->stats.huge_pages),
(u64)atomic64_read(&zram->stats.huge_pages_since));
up_read(&zram->init_lock);
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
return ret;
@@ -1193,12 +1349,19 @@ static ssize_t mm_stat_show(struct device *dev,
static ssize_t bd_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
ssize_t ret;
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ ret = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
down_read(&zram->init_lock);
ret = scnprintf(buf, PAGE_SIZE,
"%8llu %8llu %8llu\n",
@@ -1206,7 +1369,8 @@ static ssize_t bd_stat_show(struct device *dev,
FOUR_K((u64)atomic64_read(&zram->stats.bd_reads)),
FOUR_K((u64)atomic64_read(&zram->stats.bd_writes)));
up_read(&zram->init_lock);
-
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
return ret;
}
@@ -1216,12 +1380,19 @@ static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
int version = 1;
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
ssize_t ret;
if (!try_module_get(THIS_MODULE))
return -ENODEV;
+ if (!bdgrab(dev_to_bdev(dev))) {
+ ret = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
down_read(&zram->init_lock);
ret = scnprintf(buf, PAGE_SIZE,
"version: %d\n%8llu %8llu\n",
@@ -1229,7 +1400,8 @@ static ssize_t debug_stat_show(struct device *dev,
(u64)atomic64_read(&zram->stats.writestall),
(u64)atomic64_read(&zram->stats.miss_free));
up_read(&zram->init_lock);
-
+ bdput(dev_to_bdev(dev));
+out_nodev:
module_put(THIS_MODULE);
return ret;
@@ -1825,7 +1997,7 @@ static ssize_t disksize_store(struct device *dev,
{
u64 disksize;
struct zcomp *comp;
- struct zram *zram = dev_to_zram(dev);
+ struct zram *zram;
int err;
if (!try_module_get(THIS_MODULE))
@@ -1833,6 +2005,13 @@ static ssize_t disksize_store(struct device *dev,
mutex_lock(&zram_index_mutex);
+ if (!bdgrab(dev_to_bdev(dev))) {
+ err = -ENODEV;
+ goto out_nodev;
+ }
+
+ zram = dev_to_zram(dev);
+
if (!zram_up || zram->claim) {
err = -ENODEV;
goto out;
@@ -1880,6 +2059,8 @@ static ssize_t disksize_store(struct device *dev,
out_unlock:
up_write(&zram->init_lock);
out:
+ bdput(dev_to_bdev(dev));
+out_nodev:
mutex_unlock(&zram_index_mutex);
module_put(THIS_MODULE);
return err;
@@ -1902,6 +2083,11 @@ static ssize_t reset_store(struct device *dev,
mutex_lock(&zram_index_mutex);
+ if (!bdgrab(dev_to_bdev(dev))) {
+ len = -ENODEV;
+ goto out_nodev;
+ }
+
if (!zram_up) {
len = -ENODEV;
goto out;
@@ -1936,6 +2122,8 @@ static ssize_t reset_store(struct device *dev,
mutex_unlock(&bdev->bd_mutex);
out:
+ bdput(dev_to_bdev(dev));
+out_nodev:
mutex_unlock(&zram_index_mutex);
module_put(THIS_MODULE);
return len;
--
2.27.0
On Fri, Apr 23, 2021 at 01:11:06AM +0000, Luis Chamberlain wrote:
> As with other areas of the zram diver, check if the zram
> block device is being claimed but do it early to avoid
> trying to set something up for a zram device which may
> be on its way out.
I guess you're talking about zram_remove race with disksize_store.
Please add some code sequence diagram like previous patch, which
makes thing more clear.
Otherwise, looks good to me.
>
> This doesn't fix a known crash, however the potential for
> an issue issue is found through code inspection.
>
> Signed-off-by: Luis Chamberlain <[email protected]>
> ---
> drivers/block/zram/zram_drv.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 431b60cd85c1..6051d20b88c3 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1729,7 +1729,7 @@ static ssize_t disksize_store(struct device *dev,
>
> mutex_lock(&zram_index_mutex);
>
> - if (!zram_up) {
> + if (!zram_up || zram->claim) {
> err = -ENODEV;
> goto out;
> }
> --
> 2.27.0
>
On Fri, Apr 23, 2021 at 01:11:05AM +0000, Luis Chamberlain wrote:
> Provide a simple state machine to fix races with driver exit where we
> remove the CPU multistate callbacks and re-initialization / creation of
> new per CPU instances which should be managed by these callbacks.
>
> The zram driver makes use of cpu hotplug multistate support, whereby it
> associates a struct zcomp per CPU. Each struct zcomp represents a
> compression algorithm in charge of managing compression streams per CPU.
> Although a compiled zram driver only supports a fixed set of compression
> algorithms, each zram device gets a struct zcomp allocated per CPU. The
> "multi" in CPU hotplug multstate refers to these per cpu struct zcomp
> instances. Each of these will have the CPU hotplug callback called for
> it on CPU plug / unplug. The kernel's CPU hotplug multistate keeps a
> linked list of these different structures so that it will iterate over
> them on CPU transitions.
>
> By default at driver initialization we will create just one zram device
> (num_devices=1) and a zcomp structure then set for the now default
> lzo-rle comrpession algorithm. At driver removal we first remove each
> zram device, and so we destroy the associated struct zcomp per CPU. But
> since we expose sysfs attributes to create new devices or reset / initialize
> existing zram devices, we can easily end up re-initializing a struct zcomp
> for a zram device before the exit routine of the module removes the cpu
> hotplug callback. When this happens the kernel's CPU hotplug will detect
> that at least one instance (struct zcomp for us) exists. This can happen
> in the following situation:
>
> CPU 1 CPU 2
>
> class_unregister(...);
> idr_for_each(...);
> zram_debugfs_destroy();
> disksize_store(...);
> idr_destroy(...);
> unregister_blkdev(...);
> cpuhp_remove_multi_state(...);
>
> The warning comes up on cpuhp_remove_multi_state() when it sees that the
> state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
> In this case, that a struct zcom still exists, the driver allowed its
> creation per CPU even though we could have just freed them per CPU
> though a call on another CPU, and we are then later trying to remove the
> hotplug callback.
>
> Fix all this by providing a zram initialization boolean
> protected the the shared in the driver zram_index_mutex,
> which we can use to annotate when sysfs attributes are
> safe to use or not -- once the driver is properly initialized.
> When the driver is going down we also are sure to not let
> userspace muck with attributes which may affect each per cpu
> struct zcomp.
>
> This also fixes a series of possible memory leaks. The
> crashes and memory leaks can easily be caused by issuing
> the zram02.sh script from the LTP project [0] in a loop
> in two separate windows:
>
> cd testcases/kernel/device-drivers/zram
> while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
>
> You end up with a splat as follows:
>
> kernel: zram: Removed device: zram0
> kernel: zram: Added device: zram0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: Adding 104857596k swap on /dev/zram0. Priority:-2 extents:1 across:104857596k SSFS
> kernel: zram0: detected capacitky change from 209715200 to 0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: ------------[ cut here ]------------
> kernel: Error: Removing state 63 which has instances left.
> kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
> kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G E 5.12.0-rc1-next-20210304 #3
> kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
> kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Code: <etc>
> kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
> kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
> kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
> kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
> kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
> kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
> kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:0000000000000000
> kernel: CS: 0010 DS: 0000 ES 0000 CR0: 0000000080050033
> kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
> kernel: Call Trace:
> kernel: __cpuhp_remove_state+0x2e/0x80
> kernel: __do_sys_delete_module+0x190/0x2a0
> kernel: do_syscall_64+0x33/0x80
> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> The "Error: Removing state 63 which has instances left" refers
> to the zram per CPU struc zcomp instances left.
>
> [0] https://github.com/linux-test-project/ltp.git
>
> Signed-off-by: Luis Chamberlain <[email protected]>
Acked-by: Minchan Kim <[email protected]>
On Fri, Apr 23, 2021 at 01:11:04AM +0000, Luis Chamberlain wrote:
> This 2nd series documents the fixes better and includes a bdgrab() fix
> for the issue noted by Minchan. A general fix has been proposed for two
> of these issues however they are not yet deemed required upstream and so
> we just open code individual solutions on the driver.
>
> Luis Chamberlain (4):
> zram: fix crashes due to use of cpu hotplug multistate
> zram: avoid disksize setting when device is being claimed
> zram: fix deadlock with sysfs attribute usage and driver removal
> zram: fix possible races between sysfs use and bdev access
>
> drivers/block/zram/zram_drv.c | 473 +++++++++++++++++++++++++++++-----
> 1 file changed, 414 insertions(+), 59 deletions(-)
Hi Luis,
First of all, I am sorry too late review. Now I see [3/4] and [4/4] would
be not only zram issue since you shed a light in the descriptions.
Yeah, that would be helpful if it could be deal with under general
layer but looks like arguable or would take some times at least, IIUC.
On the case, yeah, we could fix it for zram first until the issue will
bring up further. Anyway, I'd like to see some wrapper rather than annotating
for every sysfs files for maintainance point of view.
At least, could you introduce one more patch "introduce zram sysfs wrapper"
on top of this series to centralize the work?
Thanks for your works!
Greg,
your feedback would be appreciated here.
On Wed, May 19, 2021 at 01:09:09PM -0700, Minchan Kim wrote:
> On Fri, Apr 23, 2021 at 01:11:04AM +0000, Luis Chamberlain wrote:
> > This 2nd series documents the fixes better and includes a bdgrab() fix
> > for the issue noted by Minchan. A general fix has been proposed for two
> > of these issues however they are not yet deemed required upstream and so
> > we just open code individual solutions on the driver.
> >
> > Luis Chamberlain (4):
> > zram: fix crashes due to use of cpu hotplug multistate
> > zram: avoid disksize setting when device is being claimed
> > zram: fix deadlock with sysfs attribute usage and driver removal
> > zram: fix possible races between sysfs use and bdev access
> >
> > drivers/block/zram/zram_drv.c | 473 +++++++++++++++++++++++++++++-----
> > 1 file changed, 414 insertions(+), 59 deletions(-)
>
> Hi Luis,
>
> First of all, I am sorry too late review. Now I see [3/4] and [4/4] would
> be not only zram issue since you shed a light in the descriptions.
> Yeah, that would be helpful if it could be deal with under general
> layer but looks like arguable or would take some times at least, IIUC.
>
> On the case, yeah, we could fix it for zram first until the issue will
> bring up further. Anyway, I'd like to see some wrapper rather than annotating
> for every sysfs files for maintainance point of view.
> At least, could you introduce one more patch "introduce zram sysfs wrapper"
> on top of this series to centralize the work?
>
> Thanks for your works!
Since I did the work for a general fix as an alternative proof of
concept to the ugliness reflected on those two last patches, I'd like
instead for Greg to re-consider merging a general fix.
Greg, can you comment on technical levels why a general core fix is not
desirable upstream for those two issues?
Based on feedback from folks, it does not seem the argument that we
don't support rmmod holds water. Specially given we are now even
discussing using live patching to suport error injection as a strategy
to not incur more code / boiler plate code even on the block layer [0].
The live patching approach would mean for testing purposes we'd
temporarily use error injection via live patching and then remove the
module after done testing.
[0] http://lkml.kernel.org/r/[email protected]
Luis
On Fri, May 21, 2021 at 10:01:52PM +0200, Greg Kroah-Hartman wrote:
> On Wed, May 19, 2021 at 08:20:23PM +0000, Luis Chamberlain wrote:
> > Greg,
> >
> > your feedback would be appreciated here.
>
> Appreciated where? This is a zram patchset, what do I need to mess with
> it for?
This patchset has 2 issues which I noted in the last series that are
generic, and could best be dealt with on sysfs, and suggested
how this could actually be dealt with on sysfs / kernfs.
> > Greg, can you comment on technical levels why a general core fix is not
> > desirable upstream for those two issues?
>
> What issues exactly?
When I suggested the generic way to fix this your main argument against
a generic solution was that we don't support module removal. Given that
argument did not seem to hold any water it begs the question if you
still would rather not see this fixed in sysfs / kernfs.
If you however are more open to it now, I can instead take that work, and
send a proper patch for review.
Luis
On Wed, May 19, 2021 at 08:20:23PM +0000, Luis Chamberlain wrote:
> Greg,
>
> your feedback would be appreciated here.
Appreciated where? This is a zram patchset, what do I need to mess with
it for?
>
> On Wed, May 19, 2021 at 01:09:09PM -0700, Minchan Kim wrote:
> > On Fri, Apr 23, 2021 at 01:11:04AM +0000, Luis Chamberlain wrote:
> > > This 2nd series documents the fixes better and includes a bdgrab() fix
> > > for the issue noted by Minchan. A general fix has been proposed for two
> > > of these issues however they are not yet deemed required upstream and so
> > > we just open code individual solutions on the driver.
> > >
> > > Luis Chamberlain (4):
> > > zram: fix crashes due to use of cpu hotplug multistate
> > > zram: avoid disksize setting when device is being claimed
> > > zram: fix deadlock with sysfs attribute usage and driver removal
> > > zram: fix possible races between sysfs use and bdev access
> > >
> > > drivers/block/zram/zram_drv.c | 473 +++++++++++++++++++++++++++++-----
> > > 1 file changed, 414 insertions(+), 59 deletions(-)
> >
> > Hi Luis,
> >
> > First of all, I am sorry too late review. Now I see [3/4] and [4/4] would
> > be not only zram issue since you shed a light in the descriptions.
> > Yeah, that would be helpful if it could be deal with under general
> > layer but looks like arguable or would take some times at least, IIUC.
> >
> > On the case, yeah, we could fix it for zram first until the issue will
> > bring up further. Anyway, I'd like to see some wrapper rather than annotating
> > for every sysfs files for maintainance point of view.
> > At least, could you introduce one more patch "introduce zram sysfs wrapper"
> > on top of this series to centralize the work?
> >
> > Thanks for your works!
>
> Since I did the work for a general fix as an alternative proof of
> concept to the ugliness reflected on those two last patches, I'd like
> instead for Greg to re-consider merging a general fix.
>
> Greg, can you comment on technical levels why a general core fix is not
> desirable upstream for those two issues?
What issues exactly?
totally confused,
greg k-h
On Fri, May 21, 2021 at 08:16:18PM +0000, Luis Chamberlain wrote:
> On Fri, May 21, 2021 at 10:01:52PM +0200, Greg Kroah-Hartman wrote:
> > On Wed, May 19, 2021 at 08:20:23PM +0000, Luis Chamberlain wrote:
> > > Greg,
> > >
> > > your feedback would be appreciated here.
> >
> > Appreciated where? This is a zram patchset, what do I need to mess with
> > it for?
>
> This patchset has 2 issues which I noted in the last series that are
> generic, and could best be dealt with on sysfs, and suggested
> how this could actually be dealt with on sysfs / kernfs.
>
> > > Greg, can you comment on technical levels why a general core fix is not
> > > desirable upstream for those two issues?
> >
> > What issues exactly?
>
> When I suggested the generic way to fix this your main argument against
> a generic solution was that we don't support module removal. Given that
> argument did not seem to hold any water it begs the question if you
> still would rather not see this fixed in sysfs / kernfs.
>
> If you however are more open to it now, I can instead take that work, and
> send a proper patch for review.
I looked at the last patch here and I really do not see the issue.
In order for the module to be removed, zram_exit() has to return, right?
And that function calls destroy_devices() which will then remove all
devices in sysfs that are associated with this driver. At that point in
time, sysfs detaches the attributes from kernfs so that any open file
handle that happened to be around for an attribute file will not call
back into the show/store function for that device.
Then destroy_devices() returns, and zram_exit() returns, and the module
is unloaded.
So how can a show/store function in zram_drv.c be called after
destroy_devices() returns?
The changelog text in patch 4/4 is odd, destroy_devices() shouldn't be
racing with anything as devices have reference counts in order to
protect this type of thing from happening, right? How can a store
function be called when a device is somehow removed from memory at the
same time? Don't we properly incremement/decrement the device
structure's reference count? If not, wouldn't that be the simplest
solution here?
And who is ripping out zram drivers while the system is running anyway?
What workflow causes this to happen so much so that the sysfs files need
to be "protected"? What tool/script/whatever is hammering on those
sysfs files so much while someone wants to unload the module?
What am I missing?
thanks,
greg k-h
On Fri, May 21, 2021 at 10:45:00PM +0200, Greg Kroah-Hartman wrote:
> On Fri, May 21, 2021 at 08:16:18PM +0000, Luis Chamberlain wrote:
> > On Fri, May 21, 2021 at 10:01:52PM +0200, Greg Kroah-Hartman wrote:
> > > On Wed, May 19, 2021 at 08:20:23PM +0000, Luis Chamberlain wrote:
> > > > Greg,
> > > >
> > > > your feedback would be appreciated here.
> > >
> > > Appreciated where? This is a zram patchset, what do I need to mess with
> > > it for?
> >
> > This patchset has 2 issues which I noted in the last series that are
> > generic, and could best be dealt with on sysfs, and suggested
> > how this could actually be dealt with on sysfs / kernfs.
> >
> > > > Greg, can you comment on technical levels why a general core fix is not
> > > > desirable upstream for those two issues?
> > >
> > > What issues exactly?
> >
> > When I suggested the generic way to fix this your main argument against
> > a generic solution was that we don't support module removal. Given that
> > argument did not seem to hold any water it begs the question if you
> > still would rather not see this fixed in sysfs / kernfs.
> >
> > If you however are more open to it now, I can instead take that work, and
> > send a proper patch for review.
>
> I looked at the last patch here and I really do not see the issue.
>
> In order for the module to be removed, zram_exit() has to return, right?
Yes, but the race is for when a module removal is ongoing, in other
words, it has not yet completed, and at the same time we race touching
sysfs files.
> So how can a show/store function in zram_drv.c be called after
> destroy_devices() returns?
The issue can come up if we have something poke at the sysfs files *while* a
removal is happening.
> The changelog text in patch 4/4 is odd, destroy_devices() shouldn't be
> racing with anything as devices have reference counts in order to
> protect this type of thing from happening, right?
The race is beyond the scope of the driver in that the sysfs file can be
poked at while a removal is ongoing. After that consider the possible
races that can happen.
> How can a store
> function be called when a device is somehow removed from memory at the
> same time? Don't we properly incremement/decrement the device
> structure's reference count? If not, wouldn't that be the simplest
> solution here?
In this case the proper refcounting has to be down on the type of
device, and so bdgrab(dev_to_bdev(dev)), because otherwise the pointer
for the zram device is just not valid, so we can't do sanity checks
on the sysfs calls for the zram device as the pointer can be invalid.
To validate the pointer we must do refcounting below a layer. In this
case for a disk.
And so genericaly this would be:
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 4a8bf8cda52b..7c897e3f4135 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -29,6 +29,7 @@
#include <linux/sched/mm.h>
#include <linux/sysfs.h>
#include <linux/dma-map-ops.h> /* for dma_default_coherent */
+#include <linux/blkdev.h>
#include "base.h"
#include "power/power.h"
@@ -1988,11 +1989,20 @@ static inline int device_is_not_partition(struct device *dev)
{
return !(dev->type == &part_type);
}
+static inline bool device_is_disk(struct device *dev)
+{
+ return (dev->type == &disk_type);
+}
#else
static inline int device_is_not_partition(struct device *dev)
{
return 1;
}
+
+static inline bool device_is_disk(struct device *dev)
+{
+ return false;
+}
#endif
static int
@@ -2037,6 +2047,19 @@ const char *dev_driver_string(const struct device *dev)
}
EXPORT_SYMBOL(dev_driver_string);
+static int dev_type_get(struct device *dev)
+{
+ if (device_is_disk(dev))
+ return !!bdgrab(dev_to_bdev(dev));
+ return 1;
+}
+
+static void dev_type_put(struct device *dev)
+{
+ if (device_is_disk(dev))
+ bdput(dev_to_bdev(dev));
+}
+
#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
@@ -2046,12 +2069,20 @@ static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
struct device *dev = kobj_to_dev(kobj);
ssize_t ret = -EIO;
+ if (!dev_type_get(dev))
+ return -ENODEV;
+
if (dev_attr->show)
ret = dev_attr->show(dev, dev_attr, buf);
if (ret >= (ssize_t)PAGE_SIZE) {
printk("dev_attr_show: %pS returned bad count\n",
dev_attr->show);
}
+
+ dev_type_put(dev);
+
+ put_device(dev);
+
return ret;
}
@@ -2062,8 +2093,14 @@ static ssize_t dev_attr_store(struct kobject *kobj, struct attribute *attr,
struct device *dev = kobj_to_dev(kobj);
ssize_t ret = -EIO;
+ if (!dev_type_get(dev))
+ return -ENODEV;
+
if (dev_attr->store)
ret = dev_attr->store(dev, dev_attr, buf, count);
+
+ dev_type_put(dev);
+
return ret;
}
> And who is ripping out zram drivers while the system is running anyway?
This race was spotted by Minchan as I explained the race in patch 3.
However he had a point and the above is a generic attempt to resolving
that issue. The patch 4 is a open coding it on the zram driver alone.
> What workflow causes this to happen so much so that the sysfs files need
> to be "protected"? What tool/script/whatever is hammering on those
> sysfs files so much while someone wants to unload the module?
It is theoretical, however it is possible.
In the case of patch 3 though, the deadlock is possible and visible
while running an ltp script at the same time twice in a loop. I decided
to try to work on a generic solution as I was told a similar race was
known to have occurred when testing live patching code.
The generic approach to solving the race on patch 3 follows:
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index cf8deecc39ef..863d84353316 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1876,6 +1876,7 @@ static struct attribute *zram_disk_attrs[] = {
static const struct attribute_group zram_disk_attr_group = {
.attrs = zram_disk_attrs,
+ .owner = THIS_MODULE,
};
static const struct attribute_group *zram_disk_attr_groups[] = {
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 7e0e62deab53..faedaaaed247 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -14,6 +14,7 @@
#include <linux/slab.h>
#include <linux/security.h>
#include <linux/hash.h>
+#include <linux/module.h>
#include "kernfs-internal.h"
@@ -415,9 +416,15 @@ struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
if (unlikely(!kn))
return NULL;
- if (!atomic_inc_unless_negative(&kn->active))
+ if (unlikely(kn->owner && !try_module_get(kn->owner)))
return NULL;
+ if (!atomic_inc_unless_negative(&kn->active)) {
+ if (kn->owner)
+ module_put(kn->owner);
+ return NULL;
+ }
+
if (kernfs_lockdep(kn))
rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
return kn;
@@ -440,6 +447,8 @@ void kernfs_put_active(struct kernfs_node *kn)
if (kernfs_lockdep(kn))
rwsem_release(&kn->dep_map, _RET_IP_);
v = atomic_dec_return(&kn->active);
+ if (kn->owner)
+ module_put(kn->owner);
if (likely(v != KN_DEACTIVATED_BIAS))
return;
@@ -612,6 +621,7 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry)
static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
unsigned flags)
{
@@ -647,6 +657,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
kn->name = name;
kn->mode = mode;
+ kn->owner = owner;
kn->flags = flags;
if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) {
@@ -680,13 +691,14 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
unsigned flags)
{
struct kernfs_node *kn;
kn = __kernfs_new_node(kernfs_root(parent), parent,
- name, mode, uid, gid, flags);
+ name, mode, owner, uid, gid, flags);
if (kn) {
kernfs_get(parent);
kn->parent = parent;
@@ -967,7 +979,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
root->id_highbits = 1;
kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO,
- GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
+ NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
KERNFS_DIR);
if (!kn) {
idr_destroy(&root->ino_idr);
@@ -1006,6 +1018,7 @@ void kernfs_destroy_root(struct kernfs_root *root)
* @parent: parent in which to create a new directory
* @name: name of the new directory
* @mode: mode of the new directory
+ * @owner: if set, the module owner of this directory
* @uid: uid of the new directory
* @gid: gid of the new directory
* @priv: opaque data associated with the new directory
@@ -1015,6 +1028,7 @@ void kernfs_destroy_root(struct kernfs_root *root)
*/
struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
void *priv, const void *ns)
{
@@ -1023,7 +1037,7 @@ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
/* allocate */
kn = kernfs_new_node(parent, name, mode | S_IFDIR,
- uid, gid, KERNFS_DIR);
+ owner, uid, gid, KERNFS_DIR);
if (!kn)
return ERR_PTR(-ENOMEM);
@@ -1055,7 +1069,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
/* allocate */
kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
- GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
+ NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index c75719312147..69080869abfc 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -961,6 +961,7 @@ const struct file_operations kernfs_file_fops = {
* @uid: uid of the file
* @gid: gid of the file
* @size: size of the file
+ * @owner: if set, the module owner of the file
* @ops: kernfs operations for the file
* @priv: private data for the file
* @ns: optional namespace tag of the file
@@ -970,7 +971,9 @@ const struct file_operations kernfs_file_fops = {
*/
struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
const char *name,
- umode_t mode, kuid_t uid, kgid_t gid,
+ umode_t mode,
+ struct module *owner,
+ kuid_t uid, kgid_t gid,
loff_t size,
const struct kernfs_ops *ops,
void *priv, const void *ns,
@@ -983,7 +986,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
flags = KERNFS_FILE;
kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
- uid, gid, flags);
+ owner, uid, gid, flags);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index ccc3b44f6306..5598c1127db7 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -112,6 +112,7 @@ void kernfs_put_active(struct kernfs_node *kn);
int kernfs_add_one(struct kernfs_node *kn);
struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
unsigned flags);
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 5432883d819f..8e130a411a9f 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
gid = target->iattr->ia_gid;
}
- kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO, uid, gid,
+ kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO,
+ target->owner, uid, gid,
KERNFS_LINK);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 59dffd5ca517..5aff6ff07392 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -57,7 +57,8 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
kobject_get_ownership(kobj, &uid, &gid);
kn = kernfs_create_dir_ns(parent, kobject_name(kobj),
- S_IRWXU | S_IRUGO | S_IXUGO, uid, gid,
+ S_IRWXU | S_IRUGO | S_IXUGO, NULL,
+ uid, gid,
kobj, ns);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 9aefa7779b29..acd0756199e1 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -314,8 +314,8 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
key = attr->key ?: (struct lock_class_key *)&attr->skey;
#endif
- kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
- size, ops, (void *)attr, ns, key);
+ kn = __kernfs_create_file(parent, attr->name, mode & 0777, attr->owner,
+ uid, gid, size, ops, (void *)attr, ns, key);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
sysfs_warn_dup(parent, attr->name);
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index 64e6a6698935..010c2dade2d6 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -136,6 +136,7 @@ static int internal_create_group(struct kobject *kobj, int update,
} else {
kn = kernfs_create_dir_ns(kobj->sd, grp->name,
S_IRWXU | S_IRUGO | S_IXUGO,
+ grp->owner,
uid, gid, kobj, NULL);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 9e8ca8743c26..43abaed26735 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -155,6 +155,7 @@ struct kernfs_node {
unsigned short flags;
umode_t mode;
+ struct module *owner;
struct kernfs_iattrs *iattr;
};
@@ -368,12 +369,14 @@ void kernfs_destroy_root(struct kernfs_root *root);
struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
void *priv, const void *ns);
struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
const char *name);
struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
loff_t size,
const struct kernfs_ops *ops,
@@ -465,13 +468,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { }
static inline struct kernfs_node *
kernfs_create_dir_ns(struct kernfs_node *parent, const char *name,
- umode_t mode, kuid_t uid, kgid_t gid,
+ umode_t mode, struct module *owner,
+ kuid_t uid, kgid_t gid,
void *priv, const void *ns)
{ return ERR_PTR(-ENOSYS); }
static inline struct kernfs_node *
__kernfs_create_file(struct kernfs_node *parent, const char *name,
- umode_t mode, kuid_t uid, kgid_t gid,
+ umode_t mode, struct module *owner,
+ kuid_t uid, kgid_t gid,
loff_t size, const struct kernfs_ops *ops,
void *priv, const void *ns, struct lock_class_key *key)
{ return ERR_PTR(-ENOSYS); }
@@ -558,14 +563,15 @@ static inline struct kernfs_node *
kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
void *priv)
{
- return kernfs_create_dir_ns(parent, name, mode,
+ return kernfs_create_dir_ns(parent, name, mode, parent->owner,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
priv, NULL);
}
static inline struct kernfs_node *
kernfs_create_file_ns(struct kernfs_node *parent, const char *name,
- umode_t mode, kuid_t uid, kgid_t gid,
+ umode_t mode, struct module *owner,
+ kuid_t uid, kgid_t gid,
loff_t size, const struct kernfs_ops *ops,
void *priv, const void *ns)
{
@@ -574,15 +580,16 @@ kernfs_create_file_ns(struct kernfs_node *parent, const char *name,
#ifdef CONFIG_DEBUG_LOCK_ALLOC
key = (struct lock_class_key *)&ops->lockdep_key;
#endif
- return __kernfs_create_file(parent, name, mode, uid, gid,
+ return __kernfs_create_file(parent, name, mode, owner, uid, gid,
size, ops, priv, ns, key);
}
static inline struct kernfs_node *
kernfs_create_file(struct kernfs_node *parent, const char *name, umode_t mode,
+ struct module *owner,
loff_t size, const struct kernfs_ops *ops, void *priv)
{
- return kernfs_create_file_ns(parent, name, mode,
+ return kernfs_create_file_ns(parent, name, mode, owner,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
size, ops, priv, NULL);
}
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index d76a1ddf83a3..6652504b860e 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -30,6 +30,7 @@ enum kobj_ns_type;
struct attribute {
const char *name;
umode_t mode;
+ struct module *owner;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
bool ignore_lockdep:1;
struct lock_class_key *key;
@@ -80,6 +81,7 @@ do { \
* @attrs: Pointer to NULL terminated list of attributes.
* @bin_attrs: Pointer to NULL terminated list of binary attributes.
* Either attrs or bin_attrs or both must be provided.
+ * @module: If set, module responsible for this attribute group
*/
struct attribute_group {
const char *name;
@@ -89,6 +91,7 @@ struct attribute_group {
struct bin_attribute *, int);
struct attribute **attrs;
struct bin_attribute **bin_attrs;
+ struct module *owner;
};
/*
@@ -100,38 +103,52 @@ struct attribute_group {
#define __ATTR(_name, _mode, _show, _store) { \
.attr = {.name = __stringify(_name), \
- .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
+ .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
+ .owner = THIS_MODULE, \
+ }, \
.show = _show, \
.store = _store, \
}
#define __ATTR_PREALLOC(_name, _mode, _show, _store) { \
.attr = {.name = __stringify(_name), \
- .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\
+ .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\
+ .owner = THIS_MODULE, \
+ }, \
.show = _show, \
.store = _store, \
}
#define __ATTR_RO(_name) { \
- .attr = { .name = __stringify(_name), .mode = 0444 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0444, \
+ .owner = THIS_MODULE, \
+ }, \
.show = _name##_show, \
}
#define __ATTR_RO_MODE(_name, _mode) { \
.attr = { .name = __stringify(_name), \
- .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
+ .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
+ .owner = THIS_MODULE, \
+ }, \
.show = _name##_show, \
}
#define __ATTR_RW_MODE(_name, _mode) { \
.attr = { .name = __stringify(_name), \
- .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
+ .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
+ .owner = THIS_MODULE, \
+ }, \
.show = _name##_show, \
.store = _name##_store, \
}
#define __ATTR_WO(_name) { \
- .attr = { .name = __stringify(_name), .mode = 0200 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0200, \
+ .owner = THIS_MODULE, \
+ }, \
.store = _name##_store, \
}
@@ -141,8 +158,11 @@ struct attribute_group {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
#define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) { \
- .attr = {.name = __stringify(_name), .mode = _mode, \
- .ignore_lockdep = true }, \
+ .attr = {.name = __stringify(_name), \
+ .mode = _mode, \
+ .ignore_lockdep = true, \
+ .owner = THIS_MODULE, \
+ }, \
.show = _show, \
.store = _store, \
}
@@ -159,6 +179,7 @@ static const struct attribute_group *_name##_groups[] = { \
#define ATTRIBUTE_GROUPS(_name) \
static const struct attribute_group _name##_group = { \
.attrs = _name##_attrs, \
+ .owner = THIS_MODULE, \
}; \
__ATTRIBUTE_GROUPS(_name)
@@ -193,20 +214,29 @@ struct bin_attribute {
/* macros to create static binary attributes easier */
#define __BIN_ATTR(_name, _mode, _read, _write, _size) { \
- .attr = { .name = __stringify(_name), .mode = _mode }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = _mode, \
+ .owner = THIS_MODULE, \
+ }, \
.read = _read, \
.write = _write, \
.size = _size, \
}
#define __BIN_ATTR_RO(_name, _size) { \
- .attr = { .name = __stringify(_name), .mode = 0444 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0444, \
+ .owner = THIS_MODULE, \
+ }, \
.read = _name##_read, \
.size = _size, \
}
#define __BIN_ATTR_WO(_name, _size) { \
- .attr = { .name = __stringify(_name), .mode = 0200 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0200, \
+ .owner = THIS_MODULE, \
+ }, \
.write = _name##_write, \
.size = _size, \
}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e049edd66776..7af508ae5348 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3833,7 +3833,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
key = &cft->lockdep_key;
#endif
kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name),
- cgroup_file_mode(cft),
+ cgroup_file_mode(cft), NULL,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
0, cft->kf_ops, cft,
NULL, key);
On Fri, May 21, 2021 at 09:08:17PM +0000, Luis Chamberlain wrote:
> On Fri, May 21, 2021 at 10:45:00PM +0200, Greg Kroah-Hartman wrote:
> > On Fri, May 21, 2021 at 08:16:18PM +0000, Luis Chamberlain wrote:
> > > On Fri, May 21, 2021 at 10:01:52PM +0200, Greg Kroah-Hartman wrote:
> > > > On Wed, May 19, 2021 at 08:20:23PM +0000, Luis Chamberlain wrote:
> > > > > Greg,
> > > > >
> > > > > your feedback would be appreciated here.
> > > >
> > > > Appreciated where? This is a zram patchset, what do I need to mess with
> > > > it for?
> > >
> > > This patchset has 2 issues which I noted in the last series that are
> > > generic, and could best be dealt with on sysfs, and suggested
> > > how this could actually be dealt with on sysfs / kernfs.
> > >
> > > > > Greg, can you comment on technical levels why a general core fix is not
> > > > > desirable upstream for those two issues?
> > > >
> > > > What issues exactly?
> > >
> > > When I suggested the generic way to fix this your main argument against
> > > a generic solution was that we don't support module removal. Given that
> > > argument did not seem to hold any water it begs the question if you
> > > still would rather not see this fixed in sysfs / kernfs.
> > >
> > > If you however are more open to it now, I can instead take that work, and
> > > send a proper patch for review.
> >
> > I looked at the last patch here and I really do not see the issue.
> >
> > In order for the module to be removed, zram_exit() has to return, right?
>
> Yes, but the race is for when a module removal is ongoing, in other
> words, it has not yet completed, and at the same time we race touching
> sysfs files.
>
> > So how can a show/store function in zram_drv.c be called after
> > destroy_devices() returns?
>
> The issue can come up if we have something poke at the sysfs files *while* a
> removal is happening.
And have you seen this in the real world? I keep asking this as module
removal is not an automated process so what triggers this?
> > The changelog text in patch 4/4 is odd, destroy_devices() shouldn't be
> > racing with anything as devices have reference counts in order to
> > protect this type of thing from happening, right?
>
> The race is beyond the scope of the driver in that the sysfs file can be
> poked at while a removal is ongoing. After that consider the possible
> races that can happen.
>
> > How can a store
> > function be called when a device is somehow removed from memory at the
> > same time? Don't we properly incremement/decrement the device
> > structure's reference count? If not, wouldn't that be the simplest
> > solution here?
>
> In this case the proper refcounting has to be down on the type of
> device, and so bdgrab(dev_to_bdev(dev)), because otherwise the pointer
> for the zram device is just not valid, so we can't do sanity checks
> on the sysfs calls for the zram device as the pointer can be invalid.
> To validate the pointer we must do refcounting below a layer. In this
> case for a disk.
>
> And so genericaly this would be:
>
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 4a8bf8cda52b..7c897e3f4135 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -29,6 +29,7 @@
> #include <linux/sched/mm.h>
> #include <linux/sysfs.h>
> #include <linux/dma-map-ops.h> /* for dma_default_coherent */
> +#include <linux/blkdev.h>
>
> #include "base.h"
> #include "power/power.h"
> @@ -1988,11 +1989,20 @@ static inline int device_is_not_partition(struct device *dev)
> {
> return !(dev->type == &part_type);
> }
> +static inline bool device_is_disk(struct device *dev)
> +{
> + return (dev->type == &disk_type);
> +}
> #else
> static inline int device_is_not_partition(struct device *dev)
> {
> return 1;
> }
> +
> +static inline bool device_is_disk(struct device *dev)
> +{
> + return false;
> +}
> #endif
>
> static int
> @@ -2037,6 +2047,19 @@ const char *dev_driver_string(const struct device *dev)
> }
> EXPORT_SYMBOL(dev_driver_string);
>
> +static int dev_type_get(struct device *dev)
> +{
> + if (device_is_disk(dev))
> + return !!bdgrab(dev_to_bdev(dev));
> + return 1;
> +}
Why do you feel that block devices are somehow more "special" here?
They are not, either this is "broken" for everyone, or it works for
everyone, don't special-case one tiny class of devices for unknown
reasons.
Your change causes another problem, if a sysfs file has show/store
happening, the reference count will always be bumped and so the module
would NOT be able to be freed. That looks like a lovely DoS that any
user could cause, right?
In sleeping on this the "correct" solution is to grab the bus lock (or
ktype lock) for the device during show/store so that the "delete device"
process can not race with it. Also let's make sure that the kref of the
kobject is being properly bumped during show/store as well, I don't
think that's happening which isn't a good thing no matter what
(different type of bug).
If that lock addition ends up showing up in benchmarks, then we can
always move it to rcu.
So in conclusion, the "correct" thing here seems to be two independant
things:
- make sure the reference count of the kobject is properly
incremented during show/store callbacks
- grab the kobject's type/bus/whatever lock during show/store so
that it can not race with deleting the device.
No bus/type should be special cased here, block devices are not special
by any means.
And don't mess with module reference counts, that way lies madness. We
want to lock data, not code :)
thanks,
greg k-h
On Sat, May 22, 2021 at 09:48:34AM +0200, Greg Kroah-Hartman wrote:
> On Fri, May 21, 2021 at 09:08:17PM +0000, Luis Chamberlain wrote:
> > On Fri, May 21, 2021 at 10:45:00PM +0200, Greg Kroah-Hartman wrote:
> > > I looked at the last patch here and I really do not see the issue.
> > >
> > > In order for the module to be removed, zram_exit() has to return, right?
> >
> > Yes, but the race is for when a module removal is ongoing, in other
> > words, it has not yet completed, and at the same time we race touching
> > sysfs files.
> >
> > > So how can a show/store function in zram_drv.c be called after
> > > destroy_devices() returns?
> >
> > The issue can come up if we have something poke at the sysfs files *while* a
> > removal is happening.
>
> And have you seen this in the real world? I keep asking this as module
> removal is not an automated process so what triggers this?
No, its not seen in the real world. It was theoretical, and noted as
possible by Minchan Kim. I reviewed it, and I agree the race is
possible.
> > And so genericaly this would be:
> >
> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index 4a8bf8cda52b..7c897e3f4135 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -29,6 +29,7 @@
> > #include <linux/sched/mm.h>
> > #include <linux/sysfs.h>
> > #include <linux/dma-map-ops.h> /* for dma_default_coherent */
> > +#include <linux/blkdev.h>
> >
> > #include "base.h"
> > #include "power/power.h"
> > @@ -1988,11 +1989,20 @@ static inline int device_is_not_partition(struct device *dev)
> > {
> > return !(dev->type == &part_type);
> > }
> > +static inline bool device_is_disk(struct device *dev)
> > +{
> > + return (dev->type == &disk_type);
> > +}
> > #else
> > static inline int device_is_not_partition(struct device *dev)
> > {
> > return 1;
> > }
> > +
> > +static inline bool device_is_disk(struct device *dev)
> > +{
> > + return false;
> > +}
> > #endif
> >
> > static int
> > @@ -2037,6 +2047,19 @@ const char *dev_driver_string(const struct device *dev)
> > }
> > EXPORT_SYMBOL(dev_driver_string);
> >
> > +static int dev_type_get(struct device *dev)
> > +{
> > + if (device_is_disk(dev))
> > + return !!bdgrab(dev_to_bdev(dev));
> > + return 1;
> > +}
>
> Why do you feel that block devices are somehow more "special" here?
I am not saying they are.
> They are not, either this is "broken" for everyone, or it works for
> everyone, don't special-case one tiny class of devices for unknown
> reasons.
The reason dev_type_get() was implemented was precisely to allow for
this to be expanded with the other types as they the *get* is specific to
the type.
> Your change causes another problem, if a sysfs file has show/store
> happening, the reference count will always be bumped and so the module
> would NOT be able to be freed. That looks like a lovely DoS that any
> user could cause, right?
Yes true. I think the better way to resolve that is to introduce and use
*try* methods, and so rmmod always trumps a new *get* for these
operations.
That would sole the possible "DOS" issue, precisely how I resolved this
same concern for resolving the deadlock with try_module_get().
> In sleeping on this
Sorry, did you mean you thought about this, or you meant sleep as in
the sleep context?
> the "correct" solution is to grab the bus lock (or
> ktype lock) for the device during show/store so that the "delete device"
> process can not race with it.
The still presents the DOS issue, which I actually did want to avoid.
I avoided it on the other deadlock issue, and so ideally I'd also
want to avoid it here. But indeed, the bus route is simple and clean.
And an alternative to the DOS issue is we just use capability checks,
if we don't do that already.
> Also let's make sure that the kref of the
> kobject is being properly bumped during show/store as well, I don't
> think that's happening which isn't a good thing no matter what
> (different type of bug).
Yup...
> If that lock addition ends up showing up in benchmarks, then we can
> always move it to rcu.
OK.
> So in conclusion, the "correct" thing here seems to be two independant
> things:
> - make sure the reference count of the kobject is properly
> incremented during show/store callbacks
> - grab the kobject's type/bus/whatever lock during show/store so
> that it can not race with deleting the device.
Yup. The above was a proof of concept solution using type, but indeed,
the downside is we'd have to implement try methods when not found, and
likely the list of types is endless.
Are there places where we cannot use the bus?
> No bus/type should be special cased here, block devices are not special
> by any means.
>
> And don't mess with module reference counts, that way lies madness. We
> want to lock data, not code :)
Live patching needs to lock code ;) and hey it works ;)
Addressing the kobject refecount here should in theory address most
deadlocks (what my third patch addresses) as well becuase, as you imply,
our protection of the kobject should prevent removal, but that's not
always the case. I think you're failing to consider a shared global
driver lock, which can be used on sysfs files, which in turn have
*nothing* kref'd. And so the module removal can still try to nuke sysfs
files, if those sysfs files like to mess with the shared global driver
lock.
Luis
On Tue, May 25, 2021 at 01:16:07AM +0000, Luis Chamberlain wrote:
> On Sat, May 22, 2021 at 09:48:34AM +0200, Greg Kroah-Hartman wrote:
> > On Fri, May 21, 2021 at 09:08:17PM +0000, Luis Chamberlain wrote:
> > > On Fri, May 21, 2021 at 10:45:00PM +0200, Greg Kroah-Hartman wrote:
> > > > I looked at the last patch here and I really do not see the issue.
> > > >
> > > > In order for the module to be removed, zram_exit() has to return, right?
> > >
> > > Yes, but the race is for when a module removal is ongoing, in other
> > > words, it has not yet completed, and at the same time we race touching
> > > sysfs files.
> > >
> > > > So how can a show/store function in zram_drv.c be called after
> > > > destroy_devices() returns?
> > >
> > > The issue can come up if we have something poke at the sysfs files *while* a
> > > removal is happening.
> >
> > And have you seen this in the real world? I keep asking this as module
> > removal is not an automated process so what triggers this?
>
> No, its not seen in the real world. It was theoretical, and noted as
> possible by Minchan Kim. I reviewed it, and I agree the race is
> possible.
Ok, then really, it's not a big deal :)
> > Why do you feel that block devices are somehow more "special" here?
>
> I am not saying they are.
Your patch made them "special", don't do that.
> > They are not, either this is "broken" for everyone, or it works for
> > everyone, don't special-case one tiny class of devices for unknown
> > reasons.
>
> The reason dev_type_get() was implemented was precisely to allow for
> this to be expanded with the other types as they the *get* is specific to
> the type.
No, that's the wrong thing to do.
> > Your change causes another problem, if a sysfs file has show/store
> > happening, the reference count will always be bumped and so the module
> > would NOT be able to be freed. That looks like a lovely DoS that any
> > user could cause, right?
>
> Yes true. I think the better way to resolve that is to introduce and use
> *try* methods, and so rmmod always trumps a new *get* for these
> operations.
No, "try" methods suck, as the Yoda quote says.
> That would sole the possible "DOS" issue, precisely how I resolved this
> same concern for resolving the deadlock with try_module_get().
Should not be needed.
> > In sleeping on this
>
> Sorry, did you mean you thought about this, or you meant sleep as in
> the sleep context?
I thought about this, sorry for the confusion.
> > So in conclusion, the "correct" thing here seems to be two independant
> > things:
> > - make sure the reference count of the kobject is properly
> > incremented during show/store callbacks
> > - grab the kobject's type/bus/whatever lock during show/store so
> > that it can not race with deleting the device.
>
> Yup. The above was a proof of concept solution using type, but indeed,
> the downside is we'd have to implement try methods when not found, and
> likely the list of types is endless.
>
> Are there places where we cannot use the bus?
Not that I know of, all objects live on some sort of "type/bus", that's
the way the driver model works.
> > No bus/type should be special cased here, block devices are not special
> > by any means.
> >
> > And don't mess with module reference counts, that way lies madness. We
> > want to lock data, not code :)
>
> Live patching needs to lock code ;) and hey it works ;)
Live patching is vodoo magic. But it just "adds" code paths, and later,
when it feels all is good, then it can remove stuff (if it even does,
I do not remember). Adding is easy, removing is hard.
> Addressing the kobject refecount here should in theory address most
> deadlocks (what my third patch addresses) as well becuase, as you imply,
> our protection of the kobject should prevent removal, but that's not
> always the case. I think you're failing to consider a shared global
> driver lock, which can be used on sysfs files, which in turn have
> *nothing* kref'd. And so the module removal can still try to nuke sysfs
> files, if those sysfs files like to mess with the shared global driver
> lock.
If any driver has that kind of crud, they deserve the nightmare that
would happen if it interacts this way. Don't worry about that, it's not
a pattern that anyone should be using.
And again, if the code and data is still there, the lock is ok to grab,
there should not be a problem. If so, we can fix the driver.
thanks,
greg k-h
On Tue, May 25, 2021 at 09:41:26AM +0200, Greg Kroah-Hartman wrote:
> On Tue, May 25, 2021 at 01:16:07AM +0000, Luis Chamberlain wrote:
> > Live patching needs to lock code ;) and hey it works ;)
>
> Live patching is vodoo magic. But it just "adds" code paths, and later,
> when it feels all is good, then it can remove stuff (if it even does,
> I do not remember). Adding is easy, removing is hard.
I didn't say it was easy. I meant that we support it and we can consider
its support as well.
> > Addressing the kobject refecount here should in theory address most
> > deadlocks (what my third patch addresses) as well becuase, as you imply,
> > our protection of the kobject should prevent removal, but that's not
> > always the case. I think you're failing to consider a shared global
> > driver lock, which can be used on sysfs files, which in turn have
> > *nothing* kref'd. And so the module removal can still try to nuke sysfs
> > files, if those sysfs files like to mess with the shared global driver
> > lock.
>
> If any driver has that kind of crud, they deserve the nightmare that
> would happen if it interacts this way. Don't worry about that, it's not
> a pattern that anyone should be using.
>
> And again, if the code and data is still there, the lock is ok to grab,
> there should not be a problem. If so, we can fix the driver.
I went back to the drawing board with this in mind. But a few things to
note:
The issue of the deadlock does not imply a lock has to be global. So long
as the rmmod path uses a lock which is also used by sysfs files, you can end
up in a deadlock.
Despite this, I tried to remove the global lock on the zram driver, however it
just doesn't seem right to remove it, its being used to help protect a
generic state machine and a global lock seems perfectly reasonable for the
driver.
If you still believe the global is not needed, let me know what you come
up with as an alternative. I just can't find a clean way to do away with
that, *and*, I still also think this pattern might be prevalent in the
kernel in different places. I am not sure we can set as generic rule:
"Thou Shalt Not use the same lock on rmmod and sysfs files"
I'm moving forward in my v3 series by keeping it and instead now
providing module device attribute wrappers. The idea with this is,
that if we are not yet sure what to do yet, we can at least have
drivers integrate these helpers as well when and if they find they need
a similar solution.
Luis