The zram driver makes use of cpu hotplug multistate support,
whereby it associates a zram compression stream per CPU. To
support CPU hotplug multistate a callback enabled to allow
the driver to do what it needs when a CPU hotplugs.
It is however currently possible to end up removing the
zram driver callback prior to removing the zram compression
streams per CPU. This would leave these compression streams
hanging.
We need to fix ordering for driver load / removal, zram
device additions, in light of the driver's use of cpu
hotplug multistate. Since the zram driver exposes many
sysfs attribute which can also muck with the comrpession
streams this also means we can hit page faults today easily.
Fix all this by providing an zram initialization boolean
protected the shared in the driver zram_index_mutex,
which we can use to annotate when sysfs attributes are
safe to use or not -- once the driver is properly initialized.
When the driver is going down we also are sure to not let
userspace muck with attributes which may affect zram cpu
compression streams.
This also fixes a series of possible memory leaks. The
crashes and memory leaks can easily be caused by issuing
the zram02.sh script from the LTP project [0] in a loop
in two separate windows:
cd testcases/kernel/device-drivers/zram
while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
You end up with a splat as follows:
kernel: zram: Removed device: zram0
kernel: zram: Added device: zram0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: Adding 104857596k swap on /dev/zram0. Priority:-2 extents:1 across:104857596k SSFS
kernel: zram0: detected capacitky change from 209715200 to 0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: ------------[ cut here ]------------
kernel: Error: Removing state 63 which has instances left.
kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G E 5.12.0-rc1-next-20210304 #3
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Code: <etc>
kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES 0000 CR0: 0000000080050033
kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
kernel: Call Trace:
kernel: __cpuhp_remove_state+0x2e/0x80
kernel: __do_sys_delete_module+0x190/0x2a0
kernel: do_syscall_64+0x33/0x80
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
The "Error: Removing state 63 which has instances left" refers
to the zram per CPU compression streams still left.
[0] https://github.com/linux-test-project/ltp.git
Signed-off-by: Luis Chamberlain <[email protected]>
---
drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++-----
1 file changed, 55 insertions(+), 8 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a711a2e2a794..63b6119cee93 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
static int zram_major;
static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
+bool zram_up;
+
/* Module params (documentation at end) */
static unsigned int num_devices = 1;
/*
@@ -1699,6 +1701,7 @@ static void zram_reset_device(struct zram *zram)
comp = zram->comp;
disksize = zram->disksize;
zram->disksize = 0;
+ zram->comp = NULL;
set_capacity_and_notify(zram->disk, 0);
part_stat_set_all(zram->disk->part0, 0);
@@ -1719,9 +1722,18 @@ static ssize_t disksize_store(struct device *dev,
struct zram *zram = dev_to_zram(dev);
int err;
+ mutex_lock(&zram_index_mutex);
+
+ if (!zram_up) {
+ err = -ENODEV;
+ goto out;
+ }
+
disksize = memparse(buf, NULL);
- if (!disksize)
- return -EINVAL;
+ if (!disksize) {
+ err = -EINVAL;
+ goto out;
+ }
down_write(&zram->init_lock);
if (init_done(zram)) {
@@ -1749,12 +1761,16 @@ static ssize_t disksize_store(struct device *dev,
set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
up_write(&zram->init_lock);
+ mutex_unlock(&zram_index_mutex);
+
return len;
out_free_meta:
zram_meta_free(zram, disksize);
out_unlock:
up_write(&zram->init_lock);
+out:
+ mutex_unlock(&zram_index_mutex);
return err;
}
@@ -1770,8 +1786,17 @@ static ssize_t reset_store(struct device *dev,
if (ret)
return ret;
- if (!do_reset)
- return -EINVAL;
+ mutex_lock(&zram_index_mutex);
+
+ if (!zram_up) {
+ len = -ENODEV;
+ goto out;
+ }
+
+ if (!do_reset) {
+ len = -EINVAL;
+ goto out;
+ }
zram = dev_to_zram(dev);
bdev = zram->disk->part0;
@@ -1780,7 +1805,8 @@ static ssize_t reset_store(struct device *dev,
/* Do not reset an active device or claimed device */
if (bdev->bd_openers || zram->claim) {
mutex_unlock(&bdev->bd_mutex);
- return -EBUSY;
+ len = -EBUSY;
+ goto out;
}
/* From now on, anyone can't open /dev/zram[0-9] */
@@ -1795,6 +1821,8 @@ static ssize_t reset_store(struct device *dev,
zram->claim = false;
mutex_unlock(&bdev->bd_mutex);
+out:
+ mutex_unlock(&zram_index_mutex);
return len;
}
@@ -2016,6 +2044,10 @@ static ssize_t hot_add_show(struct class *class,
int ret;
mutex_lock(&zram_index_mutex);
+ if (!zram_up) {
+ mutex_unlock(&zram_index_mutex);
+ return -ENODEV;
+ }
ret = zram_add();
mutex_unlock(&zram_index_mutex);
@@ -2043,6 +2075,11 @@ static ssize_t hot_remove_store(struct class *class,
mutex_lock(&zram_index_mutex);
+ if (!zram_up) {
+ ret = -ENODEV;
+ goto out;
+ }
+
zram = idr_find(&zram_index_idr, dev_id);
if (zram) {
ret = zram_remove(zram);
@@ -2052,6 +2089,7 @@ static ssize_t hot_remove_store(struct class *class,
ret = -ENODEV;
}
+out:
mutex_unlock(&zram_index_mutex);
return ret ? ret : count;
}
@@ -2078,12 +2116,15 @@ static int zram_remove_cb(int id, void *ptr, void *data)
static void destroy_devices(void)
{
+ mutex_lock(&zram_index_mutex);
+ zram_up = false;
class_unregister(&zram_control_class);
idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
zram_debugfs_destroy();
idr_destroy(&zram_index_idr);
unregister_blkdev(zram_major, "zram");
cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
+ mutex_unlock(&zram_index_mutex);
}
static int __init zram_init(void)
@@ -2111,15 +2152,21 @@ static int __init zram_init(void)
return -EBUSY;
}
+ mutex_lock(&zram_index_mutex);
+
while (num_devices != 0) {
- mutex_lock(&zram_index_mutex);
ret = zram_add();
- mutex_unlock(&zram_index_mutex);
- if (ret < 0)
+ if (ret < 0) {
+ mutex_unlock(&zram_index_mutex);
goto out_error;
+ }
num_devices--;
}
+ zram_up = true;
+
+ mutex_unlock(&zram_index_mutex);
+
return 0;
out_error:
--
2.30.1
On Sat, Mar 06, 2021 at 02:20:34AM +0000, Luis Chamberlain wrote:
> The zram driver makes use of cpu hotplug multistate support,
> whereby it associates a zram compression stream per CPU. To
> support CPU hotplug multistate a callback enabled to allow
> the driver to do what it needs when a CPU hotplugs.
>
> It is however currently possible to end up removing the
> zram driver callback prior to removing the zram compression
> streams per CPU. This would leave these compression streams
> hanging.
>
> We need to fix ordering for driver load / removal, zram
> device additions, in light of the driver's use of cpu
> hotplug multistate. Since the zram driver exposes many
> sysfs attribute which can also muck with the comrpession
> streams this also means we can hit page faults today easily.
Hi Luis,
First of all, thanks for reporting the ancient bugs.
Looks like you found several bugs and I am trying to digest them
from your description to understand more clear. I need to ask
stupid questions, first.
If I understand correctly, bugs you found were related to module
unloading race while the zram are still working.
If so, couldn't we fix the bug like this(it's not a perfect
patch but just wanted to show close module unloading race)?
(I might miss other pieces here. Let me know. Thanks!)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a711a2e2a794..646ae9e0b710 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1696,6 +1696,8 @@ static void zram_reset_device(struct zram *zram)
return;
}
+ module_put(THIS_MODULE);
+
comp = zram->comp;
disksize = zram->disksize;
zram->disksize = 0;
@@ -1744,13 +1746,19 @@ static ssize_t disksize_store(struct device *dev,
goto out_free_meta;
}
+ if (!try_module_get(THIS_MODULE))
+ goto out_free_zcomp;
+
zram->comp = comp;
zram->disksize = disksize;
+
set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
up_write(&zram->init_lock);
return len;
+out_free_zcomp:
+ zcomp_destroy(comp);
out_free_meta:
zram_meta_free(zram, disksize);
out_unlock:
On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote:
> On Sat, Mar 06, 2021 at 02:20:34AM +0000, Luis Chamberlain wrote:
> > The zram driver makes use of cpu hotplug multistate support,
> > whereby it associates a zram compression stream per CPU. To
> > support CPU hotplug multistate a callback enabled to allow
> > the driver to do what it needs when a CPU hotplugs.
> >
> > It is however currently possible to end up removing the
> > zram driver callback prior to removing the zram compression
> > streams per CPU. This would leave these compression streams
> > hanging.
> >
> > We need to fix ordering for driver load / removal, zram
> > device additions, in light of the driver's use of cpu
> > hotplug multistate. Since the zram driver exposes many
> > sysfs attribute which can also muck with the comrpession
> > streams this also means we can hit page faults today easily.
>
> Hi Luis,
>
> First of all, thanks for reporting the ancient bugs.
>
> Looks like you found several bugs and I am trying to digest them
> from your description to understand more clear. I need to ask
> stupid questions, first.
>
> If I understand correctly, bugs you found were related to module
> unloading race while the zram are still working.
> If so, couldn't we fix the bug like this(it's not a perfect
> patch but just wanted to show close module unloading race)?
> (I might miss other pieces here. Let me know. Thanks!)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index a711a2e2a794..646ae9e0b710 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1696,6 +1696,8 @@ static void zram_reset_device(struct zram *zram)
> return;
> }
>
> + module_put(THIS_MODULE);
> +
> comp = zram->comp;
> disksize = zram->disksize;
> zram->disksize = 0;
> @@ -1744,13 +1746,19 @@ static ssize_t disksize_store(struct device *dev,
> goto out_free_meta;
> }
>
> + if (!try_module_get(THIS_MODULE))
> + goto out_free_zcomp;
> +
> zram->comp = comp;
> zram->disksize = disksize;
> +
> set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
> up_write(&zram->init_lock);
>
> return len;
>
> +out_free_zcomp:
> + zcomp_destroy(comp);
> out_free_meta:
> zram_meta_free(zram, disksize);
> out_unlock:
This still allows for a crash on the driver by running the zram02.sh script twice.
Mar 09 14:52:19 kdevops-blktests-block-dev kernel: zram0: detected capacity change from 209715200 to 0
Mar 09 14:52:19 kdevops-blktests-block-dev kernel: BUG: unable to handle page fault for address: ffffba7db7904008
Mar 09 14:52:19 kdevops-blktests-block-dev kernel: #PF: supervisor read access in kernel mode
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: #PF: error_code(0x0000) - not-present page
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: PGD 100000067 P4D 100000067 PUD 100311067 PMD 14cd2f067 PTE 0
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Oops: 0000 [#1] SMP NOPTI
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CPU: 0 PID: 2137 Comm: zram02.sh Tainted: G E 5.12.0-rc1-next-20210304+ #4
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RIP: 0010:zram_free_page+0x1b/0xf0 [zram]
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Code: 1f 44 00 00 48 89 c8 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 41 54 49 89 f4 55 89 f5 53 48 8b 17 48 c1 e5 04 48 89 f>
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RSP:0018:ffffba7d808f3d88 EFLAGS: 00010286
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RAX: 0000000000000000 RBX: ffff9eee5317d200 RCX: 0000000000000000
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RDX: ffffba7db7904000 RSI: 0000000000000000 RDI: ffff9eee5317d200
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RBP: 0000000000000000 R08: 00000008f78bb1d3 R09: 0000000000000000
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000000
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R13: ffff9eee53d4cb00 R14: ffff9eee5317d220 R15: ffff9eee70508b80
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: FS: 00007f4bb1482580(0000) GS:ffff9eef77c00000(0000) knlGS:0000000000000000
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CR2: ffffba7db7904008 CR3: 0000000107e9c000 CR4: 0000000000350ef0
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Call Trace:
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: zram_reset_device+0xe9/0x150 [zram]
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: reset_store+0x9a/0x100 [zram]
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: kernfs_fop_write_iter+0x124/0x1b0
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: new_sync_write+0x11c/0x1b0
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: vfs_write+0x1c2/0x260
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: ksys_write+0x5f/0xe0
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: do_syscall_64+0x33/0x80
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RIP: 0033:0x7f4bb13aaf33
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Code: 8b 15 61 ef 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 64 8b 04 25 18 00
00 00 85 c0 75 14 b8 01 00 0>
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RSP: 002b:00007ffce0090d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RAX: ffffffffffffffda RBX: 000055a17c4a14b0 RCX: 00007f4bb13aaf33
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RDX: 0000000000000002 RSI: 000055a17c4a14b0 RDI: 0000000000000001
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RBP: 0000000000000002 R08: 000055a17c48a1d0 R09: 0000000000000000
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R10: 000055a17c48a1d1 R11: 0000000000000246 R12: 0000000000000001
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R13: 0000000000000002 R14: 7fffffffffffffff R15: 0000000000000000
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Modules linked in: zram(E) zsmalloc(E) xfs(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) joydev(E) evdev(>
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: zram0: detected capacity change from 0 to 209715200
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CR2: ffffba7db7904008
Mar 09 14:52:20 kdevops-blktests-block-dev kernel: ---[ end trace 534ee1d0b880dd90 ]---
I can try to modify it to include second patch first, as that is
required. There are two separate bugs here. Or was your goalt to
try to address both with only one patch?
Luis
On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote:
> If I understand correctly, bugs you found were related to module
> unloading race while the zram are still working.
No, that is a simplifcation of the issue. The issue consists of
two separate issues:
a) race against module unloading in light of incorrect racty use of
cpu hotplug multistate support
b) module unload race with sysfs attribute race on *any* driver which
has sysfs attributes which also shares the same lock as used during
module unload
It is important to realize that issue b) is actually a generic kernel
issue, it would be present on *any* driver which shares a lock used on
a sysfs attribute and module unload. I looked to implement a generic
solution on kernfs, however we don't yet have semantics to enable this
generically, and so for now each driver needs to deal with those races
on their own. That sysfs race is dealt with in the second patch.
The first patch only deals with the cpu hotplug races exposed at module
unloading.
Luis
On Wed, Mar 10, 2021 at 01:11:15PM +0000, Luis Chamberlain wrote:
> I can try to modify it to include second patch first, as that is
> required. There are two separate bugs here.
I tried this, applying the syfs required changes first and then
applying your idea as a secondary patch ends up like this:
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 4b57c84ba9d4..bb45c1e0f3e0 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1702,6 +1702,7 @@ static void zram_reset_device(struct zram *zram)
set_capacity_and_notify(zram->disk, 0);
part_stat_set_all(zram->disk->part0, 0);
+ module_put(THIS_MODULE);
up_write(&zram->init_lock);
/* I/O operation under all of CPU are done so let's free */
@@ -1747,6 +1748,7 @@ static ssize_t disksize_store(struct device *dev,
goto out_free_meta;
}
+ BUG_ON(!try_module_get(THIS_MODULE));
zram->comp = comp;
zram->disksize = disksize;
set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
The BUG_ON() is doable as we *know* we already have a reference to the
module due to the beginning of the other try_module_get() which would
be placed on the first patch at the top of disksize_store().
This however doesn't fix the issue. We end up in a situation where we
cannot unload the zram driver.
Luis
On Wed, Mar 10, 2021 at 01:11:15PM +0000, Luis Chamberlain wrote:
> On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote:
> > On Sat, Mar 06, 2021 at 02:20:34AM +0000, Luis Chamberlain wrote:
> > > The zram driver makes use of cpu hotplug multistate support,
> > > whereby it associates a zram compression stream per CPU. To
> > > support CPU hotplug multistate a callback enabled to allow
> > > the driver to do what it needs when a CPU hotplugs.
> > >
> > > It is however currently possible to end up removing the
> > > zram driver callback prior to removing the zram compression
> > > streams per CPU. This would leave these compression streams
> > > hanging.
> > >
> > > We need to fix ordering for driver load / removal, zram
> > > device additions, in light of the driver's use of cpu
> > > hotplug multistate. Since the zram driver exposes many
> > > sysfs attribute which can also muck with the comrpession
> > > streams this also means we can hit page faults today easily.
> >
> > Hi Luis,
> >
> > First of all, thanks for reporting the ancient bugs.
> >
> > Looks like you found several bugs and I am trying to digest them
> > from your description to understand more clear. I need to ask
> > stupid questions, first.
> >
> > If I understand correctly, bugs you found were related to module
> > unloading race while the zram are still working.
> > If so, couldn't we fix the bug like this(it's not a perfect
> > patch but just wanted to show close module unloading race)?
> > (I might miss other pieces here. Let me know. Thanks!)
> >
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index a711a2e2a794..646ae9e0b710 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -1696,6 +1696,8 @@ static void zram_reset_device(struct zram *zram)
> > return;
> > }
> >
> > + module_put(THIS_MODULE);
> > +
> > comp = zram->comp;
> > disksize = zram->disksize;
> > zram->disksize = 0;
> > @@ -1744,13 +1746,19 @@ static ssize_t disksize_store(struct device *dev,
> > goto out_free_meta;
> > }
> >
> > + if (!try_module_get(THIS_MODULE))
> > + goto out_free_zcomp;
> > +
> > zram->comp = comp;
> > zram->disksize = disksize;
> > +
> > set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
> > up_write(&zram->init_lock);
> >
> > return len;
> >
> > +out_free_zcomp:
> > + zcomp_destroy(comp);
> > out_free_meta:
> > zram_meta_free(zram, disksize);
> > out_unlock:
>
> This still allows for a crash on the driver by running the zram02.sh script twice.
>
> Mar 09 14:52:19 kdevops-blktests-block-dev kernel: zram0: detected capacity change from 209715200 to 0
> Mar 09 14:52:19 kdevops-blktests-block-dev kernel: BUG: unable to handle page fault for address: ffffba7db7904008
> Mar 09 14:52:19 kdevops-blktests-block-dev kernel: #PF: supervisor read access in kernel mode
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: #PF: error_code(0x0000) - not-present page
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: PGD 100000067 P4D 100000067 PUD 100311067 PMD 14cd2f067 PTE 0
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Oops: 0000 [#1] SMP NOPTI
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CPU: 0 PID: 2137 Comm: zram02.sh Tainted: G E 5.12.0-rc1-next-20210304+ #4
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RIP: 0010:zram_free_page+0x1b/0xf0 [zram]
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Code: 1f 44 00 00 48 89 c8 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 41 54 49 89 f4 55 89 f5 53 48 8b 17 48 c1 e5 04 48 89 f>
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RSP:0018:ffffba7d808f3d88 EFLAGS: 00010286
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RAX: 0000000000000000 RBX: ffff9eee5317d200 RCX: 0000000000000000
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RDX: ffffba7db7904000 RSI: 0000000000000000 RDI: ffff9eee5317d200
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RBP: 0000000000000000 R08: 00000008f78bb1d3 R09: 0000000000000000
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000000
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R13: ffff9eee53d4cb00 R14: ffff9eee5317d220 R15: ffff9eee70508b80
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: FS: 00007f4bb1482580(0000) GS:ffff9eef77c00000(0000) knlGS:0000000000000000
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CR2: ffffba7db7904008 CR3: 0000000107e9c000 CR4: 0000000000350ef0
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Call Trace:
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: zram_reset_device+0xe9/0x150 [zram]
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: reset_store+0x9a/0x100 [zram]
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: kernfs_fop_write_iter+0x124/0x1b0
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: new_sync_write+0x11c/0x1b0
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: vfs_write+0x1c2/0x260
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: ksys_write+0x5f/0xe0
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: do_syscall_64+0x33/0x80
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RIP: 0033:0x7f4bb13aaf33
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Code: 8b 15 61 ef 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 64 8b 04 25 18 00
> 00 00 85 c0 75 14 b8 01 00 0>
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RSP: 002b:00007ffce0090d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RAX: ffffffffffffffda RBX: 000055a17c4a14b0 RCX: 00007f4bb13aaf33
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RDX: 0000000000000002 RSI: 000055a17c4a14b0 RDI: 0000000000000001
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: RBP: 0000000000000002 R08: 000055a17c48a1d0 R09: 0000000000000000
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R10: 000055a17c48a1d1 R11: 0000000000000246 R12: 0000000000000001
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: R13: 0000000000000002 R14: 7fffffffffffffff R15: 0000000000000000
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: Modules linked in: zram(E) zsmalloc(E) xfs(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) joydev(E) evdev(>
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: zram0: detected capacity change from 0 to 209715200
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: CR2: ffffba7db7904008
> Mar 09 14:52:20 kdevops-blktests-block-dev kernel: ---[ end trace 534ee1d0b880dd90 ]---
>
> I can try to modify it to include second patch first, as that is
> required. There are two separate bugs here. Or was your goalt to
> try to address both with only one patch?
I am trying to understand problem first.
On Wed, Mar 10, 2021 at 09:21:28PM +0000, Luis Chamberlain wrote:
> On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote:
> > If I understand correctly, bugs you found were related to module
> > unloading race while the zram are still working.
>
> No, that is a simplifcation of the issue. The issue consists of
> two separate issues:
>
> a) race against module unloading in light of incorrect racty use of
> cpu hotplug multistate support
Could you add some pusedo code sequence to show the race cleary?
It would be great if it goes in the description, too since it's
more clear to show the problme.
> b) module unload race with sysfs attribute race on *any* driver which
> has sysfs attributes which also shares the same lock as used during
> module unload
Yub, that part I missed. Maybe, we need some wrapper to zram sysfs
to get try_module_get in the warapper funnction and then call sub
rountine only if it got the refcount.
zram_sysfs_wrapper(func, A, B)
if (!try_module_get(THIS_MODULE)
return -ENODEV;
ret = func(A,B);
module_put(THIS_MODULE);
return ret;
>
> It is important to realize that issue b) is actually a generic kernel
> issue, it would be present on *any* driver which shares a lock used on
> a sysfs attribute and module unload. I looked to implement a generic
> solution on kernfs, however we don't yet have semantics to enable this
> generically, and so for now each driver needs to deal with those races
> on their own. That sysfs race is dealt with in the second patch.
It's unforunate. Let me Cc Greg who might have some ideas or
find zram mess on zram sysfs implementation.
>
> The first patch only deals with the cpu hotplug races exposed at module
> unloading.
True. Thanks for the clarification.
On Thu, Mar 11, 2021 at 06:14:40PM -0800, Minchan Kim wrote:
> On Wed, Mar 10, 2021 at 09:21:28PM +0000, Luis Chamberlain wrote:
> > On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote:
> > > If I understand correctly, bugs you found were related to module
> > > unloading race while the zram are still working.
> >
> > No, that is a simplifcation of the issue. The issue consists of
> > two separate issues:
> >
> > a) race against module unloading in light of incorrect racty use of
> > cpu hotplug multistate support
>
>
> Could you add some pusedo code sequence to show the race cleary?
Let us deal with each issue one at time. First, let's address
understanding the kernel warning can be reproduced easily by
triggering zram02.sh from LTP twice:
kernel: ------------[ cut here ]------------
kernel: Error: Removing state 63 which has instances left.
kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
The first patch prevents this race. This race is possible because on
module init we associate callbacks for CPU hotplug add / remove:
static int __init zram_init(void)
{
...
ret = cpuhp_setup_state_multi(CPUHP_ZCOMP_PREPARE, "block/zram:prepare",
zcomp_cpu_up_prepare, zcomp_cpu_dead);
...
}
The zcomp_cpu_dead() accesses the zcom->comp, and if zcomp->comp is
removed and this function is called, clearly we'll be accessing some
random data here and can easily crash afterwards:
int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node)
{
struct zcomp *comp = hlist_entry(node, struct zcomp, node);
struct zcomp_strm *zstrm;
zstrm = per_cpu_ptr(comp->stream, cpu);
zcomp_strm_free(zstrm);
return 0;
}
And zram's syfs reset_store() lets userspace call zram_reset_device()
which calls zcomp_destroy():
void zcomp_destroy(struct zcomp *comp)
{
cpuhp_state_remove_instance(CPUHP_ZCOMP_PREPARE, &comp->node);
free_percpu(comp->stream);
kfree(comp);
}
> It would be great if it goes in the description, too since it's
> more clear to show the problme.
Does the above do it?
>
> > b) module unload race with sysfs attribute race on *any* driver which
> > has sysfs attributes which also shares the same lock as used during
> > module unload
>
> Yub, that part I missed. Maybe, we need some wrapper to zram sysfs
> to get try_module_get in the warapper funnction and then call sub
> rountine only if it got the refcount.
>
> zram_sysfs_wrapper(func, A, B)
> if (!try_module_get(THIS_MODULE)
> return -ENODEV;
> ret = func(A,B);
> module_put(THIS_MODULE);
> return ret;
I'd much prefer this be resolved in kernfs later, if you look at the kernel
there are already some drivers which may have realized this requirement
the hard way. Open coding this I think makes the race / intent clearer.
Right now we have no semantics possible for a generic solution, but I
can work on one later.
Luis
On Fri, Mar 12, 2021 at 06:32:38PM +0000, Luis Chamberlain wrote:
> On Thu, Mar 11, 2021 at 06:14:40PM -0800, Minchan Kim wrote:
> > On Wed, Mar 10, 2021 at 09:21:28PM +0000, Luis Chamberlain wrote:
> > > On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote:
> > > > If I understand correctly, bugs you found were related to module
> > > > unloading race while the zram are still working.
> > >
> > > No, that is a simplifcation of the issue. The issue consists of
> > > two separate issues:
> > >
> > > a) race against module unloading in light of incorrect racty use of
> > > cpu hotplug multistate support
> >
> >
> > Could you add some pusedo code sequence to show the race cleary?
>
> Let us deal with each issue one at time. First, let's address
> understanding the kernel warning can be reproduced easily by
> triggering zram02.sh from LTP twice:
>
> kernel: ------------[ cut here ]------------
> kernel: Error: Removing state 63 which has instances left.
> kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
>
> The first patch prevents this race. This race is possible because on
> module init we associate callbacks for CPU hotplug add / remove:
>
> static int __init zram_init(void)
> {
> ...
> ret = cpuhp_setup_state_multi(CPUHP_ZCOMP_PREPARE, "block/zram:prepare",
> zcomp_cpu_up_prepare, zcomp_cpu_dead);
> ...
> }
>
> The zcomp_cpu_dead() accesses the zcom->comp, and if zcomp->comp is
> removed and this function is called, clearly we'll be accessing some
> random data here and can easily crash afterwards:
I am trying to understand this crash. You mentioned the warning
above but here mention crash(I understand sysfs race but this is
different topic). What's the relation with above warning and
crash here? You are talking the warning could change to the
crash sometimes?
>
> int zcomp_cpu_dead(unsigned int cpu, struct hlist_node *node)
> {
> struct zcomp *comp = hlist_entry(node, struct zcomp, node);
> struct zcomp_strm *zstrm;
>
> zstrm = per_cpu_ptr(comp->stream, cpu);
> zcomp_strm_free(zstrm);
> return 0;
> }
>
> And zram's syfs reset_store() lets userspace call zram_reset_device()
> which calls zcomp_destroy():
>
> void zcomp_destroy(struct zcomp *comp)
> {
> cpuhp_state_remove_instance(CPUHP_ZCOMP_PREPARE, &comp->node);
> free_percpu(comp->stream);
> kfree(comp);
> }
Do you mean the race between module unloading and zram_reset_device?
If I understood correctly, the approach I put in the first reply would
prevent this problem.
>
> > It would be great if it goes in the description, too since it's
> > more clear to show the problme.
>
> Does the above do it?
Rather than that, let's have this kinds of explanation, which is
more clear(it's a just example, not describing exact race you saw.
You could be better to descibe it).
CPU A CPU B
moudle_unload
destroy_devices
zram_remove_cb
reset_store
zcomp_cpu_dead
zram_reset_device
zcomp_destroy
cpuhp_state_remove_instance
zcomp_strm_free
> >
> > > b) module unload race with sysfs attribute race on *any* driver which
> > > has sysfs attributes which also shares the same lock as used during
> > > module unload
> >
> > Yub, that part I missed. Maybe, we need some wrapper to zram sysfs
> > to get try_module_get in the warapper funnction and then call sub
> > rountine only if it got the refcount.
> >
> > zram_sysfs_wrapper(func, A, B)
> > if (!try_module_get(THIS_MODULE)
> > return -ENODEV;
> > ret = func(A,B);
> > module_put(THIS_MODULE);
> > return ret;
>
> I'd much prefer this be resolved in kernfs later, if you look at the kernel
> there are already some drivers which may have realized this requirement
> the hard way. Open coding this I think makes the race / intent clearer.
>
> Right now we have no semantics possible for a generic solution, but I
> can work on one later.
Yub, no problem. My point is let's have some zram sysfs wrapper to manage
module_get/put automatically as I mentioned above so no need to deal with
module refcount in worker functions.
On Fri, Mar 12, 2021 at 11:28:21AM -0800, Minchan Kim wrote:
> On Fri, Mar 12, 2021 at 06:32:38PM +0000, Luis Chamberlain wrote:
> > On Thu, Mar 11, 2021 at 06:14:40PM -0800, Minchan Kim wrote:
> > > On Wed, Mar 10, 2021 at 09:21:28PM +0000, Luis Chamberlain wrote:
> > > > On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote:
> > > > > If I understand correctly, bugs you found were related to module
> > > > > unloading race while the zram are still working.
> > > >
> > > > No, that is a simplifcation of the issue. The issue consists of
> > > > two separate issues:
> > > >
> > > > a) race against module unloading in light of incorrect racty use of
> > > > cpu hotplug multistate support
> > >
> > >
> > > Could you add some pusedo code sequence to show the race cleary?
> >
> > Let us deal with each issue one at time. First, let's address
> > understanding the kernel warning can be reproduced easily by
> > triggering zram02.sh from LTP twice:
> >
> > kernel: ------------[ cut here ]------------
> > kernel: Error: Removing state 63 which has instances left.
> > kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
> > kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
> >
> > The first patch prevents this race. This race is possible because on
> > module init we associate callbacks for CPU hotplug add / remove:
> >
> > static int __init zram_init(void)
> > {
> > ...
> > ret = cpuhp_setup_state_multi(CPUHP_ZCOMP_PREPARE, "block/zram:prepare",
> > zcomp_cpu_up_prepare, zcomp_cpu_dead);
> > ...
> > }
> >
> > The zcomp_cpu_dead() accesses the zcom->comp, and if zcomp->comp is
> > removed and this function is called, clearly we'll be accessing some
> > random data here and can easily crash afterwards:
>
> I am trying to understand this crash. You mentioned the warning
> above but here mention crash(I understand sysfs race but this is
> different topic). What's the relation with above warning and
> crash here? You are talking the warning could change to the
> crash sometimes?
Indeed one issue is a consequence of the other but a bit better
description can be put together for both separately.
The warning comes up when cpu hotplug detects that the callback
is being removed, but yet "instances" for multistate are left behind,
ie, not removed. CPU hotplug multistate allows us to dedicate a callback
for zram so that it gets called every time a CPU hot plugs or unplugs.
In the zram driver's case we want to trigger the callback per each
struct zcomp, one is used per supported and used supported compression
algorithm. The zram driver allocates a struct zcomp for an compression
algorithm on a need basis. The default compressor is CONFIG_ZRAM_DEF_COMP
which today is CONFIG_ZRAM_DEF_COMP_LZORLE ("lzo-rle"). Although we may
have different zram devices, zram devices which use the same compression
algorithm share the same struct zcomp. The "multi" in CPU hotplug multstate
refers to these different struct zcomp instances, each of these will
have the callback routine registered run for it. The kernel's CPU
hotplug multistate keeps a linked list of these different structures
so that it will iterate over them on CPU transitions. By default at
driver initialization we will create just one zram device (num_devices=1)
and a zcomp structure then set for the lzo-rle comrpession algorithm.
At driver removal we first remove each zram device, and so we destroy
the struct zcomp. But since we expose sysfs attributes to create new
devices or reset / initialize existing ones, we can easily end up
re-initializing a struct zcomp before the exit routine of the module
removes the cpu hotplug callback. When this happens the kernel's
CPU hotplug will detect that at least one instance (struct zcomp for
us) exists.
And so we can have:
static void destroy_devices(void)
{
class_unregister(&zram_control_class);
idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
zram_debugfs_destroy();
idr_destroy(&zram_index_idr);
unregister_blkdev(zram_major, "zram");
cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
}
While destroy_devices() runs we can have this race:
CPU 1 CPU 2
class_unregister(...);
idr_for_each(...);
zram_debugfs_destroy();
disksize_store(...)
idr_destroy(...);
unregister_blkdev(...);
cpuhp_remove_multi_state(...);
The warning comes up on cpuhp_remove_multi_state() when it sees
that the state for CPUHP_ZCOMP_PREPARE does not have an empty
instance linked list.
After the idr_destory() its also easy to run into a crash. The
above predicament also leads to memory leaks.
And so we need to have a state machine for when we're up and when
we're going down generically.
Let me know if it makes sense now, if so I can amend the commit log
accordingly.
Luis
On Fri, Mar 19, 2021 at 07:09:24PM +0000, Luis Chamberlain wrote:
> On Fri, Mar 12, 2021 at 11:28:21AM -0800, Minchan Kim wrote:
> > On Fri, Mar 12, 2021 at 06:32:38PM +0000, Luis Chamberlain wrote:
> > > On Thu, Mar 11, 2021 at 06:14:40PM -0800, Minchan Kim wrote:
> > > > On Wed, Mar 10, 2021 at 09:21:28PM +0000, Luis Chamberlain wrote:
> > > > > On Mon, Mar 08, 2021 at 06:55:30PM -0800, Minchan Kim wrote:
> > > > > > If I understand correctly, bugs you found were related to module
> > > > > > unloading race while the zram are still working.
> > > > >
> > > > > No, that is a simplifcation of the issue. The issue consists of
> > > > > two separate issues:
> > > > >
> > > > > a) race against module unloading in light of incorrect racty use of
> > > > > cpu hotplug multistate support
> > > >
> > > >
> > > > Could you add some pusedo code sequence to show the race cleary?
> > >
> > > Let us deal with each issue one at time. First, let's address
> > > understanding the kernel warning can be reproduced easily by
> > > triggering zram02.sh from LTP twice:
> > >
> > > kernel: ------------[ cut here ]------------
> > > kernel: Error: Removing state 63 which has instances left.
> > > kernel: WARNING: CPU: 7 PID: 70457 at kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
> > > kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
> > >
> > > The first patch prevents this race. This race is possible because on
> > > module init we associate callbacks for CPU hotplug add / remove:
> > >
> > > static int __init zram_init(void)
> > > {
> > > ...
> > > ret = cpuhp_setup_state_multi(CPUHP_ZCOMP_PREPARE, "block/zram:prepare",
> > > zcomp_cpu_up_prepare, zcomp_cpu_dead);
> > > ...
> > > }
> > >
> > > The zcomp_cpu_dead() accesses the zcom->comp, and if zcomp->comp is
> > > removed and this function is called, clearly we'll be accessing some
> > > random data here and can easily crash afterwards:
> >
> > I am trying to understand this crash. You mentioned the warning
> > above but here mention crash(I understand sysfs race but this is
> > different topic). What's the relation with above warning and
> > crash here? You are talking the warning could change to the
> > crash sometimes?
>
Hi Luis,
> Indeed one issue is a consequence of the other but a bit better
> description can be put together for both separately.
>
> The warning comes up when cpu hotplug detects that the callback
> is being removed, but yet "instances" for multistate are left behind,
> ie, not removed. CPU hotplug multistate allows us to dedicate a callback
> for zram so that it gets called every time a CPU hot plugs or unplugs.
> In the zram driver's case we want to trigger the callback per each
> struct zcomp, one is used per supported and used supported compression
you meant "each one is used per supported compression"?
> algorithm. The zram driver allocates a struct zcomp for an compression
> algorithm on a need basis. The default compressor is CONFIG_ZRAM_DEF_COMP
> which today is CONFIG_ZRAM_DEF_COMP_LZORLE ("lzo-rle"). Although we may
> have different zram devices, zram devices which use the same compression
> algorithm share the same struct zcomp. The "multi" in CPU hotplug multstate
It allocates each own zcomp, not sharing even though zram devices
use the same compression algorithm.
> refers to these different struct zcomp instances, each of these will
> have the callback routine registered run for it. The kernel's CPU
> hotplug multistate keeps a linked list of these different structures
> so that it will iterate over them on CPU transitions. By default at
> driver initialization we will create just one zram device (num_devices=1)
> and a zcomp structure then set for the lzo-rle comrpession algorithm.
> At driver removal we first remove each zram device, and so we destroy
> the struct zcomp. But since we expose sysfs attributes to create new
> devices or reset / initialize existing ones, we can easily end up
> re-initializing a struct zcomp before the exit routine of the module
> removes the cpu hotplug callback. When this happens the kernel's
> CPU hotplug will detect that at least one instance (struct zcomp for
> us) exists.
It's very helpful to understand. Thanks a lot!S
So IIUC, it's fundamentally race between destroy_devices(i.e., module
unload) and every sysfs knobs in zram. Isn't it?
Below specific example is one of them and every sysfs code in zram
could access freed object(e.g., zram->something).
And you are claiming there isn't good way to fix it in kernfs(please
add it in the description, too) even though it's general problem.
(If we had, we may detect the race and make zram_remove_cb fails
so unloading modulies fails, finally)
So, shouldn't we introcude a global rw_semaphore?
destroy_devices
class_unregister
down_write(&zram_unload);
idr_for_each(zram_remove_cb);
up_write(&zram_unload);
And every sysfs code hold the lock with down_read in the first place
and release the lock right before return. So, introduce a zram sysfs
wrapper to centralize all of locking logic.
Does it make sense?
>
> And so we can have:
>
> static void destroy_devices(void)
> {
> class_unregister(&zram_control_class);
> idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> zram_debugfs_destroy();
> idr_destroy(&zram_index_idr);
> unregister_blkdev(zram_major, "zram");
> cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> }
>
> While destroy_devices() runs we can have this race:
>
>
> CPU 1 CPU 2
>
> class_unregister(...);
> idr_for_each(...);
I think sysfs was already remove here.
> zram_debugfs_destroy();
> disksize_store(...)
> idr_destroy(...);
> unregister_blkdev(...);
> cpuhp_remove_multi_state(...);
So,
CPU 1 CPU 2
destroy_devices
..
disksize_store()
zcomp_create
zcomp_init
idr_for_each(zram_remove_cb
zram_remove
zram_reset
zcomp_destroy
cpuhp_state_remove_instance
cpuhp_state_add_instance
..
cpuhp_remove_multi_state(...)
Detect!
kernel: Error: Removing state 63 which has instances left.
>
>
> The warning comes up on cpuhp_remove_multi_state() when it sees
> that the state for CPUHP_ZCOMP_PREPARE does not have an empty
> instance linked list.
>
> After the idr_destory() its also easy to run into a crash. The
> above predicament also leads to memory leaks.
>
> And so we need to have a state machine for when we're up and when
> we're going down generically.
>
> Let me know if it makes sense now, if so I can amend the commit log
> accordingly.
Yub, It's much better. Let's have it in the commit log.
Thanks.
On Mon, Mar 22, 2021 at 09:37:17AM -0700, Minchan Kim wrote:
> On Fri, Mar 19, 2021 at 07:09:24PM +0000, Luis Chamberlain wrote:
> > Indeed one issue is a consequence of the other but a bit better
> > description can be put together for both separately.
> >
> > The warning comes up when cpu hotplug detects that the callback
> > is being removed, but yet "instances" for multistate are left behind,
> > ie, not removed. CPU hotplug multistate allows us to dedicate a callback
> > for zram so that it gets called every time a CPU hot plugs or unplugs.
> > In the zram driver's case we want to trigger the callback per each
> > struct zcomp, one is used per supported and used supported compression
>
> you meant "each one is used per supported compression"?
Yup.
> > algorithm. The zram driver allocates a struct zcomp for an compression
> > algorithm on a need basis. The default compressor is CONFIG_ZRAM_DEF_COMP
> > which today is CONFIG_ZRAM_DEF_COMP_LZORLE ("lzo-rle"). Although we may
> > have different zram devices, zram devices which use the same compression
> > algorithm share the same struct zcomp. The "multi" in CPU hotplug multstate
>
> It allocates each own zcomp, not sharing even though zram devices
> use the same compression algorithm.
Right.
> > refers to these different struct zcomp instances, each of these will
> > have the callback routine registered run for it. The kernel's CPU
> > hotplug multistate keeps a linked list of these different structures
> > so that it will iterate over them on CPU transitions. By default at
> > driver initialization we will create just one zram device (num_devices=1)
> > and a zcomp structure then set for the lzo-rle comrpession algorithm.
> > At driver removal we first remove each zram device, and so we destroy
> > the struct zcomp. But since we expose sysfs attributes to create new
> > devices or reset / initialize existing ones, we can easily end up
> > re-initializing a struct zcomp before the exit routine of the module
> > removes the cpu hotplug callback. When this happens the kernel's
> > CPU hotplug will detect that at least one instance (struct zcomp for
> > us) exists.
>
> It's very helpful to understand. Thanks a lot!S
>
> So IIUC, it's fundamentally race between destroy_devices(i.e., module
> unload) and every sysfs knobs in zram. Isn't it?
I would not call it *every* syfs knob as not all deal with things which
are related to CPU hotplug multistate, right? Note that using just
try_module_get() alone (that is the second patch only, does not fix the
race I am describing above).
There are two issues.
> Below specific example is one of them and every sysfs code in zram
> could access freed object(e.g., zram->something).
> And you are claiming there isn't good way to fix it in kernfs(please
> add it in the description, too) even though it's general problem.
Correct, the easier route would have been through the block layer as
its the one adding our syfs attributes for us but even it canno deal
with this race on behalf of drivers given the currently exposed
semantics on kernfs.
> (If we had, we may detect the race and make zram_remove_cb fails
> so unloading modulies fails, finally)
>
> So, shouldn't we introcude a global rw_semaphore?
>
> destroy_devices
> class_unregister
> down_write(&zram_unload);
> idr_for_each(zram_remove_cb);
> up_write(&zram_unload);
>
> And every sysfs code hold the lock with down_read in the first place
> and release the lock right before return. So, introduce a zram sysfs
> wrapper to centralize all of locking logic.
Actually that does work but only if we use write lock attempts so to
be able to knock two birds with one stone, so to address the deadlock
with sysfs attribute removal. We're not asking politely to read some
value off of a zram devices with these locks, we're ensuring to not
run if our driver is going to be removed, so replacing the
try_module_get() with something similar.
> Does it make sense?
>
> >
> > And so we can have:
> >
> > static void destroy_devices(void)
> > {
> > class_unregister(&zram_control_class);
> > idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > zram_debugfs_destroy();
> > idr_destroy(&zram_index_idr);
> > unregister_blkdev(zram_major, "zram");
> > cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> > }
> >
> > While destroy_devices() runs we can have this race:
> >
> >
> > CPU 1 CPU 2
> >
> > class_unregister(...);
> > idr_for_each(...);
> I think sysfs was already remove here.
> > zram_debugfs_destroy();
> > disksize_store(...)
> > idr_destroy(...);
> > unregister_blkdev(...);
> > cpuhp_remove_multi_state(...);
>
> So,
>
> CPU 1 CPU 2
>
> destroy_devices
> ..
> disksize_store()
> zcomp_create
> zcomp_init
> idr_for_each(zram_remove_cb
> zram_remove
> zram_reset
> zcomp_destroy
> cpuhp_state_remove_instance
> cpuhp_state_add_instance
> ..
>
> cpuhp_remove_multi_state(...)
> Detect!
> kernel: Error: Removing state 63 which has instances left.
Yup true.
> > The warning comes up on cpuhp_remove_multi_state() when it sees
> > that the state for CPUHP_ZCOMP_PREPARE does not have an empty
> > instance linked list.
> >
> > After the idr_destory() its also easy to run into a crash. The
> > above predicament also leads to memory leaks.
> >
> > And so we need to have a state machine for when we're up and when
> > we're going down generically.
> >
> > Let me know if it makes sense now, if so I can amend the commit log
> > accordingly.
>
> Yub, It's much better. Let's have it in the commit log.
Sure OK, well the following is what I end up with. With this approach
only one patch is needed.
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a711a2e2a794..3b86ee94309e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -41,6 +41,15 @@ static DEFINE_IDR(zram_index_idr);
/* idr index must be protected */
static DEFINE_MUTEX(zram_index_mutex);
+/*
+ * Protects against:
+ * a) Hotplug cpu multistate races against compression algorithm removals
+ * and additions prior to its removal of the zram CPU hotplug callback
+ * b) Deadlock which is possible when sysfs attributes are used and we
+ * share a lock on removal.
+ */
+DECLARE_RWSEM(zram_unload);
+
static int zram_major;
static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
@@ -1723,6 +1732,9 @@ static ssize_t disksize_store(struct device *dev,
if (!disksize)
return -EINVAL;
+ if (!down_write_trylock(&zram_unload))
+ return -ENODEV;
+
down_write(&zram->init_lock);
if (init_done(zram)) {
pr_info("Cannot change disksize for initialized device\n");
@@ -1748,6 +1760,7 @@ static ssize_t disksize_store(struct device *dev,
zram->disksize = disksize;
set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
up_write(&zram->init_lock);
+ up_write(&zram_unload);
return len;
@@ -1755,6 +1768,7 @@ static ssize_t disksize_store(struct device *dev,
zram_meta_free(zram, disksize);
out_unlock:
up_write(&zram->init_lock);
+ up_write(&zram_unload);
return err;
}
@@ -1773,14 +1787,17 @@ static ssize_t reset_store(struct device *dev,
if (!do_reset)
return -EINVAL;
+ if (!down_write_trylock(&zram_unload))
+ return -ENODEV;
+
zram = dev_to_zram(dev);
bdev = zram->disk->part0;
mutex_lock(&bdev->bd_mutex);
/* Do not reset an active device or claimed device */
if (bdev->bd_openers || zram->claim) {
- mutex_unlock(&bdev->bd_mutex);
- return -EBUSY;
+ len = -EBUSY;
+ goto out;
}
/* From now on, anyone can't open /dev/zram[0-9] */
@@ -1793,7 +1810,10 @@ static ssize_t reset_store(struct device *dev,
mutex_lock(&bdev->bd_mutex);
zram->claim = false;
+
+out:
mutex_unlock(&bdev->bd_mutex);
+ up_write(&zram_unload);
return len;
}
@@ -2015,10 +2035,15 @@ static ssize_t hot_add_show(struct class *class,
{
int ret;
+ if (!down_write_trylock(&zram_unload))
+ return -ENODEV;
+
mutex_lock(&zram_index_mutex);
ret = zram_add();
mutex_unlock(&zram_index_mutex);
+ up_write(&zram_unload);
+
if (ret < 0)
return ret;
return scnprintf(buf, PAGE_SIZE, "%d\n", ret);
@@ -2041,6 +2066,9 @@ static ssize_t hot_remove_store(struct class *class,
if (dev_id < 0)
return -EINVAL;
+ if (!down_write_trylock(&zram_unload))
+ return -ENODEV;
+
mutex_lock(&zram_index_mutex);
zram = idr_find(&zram_index_idr, dev_id);
@@ -2053,6 +2081,7 @@ static ssize_t hot_remove_store(struct class *class,
}
mutex_unlock(&zram_index_mutex);
+ up_write(&zram_unload);
return ret ? ret : count;
}
static CLASS_ATTR_WO(hot_remove);
@@ -2078,12 +2107,14 @@ static int zram_remove_cb(int id, void *ptr, void *data)
static void destroy_devices(void)
{
+ down_write(&zram_unload);
class_unregister(&zram_control_class);
idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
zram_debugfs_destroy();
idr_destroy(&zram_index_idr);
unregister_blkdev(zram_major, "zram");
cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
+ up_write(&zram_unload);
}
static int __init zram_init(void)
@@ -2095,10 +2126,13 @@ static int __init zram_init(void)
if (ret < 0)
return ret;
+ down_write(&zram_unload);
+
ret = class_register(&zram_control_class);
if (ret) {
pr_err("Unable to register zram-control class\n");
cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
+ up_write(&zram_unload);
return ret;
}
@@ -2108,6 +2142,7 @@ static int __init zram_init(void)
pr_err("Unable to get major number\n");
class_unregister(&zram_control_class);
cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
+ up_write(&zram_unload);
return -EBUSY;
}
@@ -2119,10 +2154,12 @@ static int __init zram_init(void)
goto out_error;
num_devices--;
}
+ up_write(&zram_unload);
return 0;
out_error:
+ up_write(&zram_unload);
destroy_devices();
return ret;
}
On Mon, Mar 22, 2021 at 08:41:56PM +0000, Luis Chamberlain wrote:
> On Mon, Mar 22, 2021 at 09:37:17AM -0700, Minchan Kim wrote:
> > On Fri, Mar 19, 2021 at 07:09:24PM +0000, Luis Chamberlain wrote:
> > > Indeed one issue is a consequence of the other but a bit better
> > > description can be put together for both separately.
> > >
> > > The warning comes up when cpu hotplug detects that the callback
> > > is being removed, but yet "instances" for multistate are left behind,
> > > ie, not removed. CPU hotplug multistate allows us to dedicate a callback
> > > for zram so that it gets called every time a CPU hot plugs or unplugs.
> > > In the zram driver's case we want to trigger the callback per each
> > > struct zcomp, one is used per supported and used supported compression
> >
> > you meant "each one is used per supported compression"?
>
> Yup.
>
> > > algorithm. The zram driver allocates a struct zcomp for an compression
> > > algorithm on a need basis. The default compressor is CONFIG_ZRAM_DEF_COMP
> > > which today is CONFIG_ZRAM_DEF_COMP_LZORLE ("lzo-rle"). Although we may
> > > have different zram devices, zram devices which use the same compression
> > > algorithm share the same struct zcomp. The "multi" in CPU hotplug multstate
> >
> > It allocates each own zcomp, not sharing even though zram devices
> > use the same compression algorithm.
>
> Right.
>
> > > refers to these different struct zcomp instances, each of these will
> > > have the callback routine registered run for it. The kernel's CPU
> > > hotplug multistate keeps a linked list of these different structures
> > > so that it will iterate over them on CPU transitions. By default at
> > > driver initialization we will create just one zram device (num_devices=1)
> > > and a zcomp structure then set for the lzo-rle comrpession algorithm.
> > > At driver removal we first remove each zram device, and so we destroy
> > > the struct zcomp. But since we expose sysfs attributes to create new
> > > devices or reset / initialize existing ones, we can easily end up
> > > re-initializing a struct zcomp before the exit routine of the module
> > > removes the cpu hotplug callback. When this happens the kernel's
> > > CPU hotplug will detect that at least one instance (struct zcomp for
> > > us) exists.
> >
> > It's very helpful to understand. Thanks a lot!S
> >
> > So IIUC, it's fundamentally race between destroy_devices(i.e., module
> > unload) and every sysfs knobs in zram. Isn't it?
>
> I would not call it *every* syfs knob as not all deal with things which
> are related to CPU hotplug multistate, right? Note that using just
> try_module_get() alone (that is the second patch only, does not fix the
> race I am describing above).
It wouldn't be CPU hotplug multistate issue but I'd like to call it
as more "zram instance race" bug.
What happens in this case?
CPU 1 CPU 2
destroy_devices
..
compact_store()
zram = dev_to_zram(dev);
idr_for_each(zram_remove_cb
zram_remove
..
kfree(zram)
down_read(&zram->init_lock);
CPU 1 CPU 2
hot_remove_store
compact_store()
zram = dev_to_zram(dev);
zram_remove
kfree(zram)
down_read(&zram->init_lock);
So, for me we need to close the zram instance create/removal
with sysfs rather than focusing on CPU hotplug issue.
Maybe, we could reuse zram_index_mutex with modifying it with
rw_semaphore. What do you think?
>
> There are two issues.
>
> > Below specific example is one of them and every sysfs code in zram
> > could access freed object(e.g., zram->something).
> > And you are claiming there isn't good way to fix it in kernfs(please
> > add it in the description, too) even though it's general problem.
>
> Correct, the easier route would have been through the block layer as
> its the one adding our syfs attributes for us but even it canno deal
> with this race on behalf of drivers given the currently exposed
> semantics on kernfs.
>
> > (If we had, we may detect the race and make zram_remove_cb fails
> > so unloading modulies fails, finally)
> >
> > So, shouldn't we introcude a global rw_semaphore?
> >
> > destroy_devices
> > class_unregister
> > down_write(&zram_unload);
> > idr_for_each(zram_remove_cb);
> > up_write(&zram_unload);
> >
> > And every sysfs code hold the lock with down_read in the first place
> > and release the lock right before return. So, introduce a zram sysfs
> > wrapper to centralize all of locking logic.
>
> Actually that does work but only if we use write lock attempts so to
> be able to knock two birds with one stone, so to address the deadlock
> with sysfs attribute removal. We're not asking politely to read some
> value off of a zram devices with these locks, we're ensuring to not
> run if our driver is going to be removed, so replacing the
> try_module_get() with something similar.
>
> > Does it make sense?
> >
> > >
> > > And so we can have:
> > >
> > > static void destroy_devices(void)
> > > {
> > > class_unregister(&zram_control_class);
> > > idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > > zram_debugfs_destroy();
> > > idr_destroy(&zram_index_idr);
> > > unregister_blkdev(zram_major, "zram");
> > > cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> > > }
> > >
> > > While destroy_devices() runs we can have this race:
> > >
> > >
> > > CPU 1 CPU 2
> > >
> > > class_unregister(...);
> > > idr_for_each(...);
> > I think sysfs was already remove here.
> > > zram_debugfs_destroy();
> > > disksize_store(...)
> > > idr_destroy(...);
> > > unregister_blkdev(...);
> > > cpuhp_remove_multi_state(...);
> >
> > So,
> >
> > CPU 1 CPU 2
> >
> > destroy_devices
> > ..
> > disksize_store()
> > zcomp_create
> > zcomp_init
> > idr_for_each(zram_remove_cb
> > zram_remove
> > zram_reset
> > zcomp_destroy
> > cpuhp_state_remove_instance
> > cpuhp_state_add_instance
> > ..
> >
> > cpuhp_remove_multi_state(...)
> > Detect!
> > kernel: Error: Removing state 63 which has instances left.
>
> Yup true.
>
> > > The warning comes up on cpuhp_remove_multi_state() when it sees
> > > that the state for CPUHP_ZCOMP_PREPARE does not have an empty
> > > instance linked list.
> > >
> > > After the idr_destory() its also easy to run into a crash. The
> > > above predicament also leads to memory leaks.
> > >
> > > And so we need to have a state machine for when we're up and when
> > > we're going down generically.
> > >
> > > Let me know if it makes sense now, if so I can amend the commit log
> > > accordingly.
> >
> > Yub, It's much better. Let's have it in the commit log.
>
> Sure OK, well the following is what I end up with. With this approach
> only one patch is needed.
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index a711a2e2a794..3b86ee94309e 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -41,6 +41,15 @@ static DEFINE_IDR(zram_index_idr);
> /* idr index must be protected */
> static DEFINE_MUTEX(zram_index_mutex);
>
> +/*
> + * Protects against:
> + * a) Hotplug cpu multistate races against compression algorithm removals
> + * and additions prior to its removal of the zram CPU hotplug callback
> + * b) Deadlock which is possible when sysfs attributes are used and we
> + * share a lock on removal.
> + */
> +DECLARE_RWSEM(zram_unload);
> +
> static int zram_major;
> static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
>
> @@ -1723,6 +1732,9 @@ static ssize_t disksize_store(struct device *dev,
> if (!disksize)
> return -EINVAL;
>
> + if (!down_write_trylock(&zram_unload))
> + return -ENODEV;
> +
> down_write(&zram->init_lock);
> if (init_done(zram)) {
> pr_info("Cannot change disksize for initialized device\n");
> @@ -1748,6 +1760,7 @@ static ssize_t disksize_store(struct device *dev,
> zram->disksize = disksize;
> set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
> up_write(&zram->init_lock);
> + up_write(&zram_unload);
>
> return len;
>
> @@ -1755,6 +1768,7 @@ static ssize_t disksize_store(struct device *dev,
> zram_meta_free(zram, disksize);
> out_unlock:
> up_write(&zram->init_lock);
> + up_write(&zram_unload);
> return err;
> }
>
> @@ -1773,14 +1787,17 @@ static ssize_t reset_store(struct device *dev,
> if (!do_reset)
> return -EINVAL;
>
> + if (!down_write_trylock(&zram_unload))
> + return -ENODEV;
> +
> zram = dev_to_zram(dev);
> bdev = zram->disk->part0;
>
> mutex_lock(&bdev->bd_mutex);
> /* Do not reset an active device or claimed device */
> if (bdev->bd_openers || zram->claim) {
> - mutex_unlock(&bdev->bd_mutex);
> - return -EBUSY;
> + len = -EBUSY;
> + goto out;
> }
>
> /* From now on, anyone can't open /dev/zram[0-9] */
> @@ -1793,7 +1810,10 @@ static ssize_t reset_store(struct device *dev,
>
> mutex_lock(&bdev->bd_mutex);
> zram->claim = false;
> +
> +out:
> mutex_unlock(&bdev->bd_mutex);
> + up_write(&zram_unload);
>
> return len;
> }
> @@ -2015,10 +2035,15 @@ static ssize_t hot_add_show(struct class *class,
> {
> int ret;
>
> + if (!down_write_trylock(&zram_unload))
> + return -ENODEV;
> +
> mutex_lock(&zram_index_mutex);
> ret = zram_add();
> mutex_unlock(&zram_index_mutex);
>
> + up_write(&zram_unload);
> +
> if (ret < 0)
> return ret;
> return scnprintf(buf, PAGE_SIZE, "%d\n", ret);
> @@ -2041,6 +2066,9 @@ static ssize_t hot_remove_store(struct class *class,
> if (dev_id < 0)
> return -EINVAL;
>
> + if (!down_write_trylock(&zram_unload))
> + return -ENODEV;
> +
> mutex_lock(&zram_index_mutex);
>
> zram = idr_find(&zram_index_idr, dev_id);
> @@ -2053,6 +2081,7 @@ static ssize_t hot_remove_store(struct class *class,
> }
>
> mutex_unlock(&zram_index_mutex);
> + up_write(&zram_unload);
> return ret ? ret : count;
> }
> static CLASS_ATTR_WO(hot_remove);
> @@ -2078,12 +2107,14 @@ static int zram_remove_cb(int id, void *ptr, void *data)
>
> static void destroy_devices(void)
> {
> + down_write(&zram_unload);
> class_unregister(&zram_control_class);
> idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> zram_debugfs_destroy();
> idr_destroy(&zram_index_idr);
> unregister_blkdev(zram_major, "zram");
> cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> + up_write(&zram_unload);
> }
>
> static int __init zram_init(void)
> @@ -2095,10 +2126,13 @@ static int __init zram_init(void)
> if (ret < 0)
> return ret;
>
> + down_write(&zram_unload);
> +
> ret = class_register(&zram_control_class);
> if (ret) {
> pr_err("Unable to register zram-control class\n");
> cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> + up_write(&zram_unload);
> return ret;
> }
>
> @@ -2108,6 +2142,7 @@ static int __init zram_init(void)
> pr_err("Unable to get major number\n");
> class_unregister(&zram_control_class);
> cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> + up_write(&zram_unload);
> return -EBUSY;
> }
>
> @@ -2119,10 +2154,12 @@ static int __init zram_init(void)
> goto out_error;
> num_devices--;
> }
> + up_write(&zram_unload);
>
> return 0;
>
> out_error:
> + up_write(&zram_unload);
> destroy_devices();
> return ret;
> }
As I mentioned above, it didn't close the race I gave as examples
unless I miss something. Let's discuss further.
Thank you!
On Mon, Mar 22, 2021 at 03:12:01PM -0700, Minchan Kim wrote:
> On Mon, Mar 22, 2021 at 08:41:56PM +0000, Luis Chamberlain wrote:
> >
> > I would not call it *every* syfs knob as not all deal with things which
> > are related to CPU hotplug multistate, right? Note that using just
> > try_module_get() alone (that is the second patch only, does not fix the
> > race I am describing above).
>
> It wouldn't be CPU hotplug multistate issue but I'd like to call it
> as more "zram instance race" bug.
> What happens in this case?
>
> CPU 1 CPU 2
>
> destroy_devices
> ..
> compact_store()
> zram = dev_to_zram(dev);
> idr_for_each(zram_remove_cb
> zram_remove
> ..
> kfree(zram)
> down_read(&zram->init_lock);
>
>
> CPU 1 CPU 2
>
> hot_remove_store
> compact_store()
> zram = dev_to_zram(dev);
> zram_remove
> kfree(zram)
> down_read(&zram->init_lock);
>
> So, for me we need to close the zram instance create/removal
> with sysfs rather than focusing on CPU hotplug issue.
Sure, that's a good point.
The issue which I noted for the race which ends up in a deadlock is only
possible if a shared lock is used on removal but also on sysfs knobs.
At first glance, the issue you describe above *seems* to be just proper
care driver developers must take with structures used. It is certainly
yet another issue we need to address, and if we can generalize a
solution even better. I now recall I *think* I spotted that race a while
ago and mentioned it to Kees and David Howells but I didn't have a
solution for it yet. More on this below.
The issue you point out is real, however you cannot disregard the
CPU hoplug possible race as well, it is a design consideration which
the CPU hotplug multistate support warns for -- consider driver removal.
I agree that perhaps solving this "zram instance race" can fix he
hotplug race as well. If we can solves all 3 issues in one shot even
better. But let's evaluate that prospect...
> Maybe, we could reuse zram_index_mutex with modifying it with
> rw_semaphore. What do you think?
Although ideal given it would knock 3 birds with 1 stone, it ends up
actually making the sysfs attributes rather useless in light of the
requirements for each of the races. Namely, the sysfs deadlock race
*must* use a try lock approach, just as the try_module_get() case.
It must use this approach so to immediately just bail out if we have
our module being removed, and so on our __exit path. By trying to
repurpose zram_index_mutex we end up then just doing too much with it
and making the syfs attributes rather fragile for most uses:
Consider disksize_show(), that would have to become:
static ssize_t disksize_show(struct device *dev, struct device_attribute *attr, char *buf)
{
struct zram *zram = dev_to_zram(dev);
+ size_t disksize;
+ down_read(&zram_index_rwlock);
+ disksize = zram->disksize;
+ up_read(&zram_index_rwlock);
- return scnprintf(buf, PAGE_SIZE, "%llu\n", zram->disksize);
+ return scnprintf(buf, PAGE_SIZE, "%llu\n", disksize);
}
What's wrong with this?
It can block during a write, yes, but there is a type of write which
will make this crash after the read lock is acquired. When the instance
is removed. What if we try down_read_trylock()?
static ssize_t disksize_show(struct device *dev, struct device_attribute *attr, char *buf)
{
struct zram *zram = dev_to_zram(dev);
+ size_t disksize;
+ if (!down_read_trylock(&zram_index_rwlock))
+ return -ENODEV;
+
+ disksize = zram->disksize;
+ up_read(&zram_index_rwlock);
- return scnprintf(buf, PAGE_SIZE, "%llu\n", zram->disksize);
+ return scnprintf(buf, PAGE_SIZE, "%llu\n", disksize);
}
What's wrong with this?
If it got the lock, it should be OK as it is preventing writes from
taking the lock for a bit. But then this just becomes pretty fragile,
it will fail whenever another read or write is happening, triggering
perhaps quite a bit of regressions on tests.
And if we use write_trylock() we end up with the same fragile nature
of failing the read with ENODEV for any silly thing going on with the
driver.
And come to think of it the last patch I had sent with a new
DECLARE_RWSEM(zram_unload) also has this same issue making most
sysfs attributes rather fragile.
So, I still believe we should split this up into two separate patches then
as I had originally proposed. I looked into the race you described as well
and I believe I have a generic solution for that as well.
As for the syfs deadlock possible with drivers, this fixes it in a generic way:
commit fac43d8025727a74f80a183cc5eb74ed902a5d14
Author: Luis Chamberlain <[email protected]>
Date: Sat Mar 27 14:58:15 2021 +0000
sysfs: add optional module_owner to attribute
This is needed as otherwise the owner of the attribute
or group read/store might have a shared lock used on driver removal,
and deadlock if we race with driver removal.
Signed-off-by: Luis Chamberlain <[email protected]>
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b44cd8ee2eb7..494695ff227e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1899,6 +1899,7 @@ static struct attribute *zram_disk_attrs[] = {
static const struct attribute_group zram_disk_attr_group = {
.attrs = zram_disk_attrs,
+ .owner = THIS_MODULE,
};
static const struct attribute_group *zram_disk_attr_groups[] = {
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 7e0e62deab53..faedaaaed247 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -14,6 +14,7 @@
#include <linux/slab.h>
#include <linux/security.h>
#include <linux/hash.h>
+#include <linux/module.h>
#include "kernfs-internal.h"
@@ -415,9 +416,15 @@ struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
if (unlikely(!kn))
return NULL;
- if (!atomic_inc_unless_negative(&kn->active))
+ if (unlikely(kn->owner && !try_module_get(kn->owner)))
return NULL;
+ if (!atomic_inc_unless_negative(&kn->active)) {
+ if (kn->owner)
+ module_put(kn->owner);
+ return NULL;
+ }
+
if (kernfs_lockdep(kn))
rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
return kn;
@@ -440,6 +447,8 @@ void kernfs_put_active(struct kernfs_node *kn)
if (kernfs_lockdep(kn))
rwsem_release(&kn->dep_map, _RET_IP_);
v = atomic_dec_return(&kn->active);
+ if (kn->owner)
+ module_put(kn->owner);
if (likely(v != KN_DEACTIVATED_BIAS))
return;
@@ -612,6 +621,7 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry)
static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
unsigned flags)
{
@@ -647,6 +657,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
kn->name = name;
kn->mode = mode;
+ kn->owner = owner;
kn->flags = flags;
if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) {
@@ -680,13 +691,14 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
unsigned flags)
{
struct kernfs_node *kn;
kn = __kernfs_new_node(kernfs_root(parent), parent,
- name, mode, uid, gid, flags);
+ name, mode, owner, uid, gid, flags);
if (kn) {
kernfs_get(parent);
kn->parent = parent;
@@ -967,7 +979,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
root->id_highbits = 1;
kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO,
- GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
+ NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
KERNFS_DIR);
if (!kn) {
idr_destroy(&root->ino_idr);
@@ -1006,6 +1018,7 @@ void kernfs_destroy_root(struct kernfs_root *root)
* @parent: parent in which to create a new directory
* @name: name of the new directory
* @mode: mode of the new directory
+ * @owner: if set, the module owner of this directory
* @uid: uid of the new directory
* @gid: gid of the new directory
* @priv: opaque data associated with the new directory
@@ -1015,6 +1028,7 @@ void kernfs_destroy_root(struct kernfs_root *root)
*/
struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
void *priv, const void *ns)
{
@@ -1023,7 +1037,7 @@ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
/* allocate */
kn = kernfs_new_node(parent, name, mode | S_IFDIR,
- uid, gid, KERNFS_DIR);
+ owner, uid, gid, KERNFS_DIR);
if (!kn)
return ERR_PTR(-ENOMEM);
@@ -1055,7 +1069,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
/* allocate */
kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
- GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
+ NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index c75719312147..69080869abfc 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -961,6 +961,7 @@ const struct file_operations kernfs_file_fops = {
* @uid: uid of the file
* @gid: gid of the file
* @size: size of the file
+ * @owner: if set, the module owner of the file
* @ops: kernfs operations for the file
* @priv: private data for the file
* @ns: optional namespace tag of the file
@@ -970,7 +971,9 @@ const struct file_operations kernfs_file_fops = {
*/
struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
const char *name,
- umode_t mode, kuid_t uid, kgid_t gid,
+ umode_t mode,
+ struct module *owner,
+ kuid_t uid, kgid_t gid,
loff_t size,
const struct kernfs_ops *ops,
void *priv, const void *ns,
@@ -983,7 +986,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
flags = KERNFS_FILE;
kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
- uid, gid, flags);
+ owner, uid, gid, flags);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index ccc3b44f6306..5598c1127db7 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -112,6 +112,7 @@ void kernfs_put_active(struct kernfs_node *kn);
int kernfs_add_one(struct kernfs_node *kn);
struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
unsigned flags);
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 5432883d819f..8e130a411a9f 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
gid = target->iattr->ia_gid;
}
- kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO, uid, gid,
+ kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO,
+ target->owner, uid, gid,
KERNFS_LINK);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 59dffd5ca517..5aff6ff07392 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -57,7 +57,8 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
kobject_get_ownership(kobj, &uid, &gid);
kn = kernfs_create_dir_ns(parent, kobject_name(kobj),
- S_IRWXU | S_IRUGO | S_IXUGO, uid, gid,
+ S_IRWXU | S_IRUGO | S_IXUGO, NULL,
+ uid, gid,
kobj, ns);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 9aefa7779b29..acd0756199e1 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -314,8 +314,8 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
key = attr->key ?: (struct lock_class_key *)&attr->skey;
#endif
- kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
- size, ops, (void *)attr, ns, key);
+ kn = __kernfs_create_file(parent, attr->name, mode & 0777, attr->owner,
+ uid, gid, size, ops, (void *)attr, ns, key);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
sysfs_warn_dup(parent, attr->name);
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index 64e6a6698935..010c2dade2d6 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -136,6 +136,7 @@ static int internal_create_group(struct kobject *kobj, int update,
} else {
kn = kernfs_create_dir_ns(kobj->sd, grp->name,
S_IRWXU | S_IRUGO | S_IXUGO,
+ grp->owner,
uid, gid, kobj, NULL);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 9e8ca8743c26..43abaed26735 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -155,6 +155,7 @@ struct kernfs_node {
unsigned short flags;
umode_t mode;
+ struct module *owner;
struct kernfs_iattrs *iattr;
};
@@ -368,12 +369,14 @@ void kernfs_destroy_root(struct kernfs_root *root);
struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
void *priv, const void *ns);
struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
const char *name);
struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
loff_t size,
const struct kernfs_ops *ops,
@@ -465,13 +468,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { }
static inline struct kernfs_node *
kernfs_create_dir_ns(struct kernfs_node *parent, const char *name,
- umode_t mode, kuid_t uid, kgid_t gid,
+ umode_t mode, struct module *owner,
+ kuid_t uid, kgid_t gid,
void *priv, const void *ns)
{ return ERR_PTR(-ENOSYS); }
static inline struct kernfs_node *
__kernfs_create_file(struct kernfs_node *parent, const char *name,
- umode_t mode, kuid_t uid, kgid_t gid,
+ umode_t mode, struct module *owner,
+ kuid_t uid, kgid_t gid,
loff_t size, const struct kernfs_ops *ops,
void *priv, const void *ns, struct lock_class_key *key)
{ return ERR_PTR(-ENOSYS); }
@@ -558,14 +563,15 @@ static inline struct kernfs_node *
kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
void *priv)
{
- return kernfs_create_dir_ns(parent, name, mode,
+ return kernfs_create_dir_ns(parent, name, mode, parent->owner,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
priv, NULL);
}
static inline struct kernfs_node *
kernfs_create_file_ns(struct kernfs_node *parent, const char *name,
- umode_t mode, kuid_t uid, kgid_t gid,
+ umode_t mode, struct module *owner,
+ kuid_t uid, kgid_t gid,
loff_t size, const struct kernfs_ops *ops,
void *priv, const void *ns)
{
@@ -574,15 +580,16 @@ kernfs_create_file_ns(struct kernfs_node *parent, const char *name,
#ifdef CONFIG_DEBUG_LOCK_ALLOC
key = (struct lock_class_key *)&ops->lockdep_key;
#endif
- return __kernfs_create_file(parent, name, mode, uid, gid,
+ return __kernfs_create_file(parent, name, mode, owner, uid, gid,
size, ops, priv, ns, key);
}
static inline struct kernfs_node *
kernfs_create_file(struct kernfs_node *parent, const char *name, umode_t mode,
+ struct module *owner,
loff_t size, const struct kernfs_ops *ops, void *priv)
{
- return kernfs_create_file_ns(parent, name, mode,
+ return kernfs_create_file_ns(parent, name, mode, owner,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
size, ops, priv, NULL);
}
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index d76a1ddf83a3..6652504b860e 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -30,6 +30,7 @@ enum kobj_ns_type;
struct attribute {
const char *name;
umode_t mode;
+ struct module *owner;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
bool ignore_lockdep:1;
struct lock_class_key *key;
@@ -80,6 +81,7 @@ do { \
* @attrs: Pointer to NULL terminated list of attributes.
* @bin_attrs: Pointer to NULL terminated list of binary attributes.
* Either attrs or bin_attrs or both must be provided.
+ * @module: If set, module responsible for this attribute group
*/
struct attribute_group {
const char *name;
@@ -89,6 +91,7 @@ struct attribute_group {
struct bin_attribute *, int);
struct attribute **attrs;
struct bin_attribute **bin_attrs;
+ struct module *owner;
};
/*
@@ -100,38 +103,52 @@ struct attribute_group {
#define __ATTR(_name, _mode, _show, _store) { \
.attr = {.name = __stringify(_name), \
- .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
+ .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
+ .owner = THIS_MODULE, \
+ }, \
.show = _show, \
.store = _store, \
}
#define __ATTR_PREALLOC(_name, _mode, _show, _store) { \
.attr = {.name = __stringify(_name), \
- .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\
+ .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\
+ .owner = THIS_MODULE, \
+ }, \
.show = _show, \
.store = _store, \
}
#define __ATTR_RO(_name) { \
- .attr = { .name = __stringify(_name), .mode = 0444 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0444, \
+ .owner = THIS_MODULE, \
+ }, \
.show = _name##_show, \
}
#define __ATTR_RO_MODE(_name, _mode) { \
.attr = { .name = __stringify(_name), \
- .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
+ .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
+ .owner = THIS_MODULE, \
+ }, \
.show = _name##_show, \
}
#define __ATTR_RW_MODE(_name, _mode) { \
.attr = { .name = __stringify(_name), \
- .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
+ .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
+ .owner = THIS_MODULE, \
+ }, \
.show = _name##_show, \
.store = _name##_store, \
}
#define __ATTR_WO(_name) { \
- .attr = { .name = __stringify(_name), .mode = 0200 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0200, \
+ .owner = THIS_MODULE, \
+ }, \
.store = _name##_store, \
}
@@ -141,8 +158,11 @@ struct attribute_group {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
#define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) { \
- .attr = {.name = __stringify(_name), .mode = _mode, \
- .ignore_lockdep = true }, \
+ .attr = {.name = __stringify(_name), \
+ .mode = _mode, \
+ .ignore_lockdep = true, \
+ .owner = THIS_MODULE, \
+ }, \
.show = _show, \
.store = _store, \
}
@@ -159,6 +179,7 @@ static const struct attribute_group *_name##_groups[] = { \
#define ATTRIBUTE_GROUPS(_name) \
static const struct attribute_group _name##_group = { \
.attrs = _name##_attrs, \
+ .owner = THIS_MODULE, \
}; \
__ATTRIBUTE_GROUPS(_name)
@@ -193,20 +214,29 @@ struct bin_attribute {
/* macros to create static binary attributes easier */
#define __BIN_ATTR(_name, _mode, _read, _write, _size) { \
- .attr = { .name = __stringify(_name), .mode = _mode }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = _mode, \
+ .owner = THIS_MODULE, \
+ }, \
.read = _read, \
.write = _write, \
.size = _size, \
}
#define __BIN_ATTR_RO(_name, _size) { \
- .attr = { .name = __stringify(_name), .mode = 0444 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0444, \
+ .owner = THIS_MODULE, \
+ }, \
.read = _name##_read, \
.size = _size, \
}
#define __BIN_ATTR_WO(_name, _size) { \
- .attr = { .name = __stringify(_name), .mode = 0200 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0200, \
+ .owner = THIS_MODULE, \
+ }, \
.write = _name##_write, \
.size = _size, \
}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9153b20e5cc6..c3b650748029 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3820,7 +3820,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
key = &cft->lockdep_key;
#endif
kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name),
- cgroup_file_mode(cft),
+ cgroup_file_mode(cft), NULL,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
0, cft->kf_ops, cft,
NULL, key);
As for the race you described, I think this might resolve it:
diff --git a/drivers/base/core.c b/drivers/base/core.c
index f29839382f81..5e3f65ab8148 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -29,6 +29,7 @@
#include <linux/sched/mm.h>
#include <linux/sysfs.h>
#include <linux/dma-map-ops.h> /* for dma_default_coherent */
+#include <linux/blkdev.h>
#include "base.h"
#include "power/power.h"
@@ -1911,11 +1912,20 @@ static inline int device_is_not_partition(struct device *dev)
{
return !(dev->type == &part_type);
}
+static inline bool device_is_disk(struct device *dev)
+{
+ return (dev->type == &disk_type);
+}
#else
static inline int device_is_not_partition(struct device *dev)
{
return 1;
}
+
+static inline bool device_is_disk(struct device *dev)
+{
+ return false;
+}
#endif
static int
@@ -1960,6 +1970,19 @@ const char *dev_driver_string(const struct device *dev)
}
EXPORT_SYMBOL(dev_driver_string);
+static int dev_type_get(struct device *dev)
+{
+ if (device_is_disk(dev))
+ return !!bdgrab(dev_to_bdev(dev));
+ return 1;
+}
+
+static void dev_type_put(struct device *dev)
+{
+ if (device_is_disk(dev))
+ bdput(dev_to_bdev(dev));
+}
+
#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
@@ -1969,12 +1992,20 @@ static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
struct device *dev = kobj_to_dev(kobj);
ssize_t ret = -EIO;
+ if (!dev_type_get(dev))
+ return -ENODEV;
+
if (dev_attr->show)
ret = dev_attr->show(dev, dev_attr, buf);
if (ret >= (ssize_t)PAGE_SIZE) {
printk("dev_attr_show: %pS returned bad count\n",
dev_attr->show);
}
+
+ dev_type_put(dev);
+
+ put_device(dev);
+
return ret;
}
@@ -1985,8 +2016,14 @@ static ssize_t dev_attr_store(struct kobject *kobj, struct attribute *attr,
struct device *dev = kobj_to_dev(kobj);
ssize_t ret = -EIO;
+ if (!dev_type_get(dev))
+ return -ENODEV;
+
if (dev_attr->store)
ret = dev_attr->store(dev, dev_attr, buf, count);
+
+ dev_type_put(dev);
+
return ret;
}
On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> As for the syfs deadlock possible with drivers, this fixes it in a generic way:
>
> commit fac43d8025727a74f80a183cc5eb74ed902a5d14
> Author: Luis Chamberlain <[email protected]>
> Date: Sat Mar 27 14:58:15 2021 +0000
>
> sysfs: add optional module_owner to attribute
>
> This is needed as otherwise the owner of the attribute
> or group read/store might have a shared lock used on driver removal,
> and deadlock if we race with driver removal.
>
> Signed-off-by: Luis Chamberlain <[email protected]>
No, please no. Module removal is a "best effort", if the system dies
when it happens, that's on you. I am not willing to expend extra energy
and maintance of core things like sysfs for stuff like this that does
not matter in any system other than a developer's box.
Lock data, not code please. Trying to tie data structure's lifespans
to the lifespan of code is a tangled mess, and one that I do not want to
add to in any form.
sorry,
greg k-h
On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote:
> On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> > As for the syfs deadlock possible with drivers, this fixes it in a generic way:
> >
> > commit fac43d8025727a74f80a183cc5eb74ed902a5d14
> > Author: Luis Chamberlain <[email protected]>
> > Date: Sat Mar 27 14:58:15 2021 +0000
> >
> > sysfs: add optional module_owner to attribute
> >
> > This is needed as otherwise the owner of the attribute
> > or group read/store might have a shared lock used on driver removal,
> > and deadlock if we race with driver removal.
> >
> > Signed-off-by: Luis Chamberlain <[email protected]>
>
> No, please no. Module removal is a "best effort",
Not for live patching. I am not sure if I am missing any other valid
use case?
> if the system dies when it happens, that's on you.
I think the better approach for now is simply to call testers / etc to
deal with this open coded. I cannot be sure that other than live
patching there may be other valid use cases for module removal, and for
races we really may care for where userspace *will* typically be mucking
with sysfs attributes. Monitoring my systems's sysfs attributes I am
actually quite surprised at the random pokes at them.
> I am not willing to expend extra energy
> and maintance of core things like sysfs for stuff like this that does
> not matter in any system other than a developer's box.
Should we document this as well? Without this it is unclear that tons of
random tests are sanely nullified. At least this dead lock I spotted can
be pretty common form on many drivers.
> Lock data, not code please. Trying to tie data structure's lifespans
> to the lifespan of code is a tangled mess, and one that I do not want to
> add to in any form.
Driver developers will simply have to open code these protections. In
light of what I see on LTP / fuzzing, I suspect the use case will grow
and we'll have to revisit this in the future. But for now, sure, we can
just open code the required protections everywhere to not crash on module
removal.
Luis
On Fri, Apr 02, 2021 at 06:30:16PM +0000, Luis Chamberlain wrote:
> On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote:
> > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> > > As for the syfs deadlock possible with drivers, this fixes it in a generic way:
> > >
> > > commit fac43d8025727a74f80a183cc5eb74ed902a5d14
> > > Author: Luis Chamberlain <[email protected]>
> > > Date: Sat Mar 27 14:58:15 2021 +0000
> > >
> > > sysfs: add optional module_owner to attribute
> > >
> > > This is needed as otherwise the owner of the attribute
> > > or group read/store might have a shared lock used on driver removal,
> > > and deadlock if we race with driver removal.
> > >
> > > Signed-off-by: Luis Chamberlain <[email protected]>
> >
> > No, please no. Module removal is a "best effort",
>
> Not for live patching. I am not sure if I am missing any other valid
> use case?
live patching removes modules? We have so many code paths that are
"best effort" when it comes to module unloading, trying to resolve this
one is a valiant try, but not realistic.
> > if the system dies when it happens, that's on you.
>
> I think the better approach for now is simply to call testers / etc to
> deal with this open coded. I cannot be sure that other than live
> patching there may be other valid use cases for module removal, and for
> races we really may care for where userspace *will* typically be mucking
> with sysfs attributes. Monitoring my systems's sysfs attributes I am
> actually quite surprised at the random pokes at them.
>
> > I am not willing to expend extra energy
> > and maintance of core things like sysfs for stuff like this that does
> > not matter in any system other than a developer's box.
>
> Should we document this as well? Without this it is unclear that tons of
> random tests are sanely nullified. At least this dead lock I spotted can
> be pretty common form on many drivers.
What other drivers have this problem?
> > Lock data, not code please. Trying to tie data structure's lifespans
> > to the lifespan of code is a tangled mess, and one that I do not want to
> > add to in any form.
>
> Driver developers will simply have to open code these protections. In
> light of what I see on LTP / fuzzing, I suspect the use case will grow
> and we'll have to revisit this in the future. But for now, sure, we can
> just open code the required protections everywhere to not crash on module
> removal.
LTP and fuzzing too do not remove modules. So I do not understand the
root problem here, that's just something that does not happen on a real
system.
thanks,
greg k-h
On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> On Mon, Mar 22, 2021 at 03:12:01PM -0700, Minchan Kim wrote:
> > On Mon, Mar 22, 2021 at 08:41:56PM +0000, Luis Chamberlain wrote:
> > >
> > > I would not call it *every* syfs knob as not all deal with things which
> > > are related to CPU hotplug multistate, right? Note that using just
> > > try_module_get() alone (that is the second patch only, does not fix the
> > > race I am describing above).
> >
> > It wouldn't be CPU hotplug multistate issue but I'd like to call it
> > as more "zram instance race" bug.
> > What happens in this case?
> >
> > CPU 1 CPU 2
> >
> > destroy_devices
> > ..
> > compact_store()
> > zram = dev_to_zram(dev);
> > idr_for_each(zram_remove_cb
> > zram_remove
> > ..
> > kfree(zram)
> > down_read(&zram->init_lock);
> >
> >
> > CPU 1 CPU 2
> >
> > hot_remove_store
> > compact_store()
> > zram = dev_to_zram(dev);
> > zram_remove
> > kfree(zram)
> > down_read(&zram->init_lock);
> >
> > So, for me we need to close the zram instance create/removal
> > with sysfs rather than focusing on CPU hotplug issue.
>
> Sure, that's a good point.
>
> The issue which I noted for the race which ends up in a deadlock is only
> possible if a shared lock is used on removal but also on sysfs knobs.
>
> At first glance, the issue you describe above *seems* to be just proper
> care driver developers must take with structures used. It is certainly
> yet another issue we need to address, and if we can generalize a
> solution even better. I now recall I *think* I spotted that race a while
> ago and mentioned it to Kees and David Howells but I didn't have a
> solution for it yet. More on this below.
>
> The issue you point out is real, however you cannot disregard the
> CPU hoplug possible race as well, it is a design consideration which
> the CPU hotplug multistate support warns for -- consider driver removal.
> I agree that perhaps solving this "zram instance race" can fix he
> hotplug race as well. If we can solves all 3 issues in one shot even
> better. But let's evaluate that prospect...
>
> > Maybe, we could reuse zram_index_mutex with modifying it with
> > rw_semaphore. What do you think?
>
> Although ideal given it would knock 3 birds with 1 stone, it ends up
> actually making the sysfs attributes rather useless in light of the
> requirements for each of the races. Namely, the sysfs deadlock race
> *must* use a try lock approach, just as the try_module_get() case.
> It must use this approach so to immediately just bail out if we have
> our module being removed, and so on our __exit path. By trying to
> repurpose zram_index_mutex we end up then just doing too much with it
> and making the syfs attributes rather fragile for most uses:
>
> Consider disksize_show(), that would have to become:
>
> static ssize_t disksize_show(struct device *dev, struct device_attribute *attr, char *buf)
> {
> struct zram *zram = dev_to_zram(dev);
> + size_t disksize;
>
> + down_read(&zram_index_rwlock);
> + disksize = zram->disksize;
> + up_read(&zram_index_rwlock);
> - return scnprintf(buf, PAGE_SIZE, "%llu\n", zram->disksize);
> + return scnprintf(buf, PAGE_SIZE, "%llu\n", disksize);
> }
>
> What's wrong with this?
>
> It can block during a write, yes, but there is a type of write which
> will make this crash after the read lock is acquired. When the instance
> is removed. What if we try down_read_trylock()?
>
> static ssize_t disksize_show(struct device *dev, struct device_attribute *attr, char *buf)
> {
> struct zram *zram = dev_to_zram(dev);
> + size_t disksize;
>
> + if (!down_read_trylock(&zram_index_rwlock))
> + return -ENODEV;
> +
> + disksize = zram->disksize;
> + up_read(&zram_index_rwlock);
> - return scnprintf(buf, PAGE_SIZE, "%llu\n", zram->disksize);
> + return scnprintf(buf, PAGE_SIZE, "%llu\n", disksize);
> }
>
> What's wrong with this?
>
> If it got the lock, it should be OK as it is preventing writes from
> taking the lock for a bit. But then this just becomes pretty fragile,
> it will fail whenever another read or write is happening, triggering
> perhaps quite a bit of regressions on tests.
>
> And if we use write_trylock() we end up with the same fragile nature
> of failing the read with ENODEV for any silly thing going on with the
> driver.
>
> And come to think of it the last patch I had sent with a new
> DECLARE_RWSEM(zram_unload) also has this same issue making most
> sysfs attributes rather fragile.
Thanks for looking the way. I agree the single zram_index_rwlock is
not the right approach to fix it. However, I still hope we find more
generic solution to fix them at once since I see it's zram instance
racing problem.
A approach I am considering is to make struct zram include kobject
and then make zram sysfs auto populated under the kobject. So, zram/sysfs
lifetime should be under the kobject. With it, sysfs race probem I
mentioned above should be gone. Furthermore, zram_remove should fail
if one of the alive zram objects is existing
(i.e., zram->kobject->refcount > 1) so module_exit will fail, too.
I see one of the problems is how I could make new zram object's
attribute group for zram knobs under /sys/block/zram0 since block
layer already made zram0 kobject via device_add_disk.
On Mon, Apr 05, 2021 at 07:00:23PM +0000, Luis Chamberlain wrote:
> On Mon, Apr 05, 2021 at 10:07:24AM -0700, Minchan Kim wrote:
> > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> > > And come to think of it the last patch I had sent with a new
> > > DECLARE_RWSEM(zram_unload) also has this same issue making most
> > > sysfs attributes rather fragile.
> >
> > Thanks for looking the way. I agree the single zram_index_rwlock is
> > not the right approach to fix it. However, I still hope we find more
> > generic solution to fix them at once since I see it's zram instance
> > racing problem.
>
> They are 3 separate different problems. Related, but different.
What are 3 different problems? I am asking since I remember only two:
one for CPU multistate and the other one for sysfs during rmmod.
The missing one from my view would help why you are saying they
are all different problems(even though it's a bit releated).
> At this point I think it would be difficult to resolve all 3
> with one solution without creating side issues, but hey,
> I'm curious if you find a solution that does not create other
> issues.
Yeah, let me try something with kobject stuff how I could go far
but let me understand what I was missing from problems your mentioned
above.
>
> > A approach I am considering is to make struct zram include kobject
> > and then make zram sysfs auto populated under the kobject. So, zram/sysfs
> > lifetime should be under the kobject. With it, sysfs race probem I
> > mentioned above should be gone. Furthermore, zram_remove should fail
> > if one of the alive zram objects is existing
> > (i.e., zram->kobject->refcount > 1) so module_exit will fail, too.
>
> If the idea then is to busy out rmmod if a sysfs attribute is being
> read, that could then mean rmmod can sometimes never complete. Hogging
> up / busying out sysfs attributes means the module cannto be removed.
It's true but is it a big problem? There are many cases that system
just return error if it's busy and rely on the admin. IMHO, rmmod should
be part of them.
> Which is why the *try_module_get()* I think is much more suitable, as
> it will always fails if we're already going down.
How does the try_module_get solved the problem? I thought it's
general problem on every sysfs knobs in zram. Do you mean
you will add module_get/put for every sysfs knobs in zram?
>
> > I see one of the problems is how I could make new zram object's
> > attribute group for zram knobs under /sys/block/zram0 since block
> > layer already made zram0 kobject via device_add_disk.
>
> Right.. well the syfs attribute races uncovered here actually do
> apply to any block driver as well. And which is why I was aiming
> for something generic if possible.
It would be great but that's not the one we have atm so want to
proceed to fix anyway.
>
> I am not sure if you missed the last hunks of the generic solution,
> but that would resolve the issue you noted. Here is the same approach
> but in a non-generic solution, specific to just one attribute so far
> and to zram:
So idea is refcount of the block_device's inode and module_exit path
checks also the inode refcount to make rmmod failure?
Hmm, might work but I am not sure it's right layerm, too.
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 494695ff227e..b566916e4ad9 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1724,6 +1724,11 @@ static ssize_t disksize_store(struct device *dev,
>
> mutex_lock(&zram_index_mutex);
>
> + if (!bdgrab(dev_to_bdev(dev))) {
> + err = -ENODEV;
> + goto out_nodev;
> + }
> +
> if (!zram_up || zram->claim) {
> err = -ENODEV;
> goto out;
> @@ -1760,6 +1765,7 @@ static ssize_t disksize_store(struct device *dev,
> zram->disksize = disksize;
> set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
> up_write(&zram->init_lock);
> + bdput(dev_to_bdev(dev);
>
> mutex_unlock(&zram_index_mutex);
>
> @@ -1770,6 +1776,8 @@ static ssize_t disksize_store(struct device *dev,
> out_unlock:
> up_write(&zram->init_lock);
> out:
> + bdput(dev_to_bdev(dev);
> +out_nodev:
> mutex_unlock(&zram_index_mutex);
> return err;
> }
On Mon, Apr 05, 2021 at 10:07:24AM -0700, Minchan Kim wrote:
> On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> > And come to think of it the last patch I had sent with a new
> > DECLARE_RWSEM(zram_unload) also has this same issue making most
> > sysfs attributes rather fragile.
>
> Thanks for looking the way. I agree the single zram_index_rwlock is
> not the right approach to fix it. However, I still hope we find more
> generic solution to fix them at once since I see it's zram instance
> racing problem.
They are 3 separate different problems. Related, but different.
At this point I think it would be difficult to resolve all 3
with one solution without creating side issues, but hey,
I'm curious if you find a solution that does not create other
issues.
> A approach I am considering is to make struct zram include kobject
> and then make zram sysfs auto populated under the kobject. So, zram/sysfs
> lifetime should be under the kobject. With it, sysfs race probem I
> mentioned above should be gone. Furthermore, zram_remove should fail
> if one of the alive zram objects is existing
> (i.e., zram->kobject->refcount > 1) so module_exit will fail, too.
If the idea then is to busy out rmmod if a sysfs attribute is being
read, that could then mean rmmod can sometimes never complete. Hogging
up / busying out sysfs attributes means the module cannto be removed.
Which is why the *try_module_get()* I think is much more suitable, as
it will always fails if we're already going down.
> I see one of the problems is how I could make new zram object's
> attribute group for zram knobs under /sys/block/zram0 since block
> layer already made zram0 kobject via device_add_disk.
Right.. well the syfs attribute races uncovered here actually do
apply to any block driver as well. And which is why I was aiming
for something generic if possible.
I am not sure if you missed the last hunks of the generic solution,
but that would resolve the issue you noted. Here is the same approach
but in a non-generic solution, specific to just one attribute so far
and to zram:
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 494695ff227e..b566916e4ad9 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1724,6 +1724,11 @@ static ssize_t disksize_store(struct device *dev,
mutex_lock(&zram_index_mutex);
+ if (!bdgrab(dev_to_bdev(dev))) {
+ err = -ENODEV;
+ goto out_nodev;
+ }
+
if (!zram_up || zram->claim) {
err = -ENODEV;
goto out;
@@ -1760,6 +1765,7 @@ static ssize_t disksize_store(struct device *dev,
zram->disksize = disksize;
set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
up_write(&zram->init_lock);
+ bdput(dev_to_bdev(dev);
mutex_unlock(&zram_index_mutex);
@@ -1770,6 +1776,8 @@ static ssize_t disksize_store(struct device *dev,
out_unlock:
up_write(&zram->init_lock);
out:
+ bdput(dev_to_bdev(dev);
+out_nodev:
mutex_unlock(&zram_index_mutex);
return err;
}
On Mon, Apr 05, 2021 at 12:58:05PM -0700, Minchan Kim wrote:
> On Mon, Apr 05, 2021 at 07:00:23PM +0000, Luis Chamberlain wrote:
> > On Mon, Apr 05, 2021 at 10:07:24AM -0700, Minchan Kim wrote:
> > > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> > > > And come to think of it the last patch I had sent with a new
> > > > DECLARE_RWSEM(zram_unload) also has this same issue making most
> > > > sysfs attributes rather fragile.
> > >
> > > Thanks for looking the way. I agree the single zram_index_rwlock is
> > > not the right approach to fix it. However, I still hope we find more
> > > generic solution to fix them at once since I see it's zram instance
> > > racing problem.
> >
> > They are 3 separate different problems. Related, but different.
>
> What are 3 different problems? I am asking since I remember only two:
> one for CPU multistate and the other one for sysfs during rmmod.
The third one is the race to use sysfs attributes and those routines
then derefernece th egendisk private_data.
> > If the idea then is to busy out rmmod if a sysfs attribute is being
> > read, that could then mean rmmod can sometimes never complete. Hogging
> > up / busying out sysfs attributes means the module cannto be removed.
>
> It's true but is it a big problem? There are many cases that system
> just return error if it's busy and rely on the admin. IMHO, rmmod should
> be part of them.
It depends on existing userspace scripts which are used to test and
expectations set. Consider existing tests, you would know better, and
since you are the maintainer you decide.
I at least know for many other types of device drivers an rmmod is
a sledge hammer.
You decide. I just thought it would be good to highlight the effect now
rather than us considering it later.
> > Which is why the *try_module_get()* I think is much more suitable, as
> > it will always fails if we're already going down.
>
> How does the try_module_get solved the problem?
The try stuff only resolves the deadlock. The bget() / bdput() resolves
the race to access to the gendisk private_data.
> > > I see one of the problems is how I could make new zram object's
> > > attribute group for zram knobs under /sys/block/zram0 since block
> > > layer already made zram0 kobject via device_add_disk.
> >
> > Right.. well the syfs attribute races uncovered here actually do
> > apply to any block driver as well. And which is why I was aiming
> > for something generic if possible.
>
> It would be great but that's not the one we have atm so want to
> proceed to fix anyway.
What is not the one we have atm? I *do* have a proposed generic solution
for 2/3 issues we have been disussing:
a) deadlock on sysfs access
b) gendisk private_data race
But so far Greg does not see enough justification for a), so we can either
show how wider this issue is (which I can do using coccinelle), or we
just open code the try_module_get() / put on each driver that needs it
for now. Either way it would resolve the issue.
As for b), given that I think even you had missed my attempt to
generialzie the bdget/bdput solution for any attribute type (did you see
my dev_type_get() and dev_type_put() proposed changes?), I don't think
this problem is yet well defined in a generic way for us to rule it out.
It is however easier to simply open code this per driver that needs it
for now given that I don't think Greg is yet convinced the deadlock is
yet a widespread issue. I however am pretty sure both races races *do*
exist outside of zram in many places.
> > I am not sure if you missed the last hunks of the generic solution,
> > but that would resolve the issue you noted. Here is the same approach
> > but in a non-generic solution, specific to just one attribute so far
> > and to zram:
>
> So idea is refcount of the block_device's inode
Yes that itself prevents races against the gendisk private_data from
being invalid. Why because a bdget() would not be successful after
del_gendisk():
del_gendisk() --> invalidate_partition() --> __invalidate_device() --> invalidate_inodes()
> and module_exit path
> checks also the inode refcount to make rmmod failure?
They try_module_get() approach resolves the deadlock race, but it does
so in a lazy way. I mean lazy in that then rmmod wins over sysfs knobs.
So touching sysfs knobs won't make an rmmod fail. I think that's more
typical expected behaviour. Why? Because I find it odd that looping
forever touching sysfs attributes should prevent a module removal. But
that's a personal preference.
Luis
Hi,
> > Driver developers will simply have to open code these protections. In
> > light of what I see on LTP / fuzzing, I suspect the use case will grow
> > and we'll have to revisit this in the future. But for now, sure, we can
> > just open code the required protections everywhere to not crash on module
> > removal.
>
> LTP and fuzzing too do not remove modules. So I do not understand the
> root problem here, that's just something that does not happen on a real
> system.
If I am not mistaken, the issue that Luis tries to solve here was indeed
found by running LTP.
> On Sat, Apr 03, 2021 at 08:13:23AM +0200, Greg KH wrote:
> > On Fri, Apr 02, 2021 at 06:30:16PM +0000, Luis Chamberlain wrote:
> > > On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote:
> > > > No, please no. Module removal is a "best effort",
> > >
> > > Not for live patching. I am not sure if I am missing any other valid
> > > use case?
> >
> > live patching removes modules? We have so many code paths that are
> > "best effort" when it comes to module unloading, trying to resolve this
> > one is a valiant try, but not realistic.
>
> Miroslav, your input / help here would be valuable. I did the
> generalization work because you said it would be worthy for you too...
Yes, we have the option to revert and remove the existing live patch from
the system. I am not sure how (if) it is used in practice.
At least at SUSE we do not support the option. But we are only one of the
many downstream users. So yes, there is the option.
Miroslav
On Tue, Apr 06, 2021 at 02:00:19PM +0200, Miroslav Benes wrote:
> Hi,
>
> > > Driver developers will simply have to open code these protections. In
> > > light of what I see on LTP / fuzzing, I suspect the use case will grow
> > > and we'll have to revisit this in the future. But for now, sure, we can
> > > just open code the required protections everywhere to not crash on module
> > > removal.
> >
> > LTP and fuzzing too do not remove modules. So I do not understand the
> > root problem here, that's just something that does not happen on a real
> > system.
>
> If I am not mistaken, the issue that Luis tries to solve here was indeed
> found by running LTP.
>
> > On Sat, Apr 03, 2021 at 08:13:23AM +0200, Greg KH wrote:
> > > On Fri, Apr 02, 2021 at 06:30:16PM +0000, Luis Chamberlain wrote:
> > > > On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote:
> > > > > No, please no. Module removal is a "best effort",
> > > >
> > > > Not for live patching. I am not sure if I am missing any other valid
> > > > use case?
> > >
> > > live patching removes modules? We have so many code paths that are
> > > "best effort" when it comes to module unloading, trying to resolve this
> > > one is a valiant try, but not realistic.
> >
> > Miroslav, your input / help here would be valuable. I did the
> > generalization work because you said it would be worthy for you too...
>
> Yes, we have the option to revert and remove the existing live patch from
> the system. I am not sure how (if) it is used in practice.
>
> At least at SUSE we do not support the option. But we are only one of the
> many downstream users. So yes, there is the option.
Same for Red Hat. Unloading livepatch modules seems to work fine, but
isn't officially supported.
That said, if rmmod is just considered a development aid, and we're
going to be ignoring bugs, we should make it official with a new
TAINT_RMMOD.
--
Josh
On Tue, Apr 06, 2021 at 12:29:09AM +0000, Luis Chamberlain wrote:
> On Mon, Apr 05, 2021 at 12:58:05PM -0700, Minchan Kim wrote:
> > On Mon, Apr 05, 2021 at 07:00:23PM +0000, Luis Chamberlain wrote:
> > > On Mon, Apr 05, 2021 at 10:07:24AM -0700, Minchan Kim wrote:
> > > > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> > > > > And come to think of it the last patch I had sent with a new
> > > > > DECLARE_RWSEM(zram_unload) also has this same issue making most
> > > > > sysfs attributes rather fragile.
> > > >
> > > > Thanks for looking the way. I agree the single zram_index_rwlock is
> > > > not the right approach to fix it. However, I still hope we find more
> > > > generic solution to fix them at once since I see it's zram instance
> > > > racing problem.
> > >
> > > They are 3 separate different problems. Related, but different.
> >
> > What are 3 different problems? I am asking since I remember only two:
> > one for CPU multistate and the other one for sysfs during rmmod.
>
> The third one is the race to use sysfs attributes and those routines
> then derefernece th egendisk private_data.
First of all, thanks for keeping discussion, Luis.
That was the one I thought race between sysfs and during rmmod.
>
> > > If the idea then is to busy out rmmod if a sysfs attribute is being
> > > read, that could then mean rmmod can sometimes never complete. Hogging
> > > up / busying out sysfs attributes means the module cannto be removed.
> >
> > It's true but is it a big problem? There are many cases that system
> > just return error if it's busy and rely on the admin. IMHO, rmmod should
> > be part of them.
>
> It depends on existing userspace scripts which are used to test and
> expectations set. Consider existing tests, you would know better, and
> since you are the maintainer you decide.
>
> I at least know for many other types of device drivers an rmmod is
> a sledge hammer.
>
> You decide. I just thought it would be good to highlight the effect now
> rather than us considering it later.
To me, the rmmod faillure is not a big problem for zram since it's
common cases in the system with -EBUSY(Having said, I agree that's the
best if we could avoid the fail-and-retrial. IOW, -EBUSY should be
last resort unless we have nicer way.)
>
> > > Which is why the *try_module_get()* I think is much more suitable, as
> > > it will always fails if we're already going down.
> >
> > How does the try_module_get solved the problem?
>
> The try stuff only resolves the deadlock. The bget() / bdput() resolves
> the race to access to the gendisk private_data.
That's the one I missed in this discussion. Now I am reading your [2/2]
in original patch. I thought it was just zram instance was destroyed
by sysfs race problem so you had seen the deadlock. I might miss the
point here, too.
Hmm, we are discussing several problems all at once. I feel it's time
to jump v2 with your way in this point. You said three different
problems. As I asked, please write it down with more detail with
code sequence as we discussed other thread. If you mean a deadlock,
please write what specific locks was deadlock with it.
It would make discussion much easier. Let's discuss the issue
one by one in each thread.
>
> > > > I see one of the problems is how I could make new zram object's
> > > > attribute group for zram knobs under /sys/block/zram0 since block
> > > > layer already made zram0 kobject via device_add_disk.
> > >
> > > Right.. well the syfs attribute races uncovered here actually do
> > > apply to any block driver as well. And which is why I was aiming
> > > for something generic if possible.
> >
> > It would be great but that's not the one we have atm so want to
> > proceed to fix anyway.
>
> What is not the one we have atm? I *do* have a proposed generic solution
> for 2/3 issues we have been disussing:
>
> a) deadlock on sysfs access
This is the one I didn't understand.
> b) gendisk private_data race
Yub.
>
> But so far Greg does not see enough justification for a), so we can either
> show how wider this issue is (which I can do using coccinelle), or we
> just open code the try_module_get() / put on each driver that needs it
> for now. Either way it would resolve the issue.
I second if it's general problem for drivers, I agree it's worth to
addresss in the core unless the driver introduce the mess. I have
no idea here since I didn't understand the problem, yet.
>
> As for b), given that I think even you had missed my attempt to
> generialzie the bdget/bdput solution for any attribute type (did you see
> my dev_type_get() and dev_type_put() proposed changes?), I don't think
> this problem is yet well defined in a generic way for us to rule it out.
> It is however easier to simply open code this per driver that needs it
> for now given that I don't think Greg is yet convinced the deadlock is
> yet a widespread issue. I however am pretty sure both races races *do*
> exist outside of zram in many places.
It would be good sign to propose general solution.
>
> > > I am not sure if you missed the last hunks of the generic solution,
> > > but that would resolve the issue you noted. Here is the same approach
> > > but in a non-generic solution, specific to just one attribute so far
> > > and to zram:
> >
> > So idea is refcount of the block_device's inode
>
> Yes that itself prevents races against the gendisk private_data from
> being invalid. Why because a bdget() would not be successful after
> del_gendisk():
>
> del_gendisk() --> invalidate_partition() --> __invalidate_device() --> invalidate_inodes()
>
> > and module_exit path
> > checks also the inode refcount to make rmmod failure?
>
> They try_module_get() approach resolves the deadlock race, but it does
> so in a lazy way. I mean lazy in that then rmmod wins over sysfs knobs.
> So touching sysfs knobs won't make an rmmod fail. I think that's more
> typical expected behaviour. Why? Because I find it odd that looping
> forever touching sysfs attributes should prevent a module removal. But
> that's a personal preference.
I agree with you that would be better but let's see how it could go
clean.
FYI, please look at hot_remove_store which also can remove zram
instance on demand. I am looking forward to seeing your v2.
Thanks for your patience, Luis.
On Tue, Apr 06, 2021 at 06:23:46PM -0700, Minchan Kim wrote:
> On Tue, Apr 06, 2021 at 12:29:09AM +0000, Luis Chamberlain wrote:
> > On Mon, Apr 05, 2021 at 12:58:05PM -0700, Minchan Kim wrote:
> > > On Mon, Apr 05, 2021 at 07:00:23PM +0000, Luis Chamberlain wrote:
> > > > On Mon, Apr 05, 2021 at 10:07:24AM -0700, Minchan Kim wrote:
> > > > > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> > > > > > And come to think of it the last patch I had sent with a new
> > > > > > DECLARE_RWSEM(zram_unload) also has this same issue making most
> > > > > > sysfs attributes rather fragile.
> > > > >
> > > > > Thanks for looking the way. I agree the single zram_index_rwlock is
> > > > > not the right approach to fix it. However, I still hope we find more
> > > > > generic solution to fix them at once since I see it's zram instance
> > > > > racing problem.
> > > >
> > > > They are 3 separate different problems. Related, but different.
> > >
> > > What are 3 different problems? I am asking since I remember only two:
> > > one for CPU multistate and the other one for sysfs during rmmod.
> >
> > The third one is the race to use sysfs attributes and those routines
> > then derefernece th egendisk private_data.
>
> First of all, thanks for keeping discussion, Luis.
>
> That was the one I thought race between sysfs and during rmmod.
>
> >
> > > > If the idea then is to busy out rmmod if a sysfs attribute is being
> > > > read, that could then mean rmmod can sometimes never complete. Hogging
> > > > up / busying out sysfs attributes means the module cannto be removed.
> > >
> > > It's true but is it a big problem? There are many cases that system
> > > just return error if it's busy and rely on the admin. IMHO, rmmod should
> > > be part of them.
> >
> > It depends on existing userspace scripts which are used to test and
> > expectations set. Consider existing tests, you would know better, and
> > since you are the maintainer you decide.
> >
> > I at least know for many other types of device drivers an rmmod is
> > a sledge hammer.
> >
> > You decide. I just thought it would be good to highlight the effect now
> > rather than us considering it later.
>
> To me, the rmmod faillure is not a big problem for zram since it's
> common cases in the system with -EBUSY(Having said, I agree that's the
> best if we could avoid the fail-and-retrial. IOW, -EBUSY should be
> last resort unless we have nicer way.)
>
> >
> > > > Which is why the *try_module_get()* I think is much more suitable, as
> > > > it will always fails if we're already going down.
> > >
> > > How does the try_module_get solved the problem?
> >
> > The try stuff only resolves the deadlock. The bget() / bdput() resolves
> > the race to access to the gendisk private_data.
>
> That's the one I missed in this discussion. Now I am reading your [2/2]
> in original patch. I thought it was just zram instance was destroyed
> by sysfs race problem so you had seen the deadlock. I might miss the
> point here, too.
>
> Hmm, we are discussing several problems all at once. I feel it's time
> to jump v2 with your way in this point. You said three different
> problems. As I asked, please write it down with more detail with
> code sequence as we discussed other thread. If you mean a deadlock,
> please write what specific locks was deadlock with it.
> It would make discussion much easier. Let's discuss the issue
> one by one in each thread.
To clarify what I understood form the discussion until now:
1. zram shouldn't allow creating more zram instance during
rmmod(It causes CPU multistate splat)
2. the private data of gendisk shouldn't destroyed while zram
sysfs knob is working(it makes system goes crash)
Thank you.
On Tue, Apr 06, 2021 at 10:54:23AM -0500, Josh Poimboeuf wrote:
> Same for Red Hat. Unloading livepatch modules seems to work fine, but
> isn't officially supported.
>
> That said, if rmmod is just considered a development aid, and we're
> going to be ignoring bugs, we should make it official with a new
> TAINT_RMMOD.
Another option would be to have live-patch modules leak a module
reference by default, except when some debug sysctl is set or something.
Then only those LP modules loaded while the sysctl is set to 'YOLO' can
be unloaded.
On Tue, Apr 06, 2021 at 06:23:46PM -0700, Minchan Kim wrote:
> On Tue, Apr 06, 2021 at 12:29:09AM +0000, Luis Chamberlain wrote:
> > On Mon, Apr 05, 2021 at 12:58:05PM -0700, Minchan Kim wrote:
> > > On Mon, Apr 05, 2021 at 07:00:23PM +0000, Luis Chamberlain wrote:
> > > > Which is why the *try_module_get()* I think is much more suitable, as
> > > > it will always fails if we're already going down.
> > >
> > > How does the try_module_get solved the problem?
> >
> > The try stuff only resolves the deadlock. The bget() / bdput() resolves
> > the race to access to the gendisk private_data.
>
> That's the one I missed in this discussion. Now I am reading your [2/2]
> in original patch. I thought it was just zram instance was destroyed
> by sysfs race problem so you had seen the deadlock.
Patch [2/2] indeed dealt with a zram instance race but the issue was not
a crash due to indirection, it was because of a deadlock. The deadlock
happens because a shared mutex is used both at sysfs and also on __exit.
I'll expand on the example as you request so that both issues are
clearly understood.
The zram race you spotted which could lead to a derefence and crash is
a separate one, which the bget() / bdput() on sysfs knobs resolves. That
race happens because zram's sysfs knobs don't prevent del_gendisk() from
completing currently, and so a dereference can happen.
> Hmm, we are discussing several problems all at once. I feel it's time
> to jump v2 with your way in this point. You said three different
> problems. As I asked, please write it down with more detail with
> code sequence as we discussed other thread. If you mean a deadlock,
> please write what specific locks was deadlock with it.
> It would make discussion much easier. Let's discuss the issue
> one by one in each thread.
Sure. Will post a v2.
> > But so far Greg does not see enough justification for a), so we can either
> > show how wider this issue is (which I can do using coccinelle), or we
> > just open code the try_module_get() / put on each driver that needs it
> > for now. Either way it would resolve the issue.
>
> I second if it's general problem for drivers, I agree it's worth to
> addresss in the core unless the driver introduce the mess. I have
> no idea here since I didn't understand the problem, yet.
I suspect these issue are generic, however hard to reproduce, but this
just means busying out sysfs knobs and doing module removal can likely
cause crashes to some kernel drivers.
Since it seems the position to take is module removal is best effort,
if crashes on module removal are important to module maintainers, the
position to take is that such races are best addressed on the driver
side, not core.
> > As for b), given that I think even you had missed my attempt to
> > generialzie the bdget/bdput solution for any attribute type (did you see
> > my dev_type_get() and dev_type_put() proposed changes?), I don't think
> > this problem is yet well defined in a generic way for us to rule it out.
> > It is however easier to simply open code this per driver that needs it
> > for now given that I don't think Greg is yet convinced the deadlock is
> > yet a widespread issue. I however am pretty sure both races races *do*
> > exist outside of zram in many places.
>
> It would be good sign to propose general solution.
Same situation here, as the race is with module removal.
Even in the case where module removal is possible today and likely
"supported" -- on livepatching, it seems we're shying away slowly from
that situation, and will likely expose a knob to ensure removing a
livepatch (and do module removal) will require a knob to be flipped
to say, "Yes, I know what I am doing").
> > They try_module_get() approach resolves the deadlock race, but it does
> > so in a lazy way. I mean lazy in that then rmmod wins over sysfs knobs.
> > So touching sysfs knobs won't make an rmmod fail. I think that's more
> > typical expected behaviour. Why? Because I find it odd that looping
> > forever touching sysfs attributes should prevent a module removal. But
> > that's a personal preference.
>
> I agree with you that would be better but let's see how it could go
> clean.
OK great.
> FYI, please look at hot_remove_store which also can remove zram
> instance on demand. I am looking forward to seeing your v2.
Sure, I'll address all sysfs knobs in v2.
Luis
On Wed, Apr 07, 2021 at 04:09:44PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 06, 2021 at 10:54:23AM -0500, Josh Poimboeuf wrote:
>
> > Same for Red Hat. Unloading livepatch modules seems to work fine, but
> > isn't officially supported.
> >
> > That said, if rmmod is just considered a development aid, and we're
> > going to be ignoring bugs, we should make it official with a new
> > TAINT_RMMOD.
>
> Another option would be to have live-patch modules leak a module
> reference by default, except when some debug sysctl is set or something.
> Then only those LP modules loaded while the sysctl is set to 'YOLO' can
> be unloaded.
The issue is broader than just live patching.
My suggestion was that if we aren't going to fix bugs in kernel module
unloading, then unloading modules shouldn't be supported, and should
taint the kernel.
--
Josh
On Wed, Apr 07, 2021 at 10:30:31AM -0500, Josh Poimboeuf wrote:
> On Wed, Apr 07, 2021 at 04:09:44PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 06, 2021 at 10:54:23AM -0500, Josh Poimboeuf wrote:
> >
> > > Same for Red Hat. Unloading livepatch modules seems to work fine, but
> > > isn't officially supported.
> > >
> > > That said, if rmmod is just considered a development aid, and we're
> > > going to be ignoring bugs, we should make it official with a new
> > > TAINT_RMMOD.
> >
> > Another option would be to have live-patch modules leak a module
> > reference by default, except when some debug sysctl is set or something.
> > Then only those LP modules loaded while the sysctl is set to 'YOLO' can
> > be unloaded.
>
> The issue is broader than just live patching.
>
> My suggestion was that if we aren't going to fix bugs in kernel module
> unloading, then unloading modules shouldn't be supported, and should
> taint the kernel.
Hold up, what? However much I dislike modules (and that is lots), if you
don't want to support rmmod, you have to leak a reference to self in
init. Barring that you get to fix any and all unload bugs.
On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote:
> On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> > As for the syfs deadlock possible with drivers, this fixes it in a generic way:
> >
> > commit fac43d8025727a74f80a183cc5eb74ed902a5d14
> > Author: Luis Chamberlain <[email protected]>
> > Date: Sat Mar 27 14:58:15 2021 +0000
> >
> > sysfs: add optional module_owner to attribute
> >
> > This is needed as otherwise the owner of the attribute
> > or group read/store might have a shared lock used on driver removal,
> > and deadlock if we race with driver removal.
> >
> > Signed-off-by: Luis Chamberlain <[email protected]>
>
> No, please no. Module removal is a "best effort", if the system dies
> when it happens, that's on you. I am not willing to expend extra energy
> and maintance of core things like sysfs for stuff like this that does
> not matter in any system other than a developer's box.
So I mentioned this on IRC, and some folks were surprised to hear that
module unloading is unsupported and is just a development aid.
Is this stance documented anywhere?
If we really believe this to be true, we should make rmmod taint the
kernel.
--
Josh
On Tue, Apr 06, 2021 at 06:38:24PM -0700, Minchan Kim wrote:
> To clarify what I understood form the discussion until now:
>
> 1. zram shouldn't allow creating more zram instance during
> rmmod(It causes CPU multistate splat)
Right!
> 2. the private data of gendisk shouldn't destroyed while zram
> sysfs knob is working(it makes system goes crash)
Yup, this is resolved with the bdgrab / bdput on each sysfs knob.
And the last issue is:
3) which patch 2/2 addresed. If a mutex is shared on sysfs
knobs and module removal, you must do try_module_get() to prevent
the deadlock.
Luis
Greg,
On Fri, Apr 02 2021 at 09:54, Greg KH wrote:
> On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
>> As for the syfs deadlock possible with drivers, this fixes it in a generic way:
>>
>> commit fac43d8025727a74f80a183cc5eb74ed902a5d14
>> Author: Luis Chamberlain <[email protected]>
>> Date: Sat Mar 27 14:58:15 2021 +0000
>>
>> sysfs: add optional module_owner to attribute
>>
>> This is needed as otherwise the owner of the attribute
>> or group read/store might have a shared lock used on driver removal,
>> and deadlock if we race with driver removal.
>>
>> Signed-off-by: Luis Chamberlain <[email protected]>
>
> No, please no. Module removal is a "best effort", if the system dies
> when it happens, that's on you. I am not willing to expend extra energy
> and maintance of core things like sysfs for stuff like this that does
> not matter in any system other than a developer's box.
>
> Lock data, not code please. Trying to tie data structure's lifespans
> to the lifespan of code is a tangled mess, and one that I do not want to
> add to in any form.
>
> sorry,
Sorry, but you are fundamentaly off track here. This has absolutely
nothing to do with module removal.
The point is that module removal is the reverse operation of module
insertion. So far so good.
But module insertion can fail. So if you have nested functionalities
which hang off or are enabled by moduled insertion then any fail in that
sequence has to be able to roll back and clean up properly no matter
what.
Which it turn makes modules removal a reverse operation of module
insertion.
If you think otherwise, then please provide a proper plan how nested
operations like sysfs - not to talk about more complex things like multi
instance discovery which can happen inside a module insertion sequence
can be properly rolled back.
Just declaring that rmmod is evil does not cut it. rmmod is the least of
the problems. If that fails, then a lot of rollback, failure handling
mechanisms are missing in the setup path already.
Anything which cannot cleanly rollback no matter whether the fail or
rollback request happens at insertion time or later is broken by design.
So either you declare module removal as disfunctional or you stop making
up semantically ill defined and therefore useless claims about it.
Your argument in:
https://lore.kernel.org/linux-block/[email protected]/
"Lock data, not code please. Trying to tie data structure's lifespans
to the lifespan of code is a tangled mess, and one that I do not want to
add to in any form"
is just useless blurb because the fundamental purpose of discovery code
is to create the data structures which are tied to the code which is
associated to it.
Please stop this 'module removal' is not supported nonsense unless you
can prove a complete indepenence of module init/discovery code to
subsequent discovered entities depending on it.
Thanks,
tglx
On Thu, Apr 08, 2021 at 03:37:53AM +0200, Thomas Gleixner wrote:
> Greg,
>
> On Fri, Apr 02 2021 at 09:54, Greg KH wrote:
> > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> >> As for the syfs deadlock possible with drivers, this fixes it in a generic way:
> >>
> >> commit fac43d8025727a74f80a183cc5eb74ed902a5d14
> >> Author: Luis Chamberlain <[email protected]>
> >> Date: Sat Mar 27 14:58:15 2021 +0000
> >>
> >> sysfs: add optional module_owner to attribute
> >>
> >> This is needed as otherwise the owner of the attribute
> >> or group read/store might have a shared lock used on driver removal,
> >> and deadlock if we race with driver removal.
> >>
> >> Signed-off-by: Luis Chamberlain <[email protected]>
> >
> > No, please no. Module removal is a "best effort", if the system dies
> > when it happens, that's on you. I am not willing to expend extra energy
> > and maintance of core things like sysfs for stuff like this that does
> > not matter in any system other than a developer's box.
> >
> > Lock data, not code please. Trying to tie data structure's lifespans
> > to the lifespan of code is a tangled mess, and one that I do not want to
> > add to in any form.
> >
> > sorry,
>
> Sorry, but you are fundamentaly off track here. This has absolutely
> nothing to do with module removal.
>
> The point is that module removal is the reverse operation of module
> insertion. So far so good.
>
> But module insertion can fail. So if you have nested functionalities
> which hang off or are enabled by moduled insertion then any fail in that
> sequence has to be able to roll back and clean up properly no matter
> what.
>
> Which it turn makes modules removal a reverse operation of module
> insertion.
>
> If you think otherwise, then please provide a proper plan how nested
> operations like sysfs - not to talk about more complex things like multi
> instance discovery which can happen inside a module insertion sequence
> can be properly rolled back.
>
> Just declaring that rmmod is evil does not cut it. rmmod is the least of
> the problems. If that fails, then a lot of rollback, failure handling
> mechanisms are missing in the setup path already.
>
> Anything which cannot cleanly rollback no matter whether the fail or
> rollback request happens at insertion time or later is broken by design.
>
> So either you declare module removal as disfunctional or you stop making
> up semantically ill defined and therefore useless claims about it.
>
> Your argument in:
>
> https://lore.kernel.org/linux-block/[email protected]/
>
> "Lock data, not code please. Trying to tie data structure's lifespans
> to the lifespan of code is a tangled mess, and one that I do not want to
> add to in any form"
>
> is just useless blurb because the fundamental purpose of discovery code
> is to create the data structures which are tied to the code which is
> associated to it.
>
> Please stop this 'module removal' is not supported nonsense unless you
> can prove a complete indepenence of module init/discovery code to
> subsequent discovered entities depending on it.
Ok, but to be fair, trying to add the crazy hacks that were being
proposed to sysfs for something that is obviously not a code path that
can be taken by a normal user or operation is just not going to fly.
Removing a module from a system has always been "let's try it and see!"
type of operation for a very long time. While we try our best, doing
horrible hacks for this rare type of thing are generally not considered
a good idea which is why I said that.
thanks,
greg k-h
On Wed, Apr 07, 2021 at 03:17:46PM -0500, Josh Poimboeuf wrote:
> On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote:
> > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> > > As for the syfs deadlock possible with drivers, this fixes it in a generic way:
> > >
> > > commit fac43d8025727a74f80a183cc5eb74ed902a5d14
> > > Author: Luis Chamberlain <[email protected]>
> > > Date: Sat Mar 27 14:58:15 2021 +0000
> > >
> > > sysfs: add optional module_owner to attribute
> > >
> > > This is needed as otherwise the owner of the attribute
> > > or group read/store might have a shared lock used on driver removal,
> > > and deadlock if we race with driver removal.
> > >
> > > Signed-off-by: Luis Chamberlain <[email protected]>
> >
> > No, please no. Module removal is a "best effort", if the system dies
> > when it happens, that's on you. I am not willing to expend extra energy
> > and maintance of core things like sysfs for stuff like this that does
> > not matter in any system other than a developer's box.
>
> So I mentioned this on IRC, and some folks were surprised to hear that
> module unloading is unsupported and is just a development aid.
>
> Is this stance documented anywhere?
>
> If we really believe this to be true, we should make rmmod taint the
> kernel.
My throw-away comment here seems to have gotten way more attention than
it deserved, sorry about that everyone.
Nothing is supported for anything here, it's all "best effort" :)
And I would love a taint for rmmod, but what is that going to help out
with?
thanks,
greg k-h
On Thu, 8 Apr 2021, Greg KH wrote:
> Removing a module from a system has always been "let's try it and see!"
> type of operation for a very long time.
Which part of it?
If there is a driver/subsystem code that can't handle the reverse
operation to modprobe, it clearly can't handle error handling during
modprobe (which, one would hope, is supported), and should be fixed.
If there is a particular issue in kernel dynamic linker that causes crash
on module removal, we'd better off fixing it. Is there one such that makes
you claim module removal unsupported?
Thanks,
--
Jiri Kosina
SUSE Labs
On Thu, Apr 08, 2021 at 10:01:23AM +0200, Jiri Kosina wrote:
> On Thu, 8 Apr 2021, Greg KH wrote:
>
> > Removing a module from a system has always been "let's try it and see!"
> > type of operation for a very long time.
>
> Which part of it?
>
> If there is a driver/subsystem code that can't handle the reverse
> operation to modprobe, it clearly can't handle error handling during
> modprobe (which, one would hope, is supported), and should be fixed.
Huh? No, that's not the issue here, it's the issue of different
userspace code paths into the module at the same time that it is trying
to be unloaded. That has nothing to do with loading the module the
first time as userspace is not touching those apis yet.
> If there is a particular issue in kernel dynamic linker that causes crash
> on module removal, we'd better off fixing it. Is there one such that makes
> you claim module removal unsupported?
The linker has nothing to do with this, it's userspace tasks touching
code paths.
thanks,
greg k-h
On Thu, 8 Apr 2021, Greg KH wrote:
> > If there is a driver/subsystem code that can't handle the reverse
> > operation to modprobe, it clearly can't handle error handling during
> > modprobe (which, one would hope, is supported), and should be fixed.
>
> Huh? No, that's not the issue here, it's the issue of different
> userspace code paths into the module at the same time that it is trying
> to be unloaded. That has nothing to do with loading the module the
> first time as userspace is not touching those apis yet.
So do you claim that once the first (out of possibly many)
userspace-visible sysfs entry has been created during module insertion and
made available to userspace, there is never going to be rollback happening
that'd be removing that first sysfs entry again?
Thanks,
--
Jiri Kosina
SUSE Labs
On Thu, Apr 08, 2021 at 10:35:17AM +0200, Jiri Kosina wrote:
> On Thu, 8 Apr 2021, Greg KH wrote:
>
> > > If there is a driver/subsystem code that can't handle the reverse
> > > operation to modprobe, it clearly can't handle error handling during
> > > modprobe (which, one would hope, is supported), and should be fixed.
> >
> > Huh? No, that's not the issue here, it's the issue of different
> > userspace code paths into the module at the same time that it is trying
> > to be unloaded. That has nothing to do with loading the module the
> > first time as userspace is not touching those apis yet.
>
> So do you claim that once the first (out of possibly many)
> userspace-visible sysfs entry has been created during module insertion and
> made available to userspace, there is never going to be rollback happening
> that'd be removing that first sysfs entry again?
{sigh}
I'm not trying to argue that, no.
What I am arguing is that the complexity that the original patch was
not worth the low probablity of this actually being an issue hit in
real-life operations.
That's all, messing around with sysfs entries and module reference
counts is tricky and complex and a total mess. We have a separation
between normal sysfs files and devices being removed that should handle
the normal operations but there are still some crazy corner cases, of
which this seems to be one.
Module removal is not a "normal" operation that can be triggered by a
system automatically without a user asking for it. As someone reminded
me on IRC, we used to do this "automatically" for many problematic
drivers years ago for suspend/resume, that should all now be long fixed
up.
So to add crazy complexity to the kernel, for an operation that can only
be triggered manually by a root user, is not worth it in my opinion, as
the maintainer of that code the complexity was asked to be made to.
My throw-away comment of "module unloading is not supported" was an
attempt to summarize all of the above into one single sentence that
seems to have struck a nerve with a lot of people, and I appologize for
that :(
thanks,
greg k-h
On Thu, 8 Apr 2021 08:18:21 +0200
Greg KH <[email protected]> wrote:
> And I would love a taint for rmmod, but what is that going to help out
> with?
Just like any other taint. If a rmmod can cause the system to lose
integrity, the rmmod could cause a subtle issue that manifests itself into
something more serious and may look unrelated. If you have a bug report
with the rmmod taint, one could ask to try to recreate the bug without
doing rmmod. Or perhaps we have a similar bug reports that all show the
rmmod taint. That would give us an impression that something was removed
and caused the system to lose stability.
-- Steve
On Thu, Apr 08, 2021 at 08:18:21AM +0200, Greg KH wrote:
> On Wed, Apr 07, 2021 at 03:17:46PM -0500, Josh Poimboeuf wrote:
> > On Fri, Apr 02, 2021 at 09:54:12AM +0200, Greg KH wrote:
> > > On Thu, Apr 01, 2021 at 11:59:25PM +0000, Luis Chamberlain wrote:
> > > > As for the syfs deadlock possible with drivers, this fixes it in a generic way:
> > > >
> > > > commit fac43d8025727a74f80a183cc5eb74ed902a5d14
> > > > Author: Luis Chamberlain <[email protected]>
> > > > Date: Sat Mar 27 14:58:15 2021 +0000
> > > >
> > > > sysfs: add optional module_owner to attribute
> > > >
> > > > This is needed as otherwise the owner of the attribute
> > > > or group read/store might have a shared lock used on driver removal,
> > > > and deadlock if we race with driver removal.
> > > >
> > > > Signed-off-by: Luis Chamberlain <[email protected]>
> > >
> > > No, please no. Module removal is a "best effort", if the system dies
> > > when it happens, that's on you. I am not willing to expend extra energy
> > > and maintance of core things like sysfs for stuff like this that does
> > > not matter in any system other than a developer's box.
> >
> > So I mentioned this on IRC, and some folks were surprised to hear that
> > module unloading is unsupported and is just a development aid.
> >
> > Is this stance documented anywhere?
> >
> > If we really believe this to be true, we should make rmmod taint the
> > kernel.
>
> My throw-away comment here seems to have gotten way more attention than
> it deserved, sorry about that everyone.
>
> Nothing is supported for anything here, it's all "best effort" :)
>
> And I would love a taint for rmmod, but what is that going to help out
> with?
I'm having trouble following this conclusion. Sure, we give our best
effort, but if "nothing is supported" then what are we even doing here?
Either we aim to support module unloading, or we don't.
If we do support it, "best effort" isn't a valid reason for nacking
fixes.
If we don't support it, but still want to keep it half-working for
development purposes, tainting on rmmod will make it crystal clear that
the system has been destabilized.
--
Josh
On Thu, Apr 08, 2021 at 10:55:53AM +0200, Greg KH wrote:
> So to add crazy complexity to the kernel,
I agree that this can be tricky. However, driver developers are going to
open code this either way. The problem with that as well, and one of my
own reasons for striving for at least *trying* for a generic solution,
was that I am aware that driver developers may be trying a busy solution
rather than the try method. The busy approach means you could also end
up in a situation where userspace can prevent full undoing / removal
of a driver. The try method is best effort in the sense that if the
driver is going it won't be allowed.
So a sensible consideration I think is we at least choose one of these:
a) We allow driver developers to open code it on drivers which need it on
each and every single sysfs knob on the driver where its needed, and
accept folks might do it wrong
b) Provide a macro which is opt-in, ie, not doing it for all
attributes, but perhaps one which the driver author *is* aware to
try / put of the driver method.
c) Document this race and other races / failings so that driver
authors who do care about module removal are aware and know what
to do.
In this thread two races were exposed on syfs:
* deadlock when a sysfs attribute uses a lock which is also used on
module __exit
* possible race against the device's private data, and this is type
specific. I think many people probably missed the last hunks of my
proposed patch which added dev_type_get() which were not part of the
deadlock fix. At least I showed how attributes for all block devices
have an exposed race, which can crash if removal of a block device
with del_gendisk() happens while a sysfs attribute is about to be
used.
I don't think either race is trivial for a driver developer to assume a
solution for. Most focus on this thread was about the first race, the
seconod however is also very real, and likely can cause more issues on
rolling backs on error codes unrelated to rmmod...
> for an operation that can only
> be triggered manually by a root user, is not worth it in my opinion, as
> the maintainer of that code the complexity was asked to be made to.
Three things:
1) Many driver maintainers *do* care that rmmod works well. To the point
that if it does not, they feel ashamed. Reason for this is that in some
subsystems this *is* a typical test case. So although rmmod may be
a vodoo thing for core, many driver developers do want this to work
well.
As someone who also works on many test cases to expose odd issues in the
kernel unrelated to module removal, I can also say that module removal
does also expose other possible races which would otherwise be difficult
to trigger. So it is also a helfup aid as a developer working on
differen areas of the kernel.
2) Folks have pointed out that this is not just about rmmod, rolling
back removal of sysfs attributes due to error code paths is another
real scenario to consider. I don't think we have rules to tell userspace
to not muck with sysfs files after they are exposed. In fact those
uevents we send to userspace allow userspace to know exactly when to
muck with them.
3) Sadly, some sysfs attributes also purposely do things which *also*
mimic what is typically done on module removal, such as removal of an
interface, or block device. That was the case with zram, it allows
remvoal / reset of a device... Yes these are odd things to do, but we
allow for it. And sysfs users *do* need to be aware of these corner
cases if they want to support them.
There **may** be some severity to some of the issues mentioned above, to
allow really bad things to be done in userspace even without module
removal... but I didn't have time yet to expose a pattern with coccinelle
yet to see how commonplace some of these issue are. I was focusing at
first more for a generic solution if possible, as I thought that would
be better first evaluated, and to let others slowly become aware of the
issue.
Luis
On Thu, Apr 08, 2021 at 10:55:53AM +0200, Greg KH wrote:
> Module removal is not a "normal" operation that can be triggered by a
> system automatically without a user asking for it. As someone reminded
> me on IRC, we used to do this "automatically" for many problematic
> drivers years ago for suspend/resume, that should all now be long fixed
> up.
I know what you're trying to say here, but I think you're just completely
wrong. :P
We absolutely must handle module unloading, and it is expected to work
correctly. _The reason it doesn't work correctly sometimes is because it
is a less common operation_. But that doesn't make it any less
important. Disk failures are rare too, but RAID handles it. If there are
bugs here it is due to _our lack of testing_.
Module unload needs to work for developers loading/unloading while they
work on a module[1], it needs to work for sysadmins that would to unload
a now-unused network protocol[2], it needs to work for people trying to
reset device hardware state[3], it needs to work for unloading unused
modules after root device, partition, and filesystem discovery in an
initramfs[3], and every other case.
The kernel module subsystem is not "best effort" and removal is not
"not normal". If unloading a module destabilizes the kernel, then there
is a bug that needs fixing. I'm not advocating for any path to fixing
the technical issues in this thread, but we absolutely cannot suddenly
claim that it's the user's fault that their system dies if they use
"rmmod". That's outrageous. O_o
-Kees
[1] Yes, I'm linking to the book you wrote ;)
https://www.oreilly.com/library/view/linux-device-drivers/0596005903/ch02.html
[2] https://www.cs.unh.edu/cnrg/people/gherrin/linux-net.html#tth_sEc11.2.2
[3] https://opensource.com/article/18/5/how-load-or-unload-linux-kernel-module
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/sec-unloading_a_module
[4] https://git.busybox.net/busybox/tree/modutils/rmmod.c
--
Kees Cook