This is a follow up to my v7 series of fixes for the zram driver [0]
which ended up uncovering a generic deadlock issue with sysfs and module
removal. I've reported this issue and proposed a few patches first since
March 2021 [1]. At the end of this email you will find an itemized list
of changes since that v1 series, you can also find these changes on my
branch 20210927-sysfs-generic-deadlock-fix [4] which is based on
linux-next tag next-20210927.
Just a heads up, I'm goin on vacation in two days, won't be back until
Monday October 11th.
On this v8 I incorporate feedback from the v7 series, namely:
- Tejun requested I move the struct module to the last attribute when
extending functions
- As per discussion with Tejun, trimmed and clarified the commit log
and documentation on the generic fix on patch 7
- As requested by Bart Van Assche, I simplied the setting of the
struct test_config *config into one line instead of two on many
places on patch 3 which adds the new sysfs selftest
- Dan Williams had some questions about patch 7, and so clarified these
questions using a more elaborate example on the commit log to show
where the lock call was happening.
- Trimmed the Cc list considerably as it was way too long before
- Rebased onto linux-next tag next-20210927
Below a list of changes of this patch set since its inception:
On v1:
- Open coded the sysfs deadlock race to only be localized by the zram
driver
Changes on v2:
- used bdgrab() as well for another race which was speculated by
Minchan
- improved documentation of fixes
Changes on v3:
- used a localized zram macros for the sysfs attributes instead of
open coding on each routine
- replaced bdget() stuff for a generic get_device() and bus_get() on
dev_attr_show() / dev_attr_store() for the issue speculated by
Michan
Changes on v4:
- Cosmetic fixes on the zram fixes as requested by Greg
- Split out the driver core fix as requested by Greg for the
issue speculated by Michan. This fix ended up getting up to its 4th
patch iteration [2] and eventually hit linux-next. We got a 0day
0day suspend stres fail for this patch [3]
Changes on v5:
- I ended up writing a test_sysfs driver and with it I ended up
proving that the issue speculated by Michen was not possible and
so I asked Greg to drop the patch from his queue titled
"sysfs: fix kobject refcount to address races with kobject removal"
- checkpatch fixes for the zram changes
Changes on v6:
- I submitted my test_sysfs driver for inclusion upstream which easily
abstracted the deadlock issue in a driver generically [4]
- I rebased the zram fixes and added also a new patch for zram to use
ATTRIBUTE_GROUPS As per Minchen I sent the patches to be merged
through Andrew Morton.
- Greg ended up NACK'ing the patchset because he was not sure the fix
was correct still
Changes on v7:
- Formalizes the original proposed generic sysfs fix intead of using
macro helpers to work around the issue
- I decided it is best to merge all the effort together into
one patch set because communication was being lost when I split the
patches up. This was not helping in any way to either fix the zram
issues or come to consensus on a generic solution. The patches are
also merged now because they are all related now.
- Running checkpatch exposed that S_IRWXUGO and S_IRWXU|S_IRUGO|S_IXUGO
should be replaced, so I did that in this series in two new patches
- Adds a try_module_get() documentation extension with tribal
knowledge and new information I don't think some folks still believe
in. The new test_sysfs selftest however proves this information to
be correct, the same selftest can be used to try to prove that
documentation incorrect
- Because the fix is now generic zram's deadlock can easily be fixed
now by just making it use ATTRIBUTE_GROUPS().
[0] https://lkml.kernel.org/r/[email protected]
[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]
[3] https://lkml.kernel.org/r/20210701022737.GC21279@xsang-OptiPlex-9020
[4] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
Luis Chamberlain (12):
LICENSES: Add the copyleft-next-0.3.1 license
testing: use the copyleft-next-0.3.1 SPDX tag
selftests: add tests_sysfs module
kernfs: add initial failure injection support
test_sysfs: add support to use kernfs failure injection
kernel/module: add documentation for try_module_get()
fs/kernfs/symlink.c: replace S_IRWXUGO with 0777 on
kernfs_create_link()
fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755
sysfs_create_dir_ns()
sysfs: fix deadlock race with module removal
test_sysfs: enable deadlock tests by default
zram: fix crashes with cpu hotplug multistate
zram: use ATTRIBUTE_GROUPS to fix sysfs deadlock module removal
.../fault-injection/fault-injection.rst | 22 +
LICENSES/dual/copyleft-next-0.3.1 | 237 +++
MAINTAINERS | 9 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +-
drivers/block/zram/zram_drv.c | 74 +-
fs/kernfs/Makefile | 1 +
fs/kernfs/dir.c | 44 +-
fs/kernfs/failure-injection.c | 91 ++
fs/kernfs/file.c | 19 +-
fs/kernfs/kernfs-internal.h | 75 +-
fs/kernfs/symlink.c | 4 +-
fs/sysfs/dir.c | 5 +-
fs/sysfs/file.c | 6 +-
fs/sysfs/group.c | 3 +-
include/linux/kernfs.h | 19 +-
include/linux/module.h | 34 +-
include/linux/sysfs.h | 52 +-
kernel/cgroup/cgroup.c | 2 +-
lib/Kconfig.debug | 25 +
lib/Makefile | 1 +
lib/test_kmod.c | 12 +-
lib/test_sysctl.c | 12 +-
lib/test_sysfs.c | 952 ++++++++++++
tools/testing/selftests/kmod/kmod.sh | 13 +-
tools/testing/selftests/sysctl/sysctl.sh | 12 +-
tools/testing/selftests/sysfs/Makefile | 12 +
tools/testing/selftests/sysfs/config | 5 +
tools/testing/selftests/sysfs/sysfs.sh | 1383 +++++++++++++++++
28 files changed, 3026 insertions(+), 102 deletions(-)
create mode 100644 LICENSES/dual/copyleft-next-0.3.1
create mode 100644 fs/kernfs/failure-injection.c
create mode 100644 lib/test_sysfs.c
create mode 100644 tools/testing/selftests/sysfs/Makefile
create mode 100644 tools/testing/selftests/sysfs/config
create mode 100755 tools/testing/selftests/sysfs/sysfs.sh
--
2.30.2
Provide a simple state machine to fix races with driver exit where we
remove the CPU multistate callbacks and re-initialization / creation of
new per CPU instances which should be managed by these callbacks.
The zram driver makes use of cpu hotplug multistate support, whereby it
associates a struct zcomp per CPU. Each struct zcomp represents a
compression algorithm in charge of managing compression streams per
CPU. Although a compiled zram driver only supports a fixed set of
compression algorithms, each zram device gets a struct zcomp allocated
per CPU. The "multi" in CPU hotplug multstate refers to these per
cpu struct zcomp instances. Each of these will have the CPU hotplug
callback called for it on CPU plug / unplug. The kernel's CPU hotplug
multistate keeps a linked list of these different structures so that
it will iterate over them on CPU transitions.
By default at driver initialization we will create just one zram device
(num_devices=1) and a zcomp structure then set for the now default
lzo-rle comrpession algorithm. At driver removal we first remove each
zram device, and so we destroy the associated struct zcomp per CPU. But
since we expose sysfs attributes to create new devices or reset /
initialize existing zram devices, we can easily end up re-initializing
a struct zcomp for a zram device before the exit routine of the module
removes the cpu hotplug callback. When this happens the kernel's CPU
hotplug will detect that at least one instance (struct zcomp for us)
exists. This can happen in the following situation:
CPU 1 CPU 2
disksize_store(...);
class_unregister(...);
idr_for_each(...);
zram_debugfs_destroy();
idr_destroy(...);
unregister_blkdev(...);
cpuhp_remove_multi_state(...);
The warning comes up on cpuhp_remove_multi_state() when it sees that the
state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
In this case, that a struct zcom still exists, the driver allowed its
creation per CPU even though we could have just freed them per CPU
though a call on another CPU, and we are then later trying to remove the
hotplug callback.
Fix all this by providing a zram initialization boolean
protected the shared in the driver zram_index_mutex, which we
can use to annotate when sysfs attributes are safe to use or
not -- once the driver is properly initialized. When the driver
is going down we also are sure to not let userspace muck with
attributes which may affect each per cpu struct zcomp.
This also fixes a series of possible memory leaks. The
crashes and memory leaks can easily be caused by issuing
the zram02.sh script from the LTP project [0] in a loop
in two separate windows:
cd testcases/kernel/device-drivers/zram
while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
You end up with a splat as follows:
kernel: zram: Removed device: zram0
kernel: zram: Added device: zram0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: Adding 104857596k swap on /dev/zram0. <etc>
kernel: zram0: detected capacitky change from 209715200 to 0
kernel: zram0: detected capacity change from 0 to 209715200
kernel: ------------[ cut here ]------------
kernel: Error: Removing state 63 which has instances left.
kernel: WARNING: CPU: 7 PID: 70457 at \
kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G \
E 5.12.0-rc1-next-20210304 #3
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \
BIOS 1.14.0-2 04/01/2014
kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
kernel: Code: <etc>
kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc>
kernel: CS: 0010 DS: 0000 ES 0000 CR0: 0000000080050033
kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
kernel: Call Trace:
kernel: __cpuhp_remove_state+0x2e/0x80
kernel: __do_sys_delete_module+0x190/0x2a0
kernel: do_syscall_64+0x33/0x80
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
The "Error: Removing state 63 which has instances left" refers
to the zram per CPU struct zcomp instances left.
[0] https://github.com/linux-test-project/ltp.git
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
---
drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++-----
1 file changed, 55 insertions(+), 8 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f61910c65f0f..b26abcb955cc 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
static int zram_major;
static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
+static bool zram_up;
+
/* Module params (documentation at end) */
static unsigned int num_devices = 1;
/*
@@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram)
comp = zram->comp;
disksize = zram->disksize;
zram->disksize = 0;
+ zram->comp = NULL;
set_capacity_and_notify(zram->disk, 0);
part_stat_set_all(zram->disk->part0, 0);
@@ -1724,9 +1727,18 @@ static ssize_t disksize_store(struct device *dev,
struct zram *zram = dev_to_zram(dev);
int err;
+ mutex_lock(&zram_index_mutex);
+
+ if (!zram_up) {
+ err = -ENODEV;
+ goto out;
+ }
+
disksize = memparse(buf, NULL);
- if (!disksize)
- return -EINVAL;
+ if (!disksize) {
+ err = -EINVAL;
+ goto out;
+ }
down_write(&zram->init_lock);
if (init_done(zram)) {
@@ -1754,12 +1766,16 @@ static ssize_t disksize_store(struct device *dev,
set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
up_write(&zram->init_lock);
+ mutex_unlock(&zram_index_mutex);
+
return len;
out_free_meta:
zram_meta_free(zram, disksize);
out_unlock:
up_write(&zram->init_lock);
+out:
+ mutex_unlock(&zram_index_mutex);
return err;
}
@@ -1775,8 +1791,17 @@ static ssize_t reset_store(struct device *dev,
if (ret)
return ret;
- if (!do_reset)
- return -EINVAL;
+ mutex_lock(&zram_index_mutex);
+
+ if (!zram_up) {
+ len = -ENODEV;
+ goto out;
+ }
+
+ if (!do_reset) {
+ len = -EINVAL;
+ goto out;
+ }
zram = dev_to_zram(dev);
bdev = zram->disk->part0;
@@ -1785,7 +1810,8 @@ static ssize_t reset_store(struct device *dev,
/* Do not reset an active device or claimed device */
if (bdev->bd_openers || zram->claim) {
mutex_unlock(&bdev->bd_disk->open_mutex);
- return -EBUSY;
+ len = -EBUSY;
+ goto out;
}
/* From now on, anyone can't open /dev/zram[0-9] */
@@ -1800,6 +1826,8 @@ static ssize_t reset_store(struct device *dev,
zram->claim = false;
mutex_unlock(&bdev->bd_disk->open_mutex);
+out:
+ mutex_unlock(&zram_index_mutex);
return len;
}
@@ -2010,6 +2038,10 @@ static ssize_t hot_add_show(struct class *class,
int ret;
mutex_lock(&zram_index_mutex);
+ if (!zram_up) {
+ mutex_unlock(&zram_index_mutex);
+ return -ENODEV;
+ }
ret = zram_add();
mutex_unlock(&zram_index_mutex);
@@ -2037,6 +2069,11 @@ static ssize_t hot_remove_store(struct class *class,
mutex_lock(&zram_index_mutex);
+ if (!zram_up) {
+ ret = -ENODEV;
+ goto out;
+ }
+
zram = idr_find(&zram_index_idr, dev_id);
if (zram) {
ret = zram_remove(zram);
@@ -2046,6 +2083,7 @@ static ssize_t hot_remove_store(struct class *class,
ret = -ENODEV;
}
+out:
mutex_unlock(&zram_index_mutex);
return ret ? ret : count;
}
@@ -2072,12 +2110,15 @@ static int zram_remove_cb(int id, void *ptr, void *data)
static void destroy_devices(void)
{
+ mutex_lock(&zram_index_mutex);
+ zram_up = false;
class_unregister(&zram_control_class);
idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
zram_debugfs_destroy();
idr_destroy(&zram_index_idr);
unregister_blkdev(zram_major, "zram");
cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
+ mutex_unlock(&zram_index_mutex);
}
static int __init zram_init(void)
@@ -2105,15 +2146,21 @@ static int __init zram_init(void)
return -EBUSY;
}
+ mutex_lock(&zram_index_mutex);
+
while (num_devices != 0) {
- mutex_lock(&zram_index_mutex);
ret = zram_add();
- mutex_unlock(&zram_index_mutex);
- if (ret < 0)
+ if (ret < 0) {
+ mutex_unlock(&zram_index_mutex);
goto out_error;
+ }
num_devices--;
}
+ zram_up = true;
+
+ mutex_unlock(&zram_index_mutex);
+
return 0;
out_error:
--
2.30.2
This extends test_sysfs with support for using the failure injection
wait completion and knobs to force a few race conditions which
demonstrates that kernfs active reference protection is sufficient
for kobject / device protection at higher layers.
This adds 4 new tests which tries to remove the device attribute
store operation in 4 different situations:
1) at the start of kernfs_kernfs_fop_write_iter()
2) before the of->mutex is held in kernfs_kernfs_fop_write_iter()
3) after the of->mutex is held in kernfs_kernfs_fop_write_iter()
4) after the kernfs node active reference is taken
A write fails in call cases except the last one, test number #32. There
is a good explanation for this: *once* kernfs_get_active() gets called
we have a guarantee that the kernfs entry cannot be removed. If
kernfs_get_active() succeeds that entry cannot be removed and so
anything trying to remove that entry will have to wait. It is perhaps
not obvious but since a sysfs write will trigger eventually a
kernfs_get_active() call, and *only* if this succeeds will the sysfs
op be called, this and the fact that you cannot remove the kernfs
entry while the kenfs entry is active implies that a module that
created the respective sysfs / kernfs entry *cannot* possibly be
removed during a sysfs operation. And test number 32 provides us with
proof of this. If it were not true test #32 should crash.
No null dereferences are reproduced, even though this has been observed
in some complex testing cases [0]. If this issue really exists we should
have enough tools on the sysfs_test toolbox now to try to reproduce
this easily without having to poke around other drivers. It very likley
was the case that the issue reported [0] was possibly a side issue after
the first bug which was zram specific. This is why it is important to
isolate the issue and try to reproduce it in a generic form using the
test_sysfs driver.
[0] https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luis Chamberlain <[email protected]>
---
lib/Kconfig.debug | 3 +
lib/test_sysfs.c | 31 +++++
tools/testing/selftests/sysfs/config | 3 +
tools/testing/selftests/sysfs/sysfs.sh | 175 +++++++++++++++++++++++++
4 files changed, 212 insertions(+)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a29b7d398c4e..176b822654e5 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2358,6 +2358,9 @@ config TEST_SYSFS
depends on SYSFS
depends on NET
depends on BLOCK
+ select FAULT_INJECTION
+ select FAULT_INJECTION_DEBUG_FS
+ select FAIL_KERNFS_KNOBS
help
This builds the "test_sysfs" module. This driver enables to test the
sysfs file system safely without affecting production knobs which
diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c
index 2043ca494af8..c6e62de61403 100644
--- a/lib/test_sysfs.c
+++ b/lib/test_sysfs.c
@@ -38,6 +38,11 @@
#include <linux/rtnetlink.h>
#include <linux/genhd.h>
#include <linux/blkdev.h>
+#include <linux/kernfs.h>
+
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+MODULE_IMPORT_NS(KERNFS_DEBUG_PRIVATE);
+#endif
static bool enable_lock;
module_param(enable_lock, bool_enable_only, 0644);
@@ -82,6 +87,13 @@ static bool enable_verbose_rmmod;
module_param(enable_verbose_rmmod, bool_enable_only, 0644);
MODULE_PARM_DESC(enable_verbose_rmmod, "enable verbose print messages on rmmod");
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+static bool enable_completion_on_rmmod;
+module_param(enable_completion_on_rmmod, bool_enable_only, 0644);
+MODULE_PARM_DESC(enable_completion_on_rmmod,
+ "enable sending a kernfs completion on rmmod");
+#endif
+
static int sysfs_test_major;
/**
@@ -285,6 +297,12 @@ static ssize_t config_show(struct device *dev,
"enable_verbose_writes:\t%s\n",
enable_verbose_writes ? "true" : "false");
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+ len += snprintf(buf+len, PAGE_SIZE - len,
+ "enable_completion_on_rmmod:\t%s\n",
+ enable_completion_on_rmmod ? "true" : "false");
+#endif
+
test_dev_config_unlock(test_dev);
return len;
@@ -904,10 +922,23 @@ static int __init test_sysfs_init(void)
}
module_init(test_sysfs_init);
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+/* The goal is to race our device removal with a pending kernfs -> store call */
+static void test_sysfs_kernfs_send_completion_rmmod(void)
+{
+ if (!enable_completion_on_rmmod)
+ return;
+ complete(&kernfs_debug_wait_completion);
+}
+#else
+static inline void test_sysfs_kernfs_send_completion_rmmod(void) {}
+#endif
+
static void __exit test_sysfs_exit(void)
{
if (enable_debugfs)
debugfs_remove(debugfs_dir);
+ test_sysfs_kernfs_send_completion_rmmod();
if (delay_rmmod_ms)
msleep(delay_rmmod_ms);
unregister_test_dev_sysfs(first_test_dev);
diff --git a/tools/testing/selftests/sysfs/config b/tools/testing/selftests/sysfs/config
index 9196f452ecd5..2876a229f95b 100644
--- a/tools/testing/selftests/sysfs/config
+++ b/tools/testing/selftests/sysfs/config
@@ -1,2 +1,5 @@
CONFIG_SYSFS=m
CONFIG_TEST_SYSFS=m
+CONFIG_FAULT_INJECTION=y
+CONFIG_FAULT_INJECTION_DEBUG_FS=y
+CONFIG_FAIL_KERNFS_KNOBS=y
diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
index b3f4c2236c7f..f928635d0e35 100755
--- a/tools/testing/selftests/sysfs/sysfs.sh
+++ b/tools/testing/selftests/sysfs/sysfs.sh
@@ -62,6 +62,10 @@ ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
+ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store
+ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex
+ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex
+ALL_TESTS="$ALL_TESTS 0032:1:1:test_dev_x:block" # kernfs race removal after active
allow_user_defaults()
{
@@ -92,6 +96,9 @@ allow_user_defaults()
if [ -z $SYSFS_DEBUGFS_DIR ]; then
SYSFS_DEBUGFS_DIR="/sys/kernel/debug/test_sysfs"
fi
+ if [ -z $KERNFS_DEBUGFS_DIR ]; then
+ KERNFS_DEBUGFS_DIR="/sys/kernel/debug/kernfs"
+ fi
if [ -z $PAGE_SIZE ]; then
PAGE_SIZE=$(getconf PAGESIZE)
fi
@@ -167,6 +174,14 @@ modprobe_reset_enable_rtnl_lock_on_rmmod()
unset FIRST_MODPROBE_ARGS
}
+modprobe_reset_enable_completion()
+{
+ FIRST_MODPROBE_ARGS="enable_completion_on_rmmod=1 enable_verbose_writes=1"
+ FIRST_MODPROBE_ARGS="$FIRST_MODPROBE_ARGS enable_verbose_rmmod=1 delay_rmmod_ms=0"
+ modprobe_reset
+ unset FIRST_MODPROBE_ARGS
+}
+
load_req_mod()
{
modprobe_reset
@@ -197,6 +212,63 @@ debugfs_reset_first_test_dev_ignore_errors()
echo -n "1" >"$SYSFS_DEBUGFS_DIR"/reset_first_test_dev
}
+debugfs_kernfs_kernfs_fop_write_iter_exists()
+{
+ KNOB_DIR="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter"
+ if [[ ! -d $KNOB_DIR ]]; then
+ echo "kernfs debugfs does not exist $KNOB_DIR"
+ return 0;
+ fi
+ KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+ if [[ ! -d $KNOB_DEBUGFS ]]; then
+ echo -n "kernfs debugfs for coniguring fail_kernfs_fop_write_iter "
+ echo "does not exist $KNOB_DIR"
+ return 0;
+ fi
+ return 1
+}
+
+debugfs_kernfs_kernfs_fop_write_iter_set_fail_once()
+{
+ KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+ echo 1 > $KNOB_DEBUGFS/interval
+ echo 100 > $KNOB_DEBUGFS/probability
+ echo 0 > $KNOB_DEBUGFS/space
+ # Disable verbose messages on the kernel ring buffer which may
+ # confuse developers with a kernel panic.
+ echo 0 > $KNOB_DEBUGFS/verbose
+
+ # Fail only once
+ echo 1 > $KNOB_DEBUGFS/times
+}
+
+debugfs_kernfs_kernfs_fop_write_iter_set_fail_never()
+{
+ KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
+ echo 0 > $KNOB_DEBUGFS/times
+}
+
+debugfs_kernfs_set_wait_ms()
+{
+ SLEEP_AFTER_WAIT_MS="${KERNFS_DEBUGFS_DIR}/sleep_after_wait_ms"
+ echo $1 > $SLEEP_AFTER_WAIT_MS
+}
+
+debugfs_kernfs_disable_wait_kernfs_fop_write_iter()
+{
+ ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_"
+ for KNOB in ${ENABLE_WAIT_KNOB}*; do
+ echo 0 > $KNOB
+ done
+}
+
+debugfs_kernfs_enable_wait_kernfs_fop_write_iter()
+{
+ ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_$1"
+ echo -n "1" > $ENABLE_WAIT_KNOB
+ return $?
+}
+
set_orig()
{
if [[ ! -z $TARGET ]] && [[ ! -z $ORIG ]]; then
@@ -972,6 +1044,105 @@ sysfs_test_0028()
fi
}
+sysfs_race_kernfs_kernfs_fop_write_iter()
+{
+ TARGET="${DIR}/$(get_test_target $1)"
+ WAIT_AT=$2
+ EXPECT_WRITE_RETURNS=$3
+ MSDELAY=$4
+
+ modprobe_reset_enable_completion
+ ORIG=$(cat "${TARGET}")
+ TEST_STR=$(( $ORIG + 1 ))
+
+ echo -n "Test racing removal of sysfs store op with kernfs $WAIT_AT ... "
+
+ if debugfs_kernfs_kernfs_fop_write_iter_exists; then
+ echo -n "skipping test as CONFIG_FAIL_KERNFS_KNOBS "
+ echo " or CONFIG_FAULT_INJECTION_DEBUG_FS is disabled"
+ return $ksft_skip
+ fi
+
+ # Allow for failing the kernfs_kernfs_fop_write_iter call once,
+ # we'll provide exact context shortly afterwards.
+ debugfs_kernfs_kernfs_fop_write_iter_set_fail_once
+
+ # First disable all waits
+ debugfs_kernfs_disable_wait_kernfs_fop_write_iter
+
+ # Enable a wait_for_completion(&kernfs_debug_wait_completion) at the
+ # specified location inside the kernfs_fop_write_iter() routine
+ debugfs_kernfs_enable_wait_kernfs_fop_write_iter $WAIT_AT
+
+ # Configure kernfs so that after its wait_for_completion() it
+ # will msleep() this amount of time and schedule(). We figure this
+ # will be sufficient time to allow for our module removal to complete.
+ debugfs_kernfs_set_wait_ms $MSDELAY
+
+ # Now we trigger a kernfs write op, which will run kernfs_fop_write_iter,
+ # but will wait until our driver sends a respective completion
+ set_test_ignore_errors &
+ write_pid=$!
+
+ # At this point kernfs_fop_write_iter() hasn't run our op, its
+ # waiting for our completion at the specified time $WAIT_AT.
+ # We now remove our module which will send a
+ # complete(&kernfs_debug_wait_completion) right before we deregister
+ # our device and the sysfs device attributes are removed.
+ #
+ # After the completion is sent, the test_sysfs driver races with
+ # kernfs to do the device deregistration with the kernfs msleep
+ # and schedule(). This should mean we've forced trying to remove the
+ # module prior to allowing kernfs to run our store operation. If the
+ # race did happen we'll panic with a null dereference on the store op.
+ #
+ # If no race happens we should see no write operation triggered.
+ modprobe -r $TEST_DRIVER > /dev/null 2>&1
+
+ debugfs_kernfs_kernfs_fop_write_iter_set_fail_never
+
+ wait $write_pid
+ if [[ $? -eq $EXPECT_WRITE_RETURNS ]]; then
+ echo "ok"
+ else
+ echo "FAIL" >&2
+ fi
+}
+
+sysfs_test_0029()
+{
+ for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+ echo "Using delay-after-completion: $delay"
+ sysfs_race_kernfs_kernfs_fop_write_iter 0029 at_start 1 $delay
+ done
+}
+
+sysfs_test_0030()
+{
+ for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+ echo "Using delay-after-completion: $delay"
+ sysfs_race_kernfs_kernfs_fop_write_iter 0030 before_mutex 1 $delay
+ done
+}
+
+sysfs_test_0031()
+{
+ for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+ echo "Using delay-after-completion: $delay"
+ sysfs_race_kernfs_kernfs_fop_write_iter 0031 after_mutex 1 $delay
+ done
+}
+
+# A write only succeeds *iff* a module removal happens *after* the
+# kernfs active reference is obtained with kernfs_get_active().
+sysfs_test_0032()
+{
+ for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
+ echo "Using delay-after-completion: $delay"
+ sysfs_race_kernfs_kernfs_fop_write_iter 0032 after_active 0 $delay
+ done
+}
+
test_gen_desc()
{
echo -n "$1 x $(get_test_count $1)"
@@ -1013,6 +1184,10 @@ list_tests()
echo "$(test_gen_desc 0026) - block test writing y larger delay and resetting device"
echo "$(test_gen_desc 0027) - test rmmod deadlock while writing x ... "
echo "$(test_gen_desc 0028) - test rmmod deadlock using rtnl_lock while writing x ..."
+ echo "$(test_gen_desc 0029) - racing removal of store op with kernfs at start"
+ echo "$(test_gen_desc 0030) - racing removal of store op with kernfs before mutex"
+ echo "$(test_gen_desc 0031) - racing removal of store op with kernfs after mutex"
+ echo "$(test_gen_desc 0032) - racing removal of store op with kernfs after active"
}
usage()
--
2.30.2
This adds initial failure injection support to kernfs. We start
off with debug knobs which when enabled allow test drivers, such as
test_sysfs, to then make use of these to try to force certain
difficult races to take place with a high degree of certainty.
This only adds runtime code *iff* the new bool CONFIG_FAIL_KERNFS_KNOBS is
enabled in your kernel. If you don't have this enabled this provides
no new functional. When CONFIG_FAIL_KERNFS_KNOBS is disabled the new
routine kernfs_debug_should_wait() ends up being transformed to if
(false), and so the compiler should optimize these out as dead code
producing no new effective binary changes.
We start off with enabling failure injections in kernfs by allowing us to
alter the way kernfs_fop_write_iter() behaves. We allow for the routine
kernfs_fop_write_iter() to wait for a certain condition in the kernel to
occur, after which it will sleep a predefined amount of time. This lets
kernfs users to time exactly when it want kernfs_fop_write_iter() to
complete, allowing for developing race conditions and test for correctness
in kernfs.
You'd boot with this enabled on your kernel command line:
fail_kernfs_fop_write_iter=1,100,0,1
The values are <interval,probability,size,times>, we don't care for
size, so for now we ignore it. The above ensures a failure will trigger
only once.
*How* we allow for this routine to change behaviour is left to knobs we
expose under debugfs:
# ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
wait_after_active
wait_after_mutex
wait_at_start
wait_before_mutex
A debugfs entry also exists to allow us to sleep a configurabler amount
of time after the completion:
/sys/kernel/debug/kernfs/sleep_after_wait_ms
These two sets of knobs allow us to construct races and demonstrate
how the kernfs active reference should suffice to project against
races.
Enabling CONFIG_FAULT_INJECTION_DEBUG_FS enables us to configure the
differnt fault injection parametres for the new fail_kernfs_fop_write_iter
fault injection at run time:
ls -1 /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
interval
probability
space
task-filter
times
verbose
verbose_ratelimit_burst
verbose_ratelimit_interval_ms
Signed-off-by: Luis Chamberlain <[email protected]>
---
.../fault-injection/fault-injection.rst | 22 +++++
MAINTAINERS | 2 +-
fs/kernfs/Makefile | 1 +
fs/kernfs/failure-injection.c | 91 +++++++++++++++++++
fs/kernfs/file.c | 13 +++
fs/kernfs/kernfs-internal.h | 72 +++++++++++++++
include/linux/kernfs.h | 5 +
lib/Kconfig.debug | 10 ++
8 files changed, 215 insertions(+), 1 deletion(-)
create mode 100644 fs/kernfs/failure-injection.c
diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst
index 4a25c5eb6f07..d4d34b082f47 100644
--- a/Documentation/fault-injection/fault-injection.rst
+++ b/Documentation/fault-injection/fault-injection.rst
@@ -28,6 +28,28 @@ Available fault injection capabilities
injects kernel RPC client and server failures.
+- fail_kernfs_fop_write_iter
+
+ Allows for failures to be enabled inside kernfs_fop_write_iter(). Enabling
+ this does not immediately enable any errors to occur. You must configure
+ how you want this routine to fail or change behaviour by using the debugfs
+ knobs for it:
+
+ # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
+ wait_after_active
+ wait_after_mutex
+ wait_at_start
+ wait_before_mutex
+
+ You can also configure how long to sleep after a wait under
+
+ /sys/kernel/debug/kernfs/sleep_after_wait_ms
+
+ If you enable CONFIG_FAULT_INJECTION_DEBUG_FS the fail_add_disk failure
+ injection parameters are placed under:
+
+ /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
+
- fail_make_request
injects disk IO errors on devices permitted by setting
diff --git a/MAINTAINERS b/MAINTAINERS
index 1b4cefcb064c..fadfd961ad80 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10384,7 +10384,7 @@ M: Greg Kroah-Hartman <[email protected]>
M: Tejun Heo <[email protected]>
S: Supported
T: git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git
-F: fs/kernfs/
+F: fs/kernfs/*
F: include/linux/kernfs.h
KEXEC
diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile
index 4ca54ff54c98..bc5b32ca39f9 100644
--- a/fs/kernfs/Makefile
+++ b/fs/kernfs/Makefile
@@ -4,3 +4,4 @@
#
obj-y := mount.o inode.o dir.o file.o symlink.o
+obj-$(CONFIG_FAIL_KERNFS_KNOBS) += failure-injection.o
diff --git a/fs/kernfs/failure-injection.c b/fs/kernfs/failure-injection.c
new file mode 100644
index 000000000000..4130d202c13b
--- /dev/null
+++ b/fs/kernfs/failure-injection.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/fault-inject.h>
+#include <linux/delay.h>
+
+#include "kernfs-internal.h"
+
+static DECLARE_FAULT_ATTR(fail_kernfs_fop_write_iter);
+struct kernfs_config_fail kernfs_config_fail;
+
+#define kernfs_config_fail(when) \
+ kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
+
+#define kernfs_config_fail(when) \
+ kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
+
+static int __init setup_fail_kernfs_fop_write_iter(char *str)
+{
+ return setup_fault_attr(&fail_kernfs_fop_write_iter, str);
+}
+
+__setup("fail_kernfs_fop_write_iter=", setup_fail_kernfs_fop_write_iter);
+
+struct dentry *kernfs_debugfs_root;
+struct dentry *config_fail_kernfs_fop_write_iter;
+
+static int __init kernfs_init_failure_injection(void)
+{
+ kernfs_config_fail.sleep_after_wait_ms = 100;
+ kernfs_debugfs_root = debugfs_create_dir("kernfs", NULL);
+
+ fault_create_debugfs_attr("fail_kernfs_fop_write_iter",
+ kernfs_debugfs_root, &fail_kernfs_fop_write_iter);
+
+ config_fail_kernfs_fop_write_iter =
+ debugfs_create_dir("config_fail_kernfs_fop_write_iter",
+ kernfs_debugfs_root);
+
+ debugfs_create_u32("sleep_after_wait_ms", 0600,
+ kernfs_debugfs_root,
+ &kernfs_config_fail.sleep_after_wait_ms);
+
+ debugfs_create_bool("wait_at_start", 0600,
+ config_fail_kernfs_fop_write_iter,
+ &kernfs_config_fail(at_start));
+ debugfs_create_bool("wait_before_mutex", 0600,
+ config_fail_kernfs_fop_write_iter,
+ &kernfs_config_fail(before_mutex));
+ debugfs_create_bool("wait_after_mutex", 0600,
+ config_fail_kernfs_fop_write_iter,
+ &kernfs_config_fail(after_mutex));
+ debugfs_create_bool("wait_after_active", 0600,
+ config_fail_kernfs_fop_write_iter,
+ &kernfs_config_fail(after_active));
+ return 0;
+}
+late_initcall(kernfs_init_failure_injection);
+
+int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate)
+{
+ if (!evaluate)
+ return 0;
+
+ return should_fail(&fail_kernfs_fop_write_iter, 0);
+}
+
+DECLARE_COMPLETION(kernfs_debug_wait_completion);
+EXPORT_SYMBOL_NS_GPL(kernfs_debug_wait_completion, KERNFS_DEBUG_PRIVATE);
+
+void kernfs_debug_wait(void)
+{
+ unsigned long timeout;
+
+ timeout = wait_for_completion_timeout(&kernfs_debug_wait_completion,
+ msecs_to_jiffies(3000));
+ if (!timeout)
+ pr_info("%s waiting for kernfs_debug_wait_completion timed out\n",
+ __func__);
+ else
+ pr_info("%s received completion with time left on timeout %u ms\n",
+ __func__, jiffies_to_msecs(timeout));
+
+ /**
+ * The goal is wait for an event, and *then* once we have
+ * reached it, the other side will try to do something which
+ * it thinks will break. So we must give it some time to do
+ * that. The amount of time is configurable.
+ */
+ msleep(kernfs_config_fail.sleep_after_wait_ms);
+ pr_info("%s ended\n", __func__);
+}
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 60e2a86c535e..4479c6580333 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -259,6 +259,9 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
const struct kernfs_ops *ops;
char *buf;
+ if (kernfs_debug_should_wait(kernfs_fop_write_iter, at_start))
+ kernfs_debug_wait();
+
if (of->atomic_write_len) {
if (len > of->atomic_write_len)
return -E2BIG;
@@ -280,17 +283,27 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
}
buf[len] = '\0'; /* guarantee string termination */
+ if (kernfs_debug_should_wait(kernfs_fop_write_iter, before_mutex))
+ kernfs_debug_wait();
+
/*
* @of->mutex nests outside active ref and is used both to ensure that
* the ops aren't called concurrently for the same open file.
*/
mutex_lock(&of->mutex);
+
+ if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_mutex))
+ kernfs_debug_wait();
+
if (!kernfs_get_active(of->kn)) {
mutex_unlock(&of->mutex);
len = -ENODEV;
goto out_free;
}
+ if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_active))
+ kernfs_debug_wait();
+
ops = kernfs_ops(of->kn);
if (ops->write)
len = ops->write(of, buf, len, iocb->ki_pos);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index f9cc912c31e1..9e3abf597e2d 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -18,6 +18,7 @@
#include <linux/kernfs.h>
#include <linux/fs_context.h>
+#include <linux/stringify.h>
struct kernfs_iattrs {
kuid_t ia_uid;
@@ -147,4 +148,75 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
*/
extern const struct inode_operations kernfs_symlink_iops;
+/*
+ * failure-injection.c
+ */
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+
+/**
+ * struct kernfs_fop_write_iter_fail - how kernfs_fop_write_iter_fail fails
+ *
+ * This lets you configure what part of kernfs_fop_write_iter() should behave
+ * in a specific way to allow userspace to capture possible failures in
+ * kernfs. The wait knobs are allowed to let you design capture possible
+ * race conditions which would otherwise be difficult to reproduce. A
+ * secondary driver would tell kernfs's wait completion when it is done.
+ *
+ * The point to the wait completion failure injection tests are to confirm
+ * that the kernfs active refcount suffice to ensure other objects in other
+ * layers are also gauranteed to exist, even they are opaque to kernfs. This
+ * includes kobjects, devices, and other objects built on top of this, like
+ * the block layer when using sysfs block device attributes.
+ *
+ * @wait_at_start: waits for completion from a third party at the start of
+ * the routine.
+ * @wait_before_mutex: waits for completion from a third party before we
+ * are allowed to continue before the of->mutex is held.
+ * @wait_after_mutex: waits for completion from a third party after we
+ * have held the of->mutex.
+ * @wait_after_active: waits for completion from a thid party after we
+ * have refcounted the struct kernfs_node.
+ */
+struct kernfs_fop_write_iter_fail {
+ bool wait_at_start;
+ bool wait_before_mutex;
+ bool wait_after_mutex;
+ bool wait_after_active;
+};
+
+/**
+ * struct kernfs_config_fail - kernfs configuration for failure injection
+ *
+ * You can kernfs failure injection on boot, and in particular we currently
+ * only support failures for kernfs_fop_write_iter(). However, we don't
+ * want to always enable errors on this call when failure injection is enabled
+ * as this routine is used by many parts of the kernel for proper functionality.
+ * The compromise we make is we let userspace start enabling which parts it
+ * wants to fail after boot, if and only if failure injection has been enabled.
+ *
+ * @kernfs_fop_write_iter_fail: configuration for how we want to allow
+ * for failure injection on kernfs_fop_write_iter()
+ * @sleep_after_wait_ms: how many ms to wait after completion is received.
+ */
+struct kernfs_config_fail {
+ struct kernfs_fop_write_iter_fail kernfs_fop_write_iter_fail;
+ u32 sleep_after_wait_ms;
+};
+
+extern struct kernfs_config_fail kernfs_config_fail;
+
+#define __kernfs_config_wait_var(func, when) \
+ (kernfs_config_fail. func ## _fail.wait_ ## when)
+#define __kernfs_debug_should_wait_func_name(func) __kernfs_debug_should_wait_## func
+
+#define kernfs_debug_should_wait(func, when) \
+ __kernfs_debug_should_wait_func_name(func)(__kernfs_config_wait_var(func, when))
+int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate);
+void kernfs_debug_wait(void);
+#else
+static inline void kernfs_init_failure_injection(void) {}
+#define kernfs_debug_should_wait(func, when) (false)
+static inline void kernfs_debug_wait(void) {}
+#endif
+
#endif /* __KERNFS_INTERNAL_H */
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3ccce6f24548..cd968ee2b503 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -411,6 +411,11 @@ void kernfs_init(void);
struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
u64 id);
+
+#ifdef CONFIG_FAIL_KERNFS_KNOBS
+extern struct completion kernfs_debug_wait_completion;
+#endif
+
#else /* CONFIG_KERNFS */
static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ae19bf1a21b8..a29b7d398c4e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1902,6 +1902,16 @@ config FAULT_INJECTION_USERCOPY
Provides fault-injection capability to inject failures
in usercopy functions (copy_from_user(), get_user(), ...).
+config FAIL_KERNFS_KNOBS
+ bool "Fault-injection support in kernfs"
+ depends on FAULT_INJECTION
+ help
+ Provide fault-injection capability for kernfs. This only enables
+ the error injection functionality. To use it you must configure which
+ which path you want to trigger on error on using debugfs under
+ /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/. By
+ default all of these are disabled.
+
config FAIL_MAKE_REQUEST
bool "Fault-injection capability for disk IO"
depends on FAULT_INJECTION && BLOCK
--
2.30.2
When driver sysfs attributes use a lock also used on module removal we
can race to deadlock. This happens when for instance a sysfs file on
a driver is used, then at the same time we have module removal call
trigger. The module removal call code holds a lock, and then the
driver's sysfs file entry waits for the same lock. While holding the
lock the module removal tries to remove the sysfs entries, but these
cannot be removed yet as one is waiting for a lock. This won't complete
as the lock is already held. Likewise module removal cannot complete,
and so we deadlock.
This can now be easily reproducible with our sysfs selftest as follows:
./tools/testing/selftests/sysfs/sysfs.sh -t 0027
This uses a local driver lock. Test 0028 can also be used, that uses
the rtnl_lock():
./tools/testing/selftests/sysfs/sysfs.sh -t 0028
To fix this we extend the struct kernfs_node with a module reference
and use the try_module_get() after kernfs_get_active() is called. As
documented in the prior patch, we now know that once kernfs_get_active()
is called the module is implicitly guarded to exist and cannot be removed.
This is because the module is the one in charge of removing the same
sysfs file it created, and removal of sysfs files on module exit will wait
until they don't have any active references. By using a try_module_get()
after kernfs_get_active() we yield to let module removal trump calls to
process a sysfs operation, while also preventing module removal if a sysfs
operation is in already progress. This prevents the deadlock.
This deadlock was first reported with the zram driver, however the live
patching folks have acknowledged they have observed this as well with
live patching, when a live patch is removed. I was then able to
reproduce easily by creating a dedicated selftest for it.
A sketch of how this can happen follows, consider foo a local mutex
part of a driver, and used on the driver's module exit routine and
on one of its sysfs ops:
foo.c:
static DEFINE_MUTEX(foo);
static ssize_t foo_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
...
mutex_lock(&foo);
...
mutex_lock(&foo);
...
}
static DEVICE_ATTR_RW(foo);
...
void foo_exit(void)
{
mutex_lock(&foo);
...
mutex_unlock(&foo);
}
module_exit(foo_exit);
And this can lead to this condition:
CPU A CPU B
foo_store()
foo_exit()
mutex_lock(&foo)
mutex_lock(&foo)
del_gendisk(some_struct->disk);
device_del()
device_remove_groups()
In this situation foo_store() is waiting for the mutex foo to
become unlocked, but that won't happen until module removal is complete.
But module removal won't complete until the sysfs file being poked at
completes which is waiting for a lock already held.
Signed-off-by: Luis Chamberlain <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +-
fs/kernfs/dir.c | 44 ++++++++++++++++++----
fs/kernfs/file.c | 6 ++-
fs/kernfs/kernfs-internal.h | 3 +-
fs/kernfs/symlink.c | 3 +-
fs/sysfs/dir.c | 2 +-
fs/sysfs/file.c | 6 ++-
fs/sysfs/group.c | 3 +-
include/linux/kernfs.h | 14 ++++---
include/linux/sysfs.h | 52 ++++++++++++++++++++------
kernel/cgroup/cgroup.c | 2 +-
11 files changed, 105 insertions(+), 34 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b57b3db9a6a7..4edf3b37fd2c 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
- 0, rft->kf_ops, rft, NULL, NULL);
+ 0, rft->kf_ops, rft, NULL, NULL, NULL);
if (IS_ERR(kn))
return PTR_ERR(kn);
@@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
kn = __kernfs_create_file(parent_kn, name, 0444,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
- &kf_mondata_ops, priv, NULL, NULL);
+ &kf_mondata_ops, priv, NULL, NULL, NULL);
if (IS_ERR(kn))
return PTR_ERR(kn);
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index ba581429bf7b..e841201fd11b 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -14,6 +14,7 @@
#include <linux/slab.h>
#include <linux/security.h>
#include <linux/hash.h>
+#include <linux/module.h>
#include "kernfs-internal.h"
@@ -414,12 +415,29 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
*/
struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
{
+ int v;
+
if (unlikely(!kn))
return NULL;
if (!atomic_inc_unless_negative(&kn->active))
return NULL;
+ /*
+ * If a module created the kernfs_node, the module cannot possibly be
+ * removed if the above atomic_inc_unless_negative() succeeded. So the
+ * try_module_get() below is not to protect the lifetime of the module
+ * as that is already guaranteed. The try_module_get() below is used
+ * to ensure that we don't deadlock in case a kernfs operation and
+ * module removal used a shared lock.
+ */
+ if (!try_module_get(kn->owner)) {
+ v = atomic_dec_return(&kn->active);
+ if (unlikely(v == KN_DEACTIVATED_BIAS))
+ wake_up_all(&kernfs_root(kn)->deactivate_waitq);
+ return NULL;
+ }
+
if (kernfs_lockdep(kn))
rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
return kn;
@@ -442,6 +460,13 @@ void kernfs_put_active(struct kernfs_node *kn)
if (kernfs_lockdep(kn))
rwsem_release(&kn->dep_map, _RET_IP_);
v = atomic_dec_return(&kn->active);
+
+ /*
+ * We prevent module exit *until* we know for sure all possible
+ * kernfs ops are done.
+ */
+ module_put(kn->owner);
+
if (likely(v != KN_DEACTIVATED_BIAS))
return;
@@ -572,7 +597,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
struct kernfs_node *parent,
const char *name, umode_t mode,
kuid_t uid, kgid_t gid,
- unsigned flags)
+ unsigned flags,
+ struct module *owner)
{
struct kernfs_node *kn;
u32 id_highbits;
@@ -607,6 +633,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
kn->name = name;
kn->mode = mode;
kn->flags = flags;
+ kn->owner = owner;
if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) {
struct iattr iattr = {
@@ -640,12 +667,13 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
const char *name, umode_t mode,
kuid_t uid, kgid_t gid,
- unsigned flags)
+ unsigned flags,
+ struct module *owner)
{
struct kernfs_node *kn;
kn = __kernfs_new_node(kernfs_root(parent), parent,
- name, mode, uid, gid, flags);
+ name, mode, uid, gid, flags, owner);
if (kn) {
kernfs_get(parent);
kn->parent = parent;
@@ -927,7 +955,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
- KERNFS_DIR);
+ KERNFS_DIR, NULL);
if (!kn) {
idr_destroy(&root->ino_idr);
kfree(root);
@@ -969,20 +997,22 @@ void kernfs_destroy_root(struct kernfs_root *root)
* @gid: gid of the new directory
* @priv: opaque data associated with the new directory
* @ns: optional namespace tag of the directory
+ * @owner: if set, the module owner of this directory
*
* Returns the created node on success, ERR_PTR() value on failure.
*/
struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
const char *name, umode_t mode,
kuid_t uid, kgid_t gid,
- void *priv, const void *ns)
+ void *priv, const void *ns,
+ struct module *owner)
{
struct kernfs_node *kn;
int rc;
/* allocate */
kn = kernfs_new_node(parent, name, mode | S_IFDIR,
- uid, gid, KERNFS_DIR);
+ uid, gid, KERNFS_DIR, owner);
if (!kn)
return ERR_PTR(-ENOMEM);
@@ -1014,7 +1044,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
/* allocate */
kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
- GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
+ GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR, NULL);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 4479c6580333..0e125287e050 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -978,6 +978,7 @@ const struct file_operations kernfs_file_fops = {
* @priv: private data for the file
* @ns: optional namespace tag of the file
* @key: lockdep key for the file's active_ref, %NULL to disable lockdep
+ * @owner: if set, the module owner of the file
*
* Returns the created node on success, ERR_PTR() value on error.
*/
@@ -987,7 +988,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
loff_t size,
const struct kernfs_ops *ops,
void *priv, const void *ns,
- struct lock_class_key *key)
+ struct lock_class_key *key,
+ struct module *owner)
{
struct kernfs_node *kn;
unsigned flags;
@@ -996,7 +998,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
flags = KERNFS_FILE;
kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
- uid, gid, flags);
+ uid, gid, flags, owner);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 9e3abf597e2d..6d275d661987 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -134,7 +134,8 @@ int kernfs_add_one(struct kernfs_node *kn);
struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
const char *name, umode_t mode,
kuid_t uid, kgid_t gid,
- unsigned flags);
+ unsigned flags,
+ struct module *owner);
/*
* file.c
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 19a6c71c6ff5..5a053eebee52 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
gid = target->iattr->ia_gid;
}
- kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
+ kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK,
+ target->owner);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index b6b6796e1616..9763c2fde3c7 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -57,7 +57,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
kobject_get_ownership(kobj, &uid, &gid);
kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
- kobj, ns);
+ kobj, ns, NULL);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
sysfs_warn_dup(parent, kobject_name(kobj));
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 42dcf96881b6..af9e91fd1a92 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -292,7 +292,8 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
#endif
kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
- PAGE_SIZE, ops, (void *)attr, ns, key);
+ PAGE_SIZE, ops, (void *)attr, ns, key,
+ attr->owner);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
sysfs_warn_dup(parent, attr->name);
@@ -327,7 +328,8 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
#endif
kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
- battr->size, ops, (void *)attr, ns, key);
+ battr->size, ops, (void *)attr, ns, key,
+ attr->owner);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
sysfs_warn_dup(parent, attr->name);
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index eeb0e3099421..372864d1cb54 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -135,7 +135,8 @@ static int internal_create_group(struct kobject *kobj, int update,
} else {
kn = kernfs_create_dir_ns(kobj->sd, grp->name,
S_IRWXU | S_IRUGO | S_IXUGO,
- uid, gid, kobj, NULL);
+ uid, gid, kobj, NULL,
+ grp->owner);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
sysfs_warn_dup(kobj->sd, grp->name);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index cd968ee2b503..818b00ebd107 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -161,6 +161,7 @@ struct kernfs_node {
unsigned short flags;
umode_t mode;
struct kernfs_iattrs *iattr;
+ struct module *owner;
};
/*
@@ -370,7 +371,8 @@ void kernfs_destroy_root(struct kernfs_root *root);
struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
const char *name, umode_t mode,
kuid_t uid, kgid_t gid,
- void *priv, const void *ns);
+ void *priv, const void *ns,
+ struct module *owner);
struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
const char *name);
struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
@@ -379,7 +381,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
loff_t size,
const struct kernfs_ops *ops,
void *priv, const void *ns,
- struct lock_class_key *key);
+ struct lock_class_key *key,
+ struct module *owner);
struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
const char *name,
struct kernfs_node *target);
@@ -472,14 +475,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { }
static inline struct kernfs_node *
kernfs_create_dir_ns(struct kernfs_node *parent, const char *name,
umode_t mode, kuid_t uid, kgid_t gid,
- void *priv, const void *ns)
+ void *priv, const void *ns, struct module *owner)
{ return ERR_PTR(-ENOSYS); }
static inline struct kernfs_node *
__kernfs_create_file(struct kernfs_node *parent, const char *name,
umode_t mode, kuid_t uid, kgid_t gid,
loff_t size, const struct kernfs_ops *ops,
- void *priv, const void *ns, struct lock_class_key *key)
+ void *priv, const void *ns, struct lock_class_key *key,
+ struct module *owner)
{ return ERR_PTR(-ENOSYS); }
static inline struct kernfs_node *
@@ -566,7 +570,7 @@ kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
{
return kernfs_create_dir_ns(parent, name, mode,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
- priv, NULL);
+ priv, NULL, parent->owner);
}
static inline int kernfs_remove_by_name(struct kernfs_node *parent,
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index e3f1e8ac1f85..babbabb460dc 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -30,6 +30,7 @@ enum kobj_ns_type;
struct attribute {
const char *name;
umode_t mode;
+ struct module *owner;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
bool ignore_lockdep:1;
struct lock_class_key *key;
@@ -80,6 +81,7 @@ do { \
* @attrs: Pointer to NULL terminated list of attributes.
* @bin_attrs: Pointer to NULL terminated list of binary attributes.
* Either attrs or bin_attrs or both must be provided.
+ * @module: If set, module responsible for this attribute group
*/
struct attribute_group {
const char *name;
@@ -89,6 +91,7 @@ struct attribute_group {
struct bin_attribute *, int);
struct attribute **attrs;
struct bin_attribute **bin_attrs;
+ struct module *owner;
};
/*
@@ -100,38 +103,52 @@ struct attribute_group {
#define __ATTR(_name, _mode, _show, _store) { \
.attr = {.name = __stringify(_name), \
- .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
+ .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
+ .owner = THIS_MODULE, \
+ }, \
.show = _show, \
.store = _store, \
}
#define __ATTR_PREALLOC(_name, _mode, _show, _store) { \
.attr = {.name = __stringify(_name), \
- .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\
+ .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\
+ .owner = THIS_MODULE, \
+ }, \
.show = _show, \
.store = _store, \
}
#define __ATTR_RO(_name) { \
- .attr = { .name = __stringify(_name), .mode = 0444 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0444, \
+ .owner = THIS_MODULE, \
+ }, \
.show = _name##_show, \
}
#define __ATTR_RO_MODE(_name, _mode) { \
.attr = { .name = __stringify(_name), \
- .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
+ .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
+ .owner = THIS_MODULE, \
+ }, \
.show = _name##_show, \
}
#define __ATTR_RW_MODE(_name, _mode) { \
.attr = { .name = __stringify(_name), \
- .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
+ .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
+ .owner = THIS_MODULE, \
+ }, \
.show = _name##_show, \
.store = _name##_store, \
}
#define __ATTR_WO(_name) { \
- .attr = { .name = __stringify(_name), .mode = 0200 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0200, \
+ .owner = THIS_MODULE, \
+ }, \
.store = _name##_store, \
}
@@ -141,8 +158,11 @@ struct attribute_group {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
#define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) { \
- .attr = {.name = __stringify(_name), .mode = _mode, \
- .ignore_lockdep = true }, \
+ .attr = {.name = __stringify(_name), \
+ .mode = _mode, \
+ .ignore_lockdep = true, \
+ .owner = THIS_MODULE, \
+ }, \
.show = _show, \
.store = _store, \
}
@@ -159,6 +179,7 @@ static const struct attribute_group *_name##_groups[] = { \
#define ATTRIBUTE_GROUPS(_name) \
static const struct attribute_group _name##_group = { \
.attrs = _name##_attrs, \
+ .owner = THIS_MODULE, \
}; \
__ATTRIBUTE_GROUPS(_name)
@@ -199,20 +220,29 @@ struct bin_attribute {
/* macros to create static binary attributes easier */
#define __BIN_ATTR(_name, _mode, _read, _write, _size) { \
- .attr = { .name = __stringify(_name), .mode = _mode }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = _mode, \
+ .owner = THIS_MODULE, \
+ }, \
.read = _read, \
.write = _write, \
.size = _size, \
}
#define __BIN_ATTR_RO(_name, _size) { \
- .attr = { .name = __stringify(_name), .mode = 0444 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0444, \
+ .owner = THIS_MODULE, \
+ }, \
.read = _name##_read, \
.size = _size, \
}
#define __BIN_ATTR_WO(_name, _size) { \
- .attr = { .name = __stringify(_name), .mode = 0200 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0200, \
+ .owner = THIS_MODULE, \
+ }, \
.write = _name##_write, \
.size = _size, \
}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9e0390000025..c6b0a28f599c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3975,7 +3975,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
cgroup_file_mode(cft),
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
0, cft->kf_ops, cft,
- NULL, key);
+ NULL, key, NULL);
if (IS_ERR(kn))
return PTR_ERR(kn);
--
2.30.2
There is quite a bit of tribal knowledge around proper use of
try_module_get() and that it must be used only in a context which
can ensure the module won't be gone during the operation. Document
this little bit of tribal knowledge.
I'm extending this tribal knowledge with new developments which it
seems some folks do not yet believe to be true: we can be sure a
module will exist during the lifetime of a sysfs file operation.
For proof, refer to test_sysfs test #32:
./tools/testing/selftests/sysfs/sysfs.sh -t 0032
Without this being true, the write would fail or worse,
a crash would happen, in this test. It does not.
Signed-off-by: Luis Chamberlain <[email protected]>
---
include/linux/module.h | 34 ++++++++++++++++++++++++++++++++--
1 file changed, 32 insertions(+), 2 deletions(-)
diff --git a/include/linux/module.h b/include/linux/module.h
index c9f1200b2312..22eacd5e1e85 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -609,10 +609,40 @@ void symbol_put_addr(void *addr);
to handle the error case (which only happens with rmmod --wait). */
extern void __module_get(struct module *module);
-/* This is the Right Way to get a module: if it fails, it's being removed,
- * so pretend it's not there. */
+/**
+ * try_module_get() - yields to module removal and bumps refcnt otherwise
+ * @module: the module we should check for
+ *
+ * This can be used to try to bump the reference count of a module, so to
+ * prevent module removal. The reference count of a module is not allowed
+ * to be incremented if the module is already being removed.
+ *
+ * Care must be taken to ensure the module cannot be removed during the call to
+ * try_module_get(). This can be done by having another entity other than the
+ * module itself increment the module reference count, or through some other
+ * means which guarantees the module could not be removed during an operation.
+ * An example of this later case is using try_module_get() in a sysfs file
+ * which the module created. The sysfs store / read file operations are
+ * gauranteed to exist through the use of kernfs's active reference (see
+ * kernfs_active()). If a sysfs file operation is being run, the module which
+ * created it must still exist as the module is in charge of removing the same
+ * sysfs file being read. Also, a sysfs / kernfs file removal cannot happen
+ * unless the same file is not active.
+ *
+ * One of the real values to try_module_get() is the module_is_live() check
+ * which ensures this the caller of try_module_get() can yield to userspace
+ * module removal requests and fail whatever it was about to process.
+ */
extern bool try_module_get(struct module *module);
+/**
+ * module_put() - release a reference count to a module
+ * @module: the module we should release a reference count for
+ *
+ * If you successfully bump a reference count to a module with try_module_get(),
+ * when you are finished you must call module_put() to release that reference
+ * count.
+ */
extern void module_put(struct module *module);
#else /*!CONFIG_MODULE_UNLOAD*/
--
2.30.2
If one ends up extending this line checkpatch will complain about the
use of S_IRWXUGO suggesting it is not preferred and that 0777
should be used instead. Take the tip from checkpatch and do that
change before we do our subsequent changes.
This makes no functional changes.
Signed-off-by: Luis Chamberlain <[email protected]>
---
fs/kernfs/symlink.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index c8f8e41b8411..19a6c71c6ff5 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,8 +36,7 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
gid = target->iattr->ia_gid;
}
- kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO, uid, gid,
- KERNFS_LINK);
+ kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
if (!kn)
return ERR_PTR(-ENOMEM);
--
2.30.2
On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> When driver sysfs attributes use a lock also used on module removal we
> can race to deadlock. This happens when for instance a sysfs file on
> a driver is used, then at the same time we have module removal call
> trigger. The module removal call code holds a lock, and then the
> driver's sysfs file entry waits for the same lock. While holding the
> lock the module removal tries to remove the sysfs entries, but these
> cannot be removed yet as one is waiting for a lock. This won't complete
> as the lock is already held. Likewise module removal cannot complete,
> and so we deadlock.
>
> This can now be easily reproducible with our sysfs selftest as follows:
>
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
>
> This uses a local driver lock. Test 0028 can also be used, that uses
> the rtnl_lock():
>
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
>
> To fix this we extend the struct kernfs_node with a module reference
> and use the try_module_get() after kernfs_get_active() is called. As
> documented in the prior patch, we now know that once kernfs_get_active()
> is called the module is implicitly guarded to exist and cannot be removed.
> This is because the module is the one in charge of removing the same
> sysfs file it created, and removal of sysfs files on module exit will wait
> until they don't have any active references. By using a try_module_get()
> after kernfs_get_active() we yield to let module removal trump calls to
> process a sysfs operation, while also preventing module removal if a sysfs
> operation is in already progress. This prevents the deadlock.
>
> This deadlock was first reported with the zram driver, however the live
Looks not see the lock pattern you mentioned in zram driver, can you
share the related zram code?
> patching folks have acknowledged they have observed this as well with
> live patching, when a live patch is removed. I was then able to
> reproduce easily by creating a dedicated selftest for it.
>
> A sketch of how this can happen follows, consider foo a local mutex
> part of a driver, and used on the driver's module exit routine and
> on one of its sysfs ops:
>
> foo.c:
> static DEFINE_MUTEX(foo);
> static ssize_t foo_store(struct device *dev,
> struct device_attribute *attr,
> const char *buf, size_t count)
> {
> ...
> mutex_lock(&foo);
> ...
> mutex_lock(&foo);
> ...
> }
> static DEVICE_ATTR_RW(foo);
> ...
> void foo_exit(void)
> {
> mutex_lock(&foo);
> ...
> mutex_unlock(&foo);
> }
> module_exit(foo_exit);
>
> And this can lead to this condition:
>
> CPU A CPU B
> foo_store()
> foo_exit()
> mutex_lock(&foo)
> mutex_lock(&foo)
> del_gendisk(some_struct->disk);
> device_del()
> device_remove_groups()
I guess the deadlock exists if foo_exit() is called anywhere. If yes,
look the issue may not be related with removing module directly, right?
Thanks,
Ming
On Mon, Sep 27, 2021 at 09:37:57AM -0700, Luis Chamberlain wrote:
> This adds initial failure injection support to kernfs. We start
> off with debug knobs which when enabled allow test drivers, such as
> test_sysfs, to then make use of these to try to force certain
> difficult races to take place with a high degree of certainty.
>
> This only adds runtime code *iff* the new bool CONFIG_FAIL_KERNFS_KNOBS is
> enabled in your kernel. If you don't have this enabled this provides
> no new functional. When CONFIG_FAIL_KERNFS_KNOBS is disabled the new
> routine kernfs_debug_should_wait() ends up being transformed to if
> (false), and so the compiler should optimize these out as dead code
> producing no new effective binary changes.
>
> We start off with enabling failure injections in kernfs by allowing us to
> alter the way kernfs_fop_write_iter() behaves. We allow for the routine
> kernfs_fop_write_iter() to wait for a certain condition in the kernel to
> occur, after which it will sleep a predefined amount of time. This lets
> kernfs users to time exactly when it want kernfs_fop_write_iter() to
> complete, allowing for developing race conditions and test for correctness
> in kernfs.
>
> You'd boot with this enabled on your kernel command line:
>
> fail_kernfs_fop_write_iter=1,100,0,1
>
> The values are <interval,probability,size,times>, we don't care for
> size, so for now we ignore it. The above ensures a failure will trigger
> only once.
>
> *How* we allow for this routine to change behaviour is left to knobs we
> expose under debugfs:
>
> # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
I'd expect this to live under /sys/kernel/debug/fail_kernfs, like the
other fault injectors.
> wait_after_active
> wait_after_mutex
> wait_at_start
> wait_before_mutex
>
> A debugfs entry also exists to allow us to sleep a configurabler amount
> of time after the completion:
>
> /sys/kernel/debug/kernfs/sleep_after_wait_ms
>
> These two sets of knobs allow us to construct races and demonstrate
> how the kernfs active reference should suffice to project against
> races.
>
> Enabling CONFIG_FAULT_INJECTION_DEBUG_FS enables us to configure the
> differnt fault injection parametres for the new fail_kernfs_fop_write_iter
> fault injection at run time:
>
> ls -1 /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
> interval
> probability
> space
> times
> task-filter
> verbose
> verbose_ratelimit_burst
> verbose_ratelimit_interval_ms
>
> Signed-off-by: Luis Chamberlain <[email protected]>
> ---
> .../fault-injection/fault-injection.rst | 22 +++++
> MAINTAINERS | 2 +-
> fs/kernfs/Makefile | 1 +
> fs/kernfs/failure-injection.c | 91 +++++++++++++++++++
> fs/kernfs/file.c | 13 +++
> fs/kernfs/kernfs-internal.h | 72 +++++++++++++++
> include/linux/kernfs.h | 5 +
> lib/Kconfig.debug | 10 ++
> 8 files changed, 215 insertions(+), 1 deletion(-)
> create mode 100644 fs/kernfs/failure-injection.c
>
> diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst
> index 4a25c5eb6f07..d4d34b082f47 100644
> --- a/Documentation/fault-injection/fault-injection.rst
> +++ b/Documentation/fault-injection/fault-injection.rst
> @@ -28,6 +28,28 @@ Available fault injection capabilities
>
> injects kernel RPC client and server failures.
>
> +- fail_kernfs_fop_write_iter
> +
> + Allows for failures to be enabled inside kernfs_fop_write_iter(). Enabling
> + this does not immediately enable any errors to occur. You must configure
> + how you want this routine to fail or change behaviour by using the debugfs
> + knobs for it:
> +
> + # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
> + wait_after_active
> + wait_after_mutex
> + wait_at_start
> + wait_before_mutex
This should be split up and detailed in the "debugfs entries" section
below here.
> +
> + You can also configure how long to sleep after a wait under
> +
> + /sys/kernel/debug/kernfs/sleep_after_wait_ms
> +
> + If you enable CONFIG_FAULT_INJECTION_DEBUG_FS the fail_add_disk failure
> + injection parameters are placed under:
> +
> + /sys/kernel/debug/kernfs/fail_kernfs_fop_write_iter/
> +
> - fail_make_request
>
> injects disk IO errors on devices permitted by setting
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1b4cefcb064c..fadfd961ad80 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -10384,7 +10384,7 @@ M: Greg Kroah-Hartman <[email protected]>
> M: Tejun Heo <[email protected]>
> S: Supported
> T: git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git
> -F: fs/kernfs/
> +F: fs/kernfs/*
> F: include/linux/kernfs.h
>
> KEXEC
> diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile
> index 4ca54ff54c98..bc5b32ca39f9 100644
> --- a/fs/kernfs/Makefile
> +++ b/fs/kernfs/Makefile
> @@ -4,3 +4,4 @@
> #
>
> obj-y := mount.o inode.o dir.o file.o symlink.o
> +obj-$(CONFIG_FAIL_KERNFS_KNOBS) += failure-injection.o
> diff --git a/fs/kernfs/failure-injection.c b/fs/kernfs/failure-injection.c
> new file mode 100644
> index 000000000000..4130d202c13b
> --- /dev/null
> +++ b/fs/kernfs/failure-injection.c
I'd name this fault_inject.c, which matches the more common case:
$ find . -type f -name '*fault*inject*.c'
./fs/nfsd/fault_inject.c
./drivers/nvme/host/fault_inject.c
./drivers/scsi/ufs/ufs-fault-injection.c
./lib/fault-inject.c
./lib/fault-inject-usercopy.c
> @@ -0,0 +1,91 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/fault-inject.h>
> +#include <linux/delay.h>
> +
> +#include "kernfs-internal.h"
> +
> +static DECLARE_FAULT_ATTR(fail_kernfs_fop_write_iter);
> +struct kernfs_config_fail kernfs_config_fail;
> +
> +#define kernfs_config_fail(when) \
> + kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
> +
> +#define kernfs_config_fail(when) \
> + kernfs_config_fail.kernfs_fop_write_iter_fail.wait_ ## when
> +
> +static int __init setup_fail_kernfs_fop_write_iter(char *str)
> +{
> + return setup_fault_attr(&fail_kernfs_fop_write_iter, str);
> +}
> +
> +__setup("fail_kernfs_fop_write_iter=", setup_fail_kernfs_fop_write_iter);
> +
> +struct dentry *kernfs_debugfs_root;
> +struct dentry *config_fail_kernfs_fop_write_iter;
> +
> +static int __init kernfs_init_failure_injection(void)
> +{
> + kernfs_config_fail.sleep_after_wait_ms = 100;
> + kernfs_debugfs_root = debugfs_create_dir("kernfs", NULL);
> +
> + fault_create_debugfs_attr("fail_kernfs_fop_write_iter",
> + kernfs_debugfs_root, &fail_kernfs_fop_write_iter);
> +
> + config_fail_kernfs_fop_write_iter =
> + debugfs_create_dir("config_fail_kernfs_fop_write_iter",
> + kernfs_debugfs_root);
> +
> + debugfs_create_u32("sleep_after_wait_ms", 0600,
> + kernfs_debugfs_root,
> + &kernfs_config_fail.sleep_after_wait_ms);
> +
> + debugfs_create_bool("wait_at_start", 0600,
> + config_fail_kernfs_fop_write_iter,
> + &kernfs_config_fail(at_start));
> + debugfs_create_bool("wait_before_mutex", 0600,
> + config_fail_kernfs_fop_write_iter,
> + &kernfs_config_fail(before_mutex));
> + debugfs_create_bool("wait_after_mutex", 0600,
> + config_fail_kernfs_fop_write_iter,
> + &kernfs_config_fail(after_mutex));
> + debugfs_create_bool("wait_after_active", 0600,
> + config_fail_kernfs_fop_write_iter,
> + &kernfs_config_fail(after_active));
> + return 0;
> +}
> +late_initcall(kernfs_init_failure_injection);
> +
> +int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate)
> +{
> + if (!evaluate)
> + return 0;
> +
> + return should_fail(&fail_kernfs_fop_write_iter, 0);
> +}
Every caller ends up doing the wait, so how about just including that
here instead? It should make things much less intrusive and more readable.
And for the naming, other fault injectors use "should_fail_$topic", so
maybe better here would be something like may_wait_kernfs(...).
> +
> +DECLARE_COMPLETION(kernfs_debug_wait_completion);
> +EXPORT_SYMBOL_NS_GPL(kernfs_debug_wait_completion, KERNFS_DEBUG_PRIVATE);
> +
> +void kernfs_debug_wait(void)
> +{
> + unsigned long timeout;
> +
> + timeout = wait_for_completion_timeout(&kernfs_debug_wait_completion,
> + msecs_to_jiffies(3000));
> + if (!timeout)
> + pr_info("%s waiting for kernfs_debug_wait_completion timed out\n",
> + __func__);
> + else
> + pr_info("%s received completion with time left on timeout %u ms\n",
> + __func__, jiffies_to_msecs(timeout));
> +
> + /**
> + * The goal is wait for an event, and *then* once we have
> + * reached it, the other side will try to do something which
> + * it thinks will break. So we must give it some time to do
> + * that. The amount of time is configurable.
> + */
> + msleep(kernfs_config_fail.sleep_after_wait_ms);
> + pr_info("%s ended\n", __func__);
> +}
All the uses of "__func__" here seems redundant; I would drop them.
> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index 60e2a86c535e..4479c6580333 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -259,6 +259,9 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> const struct kernfs_ops *ops;
> char *buf;
>
> + if (kernfs_debug_should_wait(kernfs_fop_write_iter, at_start))
> + kernfs_debug_wait();
So this could just be:
may_wait_kernfs(kernfs_fop_write_iter, at_start);
> +
> if (of->atomic_write_len) {
> if (len > of->atomic_write_len)
> return -E2BIG;
> @@ -280,17 +283,27 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> }
> buf[len] = '\0'; /* guarantee string termination */
>
> + if (kernfs_debug_should_wait(kernfs_fop_write_iter, before_mutex))
> + kernfs_debug_wait();
> +
> /*
> * @of->mutex nests outside active ref and is used both to ensure that
> * the ops aren't called concurrently for the same open file.
> */
> mutex_lock(&of->mutex);
> +
> + if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_mutex))
> + kernfs_debug_wait();
> +
> if (!kernfs_get_active(of->kn)) {
> mutex_unlock(&of->mutex);
> len = -ENODEV;
> goto out_free;
> }
>
> + if (kernfs_debug_should_wait(kernfs_fop_write_iter, after_active))
> + kernfs_debug_wait();
> +
> ops = kernfs_ops(of->kn);
> if (ops->write)
> len = ops->write(of, buf, len, iocb->ki_pos);
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index f9cc912c31e1..9e3abf597e2d 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -18,6 +18,7 @@
>
> #include <linux/kernfs.h>
> #include <linux/fs_context.h>
> +#include <linux/stringify.h>
>
> struct kernfs_iattrs {
> kuid_t ia_uid;
> @@ -147,4 +148,75 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
> */
> extern const struct inode_operations kernfs_symlink_iops;
>
> +/*
> + * failure-injection.c
> + */
> +#ifdef CONFIG_FAIL_KERNFS_KNOBS
> +
> +/**
> + * struct kernfs_fop_write_iter_fail - how kernfs_fop_write_iter_fail fails
> + *
> + * This lets you configure what part of kernfs_fop_write_iter() should behave
> + * in a specific way to allow userspace to capture possible failures in
> + * kernfs. The wait knobs are allowed to let you design capture possible
> + * race conditions which would otherwise be difficult to reproduce. A
> + * secondary driver would tell kernfs's wait completion when it is done.
> + *
> + * The point to the wait completion failure injection tests are to confirm
> + * that the kernfs active refcount suffice to ensure other objects in other
> + * layers are also gauranteed to exist, even they are opaque to kernfs. This
> + * includes kobjects, devices, and other objects built on top of this, like
> + * the block layer when using sysfs block device attributes.
> + *
> + * @wait_at_start: waits for completion from a third party at the start of
> + * the routine.
> + * @wait_before_mutex: waits for completion from a third party before we
> + * are allowed to continue before the of->mutex is held.
> + * @wait_after_mutex: waits for completion from a third party after we
> + * have held the of->mutex.
> + * @wait_after_active: waits for completion from a thid party after we
> + * have refcounted the struct kernfs_node.
> + */
> +struct kernfs_fop_write_iter_fail {
> + bool wait_at_start;
> + bool wait_before_mutex;
> + bool wait_after_mutex;
> + bool wait_after_active;
> +};
> +
> +/**
> + * struct kernfs_config_fail - kernfs configuration for failure injection
> + *
> + * You can kernfs failure injection on boot, and in particular we currently
> + * only support failures for kernfs_fop_write_iter(). However, we don't
> + * want to always enable errors on this call when failure injection is enabled
> + * as this routine is used by many parts of the kernel for proper functionality.
> + * The compromise we make is we let userspace start enabling which parts it
> + * wants to fail after boot, if and only if failure injection has been enabled.
> + *
> + * @kernfs_fop_write_iter_fail: configuration for how we want to allow
> + * for failure injection on kernfs_fop_write_iter()
> + * @sleep_after_wait_ms: how many ms to wait after completion is received.
> + */
> +struct kernfs_config_fail {
> + struct kernfs_fop_write_iter_fail kernfs_fop_write_iter_fail;
> + u32 sleep_after_wait_ms;
> +};
> +
> +extern struct kernfs_config_fail kernfs_config_fail;
> +
> +#define __kernfs_config_wait_var(func, when) \
> + (kernfs_config_fail. func ## _fail.wait_ ## when)
^^ ^ ^
nit: needless spaces
> +#define __kernfs_debug_should_wait_func_name(func) __kernfs_debug_should_wait_## func
> +
> +#define kernfs_debug_should_wait(func, when) \
> + __kernfs_debug_should_wait_func_name(func)(__kernfs_config_wait_var(func, when))
> +int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate);
> +void kernfs_debug_wait(void);
> +#else
> +static inline void kernfs_init_failure_injection(void) {}
> +#define kernfs_debug_should_wait(func, when) (false)
> +static inline void kernfs_debug_wait(void) {}
> +#endif
> +
> #endif /* __KERNFS_INTERNAL_H */
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 3ccce6f24548..cd968ee2b503 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -411,6 +411,11 @@ void kernfs_init(void);
>
> struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
> u64 id);
> +
> +#ifdef CONFIG_FAIL_KERNFS_KNOBS
> +extern struct completion kernfs_debug_wait_completion;
> +#endif
> +
> #else /* CONFIG_KERNFS */
>
> static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index ae19bf1a21b8..a29b7d398c4e 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1902,6 +1902,16 @@ config FAULT_INJECTION_USERCOPY
> Provides fault-injection capability to inject failures
> in usercopy functions (copy_from_user(), get_user(), ...).
>
> +config FAIL_KERNFS_KNOBS
> + bool "Fault-injection support in kernfs"
> + depends on FAULT_INJECTION
> + help
> + Provide fault-injection capability for kernfs. This only enables
> + the error injection functionality. To use it you must configure which
> + which path you want to trigger on error on using debugfs under
> + /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/. By
> + default all of these are disabled.
> +
> config FAIL_MAKE_REQUEST
> bool "Fault-injection capability for disk IO"
> depends on FAULT_INJECTION && BLOCK
> --
> 2.30.2
>
--
Kees Cook
On Mon, Sep 27, 2021 at 09:37:58AM -0700, Luis Chamberlain wrote:
> This extends test_sysfs with support for using the failure injection
> wait completion and knobs to force a few race conditions which
> demonstrates that kernfs active reference protection is sufficient
> for kobject / device protection at higher layers.
>
> This adds 4 new tests which tries to remove the device attribute
> store operation in 4 different situations:
>
> 1) at the start of kernfs_kernfs_fop_write_iter()
> 2) before the of->mutex is held in kernfs_kernfs_fop_write_iter()
> 3) after the of->mutex is held in kernfs_kernfs_fop_write_iter()
> 4) after the kernfs node active reference is taken
>
> A write fails in call cases except the last one, test number #32. There
> is a good explanation for this: *once* kernfs_get_active() gets called
> we have a guarantee that the kernfs entry cannot be removed. If
> kernfs_get_active() succeeds that entry cannot be removed and so
> anything trying to remove that entry will have to wait. It is perhaps
> not obvious but since a sysfs write will trigger eventually a
> kernfs_get_active() call, and *only* if this succeeds will the sysfs
> op be called, this and the fact that you cannot remove the kernfs
> entry while the kenfs entry is active implies that a module that
> created the respective sysfs / kernfs entry *cannot* possibly be
> removed during a sysfs operation. And test number 32 provides us with
> proof of this. If it were not true test #32 should crash.
>
> No null dereferences are reproduced, even though this has been observed
> in some complex testing cases [0]. If this issue really exists we should
> have enough tools on the sysfs_test toolbox now to try to reproduce
> this easily without having to poke around other drivers. It very likley
> was the case that the issue reported [0] was possibly a side issue after
> the first bug which was zram specific. This is why it is important to
> isolate the issue and try to reproduce it in a generic form using the
> test_sysfs driver.
>
> [0] https://lkml.kernel.org/r/[email protected]
>
> Signed-off-by: Luis Chamberlain <[email protected]>
> ---
> lib/Kconfig.debug | 3 +
> lib/test_sysfs.c | 31 +++++
> tools/testing/selftests/sysfs/config | 3 +
> tools/testing/selftests/sysfs/sysfs.sh | 175 +++++++++++++++++++++++++
> 4 files changed, 212 insertions(+)
>
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index a29b7d398c4e..176b822654e5 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -2358,6 +2358,9 @@ config TEST_SYSFS
> depends on SYSFS
> depends on NET
> depends on BLOCK
> + select FAULT_INJECTION
> + select FAULT_INJECTION_DEBUG_FS
> + select FAIL_KERNFS_KNOBS
I don't like seeing "select" for user-configurable CONFIGs -- things
tend to end up weird. This should simply be:
depends on FAIL_KERNFS_KNOBS
> help
> This builds the "test_sysfs" module. This driver enables to test the
> sysfs file system safely without affecting production knobs which
> diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c
> index 2043ca494af8..c6e62de61403 100644
> --- a/lib/test_sysfs.c
> +++ b/lib/test_sysfs.c
> @@ -38,6 +38,11 @@
> #include <linux/rtnetlink.h>
> #include <linux/genhd.h>
> #include <linux/blkdev.h>
> +#include <linux/kernfs.h>
> +
> +#ifdef CONFIG_FAIL_KERNFS_KNOBS
This isn't an optional config here (and following)?
> +MODULE_IMPORT_NS(KERNFS_DEBUG_PRIVATE);
> +#endif
>
> static bool enable_lock;
> module_param(enable_lock, bool_enable_only, 0644);
> @@ -82,6 +87,13 @@ static bool enable_verbose_rmmod;
> module_param(enable_verbose_rmmod, bool_enable_only, 0644);
> MODULE_PARM_DESC(enable_verbose_rmmod, "enable verbose print messages on rmmod");
>
> +#ifdef CONFIG_FAIL_KERNFS_KNOBS
> +static bool enable_completion_on_rmmod;
> +module_param(enable_completion_on_rmmod, bool_enable_only, 0644);
> +MODULE_PARM_DESC(enable_completion_on_rmmod,
> + "enable sending a kernfs completion on rmmod");
> +#endif
> +
> static int sysfs_test_major;
>
> /**
> @@ -285,6 +297,12 @@ static ssize_t config_show(struct device *dev,
> "enable_verbose_writes:\t%s\n",
> enable_verbose_writes ? "true" : "false");
>
> +#ifdef CONFIG_FAIL_KERNFS_KNOBS
> + len += snprintf(buf+len, PAGE_SIZE - len,
> + "enable_completion_on_rmmod:\t%s\n",
> + enable_completion_on_rmmod ? "true" : "false");
> +#endif
> +
> test_dev_config_unlock(test_dev);
>
> return len;
> @@ -904,10 +922,23 @@ static int __init test_sysfs_init(void)
> }
> module_init(test_sysfs_init);
>
> +#ifdef CONFIG_FAIL_KERNFS_KNOBS
> +/* The goal is to race our device removal with a pending kernfs -> store call */
> +static void test_sysfs_kernfs_send_completion_rmmod(void)
> +{
> + if (!enable_completion_on_rmmod)
> + return;
> + complete(&kernfs_debug_wait_completion);
> +}
> +#else
> +static inline void test_sysfs_kernfs_send_completion_rmmod(void) {}
> +#endif
> +
> static void __exit test_sysfs_exit(void)
> {
> if (enable_debugfs)
> debugfs_remove(debugfs_dir);
> + test_sysfs_kernfs_send_completion_rmmod();
> if (delay_rmmod_ms)
> msleep(delay_rmmod_ms);
> unregister_test_dev_sysfs(first_test_dev);
> diff --git a/tools/testing/selftests/sysfs/config b/tools/testing/selftests/sysfs/config
> index 9196f452ecd5..2876a229f95b 100644
> --- a/tools/testing/selftests/sysfs/config
> +++ b/tools/testing/selftests/sysfs/config
> @@ -1,2 +1,5 @@
> CONFIG_SYSFS=m
> CONFIG_TEST_SYSFS=m
> +CONFIG_FAULT_INJECTION=y
> +CONFIG_FAULT_INJECTION_DEBUG_FS=y
> +CONFIG_FAIL_KERNFS_KNOBS=y
> diff --git a/tools/testing/selftests/sysfs/sysfs.sh b/tools/testing/selftests/sysfs/sysfs.sh
> index b3f4c2236c7f..f928635d0e35 100755
> --- a/tools/testing/selftests/sysfs/sysfs.sh
> +++ b/tools/testing/selftests/sysfs/sysfs.sh
> @@ -62,6 +62,10 @@ ALL_TESTS="$ALL_TESTS 0025:1:1:test_dev_y:block"
> ALL_TESTS="$ALL_TESTS 0026:1:1:test_dev_y:block"
> ALL_TESTS="$ALL_TESTS 0027:1:0:test_dev_x:block" # deadlock test
> ALL_TESTS="$ALL_TESTS 0028:1:0:test_dev_x:block" # deadlock test with rntl_lock
> +ALL_TESTS="$ALL_TESTS 0029:1:1:test_dev_x:block" # kernfs race removal of store
> +ALL_TESTS="$ALL_TESTS 0030:1:1:test_dev_x:block" # kernfs race removal before mutex
> +ALL_TESTS="$ALL_TESTS 0031:1:1:test_dev_x:block" # kernfs race removal after mutex
> +ALL_TESTS="$ALL_TESTS 0032:1:1:test_dev_x:block" # kernfs race removal after active
>
> allow_user_defaults()
> {
> @@ -92,6 +96,9 @@ allow_user_defaults()
> if [ -z $SYSFS_DEBUGFS_DIR ]; then
> SYSFS_DEBUGFS_DIR="/sys/kernel/debug/test_sysfs"
> fi
> + if [ -z $KERNFS_DEBUGFS_DIR ]; then
> + KERNFS_DEBUGFS_DIR="/sys/kernel/debug/kernfs"
> + fi
> if [ -z $PAGE_SIZE ]; then
> PAGE_SIZE=$(getconf PAGESIZE)
> fi
> @@ -167,6 +174,14 @@ modprobe_reset_enable_rtnl_lock_on_rmmod()
> unset FIRST_MODPROBE_ARGS
> }
>
> +modprobe_reset_enable_completion()
> +{
> + FIRST_MODPROBE_ARGS="enable_completion_on_rmmod=1 enable_verbose_writes=1"
> + FIRST_MODPROBE_ARGS="$FIRST_MODPROBE_ARGS enable_verbose_rmmod=1 delay_rmmod_ms=0"
> + modprobe_reset
> + unset FIRST_MODPROBE_ARGS
> +}
> +
> load_req_mod()
> {
> modprobe_reset
> @@ -197,6 +212,63 @@ debugfs_reset_first_test_dev_ignore_errors()
> echo -n "1" >"$SYSFS_DEBUGFS_DIR"/reset_first_test_dev
> }
>
> +debugfs_kernfs_kernfs_fop_write_iter_exists()
> +{
> + KNOB_DIR="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter"
> + if [[ ! -d $KNOB_DIR ]]; then
> + echo "kernfs debugfs does not exist $KNOB_DIR"
> + return 0;
> + fi
> + KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
> + if [[ ! -d $KNOB_DEBUGFS ]]; then
> + echo -n "kernfs debugfs for coniguring fail_kernfs_fop_write_iter "
> + echo "does not exist $KNOB_DIR"
> + return 0;
> + fi
> + return 1
> +}
> +
> +debugfs_kernfs_kernfs_fop_write_iter_set_fail_once()
> +{
> + KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
> + echo 1 > $KNOB_DEBUGFS/interval
> + echo 100 > $KNOB_DEBUGFS/probability
> + echo 0 > $KNOB_DEBUGFS/space
> + # Disable verbose messages on the kernel ring buffer which may
> + # confuse developers with a kernel panic.
> + echo 0 > $KNOB_DEBUGFS/verbose
> +
> + # Fail only once
> + echo 1 > $KNOB_DEBUGFS/times
> +}
> +
> +debugfs_kernfs_kernfs_fop_write_iter_set_fail_never()
> +{
> + KNOB_DEBUGFS="${KERNFS_DEBUGFS_DIR}/fail_kernfs_fop_write_iter"
> + echo 0 > $KNOB_DEBUGFS/times
> +}
> +
> +debugfs_kernfs_set_wait_ms()
> +{
> + SLEEP_AFTER_WAIT_MS="${KERNFS_DEBUGFS_DIR}/sleep_after_wait_ms"
> + echo $1 > $SLEEP_AFTER_WAIT_MS
> +}
> +
> +debugfs_kernfs_disable_wait_kernfs_fop_write_iter()
> +{
> + ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_"
> + for KNOB in ${ENABLE_WAIT_KNOB}*; do
> + echo 0 > $KNOB
> + done
> +}
> +
> +debugfs_kernfs_enable_wait_kernfs_fop_write_iter()
> +{
> + ENABLE_WAIT_KNOB="${KERNFS_DEBUGFS_DIR}/config_fail_kernfs_fop_write_iter/wait_$1"
> + echo -n "1" > $ENABLE_WAIT_KNOB
> + return $?
> +}
> +
> set_orig()
> {
> if [[ ! -z $TARGET ]] && [[ ! -z $ORIG ]]; then
> @@ -972,6 +1044,105 @@ sysfs_test_0028()
> fi
> }
>
> +sysfs_race_kernfs_kernfs_fop_write_iter()
> +{
> + TARGET="${DIR}/$(get_test_target $1)"
> + WAIT_AT=$2
> + EXPECT_WRITE_RETURNS=$3
> + MSDELAY=$4
> +
> + modprobe_reset_enable_completion
> + ORIG=$(cat "${TARGET}")
> + TEST_STR=$(( $ORIG + 1 ))
> +
> + echo -n "Test racing removal of sysfs store op with kernfs $WAIT_AT ... "
> +
> + if debugfs_kernfs_kernfs_fop_write_iter_exists; then
> + echo -n "skipping test as CONFIG_FAIL_KERNFS_KNOBS "
> + echo " or CONFIG_FAULT_INJECTION_DEBUG_FS is disabled"
> + return $ksft_skip
> + fi
> +
> + # Allow for failing the kernfs_kernfs_fop_write_iter call once,
> + # we'll provide exact context shortly afterwards.
> + debugfs_kernfs_kernfs_fop_write_iter_set_fail_once
> +
> + # First disable all waits
> + debugfs_kernfs_disable_wait_kernfs_fop_write_iter
> +
> + # Enable a wait_for_completion(&kernfs_debug_wait_completion) at the
> + # specified location inside the kernfs_fop_write_iter() routine
> + debugfs_kernfs_enable_wait_kernfs_fop_write_iter $WAIT_AT
> +
> + # Configure kernfs so that after its wait_for_completion() it
> + # will msleep() this amount of time and schedule(). We figure this
> + # will be sufficient time to allow for our module removal to complete.
> + debugfs_kernfs_set_wait_ms $MSDELAY
> +
> + # Now we trigger a kernfs write op, which will run kernfs_fop_write_iter,
> + # but will wait until our driver sends a respective completion
> + set_test_ignore_errors &
> + write_pid=$!
> +
> + # At this point kernfs_fop_write_iter() hasn't run our op, its
> + # waiting for our completion at the specified time $WAIT_AT.
> + # We now remove our module which will send a
> + # complete(&kernfs_debug_wait_completion) right before we deregister
> + # our device and the sysfs device attributes are removed.
> + #
> + # After the completion is sent, the test_sysfs driver races with
> + # kernfs to do the device deregistration with the kernfs msleep
> + # and schedule(). This should mean we've forced trying to remove the
> + # module prior to allowing kernfs to run our store operation. If the
> + # race did happen we'll panic with a null dereference on the store op.
> + #
> + # If no race happens we should see no write operation triggered.
> + modprobe -r $TEST_DRIVER > /dev/null 2>&1
> +
> + debugfs_kernfs_kernfs_fop_write_iter_set_fail_never
> +
> + wait $write_pid
> + if [[ $? -eq $EXPECT_WRITE_RETURNS ]]; then
> + echo "ok"
> + else
> + echo "FAIL" >&2
> + fi
> +}
> +
> +sysfs_test_0029()
> +{
> + for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
> + echo "Using delay-after-completion: $delay"
> + sysfs_race_kernfs_kernfs_fop_write_iter 0029 at_start 1 $delay
> + done
> +}
> +
> +sysfs_test_0030()
> +{
> + for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
> + echo "Using delay-after-completion: $delay"
> + sysfs_race_kernfs_kernfs_fop_write_iter 0030 before_mutex 1 $delay
> + done
> +}
> +
> +sysfs_test_0031()
> +{
> + for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
> + echo "Using delay-after-completion: $delay"
> + sysfs_race_kernfs_kernfs_fop_write_iter 0031 after_mutex 1 $delay
> + done
> +}
> +
> +# A write only succeeds *iff* a module removal happens *after* the
> +# kernfs active reference is obtained with kernfs_get_active().
> +sysfs_test_0032()
> +{
> + for delay in 0 2 4 8 16 32 64 128 246 512 1024; do
> + echo "Using delay-after-completion: $delay"
> + sysfs_race_kernfs_kernfs_fop_write_iter 0032 after_active 0 $delay
> + done
> +}
> +
> test_gen_desc()
> {
> echo -n "$1 x $(get_test_count $1)"
> @@ -1013,6 +1184,10 @@ list_tests()
> echo "$(test_gen_desc 0026) - block test writing y larger delay and resetting device"
> echo "$(test_gen_desc 0027) - test rmmod deadlock while writing x ... "
> echo "$(test_gen_desc 0028) - test rmmod deadlock using rtnl_lock while writing x ..."
> + echo "$(test_gen_desc 0029) - racing removal of store op with kernfs at start"
> + echo "$(test_gen_desc 0030) - racing removal of store op with kernfs before mutex"
> + echo "$(test_gen_desc 0031) - racing removal of store op with kernfs after mutex"
> + echo "$(test_gen_desc 0032) - racing removal of store op with kernfs after active"
> }
>
> usage()
> --
> 2.30.2
>
--
Kees Cook
On Mon, Sep 27, 2021 at 09:37:59AM -0700, Luis Chamberlain wrote:
> There is quite a bit of tribal knowledge around proper use of
> try_module_get() and that it must be used only in a context which
> can ensure the module won't be gone during the operation. Document
> this little bit of tribal knowledge.
>
> I'm extending this tribal knowledge with new developments which it
> seems some folks do not yet believe to be true: we can be sure a
> module will exist during the lifetime of a sysfs file operation.
> For proof, refer to test_sysfs test #32:
>
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0032
>
> Without this being true, the write would fail or worse,
> a crash would happen, in this test. It does not.
>
> Signed-off-by: Luis Chamberlain <[email protected]>
> ---
> include/linux/module.h | 34 ++++++++++++++++++++++++++++++++--
> 1 file changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/module.h b/include/linux/module.h
> index c9f1200b2312..22eacd5e1e85 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -609,10 +609,40 @@ void symbol_put_addr(void *addr);
> to handle the error case (which only happens with rmmod --wait). */
> extern void __module_get(struct module *module);
>
> -/* This is the Right Way to get a module: if it fails, it's being removed,
> - * so pretend it's not there. */
> +/**
> + * try_module_get() - yields to module removal and bumps refcnt otherwise
I find this hard to parse. How about:
"Take module refcount unless module is being removed"
> + * @module: the module we should check for
> + *
> + * This can be used to try to bump the reference count of a module, so to
> + * prevent module removal. The reference count of a module is not allowed
> + * to be incremented if the module is already being removed.
This I understand.
> + *
> + * Care must be taken to ensure the module cannot be removed during the call to
> + * try_module_get(). This can be done by having another entity other than the
> + * module itself increment the module reference count, or through some other
> + * means which guarantees the module could not be removed during an operation.
> + * An example of this later case is using try_module_get() in a sysfs file
> + * which the module created. The sysfs store / read file operations are
> + * gauranteed to exist through the use of kernfs's active reference (see
> + * kernfs_active()). If a sysfs file operation is being run, the module which
> + * created it must still exist as the module is in charge of removing the same
> + * sysfs file being read. Also, a sysfs / kernfs file removal cannot happen
> + * unless the same file is not active.
I can't understand this paragraph at all. "Care must be taken ..."? Why?
Shouldn't callers of try_module_get() be satisfied with the results? I
don't follow the example at all. It seems to just say "sysfs store/read
functions don't need try_module_get() because whatever opened the sysfs
file is already keeping the module referenced." ?
> + *
> + * One of the real values to try_module_get() is the module_is_live() check
> + * which ensures this the caller of try_module_get() can yield to userspace
> + * module removal requests and fail whatever it was about to process.
Please document the return value explicitly.
> + */
> extern bool try_module_get(struct module *module);
>
> +/**
> + * module_put() - release a reference count to a module
> + * @module: the module we should release a reference count for
> + *
> + * If you successfully bump a reference count to a module with try_module_get(),
> + * when you are finished you must call module_put() to release that reference
> + * count.
> + */
> extern void module_put(struct module *module);
>
> #else /*!CONFIG_MODULE_UNLOAD*/
> --
> 2.30.2
>
--
Kees Cook
On Mon, Sep 27, 2021 at 09:38:00AM -0700, Luis Chamberlain wrote:
> If one ends up extending this line checkpatch will complain about the
> use of S_IRWXUGO suggesting it is not preferred and that 0777
> should be used instead. Take the tip from checkpatch and do that
> change before we do our subsequent changes.
>
> This makes no functional changes.
>
> Signed-off-by: Luis Chamberlain <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
> ---
> fs/kernfs/symlink.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
> index c8f8e41b8411..19a6c71c6ff5 100644
> --- a/fs/kernfs/symlink.c
> +++ b/fs/kernfs/symlink.c
> @@ -36,8 +36,7 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
> gid = target->iattr->ia_gid;
> }
>
> - kn = kernfs_new_node(parent, name, S_IFLNK|S_IRWXUGO, uid, gid,
> - KERNFS_LINK);
> + kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
> if (!kn)
> return ERR_PTR(-ENOMEM);
>
> --
> 2.30.2
>
--
Kees Cook
On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> When driver sysfs attributes use a lock also used on module removal we
> can race to deadlock. This happens when for instance a sysfs file on
> a driver is used, then at the same time we have module removal call
> trigger. The module removal call code holds a lock, and then the
> driver's sysfs file entry waits for the same lock. While holding the
> lock the module removal tries to remove the sysfs entries, but these
> cannot be removed yet as one is waiting for a lock. This won't complete
> as the lock is already held. Likewise module removal cannot complete,
> and so we deadlock.
>
> This can now be easily reproducible with our sysfs selftest as follows:
>
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
>
> This uses a local driver lock. Test 0028 can also be used, that uses
> the rtnl_lock():
>
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
>
> To fix this we extend the struct kernfs_node with a module reference
> and use the try_module_get() after kernfs_get_active() is called. As
I would agree: kernfs must know about the module containing the ops
structure it has been given. (Without this, there are, at the very least,
removal races for looking at kernfs_ops structures.)
In other places in the kernel, function callback dependencies are more
explicit in that if code is holding such things, it has already taken a
module reference, etc. But kernfs is special in the sense that just
because a kernfs entry exists, we don't want to pin the module use count
too.
But simple locking isn't workable to solve this because kernfs_remove()
must be able to be called from a module_exit routine without deadlocking.
(i.e. we would create exactly the situation that caused this condition
to get noticed in the first place.)
> documented in the prior patch, we now know that once kernfs_get_active()
> is called the module is implicitly guarded to exist and cannot be removed.
> This is because the module is the one in charge of removing the same
> sysfs file it created, and removal of sysfs files on module exit will wait
> until they don't have any active references. By using a try_module_get()
> after kernfs_get_active() we yield to let module removal trump calls to
> process a sysfs operation, while also preventing module removal if a sysfs
> operation is in already progress. This prevents the deadlock.
>
> This deadlock was first reported with the zram driver, however the live
> patching folks have acknowledged they have observed this as well with
> live patching, when a live patch is removed. I was then able to
> reproduce easily by creating a dedicated selftest for it.
>
> A sketch of how this can happen follows, consider foo a local mutex
> part of a driver, and used on the driver's module exit routine and
> on one of its sysfs ops:
>
> foo.c:
> static DEFINE_MUTEX(foo);
> static ssize_t foo_store(struct device *dev,
> struct device_attribute *attr,
> const char *buf, size_t count)
> {
> ...
> mutex_lock(&foo);
> ...
> mutex_lock(&foo);
> ...
> }
> static DEVICE_ATTR_RW(foo);
> ...
> void foo_exit(void)
> {
> mutex_lock(&foo);
> ...
> mutex_unlock(&foo);
> }
> module_exit(foo_exit);
>
> And this can lead to this condition:
>
> CPU A CPU B
> foo_store()
> foo_exit()
> mutex_lock(&foo)
> mutex_lock(&foo)
> del_gendisk(some_struct->disk);
> device_del()
> device_remove_groups()
Please expand this further, where does device_remove_groups() end up
waiting for that never happens?
>
> In this situation foo_store() is waiting for the mutex foo to
> become unlocked, but that won't happen until module removal is complete.
> But module removal won't complete until the sysfs file being poked at
> completes which is waiting for a lock already held.
>
> Signed-off-by: Luis Chamberlain <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +-
> fs/kernfs/dir.c | 44 ++++++++++++++++++----
> fs/kernfs/file.c | 6 ++-
> fs/kernfs/kernfs-internal.h | 3 +-
> fs/kernfs/symlink.c | 3 +-
> fs/sysfs/dir.c | 2 +-
> fs/sysfs/file.c | 6 ++-
> fs/sysfs/group.c | 3 +-
> include/linux/kernfs.h | 14 ++++---
> include/linux/sysfs.h | 52 ++++++++++++++++++++------
> kernel/cgroup/cgroup.c | 2 +-
> 11 files changed, 105 insertions(+), 34 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index b57b3db9a6a7..4edf3b37fd2c 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
>
> kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
> GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
> - 0, rft->kf_ops, rft, NULL, NULL);
> + 0, rft->kf_ops, rft, NULL, NULL, NULL);
> if (IS_ERR(kn))
> return PTR_ERR(kn);
>
> @@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
>
> kn = __kernfs_create_file(parent_kn, name, 0444,
> GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
> - &kf_mondata_ops, priv, NULL, NULL);
> + &kf_mondata_ops, priv, NULL, NULL, NULL);
> if (IS_ERR(kn))
> return PTR_ERR(kn);
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index ba581429bf7b..e841201fd11b 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -14,6 +14,7 @@
> #include <linux/slab.h>
> #include <linux/security.h>
> #include <linux/hash.h>
> +#include <linux/module.h>
>
> #include "kernfs-internal.h"
>
> @@ -414,12 +415,29 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
> */
> struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
> {
> + int v;
> +
> if (unlikely(!kn))
> return NULL;
>
> if (!atomic_inc_unless_negative(&kn->active))
> return NULL;
>
> + /*
> + * If a module created the kernfs_node, the module cannot possibly be
> + * removed if the above atomic_inc_unless_negative() succeeded. So the
> + * try_module_get() below is not to protect the lifetime of the module
> + * as that is already guaranteed. The try_module_get() below is used
> + * to ensure that we don't deadlock in case a kernfs operation and
> + * module removal used a shared lock.
> + */
> + if (!try_module_get(kn->owner)) {
> + v = atomic_dec_return(&kn->active);
> + if (unlikely(v == KN_DEACTIVATED_BIAS))
> + wake_up_all(&kernfs_root(kn)->deactivate_waitq);
> + return NULL;
> + }
The special casing in here makes me think this isn't happening the right
place. (i.e this looks like an open-coded version of kernfs_put_active())
> +
> if (kernfs_lockdep(kn))
> rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
> return kn;
> @@ -442,6 +460,13 @@ void kernfs_put_active(struct kernfs_node *kn)
> if (kernfs_lockdep(kn))
> rwsem_release(&kn->dep_map, _RET_IP_);
> v = atomic_dec_return(&kn->active);
> +
> + /*
> + * We prevent module exit *until* we know for sure all possible
> + * kernfs ops are done.
> + */
> + module_put(kn->owner);
> +
> if (likely(v != KN_DEACTIVATED_BIAS))
> return;
What I don't understand, however, is what kernfs_get/put_active() is
intending to do -- it looks like it's trying to provide an interruption
point for open kernfs file operations?
This all seems extremely complex for what seems like it should just be a
global "am I being removed?" bool?
Regardless, while I do see the logic of associating the module get/put
with get/put of kernfs "active", why is it not better tied to strictly
kernfs open/close? That would seem to be much simpler and not require
any special handling?
For example, why does this not work?
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 60e2a86c535e..e44502ac244d 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -525,6 +525,9 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
{
struct kernfs_open_node *on, *new_on = NULL;
+ if (!try_module_get(kn->owner))
+ return -ENODEV;
+
retry:
mutex_lock(&kernfs_open_file_mutex);
spin_lock_irq(&kernfs_open_node_lock);
@@ -592,6 +595,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
mutex_unlock(&kernfs_open_file_mutex);
kfree(on);
+ module_put(kn->owner);
}
static int kernfs_fop_open(struct inode *inode, struct file *file)
@@ -719,6 +723,7 @@ static int kernfs_fop_open(struct inode *inode, struct file *file)
kfree(of);
err_out:
kernfs_put_active(kn);
+ module_put(kn->owner);
return error;
}
>
> @@ -572,7 +597,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
> struct kernfs_node *parent,
> const char *name, umode_t mode,
> kuid_t uid, kgid_t gid,
> - unsigned flags)
> + unsigned flags,
> + struct module *owner)
> {
> struct kernfs_node *kn;
> u32 id_highbits;
> @@ -607,6 +633,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
> kn->name = name;
> kn->mode = mode;
> kn->flags = flags;
> + kn->owner = owner;
>
> if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) {
> struct iattr iattr = {
> @@ -640,12 +667,13 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
> struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
> const char *name, umode_t mode,
> kuid_t uid, kgid_t gid,
> - unsigned flags)
> + unsigned flags,
> + struct module *owner)
> {
> struct kernfs_node *kn;
>
> kn = __kernfs_new_node(kernfs_root(parent), parent,
> - name, mode, uid, gid, flags);
> + name, mode, uid, gid, flags, owner);
> if (kn) {
> kernfs_get(parent);
> kn->parent = parent;
> @@ -927,7 +955,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>
> kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO,
> GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
> - KERNFS_DIR);
> + KERNFS_DIR, NULL);
> if (!kn) {
> idr_destroy(&root->ino_idr);
> kfree(root);
> @@ -969,20 +997,22 @@ void kernfs_destroy_root(struct kernfs_root *root)
> * @gid: gid of the new directory
> * @priv: opaque data associated with the new directory
> * @ns: optional namespace tag of the directory
> + * @owner: if set, the module owner of this directory
> *
> * Returns the created node on success, ERR_PTR() value on failure.
> */
> struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
> const char *name, umode_t mode,
> kuid_t uid, kgid_t gid,
> - void *priv, const void *ns)
> + void *priv, const void *ns,
> + struct module *owner)
> {
> struct kernfs_node *kn;
> int rc;
>
> /* allocate */
> kn = kernfs_new_node(parent, name, mode | S_IFDIR,
> - uid, gid, KERNFS_DIR);
> + uid, gid, KERNFS_DIR, owner);
> if (!kn)
> return ERR_PTR(-ENOMEM);
>
> @@ -1014,7 +1044,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
>
> /* allocate */
> kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
> - GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
> + GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR, NULL);
> if (!kn)
> return ERR_PTR(-ENOMEM);
>
> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index 4479c6580333..0e125287e050 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -978,6 +978,7 @@ const struct file_operations kernfs_file_fops = {
> * @priv: private data for the file
> * @ns: optional namespace tag of the file
> * @key: lockdep key for the file's active_ref, %NULL to disable lockdep
> + * @owner: if set, the module owner of the file
> *
> * Returns the created node on success, ERR_PTR() value on error.
> */
> @@ -987,7 +988,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
> loff_t size,
> const struct kernfs_ops *ops,
> void *priv, const void *ns,
> - struct lock_class_key *key)
> + struct lock_class_key *key,
> + struct module *owner)
> {
> struct kernfs_node *kn;
> unsigned flags;
> @@ -996,7 +998,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
> flags = KERNFS_FILE;
>
> kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
> - uid, gid, flags);
> + uid, gid, flags, owner);
> if (!kn)
> return ERR_PTR(-ENOMEM);
>
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index 9e3abf597e2d..6d275d661987 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -134,7 +134,8 @@ int kernfs_add_one(struct kernfs_node *kn);
> struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
> const char *name, umode_t mode,
> kuid_t uid, kgid_t gid,
> - unsigned flags);
> + unsigned flags,
> + struct module *owner);
>
> /*
> * file.c
> diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
> index 19a6c71c6ff5..5a053eebee52 100644
> --- a/fs/kernfs/symlink.c
> +++ b/fs/kernfs/symlink.c
> @@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
> gid = target->iattr->ia_gid;
> }
>
> - kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
> + kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK,
> + target->owner);
> if (!kn)
> return ERR_PTR(-ENOMEM);
>
> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
> index b6b6796e1616..9763c2fde3c7 100644
> --- a/fs/sysfs/dir.c
> +++ b/fs/sysfs/dir.c
> @@ -57,7 +57,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
> kobject_get_ownership(kobj, &uid, &gid);
>
> kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
> - kobj, ns);
> + kobj, ns, NULL);
> if (IS_ERR(kn)) {
> if (PTR_ERR(kn) == -EEXIST)
> sysfs_warn_dup(parent, kobject_name(kobj));
> diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
> index 42dcf96881b6..af9e91fd1a92 100644
> --- a/fs/sysfs/file.c
> +++ b/fs/sysfs/file.c
> @@ -292,7 +292,8 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
> #endif
>
> kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
> - PAGE_SIZE, ops, (void *)attr, ns, key);
> + PAGE_SIZE, ops, (void *)attr, ns, key,
> + attr->owner);
> if (IS_ERR(kn)) {
> if (PTR_ERR(kn) == -EEXIST)
> sysfs_warn_dup(parent, attr->name);
> @@ -327,7 +328,8 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
> #endif
>
> kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
> - battr->size, ops, (void *)attr, ns, key);
> + battr->size, ops, (void *)attr, ns, key,
> + attr->owner);
> if (IS_ERR(kn)) {
> if (PTR_ERR(kn) == -EEXIST)
> sysfs_warn_dup(parent, attr->name);
> diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
> index eeb0e3099421..372864d1cb54 100644
> --- a/fs/sysfs/group.c
> +++ b/fs/sysfs/group.c
> @@ -135,7 +135,8 @@ static int internal_create_group(struct kobject *kobj, int update,
> } else {
> kn = kernfs_create_dir_ns(kobj->sd, grp->name,
> S_IRWXU | S_IRUGO | S_IXUGO,
> - uid, gid, kobj, NULL);
> + uid, gid, kobj, NULL,
> + grp->owner);
> if (IS_ERR(kn)) {
> if (PTR_ERR(kn) == -EEXIST)
> sysfs_warn_dup(kobj->sd, grp->name);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index cd968ee2b503..818b00ebd107 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -161,6 +161,7 @@ struct kernfs_node {
> unsigned short flags;
> umode_t mode;
> struct kernfs_iattrs *iattr;
> + struct module *owner;
> };
>
> /*
> @@ -370,7 +371,8 @@ void kernfs_destroy_root(struct kernfs_root *root);
> struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
> const char *name, umode_t mode,
> kuid_t uid, kgid_t gid,
> - void *priv, const void *ns);
> + void *priv, const void *ns,
> + struct module *owner);
> struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
> const char *name);
> struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
> @@ -379,7 +381,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
> loff_t size,
> const struct kernfs_ops *ops,
> void *priv, const void *ns,
> - struct lock_class_key *key);
> + struct lock_class_key *key,
> + struct module *owner);
> struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
> const char *name,
> struct kernfs_node *target);
> @@ -472,14 +475,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { }
> static inline struct kernfs_node *
> kernfs_create_dir_ns(struct kernfs_node *parent, const char *name,
> umode_t mode, kuid_t uid, kgid_t gid,
> - void *priv, const void *ns)
> + void *priv, const void *ns, struct module *owner)
> { return ERR_PTR(-ENOSYS); }
>
> static inline struct kernfs_node *
> __kernfs_create_file(struct kernfs_node *parent, const char *name,
> umode_t mode, kuid_t uid, kgid_t gid,
> loff_t size, const struct kernfs_ops *ops,
> - void *priv, const void *ns, struct lock_class_key *key)
> + void *priv, const void *ns, struct lock_class_key *key,
> + struct module *owner)
> { return ERR_PTR(-ENOSYS); }
>
> static inline struct kernfs_node *
> @@ -566,7 +570,7 @@ kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
> {
> return kernfs_create_dir_ns(parent, name, mode,
> GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
> - priv, NULL);
> + priv, NULL, parent->owner);
> }
>
> static inline int kernfs_remove_by_name(struct kernfs_node *parent,
> diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
> index e3f1e8ac1f85..babbabb460dc 100644
> --- a/include/linux/sysfs.h
> +++ b/include/linux/sysfs.h
> @@ -30,6 +30,7 @@ enum kobj_ns_type;
> struct attribute {
> const char *name;
> umode_t mode;
> + struct module *owner;
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> bool ignore_lockdep:1;
> struct lock_class_key *key;
> @@ -80,6 +81,7 @@ do { \
> * @attrs: Pointer to NULL terminated list of attributes.
> * @bin_attrs: Pointer to NULL terminated list of binary attributes.
> * Either attrs or bin_attrs or both must be provided.
> + * @module: If set, module responsible for this attribute group
> */
> struct attribute_group {
> const char *name;
> @@ -89,6 +91,7 @@ struct attribute_group {
> struct bin_attribute *, int);
> struct attribute **attrs;
> struct bin_attribute **bin_attrs;
> + struct module *owner;
> };
>
> /*
> @@ -100,38 +103,52 @@ struct attribute_group {
>
> #define __ATTR(_name, _mode, _show, _store) { \
> .attr = {.name = __stringify(_name), \
> - .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
> + .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
> + .owner = THIS_MODULE, \
> + }, \
> .show = _show, \
> .store = _store, \
> }
>
> #define __ATTR_PREALLOC(_name, _mode, _show, _store) { \
> .attr = {.name = __stringify(_name), \
> - .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\
> + .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\
> + .owner = THIS_MODULE, \
> + }, \
> .show = _show, \
> .store = _store, \
> }
>
> #define __ATTR_RO(_name) { \
> - .attr = { .name = __stringify(_name), .mode = 0444 }, \
> + .attr = { .name = __stringify(_name), \
> + .mode = 0444, \
> + .owner = THIS_MODULE, \
> + }, \
> .show = _name##_show, \
> }
>
> #define __ATTR_RO_MODE(_name, _mode) { \
> .attr = { .name = __stringify(_name), \
> - .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
> + .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
> + .owner = THIS_MODULE, \
> + }, \
> .show = _name##_show, \
> }
>
> #define __ATTR_RW_MODE(_name, _mode) { \
> .attr = { .name = __stringify(_name), \
> - .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
> + .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
> + .owner = THIS_MODULE, \
> + }, \
> .show = _name##_show, \
> .store = _name##_store, \
> }
>
> #define __ATTR_WO(_name) { \
> - .attr = { .name = __stringify(_name), .mode = 0200 }, \
> + .attr = { .name = __stringify(_name), \
> + .mode = 0200, \
> + .owner = THIS_MODULE, \
> + }, \
> .store = _name##_store, \
> }
>
> @@ -141,8 +158,11 @@ struct attribute_group {
>
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> #define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) { \
> - .attr = {.name = __stringify(_name), .mode = _mode, \
> - .ignore_lockdep = true }, \
> + .attr = {.name = __stringify(_name), \
> + .mode = _mode, \
> + .ignore_lockdep = true, \
> + .owner = THIS_MODULE, \
> + }, \
> .show = _show, \
> .store = _store, \
> }
> @@ -159,6 +179,7 @@ static const struct attribute_group *_name##_groups[] = { \
> #define ATTRIBUTE_GROUPS(_name) \
> static const struct attribute_group _name##_group = { \
> .attrs = _name##_attrs, \
> + .owner = THIS_MODULE, \
> }; \
> __ATTRIBUTE_GROUPS(_name)
>
> @@ -199,20 +220,29 @@ struct bin_attribute {
>
> /* macros to create static binary attributes easier */
> #define __BIN_ATTR(_name, _mode, _read, _write, _size) { \
> - .attr = { .name = __stringify(_name), .mode = _mode }, \
> + .attr = { .name = __stringify(_name), \
> + .mode = _mode, \
> + .owner = THIS_MODULE, \
> + }, \
> .read = _read, \
> .write = _write, \
> .size = _size, \
> }
>
> #define __BIN_ATTR_RO(_name, _size) { \
> - .attr = { .name = __stringify(_name), .mode = 0444 }, \
> + .attr = { .name = __stringify(_name), \
> + .mode = 0444, \
> + .owner = THIS_MODULE, \
> + }, \
> .read = _name##_read, \
> .size = _size, \
> }
>
> #define __BIN_ATTR_WO(_name, _size) { \
> - .attr = { .name = __stringify(_name), .mode = 0200 }, \
> + .attr = { .name = __stringify(_name), \
> + .mode = 0200, \
> + .owner = THIS_MODULE, \
> + }, \
> .write = _name##_write, \
> .size = _size, \
> }
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 9e0390000025..c6b0a28f599c 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3975,7 +3975,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
> cgroup_file_mode(cft),
> GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
> 0, cft->kf_ops, cft,
> - NULL, key);
> + NULL, key, NULL);
> if (IS_ERR(kn))
> return PTR_ERR(kn);
>
> --
> 2.30.2
>
--
Kees Cook
On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> Provide a simple state machine to fix races with driver exit where we
> remove the CPU multistate callbacks and re-initialization / creation of
> new per CPU instances which should be managed by these callbacks.
>
> The zram driver makes use of cpu hotplug multistate support, whereby it
> associates a struct zcomp per CPU. Each struct zcomp represents a
> compression algorithm in charge of managing compression streams per
> CPU. Although a compiled zram driver only supports a fixed set of
> compression algorithms, each zram device gets a struct zcomp allocated
> per CPU. The "multi" in CPU hotplug multstate refers to these per
> cpu struct zcomp instances. Each of these will have the CPU hotplug
> callback called for it on CPU plug / unplug. The kernel's CPU hotplug
> multistate keeps a linked list of these different structures so that
> it will iterate over them on CPU transitions.
>
> By default at driver initialization we will create just one zram device
> (num_devices=1) and a zcomp structure then set for the now default
> lzo-rle comrpession algorithm. At driver removal we first remove each
> zram device, and so we destroy the associated struct zcomp per CPU. But
> since we expose sysfs attributes to create new devices or reset /
> initialize existing zram devices, we can easily end up re-initializing
> a struct zcomp for a zram device before the exit routine of the module
> removes the cpu hotplug callback. When this happens the kernel's CPU
> hotplug will detect that at least one instance (struct zcomp for us)
> exists. This can happen in the following situation:
>
> CPU 1 CPU 2
>
> disksize_store(...);
> class_unregister(...);
> idr_for_each(...);
> zram_debugfs_destroy();
>
> idr_destroy(...);
> unregister_blkdev(...);
> cpuhp_remove_multi_state(...);
So this is strictly separate from the sysfs/module unloading race?
-Kees
>
> The warning comes up on cpuhp_remove_multi_state() when it sees that the
> state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
> In this case, that a struct zcom still exists, the driver allowed its
> creation per CPU even though we could have just freed them per CPU
> though a call on another CPU, and we are then later trying to remove the
> hotplug callback.
>
> Fix all this by providing a zram initialization boolean
> protected the shared in the driver zram_index_mutex, which we
> can use to annotate when sysfs attributes are safe to use or
> not -- once the driver is properly initialized. When the driver
> is going down we also are sure to not let userspace muck with
> attributes which may affect each per cpu struct zcomp.
>
> This also fixes a series of possible memory leaks. The
> crashes and memory leaks can easily be caused by issuing
> the zram02.sh script from the LTP project [0] in a loop
> in two separate windows:
>
> cd testcases/kernel/device-drivers/zram
> while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
>
> You end up with a splat as follows:
>
> kernel: zram: Removed device: zram0
> kernel: zram: Added device: zram0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: Adding 104857596k swap on /dev/zram0. <etc>
> kernel: zram0: detected capacitky change from 209715200 to 0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: ------------[ cut here ]------------
> kernel: Error: Removing state 63 which has instances left.
> kernel: WARNING: CPU: 7 PID: 70457 at \
> kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
> kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G \
> E 5.12.0-rc1-next-20210304 #3
> kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \
> BIOS 1.14.0-2 04/01/2014
> kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Code: <etc>
> kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
> kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
> kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
> kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
> kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
> kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
> kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc>
> kernel: CS: 0010 DS: 0000 ES 0000 CR0: 0000000080050033
> kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
> kernel: Call Trace:
> kernel: __cpuhp_remove_state+0x2e/0x80
> kernel: __do_sys_delete_module+0x190/0x2a0
> kernel: do_syscall_64+0x33/0x80
> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> The "Error: Removing state 63 which has instances left" refers
> to the zram per CPU struct zcomp instances left.
>
> [0] https://github.com/linux-test-project/ltp.git
>
> Acked-by: Minchan Kim <[email protected]>
> Signed-off-by: Luis Chamberlain <[email protected]>
> ---
> drivers/block/zram/zram_drv.c | 63 ++++++++++++++++++++++++++++++-----
> 1 file changed, 55 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index f61910c65f0f..b26abcb955cc 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -44,6 +44,8 @@ static DEFINE_MUTEX(zram_index_mutex);
> static int zram_major;
> static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
>
> +static bool zram_up;
> +
> /* Module params (documentation at end) */
> static unsigned int num_devices = 1;
> /*
> @@ -1704,6 +1706,7 @@ static void zram_reset_device(struct zram *zram)
> comp = zram->comp;
> disksize = zram->disksize;
> zram->disksize = 0;
> + zram->comp = NULL;
>
> set_capacity_and_notify(zram->disk, 0);
> part_stat_set_all(zram->disk->part0, 0);
> @@ -1724,9 +1727,18 @@ static ssize_t disksize_store(struct device *dev,
> struct zram *zram = dev_to_zram(dev);
> int err;
>
> + mutex_lock(&zram_index_mutex);
> +
> + if (!zram_up) {
> + err = -ENODEV;
> + goto out;
> + }
> +
> disksize = memparse(buf, NULL);
> - if (!disksize)
> - return -EINVAL;
> + if (!disksize) {
> + err = -EINVAL;
> + goto out;
> + }
>
> down_write(&zram->init_lock);
> if (init_done(zram)) {
> @@ -1754,12 +1766,16 @@ static ssize_t disksize_store(struct device *dev,
> set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
> up_write(&zram->init_lock);
>
> + mutex_unlock(&zram_index_mutex);
> +
> return len;
>
> out_free_meta:
> zram_meta_free(zram, disksize);
> out_unlock:
> up_write(&zram->init_lock);
> +out:
> + mutex_unlock(&zram_index_mutex);
> return err;
> }
>
> @@ -1775,8 +1791,17 @@ static ssize_t reset_store(struct device *dev,
> if (ret)
> return ret;
>
> - if (!do_reset)
> - return -EINVAL;
> + mutex_lock(&zram_index_mutex);
> +
> + if (!zram_up) {
> + len = -ENODEV;
> + goto out;
> + }
> +
> + if (!do_reset) {
> + len = -EINVAL;
> + goto out;
> + }
>
> zram = dev_to_zram(dev);
> bdev = zram->disk->part0;
> @@ -1785,7 +1810,8 @@ static ssize_t reset_store(struct device *dev,
> /* Do not reset an active device or claimed device */
> if (bdev->bd_openers || zram->claim) {
> mutex_unlock(&bdev->bd_disk->open_mutex);
> - return -EBUSY;
> + len = -EBUSY;
> + goto out;
> }
>
> /* From now on, anyone can't open /dev/zram[0-9] */
> @@ -1800,6 +1826,8 @@ static ssize_t reset_store(struct device *dev,
> zram->claim = false;
> mutex_unlock(&bdev->bd_disk->open_mutex);
>
> +out:
> + mutex_unlock(&zram_index_mutex);
> return len;
> }
>
> @@ -2010,6 +2038,10 @@ static ssize_t hot_add_show(struct class *class,
> int ret;
>
> mutex_lock(&zram_index_mutex);
> + if (!zram_up) {
> + mutex_unlock(&zram_index_mutex);
> + return -ENODEV;
> + }
> ret = zram_add();
> mutex_unlock(&zram_index_mutex);
>
> @@ -2037,6 +2069,11 @@ static ssize_t hot_remove_store(struct class *class,
>
> mutex_lock(&zram_index_mutex);
>
> + if (!zram_up) {
> + ret = -ENODEV;
> + goto out;
> + }
> +
> zram = idr_find(&zram_index_idr, dev_id);
> if (zram) {
> ret = zram_remove(zram);
> @@ -2046,6 +2083,7 @@ static ssize_t hot_remove_store(struct class *class,
> ret = -ENODEV;
> }
>
> +out:
> mutex_unlock(&zram_index_mutex);
> return ret ? ret : count;
> }
> @@ -2072,12 +2110,15 @@ static int zram_remove_cb(int id, void *ptr, void *data)
>
> static void destroy_devices(void)
> {
> + mutex_lock(&zram_index_mutex);
> + zram_up = false;
> class_unregister(&zram_control_class);
> idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> zram_debugfs_destroy();
> idr_destroy(&zram_index_idr);
> unregister_blkdev(zram_major, "zram");
> cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> + mutex_unlock(&zram_index_mutex);
> }
>
> static int __init zram_init(void)
> @@ -2105,15 +2146,21 @@ static int __init zram_init(void)
> return -EBUSY;
> }
>
> + mutex_lock(&zram_index_mutex);
> +
> while (num_devices != 0) {
> - mutex_lock(&zram_index_mutex);
> ret = zram_add();
> - mutex_unlock(&zram_index_mutex);
> - if (ret < 0)
> + if (ret < 0) {
> + mutex_unlock(&zram_index_mutex);
> goto out_error;
> + }
> num_devices--;
> }
>
> + zram_up = true;
> +
> + mutex_unlock(&zram_index_mutex);
> +
> return 0;
>
> out_error:
> --
> 2.30.2
>
--
Kees Cook
On Tue, Oct 05, 2021 at 01:55:35PM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> > Provide a simple state machine to fix races with driver exit where we
> > remove the CPU multistate callbacks and re-initialization / creation of
> > new per CPU instances which should be managed by these callbacks.
> >
> > The zram driver makes use of cpu hotplug multistate support, whereby it
> > associates a struct zcomp per CPU. Each struct zcomp represents a
> > compression algorithm in charge of managing compression streams per
> > CPU. Although a compiled zram driver only supports a fixed set of
> > compression algorithms, each zram device gets a struct zcomp allocated
> > per CPU. The "multi" in CPU hotplug multstate refers to these per
> > cpu struct zcomp instances. Each of these will have the CPU hotplug
> > callback called for it on CPU plug / unplug. The kernel's CPU hotplug
> > multistate keeps a linked list of these different structures so that
> > it will iterate over them on CPU transitions.
> >
> > By default at driver initialization we will create just one zram device
> > (num_devices=1) and a zcomp structure then set for the now default
> > lzo-rle comrpession algorithm. At driver removal we first remove each
> > zram device, and so we destroy the associated struct zcomp per CPU. But
> > since we expose sysfs attributes to create new devices or reset /
> > initialize existing zram devices, we can easily end up re-initializing
> > a struct zcomp for a zram device before the exit routine of the module
> > removes the cpu hotplug callback. When this happens the kernel's CPU
> > hotplug will detect that at least one instance (struct zcomp for us)
> > exists. This can happen in the following situation:
> >
> > CPU 1 CPU 2
> >
> > disksize_store(...);
> > class_unregister(...);
> > idr_for_each(...);
> > zram_debugfs_destroy();
> >
> > idr_destroy(...);
> > unregister_blkdev(...);
> > cpuhp_remove_multi_state(...);
>
> So this is strictly separate from the sysfs/module unloading race?
It is only related in the sense that the sysfs/module unloading race
happened *after* this other issue, but addressing these through
separate threads created a break in conversation and focus. For
instance, a theoretical race was mentioned in one thread, which
I worked to prove/disprove and then I disproved it was not possible.
But at this point, yes, this is a purely separate issue, and this
patch *should* be picked up already.
Andrew, can you merge this? It already has the respective maintainer
Ack, and I can continue to work on the rest of the patches. The only
issue I can think of would be a conflict with the last patch but
that's a oneliner, I think chances are low that would create a conflict
if its all merged separately, and if so, it should be an easy fix for
a merge conflict.
Luis
On Tue, Oct 05, 2021 at 12:47:22PM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 09:37:57AM -0700, Luis Chamberlain wrote:
> > This adds initial failure injection support to kernfs. We start
> > off with debug knobs which when enabled allow test drivers, such as
> > test_sysfs, to then make use of these to try to force certain
> > difficult races to take place with a high degree of certainty.
> >
> > This only adds runtime code *iff* the new bool CONFIG_FAIL_KERNFS_KNOBS is
> > enabled in your kernel. If you don't have this enabled this provides
> > no new functional. When CONFIG_FAIL_KERNFS_KNOBS is disabled the new
> > routine kernfs_debug_should_wait() ends up being transformed to if
> > (false), and so the compiler should optimize these out as dead code
> > producing no new effective binary changes.
> >
> > We start off with enabling failure injections in kernfs by allowing us to
> > alter the way kernfs_fop_write_iter() behaves. We allow for the routine
> > kernfs_fop_write_iter() to wait for a certain condition in the kernel to
> > occur, after which it will sleep a predefined amount of time. This lets
> > kernfs users to time exactly when it want kernfs_fop_write_iter() to
> > complete, allowing for developing race conditions and test for correctness
> > in kernfs.
> >
> > You'd boot with this enabled on your kernel command line:
> >
> > fail_kernfs_fop_write_iter=1,100,0,1
> >
> > The values are <interval,probability,size,times>, we don't care for
> > size, so for now we ignore it. The above ensures a failure will trigger
> > only once.
> >
> > *How* we allow for this routine to change behaviour is left to knobs we
> > expose under debugfs:
> >
> > # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
>
> I'd expect this to live under /sys/kernel/debug/fail_kernfs, like the
> other fault injectors.
Yes I see, thanks will fix up!
> > diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst
> > index 4a25c5eb6f07..d4d34b082f47 100644
> > --- a/Documentation/fault-injection/fault-injection.rst
> > +++ b/Documentation/fault-injection/fault-injection.rst
> > @@ -28,6 +28,28 @@ Available fault injection capabilities
> >
> > injects kernel RPC client and server failures.
> >
> > +- fail_kernfs_fop_write_iter
> > +
> > + Allows for failures to be enabled inside kernfs_fop_write_iter(). Enabling
> > + this does not immediately enable any errors to occur. You must configure
> > + how you want this routine to fail or change behaviour by using the debugfs
> > + knobs for it:
> > +
> > + # ls -1 /sys/kernel/debug/kernfs/config_fail_kernfs_fop_write_iter/
> > + wait_after_active
> > + wait_after_mutex
> > + wait_at_start
> > + wait_before_mutex
>
> This should be split up and detailed in the "debugfs entries" section
> below here.
Done!
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 1b4cefcb064c..fadfd961ad80 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -10384,7 +10384,7 @@ M: Greg Kroah-Hartman <[email protected]>
> > M: Tejun Heo <[email protected]>
> > S: Supported
> > T: git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git
> > -F: fs/kernfs/
> > +F: fs/kernfs/*
> > F: include/linux/kernfs.h
> >
> > KEXEC
> > diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile
> > index 4ca54ff54c98..bc5b32ca39f9 100644
> > --- a/fs/kernfs/Makefile
> > +++ b/fs/kernfs/Makefile
> > @@ -4,3 +4,4 @@
> > #
> >
> > obj-y := mount.o inode.o dir.o file.o symlink.o
> > +obj-$(CONFIG_FAIL_KERNFS_KNOBS) += failure-injection.o
> > diff --git a/fs/kernfs/failure-injection.c b/fs/kernfs/failure-injection.c
> > new file mode 100644
> > index 000000000000..4130d202c13b
> > --- /dev/null
> > +++ b/fs/kernfs/failure-injection.c
>
> I'd name this fault_inject.c, which matches the more common case:
>
> $ find . -type f -name '*fault*inject*.c'
> ./fs/nfsd/fault_inject.c
> ./drivers/nvme/host/fault_inject.c
> ./drivers/scsi/ufs/ufs-fault-injection.c
> ./lib/fault-inject.c
> ./lib/fault-inject-usercopy.c
Sure, done.
> > +int __kernfs_debug_should_wait_kernfs_fop_write_iter(bool evaluate)
> > +{
> > + if (!evaluate)
> > + return 0;
> > +
> > + return should_fail(&fail_kernfs_fop_write_iter, 0);
> > +}
>
> Every caller ends up doing the wait, so how about just including that
> here instead? It should make things much less intrusive and more readable.
>
> And for the naming, other fault injectors use "should_fail_$topic", so
> maybe better here would be something like may_wait_kernfs(...).
In case anyone is reading Hail Mary by Andy Weir: "Yes yes yes!"
Indeed, that's a great idea. Changed!
> > +
> > +DECLARE_COMPLETION(kernfs_debug_wait_completion);
> > +EXPORT_SYMBOL_NS_GPL(kernfs_debug_wait_completion, KERNFS_DEBUG_PRIVATE);
> > +
> > +void kernfs_debug_wait(void)
> > +{
> > + unsigned long timeout;
> > +
> > + timeout = wait_for_completion_timeout(&kernfs_debug_wait_completion,
> > + msecs_to_jiffies(3000));
> > + if (!timeout)
> > + pr_info("%s waiting for kernfs_debug_wait_completion timed out\n",
> > + __func__);
> > + else
> > + pr_info("%s received completion with time left on timeout %u ms\n",
> > + __func__, jiffies_to_msecs(timeout));
> > +
> > + /**
> > + * The goal is wait for an event, and *then* once we have
> > + * reached it, the other side will try to do something which
> > + * it thinks will break. So we must give it some time to do
> > + * that. The amount of time is configurable.
> > + */
> > + msleep(kernfs_config_fail.sleep_after_wait_ms);
> > + pr_info("%s ended\n", __func__);
> > +}
>
> All the uses of "__func__" here seems redundant; I would drop them.
Alright, and I also added the pr_fmt define which I forgot.
> > diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> > index 60e2a86c535e..4479c6580333 100644
> > --- a/fs/kernfs/file.c
> > +++ b/fs/kernfs/file.c
> > @@ -259,6 +259,9 @@ static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> > const struct kernfs_ops *ops;
> > char *buf;
> >
> > + if (kernfs_debug_should_wait(kernfs_fop_write_iter, at_start))
> > + kernfs_debug_wait();
>
> So this could just be:
>
> may_wait_kernfs(kernfs_fop_write_iter, at_start);
Yup! Thanks!
> > diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> > index f9cc912c31e1..9e3abf597e2d 100644
> > --- a/fs/kernfs/kernfs-internal.h
> > +++ b/fs/kernfs/kernfs-internal.h
> > +#define __kernfs_config_wait_var(func, when) \
> > + (kernfs_config_fail. func ## _fail.wait_ ## when)
> ^^ ^ ^
> nit: needless spaces
Trimmed.
Luis
On Tue, Oct 05, 2021 at 12:51:33PM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 09:37:58AM -0700, Luis Chamberlain wrote:
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index a29b7d398c4e..176b822654e5 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -2358,6 +2358,9 @@ config TEST_SYSFS
> > depends on SYSFS
> > depends on NET
> > depends on BLOCK
> > + select FAULT_INJECTION
> > + select FAULT_INJECTION_DEBUG_FS
> > + select FAIL_KERNFS_KNOBS
>
> I don't like seeing "select" for user-configurable CONFIGs -- things
> tend to end up weird. This should simply be:
>
> depends on FAIL_KERNFS_KNOBS
Sure.
> > diff --git a/lib/test_sysfs.c b/lib/test_sysfs.c
> > index 2043ca494af8..c6e62de61403 100644
> > --- a/lib/test_sysfs.c
> > +++ b/lib/test_sysfs.c
> > @@ -38,6 +38,11 @@
> > #include <linux/rtnetlink.h>
> > #include <linux/genhd.h>
> > #include <linux/blkdev.h>
> > +#include <linux/kernfs.h>
> > +
> > +#ifdef CONFIG_FAIL_KERNFS_KNOBS
>
> This isn't an optional config here (and following)?
Sure with the above change this is no longer needed. Removed all that
ifdef'ery.
Luis
On Tue, Oct 05, 2021 at 12:58:47PM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 09:37:59AM -0700, Luis Chamberlain wrote:
> > diff --git a/include/linux/module.h b/include/linux/module.h
> > index c9f1200b2312..22eacd5e1e85 100644
> > --- a/include/linux/module.h
> > +++ b/include/linux/module.h
> > @@ -609,10 +609,40 @@ void symbol_put_addr(void *addr);
> > to handle the error case (which only happens with rmmod --wait). */
> > extern void __module_get(struct module *module);
> >
> > -/* This is the Right Way to get a module: if it fails, it's being removed,
> > - * so pretend it's not there. */
> > +/**
> > + * try_module_get() - yields to module removal and bumps refcnt otherwise
>
> I find this hard to parse. How about:
> "Take module refcount unless module is being removed"
Sure.
> > + * @module: the module we should check for
> > + *
> > + * This can be used to try to bump the reference count of a module, so to
> > + * prevent module removal. The reference count of a module is not allowed
> > + * to be incremented if the module is already being removed.
>
> This I understand.
>
> > + *
> > + * Care must be taken to ensure the module cannot be removed during the call to
> > + * try_module_get(). This can be done by having another entity other than the
> > + * module itself increment the module reference count, or through some other
> > + * means which guarantees the module could not be removed during an operation.
> > + * An example of this later case is using try_module_get() in a sysfs file
> > + * which the module created. The sysfs store / read file operations are
> > + * gauranteed to exist through the use of kernfs's active reference (see
> > + * kernfs_active()). If a sysfs file operation is being run, the module which
> > + * created it must still exist as the module is in charge of removing the same
> > + * sysfs file being read. Also, a sysfs / kernfs file removal cannot happen
> > + * unless the same file is not active.
>
> I can't understand this paragraph at all. "Care must be taken ..."? Why?
Because the routine try_module_get() assumes the struct module pointer
is valid for the entire call. That can only be true if at least one
reference is held prior to this call.
> Shouldn't callers of try_module_get() be satisfied with the results?
Yes but only with the above care addressed.
> I don't follow the example at all. It seems to just say "sysfs store/read
> functions don't need try_module_get() because whatever opened the sysfs
> file is already keeping the module referenced." ?
That is exactly what I intended to clarify with that example, yes, a
reference is held but this is done implicitly. *If* a kernfs op is
active module removal waits for that active reference to go down. So
while a kernfs file is being used it is simply not possible for the
module to disappear underneath us. And the reason is that the module
that created the sysfs file must obviously destroy that same sysfs file.
But since kernfs ensures that sysfs file cannot be removed if a sysfs
file is being used, this implicitly holds a module reference.
Let me know if y ou can think of a better way to phrase this.
> > + *
> > + * One of the real values to try_module_get() is the module_is_live() check
> > + * which ensures this the caller of try_module_get() can yield to userspace
> > + * module removal requests and fail whatever it was about to process.
>
> Please document the return value explicitly.
Sure thing.
Luis
On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
> On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > When driver sysfs attributes use a lock also used on module removal we
> > can race to deadlock. This happens when for instance a sysfs file on
> > a driver is used, then at the same time we have module removal call
> > trigger. The module removal call code holds a lock, and then the
> > driver's sysfs file entry waits for the same lock. While holding the
> > lock the module removal tries to remove the sysfs entries, but these
> > cannot be removed yet as one is waiting for a lock. This won't complete
> > as the lock is already held. Likewise module removal cannot complete,
> > and so we deadlock.
> >
> > This can now be easily reproducible with our sysfs selftest as follows:
> >
> > ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> >
> > This uses a local driver lock. Test 0028 can also be used, that uses
> > the rtnl_lock():
> >
> > ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> >
> > To fix this we extend the struct kernfs_node with a module reference
> > and use the try_module_get() after kernfs_get_active() is called. As
> > documented in the prior patch, we now know that once kernfs_get_active()
> > is called the module is implicitly guarded to exist and cannot be removed.
> > This is because the module is the one in charge of removing the same
> > sysfs file it created, and removal of sysfs files on module exit will wait
> > until they don't have any active references. By using a try_module_get()
> > after kernfs_get_active() we yield to let module removal trump calls to
> > process a sysfs operation, while also preventing module removal if a sysfs
> > operation is in already progress. This prevents the deadlock.
> >
> > This deadlock was first reported with the zram driver, however the live
>
> Looks not see the lock pattern you mentioned in zram driver, can you
> share the related zram code?
I recommend to not look at the zram driver, instead look at the
test_sysfs driver as that abstracts the issue more clearly and uses
two different locks as an example. The point is that if on module
removal *any* lock is used which is *also* used on the sysfs file
created by the module, you can deadlock.
> > And this can lead to this condition:
> >
> > CPU A CPU B
> > foo_store()
> > foo_exit()
> > mutex_lock(&foo)
> > mutex_lock(&foo)
> > del_gendisk(some_struct->disk);
> > device_del()
> > device_remove_groups()
>
> I guess the deadlock exists if foo_exit() is called anywhere. If yes,
> look the issue may not be related with removing module directly, right?
No, the reason this can deadlock is that the module exit routine will
patiently wait for the sysfs / kernfs files to be stop being used,
but clearly they cannot if the exit routine took the mutex also used
by the sysfs ops. That is, the special condition here is the removal of
the sysfs files, and the sysfs files using a lock also used on module
exit.
Luis
On Tue, Oct 05, 2021 at 01:50:31PM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > A sketch of how this can happen follows, consider foo a local mutex
> > part of a driver, and used on the driver's module exit routine and
> > on one of its sysfs ops:
> >
> > foo.c:
> > static DEFINE_MUTEX(foo);
> > static ssize_t foo_store(struct device *dev,
> > struct device_attribute *attr,
> > const char *buf, size_t count)
> > {
> > ...
> > mutex_lock(&foo);
> > ...
> > mutex_lock(&foo);
> > ...
> > }
> > static DEVICE_ATTR_RW(foo);
> > ...
> > void foo_exit(void)
> > {
> > mutex_lock(&foo);
> > ...
> > mutex_unlock(&foo);
> > }
> > module_exit(foo_exit);
> >
> > And this can lead to this condition:
> >
> > CPU A CPU B
> > foo_store()
> > foo_exit()
> > mutex_lock(&foo)
> > mutex_lock(&foo)
> > del_gendisk(some_struct->disk);
> > device_del()
> > device_remove_groups()
>
> Please expand this further, where does device_remove_groups() end up
> waiting for that never happens?
Sure. How about:
Furthermore, device_remove_groups() will just go on trying to remove
the sysfs files, which are kernfs entries. The way kernfs deals with
removal is that it will wait until all active references for the files
being removed are done. The active reference is obtained through
kernfs_get_active(). Removal ends up waiting through kernfs_drain()
for the active references to be done, and that only happens if the
kernfs file ops can complete. If these kernfs ops / sysfs files
are waiting for a mutex which taken by the module's exit routine
prior to trying to remove the sysfs files we deadlock.
> > In this situation foo_store() is waiting for the mutex foo to
> > become unlocked, but that won't happen until module removal is complete.
> > But module removal won't complete until the sysfs file being poked at
> > completes which is waiting for a lock already held.
> >
> > Signed-off-by: Luis Chamberlain <[email protected]>
> > ---
> > arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +-
> > fs/kernfs/dir.c | 44 ++++++++++++++++++----
> > fs/kernfs/file.c | 6 ++-
> > fs/kernfs/kernfs-internal.h | 3 +-
> > fs/kernfs/symlink.c | 3 +-
> > fs/sysfs/dir.c | 2 +-
> > fs/sysfs/file.c | 6 ++-
> > fs/sysfs/group.c | 3 +-
> > include/linux/kernfs.h | 14 ++++---
> > include/linux/sysfs.h | 52 ++++++++++++++++++++------
> > kernel/cgroup/cgroup.c | 2 +-
> > 11 files changed, 105 insertions(+), 34 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > index b57b3db9a6a7..4edf3b37fd2c 100644
> > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > @@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
> >
> > kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
> > GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
> > - 0, rft->kf_ops, rft, NULL, NULL);
> > + 0, rft->kf_ops, rft, NULL, NULL, NULL);
> > if (IS_ERR(kn))
> > return PTR_ERR(kn);
> >
> > @@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
> >
> > kn = __kernfs_create_file(parent_kn, name, 0444,
> > GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
> > - &kf_mondata_ops, priv, NULL, NULL);
> > + &kf_mondata_ops, priv, NULL, NULL, NULL);
> > if (IS_ERR(kn))
> > return PTR_ERR(kn);
> >
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index ba581429bf7b..e841201fd11b 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -14,6 +14,7 @@
> > #include <linux/slab.h>
> > #include <linux/security.h>
> > #include <linux/hash.h>
> > +#include <linux/module.h>
> >
> > #include "kernfs-internal.h"
> >
> > @@ -414,12 +415,29 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
> > */
> > struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
> > {
> > + int v;
> > +
> > if (unlikely(!kn))
> > return NULL;
> >
> > if (!atomic_inc_unless_negative(&kn->active))
> > return NULL;
> >
> > + /*
> > + * If a module created the kernfs_node, the module cannot possibly be
> > + * removed if the above atomic_inc_unless_negative() succeeded. So the
> > + * try_module_get() below is not to protect the lifetime of the module
> > + * as that is already guaranteed. The try_module_get() below is used
> > + * to ensure that we don't deadlock in case a kernfs operation and
> > + * module removal used a shared lock.
> > + */
> > + if (!try_module_get(kn->owner)) {
> > + v = atomic_dec_return(&kn->active);
> > + if (unlikely(v == KN_DEACTIVATED_BIAS))
> > + wake_up_all(&kernfs_root(kn)->deactivate_waitq);
> > + return NULL;
> > + }
>
> The special casing in here makes me think this isn't happening the right
> place. (i.e this looks like an open-coded version of kernfs_put_active())
No, well you see, in effect the special care taken in
kernfs_put_active() *is* the right way to inform a waiter that
that the *taken* reference right above *also* is no longer active.
The special casing here is because we took the active reference
before the try_module_get() in the above atomic_inc_unless_negative()
call. Outside callers deal with this through kernfs_put_active().
We are special casing to deal with the deadlock case.
> > +
> > if (kernfs_lockdep(kn))
> > rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
> > return kn;
> > @@ -442,6 +460,13 @@ void kernfs_put_active(struct kernfs_node *kn)
> > if (kernfs_lockdep(kn))
> > rwsem_release(&kn->dep_map, _RET_IP_);
> > v = atomic_dec_return(&kn->active);
> > +
> > + /*
> > + * We prevent module exit *until* we know for sure all possible
> > + * kernfs ops are done.
> > + */
> > + module_put(kn->owner);
> > +
> > if (likely(v != KN_DEACTIVATED_BIAS))
> > return;
>
> What I don't understand, however, is what kernfs_get/put_active() is
> intending to do -- it looks like it's trying to provide an interruption
> point for open kernfs file operations?
It is essentially ensuring that removal does not happen if any ops
are being used.
> This all seems extremely complex for what seems like it should just be a
> global "am I being removed?" bool?
It used to be worse :) And Tejun has cleaned this up over time. Yes,
perhaps we can improve that more but, given how sensible this code
is I think such improvements should be made separately.
> Regardless, while I do see the logic of associating the module get/put
> with get/put of kernfs "active", why is it not better tied to strictly
> kernfs open/close?
It's not just files, consider kernfs_iop_mkdir() which also calls
kernfs_get_active(). How about kernfs_fop_mmap()? And so, the common
denominator is actually kernfs_get_active().
> That would seem to be much simpler and not require
> any special handling?
Yes true, but it I think this would still leave open some other possible
deadlocks.
> For example, why does this not work?
It does for the write case for sure, but I haven't written tests for the
other odd cases, but suspect that would deadlock as well.
Luis
On Mon, Oct 11, 2021 at 02:25:46PM -0700, Luis Chamberlain wrote:
> On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
> > On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > > When driver sysfs attributes use a lock also used on module removal we
> > > can race to deadlock. This happens when for instance a sysfs file on
> > > a driver is used, then at the same time we have module removal call
> > > trigger. The module removal call code holds a lock, and then the
> > > driver's sysfs file entry waits for the same lock. While holding the
> > > lock the module removal tries to remove the sysfs entries, but these
> > > cannot be removed yet as one is waiting for a lock. This won't complete
> > > as the lock is already held. Likewise module removal cannot complete,
> > > and so we deadlock.
> > >
> > > This can now be easily reproducible with our sysfs selftest as follows:
> > >
> > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> > >
> > > This uses a local driver lock. Test 0028 can also be used, that uses
> > > the rtnl_lock():
> > >
> > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> > >
> > > To fix this we extend the struct kernfs_node with a module reference
> > > and use the try_module_get() after kernfs_get_active() is called. As
> > > documented in the prior patch, we now know that once kernfs_get_active()
> > > is called the module is implicitly guarded to exist and cannot be removed.
> > > This is because the module is the one in charge of removing the same
> > > sysfs file it created, and removal of sysfs files on module exit will wait
> > > until they don't have any active references. By using a try_module_get()
> > > after kernfs_get_active() we yield to let module removal trump calls to
> > > process a sysfs operation, while also preventing module removal if a sysfs
> > > operation is in already progress. This prevents the deadlock.
> > >
> > > This deadlock was first reported with the zram driver, however the live
> >
> > Looks not see the lock pattern you mentioned in zram driver, can you
> > share the related zram code?
>
> I recommend to not look at the zram driver, instead look at the
> test_sysfs driver as that abstracts the issue more clearly and uses
Looks test_sysfs isn't in linus tree, where can I find it? Also please
update your commit log about this wrong info if it can't be applied on
zram.
> two different locks as an example. The point is that if on module
> removal *any* lock is used which is *also* used on the sysfs file
> created by the module, you can deadlock.
>
> > > And this can lead to this condition:
> > >
> > > CPU A CPU B
> > > foo_store()
> > > foo_exit()
> > > mutex_lock(&foo)
> > > mutex_lock(&foo)
> > > del_gendisk(some_struct->disk);
> > > device_del()
> > > device_remove_groups()
> >
> > I guess the deadlock exists if foo_exit() is called anywhere. If yes,
> > look the issue may not be related with removing module directly, right?
>
> No, the reason this can deadlock is that the module exit routine will
> patiently wait for the sysfs / kernfs files to be stop being used,
Can you share the code which waits for the sysfs / kernfs files to be
stop being used? And why does it make a difference in case of being
called from module_exit()?
Thanks,
Ming
On Tue, Oct 12, 2021 at 08:20:46AM +0800, Ming Lei wrote:
> On Mon, Oct 11, 2021 at 02:25:46PM -0700, Luis Chamberlain wrote:
> > On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
> > > On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > > > When driver sysfs attributes use a lock also used on module removal we
> > > > can race to deadlock. This happens when for instance a sysfs file on
> > > > a driver is used, then at the same time we have module removal call
> > > > trigger. The module removal call code holds a lock, and then the
> > > > driver's sysfs file entry waits for the same lock. While holding the
> > > > lock the module removal tries to remove the sysfs entries, but these
> > > > cannot be removed yet as one is waiting for a lock. This won't complete
> > > > as the lock is already held. Likewise module removal cannot complete,
> > > > and so we deadlock.
> > > >
> > > > This can now be easily reproducible with our sysfs selftest as follows:
> > > >
> > > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> > > >
> > > > This uses a local driver lock. Test 0028 can also be used, that uses
> > > > the rtnl_lock():
> > > >
> > > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> > > >
> > > > To fix this we extend the struct kernfs_node with a module reference
> > > > and use the try_module_get() after kernfs_get_active() is called. As
> > > > documented in the prior patch, we now know that once kernfs_get_active()
> > > > is called the module is implicitly guarded to exist and cannot be removed.
> > > > This is because the module is the one in charge of removing the same
> > > > sysfs file it created, and removal of sysfs files on module exit will wait
> > > > until they don't have any active references. By using a try_module_get()
> > > > after kernfs_get_active() we yield to let module removal trump calls to
> > > > process a sysfs operation, while also preventing module removal if a sysfs
> > > > operation is in already progress. This prevents the deadlock.
> > > >
> > > > This deadlock was first reported with the zram driver, however the live
> > >
> > > Looks not see the lock pattern you mentioned in zram driver, can you
> > > share the related zram code?
> >
> > I recommend to not look at the zram driver, instead look at the
> > test_sysfs driver as that abstracts the issue more clearly and uses
>
> Looks test_sysfs isn't in linus tree, where can I find it?
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
> Also please
> update your commit log about this wrong info if it can't be applied on
> zram.
It does apply to zram, it is just that I have other fixes for zram in
my pipeline which will change the zram driver further, and so what makes
more sense is to abstract the issue into a selftest driver to
demonstrate the issue more clearly.
To reproduce the deadlock revert the patch in this thread and then run
either of these two tests as root:
./tools/testing/selftests/sysfs/sysfs.sh -w 0027
./tools/testing/selftests/sysfs/sysfs.sh -w 0028
You will need to enable the test_sysfs driver.
> > two different locks as an example. The point is that if on module
> > removal *any* lock is used which is *also* used on the sysfs file
> > created by the module, you can deadlock.
> >
> > > > And this can lead to this condition:
> > > >
> > > > CPU A CPU B
> > > > foo_store()
> > > > foo_exit()
> > > > mutex_lock(&foo)
> > > > mutex_lock(&foo)
> > > > del_gendisk(some_struct->disk);
> > > > device_del()
> > > > device_remove_groups()
> > >
> > > I guess the deadlock exists if foo_exit() is called anywhere. If yes,
> > > look the issue may not be related with removing module directly, right?
> >
> > No, the reason this can deadlock is that the module exit routine will
> > patiently wait for the sysfs / kernfs files to be stop being used,
>
> Can you share the code which waits for the sysfs / kernfs files to be
> stop being used?
How about a call trace of the two tasks which deadlock, here is one of
running test 0027:
kdevops login: [ 363.875459] INFO: task sysfs.sh:1271 blocked for more
than 120 seconds.
[ 363.878341] Tainted: G E
5.15.0-rc3-next-20210927+ #83
[ 363.881218] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 363.882255] task:sysfs.sh state:D stack: 0 pid: 1271 ppid:
1 flags:0x00000004
[ 363.882894] Call Trace:
[ 363.883091] <TASK>
[ 363.883259] __schedule+0x2fd/0x990
[ 363.883551] schedule+0x43/0xe0
[ 363.883800] schedule_preempt_disabled+0x14/0x20
[ 363.884160] __mutex_lock.constprop.0+0x249/0x470
[ 363.884524] test_dev_x_store+0xa5/0xc0 [test_sysfs]
[ 363.884915] kernfs_fop_write_iter+0x177/0x220
[ 363.885257] new_sync_write+0x11c/0x1b0
[ 363.885556] vfs_write+0x20d/0x2a0
[ 363.885821] ksys_write+0x5f/0xe0
[ 363.886081] do_syscall_64+0x38/0xc0
[ 363.886359] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 363.886748] RIP: 0033:0x7fee00f8bf33
[ 363.887029] RSP: 002b:00007ffd372c5d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 363.887633] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fee00f8bf33
[ 363.888217] RDX: 0000000000000003 RSI: 000055a4d14a0db0 RDI: 0000000000000001
[ 363.888761] RBP: 000055a4d14a0db0 R08: 000000000000000a R09: 0000000000000002
[ 363.889267] R10: 000055a4d1554ac0 R11: 0000000000000246 R12: 0000000000000003
[ 363.889983] R13: 00007fee0105c6a0 R14: 0000000000000003 R15: 00007fee0105c8a0
[ 363.890513] </TASK>
[ 363.890709] INFO: task modprobe:1276 blocked for more than 120 seconds.
[ 363.891185] Tainted: G E 5.15.0-rc3-next-20210927+ #83
[ 363.891781] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 363.892353] task:modprobe state:D stack: 0 pid: 1276 ppid: 1 flags:0x00004000
[ 363.892955] Call Trace:
[ 363.893141] <TASK>
[ 363.893457] __schedule+0x2fd/0x990
[ 363.893865] schedule+0x43/0xe0
[ 363.894246] __kernfs_remove.part.0+0x21e/0x2a0
[ 363.894704] ? do_wait_intr_irq+0xa0/0xa0
[ 363.895142] kernfs_remove_by_name_ns+0x50/0x90
[ 363.895632] remove_files+0x2b/0x60
[ 363.896035] sysfs_remove_group+0x38/0x80
[ 363.896470] sysfs_remove_groups+0x29/0x40
[ 363.896912] device_remove_attrs+0x5b/0x90
[ 363.897352] device_del+0x183/0x400
[ 363.897758] unregister_test_dev_sysfs+0x5b/0xaa [test_sysfs]
[ 363.898317] test_sysfs_exit+0x45/0xfb0 [test_sysfs]
[ 363.898833] __do_sys_delete_module+0x18d/0x2a0
[ 363.899329] ? fpregs_assert_state_consistent+0x1e/0x40
[ 363.899868] ? exit_to_user_mode_prepare+0x3a/0x180
[ 363.900390] do_syscall_64+0x38/0xc0
[ 363.900810] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 363.901330] RIP: 0033:0x7f21915c57d7
[ 363.901747] RSP: 002b:00007ffd90869fe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 363.902442] RAX: ffffffffffffffda RBX: 000055ce676ffc30 RCX: 00007f21915c57d7
[ 363.903104] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055ce676ffc98
[ 363.903782] RBP: 000055ce676ffc30 R08: 0000000000000000 R09: 0000000000000000
[ 363.904462] R10: 00007f2191638ac0 R11: 0000000000000206 R12: 000055ce676ffc98
[ 363.905128] R13: 0000000000000000 R14: 0000000000000000 R15: 000055ce676ffdf0
[ 363.905797] </TASK>
And gdb:
(gdb) l *(__kernfs_remove+0x21e)
0xffffffff8139288e is in __kernfs_remove (fs/kernfs/dir.c:476).
471 if (atomic_read(&kn->active) != KN_DEACTIVATED_BIAS)
472 lock_contended(&kn->dep_map, _RET_IP_);
473 }
474
475 /* but everyone should wait for draining */
476 wait_event(root->deactivate_waitq,
477 atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);
478
479 if (kernfs_lockdep(kn)) {
480 lock_acquired(&kn->dep_map, _RET_IP_);
(gdb) l *(kernfs_remove_by_name_ns+0x50)
0xffffffff813938d0 is in kernfs_remove_by_name_ns (fs/kernfs/dir.c:1534).
1529
1530 kn = kernfs_find_ns(parent, name, ns);
1531 if (kn)
1532 __kernfs_remove(kn);
1533
1534 up_write(&kernfs_rwsem);
1535
1536 if (kn)
1537 return 0;
1538 else
The same happens for test 0028 except instead of a mutex
lock an rtnl_lock() is used.
Would this be better for the commit log?
> And why does it make a difference in case of being
> called from module_exit()?
Well because that is where we remove the sysfs files. *If*
a developer happens to use a lock on a sysfs op but it is
also used on module exit, this deadlock is bound to happen.
Luis
On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
> On Tue, Oct 12, 2021 at 08:20:46AM +0800, Ming Lei wrote:
> > On Mon, Oct 11, 2021 at 02:25:46PM -0700, Luis Chamberlain wrote:
> > > On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
> > > > On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
> > > > > When driver sysfs attributes use a lock also used on module removal we
> > > > > can race to deadlock. This happens when for instance a sysfs file on
> > > > > a driver is used, then at the same time we have module removal call
> > > > > trigger. The module removal call code holds a lock, and then the
> > > > > driver's sysfs file entry waits for the same lock. While holding the
> > > > > lock the module removal tries to remove the sysfs entries, but these
> > > > > cannot be removed yet as one is waiting for a lock. This won't complete
> > > > > as the lock is already held. Likewise module removal cannot complete,
> > > > > and so we deadlock.
> > > > >
> > > > > This can now be easily reproducible with our sysfs selftest as follows:
> > > > >
> > > > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
> > > > >
> > > > > This uses a local driver lock. Test 0028 can also be used, that uses
> > > > > the rtnl_lock():
> > > > >
> > > > > ./tools/testing/selftests/sysfs/sysfs.sh -t 0028
> > > > >
> > > > > To fix this we extend the struct kernfs_node with a module reference
> > > > > and use the try_module_get() after kernfs_get_active() is called. As
> > > > > documented in the prior patch, we now know that once kernfs_get_active()
> > > > > is called the module is implicitly guarded to exist and cannot be removed.
> > > > > This is because the module is the one in charge of removing the same
> > > > > sysfs file it created, and removal of sysfs files on module exit will wait
> > > > > until they don't have any active references. By using a try_module_get()
> > > > > after kernfs_get_active() we yield to let module removal trump calls to
> > > > > process a sysfs operation, while also preventing module removal if a sysfs
> > > > > operation is in already progress. This prevents the deadlock.
> > > > >
> > > > > This deadlock was first reported with the zram driver, however the live
> > > >
> > > > Looks not see the lock pattern you mentioned in zram driver, can you
> > > > share the related zram code?
> > >
> > > I recommend to not look at the zram driver, instead look at the
> > > test_sysfs driver as that abstracts the issue more clearly and uses
> >
> > Looks test_sysfs isn't in linus tree, where can I find it?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
>
> > Also please
> > update your commit log about this wrong info if it can't be applied on
> > zram.
>
> It does apply to zram, it is just that I have other fixes for zram in
> my pipeline which will change the zram driver further, and so what makes
> more sense is to abstract the issue into a selftest driver to
> demonstrate the issue more clearly.
>
> To reproduce the deadlock revert the patch in this thread and then run
> either of these two tests as root:
>
> ./tools/testing/selftests/sysfs/sysfs.sh -w 0027
> ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
>
> You will need to enable the test_sysfs driver.
>
> > > two different locks as an example. The point is that if on module
> > > removal *any* lock is used which is *also* used on the sysfs file
> > > created by the module, you can deadlock.
> > >
> > > > > And this can lead to this condition:
> > > > >
> > > > > CPU A CPU B
> > > > > foo_store()
> > > > > foo_exit()
> > > > > mutex_lock(&foo)
> > > > > mutex_lock(&foo)
> > > > > del_gendisk(some_struct->disk);
> > > > > device_del()
> > > > > device_remove_groups()
> > > >
> > > > I guess the deadlock exists if foo_exit() is called anywhere. If yes,
> > > > look the issue may not be related with removing module directly, right?
> > >
> > > No, the reason this can deadlock is that the module exit routine will
> > > patiently wait for the sysfs / kernfs files to be stop being used,
> >
> > Can you share the code which waits for the sysfs / kernfs files to be
> > stop being used?
>
> How about a call trace of the two tasks which deadlock, here is one of
> running test 0027:
>
> kdevops login: [ 363.875459] INFO: task sysfs.sh:1271 blocked for more
> than 120 seconds.
> [ 363.878341] Tainted: G E
> 5.15.0-rc3-next-20210927+ #83
> [ 363.881218] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 363.882255] task:sysfs.sh state:D stack: 0 pid: 1271 ppid:
> 1 flags:0x00000004
> [ 363.882894] Call Trace:
> [ 363.883091] <TASK>
> [ 363.883259] __schedule+0x2fd/0x990
> [ 363.883551] schedule+0x43/0xe0
> [ 363.883800] schedule_preempt_disabled+0x14/0x20
> [ 363.884160] __mutex_lock.constprop.0+0x249/0x470
> [ 363.884524] test_dev_x_store+0xa5/0xc0 [test_sysfs]
> [ 363.884915] kernfs_fop_write_iter+0x177/0x220
> [ 363.885257] new_sync_write+0x11c/0x1b0
> [ 363.885556] vfs_write+0x20d/0x2a0
> [ 363.885821] ksys_write+0x5f/0xe0
> [ 363.886081] do_syscall_64+0x38/0xc0
> [ 363.886359] entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 363.886748] RIP: 0033:0x7fee00f8bf33
> [ 363.887029] RSP: 002b:00007ffd372c5d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [ 363.887633] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fee00f8bf33
> [ 363.888217] RDX: 0000000000000003 RSI: 000055a4d14a0db0 RDI: 0000000000000001
> [ 363.888761] RBP: 000055a4d14a0db0 R08: 000000000000000a R09: 0000000000000002
> [ 363.889267] R10: 000055a4d1554ac0 R11: 0000000000000246 R12: 0000000000000003
> [ 363.889983] R13: 00007fee0105c6a0 R14: 0000000000000003 R15: 00007fee0105c8a0
> [ 363.890513] </TASK>
> [ 363.890709] INFO: task modprobe:1276 blocked for more than 120 seconds.
> [ 363.891185] Tainted: G E 5.15.0-rc3-next-20210927+ #83
> [ 363.891781] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 363.892353] task:modprobe state:D stack: 0 pid: 1276 ppid: 1 flags:0x00004000
> [ 363.892955] Call Trace:
> [ 363.893141] <TASK>
> [ 363.893457] __schedule+0x2fd/0x990
> [ 363.893865] schedule+0x43/0xe0
> [ 363.894246] __kernfs_remove.part.0+0x21e/0x2a0
> [ 363.894704] ? do_wait_intr_irq+0xa0/0xa0
> [ 363.895142] kernfs_remove_by_name_ns+0x50/0x90
> [ 363.895632] remove_files+0x2b/0x60
> [ 363.896035] sysfs_remove_group+0x38/0x80
> [ 363.896470] sysfs_remove_groups+0x29/0x40
> [ 363.896912] device_remove_attrs+0x5b/0x90
> [ 363.897352] device_del+0x183/0x400
> [ 363.897758] unregister_test_dev_sysfs+0x5b/0xaa [test_sysfs]
> [ 363.898317] test_sysfs_exit+0x45/0xfb0 [test_sysfs]
> [ 363.898833] __do_sys_delete_module+0x18d/0x2a0
> [ 363.899329] ? fpregs_assert_state_consistent+0x1e/0x40
> [ 363.899868] ? exit_to_user_mode_prepare+0x3a/0x180
> [ 363.900390] do_syscall_64+0x38/0xc0
> [ 363.900810] entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 363.901330] RIP: 0033:0x7f21915c57d7
> [ 363.901747] RSP: 002b:00007ffd90869fe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> [ 363.902442] RAX: ffffffffffffffda RBX: 000055ce676ffc30 RCX: 00007f21915c57d7
> [ 363.903104] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055ce676ffc98
> [ 363.903782] RBP: 000055ce676ffc30 R08: 0000000000000000 R09: 0000000000000000
> [ 363.904462] R10: 00007f2191638ac0 R11: 0000000000000206 R12: 000055ce676ffc98
> [ 363.905128] R13: 0000000000000000 R14: 0000000000000000 R15: 000055ce676ffdf0
> [ 363.905797] </TASK>
That doesn't show the deadlock is related with module_exit().
>
>
> And gdb:
>
> (gdb) l *(__kernfs_remove+0x21e)
> 0xffffffff8139288e is in __kernfs_remove (fs/kernfs/dir.c:476).
> 471 if (atomic_read(&kn->active) != KN_DEACTIVATED_BIAS)
> 472 lock_contended(&kn->dep_map, _RET_IP_);
> 473 }
> 474
> 475 /* but everyone should wait for draining */
> 476 wait_event(root->deactivate_waitq,
> 477 atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);
> 478
> 479 if (kernfs_lockdep(kn)) {
> 480 lock_acquired(&kn->dep_map, _RET_IP_);
>
> (gdb) l *(kernfs_remove_by_name_ns+0x50)
> 0xffffffff813938d0 is in kernfs_remove_by_name_ns (fs/kernfs/dir.c:1534).
> 1529
> 1530 kn = kernfs_find_ns(parent, name, ns);
> 1531 if (kn)
> 1532 __kernfs_remove(kn);
> 1533
> 1534 up_write(&kernfs_rwsem);
> 1535
> 1536 if (kn)
> 1537 return 0;
> 1538 else
>
> The same happens for test 0028 except instead of a mutex
> lock an rtnl_lock() is used.
>
> Would this be better for the commit log?
>
> > And why does it make a difference in case of being
> > called from module_exit()?
>
> Well because that is where we remove the sysfs files. *If*
> a developer happens to use a lock on a sysfs op but it is
> also used on module exit, this deadlock is bound to happen.
It is clearly one AA deadlock, what I meant was that it isn't related with
module exit cause lock & device_del() isn't always done in module exit, so
I doubt your fix with grabbing module refcnt is good or generic enough.
Except for your cooked test_sys module, how many real drivers do suffer the
problem? What are they? Why can't we fix the exact driver?
Thanks,
Ming
On Wed, Oct 13, 2021 at 09:07:03AM +0800, Ming Lei wrote:
> On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
> > > Looks test_sysfs isn't in linus tree, where can I find it?
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
> >
> > To reproduce the deadlock revert the patch in this thread and then run
> > either of these two tests as root:
> >
> > ./tools/testing/selftests/sysfs/sysfs.sh -w 0027
> > ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
> >
> > You will need to enable the test_sysfs driver.
> > > Can you share the code which waits for the sysfs / kernfs files to be
> > > stop being used?
> >
> > How about a call trace of the two tasks which deadlock, here is one of
> > running test 0027:
> >
> > kdevops login: [ 363.875459] INFO: task sysfs.sh:1271 blocked for more
> > than 120 seconds.
<-- snip -->
> That doesn't show the deadlock is related with module_exit().
Not directly no.
> It is clearly one AA deadlock, what I meant was that it isn't related with
> module exit cause lock & device_del() isn't always done in module exit, so
> I doubt your fix with grabbing module refcnt is good or generic enough.
A device_del() *can* happen in other areas other than module exit sure,
but the issue is if a shared lock is used *before* device_del() and also
used on a sysfs op. Typically this can happen on module exit, and the
other common use case in my experience is on sysfs ops, such is the case
with the zram driver. Both cases are covered then by this fix.
If there are other areas, that is still driver specific, but of the
things we *can* generalize, definitely module exit is a common path.
> Except for your cooked test_sys module, how many real drivers do suffer the
> problem? What are they?
I only really seriously considered trying to generalize this after it
was hinted to me live patching was also affected, and so clearly
something generic was desirable.
There may be other drivers for sure, but a hunt for that with semantics
would require a bit complex coccinelle patch with iteration support.
> Why can't we fix the exact driver?
You can try, the way the lock is used in zram is correct, specially
after my other fix in this series which addresses another unrelated bug
with cpu hotplug multistate support. So we then can proceed to either
take the position to say: "Thou shalt not use a shared lock on module
exit and a sysfs op" and try to fix all places, or we generalize a fix
for this. A generic fix seems more desirable.
Luis
On Mon, Oct 11, 2021 at 03:26:02PM -0700, Luis Chamberlain wrote:
> On Tue, Oct 05, 2021 at 01:50:31PM -0700, Kees Cook wrote:
> > For example, why does this not work?
>
> It does for the write case for sure,
I mispoke, just for the record, the changes you mentioned actually don't
suffice for the test cases in question for test_sysfs, the deadlock
still occurs with those changes. At first I thought it did but I had failed
to remove my own fix first on fs/kernfs/dir.c. After removing that and
just trying the proposed changes I confirm it does not fix the deadlock.
Luis
On Wed, Oct 13, 2021 at 05:35:31AM -0700, Luis Chamberlain wrote:
> On Wed, Oct 13, 2021 at 09:07:03AM +0800, Ming Lei wrote:
> > On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
> > > > Looks test_sysfs isn't in linus tree, where can I find it?
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
> > >
> > > To reproduce the deadlock revert the patch in this thread and then run
> > > either of these two tests as root:
> > >
> > > ./tools/testing/selftests/sysfs/sysfs.sh -w 0027
> > > ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
> > >
> > > You will need to enable the test_sysfs driver.
> > > > Can you share the code which waits for the sysfs / kernfs files to be
> > > > stop being used?
> > >
> > > How about a call trace of the two tasks which deadlock, here is one of
> > > running test 0027:
> > >
> > > kdevops login: [ 363.875459] INFO: task sysfs.sh:1271 blocked for more
> > > than 120 seconds.
>
> <-- snip -->
>
> > That doesn't show the deadlock is related with module_exit().
>
> Not directly no.
Then the patch title of 'sysfs: fix deadlock race with module removal'
is wrong.
>
> > It is clearly one AA deadlock, what I meant was that it isn't related with
> > module exit cause lock & device_del() isn't always done in module exit, so
> > I doubt your fix with grabbing module refcnt is good or generic enough.
>
> A device_del() *can* happen in other areas other than module exit sure,
> but the issue is if a shared lock is used *before* device_del() and also
> used on a sysfs op. Typically this can happen on module exit, and the
> other common use case in my experience is on sysfs ops, such is the case
> with the zram driver. Both cases are covered then by this fix.
Again, can you share the related zram code about the issue? In
zram_drv.c of linus or next tree, I don't see any lock is held before
calling del_gendisk().
>
> If there are other areas, that is still driver specific, but of the
> things we *can* generalize, definitely module exit is a common path.
>
> > Except for your cooked test_sys module, how many real drivers do suffer the
> > problem? What are they?
>
> I only really seriously considered trying to generalize this after it
IMO your generalization isn't good or correct because this kind of issue
is _not_ related with module exit at all. What matters is just that one lock is
held before calling device_del(), meantime the same lock is required
in the device's attribute show/store function().
There are many cases in which we call device_del() not from module_exit(),
such as scsi scan, scsi sysfs store(), or even handling event from
device side, nvme error handling, usb hotplug, ...
> was hinted to me live patching was also affected, and so clearly
> something generic was desirable.
It might be just the only two drivers(zram and live patch) with this bug, and
it is one simply AA bug in driver. Not mention I don't see such usage in
zram_drv.c.
>
> There may be other drivers for sure, but a hunt for that with semantics
> would require a bit complex coccinelle patch with iteration support.
>
> > Why can't we fix the exact driver?
>
> You can try, the way the lock is used in zram is correct, specially
What is the lock in zram? Again can you share the related functions?
> after my other fix in this series which addresses another unrelated bug
> with cpu hotplug multistate support. So we then can proceed to either
> take the position to say: "Thou shalt not use a shared lock on module
> exit and a sysfs op" and try to fix all places, or we generalize a fix
> for this. A generic fix seems more desirable.
What matters is that the lock is held before calling device_del()
instead of being held in module_exit().
Thanks,
Ming
On Wed, Oct 13, 2021 at 11:04:07PM +0800, Ming Lei wrote:
> On Wed, Oct 13, 2021 at 05:35:31AM -0700, Luis Chamberlain wrote:
> > On Wed, Oct 13, 2021 at 09:07:03AM +0800, Ming Lei wrote:
> > > On Tue, Oct 12, 2021 at 02:18:28PM -0700, Luis Chamberlain wrote:
> > > > > Looks test_sysfs isn't in linus tree, where can I find it?
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
> > > >
> > > > To reproduce the deadlock revert the patch in this thread and then run
> > > > either of these two tests as root:
> > > >
> > > > ./tools/testing/selftests/sysfs/sysfs.sh -w 0027
> > > > ./tools/testing/selftests/sysfs/sysfs.sh -w 0028
> > > >
> > > > You will need to enable the test_sysfs driver.
> > > > > Can you share the code which waits for the sysfs / kernfs files to be
> > > > > stop being used?
> > > >
> > > > How about a call trace of the two tasks which deadlock, here is one of
> > > > running test 0027:
> > > >
> > > > kdevops login: [ 363.875459] INFO: task sysfs.sh:1271 blocked for more
> > > > than 120 seconds.
> >
> > <-- snip -->
> >
> > > That doesn't show the deadlock is related with module_exit().
> >
> > Not directly no.
>
> Then the patch title of 'sysfs: fix deadlock race with module removal'
> is wrong.
Well that is what it does though. The scope of the issue you are raising
is beyond module removal, but I do agree such races can exist outside of
module removal.
> > > It is clearly one AA deadlock, what I meant was that it isn't related with
> > > module exit cause lock & device_del() isn't always done in module exit, so
> > > I doubt your fix with grabbing module refcnt is good or generic enough.
> >
> > A device_del() *can* happen in other areas other than module exit sure,
> > but the issue is if a shared lock is used *before* device_del() and also
> > used on a sysfs op. Typically this can happen on module exit, and the
> > other common use case in my experience is on sysfs ops, such is the case
> > with the zram driver. Both cases are covered then by this fix.
>
> Again, can you share the related zram code about the issue? In
> zram_drv.c of linus or next tree, I don't see any lock is held before
> calling del_gendisk().
There is another bug with CPU hotplug multistate support in the zram
driver which a patch in this series fixes, refer to the patch titled
"zram: fix crashes with cpu hotplug multistate". In zram's case we need
to contend a generic lock on certain sysfs attributes due to the way CPU
hotplug is used.
If we tried to generalize this on the block layer the closest we get is
the disk->fops->owner, however zram is an example driver where the
disk->fops is actually be even changed *after* module load, and so the
original disk->fops->owner can be dynamic. In zram's case the
fops->owner is the same, however we have no semantics to ensure this is
the case for all block drivers.
In the case for live patching, refer to the use of klp_mutex. The way
that was solved there was a combination of completions and deferred
works to solve it, so that all kobject_put calls are outside of the
critical sections, refer to commit 3ec24776bfd0 ("livepatch:
allow removal of a disabled patch").
And so it was encouraged a generic solution be sought after.
> > If there are other areas, that is still driver specific, but of the
> > things we *can* generalize, definitely module exit is a common path.
> >
> > > Except for your cooked test_sys module, how many real drivers do suffer the
> > > problem? What are they?
> >
> > I only really seriously considered trying to generalize this after it
>
> IMO your generalization isn't good or correct because this kind of issue
> is _not_ related with module exit at all. What matters is just that one lock is
> held before calling device_del(), meantime the same lock is required
> in the device's attribute show/store function().
Your point that a race for a deadlock still can exist beyond module
removal is valid but unfortunately there are no possible semantics I can
see to fix that generically at this time.
> There are many cases in which we call device_del() not from module_exit(),
> such as scsi scan, scsi sysfs store(), or even handling event from
> device side, nvme error handling, usb hotplug, ...
These are really good points.
> > was hinted to me live patching was also affected, and so clearly
> > something generic was desirable.
>
> It might be just the only two drivers(zram and live patch) with this bug, and
> it is one simply AA bug in driver. Not mention I don't see such usage in
> zram_drv.c.
Well... given what you say above about other uses cases other than
module removal which can remove sysfs files and having them be used,
the possibilities of this deadlock existing elsewhere should increase,
not decrease.
> > There may be other drivers for sure, but a hunt for that with semantics
> > would require a bit complex coccinelle patch with iteration support.
> >
> > > Why can't we fix the exact driver?
> >
> > You can try, the way the lock is used in zram is correct, specially
>
> What is the lock in zram? Again can you share the related functions?
If you git checked out the tree I mentioned try looking at the code
there with the fix for CPU hotplug multistate in mind.
> > after my other fix in this series which addresses another unrelated bug
> > with cpu hotplug multistate support. So we then can proceed to either
> > take the position to say: "Thou shalt not use a shared lock on module
> > exit and a sysfs op" and try to fix all places, or we generalize a fix
> > for this. A generic fix seems more desirable.
>
> What matters is that the lock is held before calling device_del()
> instead of being held in module_exit().
I agree the possibilities can include more than just module exit.
Unfortunately I can't see a way to generalize this further. I tried,
see below, and this moves the ideas from a module to the kobject, but
even with that, it does not get us any closer to fixing this
generically. The reason a fix works for module removal is the
try_module_get() call when getting the kernfs active reference
will trump the module exit call completely, and so we *do* prevent
the context which will issue the lock in this case if a sysfs
operation is in progress.
Outside of that call sequence I am afraid we'd need separate solutions
or side with the 'though shall not use a shared lock on a sysfs op
and when issuing a device_del(), other than module exit'.
Below is an attempt to generalize this further, but it does not work,
let me know if you have further ideas.
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b57b3db9a6a7..4edf3b37fd2c 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -209,7 +209,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
- 0, rft->kf_ops, rft, NULL, NULL);
+ 0, rft->kf_ops, rft, NULL, NULL, NULL);
if (IS_ERR(kn))
return PTR_ERR(kn);
@@ -2482,7 +2482,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
kn = __kernfs_create_file(parent_kn, name, 0444,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
- &kf_mondata_ops, priv, NULL, NULL);
+ &kf_mondata_ops, priv, NULL, NULL, NULL);
if (IS_ERR(kn))
return PTR_ERR(kn);
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 7758223f040c..38f07072ab44 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -3507,6 +3507,7 @@ bool kill_device(struct device *dev)
if (dev->p->dead)
return false;
dev->p->dead = true;
+ kobject_set_being_removed(&dev->kobj);
return true;
}
EXPORT_SYMBOL_GPL(kill_device);
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index ba581429bf7b..7d14f6b2c12d 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -14,6 +14,7 @@
#include <linux/slab.h>
#include <linux/security.h>
#include <linux/hash.h>
+#include <linux/kobject.h>
#include "kernfs-internal.h"
@@ -414,15 +415,38 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
*/
struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
{
+ int v;
+
if (unlikely(!kn))
return NULL;
if (!atomic_inc_unless_negative(&kn->active))
return NULL;
+ /*
+ * If a kobject created the kernfs_node, the kobject cannot possibly be
+ * removed if the above atomic_inc_unless_negative() succeeded. But we
+ * need to inspect if its on its way out to ensure that we don't
+ * deadlock in case a kernfs operation and the code responsible for
+ * the kobject removal used a shared lock.
+ */
+ if (kn->kobj) {
+ if (WARN_ON(!kobject_get_unless_zero(kn->kobj))) {
+ goto fail;
+ } else if (kobject_being_removed(kn->kobj)) {
+ kobject_put(kn->kobj);
+ goto fail;
+ }
+ }
+
if (kernfs_lockdep(kn))
rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
return kn;
+fail:
+ v = atomic_dec_return(&kn->active);
+ if (unlikely(v == KN_DEACTIVATED_BIAS))
+ wake_up_all(&kernfs_root(kn)->deactivate_waitq);
+ return NULL;
}
/**
@@ -442,6 +466,7 @@ void kernfs_put_active(struct kernfs_node *kn)
if (kernfs_lockdep(kn))
rwsem_release(&kn->dep_map, _RET_IP_);
v = atomic_dec_return(&kn->active);
+ kobject_put(kn->kobj);
if (likely(v != KN_DEACTIVATED_BIAS))
return;
@@ -572,7 +597,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
struct kernfs_node *parent,
const char *name, umode_t mode,
kuid_t uid, kgid_t gid,
- unsigned flags)
+ unsigned flags,
+ struct kobject *kobj)
{
struct kernfs_node *kn;
u32 id_highbits;
@@ -607,6 +633,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
kn->name = name;
kn->mode = mode;
kn->flags = flags;
+ kn->kobj = kobj;
if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) {
struct iattr iattr = {
@@ -640,12 +667,13 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
const char *name, umode_t mode,
kuid_t uid, kgid_t gid,
- unsigned flags)
+ unsigned flags,
+ struct kobject *kobj)
{
struct kernfs_node *kn;
kn = __kernfs_new_node(kernfs_root(parent), parent,
- name, mode, uid, gid, flags);
+ name, mode, uid, gid, flags, kobj);
if (kn) {
kernfs_get(parent);
kn->parent = parent;
@@ -927,7 +955,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
- KERNFS_DIR);
+ KERNFS_DIR, NULL);
if (!kn) {
idr_destroy(&root->ino_idr);
kfree(root);
@@ -969,20 +997,22 @@ void kernfs_destroy_root(struct kernfs_root *root)
* @gid: gid of the new directory
* @priv: opaque data associated with the new directory
* @ns: optional namespace tag of the directory
+ * @kobj: if set, the kobject responsible for this directory
*
* Returns the created node on success, ERR_PTR() value on failure.
*/
struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
const char *name, umode_t mode,
kuid_t uid, kgid_t gid,
- void *priv, const void *ns)
+ void *priv, const void *ns,
+ struct kobject *kobj)
{
struct kernfs_node *kn;
int rc;
/* allocate */
kn = kernfs_new_node(parent, name, mode | S_IFDIR,
- uid, gid, KERNFS_DIR);
+ uid, gid, KERNFS_DIR, kobj);
if (!kn)
return ERR_PTR(-ENOMEM);
@@ -1014,7 +1044,8 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
/* allocate */
kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
- GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
+ GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR,
+ parent->kobj);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 4479c6580333..1b02f3e69c81 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -978,6 +978,7 @@ const struct file_operations kernfs_file_fops = {
* @priv: private data for the file
* @ns: optional namespace tag of the file
* @key: lockdep key for the file's active_ref, %NULL to disable lockdep
+ * @kobj: if set, the kobject responsible for the file
*
* Returns the created node on success, ERR_PTR() value on error.
*/
@@ -987,7 +988,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
loff_t size,
const struct kernfs_ops *ops,
void *priv, const void *ns,
- struct lock_class_key *key)
+ struct lock_class_key *key,
+ struct kobject *kobj)
{
struct kernfs_node *kn;
unsigned flags;
@@ -996,7 +998,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
flags = KERNFS_FILE;
kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
- uid, gid, flags);
+ uid, gid, flags, kobj);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 9e3abf597e2d..44983720d292 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -134,7 +134,8 @@ int kernfs_add_one(struct kernfs_node *kn);
struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
const char *name, umode_t mode,
kuid_t uid, kgid_t gid,
- unsigned flags);
+ unsigned flags,
+ struct kobject *kobj);
/*
* file.c
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 19a6c71c6ff5..c877de06e53a 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
gid = target->iattr->ia_gid;
}
- kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
+ kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK,
+ target->kobj);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index b6b6796e1616..9cc159e9fb55 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -57,7 +57,7 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
kobject_get_ownership(kobj, &uid, &gid);
kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
- kobj, ns);
+ kobj, ns, kobj);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
sysfs_warn_dup(parent, kobject_name(kobj));
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 42dcf96881b6..e1a3315dba35 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -292,7 +292,7 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
#endif
kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
- PAGE_SIZE, ops, (void *)attr, ns, key);
+ PAGE_SIZE, ops, (void *)attr, ns, key, kobj);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
sysfs_warn_dup(parent, attr->name);
@@ -309,6 +309,7 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
struct lock_class_key *key = NULL;
const struct kernfs_ops *ops;
struct kernfs_node *kn;
+ struct kobject *kobj = parent->priv;
if (battr->mmap)
ops = &sysfs_bin_kfops_mmap;
@@ -327,7 +328,8 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
#endif
kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
- battr->size, ops, (void *)attr, ns, key);
+ battr->size, ops, (void *)attr, ns, key,
+ kobj);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
sysfs_warn_dup(parent, attr->name);
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index eeb0e3099421..36022fe2b21d 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -135,7 +135,8 @@ static int internal_create_group(struct kobject *kobj, int update,
} else {
kn = kernfs_create_dir_ns(kobj->sd, grp->name,
S_IRWXU | S_IRUGO | S_IXUGO,
- uid, gid, kobj, NULL);
+ uid, gid, kobj, NULL,
+ kobj);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
sysfs_warn_dup(kobj->sd, grp->name);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index cd968ee2b503..38155414e6e5 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -161,6 +161,7 @@ struct kernfs_node {
unsigned short flags;
umode_t mode;
struct kernfs_iattrs *iattr;
+ struct kobject *kobj;
};
/*
@@ -370,7 +371,8 @@ void kernfs_destroy_root(struct kernfs_root *root);
struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
const char *name, umode_t mode,
kuid_t uid, kgid_t gid,
- void *priv, const void *ns);
+ void *priv, const void *ns,
+ struct kobject *kobj);
struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
const char *name);
struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
@@ -379,7 +381,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
loff_t size,
const struct kernfs_ops *ops,
void *priv, const void *ns,
- struct lock_class_key *key);
+ struct lock_class_key *key,
+ struct kobject *kobj);
struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
const char *name,
struct kernfs_node *target);
@@ -472,14 +475,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { }
static inline struct kernfs_node *
kernfs_create_dir_ns(struct kernfs_node *parent, const char *name,
umode_t mode, kuid_t uid, kgid_t gid,
- void *priv, const void *ns)
+ void *priv, const void *ns, struct kobject *kobj)
{ return ERR_PTR(-ENOSYS); }
static inline struct kernfs_node *
__kernfs_create_file(struct kernfs_node *parent, const char *name,
umode_t mode, kuid_t uid, kgid_t gid,
loff_t size, const struct kernfs_ops *ops,
- void *priv, const void *ns, struct lock_class_key *key)
+ void *priv, const void *ns, struct lock_class_key *key,
+ struct kobject *kobj)
{ return ERR_PTR(-ENOSYS); }
static inline struct kernfs_node *
@@ -566,7 +570,7 @@ kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
{
return kernfs_create_dir_ns(parent, name, mode,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
- priv, NULL);
+ priv, NULL, parent->kobj);
}
static inline int kernfs_remove_by_name(struct kernfs_node *parent,
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index efd56f990a46..cb26ebeb7cf1 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -77,6 +77,7 @@ struct kobject {
unsigned int state_add_uevent_sent:1;
unsigned int state_remove_uevent_sent:1;
unsigned int uevent_suppress:1;
+ unsigned int being_removed:1;
};
extern __printf(2, 3)
@@ -117,6 +118,15 @@ extern void kobject_get_ownership(struct kobject *kobj,
kuid_t *uid, kgid_t *gid);
extern char *kobject_get_path(struct kobject *kobj, gfp_t flag);
+static inline bool kobject_being_removed(const struct kobject *kobj)
+{
+ if (!kobj)
+ return false;
+ return !!kobj->being_removed;
+}
+
+void kobject_set_being_removed(struct kobject *kobj);
+
/**
* kobject_has_children - Returns whether a kobject has children.
* @kobj: the object to test
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9e0390000025..c6b0a28f599c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3975,7 +3975,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
cgroup_file_mode(cft),
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
0, cft->kf_ops, cft,
- NULL, key);
+ NULL, key, NULL);
if (IS_ERR(kn))
return PTR_ERR(kn);
diff --git a/lib/kobject.c b/lib/kobject.c
index 4a56f519139d..ef89bf2ac218 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -221,6 +221,12 @@ static void kobject_init_internal(struct kobject *kobj)
kobj->state_initialized = 1;
}
+void kobject_set_being_removed(struct kobject *kobj)
+{
+ if (!kobj)
+ return;
+ kobj->being_removed = 1;
+}
static int kobject_add_internal(struct kobject *kobj)
{
On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> Provide a simple state machine to fix races with driver exit where we
> remove the CPU multistate callbacks and re-initialization / creation of
> new per CPU instances which should be managed by these callbacks.
>
> The zram driver makes use of cpu hotplug multistate support, whereby it
> associates a struct zcomp per CPU. Each struct zcomp represents a
> compression algorithm in charge of managing compression streams per
> CPU. Although a compiled zram driver only supports a fixed set of
> compression algorithms, each zram device gets a struct zcomp allocated
> per CPU. The "multi" in CPU hotplug multstate refers to these per
> cpu struct zcomp instances. Each of these will have the CPU hotplug
> callback called for it on CPU plug / unplug. The kernel's CPU hotplug
> multistate keeps a linked list of these different structures so that
> it will iterate over them on CPU transitions.
>
> By default at driver initialization we will create just one zram device
> (num_devices=1) and a zcomp structure then set for the now default
> lzo-rle comrpession algorithm. At driver removal we first remove each
> zram device, and so we destroy the associated struct zcomp per CPU. But
> since we expose sysfs attributes to create new devices or reset /
> initialize existing zram devices, we can easily end up re-initializing
> a struct zcomp for a zram device before the exit routine of the module
> removes the cpu hotplug callback. When this happens the kernel's CPU
> hotplug will detect that at least one instance (struct zcomp for us)
> exists. This can happen in the following situation:
>
> CPU 1 CPU 2
>
> disksize_store(...);
> class_unregister(...);
> idr_for_each(...);
> zram_debugfs_destroy();
>
> idr_destroy(...);
> unregister_blkdev(...);
> cpuhp_remove_multi_state(...);
>
> The warning comes up on cpuhp_remove_multi_state() when it sees that the
> state for CPUHP_ZCOMP_PREPARE does not have an empty instance linked list.
> In this case, that a struct zcom still exists, the driver allowed its
> creation per CPU even though we could have just freed them per CPU
> though a call on another CPU, and we are then later trying to remove the
> hotplug callback.
>
> Fix all this by providing a zram initialization boolean
> protected the shared in the driver zram_index_mutex, which we
> can use to annotate when sysfs attributes are safe to use or
> not -- once the driver is properly initialized. When the driver
> is going down we also are sure to not let userspace muck with
> attributes which may affect each per cpu struct zcomp.
>
> This also fixes a series of possible memory leaks. The
> crashes and memory leaks can easily be caused by issuing
> the zram02.sh script from the LTP project [0] in a loop
> in two separate windows:
>
> cd testcases/kernel/device-drivers/zram
> while true; do PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh; done
>
> You end up with a splat as follows:
>
> kernel: zram: Removed device: zram0
> kernel: zram: Added device: zram0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: Adding 104857596k swap on /dev/zram0. <etc>
> kernel: zram0: detected capacitky change from 209715200 to 0
> kernel: zram0: detected capacity change from 0 to 209715200
> kernel: ------------[ cut here ]------------
> kernel: Error: Removing state 63 which has instances left.
> kernel: WARNING: CPU: 7 PID: 70457 at \
> kernel/cpu.c:2069 __cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Modules linked in: zram(E-) zsmalloc(E) <etc>
> kernel: CPU: 7 PID: 70457 Comm: rmmod Tainted: G \
> E 5.12.0-rc1-next-20210304 #3
> kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), \
> BIOS 1.14.0-2 04/01/2014
> kernel: RIP: 0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
> kernel: Code: <etc>
> kernel: RSP: 0018:ffffa800c139be98 EFLAGS: 00010282
> kernel: RAX: 0000000000000000 RBX: ffffffff9083db58 RCX: ffff9609f7dd86d8
> kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9609f7dd86d0
> kernel: RBP: 0000000000000000i R08: 0000000000000000 R09: ffffa800c139bcb8
> kernel: R10: ffffa800c139bcb0 R11: ffffffff908bea40 R12: 000000000000003f
> kernel: R13: 00000000000009d8 R14: 0000000000000000 R15: 0000000000000000
> kernel: FS: 00007f1b075a7540(0000) GS:ffff9609f7dc0000(0000) knlGS:<etc>
> kernel: CS: 0010 DS: 0000 ES 0000 CR0: 0000000080050033
> kernel: CR2: 00007f1b07610490 CR3: 00000001bd04e000 CR4: 0000000000350ee0
> kernel: Call Trace:
> kernel: __cpuhp_remove_state+0x2e/0x80
> kernel: __do_sys_delete_module+0x190/0x2a0
> kernel: do_syscall_64+0x33/0x80
> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> The "Error: Removing state 63 which has instances left" refers
> to the zram per CPU struct zcomp instances left.
>
> [0] https://github.com/linux-test-project/ltp.git
>
> Acked-by: Minchan Kim <[email protected]>
> Signed-off-by: Luis Chamberlain <[email protected]>
> ---
Hello Luis,
Can you test the following patch and see if the issue can be addressed?
Please see the idea from the inline comment.
Also zram_index_mutex isn't needed in zram disk's store() compared with
your patch, then the deadlock issue you are addressing in this series can
be avoided.
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index fcaf2750f68f..3c17927d23a7 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
/* Make sure all the pending I/O are finished */
fsync_bdev(bdev);
- zram_reset_device(zram);
pr_info("Removed device: %s\n", zram->disk->disk_name);
del_gendisk(zram->disk);
+
+ /*
+ * reset device after gendisk is removed, so any change from sysfs
+ * store won't come in, then we can really reset device here
+ */
+ zram_reset_device(zram);
+
blk_cleanup_disk(zram->disk);
kfree(zram);
return 0;
@@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
static void destroy_devices(void)
{
class_unregister(&zram_control_class);
+
+ /* hold the global lock so new device can't be added */
+ mutex_lock(&zram_index_mutex);
idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
+ mutex_unlock(&zram_index_mutex);
+
zram_debugfs_destroy();
idr_destroy(&zram_index_idr);
unregister_blkdev(zram_major, "zram");
Thanks,
Ming
On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
> On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
...
>
> Hello Luis,
>
> Can you test the following patch and see if the issue can be addressed?
>
> Please see the idea from the inline comment.
>
> Also zram_index_mutex isn't needed in zram disk's store() compared with
> your patch, then the deadlock issue you are addressing in this series can
> be avoided.
>
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index fcaf2750f68f..3c17927d23a7 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
>
> /* Make sure all the pending I/O are finished */
> fsync_bdev(bdev);
> - zram_reset_device(zram);
>
> pr_info("Removed device: %s\n", zram->disk->disk_name);
>
> del_gendisk(zram->disk);
> +
> + /*
> + * reset device after gendisk is removed, so any change from sysfs
> + * store won't come in, then we can really reset device here
> + */
> + zram_reset_device(zram);
> +
> blk_cleanup_disk(zram->disk);
> kfree(zram);
> return 0;
> @@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
> static void destroy_devices(void)
> {
> class_unregister(&zram_control_class);
> +
> + /* hold the global lock so new device can't be added */
> + mutex_lock(&zram_index_mutex);
> idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> + mutex_unlock(&zram_index_mutex);
> +
Actually zram_index_mutex isn't needed when calling zram_remove_cb()
since the zram-control sysfs interface has been removed, so userspace
can't add new device any more, then the issue is supposed to be fixed
by the following one line change, please test it:
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index fcaf2750f68f..96dd641de233 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
/* Make sure all the pending I/O are finished */
fsync_bdev(bdev);
- zram_reset_device(zram);
pr_info("Removed device: %s\n", zram->disk->disk_name);
del_gendisk(zram->disk);
+
+ /*
+ * reset device after gendisk is removed, so any change from sysfs
+ * store won't come in, then we can really reset device here
+ */
+ zram_reset_device(zram);
+
blk_cleanup_disk(zram->disk);
kfree(zram);
return 0;
Thanks,
Ming
On Thu, Oct 14, 2021 at 10:11:46AM +0800, Ming Lei wrote:
> On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
> > On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
>
> ...
>
> >
> > Hello Luis,
> >
> > Can you test the following patch and see if the issue can be addressed?
> >
> > Please see the idea from the inline comment.
> >
> > Also zram_index_mutex isn't needed in zram disk's store() compared with
> > your patch, then the deadlock issue you are addressing in this series can
> > be avoided.
> >
> >
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index fcaf2750f68f..3c17927d23a7 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> >
> > /* Make sure all the pending I/O are finished */
> > fsync_bdev(bdev);
> > - zram_reset_device(zram);
> >
> > pr_info("Removed device: %s\n", zram->disk->disk_name);
> >
> > del_gendisk(zram->disk);
> > +
> > + /*
> > + * reset device after gendisk is removed, so any change from sysfs
> > + * store won't come in, then we can really reset device here
> > + */
> > + zram_reset_device(zram);
> > +
> > blk_cleanup_disk(zram->disk);
> > kfree(zram);
> > return 0;
> > @@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
> > static void destroy_devices(void)
> > {
> > class_unregister(&zram_control_class);
> > +
> > + /* hold the global lock so new device can't be added */
> > + mutex_lock(&zram_index_mutex);
> > idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > + mutex_unlock(&zram_index_mutex);
> > +
>
> Actually zram_index_mutex isn't needed when calling zram_remove_cb()
> since the zram-control sysfs interface has been removed, so userspace
> can't add new device any more, then the issue is supposed to be fixed
> by the following one line change, please test it:
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index fcaf2750f68f..96dd641de233 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
>
> /* Make sure all the pending I/O are finished */
> fsync_bdev(bdev);
> - zram_reset_device(zram);
>
> pr_info("Removed device: %s\n", zram->disk->disk_name);
>
> del_gendisk(zram->disk);
> +
> + /*
> + * reset device after gendisk is removed, so any change from sysfs
> + * store won't come in, then we can really reset device here
> + */
> + zram_reset_device(zram);
> +
> blk_cleanup_disk(zram->disk);
> kfree(zram);
> return 0;
Sorry but nope, the cpu multistate issue is still present and we end up
eventually with page faults. I tried with both patches.
Oct 14 20:21:34 kdevops kernel: ------------[ cut here ]------------
Oct 14 20:21:34 kdevops kernel: Error: Removing state 65 which has
instances left.
Oct 14 20:21:34 kdevops kernel: WARNING: CPU: 4 PID: 3358 at
kernel/cpu.c:2151 __cpuhp_remove_state_cpuslocked+0xf9/0x100
Oct 14 20:21:34 kdevops kernel: Modules linked in: zram(E-) zstd(E)
zsmalloc(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E)
crc32_pclmul(E) ghash_clmulni_intel(E) >
Oct 14 20:21:34 kdevops kernel: CPU: 4 PID: 3358 Comm: rmmod Tainted: G
E 5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:34 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:34 kdevops kernel: RIP:
0010:__cpuhp_remove_state_cpuslocked+0xf9/0x100
Oct 14 20:21:34 kdevops kernel: Code: 21 00 48 c7 43 18 00 00 00 00 5b
5d 41 5c 41 5d 41 5e 41 5f e9 d8 17 84 00 0f 0b 44 89 e6 48 c7 c7 78 0c
8b ad e8 56 92 7f 00 <0f> 0b >
Oct 14 20:21:34 kdevops kernel: RSP: 0018:ffffaac980a1fe90 EFLAGS:
00010286
Oct 14 20:21:34 kdevops kernel: RAX: 0000000000000000 RBX:
ffffffffada3e208 RCX: 0000000000000000
Oct 14 20:21:34 kdevops kernel: RDX: 0000000000000001 RSI:
ffffffffad8efdb6 RDI: 00000000ffffffff
Oct 14 20:21:34 kdevops kernel: RBP: 0000000000000000 R08:
0000000000000000 R09: ffffaac980a1fcc0
Oct 14 20:21:34 kdevops kernel: R10: ffffaac980a1fcb8 R11:
ffffffffadac3c68 R12: 0000000000000041
Oct 14 20:21:34 kdevops kernel: R13: 0000000000000a28 R14:
0000000000000000 R15: 0000000000000000
Oct 14 20:21:34 kdevops kernel: FS: 00007fc0c2882580(0000)
GS:ffff9ed6f7d00000(0000) knlGS:0000000000000000
Oct 14 20:21:34 kdevops kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 14 20:21:34 kdevops kernel: CR2: 00005621b0490b78 CR3:
000000011a538005 CR4: 0000000000370ee0
Oct 14 20:21:34 kdevops kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Oct 14 20:21:34 kdevops kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Oct 14 20:21:34 kdevops kernel: Call Trace:
Oct 14 20:21:34 kdevops kernel: <TASK>
Oct 14 20:21:34 kdevops kernel: __cpuhp_remove_state+0x4d/0xc0
Oct 14 20:21:34 kdevops kernel: __do_sys_delete_module+0x18d/0x2a0
Oct 14 20:21:34 kdevops kernel: ?
fpregs_assert_state_consistent+0x1e/0x40
Oct 14 20:21:34 kdevops kernel: ? exit_to_user_mode_prepare+0x3a/0x180
Oct 14 20:21:34 kdevops kernel: do_syscall_64+0x38/0xc0
Oct 14 20:21:34 kdevops kernel:
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:34 kdevops kernel: RIP: 0033:0x7fc0c29a84a7
<etc>
Oct 14 20:21:35 kdevops kernel: sysfs: cannot create duplicate filename
'/devices/virtual/block/zram0'
Oct 14 20:21:35 kdevops kernel: CPU: 5 PID: 3388 Comm: modprobe Tainted:
G W E 5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:35 kdevops kernel: Call Trace:
Oct 14 20:21:35 kdevops kernel: <TASK>
Oct 14 20:21:35 kdevops kernel: dump_stack_lvl+0x48/0x5e
Oct 14 20:21:35 kdevops kernel: sysfs_warn_dup.cold+0x17/0x24
Oct 14 20:21:35 kdevops kernel: sysfs_create_dir_ns+0xbc/0xd0
Oct 14 20:21:35 kdevops kernel: kobject_add_internal+0xbd/0x2b0
Oct 14 20:21:35 kdevops kernel: kobject_add+0x7e/0xb0
Oct 14 20:21:35 kdevops kernel: ? _raw_spin_unlock_irqrestore+0x25/0x40
Oct 14 20:21:35 kdevops kernel: ? preempt_count_add+0x68/0xa0
Oct 14 20:21:35 kdevops kernel: device_add+0x11a/0x980
Oct 14 20:21:35 kdevops kernel: ? dev_set_name+0x53/0x70
Oct 14 20:21:35 kdevops kernel: device_add_disk+0x9d/0x3a0
Oct 14 20:21:35 kdevops kernel: zram_add+0x1ad/0x200 [zram]
Oct 14 20:21:35 kdevops kernel: ? 0xffffffffc0c10000
Oct 14 20:21:35 kdevops kernel: zram_init+0xd7/0x1000 [zram]
Oct 14 20:21:35 kdevops kernel: do_one_initcall+0x41/0x200
Oct 14 20:21:35 kdevops kernel: ? _raw_spin_unlock_irqrestore+0x25/0x40
Oct 14 20:21:35 kdevops kernel: ? kmem_cache_alloc_trace+0x2ab/0x420
Oct 14 20:21:35 kdevops kernel: do_init_module+0x5c/0x270
Oct 14 20:21:35 kdevops kernel: __do_sys_finit_module+0xae/0x110
Oct 14 20:21:35 kdevops kernel: do_syscall_64+0x38/0xc0
Oct 14 20:21:35 kdevops kernel:
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:35 kdevops kernel: RIP: 0033:0x7fca3aa555e9
Oct 14 20:21:35 kdevops kernel: Code: 00 c3 66 2e 0f 1f 84 00 00 00 00
00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8
4c 8b 4c 24 08 0f 05 <48> 3d >
Oct 14 20:21:35 kdevops kernel: RSP: 002b:00007fff142417b8 EFLAGS:
00000246 ORIG_RAX: 0000000000000139
Oct 14 20:21:35 kdevops kernel: RAX: ffffffffffffffda RBX:
0000558ba9491bd0 RCX: 00007fca3aa555e9
Oct 14 20:21:35 kdevops kernel: RDX: 0000000000000000 RSI:
0000558ba9491f60 RDI: 0000000000000003
Oct 14 20:21:35 kdevops kernel: RBP: 0000000000040000 R08:
0000000000000000 R09: 0000558ba9491db0
Oct 14 20:21:35 kdevops kernel: R10: 0000000000000003 R11:
0000000000000246 R12: 0000558ba9491f60
Oct 14 20:21:35 kdevops kernel: R13: 0000000000000000 R14:
0000558ba9491d00 R15: 0000558ba9491bd0
Oct 14 20:21:35 kdevops kernel: </TASK>
<etc>
Oct 14 20:21:35 kdevops kernel: kobject_add_internal failed for zram0
with -EEXIST, don't try to register things with the same name in the
same directory.
Oct 14 20:21:35 kdevops kernel: ------------[ cut here ]------------
Oct 14 20:21:35 kdevops kernel: WARNING: CPU: 5 PID: 3388 at
block/genhd.c:537 device_add_disk+0x1b9/0x3a0
Oct 14 20:21:35 kdevops kernel: Modules linked in: zram(E+) zstd(E)
zsmalloc(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E)
crc32_pclmul(E) ghash_clmulni_intel(E) >
Oct 14 20:21:35 kdevops kernel: CPU: 5 PID: 3388 Comm: modprobe Tainted:
G W E 5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:35 kdevops kernel: RIP: 0010:device_add_disk+0x1b9/0x3a0
Oct 14 20:21:35 kdevops kernel: Code: 00 03 01 00 00 0f 85 32 ff ff ff
e9 1e ff ff ff 0f 0b 41 bc ea ff ff ff e9 29 ff ff ff 4c 89 ff e8 5c 45
1c 00 e9 ef fe ff ff <0f> 0b >
Oct 14 20:21:35 kdevops kernel: RSP: 0018:ffffaac980607d90 EFLAGS:
00010287
Oct 14 20:21:35 kdevops kernel: RAX: 0000000000000000 RBX:
0000000000000000 RCX: 0000000000023005
Oct 14 20:21:35 kdevops kernel: RDX: 0000000000022e05 RSI:
ffffffffacc4b710 RDI: 0000000000000000
Oct 14 20:21:35 kdevops kernel: RBP: ffff9ed5d788a600 R08:
0000000000000000 R09: ffffaac980607a98
Oct 14 20:21:35 kdevops kernel: R10: ffff9ed5c795ef00 R11:
ffffffffadac3c68 R12: 00000000ffffffef
Oct 14 20:21:35 kdevops kernel: R13: ffff9ed5d5600000 R14:
ffffffffc0a52100 R15: ffff9ed5d5600040
Oct 14 20:21:35 kdevops kernel: FS: 00007fca3a935580(0000)
GS:ffff9ed6f7d40000(0000) knlGS:0000000000000000
Oct 14 20:21:35 kdevops kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 14 20:21:35 kdevops kernel: CR2: 00007fff1423e6d8 CR3:
0000000136752002 CR4: 0000000000370ee0
Oct 14 20:21:35 kdevops kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Oct 14 20:21:35 kdevops kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Oct 14 20:21:35 kdevops kernel: Call Trace:
Oct 14 20:21:35 kdevops kernel: <TASK>
Oct 14 20:21:35 kdevops kernel: zram_add+0x1ad/0x200 [zram]
Oct 14 20:21:35 kdevops kernel: ? 0xffffffffc0c10000
Oct 14 20:21:35 kdevops kernel: zram_init+0xd7/0x1000 [zram]
Oct 14 20:21:35 kdevops kernel: do_one_initcall+0x41/0x200
Oct 14 20:21:35 kdevops kernel: ? _raw_spin_unlock_irqrestore+0x25/0x40
Oct 14 20:21:35 kdevops kernel: ? kmem_cache_alloc_trace+0x2ab/0x420
Oct 14 20:21:35 kdevops kernel: do_init_module+0x5c/0x270
Oct 14 20:21:35 kdevops kernel: __do_sys_finit_module+0xae/0x110
Oct 14 20:21:35 kdevops kernel: do_syscall_64+0x38/0xc0
Oct 14 20:21:35 kdevops kernel:
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:35 kdevops kernel: RIP: 0033:0x7fca3aa555e9
<etc>
Oct 14 20:21:35 kdevops kernel: ------------[ cut here ]------------
Oct 14 20:21:35 kdevops kernel: WARNING: CPU: 2 PID: 3457 at
block/genhd.c:564 del_gendisk+0x1a2/0x1d0
Oct 14 20:21:35 kdevops kernel: Modules linked in: 842(E)
842_decompress(E) 842_compress(E) zram(E-) zstd(E) zsmalloc(E)
kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E>
Oct 14 20:21:35 kdevops kernel: CPU: 2 PID: 3457 Comm: rmmod Tainted: G
W E 5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:35 kdevops kernel: RIP: 0010:del_gendisk+0x1a2/0x1d0
Oct 14 20:21:35 kdevops kernel: Code: 48 8d 78 40 e8 8f 87 1d 00 48 8b
7b 40 5b 5d 41 5c 48 83 c7 40 e9 4e 47 1c 00 48 8b 70 40 eb ce f6 43 61
04 0f 85 85 fe ff ff <0f> 0b >
Oct 14 20:21:35 kdevops kernel: RSP: 0018:ffffaac9807cfe30 EFLAGS:
00010246
Oct 14 20:21:35 kdevops kernel: RAX: ffff9ed5d5600380 RBX:
ffff9ed5d788a600 RCX: 0000000000000000
Oct 14 20:21:35 kdevops kernel: RDX: 0000000000000000 RSI:
ffffffffad8efdb6 RDI: ffff9ed5d788a600
Oct 14 20:21:35 kdevops kernel: RBP: ffff9ed5d788b600 R08:
0000000000000000 R09: ffffaac9807cfc88
Oct 14 20:21:35 kdevops kernel: R10: ffffaac9807cfc80 R11:
ffffffffadac3c68 R12: ffff9ed5d5600000
Oct 14 20:21:35 kdevops kernel: R13: 0000000000000000 R14:
ffffffffc0a52360 R15: ffff9ed5c4a87b78
Oct 14 20:21:35 kdevops kernel: FS: 00007f292a2bb580(0000)
GS:ffff9ed6f7c80000(0000) knlGS:0000000000000000
Oct 14 20:21:35 kdevops kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 14 20:21:35 kdevops kernel: CR2: 000056161b453b78 CR3:
000000013213e002 CR4: 0000000000370ee0
Oct 14 20:21:35 kdevops kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Oct 14 20:21:35 kdevops kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Oct 14 20:21:35 kdevops kernel: Call Trace:
Oct 14 20:21:35 kdevops kernel: <TASK>
Oct 14 20:21:35 kdevops kernel: zram_remove+0x96/0xc0 [zram]
Oct 14 20:21:35 kdevops kernel: ? hot_remove_store+0xe0/0xe0 [zram]
Oct 14 20:21:35 kdevops kernel: zram_remove_cb+0xd/0x10 [zram]
Oct 14 20:21:35 kdevops kernel: idr_for_each+0x5b/0xd0
Oct 14 20:21:35 kdevops kernel: destroy_devices+0x32/0x68 [zram]
Oct 14 20:21:35 kdevops kernel: __do_sys_delete_module+0x18d/0x2a0
Oct 14 20:21:35 kdevops kernel: ?
fpregs_assert_state_consistent+0x1e/0x40
Oct 14 20:21:35 kdevops kernel: ? exit_to_user_mode_prepare+0x3a/0x180
Oct 14 20:21:35 kdevops kernel: do_syscall_64+0x38/0xc0
Oct 14 20:21:35 kdevops kernel:
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:35 kdevops kernel: RIP: 0033:0x7f292a3e14a7
<etc>
Oct 14 20:21:35 kdevops kernel: BUG: unable to handle page fault for
address: ffffffffc0a4e0ae
Oct 14 20:21:35 kdevops kernel: #PF: supervisor instruction fetch in
kernel mode
Oct 14 20:21:35 kdevops kernel: #PF: error_code(0x0010) - not-present
page
Oct 14 20:21:35 kdevops kernel: PGD 3ba0e067 P4D 3ba0e067 PUD 3ba10067
PMD 10526c067 PTE 0
Oct 14 20:21:35 kdevops kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI
Oct 14 20:21:35 kdevops kernel: CPU: 6 PID: 3655 Comm: zram02.sh
Tainted: G W E 5.15.0-rc3-next-20210927+ #89
Oct 14 20:21:35 kdevops kernel: Hardware name: QEMU Standard PC (i440FX
+ PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Oct 14 20:21:35 kdevops kernel: RIP: 0010:0xffffffffc0a4e0ae
Oct 14 20:21:35 kdevops kernel: Code: Unable to access opcode bytes at
RIP 0xffffffffc0a4e084.
Oct 14 20:21:35 kdevops kernel: RSP: 0018:ffffaac980687da8 EFLAGS:
00010286
Oct 14 20:21:35 kdevops kernel: RAX: 0000000000000000 RBX:
ffff9ed5c40be400 RCX: 0000000080400035
Oct 14 20:21:35 kdevops kernel: RDX: 0000000080400036 RSI:
fffffa3544561080 RDI: 0000000040000000
Oct 14 20:21:35 kdevops kernel: RBP: 0000000001900000 R08:
ffff9ed5d5842cc0 R09: 0000000080400035
Oct 14 20:21:35 kdevops kernel: R10: ffff9ed5d5842c00 R11:
ffff9ed5f1341350 R12: 0000000001900000
Oct 14 20:21:35 kdevops kernel: R13: ffff9ed5d5666c00 R14:
ffff9ed5c40be420 R15: ffff9ed5dfa8c8c0
Oct 14 20:21:35 kdevops kernel: FS: 00007f978fe2d5c0(0000)
GS:ffff9ed6f7d80000(0000) knlGS:0000000000000000
Oct 14 20:21:35 kdevops kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 14 20:21:35 kdevops kernel: CR2: ffffffffc0a4e084 CR3:
0000000133fd4006 CR4: 0000000000370ee0
Oct 14 20:21:35 kdevops kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Oct 14 20:21:35 kdevops kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Oct 14 20:21:35 kdevops kernel: Call Trace:
Oct 14 20:21:35 kdevops kernel: <TASK>
Oct 14 20:21:35 kdevops kernel: ? kernfs_fop_write_iter+0x177/0x220
Oct 14 20:21:35 kdevops kernel: ? new_sync_write+0x11c/0x1b0
Oct 14 20:21:35 kdevops kernel: ? vfs_write+0x20d/0x2a0
Oct 14 20:21:35 kdevops kernel: ? ksys_write+0x5f/0xe0
Oct 14 20:21:35 kdevops kernel: ? do_syscall_64+0x38/0xc0
Oct 14 20:21:35 kdevops kernel: ?
entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 14 20:21:35 kdevops kernel: </TASK>
<etc, etc, etc, this goes on and on>
Luis
On Thu, Oct 14, 2021 at 01:24:32PM -0700, Luis Chamberlain wrote:
> On Thu, Oct 14, 2021 at 10:11:46AM +0800, Ming Lei wrote:
> > On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
> > > On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> >
> > ...
> >
> > >
> > > Hello Luis,
> > >
> > > Can you test the following patch and see if the issue can be addressed?
> > >
> > > Please see the idea from the inline comment.
> > >
> > > Also zram_index_mutex isn't needed in zram disk's store() compared with
> > > your patch, then the deadlock issue you are addressing in this series can
> > > be avoided.
> > >
> > >
> > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > index fcaf2750f68f..3c17927d23a7 100644
> > > --- a/drivers/block/zram/zram_drv.c
> > > +++ b/drivers/block/zram/zram_drv.c
> > > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> > >
> > > /* Make sure all the pending I/O are finished */
> > > fsync_bdev(bdev);
> > > - zram_reset_device(zram);
> > >
> > > pr_info("Removed device: %s\n", zram->disk->disk_name);
> > >
> > > del_gendisk(zram->disk);
> > > +
> > > + /*
> > > + * reset device after gendisk is removed, so any change from sysfs
> > > + * store won't come in, then we can really reset device here
> > > + */
> > > + zram_reset_device(zram);
> > > +
> > > blk_cleanup_disk(zram->disk);
> > > kfree(zram);
> > > return 0;
> > > @@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
> > > static void destroy_devices(void)
> > > {
> > > class_unregister(&zram_control_class);
> > > +
> > > + /* hold the global lock so new device can't be added */
> > > + mutex_lock(&zram_index_mutex);
> > > idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > > + mutex_unlock(&zram_index_mutex);
> > > +
> >
> > Actually zram_index_mutex isn't needed when calling zram_remove_cb()
> > since the zram-control sysfs interface has been removed, so userspace
> > can't add new device any more, then the issue is supposed to be fixed
> > by the following one line change, please test it:
> >
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index fcaf2750f68f..96dd641de233 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> >
> > /* Make sure all the pending I/O are finished */
> > fsync_bdev(bdev);
> > - zram_reset_device(zram);
> >
> > pr_info("Removed device: %s\n", zram->disk->disk_name);
> >
> > del_gendisk(zram->disk);
> > +
> > + /*
> > + * reset device after gendisk is removed, so any change from sysfs
> > + * store won't come in, then we can really reset device here
> > + */
> > + zram_reset_device(zram);
> > +
> > blk_cleanup_disk(zram->disk);
> > kfree(zram);
> > return 0;
>
> Sorry but nope, the cpu multistate issue is still present and we end up
> eventually with page faults. I tried with both patches.
In theory disksize_store() can't come in after del_gendisk() returns,
then zram_reset_device() should cleanup everything, that is the issue
you described in commit log.
We need to understand the exact reason why there is still cpuhp node
left, can you share us the exact steps for reproducing the issue?
Otherwise we may have to trace and narrow down the reason.
thanks,
Ming
On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> On Thu, Oct 14, 2021 at 01:24:32PM -0700, Luis Chamberlain wrote:
> > On Thu, Oct 14, 2021 at 10:11:46AM +0800, Ming Lei wrote:
> > > On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
> > > > On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> > >
> > > ...
> > >
> > > >
> > > > Hello Luis,
> > > >
> > > > Can you test the following patch and see if the issue can be addressed?
> > > >
> > > > Please see the idea from the inline comment.
> > > >
> > > > Also zram_index_mutex isn't needed in zram disk's store() compared with
> > > > your patch, then the deadlock issue you are addressing in this series can
> > > > be avoided.
> > > >
> > > >
> > > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > > index fcaf2750f68f..3c17927d23a7 100644
> > > > --- a/drivers/block/zram/zram_drv.c
> > > > +++ b/drivers/block/zram/zram_drv.c
> > > > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> > > >
> > > > /* Make sure all the pending I/O are finished */
> > > > fsync_bdev(bdev);
> > > > - zram_reset_device(zram);
> > > >
> > > > pr_info("Removed device: %s\n", zram->disk->disk_name);
> > > >
> > > > del_gendisk(zram->disk);
> > > > +
> > > > + /*
> > > > + * reset device after gendisk is removed, so any change from sysfs
> > > > + * store won't come in, then we can really reset device here
> > > > + */
> > > > + zram_reset_device(zram);
> > > > +
> > > > blk_cleanup_disk(zram->disk);
> > > > kfree(zram);
> > > > return 0;
> > > > @@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
> > > > static void destroy_devices(void)
> > > > {
> > > > class_unregister(&zram_control_class);
> > > > +
> > > > + /* hold the global lock so new device can't be added */
> > > > + mutex_lock(&zram_index_mutex);
> > > > idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > > > + mutex_unlock(&zram_index_mutex);
> > > > +
> > >
> > > Actually zram_index_mutex isn't needed when calling zram_remove_cb()
> > > since the zram-control sysfs interface has been removed, so userspace
> > > can't add new device any more, then the issue is supposed to be fixed
> > > by the following one line change, please test it:
> > >
> > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > index fcaf2750f68f..96dd641de233 100644
> > > --- a/drivers/block/zram/zram_drv.c
> > > +++ b/drivers/block/zram/zram_drv.c
> > > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> > >
> > > /* Make sure all the pending I/O are finished */
> > > fsync_bdev(bdev);
> > > - zram_reset_device(zram);
> > >
> > > pr_info("Removed device: %s\n", zram->disk->disk_name);
> > >
> > > del_gendisk(zram->disk);
> > > +
> > > + /*
> > > + * reset device after gendisk is removed, so any change from sysfs
> > > + * store won't come in, then we can really reset device here
> > > + */
> > > + zram_reset_device(zram);
> > > +
> > > blk_cleanup_disk(zram->disk);
> > > kfree(zram);
> > > return 0;
> >
> > Sorry but nope, the cpu multistate issue is still present and we end up
> > eventually with page faults. I tried with both patches.
>
> In theory disksize_store() can't come in after del_gendisk() returns,
> then zram_reset_device() should cleanup everything, that is the issue
> you described in commit log.
>
> We need to understand the exact reason why there is still cpuhp node
> left, can you share us the exact steps for reproducing the issue?
> Otherwise we may have to trace and narrow down the reason.
See my commit log for my own fix for this issue.
Luis
On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
...
> >
> > We need to understand the exact reason why there is still cpuhp node
> > left, can you share us the exact steps for reproducing the issue?
> > Otherwise we may have to trace and narrow down the reason.
>
> See my commit log for my own fix for this issue.
OK, thanks!
I can reproduce the issue, and the reason is that reset_store fails
zram_remove() when unloading module, then the warning is caused.
The top 3 patches in the following tree can fix the issue:
https://github.com/ming1/linux/commits/my_v5.15-blk-dev
Thanks,
Ming
On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> ...
> > >
> > > We need to understand the exact reason why there is still cpuhp node
> > > left, can you share us the exact steps for reproducing the issue?
> > > Otherwise we may have to trace and narrow down the reason.
> >
> > See my commit log for my own fix for this issue.
>
> OK, thanks!
>
> I can reproduce the issue, and the reason is that reset_store fails
> zram_remove() when unloading module, then the warning is caused.
>
> The top 3 patches in the following tree can fix the issue:
>
> https://github.com/ming1/linux/commits/my_v5.15-blk-dev
At a quick glance, those look sane to me, nice work.
greg k-h
On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> ...
> > >
> > > We need to understand the exact reason why there is still cpuhp node
> > > left, can you share us the exact steps for reproducing the issue?
> > > Otherwise we may have to trace and narrow down the reason.
> >
> > See my commit log for my own fix for this issue.
>
> OK, thanks!
>
> I can reproduce the issue, and the reason is that reset_store fails
> zram_remove() when unloading module, then the warning is caused.
>
> The top 3 patches in the following tree can fix the issue:
>
> https://github.com/ming1/linux/commits/my_v5.15-blk-dev
Thanks for trying an alternative fix! A crash stops yes, however this
also ends up leaving the driver in an unrecoverable state after a few
tries. Ie, you CTRL-C the scripts and try again over and over again and
the driver ends up in a situation where it just says:
zram: Can't change algorithm for initialized device
And the zram module can't be removed at that point.
Luis
On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > ...
> > > >
> > > > We need to understand the exact reason why there is still cpuhp node
> > > > left, can you share us the exact steps for reproducing the issue?
> > > > Otherwise we may have to trace and narrow down the reason.
> > >
> > > See my commit log for my own fix for this issue.
> >
> > OK, thanks!
> >
> > I can reproduce the issue, and the reason is that reset_store fails
> > zram_remove() when unloading module, then the warning is caused.
> >
> > The top 3 patches in the following tree can fix the issue:
> >
> > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
>
> Thanks for trying an alternative fix! A crash stops yes, however this
I doubt it is alternative since your patchset doesn't mention the exact
reason of 'Error: Removing state 63 which has instances left.', that is
simply caused by failing to remove zram because ->claim is set during
unloading module.
Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
however I don't think it is reproduced easily in the test because the race
window is pretty small, also it can be fixed easily in my 3rd path
without any complicated tricks.
Not dig into details of your patchset via grabbing module reference
count during show/store attribute of kernfs which is done in your patch
9, but IMO this way isn't necessary:
1) any driver module has to cleanup anything which may refer to symbols
or data defined in module_exit of this driver
2) device_del() is often done in module_exit(), once device_del()
returns, no any new show/store on the device's kobject attribute
is possible.
3) it is _not_ a must or pattern for fixing bugs to hold one lock before
calling device_del(), meantime the lock is required in the device's
attribute show()/store(), which causes AA deadlock easily. Your approach
just avoids the issue by not releasing module until all show/store are
done.
Also the model of using module refcount is usually that if anyone will
use the module, grab one extra ref, and once the use is done, release
it. For example of block device, the driver's module refcnt is grabbed
when the disk/part is opened, and released when the disk/part is closed.
> also ends up leaving the driver in an unrecoverable state after a few
> tries. Ie, you CTRL-C the scripts and try again over and over again and
> the driver ends up in a situation where it just says:
>
> zram: Can't change algorithm for initialized device
It means the algorithm can't be changed for one initialized device
at the exact time. That is understandable because two zram02.sh are
running concurrently.
Your test script just runs two ./zram02.sh tasks concurrently forever,
so what is your expected result for the test? Of course, it can't be
over.
I can't reproduce the 'unrecoverable' state in my test, can you share the
stack trace log after that happens?
Is the zram02.sh still running or slept somewhere in the 'unrecoverable'
state? If it is still running, it means the current sleep point isn't
interruptable when running 'CTRL-C'. In my test, after several 'CTRL-C',
both the two zram02.sh started from two terminals can be terminated. If
it is slept somewhere forever, it can be one problem.
>
> And the zram module can't be removed at that point.
It is just that systemd opens the zram or the disk is opened as swap
disk, and once systemd closes it or after you run swapoff, it can be
unloaded.
Thanks,
Ming
On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
> On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> > On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > > ...
> > > > >
> > > > > We need to understand the exact reason why there is still cpuhp node
> > > > > left, can you share us the exact steps for reproducing the issue?
> > > > > Otherwise we may have to trace and narrow down the reason.
> > > >
> > > > See my commit log for my own fix for this issue.
> > >
> > > OK, thanks!
> > >
> > > I can reproduce the issue, and the reason is that reset_store fails
> > > zram_remove() when unloading module, then the warning is caused.
> > >
> > > The top 3 patches in the following tree can fix the issue:
> > >
> > > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
> >
> > Thanks for trying an alternative fix! A crash stops yes, however this
>
> I doubt it is alternative since your patchset doesn't mention the exact
> reason of 'Error: Removing state 63 which has instances left.', that is
> simply caused by failing to remove zram because ->claim is set during
> unloading module.
Well I disagree because it does explain how the race can happen, and it
also explains how since the sysfs interface is exposed until module
removal completes, it leaves exposed knobs to allow re-initializing of a
struct zcomp for a zram device before the exit.
> Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
> however I don't think it is reproduced easily in the test because the race
> window is pretty small, also it can be fixed easily in my 3rd path
> without any complicated tricks.
Reproducing for me is... extremely easy.
> Not dig into details of your patchset via grabbing module reference
> count during show/store attribute of kernfs which is done in your patch
> 9, but IMO this way isn't necessary:
That's to address the deadlock only.
> 1) any driver module has to cleanup anything which may refer to symbols
> or data defined in module_exit of this driver
Yes, and as the cpu multistate hotplug documentation warns (although
such documentation is kind of hidden) that driver authors need to be
careful with module removal too, refer to the warning at the end of
__cpuhp_remove_state_cpuslocked() about module removal.
> 2) device_del() is often done in module_exit(), once device_del()
> returns, no any new show/store on the device's kobject attribute
> is possible.
Right and if a syfs knob is exposed before device_del() completely
and is allowed to do things, the driver should take care to prevent
races for CPU multistate support. The small state machine I added ensures
we don't run over any expectations from cpu hotplug multistate support.
I've *never* suggested there cannot be alternatives to my solution with
the small state machine, but for you to say it is incorrect is simply
not right either.
> 3) it is _not_ a must or pattern for fixing bugs to hold one lock before
> calling device_del(), meantime the lock is required in the device's
> attribute show()/store(), which causes AA deadlock easily. Your approach
> just avoids the issue by not releasing module until all show/store are
> done.
Right, there are two approaches here:
a) Your approach is to accept the deadlock as a requirement and so
you would prefer to implement an alternative to using a shared lock
on module exit and sysfs op.
b) While I address such a deadlock head on as I think this sort of locking
be allowed for two reasons:
b1) as we never documented such requirement otherwise.
b2) There is a possibility that other drivers already exist too
which *do* use a shared lock on module removal and sysfs ops
(and I just confirmed this to be true)
By you only addressing the deadlock as a requirement on approach a) you are
forgetting that there *may* already be present drivers which *do* implement
such patterns in the kernel. I worked on addressing the deadlock because
I was informed livepatching *did* have that issue as well and so very
likely a generic solution to the deadlock could be beneficial to other
random drivers.
So I *really* don't think it is wise for us to simply accept this new
found deadlock as a *new* requirement, specially if we can fix it easily.
A cursory review using Coccinelle potential issues with mutex lock
directly used on module exit (so this doesn't cover drivers like zram
which uses a routine and then grabs the lock through indirection) and a
sysfs op shows these drivers are also affected by this deadlock:
* arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c
* lib/test_firmware.c
Note that this cursory review does not cover spin_lock uses, and other
forms locks. Consider the case where a routine is used and then that
routine grabs a lock, so one level indirection. There are many levels
of indirections possible here. And likewise there are different types
of locks.
> > also ends up leaving the driver in an unrecoverable state after a few
> > tries. Ie, you CTRL-C the scripts and try again over and over again and
> > the driver ends up in a situation where it just says:
> >
> > zram: Can't change algorithm for initialized device
>
> It means the algorithm can't be changed for one initialized device
> at the exact time. That is understandable because two zram02.sh are
> running concurrently.
Indeed but with your patch it can get stuck and cannot be taken out of this
state.
> Your test script just runs two ./zram02.sh tasks concurrently forever,
> so what is your expected result for the test? Of course, it can't be
> over.
>
> I can't reproduce the 'unrecoverable' state in my test, can you share the
> stack trace log after that happens?
Try a bit harder, cancel the scripts after running for a while randomly
(CTRL C a few times until the script finishes) and have them race again.
Do this a few times.
> > And the zram module can't be removed at that point.
>
> It is just that systemd opens the zram or the disk is opened as swap
> disk, and once systemd closes it or after you run swapoff, it can be
> unloaded.
With my patch this issues does not happen.
Luis
On Mon, Oct 18, 2021 at 12:32:11PM -0700, Luis Chamberlain wrote:
> On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
> > On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> > > On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > > > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > > > ...
> > > > > >
> > > > > > We need to understand the exact reason why there is still cpuhp node
> > > > > > left, can you share us the exact steps for reproducing the issue?
> > > > > > Otherwise we may have to trace and narrow down the reason.
> > > > >
> > > > > See my commit log for my own fix for this issue.
> > > >
> > > > OK, thanks!
> > > >
> > > > I can reproduce the issue, and the reason is that reset_store fails
> > > > zram_remove() when unloading module, then the warning is caused.
> > > >
> > > > The top 3 patches in the following tree can fix the issue:
> > > >
> > > > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
> > >
> > > Thanks for trying an alternative fix! A crash stops yes, however this
> >
> > I doubt it is alternative since your patchset doesn't mention the exact
> > reason of 'Error: Removing state 63 which has instances left.', that is
> > simply caused by failing to remove zram because ->claim is set during
> > unloading module.
>
> Well I disagree because it does explain how the race can happen, and it
> also explains how since the sysfs interface is exposed until module
> removal completes, it leaves exposed knobs to allow re-initializing of a
> struct zcomp for a zram device before the exit.
>
> > Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
> > however I don't think it is reproduced easily in the test because the race
> > window is pretty small, also it can be fixed easily in my 3rd path
> > without any complicated tricks.
>
> Reproducing for me is... extremely easy.
In my observation, failing zram_remove() is extremely easy to trigger, which
is caused by reset_store() which sets ->reclaim as true, so
zram_remove() is failed and zram_reset_device() is bypassed , then the
failure of 'Error: Removing state 63 which has instances left.' is caused.
We are in same page?
>
> > Not dig into details of your patchset via grabbing module reference
> > count during show/store attribute of kernfs which is done in your patch
> > 9, but IMO this way isn't necessary:
>
> That's to address the deadlock only.
>
> > 1) any driver module has to cleanup anything which may refer to symbols
> > or data defined in module_exit of this driver
>
> Yes, and as the cpu multistate hotplug documentation warns (although
> such documentation is kind of hidden) that driver authors need to be
> careful with module removal too, refer to the warning at the end of
> __cpuhp_remove_state_cpuslocked() about module removal.
It is zram's bug. zram has to clean everything in module_exit(),
unfortunately zram_remove() can be failed when calling from
module_exit() because ->claim is set as true by reset_store(), then
zram_reset_device()(->zcomp_destroy) isn't called, and this failure should
not happen when unloading module, should it?
>
> > 2) device_del() is often done in module_exit(), once device_del()
> > returns, no any new show/store on the device's kobject attribute
> > is possible.
>
> Right and if a syfs knob is exposed before device_del() completely
> and is allowed to do things, the driver should take care to prevent
> races for CPU multistate support. The small state machine I added ensures
What is the race for CPU multistate support? If you mean 'Error: Removing
state 63 which has instances left.', it is zram's bug since zram has to
cleanup everything in module_exit().
> we don't run over any expectations from cpu hotplug multistate support.
>
> I've *never* suggested there cannot be alternatives to my solution with
> the small state machine, but for you to say it is incorrect is simply
> not right either.
>
> > 3) it is _not_ a must or pattern for fixing bugs to hold one lock before
> > calling device_del(), meantime the lock is required in the device's
> > attribute show()/store(), which causes AA deadlock easily. Your approach
> > just avoids the issue by not releasing module until all show/store are
> > done.
>
> Right, there are two approaches here:
>
> a) Your approach is to accept the deadlock as a requirement and so
> you would prefer to implement an alternative to using a shared lock
> on module exit and sysfs op.
wrt. in-tree zram, there is neither any deadlock in linus tree, nor after
applying my 3 patches. If you think there is, please share us the code
or lockdep warning.
>
> b) While I address such a deadlock head on as I think this sort of locking
> be allowed for two reasons:
> b1) as we never documented such requirement otherwise.
> b2) There is a possibility that other drivers already exist too
> which *do* use a shared lock on module removal and sysfs ops
> (and I just confirmed this to be true)
The 'deadlock' is actually caused by your out-of-tree patch of 'zram: fix
crashes with cpu hotplug multistate' which adds mutex_lock(zram_index_mutex)
in destroy_devices().
We can fix this issue easily without needing the global lock, please see the
attached(pre-V2) patch.
>
> By you only addressing the deadlock as a requirement on approach a) you are
> forgetting that there *may* already be present drivers which *do* implement
> such patterns in the kernel. I worked on addressing the deadlock because
> I was informed livepatching *did* have that issue as well and so very
> likely a generic solution to the deadlock could be beneficial to other
> random drivers.
In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
just fixed it, and seems it has been fixed by 3ec24776bfd0.
>
> So I *really* don't think it is wise for us to simply accept this new
> found deadlock as a *new* requirement, specially if we can fix it easily.
>
> A cursory review using Coccinelle potential issues with mutex lock
> directly used on module exit (so this doesn't cover drivers like zram
> which uses a routine and then grabs the lock through indirection) and a
> sysfs op shows these drivers are also affected by this deadlock:
>
> * arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c
In fsl_wakeup_sys_exit(), device_remove_file() is called before
acquiring &sysfs_lock, so there shouldn't be such AA deadlock.
> * lib/test_firmware.c
Yeah, there is the AA deadlock risk, but it should be fixed by moving
misc_deregister() out of &test_fw_mutex.
>
> Note that this cursory review does not cover spin_lock uses, and other
> forms locks. Consider the case where a routine is used and then that
> routine grabs a lock, so one level indirection. There are many levels
> of indirections possible here. And likewise there are different types
> of locks.
>
> > > also ends up leaving the driver in an unrecoverable state after a few
> > > tries. Ie, you CTRL-C the scripts and try again over and over again and
> > > the driver ends up in a situation where it just says:
> > >
> > > zram: Can't change algorithm for initialized device
> >
> > It means the algorithm can't be changed for one initialized device
> > at the exact time. That is understandable because two zram02.sh are
> > running concurrently.
>
> Indeed but with your patch it can get stuck and cannot be taken out of this
> state.
OK, I can keep current behavior: fail open() in case of removing or
resetting, meantime not hold open_mutex when sync bdev and reset device,
see attached patch.
>
> > Your test script just runs two ./zram02.sh tasks concurrently forever,
> > so what is your expected result for the test? Of course, it can't be
> > over.
> >
> > I can't reproduce the 'unrecoverable' state in my test, can you share the
> > stack trace log after that happens?
>
> Try a bit harder, cancel the scripts after running for a while randomly
> (CTRL C a few times until the script finishes) and have them race again.
> Do this a few times.
>
> > > And the zram module can't be removed at that point.
> >
> > It is just that systemd opens the zram or the disk is opened as swap
> > disk, and once systemd closes it or after you run swapoff, it can be
> > unloaded.
>
> With my patch this issues does not happen.
It is because the patch 2 holds ->open_mutex() for sync bdev and reset
zram, so several 'CTRL-C' is needed for terminating the test script, then
zram02.sh's cleanup handler can be interrupted too. We can keep current
behavior easily.
Please try the following patch against upstream(linus or next) tree(basically
fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
module_exit(), race between zram_remove() and disksize_store()), and see if
everything is fine for you:
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a68297fb51a2..320822a80b64 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1967,25 +1967,45 @@ static int zram_add(void)
static int zram_remove(struct zram *zram)
{
struct block_device *bdev = zram->disk->part0;
+ bool claimed;
mutex_lock(&bdev->bd_disk->open_mutex);
- if (bdev->bd_openers || zram->claim) {
+ if (bdev->bd_openers) {
mutex_unlock(&bdev->bd_disk->open_mutex);
return -EBUSY;
}
- zram->claim = true;
+ claimed = zram->claim;
+ if (!claimed)
+ zram->claim = true;
mutex_unlock(&bdev->bd_disk->open_mutex);
zram_debugfs_unregister(zram);
- /* Make sure all the pending I/O are finished */
- fsync_bdev(bdev);
- zram_reset_device(zram);
+ if (claimed) {
+ /*
+ * If we were claimed by reset_store(), del_gendisk() will
+ * wait until sync & reset is completed, so do nothing here.
+ */
+ ;
+ } else {
+ /* Make sure all the pending I/O are finished */
+ sync_blockdev(bdev);
+ zram_reset_device(zram);
+ }
pr_info("Removed device: %s\n", zram->disk->disk_name);
del_gendisk(zram->disk);
+
+ WARN_ON_ONCE(claimed && zram->claim);
+
+ /*
+ * disksize store may come after the above zram_reset_device
+ * returns, so run the last reset to avoid the race
+ */
+ zram_reset_device(zram);
+
blk_cleanup_disk(zram->disk);
kfree(zram);
return 0;
Thanks,
Ming
> > By you only addressing the deadlock as a requirement on approach a) you are
> > forgetting that there *may* already be present drivers which *do* implement
> > such patterns in the kernel. I worked on addressing the deadlock because
> > I was informed livepatching *did* have that issue as well and so very
> > likely a generic solution to the deadlock could be beneficial to other
> > random drivers.
>
> In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> just fixed it, and seems it has been fixed by 3ec24776bfd0.
I would not call it a fix. It is a kind of ugly workaround because the
generic infrastructure lacked (lacks) the proper support in my opinion.
Luis is trying to fix that.
Just my two cents.
Miroslav
On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > By you only addressing the deadlock as a requirement on approach a) you are
> > > forgetting that there *may* already be present drivers which *do* implement
> > > such patterns in the kernel. I worked on addressing the deadlock because
> > > I was informed livepatching *did* have that issue as well and so very
> > > likely a generic solution to the deadlock could be beneficial to other
> > > random drivers.
> >
> > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > just fixed it, and seems it has been fixed by 3ec24776bfd0.
>
> I would not call it a fix. It is a kind of ugly workaround because the
> generic infrastructure lacked (lacks) the proper support in my opinion.
> Luis is trying to fix that.
What is the proper support of the generic infrastructure? I am not
familiar with livepatching's model(especially with module unload), you mean
livepatching have to do the following way from sysfs:
1) during module exit:
mutex_lock(lp_lock);
kobject_put(lp_kobj);
mutex_unlock(lp_lock);
2) show()/store() method of attributes of lp_kobj
mutex_lock(lp_lock)
...
mutex_unlock(lp_lock)
IMO, the above usage simply caused AA deadlock. Even in Luis's patch
'zram: fix crashes with cpu hotplug multistate', new/same AA deadlock
(hot_remove_store() vs. disksize_store() or reset_store()) is added
because hot_remove_store() isn't called from module_exit().
Luis tries to delay unloading module until all show()/store() are done. But
that can be obtained by the following way simply during module_exit():
kobject_del(lp_kobj); //all pending store()/show() from lp_kobj are done,
//no new store()/show() can come after
//kobject_del() returns
mutex_lock(lp_lock);
kobject_put(lp_kobj);
mutex_unlock(lp_lock);
Or can you explain your requirement on kobject/module unload in a bit
details?
Thanks,
Ming
On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> Please try the following patch against upstream(linus or next) tree(basically
> fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
> module_exit(), race between zram_remove() and disksize_store()), and see if
> everything is fine for you:
Page fault ...
[ 18.284256] zram: Removed device: zram0
[ 18.312974] BUG: unable to handle page fault for address:
ffffad86de903008
[ 18.313707] #PF: supervisor read access in kernel mode
[ 18.314248] #PF: error_code(0x0000) - not-present page
[ 18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067
PTE 0
[ 18.315538] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 18.316012] CPU: 3 PID: 1198 Comm: rmmod Tainted: G E
5.15.0-rc3-next-20210927+ #89
[ 18.316979] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.14.0-2 04/01/2014
[ 18.317876] RIP: 0010:zram_free_page+0x1b/0xf0 [zram]
[ 18.318430] Code: 1f 44 00 00 48 89 c8 c3 0f 1f 80 00 00 00 00 0f 1f
44 00 00 41 54 49 89 f4 55 89 f5 53 48 8b 17 48 c1 e5 04 48 89 fb 48 01
ea <48> 8b 42 08 a9 00 00 00 20 74 14 48 25 ff ff ff df 48 89 42 08 48
[ 18.320412] RSP: 0018:ffffad86f8013df8 EFLAGS: 00010286
[ 18.320978] RAX: 0000000000000001 RBX: ffff9b7b435c7800 RCX:
0000000000000200
[ 18.321758] RDX: ffffad86de903000 RSI: 0000000000000000 RDI:
ffff9b7b435c7800
[ 18.322524] RBP: 0000000000000000 R08: 0000000000000200 R09:
0000000000000000
[ 18.323299] R10: 0000000000000200 R11: 0000000000000000 R12:
0000000000000000
[ 18.324030] R13: ffff9b7b55191800 R14: ffff9b7b435c7820 R15:
ffff9b7b4677f960
[ 18.324784] FS: 00007fc8e4c90580(0000) GS:ffff9b7c77cc0000(0000)
knlGS:0000000000000000
[ 18.325651] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 18.326272] CR2: ffffad86de903008 CR3: 000000014f1de003 CR4:
0000000000370ee0
[ 18.327047] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 18.327818] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 18.328586] Call Trace:
[ 18.328852] <TASK>
[ 18.329284] zram_reset_device+0xd8/0x140 [zram]
[ 18.329983] zram_remove.cold+0xa/0x20 [zram]
[ 18.330644] ? hot_remove_store+0xe0/0xe0 [zram]
[ 18.331367] zram_remove_cb+0xd/0x10 [zram]
[ 18.332010] idr_for_each+0x5b/0xd0
[ 18.332578] destroy_devices+0x26/0x50 [zram]
[ 18.333238] __do_sys_delete_module+0x18d/0x2a0
[ 18.333913] ? fpregs_assert_state_consistent+0x1e/0x40
[ 18.334665] ? exit_to_user_mode_prepare+0x3a/0x180
[ 18.335395] do_syscall_64+0x38/0xc0
[ 18.335966] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 18.336681] RIP: 0033:0x7fc8e4db64a7
On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> On Mon, Oct 18, 2021 at 12:32:11PM -0700, Luis Chamberlain wrote:
> > On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
> > > On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> > > > On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > > > > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > > > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > > > > ...
> > > > > > >
> > > > > > > We need to understand the exact reason why there is still cpuhp node
> > > > > > > left, can you share us the exact steps for reproducing the issue?
> > > > > > > Otherwise we may have to trace and narrow down the reason.
> > > > > >
> > > > > > See my commit log for my own fix for this issue.
> > > > >
> > > > > OK, thanks!
> > > > >
> > > > > I can reproduce the issue, and the reason is that reset_store fails
> > > > > zram_remove() when unloading module, then the warning is caused.
> > > > >
> > > > > The top 3 patches in the following tree can fix the issue:
> > > > >
> > > > > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
> > > >
> > > > Thanks for trying an alternative fix! A crash stops yes, however this
> > >
> > > I doubt it is alternative since your patchset doesn't mention the exact
> > > reason of 'Error: Removing state 63 which has instances left.', that is
> > > simply caused by failing to remove zram because ->claim is set during
> > > unloading module.
> >
> > Well I disagree because it does explain how the race can happen, and it
> > also explains how since the sysfs interface is exposed until module
> > removal completes, it leaves exposed knobs to allow re-initializing of a
> > struct zcomp for a zram device before the exit.
> >
> > > Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
> > > however I don't think it is reproduced easily in the test because the race
> > > window is pretty small, also it can be fixed easily in my 3rd path
> > > without any complicated tricks.
> >
> > Reproducing for me is... extremely easy.
>
> In my observation, failing zram_remove() is extremely easy to trigger, which
> is caused by reset_store() which sets ->reclaim as true, so
> zram_remove() is failed and zram_reset_device() is bypassed , then the
> failure of 'Error: Removing state 63 which has instances left.' is caused.
>
> We are in same page?
The actual first issue is the CPU hotplug remove callback is long gone and
in the meantime we allow a race to add a new "instance", in the zram
driver's case a cpu struct zcomp instance though the sysfs interface.
Regardless of if zram_remove() can fail or not, the above race needs to
be addressed.
> > > Not dig into details of your patchset via grabbing module reference
> > > count during show/store attribute of kernfs which is done in your patch
> > > 9, but IMO this way isn't necessary:
> >
> > That's to address the deadlock only.
> >
> > > 1) any driver module has to cleanup anything which may refer to symbols
> > > or data defined in module_exit of this driver
> >
> > Yes, and as the cpu multistate hotplug documentation warns (although
> > such documentation is kind of hidden) that driver authors need to be
> > careful with module removal too, refer to the warning at the end of
> > __cpuhp_remove_state_cpuslocked() about module removal.
>
> It is zram's bug. zram has to clean everything in module_exit(),
> unfortunately zram_remove() can be failed when calling from
> module_exit() because ->claim is set as true by reset_store(), then
> zram_reset_device()(->zcomp_destroy) isn't called, and this failure should
> not happen when unloading module, should it?
You're addressing a possible failig zram_remove() while I address not
allowing entry to muck with the zram driver at all once we're bailing
on module removal.
> > > 2) device_del() is often done in module_exit(), once device_del()
> > > returns, no any new show/store on the device's kobject attribute
> > > is possible.
> >
> > Right and if a syfs knob is exposed before device_del() completely
> > and is allowed to do things, the driver should take care to prevent
> > races for CPU multistate support. The small state machine I added ensures
>
> What is the race for CPU multistate support? If you mean 'Error: Removing
> state 63 which has instances left.', it is zram's bug since zram has to
> cleanup everything in module_exit().
Yes. And it is what my out of tree yet Acked patch, 'zram: fix
crashes with cpu hotplug multistate' does.
> > we don't run over any expectations from cpu hotplug multistate support.
> >
> > I've *never* suggested there cannot be alternatives to my solution with
> > the small state machine, but for you to say it is incorrect is simply
> > not right either.
> >
> > > 3) it is _not_ a must or pattern for fixing bugs to hold one lock before
> > > calling device_del(), meantime the lock is required in the device's
> > > attribute show()/store(), which causes AA deadlock easily. Your approach
> > > just avoids the issue by not releasing module until all show/store are
> > > done.
> >
> > Right, there are two approaches here:
> >
> > a) Your approach is to accept the deadlock as a requirement and so
> > you would prefer to implement an alternative to using a shared lock
> > on module exit and sysfs op.
>
> wrt. in-tree zram, there is neither any deadlock in linus tree, nor after
> applying my 3 patches. If you think there is, please share us the code
> or lockdep warning.
Right, 'zram: fix crashes with cpu hotplug multistate' is not yet
merged, my approach to fixing that does add a lock use on module removal
which does introduce a possible deadlock with syfs, which is later addressed
generically between sysfs and module removal for all drivers.
> > b) While I address such a deadlock head on as I think this sort of locking
> > be allowed for two reasons:
> > b1) as we never documented such requirement otherwise.
> > b2) There is a possibility that other drivers already exist too
> > which *do* use a shared lock on module removal and sysfs ops
> > (and I just confirmed this to be true)
>
> The 'deadlock' is actually caused by your out-of-tree patch of 'zram: fix
> crashes with cpu hotplug multistate' which adds mutex_lock(zram_index_mutex)
> in destroy_devices().
Yes yes, but you are completely throwing out the window that other
possible deadlocks can exist in the kernel *and* that *new* cases of
the deadlock can easily also be added!
> We can fix this issue easily without needing the global lock, please see the
> attached(pre-V2) patch.
So far your patches do not fix the issues though...
> > So I *really* don't think it is wise for us to simply accept this new
> > found deadlock as a *new* requirement, specially if we can fix it easily.
> >
> > A cursory review using Coccinelle potential issues with mutex lock
> > directly used on module exit (so this doesn't cover drivers like zram
> > which uses a routine and then grabs the lock through indirection) and a
> > sysfs op shows these drivers are also affected by this deadlock:
> >
> > * arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c
>
> In fsl_wakeup_sys_exit(), device_remove_file() is called before
> acquiring &sysfs_lock, so there shouldn't be such AA deadlock.
>
> > * lib/test_firmware.c
>
> Yeah, there is the AA deadlock risk, but it should be fixed by moving
> misc_deregister() out of &test_fw_mutex.
And just like that you are ignoring other possible uses in the kernel
which might have similar deadlocks.
So do you want to take the position:
Hey driver authors: you cannot use any shared lock on module removal and
on sysfs ops?
Luis
On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> So do you want to take the position:
>
> Hey driver authors: you cannot use any shared lock on module removal and
> on sysfs ops?
Yes, I would not recommend using such a lock at all. sysfs operations
happen on a per-device basis, so you can lock the device structure.
Module removal happens on a driver basis, and I have no idea what you
want to lock there, but odds are it is NOT shared with your per-device
structures either, right?
If so, then yes, that is a bug, but a very rare one as drivers should do
almost nothing except register/unregister_driver() in their module
init/exit calls.
zram is not a "normal" driver at all here, so fixing this type of
problem up should be done in the zram code, it is not a generic
module/sysfs issue at all.
thanks,
greg k-h
On Tue, Oct 19, 2021 at 08:28:21AM -0700, Luis Chamberlain wrote:
> On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> > Please try the following patch against upstream(linus or next) tree(basically
> > fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
> > module_exit(), race between zram_remove() and disksize_store()), and see if
> > everything is fine for you:
>
> Page fault ...
>
> [ 18.284256] zram: Removed device: zram0
> [ 18.312974] BUG: unable to handle page fault for address:
> ffffad86de903008
> [ 18.313707] #PF: supervisor read access in kernel mode
> [ 18.314248] #PF: error_code(0x0000) - not-present page
> [ 18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067
That is another race between zram_reset_device() and disksize_store(),
which is supposed to be covered by ->init_lock, and follows the delta fix
against the last patch I posted, and the whole patch can be found in the
github link:
https://github.com/ming1/linux/commit/fa6045b1371eb301f392ac84adaf3ad53bb16894
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d0cae7a42f4d..a14ba3d350ea 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
set_capacity_and_notify(zram->disk, 0);
part_stat_set_all(zram->disk->part0, 0);
- up_write(&zram->init_lock);
/* I/O operation under all of CPU are done so let's free */
zram_meta_free(zram, disksize);
memset(&zram->stats, 0, sizeof(zram->stats));
zcomp_destroy(comp);
reset_bdev(zram);
+ up_write(&zram->init_lock);
}
static ssize_t disksize_store(struct device *dev,
--
Ming
On Tue, Oct 19, 2021 at 06:25:18PM +0200, Greg KH wrote:
> On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > So do you want to take the position:
> >
> > Hey driver authors: you cannot use any shared lock on module removal and
> > on sysfs ops?
>
> Yes, I would not recommend using such a lock at all. sysfs operations
> happen on a per-device basis, so you can lock the device structure.
All devices are going to be removed on module removal and so cannot be locked.
Luis
On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> > On Mon, Oct 18, 2021 at 12:32:11PM -0700, Luis Chamberlain wrote:
> > > On Sat, Oct 16, 2021 at 07:28:39PM +0800, Ming Lei wrote:
> > > > On Fri, Oct 15, 2021 at 10:31:31AM -0700, Luis Chamberlain wrote:
> > > > > On Fri, Oct 15, 2021 at 04:36:11PM +0800, Ming Lei wrote:
> > > > > > On Thu, Oct 14, 2021 at 05:22:40PM -0700, Luis Chamberlain wrote:
> > > > > > > On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> > > > > > ...
> > > > > > > >
> > > > > > > > We need to understand the exact reason why there is still cpuhp node
> > > > > > > > left, can you share us the exact steps for reproducing the issue?
> > > > > > > > Otherwise we may have to trace and narrow down the reason.
> > > > > > >
> > > > > > > See my commit log for my own fix for this issue.
> > > > > >
> > > > > > OK, thanks!
> > > > > >
> > > > > > I can reproduce the issue, and the reason is that reset_store fails
> > > > > > zram_remove() when unloading module, then the warning is caused.
> > > > > >
> > > > > > The top 3 patches in the following tree can fix the issue:
> > > > > >
> > > > > > https://github.com/ming1/linux/commits/my_v5.15-blk-dev
> > > > >
> > > > > Thanks for trying an alternative fix! A crash stops yes, however this
> > > >
> > > > I doubt it is alternative since your patchset doesn't mention the exact
> > > > reason of 'Error: Removing state 63 which has instances left.', that is
> > > > simply caused by failing to remove zram because ->claim is set during
> > > > unloading module.
> > >
> > > Well I disagree because it does explain how the race can happen, and it
> > > also explains how since the sysfs interface is exposed until module
> > > removal completes, it leaves exposed knobs to allow re-initializing of a
> > > struct zcomp for a zram device before the exit.
> > >
> > > > Yeah, you mentioned the race between disksize_store() vs. zram_remove(),
> > > > however I don't think it is reproduced easily in the test because the race
> > > > window is pretty small, also it can be fixed easily in my 3rd path
> > > > without any complicated tricks.
> > >
> > > Reproducing for me is... extremely easy.
> >
> > In my observation, failing zram_remove() is extremely easy to trigger, which
> > is caused by reset_store() which sets ->reclaim as true, so
> > zram_remove() is failed and zram_reset_device() is bypassed , then the
> > failure of 'Error: Removing state 63 which has instances left.' is caused.
> >
> > We are in same page?
>
> The actual first issue is the CPU hotplug remove callback is long gone and
> in the meantime we allow a race to add a new "instance", in the zram
> driver's case a cpu struct zcomp instance though the sysfs interface.
>
> Regardless of if zram_remove() can fail or not, the above race needs to
> be addressed.
>
> > > > Not dig into details of your patchset via grabbing module reference
> > > > count during show/store attribute of kernfs which is done in your patch
> > > > 9, but IMO this way isn't necessary:
> > >
> > > That's to address the deadlock only.
> > >
> > > > 1) any driver module has to cleanup anything which may refer to symbols
> > > > or data defined in module_exit of this driver
> > >
> > > Yes, and as the cpu multistate hotplug documentation warns (although
> > > such documentation is kind of hidden) that driver authors need to be
> > > careful with module removal too, refer to the warning at the end of
> > > __cpuhp_remove_state_cpuslocked() about module removal.
> >
> > It is zram's bug. zram has to clean everything in module_exit(),
> > unfortunately zram_remove() can be failed when calling from
> > module_exit() because ->claim is set as true by reset_store(), then
> > zram_reset_device()(->zcomp_destroy) isn't called, and this failure should
> > not happen when unloading module, should it?
>
> You're addressing a possible failig zram_remove() while I address not
> allowing entry to muck with the zram driver at all once we're bailing
> on module removal.
>
> > > > 2) device_del() is often done in module_exit(), once device_del()
> > > > returns, no any new show/store on the device's kobject attribute
> > > > is possible.
> > >
> > > Right and if a syfs knob is exposed before device_del() completely
> > > and is allowed to do things, the driver should take care to prevent
> > > races for CPU multistate support. The small state machine I added ensures
> >
> > What is the race for CPU multistate support? If you mean 'Error: Removing
> > state 63 which has instances left.', it is zram's bug since zram has to
> > cleanup everything in module_exit().
>
> Yes. And it is what my out of tree yet Acked patch, 'zram: fix
> crashes with cpu hotplug multistate' does.
Unfortunately that patch adds new deadlock between hot_remove_store() and
disksize_store() & others, see my below comment.
>
> > > we don't run over any expectations from cpu hotplug multistate support.
> > >
> > > I've *never* suggested there cannot be alternatives to my solution with
> > > the small state machine, but for you to say it is incorrect is simply
> > > not right either.
> > >
> > > > 3) it is _not_ a must or pattern for fixing bugs to hold one lock before
> > > > calling device_del(), meantime the lock is required in the device's
> > > > attribute show()/store(), which causes AA deadlock easily. Your approach
> > > > just avoids the issue by not releasing module until all show/store are
> > > > done.
> > >
> > > Right, there are two approaches here:
> > >
> > > a) Your approach is to accept the deadlock as a requirement and so
> > > you would prefer to implement an alternative to using a shared lock
> > > on module exit and sysfs op.
> >
> > wrt. in-tree zram, there is neither any deadlock in linus tree, nor after
> > applying my 3 patches. If you think there is, please share us the code
> > or lockdep warning.
>
> Right, 'zram: fix crashes with cpu hotplug multistate' is not yet
> merged, my approach to fixing that does add a lock use on module removal
> which does introduce a possible deadlock with syfs, which is later addressed
> generically between sysfs and module removal for all drivers.
>
> > > b) While I address such a deadlock head on as I think this sort of locking
> > > be allowed for two reasons:
> > > b1) as we never documented such requirement otherwise.
> > > b2) There is a possibility that other drivers already exist too
> > > which *do* use a shared lock on module removal and sysfs ops
> > > (and I just confirmed this to be true)
> >
> > The 'deadlock' is actually caused by your out-of-tree patch of 'zram: fix
> > crashes with cpu hotplug multistate' which adds mutex_lock(zram_index_mutex)
> > in destroy_devices().
>
> Yes yes, but you are completely throwing out the window that other
> possible deadlocks can exist in the kernel *and* that *new* cases of
> the deadlock can easily also be added!
>
> > We can fix this issue easily without needing the global lock, please see the
> > attached(pre-V2) patch.
>
> So far your patches do not fix the issues though...
>
> > > So I *really* don't think it is wise for us to simply accept this new
> > > found deadlock as a *new* requirement, specially if we can fix it easily.
> > >
> > > A cursory review using Coccinelle potential issues with mutex lock
> > > directly used on module exit (so this doesn't cover drivers like zram
> > > which uses a routine and then grabs the lock through indirection) and a
> > > sysfs op shows these drivers are also affected by this deadlock:
> > >
> > > * arch/powerpc/sysdev/fsl_mpic_timer_wakeup.c
> >
> > In fsl_wakeup_sys_exit(), device_remove_file() is called before
> > acquiring &sysfs_lock, so there shouldn't be such AA deadlock.
> >
> > > * lib/test_firmware.c
> >
> > Yeah, there is the AA deadlock risk, but it should be fixed by moving
> > misc_deregister() out of &test_fw_mutex.
>
> And just like that you are ignoring other possible uses in the kernel
> which might have similar deadlocks.
>
> So do you want to take the position:
>
> Hey driver authors: you cannot use any shared lock on module removal and
> on sysfs ops?
IMO, yes, in your patch of 'zram: fix crashes with cpu hotplug multistate',
when you added mutex_lock(zram_index_mutex) to disksize_store() and
other attribute show() or store() method. You have added new deadlock
between hot_remove_store() and disksize_store() & others, which can't be
addressed by your approach of holding module refcnt.
So far not see ltp tests covers hot add/remove interface yet.
Thanks,
Ming
On Tue, Oct 19, 2021 at 09:30:05AM -0700, Luis Chamberlain wrote:
> On Tue, Oct 19, 2021 at 06:25:18PM +0200, Greg KH wrote:
> > On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > > So do you want to take the position:
> > >
> > > Hey driver authors: you cannot use any shared lock on module removal and
> > > on sysfs ops?
> >
> > Yes, I would not recommend using such a lock at all. sysfs operations
> > happen on a per-device basis, so you can lock the device structure.
>
> All devices are going to be removed on module removal and so cannot be locked.
devices are not normally created by a driver, that is up to the bus
controller logic. A module will just disconnect itself from the device,
the device does not go away.
But yes, there are exceptions, and if you are doing something odd like
that, then you need to be aware of crazy things like this, so be
careful. But for all normal drivers, they do not have to worry about
this.
thanks,
greg k-h
On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
> On Tue, Oct 19, 2021 at 08:28:21AM -0700, Luis Chamberlain wrote:
> > On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> > > Please try the following patch against upstream(linus or next) tree(basically
> > > fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
> > > module_exit(), race between zram_remove() and disksize_store()), and see if
> > > everything is fine for you:
> >
> > Page fault ...
> >
> > [ 18.284256] zram: Removed device: zram0
> > [ 18.312974] BUG: unable to handle page fault for address:
> > ffffad86de903008
> > [ 18.313707] #PF: supervisor read access in kernel mode
> > [ 18.314248] #PF: error_code(0x0000) - not-present page
> > [ 18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067
>
> That is another race between zram_reset_device() and disksize_store(),
> which is supposed to be covered by ->init_lock, and follows the delta fix
> against the last patch I posted, and the whole patch can be found in the
> github link:
>
> https://github.com/ming1/linux/commit/fa6045b1371eb301f392ac84adaf3ad53bb16894
>
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index d0cae7a42f4d..a14ba3d350ea 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
> set_capacity_and_notify(zram->disk, 0);
> part_stat_set_all(zram->disk->part0, 0);
>
> - up_write(&zram->init_lock);
> /* I/O operation under all of CPU are done so let's free */
> zram_meta_free(zram, disksize);
> memset(&zram->stats, 0, sizeof(zram->stats));
> zcomp_destroy(comp);
> reset_bdev(zram);
> + up_write(&zram->init_lock);
> }
>
> static ssize_t disksize_store(struct device *dev,
With this, it still ends up in a state where we loop and can't get out of:
zram: Can't change algorithm for initialized device
Luis
On Wed, Oct 20, 2021 at 12:39:22AM +0800, Ming Lei wrote:
> On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > So do you want to take the position:
> >
> > Hey driver authors: you cannot use any shared lock on module removal and
> > on sysfs ops?
>
> IMO, yes, in your patch of 'zram: fix crashes with cpu hotplug multistate',
> when you added mutex_lock(zram_index_mutex) to disksize_store() and
> other attribute show() or store() method. You have added new deadlock
> between hot_remove_store() and disksize_store() & others, which can't be
> addressed by your approach of holding module refcnt.
>
> So far not see ltp tests covers hot add/remove interface yet.
Care to show what commands to use to cause this deadlock with my patches?
Luis
On Tue, Oct 19, 2021 at 07:28:35PM +0200, Greg KH wrote:
> On Tue, Oct 19, 2021 at 09:30:05AM -0700, Luis Chamberlain wrote:
> > On Tue, Oct 19, 2021 at 06:25:18PM +0200, Greg KH wrote:
> > > On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > > > So do you want to take the position:
> > > >
> > > > Hey driver authors: you cannot use any shared lock on module removal and
> > > > on sysfs ops?
> > >
> > > Yes, I would not recommend using such a lock at all. sysfs operations
> > > happen on a per-device basis, so you can lock the device structure.
> >
> > All devices are going to be removed on module removal and so cannot be locked.
>
> devices are not normally created by a driver, that is up to the bus
> controller logic. A module will just disconnect itself from the device,
> the device does not go away.
>
> But yes, there are exceptions, and if you are doing something odd like
> that, then you need to be aware of crazy things like this, so be
> careful. But for all normal drivers, they do not have to worry about
> this.
"Recommend" is a weak position to take given a possible deadlock with sysfs.
Do we want to at the very least document this is not a supported scheme?
If so I can also add a simple 1 level indirrection coccinelle patch to
detect these schemes and complain about them as wel, if we are going to
take this position.
But to simply disregard this as "not an issue", or we won't do anything
seems pretty counter productive given we *do* had drivers with this
issue before *and* still have them upstream, and can end up with more
drivers like this later.
Luis
On Tue, Oct 19, 2021 at 12:38:42PM -0700, Luis Chamberlain wrote:
> On Wed, Oct 20, 2021 at 12:39:22AM +0800, Ming Lei wrote:
> > On Tue, Oct 19, 2021 at 08:50:24AM -0700, Luis Chamberlain wrote:
> > > So do you want to take the position:
> > >
> > > Hey driver authors: you cannot use any shared lock on module removal and
> > > on sysfs ops?
> >
> > IMO, yes, in your patch of 'zram: fix crashes with cpu hotplug multistate',
> > when you added mutex_lock(zram_index_mutex) to disksize_store() and
> > other attribute show() or store() method. You have added new deadlock
> > between hot_remove_store() and disksize_store() & others, which can't be
> > addressed by your approach of holding module refcnt.
> >
> > So far not see ltp tests covers hot add/remove interface yet.
>
> Care to show what commands to use to cause this deadlock with my patches?
Build a kernel with your patch 4,7,8,9,11 and 12(all others are test module or
document change), with lockdep enabled, run the following command, then you
will see the warning, and it is one real deadlock, not false warning.
BTW, your patch 9 can't be applied cleanly against both linus and next
tree, so I edited it manually, but that can't make difference wrt. this issue.
[root@ktest-09 ~]# lsblk | grep zram
zram0 253:0 0 0B 0 disk
cat /sys/class/zram-control/hot_add
[root@ktest-09 ~]# lsblk | grep zram
zram0 253:0 0 0B 0 disk
zram1 253:1 0 0B 0 disk
[root@ktest-09 ~]# echo 256M > /sys/block/zram1/disksize
[root@ktest-09 ~]# echo 1 > /sys/class/zram-control/hot_remove
[root@ktest-09 ~]# dmesg
...
[ 75.599882] ======================================================
[ 75.601355] WARNING: possible circular locking dependency detected
[ 75.602818] 5.15.0-rc3_zram_fix_luis+ #24 Not tainted
[ 75.604038] ------------------------------------------------------
[ 75.605512] bash/1154 is trying to acquire lock:
[ 75.606634] ffff91ce026cd428 (kn->active#237){++++}-{0:0}, at: __kernfs_remove+0x1ab/0x1e0
[ 75.608570]
but task is already holding lock:
[ 75.609955] ffffffff839e3ef0 (zram_index_mutex){+.+.}-{3:3}, at: hot_remove_store+0x52/0xf0
[ 75.611910]
which lock already depends on the new lock.
[ 75.613896]
the existing dependency chain (in reverse order) is:
[ 75.615830]
-> #1 (zram_index_mutex){+.+.}-{3:3}:
[ 75.617483] __lock_acquire+0x4d2/0x930
[ 75.618650] lock_acquire+0xbb/0x2d0
[ 75.619748] __mutex_lock+0x8e/0x8a0
[ 75.620854] disksize_store+0x38/0x180
[ 75.621996] kernfs_fop_write_iter+0x134/0x1d0
[ 75.623287] new_sync_write+0x122/0x1b0
[ 75.624442] vfs_write+0x23e/0x350
[ 75.625506] ksys_write+0x68/0xe0
[ 75.626550] do_syscall_64+0x3b/0x90
[ 75.627649] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 75.629070]
-> #0 (kn->active#237){++++}-{0:0}:
[ 75.630677] check_prev_add+0x91/0xc10
[ 75.631816] validate_chain+0x474/0x500
[ 75.632972] __lock_acquire+0x4d2/0x930
[ 75.634131] lock_acquire+0xbb/0x2d0
[ 75.635234] kernfs_drain+0x139/0x190
[ 75.636355] __kernfs_remove+0x1ab/0x1e0
[ 75.637532] kernfs_remove_by_name_ns+0x3f/0x80
[ 75.638843] remove_files+0x2b/0x60
[ 75.639926] sysfs_remove_group+0x38/0x80
[ 75.641120] sysfs_remove_groups+0x29/0x40
[ 75.642334] device_remove_attrs+0x5b/0x90
[ 75.643552] device_del+0x184/0x400
[ 75.644635] zram_remove+0xac/0xc0
[ 75.645700] hot_remove_store+0xa3/0xf0
[ 75.646856] kernfs_fop_write_iter+0x134/0x1d0
[ 75.648147] new_sync_write+0x122/0x1b0
[ 75.649311] vfs_write+0x23e/0x350
[ 75.650372] ksys_write+0x68/0xe0
[ 75.651412] do_syscall_64+0x3b/0x90
[ 75.652512] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 75.653929]
other info that might help us debug this:
[ 75.656054] Possible unsafe locking scenario:
[ 75.657637] CPU0 CPU1
[ 75.658833] ---- ----
[ 75.660020] lock(zram_index_mutex);
[ 75.661024] lock(kn->active#237);
[ 75.662549] lock(zram_index_mutex);
[ 75.664103] lock(kn->active#237);
[ 75.665072]
*** DEADLOCK ***
[ 75.666736] 4 locks held by bash/1154:
[ 75.667767] #0: ffff91ce06983470 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x68/0xe0
[ 75.669802] #1: ffff91ce4123d290 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x100/0x1d0
[ 75.672050] #2: ffff91ce05a7ac40 (kn->active#238){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x108/0x1d0
[ 75.674383] #3: ffffffff839e3ef0 (zram_index_mutex){+.+.}-{3:3}, at: hot_remove_store+0x52/0xf0
[ 75.676595]
stack backtrace:
[ 75.677835] CPU: 2 PID: 1154 Comm: bash Not tainted 5.15.0-rc3_zram_fix_luis+ #24
[ 75.679768] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1.fc33 04/01/2014
[ 75.681927] Call Trace:
[ 75.682674] dump_stack_lvl+0x57/0x7d
[ 75.683680] check_noncircular+0xff/0x110
[ 75.684758] ? stack_trace_save+0x4b/0x70
[ 75.685843] check_prev_add+0x91/0xc10
[ 75.686867] ? add_chain_cache+0x112/0x2d0
[ 75.687965] validate_chain+0x474/0x500
[ 75.689005] __lock_acquire+0x4d2/0x930
[ 75.690054] lock_acquire+0xbb/0x2d0
[ 75.691038] ? __kernfs_remove+0x1ab/0x1e0
[ 75.692131] ? __lock_release+0x179/0x2c0
[ 75.693212] ? kernfs_drain+0x5b/0x190
[ 75.694239] kernfs_drain+0x139/0x190
[ 75.695240] ? __kernfs_remove+0x1ab/0x1e0
[ 75.696341] __kernfs_remove+0x1ab/0x1e0
[ 75.697408] kernfs_remove_by_name_ns+0x3f/0x80
[ 75.698607] remove_files+0x2b/0x60
[ 75.699576] sysfs_remove_group+0x38/0x80
[ 75.700661] sysfs_remove_groups+0x29/0x40
[ 75.701770] device_remove_attrs+0x5b/0x90
[ 75.702870] device_del+0x184/0x400
[ 75.703835] zram_remove+0xac/0xc0
[ 75.704785] hot_remove_store+0xa3/0xf0
[ 75.705831] kernfs_fop_write_iter+0x134/0x1d0
[ 75.707004] new_sync_write+0x122/0x1b0
[ 75.708048] ? __do_fast_syscall_32+0xe0/0xf0
[ 75.709214] vfs_write+0x23e/0x350
[ 75.710161] ksys_write+0x68/0xe0
[ 75.711088] do_syscall_64+0x3b/0x90
[ 75.712078] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 75.713389] RIP: 0033:0x7fcc1893f927
[ 75.714381] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 75.718879] RSP: 002b:00007ffcd56d91a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 75.720832] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fcc1893f927
[ 75.722592] RDX: 0000000000000002 RSI: 000055d7d33f78c0 RDI: 0000000000000001
[ 75.724352] RBP: 000055d7d33f78c0 R08: 0000000000000000 R09: 00007fcc189f44e0
[ 75.726123] R10: 00007fcc189f43e0 R11: 0000000000000246 R12: 0000000000000002
[ 75.727884] R13: 00007fcc18a395a0 R14: 0000000000000002 R15: 00007fcc18a397a0
Thanks,
Ming
On Tue, Oct 19, 2021 at 12:36:42PM -0700, Luis Chamberlain wrote:
> On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
> > On Tue, Oct 19, 2021 at 08:28:21AM -0700, Luis Chamberlain wrote:
> > > On Tue, Oct 19, 2021 at 10:34:41AM +0800, Ming Lei wrote:
> > > > Please try the following patch against upstream(linus or next) tree(basically
> > > > fold revised 2 and 3 of V1, and cover two issues: not fail zram_remove in
> > > > module_exit(), race between zram_remove() and disksize_store()), and see if
> > > > everything is fine for you:
> > >
> > > Page fault ...
> > >
> > > [ 18.284256] zram: Removed device: zram0
> > > [ 18.312974] BUG: unable to handle page fault for address:
> > > ffffad86de903008
> > > [ 18.313707] #PF: supervisor read access in kernel mode
> > > [ 18.314248] #PF: error_code(0x0000) - not-present page
> > > [ 18.314797] PGD 100000067 P4D 100000067 PUD 10031e067 PMD 136a28067
> >
> > That is another race between zram_reset_device() and disksize_store(),
> > which is supposed to be covered by ->init_lock, and follows the delta fix
> > against the last patch I posted, and the whole patch can be found in the
> > github link:
> >
> > https://github.com/ming1/linux/commit/fa6045b1371eb301f392ac84adaf3ad53bb16894
> >
> >
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index d0cae7a42f4d..a14ba3d350ea 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
> > set_capacity_and_notify(zram->disk, 0);
> > part_stat_set_all(zram->disk->part0, 0);
> >
> > - up_write(&zram->init_lock);
> > /* I/O operation under all of CPU are done so let's free */
> > zram_meta_free(zram, disksize);
> > memset(&zram->stats, 0, sizeof(zram->stats));
> > zcomp_destroy(comp);
> > reset_bdev(zram);
> > + up_write(&zram->init_lock);
> > }
> >
> > static ssize_t disksize_store(struct device *dev,
>
> With this, it still ends up in a state where we loop and can't get out of:
>
> zram: Can't change algorithm for initialized device
Again, you are running two zram02.sh[1] on /dev/zram0, that isn't unexpected
behavior. Here the difference is just timing. In my test VM,
this message shows a while on one task, then it may be switched to
another task.
Just run your patches a while, nothing real difference here, and the
following message can be dumped from one task for long time:
can't set '107374182400' to /sys/block/zram0/disksize
Also you did not answer my question about your test expected result when
running the following script from two terminal concurrently:
while true; do
PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh;
done
Thanks,
Ming
On Tue, 19 Oct 2021, Ming Lei wrote:
> On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > forgetting that there *may* already be present drivers which *do* implement
> > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > I was informed livepatching *did* have that issue as well and so very
> > > > likely a generic solution to the deadlock could be beneficial to other
> > > > random drivers.
> > >
> > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> >
> > I would not call it a fix. It is a kind of ugly workaround because the
> > generic infrastructure lacked (lacks) the proper support in my opinion.
> > Luis is trying to fix that.
>
> What is the proper support of the generic infrastructure? I am not
> familiar with livepatching's model(especially with module unload), you mean
> livepatching have to do the following way from sysfs:
>
> 1) during module exit:
>
> mutex_lock(lp_lock);
> kobject_put(lp_kobj);
> mutex_unlock(lp_lock);
>
> 2) show()/store() method of attributes of lp_kobj
>
> mutex_lock(lp_lock)
> ...
> mutex_unlock(lp_lock)
Yes, this was exactly the case. We then reworked it a lot (see
958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so
now the call sequence is different. kobject_put() is basically offloaded
to a workqueue scheduled right from the store() method. Meaning that
Luis's work would probably not help us currently, but on the other hand
the issues with AA deadlock were one of the main drivers of the redesign
(if I remember correctly). There were other reasons too as the changelog
of the commit describes.
So, from my perspective, if there was a way to easily synchronize between
a data cleanup from module_exit callback and sysfs/kernfs operations, it
could spare people many headaches.
> IMO, the above usage simply caused AA deadlock. Even in Luis's patch
> 'zram: fix crashes with cpu hotplug multistate', new/same AA deadlock
> (hot_remove_store() vs. disksize_store() or reset_store()) is added
> because hot_remove_store() isn't called from module_exit().
>
> Luis tries to delay unloading module until all show()/store() are done. But
> that can be obtained by the following way simply during module_exit():
>
> kobject_del(lp_kobj); //all pending store()/show() from lp_kobj are done,
> //no new store()/show() can come after
> //kobject_del() returns
> mutex_lock(lp_lock);
> kobject_put(lp_kobj);
> mutex_unlock(lp_lock);
kobject_del() already calls kobject_put(). Did you mean __kobject_del().
That one is internal though.
> Or can you explain your requirement on kobject/module unload in a bit
> details?
Does the above makes sense?
Thanks
Miroslav
On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> On Tue, 19 Oct 2021, Ming Lei wrote:
>
> > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > random drivers.
> > > >
> > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > >
> > > I would not call it a fix. It is a kind of ugly workaround because the
> > > generic infrastructure lacked (lacks) the proper support in my opinion.
> > > Luis is trying to fix that.
> >
> > What is the proper support of the generic infrastructure? I am not
> > familiar with livepatching's model(especially with module unload), you mean
> > livepatching have to do the following way from sysfs:
> >
> > 1) during module exit:
> >
> > mutex_lock(lp_lock);
> > kobject_put(lp_kobj);
> > mutex_unlock(lp_lock);
> >
> > 2) show()/store() method of attributes of lp_kobj
> >
> > mutex_lock(lp_lock)
> > ...
> > mutex_unlock(lp_lock)
>
> Yes, this was exactly the case. We then reworked it a lot (see
> 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so
> now the call sequence is different. kobject_put() is basically offloaded
> to a workqueue scheduled right from the store() method. Meaning that
> Luis's work would probably not help us currently, but on the other hand
> the issues with AA deadlock were one of the main drivers of the redesign
> (if I remember correctly). There were other reasons too as the changelog
> of the commit describes.
>
> So, from my perspective, if there was a way to easily synchronize between
> a data cleanup from module_exit callback and sysfs/kernfs operations, it
> could spare people many headaches.
kobject_del() is supposed to do so, but you can't hold a shared lock
which is required in show()/store() method. Once kobject_del() returns,
no pending show()/store() any more.
The question is that why one shared lock is required for livepatching to
delete the kobject. What are you protecting when you delete one kobject?
>
> > IMO, the above usage simply caused AA deadlock. Even in Luis's patch
> > 'zram: fix crashes with cpu hotplug multistate', new/same AA deadlock
> > (hot_remove_store() vs. disksize_store() or reset_store()) is added
> > because hot_remove_store() isn't called from module_exit().
> >
> > Luis tries to delay unloading module until all show()/store() are done. But
> > that can be obtained by the following way simply during module_exit():
> >
> > kobject_del(lp_kobj); //all pending store()/show() from lp_kobj are done,
> > //no new store()/show() can come after
> > //kobject_del() returns
> > mutex_lock(lp_lock);
> > kobject_put(lp_kobj);
> > mutex_unlock(lp_lock);
>
> kobject_del() already calls kobject_put(). Did you mean __kobject_del().
> That one is internal though.
kobject_del() is counter-part of kobject_add(), and kobject_put() will
call kobject_del() automatically() if it isn't deleted yet, but usually
kobject_put() is for releasing the object only. It is more often to
release kobject by calling kobject_del() and kobject_put().
>
> > Or can you explain your requirement on kobject/module unload in a bit
> > details?
>
> Does the above makes sense?
I think now focus is the shared lock between kobject_del() and
show()/store() of the kobject's attributes.
Thanks,
Ming
On Wed, 20 Oct 2021, Ming Lei wrote:
> On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > On Tue, 19 Oct 2021, Ming Lei wrote:
> >
> > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > random drivers.
> > > > >
> > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > >
> > > > I would not call it a fix. It is a kind of ugly workaround because the
> > > > generic infrastructure lacked (lacks) the proper support in my opinion.
> > > > Luis is trying to fix that.
> > >
> > > What is the proper support of the generic infrastructure? I am not
> > > familiar with livepatching's model(especially with module unload), you mean
> > > livepatching have to do the following way from sysfs:
> > >
> > > 1) during module exit:
> > >
> > > mutex_lock(lp_lock);
> > > kobject_put(lp_kobj);
> > > mutex_unlock(lp_lock);
> > >
> > > 2) show()/store() method of attributes of lp_kobj
> > >
> > > mutex_lock(lp_lock)
> > > ...
> > > mutex_unlock(lp_lock)
> >
> > Yes, this was exactly the case. We then reworked it a lot (see
> > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so
> > now the call sequence is different. kobject_put() is basically offloaded
> > to a workqueue scheduled right from the store() method. Meaning that
> > Luis's work would probably not help us currently, but on the other hand
> > the issues with AA deadlock were one of the main drivers of the redesign
> > (if I remember correctly). There were other reasons too as the changelog
> > of the commit describes.
> >
> > So, from my perspective, if there was a way to easily synchronize between
> > a data cleanup from module_exit callback and sysfs/kernfs operations, it
> > could spare people many headaches.
>
> kobject_del() is supposed to do so, but you can't hold a shared lock
> which is required in show()/store() method. Once kobject_del() returns,
> no pending show()/store() any more.
>
> The question is that why one shared lock is required for livepatching to
> delete the kobject. What are you protecting when you delete one kobject?
I think it boils down to the fact that we embed kobject statically to
structures which livepatch uses to maintain data. That is discouraged
generally, but all the attempts to implement it correctly were utter
failures.
Miroslav
On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> On Wed, 20 Oct 2021, Ming Lei wrote:
>
> > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > >
> > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > random drivers.
> > > > > >
> > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > >
> > > > > I would not call it a fix. It is a kind of ugly workaround because the
> > > > > generic infrastructure lacked (lacks) the proper support in my opinion.
> > > > > Luis is trying to fix that.
> > > >
> > > > What is the proper support of the generic infrastructure? I am not
> > > > familiar with livepatching's model(especially with module unload), you mean
> > > > livepatching have to do the following way from sysfs:
> > > >
> > > > 1) during module exit:
> > > >
> > > > mutex_lock(lp_lock);
> > > > kobject_put(lp_kobj);
> > > > mutex_unlock(lp_lock);
> > > >
> > > > 2) show()/store() method of attributes of lp_kobj
> > > >
> > > > mutex_lock(lp_lock)
> > > > ...
> > > > mutex_unlock(lp_lock)
> > >
> > > Yes, this was exactly the case. We then reworked it a lot (see
> > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so
> > > now the call sequence is different. kobject_put() is basically offloaded
> > > to a workqueue scheduled right from the store() method. Meaning that
> > > Luis's work would probably not help us currently, but on the other hand
> > > the issues with AA deadlock were one of the main drivers of the redesign
> > > (if I remember correctly). There were other reasons too as the changelog
> > > of the commit describes.
> > >
> > > So, from my perspective, if there was a way to easily synchronize between
> > > a data cleanup from module_exit callback and sysfs/kernfs operations, it
> > > could spare people many headaches.
> >
> > kobject_del() is supposed to do so, but you can't hold a shared lock
> > which is required in show()/store() method. Once kobject_del() returns,
> > no pending show()/store() any more.
> >
> > The question is that why one shared lock is required for livepatching to
> > delete the kobject. What are you protecting when you delete one kobject?
>
> I think it boils down to the fact that we embed kobject statically to
> structures which livepatch uses to maintain data. That is discouraged
> generally, but all the attempts to implement it correctly were utter
> failures.
Sounds like this is the real problem that needs to be fixed. kobjects
should always control the lifespan of the structure they are embedded
in. If not, then that is a design flaw of the user of the kobject :(
Where in the kernel is this happening? And where have been the attempts
to fix this up?
thanks,
greg k-h
On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> On Wed, 20 Oct 2021, Ming Lei wrote:
>
> > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > >
> > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > random drivers.
> > > > > >
> > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > >
> > > > > I would not call it a fix. It is a kind of ugly workaround because the
> > > > > generic infrastructure lacked (lacks) the proper support in my opinion.
> > > > > Luis is trying to fix that.
> > > >
> > > > What is the proper support of the generic infrastructure? I am not
> > > > familiar with livepatching's model(especially with module unload), you mean
> > > > livepatching have to do the following way from sysfs:
> > > >
> > > > 1) during module exit:
> > > >
> > > > mutex_lock(lp_lock);
> > > > kobject_put(lp_kobj);
> > > > mutex_unlock(lp_lock);
> > > >
> > > > 2) show()/store() method of attributes of lp_kobj
> > > >
> > > > mutex_lock(lp_lock)
> > > > ...
> > > > mutex_unlock(lp_lock)
> > >
> > > Yes, this was exactly the case. We then reworked it a lot (see
> > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so
> > > now the call sequence is different. kobject_put() is basically offloaded
> > > to a workqueue scheduled right from the store() method. Meaning that
> > > Luis's work would probably not help us currently, but on the other hand
> > > the issues with AA deadlock were one of the main drivers of the redesign
> > > (if I remember correctly). There were other reasons too as the changelog
> > > of the commit describes.
> > >
> > > So, from my perspective, if there was a way to easily synchronize between
> > > a data cleanup from module_exit callback and sysfs/kernfs operations, it
> > > could spare people many headaches.
> >
> > kobject_del() is supposed to do so, but you can't hold a shared lock
> > which is required in show()/store() method. Once kobject_del() returns,
> > no pending show()/store() any more.
> >
> > The question is that why one shared lock is required for livepatching to
> > delete the kobject. What are you protecting when you delete one kobject?
>
> I think it boils down to the fact that we embed kobject statically to
> structures which livepatch uses to maintain data. That is discouraged
> generally, but all the attempts to implement it correctly were utter
> failures.
OK, then it isn't one common usage, in which kobject covers the release
of the external object. What is the exact kobject in livepatching?
But kobject_del() won't release the kobject, you shouldn't need the lock
to delete kobject first. After the kobject is deleted, no any show() and
store() any more, isn't such sync[1] you expected?
Thanks,
Ming
On Wed, Oct 20, 2021 at 09:15:20AM +0800, Ming Lei wrote:
> On Tue, Oct 19, 2021 at 12:36:42PM -0700, Luis Chamberlain wrote:
> > On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
> > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > index d0cae7a42f4d..a14ba3d350ea 100644
> > > --- a/drivers/block/zram/zram_drv.c
> > > +++ b/drivers/block/zram/zram_drv.c
> > > @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
> > > set_capacity_and_notify(zram->disk, 0);
> > > part_stat_set_all(zram->disk->part0, 0);
> > >
> > > - up_write(&zram->init_lock);
> > > /* I/O operation under all of CPU are done so let's free */
> > > zram_meta_free(zram, disksize);
> > > memset(&zram->stats, 0, sizeof(zram->stats));
> > > zcomp_destroy(comp);
> > > reset_bdev(zram);
> > > + up_write(&zram->init_lock);
> > > }
> > >
> > > static ssize_t disksize_store(struct device *dev,
> >
> > With this, it still ends up in a state where we loop and can't get out of:
> >
> > zram: Can't change algorithm for initialized device
>
> Again, you are running two zram02.sh[1] on /dev/zram0, that isn't unexpected
You mean that it is not expected? If so then yes, of course.
> behavior. Here the difference is just timing.
Right, but that is what helped reproduce a difficutl to re-produce customer
bug. Once you find an easy way to reproduce a reported issue you stick
with it and try to make the situation worse to ensure no more bugs are
present.
> Also you did not answer my question about your test expected result when
> running the following script from two terminal concurrently:
>
> while true; do
> PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh;
> done
If you run this, you should see no failures.
Once you start a second script that one should cause odd issues on both
sides but never crash or stall the module.
A second series of tests is hitting CTRL-C on either randonly and
restarting testing once again randomly.
Again, neither should crash the kernel or stall the module.
In the end of these tests you should be able to run the script alone
just once and not see issues.
Luis
On Wed, Oct 20, 2021 at 08:48:04AM -0700, Luis Chamberlain wrote:
> On Wed, Oct 20, 2021 at 09:15:20AM +0800, Ming Lei wrote:
> > On Tue, Oct 19, 2021 at 12:36:42PM -0700, Luis Chamberlain wrote:
> > > On Wed, Oct 20, 2021 at 12:29:53AM +0800, Ming Lei wrote:
> > > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > > index d0cae7a42f4d..a14ba3d350ea 100644
> > > > --- a/drivers/block/zram/zram_drv.c
> > > > +++ b/drivers/block/zram/zram_drv.c
> > > > @@ -1704,12 +1704,12 @@ static void zram_reset_device(struct zram *zram)
> > > > set_capacity_and_notify(zram->disk, 0);
> > > > part_stat_set_all(zram->disk->part0, 0);
> > > >
> > > > - up_write(&zram->init_lock);
> > > > /* I/O operation under all of CPU are done so let's free */
> > > > zram_meta_free(zram, disksize);
> > > > memset(&zram->stats, 0, sizeof(zram->stats));
> > > > zcomp_destroy(comp);
> > > > reset_bdev(zram);
> > > > + up_write(&zram->init_lock);
> > > > }
> > > >
> > > > static ssize_t disksize_store(struct device *dev,
> > >
> > > With this, it still ends up in a state where we loop and can't get out of:
> > >
> > > zram: Can't change algorithm for initialized device
> >
> > Again, you are running two zram02.sh[1] on /dev/zram0, that isn't unexpected
>
> You mean that it is not expected? If so then yes, of course.
My meaning is clear: it is not unexpected, so it is expected.
>
> > behavior. Here the difference is just timing.
>
> Right, but that is what helped reproduce a difficutl to re-produce customer
> bug. Once you find an easy way to reproduce a reported issue you stick
> with it and try to make the situation worse to ensure no more bugs are
> present.
>
> > Also you did not answer my question about your test expected result when
> > running the following script from two terminal concurrently:
> >
> > while true; do
> > PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh;
> > done
>
> If you run this, you should see no failures.
OK, not see any failure when running single zram02.sh after applying my
patch V2.
>
> Once you start a second script that one should cause odd issues on both
> sides but never crash or stall the module.
crash can't be observed with my patch V2, what do you mean 'stall'
the module? Is that 'zram' can't be unloaded after the test is
terminated via multiple 'ctrl-c'?
>
> A second series of tests is hitting CTRL-C on either randonly and
> restarting testing once again randomly.
ltp/zram02.sh has cleanup handler via trap to clean everything(swapoff/umount/reset/
rmmod), ctrl-c will terminate current forground task and cause shell to run the
cleanup handler first, but further 'ctrl-c' will terminate the cleanup handler,
then the cleanup won't be done completely, such as zram disk is left as swap
device and zram can't be unloaded. The idea can be observed via the following
script:
#!/bin/bash
trap 'echo "enter trap"; sleep 20; echo "exit trap";' INT
sleep 30
After the above script is run foreground, when 1st ctrl-c is pressed, 'sleep 30'
is terminated, then the trap command is run, so you can see "enter trap"
dumped. Then if you pressed 2nd ctrl-c, 'sleep 20' is terminated immediately.
So 'swapoff' from zram02.sh's trap function can be terminated in this way.
zram disk being left as swap disk can be observed with your patch too
after terminating via multiple ctrl-c which has to be done this way because
the test is dead loop.
So it is hard to cleanup everything completely after multiple 'CTRL-C' is
involved, and it should be impossible. It needs violent multiple ctrl-c to
terminate the dealoop test.
So it isn't reasonable to expect that zram can be always unloaded successfully
after the test script is terminated via multiple ctrl-c.
But zram can be unloaded after running swapoff manually, from driver
viewpoint, nothing is wrong.
>
> Again, neither should crash the kernel or stall the module.
>
> In the end of these tests you should be able to run the script alone
> just once and not see issues.
Thanks,
Ming
On Thu, Oct 21, 2021 at 08:39:05AM +0800, Ming Lei wrote:
> On Wed, Oct 20, 2021 at 08:48:04AM -0700, Luis Chamberlain wrote:
> > A second series of tests is hitting CTRL-C on either randonly and
> > restarting testing once again randomly.
>
> ltp/zram02.sh has cleanup handler via trap to clean everything(swapoff/umount/reset/
> rmmod), ctrl-c will terminate current forground task and cause shell to run the
> cleanup handler first, but further 'ctrl-c' will terminate the cleanup handler,
> then the cleanup won't be done completely, such as zram disk is left as swap
> device and zram can't be unloaded. The idea can be observed via the following
> script:
>
> #!/bin/bash
> trap 'echo "enter trap"; sleep 20; echo "exit trap";' INT
> sleep 30
>
> After the above script is run foreground, when 1st ctrl-c is pressed, 'sleep 30'
> is terminated, then the trap command is run, so you can see "enter trap"
> dumped. Then if you pressed 2nd ctrl-c, 'sleep 20' is terminated immediately.
> So 'swapoff' from zram02.sh's trap function can be terminated in this way.
>
> zram disk being left as swap disk can be observed with your patch too
> after terminating via multiple ctrl-c which has to be done this way because
> the test is dead loop.
>
> So it is hard to cleanup everything completely after multiple 'CTRL-C' is
> involved, and it should be impossible. It needs violent multiple ctrl-c to
> terminate the dealoop test.
>
> So it isn't reasonable to expect that zram can be always unloaded successfully
> after the test script is terminated via multiple ctrl-c.
For the life of me, I do not run into these issue with my patch. But
with yours I had.
To be clear, I run zram02.sh on two terminals. Then to interrupt I just leave
CTRL-C pressed to issue multiple terminations until the script is done
on each terminal at a time, until I see both have completed.
I repeat the same test, noting always that when I start one one terminal
the test is succeeding. And also when I cancel completely one script the
test continue fine without issue.
> But zram can be unloaded after running swapoff manually, from driver
> viewpoint, nothing is wrong.
I had not run into that issue with my patch FWIW.
Luis
On Thu, Oct 21, 2021 at 10:18:47AM -0700, Luis Chamberlain wrote:
> On Thu, Oct 21, 2021 at 08:39:05AM +0800, Ming Lei wrote:
> > On Wed, Oct 20, 2021 at 08:48:04AM -0700, Luis Chamberlain wrote:
> > > A second series of tests is hitting CTRL-C on either randonly and
> > > restarting testing once again randomly.
> >
> > ltp/zram02.sh has cleanup handler via trap to clean everything(swapoff/umount/reset/
> > rmmod), ctrl-c will terminate current forground task and cause shell to run the
> > cleanup handler first, but further 'ctrl-c' will terminate the cleanup handler,
> > then the cleanup won't be done completely, such as zram disk is left as swap
> > device and zram can't be unloaded. The idea can be observed via the following
> > script:
> >
> > #!/bin/bash
> > trap 'echo "enter trap"; sleep 20; echo "exit trap";' INT
> > sleep 30
> >
> > After the above script is run foreground, when 1st ctrl-c is pressed, 'sleep 30'
> > is terminated, then the trap command is run, so you can see "enter trap"
> > dumped. Then if you pressed 2nd ctrl-c, 'sleep 20' is terminated immediately.
> > So 'swapoff' from zram02.sh's trap function can be terminated in this way.
> >
> > zram disk being left as swap disk can be observed with your patch too
> > after terminating via multiple ctrl-c which has to be done this way because
> > the test is dead loop.
> >
> > So it is hard to cleanup everything completely after multiple 'CTRL-C' is
> > involved, and it should be impossible. It needs violent multiple ctrl-c to
> > terminate the dealoop test.
> >
> > So it isn't reasonable to expect that zram can be always unloaded successfully
> > after the test script is terminated via multiple ctrl-c.
>
> For the life of me, I do not run into these issue with my patch. But
> with yours I had.
>
> To be clear, I run zram02.sh on two terminals. Then to interrupt I just leave
> CTRL-C pressed to issue multiple terminations until the script is done
> on each terminal at a time, until I see both have completed.
>
> I repeat the same test, noting always that when I start one one terminal
> the test is succeeding. And also when I cancel completely one script the
> test continue fine without issue.
As I explained wrt. shell's trap, this issue won't be avoided from
userspace because trap function can be terminated by ctrl-c too,
otherwise one shell script may not be terminated at all.
The unclean shutdown can be observed in single 'while true; do zram02.sh; done'
too on both your patches and mine.
Also it is insane to write write test in a deadloop, and people seldom
do that, not see such way in either blktests/xfstests.
I you limit completion time of this test in long enough time(one or
several hours) or big enough loops, I believe it can be done cleanly,
such as:
cnt=0
MAX=10000
while [ $cnt -lt $MAX ]; do
PATH=$PATH:$PWD:$PWD/../../../lib/ ./zram02.sh;
done
Thanks,
Ming
On Wed, 20 Oct 2021, Greg KH wrote:
> On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> > On Wed, 20 Oct 2021, Ming Lei wrote:
> >
> > > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > > >
> > > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > > random drivers.
> > > > > > >
> > > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > > >
> > > > > > I would not call it a fix. It is a kind of ugly workaround because the
> > > > > > generic infrastructure lacked (lacks) the proper support in my opinion.
> > > > > > Luis is trying to fix that.
> > > > >
> > > > > What is the proper support of the generic infrastructure? I am not
> > > > > familiar with livepatching's model(especially with module unload), you mean
> > > > > livepatching have to do the following way from sysfs:
> > > > >
> > > > > 1) during module exit:
> > > > >
> > > > > mutex_lock(lp_lock);
> > > > > kobject_put(lp_kobj);
> > > > > mutex_unlock(lp_lock);
> > > > >
> > > > > 2) show()/store() method of attributes of lp_kobj
> > > > >
> > > > > mutex_lock(lp_lock)
> > > > > ...
> > > > > mutex_unlock(lp_lock)
> > > >
> > > > Yes, this was exactly the case. We then reworked it a lot (see
> > > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so
> > > > now the call sequence is different. kobject_put() is basically offloaded
> > > > to a workqueue scheduled right from the store() method. Meaning that
> > > > Luis's work would probably not help us currently, but on the other hand
> > > > the issues with AA deadlock were one of the main drivers of the redesign
> > > > (if I remember correctly). There were other reasons too as the changelog
> > > > of the commit describes.
> > > >
> > > > So, from my perspective, if there was a way to easily synchronize between
> > > > a data cleanup from module_exit callback and sysfs/kernfs operations, it
> > > > could spare people many headaches.
> > >
> > > kobject_del() is supposed to do so, but you can't hold a shared lock
> > > which is required in show()/store() method. Once kobject_del() returns,
> > > no pending show()/store() any more.
> > >
> > > The question is that why one shared lock is required for livepatching to
> > > delete the kobject. What are you protecting when you delete one kobject?
> >
> > I think it boils down to the fact that we embed kobject statically to
> > structures which livepatch uses to maintain data. That is discouraged
> > generally, but all the attempts to implement it correctly were utter
> > failures.
>
> Sounds like this is the real problem that needs to be fixed. kobjects
> should always control the lifespan of the structure they are embedded
> in. If not, then that is a design flaw of the user of the kobject :(
Right, and you've already told us. A couple of times.
For example
here https://lore.kernel.org/all/[email protected]/
:)
> Where in the kernel is this happening? And where have been the attempts
> to fix this up?
include/linux/livepatch.h and kernel/livepatch/core.c. See
klp_{patch,object,func}.
It took some archeology, but I think
https://lore.kernel.org/all/[email protected]/
is it. Petr might correct me.
It was long before we added some important features to the code, so it
might be even more difficult today.
It resurfaced later when Tobin tried to fix some of kobject call sites in
the kernel...
https://lore.kernel.org/all/[email protected]/
https://lore.kernel.org/all/[email protected]/
https://lore.kernel.org/all/[email protected]/
There are probably more references.
Anyway, the current code works fine (well, one could argue about that). If
someone wants to take a (another) stab at this, then why not, but it
seemed like a rabbit hole without a substantial gain in the past. On the
other hand, we currently misuse the API to some extent.
/me scratches head
Miroslav
On Wed 2021-10-20 18:09:51, Ming Lei wrote:
> On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> > On Wed, 20 Oct 2021, Ming Lei wrote:
> >
> > > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > > >
> > > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > > random drivers.
> > > > > > >
> > > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > > >
> > > > > > I would not call it a fix. It is a kind of ugly workaround because the
> > > > > > generic infrastructure lacked (lacks) the proper support in my opinion.
> > > > > > Luis is trying to fix that.
> > > > >
> > > > > What is the proper support of the generic infrastructure? I am not
> > > > > familiar with livepatching's model(especially with module unload), you mean
> > > > > livepatching have to do the following way from sysfs:
> > > > >
> > > > > 1) during module exit:
> > > > >
> > > > > mutex_lock(lp_lock);
> > > > > kobject_put(lp_kobj);
> > > > > mutex_unlock(lp_lock);
> > > > >
> > > > > 2) show()/store() method of attributes of lp_kobj
> > > > >
> > > > > mutex_lock(lp_lock)
> > > > > ...
> > > > > mutex_unlock(lp_lock)
> > > >
> > > > Yes, this was exactly the case. We then reworked it a lot (see
> > > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so
> > > > now the call sequence is different. kobject_put() is basically offloaded
> > > > to a workqueue scheduled right from the store() method. Meaning that
> > > > Luis's work would probably not help us currently, but on the other hand
> > > > the issues with AA deadlock were one of the main drivers of the redesign
> > > > (if I remember correctly). There were other reasons too as the changelog
> > > > of the commit describes.
> > > >
> > > > So, from my perspective, if there was a way to easily synchronize between
> > > > a data cleanup from module_exit callback and sysfs/kernfs operations, it
> > > > could spare people many headaches.
> > >
> > > kobject_del() is supposed to do so, but you can't hold a shared lock
> > > which is required in show()/store() method. Once kobject_del() returns,
> > > no pending show()/store() any more.
> > >
> > > The question is that why one shared lock is required for livepatching to
> > > delete the kobject. What are you protecting when you delete one kobject?
> >
> > I think it boils down to the fact that we embed kobject statically to
> > structures which livepatch uses to maintain data. That is discouraged
> > generally, but all the attempts to implement it correctly were utter
> > failures.
>
> OK, then it isn't one common usage, in which kobject covers the release
> of the external object. What is the exact kobject in livepatching?
Below are more details about the livepatch code. I hope that it will
help you to see if zram has similar problems or not.
We have kobject in three structures: klp_func, klp_object, and
klp_patch, see include/linux/livepatch.h.
These structures have to be statically defined in the module sources
because they define what is livepatched, see
samples/livepatch/livepatch-sample.c
The kobject is used there to show information about the patch, patched
objects, and patched functions, in sysfs. And most importantly,
the sysfs interface can be used to disable the livepatch.
The problem with static structures is that the module must stay
in the memory as long as the sysfs interface exists. It can be
solved in module_exit() callback. It could wait until the sysfs
interface is destroyed.
kobject API does not support this scenario. The relase() callbacks
are called asynchronously. It expects that the structure is bundled
in a dynamically allocated structure. As a result, the sysfs
interface can be removed even after the module removal.
The livepatching might create the dynamic structures by duplicating
the structures defined in the module statically. It might safe us
some headaches with kobject release. But it would also need an extra code
that would need to be maintained. The structure constrains strings
than need to be duplicated and later freed...
> But kobject_del() won't release the kobject, you shouldn't need the lock
> to delete kobject first. After the kobject is deleted, no any show() and
> store() any more, isn't such sync[1] you expected?
Livepatch code never called kobject_del() under a lock. It would cause
the obvious deadlock. The historic code only waited in the
module_exit() callback until the sysfs interface was removed.
It has changed in the commit 958ef1e39d24d6cb8bf2a740 ("livepatch:
Simplify API by removing registration step"). The livepatch could
never get enabled again after it was disabled now. The sysfs interface
is removed when the livepatch gets disabled. The module could
be removed only after the sysfs interface is destroyed, see
the module_put() in klp_free_patch_finish().
The livepatch code uses workqueue because the livepatch can be
disabled via sysfs interface. It obviously could not wait until
the sysfs interface is removed in the sysfs write() callback
that triggered the removal.
HTH,
Petr
On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> On Wed 2021-10-20 18:09:51, Ming Lei wrote:
> > On Wed, Oct 20, 2021 at 10:19:27AM +0200, Miroslav Benes wrote:
> > > On Wed, 20 Oct 2021, Ming Lei wrote:
> > >
> > > > On Wed, Oct 20, 2021 at 08:43:37AM +0200, Miroslav Benes wrote:
> > > > > On Tue, 19 Oct 2021, Ming Lei wrote:
> > > > >
> > > > > > On Tue, Oct 19, 2021 at 08:23:51AM +0200, Miroslav Benes wrote:
> > > > > > > > > By you only addressing the deadlock as a requirement on approach a) you are
> > > > > > > > > forgetting that there *may* already be present drivers which *do* implement
> > > > > > > > > such patterns in the kernel. I worked on addressing the deadlock because
> > > > > > > > > I was informed livepatching *did* have that issue as well and so very
> > > > > > > > > likely a generic solution to the deadlock could be beneficial to other
> > > > > > > > > random drivers.
> > > > > > > >
> > > > > > > > In-tree zram doesn't have such deadlock, if livepatching has such AA deadlock,
> > > > > > > > just fixed it, and seems it has been fixed by 3ec24776bfd0.
> > > > > > >
> > > > > > > I would not call it a fix. It is a kind of ugly workaround because the
> > > > > > > generic infrastructure lacked (lacks) the proper support in my opinion.
> > > > > > > Luis is trying to fix that.
> > > > > >
> > > > > > What is the proper support of the generic infrastructure? I am not
> > > > > > familiar with livepatching's model(especially with module unload), you mean
> > > > > > livepatching have to do the following way from sysfs:
> > > > > >
> > > > > > 1) during module exit:
> > > > > >
> > > > > > mutex_lock(lp_lock);
> > > > > > kobject_put(lp_kobj);
> > > > > > mutex_unlock(lp_lock);
> > > > > >
> > > > > > 2) show()/store() method of attributes of lp_kobj
> > > > > >
> > > > > > mutex_lock(lp_lock)
> > > > > > ...
> > > > > > mutex_unlock(lp_lock)
> > > > >
> > > > > Yes, this was exactly the case. We then reworked it a lot (see
> > > > > 958ef1e39d24 ("livepatch: Simplify API by removing registration step"), so
> > > > > now the call sequence is different. kobject_put() is basically offloaded
> > > > > to a workqueue scheduled right from the store() method. Meaning that
> > > > > Luis's work would probably not help us currently, but on the other hand
> > > > > the issues with AA deadlock were one of the main drivers of the redesign
> > > > > (if I remember correctly). There were other reasons too as the changelog
> > > > > of the commit describes.
> > > > >
> > > > > So, from my perspective, if there was a way to easily synchronize between
> > > > > a data cleanup from module_exit callback and sysfs/kernfs operations, it
> > > > > could spare people many headaches.
> > > >
> > > > kobject_del() is supposed to do so, but you can't hold a shared lock
> > > > which is required in show()/store() method. Once kobject_del() returns,
> > > > no pending show()/store() any more.
> > > >
> > > > The question is that why one shared lock is required for livepatching to
> > > > delete the kobject. What are you protecting when you delete one kobject?
> > >
> > > I think it boils down to the fact that we embed kobject statically to
> > > structures which livepatch uses to maintain data. That is discouraged
> > > generally, but all the attempts to implement it correctly were utter
> > > failures.
> >
> > OK, then it isn't one common usage, in which kobject covers the release
> > of the external object. What is the exact kobject in livepatching?
>
> Below are more details about the livepatch code. I hope that it will
> help you to see if zram has similar problems or not.
>
> We have kobject in three structures: klp_func, klp_object, and
> klp_patch, see include/linux/livepatch.h.
>
> These structures have to be statically defined in the module sources
> because they define what is livepatched, see
> samples/livepatch/livepatch-sample.c
>
> The kobject is used there to show information about the patch, patched
> objects, and patched functions, in sysfs. And most importantly,
> the sysfs interface can be used to disable the livepatch.
>
> The problem with static structures is that the module must stay
> in the memory as long as the sysfs interface exists. It can be
> solved in module_exit() callback. It could wait until the sysfs
> interface is destroyed.
>
> kobject API does not support this scenario. The relase() callbacks
kobject_delete() is for supporting this scenario, that is why we don't
need to grab module refcnt before calling show()/store() of the
kobject's attributes.
kobject_delete() can be called in module_exit(), then any show()/store()
will be done after kobject_delete() returns.
> are called asynchronously. It expects that the structure is bundled
> in a dynamically allocated structure. As a result, the sysfs
> interface can be removed even after the module removal.
That should be one bug, otherwise store()/show() method could be called
into after the module is unloaded.
>
> The livepatching might create the dynamic structures by duplicating
> the structures defined in the module statically. It might safe us
> some headaches with kobject release. But it would also need an extra code
> that would need to be maintained. The structure constrains strings
> than need to be duplicated and later freed...
>
>
> > But kobject_del() won't release the kobject, you shouldn't need the lock
> > to delete kobject first. After the kobject is deleted, no any show() and
> > store() any more, isn't such sync[1] you expected?
>
> Livepatch code never called kobject_del() under a lock. It would cause
> the obvious deadlock. The historic code only waited in the
> module_exit() callback until the sysfs interface was removed.
OK, then Luis shouldn't consider livepatching as one such issue to solve
with one generic solution.
>
> It has changed in the commit 958ef1e39d24d6cb8bf2a740 ("livepatch:
> Simplify API by removing registration step"). The livepatch could
> never get enabled again after it was disabled now. The sysfs interface
> is removed when the livepatch gets disabled. The module could
> be removed only after the sysfs interface is destroyed, see
> the module_put() in klp_free_patch_finish().
OK, that is livepatching's implementation: all the kobjects are deleted &
freed after disabling the livepatch module, that looks one kill-me
operation, instead of disabling, so this way isn't a normal usage,
scsi has similar sysfs interface of delete. Also kobjects can't be
removed in enable's store() directly, since deadlock could be
caused, looks wq has to be used here for avoiding deadlock.
BTW, what is the livepatching module use model? try_module_get() is
called in klp_init_patch_early()<-klp_enable_patch()<-module_init(),
module_put() is called in klp_free_patch_finish() which seems only be
called after 'echo 0 > /sys/kernel/livepatch/$lp_mod/enabled'.
Usually when the module isn't used, module_exit() gets chance to be called
by userspace rmmod, then all kobjects created in this module can be
deleted in module_exit().
>
> The livepatch code uses workqueue because the livepatch can be
> disabled via sysfs interface. It obviously could not wait until
> the sysfs interface is removed in the sysfs write() callback
> that triggered the removal.
If klp_free_patch_* is moved into module_exit() and not let enable
store() to kill kobjects, all kobjects can be deleted in module_exit(),
then wait_for_completion(patch->finish) may be removed, also wq isn't
required for the async cleanup.
Thanks,
Ming
On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
> On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > Livepatch code never called kobject_del() under a lock. It would cause
> > the obvious deadlock.
Never?
> > The historic code only waited in the
> > module_exit() callback until the sysfs interface was removed.
>
> OK, then Luis shouldn't consider livepatching as one such issue to solve
> with one generic solution.
It's not what I was told when the deadlock was found with zram, so I was
informed quite the contrary.
I'm working on a generic coccinelle patch which hunts for actual cases
using iteration (a feature of coccinelle for complex searches). The
search is pretty involved, so I don't think I'll have an answer to this
soon.
Since the question of how generic this deadlock is remains questionable,
I think it makes sense to put the generic deadlock fix off the table for
now, and we address this once we have a more concrete search with
coccinelle.
But to say we *don't* have drivers which can cause this is obviously
wrong as well, from a cursory search so far. But let's wait and see how
big this list actually is.
I'll drop the deadlock generic fixes and move on with at least a starter
kernfs / sysfs tests.
Luis
> >
> > The livepatch code uses workqueue because the livepatch can be
> > disabled via sysfs interface. It obviously could not wait until
> > the sysfs interface is removed in the sysfs write() callback
> > that triggered the removal.
>
> If klp_free_patch_* is moved into module_exit() and not let enable
> store() to kill kobjects, all kobjects can be deleted in module_exit(),
> then wait_for_completion(patch->finish) may be removed, also wq isn't
> required for the async cleanup.
It sounds like a nice cleanup. If we combine kobject_del() to prevent any
show()/store() accesses and free everything later in module_exit(), it
could work. If I am not missing something around how we maintain internal
lists of live patches and their modules.
Thanks
Miroslav
On Tue, 26 Oct 2021, Luis Chamberlain wrote:
> On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
> > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > Livepatch code never called kobject_del() under a lock. It would cause
> > > the obvious deadlock.
>
> Never?
kobject_put() to be precise.
When I started working on the support for module/live patches removal,
calling kobject_put() under our klp_mutex lock was the obvious first
choice given how the code was structured, but I ran into problems with
deadlocks immediately. So it was changed to async approach with the
workqueue. Thus the mainline code has never suffered from this, but we
knew about the issues.
> > > The historic code only waited in the
> > > module_exit() callback until the sysfs interface was removed.
> >
> > OK, then Luis shouldn't consider livepatching as one such issue to solve
> > with one generic solution.
>
> It's not what I was told when the deadlock was found with zram, so I was
> informed quite the contrary.
From my perspective, it is quite easy to get it wrong due to either a lack
of generic support, or missing rules/documentation. So if this thread
leads to "do not share locks between a module removal and a sysfs
operation" strict rule, it would be at least something. In the same
manner as Luis proposed to document try_module_get() expectations.
> I'm working on a generic coccinelle patch which hunts for actual cases
> using iteration (a feature of coccinelle for complex searches). The
> search is pretty involved, so I don't think I'll have an answer to this
> soon.
>
> Since the question of how generic this deadlock is remains questionable,
> I think it makes sense to put the generic deadlock fix off the table for
> now, and we address this once we have a more concrete search with
> coccinelle.
>
> But to say we *don't* have drivers which can cause this is obviously
> wrong as well, from a cursory search so far. But let's wait and see how
> big this list actually is.
>
> I'll drop the deadlock generic fixes and move on with at least a starter
> kernfs / sysfs tests.
It makes sense to me.
Thanks, Luis, for pursuing it.
Miroslav
On Wed, Oct 27, 2021 at 01:57:40PM +0200, Miroslav Benes wrote:
> On Tue, 26 Oct 2021, Luis Chamberlain wrote:
>
> > On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
> > > OK, then Luis shouldn't consider livepatching as one such issue to solve
> > > with one generic solution.
> >
> > It's not what I was told when the deadlock was found with zram, so I was
> > informed quite the contrary.
>
> From my perspective, it is quite easy to get it wrong due to either a lack
> of generic support, or missing rules/documentation.
Indeed. I agree some level of guidence is needed, even if subtle, rather
than tribal knowledge. I'll start off with the test_sysfs demo'ing what
not to do and documenting this there. I don't think it makes sense to
formalize yet documentation for "though shalt not do this" generically
until a full depth search is done with Coccinelle.
> So if this thread
> leads to "do not share locks between a module removal and a sysfs
> operation" strict rule, it would be at least something.
I think that's where we are at. I'll wait to complete my coccinelle
deadlock hunt patch to complete the full search, and that could be
useful to *warn* aboute new use cases, so to prevent this deadlock
in the future. Until then I agree that the complexity introduced is
not worth it given the evidence of users, but the full evidence of
actual users still remains to be determined. A perfect job left to
advances with Coccinelle.
> In the same
> manner as Luis proposed to document try_module_get() expectations.
Right and so sysfs ops using try_module_get() *still* remains safe,
and so will keep that patch in my next iteration because there *are*
*many* uses cases for that.
Luis
On Tue 2021-10-26 23:37:30, Ming Lei wrote:
> On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > Below are more details about the livepatch code. I hope that it will
> > help you to see if zram has similar problems or not.
> >
> > We have kobject in three structures: klp_func, klp_object, and
> > klp_patch, see include/linux/livepatch.h.
> >
> > These structures have to be statically defined in the module sources
> > because they define what is livepatched, see
> > samples/livepatch/livepatch-sample.c
> >
> > The kobject is used there to show information about the patch, patched
> > objects, and patched functions, in sysfs. And most importantly,
> > the sysfs interface can be used to disable the livepatch.
> >
> > The problem with static structures is that the module must stay
> > in the memory as long as the sysfs interface exists. It can be
> > solved in module_exit() callback. It could wait until the sysfs
> > interface is destroyed.
> >
> > kobject API does not support this scenario. The relase() callbacks
>
> kobject_delete() is for supporting this scenario, that is why we don't
> need to grab module refcnt before calling show()/store() of the
> kobject's attributes.
>
> kobject_delete() can be called in module_exit(), then any show()/store()
> will be done after kobject_delete() returns.
I am a bit confused. I do not see kobject_delete() anywhere in kernel
sources.
I see only kobject_del() and kobject_put(). AFAIK, they do _not_
guarantee that either the sysfs interface was destroyed or
the release callbacks were called. For example, see
schedule_delayed_work(&kobj->release, delay) in kobject_release().
By other words, anyone could still be using either the sysfs interface
or the related structures after kobject_del() or kobject_put()
returns.
IMHO, kobject API does not support static structures and module
removal.
Best Regards,
Petr
On Tue 2021-11-02 15:15:19, Petr Mladek wrote:
> On Tue 2021-10-26 23:37:30, Ming Lei wrote:
> > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > Below are more details about the livepatch code. I hope that it will
> > > help you to see if zram has similar problems or not.
> > >
> > > We have kobject in three structures: klp_func, klp_object, and
> > > klp_patch, see include/linux/livepatch.h.
> > >
> > > These structures have to be statically defined in the module sources
> > > because they define what is livepatched, see
> > > samples/livepatch/livepatch-sample.c
> > >
> > > The kobject is used there to show information about the patch, patched
> > > objects, and patched functions, in sysfs. And most importantly,
> > > the sysfs interface can be used to disable the livepatch.
> > >
> > > The problem with static structures is that the module must stay
> > > in the memory as long as the sysfs interface exists. It can be
> > > solved in module_exit() callback. It could wait until the sysfs
> > > interface is destroyed.
> > >
> > > kobject API does not support this scenario. The relase() callbacks
> >
> > kobject_delete() is for supporting this scenario, that is why we don't
> > need to grab module refcnt before calling show()/store() of the
> > kobject's attributes.
> >
> > kobject_delete() can be called in module_exit(), then any show()/store()
> > will be done after kobject_delete() returns.
>
> I am a bit confused. I do not see kobject_delete() anywhere in kernel
> sources.
>
> I see only kobject_del() and kobject_put(). AFAIK, they do _not_
> guarantee that either the sysfs interface was destroyed or
> the release callbacks were called. For example, see
> schedule_delayed_work(&kobj->release, delay) in kobject_release().
Grr, I always get confused by the code. kobject_del() actually waits
until the sysfs interface gets destroyed. This is why there is
the deadlock.
But kobject_put() is _not_ synchronous. And the comment above
kobject_add() repeat 3 times that kobject_put() must be called
on success:
* Return: If this function returns an error, kobject_put() must be
* called to properly clean up the memory associated with the
* object. Under no instance should the kobject that is passed
* to this function be directly freed with a call to kfree(),
* that can leak memory.
*
* If this function returns success, kobject_put() must also be called
* in order to properly clean up the memory associated with the object.
*
* In short, once this function is called, kobject_put() MUST be called
* when the use of the object is finished in order to properly free
* everything.
and similar text in Documentation/core-api/kobject.rst
After a kobject has been registered with the kobject core successfully, it
must be cleaned up when the code is finished with it. To do that, call
kobject_put().
If I read the code correctly then kobject_put() calls kref_put()
that might call kobject_delayed_cleanup(). This function does a lot
of things and need to access struct kobject.
> IMHO, kobject API does not support static structures and module
> removal.
If kobject_put() has to be called also for static structures then
module_exit() must explicitly wait until the clean up is finished.
Best Regards,
Petr
On Tue, Nov 02, 2021 at 03:15:15PM +0100, Petr Mladek wrote:
> On Tue 2021-10-26 23:37:30, Ming Lei wrote:
> > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > Below are more details about the livepatch code. I hope that it will
> > > help you to see if zram has similar problems or not.
> > >
> > > We have kobject in three structures: klp_func, klp_object, and
> > > klp_patch, see include/linux/livepatch.h.
> > >
> > > These structures have to be statically defined in the module sources
> > > because they define what is livepatched, see
> > > samples/livepatch/livepatch-sample.c
> > >
> > > The kobject is used there to show information about the patch, patched
> > > objects, and patched functions, in sysfs. And most importantly,
> > > the sysfs interface can be used to disable the livepatch.
> > >
> > > The problem with static structures is that the module must stay
> > > in the memory as long as the sysfs interface exists. It can be
> > > solved in module_exit() callback. It could wait until the sysfs
> > > interface is destroyed.
> > >
> > > kobject API does not support this scenario. The relase() callbacks
> >
> > kobject_delete() is for supporting this scenario, that is why we don't
> > need to grab module refcnt before calling show()/store() of the
> > kobject's attributes.
> >
> > kobject_delete() can be called in module_exit(), then any show()/store()
> > will be done after kobject_delete() returns.
>
> I am a bit confused. I do not see kobject_delete() anywhere in kernel
> sources.
>
> I see only kobject_del() and kobject_put(). AFAIK, they do _not_
> guarantee that either the sysfs interface was destroyed or
> the release callbacks were called. For example, see
> schedule_delayed_work(&kobj->release, delay) in kobject_release().
After kobject_del() returns, no one can call run into show()/store(),
and all pending show()/store() are drained meantime. But yes, the release
handler may still be called later, and the kobject has to be freed
during or before module_exit().
https://lore.kernel.org/lkml/[email protected]/
>
> By other words, anyone could still be using either the sysfs interface
> or the related structures after kobject_del() or kobject_put()
> returns.
No, no one can do that after kobject_del() returns.
>
> IMHO, kobject API does not support static structures and module
> removal.
But so far klp_patch can only be defined as static instance, and it
depends on the implementation, especially the release handler.
Thanks,
Ming
On Tue, Nov 02, 2021 at 03:51:33PM +0100, Petr Mladek wrote:
> On Tue 2021-11-02 15:15:19, Petr Mladek wrote:
> > On Tue 2021-10-26 23:37:30, Ming Lei wrote:
> > > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > > Below are more details about the livepatch code. I hope that it will
> > > > help you to see if zram has similar problems or not.
> > > >
> > > > We have kobject in three structures: klp_func, klp_object, and
> > > > klp_patch, see include/linux/livepatch.h.
> > > >
> > > > These structures have to be statically defined in the module sources
> > > > because they define what is livepatched, see
> > > > samples/livepatch/livepatch-sample.c
> > > >
> > > > The kobject is used there to show information about the patch, patched
> > > > objects, and patched functions, in sysfs. And most importantly,
> > > > the sysfs interface can be used to disable the livepatch.
> > > >
> > > > The problem with static structures is that the module must stay
> > > > in the memory as long as the sysfs interface exists. It can be
> > > > solved in module_exit() callback. It could wait until the sysfs
> > > > interface is destroyed.
> > > >
> > > > kobject API does not support this scenario. The relase() callbacks
> > >
> > > kobject_delete() is for supporting this scenario, that is why we don't
> > > need to grab module refcnt before calling show()/store() of the
> > > kobject's attributes.
> > >
> > > kobject_delete() can be called in module_exit(), then any show()/store()
> > > will be done after kobject_delete() returns.
> >
> > I am a bit confused. I do not see kobject_delete() anywhere in kernel
> > sources.
> >
> > I see only kobject_del() and kobject_put(). AFAIK, they do _not_
> > guarantee that either the sysfs interface was destroyed or
> > the release callbacks were called. For example, see
> > schedule_delayed_work(&kobj->release, delay) in kobject_release().
>
> Grr, I always get confused by the code. kobject_del() actually waits
> until the sysfs interface gets destroyed. This is why there is
> the deadlock.
Right.
>
> But kobject_put() is _not_ synchronous. And the comment above
> kobject_add() repeat 3 times that kobject_put() must be called
> on success:
>
> * Return: If this function returns an error, kobject_put() must be
> * called to properly clean up the memory associated with the
> * object. Under no instance should the kobject that is passed
> * to this function be directly freed with a call to kfree(),
> * that can leak memory.
> *
> * If this function returns success, kobject_put() must also be called
> * in order to properly clean up the memory associated with the object.
> *
> * In short, once this function is called, kobject_put() MUST be called
> * when the use of the object is finished in order to properly free
> * everything.
>
> and similar text in Documentation/core-api/kobject.rst
>
> After a kobject has been registered with the kobject core successfully, it
> must be cleaned up when the code is finished with it. To do that, call
> kobject_put().
>
>
> If I read the code correctly then kobject_put() calls kref_put()
> that might call kobject_delayed_cleanup(). This function does a lot
> of things and need to access struct kobject.
Yes, then what is the problem here wrt. kobject_put() which may not be
synchronous?
>
> > IMHO, kobject API does not support static structures and module
> > removal.
>
> If kobject_put() has to be called also for static structures then
> module_exit() must explicitly wait until the clean up is finished.
Right, that is exactly how klp_patch kobject is implemented. klp_patch
kobject has to be disabled first, then module refcnt can be dropped after
the klp_patch kobject is released. Then module_exit() is possible.
Thanks,
Ming
On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
> On Tue, 26 Oct 2021, Luis Chamberlain wrote:
>
> > On Tue, Oct 26, 2021 at 11:37:30PM +0800, Ming Lei wrote:
> > > On Tue, Oct 26, 2021 at 10:48:18AM +0200, Petr Mladek wrote:
> > > > Livepatch code never called kobject_del() under a lock. It would cause
> > > > the obvious deadlock.
I have to correct myself. IMHO, the deadlock is far from obvious. I
always get lost in the code and the documentation is not clear.
I always get lost.
> >
> > Never?
>
> kobject_put() to be precise.
IMHO, the problem is actually with kobject_del() that gets blocked
until the sysfs interface gets removed. kobject_put() will have
the same problem only when the clean up is not delayed.
> When I started working on the support for module/live patches removal,
> calling kobject_put() under our klp_mutex lock was the obvious first
> choice given how the code was structured, but I ran into problems with
> deadlocks immediately. So it was changed to async approach with the
> workqueue. Thus the mainline code has never suffered from this, but we
> knew about the issues.
>
> > > > The historic code only waited in the
> > > > module_exit() callback until the sysfs interface was removed.
> > >
> > > OK, then Luis shouldn't consider livepatching as one such issue to solve
> > > with one generic solution.
> >
> > It's not what I was told when the deadlock was found with zram, so I was
> > informed quite the contrary.
>
> >From my perspective, it is quite easy to get it wrong due to either a lack
> of generic support, or missing rules/documentation. So if this thread
> leads to "do not share locks between a module removal and a sysfs
> operation" strict rule, it would be at least something. In the same
> manner as Luis proposed to document try_module_get() expectations.
The rule "do not share locks between a module removal and a sysfs
operation" is not clear to me.
IMHO, there are the following rules:
1. rule: kobject_del() or kobject_put() must not be called under a lock that
is used by store()/show() callbacks.
reason: kobject_del() waits until the sysfs interface is destroyed.
It has to wait until all store()/show() callbacks are finished.
2. rule: kobject_del()/kobject_put() must not be called from the
related store() callbacks.
reason: same as in 1st rule.
3. rule: module_exit() must wait until all release() callbacks are called
when kobject are static.
reason: kobject_put() must be called to clean up internal
dependencies. The clean up might be done asynchronously
and need access to the kobject structure.
Best Regards,
Petr
PS: I am sorry if I am messing things. I want to be sure that we are
all talking about the same and understand it the same way.
On Tue, Nov 02, 2021 at 04:24:06PM +0100, Petr Mladek wrote:
> On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
> > >From my perspective, it is quite easy to get it wrong due to either a lack
> > of generic support, or missing rules/documentation. So if this thread
> > leads to "do not share locks between a module removal and a sysfs
> > operation" strict rule, it would be at least something. In the same
> > manner as Luis proposed to document try_module_get() expectations.
>
> The rule "do not share locks between a module removal and a sysfs
> operation" is not clear to me.
That's exactly it. It *is* not. The test_sysfs selftest will hopefully
help with this. But I'll wait to take a final position on whether or not
a generic fix should be merged until the Coccinelle patch which looks
for all uses cases completes.
So I think that once that Coccinelle hunt is done for the deadlock, we
should also remind folks of the potential deadlock and some of the rules
you mentioned below so that if we take a position that we don't support
this, we at least inform developers why and what to avoid. If Coccinelle
finds quite a bit of cases, then perhaps evaluating the generic fix
might be worth evaluating.
> IMHO, there are the following rules:
>
> 1. rule: kobject_del() or kobject_put() must not be called under a lock that
> is used by store()/show() callbacks.
>
> reason: kobject_del() waits until the sysfs interface is destroyed.
> It has to wait until all store()/show() callbacks are finished.
Right, this is what actually started this entire conversation.
Note that as Ming pointed out, the generic kernfs fix I proposed would
only cover the case when kobject_del() ends up being called on module
exit, so it would not cover the cases where perhaps kobject_del() might
be called outside of module exit, and so the cope of the possible
deadlock then increases in scope.
Likewise, the Coccinelle hunt I'm trying would only cover the module
exit case. I'm a bit of afraid of the complexity of a generic hunt
as expresed in rule 1.
>
> 2. rule: kobject_del()/kobject_put() must not be called from the
> related store() callbacks.
>
> reason: same as in 1st rule.
Sensible corollary.
Given tha the exact kobjet_del() / kobject_put() which must not be
called from the respective sysfs ops depends on which kobject is
underneath the device for which the sysfs ops is being created,
it would make this hunt in Coccinelle a bit tricky. My current iteration
of a coccinelle hunt cheats and looks at any sysfs looking op and
ensures a module exit exists.
> 3. rule: module_exit() must wait until all release() callbacks are called
> when kobject are static.
>
> reason: kobject_put() must be called to clean up internal
> dependencies. The clean up might be done asynchronously
> and need access to the kobject structure.
This might be an easier rule to implement a respective Coccinelle rule
for.
Luis
On Tue, Nov 02, 2021 at 09:25:44AM -0700, Luis Chamberlain wrote:
> On Tue, Nov 02, 2021 at 04:24:06PM +0100, Petr Mladek wrote:
> > On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
> > > >From my perspective, it is quite easy to get it wrong due to either a lack
> > > of generic support, or missing rules/documentation. So if this thread
> > > leads to "do not share locks between a module removal and a sysfs
> > > operation" strict rule, it would be at least something. In the same
> > > manner as Luis proposed to document try_module_get() expectations.
> >
> > The rule "do not share locks between a module removal and a sysfs
> > operation" is not clear to me.
>
> That's exactly it. It *is* not. The test_sysfs selftest will hopefully
> help with this. But I'll wait to take a final position on whether or not
> a generic fix should be merged until the Coccinelle patch which looks
> for all uses cases completes.
>
> So I think that once that Coccinelle hunt is done for the deadlock, we
> should also remind folks of the potential deadlock and some of the rules
> you mentioned below so that if we take a position that we don't support
> this, we at least inform developers why and what to avoid. If Coccinelle
> finds quite a bit of cases, then perhaps evaluating the generic fix
> might be worth evaluating.
>
> > IMHO, there are the following rules:
> >
> > 1. rule: kobject_del() or kobject_put() must not be called under a lock that
> > is used by store()/show() callbacks.
> >
> > reason: kobject_del() waits until the sysfs interface is destroyed.
> > It has to wait until all store()/show() callbacks are finished.
>
> Right, this is what actually started this entire conversation.
>
> Note that as Ming pointed out, the generic kernfs fix I proposed would
> only cover the case when kobject_del() ends up being called on module
> exit, so it would not cover the cases where perhaps kobject_del() might
> be called outside of module exit, and so the cope of the possible
> deadlock then increases in scope.
>
> Likewise, the Coccinelle hunt I'm trying would only cover the module
> exit case. I'm a bit of afraid of the complexity of a generic hunt
> as expresed in rule 1.
Question is that why one shared lock is required between kobject_del()
and its show()/store(), both zram and livepatch needn't that. Is it
one common usage?
>
> >
> > 2. rule: kobject_del()/kobject_put() must not be called from the
> > related store() callbacks.
> >
> > reason: same as in 1st rule.
>
> Sensible corollary.
>
> Given tha the exact kobjet_del() / kobject_put() which must not be
> called from the respective sysfs ops depends on which kobject is
> underneath the device for which the sysfs ops is being created,
> it would make this hunt in Coccinelle a bit tricky. My current iteration
> of a coccinelle hunt cheats and looks at any sysfs looking op and
> ensures a module exit exists.
Actually kernfs/sysfs provides interface for supporting deleting
kobject/attr from the attr's show()/store(), see example of
sdev_store_delete(), and the livepatch example:
https://lore.kernel.org/lkml/[email protected]/
>
> > 3. rule: module_exit() must wait until all release() callbacks are called
> > when kobject are static.
> >
> > reason: kobject_put() must be called to clean up internal
> > dependencies. The clean up might be done asynchronously
> > and need access to the kobject structure.
>
> This might be an easier rule to implement a respective Coccinelle rule
> for.
If kobject_del() is done in module_exit() or before module_exit(),
kobject should have been freed in module_exit() via kobject_put().
But yes, it can be asynchronously because of CONFIG_DEBUG_KOBJECT_RELEASE,
seems like one real issue.
Thanks,
Ming
On Wed, Nov 03, 2021 at 08:01:45AM +0800, Ming Lei wrote:
> On Tue, Nov 02, 2021 at 09:25:44AM -0700, Luis Chamberlain wrote:
> > On Tue, Nov 02, 2021 at 04:24:06PM +0100, Petr Mladek wrote:
> > > On Wed 2021-10-27 13:57:40, Miroslav Benes wrote:
> > > > >From my perspective, it is quite easy to get it wrong due to either a lack
> > > > of generic support, or missing rules/documentation. So if this thread
> > > > leads to "do not share locks between a module removal and a sysfs
> > > > operation" strict rule, it would be at least something. In the same
> > > > manner as Luis proposed to document try_module_get() expectations.
> > >
> > > The rule "do not share locks between a module removal and a sysfs
> > > operation" is not clear to me.
> >
> > That's exactly it. It *is* not. The test_sysfs selftest will hopefully
> > help with this. But I'll wait to take a final position on whether or not
> > a generic fix should be merged until the Coccinelle patch which looks
> > for all uses cases completes.
> >
> > So I think that once that Coccinelle hunt is done for the deadlock, we
> > should also remind folks of the potential deadlock and some of the rules
> > you mentioned below so that if we take a position that we don't support
> > this, we at least inform developers why and what to avoid. If Coccinelle
> > finds quite a bit of cases, then perhaps evaluating the generic fix
> > might be worth evaluating.
> >
> > > IMHO, there are the following rules:
> > >
> > > 1. rule: kobject_del() or kobject_put() must not be called under a lock that
> > > is used by store()/show() callbacks.
> > >
> > > reason: kobject_del() waits until the sysfs interface is destroyed.
> > > It has to wait until all store()/show() callbacks are finished.
> >
> > Right, this is what actually started this entire conversation.
> >
> > Note that as Ming pointed out, the generic kernfs fix I proposed would
> > only cover the case when kobject_del() ends up being called on module
> > exit, so it would not cover the cases where perhaps kobject_del() might
> > be called outside of module exit, and so the cope of the possible
> > deadlock then increases in scope.
> >
> > Likewise, the Coccinelle hunt I'm trying would only cover the module
> > exit case. I'm a bit of afraid of the complexity of a generic hunt
> > as expresed in rule 1.
>
> Question is that why one shared lock is required between kobject_del()
> and its show()/store(), both zram and livepatch needn't that. Is it
> one common usage?
That is the question the coccinelle hunt is aimed at finding. Answering
that in the context of module removal is easier than the generic case.
But also note that I had mentioned before that we have semantics to
check *when* we're in the module removal case, and as such can address
that case. For the other cases we have no possible semantics to be able to
address a generic fix. I tried though, refer to my reply in this
thread and refer to the new kobject_being_removed() I'm adding:
https://lkml.kernel.org/r/[email protected]
So we have semantics for knowing when about to remove a module but,
my attempt with kobject_being_removed() isn't sufficient to address this
generically.
In either case, having a gauge of how common this is either on module
removal of generally would be wonderful. It is easier to answer the
question from a module removal perspective though.
> > > 2. rule: kobject_del()/kobject_put() must not be called from the
> > > related store() callbacks.
> > >
> > > reason: same as in 1st rule.
> >
> > Sensible corollary.
> >
> > Given tha the exact kobjet_del() / kobject_put() which must not be
> > called from the respective sysfs ops depends on which kobject is
> > underneath the device for which the sysfs ops is being created,
> > it would make this hunt in Coccinelle a bit tricky. My current iteration
> > of a coccinelle hunt cheats and looks at any sysfs looking op and
> > ensures a module exit exists.
>
> Actually kernfs/sysfs provides interface for supporting deleting
> kobject/attr from the attr's show()/store(), see example of
> sdev_store_delete(), and the livepatch example:
>
> https://lore.kernel.org/lkml/[email protected]/
Imagine that.. is that the suicidal thing?
> > > 3. rule: module_exit() must wait until all release() callbacks are called
> > > when kobject are static.
> > >
> > > reason: kobject_put() must be called to clean up internal
> > > dependencies. The clean up might be done asynchronously
> > > and need access to the kobject structure.
> >
> > This might be an easier rule to implement a respective Coccinelle rule
> > for.
>
> If kobject_del() is done in module_exit() or before module_exit(),
> kobject should have been freed in module_exit() via kobject_put().
>
> But yes, it can be asynchronously because of CONFIG_DEBUG_KOBJECT_RELEASE,
> seems like one real issue.
Alright thanks for confirming.
Luis