When sysfs attributes use a lock also used on module removal we can
race to deadlock. This happens when for instance a sysfs file on
a driver is used, then at the same time we have module removal call
trigger. The module removal call code holds a lock, and then the sysfs
file entry waits for the same lock. While holding the lock the module
removal tries to remove the sysfs entries, but these cannot be removed
yet as one is waiting for a lock. This won't complete as the lock is
already held. Likewise module removal cannot complete, and so we deadlock.
This can now be easily reproducible with our sysfs selftest as follows:
./tools/testing/selftests/sysfs/sysfs.sh -t 0027
To fix this we extend the struct kernfs_node with a module reference and
use the try_module_get() after kernfs_get_active() is called which
protects integrity and the existence of the kernfs node during the
operation.
So long as the kernfs node is protected with kernfs_get_active() we know
we can rely on its contents. And, as now just documented in the previous
patch, we also now know that once kernfs_get_active() is called the module
is also guarded to exist and cannot be removed.
If try_module_get() fails we fail the operation on the kernfs node.
We use a try method as a full lock means we'd then make our sysfs
attributes busy us out from possible module removal, and so userspace
could force denying module removal, a silly form of "DOS" against module
removal. A try lock on the module removal ensures we give priority to
module removal and interacting with sysfs attributes only comes second.
Using a full lock could mean for instance that if you don't stop poking
at sysfs files you cannot remove a module.
Races between removal of sysfs files and the module are not possible
given sysfs files are created by the same module, and when a sysfs file
is being used kernfs prevents removal of the sysfs file. So if module
removal is actually happening the removal would have to wait until
the sysfs file operation is complete.
This deadlock was first reported with the zram driver, however the live
patching folks have acknowledged they have observed this as well with
live patching, when a live patch is removed. I was then able to
reproduce easily by creating a dedicated selftests.
A sketch of how this can happen follows:
CPU A CPU B
whatever_store()
module_unload
mutex_lock(foo)
mutex_lock(foo)
del_gendisk(zram->disk);
device_del()
device_remove_groups()
In this situation whatever_store() is waiting for the mutex foo to
become unlocked, but that won't happen until module removal is complete.
But module removal won't complete until the sysfs file being poked
completes which is waiting for a lock already held.
Signed-off-by: Luis Chamberlain <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +-
fs/kernfs/dir.c | 41 ++++++++++++++++++--
fs/kernfs/file.c | 7 +++-
fs/kernfs/kernfs-internal.h | 1 +
fs/kernfs/symlink.c | 3 +-
fs/sysfs/dir.c | 4 +-
fs/sysfs/file.c | 10 +++--
fs/sysfs/group.c | 1 +
include/linux/kernfs.h | 11 ++++--
include/linux/sysfs.h | 52 ++++++++++++++++++++------
kernel/cgroup/cgroup.c | 2 +-
11 files changed, 106 insertions(+), 30 deletions(-)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b57b3db9a6a7..d40c486f46b9 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -207,7 +207,7 @@ static int rdtgroup_add_file(struct kernfs_node *parent_kn, struct rftype *rft)
struct kernfs_node *kn;
int ret;
- kn = __kernfs_create_file(parent_kn, rft->name, rft->mode,
+ kn = __kernfs_create_file(parent_kn, rft->name, rft->mode, NULL,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
0, rft->kf_ops, rft, NULL, NULL);
if (IS_ERR(kn))
@@ -2480,7 +2480,7 @@ static int mon_addfile(struct kernfs_node *parent_kn, const char *name,
struct kernfs_node *kn;
int ret = 0;
- kn = __kernfs_create_file(parent_kn, name, 0444,
+ kn = __kernfs_create_file(parent_kn, name, 0444, NULL,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0,
&kf_mondata_ops, priv, NULL, NULL);
if (IS_ERR(kn))
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index ba581429bf7b..f463878a5bf4 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -14,6 +14,7 @@
#include <linux/slab.h>
#include <linux/security.h>
#include <linux/hash.h>
+#include <linux/module.h>
#include "kernfs-internal.h"
@@ -414,12 +415,32 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
*/
struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
{
+ int v;
+
if (unlikely(!kn))
return NULL;
if (!atomic_inc_unless_negative(&kn->active))
return NULL;
+ /*
+ * If a module created the kernfs_node, the module cannot possibly be
+ * removed if the above atomic_inc_unless_negative() succeeded. So the
+ * try_module_get() below is not to protect the lifetime of the module
+ * as that is already guaranteed. The try_module_get() below is used
+ * to ensure that we don't deadlock in case a kernfs operation and
+ * module removal used a shared lock.
+ *
+ * We use a try method to avoid userspace DOS'ing module removal with
+ * access to kernfs ops.
+ */
+ if (!try_module_get(kn->owner)) {
+ v = atomic_dec_return(&kn->active);
+ if (unlikely(v == KN_DEACTIVATED_BIAS))
+ wake_up_all(&kernfs_root(kn)->deactivate_waitq);
+ return NULL;
+ }
+
if (kernfs_lockdep(kn))
rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
return kn;
@@ -442,6 +463,13 @@ void kernfs_put_active(struct kernfs_node *kn)
if (kernfs_lockdep(kn))
rwsem_release(&kn->dep_map, _RET_IP_);
v = atomic_dec_return(&kn->active);
+
+ /*
+ * We prevent module exit *until* we know for sure all possible
+ * kernfs ops are done.
+ */
+ module_put(kn->owner);
+
if (likely(v != KN_DEACTIVATED_BIAS))
return;
@@ -571,6 +599,7 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry)
static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
unsigned flags)
{
@@ -606,6 +635,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
kn->name = name;
kn->mode = mode;
+ kn->owner = owner;
kn->flags = flags;
if (!uid_eq(uid, GLOBAL_ROOT_UID) || !gid_eq(gid, GLOBAL_ROOT_GID)) {
@@ -639,13 +669,14 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
unsigned flags)
{
struct kernfs_node *kn;
kn = __kernfs_new_node(kernfs_root(parent), parent,
- name, mode, uid, gid, flags);
+ name, mode, owner, uid, gid, flags);
if (kn) {
kernfs_get(parent);
kn->parent = parent;
@@ -926,7 +957,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
root->id_highbits = 1;
kn = __kernfs_new_node(root, NULL, "", S_IFDIR | S_IRUGO | S_IXUGO,
- GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
+ NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
KERNFS_DIR);
if (!kn) {
idr_destroy(&root->ino_idr);
@@ -965,6 +996,7 @@ void kernfs_destroy_root(struct kernfs_root *root)
* @parent: parent in which to create a new directory
* @name: name of the new directory
* @mode: mode of the new directory
+ * @owner: if set, the module owner of this directory
* @uid: uid of the new directory
* @gid: gid of the new directory
* @priv: opaque data associated with the new directory
@@ -974,6 +1006,7 @@ void kernfs_destroy_root(struct kernfs_root *root)
*/
struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
void *priv, const void *ns)
{
@@ -982,7 +1015,7 @@ struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
/* allocate */
kn = kernfs_new_node(parent, name, mode | S_IFDIR,
- uid, gid, KERNFS_DIR);
+ owner, uid, gid, KERNFS_DIR);
if (!kn)
return ERR_PTR(-ENOMEM);
@@ -1014,7 +1047,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
/* allocate */
kn = kernfs_new_node(parent, name, S_IRUGO|S_IXUGO|S_IFDIR,
- GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
+ NULL, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, KERNFS_DIR);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 4479c6580333..73be39ddc856 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -974,6 +974,7 @@ const struct file_operations kernfs_file_fops = {
* @uid: uid of the file
* @gid: gid of the file
* @size: size of the file
+ * @owner: if set, the module owner of the file
* @ops: kernfs operations for the file
* @priv: private data for the file
* @ns: optional namespace tag of the file
@@ -983,7 +984,9 @@ const struct file_operations kernfs_file_fops = {
*/
struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
const char *name,
- umode_t mode, kuid_t uid, kgid_t gid,
+ umode_t mode,
+ struct module *owner,
+ kuid_t uid, kgid_t gid,
loff_t size,
const struct kernfs_ops *ops,
void *priv, const void *ns,
@@ -996,7 +999,7 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
flags = KERNFS_FILE;
kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
- uid, gid, flags);
+ owner, uid, gid, flags);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 9e3abf597e2d..3551f62f1449 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -133,6 +133,7 @@ void kernfs_put_active(struct kernfs_node *kn);
int kernfs_add_one(struct kernfs_node *kn);
struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
unsigned flags);
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 19a6c71c6ff5..1d781970bec7 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -36,7 +36,8 @@ struct kernfs_node *kernfs_create_link(struct kernfs_node *parent,
gid = target->iattr->ia_gid;
}
- kn = kernfs_new_node(parent, name, S_IFLNK|0777, uid, gid, KERNFS_LINK);
+ kn = kernfs_new_node(parent, name, S_IFLNK|0777, target->owner, uid,
+ gid, KERNFS_LINK);
if (!kn)
return ERR_PTR(-ENOMEM);
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index b6b6796e1616..01a0a53ee18e 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -56,8 +56,8 @@ int sysfs_create_dir_ns(struct kobject *kobj, const void *ns)
kobject_get_ownership(kobj, &uid, &gid);
- kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, uid, gid,
- kobj, ns);
+ kn = kernfs_create_dir_ns(parent, kobject_name(kobj), 0755, NULL, uid,
+ gid, kobj, ns);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
sysfs_warn_dup(parent, kobject_name(kobj));
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 42dcf96881b6..5edc6e180a97 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -291,8 +291,9 @@ int sysfs_add_file_mode_ns(struct kernfs_node *parent,
key = attr->key ?: (struct lock_class_key *)&attr->skey;
#endif
- kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
- PAGE_SIZE, ops, (void *)attr, ns, key);
+ kn = __kernfs_create_file(parent, attr->name, mode & 0777, attr->owner,
+ uid, gid, PAGE_SIZE, ops, (void *)attr, ns,
+ key);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
sysfs_warn_dup(parent, attr->name);
@@ -326,8 +327,9 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
key = attr->key ?: (struct lock_class_key *)&attr->skey;
#endif
- kn = __kernfs_create_file(parent, attr->name, mode & 0777, uid, gid,
- battr->size, ops, (void *)attr, ns, key);
+ kn = __kernfs_create_file(parent, attr->name, mode & 0777, attr->owner,
+ uid, gid, battr->size, ops, (void *)attr, ns,
+ key);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
sysfs_warn_dup(parent, attr->name);
diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index eeb0e3099421..283607c22eef 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -135,6 +135,7 @@ static int internal_create_group(struct kobject *kobj, int update,
} else {
kn = kernfs_create_dir_ns(kobj->sd, grp->name,
S_IRWXU | S_IRUGO | S_IXUGO,
+ grp->owner,
uid, gid, kobj, NULL);
if (IS_ERR(kn)) {
if (PTR_ERR(kn) == -EEXIST)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index cd968ee2b503..8655abd53c85 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -160,6 +160,7 @@ struct kernfs_node {
unsigned short flags;
umode_t mode;
+ struct module *owner;
struct kernfs_iattrs *iattr;
};
@@ -369,12 +370,14 @@ void kernfs_destroy_root(struct kernfs_root *root);
struct kernfs_node *kernfs_create_dir_ns(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
void *priv, const void *ns);
struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
const char *name);
struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
const char *name, umode_t mode,
+ struct module *owner,
kuid_t uid, kgid_t gid,
loff_t size,
const struct kernfs_ops *ops,
@@ -471,13 +474,15 @@ static inline void kernfs_destroy_root(struct kernfs_root *root) { }
static inline struct kernfs_node *
kernfs_create_dir_ns(struct kernfs_node *parent, const char *name,
- umode_t mode, kuid_t uid, kgid_t gid,
+ umode_t mode, struct module *owner,
+ kuid_t uid, kgid_t gid,
void *priv, const void *ns)
{ return ERR_PTR(-ENOSYS); }
static inline struct kernfs_node *
__kernfs_create_file(struct kernfs_node *parent, const char *name,
- umode_t mode, kuid_t uid, kgid_t gid,
+ umode_t mode, struct module *owner,
+ kuid_t uid, kgid_t gid,
loff_t size, const struct kernfs_ops *ops,
void *priv, const void *ns, struct lock_class_key *key)
{ return ERR_PTR(-ENOSYS); }
@@ -564,7 +569,7 @@ static inline struct kernfs_node *
kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode,
void *priv)
{
- return kernfs_create_dir_ns(parent, name, mode,
+ return kernfs_create_dir_ns(parent, name, mode, parent->owner,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
priv, NULL);
}
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index e3f1e8ac1f85..babbabb460dc 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -30,6 +30,7 @@ enum kobj_ns_type;
struct attribute {
const char *name;
umode_t mode;
+ struct module *owner;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
bool ignore_lockdep:1;
struct lock_class_key *key;
@@ -80,6 +81,7 @@ do { \
* @attrs: Pointer to NULL terminated list of attributes.
* @bin_attrs: Pointer to NULL terminated list of binary attributes.
* Either attrs or bin_attrs or both must be provided.
+ * @module: If set, module responsible for this attribute group
*/
struct attribute_group {
const char *name;
@@ -89,6 +91,7 @@ struct attribute_group {
struct bin_attribute *, int);
struct attribute **attrs;
struct bin_attribute **bin_attrs;
+ struct module *owner;
};
/*
@@ -100,38 +103,52 @@ struct attribute_group {
#define __ATTR(_name, _mode, _show, _store) { \
.attr = {.name = __stringify(_name), \
- .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
+ .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
+ .owner = THIS_MODULE, \
+ }, \
.show = _show, \
.store = _store, \
}
#define __ATTR_PREALLOC(_name, _mode, _show, _store) { \
.attr = {.name = __stringify(_name), \
- .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode) },\
+ .mode = SYSFS_PREALLOC | VERIFY_OCTAL_PERMISSIONS(_mode),\
+ .owner = THIS_MODULE, \
+ }, \
.show = _show, \
.store = _store, \
}
#define __ATTR_RO(_name) { \
- .attr = { .name = __stringify(_name), .mode = 0444 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0444, \
+ .owner = THIS_MODULE, \
+ }, \
.show = _name##_show, \
}
#define __ATTR_RO_MODE(_name, _mode) { \
.attr = { .name = __stringify(_name), \
- .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
+ .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
+ .owner = THIS_MODULE, \
+ }, \
.show = _name##_show, \
}
#define __ATTR_RW_MODE(_name, _mode) { \
.attr = { .name = __stringify(_name), \
- .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
+ .mode = VERIFY_OCTAL_PERMISSIONS(_mode), \
+ .owner = THIS_MODULE, \
+ }, \
.show = _name##_show, \
.store = _name##_store, \
}
#define __ATTR_WO(_name) { \
- .attr = { .name = __stringify(_name), .mode = 0200 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0200, \
+ .owner = THIS_MODULE, \
+ }, \
.store = _name##_store, \
}
@@ -141,8 +158,11 @@ struct attribute_group {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
#define __ATTR_IGNORE_LOCKDEP(_name, _mode, _show, _store) { \
- .attr = {.name = __stringify(_name), .mode = _mode, \
- .ignore_lockdep = true }, \
+ .attr = {.name = __stringify(_name), \
+ .mode = _mode, \
+ .ignore_lockdep = true, \
+ .owner = THIS_MODULE, \
+ }, \
.show = _show, \
.store = _store, \
}
@@ -159,6 +179,7 @@ static const struct attribute_group *_name##_groups[] = { \
#define ATTRIBUTE_GROUPS(_name) \
static const struct attribute_group _name##_group = { \
.attrs = _name##_attrs, \
+ .owner = THIS_MODULE, \
}; \
__ATTRIBUTE_GROUPS(_name)
@@ -199,20 +220,29 @@ struct bin_attribute {
/* macros to create static binary attributes easier */
#define __BIN_ATTR(_name, _mode, _read, _write, _size) { \
- .attr = { .name = __stringify(_name), .mode = _mode }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = _mode, \
+ .owner = THIS_MODULE, \
+ }, \
.read = _read, \
.write = _write, \
.size = _size, \
}
#define __BIN_ATTR_RO(_name, _size) { \
- .attr = { .name = __stringify(_name), .mode = 0444 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0444, \
+ .owner = THIS_MODULE, \
+ }, \
.read = _name##_read, \
.size = _size, \
}
#define __BIN_ATTR_WO(_name, _size) { \
- .attr = { .name = __stringify(_name), .mode = 0200 }, \
+ .attr = { .name = __stringify(_name), \
+ .mode = 0200, \
+ .owner = THIS_MODULE, \
+ }, \
.write = _name##_write, \
.size = _size, \
}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8afa8690d288..b9886db9ecdd 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3949,7 +3949,7 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
key = &cft->lockdep_key;
#endif
kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name),
- cgroup_file_mode(cft),
+ cgroup_file_mode(cft), NULL,
GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
0, cft->kf_ops, cft,
NULL, key);
--
2.30.2
On 9/17/21 10:04 PM, Luis Chamberlain wrote:
> A sketch of how this can happen follows:
>
> CPU A CPU B
> whatever_store()
> module_unload
> mutex_lock(foo)
> mutex_lock(foo)
> del_gendisk(zram->disk);
> device_del()
> device_remove_groups()
>
> In this situation whatever_store() is waiting for the mutex foo to
> become unlocked, but that won't happen until module removal is complete.
> But module removal won't complete until the sysfs file being poked
> completes which is waiting for a lock already held.
If I remember correctly I encountered the deadlock scenario described
above for the first time about ten years ago while working on the SCST
project. We solved this deadlock by removing the sysfs attributes from
the module unload code before grabbing mutex_lock(foo), e.g. by calling
sysfs_remove_file(). This works because calling sysfs_remove_file()
multiple times in a row is safe. Is that solution good enough for the
zram driver?
Thanks,
Bart.
Hello,
On Fri, Sep 17, 2021 at 10:04:27PM -0700, Luis Chamberlain wrote:
> If try_module_get() fails we fail the operation on the kernfs node.
>
> We use a try method as a full lock means we'd then make our sysfs
> attributes busy us out from possible module removal, and so userspace
> could force denying module removal, a silly form of "DOS" against module
> removal. A try lock on the module removal ensures we give priority to
> module removal and interacting with sysfs attributes only comes second.
> Using a full lock could mean for instance that if you don't stop poking
> at sysfs files you cannot remove a module.
I find this explanation odd because there's no real equivalent to locking
the module (as opposed to try locking) because you can't wait for the
removal to finish and then grant the lock, so any operation which increases
the reference *has* to be a try method unless the caller already holds a
reference to the same module and thus knows that the module is already
pinned. The code isn't wrong, so maybe just drop the related paragraphs in
the commit message?
> static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
> struct kernfs_node *parent,
> const char *name, umode_t mode,
> + struct module *owner,
> kuid_t uid, kgid_t gid,
> unsigned flags)
Is there a particular reason why @owner is added between @mode and @uid?
Sitting between two fs attributes seems a bit awkward. Maybe it can just be
the last one?
Thanks.
--
tejun
On Mon, Sep 20, 2021 at 09:22:32AM -1000, Tejun Heo wrote:
> On Mon, Sep 20, 2021 at 12:15:22PM -0700, Luis Chamberlain wrote:
>
> > Right, the reason I mention the alternative is that we technically don't
> > need to use try in this case since during a kernfs op it is implied the
> > module will be pinned, but we have further motivations to use a try
>
> I'm confused. If the module is already pinned, why are we getting an
> extra reference?
Sorry, I meant its pinned implicitly not through an actual refcount. A module
exit can happen then in that case, it just that the sysfs removal will
have to wait until the active reference for the kernfs ops go down.
> Also, I don't understand how this has that much to do
> with preventing ddoses. I mean, it does cut down the duration of one
> operation but the eventual gating is through whoever acquiring the
> initial reference through try_module_get(), which again is the *only*
> way to acquire a fresh reference.
True. I can just leave that out from the commit log. But it is perhaps
an implicit gain here that using try alows module exit to trump use
of an kernfs op. Maybe's that's sufficient to mention.
Luis
On Mon, Sep 20, 2021 at 02:36:38PM -0700, Bart Van Assche wrote:
> On 9/17/21 10:04 PM, Luis Chamberlain wrote:
> > A sketch of how this can happen follows:
> >
> > CPU A CPU B
> > whatever_store()
> > module_unload
> > mutex_lock(foo)
> > mutex_lock(foo)
> > del_gendisk(zram->disk);
> > device_del()
> > device_remove_groups()
> >
> > In this situation whatever_store() is waiting for the mutex foo to
> > become unlocked, but that won't happen until module removal is complete.
> > But module removal won't complete until the sysfs file being poked
> > completes which is waiting for a lock already held.
>
> If I remember correctly I encountered the deadlock scenario described
> above for the first time about ten years ago while working on the SCST
> project. We solved this deadlock by removing the sysfs attributes from
> the module unload code before grabbing mutex_lock(foo), e.g. by calling
> sysfs_remove_file().
Well the sysfs attributes in zram do tons of funky mucking around so
unfortunately no. It's not the only driver where this can happen. It is
why I decided to work on a generic solution instead.
> This works because calling sysfs_remove_file()
> multiple times in a row is safe. Is that solution good enough for the
> zram driver?
The sysfs attributes are group attributes part of the block, and so are
removed for the driver on a del_gendisk(). So unfortunately no, this
would not be a good solution in this case.
Luis
On Fri, Sep 17, 2021 at 10:05 PM Luis Chamberlain <[email protected]> wrote:
>
> When sysfs attributes use a lock also used on module removal we can
> race to deadlock. This happens when for instance a sysfs file on
> a driver is used, then at the same time we have module removal call
> trigger. The module removal call code holds a lock, and then the sysfs
> file entry waits for the same lock. While holding the lock the module
> removal tries to remove the sysfs entries, but these cannot be removed
> yet as one is waiting for a lock. This won't complete as the lock is
> already held. Likewise module removal cannot complete, and so we deadlock.
>
> This can now be easily reproducible with our sysfs selftest as follows:
>
> ./tools/testing/selftests/sysfs/sysfs.sh -t 0027
>
> To fix this we extend the struct kernfs_node with a module reference and
> use the try_module_get() after kernfs_get_active() is called which
> protects integrity and the existence of the kernfs node during the
> operation.
>
> So long as the kernfs node is protected with kernfs_get_active() we know
> we can rely on its contents. And, as now just documented in the previous
> patch, we also now know that once kernfs_get_active() is called the module
> is also guarded to exist and cannot be removed.
>
> If try_module_get() fails we fail the operation on the kernfs node.
>
> We use a try method as a full lock means we'd then make our sysfs
> attributes busy us out from possible module removal, and so userspace
> could force denying module removal, a silly form of "DOS" against module
> removal. A try lock on the module removal ensures we give priority to
> module removal and interacting with sysfs attributes only comes second.
> Using a full lock could mean for instance that if you don't stop poking
> at sysfs files you cannot remove a module.
>
> Races between removal of sysfs files and the module are not possible
> given sysfs files are created by the same module, and when a sysfs file
> is being used kernfs prevents removal of the sysfs file. So if module
> removal is actually happening the removal would have to wait until
> the sysfs file operation is complete.
>
> This deadlock was first reported with the zram driver, however the live
> patching folks have acknowledged they have observed this as well with
> live patching, when a live patch is removed. I was then able to
> reproduce easily by creating a dedicated selftests.
>
> A sketch of how this can happen follows:
>
> CPU A CPU B
> whatever_store()
> module_unload
> mutex_lock(foo)
> mutex_lock(foo)
> del_gendisk(zram->disk);
> device_del()
> device_remove_groups()
This flow seems possible to trigger with:
echo $dev > /sys/bus/$bus/drivers/$driver/unbind
I am missing why module pinning is part of the solution when it's the
device_del() path that is racing? Module removal is just a more coarse
grained way to trigger unbind => device_del(). Isn't the above a bug
in the driver, not missing synchronization in kernfs? Forgive me if
the unbind question was asked and answered elsewhere, this is my first
time taking a look at this series.
On Mon, Sep 20, 2021 at 01:52:21PM -0700, Dan Williams wrote:
> On Fri, Sep 17, 2021 at 10:05 PM Luis Chamberlain <[email protected]> wrote:
> > This deadlock was first reported with the zram driver, however the live
> > patching folks have acknowledged they have observed this as well with
> > live patching, when a live patch is removed. I was then able to
> > reproduce easily by creating a dedicated selftests.
> >
> > A sketch of how this can happen follows:
> >
> > CPU A CPU B
> > whatever_store()
> > module_unload
> > mutex_lock(foo)
> > mutex_lock(foo)
> > del_gendisk(zram->disk);
> > device_del()
> > device_remove_groups()
>
> This flow seems possible to trigger with:
>
> echo $dev > /sys/bus/$bus/drivers/$driver/unbind
>
> I am missing why module pinning
The aspect of try_module_get() which comes to value to prevent the
deadlock is it ensures kernfs ops do not run once exit is on the way.
> is part of the solution when it's the
> device_del() path that is racing?
But its not, the device_del() path will yield until the kernfs op
completes. It is fine to wait there.
The deadlock happens if a module exit routine uses a lock which is
also used on a sysfs op. If the lock was first held by module exit,
and module exit is waiting for the kernfs op to complete, and the
kernfs op is waiting to hold the same lock then the exit will wait
forever.
> Module removal is just a more coarse
> grained way to trigger unbind => device_del().
Right, but the device_del() path is not sharing a lock with the sysfs op.
> Isn't the above a bug
> in the driver, not missing synchronization in kernfs?
We can certainly take the position as an alternative:
"thou shalt not use a lock on exit which is also used on a syfs op"
However that seems counter intuitive, specially if we can resolve the
issue easily with a try_module_get().
Luis
On Mon, Sep 20, 2021 at 02:55:10PM -0700, Dan Williams wrote:
> On Mon, Sep 20, 2021 at 2:17 PM Luis Chamberlain <[email protected]> wrote:
> >
> > On Mon, Sep 20, 2021 at 01:52:21PM -0700, Dan Williams wrote:
> > > On Fri, Sep 17, 2021 at 10:05 PM Luis Chamberlain <[email protected]> wrote:
> > > > This deadlock was first reported with the zram driver, however the live
> > > > patching folks have acknowledged they have observed this as well with
> > > > live patching, when a live patch is removed. I was then able to
> > > > reproduce easily by creating a dedicated selftests.
> > > >
> > > > A sketch of how this can happen follows:
> > > >
> > > > CPU A CPU B
> > > > whatever_store()
> > > > module_unload
> > > > mutex_lock(foo)
> > > > mutex_lock(foo)
> > > > del_gendisk(zram->disk);
> > > > device_del()
> > > > device_remove_groups()
> > >
> > > This flow seems possible to trigger with:
> > >
> > > echo $dev > /sys/bus/$bus/drivers/$driver/unbind
> > >
> > > I am missing why module pinning
> >
> > The aspect of try_module_get() which comes to value to prevent the
> > deadlock is it ensures kernfs ops do not run once exit is on the way.
> >
> > > is part of the solution when it's the
> > > device_del() path that is racing?
> >
> > But its not, the device_del() path will yield until the kernfs op
> > completes. It is fine to wait there.
> >
> > The deadlock happens if a module exit routine uses a lock which is
> > also used on a sysfs op. If the lock was first held by module exit,
> > and module exit is waiting for the kernfs op to complete, and the
> > kernfs op is waiting to hold the same lock then the exit will wait
> > forever.
> >
> > > Module removal is just a more coarse
> > > grained way to trigger unbind => device_del().
> >
> > Right, but the device_del() path is not sharing a lock with the sysfs op.
>
> The deadlock in the example comes from holding a lock over
> device_del() [...]
No sorry, that is my mistake not making it clear that the mutex held
in the example is on module exit. Or any lock for that matter. That is
these locks are driver specific.
> > > Isn't the above a bug
> > > in the driver, not missing synchronization in kernfs?
> >
> > We can certainly take the position as an alternative:
> >
> > "thou shalt not use a lock on exit which is also used on a syfs op"
> >
> > However that seems counter intuitive, specially if we can resolve the
> > issue easily with a try_module_get().
>
> Again, I don't see how try_module_get() can affect the ABBA failure
> case of holding a lock over device_del() that is also held inside
> sysfs op.
It is not device_del(), it is on module exit. Sorry for this not being
clear before. I'll fix the commit log to make it clearer. The subject
at least was clear but I think the example could be clearer.
Luis
On Mon, Sep 20, 2021 at 2:17 PM Luis Chamberlain <[email protected]> wrote:
>
> On Mon, Sep 20, 2021 at 01:52:21PM -0700, Dan Williams wrote:
> > On Fri, Sep 17, 2021 at 10:05 PM Luis Chamberlain <[email protected]> wrote:
> > > This deadlock was first reported with the zram driver, however the live
> > > patching folks have acknowledged they have observed this as well with
> > > live patching, when a live patch is removed. I was then able to
> > > reproduce easily by creating a dedicated selftests.
> > >
> > > A sketch of how this can happen follows:
> > >
> > > CPU A CPU B
> > > whatever_store()
> > > module_unload
> > > mutex_lock(foo)
> > > mutex_lock(foo)
> > > del_gendisk(zram->disk);
> > > device_del()
> > > device_remove_groups()
> >
> > This flow seems possible to trigger with:
> >
> > echo $dev > /sys/bus/$bus/drivers/$driver/unbind
> >
> > I am missing why module pinning
>
> The aspect of try_module_get() which comes to value to prevent the
> deadlock is it ensures kernfs ops do not run once exit is on the way.
>
> > is part of the solution when it's the
> > device_del() path that is racing?
>
> But its not, the device_del() path will yield until the kernfs op
> completes. It is fine to wait there.
>
> The deadlock happens if a module exit routine uses a lock which is
> also used on a sysfs op. If the lock was first held by module exit,
> and module exit is waiting for the kernfs op to complete, and the
> kernfs op is waiting to hold the same lock then the exit will wait
> forever.
>
> > Module removal is just a more coarse
> > grained way to trigger unbind => device_del().
>
> Right, but the device_del() path is not sharing a lock with the sysfs op.
The deadlock in the example comes from holding a lock over
device_del() that is also taken in a kernfs op. For example, the code
above looks like something that runs from driver.remove(), not
exclusively module_exit(). Yes, module_exit() may trigger
driver.remove() via driver_unregister(), but there's other ways to
trigger driver.remove() that do not involve module_exit().
> > Isn't the above a bug
> > in the driver, not missing synchronization in kernfs?
>
> We can certainly take the position as an alternative:
>
> "thou shalt not use a lock on exit which is also used on a syfs op"
>
> However that seems counter intuitive, specially if we can resolve the
> issue easily with a try_module_get().
Again, I don't see how try_module_get() can affect the ABBA failure
case of holding a lock over device_del() that is also held inside
sysfs op. I agree that the problem is subtle. Does lockdep not
complain about this case? If it's going to be avoided in the core it
seems try_module_get() does not completely cover the hole that
unsuspecting driver writers might fall into.
On Mon, Sep 20, 2021 at 07:53:16AM -1000, Tejun Heo wrote:
> Hello,
>
> On Fri, Sep 17, 2021 at 10:04:27PM -0700, Luis Chamberlain wrote:
> > If try_module_get() fails we fail the operation on the kernfs node.
> >
> > We use a try method as a full lock means we'd then make our sysfs
> > attributes busy us out from possible module removal, and so userspace
> > could force denying module removal, a silly form of "DOS" against module
> > removal. A try lock on the module removal ensures we give priority to
> > module removal and interacting with sysfs attributes only comes second.
> > Using a full lock could mean for instance that if you don't stop poking
> > at sysfs files you cannot remove a module.
>
> I find this explanation odd because there's no real equivalent to locking
> the module (as opposed to try locking)
Actually there is, __module_get() but I suspect some of these users are
probably incorrect and should be be moved to try. The documentation
about "rmmod --wait" for __module_get() is also outdated as that option
is no longer supported. I'll send an update for that later.
> because you can't wait for the
> removal to finish and then grant the lock, so any operation which increases
> the reference *has* to be a try method unless the caller already holds a
> reference to the same module and thus knows that the module is already
> pinned.
Right, the reason I mention the alternative is that we technically don't
need to use try in this case since during a kernfs op it is implied the
module will be pinned, but we have further motivations to use a try
method here: to avoid a possible DOS from module removal by userspace
mucking with ops.
> The code isn't wrong, so maybe just drop the related paragraphs in
> the commit message?
Does it make sense to clarify the above a bit more somehow? Or do think
its not needed?
> > static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
> > struct kernfs_node *parent,
> > const char *name, umode_t mode,
> > + struct module *owner,
> > kuid_t uid, kgid_t gid,
> > unsigned flags)
>
> Is there a particular reason why @owner is added between @mode and @uid?
> Sitting between two fs attributes seems a bit awkward. Maybe it can just be
> the last one?
No, I just picked an arbitrary place. Sure I'll move it to the end.
Luis
Hello,
On Mon, Sep 20, 2021 at 12:15:22PM -0700, Luis Chamberlain wrote:
> > I find this explanation odd because there's no real equivalent to locking
> > the module (as opposed to try locking)
>
> Actually there is, __module_get() but I suspect some of these users are
> probably incorrect and should be be moved to try. The documentation
__module_get() is just getting an extra ref when the caller already
has one (or more). It can't be used to freshly acquire a new
reference. There is no equivalence between the relationship between
try_module_get() and __module_get() and the one between spin_trylock()
and spin_lock().
> Right, the reason I mention the alternative is that we technically don't
> need to use try in this case since during a kernfs op it is implied the
> module will be pinned, but we have further motivations to use a try
I'm confused. If the module is already pinned, why are we getting an
extra reference? Also, I don't understand how this has that much to do
with preventing ddoses. I mean, it does cut down the duration of one
operation but the eventual gating is through whoever acquiring the
initial reference through try_module_get(), which again is the *only*
way to acquire a fresh reference.
Thanks.
--
tejun
On Tue, Sep 21, 2021 at 1:24 AM David Laight <[email protected]> wrote:
>
> From: Luis Chamberlain
> > Sent: 17 September 2021 20:47
> >
> > When sysfs attributes use a lock also used on module removal we can
> > race to deadlock. This happens when for instance a sysfs file on
> > a driver is used, then at the same time we have module removal call
> > trigger. The module removal call code holds a lock, and then the sysfs
> > file entry waits for the same lock. While holding the lock the module
> > removal tries to remove the sysfs entries, but these cannot be removed
> > yet as one is waiting for a lock. This won't complete as the lock is
> > already held. Likewise module removal cannot complete, and so we deadlock.
>
> Isn't the real problem the race between a sysfs file action and the
> removal of the sysfs node?
Nope, that is taken care of by kernfs.
> This isn't really related to module unload - except that may
> well remove some sysfs nodes.
Nope, the issue is a deadlock that can happen due to a shared lock on
module removal and a driver sysfs operation.
> This is the same problem as removing any other kind of driver callback.
> There are three basic solutions:
> 1) Use a global lock - not usually useful.
> 2) Have the remove call sleep until any callbacks are complete.
> 3) Have the remove just request removal and have a final
> callback (from a different context).
Kernfs already does a sort of combination of 1) and 2) but 1) is using
atomic reference counts.
> If the remove can sleep (as in 2) then there is a requirement
> on the driver code to not hold any locks across the 'remove'
> that can be acquired during the callbacks.
And this is the part that kernfs has no control over since the removal
and sysfs operation are implementation specific.
> Now, for sysfs, you probably only want to sleep the remove code
> while a read/write is in progress - not just because the node
> is open.
> That probably requires marking an open node 'invalid' and
> deferring delete to close.
This is already done by kernfs.
> None of this requires a reference count on the module.
You are missing the point to the other aspect of the try_module_get(),
it lets you also check if module exit has been entered. By using
try_module_get() you let the module exit trump proceeding with an
operation, therefore also preventing any potential use of a shared
lock on module exit and the driver specific sysfs operation.
Luis