RFC patchset to reduce usage of global locks in kernfs operations.
PATCH-1: Move kernfs_open_file_mutex and kernfs_open_node_lock within
kernfs_node.
PATCH-2: Use per kernfs_node specific lock to access permissions
------------------------------------------------------------------
Changes since v1:
- Introduce PATCH-2 to reduce contention seen around kernfs_rwsem after
introduction of PATCH-1.
Imran Khan (2):
kernfs: use kernfs_node specific mutex and spinlock.
kernfs: Reduce contention around global per-fs kernfs_rwsem.
fs/kernfs/dir.c | 6 ++++++
fs/kernfs/file.c | 48 +++++++++++++++---------------------------
fs/kernfs/inode.c | 25 +++++++++++++++-------
include/linux/kernfs.h | 3 +++
4 files changed, 43 insertions(+), 39 deletions(-)
base-commit: ea586a076e8aa606c59b66d86660590f18354b11
--
2.30.2
Right now a global per file system based rwsem (kernfs_rwsem)
synchronizes multiple kernfs operations. On a large system with
few hundred CPUs and few hundred applications simultaenously trying
to access sysfs, this results in multiple sys_open(s) contending on
kernfs_rwsem via kernfs_iop_permission and kernfs_dop_revalidate.
- 21.42% 21.34% showgids [kernel.kallsyms] [k] up_read
21.34% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 20.05% link_path_walk
- 9.76% walk_component
lookup_fast
- d_revalidate.part.24
- 9.75% kernfs_dop_revalidate
up_read
- 9.46% inode_permission
- __inode_permission
- 9.46% kernfs_iop_permission
up_read
- 0.83% kernfs_iop_get_link
up_read
- 0.80% lookup_fast
d_revalidate.part.24
kernfs_dop_revalidate
up_read
- 21.31% 21.21% showgids [kernel.kallsyms] [k] down_read
21.21% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 19.78% link_path_walk
- 10.62% inode_permission
- __inode_permission
- 10.62% kernfs_iop_permission
down_read
- 8.45% walk_component
lookup_fast
- d_revalidate.part.24
- 8.45% kernfs_dop_revalidate
down_read
- 0.71% kernfs_iop_get_link
down_read
- 0.72% lookup_fast
- d_revalidate.part.24
- 0.72% kernfs_dop_revalidate
down_read
- 0.71% may_open
inode_permission
__inode_permission
kernfs_iop_permission
down_read
Since permission is specific to a kernfs_node we can use a per
kernfs_node based lock to access/modify permission. Also use kernfs
reference counting to ensure we are accessing/modifying permissions
for an existing kernfs_node object.
Using this change brings down the above mentioned down_read/up_read
numbers to ~8%, thus indicating that contention around kernfs_rwsem
has reduced to about 1/3rd of earlier value.
Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/dir.c | 4 ++++
fs/kernfs/inode.c | 25 +++++++++++++++++--------
include/linux/kernfs.h | 1 +
3 files changed, 22 insertions(+), 8 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index cd68ac30f71b..a6846e5e2cab 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -750,11 +750,13 @@ int kernfs_add_one(struct kernfs_node *kn)
goto out_unlock;
/* Update timestamps on the parent */
+ down_write(&parent->kernfs_iop_rwsem);
ps_iattr = parent->iattr;
if (ps_iattr) {
ktime_get_real_ts64(&ps_iattr->ia_ctime);
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}
+ up_write(&parent->kernfs_iop_rwsem);
up_write(&root->kernfs_rwsem);
@@ -1380,8 +1382,10 @@ static void __kernfs_remove(struct kernfs_node *kn)
/* update timestamps on the parent */
if (ps_iattr) {
+ down_write(&pos->parent->kernfs_iop_rwsem);
ktime_get_real_ts64(&ps_iattr->ia_ctime);
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
+ up_write(&pos->parent->kernfs_iop_rwsem);
}
kernfs_put(pos);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3d783d80f5da..6d375f3b5e17 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -101,9 +101,11 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
int ret;
struct kernfs_root *root = kernfs_root(kn);
- down_write(&root->kernfs_rwsem);
+ kernfs_get(kn);
+ down_write(&kn->kernfs_iop_rwsem);
ret = __kernfs_setattr(kn, iattr);
- up_write(&root->kernfs_rwsem);
+ up_write(&kn->kernfs_iop_rwsem);
+ kernfs_put(kn);
return ret;
}
@@ -119,7 +121,8 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
return -EINVAL;
root = kernfs_root(kn);
- down_write(&root->kernfs_rwsem);
+ kernfs_get(kn);
+ down_write(&kn->kernfs_iop_rwsem);
error = setattr_prepare(&init_user_ns, dentry, iattr);
if (error)
goto out;
@@ -132,7 +135,8 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
setattr_copy(&init_user_ns, inode, iattr);
out:
- up_write(&root->kernfs_rwsem);
+ up_write(&kn->kernfs_iop_rwsem);
+ kernfs_put(kn);
return error;
}
@@ -189,12 +193,14 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
struct kernfs_node *kn = inode->i_private;
struct kernfs_root *root = kernfs_root(kn);
- down_read(&root->kernfs_rwsem);
+ kernfs_get(kn);
+ down_read(&kn->kernfs_iop_rwsem);
spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
generic_fillattr(&init_user_ns, inode, stat);
spin_unlock(&inode->i_lock);
- up_read(&root->kernfs_rwsem);
+ up_read(&kn->kernfs_iop_rwsem);
+ kernfs_put(kn);
return 0;
}
@@ -207,6 +213,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
inode->i_op = &kernfs_iops;
inode->i_generation = kernfs_gen(kn);
+ init_rwsem(&kn->kernfs_iop_rwsem);
set_default_inode_attr(inode, kn->mode);
kernfs_refresh_inode(kn, inode);
@@ -287,12 +294,14 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
kn = inode->i_private;
root = kernfs_root(kn);
- down_read(&root->kernfs_rwsem);
+ kernfs_get(kn);
+ down_read(&kn->kernfs_iop_rwsem);
spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
ret = generic_permission(&init_user_ns, inode, mask);
spin_unlock(&inode->i_lock);
- up_read(&root->kernfs_rwsem);
+ up_read(&kn->kernfs_iop_rwsem);
+ kernfs_put(kn);
return ret;
}
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 5ed4c9ed39af..c0fa319d266b 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -166,6 +166,7 @@ struct kernfs_node {
struct kernfs_iattrs *iattr;
spinlock_t kernfs_open_node_lock;
struct mutex kernfs_open_file_mutex;
+ struct rw_semaphore kernfs_iop_rwsem;
};
/*
--
2.30.2
Right now a global mutex (kernfs_open_file_mutex) protects list of
kernfs_open_file instances corresponding to a sysfs attribute, so even
if different tasks are opening or closing different sysfs files they
can contend on osq_lock of this mutex. The contention is more apparent
in large scale systems with few hundred CPUs where most of the CPUs have
running tasks that are opening, accessing or closing sysfs files at any
point of time. Since each list of kernfs_open_file belongs to a
kernfs_open_node instance which in turn corresponds to one kernfs_node,
move global kernfs_open_file_mutex within kernfs_node so that it does
not block access to kernfs_open_file lists corresponding to other
kernfs_node.
Also since kernfs_node->attr.open points to kernfs_open_node instance
corresponding to the kernfs_node, we can use a kernfs_node specific
spinlock in place of current global spinlock i.e kernfs_open_node_lock.
So make this spinlock local to kernfs_node instance as well.
Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/dir.c | 2 ++
fs/kernfs/file.c | 48 +++++++++++++++---------------------------
include/linux/kernfs.h | 2 ++
3 files changed, 21 insertions(+), 31 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index e6d9772ddb4c..cd68ac30f71b 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -603,6 +603,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
atomic_set(&kn->count, 1);
atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
RB_CLEAR_NODE(&kn->rb);
+ spin_lock_init(&kn->kernfs_open_node_lock);
+ mutex_init(&kn->kernfs_open_file_mutex);
kn->name = name;
kn->mode = mode;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 9414a7a60a9f..4114745d80d5 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -18,20 +18,6 @@
#include "kernfs-internal.h"
-/*
- * There's one kernfs_open_file for each open file and one kernfs_open_node
- * for each kernfs_node with one or more open files.
- *
- * kernfs_node->attr.open points to kernfs_open_node. attr.open is
- * protected by kernfs_open_node_lock.
- *
- * filp->private_data points to seq_file whose ->private points to
- * kernfs_open_file. kernfs_open_files are chained at
- * kernfs_open_node->files, which is protected by kernfs_open_file_mutex.
- */
-static DEFINE_SPINLOCK(kernfs_open_node_lock);
-static DEFINE_MUTEX(kernfs_open_file_mutex);
-
struct kernfs_open_node {
atomic_t refcnt;
atomic_t event;
@@ -526,8 +512,8 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
struct kernfs_open_node *on, *new_on = NULL;
retry:
- mutex_lock(&kernfs_open_file_mutex);
- spin_lock_irq(&kernfs_open_node_lock);
+ mutex_lock(&kn->kernfs_open_file_mutex);
+ spin_lock_irq(&kn->kernfs_open_node_lock);
if (!kn->attr.open && new_on) {
kn->attr.open = new_on;
@@ -540,8 +526,8 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
list_add_tail(&of->list, &on->files);
}
- spin_unlock_irq(&kernfs_open_node_lock);
- mutex_unlock(&kernfs_open_file_mutex);
+ spin_unlock_irq(&kn->kernfs_open_node_lock);
+ mutex_unlock(&kn->kernfs_open_file_mutex);
if (on) {
kfree(new_on);
@@ -577,8 +563,8 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
struct kernfs_open_node *on = kn->attr.open;
unsigned long flags;
- mutex_lock(&kernfs_open_file_mutex);
- spin_lock_irqsave(&kernfs_open_node_lock, flags);
+ mutex_lock(&kn->kernfs_open_file_mutex);
+ spin_lock_irqsave(&kn->kernfs_open_node_lock, flags);
if (of)
list_del(&of->list);
@@ -588,8 +574,8 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
else
on = NULL;
- spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
- mutex_unlock(&kernfs_open_file_mutex);
+ spin_unlock_irqrestore(&kn->kernfs_open_node_lock, flags);
+ mutex_unlock(&kn->kernfs_open_file_mutex);
kfree(on);
}
@@ -733,7 +719,7 @@ static void kernfs_release_file(struct kernfs_node *kn,
* here because drain path may be called from places which can
* cause circular dependency.
*/
- lockdep_assert_held(&kernfs_open_file_mutex);
+ lockdep_assert_held(&kn->kernfs_open_file_mutex);
if (!of->released) {
/*
@@ -752,9 +738,9 @@ static int kernfs_fop_release(struct inode *inode, struct file *filp)
struct kernfs_open_file *of = kernfs_of(filp);
if (kn->flags & KERNFS_HAS_RELEASE) {
- mutex_lock(&kernfs_open_file_mutex);
+ mutex_lock(&kn->kernfs_open_file_mutex);
kernfs_release_file(kn, of);
- mutex_unlock(&kernfs_open_file_mutex);
+ mutex_unlock(&kn->kernfs_open_file_mutex);
}
kernfs_put_open_node(kn, of);
@@ -773,15 +759,15 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
return;
- spin_lock_irq(&kernfs_open_node_lock);
+ spin_lock_irq(&kn->kernfs_open_node_lock);
on = kn->attr.open;
if (on)
atomic_inc(&on->refcnt);
- spin_unlock_irq(&kernfs_open_node_lock);
+ spin_unlock_irq(&kn->kernfs_open_node_lock);
if (!on)
return;
- mutex_lock(&kernfs_open_file_mutex);
+ mutex_lock(&kn->kernfs_open_file_mutex);
list_for_each_entry(of, &on->files, list) {
struct inode *inode = file_inode(of->file);
@@ -793,7 +779,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
kernfs_release_file(kn, of);
}
- mutex_unlock(&kernfs_open_file_mutex);
+ mutex_unlock(&kn->kernfs_open_file_mutex);
kernfs_put_open_node(kn, NULL);
}
@@ -922,13 +908,13 @@ void kernfs_notify(struct kernfs_node *kn)
return;
/* kick poll immediately */
- spin_lock_irqsave(&kernfs_open_node_lock, flags);
+ spin_lock_irqsave(&kn->kernfs_open_node_lock, flags);
on = kn->attr.open;
if (on) {
atomic_inc(&on->event);
wake_up_interruptible(&on->poll);
}
- spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
+ spin_unlock_irqrestore(&kn->kernfs_open_node_lock, flags);
/* schedule work to kick fsnotify */
spin_lock_irqsave(&kernfs_notify_lock, flags);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 861c4f0f8a29..5ed4c9ed39af 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -164,6 +164,8 @@ struct kernfs_node {
unsigned short flags;
umode_t mode;
struct kernfs_iattrs *iattr;
+ spinlock_t kernfs_open_node_lock;
+ struct mutex kernfs_open_file_mutex;
};
/*
--
2.30.2
On Mon, Jan 03, 2022 at 07:45:43PM +1100, Imran Khan wrote:
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 861c4f0f8a29..5ed4c9ed39af 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -164,6 +164,8 @@ struct kernfs_node {
> unsigned short flags;
> umode_t mode;
> struct kernfs_iattrs *iattr;
> + spinlock_t kernfs_open_node_lock;
> + struct mutex kernfs_open_file_mutex;
Did you just blow up the memory requirements of a system with lots of
kobjects created?
We used to be able to support tens of thousands of scsi devices in a
32bit kernel, with this change, what is the memory difference that just
happened?
There is a tradeoff of memory usage and runtime contention that has to
be made here, and this might be pushing it in the wrong direction for
a lot of systems.
thanks,
greg k-h
On 3/1/22 8:54 pm, Greg KH wrote:
> On Mon, Jan 03, 2022 at 07:45:43PM +1100, Imran Khan wrote:
>> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
>> index 861c4f0f8a29..5ed4c9ed39af 100644
>> --- a/include/linux/kernfs.h
>> +++ b/include/linux/kernfs.h
>> @@ -164,6 +164,8 @@ struct kernfs_node {
>> unsigned short flags;
>> umode_t mode;
>> struct kernfs_iattrs *iattr;
>> + spinlock_t kernfs_open_node_lock;
>> + struct mutex kernfs_open_file_mutex;
>
> Did you just blow up the memory requirements of a system with lots of
> kobjects created?
>> We used to be able to support tens of thousands of scsi devices in a
> 32bit kernel, with this change, what is the memory difference that just
> happened?
>
Indeed, this patch increases kernfs_node size by 36 bytes ( 28 bytes for
mutex + 4 bytes for spinlock). From current kernfs_node size of 128
bytes, this will be a ~25% increase in kobjects memory consumption.
I can replace the mutex object with a pointer to it, to bring down
the overall increase in size. Will the size change be acceptable then?
> There is a tradeoff of memory usage and runtime contention that has to
> be made here, and this might be pushing it in the wrong direction for
> a lot of systems.
>
Agree. Could you please suggest if this should be made configurable via
kconfig ? I understand that this would result in 2 versions of some
functions but it will allow systems with large memories to avoid kernfs
contention. We are seeing the launch time of some DB workloads
adversely getting affected with this contention.
Also based on recent movement of kernfs_rwsem into kernfs_root, do you
think that the above mentioned mutex and spinlock can be moved to
kernfs_root as well. Although that change would not help in my current
case, but it could avoid similar contentions between different users of
kernfs like cgroup and sysfs
Thanks.
-- Imran
> thanks,
>
> greg k-h
Hi Greg
On 4/1/22 9:16 am, Imran Khan wrote:
>
>
> On 3/1/22 8:54 pm, Greg KH wrote:
>> On Mon, Jan 03, 2022 at 07:45:43PM +1100, Imran Khan wrote:
>>> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
>>> index 861c4f0f8a29..5ed4c9ed39af 100644
>>> --- a/include/linux/kernfs.h
>>> +++ b/include/linux/kernfs.h
>>> @@ -164,6 +164,8 @@ struct kernfs_node {
>>> unsigned short flags;
>>> umode_t mode;
>>> struct kernfs_iattrs *iattr;
>>> + spinlock_t kernfs_open_node_lock;
>>> + struct mutex kernfs_open_file_mutex;
>>
>> Did you just blow up the memory requirements of a system with lots of
>> kobjects created?
>>> We used to be able to support tens of thousands of scsi devices in a
>> 32bit kernel, with this change, what is the memory difference that just
>> happened?
>>
> Indeed, this patch increases kernfs_node size by 36 bytes ( 28 bytes for
> mutex + 4 bytes for spinlock). From current kernfs_node size of 128
> bytes, this will be a ~25% increase in kobjects memory consumption.
> I can replace the mutex object with a pointer to it, to bring down
> the overall increase in size. Will the size change be acceptable then?
>
Please ignore my proposal about using pointers to mutex object. It will
reduce the size of kernfs_node object but eventually overall kobject
memory usage will not reduce.
Thanks,
Imran
On Tue, Jan 04, 2022 at 09:16:03AM +1100, Imran Khan wrote:
> > There is a tradeoff of memory usage and runtime contention that has to
> > be made here, and this might be pushing it in the wrong direction for
> > a lot of systems.
> >
> Agree. Could you please suggest if this should be made configurable via
> kconfig ? I understand that this would result in 2 versions of some
> functions but it will allow systems with large memories to avoid kernfs
> contention.
No, that way lies madness and horrible bugs and a codebase no one will
be able to maintain.
> We are seeing the launch time of some DB workloads adversely getting
> affected with this contention.
What workloads? sysfs should NEVER be in the fast-path of any normal
operation, including booting. What benchmark or real-work is having
problems here?
> Also based on recent movement of kernfs_rwsem into kernfs_root, do you
> think that the above mentioned mutex and spinlock can be moved to
> kernfs_root as well. Although that change would not help in my current
> case, but it could avoid similar contentions between different users of
> kernfs like cgroup and sysfs
Why would you want to move it if you do not see a measured benefit?
thanks,
greg k-h
Greeting,
FYI, we noticed the following commit (built with gcc-9):
commit: 3dd2a5f81a41be1f22391cc8ef0ac78293de0141 ("[RFC PATCH v2 2/2] kernfs: Reduce contention around global per-fs kernfs_rwsem.")
url: https://github.com/0day-ci/linux/commits/Imran-Khan/kernfs-use-kernfs_node-specific-mutex-and-spinlock/20220103-164752
patch link: https://lore.kernel.org/lkml/[email protected]
in testcase: rcuscale
version:
with following parameters:
runtime: 300s
scale_type: srcud
on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
+-----------------------------------------------+------------+------------+
| | 0fc4d2a7a9 | 3dd2a5f81a |
+-----------------------------------------------+------------+------------+
| INFO:trying_to_register_non-static_key | 0 | 60 |
| WARNING:at_kernel/locking/rwsem.c:#down_write | 0 | 60 |
| EIP:down_write | 0 | 60 |
| WARNING:at_kernel/locking/rwsem.c:#up_write | 0 | 60 |
| EIP:up_write | 0 | 60 |
+-----------------------------------------------+------------+------------+
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>
[ 0.740605][ T0] INFO: trying to register non-static key.
[ 0.741504][ T0] The code is fine but needs lockdep annotation, or maybe
[ 0.742530][ T0] you didn't initialize this object before use?
[ 0.743485][ T0] turning off the locking correctness validator.
[ 0.743838][ T0] CPU: 0 PID: 0 Comm: swapper Not tainted 5.16.0-rc6-next-20211224-00002-g3dd2a5f81a41 #7
[ 0.743838][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 0.743838][ T0] Call Trace:
[ 0.743838][ T0] dump_stack_lvl (kbuild/src/rand-2/lib/dump_stack.c:108)
[ 0.743838][ T0] dump_stack (kbuild/src/rand-2/lib/dump_stack.c:114)
[ 0.743838][ T0] register_lock_class (kbuild/src/rand-2/kernel/locking/lockdep.c:952 kbuild/src/rand-2/kernel/locking/lockdep.c:1263)
[ 0.743838][ T0] ? sched_clock_cpu (kbuild/src/rand-2/kernel/sched/clock.c:382)
[ 0.743838][ T0] __lock_acquire (kbuild/src/rand-2/kernel/locking/lockdep.c:4907)
[ 0.743838][ T0] lock_acquire (kbuild/src/rand-2/kernel/locking/lockdep.c:438 kbuild/src/rand-2/kernel/locking/lockdep.c:5641 kbuild/src/rand-2/kernel/locking/lockdep.c:5604)
[ 0.743838][ T0] ? kernfs_add_one (kbuild/src/rand-2/fs/kernfs/dir.c:754)
[ 0.743838][ T0] down_write (kbuild/src/rand-2/kernel/locking/rwsem.c:1278 kbuild/src/rand-2/kernel/locking/rwsem.c:1515)
[ 0.743838][ T0] ? kernfs_add_one (kbuild/src/rand-2/fs/kernfs/dir.c:754)
[ 0.743838][ T0] kernfs_add_one (kbuild/src/rand-2/fs/kernfs/dir.c:754)
[ 0.743838][ T0] kernfs_create_dir_ns (kbuild/src/rand-2/fs/kernfs/dir.c:1009)
[ 0.743838][ T0] sysfs_create_dir_ns (kbuild/src/rand-2/include/linux/err.h:36 kbuild/src/rand-2/fs/sysfs/dir.c:61)
[ 0.743838][ T0] kobject_add_internal (kbuild/src/rand-2/lib/kobject.c:89 (discriminator 11) kbuild/src/rand-2/lib/kobject.c:255 (discriminator 11))
[ 0.743838][ T0] ? kfree_const (kbuild/src/rand-2/mm/util.c:41)
[ 0.743838][ T0] ? kobject_set_name_vargs (kbuild/src/rand-2/lib/kobject.c:312)
[ 0.743838][ T0] kobject_add (kbuild/src/rand-2/lib/kobject.c:390 kbuild/src/rand-2/lib/kobject.c:442)
[ 0.743838][ T0] kobject_create_and_add (kbuild/src/rand-2/lib/kobject.c:815)
[ 0.743838][ T0] mnt_init (kbuild/src/rand-2/fs/namespace.c:4373)
[ 0.743838][ T0] vfs_caches_init (kbuild/src/rand-2/fs/dcache.c:3295)
[ 0.743838][ T0] start_kernel (kbuild/src/rand-2/init/main.c:1123)
[ 0.743838][ T0] ? early_idt_handler_common (kbuild/src/rand-2/arch/x86/kernel/head_32.S:417)
[ 0.743838][ T0] i386_start_kernel (kbuild/src/rand-2/arch/x86/kernel/head32.c:57)
[ 0.743838][ T0] startup_32_smp (kbuild/src/rand-2/arch/x86/kernel/head_32.S:328)
[ 0.743841][ T0] ------------[ cut here ]------------
[ 0.747174][ T0] DEBUG_RWSEMS_WARN_ON(sem->magic != sem): count = 0x0, magic = 0x0, owner = 0x0, curr 0xc24ce5c0, list not empty
[ 0.748936][ T0] WARNING: CPU: 0 PID: 0 at kernel/locking/rwsem.c:1278 down_write (kbuild/src/rand-2/kernel/locking/rwsem.c:1278 kbuild/src/rand-2/kernel/locking/rwsem.c:1515)
[ 0.750268][ T0] Modules linked in:
[ 0.750508][ T0] CPU: 0 PID: 0 Comm: swapper Not tainted 5.16.0-rc6-next-20211224-00002-g3dd2a5f81a41 #7
[ 0.751888][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 0.753284][ T0] EIP: down_write (kbuild/src/rand-2/kernel/locking/rwsem.c:1278 kbuild/src/rand-2/kernel/locking/rwsem.c:1515)
[ 0.753841][ T0] Code: 8b 46 3c 8b 1d e4 26 4e c2 c7 04 24 c8 74 33 c2 89 54 24 08 ba 73 74 33 c2 89 5c 24 14 89 44 24 0c 89 54 24 04 e8 55 e1 f5 ff <0f> 0b 31 c9 ba 01 00 00 00 c7 04 24 01 00 00 00 b8 e8 19 ae c2 e8
All code
========
0: 8b 46 3c mov 0x3c(%rsi),%eax
3: 8b 1d e4 26 4e c2 mov -0x3db1d91c(%rip),%ebx # 0xffffffffc24e26ed
9: c7 04 24 c8 74 33 c2 movl $0xc23374c8,(%rsp)
10: 89 54 24 08 mov %edx,0x8(%rsp)
14: ba 73 74 33 c2 mov $0xc2337473,%edx
19: 89 5c 24 14 mov %ebx,0x14(%rsp)
1d: 89 44 24 0c mov %eax,0xc(%rsp)
21: 89 54 24 04 mov %edx,0x4(%rsp)
25: e8 55 e1 f5 ff callq 0xfffffffffff5e17f
2a:* 0f 0b ud2 <-- trapping instruction
2c: 31 c9 xor %ecx,%ecx
2e: ba 01 00 00 00 mov $0x1,%edx
33: c7 04 24 01 00 00 00 movl $0x1,(%rsp)
3a: b8 e8 19 ae c2 mov $0xc2ae19e8,%eax
3f: e8 .byte 0xe8
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 31 c9 xor %ecx,%ecx
4: ba 01 00 00 00 mov $0x1,%edx
9: c7 04 24 01 00 00 00 movl $0x1,(%rsp)
10: b8 e8 19 ae c2 mov $0xc2ae19e8,%eax
15: e8 .byte 0xe8
[ 0.757174][ T0] EAX: 0000006f EBX: c24ce5c0 ECX: 00000000 EDX: 00000000
[ 0.758206][ T0] ESI: c40110f4 EDI: c4011134 EBP: c24bde4c ESP: c24bde24
[ 0.759202][ T0] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00210282
[ 0.760303][ T0] CR0: 80050033 CR2: ffdd9000 CR3: 02c6f000 CR4: 00040690
[ 0.760509][ T0] Call Trace:
[ 0.761040][ T0] kernfs_add_one (kbuild/src/rand-2/fs/kernfs/dir.c:754)
[ 0.761714][ T0] kernfs_create_dir_ns (kbuild/src/rand-2/fs/kernfs/dir.c:1009)
[ 0.762449][ T0] sysfs_create_dir_ns (kbuild/src/rand-2/include/linux/err.h:36 kbuild/src/rand-2/fs/sysfs/dir.c:61)
[ 0.763842][ T0] kobject_add_internal (kbuild/src/rand-2/lib/kobject.c:89 (discriminator 11) kbuild/src/rand-2/lib/kobject.c:255 (discriminator 11))
[ 0.764643][ T0] ? kfree_const (kbuild/src/rand-2/mm/util.c:41)
[ 0.765371][ T0] ? kobject_set_name_vargs (kbuild/src/rand-2/lib/kobject.c:312)
[ 0.766137][ T0] kobject_add (kbuild/src/rand-2/lib/kobject.c:390 kbuild/src/rand-2/lib/kobject.c:442)
[ 0.766769][ T0] kobject_create_and_add (kbuild/src/rand-2/lib/kobject.c:815)
[ 0.767175][ T0] mnt_init (kbuild/src/rand-2/fs/namespace.c:4373)
[ 0.767784][ T0] vfs_caches_init (kbuild/src/rand-2/fs/dcache.c:3295)
[ 0.768572][ T0] start_kernel (kbuild/src/rand-2/init/main.c:1123)
[ 0.769246][ T0] ? early_idt_handler_common (kbuild/src/rand-2/arch/x86/kernel/head_32.S:417)
[ 0.770508][ T0] i386_start_kernel (kbuild/src/rand-2/arch/x86/kernel/head32.c:57)
[ 0.771192][ T0] startup_32_smp (kbuild/src/rand-2/arch/x86/kernel/head_32.S:328)
[ 0.771873][ T0] irq event stamp: 1153
[ 0.772517][ T0] hardirqs last enabled at (1153): _raw_spin_unlock_irqrestore (kbuild/src/rand-2/arch/x86/include/asm/irqflags.h:45 kbuild/src/rand-2/arch/x86/include/asm/irqflags.h:80 kbuild/src/rand-2/arch/x86/include/asm/irqflags.h:138 kbuild/src/rand-2/include/linux/spinlock_api_smp.h:151 kbuild/src/rand-2/kernel/locking/spinlock.c:194)
[ 0.773841][ T0] hardirqs last disabled at (1152): _raw_spin_lock_irqsave (kbuild/src/rand-2/include/linux/spinlock_api_smp.h:108 kbuild/src/rand-2/kernel/locking/spinlock.c:162)
[ 0.775115][ T0] softirqs last enabled at (0): 0x0
[ 0.776015][ T0] softirqs last disabled at (0): 0x0
[ 0.777174][ T0] ---[ end trace 0000000000000000 ]---
[ 0.777917][ T0] ------------[ cut here ]------------
[ 0.778654][ T0] DEBUG_RWSEMS_WARN_ON(sem->magic != sem): count = 0x1, magic = 0x0, owner = 0xc24ce5c0, curr 0xc24ce5c0, list not empty
To reproduce:
# build kernel
cd linux
cp config-5.16.0-rc6-next-20211224-00002-g3dd2a5f81a41 .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-9 CC=gcc-9 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email
# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.
---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation
Thanks,
Oliver Sang
Hello,
On Tue, Jan 04, 2022 at 08:40:30AM +0100, Greg KH wrote:
> > We are seeing the launch time of some DB workloads adversely getting
> > affected with this contention.
>
> What workloads? sysfs should NEVER be in the fast-path of any normal
> operation, including booting. What benchmark or real-work is having
> problems here?
In most systems, this shouldn't matter at all but sysfs and cgroupfs host a
lot of statistics files which may be read regularly. It is conceivable that
in large enough systems, the current locking scheme doesn't scale well
enough. We should definitely measure the overhead and gains tho.
If this is something necessary, I think one possible solution is using
hashed locks. I know that it isn't a popular choice but it makes sense given
the constraints.
Thanks.
--
tejun
Hi Tejun,
On 7/1/22 7:30 am, Tejun Heo wrote:
> Hello,
>
> On Tue, Jan 04, 2022 at 08:40:30AM +0100, Greg KH wrote:
>>> We are seeing the launch time of some DB workloads adversely getting
>>> affected with this contention.
>>
>> What workloads? sysfs should NEVER be in the fast-path of any normal
>> operation, including booting. What benchmark or real-work is having
>> problems here?
>
> In most systems, this shouldn't matter at all but sysfs and cgroupfs host a
> lot of statistics files which may be read regularly. It is conceivable that
> in large enough systems, the current locking scheme doesn't scale well
> enough. We should definitely measure the overhead and gains tho.
>
> If this is something necessary, I think one possible solution is using
> hashed locks. I know that it isn't a popular choice but it makes sense given
> the constraints.
>
Could you please suggest me some current users of hashed locks ? I can
check that code and modify my patches accordingly.
As of now I have not found any standard benchmarks/workloads to show the
impact of this contention. We have some in house DB applications where
the impact can be easily seen. Of course those applications can be
modified to get the needed data from somewhere else or access sysfs less
frequently but nonetheless I am trying to make the current locking
scheme more scalable.
Thanks
-- Imran
On Fri, Jan 07, 2022 at 11:01:55PM +1100, Imran Khan wrote:
> Hi Tejun,
>
> On 7/1/22 7:30 am, Tejun Heo wrote:
> > Hello,
> >
> > On Tue, Jan 04, 2022 at 08:40:30AM +0100, Greg KH wrote:
> >>> We are seeing the launch time of some DB workloads adversely getting
> >>> affected with this contention.
> >>
> >> What workloads? sysfs should NEVER be in the fast-path of any normal
> >> operation, including booting. What benchmark or real-work is having
> >> problems here?
> >
> > In most systems, this shouldn't matter at all but sysfs and cgroupfs host a
> > lot of statistics files which may be read regularly. It is conceivable that
> > in large enough systems, the current locking scheme doesn't scale well
> > enough. We should definitely measure the overhead and gains tho.
> >
> > If this is something necessary, I think one possible solution is using
> > hashed locks. I know that it isn't a popular choice but it makes sense given
> > the constraints.
> >
>
> Could you please suggest me some current users of hashed locks ? I can
> check that code and modify my patches accordingly.
>
> As of now I have not found any standard benchmarks/workloads to show the
> impact of this contention. We have some in house DB applications where
> the impact can be easily seen. Of course those applications can be
> modified to get the needed data from somewhere else or access sysfs less
> frequently but nonetheless I am trying to make the current locking
> scheme more scalable.
Why are applications hitting sysfs so hard that this is noticable? What
in it is needed by userspace so badly? And what changed to make this a
requirement of them?
thanks,
greg k-h
Hello,
On Fri, Jan 07, 2022 at 11:01:55PM +1100, Imran Khan wrote:
> Could you please suggest me some current users of hashed locks ? I can
> check that code and modify my patches accordingly.
include/linux/blockgroup_lock.h seems to be one.
> As of now I have not found any standard benchmarks/workloads to show the
> impact of this contention. We have some in house DB applications where
> the impact can be easily seen. Of course those applications can be
> modified to get the needed data from somewhere else or access sysfs less
> frequently but nonetheless I am trying to make the current locking
> scheme more scalable.
I don't think it needs to show up in one of the common benchmarks but what
the application does should make some sense. Which files are involved in the
contentions?
Thanks.
--
tejun
Hi Tejun,
On 8/1/22 8:25 am, Tejun Heo wrote:
> Hello,
>
> On Fri, Jan 07, 2022 at 11:01:55PM +1100, Imran Khan wrote:
>> Could you please suggest me some current users of hashed locks ? I can
>> check that code and modify my patches accordingly.
>
> include/linux/blockgroup_lock.h seems to be one.
>
Thanks for this.
>> As of now I have not found any standard benchmarks/workloads to show the
>> impact of this contention. We have some in house DB applications where
>> the impact can be easily seen. Of course those applications can be
>> modified to get the needed data from somewhere else or access sysfs less
>> frequently but nonetheless I am trying to make the current locking
>> scheme more scalable.
>
> I don't think it needs to show up in one of the common benchmarks but what
> the application does should make some sense. Which files are involved in the
> contentions?
>
The database application has a health monitoring component which
regularly collects stats from sysfs. With small number of databases this
was not an issue but recently several customers did some consolidation
and ended up having hundreds of databases, all running on the same
server and in those setups the contention became more and more evident.
As more and more customers are consolidating we have started to get more
occurences of this issue and in this case it all depends on number of
running databases on the server.
I will have to reach out to application team to get a list of all sysfs
files being accessed but one of them is
"/sys/class/infiniband/<device>/ports/<port number>/gids/<gid index>".
Thanks
-- Imran
Hello,
On Tue, Jan 11, 2022 at 10:42:31AM +1100, Imran Khan wrote:
> The database application has a health monitoring component which
> regularly collects stats from sysfs. With small number of databases this
> was not an issue but recently several customers did some consolidation
> and ended up having hundreds of databases, all running on the same
> server and in those setups the contention became more and more evident.
> As more and more customers are consolidating we have started to get more
> occurences of this issue and in this case it all depends on number of
> running databases on the server.
>
> I will have to reach out to application team to get a list of all sysfs
> files being accessed but one of them is
> "/sys/class/infiniband/<device>/ports/<port number>/gids/<gid index>".
I can imagine a similar scenario w/ cgroups with heavy stacking - each
application fetches its own stat regularly which isn't a problem in
isolation but once you put thousands of them on a machine, the shared lock
can get pretty hot, and the cgroup scenario probably is more convincing in
that they'd be hitting different files but the lock gets hot it is shared
across them.
Greg, I think the call for better scalability for read operations is
reasonably justified especially for heavy workload stacking which is a valid
use case and likely to become more prevalent. Given the requirements, hashed
locking seems like the best solution here. It doesn't cause noticeable space
overhead and is pretty easy to scale. What do you think?
Thanks.
--
tejun
On Wed, Jan 12, 2022 at 10:08:55AM -1000, Tejun Heo wrote:
> Hello,
>
> On Tue, Jan 11, 2022 at 10:42:31AM +1100, Imran Khan wrote:
> > The database application has a health monitoring component which
> > regularly collects stats from sysfs. With small number of databases this
> > was not an issue but recently several customers did some consolidation
> > and ended up having hundreds of databases, all running on the same
> > server and in those setups the contention became more and more evident.
> > As more and more customers are consolidating we have started to get more
> > occurences of this issue and in this case it all depends on number of
> > running databases on the server.
> >
> > I will have to reach out to application team to get a list of all sysfs
> > files being accessed but one of them is
> > "/sys/class/infiniband/<device>/ports/<port number>/gids/<gid index>".
>
> I can imagine a similar scenario w/ cgroups with heavy stacking - each
> application fetches its own stat regularly which isn't a problem in
> isolation but once you put thousands of them on a machine, the shared lock
> can get pretty hot, and the cgroup scenario probably is more convincing in
> that they'd be hitting different files but the lock gets hot it is shared
> across them.
>
> Greg, I think the call for better scalability for read operations is
> reasonably justified especially for heavy workload stacking which is a valid
> use case and likely to become more prevalent. Given the requirements, hashed
> locking seems like the best solution here. It doesn't cause noticeable space
> overhead and is pretty easy to scale. What do you think?
I have no objection to changes that remove the lock contention, as long
as they do not add huge additional memory requirements, like the
original submission here did. If using hashed locks is the solution,
wonderful!
thanks,
greg k-h
Hello,
On 13/1/22 7:48 pm, Greg KH wrote:
> On Wed, Jan 12, 2022 at 10:08:55AM -1000, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Jan 11, 2022 at 10:42:31AM +1100, Imran Khan wrote:
>>> The database application has a health monitoring component which
>>> regularly collects stats from sysfs. With small number of databases this
>>> was not an issue but recently several customers did some consolidation
>>> and ended up having hundreds of databases, all running on the same
>>> server and in those setups the contention became more and more evident.
>>> As more and more customers are consolidating we have started to get more
>>> occurences of this issue and in this case it all depends on number of
>>> running databases on the server.
>>>
>>> I will have to reach out to application team to get a list of all sysfs
>>> files being accessed but one of them is
>>> "/sys/class/infiniband/<device>/ports/<port number>/gids/<gid index>".
>>
>> I can imagine a similar scenario w/ cgroups with heavy stacking - each
>> application fetches its own stat regularly which isn't a problem in
>> isolation but once you put thousands of them on a machine, the shared lock
>> can get pretty hot, and the cgroup scenario probably is more convincing in
>> that they'd be hitting different files but the lock gets hot it is shared
>> across them.
>>
>> Greg, I think the call for better scalability for read operations is
>> reasonably justified especially for heavy workload stacking which is a valid
>> use case and likely to become more prevalent. Given the requirements, hashed
>> locking seems like the best solution here. It doesn't cause noticeable space
>> overhead and is pretty easy to scale. What do you think?
>
> I have no objection to changes that remove the lock contention, as long
> as they do not add huge additional memory requirements, like the
> original submission here did. If using hashed locks is the solution,
> wonderful!
>
Thanks Tejun and Greg, for suggestions and reviews so far.
I have sent v3 of this change at [1]. v3 uses hashed locks and overall
increase in memory footprint will be around 100K at most (with maximum
size of hash table). I will await your feedback regarding this.
[1]:
https://lore.kernel.org/lkml/[email protected]/
Thanks,
-- Imran