2021-11-16 19:43:25

by Minchan Kim

[permalink] [raw]
Subject: [RFC PATCH] kernfs: release kernfs_mutex before the inode allocation

The kernfs implementation has big lock granularity(kernfs_rwsem) so
every kernfs-based(e.g., sysfs, cgroup, dmabuf) fs are able to compete
the lock. Thus, if one of userspace goes the sleep under holding
the lock for a long time, rest of them should wait it. A example is
the holder goes direct reclaim with the lock since it needs memory
allocation. Let's fix it at common technique that release the lock
and then allocate the memory. Fortunately, kernfs looks like have
an refcount so I hope it's fine.

Signed-off-by: Minchan Kim <[email protected]>
---
fs/kernfs/dir.c | 14 +++++++++++---
fs/kernfs/inode.c | 2 +-
fs/kernfs/kernfs-internal.h | 1 +
3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 8e0a1378a4b1..ecdb2975060d 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1119,9 +1119,17 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
up_read(&kernfs_rwsem);
return NULL;
}
- inode = kernfs_get_inode(dir->i_sb, kn);
- if (!inode)
- inode = ERR_PTR(-ENOMEM);
+ kernfs_get(kn);
+ up_read(&kernfs_rwsem);
+ inode = iget_locked(dir->i_sb, kernfs_ino(kn));
+ if (!inode) {
+ kernfs_put(kn);
+ return ERR_PTR(-ENOMEM);
+ }
+ down_read(&kernfs_rwsem);
+ if (inode->i_state & I_NEW)
+ kernfs_init_inode(kn, inode);
+ kernfs_put(kn);
}
/*
* Needed for negative dentry validation.
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index c0eae1725435..6e2004010435 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -195,7 +195,7 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
return 0;
}

-static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
+void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
{
kernfs_get(kn);
inode->i_private = kn;
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index f9cc912c31e1..eef7656f7cd8 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -118,6 +118,7 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
u32 request_mask, unsigned int query_flags);
ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size);
int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
+void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode);

/*
* dir.c
--
2.34.0.rc1.387.gb447b232ab-goog



2021-11-16 19:49:53

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC PATCH] kernfs: release kernfs_mutex before the inode allocation

On Tue, Nov 16, 2021 at 11:43:17AM -0800, Minchan Kim wrote:
> The kernfs implementation has big lock granularity(kernfs_rwsem) so
> every kernfs-based(e.g., sysfs, cgroup, dmabuf) fs are able to compete
> the lock. Thus, if one of userspace goes the sleep under holding
> the lock for a long time, rest of them should wait it. A example is
> the holder goes direct reclaim with the lock since it needs memory
> allocation. Let's fix it at common technique that release the lock
> and then allocate the memory. Fortunately, kernfs looks like have
> an refcount so I hope it's fine.
>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> fs/kernfs/dir.c | 14 +++++++++++---
> fs/kernfs/inode.c | 2 +-
> fs/kernfs/kernfs-internal.h | 1 +
> 3 files changed, 13 insertions(+), 4 deletions(-)

What workload hits this lock to cause it to be noticable?

There was a bunch of recent work in this area to make this much more
fine-grained, and the theoritical benchmarks that people created (adding
10s of thousands of scsi disks at boot time) have gotten better.

But in that work, no one could find a real benchmark or use case that
anyone could even notice this type of thing. What do you have that
shows this?
thanks,

greg k-h

2021-11-16 21:36:05

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH] kernfs: release kernfs_mutex before the inode allocation

On Tue, Nov 16, 2021 at 08:49:46PM +0100, Greg Kroah-Hartman wrote:
> On Tue, Nov 16, 2021 at 11:43:17AM -0800, Minchan Kim wrote:
> > The kernfs implementation has big lock granularity(kernfs_rwsem) so
> > every kernfs-based(e.g., sysfs, cgroup, dmabuf) fs are able to compete
> > the lock. Thus, if one of userspace goes the sleep under holding
> > the lock for a long time, rest of them should wait it. A example is
> > the holder goes direct reclaim with the lock since it needs memory
> > allocation. Let's fix it at common technique that release the lock
> > and then allocate the memory. Fortunately, kernfs looks like have
> > an refcount so I hope it's fine.
> >
> > Signed-off-by: Minchan Kim <[email protected]>
> > ---
> > fs/kernfs/dir.c | 14 +++++++++++---
> > fs/kernfs/inode.c | 2 +-
> > fs/kernfs/kernfs-internal.h | 1 +
> > 3 files changed, 13 insertions(+), 4 deletions(-)
>
> What workload hits this lock to cause it to be noticable?

A app launching since it was dropping the frame since the
latency was too long.

>
> There was a bunch of recent work in this area to make this much more
> fine-grained, and the theoritical benchmarks that people created (adding
> 10s of thousands of scsi disks at boot time) have gotten better.
>
> But in that work, no one could find a real benchmark or use case that
> anyone could even notice this type of thing. What do you have that
> shows this?

https://developer.android.com/studio/command-line/perfetto
https://perfetto.dev/docs/data-sources/cpu-scheduling

Android has perfetto tracing system and can show where processes
were stuck. This case was the lock since holder was in direct reclaim
path.

2021-11-17 06:44:57

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC PATCH] kernfs: release kernfs_mutex before the inode allocation

On Tue, Nov 16, 2021 at 01:36:01PM -0800, Minchan Kim wrote:
> On Tue, Nov 16, 2021 at 08:49:46PM +0100, Greg Kroah-Hartman wrote:
> > On Tue, Nov 16, 2021 at 11:43:17AM -0800, Minchan Kim wrote:
> > > The kernfs implementation has big lock granularity(kernfs_rwsem) so
> > > every kernfs-based(e.g., sysfs, cgroup, dmabuf) fs are able to compete
> > > the lock. Thus, if one of userspace goes the sleep under holding
> > > the lock for a long time, rest of them should wait it. A example is
> > > the holder goes direct reclaim with the lock since it needs memory
> > > allocation. Let's fix it at common technique that release the lock
> > > and then allocate the memory. Fortunately, kernfs looks like have
> > > an refcount so I hope it's fine.
> > >
> > > Signed-off-by: Minchan Kim <[email protected]>
> > > ---
> > > fs/kernfs/dir.c | 14 +++++++++++---
> > > fs/kernfs/inode.c | 2 +-
> > > fs/kernfs/kernfs-internal.h | 1 +
> > > 3 files changed, 13 insertions(+), 4 deletions(-)
> >
> > What workload hits this lock to cause it to be noticable?
>
> A app launching since it was dropping the frame since the
> latency was too long.

How does running a program interact with kernfs filesystems? Which
one(s)?

> > There was a bunch of recent work in this area to make this much more
> > fine-grained, and the theoritical benchmarks that people created (adding
> > 10s of thousands of scsi disks at boot time) have gotten better.
> >
> > But in that work, no one could find a real benchmark or use case that
> > anyone could even notice this type of thing. What do you have that
> > shows this?
>
> https://developer.android.com/studio/command-line/perfetto
> https://perfetto.dev/docs/data-sources/cpu-scheduling

That is links to a tool, not a test we can run ourselves.

Or how about the output of that tool?

> Android has perfetto tracing system and can show where processes
> were stuck. This case was the lock since holder was in direct reclaim
> path.

Reclaim of what? What is the interaction here with kernfs? Normally
this filesystem is not on any "fast paths" that I know of.

More specifics would be nice :)

thanks,

greg k-h

2021-11-17 07:28:01

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH] kernfs: release kernfs_mutex before the inode allocation

On Wed, Nov 17, 2021 at 07:44:44AM +0100, Greg Kroah-Hartman wrote:
> On Tue, Nov 16, 2021 at 01:36:01PM -0800, Minchan Kim wrote:
> > On Tue, Nov 16, 2021 at 08:49:46PM +0100, Greg Kroah-Hartman wrote:
> > > On Tue, Nov 16, 2021 at 11:43:17AM -0800, Minchan Kim wrote:
> > > > The kernfs implementation has big lock granularity(kernfs_rwsem) so
> > > > every kernfs-based(e.g., sysfs, cgroup, dmabuf) fs are able to compete
> > > > the lock. Thus, if one of userspace goes the sleep under holding
> > > > the lock for a long time, rest of them should wait it. A example is
> > > > the holder goes direct reclaim with the lock since it needs memory
> > > > allocation. Let's fix it at common technique that release the lock
> > > > and then allocate the memory. Fortunately, kernfs looks like have
> > > > an refcount so I hope it's fine.
> > > >
> > > > Signed-off-by: Minchan Kim <[email protected]>
> > > > ---
> > > > fs/kernfs/dir.c | 14 +++++++++++---
> > > > fs/kernfs/inode.c | 2 +-
> > > > fs/kernfs/kernfs-internal.h | 1 +
> > > > 3 files changed, 13 insertions(+), 4 deletions(-)
> > >
> > > What workload hits this lock to cause it to be noticable?
> >
> > A app launching since it was dropping the frame since the
> > latency was too long.
>
> How does running a program interact with kernfs filesystems? Which
> one(s)?

A app launching involves dma_buf exports which creates kobject
and add it to the kernfs with down_write - kernfs_add_one.

At the same time in other CPU, a random process was accessing
sysfs and the kernfs_iop_lookup was already hoding the kernfs_rwsem
and ran under direct reclaim patch due to alloc_inode in
kerfs_get_inode.

Therefore, the app is stuck on the lock and lose frames so enduser
sees the jank.

>
> > > There was a bunch of recent work in this area to make this much more
> > > fine-grained, and the theoritical benchmarks that people created (adding
> > > 10s of thousands of scsi disks at boot time) have gotten better.
> > >
> > > But in that work, no one could find a real benchmark or use case that
> > > anyone could even notice this type of thing. What do you have that
> > > shows this?
> >
> > https://developer.android.com/studio/command-line/perfetto
> > https://perfetto.dev/docs/data-sources/cpu-scheduling
>
> That is links to a tool, not a test we can run ourselves.
>
> Or how about the output of that tool?
>
> > Android has perfetto tracing system and can show where processes
> > were stuck. This case was the lock since holder was in direct reclaim
> > path.
>
> Reclaim of what? What is the interaction here with kernfs? Normally
> this filesystem is not on any "fast paths" that I know of.
>
> More specifics would be nice :)

I hope it's enough above.

2021-11-17 07:39:50

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC PATCH] kernfs: release kernfs_mutex before the inode allocation

On Tue, Nov 16, 2021 at 11:27:56PM -0800, Minchan Kim wrote:
> On Wed, Nov 17, 2021 at 07:44:44AM +0100, Greg Kroah-Hartman wrote:
> > On Tue, Nov 16, 2021 at 01:36:01PM -0800, Minchan Kim wrote:
> > > On Tue, Nov 16, 2021 at 08:49:46PM +0100, Greg Kroah-Hartman wrote:
> > > > On Tue, Nov 16, 2021 at 11:43:17AM -0800, Minchan Kim wrote:
> > > > > The kernfs implementation has big lock granularity(kernfs_rwsem) so
> > > > > every kernfs-based(e.g., sysfs, cgroup, dmabuf) fs are able to compete
> > > > > the lock. Thus, if one of userspace goes the sleep under holding
> > > > > the lock for a long time, rest of them should wait it. A example is
> > > > > the holder goes direct reclaim with the lock since it needs memory
> > > > > allocation. Let's fix it at common technique that release the lock
> > > > > and then allocate the memory. Fortunately, kernfs looks like have
> > > > > an refcount so I hope it's fine.
> > > > >
> > > > > Signed-off-by: Minchan Kim <[email protected]>
> > > > > ---
> > > > > fs/kernfs/dir.c | 14 +++++++++++---
> > > > > fs/kernfs/inode.c | 2 +-
> > > > > fs/kernfs/kernfs-internal.h | 1 +
> > > > > 3 files changed, 13 insertions(+), 4 deletions(-)
> > > >
> > > > What workload hits this lock to cause it to be noticable?
> > >
> > > A app launching since it was dropping the frame since the
> > > latency was too long.
> >
> > How does running a program interact with kernfs filesystems? Which
> > one(s)?
>
> A app launching involves dma_buf exports which creates kobject
> and add it to the kernfs with down_write - kernfs_add_one.

I thought the "create a dma_buf kobject" kernel change was fixed up to
not do that anymore as that was a known performance issue.

Creating kobjects should NOT be on a fast path, if they are, that needs
to be fixed.

> At the same time in other CPU, a random process was accessing
> sysfs and the kernfs_iop_lookup was already hoding the kernfs_rwsem
> and ran under direct reclaim patch due to alloc_inode in
> kerfs_get_inode.

What program is constantly hitting sysfs? sysfs is not for
performance-critical things, right?

> Therefore, the app is stuck on the lock and lose frames so enduser
> sees the jank.

But how does this patch work around it? It seems like you are
special-casing the kobject creation path only.

And is this the case really on 5.15? I thought the kernfs locks were
broken up again to not cause this problem in 5.14 or so.

thanks,

greg k-h

2021-11-17 21:44:06

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH] kernfs: release kernfs_mutex before the inode allocation

On Wed, Nov 17, 2021 at 08:39:44AM +0100, Greg Kroah-Hartman wrote:
> On Tue, Nov 16, 2021 at 11:27:56PM -0800, Minchan Kim wrote:
> > On Wed, Nov 17, 2021 at 07:44:44AM +0100, Greg Kroah-Hartman wrote:
> > > On Tue, Nov 16, 2021 at 01:36:01PM -0800, Minchan Kim wrote:
> > > > On Tue, Nov 16, 2021 at 08:49:46PM +0100, Greg Kroah-Hartman wrote:
> > > > > On Tue, Nov 16, 2021 at 11:43:17AM -0800, Minchan Kim wrote:
> > > > > > The kernfs implementation has big lock granularity(kernfs_rwsem) so
> > > > > > every kernfs-based(e.g., sysfs, cgroup, dmabuf) fs are able to compete
> > > > > > the lock. Thus, if one of userspace goes the sleep under holding
> > > > > > the lock for a long time, rest of them should wait it. A example is
> > > > > > the holder goes direct reclaim with the lock since it needs memory
> > > > > > allocation. Let's fix it at common technique that release the lock
> > > > > > and then allocate the memory. Fortunately, kernfs looks like have
> > > > > > an refcount so I hope it's fine.
> > > > > >
> > > > > > Signed-off-by: Minchan Kim <[email protected]>
> > > > > > ---
> > > > > > fs/kernfs/dir.c | 14 +++++++++++---
> > > > > > fs/kernfs/inode.c | 2 +-
> > > > > > fs/kernfs/kernfs-internal.h | 1 +
> > > > > > 3 files changed, 13 insertions(+), 4 deletions(-)
> > > > >
> > > > > What workload hits this lock to cause it to be noticable?
> > > >
> > > > A app launching since it was dropping the frame since the
> > > > latency was too long.
> > >
> > > How does running a program interact with kernfs filesystems? Which
> > > one(s)?
> >
> > A app launching involves dma_buf exports which creates kobject
> > and add it to the kernfs with down_write - kernfs_add_one.
>
> I thought the "create a dma_buf kobject" kernel change was fixed up to
> not do that anymore as that was a known performance issue.
>
> Creating kobjects should NOT be on a fast path, if they are, that needs
> to be fixed.

Ccing Hridya
I also already mentioned before but it's the as-is.
It would be great to be fixed but kernfs still has the problem
regardless of the dmabuf.

For example, process A hold the lock as read-side and try to
allocate memory and entered the reclaim. process B try to
hold the lock as writes-side(e.g., kernfs_iop_setattr) but
he should wait until process A completes. Next, processs C is
also stuck on the lock as read-side when it tries to access
sysfs since the process B spent a threahold timeout in rwsem.
Here, process C could be critical role to contribute the jank.
What it was doing is just access sysfs but was stuck.

>
> > At the same time in other CPU, a random process was accessing
> > sysfs and the kernfs_iop_lookup was already hoding the kernfs_rwsem
> > and ran under direct reclaim patch due to alloc_inode in
> > kerfs_get_inode.
>
> What program is constantly hitting sysfs? sysfs is not for
> performance-critical things, right?

Constantly hitting sysfs itself is no problem since they need
to read telemetry data from sysfs and it's not latency sensitive.
Here, problem is the reading kernfs could make trouble others who
want to access once the rwsem is involved with exclusive lock mode.
Please look at above scenario.

>
> > Therefore, the app is stuck on the lock and lose frames so enduser
> > sees the jank.
>
> But how does this patch work around it? It seems like you are
> special-casing the kobject creation path only.

It's true. If other path also has the memory allocation with holding
kernfs_rwsem, it could make trouble.

>
> And is this the case really on 5.15? I thought the kernfs locks were
> broken up again to not cause this problem in 5.14 or so.

It happened on 5.10 but the path I mentioned were still vulnerable path
with rwsem so it could happen on 5.15, too.

2021-11-17 21:45:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH] kernfs: release kernfs_mutex before the inode allocation

Hello,

On Tue, Nov 16, 2021 at 11:27:56PM -0800, Minchan Kim wrote:
> A app launching involves dma_buf exports which creates kobject
> and add it to the kernfs with down_write - kernfs_add_one.
>
> At the same time in other CPU, a random process was accessing
> sysfs and the kernfs_iop_lookup was already hoding the kernfs_rwsem
> and ran under direct reclaim patch due to alloc_inode in
> kerfs_get_inode.
>
> Therefore, the app is stuck on the lock and lose frames so enduser
> sees the jank.

So, one really low hanging fruit here would be using a separate rwsem per
superblock. Nothing needs synchronization across different users of kernfs
and the locking is shared just because nobody bothered to separate them out
while generalizing it from sysfs.

Thanks.

--
tejun

2021-11-17 22:13:41

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH] kernfs: release kernfs_mutex before the inode allocation

Hi Tejun,

On Wed, Nov 17, 2021 at 11:45:46AM -1000, Tejun Heo wrote:
> Hello,
>
> On Tue, Nov 16, 2021 at 11:27:56PM -0800, Minchan Kim wrote:
> > A app launching involves dma_buf exports which creates kobject
> > and add it to the kernfs with down_write - kernfs_add_one.
> >
> > At the same time in other CPU, a random process was accessing
> > sysfs and the kernfs_iop_lookup was already hoding the kernfs_rwsem
> > and ran under direct reclaim patch due to alloc_inode in
> > kerfs_get_inode.
> >
> > Therefore, the app is stuck on the lock and lose frames so enduser
> > sees the jank.
>
> So, one really low hanging fruit here would be using a separate rwsem per
> superblock. Nothing needs synchronization across different users of kernfs
> and the locking is shared just because nobody bothered to separate them out
> while generalizing it from sysfs.

That's really what I wanted but had a question whether we can access
superblock from the kernfs_node all the time since there are some
functions to access the kernfs_rwsem without ionde, sb context.

Is it doable to get the superblock from the kernfs_node all the time?

2021-11-17 22:24:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH] kernfs: release kernfs_mutex before the inode allocation

Hello,

On Wed, Nov 17, 2021 at 02:13:35PM -0800, Minchan Kim wrote:
> > So, one really low hanging fruit here would be using a separate rwsem per
> > superblock. Nothing needs synchronization across different users of kernfs
> > and the locking is shared just because nobody bothered to separate them out
> > while generalizing it from sysfs.
>
> That's really what I wanted but had a question whether we can access
> superblock from the kernfs_node all the time since there are some
> functions to access the kernfs_rwsem without ionde, sb context.
>
> Is it doable to get the superblock from the kernfs_node all the time?

Ah, right, kernfs_node doesn't point back to kernfs_root. I guess it can go
one of three ways:

a. Follow parent until root kernfs_node and make that guy point to
kernfs_root through its parent field. This isn't great but the hotter
paths all have sb / inode already, I think, so if we do this only in the
really cold paths, it likely isn't too bad.

b. Change the interface so that the callers have to provide kernfs_root. I
don't think this is gonna be a huge problem. There are a few users of
kernfs and they always know their roots.

c. Add a field to kernfs_node so that we can always find kernfs_root.

I think b is likely the cheapest && cleanest.

Thanks.

--
tejun

2021-11-18 01:55:44

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC PATCH] kernfs: release kernfs_mutex before the inode allocation

On Wed, Nov 17, 2021 at 12:23:55PM -1000, Tejun Heo wrote:
> Hello,
>
> On Wed, Nov 17, 2021 at 02:13:35PM -0800, Minchan Kim wrote:
> > > So, one really low hanging fruit here would be using a separate rwsem per
> > > superblock. Nothing needs synchronization across different users of kernfs
> > > and the locking is shared just because nobody bothered to separate them out
> > > while generalizing it from sysfs.
> >
> > That's really what I wanted but had a question whether we can access
> > superblock from the kernfs_node all the time since there are some
> > functions to access the kernfs_rwsem without ionde, sb context.
> >
> > Is it doable to get the superblock from the kernfs_node all the time?
>
> Ah, right, kernfs_node doesn't point back to kernfs_root. I guess it can go
> one of three ways:

Thanks for the suggestion, Tejun.

I found kernfs_root and it seems like to return kernfs_root from kernfs_node.
If it's true all the case, we would put the rwsem in kernfs_root and change
would be straightforward. Do you see any problem?

>
> a. Follow parent until root kernfs_node and make that guy point to
> kernfs_root through its parent field. This isn't great but the hotter
> paths all have sb / inode already, I think, so if we do this only in the
> really cold paths, it likely isn't too bad.
>
> b. Change the interface so that the callers have to provide kernfs_root. I
> don't think this is gonna be a huge problem. There are a few users of
> kernfs and they always know their roots.
>
> c. Add a field to kernfs_node so that we can always find kernfs_root.
>
> I think b is likely the cheapest && cleanest.
>
> Thanks.
>
> --
> tejun

2021-11-18 16:35:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH] kernfs: release kernfs_mutex before the inode allocation

On Wed, Nov 17, 2021 at 05:55:33PM -0800, Minchan Kim wrote:
> > Ah, right, kernfs_node doesn't point back to kernfs_root. I guess it can go
> > one of three ways:
>
> Thanks for the suggestion, Tejun.
>
> I found kernfs_root and it seems like to return kernfs_root from kernfs_node.
> If it's true all the case, we would put the rwsem in kernfs_root and change
> would be straightforward. Do you see any problem?

Ah, you're right. I forgot that directory kernfs_node always points to the
root so we need at most one walk up the tree to locate kernfs_root. Yeah,
that should definitely work.

Thanks.

--
tejun