2022-12-02 17:24:58

by Brian Foster

[permalink] [raw]
Subject: [PATCH v3 5/5] procfs: use efficient tgid pid search on root readdir

find_ge_pid() walks every allocated id and checks every associated
pid in the namespace for a link to a PIDTYPE_TGID task. If the pid
namespace contains processes with large numbers of threads, this
search doesn't scale and can notably increase getdents() syscall
latency.

For example, on a mostly idle 2.4GHz Intel Xeon running Fedora on
5.19.0-rc2, 'strace -T xfs_io -c readdir /proc' shows the following:

getdents64(... /* 814 entries */, 32768) = 20624 <0.000568>

With the addition of a dummy (i.e. idle) process running that
creates an additional 100k threads, that latency increases to:

getdents64(... /* 815 entries */, 32768) = 20656 <0.011315>

While this may not be noticeable to users in one off /proc scans or
simple usage of ps or top, we have users that report problems caused
by this latency increase in these sort of scaled environments with
custom tooling that makes heavier use of task monitoring.

Optimize the tgid task scanning in proc_pid_readdir() by using the
more efficient find_get_tgid_task() helper. This significantly
improves readdir() latency when the pid namespace is populated with
processes with very large thread counts. For example, the above 100k
idle task test against a patched kernel now results in the
following:

Idle:
getdents64(... /* 861 entries */, 32768) = 21048 <0.000670>

"" + 100k threads:
getdents64(... /* 862 entries */, 32768) = 21096 <0.000959>

... which is a much smaller latency hit after the high thread count
task is started.

Signed-off-by: Brian Foster <[email protected]>
---
fs/proc/base.c | 17 +----------------
1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9e479d7d202b..ac34b6bb7249 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3475,24 +3475,9 @@ struct tgid_iter {
};
static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter)
{
- struct pid *pid;
-
if (iter.task)
put_task_struct(iter.task);
- rcu_read_lock();
-retry:
- iter.task = NULL;
- pid = find_ge_pid(iter.tgid, ns);
- if (pid) {
- iter.tgid = pid_nr_ns(pid, ns);
- iter.task = pid_task(pid, PIDTYPE_TGID);
- if (!iter.task) {
- iter.tgid += 1;
- goto retry;
- }
- get_task_struct(iter.task);
- }
- rcu_read_unlock();
+ iter.task = find_get_tgid_task(&iter.tgid, ns);
return iter;
}

--
2.37.3


2022-12-12 02:37:50

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] procfs: use efficient tgid pid search on root readdir

On 3/12/22 01:16, Brian Foster wrote:
> find_ge_pid() walks every allocated id and checks every associated
> pid in the namespace for a link to a PIDTYPE_TGID task. If the pid
> namespace contains processes with large numbers of threads, this
> search doesn't scale and can notably increase getdents() syscall
> latency.
>
> For example, on a mostly idle 2.4GHz Intel Xeon running Fedora on
> 5.19.0-rc2, 'strace -T xfs_io -c readdir /proc' shows the following:
>
> getdents64(... /* 814 entries */, 32768) = 20624 <0.000568>
>
> With the addition of a dummy (i.e. idle) process running that
> creates an additional 100k threads, that latency increases to:
>
> getdents64(... /* 815 entries */, 32768) = 20656 <0.011315>
>
> While this may not be noticeable to users in one off /proc scans or
> simple usage of ps or top, we have users that report problems caused
> by this latency increase in these sort of scaled environments with
> custom tooling that makes heavier use of task monitoring.
>
> Optimize the tgid task scanning in proc_pid_readdir() by using the
> more efficient find_get_tgid_task() helper. This significantly
> improves readdir() latency when the pid namespace is populated with
> processes with very large thread counts. For example, the above 100k
> idle task test against a patched kernel now results in the
> following:
>
> Idle:
> getdents64(... /* 861 entries */, 32768) = 21048 <0.000670>
>
> "" + 100k threads:
> getdents64(... /* 862 entries */, 32768) = 21096 <0.000959>
>
> ... which is a much smaller latency hit after the high thread count
> task is started.


This may not sound like much but in the environment where it

was reported it makes quite a difference.


The thing is that the scenario above sounds totally unreal

but apparently it isn't and even if it was think about

many thread group leaders each with even a moderately large

number of threads and the observed overhead problem becomes

clear.


>
> Signed-off-by: Brian Foster <[email protected]>


Reviewed-by: Ian Kent <[email protected]>


Ian

> ---
> fs/proc/base.c | 17 +----------------
> 1 file changed, 1 insertion(+), 16 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 9e479d7d202b..ac34b6bb7249 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3475,24 +3475,9 @@ struct tgid_iter {
> };
> static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter iter)
> {
> - struct pid *pid;
> -
> if (iter.task)
> put_task_struct(iter.task);
> - rcu_read_lock();
> -retry:
> - iter.task = NULL;
> - pid = find_ge_pid(iter.tgid, ns);
> - if (pid) {
> - iter.tgid = pid_nr_ns(pid, ns);
> - iter.task = pid_task(pid, PIDTYPE_TGID);
> - if (!iter.task) {
> - iter.tgid += 1;
> - goto retry;
> - }
> - get_task_struct(iter.task);
> - }
> - rcu_read_unlock();
> + iter.task = find_get_tgid_task(&iter.tgid, ns);
> return iter;
> }
>