2022-10-17 13:58:26

by Vishal Chourasia

[permalink] [raw]
Subject: sched/debug: CPU hotplug operation suffers in a large cpu systems

smt=off operation on system with 1920 CPUs is taking approx 59 mins on v5.14
versus 29 mins on v5.11 measured using:
# time ppc64_cpu --smt=off

Doing a git bisect between kernel v5.11 and v5.14 pointed to the commit
3b87f136f8fc ("sched,debug: Convert sysctl sched_domains to debugfs"). This
commit moves sched_domain information that was originally exported using sysctl
to debugfs.

Reverting the said commit, gives us the expected good result.

Previously sched domain information was exported at procfs(sysctl):
/proc/sys/kernel/sched_domain/ but now it gets exported at debugfs
:/sys/kernel/debug/sched/domains/

We also observe regression in kernel v6.0-rc4, which vanishes after reverting
the commit 3b87f136f8fc

# Output of `time ppc64_cpu --smt=off` on different kernel versions
|-------------------------------------+------------+----------+----------|
| kernel version | real | user | sys |
|-------------------------------------+------------+----------+----------|
| v5.11 | 29m22.007s | 0m0.001s | 0m6.444s |
| v5.14 | 58m15.719s | 0m0.037s | 0m7.482s |
| v6.0-rc4 | 59m30.318s | 0m0.055s | 0m7.681s |
| v6.0-rc4 with 3b87f136f8fc reverted | 32m20.486s | 0m0.029s | 0m7.361s |
|-------------------------------------+------------+----------+----------|

Machine with 1920 cpus was used for the above experiments. Output of lscpu is
added below.

# lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 1920
On-line CPU(s) list: 0-1919
Model name: POWER10 (architected), altivec supported
Model: 2.0 (pvr 0080 0200)
Thread(s) per core: 8
Core(s) per socket: 14
Socket(s): 17
Physical sockets: 15
Physical chips: 1
Physical cores/chip: 16

Through our experiments we have found that even when offlining 1 cpu, functions
responsible for exporting sched_domain information took more time in case of
debugfs relative to sysctl.

Experiments using trace-cmd function-graph plugin have shown execution time for
certain methods common in both the scenarios (procfs and debugfs) differ
drastically.

Below table list the execution time for some of the symbols for sysctl(procfs)
and debugfs case.

|--------------------------------+----------------+--------------|
| method | sysctl | debugfs |
|--------------------------------+----------------+--------------|
| unregister_sysctl_table | 0.020050 s | NA |
| build_sched_domains | 3.090563 s | 3.119130 s |
| register_sched_domain_sysctl | 0.065487 s | NA |
| update_sched_domain_debugfs | NA | 2.791232 s |
| partition_sched_domains_locked | 3.195958 s | 5.933254 s |
|--------------------------------+----------------+--------------|

Note: partition_sched_domains_locked internally calls build_sched_domains
and calls other functions respective to what's being currently used to
export information i.e. sysctl or debugfs

Above numbers are quoted from the case where we tried offlining 1 cpu in system
with 1920 online cpus.

From the above table, register_sched_domain_sysctl and
unregister_sysctl_table_collectively took ~0.085 secs, whereas
update_sched_domain_debugfs took ~2.79 secs.

Root cause:

The observed regression stems from the way these two pseudo-filesystems handle
creation and deletion of files and directories internally.

update_sched_domain_debugfs builds and exports sched_domains information to the
userspace. It begins with removing/tearing down the per-cpu directories present
in /sys/kernel/debug/sched/domains/ one by one for each possible cpu. And then
recreate per-cpu per-domain files and directories one by one for each possible
cpus.

Excerpt from the trace-cmd output for the debugfs case
...
| update_sched_domain_debugfs() {
+ 14.526 us | debugfs_lookup();
# 1092.64 us | debugfs_remove();
+ 48.408 us | debugfs_create_dir(); - creates per-cpu dir
9.038 us | debugfs_create_dir(); - creates per-domain dir
9.638 us | debugfs_create_ulong(); -+
7.762 us | debugfs_create_ulong(); |
7.776 us | debugfs_create_u64(); |
7.502 us | debugfs_create_u32(); |__ creates per-domain files
7.646 us | debugfs_create_u32(); |
7.702 us | debugfs_create_u32(); |
6.974 us | debugfs_create_str(); |
7.628 us | debugfs_create_file(); -+
... - repeat other domains and cpus

As a first step, We used debugfs_remove_recursive to remove entries for each cpu
in one go instead of calling debugfs_remove per cpu. But, We did not see any
improvement whatsoever.

We understand debugfs doesn't concern itself with performance, and that smt=off
operation won't be invoked much often, statistically speaking. But, We should
understand that as the CPUs in a system will scale debugfs becomes a massive
performance bottleneck that shouldn't be ignored.

Even in a system with 240 CPUs system, update_sched_domain_debugfs is 1600 times
slower than register_sched_domain_sysctl when building sched_domain directory
for 240 CPUs only.

# For 240 CPU system
|------------------------------+---------------|
| method | time taken |
|------------------------------+---------------|
| update_sched_domain_debugfs | 236550.996 us |
| register_sched_domain_sysctl | 13907.940 us |
|------------------------------+---------------|

Any ideas on how to improve here from the community is much appreciated.

Meanwhile, We will keep posting our progress updates.

-- vishal.c


Attachments:
(No filename) (5.70 kB)
signature.asc (499.00 B)
Download all attachments

2022-10-17 14:21:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems


+GregKH who actually knows about debugfs.

On Mon, Oct 17, 2022 at 06:40:49PM +0530, Vishal Chourasia wrote:
> smt=off operation on system with 1920 CPUs is taking approx 59 mins on v5.14
> versus 29 mins on v5.11 measured using:
> # time ppc64_cpu --smt=off
>
> Doing a git bisect between kernel v5.11 and v5.14 pointed to the commit
> 3b87f136f8fc ("sched,debug: Convert sysctl sched_domains to debugfs"). This
> commit moves sched_domain information that was originally exported using sysctl
> to debugfs.
>
> Reverting the said commit, gives us the expected good result.
>
> Previously sched domain information was exported at procfs(sysctl):
> /proc/sys/kernel/sched_domain/ but now it gets exported at debugfs
> :/sys/kernel/debug/sched/domains/
>
> We also observe regression in kernel v6.0-rc4, which vanishes after reverting
> the commit 3b87f136f8fc
>
> # Output of `time ppc64_cpu --smt=off` on different kernel versions
> |-------------------------------------+------------+----------+----------|
> | kernel version | real | user | sys |
> |-------------------------------------+------------+----------+----------|
> | v5.11 | 29m22.007s | 0m0.001s | 0m6.444s |
> | v5.14 | 58m15.719s | 0m0.037s | 0m7.482s |
> | v6.0-rc4 | 59m30.318s | 0m0.055s | 0m7.681s |
> | v6.0-rc4 with 3b87f136f8fc reverted | 32m20.486s | 0m0.029s | 0m7.361s |
> |-------------------------------------+------------+----------+----------|
>
> Machine with 1920 cpus was used for the above experiments. Output of lscpu is
> added below.
>
> # lscpu
> Architecture: ppc64le
> Byte Order: Little Endian
> CPU(s): 1920
> On-line CPU(s) list: 0-1919
> Model name: POWER10 (architected), altivec supported
> Model: 2.0 (pvr 0080 0200)
> Thread(s) per core: 8
> Core(s) per socket: 14
> Socket(s): 17
> Physical sockets: 15
> Physical chips: 1
> Physical cores/chip: 16
>
> Through our experiments we have found that even when offlining 1 cpu, functions
> responsible for exporting sched_domain information took more time in case of
> debugfs relative to sysctl.
>
> Experiments using trace-cmd function-graph plugin have shown execution time for
> certain methods common in both the scenarios (procfs and debugfs) differ
> drastically.
>
> Below table list the execution time for some of the symbols for sysctl(procfs)
> and debugfs case.
>
> |--------------------------------+----------------+--------------|
> | method | sysctl | debugfs |
> |--------------------------------+----------------+--------------|
> | unregister_sysctl_table | 0.020050 s | NA |
> | build_sched_domains | 3.090563 s | 3.119130 s |
> | register_sched_domain_sysctl | 0.065487 s | NA |
> | update_sched_domain_debugfs | NA | 2.791232 s |
> | partition_sched_domains_locked | 3.195958 s | 5.933254 s |
> |--------------------------------+----------------+--------------|
>
> Note: partition_sched_domains_locked internally calls build_sched_domains
> and calls other functions respective to what's being currently used to
> export information i.e. sysctl or debugfs
>
> Above numbers are quoted from the case where we tried offlining 1 cpu in system
> with 1920 online cpus.
>
> From the above table, register_sched_domain_sysctl and
> unregister_sysctl_table_collectively took ~0.085 secs, whereas
> update_sched_domain_debugfs took ~2.79 secs.
>
> Root cause:
>
> The observed regression stems from the way these two pseudo-filesystems handle
> creation and deletion of files and directories internally.
>
> update_sched_domain_debugfs builds and exports sched_domains information to the
> userspace. It begins with removing/tearing down the per-cpu directories present
> in /sys/kernel/debug/sched/domains/ one by one for each possible cpu. And then
> recreate per-cpu per-domain files and directories one by one for each possible
> cpus.
>
> Excerpt from the trace-cmd output for the debugfs case
> ...
> | update_sched_domain_debugfs() {
> + 14.526 us | debugfs_lookup();
> # 1092.64 us | debugfs_remove();
> + 48.408 us | debugfs_create_dir(); - creates per-cpu dir
> 9.038 us | debugfs_create_dir(); - creates per-domain dir
> 9.638 us | debugfs_create_ulong(); -+
> 7.762 us | debugfs_create_ulong(); |
> 7.776 us | debugfs_create_u64(); |
> 7.502 us | debugfs_create_u32(); |__ creates per-domain files
> 7.646 us | debugfs_create_u32(); |
> 7.702 us | debugfs_create_u32(); |
> 6.974 us | debugfs_create_str(); |
> 7.628 us | debugfs_create_file(); -+
> ... - repeat other domains and cpus
>
> As a first step, We used debugfs_remove_recursive to remove entries for each cpu
> in one go instead of calling debugfs_remove per cpu. But, We did not see any
> improvement whatsoever.
>
> We understand debugfs doesn't concern itself with performance, and that smt=off
> operation won't be invoked much often, statistically speaking. But, We should
> understand that as the CPUs in a system will scale debugfs becomes a massive
> performance bottleneck that shouldn't be ignored.
>
> Even in a system with 240 CPUs system, update_sched_domain_debugfs is 1600 times
> slower than register_sched_domain_sysctl when building sched_domain directory
> for 240 CPUs only.
>
> # For 240 CPU system
> |------------------------------+---------------|
> | method | time taken |
> |------------------------------+---------------|
> | update_sched_domain_debugfs | 236550.996 us |
> | register_sched_domain_sysctl | 13907.940 us |
> |------------------------------+---------------|
>
> Any ideas on how to improve here from the community is much appreciated.
>
> Meanwhile, We will keep posting our progress updates.
>
> -- vishal.c


2022-10-17 16:02:19

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

On Mon, Oct 17, 2022 at 04:19:31PM +0200, Peter Zijlstra wrote:
>
> +GregKH who actually knows about debugfs.
>
> On Mon, Oct 17, 2022 at 06:40:49PM +0530, Vishal Chourasia wrote:
> > smt=off operation on system with 1920 CPUs is taking approx 59 mins on v5.14
> > versus 29 mins on v5.11 measured using:
> > # time ppc64_cpu --smt=off
> >
> > Doing a git bisect between kernel v5.11 and v5.14 pointed to the commit
> > 3b87f136f8fc ("sched,debug: Convert sysctl sched_domains to debugfs"). This
> > commit moves sched_domain information that was originally exported using sysctl
> > to debugfs.
> >
> > Reverting the said commit, gives us the expected good result.
> >
> > Previously sched domain information was exported at procfs(sysctl):
> > /proc/sys/kernel/sched_domain/ but now it gets exported at debugfs
> > :/sys/kernel/debug/sched/domains/
> >
> > We also observe regression in kernel v6.0-rc4, which vanishes after reverting
> > the commit 3b87f136f8fc
> >
> > # Output of `time ppc64_cpu --smt=off` on different kernel versions
> > |-------------------------------------+------------+----------+----------|
> > | kernel version | real | user | sys |
> > |-------------------------------------+------------+----------+----------|
> > | v5.11 | 29m22.007s | 0m0.001s | 0m6.444s |
> > | v5.14 | 58m15.719s | 0m0.037s | 0m7.482s |
> > | v6.0-rc4 | 59m30.318s | 0m0.055s | 0m7.681s |
> > | v6.0-rc4 with 3b87f136f8fc reverted | 32m20.486s | 0m0.029s | 0m7.361s |
> > |-------------------------------------+------------+----------+----------|
> >
> > Machine with 1920 cpus was used for the above experiments. Output of lscpu is
> > added below.
> >
> > # lscpu
> > Architecture: ppc64le
> > Byte Order: Little Endian
> > CPU(s): 1920
> > On-line CPU(s) list: 0-1919
> > Model name: POWER10 (architected), altivec supported
> > Model: 2.0 (pvr 0080 0200)
> > Thread(s) per core: 8
> > Core(s) per socket: 14
> > Socket(s): 17
> > Physical sockets: 15
> > Physical chips: 1
> > Physical cores/chip: 16
> >
> > Through our experiments we have found that even when offlining 1 cpu, functions
> > responsible for exporting sched_domain information took more time in case of
> > debugfs relative to sysctl.
> >
> > Experiments using trace-cmd function-graph plugin have shown execution time for
> > certain methods common in both the scenarios (procfs and debugfs) differ
> > drastically.
> >
> > Below table list the execution time for some of the symbols for sysctl(procfs)
> > and debugfs case.
> >
> > |--------------------------------+----------------+--------------|
> > | method | sysctl | debugfs |
> > |--------------------------------+----------------+--------------|
> > | unregister_sysctl_table | 0.020050 s | NA |
> > | build_sched_domains | 3.090563 s | 3.119130 s |
> > | register_sched_domain_sysctl | 0.065487 s | NA |
> > | update_sched_domain_debugfs | NA | 2.791232 s |
> > | partition_sched_domains_locked | 3.195958 s | 5.933254 s |
> > |--------------------------------+----------------+--------------|
> >
> > Note: partition_sched_domains_locked internally calls build_sched_domains
> > and calls other functions respective to what's being currently used to
> > export information i.e. sysctl or debugfs
> >
> > Above numbers are quoted from the case where we tried offlining 1 cpu in system
> > with 1920 online cpus.
> >
> > From the above table, register_sched_domain_sysctl and
> > unregister_sysctl_table_collectively took ~0.085 secs, whereas
> > update_sched_domain_debugfs took ~2.79 secs.
> >
> > Root cause:
> >
> > The observed regression stems from the way these two pseudo-filesystems handle
> > creation and deletion of files and directories internally.

Yes, debugfs is not optimized for speed or memory usage at all. This
happens to be the first code path I have seen that cares about this for
debugfs files.

You can either work on not creating so many debugfs files (do you really
really need all of them all the time?) Or you can work on moving
debugfs to use kernfs as the backend logic, which will save you both
speed and memory usage overall as kernfs is used to being used on
semi-fast paths.

Maybe do both?

hope this helps,

greg k-h

2022-10-18 11:06:39

by Vishal Chourasia

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

On Mon, Oct 17, 2022 at 04:54:11PM +0200, Greg Kroah-Hartman wrote:
> On Mon, Oct 17, 2022 at 04:19:31PM +0200, Peter Zijlstra wrote:
> >
> > +GregKH who actually knows about debugfs.
> >
> > On Mon, Oct 17, 2022 at 06:40:49PM +0530, Vishal Chourasia wrote:
> > > smt=off operation on system with 1920 CPUs is taking approx 59 mins on v5.14
> > > versus 29 mins on v5.11 measured using:
> > > # time ppc64_cpu --smt=off
> > >
> > >
> > > |--------------------------------+----------------+--------------|
> > > | method | sysctl | debugfs |
> > > |--------------------------------+----------------+--------------|
> > > | unregister_sysctl_table | 0.020050 s | NA |
> > > | build_sched_domains | 3.090563 s | 3.119130 s |
> > > | register_sched_domain_sysctl | 0.065487 s | NA |
> > > | update_sched_domain_debugfs | NA | 2.791232 s |
> > > | partition_sched_domains_locked | 3.195958 s | 5.933254 s |
> > > |--------------------------------+----------------+--------------|
> > >
> > > Note: partition_sched_domains_locked internally calls build_sched_domains
> > > and calls other functions respective to what's being currently used to
> > > export information i.e. sysctl or debugfs
> > >
> > > Above numbers are quoted from the case where we tried offlining 1 cpu in system
> > > with 1920 online cpus.
> > >
> > > From the above table, register_sched_domain_sysctl and
> > > unregister_sysctl_table collectively took ~0.085 secs, whereas
> > > update_sched_domain_debugfs took ~2.79 secs.
> > >
> > > Root cause:
> > >
> > > The observed regression stems from the way these two pseudo-filesystems handle
> > > creation and deletion of files and directories internally.
>
> Yes, debugfs is not optimized for speed or memory usage at all. This
> happens to be the first code path I have seen that cares about this for
> debugfs files.
>
> You can either work on not creating so many debugfs files (do you really
> really need all of them all the time?) Or you can work on moving
> debugfs to use kernfs as the backend logic, which will save you both
> speed and memory usage overall as kernfs is used to being used on
> semi-fast paths.
>
> Maybe do both?
>
> hope this helps,
>
> greg k-h

Yes, we need to create 7-8 files per domain per CPU, eventually ending up
creating a lot of files.

Peter,

Is there a possibility of reverting back to /proc/sys/kernel/sched_domain/?

-- vishal.c


Attachments:
(No filename) (2.54 kB)
signature.asc (499.00 B)
Download all attachments

2022-10-18 11:14:06

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

On Tue, Oct 18, 2022 at 04:07:06PM +0530, Vishal Chourasia wrote:
> On Mon, Oct 17, 2022 at 04:54:11PM +0200, Greg Kroah-Hartman wrote:
> > On Mon, Oct 17, 2022 at 04:19:31PM +0200, Peter Zijlstra wrote:
> > >
> > > +GregKH who actually knows about debugfs.
> > >
> > > On Mon, Oct 17, 2022 at 06:40:49PM +0530, Vishal Chourasia wrote:
> > > > smt=off operation on system with 1920 CPUs is taking approx 59 mins on v5.14
> > > > versus 29 mins on v5.11 measured using:
> > > > # time ppc64_cpu --smt=off
> > > >
> > > >
> > > > |--------------------------------+----------------+--------------|
> > > > | method | sysctl | debugfs |
> > > > |--------------------------------+----------------+--------------|
> > > > | unregister_sysctl_table | 0.020050 s | NA |
> > > > | build_sched_domains | 3.090563 s | 3.119130 s |
> > > > | register_sched_domain_sysctl | 0.065487 s | NA |
> > > > | update_sched_domain_debugfs | NA | 2.791232 s |
> > > > | partition_sched_domains_locked | 3.195958 s | 5.933254 s |
> > > > |--------------------------------+----------------+--------------|
> > > >
> > > > Note: partition_sched_domains_locked internally calls build_sched_domains
> > > > and calls other functions respective to what's being currently used to
> > > > export information i.e. sysctl or debugfs
> > > >
> > > > Above numbers are quoted from the case where we tried offlining 1 cpu in system
> > > > with 1920 online cpus.
> > > >
> > > > From the above table, register_sched_domain_sysctl and
> > > > unregister_sysctl_table collectively took ~0.085 secs, whereas
> > > > update_sched_domain_debugfs took ~2.79 secs.
> > > >
> > > > Root cause:
> > > >
> > > > The observed regression stems from the way these two pseudo-filesystems handle
> > > > creation and deletion of files and directories internally.
> >
> > Yes, debugfs is not optimized for speed or memory usage at all. This
> > happens to be the first code path I have seen that cares about this for
> > debugfs files.
> >
> > You can either work on not creating so many debugfs files (do you really
> > really need all of them all the time?) Or you can work on moving
> > debugfs to use kernfs as the backend logic, which will save you both
> > speed and memory usage overall as kernfs is used to being used on
> > semi-fast paths.
> >
> > Maybe do both?
> >
> > hope this helps,
> >
> > greg k-h
>
> Yes, we need to create 7-8 files per domain per CPU, eventually ending up
> creating a lot of files.

Why do you need to? What tools require these debugfs files to be
present?

And if you only have 7-8 files per CPU, that does not seem like a lot of
files overall (14000-16000)? If you only offline 1 cpu, how is removing
7 or 8 files a bottleneck? Do you really offline 1999 cpus for a 2k
system?

> Is there a possibility of reverting back to /proc/sys/kernel/sched_domain/?

No, these are debugging-only things, they do not belong in /proc/

If you rely on them for real functionality, that's a different story,
but I want to know what tool uses them and for what functionality as
debugfs should never be relied on for normal operation of a system.

thanks,

greg k-h

2022-10-26 07:02:50

by Vishal Chourasia

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

On Tue, Oct 18, 2022 at 01:04:40PM +0200, Greg Kroah-Hartman wrote:

> Why do you need to? What tools require these debugfs files to be
> present?

We are not entirely sure what applications (if any) might be using this interface.

> And if you only have 7-8 files per CPU, that does not seem like a lot of
> files overall (14000-16000)? If you only offline 1 cpu, how is removing
> 7 or 8 files a bottleneck? Do you really offline 1999 cpus for a 2k
> system?

It's 7-8 files per domain per cpu, so, in a system with approx 2k cpus and five
domains, the total file count goes above 70k-80k files. And, when we offline 1
CPU, the entire directory is rebuilt, resulting in creation of all the files
again.

Thanks

-- vishal.c


Attachments:
(No filename) (752.00 B)
signature.asc (499.00 B)
Download all attachments

2022-10-26 07:27:54

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

On Wed, Oct 26, 2022 at 12:07:01PM +0530, Vishal Chourasia wrote:
> On Tue, Oct 18, 2022 at 01:04:40PM +0200, Greg Kroah-Hartman wrote:
>
> > Why do you need to? What tools require these debugfs files to be
> > present?
>
> We are not entirely sure what applications (if any) might be using this interface.

Then just disable it and see what happens :)

> > And if you only have 7-8 files per CPU, that does not seem like a lot of
> > files overall (14000-16000)? If you only offline 1 cpu, how is removing
> > 7 or 8 files a bottleneck? Do you really offline 1999 cpus for a 2k
> > system?
>
> It's 7-8 files per domain per cpu, so, in a system with approx 2k cpus and five
> domains, the total file count goes above 70k-80k files. And, when we offline 1
> CPU, the entire directory is rebuilt, resulting in creation of all the files
> again.

Perhaps change the logic to not rebuild the whole thing and instead just
remove the required files?

Or as I mentioned before, you can move debugfs to use kernfs, which
should resolve most of these issues automatically. Why not take the
time to do that which will solve the problem no matter what gets added
in the future in other subsystems?

thanks,

greg k-h

2022-10-26 09:17:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

On Wed, Oct 26, 2022 at 09:02:28AM +0200, Greg Kroah-Hartman wrote:
> On Wed, Oct 26, 2022 at 12:07:01PM +0530, Vishal Chourasia wrote:
> > On Tue, Oct 18, 2022 at 01:04:40PM +0200, Greg Kroah-Hartman wrote:
> >
> > > Why do you need to? What tools require these debugfs files to be
> > > present?
> >
> > We are not entirely sure what applications (if any) might be using this interface.
>
> Then just disable it and see what happens :)

It's mostly a debug interface for developers. A lot of people complained
when I moved things to debugfs, and I told them their program was broken
for a SCHED_DEBUG=n build anyway, but nobody complained about
this particular thing IIRC.

It's mostly affected by things like hotplug and cpusets, you can
discover the resulting topology by looking at these files.

Also; while we generally try and keep SCHED_DEBUG impact low, it is
still measurable; there are a number of people that run SCHED_DEBUG=n
kernels for the extra little gain.

> > > And if you only have 7-8 files per CPU, that does not seem like a lot of
> > > files overall (14000-16000)? If you only offline 1 cpu, how is removing
> > > 7 or 8 files a bottleneck? Do you really offline 1999 cpus for a 2k
> > > system?
> >
> > It's 7-8 files per domain per cpu, so, in a system with approx 2k cpus and five
> > domains, the total file count goes above 70k-80k files. And, when we offline 1
> > CPU, the entire directory is rebuilt, resulting in creation of all the files
> > again.
>
> Perhaps change the logic to not rebuild the whole thing and instead just
> remove the required files?

Unplugging a single cpu can change the topology and the other cpus might
need to be updated too.

Simplest example would be the SMT case, if you reduce from SMT>1 to SMT1
the SMT domain goes away (because a single CPU domain is as pointless as
it sounds) and that affects the CPU that remains.

Tracking all that is a pain. Simply rebuilding the whole thing is by
*far* the simplest option. And given this all is debug code, simple is
good.

> Or as I mentioned before, you can move debugfs to use kernfs, which
> should resolve most of these issues automatically. Why not take the
> time to do that which will solve the problem no matter what gets added
> in the future in other subsystems?

This sounds like a good approach.

2022-11-08 11:18:40

by Vishal Chourasia

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems


Thanks Greg & Peter for your direction.

While we pursue the idea of having debugfs based on kernfs, we thought about
having a boot time parameter which would disable creating and updating of the
sched_domain debugfs files and this would also be useful even when the kernfs
solution kicks in, as users who may not care about these debugfs files would
benefit from a faster CPU hotplug operation.

However, these sched_domain debugfs files are created by default.

-- vishal.c

------>8-----------------------------------------------------8<--------------

From f66f66ee05a9f719b58822d13e501d65391dd9d3 Mon Sep 17 00:00:00 2001
From: Vishal Chourasia <[email protected]>
Date: Tue, 8 Nov 2022 14:21:15 +0530
Subject: [PATCH] Add kernel parameter to disable creation of sched_domain
files

For large systems, creation of sched_domain debug files takes unusually long
time. In which case, sched_sd_export can be passed as kernel command line
parameter during boot time to prevent kernel from creating sched_domain files.

This commit adds a kernel command line parameter, sched_sd_export, which can be
used to, optionally, disable the creation of sched_domain debug files.
---
kernel/sched/debug.c | 9 ++++++---
kernel/sched/sched.h | 1 +
kernel/sched/topology.c | 11 ++++++++++-
3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index bb3d63bdf4ae..bd307847b76a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -279,6 +279,7 @@ static const struct file_operations sched_dynamic_fops = {
#endif /* CONFIG_PREEMPT_DYNAMIC */

__read_mostly bool sched_debug_verbose;
+__read_mostly int sched_debug_export = 1;

static const struct seq_operations sched_debug_sops;

@@ -321,9 +322,11 @@ static __init int sched_init_debug(void)
debugfs_create_u32("migration_cost_ns", 0644, debugfs_sched, &sysctl_sched_migration_cost);
debugfs_create_u32("nr_migrate", 0644, debugfs_sched, &sysctl_sched_nr_migrate);

- mutex_lock(&sched_domains_mutex);
- update_sched_domain_debugfs();
- mutex_unlock(&sched_domains_mutex);
+ if (likely(sched_debug_export)) {
+ mutex_lock(&sched_domains_mutex);
+ update_sched_domain_debugfs();
+ mutex_unlock(&sched_domains_mutex);
+ }
#endif

#ifdef CONFIG_NUMA_BALANCING
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e26688d387ae..a4d06588d876 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2738,6 +2738,7 @@ extern struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq);

#ifdef CONFIG_SCHED_DEBUG
extern bool sched_debug_verbose;
+extern int sched_debug_export;

extern void print_cfs_stats(struct seq_file *m, int cpu);
extern void print_rt_stats(struct seq_file *m, int cpu);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..7bcdbc2f856d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -19,6 +19,13 @@ static int __init sched_debug_setup(char *str)
}
early_param("sched_verbose", sched_debug_setup);

+static int __init sched_debug_disable_export(char *str)
+{
+ sched_debug_export = 0;
+ return 0;
+}
+early_param("sched_sd_export", sched_debug_disable_export);
+
static inline bool sched_debug(void)
{
return sched_debug_verbose;
@@ -152,6 +159,7 @@ static void sched_domain_debug(struct sched_domain *sd, int cpu)
#else /* !CONFIG_SCHED_DEBUG */

# define sched_debug_verbose 0
+# define sched_debug_export 1
# define sched_domain_debug(sd, cpu) do { } while (0)
static inline bool sched_debug(void)
{
@@ -2632,7 +2640,8 @@ void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new[],
dattr_cur = dattr_new;
ndoms_cur = ndoms_new;

- update_sched_domain_debugfs();
+ if (likely(sched_debug_export))
+ update_sched_domain_debugfs();
}

/*

base-commit: 7e18e42e4b280c85b76967a9106a13ca61c16179
--
2.31.1



Attachments:
(No filename) (3.91 kB)
signature.asc (499.00 B)
Download all attachments

2022-11-08 12:36:31

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
>
> Thanks Greg & Peter for your direction.
>
> While we pursue the idea of having debugfs based on kernfs, we thought about
> having a boot time parameter which would disable creating and updating of the
> sched_domain debugfs files and this would also be useful even when the kernfs
> solution kicks in, as users who may not care about these debugfs files would
> benefit from a faster CPU hotplug operation.

Ick, no, you would be adding a new user/kernel api that you will be
required to support for the next 20+ years. Just to get over a
short-term issue before you solve the problem properly.

If you really do not want these debugfs files, just disable debugfs from
your system. That should be a better short-term solution, right?

Or better yet, disable SCHED_DEBUG, why can't you do that?

thanks,

greg k-h

2022-11-08 15:30:41

by Srikar Dronamraju

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

* Greg Kroah-Hartman <[email protected]> [2022-11-08 13:24:39]:

> On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:

Hi Greg,

> >
> > Thanks Greg & Peter for your direction.
> >
> > While we pursue the idea of having debugfs based on kernfs, we thought about
> > having a boot time parameter which would disable creating and updating of the
> > sched_domain debugfs files and this would also be useful even when the kernfs
> > solution kicks in, as users who may not care about these debugfs files would
> > benefit from a faster CPU hotplug operation.
>
> Ick, no, you would be adding a new user/kernel api that you will be
> required to support for the next 20+ years. Just to get over a
> short-term issue before you solve the problem properly.
>
> If you really do not want these debugfs files, just disable debugfs from
> your system. That should be a better short-term solution, right?
>
> Or better yet, disable SCHED_DEBUG, why can't you do that?

Thanks a lot for your quick inputs.

CONFIG_SCHED_DEBUG disables a lot more stuff than just updation of debugfs
files. Information like /sys/kernel/debug/sched/debug and system-wide and
per process wide information would be lost when that config is disabled.

Most users would still be using distribution kernels and most distribution
kernels that I know of seem to have CONFIG_SCHED_DEBUG enabled.

In a large system, lets say close to 2000 CPUs and we are offlining around
1750 CPUs. For example ppc64_cpu --smt=1 on a powerpc. Even if we move to a
lesser overhead kernfs based implementation, we would still be creating
files and deleting files for every CPU offline. Most users may not even be
aware of these files. However for a few users who may be using these files
once a while, we end up creating and deleting these files for all users. The
overhead increases exponentially with the number of CPUs. I would assume the
max number of CPUs are going to increase in future further.

Hence our approach was to reduce the overhead for those users who are sure
they don't depend on these files. We still keep the creating of the files as
the default approach so that others who depend on it are not going to be
impacted.

>
> thanks,
>
> greg k-h

--
Thanks and Regards
Srikar Dronamraju

2022-11-08 16:00:31

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

On Tue, Nov 08, 2022 at 08:21:00PM +0530, Srikar Dronamraju wrote:
> * Greg Kroah-Hartman <[email protected]> [2022-11-08 13:24:39]:
>
> > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
>
> Hi Greg,
>
> > >
> > > Thanks Greg & Peter for your direction.
> > >
> > > While we pursue the idea of having debugfs based on kernfs, we thought about
> > > having a boot time parameter which would disable creating and updating of the
> > > sched_domain debugfs files and this would also be useful even when the kernfs
> > > solution kicks in, as users who may not care about these debugfs files would
> > > benefit from a faster CPU hotplug operation.
> >
> > Ick, no, you would be adding a new user/kernel api that you will be
> > required to support for the next 20+ years. Just to get over a
> > short-term issue before you solve the problem properly.
> >
> > If you really do not want these debugfs files, just disable debugfs from
> > your system. That should be a better short-term solution, right?
> >
> > Or better yet, disable SCHED_DEBUG, why can't you do that?
>
> Thanks a lot for your quick inputs.
>
> CONFIG_SCHED_DEBUG disables a lot more stuff than just updation of debugfs
> files. Information like /sys/kernel/debug/sched/debug and system-wide and
> per process wide information would be lost when that config is disabled.
>
> Most users would still be using distribution kernels and most distribution
> kernels that I know of seem to have CONFIG_SCHED_DEBUG enabled.

Then work with the distros to remove that option if it doesn't do well
on very large systems.

Odds are they really do not want that enabled either, but that's not our
issue, that's theirs :)

> In a large system, lets say close to 2000 CPUs and we are offlining around
> 1750 CPUs. For example ppc64_cpu --smt=1 on a powerpc. Even if we move to a
> lesser overhead kernfs based implementation, we would still be creating
> files and deleting files for every CPU offline. Most users may not even be
> aware of these files. However for a few users who may be using these files
> once a while, we end up creating and deleting these files for all users. The
> overhead increases exponentially with the number of CPUs. I would assume the
> max number of CPUs are going to increase in future further.

I understand the issue, you don't have to explain it again. The
scheduler developers like to see these files, and for them it's useful.
Perhaps for distros that is not a useful thing to have around, that's
up to them.

> Hence our approach was to reduce the overhead for those users who are sure
> they don't depend on these files. We still keep the creating of the files as
> the default approach so that others who depend on it are not going to be
> impacted.

No, you are adding a new user/kernel api to the kernel that you then
have to support for the next 20+ years because you haven't fixed the
real issue here.

I think you could have done the kernfs conversion already, it shouldn't
be that complex, right?

Note, when you do it, you might want to move away from returning a raw
dentry from debugfs calls, and instead use an opaque type "debugfs_file"
or something like that, instead, which might make this easier over time.

thanks,

greg k-h

2022-12-12 19:42:53

by Phil Auld

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

Hi,

On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> >
> > Thanks Greg & Peter for your direction.
> >
> > While we pursue the idea of having debugfs based on kernfs, we thought about
> > having a boot time parameter which would disable creating and updating of the
> > sched_domain debugfs files and this would also be useful even when the kernfs
> > solution kicks in, as users who may not care about these debugfs files would
> > benefit from a faster CPU hotplug operation.
>
> Ick, no, you would be adding a new user/kernel api that you will be
> required to support for the next 20+ years. Just to get over a
> short-term issue before you solve the problem properly.

I'm not convinced moving these files from debugfs to kernfs is the right
fix. That will take it from ~50 back to ~20 _minutes_ on these systems.
I don't think either of those numbers is reasonable.

The issue as I see it is the full rebuild for every change with no way to
batch the changes. How about something like the below?

This puts the domains/* files under the sched_verbose flag. About the only
thing under that flag now are the detailed topology discovery printks anyway
so this fits together nicely.

This way the files would be off by default (assuming you don't boot with
sched_verbose) and can be created at runtime by enabling verbose. Multiple
changes could also be batched by disabling/makeing changes/re-enabling.

It does not create a new API, uses one that is already there.

>
> If you really do not want these debugfs files, just disable debugfs from
> your system. That should be a better short-term solution, right?

We do find these files useful at times for debugging issue and looking
at what's going on on the system.

>
> Or better yet, disable SCHED_DEBUG, why can't you do that?

Same with this... useful information with (modulo issues like this)
small cost. There are also tuning knobs that are only available
with SCHED_DEBUG.


Cheers,
Phil

---------------

sched/debug: Put sched/domains files under verbose flag

The debug files under sched/domains can take a long time to regenerate,
especially when updates are done one at a time. Move these files under
the verbose debug flag. Allow changes to verbose to trigger generation
of the files. This lets a user batch the updates but still have the
information available. The detailed topology printk messages are also
under verbose.

Signed-off-by: Phil Auld <[email protected]>
---
kernel/sched/debug.c | 68 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 66 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..2eb51ee3ccab 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -280,6 +280,31 @@ static const struct file_operations sched_dynamic_fops = {

__read_mostly bool sched_debug_verbose;

+static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos);
+
+static int sched_verbose_show(struct seq_file *m, void *v)
+{
+ if (sched_debug_verbose)
+ seq_puts(m,"Y\n");
+ else
+ seq_puts(m,"N\n");
+ return 0;
+}
+
+static int sched_verbose_open(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, sched_verbose_show, NULL);
+}
+
+static const struct file_operations sched_verbose_fops = {
+ .open = sched_verbose_open,
+ .write = sched_verbose_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
static const struct seq_operations sched_debug_sops;

static int sched_debug_open(struct inode *inode, struct file *filp)
@@ -303,7 +328,7 @@ static __init int sched_init_debug(void)
debugfs_sched = debugfs_create_dir("sched", NULL);

debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
- debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);
+ debugfs_create_file("verbose", 0644, debugfs_sched, NULL, &sched_verbose_fops);
#ifdef CONFIG_PREEMPT_DYNAMIC
debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
#endif
@@ -402,15 +427,23 @@ void update_sched_domain_debugfs(void)
if (!debugfs_sched)
return;

+ if (!sched_debug_verbose)
+ return;
+
if (!cpumask_available(sd_sysctl_cpus)) {
if (!alloc_cpumask_var(&sd_sysctl_cpus, GFP_KERNEL))
return;
cpumask_copy(sd_sysctl_cpus, cpu_possible_mask);
}

- if (!sd_dentry)
+ if (!sd_dentry) {
sd_dentry = debugfs_create_dir("domains", debugfs_sched);

+ /* rebuild sd_sysclt_cpus if empty since it gets cleared below */
+ if (cpumask_first(sd_sysctl_cpus) >= nr_cpu_ids)
+ cpumask_copy(sd_sysctl_cpus, cpu_online_mask);
+ }
+
for_each_cpu(cpu, sd_sysctl_cpus) {
struct sched_domain *sd;
struct dentry *d_cpu;
@@ -443,6 +476,37 @@ void dirty_sched_domain_sysctl(int cpu)

#endif /* CONFIG_SMP */

+static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ struct dentry *dentry = filp->f_path.dentry;
+ bool orig = sched_debug_verbose;
+ bool bv;
+ int r;
+
+ r = kstrtobool_from_user(ubuf, cnt, &bv);
+ if (!r) {
+ mutex_lock(&sched_domains_mutex);
+ r = debugfs_file_get(dentry);
+ if (unlikely(r))
+ return r;
+ sched_debug_verbose = bv;
+ debugfs_file_put(dentry);
+
+#ifdef CONFIG_SMP
+ if (sched_debug_verbose && !orig)
+ update_sched_domain_debugfs();
+ else if (!sched_debug_verbose && orig){
+ debugfs_remove(sd_dentry);
+ sd_dentry = NULL;
+ }
+#endif
+
+ mutex_unlock(&sched_domains_mutex);
+ }
+ return cnt;
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group *tg)
{
--
2.31.1



--

2022-12-13 03:32:29

by kernel test robot

[permalink] [raw]
Subject: Re: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

Hi Phil,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on linus/master v6.1 next-20221208]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Phil-Auld/Re-sched-debug-CPU-hotplug-operation-suffers-in-a-large-cpu-systems/20221213-031857
patch link: https://lore.kernel.org/r/Y5d%2BZqdxeJD2eIHL%40lorien.usersys.redhat.com
patch subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems
config: arm-randconfig-r046-20221212
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 6e4cea55f0d1104408b26ac574566a0e4de48036)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm cross compiling tool for clang build
# apt-get install binutils-arm-linux-gnueabi
# https://github.com/intel-lab-lkp/linux/commit/0561e916ce1f2d53558afc8f833031dc0c8cf03b
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Phil-Auld/Re-sched-debug-CPU-hotplug-operation-suffers-in-a-large-cpu-systems/20221213-031857
git checkout 0561e916ce1f2d53558afc8f833031dc0c8cf03b
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash kernel/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from kernel/sched/build_utility.c:73:
>> kernel/sched/debug.c:489:15: error: use of undeclared identifier 'sched_domains_mutex'
mutex_lock(&sched_domains_mutex);
^
kernel/sched/debug.c:505:17: error: use of undeclared identifier 'sched_domains_mutex'
mutex_unlock(&sched_domains_mutex);
^
2 errors generated.


vim +/sched_domains_mutex +489 kernel/sched/debug.c

478
479 static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
480 size_t cnt, loff_t *ppos)
481 {
482 struct dentry *dentry = filp->f_path.dentry;
483 bool orig = sched_debug_verbose;
484 bool bv;
485 int r;
486
487 r = kstrtobool_from_user(ubuf, cnt, &bv);
488 if (!r) {
> 489 mutex_lock(&sched_domains_mutex);
490 r = debugfs_file_get(dentry);
491 if (unlikely(r))
492 return r;
493 sched_debug_verbose = bv;
494 debugfs_file_put(dentry);
495

--
0-DAY CI Kernel Test Service
https://01.org/lkp


Attachments:
(No filename) (2.95 kB)
config (174.44 kB)
Download all attachments

2022-12-13 07:11:21

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
> Hi,
>
> On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> > >
> > > Thanks Greg & Peter for your direction.
> > >
> > > While we pursue the idea of having debugfs based on kernfs, we thought about
> > > having a boot time parameter which would disable creating and updating of the
> > > sched_domain debugfs files and this would also be useful even when the kernfs
> > > solution kicks in, as users who may not care about these debugfs files would
> > > benefit from a faster CPU hotplug operation.
> >
> > Ick, no, you would be adding a new user/kernel api that you will be
> > required to support for the next 20+ years. Just to get over a
> > short-term issue before you solve the problem properly.
>
> I'm not convinced moving these files from debugfs to kernfs is the right
> fix. That will take it from ~50 back to ~20 _minutes_ on these systems.
> I don't think either of those numbers is reasonable.
>
> The issue as I see it is the full rebuild for every change with no way to
> batch the changes. How about something like the below?
>
> This puts the domains/* files under the sched_verbose flag. About the only
> thing under that flag now are the detailed topology discovery printks anyway
> so this fits together nicely.
>
> This way the files would be off by default (assuming you don't boot with
> sched_verbose) and can be created at runtime by enabling verbose. Multiple
> changes could also be batched by disabling/makeing changes/re-enabling.
>
> It does not create a new API, uses one that is already there.

The idea seems good, the implementation might need a bit of work :)

> > If you really do not want these debugfs files, just disable debugfs from
> > your system. That should be a better short-term solution, right?
>
> We do find these files useful at times for debugging issue and looking
> at what's going on on the system.
>
> >
> > Or better yet, disable SCHED_DEBUG, why can't you do that?
>
> Same with this... useful information with (modulo issues like this)
> small cost. There are also tuning knobs that are only available
> with SCHED_DEBUG.
>
>
> Cheers,
> Phil
>
> ---------------
>
> sched/debug: Put sched/domains files under verbose flag
>
> The debug files under sched/domains can take a long time to regenerate,
> especially when updates are done one at a time. Move these files under
> the verbose debug flag. Allow changes to verbose to trigger generation
> of the files. This lets a user batch the updates but still have the
> information available. The detailed topology printk messages are also
> under verbose.
>
> Signed-off-by: Phil Auld <[email protected]>
> ---
> kernel/sched/debug.c | 68 ++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 66 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 1637b65ba07a..2eb51ee3ccab 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -280,6 +280,31 @@ static const struct file_operations sched_dynamic_fops = {
>
> __read_mostly bool sched_debug_verbose;
>
> +static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos);
> +
> +static int sched_verbose_show(struct seq_file *m, void *v)
> +{
> + if (sched_debug_verbose)
> + seq_puts(m,"Y\n");
> + else
> + seq_puts(m,"N\n");
> + return 0;
> +}
> +
> +static int sched_verbose_open(struct inode *inode, struct file *filp)
> +{
> + return single_open(filp, sched_verbose_show, NULL);
> +}
> +
> +static const struct file_operations sched_verbose_fops = {
> + .open = sched_verbose_open,
> + .write = sched_verbose_write,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = seq_release,
> +};
> +
> static const struct seq_operations sched_debug_sops;
>
> static int sched_debug_open(struct inode *inode, struct file *filp)
> @@ -303,7 +328,7 @@ static __init int sched_init_debug(void)
> debugfs_sched = debugfs_create_dir("sched", NULL);
>
> debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
> - debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);
> + debugfs_create_file("verbose", 0644, debugfs_sched, NULL, &sched_verbose_fops);
> #ifdef CONFIG_PREEMPT_DYNAMIC
> debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
> #endif
> @@ -402,15 +427,23 @@ void update_sched_domain_debugfs(void)
> if (!debugfs_sched)
> return;
>
> + if (!sched_debug_verbose)
> + return;
> +
> if (!cpumask_available(sd_sysctl_cpus)) {
> if (!alloc_cpumask_var(&sd_sysctl_cpus, GFP_KERNEL))
> return;
> cpumask_copy(sd_sysctl_cpus, cpu_possible_mask);
> }
>
> - if (!sd_dentry)
> + if (!sd_dentry) {
> sd_dentry = debugfs_create_dir("domains", debugfs_sched);
>
> + /* rebuild sd_sysclt_cpus if empty since it gets cleared below */
> + if (cpumask_first(sd_sysctl_cpus) >= nr_cpu_ids)
> + cpumask_copy(sd_sysctl_cpus, cpu_online_mask);
> + }
> +
> for_each_cpu(cpu, sd_sysctl_cpus) {
> struct sched_domain *sd;
> struct dentry *d_cpu;
> @@ -443,6 +476,37 @@ void dirty_sched_domain_sysctl(int cpu)
>
> #endif /* CONFIG_SMP */
>
> +static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + struct dentry *dentry = filp->f_path.dentry;
> + bool orig = sched_debug_verbose;
> + bool bv;
> + int r;
> +
> + r = kstrtobool_from_user(ubuf, cnt, &bv);
> + if (!r) {
> + mutex_lock(&sched_domains_mutex);
> + r = debugfs_file_get(dentry);
> + if (unlikely(r))
> + return r;
> + sched_debug_verbose = bv;
> + debugfs_file_put(dentry);

Why the get/put of the debugfs dentry? for just this single value?

thanks,

greg k-h

2022-12-13 13:40:47

by Phil Auld

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

On Tue, Dec 13, 2022 at 07:23:54AM +0100 Greg Kroah-Hartman wrote:
> On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
> > Hi,
> >
> > On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> > > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> > > >
> > > > Thanks Greg & Peter for your direction.
> > > >
> > > > While we pursue the idea of having debugfs based on kernfs, we thought about
> > > > having a boot time parameter which would disable creating and updating of the
> > > > sched_domain debugfs files and this would also be useful even when the kernfs
> > > > solution kicks in, as users who may not care about these debugfs files would
> > > > benefit from a faster CPU hotplug operation.
> > >
> > > Ick, no, you would be adding a new user/kernel api that you will be
> > > required to support for the next 20+ years. Just to get over a
> > > short-term issue before you solve the problem properly.
> >
> > I'm not convinced moving these files from debugfs to kernfs is the right
> > fix. That will take it from ~50 back to ~20 _minutes_ on these systems.
> > I don't think either of those numbers is reasonable.
> >
> > The issue as I see it is the full rebuild for every change with no way to
> > batch the changes. How about something like the below?
> >
> > This puts the domains/* files under the sched_verbose flag. About the only
> > thing under that flag now are the detailed topology discovery printks anyway
> > so this fits together nicely.
> >
> > This way the files would be off by default (assuming you don't boot with
> > sched_verbose) and can be created at runtime by enabling verbose. Multiple
> > changes could also be batched by disabling/makeing changes/re-enabling.
> >
> > It does not create a new API, uses one that is already there.
>
> The idea seems good, the implementation might need a bit of work :)

More than the one comment below? Let me know.

>
> > > If you really do not want these debugfs files, just disable debugfs from
> > > your system. That should be a better short-term solution, right?
> >
> > We do find these files useful at times for debugging issue and looking
> > at what's going on on the system.
> >
> > >
> > > Or better yet, disable SCHED_DEBUG, why can't you do that?
> >
> > Same with this... useful information with (modulo issues like this)
> > small cost. There are also tuning knobs that are only available
> > with SCHED_DEBUG.
> >
> >
> > Cheers,
> > Phil
> >
> > ---------------
> >
> > sched/debug: Put sched/domains files under verbose flag
> >
> > The debug files under sched/domains can take a long time to regenerate,
> > especially when updates are done one at a time. Move these files under
> > the verbose debug flag. Allow changes to verbose to trigger generation
> > of the files. This lets a user batch the updates but still have the
> > information available. The detailed topology printk messages are also
> > under verbose.
> >
> > Signed-off-by: Phil Auld <[email protected]>
> > ---
> > kernel/sched/debug.c | 68 ++++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 66 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > index 1637b65ba07a..2eb51ee3ccab 100644
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -280,6 +280,31 @@ static const struct file_operations sched_dynamic_fops = {
> >
> > __read_mostly bool sched_debug_verbose;
> >
> > +static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
> > + size_t cnt, loff_t *ppos);
> > +
> > +static int sched_verbose_show(struct seq_file *m, void *v)
> > +{
> > + if (sched_debug_verbose)
> > + seq_puts(m,"Y\n");
> > + else
> > + seq_puts(m,"N\n");
> > + return 0;
> > +}
> > +
> > +static int sched_verbose_open(struct inode *inode, struct file *filp)
> > +{
> > + return single_open(filp, sched_verbose_show, NULL);
> > +}
> > +
> > +static const struct file_operations sched_verbose_fops = {
> > + .open = sched_verbose_open,
> > + .write = sched_verbose_write,
> > + .read = seq_read,
> > + .llseek = seq_lseek,
> > + .release = seq_release,
> > +};
> > +
> > static const struct seq_operations sched_debug_sops;
> >
> > static int sched_debug_open(struct inode *inode, struct file *filp)
> > @@ -303,7 +328,7 @@ static __init int sched_init_debug(void)
> > debugfs_sched = debugfs_create_dir("sched", NULL);
> >
> > debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
> > - debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);
> > + debugfs_create_file("verbose", 0644, debugfs_sched, NULL, &sched_verbose_fops);
> > #ifdef CONFIG_PREEMPT_DYNAMIC
> > debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
> > #endif
> > @@ -402,15 +427,23 @@ void update_sched_domain_debugfs(void)
> > if (!debugfs_sched)
> > return;
> >
> > + if (!sched_debug_verbose)
> > + return;
> > +
> > if (!cpumask_available(sd_sysctl_cpus)) {
> > if (!alloc_cpumask_var(&sd_sysctl_cpus, GFP_KERNEL))
> > return;
> > cpumask_copy(sd_sysctl_cpus, cpu_possible_mask);
> > }
> >
> > - if (!sd_dentry)
> > + if (!sd_dentry) {
> > sd_dentry = debugfs_create_dir("domains", debugfs_sched);
> >
> > + /* rebuild sd_sysclt_cpus if empty since it gets cleared below */
> > + if (cpumask_first(sd_sysctl_cpus) >= nr_cpu_ids)
> > + cpumask_copy(sd_sysctl_cpus, cpu_online_mask);
> > + }
> > +
> > for_each_cpu(cpu, sd_sysctl_cpus) {
> > struct sched_domain *sd;
> > struct dentry *d_cpu;
> > @@ -443,6 +476,37 @@ void dirty_sched_domain_sysctl(int cpu)
> >
> > #endif /* CONFIG_SMP */
> >
> > +static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
> > + size_t cnt, loff_t *ppos)
> > +{
> > + struct dentry *dentry = filp->f_path.dentry;
> > + bool orig = sched_debug_verbose;
> > + bool bv;
> > + int r;
> > +
> > + r = kstrtobool_from_user(ubuf, cnt, &bv);
> > + if (!r) {
> > + mutex_lock(&sched_domains_mutex);
> > + r = debugfs_file_get(dentry);
> > + if (unlikely(r))
> > + return r;
> > + sched_debug_verbose = bv;
> > + debugfs_file_put(dentry);
>
> Why the get/put of the debugfs dentry? for just this single value?

That's what debugfs_file_write_bool() does, which is where I got that since
that's really what this is doing. I couldn't see a good way to make this
just call that.

I suppose the get/put may not be needed since the only way this should
go away is under that mutex too.

... erm, yeah, that return is a problem ... I'll fix that.

Also, this was originally on v6.1-rc7. I can rebase when I repost but I
didn't want to do it on a random commit so I picked (at the time) the latest
tag. Should I just use the head of Linux?

Thanks,
Phil

>
> thanks,
>
> greg k-h
>

--

2022-12-13 14:51:27

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

On Tue, Dec 13, 2022 at 08:22:58AM -0500, Phil Auld wrote:
> On Tue, Dec 13, 2022 at 07:23:54AM +0100 Greg Kroah-Hartman wrote:
> > On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
> > > Hi,
> > >
> > > On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> > > > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> > > > >
> > > > > Thanks Greg & Peter for your direction.
> > > > >
> > > > > While we pursue the idea of having debugfs based on kernfs, we thought about
> > > > > having a boot time parameter which would disable creating and updating of the
> > > > > sched_domain debugfs files and this would also be useful even when the kernfs
> > > > > solution kicks in, as users who may not care about these debugfs files would
> > > > > benefit from a faster CPU hotplug operation.
> > > >
> > > > Ick, no, you would be adding a new user/kernel api that you will be
> > > > required to support for the next 20+ years. Just to get over a
> > > > short-term issue before you solve the problem properly.
> > >
> > > I'm not convinced moving these files from debugfs to kernfs is the right
> > > fix. That will take it from ~50 back to ~20 _minutes_ on these systems.
> > > I don't think either of those numbers is reasonable.
> > >
> > > The issue as I see it is the full rebuild for every change with no way to
> > > batch the changes. How about something like the below?
> > >
> > > This puts the domains/* files under the sched_verbose flag. About the only
> > > thing under that flag now are the detailed topology discovery printks anyway
> > > so this fits together nicely.
> > >
> > > This way the files would be off by default (assuming you don't boot with
> > > sched_verbose) and can be created at runtime by enabling verbose. Multiple
> > > changes could also be batched by disabling/makeing changes/re-enabling.
> > >
> > > It does not create a new API, uses one that is already there.
> >
> > The idea seems good, the implementation might need a bit of work :)
>
> More than the one comment below? Let me know.

No idea, resubmit a working patch and I'll review it properly :)

> > > + r = kstrtobool_from_user(ubuf, cnt, &bv);
> > > + if (!r) {
> > > + mutex_lock(&sched_domains_mutex);
> > > + r = debugfs_file_get(dentry);
> > > + if (unlikely(r))
> > > + return r;
> > > + sched_debug_verbose = bv;
> > > + debugfs_file_put(dentry);
> >
> > Why the get/put of the debugfs dentry? for just this single value?
>
> That's what debugfs_file_write_bool() does, which is where I got that since
> that's really what this is doing. I couldn't see a good way to make this
> just call that.
>
> I suppose the get/put may not be needed since the only way this should
> go away is under that mutex too.

Yes, it should not be needed.

> ... erm, yeah, that return is a problem ... I'll fix that.
>
> Also, this was originally on v6.1-rc7. I can rebase when I repost but I
> didn't want to do it on a random commit so I picked (at the time) the latest
> tag. Should I just use the head of Linux?

Yes, or linux-next.

thanks,

greg k-h

2022-12-13 16:05:59

by Phil Auld

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

On Tue, Dec 13, 2022 at 03:31:06PM +0100 Greg Kroah-Hartman wrote:
> On Tue, Dec 13, 2022 at 08:22:58AM -0500, Phil Auld wrote:
> > On Tue, Dec 13, 2022 at 07:23:54AM +0100 Greg Kroah-Hartman wrote:
> > > On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
> > > > Hi,
> > > >
> > > > On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> > > > > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> > > > > >
> > > > > > Thanks Greg & Peter for your direction.
> > > > > >
> > > > > > While we pursue the idea of having debugfs based on kernfs, we thought about
> > > > > > having a boot time parameter which would disable creating and updating of the
> > > > > > sched_domain debugfs files and this would also be useful even when the kernfs
> > > > > > solution kicks in, as users who may not care about these debugfs files would
> > > > > > benefit from a faster CPU hotplug operation.
> > > > >
> > > > > Ick, no, you would be adding a new user/kernel api that you will be
> > > > > required to support for the next 20+ years. Just to get over a
> > > > > short-term issue before you solve the problem properly.
> > > >
> > > > I'm not convinced moving these files from debugfs to kernfs is the right
> > > > fix. That will take it from ~50 back to ~20 _minutes_ on these systems.
> > > > I don't think either of those numbers is reasonable.
> > > >
> > > > The issue as I see it is the full rebuild for every change with no way to
> > > > batch the changes. How about something like the below?
> > > >
> > > > This puts the domains/* files under the sched_verbose flag. About the only
> > > > thing under that flag now are the detailed topology discovery printks anyway
> > > > so this fits together nicely.
> > > >
> > > > This way the files would be off by default (assuming you don't boot with
> > > > sched_verbose) and can be created at runtime by enabling verbose. Multiple
> > > > changes could also be batched by disabling/makeing changes/re-enabling.
> > > >
> > > > It does not create a new API, uses one that is already there.
> > >
> > > The idea seems good, the implementation might need a bit of work :)
> >
> > More than the one comment below? Let me know.
>
> No idea, resubmit a working patch and I'll review it properly :)
>

Will do.


Thanks,
Phil


--

2022-12-13 23:45:48

by Michael Ellerman

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

Phil Auld <[email protected]> writes:
> On Tue, Dec 13, 2022 at 07:23:54AM +0100 Greg Kroah-Hartman wrote:
>> On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
>> > Hi,
>> >
>> > On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
>> > > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
>> > > >
>> > > > Thanks Greg & Peter for your direction.
>> > > >
>> > > > While we pursue the idea of having debugfs based on kernfs, we thought about
>> > > > having a boot time parameter which would disable creating and updating of the
>> > > > sched_domain debugfs files and this would also be useful even when the kernfs
>> > > > solution kicks in, as users who may not care about these debugfs files would
>> > > > benefit from a faster CPU hotplug operation.
>> > >
>> > > Ick, no, you would be adding a new user/kernel api that you will be
>> > > required to support for the next 20+ years. Just to get over a
>> > > short-term issue before you solve the problem properly.
>> >
>> > I'm not convinced moving these files from debugfs to kernfs is the right
>> > fix. That will take it from ~50 back to ~20 _minutes_ on these systems.
>> > I don't think either of those numbers is reasonable.
>> >
>> > The issue as I see it is the full rebuild for every change with no way to
>> > batch the changes. How about something like the below?
>> >
>> > This puts the domains/* files under the sched_verbose flag. About the only
>> > thing under that flag now are the detailed topology discovery printks anyway
>> > so this fits together nicely.
>> >
>> > This way the files would be off by default (assuming you don't boot with
>> > sched_verbose) and can be created at runtime by enabling verbose. Multiple
>> > changes could also be batched by disabling/makeing changes/re-enabling.
>> >
>> > It does not create a new API, uses one that is already there.
>>
>> The idea seems good, the implementation might need a bit of work :)
>
> More than the one comment below? Let me know.
>
>>
>> > > If you really do not want these debugfs files, just disable debugfs from
>> > > your system. That should be a better short-term solution, right?
>> >
>> > We do find these files useful at times for debugging issue and looking
>> > at what's going on on the system.
>> >
>> > >
>> > > Or better yet, disable SCHED_DEBUG, why can't you do that?
>> >
>> > Same with this... useful information with (modulo issues like this)
>> > small cost. There are also tuning knobs that are only available
>> > with SCHED_DEBUG.
>> >
>> >
>> > Cheers,
>> > Phil
>> >
>> > ---------------
>> >
>> > sched/debug: Put sched/domains files under verbose flag
>> >
>> > The debug files under sched/domains can take a long time to regenerate,
>> > especially when updates are done one at a time. Move these files under
>> > the verbose debug flag. Allow changes to verbose to trigger generation
>> > of the files. This lets a user batch the updates but still have the
>> > information available. The detailed topology printk messages are also
>> > under verbose.
>> >
>> > Signed-off-by: Phil Auld <[email protected]>
>> > ---
>> > kernel/sched/debug.c | 68 ++++++++++++++++++++++++++++++++++++++++++--
>> > 1 file changed, 66 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> > index 1637b65ba07a..2eb51ee3ccab 100644
>> > --- a/kernel/sched/debug.c
>> > +++ b/kernel/sched/debug.c
>> > @@ -280,6 +280,31 @@ static const struct file_operations sched_dynamic_fops = {
>> >
>> > __read_mostly bool sched_debug_verbose;
>> >
>> > +static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
>> > + size_t cnt, loff_t *ppos);
>> > +
>> > +static int sched_verbose_show(struct seq_file *m, void *v)
>> > +{
>> > + if (sched_debug_verbose)
>> > + seq_puts(m,"Y\n");
>> > + else
>> > + seq_puts(m,"N\n");
>> > + return 0;
>> > +}
>> > +
>> > +static int sched_verbose_open(struct inode *inode, struct file *filp)
>> > +{
>> > + return single_open(filp, sched_verbose_show, NULL);
>> > +}
>> > +
>> > +static const struct file_operations sched_verbose_fops = {
>> > + .open = sched_verbose_open,
>> > + .write = sched_verbose_write,
>> > + .read = seq_read,
>> > + .llseek = seq_lseek,
>> > + .release = seq_release,
>> > +};
>> > +
>> > static const struct seq_operations sched_debug_sops;
>> >
>> > static int sched_debug_open(struct inode *inode, struct file *filp)
>> > @@ -303,7 +328,7 @@ static __init int sched_init_debug(void)
>> > debugfs_sched = debugfs_create_dir("sched", NULL);
>> >
>> > debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
>> > - debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);
>> > + debugfs_create_file("verbose", 0644, debugfs_sched, NULL, &sched_verbose_fops);
>> > #ifdef CONFIG_PREEMPT_DYNAMIC
>> > debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
>> > #endif
>> > @@ -402,15 +427,23 @@ void update_sched_domain_debugfs(void)
>> > if (!debugfs_sched)
>> > return;
>> >
>> > + if (!sched_debug_verbose)
>> > + return;
>> > +
>> > if (!cpumask_available(sd_sysctl_cpus)) {
>> > if (!alloc_cpumask_var(&sd_sysctl_cpus, GFP_KERNEL))
>> > return;
>> > cpumask_copy(sd_sysctl_cpus, cpu_possible_mask);
>> > }
>> >
>> > - if (!sd_dentry)
>> > + if (!sd_dentry) {
>> > sd_dentry = debugfs_create_dir("domains", debugfs_sched);
>> >
>> > + /* rebuild sd_sysclt_cpus if empty since it gets cleared below */
>> > + if (cpumask_first(sd_sysctl_cpus) >= nr_cpu_ids)
>> > + cpumask_copy(sd_sysctl_cpus, cpu_online_mask);
>> > + }
>> > +
>> > for_each_cpu(cpu, sd_sysctl_cpus) {
>> > struct sched_domain *sd;
>> > struct dentry *d_cpu;
>> > @@ -443,6 +476,37 @@ void dirty_sched_domain_sysctl(int cpu)
>> >
>> > #endif /* CONFIG_SMP */
>> >
>> > +static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
>> > + size_t cnt, loff_t *ppos)
>> > +{
>> > + struct dentry *dentry = filp->f_path.dentry;
>> > + bool orig = sched_debug_verbose;
>> > + bool bv;
>> > + int r;
>> > +
>> > + r = kstrtobool_from_user(ubuf, cnt, &bv);
>> > + if (!r) {
>> > + mutex_lock(&sched_domains_mutex);
>> > + r = debugfs_file_get(dentry);
>> > + if (unlikely(r))
>> > + return r;
>> > + sched_debug_verbose = bv;
>> > + debugfs_file_put(dentry);
>>
>> Why the get/put of the debugfs dentry? for just this single value?
>
> That's what debugfs_file_write_bool() does, which is where I got that since
> that's really what this is doing. I couldn't see a good way to make this
> just call that.

I think you can do it like below? Only lightly tested :)

cheers


diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..bc96380cf336 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -280,6 +279,42 @@ static const struct file_operations sched_dynamic_fops = {

__read_mostly bool sched_debug_verbose;

+#ifdef CONFIG_SMP
+static struct dentry *sd_dentry;
+
+static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ ssize_t result;
+ bool orig;
+
+ mutex_lock(&sched_domains_mutex);
+
+ orig = sched_debug_verbose;
+ result = debugfs_write_file_bool(filp, ubuf, cnt, ppos);
+
+ if (sched_debug_verbose && !orig)
+ update_sched_domain_debugfs();
+ else if (!sched_debug_verbose && orig) {
+ debugfs_remove(sd_dentry);
+ sd_dentry = NULL;
+ }
+
+ mutex_unlock(&sched_domains_mutex);
+
+ return result;
+}
+#else
+#define sched_verbose_write debugfs_write_file_bool
+#endif
+
+static const struct file_operations sched_verbose_fops = {
+ .read = debugfs_read_file_bool,
+ .write = sched_verbose_write,
+ .open = simple_open,
+ .llseek = default_llseek,
+};
+
static const struct seq_operations sched_debug_sops;

static int sched_debug_open(struct inode *inode, struct file *filp)
@@ -303,7 +338,7 @@ static __init int sched_init_debug(void)
debugfs_sched = debugfs_create_dir("sched", NULL);

debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
- debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);
+ debugfs_create_file_unsafe("verbose", 0644, debugfs_sched, &sched_debug_verbose, &sched_verbose_fops);
#ifdef CONFIG_PREEMPT_DYNAMIC
debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
#endif
@@ -345,7 +380,6 @@ late_initcall(sched_init_debug);
#ifdef CONFIG_SMP

static cpumask_var_t sd_sysctl_cpus;
-static struct dentry *sd_dentry;

static int sd_flags_show(struct seq_file *m, void *v)
{
@@ -402,15 +436,23 @@ void update_sched_domain_debugfs(void)
if (!debugfs_sched)
return;

+ if (!sched_debug_verbose)
+ return;
+
if (!cpumask_available(sd_sysctl_cpus)) {
if (!alloc_cpumask_var(&sd_sysctl_cpus, GFP_KERNEL))
return;
cpumask_copy(sd_sysctl_cpus, cpu_possible_mask);
}

- if (!sd_dentry)
+ if (!sd_dentry) {
sd_dentry = debugfs_create_dir("domains", debugfs_sched);

+ /* rebuild sd_sysclt_cpus if empty since it gets cleared below */
+ if (cpumask_first(sd_sysctl_cpus) >= nr_cpu_ids)
+ cpumask_copy(sd_sysctl_cpus, cpu_online_mask);
+ }
+
for_each_cpu(cpu, sd_sysctl_cpus) {
struct sched_domain *sd;
struct dentry *d_cpu;

2022-12-14 02:41:19

by Phil Auld

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

On Wed, Dec 14, 2022 at 10:41:25AM +1100 Michael Ellerman wrote:
> Phil Auld <[email protected]> writes:
> > On Tue, Dec 13, 2022 at 07:23:54AM +0100 Greg Kroah-Hartman wrote:
> >> On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
> >> > Hi,
> >> >
> >> > On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> >> > > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> >> > > >
> >> > > > Thanks Greg & Peter for your direction.
> >> > > >
> >> > > > While we pursue the idea of having debugfs based on kernfs, we thought about
> >> > > > having a boot time parameter which would disable creating and updating of the
> >> > > > sched_domain debugfs files and this would also be useful even when the kernfs
> >> > > > solution kicks in, as users who may not care about these debugfs files would
> >> > > > benefit from a faster CPU hotplug operation.
> >> > >
> >> > > Ick, no, you would be adding a new user/kernel api that you will be
> >> > > required to support for the next 20+ years. Just to get over a
> >> > > short-term issue before you solve the problem properly.
> >> >
> >> > I'm not convinced moving these files from debugfs to kernfs is the right
> >> > fix. That will take it from ~50 back to ~20 _minutes_ on these systems.
> >> > I don't think either of those numbers is reasonable.
> >> >
> >> > The issue as I see it is the full rebuild for every change with no way to
> >> > batch the changes. How about something like the below?
> >> >
> >> > This puts the domains/* files under the sched_verbose flag. About the only
> >> > thing under that flag now are the detailed topology discovery printks anyway
> >> > so this fits together nicely.
> >> >
> >> > This way the files would be off by default (assuming you don't boot with
> >> > sched_verbose) and can be created at runtime by enabling verbose. Multiple
> >> > changes could also be batched by disabling/makeing changes/re-enabling.
> >> >
> >> > It does not create a new API, uses one that is already there.
> >>
> >> The idea seems good, the implementation might need a bit of work :)
> >
> > More than the one comment below? Let me know.
> >
> >>
> >> > > If you really do not want these debugfs files, just disable debugfs from
> >> > > your system. That should be a better short-term solution, right?
> >> >
> >> > We do find these files useful at times for debugging issue and looking
> >> > at what's going on on the system.
> >> >
> >> > >
> >> > > Or better yet, disable SCHED_DEBUG, why can't you do that?
> >> >
> >> > Same with this... useful information with (modulo issues like this)
> >> > small cost. There are also tuning knobs that are only available
> >> > with SCHED_DEBUG.
> >> >
> >> >
> >> > Cheers,
> >> > Phil
> >> >
> >> > ---------------
> >> >
> >> > sched/debug: Put sched/domains files under verbose flag
> >> >
> >> > The debug files under sched/domains can take a long time to regenerate,
> >> > especially when updates are done one at a time. Move these files under
> >> > the verbose debug flag. Allow changes to verbose to trigger generation
> >> > of the files. This lets a user batch the updates but still have the
> >> > information available. The detailed topology printk messages are also
> >> > under verbose.
> >> >
> >> > Signed-off-by: Phil Auld <[email protected]>
> >> > ---
> >> > kernel/sched/debug.c | 68 ++++++++++++++++++++++++++++++++++++++++++--
> >> > 1 file changed, 66 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> >> > index 1637b65ba07a..2eb51ee3ccab 100644
> >> > --- a/kernel/sched/debug.c
> >> > +++ b/kernel/sched/debug.c
> >> > @@ -280,6 +280,31 @@ static const struct file_operations sched_dynamic_fops = {
> >> >
> >> > __read_mostly bool sched_debug_verbose;
> >> >
> >> > +static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
> >> > + size_t cnt, loff_t *ppos);
> >> > +
> >> > +static int sched_verbose_show(struct seq_file *m, void *v)
> >> > +{
> >> > + if (sched_debug_verbose)
> >> > + seq_puts(m,"Y\n");
> >> > + else
> >> > + seq_puts(m,"N\n");
> >> > + return 0;
> >> > +}
> >> > +
> >> > +static int sched_verbose_open(struct inode *inode, struct file *filp)
> >> > +{
> >> > + return single_open(filp, sched_verbose_show, NULL);
> >> > +}
> >> > +
> >> > +static const struct file_operations sched_verbose_fops = {
> >> > + .open = sched_verbose_open,
> >> > + .write = sched_verbose_write,
> >> > + .read = seq_read,
> >> > + .llseek = seq_lseek,
> >> > + .release = seq_release,
> >> > +};
> >> > +
> >> > static const struct seq_operations sched_debug_sops;
> >> >
> >> > static int sched_debug_open(struct inode *inode, struct file *filp)
> >> > @@ -303,7 +328,7 @@ static __init int sched_init_debug(void)
> >> > debugfs_sched = debugfs_create_dir("sched", NULL);
> >> >
> >> > debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
> >> > - debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);
> >> > + debugfs_create_file("verbose", 0644, debugfs_sched, NULL, &sched_verbose_fops);
> >> > #ifdef CONFIG_PREEMPT_DYNAMIC
> >> > debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
> >> > #endif
> >> > @@ -402,15 +427,23 @@ void update_sched_domain_debugfs(void)
> >> > if (!debugfs_sched)
> >> > return;
> >> >
> >> > + if (!sched_debug_verbose)
> >> > + return;
> >> > +
> >> > if (!cpumask_available(sd_sysctl_cpus)) {
> >> > if (!alloc_cpumask_var(&sd_sysctl_cpus, GFP_KERNEL))
> >> > return;
> >> > cpumask_copy(sd_sysctl_cpus, cpu_possible_mask);
> >> > }
> >> >
> >> > - if (!sd_dentry)
> >> > + if (!sd_dentry) {
> >> > sd_dentry = debugfs_create_dir("domains", debugfs_sched);
> >> >
> >> > + /* rebuild sd_sysclt_cpus if empty since it gets cleared below */
> >> > + if (cpumask_first(sd_sysctl_cpus) >= nr_cpu_ids)
> >> > + cpumask_copy(sd_sysctl_cpus, cpu_online_mask);
> >> > + }
> >> > +
> >> > for_each_cpu(cpu, sd_sysctl_cpus) {
> >> > struct sched_domain *sd;
> >> > struct dentry *d_cpu;
> >> > @@ -443,6 +476,37 @@ void dirty_sched_domain_sysctl(int cpu)
> >> >
> >> > #endif /* CONFIG_SMP */
> >> >
> >> > +static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
> >> > + size_t cnt, loff_t *ppos)
> >> > +{
> >> > + struct dentry *dentry = filp->f_path.dentry;
> >> > + bool orig = sched_debug_verbose;
> >> > + bool bv;
> >> > + int r;
> >> > +
> >> > + r = kstrtobool_from_user(ubuf, cnt, &bv);
> >> > + if (!r) {
> >> > + mutex_lock(&sched_domains_mutex);
> >> > + r = debugfs_file_get(dentry);
> >> > + if (unlikely(r))
> >> > + return r;
> >> > + sched_debug_verbose = bv;
> >> > + debugfs_file_put(dentry);
> >>
> >> Why the get/put of the debugfs dentry? for just this single value?
> >
> > That's what debugfs_file_write_bool() does, which is where I got that since
> > that's really what this is doing. I couldn't see a good way to make this
> > just call that.
>
> I think you can do it like below? Only lightly tested :)
>

That simplifies things. Thanks!

I'm testing a new version now but will switch to this method and see what
happens.


Cheers,
Phil


> cheers
>
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 1637b65ba07a..bc96380cf336 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -280,6 +279,42 @@ static const struct file_operations sched_dynamic_fops = {
>
> __read_mostly bool sched_debug_verbose;
>
> +#ifdef CONFIG_SMP
> +static struct dentry *sd_dentry;
> +
> +static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
> + size_t cnt, loff_t *ppos)
> +{
> + ssize_t result;
> + bool orig;
> +
> + mutex_lock(&sched_domains_mutex);
> +
> + orig = sched_debug_verbose;
> + result = debugfs_write_file_bool(filp, ubuf, cnt, ppos);
> +
> + if (sched_debug_verbose && !orig)
> + update_sched_domain_debugfs();
> + else if (!sched_debug_verbose && orig) {
> + debugfs_remove(sd_dentry);
> + sd_dentry = NULL;
> + }
> +
> + mutex_unlock(&sched_domains_mutex);
> +
> + return result;
> +}
> +#else
> +#define sched_verbose_write debugfs_write_file_bool
> +#endif
> +
> +static const struct file_operations sched_verbose_fops = {
> + .read = debugfs_read_file_bool,
> + .write = sched_verbose_write,
> + .open = simple_open,
> + .llseek = default_llseek,
> +};
> +
> static const struct seq_operations sched_debug_sops;
>
> static int sched_debug_open(struct inode *inode, struct file *filp)
> @@ -303,7 +338,7 @@ static __init int sched_init_debug(void)
> debugfs_sched = debugfs_create_dir("sched", NULL);
>
> debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
> - debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);
> + debugfs_create_file_unsafe("verbose", 0644, debugfs_sched, &sched_debug_verbose, &sched_verbose_fops);
> #ifdef CONFIG_PREEMPT_DYNAMIC
> debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
> #endif
> @@ -345,7 +380,6 @@ late_initcall(sched_init_debug);
> #ifdef CONFIG_SMP
>
> static cpumask_var_t sd_sysctl_cpus;
> -static struct dentry *sd_dentry;
>
> static int sd_flags_show(struct seq_file *m, void *v)
> {
> @@ -402,15 +436,23 @@ void update_sched_domain_debugfs(void)
> if (!debugfs_sched)
> return;
>
> + if (!sched_debug_verbose)
> + return;
> +
> if (!cpumask_available(sd_sysctl_cpus)) {
> if (!alloc_cpumask_var(&sd_sysctl_cpus, GFP_KERNEL))
> return;
> cpumask_copy(sd_sysctl_cpus, cpu_possible_mask);
> }
>
> - if (!sd_dentry)
> + if (!sd_dentry) {
> sd_dentry = debugfs_create_dir("domains", debugfs_sched);
>
> + /* rebuild sd_sysclt_cpus if empty since it gets cleared below */
> + if (cpumask_first(sd_sysctl_cpus) >= nr_cpu_ids)
> + cpumask_copy(sd_sysctl_cpus, cpu_online_mask);
> + }
> +
> for_each_cpu(cpu, sd_sysctl_cpus) {
> struct sched_domain *sd;
> struct dentry *d_cpu;
>

--

2023-01-19 16:26:05

by Phil Auld

[permalink] [raw]
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

Hi Greg, et alia,

On Tue, Dec 13, 2022 at 03:31:06PM +0100 Greg Kroah-Hartman wrote:
> On Tue, Dec 13, 2022 at 08:22:58AM -0500, Phil Auld wrote:

> > >
> > > The idea seems good, the implementation might need a bit of work :)
> >
> > More than the one comment below? Let me know.
>
> No idea, resubmit a working patch and I'll review it properly :)
>

I finally got this posted, twice :(. Sorry for the delay, I ran into
what turned out to be an unrelated issue while testing it, plus end of
the year holidays and what not.

https://lore.kernel.org/lkml/[email protected]/T/#u


Cheers,
Phil

--