It was found that reading /proc/stat could be time consuming on
systems with a lot of irqs. For example, reading /proc/stat in a
certain 2-socket Skylake server took about 4.6ms because it had over
5k irqs. In that particular case, the majority of the CPU cycles for
reading /proc/stat was spent in the kstat_irqs() function. Therefore,
application performance can be impacted if the application reads
/proc/stat rather frequently.
The "intr" line within /proc/stat contains a sum total of all the irqs
that have happened followed by a list of irq counts for each individual
irq number. In many cases, the first number is good enough. The
individual irq counts may not provide that much more information.
In order to avoid this kind of performance issue, all these individual
irq counts are now separated into a new /proc/stat_irqs file. The
sum total irq count will stay in /proc/stat and be duplicated in
/proc/stat_irqs. Applications that need to look up individual irq counts
will now have to look into /proc/stat_irqs instead of /proc/stat.
Signed-off-by: Waiman Long <[email protected]>
---
fs/proc/stat.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 41 insertions(+), 7 deletions(-)
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 59749df..79e3c03 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -155,11 +155,6 @@ static int show_stat(struct seq_file *p, void *v)
seq_putc(p, '\n');
}
seq_put_decimal_ull(p, "intr ", (unsigned long long)sum);
-
- /* sum again ? it could be updated? */
- for_each_irq_nr(j)
- seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
-
seq_printf(p,
"\nctxt %llu\n"
"btime %llu\n"
@@ -181,15 +176,46 @@ static int show_stat(struct seq_file *p, void *v)
return 0;
}
+/*
+ * Showing individual irq counts can be expensive if there are a lot of
+ * irqs. So it is done in a separate procfs file to reduce performance
+ * overhead of reading other statistical counts.
+ */
+static int show_stat_irqs(struct seq_file *p, void *v)
+{
+ int i, j;
+ u64 sum = 0;
+
+ for_each_possible_cpu(i) {
+ sum += kstat_cpu_irqs_sum(i);
+ sum += arch_irq_stat_cpu(i);
+ }
+ sum += arch_irq_stat();
+
+ seq_put_decimal_ull(p, "intr ", (unsigned long long)sum);
+
+ for_each_irq_nr(j)
+ seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
+
+ seq_putc(p, '\n');
+
+ return 0;
+}
+
static int stat_open(struct inode *inode, struct file *file)
{
size_t size = 1024 + 128 * num_online_cpus();
- /* minimum size to display an interrupt count : 2 bytes */
- size += 2 * nr_irqs;
return single_open_size(file, show_stat, NULL, size);
}
+static int stat_irqs_open(struct inode *inode, struct file *file)
+{
+ size_t size = 1024 + 16 * nr_irqs;
+
+ return single_open_size(file, show_stat_irqs, NULL, size);
+}
+
static const struct file_operations proc_stat_operations = {
.open = stat_open,
.read = seq_read,
@@ -197,9 +223,17 @@ static int stat_open(struct inode *inode, struct file *file)
.release = single_release,
};
+static const struct file_operations proc_stat_irqs_operations = {
+ .open = stat_irqs_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
static int __init proc_stat_init(void)
{
proc_create("stat", 0, NULL, &proc_stat_operations);
+ proc_create("stat_irqs", 0, NULL, &proc_stat_irqs_operations);
return 0;
}
fs_initcall(proc_stat_init);
--
1.8.3.1
On 04/19/18 10:09, Waiman Long wrote:
> It was found that reading /proc/stat could be time consuming on
> systems with a lot of irqs. For example, reading /proc/stat in a
> certain 2-socket Skylake server took about 4.6ms because it had over
> 5k irqs. In that particular case, the majority of the CPU cycles for
> reading /proc/stat was spent in the kstat_irqs() function. Therefore,
> application performance can be impacted if the application reads
> /proc/stat rather frequently.
>
> The "intr" line within /proc/stat contains a sum total of all the irqs
> that have happened followed by a list of irq counts for each individual
> irq number. In many cases, the first number is good enough. The
> individual irq counts may not provide that much more information.
>
> In order to avoid this kind of performance issue, all these individual
> irq counts are now separated into a new /proc/stat_irqs file. The
> sum total irq count will stay in /proc/stat and be duplicated in
> /proc/stat_irqs. Applications that need to look up individual irq counts
> will now have to look into /proc/stat_irqs instead of /proc/stat.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> fs/proc/stat.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 41 insertions(+), 7 deletions(-)
Also please update Documentation/filesystems/proc.txt.
thanks,
--
~Randy
On 04/19/2018 01:38 PM, Randy Dunlap wrote:
> On 04/19/18 10:09, Waiman Long wrote:
>> It was found that reading /proc/stat could be time consuming on
>> systems with a lot of irqs. For example, reading /proc/stat in a
>> certain 2-socket Skylake server took about 4.6ms because it had over
>> 5k irqs. In that particular case, the majority of the CPU cycles for
>> reading /proc/stat was spent in the kstat_irqs() function. Therefore,
>> application performance can be impacted if the application reads
>> /proc/stat rather frequently.
>>
>> The "intr" line within /proc/stat contains a sum total of all the irqs
>> that have happened followed by a list of irq counts for each individual
>> irq number. In many cases, the first number is good enough. The
>> individual irq counts may not provide that much more information.
>>
>> In order to avoid this kind of performance issue, all these individual
>> irq counts are now separated into a new /proc/stat_irqs file. The
>> sum total irq count will stay in /proc/stat and be duplicated in
>> /proc/stat_irqs. Applications that need to look up individual irq counts
>> will now have to look into /proc/stat_irqs instead of /proc/stat.
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> fs/proc/stat.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
>> 1 file changed, 41 insertions(+), 7 deletions(-)
> Also please update Documentation/filesystems/proc.txt.
>
> thanks,
Right. I forgot to do that. Will send out v2 to fix that.
-Longman
> Therefore, application performance can be impacted if the application
> reads /proc/stat rather frequently.
[nods]
Text interfaces can be designed in a very stupid way.
> For example, reading /proc/stat in a certain 2-socket Skylake server
> took about 4.6ms because it had over 5k irqs.
Is this top(1)? What is this application doing?
If it needs percpu usage stats, then maybe /proc/stat should be
converted away from single_open() so that core seq_file code doesn't
generate everything at once.
> -
> - /* sum again ? it could be updated? */
> - for_each_irq_nr(j)
> - seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
> -
This is direct userspace breakage.
Proper fix is to start strategic switch away from /proc.
It is a fun toy but its time has come.
On 04/19/2018 03:08 PM, Alexey Dobriyan wrote:
>> Therefore, application performance can be impacted if the application
>> reads /proc/stat rather frequently.
> [nods]
> Text interfaces can be designed in a very stupid way.
>
>> For example, reading /proc/stat in a certain 2-socket Skylake server
>> took about 4.6ms because it had over 5k irqs.
> Is this top(1)? What is this application doing?
> If it needs percpu usage stats, then maybe /proc/stat should be
> converted away from single_open() so that core seq_file code doesn't
> generate everything at once.
The application is actually a database benchmarking tool used by a
customer. The reading of /proc/stat is an artifact of the benchmarking
tool that can actually be turned off. Without doing that, about 20% of
CPU time were spent reading /proc/stat and the trashing of cachelines
slowed the benchmark number quite significantly. However, I was also
told that there are legitimate cases where reading /proc/stat was
necessary in some of their applications.
>> -
>> - /* sum again ? it could be updated? */
>> - for_each_irq_nr(j)
>> - seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
>> -
> This is direct userspace breakage.
Yes, I am aware of that. That is the cost of improving the performance
of applications that read /proc/stat, but don't need the individual irq
counts.
>
> Proper fix is to start strategic switch away from /proc.
> It is a fun toy but its time has come.
Migration from procfs is easier said then done. Many existing customers
are reluctant to do that.
-Longman
On Thu, 19 Apr 2018 13:09:29 -0400 Waiman Long <[email protected]> wrote:
> It was found that reading /proc/stat could be time consuming on
> systems with a lot of irqs. For example, reading /proc/stat in a
> certain 2-socket Skylake server took about 4.6ms because it had over
> 5k irqs. In that particular case, the majority of the CPU cycles for
> reading /proc/stat was spent in the kstat_irqs() function. Therefore,
> application performance can be impacted if the application reads
> /proc/stat rather frequently.
>
> The "intr" line within /proc/stat contains a sum total of all the irqs
> that have happened followed by a list of irq counts for each individual
> irq number. In many cases, the first number is good enough. The
> individual irq counts may not provide that much more information.
>
> In order to avoid this kind of performance issue, all these individual
> irq counts are now separated into a new /proc/stat_irqs file. The
> sum total irq count will stay in /proc/stat and be duplicated in
> /proc/stat_irqs. Applications that need to look up individual irq counts
> will now have to look into /proc/stat_irqs instead of /proc/stat.
>
(cc /proc maintainer)
It's a non-backward-compatible change. For something which has
existing for so long, it would be a mighty task to demonstrate that no
existing userspace will be disrupted by this change.
So we need to think again. A new interface which omits the per-IRQ
stats might be acceptable.
Or, conceivably, a new /proc knob which disables the per-IRQ stats in
/proc/stat. That would allow operators to opt in to this disabling and
would avoid the need to alter
whatever-application-it-is-that-is-having-trouble. This seems a bit ugly
though.
Also, the changelog is rather vague. "application performance can be
impacted". Well, *are* applications impacted? What is the real-world
performance gain which this change provides, in a real-world workload?
On Thu, Apr 19, 2018 at 03:28:40PM -0400, Waiman Long wrote:
> On 04/19/2018 03:08 PM, Alexey Dobriyan wrote:
> >> Therefore, application performance can be impacted if the application
> >> reads /proc/stat rather frequently.
> > [nods]
> > Text interfaces can be designed in a very stupid way.
> >
> >> For example, reading /proc/stat in a certain 2-socket Skylake server
> >> took about 4.6ms because it had over 5k irqs.
> > Is this top(1)? What is this application doing?
> > If it needs percpu usage stats, then maybe /proc/stat should be
> > converted away from single_open() so that core seq_file code doesn't
> > generate everything at once.
>
> The application is actually a database benchmarking tool used by a
> customer.
So it probably needs lines before "intr" line.
> The reading of /proc/stat is an artifact of the benchmarking
> tool that can actually be turned off. Without doing that, about 20% of
> CPU time were spent reading /proc/stat and the trashing of cachelines
> slowed the benchmark number quite significantly. However, I was also
> told that there are legitimate cases where reading /proc/stat was
> necessary in some of their applications.
>
> >> -
> >> - /* sum again ? it could be updated? */
> >> - for_each_irq_nr(j)
> >> - seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
> >> -
> > This is direct userspace breakage.
>
> Yes, I am aware of that. That is the cost of improving the performance
> of applications that read /proc/stat, but don't need the individual irq
> counts.
Yeah, but all it takes is one script which cares.
I have an idea.
Maintain "maximum registered irq #", it should be much smaller than
"nr_irqs":
intr 4245359 151 0 0 0 0 0 0 0 38 0 0 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 44330 182364 57741 0 0 0 0 0 0 0 0 85 89124 0 0 0 0 0 323360 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Or maintain array of registered irqs and iterate over them only.
I have another idea.
perf record shows mutex_lock/mutex_unlock at the top.
Most of them are irq mutex not seqfile mutex as there are many more
interrupts than reads. Take it once.
On 04/19/2018 03:43 PM, Andrew Morton wrote:
> On Thu, 19 Apr 2018 13:09:29 -0400 Waiman Long <[email protected]> wrote:
>
>> It was found that reading /proc/stat could be time consuming on
>> systems with a lot of irqs. For example, reading /proc/stat in a
>> certain 2-socket Skylake server took about 4.6ms because it had over
>> 5k irqs. In that particular case, the majority of the CPU cycles for
>> reading /proc/stat was spent in the kstat_irqs() function. Therefore,
>> application performance can be impacted if the application reads
>> /proc/stat rather frequently.
>>
>> The "intr" line within /proc/stat contains a sum total of all the irqs
>> that have happened followed by a list of irq counts for each individual
>> irq number. In many cases, the first number is good enough. The
>> individual irq counts may not provide that much more information.
>>
>> In order to avoid this kind of performance issue, all these individual
>> irq counts are now separated into a new /proc/stat_irqs file. The
>> sum total irq count will stay in /proc/stat and be duplicated in
>> /proc/stat_irqs. Applications that need to look up individual irq counts
>> will now have to look into /proc/stat_irqs instead of /proc/stat.
>>
> (cc /proc maintainer)
>
> It's a non-backward-compatible change. For something which has
> existing for so long, it would be a mighty task to demonstrate that no
> existing userspace will be disrupted by this change.
>
> So we need to think again. A new interface which omits the per-IRQ
> stats might be acceptable.
OK, how about a new /proc/stat2 file that is the same as /proc/stat
except that it omits per-IRQ stats. BTW, do you have any suggestion for
a better name?
> Or, conceivably, a new /proc knob which disables the per-IRQ stats in
> /proc/stat. That would allow operators to opt in to this disabling and
> would avoid the need to alter
> whatever-application-it-is-that-is-having-trouble. This seems a bit ugly
> though.
>
> Also, the changelog is rather vague. "application performance can be
> impacted". Well, *are* applications impacted? What is the real-world
> performance gain which this change provides, in a real-world workload?
Will clarify that in the next version.
-Longman
On Thu, Apr 19, 2018 at 12:43:19PM -0700, Andrew Morton wrote:
> On Thu, 19 Apr 2018 13:09:29 -0400 Waiman Long <[email protected]> wrote:
>
> > It was found that reading /proc/stat could be time consuming on
> > systems with a lot of irqs. For example, reading /proc/stat in a
> > certain 2-socket Skylake server took about 4.6ms because it had over
> > 5k irqs. In that particular case, the majority of the CPU cycles for
> > reading /proc/stat was spent in the kstat_irqs() function. Therefore,
> > application performance can be impacted if the application reads
> > /proc/stat rather frequently.
> >
> > The "intr" line within /proc/stat contains a sum total of all the irqs
> > that have happened followed by a list of irq counts for each individual
> > irq number. In many cases, the first number is good enough. The
> > individual irq counts may not provide that much more information.
> >
> > In order to avoid this kind of performance issue, all these individual
> > irq counts are now separated into a new /proc/stat_irqs file. The
> > sum total irq count will stay in /proc/stat and be duplicated in
> > /proc/stat_irqs. Applications that need to look up individual irq counts
> > will now have to look into /proc/stat_irqs instead of /proc/stat.
> >
>
> (cc /proc maintainer)
>
> It's a non-backward-compatible change. For something which has
> existing for so long, it would be a mighty task to demonstrate that no
> existing userspace will be disrupted by this change.
>
> So we need to think again. A new interface which omits the per-IRQ
> stats might be acceptable.
Here is profile of open+read+close /proc/stat
30% is taking mutex only to print "0".
+ 98.80% 0.04% a.out [kernel.vmlinux] [k] entry_SYSCALL_64 ▒
+ 98.75% 0.10% a.out [kernel.vmlinux] [k] do_syscall_64 ▒
+ 95.56% 0.04% a.out libc-2.25.so [.] __GI___libc_read ◆
+ 95.09% 0.01% a.out [kernel.vmlinux] [k] sys_read ▒
+ 95.04% 0.03% a.out [kernel.vmlinux] [k] vfs_read ▒
+ 94.98% 0.05% a.out [kernel.vmlinux] [k] proc_reg_read ▒
+ 94.98% 0.00% a.out [kernel.vmlinux] [k] __vfs_read ▒
+ 94.92% 0.06% a.out [kernel.vmlinux] [k] seq_read ▒
+ 94.52% 3.65% a.out [kernel.vmlinux] [k] show_stat ▒
+ 48.62% 2.59% a.out [kernel.vmlinux] [k] kstat_irqs_usr ▒
+ 33.52% 9.55% a.out [kernel.vmlinux] [k] seq_put_decimal_ull ▒
+ 19.63% 19.59% a.out [kernel.vmlinux] [k] memcpy_erms ▒
+ 17.34% 9.53% a.out [kernel.vmlinux] [k] kstat_irqs ▒
- 15.45% 15.43% a.out [kernel.vmlinux] [k] mutex_lock ▒
15.43% __GI___libc_read ▒
entry_SYSCALL_64 ▒
do_syscall_64 ▒
sys_read ▒
vfs_read ▒
__vfs_read ▒
proc_reg_read ▒
- seq_read ▒
- 15.41% show_stat ▒
kstat_irqs_usr ▒
mutex_lock ▒
+ 13.32% 13.27% a.out [kernel.vmlinux] [k] mutex_unlock ▒
+ 4.60% 1.35% a.out [kernel.vmlinux] [k] cpumask_next ▒
+ 3.03% 3.03% a.out [kernel.vmlinux] [k] __radix_tree_lookup ▒
+ 2.96% 0.08% a.out [kernel.vmlinux] [k] seq_printf ▒
+ 2.92% 0.02% a.out libc-2.25.so [.] __GI___libc_open ▒
+ 2.89% 0.07% a.out [kernel.vmlinux] [k] seq_vprintf ▒
+ 2.81% 0.70% a.out [kernel.vmlinux] [k] vsnprintf ▒
+ 2.66% 2.66% a.out [kernel.vmlinux] [k] _find_next_bit ▒
+ 2.42% 1.36% a.out [kernel.vmlinux] [k] num_to_str ▒
+ 2.41% 0.19% a.out [kernel.vmlinux] [k] get_idle_time ▒
+ 2.39% 0.02% a.out [kernel.vmlinux] [k] do_sys_open
On Thu, Apr 19, 2018 at 04:21:14PM -0400, Waiman Long wrote:
> On 04/19/2018 03:55 PM, Alexey Dobriyan wrote:
> > On Thu, Apr 19, 2018 at 03:28:40PM -0400, Waiman Long wrote:
> >> On 04/19/2018 03:08 PM, Alexey Dobriyan wrote:
> >>>> Therefore, application performance can be impacted if the application
> >>>> reads /proc/stat rather frequently.
> >>> [nods]
> >>> Text interfaces can be designed in a very stupid way.
> >>>
> >>>> For example, reading /proc/stat in a certain 2-socket Skylake server
> >>>> took about 4.6ms because it had over 5k irqs.
> >>> Is this top(1)? What is this application doing?
> >>> If it needs percpu usage stats, then maybe /proc/stat should be
> >>> converted away from single_open() so that core seq_file code doesn't
> >>> generate everything at once.
> >> The application is actually a database benchmarking tool used by a
> >> customer.
> > So it probably needs lines before "intr" line.
> >
> >> The reading of /proc/stat is an artifact of the benchmarking
> >> tool that can actually be turned off. Without doing that, about 20% of
> >> CPU time were spent reading /proc/stat and the trashing of cachelines
> >> slowed the benchmark number quite significantly. However, I was also
> >> told that there are legitimate cases where reading /proc/stat was
> >> necessary in some of their applications.
> >>
> >>>> -
> >>>> - /* sum again ? it could be updated? */
> >>>> - for_each_irq_nr(j)
> >>>> - seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
> >>>> -
> >>> This is direct userspace breakage.
> >> Yes, I am aware of that. That is the cost of improving the performance
> >> of applications that read /proc/stat, but don't need the individual irq
> >> counts.
> > Yeah, but all it takes is one script which cares.
> >
> > I have an idea.
> >
> > Maintain "maximum registered irq #", it should be much smaller than
> > "nr_irqs":
> >
> > intr 4245359 151 0 0 0 0 0 0 0 38 0 0 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 44330 182364 57741 0 0 0 0 0 0 0 0 85 89124 0 0 0 0 0 323360 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> Yes, that can probably help.
>
> This is the data from the problematic skylake server:
>
> model name : Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
> 56 sosreport-carevalo.02076935-20180413085327/proc/stat
> Interrupts: 5370
> Interrupts without "0" entries: 1011
>
> There are still quite a large number of non-zero entries, though.
>
> > Or maintain array of registered irqs and iterate over them only.
>
> Right, we can allocate a bitmap of used irqs to do that.
>
> >
> > I have another idea.
> >
> > perf record shows mutex_lock/mutex_unlock at the top.
> > Most of them are irq mutex not seqfile mutex as there are many more
> > interrupts than reads. Take it once.
> >
> How many cpus are in your test system? In that skylake server, it was
> the per-cpu summing operation of the irq counts that was consuming most
> of the time for reading /proc/stat. I think we can certainly try to
> optimize the lock taking.
It's 16x(NR_IRQS: 4352, nr_irqs: 960, preallocated irqs: 16)
Given that irq registering is rare operation, maintaining sorted array
of irq should be the best option.
> For the time being, I think I am going to have a clone /proc/stat2 as
> suggested in my earlier email. Alternatively, I can put that somewhere
> in sysfs if you have a good idea of where I can put it.
sysfs is strictly one-value-per-file.
> I will also look into ways to optimize the current per-IRQ stats
> handling, but it will come later.
There is always a time-honored way of ioctl(2) switching irq info off
/proc supports that.
There are many options.
On 04/19/2018 04:39 PM, Alexey Dobriyan wrote:
> On Thu, Apr 19, 2018 at 04:21:14PM -0400, Waiman Long wrote:
>> On 04/19/2018 03:55 PM, Alexey Dobriyan wrote:
>>> On Thu, Apr 19, 2018 at 03:28:40PM -0400, Waiman Long wrote:
>>>> On 04/19/2018 03:08 PM, Alexey Dobriyan wrote:
>>>>>> Therefore, application performance can be impacted if the application
>>>>>> reads /proc/stat rather frequently.
>>>>> [nods]
>>>>> Text interfaces can be designed in a very stupid way.
>>>>>
>>>>>> For example, reading /proc/stat in a certain 2-socket Skylake server
>>>>>> took about 4.6ms because it had over 5k irqs.
>>>>> Is this top(1)? What is this application doing?
>>>>> If it needs percpu usage stats, then maybe /proc/stat should be
>>>>> converted away from single_open() so that core seq_file code doesn't
>>>>> generate everything at once.
>>>> The application is actually a database benchmarking tool used by a
>>>> customer.
>>> So it probably needs lines before "intr" line.
>>>
>>>> The reading of /proc/stat is an artifact of the benchmarking
>>>> tool that can actually be turned off. Without doing that, about 20% of
>>>> CPU time were spent reading /proc/stat and the trashing of cachelines
>>>> slowed the benchmark number quite significantly. However, I was also
>>>> told that there are legitimate cases where reading /proc/stat was
>>>> necessary in some of their applications.
>>>>
>>>>>> -
>>>>>> - /* sum again ? it could be updated? */
>>>>>> - for_each_irq_nr(j)
>>>>>> - seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
>>>>>> -
>>>>> This is direct userspace breakage.
>>>> Yes, I am aware of that. That is the cost of improving the performance
>>>> of applications that read /proc/stat, but don't need the individual irq
>>>> counts.
>>> Yeah, but all it takes is one script which cares.
>>>
>>> I have an idea.
>>>
>>> Maintain "maximum registered irq #", it should be much smaller than
>>> "nr_irqs":
>>>
>>> intr 4245359 151 0 0 0 0 0 0 0 38 0 0 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 44330 182364 57741 0 0 0 0 0 0 0 0 85 89124 0 0 0 0 0 323360 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>> Yes, that can probably help.
>>
>> This is the data from the problematic skylake server:
>>
>> model name : Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
>> 56 sosreport-carevalo.02076935-20180413085327/proc/stat
>> Interrupts: 5370
>> Interrupts without "0" entries: 1011
>>
>> There are still quite a large number of non-zero entries, though.
>>
>>> Or maintain array of registered irqs and iterate over them only.
>> Right, we can allocate a bitmap of used irqs to do that.
>>
>>> I have another idea.
>>>
>>> perf record shows mutex_lock/mutex_unlock at the top.
>>> Most of them are irq mutex not seqfile mutex as there are many more
>>> interrupts than reads. Take it once.
>>>
>> How many cpus are in your test system? In that skylake server, it was
>> the per-cpu summing operation of the irq counts that was consuming most
>> of the time for reading /proc/stat. I think we can certainly try to
>> optimize the lock taking.
> It's 16x(NR_IRQS: 4352, nr_irqs: 960, preallocated irqs: 16)
> Given that irq registering is rare operation, maintaining sorted array
> of irq should be the best option.
>> For the time being, I think I am going to have a clone /proc/stat2 as
>> suggested in my earlier email. Alternatively, I can put that somewhere
>> in sysfs if you have a good idea of where I can put it.
> sysfs is strictly one-value-per-file.
>
>> I will also look into ways to optimize the current per-IRQ stats
>> handling, but it will come later.
> There is always a time-honored way of ioctl(2) switching irq info off
> /proc supports that.
>
> There are many options.
OK, it is good to know. Do you have any existing code snippet in the
kernel that I can use as reference on how to use ioctl(2) switching?
I will look into how to optimize the existing per-IRQ stats code first
before venturing into cloning /proc/stat.
Cheers,
Longman
On 04/19/2018 04:39 PM, Alexey Dobriyan wrote:
>>
>> Yes, that can probably help.
>>
>> This is the data from the problematic skylake server:
>>
>> model name : Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
>> 56 sosreport-carevalo.02076935-20180413085327/proc/stat
>> Interrupts: 5370
>> Interrupts without "0" entries: 1011
>>
>> There are still quite a large number of non-zero entries, though.
>>
>>> Or maintain array of registered irqs and iterate over them only.
>> Right, we can allocate a bitmap of used irqs to do that.
>>
>>> I have another idea.
>>>
>>> perf record shows mutex_lock/mutex_unlock at the top.
>>> Most of them are irq mutex not seqfile mutex as there are many more
>>> interrupts than reads. Take it once.
>>>
>> How many cpus are in your test system? In that skylake server, it was
>> the per-cpu summing operation of the irq counts that was consuming most
>> of the time for reading /proc/stat. I think we can certainly try to
>> optimize the lock taking.
> It's 16x(NR_IRQS: 4352, nr_irqs: 960, preallocated irqs: 16)
> Given that irq registering is rare operation, maintaining sorted array
> of irq should be the best option.
>
BTW, the skylake server is 2-socket 24-core 48-thread.
Cheers,
Longman
Hi,
>>>> Or maintain array of registered irqs and iterate over them only.
>>> Right, we can allocate a bitmap of used irqs to do that.
>>>
>>>> I have another idea.
>>>>
>>>> perf record shows mutex_lock/mutex_unlock at the top.
>>>> Most of them are irq mutex not seqfile mutex as there are many more
>>>> interrupts than reads. Take it once.
>>>>
>>> How many cpus are in your test system? In that skylake server, it was
>>> the per-cpu summing operation of the irq counts that was consuming most
>>> of the time for reading /proc/stat. I think we can certainly try to
>>> optimize the lock taking.
>> It's 16x(NR_IRQS: 4352, nr_irqs: 960, preallocated irqs: 16)
>> Given that irq registering is rare operation, maintaining sorted array
>> of irq should be the best option.
>>> For the time being, I think I am going to have a clone /proc/stat2 as
>>> suggested in my earlier email. Alternatively, I can put that somewhere
>>> in sysfs if you have a good idea of where I can put it.
>> sysfs is strictly one-value-per-file.
>>
>>> I will also look into ways to optimize the current per-IRQ stats
>>> handling, but it will come later.
>> There is always a time-honored way of ioctl(2) switching irq info off
>> /proc supports that.
>>
>> There are many options.
>
> OK, it is good to know. Do you have any existing code snippet in the
> kernel that I can use as reference on how to use ioctl(2) switching?
>
> I will look into how to optimize the existing per-IRQ stats code first
> before venturing into cloning /proc/stat.
Can we not just remove per-IRQ stats from /proc/stat (since I gather
from this discussion it isn't scalable), and just have applications
that need per-IRQ stats use /proc/interrupts ?
thanks,
- Joel
On Thu, Apr 19, 2018 at 04:23:02PM -0700, Joel Fernandes (Google) wrote:
> Can we not just remove per-IRQ stats from /proc/stat (since I gather
> from this discussion it isn't scalable), and just have applications
> that need per-IRQ stats use /proc/interrupts ?
If you can prove noone is using them in /proc/stat...
On Sat, Apr 21, 2018 at 11:34:22PM +0300, Alexey Dobriyan wrote:
> On Thu, Apr 19, 2018 at 04:23:02PM -0700, Joel Fernandes (Google) wrote:
> > Can we not just remove per-IRQ stats from /proc/stat (since I gather
> > from this discussion it isn't scalable), and just have applications
> > that need per-IRQ stats use /proc/interrupts ?
>
> If you can prove noone is using them in /proc/stat...
And you can't even stick WARN into /proc/stat to find out.
On Sat, 21 Apr 2018, Alexey Dobriyan wrote:
> > On Thu, Apr 19, 2018 at 04:23:02PM -0700, Joel Fernandes (Google) wrote:
> > > Can we not just remove per-IRQ stats from /proc/stat (since I gather
> > > from this discussion it isn't scalable), and just have applications
> > > that need per-IRQ stats use /proc/interrupts ?
> >
> > If you can prove noone is using them in /proc/stat...
>
> And you can't even stick WARN into /proc/stat to find out.
>
FWIW, removing per irq counts from /proc/stat would break some of our
scripts. We could adapt to that, but everybody else would have to as
well, so I'm afraid it's not going to be possible.
It would probably be better to extract out the stats that you're actually
interested in to a new file.
On Mon, Apr 23, 2018 at 10:54:18PM -0700, David Rientjes wrote:
> On Sat, 21 Apr 2018, Alexey Dobriyan wrote:
>
> > > On Thu, Apr 19, 2018 at 04:23:02PM -0700, Joel Fernandes (Google) wrote:
> > > > Can we not just remove per-IRQ stats from /proc/stat (since I gather
> > > > from this discussion it isn't scalable), and just have applications
> > > > that need per-IRQ stats use /proc/interrupts ?
> > >
> > > If you can prove noone is using them in /proc/stat...
> >
> > And you can't even stick WARN into /proc/stat to find out.
> >
>
> FWIW, removing per irq counts from /proc/stat would break some of our
> scripts. We could adapt to that, but everybody else would have to as
> well, so I'm afraid it's not going to be possible.
Excellent!
> It would probably be better to extract out the stats that you're actually
> interested in to a new file.
This is the worst scenario. Individual IRQ stats are going to live in 2 places.
And /proc/stat still would be slow.
On Tue, 24 Apr 2018 09:18:59 +0300 Alexey Dobriyan <[email protected]> wrote:
> On Mon, Apr 23, 2018 at 10:54:18PM -0700, David Rientjes wrote:
> > On Sat, 21 Apr 2018, Alexey Dobriyan wrote:
> >
> > > > On Thu, Apr 19, 2018 at 04:23:02PM -0700, Joel Fernandes (Google) wrote:
> > > > > Can we not just remove per-IRQ stats from /proc/stat (since I gather
> > > > > from this discussion it isn't scalable), and just have applications
> > > > > that need per-IRQ stats use /proc/interrupts ?
> > > >
> > > > If you can prove noone is using them in /proc/stat...
> > >
> > > And you can't even stick WARN into /proc/stat to find out.
> > >
> >
> > FWIW, removing per irq counts from /proc/stat would break some of our
> > scripts. We could adapt to that, but everybody else would have to as
> > well, so I'm afraid it's not going to be possible.
>
> Excellent!
>
> > It would probably be better to extract out the stats that you're actually
> > interested in to a new file.
>
> This is the worst scenario. Individual IRQ stats are going to live in 2 places.
> And /proc/stat still would be slow.
No, a new /proc/stat2 which omits the `intr' line would be fast(er).
Although if we're going to do such a thing we might choose to
reorganize and prune /proc/stat2 even further, and actually put some
thought into it - the current one is a bit of a dog's breakfast.
Dumb question(s):
a) if we moved the `intr' line to the very end of /proc/stat, would
anything break? Maybe.
b) if an application were then to read stuff from /proc/stat and
were to stop reading before it read the `intr' line, would
its read from /proc/stat still be slow?
c) if the answer to b) is "yes" then can we change that? Only go
off and prepare the `intr' line if the application is really truly
reading it in?