As newer systems have more and more IRQs and CPUs available in their
system, the performance of reading /proc/stat frequently is getting
worse and worse.
Last year, I had proposed patch to extract out the IRQ line of /proc/stat
into a new procfs file. However, this may break existing applications
that depend on the presence of the IRQ line in /proc/stat.
Davidlohr had proposed a new /proc/stat2 file which is almost the same
as /proc/stat except that it has no IRQ line. This will require change
in applications to use the new procfs file.
This patchset uses a different approach to reduce the performance impact
by using a sysctl parameter to control the maximum frequency at which the
IRQ count computation can be done. The result is that the IRQ counts from
/proc/stat may be a bit stale, but still provide the needed information.
Waiman Long (2):
/proc/stat: Extract irqs counting code into show_stat_irqs()
/proc/stat: Add sysctl parameter to control irq counts latency
Documentation/sysctl/fs.txt | 16 +++++++++
fs/proc/stat.c | 87 +++++++++++++++++++++++++++++++++++++++------
kernel/sysctl.c | 12 +++++++
3 files changed, 105 insertions(+), 10 deletions(-)
--
1.8.3.1
The code that generates the "intr" line of /proc/stat is now moved
from show_stat() into a new function - show_stat_irqs(). There is no
functional change.
Signed-off-by: Waiman Long <[email protected]>
---
fs/proc/stat.c | 39 +++++++++++++++++++++++++++++----------
1 file changed, 29 insertions(+), 10 deletions(-)
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 535eda7..4b06f1b 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -79,12 +79,38 @@ static u64 get_iowait_time(int cpu)
#endif
+static u64 compute_stat_irqs_sum(void)
+{
+ int i;
+ u64 sum = 0;
+
+ for_each_possible_cpu(i) {
+ sum += kstat_cpu_irqs_sum(i);
+ sum += arch_irq_stat_cpu(i);
+ }
+ sum += arch_irq_stat();
+ return sum;
+}
+
+/*
+ * Print out the "intr" line of /proc/stat.
+ */
+static void show_stat_irqs(struct seq_file *p)
+{
+ int i;
+
+ seq_put_decimal_ull(p, "intr ", compute_stat_irqs_sum());
+ for_each_irq_nr(i)
+ seq_put_decimal_ull(p, " ", kstat_irqs_usr(i));
+
+ seq_putc(p, '\n');
+}
+
static int show_stat(struct seq_file *p, void *v)
{
int i, j;
u64 user, nice, system, idle, iowait, irq, softirq, steal;
u64 guest, guest_nice;
- u64 sum = 0;
u64 sum_softirq = 0;
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
struct timespec64 boottime;
@@ -105,8 +131,6 @@ static int show_stat(struct seq_file *p, void *v)
steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
- sum += kstat_cpu_irqs_sum(i);
- sum += arch_irq_stat_cpu(i);
for (j = 0; j < NR_SOFTIRQS; j++) {
unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
@@ -115,7 +139,6 @@ static int show_stat(struct seq_file *p, void *v)
sum_softirq += softirq_stat;
}
}
- sum += arch_irq_stat();
seq_put_decimal_ull(p, "cpu ", nsec_to_clock_t(user));
seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice));
@@ -154,14 +177,10 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice));
seq_putc(p, '\n');
}
- seq_put_decimal_ull(p, "intr ", (unsigned long long)sum);
-
- /* sum again ? it could be updated? */
- for_each_irq_nr(j)
- seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
+ show_stat_irqs(p);
seq_printf(p,
- "\nctxt %llu\n"
+ "ctxt %llu\n"
"btime %llu\n"
"processes %lu\n"
"procs_running %lu\n"
--
1.8.3.1
On 01/07/2019 10:58 AM, Matthew Wilcox wrote:
> On Mon, Jan 07, 2019 at 10:12:58AM -0500, Waiman Long wrote:
>> Reading /proc/stat can be slow especially if there are many irqs and on
>> systems with many CPUs as summation of all the percpu counts for each
>> of the irqs is required. On some newer systems, there can be more than
>> 1000 irqs per socket.
>>
>> Applications that need to read /proc/stat many times per seconds will
>> easily hit a bottleneck. In reality, the irq counts are seldom looked
>> at. Even those applications that read them don't really need up-to-date
>> information. One way to reduce the performance impact of irq counts
>> computation is to do it less frequently.
>>
>> A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to
> No. No, no, no, no, no. No.
>
> Stop adding new sysctls for this kind of thing. It's just a way to shift
> blame from us (who allegedly know what we're doing) to the sysadmin
> (who probably has better things to be doing than keeping up with the
> intricacies of development of every single piece of software running on
> their system).
>
> Let's figure out what this _should_ be. As a strawman, I propose we
> update these stats once a second. That's easy to document and understand.
I am fine with having a fixed value (like 1s) for reporting purpose. It
is just people may have many different opinions on what the right value
should be. That is why I opt for flexibility in the initial patch.
>
>> @@ -98,7 +105,48 @@ static u64 compute_stat_irqs_sum(void)
>> static void show_stat_irqs(struct seq_file *p)
>> {
>> int i;
>> +#ifdef CONFIG_PROC_SYSCTL
>> + static char *irqs_buf; /* Buffer for irqs values */
>> + static int buflen;
>> + static unsigned long last_jiffies; /* Last buffer update jiffies */
>> + static DEFINE_MUTEX(irqs_mutex);
>> + unsigned int latency = proc_stat_irqs_latency_ms;
>> +
>> + if (latency) {
>> + char *ptr;
>> +
>> + latency = _msecs_to_jiffies(latency);
>> +
>> + mutex_lock(&irqs_mutex);
>> + if (irqs_buf && time_before(jiffies, last_jiffies + latency))
>> + goto print_out;
>> +
>> + /*
>> + * Each irq value may require up to 11 bytes.
>> + */
>> + if (!irqs_buf) {
>> + irqs_buf = kmalloc(nr_irqs * 11 + 32,
>> + GFP_KERNEL | __GFP_ZERO);
> Why are you caching the _output_ of calling sprintf(), rather than caching the
> values of each interrupt?
>
It is just faster to dump the whole string buffer than redoing the
number formatting each time when the values don't change. I can cache
the individual sums instead if it is the preferred by most.
Cheers,
Longman
On 01/07/2019 11:14 AM, Matthew Wilcox wrote:
> On Mon, Jan 07, 2019 at 11:07:47AM -0500, Waiman Long wrote:
>>> Why are you caching the _output_ of calling sprintf(), rather than caching the
>>> values of each interrupt?
>>>
>> It is just faster to dump the whole string buffer than redoing the
>> number formatting each time when the values don't change. I can cache
>> the individual sums instead if it is the preferred by most.
> But it also consumes more memory. Can you gather some measurements to
> find out what the performance difference is if you cache the values
> instead of the strings?
I allocate 11 bytes (10 digits + space) for each unsigned int IRQ value.
I needs 4 bytes for the unsigned value itself. So it is a saving of 7
bytes. With 4k irqs, it will be 28k.
I will run some measurements of caching the values versus saving the
string later this week.
Thanks,
Longman
Reading /proc/stat can be slow especially if there are many irqs and on
systems with many CPUs as summation of all the percpu counts for each
of the irqs is required. On some newer systems, there can be more than
1000 irqs per socket.
Applications that need to read /proc/stat many times per seconds will
easily hit a bottleneck. In reality, the irq counts are seldom looked
at. Even those applications that read them don't really need up-to-date
information. One way to reduce the performance impact of irq counts
computation is to do it less frequently.
A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to
control the maximum latency in milliseconds allowed between the time
when the computation was done and when the values are reported. Setting
this parameter to an appropriate value will allow us to reduce the
performance impact of reading /proc/stat repetitively. If /proc/stat
is read once in a while, the irq counts will be accurate. Reading
/proc/stat repetitively, however, may make the counts somewhat stale.
On a 4-socket 96-core Broadwell system (HT off) with 2824 irqs,
the times for reading /proc/stat 10,000 times with various values of
proc-stat-irqs-latency-ms were:
proc-stat-irqs-latency-ms elapsed time sys time
------------------------- ------------ --------
0 11.041s 9.452s
1 12.983s 10.314s
10 8.452s 5.466s
100 8.003s 4.882s
1000 8.000s 4.740s
Signed-off-by: Waiman Long <[email protected]>
---
Documentation/sysctl/fs.txt | 16 +++++++++++++++
fs/proc/stat.c | 48 +++++++++++++++++++++++++++++++++++++++++++++
kernel/sysctl.c | 12 ++++++++++++
3 files changed, 76 insertions(+)
diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt
index 819caf8..603d1b5 100644
--- a/Documentation/sysctl/fs.txt
+++ b/Documentation/sysctl/fs.txt
@@ -34,6 +34,7 @@ Currently, these files are in /proc/sys/fs:
- overflowgid
- pipe-user-pages-hard
- pipe-user-pages-soft
+- proc-stat-irqs-latency-ms
- protected_fifos
- protected_hardlinks
- protected_regular
@@ -184,6 +185,21 @@ applied.
==============================================================
+proc-stat-irqs-latency-ms:
+
+The maximum latency (in mseconds) between the time when the IRQ counts
+in the "intr" line of /proc/stat were computed and the time when they
+are reported.
+
+The default is 0 which means the counts are computed every time
+/proc/stat is read. As computing the IRQ counts can be the most time
+consuming part of accessing /proc/stat, setting a high enough value
+will shorten the time to read it in most cases.
+
+The actual maximum latency is rounded up to the next multiple of jiffies.
+
+==============================================================
+
protected_fifos:
The intent of this protection is to avoid unintentional writes to
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 4b06f1b..52f5845 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -13,6 +13,7 @@
#include <linux/irqnr.h>
#include <linux/sched/cputime.h>
#include <linux/tick.h>
+#include <linux/jiffies.h>
#ifndef arch_irq_stat_cpu
#define arch_irq_stat_cpu(cpu) 0
@@ -21,6 +22,12 @@
#define arch_irq_stat() 0
#endif
+/*
+ * Maximum latency (in ms) of the irq values reported in the "intr" line.
+ * This is converted internally to multiple of jiffies.
+ */
+unsigned int proc_stat_irqs_latency_ms;
+
#ifdef arch_idle_time
static u64 get_idle_time(int cpu)
@@ -98,7 +105,48 @@ static u64 compute_stat_irqs_sum(void)
static void show_stat_irqs(struct seq_file *p)
{
int i;
+#ifdef CONFIG_PROC_SYSCTL
+ static char *irqs_buf; /* Buffer for irqs values */
+ static int buflen;
+ static unsigned long last_jiffies; /* Last buffer update jiffies */
+ static DEFINE_MUTEX(irqs_mutex);
+ unsigned int latency = proc_stat_irqs_latency_ms;
+
+ if (latency) {
+ char *ptr;
+
+ latency = _msecs_to_jiffies(latency);
+
+ mutex_lock(&irqs_mutex);
+ if (irqs_buf && time_before(jiffies, last_jiffies + latency))
+ goto print_out;
+
+ /*
+ * Each irq value may require up to 11 bytes.
+ */
+ if (!irqs_buf) {
+ irqs_buf = kmalloc(nr_irqs * 11 + 32,
+ GFP_KERNEL | __GFP_ZERO);
+ if (!irqs_buf) {
+ mutex_unlock(&irqs_mutex);
+ goto fallback;
+ }
+ }
+ ptr = irqs_buf;
+ ptr += sprintf(ptr, "intr %llu", compute_stat_irqs_sum());
+ for_each_irq_nr(i)
+ ptr += sprintf(ptr, " %u", kstat_irqs_usr(i));
+ *ptr++ = '\n';
+ buflen = ptr - irqs_buf;
+ last_jiffies = jiffies;
+print_out:
+ seq_write(p, irqs_buf, buflen);
+ mutex_unlock(&irqs_mutex);
+ return;
+ }
+fallback:
+#endif
seq_put_decimal_ull(p, "intr ", compute_stat_irqs_sum());
for_each_irq_nr(i)
seq_put_decimal_ull(p, " ", kstat_irqs_usr(i));
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1825f71..07010c9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -114,6 +114,9 @@
#ifndef CONFIG_MMU
extern int sysctl_nr_trim_pages;
#endif
+#ifdef CONFIG_PROC_FS
+extern unsigned int proc_stat_irqs_latency_ms;
+#endif
/* Constants used for minimum and maximum */
#ifdef CONFIG_LOCKUP_DETECTOR
@@ -1890,6 +1893,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
.proc_handler = proc_dointvec_minmax,
.extra1 = &one,
},
+#ifdef CONFIG_PROC_FS
+ {
+ .procname = "proc-stat-irqs-latency-ms",
+ .data = &proc_stat_irqs_latency_ms,
+ .maxlen = sizeof(proc_stat_irqs_latency_ms),
+ .mode = 0644,
+ .proc_handler = proc_douintvec,
+ },
+#endif
{ }
};
--
1.8.3.1
On Mon, Jan 07, 2019 at 07:58:40AM -0800, Matthew Wilcox wrote:
> On Mon, Jan 07, 2019 at 10:12:58AM -0500, Waiman Long wrote:
> > A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to
>
> No. No, no, no, no, no. No.
>
> Stop adding new sysctls for this kind of thing. It's just a way to shift
> blame from us (who allegedly know what we're doing) to the sysadmin
> (who probably has better things to be doing than keeping up with the
> intricacies of development of every single piece of software running on
> their system).
>
> Let's figure out what this _should_ be.
Yeah, let's start interrogating about which values specifically this
super sekret applications wants.
I assume CPU stats, so system call which returns CPU statistics in binary form.
> Why are you caching the _output_ of calling sprintf(), rather than caching the
> values of each interrupt?
For output caching string is better, but I'm not defending the patch.
On 01/07/2019 11:33 AM, Alexey Dobriyan wrote:
> On Mon, Jan 07, 2019 at 07:58:40AM -0800, Matthew Wilcox wrote:
>> On Mon, Jan 07, 2019 at 10:12:58AM -0500, Waiman Long wrote:
>>> A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to
>> No. No, no, no, no, no. No.
>>
>> Stop adding new sysctls for this kind of thing. It's just a way to shift
>> blame from us (who allegedly know what we're doing) to the sysadmin
>> (who probably has better things to be doing than keeping up with the
>> intricacies of development of every single piece of software running on
>> their system).
>>
>> Let's figure out what this _should_ be.
> Yeah, let's start interrogating about which values specifically this
> super sekret applications wants.
>
> I assume CPU stats, so system call which returns CPU statistics in binary form.
The /proc/stat file contains a bunch of statistics for different system
parameters. I doubt the irq values are what most applications that read
/proc/stats are looking for. The problem some customers see was the fact
that they saw their application performance dropped quite significantly
when moving to a newer system with more CPUs and more irqs. The issue
here is their applications read /proc/stats at a relatively high
frequency. The irq counting overhead can weight quite significantly and
slow their applications down. Of course, they can modify their
applications to try not to read /proc/stats that often.
One instance where I was notified recently was the Oracle database that
used lseek() to move the current file pointer of /proc/stats causing the
system to recompute the stats every time the file pointer was moved. Of
course, that is the not the right way to read a procfs file and they are
probably going to change that. My point is customers are going to hit
problem like that when they move to larger and newer systems.
Thanks,
Longman
On Mon, Jan 07, 2019 at 10:12:58AM -0500, Waiman Long wrote:
> Reading /proc/stat can be slow especially if there are many irqs and on
> systems with many CPUs as summation of all the percpu counts for each
> of the irqs is required. On some newer systems, there can be more than
> 1000 irqs per socket.
>
> Applications that need to read /proc/stat many times per seconds will
> easily hit a bottleneck. In reality, the irq counts are seldom looked
> at. Even those applications that read them don't really need up-to-date
> information. One way to reduce the performance impact of irq counts
> computation is to do it less frequently.
>
> A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to
No. No, no, no, no, no. No.
Stop adding new sysctls for this kind of thing. It's just a way to shift
blame from us (who allegedly know what we're doing) to the sysadmin
(who probably has better things to be doing than keeping up with the
intricacies of development of every single piece of software running on
their system).
Let's figure out what this _should_ be. As a strawman, I propose we
update these stats once a second. That's easy to document and understand.
> @@ -98,7 +105,48 @@ static u64 compute_stat_irqs_sum(void)
> static void show_stat_irqs(struct seq_file *p)
> {
> int i;
> +#ifdef CONFIG_PROC_SYSCTL
> + static char *irqs_buf; /* Buffer for irqs values */
> + static int buflen;
> + static unsigned long last_jiffies; /* Last buffer update jiffies */
> + static DEFINE_MUTEX(irqs_mutex);
> + unsigned int latency = proc_stat_irqs_latency_ms;
> +
> + if (latency) {
> + char *ptr;
> +
> + latency = _msecs_to_jiffies(latency);
> +
> + mutex_lock(&irqs_mutex);
> + if (irqs_buf && time_before(jiffies, last_jiffies + latency))
> + goto print_out;
> +
> + /*
> + * Each irq value may require up to 11 bytes.
> + */
> + if (!irqs_buf) {
> + irqs_buf = kmalloc(nr_irqs * 11 + 32,
> + GFP_KERNEL | __GFP_ZERO);
Why are you caching the _output_ of calling sprintf(), rather than caching the
values of each interrupt?
On Mon, Jan 07, 2019 at 11:07:47AM -0500, Waiman Long wrote:
> > Why are you caching the _output_ of calling sprintf(), rather than caching the
> > values of each interrupt?
> >
> It is just faster to dump the whole string buffer than redoing the
> number formatting each time when the values don't change. I can cache
> the individual sums instead if it is the preferred by most.
But it also consumes more memory. Can you gather some measurements to
find out what the performance difference is if you cache the values
instead of the strings?
On Mon, Jan 7, 2019 at 7:13 AM Waiman Long <[email protected]> wrote:
>
> The code that generates the "intr" line of /proc/stat is now moved
> from show_stat() into a new function - show_stat_irqs(). There is no
> functional change.
>
> Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
-Kees
> ---
> fs/proc/stat.c | 39 +++++++++++++++++++++++++++++----------
> 1 file changed, 29 insertions(+), 10 deletions(-)
>
> diff --git a/fs/proc/stat.c b/fs/proc/stat.c
> index 535eda7..4b06f1b 100644
> --- a/fs/proc/stat.c
> +++ b/fs/proc/stat.c
> @@ -79,12 +79,38 @@ static u64 get_iowait_time(int cpu)
>
> #endif
>
> +static u64 compute_stat_irqs_sum(void)
> +{
> + int i;
> + u64 sum = 0;
> +
> + for_each_possible_cpu(i) {
> + sum += kstat_cpu_irqs_sum(i);
> + sum += arch_irq_stat_cpu(i);
> + }
> + sum += arch_irq_stat();
> + return sum;
> +}
> +
> +/*
> + * Print out the "intr" line of /proc/stat.
> + */
> +static void show_stat_irqs(struct seq_file *p)
> +{
> + int i;
> +
> + seq_put_decimal_ull(p, "intr ", compute_stat_irqs_sum());
> + for_each_irq_nr(i)
> + seq_put_decimal_ull(p, " ", kstat_irqs_usr(i));
> +
> + seq_putc(p, '\n');
> +}
> +
> static int show_stat(struct seq_file *p, void *v)
> {
> int i, j;
> u64 user, nice, system, idle, iowait, irq, softirq, steal;
> u64 guest, guest_nice;
> - u64 sum = 0;
> u64 sum_softirq = 0;
> unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
> struct timespec64 boottime;
> @@ -105,8 +131,6 @@ static int show_stat(struct seq_file *p, void *v)
> steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
> guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
> guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
> - sum += kstat_cpu_irqs_sum(i);
> - sum += arch_irq_stat_cpu(i);
>
> for (j = 0; j < NR_SOFTIRQS; j++) {
> unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
> @@ -115,7 +139,6 @@ static int show_stat(struct seq_file *p, void *v)
> sum_softirq += softirq_stat;
> }
> }
> - sum += arch_irq_stat();
>
> seq_put_decimal_ull(p, "cpu ", nsec_to_clock_t(user));
> seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice));
> @@ -154,14 +177,10 @@ static int show_stat(struct seq_file *p, void *v)
> seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice));
> seq_putc(p, '\n');
> }
> - seq_put_decimal_ull(p, "intr ", (unsigned long long)sum);
> -
> - /* sum again ? it could be updated? */
> - for_each_irq_nr(j)
> - seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
> + show_stat_irqs(p);
>
> seq_printf(p,
> - "\nctxt %llu\n"
> + "ctxt %llu\n"
> "btime %llu\n"
> "processes %lu\n"
> "procs_running %lu\n"
> --
> 1.8.3.1
>
--
Kees Cook
On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
> As newer systems have more and more IRQs and CPUs available in their
> system, the performance of reading /proc/stat frequently is getting
> worse and worse.
Because the "roll-your-own" per-cpu counter implementaiton has been
optimised for low possible addition overhead on the premise that
summing the counters is rare and isn't a performance issue. This
patchset is a direct indication that this "summing is rare and can
be slow" premise is now invalid.
We have percpu counter infrastructure that trades off a small amount
of addition overhead for zero-cost reading of the counter value.
i.e. why not just convert this whole mess to percpu_counters and
then just use percpu_counter_read_positive()? Then we just don't
care how often userspace reads the /proc file because there is no
summing involved at all...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Jan 7, 2019 at 5:32 PM Dave Chinner <[email protected]> wrote:
>
> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
> > As newer systems have more and more IRQs and CPUs available in their
> > system, the performance of reading /proc/stat frequently is getting
> > worse and worse.
>
> Because the "roll-your-own" per-cpu counter implementaiton has been
> optimised for low possible addition overhead on the premise that
> summing the counters is rare and isn't a performance issue. This
> patchset is a direct indication that this "summing is rare and can
> be slow" premise is now invalid.
Focusing on counter performance is, IMHO, missing the mark. Even if
interrupt count collection were made fast, there's *something* in any
particular /proc file that a particular reader doesn't need and that,
by being uselessly collected, needlessly slows that reader. There
should be a general-purpose way for /proc file readers to tell the
kernel which bits of information interest them on a particular read
syscall sequence or particular open(2) or something. Creating a new
proc file for every useful combination of attributes doesn't scale
either.
On 01/07/2019 05:32 PM, Dave Chinner wrote:
> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
>> As newer systems have more and more IRQs and CPUs available in their
>> system, the performance of reading /proc/stat frequently is getting
>> worse and worse.
> Because the "roll-your-own" per-cpu counter implementaiton has been
> optimised for low possible addition overhead on the premise that
> summing the counters is rare and isn't a performance issue. This
> patchset is a direct indication that this "summing is rare and can
> be slow" premise is now invalid.
>
> We have percpu counter infrastructure that trades off a small amount
> of addition overhead for zero-cost reading of the counter value.
> i.e. why not just convert this whole mess to percpu_counters and
> then just use percpu_counter_read_positive()? Then we just don't
> care how often userspace reads the /proc file because there is no
> summing involved at all...
>
> Cheers,
>
> Dave.
Yes, percpu_counter_read_positive() is cheap. However, you still need to
pay the price somewhere. In the case of percpu_counter, the update is
more expensive.
I would say the percentage of applications that will hit this problem is
small. But for them, this problem has some significant performance overhead.
Cheers,
Longman
On Mon, Jan 07, 2019 at 05:41:28PM -0500, Daniel Colascione wrote:
> There should be a general-purpose way for /proc file readers to tell
> the kernel which bits of information interest them on a particular read
> syscall sequence or particular open(2) or something.
This opens a pandora box full of ioctls. /proc supports ioctls on files
which aren't /proc/$PID and /proc/sys/* though.
So people made a mess with allegedly easy text files and now they
continue to make even more mess adding ioctls on top.
On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
> On 01/07/2019 05:32 PM, Dave Chinner wrote:
> > On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
> >> As newer systems have more and more IRQs and CPUs available in their
> >> system, the performance of reading /proc/stat frequently is getting
> >> worse and worse.
> > Because the "roll-your-own" per-cpu counter implementaiton has been
> > optimised for low possible addition overhead on the premise that
> > summing the counters is rare and isn't a performance issue. This
> > patchset is a direct indication that this "summing is rare and can
> > be slow" premise is now invalid.
> >
> > We have percpu counter infrastructure that trades off a small amount
> > of addition overhead for zero-cost reading of the counter value.
> > i.e. why not just convert this whole mess to percpu_counters and
> > then just use percpu_counter_read_positive()? Then we just don't
> > care how often userspace reads the /proc file because there is no
> > summing involved at all...
> >
> > Cheers,
> >
> > Dave.
>
> Yes, percpu_counter_read_positive() is cheap. However, you still need to
> pay the price somewhere. In the case of percpu_counter, the update is
> more expensive.
Ummm, that's exactly what I just said. It's a percpu counter that
solves the "sum is expensive and frequent" problem, just like you
are encountering here. I do not need basic scalability algorithms
explained to me.
> I would say the percentage of applications that will hit this problem is
> small. But for them, this problem has some significant performance overhead.
Well, duh!
What I was suggesting is that you change the per-cpu counter
implementation to the /generic infrastructure/ that solves this
problem, and then determine if the extra update overhead is at all
measurable. If you can't measure any difference in update overhead,
then slapping complexity on the existing counter to attempt to
mitigate the summing overhead is the wrong solution.
Indeed, it may be that you need o use a custom batch scaling curve
for the generic per-cpu coutner infrastructure to mitigate the
update overhead, but the fact is we already have generic
infrastructure that solves your problem and so the solution should
be "use the generic infrastructure" until it can be proven not to
work.
i.e. prove the generic infrastructure is not fit for purpose and
cannot be improved sufficiently to work for this use case before
implementing a complex, one-off snowflake counter implementation...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue 08-01-19 13:04:22, Dave Chinner wrote:
> On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
> > On 01/07/2019 05:32 PM, Dave Chinner wrote:
> > > On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
> > >> As newer systems have more and more IRQs and CPUs available in their
> > >> system, the performance of reading /proc/stat frequently is getting
> > >> worse and worse.
> > > Because the "roll-your-own" per-cpu counter implementaiton has been
> > > optimised for low possible addition overhead on the premise that
> > > summing the counters is rare and isn't a performance issue. This
> > > patchset is a direct indication that this "summing is rare and can
> > > be slow" premise is now invalid.
> > >
> > > We have percpu counter infrastructure that trades off a small amount
> > > of addition overhead for zero-cost reading of the counter value.
> > > i.e. why not just convert this whole mess to percpu_counters and
> > > then just use percpu_counter_read_positive()? Then we just don't
> > > care how often userspace reads the /proc file because there is no
> > > summing involved at all...
> > >
> > > Cheers,
> > >
> > > Dave.
> >
> > Yes, percpu_counter_read_positive() is cheap. However, you still need to
> > pay the price somewhere. In the case of percpu_counter, the update is
> > more expensive.
>
> Ummm, that's exactly what I just said. It's a percpu counter that
> solves the "sum is expensive and frequent" problem, just like you
> are encountering here. I do not need basic scalability algorithms
> explained to me.
>
> > I would say the percentage of applications that will hit this problem is
> > small. But for them, this problem has some significant performance overhead.
>
> Well, duh!
>
> What I was suggesting is that you change the per-cpu counter
> implementation to the /generic infrastructure/ that solves this
> problem, and then determine if the extra update overhead is at all
> measurable. If you can't measure any difference in update overhead,
> then slapping complexity on the existing counter to attempt to
> mitigate the summing overhead is the wrong solution.
>
> Indeed, it may be that you need o use a custom batch scaling curve
> for the generic per-cpu coutner infrastructure to mitigate the
> update overhead, but the fact is we already have generic
> infrastructure that solves your problem and so the solution should
> be "use the generic infrastructure" until it can be proven not to
> work.
>
> i.e. prove the generic infrastructure is not fit for purpose and
> cannot be improved sufficiently to work for this use case before
> implementing a complex, one-off snowflake counter implementation...
Completely agreed! Apart from that I find that conversion to a generic
infrastructure worth even if that doesn't solve the problem at hands
completely. If for no other reasons then the sheer code removal as kstat
is not really used for anything apart from this accounting AFAIR. The
less ad-hoc code we have the better IMHO.
And to the underlying problem. Some proc files do not scale on large
machines. Maybe it is time to explain that to application writers that
if they are collecting data too agressively then it won't scale. We can
only do this much. Lying about numbers by hiding updates is, well,
lying and won't solve the underlying problem.
--
Michal Hocko
SUSE Labs
On 01/07/2019 09:04 PM, Dave Chinner wrote:
> On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
>> On 01/07/2019 05:32 PM, Dave Chinner wrote:
>>> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
>>>> As newer systems have more and more IRQs and CPUs available in their
>>>> system, the performance of reading /proc/stat frequently is getting
>>>> worse and worse.
>>> Because the "roll-your-own" per-cpu counter implementaiton has been
>>> optimised for low possible addition overhead on the premise that
>>> summing the counters is rare and isn't a performance issue. This
>>> patchset is a direct indication that this "summing is rare and can
>>> be slow" premise is now invalid.
>>>
>>> We have percpu counter infrastructure that trades off a small amount
>>> of addition overhead for zero-cost reading of the counter value.
>>> i.e. why not just convert this whole mess to percpu_counters and
>>> then just use percpu_counter_read_positive()? Then we just don't
>>> care how often userspace reads the /proc file because there is no
>>> summing involved at all...
>>>
>>> Cheers,
>>>
>>> Dave.
>> Yes, percpu_counter_read_positive() is cheap. However, you still need to
>> pay the price somewhere. In the case of percpu_counter, the update is
>> more expensive.
> Ummm, that's exactly what I just said. It's a percpu counter that
> solves the "sum is expensive and frequent" problem, just like you
> are encountering here. I do not need basic scalability algorithms
> explained to me.
What I am trying to say is that the "sum is expensive and frequent" is
only true of a very small percentage of applications. It is not true for
most of them. I am hesitating to add latency to the interrupt path that
will affect all applications.
>> I would say the percentage of applications that will hit this problem is
>> small. But for them, this problem has some significant performance overhead.
> Well, duh!
>
> What I was suggesting is that you change the per-cpu counter
> implementation to the /generic infrastructure/ that solves this
> problem, and then determine if the extra update overhead is at all
> measurable. If you can't measure any difference in update overhead,
> then slapping complexity on the existing counter to attempt to
> mitigate the summing overhead is the wrong solution.
>
> Indeed, it may be that you need o use a custom batch scaling curve
> for the generic per-cpu coutner infrastructure to mitigate the
> update overhead, but the fact is we already have generic
> infrastructure that solves your problem and so the solution should
> be "use the generic infrastructure" until it can be proven not to
> work.
>
> i.e. prove the generic infrastructure is not fit for purpose and
> cannot be improved sufficiently to work for this use case before
> implementing a complex, one-off snowflake counter implementation...
I see your point. I like the deferred summation approach that I am
currently using. If I have to modify the current per-cpu counter
implementation to support that and I probably need to add counter
grouping support to amortize the overhead, that can be a major
undertaking. This is not a high priority item for me at the moment, so I
may have to wait until I have some spare time left.
Thanks,
Longman
On 01/08/2019 11:11 AM, Michal Hocko wrote:
> On Tue 08-01-19 13:04:22, Dave Chinner wrote:
>> On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
>>> On 01/07/2019 05:32 PM, Dave Chinner wrote:
>>>> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
>>>>> As newer systems have more and more IRQs and CPUs available in their
>>>>> system, the performance of reading /proc/stat frequently is getting
>>>>> worse and worse.
>>>> Because the "roll-your-own" per-cpu counter implementaiton has been
>>>> optimised for low possible addition overhead on the premise that
>>>> summing the counters is rare and isn't a performance issue. This
>>>> patchset is a direct indication that this "summing is rare and can
>>>> be slow" premise is now invalid.
>>>>
>>>> We have percpu counter infrastructure that trades off a small amount
>>>> of addition overhead for zero-cost reading of the counter value.
>>>> i.e. why not just convert this whole mess to percpu_counters and
>>>> then just use percpu_counter_read_positive()? Then we just don't
>>>> care how often userspace reads the /proc file because there is no
>>>> summing involved at all...
>>>>
>>>> Cheers,
>>>>
>>>> Dave.
>>> Yes, percpu_counter_read_positive() is cheap. However, you still need to
>>> pay the price somewhere. In the case of percpu_counter, the update is
>>> more expensive.
>> Ummm, that's exactly what I just said. It's a percpu counter that
>> solves the "sum is expensive and frequent" problem, just like you
>> are encountering here. I do not need basic scalability algorithms
>> explained to me.
>>
>>> I would say the percentage of applications that will hit this problem is
>>> small. But for them, this problem has some significant performance overhead.
>> Well, duh!
>>
>> What I was suggesting is that you change the per-cpu counter
>> implementation to the /generic infrastructure/ that solves this
>> problem, and then determine if the extra update overhead is at all
>> measurable. If you can't measure any difference in update overhead,
>> then slapping complexity on the existing counter to attempt to
>> mitigate the summing overhead is the wrong solution.
>>
>> Indeed, it may be that you need o use a custom batch scaling curve
>> for the generic per-cpu coutner infrastructure to mitigate the
>> update overhead, but the fact is we already have generic
>> infrastructure that solves your problem and so the solution should
>> be "use the generic infrastructure" until it can be proven not to
>> work.
>>
>> i.e. prove the generic infrastructure is not fit for purpose and
>> cannot be improved sufficiently to work for this use case before
>> implementing a complex, one-off snowflake counter implementation...
> Completely agreed! Apart from that I find that conversion to a generic
> infrastructure worth even if that doesn't solve the problem at hands
> completely. If for no other reasons then the sheer code removal as kstat
> is not really used for anything apart from this accounting AFAIR. The
> less ad-hoc code we have the better IMHO.
>
> And to the underlying problem. Some proc files do not scale on large
> machines. Maybe it is time to explain that to application writers that
> if they are collecting data too agressively then it won't scale. We can
> only do this much. Lying about numbers by hiding updates is, well,
> lying and won't solve the underlying problem.
I would not say it is lying. As I said in the changelog, reading
/proc/stat infrequently will give the right counts. Only when it is read
frequently that the data may not be up-to-date. Using
percpu_counter_sum_positive() as suggested by Dave will guarantee that
the counts will likely be off by a certain amount too. So it is also a
trade-off between accuracy and performance.
Cheers,
Longman
On 01/08/2019 12:05 PM, Waiman Long wrote:
> On 01/08/2019 11:11 AM, Michal Hocko wrote:
>> On Tue 08-01-19 13:04:22, Dave Chinner wrote:
>>> On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
>>>> On 01/07/2019 05:32 PM, Dave Chinner wrote:
>>>>> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
>>>>>> As newer systems have more and more IRQs and CPUs available in their
>>>>>> system, the performance of reading /proc/stat frequently is getting
>>>>>> worse and worse.
>>>>> Because the "roll-your-own" per-cpu counter implementaiton has been
>>>>> optimised for low possible addition overhead on the premise that
>>>>> summing the counters is rare and isn't a performance issue. This
>>>>> patchset is a direct indication that this "summing is rare and can
>>>>> be slow" premise is now invalid.
>>>>>
>>>>> We have percpu counter infrastructure that trades off a small amount
>>>>> of addition overhead for zero-cost reading of the counter value.
>>>>> i.e. why not just convert this whole mess to percpu_counters and
>>>>> then just use percpu_counter_read_positive()? Then we just don't
>>>>> care how often userspace reads the /proc file because there is no
>>>>> summing involved at all...
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Dave.
>>>> Yes, percpu_counter_read_positive() is cheap. However, you still need to
>>>> pay the price somewhere. In the case of percpu_counter, the update is
>>>> more expensive.
>>> Ummm, that's exactly what I just said. It's a percpu counter that
>>> solves the "sum is expensive and frequent" problem, just like you
>>> are encountering here. I do not need basic scalability algorithms
>>> explained to me.
>>>
>>>> I would say the percentage of applications that will hit this problem is
>>>> small. But for them, this problem has some significant performance overhead.
>>> Well, duh!
>>>
>>> What I was suggesting is that you change the per-cpu counter
>>> implementation to the /generic infrastructure/ that solves this
>>> problem, and then determine if the extra update overhead is at all
>>> measurable. If you can't measure any difference in update overhead,
>>> then slapping complexity on the existing counter to attempt to
>>> mitigate the summing overhead is the wrong solution.
>>>
>>> Indeed, it may be that you need o use a custom batch scaling curve
>>> for the generic per-cpu coutner infrastructure to mitigate the
>>> update overhead, but the fact is we already have generic
>>> infrastructure that solves your problem and so the solution should
>>> be "use the generic infrastructure" until it can be proven not to
>>> work.
>>>
>>> i.e. prove the generic infrastructure is not fit for purpose and
>>> cannot be improved sufficiently to work for this use case before
>>> implementing a complex, one-off snowflake counter implementation...
>> Completely agreed! Apart from that I find that conversion to a generic
>> infrastructure worth even if that doesn't solve the problem at hands
>> completely. If for no other reasons then the sheer code removal as kstat
>> is not really used for anything apart from this accounting AFAIR. The
>> less ad-hoc code we have the better IMHO.
Another point that I want to make is that I don't see the kstat code
will ever going to be removed unless we scrap the whole /proc/stat file.
IRQ counts reporting is a performance problem simply because of the
large number of them (in thousands). The other percpu counts are
currently fine as they will only go up to hundreds at most.
Cheers,
Longman
>> And to the underlying problem. Some proc files do not scale on large
>> machines. Maybe it is time to explain that to application writers that
>> if they are collecting data too agressively then it won't scale. We can
>> only do this much. Lying about numbers by hiding updates is, well,
>> lying and won't solve the underlying problem.
> I would not say it is lying. As I said in the changelog, reading
> /proc/stat infrequently will give the right counts. Only when it is read
> frequently that the data may not be up-to-date. Using
> percpu_counter_sum_positive() as suggested by Dave will guarantee that
> the counts will likely be off by a certain amount too. So it is also a
> trade-off between accuracy and performance.
>
> Cheers,
> Longman
>
>
On Tue, Jan 08, 2019 at 11:58:26AM -0500, Waiman Long wrote:
> On 01/07/2019 09:04 PM, Dave Chinner wrote:
> > On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
> >> On 01/07/2019 05:32 PM, Dave Chinner wrote:
> >>> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
> > What I was suggesting is that you change the per-cpu counter
> > implementation to the /generic infrastructure/ that solves this
> > problem, and then determine if the extra update overhead is at all
> > measurable. If you can't measure any difference in update overhead,
> > then slapping complexity on the existing counter to attempt to
> > mitigate the summing overhead is the wrong solution.
> >
> > Indeed, it may be that you need o use a custom batch scaling curve
> > for the generic per-cpu coutner infrastructure to mitigate the
> > update overhead, but the fact is we already have generic
> > infrastructure that solves your problem and so the solution should
> > be "use the generic infrastructure" until it can be proven not to
> > work.
> >
> > i.e. prove the generic infrastructure is not fit for purpose and
> > cannot be improved sufficiently to work for this use case before
> > implementing a complex, one-off snowflake counter implementation...
>
> I see your point. I like the deferred summation approach that I am
> currently using. If I have to modify the current per-cpu counter
> implementation to support that
No! Stop that already. The "deferred counter summation" is exactly
the problem the current algorithm has and exactly the problem the
generic counters /don't have/. Changing the generic percpu counter
algorithm to match this specific hand rolled implementation is not
desirable as it will break implementations that rely on the
bound maximum summation deviation of the existing algorithm (e.g.
the ext4 and XFS ENOSPC accounting algorithms).
> and I probably need to add counter
> grouping support to amortize the overhead, that can be a major
The per-cpu counters already have configurable update batching
to amortise the summation update cost across multiple individual
per-cpu updates. You don't need to change the implementation at all,
just tweak the amortisation curve appropriately for the desired
requirements for update scalability (high) vs read accuracy (low).
Let's face it, on large systems where counters are frequently
updated, the resultant sum can be highly inaccurate by the time a
thousand CPU counters have been summed. The generic counters have a
predictable and bound "fast sum" maximum deviation (batch size *
nr_cpus), so over large machines are likely to be much more accurate
than "unlocked summing on demand" algorithms.
Cheers,
Dave.
--
Dave Chinner
[email protected]
Greeting,
FYI, we noticed a -4.8% regression of reaim.jobs_per_min due to commit:
commit: 3047027b34b8c6404b509903058b89836093acc7 ("[PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency")
url: https://github.com/0day-ci/linux/commits/Waiman-Long/proc-stat-Reduce-irqs-counting-performance-overhead/20190108-104818
in testcase: reaim
on test machine: 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory
with following parameters:
runtime: 300s
nr_task: 5000
test: shared_memory
cpufreq_governor: performance
ucode: 0x3d
test-description: REAIM is an updated and improved version of AIM 7 benchmark.
test-url: https://sourceforge.net/projects/re-aim-7/
Details are as below:
-------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase/ucode:
gcc-7/performance/x86_64-rhel-7.2/5000/debian-x86_64-2018-04-03.cgz/300s/lkp-hsw-ep5/shared_memory/reaim/0x3d
commit:
51e8bce392 ("/proc/stat: Extract irqs counting code into show_stat_irqs()")
3047027b34 ("/proc/stat: Add sysctl parameter to control irq counts latency")
51e8bce392dd2cc9 3047027b34b8c6404b50990305
---------------- --------------------------
fail:runs %reproduction fail:runs
| | |
1:4 -25% :4 kmsg.igb#:#:#:exceed_max#second
%stddev %change %stddev
\ | \
101.96 +7.5% 109.60 reaim.child_systime
33.32 -1.8% 32.73 reaim.child_utime
5534451 -4.8% 5271308 reaim.jobs_per_min
1106 -4.8% 1054 reaim.jobs_per_min_child
5800927 -4.9% 5517884 reaim.max_jobs_per_min
5.42 +5.0% 5.69 reaim.parent_time
1.51 +5.3% 1.59 reaim.std_dev_time
29374932 -2.8% 28558608 reaim.time.minor_page_faults
1681 +1.6% 1708 reaim.time.percent_of_cpu_this_job_got
3841 +4.5% 4012 reaim.time.system_time
1234 -4.4% 1180 reaim.time.user_time
18500000 -2.7% 18000000 reaim.workload
5495296 ± 9% -9.5% 4970496 meminfo.DirectMap2M
5142 ± 18% -43.2% 2920 ± 46% numa-vmstat.node0.nr_shmem
29.00 ± 32% +56.9% 45.50 ± 10% vmstat.procs.r
67175 ± 37% +66.6% 111910 ± 20% numa-meminfo.node0.AnonHugePages
20591 ± 18% -43.2% 11691 ± 46% numa-meminfo.node0.Shmem
64688 ± 6% -36.8% 40906 ± 19% slabinfo.kmalloc-8.active_objs
64691 ± 6% -36.8% 40908 ± 19% slabinfo.kmalloc-8.num_objs
37.36 ± 7% +11.1% 41.53 ± 4% boot-time.boot
29.15 ± 6% +14.3% 33.31 ± 3% boot-time.dhcp
847.73 ± 9% +12.9% 957.09 ± 4% boot-time.idle
202.50 ±100% +101.7% 408.50 proc-vmstat.nr_mlock
8018 ± 9% -12.3% 7034 ± 2% proc-vmstat.nr_shmem
29175944 -2.8% 28369676 proc-vmstat.numa_hit
29170351 -2.8% 28364111 proc-vmstat.numa_local
5439 ± 5% -18.7% 4423 ± 7% proc-vmstat.pgactivate
30220220 -2.8% 29374906 proc-vmstat.pgalloc_normal
30182224 -2.7% 29368266 proc-vmstat.pgfault
30186671 -2.8% 29341792 proc-vmstat.pgfree
69510 ± 12% -34.2% 45759 ± 33% sched_debug.cfs_rq:/.load.avg
30.21 ± 24% -33.6% 20.05 ± 20% sched_debug.cfs_rq:/.runnable_load_avg.avg
66447 ± 12% -37.6% 41460 ± 37% sched_debug.cfs_rq:/.runnable_weight.avg
12.35 ± 4% +88.0% 23.22 ± 15% sched_debug.cpu.clock.stddev
12.35 ± 4% +88.0% 23.22 ± 15% sched_debug.cpu.clock_task.stddev
30.06 ± 12% -26.5% 22.10 ± 13% sched_debug.cpu.cpu_load[0].avg
29.37 ± 9% -22.6% 22.72 ± 13% sched_debug.cpu.cpu_load[1].avg
28.71 ± 6% -21.1% 22.66 ± 16% sched_debug.cpu.cpu_load[2].avg
17985 -12.0% 15823 ± 2% sched_debug.cpu.curr->pid.max
67478 ± 6% -32.5% 45531 ± 24% sched_debug.cpu.load.avg
10369 ± 49% -100.0% 0.00 sched_debug.cpu.load.min
0.21 ± 34% -100.0% 0.00 sched_debug.cpu.nr_running.min
12.98 -16.3% 10.86 ± 11% perf-stat.i.MPKI
5.712e+09 -3.8% 5.492e+09 perf-stat.i.branch-instructions
1.024e+08 -3.7% 98557208 perf-stat.i.branch-misses
8.17 +0.4 8.58 ± 2% perf-stat.i.cache-miss-rate%
7839589 +10.5% 8659798 perf-stat.i.cache-misses
86324420 +3.8% 89595898 ± 2% perf-stat.i.cache-references
1.55 ± 2% -4.0% 1.49 perf-stat.i.cpi
2290 -1.9% 2246 perf-stat.i.cpu-migrations
4667 -10.9% 4160 perf-stat.i.cycles-between-cache-misses
8.749e+09 -3.9% 8.409e+09 perf-stat.i.dTLB-loads
527660 ± 3% -15.0% 448539 ± 5% perf-stat.i.dTLB-store-misses
5.747e+09 -4.3% 5.499e+09 perf-stat.i.dTLB-stores
53047071 -3.5% 51190942 perf-stat.i.iTLB-load-misses
20576112 -13.7% 17759009 ± 2% perf-stat.i.iTLB-loads
3.207e+10 -3.9% 3.083e+10 perf-stat.i.instructions
0.77 -2.3% 0.75 perf-stat.i.ipc
99933 -3.8% 96127 perf-stat.i.minor-faults
4325719 +5.6% 4568226 perf-stat.i.node-load-misses
52.39 -2.0 50.36 perf-stat.i.node-store-miss-rate%
1411700 +20.9% 1706321 perf-stat.i.node-store-misses
883790 +34.1% 1184836 perf-stat.i.node-stores
99933 -3.8% 96127 perf-stat.i.page-faults
2.69 +7.9% 2.91 perf-stat.overall.MPKI
9.08 +0.6 9.67 ± 2% perf-stat.overall.cache-miss-rate%
1.13 +5.0% 1.19 perf-stat.overall.cpi
4633 -8.6% 4233 perf-stat.overall.cycles-between-cache-misses
0.01 ± 2% -0.0 0.01 ± 4% perf-stat.overall.dTLB-store-miss-rate%
72.05 +2.2 74.24 perf-stat.overall.iTLB-load-miss-rate%
0.88 -4.8% 0.84 perf-stat.overall.ipc
78.12 +1.4 79.52 perf-stat.overall.node-load-miss-rate%
61.49 -2.5 59.01 perf-stat.overall.node-store-miss-rate%
5.688e+09 -3.8% 5.471e+09 perf-stat.ps.branch-instructions
1.02e+08 -3.7% 98177745 perf-stat.ps.branch-misses
7807912 +10.5% 8626353 perf-stat.ps.cache-misses
85999504 +3.8% 89266688 ± 2% perf-stat.ps.cache-references
2282 -1.9% 2239 perf-stat.ps.cpu-migrations
8.713e+09 -3.9% 8.376e+09 perf-stat.ps.dTLB-loads
525761 ± 3% -15.0% 446967 ± 5% perf-stat.ps.dTLB-store-misses
5.723e+09 -4.3% 5.478e+09 perf-stat.ps.dTLB-stores
52823322 -3.5% 50990190 perf-stat.ps.iTLB-load-misses
20490261 -13.7% 17689923 ± 2% perf-stat.ps.iTLB-loads
3.193e+10 -3.8% 3.071e+10 perf-stat.ps.instructions
99560 -3.8% 95786 perf-stat.ps.minor-faults
4308031 +5.6% 4550453 perf-stat.ps.node-load-misses
1405805 +20.9% 1699670 perf-stat.ps.node-store-misses
880319 +34.1% 1180378 perf-stat.ps.node-stores
99560 -3.8% 95786 perf-stat.ps.page-faults
9.664e+12 -2.8% 9.397e+12 perf-stat.total.instructions
33.09 ± 7% -25.0 8.10 ± 45% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
32.21 ± 7% -24.3 7.92 ± 45% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
22.36 ± 7% -17.1 5.24 ± 49% perf-profile.calltrace.cycles-pp.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
10.57 ± 5% -8.3 2.31 ± 48% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64
9.60 ± 4% -7.6 2.03 ± 49% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret
3.68 ± 8% -3.3 0.43 ±102% perf-profile.calltrace.cycles-pp.security_ipc_permission.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
4.37 ± 9% -2.8 1.60 ± 50% perf-profile.calltrace.cycles-pp.ipc_obtain_object_check.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.30 ± 2% -1.9 0.44 ±102% perf-profile.calltrace.cycles-pp.security_sem_semop.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
5.10 ± 3% -1.6 3.46 ± 10% perf-profile.calltrace.cycles-pp.ipc_has_perm.security_ipc_permission.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
3.20 -1.3 1.95 ± 23% perf-profile.calltrace.cycles-pp.avc_has_perm.ipc_has_perm.security_ipc_permission.do_semtimedop.do_syscall_64
2.12 ± 3% -0.4 1.76 ± 4% perf-profile.calltrace.cycles-pp.avc_has_perm.ipc_has_perm.security_sem_semop.do_semtimedop.do_syscall_64
1.03 ± 2% -0.1 0.94 ± 3% perf-profile.calltrace.cycles-pp.__do_page_fault.do_page_fault.page_fault
1.06 ± 2% -0.1 0.97 ± 3% perf-profile.calltrace.cycles-pp.do_page_fault.page_fault
1.06 ± 2% -0.1 0.98 ± 3% perf-profile.calltrace.cycles-pp.page_fault
0.91 ± 3% -0.1 0.82 ± 3% perf-profile.calltrace.cycles-pp.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
0.86 ± 3% -0.1 0.78 ± 3% perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
0.61 ± 12% +0.2 0.79 ± 12% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.cpuidle_enter_state.do_idle.cpu_startup_entry
0.61 ± 10% +0.2 0.79 ± 10% perf-profile.calltrace.cycles-pp.shm_close.remove_vma.__do_munmap.ksys_shmdt.do_syscall_64
0.67 ± 11% +0.2 0.86 ± 10% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary
0.40 ± 57% +0.2 0.62 ± 8% perf-profile.calltrace.cycles-pp.call_rwsem_down_write_failed.down_write.do_shmat.__x64_sys_shmat.do_syscall_64
0.40 ± 57% +0.2 0.62 ± 7% perf-profile.calltrace.cycles-pp.rwsem_down_write_failed.call_rwsem_down_write_failed.down_write.do_shmat.__x64_sys_shmat
0.43 ± 57% +0.2 0.66 ± 7% perf-profile.calltrace.cycles-pp.down_write.do_shmat.__x64_sys_shmat.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.41 ± 57% +0.3 0.67 ± 9% perf-profile.calltrace.cycles-pp.call_rwsem_down_write_failed.down_write.shm_close.remove_vma.__do_munmap
0.40 ± 57% +0.3 0.66 ± 7% perf-profile.calltrace.cycles-pp.down_write.ipcget.__x64_sys_shmget.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.41 ± 57% +0.3 0.67 ± 9% perf-profile.calltrace.cycles-pp.rwsem_down_write_failed.call_rwsem_down_write_failed.down_write.shm_close.remove_vma
0.44 ± 57% +0.3 0.71 ± 9% perf-profile.calltrace.cycles-pp.down_write.shm_close.remove_vma.__do_munmap.ksys_shmdt
0.26 ±100% +0.4 0.67 ± 8% perf-profile.calltrace.cycles-pp.dput.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64
0.28 ±100% +0.4 0.71 ± 7% perf-profile.calltrace.cycles-pp.__might_fault._copy_from_user.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.58 ± 10% +0.4 1.03 ± 6% perf-profile.calltrace.cycles-pp.shmctl_down.ksys_shmctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
0.59 ± 10% +0.5 1.05 ± 6% perf-profile.calltrace.cycles-pp.ksys_shmctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
0.13 ±173% +0.5 0.62 ± 7% perf-profile.calltrace.cycles-pp.call_rwsem_down_write_failed.down_write.ipcget.__x64_sys_shmget.do_syscall_64
0.13 ±173% +0.5 0.62 ± 7% perf-profile.calltrace.cycles-pp.rwsem_down_write_failed.call_rwsem_down_write_failed.down_write.ipcget.__x64_sys_shmget
0.12 ±173% +0.5 0.66 ± 8% perf-profile.calltrace.cycles-pp.dentry_kill.dput.__fput.task_work_run.exit_to_usermode_loop
0.00 +0.6 0.58 ± 7% perf-profile.calltrace.cycles-pp.copy_user_generic_unrolled._copy_from_user.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +0.6 0.60 ± 7% perf-profile.calltrace.cycles-pp.unmap_region.__do_munmap.ksys_shmdt.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +0.6 0.61 ± 7% perf-profile.calltrace.cycles-pp.wait_consider_task.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
0.00 +0.6 0.62 ± 4% perf-profile.calltrace.cycles-pp.semctl_down.ksys_semctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.semctl
0.75 ± 8% +0.6 1.38 ± 8% perf-profile.calltrace.cycles-pp.__x64_sys_semop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
0.00 +0.6 0.65 ± 9% perf-profile.calltrace.cycles-pp.ipcperms.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
0.00 +0.6 0.65 ± 8% perf-profile.calltrace.cycles-pp.__dentry_kill.dentry_kill.dput.__fput.task_work_run
0.89 ± 11% +0.7 1.57 ± 8% perf-profile.calltrace.cycles-pp.do_shmat.__x64_sys_shmat.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmat
0.90 ± 11% +0.7 1.58 ± 8% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmat
0.90 ± 10% +0.7 1.58 ± 8% perf-profile.calltrace.cycles-pp.__x64_sys_shmat.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmat
0.00 +0.7 0.68 ± 5% perf-profile.calltrace.cycles-pp.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +0.7 0.69 ± 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.wait
0.00 +0.7 0.69 ± 6% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.wait
0.00 +0.7 0.69 ± 6% perf-profile.calltrace.cycles-pp.__do_sys_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe.wait
0.00 +0.7 0.69 ± 6% perf-profile.calltrace.cycles-pp.kernel_wait4.__do_sys_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe.wait
0.90 ± 11% +0.7 1.59 ± 8% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmat
0.00 +0.7 0.69 ± 5% perf-profile.calltrace.cycles-pp.wait
0.27 ±100% +0.7 0.99 ± 7% perf-profile.calltrace.cycles-pp.perform_atomic_semop.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
0.95 ± 11% +0.7 1.67 ± 8% perf-profile.calltrace.cycles-pp.shmat
0.00 +0.7 0.73 ± 6% perf-profile.calltrace.cycles-pp.ipcget.ksys_semget.do_syscall_64.entry_SYSCALL_64_after_hwframe.semget
0.00 +0.7 0.74 ± 8% perf-profile.calltrace.cycles-pp.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
0.72 ± 9% +0.7 1.46 ± 7% perf-profile.calltrace.cycles-pp._raw_spin_lock.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
0.00 +0.8 0.75 ± 5% perf-profile.calltrace.cycles-pp.ksys_semget.do_syscall_64.entry_SYSCALL_64_after_hwframe.semget
0.00 +0.8 0.76 ± 8% perf-profile.calltrace.cycles-pp.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
0.89 ± 9% +0.8 1.65 ± 9% perf-profile.calltrace.cycles-pp.__do_munmap.ksys_shmdt.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmdt
0.00 +0.8 0.76 ± 5% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.semget
0.00 +0.8 0.76 ± 5% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.semget
0.90 ± 9% +0.8 1.68 ± 9% perf-profile.calltrace.cycles-pp.ksys_shmdt.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmdt
0.00 +0.8 0.78 ± 6% perf-profile.calltrace.cycles-pp.ksys_semctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.semctl
0.00 +0.8 0.80 ± 5% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.semctl
0.00 +0.8 0.80 ± 5% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.semctl
0.00 +0.8 0.85 ± 5% perf-profile.calltrace.cycles-pp.semget
0.97 ± 9% +0.8 1.82 ± 6% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
0.97 ± 9% +0.8 1.82 ± 7% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmctl
0.94 ± 10% +0.9 1.79 ± 6% perf-profile.calltrace.cycles-pp._copy_from_user.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
1.04 ± 9% +0.9 1.91 ± 6% perf-profile.calltrace.cycles-pp.ipcget.__x64_sys_shmget.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmget
1.05 ± 9% +0.9 1.92 ± 5% perf-profile.calltrace.cycles-pp.__x64_sys_shmget.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmget
1.05 ± 9% +0.9 1.93 ± 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmget
1.05 ± 9% +0.9 1.93 ± 6% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmget
1.02 ± 9% +0.9 1.92 ± 7% perf-profile.calltrace.cycles-pp.shmctl
1.00 ± 9% +0.9 1.93 ± 9% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmdt
1.00 ± 8% +0.9 1.94 ± 9% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmdt
1.13 ± 8% +0.9 2.08 ± 6% perf-profile.calltrace.cycles-pp.shmget
0.00 +1.0 0.97 ± 5% perf-profile.calltrace.cycles-pp.semctl
1.03 ± 9% +1.0 1.99 ± 5% perf-profile.calltrace.cycles-pp.do_smart_update.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
1.04 ± 8% +1.0 2.03 ± 9% perf-profile.calltrace.cycles-pp.shmdt
2.61 ± 9% +1.3 3.89 ± 14% perf-profile.calltrace.cycles-pp.security_ipc_permission.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
1.65 ± 13% +2.0 3.65 ± 7% perf-profile.calltrace.cycles-pp.security_sem_semop.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
1.61 ± 6% +5.3 6.89 ± 2% perf-profile.calltrace.cycles-pp.idr_find.ipc_obtain_object_check.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
6.45 ± 9% +6.2 12.63 ± 6% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.semop
2.92 ± 5% +8.0 10.88 ± 7% perf-profile.calltrace.cycles-pp.ipc_obtain_object_check.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
9.07 ± 8% +8.8 17.85 ± 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.semop
15.56 ± 8% +18.8 34.38 ± 6% perf-profile.calltrace.cycles-pp.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
17.40 ± 8% +20.4 37.82 ± 6% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
17.99 ± 8% +21.0 39.00 ± 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.semop
33.85 ± 8% +36.1 69.98 ± 6% perf-profile.calltrace.cycles-pp.semop
reaim.time.percent_of_cpu_this_job_got
1800 +-+------------------------------------------------------------------+
| O |
1780 +-OO O O O |
1760 O-+ O O OO |
| |
1740 +-+ |
| |
1720 +-+ O .+. |
| O O + + |
1700 +-+ O .+ : |
1680 +-+ .+ .+. .+ .+.+ .+ : +. .+.++.|
|.++.+.++ : .++ ++.+ + +.+.++.++ +.+.++.+.+ ++ |
1660 +-+ : + |
| + |
1640 +-+------------------------------------------------------------------+
[*] bisect-good sample
[O] bisect-bad sample
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
Thanks,
Rong Chen
On Fri, Jan 18, 2019 at 9:44 PM kernel test robot <[email protected]> wrote:
>
> Greeting,
>
> FYI, we noticed a -4.8% regression of reaim.jobs_per_min due to commit:
>
>
> commit: 3047027b34b8c6404b509903058b89836093acc7 ("[PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency")
> url: https://github.com/0day-ci/linux/commits/Waiman-Long/proc-stat-Reduce-irqs-counting-performance-overhead/20190108-104818
Is this expected? (And it seems like other things in the report below
are faster? I don't understand why this particular regression was
called out?)
-Kees
>
>
> in testcase: reaim
> on test machine: 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory
> with following parameters:
>
> runtime: 300s
> nr_task: 5000
> test: shared_memory
> cpufreq_governor: performance
> ucode: 0x3d
>
> test-description: REAIM is an updated and improved version of AIM 7 benchmark.
> test-url: https://sourceforge.net/projects/re-aim-7/
>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp run job.yaml
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase/ucode:
> gcc-7/performance/x86_64-rhel-7.2/5000/debian-x86_64-2018-04-03.cgz/300s/lkp-hsw-ep5/shared_memory/reaim/0x3d
>
> commit:
> 51e8bce392 ("/proc/stat: Extract irqs counting code into show_stat_irqs()")
> 3047027b34 ("/proc/stat: Add sysctl parameter to control irq counts latency")
>
> 51e8bce392dd2cc9 3047027b34b8c6404b50990305
> ---------------- --------------------------
> fail:runs %reproduction fail:runs
> | | |
> 1:4 -25% :4 kmsg.igb#:#:#:exceed_max#second
> %stddev %change %stddev
> \ | \
> 101.96 +7.5% 109.60 reaim.child_systime
> 33.32 -1.8% 32.73 reaim.child_utime
> 5534451 -4.8% 5271308 reaim.jobs_per_min
> 1106 -4.8% 1054 reaim.jobs_per_min_child
> 5800927 -4.9% 5517884 reaim.max_jobs_per_min
> 5.42 +5.0% 5.69 reaim.parent_time
> 1.51 +5.3% 1.59 reaim.std_dev_time
> 29374932 -2.8% 28558608 reaim.time.minor_page_faults
> 1681 +1.6% 1708 reaim.time.percent_of_cpu_this_job_got
> 3841 +4.5% 4012 reaim.time.system_time
> 1234 -4.4% 1180 reaim.time.user_time
> 18500000 -2.7% 18000000 reaim.workload
> 5495296 ± 9% -9.5% 4970496 meminfo.DirectMap2M
> 5142 ± 18% -43.2% 2920 ± 46% numa-vmstat.node0.nr_shmem
> 29.00 ± 32% +56.9% 45.50 ± 10% vmstat.procs.r
> 67175 ± 37% +66.6% 111910 ± 20% numa-meminfo.node0.AnonHugePages
> 20591 ± 18% -43.2% 11691 ± 46% numa-meminfo.node0.Shmem
> 64688 ± 6% -36.8% 40906 ± 19% slabinfo.kmalloc-8.active_objs
> 64691 ± 6% -36.8% 40908 ± 19% slabinfo.kmalloc-8.num_objs
> 37.36 ± 7% +11.1% 41.53 ± 4% boot-time.boot
> 29.15 ± 6% +14.3% 33.31 ± 3% boot-time.dhcp
> 847.73 ± 9% +12.9% 957.09 ± 4% boot-time.idle
> 202.50 ±100% +101.7% 408.50 proc-vmstat.nr_mlock
> 8018 ± 9% -12.3% 7034 ± 2% proc-vmstat.nr_shmem
> 29175944 -2.8% 28369676 proc-vmstat.numa_hit
> 29170351 -2.8% 28364111 proc-vmstat.numa_local
> 5439 ± 5% -18.7% 4423 ± 7% proc-vmstat.pgactivate
> 30220220 -2.8% 29374906 proc-vmstat.pgalloc_normal
> 30182224 -2.7% 29368266 proc-vmstat.pgfault
> 30186671 -2.8% 29341792 proc-vmstat.pgfree
> 69510 ± 12% -34.2% 45759 ± 33% sched_debug.cfs_rq:/.load.avg
> 30.21 ± 24% -33.6% 20.05 ± 20% sched_debug.cfs_rq:/.runnable_load_avg.avg
> 66447 ± 12% -37.6% 41460 ± 37% sched_debug.cfs_rq:/.runnable_weight.avg
> 12.35 ± 4% +88.0% 23.22 ± 15% sched_debug.cpu.clock.stddev
> 12.35 ± 4% +88.0% 23.22 ± 15% sched_debug.cpu.clock_task.stddev
> 30.06 ± 12% -26.5% 22.10 ± 13% sched_debug.cpu.cpu_load[0].avg
> 29.37 ± 9% -22.6% 22.72 ± 13% sched_debug.cpu.cpu_load[1].avg
> 28.71 ± 6% -21.1% 22.66 ± 16% sched_debug.cpu.cpu_load[2].avg
> 17985 -12.0% 15823 ± 2% sched_debug.cpu.curr->pid.max
> 67478 ± 6% -32.5% 45531 ± 24% sched_debug.cpu.load.avg
> 10369 ± 49% -100.0% 0.00 sched_debug.cpu.load.min
> 0.21 ± 34% -100.0% 0.00 sched_debug.cpu.nr_running.min
> 12.98 -16.3% 10.86 ± 11% perf-stat.i.MPKI
> 5.712e+09 -3.8% 5.492e+09 perf-stat.i.branch-instructions
> 1.024e+08 -3.7% 98557208 perf-stat.i.branch-misses
> 8.17 +0.4 8.58 ± 2% perf-stat.i.cache-miss-rate%
> 7839589 +10.5% 8659798 perf-stat.i.cache-misses
> 86324420 +3.8% 89595898 ± 2% perf-stat.i.cache-references
> 1.55 ± 2% -4.0% 1.49 perf-stat.i.cpi
> 2290 -1.9% 2246 perf-stat.i.cpu-migrations
> 4667 -10.9% 4160 perf-stat.i.cycles-between-cache-misses
> 8.749e+09 -3.9% 8.409e+09 perf-stat.i.dTLB-loads
> 527660 ± 3% -15.0% 448539 ± 5% perf-stat.i.dTLB-store-misses
> 5.747e+09 -4.3% 5.499e+09 perf-stat.i.dTLB-stores
> 53047071 -3.5% 51190942 perf-stat.i.iTLB-load-misses
> 20576112 -13.7% 17759009 ± 2% perf-stat.i.iTLB-loads
> 3.207e+10 -3.9% 3.083e+10 perf-stat.i.instructions
> 0.77 -2.3% 0.75 perf-stat.i.ipc
> 99933 -3.8% 96127 perf-stat.i.minor-faults
> 4325719 +5.6% 4568226 perf-stat.i.node-load-misses
> 52.39 -2.0 50.36 perf-stat.i.node-store-miss-rate%
> 1411700 +20.9% 1706321 perf-stat.i.node-store-misses
> 883790 +34.1% 1184836 perf-stat.i.node-stores
> 99933 -3.8% 96127 perf-stat.i.page-faults
> 2.69 +7.9% 2.91 perf-stat.overall.MPKI
> 9.08 +0.6 9.67 ± 2% perf-stat.overall.cache-miss-rate%
> 1.13 +5.0% 1.19 perf-stat.overall.cpi
> 4633 -8.6% 4233 perf-stat.overall.cycles-between-cache-misses
> 0.01 ± 2% -0.0 0.01 ± 4% perf-stat.overall.dTLB-store-miss-rate%
> 72.05 +2.2 74.24 perf-stat.overall.iTLB-load-miss-rate%
> 0.88 -4.8% 0.84 perf-stat.overall.ipc
> 78.12 +1.4 79.52 perf-stat.overall.node-load-miss-rate%
> 61.49 -2.5 59.01 perf-stat.overall.node-store-miss-rate%
> 5.688e+09 -3.8% 5.471e+09 perf-stat.ps.branch-instructions
> 1.02e+08 -3.7% 98177745 perf-stat.ps.branch-misses
> 7807912 +10.5% 8626353 perf-stat.ps.cache-misses
> 85999504 +3.8% 89266688 ± 2% perf-stat.ps.cache-references
> 2282 -1.9% 2239 perf-stat.ps.cpu-migrations
> 8.713e+09 -3.9% 8.376e+09 perf-stat.ps.dTLB-loads
> 525761 ± 3% -15.0% 446967 ± 5% perf-stat.ps.dTLB-store-misses
> 5.723e+09 -4.3% 5.478e+09 perf-stat.ps.dTLB-stores
> 52823322 -3.5% 50990190 perf-stat.ps.iTLB-load-misses
> 20490261 -13.7% 17689923 ± 2% perf-stat.ps.iTLB-loads
> 3.193e+10 -3.8% 3.071e+10 perf-stat.ps.instructions
> 99560 -3.8% 95786 perf-stat.ps.minor-faults
> 4308031 +5.6% 4550453 perf-stat.ps.node-load-misses
> 1405805 +20.9% 1699670 perf-stat.ps.node-store-misses
> 880319 +34.1% 1180378 perf-stat.ps.node-stores
> 99560 -3.8% 95786 perf-stat.ps.page-faults
> 9.664e+12 -2.8% 9.397e+12 perf-stat.total.instructions
> 33.09 ± 7% -25.0 8.10 ± 45% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
> 32.21 ± 7% -24.3 7.92 ± 45% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 22.36 ± 7% -17.1 5.24 ± 49% perf-profile.calltrace.cycles-pp.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 10.57 ± 5% -8.3 2.31 ± 48% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64
> 9.60 ± 4% -7.6 2.03 ± 49% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret
> 3.68 ± 8% -3.3 0.43 ±102% perf-profile.calltrace.cycles-pp.security_ipc_permission.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 4.37 ± 9% -2.8 1.60 ± 50% perf-profile.calltrace.cycles-pp.ipc_obtain_object_check.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 2.30 ± 2% -1.9 0.44 ±102% perf-profile.calltrace.cycles-pp.security_sem_semop.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 5.10 ± 3% -1.6 3.46 ± 10% perf-profile.calltrace.cycles-pp.ipc_has_perm.security_ipc_permission.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 3.20 -1.3 1.95 ± 23% perf-profile.calltrace.cycles-pp.avc_has_perm.ipc_has_perm.security_ipc_permission.do_semtimedop.do_syscall_64
> 2.12 ± 3% -0.4 1.76 ± 4% perf-profile.calltrace.cycles-pp.avc_has_perm.ipc_has_perm.security_sem_semop.do_semtimedop.do_syscall_64
> 1.03 ± 2% -0.1 0.94 ± 3% perf-profile.calltrace.cycles-pp.__do_page_fault.do_page_fault.page_fault
> 1.06 ± 2% -0.1 0.97 ± 3% perf-profile.calltrace.cycles-pp.do_page_fault.page_fault
> 1.06 ± 2% -0.1 0.98 ± 3% perf-profile.calltrace.cycles-pp.page_fault
> 0.91 ± 3% -0.1 0.82 ± 3% perf-profile.calltrace.cycles-pp.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
> 0.86 ± 3% -0.1 0.78 ± 3% perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
> 0.61 ± 12% +0.2 0.79 ± 12% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.cpuidle_enter_state.do_idle.cpu_startup_entry
> 0.61 ± 10% +0.2 0.79 ± 10% perf-profile.calltrace.cycles-pp.shm_close.remove_vma.__do_munmap.ksys_shmdt.do_syscall_64
> 0.67 ± 11% +0.2 0.86 ± 10% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary
> 0.40 ± 57% +0.2 0.62 ± 8% perf-profile.calltrace.cycles-pp.call_rwsem_down_write_failed.down_write.do_shmat.__x64_sys_shmat.do_syscall_64
> 0.40 ± 57% +0.2 0.62 ± 7% perf-profile.calltrace.cycles-pp.rwsem_down_write_failed.call_rwsem_down_write_failed.down_write.do_shmat.__x64_sys_shmat
> 0.43 ± 57% +0.2 0.66 ± 7% perf-profile.calltrace.cycles-pp.down_write.do_shmat.__x64_sys_shmat.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 0.41 ± 57% +0.3 0.67 ± 9% perf-profile.calltrace.cycles-pp.call_rwsem_down_write_failed.down_write.shm_close.remove_vma.__do_munmap
> 0.40 ± 57% +0.3 0.66 ± 7% perf-profile.calltrace.cycles-pp.down_write.ipcget.__x64_sys_shmget.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 0.41 ± 57% +0.3 0.67 ± 9% perf-profile.calltrace.cycles-pp.rwsem_down_write_failed.call_rwsem_down_write_failed.down_write.shm_close.remove_vma
> 0.44 ± 57% +0.3 0.71 ± 9% perf-profile.calltrace.cycles-pp.down_write.shm_close.remove_vma.__do_munmap.ksys_shmdt
> 0.26 ±100% +0.4 0.67 ± 8% perf-profile.calltrace.cycles-pp.dput.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64
> 0.28 ±100% +0.4 0.71 ± 7% perf-profile.calltrace.cycles-pp.__might_fault._copy_from_user.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 0.58 ± 10% +0.4 1.03 ± 6% perf-profile.calltrace.cycles-pp.shmctl_down.ksys_shmctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
> 0.59 ± 10% +0.5 1.05 ± 6% perf-profile.calltrace.cycles-pp.ksys_shmctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
> 0.13 ±173% +0.5 0.62 ± 7% perf-profile.calltrace.cycles-pp.call_rwsem_down_write_failed.down_write.ipcget.__x64_sys_shmget.do_syscall_64
> 0.13 ±173% +0.5 0.62 ± 7% perf-profile.calltrace.cycles-pp.rwsem_down_write_failed.call_rwsem_down_write_failed.down_write.ipcget.__x64_sys_shmget
> 0.12 ±173% +0.5 0.66 ± 8% perf-profile.calltrace.cycles-pp.dentry_kill.dput.__fput.task_work_run.exit_to_usermode_loop
> 0.00 +0.6 0.58 ± 7% perf-profile.calltrace.cycles-pp.copy_user_generic_unrolled._copy_from_user.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 0.00 +0.6 0.60 ± 7% perf-profile.calltrace.cycles-pp.unmap_region.__do_munmap.ksys_shmdt.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 0.00 +0.6 0.61 ± 7% perf-profile.calltrace.cycles-pp.wait_consider_task.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
> 0.00 +0.6 0.62 ± 4% perf-profile.calltrace.cycles-pp.semctl_down.ksys_semctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.semctl
> 0.75 ± 8% +0.6 1.38 ± 8% perf-profile.calltrace.cycles-pp.__x64_sys_semop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
> 0.00 +0.6 0.65 ± 9% perf-profile.calltrace.cycles-pp.ipcperms.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
> 0.00 +0.6 0.65 ± 8% perf-profile.calltrace.cycles-pp.__dentry_kill.dentry_kill.dput.__fput.task_work_run
> 0.89 ± 11% +0.7 1.57 ± 8% perf-profile.calltrace.cycles-pp.do_shmat.__x64_sys_shmat.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmat
> 0.90 ± 11% +0.7 1.58 ± 8% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmat
> 0.90 ± 10% +0.7 1.58 ± 8% perf-profile.calltrace.cycles-pp.__x64_sys_shmat.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmat
> 0.00 +0.7 0.68 ± 5% perf-profile.calltrace.cycles-pp.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 0.00 +0.7 0.69 ± 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.wait
> 0.00 +0.7 0.69 ± 6% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.wait
> 0.00 +0.7 0.69 ± 6% perf-profile.calltrace.cycles-pp.__do_sys_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe.wait
> 0.00 +0.7 0.69 ± 6% perf-profile.calltrace.cycles-pp.kernel_wait4.__do_sys_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe.wait
> 0.90 ± 11% +0.7 1.59 ± 8% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmat
> 0.00 +0.7 0.69 ± 5% perf-profile.calltrace.cycles-pp.wait
> 0.27 ±100% +0.7 0.99 ± 7% perf-profile.calltrace.cycles-pp.perform_atomic_semop.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
> 0.95 ± 11% +0.7 1.67 ± 8% perf-profile.calltrace.cycles-pp.shmat
> 0.00 +0.7 0.73 ± 6% perf-profile.calltrace.cycles-pp.ipcget.ksys_semget.do_syscall_64.entry_SYSCALL_64_after_hwframe.semget
> 0.00 +0.7 0.74 ± 8% perf-profile.calltrace.cycles-pp.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
> 0.72 ± 9% +0.7 1.46 ± 7% perf-profile.calltrace.cycles-pp._raw_spin_lock.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
> 0.00 +0.8 0.75 ± 5% perf-profile.calltrace.cycles-pp.ksys_semget.do_syscall_64.entry_SYSCALL_64_after_hwframe.semget
> 0.00 +0.8 0.76 ± 8% perf-profile.calltrace.cycles-pp.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
> 0.89 ± 9% +0.8 1.65 ± 9% perf-profile.calltrace.cycles-pp.__do_munmap.ksys_shmdt.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmdt
> 0.00 +0.8 0.76 ± 5% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.semget
> 0.00 +0.8 0.76 ± 5% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.semget
> 0.90 ± 9% +0.8 1.68 ± 9% perf-profile.calltrace.cycles-pp.ksys_shmdt.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmdt
> 0.00 +0.8 0.78 ± 6% perf-profile.calltrace.cycles-pp.ksys_semctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.semctl
> 0.00 +0.8 0.80 ± 5% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.semctl
> 0.00 +0.8 0.80 ± 5% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.semctl
> 0.00 +0.8 0.85 ± 5% perf-profile.calltrace.cycles-pp.semget
> 0.97 ± 9% +0.8 1.82 ± 6% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
> 0.97 ± 9% +0.8 1.82 ± 7% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmctl
> 0.94 ± 10% +0.9 1.79 ± 6% perf-profile.calltrace.cycles-pp._copy_from_user.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
> 1.04 ± 9% +0.9 1.91 ± 6% perf-profile.calltrace.cycles-pp.ipcget.__x64_sys_shmget.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmget
> 1.05 ± 9% +0.9 1.92 ± 5% perf-profile.calltrace.cycles-pp.__x64_sys_shmget.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmget
> 1.05 ± 9% +0.9 1.93 ± 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmget
> 1.05 ± 9% +0.9 1.93 ± 6% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmget
> 1.02 ± 9% +0.9 1.92 ± 7% perf-profile.calltrace.cycles-pp.shmctl
> 1.00 ± 9% +0.9 1.93 ± 9% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmdt
> 1.00 ± 8% +0.9 1.94 ± 9% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmdt
> 1.13 ± 8% +0.9 2.08 ± 6% perf-profile.calltrace.cycles-pp.shmget
> 0.00 +1.0 0.97 ± 5% perf-profile.calltrace.cycles-pp.semctl
> 1.03 ± 9% +1.0 1.99 ± 5% perf-profile.calltrace.cycles-pp.do_smart_update.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
> 1.04 ± 8% +1.0 2.03 ± 9% perf-profile.calltrace.cycles-pp.shmdt
> 2.61 ± 9% +1.3 3.89 ± 14% perf-profile.calltrace.cycles-pp.security_ipc_permission.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
> 1.65 ± 13% +2.0 3.65 ± 7% perf-profile.calltrace.cycles-pp.security_sem_semop.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
> 1.61 ± 6% +5.3 6.89 ± 2% perf-profile.calltrace.cycles-pp.idr_find.ipc_obtain_object_check.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 6.45 ± 9% +6.2 12.63 ± 6% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.semop
> 2.92 ± 5% +8.0 10.88 ± 7% perf-profile.calltrace.cycles-pp.ipc_obtain_object_check.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
> 9.07 ± 8% +8.8 17.85 ± 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.semop
> 15.56 ± 8% +18.8 34.38 ± 6% perf-profile.calltrace.cycles-pp.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
> 17.40 ± 8% +20.4 37.82 ± 6% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
> 17.99 ± 8% +21.0 39.00 ± 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.semop
> 33.85 ± 8% +36.1 69.98 ± 6% perf-profile.calltrace.cycles-pp.semop
>
>
>
> reaim.time.percent_of_cpu_this_job_got
>
> 1800 +-+------------------------------------------------------------------+
> | O |
> 1780 +-OO O O O |
> 1760 O-+ O O OO |
> | |
> 1740 +-+ |
> | |
> 1720 +-+ O .+. |
> | O O + + |
> 1700 +-+ O .+ : |
> 1680 +-+ .+ .+. .+ .+.+ .+ : +. .+.++.|
> |.++.+.++ : .++ ++.+ + +.+.++.++ +.+.++.+.+ ++ |
> 1660 +-+ : + |
> | + |
> 1640 +-+------------------------------------------------------------------+
>
>
> [*] bisect-good sample
> [O] bisect-bad sample
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
> Thanks,
> Rong Chen
--
Kees Cook
On Tue, Jan 22, 2019 at 09:02:53AM +1300, Kees Cook wrote:
> On Fri, Jan 18, 2019 at 9:44 PM kernel test robot <[email protected]> wrote:
> >
> > Greeting,
> >
> > FYI, we noticed a -4.8% regression of reaim.jobs_per_min due to commit:
> >
> >
> > commit: 3047027b34b8c6404b509903058b89836093acc7 ("[PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency")
> > url: https://github.com/0day-ci/linux/commits/Waiman-Long/proc-stat-Reduce-irqs-counting-performance-overhead/20190108-104818
>
> Is this expected? (And it seems like other things in the report below
> are faster? I don't understand why this particular regression was
> called out?)
No, but the sysctl has been dropped, so the point is moot.