2015-08-26 03:17:54

by Jason Low

[permalink] [raw]
Subject: [PATCH 0/3] timer: Improve itimers scalability

When running a database workload on a 16 socket machine, there were
scalability issues related to itimers.

Commit 1018016c706f addressed the issue with the thread_group_cputimer
spinlock taking up a significant portion of total run time.

This patch series address the other issue where a lot of time is spent
trying to acquire the sighand lock. It was found in some cases that
200+ threads were simultaneously contending for the same sighand lock,
reducing throughput by more than 30%.

Jason Low (3):
timer: Optimize fastpath_timer_check()
timer: Check thread timers only when there are active thread timers
timer: Reduce unnecessary sighand lock contention

include/linux/init_task.h | 1 +
include/linux/sched.h | 3 ++
kernel/time/posix-cpu-timers.c | 44 +++++++++++++++++++++++++++------------
3 files changed, 34 insertions(+), 14 deletions(-)

--
1.7.2.5


2015-08-26 03:18:11

by Jason Low

[permalink] [raw]
Subject: [PATCH 1/3] timer: Optimize fastpath_timer_check()

In fastpath_timer_check(), the task_cputime() function is always
called to compute the utime and stime values. However, this is not
necessary if there are no per-thread timers to check for. This patch
modifies the code such that we compute the task_cputime values only
when there are per-thread timers set.

Signed-off-by: Jason Low <[email protected]>
---
kernel/time/posix-cpu-timers.c | 15 +++++++--------
1 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 892e3da..02596ff 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -1117,16 +1117,15 @@ static inline int task_cputime_expired(const struct task_cputime *sample,
static inline int fastpath_timer_check(struct task_struct *tsk)
{
struct signal_struct *sig;
- cputime_t utime, stime;
-
- task_cputime(tsk, &utime, &stime);

if (!task_cputime_zero(&tsk->cputime_expires)) {
- struct task_cputime task_sample = {
- .utime = utime,
- .stime = stime,
- .sum_exec_runtime = tsk->se.sum_exec_runtime
- };
+ struct task_cputime task_sample;
+ cputime_t utime, stime;
+
+ task_cputime(tsk, &utime, &stime);
+ task_sample.utime = utime;
+ task_sample.stime = stime;
+ task_sample.sum_exec_runtime = tsk->se.sum_exec_runtime;

if (task_cputime_expired(&task_sample, &tsk->cputime_expires))
return 1;
--
1.7.2.5

2015-08-26 03:18:33

by Jason Low

[permalink] [raw]
Subject: [PATCH 2/3] timer: Check thread timers only when there are active thread timers

The fastpath_timer_check() contains logic to check for if any timers
are set by checking if !task_cputime_zero(). Similarly, we can do this
before calling check_thread_timers(). In the case where there
are only process-wide timers, this will skip all the computations for
the per-thread timers when there are no per-thread timers.

Signed-off-by: Jason Low <[email protected]>
---
kernel/time/posix-cpu-timers.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 02596ff..535bef5 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -1168,11 +1168,13 @@ void run_posix_cpu_timers(struct task_struct *tsk)
if (!lock_task_sighand(tsk, &flags))
return;
/*
- * Here we take off tsk->signal->cpu_timers[N] and
- * tsk->cpu_timers[N] all the timers that are firing, and
- * put them on the firing list.
+ * If there are active per-thread timers, take off
+ * tsk->signal->cpu_timers[N] and tsk->cpu_timers[N] all the
+ * timers that are firing, and put them on the firing list.
*/
- check_thread_timers(tsk, &firing);
+ if (!task_cputime_zero(&tsk->cputime_expires))
+ check_thread_timers(tsk, &firing);
+
/*
* If there are any active process wide timers (POSIX 1.b, itimers,
* RLIMIT_CPU) cputimer must be running.
--
1.7.2.5

2015-08-26 03:18:57

by Jason Low

[permalink] [raw]
Subject: [PATCH 3/3] timer: Reduce unnecessary sighand lock contention

It was found while running a database workload on large systems that
significant time was spent trying to acquire the sighand lock.

The issue was that whenever an itimer expired, many threads ended up
simultaneously trying to send the signal. Most of the time, nothing
happened after acquiring the sighand lock because another thread
had already sent the signal and updated the "next expire" time. The
fastpath_timer_check() didn't help much since the "next expire" time
was updated later.

This patch addresses this by having the thread_group_cputimer structure
maintain a boolean to signify when a thread in the group is already
checking for process wide timers, and adds extra logic in the fastpath
to check the boolean.

Signed-off-by: Jason Low <[email protected]>
---
include/linux/init_task.h | 1 +
include/linux/sched.h | 3 +++
kernel/time/posix-cpu-timers.c | 19 +++++++++++++++++--
3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index d0b380e..3350c77 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -53,6 +53,7 @@ extern struct fs_struct init_fs;
.cputimer = { \
.cputime_atomic = INIT_CPUTIME_ATOMIC, \
.running = 0, \
+ .checking_timer = 0, \
}, \
INIT_PREV_CPUTIME(sig) \
.cred_guard_mutex = \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 119823d..a6c8334 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -619,6 +619,8 @@ struct task_cputime_atomic {
* @cputime_atomic: atomic thread group interval timers.
* @running: non-zero when there are timers running and
* @cputime receives updates.
+ * @checking_timer: non-zero when a thread is in the process of
+ * checking for thread group timers.
*
* This structure contains the version of task_cputime, above, that is
* used for thread group CPU timer calculations.
@@ -626,6 +628,7 @@ struct task_cputime_atomic {
struct thread_group_cputimer {
struct task_cputime_atomic cputime_atomic;
int running;
+ int checking_timer;
};

#include <linux/rwsem.h>
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 535bef5..f3ddf0e 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -962,6 +962,14 @@ static void check_process_timers(struct task_struct *tsk,
unsigned long soft;

/*
+ * Signify that a thread is checking for process timers.
+ * The checking_timer field is only modified in this function,
+ * which is called with the sighand lock held. Thus, we can
+ * just use WRITE_ONCE() without any further locking.
+ */
+ WRITE_ONCE(sig->cputimer.checking_timer, 1);
+
+ /*
* Collect the current process totals.
*/
thread_group_cputimer(tsk, &cputime);
@@ -1015,6 +1023,8 @@ static void check_process_timers(struct task_struct *tsk,
sig->cputime_expires.sched_exp = sched_expires;
if (task_cputime_zero(&sig->cputime_expires))
stop_process_timers(sig);
+
+ WRITE_ONCE(sig->cputimer.checking_timer, 0);
}

/*
@@ -1132,8 +1142,13 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
}

sig = tsk->signal;
- /* Check if cputimer is running. This is accessed without locking. */
- if (READ_ONCE(sig->cputimer.running)) {
+ /*
+ * Check if thread group timers expired if the cputimer is running
+ * and that no other thread in the group is already checking for
+ * thread group cputimers.
+ */
+ if (READ_ONCE(sig->cputimer.running) &&
+ !READ_ONCE(sig->cputimer.checking_timer)) {
struct task_cputime group_sample;

sample_cputime_atomic(&group_sample, &sig->cputimer.cputime_atomic);
--
1.7.2.5

2015-08-26 03:27:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/3] timer: Improve itimers scalability

On Tue, 25 Aug 2015 20:17:45 -0700 Jason Low <[email protected]> wrote:

> When running a database workload on a 16 socket machine, there were
> scalability issues related to itimers.
>
> Commit 1018016c706f addressed the issue with the thread_group_cputimer
> spinlock taking up a significant portion of total run time.
>
> This patch series address the other issue where a lot of time is spent
> trying to acquire the sighand lock. It was found in some cases that
> 200+ threads were simultaneously contending for the same sighand lock,
> reducing throughput by more than 30%.

Does this imply that the patchset increased the throughput of this
workload by 30%?

And is this test case realistic? If not, what are the benefits on a
real-world workload?

2015-08-26 16:33:57

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH 0/3] timer: Improve itimers scalability

Hi Andrew,

On Tue, 2015-08-25 at 20:27 -0700, Andrew Morton wrote:
> On Tue, 25 Aug 2015 20:17:45 -0700 Jason Low <[email protected]> wrote:
>
> > When running a database workload on a 16 socket machine, there were
> > scalability issues related to itimers.
> >
> > Commit 1018016c706f addressed the issue with the thread_group_cputimer
> > spinlock taking up a significant portion of total run time.
> >
> > This patch series address the other issue where a lot of time is spent
> > trying to acquire the sighand lock. It was found in some cases that
> > 200+ threads were simultaneously contending for the same sighand lock,
> > reducing throughput by more than 30%.
>
> Does this imply that the patchset increased the throughput of this
> workload by 30%?
>
> And is this test case realistic? If not, what are the benefits on a
> real-world workload?

Yes, the test case with the database workload is realistic. We did write
a simple micro-benchmark that just generates the contention in this code
path to quickly test experimental patches, since the database takes
longer to set up and run. However, the performance issues and numbers
mentioned here are for the database workload.

These patches should also be beneficial for other multi-threaded
applications which uses process-wide timers particularly on systems with
a lot of cores.

2015-08-26 16:58:00

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH 1/3] timer: Optimize fastpath_timer_check()

> if (!task_cputime_zero(&tsk->cputime_expires)) {
>+ struct task_cputime task_sample;
>+ cputime_t utime, stime;
>+
>+ task_cputime(tsk, &utime, &stime);
>+ task_sample.utime = utime;
>+ task_sample.stime = stime;
>+ task_sample.sum_exec_runtime = tsk->se.sum_exec_runtime;

Er, task_sample.[us]time are already the correct types.
Whay are the local variables necessary? How about:

if (!task_cputime_zero(&tsk->cputime_expires)) {
+ struct task_cputime task_sample;
+
+ task_cputime(tsk, &task_simple.utime, &task_simple.stime);
+ task_sample.sum_exec_runtime = tsk->se.sum_exec_runtime;

2015-08-26 17:04:41

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH 2/3] timer: Check thread timers only when there are active thread timers


- check_thread_timers(tsk, &firing);
+ if (!task_cputime_zero(&tsk->cputime_expires))
+ check_thread_timers(tsk, &firing);

Sincere question; I'm not certain myself: would it make more sense to put
this shortcut into check_thread_timers()?

It seems more like an optimization of that function than something the
caller needs to know about.

2015-08-26 17:11:29

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 0/3] timer: Improve itimers scalability

On 08/26, Jason Low wrote:
>
> Hi Andrew,
>
> On Tue, 2015-08-25 at 20:27 -0700, Andrew Morton wrote:
> > On Tue, 25 Aug 2015 20:17:45 -0700 Jason Low <[email protected]> wrote:
> >
> > > When running a database workload on a 16 socket machine, there were
> > > scalability issues related to itimers.
> > >
> > > Commit 1018016c706f addressed the issue with the thread_group_cputimer
> > > spinlock taking up a significant portion of total run time.
> > >
> > > This patch series address the other issue where a lot of time is spent
> > > trying to acquire the sighand lock. It was found in some cases that
> > > 200+ threads were simultaneously contending for the same sighand lock,
> > > reducing throughput by more than 30%.
> >
> > Does this imply that the patchset increased the throughput of this
> > workload by 30%?
> >
> > And is this test case realistic? If not, what are the benefits on a
> > real-world workload?
>
> Yes, the test case with the database workload is realistic.

Can't resists, sorry... to me the very idea to use the process wide posix-
cpu-timers on performance critical application doesn't look realistic ;)

However, I thinks the patches are fine.


Reviewed-by: Oleg Nesterov <[email protected]>

2015-08-26 17:31:05

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH 1/3] timer: Optimize fastpath_timer_check()

On Wed, 2015-08-26 at 12:57 -0400, George Spelvin wrote:
> > if (!task_cputime_zero(&tsk->cputime_expires)) {
> >+ struct task_cputime task_sample;
> >+ cputime_t utime, stime;
> >+
> >+ task_cputime(tsk, &utime, &stime);
> >+ task_sample.utime = utime;
> >+ task_sample.stime = stime;
> >+ task_sample.sum_exec_runtime = tsk->se.sum_exec_runtime;
>
> Er, task_sample.[us]time are already the correct types.
> Whay are the local variables necessary? How about:
>
> if (!task_cputime_zero(&tsk->cputime_expires)) {
> + struct task_cputime task_sample;
> +
> + task_cputime(tsk, &task_simple.utime, &task_simple.stime);
> + task_sample.sum_exec_runtime = tsk->se.sum_exec_runtime;

Yes, good point. Now that we're moving the task_cputime() call to after
the task_sample structure is declared, the utime and stime local
variables are not required anymore.

2015-08-26 17:41:19

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH 2/3] timer: Check thread timers only when there are active thread timers

On Wed, 2015-08-26 at 13:04 -0400, George Spelvin wrote:
> - check_thread_timers(tsk, &firing);
> + if (!task_cputime_zero(&tsk->cputime_expires))
> + check_thread_timers(tsk, &firing);
>
> Sincere question; I'm not certain myself: would it make more sense to put
> this shortcut into check_thread_timers()?
>
> It seems more like an optimization of that function than something the
> caller needs to know about.

Yes, I also thought it might be better if we add something like:

if (task_cputime_zero(&tsk->cputime_expires)
return;

in check_thread_timers(). The reason I made it this way though is
because in the next few lines, we do a similar check before calling
check_process_timers(), and I wanted to keep things consistent.

However, perhaps we can consider also moving that
tsk->signal->cputimer.running check into check_process_timers() too.

Thanks for the suggestions.

2015-08-26 17:53:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/3] timer: Reduce unnecessary sighand lock contention

On Tue, Aug 25, 2015 at 8:17 PM, Jason Low <[email protected]> wrote:
>
> This patch addresses this by having the thread_group_cputimer structure
> maintain a boolean to signify when a thread in the group is already
> checking for process wide timers, and adds extra logic in the fastpath
> to check the boolean.

It is not at all obvious why the unlocked read of that variable is
safe, and why there is no race with another thread just about to end
its check_process_timers().

I can well imagine that this is all perfectly safe and fine, but I'd
really like to see that comment about _why_ that's the case, and why a
completely unlocked access without even memory barriers is fine.

Linus

2015-08-26 21:58:04

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 1/3] timer: Optimize fastpath_timer_check()

On Tue, Aug 25, 2015 at 08:17:46PM -0700, Jason Low wrote:
> In fastpath_timer_check(), the task_cputime() function is always
> called to compute the utime and stime values. However, this is not
> necessary if there are no per-thread timers to check for. This patch
> modifies the code such that we compute the task_cputime values only
> when there are per-thread timers set.
>
> Signed-off-by: Jason Low <[email protected]>

Reviewed-by: Frederic Weisbecker <[email protected]>

Thanks.

> ---
> kernel/time/posix-cpu-timers.c | 15 +++++++--------
> 1 files changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
> index 892e3da..02596ff 100644
> --- a/kernel/time/posix-cpu-timers.c
> +++ b/kernel/time/posix-cpu-timers.c
> @@ -1117,16 +1117,15 @@ static inline int task_cputime_expired(const struct task_cputime *sample,
> static inline int fastpath_timer_check(struct task_struct *tsk)
> {
> struct signal_struct *sig;
> - cputime_t utime, stime;
> -
> - task_cputime(tsk, &utime, &stime);
>
> if (!task_cputime_zero(&tsk->cputime_expires)) {
> - struct task_cputime task_sample = {
> - .utime = utime,
> - .stime = stime,
> - .sum_exec_runtime = tsk->se.sum_exec_runtime
> - };
> + struct task_cputime task_sample;
> + cputime_t utime, stime;
> +
> + task_cputime(tsk, &utime, &stime);
> + task_sample.utime = utime;
> + task_sample.stime = stime;
> + task_sample.sum_exec_runtime = tsk->se.sum_exec_runtime;
>
> if (task_cputime_expired(&task_sample, &tsk->cputime_expires))
> return 1;
> --
> 1.7.2.5
>

2015-08-26 22:07:33

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH 0/3] timer: Improve itimers scalability

On Wed, 2015-08-26 at 19:08 +0200, Oleg Nesterov wrote:
> On 08/26, Jason Low wrote:
> >
> > Hi Andrew,
> >
> > On Tue, 2015-08-25 at 20:27 -0700, Andrew Morton wrote:
> > > On Tue, 25 Aug 2015 20:17:45 -0700 Jason Low <[email protected]> wrote:
> > >
> > > > When running a database workload on a 16 socket machine, there were
> > > > scalability issues related to itimers.
> > > >
> > > > Commit 1018016c706f addressed the issue with the thread_group_cputimer
> > > > spinlock taking up a significant portion of total run time.
> > > >
> > > > This patch series address the other issue where a lot of time is spent
> > > > trying to acquire the sighand lock. It was found in some cases that
> > > > 200+ threads were simultaneously contending for the same sighand lock,
> > > > reducing throughput by more than 30%.
> > >
> > > Does this imply that the patchset increased the throughput of this
> > > workload by 30%?
> > >
> > > And is this test case realistic? If not, what are the benefits on a
> > > real-world workload?
> >
> > Yes, the test case with the database workload is realistic.
>
> Can't resists, sorry... to me the very idea to use the process wide posix-
> cpu-timers on performance critical application doesn't look realistic ;)

I will let Hideaki elaborate more regarding the issue at the application
level.

> However, I thinks the patches are fine.
>
>
> Reviewed-by: Oleg Nesterov <[email protected]>

Thanks for reviewing!

2015-08-26 22:31:58

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/3] timer: Reduce unnecessary sighand lock contention

On Wed, Aug 26, 2015 at 10:53:35AM -0700, Linus Torvalds wrote:
> On Tue, Aug 25, 2015 at 8:17 PM, Jason Low <[email protected]> wrote:
> >
> > This patch addresses this by having the thread_group_cputimer structure
> > maintain a boolean to signify when a thread in the group is already
> > checking for process wide timers, and adds extra logic in the fastpath
> > to check the boolean.
>
> It is not at all obvious why the unlocked read of that variable is
> safe, and why there is no race with another thread just about to end
> its check_process_timers().

The risk is when a next timer is going to expire soon after we relaxed
the "checking" variable due to a recent expiration. The thread which
expires the next timer may still see a stale value on the "checking"
state and therefore delay the timer firing until the new value is seen.
So the worst that can happen is that the timer firing gets delayed for
X jiffies (I guess in practice it's only 1 jiffy).

That said, posix cpu timers already suffer such race because
sig->cputimer.running itself is checked outside the sighand lock anyway.

> I can well imagine that this is all perfectly safe and fine, but I'd
> really like to see that comment about _why_ that's the case, and why a
> completely unlocked access without even memory barriers is fine.

Agreed, there should be a comment about that in the code (that is already full
of undocumented subtleties).

2015-08-26 23:10:30

by Hideaki Kimura

[permalink] [raw]
Subject: Re: [PATCH 0/3] timer: Improve itimers scalability

Sure, let me elaborate.

Executive summary:
Yes, enabling a process-wide timer in such a large machine is not
wise, but sometimes users/applications cannot avoid it.


The issue was observed actually not in a database itself but in a common
library it links to; gperftools.

The database itself is optimized for many-cores/sockets, so surely it
avoids putting a process-wide timer or other unscalable things. It just
links to libprofiler for an optional feature to profile performance
bottleneck only when the user turns it on. We of course avoid turning
the feature on unless while we debug/tune the database.

However, libprofiler sets the timer even when the client program doesn't
invoke any of its functions: libprofiler does it when the shared library
is loaded. We requested the developer of libprofiler to change the
behavior, but seems like there is a reason to keep that behavior:
https://code.google.com/p/gperftools/issues/detail?id=133

Based on this, I think there are two reasons why we should ameliorate
this issue in kernel layer.


1. In the particular case, it's hard to prevent or even detect the issue
in user space.

We (a team of low-level database and kernel experts) in fact spent huge
amount of time to just figure out what's the bottleneck there because
nothing measurable happens in user space. I pulled out countless hairs.

Also, the user has to de-link the library from the application to
prevent the itimer installation. Imagine a case where the software is
proprietary. It won't fly.


2. This is just one example. There could be many other such
binaries/libraries that do similar things somewhere in a complex
software stack.

Today we haven't heard of many such cases, but people will start hitting
it once 100s~1,000s of cores become common.


After applying this patchset, we have observed that the performance hit
almost completely went away at least for 240 cores. So, it's quite
beneficial in real world.

--
Hideaki Kimura

2015-08-26 22:56:34

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/3] timer: Reduce unnecessary sighand lock contention

On Tue, Aug 25, 2015 at 08:17:48PM -0700, Jason Low wrote:
> It was found while running a database workload on large systems that
> significant time was spent trying to acquire the sighand lock.
>
> The issue was that whenever an itimer expired, many threads ended up
> simultaneously trying to send the signal. Most of the time, nothing
> happened after acquiring the sighand lock because another thread
> had already sent the signal and updated the "next expire" time. The
> fastpath_timer_check() didn't help much since the "next expire" time
> was updated later.
>
> This patch addresses this by having the thread_group_cputimer structure
> maintain a boolean to signify when a thread in the group is already
> checking for process wide timers, and adds extra logic in the fastpath
> to check the boolean.
>
> Signed-off-by: Jason Low <[email protected]>
> ---
> include/linux/init_task.h | 1 +
> include/linux/sched.h | 3 +++
> kernel/time/posix-cpu-timers.c | 19 +++++++++++++++++--
> 3 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index d0b380e..3350c77 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -53,6 +53,7 @@ extern struct fs_struct init_fs;
> .cputimer = { \
> .cputime_atomic = INIT_CPUTIME_ATOMIC, \
> .running = 0, \
> + .checking_timer = 0, \
> }, \
> INIT_PREV_CPUTIME(sig) \
> .cred_guard_mutex = \
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 119823d..a6c8334 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -619,6 +619,8 @@ struct task_cputime_atomic {
> * @cputime_atomic: atomic thread group interval timers.
> * @running: non-zero when there are timers running and
> * @cputime receives updates.
> + * @checking_timer: non-zero when a thread is in the process of
> + * checking for thread group timers.
> *
> * This structure contains the version of task_cputime, above, that is
> * used for thread group CPU timer calculations.
> @@ -626,6 +628,7 @@ struct task_cputime_atomic {
> struct thread_group_cputimer {
> struct task_cputime_atomic cputime_atomic;
> int running;
> + int checking_timer;

How about a flag in the "running" field instead?

1) Space in signal_struct is not as important as in task_strut but it
still matters.

2) We already read the "running" field locklessly. Adding a new field like
checking_timer gets even more complicated. Ideally there should be at
least a paired memory barrier between both. Let's just simplify that
with a single field.

Now concerning the solution for your problem, I'm a bit uncomfortable with
lockless magics like this. When the thread sets checking_timer to 1, there is
no guarantee that the other threads in the process will see it "fast" enough
to avoid the slow path checks. Then there is also the risk that the threads
don't see "fast" enough that checking_timers has toggled to 0 and as a result
a timer may expire late. Now the lockless access of "running" already induces
such race. So if it really solves issues in practice, why not.

2015-08-26 22:57:39

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH 3/3] timer: Reduce unnecessary sighand lock contention

On Thu, 2015-08-27 at 00:31 +0200, Frederic Weisbecker wrote:
> On Wed, Aug 26, 2015 at 10:53:35AM -0700, Linus Torvalds wrote:
> > On Tue, Aug 25, 2015 at 8:17 PM, Jason Low <[email protected]> wrote:
> > >
> > > This patch addresses this by having the thread_group_cputimer structure
> > > maintain a boolean to signify when a thread in the group is already
> > > checking for process wide timers, and adds extra logic in the fastpath
> > > to check the boolean.
> >
> > It is not at all obvious why the unlocked read of that variable is
> > safe, and why there is no race with another thread just about to end
> > its check_process_timers().
>
> The risk is when a next timer is going to expire soon after we relaxed
> the "checking" variable due to a recent expiration. The thread which
> expires the next timer may still see a stale value on the "checking"
> state and therefore delay the timer firing until the new value is seen.
> So the worst that can happen is that the timer firing gets delayed for
> X jiffies (I guess in practice it's only 1 jiffy).
>
> That said, posix cpu timers already suffer such race because
> sig->cputimer.running itself is checked outside the sighand lock anyway.

Right, in the worst case, the next thread that comes along will handle
it. The current code already expects that level of granularity, so this
shouldn't change anything much. For example, there is almost always a
delay between a timer expiring and when a thread actually calls into
check_process_timers().

> > I can well imagine that this is all perfectly safe and fine, but I'd
> > really like to see that comment about _why_ that's the case, and why a
> > completely unlocked access without even memory barriers is fine.
>
> Agreed, there should be a comment about that in the code (that is already full
> of undocumented subtleties).

Okay, I will add more comments about this.

2015-08-26 23:13:31

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/3] timer: Improve itimers scalability

On Wed, Aug 26, 2015 at 03:53:26PM -0700, Hideaki Kimura wrote:
> Sure, let me elaborate.
>
> Executive summary:
> Yes, enabling a process-wide timer in such a large machine is not wise, but
> sometimes users/applications cannot avoid it.
>
>
> The issue was observed actually not in a database itself but in a common
> library it links to; gperftools.
>
> The database itself is optimized for many-cores/sockets, so surely it avoids
> putting a process-wide timer or other unscalable things. It just links to
> libprofiler for an optional feature to profile performance bottleneck only
> when the user turns it on. We of course avoid turning the feature on unless
> while we debug/tune the database.
>
> However, libprofiler sets the timer even when the client program doesn't
> invoke any of its functions: libprofiler does it when the shared library is
> loaded. We requested the developer of libprofiler to change the behavior,
> but seems like there is a reason to keep that behavior:
> https://code.google.com/p/gperftools/issues/detail?id=133
>
> Based on this, I think there are two reasons why we should ameliorate this
> issue in kernel layer.
>
>
> 1. In the particular case, it's hard to prevent or even detect the issue in
> user space.
>
> We (a team of low-level database and kernel experts) in fact spent huge
> amount of time to just figure out what's the bottleneck there because
> nothing measurable happens in user space. I pulled out countless hairs.
>
> Also, the user has to de-link the library from the application to prevent
> the itimer installation. Imagine a case where the software is proprietary.
> It won't fly.
>
>
> 2. This is just one example. There could be many other such
> binaries/libraries that do similar things somewhere in a complex software
> stack.
>
> Today we haven't heard of many such cases, but people will start hitting it
> once 100s~1,000s of cores become common.
>
>
> After applying this patchset, we have observed that the performance hit
> almost completely went away at least for 240 cores. So, it's quite
> beneficial in real world.

I can easily imagine that many code incidentally use posix cpu timers when
it's not strictly required. But it doesn't look right to fix the kernel
for that. For this simple reason: posix cpu timers, even after your fix,
should introduce noticeable overhead. All threads of a process with a timer
enqueued in elapse the cputime in a shared atomic variable. Add to that the
overhead of enqueuing the timer, firing it. There is a bunch of scalability
issue there.

>
> --
> Hideaki Kimura

2015-08-26 23:32:40

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH 3/3] timer: Reduce unnecessary sighand lock contention

On Thu, 2015-08-27 at 00:56 +0200, Frederic Weisbecker wrote:
> On Tue, Aug 25, 2015 at 08:17:48PM -0700, Jason Low wrote:
> > It was found while running a database workload on large systems that
> > significant time was spent trying to acquire the sighand lock.
> >
> > The issue was that whenever an itimer expired, many threads ended up
> > simultaneously trying to send the signal. Most of the time, nothing
> > happened after acquiring the sighand lock because another thread
> > had already sent the signal and updated the "next expire" time. The
> > fastpath_timer_check() didn't help much since the "next expire" time
> > was updated later.
> >
> > This patch addresses this by having the thread_group_cputimer structure
> > maintain a boolean to signify when a thread in the group is already
> > checking for process wide timers, and adds extra logic in the fastpath
> > to check the boolean.
> >
> > Signed-off-by: Jason Low <[email protected]>
> > ---
> > include/linux/init_task.h | 1 +
> > include/linux/sched.h | 3 +++
> > kernel/time/posix-cpu-timers.c | 19 +++++++++++++++++--
> > 3 files changed, 21 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> > index d0b380e..3350c77 100644
> > --- a/include/linux/init_task.h
> > +++ b/include/linux/init_task.h
> > @@ -53,6 +53,7 @@ extern struct fs_struct init_fs;
> > .cputimer = { \
> > .cputime_atomic = INIT_CPUTIME_ATOMIC, \
> > .running = 0, \
> > + .checking_timer = 0, \
> > }, \
> > INIT_PREV_CPUTIME(sig) \
> > .cred_guard_mutex = \
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 119823d..a6c8334 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -619,6 +619,8 @@ struct task_cputime_atomic {
> > * @cputime_atomic: atomic thread group interval timers.
> > * @running: non-zero when there are timers running and
> > * @cputime receives updates.
> > + * @checking_timer: non-zero when a thread is in the process of
> > + * checking for thread group timers.
> > *
> > * This structure contains the version of task_cputime, above, that is
> > * used for thread group CPU timer calculations.
> > @@ -626,6 +628,7 @@ struct task_cputime_atomic {
> > struct thread_group_cputimer {
> > struct task_cputime_atomic cputime_atomic;
> > int running;
> > + int checking_timer;
>
> How about a flag in the "running" field instead?
>
> 1) Space in signal_struct is not as important as in task_strut but it
> still matters.

George Spelvin suggested that we convert them to booleans which would
make them take up 2 bytes.

> 2) We already read the "running" field locklessly. Adding a new field like
> checking_timer gets even more complicated. Ideally there should be at
> least a paired memory barrier between both. Let's just simplify that
> with a single field.

hmmm, so having 1 "flag" where we access bits for the "running" and
"checking_timer"?

> Now concerning the solution for your problem, I'm a bit uncomfortable with
> lockless magics like this. When the thread sets checking_timer to 1, there is
> no guarantee that the other threads in the process will see it "fast" enough
> to avoid the slow path checks. Then there is also the risk that the threads
> don't see "fast" enough that checking_timers has toggled to 0 and as a result
> a timer may expire late. Now the lockless access of "running" already induces
> such race. So if it really solves issues in practice, why not.

Perhaps to be safer, we use something like load_acquire() and
store_release() for accessing both the ->running and ->checking_timer
fields?

2015-08-26 23:55:10

by Hideaki Kimura

[permalink] [raw]
Subject: Re: [PATCH 0/3] timer: Improve itimers scalability



On 08/26/2015 04:13 PM, Frederic Weisbecker wrote:
> On Wed, Aug 26, 2015 at 03:53:26PM -0700, Hideaki Kimura wrote:
>> Sure, let me elaborate.
>>
>> Executive summary:
>> Yes, enabling a process-wide timer in such a large machine is not wise, but
>> sometimes users/applications cannot avoid it.
>>
>>
>> The issue was observed actually not in a database itself but in a common
>> library it links to; gperftools.
>>
>> The database itself is optimized for many-cores/sockets, so surely it avoids
>> putting a process-wide timer or other unscalable things. It just links to
>> libprofiler for an optional feature to profile performance bottleneck only
>> when the user turns it on. We of course avoid turning the feature on unless
>> while we debug/tune the database.
>>
>> However, libprofiler sets the timer even when the client program doesn't
>> invoke any of its functions: libprofiler does it when the shared library is
>> loaded. We requested the developer of libprofiler to change the behavior,
>> but seems like there is a reason to keep that behavior:
>> https://code.google.com/p/gperftools/issues/detail?id=133
>>
>> Based on this, I think there are two reasons why we should ameliorate this
>> issue in kernel layer.
>>
>>
>> 1. In the particular case, it's hard to prevent or even detect the issue in
>> user space.
>>
>> We (a team of low-level database and kernel experts) in fact spent huge
>> amount of time to just figure out what's the bottleneck there because
>> nothing measurable happens in user space. I pulled out countless hairs.
>>
>> Also, the user has to de-link the library from the application to prevent
>> the itimer installation. Imagine a case where the software is proprietary.
>> It won't fly.
>>
>>
>> 2. This is just one example. There could be many other such
>> binaries/libraries that do similar things somewhere in a complex software
>> stack.
>>
>> Today we haven't heard of many such cases, but people will start hitting it
>> once 100s~1,000s of cores become common.
>>
>>
>> After applying this patchset, we have observed that the performance hit
>> almost completely went away at least for 240 cores. So, it's quite
>> beneficial in real world.
>
> I can easily imagine that many code incidentally use posix cpu timers when
> it's not strictly required. But it doesn't look right to fix the kernel
> for that. For this simple reason: posix cpu timers, even after your fix,
> should introduce noticeable overhead. All threads of a process with a timer
> enqueued in elapse the cputime in a shared atomic variable. Add to that the
> overhead of enqueuing the timer, firing it. There is a bunch of scalability
> issue there.

I totally agree that this is not a perfect solution. If there are 10x
more cores and sockets, just the atomic fetch_add might be too expensive.

However, it's comparatively/realistically the best thing we can do
without any drawbacks. We can't magically force all library developers
to write the most scalable code always.

My point is: this is a safety net, and a very effective one.

--
Hideaki Kimura

2015-08-27 04:53:04

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH 3/3] timer: Reduce unnecessary sighand lock contention

On Wed, 2015-08-26 at 16:32 -0700, Jason Low wrote:

> Perhaps to be safer, we use something like load_acquire() and
> store_release() for accessing both the ->running and ->checking_timer
> fields?

Regarding using barriers, one option could be to pair them between
sig->cputime_expires and the sig->cputimer.checking_timer accesses.

fastpath_timer_check()
{
...

if (READ_ONCE(sig->cputimer.running))
struct task_cputime group_sample;

sample_cputime_atomic(&group_sample, &sig->cputimer.cputime_atomic);

if (task_cputime_expired(&group_sample, &sig->cputime_expires)) {
/*
* Comments
*/
mb();

if (!READ_ONCE(sig->cputimer.checking_timer))
return 1;
}
}
}

check_process_timers()
{
...

WRITE_ONCE(sig->cputimer.checking_timer, 0);

/*
* Comments
*/
mb();

sig->cputime_expires.prof_exp = expires_to_cputime(prof_expires);
sig->cputime_expires.virt_exp = expires_to_cputime(virt_expires);
sig->cputime_expires.sched_exp = sched_expires;

...
}

By the time the cputime_expires fields get updated at the end of
check_process_timers(), other threads in the fastpath_timer_check()
should observe the value 0 for READ_ONCE(sig->cputimer.checking_timer).

In the case where threads in the fastpath don't observe the
WRITE_ONCE(checking_timer, 1) early enough, that's fine, since it will
just (unnecessarily) go through the slowpath which is what we also do in
the current code.

2015-08-27 12:53:08

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/3] timer: Reduce unnecessary sighand lock contention

On Wed, Aug 26, 2015 at 04:32:34PM -0700, Jason Low wrote:
> On Thu, 2015-08-27 at 00:56 +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 25, 2015 at 08:17:48PM -0700, Jason Low wrote:
> > > It was found while running a database workload on large systems that
> > > significant time was spent trying to acquire the sighand lock.
> > >
> > > The issue was that whenever an itimer expired, many threads ended up
> > > simultaneously trying to send the signal. Most of the time, nothing
> > > happened after acquiring the sighand lock because another thread
> > > had already sent the signal and updated the "next expire" time. The
> > > fastpath_timer_check() didn't help much since the "next expire" time
> > > was updated later.
> > >
> > > This patch addresses this by having the thread_group_cputimer structure
> > > maintain a boolean to signify when a thread in the group is already
> > > checking for process wide timers, and adds extra logic in the fastpath
> > > to check the boolean.
> > >
> > > Signed-off-by: Jason Low <[email protected]>
> > > ---
> > > include/linux/init_task.h | 1 +
> > > include/linux/sched.h | 3 +++
> > > kernel/time/posix-cpu-timers.c | 19 +++++++++++++++++--
> > > 3 files changed, 21 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> > > index d0b380e..3350c77 100644
> > > --- a/include/linux/init_task.h
> > > +++ b/include/linux/init_task.h
> > > @@ -53,6 +53,7 @@ extern struct fs_struct init_fs;
> > > .cputimer = { \
> > > .cputime_atomic = INIT_CPUTIME_ATOMIC, \
> > > .running = 0, \
> > > + .checking_timer = 0, \
> > > }, \
> > > INIT_PREV_CPUTIME(sig) \
> > > .cred_guard_mutex = \
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index 119823d..a6c8334 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -619,6 +619,8 @@ struct task_cputime_atomic {
> > > * @cputime_atomic: atomic thread group interval timers.
> > > * @running: non-zero when there are timers running and
> > > * @cputime receives updates.
> > > + * @checking_timer: non-zero when a thread is in the process of
> > > + * checking for thread group timers.
> > > *
> > > * This structure contains the version of task_cputime, above, that is
> > > * used for thread group CPU timer calculations.
> > > @@ -626,6 +628,7 @@ struct task_cputime_atomic {
> > > struct thread_group_cputimer {
> > > struct task_cputime_atomic cputime_atomic;
> > > int running;
> > > + int checking_timer;
> >
> > How about a flag in the "running" field instead?
> >
> > 1) Space in signal_struct is not as important as in task_strut but it
> > still matters.
>
> George Spelvin suggested that we convert them to booleans which would
> make them take up 2 bytes.
>
> > 2) We already read the "running" field locklessly. Adding a new field like
> > checking_timer gets even more complicated. Ideally there should be at
> > least a paired memory barrier between both. Let's just simplify that
> > with a single field.
>
> hmmm, so having 1 "flag" where we access bits for the "running" and
> "checking_timer"?

Sure, like:

#define CPUTIMER_RUNNING 0x1
#define CPUTIMER_CHECKING 0x2

struct thread_group_cputimer {
struct task_cputime_atomic cputime_atomic;
int status;
}

So from cputimer_running() you just need to check:

if (cputimer->status & CPUTIMER_RUNNING)

And from run_posix_cpu_timer() fast-path:

if (cputimer->status == CPUTIMER_RUNNING)

so that ignores CPUTIMER_CHECKING case.

>
> > Now concerning the solution for your problem, I'm a bit uncomfortable with
> > lockless magics like this. When the thread sets checking_timer to 1, there is
> > no guarantee that the other threads in the process will see it "fast" enough
> > to avoid the slow path checks. Then there is also the risk that the threads
> > don't see "fast" enough that checking_timers has toggled to 0 and as a result
> > a timer may expire late. Now the lockless access of "running" already induces
> > such race. So if it really solves issues in practice, why not.
>
> Perhaps to be safer, we use something like load_acquire() and
> store_release() for accessing both the ->running and ->checking_timer
> fields?

Well it depends against what we want to order them. If it's a single field
we don't need to order them together at least.

2015-08-27 13:18:53

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/3] timer: Improve itimers scalability

On Wed, Aug 26, 2015 at 04:45:44PM -0700, Hideaki Kimura wrote:
> I totally agree that this is not a perfect solution. If there are 10x more
> cores and sockets, just the atomic fetch_add might be too expensive.
>
> However, it's comparatively/realistically the best thing we can do without
> any drawbacks. We can't magically force all library developers to write the
> most scalable code always.
>
> My point is: this is a safety net, and a very effective one.

I mean the problem here is that a library uses an unscalable profiling feature,
unconditionally as soon as you load it without even initializing anything. And
this library is used in production.

At first sight, fixing that in the kernel is only a hack that just reduces a bit
the symptoms.

What is the technical issue that prevents from fixing that in the library itself?
Posix timers can be attached anytime.

2015-08-27 14:47:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/3] timer: Improve itimers scalability

On Thu, 27 Aug 2015 15:18:49 +0200
Frederic Weisbecker <[email protected]> wrote:

> On Wed, Aug 26, 2015 at 04:45:44PM -0700, Hideaki Kimura wrote:
> > I totally agree that this is not a perfect solution. If there are 10x more
> > cores and sockets, just the atomic fetch_add might be too expensive.
> >
> > However, it's comparatively/realistically the best thing we can do without
> > any drawbacks. We can't magically force all library developers to write the
> > most scalable code always.
> >
> > My point is: this is a safety net, and a very effective one.
>
> I mean the problem here is that a library uses an unscalable profiling feature,
> unconditionally as soon as you load it without even initializing anything. And
> this library is used in production.
>
> At first sight, fixing that in the kernel is only a hack that just reduces a bit
> the symptoms.
>
> What is the technical issue that prevents from fixing that in the library itself?
> Posix timers can be attached anytime.

I'm curious to what the downside of this patch set is? If we can fix a
problem that should be fixed in userspace, but does not harm the kernel
by doing so, is that bad? (an argument for kdbus? ;-)


As Hideaki noted, this could be a problem in other locations as well
that people have yet to find.

-- Steve

2015-08-27 15:10:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 0/3] timer: Improve itimers scalability

On Thu, 27 Aug 2015, Steven Rostedt wrote:
> On Thu, 27 Aug 2015 15:18:49 +0200
> Frederic Weisbecker <[email protected]> wrote:
>
> > On Wed, Aug 26, 2015 at 04:45:44PM -0700, Hideaki Kimura wrote:
> > > I totally agree that this is not a perfect solution. If there are 10x more
> > > cores and sockets, just the atomic fetch_add might be too expensive.
> > >
> > > However, it's comparatively/realistically the best thing we can do without
> > > any drawbacks. We can't magically force all library developers to write the
> > > most scalable code always.
> > >
> > > My point is: this is a safety net, and a very effective one.
> >
> > I mean the problem here is that a library uses an unscalable profiling feature,
> > unconditionally as soon as you load it without even initializing anything. And
> > this library is used in production.
> >
> > At first sight, fixing that in the kernel is only a hack that just reduces a bit
> > the symptoms.
> >
> > What is the technical issue that prevents from fixing that in the library itself?
> > Posix timers can be attached anytime.
>
> I'm curious to what the downside of this patch set is? If we can fix a
> problem that should be fixed in userspace, but does not harm the kernel
> by doing so, is that bad? (an argument for kdbus? ;-)

The patches are not fixing a problem which should be fixed in user
space. They merily avoid lock contention which happens to be prominent
with that particular library. But avoiding lock contention even for 2
threads is a worthwhile exercise if it does not hurt otherwise. And I
can't see anything what hurts with these patches.

Thanks,

tglx

2015-08-27 15:17:31

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/3] timer: Improve itimers scalability

On Thu, Aug 27, 2015 at 05:09:21PM +0200, Thomas Gleixner wrote:
> On Thu, 27 Aug 2015, Steven Rostedt wrote:
> > On Thu, 27 Aug 2015 15:18:49 +0200
> > Frederic Weisbecker <[email protected]> wrote:
> >
> > > On Wed, Aug 26, 2015 at 04:45:44PM -0700, Hideaki Kimura wrote:
> > > > I totally agree that this is not a perfect solution. If there are 10x more
> > > > cores and sockets, just the atomic fetch_add might be too expensive.
> > > >
> > > > However, it's comparatively/realistically the best thing we can do without
> > > > any drawbacks. We can't magically force all library developers to write the
> > > > most scalable code always.
> > > >
> > > > My point is: this is a safety net, and a very effective one.
> > >
> > > I mean the problem here is that a library uses an unscalable profiling feature,
> > > unconditionally as soon as you load it without even initializing anything. And
> > > this library is used in production.
> > >
> > > At first sight, fixing that in the kernel is only a hack that just reduces a bit
> > > the symptoms.
> > >
> > > What is the technical issue that prevents from fixing that in the library itself?
> > > Posix timers can be attached anytime.
> >
> > I'm curious to what the downside of this patch set is? If we can fix a
> > problem that should be fixed in userspace, but does not harm the kernel
> > by doing so, is that bad? (an argument for kdbus? ;-)
>
> The patches are not fixing a problem which should be fixed in user
> space. They merily avoid lock contention which happens to be prominent
> with that particular library. But avoiding lock contention even for 2
> threads is a worthwhile exercise if it does not hurt otherwise. And I
> can't see anything what hurts with these patches.

Sure it shouldn't really hurt anyway, since the presense of elapsing timers
itself is checked locklessly.

2015-08-27 20:30:00

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH 3/3] timer: Reduce unnecessary sighand lock contention

On Thu, 2015-08-27 at 14:53 +0200, Frederic Weisbecker wrote:
> On Wed, Aug 26, 2015 at 04:32:34PM -0700, Jason Low wrote:
> > On Thu, 2015-08-27 at 00:56 +0200, Frederic Weisbecker wrote:
> > > On Tue, Aug 25, 2015 at 08:17:48PM -0700, Jason Low wrote:
> > > > It was found while running a database workload on large systems that
> > > > significant time was spent trying to acquire the sighand lock.
> > > >
> > > > The issue was that whenever an itimer expired, many threads ended up
> > > > simultaneously trying to send the signal. Most of the time, nothing
> > > > happened after acquiring the sighand lock because another thread
> > > > had already sent the signal and updated the "next expire" time. The
> > > > fastpath_timer_check() didn't help much since the "next expire" time
> > > > was updated later.
> > > >
> > > > This patch addresses this by having the thread_group_cputimer structure
> > > > maintain a boolean to signify when a thread in the group is already
> > > > checking for process wide timers, and adds extra logic in the fastpath
> > > > to check the boolean.
> > > >
> > > > Signed-off-by: Jason Low <[email protected]>
> > > > ---
> > > > include/linux/init_task.h | 1 +
> > > > include/linux/sched.h | 3 +++
> > > > kernel/time/posix-cpu-timers.c | 19 +++++++++++++++++--
> > > > 3 files changed, 21 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> > > > index d0b380e..3350c77 100644
> > > > --- a/include/linux/init_task.h
> > > > +++ b/include/linux/init_task.h
> > > > @@ -53,6 +53,7 @@ extern struct fs_struct init_fs;
> > > > .cputimer = { \
> > > > .cputime_atomic = INIT_CPUTIME_ATOMIC, \
> > > > .running = 0, \
> > > > + .checking_timer = 0, \
> > > > }, \
> > > > INIT_PREV_CPUTIME(sig) \
> > > > .cred_guard_mutex = \
> > > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > > index 119823d..a6c8334 100644
> > > > --- a/include/linux/sched.h
> > > > +++ b/include/linux/sched.h
> > > > @@ -619,6 +619,8 @@ struct task_cputime_atomic {
> > > > * @cputime_atomic: atomic thread group interval timers.
> > > > * @running: non-zero when there are timers running and
> > > > * @cputime receives updates.
> > > > + * @checking_timer: non-zero when a thread is in the process of
> > > > + * checking for thread group timers.
> > > > *
> > > > * This structure contains the version of task_cputime, above, that is
> > > > * used for thread group CPU timer calculations.
> > > > @@ -626,6 +628,7 @@ struct task_cputime_atomic {
> > > > struct thread_group_cputimer {
> > > > struct task_cputime_atomic cputime_atomic;
> > > > int running;
> > > > + int checking_timer;
> > >
> > > How about a flag in the "running" field instead?
> > >
> > > 1) Space in signal_struct is not as important as in task_strut but it
> > > still matters.
> >
> > George Spelvin suggested that we convert them to booleans which would
> > make them take up 2 bytes.
> >
> > > 2) We already read the "running" field locklessly. Adding a new field like
> > > checking_timer gets even more complicated. Ideally there should be at
> > > least a paired memory barrier between both. Let's just simplify that
> > > with a single field.
> >
> > hmmm, so having 1 "flag" where we access bits for the "running" and
> > "checking_timer"?
>
> Sure, like:
>
> #define CPUTIMER_RUNNING 0x1
> #define CPUTIMER_CHECKING 0x2
>
> struct thread_group_cputimer {
> struct task_cputime_atomic cputime_atomic;
> int status;
> }
>
> So from cputimer_running() you just need to check:
>
> if (cputimer->status & CPUTIMER_RUNNING)
>
> And from run_posix_cpu_timer() fast-path:
>
> if (cputimer->status == CPUTIMER_RUNNING)
>
> so that ignores CPUTIMER_CHECKING case.

Right, having just 1 "status" field can simply things a bit. The
(cputimer->status == CPUTIMER_RUNNING) check does appear misleading
though, since we're technically not only checking for if the "cputimer
is running".

Maybe something like:

int status = cputimer->status;
if ((status & CPUTIMER_RUNNING) && !(status & CPUTIMER_CHECKING))

makes it more obvious what's going on here.

2015-08-27 21:12:26

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/3] timer: Reduce unnecessary sighand lock contention

On Thu, Aug 27, 2015 at 01:29:50PM -0700, Jason Low wrote:
> On Thu, 2015-08-27 at 14:53 +0200, Frederic Weisbecker wrote:
> > Sure, like:
> >
> > #define CPUTIMER_RUNNING 0x1
> > #define CPUTIMER_CHECKING 0x2
> >
> > struct thread_group_cputimer {
> > struct task_cputime_atomic cputime_atomic;
> > int status;
> > }
> >
> > So from cputimer_running() you just need to check:
> >
> > if (cputimer->status & CPUTIMER_RUNNING)
> >
> > And from run_posix_cpu_timer() fast-path:
> >
> > if (cputimer->status == CPUTIMER_RUNNING)
> >
> > so that ignores CPUTIMER_CHECKING case.
>
> Right, having just 1 "status" field can simply things a bit. The
> (cputimer->status == CPUTIMER_RUNNING) check does appear misleading
> though, since we're technically not only checking for if the "cputimer
> is running".
>
> Maybe something like:
>
> int status = cputimer->status;
> if ((status & CPUTIMER_RUNNING) && !(status & CPUTIMER_CHECKING))
>
> makes it more obvious what's going on here.

The result is the same but it may clarify the code indeed.

Thanks.

2015-08-31 15:15:19

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 1/3] timer: Optimize fastpath_timer_check()

On Tue, 2015-08-25 at 20:17 -0700, Jason Low wrote:
> In fastpath_timer_check(), the task_cputime() function is always
> called to compute the utime and stime values. However, this is not
> necessary if there are no per-thread timers to check for. This patch
> modifies the code such that we compute the task_cputime values only
> when there are per-thread timers set.
>
> Signed-off-by: Jason Low <[email protected]>

Reviewed-by: Davidlohr Bueso <[email protected]>

2015-08-31 19:44:24

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH 1/3] timer: Optimize fastpath_timer_check()

On Mon, 2015-08-31 at 08:15 -0700, Davidlohr Bueso wrote:
> On Tue, 2015-08-25 at 20:17 -0700, Jason Low wrote:
> > In fastpath_timer_check(), the task_cputime() function is always
> > called to compute the utime and stime values. However, this is not
> > necessary if there are no per-thread timers to check for. This patch
> > modifies the code such that we compute the task_cputime values only
> > when there are per-thread timers set.
> >
> > Signed-off-by: Jason Low <[email protected]>
>
> Reviewed-by: Davidlohr Bueso <[email protected]>

Thanks David and Frederic.

I want to mention that this patch has been slightly changed. As
suggested by George, those extra utime, stime local variables are not
needed anymore, and we should be able to directly call:

task_cputime(tsk, &task_sample.utime, &task_sample.stime);

---
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 892e3da..3a8c5b4 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -1117,16 +1117,13 @@ static inline int task_cputime_expired(const struct task_cputime *sample,
static inline int fastpath_timer_check(struct task_struct *tsk)
{
struct signal_struct *sig;
- cputime_t utime, stime;
-
- task_cputime(tsk, &utime, &stime);

if (!task_cputime_zero(&tsk->cputime_expires)) {
- struct task_cputime task_sample = {
- .utime = utime,
- .stime = stime,
- .sum_exec_runtime = tsk->se.sum_exec_runtime
- };
+ struct task_cputime task_sample;
+
+ task_cputime(tsk, &task_sample.utime, &task_sample.stime);
+
+ task_sample.sum_exec_runtime = tsk->se.sum_exec_runtime;

if (task_cputime_expired(&task_sample, &tsk->cputime_expires))
return 1;