From: John Stultz <[email protected]>
So get_seconds() has always been lock free, with the assumption
that accessing a long will be atomic.
However, recently I came across an odd bug where time() access could
occasionally be inconsistent, but only on power7 hardware. The
same code paths on power6 or x86 could not reproduce the issue.
After adding careful debugging checks to any xtime manipulation, and
not seeing any inconsistencies on the kernel side, I realized that
with no locking in the get_seconds path, its could be that two
sequential calls to time() could be executed out of order on newer
hardware, causing the inconsistency to appear in userland.
After adding the following locking, the issue cannot be reproduced.
Wanted to run this by the power guys to make sure the theory above
sounds sane.
CC: Paul Mackerras <[email protected]>
CC: Paul E. McKenney <[email protected]>
CC: Anton Blanchard <[email protected]>
CC: Thomas Gleixner <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
kernel/time/timekeeping.c | 10 +++++++++-
1 files changed, 9 insertions(+), 1 deletions(-)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 8ad5d57..89c7582 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -975,7 +975,15 @@ EXPORT_SYMBOL_GPL(monotonic_to_bootbased);
unsigned long get_seconds(void)
{
- return xtime.tv_sec;
+ unsigned long seq, now;
+
+ do {
+ seq = read_seqbegin(&xtime_lock);
+
+ now = xtime.tv_sec;
+ } while (read_seqretry(&xtime_lock, seq));
+
+ return now;
}
EXPORT_SYMBOL(get_seconds);
--
1.7.3.2.146.gca209
John Stultz <[email protected]> writes:
> From: John Stultz <[email protected]>
>
> So get_seconds() has always been lock free, with the assumption
> that accessing a long will be atomic.
>
> However, recently I came across an odd bug where time() access could
> occasionally be inconsistent, but only on power7 hardware. The
Shouldn't a single rmb() be enough to avoid that?
If not then I suspect there's a lot more code buggy on that CPU than
just the time.
-Andi
--
[email protected] -- Speaking for myself only
On Tue, 2011-05-03 at 20:11 -0700, John Stultz wrote:
> From: John Stultz <[email protected]>
>
> So get_seconds() has always been lock free, with the assumption
> that accessing a long will be atomic.
>
get_seconds() is used in the x86 machine check handler and there is a
comment saying:
/* We hope get_seconds stays lockless */
This needs to be carefully looked at if locking is introduced to
get_seconds().
- Max
Max Asbock <[email protected]> writes:
> On Tue, 2011-05-03 at 20:11 -0700, John Stultz wrote:
>> From: John Stultz <[email protected]>
>>
>> So get_seconds() has always been lock free, with the assumption
>> that accessing a long will be atomic.
>>
>
> get_seconds() is used in the x86 machine check handler and there is a
> comment saying:
> /* We hope get_seconds stays lockless */
>
> This needs to be carefully looked at if locking is introduced to
> get_seconds().
Yes the seqlock being interrupted by an MCE would deadlock.
-Andi
--
[email protected] -- Speaking for myself only
On Wed, 2011-05-04 at 09:51 -0700, Max Asbock wrote:
> On Tue, 2011-05-03 at 20:11 -0700, John Stultz wrote:
> > From: John Stultz <[email protected]>
> >
> > So get_seconds() has always been lock free, with the assumption
> > that accessing a long will be atomic.
> >
>
> get_seconds() is used in the x86 machine check handler and there is a
> comment saying:
> /* We hope get_seconds stays lockless */
>
> This needs to be carefully looked at if locking is introduced to
> get_seconds().
Ah. Thanks for pointing this out Max.
I'll go ahead and use Andi's suggestion of the rmb();
Patch soon to follow.
thanks
-john
On Tue, 2011-05-03 at 20:52 -0700, Andi Kleen wrote:
> John Stultz <[email protected]> writes:
>
> > From: John Stultz <[email protected]>
> >
> > So get_seconds() has always been lock free, with the assumption
> > that accessing a long will be atomic.
> >
> > However, recently I came across an odd bug where time() access could
> > occasionally be inconsistent, but only on power7 hardware. The
>
> Shouldn't a single rmb() be enough to avoid that?
>
> If not then I suspect there's a lot more code buggy on that CPU than
> just the time.
So interestingly, I've found that the issue was not as complex as I
first assumed. While the rmb() is probably a good idea for
get_seconds(), but it alone does not solve the issue I was seeing,
making it clear my theory wasn't correct.
The problem was reported against the 2.6.32-stable kernel, and had not
been seen in later kernels. I had assumed the change to logarithmic time
accumulation basically reduced the window for for the issue to be seen,
but it would likely still show up eventually.
When the rmb() alone did not solve this issue, I looked to see why the
locking did resolve it, and then it was clear: The old
update_xtime_cache() function doesn't set the xtime_cache values
atomically.
Now, the xtime_cache writing is done under the xtime_lock, so the
get_seconds() locking resolves the issue, but isn't appropriate since
get_seconds() is called from machine check handlers.
So the fix here for the 2.6.32-stable tree is to just update xtime_cache
in one go as done with the following patch.
I also added the rmb() for good measure, and the rmb() should probably
also go upstream since theoretically there maybe a platform that could
do out of order syscalls.
I suspect the reason this hasn't been triggered on x86 or power6 is due
to compiler or processor optimizations reordering the assignment to in
effect make it atomic. Or maybe the timing window to see the issue is
harder to observe?
Signed-off-by: John Stultz <[email protected]>
Index: linux-2.6.32.y/kernel/time/timekeeping.c
===================================================================
--- linux-2.6.32.y.orig/kernel/time/timekeeping.c 2011-05-04 19:34:21.604314152 -0700
+++ linux-2.6.32.y/kernel/time/timekeeping.c 2011-05-04 19:39:09.972203989 -0700
@@ -168,8 +168,10 @@ int __read_mostly timekeeping_suspended;
static struct timespec xtime_cache __attribute__ ((aligned (16)));
void update_xtime_cache(u64 nsec)
{
- xtime_cache = xtime;
- timespec_add_ns(&xtime_cache, nsec);
+ /* use temporary timespec so xtime_cache is updated atomically */
+ struct timespec ts = xtime;
+ timespec_add_ns(&ts, nsec);
+ xtime_cache = ts;
}
/* must hold xtime_lock */
@@ -859,6 +861,7 @@ EXPORT_SYMBOL_GPL(monotonic_to_bootbased
unsigned long get_seconds(void)
{
+ rmb();
return xtime_cache.tv_sec;
}
EXPORT_SYMBOL(get_seconds);
Le mercredi 04 mai 2011 à 19:54 -0700, john stultz a écrit :
> On Tue, 2011-05-03 at 20:52 -0700, Andi Kleen wrote:
> > John Stultz <[email protected]> writes:
> >
> > > From: John Stultz <[email protected]>
> > >
> > > So get_seconds() has always been lock free, with the assumption
> > > that accessing a long will be atomic.
> > >
> > > However, recently I came across an odd bug where time() access could
> > > occasionally be inconsistent, but only on power7 hardware. The
> >
> > Shouldn't a single rmb() be enough to avoid that?
> >
> > If not then I suspect there's a lot more code buggy on that CPU than
> > just the time.
>
> So interestingly, I've found that the issue was not as complex as I
> first assumed. While the rmb() is probably a good idea for
> get_seconds(), but it alone does not solve the issue I was seeing,
> making it clear my theory wasn't correct.
>
> The problem was reported against the 2.6.32-stable kernel, and had not
> been seen in later kernels. I had assumed the change to logarithmic time
> accumulation basically reduced the window for for the issue to be seen,
> but it would likely still show up eventually.
>
> When the rmb() alone did not solve this issue, I looked to see why the
> locking did resolve it, and then it was clear: The old
> update_xtime_cache() function doesn't set the xtime_cache values
> atomically.
>
> Now, the xtime_cache writing is done under the xtime_lock, so the
> get_seconds() locking resolves the issue, but isn't appropriate since
> get_seconds() is called from machine check handlers.
>
> So the fix here for the 2.6.32-stable tree is to just update xtime_cache
> in one go as done with the following patch.
>
> I also added the rmb() for good measure, and the rmb() should probably
> also go upstream since theoretically there maybe a platform that could
> do out of order syscalls.
>
> I suspect the reason this hasn't been triggered on x86 or power6 is due
> to compiler or processor optimizations reordering the assignment to in
> effect make it atomic. Or maybe the timing window to see the issue is
> harder to observe?
>
>
> Signed-off-by: John Stultz <[email protected]>
>
> Index: linux-2.6.32.y/kernel/time/timekeeping.c
> ===================================================================
> --- linux-2.6.32.y.orig/kernel/time/timekeeping.c 2011-05-04 19:34:21.604314152 -0700
> +++ linux-2.6.32.y/kernel/time/timekeeping.c 2011-05-04 19:39:09.972203989 -0700
> @@ -168,8 +168,10 @@ int __read_mostly timekeeping_suspended;
> static struct timespec xtime_cache __attribute__ ((aligned (16)));
> void update_xtime_cache(u64 nsec)
> {
> - xtime_cache = xtime;
> - timespec_add_ns(&xtime_cache, nsec);
> + /* use temporary timespec so xtime_cache is updated atomically */
Atomically is not possible on 32bit platform, so this comment is
misleading.
What about a comment saying :
/*
* use temporary variable so get_seconds() cannot catch
* intermediate value (one second backward)
*/
> + struct timespec ts = xtime;
> + timespec_add_ns(&ts, nsec);
> + xtime_cache = ts;
> }
>
> /* must hold xtime_lock */
> @@ -859,6 +861,7 @@ EXPORT_SYMBOL_GPL(monotonic_to_bootbased
>
> unsigned long get_seconds(void)
> {
> + rmb();
Please dont, this makes no sense, and with no comment anyway.
> return xtime_cache.tv_sec;
> }
> EXPORT_SYMBOL(get_seconds);
>
On Thu, 2011-05-05 at 07:44 +0200, Eric Dumazet wrote:
> Le mercredi 04 mai 2011 à 19:54 -0700, john stultz a écrit :
> > On Tue, 2011-05-03 at 20:52 -0700, Andi Kleen wrote:
> > > John Stultz <[email protected]> writes:
> > >
> > > > From: John Stultz <[email protected]>
> > > >
> > > > So get_seconds() has always been lock free, with the assumption
> > > > that accessing a long will be atomic.
> > > >
> > > > However, recently I came across an odd bug where time() access could
> > > > occasionally be inconsistent, but only on power7 hardware. The
> > >
> > > Shouldn't a single rmb() be enough to avoid that?
> > >
> > > If not then I suspect there's a lot more code buggy on that CPU than
> > > just the time.
> >
> > So interestingly, I've found that the issue was not as complex as I
> > first assumed. While the rmb() is probably a good idea for
> > get_seconds(), but it alone does not solve the issue I was seeing,
> > making it clear my theory wasn't correct.
> >
> > The problem was reported against the 2.6.32-stable kernel, and had not
> > been seen in later kernels. I had assumed the change to logarithmic time
> > accumulation basically reduced the window for for the issue to be seen,
> > but it would likely still show up eventually.
> >
> > When the rmb() alone did not solve this issue, I looked to see why the
> > locking did resolve it, and then it was clear: The old
> > update_xtime_cache() function doesn't set the xtime_cache values
> > atomically.
> >
> > Now, the xtime_cache writing is done under the xtime_lock, so the
> > get_seconds() locking resolves the issue, but isn't appropriate since
> > get_seconds() is called from machine check handlers.
> >
> > So the fix here for the 2.6.32-stable tree is to just update xtime_cache
> > in one go as done with the following patch.
> >
> > I also added the rmb() for good measure, and the rmb() should probably
> > also go upstream since theoretically there maybe a platform that could
> > do out of order syscalls.
> >
> > I suspect the reason this hasn't been triggered on x86 or power6 is due
> > to compiler or processor optimizations reordering the assignment to in
> > effect make it atomic. Or maybe the timing window to see the issue is
> > harder to observe?
> >
> >
> > Signed-off-by: John Stultz <[email protected]>
> >
> > Index: linux-2.6.32.y/kernel/time/timekeeping.c
> > ===================================================================
> > --- linux-2.6.32.y.orig/kernel/time/timekeeping.c 2011-05-04 19:34:21.604314152 -0700
> > +++ linux-2.6.32.y/kernel/time/timekeeping.c 2011-05-04 19:39:09.972203989 -0700
> > @@ -168,8 +168,10 @@ int __read_mostly timekeeping_suspended;
> > static struct timespec xtime_cache __attribute__ ((aligned (16)));
> > void update_xtime_cache(u64 nsec)
> > {
> > - xtime_cache = xtime;
> > - timespec_add_ns(&xtime_cache, nsec);
> > + /* use temporary timespec so xtime_cache is updated atomically */
>
> Atomically is not possible on 32bit platform, so this comment is
> misleading.
Well, 32bit/64bit, the time_t .tv_sec portion is a long, so it should be
written atomically.
> What about a comment saying :
> /*
> * use temporary variable so get_seconds() cannot catch
> * intermediate value (one second backward)
> */
Fair enough. Such a comment is an improvement.
> > + struct timespec ts = xtime;
> > + timespec_add_ns(&ts, nsec);
> > + xtime_cache = ts;
> > }
> >
> > /* must hold xtime_lock */
> > @@ -859,6 +861,7 @@ EXPORT_SYMBOL_GPL(monotonic_to_bootbased
> >
> > unsigned long get_seconds(void)
> > {
> > + rmb();
>
> Please dont, this makes no sense, and with no comment anyway.
Would a comment to the effect of "ensure processors don't re-order calls
to get_seconds" help, or is it still too opaque (or even still
nonsense?).
thanks
-john
Le mercredi 04 mai 2011 à 23:21 -0700, john stultz a écrit :
> Would a comment to the effect of "ensure processors don't re-order calls
> to get_seconds" help, or is it still too opaque (or even still
> nonsense?).
>
Where is corresponding smp_wmb() ? It's not obvious, so you must add a
comment to help reader of this code.
On Wed, May 04, 2011 at 11:21:35PM -0700, john stultz wrote:
> On Thu, 2011-05-05 at 07:44 +0200, Eric Dumazet wrote:
> > Le mercredi 04 mai 2011 ? 19:54 -0700, john stultz a ?crit :
> > > On Tue, 2011-05-03 at 20:52 -0700, Andi Kleen wrote:
> > > > John Stultz <[email protected]> writes:
> > > >
> > > > > From: John Stultz <[email protected]>
> > > > >
> > > > > So get_seconds() has always been lock free, with the assumption
> > > > > that accessing a long will be atomic.
> > > > >
> > > > > However, recently I came across an odd bug where time() access could
> > > > > occasionally be inconsistent, but only on power7 hardware. The
> > > >
> > > > Shouldn't a single rmb() be enough to avoid that?
> > > >
> > > > If not then I suspect there's a lot more code buggy on that CPU than
> > > > just the time.
> > >
> > > So interestingly, I've found that the issue was not as complex as I
> > > first assumed. While the rmb() is probably a good idea for
> > > get_seconds(), but it alone does not solve the issue I was seeing,
> > > making it clear my theory wasn't correct.
> > >
> > > The problem was reported against the 2.6.32-stable kernel, and had not
> > > been seen in later kernels. I had assumed the change to logarithmic time
> > > accumulation basically reduced the window for for the issue to be seen,
> > > but it would likely still show up eventually.
> > >
> > > When the rmb() alone did not solve this issue, I looked to see why the
> > > locking did resolve it, and then it was clear: The old
> > > update_xtime_cache() function doesn't set the xtime_cache values
> > > atomically.
> > >
> > > Now, the xtime_cache writing is done under the xtime_lock, so the
> > > get_seconds() locking resolves the issue, but isn't appropriate since
> > > get_seconds() is called from machine check handlers.
> > >
> > > So the fix here for the 2.6.32-stable tree is to just update xtime_cache
> > > in one go as done with the following patch.
> > >
> > > I also added the rmb() for good measure, and the rmb() should probably
> > > also go upstream since theoretically there maybe a platform that could
> > > do out of order syscalls.
> > >
> > > I suspect the reason this hasn't been triggered on x86 or power6 is due
> > > to compiler or processor optimizations reordering the assignment to in
> > > effect make it atomic. Or maybe the timing window to see the issue is
> > > harder to observe?
> > >
> > >
> > > Signed-off-by: John Stultz <[email protected]>
> > >
> > > Index: linux-2.6.32.y/kernel/time/timekeeping.c
> > > ===================================================================
> > > --- linux-2.6.32.y.orig/kernel/time/timekeeping.c 2011-05-04 19:34:21.604314152 -0700
> > > +++ linux-2.6.32.y/kernel/time/timekeeping.c 2011-05-04 19:39:09.972203989 -0700
> > > @@ -168,8 +168,10 @@ int __read_mostly timekeeping_suspended;
> > > static struct timespec xtime_cache __attribute__ ((aligned (16)));
> > > void update_xtime_cache(u64 nsec)
> > > {
> > > - xtime_cache = xtime;
> > > - timespec_add_ns(&xtime_cache, nsec);
> > > + /* use temporary timespec so xtime_cache is updated atomically */
> >
> > Atomically is not possible on 32bit platform, so this comment is
> > misleading.
>
> Well, 32bit/64bit, the time_t .tv_sec portion is a long, so it should be
> written atomically.
>
> > What about a comment saying :
> > /*
> > * use temporary variable so get_seconds() cannot catch
> > * intermediate value (one second backward)
> > */
>
> Fair enough. Such a comment is an improvement.
>
> > > + struct timespec ts = xtime;
> > > + timespec_add_ns(&ts, nsec);
> > > + xtime_cache = ts;
> > > }
> > >
> > > /* must hold xtime_lock */
> > > @@ -859,6 +861,7 @@ EXPORT_SYMBOL_GPL(monotonic_to_bootbased
> > >
> > > unsigned long get_seconds(void)
> > > {
> > > + rmb();
> >
> > Please dont, this makes no sense, and with no comment anyway.
>
> Would a comment to the effect of "ensure processors don't re-order calls
> to get_seconds" help, or is it still too opaque (or even still
> nonsense?).
A CPU that reordered syscalls reading from or writing to a given memory
location is broken. At least if the CPU does such reordering in a way
that lets the software detect it. There is quite a bit of code out there
that assumes cache coherence, so I sure hope that CPUs don't require
the above memory barrier...
Thanx, Paul
I feel xtime_lock seqlock is abused these days.
seqlock abstraction is somewhat lazy/dangerous because write_sequnlock()
does both the seqcount increment and spinlock release.
I am concerned by fact that readers might wait for long times, because
writers hold the whole seqlock, while sometime they only want to guard
other writers to come in.
Maybe it's time to separate the things (the seqcount and the spinlock)
so that writer can manipulate data in different sections :
- Sections while holding spinlock, allowing "readers" to run
- Very small sections enclosed in a pair of seqcount increments, to
synchronize with readers.
I tried followed hack/patch to get an idea of ktime_get() latencies on
current linux-2.6 kernel. And they really are too big. ktime_get()
should be as fast as possible [ if hardware permits ], yet it is
sometimes very slow, even on TSC clocksource.
Note : I disable irqs to make sure latencies are not because of an
interrupt.
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index e98cd2e..0a3aec5 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -83,7 +83,7 @@ static inline int write_tryseqlock(seqlock_t *sl)
}
/* Start of read calculation -- fetch last complete writer token */
-static __always_inline unsigned read_seqbegin(const seqlock_t *sl)
+static __always_inline unsigned __read_seqbegin(const seqlock_t *sl, unsigned long *loop)
{
unsigned ret;
@@ -91,6 +91,8 @@ repeat:
ret = sl->sequence;
smp_rmb();
if (unlikely(ret & 1)) {
+ if (loop)
+ (*loop)++;
cpu_relax();
goto repeat;
}
@@ -98,6 +100,11 @@ repeat:
return ret;
}
+static __always_inline unsigned read_seqbegin(const seqlock_t *sl)
+{
+ return __read_seqbegin(sl, NULL);
+}
+
/*
* Test if reader processed invalid data.
*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 8ad5d57..db84027 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -237,20 +237,28 @@ void getnstimeofday(struct timespec *ts)
EXPORT_SYMBOL(getnstimeofday);
+static DEFINE_PER_CPU(unsigned long, loopmax);
ktime_t ktime_get(void)
{
unsigned int seq;
s64 secs, nsecs;
+ unsigned long flags, loop = 0;
WARN_ON(timekeeping_suspended);
+ local_irq_save(flags);
do {
- seq = read_seqbegin(&xtime_lock);
+ seq = __read_seqbegin(&xtime_lock, &loop);
secs = xtime.tv_sec + wall_to_monotonic.tv_sec;
nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec;
nsecs += timekeeping_get_ns();
} while (read_seqretry(&xtime_lock, seq));
+ if (unlikely(loop > __this_cpu_read(loopmax))) {
+ this_cpu_write(loopmax, loop);
+ pr_err("cpu%d ktime_get() looped %lu times\n", smp_processor_id(), loop);
+ }
+ local_irq_restore(flags);
/*
* Use ktime_set/ktime_add_ns to create a proper ktime on
* 32-bit architectures without CONFIG_KTIME_SCALAR.
Results on my Intel(R) Xeon(R) CPU E5450 @ 3.00GHz machine
[ 7.754018] cpu2 ktime_get() looped 3235 times
[ 7.754021] cpu6 ktime_get() looped 3078 times
[ 7.754024] cpu4 ktime_get() looped 3208 times
[ 7.754026] cpu0 ktime_get() looped 3206 times
[ 9.724026] cpu4 ktime_get() looped 4432 times
[ 9.724029] cpu0 ktime_get() looped 4363 times
[ 9.724032] cpu2 ktime_get() looped 4453 times
[ 9.724035] cpu6 ktime_get() looped 4590 times
[ 230.795003] cpu7 ktime_get() looped 1637 times
[ 230.795001] cpu1 ktime_get() looped 1606 times
[ 231.157003] cpu7 ktime_get() looped 4240 times
[ 231.157002] cpu5 ktime_get() looped 4367 times
[ 231.157002] cpu1 ktime_get() looped 4322 times
[ 231.861003] cpu7 ktime_get() looped 4262 times
[ 234.045002] cpu5 ktime_get() looped 4700 times
[ 234.045002] cpu7 ktime_get() looped 4598 times
[ 234.045000] cpu1 ktime_get() looped 4689 times
[ 234.081003] cpu7 ktime_get() looped 4603 times
[ 234.238003] cpu4 ktime_get() looped 4693 times
[ 234.238029] cpu0 ktime_get() looped 4559 times
[ 275.362000] cpu5 ktime_get() looped 4892 times
[ 275.362002] cpu1 ktime_get() looped 4892 times
[ 279.176002] cpu4 ktime_get() looped 4760 times
[ 291.480003] cpu1 ktime_get() looped 4915 times
[ 294.045027] cpu2 ktime_get() looped 4642 times
[ 294.045029] cpu0 ktime_get() looped 4696 times
[ 294.045002] cpu4 ktime_get() looped 4891 times
each loop is done with 6 machine instructions, so we have some pretty high values.
On Thu, 5 May 2011, Eric Dumazet wrote:
> I feel xtime_lock seqlock is abused these days.
>
> seqlock abstraction is somewhat lazy/dangerous because write_sequnlock()
> does both the seqcount increment and spinlock release.
>
> I am concerned by fact that readers might wait for long times, because
> writers hold the whole seqlock, while sometime they only want to guard
> other writers to come in.
>
> Maybe it's time to separate the things (the seqcount and the spinlock)
> so that writer can manipulate data in different sections :
> - Sections while holding spinlock, allowing "readers" to run
> - Very small sections enclosed in a pair of seqcount increments, to
> synchronize with readers.
Well, in the case of timekeeping that might be problematic. I'm not
sure whether we can calculate the new values under the spinlock and
then update the timekeeper under the seqlock because we might adjust
the mult/shift pair which then can result in observabcle time going
backwards problems. It might be worth a try, though. John ???
The only thing which really can move right away outside the xtime
seqlock region is calc_global_load().
Thanks,
tglx
Le jeudi 05 mai 2011 à 16:39 +0200, Thomas Gleixner a écrit :
> On Thu, 5 May 2011, Eric Dumazet wrote:
>
> > I feel xtime_lock seqlock is abused these days.
> >
> > seqlock abstraction is somewhat lazy/dangerous because write_sequnlock()
> > does both the seqcount increment and spinlock release.
> >
> > I am concerned by fact that readers might wait for long times, because
> > writers hold the whole seqlock, while sometime they only want to guard
> > other writers to come in.
> >
> > Maybe it's time to separate the things (the seqcount and the spinlock)
> > so that writer can manipulate data in different sections :
> > - Sections while holding spinlock, allowing "readers" to run
> > - Very small sections enclosed in a pair of seqcount increments, to
> > synchronize with readers.
>
> Well, in the case of timekeeping that might be problematic. I'm not
> sure whether we can calculate the new values under the spinlock and
> then update the timekeeper under the seqlock because we might adjust
> the mult/shift pair which then can result in observabcle time going
> backwards problems. It might be worth a try, though. John ???
>
> The only thing which really can move right away outside the xtime
> seqlock region is calc_global_load().
>
That would be a start, but we also could have finer granularity in
locks :
update_vsyscall() has its own protection and could be done outside of
the seqcount inc pair used for ktime_get().
[ but my patch numbers were for a 32bit kernel, so vsyscall is not
accounted for. ]
Another idea would be to prime cache lines to be dirtied in cpu cache
before taking locks, and better pack variables to reduce number of cache
lines.
On Thu, 5 May 2011, Eric Dumazet wrote:
> Le jeudi 05 mai 2011 à 16:39 +0200, Thomas Gleixner a écrit :
> > On Thu, 5 May 2011, Eric Dumazet wrote:
> >
> > > I feel xtime_lock seqlock is abused these days.
> > >
> > > seqlock abstraction is somewhat lazy/dangerous because write_sequnlock()
> > > does both the seqcount increment and spinlock release.
> > >
> > > I am concerned by fact that readers might wait for long times, because
> > > writers hold the whole seqlock, while sometime they only want to guard
> > > other writers to come in.
> > >
> > > Maybe it's time to separate the things (the seqcount and the spinlock)
> > > so that writer can manipulate data in different sections :
> > > - Sections while holding spinlock, allowing "readers" to run
> > > - Very small sections enclosed in a pair of seqcount increments, to
> > > synchronize with readers.
> >
> > Well, in the case of timekeeping that might be problematic. I'm not
> > sure whether we can calculate the new values under the spinlock and
> > then update the timekeeper under the seqlock because we might adjust
> > the mult/shift pair which then can result in observabcle time going
> > backwards problems. It might be worth a try, though. John ???
> >
> > The only thing which really can move right away outside the xtime
> > seqlock region is calc_global_load().
> >
>
> That would be a start, but we also could have finer granularity in
> locks :
>
> update_vsyscall() has its own protection and could be done outside of
> the seqcount inc pair used for ktime_get().
Yeah, we could move that out, but it might be interesting to add a few
tracepoints into update_wall_time() first to see which part takes the
most time.
> [ but my patch numbers were for a 32bit kernel, so vsyscall is not
> accounted for. ]
:)
> Another idea would be to prime cache lines to be dirtied in cpu cache
> before taking locks, and better pack variables to reduce number of cache
> lines.
Most variables are packed already in struct timekeeper, which should
be pretty cache hot anyway, so I don't know whether we gain much.
Thanks,
tglx
> I suspect the reason this hasn't been triggered on x86 or power6 is due
> to compiler or processor optimizations reordering the assignment to in
> effect make it atomic. Or maybe the timing window to see the issue is
> harder to observe?
On x86 all aligned stores are atomic. So I don't see how this
could be a problem ever.
Are you saying that on P6 only atomic stores are atomic?
If that's the case again I suspect other code will be affected
too.
-Andi
>
On Thu, 2011-05-05 at 01:14 -0700, Paul E. McKenney wrote:
> On Wed, May 04, 2011 at 11:21:35PM -0700, john stultz wrote:
> > On Thu, 2011-05-05 at 07:44 +0200, Eric Dumazet wrote:
> > > Le mercredi 04 mai 2011 à 19:54 -0700, john stultz a écrit :
> > > > On Tue, 2011-05-03 at 20:52 -0700, Andi Kleen wrote:
> > > > > John Stultz <[email protected]> writes:
> > > > >
> > > > > > From: John Stultz <[email protected]>
> > > > > >
> > > > > > So get_seconds() has always been lock free, with the assumption
> > > > > > that accessing a long will be atomic.
> > > > > >
> > > > > > However, recently I came across an odd bug where time() access could
> > > > > > occasionally be inconsistent, but only on power7 hardware. The
> > > > >
> > > > > Shouldn't a single rmb() be enough to avoid that?
> > > > >
> > > > > If not then I suspect there's a lot more code buggy on that CPU than
> > > > > just the time.
> > > >
> > > > So interestingly, I've found that the issue was not as complex as I
> > > > first assumed. While the rmb() is probably a good idea for
> > > > get_seconds(), but it alone does not solve the issue I was seeing,
> > > > making it clear my theory wasn't correct.
> > > >
> > > > The problem was reported against the 2.6.32-stable kernel, and had not
> > > > been seen in later kernels. I had assumed the change to logarithmic time
> > > > accumulation basically reduced the window for for the issue to be seen,
> > > > but it would likely still show up eventually.
> > > >
> > > > When the rmb() alone did not solve this issue, I looked to see why the
> > > > locking did resolve it, and then it was clear: The old
> > > > update_xtime_cache() function doesn't set the xtime_cache values
> > > > atomically.
> > > >
> > > > Now, the xtime_cache writing is done under the xtime_lock, so the
> > > > get_seconds() locking resolves the issue, but isn't appropriate since
> > > > get_seconds() is called from machine check handlers.
> > > >
> > > > So the fix here for the 2.6.32-stable tree is to just update xtime_cache
> > > > in one go as done with the following patch.
> > > >
> > > > I also added the rmb() for good measure, and the rmb() should probably
> > > > also go upstream since theoretically there maybe a platform that could
> > > > do out of order syscalls.
> > > >
> > > > I suspect the reason this hasn't been triggered on x86 or power6 is due
> > > > to compiler or processor optimizations reordering the assignment to in
> > > > effect make it atomic. Or maybe the timing window to see the issue is
> > > > harder to observe?
> > > >
> > > >
> > > > Signed-off-by: John Stultz <[email protected]>
> > > >
> > > > Index: linux-2.6.32.y/kernel/time/timekeeping.c
> > > > ===================================================================
> > > > --- linux-2.6.32.y.orig/kernel/time/timekeeping.c 2011-05-04 19:34:21.604314152 -0700
> > > > +++ linux-2.6.32.y/kernel/time/timekeeping.c 2011-05-04 19:39:09.972203989 -0700
> > > > @@ -168,8 +168,10 @@ int __read_mostly timekeeping_suspended;
> > > > static struct timespec xtime_cache __attribute__ ((aligned (16)));
> > > > void update_xtime_cache(u64 nsec)
> > > > {
> > > > - xtime_cache = xtime;
> > > > - timespec_add_ns(&xtime_cache, nsec);
> > > > + /* use temporary timespec so xtime_cache is updated atomically */
> > >
> > > Atomically is not possible on 32bit platform, so this comment is
> > > misleading.
> >
> > Well, 32bit/64bit, the time_t .tv_sec portion is a long, so it should be
> > written atomically.
> >
> > > What about a comment saying :
> > > /*
> > > * use temporary variable so get_seconds() cannot catch
> > > * intermediate value (one second backward)
> > > */
> >
> > Fair enough. Such a comment is an improvement.
> >
> > > > + struct timespec ts = xtime;
> > > > + timespec_add_ns(&ts, nsec);
> > > > + xtime_cache = ts;
> > > > }
> > > >
> > > > /* must hold xtime_lock */
> > > > @@ -859,6 +861,7 @@ EXPORT_SYMBOL_GPL(monotonic_to_bootbased
> > > >
> > > > unsigned long get_seconds(void)
> > > > {
> > > > + rmb();
> > >
> > > Please dont, this makes no sense, and with no comment anyway.
> >
> > Would a comment to the effect of "ensure processors don't re-order calls
> > to get_seconds" help, or is it still too opaque (or even still
> > nonsense?).
>
> A CPU that reordered syscalls reading from or writing to a given memory
> location is broken. At least if the CPU does such reordering in a way
> that lets the software detect it. There is quite a bit of code out there
> that assumes cache coherence, so I sure hope that CPUs don't require
> the above memory barrier...
Much appreciated. I'll drop it then.
thanks
-john
On Thu, 2011-05-05 at 19:57 +0200, Andi Kleen wrote:
> > I suspect the reason this hasn't been triggered on x86 or power6 is due
> > to compiler or processor optimizations reordering the assignment to in
> > effect make it atomic. Or maybe the timing window to see the issue is
> > harder to observe?
>
> On x86 all aligned stores are atomic. So I don't see how this
> could be a problem ever.
No no. The issue was with the fact that in update_xtime_cache we modify
xtime_cache twice (once setting it possibly backwards to xtime, then
adding in the nsec offset).
Since get_seconds does no locking, this issue should be visible
anywhere, as long as you manage to hit the race window between the first
assignment and the second.
However, in the testing, the issue only showed up on P7, but not P6 or
x86.
My guess was that the code:
xtime_cache.sec = xtime.sec
xtime_cache.nsec = xtime.nsec
xtime_cache.sec = xtime_cache.sec
+ div(xtime_cache.nsec + nsec, NSEC_PER_SEC, &rem);
xtime_cache.nsec = rem
Was getting rearranged to:
xtime_cache.sec = xtime.sec
+ div(xtime.nsec + nsec, NSEC_PER_SEC, &rem);
xtime_cache.nsec = rem
Which makes the xtime_cache.sec update atomic.
But its just a guess.
thanks
-john
Le jeudi 05 mai 2011 à 13:17 -0700, john stultz a écrit :
> On Thu, 2011-05-05 at 19:57 +0200, Andi Kleen wrote:
> > > I suspect the reason this hasn't been triggered on x86 or power6 is due
> > > to compiler or processor optimizations reordering the assignment to in
> > > effect make it atomic. Or maybe the timing window to see the issue is
> > > harder to observe?
> >
> > On x86 all aligned stores are atomic. So I don't see how this
> > could be a problem ever.
>
> No no. The issue was with the fact that in update_xtime_cache we modify
> xtime_cache twice (once setting it possibly backwards to xtime, then
> adding in the nsec offset).
>
> Since get_seconds does no locking, this issue should be visible
> anywhere, as long as you manage to hit the race window between the first
> assignment and the second.
>
> However, in the testing, the issue only showed up on P7, but not P6 or
> x86.
>
> My guess was that the code:
>
> xtime_cache.sec = xtime.sec
> xtime_cache.nsec = xtime.nsec
> xtime_cache.sec = xtime_cache.sec
> + div(xtime_cache.nsec + nsec, NSEC_PER_SEC, &rem);
> xtime_cache.nsec = rem
>
> Was getting rearranged to:
>
> xtime_cache.sec = xtime.sec
> + div(xtime.nsec + nsec, NSEC_PER_SEC, &rem);
> xtime_cache.nsec = rem
>
>
> Which makes the xtime_cache.sec update atomic.
>
> But its just a guess.
Sure (disassembly could help to check this), but get_seconds() reads
xtime.tv_sec ;)
On Thu, 2011-05-05 at 22:24 +0200, Eric Dumazet wrote:
> Le jeudi 05 mai 2011 à 13:17 -0700, john stultz a écrit :
> > On Thu, 2011-05-05 at 19:57 +0200, Andi Kleen wrote:
> > > > I suspect the reason this hasn't been triggered on x86 or power6 is due
> > > > to compiler or processor optimizations reordering the assignment to in
> > > > effect make it atomic. Or maybe the timing window to see the issue is
> > > > harder to observe?
> > >
> > > On x86 all aligned stores are atomic. So I don't see how this
> > > could be a problem ever.
> >
> > No no. The issue was with the fact that in update_xtime_cache we modify
> > xtime_cache twice (once setting it possibly backwards to xtime, then
> > adding in the nsec offset).
> >
> > Since get_seconds does no locking, this issue should be visible
> > anywhere, as long as you manage to hit the race window between the first
> > assignment and the second.
> >
> > However, in the testing, the issue only showed up on P7, but not P6 or
> > x86.
> >
> > My guess was that the code:
> >
> > xtime_cache.sec = xtime.sec
> > xtime_cache.nsec = xtime.nsec
> > xtime_cache.sec = xtime_cache.sec
> > + div(xtime_cache.nsec + nsec, NSEC_PER_SEC, &rem);
> > xtime_cache.nsec = rem
> >
> > Was getting rearranged to:
> >
> > xtime_cache.sec = xtime.sec
> > + div(xtime.nsec + nsec, NSEC_PER_SEC, &rem);
> > xtime_cache.nsec = rem
> >
> >
> > Which makes the xtime_cache.sec update atomic.
> >
> > But its just a guess.
>
> Sure (disassembly could help to check this), but get_seconds() reads
> xtime.tv_sec ;)
Currently, yes.
But as I mentioned in an earlier mail, the problem was with
2.6.32-stable.
thanks
-john
Le jeudi 05 mai 2011 à 13:40 -0700, john stultz a écrit :
> Currently, yes.
> But as I mentioned in an earlier mail, the problem was with
> 2.6.32-stable.
Oops sorry, I forgot this :(
On Thu, May 05, 2011 at 01:17:01PM -0700, john stultz wrote:
> On Thu, 2011-05-05 at 19:57 +0200, Andi Kleen wrote:
> > > I suspect the reason this hasn't been triggered on x86 or power6 is due
> > > to compiler or processor optimizations reordering the assignment to in
> > > effect make it atomic. Or maybe the timing window to see the issue is
> > > harder to observe?
> >
> > On x86 all aligned stores are atomic. So I don't see how this
> > could be a problem ever.
>
> No no. The issue was with the fact that in update_xtime_cache we modify
> xtime_cache twice (once setting it possibly backwards to xtime, then
> adding in the nsec offset).
Ok makes sense. So the obvious fix is to compute the values first
in temporaries, then write them with appropiate wmb()s?
For the reader it should not make any difference.
-Andi
> > Another idea would be to prime cache lines to be dirtied in cpu cache
> > before taking locks, and better pack variables to reduce number of cache
> > lines.
>
> Most variables are packed already in struct timekeeper, which should
> be pretty cache hot anyway, so I don't know whether we gain much.
There's actually some potential here. I got a moderate speedup in a
database benchmark with this patch recently. The biggest win
was in the timer interrupt.
All those variables are on separate cache lines.
(needs some cleanups, just for illustration)
-Andi
From: Andi Kleen <[email protected]>
Subject: [PATCH] do prefetches for timer state in timer interrupt
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 76b96d7..2704267 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -856,6 +856,17 @@ void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
{
struct pt_regs *old_regs = set_irq_regs(regs);
+ extern struct timespec xtime;
+ extern struct timespec wall_to_monotonic;
+
+
+ /* Prefetch here to make the ktime_get later faster.
+ * XXX move them all on the same cache line.
+ */
+ prefetch(&xtime_lock);
+ prefetch(&xtime);
+ prefetch(&wall_to_monotonic);
+
/*
* NOTE! We'd better ACK the irq immediately,
* because timer handling can be slow.
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index d27c756..0002e9e 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -157,8 +157,8 @@ __cacheline_aligned_in_smp DEFINE_SEQLOCK(xtime_lock);
* - wall_to_monotonic is no longer the boot time, getboottime must be
* used instead.
*/
-static struct timespec xtime __attribute__ ((aligned (16)));
-static struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
+struct timespec xtime __attribute__ ((aligned (16)));
+struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
static struct timespec total_sleep_time;
/*
Le jeudi 05 mai 2011 à 23:01 +0200, Andi Kleen a écrit :
> > > Another idea would be to prime cache lines to be dirtied in cpu cache
> > > before taking locks, and better pack variables to reduce number of cache
> > > lines.
> >
> > Most variables are packed already in struct timekeeper, which should
> > be pretty cache hot anyway, so I don't know whether we gain much.
>
> There's actually some potential here. I got a moderate speedup in a
> database benchmark with this patch recently. The biggest win
> was in the timer interrupt.
>
> All those variables are on separate cache lines.
>
> (needs some cleanups, just for illustration)
>
> -Andi
>
> From: Andi Kleen <[email protected]>
> Subject: [PATCH] do prefetches for timer state in timer interrupt
>
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index 76b96d7..2704267 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -856,6 +856,17 @@ void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
> {
> struct pt_regs *old_regs = set_irq_regs(regs);
>
> + extern struct timespec xtime;
> + extern struct timespec wall_to_monotonic;
> +
> +
> + /* Prefetch here to make the ktime_get later faster.
> + * XXX move them all on the same cache line.
> + */
> + prefetch(&xtime_lock);
> + prefetch(&xtime);
> + prefetch(&wall_to_monotonic);
> +
> /*
> * NOTE! We'd better ACK the irq immediately,
> * because timer handling can be slow.
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index d27c756..0002e9e 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -157,8 +157,8 @@ __cacheline_aligned_in_smp DEFINE_SEQLOCK(xtime_lock);
> * - wall_to_monotonic is no longer the boot time, getboottime must be
> * used instead.
> */
> -static struct timespec xtime __attribute__ ((aligned (16)));
> -static struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
> +struct timespec xtime __attribute__ ((aligned (16)));
> +struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
> static struct timespec total_sleep_time;
>
> /*
>
Did you tried prefetchw() by any chance ?
>
> Did you tried prefetchw() by any chance ?
I didn't because prefetchw() is the same as prefetch() on Intel machines.
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, 5 May 2011, Andi Kleen wrote:
> > > Another idea would be to prime cache lines to be dirtied in cpu cache
> > > before taking locks, and better pack variables to reduce number of cache
> > > lines.
> >
> > Most variables are packed already in struct timekeeper, which should
> > be pretty cache hot anyway, so I don't know whether we gain much.
>
> There's actually some potential here. I got a moderate speedup in a
> database benchmark with this patch recently. The biggest win
Numbers please.
> was in the timer interrupt.
Because this _is_ the timer interrupt ?
> All those variables are on separate cache lines.
Well, then we should make sure that they are not.
And looking at a few kernel images the interesting variables
timekeeper, xtime, wall_to_monotonic are in a consecutive area which
is not really surprising. Further all images have xtime and
wall_to_monotonic in the same cacheline, just xtime_lock is somewhere
else.
> (needs some cleanups, just for illustration)
And how's that cleanup going to look like? Making the timekeeping
internal variables global again is not going to happen. And if you
want prefetching those variables in the timer interrupt, then you want
to prefetch them in any random code path which ends up touching them.
That's the completely wrong aproach, really. If stuff takes too long,
then we need to tackle it at the root of the problem and not solve it
by sprinkling magic prefetches all over the place.
Thanks,
tglx
* Thomas Gleixner <[email protected]> wrote:
> On Thu, 5 May 2011, Andi Kleen wrote:
>
> > > > Another idea would be to prime cache lines to be dirtied in cpu cache
> > > > before taking locks, and better pack variables to reduce number of cache
> > > > lines.
> > >
> > > Most variables are packed already in struct timekeeper, which should
> > > be pretty cache hot anyway, so I don't know whether we gain much.
> >
> > There's actually some potential here. I got a moderate speedup in a
> > database benchmark with this patch recently. The biggest win
>
> Numbers please.
I'd suggest to create and publish a seqlock usage worst-case testcase,
something that runs N threads on an N CPU system.
Then precise measurements have to be done on the source of cache misses, the
total cost of the timer interrupt, etc.
I.e. this should be analyzed and improved properly, not just sloppily slapping
a few prefetches here and there, which wont really *solve* anything ...
Thanks,
Ingo
> Then precise measurements have to be done on the source of cache misses, the
> total cost of the timer interrupt, etc.
What we did was profiling the remote node cache misses
using raw offcore events, and the xtime code was high up.
I cannot share exact numbers for various reasons, but it was
an improvement on a hard to improve workload.
> I.e. this should be analyzed and improved properly, not just sloppily slapping
> a few prefetches here and there, which wont really *solve* anything ...
Well it made things faster. That's definitely something.
I agree the cache lines need to be consolidated in a proper way.
But well placed prefetches by themselves can be a useful tool,
and placing them based on stable profile results is not "sloppy".
-Andi
On Fri, May 06, 2011 at 12:18:27PM +0200, Thomas Gleixner wrote:
> > There's actually some potential here. I got a moderate speedup in a
> > database benchmark with this patch recently. The biggest win
>
> Numbers please.
Low single digit on a hard to improve well tuned workload.
> Well, then we should make sure that they are not.
Ok I will send a patch.
>
> And looking at a few kernel images the interesting variables
> timekeeper, xtime, wall_to_monotonic are in a consecutive area which
> is not really surprising. Further all images have xtime and
> wall_to_monotonic in the same cacheline, just xtime_lock is somewhere
> else.
I guess it depends a lot on the particular alignment.
But yes it should be ensured.
>
> > (needs some cleanups, just for illustration)
>
> And how's that cleanup going to look like? Making the timekeeping
> internal variables global again is not going to happen. And if you
Why not? You could of course call some function in the file
to do the prefetches, but that would seem dumb to me.
> want prefetching those variables in the timer interrupt, then you want
> to prefetch them in any random code path which ends up touching them.
If that code path shows up in profiling. Only the timer interrupt
did so far.
> That's the completely wrong aproach, really. If stuff takes too long,
> then we need to tackle it at the root of the problem and not solve it
> by sprinkling magic prefetches all over the place.
If you have a better way to make it faster please share it.
-Andi
Le vendredi 06 mai 2011 à 18:59 +0200, Andi Kleen a écrit :
> On Fri, May 06, 2011 at 12:18:27PM +0200, Thomas Gleixner wrote:
> If that code path shows up in profiling. Only the timer interrupt
> did so far.
>
> > That's the completely wrong aproach, really. If stuff takes too long,
> > then we need to tackle it at the root of the problem and not solve it
> > by sprinkling magic prefetches all over the place.
>
> If you have a better way to make it faster please share it.
>
Unless I am very mistaken, it cannot explain my original report of up to
24.000 instructions waiting in ktime_get()
BTW, how many consecutive prefetch() a cpu can 'store/execute' before
throwing away hints ?
> Unless I am very mistaken, it cannot explain my original report of up to
> 24.000 instructions waiting in ktime_get()
Yes it's a different problem. Sorry for hijacking your thread.
>
> BTW, how many consecutive prefetch() a cpu can 'store/execute' before
> throwing away hints ?
It depends on the CPU. Older ones had a small number only. On the
newer ones it tends to be limited by the various queues to memory,
which are a shared resource (so it also depends on what the other
threads/cores are doing)
I normally assume as a rule of thumb < 50
-Andi
--
[email protected] -- Speaking for myself only.
Le vendredi 06 mai 2011 à 18:59 +0200, Andi Kleen a écrit :
> If you have a better way to make it faster please share it.
Ideally we could use RCU :)
Have whatever state hold in one structure (possibly big, it doesnt
matter) and switch pointer in writer once everything is setup in new
structure.
struct {
struct timespec xtime;
struct timespec wall_to_monotonic;
...
} time_keep_blob;
struct time_keep_blob __rcu *xtime_cur;
ktime_t ktime_get(void)
{
const struct time_keep_blob *xp;
s64 secs, nsecs;
rcu_read_lock();
xp = rcu_dereference(xtime_cur);
secs = xp->xtime.tv_sec + xp->wall_to_monotonic.tv_sec;
nsecs = xp->xtime.tv_nsec + xp->wall_to_monotonic.tv_nsec;
nsecs += timekeeping_get_ns(xp);
rcu_read_unlock();
return ktime_add_ns(ktime_set(secs, 0), nsecs);
}
I dont know timekeeping details, maybe its necessary to loop if
xtime_cur changes :
ktime_t ktime_get(void)
{
const struct time_keep_blob *xp;
s64 secs, nsecs;
rcu_read_lock();
do {
xp = rcu_dereference(xtime_cur);
secs = xp->xtime.tv_sec + xp->wall_to_monotonic.tv_sec;
nsecs = xp->xtime.tv_nsec + xp->wall_to_monotonic.tv_nsec;
nsecs += timekeeping_get_ns(xp);
} while (rcu_dereference(xtime_cur) != xp);
rcu_read_unlock();
return ktime_add_ns(ktime_set(secs, 0), nsecs);
}
On Fri, May 06, 2011 at 07:42:47PM +0200, Eric Dumazet wrote:
> Le vendredi 06 mai 2011 ? 18:59 +0200, Andi Kleen a ?crit :
>
> > If you have a better way to make it faster please share it.
>
> Ideally we could use RCU :)
Hmm, I didn't think my case had a lot of loops in the seqlock -- just
expensive cache misses -- but I should double check.
For the lots of loop case we probably need to understand first why you
iterate that often.
-Andi
Le vendredi 06 mai 2011 à 19:50 +0200, Andi Kleen a écrit :
> On Fri, May 06, 2011 at 07:42:47PM +0200, Eric Dumazet wrote:
> > Le vendredi 06 mai 2011 à 18:59 +0200, Andi Kleen a écrit :
> >
> > > If you have a better way to make it faster please share it.
> >
> > Ideally we could use RCU :)
>
> Hmm, I didn't think my case had a lot of loops in the seqlock -- just
> expensive cache misses -- but I should double check.
>
> For the lots of loop case we probably need to understand first why you
> iterate that often.
Yep, I'll try to investigate on this
Thanks !
Le vendredi 06 mai 2011 à 21:26 +0200, Eric Dumazet a écrit :
> Le vendredi 06 mai 2011 à 19:50 +0200, Andi Kleen a écrit :
> > On Fri, May 06, 2011 at 07:42:47PM +0200, Eric Dumazet wrote:
> > > Le vendredi 06 mai 2011 à 18:59 +0200, Andi Kleen a écrit :
> > >
> > > > If you have a better way to make it faster please share it.
> > >
> > > Ideally we could use RCU :)
> >
> > Hmm, I didn't think my case had a lot of loops in the seqlock -- just
> > expensive cache misses -- but I should double check.
> >
> > For the lots of loop case we probably need to understand first why you
> > iterate that often.
>
> Yep, I'll try to investigate on this
>
So apparently some calls to tick_do_update_jiffies64() are pretty
expensive :
[ 369.334399] maxhold=1191627
[ 369.334450] ------------[ cut here ]------------
[ 369.334505] WARNING: at include/linux/seqlock.h:81 tick_do_update_jiffies64+0x169/0x1c0()
[ 369.334579] Hardware name: ProLiant BL460c G1
[ 369.334631] Modules linked in: xt_hashlimit ipmi_devintf af_packet ipmi_si ipmi_msghandler hpilo tg3 bonding
[ 369.334989] Pid: 0, comm: swapper Tainted: G W 2.6.39-rc6-00097-g6ac1576-dirty #547
[ 369.335000] Call Trace:
[ 369.335000] [<c05b395f>] ? printk+0x18/0x21
[ 369.335000] [<c023b90d>] warn_slowpath_common+0x6d/0xa0
[ 369.335000] [<c02661f9>] ? tick_do_update_jiffies64+0x169/0x1c0
[ 369.335000] [<c02661f9>] ? tick_do_update_jiffies64+0x169/0x1c0
[ 369.335000] [<c023b95d>] warn_slowpath_null+0x1d/0x20
[ 369.335000] [<c02661f9>] tick_do_update_jiffies64+0x169/0x1c0
[ 369.335000] [<c0266c18>] tick_check_idle+0x78/0xa0
[ 369.335000] [<c0241734>] irq_enter+0x54/0x60
[ 369.335000] [<c021ac3c>] smp_apic_timer_interrupt+0x2c/0x90
[ 369.335000] [<c025b452>] ? hrtimer_start+0x22/0x30
[ 369.335000] [<c05b6d2a>] apic_timer_interrupt+0x2a/0x30
[ 369.335000] [<c02098e4>] ? mwait_idle+0x84/0x180
[ 369.335000] [<c02015f6>] cpu_idle+0x46/0x80
[ 369.335000] [<c059f41d>] rest_init+0x5d/0x70
[ 369.335000] [<c07c8710>] start_kernel+0x2d6/0x2dc
[ 369.335000] [<c07c81c2>] ? loglevel+0x1a/0x1a
[ 369.335000] [<c07c80bd>] i386_start_kernel+0xbd/0xc5
[ 369.335000] ---[ end trace 4eaa2a86a8e2da80 ]---
I used following hack/patch :
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index e98cd2e..1e3ba24 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -32,6 +32,8 @@
typedef struct {
unsigned sequence;
spinlock_t lock;
+ unsigned long long maxhold;
+ unsigned long long t0;
} seqlock_t;
/*
@@ -48,6 +50,7 @@ typedef struct {
do { \
(x)->sequence = 0; \
spin_lock_init(&(x)->lock); \
+ (x)->maxhold = 0; \
} while (0)
#define DEFINE_SEQLOCK(x) \
@@ -61,13 +64,22 @@ static inline void write_seqlock(seqlock_t *sl)
{
spin_lock(&sl->lock);
++sl->sequence;
+ sl->t0 = __native_read_tsc();
smp_wmb();
}
static inline void write_sequnlock(seqlock_t *sl)
{
+ unsigned long long t1;
smp_wmb();
sl->sequence++;
+ t1 = __native_read_tsc();
+ t1 -= sl->t0;
+ if (unlikely(t1 > sl->maxhold)) {
+ sl->maxhold = t1;
+ pr_err("maxhold=%llu\n", t1);
+ WARN_ON(1);
+ }
spin_unlock(&sl->lock);
}
@@ -77,13 +89,14 @@ static inline int write_tryseqlock(seqlock_t *sl)
if (ret) {
++sl->sequence;
+ sl->t0 = __native_read_tsc();
smp_wmb();
}
return ret;
}
/* Start of read calculation -- fetch last complete writer token */
-static __always_inline unsigned read_seqbegin(const seqlock_t *sl)
+static __always_inline unsigned __read_seqbegin(const seqlock_t *sl, unsigned long *loop)
{
unsigned ret;
@@ -91,6 +104,8 @@ repeat:
ret = sl->sequence;
smp_rmb();
if (unlikely(ret & 1)) {
+ if (loop)
+ (*loop)++;
cpu_relax();
goto repeat;
}
@@ -98,6 +113,11 @@ repeat:
return ret;
}
+static __always_inline unsigned read_seqbegin(const seqlock_t *sl)
+{
+ return __read_seqbegin(sl, NULL);
+}
+
/*
* Test if reader processed invalid data.
*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 8ad5d57..db84027 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -237,20 +237,28 @@ void getnstimeofday(struct timespec *ts)
EXPORT_SYMBOL(getnstimeofday);
+static DEFINE_PER_CPU(unsigned long, loopmax);
ktime_t ktime_get(void)
{
unsigned int seq;
s64 secs, nsecs;
+ unsigned long flags, loop = 0;
WARN_ON(timekeeping_suspended);
+ local_irq_save(flags);
do {
- seq = read_seqbegin(&xtime_lock);
+ seq = __read_seqbegin(&xtime_lock, &loop);
secs = xtime.tv_sec + wall_to_monotonic.tv_sec;
nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec;
nsecs += timekeeping_get_ns();
} while (read_seqretry(&xtime_lock, seq));
+ if (unlikely(loop > __this_cpu_read(loopmax))) {
+ this_cpu_write(loopmax, loop);
+ pr_err("cpu%d ktime_get() looped %lu times\n", smp_processor_id(), loop);
+ }
+ local_irq_restore(flags);
/*
* Use ktime_set/ktime_add_ns to create a proper ktime on
* 32-bit architectures without CONFIG_KTIME_SCALAR.
On Fri, 2011-05-06 at 19:42 +0200, Eric Dumazet wrote:
> Le vendredi 06 mai 2011 à 18:59 +0200, Andi Kleen a écrit :
>
> > If you have a better way to make it faster please share it.
>
> Ideally we could use RCU :)
>
> Have whatever state hold in one structure (possibly big, it doesnt
> matter) and switch pointer in writer once everything is setup in new
> structure.
Yea. RCU is tough though, because we do need the hardware cycle value
that we use to be valid for the current interval stored in the
timekeeper structure (ie: the cycle value can't be before the current
cycle_last, and it can't be greater then the next cycle_last).
The reason being, that should a frequency correction occur, you might
end up applying the old frequency to a longer interval then desired,
which could cause small timekeeping inconsistencies.
So we really do need a way to ensure that gettimeofday calls fall on one
side or the other of the accumulation loop. We may be able to tighten
that window some, but I don't know if RCU will work.
thanks
-john
On Fri, 2011-05-06 at 22:04 +0200, Eric Dumazet wrote:
> Le vendredi 06 mai 2011 à 21:26 +0200, Eric Dumazet a écrit :
> > Le vendredi 06 mai 2011 à 19:50 +0200, Andi Kleen a écrit :
> > > On Fri, May 06, 2011 at 07:42:47PM +0200, Eric Dumazet wrote:
> > > > Le vendredi 06 mai 2011 à 18:59 +0200, Andi Kleen a écrit :
> > > >
> > > > > If you have a better way to make it faster please share it.
> > > >
> > > > Ideally we could use RCU :)
> > >
> > > Hmm, I didn't think my case had a lot of loops in the seqlock -- just
> > > expensive cache misses -- but I should double check.
> > >
> > > For the lots of loop case we probably need to understand first why you
> > > iterate that often.
> >
> > Yep, I'll try to investigate on this
> >
>
> So apparently some calls to tick_do_update_jiffies64() are pretty
> expensive :
So would the easier solution be to just break out timekeeper locking
from the xtime_lock?
So basically we would just add a timekeeper.lock seqlock and use it to
protect only the timekeeping code? We can still keep xtime_lock around
for the tick/jiffies protection (well, until tglx kills jiffies :), but
gettimeofday and friends wouldn't be blocked for so long.
That should be pretty straight forward now that the timekeeper data is
completely static to timkeeeping.c.
thanks
-john
Le vendredi 06 mai 2011 à 13:24 -0700, john stultz a écrit :
> So would the easier solution be to just break out timekeeper locking
> from the xtime_lock?
>
> So basically we would just add a timekeeper.lock seqlock and use it to
> protect only the timekeeping code? We can still keep xtime_lock around
> for the tick/jiffies protection (well, until tglx kills jiffies :), but
> gettimeofday and friends wouldn't be blocked for so long.
>
> That should be pretty straight forward now that the timekeeper data is
> completely static to timkeeeping.c.
>
Yes :)
I can see many cpus entering tick_do_update_jiffies64() and all are
calling write_seqlock(&xtime_lock);
Only first one can perform the work, but all others are waiting on the
spinlock, get it, change seqcount, and realize they have nothing to
do...
Meanwhile, a reader must wait that all writers are finished, because of
all seqcount changes storm.
Following patch helps. Of course we might find out why so many cpus (on
my 8 cpus machine !) are calling tick_do_update_jiffies64() at the same
time...
This is basically what I said in my first mail :
Separate logical sections to reduce windows where readers are blocked/spinning.
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index d5097c4..251b2fe 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -56,7 +56,7 @@ static void tick_do_update_jiffies64(ktime_t now)
return;
/* Reevalute with xtime_lock held */
- write_seqlock(&xtime_lock);
+ spin_lock(&xtime_lock.lock);
delta = ktime_sub(now, last_jiffies_update);
if (delta.tv64 >= tick_period.tv64) {
@@ -74,12 +74,15 @@ static void tick_do_update_jiffies64(ktime_t now)
last_jiffies_update = ktime_add_ns(last_jiffies_update,
incr * ticks);
}
+ xtime_lock.sequence++;
+ smp_wmb();
do_timer(++ticks);
-
+ smp_wmb();
+ xtime_lock.sequence++;
/* Keep the tick_next_period variable up to date */
tick_next_period = ktime_add(last_jiffies_update, tick_period);
}
- write_sequnlock(&xtime_lock);
+ spin_unlock(&xtime_lock.lock);
}
/*
On Sat, 2011-05-07 at 00:30 +0200, Eric Dumazet wrote:
> Le vendredi 06 mai 2011 à 13:24 -0700, john stultz a écrit :
>
> > So would the easier solution be to just break out timekeeper locking
> > from the xtime_lock?
> >
> > So basically we would just add a timekeeper.lock seqlock and use it to
> > protect only the timekeeping code? We can still keep xtime_lock around
> > for the tick/jiffies protection (well, until tglx kills jiffies :), but
> > gettimeofday and friends wouldn't be blocked for so long.
> >
> > That should be pretty straight forward now that the timekeeper data is
> > completely static to timkeeeping.c.
> >
>
> Yes :)
>
> I can see many cpus entering tick_do_update_jiffies64() and all are
> calling write_seqlock(&xtime_lock);
>
> Only first one can perform the work, but all others are waiting on the
> spinlock, get it, change seqcount, and realize they have nothing to
> do...
Huh. So who is calling tick_do_update_jiffies64 in your case? I know the
sched_tick_timer and tick_nohz_handler checks to make sure
tick_do_timer_cpu == cpu to avoid exactly the thundering heard problem
on the jiffies update.
There's other spots that call tick_do_update_jiffies64, but I thought
those were more rare. So there may be something else wrong going on
here.
> Meanwhile, a reader must wait that all writers are finished, because of
> all seqcount changes storm.
>
> Following patch helps. Of course we might find out why so many cpus (on
> my 8 cpus machine !) are calling tick_do_update_jiffies64() at the same
> time...
>
>
> This is basically what I said in my first mail :
>
> Separate logical sections to reduce windows where readers are blocked/spinning.
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index d5097c4..251b2fe 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -56,7 +56,7 @@ static void tick_do_update_jiffies64(ktime_t now)
> return;
>
> /* Reevalute with xtime_lock held */
> - write_seqlock(&xtime_lock);
> + spin_lock(&xtime_lock.lock);
Oof.. No, this is too ugly and really just abuses the seqlock structure.
If you really want to untangle what xtime_lock protects, you need to
introduce a new lock (I suggest in the timekeeper structure) to protect
the timekeeping data.
Then we can refine xtime_lock to also just protect the jiffies/tick
management bits as well if needed.
thanks
-john
Le vendredi 06 mai 2011 à 15:46 -0700, john stultz a écrit :
> On Sat, 2011-05-07 at 00:30 +0200, Eric Dumazet wrote:
> > Le vendredi 06 mai 2011 à 13:24 -0700, john stultz a écrit :
> >
> > > So would the easier solution be to just break out timekeeper locking
> > > from the xtime_lock?
> > >
> > > So basically we would just add a timekeeper.lock seqlock and use it to
> > > protect only the timekeeping code? We can still keep xtime_lock around
> > > for the tick/jiffies protection (well, until tglx kills jiffies :), but
> > > gettimeofday and friends wouldn't be blocked for so long.
> > >
> > > That should be pretty straight forward now that the timekeeper data is
> > > completely static to timkeeeping.c.
> > >
> >
> > Yes :)
> >
> > I can see many cpus entering tick_do_update_jiffies64() and all are
> > calling write_seqlock(&xtime_lock);
> >
> > Only first one can perform the work, but all others are waiting on the
> > spinlock, get it, change seqcount, and realize they have nothing to
> > do...
>
> Huh. So who is calling tick_do_update_jiffies64 in your case? I know the
> sched_tick_timer and tick_nohz_handler checks to make sure
> tick_do_timer_cpu == cpu to avoid exactly the thundering heard problem
> on the jiffies update.
>
> There's other spots that call tick_do_update_jiffies64, but I thought
> those were more rare. So there may be something else wrong going on
> here.
>
That I can answer :
echo function > /sys/kernel/debug/tracing/current_tracer
echo "do_timer calc_global_load second_overflow ktime_divns do_timestamp1 do_timestamp2 tick_do_update_jiffies64" >/sys/kernel/debug/tracing/set_ftrace_filter
echo 1 > /sys/kernel/debug/tracing/tracing_enabled
usleep 2000000;
echo 0 > /sys/kernel/debug/tracing/tracing_enabled
cat /sys/kernel/debug/tracing/trace >TRACE
(I added do_timestamp1/do_timestamp2) after/before write_seqlock()/write_sequnlock()
<idle>-0 [003] 920.355377: do_timestamp1 <-tick_do_update_jiffies64
<idle>-0 [006] 920.355377: tick_do_update_jiffies64 <-tick_sched_timer
<idle>-0 [003] 920.355378: do_timestamp2 <-tick_do_update_jiffies64
<idle>-0 [000] 920.355657: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [000] 920.355660: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [006] 920.355663: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [006] 920.355666: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [003] 920.355781: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [003] 920.355783: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [004] 920.355796: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [004] 920.355797: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [000] 920.355808: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [004] 920.355933: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [000] 920.355934: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [004] 920.355935: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [000] 920.355935: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [006] 920.356093: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [006] 920.356095: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [006] 920.356198: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [000] 920.356202: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [000] 920.356211: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [002] 920.356319: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [002] 920.356321: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [000] 920.356372: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [004] 920.356372: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [006] 920.356372: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [002] 920.356372: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [003] 920.356372: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [004] 920.356373: do_timestamp1 <-tick_do_update_jiffies64
<idle>-0 [004] 920.356373: do_timer <-tick_do_update_jiffies64
<idle>-0 [004] 920.356374: calc_global_load <-do_timer
<idle>-0 [004] 920.356374: do_timestamp2 <-tick_do_update_jiffies64
<idle>-0 [006] 920.356375: do_timestamp1 <-tick_do_update_jiffies64
<idle>-0 [006] 920.356376: do_timestamp2 <-tick_do_update_jiffies64
<idle>-0 [000] 920.356377: do_timestamp1 <-tick_do_update_jiffies64
<idle>-0 [000] 920.356377: do_timestamp2 <-tick_do_update_jiffies64
<idle>-0 [002] 920.356379: do_timestamp1 <-tick_do_update_jiffies64
<idle>-0 [002] 920.356379: do_timestamp2 <-tick_do_update_jiffies64
<idle>-0 [003] 920.356380: do_timestamp1 <-tick_do_update_jiffies64
<idle>-0 [003] 920.356380: do_timestamp2 <-tick_do_update_jiffies64
<idle>-0 [002] 920.356515: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [002] 920.356517: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [002] 920.356607: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [002] 920.356608: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [000] 920.356716: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [000] 920.356718: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [006] 920.356720: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [006] 920.356723: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [004] 920.356808: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [004] 920.356810: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
>
> > Meanwhile, a reader must wait that all writers are finished, because of
> > all seqcount changes storm.
> >
> > Following patch helps. Of course we might find out why so many cpus (on
> > my 8 cpus machine !) are calling tick_do_update_jiffies64() at the same
> > time...
> >
> >
> > This is basically what I said in my first mail :
> >
> > Separate logical sections to reduce windows where readers are blocked/spinning.
> >
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index d5097c4..251b2fe 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -56,7 +56,7 @@ static void tick_do_update_jiffies64(ktime_t now)
> > return;
> >
> > /* Reevalute with xtime_lock held */
> > - write_seqlock(&xtime_lock);
> > + spin_lock(&xtime_lock.lock);
>
> Oof.. No, this is too ugly and really just abuses the seqlock structure.
>
That was a hack/POC, of course, but we can cleanup seqlock.h to provide
clean thing. A seqlock_t should include a seqcount_t and a spinlock_t.
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index e98cd2e..afc00f4 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -29,8 +29,12 @@
#include <linux/spinlock.h>
#include <linux/preempt.h>
-typedef struct {
+typedef struct seqcount {
unsigned sequence;
+} seqcount_t;
+
+typedef struct {
+ seqcount_t seq;
spinlock_t lock;
} seqlock_t;
@@ -53,6 +57,22 @@ typedef struct {
#define DEFINE_SEQLOCK(x) \
seqlock_t x = __SEQLOCK_UNLOCKED(x)
+/*
+ * Sequence counter only version assumes that callers are using their
+ * own mutexing.
+ */
+static inline void write_seqcount_begin(seqcount_t *s)
+{
+ s->sequence++;
+ smp_wmb();
+}
+
+static inline void write_seqcount_end(seqcount_t *s)
+{
+ smp_wmb();
+ s->sequence++;
+}
+
/* Lock out other writers and update the count.
* Acts like a normal spin_lock/unlock.
* Don't need preempt_disable() because that is in the spin_lock already.
@@ -60,14 +80,12 @@ typedef struct {
static inline void write_seqlock(seqlock_t *sl)
{
spin_lock(&sl->lock);
- ++sl->sequence;
- smp_wmb();
+ write_seqcount_begin(&sl->seq);
}
static inline void write_sequnlock(seqlock_t *sl)
{
- smp_wmb();
- sl->sequence++;
+ write_seqcount_end(&sl->seq);
spin_unlock(&sl->lock);
}
@@ -75,10 +93,9 @@ static inline int write_tryseqlock(seqlock_t *sl)
{
int ret = spin_trylock(&sl->lock);
- if (ret) {
- ++sl->sequence;
- smp_wmb();
- }
+ if (ret)
+ write_seqcount_begin(&sl->seq);
+
return ret;
}
@@ -88,7 +105,7 @@ static __always_inline unsigned read_seqbegin(const seqlock_t *sl)
unsigned ret;
repeat:
- ret = sl->sequence;
+ ret = sl->seq.sequence;
smp_rmb();
if (unlikely(ret & 1)) {
cpu_relax();
@@ -107,7 +124,7 @@ static __always_inline int read_seqretry(const seqlock_t *sl, unsigned start)
{
smp_rmb();
- return unlikely(sl->sequence != start);
+ return unlikely(sl->seq.sequence != start);
}
@@ -118,10 +135,6 @@ static __always_inline int read_seqretry(const seqlock_t *sl, unsigned start)
* after the write_seqcount_end().
*/
-typedef struct seqcount {
- unsigned sequence;
-} seqcount_t;
-
#define SEQCNT_ZERO { 0 }
#define seqcount_init(x) do { *(x) = (seqcount_t) SEQCNT_ZERO; } while (0)
@@ -204,22 +217,6 @@ static inline int read_seqcount_retry(const seqcount_t *s, unsigned start)
}
-/*
- * Sequence counter only version assumes that callers are using their
- * own mutexing.
- */
-static inline void write_seqcount_begin(seqcount_t *s)
-{
- s->sequence++;
- smp_wmb();
-}
-
-static inline void write_seqcount_end(seqcount_t *s)
-{
- smp_wmb();
- s->sequence++;
-}
-
/**
* write_seqcount_barrier - invalidate in-progress read-side seq operations
* @s: pointer to seqcount_t
> If you really want to untangle what xtime_lock protects, you need to
> introduce a new lock (I suggest in the timekeeper structure) to protect
> the timekeeping data.
>
> Then we can refine xtime_lock to also just protect the jiffies/tick
> management bits as well if needed.
For the moment I am trying to understand the code. Its quite complex and
uses a monolithic seqlock, defeating seqlock power.
That was really my initial point, if you remember my mail...
On Sat, 2011-05-07 at 01:00 +0200, Eric Dumazet wrote:
> Le vendredi 06 mai 2011 à 15:46 -0700, john stultz a écrit :
> > On Sat, 2011-05-07 at 00:30 +0200, Eric Dumazet wrote:
> > > I can see many cpus entering tick_do_update_jiffies64() and all are
> > > calling write_seqlock(&xtime_lock);
> > >
> > > Only first one can perform the work, but all others are waiting on the
> > > spinlock, get it, change seqcount, and realize they have nothing to
> > > do...
> >
> > Huh. So who is calling tick_do_update_jiffies64 in your case? I know the
> > sched_tick_timer and tick_nohz_handler checks to make sure
> > tick_do_timer_cpu == cpu to avoid exactly the thundering heard problem
> > on the jiffies update.
> >
> > There's other spots that call tick_do_update_jiffies64, but I thought
> > those were more rare. So there may be something else wrong going on
> > here.
> >
>
> That I can answer :
[snip]
> (I added do_timestamp1/do_timestamp2) after/before write_seqlock()/write_sequnlock()
>
> <idle>-0 [003] 920.355377: do_timestamp1 <-tick_do_update_jiffies64
> <idle>-0 [006] 920.355377: tick_do_update_jiffies64 <-tick_sched_timer
> <idle>-0 [003] 920.355378: do_timestamp2 <-tick_do_update_jiffies64
> <idle>-0 [000] 920.355657: tick_do_update_jiffies64 <-tick_check_idle
> <idle>-0 [000] 920.355660: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
Thomas, any clues why this would be getting hammered?
> > > /* Reevalute with xtime_lock held */
> > > - write_seqlock(&xtime_lock);
> > > + spin_lock(&xtime_lock.lock);
> >
> > Oof.. No, this is too ugly and really just abuses the seqlock structure.
> >
>
> That was a hack/POC, of course, but we can cleanup seqlock.h to provide
> clean thing. A seqlock_t should include a seqcount_t and a spinlock_t.
Maybe I'm misunderstanding, but you seem to be trying to create some
sort of a layered lock from the seqlock ? I don't quite understand why
your proposing this instead of actually splitting the lock out?
> > If you really want to untangle what xtime_lock protects, you need to
> > introduce a new lock (I suggest in the timekeeper structure) to protect
> > the timekeeping data.
> >
> > Then we can refine xtime_lock to also just protect the jiffies/tick
> > management bits as well if needed.
>
> For the moment I am trying to understand the code. Its quite complex and
> uses a monolithic seqlock, defeating seqlock power.
Defeating seqlock power? My thoughts are that seqlocks are nice
lightweight reader/writer locks that avoid writer starvation. You seem
to be trying to redefine or extend them to be something else which is
more subtle.
I agree, the code is complex! I'm just not sure adding more
complicated/subtle locking mechanisms is a good solution. Instead I'd
suggest simply splitting up the locks (by using new locks) to reduce the
amount of data that is being protected by a single lock.
But again, maybe I'm misunderstanding you?
thanks
-john
Le vendredi 06 mai 2011 à 16:28 -0700, john stultz a écrit :
> Defeating seqlock power? My thoughts are that seqlocks are nice
> lightweight reader/writer locks that avoid writer starvation. You seem
> to be trying to redefine or extend them to be something else which is
> more subtle.
>
All I am trying to explain is that a seqlock is a compound of two
things : One spinlock to synchronize writers among themselves, one
seqcount to synchronize readers with a writer.
But the API provides only a compound one. Writer uses the whole locking,
while it would be nice to be able to separate the two steps. Note that
because write_seqlock() and write_sequnlock() are inlined, this would
not increase text size.
One another problem is that spinlock is in same cache line than
seqcount, while a reader doesnt need the spinlock.
> I agree, the code is complex! I'm just not sure adding more
> complicated/subtle locking mechanisms is a good solution. Instead I'd
> suggest simply splitting up the locks (by using new locks) to reduce the
> amount of data that is being protected by a single lock.
>
> But again, maybe I'm misunderstanding you?
Adding locks might be fine, I really dont know yet. Its adding yet
another pieces of memory to be dirtied. And total number of cache lines
to be dirtied gives the latency.
I would like to have a ktime_get() service as fast as possible, with
guarded latencies.
Typical workloads need to call it hundred of thousands times per second.
Hi Eric,
> > Defeating seqlock power? My thoughts are that seqlocks are nice
> > lightweight reader/writer locks that avoid writer starvation. You seem
> > to be trying to redefine or extend them to be something else which is
> > more subtle.
> >
>
> All I am trying to explain is that a seqlock is a compound of two
> things : One spinlock to synchronize writers among themselves, one
> seqcount to synchronize readers with a writer.
>
> But the API provides only a compound one. Writer uses the whole locking,
> while it would be nice to be able to separate the two steps. Note that
> because write_seqlock() and write_sequnlock() are inlined, this would
> not increase text size.
The separate writer and reader idea also came up in the input
subsystem about a year ago, to adress the one-writer-many-readers
situation. The patch went a few rounds on lkml
(https://lkml.org/lkml/2010/6/20/87), but the final version included
below got dropped because of an unrelated problem with the (evdev)
usecase. Perhaps it is useful in this setting?
Cheers,
Henrik
>From bd26d8972130978deb9056a7a6da30b5c049914f Mon Sep 17 00:00:00 2001
From: Henrik Rydberg <[email protected]>
Date: Fri, 25 Jun 2010 01:59:16 +0200
Subject: [PATCH] Introduce mrbuf, a multi-reader buffer mechanism
In spite of the many lock patterns and fifo helpers in the kernel, the
case of a single writer feeding many readers via a circular event
buffer seems to be uncovered. This patch adds mrbuf, a mechanism
for handling multiple concurrent read positions in a shared circular
buffer. Under normal operation, given adequate buffer size, the
operation is lock-less.
---
include/linux/mrbuf.h | 228 +++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 228 insertions(+), 0 deletions(-)
create mode 100644 include/linux/mrbuf.h
diff --git a/include/linux/mrbuf.h b/include/linux/mrbuf.h
new file mode 100644
index 0000000..86ab656
--- /dev/null
+++ b/include/linux/mrbuf.h
@@ -0,0 +1,228 @@
+#ifndef _MRBUF_H
+#define _MRBUF_H
+/*
+ * Multi-reader buffer lock for single writer, concurrent readers
+ *
+ * Copyright (c) 2010 Henrik Rydberg
+ *
+ * The mrbuf is a mechanism for handling multiple concurrent read
+ * positions in a shared circular buffer. The mrbuf does not handle
+ * the actual buffer, which can therefore be of any type.
+ *
+ * The writer does not block, but overwrites older events as the
+ * buffer gets full. Writing is performed in two stages; writing the
+ * data and telling the readers about it (syncing).
+ *
+ * The readers all read from the same shared buffer, but at individual
+ * positions. Reading from the position currently being written to
+ * will cause the reader to retry until the read is coherent. During
+ * normal operation, with sufficient buffer size, the mechanism is
+ * practically lock-less.
+ *
+ * The mrbuf is particularly suitable for input event buffers, where
+ * the single writer must not be starved, and where there are several
+ * threads reading the same data.
+ *
+ * Multi-reader buffer locks are similar to seqlocks, but while
+ * seqlocks have a finite retry frequency, mrbufs virtually never
+ * retry. Like seqlocks, mrbufs are not very cache friendly, and
+ * require the buffer to be valid in all threads.
+ *
+ * Multiple writers and re-entrant readers require additional locking.
+ *
+ * Event queue writer example. Readers are synchronized individually.
+ *
+ * mrbuf_write_lock(&writer);
+ *
+ * buffer[mrbuf_write_at(&writer)] = event_to_write;
+ *
+ * mrbuf_write_unlock(&writer);
+ *
+ * for (i = 0; i < NUM_READERS; i++)
+ * mrbuf_write_sync(&writer, &client[i]->reader);
+ *
+ * Event queue reader example, reading all available events.
+ *
+ * while (!mrbuf_read_empty(&reader)) {
+ * do {
+ * mrbuf_read_try(&reader, &writer);
+ * event_to_read = buffer[mrbuf_read_at(&reader, &writer)];
+ * } while (mrbuf_read_retry(&reader, &writer));
+ * }
+ *
+ */
+
+/**
+ * struct mrbuf_writer - mrbuf writer
+ * @page: the bits of the circular buffer (page = size - 1)
+ * @head: the current writer head
+ * @next: the head to take effect after the lock
+ *
+ * The buffer size must be a power of two, such that page is the set
+ * of bits in which the buffer positions live.
+ *
+ */
+struct mrbuf_writer {
+ unsigned int page;
+ unsigned int head;
+ unsigned int next;
+};
+
+/**
+ * struct mrbuf_reader - mrbuf reader
+ * @last: the last write head seen beforing trying
+ * @head: the current reader head
+ * @tail: the current reader tail
+ */
+struct mrbuf_reader {
+ unsigned int last;
+ unsigned int head;
+ unsigned int tail;
+};
+
+/**
+ * mrbuf_write_init - initialize writer
+ * @bw: the mrbuf_writer to initialize
+ * @size: the size of the buffer (must be power of two)
+ */
+static inline void mrbuf_write_init(struct mrbuf_writer *bw, unsigned int size)
+{
+ bw->page = size - 1;
+ bw->head = 0;
+ bw->next = 0;
+}
+
+/**
+ * mrbuf_write_lock - lock to write one event
+ * @bw: the mrbuf_writer to lock
+ *
+ * This method prepares to write one event. The write lock does not
+ * sleep and is thus suitable for use in atomic contexts.
+ *
+ */
+static inline void mrbuf_write_lock(struct mrbuf_writer *bw)
+{
+ ++bw->next;
+ smp_wmb();
+}
+
+/**
+ * mrbuf_write_at - return position to write to
+ * @bw: the mrbuf_writer keeping track of the write position
+ *
+ * Returns the buffer position to write to. Must be called from within
+ * the write lock.
+ *
+ */
+static inline unsigned int mrbuf_write_at(const struct mrbuf_writer *bw)
+{
+ return bw->head & bw->page;
+}
+
+/**
+ * mrbuf_write_unlock - unlock writing
+ * @bw: the mrbuf_writer to unlock
+ */
+static inline void mrbuf_write_unlock(struct mrbuf_writer *bw)
+{
+ smp_wmb();
+ ++bw->head;
+}
+
+/**
+ * mrbuf_write_sync - synchronize reader with current writer
+ * @bw: the mrbuf_writer to sync with
+ * @br: the mrbuf_reader to sync
+ *
+ * Synchronize the reader head with the writer head, effectively
+ * telling the reader thread that there is new data to read.
+ *
+ */
+static inline void mrbuf_write_sync(const struct mrbuf_writer *bw,
+ struct mrbuf_reader *br)
+{
+ br->head = bw->head;
+}
+
+/**
+ * mrbuf_read_init - initialize reader
+ * @br: the mrbuf_reader to initialize
+ * @head: the initial reader head
+ * @tail: the initial reader tail
+ *
+ * This function must be called while mrbuf_write_sync() and
+ * mrbuf_read_empty() are out of reach, since the reader state is updated
+ * without coherence guarantees.
+ */
+static inline void mrbuf_read_init(struct mrbuf_reader *br,
+ unsigned int head, unsigned int tail)
+{
+ br->head = head;
+ br->tail = tail;
+ smp_wmb();
+}
+
+/**
+ * mrbuf_read_empty - true if reader is empty
+ * @br: the mrbuf_reader to check
+ */
+static inline bool mrbuf_read_empty(const struct mrbuf_reader *br)
+{
+ return br->head == br->tail;
+}
+
+/**
+ * mrbuf_read_try - try to read data
+ * @br: the mrbuf_reader to try to read from
+ * @bw: the mrbuf_writer keeping track of the write position
+ *
+ * Prepare to read from buffer. The reader must be non-empty before
+ * trying to read from it. If the reader has fallen behind, it catches
+ * up by jumping to the last page being written to.
+ *
+ */
+static inline void mrbuf_read_try(struct mrbuf_reader *br,
+ const struct mrbuf_writer *bw)
+{
+ br->last = bw->head;
+ smp_rmb();
+ br->tail += (br->head - 1 - br->tail) & ~bw->page;
+}
+
+/**
+ * mrbuf_read_at - return position to read from
+ * @br: the mrbuf_reader keeping track of the read position
+ * @bw: the mrbuf_writer keeping track of the buffer size
+ *
+ * Returns the buffer position to read from. Must be called from
+ * within the read try block.
+ *
+ */
+static inline unsigned int mrbuf_read_at(const struct mrbuf_reader *br,
+ const struct mrbuf_writer *bw)
+{
+ return br->tail & bw->page;
+}
+
+/**
+ * mrbuf_read_retry - try to finish read
+ * @br: the mrbuf_reader to try to finish
+ * @bw: the mrbuf_writer keeping track of the write position
+ *
+ * Try to finish the read. Returns false if successful. Otherwise, the
+ * read was incoherent and one must mrbuf_read_try() again. During
+ * normal operation, with adequate buffer size, this method will
+ * always return false.
+ *
+ */
+static inline bool __must_check mrbuf_read_retry(struct mrbuf_reader *br,
+ const struct mrbuf_writer *bw)
+{
+ smp_rmb();
+ if (unlikely(((br->tail - br->last) & bw->page) < bw->next - br->last))
+ return true;
+ ++br->tail;
+ return false;
+}
+
+#endif /* _MRBUF_H */
--
1.6.3.3
* Andi Kleen <[email protected]> wrote:
> > Then precise measurements have to be done on the source of cache misses, the
> > total cost of the timer interrupt, etc.
>
> What we did was profiling the remote node cache misses
> using raw offcore events, and the xtime code was high up.
> I cannot share exact numbers for various reasons, but it was
> an improvement on a hard to improve workload.
You could certainly share the profiling commands you typed, so that
others can reproduce and double check your results.
Thanks,
Ingo
On Fri, 6 May 2011, john stultz wrote:
> On Sat, 2011-05-07 at 01:00 +0200, Eric Dumazet wrote:
> > Le vendredi 06 mai 2011 à 15:46 -0700, john stultz a écrit :
> > > On Sat, 2011-05-07 at 00:30 +0200, Eric Dumazet wrote:
> > > > I can see many cpus entering tick_do_update_jiffies64() and all are
> > > > calling write_seqlock(&xtime_lock);
> > > >
> > > > Only first one can perform the work, but all others are waiting on the
> > > > spinlock, get it, change seqcount, and realize they have nothing to
> > > > do...
> > >
> > > Huh. So who is calling tick_do_update_jiffies64 in your case? I know the
> > > sched_tick_timer and tick_nohz_handler checks to make sure
> > > tick_do_timer_cpu == cpu to avoid exactly the thundering heard problem
> > > on the jiffies update.
> > >
> > > There's other spots that call tick_do_update_jiffies64, but I thought
> > > those were more rare. So there may be something else wrong going on
> > > here.
> > >
> >
> > That I can answer :
> [snip]
> > (I added do_timestamp1/do_timestamp2) after/before write_seqlock()/write_sequnlock()
> >
> > <idle>-0 [003] 920.355377: do_timestamp1 <-tick_do_update_jiffies64
> > <idle>-0 [006] 920.355377: tick_do_update_jiffies64 <-tick_sched_timer
> > <idle>-0 [003] 920.355378: do_timestamp2 <-tick_do_update_jiffies64
> > <idle>-0 [000] 920.355657: tick_do_update_jiffies64 <-tick_check_idle
> > <idle>-0 [000] 920.355660: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
>
> Thomas, any clues why this would be getting hammered?
Hmm, tick-sched code grew quite a few unconditional callsites which
i'm not sure of whether they are correct.
Thanks,
tglx
Move the smp_rmb after cpu_relax loop in read_seqlock and add
ACCESS_ONCE to make sure the test and return are consistent.
A multi-threaded core in the lab didn't like the update
from 2.6.35 to 2.6.36, to the point it would hang during
boot when multiple threads were active. Bisection showed
af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 (clockevents:
Remove the per cpu tick skew) as the culprit and it is
supported with stack traces showing xtime_lock waits including
tick_do_update_jiffies64 and/or update_vsyscall.
Experimentation showed the combination of cpu_relax and smp_rmb
was significantly slowing the progress of other threads sharing
the core, and this patch is effective in avoiding the hang.
A theory is the rmb is affecting the whole core while the
cpu_relax is causing a resource rebalance flush, together they
cause an interfernce cadance that is unbroken when the seqlock
reader has interrupts disabled.
At first I was confused why the refactor in
3c22cd5709e8143444a6d08682a87f4c57902df3 (kernel: optimise
seqlock) didn't affect this patch application, but after some
study that affected seqcount not seqlock. The new seqcount was
not factored back into the seqlock. I defer that the future.
While the removal of the timer interrupt offset created
contention for the xtime lock while a cpu does the
additonal work to update the system clock, the seqlock
implementation with the tight rmb spin loop goes back much
further, and is just waiting for the right trigger.
Cc: <[email protected]>
Signed-off-by: Milton Miller <[email protected]>
---
To the readers of [RFC] time: xtime_lock is held too long:
I initially thought x86 would not see this because rmb would
be a nop, but upon closer inspection X86_PPRO_FENCE will add
a lfence for rmb.
milton
Index: common/include/linux/seqlock.h
===================================================================
--- common.orig/include/linux/seqlock.h 2011-04-06 03:27:02.000000000 -0500
+++ common/include/linux/seqlock.h 2011-04-06 03:35:02.000000000 -0500
@@ -88,12 +88,12 @@ static __always_inline unsigned read_seq
unsigned ret;
repeat:
- ret = sl->sequence;
- smp_rmb();
+ ret = ACCESS_ONCE(sl->sequence);
if (unlikely(ret & 1)) {
cpu_relax();
goto repeat;
}
+ smp_rmb();
return ret;
}
Le jeudi 12 mai 2011 à 04:13 -0500, Milton Miller a écrit :
> Move the smp_rmb after cpu_relax loop in read_seqlock and add
> ACCESS_ONCE to make sure the test and return are consistent.
>
> A multi-threaded core in the lab didn't like the update
> from 2.6.35 to 2.6.36, to the point it would hang during
> boot when multiple threads were active. Bisection showed
> af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 (clockevents:
> Remove the per cpu tick skew) as the culprit and it is
> supported with stack traces showing xtime_lock waits including
> tick_do_update_jiffies64 and/or update_vsyscall.
>
> Experimentation showed the combination of cpu_relax and smp_rmb
> was significantly slowing the progress of other threads sharing
> the core, and this patch is effective in avoiding the hang.
>
> A theory is the rmb is affecting the whole core while the
> cpu_relax is causing a resource rebalance flush, together they
> cause an interfernce cadance that is unbroken when the seqlock
> reader has interrupts disabled.
>
> At first I was confused why the refactor in
> 3c22cd5709e8143444a6d08682a87f4c57902df3 (kernel: optimise
> seqlock) didn't affect this patch application, but after some
> study that affected seqcount not seqlock. The new seqcount was
> not factored back into the seqlock. I defer that the future.
>
> While the removal of the timer interrupt offset created
> contention for the xtime lock while a cpu does the
> additonal work to update the system clock, the seqlock
> implementation with the tight rmb spin loop goes back much
> further, and is just waiting for the right trigger.
>
> Cc: <[email protected]>
> Signed-off-by: Milton Miller <[email protected]>
> ---
>
> To the readers of [RFC] time: xtime_lock is held too long:
>
> I initially thought x86 would not see this because rmb would
> be a nop, but upon closer inspection X86_PPRO_FENCE will add
> a lfence for rmb.
>
> milton
>
> Index: common/include/linux/seqlock.h
> ===================================================================
> --- common.orig/include/linux/seqlock.h 2011-04-06 03:27:02.000000000 -0500
> +++ common/include/linux/seqlock.h 2011-04-06 03:35:02.000000000 -0500
> @@ -88,12 +88,12 @@ static __always_inline unsigned read_seq
> unsigned ret;
>
> repeat:
> - ret = sl->sequence;
> - smp_rmb();
> + ret = ACCESS_ONCE(sl->sequence);
> if (unlikely(ret & 1)) {
> cpu_relax();
> goto repeat;
> }
> + smp_rmb();
>
> return ret;
> }
I fully agree with your analysis. This is a call to make the change I
suggested earlier [1]. (Use a seqcount object in seqlock_t)
typedef struct {
seqcount_t seq
spinlock_t lock;
} seqlock_t;
I'll submit a patch for 2.6.40
Acked-by: Eric Dumazet <[email protected]>
Thanks
[1] Ref: https://lkml.org/lkml/2011/5/6/351
On Thu, May 12, 2011 at 04:13:54AM -0500, Milton Miller wrote:
>
> Move the smp_rmb after cpu_relax loop in read_seqlock and add
> ACCESS_ONCE to make sure the test and return are consistent.
>
> A multi-threaded core in the lab didn't like the update
Which core was that?
-Andi