This patch
[email protected], 2003-12-30 15:40:23-08:00, [email protected]
[PATCH] ia32 jiffy wrapping fixes
Causes the voyager boot to hang. The problem is this change:
--- a/arch/i386/kernel/timers/timer_tsc.c Tue Jan 6 09:57:34 2004
+++ b/arch/i386/kernel/timers/timer_tsc.c Tue Jan 6 09:57:34 2004
@@ -141,7 +140,7 @@
#ifndef CONFIG_NUMA
if (!use_tsc)
#endif
- return (unsigned long long)jiffies * (1000000000 / HZ);
+ return (unsigned long long)get_jiffies_64() *
(1000000000 / HZ);
Apart from the fact (that I've whined about before) that this
sched_clock() function should be one of the timer function pointers, so
there isn't this CONFIG_NUMA dependence (unless I can also add a
CONFIG_X86_VOYAGER dependence to it as well), the problem seems to be
some type of bad resonance between the jiffies_64 update and the
xtime_lock in get_jiffies_64(). I think this may indicate that HZ needs
to be reduced to 100 on voyager; however, there is also no need to get
the xtime sequence lock every time we do a jiffies_64 read, since the
only unstable time is when we may be updating both halves of it
non-atomically. Thus, we only need the sequence lock when the bottom
half is zero. This should improve the fast path of get_jiffies_64() for
all x86 arch's.
James
===== kernel/time.c 1.18 vs edited =====
--- 1.18/kernel/time.c Wed Oct 22 00:09:54 2003
+++ edited/kernel/time.c Tue Jan 6 09:20:38 2004
@@ -422,13 +422,20 @@
#if (BITS_PER_LONG < 64)
u64 get_jiffies_64(void)
{
- unsigned long seq;
- u64 ret;
+ u64 ret = jiffies_64;
- do {
- seq = read_seqbegin(&xtime_lock);
+ /* We only have read problems when the lower 32 bits are zero
+ * indicating that we may be in the process of updating the upper
+ * 32 bits */
+ while (unlikely((jiffies_64 & 0xffffffffULL) == 0)) {
+ unsigned long seq = read_seqbegin(&xtime_lock);
+
+ rmb();
ret = jiffies_64;
- } while (read_seqretry(&xtime_lock, seq));
+ rmb();
+ if(!read_seqretry(&xtime_lock, seq))
+ break;
+ }
return ret;
}
James Bottomley <[email protected]> wrote:
>
> This patch
>
>
> [email protected], 2003-12-30 15:40:23-08:00, [email protected]
> [PATCH] ia32 jiffy wrapping fixes
>
> Causes the voyager boot to hang. The problem is this change:
>
> --- a/arch/i386/kernel/timers/timer_tsc.c Tue Jan 6 09:57:34 2004
> +++ b/arch/i386/kernel/timers/timer_tsc.c Tue Jan 6 09:57:34 2004
> @@ -141,7 +140,7 @@
> #ifndef CONFIG_NUMA
> if (!use_tsc)
> #endif
> - return (unsigned long long)jiffies * (1000000000 / HZ);
> + return (unsigned long long)get_jiffies_64() *
> (1000000000 / HZ);
Hm, OK. I hit the same deadlock when running with the "don't require TSCs
to be synchronised in sched_clock()" patch from -mm. The fix for that is
below. I shall accelerate it.
--- 25/arch/i386/kernel/timers/timer_tsc.c~sched_clock-2.6.0-A1-deadlock-fix 2003-12-30 00:45:09.000000000 -0800
+++ 25-akpm/arch/i386/kernel/timers/timer_tsc.c 2003-12-30 00:45:09.000000000 -0800
@@ -140,7 +140,8 @@ unsigned long long sched_clock(void)
#ifndef CONFIG_NUMA
if (!use_tsc)
#endif
- return (unsigned long long)get_jiffies_64() * (1000000000 / HZ);
+ /* jiffies might overflow but this is not a big deal here */
+ return (unsigned long long)jiffies * (1000000000 / HZ);
/* Read the Time Stamp Counter */
rdtscll(this_offset);
_
On Tue, 6 Jan 2004, James Bottomley wrote:
>
> however, there is also no need to get
> the xtime sequence lock every time we do a jiffies_64 read, since the
> only unstable time is when we may be updating both halves of it
> non-atomically. Thus, we only need the sequence lock when the bottom
> half is zero. This should improve the fast path of get_jiffies_64() for
> all x86 arch's.
This is wrong. There is nothing that guarantees that the read has read the
high bits first.
It might have read the low bits first (and gotten 0xffffffff), and then on
another CPU the contents were updated, and now the high bits are one
bigger, and when you read the high bits, you get a value that is off by a
_lot_.
And it's not just 0 and 0xffffffff in the low bits that can be problematic
either: if the CPU that does the "get_jiffies64()" gets an interrupt, the
thing may be off by more than a count of one.
IF this optimization is really worth it, the code should be something like
this:
#define JIFFY_SLOP (3)
u64 ret = jiffies_64;
u32 low32 = ret;
if ((low+JIFFY_SLOP) <= JIFFY_SLOP*2) {
... do the seqlock thing ...
}
instead.
Which still avoids the read-lock about 99.9999% of the time, so it may
well be worth it. But somebody should double-check my logic.
Linus
On Tue, 6 Jan 2004, Andrew Morton wrote:
>
> Hm, OK. I hit the same deadlock when running with the "don't require TSCs
> to be synchronised in sched_clock()" patch from -mm. The fix for that is
> below. I shall accelerate it.
>
> --- 25/arch/i386/kernel/timers/timer_tsc.c~sched_clock-2.6.0-A1-deadlock-fix 2003-12-30 00:45:09.000000000 -0800
> +++ 25-akpm/arch/i386/kernel/timers/timer_tsc.c 2003-12-30 00:45:09.000000000 -0800
> @@ -140,7 +140,8 @@ unsigned long long sched_clock(void)
> #ifndef CONFIG_NUMA
> if (!use_tsc)
> #endif
> - return (unsigned long long)get_jiffies_64() * (1000000000 / HZ);
> + /* jiffies might overflow but this is not a big deal here */
> + return (unsigned long long)jiffies * (1000000000 / HZ);
Augh. If you cast it to "unsigned long long" anyway, why not just use the
right value? It's "jiffies_64".
It has other problems, of course, but at least that makes them slightly
less.
Linus
On Tue, 2004-01-06 at 10:19, Andrew Morton wrote:
> Hm, OK. I hit the same deadlock when running with the "don't require TSCs
> to be synchronised in sched_clock()" patch from -mm. The fix for that is
> below. I shall accelerate it.
Actually, I think we need to know why this is happening, since the use
of these sequence locks is growing. On voyager I just put it down to HZ
== 1000 being a bit much for my old pentium 66MHz processors, but if
you've seen it on a much faster processor, that would tend to indicate
there's some other problem at work here.
James
On Tue, 6 Jan 2004, Andrew Morton wrote:
> James Bottomley <[email protected]> wrote:
> >
> > This patch
> >
> >
> > [email protected], 2003-12-30 15:40:23-08:00, [email protected]
> > [PATCH] ia32 jiffy wrapping fixes
> >
> > Causes the voyager boot to hang. The problem is this change:
> >
> > --- a/arch/i386/kernel/timers/timer_tsc.c Tue Jan 6 09:57:34 2004
> > +++ b/arch/i386/kernel/timers/timer_tsc.c Tue Jan 6 09:57:34 2004
> > @@ -141,7 +140,7 @@
> > #ifndef CONFIG_NUMA
> > if (!use_tsc)
> > #endif
> > - return (unsigned long long)jiffies * (1000000000 / HZ);
> > + return (unsigned long long)get_jiffies_64() *
> > (1000000000 / HZ);
>
> Hm, OK. I hit the same deadlock when running with the "don't require TSCs
> to be synchronised in sched_clock()" patch from -mm. The fix for that is
> below. I shall accelerate it.
>
> --- 25/arch/i386/kernel/timers/timer_tsc.c~sched_clock-2.6.0-A1-deadlock-fix 2003-12-30 00:45:09.000000000 -0800
> +++ 25-akpm/arch/i386/kernel/timers/timer_tsc.c 2003-12-30 00:45:09.000000000 -0800
> @@ -140,7 +140,8 @@ unsigned long long sched_clock(void)
> #ifndef CONFIG_NUMA
> if (!use_tsc)
> #endif
> - return (unsigned long long)get_jiffies_64() * (1000000000 / HZ);
> + /* jiffies might overflow but this is not a big deal here */
> + return (unsigned long long)jiffies * (1000000000 / HZ);
>
> /* Read the Time Stamp Counter */
> rdtscll(this_offset);
Yes, please revert that change. When I tested it, I didn't realize I need
NUMA or clear use_tsc to actually cover the change. And it's indeed not a
big deal.
The proposed get_jiffies_64() change however looks quite fragile to me.
Sorry,
Tim
James Bottomley <[email protected]> wrote:
>
> On Tue, 2004-01-06 at 10:19, Andrew Morton wrote:
> > Hm, OK. I hit the same deadlock when running with the "don't require TSCs
> > to be synchronised in sched_clock()" patch from -mm. The fix for that is
> > below. I shall accelerate it.
>
> Actually, I think we need to know why this is happening, since the use
> of these sequence locks is growing.
That would be nice. Can you get a backtrace?
> On voyager I just put it down to HZ
> == 1000 being a bit much for my old pentium 66MHz processors, but if
> you've seen it on a much faster processor, that would tend to indicate
> there's some other problem at work here.
No, it was much simpler in my case: log_buf_len_setup() was accidentally
enabling interrupts early in boot and we were taking a timer interrupt
while holding a write lock on xtime_lock. sched_clock() was requiring a
read lock and boom.
Linus Torvalds <[email protected]> wrote:
>
>
>
> On Tue, 6 Jan 2004, Andrew Morton wrote:
> >
> > Hm, OK. I hit the same deadlock when running with the "don't require TSCs
> > to be synchronised in sched_clock()" patch from -mm. The fix for that is
> > below. I shall accelerate it.
> >
> > --- 25/arch/i386/kernel/timers/timer_tsc.c~sched_clock-2.6.0-A1-deadlock-fix 2003-12-30 00:45:09.000000000 -0800
> > +++ 25-akpm/arch/i386/kernel/timers/timer_tsc.c 2003-12-30 00:45:09.000000000 -0800
> > @@ -140,7 +140,8 @@ unsigned long long sched_clock(void)
> > #ifndef CONFIG_NUMA
> > if (!use_tsc)
> > #endif
> > - return (unsigned long long)get_jiffies_64() * (1000000000 / HZ);
> > + /* jiffies might overflow but this is not a big deal here */
> > + return (unsigned long long)jiffies * (1000000000 / HZ);
>
> Augh. If you cast it to "unsigned long long" anyway, why not just use the
> right value? It's "jiffies_64".
Sounds sane.
> It has other problems, of course, but at least that makes them slightly
> less.
Yes, it's slightly sleazy.
I haven't tested this...
From: Ingo Molnar <[email protected]>
Voyager is getting odd deadlocks due to the taking of xtime_lock() in
sched_clock()->get_jiffies_64().
I had this patch queued up to fix a different deadlock, which ocurs when we
relax the requirement that TSC's be synchronised across CPUs. But it will
fix James' deadlock too.
arch/i386/kernel/timers/timer_tsc.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletion(-)
diff -puN arch/i386/kernel/timers/timer_tsc.c~sched_clock-2.6.0-A1-deadlock-fix arch/i386/kernel/timers/timer_tsc.c
--- 25/arch/i386/kernel/timers/timer_tsc.c~sched_clock-2.6.0-A1-deadlock-fix 2004-01-06 08:56:36.000000000 -0800
+++ 25-akpm/arch/i386/kernel/timers/timer_tsc.c 2004-01-06 08:58:37.000000000 -0800
@@ -140,7 +140,12 @@ unsigned long long sched_clock(void)
#ifndef CONFIG_NUMA
if (!use_tsc)
#endif
- return (unsigned long long)get_jiffies_64() * (1000000000 / HZ);
+ /*
+ * Avoid xtime_lock deadlocks by not locking the read of
+ * jiffies_64. Can possibly cause glitches on 32-bit rollovers
+ * but the CPU scheduler will recover OK.
+ */
+ return jiffies_64 * (1000000000 / HZ);
/* Read the Time Stamp Counter */
rdtscll(this_offset);
_
On Tue, 6 Jan 2004, Linus Torvalds wrote:
> Augh. If you cast it to "unsigned long long" anyway, why not just use the
> right value? It's "jiffies_64".
Lessons in straight thinking...
We then need to make jiffies_64 volatile, since we violate the rule to
never read it.
Tim
--- linux-2.6.1-rc1/arch/i386/kernel/timers/timer_tsc.c 2003-12-31 14:21:19.000000000 +0100
+++ linux-2.6.1-rc1-j64/arch/i386/kernel/timers/timer_tsc.c 2004-01-06 18:05:24.000000000 +0100
@@ -140,7 +140,8 @@
#ifndef CONFIG_NUMA
if (!use_tsc)
#endif
- return (unsigned long long)get_jiffies_64() * (1000000000 / HZ);
+ /* no locking but a rare wrong value is not a big deal */
+ return (unsigned long long)jiffies_64 * (1000000000 / HZ);
/* Read the Time Stamp Counter */
rdtscll(this_offset);
--- linux-2.6.1-rc1/include/linux/jiffies.h 2003-12-31 14:22:05.000000000 +0100
+++ linux-2.6.1-rc1-j64/include/linux/jiffies.h 2004-01-06 17:56:15.000000000 +0100
@@ -9,11 +9,11 @@
#include <asm/param.h> /* for HZ */
/*
- * The 64-bit value is not volatile - you MUST NOT read it
- * without sampling the sequence number in xtime_lock.
+ * NEVER EVER read jiffies_64 without sampling the sequence number
+ * in xtime_lock.
* get_jiffies_64() will do this for you as appropriate.
*/
-extern u64 jiffies_64;
+extern u64 volatile jiffies_64;
extern unsigned long volatile jiffies;
#if (BITS_PER_LONG < 64)
@@ -21,7 +21,7 @@
#else
static inline u64 get_jiffies_64(void)
{
- return (u64)jiffies;
+ return jiffies_64;
}
#endif
Tim Schmielau <[email protected]> wrote:
>
> #if (BITS_PER_LONG < 64)
> @@ -21,7 +21,7 @@
> #else
> static inline u64 get_jiffies_64(void)
> {
> - return (u64)jiffies;
> + return jiffies_64;
> }
> #endif
hm, why this change? Are you sure that all 64-bit architectures alias
jiffies onto jiffies_64? x86_64 seems not to.
On Tue, 6 Jan 2004, Andrew Morton wrote:
> Tim Schmielau <[email protected]> wrote:
> >
> > #if (BITS_PER_LONG < 64)
> > @@ -21,7 +21,7 @@
> > #else
> > static inline u64 get_jiffies_64(void)
> > {
> > - return (u64)jiffies;
> > + return jiffies_64;
> > }
> > #endif
>
> hm, why this change? Are you sure that all 64-bit architectures alias
> jiffies onto jiffies_64? x86_64 seems not to.
>
It was jiffies_64 in the first place (that's why the function is called
get_jiffies_64(), after all).
I once made it jiffies because jiffies was volatile while jiffies_64 was
not. Now that jiffies_64 becomes volatile as well, I thought we could
re-straighten things.
But we can leave out this chunk, it really shouldn't make any difference.
Tim
[ This is a big rant against using "volatile" on data structures. Feel
free to ignore it, but the fact is, I'm right. You should never EVER use
"volatile" on a data structure. ]
On Tue, 6 Jan 2004, Tim Schmielau wrote:
>
> We then need to make jiffies_64 volatile, since we violate the rule to
> never read it.
No, we should _never_ make anything volatile. There just isn't any reason
to. The compiler will never do any "better" with a volatile, it will only
ever do worse.
If there are memory ordering constraints etc, the compiler won't be able
to handle them anyway, and volatile will be a no-op. That's why we have
"barrier()" and "mb()" and friends.
The _only_ acceptable use of "volatile" is basically:
- in _code_ (not data structures), where we might mark a place as making
a special type of access. For example, in the PCI MMIO read functions,
rather than use inline assembly to force the particular access (which
would work equally well), we can force the pointer to a volatile type.
Similarly, we force this for "test_bit()" macros etc, because they are
documented to work on SMP-safe. But it's the _code_ that works that
way, not the data structures.
And this is an important distinctions: there are specific pieces of
_code_ that may be SMP-safe (for whatever reasons the writer thinks).
Data structures themselves are never SMP safe.
Ergo: never mark data structures "volatile". It's a sure sign of a bug
if the thing isn't a memory-mapped IO register (and even then it's
likely a bug, since you really should be using the proper functions).
(Some driver writers use "volatile" for mailboxes that are updated by
DMA from the hardware. It _can_ be correct there, but the fact is, you
might as well put the "volatile" in the code just out of principle).
That said, the "sure sign of a bug" case has one specific sub-case:
- to paste over bugs that you really don't tink are worth fixing any
other way. This is why "jiffies" itself is declared volatile: just to
let people write code that does "while (time_before(xxx, jiffies))".
But the "jiffies" case is safe only _exactly_ because it's an atomic read.
You always get a valid value - so it's actually "safe" to mark jiffies as
baing volatile. It allows people to be sloppy (bad), but at least it
allows people to be sloppy in a safe way.
In contrast, "jiffies_64" is _not_ an area where you can safely let the
compiler read a unstable value, so "volatile" is fundamentally wrong for
it. You need to have some locking, or to explicitly say "we don't care in
this case", and in both cases it would be wrong to call the thing
"volatile". With locking, it _isn't_ volatile, and with "we don't care",
it had better not make any difference. In either case the "volatile" is
wrong.
We had absolutely _tons_ of bugs in the original networking code, where
clueless people thougth that "volatile" somehow means that you don't need
locking. EVERY SINGLE CASE, and I mean EVERY ONE, was a bug.
There are some other cases where the "volatile" keyword is fine, notably
inline asm has a specific meaning that is pretty well-defined and very
useful. But in all other cases I'd suggest having the volatile be part of
the code, possibly with an explanation _why_ it is ok to use volatile
there unless it is obvious.
Linus
Linus Torvalds wrote:
>
> On Tue, 6 Jan 2004, James Bottomley wrote:
>
>> however, there is also no need to get
>>the xtime sequence lock every time we do a jiffies_64 read, since the
>>only unstable time is when we may be updating both halves of it
>>non-atomically. Thus, we only need the sequence lock when the bottom
>>half is zero. This should improve the fast path of get_jiffies_64() for
>>all x86 arch's.
>>
>
> This is wrong. There is nothing that guarantees that the read has read the
> high bits first.
>
It seems to me that even if the the high bits get read first, the 64 value can
go "backwards" in time, since it can read X from the high 32 bits, then the
clock advance to X+1, and the low bits will read a value close to zero,
resulting in ((X<<32)|0).
> It might have read the low bits first (and gotten 0xffffffff), and then on
> another CPU the contents were updated, and now the high bits are one
> bigger, and when you read the high bits, you get a value that is off by a
> _lot_.
>
> And it's not just 0 and 0xffffffff in the low bits that can be problematic
> either: if the CPU that does the "get_jiffies64()" gets an interrupt, the
> thing may be off by more than a count of one.
>
> IF this optimization is really worth it, the code should be something like
> this:
>
> #define JIFFY_SLOP (3)
>
> u64 ret = jiffies_64;
> u32 low32 = ret;
>
> if ((low+JIFFY_SLOP) <= JIFFY_SLOP*2) {
> ... do the seqlock thing ...
> }
>
> instead.
>
What about this instead? I don't like very much this kind of construction, but
it seems that it would prevent the lock altogether:
u32 jiff_high1, jiff_high2, jiff_low
do {
jiff_high1 = ((volatile) jiffies_64) >> 32;
jiff_low = ((volatile) jiffies_64);
jiff_high2 = ((volatile) jiffies_64) >> 32;
}
while(jiff_high1 != jiff_high2);
return (jiff_high1<<32) | jiff_low;
If there is anyway to avoid the volatiles there, it would be much cleaner.
Please don't beat me too hard if there is an obvious problem with this
implementation.
This is an old trick that was used on a brain dead DOS API, where there were 2
functions to read date and time, and there was hard to get a real timestamp :)
--
Paulo Marques - http://www.grupopie.com
"In a world without walls and fences who needs windows and gates?"
On Tue, 6 Jan 2004, Paulo Marques wrote:
>
> What about this instead? I don't like very much this kind of construction, but
> it seems that it would prevent the lock altogether:
>
> u32 jiff_high1, jiff_high2, jiff_low
>
> do {
> jiff_high1 = ((volatile) jiffies_64) >> 32;
> jiff_low = ((volatile) jiffies_64);
> jiff_high2 = ((volatile) jiffies_64) >> 32;
> }
> while(jiff_high1 != jiff_high2);
Yes, we used to do things like that at some point (and your code is
buggy: by leaving out the size, the "volatile" cast casts to the implicit
"int" size in C).
It doesn't work in SMP environments, since the "ordering" by volatile is
only a single-CPU ordering. This is just one of the reasons why "volatile"
isn't very useful.
Also, the above still assumes an update ordering between low and high (it
assumes that both set of bits get written back simultaneously). Which is
not necessarily true on a software _or_ hardware ordering level.
So on the reader side you'd need to add an explicit "rmb()" between the
two reads of the high bits, and on the writer side you'd need to always
make sure that you do an atomic 64-bt write.
Since the "rmb()" is as expensive as the sequence lock, the above wouldn't
much help.
> If there is anyway to avoid the volatiles there, it would be much cleaner.
Not only is there a need to avoid them, they don't help in the least,
since you need the explicit CPU ordering macro anyway. And that ordering
macro is the true cost of "seq_read_lock()", so...
In contrast, the reason the simple "assume some values are stable" patch
actually _does_ help is that it doesn't depend on any ordering constraints
at all. So it literally can be done lock-less, because it knows about
something much more fundamental: it depends on the _behaviour_ of the
values.
Basically, in the kernel, the expensive part about any lock is literally
the ordering. There are other things that can be expensive too (if the
lock is bouncing back and forth between CPU's, that becomes really
exensive really quickly), but to a first order and especially on locks
that aren't themselves fundamentally highly contended for, the CPU
serialization implied in the lock is the expensive part.
So converting that serialization to a "lockless" algorithm that still
depends on ordering usually doesn't much help for those locks. The
"ordering" part is still serializing, and the main win ends up being on
64-CPU systems etc where you can at least avoid the cacheline bounces.
Indeed, I think it was 64-CPU systems that caused the sequence lock stuff,
not so much the "normal" 2- and 4-way boxes.
If you want to improve on the sequence lock, you need to take advantage of
inherent data knowledge (in this case the knowledge that you don't care
about the exact value, and that you know how the bits are updated).
Linus
On Tue, 2004-01-06 at 11:05, Andrew Morton wrote:
> No, it was much simpler in my case: log_buf_len_setup() was accidentally
> enabling interrupts early in boot and we were taking a timer interrupt
> while holding a write lock on xtime_lock. sched_clock() was requiring a
> read lock and boom.
Voyager has no APIC local timer equivalent, so I rebroadcast the timer
tick to all CPUs. However, the local tick is done in this thread with
the xtime_lock held as write and it can trigger the scheduler load
balancing, which needs to call sched_clock()....boom.
All in all, this does show that the xtime sequence lock is a bit too
fragile. It also seems to show that we should redo the subarch timer
hooks if we want to make the sequence locks work.
I think there should be two hooks: one called holding the xtime write
lock for doing clock adjustment specific things and the other called
after we've released the xtime write lock.
James
Linus Torvalds wrote:
>
> On Tue, 6 Jan 2004, Paulo Marques wrote:
>
>> ....
>
> Yes, we used to do things like that at some point (and your code is
> buggy: by leaving out the size, the "volatile" cast casts to the implicit
> "int" size in C).
>
Ugh, stupid mistake. Just wrote the code to fast... :(
Anyway, thanks for your clarification. People like me that code a lot of
user-space stuff, take a while to realize all the implications of SMP, ordering,
preemption, etc., etc.
By the way, after double checking your logic, it looks good as long as there are
no interrupts that take more than 3 jiffies to complete :)
--
Paulo Marques - http://www.grupopie.com
"In a world without walls and fences who needs windows and gates?"