Turn on CONFIG_HAVE_UNSTABLE_SCHED_CLOCK for SGI_SN.
SGI Altix has unsynchronized itc clocks. This results in rq->clock
occasionally being set to a time in the past by a remote cpu.
Note that it is possible that this problem may exist for other ia64
machines as well, based on the following comment for sched_clock() in
arch/ia64/kernel/head.S:
* Return a CPU-local timestamp in nano-seconds. This timestamp is
* NOT synchronized across CPUs its return value must never be
* compared against the values returned on another CPU. The usage in
* kernel/sched.c ensures that.
Signed-off-by: Dimitri Sivanich <[email protected]>
---
Greg, if everyone is OK with this patch, this should also be applied
to all stable trees starting with 2.6.26.
arch/ia64/Kconfig | 1 +
1 file changed, 1 insertion(+)
Index: linux-2.6/arch/ia64/Kconfig
===================================================================
--- linux-2.6.orig/arch/ia64/Kconfig 2009-01-06 10:13:13.051918923 -0600
+++ linux-2.6/arch/ia64/Kconfig 2009-01-06 10:13:44.547856328 -0600
@@ -536,6 +536,7 @@ config IA64_MC_ERR_INJECT
config SGI_SN
def_bool y if (IA64_SGI_SN2 || IA64_GENERIC)
+ select HAVE_UNSTABLE_SCHED_CLOCK
config IA64_ESI
bool "ESI (Extensible SAL Interface) support"
On Tue, Jan 06, 2009 at 10:27:41AM -0600, Dimitri Sivanich wrote:
> Turn on CONFIG_HAVE_UNSTABLE_SCHED_CLOCK for SGI_SN.
>
> SGI Altix has unsynchronized itc clocks. This results in rq->clock
> occasionally being set to a time in the past by a remote cpu.
>
> Note that it is possible that this problem may exist for other ia64
> machines as well, based on the following comment for sched_clock() in
> arch/ia64/kernel/head.S:
>
> * Return a CPU-local timestamp in nano-seconds. This timestamp is
> * NOT synchronized across CPUs its return value must never be
> * compared against the values returned on another CPU. The usage in
> * kernel/sched.c ensures that.
>
>
> Signed-off-by: Dimitri Sivanich <[email protected]>
>
> ---
>
> Greg, if everyone is OK with this patch, this should also be applied
> to all stable trees starting with 2.6.26.
That's fine (but 2.6.26 is no longer getting -stable releases).
Just let [email protected] know when this patch is applied in Linus's
tree. You can do this automatically by adding a:
Cc: stable <[email protected]>
to the signed-off-by: area of the patch, and the stable team will be
automatically notified when it goes into Linus's tree.
thanks,
greg k-h
> Turn on CONFIG_HAVE_UNSTABLE_SCHED_CLOCK for SGI_SN.
>
> SGI Altix has unsynchronized itc clocks. This results in rq->clock
> occasionally being set to a time in the past by a remote cpu.
SGI Altix is definitely the worst offender here. On heterogeneneous
systems with different model cpus across nodes the itc rate may differ
by hundreds of megahertz. Even when all cpus are at the same rated
speed, different nodes are driven from different crystals, so the itc
values will slowly drift apart (Linux discovers this from SAL and so
doesn't bother to synchronize them).
> Note that it is possible that this problem may exist for other ia64
> machines as well, based on the following comment for sched_clock() in
> arch/ia64/kernel/head.S:
>
> * Return a CPU-local timestamp in nano-seconds. This timestamp is
> * NOT synchronized across CPUs its return value must never be
> * compared against the values returned on another CPU. The usage in
> * kernel/sched.c ensures that.
When sched_clock() was first created this was part of its definition.
It meant that sched_clock could be simple & fast because it didn't
need to be synchronized across cpus.
It seems that new uses have grown for it that no longer allow that
flexibility.
All ia64 systems are potentially affected ... but perhaps you might
never see the problem on most because the itc clocks are synced as close
as s/w can get them when cpus are brought on line.
-Tony
On Tue, Jan 06, 2009 at 12:15:34PM -0800, Luck, Tony wrote:
> > Turn on CONFIG_HAVE_UNSTABLE_SCHED_CLOCK for SGI_SN.
> >
> > SGI Altix has unsynchronized itc clocks. This results in rq->clock
> > occasionally being set to a time in the past by a remote cpu.
>
> SGI Altix is definitely the worst offender here. On heterogeneneous
> systems with different model cpus across nodes the itc rate may differ
> by hundreds of megahertz. Even when all cpus are at the same rated
> speed, different nodes are driven from different crystals, so the itc
> values will slowly drift apart (Linux discovers this from SAL and so
> doesn't bother to synchronize them).
>
> > Note that it is possible that this problem may exist for other ia64
> > machines as well, based on the following comment for sched_clock() in
> > arch/ia64/kernel/head.S:
> >
> > * Return a CPU-local timestamp in nano-seconds. This timestamp is
> > * NOT synchronized across CPUs its return value must never be
> > * compared against the values returned on another CPU. The usage in
> > * kernel/sched.c ensures that.
>
> When sched_clock() was first created this was part of its definition.
> It meant that sched_clock could be simple & fast because it didn't
> need to be synchronized across cpus.
>
> It seems that new uses have grown for it that no longer allow that
> flexibility.
>
> All ia64 systems are potentially affected ... but perhaps you might
> never see the problem on most because the itc clocks are synced as close
> as s/w can get them when cpus are brought on line.
Do you want Dimitri to resubmit with this set for all IA64 or leave it
as is?
Thanks,
Robin
> > All ia64 systems are potentially affected ... but perhaps you might
> > never see the problem on most because the itc clocks are synced as close
> > as s/w can get them when cpus are brought on line.
>
> Do you want Dimitri to resubmit with this set for all IA64 or leave it
> as is?
I'd like to understand the impact of turning on HAVE_UNSTABLE_SCHED_CLOCK
It looks like both the i386_defconfig and x86_64_defconfig choose this,
so at least ia64 will be hitting the well tested code paths
Have the other architectures just not hit this yet? Or do they all have
"stable" sched_clock() functions?
sched_clock() seemed like such a straightforward thing to begin with. A
quick & easy way to measure a time delta ON THE SAME CPU. I'm not at
all sure why it has been co-opted for general time measurement.
-Tony
On Tue, 2009-01-06 at 12:34 -0800, Luck, Tony wrote:
> > > All ia64 systems are potentially affected ... but perhaps you might
> > > never see the problem on most because the itc clocks are synced as close
> > > as s/w can get them when cpus are brought on line.
> >
> > Do you want Dimitri to resubmit with this set for all IA64 or leave it
> > as is?
>
> I'd like to understand the impact of turning on HAVE_UNSTABLE_SCHED_CLOCK
>
> It looks like both the i386_defconfig and x86_64_defconfig choose this,
> so at least ia64 will be hitting the well tested code paths
>
> Have the other architectures just not hit this yet? Or do they all have
> "stable" sched_clock() functions?
>
>
> sched_clock() seemed like such a straightforward thing to begin with. A
> quick & easy way to measure a time delta ON THE SAME CPU. I'm not at
> all sure why it has been co-opted for general time measurement.
It came from the complication of needing to tell a remote cpu's time due
to remote wakeups in the scheduler.
On Tue, Jan 06, 2009 at 09:57:20PM +0100, Peter Zijlstra wrote:
> On Tue, 2009-01-06 at 12:34 -0800, Luck, Tony wrote:
> > > > All ia64 systems are potentially affected ... but perhaps you might
> > > > never see the problem on most because the itc clocks are synced as close
> > > > as s/w can get them when cpus are brought on line.
> > >
> > > Do you want Dimitri to resubmit with this set for all IA64 or leave it
> > > as is?
> >
> > I'd like to understand the impact of turning on HAVE_UNSTABLE_SCHED_CLOCK
> >
> > It looks like both the i386_defconfig and x86_64_defconfig choose this,
> > so at least ia64 will be hitting the well tested code paths
> >
> > Have the other architectures just not hit this yet? Or do they all have
> > "stable" sched_clock() functions?
> >
> >
> > sched_clock() seemed like such a straightforward thing to begin with. A
> > quick & easy way to measure a time delta ON THE SAME CPU. I'm not at
> > all sure why it has been co-opted for general time measurement.
>
> It came from the complication of needing to tell a remote cpu's time due
> to remote wakeups in the scheduler.
But doesn't scheduler tick advance the rq->clock? Why do the others
need to fiddle with a remote runqueue's clock? When that cpu starts
taking ticks again, it will update it's rq->clock field and start the
processes. I guess I am a lot underinformed about the new scheduler
design.
Robin
On Tue, 2009-01-06 at 16:50 -0600, Robin Holt wrote:
> On Tue, Jan 06, 2009 at 09:57:20PM +0100, Peter Zijlstra wrote:
> > On Tue, 2009-01-06 at 12:34 -0800, Luck, Tony wrote:
> > > > > All ia64 systems are potentially affected ... but perhaps you might
> > > > > never see the problem on most because the itc clocks are synced as close
> > > > > as s/w can get them when cpus are brought on line.
> > > >
> > > > Do you want Dimitri to resubmit with this set for all IA64 or leave it
> > > > as is?
> > >
> > > I'd like to understand the impact of turning on HAVE_UNSTABLE_SCHED_CLOCK
> > >
> > > It looks like both the i386_defconfig and x86_64_defconfig choose this,
> > > so at least ia64 will be hitting the well tested code paths
> > >
> > > Have the other architectures just not hit this yet? Or do they all have
> > > "stable" sched_clock() functions?
> > >
> > >
> > > sched_clock() seemed like such a straightforward thing to begin with. A
> > > quick & easy way to measure a time delta ON THE SAME CPU. I'm not at
> > > all sure why it has been co-opted for general time measurement.
> >
> > It came from the complication of needing to tell a remote cpu's time due
> > to remote wakeups in the scheduler.
>
> But doesn't scheduler tick advance the rq->clock? Why do the others
> need to fiddle with a remote runqueue's clock? When that cpu starts
> taking ticks again, it will update it's rq->clock field and start the
> processes. I guess I am a lot underinformed about the new scheduler
> design.
We try to do better than tick based time accounting these days.
On Wed, Jan 07, 2009 at 12:16:03AM +0100, Peter Zijlstra wrote:
> >
> > But doesn't scheduler tick advance the rq->clock? Why do the others
> > need to fiddle with a remote runqueue's clock? When that cpu starts
> > taking ticks again, it will update it's rq->clock field and start the
> > processes. I guess I am a lot underinformed about the new scheduler
> > design.
>
> We try to do better than tick based time accounting these days.
But if you contain the drift to within one tick, it shouldn't be much
problem to just truncate negative deltas I would have thought? The
time between events on different CPUs is pretty fuzzy at the ns level
anyway, I think ;)
On Wed, Jan 07, 2009 at 04:00:30AM +0100, Nick Piggin wrote:
> On Wed, Jan 07, 2009 at 12:16:03AM +0100, Peter Zijlstra wrote:
> > >
> > > But doesn't scheduler tick advance the rq->clock? Why do the others
> > > need to fiddle with a remote runqueue's clock? When that cpu starts
> > > taking ticks again, it will update it's rq->clock field and start the
> > > processes. I guess I am a lot underinformed about the new scheduler
> > > design.
> >
> > We try to do better than tick based time accounting these days.
>
> But if you contain the drift to within one tick, it shouldn't be much
> problem to just truncate negative deltas I would have thought? The
> time between events on different CPUs is pretty fuzzy at the ns level
> anyway, I think ;)
Unfortunately, not possible on SGI IA64 systems. The cpus on different
nodes (blades) are not required to be the same steppings or core
frequencies. Core frequencies within the SSI can differ by hundreds of
MHz (~25%).
(I don't recall if we support systems with mixed madison & montecito
processors. If so, IIRC, the itc frequencies of these differ by 4X
for the same core frequency).
--- jack
On Wed, 2009-01-07 at 04:00 +0100, Nick Piggin wrote:
> On Wed, Jan 07, 2009 at 12:16:03AM +0100, Peter Zijlstra wrote:
> > >
> > > But doesn't scheduler tick advance the rq->clock? Why do the others
> > > need to fiddle with a remote runqueue's clock? When that cpu starts
> > > taking ticks again, it will update it's rq->clock field and start the
> > > processes. I guess I am a lot underinformed about the new scheduler
> > > design.
> >
> > We try to do better than tick based time accounting these days.
>
> But if you contain the drift to within one tick, it shouldn't be much
> problem to just truncate negative deltas I would have thought? The
> time between events on different CPUs is pretty fuzzy at the ns level
> anyway, I think ;)
That's basically what the HAVE_UNSTABLE_SCHED_CLOCK code does. It takes
a tick timestamp and tries to improve on that by using strict per cpu
sched_clock() deltas.
What we do to obtain remote time, is basically calculate local time and
pull remote time fwd if that was behind.
While doing that, it filters out any backward motion and large fwd leaps
so as to stay no worse than a jiffie clock.
On Wed, Jan 07, 2009 at 08:28:09AM +0100, Peter Zijlstra wrote:
> On Wed, 2009-01-07 at 04:00 +0100, Nick Piggin wrote:
> > On Wed, Jan 07, 2009 at 12:16:03AM +0100, Peter Zijlstra wrote:
> > > >
> > > > But doesn't scheduler tick advance the rq->clock? Why do the others
> > > > need to fiddle with a remote runqueue's clock? When that cpu starts
> > > > taking ticks again, it will update it's rq->clock field and start the
> > > > processes. I guess I am a lot underinformed about the new scheduler
> > > > design.
> > >
> > > We try to do better than tick based time accounting these days.
> >
> > But if you contain the drift to within one tick, it shouldn't be much
> > problem to just truncate negative deltas I would have thought? The
> > time between events on different CPUs is pretty fuzzy at the ns level
> > anyway, I think ;)
>
> That's basically what the HAVE_UNSTABLE_SCHED_CLOCK code does. It takes
> a tick timestamp and tries to improve on that by using strict per cpu
> sched_clock() deltas.
>
> What we do to obtain remote time, is basically calculate local time and
> pull remote time fwd if that was behind.
>
> While doing that, it filters out any backward motion and large fwd leaps
> so as to stay no worse than a jiffie clock.
OK, that's good. I guess the optimisations to remove that code should
have been called HAVE_STABLE_SCHED_CLOCK and have archs turn it on on
a case by case basis.
On Wed, Jan 07, 2009 at 08:28:09AM +0100, Peter Zijlstra wrote:
> That's basically what the HAVE_UNSTABLE_SCHED_CLOCK code does. It takes
> a tick timestamp and tries to improve on that by using strict per cpu
> sched_clock() deltas.
>
> What we do to obtain remote time, is basically calculate local time and
> pull remote time fwd if that was behind.
But what happens when the remote clock then takes a single tick and
pulls forward a different remote clock by both your tick and its tick.
Multiply that by 4096 and I am starting to feel that time is likely to
go racing forward in a really random fashion.
> While doing that, it filters out any backward motion and large fwd leaps
> so as to stay no worse than a jiffie clock.
On Wed, 2009-01-07 at 03:43 -0600, Robin Holt wrote:
> On Wed, Jan 07, 2009 at 08:28:09AM +0100, Peter Zijlstra wrote:
> > That's basically what the HAVE_UNSTABLE_SCHED_CLOCK code does. It takes
> > a tick timestamp and tries to improve on that by using strict per cpu
> > sched_clock() deltas.
> >
> > What we do to obtain remote time, is basically calculate local time and
> > pull remote time fwd if that was behind.
>
> But what happens when the remote clock then takes a single tick and
> pulls forward a different remote clock by both your tick and its tick.
> Multiply that by 4096 and I am starting to feel that time is likely to
> go racing forward in a really random fashion.
I'm having trouble parsing that...
Clock state is kept per-cpu, and locked with a spinlock. When we request
a remote time-stamp we lock both our local and the remote clock state,
compute our local time, and set both the remote and local clock to the
max of local,remote clock.
An interrupt on the remote cpu will spin for its clock state lock before
doing anything - also, the tick interrupt will never bother with remote
clock state.
So I'm not exactly seeing what could go wrong here.
Please read kernel/sched_clock.c, its only 266 lines and could use an
extra eyeball, I'd be glad to answer any questions.
Peter,
On Wed, Jan 07, 2009 at 10:53:57AM +0100, Peter Zijlstra wrote:
>
> Clock state is kept per-cpu, and locked with a spinlock. When we request
Admittedly I have not looked at this possibility too closely, but my initial concern upon looking at sched_clock_cpu() for the UNSTABLE case was the lock_double_clock() and what sort of contention that might cause on larger systems under certain conditions.
On Wed, 2009-01-07 at 07:32 -0600, Dimitri Sivanich wrote:
> Peter,
>
> On Wed, Jan 07, 2009 at 10:53:57AM +0100, Peter Zijlstra wrote:
> >
> > Clock state is kept per-cpu, and locked with a spinlock. When we request
>
> Admittedly I have not looked at this possibility too closely, but my
> initial concern upon looking at sched_clock_cpu() for the UNSTABLE
> case was the lock_double_clock() and what sort of contention that
> might cause on larger systems under certain conditions.
Similar contention would already exist on rq->lock.
On Tue, Jan 06, 2009 at 10:27:41AM -0600, Dimitri Sivanich wrote:
> Turn on CONFIG_HAVE_UNSTABLE_SCHED_CLOCK for SGI_SN.
>
> SGI Altix has unsynchronized itc clocks. This results in rq->clock
> occasionally being set to a time in the past by a remote cpu.
>
> Note that it is possible that this problem may exist for other ia64
> machines as well, based on the following comment for sched_clock() in
> arch/ia64/kernel/head.S:
>
> * Return a CPU-local timestamp in nano-seconds. This timestamp is
> * NOT synchronized across CPUs its return value must never be
> * compared against the values returned on another CPU. The usage in
> * kernel/sched.c ensures that.
>
>
> Signed-off-by: Dimitri Sivanich <[email protected]>
>
> ---
>
> Greg, if everyone is OK with this patch, this should also be applied
> to all stable trees starting with 2.6.26.
What is the status of this going into Linus's tree?
thanks,
greg k-h
> > Greg, if everyone is OK with this patch, this should also be applied
> > to all stable trees starting with 2.6.26.
> What is the status of this going into Linus's tree?
I'm about to ask Linus to pull a modified version that does
a "select HAVE_UNSTABLE_SCHED_CLOCK" for all IA64 systems ...
SGI sched_clock() is a lot less "stable" than most, but all
ia64 systems can hit this as the ar.itc cycle counter is only
s/w synchronized between cpus.
-Tony
On Thu, Jan 15, 2009 at 11:21:05AM -0800, Luck, Tony wrote:
> > > Greg, if everyone is OK with this patch, this should also be applied
> > > to all stable trees starting with 2.6.26.
>
> > What is the status of this going into Linus's tree?
>
> I'm about to ask Linus to pull a modified version that does
> a "select HAVE_UNSTABLE_SCHED_CLOCK" for all IA64 systems ...
> SGI sched_clock() is a lot less "stable" than most, but all
> ia64 systems can hit this as the ar.itc cycle counter is only
> s/w synchronized between cpus.
Thanks, I see it now and will queue it up for the next -stable releases.
greg k-h