The scenario is that I do a remote login to an ARM build server,
use screen to start a sub-shell, in that shell start a largish
compile job, detach from that screen, and from the original login
shell I occasionally monitor the compile job with top or ps or
by attaching to the screen.
With kernels 2.6.37-rc2 and -rc3 this causes the machine to become
very sluggish: top takes forever to start, once started it shows no
activity from the compile job (it's as if it's sleeping on a lock),
and ps also takes forever and shows no activity from the compile job.
Rebooting into 2.6.36 eliminates these issues.
I do pretty much the same thing (remote login -> screen -> compile job)
on other archs, but so far I've only seen the 2.6.37-rc misbehaviour
on ARM EABI, specifically on an IOP n2100. (I have access to other ARM
sub-archs, but haven't had time to test 2.6.37-rc on them yet.)
Has anyone else seen this? Any ideas about the cause?
/Mikael
Mikael Pettersson writes:
> The scenario is that I do a remote login to an ARM build server,
> use screen to start a sub-shell, in that shell start a largish
> compile job, detach from that screen, and from the original login
> shell I occasionally monitor the compile job with top or ps or
> by attaching to the screen.
>
> With kernels 2.6.37-rc2 and -rc3 this causes the machine to become
> very sluggish: top takes forever to start, once started it shows no
> activity from the compile job (it's as if it's sleeping on a lock),
> and ps also takes forever and shows no activity from the compile job.
>
> Rebooting into 2.6.36 eliminates these issues.
>
> I do pretty much the same thing (remote login -> screen -> compile job)
> on other archs, but so far I've only seen the 2.6.37-rc misbehaviour
> on ARM EABI, specifically on an IOP n2100. (I have access to other ARM
> sub-archs, but haven't had time to test 2.6.37-rc on them yet.)
>
> Has anyone else seen this? Any ideas about the cause?
(Re-followup since I just realised my previous followups were to Rafael's
regressions mailbot rather than the original thread.)
> The bug is still present in 2.6.37-rc4. I'm currently trying to bisect it.
git bisect identified
[305e6835e05513406fa12820e40e4a8ecb63743c] sched: Do not account irq time to current task
as the cause of this regression. Reverting it from 2.6.37-rc4 (requires some
hackery due to subsequent changes in the same area) restores sane behaviour.
The original patch submission talks about irq-heavy scenarios. My case is the
exact opposite: UP, !PREEMPT, NO_HZ, very low irq rate, essentially 100% CPU
bound in userspace but expected to schedule quickly when needed (e.g. running
top or ps or just hitting CR in one shell while another runs a compile job).
I've reproduced the misbehaviour with 2.6.37-rc4 on ARM/mach-iop32x and
ARM/mach-ixp4xx, but ARM/mach-kirkwood does not misbehave, and other archs
(x86 SMP, SPARC64 UP and SMP, PowerPC32 UP, Alpha UP) also do not misbehave.
So it looks like an ARM-only issue, possibly depending on platform specifics.
One difference I noticed between my Kirkwood machine and my ixp4xx and iop32x
machines is that even though all have CONFIG_NO_HZ=y, the timer irq rate is
much higher on Kirkwood, even when the machine is idle.
/Mikael
On Sun, Dec 05, 2010 at 01:32:37PM +0100, Mikael Pettersson wrote:
> Mikael Pettersson writes:
> > The scenario is that I do a remote login to an ARM build server,
> > use screen to start a sub-shell, in that shell start a largish
> > compile job, detach from that screen, and from the original login
> > shell I occasionally monitor the compile job with top or ps or
> > by attaching to the screen.
> >
> > With kernels 2.6.37-rc2 and -rc3 this causes the machine to become
> > very sluggish: top takes forever to start, once started it shows no
> > activity from the compile job (it's as if it's sleeping on a lock),
> > and ps also takes forever and shows no activity from the compile job.
> >
> > Rebooting into 2.6.36 eliminates these issues.
> >
> > I do pretty much the same thing (remote login -> screen -> compile job)
> > on other archs, but so far I've only seen the 2.6.37-rc misbehaviour
> > on ARM EABI, specifically on an IOP n2100. (I have access to other ARM
> > sub-archs, but haven't had time to test 2.6.37-rc on them yet.)
> >
> > Has anyone else seen this? Any ideas about the cause?
>
> (Re-followup since I just realised my previous followups were to Rafael's
> regressions mailbot rather than the original thread.)
>
> > The bug is still present in 2.6.37-rc4. I'm currently trying to bisect it.
>
> git bisect identified
>
> [305e6835e05513406fa12820e40e4a8ecb63743c] sched: Do not account irq time to current task
>
> as the cause of this regression. Reverting it from 2.6.37-rc4 (requires some
> hackery due to subsequent changes in the same area) restores sane behaviour.
>
> The original patch submission talks about irq-heavy scenarios. My case is the
> exact opposite: UP, !PREEMPT, NO_HZ, very low irq rate, essentially 100% CPU
> bound in userspace but expected to schedule quickly when needed (e.g. running
> top or ps or just hitting CR in one shell while another runs a compile job).
>
> I've reproduced the misbehaviour with 2.6.37-rc4 on ARM/mach-iop32x and
> ARM/mach-ixp4xx, but ARM/mach-kirkwood does not misbehave, and other archs
> (x86 SMP, SPARC64 UP and SMP, PowerPC32 UP, Alpha UP) also do not misbehave.
>
> So it looks like an ARM-only issue, possibly depending on platform specifics.
>
> One difference I noticed between my Kirkwood machine and my ixp4xx and iop32x
> machines is that even though all have CONFIG_NO_HZ=y, the timer irq rate is
> much higher on Kirkwood, even when the machine is idle.
The above patch you point out is fundamentally broken.
+ rq->clock = sched_clock_cpu(cpu);
+ irq_time = irq_time_cpu(cpu);
+ if (rq->clock - irq_time > rq->clock_task)
+ rq->clock_task = rq->clock - irq_time;
This means that we will only update rq->clock_task if it is smaller than
rq->clock. So, eventually over time, rq->clock_task becomes the maximum
value that rq->clock can ever be. Or in other words, the maximum value
of sched_clock_cpu().
Once that has been reached, although rq->clock will wrap back to zero,
rq->clock_task will not, and so (I think) task execution time accounting
effectively stops dead.
I guess this hasn't been noticed on x86 as they have a 64-bit sched_clock,
and so need to wait a long time for this to be noticed. However, on ARM
where we tend to have 32-bit counters feeding sched_clock(), this value
will wrap far sooner.
On Sun, Dec 05, 2010 at 01:17:02PM +0000, Russell King - ARM Linux wrote:
> On Sun, Dec 05, 2010 at 01:32:37PM +0100, Mikael Pettersson wrote:
> > Mikael Pettersson writes:
> > > The scenario is that I do a remote login to an ARM build server,
> > > use screen to start a sub-shell, in that shell start a largish
> > > compile job, detach from that screen, and from the original login
> > > shell I occasionally monitor the compile job with top or ps or
> > > by attaching to the screen.
> > >
> > > With kernels 2.6.37-rc2 and -rc3 this causes the machine to become
> > > very sluggish: top takes forever to start, once started it shows no
> > > activity from the compile job (it's as if it's sleeping on a lock),
> > > and ps also takes forever and shows no activity from the compile job.
> > >
> > > Rebooting into 2.6.36 eliminates these issues.
> > >
> > > I do pretty much the same thing (remote login -> screen -> compile job)
> > > on other archs, but so far I've only seen the 2.6.37-rc misbehaviour
> > > on ARM EABI, specifically on an IOP n2100. (I have access to other ARM
> > > sub-archs, but haven't had time to test 2.6.37-rc on them yet.)
> > >
> > > Has anyone else seen this? Any ideas about the cause?
> >
> > (Re-followup since I just realised my previous followups were to Rafael's
> > regressions mailbot rather than the original thread.)
> >
> > > The bug is still present in 2.6.37-rc4. I'm currently trying to bisect it.
> >
> > git bisect identified
> >
> > [305e6835e05513406fa12820e40e4a8ecb63743c] sched: Do not account irq time to current task
> >
> > as the cause of this regression. Reverting it from 2.6.37-rc4 (requires some
> > hackery due to subsequent changes in the same area) restores sane behaviour.
> >
> > The original patch submission talks about irq-heavy scenarios. My case is the
> > exact opposite: UP, !PREEMPT, NO_HZ, very low irq rate, essentially 100% CPU
> > bound in userspace but expected to schedule quickly when needed (e.g. running
> > top or ps or just hitting CR in one shell while another runs a compile job).
> >
> > I've reproduced the misbehaviour with 2.6.37-rc4 on ARM/mach-iop32x and
> > ARM/mach-ixp4xx, but ARM/mach-kirkwood does not misbehave, and other archs
> > (x86 SMP, SPARC64 UP and SMP, PowerPC32 UP, Alpha UP) also do not misbehave.
> >
> > So it looks like an ARM-only issue, possibly depending on platform specifics.
> >
> > One difference I noticed between my Kirkwood machine and my ixp4xx and iop32x
> > machines is that even though all have CONFIG_NO_HZ=y, the timer irq rate is
> > much higher on Kirkwood, even when the machine is idle.
>
> The above patch you point out is fundamentally broken.
>
> + rq->clock = sched_clock_cpu(cpu);
> + irq_time = irq_time_cpu(cpu);
> + if (rq->clock - irq_time > rq->clock_task)
> + rq->clock_task = rq->clock - irq_time;
>
> This means that we will only update rq->clock_task if it is smaller than
> rq->clock. So, eventually over time, rq->clock_task becomes the maximum
> value that rq->clock can ever be. Or in other words, the maximum value
> of sched_clock_cpu().
>
> Once that has been reached, although rq->clock will wrap back to zero,
> rq->clock_task will not, and so (I think) task execution time accounting
> effectively stops dead.
>
> I guess this hasn't been noticed on x86 as they have a 64-bit sched_clock,
> and so need to wait a long time for this to be noticed. However, on ARM
> where we tend to have 32-bit counters feeding sched_clock(), this value
> will wrap far sooner.
I'm not so sure about this - certainly that if() statement looks very
suspicious above. As irq_time_cpu() will always be zero, can you try
removing the conditional?
In any case, sched_clock_cpu() should be resilient against sched_clock()
wrapping. However, your comments about it being iop32x and ixp4xx
(both of which are 32-bit-counter-to-ns based implementations) and
kirkwood being a 32-bit-extended-to-63-bit-counter-to-ns implementation
does make me wonder...
Russell King - ARM Linux writes:
> On Sun, Dec 05, 2010 at 01:17:02PM +0000, Russell King - ARM Linux wrote:
> > On Sun, Dec 05, 2010 at 01:32:37PM +0100, Mikael Pettersson wrote:
> > > Mikael Pettersson writes:
> > > > The scenario is that I do a remote login to an ARM build server,
> > > > use screen to start a sub-shell, in that shell start a largish
> > > > compile job, detach from that screen, and from the original login
> > > > shell I occasionally monitor the compile job with top or ps or
> > > > by attaching to the screen.
> > > >
> > > > With kernels 2.6.37-rc2 and -rc3 this causes the machine to become
> > > > very sluggish: top takes forever to start, once started it shows no
> > > > activity from the compile job (it's as if it's sleeping on a lock),
> > > > and ps also takes forever and shows no activity from the compile job.
> > > >
> > > > Rebooting into 2.6.36 eliminates these issues.
> > > >
> > > > I do pretty much the same thing (remote login -> screen -> compile job)
> > > > on other archs, but so far I've only seen the 2.6.37-rc misbehaviour
> > > > on ARM EABI, specifically on an IOP n2100. (I have access to other ARM
> > > > sub-archs, but haven't had time to test 2.6.37-rc on them yet.)
> > > >
> > > > Has anyone else seen this? Any ideas about the cause?
> > >
> > > (Re-followup since I just realised my previous followups were to Rafael's
> > > regressions mailbot rather than the original thread.)
> > >
> > > > The bug is still present in 2.6.37-rc4. I'm currently trying to bisect it.
> > >
> > > git bisect identified
> > >
> > > [305e6835e05513406fa12820e40e4a8ecb63743c] sched: Do not account irq time to current task
> > >
> > > as the cause of this regression. Reverting it from 2.6.37-rc4 (requires some
> > > hackery due to subsequent changes in the same area) restores sane behaviour.
> > >
> > > The original patch submission talks about irq-heavy scenarios. My case is the
> > > exact opposite: UP, !PREEMPT, NO_HZ, very low irq rate, essentially 100% CPU
> > > bound in userspace but expected to schedule quickly when needed (e.g. running
> > > top or ps or just hitting CR in one shell while another runs a compile job).
> > >
> > > I've reproduced the misbehaviour with 2.6.37-rc4 on ARM/mach-iop32x and
> > > ARM/mach-ixp4xx, but ARM/mach-kirkwood does not misbehave, and other archs
> > > (x86 SMP, SPARC64 UP and SMP, PowerPC32 UP, Alpha UP) also do not misbehave.
> > >
> > > So it looks like an ARM-only issue, possibly depending on platform specifics.
> > >
> > > One difference I noticed between my Kirkwood machine and my ixp4xx and iop32x
> > > machines is that even though all have CONFIG_NO_HZ=y, the timer irq rate is
> > > much higher on Kirkwood, even when the machine is idle.
> >
> > The above patch you point out is fundamentally broken.
> >
> > + rq->clock = sched_clock_cpu(cpu);
> > + irq_time = irq_time_cpu(cpu);
> > + if (rq->clock - irq_time > rq->clock_task)
> > + rq->clock_task = rq->clock - irq_time;
> >
> > This means that we will only update rq->clock_task if it is smaller than
> > rq->clock. So, eventually over time, rq->clock_task becomes the maximum
> > value that rq->clock can ever be. Or in other words, the maximum value
> > of sched_clock_cpu().
> >
> > Once that has been reached, although rq->clock will wrap back to zero,
> > rq->clock_task will not, and so (I think) task execution time accounting
> > effectively stops dead.
> >
> > I guess this hasn't been noticed on x86 as they have a 64-bit sched_clock,
> > and so need to wait a long time for this to be noticed. However, on ARM
> > where we tend to have 32-bit counters feeding sched_clock(), this value
> > will wrap far sooner.
>
> I'm not so sure about this - certainly that if() statement looks very
> suspicious above. As irq_time_cpu() will always be zero, can you try
> removing the conditional?
>
> In any case, sched_clock_cpu() should be resilient against sched_clock()
> wrapping. However, your comments about it being iop32x and ixp4xx
> (both of which are 32-bit-counter-to-ns based implementations) and
> kirkwood being a 32-bit-extended-to-63-bit-counter-to-ns implementation
> does make me wonder...
I ran two tests on my iop32x machine:
1. Made the above-mentioned assignment to rq->clock_task unconditional.
That cured the interactivity regressions.
2. Restored the conditional assignment to rq->clock_task and disabled the
platform-specific sched_clock() so the kernel used the generic one.
That too cured the interactivity regressions.
I then repeated these tests on my ixp4xx machine, with the same results.
I'll try to come up with a fix for the ixp4xx and plat-iop 32-bit sched_clock()s.
On Sun, Dec 05, 2010 at 05:07:36PM +0100, Mikael Pettersson wrote:
> I ran two tests on my iop32x machine:
>
> 1. Made the above-mentioned assignment to rq->clock_task unconditional.
> That cured the interactivity regressions.
>
> 2. Restored the conditional assignment to rq->clock_task and disabled the
> platform-specific sched_clock() so the kernel used the generic one.
> That too cured the interactivity regressions.
>
> I then repeated these tests on my ixp4xx machine, with the same results.
>
> I'll try to come up with a fix for the ixp4xx and plat-iop 32-bit
> sched_clock()s.
I'm not sure that's the correct fix - it looks like sched_clock_cpu()
should already be preventing scheduler clock time going backwards.
Hmm. IOP32x seems to have a 32-bit timer clocked at 200MHz. That means
it wraps once every 21s. However, we have that converted to ns by an
unknown multiplier and shift. It seems that those are chosen to
guarantee that we will cover only 4s without wrapping in the clocksource
conversion. Maybe that's not sufficient?
Could you try looking into sched_clock_cpu(), sched_clock_local() and
sched_clock() to see whether anything odd stands out?
On Sun, Dec 5, 2010 at 6:19 AM, Russell King - ARM Linux
<[email protected]> wrote:
> On Sun, Dec 05, 2010 at 01:17:02PM +0000, Russell King - ARM Linux wrote:
>> On Sun, Dec 05, 2010 at 01:32:37PM +0100, Mikael Pettersson wrote:
>> > Mikael Pettersson writes:
>> > ?> The scenario is that I do a remote login to an ARM build server,
>> > ?> use screen to start a sub-shell, in that shell start a largish
>> > ?> compile job, detach from that screen, and from the original login
>> > ?> shell I occasionally monitor the compile job with top or ps or
>> > ?> by attaching to the screen.
>> > ?>
>> > ?> With kernels 2.6.37-rc2 and -rc3 this causes the machine to become
>> > ?> very sluggish: top takes forever to start, once started it shows no
>> > ?> activity from the compile job (it's as if it's sleeping on a lock),
>> > ?> and ps also takes forever and shows no activity from the compile job.
>> > ?>
>> > ?> Rebooting into 2.6.36 eliminates these issues.
>> > ?>
>> > ?> I do pretty much the same thing (remote login -> screen -> compile job)
>> > ?> on other archs, but so far I've only seen the 2.6.37-rc misbehaviour
>> > ?> on ARM EABI, specifically on an IOP n2100. (I have access to other ARM
>> > ?> sub-archs, but haven't had time to test 2.6.37-rc on them yet.)
>> > ?>
>> > ?> Has anyone else seen this? Any ideas about the cause?
>> >
>> > (Re-followup since I just realised my previous followups were to Rafael's
>> > regressions mailbot rather than the original thread.)
>> >
>> > > The bug is still present in 2.6.37-rc4. ?I'm currently trying to bisect it.
>> >
>> > git bisect identified
>> >
>> > [305e6835e05513406fa12820e40e4a8ecb63743c] sched: Do not account irq time to current task
>> >
>> > as the cause of this regression. ?Reverting it from 2.6.37-rc4 (requires some
>> > hackery due to subsequent changes in the same area) restores sane behaviour.
>> >
>> > The original patch submission talks about irq-heavy scenarios. ?My case is the
>> > exact opposite: UP, !PREEMPT, NO_HZ, very low irq rate, essentially 100% CPU
>> > bound in userspace but expected to schedule quickly when needed (e.g. running
>> > top or ps or just hitting CR in one shell while another runs a compile job).
>> >
>> > I've reproduced the misbehaviour with 2.6.37-rc4 on ARM/mach-iop32x and
>> > ARM/mach-ixp4xx, but ARM/mach-kirkwood does not misbehave, and other archs
>> > (x86 SMP, SPARC64 UP and SMP, PowerPC32 UP, Alpha UP) also do not misbehave.
>> >
>> > So it looks like an ARM-only issue, possibly depending on platform specifics.
>> >
>> > One difference I noticed between my Kirkwood machine and my ixp4xx and iop32x
>> > machines is that even though all have CONFIG_NO_HZ=y, the timer irq rate is
>> > much higher on Kirkwood, even when the machine is idle.
>>
>> The above patch you point out is fundamentally broken.
>>
>> + ? ? ? ? ? ? ? rq->clock = sched_clock_cpu(cpu);
>> + ? ? ? ? ? ? ? irq_time = irq_time_cpu(cpu);
>> + ? ? ? ? ? ? ? if (rq->clock - irq_time > rq->clock_task)
>> + ? ? ? ? ? ? ? ? ? ? ? rq->clock_task = rq->clock - irq_time;
>>
>> This means that we will only update rq->clock_task if it is smaller than
>> rq->clock. ?So, eventually over time, rq->clock_task becomes the maximum
>> value that rq->clock can ever be. ?Or in other words, the maximum value
>> of sched_clock_cpu().
>>
>> Once that has been reached, although rq->clock will wrap back to zero,
>> rq->clock_task will not, and so (I think) task execution time accounting
>> effectively stops dead.
>>
>> I guess this hasn't been noticed on x86 as they have a 64-bit sched_clock,
>> and so need to wait a long time for this to be noticed. ?However, on ARM
>> where we tend to have 32-bit counters feeding sched_clock(), this value
>> will wrap far sooner.
>
> I'm not so sure about this - certainly that if() statement looks very
> suspicious above. ?As irq_time_cpu() will always be zero, can you try
> removing the conditional?
>
> In any case, sched_clock_cpu() should be resilient against sched_clock()
> wrapping. ?However, your comments about it being iop32x and ixp4xx
> (both of which are 32-bit-counter-to-ns based implementations) and
> kirkwood being a 32-bit-extended-to-63-bit-counter-to-ns implementation
> does make me wonder...
>
That conditional is based on assumption that sched_clock_cpu() is u64.
If that is not true and sched_clock_cpu() is 32 wrapping around, then there
are other places in scheduler which may have problems as well, where
we do curr_time - prev_time kind of calculations in u64.
For example, update_curr() has:
delta_exec = (unsigned long)(now - curr->exec_start);
which is based on rq->clock and can end up as high positive number
in case of 32 bit wraparound.
Having said that, this conditional can be cleaned up to handle the potential
64 bit overflow (even after a long long time) cleanly. But, it will be good to
know what exactly is going wrong here though.
Thanks,
Venki
On Sun, 2010-12-05 at 16:21 +0000, Russell King - ARM Linux wrote:
> I'm not sure that's the correct fix - it looks like sched_clock_cpu()
> should already be preventing scheduler clock time going backwards.
>
> Hmm. IOP32x seems to have a 32-bit timer clocked at 200MHz. That means
> it wraps once every 21s. However, we have that converted to ns by an
> unknown multiplier and shift. It seems that those are chosen to
> guarantee that we will cover only 4s without wrapping in the clocksource
> conversion. Maybe that's not sufficient?
>
> Could you try looking into sched_clock_cpu(), sched_clock_local() and
> sched_clock() to see whether anything odd stands out?
# git grep HAVE_UNSTABLE_SCHED_CLOCK arch/arm | wc -l
0
That code won't help if you don't enable it ;-)
John Stultz was also looking into making the kernel/sched_clock.c code
deal with short clocks.
On Wed, Dec 08, 2010 at 01:40:15PM +0100, Peter Zijlstra wrote:
> On Sun, 2010-12-05 at 16:21 +0000, Russell King - ARM Linux wrote:
>
> > I'm not sure that's the correct fix - it looks like sched_clock_cpu()
> > should already be preventing scheduler clock time going backwards.
> >
> > Hmm. IOP32x seems to have a 32-bit timer clocked at 200MHz. That means
> > it wraps once every 21s. However, we have that converted to ns by an
> > unknown multiplier and shift. It seems that those are chosen to
> > guarantee that we will cover only 4s without wrapping in the clocksource
> > conversion. Maybe that's not sufficient?
> >
> > Could you try looking into sched_clock_cpu(), sched_clock_local() and
> > sched_clock() to see whether anything odd stands out?
>
> # git grep HAVE_UNSTABLE_SCHED_CLOCK arch/arm | wc -l
> 0
Hmm, you're right. In which case it's purely down to sched_clock()
only being able to cover 4s - which seems to be far too small a gap.
I'm not sure that the unstable sched clock stuff makes much sense to
be enabled - we don't have an unstable clock, we just don't have the
required number of bits for the scheduler to work correctly.
Nevertheless, this provides a good way to find this kind of wrap bug.
Even with cnt_32_to_63, we still don't get a 64-bit sched_clock(), so
this bug will still be there. Even with a 64-bit clock, the bug will
still be there. It's basically crap code.
Maybe it's better that on ARM, we just don't implement sched_clock()
at all?
On Wed, 2010-12-08 at 12:55 +0000, Russell King - ARM Linux wrote:
> On Wed, Dec 08, 2010 at 01:40:15PM +0100, Peter Zijlstra wrote:
> > On Sun, 2010-12-05 at 16:21 +0000, Russell King - ARM Linux wrote:
> >
> > > I'm not sure that's the correct fix - it looks like sched_clock_cpu()
> > > should already be preventing scheduler clock time going backwards.
> > >
> > > Hmm. IOP32x seems to have a 32-bit timer clocked at 200MHz. That means
> > > it wraps once every 21s. However, we have that converted to ns by an
> > > unknown multiplier and shift. It seems that those are chosen to
> > > guarantee that we will cover only 4s without wrapping in the clocksource
> > > conversion. Maybe that's not sufficient?
> > >
> > > Could you try looking into sched_clock_cpu(), sched_clock_local() and
> > > sched_clock() to see whether anything odd stands out?
> >
> > # git grep HAVE_UNSTABLE_SCHED_CLOCK arch/arm | wc -l
> > 0
>
> Hmm, you're right. In which case it's purely down to sched_clock()
> only being able to cover 4s - which seems to be far too small a gap.
>
> I'm not sure that the unstable sched clock stuff makes much sense to
> be enabled - we don't have an unstable clock, we just don't have the
> required number of bits for the scheduler to work correctly.
We can perhaps make part of the HAVE_UNSTABLE_SCHED_CLOCK depend on SMP
and only deal with the short wraps (and maybe monotonicity) on UP.
> Nevertheless, this provides a good way to find this kind of wrap bug.
> Even with cnt_32_to_63, we still don't get a 64-bit sched_clock(), so
> this bug will still be there. Even with a 64-bit clock, the bug will
> still be there. It's basically crap code.
You're referring to the clock_task bit from Venki? Yes that needs
fixing.
> Maybe it's better that on ARM, we just don't implement sched_clock()
> at all?
If you have a high res clock source that's cheap to read it would be
better if we can simply fix the infrastructure such that we can make use
of it.
On Wed, Dec 08, 2010 at 03:04:36PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-08 at 12:55 +0000, Russell King - ARM Linux wrote:
> > Hmm, you're right. In which case it's purely down to sched_clock()
> > only being able to cover 4s - which seems to be far too small a gap.
> >
> > I'm not sure that the unstable sched clock stuff makes much sense to
> > be enabled - we don't have an unstable clock, we just don't have the
> > required number of bits for the scheduler to work correctly.
>
> We can perhaps make part of the HAVE_UNSTABLE_SCHED_CLOCK depend on SMP
> and only deal with the short wraps (and maybe monotonicity) on UP.
>
> > Nevertheless, this provides a good way to find this kind of wrap bug.
> > Even with cnt_32_to_63, we still don't get a 64-bit sched_clock(), so
> > this bug will still be there. Even with a 64-bit clock, the bug will
> > still be there. It's basically crap code.
>
> You're referring to the clock_task bit from Venki? Yes that needs
> fixing.
Indeed.
> > Maybe it's better that on ARM, we just don't implement sched_clock()
> > at all?
>
> If you have a high res clock source that's cheap to read it would be
> better if we can simply fix the infrastructure such that we can make use
> of it.
This is the point. We don't have high res clock sources that run for
millenium before wrapping. What we have depends on the platform, and
as this example has found, some platforms have high-res clock sources
but they wrap in as little as 4 seconds. Other platforms will have
slower clock sources which'll run for longer before wrapping.
Essentially, clock sources on ARM so far have been a 32-bit counter
(if you're lucky) clocked at some rate.
So, what I'm saying is that if wrapping in 4 seconds is a problem,
then maybe we shouldn't be providing sched_clock() at all.
Also, if wrapping below 64-bits is also a problem, again, maybe we
shouldn't be providing it there either. Eg:
#define TCR2NS_SCALE_FACTOR 10
unsigned long long sched_clock(void)
{
unsigned long long v = cnt32_to_63(timer_read());
return (v * tcr2ns_scale) >> TCR2NS_SCALE_FACTOR;
}
has a maximum of 54 bits - and this seems to be the most we can sanely
get from a 32-bit counter.
On Wed, 2010-12-08 at 14:28 +0000, Russell King - ARM Linux wrote:
> So, what I'm saying is that if wrapping in 4 seconds is a problem,
> then maybe we shouldn't be providing sched_clock() at all.
4 seconds should be perfectly fine, we call it at least every scheduler
tick (HZ) and NO_HZ will either have to limit the max sleep period or
provide its own sleep duration (using a more expensive clock) so we can
recover (much like GTOD already requires).
> Also, if wrapping below 64-bits is also a problem, again, maybe we
> shouldn't be providing it there either. Eg:
That's mostly due to hysterical raisins and we should just fix that, the
simplest way is to use something like kernel/sched_clock.c to use
sched_clock() deltas to make a running u64 value.
Like said, John Stultz was already looking at doing something like that
because there's a number of architectures suffering this same problem
and they're all already using part of the clocksource infrastructure to
implement the sched_clock() interface simply because they only have a
single hardware resource.
One of the problems is I think the cycles2ns multiplication of the raw
clock, that makes dealing with wrap-around lots harder, so I guess we
should deal with the wrap on the raw clock values and then apply
cycles2ns on the delta or somesuch. But I expect the clocksource
infrastructure already has something like that, John?
On Wed, Dec 08, 2010 at 03:44:19PM +0100, Peter Zijlstra wrote:
> One of the problems is I think the cycles2ns multiplication of the raw
> clock, that makes dealing with wrap-around lots harder, so I guess we
> should deal with the wrap on the raw clock values and then apply
> cycles2ns on the delta or somesuch. But I expect the clocksource
> infrastructure already has something like that, John?
I've thought about that, but it becomes slightly problematical, as
was shown in one of the examples I provided. If you do scale by
doing a 64-bit multiply and shift, you're always going to end up
with less than a 64-bit result.
I think your idea makes sense though, but I think for it to be able
to cover the full 64-bit range, we need to do the wraparound handling
after scaling. So maybe something like the following:
static unsigned long long last_ns, cur_ns;
static unsigned long long max = (max_read_clock() * mult) >> shift;
unsigned long long sched_clock(void)
{
unsigned long long cyc = read_clock();
unsigned long long ns = (cyc * mult) >> shift;
unsigned long long delta;
spin_lock(&sched_clock_lock);
delta = last_ns - ns;
if (ns < last_ns)
delta += max;
last_ns = ns;
ns = cur_ns += delta;
spin_unlock(&sched_clock_lock);
return ns;
}
2010/12/8 Peter Zijlstra <[email protected]>:
> Like said, John Stultz was already looking at doing something like that
> because there's a number of architectures suffering this same problem
> and they're all already using part of the clocksource infrastructure to
> implement the sched_clock() interface simply because they only have a
> single hardware resource.
I was in on that discussion and for the Ux500 this patch was the
outcome:
http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6488/1
It seems to work, mostly, it's based off Nico's orion code and
just tries to make everything as explicit as possible.
Uwe has also been onto it I think.
Yours,
Linus Walleij
On Wed, 2010-12-08 at 15:44 +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-08 at 14:28 +0000, Russell King - ARM Linux wrote:
> > So, what I'm saying is that if wrapping in 4 seconds is a problem,
> > then maybe we shouldn't be providing sched_clock() at all.
>
> 4 seconds should be perfectly fine, we call it at least every scheduler
> tick (HZ) and NO_HZ will either have to limit the max sleep period or
> provide its own sleep duration (using a more expensive clock) so we can
> recover (much like GTOD already requires).
>
> > Also, if wrapping below 64-bits is also a problem, again, maybe we
> > shouldn't be providing it there either. Eg:
>
> That's mostly due to hysterical raisins and we should just fix that, the
> simplest way is to use something like kernel/sched_clock.c to use
> sched_clock() deltas to make a running u64 value.
>
> Like said, John Stultz was already looking at doing something like that
> because there's a number of architectures suffering this same problem
> and they're all already using part of the clocksource infrastructure to
> implement the sched_clock() interface simply because they only have a
> single hardware resource.
I'm not actively working on it right now, but trying to rework the
sched_clock code so its more like the generic timekeeping code is on my
list (I'm Looking to see if I can bump it up to the front in the near
future).
thanks
-john
As I mentioned earlier in this thread, there are other places in
the scheduler, where we do curr_sched_clock - prev_sched_clock kind of math
in u64 and this 32 bit wraparound will cause problems there as well. Example-
accounting a big positive number as exec_time for some task. So, we
need a generic fix for sched_clock. This particular code happened to cause
an easy to notice failure.
Anyways, here's the patch that should resolve the wraparound problem with
this particular piece of code. I haven't been able to test this with latest
git as my test systems fail to boot with divide error in
select_task_rq_fair+0x5dd/0x70a
with vanilla git. Thats an unrelated bug that has been reported earlier.
I have only tested with older known to work kernel with this sched irq changes.
Peter: Also, this change will cause some conflict with pending IRQ time
reporting changeset. Let me know how you want to deal with this patch and that
pending patchset.
Thanks,
Venki
[PATCH] update_rq_clock() with irq_time should handle 64 bit wraparound
update_rq_clock with IRQ_TIME_ACCOUNTING assumed 64 bit sched_clock that
practically will not wraparound. Change to code to handle wraparounds,
still preventing rq->clock_task from going backwards.
Signed-off-by: Venkatesh Pallipadi <[email protected]>
---
kernel/sched.c | 49 ++++++++++++++++++++++++++++++-------------------
1 files changed, 30 insertions(+), 19 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..29ce549 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -636,21 +636,13 @@ static inline struct task_group *task_group(struct task_struct *p)
#endif /* CONFIG_CGROUP_SCHED */
-static u64 irq_time_cpu(int cpu);
-static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time);
+static void update_rq_clock_irqtime(struct rq *rq, int cpu);
inline void update_rq_clock(struct rq *rq)
{
if (!rq->skip_clock_update) {
int cpu = cpu_of(rq);
- u64 irq_time;
-
- rq->clock = sched_clock_cpu(cpu);
- irq_time = irq_time_cpu(cpu);
- if (rq->clock - irq_time > rq->clock_task)
- rq->clock_task = rq->clock - irq_time;
-
- sched_irq_time_avg_update(rq, irq_time);
+ update_rq_clock_irqtime(rq, cpu);
}
}
@@ -1983,24 +1975,43 @@ void account_system_vtime(struct task_struct *curr)
}
EXPORT_SYMBOL_GPL(account_system_vtime);
-static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time)
+/*
+ * Update rq->clock_task making sure that it does not go backwards due to
+ * irq_time subtraction, allowing for possible wraparound.
+ */
+static void update_rq_clock_irqtime(struct rq *rq, int cpu)
{
- if (sched_clock_irqtime && sched_feat(NONIRQ_POWER)) {
- u64 delta_irq = curr_irq_time - rq->prev_irq_time;
- rq->prev_irq_time = curr_irq_time;
- sched_rt_avg_update(rq, delta_irq);
+ u64 curr_rq_clock = sched_clock_cpu(cpu);
+ u64 curr_irq_time = irq_time_cpu(cpu);
+
+ if (!sched_clock_irqtime) {
+ rq->clock = sched_clock_cpu(cpu);
+ rq->clock_task = rq->clock;
+ return;
}
+
+ if (unlikely(curr_irq_time - rq->prev_irq_time >
+ curr_rq_clock - rq->clock)) {
+ curr_irq_time = rq->prev_irq_time + curr_rq_clock - rq->clock;
+ }
+
+ rq->clock = curr_rq_clock;
+ rq->clock_task = rq->clock - curr_irq_time;
+
+ if (sched_feat(NONIRQ_POWER))
+ sched_rt_avg_update(rq, curr_irq_time - rq->prev_irq_time);
+
+ rq->prev_irq_time = curr_irq_time;
}
#else
-static u64 irq_time_cpu(int cpu)
+static void update_rq_clock_irqtime(struct rq *rq, int cpu)
{
- return 0;
+ rq->clock = sched_clock_cpu(cpu);
+ rq->clock_task = rq->clock;
}
-static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { }
-
#endif
#include "sched_idletask.c"
--
1.7.3.1
On Wed, 2010-12-08 at 15:31 -0800, Venkatesh Pallipadi wrote:
> [PATCH] update_rq_clock() with irq_time should handle 64 bit wraparound
>
> update_rq_clock with IRQ_TIME_ACCOUNTING assumed 64 bit sched_clock that
> practically will not wraparound. Change to code to handle wraparounds,
> still preventing rq->clock_task from going backwards.
All we should do is deal with the 64wrap, not filter backward motion, I
came up with something like the below.
I think for now the best solution to the early wrap problem for ARM is
for them to select HAVE_UNSTABLE_SCHED_CLOCK, it will mostly deal with
the short wrap by filtering out the occasional negative value.
Then later we can look at cleaning/breaking-up the kernel/sched_clock.c
code to provide smaller bits of functionality and possibly merging it
with some of the clocksource code.
---
kernel/sched.c | 54 +++++++++++++++++++++++++-----------------------------
1 files changed, 25 insertions(+), 29 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 8e885c1..0fb7de8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -642,23 +642,19 @@ static inline struct task_group *task_group(struct task_struct *p)
#endif /* CONFIG_CGROUP_SCHED */
-static u64 irq_time_cpu(int cpu);
-static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time);
+static void update_rq_clock_task(struct rq *rq, s64 delta);
inline void update_rq_clock(struct rq *rq)
{
- int cpu = cpu_of(rq);
- u64 irq_time;
+ s64 delta;
if (rq->skip_clock_update)
return;
- rq->clock = sched_clock_cpu(cpu);
- irq_time = irq_time_cpu(cpu);
- if (rq->clock - irq_time > rq->clock_task)
- rq->clock_task = rq->clock - irq_time;
+ delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
+ rq->clock += delta;
+ update_rq_clock_task(rq, delta);
- sched_irq_time_avg_update(rq, irq_time);
}
/*
@@ -1817,14 +1813,6 @@ void disable_sched_clock_irqtime(void)
sched_clock_irqtime = 0;
}
-static u64 irq_time_cpu(int cpu)
-{
- if (!sched_clock_irqtime)
- return 0;
-
- return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
-}
-
void account_system_vtime(struct task_struct *curr)
{
unsigned long flags;
@@ -1855,25 +1843,33 @@ void account_system_vtime(struct task_struct *curr)
}
EXPORT_SYMBOL_GPL(account_system_vtime);
-static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time)
+static inline u64 irq_time_cpu(int cpu)
{
- if (sched_clock_irqtime && sched_feat(NONIRQ_POWER)) {
- u64 delta_irq = curr_irq_time - rq->prev_irq_time;
- rq->prev_irq_time = curr_irq_time;
- sched_rt_avg_update(rq, delta_irq);
- }
+ return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
}
-#else
-
-static u64 irq_time_cpu(int cpu)
+static void update_rq_clock_task(struct rq *rq, s64 delta)
{
- return 0;
+ s64 irq_delta;
+
+ irq_delta = irq_time_cpu(cpu_of(rq)) - rq->prev_irq_time;
+ rq->prev_irq_time += irq_delta;
+
+ delta -= irq_delta;
+ rq->clock_task += delta;
+
+ if (irq_delta && sched_feat(NONIRQ_POWER))
+ sched_rt_avg_update(rq, irq_delta);
}
-static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { }
+#else /* CONFIG_IRQ_TIME_ACCOUNTING */
-#endif
+static inline void update_rq_clock_task(struct rq *rq, s64 delta)
+{
+ rq->clock_task += delta;
+}
+
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
#include "sched_idletask.c"
#include "sched_fair.c"
On Thu, Dec 9, 2010 at 4:52 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2010-12-08 at 15:31 -0800, Venkatesh Pallipadi wrote:
>
>> [PATCH] update_rq_clock() with irq_time should handle 64 bit wraparound
>>
>> update_rq_clock with IRQ_TIME_ACCOUNTING assumed 64 bit sched_clock that
>> practically will not wraparound. Change to code to handle wraparounds,
>> still preventing rq->clock_task from going backwards.
>
> All we should do is deal with the 64wrap, not filter backward motion, I
> came up with something like the below.
The original intent of this check
- if (rq->clock - irq_time > rq->clock_task)
was to filter the potential backward motion. As otherwise, downstream
users of clock_task can see
huge positive numbers from curr - prev calculations.
The same problem will be there with below code, with irq_delta >
delta, clock_task can go backwards which is not good.
+ delta -= irq_delta;
+ rq->clock_task += delta;
The reason for this is rq->clock and irqtime updates kind of happen
independently and specifically, if a rq->clock update happens while we
are in a softirq, we may have this case of going backwards on the next
update.
>
> I think for now the best solution to the early wrap problem for ARM is
> for them to select HAVE_UNSTABLE_SCHED_CLOCK, it will mostly deal with
> the short wrap by filtering out the occasional negative value.
>
> Then later we can look at cleaning/breaking-up the kernel/sched_clock.c
> code to provide smaller bits of functionality and possibly merging it
> with some of the clocksource code.
>
Yes. That should resolve the early wrap.
Thanks,
Venki
> ---
> ?kernel/sched.c | ? 54 +++++++++++++++++++++++++-----------------------------
> ?1 files changed, 25 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 8e885c1..0fb7de8 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -642,23 +642,19 @@ static inline struct task_group *task_group(struct task_struct *p)
>
> ?#endif /* CONFIG_CGROUP_SCHED */
>
> -static u64 irq_time_cpu(int cpu);
> -static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time);
> +static void update_rq_clock_task(struct rq *rq, s64 delta);
>
> ?inline void update_rq_clock(struct rq *rq)
> ?{
> - ? ? ? int cpu = cpu_of(rq);
> - ? ? ? u64 irq_time;
> + ? ? ? s64 delta;
>
> ? ? ? ?if (rq->skip_clock_update)
> ? ? ? ? ? ? ? ?return;
>
> - ? ? ? rq->clock = sched_clock_cpu(cpu);
> - ? ? ? irq_time = irq_time_cpu(cpu);
> - ? ? ? if (rq->clock - irq_time > rq->clock_task)
> - ? ? ? ? ? ? ? rq->clock_task = rq->clock - irq_time;
> + ? ? ? delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
> + ? ? ? rq->clock += delta;
> + ? ? ? update_rq_clock_task(rq, delta);
>
> - ? ? ? sched_irq_time_avg_update(rq, irq_time);
> ?}
>
> ?/*
> @@ -1817,14 +1813,6 @@ void disable_sched_clock_irqtime(void)
> ? ? ? ?sched_clock_irqtime = 0;
> ?}
>
> -static u64 irq_time_cpu(int cpu)
> -{
> - ? ? ? if (!sched_clock_irqtime)
> - ? ? ? ? ? ? ? return 0;
> -
> - ? ? ? return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
> -}
> -
> ?void account_system_vtime(struct task_struct *curr)
> ?{
> ? ? ? ?unsigned long flags;
> @@ -1855,25 +1843,33 @@ void account_system_vtime(struct task_struct *curr)
> ?}
> ?EXPORT_SYMBOL_GPL(account_system_vtime);
>
> -static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time)
> +static inline u64 irq_time_cpu(int cpu)
> ?{
> - ? ? ? if (sched_clock_irqtime && sched_feat(NONIRQ_POWER)) {
> - ? ? ? ? ? ? ? u64 delta_irq = curr_irq_time - rq->prev_irq_time;
> - ? ? ? ? ? ? ? rq->prev_irq_time = curr_irq_time;
> - ? ? ? ? ? ? ? sched_rt_avg_update(rq, delta_irq);
> - ? ? ? }
> + ? ? ? return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
> ?}
>
> -#else
> -
> -static u64 irq_time_cpu(int cpu)
> +static void update_rq_clock_task(struct rq *rq, s64 delta)
> ?{
> - ? ? ? return 0;
> + ? ? ? s64 irq_delta;
> +
> + ? ? ? irq_delta = irq_time_cpu(cpu_of(rq)) - rq->prev_irq_time;
> + ? ? ? rq->prev_irq_time += irq_delta;
> +
> + ? ? ? delta -= irq_delta;
> + ? ? ? rq->clock_task += delta;
> +
> + ? ? ? if (irq_delta && sched_feat(NONIRQ_POWER))
> + ? ? ? ? ? ? ? sched_rt_avg_update(rq, irq_delta);
> ?}
>
> -static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { }
> +#else /* CONFIG_IRQ_TIME_ACCOUNTING */
>
> -#endif
> +static inline void update_rq_clock_task(struct rq *rq, s64 delta)
> +{
> + ? ? ? rq->clock_task += delta;
> +}
> +
> +#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
>
> ?#include "sched_idletask.c"
> ?#include "sched_fair.c"
>
>
On Thu, 2010-12-09 at 09:43 -0800, Venkatesh Pallipadi wrote:
>
> The same problem will be there with below code, with irq_delta >
> delta, clock_task can go backwards which is not good.
> + delta -= irq_delta;
> + rq->clock_task += delta;
>
> The reason for this is rq->clock and irqtime updates kind of happen
> independently and specifically, if a rq->clock update happens while we
> are in a softirq, we may have this case of going backwards on the next
> update.
But how can irq_delta > delta?, we measure it using the same clock.
On Thu, Dec 9, 2010 at 9:55 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2010-12-09 at 09:43 -0800, Venkatesh Pallipadi wrote:
>>
>> The same problem will be there with below code, with irq_delta >
>> delta, clock_task can go backwards which is not good.
>> + ? ? ? delta -= irq_delta;
>> + ? ? ? rq->clock_task += delta;
>>
>> The reason for this is rq->clock and irqtime updates kind of happen
>> independently and specifically, if a rq->clock update happens while we
>> are in a softirq, we may have this case of going backwards on the next
>> update.
>
> But how can irq_delta > delta?, we measure it using the same clock.
>
This would be mostly a corner case like:
- softirq start time t1
- rq->clock updated at t2 and rq->clock_task updated at t2 without
accounting for current softirq
- softirq end time t3
- cpu spends most time here in softirq or hardirq
- next rq->clock update at t4 and rq->clock_task update, with delta =
t4-t2 and irq_delta ~= t4 - t1
On Thu, 2010-12-09 at 10:11 -0800, Venkatesh Pallipadi wrote:
> On Thu, Dec 9, 2010 at 9:55 AM, Peter Zijlstra <[email protected]> wrote:
> > On Thu, 2010-12-09 at 09:43 -0800, Venkatesh Pallipadi wrote:
> >>
> >> The same problem will be there with below code, with irq_delta >
> >> delta, clock_task can go backwards which is not good.
> >> + delta -= irq_delta;
> >> + rq->clock_task += delta;
> >>
> >> The reason for this is rq->clock and irqtime updates kind of happen
> >> independently and specifically, if a rq->clock update happens while we
> >> are in a softirq, we may have this case of going backwards on the next
> >> update.
> >
> > But how can irq_delta > delta?, we measure it using the same clock.
> >
>
> This would be mostly a corner case like:
> - softirq start time t1
> - rq->clock updated at t2 and rq->clock_task updated at t2 without
> accounting for current softirq
> - softirq end time t3
> - cpu spends most time here in softirq or hardirq
> - next rq->clock update at t4 and rq->clock_task update, with delta =
> t4-t2 and irq_delta ~= t4 - t1
Ah, something like that would happen when we do a wakeup from
soft/hard-irq context, not an altogether uncommon occurrence.
Wouldn't that be cured by updating the irq-time when asking for it,
something like the below? (on top of my earlier patch)
---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1843,8 +1843,9 @@ void account_system_vtime(struct task_st
}
EXPORT_SYMBOL_GPL(account_system_vtime);
-static inline u64 irq_time_cpu(int cpu)
+static u64 irq_time_cpu(int cpu)
{
+ account_system_vtime(current);
return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
}
On Thu, Dec 9, 2010 at 10:55 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2010-12-09 at 10:11 -0800, Venkatesh Pallipadi wrote:
>> On Thu, Dec 9, 2010 at 9:55 AM, Peter Zijlstra <[email protected]> wrote:
>> > On Thu, 2010-12-09 at 09:43 -0800, Venkatesh Pallipadi wrote:
>> >>
>> >> The same problem will be there with below code, with irq_delta >
>> >> delta, clock_task can go backwards which is not good.
>> >> + ? ? ? delta -= irq_delta;
>> >> + ? ? ? rq->clock_task += delta;
>> >>
>> >> The reason for this is rq->clock and irqtime updates kind of happen
>> >> independently and specifically, if a rq->clock update happens while we
>> >> are in a softirq, we may have this case of going backwards on the next
>> >> update.
>> >
>> > But how can irq_delta > delta?, we measure it using the same clock.
>> >
>>
>> This would be mostly a corner case like:
>> - softirq start time t1
>> - rq->clock updated at t2 and rq->clock_task updated at t2 without
>> accounting for current softirq
>> - softirq end time t3
>> - cpu spends most time here in softirq or hardirq
>> - next rq->clock update at t4 and rq->clock_task update, with delta =
>> t4-t2 and irq_delta ~= t4 - t1
>
> Ah, something like that would happen when we do a wakeup from
> soft/hard-irq context, not an altogether uncommon occurrence.
>
> Wouldn't that be cured by updating the irq-time when asking for it,
> something like the below? (on top of my earlier patch)
This should mostly work. There would be a small window there on
rq->clock=sched_clock_cpu() and system_vtime doing sched_clock_cpu()
and also the overhead of doing this everytime when this going
backwards may happen rarely.
Thanks,
Venki
>
> ---
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -1843,8 +1843,9 @@ void account_system_vtime(struct task_st
> ?}
> ?EXPORT_SYMBOL_GPL(account_system_vtime);
>
> -static inline u64 irq_time_cpu(int cpu)
> +static u64 irq_time_cpu(int cpu)
> ?{
> + ? ? ? account_system_vtime(current);
> ? ? ? ?return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
> ?}
>
>
>
On Thu, 2010-12-09 at 14:21 -0800, Venkatesh Pallipadi wrote:
> This should mostly work. There would be a small window there on
> rq->clock=sched_clock_cpu() and system_vtime doing sched_clock_cpu()
> and also the overhead of doing this everytime when this going
> backwards may happen rarely.
Right, so something like the below removes that tiny race by re-using
rq->clock as now. It also removes some overhead by removing the IRQ
fudging (we already know IRQs are disabled).
It does have a few extra branches (2 afaict) than your monotonicity path
had, but it seems to me this approach is slightly more accurate.
---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1813,19 +1813,10 @@ void disable_sched_clock_irqtime(void)
sched_clock_irqtime = 0;
}
-void account_system_vtime(struct task_struct *curr)
+static void __account_system_vtime(int cpu, u64 now)
{
- unsigned long flags;
- int cpu;
- u64 now, delta;
+ s64 delta;
- if (!sched_clock_irqtime)
- return;
-
- local_irq_save(flags);
-
- cpu = smp_processor_id();
- now = sched_clock_cpu(cpu);
delta = now - per_cpu(irq_start_time, cpu);
per_cpu(irq_start_time, cpu) = now;
/*
@@ -1836,16 +1827,36 @@ void account_system_vtime(struct task_st
*/
if (hardirq_count())
per_cpu(cpu_hardirq_time, cpu) += delta;
- else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD))
+ else if (in_serving_softirq() && !(current->flags & PF_KSOFTIRQD))
per_cpu(cpu_softirq_time, cpu) += delta;
+}
+
+void account_system_vtime(struct task_struct *curr)
+{
+ unsigned long flags;
+ u64 now;
+ int cpu;
+
+ if (!sched_clock_irqtime)
+ return;
+
+ local_irq_save(flags);
+
+ cpu = smp_processor_id();
+ now = sched_clock_cpu(cpu);
+ __account_system_vtime(cpu, now);
local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(account_system_vtime);
-static u64 irq_time_cpu(int cpu)
+static u64 irq_time_cpu(struct rq *rq)
{
- account_system_vtime(current);
+ int cpu = cpu_of(rq);
+
+ if (sched_clock_irqtime)
+ __account_system_vtime(cpu, rq->clock);
+
return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
}
@@ -1853,7 +1864,7 @@ static void update_rq_clock_task(struct
{
s64 irq_delta;
- irq_delta = irq_time_cpu(cpu_of(rq)) - rq->prev_irq_time;
+ irq_delta = irq_time_cpu(rq) - rq->prev_irq_time;
rq->prev_irq_time += irq_delta;
delta -= irq_delta;
On Thu, Dec 9, 2010 at 3:16 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2010-12-09 at 14:21 -0800, Venkatesh Pallipadi wrote:
>> This should mostly work. There would be a small window there on
>> rq->clock=sched_clock_cpu() and system_vtime doing sched_clock_cpu()
>> and also the overhead of doing this everytime when this going
>> backwards may happen rarely.
>
> Right, so something like the below removes that tiny race by re-using
> rq->clock as now. It also removes some overhead by removing the IRQ
> fudging (we already know IRQs are disabled).
>
> It does have a few extra branches (2 afaict) than your monotonicity path
> had, but it seems to me this approach is slightly more accurate.
>
Looks fine.
Just to make sure, update_rq_clock() always gets called on current
CPU. Right? The pending patches I have optimizes
account_system_vtime() to use this_cpu_write and friends. Want to make
sure this change will still keep that optimization relevant.
> ---
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -1813,19 +1813,10 @@ void disable_sched_clock_irqtime(void)
> ? ? ? ?sched_clock_irqtime = 0;
> ?}
>
> -void account_system_vtime(struct task_struct *curr)
> +static void __account_system_vtime(int cpu, u64 now)
> ?{
> - ? ? ? unsigned long flags;
> - ? ? ? int cpu;
> - ? ? ? u64 now, delta;
> + ? ? ? s64 delta;
>
> - ? ? ? if (!sched_clock_irqtime)
> - ? ? ? ? ? ? ? return;
> -
> - ? ? ? local_irq_save(flags);
> -
> - ? ? ? cpu = smp_processor_id();
> - ? ? ? now = sched_clock_cpu(cpu);
> ? ? ? ?delta = now - per_cpu(irq_start_time, cpu);
> ? ? ? ?per_cpu(irq_start_time, cpu) = now;
> ? ? ? ?/*
> @@ -1836,16 +1827,36 @@ void account_system_vtime(struct task_st
> ? ? ? ? */
> ? ? ? ?if (hardirq_count())
> ? ? ? ? ? ? ? ?per_cpu(cpu_hardirq_time, cpu) += delta;
> - ? ? ? else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD))
> + ? ? ? else if (in_serving_softirq() && !(current->flags & PF_KSOFTIRQD))
> ? ? ? ? ? ? ? ?per_cpu(cpu_softirq_time, cpu) += delta;
> +}
> +
> +void account_system_vtime(struct task_struct *curr)
> +{
> + ? ? ? unsigned long flags;
> + ? ? ? u64 now;
> + ? ? ? int cpu;
> +
> + ? ? ? if (!sched_clock_irqtime)
> + ? ? ? ? ? ? ? return;
> +
> + ? ? ? local_irq_save(flags);
> +
> + ? ? ? cpu = smp_processor_id();
> + ? ? ? now = sched_clock_cpu(cpu);
> + ? ? ? __account_system_vtime(cpu, now);
>
> ? ? ? ?local_irq_restore(flags);
> ?}
> ?EXPORT_SYMBOL_GPL(account_system_vtime);
>
> -static u64 irq_time_cpu(int cpu)
> +static u64 irq_time_cpu(struct rq *rq)
> ?{
> - ? ? ? account_system_vtime(current);
> + ? ? ? int cpu = cpu_of(rq);
> +
> + ? ? ? if (sched_clock_irqtime)
> + ? ? ? ? ? ? ? __account_system_vtime(cpu, rq->clock);
> +
> ? ? ? ?return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
> ?}
>
> @@ -1853,7 +1864,7 @@ static void update_rq_clock_task(struct
> ?{
> ? ? ? ?s64 irq_delta;
>
> - ? ? ? irq_delta = irq_time_cpu(cpu_of(rq)) - rq->prev_irq_time;
> + ? ? ? irq_delta = irq_time_cpu(rq) - rq->prev_irq_time;
> ? ? ? ?rq->prev_irq_time += irq_delta;
>
> ? ? ? ?delta -= irq_delta;
>
>
On Thu, 2010-12-09 at 15:35 -0800, Venkatesh Pallipadi wrote:
>
> Just to make sure, update_rq_clock() always gets called on current
> CPU. Right?
No, specifically not. If that were the case we wouldn't need the
cross-cpu synced timestamp. Things like load-balancing and
remote-wakeups need to update a remote CPUs clock.
> The pending patches I have optimizes
> account_system_vtime() to use this_cpu_write and friends. Want to make
> sure this change will still keep that optimization relevant.
Ah, good point, remote CPUs updating that will mess with the consistency
of the per-cpu timestamps due to non atomic updates :/
Bugger.. making them atomics will make it even more expensive. /me goes
ponder.
On Fri, 2010-12-10 at 11:08 +0100, Peter Zijlstra wrote:
> > Just to make sure, update_rq_clock() always gets called on current
> > CPU. Right?
>
> No, specifically not. If that were the case we wouldn't need the
> cross-cpu synced timestamp. Things like load-balancing and
> remote-wakeups need to update a remote CPUs clock.
>
> > The pending patches I have optimizes
> > account_system_vtime() to use this_cpu_write and friends. Want to make
> > sure this change will still keep that optimization relevant.
>
> Ah, good point, remote CPUs updating that will mess with the consistency
> of the per-cpu timestamps due to non atomic updates :/
>
> Bugger.. making them atomics will make it even more expensive. /me goes
> ponder.
OK, so I ended up doing the same you did.. Still staring at that, 32bit
will go very funny in the head once every so often. One possible
solution would be to ignore the occasional abs(irq_delta) > 2 * delta.
That would however result in an accounting discrepancy such that:
clock_task + irq_time != clock
Thoughts?
---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1813,11 +1813,36 @@ void disable_sched_clock_irqtime(void)
sched_clock_irqtime = 0;
}
+static void __account_system_vtime(int cpu, u64 now)
+{
+ s64 delta;
+
+ delta = now - per_cpu(irq_start_time, cpu);
+ per_cpu(irq_start_time, cpu) = now;
+
+ if (hardirq_count())
+ per_cpu(cpu_hardirq_time, cpu) += delta;
+ /*
+ * We do not account for softirq time from ksoftirqd here. We want to
+ * continue accounting softirq time to ksoftirqd thread in that case,
+ * so as not to confuse scheduler with a special task that do not
+ * consume any time, but still wants to run.
+ */
+ else if (in_serving_softirq() && !(current->flags & PF_KSOFTIRQD))
+ per_cpu(cpu_softirq_time, cpu) += delta;
+}
+
+/*
+ * Called before incrementing preempt_count on {soft,}irq_enter
+ * and before decrementing preempt_count on {soft,}irq_exit.
+ *
+ * @curr should at all times be current.
+ */
void account_system_vtime(struct task_struct *curr)
{
unsigned long flags;
+ u64 now;
int cpu;
- u64 now, delta;
if (!sched_clock_irqtime)
return;
@@ -1826,26 +1851,21 @@ void account_system_vtime(struct task_st
cpu = smp_processor_id();
now = sched_clock_cpu(cpu);
- delta = now - per_cpu(irq_start_time, cpu);
- per_cpu(irq_start_time, cpu) = now;
- /*
- * We do not account for softirq time from ksoftirqd here.
- * We want to continue accounting softirq time to ksoftirqd thread
- * in that case, so as not to confuse scheduler with a special task
- * that do not consume any time, but still wants to run.
- */
- if (hardirq_count())
- per_cpu(cpu_hardirq_time, cpu) += delta;
- else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD))
- per_cpu(cpu_softirq_time, cpu) += delta;
+ __account_system_vtime(cpu, now);
local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(account_system_vtime);
-static u64 irq_time_cpu(int cpu)
+static u64 irq_time_cpu(struct rq *rq)
{
- account_system_vtime(current);
+ int cpu = cpu_of(rq);
+ /*
+ * See the comment in update_rq_clock_task(), ideally we'd update
+ * the *irq_time values using rq->clock here.
+ *
+ * As it stands, reading this from a remote cpu is buggy on 32bit.
+ */
return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
}
@@ -1853,9 +1873,27 @@ static void update_rq_clock_task(struct
{
s64 irq_delta;
- irq_delta = irq_time_cpu(cpu_of(rq)) - rq->prev_irq_time;
- rq->prev_irq_time += irq_delta;
+ irq_delta = irq_time_cpu(rq) - rq->prev_irq_time;
+
+ /*
+ * Since irq_time is only updated on {soft,}irq_exit, we might run into
+ * this case when a previous update_rq_clock() happened inside a
+ * {soft,}irq region.
+ *
+ * When this happens, we stop ->clock_task and only update the
+ * prev_irq_time stamp to account for the part that fit, so that a next
+ * update will consume the rest. This ensures ->clock_task is
+ * monotonic.
+ *
+ * It does however cause some slight miss-attribution of {soft,}irq
+ * time, a more accurate solution would be to update the irq_time using
+ * the current rq->clock timestamp, except that would require using
+ * atomic ops.
+ */
+ if (irq_delta > delta)
+ irq_delta = delta;
+ rq->prev_irq_time += irq_delta;
delta -= irq_delta;
rq->clock_task += delta;
On Fri, 2010-12-10 at 14:17 +0100, Peter Zijlstra wrote:
>
> OK, so I ended up doing the same you did.. Still staring at that, 32bit
> will go very funny in the head once every so often. One possible
> solution would be to ignore the occasional abs(irq_delta) > 2 * delta.
>
> That would however result in an accounting discrepancy such that:
> clock_task + irq_time != clock
>
> Thoughts?
The brute force solution is a seqcount.. something like so:
---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1786,21 +1786,63 @@ static void deactivate_task(struct rq *r
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
/*
- * There are no locks covering percpu hardirq/softirq time.
- * They are only modified in account_system_vtime, on corresponding CPU
- * with interrupts disabled. So, writes are safe.
+ * There are no locks covering percpu hardirq/softirq time. They are only
+ * modified in account_system_vtime, on corresponding CPU with interrupts
+ * disabled. So, writes are safe.
+ *
* They are read and saved off onto struct rq in update_rq_clock().
- * This may result in other CPU reading this CPU's irq time and can
- * race with irq/account_system_vtime on this CPU. We would either get old
- * or new value (or semi updated value on 32 bit) with a side effect of
- * accounting a slice of irq time to wrong task when irq is in progress
- * while we read rq->clock. That is a worthy compromise in place of having
- * locks on each irq in account_system_time.
+ *
+ * This may result in other CPU reading this CPU's irq time and can race with
+ * irq/account_system_vtime on this CPU. We would either get old or new value
+ * with a side effect of accounting a slice of irq time to wrong task when irq
+ * is in progress while we read rq->clock. That is a worthy compromise in place
+ * of having locks on each irq in account_system_time.
*/
static DEFINE_PER_CPU(u64, cpu_hardirq_time);
static DEFINE_PER_CPU(u64, cpu_softirq_time);
-
static DEFINE_PER_CPU(u64, irq_start_time);
+
+#ifndef CONFIG_64BIT
+static DEFINE_PER_CPU(seqcount_t, irq_time_seq);
+
+static inline void irq_time_write_begin(int cpu)
+{
+ write_seqcount_begin(&per_cpu(irq_time_seq, cpu));
+}
+
+static inline void irq_time_write_end(int cpu)
+{
+ write_seqcount_end(&per_cpu(irq_time_seq, cpu));
+}
+
+static inline u64 irq_time_read(int cpu)
+{
+ u64 irq_time;
+ unsigned seq;
+
+ do {
+ seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));
+ irq_time = per_cpu(cpu_softirq_time, cpu) +
+ per_cpu(cpu_hardirq_time, cpu);
+ } while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq));
+
+ return irq_time;
+}
+#else /* CONFIG_64BIT */
+static inline void irq_time_write_begin(int cpu)
+{
+}
+
+static inline void irq_time_write_end(int cpu)
+{
+}
+
+static inline u64 irq_time_read(int cpu)
+{
+ return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
+}
+#endif /* CONFIG_64BIT */
+
static int sched_clock_irqtime;
void enable_sched_clock_irqtime(void)
@@ -1820,6 +1862,7 @@ static void __account_system_vtime(int c
delta = now - per_cpu(irq_start_time, cpu);
per_cpu(irq_start_time, cpu) = now;
+ irq_time_write_begin(cpu);
if (hardirq_count())
per_cpu(cpu_hardirq_time, cpu) += delta;
/*
@@ -1830,6 +1873,7 @@ static void __account_system_vtime(int c
*/
else if (in_serving_softirq() && !(current->flags & PF_KSOFTIRQD))
per_cpu(cpu_softirq_time, cpu) += delta;
+ irq_time_write_end(cpu);
}
/*
@@ -1859,14 +1903,11 @@ EXPORT_SYMBOL_GPL(account_system_vtime);
static u64 irq_time_cpu(struct rq *rq)
{
- int cpu = cpu_of(rq);
/*
* See the comment in update_rq_clock_task(), ideally we'd update
* the *irq_time values using rq->clock here.
- *
- * As it stands, reading this from a remote cpu is buggy on 32bit.
*/
- return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
+ return irq_time_read(cpu_of(rq));
}
static void update_rq_clock_task(struct rq *rq, s64 delta)
Full patch..
---
Subject: sched: Fix the irqtime code to deal with u64 wraps
From: Peter Zijlstra <[email protected]>
Date: Thu Dec 09 14:15:34 CET 2010
ARM systems have a 32bit sched_clock() [ which needs to be fixed ],
but this exposed a bug in the irq_time code as well, it doesn't deal
with wraps at all.
Fix the irq_time code to deal with u64 wraps by re-writing the code to
only use delta increments, which avoids the whole issue.
Furthermore, solve the problem of 32bit arches reading partial updates
of the u64 time values.
Cc: Venkatesh Pallipadi <[email protected]>
Reported-by: Russell King - ARM Linux <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched.c | 166 +++++++++++++++++++++++++++++++++++++++------------------
1 file changed, 116 insertions(+), 50 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -642,23 +642,19 @@ static inline struct task_group *task_gr
#endif /* CONFIG_CGROUP_SCHED */
-static u64 irq_time_cpu(int cpu);
-static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time);
+static void update_rq_clock_task(struct rq *rq, s64 delta);
inline void update_rq_clock(struct rq *rq)
{
- int cpu = cpu_of(rq);
- u64 irq_time;
+ s64 delta;
if (rq->skip_clock_update)
return;
- rq->clock = sched_clock_cpu(cpu);
- irq_time = irq_time_cpu(cpu);
- if (rq->clock - irq_time > rq->clock_task)
- rq->clock_task = rq->clock - irq_time;
+ delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
+ rq->clock += delta;
+ update_rq_clock_task(rq, delta);
- sched_irq_time_avg_update(rq, irq_time);
}
/*
@@ -1790,90 +1786,160 @@ static void deactivate_task(struct rq *r
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
/*
- * There are no locks covering percpu hardirq/softirq time.
- * They are only modified in account_system_vtime, on corresponding CPU
- * with interrupts disabled. So, writes are safe.
+ * There are no locks covering percpu hardirq/softirq time. They are only
+ * modified in account_system_vtime, on corresponding CPU with interrupts
+ * disabled. So, writes are safe.
+ *
* They are read and saved off onto struct rq in update_rq_clock().
- * This may result in other CPU reading this CPU's irq time and can
- * race with irq/account_system_vtime on this CPU. We would either get old
- * or new value (or semi updated value on 32 bit) with a side effect of
- * accounting a slice of irq time to wrong task when irq is in progress
- * while we read rq->clock. That is a worthy compromise in place of having
- * locks on each irq in account_system_time.
+ *
+ * This may result in other CPU reading this CPU's irq time and can race with
+ * irq/account_system_vtime on this CPU. We would either get old or new value
+ * with a side effect of accounting a slice of irq time to wrong task when irq
+ * is in progress while we read rq->clock. That is a worthy compromise in place
+ * of having locks on each irq in account_system_time.
*/
static DEFINE_PER_CPU(u64, cpu_hardirq_time);
static DEFINE_PER_CPU(u64, cpu_softirq_time);
-
static DEFINE_PER_CPU(u64, irq_start_time);
-static int sched_clock_irqtime;
-void enable_sched_clock_irqtime(void)
+#ifndef CONFIG_64BIT
+static DEFINE_PER_CPU(seqcount_t, irq_time_seq);
+
+static inline void irq_time_write_begin(int cpu)
{
- sched_clock_irqtime = 1;
+ write_seqcount_begin(&per_cpu(irq_time_seq, cpu));
}
-void disable_sched_clock_irqtime(void)
+static inline void irq_time_write_end(int cpu)
{
- sched_clock_irqtime = 0;
+ write_seqcount_end(&per_cpu(irq_time_seq, cpu));
}
-static u64 irq_time_cpu(int cpu)
+static inline u64 irq_time_read(int cpu)
{
- if (!sched_clock_irqtime)
- return 0;
+ u64 irq_time;
+ unsigned seq;
+
+ do {
+ seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));
+ irq_time = per_cpu(cpu_softirq_time, cpu) +
+ per_cpu(cpu_hardirq_time, cpu);
+ } while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq));
+
+ return irq_time;
+}
+#else /* CONFIG_64BIT */
+static inline void irq_time_write_begin(int cpu)
+{
+}
+
+static inline void irq_time_write_end(int cpu)
+{
+}
+static inline u64 irq_time_read(int cpu)
+{
return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
}
+#endif /* CONFIG_64BIT */
+static int sched_clock_irqtime;
+
+void enable_sched_clock_irqtime(void)
+{
+ sched_clock_irqtime = 1;
+}
+
+void disable_sched_clock_irqtime(void)
+{
+ sched_clock_irqtime = 0;
+}
+
+/*
+ * Called before incrementing preempt_count on {soft,}irq_enter
+ * and before decrementing preempt_count on {soft,}irq_exit.
+ */
void account_system_vtime(struct task_struct *curr)
{
unsigned long flags;
+ s64 delta;
int cpu;
- u64 now, delta;
if (!sched_clock_irqtime)
return;
local_irq_save(flags);
-
cpu = smp_processor_id();
- now = sched_clock_cpu(cpu);
- delta = now - per_cpu(irq_start_time, cpu);
- per_cpu(irq_start_time, cpu) = now;
- /*
- * We do not account for softirq time from ksoftirqd here.
- * We want to continue accounting softirq time to ksoftirqd thread
- * in that case, so as not to confuse scheduler with a special task
- * that do not consume any time, but still wants to run.
- */
+ delta = sched_clock_cpu(cpu) - per_cpu(irq_start_time, cpu);
+ per_cpu(irq_start_time, cpu) += delta;
+
+ irq_time_write_begin(cpu);
+
if (hardirq_count())
per_cpu(cpu_hardirq_time, cpu) += delta;
+ /*
+ * We do not account for softirq time from ksoftirqd here. We want to
+ * continue accounting softirq time to ksoftirqd thread in that case,
+ * so as not to confuse scheduler with a special task that do not
+ * consume any time, but still wants to run.
+ */
else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD))
per_cpu(cpu_softirq_time, cpu) += delta;
+ irq_time_write_end(cpu);
local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(account_system_vtime);
-static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time)
+static u64 irq_time_cpu(struct rq *rq)
{
- if (sched_clock_irqtime && sched_feat(NONIRQ_POWER)) {
- u64 delta_irq = curr_irq_time - rq->prev_irq_time;
- rq->prev_irq_time = curr_irq_time;
- sched_rt_avg_update(rq, delta_irq);
- }
+ /*
+ * See the comment in update_rq_clock_task(), ideally we'd update
+ * the *irq_time values using rq->clock here.
+ */
+ return irq_time_read(cpu_of(rq));
}
-#else
-
-static u64 irq_time_cpu(int cpu)
+static void update_rq_clock_task(struct rq *rq, s64 delta)
{
- return 0;
+ s64 irq_delta;
+
+ irq_delta = irq_time_cpu(rq) - rq->prev_irq_time;
+
+ /*
+ * Since irq_time is only updated on {soft,}irq_exit, we might run into
+ * this case when a previous update_rq_clock() happened inside a
+ * {soft,}irq region.
+ *
+ * When this happens, we stop ->clock_task and only update the
+ * prev_irq_time stamp to account for the part that fit, so that a next
+ * update will consume the rest. This ensures ->clock_task is
+ * monotonic.
+ *
+ * It does however cause some slight miss-attribution of {soft,}irq
+ * time, a more accurate solution would be to update the irq_time using
+ * the current rq->clock timestamp, except that would require using
+ * atomic ops.
+ */
+ if (irq_delta > delta)
+ irq_delta = delta;
+
+ rq->prev_irq_time += irq_delta;
+ delta -= irq_delta;
+ rq->clock_task += delta;
+
+ if (irq_delta && sched_feat(NONIRQ_POWER))
+ sched_rt_avg_update(rq, irq_delta);
}
-static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { }
+#else /* CONFIG_IRQ_TIME_ACCOUNTING */
-#endif
+static inline void update_rq_clock_task(struct rq *rq, s64 delta)
+{
+ rq->clock_task += delta;
+}
+
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
#include "sched_idletask.c"
#include "sched_fair.c"
On Fri, Dec 10, 2010 at 02:47:46PM +0100, Peter Zijlstra wrote:
>
> Full patch..
>
> ---
> Subject: sched: Fix the irqtime code to deal with u64 wraps
> From: Peter Zijlstra <[email protected]>
> Date: Thu Dec 09 14:15:34 CET 2010
>
> ARM systems have a 32bit sched_clock() [ which needs to be fixed ],
> but this exposed a bug in the irq_time code as well, it doesn't deal
> with wraps at all.
>
> Fix the irq_time code to deal with u64 wraps by re-writing the code to
> only use delta increments, which avoids the whole issue.
>
> Furthermore, solve the problem of 32bit arches reading partial updates
> of the u64 time values.
>
> Cc: Venkatesh Pallipadi <[email protected]>
> Reported-by: Russell King - ARM Linux <[email protected]>
I think credit should go to Mikael Pettersson, who identified the
interactivity regression and problematical commit. I only pointed
out the dubious nature of the code.
On Fri, 2010-12-10 at 16:50 +0000, Russell King - ARM Linux wrote:
> > Reported-by: Russell King - ARM Linux <[email protected]>
>
> I think credit should go to Mikael Pettersson, who identified the
> interactivity regression and problematical commit. I only pointed
> out the dubious nature of the code.
Ah, thanks, missed that. Updated the patch.
Le vendredi 10 décembre 2010 à 14:47 +0100, Peter Zijlstra a écrit :
> Full patch..
> unt_system_vtime(struct task_struct *curr)
> {
> unsigned long flags;
> + s64 delta;
> int cpu;
> - u64 now, delta;
>
> if (!sched_clock_irqtime)
> return;
>
> local_irq_save(flags);
> -
> cpu = smp_processor_id();
> - now = sched_clock_cpu(cpu);
> - delta = now - per_cpu(irq_start_time, cpu);
> - per_cpu(irq_start_time, cpu) = now;
> - /*
> - * We do not account for softirq time from ksoftirqd here.
> - * We want to continue accounting softirq time to ksoftirqd thread
> - * in that case, so as not to confuse scheduler with a special task
> - * that do not consume any time, but still wants to run.
> - */
> + delta = sched_clock_cpu(cpu) - per_cpu(irq_start_time, cpu);
> + per_cpu(irq_start_time, cpu) += delta;
We should add some checkpatch warning to any new per_cpu() use.
delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time);
__this_cpu_add(irq_start_time, delta);
Also irq_time_write_begin() and irq_time_write_end() could be faster
(called for current cpu)
static inline void irq_time_write_begin(void)
{
__this_cpu_inc(irq_time_seq.sequence);
smp_wmb();
}
static inline void irq_time_write_end(void)
{
smp_wmb();
__this_cpu_inc(irq_time_seq.sequence);
}
On Fri, 2010-12-10 at 18:18 +0100, Eric Dumazet wrote:
> Le vendredi 10 décembre 2010 à 14:47 +0100, Peter Zijlstra a écrit :
> Also irq_time_write_begin() and irq_time_write_end() could be faster
> (called for current cpu)
>
> static inline void irq_time_write_begin(void)
> {
> __this_cpu_inc(irq_time_seq.sequence);
> smp_wmb();
> }
>
> static inline void irq_time_write_end(void)
> {
> smp_wmb();
> __this_cpu_inc(irq_time_seq.sequence);
> }
Yeah, but that kinda defeats the purpose of having it implemented in
seqlock.h. Ideally we'd teach gcc about these long pointers and have
something like:
write_seqcount_begin(&this_cpu_read(irq_time_seq));
do the right thing.
On Fri, Dec 10, 2010 at 02:47:46PM +0100, Peter Zijlstra wrote:
> inline void update_rq_clock(struct rq *rq)
> {
> - int cpu = cpu_of(rq);
> - u64 irq_time;
> + s64 delta;
>
> if (rq->skip_clock_update)
> return;
>
> - rq->clock = sched_clock_cpu(cpu);
> - irq_time = irq_time_cpu(cpu);
> - if (rq->clock - irq_time > rq->clock_task)
> - rq->clock_task = rq->clock - irq_time;
> + delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
> + rq->clock += delta;
Hmm. Can you tell me how this is different to:
new_clock = sched_clock_cpu(cpu_of(rq));
delta = new_clock - rq->clock;
rq->clock = new_clock;
which I think may be simpler in terms of 64-bit math for 32-bit compilers
to deal with?
In terms of the wrap-around, I don't see this as any different from the
above, as:
rq->clock += sched_clock_cpu(cpu_of(rq)) - rq_clock;
rq->clock = rq->clock + sched_clock_cpu(cpu_of(rq)) - rq_clock;
rq->clock = sched_clock_cpu(cpu_of(rq));
On Fri, 2010-12-10 at 17:56 +0000, Russell King - ARM Linux wrote:
> On Fri, Dec 10, 2010 at 02:47:46PM +0100, Peter Zijlstra wrote:
> > inline void update_rq_clock(struct rq *rq)
> > {
> > - int cpu = cpu_of(rq);
> > - u64 irq_time;
> > + s64 delta;
> >
> > if (rq->skip_clock_update)
> > return;
> >
> > - rq->clock = sched_clock_cpu(cpu);
> > - irq_time = irq_time_cpu(cpu);
> > - if (rq->clock - irq_time > rq->clock_task)
> > - rq->clock_task = rq->clock - irq_time;
> > + delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
> > + rq->clock += delta;
>
> Hmm. Can you tell me how this is different to:
>
> new_clock = sched_clock_cpu(cpu_of(rq));
> delta = new_clock - rq->clock;
> rq->clock = new_clock;
>
> which I think may be simpler in terms of 64-bit math for 32-bit compilers
> to deal with?
Its not, I could write it like that, the only reason I didn't is because
it uses an extra variable. If gcc on 32bit targets really generates
hideous code for it I'll happily change it.
> In terms of the wrap-around, I don't see this as any different from the
> above, as:
>
> rq->clock += sched_clock_cpu(cpu_of(rq)) - rq_clock;
> rq->clock = rq->clock + sched_clock_cpu(cpu_of(rq)) - rq_clock;
> rq->clock = sched_clock_cpu(cpu_of(rq));
Correct, its not different. Nor was it meant to be. The only problem it
solves is the u64 wrap failure in:
if (rq->clock - irq_time > rq->clock_task)
There are lots of places in the scheduler that rely on u64 wrap, for now
the easiest thing for ARM would be to select HAVE_UNSTABLE_SCHED_CLOCK
for those platforms that implement a short sched_clock().
While that isn't ideal it is something that makes it work, we can work
on something more suitable for future kernels.
Le vendredi 10 décembre 2010 à 18:49 +0100, Peter Zijlstra a écrit :
> On Fri, 2010-12-10 at 18:18 +0100, Eric Dumazet wrote:
> > Le vendredi 10 décembre 2010 à 14:47 +0100, Peter Zijlstra a écrit :
>
> > Also irq_time_write_begin() and irq_time_write_end() could be faster
> > (called for current cpu)
> >
> > static inline void irq_time_write_begin(void)
> > {
> > __this_cpu_inc(irq_time_seq.sequence);
> > smp_wmb();
> > }
> >
> > static inline void irq_time_write_end(void)
> > {
> > smp_wmb();
> > __this_cpu_inc(irq_time_seq.sequence);
> > }
>
> Yeah, but that kinda defeats the purpose of having it implemented in
> seqlock.h. Ideally we'd teach gcc about these long pointers and have
> something like:
>
> write_seqcount_begin(&this_cpu_read(irq_time_seq));
>
> do the right thing.
gcc wont be able to do this yet (%fs/%gs selectors)
But we can provide this_cpu_write_seqcount_{begin|end}()
static inline void this_cpu_write_seqcount_begin(seqcount_t *s)
{
__this_cpu_inc(s->sequence);
smp_wmb();
}
On Fri, 10 Dec 2010, Eric Dumazet wrote:
> > Yeah, but that kinda defeats the purpose of having it implemented in
> > seqlock.h. Ideally we'd teach gcc about these long pointers and have
> > something like:
> >
> > write_seqcount_begin(&this_cpu_read(irq_time_seq));
> >
> > do the right thing.
>
> gcc wont be able to do this yet (%fs/%gs selectors)
The kernel can do that using the __percpu annotation.
> But we can provide this_cpu_write_seqcount_{begin|end}()
No we cannot do hat. this_cpu ops are for per cpu data and not for locking
values shared between processors. We have a mechanism for passing per cpu
pointers with a corresponding annotation.
> static inline void this_cpu_write_seqcount_begin(seqcount_t *s)
^^^ Would have to be seqcount_t __percpu *s
> {
> __this_cpu_inc(s->sequence);
> smp_wmb();
> }
On Fri, 2010-12-10 at 19:10 +0100, Peter Zijlstra wrote:
> There are lots of places in the scheduler that rely on u64 wrap, for now
> the easiest thing for ARM would be to select HAVE_UNSTABLE_SCHED_CLOCK
> for those platforms that implement a short sched_clock().
>
> While that isn't ideal it is something that makes it work, we can work
> on something more suitable for future kernels.
Either that, or the thing you proposed a while back in this thread.
Since ARM doesn't have NMIs something like that should work just fine.
On Fri, 2010-12-10 at 12:39 -0600, Christoph Lameter wrote:
> On Fri, 10 Dec 2010, Eric Dumazet wrote:
>
> > > Yeah, but that kinda defeats the purpose of having it implemented in
> > > seqlock.h. Ideally we'd teach gcc about these long pointers and have
> > > something like:
> > >
> > > write_seqcount_begin(&this_cpu_read(irq_time_seq));
> > >
> > > do the right thing.
> >
> > gcc wont be able to do this yet (%fs/%gs selectors)
>
> The kernel can do that using the __percpu annotation.
That's not true:
# define __percpu
Its a complete NOP.
> > But we can provide this_cpu_write_seqcount_{begin|end}()
>
> No we cannot do hat. this_cpu ops are for per cpu data and not for locking
> values shared between processors. We have a mechanism for passing per cpu
> pointers with a corresponding annotation.
-enoparse, its not locking anything, is a per-cpu sequence count.
On Fri, 2010-12-10 at 19:10 +0100, Peter Zijlstra wrote:
> > > + delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
> > > + rq->clock += delta;
> >
> > Hmm. Can you tell me how this is different to:
> >
> > new_clock = sched_clock_cpu(cpu_of(rq));
> > delta = new_clock - rq->clock;
> > rq->clock = new_clock;
> >
> > which I think may be simpler in terms of 64-bit math for 32-bit compilers
> > to deal with?
>
> Its not, I could write it like that, the only reason I didn't is because
> it uses an extra variable. If gcc on 32bit targets really generates
> hideous code for it I'll happily change it.
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1863,6 +1863,7 @@ void account_system_vtime(struct task_st
{
unsigned long flags;
s64 delta;
+ u64 now;
int cpu;
if (!sched_clock_irqtime)
@@ -1870,8 +1871,9 @@ void account_system_vtime(struct task_st
local_irq_save(flags);
cpu = smp_processor_id();
- delta = sched_clock_cpu(cpu) - per_cpu(irq_start_time, cpu);
- per_cpu(irq_start_time, cpu) += delta;
+ now = sched_clock_cpu(cpu);
+ delta = now - per_cpu(irq_start_time, cpu);
+ per_cpu(irq_start_time, cpu) = now;
irq_time_write_begin(cpu);
On i386 (gcc (Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5):
Before: account_system_vtime: 160 bytes
After: account_system_vtime: 214 bytes
On Fri, Dec 10, 2010 at 07:10:54PM +0100, Peter Zijlstra wrote:
> On Fri, 2010-12-10 at 17:56 +0000, Russell King - ARM Linux wrote:
> > On Fri, Dec 10, 2010 at 02:47:46PM +0100, Peter Zijlstra wrote:
> > > inline void update_rq_clock(struct rq *rq)
> > > {
> > > - int cpu = cpu_of(rq);
> > > - u64 irq_time;
> > > + s64 delta;
> > >
> > > if (rq->skip_clock_update)
> > > return;
> > >
> > > - rq->clock = sched_clock_cpu(cpu);
> > > - irq_time = irq_time_cpu(cpu);
> > > - if (rq->clock - irq_time > rq->clock_task)
> > > - rq->clock_task = rq->clock - irq_time;
> > > + delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
> > > + rq->clock += delta;
> >
> > Hmm. Can you tell me how this is different to:
> >
> > new_clock = sched_clock_cpu(cpu_of(rq));
> > delta = new_clock - rq->clock;
> > rq->clock = new_clock;
> >
> > which I think may be simpler in terms of 64-bit math for 32-bit compilers
> > to deal with?
>
> Its not, I could write it like that, the only reason I didn't is because
> it uses an extra variable. If gcc on 32bit targets really generates
> hideous code for it I'll happily change it.
Well, I can't tell you what kind of code this produces on ARM, as it
doesn't appear to apply to any kernel I've tried. So, I assume it's
against some scheduler development tree rather than Linus' tree?
On Fri, 2010-12-10 at 19:17 +0000, Russell King - ARM Linux wrote:
>
>
> Well, I can't tell you what kind of code this produces on ARM, as it
> doesn't appear to apply to any kernel I've tried. So, I assume it's
> against some scheduler development tree rather than Linus' tree?
Ah yes, my bad, there's some change that got in the way.
---
Subject: sched: Fix the irqtime code to deal with u64 wraps
From: Peter Zijlstra <[email protected]>
Date: Thu Dec 09 14:15:34 CET 2010
ARM systems have a 32bit sched_clock() [ which needs to be fixed ],
but this exposed a bug in the irq_time code as well, it doesn't deal
with wraps at all.
Fix the irq_time code to deal with u64 wraps by re-writing the code to
only use delta increments, which avoids the whole issue.
Furthermore, solve the problem of 32bit arches reading partial updates
of the u64 time values.
Cc: Venkatesh Pallipadi <[email protected]>
Reported-by: Mikael Pettersson <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <new-submission>
---
kernel/sched.c | 172 +++++++++++++++++++++++++++++++++++++++------------------
1 file changed, 119 insertions(+), 53 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -636,22 +636,18 @@ static inline struct task_group *task_gr
#endif /* CONFIG_CGROUP_SCHED */
-static u64 irq_time_cpu(int cpu);
-static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time);
+static void update_rq_clock_task(struct rq *rq, s64 delta);
-inline void update_rq_clock(struct rq *rq)
+static void update_rq_clock(struct rq *rq)
{
- if (!rq->skip_clock_update) {
- int cpu = cpu_of(rq);
- u64 irq_time;
+ s64 delta;
- rq->clock = sched_clock_cpu(cpu);
- irq_time = irq_time_cpu(cpu);
- if (rq->clock - irq_time > rq->clock_task)
- rq->clock_task = rq->clock - irq_time;
+ if (rq->skip_clock_update)
+ return;
- sched_irq_time_avg_update(rq, irq_time);
- }
+ delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
+ rq->clock += delta;
+ update_rq_clock_task(rq, delta);
}
/*
@@ -1918,90 +1914,160 @@ static void deactivate_task(struct rq *r
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
/*
- * There are no locks covering percpu hardirq/softirq time.
- * They are only modified in account_system_vtime, on corresponding CPU
- * with interrupts disabled. So, writes are safe.
+ * There are no locks covering percpu hardirq/softirq time. They are only
+ * modified in account_system_vtime, on corresponding CPU with interrupts
+ * disabled. So, writes are safe.
+ *
* They are read and saved off onto struct rq in update_rq_clock().
- * This may result in other CPU reading this CPU's irq time and can
- * race with irq/account_system_vtime on this CPU. We would either get old
- * or new value (or semi updated value on 32 bit) with a side effect of
- * accounting a slice of irq time to wrong task when irq is in progress
- * while we read rq->clock. That is a worthy compromise in place of having
- * locks on each irq in account_system_time.
+ *
+ * This may result in other CPU reading this CPU's irq time and can race with
+ * irq/account_system_vtime on this CPU. We would either get old or new value
+ * with a side effect of accounting a slice of irq time to wrong task when irq
+ * is in progress while we read rq->clock. That is a worthy compromise in place
+ * of having locks on each irq in account_system_time.
*/
static DEFINE_PER_CPU(u64, cpu_hardirq_time);
static DEFINE_PER_CPU(u64, cpu_softirq_time);
-
static DEFINE_PER_CPU(u64, irq_start_time);
-static int sched_clock_irqtime;
-void enable_sched_clock_irqtime(void)
+#ifndef CONFIG_64BIT
+static DEFINE_PER_CPU(seqcount_t, irq_time_seq);
+
+static inline void irq_time_write_begin(int cpu)
{
- sched_clock_irqtime = 1;
+ write_seqcount_begin(&per_cpu(irq_time_seq, cpu));
}
-void disable_sched_clock_irqtime(void)
+static inline void irq_time_write_end(int cpu)
{
- sched_clock_irqtime = 0;
+ write_seqcount_end(&per_cpu(irq_time_seq, cpu));
}
-static u64 irq_time_cpu(int cpu)
+static inline u64 irq_time_read(int cpu)
{
- if (!sched_clock_irqtime)
- return 0;
+ u64 irq_time;
+ unsigned seq;
+
+ do {
+ seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));
+ irq_time = per_cpu(cpu_softirq_time, cpu) +
+ per_cpu(cpu_hardirq_time, cpu);
+ } while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq));
+
+ return irq_time;
+}
+#else /* CONFIG_64BIT */
+static inline void irq_time_write_begin(int cpu)
+{
+}
+
+static inline void irq_time_write_end(int cpu)
+{
+}
+static inline u64 irq_time_read(int cpu)
+{
return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
}
+#endif /* CONFIG_64BIT */
+static int sched_clock_irqtime;
+
+void enable_sched_clock_irqtime(void)
+{
+ sched_clock_irqtime = 1;
+}
+
+void disable_sched_clock_irqtime(void)
+{
+ sched_clock_irqtime = 0;
+}
+
+/*
+ * Called before incrementing preempt_count on {soft,}irq_enter
+ * and before decrementing preempt_count on {soft,}irq_exit.
+ */
void account_system_vtime(struct task_struct *curr)
{
unsigned long flags;
+ s64 delta;
int cpu;
- u64 now, delta;
if (!sched_clock_irqtime)
return;
local_irq_save(flags);
-
cpu = smp_processor_id();
- now = sched_clock_cpu(cpu);
- delta = now - per_cpu(irq_start_time, cpu);
- per_cpu(irq_start_time, cpu) = now;
- /*
- * We do not account for softirq time from ksoftirqd here.
- * We want to continue accounting softirq time to ksoftirqd thread
- * in that case, so as not to confuse scheduler with a special task
- * that do not consume any time, but still wants to run.
- */
+ delta = sched_clock_cpu(cpu) - per_cpu(irq_start_time, cpu);
+ per_cpu(irq_start_time, cpu) += delta;
+
+ irq_time_write_begin(cpu);
+
if (hardirq_count())
per_cpu(cpu_hardirq_time, cpu) += delta;
+ /*
+ * We do not account for softirq time from ksoftirqd here. We want to
+ * continue accounting softirq time to ksoftirqd thread in that case,
+ * so as not to confuse scheduler with a special task that do not
+ * consume any time, but still wants to run.
+ */
else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD))
per_cpu(cpu_softirq_time, cpu) += delta;
+ irq_time_write_end(cpu);
local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(account_system_vtime);
-static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time)
+static u64 irq_time_cpu(struct rq *rq)
{
- if (sched_clock_irqtime && sched_feat(NONIRQ_POWER)) {
- u64 delta_irq = curr_irq_time - rq->prev_irq_time;
- rq->prev_irq_time = curr_irq_time;
- sched_rt_avg_update(rq, delta_irq);
- }
+ /*
+ * See the comment in update_rq_clock_task(), ideally we'd update
+ * the *irq_time values using rq->clock here.
+ */
+ return irq_time_read(cpu_of(rq));
}
-#else
-
-static u64 irq_time_cpu(int cpu)
+static void update_rq_clock_task(struct rq *rq, s64 delta)
{
- return 0;
+ s64 irq_delta;
+
+ irq_delta = irq_time_cpu(rq) - rq->prev_irq_time;
+
+ /*
+ * Since irq_time is only updated on {soft,}irq_exit, we might run into
+ * this case when a previous update_rq_clock() happened inside a
+ * {soft,}irq region.
+ *
+ * When this happens, we stop ->clock_task and only update the
+ * prev_irq_time stamp to account for the part that fit, so that a next
+ * update will consume the rest. This ensures ->clock_task is
+ * monotonic.
+ *
+ * It does however cause some slight miss-attribution of {soft,}irq
+ * time, a more accurate solution would be to update the irq_time using
+ * the current rq->clock timestamp, except that would require using
+ * atomic ops.
+ */
+ if (irq_delta > delta)
+ irq_delta = delta;
+
+ rq->prev_irq_time += irq_delta;
+ delta -= irq_delta;
+ rq->clock_task += delta;
+
+ if (irq_delta && sched_feat(NONIRQ_POWER))
+ sched_rt_avg_update(rq, irq_delta);
}
-static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { }
+#else /* CONFIG_IRQ_TIME_ACCOUNTING */
-#endif
+static inline void update_rq_clock_task(struct rq *rq, s64 delta)
+{
+ rq->clock_task += delta;
+}
+
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
#include "sched_idletask.c"
#include "sched_fair.c"
On Fri, 10 Dec 2010, Peter Zijlstra wrote:
> > > gcc wont be able to do this yet (%fs/%gs selectors)
> >
> > The kernel can do that using the __percpu annotation.
>
> That's not true:
>
> # define __percpu
>
> Its a complete NOP.
The annotation serves for sparse checking. .... If you do not care about
those checks then you can simply pass a percpu pointer in the same form as
a regular pointer.
> > > But we can provide this_cpu_write_seqcount_{begin|end}()
> >
> > No we cannot do hat. this_cpu ops are for per cpu data and not for locking
> > values shared between processors. We have a mechanism for passing per cpu
> > pointers with a corresponding annotation.
>
> -enoparse, its not locking anything, is a per-cpu sequence count.
seqlocks are for synchronization of objects on different processors.
Seems that you do not have that use case in mind. So a seqlock restricted
to a single processor? If so then you wont need any of those smp write
barriers mentioned earlier. A simple compiler barrier() is sufficient.
On Fri, 2010-12-10 at 13:51 -0600, Christoph Lameter wrote:
> On Fri, 10 Dec 2010, Peter Zijlstra wrote:
>
> > > > gcc wont be able to do this yet (%fs/%gs selectors)
> > >
> > > The kernel can do that using the __percpu annotation.
> >
> > That's not true:
> >
> > # define __percpu
> >
> > Its a complete NOP.
>
> The annotation serves for sparse checking. .... If you do not care about
> those checks then you can simply pass a percpu pointer in the same form as
> a regular pointer.
Its not about passing per-cpu pointers, its about passing long pointers.
When I write:
void foo(u64 *bla)
{
*bla++;
}
DEFINE_PER_CPU(u64, plop);
void bar(void)
{
foo(__this_cpu_ptr(plop));
}
I want gcc to emit the equivalent to:
__this_cpu_inc(plop); /* incq %fs:(%0) */
Now I guess the C type system will get in the way of this ever working,
since a long pointer would have a distinct type from a regular
pointer :/
The idea is to use 'regular' functions with the per-cpu data in a
transparent manner so as not to have to replicate all logic.
> > > > But we can provide this_cpu_write_seqcount_{begin|end}()
> > >
> > > No we cannot do hat. this_cpu ops are for per cpu data and not for locking
> > > values shared between processors. We have a mechanism for passing per cpu
> > > pointers with a corresponding annotation.
> >
> > -enoparse, its not locking anything, is a per-cpu sequence count.
>
> seqlocks are for synchronization of objects on different processors.
>
> Seems that you do not have that use case in mind. So a seqlock restricted
> to a single processor? If so then you wont need any of those smp write
> barriers mentioned earlier. A simple compiler barrier() is sufficient.
The seqcount is sometimes read by different CPUs, but I don't see why we
couldn't do what Eric suggested.
On Fri, 10 Dec 2010, Peter Zijlstra wrote:
> Its not about passing per-cpu pointers, its about passing long pointers.
>
> When I write:
>
> void foo(u64 *bla)
> {
> *bla++;
> }
>
> DEFINE_PER_CPU(u64, plop);
>
> void bar(void)
> {
> foo(__this_cpu_ptr(plop));
> }
>
> I want gcc to emit the equivalent to:
>
> __this_cpu_inc(plop); /* incq %fs:(%0) */
>
> Now I guess the C type system will get in the way of this ever working,
> since a long pointer would have a distinct type from a regular
> pointer :/
>
> The idea is to use 'regular' functions with the per-cpu data in a
> transparent manner so as not to have to replicate all logic.
That would mean you would have to pass information in the pointer at
runtime indicating that this particular pointer is a per cpu pointer.
Code for the Itanium arch can do that because it has per cpu virtual
mappings. So you define a virtual area for per cpu data and then map it
differently for each processor. If we would have a different page table
for each processor then we could avoid using segment register and do the
same on x86.
> > Seems that you do not have that use case in mind. So a seqlock restricted
> > to a single processor? If so then you wont need any of those smp write
> > barriers mentioned earlier. A simple compiler barrier() is sufficient.
>
> The seqcount is sometimes read by different CPUs, but I don't see why we
> couldn't do what Eric suggested.
But you would have to define a per cpu seqlock. Each cpu would have
its own seqlock. Then you could have this_cpu_read_seqcount_begin and
friends:
DEFINE_PER_CPU(seqcount, bla);
/* Start of read using pointer to a sequence counter only. */
static inline unsigned this_cpu_read_seqcount_begin(const seqcount_t __percpu *s)
{
/* No other processor can be using this lock since it is per cpu*/
ret = this_cpu_read(s->sequence);
barrier();
return ret;
}
/*
* Test if reader processed invalid data because sequence number has changed.
*/
static inline int this_cpu_read_seqcount_retry(const seqcount_t __percpu *s, unsigned start)
{
barrier();
return this_cpu_read(s->sequence) != start;
}
/*
* Sequence counter only version assumes that callers are using their
* own mutexing.
*/
static inline void this_cpu_write_seqcount_begin(seqcount_t __percpu *s)
{
__this_cpu_inc(s->sequence);
barrier();
}
static inline void this_cpuwrite_seqcount_end(seqcount_t __percpu *s)
{
__this_cpu_dec(s->sequence);
barrier();
}
Then you can do
this_cpu_read_seqcount_begin(&bla)
...
But then this seemed to be a discussion related to ARM. ARM does not have
optimized per cpu accesses.
On Fri, 2010-12-10 at 14:23 -0600, Christoph Lameter wrote:
> On Fri, 10 Dec 2010, Peter Zijlstra wrote:
>
> > Its not about passing per-cpu pointers, its about passing long pointers.
> >
> > When I write:
> >
> > void foo(u64 *bla)
> > {
> > *bla++;
> > }
> >
> > DEFINE_PER_CPU(u64, plop);
> >
> > void bar(void)
> > {
> > foo(__this_cpu_ptr(plop));
> > }
> >
> > I want gcc to emit the equivalent to:
> >
> > __this_cpu_inc(plop); /* incq %fs:(%0) */
> >
> > Now I guess the C type system will get in the way of this ever working,
> > since a long pointer would have a distinct type from a regular
> > pointer :/
> >
> > The idea is to use 'regular' functions with the per-cpu data in a
> > transparent manner so as not to have to replicate all logic.
>
> That would mean you would have to pass information in the pointer at
> runtime indicating that this particular pointer is a per cpu pointer.
>
> Code for the Itanium arch can do that because it has per cpu virtual
> mappings. So you define a virtual area for per cpu data and then map it
> differently for each processor. If we would have a different page table
> for each processor then we could avoid using segment register and do the
> same on x86.
I don't think its a runtime issue, its a compile time issue. At compile
time the compiler can see the argument is a long pointer:
%fs:(addr,idx,size), and could propagate this into the caller.
The above example will compute the effective address by doing something
like:
lea %fs:(addr,idx,size),%ebx
and will then do something like
inc (%ebx)
Where it could easily have optimized this into:
inc %fs:(addr,idx,size)
esp when foo would be inlined. If its an actual call-site you need
function overloading because a long pointer has a different signature
from a regular pointer, and that is something C doesn't do.
> > > Seems that you do not have that use case in mind. So a seqlock restricted
> > > to a single processor? If so then you wont need any of those smp write
> > > barriers mentioned earlier. A simple compiler barrier() is sufficient.
> >
> > The seqcount is sometimes read by different CPUs, but I don't see why we
> > couldn't do what Eric suggested.
>
> But you would have to define a per cpu seqlock. Each cpu would have
> its own seqlock. Then you could have this_cpu_read_seqcount_begin and
> friends:
>
> Then you can do
>
> this_cpu_read_seqcount_begin(&bla)
>
Which to me seems to be exactly what Eric proposed..
> But then this seemed to be a discussion related to ARM. ARM does not have
> optimized per cpu accesses.
Nah, there's multiple issues all nicely mangled into one thread ;-)
Le vendredi 10 décembre 2010 à 14:23 -0600, Christoph Lameter a écrit :
> On Fri, 10 Dec 2010, Peter Zijlstra wrote:
>
> > Its not about passing per-cpu pointers, its about passing long pointers.
> >
> > When I write:
> >
> > void foo(u64 *bla)
> > {
> > *bla++;
> > }
> >
> > DEFINE_PER_CPU(u64, plop);
> >
> > void bar(void)
> > {
> > foo(__this_cpu_ptr(plop));
> > }
> >
> > I want gcc to emit the equivalent to:
> >
> > __this_cpu_inc(plop); /* incq %fs:(%0) */
> >
> > Now I guess the C type system will get in the way of this ever working,
> > since a long pointer would have a distinct type from a regular
> > pointer :/
> >
> > The idea is to use 'regular' functions with the per-cpu data in a
> > transparent manner so as not to have to replicate all logic.
>
> That would mean you would have to pass information in the pointer at
> runtime indicating that this particular pointer is a per cpu pointer.
>
> Code for the Itanium arch can do that because it has per cpu virtual
> mappings. So you define a virtual area for per cpu data and then map it
> differently for each processor. If we would have a different page table
> for each processor then we could avoid using segment register and do the
> same on x86.
>
> > > Seems that you do not have that use case in mind. So a seqlock restricted
> > > to a single processor? If so then you wont need any of those smp write
> > > barriers mentioned earlier. A simple compiler barrier() is sufficient.
> >
> > The seqcount is sometimes read by different CPUs, but I don't see why we
> > couldn't do what Eric suggested.
>
> But you would have to define a per cpu seqlock. Each cpu would have
> its own seqlock. Then you could have this_cpu_read_seqcount_begin and
> friends:
>
>
Yes. It was the idea.
> DEFINE_PER_CPU(seqcount, bla);
>
>
This is in Peter patch :)
>
>
> /* Start of read using pointer to a sequence counter only. */
> static inline unsigned this_cpu_read_seqcount_begin(const seqcount_t __percpu *s)
> {
> /* No other processor can be using this lock since it is per cpu*/
> ret = this_cpu_read(s->sequence);
> barrier();
> return ret;
> }
>
> /*
> * Test if reader processed invalid data because sequence number has changed.
> */
> static inline int this_cpu_read_seqcount_retry(const seqcount_t __percpu *s, unsigned start)
> {
> barrier();
> return this_cpu_read(s->sequence) != start;
> }
>
>
> /*
> * Sequence counter only version assumes that callers are using their
> * own mutexing.
> */
> static inline void this_cpu_write_seqcount_begin(seqcount_t __percpu *s)
> {
> __this_cpu_inc(s->sequence);
> barrier();
> }
>
> static inline void this_cpuwrite_seqcount_end(seqcount_t __percpu *s)
> {
> __this_cpu_dec(s->sequence);
> barrier();
> }
>
>
> Then you can do
>
> this_cpu_read_seqcount_begin(&bla)
>
> ...
This was exactly my suggestion Christoph.
I am glad you understand it now.
Le vendredi 10 décembre 2010 à 21:39 +0100, Eric Dumazet a écrit :
> This was exactly my suggestion Christoph.
>
> I am glad you understand it now.
>
>
By the way, we need smp_wmb(), not barrier(), even only the "owner cpu"
can write into its 'percpu' seqcount.
There is nothing special about a seqcount being percpu or a 'global'
one. We must have same memory barrier semantics.
this_cpu_write_seqcount_begin(&myseqcount);
this_cpu_add(mydata1, add1);
this_cpu_add(mydata2, add2);
this_cpu_inc(mydata3);
this_cpu_write_seqcount_end(&myseqcount);
We protect the data[1,2,3] set with a seqcount, so need smp_wmb() in
both _begin() and _end()
On Fri, 10 Dec 2010, Eric Dumazet wrote:
>
> By the way, we need smp_wmb(), not barrier(), even only the "owner cpu"
> can write into its 'percpu' seqcount.
>
> There is nothing special about a seqcount being percpu or a 'global'
> one. We must have same memory barrier semantics.
There is certainly a major difference in that execution of a stream of
instructions on the same cpu is guaranteed to have a coherent view of
the data. That is not affected by interrupts etc.
>
> this_cpu_write_seqcount_begin(&myseqcount);
> this_cpu_add(mydata1, add1);
> this_cpu_add(mydata2, add2);
> this_cpu_inc(mydata3);
> this_cpu_write_seqcount_end(&myseqcount);
>
> We protect the data[1,2,3] set with a seqcount, so need smp_wmb() in
> both _begin() and _end()
There is nothing to protect there since processing is on the same cpu. The
data coherency guarantees of the processor will not allow anything out of
sequence to affect execution. An interrupt f.e. will not cause updates to
mydata1 to get lost.
Le vendredi 10 décembre 2010 à 15:09 -0600, Christoph Lameter a écrit :
> On Fri, 10 Dec 2010, Eric Dumazet wrote:
>
> >
> > By the way, we need smp_wmb(), not barrier(), even only the "owner cpu"
> > can write into its 'percpu' seqcount.
> >
> > There is nothing special about a seqcount being percpu or a 'global'
> > one. We must have same memory barrier semantics.
>
> There is certainly a major difference in that execution of a stream of
> instructions on the same cpu is guaranteed to have a coherent view of
> the data. That is not affected by interrupts etc.
>
We dont care of interrupts. We care of doing a transaction over a
complex set of data, that cannot be done using an atomic op (or we need
a spinlock/mutex/rwlock), and should not because of performance.
> >
> > this_cpu_write_seqcount_begin(&myseqcount);
> > this_cpu_add(mydata1, add1);
> > this_cpu_add(mydata2, add2);
> > this_cpu_inc(mydata3);
> > this_cpu_write_seqcount_end(&myseqcount);
> >
> > We protect the data[1,2,3] set with a seqcount, so need smp_wmb() in
> > both _begin() and _end()
>
> There is nothing to protect there since processing is on the same cpu. The
> data coherency guarantees of the processor will not allow anything out of
> sequence to affect execution. An interrupt f.e. will not cause updates to
> mydata1 to get lost.
>
Please take a look at include/linux/u64_stats_sync.h, maybe you'll
understand the concern about using a seqcount to protect a set of data,
for example a 256 bit counter increment.
On Fri, 10 Dec 2010, Eric Dumazet wrote:
> > There is certainly a major difference in that execution of a stream of
> > instructions on the same cpu is guaranteed to have a coherent view of
> > the data. That is not affected by interrupts etc.
> >
>
> We dont care of interrupts. We care of doing a transaction over a
> complex set of data, that cannot be done using an atomic op (or we need
> a spinlock/mutex/rwlock), and should not because of performance.
So what? The cpu gives you incoherent view of data somewhere when only
processing data from a single cpu?
If you have remote data accesses (loop summing the data?) and you have to
be concerned about data coherence then you CANNOT use this_cpu_ops since
they are not guaranteed to be coherent to other processors.
On Thu, Dec 9, 2010 at 11:41 PM, Venkatesh Pallipadi <[email protected]> wrote:
> On Thu, Dec 9, 2010 at 9:55 AM, Peter Zijlstra <[email protected]> wrote:
>> On Thu, 2010-12-09 at 09:43 -0800, Venkatesh Pallipadi wrote:
>>>
>>> The same problem will be there with below code, with irq_delta >
>>> delta, clock_task can go backwards which is not good.
>>> + ? ? ? delta -= irq_delta;
>>> + ? ? ? rq->clock_task += delta;
>>>
>>> The reason for this is rq->clock and irqtime updates kind of happen
>>> independently and specifically, if a rq->clock update happens while we
>>> are in a softirq, we may have this case of going backwards on the next
>>> update.
>>
>> But how can irq_delta > delta?, we measure it using the same clock.
>>
>
> This would be mostly a corner case like:
> - softirq start time t1
> - rq->clock updated at t2 and rq->clock_task updated at t2 without
> accounting for current softirq
> - softirq end time t3
> - cpu spends most time here in softirq or hardirq
> - next rq->clock update at t4 and rq->clock_task update, with delta =
> t4-t2 and irq_delta ~= t4 - t1
^^^
I was curious on this line. Shouldn't this be irq_delta ~= t3 - t1 ?
If so the time going backwards shouldn't happen. If it does happen
then wouldn't it be better to program irq_time_cpu(cpu_of(rq)) to
return t3 instead of t4?
Thanks,
Jack