i'm pleased to announce the v2.6.21.4-rt11 kernel, which can be
downloaded from the usual place:
http://people.redhat.com/mingo/realtime-preempt/
more info about the -rt patchset can be found in the RT wiki:
http://rt.wiki.kernel.org
-rt11 is a bit more experimental than usual: it includes the CFS
scheduler. Several people have suggested the inclusion of CFS into the
-rt tree: the determinism of the CFS scheduler is a nice match to the
determinism offered by PREEMPT_RT. The port of CFS to -rt was done by
Dinakar Guniguntala. Tested on i686 and x86_64.
to build a 2.6.21.4-rt11 tree, the following patches should be applied:
http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.21.4.tar.bz2
http://people.redhat.com/mingo/realtime-preempt/patch-2.6.21.4-rt11
Ingo
On Sunday 10 June 2007 15:17, Ingo Molnar wrote:
> -rt11 is a bit more experimental than usual: it includes the CFS
> scheduler.
Great! Finally CFS is included ;)
Right now I'm using a patched kernel (2.6.21.4) with realtime-preemption patch
and it works fine but I noticed something that I think you should know.
There's a problem with mac80211. I'm using "mac80211-8.0.1"
and "iwlwifi-0.0.25" driver with my "Intel Pro Wireless 3945ABG" card.
When loading the "iwl3945" module or when an application (like wpa_supplicant,
dhcpcd...) tries to do something with the card, I get this message in dmesg:
BUG: using smp_processor_id() in preemptible [00000000] code:
wpa_supplicant/11659
caller is ieee80211_set_multicast_list+0x40/0x163 [mac80211]
[<c0213b1d>] debug_smp_processor_id+0xad/0xb0
[<f8e860bc>] ieee80211_set_multicast_list+0x40/0x163 [mac80211]
[<c02e8532>] __dev_mc_upload+0x22/0x23
[<c02e8686>] dev_mc_upload+0x24/0x37
[<c02e52f5>] dev_change_flags+0x26/0xf6
[<c031fc5e>] devinet_ioctl+0x539/0x6aa
[<c02db972>] sock_ioctl+0xa2/0x1d5
[<c02db8d0>] sock_ioctl+0x0/0x1d5
[<c018500f>] do_ioctl+0x1f/0x6d
[<c01850ad>] vfs_ioctl+0x50/0x273
[<c0185304>] sys_ioctl+0x34/0x50
[<c01040c6>] sysenter_past_esp+0x5f/0x85
[<c0340000>] pfkey_add+0x7c7/0x8d9
=======================
---------------------------
| preempt count: 00000001 ]
| 1-level deep critical section nesting:
----------------------------------------
.. [<c0213ac4>] .... debug_smp_processor_id+0x54/0xb0
.....[<00000000>] .. ( <= _stext+0x3fefed0c/0xc)
Anyway, the wifi card works fine.
I got rid of this message by commenting the code of the
function "ieee80211_set_multicast_list()" that's
on "net/mac80211/ieee80211.c" but this isn't a proper fix.
I think you should know about this because kernel 2.6.22 already includes
mac80211.
Greetings.
--
Miguel Bot?n
On Sat, Jun 09, 2007 at 11:05:07PM +0200, Ingo Molnar wrote:
>
> i'm pleased to announce the v2.6.21.4-rt11 kernel, which can be
> downloaded from the usual place:
>
> http://people.redhat.com/mingo/realtime-preempt/
>
> more info about the -rt patchset can be found in the RT wiki:
>
> http://rt.wiki.kernel.org
>
> -rt11 is a bit more experimental than usual: it includes the CFS
> scheduler. Several people have suggested the inclusion of CFS into the
> -rt tree: the determinism of the CFS scheduler is a nice match to the
> determinism offered by PREEMPT_RT. The port of CFS to -rt was done by
> Dinakar Guniguntala. Tested on i686 and x86_64.
>
> to build a 2.6.21.4-rt11 tree, the following patches should be applied:
>
> http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.21.4.tar.bz2
> http://people.redhat.com/mingo/realtime-preempt/patch-2.6.21.4-rt11
2.6.21.4-rt12 boots on 4-CPU Opteron and passes several hours of
rcutorture. However, if I simply do "modprobe rcutorture", the kernel
threads do not spread across the CPUs as I would expect them to, even
given CFS. Instead, the readers all stack up on a single CPU, and I
have to use the "taskset" command to spread them out manually. Is there
some config parameter I am missing out on?
Thanx, Paul
* Paul E. McKenney <[email protected]> wrote:
> 2.6.21.4-rt12 boots on 4-CPU Opteron and passes several hours of
> rcutorture. However, if I simply do "modprobe rcutorture", the kernel
> threads do not spread across the CPUs as I would expect them to, even
> given CFS. Instead, the readers all stack up on a single CPU, and I
> have to use the "taskset" command to spread them out manually. Is
> there some config parameter I am missing out on?
hm, what affinity do they start out with? Could they all be pinned to
CPU#0 by default?
Ingo
On Mon, Jun 11, 2007 at 09:36:34AM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > 2.6.21.4-rt12 boots on 4-CPU Opteron and passes several hours of
> > rcutorture. However, if I simply do "modprobe rcutorture", the kernel
> > threads do not spread across the CPUs as I would expect them to, even
> > given CFS. Instead, the readers all stack up on a single CPU, and I
> > have to use the "taskset" command to spread them out manually. Is
> > there some config parameter I am missing out on?
>
> hm, what affinity do they start out with? Could they all be pinned to
> CPU#0 by default?
They start off with affinity masks of 0xf on a 4-CPU system. I would
expect them to load-balance across the four CPUs, but they stay all
on the same CPU until long after I lose patience (many minutes).
Since there are eight readers, I use the following commands:
taskset -p 3 pid1
taskset -p 3 pid2
taskset -p 6 pid3
taskset -p 6 pid4
taskset -p c pid5
taskset -p c pid6
taskset -p 9 pid7
taskset -p 9 pid8
where the "pidn" are all replaced by the pids of the torture readers.
Before I do this, the processes are all sharing a single CPU. After I
do this, they are spread reasonably nicely over the CPUs. I do need to
allow some migration in order to fully test the realtime RCU variants
in the various preemption scenarios.
Thanx, Paul
* Paul E. McKenney <[email protected]> wrote:
> > hm, what affinity do they start out with? Could they all be pinned
> > to CPU#0 by default?
>
> They start off with affinity masks of 0xf on a 4-CPU system. I would
> expect them to load-balance across the four CPUs, but they stay all on
> the same CPU until long after I lose patience (many minutes).
ugh. Would be nice to figure out why this happens. I enabled rcutorture
on a dual-core CPU and all the threads are spread evenly.
Ingo
On Mon, Jun 11, 2007 at 05:38:55PM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > > hm, what affinity do they start out with? Could they all be pinned
> > > to CPU#0 by default?
> >
> > They start off with affinity masks of 0xf on a 4-CPU system. I would
> > expect them to load-balance across the four CPUs, but they stay all on
> > the same CPU until long after I lose patience (many minutes).
>
> ugh. Would be nice to figure out why this happens. I enabled rcutorture
> on a dual-core CPU and all the threads are spread evenly.
Here is the /proc/cpuinfo in case this helps. I am starting up a test
on a dual-core CPU to see if that works better.
Thanx, Paul
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 844
stepping : 8
cpu MHz : 1793.105
cache size : 1024 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall mmxext lm 3dnowext 3dnow
bogomips : 3522.56
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 844
stepping : 8
cpu MHz : 1793.105
cache size : 1024 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall mmxext lm 3dnowext 3dnow
bogomips : 3579.90
processor : 2
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 844
stepping : 8
cpu MHz : 1793.105
cache size : 1024 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall mmxext lm 3dnowext 3dnow
bogomips : 3579.90
processor : 3
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 844
stepping : 8
cpu MHz : 1793.105
cache size : 1024 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall mmxext lm 3dnowext 3dnow
bogomips : 3579.90
On Mon, Jun 11, 2007 at 08:55:27AM -0700, Paul E. McKenney wrote:
> On Mon, Jun 11, 2007 at 05:38:55PM +0200, Ingo Molnar wrote:
> >
> > * Paul E. McKenney <[email protected]> wrote:
> >
> > > > hm, what affinity do they start out with? Could they all be pinned
> > > > to CPU#0 by default?
> > >
> > > They start off with affinity masks of 0xf on a 4-CPU system. I would
> > > expect them to load-balance across the four CPUs, but they stay all on
> > > the same CPU until long after I lose patience (many minutes).
> >
> > ugh. Would be nice to figure out why this happens. I enabled rcutorture
> > on a dual-core CPU and all the threads are spread evenly.
>
> Here is the /proc/cpuinfo in case this helps. I am starting up a test
> on a dual-core CPU to see if that works better.
And this quickly load-balanced to put a pair of readers on each CPU.
Later, it moved one of the readers so that it is now running with
one reader on one of the CPUs, and the remaining three readers on the
other CPU.
Argh... this is with 2.6.21-rt1... Need to reboot with 2.6.21.4-rt12...
Thanx, Paul
On Mon, Jun 11, 2007 at 10:18:06AM -0700, Paul E. McKenney wrote:
> On Mon, Jun 11, 2007 at 08:55:27AM -0700, Paul E. McKenney wrote:
> > On Mon, Jun 11, 2007 at 05:38:55PM +0200, Ingo Molnar wrote:
> > >
> > > * Paul E. McKenney <[email protected]> wrote:
> > >
> > > > > hm, what affinity do they start out with? Could they all be pinned
> > > > > to CPU#0 by default?
> > > >
> > > > They start off with affinity masks of 0xf on a 4-CPU system. I would
> > > > expect them to load-balance across the four CPUs, but they stay all on
> > > > the same CPU until long after I lose patience (many minutes).
> > >
> > > ugh. Would be nice to figure out why this happens. I enabled rcutorture
> > > on a dual-core CPU and all the threads are spread evenly.
> >
> > Here is the /proc/cpuinfo in case this helps. I am starting up a test
> > on a dual-core CPU to see if that works better.
>
> And this quickly load-balanced to put a pair of readers on each CPU.
> Later, it moved one of the readers so that it is now running with
> one reader on one of the CPUs, and the remaining three readers on the
> other CPU.
>
> Argh... this is with 2.6.21-rt1... Need to reboot with 2.6.21.4-rt12...
OK, here are a couple of snapshots from "top" on a two-way system.
It seems to cycle back and forth between these two states.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20126 root 39 19 0 0 0 R 47 0.0 11:38.62 rcu_torture_rea
20129 root 39 19 0 0 0 R 47 0.0 13:28.06 rcu_torture_rea
20127 root 39 19 0 0 0 R 43 0.0 12:39.83 rcu_torture_rea
20128 root 39 19 0 0 0 R 43 0.0 11:50.58 rcu_torture_rea
20121 root 39 19 0 0 0 R 10 0.0 2:59.69 rcu_torture_wri
20123 root 39 19 0 0 0 D 2 0.0 0:28.52 rcu_torture_fak
20125 root 39 19 0 0 0 D 2 0.0 0:28.47 rcu_torture_fak
20122 root 39 19 0 0 0 D 1 0.0 0:28.38 rcu_torture_fak
20124 root 39 19 0 0 0 D 1 0.0 0:28.41 rcu_torture_fak
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20129 root 39 19 0 0 0 R 80 0.0 14:46.56 rcu_torture_rea
20126 root 39 19 0 0 0 R 33 0.0 12:52.70 rcu_torture_rea
20128 root 39 19 0 0 0 R 33 0.0 13:01.50 rcu_torture_rea
20127 root 39 19 0 0 0 R 33 0.0 13:49.68 rcu_torture_rea
20121 root 39 19 0 0 0 R 13 0.0 3:16.82 rcu_torture_wri
20122 root 39 19 0 0 0 R 2 0.0 0:31.16 rcu_torture_fak
20123 root 39 19 0 0 0 R 2 0.0 0:31.25 rcu_torture_fak
20124 root 39 19 0 0 0 D 2 0.0 0:31.23 rcu_torture_fak
20125 root 39 19 0 0 0 R 2 0.0 0:31.25 rcu_torture_fak
12907 root 20 0 12576 1068 796 R 1 0.0 0:08.55 top
The "preferred" state is the first one. But given that the readers
will consume all CPU available to them, the scheduler might not be
able to tell the difference.
Perhaps the fakewriters are confusing the scheduler, will try again on
a 4-CPU machine leaving them out.
Thanx, Paul
On Mon, Jun 11, 2007 at 01:44:27PM -0700, Paul E. McKenney wrote:
> On Mon, Jun 11, 2007 at 10:18:06AM -0700, Paul E. McKenney wrote:
> > On Mon, Jun 11, 2007 at 08:55:27AM -0700, Paul E. McKenney wrote:
> > > On Mon, Jun 11, 2007 at 05:38:55PM +0200, Ingo Molnar wrote:
> > > >
> > > > * Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > > > hm, what affinity do they start out with? Could they all be pinned
> > > > > > to CPU#0 by default?
> > > > >
> > > > > They start off with affinity masks of 0xf on a 4-CPU system. I would
> > > > > expect them to load-balance across the four CPUs, but they stay all on
> > > > > the same CPU until long after I lose patience (many minutes).
> > > >
> > > > ugh. Would be nice to figure out why this happens. I enabled rcutorture
> > > > on a dual-core CPU and all the threads are spread evenly.
> > >
> > > Here is the /proc/cpuinfo in case this helps. I am starting up a test
> > > on a dual-core CPU to see if that works better.
> >
> > And this quickly load-balanced to put a pair of readers on each CPU.
> > Later, it moved one of the readers so that it is now running with
> > one reader on one of the CPUs, and the remaining three readers on the
> > other CPU.
> >
> > Argh... this is with 2.6.21-rt1... Need to reboot with 2.6.21.4-rt12...
>
> OK, here are a couple of snapshots from "top" on a two-way system.
> It seems to cycle back and forth between these two states.
And on the 4-CPU box:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3112 root 39 19 0 0 0 R 11.6 0.0 0:44.34 rcu_torture_rea
3114 root 39 19 0 0 0 R 11.6 0.0 0:44.34 rcu_torture_rea
3115 root 39 19 0 0 0 R 11.6 0.0 0:44.34 rcu_torture_rea
3116 root 39 19 0 0 0 R 11.6 0.0 0:44.34 rcu_torture_rea
3109 root 39 19 0 0 0 R 11.3 0.0 0:44.33 rcu_torture_rea
3110 root 39 19 0 0 0 R 11.3 0.0 0:44.33 rcu_torture_rea
3111 root 39 19 0 0 0 R 11.3 0.0 0:44.34 rcu_torture_rea
3113 root 39 19 0 0 0 R 11.3 0.0 0:44.34 rcu_torture_rea
3108 root 39 19 0 0 0 D 6.0 0.0 0:24.35 rcu_torture_wri
All are on CPU zero:
elm3b6:~# cat /proc/3109/stat | awk '{print $(NF-3)}'
0
elm3b6:~# cat /proc/3110/stat | awk '{print $(NF-3)}'
0
elm3b6:~# cat /proc/3111/stat | awk '{print $(NF-3)}'
0
elm3b6:~# cat /proc/3112/stat | awk '{print $(NF-3)}'
0
elm3b6:~# cat /proc/3113/stat | awk '{print $(NF-3)}'
0
elm3b6:~# cat /proc/3114/stat | awk '{print $(NF-3)}'
0
elm3b6:~# cat /proc/3115/stat | awk '{print $(NF-3)}'
0
elm3b6:~# cat /proc/3116/stat | awk '{print $(NF-3)}'
0
elm3b6:~# cat /proc/3108/stat | awk '{print $(NF-3)}'
0
All have their affinity masks at f (allowing them to run on all CPUs):
elm3b6:~# taskset -p 3109
pid 3109's current affinity mask: f
elm3b6:~# taskset -p 3110
pid 3110's current affinity mask: f
elm3b6:~# taskset -p 3111
pid 3111's current affinity mask: f
elm3b6:~# taskset -p 3112
pid 3112's current affinity mask: f
elm3b6:~# taskset -p 3113
pid 3113's current affinity mask: f
elm3b6:~# taskset -p 3114
pid 3114's current affinity mask: f
elm3b6:~# taskset -p 3115
pid 3115's current affinity mask: f
elm3b6:~# taskset -p 3116
pid 3116's current affinity mask: f
elm3b6:~# taskset -p 3108
pid 3108's current affinity mask: f
Not a biggie for me, since I can easily do the taskset commands to
force the processes to spread out, but I am worried that casual users
of rcutorture won't know to do this -- thus not really torturing RCU.
It would not be hard to modify rcutorture to affinity the tasks so as
to spread them, but this seems a bit ugly.
Thanx, Paul
On Sat, 2007-09-06 at 23:05 +0200, Ingo Molnar wrote:
> i'm pleased to announce the v2.6.21.4-rt11 kernel, which can be
> downloaded from the usual place:
>
I'm running 2.6.21.4-rt12-cfs-v17 (x86_64), so far no problems. I like
this kernel a lot, it's feels quite smooth.
One little thing, no HPET timer is detected. By looking at the patch,
even the force detect code is there, it should work.
The hpet timer is not available as a clocksource and only one hpet
related message is present in dmesg:
PM: Adding info for No Bus:hpet
This is on a Asus P5LD2-VM motherboard (ICH7)
Relevant config bits:
CONFIG_HPET_TIMER=y
# CONFIG_HPET_EMULATE_RTC is not set
CONFIG_HPET=y
# CONFIG_HPET_RTC_IRQ is not set
CONFIG_HPET_MMAP=y
Should I enable one of the two other options? Any ideas?
Best regards,
- Eric
(Cc:-ed Venki for the force-hpet issue below)
* Eric St-Laurent <[email protected]> wrote:
> On Sat, 2007-09-06 at 23:05 +0200, Ingo Molnar wrote:
> > i'm pleased to announce the v2.6.21.4-rt11 kernel, which can be
> > downloaded from the usual place:
>
> I'm running 2.6.21.4-rt12-cfs-v17 (x86_64), so far no problems. I like
> this kernel a lot, it's feels quite smooth.
yeah, that's probably CFS in the works :-) That combined with PREEMPT_RT
makes for a really snappy desktop.
> One little thing, no HPET timer is detected. By looking at the patch,
> even the force detect code is there, it should work.
>
> The hpet timer is not available as a clocksource and only one hpet
> related message is present in dmesg:
>
> PM: Adding info for No Bus:hpet
>
> This is on a Asus P5LD2-VM motherboard (ICH7)
>
> Relevant config bits:
>
> CONFIG_HPET_TIMER=y
> # CONFIG_HPET_EMULATE_RTC is not set
> CONFIG_HPET=y
> # CONFIG_HPET_RTC_IRQ is not set
> CONFIG_HPET_MMAP=y
>
> Should I enable one of the two other options? Any ideas?
Venki, is this ICH7 board supposed to work with force-hpet?
Ingo
>-----Original Message-----
>From: Ingo Molnar [mailto:[email protected]]
>Sent: Tuesday, June 12, 2007 12:32 AM
>To: Eric St-Laurent
>Cc: [email protected];
>[email protected]; Thomas Gleixner; Dinakar
>Guniguntala; Pallipadi, Venkatesh
>Subject: Re: v2.6.21.4-rt11
>
>
>(Cc:-ed Venki for the force-hpet issue below)
>
>* Eric St-Laurent <[email protected]> wrote:
>
>> On Sat, 2007-09-06 at 23:05 +0200, Ingo Molnar wrote:
>> > i'm pleased to announce the v2.6.21.4-rt11 kernel, which can be
>> > downloaded from the usual place:
>>
>> I'm running 2.6.21.4-rt12-cfs-v17 (x86_64), so far no
>problems. I like
>> this kernel a lot, it's feels quite smooth.
>
>yeah, that's probably CFS in the works :-) That combined with
>PREEMPT_RT
>makes for a really snappy desktop.
>
>> One little thing, no HPET timer is detected. By looking at
>the patch,
>> even the force detect code is there, it should work.
>>
>> The hpet timer is not available as a clocksource and only one hpet
>> related message is present in dmesg:
>>
>> PM: Adding info for No Bus:hpet
>>
>> This is on a Asus P5LD2-VM motherboard (ICH7)
>>
>> Relevant config bits:
>>
>> CONFIG_HPET_TIMER=y
>> # CONFIG_HPET_EMULATE_RTC is not set
>> CONFIG_HPET=y
>> # CONFIG_HPET_RTC_IRQ is not set
>> CONFIG_HPET_MMAP=y
>>
>> Should I enable one of the two other options? Any ideas?
>
>Venki, is this ICH7 board supposed to work with force-hpet?
>
Yes. Force_hpet part is should have worked..
Eric: Can you send me the output of 'lspci -n on your system.
We need to double check we are covering all ICH7 ids.
Thanks,
Venki
* Paul E. McKenney <[email protected]> wrote:
> Not a biggie for me, since I can easily do the taskset commands to
> force the processes to spread out, but I am worried that casual users
> of rcutorture won't know to do this -- thus not really torturing RCU.
> It would not be hard to modify rcutorture to affinity the tasks so as
> to spread them, but this seems a bit ugly.
does it get any better if you renice them from +19 to 0? (and then back
to +19?)
Ingo
On Tue, Jun 12, 2007 at 11:37:58PM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > Not a biggie for me, since I can easily do the taskset commands to
> > force the processes to spread out, but I am worried that casual users
> > of rcutorture won't know to do this -- thus not really torturing RCU.
> > It would not be hard to modify rcutorture to affinity the tasks so as
> > to spread them, but this seems a bit ugly.
>
> does it get any better if you renice them from +19 to 0? (and then back
> to +19?)
Interesting!
That did spread them evenly across two CPUs, but not across all four.
I took a look at CFS, which seems to operate in terms of milliseconds.
Since the rcu_torture_reader() code enters the scheduler on each
interation, it would not give CFS millisecond-scale bursts of CPU
consumption, perhaps not allowing it to do reasonable load balancing.
So I inserted the following code at the beginning of rcu_torture_reader():
set_user_nice(current, 19);
set_user_nice(current, 0);
for (idx = 0; idx < 1000; idx++) {
udelay(10);
}
set_user_nice(current, 19);
This worked much better:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18600 root 39 19 0 0 0 R 50 0.0 0:09.57 rcu_torture_rea
18599 root 39 19 0 0 0 R 50 0.0 0:09.56 rcu_torture_rea
18598 root 39 19 0 0 0 R 49 0.0 0:10.33 rcu_torture_rea
18602 root 39 19 0 0 0 R 49 0.0 0:10.34 rcu_torture_rea
18596 root 39 19 0 0 0 R 47 0.0 0:09.48 rcu_torture_rea
18601 root 39 19 0 0 0 R 46 0.0 0:09.56 rcu_torture_rea
18595 root 39 19 0 0 0 R 45 0.0 0:09.23 rcu_torture_rea
18597 root 39 19 0 0 0 R 44 0.0 0:10.92 rcu_torture_rea
18590 root 39 19 0 0 0 R 10 0.0 0:02.23 rcu_torture_wri
18591 root 39 19 0 0 0 D 2 0.0 0:00.34 rcu_torture_fak
18592 root 39 19 0 0 0 D 2 0.0 0:00.35 rcu_torture_fak
18593 root 39 19 0 0 0 D 2 0.0 0:00.35 rcu_torture_fak
18594 root 39 19 0 0 0 D 2 0.0 0:00.33 rcu_torture_fak
18603 root 15 -5 0 0 0 S 1 0.0 0:00.06 rcu_torture_sta
(The first eight tasks are readers, while the last six tasks are update
and statistics threads that don't consume so much CPU, so the above is
pretty close to optimal.)
I stopped and restarted rcutorture several times, and it spread nicely
each time, at least aside from the time that makewhatis decided to fire
up just as I started rcutorture.
But this is admittedly a -very- crude hack.
One approach would be to make them all spin until a few milliseconds
after the last one was created. I would like to spread the readers
separately from the other tasks, which could be done by taking a two-stage
approach, spreading the writer and fakewriter tasks first, then spreading
the readers. This seems a bit nicer, and I will play with it a bit.
In the meantime, thoughts on more-maintainable ways of making this work?
Thanx, Paul
On Tue, 2007-12-06 at 06:00 -0700, Pallipadi, Venkatesh wrote:
>
> >-----Original Message-----
> Yes. Force_hpet part is should have worked..
> Eric: Can you send me the output of 'lspci -n on your system.
> We need to double check we are covering all ICH7 ids.
Here it is:
00:00.0 0600: 8086:2770 (rev 02)
00:02.0 0300: 8086:2772 (rev 02)
00:1b.0 0403: 8086:27d8 (rev 01)
00:1c.0 0604: 8086:27d0 (rev 01)
00:1c.1 0604: 8086:27d2 (rev 01)
00:1d.0 0c03: 8086:27c8 (rev 01)
00:1d.1 0c03: 8086:27c9 (rev 01)
00:1d.2 0c03: 8086:27ca (rev 01)
00:1d.3 0c03: 8086:27cb (rev 01)
00:1d.7 0c03: 8086:27cc (rev 01)
00:1e.0 0604: 8086:244e (rev e1)
00:1f.0 0601: 8086:27b8 (rev 01)
00:1f.1 0101: 8086:27df (rev 01)
00:1f.2 0101: 8086:27c0 (rev 01)
00:1f.3 0c05: 8086:27da (rev 01)
01:0a.0 0604: 3388:0021 (rev 11)
02:0c.0 0c03: 1033:0035 (rev 41)
02:0c.1 0c03: 1033:0035 (rev 41)
02:0c.2 0c03: 1033:00e0 (rev 02)
02:0d.0 0c00: 1106:3044 (rev 46)
03:00.0 0200: 8086:109a
Adding the id for PCI_DEVICE_ID_INTEL_ICH7_0 (27b8) should do the trick.
I've patched my kernel and was ready to test it, but in the meantime I
did a BIOS upgrade (bad idea...) and with the new version the HPET timer
is detected via ACPI.
Unfortunately it seems that downgrading the BIOS is a lot more trouble
than upgrading it. So I cannot easily test the force enable anymore.
Anyway it works now. Here is my patch if it's any use to you:
diff -uprN linux-2.6.21.4.orig/arch/i386/kernel/quirks.c linux-2.6.21.4/arch/i386/kernel/quirks.c
--- linux-2.6.21.4.orig/arch/i386/kernel/quirks.c Tue Jun 12 10:03:18 2007
+++ linux-2.6.21.4/arch/i386/kernel/quirks.c Tue Jun 12 10:08:02 2007
@@ -149,6 +149,8 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_I
ich_force_enable_hpet);
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH6_1,
ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_0,
+ ich_force_enable_hpet);
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_1,
ich_force_enable_hpet);
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_31,
Best regards,
- Eric
On 6/9/07, Ingo Molnar <[email protected]> wrote:
>
> i'm pleased to announce the v2.6.21.4-rt11 kernel, which can be
> downloaded from the usual place:
>
> http://people.redhat.com/mingo/realtime-preempt/
>
> more info about the -rt patchset can be found in the RT wiki:
>
> http://rt.wiki.kernel.org
Not for ARM yet :(
What should I try for the ARM architecture? There are many choices and
I don't know what is the more friendly. By friendly I mean the one that
is likely to be merged and that cooperate with you.
> to build a 2.6.21.4-rt11 tree, the following patches should be applied:
>
> http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.21.4.tar.bz2
> http://people.redhat.com/mingo/realtime-preempt/patch-2.6.21.4-rt11
>
Using this patch:
http://people.redhat.com/mingo/realtime-preempt/patch-2.6.21.4-rt14
LD init/built-in.o
LD .tmp_vmlinux1
kernel/built-in.o(.text+0xd3f0): In function `do_sys_settimeofday':
: undefined reference to `warp_check_clock_was_changed'
kernel/built-in.o(.text+0x12588): In function `timekeeping_resume':
: undefined reference to `warp_check_clock_was_changed'
kernel/built-in.o(.text+0x132f8): In function `do_sysinfo':
: undefined reference to `__get_nsec_offset'
kernel/built-in.o(.text+0x20a04): In function `ktime_get_ts':
: undefined reference to `__get_nsec_offset'
kernel/built-in.o(.text+0x221c0): In function `$a':
: undefined reference to `warp_check_clock_was_changed'
kernel/built-in.o(.text+0x22208): In function `$a':
: undefined reference to `warp_check_clock_was_changed'
kernel/built-in.o(.text+0x2b9dc): In function `$a':
: undefined reference to `usecs_to_cycles'
make: *** [.tmp_vmlinux1] Error 1
Regards.
--
http://arhuaco.org
http://emQbit.com
On Sun, 2007-06-17 at 11:15 -0500, Nelson Castillo wrote:
> > http://rt.wiki.kernel.org
>
> Not for ARM yet :(
>
> What should I try for the ARM architecture?
ARM has a lot of sub architectures and not all of them are supported
yet.
> There are many choices and
> I don't know what is the more friendly. By friendly I mean the one that
> is likely to be merged and that cooperate with you.
Which choices do you mean ?
> http://people.redhat.com/mingo/realtime-preempt/patch-2.6.21.4-rt14
>
> : undefined reference to `usecs_to_cycles'
> make: *** [.tmp_vmlinux1] Error 1
Which ARM sub arch ?
tglx
On 6/17/07, Thomas Gleixner <[email protected]> wrote:
> On Sun, 2007-06-17 at 11:15 -0500, Nelson Castillo wrote:
> > > http://rt.wiki.kernel.org
> >
> > Not for ARM yet :(
> >
> > What should I try for the ARM architecture?
>
> ARM has a lot of sub architectures and not all of them are supported
> yet.
I see.
> > There are many choices and
> > I don't know what is the more friendly. By friendly I mean the one that
> > is likely to be merged and that cooperate with you.
>
> Which choices do you mean ?
I mean implementations. I've seen lot of them but i don't know which one
to try (I'm new to RT and the implementation in this thread seems to
be very nice).
>
> > http://people.redhat.com/mingo/realtime-preempt/patch-2.6.21.4-rt14
> >
> > : undefined reference to `usecs_to_cycles'
> > make: *** [.tmp_vmlinux1] Error 1
>
> Which ARM sub arch ?
sub arch AT91 -- (Atmel AT91RM9200 processor).
Thanks,
Nelson.-
--
http://arhuaco.org
http://emQbit.com
On Sun, 2007-06-17 at 11:49 -0500, Nelson Castillo wrote:
> > > There are many choices and
> > > I don't know what is the more friendly. By friendly I mean the one that
> > > is likely to be merged and that cooperate with you.
> >
> > Which choices do you mean ?
>
> I mean implementations. I've seen lot of them but i don't know which one
> to try (I'm new to RT and the implementation in this thread seems to
> be very nice).
Thanks :)
> > > http://people.redhat.com/mingo/realtime-preempt/patch-2.6.21.4-rt14
> > >
> > > : undefined reference to `usecs_to_cycles'
> > > make: *** [.tmp_vmlinux1] Error 1
> >
> > Which ARM sub arch ?
>
> sub arch AT91 -- (Atmel AT91RM9200 processor).
It lacks support for the generic timeofday and clock event layers, which
causes the compile breakage.
I take a look at the compile errors and ping somebody who is working on
support for AT91 to send out the patches ASAP.
tglx
On Sat, Jun 16, 2007 at 09:12:13AM -0700, Paul E. McKenney wrote:
> On Sat, Jun 16, 2007 at 02:14:34PM +0530, Srivatsa Vaddagiri wrote:
> > On Fri, Jun 15, 2007 at 06:16:05PM -0700, Paul E. McKenney wrote:
> > > On Fri, Jun 15, 2007 at 09:55:45PM +0200, Ingo Molnar wrote:
> > > >
> > > > * Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > > > to make sure it's not some effect in -rt causing this. v17 has an
> > > > > > updated load balancing code. (which might or might not affect the
> > > > > > rcutorture problem.)
> > > > >
> > > > > Good point! I will try the following:
> > > > >
> > > > > 1. Stock 2.6.21.5.
> > > > >
> > > > > 2. 2.6.21-rt14.
> > > > >
> > > > > 3. 2.6.21.5 + sched-cfs-v2.6.21.5-v17.patch
> > > > >
> > > > > And quickly, before everyone else jumps on the machines that show the
> > > > > problem. ;-)
> > > >
> > > > thanks! It's enough to check whether modprobe rcutorture still produces
> > > > that weird balancing problem. That clearly has to be fixed ...
> > > >
> > > > And i've Cc:-ed Dmitry and Srivatsa, who are busy hacking this area of
> > > > the CFS code as we speak :-)
> > >
> > > Well, I am not sure that the info I was able to collect will be all
> > > that helpful, but it most certainly does confirm that the balancing
> > > problem that rcutorture produces is indeed weird...
> >
> > Hi Paul,
> > I tried on two machines in our lab and could not recreate your
> > problem.
> >
> > On a 2way x86_64 AMD box and 2.6.21.5+cfsv17:
> >
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 12395 root 39 19 0 0 0 R 50.3 0.0 0:57.62 rcu_torture_rea
> > 12394 root 39 19 0 0 0 R 49.9 0.0 0:57.29 rcu_torture_rea
> > 12396 root 39 19 0 0 0 R 49.9 0.0 0:56.96 rcu_torture_rea
> > 12397 root 39 19 0 0 0 R 49.9 0.0 0:56.90 rcu_torture_rea
> >
> > On a 4way x86_64 Intel Xeon box and 2.6.21.5+cfsv17:
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
> > 6258 root 39 19 0 0 0 R 53 0.0 17:29.72 0 rcu_torture_rea
> > 6252 root 39 19 0 0 0 R 49 0.0 17:49.40 3 rcu_torture_rea
> > 6257 root 39 19 0 0 0 R 49 0.0 17:22.49 2 rcu_torture_rea
> > 6256 root 39 19 0 0 0 R 48 0.0 17:50.12 1 rcu_torture_rea
> > 6254 root 39 19 0 0 0 R 48 0.0 17:26.98 0 rcu_torture_rea
> > 6255 root 39 19 0 0 0 R 48 0.0 17:25.74 2 rcu_torture_rea
> > 6251 root 39 19 0 0 0 R 45 0.0 17:47.45 3 rcu_torture_rea
> > 6253 root 39 19 0 0 0 R 45 0.0 17:48.48 1 rcu_torture_rea
> >
> >
> > I will try this on few more boxes we have on Monday. If I can't recreate, then
> > I may request you to provide me machine details (or even access to the problem
> > box if it is in IBM labs and if I am allowed to login!)
>
> elm3b6, ABAT job 95107. There are others, this particular job uses
> 2.6.21.5-cfsv17.
Paul,
I logged into elm3b6 and did some investigation. I think I have
a tentative patch to fix your load-balance problem.
First, an explanation of the problem:
This particular machine, elm3b6, is a 4-cpu, (gasp, yes!) 4-node box i.e
each CPU is a node by itself. If you don't have CONFIG_NUMA enabled,
then we won't have cross-node (i.e cross-cpu) load balancing.
Fortunately in your case you had CONFIG_NUMA enabled, but still were
hitting the (gross) load imbalance.
The problem seems to be with idle_balance(). This particular routine,
invoked by schedule() on a idle cpu, walks up sched-domain hierarchy and
tries to balance in each domain that has SD_BALANCE_NEWIDLE flag set.
The nodes-level domain (SD_NODE_INIT) however doesn't set this flag,
which means idle cpu looks for (im)balance within its own node at most and
not beyond. Now, here's the problem, if the idle cpu doesn't find
imbalance within its node (pulled_tasks = 0), it resets this_rq->next_balance
so that next balancing activity is deferred for upto a minute
(next_balance = jiffies + 60 * HZ). If a idle cpu calls idle_balance
again in the next minute and finds no imbalance within its node, it
-again- resets next_balance. In your case, I think this was happening
repetetively, which made other CPUs never look for cross-node
(im)balance.
I believe the patch below is correct. With the patch applied, I could
not recreate the imbalance with rcutorture. Let me know whether you
still see the problem with this patch applied on any other machine.
I have CCed others who have worked in this area and request them to review
this patch.
Andrew,
If there is no objection from anyone, request you to pick this
up for next -mm release. It has been tested against 2.6.22-rc4-mm2.
idle_balance() can erroneously cause system-wide imbalance to be overlooked
by reseting rq->next_balance. When called sufficient number of times, it
can forever defer system-wide load balance. Patch below modifies
idle_balance() not to mess with ->next_balance. If indeed it turns out
that there is no imbalance even system-wide, rebalance_domains() will
anyway set ->next_balance to happen after a minute.
Signed-off-by : Srivatsa Vaddagiri <[email protected]>
Index: linux-2.6.22-rc4/kernel/sched.c
===================================================================
--- linux-2.6.22-rc4.orig/kernel/sched.c 2007-06-18 07:16:49.000000000 -0700
+++ linux-2.6.22-rc4/kernel/sched.c 2007-06-18 07:18:41.000000000 -0700
@@ -2490,27 +2490,16 @@
{
struct sched_domain *sd;
int pulled_task = 0;
- unsigned long next_balance = jiffies + 60 * HZ;
for_each_domain(this_cpu, sd) {
if (sd->flags & SD_BALANCE_NEWIDLE) {
/* If we've pulled tasks over stop searching: */
pulled_task = load_balance_newidle(this_cpu,
this_rq, sd);
- if (time_after(next_balance,
- sd->last_balance + sd->balance_interval))
- next_balance = sd->last_balance
- + sd->balance_interval;
if (pulled_task)
break;
}
}
- if (!pulled_task)
- /*
- * We are going idle. next_balance may be set based on
- * a busy processor. So reset next_balance.
- */
- this_rq->next_balance = next_balance;
}
/*
--
Regards,
vatsa
From: Thomas Gleixner <[email protected]>
Date: Sun, 17 Jun 2007 18:59:18 +0200
> On Sun, 2007-06-17 at 11:49 -0500, Nelson Castillo wrote:
> > > > There are many choices and
> > > > I don't know what is the more friendly. By friendly I mean the one that
> > > > is likely to be merged and that cooperate with you.
> > >
> > > Which choices do you mean ?
> >
> > I mean implementations. I've seen lot of them but i don't know which one
> > to try (I'm new to RT and the implementation in this thread seems to
> > be very nice).
>
> Thanks :)
>
> > > > http://people.redhat.com/mingo/realtime-preempt/patch-2.6.21.4-rt14
> > > >
> > > > : undefined reference to `usecs_to_cycles'
> > > > make: *** [.tmp_vmlinux1] Error 1
> > >
> > > Which ARM sub arch ?
> >
> > sub arch AT91 -- (Atmel AT91RM9200 processor).
>
> It lacks support for the generic timeofday and clock event layers, which
> causes the compile breakage.
I am working on Renesas SuperH platforms.
I faced the similar compile errors
because 2.6.21.X in SH does not support GENERIC_TIME yet.
I made a workaround patch. Is this correct?
Thanks,
---
Katsuya Matsubara @ Igel Co., Ltd
[email protected]
diff -cr linux-2.6.21.5-rt14/kernel/hrtimer.c linux-2.6.21.5-rt14-nogt/kernel/hrtimer.c
*** linux-2.6.21.5-rt14/kernel/hrtimer.c 2007-06-18 19:55:55.000000000 +0900
--- linux-2.6.21.5-rt14-nogt/kernel/hrtimer.c 2007-06-16 16:36:10.000000000 +0900
***************
*** 119,127 ****
--- 119,131 ----
do {
seq = read_seqbegin(&xtime_lock);
+ #ifdef CONFIG_GENERIC_TIME
*ts = xtime;
nsecs = __get_nsec_offset();
timespec_add_ns(ts, nsecs);
+ #else
+ getnstimeofday(ts);
+ #endif
tomono = wall_to_monotonic;
} while (read_seqretry(&xtime_lock, seq));
diff -cr linux-2.6.21.5-rt14/kernel/time/ntp.c linux-2.6.21.5-rt14-nogt/kernel/time/ntp.c
*** linux-2.6.21.5-rt14/kernel/time/ntp.c 2007-06-18 19:55:56.000000000 +0900
--- linux-2.6.21.5-rt14-nogt/kernel/time/ntp.c 2007-06-16 16:37:46.000000000 +0900
***************
*** 120,126 ****
--- 120,128 ----
*/
time_interpolator_update(-NSEC_PER_SEC);
time_state = TIME_OOP;
+ #ifdef CONFIG_GENERIC_TIME
warp_check_clock_was_changed();
+ #endif
clock_was_set();
printk(KERN_NOTICE "Clock: inserting leap second "
"23:59:60 UTC\n");
***************
*** 136,142 ****
--- 138,146 ----
*/
time_interpolator_update(NSEC_PER_SEC);
time_state = TIME_WAIT;
+ #ifdef CONFIG_GENERIC_TIME
warp_check_clock_was_changed();
+ #endif
clock_was_set();
printk(KERN_NOTICE "Clock: deleting leap second "
"23:59:59 UTC\n");
diff -cr linux-2.6.21.5-rt14/kernel/time.c linux-2.6.21.5-rt14-nogt/kernel/time.c
*** linux-2.6.21.5-rt14/kernel/time.c 2007-06-18 19:55:56.000000000 +0900
--- linux-2.6.21.5-rt14-nogt/kernel/time.c 2007-06-16 16:36:10.000000000 +0900
***************
*** 135,141 ****
--- 135,143 ----
wall_to_monotonic.tv_sec -= sys_tz.tz_minuteswest * 60;
xtime.tv_sec += sys_tz.tz_minuteswest * 60;
time_interpolator_reset();
+ #ifdef CONFIG_GENERIC_TIME
warp_check_clock_was_changed();
+ #endif
write_sequnlock_irq(&xtime_lock);
clock_was_set();
}
***************
*** 320,326 ****
--- 322,330 ----
time_esterror = NTP_PHASE_LIMIT;
time_interpolator_reset();
}
+ #ifdef CONFIG_GENERIC_TIME
warp_check_clock_was_changed();
+ #endif
write_sequnlock_irq(&xtime_lock);
clock_was_set();
return 0;
diff -cr linux-2.6.21.5-rt14/kernel/timer.c linux-2.6.21.5-rt14-nogt/kernel/timer.c
*** linux-2.6.21.5-rt14/kernel/timer.c 2007-06-18 19:55:56.000000000 +0900
--- linux-2.6.21.5-rt14-nogt/kernel/timer.c 2007-06-16 16:36:10.000000000 +0900
***************
*** 1165,1171 ****
--- 1165,1173 ----
clock->cycle_accumulated = 0;
clock->error = 0;
timekeeping_suspended = 0;
+ #ifdef CONFIG_GENERIC_TIME
warp_check_clock_was_changed();
+ #endif
write_sequnlock_irqrestore(&xtime_lock, flags);
touch_softlockup_watchdog();
***************
*** 1728,1736 ****
--- 1730,1742 ----
* too.
*/
+ #ifdef CONFIG_GENERIC_TIME
tp = xtime;
nsecs = __get_nsec_offset();
timespec_add_ns(&tp, nsecs);
+ #else
+ getnstimeofday(&tp);
+ #endif
tp.tv_sec += wall_to_monotonic.tv_sec;
tp.tv_nsec += wall_to_monotonic.tv_nsec;
On Mon, 18 Jun 2007, Srivatsa Vaddagiri wrote:
> This particular machine, elm3b6, is a 4-cpu, (gasp, yes!) 4-node box i.e
> each CPU is a node by itself. If you don't have CONFIG_NUMA enabled,
> then we won't have cross-node (i.e cross-cpu) load balancing.
> Fortunately in your case you had CONFIG_NUMA enabled, but still were
> hitting the (gross) load imbalance.
>
> The problem seems to be with idle_balance(). This particular routine,
> invoked by schedule() on a idle cpu, walks up sched-domain hierarchy and
> tries to balance in each domain that has SD_BALANCE_NEWIDLE flag set.
> The nodes-level domain (SD_NODE_INIT) however doesn't set this flag,
> which means idle cpu looks for (im)balance within its own node at most and
The nodes-level domain looks for internode balances between up to 16
nodes. It is not restricted to a single node. The balancing on the
phys_domain level does balance within a node.
On Mon, Jun 18, 2007 at 09:54:18AM -0700, Christoph Lameter wrote:
> The nodes-level domain looks for internode balances between up to 16
> nodes. It is not restricted to a single node.
I was mostly speaking with the example system in mind (4-node 4-cpu
box), but yes, node-level domain does look for imbalance across max 16
nodes as you mention.
Both node and all-node domains don't have SD_BALANCE_NEWIDLE set, which
means idle_balance() will stop looking for imbalance beyonds its own
node. Based on the observed balance within its own node, IMO,
idle_balance() should not cause ->next_balance to be reset.
--
Regards,
vatsa
On Mon, Jun 18, 2007 at 08:42:15PM +0530, Srivatsa Vaddagiri wrote:
> If you don't have CONFIG_NUMA enabled,
> then we won't have cross-node (i.e cross-cpu) load balancing.
Mmm ..that is not correct. I found that disabling CONFIG_NUMA leads
to better load balance on the problem system (i.e w/o any patches
applied 2.6.22-rc4-mm2 leads to good distribution of rcu readers on all
4 cpus).
Anyway, the patch is still needed for scenarios like you originally
tested with.
--
Regards,
vatsa
On Mon, 18 Jun 2007, Srivatsa Vaddagiri wrote:
> On Mon, Jun 18, 2007 at 09:54:18AM -0700, Christoph Lameter wrote:
> > The nodes-level domain looks for internode balances between up to 16
> > nodes. It is not restricted to a single node.
>
> I was mostly speaking with the example system in mind (4-node 4-cpu
> box), but yes, node-level domain does look for imbalance across max 16
> nodes as you mention.
>
> Both node and all-node domains don't have SD_BALANCE_NEWIDLE set, which
> means idle_balance() will stop looking for imbalance beyonds its own
> node. Based on the observed balance within its own node, IMO,
> idle_balance() should not cause ->next_balance to be reset.
I think the check in idle_balance needs to be modified.
If the domain *does not* have SD_BALANCE_NEWIDLE set then
next_balance must still be set right. Does this patch fix it?
Scheduler: Fix next_interval determination in idle_balance().
The intervals of domains that do not have SD_BALANCE_NEWIDLE must
be considered for the calculation of the time of the next balance.
Otherwise we may defer rebalancing forever.
Signed-off-by: Christop Lameter <[email protected]>
Index: linux-2.6.22-rc4-mm2/kernel/sched.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/kernel/sched.c 2007-06-18 10:56:31.000000000 -0700
+++ linux-2.6.22-rc4-mm2/kernel/sched.c 2007-06-18 10:57:10.000000000 -0700
@@ -2493,17 +2493,16 @@ static void idle_balance(int this_cpu, s
unsigned long next_balance = jiffies + 60 * HZ;
for_each_domain(this_cpu, sd) {
- if (sd->flags & SD_BALANCE_NEWIDLE) {
+ if (sd->flags & SD_BALANCE_NEWIDLE)
/* If we've pulled tasks over stop searching: */
pulled_task = load_balance_newidle(this_cpu,
this_rq, sd);
- if (time_after(next_balance,
- sd->last_balance + sd->balance_interval))
- next_balance = sd->last_balance
- + sd->balance_interval;
- if (pulled_task)
- break;
- }
+ if (time_after(next_balance,
+ sd->last_balance + sd->balance_interval))
+ next_balance = sd->last_balance
+ + sd->balance_interval;
+ if (pulled_task)
+ break;
}
if (!pulled_task)
/*
On Mon, Jun 18, 2007 at 10:59:21AM -0700, Christoph Lameter wrote:
> I think the check in idle_balance needs to be modified.
>
> If the domain *does not* have SD_BALANCE_NEWIDLE set then
> next_balance must still be set right. Does this patch fix it?
Is the ->next_balance calculation in idle_balance() necessary at all?
rebalance_domains() would have programmed ->next_balance anyway, based
on the nearest next_balance point of all (load-balance'able) domains.
By repeating that calculation in idle_balance, are we covering any corner case?
--
Regards,
vatsa
On Tue, Jun 19, 2007 at 07:22:32AM +0530, Srivatsa Vaddagiri wrote:
> On Mon, Jun 18, 2007 at 10:59:21AM -0700, Christoph Lameter wrote:
> > I think the check in idle_balance needs to be modified.
> >
> > If the domain *does not* have SD_BALANCE_NEWIDLE set then
> > next_balance must still be set right. Does this patch fix it?
>
> Is the ->next_balance calculation in idle_balance() necessary at all?
> rebalance_domains() would have programmed ->next_balance anyway, based
> on the nearest next_balance point of all (load-balance'able) domains.
> By repeating that calculation in idle_balance, are we covering any corner case?
rebalance_domains() have programmed ->next_balance based on 'busy' state.
And now, as it is going to 'idle', this routine is recalculating
the next_balance based on 'idle' state.
thanks,
suresh
On Mon, Jun 18, 2007 at 10:59:21AM -0700, Christoph Lameter wrote:
> for_each_domain(this_cpu, sd) {
> - if (sd->flags & SD_BALANCE_NEWIDLE) {
> + if (sd->flags & SD_BALANCE_NEWIDLE)
> /* If we've pulled tasks over stop searching: */
> pulled_task = load_balance_newidle(this_cpu,
> this_rq, sd);
> - if (time_after(next_balance,
> - sd->last_balance + sd->balance_interval))
> - next_balance = sd->last_balance
> - + sd->balance_interval;
> - if (pulled_task)
> - break;
> - }
> + if (time_after(next_balance,
> + sd->last_balance + sd->balance_interval))
> + next_balance = sd->last_balance
> + + sd->balance_interval;
don't we have to do, msecs_to_jiffies(sd->balance_interval)?
thanks,
suresh
> + if (pulled_task)
> + break;
> }
> if (!pulled_task)
> /*
On Mon, 18 Jun 2007, Siddha, Suresh B wrote:
> > + if (time_after(next_balance,
> > + sd->last_balance + sd->balance_interval))
> > + next_balance = sd->last_balance
> > + + sd->balance_interval;
>
> don't we have to do, msecs_to_jiffies(sd->balance_interval)?
Well that is certainly a bug here. Is this better?
Scheduler: Fix next_interval determination in idle_balance().
The intervals of domains that do not have SD_BALANCE_NEWIDLE must
be considered for the calculation of the time of the next balance.
Otherwise we may defer rebalancing forever.
Siddha also spotted that the conversion of the balance interval
to jiffies is missing. Fix that to.
Signed-off-by: Christop Lameter <[email protected]>
Index: linux-2.6.22-rc4-mm2/kernel/sched.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/kernel/sched.c 2007-06-18 20:41:46.000000000 -0700
+++ linux-2.6.22-rc4-mm2/kernel/sched.c 2007-06-18 20:44:00.000000000 -0700
@@ -2493,17 +2493,18 @@ static void idle_balance(int this_cpu, s
unsigned long next_balance = jiffies + 60 * HZ;
for_each_domain(this_cpu, sd) {
- if (sd->flags & SD_BALANCE_NEWIDLE) {
+ unsigned long interval;
+
+ if (sd->flags & SD_BALANCE_NEWIDLE)
/* If we've pulled tasks over stop searching: */
- pulled_task = load_balance_newidle(this_cpu,
- this_rq, sd);
- if (time_after(next_balance,
- sd->last_balance + sd->balance_interval))
- next_balance = sd->last_balance
- + sd->balance_interval;
- if (pulled_task)
- break;
- }
+ pulled_task = load_balance_newidle(this_cpu,this_rq, sd);
+
+ interval = msecs_to_jiffies(sd->balance_interval);
+ if (time_after(next_balance,
+ sd->last_balance + interval))
+ next_balance = sd->last_balance + interval;
+ if (pulled_task)
+ break;
}
if (!pulled_task)
/*
Katsuya-San,
On Tue, 2007-06-19 at 01:14 +0900, Katsuya MATSUBARA wrote:
> > It lacks support for the generic timeofday and clock event layers, which
> > causes the compile breakage.
>
> I am working on Renesas SuperH platforms.
> I faced the similar compile errors
> because 2.6.21.X in SH does not support GENERIC_TIME yet.
> I made a workaround patch. Is this correct?
Looks good.
Can you in future please use "diff -ur" to generate patches. That's the
usual format.
Thanks,
tglx
On Mon, Jun 18, 2007 at 08:46:03PM -0700, Christoph Lameter wrote:
> @@ -2493,17 +2493,18 @@ static void idle_balance(int this_cpu, s
> unsigned long next_balance = jiffies + 60 * HZ;
>
> for_each_domain(this_cpu, sd) {
> - if (sd->flags & SD_BALANCE_NEWIDLE) {
> + unsigned long interval;
> +
Do we need a :
if (!(sd->flags & SD_LOAD_BALANCE))
continue;
here?
Otherwise patch look good and fixes the problem Paul observed earlier.
> + if (sd->flags & SD_BALANCE_NEWIDLE)
> /* If we've pulled tasks over stop searching: */
> - pulled_task = load_balance_newidle(this_cpu,
> - this_rq, sd);
> - if (time_after(next_balance,
> - sd->last_balance + sd->balance_interval))
> - next_balance = sd->last_balance
> - + sd->balance_interval;
> - if (pulled_task)
> - break;
> - }
> + pulled_task = load_balance_newidle(this_cpu,this_rq, sd);
> +
> + interval = msecs_to_jiffies(sd->balance_interval);
> + if (time_after(next_balance,
> + sd->last_balance + interval))
> + next_balance = sd->last_balance + interval;
> + if (pulled_task)
> + break;
> }
> if (!pulled_task)
> /*
--
Regards,
vatsa
* Srivatsa Vaddagiri <[email protected]> wrote:
> On Mon, Jun 18, 2007 at 08:46:03PM -0700, Christoph Lameter wrote:
> > @@ -2493,17 +2493,18 @@ static void idle_balance(int this_cpu, s
> > unsigned long next_balance = jiffies + 60 * HZ;
> >
> > for_each_domain(this_cpu, sd) {
> > - if (sd->flags & SD_BALANCE_NEWIDLE) {
> > + unsigned long interval;
> > +
>
> Do we need a :
>
> if (!(sd->flags & SD_LOAD_BALANCE))
> continue;
>
> here?
>
> Otherwise patch look good and fixes the problem Paul observed earlier.
great! I've applied the patch below (added your fix and cleaned it up a
bit) and have released 2.6.21.5-rt17 with it.
Ingo
------------------------------>
From: Christoph Lameter <[email protected]>
Subject: [patch] sched: fix next_interval determination in idle_balance().
The intervals of domains that do not have SD_BALANCE_NEWIDLE must
be considered for the calculation of the time of the next balance.
Otherwise we may defer rebalancing forever.
Siddha also spotted that the conversion of the balance interval
to jiffies is missing. Fix that to.
From: Srivatsa Vaddagiri <[email protected]>
also continue the loop if !(sd->flags & SD_LOAD_BALANCE).
Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 22 +++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -2591,17 +2591,21 @@ static void idle_balance(int this_cpu, s
unsigned long next_balance = jiffies + HZ;
for_each_domain(this_cpu, sd) {
- if (sd->flags & SD_BALANCE_NEWIDLE) {
+ unsigned long interval;
+
+ if (!(sd->flags & SD_LOAD_BALANCE))
+ continue;
+
+ if (sd->flags & SD_BALANCE_NEWIDLE)
/* If we've pulled tasks over stop searching: */
pulled_task = load_balance_newidle(this_cpu,
- this_rq, sd);
- if (time_after(next_balance,
- sd->last_balance + sd->balance_interval))
- next_balance = sd->last_balance
- + sd->balance_interval;
- if (pulled_task)
- break;
- }
+ this_rq, sd);
+
+ interval = msecs_to_jiffies(sd->balance_interval);
+ if (time_after(next_balance, sd->last_balance + interval))
+ next_balance = sd->last_balance + interval;
+ if (pulled_task)
+ break;
}
if (pulled_task || time_after(jiffies, this_rq->next_balance)) {
/*
* Srivatsa Vaddagiri <[email protected]> wrote:
> I believe the patch below is correct. With the patch applied, I could
> not recreate the imbalance with rcutorture. Let me know whether you
> still see the problem with this patch applied on any other machine.
thanks for tracking this down! I've applied Christoph's patch (with your
suggested modification plus a few small cleanups).
I'm wondering, why did this trigger under CFS and not on mainline?
Mainline seems to have a similar problem in idle_balance() too, or am i
misreading it?
Ingo
On Tue, Jun 19, 2007 at 11:04:30AM +0200, Ingo Molnar wrote:
> I'm wondering, why did this trigger under CFS and not on mainline?
I thought Paul had seen the same problem with 2.6.21.5. I will try a
more recent mainline (2.6.22-rc5 maybe) after I get hold of the problem
machine and report later today.
If there is any difference, it should be because of the reported topology
by low-level platform code. In the problem case, each CPU was being reported
to be a separate node (CONFIG_NUMA enabled) which caused idle_balance()
to stop load-balance lookups at cpu/node level itself.
> Mainline seems to have a similar problem in idle_balance() too, or am i
> misreading it?
--
Regards,
vatsa
On Tue, Jun 19, 2007 at 11:04:30AM +0200, Ingo Molnar wrote:
> I'm wondering, why did this trigger under CFS and not on mainline?
> Mainline seems to have a similar problem in idle_balance() too, or am i
> misreading it?
The problem is there in mainline very much. I could recreate the problem
with 2.6.22-rc5 (which doesnt have CFS) on that same hardware, with
CONFIG_NUMA enabled.
Let me know if you needed anything else to be clarified.
--
Regards,
vatsa
On Tue, Jun 19, 2007 at 11:04:30AM +0200, Ingo Molnar wrote:
>
> * Srivatsa Vaddagiri <[email protected]> wrote:
>
> > I believe the patch below is correct. With the patch applied, I could
> > not recreate the imbalance with rcutorture. Let me know whether you
> > still see the problem with this patch applied on any other machine.
>
> thanks for tracking this down! I've applied Christoph's patch (with your
> suggested modification plus a few small cleanups).
>
> I'm wondering, why did this trigger under CFS and not on mainline?
> Mainline seems to have a similar problem in idle_balance() too, or am i
> misreading it?
It did in fact trigger under all three of mainline, CFS, and -rt including
CFS -- see below for a couple of emails from last Friday giving results
for these three on the AMD box (where it happened) and on a single-quad
NUMA-Q system (where it did not, at least not with such severity).
That said, there certainly was a time when neither mainline nor -rt
acted this way!
Thanx, Paul
------------------------------------------------------------------------
Date: Fri, 15 Jun 2007 13:06:17 -0700
From: "Paul E. McKenney" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>,
Dinakar Guniguntala <[email protected]>
Subject: Re: v2.6.21.4-rt11
On Fri, Jun 15, 2007 at 08:14:52AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 15, 2007 at 04:45:35PM +0200, Ingo Molnar wrote:
> >
> > Paul,
> >
> > do you still see the load-distribution problem with -rt14? (which
> > includes cfsv17) Or rather ... could you try vanilla cfsv17 instead:
> >
> > http://people.redhat.com/mingo/cfs-scheduler/
> >
> > to make sure it's not some effect in -rt causing this. v17 has an
> > updated load balancing code. (which might or might not affect the
> > rcutorture problem.)
No joy, see below. Strangely hardware dependent. My next step, left
to myself, would be to patch rcutorture.c to cause the readers to dump
the CFS state information every ten seconds or so. My guess is that
the important per-task stuff is:
current->sched_info.pcnt
current->sched_info.cpu_time
current->sched_info.run_delay
current->sched_info.last_arrival
current->sched_info.last_queued
And maybe the runqueue info dumped out by show_schedstat, this last
via new per-CPU tasks.
Other thoughts?
Thanx, Paul
> Good point! I will try the following:
>
> 1. Stock 2.6.21.5. 64-bit kernel on AMD Opterons.
All eight readers end up on the same CPU, CPU 1 in this case. And they
stay there (ten minutes).
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3058 root 39 19 0 0 0 R 12.7 0.0 0:06.91 rcu_torture_rea
3059 root 39 19 0 0 0 R 12.7 0.0 0:06.91 rcu_torture_rea
3060 root 39 19 0 0 0 R 12.7 0.0 0:06.91 rcu_torture_rea
3061 root 39 19 0 0 0 R 12.7 0.0 0:06.91 rcu_torture_rea
3062 root 39 19 0 0 0 R 12.7 0.0 0:06.91 rcu_torture_rea
3063 root 39 19 0 0 0 R 12.7 0.0 0:06.91 rcu_torture_rea
3057 root 39 19 0 0 0 R 12.3 0.0 0:06.91 rcu_torture_rea
3064 root 39 19 0 0 0 R 12.3 0.0 0:06.91 rcu_torture_rea
> 1. Stock 2.6.21.5. 32-bit kernel on NUMA-Q.
Works just fine(!).
> 2. 2.6.21-rt14. 64-bit kernel on AMD Opterons.
All eight readers are spread, but over only two CPUs (0 and 3, in this
case). Persists, usually with 4/4 split, but sometimes with five
tasks on one CPU and three on the other.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3111 root 39 19 0 0 0 R 23.9 0.0 0:27.27 rcu_torture_rea
3114 root 39 19 0 0 0 R 23.9 0.0 0:28.58 rcu_torture_rea
3117 root 39 19 0 0 0 R 23.9 0.0 0:32.40 rcu_torture_rea
3112 root 39 19 0 0 0 R 23.6 0.0 0:28.41 rcu_torture_rea
3110 root 39 19 0 0 0 R 22.9 0.0 0:43.46 rcu_torture_rea
3113 root 39 19 0 0 0 R 22.9 0.0 0:27.28 rcu_torture_rea
3115 root 39 19 0 0 0 R 22.9 0.0 0:33.08 rcu_torture_rea
3116 root 39 19 0 0 0 R 22.6 0.0 0:28.10 rcu_torture_rea
elm3b6:~# for ((i=3110;i<=3117;i++)); do cat /proc/$i/stat | awk '{print $(NF-3)}'; done
3 3 0 3 0 0 0 3
> 2. 2.6.21-rt14. 32-bit kernel on NUMA-Q.
Works just fine.
> 3. 2.6.21.5 + sched-cfs-v2.6.21.5-v17.patch on 64-bit kernel on
AMD Opteron.
All eight readers end up on the same CPU, CPU 2 in this case. And they
stay there (ten minutes).
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3081 root 39 19 0 0 0 R 11.3 0.0 1:31.77 rcu_torture_rea
3082 root 39 19 0 0 0 R 11.3 0.0 1:31.77 rcu_torture_rea
3085 root 39 19 0 0 0 R 11.3 0.0 1:31.78 rcu_torture_rea
3079 root 39 19 0 0 0 R 11.0 0.0 1:31.72 rcu_torture_rea
3080 root 39 19 0 0 0 R 11.0 0.0 1:31.76 rcu_torture_rea
3083 root 39 19 0 0 0 R 11.0 0.0 1:31.76 rcu_torture_rea
3084 root 39 19 0 0 0 R 11.0 0.0 1:31.77 rcu_torture_rea
3086 root 39 19 0 0 0 R 11.0 0.0 1:31.75 rcu_torture_rea
Using "taskset" to pin each process to a pair of CPUs (masks 0x3, 0x6,
0xc, and 0x9) forces them to CPUs 0 and 2 -- previously this had spread
them nicely. So I kept pinning tasks to single CPUs (which defeats
some rcutorture testing) until they did spread, getting the following:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3079 root 39 19 0 0 0 R 49.9 0.0 3:18.91 rcu_torture_rea
3080 root 39 19 0 0 0 R 49.6 0.0 3:15.76 rcu_torture_rea
3086 root 39 19 0 0 0 R 49.6 0.0 3:48.82 rcu_torture_rea
3083 root 39 19 0 0 0 R 49.3 0.0 2:58.02 rcu_torture_rea
3084 root 39 19 0 0 0 R 48.6 0.0 3:00.54 rcu_torture_rea
3081 root 39 19 0 0 0 R 47.9 0.0 3:00.55 rcu_torture_rea
3082 root 39 19 0 0 0 R 44.6 0.0 3:18.89 rcu_torture_rea
3085 root 39 19 0 0 0 R 44.3 0.0 3:07.11 rcu_torture_rea
elm3b6:~# for ((i=3079;i<=3086;i++)); do cat /proc/$i/stat | awk '{print $(NF-3)}'; done
0 0 2 1 3 2 1 3
> 3. 2.6.21.5 + sched-cfs-v2.6.21.5-v17.patch on 32-bit kernel on
NUMA-Q.
Some imbalance:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2263 root 39 19 0 0 0 R 92.4 0.0 2:19.69 rcu_torture_rea
2265 root 39 19 0 0 0 R 49.8 0.0 1:41.84 rcu_torture_rea
2264 root 39 19 0 0 0 R 49.5 0.0 2:11.69 rcu_torture_rea
2261 root 39 19 0 0 0 R 48.8 0.0 2:09.95 rcu_torture_rea
2262 root 39 19 0 0 0 R 48.8 0.0 3:01.42 rcu_torture_rea
2266 root 39 19 0 0 0 R 30.1 0.0 1:47.02 rcu_torture_rea
2260 root 39 19 0 0 0 R 29.8 0.0 2:10.07 rcu_torture_rea
2267 root 39 19 0 0 0 R 29.8 0.0 1:57.34 rcu_torture_rea
elm3b132:~# for ((i=2260;i<=2267;i++)); do cat /proc/$i/stat | awk '{print $(NF-3)}'; done
0 1 3 1 2 2 2 0
Has persisted (with some shuffling of CPUs, see below) for about five
minutes, will let it run for an hour or so to see if it is really serious
about this.
3 0 0 2 1 0 2 3
------------------------------------------------------------------------
Date: Fri, 15 Jun 2007 15:00:17 -0700
From: "Paul E. McKenney" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>,
Dinakar Guniguntala <[email protected]>,
Srivatsa Vaddagiri <[email protected]>,
Dmitry Adamushko <[email protected]>
Subject: Re: v2.6.21.4-rt11
On Fri, Jun 15, 2007 at 10:35:39PM +0200, Ingo Molnar wrote:
>
> (forwarding Paul's mail below to other CFS hackers too.)
>
> ------------>
[ . . . ]
> > 3. 2.6.21.5 + sched-cfs-v2.6.21.5-v17.patch on 32-bit kernel on
> NUMA-Q.
>
> Some imbalance:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 2263 root 39 19 0 0 0 R 92.4 0.0 2:19.69 rcu_torture_rea
> 2265 root 39 19 0 0 0 R 49.8 0.0 1:41.84 rcu_torture_rea
> 2264 root 39 19 0 0 0 R 49.5 0.0 2:11.69 rcu_torture_rea
> 2261 root 39 19 0 0 0 R 48.8 0.0 2:09.95 rcu_torture_rea
> 2262 root 39 19 0 0 0 R 48.8 0.0 3:01.42 rcu_torture_rea
> 2266 root 39 19 0 0 0 R 30.1 0.0 1:47.02 rcu_torture_rea
> 2260 root 39 19 0 0 0 R 29.8 0.0 2:10.07 rcu_torture_rea
> 2267 root 39 19 0 0 0 R 29.8 0.0 1:57.34 rcu_torture_rea
>
> elm3b132:~# for ((i=2260;i<=2267;i++)); do cat /proc/$i/stat | awk '{print $(NF-3)}'; done
> 0 1 3 1 2 2 2 0
>
> Has persisted (with some shuffling of CPUs, see below) for about five
> minutes, will let it run for an hour or so to see if it is really serious
> about this.
>
> 3 0 0 2 1 0 2 3
And when I returned after an hour, it had straightened itself out:
1 3 1 2 2 0 3 0
The 64-bit AMD 4-CPU machines have not straightened themselves out
in the past, but will try an extended run over the weekend to see
if load balancing is just a bit on the slow side. ;-)
But got distracted for an additional hour, and it is imbalanced again:
2 1 2 0 1 1 3 3
Strange...
Thanx, Paul
On Tue, 19 Jun 2007, Ingo Molnar wrote:
> I'm wondering, why did this trigger under CFS and not on mainline?
> Mainline seems to have a similar problem in idle_balance() too, or am i
> misreading it?
Right. The patch needs to go into mainline as well.
On Tue, 19 Jun 2007, Srivatsa Vaddagiri wrote:
> On Tue, Jun 19, 2007 at 11:04:30AM +0200, Ingo Molnar wrote:
> > I'm wondering, why did this trigger under CFS and not on mainline?
> > Mainline seems to have a similar problem in idle_balance() too, or am i
> > misreading it?
>
> The problem is there in mainline very much. I could recreate the problem
> with 2.6.22-rc5 (which doesnt have CFS) on that same hardware, with
> CONFIG_NUMA enabled.
>
> Let me know if you needed anything else to be clarified.
This is a bugfix that needs to go into 2.6.22.