Hi,
the input layer does a "synchronize_rcu()" after a list_add_tail_rcu(), which
is costing me 1 second of boot time.....
And based on my understanding of the RCU concept, you only need to synchronize on delete,
not on addition... so I think the synchronize is entirely redundant here...
Can I have my second of boot time back please ?
diff --git a/drivers/input/input.c b/drivers/input/input.c
index 1730d73..d69ec56 100644
--- a/drivers/input/input.c
+++ b/drivers/input/input.c
@@ -1544,7 +1544,6 @@ int input_register_handle(struct input_handle *handle)
return error;
list_add_tail_rcu(&handle->d_node, &dev->h_list);
mutex_unlock(&dev->mutex);
- synchronize_rcu();
/*
* Since we are supposed to be called from ->connect()
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
Hi Arjan,
On Wednesday 18 March 2009 21:58:12 Arjan van de Ven wrote:
> Hi,
>
> the input layer does a "synchronize_rcu()" after a list_add_tail_rcu(),
> which is costing me 1 second of boot time.....
> And based on my understanding of the RCU concept, you only need to
> synchronize on delete, not on addition... so I think the synchronize is
> entirely redundant here...
It is there to guarantee that once we registered the handle all
subsequent input events will be delivered through it.
>
> Can I have my second of boot time back please ?
>
Not like this I'm afraid but I will see what I can do.
>
>
> diff --git a/drivers/input/input.c b/drivers/input/input.c
> index 1730d73..d69ec56 100644
> --- a/drivers/input/input.c
> +++ b/drivers/input/input.c
> @@ -1544,7 +1544,6 @@ int input_register_handle(struct input_handle
> *handle) return error;
> list_add_tail_rcu(&handle->d_node, &dev->h_list);
> mutex_unlock(&dev->mutex);
> - synchronize_rcu();
>
> /*
> * Since we are supposed to be called from ->connect()
--
Dmitry
On Wed, Mar 18, 2009 at 09:58:12PM -0700, Arjan van de Ven wrote:
> Hi,
>
> the input layer does a "synchronize_rcu()" after a list_add_tail_rcu(), which
> is costing me 1 second of boot time.....
> And based on my understanding of the RCU concept, you only need to synchronize on delete,
> not on addition... so I think the synchronize is entirely redundant here...
>
The more appropriate question is - why is synchronize_rcu() taking
1 second ? Any idea what the other CPUs are doing at the time
of calling synchronize_rcu() ? What driver is this ? How early
in the boot is this happening ?
Thanks
Dipankar
On Thu, 19 Mar 2009 00:23:56 -0700
Dmitry Torokhov <[email protected]> wrote:
> Hi Arjan,
>
> On Wednesday 18 March 2009 21:58:12 Arjan van de Ven wrote:
> > Hi,
> >
> > the input layer does a "synchronize_rcu()" after a
> > list_add_tail_rcu(), which is costing me 1 second of boot time.....
> > And based on my understanding of the RCU concept, you only need to
> > synchronize on delete, not on addition... so I think the
> > synchronize is entirely redundant here...
>
>
> It is there to guarantee that once we registered the handle all
> subsequent input events will be delivered through it.
afaik rcu already guarantees that even without a synchronize;
the only reason you would need a synchronize is to guarantee that
people STOPPED using your memory. Or am I now totally misunderstanding
RCU ?
On Thu, 19 Mar 2009 14:26:28 +0530
Dipankar Sarma <[email protected]> wrote:
> On Wed, Mar 18, 2009 at 09:58:12PM -0700, Arjan van de Ven wrote:
> > Hi,
> >
> > the input layer does a "synchronize_rcu()" after a
> > list_add_tail_rcu(), which is costing me 1 second of boot time.....
> > And based on my understanding of the RCU concept, you only need to
> > synchronize on delete, not on addition... so I think the
> > synchronize is entirely redundant here...
> >
>
> The more appropriate question is - why is synchronize_rcu() taking
> 1 second ? Any idea what the other CPUs are doing at the time
> of calling synchronize_rcu() ?
one cpu is doing a lot of i2c traffic which is a bunch of udelay()s
in loops.. then it does quite a bit of uncached memory access, and
the lot takes quite while.
> What driver is this ? How early
> in the boot is this happening ?
during kernel boot.
I suppose my question is also more generic.. why synchronize when it's
not needed? At least based on my understanding of RCU (but you're the
expert), you don't need to synchronize for an add, only between a
delete and a (k)free.....
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thu, Mar 19, 2009 at 07:18:41AM -0700, Arjan van de Ven wrote:
> On Thu, 19 Mar 2009 14:26:28 +0530
> Dipankar Sarma <[email protected]> wrote:
>
> > On Wed, Mar 18, 2009 at 09:58:12PM -0700, Arjan van de Ven wrote:
> > > Hi,
> > >
> > > the input layer does a "synchronize_rcu()" after a
> > > list_add_tail_rcu(), which is costing me 1 second of boot time.....
> > > And based on my understanding of the RCU concept, you only need to
> > > synchronize on delete, not on addition... so I think the
> > > synchronize is entirely redundant here...
> >
> > The more appropriate question is - why is synchronize_rcu() taking
> > 1 second ? Any idea what the other CPUs are doing at the time
> > of calling synchronize_rcu() ?
>
> one cpu is doing a lot of i2c traffic which is a bunch of udelay()s
> in loops.. then it does quite a bit of uncached memory access, and
> the lot takes quite while.
>
> > What driver is this ? How early
> > in the boot is this happening ?
>
> during kernel boot.
>
> I suppose my question is also more generic.. why synchronize when it's
> not needed? At least based on my understanding of RCU (but you're the
> expert), you don't need to synchronize for an add, only between a
> delete and a (k)free.....
I don't claim to understand the code in question, so it is entirely
possible that the following is irrelevant. But one other reason for
synchronize_rcu() is:
1. Make change.
2. synchronize_rcu()
3. Now you are guaranteed that all CPUs/tasks/whatever currently
running either are not messing with you on the one hand, or
have seen the change on the other.
It sounds like you are seeing these delays later in boot, however,
if this this is instead happening before the scheduler is operational,
please check to make sure that the following patch is applied:
http://lkml.org/lkml/2009/2/25/558
This patch will also -greatly- improve performance on single-CPU systems,
as in the patch makes synchronize_rcu() be essentially a no-op in the
single-CPU case for Classic RCU.
Alternatively, again assuming a single-CPU system, the following patch
also effectively makes synchronize_rcu() be a no-op while also cutting
down the kernel's memory footprint:
http://lkml.org/lkml/2009/2/3/333
Thanx, Paul
On Thu, 19 Mar 2009 19:07:50 -0700
"Paul E. McKenney" <[email protected]> wrote:
> On Thu, Mar 19, 2009 at 07:18:41AM -0700, Arjan van de Ven wrote:
> > On Thu, 19 Mar 2009 14:26:28 +0530
> > Dipankar Sarma <[email protected]> wrote:
> >
> > > On Wed, Mar 18, 2009 at 09:58:12PM -0700, Arjan van de Ven wrote:
> > > > Hi,
> > > >
> > > > the input layer does a "synchronize_rcu()" after a
> > > > list_add_tail_rcu(), which is costing me 1 second of boot
> > > > time..... And based on my understanding of the RCU concept, you
> > > > only need to synchronize on delete, not on addition... so I
> > > > think the synchronize is entirely redundant here...
> > >
> > > The more appropriate question is - why is synchronize_rcu() taking
> > > 1 second ? Any idea what the other CPUs are doing at the time
> > > of calling synchronize_rcu() ?
> >
> > one cpu is doing a lot of i2c traffic which is a bunch of udelay()s
> > in loops.. then it does quite a bit of uncached memory access, and
> > the lot takes quite while.
> >
> > > What driver is this ? How early
> > > in the boot is this happening ?
> >
> > during kernel boot.
> >
> > I suppose my question is also more generic.. why synchronize when
> > it's not needed? At least based on my understanding of RCU (but
> > you're the expert), you don't need to synchronize for an add, only
> > between a delete and a (k)free.....
>
> I don't claim to understand the code in question, so it is entirely
> possible that the following is irrelevant. But one other reason for
> synchronize_rcu() is:
>
> 1. Make change.
>
> 2. synchronize_rcu()
>
> 3. Now you are guaranteed that all CPUs/tasks/whatever
> currently running either are not messing with you on the one hand, or
> have seen the change on the other.
ok so this is for the case where someone is already iterating the list.
I don't see anything in the code that assumes this..
>
> It sounds like you are seeing these delays later in boot, however,
yeah it's during driver init/
>
> Alternatively, again assuming a single-CPU system
single CPU is soooo last decade ;-)
But seriously I no longer have systems that aren't dual core or SMT in
some form...
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thu, Mar 19, 2009 at 08:20:32PM -0700, Arjan van de Ven wrote:
> On Thu, 19 Mar 2009 19:07:50 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > On Thu, Mar 19, 2009 at 07:18:41AM -0700, Arjan van de Ven wrote:
> > > On Thu, 19 Mar 2009 14:26:28 +0530
> > > Dipankar Sarma <[email protected]> wrote:
> > >
> > > > On Wed, Mar 18, 2009 at 09:58:12PM -0700, Arjan van de Ven wrote:
> > > > > Hi,
> > > > >
> > > > > the input layer does a "synchronize_rcu()" after a
> > > > > list_add_tail_rcu(), which is costing me 1 second of boot
> > > > > time..... And based on my understanding of the RCU concept, you
> > > > > only need to synchronize on delete, not on addition... so I
> > > > > think the synchronize is entirely redundant here...
> > > >
> > > > The more appropriate question is - why is synchronize_rcu() taking
> > > > 1 second ? Any idea what the other CPUs are doing at the time
> > > > of calling synchronize_rcu() ?
> > >
> > > one cpu is doing a lot of i2c traffic which is a bunch of udelay()s
> > > in loops.. then it does quite a bit of uncached memory access, and
> > > the lot takes quite while.
> > >
> > > > What driver is this ? How early
> > > > in the boot is this happening ?
> > >
> > > during kernel boot.
> > >
> > > I suppose my question is also more generic.. why synchronize when
> > > it's not needed? At least based on my understanding of RCU (but
> > > you're the expert), you don't need to synchronize for an add, only
> > > between a delete and a (k)free.....
> >
> > I don't claim to understand the code in question, so it is entirely
> > possible that the following is irrelevant. But one other reason for
> > synchronize_rcu() is:
> >
> > 1. Make change.
> >
> > 2. synchronize_rcu()
> >
> > 3. Now you are guaranteed that all CPUs/tasks/whatever
> > currently running either are not messing with you on the one hand, or
> > have seen the change on the other.
>
> ok so this is for the case where someone is already iterating the list.
>
> I don't see anything in the code that assumes this..
I must let the networking guys sort this out.
> > It sounds like you are seeing these delays later in boot, however,
>
> yeah it's during driver init/
>
> > Alternatively, again assuming a single-CPU system
>
> single CPU is soooo last decade ;-)
> But seriously I no longer have systems that aren't dual core or SMT in
> some form...
OK, I will ask the stupid question...
Why not delay bringing up the non-boot CPUs until later in boot?
The first patch in my earlier email (which is in mainline) will shortcut
synchronize_rcu() whenever there is only one CPU is online, at least
for Classic RCU and Hierarchical RCU.
Thanx, Paul
Paul E. McKenney a ?crit :
> On Thu, Mar 19, 2009 at 08:20:32PM -0700, Arjan van de Ven wrote:
>> On Thu, 19 Mar 2009 19:07:50 -0700
>> "Paul E. McKenney" <[email protected]> wrote:
>>
>>> On Thu, Mar 19, 2009 at 07:18:41AM -0700, Arjan van de Ven wrote:
>>>> On Thu, 19 Mar 2009 14:26:28 +0530
>>>> Dipankar Sarma <[email protected]> wrote:
>>>>
>>>>> On Wed, Mar 18, 2009 at 09:58:12PM -0700, Arjan van de Ven wrote:
>>>>>> Hi,
>>>>>>
>>>>>> the input layer does a "synchronize_rcu()" after a
>>>>>> list_add_tail_rcu(), which is costing me 1 second of boot
>>>>>> time..... And based on my understanding of the RCU concept, you
>>>>>> only need to synchronize on delete, not on addition... so I
>>>>>> think the synchronize is entirely redundant here...
>>>>> The more appropriate question is - why is synchronize_rcu() taking
>>>>> 1 second ? Any idea what the other CPUs are doing at the time
>>>>> of calling synchronize_rcu() ?
>>>> one cpu is doing a lot of i2c traffic which is a bunch of udelay()s
>>>> in loops.. then it does quite a bit of uncached memory access, and
>>>> the lot takes quite while.
>>>>
>>>>> What driver is this ? How early
>>>>> in the boot is this happening ?
>>>> during kernel boot.
>>>>
>>>> I suppose my question is also more generic.. why synchronize when
>>>> it's not needed? At least based on my understanding of RCU (but
>>>> you're the expert), you don't need to synchronize for an add, only
>>>> between a delete and a (k)free.....
>>> I don't claim to understand the code in question, so it is entirely
>>> possible that the following is irrelevant. But one other reason for
>>> synchronize_rcu() is:
>>>
>>> 1. Make change.
>>>
>>> 2. synchronize_rcu()
>>>
>>> 3. Now you are guaranteed that all CPUs/tasks/whatever
>>> currently running either are not messing with you on the one hand, or
>>> have seen the change on the other.
>> ok so this is for the case where someone is already iterating the list.
>>
>> I don't see anything in the code that assumes this..
>
> I must let the networking guys sort this out.
>
>>> It sounds like you are seeing these delays later in boot, however,
>> yeah it's during driver init/
>>
>>> Alternatively, again assuming a single-CPU system
>> single CPU is soooo last decade ;-)
>> But seriously I no longer have systems that aren't dual core or SMT in
>> some form...
>
> OK, I will ask the stupid question...
>
> Why not delay bringing up the non-boot CPUs until later in boot?
> The first patch in my earlier email (which is in mainline) will shortcut
> synchronize_rcu() whenever there is only one CPU is online, at least
> for Classic RCU and Hierarchical RCU.
>
Hmm... point is to make linux boot as fast as possible, so ...
Use a special variant of udelay() in offending drivers that make appropriate
RCU call to increment quiescent state ?
On Fri, Mar 20, 2009 at 06:28:11AM +0100, Eric Dumazet wrote:
> Paul E. McKenney a ?crit :
> Hmm... point is to make linux boot as fast as possible, so ...
>
> Use a special variant of udelay() in offending drivers that make appropriate
> RCU call to increment quiescent state ?
It may not be safe to do this it is in a RCU read-side critical section.
Thanks
Dipankar
Dipankar Sarma a ?crit :
> On Fri, Mar 20, 2009 at 06:28:11AM +0100, Eric Dumazet wrote:
>> Paul E. McKenney a ?crit :
>> Hmm... point is to make linux boot as fast as possible, so ...
>>
>> Use a special variant of udelay() in offending drivers that make appropriate
>> RCU call to increment quiescent state ?
>
> It may not be safe to do this it is in a RCU read-side critical section.
Yes, this is why I suggested a variant of udelay, not udelay() itself,
used by selected drivers (probably not so many drivers hog cpu so long)
void udelay_rcu_quiescent(unsigned long usecs)
{
preempt_disable();
rcu_qsctr_inc(smp_processor_id());
preempt_enable();
udelay(usecs);
}
Or maybe we have a way to detect we are in a RCU-side critical section at runtime ?
void __udelay(unsigned long usecs)
{
if (!in_rcu_critical_section()) {
preempt_disable();
rcu_qsctr_inc(smp_processor_id());
preempt_enable();
}
...
}
Thank you
On Thu, 19 Mar 2009 21:45:41 -0700
"Paul E. McKenney" <[email protected]> wrote:
> > single CPU is soooo last decade ;-)
> > But seriously I no longer have systems that aren't dual core or SMT
> > in some form...
>
> OK, I will ask the stupid question...
>
> Why not delay bringing up the non-boot CPUs until later in boot?
that'd be throwing out the baby with the bathwater... I'm trying to use
the other cpus to do some of the boot work (so that the total goes
faster); not using the other cpus would be counter productive to that.
(As is just sitting in synchronize_rcu() when the other cpu is
working.. hence this discussion ;-)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Fri, Mar 20, 2009 at 06:50:58AM -0700, Arjan van de Ven wrote:
> On Thu, 19 Mar 2009 21:45:41 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > single CPU is soooo last decade ;-)
> > > But seriously I no longer have systems that aren't dual core or SMT
> > > in some form...
> >
> > OK, I will ask the stupid question...
> >
> > Why not delay bringing up the non-boot CPUs until later in boot?
>
> that'd be throwing out the baby with the bathwater... I'm trying to use
> the other cpus to do some of the boot work (so that the total goes
> faster); not using the other cpus would be counter productive to that.
> (As is just sitting in synchronize_rcu() when the other cpu is
> working.. hence this discussion ;-)
OK, so you are definitely running multiple CPUs when the offending
synchronize_rcu() executes, then?
If so, here are some follow-on questions:
1. How many synchronize_rcu() calls are you seeing on the
critical boot path and what value of HZ are you running?
If each synchronize_rcu() is taking (say) tens of jiffies, then,
as Peter Zijlstra notes earlier in this thread, we need to focus
on what is taking too long to get through its RCU read-side
critical sections. Otherwise, if each synchronize_rcu() is
in the 3-5 jiffy range, I may finally be forced to create an
expedited version of the synchronize_rcu() API.
2. If expediting is required, then the code calling synchronize_rcu()
might or might not have any idea whether or not expediting is
appropriate. If it does not, then we would need some sort of way
to tell synchronize_rcu() that it should act more aggressively,
perhaps /proc flag or kernel global variable indicating that
boot is in progress.
No, we do not want to make synchronize_rcu() aggressive all the
time, as this would harm performance and energy efficiency in
the normal runtime situation.
So, if it turns out that synchronize_rcu()'s caller does not
know whether or not expediting is appropriate, can the boot path
manipulate such a flag or variable?
3. Which RCU implementation are you using? CONFIG_CLASSIC_RCU,
CONFIG_TREE_RCU, or CONFIG_PREEMPT_RCU?
Thanx, Paul
On Fri, 20 Mar 2009 07:31:04 -0700
"Paul E. McKenney" <[email protected]> wrote:
> >
> > that'd be throwing out the baby with the bathwater... I'm trying to
> > use the other cpus to do some of the boot work (so that the total
> > goes faster); not using the other cpus would be counter productive
> > to that. (As is just sitting in synchronize_rcu() when the other
> > cpu is working.. hence this discussion ;-)
>
> OK, so you are definitely running multiple CPUs when the offending
> synchronize_rcu() executes, then?
absolutely.
(and I'm using bootgraph.pl in scripts to track who's stalling etc)
>
> If so, here are some follow-on questions:
>
> 1. How many synchronize_rcu() calls are you seeing on the
> critical boot path
I've seen only this (input) one to take a long time
> and what value of HZ are you running?
1000
>
> If each synchronize_rcu() is taking (say) tens of jiffies,
> then, as Peter Zijlstra notes earlier in this thread, we need to focus
> on what is taking too long to get through its RCU read-side
> critical sections
I know that "the other guy" is not optimal and takes waaay too long.
> Otherwise, if each synchronize_rcu() is
> in the 3-5 jiffy range, I may finally be forced to create an
> expedited version of the synchronize_rcu() API.
I think a simplified API for the "add to a list" case might make sense.
Because the request isn't for a full sync for sure...
(independent of that .. the open question is if this specific case is
even needed; I think the code confused "send to others" with "wait
until everyone sees"; afaik synchronize_rcu() has no pushing behavior
at all, nor should it)
>
> 2. If expediting is required, then the code calling
> synchronize_rcu() might or might not have any idea whether or not
> expediting is appropriate. If it does not, then we would need some
> sort of way to tell synchronize_rcu() that it should act more
> aggressively, perhaps /proc flag or kernel global variable indicating
> that boot is in progress.
>
> No, we do not want to make synchronize_rcu() aggressive all
> the time, as this would harm performance and energy efficiency in
> the normal runtime situation.
>
> So, if it turns out that synchronize_rcu()'s caller does not
> know whether or not expediting is appropriate, can the boot
> path manipulate such a flag or variable?
>
> 3. Which RCU implementation are you using? CONFIG_CLASSIC_RCU,
> CONFIG_TREE_RCU, or CONFIG_PREEMPT_RCU?
CLASSIC
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Fri, Mar 20, 2009 at 07:31:04AM -0700, Paul E. McKenney wrote:
> On Fri, Mar 20, 2009 at 06:50:58AM -0700, Arjan van de Ven wrote:
> > On Thu, 19 Mar 2009 21:45:41 -0700
> > "Paul E. McKenney" <[email protected]> wrote:
> >
> > > > single CPU is soooo last decade ;-)
> > > > But seriously I no longer have systems that aren't dual core or SMT
> > > > in some form...
> > >
> > > OK, I will ask the stupid question...
> > >
> > > Why not delay bringing up the non-boot CPUs until later in boot?
> >
> > that'd be throwing out the baby with the bathwater... I'm trying to use
> > the other cpus to do some of the boot work (so that the total goes
> > faster); not using the other cpus would be counter productive to that.
> > (As is just sitting in synchronize_rcu() when the other cpu is
> > working.. hence this discussion ;-)
>
> OK, so you are definitely running multiple CPUs when the offending
> synchronize_rcu() executes, then?
>
> If so, here are some follow-on questions:
>
> 1. How many synchronize_rcu() calls are you seeing on the
> critical boot path and what value of HZ are you running?
>
> If each synchronize_rcu() is taking (say) tens of jiffies, then,
> as Peter Zijlstra notes earlier in this thread, we need to focus
> on what is taking too long to get through its RCU read-side
> critical sections. Otherwise, if each synchronize_rcu() is
> in the 3-5 jiffy range, I may finally be forced to create an
> expedited version of the synchronize_rcu() API.
>
> 2. If expediting is required, then the code calling synchronize_rcu()
> might or might not have any idea whether or not expediting is
> appropriate. If it does not, then we would need some sort of way
> to tell synchronize_rcu() that it should act more aggressively,
> perhaps /proc flag or kernel global variable indicating that
> boot is in progress.
>
> No, we do not want to make synchronize_rcu() aggressive all the
> time, as this would harm performance and energy efficiency in
> the normal runtime situation.
>
> So, if it turns out that synchronize_rcu()'s caller does not
> know whether or not expediting is appropriate, can the boot path
> manipulate such a flag or variable?
>
> 3. Which RCU implementation are you using? CONFIG_CLASSIC_RCU,
> CONFIG_TREE_RCU, or CONFIG_PREEMPT_RCU?
And one other thing... CONFIG_CLASSIC_RCU's synchronize_rcu() normally
runs faster than CONFIG_PREEMPT_RCU, if that helps.
Thanx, Paul
On Fri, Mar 20, 2009 at 11:13:54AM -0700, Arjan van de Ven wrote:
> On Fri, 20 Mar 2009 07:31:04 -0700
> "Paul E. McKenney" <[email protected]> wrote:
> > >
> > > that'd be throwing out the baby with the bathwater... I'm trying to
> > > use the other cpus to do some of the boot work (so that the total
> > > goes faster); not using the other cpus would be counter productive
> > > to that. (As is just sitting in synchronize_rcu() when the other
> > > cpu is working.. hence this discussion ;-)
> >
> > OK, so you are definitely running multiple CPUs when the offending
> > synchronize_rcu() executes, then?
>
> absolutely.
> (and I'm using bootgraph.pl in scripts to track who's stalling etc)
> >
> > If so, here are some follow-on questions:
> >
> > 1. How many synchronize_rcu() calls are you seeing on the
> > critical boot path
>
> I've seen only this (input) one to take a long time
Ouch!!! A -single- synchronize_rcu() taking a full second??? That
indicates breakage.
> > and what value of HZ are you running?
>
> 1000
K, in absence of readers for RCU_CLASSIC, we should see a handful
of milliseconds for synchronize_rcu().
> > If each synchronize_rcu() is taking (say) tens of jiffies,
> > then, as Peter Zijlstra notes earlier in this thread, we need to focus
> > on what is taking too long to get through its RCU read-side
> > critical sections
>
> I know that "the other guy" is not optimal and takes waaay too long.
That could explain why Peter focused on this case. ;-)
> > Otherwise, if each synchronize_rcu() is
> > in the 3-5 jiffy range, I may finally be forced to create an
> > expedited version of the synchronize_rcu() API.
>
> I think a simplified API for the "add to a list" case might make sense.
> Because the request isn't for a full sync for sure...
>
> (independent of that .. the open question is if this specific case is
> even needed; I think the code confused "send to others" with "wait
> until everyone sees"; afaik synchronize_rcu() has no pushing behavior
> at all, nor should it)
Quite possibly, perhaps Dmitry will come up with something.
> > 2. If expediting is required, then the code calling
> > synchronize_rcu() might or might not have any idea whether or not
> > expediting is appropriate. If it does not, then we would need some
> > sort of way to tell synchronize_rcu() that it should act more
> > aggressively, perhaps /proc flag or kernel global variable indicating
> > that boot is in progress.
> >
> > No, we do not want to make synchronize_rcu() aggressive all
> > the time, as this would harm performance and energy efficiency in
> > the normal runtime situation.
> >
> > So, if it turns out that synchronize_rcu()'s caller does not
> > know whether or not expediting is appropriate, can the boot
> > path manipulate such a flag or variable?
> >
> > 3. Which RCU implementation are you using? CONFIG_CLASSIC_RCU,
> > CONFIG_TREE_RCU, or CONFIG_PREEMPT_RCU?
>
> CLASSIC
OK, it usually has the fastest synchronize_rcu() at the moment, though
I will be giving TREE_RCU some more help.
Sounds like I should hold off in favor of Dmitry's and Peter's efforts.
Thanx, Paul
On Fri, 20 Mar 2009 18:27:46 -0700
"Paul E. McKenney" <[email protected]> wrote:
> > I've seen only this (input) one to take a long time
>
> Ouch!!! A -single- synchronize_rcu() taking a full second??? That
> indicates breakage.
I'll do a debug mode tomorrow and visualize it as part of the bootgraph;
hopefully it'll be an interesting result.
I'll share the graph when I have it....
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thursday 19 March 2009 20:20:32 Arjan van de Ven wrote:
> > I don't claim to understand the code in question, so it is entirely
> > possible that the following is irrelevant. But one other reason for
> > synchronize_rcu() is:
> >
> > 1. Make change.
> >
> > 2. synchronize_rcu()
> >
> > 3. Now you are guaranteed that all CPUs/tasks/whatever
> > currently running either are not messing with you on the one hand, or
> > have seen the change on the other.
>
> ok so this is for the case where someone is already iterating the list.
>
> I don't see anything in the code that assumes this..
This is something that input core guarantees to its users: by the time
input core calls hander->start() or, in its absence, by the time
input_register_handle() returns, events from input drivers will be
passed into the handle being registered, i.e. the presence of the
new item in the list is noticed by all CPUs.
Now, it is possible to stop using RCU primitives in the input core
but I think that you'd want to figure out why synchronize_rcu()
takes so long first, otheriwse you may find another "abuser"
down the road.
--
Dmitry
Dmitry Torokhov a ?crit :
> On Thursday 19 March 2009 20:20:32 Arjan van de Ven wrote:
>>> I don't claim to understand the code in question, so it is entirely
>>> possible that the following is irrelevant. But one other reason for
>>> synchronize_rcu() is:
>>>
>>> 1. Make change.
>>>
>>> 2. synchronize_rcu()
>>>
>>> 3. Now you are guaranteed that all CPUs/tasks/whatever
>>> currently running either are not messing with you on the one hand, or
>>> have seen the change on the other.
>> ok so this is for the case where someone is already iterating the list.
>>
>> I don't see anything in the code that assumes this..
>
> This is something that input core guarantees to its users: by the time
> input core calls hander->start() or, in its absence, by the time
> input_register_handle() returns, events from input drivers will be
> passed into the handle being registered, i.e. the presence of the
> new item in the list is noticed by all CPUs.
>
> Now, it is possible to stop using RCU primitives in the input core
> but I think that you'd want to figure out why synchronize_rcu()
> takes so long first, otheriwse you may find another "abuser"
> down the road.
>
If a cpu does a loop, it nearly impossible that synchronize_rcu() can
be fast.
We had same problem in ksoftirqd(), where I had to add a call
to rcu_qsctr_inc() to unblock other threads blocked in synchronize_rcu()
http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=64ca5ab913f1594ef316556e65f5eae63ff50cee
If a driver does a loop with no call to scheduler, it might have same problem
On Sat, Mar 21, 2009 at 10:13:38AM +0100, Eric Dumazet wrote:
> Dmitry Torokhov a ?crit :
> > On Thursday 19 March 2009 20:20:32 Arjan van de Ven wrote:
> >>> I don't claim to understand the code in question, so it is entirely
> >>> possible that the following is irrelevant. But one other reason for
> >>> synchronize_rcu() is:
> >>>
> >>> 1. Make change.
> >>>
> >>> 2. synchronize_rcu()
> >>>
> >>> 3. Now you are guaranteed that all CPUs/tasks/whatever
> >>> currently running either are not messing with you on the one hand, or
> >>> have seen the change on the other.
> >> ok so this is for the case where someone is already iterating the list.
> >>
> >> I don't see anything in the code that assumes this..
> >
> > This is something that input core guarantees to its users: by the time
> > input core calls hander->start() or, in its absence, by the time
> > input_register_handle() returns, events from input drivers will be
> > passed into the handle being registered, i.e. the presence of the
> > new item in the list is noticed by all CPUs.
> >
> > Now, it is possible to stop using RCU primitives in the input core
> > but I think that you'd want to figure out why synchronize_rcu()
> > takes so long first, otheriwse you may find another "abuser"
> > down the road.
> >
>
> If a cpu does a loop, it nearly impossible that synchronize_rcu() can
> be fast.
>
> We had same problem in ksoftirqd(), where I had to add a call
> to rcu_qsctr_inc() to unblock other threads blocked in synchronize_rcu()
>
> http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=64ca5ab913f1594ef316556e65f5eae63ff50cee
>
> If a driver does a loop with no call to scheduler, it might have same problem
And hopefully Arjan's promised bootgraph will give us some insights
as to what might be holding up the grace period.
Thanx, Paul
On Fri, Mar 20, 2009 at 09:58:25PM -0700, Arjan van de Ven wrote:
> On Fri, 20 Mar 2009 18:27:46 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > I've seen only this (input) one to take a long time
> >
> > Ouch!!! A -single- synchronize_rcu() taking a full second??? That
> > indicates breakage.
>
> I'll do a debug mode tomorrow and visualize it as part of the bootgraph;
> hopefully it'll be an interesting result.
> I'll share the graph when I have it....
Looking forward to seeing it!!!
Thanx, Paul
On Fri, 20 Mar 2009 18:27:46 -0700
"Paul E. McKenney" <[email protected]> wrote:
> On Fri, Mar 20, 2009 at 11:13:54AM -0700, Arjan van de Ven wrote:
> > On Fri, 20 Mar 2009 07:31:04 -0700
> > "Paul E. McKenney" <[email protected]> wrote:
> > > >
> > > > that'd be throwing out the baby with the bathwater... I'm
> > > > trying to use the other cpus to do some of the boot work (so
> > > > that the total goes faster); not using the other cpus would be
> > > > counter productive to that. (As is just sitting in
> > > > synchronize_rcu() when the other cpu is working.. hence this
> > > > discussion ;-)
> > >
> > > OK, so you are definitely running multiple CPUs when the offending
> > > synchronize_rcu() executes, then?
> >
> > absolutely.
> > (and I'm using bootgraph.pl in scripts to track who's stalling etc)
> > >
> > > If so, here are some follow-on questions:
> > >
> > > 1. How many synchronize_rcu() calls are you seeing on the
> > > critical boot path
> >
> > I've seen only this (input) one to take a long time
>
> Ouch!!! A -single- synchronize_rcu() taking a full second??? That
> indicates breakage.
>
> > > and what value of HZ are you running?
> >
> > 1000
>
> K, in absence of readers for RCU_CLASSIC, we should see a handful
> of milliseconds for synchronize_rcu().
I've attached an instrumented bootgraph of what is going on;
the rcu delays are shown as red blocks inside the regular functions
as they initialize......
(svg can be viewed with inkscape, gimp, firefox and various other tools)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
Arjan van de Ven a ?crit :
> On Fri, 20 Mar 2009 18:27:46 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
>> On Fri, Mar 20, 2009 at 11:13:54AM -0700, Arjan van de Ven wrote:
>>> On Fri, 20 Mar 2009 07:31:04 -0700
>>> "Paul E. McKenney" <[email protected]> wrote:
>>>>> that'd be throwing out the baby with the bathwater... I'm
>>>>> trying to use the other cpus to do some of the boot work (so
>>>>> that the total goes faster); not using the other cpus would be
>>>>> counter productive to that. (As is just sitting in
>>>>> synchronize_rcu() when the other cpu is working.. hence this
>>>>> discussion ;-)
>>>> OK, so you are definitely running multiple CPUs when the offending
>>>> synchronize_rcu() executes, then?
>>> absolutely.
>>> (and I'm using bootgraph.pl in scripts to track who's stalling etc)
>>>> If so, here are some follow-on questions:
>>>>
>>>> 1. How many synchronize_rcu() calls are you seeing on the
>>>> critical boot path
>>> I've seen only this (input) one to take a long time
>> Ouch!!! A -single- synchronize_rcu() taking a full second??? That
>> indicates breakage.
>>
>>>> and what value of HZ are you running?
>>> 1000
>> K, in absence of readers for RCU_CLASSIC, we should see a handful
>> of milliseconds for synchronize_rcu().
>
> I've attached an instrumented bootgraph of what is going on;
> the rcu delays are shown as red blocks inside the regular functions
> as they initialize......
>
> (svg can be viewed with inkscape, gimp, firefox and various other tools)
>
>
Interesting stuff...
I thought you mentioned i2c drivers being source of the udelays(),
but I cant see them in this svg, unless its async_probe_hard ?
On Sat, Mar 21, 2009 at 09:26:08PM +0100, Eric Dumazet wrote:
> Arjan van de Ven a ?crit :
> > On Fri, 20 Mar 2009 18:27:46 -0700
> > "Paul E. McKenney" <[email protected]> wrote:
> >
> >> On Fri, Mar 20, 2009 at 11:13:54AM -0700, Arjan van de Ven wrote:
> >>> On Fri, 20 Mar 2009 07:31:04 -0700
> >>> "Paul E. McKenney" <[email protected]> wrote:
> >>>>> that'd be throwing out the baby with the bathwater... I'm
> >>>>> trying to use the other cpus to do some of the boot work (so
> >>>>> that the total goes faster); not using the other cpus would be
> >>>>> counter productive to that. (As is just sitting in
> >>>>> synchronize_rcu() when the other cpu is working.. hence this
> >>>>> discussion ;-)
> >>>> OK, so you are definitely running multiple CPUs when the offending
> >>>> synchronize_rcu() executes, then?
> >>> absolutely.
> >>> (and I'm using bootgraph.pl in scripts to track who's stalling etc)
> >>>> If so, here are some follow-on questions:
> >>>>
> >>>> 1. How many synchronize_rcu() calls are you seeing on the
> >>>> critical boot path
> >>> I've seen only this (input) one to take a long time
> >> Ouch!!! A -single- synchronize_rcu() taking a full second??? That
> >> indicates breakage.
> >>
> >>>> and what value of HZ are you running?
> >>> 1000
> >> K, in absence of readers for RCU_CLASSIC, we should see a handful
> >> of milliseconds for synchronize_rcu().
> >
> > I've attached an instrumented bootgraph of what is going on;
> > the rcu delays are shown as red blocks inside the regular functions
> > as they initialize......
> >
> > (svg can be viewed with inkscape, gimp, firefox and various other tools)
>
> Interesting stuff...
>
> I thought you mentioned i2c drivers being source of the udelays(),
> but I cant see them in this svg, unless its async_probe_hard ?
Arjan, another thought -- if the udelays() are not under rcu_read_lock(),
you should be able to finesse this by using CONFIG_PREEMPT_RCU, which
will happily ignore spinning CPUs as long as they are not in an RCU
read-side critical section.
Thanx, Paul
On Sat, 21 Mar 2009 21:26:08 +0100
Eric Dumazet <[email protected]> wrote:
> >
> > (svg can be viewed with inkscape, gimp, firefox and various other
> > tools)
> >
> >
>
> Interesting stuff...
>
> I thought you mentioned i2c drivers being source of the udelays(),
> but I cant see them in this svg, unless its async_probe_hard ?
yeah that one is doing a ton of i2c
(I'm working on optimizing the code in that area still)
>
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sat, 21 Mar 2009 14:07:45 -0700
"Paul E. McKenney" <[email protected]> wrote:
> On Sat, Mar 21, 2009 at 09:26:08PM +0100, Eric Dumazet wrote:
> > Arjan van de Ven a écrit :
> > > On Fri, 20 Mar 2009 18:27:46 -0700
> > > "Paul E. McKenney" <[email protected]> wrote:
> > >
> > >> On Fri, Mar 20, 2009 at 11:13:54AM -0700, Arjan van de Ven wrote:
> > >>> On Fri, 20 Mar 2009 07:31:04 -0700
> > >>> "Paul E. McKenney" <[email protected]> wrote:
> > >>>>> that'd be throwing out the baby with the bathwater... I'm
> > >>>>> trying to use the other cpus to do some of the boot work (so
> > >>>>> that the total goes faster); not using the other cpus would be
> > >>>>> counter productive to that. (As is just sitting in
> > >>>>> synchronize_rcu() when the other cpu is working.. hence this
> > >>>>> discussion ;-)
> > >>>> OK, so you are definitely running multiple CPUs when the
> > >>>> offending synchronize_rcu() executes, then?
> > >>> absolutely.
> > >>> (and I'm using bootgraph.pl in scripts to track who's stalling
> > >>> etc)
> > >>>> If so, here are some follow-on questions:
> > >>>>
> > >>>> 1. How many synchronize_rcu() calls are you seeing on
> > >>>> the critical boot path
> > >>> I've seen only this (input) one to take a long time
> > >> Ouch!!! A -single- synchronize_rcu() taking a full second???
> > >> That indicates breakage.
> > >>
> > >>>> and what value of HZ are you running?
> > >>> 1000
> > >> K, in absence of readers for RCU_CLASSIC, we should see a handful
> > >> of milliseconds for synchronize_rcu().
> > >
> > > I've attached an instrumented bootgraph of what is going on;
> > > the rcu delays are shown as red blocks inside the regular
> > > functions as they initialize......
> > >
> > > (svg can be viewed with inkscape, gimp, firefox and various other
> > > tools)
> >
> > Interesting stuff...
> >
> > I thought you mentioned i2c drivers being source of the udelays(),
> > but I cant see them in this svg, unless its async_probe_hard ?
>
> Arjan, another thought -- if the udelays() are not under
> rcu_read_lock(), you should be able to finesse this by using
> CONFIG_PREEMPT_RCU, which will happily ignore spinning CPUs as long
> as they are not in an RCU read-side critical section.
I'll play with that
In the mean time I've reduced the "other" function's time significantly;
so the urgency has gone away some.
It's still "interesting" that even in the "there is only really one
thread running" case the minimum delay seems to be 2700 microseconds
for classic RCU. Especially during bootup that sounds a bit harsh....
(since that is where many "read mostly" cases actually get their
modifications)
>
> Thanx, Paul
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sat, Mar 21, 2009 at 08:40:45PM -0700, Arjan van de Ven wrote:
> On Sat, 21 Mar 2009 14:07:45 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > On Sat, Mar 21, 2009 at 09:26:08PM +0100, Eric Dumazet wrote:
> > > Arjan van de Ven a ?crit :
> > > > On Fri, 20 Mar 2009 18:27:46 -0700
> > > > "Paul E. McKenney" <[email protected]> wrote:
> > > >
> > > >> On Fri, Mar 20, 2009 at 11:13:54AM -0700, Arjan van de Ven wrote:
> > > >>> On Fri, 20 Mar 2009 07:31:04 -0700
> > > >>> "Paul E. McKenney" <[email protected]> wrote:
> > > >>>>> that'd be throwing out the baby with the bathwater... I'm
> > > >>>>> trying to use the other cpus to do some of the boot work (so
> > > >>>>> that the total goes faster); not using the other cpus would be
> > > >>>>> counter productive to that. (As is just sitting in
> > > >>>>> synchronize_rcu() when the other cpu is working.. hence this
> > > >>>>> discussion ;-)
> > > >>>> OK, so you are definitely running multiple CPUs when the
> > > >>>> offending synchronize_rcu() executes, then?
> > > >>> absolutely.
> > > >>> (and I'm using bootgraph.pl in scripts to track who's stalling
> > > >>> etc)
> > > >>>> If so, here are some follow-on questions:
> > > >>>>
> > > >>>> 1. How many synchronize_rcu() calls are you seeing on
> > > >>>> the critical boot path
> > > >>> I've seen only this (input) one to take a long time
> > > >> Ouch!!! A -single- synchronize_rcu() taking a full second???
> > > >> That indicates breakage.
> > > >>
> > > >>>> and what value of HZ are you running?
> > > >>> 1000
> > > >> K, in absence of readers for RCU_CLASSIC, we should see a handful
> > > >> of milliseconds for synchronize_rcu().
> > > >
> > > > I've attached an instrumented bootgraph of what is going on;
> > > > the rcu delays are shown as red blocks inside the regular
> > > > functions as they initialize......
> > > >
> > > > (svg can be viewed with inkscape, gimp, firefox and various other
> > > > tools)
> > >
> > > Interesting stuff...
> > >
> > > I thought you mentioned i2c drivers being source of the udelays(),
> > > but I cant see them in this svg, unless its async_probe_hard ?
> >
> > Arjan, another thought -- if the udelays() are not under
> > rcu_read_lock(), you should be able to finesse this by using
> > CONFIG_PREEMPT_RCU, which will happily ignore spinning CPUs as long
> > as they are not in an RCU read-side critical section.
>
> I'll play with that
> In the mean time I've reduced the "other" function's time significantly;
> so the urgency has gone away some.
Good to hear!
> It's still "interesting" that even in the "there is only really one
> thread running" case the minimum delay seems to be 2700 microseconds
> for classic RCU. Especially during bootup that sounds a bit harsh....
> (since that is where many "read mostly" cases actually get their
> modifications)
OK, I'll bite... 2700 microseconds measures exactly what?
Also, "really one thread" means hardware threads or software threads?
If the former, exactly which kernel are you using? The single-CPU
optimization was added in 2.6.29-rc7, commit ID a682604838.
Thanx, Paul
On Sat, 21 Mar 2009 21:38:38 -0700
"Paul E. McKenney" <[email protected]> wrote:
> On Sat, Mar 21, 2009 at 08:40:45PM -0700, Arjan van de Ven wrote:
> > On Sat, 21 Mar 2009 14:07:45 -0700
> > "Paul E. McKenney" <[email protected]> wrote:
> >
> > > On Sat, Mar 21, 2009 at 09:26:08PM +0100, Eric Dumazet wrote:
> > > > Arjan van de Ven a écrit :
> > > > > On Fri, 20 Mar 2009 18:27:46 -0700
> > > > > "Paul E. McKenney" <[email protected]> wrote:
> > > > >
> > > > >> On Fri, Mar 20, 2009 at 11:13:54AM -0700, Arjan van de Ven
> > > > >> wrote:
> > > > >>> On Fri, 20 Mar 2009 07:31:04 -0700
> > > > >>> "Paul E. McKenney" <[email protected]> wrote:
> > > > >>>>> that'd be throwing out the baby with the bathwater... I'm
> > > > >>>>> trying to use the other cpus to do some of the boot work
> > > > >>>>> (so that the total goes faster); not using the other cpus
> > > > >>>>> would be counter productive to that. (As is just sitting
> > > > >>>>> in synchronize_rcu() when the other cpu is working..
> > > > >>>>> hence this discussion ;-)
> > > > >>>> OK, so you are definitely running multiple CPUs when the
> > > > >>>> offending synchronize_rcu() executes, then?
> > > > >>> absolutely.
> > > > >>> (and I'm using bootgraph.pl in scripts to track who's
> > > > >>> stalling etc)
> > > > >>>> If so, here are some follow-on questions:
> > > > >>>>
> > > > >>>> 1. How many synchronize_rcu() calls are you seeing
> > > > >>>> on the critical boot path
> > > > >>> I've seen only this (input) one to take a long time
> > > > >> Ouch!!! A -single- synchronize_rcu() taking a full second???
> > > > >> That indicates breakage.
> > > > >>
> > > > >>>> and what value of HZ are you running?
> > > > >>> 1000
> > > > >> K, in absence of readers for RCU_CLASSIC, we should see a
> > > > >> handful of milliseconds for synchronize_rcu().
> > > > >
> > > > > I've attached an instrumented bootgraph of what is going on;
> > > > > the rcu delays are shown as red blocks inside the regular
> > > > > functions as they initialize......
> > > > >
> > > > > (svg can be viewed with inkscape, gimp, firefox and various
> > > > > other tools)
> > > >
> > > > Interesting stuff...
> > > >
> > > > I thought you mentioned i2c drivers being source of the
> > > > udelays(), but I cant see them in this svg, unless its
> > > > async_probe_hard ?
> > >
> > > Arjan, another thought -- if the udelays() are not under
> > > rcu_read_lock(), you should be able to finesse this by using
> > > CONFIG_PREEMPT_RCU, which will happily ignore spinning CPUs as
> > > long as they are not in an RCU read-side critical section.
> >
> > I'll play with that
> > In the mean time I've reduced the "other" function's time
> > significantly; so the urgency has gone away some.
>
> Good to hear!
>
> > It's still "interesting" that even in the "there is only really one
> > thread running" case the minimum delay seems to be 2700 microseconds
> > for classic RCU. Especially during bootup that sounds a bit
> > harsh.... (since that is where many "read mostly" cases actually
> > get their modifications)
>
> OK, I'll bite... 2700 microseconds measures exactly what?
I'm measuring the time that the following code takes:
init_completion(&rcu.completion);
/* Will wake me after RCU finished. */
call_rcu(&rcu.head, wakeme_after_rcu);
/* Wait for it. */
wait_for_completion(&rcu.completion);
in kernel/rcupdate.c:synchronize_rcu();
(I put markings around it for bootgraph to pick up).
it looks like this:
[ 0.196157] rcu_waiting @ 1
[ 0.198978] rcu_continuing @ 1 after 2929 usec
[ 0.199585] rcu_waiting @ 1
[ 0.201973] rcu_continuing @ 1 after 2929 usec
[ 0.208132] rcu_waiting @ 1
[ 0.210905] rcu_continuing @ 1 after 2633 usec
[ 0.258025] rcu_waiting @ 1
[ 0.260910] rcu_continuing @ 1 after 2742 usec
[ 0.260988] rcu_waiting @ 1
[ 0.263910] rcu_continuing @ 1 after 2778 usec
[ 0.263987] rcu_waiting @ 1
[ 0.266910] rcu_continuing @ 1 after 2778 usec
[ 0.273030] rcu_waiting @ 1
[ 0.275912] rcu_continuing @ 1 after 2738 usec
[ 0.636267] rcu_waiting @ 1
[ 0.639531] rcu_continuing @ 1 after 3113 usec
[ 0.639611] rcu_waiting @ 1
[ 0.642006] rcu_continuing @ 1 after 2242 usec
[ 0.642086] rcu_waiting @ 1
[ 0.645407] rcu_continuing @ 1 after 3169 usec
[ 0.645487] rcu_waiting @ 1
[ 0.648007] rcu_continuing @ 1 after 2361 usec
[ 1.176323] rcu_waiting @ 1
[ 1.873021] rcu_continuing @ 1 after 680368 usec
[ 1.873108] rcu_waiting @ 1
[ 2.046045] rcu_continuing @ 1 after 168881 usec
ok so I was not entirely right; there's a few ones that are a bit
shorter than 2700...
>
> Also, "really one thread" means hardware threads or software threads?
one software thread. (as per the bootgraph)
> If the former, exactly which kernel are you using? The single-CPU
> optimization was added in 2.6.29-rc7, commit ID a682604838.
a bit after -rc8, specifically commit
5bee17f18b595937e6beafeee5197868a3f74a06
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sat, Mar 21, 2009 at 09:51:09PM -0700, Arjan van de Ven wrote:
> On Sat, 21 Mar 2009 21:38:38 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > On Sat, Mar 21, 2009 at 08:40:45PM -0700, Arjan van de Ven wrote:
> > > On Sat, 21 Mar 2009 14:07:45 -0700
> > > "Paul E. McKenney" <[email protected]> wrote:
> > >
> > > > On Sat, Mar 21, 2009 at 09:26:08PM +0100, Eric Dumazet wrote:
> > > > > Arjan van de Ven a ?crit :
> > > > > > On Fri, 20 Mar 2009 18:27:46 -0700
> > > > > > "Paul E. McKenney" <[email protected]> wrote:
> > > > > >
> > > > > >> On Fri, Mar 20, 2009 at 11:13:54AM -0700, Arjan van de Ven
> > > > > >> wrote:
> > > > > >>> On Fri, 20 Mar 2009 07:31:04 -0700
> > > > > >>> "Paul E. McKenney" <[email protected]> wrote:
> > > > > >>>>> that'd be throwing out the baby with the bathwater... I'm
> > > > > >>>>> trying to use the other cpus to do some of the boot work
> > > > > >>>>> (so that the total goes faster); not using the other cpus
> > > > > >>>>> would be counter productive to that. (As is just sitting
> > > > > >>>>> in synchronize_rcu() when the other cpu is working..
> > > > > >>>>> hence this discussion ;-)
> > > > > >>>> OK, so you are definitely running multiple CPUs when the
> > > > > >>>> offending synchronize_rcu() executes, then?
> > > > > >>> absolutely.
> > > > > >>> (and I'm using bootgraph.pl in scripts to track who's
> > > > > >>> stalling etc)
> > > > > >>>> If so, here are some follow-on questions:
> > > > > >>>>
> > > > > >>>> 1. How many synchronize_rcu() calls are you seeing
> > > > > >>>> on the critical boot path
> > > > > >>> I've seen only this (input) one to take a long time
> > > > > >> Ouch!!! A -single- synchronize_rcu() taking a full second???
> > > > > >> That indicates breakage.
> > > > > >>
> > > > > >>>> and what value of HZ are you running?
> > > > > >>> 1000
> > > > > >> K, in absence of readers for RCU_CLASSIC, we should see a
> > > > > >> handful of milliseconds for synchronize_rcu().
> > > > > >
> > > > > > I've attached an instrumented bootgraph of what is going on;
> > > > > > the rcu delays are shown as red blocks inside the regular
> > > > > > functions as they initialize......
> > > > > >
> > > > > > (svg can be viewed with inkscape, gimp, firefox and various
> > > > > > other tools)
> > > > >
> > > > > Interesting stuff...
> > > > >
> > > > > I thought you mentioned i2c drivers being source of the
> > > > > udelays(), but I cant see them in this svg, unless its
> > > > > async_probe_hard ?
> > > >
> > > > Arjan, another thought -- if the udelays() are not under
> > > > rcu_read_lock(), you should be able to finesse this by using
> > > > CONFIG_PREEMPT_RCU, which will happily ignore spinning CPUs as
> > > > long as they are not in an RCU read-side critical section.
> > >
> > > I'll play with that
> > > In the mean time I've reduced the "other" function's time
> > > significantly; so the urgency has gone away some.
> >
> > Good to hear!
> >
> > > It's still "interesting" that even in the "there is only really one
> > > thread running" case the minimum delay seems to be 2700 microseconds
> > > for classic RCU. Especially during bootup that sounds a bit
> > > harsh.... (since that is where many "read mostly" cases actually
> > > get their modifications)
> >
> > OK, I'll bite... 2700 microseconds measures exactly what?
>
> I'm measuring the time that the following code takes:
>
> init_completion(&rcu.completion);
> /* Will wake me after RCU finished. */
> call_rcu(&rcu.head, wakeme_after_rcu);
> /* Wait for it. */
> wait_for_completion(&rcu.completion);
>
> in kernel/rcupdate.c:synchronize_rcu();
> (I put markings around it for bootgraph to pick up).
> it looks like this:
> [ 0.196157] rcu_waiting @ 1
> [ 0.198978] rcu_continuing @ 1 after 2929 usec
> [ 0.199585] rcu_waiting @ 1
> [ 0.201973] rcu_continuing @ 1 after 2929 usec
> [ 0.208132] rcu_waiting @ 1
> [ 0.210905] rcu_continuing @ 1 after 2633 usec
> [ 0.258025] rcu_waiting @ 1
> [ 0.260910] rcu_continuing @ 1 after 2742 usec
> [ 0.260988] rcu_waiting @ 1
> [ 0.263910] rcu_continuing @ 1 after 2778 usec
> [ 0.263987] rcu_waiting @ 1
> [ 0.266910] rcu_continuing @ 1 after 2778 usec
> [ 0.273030] rcu_waiting @ 1
> [ 0.275912] rcu_continuing @ 1 after 2738 usec
> [ 0.636267] rcu_waiting @ 1
> [ 0.639531] rcu_continuing @ 1 after 3113 usec
> [ 0.639611] rcu_waiting @ 1
> [ 0.642006] rcu_continuing @ 1 after 2242 usec
> [ 0.642086] rcu_waiting @ 1
> [ 0.645407] rcu_continuing @ 1 after 3169 usec
> [ 0.645487] rcu_waiting @ 1
> [ 0.648007] rcu_continuing @ 1 after 2361 usec
> [ 1.176323] rcu_waiting @ 1
> [ 1.873021] rcu_continuing @ 1 after 680368 usec
> [ 1.873108] rcu_waiting @ 1
> [ 2.046045] rcu_continuing @ 1 after 168881 usec
>
> ok so I was not entirely right; there's a few ones that are a bit
> shorter than 2700...
No, my confusion -- I misread as 2700 milliseconds rather than 2700
-microseconds-. 2700 microseconds (or 2.7 milliseconds) is in the
expected range for synchronize_rcu() on an HZ=1000 system. 2.7 seconds
would of course be way out of line.
The last two you captured above are excessive, of course -- more than
100 milliseconds each.
> > Also, "really one thread" means hardware threads or software threads?
>
> one software thread. (as per the bootgraph)
OK, then the commit called out below would not help you much anyway.
> > If the former, exactly which kernel are you using? The single-CPU
> > optimization was added in 2.6.29-rc7, commit ID a682604838.
>
> a bit after -rc8, specifically commit
> 5bee17f18b595937e6beafeee5197868a3f74a06
How many synchronize_rcu() calls are you seeing on the boot path?
Also, are you running with NO_HZ=y?
Thanx, Paul
On Sat, 21 Mar 2009 22:18:22 -0700
"Paul E. McKenney" <[email protected]> wrote:
> > I'm measuring the time that the following code takes:
> >
> > init_completion(&rcu.completion);
> > /* Will wake me after RCU finished. */
> > call_rcu(&rcu.head, wakeme_after_rcu);
> > /* Wait for it. */
> > wait_for_completion(&rcu.completion);
> >
>
> No, my confusion -- I misread as 2700 milliseconds rather than 2700
> -microseconds-. 2700 microseconds (or 2.7 milliseconds) is in the
> expected range for synchronize_rcu() on an HZ=1000 system. 2.7
> seconds would of course be way out of line.
> > > If the former, exactly which kernel are you using? The single-CPU
> > > optimization was added in 2.6.29-rc7, commit ID a682604838.
> >
> > a bit after -rc8, specifically commit
> > 5bee17f18b595937e6beafeee5197868a3f74a06
>
> How many synchronize_rcu() calls are you seeing on the boot path?
I see 20 that hit the above code path (eg ones that wait) until
userspace starts.
> Also, are you running with NO_HZ=y?
of course... is there any other way ? ;-)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sat, Mar 21, 2009 at 10:53:18PM -0700, Arjan van de Ven wrote:
> On Sat, 21 Mar 2009 22:18:22 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > I'm measuring the time that the following code takes:
> > >
> > > init_completion(&rcu.completion);
> > > /* Will wake me after RCU finished. */
> > > call_rcu(&rcu.head, wakeme_after_rcu);
> > > /* Wait for it. */
> > > wait_for_completion(&rcu.completion);
> > >
>
> >
> > No, my confusion -- I misread as 2700 milliseconds rather than 2700
> > -microseconds-. 2700 microseconds (or 2.7 milliseconds) is in the
> > expected range for synchronize_rcu() on an HZ=1000 system. 2.7
> > seconds would of course be way out of line.
>
> > > > If the former, exactly which kernel are you using? The single-CPU
> > > > optimization was added in 2.6.29-rc7, commit ID a682604838.
> > >
> > > a bit after -rc8, specifically commit
> > > 5bee17f18b595937e6beafeee5197868a3f74a06
> >
> > How many synchronize_rcu() calls are you seeing on the boot path?
>
> I see 20 that hit the above code path (eg ones that wait) until
> userspace starts.
So with well-behaved readers, the full sequence would be worth
something like 50-60 milliseconds.
> > Also, are you running with NO_HZ=y?
>
> of course... is there any other way ? ;-)
Well, if it does become necessary to make common-case no-readers
execution of synchronize_rcu() go faster, you certainly have made the
correct choice. ;-)
Thanx, Paul
On Sun, 22 Mar 2009 09:53:24 -0700
"Paul E. McKenney" <[email protected]> wrote:
> > > How many synchronize_rcu() calls are you seeing on the boot path?
> >
> > I see 20 that hit the above code path (eg ones that wait) until
> > userspace starts.
>
> So with well-behaved readers, the full sequence would be worth
> something like 50-60 milliseconds.
yeah it's about 10% if the total kernel boot time.. so it does start to
matter
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, Mar 22, 2009 at 12:46:32PM -0700, Arjan van de Ven wrote:
> On Sun, 22 Mar 2009 09:53:24 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > > How many synchronize_rcu() calls are you seeing on the boot path?
> > >
> > > I see 20 that hit the above code path (eg ones that wait) until
> > > userspace starts.
> >
> > So with well-behaved readers, the full sequence would be worth
> > something like 50-60 milliseconds.
>
> yeah it's about 10% if the total kernel boot time.. so it does start to
> matter
Half-second boot, eh? That would indeed be impressive.
Thanx, Paul
On Sun, 22 Mar 2009 13:52:12 -0700
"Paul E. McKenney" <[email protected]> wrote:
> On Sun, Mar 22, 2009 at 12:46:32PM -0700, Arjan van de Ven wrote:
> > On Sun, 22 Mar 2009 09:53:24 -0700
> > "Paul E. McKenney" <[email protected]> wrote:
> >
> > > > > How many synchronize_rcu() calls are you seeing on the boot
> > > > > path?
> > > >
> > > > I see 20 that hit the above code path (eg ones that wait) until
> > > > userspace starts.
> > >
> > > So with well-behaved readers, the full sequence would be worth
> > > something like 50-60 milliseconds.
> >
> > yeah it's about 10% if the total kernel boot time.. so it does
> > start to matter
>
> Half-second boot, eh? That would indeed be impressive.
half a second until calling init.. that's what I have today (with all
drivers built in).. nothing really special needed for it
(well a few small patches that are pending for 2.6.30 ;-)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, Mar 22, 2009 at 03:44:33PM -0700, Arjan van de Ven wrote:
> On Sun, 22 Mar 2009 13:52:12 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > On Sun, Mar 22, 2009 at 12:46:32PM -0700, Arjan van de Ven wrote:
> > > On Sun, 22 Mar 2009 09:53:24 -0700
> > > "Paul E. McKenney" <[email protected]> wrote:
> > >
> > > > > > How many synchronize_rcu() calls are you seeing on the boot
> > > > > > path?
> > > > >
> > > > > I see 20 that hit the above code path (eg ones that wait) until
> > > > > userspace starts.
> > > >
> > > > So with well-behaved readers, the full sequence would be worth
> > > > something like 50-60 milliseconds.
> > >
> > > yeah it's about 10% if the total kernel boot time.. so it does
> > > start to matter
> >
> > Half-second boot, eh? That would indeed be impressive.
>
> half a second until calling init.. that's what I have today (with all
> drivers built in).. nothing really special needed for it
> (well a few small patches that are pending for 2.6.30 ;-)
I thought the measurement was until the desktop was running with no more
disk activity. ;-)
But in any case, I will see what I can do about speeding up
synchronize_rcu(). I will likely start with TREE_RCU, and I may need
some sort of indication that boot is in progress.
Thanx, Paul
On Sun, 22 Mar 2009 16:03:31 -0700
"Paul E. McKenney" <[email protected]> wrote:
> On Sun, Mar 22, 2009 at 03:44:33PM -0700, Arjan van de Ven wrote:
> > On Sun, 22 Mar 2009 13:52:12 -0700
> > "Paul E. McKenney" <[email protected]> wrote:
> >
> > > On Sun, Mar 22, 2009 at 12:46:32PM -0700, Arjan van de Ven wrote:
> > > > On Sun, 22 Mar 2009 09:53:24 -0700
> > > > "Paul E. McKenney" <[email protected]> wrote:
> > > >
> > > > > > > How many synchronize_rcu() calls are you seeing on the
> > > > > > > boot path?
> > > > > >
> > > > > > I see 20 that hit the above code path (eg ones that wait)
> > > > > > until userspace starts.
> > > > >
> > > > > So with well-behaved readers, the full sequence would be worth
> > > > > something like 50-60 milliseconds.
> > > >
> > > > yeah it's about 10% if the total kernel boot time.. so it does
> > > > start to matter
> > >
> > > Half-second boot, eh? That would indeed be impressive.
> >
> > half a second until calling init.. that's what I have today (with
> > all drivers built in).. nothing really special needed for it
> > (well a few small patches that are pending for 2.6.30 ;-)
>
> I thought the measurement was until the desktop was running with no
> more disk activity. ;-)
yeah that's like 2 seconds after that ;)
>
> But in any case, I will see what I can do about speeding up
> synchronize_rcu(). I will likely start with TREE_RCU, and I may need
> some sort of indication that boot is in progress.
system_state == SYSTEM_BOOTING is not a very elegant test but it works
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, Mar 22, 2009 at 04:16:37PM -0700, Arjan van de Ven wrote:
> On Sun, 22 Mar 2009 16:03:31 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > On Sun, Mar 22, 2009 at 03:44:33PM -0700, Arjan van de Ven wrote:
> > > On Sun, 22 Mar 2009 13:52:12 -0700
> > > "Paul E. McKenney" <[email protected]> wrote:
> > >
> > > > On Sun, Mar 22, 2009 at 12:46:32PM -0700, Arjan van de Ven wrote:
> > > > > On Sun, 22 Mar 2009 09:53:24 -0700
> > > > > "Paul E. McKenney" <[email protected]> wrote:
> > > > >
> > > > > > > > How many synchronize_rcu() calls are you seeing on the
> > > > > > > > boot path?
> > > > > > >
> > > > > > > I see 20 that hit the above code path (eg ones that wait)
> > > > > > > until userspace starts.
> > > > > >
> > > > > > So with well-behaved readers, the full sequence would be worth
> > > > > > something like 50-60 milliseconds.
> > > > >
> > > > > yeah it's about 10% if the total kernel boot time.. so it does
> > > > > start to matter
> > > >
> > > > Half-second boot, eh? That would indeed be impressive.
> > >
> > > half a second until calling init.. that's what I have today (with
> > > all drivers built in).. nothing really special needed for it
> > > (well a few small patches that are pending for 2.6.30 ;-)
> >
> > I thought the measurement was until the desktop was running with no
> > more disk activity. ;-)
>
> yeah that's like 2 seconds after that ;)
I must confess that I still dream of it booting faster than I perceive.
Of course, this is becoming ever easier to achieve as I continue to
grow older... ;-)
> > But in any case, I will see what I can do about speeding up
> > synchronize_rcu(). I will likely start with TREE_RCU, and I may need
> > some sort of indication that boot is in progress.
>
> system_state == SYSTEM_BOOTING is not a very elegant test but it works
Fair enough... This would be until init, correct?
Thanx, Paul
On Sun, Mar 22, 2009 at 06:27:37PM -0700, Paul E. McKenney wrote:
> On Sun, Mar 22, 2009 at 04:16:37PM -0700, Arjan van de Ven wrote:
> > On Sun, 22 Mar 2009 16:03:31 -0700
> > "Paul E. McKenney" <[email protected]> wrote:
[ . . . ]
> > > But in any case, I will see what I can do about speeding up
> > > synchronize_rcu(). I will likely start with TREE_RCU, and I may need
> > > some sort of indication that boot is in progress.
> >
> > system_state == SYSTEM_BOOTING is not a very elegant test but it works
>
> Fair enough... This would be until init, correct?
Here is a first attempt. The idea is to accelerate grace-period
detection when all but one of the CPUs is in nohz state by checking
for nohz CPUs immediately upon starting a grace period, instead of
waiting for three jiffies in that case. Lightly tested.
Not for inclusion.
Signed-off-by: Paul E. McKenney <[email protected]>
---
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 97ce315..77ce937 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -373,7 +373,26 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
return rcu_implicit_offline_qs(rdp);
}
-#endif /* #ifdef CONFIG_SMP */
+/*
+ * If almost all the other CPUs are in dyntick-idle mode, invoke
+ * force_quiescent_state() in order to make the RCU grace period
+ * happen more quickly. The main purpose is to speed up boot
+ * on multiprocessor machines with CONFIG_NO_HZ.
+ */
+static void accelerate_almost_idle(struct rcu_state *rsp)
+{
+ if (cpumask_weight(cpu_online_mask) -
+ cpumask_weight(nohz_cpu_mask) <= 1)
+ force_quiescent_state(rsp, 0);
+}
+
+#else /* #ifdef CONFIG_SMP */
+
+static void accelerate_almost_idle(struct rcu_state *rsp)
+{
+}
+
+#endif /* #else #ifdef CONFIG_SMP */
#else /* #ifdef CONFIG_NO_HZ */
@@ -407,6 +426,10 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
#endif /* #ifdef CONFIG_SMP */
+static void accelerate_almost_idle(struct rcu_state *rsp)
+{
+}
+
#endif /* #else #ifdef CONFIG_NO_HZ */
#ifdef CONFIG_RCU_CPU_STALL_DETECTOR
@@ -577,6 +600,7 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
rnp->qsmask = rnp->qsmaskinit;
rsp->signaled = RCU_SIGNAL_INIT; /* force_quiescent_state OK. */
spin_unlock_irqrestore(&rnp->lock, flags);
+ accelerate_almost_idle(rsp);
return;
}
@@ -627,6 +651,7 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
rsp->signaled = RCU_SIGNAL_INIT; /* force_quiescent_state now OK. */
spin_unlock_irqrestore(&rsp->onofflock, flags);
+ accelerate_almost_idle(rsp);
}
/*