2014-11-16 08:33:15

by Fabian Frédérick

[permalink] [raw]
Subject: frequent lockups in 3.18rc4: revert suggestion

Hi Dave,

        I was reading your report
http://marc.info/?l=linux-kernel&m=141600070111887&w=2

        Have you tried reverting the following patches (all from rc1) ?
       
        c6f4459 v3.18-rc1 smp: Add new wake_up_all_idle_cpus() function
        bb964a9 v3.18-rc1 kernel misc: Replace __get_cpu_var uses
        2ed903c v3.18-rc1 cpuidle: Use wake_up_all_idle_cpus() to wake up all
idle cpus

Regards,
Fabian


2014-11-16 20:03:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: frequent lockups in 3.18rc4: revert suggestion

On Sun, Nov 16, 2014 at 12:33 AM, Fabian Frederick <[email protected]> wrote:
>
> Have you tried reverting the following patches (all from rc1) ?

Hmm. Any particular reason you're looking at those?

> c6f4459 v3.18-rc1 smp: Add new wake_up_all_idle_cpus() function
> bb964a9 v3.18-rc1 kernel misc: Replace __get_cpu_var uses
> 2ed903c v3.18-rc1 cpuidle: Use wake_up_all_idle_cpus() to wake up all idle cpus

It does strike me that the reschedule IPI is somewhat special in that
we don't try to serialize it at all, on the grounds that a lost IPI is
ok (ie smp_send_reschedule() is very much a special case of IPI). Or
am I mis-remembering? Does that series end up adding a lot more of
those things, rather than using the normal smp_call_function().

The normal smp_function_mask() thing tries to make sure only one entry
is ever active at a time (even a non-blocking one will use the whole
"queue it on a llist, only send the IPI if the llist was empty", so
this is not about the IPI's being synchronous). The rescheduling
thing is rather special, isn't it.

The softlockup thing *did* look like some IPI got lost. Could an IPI
overflow on the RESCHEDULE_VECTOR end up affecting other vectors? It's
been too long since I worked with the APIC (and by "too long", I
obviously mean "thank God I haven't had to" ;^) but there used to be
grouping of the vectors..

Maybe that is all barking up the wrong tree, but I'm wondering why
Fabian picked that particular set of commits. Fabian?

Linus

2014-11-16 20:42:07

by Fabian Frédérick

[permalink] [raw]
Subject: Re: frequent lockups in 3.18rc4: revert suggestion



> On 16 November 2014 at 21:03 Linus Torvalds <[email protected]>
> wrote:
>
>
> On Sun, Nov 16, 2014 at 12:33 AM, Fabian Frederick <[email protected]> wrote:
> >
> >         Have you tried reverting the following patches (all from rc1) ?
>
> Hmm. Any particular reason you're looking at those?
>
> >         c6f4459 v3.18-rc1 smp: Add new wake_up_all_idle_cpus() function
> >         bb964a9 v3.18-rc1 kernel misc: Replace __get_cpu_var uses
> >         2ed903c v3.18-rc1 cpuidle: Use wake_up_all_idle_cpus() to wake up
> >all idle cpus
>
> It does strike me that the reschedule IPI is somewhat special in that
> we don't try to serialize it at all, on the grounds that a lost IPI is
> ok (ie smp_send_reschedule() is very much a special case of IPI). Or
> am I mis-remembering? Does that series end up adding a lot more of
> those things, rather than using the normal smp_call_function().
>
> The normal smp_function_mask() thing tries to make sure only one entry
> is ever active at a time (even a non-blocking one will use the whole
> "queue it on a llist, only send the IPI if the llist was empty", so
> this is not about the IPI's being synchronous).  The rescheduling
> thing is rather special, isn't it.
>
> The softlockup thing *did* look like some IPI got lost. Could an IPI
> overflow on the RESCHEDULE_VECTOR end up affecting other vectors? It's
> been too long since I worked with the APIC (and by "too long", I
> obviously mean "thank God I haven't had to" ;^) but there used to be
> grouping of the vectors..
>
> Maybe that is all barking up the wrong tree, but I'm wondering why
> Fabian picked that particular set of commits. Fabian?

Thomas talked about csd_lock and the last reliable stack function
being smp_call_function_single, I thought it could be interesting
to bisect directly in smp.c as I only read about reverting mm/memory.c
stuff ... Maybe not too much original but who knows ? :)
 
Regards,
Fabian

>
>                      Linus

2014-11-17 00:35:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: frequent lockups in 3.18rc4: revert suggestion

On Sun, Nov 16, 2014 at 12:42 PM, Fabian Frederick <[email protected]> wrote:
>
> Thomas talked about csd_lock and the last reliable stack function
> being smp_call_function_single, I thought it could be interesting
> to bisect directly in smp.c as I only read about reverting mm/memory.c
> stuff ... Maybe not too much original but who knows ? :)

Fair enough.

I'd be almost have been more inclined to look at the apic changes,
like commit 4ba2968420fa ("percpu: Resolve ambiguities in
__get_cpu_var/cpumask_var_t") that was horribly buggy. It was fixed in
59f6e2073c72, though, and the end result looks sane, so I don't think
it's that particular thing. The rest seems to be either kvm-related or
just clearly trivial.

Which is why I think even a partial bisection would be nice - as it is
we're kind of just guessing, and I'm not all the confident in the
guesses. Sure, they may be right, but bisection is guaranteed to at
least narrow the suspects down, while guesses *could* hit jack-pot,
but could also be a total waste of time.

I guess I'm not much of a gambler. I'll take a steady slow guarantee
of progress over a jackpot just about every day.

Linus

2014-11-17 06:04:11

by Fabian Frédérick

[permalink] [raw]
Subject: Re: frequent lockups in 3.18rc4: revert suggestion



> On 17 November 2014 at 01:35 Linus Torvalds <[email protected]>
> wrote:
>
>
> On Sun, Nov 16, 2014 at 12:42 PM, Fabian Frederick <[email protected]> wrote:
> >
> > Thomas talked about csd_lock and the last reliable stack function
> > being smp_call_function_single, I thought it could be interesting
> > to bisect directly in smp.c as I only read about reverting mm/memory.c
> > stuff ... Maybe not too much original but who knows ? :)
>
> Fair enough.
>
> I'd be almost have been more inclined to look at the apic changes,
> like commit 4ba2968420fa ("percpu: Resolve ambiguities in
> __get_cpu_var/cpumask_var_t") that was horribly buggy. It was fixed in
> 59f6e2073c72, though, and the end result looks sane, so I don't think
> it's that particular thing. The rest seems to be either kvm-related or
> just clearly trivial.
>
> Which is why I think even a partial bisection would be nice - as it is
> we're kind of just guessing, and I'm not all the confident in the
> guesses. Sure, they may be right, but bisection is guaranteed to at
> least narrow the suspects down, while guesses *could* hit jack-pot,
> but could also be a total waste of time.
>
> I guess I'm not much of a gambler. I'll take a steady slow guarantee
> of progress over a jackpot just about every day.
>
>                        Linus

Ok Linus, you're not a gambler but honestly, you created
Git and Linux: the best games I know about on earth ;)

Regards,
Fabian