LinuxLists.cc - Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

2009-06-30 23:21:21

Subject: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

Hi,

While playing around with netem (time, not packet count based loss-
bursts) I experienced soft lockups several times - to exclude it was my
modifications causing this I recompiled with the original and it is
still locking up.
I captured several of those traces via the thankfully
still working netconsole.
The simplest policy I could reproduce the error with was:
tc qdisc add dev eth0 root handle 1: netem delay 10ms loss 0

I could not reproduce the error without delay - but that may only be a
timing issue, as the host I was mainly transferring data to was on a
local network.
I could not reproduce the issue on lo.

The time to reproduce the error varied from seconds after executing tc
to several minutes.

Traces 5+6 are made with vanilla 52989765629e7d182b4f146050ebba0abf2cb0b7

The earlier traces are made with parts of my patches applied, and only
included for completeness as I don't believe my modifications were
causing this and all traces are different, so it may give some clues.

Lockdep was enabled but did not diagnose anything relevant (one dvb
warning during bootup).

Any ideas for debugging?

Andres

PS: I also could reproduce the issue without netconsole but in vain of a
serial console could not capture a trace.

Attachments:

config (68.85 kB)
debug_out6.txt (12.24 kB)
debug_out5.txt (10.23 kB)
debug_out1.txt (3.05 kB)
debug_out2.txt (8.21 kB)
debug_out3.txt (9.44 kB)
debug_out4.txt (8.87 kB)
Download all attachments

2009-07-01 18:40:00

by Jarek Poplawski

[permalink] [raw]

Subject: Re: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

Andres Freund wrote, On 07/01/2009 01:20 AM:

> Hi,

Hi,

>
> While playing around with netem (time, not packet count based loss-
> bursts) I experienced soft lockups several times - to exclude it was my
> modifications causing this I recompiled with the original and it is
> still locking up.
> I captured several of those traces via the thankfully
> still working netconsole.
> The simplest policy I could reproduce the error with was:
> tc qdisc add dev eth0 root handle 1: netem delay 10ms loss 0
>
> I could not reproduce the error without delay - but that may only be a
> timing issue, as the host I was mainly transferring data to was on a
> local network.
> I could not reproduce the issue on lo.
>
> The time to reproduce the error varied from seconds after executing tc
> to several minutes.
>
> Traces 5+6 are made with vanilla 52989765629e7d182b4f146050ebba0abf2cb0b7
>
> The earlier traces are made with parts of my patches applied, and only
> included for completeness as I don't believe my modifications were
> causing this and all traces are different, so it may give some clues.
>
> Lockdep was enabled but did not diagnose anything relevant (one dvb
> warning during bootup).
>
> Any ideas for debugging?

Maybe these traces will be enough, but lockdep report could save time.
If dvb warning triggers every time then lockdep probably turns off
just after (it works this way, unless something was changed). So,
could you try to repeat this without dvb? Btw., did you try this on
some earlier kernel?

Thanks,
Jarek P.

2009-07-01 21:22:24

by Andres Freund

[permalink] [raw]

Subject: Re: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

Hi,

On 07/01/2009 08:39 PM, Jarek Poplawski wrote:
> Andres Freund wrote, On 07/01/2009 01:20 AM:
>> While playing around with netem (time, not packet count based loss-
>> bursts) I experienced soft lockups several times - to exclude it was my
>> modifications causing this I recompiled with the original and it is
>> still locking up.
>> I captured several of those traces via the thankfully
>> still working netconsole.
>> The simplest policy I could reproduce the error with was:
>> tc qdisc add dev eth0 root handle 1: netem delay 10ms loss 0
>>
>> I could not reproduce the error without delay - but that may only be a
>> timing issue, as the host I was mainly transferring data to was on a
>> local network.
>> I could not reproduce the issue on lo.
>>
>> The time to reproduce the error varied from seconds after executing tc
>> to several minutes.
>>
>> Traces 5+6 are made with vanilla 52989765629e7d182b4f146050ebba0abf2cb0b7
>>
>> The earlier traces are made with parts of my patches applied, and only
>> included for completeness as I don't believe my modifications were
>> causing this and all traces are different, so it may give some clues.
>>
>> Lockdep was enabled but did not diagnose anything relevant (one dvb
>> warning during bootup).
>>
>> Any ideas for debugging?
>
> Maybe these traces will be enough, but lockdep report could save time.
> If dvb warning triggers every time then lockdep probably turns off
> just after (it works this way, unless something was changed). So,
> could you try to repeat this without dvb? Btw., did you try this on
> some earlier kernel?
Yes. Today I could not manage to reproduce it on 2.6.30 but could on
current git...

I *think* I could also provoke the same issue on lo, but I am not
completely sure, as the host I was redirecting netconsole to
unfortunately was not up, so I could not check if it was a similar trace.
It could also have been triggered by some random traffic on eth0... Hard
to say.

Will try without dvb.

Andres

2009-07-02 00:37:35

by Andres Freund

[permalink] [raw]

Subject: Re: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

On 07/01/2009 11:22 PM, Andres Freund wrote:
> On 07/01/2009 08:39 PM, Jarek Poplawski wrote:
>> Andres Freund wrote, On 07/01/2009 01:20 AM:
>>> While playing around with netem (time, not packet count based loss-
>>> bursts) I experienced soft lockups several times - to exclude it was my
>>> modifications causing this I recompiled with the original and it is
>>> still locking up.
>>> I captured several of those traces via the thankfully
>>> still working netconsole.
>>> The simplest policy I could reproduce the error with was:
>>> tc qdisc add dev eth0 root handle 1: netem delay 10ms loss 0
>>>
>>> I could not reproduce the error without delay - but that may only be a
>>> timing issue, as the host I was mainly transferring data to was on a
>>> local network.
>>> I could not reproduce the issue on lo.
>>>
>>> The time to reproduce the error varied from seconds after executing tc
>>> to several minutes.
>>>
>>> Traces 5+6 are made with vanilla
>>> 52989765629e7d182b4f146050ebba0abf2cb0b7
>>>
>>> The earlier traces are made with parts of my patches applied, and only
>>> included for completeness as I don't believe my modifications were
>>> causing this and all traces are different, so it may give some clues.
>>>
>>> Lockdep was enabled but did not diagnose anything relevant (one dvb
>>> warning during bootup).
>>>
>>> Any ideas for debugging?
>>
>> Maybe these traces will be enough, but lockdep report could save time.
>> If dvb warning triggers every time then lockdep probably turns off
>> just after (it works this way, unless something was changed). So,
>> could you try to repeat this without dvb? Btw., did you try this on
>> some earlier kernel?
> Yes. Today I could not manage to reproduce it on 2.6.30 but could on
> current git...
> Will try without dvb.
So I tried - and I did not catch any lockdep output before the crash.
Unfortunately I do not have another machine on the same local network to
catch any messages after the crash... So I could be missing some warning
(I did synchronous logging though).
Will check with netconsole tomorrow.

Andres

2009-07-02 09:30:47

by Jarek Poplawski

[permalink] [raw]

Subject: Re: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

On Thu, Jul 02, 2009 at 02:37:24AM +0200, Andres Freund wrote:
...
> So I tried - and I did not catch any lockdep output before the crash.
> Unfortunately I do not have another machine on the same local network to
> catch any messages after the crash... So I could be missing some warning
> (I did synchronous logging though).
> Will check with netconsole tomorrow.

Could you try if this patch changes anything?

Thanks,
Jarek P.
---

net/sched/sch_netem.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 2b88295..6001ad9 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -176,8 +176,6 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch)
return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
}

- skb_orphan(skb);
-
/*
* If we need to duplicate packet, then re-insert at top of the
* qdisc tree, since parent queuer expects that only one

2009-07-02 10:12:25

by Jarek Poplawski

[permalink] [raw]

Subject: Re: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

On Thu, Jul 02, 2009 at 09:30:31AM +0000, Jarek Poplawski wrote:
> On Thu, Jul 02, 2009 at 02:37:24AM +0200, Andres Freund wrote:
> ...
> > So I tried - and I did not catch any lockdep output before the crash.
> > Unfortunately I do not have another machine on the same local network to
> > catch any messages after the crash... So I could be missing some warning
> > (I did synchronous logging though).
> > Will check with netconsole tomorrow.
>
> Could you try if this patch changes anything?

...and maybe CONFIG_PACKET_MMAP turned off.

Jarek P.

2009-07-02 10:52:29

by Joao Correia

[permalink] [raw]

Subject: Re: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

Im getting this one too. On any .31-rc1-gitXX, it pops up a few
seconds after i turn traffic shaping on with tc/htb (no netem here).
Its easily reproducible on my end, turn on traffic shaping, get
traffic flowing, box freezes. No error message, nothing on logs,
although -sometimes- i get lockdep error messages on screen, but
scrolls too fast and box is otherwise unresponsive, so i can't get the
error out.

This does not happen on 2.6.30 at all.

Hope this helps.

(i tried both CONFIG_PACKET_MMAP off and the patch, still happens).

Joao Correia
Centro de Informatica
Universidade da Beira Interior
Portugal

(snip firewall script. TCBIN is just tc, OUTIF is just the outside
interface (gigabit D-Link using r8169 driver))
$TCBIN class add dev $OUTIF parent 1: classid 1:1 htb rate
${UPLINK}kbit ceil ${UPLINK}kbit
$TCBIN class add dev $OUTIF parent 1:1 classid 1:10 htb rate
$[30*$UPLINK/100]kbit ceil $[30*$UPLINK/100]kbit prio 0
$TCBIN class add dev $OUTIF parent 1:1 classid 1:11 htb rate
$[30*$UPLINK/100]kbit ceil ${UPLINK}kbit prio 1
$TCBIN class add dev $OUTIF parent 1:1 classid 1:12 htb rate
$[8*$UPLINK/100]kbit ceil ${UPLINK}kbit prio 2
$TCBIN class add dev $OUTIF parent 1:1 classid 1:13 htb rate
$[8*$UPLINK/100]kbit ceil ${UPLINK}kbit prio 2
$TCBIN class add dev $OUTIF parent 1:1 classid 1:14 htb rate
$[10*$UPLINK/100]kbit ceil ${UPLINK}kbit prio 7 quantum 50000
$TCBIN class add dev $OUTIF parent 1:1 classid 1:15 htb rate
$[13*$UPLINK/100]kbit ceil ${UPLINK}kbit prio 8

$TCBIN qdisc add dev $OUTIF parent 1:12 handle 120: sfq perturb 10
$TCBIN qdisc add dev $OUTIF parent 1:13 handle 130: sfq perturb 10
$TCBIN qdisc add dev $OUTIF parent 1:14 handle 140: sfq perturb 10
$TCBIN qdisc add dev $OUTIF parent 1:15 handle 150: sfq perturb 10

$TCBIN filter add dev $OUTIF parent 1:0 protocol ip prio 1 handle 1 fw
classid 1:10
$TCBIN filter add dev $OUTIF parent 1:0 protocol ip prio 2 handle 2 fw
classid 1:11
$TCBIN filter add dev $OUTIF parent 1:0 protocol ip prio 3 handle 3 fw
classid 1:12
$TCBIN filter add dev $OUTIF parent 1:0 protocol ip prio 4 handle 4 fw
classid 1:13
$TCBIN filter add dev $OUTIF parent 1:0 protocol ip prio 5 handle 5 fw
classid 1:14
$TCBIN filter add dev $OUTIF parent 1:0 protocol ip prio 6 handle 6 fw
classid 1:15
(snip firewall script)

On Thu, Jul 2, 2009 at 11:12 AM, Jarek Poplawski<[email protected]> wrote:
> On Thu, Jul 02, 2009 at 09:30:31AM +0000, Jarek Poplawski wrote:
>> On Thu, Jul 02, 2009 at 02:37:24AM +0200, Andres Freund wrote:
>> ...
>> > So I tried - and I did not catch any lockdep output before the crash.
>> > Unfortunately I do not have another machine on the same local network to
>> > catch any messages after the crash... So I could be missing some warning
>> > (I did synchronous logging though).
>> > Will check with netconsole tomorrow.
>>
>> Could you try if this patch changes anything?
>
> ...and maybe CONFIG_PACKET_MMAP turned off.
>
> Jarek P.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>

2009-07-02 11:09:54

by Jarek Poplawski

[permalink] [raw]

Subject: Re: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

On Thu, Jul 02, 2009 at 11:51:56AM +0100, Joao Correia wrote:
> Im getting this one too. On any .31-rc1-gitXX, it pops up a few
> seconds after i turn traffic shaping on with tc/htb (no netem here).
> Its easily reproducible on my end, turn on traffic shaping, get
> traffic flowing, box freezes. No error message, nothing on logs,
> although -sometimes- i get lockdep error messages on screen, but
> scrolls too fast and box is otherwise unresponsive, so i can't get the
> error out.
>
> This does not happen on 2.6.30 at all.
>
> Hope this helps.
>
> (i tried both CONFIG_PACKET_MMAP off and the patch, still happens).

It helps for sure - to save on useless checking, and maybe reproduce
too.

Thank you very much Joao,
Jarek P.

2009-07-02 11:12:19

by Andres Freund

[permalink] [raw]

Subject: Re: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

On 07/02/2009 12:12 PM, Jarek Poplawski wrote:
> On Thu, Jul 02, 2009 at 09:30:31AM +0000, Jarek Poplawski wrote:
>> On Thu, Jul 02, 2009 at 02:37:24AM +0200, Andres Freund wrote:
>> ...
>>> So I tried - and I did not catch any lockdep output before the crash.
>>> Unfortunately I do not have another machine on the same local network to
>>> catch any messages after the crash... So I could be missing some warning
>>> (I did synchronous logging though).
>>> Will check with netconsole tomorrow.
>>
>> Could you try if this patch changes anything?
>
> ...and maybe CONFIG_PACKET_MMAP turned off.
Ok. Removed the skb_orphan and turned of CONFIG_PACKET_MMAP. Seemingly
the same game.
I now had another computer to catch the netconsole output. Still no
lockdep warnings.

Unfortunately the other computer was a windows machine with its strange
terminal, so long lines are wrapped at 80cols, but that shouldn't be too
bad.

Andres

Attachments:

dump_no_orphan_no_mmap.log (7.49 kB)

2009-07-02 11:43:57

by Andres Freund

[permalink] [raw]

Subject: Re: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

On 07/02/2009 01:11 PM, Andres Freund wrote:
> On 07/02/2009 12:12 PM, Jarek Poplawski wrote:
>> On Thu, Jul 02, 2009 at 09:30:31AM +0000, Jarek Poplawski wrote:
>>> On Thu, Jul 02, 2009 at 02:37:24AM +0200, Andres Freund wrote:
>>> ...
>>>> So I tried - and I did not catch any lockdep output before the crash.
>>>> Unfortunately I do not have another machine on the same local
>>>> network to
>>>> catch any messages after the crash... So I could be missing some
>>>> warning
>>>> (I did synchronous logging though).
>>>> Will check with netconsole tomorrow.
>>>
>>> Could you try if this patch changes anything?
>>
>> ...and maybe CONFIG_PACKET_MMAP turned off.
> Ok. Removed the skb_orphan and turned of CONFIG_PACKET_MMAP. Seemingly
> the same game.
> I now had another computer to catch the netconsole output. Still no
> lockdep warnings.
>
> Unfortunately the other computer was a windows machine with its strange
> terminal, so long lines are wrapped at 80cols, but that shouldn't be too
> bad.
I will start trying to place the issue by testing with existing kernels
between 2.6.30 and now.

Andres

2009-07-02 11:44:11

by Jarek Poplawski

[permalink] [raw]

Subject: Re: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

On Thu, Jul 02, 2009 at 01:11:56PM +0200, Andres Freund wrote:
> On 07/02/2009 12:12 PM, Jarek Poplawski wrote:
>> On Thu, Jul 02, 2009 at 09:30:31AM +0000, Jarek Poplawski wrote:
>>> On Thu, Jul 02, 2009 at 02:37:24AM +0200, Andres Freund wrote:
>>> ...
>>>> So I tried - and I did not catch any lockdep output before the crash.
>>>> Unfortunately I do not have another machine on the same local network to
>>>> catch any messages after the crash... So I could be missing some warning
>>>> (I did synchronous logging though).
>>>> Will check with netconsole tomorrow.
>>>
>>> Could you try if this patch changes anything?
>>
>> ...and maybe CONFIG_PACKET_MMAP turned off.
> Ok. Removed the skb_orphan and turned of CONFIG_PACKET_MMAP. Seemingly
> the same game.
> I now had another computer to catch the netconsole output. Still no
> lockdep warnings.
>
> Unfortunately the other computer was a windows machine with its strange
> terminal, so long lines are wrapped at 80cols, but that shouldn't be too
> bad.

It's very nice ;-) Alas we can see here only one side, so there is
still some searching needed. I hope I'll be able to reproduce this
later.

Thanks for your work Andres,
Jarek P.

2009-07-02 11:55:00

by Jarek Poplawski

[permalink] [raw]

Subject: Re: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

On Thu, Jul 02, 2009 at 01:43:49PM +0200, Andres Freund wrote:
...
> I will start trying to place the issue by testing with existing kernels
> between 2.6.30 and now.

If you can afford your time of course this would be very helpful.

Thanks,
Jarek P.

2009-07-02 11:59:46

by Andres Freund

[permalink] [raw]

Subject: Re: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly caused by netem)

On 07/02/2009 01:54 PM, Jarek Poplawski wrote:
> On Thu, Jul 02, 2009 at 01:43:49PM +0200, Andres Freund wrote: ...
>> I will start trying to place the issue by testing with existing
>> kernels between 2.6.30 and now.
> If you can afford your time of course this would be very helpful.
Well. Waiting for the issue to resolve itself would cost time as well ;-)
I wont be able to finish this today, but perhaps some reduction of the
search space will be enough.

I placed it between 2.6.30 and 03347e2592078a90df818670fddf97a33eec70fb
(v2.6.30-5415-g03347e2) so far.

Andres