2006-09-02 20:39:31

by Krzysztof Halasa

[permalink] [raw]
Subject: 2.6.18-rc5 with GRE, iptables and Speedtouch ADSL, PPP over ATM

Hi,

FYI: Just enabled kernel lock testers on my old laptop machine
doing "internet services". 2.6.18-rc5, i686. All details available
on request, of course.

There is IP GRE tunnel here running over ADSL connection (USB
Thomson/Alcatel Speedtouch 330, PPP over ATM, in-kernel drivers).
Ethernet is DLink Tulip-based (PC Card 32-bit), probably not
relevant here. Iptables doing mostly ACCEPTs, REJECT and DROPs in
INPUT and FORWARD, there is also MASQUERADE but it probably doesn't
matter. Few ip rules directing some traffic to the GRE tunnel as well.

=======================================================
[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
swapper/0 is trying to acquire lock:
(&dev->queue_lock){-+..}, at: [<c02c8c46>] dev_queue_xmit+0x56/0x290

but task is already holding lock:
(&dev->_xmit_lock){-+..}, at: [<c02c8e14>] dev_queue_xmit+0x224/0x290

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&dev->_xmit_lock){-+..}:
[<c012e7b6>] lock_acquire+0x76/0xa0
[<c0336241>] _spin_lock_bh+0x31/0x40
[<c02d25a9>] dev_activate+0x69/0x120
[<c02c8149>] dev_open+0x59/0x70
[<c02c6921>] dev_change_flags+0x51/0x110
[<c0302464>] devinet_ioctl+0x484/0x670
[<c030295b>] inet_ioctl+0x6b/0x80
[<c02bc6e8>] sock_ioctl+0x118/0x200
[<c01698b0>] do_ioctl+0x20/0x70
[<c0169957>] vfs_ioctl+0x57/0x290
[<c0169bc9>] sys_ioctl+0x39/0x60
[<c0102c8d>] sysenter_past_esp+0x56/0x8d

-> #0 (&dev->queue_lock){-+..}:
[<c012e7b6>] lock_acquire+0x76/0xa0
[<c03361fc>] _spin_lock+0x2c/0x40
[<c02c8c46>] dev_queue_xmit+0x56/0x290
[<c02e2cfd>] ip_output+0x1ad/0x250
[<c883dea2>] ipgre_tunnel_xmit+0x412/0x740 [ip_gre]
[<c02c73db>] dev_hard_start_xmit+0x1bb/0x220
[<c02c8e2b>] dev_queue_xmit+0x23b/0x290
[<c02e2cfd>] ip_output+0x1ad/0x250
[<c03142bc>] reject+0x37c/0x6d0
[<c0313688>] ipt_do_table+0x2b8/0x330
[<c03137b7>] ipt_hook+0x27/0x30
[<c02d6529>] nf_iterate+0x59/0x80
[<c02d667a>] nf_hook_slow+0x4a/0xc0
[<c02ddb55>] ip_local_deliver+0x175/0x1c0
[<c02dd7bc>] ip_rcv+0x25c/0x480
[<c02c705e>] netif_receive_skb+0x15e/0x1e0
[<c02c8922>] process_backlog+0x82/0x110
[<c02c8a22>] net_rx_action+0x72/0x120
[<c01194b5>] __do_softirq+0x55/0xc0
[<c0104b13>] do_softirq+0x63/0xd0

other info that might help us debug this:

2 locks held by swapper/0:
#0: (&table->lock){-.-+}, at: [<c0313421>] ipt_do_table+0x51/0x330
#1: (&dev->_xmit_lock){-+..}, at: [<c02c8e14>] dev_queue_xmit+0x224/0x290

stack backtrace:
[<c0103522>] show_trace+0x12/0x20
[<c0103b79>] dump_stack+0x19/0x20
[<c012d6c1>] print_circular_bug_tail+0x61/0x70
[<c012e196>] __lock_acquire+0xac6/0xd70
[<c012e7b6>] lock_acquire+0x76/0xa0
[<c03361fc>] _spin_lock+0x2c/0x40
[<c02c8c46>] dev_queue_xmit+0x56/0x290
[<c02e2cfd>] ip_output+0x1ad/0x250
[<c883dea2>] ipgre_tunnel_xmit+0x412/0x740 [ip_gre]
[<c02c73db>] dev_hard_start_xmit+0x1bb/0x220
[<c02c8e2b>] dev_queue_xmit+0x23b/0x290
[<c02e2cfd>] ip_output+0x1ad/0x250
[<c03142bc>] reject+0x37c/0x6d0
[<c0313688>] ipt_do_table+0x2b8/0x330
[<c03137b7>] ipt_hook+0x27/0x30
[<c02d6529>] nf_iterate+0x59/0x80
[<c02d667a>] nf_hook_slow+0x4a/0xc0
[<c02ddb55>] ip_local_deliver+0x175/0x1c0
[<c02dd7bc>] ip_rcv+0x25c/0x480
[<c02c705e>] netif_receive_skb+0x15e/0x1e0
[<c02c8922>] process_backlog+0x82/0x110
[<c02c8a22>] net_rx_action+0x72/0x120
[<c01194b5>] __do_softirq+0x55/0xc0
[<c0104b13>] do_softirq+0x63/0xd0
=======================
[<c0119395>] irq_exit+0x35/0x40
[<c0104c0f>] do_IRQ+0x8f/0xf0
[<c0102ef9>] common_interrupt+0x25/0x2c
[<c01018e9>] cpu_idle+0x39/0x50
[<c010052e>] rest_init+0x1e/0x30
[<c041f76e>] start_kernel+0x25e/0x2c0
[<c0100199>] 0xc0100199

--
Krzysztof Halasa

--
VGER BF report: U 0.5


2006-09-02 21:52:05

by Francois Romieu

[permalink] [raw]
Subject: Re: 2.6.18-rc5 with GRE, iptables and Speedtouch ADSL, PPP over ATM

Krzysztof Halasa <[email protected]> :
[...]
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> -------------------------------------------------------
> swapper/0 is trying to acquire lock:
> (&dev->queue_lock){-+..}, at: [<c02c8c46>] dev_queue_xmit+0x56/0x290
>
> but task is already holding lock:
> (&dev->_xmit_lock){-+..}, at: [<c02c8e14>] dev_queue_xmit+0x224/0x290
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (&dev->_xmit_lock){-+..}:
> [<c012e7b6>] lock_acquire+0x76/0xa0
> [<c0336241>] _spin_lock_bh+0x31/0x40
> [<c02d25a9>] dev_activate+0x69/0x120
[...]
> [<c0169957>] vfs_ioctl+0x57/0x290
> [<c0169bc9>] sys_ioctl+0x39/0x60
> [<c0102c8d>] sysenter_past_esp+0x56/0x8d

>
> -> #0 (&dev->queue_lock){-+..}:
> [<c012e7b6>] lock_acquire+0x76/0xa0
> [<c03361fc>] _spin_lock+0x2c/0x40
> [<c02c8c46>] dev_queue_xmit+0x56/0x290
[...]
> [<c01194b5>] __do_softirq+0x55/0xc0
> [<c0104b13>] do_softirq+0x63/0xd0

dev_activate takes BH disabling locks only. How could a softirq happen
on the same cpu and trigger a deadlock ?

--
Ueimor

--
VGER BF report: U 0.500151

2006-09-03 07:53:53

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: 2.6.18-rc5 with GRE, iptables and Speedtouch ADSL, PPP over ATM

Francois Romieu <[email protected]> writes:

> dev_activate takes BH disabling locks only. How could a softirq happen
> on the same cpu and trigger a deadlock ?

That's BTW "possible circular locking dependency" and there is only
one CPU (no HT, single core - an old mobile Celeron P3).

I have to admit this machine has a history of mysterious hangs (black
screen of death), though I think they are related to LAN and maybe disk
activity, not GRE/PPP/ATM/ADSL (RAM tests ok, the hardware is rather
common - i440BX etc. but who knows).
--
Krzysztof Halasa

--
VGER BF report: S 0.999877

2006-09-03 11:16:14

by Herbert Xu

[permalink] [raw]
Subject: Re: 2.6.18-rc5 with GRE, iptables and Speedtouch ADSL, PPP over ATM

[NET]: Drop tx lock in dev_watchdog_up

On Sat, Sep 02, 2006 at 08:39:28PM +0000, Krzysztof Halasa wrote:
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> -------------------------------------------------------
> swapper/0 is trying to acquire lock:
> (&dev->queue_lock){-+..}, at: [<c02c8c46>] dev_queue_xmit+0x56/0x290
>
> but task is already holding lock:
> (&dev->_xmit_lock){-+..}, at: [<c02c8e14>] dev_queue_xmit+0x224/0x290
>
> which lock already depends on the new lock.

This turns out to be a genuine bug. The queue lock and xmit lock are
intentionally taken out of order. Two things are supposed to prevent
dead-locks from occuring:

1) When we hold the queue_lock we're supposed to only do try_lock on the
tx_lock.

2) We always drop the queue_lock after taking the tx_lock and before doing
anything else.

>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (&dev->_xmit_lock){-+..}:
> [<c012e7b6>] lock_acquire+0x76/0xa0
> [<c0336241>] _spin_lock_bh+0x31/0x40
> [<c02d25a9>] dev_activate+0x69/0x120

This path obviously breaks assumption 1) and therefore can lead to ABBA
dead-locks.

I've looked at the history and there seems to be no reason for the lock
to be held at all in dev_watchdog_up. The lock appeared in day one and
even there it was unnecessary. In fact, people added __dev_watchdog_up
precisely in order to get around the tx lock there.

The function dev_watchdog_up is already serialised by rtnl_lock since
its only caller dev_activate is always called under it.

So here is a simple patch to remove the tx lock from dev_watchdog_up.
In 2.6.19 we can eliminate the unnecessary __dev_watchdog_up and
replace it with dev_watchdog_up.

Signed-off-by: Herbert Xu <[email protected]>

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 0834c2e..6f91518 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -238,9 +238,7 @@ void __netdev_watchdog_up(struct net_dev

static void dev_watchdog_up(struct net_device *dev)
{
- netif_tx_lock_bh(dev);
__netdev_watchdog_up(dev);
- netif_tx_unlock_bh(dev);
}

static void dev_watchdog_down(struct net_device *dev)

--
VGER BF report: U 0.5

2006-09-03 12:43:25

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: 2.6.18-rc5 with GRE, iptables and Speedtouch ADSL, PPP over ATM

Herbert Xu <[email protected]> writes:

> [NET]: Drop tx lock in dev_watchdog_up
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -238,9 +238,7 @@ void __netdev_watchdog_up(struct net_dev
>
> static void dev_watchdog_up(struct net_device *dev)
> {
> - netif_tx_lock_bh(dev);
> __netdev_watchdog_up(dev);
> - netif_tx_unlock_bh(dev);
> }
>
> static void dev_watchdog_down(struct net_device *dev)

Many thanks for looking into this. The lockdep warning is gone now.
--
Krzysztof Halasa

--
VGER BF report: H 3.43998e-11

2006-09-03 19:27:09

by Larry Finger

[permalink] [raw]
Subject: Re: 2.6.18-rc5 with GRE, iptables and Speedtouch ADSL, PPP over ATM

Herbert Xu wrote:
> [NET]: Drop tx lock in dev_watchdog_up
>
> On Sat, Sep 02, 2006 at 08:39:28PM +0000, Krzysztof Halasa wrote:
>> =======================================================
>> [ INFO: possible circular locking dependency detected ]
>> -------------------------------------------------------
>> swapper/0 is trying to acquire lock:
>> (&dev->queue_lock){-+..}, at: [<c02c8c46>] dev_queue_xmit+0x56/0x290
>>
>> but task is already holding lock:
>> (&dev->_xmit_lock){-+..}, at: [<c02c8e14>] dev_queue_xmit+0x224/0x290
>>
>> which lock already depends on the new lock.
>
> This turns out to be a genuine bug. The queue lock and xmit lock are
> intentionally taken out of order. Two things are supposed to prevent
> dead-locks from occuring:
>
> 1) When we hold the queue_lock we're supposed to only do try_lock on the
> tx_lock.
>
> 2) We always drop the queue_lock after taking the tx_lock and before doing
> anything else.
>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #1 (&dev->_xmit_lock){-+..}:
>> [<c012e7b6>] lock_acquire+0x76/0xa0
>> [<c0336241>] _spin_lock_bh+0x31/0x40
>> [<c02d25a9>] dev_activate+0x69/0x120
>
> This path obviously breaks assumption 1) and therefore can lead to ABBA
> dead-locks.
>
> I've looked at the history and there seems to be no reason for the lock
> to be held at all in dev_watchdog_up. The lock appeared in day one and
> even there it was unnecessary. In fact, people added __dev_watchdog_up
> precisely in order to get around the tx lock there.
>
> The function dev_watchdog_up is already serialised by rtnl_lock since
> its only caller dev_activate is always called under it.
>
> So here is a simple patch to remove the tx lock from dev_watchdog_up.
> In 2.6.19 we can eliminate the unnecessary __dev_watchdog_up and
> replace it with dev_watchdog_up.
>
> Signed-off-by: Herbert Xu <[email protected]>

This patch also fixes a circular locking problem that I found on Linville's wireless-dev tree
involving ieee80211 and wpa_supplicant.

Larry Finger




--
VGER BF report: H 1.06514e-06

2006-09-04 08:46:06

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: 2.6.18-rc5 with GRE, iptables and Speedtouch ADSL, PPP over ATM

Hello!

> This path obviously breaks assumption 1) and therefore can lead to ABBA
> dead-locks.

Yes...


> I've looked at the history and there seems to be no reason for the lock
> to be held at all in dev_watchdog_up. The lock appeared in day one and
> even there it was unnecessary.

Seems, it serializes mod_timer and timer handler to keep timer
in predictable state. Maybe, this is not necessary. A priori, it is required.

Note that in dev_watchdog_down() queue_lock is released before
taking xmit_lock. Probably, this is the thing which was supposed
to be done in dev_watchdog_up() too.

Alexey

--
VGER BF report: U 0.46385

2006-09-04 08:48:38

by Herbert Xu

[permalink] [raw]
Subject: Re: 2.6.18-rc5 with GRE, iptables and Speedtouch ADSL, PPP over ATM

On Mon, Sep 04, 2006 at 12:44:14PM +0400, Alexey Kuznetsov wrote:
>
> Seems, it serializes mod_timer and timer handler to keep timer
> in predictable state. Maybe, this is not necessary. A priori, it is required.
>
> Note that in dev_watchdog_down() queue_lock is released before
> taking xmit_lock. Probably, this is the thing which was supposed
> to be done in dev_watchdog_up() too.

Right, in that case this should definitely be unncessary because both
dev_watchdog_up and dev_watchdog_up are called under RTNL.

The function __dev_watchdog_up could possibly be dodgy though but that's
a different story.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

--
VGER BF report: H 3.52484e-12

2006-09-18 07:23:09

by David Miller

[permalink] [raw]
Subject: Re: 2.6.18-rc5 with GRE, iptables and Speedtouch ADSL, PPP over ATM

From: Herbert Xu <[email protected]>
Date: Sun, 3 Sep 2006 21:15:07 +1000

> So here is a simple patch to remove the tx lock from dev_watchdog_up.
> In 2.6.19 we can eliminate the unnecessary __dev_watchdog_up and
> replace it with dev_watchdog_up.
>
> Signed-off-by: Herbert Xu <[email protected]>

Applied, thanks Herbert.