2010-01-18 21:37:07

by Stefan Richter

[permalink] [raw]
Subject: [PATCH] firewire: net: fix panic in fwnet_write_complete

Date:
From: Stefan Richter <[email protected]>
Subject: firewire: net: fix panic in fwnet_write_complete

In the transmit path of firewire-net (IPv4 over 1394), the following
race condition may occur:
- The networking soft IRQ inserts a datagram into the 1394 async
request transmit DMA.
- The 1394 async transmit completion tasklet runs to finish cleaning
up (unlink datagram from list of pending ones, release skb and
outbound 1394 transaction object) --- before the networking soft IRQ
had a chance to proceed and add the datagram to the list of pending
datagrams.

This caused a panic in the 1394 async transmit completion tasklet when
it dereferenced unitialized list heads:
http://bugzilla.kernel.org/show_bug.cgi?id=15077

The fix is to add checks in the tx soft RQ and in the tasklet to
determine who of these two is the last referrer to the transaction
object. Then handle the cleanup of the object by the last referrer
rather than assuming that the tasklet is always the last one.

There is another similar race: Between said tasklet and fwnet_close,
i.e. at ifdown. However, that race is much less likely to occur in
practice and shall be fixed in a separate update.

Reported-by: Илья Басин <[email protected]>
Signed-off-by: Stefan Richter <[email protected]>
---

Илья, could you give this a try?

drivers/firewire/net.c | 53 ++++++++++++++++++++++++++++++-----------
1 file changed, 39 insertions(+), 14 deletions(-)

Index: linux-2.6.32.2/drivers/firewire/net.c
===================================================================
--- linux-2.6.32.2.orig/drivers/firewire/net.c
+++ linux-2.6.32.2/drivers/firewire/net.c
@@ -893,20 +893,31 @@ static void fwnet_receive_broadcast(stru

static struct kmem_cache *fwnet_packet_task_cache;

+static void fwnet_free_ptask(struct fwnet_packet_task *ptask)
+{
+ dev_kfree_skb_any(ptask->skb);
+ kmem_cache_free(fwnet_packet_task_cache, ptask);
+}
+
static int fwnet_send_packet(struct fwnet_packet_task *ptask);

static void fwnet_transmit_packet_done(struct fwnet_packet_task *ptask)
{
- struct fwnet_device *dev;
+ struct fwnet_device *dev = ptask->dev;
unsigned long flags;
-
- dev = ptask->dev;
+ bool free;

spin_lock_irqsave(&dev->lock, flags);
- list_del(&ptask->pt_link);
- spin_unlock_irqrestore(&dev->lock, flags);

- ptask->outstanding_pkts--; /* FIXME access inside lock */
+ ptask->outstanding_pkts--;
+
+ /* Check whether we or the networking TX soft-IRQ is last user. */
+ free = (ptask->outstanding_pkts == 0 && !list_empty(&ptask->pt_link));
+
+ if (ptask->outstanding_pkts == 0)
+ list_del(&ptask->pt_link);
+
+ spin_unlock_irqrestore(&dev->lock, flags);

if (ptask->outstanding_pkts > 0) {
u16 dg_size;
@@ -951,10 +962,10 @@ static void fwnet_transmit_packet_done(s
ptask->max_payload = skb->len + RFC2374_FRAG_HDR_SIZE;
}
fwnet_send_packet(ptask);
- } else {
- dev_kfree_skb_any(ptask->skb);
- kmem_cache_free(fwnet_packet_task_cache, ptask);
}
+
+ if (free)
+ fwnet_free_ptask(ptask);
}

static void fwnet_write_complete(struct fw_card *card, int rcode,
@@ -977,6 +988,7 @@ static int fwnet_send_packet(struct fwne
unsigned tx_len;
struct rfc2734_header *bufhdr;
unsigned long flags;
+ bool free;

dev = ptask->dev;
tx_len = ptask->max_payload;
@@ -1022,12 +1034,16 @@ static int fwnet_send_packet(struct fwne
generation, SCODE_100, 0ULL, ptask->skb->data,
tx_len + 8, fwnet_write_complete, ptask);

- /* FIXME race? */
spin_lock_irqsave(&dev->lock, flags);
- list_add_tail(&ptask->pt_link, &dev->broadcasted_list);
+
+ /* If the AT tasklet already ran, we may be last user. */
+ free = (ptask->outstanding_pkts == 0 && list_empty(&ptask->pt_link));
+ if (!free)
+ list_add_tail(&ptask->pt_link, &dev->broadcasted_list);
+
spin_unlock_irqrestore(&dev->lock, flags);

- return 0;
+ goto out;
}

fw_send_request(dev->card, &ptask->transaction,
@@ -1035,12 +1051,19 @@ static int fwnet_send_packet(struct fwne
ptask->generation, ptask->speed, ptask->fifo_addr,
ptask->skb->data, tx_len, fwnet_write_complete, ptask);

- /* FIXME race? */
spin_lock_irqsave(&dev->lock, flags);
- list_add_tail(&ptask->pt_link, &dev->sent_list);
+
+ /* If the AT tasklet already ran, we may be last user. */
+ free = (ptask->outstanding_pkts == 0 && list_empty(&ptask->pt_link));
+ if (!free)
+ list_add_tail(&ptask->pt_link, &dev->sent_list);
+
spin_unlock_irqrestore(&dev->lock, flags);

dev->netdev->trans_start = jiffies;
+ out:
+ if (free)
+ fwnet_free_ptask(ptask);

return 0;
}
@@ -1298,6 +1321,8 @@ static netdev_tx_t fwnet_tx(struct sk_bu
spin_unlock_irqrestore(&dev->lock, flags);

ptask->max_payload = max_payload;
+ INIT_LIST_HEAD(&ptask->pt_link);
+
fwnet_send_packet(ptask);

return NETDEV_TX_OK;


--
Stefan Richter
-=====-==-=- ---= =--=-
http://arcgraph.de/sr/


2010-01-19 00:44:40

by Stefan Richter

[permalink] [raw]
Subject: Re: [PATCH] firewire: net: fix panic in fwnet_write_complete

Stefan Richter wrote:
> The fix is to add checks in the tx soft RQ and in the tasklet to
> determine who of these two is the last referrer to the transaction
> object. Then handle the cleanup of the object by the last referrer
> rather than assuming that the tasklet is always the last one.
...
> Илья, could you give this a try?

My own testing on a dual core box --- peered with another Linux box
which ran the older eth1394 --- worked OK so far for transfers of
massive files (> 4 GiB) back and forth via FTP and ssh running on a text
console.

But in my first attempt to use FTP on X11 --- i.e. with more concurrent
interrupt sources --- the firewire-net box crashed very soon. In that
test I used Dolphin of KDE as FTP client, and the crash already happened
after Dolphin had loaded and displayed the remote home directory and was
peeking into files for preview data. I got the following trace:

------------: cut here ]------------
kernel: BUG at mm/slab.c:2885!
invalid: opcode: 0000 [#1]
PREEMPT:
SMP:
DEBUG_PAGEALLOC:

last: sysfs file:
/sys/devices/pci0000:00/0000:00:1e.0/0000:03:03.0/fw0/units
Modules: linked in:
firewire_net:
firewire_ohci:
firewire_core:
netconsole:
nfs:
lockd:
sunrpc:
i915:
drm_kms_helper:
drm:
i2c_algo_bit:
snd_hda_codec_idt:
snd_hda_intel:
snd_hda_codec:
applesmc:
led_class:
input_polldev:
snd_pcm:
rtc:
i2c_i801:
snd_timer:
sg:
sky2:
snd:
video:
backlight:
snd_page_alloc:
thermal:
output:
button:


Pid: 4267, comm: kio_thumbnail Not tainted 2.6.33-rc4 #2
Mac-F4208EC8/Macmini1,1
EIP: 0060:[<c1080e4f>] EFLAGS: 00010006 CPU: 0
EIP: is at cache_free_debugcheck+0x1e8/0x2e8
EAX: f8efddca EBX: f2008000 ECX: 65727661 EDX: 00c38e0a
ESI: d84156c5 EDI: f707fe40 EBP: f07fddc0 ESP: f07fdd80
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process: kio_thumbnail (pid: 4267, ti=f07fc000 task=f079b6e8
task.ti=f07fc000)
Stack:
f2008e20:
00000000:
0000000c:
c11c6f96:
00640100:
f2008000:
635688c0:
d84156c5:

kernel:
f2008ec0:
00000082:
f2008e18:
00000000:
00000000:
f715ff30:
f707fe40:
f2008e20:

kernel:
f07fddd8:
c108137c:
00000282:
f2008e20:
f2008e3c:
00000060:
f07fdde4:
c11c6f96:

Call: Trace:
? __kfree_skb+0x6e/0x71
? kmem_cache_free+0x56/0xb0
? __kfree_skb+0x6e/0x71
? kfree_skb+0x2b/0x2d
? unix_stream_recvmsg+0x3c3/0x48d
? file_read_actor+0x74/0xcc
? sock_aio_read+0xf2/0x107
? do_sync_read+0x89/0xc7


This was all that made it out via netconsole. This was on 2.6.33-rc4
and I think I am going back to 2.6.32.y to eliminate unrelated .33-rc
flakiness from my testing.
--
Stefan Richter
-=====-==-=- ---= =--==
http://arcgraph.de/sr/

2010-01-26 20:10:09

by Stefan Richter

[permalink] [raw]
Subject: Re: [PATCH] firewire: net: fix panic in fwnet_write_complete

Stefan Richter wrote:
> My own testing on a dual core box --- peered with another Linux box
> which ran the older eth1394 --- worked OK so far for transfers of
> massive files (> 4 GiB) back and forth via FTP and ssh running on a text
> console.
>
> But in my first attempt to use FTP on X11 --- i.e. with more concurrent
> interrupt sources --- the firewire-net box crashed very soon. In that
> test I used Dolphin of KDE as FTP client, and the crash already happened
> after Dolphin had loaded and displayed the remote home directory and was
> peeking into files for preview data. I got the following trace:
>
> ------------: cut here ]------------
> kernel: BUG at mm/slab.c:2885!
[...]
> EIP: is at cache_free_debugcheck+0x1e8/0x2e8
[...]
> Call: Trace:
> ? __kfree_skb+0x6e/0x71
> ? kmem_cache_free+0x56/0xb0
> ? __kfree_skb+0x6e/0x71
> ? kfree_skb+0x2b/0x2d
> ? unix_stream_recvmsg+0x3c3/0x48d
> ? file_read_actor+0x74/0xcc
> ? sock_aio_read+0xf2/0x107
> ? do_sync_read+0x89/0xc7

Hi Илья,

I am going to send a pull request for some other unrelated firewire
fixes to Linus about tomorrow.

firewire-net on the other hand needs still more work than my
fwnet_write_complete patch since you and I now get these kmem cache
corruption related bugs.

What is your impression --- does this first incomplete fix decrease the
likelihood of crashes enough to make it worth to include it in a pull
request already? I haven't done more extensive firewire-net tests since
last week yet, hence it is hard to tell for me how severe the new issue
is in practical use.

(Also, I have no idea yet whether I will be quick or slow to find this
other problem and whether it can be fixed in a manner that is suitable
for a mainline merge before 2.6.33 is going to be released.)
--
Stefan Richter
-=====-==-=- ---= ==-=-
http://arcgraph.de/sr/

2010-01-31 09:44:47

by Ilya Basin

[permalink] [raw]
Subject: Re[2]: [PATCH] firewire: net: fix panic in fwnet_write_complete

SR> Stefan Richter wrote:
>> My own testing on a dual core box --- peered with another Linux box
>> which ran the older eth1394 --- worked OK so far for transfers of
>> massive files (> 4 GiB) back and forth via FTP and ssh running on a text
>> console.
>>
>> But in my first attempt to use FTP on X11 --- i.e. with more concurrent
>> interrupt sources --- the firewire-net box crashed very soon. In that
>> test I used Dolphin of KDE as FTP client, and the crash already happened
>> after Dolphin had loaded and displayed the remote home directory and was
>> peeking into files for preview data. I got the following trace:
>>
>> ------------: cut here ]------------
>> kernel: BUG at mm/slab.c:2885!
SR> [...]
>> EIP: is at cache_free_debugcheck+0x1e8/0x2e8
SR> [...]
>> Call: Trace:
>> ? __kfree_skb+0x6e/0x71
>> ? kmem_cache_free+0x56/0xb0
>> ? __kfree_skb+0x6e/0x71
>> ? kfree_skb+0x2b/0x2d
>> ? unix_stream_recvmsg+0x3c3/0x48d
>> ? file_read_actor+0x74/0xcc
>> ? sock_aio_read+0xf2/0x107
>> ? do_sync_read+0x89/0xc7

SR> Hi Илья,

SR> I am going to send a pull request for some other unrelated firewire
SR> fixes to Linus about tomorrow.

SR> firewire-net on the other hand needs still more work than my
SR> fwnet_write_complete patch since you and I now get these kmem cache
SR> corruption related bugs.

SR> What is your impression --- does this first incomplete fix decrease the
SR> likelihood of crashes enough to make it worth to include it in a pull
SR> request already? I haven't done more extensive firewire-net tests since
SR> last week yet, hence it is hard to tell for me how severe the new issue
SR> is in practical use.

SR> (Also, I have no idea yet whether I will be quick or slow to find this
SR> other problem and whether it can be fixed in a manner that is suitable
SR> for a mainline merge before 2.6.33 is going to be released.)

Hi. Have just found 4 letters from you in gmail spam bin, including
this one. Your fix does decrease the chance of crash. Without the fix
crashes randomly occured within 3 minutes. With the fix - only after
10 minutes of copying. These 2 bugs are probably unrelated.

--

2010-01-31 10:31:53

by Stefan Richter

[permalink] [raw]
Subject: Re: [PATCH] firewire: net: fix panic in fwnet_write_complete

Илья Басин wrote:
> SR> Stefan Richter wrote:
> SR> I am going to send a pull request for some other unrelated firewire
> SR> fixes to Linus about tomorrow.
[...]
> SR> What is your impression --- does this first incomplete fix decrease the
> SR> likelihood of crashes enough to make it worth to include it in a pull
> SR> request already? I haven't done more extensive firewire-net tests since
> SR> last week yet, hence it is hard to tell for me how severe the new issue
> SR> is in practical use.
[...]
> Hi. Have just found 4 letters from you in gmail spam bin, including
> this one.

(One reason may be that my mail service provider occasionally gets
blacklisted in SORBS, perhaps other DNSBLs. Or they distrust kyrillic
letters in mails from .de or whatever.)

> Your fix does decrease the chance of crash. Without the fix
> crashes randomly occured within 3 minutes. With the fix - only after
> 10 minutes of copying. These 2 bugs are probably unrelated.

OK. The patch didn't make it into 2.6.33-rc6 from last Friday but I
will see to it that it gets merged before the final release. I will
also try to find time to learn more about the other bug.
--
Stefan Richter
-=====-==-=- ---= =====
http://arcgraph.de/sr/