2004-04-04 14:16:03

by Marcel Lanz

[permalink] [raw]
Subject: [PANIC] ohci1394 & copy large files

Since 2.6.4 and still in 2.6.5 I get regurarly a Kernel panic if I try
to backup large files (10-35GB) to an external attached disc (200GB/JFS) via ieee1394/sbp2.

Has anyone similar problems ?

Marcel


2004-04-04 14:24:59

by Ben Collins

[permalink] [raw]
Subject: Re: [PANIC] ohci1394 & copy large files

On Sun, Apr 04, 2004 at 04:16:00PM +0200, Marcel Lanz wrote:
> Since 2.6.4 and still in 2.6.5 I get regurarly a Kernel panic if I try
> to backup large files (10-35GB) to an external attached disc (200GB/JFS) via ieee1394/sbp2.
>
> Has anyone similar problems ?

Known issue, fixed in our repo. I still need to sync with Linus once I
iron one more issue and merge some more patches.

--
Debian - http://www.debian.org/
Linux 1394 - http://www.linux1394.org/
Subversion - http://subversion.tigris.org/
WatchGuard - http://www.watchguard.com/

2004-04-04 23:01:47

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PANIC] ohci1394 & copy large files

On Mon, 2004-04-05 at 00:13, Ben Collins wrote:
> On Sun, Apr 04, 2004 at 04:16:00PM +0200, Marcel Lanz wrote:
> > Since 2.6.4 and still in 2.6.5 I get regurarly a Kernel panic if I try
> > to backup large files (10-35GB) to an external attached disc (200GB/JFS) via ieee1394/sbp2.
> >
> > Has anyone similar problems ?
>
> Known issue, fixed in our repo. I still need to sync with Linus once I
> iron one more issue and merge some more patches.

Hi Ben !

I don't want to be too critical or harsh or whatever, but why don't you
just send such fixes right upstream instead of stacking patches for a
while in your repo ? From my experience, such "batching" of patches is
the _wrong_ thing to do, and typically, there is a major useability
issue with sbp2 that could have been "right" in 2.6.5 final and will not
be (so we'll have to wait what ? 1 or 2 monthes more now to have a
release kernel with a reliable sbp2)

Ben.


2004-04-04 23:29:39

by Ben Collins

[permalink] [raw]
Subject: Re: [PANIC] ohci1394 & copy large files

On Mon, Apr 05, 2004 at 09:00:24AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2004-04-05 at 00:13, Ben Collins wrote:
> > On Sun, Apr 04, 2004 at 04:16:00PM +0200, Marcel Lanz wrote:
> > > Since 2.6.4 and still in 2.6.5 I get regurarly a Kernel panic if I try
> > > to backup large files (10-35GB) to an external attached disc (200GB/JFS) via ieee1394/sbp2.
> > >
> > > Has anyone similar problems ?
> >
> > Known issue, fixed in our repo. I still need to sync with Linus once I
> > iron one more issue and merge some more patches.
>
> Hi Ben !
>
> I don't want to be too critical or harsh or whatever, but why don't you
> just send such fixes right upstream instead of stacking patches for a
> while in your repo ? From my experience, such "batching" of patches is
> the _wrong_ thing to do, and typically, there is a major useability
> issue with sbp2 that could have been "right" in 2.6.5 final and will not
> be (so we'll have to wait what ? 1 or 2 monthes more now to have a
> release kernel with a reliable sbp2)

Because the fix was pretty extensive and needed testing. It was
potentially more broken that the problem it was fixing. Sending untested
patches to Linus is far worse than batching a few up and pushing to him.

--
Debian - http://www.debian.org/
Linux 1394 - http://www.linux1394.org/
Subversion - http://subversion.tigris.org/
WatchGuard - http://www.watchguard.com/

2004-04-04 23:34:55

by Randy.Dunlap

[permalink] [raw]
Subject: Re: [PANIC] ohci1394 & copy large files

On Sun, 4 Apr 2004 19:17:46 -0400 Ben Collins <[email protected]> wrote:

| On Mon, Apr 05, 2004 at 09:00:24AM +1000, Benjamin Herrenschmidt wrote:
| > On Mon, 2004-04-05 at 00:13, Ben Collins wrote:
| > > On Sun, Apr 04, 2004 at 04:16:00PM +0200, Marcel Lanz wrote:
| > > > Since 2.6.4 and still in 2.6.5 I get regurarly a Kernel panic if I try
| > > > to backup large files (10-35GB) to an external attached disc (200GB/JFS) via ieee1394/sbp2.
| > > >
| > > > Has anyone similar problems ?
| > >
| > > Known issue, fixed in our repo. I still need to sync with Linus once I
| > > iron one more issue and merge some more patches.
| >
| > Hi Ben !
| >
| > I don't want to be too critical or harsh or whatever, but why don't you
| > just send such fixes right upstream instead of stacking patches for a
| > while in your repo ? From my experience, such "batching" of patches is
| > the _wrong_ thing to do, and typically, there is a major useability
| > issue with sbp2 that could have been "right" in 2.6.5 final and will not
| > be (so we'll have to wait what ? 1 or 2 monthes more now to have a
| > release kernel with a reliable sbp2)
|
| Because the fix was pretty extensive and needed testing. It was
| potentially more broken that the problem it was fixing. Sending untested
| patches to Linus is far worse than batching a few up and pushing to him.

Was (is) it already being tested more extensively in the -mm patches
before going to Linus? Should/could be. E.g., that's what gregkh does,
and ACPI, etc.

--
~Randy

2004-04-05 00:09:17

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PANIC] ohci1394 & copy large files

On Mon, 2004-04-05 at 09:17, Ben Collins wrote:

> Because the fix was pretty extensive and needed testing. It was
> potentially more broken that the problem it was fixing. Sending untested
> patches to Linus is far worse than batching a few up and pushing to him.

Ok, makes sense, it wasn't just a 1-liner quick fix then ;)

Still, from my experience, _very few_ people actually test things in
trees like ieee1394, fbdev, etc... Even my tree isn't what it used
to be for pmacs now that I'm fully in sync upstream.

Even -mm lately haven't been as tested as it used to be (possibly
because of upstream getting better). I find it's quite ok to send
a fix that needs a bit more testing to a Linus -rc1 (but not later),

Ben.



2004-04-05 00:21:38

by Ben Collins

[permalink] [raw]
Subject: Re: [PANIC] ohci1394 & copy large files

On Sun, Apr 04, 2004 at 04:28:18PM -0700, Randy.Dunlap wrote:
> On Sun, 4 Apr 2004 19:17:46 -0400 Ben Collins <[email protected]> wrote:
>
> | On Mon, Apr 05, 2004 at 09:00:24AM +1000, Benjamin Herrenschmidt wrote:
> | > On Mon, 2004-04-05 at 00:13, Ben Collins wrote:
> | > > On Sun, Apr 04, 2004 at 04:16:00PM +0200, Marcel Lanz wrote:
> | > > > Since 2.6.4 and still in 2.6.5 I get regurarly a Kernel panic if I try
> | > > > to backup large files (10-35GB) to an external attached disc (200GB/JFS) via ieee1394/sbp2.
> | > > >
> | > > > Has anyone similar problems ?
> | > >
> | > > Known issue, fixed in our repo. I still need to sync with Linus once I
> | > > iron one more issue and merge some more patches.
> | >
> | > Hi Ben !
> | >
> | > I don't want to be too critical or harsh or whatever, but why don't you
> | > just send such fixes right upstream instead of stacking patches for a
> | > while in your repo ? From my experience, such "batching" of patches is
> | > the _wrong_ thing to do, and typically, there is a major useability
> | > issue with sbp2 that could have been "right" in 2.6.5 final and will not
> | > be (so we'll have to wait what ? 1 or 2 monthes more now to have a
> | > release kernel with a reliable sbp2)
> |
> | Because the fix was pretty extensive and needed testing. It was
> | potentially more broken that the problem it was fixing. Sending untested
> | patches to Linus is far worse than batching a few up and pushing to him.
>
> Was (is) it already being tested more extensively in the -mm patches
> before going to Linus? Should/could be. E.g., that's what gregkh does,
> and ACPI, etc.

That's what generally happens in our own repo, and if I get the chance
to sync them to my bk tree, then that's what happens with -mm too.
Wasn't the case here since I've been swamped for a little over a week.

--
Debian - http://www.debian.org/
Linux 1394 - http://www.linux1394.org/
Subversion - http://subversion.tigris.org/
WatchGuard - http://www.watchguard.com/

2004-04-05 00:24:04

by Ben Collins

[permalink] [raw]
Subject: Re: [PANIC] ohci1394 & copy large files

On Mon, Apr 05, 2004 at 10:07:56AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2004-04-05 at 09:17, Ben Collins wrote:
>
> > Because the fix was pretty extensive and needed testing. It was
> > potentially more broken that the problem it was fixing. Sending untested
> > patches to Linus is far worse than batching a few up and pushing to him.
>
> Ok, makes sense, it wasn't just a 1-liner quick fix then ;)
>
> Still, from my experience, _very few_ people actually test things in
> trees like ieee1394, fbdev, etc... Even my tree isn't what it used
> to be for pmacs now that I'm fully in sync upstream.
>
> Even -mm lately haven't been as tested as it used to be (possibly
> because of upstream getting better). I find it's quite ok to send
> a fix that needs a bit more testing to a Linus -rc1 (but not later),

People that experience problems that I have a fix for in the repo,
usually get pointed to that repo for testing.

In this case I did the fix after -rc2, wasn't comfortable enough to send
it for -rc3 and I think I may wait for 2.6.6-rc1 since there's a harder
to reproduce bug now that I have to fix (I haven't seen it, but a few
people that tested our repo have reported it).

--
Debian - http://www.debian.org/
Linux 1394 - http://www.linux1394.org/
Subversion - http://subversion.tigris.org/
WatchGuard - http://www.watchguard.com/

2004-04-05 05:03:20

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: [PANIC] ohci1394 & copy large files

On Sunday 04 April 2004 09:13 am, Ben Collins wrote:
> On Sun, Apr 04, 2004 at 04:16:00PM +0200, Marcel Lanz wrote:
> > Since 2.6.4 and still in 2.6.5 I get regurarly a Kernel panic if I try
> > to backup large files (10-35GB) to an external attached disc (200GB/JFS) via ieee1394/sbp2.
> >
> > Has anyone similar problems ?
>
> Known issue, fixed in our repo. I still need to sync with Linus once I
> iron one more issue and merge some more patches.
>

I have some concerns that it is completely fixed in your tree - there is still
a race - if hpsb_packet_received arrives before hosb_packet_sent then there is
a chance that the code will try to put the same packet in the completion queue
twice. With SVN tree it will cause kernel BUG in skb code, in BK tree kernel
will just oops.

I wonder what was the reason to convert the code to abuse skbs aside for using
skbs queues and their locking?

Anyway, below is a backport of my patch from SVN to BK tree, I would like to
know if it works for others...

--
Dmitry


===================================================================


[email protected], 2004-04-04 23:52:29-05:00, [email protected]
IEEE1394: Make sure that a packet submitted into completion
queue just once; race between hpsb_packet_sent and
hpsb_packet_received


ieee1394_core.c | 26 ++++++++++++++++++++++----
ieee1394_core.h | 1 +
2 files changed, 23 insertions(+), 4 deletions(-)


===================================================================



diff -Nru a/drivers/ieee1394/ieee1394_core.c b/drivers/ieee1394/ieee1394_core.c
--- a/drivers/ieee1394/ieee1394_core.c Sun Apr 4 23:53:45 2004
+++ b/drivers/ieee1394/ieee1394_core.c Sun Apr 4 23:53:45 2004
@@ -96,7 +96,7 @@
WARN_ON(packet->complete_routine != NULL);
packet->complete_routine = routine;
packet->complete_data = data;
- return;
+ atomic_set(&packet->completion_armed, 1);
}

/**
@@ -412,8 +412,26 @@
}

if (ackcode != ACK_PENDING || !packet->expect_response) {
+ struct hpsb_packet *p;
+ struct list_head *lh;
+ unsigned long flags;
+
atomic_dec(&packet->refcnt);
- list_del(&packet->list);
+
+ /*
+ * Remove packet from host's pending queue
+ * (and not from any other queue)
+ */
+ spin_lock_irqsave(&host->pending_pkt_lock, flags);
+ list_for_each(lh, &host->pending_packets) {
+ p = list_entry(lh, struct hpsb_packet, list);
+ if (p == packet) {
+ list_del(&packet->list);
+ break;
+ }
+ }
+ spin_unlock_irqrestore(&host->pending_pkt_lock, flags);
+
packet->state = hpsb_complete;
queue_packet_complete(packet);
return;
@@ -1003,7 +1021,8 @@

static void queue_packet_complete(struct hpsb_packet *packet)
{
- if (packet->complete_routine != NULL) {
+ if (packet->complete_routine != NULL &&
+ atomic_dec_and_test(&packet->completion_armed)) {
unsigned long flags;

spin_lock_irqsave(&khpsbpkt_lock, flags);
@@ -1013,7 +1032,6 @@
/* Signal the kernel thread to handle this */
up(&khpsbpkt_sig);
}
- return;
}

static int hpsbpkt_thread(void *__hi)
diff -Nru a/drivers/ieee1394/ieee1394_core.h b/drivers/ieee1394/ieee1394_core.h
--- a/drivers/ieee1394/ieee1394_core.h Sun Apr 4 23:53:45 2004
+++ b/drivers/ieee1394/ieee1394_core.h Sun Apr 4 23:53:45 2004
@@ -66,6 +66,7 @@
* packet is completed. */
void (*complete_routine)(void *);
void *complete_data;
+ atomic_t completion_armed;

/* Store jiffies for implementing bus timeouts. */
unsigned long sendtime;

2004-04-05 13:57:19

by Ben Collins

[permalink] [raw]
Subject: Re: [PANIC] ohci1394 & copy large files

On Mon, Apr 05, 2004 at 12:03:10AM -0500, Dmitry Torokhov wrote:
> On Sunday 04 April 2004 09:13 am, Ben Collins wrote:
> > On Sun, Apr 04, 2004 at 04:16:00PM +0200, Marcel Lanz wrote:
> > > Since 2.6.4 and still in 2.6.5 I get regurarly a Kernel panic if I try
> > > to backup large files (10-35GB) to an external attached disc (200GB/JFS) via ieee1394/sbp2.
> > >
> > > Has anyone similar problems ?
> >
> > Known issue, fixed in our repo. I still need to sync with Linus once I
> > iron one more issue and merge some more patches.
> >
>
> I have some concerns that it is completely fixed in your tree - there is still
> a race - if hpsb_packet_received arrives before hosb_packet_sent then there is
> a chance that the code will try to put the same packet in the completion queue
> twice. With SVN tree it will cause kernel BUG in skb code, in BK tree kernel
> will just oops.
>
> I wonder what was the reason to convert the code to abuse skbs aside for using
> skbs queues and their locking?
>
> Anyway, below is a backport of my patch from SVN to BK tree, I would like to
> know if it works for others...

That's the other problem I was talking about. I'm reviewing your patch
today.

--
Debian - http://www.debian.org/
Linux 1394 - http://www.linux1394.org/
Subversion - http://subversion.tigris.org/
WatchGuard - http://www.watchguard.com/