2007-08-15 21:47:49

by Gregor Jasny

[permalink] [raw]
Subject: Corrupted filesystem with new Firewire stack

Hi,

today I got the "status write for unknown orb" during early boot
sequence. This corrupted somehow my root filesystem which is
completely located at the external disk.

Aug 15 23:06:02 Mini kernel: firewire_sbp2: logged in to sbp2 unit
fw1.0 (0 retries)
Aug 15 23:06:02 Mini kernel: firewire_sbp2: -
management_agent_address: 0xfffff0030000
Aug 15 23:06:02 Mini kernel: firewire_sbp2: -
command_block_agent_address: 0xfffff0100000
Aug 15 23:06:02 Mini kernel: firewire_sbp2: - status write address:
0x000100000000
...
Aug 15 23:06:02 Mini kernel: Freeing unused kernel memory: 252k freed
Aug 15 23:06:02 Mini kernel: firewire_sbp2: status write for unknown orb
Aug 15 23:06:02 Mini kernel: firewire_sbp2: sbp2_scsi_abort

After this error with timeout the kernel complained about missing
symbols in the intel agp module and udev behave strange.

After the next reboot I found the following lines in the kernel logfile:

Aug 15 23:09:52 Mini kernel: firewire_core: created new fw device fw1
(0 config rom retries)
Aug 15 23:09:52 Mini kernel: firewire_core: phy config: card 0, new
root=ffc1, gap_count=5
Aug 15 23:09:52 Mini kernel: firewire_ohci: context_stop: still active
(0x00000411)
Aug 15 23:09:52 Mini kernel: firewire_sbp2: management write failed, rcode 0x11

Sometimes rcode is 0x12:
Aug 13 22:24:13 Mini kernel: firewire_core: phy config: card 0, new
root=ffc0, gap_count=5
Aug 13 22:24:13 Mini kernel: firewire_sbp2: management write failed, rcode 0x12
Aug 13 22:24:13 Mini kernel: firewire_sbp2: reconnected to unit fw1.0
(1 retries)

The system is an early Mac Mini with C2D CPU. The external disk is a
Formax Oxygen 250 with Oxford OXFW_911 chipset. The Kernel was a
vanilla 2.6.22.2 with the mactel patches applied.

Is there anything I can do to help debugging this problem?

Thanks,
Gregor


2007-08-16 06:27:05

by Stefan Richter

[permalink] [raw]
Subject: Re: Corrupted filesystem with new Firewire stack

(full quote for linux1394-devel)

Gregor Jasny wrote at lkml:
> today I got the "status write for unknown orb" during early boot
> sequence. This corrupted somehow my root filesystem which is
> completely located at the external disk.
>
> Aug 15 23:06:02 Mini kernel: firewire_sbp2: logged in to sbp2 unit fw1.0 (0 retries)
> Aug 15 23:06:02 Mini kernel: firewire_sbp2: - management_agent_address: 0xfffff0030000
> Aug 15 23:06:02 Mini kernel: firewire_sbp2: - command_block_agent_address: 0xfffff0100000
> Aug 15 23:06:02 Mini kernel: firewire_sbp2: - status write address: 0x000100000000
> ...
> Aug 15 23:06:02 Mini kernel: Freeing unused kernel memory: 252k freed
> Aug 15 23:06:02 Mini kernel: firewire_sbp2: status write for unknown orb
> Aug 15 23:06:02 Mini kernel: firewire_sbp2: sbp2_scsi_abort
>
> After this error with timeout the kernel complained about missing
> symbols in the intel agp module and udev behave strange.

There were some similar reports involving that "status write for unknown
orb". I haven't found a way to reproduce it; I noticed it only once in
the logs here so far.

> After the next reboot I found the following lines in the kernel logfile:
>
> Aug 15 23:09:52 Mini kernel: firewire_core: created new fw device fw1 (0 config rom retries)
> Aug 15 23:09:52 Mini kernel: firewire_core: phy config: card 0, new root=ffc1, gap_count=5
> Aug 15 23:09:52 Mini kernel: firewire_ohci: context_stop: still active (0x00000411)
> Aug 15 23:09:52 Mini kernel: firewire_sbp2: management write failed, rcode 0x11
>
> Sometimes rcode is 0x12:
> Aug 13 22:24:13 Mini kernel: firewire_core: phy config: card 0, new root=ffc0, gap_count=5
> Aug 13 22:24:13 Mini kernel: firewire_sbp2: management write failed, rcode 0x12
> Aug 13 22:24:13 Mini kernel: firewire_sbp2: reconnected to unit fw1.0 (1 retries)

As long as it ends in "reconnected to...", failure messages like this
can typically be ignored. It cannot be predicted which failures are
transient and which aren't, hence all are logged.

> The system is an early Mac Mini with C2D CPU. The external disk is a
> Formax Oxygen 250 with Oxford OXFW_911 chipset. The Kernel was a
> vanilla 2.6.22.2 with the mactel patches applied.

Among else I too have an Intel Mac mini (running x86-64 Linux though)
and a OXFW911 enclosure with a NTFS formatted disk in it; I'll swap the
disk and try stress tests with a native filesystem.

> Is there anything I can do to help debugging this problem?

Alas this kind of bug is harder to debug remotely, and the root FS isn't
exactly ideal for respective tests... I'll try to remember to Cc you on
potentially related patches though.
--
Stefan Richter
-=====-=-=== =--- =----
http://arcgraph.de/sr/

2007-08-17 12:36:44

by Martin K. Petersen

[permalink] [raw]
Subject: Re: Corrupted filesystem with new Firewire stack

>>>>> "Stefan" == Stefan Richter <[email protected]> writes:

Stefan> There were some similar reports involving that "status write
Stefan> for unknown orb". I haven't found a way to reproduce it; I
Stefan> noticed it only once in the logs here so far.

I get those all the time. Just do heavy ext3 I/O to the drive.

Happens here on both a G4 and an intel Mini. Both running FC7.

Aug 17 08:24:08 mini kernel: firewire_sbp2: status write for unknown orb
Aug 17 08:25:08 mini kernel: firewire_sbp2: sbp2_scsi_abort
Aug 17 08:26:36 mini kernel: firewire_sbp2: status write for unknown orb
Aug 17 08:27:36 mini kernel: firewire_sbp2: sbp2_scsi_abort
Aug 17 08:33:51 mini kernel: firewire_sbp2: status write for unknown orb
Aug 17 08:34:51 mini kernel: firewire_sbp2: sbp2_scsi_abort

Lacie drive in both cases.

--
Martin K. Petersen http://mkp.net/

2007-08-18 13:28:18

by Stefan Richter

[permalink] [raw]
Subject: Re: Corrupted filesystem with new Firewire stack

Martin K. Petersen wrote:
>>>>>> "Stefan" == Stefan Richter <[email protected]> writes:
>
> Stefan> There were some similar reports involving that "status write
> Stefan> for unknown orb". I haven't found a way to reproduce it; I
> Stefan> noticed it only once in the logs here so far.
>
> I get those all the time. Just do heavy ext3 I/O to the drive.
>
> Happens here on both a G4 and an intel Mini. Both running FC7.
>
> Aug 17 08:24:08 mini kernel: firewire_sbp2: status write for unknown orb
> Aug 17 08:25:08 mini kernel: firewire_sbp2: sbp2_scsi_abort
> Aug 17 08:26:36 mini kernel: firewire_sbp2: status write for unknown orb
> Aug 17 08:27:36 mini kernel: firewire_sbp2: sbp2_scsi_abort
> Aug 17 08:33:51 mini kernel: firewire_sbp2: status write for unknown orb
> Aug 17 08:34:51 mini kernel: firewire_sbp2: sbp2_scsi_abort
>
> Lacie drive in both cases.

I replaced a HDD in my OFXW911 enclosure and am starting tests now.

While backing ~80 GB from its current reiserfs partition up in order to
reformat to ext3, using find | cpio, I got that error 4 times:

Aug 18 13:24:58 stein ReiserFS: sdd1: Using r5 hash to sort names
Aug 18 13:47:41 stein firewire_sbp2: status write for unknown orb
Aug 18 13:48:11 stein firewire_sbp2: sbp2_scsi_abort
Aug 18 14:18:30 stein firewire_sbp2: status write for unknown orb
Aug 18 14:19:00 stein firewire_sbp2: sbp2_scsi_abort
Aug 18 14:50:08 stein firewire_sbp2: status write for unknown orb
Aug 18 14:50:39 stein firewire_sbp2: sbp2_scsi_abort
Aug 18 14:57:24 stein firewire_sbp2: status write for unknown orb
Aug 18 14:57:54 stein firewire_sbp2: sbp2_scsi_abort

cpio finished with exit code 0, and I compared one of the directories
within which a command was aborted after remounting the disk. Seems
that error was properly recovered from.

I will test if more of these errors occur in write access or with ext3,
but at least I have a way now to get one error per ~20 minutes
continuous IO, on average. Then I will proceed to examine the new
stack's sources for potential related bugs. Will take a while though
because I have other projects going on at the moment.
--
Stefan Richter
-=====-=-=== =--- =--=-
http://arcgraph.de/sr/

2007-08-25 16:44:19

by Stefan Richter

[permalink] [raw]
Subject: Re: Corrupted filesystem with new Firewire stack

> Martin K. Petersen wrote:
>> Happens here on both a G4 and an intel Mini. Both running FC7.
>>
>> Aug 17 08:24:08 mini kernel: firewire_sbp2: status write for unknown orb
>> Aug 17 08:25:08 mini kernel: firewire_sbp2: sbp2_scsi_abort
>> Aug 17 08:26:36 mini kernel: firewire_sbp2: status write for unknown orb
>> Aug 17 08:27:36 mini kernel: firewire_sbp2: sbp2_scsi_abort
>> Aug 17 08:33:51 mini kernel: firewire_sbp2: status write for unknown orb
>> Aug 17 08:34:51 mini kernel: firewire_sbp2: sbp2_scsi_abort

The infrequent occurrences of this on my test setup vanished after
Kristian's patch. The patch is now in linux1394-2.6.git and in
http://me.in-berlin.de/~s5r6/linux1394/updates/.
--
Stefan Richter
-=====-=-=== =--- ==--=
http://arcgraph.de/sr/