2003-03-13 22:48:22

by Joy Latten

[permalink] [raw]
Subject: eth0: Bus master arbitration failure

I was wondering if anyone knew if this had been resolved
or see this problem too. I am having the same problem.
However, I am using 2.5.64 kernel and I have tried
both an eepro100 and a 3com-tornado ethernet card.

I use netperf to create a load of packets, and within minutes
I receive a few "eth0: Bus master arbitration failure, status ffff"
and then the machine locks up. (This seems to only happen
with a heavy load.)

I have no problems when I use 2.4.18-3 smp kernel (Redhat 8.0)
and I have also used a 2.4.19 kernel a while back with no problems.

I am using a 2-way IBM Netfinity 4500 and a 4-way xSeries 350.
No problems on a uniprocessor.

Thanks, for any info.

Regards,
Joy

>Hello Andrea,
>About 4 hours of heavy load on 2 of my boxs lead to hard lockup.
>Before the lockup there are a lot of messages like:
>"eth0: Bus master arbitration failure, status ffff"
>There is no such problems on 2.4.18rc2aa1 and 2.4.19rc1aa2
>Both Systems are IBM Netfinity 5100.
>
>[rathamahata@bo linux]$ /sbin/lspci
>00:00.0 Host bridge: ServerWorks CNB20LE Host Bridge (rev 05)
>00:00.1 Host bridge: ServerWorks CNB20LE Host Bridge (rev 05)
>00:01.0 VGA compatible controller: S3 Inc. Savage 4 (rev 04)
>00:02.0 Ethernet controller: Advanced Micro Devices [AMD] 79c970 [PCnet LANCE] (rev 44)
>00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 4f)
>00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller
>00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 04)
>01:03.0 SCSI storage controller: Adaptec 7899P (rev 01)
>01:03.1 SCSI storage controller: Adaptec 7899P (rev 01)
>01:05.0 RAID bus controller: IBM Netfinity ServeRAID controller
>
>[rathamahata@bo linux]$ cat /proc/cpuinfo
>processor : 0
>vendor_id : GenuineIntel
>cpu family : 6
>model : 8
>model name : Pentium III (Coppermine)
>stepping : 6
>cpu MHz : 996.758
>cache size : 256 KB
>fdiv_bug : no
>hlt_bug : no
>f00f_bug : no
>coma_bug : no
>fpu : yes
>fpu_exception : yes
>cpuid level : 2
>wp : yes
>flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
>bogomips : 1985.74
>
>processor : 1
>vendor_id : GenuineIntel
>cpu family : 6
>model : 8
>model name : Pentium III (Coppermine)
>stepping : 6
>cpu MHz : 996.758
>cache size : 256 KB
>fdiv_bug : no
>hlt_bug : no
>f00f_bug : no
>coma_bug : no
>fpu : yes
>fpu_exception : yes
>cpuid level : 2
>wp : yes
>flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
>bogomips : 1992.29
>
>--
> Best regards,
> Sergey S. Kostyliov <[email protected]>
> Public PGP key: http://sysadminday.org.ru/rathamahata.asc


2003-03-14 13:03:55

by Richard B. Johnson

[permalink] [raw]
Subject: Re: eth0: Bus master arbitration failure

On Thu, 13 Mar 2003 [email protected] wrote:

> I was wondering if anyone knew if this had been resolved
> or see this problem too. I am having the same problem.
> However, I am using 2.5.64 kernel and I have tried
> both an eepro100 and a 3com-tornado ethernet card.
>
> I use netperf to create a load of packets, and within minutes
> I receive a few "eth0: Bus master arbitration failure, status ffff"
> and then the machine locks up. (This seems to only happen
> with a heavy load.)
>
> I have no problems when I use 2.4.18-3 smp kernel (Redhat 8.0)
> and I have also used a 2.4.19 kernel a while back with no problems.
>
> I am using a 2-way IBM Netfinity 4500 and a 4-way xSeries 350.
> No problems on a uniprocessor.
>
> Thanks, for any info.
>
> Regards,
> Joy
>
> >Hello Andrea,
> >About 4 hours of heavy load on 2 of my boxs lead to hard lockup.
> >Before the lockup there are a lot of messages like:
> >"eth0: Bus master arbitration failure, status ffff"
> >There is no such problems on 2.4.18rc2aa1 and 2.4.19rc1aa2
> >Both Systems are IBM Netfinity 5100.
> >

I think the problem is probably all those "printk()" calls
within timing-sensitive code (really). A Bus master arbitration
failure is supposed to result in a retry. It is not supposed to
be fatal. For kicks, just comment out the printk() and see if
the box starts to work. If that makes it work, an appropriate
permanent fix would be to just keep track of the number of
such failures just like the dropped-packet and collision count.

If removing the printk() doesn't fix it, there may be a retained
spin-lock on an error exit path.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


2003-03-14 17:28:28

by Joy Latten

[permalink] [raw]
Subject: Re: eth0: Bus master arbitration failure


>On Thu, 13 Mar 2003 [email protected] wrote:
>
>> I was wondering if anyone knew if this had been resolved
>> or see this problem too. I am having the same problem.
>> However, I am using 2.5.64 kernel and I have tried
>> both an eepro100 and a 3com-tornado ethernet card.
>>
>
>I think the problem is probably all those "printk()" calls
>within timing-sensitive code (really). A Bus master arbitration
>failure is supposed to result in a retry. It is not supposed to
>be fatal. For kicks, just comment out the printk() and see if
>the box starts to work. If that makes it work, an appropriate
>permanent fix would be to just keep track of the number of
>such failures just like the dropped-packet and collision count.
>
>If removing the printk() doesn't fix it, there may be a retained
>spin-lock on an error exit path.
>

I did go and take a look at that printk :-) and realized it was in
pcnet32.c and that it was my pcnet32 card complaining and not
my eepro100 or 3com card. Whew! Sorry about that mistake.
I am going to try and install kdb and see if it will help
locate where the lockup is occuring.

Thanks,
Joy

2003-03-22 12:22:25

by Sergey S. Kostyliov

[permalink] [raw]
Subject: Re: eth0: Bus master arbitration failure

Hello all,

On Friday 14 March 2003 20:41, [email protected] wrote:
> >On Thu, 13 Mar 2003 [email protected] wrote:
> >
> >> I was wondering if anyone knew if this had been resolved
> >> or see this problem too. I am having the same problem.
> >> However, I am using 2.5.64 kernel and I have tried
> >> both an eepro100 and a 3com-tornado ethernet card.
> >>
> >
> >I think the problem is probably all those "printk()" calls
> >within timing-sensitive code (really). A Bus master arbitration
> >failure is supposed to result in a retry. It is not supposed to
> >be fatal. For kicks, just comment out the printk() and see if
> >the box starts to work. If that makes it work, an appropriate
> >permanent fix would be to just keep track of the number of
> >such failures just like the dropped-packet and collision count.
> >
> >If removing the printk() doesn't fix it, there may be a retained
> >spin-lock on an error exit path.
> >
>
> I did go and take a look at that printk :-) and realized it was in
> pcnet32.c and that it was my pcnet32 card complaining and not
> my eepro100 or 3com card. Whew! Sorry about that mistake.
> I am going to try and install kdb and see if it will help
> locate where the lockup is occuring.

I just want to clear things out.
This check was in kernels before 2.4.19 too.

if (csr0 & 0x0800) {
printk(KERN_ERR "%s: Bus master arbitration failure, status %4.4x.\n",
dev->name, csr0);
/* unlike for the lance, there is no restart needed */
}

But I've never seen nor this message neither kernel lockups before
2.4.19. And even for kernels > 2.4.19 it seems UP systems are not
affected (There are no such problems on my UP Netfinity 5100 so far).

>
> Thanks,
> Joy

--
Best regards,
Sergey S. Kostyliov <[email protected]>
Public PGP key: http://sysadminday.org.ru/rathamahata.asc