LinuxLists.cc - Re: [Linux-ATM-General] Kernel 2.6.10 and 2.4.29 Oops fore200e (fwd)

2005-01-24 18:37:33

Subject: Re: [Linux-ATM-General] Kernel 2.6.10 and 2.4.29 Oops fore200e (fwd)

In message <[email protected]>,Lukasz Trabinsk
i writes:
>Sorry, but I don;t understand, what line, i am not kernel guru. :/

look for the following code:

/* retry once again? */
if(--retry > 0) {
schedule();
goto retry_here;
}

change schedule() to udelay(50) and see if things are 'better'.

>Is was happened on 2.4.29, too. It is a interrupt problem?

its calling a routine that might sleep while in the transmit routine.
this is not allow.

2005-01-24 22:32:00

by Mike Westall

[permalink] [raw]

Subject: Re: [Linux-ATM-General] Kernel 2.6.10 and 2.4.29 Oops fore200e (fwd)

You could also just revert to kernel 2.4.25 or
earlier. Someone who was apparently oblivious
to the fact that device driver send routines
were "routinely" called in irq context and/or
that it was a <very bad thing> to call schedule()
under such circumstances slipped that one in
sometime between 2.4.25 which is OK and 2.4.28
where it is broken.

In 2.4.25 and earlier it was a simple busy wait loop
in which "goto retry_here;" immediately followed
the "if" statement. This was safe, albeit MP unfriendly
because of the spin_lock()/unlock() on each iteration.

I'd say just delete the if and drop the damn
packet.

At any rate someone who has access to the golden code
should fix this one way or another ASAP because its
definitely seriously broken the way it is now.

Mike

chas williams - CONTRACTOR wrote:
> In message <[email protected]>,Lukasz Trabinsk
> i writes:
>
>>Sorry, but I don;t understand, what line, i am not kernel guru. :/
>
>
> look for the following code:
>
> /* retry once again? */
> if(--retry > 0) {
> schedule();
> goto retry_here;
> }
>
>
> change schedule() to udelay(50) and see if things are 'better'.
>
>
>>Is was happened on 2.4.29, too. It is a interrupt problem?
>
>
> its calling a routine that might sleep while in the transmit routine.
> this is not allow.
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
> Tool for open source databases. Create drag-&-drop reports. Save time
> by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
> Download a FREE copy at http://www.intelliview.com/go/osdn_nl
> _______________________________________________
> Linux-atm-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/linux-atm-general
>
>

2005-01-24 22:42:09

by chas williams - CONTRACTOR

[permalink] [raw]

Subject: Re: [Linux-ATM-General] Kernel 2.6.10 and 2.4.29 Oops fore200e (fwd)

the author sent me the latest version of the driver and i
got it applied. the driver does has some useful changes
along with this broken change. i suggest udelay() since
it preserves the author's original intent.

i intend to submit a patch this week. i probably wont
fix the ambassador since i cant test the change.

In message <[email protected]>,Mike Westall writes:
>You could also just revert to kernel 2.4.25 or
>earlier. Someone who was apparently oblivious
>to the fact that device driver send routines
>were "routinely" called in irq context and/or
>that it was a <very bad thing> to call schedule()
>under such circumstances slipped that one in
>sometime between 2.4.25 which is OK and 2.4.28
>where it is broken.
>
>In 2.4.25 and earlier it was a simple busy wait loop
>in which "goto retry_here;" immediately followed
>the "if" statement. This was safe, albeit MP unfriendly
>because of the spin_lock()/unlock() on each iteration.
>
>I'd say just delete the if and drop the damn
>packet.
>
>At any rate someone who has access to the golden code
>should fix this one way or another ASAP because its
>definitely seriously broken the way it is now.
>
>Mike
>
>
>chas williams - CONTRACTOR wrote:
>> In message <[email protected]>,Lukasz Trabinsk
>> i writes:
>>
>>>Sorry, but I don;t understand, what line, i am not kernel guru. :/
>>
>>
>> look for the following code:
>>
>> /* retry once again? */
>> if(--retry > 0) {
>> schedule();
>> goto retry_here;
>> }
>>
>>
>> change schedule() to udelay(50) and see if things are 'better'.
>>
>>
>>>Is was happened on 2.4.29, too. It is a interrupt problem?
>>
>>
>> its calling a routine that might sleep while in the transmit routine.
>> this is not allow.
>>
>>
>> -------------------------------------------------------
>> This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
>> Tool for open source databases. Create drag-&-drop reports. Save time
>> by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
>> Download a FREE copy at http://www.intelliview.com/go/osdn_nl
>> _______________________________________________
>> Linux-atm-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/linux-atm-general
>>
>>
>
>
>

2005-01-24 23:24:46

by Lukasz Trabinski

[permalink] [raw]

Subject: Re: [Linux-ATM-General] Kernel 2.6.10 and 2.4.29 Oops fore200e (fwd)

On Mon, 24 Jan 2005, chas williams - CONTRACTOR wrote:

> the author sent me the latest version of the driver and i
> got it applied. the driver does has some useful changes
> along with this broken change. i suggest udelay() since
> it preserves the author's original intent.

Ok, i have just put udelay() function to the driver. If router will not
crash after 5-6 days, it mean that driver works fine. I will inform about
it. Generally problems has stareted (frequently crashes) when we puted to
them more atm interfaces/VCs and router started forward more traffic and
operated with two additional full bgp table.

--
?T

2005-01-30 19:24:18

by Lukasz Trabinski

[permalink] [raw]

Subject: Re: [Linux-ATM-General] Kernel 2.6.10 and 2.4.29 Oops fore200e (fwd)

On Tue, 25 Jan 2005, Lukasz Trabinski wrote:

> Ok, i have just put udelay() function to the driver. If router will not crash
> after 5-6 days, it mean that driver works fine. I will inform about
> it. Generally problems has stareted (frequently crashes) when we puted to
> them more atm interfaces/VCs and router started forward more traffic and
> operated with two additional full bgp table.

OK, I think that dirver works much better with udelay() function.

[root@cosmos root]# uptime
20:20:48 up 6 days, 23:25, 1 user, load average: 0.03, 0.03, 0.00

--
*[ ?ukasz Tr?bi?ski ]*
SysAdmin @wsisiz.edu.pl

2005-01-30 22:56:04

by chas williams - CONTRACTOR

[permalink] [raw]

Subject: Re: [Linux-ATM-General] Kernel 2.6.10 and 2.4.29 Oops fore200e (fwd)

In message <[email protected]>,Lukasz Trabinski writes:
>OK, I think that dirver works much better with udelay() function.

good to hear. what does atmdiag say about that interface? does it have
a large percentage of tx drops?

2005-01-31 08:48:22

by Lukasz Trabinski

[permalink] [raw]

Subject: Re: [Linux-ATM-General] Kernel 2.6.10 and 2.4.29 Oops fore200e (fwd)

On Sun, 30 Jan 2005, chas williams - CONTRACTOR wrote:

> In message <[email protected]>,Lukasz Trabinski writes:
>> OK, I think that dirver works much better with udelay() function.
>
> good to hear. what does atmdiag say about that interface? does it have
> a large percentage of tx drops?

After 12 hours:

[root@cosmos root]# atmdiag
Itf TX_okay TX_err RX_okay RX_err RX_drop
0 AAL0 0 0 0 0 0
AAL5 31375820 0 31479406 0 0

--
*[ ?ukasz Tr?bi?ski ]*
SysAdmin @wsisiz.edu.pl

2005-03-05 12:35:17

by Lukasz Trabinski

[permalink] [raw]

Subject: Re: [Linux-ATM-General] Kernel 2.6.10 and 2.4.29 Oops fore200e (fwd)

On Sun, 30 Jan 2005, chas williams - CONTRACTOR wrote:

Hello again

> good to hear. what does atmdiag say about that interface? does it have
> a large percentage of tx drops?

After one month work without oops, we have experienced oops again. It
happen when one or more VC is down (for example on atm switch).
We have two atm interfaces (fore_200e,nicstar) on our router:

[root@cosmos root]# lspci |grep ATM
01:01.0 ATM network controller: FORE Systems Inc ForeRunner PCA-200EPC ATM
01:05.0 ATM network controller: Integrated Device Tech IDT77211 ATM
Adapter (rev 03)

I have changed schedule() to udelay(50) in fore_200e and nicstar.
I have replaced also atm nicstar card to second one.
In log file, we can see many infromation like this one:

nicstar0: AAL5 CRC error - PDU size mismatch.

ksymoops 2.4.11 on i686 2.4.29. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.29/ (default)
-m /lib/modules/2.4.29/System.map (specified)

CPU: 0
EIP: 0010:[<c01b68f9>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00000002
eax: c031ea00 ebx: 00000005 ecx: 00000001 edx: 000003fd
esi: c031eac0 edi: c0305ee3 ebp: 00000005 esp: c02b3e18
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c02b3000)
Stack: 000f4016 c01bbe61 c031eac0 00000005 00000044 0000000d 00000016 c02a4c60
c0305ede 00011c3e 00011c54 c0118452 c02a4c60 c0305ede 00000016 00011c54
00011c54 00000016 f793d480 c011855f 00011c3e 00011c54 00000004 c029a1bc
Call Trace: [<c01bbe61>] [<c0118452>] [<c011855f>] [<c0118893>] [<c01187bf>]
[<f8a1f165>] [<f8a1cc15>] [<f8a1f14f>] [<f8a1c96c>] [<f8a1b7ad>] [<c0109029>]
[<c0109248>] [<c0105330>] [<c010b938>] [<c0105330>] [<c0105359>] [<c01053f2>]
[<c0105000>]
Code: 5b 0f b6 c0 c3 89 f6 0f b7 48 74 8b 40 70 d3 e3 0f b6 04 03

>>EIP; c01b68f9 <serial_in+19/30> <=====

>>eax; c031ea00 <serial_termios_locked+60/100>
>>esi; c031eac0 <async_sercons+0/c0>
>>edi; c0305ee3 <log_buf+1c43/8000>
>>esp; c02b3e18 <init_task_union+1e18/2000>

Trace; c01bbe61 <serial_console_write+81/220>
Trace; c0118452 <__call_console_drivers+62/70>
Trace; c011855f <call_console_drivers+7f/120>
Trace; c0118893 <release_console_sem+53/b0>
Trace; c01187bf <printk+14f/180>
Trace; f8a1f165 <[nicstar]__module_license+4f/130a>
Trace; f8a1cc15 <[nicstar]dequeue_rx+265/1040>
Trace; f8a1f14f <[nicstar]__module_license+39/130a>
Trace; f8a1c96c <[nicstar]process_rsq+2c/70>
Trace; f8a1b7ad <[nicstar]ns_irq_handler+3ad/470>
Trace; c0109029 <handle_IRQ_event+79/b0>
Trace; c0109248 <do_IRQ+98/f0>
Trace; c0105330 <default_idle+0/50>
Trace; c010b938 <call_do_IRQ+5/d>
Trace; c0105330 <default_idle+0/50>
Trace; c0105359 <default_idle+29/50>
Trace; c01053f2 <cpu_idle+52/70>
Trace; c0105000 <_stext+0/0>

Code; c01b68f9 <serial_in+19/30>
00000000 <_EIP>:
Code; c01b68f9 <serial_in+19/30> <=====
0: 5b pop %ebx <=====
Code; c01b68fa <serial_in+1a/30>
1: 0f b6 c0 movzbl %al,%eax
Code; c01b68fd <serial_in+1d/30>
4: c3 ret
Code; c01b68fe <serial_in+1e/30>
5: 89 f6 mov %esi,%esi
Code; c01b6900 <serial_in+20/30>
7: 0f b7 48 74 movzwl 0x74(%eax),%ecx
Code; c01b6904 <serial_in+24/30>
b: 8b 40 70 mov 0x70(%eax),%eax
Code; c01b6907 <serial_in+27/30>
e: d3 e3 shl %cl,%ebx
Code; c01b6909 <serial_in+29/30>
10: 0f b6 04 03 movzbl (%ebx,%eax,1),%eax

Where is the problem, patchord is bad, or problem exists on atm switch?

--
*[ ?ukasz Tr?bi?ski ]*
SysAdmin @wsisiz.edu.pl