2002-06-18 03:28:11

by Zhang Fuxin

[permalink] [raw]
Subject: NAPI eepro100 bug fixed

hi,all
My first NAPI eepro100 contains a subtle but fatal race,which will
lead
to lockup(of the whole machine here,but of ether interface for paul). This
version should be ok, Paul, would you like to have a try? I've tested it
in my
pcs,it seems very stable now. Even 50kpps traffic won't cause any problem
here.
The bug is explained in the comment,i think NAPI driver writer
probably
will meet it,so it is listed here.
/* disable interrupts here is necessary!
* We need to ensure Rx/RxNobuf ints are disabled if in poll
* flag is set. If interrupt comes bwteen netif_rx_complete
* and enable_rx_and_rxnobuf_ints, the following will happen:
* netif_rx_complete --> clear RX_SCHED flag
* -> ints(e.g. TxDone)
* speedo_interrupt
* if (netif_rx_schedule_prep(dev))
* disable_rx_and_rxnobuf_ints
* return
* <-
* enable_rx_and_rxnobuf_ints
* then we will have Rx&RxNoBuf ints enable while in polling!
* it may lead to endless interrupts and effective lockup of
* the whole machine.
*/
spin_lock_irqsave(&sp->lock,flags);
netif_rx_complete(dev);
enable_rx_and_rxnobuf_ints(dev);
spin_unlock_irqrestore(&sp->lock,flags);

Sorry for my delay,it is all the world cup's fault:)

Robert Olsson wrote:

>Paul writes:
> > Man well I tried 2.4.17 kernel
> > eepro100.c driver patched with NAPI and as soon as
> > I route traffic to it it destroys eth0 and eth1 which are the two
> > interfaces that take the traffic.. they just die.. nothing in logs
> > nothing in dmesg, no errors, just all of a sudden no traffic
> > can go in or out those interfaces... sigh..
> > I then took the driver from kernel 2.5.21 and put it in 2.4.17 and compiled
> > after patching with NAPI and had the same problem..
> >
> > You have any idea what the deal is?
> >
> > It just dies instantly..
>
> Honestly no, just uploaded the eeproo napi patch from Zhang Fuxin he might
> have some ideas.
>
> I'm struggling with napi variant of the D-LINK sundance driver for 4-port
> board myself.
>
> Cheers.
>
> --ro
>
>



Attachments:
eepro100-napi.tar.gz (7.71 kB)

2002-06-18 04:44:08

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: NAPI eepro100 bug fixed

Hello!

> /* disable interrupts here is necessary!
> * We need to ensure Rx/RxNobuf ints are disabled if in poll
> * flag is set. If interrupt comes bwteen netif_rx_complete
> * and enable_rx_and_rxnobuf_ints, the following will happen:
> * netif_rx_complete --> clear RX_SCHED flag
> * -> ints(e.g. TxDone)
> * speedo_interrupt
> * if (netif_rx_schedule_prep(dev))
> * disable_rx_and_rxnobuf_ints
> * return
> * <-
> * enable_rx_and_rxnobuf_ints
> * then we will have Rx&RxNoBuf ints enable while in polling!
> * it may lead to endless interrupts and effective lockup of
> * the whole machine.
> */
> spin_lock_irqsave(&sp->lock,flags);
> netif_rx_complete(dev);
> enable_rx_and_rxnobuf_ints(dev);
> spin_unlock_irqrestore(&sp->lock,flags);

You mixed two different driver models, that's reason of lockup.

You must ACK irq in interrupt handler in some way.
Tulip really does trick with deferring ack to poll routine,
but it pays for this _masking_ irq each interrupt instead, which also
drops irq line. See?

Alexey

2002-06-18 17:04:23

by Robert Olsson

[permalink] [raw]
Subject: NAPI eepro100 bug fixed


Zhang Fuxin writes:

> My first NAPI eepro100 contains a subtle but fatal race,which will
> lead
> to lockup(of the whole machine here,but of ether interface for paul). This
> version should be ok, Paul, would you like to have a try? I've tested it
> in my


> will meet it,so it is listed here.
> /* disable interrupts here is necessary!
> * We need to ensure Rx/RxNobuf ints are disabled if in poll
> * flag is set. If interrupt comes bwteen netif_rx_complete
> * and enable_rx_and_rxnobuf_ints, the following will happen:
> * netif_rx_complete --> clear RX_SCHED flag
> * -> ints(e.g. TxDone)
> * speedo_interrupt
> * if (netif_rx_schedule_prep(dev))
> * disable_rx_and_rxnobuf_ints
> * return
> * <-
> * enable_rx_and_rxnobuf_ints
> * then we will have Rx&RxNoBuf ints enable while in polling!
> * it may lead to endless interrupts and effective lockup of
> * the whole machine.
> */
> spin_lock_irqsave(&sp->lock,flags);
> netif_rx_complete(dev);
> enable_rx_and_rxnobuf_ints(dev);
> spin_unlock_irqrestore(&sp->lock,flags);

Thanks!

Yes as far as I see this correct... and this race and others is mentioned
in NAPI_HOWTO.txt and yes the spinlock can help for the drivers that uses
this type interrupt acking. And tulip is a candidate for this as well. Let
see if it solves Paul's problem to start with.

Cheers.
--ro

2002-06-18 17:55:51

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: NAPI eepro100 bug fixed

Hello!

> By disabling irq in speedo_poll,we can be sure this won't happen.
...
> could you find a flaw?

This nicely will happen when irq arrived on another cpu.

Alexey