2007-12-11 09:14:39

by Joonwoo Park

[permalink] [raw]
Subject: [PATCH] [NET]: Fix Ooops of napi net_rx_action.

[NET]: Fix Ooops of napi net_rx_action.
Before doing list_move_tail napi poll_list, it should be ensured

Signed-off-by: Joonwoo Park <[email protected]>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index 86d6261..74bd5ab 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2207,7 +2207,8 @@ static void net_rx_action(struct softirq_action *h)
* still "owns" the NAPI instance and therefore can
* move the instance around on the list at-will.
*/
- if (unlikely(work == weight))
+ if (unlikely((test_bit(NAPI_STATE_SCHED, &n->state))
+ && (work == weight)))
list_move_tail(&n->poll_list, list);

netpoll_poll_unlock(have);
---

I was able to make it real oops on 2.6.24-rc4 + e1000 napi + smp.
eth0 of my laptop is connected directly to a packet generator.
The packet generator is making unicast udp packets to laptop (approximately 70000pps).
At that time, if the interface is went down, Ooops occurred.

Thanks.
Joonwoo


root@joonwpark-laptop:~# ifconfig eth0 down
BUG: unable to handle kernel paging request at virtual address 00100104
printing eip: c42ecc35 *pdpt = 000000001fc89001 *pde = 0000000000000000
Ooops: 0002 [#1] SMP
Modules linked in:

Pid: 4, comm:ksoftirqd/0 Not tainted (2.6.24-rc4 #27)
EIP: 0060:[<c42ecc35>] EFLAGS: 00010046 CPU: 0
EIP is at net_rx_action+0x1d5/0x1f0
EAX: 00100100 EBX: f77c965c ECX: 00000000 EDX: 00200200
ESI: 00000040 EDI: f77c965c EBP: f7c6df94 ESP: f7c6df64
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process ksoftirqd/0 (pid: 4, ti=f7c6c000 task=f7c68ab0 task.ti=f7c6c000)
Stack: 00000002 00000001 c42ecabe c4598f40 c1d0f5cc 0000c7da 000000ac f77c965c
00000040 00000001 c4543b18 c4598f40 f7c6dfb0 c4032a07 00000004 00000000
00000246 00000000 c4032f60 f7c6dfbc c4032ad7 c459ba80 f6c6dfcc c4032fb9
Call Trace:
[<c40053ca>] show_trace_log_lvl+0x1a/0x30
[<c4005489>] show_stack_log_lvl+0xa9/0xd0
[<c400557a>] show_registers+0xca/0x1c0
[<c4005785>] die+0x115/0x230
[<c4020f0c>] do_page_fault+0x34c/0x7f0
[<c43c9cea>] error_code+0x72/0x78
[<c4032a07>] __do_softirq+0x87/0x60
[<c4032ad7>] do_softirq+0x57/0xe0
[<c4041ee2>] kthread+0x42/0x70
[<c4004fc3>] kernel_thread_helper+0x7/0x14

gdb outputs
(gdb) info line *net_rx_action+0x1d5
Line 157 of "include/linux/list.h"
starts at address 0xc42ecc35 <net_rx_action+469>
and ends at 0xc42ecc38 <net_rx_action+472>.
(gdb) l include/linux/list.h:157
152 * This is only for internal list manipulation where we know
153 * the prev/next entries already!
154 */
155 static inline void __list_del(struct list_head * prev, struct list_head * next)
156 {
157 next->prev = prev;
158 prev->next = next;
159 }
160
161 /**
(gdb) info line *__do_softirq+0x87
Line 130 of "include/linux/rcupdate.h"
starts at address 0xc4032a07 <__do_softirq+135>
and ends at 0xc4032a0a <__do_softirq+138>.
(gdb) l include/linux/rcupdate.h:130
125 struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
126 rdp->passed_quiesc = 1;
127 }
128 static inline void rcu_bh_qsctr_inc(int cpu)
129 {
130 struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
131 rdp->passed_quiesc = 1;
132 }
133
134 extern int rcu_pending(int cpu);
(gdb) info line *do_softirq+0x57
Line 269 of "kernel/softirq.c" starts at address 0xc4032ad2 <do_softirq+82>
and ends at 0xc4032ae0 <tasklet_kill>.
(gdb) l kernel/softirq.c:269
264 local_irq_save(flags);
265
266 pending = local_softirq_pending();
267
268 if (pending)
269 __do_softirq();
270
271 local_irq_restore(flags);
272 }
273


2007-12-11 10:32:29

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] [NET]: Fix Ooops of napi net_rx_action.

From: "Joonwoo Park" <[email protected]>
Date: Tue, 11 Dec 2007 18:13:34 +0900

Joonwoo-ssi annyoung haseyo,

> [NET]: Fix Ooops of napi net_rx_action.
> Before doing list_move_tail napi poll_list, it should be ensured
>
> Signed-off-by: Joonwoo Park <[email protected]>
> ---
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 86d6261..74bd5ab 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2207,7 +2207,8 @@ static void net_rx_action(struct softirq_action *h)
> * still "owns" the NAPI instance and therefore can
> * move the instance around on the list at-will.
> */
> - if (unlikely(work == weight))
> + if (unlikely((test_bit(NAPI_STATE_SCHED, &n->state))
> + && (work == weight)))
> list_move_tail(&n->poll_list, list);
>
> netpoll_poll_unlock(have);

How can the NAPI_STATE_SCHED bit be cleared externally yet we take
this list_move_tail() code path?

If NAPI_STATE_SCHED is cleared, work will be zero which will never be
equal to 'weight', and this we'll never attempt the list_move_tail().

If something clears NAPI_STATE_SCHED meanwhile, we have a serious race
and your patch is an incomplete bandaid. For example, if it can
happen, then a case like:

if (test_bit(NAPI_STATE_SCHED, &n->state))
... something clears NAPI_STATE_SCHED right now ...
work = n->poll(n, weight);

can crash too.

2007-12-11 12:36:38

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH] [NET]: Fix Ooops of napi net_rx_action.

David Miller <[email protected]> wrote:
>
> How can the NAPI_STATE_SCHED bit be cleared externally yet we take
> this list_move_tail() code path?

His driver is probably buggy. When we had two drivers beginning
with e100 we often forgot to apply fixes to the both of them. Now
that we have three it's even more confusing.

I just checked and indeed e1000e seems to be missing the NAPI fix
that was applied to e1000. Of course it doesn't rule out the
possibility of another NAPI bug in e1000.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-12-11 12:41:59

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] [NET]: Fix Ooops of napi net_rx_action.

From: Herbert Xu <[email protected]>
Date: Tue, 11 Dec 2007 20:36:21 +0800

> David Miller <[email protected]> wrote:
> >
> > How can the NAPI_STATE_SCHED bit be cleared externally yet we take
> > this list_move_tail() code path?
>
> His driver is probably buggy. When we had two drivers beginning
> with e100 we often forgot to apply fixes to the both of them. Now
> that we have three it's even more confusing.
>
> I just checked and indeed e1000e seems to be missing the NAPI fix
> that was applied to e1000. Of course it doesn't rule out the
> possibility of another NAPI bug in e1000.

Thanks for checking.

Indeed I stuck the huge comment there in net_rx_action() above the
list move to try and explain things to people, so that if you saw a
crash in the list manipulation, you're go check the driver first.

2007-12-11 12:57:22

by Joonwoo Park

[permalink] [raw]
Subject: Re: [PATCH] [NET]: Fix Ooops of napi net_rx_action.

2007/12/11, David Miller <[email protected]>:
> From: "Joonwoo Park" <[email protected]>
> Date: Tue, 11 Dec 2007 18:13:34 +0900
>
> Joonwoo-ssi annyoung haseyo,

Wow Great! :-)

> How can the NAPI_STATE_SCHED bit be cleared externally yet we take
> this list_move_tail() code path?
>
> If NAPI_STATE_SCHED is cleared, work will be zero which will never be
> equal to 'weight', and this we'll never attempt the list_move_tail().
>
> If something clears NAPI_STATE_SCHED meanwhile, we have a serious race
> and your patch is an incomplete bandaid. For example, if it can
> happen, then a case like:
>
> if (test_bit(NAPI_STATE_SCHED, &n->state))
> ... something clears NAPI_STATE_SCHED right now ...
> work = n->poll(n, weight);
>
> can crash too.
>

David,
With your suggestions, I'm feeling a doubt at e1000.
It's seems to me e1000_clean does call netif_rx_complete and returns
non-zero work_done when !netif_running()
So net_rx_action has (work == weight) as true with NAPI_STATE_SCHED cleared.

Here is the e1000_clean.

static int
e1000_clean(struct napi_struct *napi, int budget)
{
/* Keep link state information with original netdev */
if (!netif_carrier_ok(poll_dev))
goto quit_polling;

...

adapter->clean_rx(adapter, &adapter->rx_ring[0],
&work_done, budget);

/* If no Tx and not enough Rx work done, exit the polling mode */
if ((!tx_cleaned && (work_done == 0)) ||
!netif_running(poll_dev)) {
quit_polling:
if (likely(adapter->itr_setting & 3))
e1000_set_itr(adapter);
netif_rx_complete(poll_dev, napi);
e1000_irq_enable(adapter);
}

return work_done;
}

Thanks.
Joonwoo

2007-12-11 23:22:19

by Jesse Brandeburg

[permalink] [raw]
Subject: RE: [PATCH] [NET]: Fix Ooops of napi net_rx_action.

Joonwoo Park wrote:
> /* If no Tx and not enough Rx work done, exit the polling mode */
> if ((!tx_cleaned && (work_done == 0)) ||
> !netif_running(poll_dev)) {
> quit_polling:
> if (likely(adapter->itr_setting & 3))
> e1000_set_itr(adapter);
> netif_rx_complete(poll_dev, napi);
> e1000_irq_enable(adapter);

all drivers using NAPI in 2.6.24+ (NNAPI??) must return zero here, after
calling netif_rx_complete. netif_rx_complete plus work_done != 0 causes
a bug.

> }
>
> return work_done;
> }
>

2007-12-11 23:27:52

by Joonwoo Park

[permalink] [raw]
Subject: Re: [PATCH] [NET]: Fix Ooops of napi net_rx_action.

2007/12/12, Brandeburg, Jesse <[email protected]>:
> Joonwoo Park wrote:
> > /* If no Tx and not enough Rx work done, exit the polling mode */
> > if ((!tx_cleaned && (work_done == 0)) ||
> > !netif_running(poll_dev)) {
> > quit_polling:
> > if (likely(adapter->itr_setting & 3))
> > e1000_set_itr(adapter);
> > netif_rx_complete(poll_dev, napi);
> > e1000_irq_enable(adapter);
>
> all drivers using NAPI in 2.6.24+ (NNAPI??) must return zero here, after
> calling netif_rx_complete. netif_rx_complete plus work_done != 0 causes
> a bug.
>
> > }
> >
> > return work_done;
> > }
> >
>

I'm working another patch for drivers (maybe patches)

Thanks.
Joonwoo

2007-12-11 23:45:06

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [PATCH] [NET]: Fix Ooops of napi net_rx_action.

Perhaps we should change the warning to identify the guilty device.


--- a/net/core/dev.c 2007-11-19 09:09:57.000000000 -0800
+++ b/net/core/dev.c 2007-12-07 15:54:03.000000000 -0800
@@ -2196,7 +2196,13 @@ static void net_rx_action(struct softirq
if (test_bit(NAPI_STATE_SCHED, &n->state))
work = n->poll(n, weight);

- WARN_ON_ONCE(work > weight);
+ if (unlikely(work > weight)) {
+ if (net_ratelimit())
+ printk(KERN_WARNING
+ "%s: driver poll bug (work=%d weight=%d)\n",
+ work, weight);
+ work = weight;
+ }

budget -= work;

2007-12-12 00:12:20

by Joonwoo Park

[permalink] [raw]
Subject: Re: [PATCH] [NET]: Fix Ooops of napi net_rx_action.

2007/12/12, Brandeburg, Jesse <[email protected]>:
>
> all drivers using NAPI in 2.6.24+ (NNAPI??) must return zero here, after
> calling netif_rx_complete. netif_rx_complete plus work_done != 0 causes
> a bug.
>

Brandeburg,
Don't we need to return non-zero work_done after netif_rx_complete if
work_done != weight?

Thanks,
Joonwoo

2007-12-12 00:41:56

by Jesse Brandeburg

[permalink] [raw]
Subject: RE: [PATCH] [NET]: Fix Ooops of napi net_rx_action.

Joonwoo Park wrote:
> 2007/12/12, Brandeburg, Jesse <[email protected]>:
>>
>> all drivers using NAPI in 2.6.24+ (NNAPI??) must return zero here,
>> after calling netif_rx_complete. netif_rx_complete plus work_done
>> != 0 causes a bug.
>>
>
> Brandeburg,
> Don't we need to return non-zero work_done after netif_rx_complete if
> work_done != weight?

Actually we just need to make sure we don't return work_done == weight
as it is checked on line 2210 of dev.c.
I should also note that I was wrong above, and that the requirement is
that we MUST not return work_done == weight when indicating
netif_rx_complete.

So, returning 0 when we actually did some work, but are being removed
from the poll list because something like !netif_running, is probably
okay, but I'm sure someone will disagree with me. Maybe returning like
this would be better
diff --git a/drivers/net/e1000/e1000_main.c
b/drivers/net/e1000/e1000_main.c
index 724f067..76d5e3b 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3933,6 +3933,10 @@ quit_polling:
e1000_set_itr(adapter);
netif_rx_complete(poll_dev, napi);
e1000_irq_enable(adapter);
+ if (work_done == weight)
+ return work_done - 1;
+ else
+ return work_done;
}

return work_done;

Jesse

2007-12-12 15:13:16

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] [NET]: Fix Ooops of napi net_rx_action.

From: "Brandeburg, Jesse" <[email protected]>
Date: Tue, 11 Dec 2007 16:38:37 -0800

> @@ -3933,6 +3933,10 @@ quit_polling:
> e1000_set_itr(adapter);
> netif_rx_complete(poll_dev, napi);
> e1000_irq_enable(adapter);
> + if (work_done == weight)
> + return work_done - 1;
> + else
> + return work_done;

Don't do this.

If you processed "weight" worth of packets, return that
exact value and do not netif_rx_complete() and do not
re-enable interrupts.

That is the only correct fix.

2007-12-16 21:35:27

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] [NET]: Fix Ooops of napi net_rx_action.

From: Stephen Hemminger <[email protected]>
Date: Tue, 11 Dec 2007 15:42:22 -0800

> Perhaps we should change the warning to identify the guilty device.

Applied.

Stephen, you often don't supply a proper signoff line
for one-off changes like this and I find it very irritating.
It doesn't cost you anything to hit one or two keys in your
outgoing email editor to include the signoff line.

So please do so, or I'll silently drop patches without them
in the future.

Thanks.