2007-12-12 04:03:53

by Joonwoo Park

[permalink] [raw]
Subject: [PATCH 6/7] [NETDEV]: tehuti Fix possible causing oops of net_rx_action

[NETDEV]: tehuti Fix possible causing oops of net_rx_action

Signed-off-by: Joonwoo Park <[email protected]>
---
drivers/net/tehuti.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/tehuti.c b/drivers/net/tehuti.c
index 21230c9..955e749 100644
--- a/drivers/net/tehuti.c
+++ b/drivers/net/tehuti.c
@@ -305,6 +305,8 @@ static int bdx_poll(struct napi_struct *napi, int budget)

netif_rx_complete(dev, napi);
bdx_enable_interrupts(priv);
+ if (unlikely(work_done == napi->weight))
+ return work_done - 1;
}
return work_done;
}
---


2007-12-12 05:41:58

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [PATCH 6/7] [NETDEV]: tehuti Fix possible causing oops of net_rx_action

On Wed, 12 Dec 2007 13:01:27 +0900
"Joonwoo Park" <[email protected]> wrote:

> [NETDEV]: tehuti Fix possible causing oops of net_rx_action
>
> Signed-off-by: Joonwoo Park <[email protected]>
> ---
> drivers/net/tehuti.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/net/tehuti.c b/drivers/net/tehuti.c
> index 21230c9..955e749 100644
> --- a/drivers/net/tehuti.c
> +++ b/drivers/net/tehuti.c
> @@ -305,6 +305,8 @@ static int bdx_poll(struct napi_struct *napi, int budget)
>
> netif_rx_complete(dev, napi);
> bdx_enable_interrupts(priv);
> + if (unlikely(work_done == napi->weight))
> + return work_done - 1;
> }
> return work_done;
> }

A better fix would be not going over budget in the first place.

--
Stephen Hemminger <[email protected]>

2007-12-12 05:48:29

by Stephen Hemminger

[permalink] [raw]
Subject: [RFC] net: napi fix

Isn't this a better fix for all drivers, rather than peppering every
driver with the special case. This is how the logic worked up until
2.6.24.


--- a/net/core/dev.c 2007-12-11 12:16:20.000000000 -0800
+++ b/net/core/dev.c 2007-12-11 21:43:39.000000000 -0800
@@ -2184,7 +2184,7 @@ static void net_rx_action(struct softirq

have = netpoll_poll_lock(n);

- weight = n->weight;
+ weight = min(n->weight, budget);

/* This NAPI_STATE_SCHED test is for avoiding a race
* with netpoll's poll_napi(). Only the entity which

2007-12-12 05:48:45

by Joonwoo Park

[permalink] [raw]
Subject: Re: [PATCH 6/7] [NETDEV]: tehuti Fix possible causing oops of net_rx_action

2007/12/12, Stephen Hemminger <[email protected]>:
> On Wed, 12 Dec 2007 13:01:27 +0900
> "Joonwoo Park" <[email protected]> wrote:
>
> > [NETDEV]: tehuti Fix possible causing oops of net_rx_action
> >
> > Signed-off-by: Joonwoo Park <[email protected]>
> > ---
> > drivers/net/tehuti.c | 2 ++
> > 1 files changed, 2 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/net/tehuti.c b/drivers/net/tehuti.c
> > index 21230c9..955e749 100644
> > --- a/drivers/net/tehuti.c
> > +++ b/drivers/net/tehuti.c
> > @@ -305,6 +305,8 @@ static int bdx_poll(struct napi_struct *napi, int budget)
> >
> > netif_rx_complete(dev, napi);
> > bdx_enable_interrupts(priv);
> > + if (unlikely(work_done == napi->weight))
> > + return work_done - 1;
> > }
> > return work_done;
> > }
>
> A better fix would be not going over budget in the first place.
>
> --
> Stephen Hemminger <[email protected]>
>

Stephen,
This is code of bd_poll().
Do you mean remove napi_stop stuff?

static int bdx_poll(struct napi_struct *napi, int budget)
{
...
work_done = bdx_rx_receive(priv, &priv->rxd_fifo0, budget);
if ((work_done < budget) ||
(priv->napi_stop++ >= 30)) {
DBG("rx poll is done. backing to isr-driven\n");

/* from time to time we exit to let NAPI layer release
* device lock and allow waiting tasks (eg rmmod) to advance) */
priv->napi_stop = 0;

netif_rx_complete(dev, napi);
bdx_enable_interrupts(priv);
if (unlikely(work_done == napi->weight))
return work_done - 1;
}
return work_done;
}

Thanks,
Joonwoo

2007-12-12 05:56:13

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [PATCH 6/7] [NETDEV]: tehuti Fix possible causing oops of net_rx_action

On Wed, 12 Dec 2007 14:48:27 +0900
"Joonwoo Park" <[email protected]> wrote:

> 2007/12/12, Stephen Hemminger <[email protected]>:
> > On Wed, 12 Dec 2007 13:01:27 +0900
> > "Joonwoo Park" <[email protected]> wrote:
> >
> > > [NETDEV]: tehuti Fix possible causing oops of net_rx_action
> > >
> > > Signed-off-by: Joonwoo Park <[email protected]>
> > > ---
> > > drivers/net/tehuti.c | 2 ++
> > > 1 files changed, 2 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/drivers/net/tehuti.c b/drivers/net/tehuti.c
> > > index 21230c9..955e749 100644
> > > --- a/drivers/net/tehuti.c
> > > +++ b/drivers/net/tehuti.c
> > > @@ -305,6 +305,8 @@ static int bdx_poll(struct napi_struct *napi, int budget)
> > >
> > > netif_rx_complete(dev, napi);
> > > bdx_enable_interrupts(priv);
> > > + if (unlikely(work_done == napi->weight))
> > > + return work_done - 1;
> > > }
> > > return work_done;
> > > }
> >
> > A better fix would be not going over budget in the first place.
> >
> > --
> > Stephen Hemminger <[email protected]>
> >
>
> Stephen,
> This is code of bd_poll().
> Do you mean remove napi_stop stuff?
>
> static int bdx_poll(struct napi_struct *napi, int budget)
> {
> ...
> work_done = bdx_rx_receive(priv, &priv->rxd_fifo0, budget);
> if ((work_done < budget) ||
> (priv->napi_stop++ >= 30)) {

Yes remove the napi_stop stuff, because current NAPI expects device
to be constrained only by budget. If you need to stop sooner, just
set napi weight to be smaller.

> DBG("rx poll is done. backing to isr-driven\n");
>
> /* from time to time we exit to let NAPI layer release
> * device lock and allow waiting tasks (eg rmmod) to advance) */
> priv->napi_stop = 0;
>
> netif_rx_complete(dev, napi);
> bdx_enable_interrupts(priv);

With my posted fix to rx_action the following two lines would not be needed.
> if (unlikely(work_done == napi->weight))
> return work_done - 1;
> }
> return work_done;
> }
>
> Thanks,
> Joonwoo


--
Stephen Hemminger <[email protected]>

2007-12-12 06:05:41

by Joonwoo Park

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

2007/12/12, Stephen Hemminger <[email protected]>:
> Isn't this a better fix for all drivers, rather than peppering every
> driver with the special case. This is how the logic worked up until
> 2.6.24.
>
>
> --- a/net/core/dev.c 2007-12-11 12:16:20.000000000 -0800
> +++ b/net/core/dev.c 2007-12-11 21:43:39.000000000 -0800
> @@ -2184,7 +2184,7 @@ static void net_rx_action(struct softirq
>
> have = netpoll_poll_lock(n);
>
> - weight = n->weight;
> + weight = min(n->weight, budget);
>
> /* This NAPI_STATE_SCHED test is for avoiding a race
> * with netpoll's poll_napi(). Only the entity which
>

Stephen,
Could you explain how it fix the problem?
IMHO I think your patch cannot solve the problem.
The drivers can call netif_rx_complete and net_rx_action can do
list_move_tail also.
Am I missing something?

Thanks
Joonwoo

2007-12-12 15:18:48

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 6/7] [NETDEV]: tehuti Fix possible causing oops of net_rx_action

From: "Joonwoo Park" <[email protected]>
Date: Wed, 12 Dec 2007 13:01:27 +0900

> @@ -305,6 +305,8 @@ static int bdx_poll(struct napi_struct *napi, int budget)
>
> netif_rx_complete(dev, napi);
> bdx_enable_interrupts(priv);
> + if (unlikely(work_done == napi->weight))
> + return work_done - 1;
> }
> return work_done;
> }

Any time your trying to make a caller "happy" by adjusting
a return value forcefully, it's a hack.

And I stated this in another reply about this issue.

Please do not fix the problem this way.

The correct way to fix this is, if we did process a full
"weight" or work, we should not netif_rx_complete() and
we should not re-enable chip interrupts.

Instead we should return the true "work_done" value and
allow the caller to thus poll us one more time.

2007-12-12 15:20:50

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 6/7] [NETDEV]: tehuti Fix possible causing oops of net_rx_action

From: Stephen Hemminger <[email protected]>
Date: Tue, 11 Dec 2007 21:39:39 -0800

> On Wed, 12 Dec 2007 13:01:27 +0900
> "Joonwoo Park" <[email protected]> wrote:
>
> > [NETDEV]: tehuti Fix possible causing oops of net_rx_action
> >
> > Signed-off-by: Joonwoo Park <[email protected]>
> > ---
> > drivers/net/tehuti.c | 2 ++
> > 1 files changed, 2 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/net/tehuti.c b/drivers/net/tehuti.c
> > index 21230c9..955e749 100644
> > --- a/drivers/net/tehuti.c
> > +++ b/drivers/net/tehuti.c
> > @@ -305,6 +305,8 @@ static int bdx_poll(struct napi_struct *napi, int budget)
> >
> > netif_rx_complete(dev, napi);
> > bdx_enable_interrupts(priv);
> > + if (unlikely(work_done == napi->weight))
> > + return work_done - 1;
> > }
> > return work_done;
> > }
>
> A better fix would be not going over budget in the first place.

That's not the problem.

They are not going over the budget, rather, they are hitting
the budget yet doing netif_rx_complete() as well which is
illegal.

Unless you strictly process less than "weight" packets, you must
not netif_rx_complete() and re-enable chip interrupts.

I can't believe people are trying to fix this bug like this.

2007-12-12 15:21:37

by David Miller

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

From: Stephen Hemminger <[email protected]>
Date: Tue, 11 Dec 2007 21:46:34 -0800

> Isn't this a better fix for all drivers, rather than peppering every
> driver with the special case. This is how the logic worked up until
> 2.6.24.

Stephen this is not the problem.

The problem is that the driver is doing a NAPI completion and
re-enabling chip interrupts with work_done == weight, and that is
illegal.

2007-12-12 15:22:27

by David Miller

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

From: "Joonwoo Park" <[email protected]>
Date: Wed, 12 Dec 2007 15:05:26 +0900

> Could you explain how it fix the problem?
> IMHO I think your patch cannot solve the problem.
> The drivers can call netif_rx_complete and net_rx_action can do
> list_move_tail also.

Stephen is confused about what the bug is in these drivers,
that's all.

2007-12-12 16:39:24

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [PATCH 6/7] [NETDEV]: tehuti Fix possible causing oops of net_rx_action

On Wed, 12 Dec 2007 07:20:34 -0800 (PST)
David Miller <[email protected]> wrote:

> From: Stephen Hemminger <[email protected]>
> Date: Tue, 11 Dec 2007 21:39:39 -0800
>
> > On Wed, 12 Dec 2007 13:01:27 +0900
> > "Joonwoo Park" <[email protected]> wrote:
> >
> > > [NETDEV]: tehuti Fix possible causing oops of net_rx_action
> > >
> > > Signed-off-by: Joonwoo Park <[email protected]>
> > > ---
> > > drivers/net/tehuti.c | 2 ++
> > > 1 files changed, 2 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/drivers/net/tehuti.c b/drivers/net/tehuti.c
> > > index 21230c9..955e749 100644
> > > --- a/drivers/net/tehuti.c
> > > +++ b/drivers/net/tehuti.c
> > > @@ -305,6 +305,8 @@ static int bdx_poll(struct napi_struct *napi, int budget)
> > >
> > > netif_rx_complete(dev, napi);
> > > bdx_enable_interrupts(priv);
> > > + if (unlikely(work_done == napi->weight))
> > > + return work_done - 1;
> > > }
> > > return work_done;
> > > }
> >
> > A better fix would be not going over budget in the first place.
>
> That's not the problem.
>
> They are not going over the budget, rather, they are hitting
> the budget yet doing netif_rx_complete() as well which is
> illegal.
>
> Unless you strictly process less than "weight" packets, you must
> not netif_rx_complete() and re-enable chip interrupts.
>
> I can't believe people are trying to fix this bug like this.

Sorry, I was looking at a different possible problem. The issue
is that if netdev_budget was set smaller (say 128) but device
weight was set larger (say 256). The new code would still allow
the device to do a full swipe (256) packets rather than only
128 as in earlier NAPI. I guess it is an okay behaviour change, because
we don't really guarantee that case.

The problem with the tehuti driver is the logic around priv->napi_stop.
That whole early stop concept should be removed since it just
duplicates the logic of netdev->weight but breaks the assumptions
in the calling netif_rx_action.



--
Stephen Hemminger <[email protected]>

2007-12-12 17:30:59

by Andrew Gallatin

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

[I apologize for loosing threading, I'm replying from the archives]

> The problem is that the driver is doing a NAPI completion and
> re-enabling chip interrupts with work_done == weight, and that is
> illegal.

The only time at least myri10ge will do this is due to
the !netif_running(netdev) check. Eg, from myri10ge's poll:

work_done = myri10ge_clean_rx_done(mgp, budget);

if (work_done < budget || !netif_running(netdev)) {
netif_rx_complete(netdev, napi);
put_be32(htonl(3), mgp->irq_claim);
}

Is the netif_running() check even required? Is this just
a bad way to solve a race with running NAPI at down() time
that would be better solved by putting a napi_synchronize()
in the driver's down() routine?

I'd rather fix this right than add another check to a
questionable code path.

Thanks,

Drew

2007-12-12 17:38:31

by David Miller

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

From: Andrew Gallatin <[email protected]>
Date: Wed, 12 Dec 2007 12:29:23 -0500

> Is the netif_running() check even required?

No, it is not.

When a device is brought down, one of the first things
that happens is that we wait for all pending NAPI polls
to complete, then block any new polls from starting.

2007-12-12 17:47:38

by Andrew Gallatin

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

David Miller wrote:
> From: Andrew Gallatin <[email protected]>
> Date: Wed, 12 Dec 2007 12:29:23 -0500
>
>> Is the netif_running() check even required?
>
> No, it is not.
>
> When a device is brought down, one of the first things
> that happens is that we wait for all pending NAPI polls
> to complete, then block any new polls from starting.

Great, thanks. I will submit a patch to remove the bogus
check. This should fix myri10ge properly.


Thank you,

Drew

2007-12-12 18:45:19

by Kok, Auke

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

David Miller wrote:
> From: Andrew Gallatin <[email protected]>
> Date: Wed, 12 Dec 2007 12:29:23 -0500
>
>> Is the netif_running() check even required?
>
> No, it is not.
>
> When a device is brought down, one of the first things
> that happens is that we wait for all pending NAPI polls
> to complete, then block any new polls from starting.

I think this was previously (pre-2.6.24) not the case, which is why e1000 et al
has this check as well and that's exactly what is causing most of the
net_rx_action oopses in the first place. Without the netif_running() check
previously the drivers were just unusable with NAPI and prone to many races with
down (i.e. touching some ethtool ioctl which wants to do a reset while routing
small packets at high numbers). that's why we added the netif_running() check in
the first place :)

There might be more drivers lurking that need this change...

Auke

2007-12-13 07:22:30

by Joonwoo Park

[permalink] [raw]
Subject: Re: [PATCH 6/7] [NETDEV]: tehuti Fix possible causing oops of net_rx_action

2007/12/13, David Miller <[email protected]>:
> From: "Joonwoo Park" <[email protected]>
> Date: Wed, 12 Dec 2007 13:01:27 +0900
>
>
> Any time your trying to make a caller "happy" by adjusting
> a return value forcefully, it's a hack.
>
> And I stated this in another reply about this issue.
>
> Please do not fix the problem this way.
>
> The correct way to fix this is, if we did process a full
> "weight" or work, we should not netif_rx_complete() and
> we should not re-enable chip interrupts.
>
> Instead we should return the true "work_done" value and
> allow the caller to thus poll us one more time.
>

Thanks so much for your advice.
I agree, returning work_done itself exactly.
I will rework for these drivers.

Thanks.
Joonwoo

2007-12-13 07:41:35

by Joonwoo Park

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

2007/12/13, Kok, Auke <[email protected]>:
> David Miller wrote:
> > From: Andrew Gallatin <[email protected]>
> > Date: Wed, 12 Dec 2007 12:29:23 -0500
> >
> >> Is the netif_running() check even required?
> >
> > No, it is not.
> >
> > When a device is brought down, one of the first things
> > that happens is that we wait for all pending NAPI polls
> > to complete, then block any new polls from starting.
>
> I think this was previously (pre-2.6.24) not the case, which is why e1000 et al
> has this check as well and that's exactly what is causing most of the
> net_rx_action oopses in the first place. Without the netif_running() check
> previously the drivers were just unusable with NAPI and prone to many races with
> down (i.e. touching some ethtool ioctl which wants to do a reset while routing
> small packets at high numbers). that's why we added the netif_running() check in
> the first place :)
>
> There might be more drivers lurking that need this change...
>
> Auke
>

Also in my case, without netif_running() check, I cannot do ifconfig down.
It stucked if packet generator was sending packets.

Thanks
Joonwoo

2007-12-13 13:45:08

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

On 12-12-2007 19:41, Kok, Auke wrote:
> David Miller wrote:
>> From: Andrew Gallatin <[email protected]>
>> Date: Wed, 12 Dec 2007 12:29:23 -0500
>>
>>> Is the netif_running() check even required?
>> No, it is not.
>>
>> When a device is brought down, one of the first things
>> that happens is that we wait for all pending NAPI polls
>> to complete, then block any new polls from starting.
>
> I think this was previously (pre-2.6.24) not the case, which is why e1000 et al
> has this check as well and that's exactly what is causing most of the
> net_rx_action oopses in the first place. Without the netif_running() check
> previously the drivers were just unusable with NAPI and prone to many races with
> down (i.e. touching some ethtool ioctl which wants to do a reset while routing
> small packets at high numbers). that's why we added the netif_running() check in
> the first place :)
>
> There might be more drivers lurking that need this change...
>

As a matter of fact, since it's "unlikely()" in net_rx_action() anyway,
I wonder what is the main reason or gain of leaving such a tricky
exception, instead of letting drivers to always decide which is the
best moment for napi_complete()? (Or maybe even, in such a case, they
should call some function with this list_move_tail() if it's so
useful?)

Regards,
Jarek P.

2007-12-13 13:50:29

by David Miller

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

From: Jarek Poplawski <[email protected]>
Date: Thu, 13 Dec 2007 14:49:53 +0100

> As a matter of fact, since it's "unlikely()" in net_rx_action() anyway,
> I wonder what is the main reason or gain of leaving such a tricky
> exception, instead of letting drivers to always decide which is the
> best moment for napi_complete()? (Or maybe even, in such a case, they
> should call some function with this list_move_tail() if it's so
> useful?)

It is the only sane way to synchronize the list manipulations.

There has to be a way for ->poll() to tell net_rx_action() two things:

1) How much work was completed, so we can adjust 'budget'
2) Was the NAPI quota exhausted? So that we know that
net_rx_action() still "owns" the polling context and
thus can do the list manipulation safely.

And these both need to be encoded into one single return value, thus
the adopted convention that "work == weight" means that the device has
not done a NAPI complete.

2007-12-13 14:09:53

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

On Thu, Dec 13, 2007 at 05:50:13AM -0800, David Miller wrote:
> From: Jarek Poplawski <[email protected]>
> Date: Thu, 13 Dec 2007 14:49:53 +0100
>
> > As a matter of fact, since it's "unlikely()" in net_rx_action() anyway,
> > I wonder what is the main reason or gain of leaving such a tricky
> > exception, instead of letting drivers to always decide which is the
> > best moment for napi_complete()? (Or maybe even, in such a case, they
> > should call some function with this list_move_tail() if it's so
> > useful?)
>
> It is the only sane way to synchronize the list manipulations.
>
> There has to be a way for ->poll() to tell net_rx_action() two things:
>
> 1) How much work was completed, so we can adjust 'budget'
> 2) Was the NAPI quota exhausted? So that we know that
> net_rx_action() still "owns" the polling context and
> thus can do the list manipulation safely.
>
> And these both need to be encoded into one single return value, thus
> the adopted convention that "work == weight" means that the device has
> not done a NAPI complete.

Thanks! So, I've to rethink this all...

Jarek P.

2007-12-13 14:19:54

by David Miller

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

From: Andrew Gallatin <[email protected]>
Date: Thu, 13 Dec 2007 09:13:54 -0500

> If the netif_running() check is indeed required to make a device break
> out of napi polling and respond to an ifconfig down, then I think the
> netif_running() check should be moved up into net_rx_action() to avoid
> potential for driver complexity and bugs like the ones you found.

That, or something like it, definitely sounds reasonable and much
better than putting the check into every driver :-)

2007-12-13 14:20:15

by Andrew Gallatin

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

Joonwoo Park wrote:
> 2007/12/13, Kok, Auke <[email protected]>:
>> David Miller wrote:
>>> From: Andrew Gallatin <[email protected]>
>>> Date: Wed, 12 Dec 2007 12:29:23 -0500
>>>
>>>> Is the netif_running() check even required?
>>> No, it is not.
>>>
>>> When a device is brought down, one of the first things
>>> that happens is that we wait for all pending NAPI polls
>>> to complete, then block any new polls from starting.
>> I think this was previously (pre-2.6.24) not the case, which is why
e1000 et al
>> has this check as well and that's exactly what is causing most of the
>> net_rx_action oopses in the first place. Without the netif_running()
check
>> previously the drivers were just unusable with NAPI and prone to
many races with
>> down (i.e. touching some ethtool ioctl which wants to do a reset
while routing
>> small packets at high numbers). that's why we added the
netif_running() check in
>> the first place :)
>>
>> There might be more drivers lurking that need this change...
>>
>> Auke
>>
>
> Also in my case, without netif_running() check, I cannot do ifconfig
down.
> It stucked if packet generator was sending packets.

If the netif_running() check is indeed required to make a device break
out of napi polling and respond to an ifconfig down, then I think the
netif_running() check should be moved up into net_rx_action() to avoid
potential for driver complexity and bugs like the ones you found.

Drew

2007-12-13 16:48:16

by Kok, Auke

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

David Miller wrote:
> From: Andrew Gallatin <[email protected]>
> Date: Thu, 13 Dec 2007 09:13:54 -0500
>
>> If the netif_running() check is indeed required to make a device break
>> out of napi polling and respond to an ifconfig down, then I think the
>> netif_running() check should be moved up into net_rx_action() to avoid
>> potential for driver complexity and bugs like the ones you found.
>
> That, or something like it, definitely sounds reasonable and much
> better than putting the check into every driver :-)

hear hear!

Auke

2007-12-13 18:30:30

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

On Thu, 13 Dec 2007 06:19:38 -0800 (PST)
David Miller <[email protected]> wrote:

> From: Andrew Gallatin <[email protected]>
> Date: Thu, 13 Dec 2007 09:13:54 -0500
>
> > If the netif_running() check is indeed required to make a device break
> > out of napi polling and respond to an ifconfig down, then I think the
> > netif_running() check should be moved up into net_rx_action() to avoid
> > potential for driver complexity and bugs like the ones you found.
>
> That, or something like it, definitely sounds reasonable and much
> better than putting the check into every driver :-)
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

It is not possible to do netif_running() check in generic code as currently
written because of the case of devices where a single NAPI object is
being used to handle two devices. The association between napi and netdevice
is M to N. There are cases like niu that have multiple NAPI's and one
netdevice; and devices like sky2 that can have one NAPI and 2 netdevice's.

The existing pointer from napi to netdevice is only used by netconsole
now. For devices like sky2 it means that netconsole can't work on the the
second port which is a not a big problem. But adding a netif_running()
check would be a big issue.

--
Stephen Hemminger <[email protected]>

2007-12-13 19:09:40

by David Miller

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

From: Andrew Gallatin <[email protected]>
Date: Thu, 13 Dec 2007 14:02:25 -0500

> Or perhaps we should just leave things as is.

We should probably add a "disabling" state bit to the
napi struct flags, this will be set by napi_disable()
before it loops trying to set the sched bit.

net_rx_action() can then check this.

2007-12-13 19:11:05

by Andrew Gallatin

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

Stephen Hemminger wrote:
> On Thu, 13 Dec 2007 06:19:38 -0800 (PST)
> David Miller <[email protected]> wrote:
>
>> From: Andrew Gallatin <[email protected]>
>> Date: Thu, 13 Dec 2007 09:13:54 -0500
>>
>>> If the netif_running() check is indeed required to make a device break
>>> out of napi polling and respond to an ifconfig down, then I think the
>>> netif_running() check should be moved up into net_rx_action() to avoid
>>> potential for driver complexity and bugs like the ones you found.
>> That, or something like it, definitely sounds reasonable and much
>> better than putting the check into every driver :-)
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> It is not possible to do netif_running() check in generic code as currently
> written because of the case of devices where a single NAPI object is
> being used to handle two devices. The association between napi and netdevice
> is M to N. There are cases like niu that have multiple NAPI's and one
> netdevice; and devices like sky2 that can have one NAPI and 2 netdevice's.

Ah, now I see. I forgot that not every device has a 1:1::napi:netdev
relationship.

Could we make an optional *dev_state field in the napi structure.
It would be initialized to __LINK_STATE_START. Devices which have
a 1:1 NAPI:netdevice relationship would set it to &netdev->state.
The generic code would then do a test_bit(__LINK_STATE_START,
napi->dev_state), and 1:1 drivers could remove this check.
M:N drivers would pay for a useless (to them) test_bit, and would
have to provide their own netif_running check to get termination
under heavy load.

Just an idea, perhaps there is a better way which is less hacky.

Or perhaps we should just leave things as is.

Drew

2007-12-13 19:35:51

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

David Miller wrote:
> From: Andrew Gallatin <[email protected]>
> Date: Thu, 13 Dec 2007 14:02:25 -0500
>
>
>> Or perhaps we should just leave things as is.
>>
>
> We should probably add a "disabling" state bit to the
> napi struct flags, this will be set by napi_disable()
> before it loops trying to set the sched bit.
>
> net_rx_action() can then check this.
>
How about allowing a return value of -1 from napi_poll and letting device
check itself.

2007-12-13 20:14:13

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

David Miller wrote, On 12/13/2007 02:50 PM:

> From: Jarek Poplawski <[email protected]>
> Date: Thu, 13 Dec 2007 14:49:53 +0100
>
>> As a matter of fact, since it's "unlikely()" in net_rx_action() anyway,
>> I wonder what is the main reason or gain of leaving such a tricky
>> exception, instead of letting drivers to always decide which is the
>> best moment for napi_complete()? (Or maybe even, in such a case, they
>> should call some function with this list_move_tail() if it's so
>> useful?)
>
> It is the only sane way to synchronize the list manipulations.
>
> There has to be a way for ->poll() to tell net_rx_action() two things:
>
> 1) How much work was completed, so we can adjust 'budget'


The 'budget' line would stay where it is. IMHO, it's only about this
list_move_tail(). (Probably also doing netpoll_poll_unlock()
during n->poll() could be considered to let the driver even destroy
napi just after napi_complete() - but it's another subject.)

> 2) Was the NAPI quota exhausted? So that we know that
> net_rx_action() still "owns" the polling context and
> thus can do the list manipulation safely.
>
> And these both need to be encoded into one single return value, thus
> the adopted convention that "work == weight" means that the device has
> not done a NAPI complete.

Of course, with some care and explanations to driver maintainers, like in
this case, this all should probably work like it is. But IMHO it would be
easier to remember and maintain if there are some simple rules with no
exceptions, so here e.g. driver always "owns" (with functions like
napi_schedule(), napi_complete(), and maybe napi_move_tail()), and
net_rx_action() only reads the list and runs these functions?!

I see in a nearby thread you would prefer to save some work to drivers
(like this netif_running() check), but I think this all is at the cost
of flexibility, and there will probably appear new problems, when a
driver simply can't wait till the next poll (which btw. looks strange
with all these hotplugging, usb and powersaving).

Regards,
Jarek P.

2007-12-13 20:37:41

by David Miller

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

From: Jarek Poplawski <[email protected]>
Date: Thu, 13 Dec 2007 21:16:12 +0100

> I see in a nearby thread you would prefer to save some work to drivers
> (like this netif_running() check), but I think this all is at the cost
> of flexibility, and there will probably appear new problems, when a
> driver simply can't wait till the next poll (which btw. looks strange
> with all these hotplugging, usb and powersaving).

As someone who has actually had to edit the NAPI support of _EVERY_
single driver in the tree I can tell you that code duplication and
subtle semantic differences are a huge issue.

And when you talk about driver flexibility, it's wise to mention that
this comes at the expense of flexibility in the core implmentation.
For example, if we export the list handling widget into the ->poll()
routines, god help the person who wants to change how the poll list is
managed in net_rx_action() :-/

So we don't want to export datastructure details like that to the
driver.

2007-12-13 20:38:52

by David Miller

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

From: Stephen Hemminger <[email protected]>
Date: Thu, 13 Dec 2007 11:35:07 -0800

> How about allowing a return value of -1 from napi_poll and letting
> device check itself.

It doesn't avoid the code duplication in the ->poll() fast paths.

I don't care, on the other hand, if crap accumulates in non-critical
slow paths like napi_disable() and dev_close(). That's why I'm
suggesting solutions in that area.

2007-12-13 20:47:42

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

David Miller wrote:
> From: Jarek Poplawski <[email protected]>
> Date: Thu, 13 Dec 2007 21:16:12 +0100
>
>
>> I see in a nearby thread you would prefer to save some work to drivers
>> (like this netif_running() check), but I think this all is at the cost
>> of flexibility, and there will probably appear new problems, when a
>> driver simply can't wait till the next poll (which btw. looks strange
>> with all these hotplugging, usb and powersaving).
>>
>
> As someone who has actually had to edit the NAPI support of _EVERY_
> single driver in the tree I can tell you that code duplication and
> subtle semantic differences are a huge issue.
>
> And when you talk about driver flexibility, it's wise to mention that
> this comes at the expense of flexibility in the core implmentation.
> For example, if we export the list handling widget into the ->poll()
> routines, god help the person who wants to change how the poll list is
> managed in net_rx_action() :-/
>
> So we don't want to export datastructure details like that to the
> driver.
>
Also, most of the drivers should/could be doing the same thing. It is
seems that
driver writers just want to get "creative" and do things differently.
The code is
cleaner, safer, and less buggy if every device uses the interface in the
same way.

When I did the initial pass on this, I didn't see a single variation on
NAPI usage
that was better than the simple "get N packets and return" variation.
But Dave
did way more detailed grunt work on this.

2007-12-13 21:52:22

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

Stephen Hemminger wrote, On 12/13/2007 09:41 PM:

> David Miller wrote:
>> From: Jarek Poplawski <[email protected]>
>> Date: Thu, 13 Dec 2007 21:16:12 +0100
>>
>>
>>> I see in a nearby thread you would prefer to save some work to drivers
>>> (like this netif_running() check), but I think this all is at the cost
>>> of flexibility, and there will probably appear new problems, when a
>>> driver simply can't wait till the next poll (which btw. looks strange
>>> with all these hotplugging, usb and powersaving).
>>>
>> As someone who has actually had to edit the NAPI support of _EVERY_
>> single driver in the tree I can tell you that code duplication and
>> subtle semantic differences are a huge issue.
>>
>> And when you talk about driver flexibility, it's wise to mention that
>> this comes at the expense of flexibility in the core implmentation.
>> For example, if we export the list handling widget into the ->poll()
>> routines, god help the person who wants to change how the poll list is
>> managed in net_rx_action() :-/
>>
>> So we don't want to export datastructure details like that to the
>> driver.


(I hope you both don't mind I save some 'paper' and do this
2 in 1...)

So, you've seen a few drivers, know this much better than me, and
maybe even thought why they all so unnecessarily different... Of
course, if you think that despite those differences they all can
work with simpler napi api then OK (until they don't have to do
any cheating, like with this 'work' here).

> Also, most of the drivers should/could be doing the same thing. It is
> seems that
> driver writers just want to get "creative" and do things differently.
> The code is
> cleaner, safer, and less buggy if every device uses the interface in the
> same way.
>
> When I did the initial pass on this, I didn't see a single variation on
> NAPI usage
> that was better than the simple "get N packets and return" variation.
> But Dave
> did way more detailed grunt work on this.

It seems there are some differences in thinking what is simple/complex.
I think drivers' developers are used to controlling their devices, so
they know better when to turn on/off interrupts. So, maybe similar model
could be appropriate here. Sometimes doing more looks simpler than doing
less and remembering how and when the rest will be done (like
this netif_running() test). But I hope I'm wrong here, and this will
work after all!

Cheers,
Jarek P.

2007-12-13 22:25:33

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

David Miller wrote, On 12/13/2007 09:37 PM:
...

> For example, if we export the list handling widget into the ->poll()
> routines, god help the person who wants to change how the poll list is
> managed in net_rx_action() :-/

...I'm afraid I can't understand: I mean doing the same but without
passing this info with 'work == weight': if driver sends this info,
why it can't instead call something like napi_continue() with
this list_move_tail() (and probably additional local_irq_disable()/
enble() - but since it's unlikely()?) which looks much more readable,
and saves one whole unlikely if ()?

Jarek P.

2007-12-13 22:34:53

by David Miller

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

From: Jarek Poplawski <[email protected]>
Date: Thu, 13 Dec 2007 23:28:41 +0100

> ...I'm afraid I can't understand: I mean doing the same but without
> passing this info with 'work == weight': if driver sends this info,
> why it can't instead call something like napi_continue() with
> this list_move_tail() (and probably additional local_irq_disable()/
> enble() - but since it's unlikely()?) which looks much more readable,
> and saves one whole unlikely if ()?

Because the poll list is private to net_rx_action() and we don't
want to expose implementation details like that to every
->poll() implementation.

2007-12-13 22:55:39

by Jarek Poplawski

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

David Miller wrote, On 12/13/2007 11:34 PM:

> From: Jarek Poplawski <[email protected]>
> Date: Thu, 13 Dec 2007 23:28:41 +0100
>
>> ...I'm afraid I can't understand: I mean doing the same but without
>> passing this info with 'work == weight': if driver sends this info,
>> why it can't instead call something like napi_continue() with
>> this list_move_tail() (and probably additional local_irq_disable()/
>> enble() - but since it's unlikely()?) which looks much more readable,
>> and saves one whole unlikely if ()?
>
> Because the poll list is private to net_rx_action() and we don't
> want to expose implementation details like that to every
> ->poll() implementation.

So, it seems 'we' failed e.g. exposing napi_complete()...
OK, no offense, I'll only mention at the end that there is
always a possibility to redefine such a function to {} with any
change of implementation.

Jarek P.

2007-12-14 02:06:53

by Joonwoo Park

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

2007/12/13, Andrew Gallatin <[email protected]>:
>
> If the netif_running() check is indeed required to make a device break
> out of napi polling and respond to an ifconfig down, then I think the
> netif_running() check should be moved up into net_rx_action() to avoid
> potential for driver complexity and bugs like the ones you found.
>
> Drew
>

Yep, It looks good.

Joonwoo

2007-12-20 10:41:14

by Robert Olsson

[permalink] [raw]
Subject: Re: [RFC] net: napi fix


David Miller writes:

> > Is the netif_running() check even required?
>
> No, it is not.
>
> When a device is brought down, one of the first things
> that happens is that we wait for all pending NAPI polls
> to complete, then block any new polls from starting.

Hello!

Yes but the reason was not to wait for all pending polls to
complete so a server/router could be rebooted even under high-
load and DOS. We've experienced some nasty problems with this.

Cheers.
--ro

2007-12-20 11:22:39

by David Miller

[permalink] [raw]
Subject: Re: [RFC] net: napi fix

From: Robert Olsson <[email protected]>
Date: Thu, 20 Dec 2007 10:52:17 +0100

> Yes but the reason was not to wait for all pending polls to
> complete so a server/router could be rebooted even under high-
> load and DOS. We've experienced some nasty problems with this.

I know, see the rest of the thread where I agree that
we need to deal with this somehow.

The device is marked down first, and somehow we need to
tip off of that to break out of the NAPI loop. This
"how" is what hasn't been resolved yet.