The recent patch to fix receive side flow control (11b57f) solved the spinning
thread problem, however caused an another one. The receive side can stall, if:
- [THREAD] xenvif_rx_action sets rx_queue_stopped to true
- [INTERRUPT] interrupt happens, and sets rx_event to true
- [THREAD] then xenvif_kthread sets rx_event to false
- [THREAD] rx_work_todo doesn't return true anymore
Also, if interrupt sent but there is still no room in the ring, it take quite a
long time until xenvif_rx_action realize it. This patch ditch that two variable,
and rework rx_work_todo. If the thread finds it can't fit more skb's into the
ring, it saves the last slot estimation into rx_last_skb_slots, otherwise it's
kept as 0. Then rx_work_todo will check if:
- there is something to send to the ring (like before)
- there is space for the topmost packet in the queue
I think that's more natural and optimal thing to test than two bool which are
set somewhere else.
Signed-off-by: Zoltan Kiss <[email protected]>
---
drivers/net/xen-netback/common.h | 6 +-----
drivers/net/xen-netback/interface.c | 1 -
drivers/net/xen-netback/netback.c | 16 ++++++----------
3 files changed, 7 insertions(+), 16 deletions(-)
diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 4c76bcb..ae413a2 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -143,11 +143,7 @@ struct xenvif {
char rx_irq_name[IFNAMSIZ+4]; /* DEVNAME-rx */
struct xen_netif_rx_back_ring rx;
struct sk_buff_head rx_queue;
- bool rx_queue_stopped;
- /* Set when the RX interrupt is triggered by the frontend.
- * The worker thread may need to wake the queue.
- */
- bool rx_event;
+ RING_IDX rx_last_skb_slots;
/* This array is allocated seperately as it is large */
struct gnttab_copy *grant_copy_op;
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index b9de31e..7669d49 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -100,7 +100,6 @@ static irqreturn_t xenvif_rx_interrupt(int irq, void *dev_id)
{
struct xenvif *vif = dev_id;
- vif->rx_event = true;
xenvif_kick_thread(vif);
return IRQ_HANDLED;
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 2738563..bb241d0 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -477,7 +477,6 @@ static void xenvif_rx_action(struct xenvif *vif)
unsigned long offset;
struct skb_cb_overlay *sco;
bool need_to_notify = false;
- bool ring_full = false;
struct netrx_pending_operations npo = {
.copy = vif->grant_copy_op,
@@ -487,7 +486,7 @@ static void xenvif_rx_action(struct xenvif *vif)
skb_queue_head_init(&rxq);
while ((skb = skb_dequeue(&vif->rx_queue)) != NULL) {
- int max_slots_needed;
+ RING_IDX max_slots_needed;
int i;
/* We need a cheap worse case estimate for the number of
@@ -510,9 +509,10 @@ static void xenvif_rx_action(struct xenvif *vif)
if (!xenvif_rx_ring_slots_available(vif, max_slots_needed)) {
skb_queue_head(&vif->rx_queue, skb);
need_to_notify = true;
- ring_full = true;
+ vif->rx_last_skb_slots = max_slots_needed;
break;
- }
+ } else
+ vif->rx_last_skb_slots = 0;
sco = (struct skb_cb_overlay *)skb->cb;
sco->meta_slots_used = xenvif_gop_skb(skb, &npo);
@@ -523,8 +523,6 @@ static void xenvif_rx_action(struct xenvif *vif)
BUG_ON(npo.meta_prod > ARRAY_SIZE(vif->meta));
- vif->rx_queue_stopped = !npo.copy_prod && ring_full;
-
if (!npo.copy_prod)
goto done;
@@ -1727,8 +1725,8 @@ static struct xen_netif_rx_response *make_rx_response(struct xenvif *vif,
static inline int rx_work_todo(struct xenvif *vif)
{
- return (!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) ||
- vif->rx_event;
+ return !skb_queue_empty(&vif->rx_queue) &&
+ xenvif_rx_ring_slots_available(vif, vif->rx_last_skb_slots);
}
static inline int tx_work_todo(struct xenvif *vif)
@@ -1814,8 +1812,6 @@ int xenvif_kthread(void *data)
if (!skb_queue_empty(&vif->rx_queue))
xenvif_rx_action(vif);
- vif->rx_event = false;
-
if (skb_queue_empty(&vif->rx_queue) &&
netif_queue_stopped(vif->dev))
xenvif_start_queue(vif);
Any reviews on this one? It fixes an important lockup situation, so
either this or some other fix should go in soon.
On 15/01/14 17:11, Zoltan Kiss wrote:
> The recent patch to fix receive side flow control (11b57f) solved the spinning
> thread problem, however caused an another one. The receive side can stall, if:
> - [THREAD] xenvif_rx_action sets rx_queue_stopped to true
> - [INTERRUPT] interrupt happens, and sets rx_event to true
> - [THREAD] then xenvif_kthread sets rx_event to false
> - [THREAD] rx_work_todo doesn't return true anymore
>
> Also, if interrupt sent but there is still no room in the ring, it take quite a
> long time until xenvif_rx_action realize it. This patch ditch that two variable,
> and rework rx_work_todo. If the thread finds it can't fit more skb's into the
> ring, it saves the last slot estimation into rx_last_skb_slots, otherwise it's
> kept as 0. Then rx_work_todo will check if:
> - there is something to send to the ring (like before)
> - there is space for the topmost packet in the queue
>
> I think that's more natural and optimal thing to test than two bool which are
> set somewhere else.
>
> Signed-off-by: Zoltan Kiss <[email protected]>
> ---
> drivers/net/xen-netback/common.h | 6 +-----
> drivers/net/xen-netback/interface.c | 1 -
> drivers/net/xen-netback/netback.c | 16 ++++++----------
> 3 files changed, 7 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
> index 4c76bcb..ae413a2 100644
> --- a/drivers/net/xen-netback/common.h
> +++ b/drivers/net/xen-netback/common.h
> @@ -143,11 +143,7 @@ struct xenvif {
> char rx_irq_name[IFNAMSIZ+4]; /* DEVNAME-rx */
> struct xen_netif_rx_back_ring rx;
> struct sk_buff_head rx_queue;
> - bool rx_queue_stopped;
> - /* Set when the RX interrupt is triggered by the frontend.
> - * The worker thread may need to wake the queue.
> - */
> - bool rx_event;
> + RING_IDX rx_last_skb_slots;
>
> /* This array is allocated seperately as it is large */
> struct gnttab_copy *grant_copy_op;
> diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
> index b9de31e..7669d49 100644
> --- a/drivers/net/xen-netback/interface.c
> +++ b/drivers/net/xen-netback/interface.c
> @@ -100,7 +100,6 @@ static irqreturn_t xenvif_rx_interrupt(int irq, void *dev_id)
> {
> struct xenvif *vif = dev_id;
>
> - vif->rx_event = true;
> xenvif_kick_thread(vif);
>
> return IRQ_HANDLED;
> diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
> index 2738563..bb241d0 100644
> --- a/drivers/net/xen-netback/netback.c
> +++ b/drivers/net/xen-netback/netback.c
> @@ -477,7 +477,6 @@ static void xenvif_rx_action(struct xenvif *vif)
> unsigned long offset;
> struct skb_cb_overlay *sco;
> bool need_to_notify = false;
> - bool ring_full = false;
>
> struct netrx_pending_operations npo = {
> .copy = vif->grant_copy_op,
> @@ -487,7 +486,7 @@ static void xenvif_rx_action(struct xenvif *vif)
> skb_queue_head_init(&rxq);
>
> while ((skb = skb_dequeue(&vif->rx_queue)) != NULL) {
> - int max_slots_needed;
> + RING_IDX max_slots_needed;
> int i;
>
> /* We need a cheap worse case estimate for the number of
> @@ -510,9 +509,10 @@ static void xenvif_rx_action(struct xenvif *vif)
> if (!xenvif_rx_ring_slots_available(vif, max_slots_needed)) {
> skb_queue_head(&vif->rx_queue, skb);
> need_to_notify = true;
> - ring_full = true;
> + vif->rx_last_skb_slots = max_slots_needed;
> break;
> - }
> + } else
> + vif->rx_last_skb_slots = 0;
>
> sco = (struct skb_cb_overlay *)skb->cb;
> sco->meta_slots_used = xenvif_gop_skb(skb, &npo);
> @@ -523,8 +523,6 @@ static void xenvif_rx_action(struct xenvif *vif)
>
> BUG_ON(npo.meta_prod > ARRAY_SIZE(vif->meta));
>
> - vif->rx_queue_stopped = !npo.copy_prod && ring_full;
> -
> if (!npo.copy_prod)
> goto done;
>
> @@ -1727,8 +1725,8 @@ static struct xen_netif_rx_response *make_rx_response(struct xenvif *vif,
>
> static inline int rx_work_todo(struct xenvif *vif)
> {
> - return (!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) ||
> - vif->rx_event;
> + return !skb_queue_empty(&vif->rx_queue) &&
> + xenvif_rx_ring_slots_available(vif, vif->rx_last_skb_slots);
> }
>
> static inline int tx_work_todo(struct xenvif *vif)
> @@ -1814,8 +1812,6 @@ int xenvif_kthread(void *data)
> if (!skb_queue_empty(&vif->rx_queue))
> xenvif_rx_action(vif);
>
> - vif->rx_event = false;
> -
> if (skb_queue_empty(&vif->rx_queue) &&
> netif_queue_stopped(vif->dev))
> xenvif_start_queue(vif);
>
> -----Original Message-----
> From: Zoltan Kiss
> Sent: 20 January 2014 12:23
> To: Ian Campbell; Wei Liu; [email protected];
> [email protected]; [email protected]; Jonathan Davies
> Cc: Paul Durrant
> Subject: Re: [PATCH net-next v2] xen-netback: Rework rx_work_todo
>
> Any reviews on this one? It fixes an important lockup situation, so
> either this or some other fix should go in soon.
>
> On 15/01/14 17:11, Zoltan Kiss wrote:
> > The recent patch to fix receive side flow control (11b57f) solved the
> spinning
> > thread problem, however caused an another one. The receive side can
> stall, if:
> > - [THREAD] xenvif_rx_action sets rx_queue_stopped to true
> > - [INTERRUPT] interrupt happens, and sets rx_event to true
> > - [THREAD] then xenvif_kthread sets rx_event to false
> > - [THREAD] rx_work_todo doesn't return true anymore
> >
> > Also, if interrupt sent but there is still no room in the ring, it take quite a
> > long time until xenvif_rx_action realize it. This patch ditch that two variable,
> > and rework rx_work_todo. If the thread finds it can't fit more skb's into the
> > ring, it saves the last slot estimation into rx_last_skb_slots, otherwise it's
> > kept as 0. Then rx_work_todo will check if:
> > - there is something to send to the ring (like before)
> > - there is space for the topmost packet in the queue
> >
> > I think that's more natural and optimal thing to test than two bool which are
> > set somewhere else.
> >
> > Signed-off-by: Zoltan Kiss <[email protected]>
> > ---
> > drivers/net/xen-netback/common.h | 6 +-----
> > drivers/net/xen-netback/interface.c | 1 -
> > drivers/net/xen-netback/netback.c | 16 ++++++----------
> > 3 files changed, 7 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-
> netback/common.h
> > index 4c76bcb..ae413a2 100644
> > --- a/drivers/net/xen-netback/common.h
> > +++ b/drivers/net/xen-netback/common.h
> > @@ -143,11 +143,7 @@ struct xenvif {
> > char rx_irq_name[IFNAMSIZ+4]; /* DEVNAME-rx */
> > struct xen_netif_rx_back_ring rx;
> > struct sk_buff_head rx_queue;
> > - bool rx_queue_stopped;
> > - /* Set when the RX interrupt is triggered by the frontend.
> > - * The worker thread may need to wake the queue.
> > - */
> > - bool rx_event;
> > + RING_IDX rx_last_skb_slots;
> >
> > /* This array is allocated seperately as it is large */
> > struct gnttab_copy *grant_copy_op;
> > diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-
> netback/interface.c
> > index b9de31e..7669d49 100644
> > --- a/drivers/net/xen-netback/interface.c
> > +++ b/drivers/net/xen-netback/interface.c
> > @@ -100,7 +100,6 @@ static irqreturn_t xenvif_rx_interrupt(int irq, void
> *dev_id)
> > {
> > struct xenvif *vif = dev_id;
> >
> > - vif->rx_event = true;
> > xenvif_kick_thread(vif);
> >
> > return IRQ_HANDLED;
> > diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-
> netback/netback.c
> > index 2738563..bb241d0 100644
> > --- a/drivers/net/xen-netback/netback.c
> > +++ b/drivers/net/xen-netback/netback.c
> > @@ -477,7 +477,6 @@ static void xenvif_rx_action(struct xenvif *vif)
> > unsigned long offset;
> > struct skb_cb_overlay *sco;
> > bool need_to_notify = false;
> > - bool ring_full = false;
> >
> > struct netrx_pending_operations npo = {
> > .copy = vif->grant_copy_op,
> > @@ -487,7 +486,7 @@ static void xenvif_rx_action(struct xenvif *vif)
> > skb_queue_head_init(&rxq);
> >
> > while ((skb = skb_dequeue(&vif->rx_queue)) != NULL) {
> > - int max_slots_needed;
> > + RING_IDX max_slots_needed;
> > int i;
> >
> > /* We need a cheap worse case estimate for the number of
> > @@ -510,9 +509,10 @@ static void xenvif_rx_action(struct xenvif *vif)
> > if (!xenvif_rx_ring_slots_available(vif, max_slots_needed)) {
> > skb_queue_head(&vif->rx_queue, skb);
> > need_to_notify = true;
> > - ring_full = true;
> > + vif->rx_last_skb_slots = max_slots_needed;
> > break;
> > - }
> > + } else
> > + vif->rx_last_skb_slots = 0;
> >
> > sco = (struct skb_cb_overlay *)skb->cb;
> > sco->meta_slots_used = xenvif_gop_skb(skb, &npo);
> > @@ -523,8 +523,6 @@ static void xenvif_rx_action(struct xenvif *vif)
> >
> > BUG_ON(npo.meta_prod > ARRAY_SIZE(vif->meta));
> >
> > - vif->rx_queue_stopped = !npo.copy_prod && ring_full;
> > -
> > if (!npo.copy_prod)
> > goto done;
> >
> > @@ -1727,8 +1725,8 @@ static struct xen_netif_rx_response
> *make_rx_response(struct xenvif *vif,
> >
> > static inline int rx_work_todo(struct xenvif *vif)
> > {
> > - return (!skb_queue_empty(&vif->rx_queue) && !vif-
> >rx_queue_stopped) ||
> > - vif->rx_event;
> > + return !skb_queue_empty(&vif->rx_queue) &&
> > + xenvif_rx_ring_slots_available(vif, vif->rx_last_skb_slots);
> > }
> >
> > static inline int tx_work_todo(struct xenvif *vif)
> > @@ -1814,8 +1812,6 @@ int xenvif_kthread(void *data)
> > if (!skb_queue_empty(&vif->rx_queue))
> > xenvif_rx_action(vif);
> >
> > - vif->rx_event = false;
> > -
The minimal patch is to simply move this line up above the previous if clause, but I'm happy with your patch as it stands so
Reviewed-by: Paul Durrant <[email protected]>
> > if (skb_queue_empty(&vif->rx_queue) &&
> > netif_queue_stopped(vif->dev))
> > xenvif_start_queue(vif);
> >
On Wed, Jan 15, 2014 at 05:11:07PM +0000, Zoltan Kiss wrote:
> The recent patch to fix receive side flow control (11b57f) solved the spinning
> thread problem, however caused an another one. The receive side can stall, if:
> - [THREAD] xenvif_rx_action sets rx_queue_stopped to true
> - [INTERRUPT] interrupt happens, and sets rx_event to true
> - [THREAD] then xenvif_kthread sets rx_event to false
> - [THREAD] rx_work_todo doesn't return true anymore
>
> Also, if interrupt sent but there is still no room in the ring, it take quite a
> long time until xenvif_rx_action realize it. This patch ditch that two variable,
> and rework rx_work_todo. If the thread finds it can't fit more skb's into the
> ring, it saves the last slot estimation into rx_last_skb_slots, otherwise it's
> kept as 0. Then rx_work_todo will check if:
> - there is something to send to the ring (like before)
> - there is space for the topmost packet in the queue
>
> I think that's more natural and optimal thing to test than two bool which are
> set somewhere else.
>
> Signed-off-by: Zoltan Kiss <[email protected]>
Sorry for the delay.
Paul, thanks for reviewing.
Acked-by: Wei Liu <[email protected]>
Wei.
On 20/01/14 16:38, Wei Liu wrote:
> On Wed, Jan 15, 2014 at 05:11:07PM +0000, Zoltan Kiss wrote:
>> The recent patch to fix receive side flow control (11b57f) solved the spinning
>> thread problem, however caused an another one. The receive side can stall, if:
>> - [THREAD] xenvif_rx_action sets rx_queue_stopped to true
>> - [INTERRUPT] interrupt happens, and sets rx_event to true
>> - [THREAD] then xenvif_kthread sets rx_event to false
>> - [THREAD] rx_work_todo doesn't return true anymore
>>
>> Also, if interrupt sent but there is still no room in the ring, it take quite a
>> long time until xenvif_rx_action realize it. This patch ditch that two variable,
>> and rework rx_work_todo. If the thread finds it can't fit more skb's into the
>> ring, it saves the last slot estimation into rx_last_skb_slots, otherwise it's
>> kept as 0. Then rx_work_todo will check if:
>> - there is something to send to the ring (like before)
>> - there is space for the topmost packet in the queue
>>
>> I think that's more natural and optimal thing to test than two bool which are
>> set somewhere else.
>>
>> Signed-off-by: Zoltan Kiss <[email protected]>
>
> Sorry for the delay.
>
> Paul, thanks for reviewing.
>
> Acked-by: Wei Liu <[email protected]>
Hi,
This patch haven't made it to net-next yet, maybe because the subject
doesn't suggest that this is a bugfix. I suggest to apply it as soon as
possible, otherwise netback will be quite broken.
Zoli
On 04/02/14 19:19, Zoltan Kiss wrote:
> On 20/01/14 16:38, Wei Liu wrote:
>> On Wed, Jan 15, 2014 at 05:11:07PM +0000, Zoltan Kiss wrote:
>>> The recent patch to fix receive side flow control (11b57f) solved the
>>> spinning
>>> thread problem, however caused an another one. The receive side can
>>> stall, if:
>>> - [THREAD] xenvif_rx_action sets rx_queue_stopped to true
>>> - [INTERRUPT] interrupt happens, and sets rx_event to true
>>> - [THREAD] then xenvif_kthread sets rx_event to false
>>> - [THREAD] rx_work_todo doesn't return true anymore
>>>
>>> Also, if interrupt sent but there is still no room in the ring, it
>>> take quite a
>>> long time until xenvif_rx_action realize it. This patch ditch that
>>> two variable,
>>> and rework rx_work_todo. If the thread finds it can't fit more skb's
>>> into the
>>> ring, it saves the last slot estimation into rx_last_skb_slots,
>>> otherwise it's
>>> kept as 0. Then rx_work_todo will check if:
>>> - there is something to send to the ring (like before)
>>> - there is space for the topmost packet in the queue
>>>
>>> I think that's more natural and optimal thing to test than two bool
>>> which are
>>> set somewhere else.
>>>
>>> Signed-off-by: Zoltan Kiss <[email protected]>
>>
>> Sorry for the delay.
>>
>> Paul, thanks for reviewing.
>>
>> Acked-by: Wei Liu <[email protected]>
>
> Hi,
>
> This patch haven't made it to net-next yet, maybe because the subject
> doesn't suggest that this is a bugfix. I suggest to apply it as soon as
> possible, otherwise netback will be quite broken.
I've reposted it with clearer subject, sorry for being too vague
Zoli