2014-01-14 19:28:53

by Zoltan Kiss

[permalink] [raw]
Subject: [PATCH net-next] xen-netback: Rework rx_work_todo

The recent patch to fix receive side flow control (11b57f) solved the spinning
thread problem, however caused an another one. The receive side can stall, if:
- xenvif_rx_action sets rx_queue_stopped to false
- interrupt happens, and sets rx_event to true
- then xenvif_kthread sets rx_event to false

Also, through rx_event a malicious guest can force the RX thread to spin. This
patch ditch that two variable, and rework rx_work_todo. If the thread finds it
can't fit more skb's into the ring, it saves the last slot estimation into
rx_last_skb_slots, otherwise it's kept as 0. Then rx_work_todo will check if:
- there is something to send to the ring
- there is space for the topmost packet in the queue

Signed-off-by: Zoltan Kiss <[email protected]>
---
drivers/net/xen-netback/common.h | 6 +-----
drivers/net/xen-netback/interface.c | 1 -
drivers/net/xen-netback/netback.c | 16 ++++++----------
3 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 4c76bcb..ae413a2 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -143,11 +143,7 @@ struct xenvif {
char rx_irq_name[IFNAMSIZ+4]; /* DEVNAME-rx */
struct xen_netif_rx_back_ring rx;
struct sk_buff_head rx_queue;
- bool rx_queue_stopped;
- /* Set when the RX interrupt is triggered by the frontend.
- * The worker thread may need to wake the queue.
- */
- bool rx_event;
+ RING_IDX rx_last_skb_slots;

/* This array is allocated seperately as it is large */
struct gnttab_copy *grant_copy_op;
diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c
index b9de31e..7669d49 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -100,7 +100,6 @@ static irqreturn_t xenvif_rx_interrupt(int irq, void *dev_id)
{
struct xenvif *vif = dev_id;

- vif->rx_event = true;
xenvif_kick_thread(vif);

return IRQ_HANDLED;
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 2738563..bb241d0 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -477,7 +477,6 @@ static void xenvif_rx_action(struct xenvif *vif)
unsigned long offset;
struct skb_cb_overlay *sco;
bool need_to_notify = false;
- bool ring_full = false;

struct netrx_pending_operations npo = {
.copy = vif->grant_copy_op,
@@ -487,7 +486,7 @@ static void xenvif_rx_action(struct xenvif *vif)
skb_queue_head_init(&rxq);

while ((skb = skb_dequeue(&vif->rx_queue)) != NULL) {
- int max_slots_needed;
+ RING_IDX max_slots_needed;
int i;

/* We need a cheap worse case estimate for the number of
@@ -510,9 +509,10 @@ static void xenvif_rx_action(struct xenvif *vif)
if (!xenvif_rx_ring_slots_available(vif, max_slots_needed)) {
skb_queue_head(&vif->rx_queue, skb);
need_to_notify = true;
- ring_full = true;
+ vif->rx_last_skb_slots = max_slots_needed;
break;
- }
+ } else
+ vif->rx_last_skb_slots = 0;

sco = (struct skb_cb_overlay *)skb->cb;
sco->meta_slots_used = xenvif_gop_skb(skb, &npo);
@@ -523,8 +523,6 @@ static void xenvif_rx_action(struct xenvif *vif)

BUG_ON(npo.meta_prod > ARRAY_SIZE(vif->meta));

- vif->rx_queue_stopped = !npo.copy_prod && ring_full;
-
if (!npo.copy_prod)
goto done;

@@ -1727,8 +1725,8 @@ static struct xen_netif_rx_response *make_rx_response(struct xenvif *vif,

static inline int rx_work_todo(struct xenvif *vif)
{
- return (!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) ||
- vif->rx_event;
+ return !skb_queue_empty(&vif->rx_queue) &&
+ xenvif_rx_ring_slots_available(vif, vif->rx_last_skb_slots);
}

static inline int tx_work_todo(struct xenvif *vif)
@@ -1814,8 +1812,6 @@ int xenvif_kthread(void *data)
if (!skb_queue_empty(&vif->rx_queue))
xenvif_rx_action(vif);

- vif->rx_event = false;
-
if (skb_queue_empty(&vif->rx_queue) &&
netif_queue_stopped(vif->dev))
xenvif_start_queue(vif);


2014-01-15 10:37:25

by Wei Liu

[permalink] [raw]
Subject: Re: [PATCH net-next] xen-netback: Rework rx_work_todo

On Tue, Jan 14, 2014 at 07:28:39PM +0000, Zoltan Kiss wrote:
> The recent patch to fix receive side flow control (11b57f) solved the spinning
> thread problem, however caused an another one. The receive side can stall, if:
> - xenvif_rx_action sets rx_queue_stopped to false
> - interrupt happens, and sets rx_event to true
> - then xenvif_kthread sets rx_event to false
>

If you mean "rx_work_todo" returns false.

In this case

(!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) || vif->rx_event;

can still be true, can't it?

> Also, through rx_event a malicious guest can force the RX thread to spin. This
> patch ditch that two variable, and rework rx_work_todo. If the thread finds it

This seems to be a bigger problem. Can you elaborate?

Wei.

2014-01-15 11:47:08

by Zoltan Kiss

[permalink] [raw]
Subject: Re: [PATCH net-next] xen-netback: Rework rx_work_todo

On 15/01/14 10:37, Wei Liu wrote:
> On Tue, Jan 14, 2014 at 07:28:39PM +0000, Zoltan Kiss wrote:
>> The recent patch to fix receive side flow control (11b57f) solved the spinning
>> thread problem, however caused an another one. The receive side can stall, if:
>> - xenvif_rx_action sets rx_queue_stopped to false
>> - interrupt happens, and sets rx_event to true
>> - then xenvif_kthread sets rx_event to false
>>
>
> If you mean "rx_work_todo" returns false.
>
> In this case
>
> (!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) || vif->rx_event;
>
> can still be true, can't it?
Sorry, I should wrote rx_queue_stopped to true

>
>> Also, through rx_event a malicious guest can force the RX thread to spin. This
>> patch ditch that two variable, and rework rx_work_todo. If the thread finds it
>
> This seems to be a bigger problem. Can you elaborate?
My mistake too. I forgot that rx_action set it to false, so it's not
really a spinning. However the thread should still run xenvif_rx_action
to figure out there is no space in the ring before it sets rx_event to
false. In my patch we can quit earlier.

Zoli

2014-01-15 14:45:22

by Wei Liu

[permalink] [raw]
Subject: Re: [PATCH net-next] xen-netback: Rework rx_work_todo

On Wed, Jan 15, 2014 at 11:47:02AM +0000, Zoltan Kiss wrote:
> On 15/01/14 10:37, Wei Liu wrote:
> >On Tue, Jan 14, 2014 at 07:28:39PM +0000, Zoltan Kiss wrote:
> >>The recent patch to fix receive side flow control (11b57f) solved the spinning
> >>thread problem, however caused an another one. The receive side can stall, if:
> >>- xenvif_rx_action sets rx_queue_stopped to false
> >>- interrupt happens, and sets rx_event to true
> >>- then xenvif_kthread sets rx_event to false
> >>
> >
> >If you mean "rx_work_todo" returns false.
> >
> >In this case
> >
> >(!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) || vif->rx_event;
> >
> >can still be true, can't it?
> Sorry, I should wrote rx_queue_stopped to true
>

In this case, if rx_queue_stopped is true, then we're expecting frontend
to notify us, right?

rx_queue_stopped is set to true if we cannot make any progress to queue
packet into the ring. In that situation we can expect frontend will send
notification to backend after it goes through the backlog in the ring.
That means rx_event is set to true, and rx_work_todo is true again. So
the ring is actually not stalled in this case as well. Did I miss
something?

> >
> >>Also, through rx_event a malicious guest can force the RX thread to spin. This
> >>patch ditch that two variable, and rework rx_work_todo. If the thread finds it
> >
> >This seems to be a bigger problem. Can you elaborate?
> My mistake too. I forgot that rx_action set it to false, so it's not
> really a spinning. However the thread should still run
> xenvif_rx_action to figure out there is no space in the ring before
> it sets rx_event to false. In my patch we can quit earlier.
>
> Zoli

2014-01-15 14:52:47

by Zoltan Kiss

[permalink] [raw]
Subject: Re: [PATCH net-next] xen-netback: Rework rx_work_todo

On 15/01/14 14:45, Wei Liu wrote:
>>>> The recent patch to fix receive side flow control (11b57f) solved the spinning
>>>> > >>thread problem, however caused an another one. The receive side can stall, if:
>>>> > >>- xenvif_rx_action sets rx_queue_stopped to false
>>>> > >>- interrupt happens, and sets rx_event to true
>>>> > >>- then xenvif_kthread sets rx_event to false
>>>> > >>
>>> > >
>>> > >If you mean "rx_work_todo" returns false.
>>> > >
>>> > >In this case
>>> > >
>>> > >(!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) || vif->rx_event;
>>> > >
>>> > >can still be true, can't it?
>> >Sorry, I should wrote rx_queue_stopped to true
>> >
> In this case, if rx_queue_stopped is true, then we're expecting frontend
> to notify us, right?
>
> rx_queue_stopped is set to true if we cannot make any progress to queue
> packet into the ring. In that situation we can expect frontend will send
> notification to backend after it goes through the backlog in the ring.
> That means rx_event is set to true, and rx_work_todo is true again. So
> the ring is actually not stalled in this case as well. Did I miss
> something?
>

Yes, we expect the guest to notify us, and it does, and we set rx_event
to true (see second point), but then the thread set it to false (see
third point). Talking with Paul, another solution could be to set
rx_event false before calling xenvif_rx_action. But using
rx_last_skb_slots makes it quicker for the thread to see if it doesn't
have to do anything.

Zoli

2014-01-15 14:59:56

by Wei Liu

[permalink] [raw]
Subject: Re: [PATCH net-next] xen-netback: Rework rx_work_todo

On Wed, Jan 15, 2014 at 02:52:41PM +0000, Zoltan Kiss wrote:
> On 15/01/14 14:45, Wei Liu wrote:
> >>>>The recent patch to fix receive side flow control (11b57f) solved the spinning
> >>>>> >>thread problem, however caused an another one. The receive side can stall, if:
> >>>>> >>- xenvif_rx_action sets rx_queue_stopped to false
> >>>>> >>- interrupt happens, and sets rx_event to true
> >>>>> >>- then xenvif_kthread sets rx_event to false
> >>>>> >>
> >>>> >
> >>>> >If you mean "rx_work_todo" returns false.
> >>>> >
> >>>> >In this case
> >>>> >
> >>>> >(!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) || vif->rx_event;
> >>>> >
> >>>> >can still be true, can't it?
> >>>Sorry, I should wrote rx_queue_stopped to true
> >>>
> >In this case, if rx_queue_stopped is true, then we're expecting frontend
> >to notify us, right?
> >
> >rx_queue_stopped is set to true if we cannot make any progress to queue
> >packet into the ring. In that situation we can expect frontend will send
> >notification to backend after it goes through the backlog in the ring.
> >That means rx_event is set to true, and rx_work_todo is true again. So
> >the ring is actually not stalled in this case as well. Did I miss
> >something?
> >
>
> Yes, we expect the guest to notify us, and it does, and we set
> rx_event to true (see second point), but then the thread set it to
> false (see third point). Talking with Paul, another solution could
> be to set rx_event false before calling xenvif_rx_action. But using
> rx_last_skb_slots makes it quicker for the thread to see if it
> doesn't have to do anything.
>

OK, this is a better explaination. So actually there's no bug in the
original implementation and your patch is sort of an improvement.

Could you send a new version of this patch with relevant information in
commit message? Talking to people offline is faster, but I would like to
have public discussion and relevant information archived in a searchable
form. Thanks.

Wei.

> Zoli

2014-01-15 15:08:54

by Zoltan Kiss

[permalink] [raw]
Subject: Re: [PATCH net-next] xen-netback: Rework rx_work_todo

On 15/01/14 14:59, Wei Liu wrote:
> On Wed, Jan 15, 2014 at 02:52:41PM +0000, Zoltan Kiss wrote:
>> On 15/01/14 14:45, Wei Liu wrote:
>>>>>> The recent patch to fix receive side flow control (11b57f) solved the spinning
>>>>>>>>> thread problem, however caused an another one. The receive side can stall, if:
>>>>>>>>> - xenvif_rx_action sets rx_queue_stopped to false
>>>>>>>>> - interrupt happens, and sets rx_event to true
>>>>>>>>> - then xenvif_kthread sets rx_event to false
>>>>>>>>>
>>>>>>>
>>>>>>> If you mean "rx_work_todo" returns false.
>>>>>>>
>>>>>>> In this case
>>>>>>>
>>>>>>> (!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) || vif->rx_event;
>>>>>>>
>>>>>>> can still be true, can't it?
>>>>> Sorry, I should wrote rx_queue_stopped to true
>>>>>
>>> In this case, if rx_queue_stopped is true, then we're expecting frontend
>>> to notify us, right?
>>>
>>> rx_queue_stopped is set to true if we cannot make any progress to queue
>>> packet into the ring. In that situation we can expect frontend will send
>>> notification to backend after it goes through the backlog in the ring.
>>> That means rx_event is set to true, and rx_work_todo is true again. So
>>> the ring is actually not stalled in this case as well. Did I miss
>>> something?
>>>
>>
>> Yes, we expect the guest to notify us, and it does, and we set
>> rx_event to true (see second point), but then the thread set it to
>> false (see third point). Talking with Paul, another solution could
>> be to set rx_event false before calling xenvif_rx_action. But using
>> rx_last_skb_slots makes it quicker for the thread to see if it
>> doesn't have to do anything.
>>
>
> OK, this is a better explaination. So actually there's no bug in the
> original implementation and your patch is sort of an improvement.
>
> Could you send a new version of this patch with relevant information in
> commit message? Talking to people offline is faster, but I would like to
> have public discussion and relevant information archived in a searchable
> form. Thanks.

No, there is a bug in the original implementation:
- [THREAD] xenvif_rx_action sets rx_queue_stopped to true
- [INTERRUPT] interrupt happens, and sets rx_event to true
- [THREAD] then xenvif_kthread sets rx_event to false
- [THREAD] rx_work_todo never returns true anymore

I will update the explanation and send in the patch again.

Zoli

2014-01-15 16:10:18

by Wei Liu

[permalink] [raw]
Subject: Re: [PATCH net-next] xen-netback: Rework rx_work_todo

On Wed, Jan 15, 2014 at 03:08:12PM +0000, Zoltan Kiss wrote:
> On 15/01/14 14:59, Wei Liu wrote:
> >On Wed, Jan 15, 2014 at 02:52:41PM +0000, Zoltan Kiss wrote:
> >>On 15/01/14 14:45, Wei Liu wrote:
> >>>>>>The recent patch to fix receive side flow control (11b57f) solved the spinning
> >>>>>>>>>thread problem, however caused an another one. The receive side can stall, if:
> >>>>>>>>>- xenvif_rx_action sets rx_queue_stopped to false
> >>>>>>>>>- interrupt happens, and sets rx_event to true
> >>>>>>>>>- then xenvif_kthread sets rx_event to false
> >>>>>>>>>
> >>>>>>>
> >>>>>>>If you mean "rx_work_todo" returns false.
> >>>>>>>
> >>>>>>>In this case
> >>>>>>>
> >>>>>>>(!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) || vif->rx_event;
> >>>>>>>
> >>>>>>>can still be true, can't it?
> >>>>>Sorry, I should wrote rx_queue_stopped to true
> >>>>>
> >>>In this case, if rx_queue_stopped is true, then we're expecting frontend
> >>>to notify us, right?
> >>>
> >>>rx_queue_stopped is set to true if we cannot make any progress to queue
> >>>packet into the ring. In that situation we can expect frontend will send
> >>>notification to backend after it goes through the backlog in the ring.
> >>>That means rx_event is set to true, and rx_work_todo is true again. So
> >>>the ring is actually not stalled in this case as well. Did I miss
> >>>something?
> >>>
> >>
> >>Yes, we expect the guest to notify us, and it does, and we set
> >>rx_event to true (see second point), but then the thread set it to
> >>false (see third point). Talking with Paul, another solution could
> >>be to set rx_event false before calling xenvif_rx_action. But using
> >>rx_last_skb_slots makes it quicker for the thread to see if it
> >>doesn't have to do anything.
> >>
> >
> >OK, this is a better explaination. So actually there's no bug in the
> >original implementation and your patch is sort of an improvement.
> >
> >Could you send a new version of this patch with relevant information in
> >commit message? Talking to people offline is faster, but I would like to
> >have public discussion and relevant information archived in a searchable
> >form. Thanks.
>
> No, there is a bug in the original implementation:
> - [THREAD] xenvif_rx_action sets rx_queue_stopped to true
> - [INTERRUPT] interrupt happens, and sets rx_event to true
> - [THREAD] then xenvif_kthread sets rx_event to false
> - [THREAD] rx_work_todo never returns true anymore
>

I see what you mean. The interrupt is "lost", that's why it's stalled.

> I will update the explanation and send in the patch again.
>

Thanks.

Wei.

> Zoli