2024-03-14 07:51:44

by Gavin Shan

[permalink] [raw]
Subject: [PATCH] virtio_ring: Fix the stale index in available ring

The issue is reported by Yihuang Yu who have 'netperf' test on
NVidia's grace-grace and grace-hopper machines. The 'netperf'
client is started in the VM hosted by grace-hopper machine,
while the 'netperf' server is running on grace-grace machine.

The VM is started with virtio-net and vhost has been enabled.
We observe a error message spew from VM and then soft-lockup
report. The error message indicates the data associated with
the descriptor (index: 135) has been released, and the queue
is marked as broken. It eventually leads to the endless effort
to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
and soft-lockup. The stale index 135 is fetched from the available
ring and published to the used ring by vhost, meaning we have
disordred write to the available ring element and available index.

/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
-accel kvm -machine virt,gic-version=host \
: \
-netdev tap,id=vnet0,vhost=on \
-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \

[ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!

Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
ARM64. It should work for other architectures, but performance loss is
expected.

Cc: [email protected]
Reported-by: Yihuang Yu <[email protected]>
Signed-off-by: Gavin Shan <[email protected]>
---
drivers/virtio/virtio_ring.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 49299b1f9ec7..7d852811c912 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);

- /* Descriptors and available array need to be set before we expose the
- * new available array entries. */
- virtio_wmb(vq->weak_barriers);
+ /*
+ * Descriptors and available array need to be set before we expose
+ * the new available array entries. virtio_wmb() should be enough
+ * to ensuere the order theoretically. However, a stronger barrier
+ * is needed by ARM64. Otherwise, the stale data can be observed
+ * by the host (vhost). A stronger barrier should work for other
+ * architectures, but performance loss is expected.
+ */
+ virtio_mb(false);
vq->split.avail_idx_shadow++;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);
--
2.44.0



2024-03-14 08:05:49

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> The issue is reported by Yihuang Yu who have 'netperf' test on
> NVidia's grace-grace and grace-hopper machines. The 'netperf'
> client is started in the VM hosted by grace-hopper machine,
> while the 'netperf' server is running on grace-grace machine.
>
> The VM is started with virtio-net and vhost has been enabled.
> We observe a error message spew from VM and then soft-lockup
> report. The error message indicates the data associated with
> the descriptor (index: 135) has been released, and the queue
> is marked as broken. It eventually leads to the endless effort
> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> and soft-lockup. The stale index 135 is fetched from the available
> ring and published to the used ring by vhost, meaning we have
> disordred write to the available ring element and available index.
>
> /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> -accel kvm -machine virt,gic-version=host \
> : \
> -netdev tap,id=vnet0,vhost=on \
> -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
>
> [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!
>
> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> ARM64. It should work for other architectures, but performance loss is
> expected.
>
> Cc: [email protected]
> Reported-by: Yihuang Yu <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>
> ---
> drivers/virtio/virtio_ring.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 49299b1f9ec7..7d852811c912 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>
> - /* Descriptors and available array need to be set before we expose the
> - * new available array entries. */
> - virtio_wmb(vq->weak_barriers);
> + /*
> + * Descriptors and available array need to be set before we expose
> + * the new available array entries. virtio_wmb() should be enough
> + * to ensuere the order theoretically. However, a stronger barrier
> + * is needed by ARM64. Otherwise, the stale data can be observed
> + * by the host (vhost). A stronger barrier should work for other
> + * architectures, but performance loss is expected.
> + */
> + virtio_mb(false);


I don't get what is going on here. Any explanation why virtio_wmb is not
enough besides "it does not work"?

> vq->split.avail_idx_shadow++;
> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> vq->split.avail_idx_shadow);
> --
> 2.44.0


2024-03-14 10:16:04

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On 3/14/24 18:05, Michael S. Tsirkin wrote:
> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
>> The issue is reported by Yihuang Yu who have 'netperf' test on
>> NVidia's grace-grace and grace-hopper machines. The 'netperf'
>> client is started in the VM hosted by grace-hopper machine,
>> while the 'netperf' server is running on grace-grace machine.
>>
>> The VM is started with virtio-net and vhost has been enabled.
>> We observe a error message spew from VM and then soft-lockup
>> report. The error message indicates the data associated with
>> the descriptor (index: 135) has been released, and the queue
>> is marked as broken. It eventually leads to the endless effort
>> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
>> and soft-lockup. The stale index 135 is fetched from the available
>> ring and published to the used ring by vhost, meaning we have
>> disordred write to the available ring element and available index.
>>
>> /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>> -accel kvm -machine virt,gic-version=host \
>> : \
>> -netdev tap,id=vnet0,vhost=on \
>> -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
>>
>> [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!
>>
>> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
>> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
>> ARM64. It should work for other architectures, but performance loss is
>> expected.
>>
>> Cc: [email protected]
>> Reported-by: Yihuang Yu <[email protected]>
>> Signed-off-by: Gavin Shan <[email protected]>
>> ---
>> drivers/virtio/virtio_ring.c | 12 +++++++++---
>> 1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>> index 49299b1f9ec7..7d852811c912 100644
>> --- a/drivers/virtio/virtio_ring.c
>> +++ b/drivers/virtio/virtio_ring.c
>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>
>> - /* Descriptors and available array need to be set before we expose the
>> - * new available array entries. */
>> - virtio_wmb(vq->weak_barriers);
>> + /*
>> + * Descriptors and available array need to be set before we expose
>> + * the new available array entries. virtio_wmb() should be enough
>> + * to ensuere the order theoretically. However, a stronger barrier
>> + * is needed by ARM64. Otherwise, the stale data can be observed
>> + * by the host (vhost). A stronger barrier should work for other
>> + * architectures, but performance loss is expected.
>> + */
>> + virtio_mb(false);
>
>
> I don't get what is going on here. Any explanation why virtio_wmb is not
> enough besides "it does not work"?
>

The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
than "dmb" because "dsb" ensures that all memory accesses raised before this
instruction is completed when the 'dsb' instruction completes. However, "dmb"
doesn't guarantee the order of completion of the memory accesses.

So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
The stronger barrier 'dsb' ensures the completion order as we expected.

virtio_wmb(true) virt_mb(false)
virt_wmb mb
__smp_wmb __mb
dmb(ishst) dsb(sy)


Extraced from ARMv9 specificaton
================================
The DMB instruction is a memory barrier instruction that ensures the relative
order of memory accesses before the barrier with memory accesses after the
barrier. The DMB instruction _does not_ ensure the completion of any of the
memory accesses for which it ensures relative order.

A DSB instruction is a memory barrier that ensures that memory accesses that
occur before the DSB instruction have __completed__ before the completion of
the DSB instruction. In doing this, it acts as a stronger barrier than a DMB
and all ordering that is created by a DMB with specific options is also generated
by a DSB with the same options.

>> vq->split.avail_idx_shadow++;
>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>> vq->split.avail_idx_shadow);
>> --
>> 2.44.0
>

Thanks,
Gavin


2024-03-14 11:50:43

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > >
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > >
> > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> > > -accel kvm -machine virt,gic-version=host \
> > > : \
> > > -netdev tap,id=vnet0,vhost=on \
> > > -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > >
> > > [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > >
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > >
> > > Cc: [email protected]
> > > Reported-by: Yihuang Yu <[email protected]>
> > > Signed-off-by: Gavin Shan <[email protected]>
> > > ---
> > > drivers/virtio/virtio_ring.c | 12 +++++++++---
> > > 1 file changed, 9 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > - /* Descriptors and available array need to be set before we expose the
> > > - * new available array entries. */
> > > - virtio_wmb(vq->weak_barriers);
> > > + /*
> > > + * Descriptors and available array need to be set before we expose
> > > + * the new available array entries. virtio_wmb() should be enough
> > > + * to ensuere the order theoretically. However, a stronger barrier
> > > + * is needed by ARM64. Otherwise, the stale data can be observed
> > > + * by the host (vhost). A stronger barrier should work for other
> > > + * architectures, but performance loss is expected.
> > > + */
> > > + virtio_mb(false);
> >
> >
> > I don't get what is going on here. Any explanation why virtio_wmb is not
> > enough besides "it does not work"?
> >
>
> The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
> than "dmb" because "dsb" ensures that all memory accesses raised before this
> instruction is completed when the 'dsb' instruction completes. However, "dmb"
> doesn't guarantee the order of completion of the memory accesses.
>
> So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
> can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.

Completed as observed by which CPU?
We have 2 writes that we want observed by another CPU in order.
So if CPU observes a new value of idx we want it to see
new value in ring.
This is standard use of smp_wmb()
How are these 2 writes different?

What DMB does, is that is seems to ensure that effects
of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
are observed after effects of
'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.




> The stronger barrier 'dsb' ensures the completion order as we expected.
>
> virtio_wmb(true) virt_mb(false)
> virt_wmb mb
> __smp_wmb __mb
> dmb(ishst) dsb(sy)

First, why would you want a non smp barrier when you are
synchronizing with another CPU? This is what smp_ means ...


> Extraced from ARMv9 specificaton
> ================================
> The DMB instruction is a memory barrier instruction that ensures the relative
> order of memory accesses before the barrier with memory accesses after the
> barrier. The DMB instruction _does not_ ensure the completion of any of the
> memory accesses for which it ensures relative order.

Isn't this exactly what we need?

> A DSB instruction is a memory barrier that ensures that memory accesses that
> occur before the DSB instruction have __completed__ before the completion of
> the DSB instruction.

This seems appropriate for when you want to order things more
strongly. I do not get why it's necessary here.

> In doing this, it acts as a stronger barrier than a DMB
> and all ordering that is created by a DMB with specific options is also generated
> by a DSB with the same options.
>
> > > vq->split.avail_idx_shadow++;
> > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > vq->split.avail_idx_shadow);
> > > --
> > > 2.44.0
> >
>
> Thanks,
> Gavin


2024-03-14 12:50:34

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On 3/14/24 21:50, Michael S. Tsirkin wrote:
> On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
>> On 3/14/24 18:05, Michael S. Tsirkin wrote:
>>> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
>>>> The issue is reported by Yihuang Yu who have 'netperf' test on
>>>> NVidia's grace-grace and grace-hopper machines. The 'netperf'
>>>> client is started in the VM hosted by grace-hopper machine,
>>>> while the 'netperf' server is running on grace-grace machine.
>>>>
>>>> The VM is started with virtio-net and vhost has been enabled.
>>>> We observe a error message spew from VM and then soft-lockup
>>>> report. The error message indicates the data associated with
>>>> the descriptor (index: 135) has been released, and the queue
>>>> is marked as broken. It eventually leads to the endless effort
>>>> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
>>>> and soft-lockup. The stale index 135 is fetched from the available
>>>> ring and published to the used ring by vhost, meaning we have
>>>> disordred write to the available ring element and available index.
>>>>
>>>> /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>>>> -accel kvm -machine virt,gic-version=host \
>>>> : \
>>>> -netdev tap,id=vnet0,vhost=on \
>>>> -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
>>>>
>>>> [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!
>>>>
>>>> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
>>>> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
>>>> ARM64. It should work for other architectures, but performance loss is
>>>> expected.
>>>>
>>>> Cc: [email protected]
>>>> Reported-by: Yihuang Yu <[email protected]>
>>>> Signed-off-by: Gavin Shan <[email protected]>
>>>> ---
>>>> drivers/virtio/virtio_ring.c | 12 +++++++++---
>>>> 1 file changed, 9 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>> index 49299b1f9ec7..7d852811c912 100644
>>>> --- a/drivers/virtio/virtio_ring.c
>>>> +++ b/drivers/virtio/virtio_ring.c
>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>> - /* Descriptors and available array need to be set before we expose the
>>>> - * new available array entries. */
>>>> - virtio_wmb(vq->weak_barriers);
>>>> + /*
>>>> + * Descriptors and available array need to be set before we expose
>>>> + * the new available array entries. virtio_wmb() should be enough
>>>> + * to ensuere the order theoretically. However, a stronger barrier
>>>> + * is needed by ARM64. Otherwise, the stale data can be observed
>>>> + * by the host (vhost). A stronger barrier should work for other
>>>> + * architectures, but performance loss is expected.
>>>> + */
>>>> + virtio_mb(false);
>>>
>>>
>>> I don't get what is going on here. Any explanation why virtio_wmb is not
>>> enough besides "it does not work"?
>>>
>>
>> The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
>> than "dmb" because "dsb" ensures that all memory accesses raised before this
>> instruction is completed when the 'dsb' instruction completes. However, "dmb"
>> doesn't guarantee the order of completion of the memory accesses.
>>
>> So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
>> can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
>
> Completed as observed by which CPU?
> We have 2 writes that we want observed by another CPU in order.
> So if CPU observes a new value of idx we want it to see
> new value in ring.
> This is standard use of smp_wmb()
> How are these 2 writes different?
>
> What DMB does, is that is seems to ensure that effects
> of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
> are observed after effects of
> 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
>
>

Completed as observed by the CPU where vhost worker is running. I don't think DMB
does the work here. If I'm understanding correctly, DMB ensures the order of these
two writes from the local CPU's standpoint. The written data can be stored in local
CPU's cache, not flushed to DRAM and propogated to the cache of the far CPU where
vhost worker is running. So DMB isn't ensuring the write data is observed from the
far CPU.

DSB ensures that the written data is observable from the far CPU immediately.

>
>
>> The stronger barrier 'dsb' ensures the completion order as we expected.
>>
>> virtio_wmb(true) virt_mb(false)
>> virt_wmb mb
>> __smp_wmb __mb
>> dmb(ishst) dsb(sy)
>
> First, why would you want a non smp barrier when you are
> synchronizing with another CPU? This is what smp_ means ...
>
>
>> Extraced from ARMv9 specificaton
>> ================================
>> The DMB instruction is a memory barrier instruction that ensures the relative
>> order of memory accesses before the barrier with memory accesses after the
>> barrier. The DMB instruction _does not_ ensure the completion of any of the
>> memory accesses for which it ensures relative order.
>
> Isn't this exactly what we need?
>

I don't think so. DMB gurantees the order of EXECUTION, but DSB gurantees the
order of COMPLETION. After the data store is executed, the written data can
be stored in local CPU's cache, far from completion. Only the instruction is
completed, the written data is synchronized to the far CPU where vhost worker
is running.

Here is another paper talking about the difference between DMB and DSB. It's
explaining the difference between DMB/DSB better than ARMv9 specification.
However, it's hard for us to understand DMB/DSB work from the hardware level
based on the specification and paper.

https://ipads.se.sjtu.edu.cn/_media/publications/liuppopp20.pdf
(section 2.2 Order-preserving Options)
(I'm extracting the relevant part as below)

Data Memory Barrier (DMB) prevents reordering of memory accesses across the barrier.
Instead of waiting for previous accesses to become observable in the specified domain,
DMB only maintains the relative order between memory accesses. Meanwhile, DMB does
not block any non-memory access operations.

Data Synchronization Barrier (DSB) prevents reordering of any instructions across
the barrier. DSB will make sure that all masters in the specified domain can observe
the previous operations before issuing any subsequent instructions.

>> A DSB instruction is a memory barrier that ensures that memory accesses that
>> occur before the DSB instruction have __completed__ before the completion of
>> the DSB instruction.
>
> This seems appropriate for when you want to order things more
> strongly. I do not get why it's necessary here.
>

Please refer to above explanations.

>> In doing this, it acts as a stronger barrier than a DMB
>> and all ordering that is created by a DMB with specific options is also generated
>> by a DSB with the same options.
>>
>>>> vq->split.avail_idx_shadow++;
>>>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>> vq->split.avail_idx_shadow);
>>>> --
>>>> 2.44.0
>>>

Thanks,
Gavin


2024-03-14 13:00:01

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Thu, Mar 14, 2024 at 10:50:15PM +1000, Gavin Shan wrote:
> On 3/14/24 21:50, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> > > On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > > client is started in the VM hosted by grace-hopper machine,
> > > > > while the 'netperf' server is running on grace-grace machine.
> > > > >
> > > > > The VM is started with virtio-net and vhost has been enabled.
> > > > > We observe a error message spew from VM and then soft-lockup
> > > > > report. The error message indicates the data associated with
> > > > > the descriptor (index: 135) has been released, and the queue
> > > > > is marked as broken. It eventually leads to the endless effort
> > > > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > > ring and published to the used ring by vhost, meaning we have
> > > > > disordred write to the available ring element and available index.
> > > > >
> > > > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> > > > > -accel kvm -machine virt,gic-version=host \
> > > > > : \
> > > > > -netdev tap,id=vnet0,vhost=on \
> > > > > -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > > >
> > > > > [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > > >
> > > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > > > ARM64. It should work for other architectures, but performance loss is
> > > > > expected.
> > > > >
> > > > > Cc: [email protected]
> > > > > Reported-by: Yihuang Yu <[email protected]>
> > > > > Signed-off-by: Gavin Shan <[email protected]>
> > > > > ---
> > > > > drivers/virtio/virtio_ring.c | 12 +++++++++---
> > > > > 1 file changed, 9 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > - /* Descriptors and available array need to be set before we expose the
> > > > > - * new available array entries. */
> > > > > - virtio_wmb(vq->weak_barriers);
> > > > > + /*
> > > > > + * Descriptors and available array need to be set before we expose
> > > > > + * the new available array entries. virtio_wmb() should be enough
> > > > > + * to ensuere the order theoretically. However, a stronger barrier
> > > > > + * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > + * by the host (vhost). A stronger barrier should work for other
> > > > > + * architectures, but performance loss is expected.
> > > > > + */
> > > > > + virtio_mb(false);
> > > >
> > > >
> > > > I don't get what is going on here. Any explanation why virtio_wmb is not
> > > > enough besides "it does not work"?
> > > >
> > >
> > > The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
> > > than "dmb" because "dsb" ensures that all memory accesses raised before this
> > > instruction is completed when the 'dsb' instruction completes. However, "dmb"
> > > doesn't guarantee the order of completion of the memory accesses.
> > >
> > > So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
> > > can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
> >
> > Completed as observed by which CPU?
> > We have 2 writes that we want observed by another CPU in order.
> > So if CPU observes a new value of idx we want it to see
> > new value in ring.
> > This is standard use of smp_wmb()
> > How are these 2 writes different?
> >
> > What DMB does, is that is seems to ensure that effects
> > of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
> > are observed after effects of
> > 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
> >
> >
>
> Completed as observed by the CPU where vhost worker is running. I don't think DMB
> does the work here. If I'm understanding correctly, DMB ensures the order of these
> two writes from the local CPU's standpoint.

No this makes no sense at all. All memory accesses are in order from
local CPU standpoint.

> The written data can be stored in local
> CPU's cache, not flushed to DRAM and propogated to the cache of the far CPU where
> vhost worker is running. So DMB isn't ensuring the write data is observed from the
> far CPU.

No but it has to ensure *ordering*. So if index is oberved then ring
is observed too because there is a MB there when reading.


> DSB ensures that the written data is observable from the far CPU immediately.




> >
> >
> > > The stronger barrier 'dsb' ensures the completion order as we expected.
> > >
> > > virtio_wmb(true) virt_mb(false)
> > > virt_wmb mb
> > > __smp_wmb __mb
> > > dmb(ishst) dsb(sy)
> >
> > First, why would you want a non smp barrier when you are
> > synchronizing with another CPU? This is what smp_ means ...
> >
> >
> > > Extraced from ARMv9 specificaton
> > > ================================
> > > The DMB instruction is a memory barrier instruction that ensures the relative
> > > order of memory accesses before the barrier with memory accesses after the
> > > barrier. The DMB instruction _does not_ ensure the completion of any of the
> > > memory accesses for which it ensures relative order.
> >
> > Isn't this exactly what we need?
> >
>
> I don't think so. DMB gurantees the order of EXECUTION, but DSB gurantees the
> order of COMPLETION. After the data store is executed, the written data can
> be stored in local CPU's cache, far from completion. Only the instruction is
> completed, the written data is synchronized to the far CPU where vhost worker
> is running.
>
> Here is another paper talking about the difference between DMB and DSB. It's
> explaining the difference between DMB/DSB better than ARMv9 specification.
> However, it's hard for us to understand DMB/DSB work from the hardware level
> based on the specification and paper.
>
> https://ipads.se.sjtu.edu.cn/_media/publications/liuppopp20.pdf
> (section 2.2 Order-preserving Options)
> (I'm extracting the relevant part as below)
>
> Data Memory Barrier (DMB) prevents reordering of memory accesses across the barrier.
> Instead of waiting for previous accesses to become observable in the specified domain,
> DMB only maintains the relative order between memory accesses. Meanwhile, DMB does
> not block any non-memory access operations.
>
> Data Synchronization Barrier (DSB) prevents reordering of any instructions across
> the barrier. DSB will make sure that all masters in the specified domain can observe
> the previous operations before issuing any subsequent instructions.
>

What you are describing is that smp_wmb is buggy on this platform.
Poossible but are you sure?
Write a small test and check.
Leave virtio alone.


> > > A DSB instruction is a memory barrier that ensures that memory accesses that
> > > occur before the DSB instruction have __completed__ before the completion of
> > > the DSB instruction.
> >
> > This seems appropriate for when you want to order things more
> > strongly. I do not get why it's necessary here.
> >
>
> Please refer to above explanations.
>
> > > In doing this, it acts as a stronger barrier than a DMB
> > > and all ordering that is created by a DMB with specific options is also generated
> > > by a DSB with the same options.
> > >
> > > > > vq->split.avail_idx_shadow++;
> > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > vq->split.avail_idx_shadow);
> > > > > --
> > > > > 2.44.0
> > > >
>
> Thanks,
> Gavin


2024-03-15 10:46:11

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring


+ Will, Catalin and Matt from Nvidia

On 3/14/24 22:59, Michael S. Tsirkin wrote:
> On Thu, Mar 14, 2024 at 10:50:15PM +1000, Gavin Shan wrote:
>> On 3/14/24 21:50, Michael S. Tsirkin wrote:
>>> On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
>>>> On 3/14/24 18:05, Michael S. Tsirkin wrote:
>>>>> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
>>>>>> The issue is reported by Yihuang Yu who have 'netperf' test on
>>>>>> NVidia's grace-grace and grace-hopper machines. The 'netperf'
>>>>>> client is started in the VM hosted by grace-hopper machine,
>>>>>> while the 'netperf' server is running on grace-grace machine.
>>>>>>
>>>>>> The VM is started with virtio-net and vhost has been enabled.
>>>>>> We observe a error message spew from VM and then soft-lockup
>>>>>> report. The error message indicates the data associated with
>>>>>> the descriptor (index: 135) has been released, and the queue
>>>>>> is marked as broken. It eventually leads to the endless effort
>>>>>> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
>>>>>> and soft-lockup. The stale index 135 is fetched from the available
>>>>>> ring and published to the used ring by vhost, meaning we have
>>>>>> disordred write to the available ring element and available index.
>>>>>>
>>>>>> /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>>>>>> -accel kvm -machine virt,gic-version=host \
>>>>>> : \
>>>>>> -netdev tap,id=vnet0,vhost=on \
>>>>>> -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
>>>>>>
>>>>>> [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!
>>>>>>
>>>>>> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
>>>>>> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
>>>>>> ARM64. It should work for other architectures, but performance loss is
>>>>>> expected.
>>>>>>
>>>>>> Cc: [email protected]
>>>>>> Reported-by: Yihuang Yu <[email protected]>
>>>>>> Signed-off-by: Gavin Shan <[email protected]>
>>>>>> ---
>>>>>> drivers/virtio/virtio_ring.c | 12 +++++++++---
>>>>>> 1 file changed, 9 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>> - /* Descriptors and available array need to be set before we expose the
>>>>>> - * new available array entries. */
>>>>>> - virtio_wmb(vq->weak_barriers);
>>>>>> + /*
>>>>>> + * Descriptors and available array need to be set before we expose
>>>>>> + * the new available array entries. virtio_wmb() should be enough
>>>>>> + * to ensuere the order theoretically. However, a stronger barrier
>>>>>> + * is needed by ARM64. Otherwise, the stale data can be observed
>>>>>> + * by the host (vhost). A stronger barrier should work for other
>>>>>> + * architectures, but performance loss is expected.
>>>>>> + */
>>>>>> + virtio_mb(false);
>>>>>
>>>>>
>>>>> I don't get what is going on here. Any explanation why virtio_wmb is not
>>>>> enough besides "it does not work"?
>>>>>
>>>>
>>>> The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
>>>> than "dmb" because "dsb" ensures that all memory accesses raised before this
>>>> instruction is completed when the 'dsb' instruction completes. However, "dmb"
>>>> doesn't guarantee the order of completion of the memory accesses.
>>>>
>>>> So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
>>>> can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
>>>
>>> Completed as observed by which CPU?
>>> We have 2 writes that we want observed by another CPU in order.
>>> So if CPU observes a new value of idx we want it to see
>>> new value in ring.
>>> This is standard use of smp_wmb()
>>> How are these 2 writes different?
>>>
>>> What DMB does, is that is seems to ensure that effects
>>> of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
>>> are observed after effects of
>>> 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
>>>
>>>
>>
>> Completed as observed by the CPU where vhost worker is running. I don't think DMB
>> does the work here. If I'm understanding correctly, DMB ensures the order of these
>> two writes from the local CPU's standpoint.
>
> No this makes no sense at all. All memory accesses are in order from
> local CPU standpoint.
>

It's true if compiler doesn't reorder the accesses, and light-weight barrier
like 'dmb' and 'isb' is used. Otherwise, the accesses still can be disordered
on the local CPU, correct?

>> The written data can be stored in local
>> CPU's cache, not flushed to DRAM and propogated to the cache of the far CPU where
>> vhost worker is running. So DMB isn't ensuring the write data is observed from the
>> far CPU.
>
> No but it has to ensure *ordering*. So if index is oberved then ring
> is observed too because there is a MB there when reading.
>

Right. It should work like this way theoretically. I don't know the difference
between 'dmb' and 'dsb' from the hardware level. The description in ARMv9 spec
is also ambiguous. Lets see if Matt will have comments. Besides, this issue is
only reproducible on NVidia's grace-hopper. We don't hit the issue on Ampere's
CPUs.

>
>> DSB ensures that the written data is observable from the far CPU immediately.
>
>>>
>>>
>>>> The stronger barrier 'dsb' ensures the completion order as we expected.
>>>>
>>>> virtio_wmb(true) virt_mb(false)
>>>> virt_wmb mb
>>>> __smp_wmb __mb
>>>> dmb(ishst) dsb(sy)
>>>
>>> First, why would you want a non smp barrier when you are
>>> synchronizing with another CPU? This is what smp_ means ...
>>>
>>>
>>>> Extraced from ARMv9 specificaton
>>>> ================================
>>>> The DMB instruction is a memory barrier instruction that ensures the relative
>>>> order of memory accesses before the barrier with memory accesses after the
>>>> barrier. The DMB instruction _does not_ ensure the completion of any of the
>>>> memory accesses for which it ensures relative order.
>>>
>>> Isn't this exactly what we need?
>>>
>>
>> I don't think so. DMB gurantees the order of EXECUTION, but DSB gurantees the
>> order of COMPLETION. After the data store is executed, the written data can
>> be stored in local CPU's cache, far from completion. Only the instruction is
>> completed, the written data is synchronized to the far CPU where vhost worker
>> is running.
>>
>> Here is another paper talking about the difference between DMB and DSB. It's
>> explaining the difference between DMB/DSB better than ARMv9 specification.
>> However, it's hard for us to understand DMB/DSB work from the hardware level
>> based on the specification and paper.
>>
>> https://ipads.se.sjtu.edu.cn/_media/publications/liuppopp20.pdf
>> (section 2.2 Order-preserving Options)
>> (I'm extracting the relevant part as below)
>>
>> Data Memory Barrier (DMB) prevents reordering of memory accesses across the barrier.
>> Instead of waiting for previous accesses to become observable in the specified domain,
>> DMB only maintains the relative order between memory accesses. Meanwhile, DMB does
>> not block any non-memory access operations.
>>
>> Data Synchronization Barrier (DSB) prevents reordering of any instructions across
>> the barrier. DSB will make sure that all masters in the specified domain can observe
>> the previous operations before issuing any subsequent instructions.
>>
>
> What you are describing is that smp_wmb is buggy on this platform.
> Poossible but are you sure?
> Write a small test and check.
> Leave virtio alone.
>

Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
to reproduce it with my own driver where one thread writes to the shared buffer
and another thread reads from the buffer. I don't hit the out-of-order issue so
far. My driver may be not correct somewhere and I will update if I can reproduce
the issue with my driver in the future.

>
>>>> A DSB instruction is a memory barrier that ensures that memory accesses that
>>>> occur before the DSB instruction have __completed__ before the completion of
>>>> the DSB instruction.
>>>
>>> This seems appropriate for when you want to order things more
>>> strongly. I do not get why it's necessary here.
>>>
>>
>> Please refer to above explanations.
>>
>>>> In doing this, it acts as a stronger barrier than a DMB
>>>> and all ordering that is created by a DMB with specific options is also generated
>>>> by a DSB with the same options.
>>>>
>>>>>> vq->split.avail_idx_shadow++;
>>>>>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>> vq->split.avail_idx_shadow);
>>>>>> --
>>>>>> 2.44.0
>>>>>

Thanks,
Gavin


2024-03-15 11:06:04

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
>
> + Will, Catalin and Matt from Nvidia
>
> On 3/14/24 22:59, Michael S. Tsirkin wrote:
> > On Thu, Mar 14, 2024 at 10:50:15PM +1000, Gavin Shan wrote:
> > > On 3/14/24 21:50, Michael S. Tsirkin wrote:
> > > > On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
> > > > > On 3/14/24 18:05, Michael S. Tsirkin wrote:
> > > > > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > > > > client is started in the VM hosted by grace-hopper machine,
> > > > > > > while the 'netperf' server is running on grace-grace machine.
> > > > > > >
> > > > > > > The VM is started with virtio-net and vhost has been enabled.
> > > > > > > We observe a error message spew from VM and then soft-lockup
> > > > > > > report. The error message indicates the data associated with
> > > > > > > the descriptor (index: 135) has been released, and the queue
> > > > > > > is marked as broken. It eventually leads to the endless effort
> > > > > > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > > > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > > > > ring and published to the used ring by vhost, meaning we have
> > > > > > > disordred write to the available ring element and available index.
> > > > > > >
> > > > > > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> > > > > > > -accel kvm -machine virt,gic-version=host \
> > > > > > > : \
> > > > > > > -netdev tap,id=vnet0,vhost=on \
> > > > > > > -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > > > > >
> > > > > > > [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > > > > >
> > > > > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > > > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > > > > > ARM64. It should work for other architectures, but performance loss is
> > > > > > > expected.
> > > > > > >
> > > > > > > Cc: [email protected]
> > > > > > > Reported-by: Yihuang Yu <[email protected]>
> > > > > > > Signed-off-by: Gavin Shan <[email protected]>
> > > > > > > ---
> > > > > > > drivers/virtio/virtio_ring.c | 12 +++++++++---
> > > > > > > 1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > > - /* Descriptors and available array need to be set before we expose the
> > > > > > > - * new available array entries. */
> > > > > > > - virtio_wmb(vq->weak_barriers);
> > > > > > > + /*
> > > > > > > + * Descriptors and available array need to be set before we expose
> > > > > > > + * the new available array entries. virtio_wmb() should be enough
> > > > > > > + * to ensuere the order theoretically. However, a stronger barrier
> > > > > > > + * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > > + * by the host (vhost). A stronger barrier should work for other
> > > > > > > + * architectures, but performance loss is expected.
> > > > > > > + */
> > > > > > > + virtio_mb(false);
> > > > > >
> > > > > >
> > > > > > I don't get what is going on here. Any explanation why virtio_wmb is not
> > > > > > enough besides "it does not work"?
> > > > > >
> > > > >
> > > > > The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
> > > > > than "dmb" because "dsb" ensures that all memory accesses raised before this
> > > > > instruction is completed when the 'dsb' instruction completes. However, "dmb"
> > > > > doesn't guarantee the order of completion of the memory accesses.
> > > > >
> > > > > So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
> > > > > can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
> > > >
> > > > Completed as observed by which CPU?
> > > > We have 2 writes that we want observed by another CPU in order.
> > > > So if CPU observes a new value of idx we want it to see
> > > > new value in ring.
> > > > This is standard use of smp_wmb()
> > > > How are these 2 writes different?
> > > >
> > > > What DMB does, is that is seems to ensure that effects
> > > > of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
> > > > are observed after effects of
> > > > 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.
> > > >
> > > >
> > >
> > > Completed as observed by the CPU where vhost worker is running. I don't think DMB
> > > does the work here. If I'm understanding correctly, DMB ensures the order of these
> > > two writes from the local CPU's standpoint.
> >
> > No this makes no sense at all. All memory accesses are in order from
> > local CPU standpoint.
> >
>
> It's true if compiler doesn't reorder the accesses, and light-weight barrier
> like 'dmb' and 'isb' is used. Otherwise, the accesses still can be disordered
> on the local CPU, correct?

Compiler issues aside (you can look at binary and see if they got
reordered) no, the local CPU always observes all writes in the
same way they were initiated, and on all architectures.

> > > The written data can be stored in local
> > > CPU's cache, not flushed to DRAM and propogated to the cache of the far CPU where
> > > vhost worker is running. So DMB isn't ensuring the write data is observed from the
> > > far CPU.
> >
> > No but it has to ensure *ordering*. So if index is oberved then ring
> > is observed too because there is a MB there when reading.
> >
>
> Right. It should work like this way theoretically. I don't know the difference
> between 'dmb' and 'dsb' from the hardware level. The description in ARMv9 spec
> is also ambiguous. Lets see if Matt will have comments. Besides, this issue is
> only reproducible on NVidia's grace-hopper. We don't hit the issue on Ampere's
> CPUs.
>
> >
> > > DSB ensures that the written data is observable from the far CPU immediately.
> > > >
> > > >
> > > > > The stronger barrier 'dsb' ensures the completion order as we expected.
> > > > >
> > > > > virtio_wmb(true) virt_mb(false)
> > > > > virt_wmb mb
> > > > > __smp_wmb __mb
> > > > > dmb(ishst) dsb(sy)
> > > >
> > > > First, why would you want a non smp barrier when you are
> > > > synchronizing with another CPU? This is what smp_ means ...
> > > >
> > > >
> > > > > Extraced from ARMv9 specificaton
> > > > > ================================
> > > > > The DMB instruction is a memory barrier instruction that ensures the relative
> > > > > order of memory accesses before the barrier with memory accesses after the
> > > > > barrier. The DMB instruction _does not_ ensure the completion of any of the
> > > > > memory accesses for which it ensures relative order.
> > > >
> > > > Isn't this exactly what we need?
> > > >
> > >
> > > I don't think so. DMB gurantees the order of EXECUTION, but DSB gurantees the
> > > order of COMPLETION. After the data store is executed, the written data can
> > > be stored in local CPU's cache, far from completion. Only the instruction is
> > > completed, the written data is synchronized to the far CPU where vhost worker
> > > is running.
> > >
> > > Here is another paper talking about the difference between DMB and DSB. It's
> > > explaining the difference between DMB/DSB better than ARMv9 specification.
> > > However, it's hard for us to understand DMB/DSB work from the hardware level
> > > based on the specification and paper.
> > >
> > > https://ipads.se.sjtu.edu.cn/_media/publications/liuppopp20.pdf
> > > (section 2.2 Order-preserving Options)
> > > (I'm extracting the relevant part as below)
> > >
> > > Data Memory Barrier (DMB) prevents reordering of memory accesses across the barrier.
> > > Instead of waiting for previous accesses to become observable in the specified domain,
> > > DMB only maintains the relative order between memory accesses. Meanwhile, DMB does
> > > not block any non-memory access operations.
> > >
> > > Data Synchronization Barrier (DSB) prevents reordering of any instructions across
> > > the barrier. DSB will make sure that all masters in the specified domain can observe
> > > the previous operations before issuing any subsequent instructions.
> > >
> >
> > What you are describing is that smp_wmb is buggy on this platform.
> > Poossible but are you sure?
> > Write a small test and check.
> > Leave virtio alone.
> >
>
> Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
> to reproduce it with my own driver where one thread writes to the shared buffer
> and another thread reads from the buffer. I don't hit the out-of-order issue so
> far.

Make sure the 2 areas you are accessing are in different cache lines.


> My driver may be not correct somewhere and I will update if I can reproduce
> the issue with my driver in the future.

Then maybe your change is just making virtio slower and masks the bug
that is actually elsewhere?

You don't really need a driver. Here's a simple test: without barriers
assertion will fail. With barriers it will not.
(Warning: didn't bother testing too much, could be buggy.

---

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define FIRST values[0]
#define SECOND values[64]

volatile int values[100] = {};

void* writer_thread(void* arg) {
while (1) {
FIRST++;
// NEED smp_wmb here
SECOND++;
}
}

void* reader_thread(void* arg) {
while (1) {
int first = FIRST;
// NEED smp_rmb here
int second = SECOND;
assert(first - second == 1 || first - second == 0);
}
}

int main() {
pthread_t writer, reader;

pthread_create(&writer, NULL, writer_thread, NULL);
pthread_create(&reader, NULL, reader_thread, NULL);

pthread_join(writer, NULL);
pthread_join(reader, NULL);

return 0;
}

--
MST


2024-03-15 11:25:58

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring


On 3/15/24 21:05, Michael S. Tsirkin wrote:
> On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
>>>> Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
>> to reproduce it with my own driver where one thread writes to the shared buffer
>> and another thread reads from the buffer. I don't hit the out-of-order issue so
>> far.
>
> Make sure the 2 areas you are accessing are in different cache lines.
>

Yes, I already put those 2 areas to separate cache lines.

>
>> My driver may be not correct somewhere and I will update if I can reproduce
>> the issue with my driver in the future.
>
> Then maybe your change is just making virtio slower and masks the bug
> that is actually elsewhere?
>
> You don't really need a driver. Here's a simple test: without barriers
> assertion will fail. With barriers it will not.
> (Warning: didn't bother testing too much, could be buggy.
>
> ---
>
> #include <pthread.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <assert.h>
>
> #define FIRST values[0]
> #define SECOND values[64]
>
> volatile int values[100] = {};
>
> void* writer_thread(void* arg) {
> while (1) {
> FIRST++;
> // NEED smp_wmb here
__asm__ volatile("dmb ishst" : : : "memory");
> SECOND++;
> }
> }
>
> void* reader_thread(void* arg) {
> while (1) {
> int first = FIRST;
> // NEED smp_rmb here
__asm__ volatile("dmb ishld" : : : "memory");
> int second = SECOND;
> assert(first - second == 1 || first - second == 0);
> }
> }
>
> int main() {
> pthread_t writer, reader;
>
> pthread_create(&writer, NULL, writer_thread, NULL);
> pthread_create(&reader, NULL, reader_thread, NULL);
>
> pthread_join(writer, NULL);
> pthread_join(reader, NULL);
>
> return 0;
> }
>

Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
the assert on both of them. After replacing 'dmb' with 'dsb', I can
hit assert on both of them too. I need to look at the code closely.

[root@virt-mtcollins-02 test]# ./a
a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
Aborted (core dumped)

[root@nvidia-grace-hopper-05 test]# ./a
a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
Aborted (core dumped)

Thanks,
Gavin


2024-03-17 16:50:59

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
>
> On 3/15/24 21:05, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> > > > > Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
> > > to reproduce it with my own driver where one thread writes to the shared buffer
> > > and another thread reads from the buffer. I don't hit the out-of-order issue so
> > > far.
> >
> > Make sure the 2 areas you are accessing are in different cache lines.
> >
>
> Yes, I already put those 2 areas to separate cache lines.
>
> >
> > > My driver may be not correct somewhere and I will update if I can reproduce
> > > the issue with my driver in the future.
> >
> > Then maybe your change is just making virtio slower and masks the bug
> > that is actually elsewhere?
> >
> > You don't really need a driver. Here's a simple test: without barriers
> > assertion will fail. With barriers it will not.
> > (Warning: didn't bother testing too much, could be buggy.
> >
> > ---
> >
> > #include <pthread.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <assert.h>
> >
> > #define FIRST values[0]
> > #define SECOND values[64]
> >
> > volatile int values[100] = {};
> >
> > void* writer_thread(void* arg) {
> > while (1) {
> > FIRST++;
> > // NEED smp_wmb here
> __asm__ volatile("dmb ishst" : : : "memory");
> > SECOND++;
> > }
> > }
> >
> > void* reader_thread(void* arg) {
> > while (1) {
> > int first = FIRST;
> > // NEED smp_rmb here
> __asm__ volatile("dmb ishld" : : : "memory");
> > int second = SECOND;
> > assert(first - second == 1 || first - second == 0);
> > }
> > }
> >
> > int main() {
> > pthread_t writer, reader;
> >
> > pthread_create(&writer, NULL, writer_thread, NULL);
> > pthread_create(&reader, NULL, reader_thread, NULL);
> >
> > pthread_join(writer, NULL);
> > pthread_join(reader, NULL);
> >
> > return 0;
> > }
> >
>
> Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
> the assert on both of them. After replacing 'dmb' with 'dsb', I can
> hit assert on both of them too. I need to look at the code closely.
>
> [root@virt-mtcollins-02 test]# ./a
> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
> Aborted (core dumped)
>
> [root@nvidia-grace-hopper-05 test]# ./a
> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
> Aborted (core dumped)
>
> Thanks,
> Gavin


Actually this test is broken. No need for ordering it's a simple race.
The following works on x86 though (x86 does not need barriers
though).


#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#if 0
#define x86_rmb() asm volatile("lfence":::"memory")
#define x86_mb() asm volatile("mfence":::"memory")
#define x86_smb() asm volatile("sfence":::"memory")
#else
#define x86_rmb() asm volatile("":::"memory")
#define x86_mb() asm volatile("":::"memory")
#define x86_smb() asm volatile("":::"memory")
#endif

#define FIRST values[0]
#define SECOND values[640]
#define FLAG values[1280]

volatile unsigned values[2000] = {};

void* writer_thread(void* arg) {
while (1) {
/* Now synchronize with reader */
while(FLAG);
FIRST++;
x86_smb();
SECOND++;
x86_smb();
FLAG = 1;
}
}

void* reader_thread(void* arg) {
while (1) {
/* Now synchronize with writer */
while(!FLAG);
x86_rmb();
unsigned first = FIRST;
x86_rmb();
unsigned second = SECOND;
assert(first - second == 1 || first - second == 0);
FLAG = 0;

if (!(first %1000000))
printf("%d\n", first);
}
}

int main() {
pthread_t writer, reader;

pthread_create(&writer, NULL, writer_thread, NULL);
pthread_create(&reader, NULL, reader_thread, NULL);

pthread_join(writer, NULL);
pthread_join(reader, NULL);

return 0;
}


2024-03-17 23:42:07

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On 3/18/24 02:50, Michael S. Tsirkin wrote:
> On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
>>
>> On 3/15/24 21:05, Michael S. Tsirkin wrote:
>>> On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
>>>>>> Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
>>>> to reproduce it with my own driver where one thread writes to the shared buffer
>>>> and another thread reads from the buffer. I don't hit the out-of-order issue so
>>>> far.
>>>
>>> Make sure the 2 areas you are accessing are in different cache lines.
>>>
>>
>> Yes, I already put those 2 areas to separate cache lines.
>>
>>>
>>>> My driver may be not correct somewhere and I will update if I can reproduce
>>>> the issue with my driver in the future.
>>>
>>> Then maybe your change is just making virtio slower and masks the bug
>>> that is actually elsewhere?
>>>
>>> You don't really need a driver. Here's a simple test: without barriers
>>> assertion will fail. With barriers it will not.
>>> (Warning: didn't bother testing too much, could be buggy.
>>>
>>> ---
>>>
>>> #include <pthread.h>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <assert.h>
>>>
>>> #define FIRST values[0]
>>> #define SECOND values[64]
>>>
>>> volatile int values[100] = {};
>>>
>>> void* writer_thread(void* arg) {
>>> while (1) {
>>> FIRST++;
>>> // NEED smp_wmb here
>> __asm__ volatile("dmb ishst" : : : "memory");
>>> SECOND++;
>>> }
>>> }
>>>
>>> void* reader_thread(void* arg) {
>>> while (1) {
>>> int first = FIRST;
>>> // NEED smp_rmb here
>> __asm__ volatile("dmb ishld" : : : "memory");
>>> int second = SECOND;
>>> assert(first - second == 1 || first - second == 0);
>>> }
>>> }
>>>
>>> int main() {
>>> pthread_t writer, reader;
>>>
>>> pthread_create(&writer, NULL, writer_thread, NULL);
>>> pthread_create(&reader, NULL, reader_thread, NULL);
>>>
>>> pthread_join(writer, NULL);
>>> pthread_join(reader, NULL);
>>>
>>> return 0;
>>> }
>>>
>>
>> Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
>> the assert on both of them. After replacing 'dmb' with 'dsb', I can
>> hit assert on both of them too. I need to look at the code closely.
>>
>> [root@virt-mtcollins-02 test]# ./a
>> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
>> Aborted (core dumped)
>>
>> [root@nvidia-grace-hopper-05 test]# ./a
>> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
>> Aborted (core dumped)
>>
>> Thanks,
>> Gavin
>
>
> Actually this test is broken. No need for ordering it's a simple race.
> The following works on x86 though (x86 does not need barriers
> though).
>
>
> #include <pthread.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <assert.h>
>
> #if 0
> #define x86_rmb() asm volatile("lfence":::"memory")
> #define x86_mb() asm volatile("mfence":::"memory")
> #define x86_smb() asm volatile("sfence":::"memory")
> #else
> #define x86_rmb() asm volatile("":::"memory")
> #define x86_mb() asm volatile("":::"memory")
> #define x86_smb() asm volatile("":::"memory")
> #endif
>
> #define FIRST values[0]
> #define SECOND values[640]
> #define FLAG values[1280]
>
> volatile unsigned values[2000] = {};
>
> void* writer_thread(void* arg) {
> while (1) {
> /* Now synchronize with reader */
> while(FLAG);
> FIRST++;
> x86_smb();
> SECOND++;
> x86_smb();
> FLAG = 1;
> }
> }
>
> void* reader_thread(void* arg) {
> while (1) {
> /* Now synchronize with writer */
> while(!FLAG);
> x86_rmb();
> unsigned first = FIRST;
> x86_rmb();
> unsigned second = SECOND;
> assert(first - second == 1 || first - second == 0);
> FLAG = 0;
>
> if (!(first %1000000))
> printf("%d\n", first);
> }
> }
>
> int main() {
> pthread_t writer, reader;
>
> pthread_create(&writer, NULL, writer_thread, NULL);
> pthread_create(&reader, NULL, reader_thread, NULL);
>
> pthread_join(writer, NULL);
> pthread_join(reader, NULL);
>
> return 0;
> }
>

I tried it on host and VM of NVidia's grace-hopper. Without the barriers, I
can hit assert. With the barriers, it's working fine without hitting the
assert.

I also had some code to mimic virtio vring last weekend, and it's just
working well. Back to our original issue, __smb_wmb() is issued by guest
while __smb_rmb() is executed on host. The VM and host are running at
different exception level: EL2 vs EL1. I'm not sure it's the cause. I
need to modify my code so that __smb_wmb() and __smb_rmb() can be executed
from guest and host.

[gshan@gshan code]$ cat test.h
#ifndef __TEST_H
#define __TEST_H

struct vring_desc {
uint64_t addr;
uint32_t len;
uint16_t flags;
uint16_t next;
} __attribute__((aligned(4)));

struct vring_avail {
uint16_t flags;
uint16_t idx;
uint16_t ring[];
} __attribute__((aligned(4)));

struct vring_used_elem {
uint32_t id;
uint32_t len;
} __attribute__((aligned(4)));

struct vring_used {
uint16_t flags;
uint16_t idx;
struct vring_used_elem ring[];
} __attribute__((aligned(4)));

struct vring {
struct vring_desc *desc;
struct vring_avail *avail;
struct vring_used *used;
uint8_t pad0[64];

/* Writer */
uint32_t num;
uint32_t w_num_free;
uint32_t w_free_head;
uint16_t w_avail_idx;
uint16_t w_last_used_idx;
uint16_t *w_extra_data;
uint16_t *w_extra_next;
uint8_t pad1[64];

/* Reader */
uint16_t r_avail_idx;
uint16_t r_last_avail_idx;
uint16_t r_last_used_idx;
uint8_t pad2[64];
};

static inline unsigned int vring_size(unsigned int num, unsigned long align)
{
return ((sizeof(struct vring_desc) * num +
sizeof(uint16_t) * (3 + num) + (align - 1)) & ~(align - 1)) +
sizeof(uint16_t) * 3 + sizeof(struct vring_used_elem) * num;

}

static inline void __smp_rmb(void)
{
#ifdef WEAK_BARRIER
__asm__ volatile("dmb ishld" : : : "memory");
#else
__asm__ volatile("dsb sy" : : : "memory");
#endif
}

static inline void __smp_wmb(void)
{
#ifdef WEAK_BARRIER
__asm__ volatile("dmb ishst" : : : "memory");
#else
__asm__ volatile("dsb sy" : : : "memory");
#endif
}

#endif /* __TEST_H */


[gshan@gshan code]$ cat test.c
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <stdbool.h>
#include <stdint.h>
#include <sys/types.h>
#include <assert.h>
#include <sched.h>
#include <pthread.h>

#include "test.h"

static struct vring *vring;

static int bind_cpu(int cpuid)
{
cpu_set_t cpuset;

CPU_ZERO(&cpuset);
CPU_SET(cpuid, &cpuset);

return pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
}

static void write_free_used_desc(void)
{
uint16_t last_used;
uint32_t idx;

if ((uint16_t)(vring->used->idx - vring->w_last_used_idx) < 64)
return;

while (true) {
if (vring->w_last_used_idx == vring->used->idx)
return;

__smp_rmb();

/* Retrieve the head */
last_used = vring->w_last_used_idx & (vring->num - 1);
idx = vring->used->ring[last_used].id;
assert(idx < vring->num);
assert(vring->w_extra_data[idx]);

/* Reclaim the descriptor */
vring->w_extra_data[idx] = 0;
vring->w_extra_next[idx] = vring->w_free_head;
vring->w_free_head = idx;

/* Update statistics */
vring->w_num_free++;
vring->w_last_used_idx++;
}
}

static void write_push_desc(void)
{
uint32_t head = vring->w_free_head;
uint32_t avail_idx;

if (vring->w_num_free < 1)
return;

/*
* The data in the descriptor doesn't matter. The idea here
* is to dirty the cache line.
*/
vring->desc[head].flags = 1;
vring->desc[head].addr = 0xffffffffffffffff;
vring->desc[head].len = 0xffffffff;
vring->desc[head].next = vring->w_extra_next[head];
vring->desc[head].flags = 0;

vring->w_num_free--;
vring->w_free_head = vring->w_extra_next[head];
vring->w_extra_data[head] = 1;

avail_idx = vring->w_avail_idx & (vring->num - 1);
vring->avail->ring[avail_idx] = head;

__smp_wmb();

vring->w_avail_idx++;
vring->avail->idx = vring->w_avail_idx;
}

static void *write_worker(void *arg)
{
assert(!bind_cpu(10));

while (true) {
write_free_used_desc();
write_push_desc();
}

return NULL;
}

static void read_pull_desc(void)
{
uint16_t avail_idx, last_avail_idx;
uint32_t head;

last_avail_idx = vring->r_last_avail_idx;
if (vring->r_avail_idx == vring->r_last_avail_idx) {
vring->r_avail_idx = vring->avail->idx;
if (vring->r_avail_idx == last_avail_idx)
return;

__smp_rmb();
}

head = vring->avail->ring[last_avail_idx & (vring->num - 1)];
assert(head < vring->num);
vring->r_last_avail_idx++;

vring->used->ring[vring->r_last_used_idx & (vring->num - 1)].id = head;
vring->used->ring[vring->r_last_used_idx & (vring->num - 1)].len = 0;
vring->r_last_used_idx++;

__smp_wmb();

vring->used->idx = vring->r_last_used_idx;
}

static void *read_worker(void *arg)
{
assert(!bind_cpu(60));

while (true) {
read_pull_desc();
}

return NULL;
}

static void init_vring(unsigned int num, unsigned long align)
{
unsigned int size, i;

/* vring */
vring = malloc(sizeof(*vring));
assert(vring);
memset(vring, 0, sizeof(*vring));

/* Descriptors */
size = vring_size(num, align);
vring->desc = (struct vring_desc *)malloc(size);
assert(vring->desc);
memset(vring->desc, 0, size);
vring->avail = (struct vring_avail *)((void *)vring->desc +
num * sizeof(struct vring_desc));
vring->used = (struct vring_used *)(((unsigned long)&vring->avail->ring[num] +
sizeof(uint16_t) + (align - 1)) & ~(align - 1));

/* Writer's extra data */
vring->w_extra_data = malloc(sizeof(uint16_t) * num);
assert(vring->w_extra_data);
memset(vring->w_extra_data, 0, sizeof(uint16_t) * num);
vring->w_extra_next = malloc(sizeof(uint16_t) * num);
assert(vring->w_extra_next);
memset(vring->w_extra_next, 0, sizeof(uint16_t) * num);
for (i = 0; i < num - 1; i++)
vring->w_extra_next[i] = i + 1;

/* Statistics */
vring->num = num;
vring->w_num_free = num;
vring->w_free_head = 0;
vring->w_avail_idx = 0;
vring->w_last_used_idx = 0;
vring->r_avail_idx = 0;
vring->r_last_avail_idx = 0;
vring->r_last_used_idx = 0;
}

int main(int argc, char **argv)
{
pthread_t w_tid, r_tid;
int ret;

assert(!bind_cpu(0));

init_vring(256, 64);

ret = pthread_create(&w_tid, NULL, write_worker, NULL);
assert(!ret);
ret = pthread_create(&r_tid, NULL, read_worker, NULL);
assert(!ret);
ret = pthread_join(w_tid, NULL);
assert(!ret);
ret = pthread_join(r_tid, NULL);
assert(!ret);

while (true) {
sleep(1);
}

return 0;
}

Thanks,
Gavin





2024-03-18 07:51:00

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Mon, Mar 18, 2024 at 09:41:45AM +1000, Gavin Shan wrote:
> On 3/18/24 02:50, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
> > >
> > > On 3/15/24 21:05, Michael S. Tsirkin wrote:
> > > > On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> > > > > > > Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I tried
> > > > > to reproduce it with my own driver where one thread writes to the shared buffer
> > > > > and another thread reads from the buffer. I don't hit the out-of-order issue so
> > > > > far.
> > > >
> > > > Make sure the 2 areas you are accessing are in different cache lines.
> > > >
> > >
> > > Yes, I already put those 2 areas to separate cache lines.
> > >
> > > >
> > > > > My driver may be not correct somewhere and I will update if I can reproduce
> > > > > the issue with my driver in the future.
> > > >
> > > > Then maybe your change is just making virtio slower and masks the bug
> > > > that is actually elsewhere?
> > > >
> > > > You don't really need a driver. Here's a simple test: without barriers
> > > > assertion will fail. With barriers it will not.
> > > > (Warning: didn't bother testing too much, could be buggy.
> > > >
> > > > ---
> > > >
> > > > #include <pthread.h>
> > > > #include <stdio.h>
> > > > #include <stdlib.h>
> > > > #include <assert.h>
> > > >
> > > > #define FIRST values[0]
> > > > #define SECOND values[64]
> > > >
> > > > volatile int values[100] = {};
> > > >
> > > > void* writer_thread(void* arg) {
> > > > while (1) {
> > > > FIRST++;
> > > > // NEED smp_wmb here
> > > __asm__ volatile("dmb ishst" : : : "memory");
> > > > SECOND++;
> > > > }
> > > > }
> > > >
> > > > void* reader_thread(void* arg) {
> > > > while (1) {
> > > > int first = FIRST;
> > > > // NEED smp_rmb here
> > > __asm__ volatile("dmb ishld" : : : "memory");
> > > > int second = SECOND;
> > > > assert(first - second == 1 || first - second == 0);
> > > > }
> > > > }
> > > >
> > > > int main() {
> > > > pthread_t writer, reader;
> > > >
> > > > pthread_create(&writer, NULL, writer_thread, NULL);
> > > > pthread_create(&reader, NULL, reader_thread, NULL);
> > > >
> > > > pthread_join(writer, NULL);
> > > > pthread_join(reader, NULL);
> > > >
> > > > return 0;
> > > > }
> > > >
> > >
> > > Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
> > > the assert on both of them. After replacing 'dmb' with 'dsb', I can
> > > hit assert on both of them too. I need to look at the code closely.
> > >
> > > [root@virt-mtcollins-02 test]# ./a
> > > a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
> > > Aborted (core dumped)
> > >
> > > [root@nvidia-grace-hopper-05 test]# ./a
> > > a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 0' failed.
> > > Aborted (core dumped)
> > >
> > > Thanks,
> > > Gavin
> >
> >
> > Actually this test is broken. No need for ordering it's a simple race.
> > The following works on x86 though (x86 does not need barriers
> > though).
> >
> >
> > #include <pthread.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <assert.h>
> >
> > #if 0
> > #define x86_rmb() asm volatile("lfence":::"memory")
> > #define x86_mb() asm volatile("mfence":::"memory")
> > #define x86_smb() asm volatile("sfence":::"memory")
> > #else
> > #define x86_rmb() asm volatile("":::"memory")
> > #define x86_mb() asm volatile("":::"memory")
> > #define x86_smb() asm volatile("":::"memory")
> > #endif
> >
> > #define FIRST values[0]
> > #define SECOND values[640]
> > #define FLAG values[1280]
> >
> > volatile unsigned values[2000] = {};
> >
> > void* writer_thread(void* arg) {
> > while (1) {
> > /* Now synchronize with reader */
> > while(FLAG);
> > FIRST++;
> > x86_smb();
> > SECOND++;
> > x86_smb();
> > FLAG = 1;
> > }
> > }
> >
> > void* reader_thread(void* arg) {
> > while (1) {
> > /* Now synchronize with writer */
> > while(!FLAG);
> > x86_rmb();
> > unsigned first = FIRST;
> > x86_rmb();
> > unsigned second = SECOND;
> > assert(first - second == 1 || first - second == 0);
> > FLAG = 0;
> >
> > if (!(first %1000000))
> > printf("%d\n", first);
> > }
> > }
> >
> > int main() {
> > pthread_t writer, reader;
> >
> > pthread_create(&writer, NULL, writer_thread, NULL);
> > pthread_create(&reader, NULL, reader_thread, NULL);
> >
> > pthread_join(writer, NULL);
> > pthread_join(reader, NULL);
> >
> > return 0;
> > }
> >
>
> I tried it on host and VM of NVidia's grace-hopper. Without the barriers, I
> can hit assert. With the barriers, it's working fine without hitting the
> assert.
>
> I also had some code to mimic virtio vring last weekend, and it's just
> working well. Back to our original issue, __smb_wmb() is issued by guest
> while __smb_rmb() is executed on host. The VM and host are running at
> different exception level: EL2 vs EL1. I'm not sure it's the cause. I
> need to modify my code so that __smb_wmb() and __smb_rmb() can be executed
> from guest and host.

It is thinkably possible that on grace-hopper barriers work
differently somehow. We need to find out more though.
Anyone from Nvidia can chime in?

--
MST


2024-03-18 17:00:10

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> The issue is reported by Yihuang Yu who have 'netperf' test on
> NVidia's grace-grace and grace-hopper machines. The 'netperf'
> client is started in the VM hosted by grace-hopper machine,
> while the 'netperf' server is running on grace-grace machine.
>
> The VM is started with virtio-net and vhost has been enabled.
> We observe a error message spew from VM and then soft-lockup
> report. The error message indicates the data associated with
> the descriptor (index: 135) has been released, and the queue
> is marked as broken. It eventually leads to the endless effort
> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> and soft-lockup. The stale index 135 is fetched from the available
> ring and published to the used ring by vhost, meaning we have
> disordred write to the available ring element and available index.
>
> /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> -accel kvm -machine virt,gic-version=host \
> : \
> -netdev tap,id=vnet0,vhost=on \
> -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
>
> [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!
>
> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> ARM64. It should work for other architectures, but performance loss is
> expected.
>
> Cc: [email protected]
> Reported-by: Yihuang Yu <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>
> ---
> drivers/virtio/virtio_ring.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 49299b1f9ec7..7d852811c912 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>
> - /* Descriptors and available array need to be set before we expose the
> - * new available array entries. */
> - virtio_wmb(vq->weak_barriers);
> + /*
> + * Descriptors and available array need to be set before we expose
> + * the new available array entries. virtio_wmb() should be enough
> + * to ensuere the order theoretically. However, a stronger barrier
> + * is needed by ARM64. Otherwise, the stale data can be observed
> + * by the host (vhost). A stronger barrier should work for other
> + * architectures, but performance loss is expected.
> + */
> + virtio_mb(false);
> vq->split.avail_idx_shadow++;
> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> vq->split.avail_idx_shadow);

Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
here, especially when ordering accesses to coherent memory.

In practice, either the larger timing different from the DSB or the fact
that you're going from a Store->Store barrier to a full barrier is what
makes things "work" for you. Have you tried, for example, a DMB SY
(e.g. via __smb_mb()).

We definitely shouldn't take changes like this without a proper
explanation of what is going on.

Will

2024-03-19 04:59:44

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On 3/19/24 02:59, Will Deacon wrote:
> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
>> The issue is reported by Yihuang Yu who have 'netperf' test on
>> NVidia's grace-grace and grace-hopper machines. The 'netperf'
>> client is started in the VM hosted by grace-hopper machine,
>> while the 'netperf' server is running on grace-grace machine.
>>
>> The VM is started with virtio-net and vhost has been enabled.
>> We observe a error message spew from VM and then soft-lockup
>> report. The error message indicates the data associated with
>> the descriptor (index: 135) has been released, and the queue
>> is marked as broken. It eventually leads to the endless effort
>> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
>> and soft-lockup. The stale index 135 is fetched from the available
>> ring and published to the used ring by vhost, meaning we have
>> disordred write to the available ring element and available index.
>>
>> /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>> -accel kvm -machine virt,gic-version=host \
>> : \
>> -netdev tap,id=vnet0,vhost=on \
>> -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
>>
>> [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!
>>
>> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
>> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
>> ARM64. It should work for other architectures, but performance loss is
>> expected.
>>
>> Cc: [email protected]
>> Reported-by: Yihuang Yu <[email protected]>
>> Signed-off-by: Gavin Shan <[email protected]>
>> ---
>> drivers/virtio/virtio_ring.c | 12 +++++++++---
>> 1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>> index 49299b1f9ec7..7d852811c912 100644
>> --- a/drivers/virtio/virtio_ring.c
>> +++ b/drivers/virtio/virtio_ring.c
>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>
>> - /* Descriptors and available array need to be set before we expose the
>> - * new available array entries. */
>> - virtio_wmb(vq->weak_barriers);
>> + /*
>> + * Descriptors and available array need to be set before we expose
>> + * the new available array entries. virtio_wmb() should be enough
>> + * to ensuere the order theoretically. However, a stronger barrier
>> + * is needed by ARM64. Otherwise, the stale data can be observed
>> + * by the host (vhost). A stronger barrier should work for other
>> + * architectures, but performance loss is expected.
>> + */
>> + virtio_mb(false);
>> vq->split.avail_idx_shadow++;
>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>> vq->split.avail_idx_shadow);
>
> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> here, especially when ordering accesses to coherent memory.
>
> In practice, either the larger timing different from the DSB or the fact
> that you're going from a Store->Store barrier to a full barrier is what
> makes things "work" for you. Have you tried, for example, a DMB SY
> (e.g. via __smb_mb()).
>
> We definitely shouldn't take changes like this without a proper
> explanation of what is going on.
>

Thanks for your comments, Will.

Yes, DMB should work for us. However, it seems this instruction has issues on
NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
from hardware level. I agree it's not the solution to replace DMB with DSB
before we fully understand the root cause.

I tried the possible replacement like below. __smp_mb() can avoid the issue like
__mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.

static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
{
:
/* Put entry in available array (but don't update avail->idx until they
* do sync). */
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);

/* Descriptors and available array need to be set before we expose the
* new available array entries. */
// Broken: virtio_wmb(vq->weak_barriers);
// Broken: __dma_mb();
// Work: __mb();
// Work: __smp_mb();
// Work: __ndelay(100);
// Work: __ndelay(10);
// Broken: __ndelay(9);

vq->split.avail_idx_shadow++;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);
vq->num_added++;

pr_debug("Added buffer head %i to %p\n", head, vq);
END_USE(vq);
:
}

I also tried to measure the consumed time for various barrier-relative instructions using
ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
__smp_wmb() but faster than __mb()

Instruction Range of used time in ns
----------------------------------------------
__smp_wmb() [32 1128032]
__smp_mb() [32 1160096]
__mb() [32 1162496]

Thanks,
Gavin


2024-03-19 06:09:52

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> On 3/19/24 02:59, Will Deacon wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > >
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > >
> > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> > > -accel kvm -machine virt,gic-version=host \
> > > : \
> > > -netdev tap,id=vnet0,vhost=on \
> > > -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > >
> > > [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > >
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > >
> > > Cc: [email protected]
> > > Reported-by: Yihuang Yu <[email protected]>
> > > Signed-off-by: Gavin Shan <[email protected]>
> > > ---
> > > drivers/virtio/virtio_ring.c | 12 +++++++++---
> > > 1 file changed, 9 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > - /* Descriptors and available array need to be set before we expose the
> > > - * new available array entries. */
> > > - virtio_wmb(vq->weak_barriers);
> > > + /*
> > > + * Descriptors and available array need to be set before we expose
> > > + * the new available array entries. virtio_wmb() should be enough
> > > + * to ensuere the order theoretically. However, a stronger barrier
> > > + * is needed by ARM64. Otherwise, the stale data can be observed
> > > + * by the host (vhost). A stronger barrier should work for other
> > > + * architectures, but performance loss is expected.
> > > + */
> > > + virtio_mb(false);
> > > vq->split.avail_idx_shadow++;
> > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > vq->split.avail_idx_shadow);
> >
> > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > here, especially when ordering accesses to coherent memory.
> >
> > In practice, either the larger timing different from the DSB or the fact
> > that you're going from a Store->Store barrier to a full barrier is what
> > makes things "work" for you. Have you tried, for example, a DMB SY
> > (e.g. via __smb_mb()).
> >
> > We definitely shouldn't take changes like this without a proper
> > explanation of what is going on.
> >
>
> Thanks for your comments, Will.
>
> Yes, DMB should work for us. However, it seems this instruction has issues on
> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> from hardware level. I agree it's not the solution to replace DMB with DSB
> before we fully understand the root cause.
>
> I tried the possible replacement like below. __smp_mb() can avoid the issue like
> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>
> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> {
> :
> /* Put entry in available array (but don't update avail->idx until they
> * do sync). */
> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>
> /* Descriptors and available array need to be set before we expose the
> * new available array entries. */
> // Broken: virtio_wmb(vq->weak_barriers);
> // Broken: __dma_mb();
> // Work: __mb();
> // Work: __smp_mb();
> // Work: __ndelay(100);
> // Work: __ndelay(10);
> // Broken: __ndelay(9);
>
> vq->split.avail_idx_shadow++;
> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> vq->split.avail_idx_shadow);

What if you stick __ndelay here?


> vq->num_added++;
>
> pr_debug("Added buffer head %i to %p\n", head, vq);
> END_USE(vq);
> :
> }
>
> I also tried to measure the consumed time for various barrier-relative instructions using
> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> __smp_wmb() but faster than __mb()
>
> Instruction Range of used time in ns
> ----------------------------------------------
> __smp_wmb() [32 1128032]
> __smp_mb() [32 1160096]
> __mb() [32 1162496]
>
> Thanks,
> Gavin


2024-03-19 06:11:12

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > On 3/19/24 02:59, Will Deacon wrote:
> > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > > client is started in the VM hosted by grace-hopper machine,
> > > > while the 'netperf' server is running on grace-grace machine.
> > > >
> > > > The VM is started with virtio-net and vhost has been enabled.
> > > > We observe a error message spew from VM and then soft-lockup
> > > > report. The error message indicates the data associated with
> > > > the descriptor (index: 135) has been released, and the queue
> > > > is marked as broken. It eventually leads to the endless effort
> > > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > > and soft-lockup. The stale index 135 is fetched from the available
> > > > ring and published to the used ring by vhost, meaning we have
> > > > disordred write to the available ring element and available index.
> > > >
> > > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> > > > -accel kvm -machine virt,gic-version=host \
> > > > : \
> > > > -netdev tap,id=vnet0,vhost=on \
> > > > -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > >
> > > > [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > >
> > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > > ARM64. It should work for other architectures, but performance loss is
> > > > expected.
> > > >
> > > > Cc: [email protected]
> > > > Reported-by: Yihuang Yu <[email protected]>
> > > > Signed-off-by: Gavin Shan <[email protected]>
> > > > ---
> > > > drivers/virtio/virtio_ring.c | 12 +++++++++---
> > > > 1 file changed, 9 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 49299b1f9ec7..7d852811c912 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > - /* Descriptors and available array need to be set before we expose the
> > > > - * new available array entries. */
> > > > - virtio_wmb(vq->weak_barriers);
> > > > + /*
> > > > + * Descriptors and available array need to be set before we expose
> > > > + * the new available array entries. virtio_wmb() should be enough
> > > > + * to ensuere the order theoretically. However, a stronger barrier
> > > > + * is needed by ARM64. Otherwise, the stale data can be observed
> > > > + * by the host (vhost). A stronger barrier should work for other
> > > > + * architectures, but performance loss is expected.
> > > > + */
> > > > + virtio_mb(false);
> > > > vq->split.avail_idx_shadow++;
> > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > vq->split.avail_idx_shadow);
> > >
> > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > here, especially when ordering accesses to coherent memory.
> > >
> > > In practice, either the larger timing different from the DSB or the fact
> > > that you're going from a Store->Store barrier to a full barrier is what
> > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > (e.g. via __smb_mb()).
> > >
> > > We definitely shouldn't take changes like this without a proper
> > > explanation of what is going on.
> > >
> >
> > Thanks for your comments, Will.
> >
> > Yes, DMB should work for us. However, it seems this instruction has issues on
> > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > from hardware level. I agree it's not the solution to replace DMB with DSB
> > before we fully understand the root cause.
> >
> > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> >
> > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > {
> > :
> > /* Put entry in available array (but don't update avail->idx until they
> > * do sync). */
> > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> >
> > /* Descriptors and available array need to be set before we expose the
> > * new available array entries. */
> > // Broken: virtio_wmb(vq->weak_barriers);
> > // Broken: __dma_mb();
> > // Work: __mb();
> > // Work: __smp_mb();

Did you try __smp_wmb ? And wmb?

> > // Work: __ndelay(100);
> > // Work: __ndelay(10);
> > // Broken: __ndelay(9);
> >
> > vq->split.avail_idx_shadow++;
> > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > vq->split.avail_idx_shadow);
>
> What if you stick __ndelay here?

And keep virtio_wmb above?

>
> > vq->num_added++;
> >
> > pr_debug("Added buffer head %i to %p\n", head, vq);
> > END_USE(vq);
> > :
> > }
> >
> > I also tried to measure the consumed time for various barrier-relative instructions using
> > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > __smp_wmb() but faster than __mb()
> >
> > Instruction Range of used time in ns
> > ----------------------------------------------
> > __smp_wmb() [32 1128032]
> > __smp_mb() [32 1160096]
> > __mb() [32 1162496]
> >
> > Thanks,
> > Gavin


2024-03-19 06:14:44

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> The issue is reported by Yihuang Yu who have 'netperf' test on
> NVidia's grace-grace and grace-hopper machines. The 'netperf'
> client is started in the VM hosted by grace-hopper machine,
> while the 'netperf' server is running on grace-grace machine.
>
> The VM is started with virtio-net and vhost has been enabled.
> We observe a error message spew from VM and then soft-lockup
> report. The error message indicates the data associated with
> the descriptor (index: 135) has been released, and the queue
> is marked as broken. It eventually leads to the endless effort
> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> and soft-lockup. The stale index 135 is fetched from the available
> ring and published to the used ring by vhost, meaning we have
> disordred write to the available ring element and available index.
>
> /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> -accel kvm -machine virt,gic-version=host \
> : \
> -netdev tap,id=vnet0,vhost=on \
> -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
>
> [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!
>
> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> ARM64. It should work for other architectures, but performance loss is
> expected.
>
> Cc: [email protected]
> Reported-by: Yihuang Yu <[email protected]>
> Signed-off-by: Gavin Shan <[email protected]>
> ---
> drivers/virtio/virtio_ring.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 49299b1f9ec7..7d852811c912 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>
> - /* Descriptors and available array need to be set before we expose the
> - * new available array entries. */
> - virtio_wmb(vq->weak_barriers);
> + /*
> + * Descriptors and available array need to be set before we expose
> + * the new available array entries. virtio_wmb() should be enough
> + * to ensuere the order theoretically. However, a stronger barrier
> + * is needed by ARM64. Otherwise, the stale data can be observed
> + * by the host (vhost). A stronger barrier should work for other
> + * architectures, but performance loss is expected.
> + */
> + virtio_mb(false);
> vq->split.avail_idx_shadow++;
> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> vq->split.avail_idx_shadow);



Something else to try, is to disassemble the code and check the compiler is not broken.

It also might help to replace assigment above with WRITE_ONCE -
it's technically always has been the right thing to do, it's just a big
change (has to be done everywhere if done at all) so we never bothered
and we never hit a compiler that would split or speculate stores ...


> --
> 2.44.0


2024-03-19 06:41:54

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On 3/19/24 16:09, Michael S. Tsirkin wrote:

>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>> index 49299b1f9ec7..7d852811c912 100644
>>>> --- a/drivers/virtio/virtio_ring.c
>>>> +++ b/drivers/virtio/virtio_ring.c
>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>> - /* Descriptors and available array need to be set before we expose the
>>>> - * new available array entries. */
>>>> - virtio_wmb(vq->weak_barriers);
>>>> + /*
>>>> + * Descriptors and available array need to be set before we expose
>>>> + * the new available array entries. virtio_wmb() should be enough
>>>> + * to ensuere the order theoretically. However, a stronger barrier
>>>> + * is needed by ARM64. Otherwise, the stale data can be observed
>>>> + * by the host (vhost). A stronger barrier should work for other
>>>> + * architectures, but performance loss is expected.
>>>> + */
>>>> + virtio_mb(false);
>>>> vq->split.avail_idx_shadow++;
>>>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>> vq->split.avail_idx_shadow);
>>>
>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>> here, especially when ordering accesses to coherent memory.
>>>
>>> In practice, either the larger timing different from the DSB or the fact
>>> that you're going from a Store->Store barrier to a full barrier is what
>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>> (e.g. via __smb_mb()).
>>>
>>> We definitely shouldn't take changes like this without a proper
>>> explanation of what is going on.
>>>
>>
>> Thanks for your comments, Will.
>>
>> Yes, DMB should work for us. However, it seems this instruction has issues on
>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>> from hardware level. I agree it's not the solution to replace DMB with DSB
>> before we fully understand the root cause.
>>
>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>
>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>> {
>> :
>> /* Put entry in available array (but don't update avail->idx until they
>> * do sync). */
>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>
>> /* Descriptors and available array need to be set before we expose the
>> * new available array entries. */
>> // Broken: virtio_wmb(vq->weak_barriers);
>> // Broken: __dma_mb();
>> // Work: __mb();
>> // Work: __smp_mb();
>> // Work: __ndelay(100);
>> // Work: __ndelay(10);
>> // Broken: __ndelay(9);
>>
>> vq->split.avail_idx_shadow++;
>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>> vq->split.avail_idx_shadow);
>
> What if you stick __ndelay here?
>

/* Put entry in available array (but don't update avail->idx until they
* do sync). */
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);

/* Descriptors and available array need to be set before we expose the
* new available array entries. */
virtio_wmb(vq->weak_barriers);
vq->split.avail_idx_shadow++;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);
/* Try __ndelay(x) here as Michael suggested
*
* Work: __ndelay(200); possiblly make it hard to reproduce
* Broken: __ndelay(100);
* Broken: __ndelay(20);
* Broken: __ndelay(10);
*/
__ndelay(200);


>
>> vq->num_added++;
>>
>> pr_debug("Added buffer head %i to %p\n", head, vq);
>> END_USE(vq);
>> :
>> }
>>
>> I also tried to measure the consumed time for various barrier-relative instructions using
>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>> __smp_wmb() but faster than __mb()
>>
>> Instruction Range of used time in ns
>> ----------------------------------------------
>> __smp_wmb() [32 1128032]
>> __smp_mb() [32 1160096]
>> __mb() [32 1162496]
>>

Thanks,
Gavin


2024-03-19 06:44:04

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> On 3/19/24 16:09, Michael S. Tsirkin wrote:
>
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > - /* Descriptors and available array need to be set before we expose the
> > > > > - * new available array entries. */
> > > > > - virtio_wmb(vq->weak_barriers);
> > > > > + /*
> > > > > + * Descriptors and available array need to be set before we expose
> > > > > + * the new available array entries. virtio_wmb() should be enough
> > > > > + * to ensuere the order theoretically. However, a stronger barrier
> > > > > + * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > + * by the host (vhost). A stronger barrier should work for other
> > > > > + * architectures, but performance loss is expected.
> > > > > + */
> > > > > + virtio_mb(false);
> > > > > vq->split.avail_idx_shadow++;
> > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > vq->split.avail_idx_shadow);
> > > >
> > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > here, especially when ordering accesses to coherent memory.
> > > >
> > > > In practice, either the larger timing different from the DSB or the fact
> > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > (e.g. via __smb_mb()).
> > > >
> > > > We definitely shouldn't take changes like this without a proper
> > > > explanation of what is going on.
> > > >
> > >
> > > Thanks for your comments, Will.
> > >
> > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > before we fully understand the root cause.
> > >
> > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > >
> > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > {
> > > :
> > > /* Put entry in available array (but don't update avail->idx until they
> > > * do sync). */
> > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > >
> > > /* Descriptors and available array need to be set before we expose the
> > > * new available array entries. */
> > > // Broken: virtio_wmb(vq->weak_barriers);
> > > // Broken: __dma_mb();
> > > // Work: __mb();
> > > // Work: __smp_mb();
> > > // Work: __ndelay(100);
> > > // Work: __ndelay(10);
> > > // Broken: __ndelay(9);
> > >
> > > vq->split.avail_idx_shadow++;
> > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > vq->split.avail_idx_shadow);
> >
> > What if you stick __ndelay here?
> >
>
> /* Put entry in available array (but don't update avail->idx until they
> * do sync). */
> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>
> /* Descriptors and available array need to be set before we expose the
> * new available array entries. */
> virtio_wmb(vq->weak_barriers);
> vq->split.avail_idx_shadow++;
> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> vq->split.avail_idx_shadow);
> /* Try __ndelay(x) here as Michael suggested
> *
> * Work: __ndelay(200); possiblly make it hard to reproduce
> * Broken: __ndelay(100);
> * Broken: __ndelay(20);
> * Broken: __ndelay(10);
> */
> __ndelay(200);

So we see that just changing the timing masks the race.
What are you using on the host side? vhost or qemu?

>
> >
> > > vq->num_added++;
> > >
> > > pr_debug("Added buffer head %i to %p\n", head, vq);
> > > END_USE(vq);
> > > :
> > > }
> > >
> > > I also tried to measure the consumed time for various barrier-relative instructions using
> > > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > > __smp_wmb() but faster than __mb()
> > >
> > > Instruction Range of used time in ns
> > > ----------------------------------------------
> > > __smp_wmb() [32 1128032]
> > > __smp_mb() [32 1160096]
> > > __mb() [32 1162496]
> > >
>
> Thanks,
> Gavin


2024-03-19 06:50:09

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring


On 3/19/24 16:43, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
>> On 3/19/24 16:09, Michael S. Tsirkin wrote:
>>
>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>> - /* Descriptors and available array need to be set before we expose the
>>>>>> - * new available array entries. */
>>>>>> - virtio_wmb(vq->weak_barriers);
>>>>>> + /*
>>>>>> + * Descriptors and available array need to be set before we expose
>>>>>> + * the new available array entries. virtio_wmb() should be enough
>>>>>> + * to ensuere the order theoretically. However, a stronger barrier
>>>>>> + * is needed by ARM64. Otherwise, the stale data can be observed
>>>>>> + * by the host (vhost). A stronger barrier should work for other
>>>>>> + * architectures, but performance loss is expected.
>>>>>> + */
>>>>>> + virtio_mb(false);
>>>>>> vq->split.avail_idx_shadow++;
>>>>>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>> vq->split.avail_idx_shadow);
>>>>>
>>>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>>>> here, especially when ordering accesses to coherent memory.
>>>>>
>>>>> In practice, either the larger timing different from the DSB or the fact
>>>>> that you're going from a Store->Store barrier to a full barrier is what
>>>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>>>> (e.g. via __smb_mb()).
>>>>>
>>>>> We definitely shouldn't take changes like this without a proper
>>>>> explanation of what is going on.
>>>>>
>>>>
>>>> Thanks for your comments, Will.
>>>>
>>>> Yes, DMB should work for us. However, it seems this instruction has issues on
>>>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>>>> from hardware level. I agree it's not the solution to replace DMB with DSB
>>>> before we fully understand the root cause.
>>>>
>>>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>>>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>>>
>>>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>>>> {
>>>> :
>>>> /* Put entry in available array (but don't update avail->idx until they
>>>> * do sync). */
>>>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>
>>>> /* Descriptors and available array need to be set before we expose the
>>>> * new available array entries. */
>>>> // Broken: virtio_wmb(vq->weak_barriers);
>>>> // Broken: __dma_mb();
>>>> // Work: __mb();
>>>> // Work: __smp_mb();
>>>> // Work: __ndelay(100);
>>>> // Work: __ndelay(10);
>>>> // Broken: __ndelay(9);
>>>>
>>>> vq->split.avail_idx_shadow++;
>>>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>> vq->split.avail_idx_shadow);
>>>
>>> What if you stick __ndelay here?
>>>
>>
>> /* Put entry in available array (but don't update avail->idx until they
>> * do sync). */
>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>
>> /* Descriptors and available array need to be set before we expose the
>> * new available array entries. */
>> virtio_wmb(vq->weak_barriers);
>> vq->split.avail_idx_shadow++;
>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>> vq->split.avail_idx_shadow);
>> /* Try __ndelay(x) here as Michael suggested
>> *
>> * Work: __ndelay(200); possiblly make it hard to reproduce
>> * Broken: __ndelay(100);
>> * Broken: __ndelay(20);
>> * Broken: __ndelay(10);
>> */
>> __ndelay(200);
>
> So we see that just changing the timing masks the race.
> What are you using on the host side? vhost or qemu?
>

__ndelay(200) may make the issue harder to be reproduce as I understand.
More delays here will give vhost relief, reducing the race.

The issue is only reproducible when vhost is turned on. Otherwise, we
aren't able to hit the issue.

-netdev tap,id=vnet0,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0

>>
>>>
>>>> vq->num_added++;
>>>>
>>>> pr_debug("Added buffer head %i to %p\n", head, vq);
>>>> END_USE(vq);
>>>> :
>>>> }
>>>>
>>>> I also tried to measure the consumed time for various barrier-relative instructions using
>>>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>>>> __smp_wmb() but faster than __mb()
>>>>
>>>> Instruction Range of used time in ns
>>>> ----------------------------------------------
>>>> __smp_wmb() [32 1128032]
>>>> __smp_mb() [32 1160096]
>>>> __mb() [32 1162496]
>>>>

Thanks,
Gavin


2024-03-19 06:54:37

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On 3/19/24 16:10, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
>> On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
>>> On 3/19/24 02:59, Will Deacon wrote:
[...]
>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>> - /* Descriptors and available array need to be set before we expose the
>>>>> - * new available array entries. */
>>>>> - virtio_wmb(vq->weak_barriers);
>>>>> + /*
>>>>> + * Descriptors and available array need to be set before we expose
>>>>> + * the new available array entries. virtio_wmb() should be enough
>>>>> + * to ensuere the order theoretically. However, a stronger barrier
>>>>> + * is needed by ARM64. Otherwise, the stale data can be observed
>>>>> + * by the host (vhost). A stronger barrier should work for other
>>>>> + * architectures, but performance loss is expected.
>>>>> + */
>>>>> + virtio_mb(false);
>>>>> vq->split.avail_idx_shadow++;
>>>>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>> vq->split.avail_idx_shadow);
>>>>
>>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>>> here, especially when ordering accesses to coherent memory.
>>>>
>>>> In practice, either the larger timing different from the DSB or the fact
>>>> that you're going from a Store->Store barrier to a full barrier is what
>>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>>> (e.g. via __smb_mb()).
>>>>
>>>> We definitely shouldn't take changes like this without a proper
>>>> explanation of what is going on.
>>>>
>>>
>>> Thanks for your comments, Will.
>>>
>>> Yes, DMB should work for us. However, it seems this instruction has issues on
>>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>>> from hardware level. I agree it's not the solution to replace DMB with DSB
>>> before we fully understand the root cause.
>>>
>>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>>
>>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>>> {
>>> :
>>> /* Put entry in available array (but don't update avail->idx until they
>>> * do sync). */
>>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>
>>> /* Descriptors and available array need to be set before we expose the
>>> * new available array entries. */
>>> // Broken: virtio_wmb(vq->weak_barriers);
>>> // Broken: __dma_mb();
>>> // Work: __mb();
>>> // Work: __smp_mb();
>
> Did you try __smp_wmb ? And wmb?
>

virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.

__wmb() works either. No issue found with it.

>>> // Work: __ndelay(100);
>>> // Work: __ndelay(10);
>>> // Broken: __ndelay(9);
>>>
>>> vq->split.avail_idx_shadow++;
>>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>> vq->split.avail_idx_shadow);
>>
>> What if you stick __ndelay here?
>
> And keep virtio_wmb above?
>

The result has been shared through a separate reply.

>>
>>> vq->num_added++;
>>>
>>> pr_debug("Added buffer head %i to %p\n", head, vq);
>>> END_USE(vq);
>>> :
>>> }
>>>
>>> I also tried to measure the consumed time for various barrier-relative instructions using
>>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>>> __smp_wmb() but faster than __mb()
>>>
>>> Instruction Range of used time in ns
>>> ----------------------------------------------
>>> __smp_wmb() [32 1128032]
>>> __smp_mb() [32 1160096]
>>> __mb() [32 1162496]
>>>

Thanks,
Gavin


2024-03-19 07:07:38

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
> On 3/19/24 16:10, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > > On 3/19/24 02:59, Will Deacon wrote:
> [...]
> > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > - /* Descriptors and available array need to be set before we expose the
> > > > > > - * new available array entries. */
> > > > > > - virtio_wmb(vq->weak_barriers);
> > > > > > + /*
> > > > > > + * Descriptors and available array need to be set before we expose
> > > > > > + * the new available array entries. virtio_wmb() should be enough
> > > > > > + * to ensuere the order theoretically. However, a stronger barrier
> > > > > > + * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > + * by the host (vhost). A stronger barrier should work for other
> > > > > > + * architectures, but performance loss is expected.
> > > > > > + */
> > > > > > + virtio_mb(false);
> > > > > > vq->split.avail_idx_shadow++;
> > > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > vq->split.avail_idx_shadow);
> > > > >
> > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > > here, especially when ordering accesses to coherent memory.
> > > > >
> > > > > In practice, either the larger timing different from the DSB or the fact
> > > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > (e.g. via __smb_mb()).
> > > > >
> > > > > We definitely shouldn't take changes like this without a proper
> > > > > explanation of what is going on.
> > > > >
> > > >
> > > > Thanks for your comments, Will.
> > > >
> > > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > > before we fully understand the root cause.
> > > >
> > > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > >
> > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > {
> > > > :
> > > > /* Put entry in available array (but don't update avail->idx until they
> > > > * do sync). */
> > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > >
> > > > /* Descriptors and available array need to be set before we expose the
> > > > * new available array entries. */
> > > > // Broken: virtio_wmb(vq->weak_barriers);
> > > > // Broken: __dma_mb();
> > > > // Work: __mb();
> > > > // Work: __smp_mb();
> >
> > Did you try __smp_wmb ? And wmb?
> >
>
> virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.
>
> __wmb() works either. No issue found with it.

Oh interesting. So how do smp_mb() and wmb() disassemble on this
platform? Can you please check?


> > > > // Work: __ndelay(100);
> > > > // Work: __ndelay(10);
> > > > // Broken: __ndelay(9);
> > > >
> > > > vq->split.avail_idx_shadow++;
> > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > vq->split.avail_idx_shadow);
> > >
> > > What if you stick __ndelay here?
> >
> > And keep virtio_wmb above?
> >
>
> The result has been shared through a separate reply.
>
> > >
> > > > vq->num_added++;
> > > >
> > > > pr_debug("Added buffer head %i to %p\n", head, vq);
> > > > END_USE(vq);
> > > > :
> > > > }
> > > >
> > > > I also tried to measure the consumed time for various barrier-relative instructions using
> > > > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > > > __smp_wmb() but faster than __mb()
> > > >
> > > > Instruction Range of used time in ns
> > > > ----------------------------------------------
> > > > __smp_wmb() [32 1128032]
> > > > __smp_mb() [32 1160096]
> > > > __mb() [32 1162496]
> > > >
>
> Thanks,
> Gavin


2024-03-19 07:09:29

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
>
> On 3/19/24 16:43, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> > > On 3/19/24 16:09, Michael S. Tsirkin wrote:
> > >
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > > - /* Descriptors and available array need to be set before we expose the
> > > > > > > - * new available array entries. */
> > > > > > > - virtio_wmb(vq->weak_barriers);
> > > > > > > + /*
> > > > > > > + * Descriptors and available array need to be set before we expose
> > > > > > > + * the new available array entries. virtio_wmb() should be enough
> > > > > > > + * to ensuere the order theoretically. However, a stronger barrier
> > > > > > > + * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > > + * by the host (vhost). A stronger barrier should work for other
> > > > > > > + * architectures, but performance loss is expected.
> > > > > > > + */
> > > > > > > + virtio_mb(false);
> > > > > > > vq->split.avail_idx_shadow++;
> > > > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > > vq->split.avail_idx_shadow);
> > > > > >
> > > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > > > here, especially when ordering accesses to coherent memory.
> > > > > >
> > > > > > In practice, either the larger timing different from the DSB or the fact
> > > > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > > (e.g. via __smb_mb()).
> > > > > >
> > > > > > We definitely shouldn't take changes like this without a proper
> > > > > > explanation of what is going on.
> > > > > >
> > > > >
> > > > > Thanks for your comments, Will.
> > > > >
> > > > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > > > before we fully understand the root cause.
> > > > >
> > > > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > >
> > > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > > {
> > > > > :
> > > > > /* Put entry in available array (but don't update avail->idx until they
> > > > > * do sync). */
> > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > >
> > > > > /* Descriptors and available array need to be set before we expose the
> > > > > * new available array entries. */
> > > > > // Broken: virtio_wmb(vq->weak_barriers);
> > > > > // Broken: __dma_mb();
> > > > > // Work: __mb();
> > > > > // Work: __smp_mb();
> > > > > // Work: __ndelay(100);
> > > > > // Work: __ndelay(10);
> > > > > // Broken: __ndelay(9);
> > > > >
> > > > > vq->split.avail_idx_shadow++;
> > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > vq->split.avail_idx_shadow);
> > > >
> > > > What if you stick __ndelay here?
> > > >
> > >
> > > /* Put entry in available array (but don't update avail->idx until they
> > > * do sync). */
> > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > >
> > > /* Descriptors and available array need to be set before we expose the
> > > * new available array entries. */
> > > virtio_wmb(vq->weak_barriers);
> > > vq->split.avail_idx_shadow++;
> > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > vq->split.avail_idx_shadow);
> > > /* Try __ndelay(x) here as Michael suggested
> > > *
> > > * Work: __ndelay(200); possiblly make it hard to reproduce
> > > * Broken: __ndelay(100);
> > > * Broken: __ndelay(20);
> > > * Broken: __ndelay(10);
> > > */
> > > __ndelay(200);
> >
> > So we see that just changing the timing masks the race.
> > What are you using on the host side? vhost or qemu?
> >
>
> __ndelay(200) may make the issue harder to be reproduce as I understand.
> More delays here will give vhost relief, reducing the race.
>
> The issue is only reproducible when vhost is turned on. Otherwise, we
> aren't able to hit the issue.
>
> -netdev tap,id=vnet0,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
> -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0


Given it's vhost, it's also possible that the issue is host side.
I wonder what happens if we stick a delay or a stronger barrier
in vhost.c - either here:

/* Make sure buffer is written before we update index. */
smp_wmb();


or here:

/* Only get avail ring entries after they have been
* exposed by guest.
*/
smp_rmb();

?


> > >
> > > >
> > > > > vq->num_added++;
> > > > >
> > > > > pr_debug("Added buffer head %i to %p\n", head, vq);
> > > > > END_USE(vq);
> > > > > :
> > > > > }
> > > > >
> > > > > I also tried to measure the consumed time for various barrier-relative instructions using
> > > > > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > > > > __smp_wmb() but faster than __mb()
> > > > >
> > > > > Instruction Range of used time in ns
> > > > > ----------------------------------------------
> > > > > __smp_wmb() [32 1128032]
> > > > > __smp_mb() [32 1160096]
> > > > > __mb() [32 1162496]
> > > > >
>
> Thanks,
> Gavin


2024-03-19 07:36:50

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Mon, Mar 18, 2024 at 04:59:24PM +0000, Will Deacon wrote:
> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > The issue is reported by Yihuang Yu who have 'netperf' test on
> > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > client is started in the VM hosted by grace-hopper machine,
> > while the 'netperf' server is running on grace-grace machine.
> >
> > The VM is started with virtio-net and vhost has been enabled.
> > We observe a error message spew from VM and then soft-lockup
> > report. The error message indicates the data associated with
> > the descriptor (index: 135) has been released, and the queue
> > is marked as broken. It eventually leads to the endless effort
> > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > and soft-lockup. The stale index 135 is fetched from the available
> > ring and published to the used ring by vhost, meaning we have
> > disordred write to the available ring element and available index.
> >
> > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> > -accel kvm -machine virt,gic-version=host \
> > : \
> > -netdev tap,id=vnet0,vhost=on \
> > -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> >
> > [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> >
> > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > ARM64. It should work for other architectures, but performance loss is
> > expected.
> >
> > Cc: [email protected]
> > Reported-by: Yihuang Yu <[email protected]>
> > Signed-off-by: Gavin Shan <[email protected]>
> > ---
> > drivers/virtio/virtio_ring.c | 12 +++++++++---
> > 1 file changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 49299b1f9ec7..7d852811c912 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> >
> > - /* Descriptors and available array need to be set before we expose the
> > - * new available array entries. */
> > - virtio_wmb(vq->weak_barriers);
> > + /*
> > + * Descriptors and available array need to be set before we expose
> > + * the new available array entries. virtio_wmb() should be enough
> > + * to ensuere the order theoretically. However, a stronger barrier
> > + * is needed by ARM64. Otherwise, the stale data can be observed
> > + * by the host (vhost). A stronger barrier should work for other
> > + * architectures, but performance loss is expected.
> > + */
> > + virtio_mb(false);
> > vq->split.avail_idx_shadow++;
> > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > vq->split.avail_idx_shadow);
>
> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> here, especially when ordering accesses to coherent memory.
>
> In practice, either the larger timing different from the DSB or the fact
> that you're going from a Store->Store barrier to a full barrier is what
> makes things "work" for you. Have you tried, for example, a DMB SY
> (e.g. via __smb_mb()).
>
> We definitely shouldn't take changes like this without a proper
> explanation of what is going on.
>
> Will

Just making sure: so on this system, how do
smp_wmb() and wmb() differ? smb_wmb is normally for synchronizing
with kernel running on another CPU and we are doing something
unusual in virtio when we use it to synchronize with host
as opposed to the guest - e.g. CONFIG_SMP is special cased
because of this:

#define virt_wmb() do { kcsan_wmb(); __smp_wmb(); } while (0)

Note __smp_wmb not smp_wmb which would be a NOP on UP.


--
MST


2024-03-19 07:41:57

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On 3/19/24 17:04, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
>> On 3/19/24 16:10, Michael S. Tsirkin wrote:
>>> On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
>>>> On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
>>>>> On 3/19/24 02:59, Will Deacon wrote:
>> [...]
>>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>>>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>>> - /* Descriptors and available array need to be set before we expose the
>>>>>>> - * new available array entries. */
>>>>>>> - virtio_wmb(vq->weak_barriers);
>>>>>>> + /*
>>>>>>> + * Descriptors and available array need to be set before we expose
>>>>>>> + * the new available array entries. virtio_wmb() should be enough
>>>>>>> + * to ensuere the order theoretically. However, a stronger barrier
>>>>>>> + * is needed by ARM64. Otherwise, the stale data can be observed
>>>>>>> + * by the host (vhost). A stronger barrier should work for other
>>>>>>> + * architectures, but performance loss is expected.
>>>>>>> + */
>>>>>>> + virtio_mb(false);
>>>>>>> vq->split.avail_idx_shadow++;
>>>>>>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>>> vq->split.avail_idx_shadow);
>>>>>>
>>>>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>>>>> here, especially when ordering accesses to coherent memory.
>>>>>>
>>>>>> In practice, either the larger timing different from the DSB or the fact
>>>>>> that you're going from a Store->Store barrier to a full barrier is what
>>>>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>>>>> (e.g. via __smb_mb()).
>>>>>>
>>>>>> We definitely shouldn't take changes like this without a proper
>>>>>> explanation of what is going on.
>>>>>>
>>>>>
>>>>> Thanks for your comments, Will.
>>>>>
>>>>> Yes, DMB should work for us. However, it seems this instruction has issues on
>>>>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>>>>> from hardware level. I agree it's not the solution to replace DMB with DSB
>>>>> before we fully understand the root cause.
>>>>>
>>>>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>>>>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>>>>
>>>>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>>>>> {
>>>>> :
>>>>> /* Put entry in available array (but don't update avail->idx until they
>>>>> * do sync). */
>>>>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>
>>>>> /* Descriptors and available array need to be set before we expose the
>>>>> * new available array entries. */
>>>>> // Broken: virtio_wmb(vq->weak_barriers);
>>>>> // Broken: __dma_mb();
>>>>> // Work: __mb();
>>>>> // Work: __smp_mb();
>>>
>>> Did you try __smp_wmb ? And wmb?
>>>
>>
>> virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.
>>
>> __wmb() works either. No issue found with it.
>
> Oh interesting. So how do smp_mb() and wmb() disassemble on this
> platform? Can you please check?
>

I don't see they have been translated wrongly on Nvidia's grace-hopper:

===> virtio_wmb(vq->weak_barriers)

0xffff8000807b07c8 <+1168>: ldrb w0, [x20, #66]
0xffff8000807b07cc <+1172>: cbz w0, 0xffff8000807b089c <virtqueue_add_split+1380>
0xffff8000807b07d0 <+1176>: dmb ishst // same to __smp_wmb()
:
0xffff8000807b089c <+1380>: dmb oshst // same to __dma_wmb()
0xffff8000807b08a0 <+1384>: b 0xffff8000807b07d4 <virtqueue_add_split+1180>

===> wmb()

0xffff8000807b07c8 <+1168>: dsb st


>
>>>>> // Work: __ndelay(100);
>>>>> // Work: __ndelay(10);
>>>>> // Broken: __ndelay(9);
>>>>>
>>>>> vq->split.avail_idx_shadow++;
>>>>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>> vq->split.avail_idx_shadow);
>>>>
>>>> What if you stick __ndelay here?
>>>
>>> And keep virtio_wmb above?
>>>
>>
>> The result has been shared through a separate reply.
>>
>>>>
>>>>> vq->num_added++;
>>>>>
>>>>> pr_debug("Added buffer head %i to %p\n", head, vq);
>>>>> END_USE(vq);
>>>>> :
>>>>> }
>>>>>
>>>>> I also tried to measure the consumed time for various barrier-relative instructions using
>>>>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>>>>> __smp_wmb() but faster than __mb()
>>>>>
>>>>> Instruction Range of used time in ns
>>>>> ----------------------------------------------
>>>>> __smp_wmb() [32 1128032]
>>>>> __smp_mb() [32 1160096]
>>>>> __mb() [32 1162496]
>>>>>

Thanks,
Gavin


2024-03-19 08:12:04

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On 3/19/24 17:09, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
>>
>> On 3/19/24 16:43, Michael S. Tsirkin wrote:
>>> On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
>>>> On 3/19/24 16:09, Michael S. Tsirkin wrote:
>>>>
>>>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>>>> index 49299b1f9ec7..7d852811c912 100644
>>>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>>>>>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>>>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>>>> - /* Descriptors and available array need to be set before we expose the
>>>>>>>> - * new available array entries. */
>>>>>>>> - virtio_wmb(vq->weak_barriers);
>>>>>>>> + /*
>>>>>>>> + * Descriptors and available array need to be set before we expose
>>>>>>>> + * the new available array entries. virtio_wmb() should be enough
>>>>>>>> + * to ensuere the order theoretically. However, a stronger barrier
>>>>>>>> + * is needed by ARM64. Otherwise, the stale data can be observed
>>>>>>>> + * by the host (vhost). A stronger barrier should work for other
>>>>>>>> + * architectures, but performance loss is expected.
>>>>>>>> + */
>>>>>>>> + virtio_mb(false);
>>>>>>>> vq->split.avail_idx_shadow++;
>>>>>>>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>>>> vq->split.avail_idx_shadow);
>>>>>>>
>>>>>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>>>>>> here, especially when ordering accesses to coherent memory.
>>>>>>>
>>>>>>> In practice, either the larger timing different from the DSB or the fact
>>>>>>> that you're going from a Store->Store barrier to a full barrier is what
>>>>>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>>>>>> (e.g. via __smb_mb()).
>>>>>>>
>>>>>>> We definitely shouldn't take changes like this without a proper
>>>>>>> explanation of what is going on.
>>>>>>>
>>>>>>
>>>>>> Thanks for your comments, Will.
>>>>>>
>>>>>> Yes, DMB should work for us. However, it seems this instruction has issues on
>>>>>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>>>>>> from hardware level. I agree it's not the solution to replace DMB with DSB
>>>>>> before we fully understand the root cause.
>>>>>>
>>>>>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>>>>>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>>>>>
>>>>>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>>>>>> {
>>>>>> :
>>>>>> /* Put entry in available array (but don't update avail->idx until they
>>>>>> * do sync). */
>>>>>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>>>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>>>
>>>>>> /* Descriptors and available array need to be set before we expose the
>>>>>> * new available array entries. */
>>>>>> // Broken: virtio_wmb(vq->weak_barriers);
>>>>>> // Broken: __dma_mb();
>>>>>> // Work: __mb();
>>>>>> // Work: __smp_mb();
>>>>>> // Work: __ndelay(100);
>>>>>> // Work: __ndelay(10);
>>>>>> // Broken: __ndelay(9);
>>>>>>
>>>>>> vq->split.avail_idx_shadow++;
>>>>>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>>>> vq->split.avail_idx_shadow);
>>>>>
>>>>> What if you stick __ndelay here?
>>>>>
>>>>
>>>> /* Put entry in available array (but don't update avail->idx until they
>>>> * do sync). */
>>>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>>
>>>> /* Descriptors and available array need to be set before we expose the
>>>> * new available array entries. */
>>>> virtio_wmb(vq->weak_barriers);
>>>> vq->split.avail_idx_shadow++;
>>>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>> vq->split.avail_idx_shadow);
>>>> /* Try __ndelay(x) here as Michael suggested
>>>> *
>>>> * Work: __ndelay(200); possiblly make it hard to reproduce
>>>> * Broken: __ndelay(100);
>>>> * Broken: __ndelay(20);
>>>> * Broken: __ndelay(10);
>>>> */
>>>> __ndelay(200);
>>>
>>> So we see that just changing the timing masks the race.
>>> What are you using on the host side? vhost or qemu?
>>>
>>
>> __ndelay(200) may make the issue harder to be reproduce as I understand.
>> More delays here will give vhost relief, reducing the race.
>>
>> The issue is only reproducible when vhost is turned on. Otherwise, we
>> aren't able to hit the issue.
>>
>> -netdev tap,id=vnet0,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
>> -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
>
>
> Given it's vhost, it's also possible that the issue is host side.
> I wonder what happens if we stick a delay or a stronger barrier
> in vhost.c - either here:
>
> /* Make sure buffer is written before we update index. */
> smp_wmb();
>
>
> or here:
>
> /* Only get avail ring entries after they have been
> * exposed by guest.
> */
> smp_rmb();
>
> ?
>

It's possible. However, I still can hit the issue after both of them are
replaed with '__mb()'. So the issue seems on the guest side, where the
written data isn't observed in time by the CPU on far end (for vhost worker).

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..327b68d21b02 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2529,7 +2529,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
/* Only get avail ring entries after they have been
* exposed by guest.
*/
- smp_rmb();
+ // smp_rmb();
+ __mb();
}

/* Grab the next descriptor number they're advertising, and increment
@@ -2703,7 +2704,9 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
r = __vhost_add_used_n(vq, heads, count);

/* Make sure buffer is written before we update index. */
- smp_wmb();
+ // smp_wmb();
+ __mb();
+

From the guest:

[ 14.102608] virtio_net virtio0: output.0:id 80 is not a head!


>>>>
>>>>>
>>>>>> vq->num_added++;
>>>>>>
>>>>>> pr_debug("Added buffer head %i to %p\n", head, vq);
>>>>>> END_USE(vq);
>>>>>> :
>>>>>> }
>>>>>>
>>>>>> I also tried to measure the consumed time for various barrier-relative instructions using
>>>>>> ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
>>>>>> __smp_wmb() but faster than __mb()
>>>>>>
>>>>>> Instruction Range of used time in ns
>>>>>> ----------------------------------------------
>>>>>> __smp_wmb() [32 1128032]
>>>>>> __smp_mb() [32 1160096]
>>>>>> __mb() [32 1162496]
>>>>>>

Thanks,
Gavin


2024-03-19 08:29:40

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
> On 3/19/24 16:10, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > > On 3/19/24 02:59, Will Deacon wrote:
> [...]
> > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > - /* Descriptors and available array need to be set before we expose the
> > > > > > - * new available array entries. */
> > > > > > - virtio_wmb(vq->weak_barriers);
> > > > > > + /*
> > > > > > + * Descriptors and available array need to be set before we expose
> > > > > > + * the new available array entries. virtio_wmb() should be enough
> > > > > > + * to ensuere the order theoretically. However, a stronger barrier
> > > > > > + * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > + * by the host (vhost). A stronger barrier should work for other
> > > > > > + * architectures, but performance loss is expected.
> > > > > > + */
> > > > > > + virtio_mb(false);
> > > > > > vq->split.avail_idx_shadow++;
> > > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > vq->split.avail_idx_shadow);
> > > > >
> > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > > here, especially when ordering accesses to coherent memory.
> > > > >
> > > > > In practice, either the larger timing different from the DSB or the fact
> > > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > (e.g. via __smb_mb()).
> > > > >
> > > > > We definitely shouldn't take changes like this without a proper
> > > > > explanation of what is going on.
> > > > >
> > > >
> > > > Thanks for your comments, Will.
> > > >
> > > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > > before we fully understand the root cause.
> > > >
> > > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > >
> > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > {
> > > > :
> > > > /* Put entry in available array (but don't update avail->idx until they
> > > > * do sync). */
> > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > >
> > > > /* Descriptors and available array need to be set before we expose the
> > > > * new available array entries. */
> > > > // Broken: virtio_wmb(vq->weak_barriers);
> > > > // Broken: __dma_mb();
> > > > // Work: __mb();
> > > > // Work: __smp_mb();
> >
> > Did you try __smp_wmb ? And wmb?
> >
>
> virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.


> __wmb() works either. No issue found with it.

So this is
arch/arm64/include/asm/barrier.h:#define __smp_wmb() dmb(ishst)

versus

arch/arm64/include/asm/barrier.h:#define __wmb() dsb(st)


right?

Really interesting. And you are saying dma_wmb does not work either:

arch/arm64/include/asm/barrier.h:#define __dma_wmb() dmb(oshst)


Really strange.




However I found this:
https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Memory-attributes/Cacheable-and-shareable-memory-attributes



Going by this picture, all CPUs are in the innner shareable domain so
ishst should be enough to synchronize, right?


However, there are two points that give me pause here:


Inner shareable
This represents a shareability domain that can be shared by multiple
processors, but not necessarily all of the agents in the system. A
system might have multiple Inner Shareable domains. An operation that
affects one Inner Shareable domain does not affect other Inner Shareable
domains in the system. An example of such a domain might be a quad-core
Cortex-A57 cluster.


Point 1 - so is it possible that there are multiple inner
shareable domains in this system? With vhost running
inside one and guest inside another? Anyone knows if that
is the case on nvidia grace hopper and how to find out?



Outer shareable
An outer shareable (OSH) domain re-order is shared by multiple agents and
can consist of one or more inner shareable domains. An operation that
affects an outer shareable domain also implicitly affects all inner
shareable domains inside it. However, it does not otherwise behave as an
inner shareable operation.


I do not get this last sentence. If it affects all inner domains then
how it "does not behave as an inner shareable operation"?






>
> > > > // Work: __ndelay(100);
> > > > // Work: __ndelay(10);
> > > > // Broken: __ndelay(9);
> > > >
> > > > vq->split.avail_idx_shadow++;
> > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > vq->split.avail_idx_shadow);
> > >
> > > What if you stick __ndelay here?
> >
> > And keep virtio_wmb above?
> >
>
> The result has been shared through a separate reply.
>
> > >
> > > > vq->num_added++;
> > > >
> > > > pr_debug("Added buffer head %i to %p\n", head, vq);
> > > > END_USE(vq);
> > > > :
> > > > }
> > > >
> > > > I also tried to measure the consumed time for various barrier-relative instructions using
> > > > ktime_get_ns() which should have consumed most of the time. __smb_mb() is slower than
> > > > __smp_wmb() but faster than __mb()
> > > >
> > > > Instruction Range of used time in ns
> > > > ----------------------------------------------
> > > > __smp_wmb() [32 1128032]
> > > > __smp_mb() [32 1160096]
> > > > __mb() [32 1162496]
> > > >
>
> Thanks,
> Gavin


2024-03-19 08:50:05

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Tue, Mar 19, 2024 at 06:08:27PM +1000, Gavin Shan wrote:
> On 3/19/24 17:09, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
> > >
> > > On 3/19/24 16:43, Michael S. Tsirkin wrote:
> > > > On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> > > > > On 3/19/24 16:09, Michael S. Tsirkin wrote:
> > > > >
> > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > > > > - /* Descriptors and available array need to be set before we expose the
> > > > > > > > > - * new available array entries. */
> > > > > > > > > - virtio_wmb(vq->weak_barriers);
> > > > > > > > > + /*
> > > > > > > > > + * Descriptors and available array need to be set before we expose
> > > > > > > > > + * the new available array entries. virtio_wmb() should be enough
> > > > > > > > > + * to ensuere the order theoretically. However, a stronger barrier
> > > > > > > > > + * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > > > > + * by the host (vhost). A stronger barrier should work for other
> > > > > > > > > + * architectures, but performance loss is expected.
> > > > > > > > > + */
> > > > > > > > > + virtio_mb(false);
> > > > > > > > > vq->split.avail_idx_shadow++;
> > > > > > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > > > > vq->split.avail_idx_shadow);
> > > > > > > >
> > > > > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > > > > > here, especially when ordering accesses to coherent memory.
> > > > > > > >
> > > > > > > > In practice, either the larger timing different from the DSB or the fact
> > > > > > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > > > > (e.g. via __smb_mb()).
> > > > > > > >
> > > > > > > > We definitely shouldn't take changes like this without a proper
> > > > > > > > explanation of what is going on.
> > > > > > > >
> > > > > > >
> > > > > > > Thanks for your comments, Will.
> > > > > > >
> > > > > > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > > > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > > > > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > > > > > before we fully understand the root cause.
> > > > > > >
> > > > > > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > > > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > > > >
> > > > > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > > > > {
> > > > > > > :
> > > > > > > /* Put entry in available array (but don't update avail->idx until they
> > > > > > > * do sync). */
> > > > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > > >
> > > > > > > /* Descriptors and available array need to be set before we expose the
> > > > > > > * new available array entries. */
> > > > > > > // Broken: virtio_wmb(vq->weak_barriers);
> > > > > > > // Broken: __dma_mb();
> > > > > > > // Work: __mb();
> > > > > > > // Work: __smp_mb();
> > > > > > > // Work: __ndelay(100);
> > > > > > > // Work: __ndelay(10);
> > > > > > > // Broken: __ndelay(9);
> > > > > > >
> > > > > > > vq->split.avail_idx_shadow++;
> > > > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > > vq->split.avail_idx_shadow);
> > > > > >
> > > > > > What if you stick __ndelay here?
> > > > > >
> > > > >
> > > > > /* Put entry in available array (but don't update avail->idx until they
> > > > > * do sync). */
> > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > >
> > > > > /* Descriptors and available array need to be set before we expose the
> > > > > * new available array entries. */
> > > > > virtio_wmb(vq->weak_barriers);
> > > > > vq->split.avail_idx_shadow++;
> > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > vq->split.avail_idx_shadow);
> > > > > /* Try __ndelay(x) here as Michael suggested
> > > > > *
> > > > > * Work: __ndelay(200); possiblly make it hard to reproduce
> > > > > * Broken: __ndelay(100);
> > > > > * Broken: __ndelay(20);
> > > > > * Broken: __ndelay(10);
> > > > > */
> > > > > __ndelay(200);
> > > >
> > > > So we see that just changing the timing masks the race.
> > > > What are you using on the host side? vhost or qemu?
> > > >
> > >
> > > __ndelay(200) may make the issue harder to be reproduce as I understand.
> > > More delays here will give vhost relief, reducing the race.
> > >
> > > The issue is only reproducible when vhost is turned on. Otherwise, we
> > > aren't able to hit the issue.
> > >
> > > -netdev tap,id=vnet0,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
> > > -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
> >
> >
> > Given it's vhost, it's also possible that the issue is host side.
> > I wonder what happens if we stick a delay or a stronger barrier
> > in vhost.c - either here:
> >
> > /* Make sure buffer is written before we update index. */
> > smp_wmb();
> >
> >
> > or here:
> >
> > /* Only get avail ring entries after they have been
> > * exposed by guest.
> > */
> > smp_rmb();
> >
> > ?
> >
>
> It's possible. However, I still can hit the issue after both of them are
> replaed with '__mb()'. So the issue seems on the guest side, where the
> written data isn't observed in time by the CPU on far end (for vhost worker).
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..327b68d21b02 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2529,7 +2529,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> /* Only get avail ring entries after they have been
> * exposed by guest.
> */
> - smp_rmb();
> + // smp_rmb();
> + __mb();
> }
> /* Grab the next descriptor number they're advertising, and increment
> @@ -2703,7 +2704,9 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
> r = __vhost_add_used_n(vq, heads, count);
> /* Make sure buffer is written before we update index. */
> - smp_wmb();
> + // smp_wmb();
> + __mb();
> +
>
> From the guest:
>
> [ 14.102608] virtio_net virtio0: output.0:id 80 is not a head!


You realize however that for that to happen, first vhost has to
use the wrong index?
Again the fact tweaking code helps is suggestive but does
not lead us to the root cause.

So you should be able to detect the bad index when vhost gets it.
Here's a suggestion: stick a valid flag in desc.
then vhost should notice if it gets a desc with that flag clear.
Like this maybe?


diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 6f7e5010a673..2789e51b57e6 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -524,7 +524,7 @@ static inline unsigned int virtqueue_add_desc_split(struct virtqueue *vq,
struct vring_desc_extra *extra = vring->split.desc_extra;
u16 next;

- desc[i].flags = cpu_to_virtio16(vq->vdev, flags);
+ desc[i].flags = cpu_to_virtio16(vq->vdev, flags | (0x1 << 5)); /* Mark it valid */
desc[i].addr = cpu_to_virtio64(vq->vdev, addr);
desc[i].len = cpu_to_virtio32(vq->vdev, len);

@@ -721,6 +721,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
i = virtio16_to_cpu(_vq->vdev, desc[i].next);
} else
i = vring_unmap_one_split(vq, i);
+ desc[i].flags = 0x0;
}

free_indirect:
@@ -775,11 +776,13 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
i = head;

while (vq->split.vring.desc[i].flags & nextflag) {
+ vq->split.vring.desc[i].flags = 0x0;
vring_unmap_one_split(vq, i);
i = vq->split.desc_extra[i].next;
vq->vq.num_free++;
}

+ vq->split.vring.desc[i].flags = 0x0;
vring_unmap_one_split(vq, i);
vq->split.desc_extra[i].next = vq->free_head;
vq->free_head = head;


2024-03-19 18:22:18

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Tue, Mar 19, 2024 at 03:36:31AM -0400, Michael S. Tsirkin wrote:
> On Mon, Mar 18, 2024 at 04:59:24PM +0000, Will Deacon wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > >
> > > - /* Descriptors and available array need to be set before we expose the
> > > - * new available array entries. */
> > > - virtio_wmb(vq->weak_barriers);
> > > + /*
> > > + * Descriptors and available array need to be set before we expose
> > > + * the new available array entries. virtio_wmb() should be enough
> > > + * to ensuere the order theoretically. However, a stronger barrier
> > > + * is needed by ARM64. Otherwise, the stale data can be observed
> > > + * by the host (vhost). A stronger barrier should work for other
> > > + * architectures, but performance loss is expected.
> > > + */
> > > + virtio_mb(false);
> > > vq->split.avail_idx_shadow++;
> > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > vq->split.avail_idx_shadow);
> >
> > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > here, especially when ordering accesses to coherent memory.
> >
> > In practice, either the larger timing different from the DSB or the fact
> > that you're going from a Store->Store barrier to a full barrier is what
> > makes things "work" for you. Have you tried, for example, a DMB SY
> > (e.g. via __smb_mb()).
> >
> > We definitely shouldn't take changes like this without a proper
> > explanation of what is going on.
>
> Just making sure: so on this system, how do
> smp_wmb() and wmb() differ? smb_wmb is normally for synchronizing
> with kernel running on another CPU and we are doing something
> unusual in virtio when we use it to synchronize with host
> as opposed to the guest - e.g. CONFIG_SMP is special cased
> because of this:
>
> #define virt_wmb() do { kcsan_wmb(); __smp_wmb(); } while (0)
>
> Note __smp_wmb not smp_wmb which would be a NOP on UP.

I think that should be fine (as long as the NOP is a barrier() for the
compiler).

wmb() uses a DSB, but that is only really relevant to endpoint ordering
with real I/O devices, e.g. if for some reason you need to guarantee
that a write has made it all the way to the device before proceeding.
Even then you're slightly at the mercy of the memory type and the
interconnect not giving an early acknowledgement, so the extra ordering
is rarely needed in practice and we don't even provide it for our I/O
accessors (e.g. writel() and readl()).

So for virtio between two CPUs using coherent memory, DSB is a red
herring.

Will

2024-03-19 18:23:19

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> On 3/19/24 02:59, Will Deacon wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > >
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > >
> > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> > > -accel kvm -machine virt,gic-version=host \
> > > : \
> > > -netdev tap,id=vnet0,vhost=on \
> > > -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > >
> > > [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > >
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > >
> > > Cc: [email protected]
> > > Reported-by: Yihuang Yu <[email protected]>
> > > Signed-off-by: Gavin Shan <[email protected]>
> > > ---
> > > drivers/virtio/virtio_ring.c | 12 +++++++++---
> > > 1 file changed, 9 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > - /* Descriptors and available array need to be set before we expose the
> > > - * new available array entries. */
> > > - virtio_wmb(vq->weak_barriers);
> > > + /*
> > > + * Descriptors and available array need to be set before we expose
> > > + * the new available array entries. virtio_wmb() should be enough
> > > + * to ensuere the order theoretically. However, a stronger barrier
> > > + * is needed by ARM64. Otherwise, the stale data can be observed
> > > + * by the host (vhost). A stronger barrier should work for other
> > > + * architectures, but performance loss is expected.
> > > + */
> > > + virtio_mb(false);
> > > vq->split.avail_idx_shadow++;
> > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > vq->split.avail_idx_shadow);
> >
> > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > here, especially when ordering accesses to coherent memory.
> >
> > In practice, either the larger timing different from the DSB or the fact
> > that you're going from a Store->Store barrier to a full barrier is what
> > makes things "work" for you. Have you tried, for example, a DMB SY
> > (e.g. via __smb_mb()).
> >
> > We definitely shouldn't take changes like this without a proper
> > explanation of what is going on.
> >
>
> Thanks for your comments, Will.
>
> Yes, DMB should work for us. However, it seems this instruction has issues on
> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> from hardware level. I agree it's not the solution to replace DMB with DSB
> before we fully understand the root cause.
>
> I tried the possible replacement like below. __smp_mb() can avoid the issue like
> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>
> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> {
> :
> /* Put entry in available array (but don't update avail->idx until they
> * do sync). */
> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>
> /* Descriptors and available array need to be set before we expose the
> * new available array entries. */
> // Broken: virtio_wmb(vq->weak_barriers);
> // Broken: __dma_mb();
> // Work: __mb();
> // Work: __smp_mb();

It's pretty weird that __dma_mb() is "broken" but __smp_mb() "works". How
confident are you in that result?

Will

2024-03-19 23:57:27

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On 3/20/24 04:22, Will Deacon wrote:
> On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
>> On 3/19/24 02:59, Will Deacon wrote:
>>>> drivers/virtio/virtio_ring.c | 12 +++++++++---
>>>> 1 file changed, 9 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>> index 49299b1f9ec7..7d852811c912 100644
>>>> --- a/drivers/virtio/virtio_ring.c
>>>> +++ b/drivers/virtio/virtio_ring.c
>>>> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>>> - /* Descriptors and available array need to be set before we expose the
>>>> - * new available array entries. */
>>>> - virtio_wmb(vq->weak_barriers);
>>>> + /*
>>>> + * Descriptors and available array need to be set before we expose
>>>> + * the new available array entries. virtio_wmb() should be enough
>>>> + * to ensuere the order theoretically. However, a stronger barrier
>>>> + * is needed by ARM64. Otherwise, the stale data can be observed
>>>> + * by the host (vhost). A stronger barrier should work for other
>>>> + * architectures, but performance loss is expected.
>>>> + */
>>>> + virtio_mb(false);
>>>> vq->split.avail_idx_shadow++;
>>>> vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>>>> vq->split.avail_idx_shadow);
>>>
>>> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
>>> here, especially when ordering accesses to coherent memory.
>>>
>>> In practice, either the larger timing different from the DSB or the fact
>>> that you're going from a Store->Store barrier to a full barrier is what
>>> makes things "work" for you. Have you tried, for example, a DMB SY
>>> (e.g. via __smb_mb()).
>>>
>>> We definitely shouldn't take changes like this without a proper
>>> explanation of what is going on.
>>>
>>
>> Thanks for your comments, Will.
>>
>> Yes, DMB should work for us. However, it seems this instruction has issues on
>> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
>> from hardware level. I agree it's not the solution to replace DMB with DSB
>> before we fully understand the root cause.
>>
>> I tried the possible replacement like below. __smp_mb() can avoid the issue like
>> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
>>
>> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
>> {
>> :
>> /* Put entry in available array (but don't update avail->idx until they
>> * do sync). */
>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>
>> /* Descriptors and available array need to be set before we expose the
>> * new available array entries. */
>> // Broken: virtio_wmb(vq->weak_barriers);
>> // Broken: __dma_mb();
>> // Work: __mb();
>> // Work: __smp_mb();
>
> It's pretty weird that __dma_mb() is "broken" but __smp_mb() "works". How
> confident are you in that result?
>

Yes, __dma_mb() is even stronger than __smp_mb(). I retried the test, showing
that both __dma_mb() and __smp_mb() work for us. I had too many tests yesterday
and something may have been messed up.

Instruction Hitting times in 10 tests
---------------------------------------------
__smp_wmb() 8
__smp_mb() 0
__dma_wmb() 7
__dma_mb() 0
__mb() 0
__wmb() 0

It's strange that __smp_mb() works, but __smp_wmb() fails. It seems we need a
read barrier here. I will try WRITE_ONCE() + __smp_wmb() as suggested by Michael
in another reply. Will update the result soon.

Thanks,
Gavin


2024-03-20 00:50:04

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Wed, Mar 20, 2024 at 09:56:58AM +1000, Gavin Shan wrote:
> On 3/20/24 04:22, Will Deacon wrote:
> > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > On 3/19/24 02:59, Will Deacon wrote:
> > > > > drivers/virtio/virtio_ring.c | 12 +++++++++---
> > > > > 1 file changed, 9 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > - /* Descriptors and available array need to be set before we expose the
> > > > > - * new available array entries. */
> > > > > - virtio_wmb(vq->weak_barriers);
> > > > > + /*
> > > > > + * Descriptors and available array need to be set before we expose
> > > > > + * the new available array entries. virtio_wmb() should be enough
> > > > > + * to ensuere the order theoretically. However, a stronger barrier
> > > > > + * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > + * by the host (vhost). A stronger barrier should work for other
> > > > > + * architectures, but performance loss is expected.
> > > > > + */
> > > > > + virtio_mb(false);
> > > > > vq->split.avail_idx_shadow++;
> > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > vq->split.avail_idx_shadow);
> > > >
> > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > here, especially when ordering accesses to coherent memory.
> > > >
> > > > In practice, either the larger timing different from the DSB or the fact
> > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > (e.g. via __smb_mb()).
> > > >
> > > > We definitely shouldn't take changes like this without a proper
> > > > explanation of what is going on.
> > > >
> > >
> > > Thanks for your comments, Will.
> > >
> > > Yes, DMB should work for us. However, it seems this instruction has issues on
> > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > before we fully understand the root cause.
> > >
> > > I tried the possible replacement like below. __smp_mb() can avoid the issue like
> > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > >
> > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > {
> > > :
> > > /* Put entry in available array (but don't update avail->idx until they
> > > * do sync). */
> > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > >
> > > /* Descriptors and available array need to be set before we expose the
> > > * new available array entries. */
> > > // Broken: virtio_wmb(vq->weak_barriers);
> > > // Broken: __dma_mb();
> > > // Work: __mb();
> > > // Work: __smp_mb();
> >
> > It's pretty weird that __dma_mb() is "broken" but __smp_mb() "works". How
> > confident are you in that result?
> >
>
> Yes, __dma_mb() is even stronger than __smp_mb(). I retried the test, showing
> that both __dma_mb() and __smp_mb() work for us. I had too many tests yesterday
> and something may have been messed up.
>
> Instruction Hitting times in 10 tests
> ---------------------------------------------
> __smp_wmb() 8
> __smp_mb() 0
> __dma_wmb() 7
> __dma_mb() 0
> __mb() 0
> __wmb() 0
>
> It's strange that __smp_mb() works, but __smp_wmb() fails. It seems we need a
> read barrier here. I will try WRITE_ONCE() + __smp_wmb() as suggested by Michael
> in another reply. Will update the result soon.
>
> Thanks,
> Gavin


I think you are wasting the time with these tests. Even if it helps what
does this tell us? Try setting a flag as I suggested elsewhere.
Then check it in vhost.
Or here's another idea - possibly easier. Copy the high bits from index
into ring itself. Then vhost can check that head is synchronized with
index.

Warning: completely untested, not even compiled. But should give you
the idea. If this works btw we should consider making this official in
the spec.


static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 6f7e5010a673..79456706d0bd 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
/* Put entry in available array (but don't update avail->idx until they
* do sync). */
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
- vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
+ u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
+ vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);

/* Descriptors and available array need to be set before we expose the
* new available array entries. */

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..bd8f7c763caa 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
__virtio16 *head, int idx)
{
- return vhost_get_avail(vq, *head,
+ unsigned i = idx;
+ unsigned flag = i & ~(vq->num - 1);
+ unsigned val = vhost_get_avail(vq, *head,
&vq->avail->ring[idx & (vq->num - 1)]);
+ unsigned valflag = val & ~(vq->num - 1);
+
+ WARN_ON(valflag != flag);
+
+ return val & (vq->num - 1);
}

--
MST


2024-03-20 05:24:43

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> I think you are wasting the time with these tests. Even if it helps what
> does this tell us? Try setting a flag as I suggested elsewhere.
> Then check it in vhost.
> Or here's another idea - possibly easier. Copy the high bits from index
> into ring itself. Then vhost can check that head is synchronized with
> index.
>
> Warning: completely untested, not even compiled. But should give you
> the idea. If this works btw we should consider making this official in
> the spec.
>
>
> static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 6f7e5010a673..79456706d0bd 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> /* Put entry in available array (but don't update avail->idx until they
> * do sync). */
> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> - vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> + u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
> + vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
>
> /* Descriptors and available array need to be set before we expose the
> * new available array entries. */
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..bd8f7c763caa 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> __virtio16 *head, int idx)
> {
> - return vhost_get_avail(vq, *head,
> + unsigned i = idx;
> + unsigned flag = i & ~(vq->num - 1);
> + unsigned val = vhost_get_avail(vq, *head,
> &vq->avail->ring[idx & (vq->num - 1)]);
> + unsigned valflag = val & ~(vq->num - 1);
> +
> + WARN_ON(valflag != flag);
> +
> + return val & (vq->num - 1);
> }
>

Thanks, Michael. The code is already self-explanatory. Since vq->num is 256, I just
squeezed the last_avail_idx to the high byte. Unfortunately, I'm unable to hit
the WARN_ON(). Does it mean the low byte is stale (or corrupted) while the high
byte is still correct and valid?

avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
vq->split.vring.avail->ring[avail] =
cpu_to_virtio16(_vq->vdev, head | (avail << 8));


head = vhost16_to_cpu(vq, ring_head);
WARN_ON((head >> 8) != (vq->last_avail_idx % vq->num));
head = head & 0xff;

One question: Does QEMU has any chance writing data to the available queue when
vhost is enabled? My previous understanding is no, the queue is totally owned by
vhost instead of QEMU.

Before this patch was posted, I had debugging code to record last 16 transactions
to the available and used queue from guest and host side. It did reveal the wrong
head was fetched from the available queue.

[ 11.785745] ================ virtqueue_get_buf_ctx_split ================
[ 11.786238] virtio_net virtio0: output.0:id 74 is not a head!
[ 11.786655] head to be released: 036 077
[ 11.786952]
[ 11.786952] avail_idx:
[ 11.787234] 000 63985 <--
[ 11.787237] 001 63986
[ 11.787444] 002 63987
[ 11.787632] 003 63988
[ 11.787821] 004 63989
[ 11.788006] 005 63990
[ 11.788194] 006 63991
[ 11.788381] 007 63992
[ 11.788567] 008 63993
[ 11.788772] 009 63994
[ 11.788957] 010 63995
[ 11.789141] 011 63996
[ 11.789327] 012 63997
[ 11.789515] 013 63998
[ 11.789701] 014 63999
[ 11.789886] 015 64000
[ 11.790068]
[ 11.790068] avail_head:
[ 11.790529] 000 075 <--
[ 11.790718] 001 036
[ 11.790890] 002 077
[ 11.791061] 003 129
[ 11.791231] 004 072
[ 11.791400] 005 130
[ 11.791574] 006 015
[ 11.791748] 007 074
[ 11.791918] 008 130
[ 11.792094] 009 130
[ 11.792263] 010 074
[ 11.792437] 011 015
[ 11.792617] 012 072
[ 11.792788] 013 129
[ 11.792961] 014 077 // The last two heads from guest to host: 077, 036
[ 11.793134] 015 036

[root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost

avail_idx
000 63998
001 64000
002 63954 <---
003 63955
004 63956
005 63974
006 63981
007 63984
008 63986
009 63987
010 63988
011 63989
012 63992
013 63993
014 63995
015 63997

avail_head
000 074
001 015
002 072
003 129
004 074 // The last two heads seen by vhost is: 074, 036
005 036
006 075 <---
007 036
008 077
009 129
010 072
011 130
012 015
013 074
014 130
015 130

used_idx
000 64000
001 63882 <---
002 63889
003 63891
004 63898
005 63936
006 63942
007 63946
008 63949
009 63953
010 63957
011 63981
012 63990
013 63992
014 63993
015 63999

used_head
000 072
001 129
002 074 // The last two heads published to guest is: 074, 036
003 036
004 075 <---
005 036
006 077
007 129
008 072
009 130
010 015
011 074
012 130
013 130
014 074
015 015

Thanks,
Gavin





2024-03-20 07:15:13

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > I think you are wasting the time with these tests. Even if it helps what
> > does this tell us? Try setting a flag as I suggested elsewhere.
> > Then check it in vhost.
> > Or here's another idea - possibly easier. Copy the high bits from index
> > into ring itself. Then vhost can check that head is synchronized with
> > index.
> >
> > Warning: completely untested, not even compiled. But should give you
> > the idea. If this works btw we should consider making this official in
> > the spec.
> >
> >
> > static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 6f7e5010a673..79456706d0bd 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > /* Put entry in available array (but don't update avail->idx until they
> > * do sync). */
> > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > - vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > + u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
> > + vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
> > /* Descriptors and available array need to be set before we expose the
> > * new available array entries. */
> >
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 045f666b4f12..bd8f7c763caa 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> > static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> > __virtio16 *head, int idx)
> > {
> > - return vhost_get_avail(vq, *head,
> > + unsigned i = idx;
> > + unsigned flag = i & ~(vq->num - 1);
> > + unsigned val = vhost_get_avail(vq, *head,
> > &vq->avail->ring[idx & (vq->num - 1)]);
> > + unsigned valflag = val & ~(vq->num - 1);
> > +
> > + WARN_ON(valflag != flag);
> > +
> > + return val & (vq->num - 1);
> > }
>
> Thanks, Michael. The code is already self-explanatory.

Apparently not. See below.

> Since vq->num is 256, I just
> squeezed the last_avail_idx to the high byte. Unfortunately, I'm unable to hit
> the WARN_ON(). Does it mean the low byte is stale (or corrupted) while the high
> byte is still correct and valid?


I would find this very surprising.

> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> vq->split.vring.avail->ring[avail] =
> cpu_to_virtio16(_vq->vdev, head | (avail << 8));
>
>
> head = vhost16_to_cpu(vq, ring_head);
> WARN_ON((head >> 8) != (vq->last_avail_idx % vq->num));
> head = head & 0xff;

This code misses the point of the test.
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected/
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected.

The value you are interested in should change
each time you go around the ring a full circle.
Thus you want exactly the *high byte* of avail idx -
this is what my patch did - your patch instead
stored and compared the low byte.


The advantage of this debugging patch is that it will detect the issue immediately
not after guest detected the problem in the used ring.
For example, you can add code to re-read the value, or dump the whole
ring.

> One question: Does QEMU has any chance writing data to the available queue when
> vhost is enabled? My previous understanding is no, the queue is totally owned by
> vhost instead of QEMU.

It shouldn't do it normally.

> Before this patch was posted, I had debugging code to record last 16 transactions
> to the available and used queue from guest and host side. It did reveal the wrong
> head was fetched from the available queue.

Oh nice that's a very good hint. And is this still reproducible?

> [ 11.785745] ================ virtqueue_get_buf_ctx_split ================
> [ 11.786238] virtio_net virtio0: output.0:id 74 is not a head!
> [ 11.786655] head to be released: 036 077
> [ 11.786952]
> [ 11.786952] avail_idx:
> [ 11.787234] 000 63985 <--
> [ 11.787237] 001 63986
> [ 11.787444] 002 63987
> [ 11.787632] 003 63988
> [ 11.787821] 004 63989
> [ 11.788006] 005 63990
> [ 11.788194] 006 63991
> [ 11.788381] 007 63992
> [ 11.788567] 008 63993
> [ 11.788772] 009 63994
> [ 11.788957] 010 63995
> [ 11.789141] 011 63996
> [ 11.789327] 012 63997
> [ 11.789515] 013 63998
> [ 11.789701] 014 63999
> [ 11.789886] 015 64000
> [ 11.790068]
> [ 11.790068] avail_head:
> [ 11.790529] 000 075 <--
> [ 11.790718] 001 036
> [ 11.790890] 002 077
> [ 11.791061] 003 129
> [ 11.791231] 004 072
> [ 11.791400] 005 130
> [ 11.791574] 006 015
> [ 11.791748] 007 074
> [ 11.791918] 008 130
> [ 11.792094] 009 130
> [ 11.792263] 010 074
> [ 11.792437] 011 015
> [ 11.792617] 012 072
> [ 11.792788] 013 129
> [ 11.792961] 014 077 // The last two heads from guest to host: 077, 036
> [ 11.793134] 015 036

Maybe dump the avail ring from guest to make sure
it matches the expected contents?

> [root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost
>
> avail_idx
> 000 63998
> 001 64000
> 002 63954 <---
> 003 63955
> 004 63956
> 005 63974
> 006 63981
> 007 63984
> 008 63986
> 009 63987
> 010 63988
> 011 63989
> 012 63992
> 013 63993
> 014 63995
> 015 63997
>
> avail_head
> 000 074
> 001 015
> 002 072
> 003 129
> 004 074 // The last two heads seen by vhost is: 074, 036
> 005 036
> 006 075 <---


And is 074 the previous (stale) value in the ring?


> 007 036
> 008 077
> 009 129
> 010 072
> 011 130
> 012 015
> 013 074
> 014 130
> 015 130



> used_idx
> 000 64000
> 001 63882 <---
> 002 63889
> 003 63891
> 004 63898
> 005 63936
> 006 63942
> 007 63946
> 008 63949
> 009 63953
> 010 63957
> 011 63981
> 012 63990
> 013 63992
> 014 63993
> 015 63999
>
> used_head
> 000 072
> 001 129
> 002 074 // The last two heads published to guest is: 074, 036
> 003 036
> 004 075 <---
> 005 036
> 006 077
> 007 129
> 008 072
> 009 130
> 010 015
> 011 074
> 012 130
> 013 130
> 014 074
> 015 015
>
> Thanks,
> Gavin

I like this debugging patch, it might make sense to
polish it up and include in production. We'll want it in
debugfs naturally not /proc/vhost.

But all in good time.


>
>


2024-03-20 17:17:18

by Keir Fraser

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
>
> Before this patch was posted, I had debugging code to record last 16 transactions
> to the available and used queue from guest and host side. It did reveal the wrong
> head was fetched from the available queue.
>
> [ 11.785745] ================ virtqueue_get_buf_ctx_split ================
> [ 11.786238] virtio_net virtio0: output.0:id 74 is not a head!
> [ 11.786655] head to be released: 036 077
> [ 11.786952]
> [ 11.786952] avail_idx:
> [ 11.787234] 000 63985 <--
> [ 11.787237] 001 63986
> [ 11.787444] 002 63987
> [ 11.787632] 003 63988
> [ 11.787821] 004 63989
> [ 11.788006] 005 63990
> [ 11.788194] 006 63991
> [ 11.788381] 007 63992
> [ 11.788567] 008 63993
> [ 11.788772] 009 63994
> [ 11.788957] 010 63995
> [ 11.789141] 011 63996
> [ 11.789327] 012 63997
> [ 11.789515] 013 63998
> [ 11.789701] 014 63999
> [ 11.789886] 015 64000

Does the error always occur at such a round idx value?

Here, 64000 == 0xFA00. Maybe coincidence but it's improbable enough to be interesting.

This debug code seems rather useful!

-- Keir



> [ 11.790068]
> [ 11.790068] avail_head:
> [ 11.790529] 000 075 <--
> [ 11.790718] 001 036
> [ 11.790890] 002 077
> [ 11.791061] 003 129
> [ 11.791231] 004 072
> [ 11.791400] 005 130
> [ 11.791574] 006 015
> [ 11.791748] 007 074
> [ 11.791918] 008 130
> [ 11.792094] 009 130
> [ 11.792263] 010 074
> [ 11.792437] 011 015
> [ 11.792617] 012 072
> [ 11.792788] 013 129
> [ 11.792961] 014 077 // The last two heads from guest to host: 077, 036
> [ 11.793134] 015 036
>
> [root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost
>
> avail_idx
> 000 63998
> 001 64000
> 002 63954 <---
> 003 63955
> 004 63956
> 005 63974
> 006 63981
> 007 63984
> 008 63986
> 009 63987
> 010 63988
> 011 63989
> 012 63992
> 013 63993
> 014 63995
> 015 63997
>
> avail_head
> 000 074
> 001 015
> 002 072
> 003 129
> 004 074 // The last two heads seen by vhost is: 074, 036
> 005 036
> 006 075 <---
> 007 036
> 008 077
> 009 129
> 010 072
> 011 130
> 012 015
> 013 074
> 014 130
> 015 130
>
> used_idx
> 000 64000
> 001 63882 <---
> 002 63889
> 003 63891
> 004 63898
> 005 63936
> 006 63942
> 007 63946
> 008 63949
> 009 63953
> 010 63957
> 011 63981
> 012 63990
> 013 63992
> 014 63993
> 015 63999
>
> used_head
> 000 072
> 001 129
> 002 074 // The last two heads published to guest is: 074, 036
> 003 036
> 004 075 <---
> 005 036
> 006 077
> 007 129
> 008 072
> 009 130
> 010 015
> 011 074
> 012 130
> 013 130
> 014 074
> 015 015
>
> Thanks,
> Gavin
>
>
>
>

2024-03-21 12:08:46

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On 3/21/24 03:15, Keir Fraser wrote:
> On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
>>
>> Before this patch was posted, I had debugging code to record last 16 transactions
>> to the available and used queue from guest and host side. It did reveal the wrong
>> head was fetched from the available queue.
>>
>> [ 11.785745] ================ virtqueue_get_buf_ctx_split ================
>> [ 11.786238] virtio_net virtio0: output.0:id 74 is not a head!
>> [ 11.786655] head to be released: 036 077
>> [ 11.786952]
>> [ 11.786952] avail_idx:
>> [ 11.787234] 000 63985 <--
>> [ 11.787237] 001 63986
>> [ 11.787444] 002 63987
>> [ 11.787632] 003 63988
>> [ 11.787821] 004 63989
>> [ 11.788006] 005 63990
>> [ 11.788194] 006 63991
>> [ 11.788381] 007 63992
>> [ 11.788567] 008 63993
>> [ 11.788772] 009 63994
>> [ 11.788957] 010 63995
>> [ 11.789141] 011 63996
>> [ 11.789327] 012 63997
>> [ 11.789515] 013 63998
>> [ 11.789701] 014 63999
>> [ 11.789886] 015 64000
>
> Does the error always occur at such a round idx value?
>
> Here, 64000 == 0xFA00. Maybe coincidence but it's improbable enough to be interesting.
>
> This debug code seems rather useful!
>

Keir, Nope, it's just coincidence. We don't have such kind of pattern.

Thanks,
Gavin

>
>
>> [ 11.790068]
>> [ 11.790068] avail_head:
>> [ 11.790529] 000 075 <--
>> [ 11.790718] 001 036
>> [ 11.790890] 002 077
>> [ 11.791061] 003 129
>> [ 11.791231] 004 072
>> [ 11.791400] 005 130
>> [ 11.791574] 006 015
>> [ 11.791748] 007 074
>> [ 11.791918] 008 130
>> [ 11.792094] 009 130
>> [ 11.792263] 010 074
>> [ 11.792437] 011 015
>> [ 11.792617] 012 072
>> [ 11.792788] 013 129
>> [ 11.792961] 014 077 // The last two heads from guest to host: 077, 036
>> [ 11.793134] 015 036
>>
>> [root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost
>>
>> avail_idx
>> 000 63998
>> 001 64000
>> 002 63954 <---
>> 003 63955
>> 004 63956
>> 005 63974
>> 006 63981
>> 007 63984
>> 008 63986
>> 009 63987
>> 010 63988
>> 011 63989
>> 012 63992
>> 013 63993
>> 014 63995
>> 015 63997
>>
>> avail_head
>> 000 074
>> 001 015
>> 002 072
>> 003 129
>> 004 074 // The last two heads seen by vhost is: 074, 036
>> 005 036
>> 006 075 <---
>> 007 036
>> 008 077
>> 009 129
>> 010 072
>> 011 130
>> 012 015
>> 013 074
>> 014 130
>> 015 130
>>
>> used_idx
>> 000 64000
>> 001 63882 <---
>> 002 63889
>> 003 63891
>> 004 63898
>> 005 63936
>> 006 63942
>> 007 63946
>> 008 63949
>> 009 63953
>> 010 63957
>> 011 63981
>> 012 63990
>> 013 63992
>> 014 63993
>> 015 63999
>>
>> used_head
>> 000 072
>> 001 129
>> 002 074 // The last two heads published to guest is: 074, 036
>> 003 036
>> 004 075 <---
>> 005 036
>> 006 077
>> 007 129
>> 008 072
>> 009 130
>> 010 015
>> 011 074
>> 012 130
>> 013 130
>> 014 074
>> 015 015
>>
>> Thanks,
>> Gavin
>>
>>
>>
>>
>


2024-03-25 14:33:34

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring


On 3/20/24 17:14, Michael S. Tsirkin wrote:
> On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
>> On 3/20/24 10:49, Michael S. Tsirkin wrote:>
>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>> index 6f7e5010a673..79456706d0bd 100644
>>> --- a/drivers/virtio/virtio_ring.c
>>> +++ b/drivers/virtio/virtio_ring.c
>>> @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
>>> /* Put entry in available array (but don't update avail->idx until they
>>> * do sync). */
>>> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>>> - vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>>> + u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
>>> + vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
>>> /* Descriptors and available array need to be set before we expose the
>>> * new available array entries. */
>>>

Ok, Michael. I continued with my debugging code. It still looks like a
hardware bug on NVidia's grace-hopper. I really think NVidia needs to be
involved for the discussion, as suggested by you.

Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70.
Note that I have only one vCPU in my configuration.

Secondly, the debugging code is enhanced so that the available head for
(last_avail_idx - 1) is read for twice and recorded. It means the available
head for one specific available index is read for twice. I do see the
available heads are different from the consecutive reads. More details
are shared as below.

From the guest side
===================

virtio_net virtio0: output.0:id 86 is not a head!
head to be released: 047 062 112

avail_idx:
000 49665
001 49666 <--
:
015 49664

avail_head:
000 062
001 047 <--
:
015 112

From the host side
==================

avail_idx
000 49663
001 49666 <---
:

avail_head
000 062 (062)
001 047 (047) <---
:
015 086 (112) // head 086 is returned from the first read,
// but head 112 is returned from the second read

vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx 49664

Thanks,
Gavin



2024-03-26 07:49:19

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Mon, Mar 25, 2024 at 05:34:29PM +1000, Gavin Shan wrote:
>
> On 3/20/24 17:14, Michael S. Tsirkin wrote:
> > On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> > > On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 6f7e5010a673..79456706d0bd 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > /* Put entry in available array (but don't update avail->idx until they
> > > > * do sync). */
> > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > - vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > + u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
> > > > + vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
> > > > /* Descriptors and available array need to be set before we expose the
> > > > * new available array entries. */
> > > >
>
> Ok, Michael. I continued with my debugging code. It still looks like a
> hardware bug on NVidia's grace-hopper. I really think NVidia needs to be
> involved for the discussion, as suggested by you.

Do you have a support contact at Nvidia to report this?

> Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70.
> Note that I have only one vCPU in my configuration.

Interesting but is guest built with CONFIG_SMP set?

> Secondly, the debugging code is enhanced so that the available head for
> (last_avail_idx - 1) is read for twice and recorded. It means the available
> head for one specific available index is read for twice. I do see the
> available heads are different from the consecutive reads. More details
> are shared as below.
>
> From the guest side
> ===================
>
> virtio_net virtio0: output.0:id 86 is not a head!
> head to be released: 047 062 112
>
> avail_idx:
> 000 49665
> 001 49666 <--
> :
> 015 49664

what are these #s 49665 and so on?
and how large is the ring?
I am guessing 49664 is the index ring size is 16 and
49664 % 16 == 0

> avail_head:


is this the avail ring contents?

> 000 062
> 001 047 <--
> :
> 015 112


What are these arrows pointing at, btw?


> From the host side
> ==================
>
> avail_idx
> 000 49663
> 001 49666 <---
> :
>
> avail_head
> 000 062 (062)
> 001 047 (047) <---
> :
> 015 086 (112) // head 086 is returned from the first read,
> // but head 112 is returned from the second read
>
> vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx 49664
>
> Thanks,
> Gavin

OK thanks so this proves it is actually the avail ring value.

--
MST


2024-03-26 09:44:10

by Keir Fraser

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> On Mon, Mar 25, 2024 at 05:34:29PM +1000, Gavin Shan wrote:
> >
> > On 3/20/24 17:14, Michael S. Tsirkin wrote:
> > > On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> > > > On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 6f7e5010a673..79456706d0bd 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > > > > /* Put entry in available array (but don't update avail->idx until they
> > > > > * do sync). */
> > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > - vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > > > > + u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
> > > > > + vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
> > > > > /* Descriptors and available array need to be set before we expose the
> > > > > * new available array entries. */
> > > > >
> >
> > Ok, Michael. I continued with my debugging code. It still looks like a
> > hardware bug on NVidia's grace-hopper. I really think NVidia needs to be
> > involved for the discussion, as suggested by you.
>
> Do you have a support contact at Nvidia to report this?
>
> > Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70.
> > Note that I have only one vCPU in my configuration.
>
> Interesting but is guest built with CONFIG_SMP set?

arm64 is always built CONFIG_SMP.

> > Secondly, the debugging code is enhanced so that the available head for
> > (last_avail_idx - 1) is read for twice and recorded. It means the available
> > head for one specific available index is read for twice. I do see the
> > available heads are different from the consecutive reads. More details
> > are shared as below.
> >
> > From the guest side
> > ===================
> >
> > virtio_net virtio0: output.0:id 86 is not a head!
> > head to be released: 047 062 112
> >
> > avail_idx:
> > 000 49665
> > 001 49666 <--
> > :
> > 015 49664
>
> what are these #s 49665 and so on?
> and how large is the ring?
> I am guessing 49664 is the index ring size is 16 and
> 49664 % 16 == 0

More than that, 49664 % 256 == 0

So again there seems to be an error in the vicinity of roll-over of
the idx low byte, as I observed in the earlier log. Surely this is
more than coincidence?

-- Keir

> > avail_head:
>
>
> is this the avail ring contents?
>
> > 000 062
> > 001 047 <--
> > :
> > 015 112
>
>
> What are these arrows pointing at, btw?
>
>
> > From the host side
> > ==================
> >
> > avail_idx
> > 000 49663
> > 001 49666 <---
> > :
> >
> > avail_head
> > 000 062 (062)
> > 001 047 (047) <---
> > :
> > 015 086 (112) // head 086 is returned from the first read,
> > // but head 112 is returned from the second read
> >
> > vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx 49664
> >
> > Thanks,
> > Gavin
>
> OK thanks so this proves it is actually the avail ring value.
>
> --
> MST
>

2024-03-26 11:50:50

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Tue, Mar 26, 2024 at 09:38:55AM +0000, Keir Fraser wrote:
> On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > Secondly, the debugging code is enhanced so that the available head for
> > > (last_avail_idx - 1) is read for twice and recorded. It means the available
> > > head for one specific available index is read for twice. I do see the
> > > available heads are different from the consecutive reads. More details
> > > are shared as below.
> > >
> > > From the guest side
> > > ===================
> > >
> > > virtio_net virtio0: output.0:id 86 is not a head!
> > > head to be released: 047 062 112
> > >
> > > avail_idx:
> > > 000 49665
> > > 001 49666 <--
> > > :
> > > 015 49664
> >
> > what are these #s 49665 and so on?
> > and how large is the ring?
> > I am guessing 49664 is the index ring size is 16 and
> > 49664 % 16 == 0
>
> More than that, 49664 % 256 == 0
>
> So again there seems to be an error in the vicinity of roll-over of
> the idx low byte, as I observed in the earlier log. Surely this is
> more than coincidence?

Yeah, I'd still really like to see the disassembly for both sides of the
protocol here. Gavin, is that something you're able to provide? Worst
case, the host and guest vmlinux objects would be a starting point.

Personally, I'd be fairly surprised if this was a hardware issue.

Will

2024-03-26 16:24:04

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Tue, Mar 26, 2024 at 11:43:13AM +0000, Will Deacon wrote:
> On Tue, Mar 26, 2024 at 09:38:55AM +0000, Keir Fraser wrote:
> > On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > > Secondly, the debugging code is enhanced so that the available head for
> > > > (last_avail_idx - 1) is read for twice and recorded. It means the available
> > > > head for one specific available index is read for twice. I do see the
> > > > available heads are different from the consecutive reads. More details
> > > > are shared as below.
> > > >
> > > > From the guest side
> > > > ===================
> > > >
> > > > virtio_net virtio0: output.0:id 86 is not a head!
> > > > head to be released: 047 062 112
> > > >
> > > > avail_idx:
> > > > 000 49665
> > > > 001 49666 <--
> > > > :
> > > > 015 49664
> > >
> > > what are these #s 49665 and so on?
> > > and how large is the ring?
> > > I am guessing 49664 is the index ring size is 16 and
> > > 49664 % 16 == 0
> >
> > More than that, 49664 % 256 == 0
> >
> > So again there seems to be an error in the vicinity of roll-over of
> > the idx low byte, as I observed in the earlier log. Surely this is
> > more than coincidence?
>
> Yeah, I'd still really like to see the disassembly for both sides of the
> protocol here. Gavin, is that something you're able to provide? Worst
> case, the host and guest vmlinux objects would be a starting point.
>
> Personally, I'd be fairly surprised if this was a hardware issue.

Ok, long shot after eyeballing the vhost code, but does the diff below
help at all? It looks like vhost_vq_avail_empty() can advance the value
saved in 'vq->avail_idx' but without the read barrier, possibly confusing
vhost_get_vq_desc() in polling mode.

Will

--->8

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..87bff710331a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
return false;
vq->avail_idx = vhost16_to_cpu(vq, avail_idx);

+ smp_rmb();
return vq->avail_idx == vq->last_avail_idx;
}
EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);


2024-03-26 23:15:32

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring


On 3/27/24 01:46, Will Deacon wrote:
> On Tue, Mar 26, 2024 at 11:43:13AM +0000, Will Deacon wrote:
>
> Ok, long shot after eyeballing the vhost code, but does the diff below
> help at all? It looks like vhost_vq_avail_empty() can advance the value
> saved in 'vq->avail_idx' but without the read barrier, possibly confusing
> vhost_get_vq_desc() in polling mode.
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..87bff710331a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> return false;
> vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>
> + smp_rmb();
> return vq->avail_idx == vq->last_avail_idx;
> }
> EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
>

Thanks, Will. I already noticed smp_rmb() has been missed in vhost_vq_avail_empty().
The issue still exists after smp_rmb() is added here. However, I'm inspired by your
suggestion and recheck the code again. It seems another smp_rmb() has been missed
in vhost_enable_notify().

With smp_rmb() added to vhost_vq_avail_empty() and vhost_enable_notify(), I'm unable
to hit the issue. I will try for more times to make sure the issue is really resolved.
After that, I will post formal patches for review.

Thanks,
Gavin


2024-03-27 00:01:58

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring


On 3/27/24 09:14, Gavin Shan wrote:
> On 3/27/24 01:46, Will Deacon wrote:
>> On Tue, Mar 26, 2024 at 11:43:13AM +0000, Will Deacon wrote:
>>
>> Ok, long shot after eyeballing the vhost code, but does the diff below
>> help at all? It looks like vhost_vq_avail_empty() can advance the value
>> saved in 'vq->avail_idx' but without the read barrier, possibly confusing
>> vhost_get_vq_desc() in polling mode.
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 045f666b4f12..87bff710331a 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>>                  return false;
>>          vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>> +       smp_rmb();
>>          return vq->avail_idx == vq->last_avail_idx;
>>   }
>>   EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
>>
>
> Thanks, Will. I already noticed smp_rmb() has been missed in vhost_vq_avail_empty().
> The issue still exists after smp_rmb() is added here. However, I'm inspired by your
> suggestion and recheck the code again. It seems another smp_rmb() has been missed
> in vhost_enable_notify().
>
> With smp_rmb() added to vhost_vq_avail_empty() and vhost_enable_notify(), I'm unable
> to hit the issue. I will try for more times to make sure the issue is really resolved.
> After that, I will post formal patches for review.
>

Thanks again, Will. The formal patches have been sent for review.

https://lkml.org/lkml/2024/3/27/40

Thanks,
Gavin


2024-03-27 11:57:16

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] virtio_ring: Fix the stale index in available ring

On Tue, Mar 26, 2024 at 03:46:29PM +0000, Will Deacon wrote:
> On Tue, Mar 26, 2024 at 11:43:13AM +0000, Will Deacon wrote:
> > On Tue, Mar 26, 2024 at 09:38:55AM +0000, Keir Fraser wrote:
> > > On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > > > Secondly, the debugging code is enhanced so that the available head for
> > > > > (last_avail_idx - 1) is read for twice and recorded. It means the available
> > > > > head for one specific available index is read for twice. I do see the
> > > > > available heads are different from the consecutive reads. More details
> > > > > are shared as below.
> > > > >
> > > > > From the guest side
> > > > > ===================
> > > > >
> > > > > virtio_net virtio0: output.0:id 86 is not a head!
> > > > > head to be released: 047 062 112
> > > > >
> > > > > avail_idx:
> > > > > 000 49665
> > > > > 001 49666 <--
> > > > > :
> > > > > 015 49664
> > > >
> > > > what are these #s 49665 and so on?
> > > > and how large is the ring?
> > > > I am guessing 49664 is the index ring size is 16 and
> > > > 49664 % 16 == 0
> > >
> > > More than that, 49664 % 256 == 0
> > >
> > > So again there seems to be an error in the vicinity of roll-over of
> > > the idx low byte, as I observed in the earlier log. Surely this is
> > > more than coincidence?
> >
> > Yeah, I'd still really like to see the disassembly for both sides of the
> > protocol here. Gavin, is that something you're able to provide? Worst
> > case, the host and guest vmlinux objects would be a starting point.
> >
> > Personally, I'd be fairly surprised if this was a hardware issue.
>
> Ok, long shot after eyeballing the vhost code, but does the diff below
> help at all? It looks like vhost_vq_avail_empty() can advance the value
> saved in 'vq->avail_idx' but without the read barrier, possibly confusing
> vhost_get_vq_desc() in polling mode.
>
> Will
>
> --->8
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..87bff710331a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
> return false;
> vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
>
> + smp_rmb();
> return vq->avail_idx == vq->last_avail_idx;
> }
> EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);

Oh wow you are right.

We have:

if (vq->avail_idx == vq->last_avail_idx) {
if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
vq_err(vq, "Failed to access avail idx at %p\n",
&vq->avail->idx);
return -EFAULT;
}
vq->avail_idx = vhost16_to_cpu(vq, avail_idx);

if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
vq_err(vq, "Guest moved used index from %u to %u",
last_avail_idx, vq->avail_idx);
return -EFAULT;
}

/* If there's nothing new since last we looked, return
* invalid.
*/
if (vq->avail_idx == last_avail_idx)
return vq->num;

/* Only get avail ring entries after they have been
* exposed by guest.
*/
smp_rmb();
}


and so the rmb only happens if avail_idx is not advanced.

Actually there is a bunch of code duplication where we assign to
avail_idx, too.

Will thanks a lot for looking into this! I kept looking into
the virtio side for some reason, the fact that it did not
trigger with qemu should have been a big hint!


--
MST