Thanks, I've applied this to get it into linux-next ASAP. I'd love
to have reviews before sending it on to Linus, though.
On Wed, 13 Sep 2023 14:14:03 +0200
Christoph Hellwig <[email protected]> wrote:
> Thanks, I've applied this to get it into linux-next ASAP. I'd love
> to have reviews before sending it on to Linus, though.
Fully understood, given my past record. ;-)
@Catalin Marinas: I have added you to the list of recipient, because you
spotted some issues with memory barriers in one of my previous attempts.
Petr T
On Wed, Sep 13, 2023 at 02:26:56PM +0200, Petr Tesařík wrote:
> On Wed, 13 Sep 2023 14:14:03 +0200
> Christoph Hellwig <[email protected]> wrote:
> > Thanks, I've applied this to get it into linux-next ASAP. I'd love
> > to have reviews before sending it on to Linus, though.
>
> Fully understood, given my past record. ;-)
>
> @Catalin Marinas: I have added you to the list of recipient, because you
> spotted some issues with memory barriers in one of my previous attempts.
I seem to have forgotten everything about that thread (trying to
remember what I meant in
https://lore.kernel.org/all/[email protected]/ ;)).
What do the smp_wmb() barriers in swiotlb_find_slots() and
swiotlb_dyn_alloc() order? The latter is even more unclear as it's at
the end of the function and the "pairing" comment doesn't help.
I found the "pairing" description in memory-barriers.txt not helping
much and lots of code comments end up stating that some barrier is
paired with another as if it is some magic synchronisation event. In
general a wmb() barrier ends up paired with an rmb() one (or some other
form of dependency, acquire semantics etc.) but they should be described
in isolation first - what memory accesses on a CPU before and after the
*mb() need ordering - and then how this pairing solves the accesses
observability across multiple CPUs. The comments should also describe
things like "ensure payload is visible before flag update" rather than
"the wmb here pairs with the rmb over there".
Presumably in swiotlb_find_slots(), the smp_wmb() needs to ensure the
write to dev->dma_uses_io_tlb is visible before the *retpool write? For
swiotlb_dyn_alloc(), I'm completely lost on what the smp_wmb() orders.
You really need at least two memory accesses on a CPU for the barrier to
make sense.
--
Catalin
On Thu, 14 Sep 2023 19:28:01 +0100
Catalin Marinas <[email protected]> wrote:
> On Wed, Sep 13, 2023 at 02:26:56PM +0200, Petr Tesařík wrote:
> > On Wed, 13 Sep 2023 14:14:03 +0200
> > Christoph Hellwig <[email protected]> wrote:
> > > Thanks, I've applied this to get it into linux-next ASAP. I'd love
> > > to have reviews before sending it on to Linus, though.
> >
> > Fully understood, given my past record. ;-)
> >
> > @Catalin Marinas: I have added you to the list of recipient, because you
> > spotted some issues with memory barriers in one of my previous attempts.
>
> I seem to have forgotten everything about that thread (trying to
> remember what I meant in
> https://lore.kernel.org/all/[email protected]/ ;)).
Oh... Then I'm very much indebted to you for looking at this patch
nevertheless.
What happened is this: I originally wrote code which maintained a
per-device list of actively used swiotlb pools. The idea was rejected
in an internal review at Huawei, so I reverted to a simple flag to
reduce at least the impact on devices that do not use swiotlb. There
might be some left-overs...
> What do the smp_wmb() barriers in swiotlb_find_slots() and
> swiotlb_dyn_alloc() order? The latter is even more unclear as it's at
> the end of the function and the "pairing" comment doesn't help.
By the time swiotlb_find_slots() returns a valid slot index, the new
value of dev->dma_uses_io_tlb must be visible by all CPUs in
is_swiotlb_buffer(). The index is used to calculate the bounce buffer
address returned to device drivers. This address may be passed to
another CPU and used as an argument to is_swiotlb_buffer().
I am not sure that the smp_wmb() in swiotlb_dyn_alloc() is needed, but
let me explain my thinking. Even if the dev->dma_uses_io_tlb flag is
set, is_swiotlb_buffer() returns false unless the buffer address is
found in the RCU list of swiotlb pools, which is walked without taking
any lock. In some iterations of the patch series, there was a direct
call to swiotlb_dyn_alloc(), and a smp_wmb() was necessary to make sure
that the list of swiotlb pools cannot be observed as empty by another
CPU. You have already explained to me that taking a spin lock in
add_mem_pool() is not sufficient, because it does not invalidate a
value that might have been speculatively read by another CPU before
entering the critical section. OTOH swiotlb_dyn_alloc() is always
called through schedule_work() in the current version. If that
implicitly provides necessary barriers, this smp_wmb() can be removed.
FTR list_add_rcu() alone is not sufficient. It adds the new element
using rcu_assign_pointer(), which modifies the forward link with
smp_store_release(). However, list_for_each_entry_rcu() reads the head
with list_entry_rcu(), which is a simple READ_ONCE(); there is no
ACQUIRE semantics.
Actually, I wonder when a newly added RCU list element is guaranteed to
be visible by all CPUs. Existing documentation deals mainly with
element removal, explaining that both the old state and the new state
of an RCU list are valid after addition. Is it assumed that the old
state of an RCU list after addition is valid indefinitely?
> I found the "pairing" description in memory-barriers.txt not helping
> much and lots of code comments end up stating that some barrier is
> paired with another as if it is some magic synchronisation event. In
> general a wmb() barrier ends up paired with an rmb() one (or some other
> form of dependency, acquire semantics etc.) but they should be described
> in isolation first - what memory accesses on a CPU before and after the
> *mb() need ordering - and then how this pairing solves the accesses
> observability across multiple CPUs. The comments should also describe
> things like "ensure payload is visible before flag update" rather than
> "the wmb here pairs with the rmb over there".
Good point. I will improve my memory barrier comments accordingly.
> Presumably in swiotlb_find_slots(), the smp_wmb() needs to ensure the
> write to dev->dma_uses_io_tlb is visible before the *retpool write? For
> swiotlb_dyn_alloc(), I'm completely lost on what the smp_wmb() orders.
> You really need at least two memory accesses on a CPU for the barrier to
> make sense.
Thank you again for your time!
Petr T
On Fri, Sep 15, 2023 at 11:13:43AM +0200, Petr Tesařík wrote:
> On Thu, 14 Sep 2023 19:28:01 +0100
> Catalin Marinas <[email protected]> wrote:
> > What do the smp_wmb() barriers in swiotlb_find_slots() and
> > swiotlb_dyn_alloc() order? The latter is even more unclear as it's at
> > the end of the function and the "pairing" comment doesn't help.
>
> By the time swiotlb_find_slots() returns a valid slot index, the new
> value of dev->dma_uses_io_tlb must be visible by all CPUs in
> is_swiotlb_buffer(). The index is used to calculate the bounce buffer
> address returned to device drivers. This address may be passed to
> another CPU and used as an argument to is_swiotlb_buffer().
Ah, I remember now. So the smp_wmb() ensures that dma_uses_io_tlb is
seen by other CPUs before the slot address (presumably passed via other
memory write). It may be worth updating the comment in the code (I'm
sure I'll forget it in a month time). The smp_rmb() before READ_ONCE()
in this patch is also needed for the same reasons (ordering after the
read of the address passed to is_swiotlb_buffer()).
BTW, you may want to use WRITE_ONCE() when setting dma_uses_io_tlb (it
also matches the READ_ONCE() in is_swiotlb_buffer()). Or you can use
smp_store_mb() (but check its semantics first).
> I am not sure that the smp_wmb() in swiotlb_dyn_alloc() is needed, but
> let me explain my thinking. Even if the dev->dma_uses_io_tlb flag is
> set, is_swiotlb_buffer() returns false unless the buffer address is
> found in the RCU list of swiotlb pools, which is walked without taking
> any lock. In some iterations of the patch series, there was a direct
> call to swiotlb_dyn_alloc(), and a smp_wmb() was necessary to make sure
> that the list of swiotlb pools cannot be observed as empty by another
> CPU. You have already explained to me that taking a spin lock in
> add_mem_pool() is not sufficient, because it does not invalidate a
> value that might have been speculatively read by another CPU before
> entering the critical section. OTOH swiotlb_dyn_alloc() is always
> called through schedule_work() in the current version. If that
> implicitly provides necessary barriers, this smp_wmb() can be removed.
Normally you'd need a barrier between pool update and flag update:
list_add_rcu(&pool->node, &mem->pools);
smp_wmb();
WRITE_ONCE(dev->dma_uses_io_tlb, true);
The lock around mem->pools update doesn't help since since it only has
release semantics on the unlock path (which doesn't prevent the
dma_uses_io_tlb write from being reordered before).
On the reader side, you need to make sure that if the dma_uses_io_tlb is
true, the mem->pools access was not speculated and read earlier as
empty, so another dma_rmb():
if (READ_ONCE(dev->dma_uses_io_tlb)) {
dma_rmb();
swiotlb_find_pool(...);
}
That's missing in the code currently (and rcu_read_lock() only has
acquire semantics).
However, what confuses me is that mem->pools is populated asynchronously
via schedule_work(). Not sure how the barriers help since the work can
be scheduled on any CPU at any point after, potentially after
dma_uses_io_tlb has been updated.
On the transient dev->dma_io_tlb_pools updated in swiotlb_find_slots(),
you'd need the barriers as I mentioned above (unless I misunderstood how
this pool searching works; not entirely sure why swiotlb_find_pool()
searches both).
> FTR list_add_rcu() alone is not sufficient. It adds the new element
> using rcu_assign_pointer(), which modifies the forward link with
> smp_store_release(). However, list_for_each_entry_rcu() reads the head
> with list_entry_rcu(), which is a simple READ_ONCE(); there is no
> ACQUIRE semantics.
>
> Actually, I wonder when a newly added RCU list element is guaranteed to
> be visible by all CPUs. Existing documentation deals mainly with
> element removal, explaining that both the old state and the new state
> of an RCU list are valid after addition. Is it assumed that the old
> state of an RCU list after addition is valid indefinitely?
When some write becomes visible to other CPUs, I don't think the
architectures define precisely (more like in vague terms as
"eventually"). That's why we can't talk about barriers in relation to a
single write as the barrier doesn't make a write visible but rather
makes it visible _before_ a subsequent write becomes visible (whenever
that may be).
Anyway, as above, I think the current code is still missing some
barriers between dma_uses_io_tlb update or read and the actual pool
writes. But I haven't figured out how this works with the asynchronous
swiotlb_dyn_alloc().
--
Catalin
Hi Catalin,
thank you for your reply!
On Fri, 15 Sep 2023 18:09:28 +0100
Catalin Marinas <[email protected]> wrote:
> On Fri, Sep 15, 2023 at 11:13:43AM +0200, Petr Tesařík wrote:
> > On Thu, 14 Sep 2023 19:28:01 +0100
> > Catalin Marinas <[email protected]> wrote:
> > > What do the smp_wmb() barriers in swiotlb_find_slots() and
> > > swiotlb_dyn_alloc() order? The latter is even more unclear as it's at
> > > the end of the function and the "pairing" comment doesn't help.
> >
> > By the time swiotlb_find_slots() returns a valid slot index, the new
> > value of dev->dma_uses_io_tlb must be visible by all CPUs in
> > is_swiotlb_buffer(). The index is used to calculate the bounce buffer
> > address returned to device drivers. This address may be passed to
> > another CPU and used as an argument to is_swiotlb_buffer().
>
> Ah, I remember now. So the smp_wmb() ensures that dma_uses_io_tlb is
> seen by other CPUs before the slot address (presumably passed via other
> memory write). It may be worth updating the comment in the code (I'm
> sure I'll forget it in a month time). The smp_rmb() before READ_ONCE()
> in this patch is also needed for the same reasons (ordering after the
> read of the address passed to is_swiotlb_buffer()).
I agree. The comment should definitely mention the implicitly assumed
memory write done by device drivers, because it is not obvious which
two writes are being ordered.
> BTW, you may want to use WRITE_ONCE() when setting dma_uses_io_tlb (it
> also matches the READ_ONCE() in is_swiotlb_buffer()). Or you can use
> smp_store_mb() (but check its semantics first).
I can use WRITE_ONCE(), although I believe it does not make much
difference thanks to the barrier provided by smp_wmb().
Using smp_store_mb() should also work, but it inserts a full memory
barrier, which is more than is needed IMO.
> > I am not sure that the smp_wmb() in swiotlb_dyn_alloc() is needed, but
> > let me explain my thinking. Even if the dev->dma_uses_io_tlb flag is
> > set, is_swiotlb_buffer() returns false unless the buffer address is
> > found in the RCU list of swiotlb pools, which is walked without taking
> > any lock. In some iterations of the patch series, there was a direct
> > call to swiotlb_dyn_alloc(), and a smp_wmb() was necessary to make sure
> > that the list of swiotlb pools cannot be observed as empty by another
> > CPU. You have already explained to me that taking a spin lock in
> > add_mem_pool() is not sufficient, because it does not invalidate a
> > value that might have been speculatively read by another CPU before
> > entering the critical section. OTOH swiotlb_dyn_alloc() is always
> > called through schedule_work() in the current version. If that
> > implicitly provides necessary barriers, this smp_wmb() can be removed.
>
> Normally you'd need a barrier between pool update and flag update:
>
> list_add_rcu(&pool->node, &mem->pools);
> smp_wmb();
> WRITE_ONCE(dev->dma_uses_io_tlb, true);
I agree. However, if is_swiotlb_buffer() gets an address allocated from
a swiotlb pool that was added here, then it also had to be returned by
swiotlb_find_slots(), where smp_wmb() makes sure that all writes are
visible before the above-mentioned assumed write of the buffer pointer
into a private data field, performed by driver code.
> The lock around mem->pools update doesn't help since since it only has
> release semantics on the unlock path (which doesn't prevent the
> dma_uses_io_tlb write from being reordered before).
Yes.
> On the reader side, you need to make sure that if the dma_uses_io_tlb is
> true, the mem->pools access was not speculated and read earlier as
> empty, so another dma_rmb():
>
> if (READ_ONCE(dev->dma_uses_io_tlb)) {
> dma_rmb();
I think this is a typo, and you mean smp_rmb(); at least I can't see
which consistent DMA memory would be accessed here.
> swiotlb_find_pool(...);
> }
>
> That's missing in the code currently (and rcu_read_lock() only has
> acquire semantics).
>
> However, what confuses me is that mem->pools is populated
> asynchronously via schedule_work(). Not sure how the barriers help
> since the work can be scheduled on any CPU at any point after,
> potentially after dma_uses_io_tlb has been updated.
Ah... You may have a point after all if this sequence of events is
possible:
- CPU 0 writes new value to mem->pools->next in swiotlb_dyn_alloc().
- CPU 1 observes the new value in swiotlb_find_slots(), even though it
is not guaranteed by any barrier, allocates a slot and sets the
dev->dma_uses_io_tlb flag.
- CPU 1 (driver code) writes the returned buffer address into its
private struct. This write is ordered after dev->dma_uses_io_tlb
thanks to the smp_wmb() in swiotlb_find_slots().
- CPU 2 (driver code) reads the buffer address, and DMA core passes it
to is_swiotlb_buffer(), which contains smp_rmb().
- IIUC CPU 2 is guaranteed to observe the new value of
dev->dma_uses_io_tlb, but it may still use the old value of
mem->pools->next, because the write on CPU 0 was not ordered
against anything. The fact that the new value was observed by CPU 1
does not mean that it is also observed by CPU 2.
If this is what can happen, then I'm not sure how these events can be
properly ordered in a lockless fashion.
> On the transient dev->dma_io_tlb_pools updated in
> swiotlb_find_slots(), you'd need the barriers as I mentioned above
> (unless I misunderstood how this pool searching works; not entirely
> sure why swiotlb_find_pool() searches both).
The transient pool list is updated only in swiotlb_find_slots() before
setting dev->dma_uses_io_tlb. AFAICS this write should is already
ordered with the same smp_wmb().
>[...]
> > Actually, I wonder when a newly added RCU list element is
> > guaranteed to be visible by all CPUs. Existing documentation deals
> > mainly with element removal, explaining that both the old state and
> > the new state of an RCU list are valid after addition. Is it
> > assumed that the old state of an RCU list after addition is valid
> > indefinitely?
>
> When some write becomes visible to other CPUs, I don't think the
> architectures define precisely (more like in vague terms as
> "eventually"). That's why we can't talk about barriers in relation to
> a single write as the barrier doesn't make a write visible but rather
> makes it visible _before_ a subsequent write becomes visible (whenever
> that may be).
This is understood. It's just that list_add_rcu() updates the next
pointer in the previous list entry using smp_store_release(), ordering
writes on this next pointer, but I don't see the matching loads with
acquire semantics anywhere. It's probably required to detect RCU grace
period...
Anyway, I was merely wondering when the new list entry is visible on
all CPUs, and your answer seems to confirm that the RCU list primitives
do not provide any guarantees. IOW if correct operation requires that
the new value is observed before something else (like here), it needs
additional synchronization.
Petr T
On Sun, Sep 17, 2023 at 11:47:41AM +0200, Petr Tesařík wrote:
> On Fri, 15 Sep 2023 18:09:28 +0100
> Catalin Marinas <[email protected]> wrote:
> > On Fri, Sep 15, 2023 at 11:13:43AM +0200, Petr Tesařík wrote:
> > > On Thu, 14 Sep 2023 19:28:01 +0100
> > > Catalin Marinas <[email protected]> wrote:
> > > > What do the smp_wmb() barriers in swiotlb_find_slots() and
> > > > swiotlb_dyn_alloc() order? The latter is even more unclear as it's at
> > > > the end of the function and the "pairing" comment doesn't help.
> > >
> > > By the time swiotlb_find_slots() returns a valid slot index, the new
> > > value of dev->dma_uses_io_tlb must be visible by all CPUs in
> > > is_swiotlb_buffer(). The index is used to calculate the bounce buffer
> > > address returned to device drivers. This address may be passed to
> > > another CPU and used as an argument to is_swiotlb_buffer().
> >
> > Ah, I remember now. So the smp_wmb() ensures that dma_uses_io_tlb is
> > seen by other CPUs before the slot address (presumably passed via other
> > memory write). It may be worth updating the comment in the code (I'm
> > sure I'll forget it in a month time). The smp_rmb() before READ_ONCE()
> > in this patch is also needed for the same reasons (ordering after the
> > read of the address passed to is_swiotlb_buffer()).
>
> I agree. The comment should definitely mention the implicitly assumed
> memory write done by device drivers, because it is not obvious which
> two writes are being ordered.
>
> > BTW, you may want to use WRITE_ONCE() when setting dma_uses_io_tlb (it
> > also matches the READ_ONCE() in is_swiotlb_buffer()). Or you can use
> > smp_store_mb() (but check its semantics first).
>
> I can use WRITE_ONCE(), although I believe it does not make much
> difference thanks to the barrier provided by smp_wmb().
WRITE_ONCE() is about atomicity rather than ordering (and avoiding
compiler optimisations messing things). While I don't see the compiler
generating multiple accesses for a boolean write, using these accessors
also helps tools like kcsan.
> Using smp_store_mb() should also work, but it inserts a full memory
> barrier, which is more than is needed IMO.
True.
> > > I am not sure that the smp_wmb() in swiotlb_dyn_alloc() is needed, but
> > > let me explain my thinking. Even if the dev->dma_uses_io_tlb flag is
> > > set, is_swiotlb_buffer() returns false unless the buffer address is
> > > found in the RCU list of swiotlb pools, which is walked without taking
> > > any lock. In some iterations of the patch series, there was a direct
> > > call to swiotlb_dyn_alloc(), and a smp_wmb() was necessary to make sure
> > > that the list of swiotlb pools cannot be observed as empty by another
> > > CPU. You have already explained to me that taking a spin lock in
> > > add_mem_pool() is not sufficient, because it does not invalidate a
> > > value that might have been speculatively read by another CPU before
> > > entering the critical section. OTOH swiotlb_dyn_alloc() is always
> > > called through schedule_work() in the current version. If that
> > > implicitly provides necessary barriers, this smp_wmb() can be removed.
> >
> > Normally you'd need a barrier between pool update and flag update:
> >
> > list_add_rcu(&pool->node, &mem->pools);
> > smp_wmb();
> > WRITE_ONCE(dev->dma_uses_io_tlb, true);
>
> I agree. However, if is_swiotlb_buffer() gets an address allocated from
> a swiotlb pool that was added here, then it also had to be returned by
> swiotlb_find_slots(), where smp_wmb() makes sure that all writes are
> visible before the above-mentioned assumed write of the buffer pointer
> into a private data field, performed by driver code.
You are right, we have a dependency on this assumed write, so it should
be fine (no need to sprinkle more barriers).
> > The lock around mem->pools update doesn't help since since it only has
> > release semantics on the unlock path (which doesn't prevent the
> > dma_uses_io_tlb write from being reordered before).
>
> Yes.
>
> > On the reader side, you need to make sure that if the dma_uses_io_tlb is
> > true, the mem->pools access was not speculated and read earlier as
> > empty, so another dma_rmb():
> >
> > if (READ_ONCE(dev->dma_uses_io_tlb)) {
> > dma_rmb();
>
> I think this is a typo, and you mean smp_rmb(); at least I can't see
> which consistent DMA memory would be accessed here.
Yes, smp_rmb().
> > swiotlb_find_pool(...);
> > }
> >
> > That's missing in the code currently (and rcu_read_lock() only has
> > acquire semantics).
> >
> > However, what confuses me is that mem->pools is populated
> > asynchronously via schedule_work(). Not sure how the barriers help
> > since the work can be scheduled on any CPU at any point after,
> > potentially after dma_uses_io_tlb has been updated.
>
> Ah... You may have a point after all if this sequence of events is
> possible:
>
> - CPU 0 writes new value to mem->pools->next in swiotlb_dyn_alloc().
>
> - CPU 1 observes the new value in swiotlb_find_slots(), even though it
> is not guaranteed by any barrier, allocates a slot and sets the
> dev->dma_uses_io_tlb flag.
>
> - CPU 1 (driver code) writes the returned buffer address into its
> private struct. This write is ordered after dev->dma_uses_io_tlb
> thanks to the smp_wmb() in swiotlb_find_slots().
>
> - CPU 2 (driver code) reads the buffer address, and DMA core passes it
> to is_swiotlb_buffer(), which contains smp_rmb().
>
> - IIUC CPU 2 is guaranteed to observe the new value of
> dev->dma_uses_io_tlb, but it may still use the old value of
> mem->pools->next, because the write on CPU 0 was not ordered
> against anything. The fact that the new value was observed by CPU 1
> does not mean that it is also observed by CPU 2.
Yes, that's possible. On CPU 1 there is a control dependency between the
read of mem->pools->next and the write of dev->dma_uses_io_tlb but I
don't think this is sufficient to claim multi-copy atomicity (if CPU 1
sees mem->pools->next write by CPU 0, CPU 2 must see it as well), at
least not on all architectures supported by Linux. memory-barriers.txt
says that a full barrier on CPU 1 is needed between the read and write,
i.e. smp_mb() before WRITE_ONCE(dev->dma_uses_io_tlb). You could add it
just before "goto found" in swiotlb_find_slots() since it's only needed
on this path.
Another thing I noticed - the write in add_mem_pool() to mem->nslabs is
not ordered with list_add_rcu(). I assume swiotlb_find_slots() doesn't
need to access it since it just walks the mem->pools list.
> > On the transient dev->dma_io_tlb_pools updated in
> > swiotlb_find_slots(), you'd need the barriers as I mentioned above
> > (unless I misunderstood how this pool searching works; not entirely
> > sure why swiotlb_find_pool() searches both).
>
> The transient pool list is updated only in swiotlb_find_slots() before
> setting dev->dma_uses_io_tlb. AFAICS this write should is already
> ordered with the same smp_wmb().
Yes, I think that's fine.
> >[...]
> > > Actually, I wonder when a newly added RCU list element is
> > > guaranteed to be visible by all CPUs. Existing documentation deals
> > > mainly with element removal, explaining that both the old state and
> > > the new state of an RCU list are valid after addition. Is it
> > > assumed that the old state of an RCU list after addition is valid
> > > indefinitely?
> >
> > When some write becomes visible to other CPUs, I don't think the
> > architectures define precisely (more like in vague terms as
> > "eventually"). That's why we can't talk about barriers in relation to
> > a single write as the barrier doesn't make a write visible but rather
> > makes it visible _before_ a subsequent write becomes visible (whenever
> > that may be).
>
> This is understood. It's just that list_add_rcu() updates the next
> pointer in the previous list entry using smp_store_release(), ordering
> writes on this next pointer, but I don't see the matching loads with
> acquire semantics anywhere. It's probably required to detect RCU grace
> period...
I don't have the history of the release semantics here but one of the
reasons is to ensure that any structure initialisation is visible before
the pointer is added to the list (e.g. the pool structure populated by
swiotlb_alloc_pool() before add_mem_pool()). On the reader side, you
have an address dependency as you can't read the content of the
structure without first reading the pointer (rcu_dereference() provides
an implicit address dependency).
--
Catalin
Hi Catalin,
thanks again for your reply. I'm sorry for being slow. This world of
weakly ordered memory models is complex, and I was too distracted most
of this week, but I hope I have finally wrapped my head around it.
On Mon, 18 Sep 2023 16:45:34 +0100
Catalin Marinas <[email protected]> wrote:
> On Sun, Sep 17, 2023 at 11:47:41AM +0200, Petr Tesařík wrote:
> > On Fri, 15 Sep 2023 18:09:28 +0100
> > Catalin Marinas <[email protected]> wrote:
> > > On Fri, Sep 15, 2023 at 11:13:43AM +0200, Petr Tesařík wrote:
> > > > On Thu, 14 Sep 2023 19:28:01 +0100
> > > > Catalin Marinas <[email protected]> wrote:
> > > > > What do the smp_wmb() barriers in swiotlb_find_slots() and
> > > > > swiotlb_dyn_alloc() order? The latter is even more unclear as it's at
> > > > > the end of the function and the "pairing" comment doesn't help.
> > > >
> > > > By the time swiotlb_find_slots() returns a valid slot index, the new
> > > > value of dev->dma_uses_io_tlb must be visible by all CPUs in
> > > > is_swiotlb_buffer(). The index is used to calculate the bounce buffer
> > > > address returned to device drivers. This address may be passed to
> > > > another CPU and used as an argument to is_swiotlb_buffer().
> > >
> > > Ah, I remember now. So the smp_wmb() ensures that dma_uses_io_tlb is
> > > seen by other CPUs before the slot address (presumably passed via other
> > > memory write). It may be worth updating the comment in the code (I'm
> > > sure I'll forget it in a month time). The smp_rmb() before READ_ONCE()
> > > in this patch is also needed for the same reasons (ordering after the
> > > read of the address passed to is_swiotlb_buffer()).
>[...]
> > > BTW, you may want to use WRITE_ONCE() when setting dma_uses_io_tlb (it
> > > also matches the READ_ONCE() in is_swiotlb_buffer()). Or you can use
> > > smp_store_mb() (but check its semantics first).
> >
> > I can use WRITE_ONCE(), although I believe it does not make much
> > difference thanks to the barrier provided by smp_wmb().
>
> WRITE_ONCE() is about atomicity rather than ordering (and avoiding
> compiler optimisations messing things). While I don't see the compiler
> generating multiple accesses for a boolean write, using these accessors
> also helps tools like kcsan.
While I still believe a simple assignment works just fine here, I agree
that WRITE_ONCE() is better. It can prevent potential bugs if someone
ever turns the boolean into something else.
>[...]
> > Ah... You may have a point after all if this sequence of events is
> > possible:
> >
> > - CPU 0 writes new value to mem->pools->next in swiotlb_dyn_alloc().
> >
> > - CPU 1 observes the new value in swiotlb_find_slots(), even though it
> > is not guaranteed by any barrier, allocates a slot and sets the
> > dev->dma_uses_io_tlb flag.
> >
> > - CPU 1 (driver code) writes the returned buffer address into its
> > private struct. This write is ordered after dev->dma_uses_io_tlb
> > thanks to the smp_wmb() in swiotlb_find_slots().
> >
> > - CPU 2 (driver code) reads the buffer address, and DMA core passes it
> > to is_swiotlb_buffer(), which contains smp_rmb().
> >
> > - IIUC CPU 2 is guaranteed to observe the new value of
> > dev->dma_uses_io_tlb, but it may still use the old value of
> > mem->pools->next, because the write on CPU 0 was not ordered
> > against anything. The fact that the new value was observed by CPU 1
> > does not mean that it is also observed by CPU 2.
>
> Yes, that's possible. On CPU 1 there is a control dependency between the
> read of mem->pools->next and the write of dev->dma_uses_io_tlb but I
> don't think this is sufficient to claim multi-copy atomicity (if CPU 1
> sees mem->pools->next write by CPU 0, CPU 2 must see it as well), at
> least not on all architectures supported by Linux. memory-barriers.txt
> says that a full barrier on CPU 1 is needed between the read and write,
> i.e. smp_mb() before WRITE_ONCE(dev->dma_uses_io_tlb). You could add it
> just before "goto found" in swiotlb_find_slots() since it's only needed
> on this path.
Let me check my understanding. This smp_mb() is not needed to make sure
that the write to dev->dma_uses_io_tlb cannot be visible before the
read of mem->pools->next. Since stores are not speculated, that
ordering is provided by the control dependency alone.
But a general barrier ensures that a third CPU will observe the write to
mem->pools->next after the read of mem->pools->next. Makes sense.
I think I can send a v2 of my patch now, with abundant comments on the
memory barriers.
> Another thing I noticed - the write in add_mem_pool() to mem->nslabs is
> not ordered with list_add_rcu(). I assume swiotlb_find_slots() doesn't
> need to access it since it just walks the mem->pools list.
That's correct. Writes to mem->nslabs are known to be racy, but it
doesn't matter. This is explained in commit 1aaa736815eb ("swiotlb:
allocate a new memory pool when existing pools are full"):
- swiotlb_tbl_map_single() and is_swiotlb_active() only check for non-zero
value. This is ensured by the existence of the default memory pool,
allocated at boot.
- The exact value is used only for non-critical purposes (debugfs, kernel
messages).
Petr T
Hi Catalin,
On Fri, 22 Sep 2023 15:31:29 +0200
Petr Tesařík <[email protected]> wrote:
>[...]
> On Mon, 18 Sep 2023 16:45:34 +0100
> Catalin Marinas <[email protected]> wrote:
>
> > On Sun, Sep 17, 2023 at 11:47:41AM +0200, Petr Tesařík wrote:
>[...]
> > > Ah... You may have a point after all if this sequence of events is
> > > possible:
> > >
> > > - CPU 0 writes new value to mem->pools->next in swiotlb_dyn_alloc().
> > >
> > > - CPU 1 observes the new value in swiotlb_find_slots(), even though it
> > > is not guaranteed by any barrier, allocates a slot and sets the
> > > dev->dma_uses_io_tlb flag.
> > >
> > > - CPU 1 (driver code) writes the returned buffer address into its
> > > private struct. This write is ordered after dev->dma_uses_io_tlb
> > > thanks to the smp_wmb() in swiotlb_find_slots().
> > >
> > > - CPU 2 (driver code) reads the buffer address, and DMA core passes it
> > > to is_swiotlb_buffer(), which contains smp_rmb().
> > >
> > > - IIUC CPU 2 is guaranteed to observe the new value of
> > > dev->dma_uses_io_tlb, but it may still use the old value of
> > > mem->pools->next, because the write on CPU 0 was not ordered
> > > against anything. The fact that the new value was observed by CPU 1
> > > does not mean that it is also observed by CPU 2.
> >
> > Yes, that's possible. On CPU 1 there is a control dependency between the
> > read of mem->pools->next and the write of dev->dma_uses_io_tlb but I
> > don't think this is sufficient to claim multi-copy atomicity (if CPU 1
> > sees mem->pools->next write by CPU 0, CPU 2 must see it as well), at
> > least not on all architectures supported by Linux. memory-barriers.txt
> > says that a full barrier on CPU 1 is needed between the read and write,
> > i.e. smp_mb() before WRITE_ONCE(dev->dma_uses_io_tlb). You could add it
> > just before "goto found" in swiotlb_find_slots() since it's only needed
> > on this path.
>
> Let me check my understanding. This smp_mb() is not needed to make sure
> that the write to dev->dma_uses_io_tlb cannot be visible before the
> read of mem->pools->next. Since stores are not speculated, that
> ordering is provided by the control dependency alone.
>
> But a general barrier ensures that a third CPU will observe the write to
> mem->pools->next after the read of mem->pools->next. Makes sense.
Now that I'm writing the patch, I get your idea to replace WRITE_ONCE()
with smp_store_release(). Since a full memory barrier is required for
multicopy atomicity, it is not "more than I need". Instead, the
ordering contraints may be possibly restricted to "CPUs participating
in a release-acquire chain" if I also replace READ_ONCE() in
is_swiotlb_buffer() with smp_read_acquire().
I believe it does not matter that the CPU which writes a new value to
mem->pools->next in swiotlb_dyn_alloc() does not participate in the
chain, because the value must have been visible to the CPU which
executes swiotlb_find_slots() and which does participate in the chain.
Let me double-check this thinking with a litmus test.
> I think I can send a v2 of my patch now, with abundant comments on the
> memory barriers.
Eh, this must be delayed a bit again...
Petr T
On Fri, 22 Sep 2023 19:12:13 +0200
Petr Tesařík <[email protected]> wrote:
> Hi Catalin,
>
> On Fri, 22 Sep 2023 15:31:29 +0200
> Petr Tesařík <[email protected]> wrote:
>
> >[...]
> > On Mon, 18 Sep 2023 16:45:34 +0100
> > Catalin Marinas <[email protected]> wrote:
> >
> > > On Sun, Sep 17, 2023 at 11:47:41AM +0200, Petr Tesařík wrote:
> >[...]
> > > > Ah... You may have a point after all if this sequence of events is
> > > > possible:
> > > >
> > > > - CPU 0 writes new value to mem->pools->next in swiotlb_dyn_alloc().
> > > >
> > > > - CPU 1 observes the new value in swiotlb_find_slots(), even though it
> > > > is not guaranteed by any barrier, allocates a slot and sets the
> > > > dev->dma_uses_io_tlb flag.
> > > >
> > > > - CPU 1 (driver code) writes the returned buffer address into its
> > > > private struct. This write is ordered after dev->dma_uses_io_tlb
> > > > thanks to the smp_wmb() in swiotlb_find_slots().
> > > >
> > > > - CPU 2 (driver code) reads the buffer address, and DMA core passes it
> > > > to is_swiotlb_buffer(), which contains smp_rmb().
> > > >
> > > > - IIUC CPU 2 is guaranteed to observe the new value of
> > > > dev->dma_uses_io_tlb, but it may still use the old value of
> > > > mem->pools->next, because the write on CPU 0 was not ordered
> > > > against anything. The fact that the new value was observed by CPU 1
> > > > does not mean that it is also observed by CPU 2.
> > >
> > > Yes, that's possible. On CPU 1 there is a control dependency between the
> > > read of mem->pools->next and the write of dev->dma_uses_io_tlb but I
> > > don't think this is sufficient to claim multi-copy atomicity (if CPU 1
> > > sees mem->pools->next write by CPU 0, CPU 2 must see it as well), at
> > > least not on all architectures supported by Linux. memory-barriers.txt
> > > says that a full barrier on CPU 1 is needed between the read and write,
> > > i.e. smp_mb() before WRITE_ONCE(dev->dma_uses_io_tlb). You could add it
> > > just before "goto found" in swiotlb_find_slots() since it's only needed
> > > on this path.
> >
> > Let me check my understanding. This smp_mb() is not needed to make sure
> > that the write to dev->dma_uses_io_tlb cannot be visible before the
> > read of mem->pools->next. Since stores are not speculated, that
> > ordering is provided by the control dependency alone.
> >
> > But a general barrier ensures that a third CPU will observe the write to
> > mem->pools->next after the read of mem->pools->next. Makes sense.
>
> Now that I'm writing the patch, I get your idea to replace WRITE_ONCE()
> with smp_store_release(). Since a full memory barrier is required for
> multicopy atomicity, it is not "more than I need". Instead, the
> ordering contraints may be possibly restricted to "CPUs participating
> in a release-acquire chain" if I also replace READ_ONCE() in
> is_swiotlb_buffer() with smp_read_acquire().
>
> I believe it does not matter that the CPU which writes a new value to
> mem->pools->next in swiotlb_dyn_alloc() does not participate in the
> chain, because the value must have been visible to the CPU which
> executes swiotlb_find_slots() and which does participate in the chain.
>
> Let me double-check this thinking with a litmus test.
Hm. I didn't have much luck with smp_store_release(), because I need
to ensure ordering of the SUBSEQUENT store (by a device driver).
However, inserting smp_mb() _after_ WRITE_ONCE(dev->dma_uses_io_tlb)
seems to be enough to ensure proper ordering. I could even remove the
write memory barrier in swiotlb_dyn_alloc().
This is my first time using herd7, so I can only hope I got everything
right. FWIW this is my litmus test:
C swiotlb-new-pool
(*
* Result: Never
*
* Check that a newly allocated pool is always visible when the corresponding
* swiotlb buffer is visible.
*)
{}
P0(int *pool)
{
/* swiotlb_dyn_alloc() */
WRITE_ONCE(*pool, 999);
}
P1(int *pool, int *flag, int *buf)
{
/* swiotlb_find_slots() */
int r0 = READ_ONCE(*pool);
if (r0) {
WRITE_ONCE(*flag, 1);
smp_mb();
}
/* device driver (presumed) */
WRITE_ONCE(*buf, r0);
}
P2(int *pool, int *flag, int *buf)
{
/* device driver (presumed) */
int r1 = READ_ONCE(*buf);
/* is_swiotlb_buffer() */
int r2;
int r3;
smp_rmb();
r2 = READ_ONCE(*flag);
if (r2) {
r3 = READ_ONCE(*pool);
}
}
exists (2:r1<>0 /\ 2:r3=0) (* Not flagged or pool not found. *)
Petr T
On Fri, Sep 22, 2023 at 08:20:57PM +0200, Petr Tesařík wrote:
> On Fri, 22 Sep 2023 19:12:13 +0200
> Petr Tesařík <[email protected]> wrote:
> > On Fri, 22 Sep 2023 15:31:29 +0200
> > Petr Tesařík <[email protected]> wrote:
> > > On Mon, 18 Sep 2023 16:45:34 +0100
> > > Catalin Marinas <[email protected]> wrote:
> > > > On Sun, Sep 17, 2023 at 11:47:41AM +0200, Petr Tesařík wrote:
> > > > > Ah... You may have a point after all if this sequence of events is
> > > > > possible:
> > > > >
> > > > > - CPU 0 writes new value to mem->pools->next in swiotlb_dyn_alloc().
> > > > >
> > > > > - CPU 1 observes the new value in swiotlb_find_slots(), even though it
> > > > > is not guaranteed by any barrier, allocates a slot and sets the
> > > > > dev->dma_uses_io_tlb flag.
> > > > >
> > > > > - CPU 1 (driver code) writes the returned buffer address into its
> > > > > private struct. This write is ordered after dev->dma_uses_io_tlb
> > > > > thanks to the smp_wmb() in swiotlb_find_slots().
> > > > >
> > > > > - CPU 2 (driver code) reads the buffer address, and DMA core passes it
> > > > > to is_swiotlb_buffer(), which contains smp_rmb().
> > > > >
> > > > > - IIUC CPU 2 is guaranteed to observe the new value of
> > > > > dev->dma_uses_io_tlb, but it may still use the old value of
> > > > > mem->pools->next, because the write on CPU 0 was not ordered
> > > > > against anything. The fact that the new value was observed by CPU 1
> > > > > does not mean that it is also observed by CPU 2.
> > > >
> > > > Yes, that's possible. On CPU 1 there is a control dependency between the
> > > > read of mem->pools->next and the write of dev->dma_uses_io_tlb but I
> > > > don't think this is sufficient to claim multi-copy atomicity (if CPU 1
> > > > sees mem->pools->next write by CPU 0, CPU 2 must see it as well), at
> > > > least not on all architectures supported by Linux. memory-barriers.txt
> > > > says that a full barrier on CPU 1 is needed between the read and write,
> > > > i.e. smp_mb() before WRITE_ONCE(dev->dma_uses_io_tlb). You could add it
> > > > just before "goto found" in swiotlb_find_slots() since it's only needed
> > > > on this path.
> > >
> > > Let me check my understanding. This smp_mb() is not needed to make sure
> > > that the write to dev->dma_uses_io_tlb cannot be visible before the
> > > read of mem->pools->next. Since stores are not speculated, that
> > > ordering is provided by the control dependency alone.
> > >
> > > But a general barrier ensures that a third CPU will observe the write to
> > > mem->pools->next after the read of mem->pools->next. Makes sense.
> >
> > Now that I'm writing the patch, I get your idea to replace WRITE_ONCE()
> > with smp_store_release(). Since a full memory barrier is required for
> > multicopy atomicity, it is not "more than I need". Instead, the
> > ordering contraints may be possibly restricted to "CPUs participating
> > in a release-acquire chain" if I also replace READ_ONCE() in
> > is_swiotlb_buffer() with smp_read_acquire().
> >
> > I believe it does not matter that the CPU which writes a new value to
> > mem->pools->next in swiotlb_dyn_alloc() does not participate in the
> > chain, because the value must have been visible to the CPU which
> > executes swiotlb_find_slots() and which does participate in the chain.
> >
> > Let me double-check this thinking with a litmus test.
>
> Hm. I didn't have much luck with smp_store_release(), because I need
> to ensure ordering of the SUBSEQUENT store (by a device driver).
>
> However, inserting smp_mb() _after_ WRITE_ONCE(dev->dma_uses_io_tlb)
> seems to be enough to ensure proper ordering. I could even remove the
> write memory barrier in swiotlb_dyn_alloc().
The smp_wmb() in swiotlb_dyn_alloc() should be removed, it doesn't help
anything.
> This is my first time using herd7, so I can only hope I got everything
> right. FWIW this is my litmus test:
Nice, easier to reason on a smaller test.
> C swiotlb-new-pool
>
> (*
> * Result: Never
> *
> * Check that a newly allocated pool is always visible when the corresponding
> * swiotlb buffer is visible.
> *)
>
> {}
>
> P0(int *pool)
> {
> /* swiotlb_dyn_alloc() */
> WRITE_ONCE(*pool, 999);
> }
>
> P1(int *pool, int *flag, int *buf)
> {
> /* swiotlb_find_slots() */
> int r0 = READ_ONCE(*pool);
> if (r0) {
> WRITE_ONCE(*flag, 1);
> smp_mb();
I think in the current code, that's the smp_wmb() just before the
presumed driver write. IIUC, smp_wmb() is not sufficient to ensure that
WRITE_ONCE() on P0 is also observed, it would need to be smp_mb(). Nor
would the smp_store_release() instead of WRITE_ONCE(*flag, 1).
My initial thought was to place an smp_mb() just before WRITE_ONCE()
above as it matches the multicopy atomicity description in
memory-barriers.txt. But since we have the presumed driver write anyway,
we can use that as the write on P1, read-from by P2, to ensure the
global visibility of the write on P0.
> }
>
> /* device driver (presumed) */
> WRITE_ONCE(*buf, r0);
> }
>
> P2(int *pool, int *flag, int *buf)
> {
> /* device driver (presumed) */
> int r1 = READ_ONCE(*buf);
>
> /* is_swiotlb_buffer() */
> int r2;
> int r3;
>
> smp_rmb();
> r2 = READ_ONCE(*flag);
> if (r2) {
> r3 = READ_ONCE(*pool);
> }
> }
>
> exists (2:r1<>0 /\ 2:r3=0) (* Not flagged or pool not found. *)
>
> Petr T
I guess a v2 of this patch would only need to change the smp_wmb() in
swiotlb_find_slots() (and the original fix). But write some comments,
I'll forget everything in a week.
--
Catalin