Hello.
The following upstream commits:
aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE"
ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE
break ath9k-based Wi-Fi access point for me. The AP emits beacons, but no client can connect to it, either from the very beginning, or shortly after start. These are the only symptoms I've noticed (i.e., no BUG/WARNING messages in `dmesg` etc).
The hardware is:
```
$ dmesg | grep -i swiotlb
[ 0.426785] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
BIOS Information
Vendor: American Megatrends Inc.
Version: P1.50
Release Date: 04/16/2018
Base Board Information
Manufacturer: ASRock
Product Name: J3710-ITX
02:00.0 Network controller: Qualcomm Atheros AR9462 Wireless Network Adapter (rev 01)
Subsystem: Lite-On Communications Inc Device 6621
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 17
Region 0: Memory at 81400000 (64-bit, non-prefetchable) [size=512K]
Expansion ROM at 81480000 [disabled] [size=64K]
Capabilities: [40] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/4 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 10.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [140 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Capabilities: [160 v1] Device Serial Number 00-00-00-00-00-00-00-00
Kernel driver in use: ath9k
Kernel modules: ath9k
```
These commits appeared in v5.17 and v5.16.15, and both kernels are broken for me. I'm pretty confident these commits make the difference since I've built both v5.17 and v5.16.15 without them, and it fixed the issue.
The machine has also got another Wi-Fi card that acts as a 802.11ax AP, and it is not affected:
```
01:00.0 Unclassified device [0002]: MEDIATEK Corp. MT7915E 802.11ax PCI Express Wireless Network Adapter (prog-if 80)
```
So, I do understand this might be an issue with regard to SG I/O handling in ath9k, hence relevant people in Cc.
Please suggest on how to deal with this. Both me and Olha (in Cc) will be glad to test patches if needed. In case any extra info is required, please also let me know.
Thanks.
--
Oleksandr Natalenko (post-factum)
On Wed, Mar 23, 2022 at 12:06 PM Robin Murphy <[email protected]> wrote:
>
> On 2022-03-23 17:27, Linus Torvalds wrote:
> >
> > I'm assuming that the ath9k issue is that it gives DMA mapping a big
> > enough area to handle any possible packet size, and just expects -
> > quite reasonably - smaller packets to only fill the part they need.
> >
> > Which that "info leak" patch obviously breaks entirely.
>
> Except that's the exact case which the new patch is addressing
Not "addressing". Breaking.
Which is why it will almost certainly get reverted.
Not doing DMA to the whole area seems to be quite the sane thing to do
for things like network packets, and overwriting the part that didn't
get DMA'd with zeroes seems to be exactly the wrong thing here.
So the SG_IO - and other random untrusted block command sources - data
leak will almost certainly have to be addressed differently. Possibly
by simply allocating the area with GFP_ZERO to begin with.
Linus
On 2022-03-23 19:16, Linus Torvalds wrote:
> On Wed, Mar 23, 2022 at 12:06 PM Robin Murphy <[email protected]> wrote:
>>
>> On 2022-03-23 17:27, Linus Torvalds wrote:
>>>
>>> I'm assuming that the ath9k issue is that it gives DMA mapping a big
>>> enough area to handle any possible packet size, and just expects -
>>> quite reasonably - smaller packets to only fill the part they need.
>>>
>>> Which that "info leak" patch obviously breaks entirely.
>>
>> Except that's the exact case which the new patch is addressing
>
> Not "addressing". Breaking.
>
> Which is why it will almost certainly get reverted.
>
> Not doing DMA to the whole area seems to be quite the sane thing to do
> for things like network packets, and overwriting the part that didn't
> get DMA'd with zeroes seems to be exactly the wrong thing here.
>
> So the SG_IO - and other random untrusted block command sources - data
> leak will almost certainly have to be addressed differently. Possibly
> by simply allocating the area with GFP_ZERO to begin with.
Er, the point of the block layer case is that whole area *is* zeroed to
begin with, and a latent memory corruption problem in SWIOTLB itself
replaces those zeros with random other kernel data unexpectedly. Let me
try illustrating some sequences for clarity...
Expected behaviour/without SWIOTLB:
Memory
---------------------------------------------------
start 12345678
dma_map(DMA_FROM_DEVICE) no-op
device writes partial data 12ABC678 <- ABC
dma_unmap(DMA_FROM_DEVICE) 12ABC678
SWIOTLB previously:
Memory Bounce buffer
---------------------------------------------------
start 12345678 xxxxxxxx
dma_map(DMA_FROM_DEVICE) no-op
device writes partial data 12345678 xxABCxxx <- ABC
dma_unmap(DMA_FROM_DEVICE) xxABCxxx <- xxABCxxx
SWIOTLB Now:
Memory Bounce buffer
---------------------------------------------------
start 12345678 xxxxxxxx
dma_map(DMA_FROM_DEVICE) 12345678 -> 12345678
device writes partial data 12345678 12ABC678 <- ABC
dma_unmap(DMA_FROM_DEVICE) 12ABC678 <- 12ABC678
Now, sure we can prevent any actual information leakage by initialising
the bounce buffer slot with zeros, but then we're just corrupting the
not-written-to parts of the mapping with zeros instead of anyone else's
old data. That's still fundamentally not OK. The only thing SWIOTLB can
do to be correct is treat DMA_FROM_DEVICE as a read-modify-write of the
entire mapping, because it has no way to know how much of it is actually
going to be modified.
I'll admit I still never quite grasped the reason for also adding the
override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I
think by that point we were increasingly tired and confused and starting
to second-guess ourselves (well, I was, at least). I don't think it's
wrong per se, but as I said I do think it can bite anyone who's been
doing dma_sync_*() wrong but getting away with it until now. If
ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a
partial revert of just that one hunk.
Thanks,
Robin.
On 2022-03-23 17:27, Linus Torvalds wrote:
> On Wed, Mar 23, 2022 at 12:19 AM Oleksandr Natalenko
> <[email protected]> wrote:
>>
>> The following upstream commits:
>>
>> aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE"
>> ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE
>>
>> break ath9k-based Wi-Fi access point for me. The AP emits beacons, but
>> no client can connect to it, either from the very beginning, or
>> shortly after start. These are the only symptoms I've noticed (i.e.,
>> no BUG/WARNING messages in `dmesg` etc).
>
> Funky, but clearly true:
>
>> These commits appeared in v5.17 and v5.16.15, and both kernels are
>> broken for me. I'm pretty confident these commits make the difference
>> since I've built both v5.17 and v5.16.15 without them, and it fixed
>> the issue.
>
> Can you double-check (or just explicitly confirm if you already did
> that test) that you need to revert *both* of those commits, and it's
> the later "rework" fix that triggers it?
>
>> So, I do understand this might be an issue with regard to SG I/O
>> handling in ath9k, hence relevant people in Cc.
>
> Yeah, almost certainly an ath9k bug, but a regression is a regression,
> so if people can't find the issue in ath9k, we'll have to revert those
> commits.
>
> Honestly, I personally think they were a bit draconian to begin with,
> and didn't limit their effects sufficiently.
>
> I'm assuming that the ath9k issue is that it gives DMA mapping a big
> enough area to handle any possible packet size, and just expects -
> quite reasonably - smaller packets to only fill the part they need.
>
> Which that "info leak" patch obviously breaks entirely.
Except that's the exact case which the new patch is addressing - by
copying the whole original area into the SWIOTLB bounce buffer to begin
with, if we bounce the whole lot back after the device has only updated
part of it, the non-updated parts now get overwritten with the same
original contents, rather than whatever random crap happened to be left
in the SWIOTLB buffer by its previous user. I'm extremely puzzled how
any driver could somehow be dependent on non-device-written data getting
replaced with random crap, given that it wouldn't happen with a real
IOMMU, or if SWIOTLB just didn't need to bounce, and the data would
hardly be deterministic either.
I think I can see how aa6f8dcbab47 might increase the severity of a
driver bug where it calls dma_sync_*_for_device() on part of a
DMA_FROM_DEVICE mapping that the device *has* written to, without having
called a corresponding dma_sync_*_for_cpu() first - previously that
would have had no effect, but now SWIOTLB will effectively behave more
like an eagerly-prefetching non-coherent cache and write back old data
over new - but if ddbd89deb7d3 alone makes a difference then something
really weird must be going on.
Has anyone run a sanity check with CONFIG_DMA_API_DEBUG enabled to see
if that flags anything up?
Robin.
On Wed, Mar 23, 2022 at 12:19 AM Oleksandr Natalenko
<[email protected]> wrote:
>
> The following upstream commits:
>
> aa6f8dcbab47 swiotlb: rework "fix info leak with DMA_FROM_DEVICE"
> ddbd89deb7d3 swiotlb: fix info leak with DMA_FROM_DEVICE
>
> break ath9k-based Wi-Fi access point for me. The AP emits beacons, but
> no client can connect to it, either from the very beginning, or
> shortly after start. These are the only symptoms I've noticed (i.e.,
> no BUG/WARNING messages in `dmesg` etc).
Funky, but clearly true:
> These commits appeared in v5.17 and v5.16.15, and both kernels are
> broken for me. I'm pretty confident these commits make the difference
> since I've built both v5.17 and v5.16.15 without them, and it fixed
> the issue.
Can you double-check (or just explicitly confirm if you already did
that test) that you need to revert *both* of those commits, and it's
the later "rework" fix that triggers it?
> So, I do understand this might be an issue with regard to SG I/O
> handling in ath9k, hence relevant people in Cc.
Yeah, almost certainly an ath9k bug, but a regression is a regression,
so if people can't find the issue in ath9k, we'll have to revert those
commits.
Honestly, I personally think they were a bit draconian to begin with,
and didn't limit their effects sufficiently.
I'm assuming that the ath9k issue is that it gives DMA mapping a big
enough area to handle any possible packet size, and just expects -
quite reasonably - smaller packets to only fill the part they need.
Which that "info leak" patch obviously breaks entirely.
So my expectation is that this is something we'll just revert, but it
would be really good to have the ath9k people double-check.
Linus
On Wed, 23 Mar 2022 20:54:08 +0000
Robin Murphy <[email protected]> wrote:
> On 2022-03-23 19:16, Linus Torvalds wrote:
> > On Wed, Mar 23, 2022 at 12:06 PM Robin Murphy <[email protected]> wrote:
> >>
> >> On 2022-03-23 17:27, Linus Torvalds wrote:
> >>>
> >>> I'm assuming that the ath9k issue is that it gives DMA mapping a big
> >>> enough area to handle any possible packet size, and just expects -
> >>> quite reasonably - smaller packets to only fill the part they need.
> >>>
> >>> Which that "info leak" patch obviously breaks entirely.
> >>
> >> Except that's the exact case which the new patch is addressing
> >
> > Not "addressing". Breaking.
> >
> > Which is why it will almost certainly get reverted.
> >
> > Not doing DMA to the whole area seems to be quite the sane thing to do
> > for things like network packets, and overwriting the part that didn't
> > get DMA'd with zeroes seems to be exactly the wrong thing here.
> >
> > So the SG_IO - and other random untrusted block command sources - data
> > leak will almost certainly have to be addressed differently. Possibly
> > by simply allocating the area with GFP_ZERO to begin with.
>
> Er, the point of the block layer case is that whole area *is* zeroed to
> begin with, and a latent memory corruption problem in SWIOTLB itself
> replaces those zeros with random other kernel data unexpectedly. Let me
> try illustrating some sequences for clarity...
>
> Expected behaviour/without SWIOTLB:
> Memory
> ---------------------------------------------------
> start 12345678
> dma_map(DMA_FROM_DEVICE) no-op
> device writes partial data 12ABC678 <- ABC
> dma_unmap(DMA_FROM_DEVICE) 12ABC678
>
>
> SWIOTLB previously:
> Memory Bounce buffer
> ---------------------------------------------------
> start 12345678 xxxxxxxx
> dma_map(DMA_FROM_DEVICE) no-op
> device writes partial data 12345678 xxABCxxx <- ABC
> dma_unmap(DMA_FROM_DEVICE) xxABCxxx <- xxABCxxx
>
>
> SWIOTLB Now:
> Memory Bounce buffer
> ---------------------------------------------------
> start 12345678 xxxxxxxx
> dma_map(DMA_FROM_DEVICE) 12345678 -> 12345678
> device writes partial data 12345678 12ABC678 <- ABC
> dma_unmap(DMA_FROM_DEVICE) 12ABC678 <- 12ABC678
>
>
> Now, sure we can prevent any actual information leakage by initialising
> the bounce buffer slot with zeros, but then we're just corrupting the
> not-written-to parts of the mapping with zeros instead of anyone else's
> old data. That's still fundamentally not OK. The only thing SWIOTLB can
> do to be correct is treat DMA_FROM_DEVICE as a read-modify-write of the
> entire mapping, because it has no way to know how much of it is actually
> going to be modified.
>
Very nice explanation! Thanks!
> I'll admit I still never quite grasped the reason for also adding the
> override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I
> think by that point we were increasingly tired and confused and starting
> to second-guess ourselves (well, I was, at least).
I raised the question, do we need to do the same for
swiotlb_sync_single_for_device(). Did that based on my understanding of the
DMA API documentation. I had the following scenario in mind
SWIOTLB without the snyc_single:
Memory Bounce buffer Owner
--------------------------------------------------------------------------
start 12345678 xxxxxxxx C
dma_map(DMA_FROM_DEVICE) 12345678 -> 12345678 C->D
device writes partial data 12345678 12ABC678 <- ABC D
sync_for_cpu(DMA_FROM_DEVICE) 12ABC678 <- 12ABC678 D->C
cpu modifies buffer 66666666 12ABC678 C
sync_for_device(DMA_FROM_DEVICE) 66666666 12ABC678 C->D
device writes partial data 66666666 1EFGC678 <-EFG D
dma_unmap(DMA_FROM_DEVICE) 1EFGC678 <- 1EFGC678 D->C
Legend: in Owner column C stands for cpu and D for device.
Without swiotlb, I believe we should have arrived at 6EFG6666. To get the
same result, IMHO, we need to do a sync in sync_for_device().
And aa6f8dcbab47 is an imperfect solution to that (because of size).
> I don't think it's
> wrong per se, but as I said I do think it can bite anyone who's been
> doing dma_sync_*() wrong but getting away with it until now.
I fully agree.
> If
> ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a
> partial revert of just that one hunk.
>
I'm not against being pragmatic and doing the partial revert. But as
explained above, I do believe for correctness of swiotlb we ultimately
do need that change. So if the revert is the short term solution,
what should be our mid-term road-map?
Regards,
Halil
> Thanks,
> Robin.
On 2022-03-24 16:31, Christoph Hellwig wrote:
> On Thu, Mar 24, 2022 at 05:29:12PM +0100, Maxime Bizon wrote:
>>> I'm looking into this; but in the interest of a speedy resolution of
>>> the regression I would be in favour of merging that partial revert
>>> and reinstating it if/when we identify (and fix) any bugs in ath9k :)
>>
>> This looks fishy:
>>
>> ath9k/recv.c
>>
>> /* We will now give hardware our shiny new allocated skb */
>> new_buf_addr = dma_map_single(sc->dev, requeue_skb->data,
>> common->rx_bufsize, dma_type);
>> if (unlikely(dma_mapping_error(sc->dev, new_buf_addr))) {
>> dev_kfree_skb_any(requeue_skb);
>> goto requeue_drop_frag;
>> }
>>
>> /* Unmap the frame */
>> dma_unmap_single(sc->dev, bf->bf_buf_addr,
>> common->rx_bufsize, dma_type);
>>
>> bf->bf_mpdu = requeue_skb;
>> bf->bf_buf_addr = new_buf_addr;
>
> Creating a new mapping for the same buffer before unmapping the
> previous one does looks rather bogus. But it does not fit the
> pattern where revering the sync_single changes make the driver
> work again.
OK, you made me look :)
Now that it's obvious what to look for, I can only conclude that during
the stanza in ath_edma_get_buffers(), the device is still writing to the
buffer while ownership has been transferred to the CPU, and whatever got
written while ath9k_hw_process_rxdesc_edma() was running then gets wiped
out by the subsequent sync_for_device, which currently resets the
SWIOTLB slot to the state that sync_for_cpu copied out. By the letter of
the DMA API that's not allowed, but on the other hand I'm not sure if we
even have a good idiom for "I can't tell if the device has finished with
this buffer or not unless I look at it" :/
Robin.
On Thu, 24 Mar 2022 16:52:31 +0000
Robin Murphy <[email protected]> wrote:
> > Creating a new mapping for the same buffer before unmapping the
> > previous one does looks rather bogus. But it does not fit the
> > pattern where revering the sync_single changes make the driver
> > work again.
>
> OK, you made me look :)
>
> Now that it's obvious what to look for, I can only conclude that during
> the stanza in ath_edma_get_buffers(), the device is still writing to the
> buffer while ownership has been transferred to the CPU, and whatever got
> written while ath9k_hw_process_rxdesc_edma() was running then gets wiped
> out by the subsequent sync_for_device, which currently resets the
> SWIOTLB slot to the state that sync_for_cpu copied out. By the letter of
> the DMA API that's not allowed, but on the other hand I'm not sure if we
> even have a good idiom for "I can't tell if the device has finished with
> this buffer or not unless I look at it" :/
I agree with your analysis. Especially with the latter part (were you
state that we don't have a good idiom for that use case).
I believe, a stronger statement is also true: it is fundamentally
impossible to accommodate use cases where the device and the cpu need
concurrent access to a dma buffer, if the dma buffer isn't in dma
coherent memory.
If the dma buffer is in dma coherent memory, and we don't need swiotlb,
then we don't need the dma_sync functionality.
Specifically for swiotlb, if the swiotlb buffer is in dma coherent
memory, the driver could peek the swiotlb buffer, but I have no idea if
we can provide a remotely sane API for that. The point is the device
would have peek not via a pointer to the original buffer, but get
suitable pointer to the bounce buffer, which would be probably be
considered valid, as long as the mapping is valid. Honestly IMHO quite
ugly but I see no other way.
Regards,
Halil
Robin Murphy <[email protected]> writes:
> On 2022-03-24 10:25, Oleksandr Natalenko wrote:
>> Hello.
>>
>> On čtvrtek 24. března 2022 6:57:32 CET Christoph Hellwig wrote:
>>> On Wed, Mar 23, 2022 at 08:54:08PM +0000, Robin Murphy wrote:
>>>> I'll admit I still never quite grasped the reason for also adding the
>>>> override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think
>>>> by that point we were increasingly tired and confused and starting to
>>>> second-guess ourselves (well, I was, at least). I don't think it's wrong
>>>> per se, but as I said I do think it can bite anyone who's been doing
>>>> dma_sync_*() wrong but getting away with it until now. If ddbd89deb7d3
>>>> alone turns out to work OK then I'd be inclined to try a partial revert of
>>>> just that one hunk.
>>>
>>> Agreed. Let's try that first.
>>>
>>> Oleksandr, can you try the patch below:
>>>
>>>
>>> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
>>> index 6db1c475ec827..6c350555e5a1c 100644
>>> --- a/kernel/dma/swiotlb.c
>>> +++ b/kernel/dma/swiotlb.c
>>> @@ -701,13 +701,10 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
>>> void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
>>> size_t size, enum dma_data_direction dir)
>>> {
>>> - /*
>>> - * Unconditional bounce is necessary to avoid corruption on
>>> - * sync_*_for_cpu or dma_ummap_* when the device didn't overwrite
>>> - * the whole lengt of the bounce buffer.
>>> - */
>>> - swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
>>> - BUG_ON(!valid_dma_direction(dir));
>>> + if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
>>> + swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
>>> + else
>>> + BUG_ON(dir != DMA_FROM_DEVICE);
>>> }
>>>
>>> void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
>>>
>>
>> With this patch the AP works for me.
>
> Cool, thanks for confirming. So I think ath9k probably is doing
> something dodgy with dma_sync_*(), but if Linus prefers to make the
> above change rather than wait for that to get figured out, I believe
> that should be fine.
I'm looking into this; but in the interest of a speedy resolution of the
regression I would be in favour of merging that partial revert and
reinstating it if/when we identify (and fix) any bugs in ath9k :)
-Toke
Linus Torvalds <[email protected]> writes:
> On Thu, Mar 24, 2022 at 10:07 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>>
>> Right, but is that sync_for_device call really needed?
>
> Well, imagine that you have a non-cache-coherent DMA (not bounce
> buffers - just bad hardware)...
>
> So the driver first does that dma_sync_single_for_cpu() for the CPU
> see the current state (for the non-cache-coherent case it would just
> invalidate caches).
>
> The driver then examines the command buffer state, sees that it's
> still in progress, and does that return -EINPROGRESS.
>
> It's actually very natural in that situation to flush the caches from
> the CPU side again. And so dma_sync_single_for_device() is a fairly
> reasonable thing to do in that situation.
>
> But it doesn't seem *required*, no. The CPU caches only have a copy of
> the data in them, no writeback needed (and writeback wouldn't work
> since DMA from the device may be in progress).
>
> So I don't think the dma_sync_single_for_device() is *wrong* per se,
> because the CPU didn't actually do any modifications.
>
> But yes, I think it's unnecessary - because any later CPU accesses
> would need that dma_sync_single_for_cpu() anyway, which should
> invalidate any stale caches.
OK, the above was basically how I understood it. Thank you for
confirming!
> And it clearly doesn't work in a bounce-buffer situation, but honestly
> I don't think a "CPU modified buffers concurrently with DMA" can
> *ever* work in that situation, so I think it's wrong for a bounce
> buffer model to ever do anything in the dma_sync_single_for_device()
> situation.
Right.
> Does removing that dma_sync_single_for_device() actually fix the
> problem for the ath driver?
I am hoping Oleksandr can help answer that since my own ath9k hardware
is currently on strike :(
-Toke
On 2022-03-24 10:25, Oleksandr Natalenko wrote:
> Hello.
>
> On čtvrtek 24. března 2022 6:57:32 CET Christoph Hellwig wrote:
>> On Wed, Mar 23, 2022 at 08:54:08PM +0000, Robin Murphy wrote:
>>> I'll admit I still never quite grasped the reason for also adding the
>>> override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think
>>> by that point we were increasingly tired and confused and starting to
>>> second-guess ourselves (well, I was, at least). I don't think it's wrong
>>> per se, but as I said I do think it can bite anyone who's been doing
>>> dma_sync_*() wrong but getting away with it until now. If ddbd89deb7d3
>>> alone turns out to work OK then I'd be inclined to try a partial revert of
>>> just that one hunk.
>>
>> Agreed. Let's try that first.
>>
>> Oleksandr, can you try the patch below:
>>
>>
>> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
>> index 6db1c475ec827..6c350555e5a1c 100644
>> --- a/kernel/dma/swiotlb.c
>> +++ b/kernel/dma/swiotlb.c
>> @@ -701,13 +701,10 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
>> void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
>> size_t size, enum dma_data_direction dir)
>> {
>> - /*
>> - * Unconditional bounce is necessary to avoid corruption on
>> - * sync_*_for_cpu or dma_ummap_* when the device didn't overwrite
>> - * the whole lengt of the bounce buffer.
>> - */
>> - swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
>> - BUG_ON(!valid_dma_direction(dir));
>> + if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
>> + swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
>> + else
>> + BUG_ON(dir != DMA_FROM_DEVICE);
>> }
>>
>> void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
>>
>
> With this patch the AP works for me.
Cool, thanks for confirming. So I think ath9k probably is doing
something dodgy with dma_sync_*(), but if Linus prefers to make the
above change rather than wait for that to get figured out, I believe
that should be fine.
The crucial part of the "rework" patch is that we'll unconditionally
initialise the SWIOTLB bounce slot as it's allocated in
swiotlb_tbl_map_single(), regardless of DMA_ATTR_SKIP_CPU_SYNC. As long
as that happens, we're safe in terms of leaking data from previous
mappings, and any possibility for incorrect sync usage to lose
newly-written DMA data is at least no worse than it always has been. The
most confusion was around how the proposed DMA_ATTR_OVERWRITE attribute
would need to interact with DMA_ATTR_SKIP_CPU_SYNC to remain safe but
still have any useful advantage, so unless and until anyone wants to
revisit that, this should remain comparatively simple to reason about.
Cheers,
Robin.
Hello.
On středa 23. března 2022 18:27:21 CET Linus Torvalds wrote:
> On Wed, Mar 23, 2022 at 12:19 AM Oleksandr Natalenko
> <[email protected]> wrote:
> > These commits appeared in v5.17 and v5.16.15, and both kernels are
> > broken for me. I'm pretty confident these commits make the difference
> > since I've built both v5.17 and v5.16.15 without them, and it fixed
> > the issue.
>
> Can you double-check (or just explicitly confirm if you already did
> that test) that you need to revert *both* of those commits, and it's
> the later "rework" fix that triggers it?
I can confirm that if I revert aa6f8dcbab47 only, but leave ddbd89deb7d3 in place, AP works. So, it seems that the latest "rework" triggers the issue for me.
Thanks.
--
Oleksandr Natalenko (post-factum)
On Thu, Mar 24, 2022 at 10:07 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>
> Right, but is that sync_for_device call really needed?
Well, imagine that you have a non-cache-coherent DMA (not bounce
buffers - just bad hardware)...
So the driver first does that dma_sync_single_for_cpu() for the CPU
see the current state (for the non-cache-coherent case it would just
invalidate caches).
The driver then examines the command buffer state, sees that it's
still in progress, and does that return -EINPROGRESS.
It's actually very natural in that situation to flush the caches from
the CPU side again. And so dma_sync_single_for_device() is a fairly
reasonable thing to do in that situation.
But it doesn't seem *required*, no. The CPU caches only have a copy of
the data in them, no writeback needed (and writeback wouldn't work
since DMA from the device may be in progress).
So I don't think the dma_sync_single_for_device() is *wrong* per se,
because the CPU didn't actually do any modifications.
But yes, I think it's unnecessary - because any later CPU accesses
would need that dma_sync_single_for_cpu() anyway, which should
invalidate any stale caches.
And it clearly doesn't work in a bounce-buffer situation, but honestly
I don't think a "CPU modified buffers concurrently with DMA" can
*ever* work in that situation, so I think it's wrong for a bounce
buffer model to ever do anything in the dma_sync_single_for_device()
situation.
Does removing that dma_sync_single_for_device() actually fix the
problem for the ath driver?
There's a fair number of those dma_sync_single_for_device() things all
over. Could we find mis-uses and warn about them some way? It seems to
be a very natural thing to do in this context, but bounce buffering
does make them very fragile.
Linus
On 25.03.22 08:12, Oleksandr Natalenko wrote:
> On čtvrtek 24. března 2022 18:07:29 CET Toke Høiland-Jørgensen wrote:
>> Right, but is that sync_for_device call really needed? AFAICT, that
>> ath9k_hw_process_rxdesc_edma() invocation doesn't actually modify any of
>> the data when it returns EINPROGRESS, so could we just skip it? Like
>> the patch below? Or am I misunderstanding the semantics here?
>>
>> diff --git a/drivers/net/wireless/ath/ath9k/recv.c b/drivers/net/wireless/ath/ath9k/recv.c
>> index 0c0624a3b40d..19244d4c0ada 100644
>> --- a/drivers/net/wireless/ath/ath9k/recv.c
>> +++ b/drivers/net/wireless/ath/ath9k/recv.c
>> @@ -647,12 +647,8 @@ static bool ath_edma_get_buffers(struct ath_softc *sc,
>> common->rx_bufsize, DMA_FROM_DEVICE);
>>
>> ret = ath9k_hw_process_rxdesc_edma(ah, rs, skb->data);
>> - if (ret == -EINPROGRESS) {
>> - /*let device gain the buffer again*/
>> - dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
>> - common->rx_bufsize, DMA_FROM_DEVICE);
>> + if (ret == -EINPROGRESS)
>> return false;
>> - }
>>
>> __skb_unlink(skb, &rx_edma->rx_fifo);
>> if (ret == -EINVAL) {
>
> With this patch and both ddbd89deb7d3+aa6f8dcbab47 in place the AP works for me.
TWIMC: If anyone needs more testers or something, I noticed two bug
report in bko about this problem:
https://bugzilla.kernel.org/show_bug.cgi?id=215703
https://bugzilla.kernel.org/show_bug.cgi?id=215698
I'll point both to this discussion and the patch.
Ciao, Thorsten
On 2022-03-25 10:25, Maxime Bizon wrote:
>
> On Thu, 2022-03-24 at 12:26 -0700, Linus Torvalds wrote:
>
>>
>> It's actually very natural in that situation to flush the caches from
>> the CPU side again. And so dma_sync_single_for_device() is a fairly
>> reasonable thing to do in that situation.
>>
>
> In the non-cache-coherent scenario, and assuming dma_map() did an
> initial cache invalidation, you can write this:
>
> rx_buffer_complete_1(buf)
> {
> invalidate_cache(buf, size)
> if (!is_ready(buf))
> return;
> <proceed with receive>
> }
>
> or
>
> rx_buffer_complete_2(buf)
> {
> if (!is_ready(buf)) {
> invalidate_cache(buf, size)
> return;
> }
> <proceed with receive>
> }
>
> The latter is preferred for performance because dma_map() did the
> initial invalidate.
>
> Of course you could write:
>
> rx_buffer_complete_3(buf)
> {
> invalidate_cache(buf, size)
> if
> (!is_ready(buf)) {
> invalidate_cache(buf, size)
> return;
> }
>
> <proceed with receive>
> }
>
>
> but it's a waste of CPU cycles
>
> So I'd be very cautious assuming sync_for_cpu() and sync_for_device()
> are both doing invalidation in existing implementation of arch DMA ops,
> implementers may have taken some liberty around DMA-API to avoid
> unnecessary cache operation (not to blame them).
Right, if you have speculatively-prefetching caches, you have to
invalidate DMA_FROM_DEVICE in unmap/sync_for_cpu, since a cache may have
pulled in a snapshot of partly-written data at any point beforehand. But
if you don't, then you can simply invalidate up-front in
map/sync_for_device to tie in with the other directions, and trust that
it stays that way for the duration.
What muddies the waters a bit is that the opposite combination
sync_for_cpu(DMA_TO_DEVICE) really *should* always be a no-op, and I for
one have already made the case for eliding that in code elsewhere, but
it doesn't necessarily hold for the inverse here, hence why I'm not sure
there even is a robust common solution for peeking at a live
DMA_FROM_DEVICE buffer.
Robin.
> For example looking at arch/arm/mm/dma-mapping.c, for DMA_FROM_DEVICE
>
> sync_single_for_device()
> => __dma_page_cpu_to_dev()
> => dma_cache_maint_page(op=dmac_map_area)
> => cpu_cache.dma_map_area()
>
> sync_single_for_cpu()
> => __dma_page_dev_to_cpu()
> =>
> __dma_page_cpu_to_dev(op=dmac_unmap_area)
> =>
> cpu_cache.dma_unmap_area()
>
> dma_map_area() always does cache invalidate.
>
> But for a couple of CPU variant, dma_unmap_area() is a noop, so
> sync_for_cpu() does nothing.
>
> Toke's patch will break ath9k on those platforms (mostly silent
> breakage, rx corruption leading to bad performance)
>
>
>> There's a fair number of those dma_sync_single_for_device() things
>> all over. Could we find mis-uses and warn about them some way? It
>> seems to be a very natural thing to do in this context, but bounce
>> buffering does make them very fragile.
>
> At least in network drivers, there are at least two patterns:
>
> 1) The issue at hand, hardware mixing rx_status and data inside the
> same area. Usually very old hardware, very quick grep in network
> drivers only revealed slicoss.c. Probably would have gone unnoticed if
> ath9k hardware wasn't so common.
>
>
> 2) The very common "copy break" pattern. If a received packet is
> smaller than a certain threshold, the driver rx path is changed to do:
>
> sync_for_cpu()
> alloc_small_skb()
> memcpy(small_skb, rx_buffer_data)
> sync_for_device()
>
> Original skb is left in the hardware, this reduces memory wasted.
>
> This pattern is completely valid wrt DMA-API, the buffer is always
> either owned by CPU or device.
>
>
On 2022-03-25 15:25, Halil Pasic wrote:
> On Thu, 24 Mar 2022 19:02:16 +0100
> Halil Pasic <[email protected]> wrote:
>
>>> I'll admit I still never quite grasped the reason for also adding the
>>> override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I
>>> think by that point we were increasingly tired and confused and starting
>>> to second-guess ourselves (well, I was, at least).
>>
>> I raised the question, do we need to do the same for
>> swiotlb_sync_single_for_device(). Did that based on my understanding of the
>> DMA API documentation. I had the following scenario in mind
>>
>> SWIOTLB without the snyc_single:
>> Memory Bounce buffer Owner
>> --------------------------------------------------------------------------
>> start 12345678 xxxxxxxx C
>> dma_map(DMA_FROM_DEVICE) 12345678 -> 12345678 C->D
>> device writes partial data 12345678 12ABC678 <- ABC D
>> sync_for_cpu(DMA_FROM_DEVICE) 12ABC678 <- 12ABC678 D->C
>> cpu modifies buffer 66666666 12ABC678 C
>> sync_for_device(DMA_FROM_DEVICE) 66666666 12ABC678 C->D
>> device writes partial data 66666666 1EFGC678 <-EFG D
>> dma_unmap(DMA_FROM_DEVICE) 1EFGC678 <- 1EFGC678 D->C
>>
>> Legend: in Owner column C stands for cpu and D for device.
>>
>> Without swiotlb, I believe we should have arrived at 6EFG6666. To get the
>> same result, IMHO, we need to do a sync in sync_for_device().
>> And aa6f8dcbab47 is an imperfect solution to that (because of size).
>>
>
> @Robin, Christoph: Do we consider this a valid scenario?
Aha, I see it now (turns out diagrams really do help!) - so essentially
the original situation but with buffer recycling thrown into the mix as
well... I think it's technically valid, but do we know if anything's
actually doing that in a way which ends up affected? For sure it would
be nice to know that we had all bases covered without having to audit
whether we need to, but if it's fundamentally incompatible with what
other code expects, that we know *is* being widely used, and however
questionable it may be we don't have an easy fix for, then we're in a
bit of a tough spot :(
Thanks,
Robin.
On Thu, 2022-03-24 at 12:26 -0700, Linus Torvalds wrote:
>
> It's actually very natural in that situation to flush the caches from
> the CPU side again. And so dma_sync_single_for_device() is a fairly
> reasonable thing to do in that situation.
>
In the non-cache-coherent scenario, and assuming dma_map() did an
initial cache invalidation, you can write this:
rx_buffer_complete_1(buf)
{
invalidate_cache(buf, size)
if (!is_ready(buf))
return;
<proceed with receive>
}
or
rx_buffer_complete_2(buf)
{
if (!is_ready(buf)) {
invalidate_cache(buf, size)
return;
}
<proceed with receive>
}
The latter is preferred for performance because dma_map() did the
initial invalidate.
Of course you could write:
rx_buffer_complete_3(buf)
{
invalidate_cache(buf, size)
if
(!is_ready(buf)) {
invalidate_cache(buf, size)
return;
}
<proceed with receive>
}
but it's a waste of CPU cycles
So I'd be very cautious assuming sync_for_cpu() and sync_for_device()
are both doing invalidation in existing implementation of arch DMA ops,
implementers may have taken some liberty around DMA-API to avoid
unnecessary cache operation (not to blame them).
For example looking at arch/arm/mm/dma-mapping.c, for DMA_FROM_DEVICE
sync_single_for_device()
=> __dma_page_cpu_to_dev()
=> dma_cache_maint_page(op=dmac_map_area)
=> cpu_cache.dma_map_area()
sync_single_for_cpu()
=> __dma_page_dev_to_cpu()
=>
__dma_page_cpu_to_dev(op=dmac_unmap_area)
=>
cpu_cache.dma_unmap_area()
dma_map_area() always does cache invalidate.
But for a couple of CPU variant, dma_unmap_area() is a noop, so
sync_for_cpu() does nothing.
Toke's patch will break ath9k on those platforms (mostly silent
breakage, rx corruption leading to bad performance)
> There's a fair number of those dma_sync_single_for_device() things
> all over. Could we find mis-uses and warn about them some way? It
> seems to be a very natural thing to do in this context, but bounce
> buffering does make them very fragile.
At least in network drivers, there are at least two patterns:
1) The issue at hand, hardware mixing rx_status and data inside the
same area. Usually very old hardware, very quick grep in network
drivers only revealed slicoss.c. Probably would have gone unnoticed if
ath9k hardware wasn't so common.
2) The very common "copy break" pattern. If a received packet is
smaller than a certain threshold, the driver rx path is changed to do:
sync_for_cpu()
alloc_small_skb()
memcpy(small_skb, rx_buffer_data)
sync_for_device()
Original skb is left in the hardware, this reduces memory wasted.
This pattern is completely valid wrt DMA-API, the buffer is always
either owned by CPU or device.
--
Maxime
On Thu, Mar 24, 2022 at 07:31:58PM +0100, Halil Pasic wrote:
> I agree with your analysis. Especially with the latter part (were you
> state that we don't have a good idiom for that use case).
>
> I believe, a stronger statement is also true: it is fundamentally
> impossible to accommodate use cases where the device and the cpu need
> concurrent access to a dma buffer, if the dma buffer isn't in dma
> coherent memory.
Yes, and that is also clearly stated in the DMA API document. We
only have two platforms that do not support DMA coherent memory,
one are the oldest PARISC platforms, and the other is coldfire.
The first has drivers carefully written to actually support that,
the second only has a single driver using DMA that does manual
global cache flushes (while pretending it supports coherent memory).
> If the dma buffer is in dma coherent memory, and we don't need swiotlb,
> then we don't need the dma_sync functionality.
Yes.
On Thu, 24 Mar 2022 19:02:16 +0100
Halil Pasic <[email protected]> wrote:
> > I'll admit I still never quite grasped the reason for also adding the
> > override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I
> > think by that point we were increasingly tired and confused and starting
> > to second-guess ourselves (well, I was, at least).
>
> I raised the question, do we need to do the same for
> swiotlb_sync_single_for_device(). Did that based on my understanding of the
> DMA API documentation. I had the following scenario in mind
>
> SWIOTLB without the snyc_single:
> Memory Bounce buffer Owner
> --------------------------------------------------------------------------
> start 12345678 xxxxxxxx C
> dma_map(DMA_FROM_DEVICE) 12345678 -> 12345678 C->D
> device writes partial data 12345678 12ABC678 <- ABC D
> sync_for_cpu(DMA_FROM_DEVICE) 12ABC678 <- 12ABC678 D->C
> cpu modifies buffer 66666666 12ABC678 C
> sync_for_device(DMA_FROM_DEVICE) 66666666 12ABC678 C->D
> device writes partial data 66666666 1EFGC678 <-EFG D
> dma_unmap(DMA_FROM_DEVICE) 1EFGC678 <- 1EFGC678 D->C
>
> Legend: in Owner column C stands for cpu and D for device.
>
> Without swiotlb, I believe we should have arrived at 6EFG6666. To get the
> same result, IMHO, we need to do a sync in sync_for_device().
> And aa6f8dcbab47 is an imperfect solution to that (because of size).
>
@Robin, Christoph: Do we consider this a valid scenario?
Regards,
Halil
On 2022-03-25 16:25, Toke Høiland-Jørgensen wrote:
> Maxime Bizon <[email protected]> writes:
>
>> On Thu, 2022-03-24 at 12:26 -0700, Linus Torvalds wrote:
>>
>>>
>>> It's actually very natural in that situation to flush the caches from
>>> the CPU side again. And so dma_sync_single_for_device() is a fairly
>>> reasonable thing to do in that situation.
>>>
>>
>> In the non-cache-coherent scenario, and assuming dma_map() did an
>> initial cache invalidation, you can write this:
>>
>> rx_buffer_complete_1(buf)
>> {
>> invalidate_cache(buf, size)
>> if (!is_ready(buf))
>> return;
>> <proceed with receive>
>> }
>>
>> or
>>
>> rx_buffer_complete_2(buf)
>> {
>> if (!is_ready(buf)) {
>> invalidate_cache(buf, size)
>> return;
>> }
>> <proceed with receive>
>> }
>>
>> The latter is preferred for performance because dma_map() did the
>> initial invalidate.
>>
>> Of course you could write:
>>
>> rx_buffer_complete_3(buf)
>> {
>> invalidate_cache(buf, size)
>> if
>> (!is_ready(buf)) {
>> invalidate_cache(buf, size)
>> return;
>> }
>>
>> <proceed with receive>
>> }
>>
>>
>> but it's a waste of CPU cycles
>>
>> So I'd be very cautious assuming sync_for_cpu() and sync_for_device()
>> are both doing invalidation in existing implementation of arch DMA ops,
>> implementers may have taken some liberty around DMA-API to avoid
>> unnecessary cache operation (not to blame them).
>
> I sense an implicit "and the driver can't (or shouldn't) influence
> this" here, right?
Right, drivers don't get a choice of how a given DMA API implementation
works.
>> For example looking at arch/arm/mm/dma-mapping.c, for DMA_FROM_DEVICE
>>
>> sync_single_for_device()
>> => __dma_page_cpu_to_dev()
>> => dma_cache_maint_page(op=dmac_map_area)
>> => cpu_cache.dma_map_area()
>>
>> sync_single_for_cpu()
>> => __dma_page_dev_to_cpu()
>> =>
>> __dma_page_cpu_to_dev(op=dmac_unmap_area)
>> =>
>> cpu_cache.dma_unmap_area()
>>
>> dma_map_area() always does cache invalidate.
>>
>> But for a couple of CPU variant, dma_unmap_area() is a noop, so
>> sync_for_cpu() does nothing.
>>
>> Toke's patch will break ath9k on those platforms (mostly silent
>> breakage, rx corruption leading to bad performance)
>
> Okay, so that would be bad obviously. So if I'm reading you correctly
> (cf my question above), we can't fix this properly from the driver side,
> and we should go with the partial SWIOTLB revert instead?
Do you have any other way of telling if DMA is idle, or temporarily
pausing it before the sync_for_cpu, such that you could honour the
notion of ownership transfer properly? As mentioned elsewhere I suspect
the only "real" fix if you really do need to allow concurrent access is
to use the coherent DMA API for buffers rather than streaming mappings,
but that's obviously some far more significant surgery.
Robin.
On Thu, Mar 24, 2022 at 07:02:16PM +0100, Halil Pasic wrote:
> > If
> > ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a
> > partial revert of just that one hunk.
> >
>
> I'm not against being pragmatic and doing the partial revert. But as
> explained above, I do believe for correctness of swiotlb we ultimately
> do need that change. So if the revert is the short term solution,
> what should be our mid-term road-map?
Unless I'm misunderstanding this thread we found the bug in ath9k
and have a fix for that now?
On Fri, Mar 25, 2022 at 3:25 AM Maxime Bizon <[email protected]> wrote:
>
> In the non-cache-coherent scenario, and assuming dma_map() did an
> initial cache invalidation, you can write this:
.. but the problem is that the dma mapping code is supposed to just
work, and the driver isn't supposed to know or care whether dma is
coherent or not, or using bounce buffers or not.
And currently it doesn't work.
Because what that ath9k driver does is "natural", but it's wrong for
the bounce buffer case.
And I think the problem is squarely on the dma-mapping side for two reasons:
(a) this used to work, now it doesn't, and it's unclear how many
other drivers are affected
(b) the dma-mapping naming and calling conventions are horrible and
actively misleading
That (a) is a big deal. The reason the ath9k issue was found quickly
is very likely *NOT* because ath9k is the only thing affected. No,
it's because ath9k is relatively common.
Just grep for dma_sync_single_for_device() and ask yourself: how many
of those other drivers have you ever even HEARD of, much less be able
to test?
And that's just one "dma_sync" function. Admittedly it's likely one of
the more common ones, but still..
Now, (b) is why I think driver nufgt get this so wrong - or, in this
case, possibly the dma-mapping code itself.
The naming - and even the documentation(!!!) - implies that what ath9k
does IS THE RIGHT THING TO DO.
The documentation clearly states:
"Before giving the memory to the device, dma_sync_single_for_device() needs
to be called, and before reading memory written by the device,
dma_sync_single_for_cpu(), just like for streaming DMA mappings that are
reused"
and ath9k obviously did exactly that, even with a comment to the effect.
And I think ath9k is actually right here, but the documentation is so
odd and weak that it's the dma-mapping code that was buggy.
So the dma mapping layer literally broke the documented behavior, and
then Christoph goes and says (in another email in this discussion):
"Unless I'm misunderstanding this thread we found the bug in ath9k
and have a fix for that now?"
which I think is a gross mis-characterization of the whole issue, and
ignores *BOTH* of (a) and (b).
So what's the move forward here?
I personally think we need to
- revert commit aa6f8dcbab47 for the simple reason that it is known
to break one driver. But it is unknown how many other drivers are
affected.
Even if you think aa6f8dcbab47 was the right thing to do (and I
don't - see later), the fact is that it's new behavior that the dma
bounce buffer code hasn't done in the past, and clearly confuses
things.
- think very carefully about the ath9k case.
We have a patch that fixes it for the bounce buffer case, but you
seem to imply that it might actually break non-coherent cases:
"So I'd be very cautious assuming sync_for_cpu() and sync_for_device()
are both doing invalidation in existing implementation of arch DMA ops,
implementers may have taken some liberty around DMA-API to avoid
unnecessary cache operation (not to blame them)"
so who knows what other dma situations it might break?
Because if some non-coherent mapping infrastructure assumes that
*only* sync_for_device() will actually flush-and-invalidate caches
(because the platform thinks that once they are flushed, getting them
back to the CPU doesn't need any special ops), then you're right:
Toke's ath9k patch will just result in cache coherency issues on those
platforms instead.
- think even *more* about what the ath9k situation means for the dma
mapping naming and documentation.
I basically think the DMA syncing has at least three cases (and a
fourth combination that makes no sense):
(1) The CPU has actively written to memory, and wants to give that
data to the device.
This is "dma_sync_single_for_device(DMA_TO_DEVICE)".
A cache-coherent thing needs to do nothing.
A non-coherent thing needs to do a cache "writeback" (and probably
will flush)
A bounce buffer implementation needs to copy *to* the bounce buffer
(2) The CPU now wants to see any state written by the device since
the last sync
This is "dma_sync_single_for_cpu(DMA_FROM_DEVICE)".
A bounce-buffer implementation needs to copy *from* the bounce buffer.
A cache-coherent implementation needs to do nothing.
A non-coherent implementation maybe needs to do nothing (ie it
assumes that previous ops have flushed the cache, and just accessing
the data will bring the rigth thing back into it). Or it could just
flush the cache.
(3) The CPU has seen the state, but wants to leave it to the device
This is "dma_sync_single_for_device(DMA_FROM_DEVICE)".
A bounce buffer implementation needs to NOT DO ANYTHING (this is
the current ath9k bug - copying to the bounce buffer is wrong)
A cache coherent implementation needs to do nothing
A non-coherent implementation needs to flush the cache again, bot
not necessarily do a writeback-flush if there is some cheaper form
(assuming it does nothing in the "CPU now wants to see any state" case
because it depends on the data not having been in the caches)
(4) There is a fourth case: dma_sync_single_for_cpu(DMA_TO_DEVICE)
which maybe should generate a warning because it seems to make no
sense? I can't think of a case where this would be an issue - the data
is specifically for the device, but it's synced "for the CPU"?
Do people agree? Or am I missing something?
But I don't think the documentation lays out these cases, and I don't
think the naming is great.
I also don't think that we can *change* the naming. That's water under
the bridge. It is what it is. So I think people need to really agree
on the semantics (did I get them entirely wrong above?) and try to
think about ways to maybe give warnings for things that make no sense.
Based on my suggested understanding of what the DMA layer should do,
the ath9k code is actually doing exactly the right thing. It is doing
dma_sync_single_for_device(DMA_FROM_DEVICE);
and based on my four cases above, the bounce buffer code must do
nothing, because "for_device()" together with "FROM_DEVICE" clearly
says that all the data is coming *from* the device, and copying any
bounce buffers is wrong.
In other words, I think commit aa6f8dcbab47 ("swiotlb: rework 'fix
info leak with DMA_FROM_DEVICE'") is fundamentally wrong. It doesn't
just break ath9k, it fundamentally break that "case 3" above. It's
doing a DMA_TO_DEVICE copy, even though it was a DMA_FROM_DEVICE sync.
So I really think that "revert aa6f8dcbab47" is not only inevitable
because of practical worries about what it breaks, but because that
commit was just entirely and utterly WRONG.
But having written this long wall of text, I'm now slightly worried
that I'm just confused, and am trying to just convince myself.
So please: can people think about this a bit more, and try to shoot
down the above argument and show that I'm just being silly?
And if I'm right, can we please document this and try really hard to
come up with some sanity checks (perhaps based on some "dma buffer
state" debug code?)
Linus
Christoph Hellwig <[email protected]> writes:
> On Thu, Mar 24, 2022 at 07:02:16PM +0100, Halil Pasic wrote:
>> > If
>> > ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a
>> > partial revert of just that one hunk.
>> >
>>
>> I'm not against being pragmatic and doing the partial revert. But as
>> explained above, I do believe for correctness of swiotlb we ultimately
>> do need that change. So if the revert is the short term solution,
>> what should be our mid-term road-map?
>
> Unless I'm misunderstanding this thread we found the bug in ath9k
> and have a fix for that now?
According to Maxim's comment on the other subthread, that ath9k patch
wouldn't work on all platforms (and constitutes a bit of a violation of
the DMA API ownership abstraction). So not quite, I think?
-Toke
Robin Murphy <[email protected]> writes:
> On 2022-03-25 16:25, Toke Høiland-Jørgensen wrote:
>> Maxime Bizon <[email protected]> writes:
>>
>>> On Thu, 2022-03-24 at 12:26 -0700, Linus Torvalds wrote:
>>>
>>>>
>>>> It's actually very natural in that situation to flush the caches from
>>>> the CPU side again. And so dma_sync_single_for_device() is a fairly
>>>> reasonable thing to do in that situation.
>>>>
>>>
>>> In the non-cache-coherent scenario, and assuming dma_map() did an
>>> initial cache invalidation, you can write this:
>>>
>>> rx_buffer_complete_1(buf)
>>> {
>>> invalidate_cache(buf, size)
>>> if (!is_ready(buf))
>>> return;
>>> <proceed with receive>
>>> }
>>>
>>> or
>>>
>>> rx_buffer_complete_2(buf)
>>> {
>>> if (!is_ready(buf)) {
>>> invalidate_cache(buf, size)
>>> return;
>>> }
>>> <proceed with receive>
>>> }
>>>
>>> The latter is preferred for performance because dma_map() did the
>>> initial invalidate.
>>>
>>> Of course you could write:
>>>
>>> rx_buffer_complete_3(buf)
>>> {
>>> invalidate_cache(buf, size)
>>> if
>>> (!is_ready(buf)) {
>>> invalidate_cache(buf, size)
>>> return;
>>> }
>>>
>>> <proceed with receive>
>>> }
>>>
>>>
>>> but it's a waste of CPU cycles
>>>
>>> So I'd be very cautious assuming sync_for_cpu() and sync_for_device()
>>> are both doing invalidation in existing implementation of arch DMA ops,
>>> implementers may have taken some liberty around DMA-API to avoid
>>> unnecessary cache operation (not to blame them).
>>
>> I sense an implicit "and the driver can't (or shouldn't) influence
>> this" here, right?
>
> Right, drivers don't get a choice of how a given DMA API implementation
> works.
>
>>> For example looking at arch/arm/mm/dma-mapping.c, for DMA_FROM_DEVICE
>>>
>>> sync_single_for_device()
>>> => __dma_page_cpu_to_dev()
>>> => dma_cache_maint_page(op=dmac_map_area)
>>> => cpu_cache.dma_map_area()
>>>
>>> sync_single_for_cpu()
>>> => __dma_page_dev_to_cpu()
>>> =>
>>> __dma_page_cpu_to_dev(op=dmac_unmap_area)
>>> =>
>>> cpu_cache.dma_unmap_area()
>>>
>>> dma_map_area() always does cache invalidate.
>>>
>>> But for a couple of CPU variant, dma_unmap_area() is a noop, so
>>> sync_for_cpu() does nothing.
>>>
>>> Toke's patch will break ath9k on those platforms (mostly silent
>>> breakage, rx corruption leading to bad performance)
>>
>> Okay, so that would be bad obviously. So if I'm reading you correctly
>> (cf my question above), we can't fix this properly from the driver side,
>> and we should go with the partial SWIOTLB revert instead?
>
> Do you have any other way of telling if DMA is idle, or temporarily
> pausing it before the sync_for_cpu, such that you could honour the
> notion of ownership transfer properly?
I'll go check with someone who has a better grasp of how the hardware
works, but I don't think so...
> As mentioned elsewhere I suspect the only "real" fix if you really do
> need to allow concurrent access is to use the coherent DMA API for
> buffers rather than streaming mappings, but that's obviously some far
> more significant surgery.
That would imply copying the packet data out of that (persistent)
coherent mapping each time we do a recv operation, though, right? That
would be quite a performance hit...
If all we need is a way to make dma_sync_single_for_cpu() guarantee a
cache invalidation, why can't we just add a separate version that does
that (dma_sync_single_for_cpu_peek() or something)? Using that with the
patch I posted earlier should be enough to resolve the issue, AFAICT?
-Toke
On Fri, Mar 25, 2022 at 12:14 PM Robin Murphy <[email protected]> wrote:
>
> Note "between the DMA transfers", and not "during the DMA transfers".
> The fundamental assumption of the streaming API is that only one thing
> is ever accessing the mapping at any given time, which is what the whole
> notion of ownership is about.
Well, but that ignores reality.
Any documentation that ignores the "CPU will want to see the
intermediate state" is by definition garbage, because that is clearly
a simple fact.
We don't write documentation for fantasy.
Linus
On 2022-03-25 18:30, Linus Torvalds wrote:
> On Fri, Mar 25, 2022 at 3:25 AM Maxime Bizon <[email protected]> wrote:
>>
>> In the non-cache-coherent scenario, and assuming dma_map() did an
>> initial cache invalidation, you can write this:
>
> .. but the problem is that the dma mapping code is supposed to just
> work, and the driver isn't supposed to know or care whether dma is
> coherent or not, or using bounce buffers or not.
>
> And currently it doesn't work.
>
> Because what that ath9k driver does is "natural", but it's wrong for
> the bounce buffer case.
>
> And I think the problem is squarely on the dma-mapping side for two reasons:
>
> (a) this used to work, now it doesn't, and it's unclear how many
> other drivers are affected
>
> (b) the dma-mapping naming and calling conventions are horrible and
> actively misleading
>
> That (a) is a big deal. The reason the ath9k issue was found quickly
> is very likely *NOT* because ath9k is the only thing affected. No,
> it's because ath9k is relatively common.
>
> Just grep for dma_sync_single_for_device() and ask yourself: how many
> of those other drivers have you ever even HEARD of, much less be able
> to test?
>
> And that's just one "dma_sync" function. Admittedly it's likely one of
> the more common ones, but still..
>
> Now, (b) is why I think driver nufgt get this so wrong - or, in this
> case, possibly the dma-mapping code itself.
>
> The naming - and even the documentation(!!!) - implies that what ath9k
> does IS THE RIGHT THING TO DO.
>
> The documentation clearly states:
>
> "Before giving the memory to the device, dma_sync_single_for_device() needs
> to be called, and before reading memory written by the device,
> dma_sync_single_for_cpu(), just like for streaming DMA mappings that are
> reused"
Except that's documentation for the non-coherent allocation API, rather
than the streaming API in question here. I'll refrain from commenting on
having at least 3 DMA APIs, with the same set of sync functions serving
two of them, and just stand back a safe distance...
Anyway, the appropriate part of that document is probably:
"You must do this:
- Before reading values that have been written by DMA from the device
(use the DMA_FROM_DEVICE direction)"
I'm not saying it constitutes *good* documentation, but I would note how
it says "have been written", and not "are currently being written".
Similarly from the HOWTO:
"If you need to use the same streaming DMA region multiple times and
touch the data in between the DMA transfers, the buffer needs to be
synced properly..."
Note "between the DMA transfers", and not "during the DMA transfers".
The fundamental assumption of the streaming API is that only one thing
is ever accessing the mapping at any given time, which is what the whole
notion of ownership is about.
Thanks,
Robin.
On Fri, Mar 25, 2022 at 12:26 PM Oleksandr Natalenko
<[email protected]> wrote:
>
> On pátek 25. března 2022 19:30:21 CET Linus Torvalds wrote:
> > The reason the ath9k issue was found quickly
> > is very likely *NOT* because ath9k is the only thing affected. No,
> > it's because ath9k is relatively common.
>
> Indeed. But having a wife who complains about non-working Wi-Fi printer definitely helps in finding the issue too.
Well, maybe we should credit her in the eventual resolution (whatever
it ends up being).
Although probably not using that exact wording.
Linus
On Fri, 2022-03-25 at 13:47 -0700, Linus Torvalds wrote:
> On Fri, Mar 25, 2022 at 1:38 PM Johannes Berg <[email protected]> wrote:
> >
> > > (2) The CPU now wants to see any state written by the device since
> > > the last sync
> > >
> > > This is "dma_sync_single_for_cpu(DMA_FROM_DEVICE)".
> > >
> > > A bounce-buffer implementation needs to copy *from* the bounce buffer.
> > >
> > > A cache-coherent implementation needs to do nothing.
> > >
> > > A non-coherent implementation maybe needs to do nothing (ie it
> > > assumes that previous ops have flushed the cache, and just accessing
> > > the data will bring the rigth thing back into it). Or it could just
> > > flush the cache.
> >
> > Doesn't that just need to *invalidate* the cache, rather than *flush*
> > it?
>
> Yes. I should have been more careful.
Well I see now that you said 'cache "writeback"' in (1), and 'flush' in
(2), so perhaps you were thinking of the same, and I'm just calling it
"flush" and "invalidate" respectively?
> That said, I think "invalidate without writeback" is a really
> dangerous operation (it can generate some *really* hard to debug
> memory state), so on the whole I think you should always strive to
> just do "flush-and-invalidate".
Hmm. Yeah, can't really disagree with that.
However, this seems to be the wrong spot to flush (writeback?) the
cache, as we're trying to get data from the device, not potentially
overwrite the data that the device wrote because we have a dirty
cacheline. Hmm. Then again, how could we possibly have a dirty
cacheline?
Which starts to clarify in my mind why we have a sort of (implied)
ownership model: if the CPU dirties a cacheline while the device has
ownership then the cache writeback might overwrite the DMA data. So it's
easier to think of it as "CPU has ownership" and "device has ownership",
but evidently that simple model breaks down in real-world cases such as
ath9k where the CPU wants to look, but not write, and the device
continues doing DMA at the same time.
Now in that case the cache wouldn't be considered dirty either since the
CPU was just reading, but who knows? Hence the suggestion of just
invalidate, not flush.
> If the core has support for "invalidate clean cache lines only", then
> that's possibly a good alternative.
Well if you actually did dirty the cacheline, then you have a bug one
way or the other, and it's going to be really hard to debug - either you
lose the CPU write, or you lose the device write, there's no way you're
not losing one of them?
ath9k doesn't write, of course, so hopefully the core wouldn't write
back what it must think of as clean cachelines, even if the device
modified the memory underneath already.
So really while I agree with your semantics, I was previously privately
suggesting to Toke we should probably have something like
dma_single_cpu_peek()
// read buffer and check if it was done
dma_single_cpu_peek_finish()
which really is equivalent to the current
dma_sync_single_for_cpu(DMA_FROM_DEVICE)
// ...
dma_sync_single_for_device(DMA_FROM_DEVICE)
that ath9k does, but makes it clearer that you really can't write to the
buffer... but, water under the bridge, I guess.
Thinking about the ownership model again - it seems that we need to at
least modify that ownership model in the sense that we have *write*
ownership that we pass around, not just "ownership". But if we do that,
then we need to clarify which operations pass write ownership and which
don't.
So the operations
(1) dma_sync_single_for_device(DMA_TO_DEVICE)
(2) dma_sync_single_for_cpu(DMA_FROM_DEVICE)
(3) dma_sync_single_for_device(DMA_FROM_DEVICE)
really only (1) passes write ownership to the device, but then you can't
get it back?
But that cannot be true, because ath9k maps the buffer as
DMA_BIDIRECTIONAL, and then eventually might want to recycle it.
Actually though, perhaps passing write ownership back to the CPU isn't
an operation that the DMA API needs to worry about - if the CPU has read
ownership and the driver knows separately that the device is no longer
accessing it, then basically the CPU already got write ownership, and
passes that back uses (1)?
> > Then, however, we need to define what happens if you pass
> > DMA_BIDIRECTIONAL to the sync_for_cpu() and sync_for_device() functions,
> > which adds two more cases? Or maybe we eventually just think that's not
> > valid at all, since you have to specify how you're (currently?) using
> > the buffer, which can't be DMA_BIDIRECTIONAL?
>
> Ugh. Do we actually have cases that do it?
Yes, a few.
> That sounds really odd for
> a "sync" operation. It sounds very reasonable for _allocating_ DMA,
> but for syncing I'm left scratching my head what the semantics would
> be.
I agree.
> But yes, if we do and people come up with semantics for it, those
> semantics should be clearly documented.
I'm not sure? I'm wondering if this isn't just because - like me
initially - people misunderstood the direction argument, or didn't
understand it well enough, and then just passed the same value as for
the map()/unmap()?
You have to pass the size to all of them too, after all ... but I'm
speculating.
> And if we don't - or people can't come up with semantics for it - we
> should actively warn about it and not have some code that does odd
> things that we don't know what they mean.
Makes sense.
> But it sounds like you agree with my analysis, just not with some of
> my bad/incorrect word choices.
Yes.
johannes
I've been thinking of the case where a descriptor ring has
to be in non-coherent memory (eg because that is all there is).
The receive ring processing isn't actually that difficult.
The driver has to fill a cache line full of new buffer
descriptors in memory but without assigning the first
buffer to the hardware.
Then it has to do a cache line write of just that line.
Then it can assign ownership of the first buffer and
finally do a second cache line write.
(The first explicit write can be skipped if the cache
writes are known to be atomic.)
It then must not dirty that cache line.
To check for new frames it must invalidate the cache
line that contains the 'next to be filled' descriptor
and then read that cache line.
This will contain info about one or more receive frames.
But the hardware is still doing updates.
But both these operations can be happening at the same
time on different parts of the buffer.
So you need to know a 'cache line size' for the mapping
and be able to do writebacks and invalidates for parts
of the buffer, not just all of it.
The transmit side is harder.
It either requires waiting for all pending transmits to
finish or splitting a single transmit into enough fragments
that its descriptors end on a cache line boundary.
But again, and if the interface is busy, you want the cpu
to be able to update one cache line of transmit descriptors
while the device is writing transmit completion status
to the previous cache line.
I don't think that is materially different for non-coherent
memory or bounce buffers.
But partial flush/invalidate is needed.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Fri, Mar 25, 2022 at 2:13 PM Johannes Berg <[email protected]> wrote:
>
> Well I see now that you said 'cache "writeback"' in (1), and 'flush' in
> (2), so perhaps you were thinking of the same, and I'm just calling it
> "flush" and "invalidate" respectively?
Yeah, so I mentally tend to think of the operations as just
"writeback" (which doesn't invalidate) and "flush" (which is a
writeback-invalidate).
Which explains my word-usage, but isn't great, because it's not
well-defined. And I'm not even consistent about it.
> However, this seems to be the wrong spot to flush (writeback?) the
> cache, as we're trying to get data from the device, not potentially
> overwrite the data that the device wrote because we have a dirty
> cacheline. Hmm. Then again, how could we possibly have a dirty
> cacheline?
So in my model, (1) is the only case where there's actively dirty data
that needs to be written back. That's the "CPU wrote the data to
memory, and wants to transfer it to the device" case.
In (2) and (3), the only question is whether possibly clean cachelines
contain - or will contain - stale data.
And then exactly when you actually invalidate is up to you.
For example, in
(2) The CPU now wants to see any state written by the device
you have multiple options. You could invalidate any stale cache lines.
Or you could say "We wrote them back and invalidated them in (1), so
we don't need to invalidate them now".
And in
(3) The CPU looked at the data while it was in flight and is now done with it.
you can (again) decide to do nothing at all, BUT ONLY IF (2) chose
that "invalidate" option. Because if you made your (2) depend on that
"they were already invalidated", then (3) has to invalidate the CPU
caches so that a subsequent (2) will work right.
So these are all somewhat interconnected.
You can do just "writeback" in (1), but then you _have_ to do
"invalidate" in (2), and in that case you don't have to do anything at
all in (3).
Or maybe your CPU only has that "writeback-and-invalidate" operation,
so you decide that (2) should be a no-op, and (3) - which is
presumably less common than (2) - also does that writeback-invalidate
thing.
Or we can also say that (3) is not allowed at all - so the ath9k case
is actively wrong and we should warn about that case - but that again
constrains what you can do in (2) and now that previous optimization
is not valid.
And it's worth noting that if your CPU may do cache fills as a result
of speculative accesses (or just sufficiently out of order), then the
whole argument that "I invalidated those lines earlier, so I don't
need to invalidate them now" is just garbage.
Fun, isn't it?
> Which starts to clarify in my mind why we have a sort of (implied)
> ownership model: if the CPU dirties a cacheline while the device has
> ownership then the cache writeback might overwrite the DMA data.
Right, I think that "if the CPU dirties the cacheline while the device
has ownership, then the data is going to be undefined".
And btw, it does actually potentially happen for real - imagine a user
mmap'ing the IO buffer while IO is in flight. The kernel can't control
for that (well, we can make things read-only, and some uses do), but
then it's often a question of "you have to dirty that area and do the
IO again, because the last attempt sent out undefined data".
And note how this "undefined data" situation can happen even with
coherent IO, so this part isn't even about cache coherency - it's
literally about just about memory accesses being in some undefined
order.
So it *can* be valid to send inconsistent data, but that should be
considered the really odd case.
> So it's
> easier to think of it as "CPU has ownership" and "device has ownership",
> but evidently that simple model breaks down in real-world cases such as
> ath9k where the CPU wants to look, but not write, and the device
> continues doing DMA at the same time.
Yeah, and see above about how the CPU could even write (but honestly,
that isn't valid in the general case, it really requires that kind of
active "we can fix it up later" thing).
> Well if you actually did dirty the cacheline, then you have a bug one
> way or the other, and it's going to be really hard to debug - either you
> lose the CPU write, or you lose the device write, there's no way you're
> not losing one of them?
Again, see above. Losing the CPU write is really bad, because then you
can't even recover by re-doing the operation.
So yes, when looking at only the "single operation" case, it looks
like "lose one or the other". But in the bigger picture, one is more
important than the other.
> So the operations
> (1) dma_sync_single_for_device(DMA_TO_DEVICE)
> (2) dma_sync_single_for_cpu(DMA_FROM_DEVICE)
> (3) dma_sync_single_for_device(DMA_FROM_DEVICE)
>
> really only (1) passes write ownership to the device, but then you can't
> get it back?
Well, you get it back by just checking that the IO is done. Once the
IO is done, the CPU owns the area again.
And the "IO is done" is often some entirely independent status in some
entirely different place.
But it *could* be something that requires a CPU read from that DMA
area. But it's a CPU _read_, so you don't need write ownership for
that.
That's why there is only one DMA_TO_DEVICE, and there are two
DMA_FROM_DEVICE cases.
The DMA_TO_DEVICE cannot have a "let me write in the middle" situation.
But the DMA_FROM_DEVICE has that "let me read in the middle, and
decide if it's done or not", so you can have a looping read, and
that's where (3) comes in.
You can't have a looping write for one operation (but you can
obviously have several independent write operations - that's what the
whole "are you done" is all about)
> But that cannot be true, because ath9k maps the buffer as
> DMA_BIDIRECTIONAL, and then eventually might want to recycle it.
See above. Both cases have "the device is done with this", but they
are fundamentally different situations.
> > That sounds really odd for
> > a "sync" operation. It sounds very reasonable for _allocating_ DMA,
> > but for syncing I'm left scratching my head what the semantics would
> > be.
>
> I agree.
>
> > But yes, if we do and people come up with semantics for it, those
> > semantics should be clearly documented.
>
> I'm not sure? I'm wondering if this isn't just because - like me
> initially - people misunderstood the direction argument, or didn't
> understand it well enough, and then just passed the same value as for
> the map()/unmap()?
Yeah, the solution may well be "grep for it, and pick the right
direction once the docs are clear".
Linus
On Fri, 25 Mar 2022 11:27:41 +0000
Robin Murphy <[email protected]> wrote:
> What muddies the waters a bit is that the opposite combination
> sync_for_cpu(DMA_TO_DEVICE) really *should* always be a no-op, and I for
> one have already made the case for eliding that in code elsewhere, but
> it doesn't necessarily hold for the inverse here, hence why I'm not sure
> there even is a robust common solution for peeking at a live
> DMA_FROM_DEVICE buffer.
In https://lkml.org/lkml/2022/3/24/739 I also argued, that a robust
common solution for a peeking at a live DMA_FROM_DEVICE buffer is
probably not possible, at least not with the current programming model
as described by Documentation/core-api/dma-api.rst.
Namely AFAIU the programming model is based on exclusive ownership: the
buffer is either owned by the device, which means CPU(s) are not allowed
to *access* it, or it is owned by the CPU(s), and the device is not
allowed to *access* it. Do we agree on this?
Considering what Linus said here https://lkml.org/lkml/2022/3/24/775
I understand that: if the idea that dma_sync_*_for_{cpu,device} always
transfers ownership to the cpu and device respectively is abandoned,
and we re-define ownership in a sense that only the owner may write,
but non-owner is allowed to read, then it may be possible to make the
scenario under discussion work.
The scenario in pseudo code:
/* when invoked device might be doing DMA into buf */
rx_buf_complete(buf)
{
prepare_peek(buf, DMA_FROM_DEVICE);
if (!is_ready(buf)) {
/*let device gain the buffer again*/
peek_done_not_ready(buf, DMA_FROM_DEVICE);
return false;
}
peek_done_ready(buf, DMA_FROM_DEVICE);
process_buff(buf, DMA_FROM_DEVICE); is
}
IMHO it is pretty obvious, that prepare_peek() has to update the
cpu copy of the data *without* transferring ownership to the CPU. Since
the owner is still the device, it is legit for the device to keep
modifying the buffer via DMA. In case of the swiotlb, we would copy the
content of the bounce buffer to the orig buffer possibly after
invalidating
caches, and for non-swiotlb we would do invalidate caches. So
prepare_peek() could be actually something like,
dma_sync_single_for_cpu(buf, DMA_FROM_DEVICE,
DMA_ATTR_NO_OWNERSHIP_TRANSFER)
which would most end up being functionally the same, as without the
flag, since my guess is that the ownership is only tracked in our heads.
For peek_done_not_ready() there is conceptually nothing to do, because
the device retained ownership. Thus would either have to mandate
peek_done_not_ready() being a nop, or non-existent, (that is
what Toke's patch does in the specific case), or we would have to
mandate that dma_sync_*_for_*() has no side effects under certain. The
former looks simpler to me, especially with swiotlb. But we are also
fine if the cache ain't dirty, because the CPU didn't write (as pointed
out by Linus) and we were to detect that, and avoid flushing a clean
cache, or if we were to track ownership and to avoid flushing caches
because no ownership transfer. But to avoid these bad flushes, at least
for swiotlb, we would either have to track cache ownership, or even
worse track dirtiness (for which we would have to extend the API, and
make the drivers tell us that the cache, i.e. the original buffer got
dirtied).
Since the device has ownership when peek_done_not_ready() is invoked,
we might need to transfer ownership to the CPU in peek_done_ready().
This could again be a dma_sync_for_cpu() with a flag, which when supplied
tells the dma API that no sync (cache invalidate) is needed because the
driver guarantees, that the whole mapping was sufficiently sync-ed by
prepare_peek(). Please notice, that the whole scheme is based on the
driver knowing that the whole DMA is done by examining the buffer, and
it decides based on whatever it sees.
Some of the ongoing discussion seem so ignore this whole ownership biz.
My feeling is: the notion of ownership useful. If both sides end up
modifying (and eventually flushing) we are in trouble IMHO, an ownership
avoids that. But if the conclusion ends up being, that ownership does
not matter, then we should make sure it is purged from the documentation,
because otherwise it will confuse the hell out of people who read
documentations and care about programming models. People like me.
Regards,
Halil
On Sat, Mar 26, 2022 at 9:06 AM Toke Høiland-Jørgensen <[email protected]> wrote:
>
> I was also toying with the idea of having a copy-based peek helper like:
>
> u32 data = dma_peek_word(buf, offset)
I really don't think you can or want to have a word-based one.
That said, I like the *name* of that thing.
I think a lot of confusion comes from the very subtle naming of
fundamentally having a lot of odd conditions with
- two different "directions of the sync" - ie who it is that cares:
dma_sync_single_for_{cpu,device}
- three different "direction of the data" - ie who it is that writes the data:
DMA_FROM_DEVICE / DMA_TO_DEVICE / DMA_BIDIRECTIONAL
so you have six possible combinations, three of which seem insane and
not useful, and of the three that are actually possible, some are very
unusual (it exactly that "device is the one writing, but we want to
sync the dma area for the device").
I do not think it helps that not only do we have this combinatorial
naming, we also use _different_ names. We say "for device" and "for
cpu", but then when we specify who does the writing, we don't say "cpu
vs device", we just specify the direction instead (FROM_DEVICE means
the device did the writing, TO_DEVICE means that the CPU did the
writing).
Anyway, I spent a lot of time looking at this, and I am now personally
convinced that commit aa6f8dcbab47 (swiotlb: rework "fix info leak
with DMA_FROM_DEVICE") was just completely buggy, and was buggy
exactly becasue it was fundamentally confused even about which
direction the bounce was happening.
I have reverted it in my tree, and I tried to write a comprehensive
summary about why it was wrong.
What I *didn't* do in that commit was to argue against the naming, and
try to enumerate all the different valid cases.
Because I think naming matters, and I think the current dma_sync()
interfaces are horribly confusing exactly due to those naming
combinatorials.
But I think "peek" is a good name, not because I think reading one
work is a valid thing (you want to often peek more than that), but
because it seems much more intuitive than
"dma_sync_for_cpu(DMA_FROM_DEVICE)".
Similarly, I would think that "flush" is a much better word for
"dma_sync_for_device(DMA_FROM_CPU)".
I don't know what a good word for
"dma_sync_for_device(DMA_FROM_DEVICE)" is, but maybe "forget" would
come closest - we want the CPU to "forget" what it peeked.
Anyway, I have reverted that commit, and I think it was wrong both in
spirit and in implementation, and I'll ask Greg to remove it from
stable.
And I think the security argument was entirely bogus, because the
whole security argument was based on an incorrect understanding of the
direction of the data.
But hey, I may currently be convinced that revert is the right thing
to do, BUT I've been wrong before, and I'll happily change my mind if
somebody makes a really cogent argument
Linus
On Fri, 25 Mar 2022 22:13:08 +0100
Johannes Berg <[email protected]> wrote:
> > > Then, however, we need to define what happens if you pass
> > > DMA_BIDIRECTIONAL to the sync_for_cpu() and sync_for_device() functions,
> > > which adds two more cases? Or maybe we eventually just think that's not
> > > valid at all, since you have to specify how you're (currently?) using
> > > the buffer, which can't be DMA_BIDIRECTIONAL?
> >
> > Ugh. Do we actually have cases that do it?
>
> Yes, a few.
>
> > That sounds really odd for
> > a "sync" operation. It sounds very reasonable for _allocating_ DMA,
> > but for syncing I'm left scratching my head what the semantics would
> > be.
>
> I agree.
>
> > But yes, if we do and people come up with semantics for it, those
> > semantics should be clearly documented.
>
> I'm not sure? I'm wondering if this isn't just because - like me
> initially - people misunderstood the direction argument, or didn't
> understand it well enough, and then just passed the same value as for
> the map()/unmap()?
I don't think you misunderstood the direction argument and its usage. I
didn't finish thinking about the proposal of Linus, I'm pretty confident
about my understanding of the current semantics of the direction.
According to the documentation, you do have to pass in the very same
direction, that was specified when the mapping was created. A qoute
from the documentation.
"""
void
dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle,
size_t size,
enum dma_data_direction direction)
void
dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle,
size_t size,
enum dma_data_direction direction)
void
dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
int nents,
enum dma_data_direction direction)
void
dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg,
int nents,
enum dma_data_direction direction)
Synchronise a single contiguous or scatter/gather mapping for the CPU
and device. With the sync_sg API, all the parameters must be the same
as those passed into the single mapping API. With the sync_single API,
you can use dma_handle and size parameters that aren't identical to
those passed into the single mapping API to do a partial sync.
"""
(Documentation/core-api/dma-api.rst)
The key here is "sync_sg API, all the parameters must be the same
as those passed into the single mapping API", but I have to admit,
I don't understand the *single* in here. The intended meaning of the
last sentence is that one can do partial sync by choose
dma_hande_sync, size_sync such that dma_handle_mapping <= dma_handle_sync
< dma_handle_mapping + size_mapping and dma_handle_sync + size_sync <=
dma_handle_mapping + size_mapping. But the direction has to remain the
same.
BTW, the current documented definition of the direction is about the
data transfer direction between memory and the device, and how the CPU
is interacting with the memory is not in scope. A quote form the
documentation.
"""
======================= =============================================
DMA_NONE no direction (used for debugging)
DMA_TO_DEVICE data is going from the memory to the device
DMA_FROM_DEVICE data is coming from the device to the memory
DMA_BIDIRECTIONAL direction isn't known
======================= =============================================
"""
(Documentation/core-api/dma-api.rst)
My feeling is, that re-defining the dma direction is not a good idea. But
I don't think my opinion has much weight here.
@Christoph, Robin: What do you think?
Regards,
Halil
From: Linus Torvalds
> Sent: 26 March 2022 18:39
>
> On Sat, Mar 26, 2022 at 9:06 AM Toke Høiland-Jørgensen <[email protected]> wrote:
> >
> > I was also toying with the idea of having a copy-based peek helper like:
> >
> > u32 data = dma_peek_word(buf, offset)
>
> I really don't think you can or want to have a word-based one.
>
> That said, I like the *name* of that thing.
>
> I think a lot of confusion comes from the very subtle naming of
> fundamentally having a lot of odd conditions with
>
> - two different "directions of the sync" - ie who it is that cares:
>
> dma_sync_single_for_{cpu,device}
>
> - three different "direction of the data" - ie who it is that writes the data:
>
> DMA_FROM_DEVICE / DMA_TO_DEVICE / DMA_BIDIRECTIONAL
>
> so you have six possible combinations, three of which seem insane and
> not useful, and of the three that are actually possible, some are very
> unusual (it exactly that "device is the one writing, but we want to
> sync the dma area for the device").
Another 2c :-)
Is the idea of 'buffer ownership' even a good one?
Perhaps the whole thing would be better described in terms of
what happens when bounce buffers are used.
So there are notionally two buffers.
One accessed by the cpu, the other by the device.
There are then just 3 things that happen:
1) Dirty data may get moved to the other buffer at any time.
So the driver must not dirty buffers (cache lines) that the
device might write to.
2) The cpu has to request data be copied to the device buffer
after updating the cpu buffer.
This makes the cpu buffer 'not dirty' so copies (1) can no
longer happen.
3) The cpu has to request data be copied from the device buffer
before looking at the data.
All copies affect a dma-cache-line sized block of data (which
might be device dependant).
An optimised version of (2) that doesn't actually do the copy
can be implemented for use prior to read requests.
For cache-coherent memory only (1) happens and (2) and (3)
are no operations.
For non-coherent memory (2) is write-back-and-invalidate and
(3) might just be an invalidate.
For bounce buffers all are actual copies - and additional
operations might be needed for device access to the bounce
buffer itself.
For security reasons bounce buffers may need initialising.
But this would be done when they are allocated.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Thu, 24 Mar 2022 12:26:53 -0700
Linus Torvalds <[email protected]> wrote:
> So I don't think the dma_sync_single_for_device() is *wrong* per se,
> because the CPU didn't actually do any modifications.
>
> But yes, I think it's unnecessary - because any later CPU accesses
> would need that dma_sync_single_for_cpu() anyway, which should
> invalidate any stale caches.
>
> And it clearly doesn't work in a bounce-buffer situation, but honestly
> I don't think a "CPU modified buffers concurrently with DMA" can
> *ever* work in that situation, so I think it's wrong for a bounce
> buffer model to ever do anything in the dma_sync_single_for_device()
> situation.
I agree it CPU modified buffers *concurrently* with DMA can never work,
and I believe the ownership model was conceived to prevent this
situation. But a CPU can modify the buffer *after* DMA has written to
it, while the mapping is still alive. For example one could do one
partial read from the device, *after* the DMA is done,
sync_for_cpu(DMA_FROM_DEVICE), examine, then zero out the entire buffer,
sync_for_device(DMA_FROM_DEVICE), make the device do partial DMA, do
dma_unmap and expect no residue from the fist DMA. That is expect that
the zeroing out was effective.
The point I'm trying to make is: if concurrent is even permitted (it
isn't because of ownership) swiotlb woudn't know if we are dealing with
the *concurrent* case, which is completely bogous, or with the
*sequential* case, which at least in theory could work. And if we don't
do anyting on the sync_for_device(DMA_FROM_DEVICE) we render
that zeroing out the entire buffer form my example ineffective, because
it would not reach the bounce buffer, and on dma_unmap we would overwrite
the original buffer with the content of the bounce buffer.
Regards,
Halil
On Sat, Mar 26, 2022 at 3:38 PM David Laight <[email protected]> wrote:
>
> Is the idea of 'buffer ownership' even a good one?
I do think it might be best to not think in those terms, but literally
just in data movement terms.
Because the "buffer ownership" mental model is clearly confused, when
data transfer might be ongoing, but the CPU might need to just look at
"what's going on right now" without actually taking any ownership of
the buffer.
Linus
On Sat, Mar 26, 2022 at 10:06 PM Linus Torvalds
<[email protected]> wrote:
>
> On Sat, Mar 26, 2022 at 8:49 PM Halil Pasic <[email protected]> wrote:
> >
> > I agree it CPU modified buffers *concurrently* with DMA can never work,
> > and I believe the ownership model was conceived to prevent this
> > situation.
>
> But that just means that the "ownership" model is garbage, and cannot
> handle this REAL LIFE situation.
Just to clarify: I obviously agree that the "both sides modify
concurrently" obviously cannot work with bounce buffers.
People still do want to do that, but they'll limit themselves to
actual cache-coherent DMA when they do so (or do nasty uncached
accesses but at least no bounce buffering).
But the "bounce ownership back and forth" model comes up empty when
the CPU wants to read while the DMA is still going on. And that not
only can work, but *has* worked.
You could have a new "get me a non-ownership copy" operation of
course, but that hits the problem of "which existing drivers need it?"
We have no idea, outside of ath9k.
This is why I believe we have to keep the existing semantics in a way
that keep ath9k - and any number of unknown other drivers - happy.
And then for the cases where you want to introduce the zeroing because
you don't know how much data the DMA returned - those are the ones
you'll have to mark some way.
Linus
On Sat, 26 Mar 2022 22:06:15 -0700
Linus Torvalds <[email protected]> wrote:
> On Sat, Mar 26, 2022 at 8:49 PM Halil Pasic <[email protected]> wrote:
> >
> > I agree it CPU modified buffers *concurrently* with DMA can never work,
> > and I believe the ownership model was conceived to prevent this
> > situation.
>
> But that just means that the "ownership" model is garbage, and cannot
> handle this REAL LIFE situation.
>
> Here's the deal: if somebody makes a model that is counter-factual,
> you have exactly two choices:
>
> - fix the damn model
>
> - live in a la-la-land that isn't reality
>
> Which choice do you pick?
I pix "fix the dam model". This email of mine was supposed to discuss how
the model can be fixed:
https://www.spinics.net/lists/linux-wireless/msg222442.html
>
> And I'll be brutally honest: if you pick the la-la-land one, I don't
> think we can really discuss this any more.
I completely agree. Never intended to pick the la-la-land one.
>
> > But a CPU can modify the buffer *after* DMA has written to
> > it, while the mapping is still alive.
>
> Yes.
>
> But you are making ALL your arguments based on that "ownership" model
> that we now know is garbage.
Sorry, I'm not very knowledgeable when it comes all the different
hardware out there.
>
> If you make your arguments based on garbage, the end result _might_
> work just by happenstance, but the arguments sure aren't very
> convincing, are they?
No it is not. I have to admit, I did see some value in talking about the
model that is described by the current documentation for two reasons:
1) Not everybody has great knowledge of all the hardware out there, and
there might be people other than me, who based their work on that broken
model. And thus wrote code that is correct within the broken model,
but not correct within the fixed mode. Does not see to be the case here,
but I was not able to tell.
2) To fix the model, we have to understand both reality and the model.
Or we have to replace it with an entirely new one.
>
> So let's make it really clear that the arguments must not be based on
> some "ownership" model that you just admitted cannot handle the very
> real case of real and common hardware.
>
> Ok?
Sure. It was never my intention to base my argument on false assumptions.
>
> > For example one could do one
> > partial read from the device, *after* the DMA is done,
> > sync_for_cpu(DMA_FROM_DEVICE), examine, then zero out the entire buffer,
> > sync_for_device(DMA_FROM_DEVICE)
>
> So the result you want to get to I can believe in, but your sequence
> of getting there is untenable, since it depends on breaking other
> cases that are more important than your utterly broken hardware that
> you don't even know how much data it filled.
>
I agree, and that is is the very reason I said, I'm not against the
partial revert (see
https://www.spinics.net/lists/linux-wireless/msg222326.html)
Hey, I don't even know if there is a single usage of what I described.
In fact I asked the community is this even something legit. What I know,
is that the current (broken) model does allow the scenario.
> And I fundamentally also happen to think that since the CPU just wrote
> that buffer, and *that* write is what you want to sync with the
> device, then that DMA_FROM_DEVICE was just pure fantasy to begin with.
>
Not sync with the device, but with the memory. And it is supposed to
happen after the DMA has completed, and not while the DMA is ongoing.
But I'm clearly not knowledgeable enough to have this discussion. I'm
afraid if I continue, I will just drag the community down.
> So that code sequence you quote is wrong. You are literally trying to
> re-introduce the semantics that we already know broke the ath9k
> driver.
>
> Instead, let me give you a couple of alternative scenarios.
>
> Alternative 1:
>
> - ok, so the CPU wrote zeroes to the area, and we want to tell the
> DMA mapping code that it needs to make that CPU write visible to the
> device.
Not make visible to the device but make actually it RAM instead of
remaining in cache. My most basic mental model is:
+----------------+
| | +---------+ +--------+
| +--------+ | | | DMA | |
| CPU | CACHE | | <---> | RAM | <----> | DEVICE |
| +--------+ | | | | |
| | +---------+ +--------+
+----------------+
>
> - Ahh, you mean "sync_for_device(DMA_TO_DEVICE)"?
>
> - Yes.
>
> The "for_device()" tells us that afterwards, the device can access
> the memory (we synced it for the device).
>
> And the DMA_TO_DEVICE tells us that we're transferring the zeroes
> we wrote on the CPU to the device bounce buffer.
>
> Now the device got those zeroes, and it can overwrite them
> (partially), and everything is fine
>
> - And then we need to use "sync_for_cpu(DMA_FROM_DEVICE)" when we
> want to see the result once it's all done?
>
> - Yes.
>
> - Splendid. Except I don't like how you mix DMA_TO_DEVICE and
> DMA_FROM_DEVICE and you made the dma_alloc() not use DMA_BIDIRECTIONAL
>
> - Yeah, that's ugly, but it matches reality, and it would "just work" today.
>
It is certainly an option. The tricky part is that one is supposed to use
DMA_TO_DEVICE even if the device does not read RAM but only writes it,
and thus the direction of the data flow of the direct memory access (DMA)
is from the device to the memory (RAM).
> Alternative 2:
>
> - Ok, so the CPU doesn't even want to write to the area AT ALL, but
> we know we have a broken device that might not fill all of the bounce
> buffer, and we don't want to see old stale bounce buffer contents that
> could come from some other operation entirely and leak sensitive data
> that wasn't for us.
>
> - Ahh, so you really want to just clear the bounce buffer before IO?
>
> - Yes. So let's introduce a "clear_bounce_buffer()" operation before
> starting DMA. Let's call it "sync_for_device(DMA_NONE)" and teach the
> non-bounce-buffer dmas mapping entities to just ignore it, because
> they don't have a bounce buffer that can contain stale data.
>
> - Sounds good.
>
It is an option.
> Alternative 3:
>
> - Ok, you have me stumped. I can't come up with something else sane.
>
I have tired to in
https://www.spinics.net/lists/linux-wireless/msg222442.html
> Anyway, notice what's common about these alternatives? They are based
> not on some "we have a broken model", but on trying to solve the
> actual real-life problem case.
Yes.
>
> I'm more than happy to hear other alternatives.
>
> But the alternative I am _not_ willing to entertain is "Yeah, we have
> a model of ownership, and that can't handle ath9k because that one
> wants to do CPU reads while DMA is possibly active, so ath9k is
> broken".
>
> Can we please agree on that?
>
Yes.
AFAIU the main reason we have should postulate the ath9k code is correct,
is that it would be prohibitively expensive not to do so. Because we
can comparatively easily change ath9k, but we can't be sure about other
uses of sync_for_device unless we audit all of them. That does make
perfect sense for me.
IMHO that also means that the whole "ownership" concept is beyond saving,
and that we should rewrite the corresponding parts of the documentation.
Thus my effort at saving it was misguided.
Give how this unfolded, I intend to let the more knowledgeable people
hash out the new model, and avoiding dragging down the community by
being too vocal about my opinion.
For the record, I believe that the partial revert proposed here
https://www.spinics.net/lists/linux-wireless/msg222300.html
would have been a wiser choice, than a complete revert of commit
aa6f8dcbab47 ("swiotlb: rework "fix info leak with DMA_FROM_DEVICE"").
The reason why is in the commit message, and was also pointed out by
Robin.
Regards,
Halil
On Sun, Mar 27, 2022 at 4:37 PM Halil Pasic <[email protected]> wrote:
>
>
> For the record, I believe that the partial revert proposed here
> https://www.spinics.net/lists/linux-wireless/msg222300.html
> would have been a wiser choice, than a complete revert of commit
> aa6f8dcbab47 ("swiotlb: rework "fix info leak with DMA_FROM_DEVICE"").
Yeah, the revert is basically my standard "this doesn't work,
discussion is still ongoing" thing.
I agree that the revert then brought back that DMA_ATTR_SKIP_CPU_SYNC
complexity.
So that part of commit aa6f8dcbab47 was probably all good.
I somehow missed that Oleksandr had a tested-by for that smaller change too.
Linus
On Sun, Mar 27, 2022 at 8:24 AM David Laight <[email protected]> wrote:
>
> Aren't bounce buffers just a more extreme case on non-coherent
> memory accesses?
No.
In fact, this whoe change came about exactly because bounce buffers
are different.
The difference is that bounce buffers have that (wait for it) bounce
buffer, which can have stale contents.
> They just need explicit memory copies rather than just cache
> writeback and invalidate operations.
That's the thing - the memory copies introduce entirely new issues.
I really think that instead of making up abstract rules ("ownership"
or "bounce buffers are just extreme cases of non-coherency") we should
make the rules very much practical and down to earth, and write out
exactly what they *do*.
The whole "sync DMA" is odd and abstract enough as a concept on its
own, we shouldn't then make the rules for it odd and abstract. We
should make them very very practical.
So I will propose that we really make it very much about practical
concerns, and we document things as
(a) the "sync" operation has by definition a "whose _future_ access
do we sync for" notion.
So "dma_sync_single_for_cpu()" says that the future accesses to
this dma area is for the CPU.
Note how it does *NOT* say that the "CPU owns the are". That's
bullsh*t, and we now know it's BS.
(b) but the sync operation also has a "who wrote the data we're syncing"
Note that this is *not* "who accessed or owned it last", because
that's nonsensical: if we're syncing for the CPU, then the only reason
to do so is because we expect that the last access was by the device,
so specifying that separately would just be redundant and stupid.
But specifying who *wrote* to the area is meaningful and matters.
It matters for the non-coherent cache case (because of writeback
issues), but it also matters for the bounce buffer case (becasue it
determines which way we should copy).
Note how this makes sense: a "sync" operation is clearly about taking
some past state, and making it palatable for a future use. The past
state is pretty much defined by who wrote the data, and then we can
use that and the "the next thing to access it" to determine what we
need to do about the sync.
It is *NOT* about "ownership".
So let's go through the cases, and I'm going to ignore the "CPU caches
are coherent with device DMA" case because that's always going to be a
no-op wrt data movement (but it will still generally need a memory
barrier, which I will mention here and then ignore going forward).
Syncing for *CPU* accesses (ie dma_sync_single_for_cpu()) has four
choices I can see:
- nobody wrote the data at all (aka DMA_NONE).
This is nonsensical and should warn. If nobody wrote to it, why
would the CPU ever validly access it?
Maybe you should have written "memset(buffer, 0, size)" instead?
- the CPU wrote the data in the first place (aka DMA_TO_DEVICE)
This is a no-op (possibly a memory barrier), because even stale CPU
caches are fine, and even if it was in a bounce buffer, the original
CPU-side data is fine.
- the device wrote the data (aka DMA_FROM_DEVICE)
This is just the regular case of a device having written something,
and the CPU wants to see it.
It obviously needs real work, but it's simple and straightforward.
For non-coherent caches, it needs a cache invalidate. For a bounce
buffer, it needs a copy from the bounce buffer to the "real" buffer.
- it's not clear who write the data (aka DMA_BIDIRECTIONAL)
This is not really doable for a bounce buffer - we just don't know
which buffer contents are valid.
I think it's very very questionable for non-coherent caches too,
but "writeback and invalidate" probably can't hurt.
So probably warn about it, and do whatever we used to do historically.
Syncing for device accesses (ie dma_sync_single_for_device()) also has
the same four choices I can see, but obviously does different things:
- nobody wrote the data at all (aka DMA_NONE)
This sounds as nonsensical as the CPU case, but maybe isn't.
We may not have any previous explicit writes, but we *do* have that
"insecure and possibly stale buffer contents" bounce buffer thing on
the device side.
So with a bounce buffer, it's actually somewhat sane to say
"initialize the bounce buffer to a known state".
Because bounce buffers *are* special. Unlike even the "noncoherent
caches" issue, they have that entirely *other* hidden state in the
form of the bounce buffer itself.
Discuss.
- the CPU wrote the data in the first place (aka DMA_TO_DEVICE).
This is the regular and common case of "we have data on the CPU
side that is written to the device".
Again, needs work, but is simple and straightforward.
For non-coherent caches, we need a writeback on the CPU. For a
bounce buffer, we need to copy from the regular buffer to the bounce
buffer.
- the device wrote the data in the first place (aka DMA_FROM_DEVICE)
This is the case that we hit on ath9k. It's *not* obvious, but when
we write this out this way, I really think the semantics are pretty
clear.
For non-coherent caches, we may need an "invalidate". For a bounce
buffer, it's a no-op (because the bounce buffer already contains the
data)
- it's not clear who write the data (aka DMA_BIDIRECTIONAL)
This is again not really doable for a bounce buffer. We don't know
which buffer contains the right data, we should warn about it and do
whatever we used to do historically.
Again, it's very questionable for non-coherent caches too, but
"writeback and invalidate" probably at least can't hurt.
So hey, that's my thinking. The whole "ownership" model is, I think,
obviously untenable.
But just going through and listing the different cases and making them
explicit I think explains exactly what the different situations are,
and that then makes it fairly clear what the different combinations
should do.
No?
Linus
On Sun, Mar 27, 2022 at 4:52 PM Halil Pasic <[email protected]> wrote:
>
> I have no intention of pursuing this. When fixing the information leak,
> I happened to realize, that a somewhat similar situation can emerge when
> mappings are reused. It seemed like an easy fix, so I asked the swiotlb
> maintainers, and they agreed. It ain't my field of expertise, and the
> drivers I'm interested in don't need this functionality.
Ok.
That said, I think you are putting yourself down when you said in an
earlier email that you aren't veryt knowledgeable in this area.
I think the fact that you *did* think of this other similar situation
is actually very interesting, and it's something people probably
_haven't_ been thinking about.
So I think your first commit fixes the straightforward and common case
where you do that "map / partial dma / unmap" case.
And that straightforward case is probably all that the disk IO case
ever really triggers, which is presumably why those "drivers I'm
interested in don't need this functionality" don't need anything else?
And yes, your second commit didn't work, but hey, whatever. The whole
"multiple operations on the same double buffering allocation"
situation is something I don't think people have necessarily thought
about enough.
And by that I don't mean you. I mean very much the whole history of
our dma mapping code.
I then get opinionated and probably too forceful, but please don't
take it as being about you - it's about just my frustration with that
code - and if it comes off too negative then please accept my
apologies.
Linus
On Sat, Mar 26, 2022 at 8:49 PM Halil Pasic <[email protected]> wrote:
>
> I agree it CPU modified buffers *concurrently* with DMA can never work,
> and I believe the ownership model was conceived to prevent this
> situation.
But that just means that the "ownership" model is garbage, and cannot
handle this REAL LIFE situation.
Here's the deal: if somebody makes a model that is counter-factual,
you have exactly two choices:
- fix the damn model
- live in a la-la-land that isn't reality
Which choice do you pick?
And I'll be brutally honest: if you pick the la-la-land one, I don't
think we can really discuss this any more.
> But a CPU can modify the buffer *after* DMA has written to
> it, while the mapping is still alive.
Yes.
But you are making ALL your arguments based on that "ownership" model
that we now know is garbage.
If you make your arguments based on garbage, the end result _might_
work just by happenstance, but the arguments sure aren't very
convincing, are they?
So let's make it really clear that the arguments must not be based on
some "ownership" model that you just admitted cannot handle the very
real case of real and common hardware.
Ok?
> For example one could do one
> partial read from the device, *after* the DMA is done,
> sync_for_cpu(DMA_FROM_DEVICE), examine, then zero out the entire buffer,
> sync_for_device(DMA_FROM_DEVICE)
So the result you want to get to I can believe in, but your sequence
of getting there is untenable, since it depends on breaking other
cases that are more important than your utterly broken hardware that
you don't even know how much data it filled.
And I fundamentally also happen to think that since the CPU just wrote
that buffer, and *that* write is what you want to sync with the
device, then that DMA_FROM_DEVICE was just pure fantasy to begin with.
So that code sequence you quote is wrong. You are literally trying to
re-introduce the semantics that we already know broke the ath9k
driver.
Instead, let me give you a couple of alternative scenarios.
Alternative 1:
- ok, so the CPU wrote zeroes to the area, and we want to tell the
DMA mapping code that it needs to make that CPU write visible to the
device.
- Ahh, you mean "sync_for_device(DMA_TO_DEVICE)"?
- Yes.
The "for_device()" tells us that afterwards, the device can access
the memory (we synced it for the device).
And the DMA_TO_DEVICE tells us that we're transferring the zeroes
we wrote on the CPU to the device bounce buffer.
Now the device got those zeroes, and it can overwrite them
(partially), and everything is fine
- And then we need to use "sync_for_cpu(DMA_FROM_DEVICE)" when we
want to see the result once it's all done?
- Yes.
- Splendid. Except I don't like how you mix DMA_TO_DEVICE and
DMA_FROM_DEVICE and you made the dma_alloc() not use DMA_BIDIRECTIONAL
- Yeah, that's ugly, but it matches reality, and it would "just work" today.
Alternative 2:
- Ok, so the CPU doesn't even want to write to the area AT ALL, but
we know we have a broken device that might not fill all of the bounce
buffer, and we don't want to see old stale bounce buffer contents that
could come from some other operation entirely and leak sensitive data
that wasn't for us.
- Ahh, so you really want to just clear the bounce buffer before IO?
- Yes. So let's introduce a "clear_bounce_buffer()" operation before
starting DMA. Let's call it "sync_for_device(DMA_NONE)" and teach the
non-bounce-buffer dmas mapping entities to just ignore it, because
they don't have a bounce buffer that can contain stale data.
- Sounds good.
Alternative 3:
- Ok, you have me stumped. I can't come up with something else sane.
Anyway, notice what's common about these alternatives? They are based
not on some "we have a broken model", but on trying to solve the
actual real-life problem case.
I'm more than happy to hear other alternatives.
But the alternative I am _not_ willing to entertain is "Yeah, we have
a model of ownership, and that can't handle ath9k because that one
wants to do CPU reads while DMA is possibly active, so ath9k is
broken".
Can we please agree on that?
Linus
On Fri, Mar 25, 2022 at 11:46:09AM -0700, Linus Torvalds wrote:
> I think my list of three different sync cases (not just two! It's not
> just about whether to sync for the CPU or the device, it's also about
> what direction the data itself is taking) is correct.
>
> But maybe I'm wrong.
At the high level you are correct. It is all about which direction
the data is taking. That is the direction argument that all the
map/unmap/sync call take. The sync calls then just toggle the ownership.
You seem to hate that ownership concept, but I don't see how things
could work without that ownership concept as we're going to be in
trouble without having that. And yes, a peek operation could work in
some cases, but it would have to be at the cache line granularity.
arch/arc/mm/dma.c has a really good comment how these transfers relate
to actual cache operations btw>
*
* | map == for_device | unmap == for_cpu
* |----------------------------------------------------------------
* TO_DEV | writeback writeback | none none
* FROM_DEV | invalidate invalidate | invalidate* invalidate*
* BIDIR | writeback+inv writeback+inv | invalidate invalidate
*
* [*] needed for CPU speculative prefetches
On Sun, Mar 27, 2022 at 12:23 PM Linus Torvalds
<[email protected]> wrote:
>
> So I will propose that we really make it very much about practical
> concerns, and we document things as
>
> (a) the "sync" operation has by definition a "whose _future_ access
> do we sync for" notion.
>
> So "dma_sync_single_for_cpu()" says that the future accesses to
> this dma area is for the CPU.
>
> Note how it does *NOT* say that the "CPU owns the are". That's
> bullsh*t, and we now know it's BS.
>
> (b) but the sync operation also has a "who wrote the data we're syncing"
>
> Note that this is *not* "who accessed or owned it last", because
> that's nonsensical: if we're syncing for the CPU, then the only reason
> to do so is because we expect that the last access was by the device,
> so specifying that separately would just be redundant and stupid.
>
> But specifying who *wrote* to the area is meaningful and matters.
We could also simply require that the bounce buffer code *remember*
who wrote to it last.
So when the ath9k driver does
- setup:
bf->bf_buf_addr = dma_map_single(sc->dev, skb->data,
common->rx_bufsize,
DMA_FROM_DEVICE);
we clear the bounce buffer and remember that the state of the bounce
buffer is "device wrote to it" (because DMA_FROM_DEVICE).
Then, we have an interrupt or other event, and ath9k does
- rc event:
dma_sync_single_for_cpu(sc->dev, bf->bf_buf_addr,
common->rx_bufsize, DMA_FROM_DEVICE);
ret = ath9k_hw_process_rxdesc_edma(ah, rs, skb->data);
if (ret == -EINPROGRESS) {
/*let device gain the buffer again*/
dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
common->rx_bufsize, DMA_FROM_DEVICE);
return false;
}
and the first dma_sync_single_for_cpu() now sees "Ok, I want the CPU
buffer, and I remember that the device wrote to it, so I will copy
from the bounce buffer". It's still DMA_FROM_DEVICE, so that "the
device wrote to it" doesn't change.
When the CPU then decides "ok, that wasn't it", and does that
dma_sync_single_for_device(), the bounce buffer code goes "Ok, the
last operation was that the device wrote to the buffer, so the bounce
buffer is still valid and I should do nothing".
Voila, no ath9k breakage, and it all still makes perfect sense.
And that sounds like an even more logical model than the one where we
tell the bounce buffer code what the previous operation was, but it
involves basically the DMA mapping code remembering what the last
direction was. That makes perfect sense to me, but it's certainly not
what the DMA mapping code has traditionally done, which makes me
nervous that it would just expose a _lot_ of other drivers that do odd
things.
The "good news" (if there is such a thing) is that usually the
direction doesn't actually change. So if you use DMA_FROM_DEVICE
initially, you'll continue to use that. So there is probably basically
never any difference between "what was the previous operation" and
"what is the next operation".
So maybe practically speaking, we don't care.
Anyway, I do think we have choices here on how to describe things.
I do think that the "DMA code doesn't have to remember" model has the
advantage that remembering is always an added complexity, and
operations that behave differently depending on previous history are
always a bit harder to think about because of that. Which is why I
think that model I outlined in the previous email is probably the most
straightforward one.
Linus
From: Christoph Hellwig
> Sent: 28 March 2022 07:37
>
> On Fri, Mar 25, 2022 at 11:46:09AM -0700, Linus Torvalds wrote:
> > I think my list of three different sync cases (not just two! It's not
> > just about whether to sync for the CPU or the device, it's also about
> > what direction the data itself is taking) is correct.
> >
> > But maybe I'm wrong.
>
> At the high level you are correct. It is all about which direction
> the data is taking. That is the direction argument that all the
> map/unmap/sync call take. The sync calls then just toggle the ownership.
> You seem to hate that ownership concept, but I don't see how things
> could work without that ownership concept as we're going to be in
> trouble without having that. And yes, a peek operation could work in
> some cases, but it would have to be at the cache line granularity.
I don't think it is really 'ownership' but more about who has
write access.
Only one side can have write access (to a cache line [1]) at any
one time.
Read access is different.
You need a 'synchronise' action to pick up newly written data.
This might be a data copy, cache flush or cache invalidate.
It only need affect the area that needs to be read - not
full buffer.
Partial cache flush/invalidate will almost certainly speed
up receipt of short network packets that are copied into a
new skb - leaving the old one mapped for another receive.
[1] The cache line size might be a property of the device
and dma subsystem, not just the cpu.
I have used hardware when the effective size was 1kB.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Sun, 2022-03-27 at 05:15 +0200, Halil Pasic wrote:
>
> The key here is "sync_sg API, all the parameters must be the same
> as those passed into the single mapping API", but I have to admit,
> I don't understand the *single* in here.
>
Hah. So I wasn't imagining things after all.
However, as the rest of the thread arrives, this still means it's all
broken ... :)
> The intended meaning of the
> last sentence is that one can do partial sync by choose
> dma_hande_sync, size_sync such that dma_handle_mapping <= dma_handle_sync
> < dma_handle_mapping + size_mapping and dma_handle_sync + size_sync <=
> dma_handle_mapping + size_mapping. But the direction has to remain the
> same.
Right.
> BTW, the current documented definition of the direction is about the
> data transfer direction between memory and the device, and how the CPU
> is interacting with the memory is not in scope. A quote form the
> documentation.
>
> """
> ======================= =============================================
> DMA_NONE no direction (used for debugging)
> DMA_TO_DEVICE data is going from the memory to the device
> DMA_FROM_DEVICE data is coming from the device to the memory
> DMA_BIDIRECTIONAL direction isn't known
> ======================= =============================================
> """
> (Documentation/core-api/dma-api.rst)
>
> My feeling is, that re-defining the dma direction is not a good idea. But
> I don't think my opinion has much weight here.
However, this basically means that the direction argument to the flush
APIs are completely useless, and we do have to define something
new/else...
johannes
On Sun, 27 Mar 2022 17:30:01 -0700
Linus Torvalds <[email protected]> wrote:
> On Sun, Mar 27, 2022 at 4:52 PM Halil Pasic <[email protected]> wrote:
> >
> > I have no intention of pursuing this. When fixing the information leak,
> > I happened to realize, that a somewhat similar situation can emerge when
> > mappings are reused. It seemed like an easy fix, so I asked the swiotlb
> > maintainers, and they agreed. It ain't my field of expertise, and the
> > drivers I'm interested in don't need this functionality.
>
> Ok.
>
> That said, I think you are putting yourself down when you said in an
> earlier email that you aren't veryt knowledgeable in this area.
>
> I think the fact that you *did* think of this other similar situation
> is actually very interesting, and it's something people probably
> _haven't_ been thinking about.
Thank you!
>
> So I think your first commit fixes the straightforward and common case
> where you do that "map / partial dma / unmap" case.
>
> And that straightforward case is probably all that the disk IO case
> ever really triggers, which is presumably why those "drivers I'm
> interested in don't need this functionality" don't need anything else?
>
I agree.
> And yes, your second commit didn't work, but hey, whatever. The whole
> "multiple operations on the same double buffering allocation"
> situation is something I don't think people have necessarily thought
> about enough.
>
> And by that I don't mean you. I mean very much the whole history of
> our dma mapping code.
>
I agree. We are in the process of catching up! :) My idea was to aid
a process, as a relatively naive pair of eyes: somebody didn't read any
data sheets describing non-cache-coherent DMA, and never programmed
a DMA. It is a fairly common problem, that for the very knowledgeable
certain things seem obvious, self-explanatory or trivial, but for the
less knowledgeable the are not. And knowledge can create bias.
> I then get opinionated and probably too forceful, but please don't
> take it as being about you - it's about just my frustration with that
> code - and if it comes off too negative then please accept my
> apologies.
I have to admit, I did feel a little uncomfortable, and I did look for
an exit strategy. I do believe, that people in your position do have to
occasionally get forceful, and even abrasive to maintain efficiency. I
try to not ignore the social aspect of things, but I do get carried away
occasionally.
Your last especially paragraph is very encouraging and welcome. Thank
you!
Regards,
Halil
[..]
On Sat, 26 Mar 2022 22:21:03 -0700
Linus Torvalds <[email protected]> wrote:
> On Sat, Mar 26, 2022 at 10:06 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Sat, Mar 26, 2022 at 8:49 PM Halil Pasic <[email protected]> wrote:
> > >
> > > I agree it CPU modified buffers *concurrently* with DMA can never work,
> > > and I believe the ownership model was conceived to prevent this
> > > situation.
> >
> > But that just means that the "ownership" model is garbage, and cannot
> > handle this REAL LIFE situation.
>
> Just to clarify: I obviously agree that the "both sides modify
> concurrently" obviously cannot work with bounce buffers.
>
> People still do want to do that, but they'll limit themselves to
> actual cache-coherent DMA when they do so (or do nasty uncached
> accesses but at least no bounce buffering).
Thanks for the explanation!
>
> But the "bounce ownership back and forth" model comes up empty when
> the CPU wants to read while the DMA is still going on. And that not
> only can work, but *has* worked.
>
> You could have a new "get me a non-ownership copy" operation of
> course,
Yes, https://www.spinics.net/lists/linux-wireless/msg222442.html was
mostly about exploring that idea.
> but that hits the problem of "which existing drivers need it?"
>
> We have no idea, outside of ath9k.
>
> This is why I believe we have to keep the existing semantics in a way
> that keep ath9k - and any number of unknown other drivers - happy.
I agree.
>
> And then for the cases where you want to introduce the zeroing because
> you don't know how much data the DMA returned - those are the ones
> you'll have to mark some way.
I have no intention of pursuing this. When fixing the information leak,
I happened to realize, that a somewhat similar situation can emerge when
mappings are reused. It seemed like an easy fix, so I asked the swiotlb
maintainers, and they agreed. It ain't my field of expertise, and the
drivers I'm interested in don't need this functionality.
Regards,
Halil
>
> Linus
On Mon, 2022-03-28 at 11:50 +0200, Johannes Berg wrote:
> No I worded that badly - the direction isn't useless, but thinking of it
> in terms of a buffer property rather than data movement is inaccurate.
> So then if we need something else to indicate how data was expected to
> be moved, the direction argument becomes useless, since it's not a
> buffer property but rather a temporal thing on a specific place that
> expected certain data movement.
>
Yeah, umm. I should've read the whole thread of the weekend first, sorry
for the noise.
johannes
On Mon, 2022-03-28 at 11:48 +0200, Johannes Berg wrote:
>
> However, this basically means that the direction argument to the flush
> APIs are completely useless, and we do have to define something
> new/else...
>
No I worded that badly - the direction isn't useless, but thinking of it
in terms of a buffer property rather than data movement is inaccurate.
So then if we need something else to indicate how data was expected to
be moved, the direction argument becomes useless, since it's not a
buffer property but rather a temporal thing on a specific place that
expected certain data movement.
johannes
On Mon, 28 Mar 2022 08:37:23 +0200
Christoph Hellwig <[email protected]> wrote:
> On Fri, Mar 25, 2022 at 11:46:09AM -0700, Linus Torvalds wrote:
> > I think my list of three different sync cases (not just two! It's not
> > just about whether to sync for the CPU or the device, it's also about
> > what direction the data itself is taking) is correct.
> >
> > But maybe I'm wrong.
>
> At the high level you are correct. It is all about which direction
> the data is taking. That is the direction argument that all the
> map/unmap/sync call take. The sync calls then just toggle the ownership.
> You seem to hate that ownership concept, but I don't see how things
> could work without that ownership concept as we're going to be in
> trouble without having that. And yes, a peek operation could work in
> some cases, but it would have to be at the cache line granularity.
>
> arch/arc/mm/dma.c has a really good comment how these transfers relate
> to actual cache operations btw>
>
> *
> * | map == for_device | unmap == for_cpu
> * |----------------------------------------------------------------
> * TO_DEV | writeback writeback | none none
> * FROM_DEV | invalidate invalidate | invalidate* invalidate*
Very interesting! Does that mean that
memset(buf, 0, size);
dma_map(buf, size, FROM_DEVICE)
device does a partial write
dma_unmap(buf, size, FROM_DEVICE)
may boil down to being the same as without the memset(), because the
effect of the memset() may get thrown away by the cache invalidate?
That is in this scenario we could actually leak the original content of
the buffer if we have a non-dma-coherent cache?
Regards,
Halil
> * BIDIR | writeback+inv writeback+inv | invalidate invalidate
> *
> * [*] needed for CPU speculative prefetches