Hi Tom,
We've found a bug where systems that have the AMD SME turned on are
not able to run RDMA work loads. It seems the kernel is automatically
encrypting VMA's pointing at PCI BAR memory created by
io_remap_pfn_range() - adding a prot_decrypted() causes things to
start working.
To me this is surprising, before I go adding random prot_decrypted()
into the RDMA subsystem can you confirm this is actually how things
are expected to work?
Is RDMA missing something? I don't see anything special in VFIO for
instance and the two are very similar - does VFIO work with SME, eg
DPDK or something unrelated to virtualization?
Is there a reason not to just add prot_decrypted() to
io_remap_pfn_range()? Is there use cases where a caller actually wants
encrypted io memory?
I saw your original patch series edited a few drivers this way, but
not nearly enough. So I feel like I'm missing something.. Does vfio
work with SME? I couldn't find any sign of it calling prot_decrypted()
either?
(BTW, I don't have any AMD SME systems to test on here, I'm getting
this bug report from deployed system, running a distro kernel)
Thanks,
Jason
On 10/19/20 12:00 PM, Jason Gunthorpe wrote:
> On Mon, Oct 19, 2020 at 11:36:16AM -0500, Tom Lendacky wrote:
>
>>> Is RDMA missing something? I don't see anything special in VFIO for
>>> instance and the two are very similar - does VFIO work with SME, eg
>>> DPDK or something unrelated to virtualization?
>>
>> If user space is mapping un-encrypted memory, then, yes, it would seem
>> that there is a gap in the support where the pgprot_decrypted() would be
>> needed in order to override the protection map.
>
> It isn't "memory" it is PCI BAR pages, eg memory mapped IO
Right, I understand that.
>
>>> Is there a reason not to just add prot_decrypted() to
>>> io_remap_pfn_range()? Is there use cases where a caller actually wants
>>> encrypted io memory?
>>
>> As long as you never have physical memory / ram being mapped in this path,
>> it seems that applying pgprot_decrypted() would be ok.
>
> I think the word 'io' implies this is the case..
Heh, you would think so, but I found quite a few things that used ioremap
instead of memremap when developing this.
>
> Let me make a patch for this avenue then, I think it is not OK to add
> pgprot_decrypted to every driver.. We already have the special
> distinction with io and non-io remap, that seems better.
Yup, seems reasonable.
>
>>> I saw your original patch series edited a few drivers this way, but
>>> not nearly enough. So I feel like I'm missing something.. Does vfio
>>> work with SME? I couldn't find any sign of it calling prot_decrypted()
>>> either?
>>
>> I haven't tested SME with VFIO/DPDK.
>
> Hum, I assume it is broken also. Actually quite a swath of drivers
> and devices will be broken under this :\
Not sure what you mean by the last statement - in general or when running
under VFIO/DPDK? In general, traditional in kernel drivers work just fine
under SME without any changes.
Thanks,
Tom
>
> Jason
>
On Mon, Oct 19, 2020 at 12:11:36PM -0500, Tom Lendacky wrote:
> > Hum, I assume it is broken also. Actually quite a swath of drivers
> > and devices will be broken under this :\
>
> Not sure what you mean by the last statement - in general or when running
> under VFIO/DPDK? In general, traditional in kernel drivers work just fine
> under SME without any changes.
Split user/kernel drivers are common enough. Looks like maybe ~50
drivers in the kernel potentially are mmaping IO memory so would be
broken here.
Looking for pgprot_noncached() or pgprot_writecombine() around VMA
mappings is a pretty good clue it is working on IO memory.
I checked through the infiniband ones and they seem to be using
io_remap vs remap properly, but other places may need fixing.
Jason
On 10/19/20 10:25 AM, Jason Gunthorpe wrote:
> Hi Tom,
Hi Jason,
>
> We've found a bug where systems that have the AMD SME turned on are
> not able to run RDMA work loads. It seems the kernel is automatically
> encrypting VMA's pointing at PCI BAR memory created by
> io_remap_pfn_range() - adding a prot_decrypted() causes things to
> start working.
>
> To me this is surprising, before I go adding random prot_decrypted()
> into the RDMA subsystem can you confirm this is actually how things
> are expected to work?
Yes, currently, the idea is that anything being done in user space is
mapped encrypted.
>
> Is RDMA missing something? I don't see anything special in VFIO for
> instance and the two are very similar - does VFIO work with SME, eg
> DPDK or something unrelated to virtualization?
If user space is mapping un-encrypted memory, then, yes, it would seem
that there is a gap in the support where the pgprot_decrypted() would be
needed in order to override the protection map.
>
> Is there a reason not to just add prot_decrypted() to
> io_remap_pfn_range()? Is there use cases where a caller actually wants
> encrypted io memory?
As long as you never have physical memory / ram being mapped in this path,
it seems that applying pgprot_decrypted() would be ok.
>
> I saw your original patch series edited a few drivers this way, but
> not nearly enough. So I feel like I'm missing something.. Does vfio
> work with SME? I couldn't find any sign of it calling prot_decrypted()
> either?
I haven't tested SME with VFIO/DPDK.
>
> (BTW, I don't have any AMD SME systems to test on here, I'm getting
> this bug report from deployed system, running a distro kernel)
As a work around, if the system has support for TSME (transparent SME),
then that can be enabled (it is a BIOS option that the BIOS vendor would
have had to expose) to encrypt all of the system memory without requiring
SME support.
Thanks,
Tom
>
> Thanks,
> Jason
>
On Mon, Oct 19, 2020 at 11:36:16AM -0500, Tom Lendacky wrote:
> > Is RDMA missing something? I don't see anything special in VFIO for
> > instance and the two are very similar - does VFIO work with SME, eg
> > DPDK or something unrelated to virtualization?
>
> If user space is mapping un-encrypted memory, then, yes, it would seem
> that there is a gap in the support where the pgprot_decrypted() would be
> needed in order to override the protection map.
It isn't "memory" it is PCI BAR pages, eg memory mapped IO
> > Is there a reason not to just add prot_decrypted() to
> > io_remap_pfn_range()? Is there use cases where a caller actually wants
> > encrypted io memory?
>
> As long as you never have physical memory / ram being mapped in this path,
> it seems that applying pgprot_decrypted() would be ok.
I think the word 'io' implies this is the case..
Let me make a patch for this avenue then, I think it is not OK to add
pgprot_decrypted to every driver.. We already have the special
distinction with io and non-io remap, that seems better.
> > I saw your original patch series edited a few drivers this way, but
> > not nearly enough. So I feel like I'm missing something.. Does vfio
> > work with SME? I couldn't find any sign of it calling prot_decrypted()
> > either?
>
> I haven't tested SME with VFIO/DPDK.
Hum, I assume it is broken also. Actually quite a swath of drivers
and devices will be broken under this :\
Jason
On Mon, Oct 19, 2020 at 11:36:16AM -0500, Tom Lendacky wrote:
> > io_remap_pfn_range()? Is there use cases where a caller actually wants
> > encrypted io memory?
>
> As long as you never have physical memory / ram being mapped in this path,
> it seems that applying pgprot_decrypted() would be ok.
I made a patch along these lines:
https://github.com/jgunthorpe/linux/commits/amd_sme_fix
Just waiting for the 0-day bots to check it
I now have a report that SME works OK but when the same test is done
inside a VM with SEV it fails again - is there something else needed
for the SEV case?
This would be using VFIO with qemu and KVM to assign the PCI device to
the guest, it seems the guest kernel driver is able to use the device
but the guest userspace fails.
Regards,
Jason
On 10/21/20 6:59 AM, Jason Gunthorpe wrote:
> On Mon, Oct 19, 2020 at 11:36:16AM -0500, Tom Lendacky wrote:
>
>>> io_remap_pfn_range()? Is there use cases where a caller actually wants
>>> encrypted io memory?
>>
>> As long as you never have physical memory / ram being mapped in this path,
>> it seems that applying pgprot_decrypted() would be ok.
>
> I made a patch along these lines:
>
> https://github.com/jgunthorpe/linux/commit/fc990842983f3530b72fcceafed84bd6075174a1
>
> Just waiting for the 0-day bots to check it
>
> I now have a report that SME works OK but when the same test is done
> inside a VM with SEV it fails again - is there something else needed
> for the SEV case?
Probably. I would assume that it is getting past the MMIO issue, since the
above patch should cover SEV, too. But, with SEV, all DMA to and from the
guest is unencrypted. I'm not familiar with how the DMA is setup and
performed in this situation, but if the DMA is occurring to userspace
buffers that are mapped as encrypted, then the resulting access will be
ciphertext (either reading unencrypted data from the device as encrypted
or writing encrypted data to the device that should be unencrypted). There
isn't currently an API to allow userspace to change its mapping from
encrypted to unencrypted.
>
> This would be using VFIO with qemu and KVM to assign the PCI device to
> the guest, it seems the guest kernel driver is able to use the device
> but the guest userspace fails.
In the kernel, the SWIOTLB support is used to bounce the data from
encrypted to unencrypted and vice-versa.
Thanks,
Tom
>
> Regards,
> Jason
>
On Wed, Oct 21, 2020 at 10:30:23AM -0500, Tom Lendacky wrote:
> On 10/21/20 6:59 AM, Jason Gunthorpe wrote:
> > On Mon, Oct 19, 2020 at 11:36:16AM -0500, Tom Lendacky wrote:
> >
> >>> io_remap_pfn_range()? Is there use cases where a caller actually wants
> >>> encrypted io memory?
> >>
> >> As long as you never have physical memory / ram being mapped in this path,
> >> it seems that applying pgprot_decrypted() would be ok.
> >
> > I made a patch along these lines:
> >
> > https://github.com/jgunthorpe/linux/commit/fc990842983f3530b72fcceafed84bd6075174a1
> >
> > Just waiting for the 0-day bots to check it
> >
> > I now have a report that SME works OK but when the same test is done
> > inside a VM with SEV it fails again - is there something else needed
> > for the SEV case?
>
> Probably. I would assume that it is getting past the MMIO issue, since the
> above patch should cover SEV, too. But, with SEV, all DMA to and from the
> guest is unencrypted. I'm not familiar with how the DMA is setup and
> performed in this situation, but if the DMA is occurring to userspace
> buffers that are mapped as encrypted, then the resulting access will be
> ciphertext (either reading unencrypted data from the device as encrypted
> or writing encrypted data to the device that should be unencrypted). There
> isn't currently an API to allow userspace to change its mapping from
> encrypted to unencrypted.
Oh, interesting.. Yes the issue is no userspace DMA stuff uses the DMA
API correctly (because it is in userspace)
So SWIOTLB tricks don't work, I wish the dma_map could fail for these
situations
I would have guessed it used some vIOMMU and setup decrpytion just
like the host does..
Thanks,
Jason
On Wed, Oct 21, 2020 at 01:03:22PM -0300, Jason Gunthorpe wrote:
> Oh, interesting.. Yes the issue is no userspace DMA stuff uses the DMA
> API correctly (because it is in userspace)
>
> So SWIOTLB tricks don't work, I wish the dma_map could fail for these
> situations
Userspace DMA by definition also does not use dma_map..
On Tue, Oct 27, 2020 at 08:43:57AM +0000, Christoph Hellwig wrote:
> On Wed, Oct 21, 2020 at 01:03:22PM -0300, Jason Gunthorpe wrote:
> > Oh, interesting.. Yes the issue is no userspace DMA stuff uses the DMA
> > API correctly (because it is in userspace)
> >
> > So SWIOTLB tricks don't work, I wish the dma_map could fail for these
> > situations
>
> Userspace DMA by definition also does not use dma_map..
? Sure it does, ib_dma_map_sg_attrs() is what RDMA uses
What all the userspace users skip is the dma_sync*() - that would
require a kernel call which defeats the point.
So, my desire is some flag to dma_map_sg() that says
'user space mapping no dma_sync_*'
ie dma_sync_* is a NOP
Then things like SWIOTLB on the SEV system can fail with an error code
instead of malfunctioning
If FOLL_LONGERM is some estimate of this pattern then we have these users:
- drivers/infiniband
- v4l
- vdpa
- xdp
- rds
- habana labs
Jason