Hi,
I'm building a small accelerator card that should provide crypto
primitives, and I'm wondering how large data transfers from and to
userspace are supposed to work -- especially if these are file backed
and larger than available memory.
For testing, I've created an 8GB random file, and used kcapi-dgst on it:
$ strace kcapi-dgst -c sha256 -i test8G.bin --hex
[...]
openat(AT_FDCWD, 0x7ffc7e4b5896, O_RDONLY|O_CLOEXEC) = 6
fstat(6, 0x7ffc7e4a5da0) = 0
mmap(NULL, 8589934592, PROT_READ, MAP_SHARED, 6, 0) = 0x7f8d911cf000
accept(3, NULL, NULL) = 7
sendmsg(7, 0x7ffc7e4a5ca0, MSG_MORE) = 2147479552
vmsplice(5, 0x7ffc7e4a5d00, 1, SPLICE_F_MORE|SPLICE_F_GIFT) = 4095
splice(4, NULL, 7, NULL, 4095, SPLICE_F_MORE) = 4095
sendmsg(7, 0x7ffc7e4a5ca0, MSG_MORE) = 2147479552
vmsplice(5, 0x7ffc7e4a5d00, 1, SPLICE_F_MORE|SPLICE_F_GIFT) = 4095
splice(4, NULL, 7, NULL, 4095, SPLICE_F_MORE) = 4095
sendmsg(7, 0x7ffc7e4a5ca0, MSG_MORE) = 2147479552
vmsplice(5, 0x7ffc7e4a5d00, 1, SPLICE_F_MORE|SPLICE_F_GIFT) = 4095
splice(4, NULL, 7, NULL, 4095, SPLICE_F_MORE) = 4095
sendmsg(7, 0x7ffc7e4a5ca0, MSG_MORE) = 2147479552
vmsplice(5, 0x7ffc7e4a5d00, 1, SPLICE_F_MORE|SPLICE_F_GIFT) = 4095
splice(4, NULL, 7, NULL, 4095, SPLICE_F_MORE) = 4095
sendto(7, 0x7f8f911ceffc, 4, MSG_MORE, NULL, 0) = 4
recvmsg(7, 0x7ffc7e4a5cd0, 0) = 32
fstat(1, 0x7ffc7e4a5bc0) = 0
munmap(0x7f8d911cf000, 0) = -1 EINVAL (Invalid argument)
This seems wrong to me:
- Every sendmsg call is 2GB - 4kB. That probably makes sense when
trying to keep every transfer page aligned.
- The vmsplice()/splice() transfers 4095 bytes -- that would likely
trigger a copy and leave the file pointer unaligned after
- The last sendto() call then cleans up the remaining four bytes and
still uses MSG_MORE.
- The munmap() call is just confused.
Is that the optimal way to transfer data from disk to an ahash?
Now my PCIe device can operate directly on DMA memory, and the way I've
understood the crypto API is that the "src" scatterlist can be mapped
using dma_map_sg, so somehow the data is in DMA memory at this point,
which makes me suspect that the data was copied several times in between
as the result of mmap() is unsuitable for DMA.
crypto+mm Questions so far:
- How does flow control work for the 2GB sendmsg(mmap()) if the data
needs to be made available for DMA -- presumably I can't dma_map_sg()
all of the pages if I have 4 GB physical memory?
- Is there a zerocopy path for disk->crypto that can be used with
large data blobs?
- Are there suitable paths for crypto->disk (for encryption and
compression)?
- If the device implements PCIe Address Translation and Page Request
Interface, can I use the IOMMU to pin pages instead of doing that in a
driver, i.e. can a crypto driver indicate that the scatterlist can refer
to virtual memory that need not be pinned or even present yet, and can
this be used to avoid copies or partial mappings?
Crypto only questions so far:
- The ahash interface seems to still expect the result to be filled
out on return, when I kind of expected it to wait for me to send a
callback. Am I missing something, or do I need to suspend the current
thread and wake it up from an interrupt? Can I somehow report completion
from an interrupt handler? Does it make sense to make interrupts CPU affine?
- The result pointer for ahash points to vmalloc()ed memory -- is
there a way to get a DMA buffer instead (not that there's a performance
difference here, but space in the result DMA buffer is another resource
I need to track otherwise).
- The POWER9 NX driver has a separate interface for gzip
compression/decompression of large blobs, is there a technical reason
why it cannot implement the crypto API?
Basically my goal is to have fast gzip compression and decompression
support with the same interface on both of my workstations, one of which
has an FPGA card, and the other has two POWER9 CPUs with NX. :)
Simon