Before we rework the "pmem api" to stop abusing __copy_user_nocache()
for memcpy_to_pmem() we need to fix cases where we may strand dirty data
in the cpu cache. The problem occurs when copy_from_iter_pmem() is used
for arbitrary data transfers from userspace. There is no guarantee that
these transfers, performed by dax_iomap_actor(), will have aligned
destinations or aligned transfer lengths. Backstop the usage
__copy_user_nocache() with explicit cache management in these unaligned
cases.
Yes, copy_from_iter_pmem() is now too big for an inline, but addressing
that is saved for a later patch that moves the entirety of the "pmem
api" into the pmem driver directly.
Fixes: 5de490daec8b ("pmem: add copy_from_iter_pmem() and clear_pmem()")
Cc: <[email protected]>
Cc: <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
[toshi: trailing bytes flush only needed in the 4B misalign case]
Signed-off-by: Dan Williams <[email protected]>
---
v2: Change the condition for flushing the last cacheline of the
destination from 8-byte to 4-byte misalignment (Toshi)
arch/x86/include/asm/pmem.h | 41 ++++++++++++++++++++++++++++++-----------
1 file changed, 30 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 2c1ebeb4d737..cf4e68faedc4 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -55,7 +55,8 @@ static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n)
* @size: number of bytes to write back
*
* Write back a cache range using the CLWB (cache line write back)
- * instruction.
+ * instruction. Note that @size is internally rounded up to be cache
+ * line size aligned.
*/
static inline void arch_wb_cache_pmem(void *addr, size_t size)
{
@@ -69,15 +70,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size)
clwb(p);
}
-/*
- * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec
- * iterators, so for other types (bvec & kvec) we must do a cache write-back.
- */
-static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
-{
- return iter_is_iovec(i) == false;
-}
-
/**
* arch_copy_from_iter_pmem - copy data from an iterator to PMEM
* @addr: PMEM destination address
@@ -94,7 +86,34 @@ static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
/* TODO: skip the write-back by always using non-temporal stores */
len = copy_from_iter_nocache(addr, bytes, i);
- if (__iter_needs_pmem_wb(i))
+ /*
+ * In the iovec case on x86_64 copy_from_iter_nocache() uses
+ * non-temporal stores for the bulk of the transfer, but we need
+ * to manually flush if the transfer is unaligned. In the
+ * non-iovec case the entire destination needs to be flushed.
+ */
+ if (iter_is_iovec(i)) {
+ unsigned long dest = (unsigned long) addr;
+
+ /*
+ * If the destination is not 8-byte aligned then
+ * __copy_user_nocache (on x86_64) uses cached copies
+ */
+ if (dest & 8) {
+ arch_wb_cache_pmem(addr, 1);
+ dest = ALIGN(dest, 8);
+ }
+
+ /*
+ * If the remaining transfer length, after accounting
+ * for destination alignment, is not 4-byte aligned
+ * then __copy_user_nocache() falls back to cached
+ * copies for the trailing bytes in the final cacheline
+ * of the transfer.
+ */
+ if ((bytes - (dest - (unsigned long) addr)) & 4)
+ arch_wb_cache_pmem(addr + bytes - 1, 1);
+ } else
arch_wb_cache_pmem(addr, bytes);
return len;
> Subject: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-
> bypass assumptions
>
> Before we rework the "pmem api" to stop abusing __copy_user_nocache()
> for memcpy_to_pmem() we need to fix cases where we may strand dirty data
> in the cpu cache. The problem occurs when copy_from_iter_pmem() is used
> for arbitrary data transfers from userspace. There is no guarantee that
> these transfers, performed by dax_iomap_actor(), will have aligned
> destinations or aligned transfer lengths. Backstop the usage
> __copy_user_nocache() with explicit cache management in these unaligned
> cases.
>
> Yes, copy_from_iter_pmem() is now too big for an inline, but addressing
> that is saved for a later patch that moves the entirety of the "pmem
> api" into the pmem driver directly.
:
> ---
> v2: Change the condition for flushing the last cacheline of the
> destination from 8-byte to 4-byte misalignment (Toshi)
:
> arch/x86/include/asm/pmem.h | 41 ++++++++++++++++++++++++++++++----
:
> @@ -94,7 +86,34 @@ static inline size_t arch_copy_from_iter_pmem(void
> *addr, size_t bytes,
> /* TODO: skip the write-back by always using non-temporal stores */
> len = copy_from_iter_nocache(addr, bytes, i);
>
> - if (__iter_needs_pmem_wb(i))
> + /*
> + * In the iovec case on x86_64 copy_from_iter_nocache() uses
> + * non-temporal stores for the bulk of the transfer, but we need
> + * to manually flush if the transfer is unaligned. In the
> + * non-iovec case the entire destination needs to be flushed.
> + */
> + if (iter_is_iovec(i)) {
> + unsigned long dest = (unsigned long) addr;
> +
> + /*
> + * If the destination is not 8-byte aligned then
> + * __copy_user_nocache (on x86_64) uses cached copies
> + */
> + if (dest & 8) {
> + arch_wb_cache_pmem(addr, 1);
> + dest = ALIGN(dest, 8);
> + }
> +
> + /*
> + * If the remaining transfer length, after accounting
> + * for destination alignment, is not 4-byte aligned
> + * then __copy_user_nocache() falls back to cached
> + * copies for the trailing bytes in the final cacheline
> + * of the transfer.
> + */
> + if ((bytes - (dest - (unsigned long) addr)) & 4)
> + arch_wb_cache_pmem(addr + bytes - 1, 1);
> + } else
> arch_wb_cache_pmem(addr, bytes);
>
> return len;
Thanks for the update. I think the alignment check should be based on
the following note in copy_user_nocache.
* Note: Cached memory copy is used when destination or size is not
* naturally aligned. That is:
* - Require 8-byte alignment when size is 8 bytes or larger.
* - Require 4-byte alignment when size is 4 bytes.
So, I think the code may be something like this. I also made the following changes:
- Mask with 7, not 8.
- ALIGN with cacheline size, instead of 8.
- Add (bytes > flushed) test since calculation with unsigned long still results in a negative
value (as a positive value).
if (bytes < 8) {
if ((dest & 3) || (bytes != 4))
arch_wb_cache_pmem(addr, 1);
} else {
if (dest & 7) {
dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
arch_wb_cache_pmem(addr, 1);
}
flushed = dest - (unsigned long) addr;
if ((bytes > flushed) && ((bytes - flushed) & 7))
arch_wb_cache_pmem(addr + bytes - 1, 1);
}
Thanks,
-Toshi
On Mon, Apr 10, 2017 at 11:53 AM, Kani, Toshimitsu <[email protected]> wrote:
>> Subject: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-
>> bypass assumptions
>>
>> Before we rework the "pmem api" to stop abusing __copy_user_nocache()
>> for memcpy_to_pmem() we need to fix cases where we may strand dirty data
>> in the cpu cache. The problem occurs when copy_from_iter_pmem() is used
>> for arbitrary data transfers from userspace. There is no guarantee that
>> these transfers, performed by dax_iomap_actor(), will have aligned
>> destinations or aligned transfer lengths. Backstop the usage
>> __copy_user_nocache() with explicit cache management in these unaligned
>> cases.
>>
>> Yes, copy_from_iter_pmem() is now too big for an inline, but addressing
>> that is saved for a later patch that moves the entirety of the "pmem
>> api" into the pmem driver directly.
> :
>> ---
>> v2: Change the condition for flushing the last cacheline of the
>> destination from 8-byte to 4-byte misalignment (Toshi)
> :
>> arch/x86/include/asm/pmem.h | 41 ++++++++++++++++++++++++++++++----
> :
>> @@ -94,7 +86,34 @@ static inline size_t arch_copy_from_iter_pmem(void
>> *addr, size_t bytes,
>> /* TODO: skip the write-back by always using non-temporal stores */
>> len = copy_from_iter_nocache(addr, bytes, i);
>>
>> -if (__iter_needs_pmem_wb(i))
>> +/*
>> + * In the iovec case on x86_64 copy_from_iter_nocache() uses
>> + * non-temporal stores for the bulk of the transfer, but we need
>> + * to manually flush if the transfer is unaligned. In the
>> + * non-iovec case the entire destination needs to be flushed.
>> + */
>> +if (iter_is_iovec(i)) {
>> +unsigned long dest = (unsigned long) addr;
>> +
>> +/*
>> + * If the destination is not 8-byte aligned then
>> + * __copy_user_nocache (on x86_64) uses cached copies
>> + */
>> +if (dest & 8) {
>> +arch_wb_cache_pmem(addr, 1);
>> +dest = ALIGN(dest, 8);
>> +}
>> +
>> +/*
>> + * If the remaining transfer length, after accounting
>> + * for destination alignment, is not 4-byte aligned
>> + * then __copy_user_nocache() falls back to cached
>> + * copies for the trailing bytes in the final cacheline
>> + * of the transfer.
>> + */
>> +if ((bytes - (dest - (unsigned long) addr)) & 4)
>> +arch_wb_cache_pmem(addr + bytes - 1, 1);
>> +} else
>> arch_wb_cache_pmem(addr, bytes);
>>
>> return len;
>
> Thanks for the update. I think the alignment check should be based on
> the following note in copy_user_nocache.
>
> * Note: Cached memory copy is used when destination or size is not
> * naturally aligned. That is:
> * - Require 8-byte alignment when size is 8 bytes or larger.
> * - Require 4-byte alignment when size is 4 bytes.
>
> So, I think the code may be something like this. I also made the following changes:
Thanks!
> - Mask with 7, not 8.
Yes, good catch.
> - ALIGN with cacheline size, instead of 8.
> - Add (bytes > flushed) test since calculation with unsigned long still results in a negative
> value (as a positive value).
>
> if (bytes < 8) {
> if ((dest & 3) || (bytes != 4))
> arch_wb_cache_pmem(addr, 1);
> } else {
> if (dest & 7) {
> dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
Why align the destination to the next cacheline? As far as I can see
the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns
to the next 8-byte boundary.
> > Thanks for the update. I think the alignment check should be based on
> > the following note in copy_user_nocache.
> >
> > * Note: Cached memory copy is used when destination or size is not
> > * naturally aligned. That is:
> > * - Require 8-byte alignment when size is 8 bytes or larger.
> > * - Require 4-byte alignment when size is 4 bytes.
> >
> > So, I think the code may be something like this. I also made the following
> changes:
>
> Thanks!
>
> > - Mask with 7, not 8.
>
> Yes, good catch.
>
> > - ALIGN with cacheline size, instead of 8.
> > - Add (bytes > flushed) test since calculation with unsigned long still results
> in a negative
> > value (as a positive value).
> >
> > if (bytes < 8) {
> > if ((dest & 3) || (bytes != 4))
> > arch_wb_cache_pmem(addr, 1);
> > } else {
> > if (dest & 7) {
> > dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
>
> Why align the destination to the next cacheline? As far as I can see
> the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns
> to the next 8-byte boundary.
The clflush here flushes for the cacheline size. So, we do not need to flush
the same cacheline again when the unaligned tail is in the same line.
Thanks,
-Toshi
On Mon, Apr 10, 2017 at 2:28 PM, Kani, Toshimitsu <[email protected]> wrote:
>> > Thanks for the update. I think the alignment check should be based on
>> > the following note in copy_user_nocache.
>> >
>> > * Note: Cached memory copy is used when destination or size is not
>> > * naturally aligned. That is:
>> > * - Require 8-byte alignment when size is 8 bytes or larger.
>> > * - Require 4-byte alignment when size is 4 bytes.
>> >
>> > So, I think the code may be something like this. I also made the following
>> changes:
>>
>> Thanks!
>>
>> > - Mask with 7, not 8.
>>
>> Yes, good catch.
>>
>> > - ALIGN with cacheline size, instead of 8.
>> > - Add (bytes > flushed) test since calculation with unsigned long still results
>> in a negative
>> > value (as a positive value).
>> >
>> > if (bytes < 8) {
>> > if ((dest & 3) || (bytes != 4))
>> > arch_wb_cache_pmem(addr, 1);
>> > } else {
>> > if (dest & 7) {
>> > dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
>>
>> Why align the destination to the next cacheline? As far as I can see
>> the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns
>> to the next 8-byte boundary.
>
> The clflush here flushes for the cacheline size. So, we do not need to flush
> the same cacheline again when the unaligned tail is in the same line.
Ok, makes sense. Last question, can't we reduce the check to be:
if ((bytes > flushed) && ((bytes - flushed) & 3))
...since if 'bytes' was 4-byte aligned we would have performed
non-temporal stores.
Can I add your Signed-off-by: on v3?
> On Mon, Apr 10, 2017 at 2:28 PM, Kani, Toshimitsu <[email protected]>
> wrote:
> >> > Thanks for the update. I think the alignment check should be based on
> >> > the following note in copy_user_nocache.
> >> >
> >> > * Note: Cached memory copy is used when destination or size is not
> >> > * naturally aligned. That is:
> >> > * - Require 8-byte alignment when size is 8 bytes or larger.
> >> > * - Require 4-byte alignment when size is 4 bytes.
> >> >
> >> > So, I think the code may be something like this. I also made the following
> >> changes:
> >>
> >> Thanks!
> >>
> >> > - Mask with 7, not 8.
> >>
> >> Yes, good catch.
> >>
> >> > - ALIGN with cacheline size, instead of 8.
> >> > - Add (bytes > flushed) test since calculation with unsigned long still
> results
> >> in a negative
> >> > value (as a positive value).
> >> >
> >> > if (bytes < 8) {
> >> > if ((dest & 3) || (bytes != 4))
> >> > arch_wb_cache_pmem(addr, 1);
> >> > } else {
> >> > if (dest & 7) {
> >> > dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
> >>
> >> Why align the destination to the next cacheline? As far as I can see
> >> the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns
> >> to the next 8-byte boundary.
> >
> > The clflush here flushes for the cacheline size. So, we do not need to flush
> > the same cacheline again when the unaligned tail is in the same line.
>
> Ok, makes sense. Last question, can't we reduce the check to be:
>
> if ((bytes > flushed) && ((bytes - flushed) & 3))
>
> ...since if 'bytes' was 4-byte aligned we would have performed
> non-temporal stores.
That is not documented behavior of copy_user_nocache, but as long as the pmem
version of copy_user_nocache follows the same implemented behavior, yes, that
works.
> Can I add your Signed-off-by: on v3?
Sure.
Thanks,
-Toshi
On Mon, Apr 10, 2017 at 2:45 PM, Kani, Toshimitsu <[email protected]> wrote:
>> On Mon, Apr 10, 2017 at 2:28 PM, Kani, Toshimitsu <[email protected]>
>> wrote:
>> >> > Thanks for the update. I think the alignment check should be based on
>> >> > the following note in copy_user_nocache.
>> >> >
>> >> > * Note: Cached memory copy is used when destination or size is not
>> >> > * naturally aligned. That is:
>> >> > * - Require 8-byte alignment when size is 8 bytes or larger.
>> >> > * - Require 4-byte alignment when size is 4 bytes.
>> >> >
>> >> > So, I think the code may be something like this. I also made the following
>> >> changes:
>> >>
>> >> Thanks!
>> >>
>> >> > - Mask with 7, not 8.
>> >>
>> >> Yes, good catch.
>> >>
>> >> > - ALIGN with cacheline size, instead of 8.
>> >> > - Add (bytes > flushed) test since calculation with unsigned long still
>> results
>> >> in a negative
>> >> > value (as a positive value).
>> >> >
>> >> > if (bytes < 8) {
>> >> > if ((dest & 3) || (bytes != 4))
>> >> > arch_wb_cache_pmem(addr, 1);
>> >> > } else {
>> >> > if (dest & 7) {
>> >> > dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
>> >>
>> >> Why align the destination to the next cacheline? As far as I can see
>> >> the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns
>> >> to the next 8-byte boundary.
>> >
>> > The clflush here flushes for the cacheline size. So, we do not need to flush
>> > the same cacheline again when the unaligned tail is in the same line.
>>
>> Ok, makes sense. Last question, can't we reduce the check to be:
>>
>> if ((bytes > flushed) && ((bytes - flushed) & 3))
>>
>> ...since if 'bytes' was 4-byte aligned we would have performed
>> non-temporal stores.
>
> That is not documented behavior of copy_user_nocache, but as long as the pmem
> version of copy_user_nocache follows the same implemented behavior, yes, that
> works.
Hmm, sorry this comment confuses me, I'm only referring to the current
version of __copy_user_nocache not the new pmem version. The way I
read the current code we only ever jump to the cached copy loop
(.L_1b_cache_copy_loop) if the trailing byte-count is 4-byte
misaligned.
> >> > The clflush here flushes for the cacheline size. So, we do not need to
> flush
> >> > the same cacheline again when the unaligned tail is in the same line.
> >>
> >> Ok, makes sense. Last question, can't we reduce the check to be:
> >>
> >> if ((bytes > flushed) && ((bytes - flushed) & 3))
> >>
> >> ...since if 'bytes' was 4-byte aligned we would have performed
> >> non-temporal stores.
> >
> > That is not documented behavior of copy_user_nocache, but as long as the
> pmem
> > version of copy_user_nocache follows the same implemented behavior, yes,
> that
> > works.
>
> Hmm, sorry this comment confuses me, I'm only referring to the current
> version of __copy_user_nocache not the new pmem version. The way I
> read the current code we only ever jump to the cached copy loop
> (.L_1b_cache_copy_loop) if the trailing byte-count is 4-byte
> misaligned.
Yes, you are right and that's how the code is implemented. I added this trailing
4-byte handling for the >=8B case, which is shared with <8B case, since it was
easy to do. But I considered it a bonus. This function also needs to handle
4B-aligned destination if it is to state that it handles 4B alignment for the >=8B
case as well. Otherwise, it's inconsistent. Since I did not see much point of supporting
such case, I simply documented in the Note that 8 byte alignment is required for
the >=8B case.
Thanks,
-Toshi