LinuxLists.cc - [PATCH v4] memregion: Add cpu_cache_invalidate

2022-10-25 20:40:04

Subject: [PATCH v4] memregion: Add cpu_cache_invalidate_memregion() interface

From: Davidlohr Bueso <[email protected]>

With CXL security features, global CPU cache flushing nvdimm requirements
are no longer specific to that subsystem, even beyond the scope of
security_ops. CXL will need such semantics for features not necessarily
limited to persistent memory.

The functionality this is enabling is to be able to instantaneously
secure erase potentially terabytes of memory at once and the kernel
needs to be sure that none of the data from before the erase is still
present in the cache. It is also used when unlocking a memory device
where speculative reads and firmware accesses could have cached poison
from before the device was unlocked. Lastly this facility is used when
mapping new devices, or new capacity into an established physical
address range. I.e. when the driver switches DeviceA mapping AddressX to
DeviceB mapping AddressX then any cached data from DeviceA:AddressX
needs to be invalidated.

This capability is typically only used once per-boot (for unlock), or
once per bare metal provisioning event (secure erase), like when handing
off the system to another tenant or decommissioning a device. It may
also be used for dynamic CXL region provisioning.

Users must first call cpu_cache_has_invalidate_memregion() to know whether
this functionality is available on the architecture. Only enable it on
x86-64 via the wbinvd() hammer. Hypervisors are not supported as TDX
guests may trigger a virtualization exception and may need proper handling
to recover. See:

e2efb6359e62 ("ACPICA: Avoid cache flush inside virtual machines")

This global cache invalidation facility,
cpu_cache_invalidate_memregion(), is exported to modules since NVDIMM
and CXL support can be built as a module. However, the intent is that
this facility is not available outside of specific "device-memory" use
cases. To that end the API is scoped to a new "DEVMEM" module namespace
that only applies to the NVDIMM and CXL subsystems.

Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: [email protected]
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Davidlohr Bueso <[email protected]>
Co-developed-by: Dan Williams <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
Changes since v3: https://lore.kernel.org/r/[email protected]
- Add more details about the use of this facility in the memory
provisioning flow in the changelog
- Introduce the DEVMEM symbol namespace to indicate that this facility is not
for general consumption (Dave H)
- Update the kernel-doc for cpu_cache_invalidate_memregion() to clarify
that it is not to be used outside of the CXL and NVDIMM subsystems (Dave H)

arch/x86/Kconfig | 1 +
arch/x86/mm/pat/set_memory.c | 15 +++++++++++++++
drivers/acpi/nfit/intel.c | 43 ++++++++++++++++++++----------------------
include/linux/memregion.h | 37 ++++++++++++++++++++++++++++++++++++
lib/Kconfig | 3 +++
5 files changed, 76 insertions(+), 23 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 67745ceab0db..b68661d0633b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -69,6 +69,7 @@ config X86
select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
select ARCH_HAS_CACHE_LINE_SIZE
+ select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION if X86_64
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEBUG_VM_PGTABLE if !X86_PAE
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 97342c42dda8..8650bb6481a8 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -330,6 +330,21 @@ void arch_invalidate_pmem(void *addr, size_t size)
EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
#endif

+#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
+bool cpu_cache_has_invalidate_memregion(void)
+{
+ return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
+}
+EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM);
+
+int cpu_cache_invalidate_memregion(int res_desc)
+{
+ wbinvd_on_all_cpus();
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cpu_cache_invalidate_memregion, DEVMEM);
+#endif
+
static void __cpa_flush_all(void *arg)
{
unsigned long cache = (unsigned long)arg;
diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c
index 8dd792a55730..fa0e57e35162 100644
--- a/drivers/acpi/nfit/intel.c
+++ b/drivers/acpi/nfit/intel.c
@@ -3,6 +3,7 @@
#include <linux/libnvdimm.h>
#include <linux/ndctl.h>
#include <linux/acpi.h>
+#include <linux/memregion.h>
#include <asm/smp.h>
#include "intel.h"
#include "nfit.h"
@@ -190,8 +191,6 @@ static int intel_security_change_key(struct nvdimm *nvdimm,
}
}

-static void nvdimm_invalidate_cache(void);
-
static int __maybe_unused intel_security_unlock(struct nvdimm *nvdimm,
const struct nvdimm_key_data *key_data)
{
@@ -213,6 +212,9 @@ static int __maybe_unused intel_security_unlock(struct nvdimm *nvdimm,
if (!test_bit(NVDIMM_INTEL_UNLOCK_UNIT, &nfit_mem->dsm_mask))
return -ENOTTY;

+ if (!cpu_cache_has_invalidate_memregion())
+ return -EINVAL;
+
memcpy(nd_cmd.cmd.passphrase, key_data->data,
sizeof(nd_cmd.cmd.passphrase));
rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), NULL);
@@ -228,7 +230,7 @@ static int __maybe_unused intel_security_unlock(struct nvdimm *nvdimm,
}

/* DIMM unlocked, invalidate all CPU caches before we read it */
- nvdimm_invalidate_cache();
+ cpu_cache_invalidate_memregion(IORES_DESC_PERSISTENT_MEMORY);

return 0;
}
@@ -297,8 +299,11 @@ static int __maybe_unused intel_security_erase(struct nvdimm *nvdimm,
if (!test_bit(cmd, &nfit_mem->dsm_mask))
return -ENOTTY;

+ if (!cpu_cache_has_invalidate_memregion())
+ return -EINVAL;
+
/* flush all cache before we erase DIMM */
- nvdimm_invalidate_cache();
+ cpu_cache_invalidate_memregion(IORES_DESC_PERSISTENT_MEMORY);
memcpy(nd_cmd.cmd.passphrase, key->data,
sizeof(nd_cmd.cmd.passphrase));
rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), NULL);
@@ -318,7 +323,7 @@ static int __maybe_unused intel_security_erase(struct nvdimm *nvdimm,
}

/* DIMM erased, invalidate all CPU caches before we read it */
- nvdimm_invalidate_cache();
+ cpu_cache_invalidate_memregion(IORES_DESC_PERSISTENT_MEMORY);
return 0;
}

@@ -341,6 +346,9 @@ static int __maybe_unused intel_security_query_overwrite(struct nvdimm *nvdimm)
if (!test_bit(NVDIMM_INTEL_QUERY_OVERWRITE, &nfit_mem->dsm_mask))
return -ENOTTY;

+ if (!cpu_cache_has_invalidate_memregion())
+ return -EINVAL;
+
rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), NULL);
if (rc < 0)
return rc;
@@ -355,7 +363,7 @@ static int __maybe_unused intel_security_query_overwrite(struct nvdimm *nvdimm)
}

/* flush all cache before we make the nvdimms available */
- nvdimm_invalidate_cache();
+ cpu_cache_invalidate_memregion(IORES_DESC_PERSISTENT_MEMORY);
return 0;
}

@@ -380,8 +388,11 @@ static int __maybe_unused intel_security_overwrite(struct nvdimm *nvdimm,
if (!test_bit(NVDIMM_INTEL_OVERWRITE, &nfit_mem->dsm_mask))
return -ENOTTY;

+ if (!cpu_cache_has_invalidate_memregion())
+ return -EINVAL;
+
/* flush all cache before we erase DIMM */
- nvdimm_invalidate_cache();
+ cpu_cache_invalidate_memregion(IORES_DESC_PERSISTENT_MEMORY);
memcpy(nd_cmd.cmd.passphrase, nkey->data,
sizeof(nd_cmd.cmd.passphrase));
rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), NULL);
@@ -401,22 +412,6 @@ static int __maybe_unused intel_security_overwrite(struct nvdimm *nvdimm,
}
}

-/*
- * TODO: define a cross arch wbinvd equivalent when/if
- * NVDIMM_FAMILY_INTEL command support arrives on another arch.
- */
-#ifdef CONFIG_X86
-static void nvdimm_invalidate_cache(void)
-{
- wbinvd_on_all_cpus();
-}
-#else
-static void nvdimm_invalidate_cache(void)
-{
- WARN_ON_ONCE("cache invalidation required after unlock\n");
-}
-#endif
-
static const struct nvdimm_security_ops __intel_security_ops = {
.get_flags = intel_security_flags,
.freeze = intel_security_freeze,
@@ -775,3 +770,5 @@ static const struct nvdimm_fw_ops __intel_fw_ops = {
};

const struct nvdimm_fw_ops *intel_fw_ops = &__intel_fw_ops;
+
+MODULE_IMPORT_NS(DEVMEM);
diff --git a/include/linux/memregion.h b/include/linux/memregion.h
index c04c4fd2e209..6667af64840d 100644
--- a/include/linux/memregion.h
+++ b/include/linux/memregion.h
@@ -20,4 +20,41 @@ static inline void memregion_free(int id)
{
}
#endif
+
+/**
+ * cpu_cache_invalidate_memregion - drop any CPU cached data for
+ * memregions described by @res_desc
+ * @res_desc: one of the IORES_DESC_* types
+ *
+ * Perform cache maintenance after a memory event / operation that
+ * changes the contents of physical memory in a cache-incoherent manner.
+ * For example, device memory technologies like NVDIMM and CXL have
+ * device secure erase, or dynamic region provision features where such
+ * semantics.
+ *
+ * Limit the functionality to architectures that have an efficient way
+ * to writeback and invalidate potentially terabytes of memory at once.
+ * Note that this routine may or may not write back any dirty contents
+ * while performing the invalidation. It is only exported for the
+ * explicit usage of the NVDIMM and CXL modules in the 'DEVMEM' symbol
+ * namespace.
+ *
+ * Returns 0 on success or negative error code on a failure to perform
+ * the cache maintenance.
+ */
+#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
+int cpu_cache_invalidate_memregion(int res_desc);
+bool cpu_cache_has_invalidate_memregion(void);
+#else
+static inline bool cpu_cache_has_invalidate_memregion(void)
+{
+ return false;
+}
+
+int cpu_cache_invalidate_memregion(int res_desc)
+{
+ WARN_ON_ONCE("CPU cache invalidation required");
+ return -EINVAL;
+}
+#endif
#endif /* _MEMREGION_H_ */
diff --git a/lib/Kconfig b/lib/Kconfig
index 9bbf8a4b2108..9eb514abcdec 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -672,6 +672,9 @@ config ARCH_HAS_PMEM_API
config MEMREGION
bool

+config ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
+ bool
+
config ARCH_HAS_MEMREMAP_COMPAT_ALIGN
bool

2022-10-27 17:21:02

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH v4] memregion: Add cpu_cache_invalidate_memregion() interface

On 10/25/22 13:05, Dan Williams wrote:
> Users must first call cpu_cache_has_invalidate_memregion() to know whether
> this functionality is available on the architecture. Only enable it on
> x86-64 via the wbinvd() hammer. Hypervisors are not supported as TDX
> guests may trigger a virtualization exception and may need proper handling
> to recover. See:
That sentence doesn't quite parse correctly. Does it need to be "and
may trigger..."?

> This global cache invalidation facility,
> cpu_cache_invalidate_memregion(), is exported to modules since NVDIMM
> and CXL support can be built as a module. However, the intent is that
> this facility is not available outside of specific "device-memory" use
> cases. To that end the API is scoped to a new "DEVMEM" module namespace
> that only applies to the NVDIMM and CXL subsystems.

Oh, thank $DEITY you're trying to limit the amount of code that has
access to this thing.

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 67745ceab0db..b68661d0633b 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -69,6 +69,7 @@ config X86
> select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
> select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
> select ARCH_HAS_CACHE_LINE_SIZE
> + select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION if X86_64

What is 64-bit only about this?

I don't expect to have a lot of NVDIMMs or CXL devices on 32-bit
kernels, but it would be nice to remove this if it's not strictly
needed. Or, to add a changelog nugget that says:

Restrict this to X86_64 kernels. It would probably work on 32-
bit, but there is no practical reason to use 32-bit kernels and
no one is testing them.

> select ARCH_HAS_CURRENT_STACK_POINTER
> select ARCH_HAS_DEBUG_VIRTUAL
> select ARCH_HAS_DEBUG_VM_PGTABLE if !X86_PAE
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 97342c42dda8..8650bb6481a8 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -330,6 +330,21 @@ void arch_invalidate_pmem(void *addr, size_t size)
> EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
> #endif
>
> +#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
> +bool cpu_cache_has_invalidate_memregion(void)
> +{
> + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
> +}
> +EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM);
> +
> +int cpu_cache_invalidate_memregion(int res_desc)
> +{
> + wbinvd_on_all_cpus();
> + return 0;
> +}

Does this maybe also deserve a:

WARN_ON_ONCE(!cpu_cache_has_invalidate_memregion());

in case one of the cpu_cache_invalidate_memregion() paths missed a
cpu_cache_has_invalidate_memregion() check?

> +/**
> + * cpu_cache_invalidate_memregion - drop any CPU cached data for
> + * memregions described by @res_desc
> + * @res_desc: one of the IORES_DESC_* types
> + *
> + * Perform cache maintenance after a memory event / operation that
> + * changes the contents of physical memory in a cache-incoherent manner.
> + * For example, device memory technologies like NVDIMM and CXL have
> + * device secure erase, or dynamic region provision features where such
> + * semantics.

s/where/with/ ?

> + * Limit the functionality to architectures that have an efficient way
> + * to writeback and invalidate potentially terabytes of memory at once.
> + * Note that this routine may or may not write back any dirty contents
> + * while performing the invalidation. It is only exported for the
> + * explicit usage of the NVDIMM and CXL modules in the 'DEVMEM' symbol
> + * namespace.
> + *
> + * Returns 0 on success or negative error code on a failure to perform
> + * the cache maintenance.
> + */

WBINVD is a scary beast. But, there's also no better alternative in the
architecture. I don't think any of my comments above are deal breakers,
so from the x86 side:

Acked-by: Dave Hansen <[email protected]>

2022-10-27 18:17:33

by Dan Williams

[permalink] [raw]

Subject: Re: [PATCH v4] memregion: Add cpu_cache_invalidate_memregion() interface

Dave Hansen wrote:
> On 10/25/22 13:05, Dan Williams wrote:
> > Users must first call cpu_cache_has_invalidate_memregion() to know whether
> > this functionality is available on the architecture. Only enable it on
> > x86-64 via the wbinvd() hammer. Hypervisors are not supported as TDX
> > guests may trigger a virtualization exception and may need proper handling
> > to recover. See:
> That sentence doesn't quite parse correctly. Does it need to be "and
> may trigger..."?

Yes, but I think this can be said more generally, and with more detail:

"Only enable on bare metal x86-64 vai the wbinvd() hammer. It is already
the case that wbinvd() is problematic to allow in VMs due its global
performance impact and KVM, for example, has been known to just trap and
ignore the call. With confidential computing guest execution of wbinvd
may even trigger an exception. As guests should not be messing with the
bare metal address map cpu_cache_has_invalidate_memregion() returns
false in those environments."

> > This global cache invalidation facility,
> > cpu_cache_invalidate_memregion(), is exported to modules since NVDIMM
> > and CXL support can be built as a module. However, the intent is that
> > this facility is not available outside of specific "device-memory" use
> > cases. To that end the API is scoped to a new "DEVMEM" module namespace
> > that only applies to the NVDIMM and CXL subsystems.
>
> Oh, thank $DEITY you're trying to limit the amount of code that has
> access to this thing.
>
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 67745ceab0db..b68661d0633b 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -69,6 +69,7 @@ config X86
> > select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
> > select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
> > select ARCH_HAS_CACHE_LINE_SIZE
> > + select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION if X86_64
>
> What is 64-bit only about this?
>
> I don't expect to have a lot of NVDIMMs or CXL devices on 32-bit
> kernels, but it would be nice to remove this if it's not strictly
> needed. Or, to add a changelog nugget that says:
>
> Restrict this to X86_64 kernels. It would probably work on 32-
> bit, but there is no practical reason to use 32-bit kernels and
> no one is testing them.

I had to go recall this myself, but it looks like this is unnecessarily
cargo culting the stance of ARCH_HAS_PMEM_API that arose from this
change:

96601adb7451 x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB

...that observed that on pure 32-bit x86 CPUs that non-temporal stores
had weaker guarantees about whether writes would bypass the CPU cache.
However, that commit is so old that it even talks about the interactions
with the pcommit instruction. Suffice to say there is no X86_64
dependency for wbinvd, I'll drop the dependency.

>
> > select ARCH_HAS_CURRENT_STACK_POINTER
> > select ARCH_HAS_DEBUG_VIRTUAL
> > select ARCH_HAS_DEBUG_VM_PGTABLE if !X86_PAE
> > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > index 97342c42dda8..8650bb6481a8 100644
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -330,6 +330,21 @@ void arch_invalidate_pmem(void *addr, size_t size)
> > EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
> > #endif
> >
> > +#ifdef CONFIG_ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
> > +bool cpu_cache_has_invalidate_memregion(void)
> > +{
> > + return !cpu_feature_enabled(X86_FEATURE_HYPERVISOR);
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cpu_cache_has_invalidate_memregion, DEVMEM);
> > +
> > +int cpu_cache_invalidate_memregion(int res_desc)
> > +{
> > + wbinvd_on_all_cpus();
> > + return 0;
> > +}
>
> Does this maybe also deserve a:
>
> WARN_ON_ONCE(!cpu_cache_has_invalidate_memregion());
>
> in case one of the cpu_cache_invalidate_memregion() paths missed a
> cpu_cache_has_invalidate_memregion() check?

Yeah, good idea.

>
> > +/**
> > + * cpu_cache_invalidate_memregion - drop any CPU cached data for
> > + * memregions described by @res_desc
> > + * @res_desc: one of the IORES_DESC_* types
> > + *
> > + * Perform cache maintenance after a memory event / operation that
> > + * changes the contents of physical memory in a cache-incoherent manner.
> > + * For example, device memory technologies like NVDIMM and CXL have
> > + * device secure erase, or dynamic region provision features where such
> > + * semantics.
>
> s/where/with/ ?

Yes.

>
> > + * Limit the functionality to architectures that have an efficient way
> > + * to writeback and invalidate potentially terabytes of memory at once.
> > + * Note that this routine may or may not write back any dirty contents
> > + * while performing the invalidation. It is only exported for the
> > + * explicit usage of the NVDIMM and CXL modules in the 'DEVMEM' symbol
> > + * namespace.
> > + *
> > + * Returns 0 on success or negative error code on a failure to perform
> > + * the cache maintenance.
> > + */
>
> WBINVD is a scary beast. But, there's also no better alternative in the
> architecture. I don't think any of my comments above are deal breakers,
> so from the x86 side:
>
> Acked-by: Dave Hansen <[email protected]>

Thanks Dave, this unblocks a significant amount of CXL work.

2022-10-27 20:16:08

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [PATCH v4] memregion: Add cpu_cache_invalidate_memregion() interface

On Thu, 27 Oct 2022, Dave Hansen wrote:

>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 67745ceab0db..b68661d0633b 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -69,6 +69,7 @@ config X86
>> select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
>> select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
>> select ARCH_HAS_CACHE_LINE_SIZE
>> + select ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION if X86_64
>
>What is 64-bit only about this?
>
>I don't expect to have a lot of NVDIMMs or CXL devices on 32-bit
>kernels, but it would be nice to remove this if it's not strictly
>needed. Or, to add a changelog nugget that says:
>
> Restrict this to X86_64 kernels. It would probably work on 32-
> bit, but there is no practical reason to use 32-bit kernels and
> no one is testing them.

Yes, this was to further limit the potential users.