2022-04-30 12:20:46

by Martin Fernandez

[permalink] [raw]
Subject: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

Show for each node if every memory descriptor in that node has the
EFI_MEMORY_CPU_CRYPTO attribute.

fwupd project plans to use it as part of a check to see if the users
have properly configured memory hardware encryption
capabilities. fwupd's people have seen cases where it seems like there
is memory encryption because all the hardware is capable of doing it,
but on a closer look there is not, either because of system firmware
or because some component requires updating to enable the feature.

It's planned to make it part of a specification that can be passed to
people purchasing hardware

These checks will run at every boot. The specification is called Host
Security ID: https://fwupd.github.io/libfwupdplugin/hsi.html.

We choosed to do it a per-node basis because although an ABI that
shows that the whole system memory is capable of encryption would be
useful for the fwupd use case, doing it in a per-node basis gives also
the capability to the user to target allocations from applications to
NUMA nodes which have encryption capabilities.

I did some tests for some of the functions introduced (and modified)
in e820.c. Sadly KUnit is not able to test __init functions and data
so I had some warnings during the linking. There is a KUnit patch
already to fix that [1]. I wanted to wait for it to be merged but it
is taking more time than I expected so I'm sending this without tests
for now. I'm planning to add unit tests in the future to the e820
range update rework and e820_update_table.

[1] https://lore.kernel.org/lkml/[email protected]/T/


Changes since v7:

Less kerneldocs

Less verbosity in the e820 code


Changes since v6:

Fixes in __e820__handle_range_update

Const correctness in e820.c

Correct alignment in memblock.h

Rework memblock_overlaps_region


Changes since v5:

Refactor e820__range_{update, remove, set_crypto_capable} in order to
avoid code duplication.

Warn the user when a node has both encryptable and non-encryptable
regions.

Check that e820_table has enough size to store both current e820_table
and EFI memmap.


Changes since v4:

Add enum to represent the cryptographic capabilities in e820:
e820_crypto_capabilities.

Revert __e820__range_update, only adding the new argument for
__e820__range_add about crypto capabilities.

Add a function __e820__range_update_crypto similar to
__e820__range_update but to only update this new field.


Changes since v3:

Update date in Doc/ABI file.

More information about the fwupd usecase and the rationale behind
doing it in a per-NUMA-node.


Changes since v2:

e820__range_mark_crypto -> e820__range_mark_crypto_capable.

In e820__range_remove: Create a region with crypto capabilities
instead of creating one without it and then mark it.


Changes since v1:

Modify __e820__range_update to update the crypto capabilities of a
range; now this function will change the crypto capability of a range
if it's called with the same old_type and new_type. Rework
efi_mark_e820_regions_as_crypto_capable based on this.

Update do_add_efi_memmap to mark the regions as it creates them.

Change the type of crypto_capable in e820_entry from bool to u8.

Fix e820__update_table changes.

Remove memblock_add_crypto_capable. Now you have to add the region and
mark it then.

Better place for crypto_capable in pglist_data.

Martin Fernandez (8):
mm/memblock: Tag memblocks with crypto capabilities
mm/mmzone: Tag pg_data_t with crypto capabilities
x86/e820: Add infrastructure to refactor e820__range_{update,remove}
x86/e820: Refactor __e820__range_update
x86/e820: Refactor e820__range_remove
x86/e820: Tag e820_entry with crypto capabilities
x86/efi: Mark e820_entries as crypto capable from EFI memmap
drivers/node: Show in sysfs node's crypto capabilities

Documentation/ABI/testing/sysfs-devices-node | 10 +
arch/x86/include/asm/e820/api.h | 1 +
arch/x86/include/asm/e820/types.h | 12 +-
arch/x86/kernel/e820.c | 388 ++++++++++++++-----
arch/x86/platform/efi/efi.c | 37 ++
drivers/base/node.c | 10 +
include/linux/memblock.h | 5 +
include/linux/mmzone.h | 3 +
mm/memblock.c | 62 +++
mm/page_alloc.c | 1 +
10 files changed, 431 insertions(+), 98 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-devices-node

--
2.30.2


2022-05-01 16:52:56

by Martin Fernandez

[permalink] [raw]
Subject: [PATCH v8 3/8] x86/e820: Add infrastructure to refactor e820__range_{update,remove}

__e820__range_update and e820__range_remove had a very similar flow in
its implementation with a few lines different from each other, the
lines that actually perform the modification over the e820_table. The
similiraties were found in the checks for the different cases on how
each entry intersects with the given range (if it does at all). These
checks were very presice and error prone so it was not a good idea to
have them in both places.

Since I need to add a third one, similar to this two, in this and the
following patches I'll propose a refactor of these functions.

In this patch I introduce:

- A new type e820_entry_updater that will carry three callbacks, those
callbacks will decide WHEN to perform actions over the e820_table and
WHAY actions are going to be performed.

Note that there is a void pointer "data". This pointer will carry
useful information for the callbacks, like the type that we want to
update in e820__range_update or if we want to check the type in
e820__range_remove. Check it out in the next patches where I do the
rework of __e820__range_update and e820__range_remove.

- A new function __e820__handle_range_update that has the flow of the
original two functions to refactor. Together with e820_entry_updater
will perform the desired update on the input table.

On version 6 of this patch some people pointed out that this solution
was over-complicated. Mike Rapoport suggested a another solution [1].

I took a look at that, and although it is indeed simpler it's more
confusing at the same time. I think is manageable to have a single
function to update or remove sections of the table (what Mike did),
but when I added the functionality to also update the crypto_capable
it became really hard to manage.

I think that the approach presented in this patch it's complex but is
easier to read, to extend and to test.

[1] https://git.kernel.org/rppt/h/x86/e820-update-range

Signed-off-by: Martin Fernandez <[email protected]>

--------------------------------------------------

Changes from v7 to v8:

(Some) Callbacks of e820_entry_updater can now be NULL to avoid
defining empty functions

Remove kerneldocs in favor of plain comments just to explain what the
functions do.
---
arch/x86/kernel/e820.c | 127 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 127 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index f267205f2d5a..e0fa3ab755a5 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -459,6 +459,133 @@ static int __init append_e820_table(struct boot_e820_entry *entries, u32 nr_entr
return __append_e820_table(entries, nr_entries);
}

+/**
+ * struct e820_entry_updater - Helper type for
+ * __e820__handle_range_update().
+ * @should_update: Return true if @entry needs to be updated, false
+ * otherwise.
+ * @update: Apply desired actions to an @entry that is inside the
+ * range and satisfies @should_update. Can be set to NULL to avoid empty functions.
+ * @new: Create new entry in the table with information gathered from
+ * @original and @data. Can be set to NULL to avoid empty functions.
+ *
+ * Each function corresponds to an action that
+ * __e820__handle_range_update() does. Callbacks need to cast @data back
+ * to the corresponding type.
+ */
+struct e820_entry_updater {
+ bool (*should_update)(const struct e820_entry *entry, const void *data);
+ void (*update)(struct e820_entry *entry, const void *data);
+ void (*new)(struct e820_table *table, u64 new_start, u64 new_size,
+ const struct e820_entry *original, const void *data);
+};
+
+/*
+ * Helper for __e820__handle_range_update to handle the case where
+ * neither the entry completely covers the range nor the range
+ * completely covers the entry.
+ */
+static u64 __init
+__e820__handle_intersected_range_update(struct e820_table *table,
+ u64 start,
+ u64 size,
+ struct e820_entry *entry,
+ const struct e820_entry_updater *updater,
+ const void *data)
+{
+ u64 end;
+ u64 entry_end = entry->addr + entry->size;
+ u64 inner_start;
+ u64 inner_end;
+ u64 updated_size = 0;
+
+ if (size > (ULLONG_MAX - start))
+ size = ULLONG_MAX - start;
+
+ end = start + size;
+ inner_start = max(start, entry->addr);
+ inner_end = min(end, entry_end);
+
+ /* Range and entry do intersect and... */
+ if (inner_start < inner_end) {
+ /* Entry is on the left */
+ if (entry->addr < inner_start) {
+ /* Resize current entry */
+ entry->size = inner_start - entry->addr;
+ /* Entry is on the right */
+ } else {
+ /* Resize and move current section */
+ entry->addr = inner_end;
+ entry->size = entry_end - inner_end;
+ }
+ if (updater->new != NULL)
+ /* Create new entry with intersected region */
+ updater->new(table, inner_start, inner_end - inner_start, entry, data);
+
+ updated_size += inner_end - inner_start;
+ } /* Else: [start, end) doesn't cover entry */
+
+ return updated_size;
+}
+
+/*
+ * Update the table @table in [@start, @start + @size) doing the
+ * actions given in @updater.
+ */
+static u64 __init
+__e820__handle_range_update(struct e820_table *table,
+ u64 start,
+ u64 size,
+ const struct e820_entry_updater *updater,
+ const void *data)
+{
+ u64 updated_size = 0;
+ u64 end;
+ unsigned int i;
+
+ if (size > (ULLONG_MAX - start))
+ size = ULLONG_MAX - start;
+
+ end = start + size;
+
+ for (i = 0; i < table->nr_entries; i++) {
+ struct e820_entry *entry = &table->entries[i];
+ u64 entry_end = entry->addr + entry->size;
+
+ if (updater->should_update(entry, data)) {
+ /* Range completely covers entry */
+ if (entry->addr >= start && entry_end <= end) {
+ updated_size += entry->size;
+ if (updater->update != NULL)
+ updater->update(entry, data);
+ /* Entry completely covers range */
+ } else if (start > entry->addr && end < entry_end) {
+ /* Resize current entry */
+ entry->size = start - entry->addr;
+
+ if (updater->new != NULL)
+ /* Create new entry with intersection region */
+ updater->new(table, start, size, entry, data);
+
+ /*
+ * Create a new entry for the leftover
+ * of the current entry
+ */
+ __e820__range_add(table, end, entry_end - end,
+ entry->type);
+
+ updated_size += size;
+ } else {
+ updated_size +=
+ __e820__handle_intersected_range_update(table, start, size,
+ entry, updater, data);
+ }
+ }
+ }
+
+ return updated_size;
+}
+
static u64 __init
__e820__range_update(struct e820_table *table, u64 start, u64 size, enum e820_type old_type, enum e820_type new_type)
{
--
2.30.2

2022-05-02 22:25:35

by Martin Fernandez

[permalink] [raw]
Subject: [PATCH v8 4/8] x86/e820: Refactor __e820__range_update

Refactor __e820__range_update with the introduction of
e820_type_updater_data, indented to be used as the void pointer in the
e820_entry_updater callbacks, and the implementation of the callbacks
to perform the update of the type in a range of a e820_table.

Signed-off-by: Martin Fernandez <[email protected]>
---
arch/x86/kernel/e820.c | 119 +++++++++++++++++++++--------------------
1 file changed, 62 insertions(+), 57 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index e0fa3ab755a5..36a22c0a2199 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -586,80 +586,85 @@ __e820__handle_range_update(struct e820_table *table,
return updated_size;
}

-static u64 __init
-__e820__range_update(struct e820_table *table, u64 start, u64 size, enum e820_type old_type, enum e820_type new_type)
-{
- u64 end;
- unsigned int i;
- u64 real_updated_size = 0;
-
- BUG_ON(old_type == new_type);
-
- if (size > (ULLONG_MAX - start))
- size = ULLONG_MAX - start;
+/*
+ * Type helper for the e820_entry_updater callbacks.
+ */
+struct e820_type_updater_data {
+ enum e820_type old_type;
+ enum e820_type new_type;
+};

- end = start + size;
- printk(KERN_DEBUG "e820: update [mem %#010Lx-%#010Lx] ", start, end - 1);
- e820_print_type(old_type);
- pr_cont(" ==> ");
- e820_print_type(new_type);
- pr_cont("\n");
+static bool __init type_updater__should_update(const struct e820_entry *entry,
+ const void *data)
+{
+ const struct e820_type_updater_data *type_updater_data = data;

- for (i = 0; i < table->nr_entries; i++) {
- struct e820_entry *entry = &table->entries[i];
- u64 final_start, final_end;
- u64 entry_end;
+ return entry->type == type_updater_data->old_type;
+}

- if (entry->type != old_type)
- continue;
+static void __init type_updater__update(struct e820_entry *entry,
+ const void *data)
+{
+ const struct e820_type_updater_data *type_updater_data = data;

- entry_end = entry->addr + entry->size;
+ entry->type = type_updater_data->new_type;
+}

- /* Completely covered by new range? */
- if (entry->addr >= start && entry_end <= end) {
- entry->type = new_type;
- real_updated_size += entry->size;
- continue;
- }
+static void __init type_updater__new(struct e820_table *table, u64 new_start,
+ u64 new_size,
+ const struct e820_entry *original,
+ const void *data)
+{
+ const struct e820_type_updater_data *type_updater_data = data;

- /* New range is completely covered? */
- if (entry->addr < start && entry_end > end) {
- __e820__range_add(table, start, size, new_type);
- __e820__range_add(table, end, entry_end - end, entry->type);
- entry->size = start - entry->addr;
- real_updated_size += size;
- continue;
- }
+ __e820__range_add(table, new_start, new_size,
+ type_updater_data->new_type);
+}

- /* Partially covered: */
- final_start = max(start, entry->addr);
- final_end = min(end, entry_end);
- if (final_start >= final_end)
- continue;
+static u64 __init __e820__range_update(struct e820_table *table, u64 start,
+ u64 size, enum e820_type old_type,
+ enum e820_type new_type)
+{
+ struct e820_entry_updater updater = {
+ .should_update = type_updater__should_update,
+ .update = type_updater__update,
+ .new = type_updater__new
+ };

- __e820__range_add(table, final_start, final_end - final_start, new_type);
+ struct e820_type_updater_data data = {
+ .old_type = old_type,
+ .new_type = new_type
+ };

- real_updated_size += final_end - final_start;
+ BUG_ON(old_type == new_type);

- /*
- * Left range could be head or tail, so need to update
- * its size first:
- */
- entry->size -= final_end - final_start;
- if (entry->addr < final_start)
- continue;
+ printk(KERN_DEBUG "e820: update [mem %#018Lx-%#018Lx] ", start,
+ start + size - 1);
+ e820_print_type(old_type);
+ pr_cont(" ==> ");
+ e820_print_type(new_type);
+ pr_cont("\n");

- entry->addr = final_end;
- }
- return real_updated_size;
+ return __e820__handle_range_update(table, start, size, &updater, &data);
}

-u64 __init e820__range_update(u64 start, u64 size, enum e820_type old_type, enum e820_type new_type)
+/*
+ * Update type of addresses in [@start, @start + @size) from @old_type
+ * to @new_type in e820_table.
+ */
+u64 __init e820__range_update(u64 start, u64 size, enum e820_type old_type,
+ enum e820_type new_type)
{
return __e820__range_update(e820_table, start, size, old_type, new_type);
}

-static u64 __init e820__range_update_kexec(u64 start, u64 size, enum e820_type old_type, enum e820_type new_type)
+/*
+ * Update type of addresses in [@start, @start + @size) from @old_type
+ * to @new_type in e820_table_kexec.
+ */
+static u64 __init e820__range_update_kexec(u64 start, u64 size,
+ enum e820_type old_type,
+ enum e820_type new_type)
{
return __e820__range_update(e820_table_kexec, start, size, old_type, new_type);
}
--
2.30.2

2022-05-02 23:06:39

by Martin Fernandez

[permalink] [raw]
Subject: [PATCH v8 8/8] drivers/node: Show in sysfs node's crypto capabilities

Show in each node in sysfs if its memory is able to do be encrypted by
the CPU; on EFI systems: if all its memory is marked with
EFI_MEMORY_CPU_CRYPTO in the EFI memory map.

Signed-off-by: Martin Fernandez <[email protected]>
---
Documentation/ABI/testing/sysfs-devices-node | 10 ++++++++++
drivers/base/node.c | 10 ++++++++++
2 files changed, 20 insertions(+)
create mode 100644 Documentation/ABI/testing/sysfs-devices-node

diff --git a/Documentation/ABI/testing/sysfs-devices-node b/Documentation/ABI/testing/sysfs-devices-node
new file mode 100644
index 000000000000..0e95420bd7c5
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-node
@@ -0,0 +1,10 @@
+What: /sys/devices/system/node/nodeX/crypto_capable
+Date: April 2022
+Contact: Martin Fernandez <[email protected]>
+Users: fwupd (https://fwupd.org)
+Description:
+ This value is 1 if all system memory in this node is
+ capable of being protected with the CPU's memory
+ cryptographic capabilities. It is 0 otherwise.
+ On EFI systems the node will be marked with
+ EFI_MEMORY_CPU_CRYPTO.
\ No newline at end of file
diff --git a/drivers/base/node.c b/drivers/base/node.c
index ec8bb24a5a22..1df15ea03c27 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -560,11 +560,21 @@ static ssize_t node_read_distance(struct device *dev,
}
static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);

+static ssize_t crypto_capable_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct pglist_data *pgdat = NODE_DATA(dev->id);
+
+ return sysfs_emit(buf, "%d\n", pgdat->crypto_capable);
+}
+static DEVICE_ATTR_RO(crypto_capable);
+
static struct attribute *node_dev_attrs[] = {
&dev_attr_meminfo.attr,
&dev_attr_numastat.attr,
&dev_attr_distance.attr,
&dev_attr_vmstat.attr,
+ &dev_attr_crypto_capable.attr,
NULL
};

--
2.30.2

2022-05-02 23:19:13

by Martin Fernandez

[permalink] [raw]
Subject: [PATCH v8 1/8] mm/memblock: Tag memblocks with crypto capabilities

Add the capability to mark regions of the memory memory_type able of
hardware memory encryption.

Also add the capability to query if all regions of a memory node are
able to do hardware memory encryption to call it when initializing the
nodes. Warn the user if a node has both encryptable and
non-encryptable regions.

Signed-off-by: Martin Fernandez <[email protected]>
---
include/linux/memblock.h | 5 ++++
mm/memblock.c | 62 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 67 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 50ad19662a32..00c4f1a20335 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -40,6 +40,7 @@ extern unsigned long long max_possible_pfn;
* via a driver, and never indicated in the firmware-provided memory map as
* system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
* kernel resource tree.
+ * @MEMBLOCK_CRYPTO_CAPABLE: capable of hardware encryption
*/
enum memblock_flags {
MEMBLOCK_NONE = 0x0, /* No special request */
@@ -47,6 +48,7 @@ enum memblock_flags {
MEMBLOCK_MIRROR = 0x2, /* mirrored region */
MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
MEMBLOCK_DRIVER_MANAGED = 0x8, /* always detected via a driver */
+ MEMBLOCK_CRYPTO_CAPABLE = 0x10, /* capable of hardware encryption */
};

/**
@@ -120,6 +122,9 @@ int memblock_physmem_add(phys_addr_t base, phys_addr_t size);
void memblock_trim_memory(phys_addr_t align);
bool memblock_overlaps_region(struct memblock_type *type,
phys_addr_t base, phys_addr_t size);
+bool memblock_node_is_crypto_capable(int nid);
+int memblock_mark_crypto_capable(phys_addr_t base, phys_addr_t size);
+int memblock_clear_crypto_capable(phys_addr_t base, phys_addr_t size);
int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index e4f03a6e8e56..d6399835b155 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -191,6 +191,40 @@ bool __init_memblock memblock_overlaps_region(struct memblock_type *type,
return i < type->cnt;
}

+/**
+ * memblock_node_is_crypto_capable - get if whole node is capable
+ * of encryption
+ * @nid: number of node
+ *
+ * Iterate over all memory memblock_type and find if all regions under
+ * node @nid are capable of hardware encryption.
+ *
+ * Return:
+ * true if every region in @nid is capable of encryption, false
+ * otherwise.
+ */
+bool __init_memblock memblock_node_is_crypto_capable(int nid)
+{
+ struct memblock_region *region;
+ int crypto_capables = 0;
+ int not_crypto_capables = 0;
+
+ for_each_mem_region(region) {
+ if (memblock_get_region_node(region) == nid) {
+ if (region->flags & MEMBLOCK_CRYPTO_CAPABLE)
+ crypto_capables++;
+ else
+ not_crypto_capables++;
+ }
+ }
+
+ if (crypto_capables > 0 && not_crypto_capables > 0)
+ pr_warn("Node %d has %d regions that are encryptable and %d regions that aren't",
+ nid, not_crypto_capables, crypto_capables);
+
+ return crypto_capables > 0 && not_crypto_capables == 0;
+}
+
/**
* __memblock_find_range_bottom_up - find free area utility in bottom-up
* @start: start of candidate range
@@ -891,6 +925,34 @@ static int __init_memblock memblock_setclr_flag(phys_addr_t base,
return 0;
}

+/**
+ * memblock_mark_crypto_capable - Mark memory regions capable of hardware
+ * encryption with flag MEMBLOCK_CRYPTO_CAPABLE.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_mark_crypto_capable(phys_addr_t base,
+ phys_addr_t size)
+{
+ return memblock_setclr_flag(base, size, 1, MEMBLOCK_CRYPTO_CAPABLE);
+}
+
+/**
+ * memblock_clear_crypto_capable - Clear flag MEMBLOCK_CRYPTO for a
+ * specified region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+int __init_memblock memblock_clear_crypto_capable(phys_addr_t base,
+ phys_addr_t size)
+{
+ return memblock_setclr_flag(base, size, 0, MEMBLOCK_CRYPTO_CAPABLE);
+}
+
/**
* memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG.
* @base: the base phys addr of the region
--
2.30.2

2022-05-02 23:39:24

by Martin Fernandez

[permalink] [raw]
Subject: [PATCH v8 7/8] x86/efi: Mark e820_entries as crypto capable from EFI memmap

Add a function to iterate over the EFI Memory Map and mark the regions
tagged with EFI_MEMORY_CPU_CRYPTO in the e820_table; and call it from
efi_init if add_efi_memmap is disabled.

Also modify do_add_efi_memmap to mark the regions there.

If add_efi_memmap is false, also check that the e820_table has enough
size to (possibly) store also the EFI memmap.

Signed-off-by: Martin Fernandez <[email protected]>
---
arch/x86/platform/efi/efi.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 147c30a81f15..3efa1c620c75 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -184,6 +184,8 @@ static void __init do_add_efi_memmap(void)
}

e820__range_add(start, size, e820_type);
+ if (md->attribute & EFI_MEMORY_CPU_CRYPTO)
+ e820__range_set_crypto_capable(start, size);
}
e820__update_table(e820_table);
}
@@ -441,6 +443,34 @@ static int __init efi_config_init(const efi_config_table_type_t *arch_tables)
return ret;
}

+static void __init efi_mark_e820_regions_as_crypto_capable(void)
+{
+ efi_memory_desc_t *md;
+
+ /*
+ * Calling e820__range_set_crypto_capable several times
+ * creates a bunch of entries in the E820 table. They probably
+ * will get merged when calling update_table but we need the
+ * space there anyway
+ */
+ if (efi.memmap.nr_map + e820_table->nr_entries >= E820_MAX_ENTRIES) {
+ pr_err_once("E820 table is not large enough to fit EFI memmap; not marking entries as crypto capable\n");
+ return;
+ }
+
+ for_each_efi_memory_desc(md) {
+ if (md->attribute & EFI_MEMORY_CPU_CRYPTO)
+ e820__range_set_crypto_capable(md->phys_addr,
+ md->num_pages << EFI_PAGE_SHIFT);
+ }
+
+ /*
+ * We added and modified regions so it's good to update the
+ * table to merge/sort
+ */
+ e820__update_table(e820_table);
+}
+
void __init efi_init(void)
{
if (IS_ENABLED(CONFIG_X86_32) &&
@@ -494,6 +524,13 @@ void __init efi_init(void)
set_bit(EFI_RUNTIME_SERVICES, &efi.flags);
efi_clean_memmap();

+ /*
+ * If add_efi_memmap then there is no need to mark the regions
+ * again
+ */
+ if (!add_efi_memmap)
+ efi_mark_e820_regions_as_crypto_capable();
+
if (efi_enabled(EFI_DBG))
efi_print_memmap();
}
--
2.30.2

2022-05-02 23:47:10

by Martin Fernandez

[permalink] [raw]
Subject: [PATCH v8 2/8] mm/mmzone: Tag pg_data_t with crypto capabilities

Add a new member in the pg_data_t struct to tell whether the node
corresponding to that pg_data_t is able to do hardware memory
encryption.

This will be read from sysfs.

Signed-off-by: Martin Fernandez <[email protected]>
---
include/linux/mmzone.h | 3 +++
mm/page_alloc.c | 1 +
2 files changed, 4 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 46ffab808f03..89054af9e599 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -886,6 +886,9 @@ typedef struct pglist_data {
struct task_struct *kcompactd;
bool proactive_compact_trigger;
#endif
+
+ bool crypto_capable;
+
/*
* This is a per-node reserve of pages that are not available
* to userspace allocations.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e42038382c1..a244151045b4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7699,6 +7699,7 @@ static void __init free_area_init_node(int nid)
pgdat->node_id = nid;
pgdat->node_start_pfn = start_pfn;
pgdat->per_cpu_nodestats = NULL;
+ pgdat->crypto_capable = memblock_node_is_crypto_capable(nid);

if (start_pfn != end_pfn) {
pr_info("Initmem setup node %d [mem %#018Lx-%#018Lx]\n", nid,
--
2.30.2

2022-05-03 01:08:48

by Martin Fernandez

[permalink] [raw]
Subject: [PATCH v8 5/8] x86/e820: Refactor e820__range_remove

Refactor e820__range_remove with the introduction of
e820_remover_data, indented to be used as the void pointer in the
e820_entry_updater callbacks, and the implementation of the callbacks
remove a range in the e820_table.

Signed-off-by: Martin Fernandez <[email protected]>
---
arch/x86/kernel/e820.c | 94 ++++++++++++++++++------------------------
1 file changed, 41 insertions(+), 53 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 36a22c0a2199..0e5aa13ebdb8 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -669,66 +669,54 @@ static u64 __init e820__range_update_kexec(u64 start, u64 size,
return __e820__range_update(e820_table_kexec, start, size, old_type, new_type);
}

-/* Remove a range of memory from the E820 table: */
-u64 __init e820__range_remove(u64 start, u64 size, enum e820_type old_type, bool check_type)
-{
- int i;
- u64 end;
- u64 real_removed_size = 0;
-
- if (size > (ULLONG_MAX - start))
- size = ULLONG_MAX - start;
-
- end = start + size;
- printk(KERN_DEBUG "e820: remove [mem %#010Lx-%#010Lx] ", start, end - 1);
- if (check_type)
- e820_print_type(old_type);
- pr_cont("\n");
-
- for (i = 0; i < e820_table->nr_entries; i++) {
- struct e820_entry *entry = &e820_table->entries[i];
- u64 final_start, final_end;
- u64 entry_end;
-
- if (check_type && entry->type != old_type)
- continue;
+/*
+ * Type helper for the e820_entry_updater callbacks.
+ */
+struct e820_remover_data {
+ enum e820_type old_type;
+ bool check_type;
+};

- entry_end = entry->addr + entry->size;
+static bool __init remover__should_update(const struct e820_entry *entry,
+ const void *data)
+{
+ const struct e820_remover_data *remover_data = data;

- /* Completely covered? */
- if (entry->addr >= start && entry_end <= end) {
- real_removed_size += entry->size;
- memset(entry, 0, sizeof(*entry));
- continue;
- }
+ return !remover_data->check_type ||
+ entry->type == remover_data->old_type;
+}

- /* Is the new range completely covered? */
- if (entry->addr < start && entry_end > end) {
- e820__range_add(end, entry_end - end, entry->type);
- entry->size = start - entry->addr;
- real_removed_size += size;
- continue;
- }
+static void __init remover__update(struct e820_entry *entry, const void *data)
+{
+ memset(entry, 0, sizeof(*entry));
+}

- /* Partially covered: */
- final_start = max(start, entry->addr);
- final_end = min(end, entry_end);
- if (final_start >= final_end)
- continue;
+/*
+ * Remove [@start, @start + @size) from e820_table. If @check_type is
+ * true remove only entries with type @old_type.
+ */
+u64 __init e820__range_remove(u64 start, u64 size, enum e820_type old_type,
+ bool check_type)
+{
+ struct e820_entry_updater updater = {
+ .should_update = remover__should_update,
+ .update = remover__update,
+ .new = NULL
+ };

- real_removed_size += final_end - final_start;
+ struct e820_remover_data data = {
+ .check_type = check_type,
+ .old_type = old_type
+ };

- /*
- * Left range could be head or tail, so need to update
- * the size first:
- */
- entry->size -= final_end - final_start;
- if (entry->addr < final_start)
- continue;
+ printk(KERN_DEBUG "e820: remove [mem %#018Lx-%#018Lx] ", start,
+ start + size - 1);
+ if (check_type)
+ e820_print_type(old_type);
+ pr_cont("\n");

- entry->addr = final_end;
- }
- return real_removed_size;
+ return __e820__handle_range_update(e820_table, start, size, &updater,
+ &data);
}

void __init e820__update_table_print(void)
--
2.30.2

2022-05-05 07:20:45

by Martin Fernandez

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On 5/4/22, Borislav Petkov <[email protected]> wrote:
> On Fri, Apr 29, 2022 at 05:17:09PM -0300, Martin Fernandez wrote:
>> Show for each node if every memory descriptor in that node has the
>> EFI_MEMORY_CPU_CRYPTO attribute.
>>
>> fwupd project plans to use it as part of a check to see if the users
>> have properly configured memory hardware encryption
>> capabilities. fwupd's people have seen cases where it seems like there
>> is memory encryption because all the hardware is capable of doing it,
>> but on a closer look there is not, either because of system firmware
>> or because some component requires updating to enable the feature.
>
> Hm, so in the sysfs patch you have:
>
> + This value is 1 if all system memory in this node is
> + capable of being protected with the CPU's memory
> + cryptographic capabilities.
>
> So this says the node is capable - so what is fwupd going to report -
> that the memory is capable?
>
> From your previous paragraph above it sounds to me like you wanna
> say whether memory encryption is active or not, not that the node is
> capable.
>
> Or what is the use case?

The use case is to know if a user is using hardware encryption or
not. This new sysfs file plus knowing if tme/sev is active you can be
pretty sure about that.

>> It's planned to make it part of a specification that can be passed to
>> people purchasing hardware
>
> So people are supposed to run that fwupd on that new hw to check whether
> they can use memory encryption?

Yes

>> These checks will run at every boot. The specification is called Host
>> Security ID: https://fwupd.github.io/libfwupdplugin/hsi.html.
>>
>> We choosed to do it a per-node basis because although an ABI that
>> shows that the whole system memory is capable of encryption would be
>> useful for the fwupd use case, doing it in a per-node basis gives also
>> the capability to the user to target allocations from applications to
>> NUMA nodes which have encryption capabilities.
>
> That's another hmmm: what systems do not do full system memory
> encryption and do only per-node?
>
> From those I know, you encrypt the whole memory on the whole system and
> that's it. Even if it is a hypervisor which runs a lot of guests, you
> still want the hypervisor itself to run encrypted, i.e., what's called
> SME in AMD's variant.

Dave Hansen pointed those out in a previuos patch serie, here is the
quote:

> CXL devices will have normal RAM on them, be exposed as "System RAM" and
> they won't have encryption capabilities. I think these devices were
> probably the main motivation for EFI_MEMORY_CPU_CRYPTO.

2022-05-06 08:11:47

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On Fri, Apr 29, 2022 at 05:17:09PM -0300, Martin Fernandez wrote:
> Show for each node if every memory descriptor in that node has the
> EFI_MEMORY_CPU_CRYPTO attribute.
>
> fwupd project plans to use it as part of a check to see if the users
> have properly configured memory hardware encryption
> capabilities. fwupd's people have seen cases where it seems like there
> is memory encryption because all the hardware is capable of doing it,
> but on a closer look there is not, either because of system firmware
> or because some component requires updating to enable the feature.

Hm, so in the sysfs patch you have:

+ This value is 1 if all system memory in this node is
+ capable of being protected with the CPU's memory
+ cryptographic capabilities.

So this says the node is capable - so what is fwupd going to report -
that the memory is capable?

From your previous paragraph above it sounds to me like you wanna
say whether memory encryption is active or not, not that the node is
capable.

Or what is the use case?

> It's planned to make it part of a specification that can be passed to
> people purchasing hardware

So people are supposed to run that fwupd on that new hw to check whether
they can use memory encryption?

> These checks will run at every boot. The specification is called Host
> Security ID: https://fwupd.github.io/libfwupdplugin/hsi.html.
>
> We choosed to do it a per-node basis because although an ABI that
> shows that the whole system memory is capable of encryption would be
> useful for the fwupd use case, doing it in a per-node basis gives also
> the capability to the user to target allocations from applications to
> NUMA nodes which have encryption capabilities.

That's another hmmm: what systems do not do full system memory
encryption and do only per-node?

From those I know, you encrypt the whole memory on the whole system and
that's it. Even if it is a hypervisor which runs a lot of guests, you
still want the hypervisor itself to run encrypted, i.e., what's called
SME in AMD's variant.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-05-07 05:27:07

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On 5/6/22 11:25, Boris Petkov wrote:
> On May 6, 2022 6:14:00 PM UTC, Dave Hansen <[email protected]>
> wrote:
>> But, this interface will *work* both for the uniform and
>> non-uniform systems alike.
> And what would that additional information that some "node" -
> whatever "node" means nowadays - is not encrypted give you?

Tying it to the node ties it to the NUMA ABIs. For instance, it lets
you say: "allocate memory with encryption capabilities" with a
set_mempolicy() to nodes that are enumerated as encryption-capable.

Imagine that we have a non-uniform system: some memory supports TDX (or
SEV-SNP) and some doesn't. QEMU calls mmap() to allocate some guest
memory and then its ioctl()s to get its addresses stuffed into EPT/NPT.
The memory might be allocated from anywhere, CPU_CRYPTO-capable or not.
VM creation will fail because the (hardware-enforced) security checks
can't be satisfied on non-CPU_CRYPTO memory.

Userspace has no recourse to fix this. It's just stuck. In that case,
the *kernel* needs to be responsible for ensuring that the backing
physical memory supports TDX (or SEV).

This node attribute punts the problem back out to userspace. It gives
userspace the ability to steer allocations to compatible NUMA nodes. If
something goes wrong, they can use other NUMA ABIs to inspect the
situation, like /proc/$pid/numa_maps.

2022-05-07 22:49:18

by Mario Limonciello

[permalink] [raw]
Subject: RE: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

[Public]



> -----Original Message-----
> From: Borislav Petkov <[email protected]>
> Sent: Friday, May 6, 2022 07:44
> To: Martin Fernandez <[email protected]>
> Cc: [email protected]; [email protected]; platform-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do
> encryption
>
> On Wed, May 04, 2022 at 02:18:30PM -0300, Martin Fernandez wrote:
> > The use case is to know if a user is using hardware encryption or
> > not. This new sysfs file plus knowing if tme/sev is active you can be
> > pretty sure about that.
>
> Then please explain it in detail and in the text so that it is clear. As
> it is now, the reader is left wondering what that file is supposed to
> state.
>
> > Dave Hansen pointed those out in a previuos patch serie, here is the
> > quote:
> >
> > > CXL devices will have normal RAM on them, be exposed as "System
> RAM" and
> > > they won't have encryption capabilities. I think these devices were
> > > probably the main motivation for EFI_MEMORY_CPU_CRYPTO.
>
> So this would mean that if a system doesn't have CXL devices and has
> TME/SME/SEV-* enabled, then it is running with encrypted memory.
>
> Which would then also mean, you don't need any of that code - you only
> need to enumerate CXL devices which, it seems, do not support memory
> encryption, and then state that memory encryption is enabled on the
> whole system, except for the memory of those devices.
>
> I.e.,
>
> $ dmesg | grep -i SME
> [ 1.783650] AMD Memory Encryption Features active: SME
>
> Done - memory is encrypted on the whole system.
>
> We could export it into /proc/cpuinfo so that you don't have to grep
> dmesg and problem solved.
>

Actually we solved that already for SME. Kernel only exposes the feature
in /proc/cpuinfo if it's active now.
See kernel commit 08f253ec3767bcfafc5d32617a92cee57c63968e.

Fwupd code has been changed to match it too. It will only trust the presence of
sme flag with kernel 5.18.0 and newer.

https://github.com/fwupd/fwupd/commit/53a49b4ac1815572f242f85a1a1cc52a2d7ed50c

2022-05-08 12:13:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On Wed, May 04, 2022 at 02:18:30PM -0300, Martin Fernandez wrote:
> The use case is to know if a user is using hardware encryption or
> not. This new sysfs file plus knowing if tme/sev is active you can be
> pretty sure about that.

Then please explain it in detail and in the text so that it is clear. As
it is now, the reader is left wondering what that file is supposed to
state.

> Dave Hansen pointed those out in a previuos patch serie, here is the
> quote:
>
> > CXL devices will have normal RAM on them, be exposed as "System RAM" and
> > they won't have encryption capabilities. I think these devices were
> > probably the main motivation for EFI_MEMORY_CPU_CRYPTO.

So this would mean that if a system doesn't have CXL devices and has
TME/SME/SEV-* enabled, then it is running with encrypted memory.

Which would then also mean, you don't need any of that code - you only
need to enumerate CXL devices which, it seems, do not support memory
encryption, and then state that memory encryption is enabled on the
whole system, except for the memory of those devices.

I.e.,

$ dmesg | grep -i SME
[ 1.783650] AMD Memory Encryption Features active: SME

Done - memory is encrypted on the whole system.

We could export it into /proc/cpuinfo so that you don't have to grep
dmesg and problem solved.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-05-08 20:33:18

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On May 6, 2022 4:00:57 PM UTC, Dan Williams <[email protected]> wrote:
>On Fri, May 6, 2022 at 8:32 AM Dave Hansen <[email protected]> wrote:
>>
>> On 5/6/22 05:44, Borislav Petkov wrote:
>> >> Dave Hansen pointed those out in a previuos patch serie, here is the
>> >> quote:
>> >>
>> >>> CXL devices will have normal RAM on them, be exposed as "System RAM" and
>> >>> they won't have encryption capabilities. I think these devices were
>> >>> probably the main motivation for EFI_MEMORY_CPU_CRYPTO.
>> > So this would mean that if a system doesn't have CXL devices and has
>> > TME/SME/SEV-* enabled, then it is running with encrypted memory.
>> >
>> > Which would then also mean, you don't need any of that code - you only
>> > need to enumerate CXL devices which, it seems, do not support memory
>> > encryption, and then state that memory encryption is enabled on the
>> > whole system, except for the memory of those devices.
>>
>> CXL devices are just the easiest example to explain, but they are not
>> the only problem.
>>
>> For example, Intel NVDIMMs don't support TDX (or MKTME with integrity)
>> since TDX requires integrity protection and NVDIMMs don't have metadata
>> space available.
>>
>> Also, if this were purely a CXL problem, I would have expected this to
>> have been dealt with in the CXL spec alone. But, this series is
>> actually driven by an ACPI spec. That tells me that we'll see these
>> mismatched encryption capabilities in many more places than just CXL
>> devices.
>
>Yes, the problem is that encryption capabilities cut across multiple
>specifications. For example, you might need to consult a CPU
>vendor-specific manual, ACPI, EFI, PCI, and CXL specifications for a
>single security feature.

So here's the deal: we can say in the kernel that memory encryption is enabled and active. But then all those different devices and so on, can or cannot support encryption. IO devices do not support encryption either, afaict. And there you don't have node granularity etc. So you can't do this per node thing anyway. Or you do it and it becomes insufficient soin after.

But that is not the question - they don't wanna say in fwupd whether every transaction was encrypted or not - they wanna say that encryption is active. And that we can give them now.

Thx.

--
Sent from a small device: formatting sux and brevity is inevitable.

2022-05-09 01:26:16

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On 5/6/22 05:44, Borislav Petkov wrote:
>> Dave Hansen pointed those out in a previuos patch serie, here is the
>> quote:
>>
>>> CXL devices will have normal RAM on them, be exposed as "System RAM" and
>>> they won't have encryption capabilities. I think these devices were
>>> probably the main motivation for EFI_MEMORY_CPU_CRYPTO.
> So this would mean that if a system doesn't have CXL devices and has
> TME/SME/SEV-* enabled, then it is running with encrypted memory.
>
> Which would then also mean, you don't need any of that code - you only
> need to enumerate CXL devices which, it seems, do not support memory
> encryption, and then state that memory encryption is enabled on the
> whole system, except for the memory of those devices.

CXL devices are just the easiest example to explain, but they are not
the only problem.

For example, Intel NVDIMMs don't support TDX (or MKTME with integrity)
since TDX requires integrity protection and NVDIMMs don't have metadata
space available.

Also, if this were purely a CXL problem, I would have expected this to
have been dealt with in the CXL spec alone. But, this series is
actually driven by an ACPI spec. That tells me that we'll see these
mismatched encryption capabilities in many more places than just CXL
devices.


2022-05-09 02:13:35

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On May 6, 2022 6:43:39 PM UTC, Dave Hansen <[email protected]> wrote:
>On 5/6/22 11:25, Boris Petkov wrote:
>> On May 6, 2022 6:14:00 PM UTC, Dave Hansen <[email protected]>
>> wrote:
>>> But, this interface will *work* both for the uniform and
>>> non-uniform systems alike.
>> And what would that additional information that some "node" -
>> whatever "node" means nowadays - is not encrypted give you?
>
>Tying it to the node ties it to the NUMA ABIs. For instance, it lets
>you say: "allocate memory with encryption capabilities" with a
>set_mempolicy() to nodes that are enumerated as encryption-capable.

I was expecting something along those lines...

>Imagine that we have a non-uniform system: some memory supports TDX (or
>SEV-SNP) and some doesn't. QEMU calls mmap() to allocate some guest
>memory and then its ioctl()s to get its addresses stuffed into EPT/NPT.
> The memory might be allocated from anywhere, CPU_CRYPTO-capable or not.
> VM creation will fail because the (hardware-enforced) security checks
>can't be satisfied on non-CPU_CRYPTO memory.
>
>Userspace has no recourse to fix this. It's just stuck. In that case,
> the *kernel* needs to be responsible for ensuring that the backing
>physical memory supports TDX (or SEV).
>
>This node attribute punts the problem back out to userspace. It gives
>userspace the ability to steer allocations to compatible NUMA nodes. If
>something goes wrong, they can use other NUMA ABIs to inspect the
>situation, like /proc/$pid/numa_maps.

That's all fine and dandy but I still don't see the *actual*, real-life use case of why something would request memory of particular encryption capabilities. Don't get me wrong - I'm not saying there are not such use cases - I'm saying we should go all the way and fully define properly *why* we're doing this whole hoopla.

Remember - this all started with "i wanna say that mem enc is active" and now we're so far deep down the rabbit hole...

--
Sent from a small device: formatting sux and brevity is inevitable.

2022-05-09 04:33:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On 5/6/22 10:55, Boris Petkov wrote:
> So here's the deal: we can say in the kernel that memory encryption
> is enabled and active. But then all those different devices and so
> on, can or cannot support encryption. IO devices do not support
> encryption either, afaict.

At least on MKTME platforms, if a device does DMA to a physical address
with the KeyID bits set, it gets memory encryption. That's because all
the encryption magic is done in the memory controller itself. The CPU's
memory controller doesn't actually care if the access comes from a
device or a CPU as long as the right physical bits are set.

The reason we're talking about this in terms of CXL devices is that CXL
devices have their *OWN* memory controllers. Those memory controllers
might or might not support encryption.

> But that is not the question - they don't wanna say in fwupd whether
> every transaction was encrypted or not - they wanna say that
> encryption is active. And that we can give them now.

The reason we went down this per-node thing instead of something
system-wide is EFI_MEMORY_CPU_CRYPTO. It's in the standard because EFI
systems are not expected to have uniform crypto capabilities across the
entire memory map. Some memory will be capable of CPU crypto and some not.

As an example, if I were to build a system today with TDX and NVDIMMs,
I'd probably mark the RAM as EFI_MEMORY_CPU_CRYPTO=1 and the NVDIMMs as
EFI_MEMORY_CPU_CRYPTO=0.

I think you're saying that current AMD SEV systems have no need for
EFI_MEMORY_CPU_CRYPTO since their encryption capabilities *ARE* uniform.
I'm not challenging that at all. This interface is total overkill for
systems with guaranteed uniform encryption capabilities.

But, this interface will *work* both for the uniform and non-uniform
systems alike.

2022-05-09 08:46:46

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On Fri, May 6, 2022 at 8:32 AM Dave Hansen <[email protected]> wrote:
>
> On 5/6/22 05:44, Borislav Petkov wrote:
> >> Dave Hansen pointed those out in a previuos patch serie, here is the
> >> quote:
> >>
> >>> CXL devices will have normal RAM on them, be exposed as "System RAM" and
> >>> they won't have encryption capabilities. I think these devices were
> >>> probably the main motivation for EFI_MEMORY_CPU_CRYPTO.
> > So this would mean that if a system doesn't have CXL devices and has
> > TME/SME/SEV-* enabled, then it is running with encrypted memory.
> >
> > Which would then also mean, you don't need any of that code - you only
> > need to enumerate CXL devices which, it seems, do not support memory
> > encryption, and then state that memory encryption is enabled on the
> > whole system, except for the memory of those devices.
>
> CXL devices are just the easiest example to explain, but they are not
> the only problem.
>
> For example, Intel NVDIMMs don't support TDX (or MKTME with integrity)
> since TDX requires integrity protection and NVDIMMs don't have metadata
> space available.
>
> Also, if this were purely a CXL problem, I would have expected this to
> have been dealt with in the CXL spec alone. But, this series is
> actually driven by an ACPI spec. That tells me that we'll see these
> mismatched encryption capabilities in many more places than just CXL
> devices.

Yes, the problem is that encryption capabilities cut across multiple
specifications. For example, you might need to consult a CPU
vendor-specific manual, ACPI, EFI, PCI, and CXL specifications for a
single security feature.

2022-05-09 09:38:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On May 6, 2022 6:14:00 PM UTC, Dave Hansen <[email protected]> wrote:
>But, this interface will *work* both for the uniform and non-uniform
>systems alike.

And what would that additional information that some "node" - whatever "node" means nowadays - is not encrypted give you?

Note that the fwupd use case is to be able to say that memory encryption is active - nothing more.

--
Sent from a small device: formatting sux and brevity is inevitable.

2022-05-09 18:54:03

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

... adding some KVM/TDX folks

On 5/6/22 12:02, Boris Petkov wrote:
>> This node attribute punts the problem back out to userspace. It
>> gives userspace the ability to steer allocations to compatible NUMA
>> nodes. If something goes wrong, they can use other NUMA ABIs to
>> inspect the situation, like /proc/$pid/numa_maps.
> That's all fine and dandy but I still don't see the *actual*,
> real-life use case of why something would request memory of
> particular encryption capabilities. Don't get me wrong - I'm not
> saying there are not such use cases - I'm saying we should go all the
> way and fully define properly *why* we're doing this whole hoopla.

Let's say TDX is running on a system with mixed encryption
capabilities*. Some NUMA nodes support TDX and some don't. If that
happens, your guest RAM can come from anywhere. When the host kernel
calls into the TDX module to add pages to the guest (via
TDH.MEM.PAGE.ADD) it might get an error back from the TDX module. At
that point, the host kernel is stuck. It's got a partially created
guest and no recourse to fix the error.

This new ABI provides a way to avoid that situation in the first place.
Userspace can look at sysfs to figure out which NUMA nodes support
"encryption" (aka. TDX) and can use the existing NUMA policy ABI to
avoid TDH.MEM.PAGE.ADD failures.

So, here's the question for the TDX folks: are these mixed-capability
systems a problem for you? Does this ABI help you fix the problem?
Will your userspace (qemu and friends) actually use consume from this ABI?

* There are three ways we might hit a system with this issue:
1. NVDIMMs that don't support TDX, like lack of memory integrity
protection.
2. CXL-attached memory controllers that can't do encryption at all
3. Nominally TDX-compatible memory that was not covered/converted by
the kernel for some reason (memory hot-add, or ran out of TDMR
resources)

2022-05-10 00:18:38

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On 5/9/22 15:17, Borislav Petkov wrote:
>
>> This new ABI provides a way to avoid that situation in the first place.
>> Userspace can look at sysfs to figure out which NUMA nodes support
>> "encryption" (aka. TDX) and can use the existing NUMA policy ABI to
>> avoid TDH.MEM.PAGE.ADD failures.
>>
>> So, here's the question for the TDX folks: are these mixed-capability
>> systems a problem for you? Does this ABI help you fix the problem?
> What I'm not really sure too is, is per-node granularity ok? I guess it
> is but let me ask it anyway...

I think nodes are the only sane granularity.

tl;dr: Zones might work in theory but have no existing useful ABI around
them and too many practical problems. Nodes are the only other real
option without inventing something new and fancy.

--

What about zones (or any sub-node granularity really)?

Folks have, for instance, discussed adding new memory zones for this
purpose: have ZONE_NORMAL, and then ZONE_UNENCRYPTABLE (or something
similar). Zones are great because they have their own memory allocation
pools and can be targeted directly from within the kernel using things
like GFP_DMA. If you run out of ZONE_FOO, you can theoretically just
reclaim ZONE_FOO.

But, even a single new zone isn't necessarily good enough. What if we
have some ZONE_NORMAL that's encryption-capable and some that's not?
The same goes for ZONE_MOVABLE. We'd probably need at least:

ZONE_NORMAL
ZONE_NORMAL_UNENCRYPTABLE
ZONE_MOVABLE
ZONE_MOVABLE_UNENCRYPTABLE

Also, zones are (mostly) not exposed to userspace. If we want userspace
to be able to specify encryption capabilities, we're talking about new
ABI for enumeration and policy specification.

Why node granularity?

First, for the majority of cases, nodes "just work". ACPI systems with
an "HMAT" table already separate out different performance classes of
memory into different Proximity Domains (PXMs) which the kernel maps
into NUMA nodes.

This means that for NVDIMMs or virtually any CXL memory regions (one or
more CXL devices glued together) we can think of, they already get their
own NUMA node. Those nodes have their own zones (implicitly) and can
lean on the existing NUMA ABI for enumeration and policy creation.

Basically, the firmware creates the NUMA nodes for the kernel. All the
kernel has to do is acknowledge which of them can do encryption or not.

The one place where nodes fall down is if a memory hot-add occurs within
an existing node and the newly hot-added memory does not match the
encryption capabilities of the existing memory. The kernel basically
has two options in that case:
* Throw away the memory until the next reboot where the system might be
reconfigured in a way to support more uniform capabilities (this is
actually *likely* for a reboot of a TDX system)
* Create a synthetic NUMA node to hold it

Neither one of those is a horrible option. Throwing the memory away is
the most likely way TDX will handle this situation if it pops up. For
now, the folks building TDX-capable BIOSes claim emphatically that such
a system won't be built.

2022-05-10 00:19:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On Mon, May 09, 2022 at 11:47:43AM -0700, Dave Hansen wrote:
> ... adding some KVM/TDX folks

+ AMD SEV folks as they're going to probably need something like that
too.

> On 5/6/22 12:02, Boris Petkov wrote:
> >> This node attribute punts the problem back out to userspace. It
> >> gives userspace the ability to steer allocations to compatible NUMA
> >> nodes. If something goes wrong, they can use other NUMA ABIs to
> >> inspect the situation, like /proc/$pid/numa_maps.
> > That's all fine and dandy but I still don't see the *actual*,
> > real-life use case of why something would request memory of
> > particular encryption capabilities. Don't get me wrong - I'm not
> > saying there are not such use cases - I'm saying we should go all the
> > way and fully define properly *why* we're doing this whole hoopla.
>
> Let's say TDX is running on a system with mixed encryption
> capabilities*. Some NUMA nodes support TDX and some don't. If that
> happens, your guest RAM can come from anywhere. When the host kernel
> calls into the TDX module to add pages to the guest (via
> TDH.MEM.PAGE.ADD) it might get an error back from the TDX module. At
> that point, the host kernel is stuck. It's got a partially created
> guest and no recourse to fix the error.

Thanks for that detailed use case, btw!

> This new ABI provides a way to avoid that situation in the first place.
> Userspace can look at sysfs to figure out which NUMA nodes support
> "encryption" (aka. TDX) and can use the existing NUMA policy ABI to
> avoid TDH.MEM.PAGE.ADD failures.
>
> So, here's the question for the TDX folks: are these mixed-capability
> systems a problem for you? Does this ABI help you fix the problem?

What I'm not really sure too is, is per-node granularity ok? I guess it
is but let me ask it anyway...

> Will your userspace (qemu and friends) actually use consume from this ABI?

Same question for SEV folks - do you guys think this interface would
make sense for the SEV side of things?

> * There are three ways we might hit a system with this issue:
> 1. NVDIMMs that don't support TDX, like lack of memory integrity
> protection.
> 2. CXL-attached memory controllers that can't do encryption at all
> 3. Nominally TDX-compatible memory that was not covered/converted by
> the kernel for some reason (memory hot-add, or ran out of TDMR
> resources)

And I think some of those might be of interest to the AMD side of things
too.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-05-16 13:40:10

by Richard Hughes

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On Fri, 6 May 2022 at 20:02, Boris Petkov <[email protected]> wrote:
> Remember - this all started with "i wanna say that mem enc is active" and now we're so far deep down the rabbit hole...

This is still something consumers need; at the moment users have no
idea if data is *actually* being encrypted. I think Martin has done an
admirable job going down the rabbit hole to add this functionality in
the proper manner -- so it's actually accurate and useful for other
use cases to that of fwupd.

At the moment my professional advice to people asking about Intel
memory encryption is to assume there is none, as there's no way of
verifying that it's actually enabled and working. This is certainly a
shame for something so promising, touted as an enterprise security
feature.

Richard

2022-05-18 07:53:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On Mon, May 16, 2022 at 09:39:06AM +0100, Richard Hughes wrote:
> This is still something consumers need; at the moment users have no
> idea if data is *actually* being encrypted.

As it was already pointed out - that's in /proc/cpuinfo.

> I think Martin has done an admirable job going down the rabbit hole
> to add this functionality in the proper manner -- so it's actually
> accurate and useful for other use cases to that of fwupd.

Only after I scratched the surface as to why this is needed.

> At the moment my professional advice to people asking about Intel
> memory encryption

Well, what kind of memory encryption? Host, guest?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-05-18 18:31:42

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On Wed, May 18, 2022 at 12:53 AM Borislav Petkov <[email protected]> wrote:
>
> On Mon, May 16, 2022 at 09:39:06AM +0100, Richard Hughes wrote:
> > This is still something consumers need; at the moment users have no
> > idea if data is *actually* being encrypted.
>
> As it was already pointed out - that's in /proc/cpuinfo.

For TME you still need to compare it against the EFI memory map as
there are exclusion ranges for things like persistent memory. Given
that persistent memory can be forced into volatile "System RAM"
operation by various command line options and driver overrides, you
need to at least trim the assumptions of what is encrypted to the
default "conventional memory" conveyed by platform firmware / BIOS.

2022-05-18 20:25:43

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

On Wed, May 18, 2022 at 11:28:49AM -0700, Dan Williams wrote:
> On Wed, May 18, 2022 at 12:53 AM Borislav Petkov <[email protected]> wrote:
> >
> > On Mon, May 16, 2022 at 09:39:06AM +0100, Richard Hughes wrote:
> > > This is still something consumers need; at the moment users have no
> > > idea if data is *actually* being encrypted.
> >
> > As it was already pointed out - that's in /proc/cpuinfo.
>
> For TME you still need to compare it against the EFI memory map as
> there are exclusion ranges for things like persistent memory. Given
> that persistent memory can be forced into volatile "System RAM"
> operation by various command line options and driver overrides, you
> need to at least trim the assumptions of what is encrypted to the
> default "conventional memory" conveyed by platform firmware / BIOS.

So SME/SEV also has some exceptions to which memory is encrypted and
which not. Doing device IO would be one example where you simply cannot
encrypt.

But that wasn't the original question - the original question is whether
memory encryption is enabled on the system.

Now, the nodes way of describing what is encrypted and what not is
not enough either when you want to determine whether an arbitrary
transaction is being done encrypted or not. You can do silly things
as mapping a page decrypted even if the underlying hardware can do
encryption and every other page is encrypted and still think that that
page is encrypted too. But that would be a lie.

So the whole problem space needs to be specified with a lot more detail
as to what exact information userspace is going to need and how we can
provide it to it.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette