= Intro =
The patchset brings enabling of Intel Multi-Key Total Memory Encryption.
It consists of changes into multiple subsystems:
* Core MM: infrastructure for allocation pages, dealing with encrypted VMAs
and providing API setup encrypted mappings.
* arch/x86: feature enumeration, program keys into hardware, setup
page table entries for encrypted pages and more.
* Key management service: setup and management of encryption keys.
* DMA/IOMMU: dealing with encrypted memory on IO side.
* KVM: interaction with virtualization side.
* Documentation: description of APIs and usage examples.
The patchset is huge. This submission aims to give view to the full picture and
get feedback on the overall design. The patchset will be split into more
digestible pieces later.
Please review. Any feedback is welcome.
= Overview =
Multi-Key Total Memory Encryption (MKTME)[1] is a technology that allows
transparent memory encryption in upcoming Intel platforms. It uses a new
instruction (PCONFIG) for key setup and selects a key for individual pages by
repurposing physical address bits in the page tables.
These patches add support for MKTME into the existing kernel keyring subsystem
and add a new mprotect_encrypt() system call that can be used by applications
to encrypt anonymous memory with keys obtained from the keyring.
This architecture supports encrypting both normal, volatile DRAM and persistent
memory. However, these patches do not implement persistent memory support. We
anticipate adding that support next.
== Hardware Background ==
MKTME is built on top of an existing single-key technology called TME. TME
encrypts all system memory using a single key generated by the CPU on every
boot of the system. TME provides mitigation against physical attacks, such as
physically removing a DIMM or watching memory bus traffic.
MKTME enables the use of multiple encryption keys[2], allowing selection of the
encryption key per-page using the page tables. Encryption keys are programmed
into each memory controller and the same set of keys is available to all
entities on the system with access to that memory (all cores, DMA engines,
etc...).
MKTME inherits many of the mitigations against hardware attacks from TME. Like
TME, MKTME does not mitigate vulnerable or malicious operating systems or
virtual machine managers. MKTME offers additional mitigations when compared to
TME.
TME and MKTME use the AES encryption algorithm in the AES-XTS mode. This mode,
typically used for block-based storage devices, takes the physical address of
the data into account when encrypting each block. This ensures that the
effective key is different for each block of memory. Moving encrypted content
across physical address results in garbage on read, mitigating block-relocation
attacks. This property is the reason many of the discussed attacks require
control of a shared physical page to be handed from the victim to the attacker.
== MKTME-Provided Mitigations ==
MKTME adds a few mitigations against attacks that are not mitigated when using
TME alone. The first set are mitigations against software attacks that are
familiar today:
* Kernel Mapping Attacks: information disclosures that leverage the
kernel direct map are mitigated against disclosing user data.
* Freed Data Leak Attacks: removing an encryption key from the
hardware mitigates future user information disclosure.
The next set are attacks that depend on specialized hardware, such as an “evil
DIMM” or a DDR interposer:
* Cross-Domain Replay Attack: data is captured from one domain
(guest) and replayed to another at a later time.
* Cross-Domain Capture and Delayed Compare Attack: data is captured
and later analyzed to discover secrets.
* Key Wear-out Attack: data is captured and analyzed in order to
Weaken the AES encryption itself.
More details on these attacks are below.
=== Kernel Mapping Attacks ===
Information disclosure vulnerabilities leverage the kernel direct map because
many vulnerabilities involve manipulation of kernel data structures (examples:
CVE-2017-7277, CVE-2017-9605). We normally think of these bugs as leaking
valuable *kernel* data, but they can leak application data when application
pages are recycled for kernel use.
With this MKTME implementation, there is a direct map created for each MKTME
KeyID which is used whenever the kernel needs to access plaintext. But, all
kernel data structures are accessed via the direct map for KeyID-0. Thus,
memory reads which are not coordinated with the KeyID get garbage (for example,
accessing KeyID-4 data with the KeyID-0 mapping).
This means that if sensitive data encrypted using MKTME is leaked via the
KeyID-0 direct map, ciphertext decrypted with the wrong key will be disclosed.
To disclose plaintext, an attacker must “pivot” to the correct direct mapping,
which is non-trivial because there are no kernel data structures in the
KeyID!=0 direct mapping.
=== Freed Data Leak Attack ===
The kernel has a history of bugs around uninitialized data. Usually, we think
of these bugs as leaking sensitive kernel data, but they can also be used to
leak application secrets.
MKTME can help mitigate the case where application secrets are leaked:
* App (or VM) places a secret in a page
* App exits or frees memory to kernel allocator
* Page added to allocator free list
* Attacker reallocates page to a purpose where it can read the page
Now, imagine MKTME was in use on the memory being leaked. The data can only be
leaked as long as the key is programmed in the hardware. If the key is
de-programmed, like after all pages are freed after a guest is shut down, any
future reads will just see ciphertext.
Basically, the key is a convenient choke-point: you can be more confident that
data encrypted with it is inaccessible once the key is removed.
=== Cross-Domain Replay Attack ===
MKTME mitigates cross-domain replay attacks where an attacker replaces an
encrypted block owned by one domain with a block owned by another domain.
MKTME does not prevent this replacement from occurring, but it does mitigate
plaintext from being disclosed if the domains use different keys.
With TME, the attack could be executed by:
* A victim places secret in memory, at a given physical address.
Note: AES-XTS is what restricts the attack to being performed at a
single physical address instead of across different physical
addresses
* Attacker captures victim secret’s ciphertext
* Later on, after victim frees the physical address, attacker gains
ownership
* Attacker puts the ciphertext at the address and get the secret
plaintext
But, due to the presumably different keys used by the attacker and the victim,
the attacker can not successfully decrypt old ciphertext.
=== Cross-Domain Capture and Delayed Compare Attack ===
This is also referred to as a kind of dictionary attack.
Similarly, MKTME protects against cross-domain capture-and-compare attacks.
Consider the following scenario:
* A victim places a secret in memory, at a known physical address
* Attacker captures victim’s ciphertext
* Attacker gains control of the target physical address, perhaps
after the victim’s VM is shut down or its memory reclaimed.
* Attacker computes and writes many possible plaintexts until new
ciphertext matches content captured previously.
Secrets which have low (plaintext) entropy are more vulnerable to this attack
because they reduce the number of possible plaintexts an attacker has to
compute and write.
The attack will not work if attacker and victim uses different keys.
=== Key Wear-out Attack ===
Repeated use of an encryption key might be used by an attacker to infer
information about the key or the plaintext, weakening the encryption. The
higher the bandwidth of the encryption engine, the more vulnerable the key is
to wear-out. The MKTME memory encryption hardware works at the speed of the
memory bus, which has high bandwidth.
Such a weakness has been demonstrated[3] on a theoretical cipher with similar
properties as AES-XTS.
An attack would take the following steps:
* Victim system is using TME with AES-XTS-128
* Attacker repeatedly captures ciphertext/plaintext pairs (can be
Performed with online hardware attack like an interposer).
* Attacker compels repeated use of the key under attack for a
sustained time period without a system reboot[4].
* Attacker discovers a cipertext collision (two plaintexts
translating to the same ciphertext)
* Attacker can induce controlled modifications to the targeted
plaintext by modifying the colliding ciphertext
MKTME mitigates key wear-out in two ways:
* Keys can be rotated periodically to mitigate wear-out. Since TME
keys are generated at boot, rotation of TME keys requires a
reboot. In contrast, MKTME allows rotation while the system is
booted. An application could implement a policy to rotate keys at
a frequency which is not feasible to attack.
* In the case that MKTME is used to encrypt two guests’ memory with
two different keys, an attack on one guest’s key would not weaken
the key used in the second guest.
--
[1] https://software.intel.com/sites/default/files/managed/a5/16/Multi-Key-Total-Memory-Encryption-Spec.pdf
[2] The MKTME architecture supports up to 16 bits of KeyIDs, so a
maximum of 65535 keys on top of the “TME key” at KeyID-0. The
first implementation is expected to support 5 bits, making 63 keys
available to applications. However, this is not guaranteed. The
number of available keys could be reduced if, for instance,
additional physical address space is desired over additional
KeyIDs.
[3] http://web.cs.ucdavis.edu/~rogaway/papers/offsets.pdf
[4] This sustained time required for an attack could vary from days
to years depending on the attacker’s goals.
Alison Schofield (33):
x86/pconfig: Set a valid encryption algorithm for all MKTME commands
keys/mktme: Introduce a Kernel Key Service for MKTME
keys/mktme: Preparse the MKTME key payload
keys/mktme: Instantiate and destroy MKTME keys
keys/mktme: Move the MKTME payload into a cache aligned structure
keys/mktme: Strengthen the entropy of CPU generated MKTME keys
keys/mktme: Set up PCONFIG programming targets for MKTME keys
keys/mktme: Program MKTME keys into the platform hardware
keys/mktme: Set up a percpu_ref_count for MKTME keys
keys/mktme: Require CAP_SYS_RESOURCE capability for MKTME keys
keys/mktme: Store MKTME payloads if cmdline parameter allows
acpi: Remove __init from acpi table parsing functions
acpi/hmat: Determine existence of an ACPI HMAT
keys/mktme: Require ACPI HMAT to register the MKTME Key Service
acpi/hmat: Evaluate topology presented in ACPI HMAT for MKTME
keys/mktme: Do not allow key creation in unsafe topologies
keys/mktme: Support CPU hotplug for MKTME key service
keys/mktme: Find new PCONFIG targets during memory hotplug
keys/mktme: Program new PCONFIG targets with MKTME keys
keys/mktme: Support memory hotplug for MKTME keys
mm: Generalize the mprotect implementation to support extensions
syscall/x86: Wire up a system call for MKTME encryption keys
x86/mm: Set KeyIDs in encrypted VMAs for MKTME
mm: Add the encrypt_mprotect() system call for MKTME
x86/mm: Keep reference counts on encrypted VMAs for MKTME
mm: Restrict MKTME memory encryption to anonymous VMAs
selftests/x86/mktme: Test the MKTME APIs
x86/mktme: Overview of Multi-Key Total Memory Encryption
x86/mktme: Document the MKTME provided security mitigations
x86/mktme: Document the MKTME kernel configuration requirements
x86/mktme: Document the MKTME Key Service API
x86/mktme: Document the MKTME API for anonymous memory encryption
x86/mktme: Demonstration program using the MKTME APIs
Jacob Pan (3):
iommu/vt-d: Support MKTME in DMA remapping
x86/mm: introduce common code for mem encryption
x86/mm: Use common code for DMA memory encryption
Kai Huang (2):
mm, x86: export several MKTME variables
kvm, x86, mmu: setup MKTME keyID to spte for given PFN
Kirill A. Shutemov (24):
mm: Do no merge VMAs with different encryption KeyIDs
mm: Add helpers to setup zero page mappings
mm/ksm: Do not merge pages with different KeyIDs
mm/page_alloc: Unify alloc_hugepage_vma()
mm/page_alloc: Handle allocation for encrypted memory
mm/khugepaged: Handle encrypted pages
x86/mm: Mask out KeyID bits from page table entry pfn
x86/mm: Introduce variables to store number, shift and mask of KeyIDs
x86/mm: Preserve KeyID on pte_modify() and pgprot_modify()
x86/mm: Detect MKTME early
x86/mm: Add a helper to retrieve KeyID for a page
x86/mm: Add a helper to retrieve KeyID for a VMA
x86/mm: Add hooks to allocate and free encrypted pages
x86/mm: Map zero pages into encrypted mappings correctly
x86/mm: Rename CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING
x86/mm: Allow to disable MKTME after enumeration
x86/mm: Calculate direct mapping size
x86/mm: Implement syncing per-KeyID direct mappings
x86/mm: Handle encrypted memory in page_to_virt() and __pa()
mm/page_ext: Export lookup_page_ext() symbol
mm/rmap: Clear vma->anon_vma on unlink_anon_vmas()
x86/mm: Disable MKTME on incompatible platform configurations
x86/mm: Disable MKTME if not all system memory supports encryption
x86: Introduce CONFIG_X86_INTEL_MKTME
.../admin-guide/kernel-parameters.rst | 1 +
.../admin-guide/kernel-parameters.txt | 11 +
Documentation/x86/mktme/index.rst | 13 +
.../x86/mktme/mktme_configuration.rst | 17 +
Documentation/x86/mktme/mktme_demo.rst | 53 ++
Documentation/x86/mktme/mktme_encrypt.rst | 57 ++
Documentation/x86/mktme/mktme_keys.rst | 96 +++
Documentation/x86/mktme/mktme_mitigations.rst | 150 ++++
Documentation/x86/mktme/mktme_overview.rst | 57 ++
Documentation/x86/x86_64/mm.txt | 4 +
arch/alpha/include/asm/page.h | 2 +-
arch/x86/Kconfig | 29 +-
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/intel-family.h | 2 +
arch/x86/include/asm/intel_pconfig.h | 14 +-
arch/x86/include/asm/mem_encrypt.h | 29 +
arch/x86/include/asm/mktme.h | 93 +++
arch/x86/include/asm/page.h | 4 +
arch/x86/include/asm/page_32.h | 1 +
arch/x86/include/asm/page_64.h | 4 +-
arch/x86/include/asm/pgtable.h | 19 +
arch/x86/include/asm/pgtable_types.h | 23 +-
arch/x86/include/asm/setup.h | 6 +
arch/x86/kernel/cpu/intel.c | 58 +-
arch/x86/kernel/head64.c | 4 +
arch/x86/kernel/setup.c | 3 +
arch/x86/kvm/mmu.c | 18 +-
arch/x86/mm/Makefile | 3 +
arch/x86/mm/init_64.c | 68 ++
arch/x86/mm/kaslr.c | 11 +-
arch/x86/mm/mem_encrypt_common.c | 28 +
arch/x86/mm/mktme.c | 630 ++++++++++++++
drivers/acpi/hmat/hmat.c | 67 ++
drivers/acpi/tables.c | 10 +-
drivers/firmware/efi/efi.c | 25 +-
drivers/iommu/intel-iommu.c | 29 +-
fs/dax.c | 3 +-
fs/exec.c | 4 +-
fs/userfaultfd.c | 7 +-
include/asm-generic/pgtable.h | 8 +
include/keys/mktme-type.h | 39 +
include/linux/acpi.h | 9 +-
include/linux/dma-direct.h | 4 +-
include/linux/efi.h | 1 +
include/linux/gfp.h | 51 +-
include/linux/intel-iommu.h | 9 +-
include/linux/mem_encrypt.h | 23 +-
include/linux/migrate.h | 14 +-
include/linux/mm.h | 27 +-
include/linux/page_ext.h | 11 +-
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 2 +
kernel/sys_ni.c | 2 +
mm/compaction.c | 3 +
mm/huge_memory.c | 6 +-
mm/khugepaged.c | 10 +
mm/ksm.c | 17 +
mm/madvise.c | 2 +-
mm/memory.c | 3 +-
mm/mempolicy.c | 30 +-
mm/migrate.c | 4 +-
mm/mlock.c | 2 +-
mm/mmap.c | 31 +-
mm/mprotect.c | 98 ++-
mm/page_alloc.c | 50 ++
mm/page_ext.c | 5 +
mm/rmap.c | 4 +-
mm/userfaultfd.c | 3 +-
security/keys/Makefile | 1 +
security/keys/mktme_keys.c | 768 ++++++++++++++++++
.../selftests/x86/mktme/encrypt_tests.c | 433 ++++++++++
.../testing/selftests/x86/mktme/flow_tests.c | 266 ++++++
tools/testing/selftests/x86/mktme/key_tests.c | 526 ++++++++++++
.../testing/selftests/x86/mktme/mktme_test.c | 300 +++++++
76 files changed, 4301 insertions(+), 122 deletions(-)
create mode 100644 Documentation/x86/mktme/index.rst
create mode 100644 Documentation/x86/mktme/mktme_configuration.rst
create mode 100644 Documentation/x86/mktme/mktme_demo.rst
create mode 100644 Documentation/x86/mktme/mktme_encrypt.rst
create mode 100644 Documentation/x86/mktme/mktme_keys.rst
create mode 100644 Documentation/x86/mktme/mktme_mitigations.rst
create mode 100644 Documentation/x86/mktme/mktme_overview.rst
create mode 100644 arch/x86/include/asm/mktme.h
create mode 100644 arch/x86/mm/mem_encrypt_common.c
create mode 100644 arch/x86/mm/mktme.c
create mode 100644 include/keys/mktme-type.h
create mode 100644 security/keys/mktme_keys.c
create mode 100644 tools/testing/selftests/x86/mktme/encrypt_tests.c
create mode 100644 tools/testing/selftests/x86/mktme/flow_tests.c
create mode 100644 tools/testing/selftests/x86/mktme/key_tests.c
create mode 100644 tools/testing/selftests/x86/mktme/mktme_test.c
--
2.20.1
If all pages in the VMA got unmapped there's no reason to link it into
original anon VMA hierarchy: it cannot possibly share any pages with
other VMA.
Set vma->anon_vma to NULL on unlink_anon_vmas(). With the change VMA
can be reused. The new anon VMA will be allocated on the first fault.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/rmap.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index b30c7c71d1d9..4ec2aee7baa3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -400,8 +400,10 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
list_del(&avc->same_vma);
anon_vma_chain_free(avc);
}
- if (vma->anon_vma)
+ if (vma->anon_vma) {
vma->anon_vma->degree--;
+ vma->anon_vma = NULL;
+ }
unlock_anon_vma_root(root);
/*
--
2.20.1
From: Alison Schofield <[email protected]>
MKTME Key service maintains the hardware key tables. These key tables
are package scoped per the MKTME hardware definition. This means that
each physical package on the system needs its key table programmed.
These physical packages are the targets of the new PCONFIG programming
command. So, introduce a PCONFIG targets bitmap as well as a CPU mask
that includes the lead CPUs capable of programming the targets.
The lead CPU mask will be used every time a new key is programmed into
the hardware.
Keep the PCONFIG targets bit map around for future use during hotplug
events.
Signed-off-by: Alison Schofield <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
security/keys/mktme_keys.c | 42 ++++++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)
diff --git a/security/keys/mktme_keys.c b/security/keys/mktme_keys.c
index 9fdf482ea3e6..b5b44decfd3e 100644
--- a/security/keys/mktme_keys.c
+++ b/security/keys/mktme_keys.c
@@ -2,6 +2,7 @@
/* Documentation/x86/mktme_keys.rst */
+#include <linux/cpu.h>
#include <linux/init.h>
#include <linux/key.h>
#include <linux/key-type.h>
@@ -17,6 +18,8 @@
static DEFINE_SPINLOCK(mktme_lock);
struct kmem_cache *mktme_prog_cache; /* Hardware programming cache */
+unsigned long *mktme_target_map; /* Pconfig programming targets */
+cpumask_var_t mktme_leadcpus; /* One lead CPU per pconfig target */
/* 1:1 Mapping between Userspace Keys (struct key) and Hardware KeyIDs */
struct mktme_mapping {
@@ -303,6 +306,33 @@ struct key_type key_type_mktme = {
.destroy = mktme_destroy_key,
};
+static void mktme_update_pconfig_targets(void)
+{
+ int cpu, target_id;
+
+ cpumask_clear(mktme_leadcpus);
+ bitmap_clear(mktme_target_map, 0, sizeof(mktme_target_map));
+
+ for_each_online_cpu(cpu) {
+ target_id = topology_physical_package_id(cpu);
+ if (!__test_and_set_bit(target_id, mktme_target_map))
+ __cpumask_set_cpu(cpu, mktme_leadcpus);
+ }
+}
+
+static int mktme_alloc_pconfig_targets(void)
+{
+ if (!alloc_cpumask_var(&mktme_leadcpus, GFP_KERNEL))
+ return -ENOMEM;
+
+ mktme_target_map = bitmap_alloc(topology_max_packages(), GFP_KERNEL);
+ if (!mktme_target_map) {
+ free_cpumask_var(mktme_leadcpus);
+ return -ENOMEM;
+ }
+ return 0;
+}
+
static int __init init_mktme(void)
{
int ret;
@@ -320,9 +350,21 @@ static int __init init_mktme(void)
if (!mktme_prog_cache)
goto free_map;
+ /* Hardware programming targets */
+ if (mktme_alloc_pconfig_targets())
+ goto free_cache;
+
+ /* Initialize first programming targets */
+ mktme_update_pconfig_targets();
+
ret = register_key_type(&key_type_mktme);
if (!ret)
return ret; /* SUCCESS */
+
+ free_cpumask_var(mktme_leadcpus);
+ bitmap_free(mktme_target_map);
+free_cache:
+ kmem_cache_destroy(mktme_prog_cache);
free_map:
kvfree(mktme_map);
--
2.20.1
From: Alison Schofield <[email protected]>
MKTME (Multi-Key Total Memory Encryption) is a technology that allows
transparent memory encryption in upcoming Intel platforms. MKTME will
support multiple encryption domains, each having their own key.
The MKTME key service will manage the hardware encryption keys. It
will map Userspace Keys to Hardware KeyIDs and program the hardware
with the user requested encryption options.
Here the mapping structure and associated helpers are introduced,
as well as the key service initialization and registration.
Signed-off-by: Alison Schofield <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
security/keys/Makefile | 1 +
security/keys/mktme_keys.c | 98 ++++++++++++++++++++++++++++++++++++++
2 files changed, 99 insertions(+)
create mode 100644 security/keys/mktme_keys.c
diff --git a/security/keys/Makefile b/security/keys/Makefile
index 9cef54064f60..28799be801a9 100644
--- a/security/keys/Makefile
+++ b/security/keys/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_ASYMMETRIC_KEY_TYPE) += keyctl_pkey.o
obj-$(CONFIG_BIG_KEYS) += big_key.o
obj-$(CONFIG_TRUSTED_KEYS) += trusted.o
obj-$(CONFIG_ENCRYPTED_KEYS) += encrypted-keys/
+obj-$(CONFIG_X86_INTEL_MKTME) += mktme_keys.o
diff --git a/security/keys/mktme_keys.c b/security/keys/mktme_keys.c
new file mode 100644
index 000000000000..b5e8289f041b
--- /dev/null
+++ b/security/keys/mktme_keys.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-3.0
+
+/* Documentation/x86/mktme_keys.rst */
+
+#include <linux/init.h>
+#include <linux/key.h>
+#include <linux/key-type.h>
+#include <linux/mm.h>
+#include <keys/user-type.h>
+
+#include "internal.h"
+
+/* 1:1 Mapping between Userspace Keys (struct key) and Hardware KeyIDs */
+struct mktme_mapping {
+ unsigned int mapped_keyids;
+ struct key *key[];
+};
+
+struct mktme_mapping *mktme_map;
+
+static inline long mktme_map_size(void)
+{
+ long size = 0;
+
+ size += sizeof(*mktme_map);
+ size += sizeof(mktme_map->key[0]) * (mktme_nr_keyids + 1);
+ return size;
+}
+
+int mktme_map_alloc(void)
+{
+ mktme_map = kvzalloc(mktme_map_size(), GFP_KERNEL);
+ if (!mktme_map)
+ return -ENOMEM;
+ return 0;
+}
+
+int mktme_reserve_keyid(struct key *key)
+{
+ int i;
+
+ if (mktme_map->mapped_keyids == mktme_nr_keyids)
+ return 0;
+
+ for (i = 1; i <= mktme_nr_keyids; i++) {
+ if (mktme_map->key[i] == 0) {
+ mktme_map->key[i] = key;
+ mktme_map->mapped_keyids++;
+ return i;
+ }
+ }
+ return 0;
+}
+
+void mktme_release_keyid(int keyid)
+{
+ mktme_map->key[keyid] = 0;
+ mktme_map->mapped_keyids--;
+}
+
+int mktme_keyid_from_key(struct key *key)
+{
+ int i;
+
+ for (i = 1; i <= mktme_nr_keyids; i++) {
+ if (mktme_map->key[i] == key)
+ return i;
+ }
+ return 0;
+}
+
+struct key_type key_type_mktme = {
+ .name = "mktme",
+ .describe = user_describe,
+};
+
+static int __init init_mktme(void)
+{
+ int ret;
+
+ /* Verify keys are present */
+ if (mktme_nr_keyids < 1)
+ return 0;
+
+ /* Mapping of Userspace Keys to Hardware KeyIDs */
+ if (mktme_map_alloc())
+ return -ENOMEM;
+
+ ret = register_key_type(&key_type_mktme);
+ if (!ret)
+ return ret; /* SUCCESS */
+
+ kvfree(mktme_map);
+
+ return -ENOMEM;
+}
+
+late_initcall(init_mktme);
--
2.20.1
From: Alison Schofield <[email protected]>
Platforms that need to confirm the presence of an HMAT table
can use this function that simply reports the HMATs existence.
This is added in support of the Multi-Key Total Memory Encryption
(MKTME), a feature on future Intel platforms. These platforms will
need to confirm an HMAT is present at init time.
Signed-off-by: Alison Schofield <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
drivers/acpi/hmat/hmat.c | 13 +++++++++++++
include/linux/acpi.h | 4 ++++
2 files changed, 17 insertions(+)
diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
index 96b7d39a97c6..38e3341f569f 100644
--- a/drivers/acpi/hmat/hmat.c
+++ b/drivers/acpi/hmat/hmat.c
@@ -664,3 +664,16 @@ static __init int hmat_init(void)
return 0;
}
subsys_initcall(hmat_init);
+
+bool acpi_hmat_present(void)
+{
+ struct acpi_table_header *tbl;
+ acpi_status status;
+
+ status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
+ if (ACPI_FAILURE(status))
+ return false;
+
+ acpi_put_table(tbl);
+ return true;
+}
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 75078fc9b6b3..fe3ad4ca5bb3 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -1339,4 +1339,8 @@ acpi_platform_notify(struct device *dev, enum kobject_action action)
}
#endif
+#ifdef CONFIG_X86_INTEL_MKTME
+extern bool acpi_hmat_present(void);
+#endif /* CONFIG_X86_INTEL_MKTME */
+
#endif /*_LINUX_ACPI_H*/
--
2.20.1
Add new config option to enabled/disable Multi-Key Total Memory
Encryption support.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 23 ++++++++++++++++++++++-
1 file changed, 22 insertions(+), 1 deletion(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ce9642e2c31b..4d2cfee50102 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1533,6 +1533,27 @@ config AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT
If set to N, then the encryption of system memory can be
activated with the mem_encrypt=on command line option.
+config X86_INTEL_MKTME
+ bool "Intel Multi-Key Total Memory Encryption"
+ select DYNAMIC_PHYSICAL_MASK
+ select PAGE_EXTENSION
+ select X86_MEM_ENCRYPT_COMMON
+ depends on X86_64 && CPU_SUP_INTEL && !KASAN
+ depends on KEYS
+ depends on !MEMORY_HOTPLUG_DEFAULT_ONLINE
+ depends on ACPI_HMAT
+ ---help---
+ Say yes to enable support for Multi-Key Total Memory Encryption.
+ This requires an Intel processor that has support of the feature.
+
+ Multikey Total Memory Encryption (MKTME) is a technology that allows
+ transparent memory encryption in upcoming Intel platforms.
+
+ MKTME is built on top of TME. TME allows encryption of the entirety
+ of system memory using a single key. MKTME allows having multiple
+ encryption domains, each having own key -- different memory pages can
+ be encrypted with different keys.
+
# Common NUMA Features
config NUMA
bool "Numa Memory Allocation and Scheduler Support"
@@ -2207,7 +2228,7 @@ config RANDOMIZE_MEMORY
config MEMORY_PHYSICAL_PADDING
hex "Physical memory mapping padding" if EXPERT
- depends on RANDOMIZE_MEMORY
+ depends on RANDOMIZE_MEMORY || X86_INTEL_MKTME
default "0xa" if MEMORY_HOTPLUG
default "0x0"
range 0x1 0x40 if MEMORY_HOTPLUG
--
2.20.1
From: Alison Schofield <[email protected]>
When a new PCONFIG target is added to an MKTME platform, its
key table needs to be programmed to match the key tables across
the entire platform. This type of newly added PCONFIG target
may appear during a memory hotplug event.
This key programming path will differ from the normal key
programming path in that it will only program a single PCONFIG
target, AND, it will only do that programming if allowed.
Allowed means that either user type keys are stored, or, no
user type keys are currently programmed.
So, after checking if programming is allowable, this helper
function will program the one new PCONFIG target, with all
the currently programmed keys.
This will be used in MKTME's memory notifier callback supporting
MEM_GOING_ONLINE events.
Signed-off-by: Alison Schofield <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
security/keys/mktme_keys.c | 44 ++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/security/keys/mktme_keys.c b/security/keys/mktme_keys.c
index 2c975c48fe44..489dddb8c623 100644
--- a/security/keys/mktme_keys.c
+++ b/security/keys/mktme_keys.c
@@ -582,6 +582,50 @@ static int mktme_get_new_pconfig_target(void)
return new_target;
}
+static int mktme_program_new_pconfig_target(int new_pkg)
+{
+ struct mktme_payload *payload;
+ int cpu, keyid, ret;
+
+ /*
+ * Only program new target when user type keys are stored or,
+ * no user type keys are currently programmed.
+ */
+ if (!mktme_storekeys &&
+ (bitmap_weight(mktme_bitmap_user_type, mktme_nr_keyids)))
+ return -EPERM;
+
+ /* Set mktme_leadcpus to only include new target */
+ cpumask_clear(mktme_leadcpus);
+ for_each_online_cpu(cpu) {
+ if (topology_physical_package_id(cpu) == new_pkg) {
+ __cpumask_set_cpu(cpu, mktme_leadcpus);
+ break;
+ }
+ }
+ /* Program the stored keys into the new key table */
+ for (keyid = 1; keyid <= mktme_nr_keyids; keyid++) {
+ /*
+ * When a KeyID slot is not in use, the corresponding key
+ * pointer is 0. '-1' is an intermediate state where the
+ * key is on it's way out, but not gone yet. Program '-1's.
+ */
+ if (mktme_map->key[keyid] == 0)
+ continue;
+
+ payload = &mktme_key_store[keyid];
+ ret = mktme_program_keyid(keyid, payload);
+ if (ret != MKTME_PROG_SUCCESS) {
+ /* Quit on first failure to program key table */
+ pr_debug("mktme: %s\n", mktme_error[ret].msg);
+ ret = -ENOKEY;
+ break;
+ }
+ }
+ mktme_update_pconfig_targets(); /* Restore mktme_leadcpus */
+ return ret;
+}
+
static int __init init_mktme(void)
{
int ret, cpuhp;
--
2.20.1
From: Kai Huang <[email protected]>
Setup keyID to SPTE, which will be eventually programmed to shadow MMU
or EPT table, according to page's associated keyID, so that guest is
able to use correct keyID to access guest memory.
Note current shadow_me_mask doesn't suit MKTME's needs, since for MKTME
there's no fixed memory encryption mask, but can vary from keyID 1 to
maximum keyID, therefore shadow_me_mask remains 0 for MKTME.
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kvm/mmu.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d9c7b45d231f..bfee0c194161 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2899,6 +2899,22 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
#define SET_SPTE_WRITE_PROTECTED_PT BIT(0)
#define SET_SPTE_NEED_REMOTE_TLB_FLUSH BIT(1)
+static u64 get_phys_encryption_mask(kvm_pfn_t pfn)
+{
+#ifdef CONFIG_X86_INTEL_MKTME
+ struct page *page;
+
+ if (!pfn_valid(pfn))
+ return 0;
+
+ page = pfn_to_page(pfn);
+
+ return ((u64)page_keyid(page)) << mktme_keyid_shift;
+#else
+ return shadow_me_mask;
+#endif
+}
+
static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
unsigned pte_access, int level,
gfn_t gfn, kvm_pfn_t pfn, bool speculative,
@@ -2945,7 +2961,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
pte_access &= ~ACC_WRITE_MASK;
if (!kvm_is_mmio_pfn(pfn))
- spte |= shadow_me_mask;
+ spte |= get_phys_encryption_mask(pfn);
spte |= (u64)pfn << PAGE_SHIFT;
--
2.20.1
From: Jacob Pan <[email protected]>
Both Intel MKTME and AMD SME have needs to support DMA address
translation with encryption related bits. Common functions are
introduced in this patch to keep DMA generic code abstracted.
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 4 ++++
arch/x86/mm/Makefile | 1 +
arch/x86/mm/mem_encrypt_common.c | 28 ++++++++++++++++++++++++++++
3 files changed, 33 insertions(+)
create mode 100644 arch/x86/mm/mem_encrypt_common.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 62cfb381fee3..ce9642e2c31b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1505,11 +1505,15 @@ config X86_CPA_STATISTICS
config ARCH_HAS_MEM_ENCRYPT
def_bool y
+config X86_MEM_ENCRYPT_COMMON
+ def_bool n
+
config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD
select DYNAMIC_PHYSICAL_MASK
select ARCH_USE_MEMREMAP_PROT
+ select X86_MEM_ENCRYPT_COMMON
---help---
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4ebee899c363..89dddbc01b1b 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -55,3 +55,4 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
obj-$(CONFIG_X86_INTEL_MKTME) += mktme.o
+obj-$(CONFIG_X86_MEM_ENCRYPT_COMMON) += mem_encrypt_common.o
diff --git a/arch/x86/mm/mem_encrypt_common.c b/arch/x86/mm/mem_encrypt_common.c
new file mode 100644
index 000000000000..2adee65eec46
--- /dev/null
+++ b/arch/x86/mm/mem_encrypt_common.c
@@ -0,0 +1,28 @@
+#include <linux/mm.h>
+#include <linux/mem_encrypt.h>
+#include <asm/mktme.h>
+
+/*
+ * Encryption bits need to be set and cleared for both Intel MKTME and
+ * AMD SME when converting between DMA address and physical address.
+ */
+dma_addr_t __mem_encrypt_dma_set(dma_addr_t daddr, phys_addr_t paddr)
+{
+ unsigned long keyid;
+
+ if (sme_active())
+ return __sme_set(daddr);
+ keyid = page_keyid(pfn_to_page(__phys_to_pfn(paddr)));
+
+ return (daddr & ~mktme_keyid_mask) | (keyid << mktme_keyid_shift);
+}
+EXPORT_SYMBOL_GPL(__mem_encrypt_dma_set);
+
+phys_addr_t __mem_encrypt_dma_clear(phys_addr_t paddr)
+{
+ if (sme_active())
+ return __sme_clr(paddr);
+
+ return paddr & ~mktme_keyid_mask;
+}
+EXPORT_SYMBOL_GPL(__mem_encrypt_dma_clear);
--
2.20.1
From: Alison Schofield <[email protected]>
MKTME, Multi-Key Total Memory Encryption, is a feature on Intel
platforms. The ACPI HMAT table can be used to verify that the
platform topology is safe for the usage of MKTME.
The kernel must be capable of programming every memory controller
on the platform. This means that there must be a CPU online, in
the same proximity domain of each memory controller.
Signed-off-by: Alison Schofield <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
drivers/acpi/hmat/hmat.c | 54 ++++++++++++++++++++++++++++++++++++++++
include/linux/acpi.h | 1 +
2 files changed, 55 insertions(+)
diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
index 38e3341f569f..936a403c0694 100644
--- a/drivers/acpi/hmat/hmat.c
+++ b/drivers/acpi/hmat/hmat.c
@@ -677,3 +677,57 @@ bool acpi_hmat_present(void)
acpi_put_table(tbl);
return true;
}
+
+static int mktme_parse_proximity_domains(union acpi_subtable_headers *header,
+ const unsigned long end)
+{
+ struct acpi_hmat_proximity_domain *mar = (void *)header;
+ struct acpi_hmat_structure *hdr = (void *)header;
+
+ const struct cpumask *tmp_mask;
+
+ if (!hdr || hdr->type != ACPI_HMAT_TYPE_PROXIMITY)
+ return -EINVAL;
+
+ if (mar->header.length != sizeof(*mar)) {
+ pr_warn("MKTME: invalid header length in HMAT\n");
+ return -1;
+ }
+ /*
+ * Require a valid processor proximity domain.
+ * This will catch memory only physical packages with
+ * no processor capable of programming the key table.
+ */
+ if (!(mar->flags & ACPI_HMAT_PROCESSOR_PD_VALID)) {
+ pr_warn("MKTME: no valid processor proximity domain\n");
+ return -1;
+ }
+ /* Require an online CPU in the processor proximity domain. */
+ tmp_mask = cpumask_of_node(pxm_to_node(mar->processor_PD));
+ if (!cpumask_intersects(tmp_mask, cpu_online_mask)) {
+ pr_warn("MKTME: no online CPU in proximity domain\n");
+ return -1;
+ }
+ return 0;
+}
+
+/* Returns true if topology is safe for MKTME key creation */
+bool mktme_hmat_evaluate(void)
+{
+ struct acpi_table_header *tbl;
+ bool ret = true;
+ acpi_status status;
+
+ status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
+ if (ACPI_FAILURE(status))
+ return -EINVAL;
+
+ if (acpi_table_parse_entries(ACPI_SIG_HMAT,
+ sizeof(struct acpi_table_hmat),
+ ACPI_HMAT_TYPE_PROXIMITY,
+ mktme_parse_proximity_domains, 0) < 0) {
+ ret = false;
+ }
+ acpi_put_table(tbl);
+ return ret;
+}
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index fe3ad4ca5bb3..82b270dfb785 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -1341,6 +1341,7 @@ acpi_platform_notify(struct device *dev, enum kobject_action action)
#ifdef CONFIG_X86_INTEL_MKTME
extern bool acpi_hmat_present(void);
+extern bool mktme_hmat_evaluate(void);
#endif /* CONFIG_X86_INTEL_MKTME */
#endif /*_LINUX_ACPI_H*/
--
2.20.1
From: Alison Schofield <[email protected]>
Today mprotect is implemented to support legacy mprotect behavior
plus an extension for memory protection keys. Make it more generic
so that it can support additional extensions in the future.
This is done is preparation for adding a new system call for memory
encyption keys. The intent is that the new encrypted mprotect will be
another extension to legacy mprotect.
Signed-off-by: Alison Schofield <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/mprotect.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e768cd656a48..23e680f4b1d5 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -35,6 +35,8 @@
#include "internal.h"
+#define NO_KEY -1
+
static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable, int prot_numa)
@@ -452,9 +454,9 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
}
/*
- * pkey==-1 when doing a legacy mprotect()
+ * When pkey==NO_KEY we get legacy mprotect behavior here.
*/
-static int do_mprotect_pkey(unsigned long start, size_t len,
+static int do_mprotect_ext(unsigned long start, size_t len,
unsigned long prot, int pkey)
{
unsigned long nstart, end, tmp, reqprot;
@@ -578,7 +580,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
unsigned long, prot)
{
- return do_mprotect_pkey(start, len, prot, -1);
+ return do_mprotect_ext(start, len, prot, NO_KEY);
}
#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -586,7 +588,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
unsigned long, prot, int, pkey)
{
- return do_mprotect_pkey(start, len, prot, pkey);
+ return do_mprotect_ext(start, len, prot, pkey);
}
SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val)
--
2.20.1
From: Alison Schofield <[email protected]>
The ACPI HMAT will be used by the MKTME key service to identify
topologies that support the safe programming of encryption keys.
Those decisions will happen at key creation time and during
hotplug events.
To enable this, we at least need to have the ACPI HMAT present
at init time. If it's not present, do not register the type.
If the HMAT is not present, failure looks like this:
[ ] MKTME: Registration failed. ACPI HMAT not present.
Signed-off-by: Alison Schofield <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
security/keys/mktme_keys.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/security/keys/mktme_keys.c b/security/keys/mktme_keys.c
index bcd68850048f..f5fc6cccc81b 100644
--- a/security/keys/mktme_keys.c
+++ b/security/keys/mktme_keys.c
@@ -2,6 +2,7 @@
/* Documentation/x86/mktme_keys.rst */
+#include <linux/acpi.h>
#include <linux/cred.h>
#include <linux/cpu.h>
#include <linux/init.h>
@@ -490,6 +491,12 @@ static int __init init_mktme(void)
if (mktme_nr_keyids < 1)
return 0;
+ /* Require an ACPI HMAT to identify MKTME safe topologies */
+ if (!acpi_hmat_present()) {
+ pr_warn("MKTME: Registration failed. ACPI HMAT not present.\n");
+ return -EINVAL;
+ }
+
/* Mapping of Userspace Keys to Hardware KeyIDs */
if (mktme_map_alloc())
return -ENOMEM;
--
2.20.1
We don't need to have separate implementations of alloc_hugepage_vma()
for NUMA and non-NUMA. Using variant based on alloc_pages_vma() we would
cover both cases.
This is preparation patch for allocation encrypted pages.
alloc_pages_vma() will handle allocation of encrypted pages. With this
change we don' t need to cover alloc_hugepage_vma() separately.
The change makes typo in Alpha's implementation of
__alloc_zeroed_user_highpage() visible. Fix it too.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/alpha/include/asm/page.h | 2 +-
include/linux/gfp.h | 6 ++----
2 files changed, 3 insertions(+), 5 deletions(-)
diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index f3fb2848470a..9a6fbb5269f3 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -18,7 +18,7 @@ extern void clear_page(void *page);
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
- alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vmaddr)
+ alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
extern void copy_page(void * _to, void * _from);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fdab7de7490d..b101aa294157 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -511,21 +511,19 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
struct vm_area_struct *vma, unsigned long addr,
int node, bool hugepage);
-#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
- alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
#else
#define alloc_pages(gfp_mask, order) \
alloc_pages_node(numa_node_id(), gfp_mask, order)
#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
alloc_pages(gfp_mask, order)
-#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
- alloc_pages(gfp_mask, order)
#endif
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
#define alloc_page_vma(gfp_mask, vma, addr) \
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
#define alloc_page_vma_node(gfp_mask, vma, addr, node) \
alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
+#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
+ alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
extern unsigned long get_zeroed_page(gfp_t gfp_mask);
--
2.20.1
From: Alison Schofield <[email protected]>
MKTME (Multi-Key Total Memory Encryption) key payloads may include
data encryption keys, tweak keys, and additional entropy bits. These
are used to program the MKTME encryption hardware. By default, the
kernel destroys this payload data once the hardware is programmed.
However, in order to fully support Memory Hotplug, saving the key data
becomes important. The MKTME Key Service cannot allow a new memory
controller to come online unless it can program the Key Table to match
the Key Tables of all existing memory controllers.
With CPU generated keys (a.k.a. random keys or ephemeral keys) the
saving of user key data is not an issue. The kernel and MKTME hardware
can generate strong encryption keys without recalling any user supplied
data.
With USER directed keys (a.k.a. user type) saving the key programming
data (data and tweak key) becomes an issue. The data and tweak keys
are required to program those keys on a new physical package.
In preparation for adding support for onlining new memory:
Add an 'mktme_key_store' where key payloads are stored.
Add 'mktme_storekeys' kernel command line parameter that, when
present, allows the kernel to store user type key payloads.
Add 'mktme_bitmap_user_type' to recall when USER type keys are in
use. If no USER type keys are currently in use, new memory
may be brought online, despite the absence of 'mktme_storekeys'.
Signed-off-by: Alison Schofield <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
.../admin-guide/kernel-parameters.rst | 1 +
.../admin-guide/kernel-parameters.txt | 11 ++++
security/keys/mktme_keys.c | 51 ++++++++++++++++++-
3 files changed, 61 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst
index b8d0bc07ed0a..1b62b86d0666 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -120,6 +120,7 @@ parameter is applicable::
Documentation/m68k/kernel-options.txt.
MDA MDA console support is enabled.
MIPS MIPS architecture is enabled.
+ MKTME Multi-Key Total Memory Encryption is enabled.
MOUSE Appropriate mouse support is enabled.
MSI Message Signaled Interrupts (PCI).
MTD MTD (Memory Technology Device) support is enabled.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 2b8ee90bb644..38ea0ace9533 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2544,6 +2544,17 @@
in the "bleeding edge" mini2440 support kernel at
http://repo.or.cz/w/linux-2.6/mini2440.git
+ mktme_storekeys [X86, MKTME] When CONFIG_X86_INTEL_MKTME is set
+ this parameter allows the kernel to store the user
+ specified MKTME key payload. Storing this payload
+ means that the MKTME Key Service can always allow
+ the addition of new physical packages. If the
+ mktme_storekeys parameter is not present, users key
+ data will not be stored, and new physical packages
+ may only be added to the system if no user type
+ MKTME keys are programmed.
+ See Documentation/x86/mktme.rst
+
mminit_loglevel=
[KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this
parameter allows control of the logging verbosity for
diff --git a/security/keys/mktme_keys.c b/security/keys/mktme_keys.c
index 4b2d3dc1843a..bcd68850048f 100644
--- a/security/keys/mktme_keys.c
+++ b/security/keys/mktme_keys.c
@@ -22,6 +22,9 @@ static DEFINE_SPINLOCK(mktme_lock);
struct kmem_cache *mktme_prog_cache; /* Hardware programming cache */
unsigned long *mktme_target_map; /* Pconfig programming targets */
cpumask_var_t mktme_leadcpus; /* One lead CPU per pconfig target */
+static bool mktme_storekeys; /* True if key payloads may be stored */
+unsigned long *mktme_bitmap_user_type; /* Shows presence of user type keys */
+struct mktme_payload *mktme_key_store; /* Payload storage if allowed */
/* 1:1 Mapping between Userspace Keys (struct key) and Hardware KeyIDs */
struct mktme_mapping {
@@ -124,6 +127,27 @@ struct mktme_payload {
u8 tweak_key[MKTME_AES_XTS_SIZE];
};
+void mktme_store_payload(int keyid, struct mktme_payload *payload)
+{
+ /* Always remember if this key is of type "user" */
+ if ((payload->keyid_ctrl & 0xff) == MKTME_KEYID_SET_KEY_DIRECT)
+ set_bit(keyid, mktme_bitmap_user_type);
+ /*
+ * Always store the control fields to program newly
+ * onlined packages with RANDOM or NO_ENCRYPT keys.
+ */
+ mktme_key_store[keyid].keyid_ctrl = payload->keyid_ctrl;
+
+ /* Only store "user" type data and tweak keys if allowed */
+ if (mktme_storekeys &&
+ ((payload->keyid_ctrl & 0xff) == MKTME_KEYID_SET_KEY_DIRECT)) {
+ memcpy(mktme_key_store[keyid].data_key, payload->data_key,
+ MKTME_AES_XTS_SIZE);
+ memcpy(mktme_key_store[keyid].tweak_key, payload->tweak_key,
+ MKTME_AES_XTS_SIZE);
+ }
+}
+
struct mktme_hw_program_info {
struct mktme_key_program *key_program;
int *status;
@@ -270,9 +294,10 @@ int mktme_instantiate_key(struct key *key, struct key_preparsed_payload *prep)
0, GFP_KERNEL))
goto err_out;
- if (!mktme_program_keyid(keyid, payload))
+ if (!mktme_program_keyid(keyid, payload)) {
+ mktme_store_payload(keyid, payload);
return MKTME_PROG_SUCCESS;
-
+ }
percpu_ref_exit(&encrypt_count[keyid]);
err_out:
spin_lock_irqsave(&mktme_lock, flags);
@@ -487,10 +512,25 @@ static int __init init_mktme(void)
if (!encrypt_count)
goto free_targets;
+ /* Detect presence of user type keys */
+ mktme_bitmap_user_type = bitmap_zalloc(mktme_nr_keyids, GFP_KERNEL);
+ if (!mktme_bitmap_user_type)
+ goto free_encrypt;
+
+ /* Store key payloads if allowable */
+ mktme_key_store = kzalloc(sizeof(mktme_key_store[0]) *
+ (mktme_nr_keyids + 1), GFP_KERNEL);
+ if (!mktme_key_store)
+ goto free_bitmap;
+
ret = register_key_type(&key_type_mktme);
if (!ret)
return ret; /* SUCCESS */
+ kfree(mktme_key_store);
+free_bitmap:
+ bitmap_free(mktme_bitmap_user_type);
+free_encrypt:
kvfree(encrypt_count);
free_targets:
free_cpumask_var(mktme_leadcpus);
@@ -504,3 +544,10 @@ static int __init init_mktme(void)
}
late_initcall(init_mktme);
+
+static int mktme_enable_storekeys(char *__unused)
+{
+ mktme_storekeys = true;
+ return 1;
+}
+__setup("mktme_storekeys", mktme_enable_storekeys);
--
2.20.1
From: Alison Schofield <[email protected]>
The MKTME (Multi-Key Total Memory Encryption) Key Service needs
a reference count on encrypted VMAs. This reference count is used
to determine when a hardware encryption KeyID is no longer in use
and can be freed and reassigned to another Userspace Key.
The MKTME Key service does the percpu_ref_init and _kill, so
these gets/puts on encrypted VMA's can be considered the
intermediaries in the lifetime of the key.
Increment/decrement the reference count during encrypt_mprotect()
system call for initial or updated encryption on a VMA.
Piggy back on the vm_area_dup/free() helpers. If the VMAs being
duplicated, or freed are encrypted, adjust the reference count.
Signed-off-by: Alison Schofield <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 5 +++++
arch/x86/mm/mktme.c | 37 ++++++++++++++++++++++++++++++++++--
include/linux/mm.h | 2 ++
kernel/fork.c | 2 ++
4 files changed, 44 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index 0e6df07f1921..14da002d2e85 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -16,6 +16,11 @@ extern int mktme_keyid_shift;
extern void mprotect_set_encrypt(struct vm_area_struct *vma, int newkeyid,
unsigned long start, unsigned long end);
+/* MTKME encrypt_count for VMAs */
+extern struct percpu_ref *encrypt_count;
+extern void vma_get_encrypt_ref(struct vm_area_struct *vma);
+extern void vma_put_encrypt_ref(struct vm_area_struct *vma);
+
DECLARE_STATIC_KEY_FALSE(mktme_enabled_key);
static inline bool mktme_enabled(void)
{
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index 91b49e88ca3f..df70651816a1 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -66,11 +66,12 @@ void mprotect_set_encrypt(struct vm_area_struct *vma, int newkeyid,
if (oldkeyid == newkeyid)
return;
-
+ vma_put_encrypt_ref(vma);
newprot = pgprot_val(vma->vm_page_prot);
newprot &= ~mktme_keyid_mask;
newprot |= (unsigned long)newkeyid << mktme_keyid_shift;
vma->vm_page_prot = __pgprot(newprot);
+ vma_get_encrypt_ref(vma);
/*
* The VMA doesn't have any inherited pages.
@@ -79,6 +80,18 @@ void mprotect_set_encrypt(struct vm_area_struct *vma, int newkeyid,
unlink_anon_vmas(vma);
}
+void vma_get_encrypt_ref(struct vm_area_struct *vma)
+{
+ if (vma_keyid(vma))
+ percpu_ref_get(&encrypt_count[vma_keyid(vma)]);
+}
+
+void vma_put_encrypt_ref(struct vm_area_struct *vma)
+{
+ if (vma_keyid(vma))
+ percpu_ref_put(&encrypt_count[vma_keyid(vma)]);
+}
+
/* Prepare page to be used for encryption. Called from page allocator. */
void __prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
{
@@ -102,6 +115,22 @@ void __prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
page++;
}
+
+ /*
+ * Make sure the KeyID cannot be freed until the last page that
+ * uses the KeyID is gone.
+ *
+ * This is required because the page may live longer than VMA it
+ * is mapped into (i.e. in get_user_pages() case) and having
+ * refcounting per-VMA is not enough.
+ *
+ * Taking a reference per-4K helps in case if the page will be
+ * split after the allocation. free_encrypted_page() will balance
+ * out the refcount even if the page was split and freed as bunch
+ * of 4K pages.
+ */
+
+ percpu_ref_get_many(&encrypt_count[keyid], 1 << order);
}
/*
@@ -110,7 +139,9 @@ void __prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
*/
void free_encrypted_page(struct page *page, int order)
{
- int i;
+ int i, keyid;
+
+ keyid = page_keyid(page);
/*
* The hardware/CPU does not enforce coherency between mappings
@@ -125,6 +156,8 @@ void free_encrypted_page(struct page *page, int order)
lookup_page_ext(page)->keyid = 0;
page++;
}
+
+ percpu_ref_put_many(&encrypt_count[keyid], 1 << order);
}
static int sync_direct_mapping_pte(unsigned long keyid,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7f52d053826..00c0fd70816b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2831,6 +2831,8 @@ static inline void mprotect_set_encrypt(struct vm_area_struct *vma,
int newkeyid,
unsigned long start,
unsigned long end) {}
+static inline void vma_get_encrypt_ref(struct vm_area_struct *vma) {}
+static inline void vma_put_encrypt_ref(struct vm_area_struct *vma) {}
#endif /* CONFIG_X86_INTEL_MKTME */
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9dcd18aa210b..f0e35ed76f5a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -342,12 +342,14 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
if (new) {
*new = *orig;
INIT_LIST_HEAD(&new->anon_vma_chain);
+ vma_get_encrypt_ref(new);
}
return new;
}
void vm_area_free(struct vm_area_struct *vma)
{
+ vma_put_encrypt_ref(vma);
kmem_cache_free(vm_area_cachep, vma);
}
--
2.20.1
Zero pages are never encrypted. Keep KeyID-0 for them.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/pgtable.h | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 50b3e2d963c9..59c3dd50b8d5 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -803,6 +803,19 @@ static inline unsigned long pmd_index(unsigned long address)
*/
#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot))
+#define mk_zero_pte mk_zero_pte
+static inline pte_t mk_zero_pte(unsigned long addr, pgprot_t prot)
+{
+ extern unsigned long zero_pfn;
+ pte_t entry;
+
+ prot.pgprot &= ~mktme_keyid_mask;
+ entry = pfn_pte(zero_pfn, prot);
+ entry = pte_mkspecial(entry);
+
+ return entry;
+}
+
/*
* the pte page can be thought of an array like this: pte_t[PTRS_PER_PTE]
*
@@ -1133,6 +1146,12 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm,
#define mk_pmd(page, pgprot) pfn_pmd(page_to_pfn(page), (pgprot))
+#define mk_zero_pmd(zero_page, prot) \
+({ \
+ prot.pgprot &= ~mktme_keyid_mask; \
+ pmd_mkhuge(mk_pmd(zero_page, prot)); \
+})
+
#define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
extern int pmdp_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp,
--
2.20.1
From: Alison Schofield <[email protected]>
Implement memory encryption for MKTME (Multi-Key Total Memory
Encryption) with a new system call that is an extension of the
legacy mprotect() system call.
In encrypt_mprotect the caller must pass a handle to a previously
allocated and programmed MKTME encryption key. The key can be
obtained through the kernel key service type "mktme". The caller
must have KEY_NEED_VIEW permission on the key.
MKTME places an additional restriction on the protected data:
The length of the data must be page aligned. This is in addition
to the existing mprotect restriction that the addr must be page
aligned.
encrypt_mprotect() will lookup the hardware keyid for the given
userspace key. It will use previously defined helpers to insert
that keyid in the VMAs during legacy mprotect() execution.
Signed-off-by: Alison Schofield <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/exec.c | 4 +--
include/linux/mm.h | 3 +-
mm/mprotect.c | 68 +++++++++++++++++++++++++++++++++++++++++-----
3 files changed, 65 insertions(+), 10 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 2e0033348d8e..695c121b34b3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -755,8 +755,8 @@ int setup_arg_pages(struct linux_binprm *bprm,
vm_flags |= mm->def_flags;
vm_flags |= VM_STACK_INCOMPLETE_SETUP;
- ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end,
- vm_flags);
+ ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end, vm_flags,
+ -1);
if (ret)
goto out_unlock;
BUG_ON(prev != vma);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c027044de9bf..a7f52d053826 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1634,7 +1634,8 @@ extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long
int dirty_accountable, int prot_numa);
extern int mprotect_fixup(struct vm_area_struct *vma,
struct vm_area_struct **pprev, unsigned long start,
- unsigned long end, unsigned long newflags);
+ unsigned long end, unsigned long newflags,
+ int newkeyid);
/*
* doesn't attempt to fault and will return short.
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 23e680f4b1d5..38d766b5cc20 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -28,6 +28,7 @@
#include <linux/ksm.h>
#include <linux/uaccess.h>
#include <linux/mm_inline.h>
+#include <linux/key.h>
#include <asm/pgtable.h>
#include <asm/cacheflush.h>
#include <asm/mmu_context.h>
@@ -347,7 +348,8 @@ static int prot_none_walk(struct vm_area_struct *vma, unsigned long start,
int
mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
- unsigned long start, unsigned long end, unsigned long newflags)
+ unsigned long start, unsigned long end, unsigned long newflags,
+ int newkeyid)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long oldflags = vma->vm_flags;
@@ -357,7 +359,14 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
int error;
int dirty_accountable = 0;
- if (newflags == oldflags) {
+ /*
+ * Flags match and Keyids match or we have NO_KEY.
+ * This _fixup is usually called from do_mprotect_ext() except
+ * for one special case: caller fs/exec.c/setup_arg_pages()
+ * In that case, newkeyid is passed as -1 (NO_KEY).
+ */
+ if (newflags == oldflags &&
+ (newkeyid == vma_keyid(vma) || newkeyid == NO_KEY)) {
*pprev = vma;
return 0;
}
@@ -423,6 +432,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
}
success:
+ if (newkeyid != NO_KEY)
+ mprotect_set_encrypt(vma, newkeyid, start, end);
/*
* vm_flags and vm_page_prot are protected by the mmap_sem
* held in write mode.
@@ -454,10 +465,15 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
}
/*
- * When pkey==NO_KEY we get legacy mprotect behavior here.
+ * do_mprotect_ext() supports the legacy mprotect behavior plus extensions
+ * for Protection Keys and Memory Encryption Keys. These extensions are
+ * mutually exclusive and the behavior is:
+ * (pkey==NO_KEY && keyid==NO_KEY) ==> legacy mprotect
+ * (pkey is valid) ==> legacy mprotect plus Protection Key extensions
+ * (keyid is valid) ==> legacy mprotect plus Encryption Key extensions
*/
static int do_mprotect_ext(unsigned long start, size_t len,
- unsigned long prot, int pkey)
+ unsigned long prot, int pkey, int keyid)
{
unsigned long nstart, end, tmp, reqprot;
struct vm_area_struct *vma, *prev;
@@ -555,7 +571,8 @@ static int do_mprotect_ext(unsigned long start, size_t len,
tmp = vma->vm_end;
if (tmp > end)
tmp = end;
- error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
+ error = mprotect_fixup(vma, &prev, nstart, tmp, newflags,
+ keyid);
if (error)
goto out;
nstart = tmp;
@@ -580,7 +597,7 @@ static int do_mprotect_ext(unsigned long start, size_t len,
SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
unsigned long, prot)
{
- return do_mprotect_ext(start, len, prot, NO_KEY);
+ return do_mprotect_ext(start, len, prot, NO_KEY, NO_KEY);
}
#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -588,7 +605,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
unsigned long, prot, int, pkey)
{
- return do_mprotect_ext(start, len, prot, pkey);
+ return do_mprotect_ext(start, len, prot, pkey, NO_KEY);
}
SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val)
@@ -637,3 +654,40 @@ SYSCALL_DEFINE1(pkey_free, int, pkey)
}
#endif /* CONFIG_ARCH_HAS_PKEYS */
+
+#ifdef CONFIG_X86_INTEL_MKTME
+
+extern int mktme_keyid_from_key(struct key *key);
+
+SYSCALL_DEFINE4(encrypt_mprotect, unsigned long, start, size_t, len,
+ unsigned long, prot, key_serial_t, serial)
+{
+ key_ref_t key_ref;
+ struct key *key;
+ int ret, keyid;
+
+ /* MKTME restriction */
+ if (!PAGE_ALIGNED(len))
+ return -EINVAL;
+
+ /*
+ * key_ref prevents the destruction of the key
+ * while the memory encryption is being set up.
+ */
+
+ key_ref = lookup_user_key(serial, 0, KEY_NEED_VIEW);
+ if (IS_ERR(key_ref))
+ return PTR_ERR(key_ref);
+
+ key = key_ref_to_ptr(key_ref);
+ keyid = mktme_keyid_from_key(key);
+ if (!keyid) {
+ key_ref_put(key_ref);
+ return -EINVAL;
+ }
+ ret = do_mprotect_ext(start, len, prot, NO_KEY, keyid);
+ key_ref_put(key_ref);
+ return ret;
+}
+
+#endif /* CONFIG_X86_INTEL_MKTME */
--
2.20.1
From: Alison Schofield <[email protected]>
This is a draft for poweron testing.
I'm assuming it needs to be in Intel-next to be available for poweron.
It is not in the selftest Makefiles.
COMPILE w keyutils library ==> cc -o mktest mktme_test.c -lkeyutils
Usage: mktme_test [options]...
-a Run ALL tests
-t <testnum> Run one <testnum> test
-l List available tests
-h, -? Show this help
mktest -l
[ 1] Keys: Add each type key
[ 2] Flow: One simple roundtrip
[ 3] Keys: Valid Payload Options
[ 4] Keys: Invalid Payload Options
[ 5] Keys: Add Key Descriptor Field
[ 6] Keys: Add Multiple Same
[ 7] Keys: Change payload, auto update
[ 8] Keys: Update, explicit update
[ 9] Keys: Update, Clear
[10] Keys: Add, Invalidate Keys
[11] Keys: Add, Revoke Keys
[12] Keys: Keyctl Describe
[13] Keys: Clear
[14] Keys: No Encrypt
[15] Keys: Unique KeyIDs
[16] Keys: Get Max KeyIDs
[17] Encrypt: Parameter Alignment
[18] Encrypt: Change Protections
[19] Encrypt: Swap Keys
[20] Encrypt: Counters Same Key
[21] Encrypt: Counters Diff Key
[22] Encrypt: Counters Holes
[23] Flow: Switch key no data
[24] Flow: Switch key multi VMAs
[25] Flow: Switch No Key to Any Key
[26] Flow: madvise
[27] Flow: Invalidate In Use Key
Signed-off-by: Alison Schofield <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
.../selftests/x86/mktme/encrypt_tests.c | 433 ++++++++++++++
.../testing/selftests/x86/mktme/flow_tests.c | 266 +++++++++
tools/testing/selftests/x86/mktme/key_tests.c | 526 ++++++++++++++++++
.../testing/selftests/x86/mktme/mktme_test.c | 300 ++++++++++
4 files changed, 1525 insertions(+)
create mode 100644 tools/testing/selftests/x86/mktme/encrypt_tests.c
create mode 100644 tools/testing/selftests/x86/mktme/flow_tests.c
create mode 100644 tools/testing/selftests/x86/mktme/key_tests.c
create mode 100644 tools/testing/selftests/x86/mktme/mktme_test.c
diff --git a/tools/testing/selftests/x86/mktme/encrypt_tests.c b/tools/testing/selftests/x86/mktme/encrypt_tests.c
new file mode 100644
index 000000000000..735d5da89d29
--- /dev/null
+++ b/tools/testing/selftests/x86/mktme/encrypt_tests.c
@@ -0,0 +1,433 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* x86 MKTME Encrypt API Tests */
+
+/* Address & length parameters to encrypt_mprotect() must be page aligned */
+void test_param_alignment(void)
+{
+ size_t datalen = PAGE_SIZE * 2;
+ key_serial_t key;
+ int ret, i;
+ char *buf;
+
+ key = add_key("mktme", "keyname", options_CPU_long,
+ strlen(options_CPU_long), KEY_SPEC_THREAD_KEYRING);
+
+ if (key == -1) {
+ perror("test_param_alignment");
+ return;
+ }
+ buf = (char *)mmap(NULL, datalen, PROT_NONE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+
+ /* Fail if addr is not page aligned */
+ ret = syscall(sys_encrypt_mprotect, buf + 100, datalen / 2, PROT_NONE,
+ key);
+ if (!ret)
+ fprintf(stderr, "Error: addr is not page aligned\n");
+
+ /* Fail if len is not page aligned */
+ ret = syscall(sys_encrypt_mprotect, buf, 9, PROT_NONE, key);
+ if (!ret)
+ fprintf(stderr, "Error: len is not page aligned.");
+
+ /* Fail if both addr and len are not page aligned */
+ ret = syscall(sys_encrypt_mprotect, buf + 100, datalen + 100,
+ PROT_READ | PROT_WRITE, key);
+ if (!ret)
+ fprintf(stderr, "Error: addr and len are not page aligned\n");
+
+ /* Success if both addr and len are page aligned */
+ ret = syscall(sys_encrypt_mprotect, buf, datalen,
+ PROT_READ | PROT_WRITE, key);
+
+ if (ret)
+ fprintf(stderr, "Fail: addr and len are both page aligned\n");
+
+ ret = munmap(buf, datalen);
+
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "Error: invalidate failed on key [%d]\n", key);
+}
+
+/*
+ * Do encrypt_mprotect and follow with classic mprotects.
+ * KeyID should remain unchanged.
+ */
+void test_change_protections(void)
+{
+ unsigned int keyid, check_keyid;
+ key_serial_t key;
+ void *ptra;
+ int ret, i;
+
+ const int prots[] = {
+ PROT_NONE, PROT_READ, PROT_WRITE, PROT_EXEC,
+ PROT_READ | PROT_WRITE, PROT_READ | PROT_EXEC,
+ };
+
+ key = add_key("mktme", "testkey", options_CPU_long,
+ strlen(options_CPU_long), KEY_SPEC_THREAD_KEYRING);
+ if (key == -1) {
+ perror(__func__);
+ return;
+ }
+ ptra = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE,
+ -1, 0);
+ if (!ptra) {
+ fprintf(stderr, "Error: mmap failed.");
+ goto revoke_key;
+ }
+ /* Encrypt Memory */
+ ret = syscall(sys_encrypt_mprotect, ptra, PAGE_SIZE, PROT_NONE, key);
+ if (ret)
+ fprintf(stderr, "Error: encrypt_mprotect [%d]\n", ret);
+
+ /* Remember the assigned KeyID */
+ keyid = find_smaps_keyid((unsigned long)ptra);
+
+ /* Classic mprotects() should not change KeyID. */
+ for (i = 0; i < ARRAY_SIZE(prots); i++) {
+ ret = mprotect(ptra, PAGE_SIZE, prots[i]);
+ if (ret)
+ fprintf(stderr, "Error: encrypt_mprotect [%d]\n", ret);
+
+ check_keyid = find_smaps_keyid((unsigned long)ptra);
+ if (keyid != check_keyid)
+ fprintf(stderr, "Error: keyid change not expected\n");
+ };
+free_memory:
+ ret = munmap(ptra, PAGE_SIZE);
+revoke_key:
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "Error: invalidate failed. [%d]\n", key);
+}
+
+/*
+ * Make one mapping and create a bunch of keys.
+ * Encrypt that one mapping repeatedly with different keys.
+ * Verify the KeyID changes in smaps.
+ */
+void test_key_swap(void)
+{
+ unsigned int prev_keyid, next_keyid;
+ int maxswaps = max_keyids / 2; /* Not too many swaps */
+ key_serial_t key[maxswaps];
+ long size = PAGE_SIZE;
+ int keys_available = 0;
+ char name[12];
+ void *ptra;
+ int i, ret;
+
+ for (i = 0; i < maxswaps; i++) {
+ sprintf(name, "mk_swap_%d", i);
+ key[i] = add_key("mktme", name, options_CPU_long,
+ strlen(options_CPU_long),
+ KEY_SPEC_THREAD_KEYRING);
+ if (key[i] == -1) {
+ perror(__func__);
+ goto free_keys;
+ } else {
+ keys_available++;
+ }
+ }
+
+ printf(" Info: created %d keys\n", keys_available);
+ ptra = mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (!ptra) {
+ perror("mmap");
+ goto free_keys;
+ }
+ prev_keyid = 0;
+
+ for (i = 0; i < keys_available; i++) {
+ ret = syscall(sys_encrypt_mprotect, ptra, size,
+ PROT_NONE, key[i]);
+ if (ret) {
+ perror("encrypt_mprotect");
+ goto free_memory;
+ }
+
+ next_keyid = find_smaps_keyid((unsigned long)ptra);
+ if (prev_keyid == next_keyid)
+ fprintf(stderr, "Error %s: expected new keyid\n",
+ __func__);
+ prev_keyid = next_keyid;
+ }
+free_memory:
+ ret = munmap(ptra, size);
+
+free_keys:
+ for (i = 0; i < keys_available; i++) {
+ if (keyctl(KEYCTL_INVALIDATE, key[i]) == -1)
+ perror(__func__);
+ }
+}
+
+/*
+ * These may not be doing as orig planned. Need to check that key is
+ * invalidated and then gets destroyed when last map is removed.
+ */
+void test_counters_same(void)
+{
+ key_serial_t key;
+ int count = 4;
+ void *ptr[count];
+ int ret, i;
+
+ /* Get 4 pieces of memory */
+ i = count;
+ while (i--) {
+ ptr[i] = mmap(NULL, PAGE_SIZE, PROT_NONE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (!ptr[i])
+ perror("mmap");
+ }
+ /* Protect with same key */
+ key = add_key("mktme", "mk_same", options_USER, strlen(options_USER),
+ KEY_SPEC_THREAD_KEYRING);
+
+ if (key == -1) {
+ perror("add_key");
+ goto free_mem;
+ }
+ i = count;
+ while (i--) {
+ ret = syscall(sys_encrypt_mprotect, ptr[i], PAGE_SIZE,
+ PROT_NONE, key);
+ if (ret)
+ perror("encrypt_mprotect");
+ }
+ /* Discard Key & Unmap Memory (order irrelevant) */
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "Error: invalidate failed.\n");
+free_mem:
+ i = count;
+ while (i--)
+ ret = munmap(ptr[i], PAGE_SIZE);
+}
+
+void test_counters_diff(void)
+{
+ int prot = PROT_READ | PROT_WRITE;
+ long size = PAGE_SIZE;
+ int ret, i;
+ int loop = 4;
+ char name[12];
+ void *ptr[loop];
+ key_serial_t diffkey[loop];
+
+ i = loop;
+ while (i--)
+ ptr[i] = mmap(NULL, size, prot, MAP_ANONYMOUS | MAP_PRIVATE,
+ -1, 0);
+ i = loop;
+ while (i--) {
+ sprintf(name, "cheese_%d", i);
+ diffkey[i] = add_key("mktme", name, options_USER,
+ strlen(options_USER),
+ KEY_SPEC_THREAD_KEYRING);
+ ret = syscall(sys_encrypt_mprotect, ptr[i], size, prot,
+ diffkey[i]);
+ if (ret)
+ perror("encrypt_mprotect");
+ }
+
+ i = loop;
+ while (i--)
+ ret = munmap(ptr[i], PAGE_SIZE);
+
+ i = loop;
+ while (i--) {
+ if (keyctl(KEYCTL_INVALIDATE, diffkey[i]) == -1)
+ fprintf(stderr, "Error: invalidate failed key:%d\n",
+ diffkey[i]);
+ }
+}
+
+void test_counters_holes(void)
+{
+ int prot = PROT_READ | PROT_WRITE;
+ long size = PAGE_SIZE;
+ int ret, i;
+ int loop = 6;
+ void *ptr[loop];
+ key_serial_t samekey;
+
+ samekey = add_key("mktme", "gouda", options_CPU_long,
+ strlen(options_CPU_long), KEY_SPEC_THREAD_KEYRING);
+
+ i = loop;
+ while (i--) {
+ ptr[i] = mmap(NULL, size, prot, MAP_ANONYMOUS | MAP_PRIVATE,
+ -1, 0);
+ if (i % 2) {
+ ret = syscall(sys_encrypt_mprotect, ptr[i], size, prot,
+ samekey);
+ if (ret)
+ perror("mprotect error");
+ }
+ }
+
+ i = loop;
+ while (i--)
+ ret = munmap(ptr[i], size);
+
+ if (keyctl(KEYCTL_INVALIDATE, samekey) == -1)
+ fprintf(stderr, "Error: invalidate failed\n");
+}
+
+/*
+ * Try on SIMICs. See is SIMICs 'a1a1' thing does the trick.
+ * May need real hardware.
+ * One buffer -> encrypt entirety w one key
+ * Same buffer -> encrypt in pieces w different keys
+ */
+void test_split(void)
+{
+ int prot = PROT_READ | PROT_WRITE;
+ int ret, i;
+ int pieces = 10;
+ size_t len = PAGE_SIZE;
+ char name[12];
+ char *buf;
+ key_serial_t firstkey;
+ key_serial_t diffkey[pieces];
+
+ /* get one piece of memory, protect it, memset it */
+ buf = (char *)mmap(NULL, len, PROT_NONE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+
+ firstkey = add_key("mktme", "firstkey", options_CPU_long,
+ strlen(options_CPU_long),
+ KEY_SPEC_THREAD_KEYRING);
+
+ ret = syscall(sys_encrypt_mprotect, buf, len, PROT_READ | PROT_WRITE,
+ firstkey);
+
+ if (ret) {
+ printf("firstkey mprotect error:%d\n", ret);
+ goto free_mem;
+ }
+
+ memset(buf, 9, len);
+ /*
+ * Encrypt pieces of buf with different encryption keys.
+ * Expect to see the data in those pieces zero'd
+ */
+ for (i = 0; i < pieces; i++) {
+ sprintf(name, "cheese_%d", i);
+ diffkey[i] = add_key("mktme", name, options_CPU_long,
+ strlen(options_CPU_long),
+ KEY_SPEC_THREAD_KEYRING);
+ ret = syscall(sys_encrypt_mprotect, (buf + (i * len)), len,
+ PROT_READ | PROT_WRITE, diffkey[i]);
+ if (ret)
+ printf("diff key mprotect error:%d\n", ret);
+ else
+ printf("done protecting w i:%d key[%d]\n", i,
+ diffkey[i]);
+ }
+ printf("SIMICs - this should NOT be all 'f's.\n");
+ for (i = 0; i < len; i++)
+ printf("-%x", buf[i]);
+ printf("\n");
+
+ getchar();
+ i = pieces;
+ for (i = 0; i < pieces; i++) {
+ if (keyctl(KEYCTL_INVALIDATE, diffkey[i]) == -1)
+ fprintf(stderr, "invalidate failed key:%d\n",
+ diffkey[i]);
+ }
+ if (keyctl(KEYCTL_INVALIDATE, firstkey) == -1)
+ fprintf(stderr, "invalidate failed on key:%d\n", firstkey);
+free_mem:
+ ret = munmap(buf, len);
+}
+
+void test_well_suited(void)
+{
+ int prot;
+ long size = PAGE_SIZE;
+ int ret, i;
+ int loop = 6;
+ void *ptr[loop];
+ key_serial_t key;
+ void *addr, *first;
+
+ /* mmap alternating protections so that we get loop# of vma's */
+ i = loop;
+ /* map the first one */
+ first = mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+
+ addr = first + PAGE_SIZE;
+ i--;
+ while (i--) {
+ prot = (i % 2) ? PROT_READ : PROT_WRITE;
+ ptr[i] = mmap(addr, size, prot, MAP_ANONYMOUS | MAP_PRIVATE,
+ -1, 0);
+ addr = addr + PAGE_SIZE;
+ }
+ /* Protect with same key */
+ key = add_key("mktme", "mk_suited954", options_USER,
+ strlen(options_USER), KEY_SPEC_THREAD_KEYRING);
+
+ /* Changing FLAGS and adding KEY */
+ ret = syscall(sys_encrypt_mprotect, ptr[0], (loop * PAGE_SIZE),
+ PROT_EXEC, key);
+ if (ret)
+ fprintf(stderr, "Error: encrypt_mprotect [%d]\n", ret);
+
+ i = loop;
+ while (i--)
+ ret = munmap(ptr[i], size);
+
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "Error: invalidate failed\n");
+}
+
+void test_not_suited(int argc, char *argv[])
+{
+ int prot;
+ int protA = PROT_READ;
+ int protB = PROT_WRITE;
+ int flagsA = MAP_ANONYMOUS | MAP_PRIVATE;
+ int flagsB = MAP_SHARED | MAP_ANONYMOUS;
+ int flags;
+ int ret, i;
+ int loop = 6;
+ void *ptr[loop];
+ key_serial_t key;
+
+ printf("loop count [%d]\n", loop);
+
+ /* mmap alternating protections so that we get loop# of vma's */
+ i = loop;
+ while (i--) {
+ prot = (i % 2) ? PROT_READ : PROT_WRITE;
+ if (i == 2)
+ flags = flagsB;
+ else
+ flags = flagsA;
+ ptr[i] = mmap(NULL, PAGE_SIZE, prot, flags, -1, 0);
+ }
+
+ /* protect with same key */
+ key = add_key("mktme", "mk_notsuited", options_CPU_long,
+ strlen(options_CPU_long), KEY_SPEC_THREAD_KEYRING);
+
+ /* Changing FLAGS and adding KEY */
+ ret = syscall(sys_encrypt_mprotect, ptr[0], (loop * PAGE_SIZE),
+ PROT_EXEC, key);
+ if (!ret)
+ fprintf(stderr, "Error: expected encrypt_mprotect to fail.\n");
+
+ i = loop;
+ while (i--)
+ ret = munmap(ptr[i], PAGE_SIZE);
+
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "Error: invalidate failed.\n");
+}
+
diff --git a/tools/testing/selftests/x86/mktme/flow_tests.c b/tools/testing/selftests/x86/mktme/flow_tests.c
new file mode 100644
index 000000000000..87b17d3bf142
--- /dev/null
+++ b/tools/testing/selftests/x86/mktme/flow_tests.c
@@ -0,0 +1,266 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * x86 MKTME: API Tests
+ *
+ * Flow Tests either
+ * 1) Validate some interaction between the 2 API's: Key & Encrypt
+ * 2) or, Validate code flows, scenarios, known/fixed issues.
+ */
+
+/*
+ * Userspace Keys with outstanding memory mappings can be discarded,
+ * (discarded == revoke, invalidate, expire, unlink)
+ * The paired KeyID will not be freed for reuse until the last memory
+ * mapping is unmapped.
+ */
+void test_discard_in_use_key(void)
+{
+ key_serial_t key;
+ void *ptra;
+ int ret;
+
+ key = add_key("mktme", "discard-test", options_CPU_long,
+ strlen(options_CPU_long), KEY_SPEC_THREAD_KEYRING);
+
+ if (key == -1) {
+ perror("add key");
+ return;
+ }
+ ptra = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE,
+ -1, 0);
+ if (!ptra) {
+ fprintf(stderr, "Error: mmap failed. ");
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "Error: invalidate failed. Key:%d\n",
+ key);
+ return;
+ }
+ ret = syscall(sys_encrypt_mprotect, ptra, PAGE_SIZE, PROT_NONE, key);
+ if (ret) {
+ fprintf(stderr, "Error: encrypt_mprotect: %d\n", ret);
+ goto free_memory;
+ }
+ if (keyctl(KEYCTL_INVALIDATE, key) != 0)
+ fprintf(stderr, "Error: test_revoke_in_use_key\n");
+free_memory:
+ ret = munmap(ptra, PAGE_SIZE);
+}
+
+/* TODO: Can this be made useful? Used to reproduce a trace in Kai's setup. */
+void test_kai_madvise(void)
+{
+ key_serial_t key;
+ void *ptra;
+ int ret;
+
+ key = add_key("mktme", "testkey", options_USER, strlen(options_USER),
+ KEY_SPEC_THREAD_KEYRING);
+
+ if (key == -1) {
+ perror("add_key");
+ return;
+ }
+
+ /* TODO wanted MAP_FIXED here - but kept failing to mmap */
+ ptra = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (!ptra) {
+ perror("failed to mmap");
+ goto revoke_key;
+ }
+
+ ret = madvise(ptra, PAGE_SIZE, MADV_MERGEABLE);
+ if (ret)
+ perror("madvise err mergeable");
+
+ if ((madvise(ptra, PAGE_SIZE, MADV_HUGEPAGE)) != 0)
+ perror("madvise err hugepage");
+
+ if ((madvise(ptra, PAGE_SIZE, MADV_DONTFORK)) != 0)
+ perror("madvise err dontfork");
+
+ ret = syscall(sys_encrypt_mprotect, ptra, PAGE_SIZE, PROT_NONE, key);
+ if (ret)
+ perror("mprotect error");
+
+ ret = munmap(ptra, PAGE_SIZE);
+revoke_key:
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "invalidate failed on key [%d]\n", key);
+}
+
+void test_one_simple_round_trip(void)
+{
+ long size = PAGE_SIZE * 10;
+ key_serial_t key;
+ void *ptra;
+ int ret;
+
+ key = add_key("mktme", "testkey", options_USER, strlen(options_USER),
+ KEY_SPEC_THREAD_KEYRING);
+
+ if (key == -1) {
+ perror("add_key");
+ return;
+ }
+
+ ptra = mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (!ptra) {
+ perror("failed to mmap");
+ goto revoke_key;
+ }
+
+ ret = syscall(sys_encrypt_mprotect, ptra, size, PROT_NONE, key);
+ if (ret)
+ perror("mprotect error");
+
+ ret = munmap(ptra, size);
+revoke_key:
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "revoke failed on key [%d]\n", key);
+}
+
+void test_switch_key_no_data(void)
+{
+ key_serial_t keyA, keyB;
+ int ret, i;
+ void *buf;
+
+ /*
+ * Program 2 keys: Protect with one, protect with other
+ */
+ keyA = add_key("mktme", "keyA", options_USER, strlen(options_USER),
+ KEY_SPEC_THREAD_KEYRING);
+ if (keyA == -1) {
+ perror("add_key");
+ return;
+ }
+ keyB = add_key("mktme", "keyB", options_CPU_long,
+ strlen(options_CPU_long), KEY_SPEC_THREAD_KEYRING);
+ if (keyB == -1) {
+ perror("add_key");
+ return;
+ }
+ buf = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE,
+ -1, 0);
+ if (!buf) {
+ perror("mmap error");
+ goto revoke_key;
+ }
+ ret = syscall(sys_encrypt_mprotect, buf, PAGE_SIZE, PROT_NONE, keyA);
+ if (ret)
+ perror("mprotect error");
+
+ ret = syscall(sys_encrypt_mprotect, buf, PAGE_SIZE, PROT_NONE, keyB);
+ if (ret)
+ perror("mprotect error");
+
+free_memory:
+ ret = munmap(buf, PAGE_SIZE);
+revoke_key:
+ if (keyctl(KEYCTL_INVALIDATE, keyA) == -1)
+ printf("revoke failed on key [%d]\n", keyA);
+ if (keyctl(KEYCTL_INVALIDATE, keyB) == -1)
+ printf("revoke failed on key [%d]\n", keyB);
+}
+
+void test_switch_key_mult_vmas(void)
+{
+ int prot = PROT_READ | PROT_WRITE;
+ long size = PAGE_SIZE;
+ int ret, i;
+ int loop = 12;
+ void *ptr[loop];
+ key_serial_t firstkey;
+ key_serial_t nextkey;
+
+ firstkey = add_key("mktme", "gouda", options_CPU_long,
+ strlen(options_CPU_long), KEY_SPEC_THREAD_KEYRING);
+ nextkey = add_key("mktme", "ricotta", options_CPU_long,
+ strlen(options_CPU_long), KEY_SPEC_THREAD_KEYRING);
+
+ i = loop;
+ while (i--) {
+ ptr[i] = mmap(NULL, size, PROT_NONE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (i % 2) {
+ ret = syscall(sys_encrypt_mprotect, ptr[i],
+ size, prot, firstkey);
+ if (ret)
+ perror("mprotect error");
+ }
+ }
+ i = loop;
+ while (i--) {
+ if (i % 2) {
+ ret = syscall(sys_encrypt_mprotect, ptr[i], size, prot,
+ nextkey);
+ if (ret)
+ perror("mprotect error");
+ }
+ }
+ i = loop;
+ while (i--)
+ ret = munmap(ptr[i], size);
+
+ if (keyctl(KEYCTL_INVALIDATE, nextkey) == -1)
+ fprintf(stderr, "invalidate failed key %d\n", nextkey);
+ if (keyctl(KEYCTL_INVALIDATE, firstkey) == -1)
+ fprintf(stderr, "invalidate failed key %d\n", firstkey);
+}
+
+/* Write to buf with no encrypt key, then encrypt buf */
+void test_switch_key0_to_key(void)
+{
+ key_serial_t key;
+ size_t datalen = PAGE_SIZE;
+ char *buf_1, *buf_2;
+ int ret, i;
+
+ key = add_key("mktme", "keyA", options_USER, strlen(options_USER),
+ KEY_SPEC_THREAD_KEYRING);
+ if (key == -1) {
+ perror("add_key");
+ return;
+ }
+ buf_1 = (char *)mmap(NULL, datalen, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (!buf_1) {
+ perror("failed to mmap");
+ goto inval_key;
+ }
+ buf_2 = (char *)mmap(NULL, datalen, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (!buf_2) {
+ perror("failed to mmap");
+ goto inval_key;
+ }
+ memset(buf_1, 9, datalen);
+ memset(buf_2, 9, datalen);
+
+ ret = syscall(sys_encrypt_mprotect, buf_1, datalen,
+ PROT_READ | PROT_WRITE, key);
+ if (ret)
+ perror("mprotect error");
+
+ if (!memcmp(buf_1, buf_2, sizeof(buf_1)))
+ fprintf(stderr, "Error: bufs should not have matched\n");
+
+free_memory:
+ ret = munmap(buf_1, datalen);
+ ret = munmap(buf_2, datalen);
+inval_key:
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "invalidate failed on key [%d]\n", key);
+}
+
+void test_zero_page(void)
+{
+ /*
+ * write access to the zero page, gets replaced with a newly
+ * allocated page.
+ * Can this be seen in smaps?
+ */
+}
+
diff --git a/tools/testing/selftests/x86/mktme/key_tests.c b/tools/testing/selftests/x86/mktme/key_tests.c
new file mode 100644
index 000000000000..ff4c18dbf533
--- /dev/null
+++ b/tools/testing/selftests/x86/mktme/key_tests.c
@@ -0,0 +1,526 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Testing payload options
+ *
+ * Invalid options should return -EINVAL, not a Key.
+ * TODO This is just checking for the Key.
+ * Add a check for the actual -EINVAL return.
+ *
+ * Invalid option cases are grouped based on why they are invalid.
+ * Valid option cases are one large array of expected goodness
+ *
+ */
+const char *bad_type_tail = "algorithm=aes-xts-128 key=12345678123456781234567812345678 tweak=12345678123456781234567812345678";
+const char *bad_type[] = {
+ "type=", /* missing */
+ "type=cpu, type=cpu", /* duplicate good */
+ "type=cpu, type=user",
+ "type=user, type=user",
+ "type=user, type=cpu",
+ "type=cp", /* spelling */
+ "type=cpus",
+ "type=pu",
+ "type=cpucpu",
+ "type=useruser",
+ "type=use",
+ "type=users",
+ "type=used",
+ "type=User", /* case */
+ "type=USER",
+ "type=UsEr",
+ "type=CPU",
+ "type=Cpu",
+};
+
+const char *bad_alg_tail = "type=cpu";
+const char *bad_algorithm[] = {
+ "algorithm=",
+ "algorithm=aes-xts-12",
+ "algorithm=aes-xts-128aes-xts-128",
+ "algorithm=es-xts-128",
+ "algorithm=bad",
+ "algorithm=aes-xts-128-xxxx",
+ "algorithm=xxx-aes-xts-128",
+};
+
+const char *bad_key_tail = "type=cpu algorithm=aes-xts-128 tweak=12345678123456781234567812345678";
+const char *bad_key[] = {
+ "key=",
+ "key=0",
+ "key=ababababababab",
+ "key=blah",
+ "key=0123333456789abcdef",
+ "key=abracadabra",
+ "key=-1",
+};
+
+const char *bad_tweak_tail = "type=cpu algorithm=aes-xts-128 key=12345678123456781234567812345678";
+const char *bad_tweak[] = {
+ "tweak=",
+ "tweak=ab",
+ "tweak=bad",
+ "tweak=-1",
+ "tweak=000000000000000",
+};
+
+/* Bad, missing, repeating tokens and bad overall payload length */
+const char *bad_other[] = {
+ "",
+ " ",
+ "a ",
+ "algorithm= tweak= type= key=",
+ "key=aaaaaaaaaaaaaaaa tweak=aaaaaaaaaaaaaaaa type=cpu",
+ "algorithm=aes-xts-128 tweak=0000000000000000 tweak=aaaaaaaaaaaaaaaa key=0000000000000000 type=cpu",
+ "algorithm=aes-xts-128 tweak=0000000000000000 key=0000000000000000 key=0000000000000000 type=cpu",
+ "algorithm=aes-xts-128 tweak=0000000000000000 key=0000000000000000 type=cpu type=cpu",
+ "algorithm=aes-xts-128 tweak=0000000000000000 key=0000000000000000 type=cpu type=user",
+ "tweak=0000000000000000011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+};
+
+void test_invalid_options(const char *bad_options[], unsigned int size,
+ const char *good_tail, char *descrip)
+{
+ key_serial_t key[size];
+ char options[512];
+ char name[15];
+ int i, ret;
+
+ for (i = 0; i < size; i++) {
+ sprintf(name, "mk_inv_%d", i);
+ sprintf(options, "%s %s", bad_options[i], good_tail);
+
+ key[i] = add_key("mktme", name, options,
+ strlen(options),
+ KEY_SPEC_THREAD_KEYRING);
+ if (key[i] > 0)
+ fprintf(stderr, "Error %s: [%s] accepted.\n",
+ descrip, bad_options[i]);
+ }
+ for (i = 0; i < size; i++) {
+ if (key[i] > 0) {
+ ret = keyctl(KEYCTL_INVALIDATE, key[i]);
+ if (ret == -1)
+ fprintf(stderr, "Key invalidate failed: [%d]\n",
+ key[i]);
+ }
+ }
+}
+
+void test_keys_invalid_options(void)
+{
+ test_invalid_options(bad_type, ARRAY_SIZE(bad_type),
+ bad_type_tail, "Invalid Type Option");
+ test_invalid_options(bad_algorithm, ARRAY_SIZE(bad_algorithm),
+ bad_alg_tail, "Invalid Algorithm Option");
+ test_invalid_options(bad_key, ARRAY_SIZE(bad_key),
+ bad_key_tail, "Invalid Key Option");
+ test_invalid_options(bad_tweak, ARRAY_SIZE(bad_tweak),
+ bad_tweak_tail, "Invalid Tweak Option");
+ test_invalid_options(bad_other, ARRAY_SIZE(bad_other),
+ NULL, "Invalid Option");
+}
+
+const char *valid_options[] = {
+ "algorithm=aes-xts-128 type=user key=0123456789abcdef0123456789abcdef tweak=abababababababababababababababab",
+ "algorithm=aes-xts-128 type=user tweak=0123456789abcdef0123456789abcdef key=abababababababababababababababab",
+ "algorithm=aes-xts-128 type=user key=01010101010101010101010101010101 tweak=0123456789abcdef0123456789abcdef",
+ "algorithm=aes-xts-128 tweak=01010101010101010101010101010101 type=user key=0123456789abcdef0123456789abcdef",
+ "algorithm=aes-xts-128 key=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa tweak=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa type=user",
+ "algorithm=aes-xts-128 tweak=aaaaaaaaaaaaaaaa0000000000000000 key=aaaaaaaaaaaaaaaa0000000000000000 type=user",
+ "algorithm=aes-xts-128 type=cpu key=aaaaaaaaaaaaaaaa0123456789abcdef tweak=abababaaaaaaaaaaaaaaaaababababab",
+ "algorithm=aes-xts-128 type=cpu tweak=0123456aaaaaaaaaaaaaaaa789abcdef key=abababaaaaaaaaaaaaaaaaababababab",
+ "algorithm=aes-xts-128 type=cpu key=010101aaaaaaaaaaaaaaaa0101010101 tweak=01234567aaaaaaaaaaaaaaaa89abcdef",
+ "algorithm=aes-xts-128 tweak=01010101aaaaaaaaaaaaaaaa01010101 type=cpu key=012345aaaaaaaaaaaaaaaa6789abcdef",
+ "algorithm=aes-xts-128 key=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa tweak=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa type=cpu",
+ "algorithm=aes-xts-128 tweak=00000000000000000000000000000000 type=cpu",
+ "algorithm=aes-xts-128 key=00000000000000000000000000000000 type=cpu",
+ "algorithm=aes-xts-128 type=cpu",
+ "algorithm=aes-xts-128 tweak=00000000000000000000000000000000 key=00000000000000000000000000000000 type=cpu",
+ "algorithm=aes-xts-128 tweak=00000000000000000000000000000000 key=00000000000000000000000000000000 type=cpu",
+};
+
+void test_keys_valid_options(void)
+{
+ char name[15];
+ int i, ret;
+ key_serial_t key[ARRAY_SIZE(valid_options)];
+
+ for (i = 0; i < ARRAY_SIZE(valid_options); i++) {
+ sprintf(name, "mk_val_%d", i);
+ key[i] = add_key("mktme", name, valid_options[i],
+ strlen(valid_options[i]),
+ KEY_SPEC_THREAD_KEYRING);
+ if (key[i] <= 0)
+ fprintf(stderr, "Fail valid option: [%s]\n",
+ valid_options[i]);
+ }
+ for (i = 0; i < ARRAY_SIZE(valid_options); i++) {
+ if (key[i] > 0) {
+ ret = keyctl(KEYCTL_INVALIDATE, key[i]);
+ if (ret)
+ fprintf(stderr, "Invalidate failed key[%d]\n",
+ key[i]);
+ }
+ }
+}
+
+/*
+ * key_serial_t add_key(const char *type, const char *description,
+ * const void *payload, size_t plen,
+ * key_serial_t keyring);
+ *
+ * The Kernel Key Service should validate this. But, let's validate
+ * some basic syntax. MKTME Keys does NOT propose a description based
+ * on type and payload if no description is provided. (Some other key
+ * types do make that 'proposal'.)
+ */
+
+void test_keys_descriptor(void)
+{
+ key_serial_t key;
+
+ key = add_key("mktme", NULL, options_CPU_long, strlen(options_CPU_long),
+ KEY_SPEC_THREAD_KEYRING);
+
+ if (errno != EINVAL)
+ fprintf(stderr, "Fail: expected EINVAL with NULL descriptor\n");
+
+ if (key > 0)
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "Key invalidate failed: %s\n",
+ strerror(errno));
+
+ key = add_key("mktme", "", options_CPU_long, strlen(options_CPU_long),
+ KEY_SPEC_THREAD_KEYRING);
+
+ if (errno != EINVAL)
+ fprintf(stderr,
+ "Fail: expected EINVAL with empty descriptor\n");
+
+ if (key > 0)
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "Key invalidate failed: %s\n",
+ strerror(errno));
+}
+
+/*
+ * Test: Add multiple keys with with same descriptor
+ *
+ * Expect that the same Key Handle (key_serial_t) will be returned
+ * on each subsequent request for the same key. This is treated like
+ * a key update.
+ */
+
+void test_keys_add_mult_same(void)
+{
+ int i, inval, num_keys = 5;
+ key_serial_t key[num_keys];
+
+ for (i = 1; i <= num_keys; i++) {
+ key[i] = add_key("mktme", "multiple_keys",
+ options_USER,
+ strlen(options_USER),
+ KEY_SPEC_THREAD_KEYRING);
+
+ if (i > 1)
+ if (key[i] != key[i - 1]) {
+ fprintf(stderr, "Fail: expected same key.\n");
+ inval = i; /* maybe i keys to invalidate */
+ goto out;
+ }
+ }
+ inval = 1; /* if all works correctly, only 1 key to invalidate */
+out:
+ for (i = 1; i <= inval; i++) {
+ if (keyctl(KEYCTL_INVALIDATE, key[i]) == -1)
+ fprintf(stderr, "Key invalidate failed: %s\n",
+ strerror(errno));
+ }
+}
+
+/*
+ * Add two keys with the same descriptor but different payloads.
+ * The result should be one key with the payload from the second
+ * add_key() request. Key Service recognizes the duplicate
+ * descriptor and allows the payload to be updated.
+ *
+ * mktme key type chooses not to support the keyctl read command.
+ * This means we cannot read the key payloads back to compare.
+ * That piece can only be verified in debug mode.
+ */
+void test_keys_change_payload(void)
+{
+ key_serial_t key_a, key_b;
+
+ key_a = add_key("mktme", "changepay", options_USER,
+ strlen(options_USER), KEY_SPEC_THREAD_KEYRING);
+ if (key_a == -1) {
+ fprintf(stderr, "Failed to add test key_a: %s\n",
+ strerror(errno));
+ return;
+ }
+ key_b = add_key("mktme", "changepay", options_CPU_long,
+ strlen(options_CPU_long), KEY_SPEC_THREAD_KEYRING);
+ if (key_b == -1) {
+ fprintf(stderr, "Failed to add test key_b: %s\n",
+ strerror(errno));
+ goto out;
+ }
+ if (key_a != key_b) {
+ fprintf(stderr, "Fail: expected same key, got new key.\n");
+ if (keyctl(KEYCTL_INVALIDATE, key_b) == -1)
+ fprintf(stderr, "Key invalidate failed: %s\n",
+ strerror(errno));
+ }
+out:
+ if (keyctl(KEYCTL_INVALIDATE, key_a) == -1)
+ fprintf(stderr, "Key invalidate failed: %s\n", strerror(errno));
+}
+
+/* Add a key, then discard via method parameter: revoke or invalidate */
+void test_keys_add_discard(int method)
+{
+ key_serial_t key;
+ int i;
+
+ key = add_key("mktme", "mtest_add_discard", options_USER,
+ strlen(options_USER), KEY_SPEC_THREAD_KEYRING);
+ if (key < 0)
+ perror("add_key");
+
+ if (keyctl(method, key) == -1)
+ fprintf(stderr, "Key %s failed: %s\n",
+ ((method == KEYCTL_INVALIDATE) ? "invalidate"
+ : "revoke"), strerror(errno));
+}
+
+void test_keys_add_invalidate(void)
+{
+ test_keys_add_discard(KEYCTL_INVALIDATE);
+}
+
+void test_keys_add_revoke(void)
+{
+ if (remove_gc_delay()) {
+ fprintf(stderr, "Skipping REVOKE test. Cannot set gc_delay.\n");
+ return;
+ }
+ test_keys_add_discard(KEYCTL_REVOKE);
+ restore_gc_delay();
+}
+
+void test_keys_describe(void)
+{
+ key_serial_t key;
+ char buf[256];
+ int ret;
+
+ key = add_key("mktme", "describe_this_key", options_USER,
+ strlen(options_USER), KEY_SPEC_THREAD_KEYRING);
+
+ if (key == -1) {
+ fprintf(stderr, "Add_key failed.\n");
+ return;
+ }
+ if (keyctl(KEYCTL_DESCRIBE, key, buf, sizeof(buf)) == -1) {
+ fprintf(stderr, "%s: KEYCTL_DESCRIBE failed\n", __func__);
+ goto revoke_key;
+ }
+ if (strncmp(buf, "mktme", 5))
+ fprintf(stderr, "Error: mktme descriptor missing.\n");
+
+revoke_key:
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "Key invalidate failed: %s\n", strerror(errno));
+}
+
+void test_keys_update_explicit(void)
+{
+ key_serial_t key;
+
+ key = add_key("mktme", "testkey", options_USER, strlen(options_USER),
+ KEY_SPEC_SESSION_KEYRING);
+
+ if (key == -1) {
+ perror("add_key");
+ return;
+ }
+ if (keyctl(KEYCTL_UPDATE, key, options_CPU_long,
+ strlen(options_CPU_long)) == -1)
+ fprintf(stderr, "Error: Update key failed\n");
+
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "Key invalidate failed: %s\n", strerror(errno));
+}
+
+void test_keys_update_clear(void)
+{
+ key_serial_t key;
+
+ key = add_key("mktme", "testkey", options_USER, strlen(options_USER),
+ KEY_SPEC_SESSION_KEYRING);
+
+ if (keyctl(KEYCTL_UPDATE, key, options_CLEAR,
+ strlen(options_CLEAR)) == -1)
+ fprintf(stderr, "update: clear key failed\n");
+
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "Key invalidate failed: %s\n", strerror(errno));
+}
+
+void test_keys_no_encrypt(void)
+{
+ key_serial_t key;
+
+ key = add_key("mktme", "no_encrypt_key", options_NOENCRYPT,
+ strlen(options_USER), KEY_SPEC_SESSION_KEYRING);
+
+ if (key == -1) {
+ fprintf(stderr, "Error: add_key type=no_encrypt failed.\n");
+ return;
+ }
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "Key invalidate failed: %s\n", strerror(errno));
+}
+
+void test_keys_unique_keyid(void)
+{
+ /*
+ * exists[] array must be of mktme_nr_keyids + 1 size, else the
+ * uniqueness test will fail. OK for max_keyids under test to be
+ * less than mktme_nr_keyids.
+ */
+ unsigned int exists[max_keyids + 1];
+ unsigned int keyids[max_keyids + 1];
+ key_serial_t key[max_keyids + 1];
+ void *ptr[max_keyids + 1];
+ int keys_available = 0;
+ char name[12];
+ int i, ret;
+
+ /* Get as many keys as possible */
+ for (i = 1; i <= max_keyids; i++) {
+ sprintf(name, "mk_unique_%d", i);
+ key[i] = add_key("mktme", name, options_CPU_short,
+ strlen(options_CPU_short),
+ KEY_SPEC_THREAD_KEYRING);
+ if (key[i] > 0)
+ keys_available++;
+ }
+ /* Create mappings, encrypt them, and find the assigned KeyIDs */
+ for (i = 1; i <= keys_available; i++) {
+ ptr[i] = mmap(NULL, PAGE_SIZE, PROT_NONE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ ret = syscall(sys_encrypt_mprotect, ptr[i], PAGE_SIZE,
+ PROT_NONE, key[i]);
+ keyids[i] = find_smaps_keyid((unsigned long)ptr[i]);
+ }
+ /* Verify the KeyID's are unique */
+ memset(exists, 0, sizeof(exists));
+ for (i = 1; i <= keys_available; i++) {
+ if (exists[keyids[i]])
+ fprintf(stderr, "Error: duplicate keyid %d\n",
+ keyids[i]);
+ exists[keyids[i]] = 1;
+ }
+
+ /* Clean up */
+ for (i = 1; i <= keys_available; i++) {
+ ret = munmap(ptr[i], PAGE_SIZE);
+ if (keyctl(KEYCTL_INVALIDATE, key[i]) == -1)
+ fprintf(stderr, "Invalidate failed Serial:%d\n",
+ key[i]);
+ }
+ sleep(1); /* Rest a bit while keys get freed. */
+}
+
+void test_keys_get_max_keyids(void)
+{
+ key_serial_t key[max_keyids + 1];
+ int keys_available = 0;
+ char name[12];
+ int i, ret;
+
+ for (i = 1; i <= max_keyids; i++) {
+ sprintf(name, "mk_get63_%d", i);
+ key[i] = add_key("mktme", name, options_CPU_short,
+ strlen(options_CPU_short),
+ KEY_SPEC_THREAD_KEYRING);
+ if (key[i] > 0)
+ keys_available++;
+ }
+
+ fprintf(stderr, " Info: got %d of %d system keys\n",
+ keys_available, max_keyids);
+
+ for (i = 1; i <= keys_available; i++) {
+ if (keyctl(KEYCTL_INVALIDATE, key[i]) == -1)
+ fprintf(stderr, "Invalidate failed Serial:%d\n",
+ key[i]);
+ }
+ sleep(1); /* Rest a bit while keys get freed. */
+}
+
+/*
+ * TODO: Run out of keys, release 1, grab it, repeat
+ * This test in not completed and is not in the run list.
+ */
+void test_keys_max_out(void)
+{
+ key_serial_t key[max_keyids + 1];
+ int keys_available;
+ char name[12];
+ int i, ret;
+
+ /* Get all the keys or as many as possible: keys_available */
+ for (i = 1; i <= max_keyids; i++) {
+ sprintf(name, "mk_max_%d", i);
+ key[i] = add_key("mktme", name, options_CPU_short,
+ strlen(options_CPU_short),
+ KEY_SPEC_THREAD_KEYRING);
+ if (key[i] < 0) {
+ fprintf(stderr, "failed to get key[%d]\n", i);
+ continue;
+ }
+ }
+ keys_available = i - 1;
+ if (keys_available < max_keyids)
+ printf("Error: only got %d keys, expected %d\n",
+ keys_available, max_keyids);
+
+ for (i = 1; i <= keys_available; i++) {
+ if (keyctl(KEYCTL_INVALIDATE, key[i]) == -1)
+ fprintf(stderr, "Invalidate failed key:%d\n", key[i]);
+ }
+}
+
+/* Add each type of key */
+void test_keys_add_each_type(void)
+{
+ key_serial_t key;
+ int i;
+
+ const char *options[] = {
+ options_CPU_short, options_CPU_long, options_USER,
+ options_CLEAR, options_NOENCRYPT
+ };
+ static const char *opt_name[] = {
+ "add_key cpu_short", "add_key cpu_long", "add_key user",
+ "add_key clear", "add_key no-encrypt"
+ };
+
+ for (i = 0; i < ARRAY_SIZE(options); i++) {
+ key = add_key("mktme", opt_name[i], options[i],
+ strlen(options[i]), KEY_SPEC_SESSION_KEYRING);
+
+ if (key == -1) {
+ perror(opt_name[i]);
+ } else {
+ perror(opt_name[i]);
+ if (keyctl(KEYCTL_INVALIDATE, key) == -1)
+ fprintf(stderr, "Key invalidate failed: %d\n",
+ key);
+ }
+ }
+}
diff --git a/tools/testing/selftests/x86/mktme/mktme_test.c b/tools/testing/selftests/x86/mktme/mktme_test.c
new file mode 100644
index 000000000000..6409ccf94d4a
--- /dev/null
+++ b/tools/testing/selftests/x86/mktme/mktme_test.c
@@ -0,0 +1,300 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Tests x86 MKTME Multi-Key Memory Protection
+ *
+ * COMPILE w keyutils library ==> cc -o mktest mktme_test.c -lkeyutils
+ *
+ * Test requires capability of CAP_SYS_RESOURCE, or CAP_SYS_ADMIN.
+ * $ sudo setcap 'CAP_SYS_RESOURCE+ep' mktest
+ *
+ * Some tests may require root privileges because the test needs to
+ * remove the garbage collection delay /proc/sys/kernel/keys/gc_delay
+ * while testing. This keeps the tests (and system) from appearing to
+ * be out of keys when keys are simply awaiting the next scheduled
+ * garbage collection.
+ *
+ * Documentation/x86/mktme.rst
+ *
+ * There are examples in here of:
+ * * how to use the Kernel Key Service MKTME API to allocate keys
+ * * how to use the MKTME Memory Encryption API to encrypt memory
+ *
+ * Adding Tests:
+ * o Each test should run independently and clean up after itself.
+ * o There are no dependencies among tests.
+ * o Tests that use a lot of keys, should consider adding sleep(),
+ * so that the next test isn't key-starved.
+ * o Make no assumptions about the order in which tests will run.
+ * o There are shared defines that can be used for setting
+ * payload options.
+ */
+#include <sys/fcntl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <errno.h>
+#include <keyutils.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+#define PAGE_SIZE sysconf(_SC_PAGE_SIZE)
+#define sys_encrypt_mprotect 335
+
+/* TODO get this from kernel. Add to /proc/sys/kernel/keys/ */
+int max_keyids = 63;
+
+/* Use these pre-defined options to simplify the add_key() setup */
+char *options_CPU_short = "algorithm=aes-xts-128 type=cpu";
+char *options_CPU_long = "algorithm=aes-xts-128 type=cpu key=12345678912345671234567891234567 tweak=12345678912345671234567891234567";
+char *options_USER = "algorithm=aes-xts-128 type=user key=12345678912345671234567891234567 tweak=12345678912345671234567891234567";
+char *options_CLEAR = "type=clear";
+char *options_NOENCRYPT = "type=no-encrypt";
+
+/* Helper to check Encryption_KeyID in proc/self/smaps */
+static FILE *seek_to_smaps_entry(unsigned long addr)
+{
+ FILE *file;
+ char *line = NULL;
+ size_t size = 0;
+ unsigned long start, end;
+ char perms[5];
+ unsigned long offset;
+ char dev[32];
+ unsigned long inode;
+ char path[BUFSIZ];
+
+ file = fopen("/proc/self/smaps", "r");
+ if (!file) {
+ perror("fopen smaps");
+ _exit(1);
+ }
+ while (getline(&line, &size, file) > 0) {
+ if (sscanf(line, "%lx-%lx %s %lx %s %lu %s\n",
+ &start, &end, perms, &offset, dev, &inode, path) < 6)
+ goto next;
+
+ if (start <= addr && addr < end)
+ goto out;
+next:
+ free(line);
+ line = NULL;
+ size = 0;
+ }
+ fclose(file);
+ file = NULL;
+out:
+ free(line);
+ return file;
+}
+
+/* Find the KeyID for this addr from /proc/self/smaps */
+unsigned int find_smaps_keyid(unsigned long addr)
+{
+ unsigned int keyid = 0;
+ char *line = NULL;
+ size_t size = 0;
+ FILE *smaps;
+
+ smaps = seek_to_smaps_entry(addr);
+ if (!smaps) {
+ printf("Unable to parse /proc/self/smaps\n");
+ goto out;
+ }
+ while (getline(&line, &size, smaps) > 0) {
+ if (!strstr(line, "KeyID:")) {
+ free(line);
+ line = NULL;
+ size = 0;
+ continue;
+ }
+ if (sscanf(line, "KeyID: %5u\n", &keyid) < 1)
+ printf("Unable to parse smaps for KeyID:%s\n", line);
+ break;
+ }
+out:
+ free(line);
+ fclose(smaps);
+ return keyid;
+}
+
+/*
+ * Set the garbage collection delay to 0, so that keys are quickly
+ * available for re-use while running the selftests.
+ *
+ * Most tests use INVALIDATE to remove a key, which has no delay by
+ * design. But, revoke, unlink, and timeout still have a delay, so
+ * they should use this.
+ */
+char current_gc_delay[10] = {0};
+static inline int remove_gc_delay(void)
+{
+ int fd;
+
+ fd = open("/proc/sys/kernel/keys/gc_delay", O_RDWR | O_NONBLOCK);
+ if (fd < 0) {
+ perror("Failed to open /proc/sys/kernel/keys/gc_delay");
+ return -1;
+ }
+ if (read(fd, current_gc_delay, sizeof(current_gc_delay)) <= 0) {
+ perror("Failed to read /proc/sys/kernel/keys/gc_delay");
+ close(fd);
+ return -1;
+ }
+ lseek(fd, 0, SEEK_SET);
+ if (write(fd, "0", sizeof(char)) != sizeof(char)) {
+ perror("Failed to write temp_gc_delay to gc_delay\n");
+ close(fd);
+ return -1;
+ }
+ close(fd);
+ return 0;
+}
+
+static inline void restore_gc_delay(void)
+{
+ int fd;
+
+ fd = open("/proc/sys/kernel/keys/gc_delay", O_RDWR | O_NONBLOCK);
+ if (fd < 0) {
+ perror("Failed to open /proc/sys/kernel/keys/gc_delay");
+ return;
+ }
+ if (write(fd, current_gc_delay, strlen(current_gc_delay)) !=
+ strlen(current_gc_delay)) {
+ perror("Failed to restore gc_delay\n");
+ close(fd);
+ return;
+ }
+ close(fd);
+}
+
+/*
+ * The tests are sorted into 3 categories:
+ * key_test encrypt_test focus on their specific API
+ * flow_tests are special flows and regression tests of prior issue.
+ */
+
+#include "key_tests.c"
+#include "encrypt_tests.c"
+#include "flow_tests.c"
+
+struct tlist {
+ const char *name;
+ void (*func)();
+};
+
+static const struct tlist mktme_tests[] = {
+{"Keys: Add each type key", test_keys_add_each_type },
+{"Flow: One simple roundtrip", test_one_simple_round_trip },
+{"Keys: Valid Payload Options", test_keys_valid_options },
+{"Keys: Invalid Payload Options", test_keys_invalid_options },
+{"Keys: Add Key Descriptor Field", test_keys_descriptor },
+{"Keys: Add Multiple Same", test_keys_add_mult_same },
+{"Keys: Change payload, auto update", test_keys_change_payload },
+{"Keys: Update, explicit update", test_keys_update_explicit },
+{"Keys: Update, Clear", test_keys_update_clear },
+{"Keys: Add, Invalidate Keys", test_keys_add_invalidate },
+{"Keys: Add, Revoke Keys", test_keys_add_revoke },
+{"Keys: Keyctl Describe", test_keys_describe },
+{"Keys: Clear", test_keys_update_clear },
+{"Keys: No Encrypt", test_keys_no_encrypt },
+{"Keys: Unique KeyIDs", test_keys_unique_keyid },
+{"Keys: Get Max KeyIDs", test_keys_get_max_keyids },
+{"Encrypt: Parameter Alignment", test_param_alignment },
+{"Encrypt: Change Protections", test_change_protections },
+{"Encrypt: Swap Keys", test_key_swap },
+{"Encrypt: Counters Same Key", test_counters_same },
+{"Encrypt: Counters Diff Key", test_counters_diff },
+{"Encrypt: Counters Holes", test_counters_holes },
+/*
+{"Encrypt: Split", test_split },
+{"Encrypt: Well Suited", test_well_suited },
+{"Encrypt: Not Suited", test_not_suited },
+*/
+{"Flow: Switch key no data", test_switch_key_no_data },
+{"Flow: Switch key multi VMAs", test_switch_key_mult_vmas },
+{"Flow: Switch No Key to Any Key", test_switch_key0_to_key },
+{"Flow: madvise", test_kai_madvise },
+{"Flow: Invalidate In Use Key", test_discard_in_use_key },
+};
+
+void print_usage(void)
+{
+ fprintf(stderr, "Usage: mktme_test [options]...\n"
+ " -a Run ALL tests\n"
+ " -t <testnum> Run one <testnum> test\n"
+ " -l List available tests\n"
+ " -h, -? Show this help\n"
+ );
+}
+
+int main(int argc, char *argv[])
+{
+ int test_selected = -1;
+ char printtest[12];
+ int trace = 0;
+ int i, c, err;
+ char *temp;
+
+ /*
+ * TODO: Default case needs to run 'selftests' - a
+ * curated set of tests that validate functionality but
+ * don't hog resources.
+ */
+ c = getopt(argc, argv, "at:lph?");
+ switch (c) {
+ case 'a':
+ test_selected = -1;
+ printf("Test Selected [ALL]\n");
+ break;
+ case 't':
+ test_selected = strtoul(optarg, &temp, 10);
+ printf("Test Selected [%d]\n", test_selected);
+ break;
+ case 'l':
+ for (i = 0; i < ARRAY_SIZE(mktme_tests); i++)
+ printf("[%2d] %s\n", i + 1,
+ mktme_tests[i].name);
+ exit(0);
+ break;
+ case 'p':
+ trace = 1;
+ case 'h':
+ case '?':
+ default:
+ print_usage();
+ exit(0);
+ }
+
+/*
+ * if (!cpu_has_mktme()) {
+ * printf("MKTME not supported on this system.\n");
+ * exit(0);
+ * }
+ */
+ if (trace) {
+ printf("Pausing: start trace on PID[%d]\n", (int)getpid());
+ getchar();
+ }
+
+ if (test_selected == -1) {
+ for (i = 0; i < ARRAY_SIZE(mktme_tests); i++) {
+ printf("[%2d] %s\n", i + 1, mktme_tests[i].name);
+ mktme_tests[i].func();
+ }
+ printf("\nTests Completed\n");
+
+ } else {
+ if (test_selected <= ARRAY_SIZE(mktme_tests)) {
+ printf("[%2d] %s\n", test_selected,
+ mktme_tests[test_selected - 1].name);
+ mktme_tests[test_selected - 1].func();
+ printf("\nTest Completed\n");
+ }
+ }
+ exit(0);
+}
--
2.20.1
Hook up into page allocator to allocate and free encrypted page
properly.
The hardware/CPU does not enforce coherency between mappings of the same
physical page with different KeyIDs or encryption keys.
We are responsible for cache management.
Flush cache on allocating encrypted page and on returning the page to
the free pool.
prep_encrypted_page() also takes care about zeroing the page. We have to
do this after KeyID is set for the page.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mktme.h | 17 +++++++++++++
arch/x86/mm/mktme.c | 49 ++++++++++++++++++++++++++++++++++++
2 files changed, 66 insertions(+)
diff --git a/arch/x86/include/asm/mktme.h b/arch/x86/include/asm/mktme.h
index b5afa31b4526..6e604126f0bc 100644
--- a/arch/x86/include/asm/mktme.h
+++ b/arch/x86/include/asm/mktme.h
@@ -40,6 +40,23 @@ static inline int vma_keyid(struct vm_area_struct *vma)
return __vma_keyid(vma);
}
+#define prep_encrypted_page prep_encrypted_page
+void __prep_encrypted_page(struct page *page, int order, int keyid, bool zero);
+static inline void prep_encrypted_page(struct page *page, int order,
+ int keyid, bool zero)
+{
+ if (keyid)
+ __prep_encrypted_page(page, order, keyid, zero);
+}
+
+#define HAVE_ARCH_FREE_PAGE
+void free_encrypted_page(struct page *page, int order);
+static inline void arch_free_page(struct page *page, int order)
+{
+ if (page_keyid(page))
+ free_encrypted_page(page, order);
+}
+
#else
#define mktme_keyid_mask ((phys_addr_t)0)
#define mktme_nr_keyids 0
diff --git a/arch/x86/mm/mktme.c b/arch/x86/mm/mktme.c
index d4a1a9e9b1c0..43489c098e60 100644
--- a/arch/x86/mm/mktme.c
+++ b/arch/x86/mm/mktme.c
@@ -1,4 +1,5 @@
#include <linux/mm.h>
+#include <linux/highmem.h>
#include <asm/mktme.h>
/* Mask to extract KeyID from physical address. */
@@ -37,3 +38,51 @@ int __vma_keyid(struct vm_area_struct *vma)
pgprotval_t prot = pgprot_val(vma->vm_page_prot);
return (prot & mktme_keyid_mask) >> mktme_keyid_shift;
}
+
+/* Prepare page to be used for encryption. Called from page allocator. */
+void __prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
+{
+ int i;
+
+ /*
+ * The hardware/CPU does not enforce coherency between mappings
+ * of the same physical page with different KeyIDs or
+ * encryption keys. We are responsible for cache management.
+ */
+ clflush_cache_range(page_address(page), PAGE_SIZE * (1UL << order));
+
+ for (i = 0; i < (1 << order); i++) {
+ /* All pages coming out of the allocator should have KeyID 0 */
+ WARN_ON_ONCE(lookup_page_ext(page)->keyid);
+ lookup_page_ext(page)->keyid = keyid;
+
+ /* Clear the page after the KeyID is set. */
+ if (zero)
+ clear_highpage(page);
+
+ page++;
+ }
+}
+
+/*
+ * Handles freeing of encrypted page.
+ * Called from page allocator on freeing encrypted page.
+ */
+void free_encrypted_page(struct page *page, int order)
+{
+ int i;
+
+ /*
+ * The hardware/CPU does not enforce coherency between mappings
+ * of the same physical page with different KeyIDs or
+ * encryption keys. We are responsible for cache management.
+ */
+ clflush_cache_range(page_address(page), PAGE_SIZE * (1UL << order));
+
+ for (i = 0; i < (1 << order); i++) {
+ /* Check if the page has reasonable KeyID */
+ WARN_ON_ONCE(lookup_page_ext(page)->keyid > mktme_nr_keyids);
+ lookup_page_ext(page)->keyid = 0;
+ page++;
+ }
+}
--
2.20.1
On Wed, May 08, 2019 at 05:43:20PM +0300, Kirill A. Shutemov wrote:
> = Intro =
>
> The patchset brings enabling of Intel Multi-Key Total Memory Encryption.
> It consists of changes into multiple subsystems:
>
> * Core MM: infrastructure for allocation pages, dealing with encrypted VMAs
> and providing API setup encrypted mappings.
> * arch/x86: feature enumeration, program keys into hardware, setup
> page table entries for encrypted pages and more.
> * Key management service: setup and management of encryption keys.
> * DMA/IOMMU: dealing with encrypted memory on IO side.
> * KVM: interaction with virtualization side.
> * Documentation: description of APIs and usage examples.
>
> The patchset is huge. This submission aims to give view to the full picture and
> get feedback on the overall design. The patchset will be split into more
> digestible pieces later.
>
> Please review. Any feedback is welcome.
It would be nice to have a brief usage description in cover letter rather
than in the last patches in the series ;-)
> = Overview =
>
> Multi-Key Total Memory Encryption (MKTME)[1] is a technology that allows
> transparent memory encryption in upcoming Intel platforms. It uses a new
> instruction (PCONFIG) for key setup and selects a key for individual pages by
> repurposing physical address bits in the page tables.
>
> These patches add support for MKTME into the existing kernel keyring subsystem
> and add a new mprotect_encrypt() system call that can be used by applications
> to encrypt anonymous memory with keys obtained from the keyring.
>
> This architecture supports encrypting both normal, volatile DRAM and persistent
> memory. However, these patches do not implement persistent memory support. We
> anticipate adding that support next.
>
> == Hardware Background ==
>
> MKTME is built on top of an existing single-key technology called TME. TME
> encrypts all system memory using a single key generated by the CPU on every
> boot of the system. TME provides mitigation against physical attacks, such as
> physically removing a DIMM or watching memory bus traffic.
>
> MKTME enables the use of multiple encryption keys[2], allowing selection of the
> encryption key per-page using the page tables. Encryption keys are programmed
> into each memory controller and the same set of keys is available to all
> entities on the system with access to that memory (all cores, DMA engines,
> etc...).
>
> MKTME inherits many of the mitigations against hardware attacks from TME. Like
> TME, MKTME does not mitigate vulnerable or malicious operating systems or
> virtual machine managers. MKTME offers additional mitigations when compared to
> TME.
>
> TME and MKTME use the AES encryption algorithm in the AES-XTS mode. This mode,
> typically used for block-based storage devices, takes the physical address of
> the data into account when encrypting each block. This ensures that the
> effective key is different for each block of memory. Moving encrypted content
> across physical address results in garbage on read, mitigating block-relocation
> attacks. This property is the reason many of the discussed attacks require
> control of a shared physical page to be handed from the victim to the attacker.
>
> == MKTME-Provided Mitigations ==
>
> MKTME adds a few mitigations against attacks that are not mitigated when using
> TME alone. The first set are mitigations against software attacks that are
> familiar today:
>
> * Kernel Mapping Attacks: information disclosures that leverage the
> kernel direct map are mitigated against disclosing user data.
> * Freed Data Leak Attacks: removing an encryption key from the
> hardware mitigates future user information disclosure.
>
> The next set are attacks that depend on specialized hardware, such as an “evil
> DIMM” or a DDR interposer:
>
> * Cross-Domain Replay Attack: data is captured from one domain
> (guest) and replayed to another at a later time.
> * Cross-Domain Capture and Delayed Compare Attack: data is captured
> and later analyzed to discover secrets.
> * Key Wear-out Attack: data is captured and analyzed in order to
> Weaken the AES encryption itself.
>
> More details on these attacks are below.
>
> === Kernel Mapping Attacks ===
>
> Information disclosure vulnerabilities leverage the kernel direct map because
> many vulnerabilities involve manipulation of kernel data structures (examples:
> CVE-2017-7277, CVE-2017-9605). We normally think of these bugs as leaking
> valuable *kernel* data, but they can leak application data when application
> pages are recycled for kernel use.
>
> With this MKTME implementation, there is a direct map created for each MKTME
> KeyID which is used whenever the kernel needs to access plaintext. But, all
> kernel data structures are accessed via the direct map for KeyID-0. Thus,
> memory reads which are not coordinated with the KeyID get garbage (for example,
> accessing KeyID-4 data with the KeyID-0 mapping).
>
> This means that if sensitive data encrypted using MKTME is leaked via the
> KeyID-0 direct map, ciphertext decrypted with the wrong key will be disclosed.
> To disclose plaintext, an attacker must “pivot” to the correct direct mapping,
> which is non-trivial because there are no kernel data structures in the
> KeyID!=0 direct mapping.
>
> === Freed Data Leak Attack ===
>
> The kernel has a history of bugs around uninitialized data. Usually, we think
> of these bugs as leaking sensitive kernel data, but they can also be used to
> leak application secrets.
>
> MKTME can help mitigate the case where application secrets are leaked:
>
> * App (or VM) places a secret in a page
> * App exits or frees memory to kernel allocator
> * Page added to allocator free list
> * Attacker reallocates page to a purpose where it can read the page
>
> Now, imagine MKTME was in use on the memory being leaked. The data can only be
> leaked as long as the key is programmed in the hardware. If the key is
> de-programmed, like after all pages are freed after a guest is shut down, any
> future reads will just see ciphertext.
>
> Basically, the key is a convenient choke-point: you can be more confident that
> data encrypted with it is inaccessible once the key is removed.
>
> === Cross-Domain Replay Attack ===
>
> MKTME mitigates cross-domain replay attacks where an attacker replaces an
> encrypted block owned by one domain with a block owned by another domain.
> MKTME does not prevent this replacement from occurring, but it does mitigate
> plaintext from being disclosed if the domains use different keys.
>
> With TME, the attack could be executed by:
> * A victim places secret in memory, at a given physical address.
> Note: AES-XTS is what restricts the attack to being performed at a
> single physical address instead of across different physical
> addresses
> * Attacker captures victim secret’s ciphertext
> * Later on, after victim frees the physical address, attacker gains
> ownership
> * Attacker puts the ciphertext at the address and get the secret
> plaintext
>
> But, due to the presumably different keys used by the attacker and the victim,
> the attacker can not successfully decrypt old ciphertext.
>
> === Cross-Domain Capture and Delayed Compare Attack ===
>
> This is also referred to as a kind of dictionary attack.
>
> Similarly, MKTME protects against cross-domain capture-and-compare attacks.
> Consider the following scenario:
> * A victim places a secret in memory, at a known physical address
> * Attacker captures victim’s ciphertext
> * Attacker gains control of the target physical address, perhaps
> after the victim’s VM is shut down or its memory reclaimed.
> * Attacker computes and writes many possible plaintexts until new
> ciphertext matches content captured previously.
>
> Secrets which have low (plaintext) entropy are more vulnerable to this attack
> because they reduce the number of possible plaintexts an attacker has to
> compute and write.
>
> The attack will not work if attacker and victim uses different keys.
>
> === Key Wear-out Attack ===
>
> Repeated use of an encryption key might be used by an attacker to infer
> information about the key or the plaintext, weakening the encryption. The
> higher the bandwidth of the encryption engine, the more vulnerable the key is
> to wear-out. The MKTME memory encryption hardware works at the speed of the
> memory bus, which has high bandwidth.
>
> Such a weakness has been demonstrated[3] on a theoretical cipher with similar
> properties as AES-XTS.
>
> An attack would take the following steps:
> * Victim system is using TME with AES-XTS-128
> * Attacker repeatedly captures ciphertext/plaintext pairs (can be
> Performed with online hardware attack like an interposer).
> * Attacker compels repeated use of the key under attack for a
> sustained time period without a system reboot[4].
> * Attacker discovers a cipertext collision (two plaintexts
> translating to the same ciphertext)
> * Attacker can induce controlled modifications to the targeted
> plaintext by modifying the colliding ciphertext
>
> MKTME mitigates key wear-out in two ways:
> * Keys can be rotated periodically to mitigate wear-out. Since TME
> keys are generated at boot, rotation of TME keys requires a
> reboot. In contrast, MKTME allows rotation while the system is
> booted. An application could implement a policy to rotate keys at
> a frequency which is not feasible to attack.
> * In the case that MKTME is used to encrypt two guests’ memory with
> two different keys, an attack on one guest’s key would not weaken
> the key used in the second guest.
>
> --
>
> [1] https://software.intel.com/sites/default/files/managed/a5/16/Multi-Key-Total-Memory-Encryption-Spec.pdf
> [2] The MKTME architecture supports up to 16 bits of KeyIDs, so a
> maximum of 65535 keys on top of the “TME key” at KeyID-0. The
> first implementation is expected to support 5 bits, making 63 keys
> available to applications. However, this is not guaranteed. The
> number of available keys could be reduced if, for instance,
> additional physical address space is desired over additional
> KeyIDs.
> [3] http://web.cs.ucdavis.edu/~rogaway/papers/offsets.pdf
> [4] This sustained time required for an attack could vary from days
> to years depending on the attacker’s goals.
>
> Alison Schofield (33):
> x86/pconfig: Set a valid encryption algorithm for all MKTME commands
> keys/mktme: Introduce a Kernel Key Service for MKTME
> keys/mktme: Preparse the MKTME key payload
> keys/mktme: Instantiate and destroy MKTME keys
> keys/mktme: Move the MKTME payload into a cache aligned structure
> keys/mktme: Strengthen the entropy of CPU generated MKTME keys
> keys/mktme: Set up PCONFIG programming targets for MKTME keys
> keys/mktme: Program MKTME keys into the platform hardware
> keys/mktme: Set up a percpu_ref_count for MKTME keys
> keys/mktme: Require CAP_SYS_RESOURCE capability for MKTME keys
> keys/mktme: Store MKTME payloads if cmdline parameter allows
> acpi: Remove __init from acpi table parsing functions
> acpi/hmat: Determine existence of an ACPI HMAT
> keys/mktme: Require ACPI HMAT to register the MKTME Key Service
> acpi/hmat: Evaluate topology presented in ACPI HMAT for MKTME
> keys/mktme: Do not allow key creation in unsafe topologies
> keys/mktme: Support CPU hotplug for MKTME key service
> keys/mktme: Find new PCONFIG targets during memory hotplug
> keys/mktme: Program new PCONFIG targets with MKTME keys
> keys/mktme: Support memory hotplug for MKTME keys
> mm: Generalize the mprotect implementation to support extensions
> syscall/x86: Wire up a system call for MKTME encryption keys
> x86/mm: Set KeyIDs in encrypted VMAs for MKTME
> mm: Add the encrypt_mprotect() system call for MKTME
> x86/mm: Keep reference counts on encrypted VMAs for MKTME
> mm: Restrict MKTME memory encryption to anonymous VMAs
> selftests/x86/mktme: Test the MKTME APIs
> x86/mktme: Overview of Multi-Key Total Memory Encryption
> x86/mktme: Document the MKTME provided security mitigations
> x86/mktme: Document the MKTME kernel configuration requirements
> x86/mktme: Document the MKTME Key Service API
> x86/mktme: Document the MKTME API for anonymous memory encryption
> x86/mktme: Demonstration program using the MKTME APIs
>
> Jacob Pan (3):
> iommu/vt-d: Support MKTME in DMA remapping
> x86/mm: introduce common code for mem encryption
> x86/mm: Use common code for DMA memory encryption
>
> Kai Huang (2):
> mm, x86: export several MKTME variables
> kvm, x86, mmu: setup MKTME keyID to spte for given PFN
>
> Kirill A. Shutemov (24):
> mm: Do no merge VMAs with different encryption KeyIDs
> mm: Add helpers to setup zero page mappings
> mm/ksm: Do not merge pages with different KeyIDs
> mm/page_alloc: Unify alloc_hugepage_vma()
> mm/page_alloc: Handle allocation for encrypted memory
> mm/khugepaged: Handle encrypted pages
> x86/mm: Mask out KeyID bits from page table entry pfn
> x86/mm: Introduce variables to store number, shift and mask of KeyIDs
> x86/mm: Preserve KeyID on pte_modify() and pgprot_modify()
> x86/mm: Detect MKTME early
> x86/mm: Add a helper to retrieve KeyID for a page
> x86/mm: Add a helper to retrieve KeyID for a VMA
> x86/mm: Add hooks to allocate and free encrypted pages
> x86/mm: Map zero pages into encrypted mappings correctly
> x86/mm: Rename CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING
> x86/mm: Allow to disable MKTME after enumeration
> x86/mm: Calculate direct mapping size
> x86/mm: Implement syncing per-KeyID direct mappings
> x86/mm: Handle encrypted memory in page_to_virt() and __pa()
> mm/page_ext: Export lookup_page_ext() symbol
> mm/rmap: Clear vma->anon_vma on unlink_anon_vmas()
> x86/mm: Disable MKTME on incompatible platform configurations
> x86/mm: Disable MKTME if not all system memory supports encryption
> x86: Introduce CONFIG_X86_INTEL_MKTME
>
> .../admin-guide/kernel-parameters.rst | 1 +
> .../admin-guide/kernel-parameters.txt | 11 +
> Documentation/x86/mktme/index.rst | 13 +
> .../x86/mktme/mktme_configuration.rst | 17 +
> Documentation/x86/mktme/mktme_demo.rst | 53 ++
> Documentation/x86/mktme/mktme_encrypt.rst | 57 ++
> Documentation/x86/mktme/mktme_keys.rst | 96 +++
> Documentation/x86/mktme/mktme_mitigations.rst | 150 ++++
> Documentation/x86/mktme/mktme_overview.rst | 57 ++
> Documentation/x86/x86_64/mm.txt | 4 +
> arch/alpha/include/asm/page.h | 2 +-
> arch/x86/Kconfig | 29 +-
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/x86/include/asm/intel-family.h | 2 +
> arch/x86/include/asm/intel_pconfig.h | 14 +-
> arch/x86/include/asm/mem_encrypt.h | 29 +
> arch/x86/include/asm/mktme.h | 93 +++
> arch/x86/include/asm/page.h | 4 +
> arch/x86/include/asm/page_32.h | 1 +
> arch/x86/include/asm/page_64.h | 4 +-
> arch/x86/include/asm/pgtable.h | 19 +
> arch/x86/include/asm/pgtable_types.h | 23 +-
> arch/x86/include/asm/setup.h | 6 +
> arch/x86/kernel/cpu/intel.c | 58 +-
> arch/x86/kernel/head64.c | 4 +
> arch/x86/kernel/setup.c | 3 +
> arch/x86/kvm/mmu.c | 18 +-
> arch/x86/mm/Makefile | 3 +
> arch/x86/mm/init_64.c | 68 ++
> arch/x86/mm/kaslr.c | 11 +-
> arch/x86/mm/mem_encrypt_common.c | 28 +
> arch/x86/mm/mktme.c | 630 ++++++++++++++
> drivers/acpi/hmat/hmat.c | 67 ++
> drivers/acpi/tables.c | 10 +-
> drivers/firmware/efi/efi.c | 25 +-
> drivers/iommu/intel-iommu.c | 29 +-
> fs/dax.c | 3 +-
> fs/exec.c | 4 +-
> fs/userfaultfd.c | 7 +-
> include/asm-generic/pgtable.h | 8 +
> include/keys/mktme-type.h | 39 +
> include/linux/acpi.h | 9 +-
> include/linux/dma-direct.h | 4 +-
> include/linux/efi.h | 1 +
> include/linux/gfp.h | 51 +-
> include/linux/intel-iommu.h | 9 +-
> include/linux/mem_encrypt.h | 23 +-
> include/linux/migrate.h | 14 +-
> include/linux/mm.h | 27 +-
> include/linux/page_ext.h | 11 +-
> include/linux/syscalls.h | 2 +
> include/uapi/asm-generic/unistd.h | 4 +-
> kernel/fork.c | 2 +
> kernel/sys_ni.c | 2 +
> mm/compaction.c | 3 +
> mm/huge_memory.c | 6 +-
> mm/khugepaged.c | 10 +
> mm/ksm.c | 17 +
> mm/madvise.c | 2 +-
> mm/memory.c | 3 +-
> mm/mempolicy.c | 30 +-
> mm/migrate.c | 4 +-
> mm/mlock.c | 2 +-
> mm/mmap.c | 31 +-
> mm/mprotect.c | 98 ++-
> mm/page_alloc.c | 50 ++
> mm/page_ext.c | 5 +
> mm/rmap.c | 4 +-
> mm/userfaultfd.c | 3 +-
> security/keys/Makefile | 1 +
> security/keys/mktme_keys.c | 768 ++++++++++++++++++
> .../selftests/x86/mktme/encrypt_tests.c | 433 ++++++++++
> .../testing/selftests/x86/mktme/flow_tests.c | 266 ++++++
> tools/testing/selftests/x86/mktme/key_tests.c | 526 ++++++++++++
> .../testing/selftests/x86/mktme/mktme_test.c | 300 +++++++
> 76 files changed, 4301 insertions(+), 122 deletions(-)
> create mode 100644 Documentation/x86/mktme/index.rst
> create mode 100644 Documentation/x86/mktme/mktme_configuration.rst
> create mode 100644 Documentation/x86/mktme/mktme_demo.rst
> create mode 100644 Documentation/x86/mktme/mktme_encrypt.rst
> create mode 100644 Documentation/x86/mktme/mktme_keys.rst
> create mode 100644 Documentation/x86/mktme/mktme_mitigations.rst
> create mode 100644 Documentation/x86/mktme/mktme_overview.rst
> create mode 100644 arch/x86/include/asm/mktme.h
> create mode 100644 arch/x86/mm/mem_encrypt_common.c
> create mode 100644 arch/x86/mm/mktme.c
> create mode 100644 include/keys/mktme-type.h
> create mode 100644 security/keys/mktme_keys.c
> create mode 100644 tools/testing/selftests/x86/mktme/encrypt_tests.c
> create mode 100644 tools/testing/selftests/x86/mktme/flow_tests.c
> create mode 100644 tools/testing/selftests/x86/mktme/key_tests.c
> create mode 100644 tools/testing/selftests/x86/mktme/mktme_test.c
>
> --
> 2.20.1
>
--
Sincerely yours,
Mike.
On Wed, May 29, 2019 at 10:30:07AM +0300, Mike Rapoport wrote:
> On Wed, May 08, 2019 at 05:43:20PM +0300, Kirill A. Shutemov wrote:
> > = Intro =
> >
> > The patchset brings enabling of Intel Multi-Key Total Memory Encryption.
> > It consists of changes into multiple subsystems:
> >
> > * Core MM: infrastructure for allocation pages, dealing with encrypted VMAs
> > and providing API setup encrypted mappings.
> > * arch/x86: feature enumeration, program keys into hardware, setup
> > page table entries for encrypted pages and more.
> > * Key management service: setup and management of encryption keys.
> > * DMA/IOMMU: dealing with encrypted memory on IO side.
> > * KVM: interaction with virtualization side.
> > * Documentation: description of APIs and usage examples.
> >
> > The patchset is huge. This submission aims to give view to the full picture and
> > get feedback on the overall design. The patchset will be split into more
> > digestible pieces later.
> >
> > Please review. Any feedback is welcome.
>
> It would be nice to have a brief usage description in cover letter rather
> than in the last patches in the series ;-)
>
Thanks for making it all the way to the last patches in the set ;)
Yes, we will certainly include that usage model in the cover letters
of future patchsets.
Alison
On Wed, May 08, 2019 at 05:43:33PM +0300, Kirill A. Shutemov wrote:
> +/* Prepare page to be used for encryption. Called from page allocator. */
> +void __prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
> +{
> + int i;
> +
> + /*
> + * The hardware/CPU does not enforce coherency between mappings
> + * of the same physical page with different KeyIDs or
> + * encryption keys. We are responsible for cache management.
> + */
On alloc we should flush the unencrypted (key=0) range, while on free
(below) we should flush the encrypted (key!=0) range.
But I seem to have missed where page_address() does the right thing
here.
> + clflush_cache_range(page_address(page), PAGE_SIZE * (1UL << order));
> +
> + for (i = 0; i < (1 << order); i++) {
> + /* All pages coming out of the allocator should have KeyID 0 */
> + WARN_ON_ONCE(lookup_page_ext(page)->keyid);
> + lookup_page_ext(page)->keyid = keyid;
> +
So presumably page_address() is affected by this keyid, and the below
clear_highpage() then accesses the 'right' location?
> + /* Clear the page after the KeyID is set. */
> + if (zero)
> + clear_highpage(page);
> +
> + page++;
> + }
> +}
> +
> +/*
> + * Handles freeing of encrypted page.
> + * Called from page allocator on freeing encrypted page.
> + */
> +void free_encrypted_page(struct page *page, int order)
> +{
> + int i;
> +
> + /*
> + * The hardware/CPU does not enforce coherency between mappings
> + * of the same physical page with different KeyIDs or
> + * encryption keys. We are responsible for cache management.
> + */
I still don't like that comment much; yes the hardware doesn't do it,
and yes we have to do it, but it doesn't explain the actual scheme
employed to do so.
> + clflush_cache_range(page_address(page), PAGE_SIZE * (1UL << order));
> +
> + for (i = 0; i < (1 << order); i++) {
> + /* Check if the page has reasonable KeyID */
> + WARN_ON_ONCE(lookup_page_ext(page)->keyid > mktme_nr_keyids);
It should also check keyid > 0, so maybe:
(unsigned)(keyid - 1) > keyids-1
instead?
> + lookup_page_ext(page)->keyid = 0;
> + page++;
> + }
> +}
> --
> 2.20.1
>
On Fri, Jun 14, 2019 at 11:34:09AM +0200, Peter Zijlstra wrote:
> On Wed, May 08, 2019 at 05:43:33PM +0300, Kirill A. Shutemov wrote:
>
> > + lookup_page_ext(page)->keyid = keyid;
> > + lookup_page_ext(page)->keyid = 0;
Also, perhaps paranoid; but do we want something like:
static inline void page_set_keyid(struct page *page, int keyid)
{
/* ensure nothing creeps after changing the keyid */
barrier();
WRITE_ONCE(lookup_page_ext(page)->keyid, keyid);
barrier();
/* ensure nothing creeps before changing the keyid */
}
And this is very much assuming there is no concurrency through the
allocator locks.
On Wed, May 08, 2019 at 05:44:05PM +0300, Kirill A. Shutemov wrote:
> diff --git a/fs/exec.c b/fs/exec.c
> index 2e0033348d8e..695c121b34b3 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -755,8 +755,8 @@ int setup_arg_pages(struct linux_binprm *bprm,
> vm_flags |= mm->def_flags;
> vm_flags |= VM_STACK_INCOMPLETE_SETUP;
>
> - ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end,
> - vm_flags);
> + ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end, vm_flags,
> + -1);
You added a nice NO_KEY helper a few patches back, maybe use it?
> if (ret)
> goto out_unlock;
> BUG_ON(prev != vma);
On Wed, May 08, 2019 at 05:44:05PM +0300, Kirill A. Shutemov wrote:
> @@ -347,7 +348,8 @@ static int prot_none_walk(struct vm_area_struct *vma, unsigned long start,
>
> int
> mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> - unsigned long start, unsigned long end, unsigned long newflags)
> + unsigned long start, unsigned long end, unsigned long newflags,
> + int newkeyid)
> {
> struct mm_struct *mm = vma->vm_mm;
> unsigned long oldflags = vma->vm_flags;
> @@ -357,7 +359,14 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> int error;
> int dirty_accountable = 0;
>
> - if (newflags == oldflags) {
> + /*
> + * Flags match and Keyids match or we have NO_KEY.
> + * This _fixup is usually called from do_mprotect_ext() except
> + * for one special case: caller fs/exec.c/setup_arg_pages()
> + * In that case, newkeyid is passed as -1 (NO_KEY).
> + */
> + if (newflags == oldflags &&
> + (newkeyid == vma_keyid(vma) || newkeyid == NO_KEY)) {
> *pprev = vma;
> return 0;
> }
> @@ -423,6 +432,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> }
>
> success:
> + if (newkeyid != NO_KEY)
> + mprotect_set_encrypt(vma, newkeyid, start, end);
> /*
> * vm_flags and vm_page_prot are protected by the mmap_sem
> * held in write mode.
> @@ -454,10 +465,15 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> }
>
> /*
> - * When pkey==NO_KEY we get legacy mprotect behavior here.
> + * do_mprotect_ext() supports the legacy mprotect behavior plus extensions
> + * for Protection Keys and Memory Encryption Keys. These extensions are
> + * mutually exclusive and the behavior is:
> + * (pkey==NO_KEY && keyid==NO_KEY) ==> legacy mprotect
> + * (pkey is valid) ==> legacy mprotect plus Protection Key extensions
> + * (keyid is valid) ==> legacy mprotect plus Encryption Key extensions
> */
> static int do_mprotect_ext(unsigned long start, size_t len,
> - unsigned long prot, int pkey)
> + unsigned long prot, int pkey, int keyid)
> {
> unsigned long nstart, end, tmp, reqprot;
> struct vm_area_struct *vma, *prev;
> @@ -555,7 +571,8 @@ static int do_mprotect_ext(unsigned long start, size_t len,
> tmp = vma->vm_end;
> if (tmp > end)
> tmp = end;
> - error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
> + error = mprotect_fixup(vma, &prev, nstart, tmp, newflags,
> + keyid);
> if (error)
> goto out;
> nstart = tmp;
I've missed the part where pkey && keyid results in a WARN or error or
whatever.
On Wed, May 08, 2019 at 05:44:06PM +0300, Kirill A. Shutemov wrote:
> From: Alison Schofield <[email protected]>
>
> The MKTME (Multi-Key Total Memory Encryption) Key Service needs
> a reference count on encrypted VMAs. This reference count is used
> to determine when a hardware encryption KeyID is no longer in use
> and can be freed and reassigned to another Userspace Key.
>
> The MKTME Key service does the percpu_ref_init and _kill, so
> these gets/puts on encrypted VMA's can be considered the
> intermediaries in the lifetime of the key.
>
> Increment/decrement the reference count during encrypt_mprotect()
> system call for initial or updated encryption on a VMA.
>
> Piggy back on the vm_area_dup/free() helpers. If the VMAs being
> duplicated, or freed are encrypted, adjust the reference count.
That all talks about VMAs, but...
> @@ -102,6 +115,22 @@ void __prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
>
> page++;
> }
> +
> + /*
> + * Make sure the KeyID cannot be freed until the last page that
> + * uses the KeyID is gone.
> + *
> + * This is required because the page may live longer than VMA it
> + * is mapped into (i.e. in get_user_pages() case) and having
> + * refcounting per-VMA is not enough.
> + *
> + * Taking a reference per-4K helps in case if the page will be
> + * split after the allocation. free_encrypted_page() will balance
> + * out the refcount even if the page was split and freed as bunch
> + * of 4K pages.
> + */
> +
> + percpu_ref_get_many(&encrypt_count[keyid], 1 << order);
> }
>
> /*
> @@ -110,7 +139,9 @@ void __prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
> */
> void free_encrypted_page(struct page *page, int order)
> {
> - int i;
> + int i, keyid;
> +
> + keyid = page_keyid(page);
>
> /*
> * The hardware/CPU does not enforce coherency between mappings
> @@ -125,6 +156,8 @@ void free_encrypted_page(struct page *page, int order)
> lookup_page_ext(page)->keyid = 0;
> page++;
> }
> +
> + percpu_ref_put_many(&encrypt_count[keyid], 1 << order);
> }
counts pages, what gives?
On Wed, May 08, 2019 at 05:43:20PM +0300, Kirill A. Shutemov wrote:
> = Intro =
>
> The patchset brings enabling of Intel Multi-Key Total Memory Encryption.
> It consists of changes into multiple subsystems:
>
> * Core MM: infrastructure for allocation pages, dealing with encrypted VMAs
> and providing API setup encrypted mappings.
That wasn't eye-bleeding bad. With exception of the refcounting; that
looks like something that can easily go funny without people noticing.
> * arch/x86: feature enumeration, program keys into hardware, setup
> page table entries for encrypted pages and more.
That seemed incomplete (pageattr seems to be a giant hole).
> * Key management service: setup and management of encryption keys.
> * DMA/IOMMU: dealing with encrypted memory on IO side.
Just minor nits, someone else would have to look at this.
> * KVM: interaction with virtualization side.
You really want to limit the damage random modules can do. They have no
business writing to the mktme variables.
> * Documentation: description of APIs and usage examples.
Didn't bother with those; if the Changelogs are inadequate to make sense
of the patches documentation isn't the right place to fix things.
> The patchset is huge. This submission aims to give view to the full picture and
> get feedback on the overall design. The patchset will be split into more
> digestible pieces later.
>
> Please review. Any feedback is welcome.
I still can't tell if this is worth the complexity :-/
Yes, there's a lot of words, but it doesn't mean anything to me, that
is, nothing here makes me want to build my kernel with this 'feature'
enabled.
On Fri, Jun 14, 2019 at 11:34:09AM +0200, Peter Zijlstra wrote:
> On Wed, May 08, 2019 at 05:43:33PM +0300, Kirill A. Shutemov wrote:
>
> > +/* Prepare page to be used for encryption. Called from page allocator. */
> > +void __prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
> > +{
> > + int i;
> > +
> > + /*
> > + * The hardware/CPU does not enforce coherency between mappings
> > + * of the same physical page with different KeyIDs or
> > + * encryption keys. We are responsible for cache management.
> > + */
>
> On alloc we should flush the unencrypted (key=0) range, while on free
> (below) we should flush the encrypted (key!=0) range.
>
> But I seem to have missed where page_address() does the right thing
> here.
As you've seen by now, it will be addressed later in the patchset. I'll
update the changelog to indicate that page_address() handles KeyIDs
correctly.
> > + clflush_cache_range(page_address(page), PAGE_SIZE * (1UL << order));
> > +
> > + for (i = 0; i < (1 << order); i++) {
> > + /* All pages coming out of the allocator should have KeyID 0 */
> > + WARN_ON_ONCE(lookup_page_ext(page)->keyid);
> > + lookup_page_ext(page)->keyid = keyid;
> > +
>
> So presumably page_address() is affected by this keyid, and the below
> clear_highpage() then accesses the 'right' location?
Yes. clear_highpage() -> kmap_atomic() -> page_address().
> > + /* Clear the page after the KeyID is set. */
> > + if (zero)
> > + clear_highpage(page);
> > +
> > + page++;
> > + }
> > +}
> > +
> > +/*
> > + * Handles freeing of encrypted page.
> > + * Called from page allocator on freeing encrypted page.
> > + */
> > +void free_encrypted_page(struct page *page, int order)
> > +{
> > + int i;
> > +
> > + /*
> > + * The hardware/CPU does not enforce coherency between mappings
> > + * of the same physical page with different KeyIDs or
> > + * encryption keys. We are responsible for cache management.
> > + */
>
> I still don't like that comment much; yes the hardware doesn't do it,
> and yes we have to do it, but it doesn't explain the actual scheme
> employed to do so.
Fair enough. I'll do better.
> > + clflush_cache_range(page_address(page), PAGE_SIZE * (1UL << order));
> > +
> > + for (i = 0; i < (1 << order); i++) {
> > + /* Check if the page has reasonable KeyID */
> > + WARN_ON_ONCE(lookup_page_ext(page)->keyid > mktme_nr_keyids);
>
> It should also check keyid > 0, so maybe:
>
> (unsigned)(keyid - 1) > keyids-1
>
> instead?
Makes sense.
--
Kirill A. Shutemov
On Fri, Jun 14, 2019 at 01:04:58PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 14, 2019 at 11:34:09AM +0200, Peter Zijlstra wrote:
> > On Wed, May 08, 2019 at 05:43:33PM +0300, Kirill A. Shutemov wrote:
> >
> > > + lookup_page_ext(page)->keyid = keyid;
>
> > > + lookup_page_ext(page)->keyid = 0;
>
> Also, perhaps paranoid; but do we want something like:
>
> static inline void page_set_keyid(struct page *page, int keyid)
> {
> /* ensure nothing creeps after changing the keyid */
> barrier();
> WRITE_ONCE(lookup_page_ext(page)->keyid, keyid);
> barrier();
> /* ensure nothing creeps before changing the keyid */
> }
>
> And this is very much assuming there is no concurrency through the
> allocator locks.
There's no concurrency for this page: it has been off the free list, but
have not yet passed on to user. Nobody else sees the page before
allocation is finished.
And barriers/WRITE_ONCE() looks excessive to me. It's just yet another bit
of page's metadata and I don't see why it's has to be handled in a special
way.
Does it relax your paranoia? :P
--
Kirill A. Shutemov
On Fri, Jun 14, 2019 at 04:28:36PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jun 14, 2019 at 01:04:58PM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 14, 2019 at 11:34:09AM +0200, Peter Zijlstra wrote:
> > > On Wed, May 08, 2019 at 05:43:33PM +0300, Kirill A. Shutemov wrote:
> > >
> > > > + lookup_page_ext(page)->keyid = keyid;
> >
> > > > + lookup_page_ext(page)->keyid = 0;
> >
> > Also, perhaps paranoid; but do we want something like:
> >
> > static inline void page_set_keyid(struct page *page, int keyid)
> > {
> > /* ensure nothing creeps after changing the keyid */
> > barrier();
> > WRITE_ONCE(lookup_page_ext(page)->keyid, keyid);
> > barrier();
> > /* ensure nothing creeps before changing the keyid */
> > }
> >
> > And this is very much assuming there is no concurrency through the
> > allocator locks.
>
> There's no concurrency for this page: it has been off the free list, but
> have not yet passed on to user. Nobody else sees the page before
> allocation is finished.
>
> And barriers/WRITE_ONCE() looks excessive to me. It's just yet another bit
> of page's metadata and I don't see why it's has to be handled in a special
> way.
>
> Does it relax your paranoia? :P
Not really, it all 'works' because clflush_cache_range() includes mb()
and page_address() has an address dependency on the store, and there are
no other sites that will ever change 'keyid', which is all kind of
fragile.
At the very least that should be explicitly called out in a comment.
On Fri, Jun 14, 2019 at 01:47:32PM +0200, Peter Zijlstra wrote:
> On Wed, May 08, 2019 at 05:44:05PM +0300, Kirill A. Shutemov wrote:
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 2e0033348d8e..695c121b34b3 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -755,8 +755,8 @@ int setup_arg_pages(struct linux_binprm *bprm,
> > vm_flags |= mm->def_flags;
> > vm_flags |= VM_STACK_INCOMPLETE_SETUP;
> >
> > - ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end,
> > - vm_flags);
> > + ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end, vm_flags,
> > + -1);
>
> You added a nice NO_KEY helper a few patches back, maybe use it?
Sure, done.
(I hesitated to define NO_KEY in mm.h initially. Put it there now.
We'll see how that looks it next round.)
>
> > if (ret)
> > goto out_unlock;
> > BUG_ON(prev != vma);
On Fri, Jun 14, 2019 at 01:54:24PM +0200, Peter Zijlstra wrote:
> On Wed, May 08, 2019 at 05:44:06PM +0300, Kirill A. Shutemov wrote:
> > From: Alison Schofield <[email protected]>
> >
> > The MKTME (Multi-Key Total Memory Encryption) Key Service needs
> > a reference count on encrypted VMAs. This reference count is used
> > to determine when a hardware encryption KeyID is no longer in use
> > and can be freed and reassigned to another Userspace Key.
> >
> > The MKTME Key service does the percpu_ref_init and _kill, so
> > these gets/puts on encrypted VMA's can be considered the
> > intermediaries in the lifetime of the key.
> >
> > Increment/decrement the reference count during encrypt_mprotect()
> > system call for initial or updated encryption on a VMA.
> >
> > Piggy back on the vm_area_dup/free() helpers. If the VMAs being
> > duplicated, or freed are encrypted, adjust the reference count.
>
> That all talks about VMAs, but...
>
> > @@ -102,6 +115,22 @@ void __prep_encrypted_page(struct page *page, int order, int keyid, bool zero)
> >
> > page++;
> > }
> > +
> > + /*
> > + * Make sure the KeyID cannot be freed until the last page that
> > + * uses the KeyID is gone.
> > + *
> > + * This is required because the page may live longer than VMA it
> > + * is mapped into (i.e. in get_user_pages() case) and having
> > + * refcounting per-VMA is not enough.
> > + *
> > + * Taking a reference per-4K helps in case if the page will be
> > + * split after the allocation. free_encrypted_page() will balance
> > + * out the refcount even if the page was split and freed as bunch
> > + * of 4K pages.
> > + */
> > +
> > + percpu_ref_get_many(&encrypt_count[keyid], 1 << order);
> > }
snip
>
> counts pages, what gives?
Yeah. Comments are confusing. We implemented the refcounting w VMA's in
mind, and then added the page counting. I'll update the comments and
dig around some more based on your overall concerns about the
refcounting you mentioned in the cover letter.
On Fri, Jun 14, 2019 at 03:43:35PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 14, 2019 at 04:28:36PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jun 14, 2019 at 01:04:58PM +0200, Peter Zijlstra wrote:
> > > On Fri, Jun 14, 2019 at 11:34:09AM +0200, Peter Zijlstra wrote:
> > > > On Wed, May 08, 2019 at 05:43:33PM +0300, Kirill A. Shutemov wrote:
> > > >
> > > > > + lookup_page_ext(page)->keyid = keyid;
> > >
> > > > > + lookup_page_ext(page)->keyid = 0;
> > >
> > > Also, perhaps paranoid; but do we want something like:
> > >
> > > static inline void page_set_keyid(struct page *page, int keyid)
> > > {
> > > /* ensure nothing creeps after changing the keyid */
> > > barrier();
> > > WRITE_ONCE(lookup_page_ext(page)->keyid, keyid);
> > > barrier();
> > > /* ensure nothing creeps before changing the keyid */
> > > }
> > >
> > > And this is very much assuming there is no concurrency through the
> > > allocator locks.
> >
> > There's no concurrency for this page: it has been off the free list, but
> > have not yet passed on to user. Nobody else sees the page before
> > allocation is finished.
> >
> > And barriers/WRITE_ONCE() looks excessive to me. It's just yet another bit
> > of page's metadata and I don't see why it's has to be handled in a special
> > way.
> >
> > Does it relax your paranoia? :P
>
> Not really, it all 'works' because clflush_cache_range() includes mb()
> and page_address() has an address dependency on the store, and there are
> no other sites that will ever change 'keyid', which is all kind of
> fragile.
Hm. I don't follow how the mb() in clflush_cache_range() relevant...
Any following access of page's memory by kernel will go through
page_keyid() and therefore I believe there's always address dependency on
the store.
Am I missing something?
> At the very least that should be explicitly called out in a comment.
>
--
Kirill A. Shutemov
On Fri, Jun 14, 2019 at 01:51:37PM +0200, Peter Zijlstra wrote:
> On Wed, May 08, 2019 at 05:44:05PM +0300, Kirill A. Shutemov wrote:
snip
> > /*
> > - * When pkey==NO_KEY we get legacy mprotect behavior here.
> > + * do_mprotect_ext() supports the legacy mprotect behavior plus extensions
> > + * for Protection Keys and Memory Encryption Keys. These extensions are
> > + * mutually exclusive and the behavior is:
> > + * (pkey==NO_KEY && keyid==NO_KEY) ==> legacy mprotect
> > + * (pkey is valid) ==> legacy mprotect plus Protection Key extensions
> > + * (keyid is valid) ==> legacy mprotect plus Encryption Key extensions
> > */
> > static int do_mprotect_ext(unsigned long start, size_t len,
> > - unsigned long prot, int pkey)
> > + unsigned long prot, int pkey, int keyid)
> > {
snip
>
> I've missed the part where pkey && keyid results in a WARN or error or
> whatever.
>
I wasn't so sure about that since do_mprotect_ext()
is the call 'behind' the system calls.
legacy mprotect always calls with: NO_KEY, NO_KEY
pkey_mprotect always calls with: pkey, NO_KEY
encrypt_mprotect always calls with NO_KEY, keyid
Would a check on those arguments be debug only
to future proof this?
On Fri, Jun 14, 2019 at 05:32:31PM -0700, Alison Schofield wrote:
> On Fri, Jun 14, 2019 at 01:51:37PM +0200, Peter Zijlstra wrote:
> > On Wed, May 08, 2019 at 05:44:05PM +0300, Kirill A. Shutemov wrote:
> snip
> > > /*
> > > - * When pkey==NO_KEY we get legacy mprotect behavior here.
> > > + * do_mprotect_ext() supports the legacy mprotect behavior plus extensions
> > > + * for Protection Keys and Memory Encryption Keys. These extensions are
> > > + * mutually exclusive and the behavior is:
Well, here it states that the extentions are mutually exclusive.
> > > + * (pkey==NO_KEY && keyid==NO_KEY) ==> legacy mprotect
> > > + * (pkey is valid) ==> legacy mprotect plus Protection Key extensions
> > > + * (keyid is valid) ==> legacy mprotect plus Encryption Key extensions
> > > */
> > > static int do_mprotect_ext(unsigned long start, size_t len,
> > > - unsigned long prot, int pkey)
> > > + unsigned long prot, int pkey, int keyid)
> > > {
>
> snip
>
> >
> > I've missed the part where pkey && keyid results in a WARN or error or
> > whatever.
> >
> I wasn't so sure about that since do_mprotect_ext()
> is the call 'behind' the system calls.
>
> legacy mprotect always calls with: NO_KEY, NO_KEY
> pkey_mprotect always calls with: pkey, NO_KEY
> encrypt_mprotect always calls with NO_KEY, keyid
>
> Would a check on those arguments be debug only
> to future proof this?
But you then don't check that, anywhere, afaict.
On Sat, Jun 15, 2019 at 01:41:31AM +0300, Kirill A. Shutemov wrote:
> On Fri, Jun 14, 2019 at 03:43:35PM +0200, Peter Zijlstra wrote:
> > Not really, it all 'works' because clflush_cache_range() includes mb()
> > and page_address() has an address dependency on the store, and there are
> > no other sites that will ever change 'keyid', which is all kind of
> > fragile.
>
> Hm. I don't follow how the mb() in clflush_cache_range() relevant...
>
> Any following access of page's memory by kernel will go through
> page_keyid() and therefore I believe there's always address dependency on
> the store.
>
> Am I missing something?
The dependency doesn't help with prior calls; consider:
addr = page_address(page);
*addr = foo;
page->key_id = bar;
addr2 = page_address(page);
Without a barrier() between '*addr = foo' and 'page->key_id = bar', the
compiler is allowed to reorder these stores.
Now, the clflush stuff we do, that already hard orders things -- we need
to be done writing before we start flushing -- so we can/do rely on
that, but we should explicitly mention that.
Now, for the second part, addr2 must observe bar, because of the address
dependency, the compiler is not allowed mess that up, but again, that is
something we should put in a comment.
On Wed, May 8, 2019 at 7:44 AM Kirill A. Shutemov
<[email protected]> wrote:
>
> From: Alison Schofield <[email protected]>
>
> Implement memory encryption for MKTME (Multi-Key Total Memory
> Encryption) with a new system call that is an extension of the
> legacy mprotect() system call.
>
> In encrypt_mprotect the caller must pass a handle to a previously
> allocated and programmed MKTME encryption key. The key can be
> obtained through the kernel key service type "mktme". The caller
> must have KEY_NEED_VIEW permission on the key.
>
> MKTME places an additional restriction on the protected data:
> The length of the data must be page aligned. This is in addition
> to the existing mprotect restriction that the addr must be page
> aligned.
I still find it bizarre that this is conflated with mprotect().
I also remain entirely unconvinced that MKTME on anonymous memory is
useful in the long run. There will inevitably be all kinds of fancy
new CPU features that make the underlying MKTME mechanisms much more
useful. For example, some way to bind a key to a VM, or a way to
*sanely* encrypt persistent memory. By making this thing a syscall
that does more than just MKTME, you're adding combinatorial complexity
(you forget pkey!) and you're tying other functionality (change of
protection) to this likely-to-be-deprecated interface.
This is part of why I much prefer the idea of making this style of
MKTME a driver or some other non-intrusive interface. Then, once
everyone gets tired of it, the driver can just get turned off with no
side effects.
--Andy
On 6/17/19 8:07 AM, Andy Lutomirski wrote:
> I still find it bizarre that this is conflated with mprotect().
This needs to be in the changelog. But, for better or worse, it's
following the mprotect_pkey() pattern.
Other than the obvious "set the key on this memory", we're looking for
two other properties: atomicity (ensuring there is no transient state
where the memory is usable without the desired properties) and that it
is usable on existing allocations.
For atomicity, we have a model where we can allocate things with
PROT_NONE, then do mprotect_pkey() and mprotect_encrypt() (plus any
future features), then the last mprotect_*() call takes us from
PROT_NONE to the desired end permisions. We could just require a plain
old mprotect() to do that instead of embedding mprotect()-like behavior
in these, of course, but that isn't the path we're on at the moment with
mprotect_pkey().
So, for this series it's just a matter of whether we do this:
ptr = mmap(..., PROT_NONE);
mprotect_pkey(protect_key, ptr, PROT_NONE);
mprotect_encrypt(encr_key, ptr, PROT_READ|PROT_WRITE);
// good to go
or this:
ptr = mmap(..., PROT_NONE);
mprotect_pkey(protect_key, ptr, PROT_NONE);
sys_encrypt(key, ptr);
mprotect(ptr, PROT_READ|PROT_WRITE);
// good to go
I actually don't care all that much which one we end up with. It's not
like the extra syscall in the second options means much.
> This is part of why I much prefer the idea of making this style of
> MKTME a driver or some other non-intrusive interface. Then, once
> everyone gets tired of it, the driver can just get turned off with no
> side effects.
I like the concept, but not where it leads. I'd call it the 'hugetlbfs
approach". :) Hugetblfs certainly go us huge pages, but it's continued
to be a parallel set of code with parallel bugs and parallel
implementations of many VM features. It's not that you can't implement
new things on hugetlbfs, it's that you *need* to. You never get them
for free.
For instance, if we do a driver, how do we get large pages? How do we
swap/reclaim the pages? How do we do NUMA affinity? How do we
eventually stack it on top of persistent memory filesystems or Device
DAX? With a driver approach, I think we're stuck basically
reimplementing things or gluing them back together. Nothing comes for free.
With this approach, we basically start with our normal, full feature set
(modulo weirdo interactions like with KSM).
On Mon, Jun 17, 2019 at 8:28 AM Dave Hansen <[email protected]> wrote:
>
> On 6/17/19 8:07 AM, Andy Lutomirski wrote:
> > I still find it bizarre that this is conflated with mprotect().
>
> This needs to be in the changelog. But, for better or worse, it's
> following the mprotect_pkey() pattern.
>
> Other than the obvious "set the key on this memory", we're looking for
> two other properties: atomicity (ensuring there is no transient state
> where the memory is usable without the desired properties) and that it
> is usable on existing allocations.
>
> For atomicity, we have a model where we can allocate things with
> PROT_NONE, then do mprotect_pkey() and mprotect_encrypt() (plus any
> future features), then the last mprotect_*() call takes us from
> PROT_NONE to the desired end permisions. We could just require a plain
> old mprotect() to do that instead of embedding mprotect()-like behavior
> in these, of course, but that isn't the path we're on at the moment with
> mprotect_pkey().
>
> So, for this series it's just a matter of whether we do this:
>
> ptr = mmap(..., PROT_NONE);
> mprotect_pkey(protect_key, ptr, PROT_NONE);
> mprotect_encrypt(encr_key, ptr, PROT_READ|PROT_WRITE);
> // good to go
>
> or this:
>
> ptr = mmap(..., PROT_NONE);
> mprotect_pkey(protect_key, ptr, PROT_NONE);
> sys_encrypt(key, ptr);
> mprotect(ptr, PROT_READ|PROT_WRITE);
> // good to go
>
> I actually don't care all that much which one we end up with. It's not
> like the extra syscall in the second options means much.
The benefit of the second one is that, if sys_encrypt is absent, it
just works. In the first model, programs need a fallback because
they'll segfault of mprotect_encrypt() gets ENOSYS.
>
> > This is part of why I much prefer the idea of making this style of
> > MKTME a driver or some other non-intrusive interface. Then, once
> > everyone gets tired of it, the driver can just get turned off with no
> > side effects.
>
> I like the concept, but not where it leads. I'd call it the 'hugetlbfs
> approach". :) Hugetblfs certainly go us huge pages, but it's continued
> to be a parallel set of code with parallel bugs and parallel
> implementations of many VM features. It's not that you can't implement
> new things on hugetlbfs, it's that you *need* to. You never get them
> for free.
Fair enough, but...
>
> For instance, if we do a driver, how do we get large pages? How do we
> swap/reclaim the pages? How do we do NUMA affinity?
Those all make sense.
> How do we
> eventually stack it on top of persistent memory filesystems or Device
> DAX?
How do we stack anonymous memory on top of persistent memory or Device
DAX? I'm confused.
Just to throw this out there, what if we had a new device /dev/xpfo
and MKTME were one of its features. You open /dev/xpfo, optionally do
an ioctl to set a key, and them map it. The pages you get are
unmapped entirely from the direct map, and you get a PFNMAP VMA with
all its limitations. This seems much more useful -- it's limited, but
it's limited *because the kernel can't accidentally read it*.
I think that, in the long run, we're going to have to either expand
the core mm's concept of what "memory" is or just have a whole
parallel set of mechanisms for memory that doesn't work like memory.
We're already accumulating a set of things that are backed by memory
but aren't usable as memory. SGX EPC pages and SEV pages come to mind.
They are faster when they're in big contiguous chunks (well, not SGX
AFAIK, but maybe some day), they have NUMA node affinity, and they
show up in page tables, but the hardware restricts who can read and
write them. If Intel isn't planning to do something like this with
the MKTME hardware, I'll eat my hat.
I expect that some day normal memory will be able to be repurposed as
SGX pages on the fly, and that will also look a lot more like SEV or
XPFO than like the this model of MKTME.
So, if we upstream MKTME as anonymous memory with a magic config
syscall, I predict that, in a few years, it will be end up inheriting
all downsides of both approaches with few of the upsides. Programs
like QEMU will need to learn to manipulate pages that can't be
accessed outside the VM without special VM buy-in, so the fact that
MKTME pages are fully functional and can be GUP-ed won't be very
useful. And the VM will learn about all these things, but MKTME won't
really fit in.
And, one of these days, someone will come up with a version of XPFO
that could actually be upstreamed, and it seems entirely plausible
that it will be totally incompatible with MKTME-as-anonymous-memory
and that users of MKTME will actually get *worse* security.
Tom Lendacky, could you take a look down in the message to the talk of
SEV? I want to make sure I'm not misrepresenting what it does today.
...
>> I actually don't care all that much which one we end up with. It's not
>> like the extra syscall in the second options means much.
>
> The benefit of the second one is that, if sys_encrypt is absent, it
> just works. In the first model, programs need a fallback because
> they'll segfault of mprotect_encrypt() gets ENOSYS.
Well, by the time they get here, they would have already had to allocate
and set up the encryption key. I don't think this would really be the
"normal" malloc() path, for instance.
>> How do we
>> eventually stack it on top of persistent memory filesystems or Device
>> DAX?
>
> How do we stack anonymous memory on top of persistent memory or Device
> DAX? I'm confused.
If our interface to MKTME is:
fd = open("/dev/mktme");
ptr = mmap(fd);
Then it's hard to combine with an interface which is:
fd = open("/dev/dax123");
ptr = mmap(fd);
Where if we have something like mprotect() (or madvise() or something
else taking pointer), we can just do:
fd = open("/dev/anything987");
ptr = mmap(fd);
sys_encrypt(ptr);
Now, we might not *do* it that way for dax, for instance, but I'm just
saying that if we go the /dev/mktme route, we never get a choice.
> I think that, in the long run, we're going to have to either expand
> the core mm's concept of what "memory" is or just have a whole
> parallel set of mechanisms for memory that doesn't work like memory.
...
> I expect that some day normal memory will be able to be repurposed as
> SGX pages on the fly, and that will also look a lot more like SEV or
> XPFO than like the this model of MKTME.
I think you're drawing the line at pages where the kernel can manage
contents vs. not manage contents. I'm not sure that's the right
distinction to make, though. The thing that is important is whether the
kernel can manage the lifetime and location of the data in the page.
Basically: Can the kernel choose where the page comes from and get the
page back when it wants?
I really don't like the current state of things like with SEV or with
KVM direct device assignment where the physical location is quite locked
down and the kernel really can't manage the memory. I'm trying really
hard to make sure future hardware is more permissive about such things.
My hope is that these are a temporary blip and not the new normal.
> So, if we upstream MKTME as anonymous memory with a magic config
> syscall, I predict that, in a few years, it will be end up inheriting
> all downsides of both approaches with few of the upsides. Programs
> like QEMU will need to learn to manipulate pages that can't be
> accessed outside the VM without special VM buy-in, so the fact that
> MKTME pages are fully functional and can be GUP-ed won't be very
> useful. And the VM will learn about all these things, but MKTME won't
> really fit in.
Kai Huang (who is on cc) has been doing the QEMU enabling and might want
to weigh in. I'd also love to hear from the AMD folks in case I'm not
grokking some aspect of SEV.
But, my understanding is that, even today, neither QEMU nor the kernel
can see SEV-encrypted guest memory. So QEMU should already understand
how to not interact with guest memory. I _assume_ it's also already
doing this with anonymous memory, without needing /dev/sme or something.
> And, one of these days, someone will come up with a version of XPFO
> that could actually be upstreamed, and it seems entirely plausible
> that it will be totally incompatible with MKTME-as-anonymous-memory
> and that users of MKTME will actually get *worse* security.
I'm not following here. XPFO just means that we don't keep the direct
map around all the time for all memory. If XPFO and
MKTME-as-anonymous-memory were both in play, I think we'd just be
creating/destroying the MKTME-enlightened direct map instead of a
vanilla one.
On Mon, Jun 17, 2019 at 11:37 AM Dave Hansen <[email protected]> wrote:
>
> Tom Lendacky, could you take a look down in the message to the talk of
> SEV? I want to make sure I'm not misrepresenting what it does today.
> ...
>
>
> >> I actually don't care all that much which one we end up with. It's not
> >> like the extra syscall in the second options means much.
> >
> > The benefit of the second one is that, if sys_encrypt is absent, it
> > just works. In the first model, programs need a fallback because
> > they'll segfault of mprotect_encrypt() gets ENOSYS.
>
> Well, by the time they get here, they would have already had to allocate
> and set up the encryption key. I don't think this would really be the
> "normal" malloc() path, for instance.
>
> >> How do we
> >> eventually stack it on top of persistent memory filesystems or Device
> >> DAX?
> >
> > How do we stack anonymous memory on top of persistent memory or Device
> > DAX? I'm confused.
>
> If our interface to MKTME is:
>
> fd = open("/dev/mktme");
> ptr = mmap(fd);
>
> Then it's hard to combine with an interface which is:
>
> fd = open("/dev/dax123");
> ptr = mmap(fd);
>
> Where if we have something like mprotect() (or madvise() or something
> else taking pointer), we can just do:
>
> fd = open("/dev/anything987");
> ptr = mmap(fd);
> sys_encrypt(ptr);
I'm having a hard time imagining that ever working -- wouldn't it blow
up if someone did:
fd = open("/dev/anything987");
ptr1 = mmap(fd);
ptr2 = mmap(fd);
sys_encrypt(ptr1);
So I think it really has to be:
fd = open("/dev/anything987");
ioctl(fd, ENCRYPT_ME);
mmap(fd);
But I really expect that the encryption of a DAX device will actually
be a block device setting and won't look like this at all. It'll be
more like dm-crypt except without device mapper.
>
> Now, we might not *do* it that way for dax, for instance, but I'm just
> saying that if we go the /dev/mktme route, we never get a choice.
>
> > I think that, in the long run, we're going to have to either expand
> > the core mm's concept of what "memory" is or just have a whole
> > parallel set of mechanisms for memory that doesn't work like memory.
> ...
> > I expect that some day normal memory will be able to be repurposed as
> > SGX pages on the fly, and that will also look a lot more like SEV or
> > XPFO than like the this model of MKTME.
>
> I think you're drawing the line at pages where the kernel can manage
> contents vs. not manage contents. I'm not sure that's the right
> distinction to make, though. The thing that is important is whether the
> kernel can manage the lifetime and location of the data in the page.
The kernel can manage the location of EPC pages, for example, but only
under extreme constraints right now. The draft SGX driver can and
does swap them out and swap them back in, potentially at a different
address.
>
> Basically: Can the kernel choose where the page comes from and get the
> page back when it wants?
>
> I really don't like the current state of things like with SEV or with
> KVM direct device assignment where the physical location is quite locked
> down and the kernel really can't manage the memory. I'm trying really
> hard to make sure future hardware is more permissive about such things.
> My hope is that these are a temporary blip and not the new normal.
>
> > So, if we upstream MKTME as anonymous memory with a magic config
> > syscall, I predict that, in a few years, it will be end up inheriting
> > all downsides of both approaches with few of the upsides. Programs
> > like QEMU will need to learn to manipulate pages that can't be
> > accessed outside the VM without special VM buy-in, so the fact that
> > MKTME pages are fully functional and can be GUP-ed won't be very
> > useful. And the VM will learn about all these things, but MKTME won't
> > really fit in.
>
> Kai Huang (who is on cc) has been doing the QEMU enabling and might want
> to weigh in. I'd also love to hear from the AMD folks in case I'm not
> grokking some aspect of SEV.
>
> But, my understanding is that, even today, neither QEMU nor the kernel
> can see SEV-encrypted guest memory. So QEMU should already understand
> how to not interact with guest memory. I _assume_ it's also already
> doing this with anonymous memory, without needing /dev/sme or something.
Let's find out!
If it's using anonymous memory, it must be very strange, since it
would more or less have to be mmapped PROT_NONE. The thing that makes
anonymous memory in particular seem so awkward to me is that it's
fundamentally identified by it's *address*, which means it makes no
sense if it's not mapped.
>
> > And, one of these days, someone will come up with a version of XPFO
> > that could actually be upstreamed, and it seems entirely plausible
> > that it will be totally incompatible with MKTME-as-anonymous-memory
> > and that users of MKTME will actually get *worse* security.
>
> I'm not following here. XPFO just means that we don't keep the direct
> map around all the time for all memory. If XPFO and
> MKTME-as-anonymous-memory were both in play, I think we'd just be
> creating/destroying the MKTME-enlightened direct map instead of a
> vanilla one.
What I'm saying is that I can imagine XPFO also wanting to be
something other than anonymous memory. I don't think we'll ever want
regular MAP_ANONYMOUS to enable XPFO by default because the
performance will suck. Doing this seems odd:
ptr = mmap(MAP_ANONYMOUS);
sys_xpfo_a_pointer(ptr);
So I could imagine:
ptr = mmap(MAP_ANONYMOUS | MAP_XPFO);
or
fd = open("/dev/xpfo"); (or fd = memfd_create(..., XPFO);
ptr = mmap(fd);
I'm thinking that XPFO is a *lot* simpler under the hood if we just
straight-up don't support GUP on it. Maybe we should call this
"strong XPFO". Similarly, the kinds of things that want MKTME may
also want the memory to be entirely absent from the direct map. And
the things that use SEV (as I understand it) *can't* usefully use the
memory for normal IO via GUP or copy_to/from_user(), so these things
all have a decent amount in common.
Another down side of anonymous memory (in my head, anyway -- QEMU
people should chime in) is that it seems awkward to use it for IO
techniques in which the back-end isn't in the QEMU process. If
there's an fd involved, you can pass it around, feed it to things like
vfio, etc. If there's no fd, it's stuck in the creating process.
And another silly argument: if we had /dev/mktme, then we could
possibly get away with avoiding all the keyring stuff entirely.
Instead, you open /dev/mktme and you get your own key under the hook.
If you want two keys, you open /dev/mktme twice. If you want some
other program to be able to see your memory, you pass it the fd.
I hope this email isn't too rambling :)
--Andy
>> Where if we have something like mprotect() (or madvise() or something
>> else taking pointer), we can just do:
>>
>> fd = open("/dev/anything987");
>> ptr = mmap(fd);
>> sys_encrypt(ptr);
>
> I'm having a hard time imagining that ever working -- wouldn't it blow
> up if someone did:
>
> fd = open("/dev/anything987");
> ptr1 = mmap(fd);
> ptr2 = mmap(fd);
> sys_encrypt(ptr1);
>
> So I think it really has to be:
> fd = open("/dev/anything987");
> ioctl(fd, ENCRYPT_ME);
> mmap(fd);
Yeah, shared mappings are annoying. :)
But, let's face it, nobody is going to do what you suggest in the
ptr1/ptr2 example. It doesn't make any logical sense because it's
effectively asking to read the memory with two different keys. I
_believe_ fscrypt has similar issues and just punts on them by saying
"don't do that".
We can also quite easily figure out what's going on. It's a very simple
rule to kill a process that tries to fault a page in whose KeyID doesn't
match the VMA under which it is faulted in, and also require that no
pages are faulted in under VMAs which have their key changed.
>> Now, we might not *do* it that way for dax, for instance, but I'm just
>> saying that if we go the /dev/mktme route, we never get a choice.
>>
>>> I think that, in the long run, we're going to have to either expand
>>> the core mm's concept of what "memory" is or just have a whole
>>> parallel set of mechanisms for memory that doesn't work like memory.
>> ...
>>> I expect that some day normal memory will be able to be repurposed as
>>> SGX pages on the fly, and that will also look a lot more like SEV or
>>> XPFO than like the this model of MKTME.
>>
>> I think you're drawing the line at pages where the kernel can manage
>> contents vs. not manage contents. I'm not sure that's the right
>> distinction to make, though. The thing that is important is whether the
>> kernel can manage the lifetime and location of the data in the page.
>
> The kernel can manage the location of EPC pages, for example, but only
> under extreme constraints right now. The draft SGX driver can and
> does swap them out and swap them back in, potentially at a different
> address.
The kernel can't put arbitrary data in EPC pages and can't use normal
memory for EPC. To me, that puts them clearly on the side of being
unmanageable by the core mm code.
For instance, there's no way we could mix EPC pages in the same 'struct
zone' with non-EPC pages. Not only are they not in the direct map, but
they never *can* be, even for a second.
>>> And, one of these days, someone will come up with a version of XPFO
>>> that could actually be upstreamed, and it seems entirely plausible
>>> that it will be totally incompatible with MKTME-as-anonymous-memory
>>> and that users of MKTME will actually get *worse* security.
>>
>> I'm not following here. XPFO just means that we don't keep the direct
>> map around all the time for all memory. If XPFO and
>> MKTME-as-anonymous-memory were both in play, I think we'd just be
>> creating/destroying the MKTME-enlightened direct map instead of a
>> vanilla one.
>
> What I'm saying is that I can imagine XPFO also wanting to be
> something other than anonymous memory. I don't think we'll ever want
> regular MAP_ANONYMOUS to enable XPFO by default because the
> performance will suck.
It will certainly suck for some things. But, does it suck if the kernel
never uses the direct map for the XPFO memory? If it were for KVM guest
memory for a guest using direct device assignment, we might not even
ever notice.
> I'm thinking that XPFO is a *lot* simpler under the hood if we just
> straight-up don't support GUP on it. Maybe we should call this
> "strong XPFO". Similarly, the kinds of things that want MKTME may
> also want the memory to be entirely absent from the direct map. And
> the things that use SEV (as I understand it) *can't* usefully use the
> memory for normal IO via GUP or copy_to/from_user(), so these things
> all have a decent amount in common.
OK, so basically, you're thinking about new memory management
infrastructure that a memory-allocating-app can opt into where they get
a reduced kernel feature set, but also increased security guarantees?
The main insight thought is that some hardware features *already*
impose (some of) this reduced feature set?
FWIW, I don't think many folks will go for the no-GUP rule. It's one
thing to say no-GUPs for SGX pages which can't have I/O done on them in
the first place, but it's quite another to tell folks that sendfile() no
longer works without bounce buffers.
MKTME's security guarantees are very different than something like SEV.
Since the kernel is in the trust boundary, it *can* do fun stuff like
RDMA which is a heck of a lot faster than bounce buffering. Let's say a
franken-system existed with SEV and MKTME. It isn't even clear to me
that *everyone* would pick SEV over MKTME. IOW, I'm not sure the MKTME
model necessarily goes away given the presence of SEV.
> And another silly argument: if we had /dev/mktme, then we could
> possibly get away with avoiding all the keyring stuff entirely.
> Instead, you open /dev/mktme and you get your own key under the hook.
> If you want two keys, you open /dev/mktme twice. If you want some
> other program to be able to see your memory, you pass it the fd.
We still like the keyring because it's one-stop-shopping as the place
that *owns* the hardware KeyID slots. Those are global resources and
scream for a single global place to allocate and manage them. The
hardware slots also need to be shared between any anonymous and
file-based users, no matter what the APIs for the anonymous side.
On Mon, 2019-06-17 at 11:27 -0700, Dave Hansen wrote:
> Tom Lendacky, could you take a look down in the message to the talk of
> SEV? I want to make sure I'm not misrepresenting what it does today.
> ...
>
>
> > > I actually don't care all that much which one we end up with. It's not
> > > like the extra syscall in the second options means much.
> >
> > The benefit of the second one is that, if sys_encrypt is absent, it
> > just works. In the first model, programs need a fallback because
> > they'll segfault of mprotect_encrypt() gets ENOSYS.
>
> Well, by the time they get here, they would have already had to allocate
> and set up the encryption key. I don't think this would really be the
> "normal" malloc() path, for instance.
>
> > > How do we
> > > eventually stack it on top of persistent memory filesystems or Device
> > > DAX?
> >
> > How do we stack anonymous memory on top of persistent memory or Device
> > DAX? I'm confused.
>
> If our interface to MKTME is:
>
> fd = open("/dev/mktme");
> ptr = mmap(fd);
>
> Then it's hard to combine with an interface which is:
>
> fd = open("/dev/dax123");
> ptr = mmap(fd);
>
> Where if we have something like mprotect() (or madvise() or something
> else taking pointer), we can just do:
>
> fd = open("/dev/anything987");
> ptr = mmap(fd);
> sys_encrypt(ptr);
>
> Now, we might not *do* it that way for dax, for instance, but I'm just
> saying that if we go the /dev/mktme route, we never get a choice.
>
> > I think that, in the long run, we're going to have to either expand
> > the core mm's concept of what "memory" is or just have a whole
> > parallel set of mechanisms for memory that doesn't work like memory.
>
> ...
> > I expect that some day normal memory will be able to be repurposed as
> > SGX pages on the fly, and that will also look a lot more like SEV or
> > XPFO than like the this model of MKTME.
>
> I think you're drawing the line at pages where the kernel can manage
> contents vs. not manage contents. I'm not sure that's the right
> distinction to make, though. The thing that is important is whether the
> kernel can manage the lifetime and location of the data in the page.
>
> Basically: Can the kernel choose where the page comes from and get the
> page back when it wants?
>
> I really don't like the current state of things like with SEV or with
> KVM direct device assignment where the physical location is quite locked
> down and the kernel really can't manage the memory. I'm trying really
> hard to make sure future hardware is more permissive about such things.
> My hope is that these are a temporary blip and not the new normal.
>
> > So, if we upstream MKTME as anonymous memory with a magic config
> > syscall, I predict that, in a few years, it will be end up inheriting
> > all downsides of both approaches with few of the upsides. Programs
> > like QEMU will need to learn to manipulate pages that can't be
> > accessed outside the VM without special VM buy-in, so the fact that
> > MKTME pages are fully functional and can be GUP-ed won't be very
> > useful. And the VM will learn about all these things, but MKTME won't
> > really fit in.
>
> Kai Huang (who is on cc) has been doing the QEMU enabling and might want
> to weigh in. I'd also love to hear from the AMD folks in case I'm not
> grokking some aspect of SEV.
>
> But, my understanding is that, even today, neither QEMU nor the kernel
> can see SEV-encrypted guest memory. So QEMU should already understand
> how to not interact with guest memory. I _assume_ it's also already
> doing this with anonymous memory, without needing /dev/sme or something.
Correct neither Qemu nor kernel can see SEV-encrypted guest memory. Qemu requires guest's
cooperation when it needs to interacts with guest, i.e. to support virtual DMA (of virtual devices
in SEV-guest), qemu requires SEV-guest to setup bounce buffer (which will not be SEV-encrypted
memory, but shared memory can be accessed from host side too), so that guest kernel can copy DMA
data from bounce buffer to its own SEV-encrypted memory after qemu/host kernel puts DMA data to
bounce buffer.
And yes from my reading (better to have AMD guys to confirm) SEV guest uses anonymous memory, but it
also pins all guest memory (by calling GUP from KVM -- SEV specifically introduced 2 KVM ioctls for
this purpose), since SEV architecturally cannot support swapping, migraiton of SEV-encrypted guest
memory, because SME/SEV also uses physical address as "tweak", and there's no way that kernel can
get or use SEV-guest's memory encryption key. In order to swap/migrate SEV-guest memory, we need SGX
EPC eviction/reload similar thing, which SEV doesn't have today.
From this perspective, I think driver proposal kinda makes sense since we already have security
feature which uses normal memory some kind like "device memory" (no swap, no migration, etc), so it
makes sense that MKTME just follows that (although from HW MKTME can support swap, page migration,
etc). The downside of driver proposal for MKTME I think is, like Dave mentioned, it's hard (or not
sure whether it is possible) to extend to support NVDIMM (and file backed guest memory), since for
virtual NVDIMM, Qemu needs to call mmap against fd of NVDIMM.
Thanks,
-Kai
On Mon, 2019-06-17 at 12:12 -0700, Andy Lutomirski wrote:
> On Mon, Jun 17, 2019 at 11:37 AM Dave Hansen <[email protected]> wrote:
> >
> > Tom Lendacky, could you take a look down in the message to the talk of
> > SEV? I want to make sure I'm not misrepresenting what it does today.
> > ...
> >
> >
> > > > I actually don't care all that much which one we end up with. It's not
> > > > like the extra syscall in the second options means much.
> > >
> > > The benefit of the second one is that, if sys_encrypt is absent, it
> > > just works. In the first model, programs need a fallback because
> > > they'll segfault of mprotect_encrypt() gets ENOSYS.
> >
> > Well, by the time they get here, they would have already had to allocate
> > and set up the encryption key. I don't think this would really be the
> > "normal" malloc() path, for instance.
> >
> > > > How do we
> > > > eventually stack it on top of persistent memory filesystems or Device
> > > > DAX?
> > >
> > > How do we stack anonymous memory on top of persistent memory or Device
> > > DAX? I'm confused.
> >
> > If our interface to MKTME is:
> >
> > fd = open("/dev/mktme");
> > ptr = mmap(fd);
> >
> > Then it's hard to combine with an interface which is:
> >
> > fd = open("/dev/dax123");
> > ptr = mmap(fd);
> >
> > Where if we have something like mprotect() (or madvise() or something
> > else taking pointer), we can just do:
> >
> > fd = open("/dev/anything987");
> > ptr = mmap(fd);
> > sys_encrypt(ptr);
>
> I'm having a hard time imagining that ever working -- wouldn't it blow
> up if someone did:
>
> fd = open("/dev/anything987");
> ptr1 = mmap(fd);
> ptr2 = mmap(fd);
> sys_encrypt(ptr1);
>
> So I think it really has to be:
> fd = open("/dev/anything987");
> ioctl(fd, ENCRYPT_ME);
> mmap(fd);
This requires "/dev/anything987" to support ENCRYPT_ME ioctl, right?
So to support NVDIMM (DAX), we need to add ENCRYPT_ME ioctl to DAX?
>
> But I really expect that the encryption of a DAX device will actually
> be a block device setting and won't look like this at all. It'll be
> more like dm-crypt except without device mapper.
Are you suggesting not to support MKTME for DAX, or adding MKTME support to dm-crypt?
Thanks,
-Kai
On Mon, Jun 17, 2019 at 5:05 PM Kai Huang <[email protected]> wrote:
>
> On Mon, 2019-06-17 at 12:12 -0700, Andy Lutomirski wrote:
> > On Mon, Jun 17, 2019 at 11:37 AM Dave Hansen <[email protected]> wrote:
> > >
> > > Tom Lendacky, could you take a look down in the message to the talk of
> > > SEV? I want to make sure I'm not misrepresenting what it does today.
> > > ...
> > >
> > >
> > > > > I actually don't care all that much which one we end up with. It's not
> > > > > like the extra syscall in the second options means much.
> > > >
> > > > The benefit of the second one is that, if sys_encrypt is absent, it
> > > > just works. In the first model, programs need a fallback because
> > > > they'll segfault of mprotect_encrypt() gets ENOSYS.
> > >
> > > Well, by the time they get here, they would have already had to allocate
> > > and set up the encryption key. I don't think this would really be the
> > > "normal" malloc() path, for instance.
> > >
> > > > > How do we
> > > > > eventually stack it on top of persistent memory filesystems or Device
> > > > > DAX?
> > > >
> > > > How do we stack anonymous memory on top of persistent memory or Device
> > > > DAX? I'm confused.
> > >
> > > If our interface to MKTME is:
> > >
> > > fd = open("/dev/mktme");
> > > ptr = mmap(fd);
> > >
> > > Then it's hard to combine with an interface which is:
> > >
> > > fd = open("/dev/dax123");
> > > ptr = mmap(fd);
> > >
> > > Where if we have something like mprotect() (or madvise() or something
> > > else taking pointer), we can just do:
> > >
> > > fd = open("/dev/anything987");
> > > ptr = mmap(fd);
> > > sys_encrypt(ptr);
> >
> > I'm having a hard time imagining that ever working -- wouldn't it blow
> > up if someone did:
> >
> > fd = open("/dev/anything987");
> > ptr1 = mmap(fd);
> > ptr2 = mmap(fd);
> > sys_encrypt(ptr1);
> >
> > So I think it really has to be:
> > fd = open("/dev/anything987");
> > ioctl(fd, ENCRYPT_ME);
> > mmap(fd);
>
> This requires "/dev/anything987" to support ENCRYPT_ME ioctl, right?
>
> So to support NVDIMM (DAX), we need to add ENCRYPT_ME ioctl to DAX?
Yes and yes, or we do it with layers -- see below.
I don't see how we can credibly avoid this. If we try to do MKTME
behind the DAX driver's back, aren't we going to end up with cache
coherence problems?
>
> >
> > But I really expect that the encryption of a DAX device will actually
> > be a block device setting and won't look like this at all. It'll be
> > more like dm-crypt except without device mapper.
>
> Are you suggesting not to support MKTME for DAX, or adding MKTME support to dm-crypt?
I'm proposing exposing it by an interface that looks somewhat like
dm-crypt. Either we could have a way to create a device layered on
top of the DAX devices that exposes a decrypted view or we add a way
to tell the DAX device to kindly use MKTME with such-and-such key.
If there is demand for a way to have an fscrypt-like thing on top of
DAX where different files use different keys, I suppose that could be
done too, but it will need filesystem or VFS help.
>
> > And another silly argument: if we had /dev/mktme, then we could
> > possibly get away with avoiding all the keyring stuff entirely.
> > Instead, you open /dev/mktme and you get your own key under the hook.
> > If you want two keys, you open /dev/mktme twice. If you want some
> > other program to be able to see your memory, you pass it the fd.
>
> We still like the keyring because it's one-stop-shopping as the place
> that *owns* the hardware KeyID slots. Those are global resources and
> scream for a single global place to allocate and manage them. The
> hardware slots also need to be shared between any anonymous and
> file-based users, no matter what the APIs for the anonymous side.
MKTME driver (who creates /dev/mktme) can also be the one-stop-shopping. I think whether to choose
keyring to manage MKTME key should be based on whether we need/should take advantage of existing key
retention service functionalities. For example, with key retention service we can
revoke/invalidate/set expiry for a key (not sure whether MKTME needs those although), and we have
several keyrings -- thread specific keyring, process specific keyring, user specific keyring, etc,
thus we can control who can/cannot find the key, etc. I think managing MKTME key in MKTME driver
doesn't have those advantages.
Thanks,
-Kai
On 6/17/19 6:59 PM, Kai Huang wrote:
> On Mon, 2019-06-17 at 11:27 -0700, Dave Hansen wrote:
>> Tom Lendacky, could you take a look down in the message to the talk of
>> SEV? I want to make sure I'm not misrepresenting what it does today.
Sorry, I'm traveling this week, so responses will be delayed...
>> ...
>>
>>
>>>> I actually don't care all that much which one we end up with. It's not
>>>> like the extra syscall in the second options means much.
>>>
>>> The benefit of the second one is that, if sys_encrypt is absent, it
>>> just works. In the first model, programs need a fallback because
>>> they'll segfault of mprotect_encrypt() gets ENOSYS.
>>
>> Well, by the time they get here, they would have already had to allocate
>> and set up the encryption key. I don't think this would really be the
>> "normal" malloc() path, for instance.
>>
>>>> How do we
>>>> eventually stack it on top of persistent memory filesystems or Device
>>>> DAX?
>>>
>>> How do we stack anonymous memory on top of persistent memory or Device
>>> DAX? I'm confused.
>>
>> If our interface to MKTME is:
>>
>> fd = open("/dev/mktme");
>> ptr = mmap(fd);
>>
>> Then it's hard to combine with an interface which is:
>>
>> fd = open("/dev/dax123");
>> ptr = mmap(fd);
>>
>> Where if we have something like mprotect() (or madvise() or something
>> else taking pointer), we can just do:
>>
>> fd = open("/dev/anything987");
>> ptr = mmap(fd);
>> sys_encrypt(ptr);
>>
>> Now, we might not *do* it that way for dax, for instance, but I'm just
>> saying that if we go the /dev/mktme route, we never get a choice.
>>
>>> I think that, in the long run, we're going to have to either expand
>>> the core mm's concept of what "memory" is or just have a whole
>>> parallel set of mechanisms for memory that doesn't work like memory.
>>
>> ...
>>> I expect that some day normal memory will be able to be repurposed as
>>> SGX pages on the fly, and that will also look a lot more like SEV or
>>> XPFO than like the this model of MKTME.
>>
>> I think you're drawing the line at pages where the kernel can manage
>> contents vs. not manage contents. I'm not sure that's the right
>> distinction to make, though. The thing that is important is whether the
>> kernel can manage the lifetime and location of the data in the page.
>>
>> Basically: Can the kernel choose where the page comes from and get the
>> page back when it wants?
>>
>> I really don't like the current state of things like with SEV or with
>> KVM direct device assignment where the physical location is quite locked
>> down and the kernel really can't manage the memory. I'm trying really
>> hard to make sure future hardware is more permissive about such things.
>> My hope is that these are a temporary blip and not the new normal.
>>
>>> So, if we upstream MKTME as anonymous memory with a magic config
>>> syscall, I predict that, in a few years, it will be end up inheriting
>>> all downsides of both approaches with few of the upsides. Programs
>>> like QEMU will need to learn to manipulate pages that can't be
>>> accessed outside the VM without special VM buy-in, so the fact that
>>> MKTME pages are fully functional and can be GUP-ed won't be very
>>> useful. And the VM will learn about all these things, but MKTME won't
>>> really fit in.
>>
>> Kai Huang (who is on cc) has been doing the QEMU enabling and might want
>> to weigh in. I'd also love to hear from the AMD folks in case I'm not
>> grokking some aspect of SEV.
>>
>> But, my understanding is that, even today, neither QEMU nor the kernel
>> can see SEV-encrypted guest memory. So QEMU should already understand
>> how to not interact with guest memory. I _assume_ it's also already
>> doing this with anonymous memory, without needing /dev/sme or something.
>
> Correct neither Qemu nor kernel can see SEV-encrypted guest memory. Qemu requires guest's
> cooperation when it needs to interacts with guest, i.e. to support virtual DMA (of virtual devices
> in SEV-guest), qemu requires SEV-guest to setup bounce buffer (which will not be SEV-encrypted
> memory, but shared memory can be accessed from host side too), so that guest kernel can copy DMA
> data from bounce buffer to its own SEV-encrypted memory after qemu/host kernel puts DMA data to
> bounce buffer.
That is correct. an SEV guest must use un-encrypted memory if it wishes
for the hypervisor to be able to see it. Also, to support DMA into the
guest, the target memory must be un-encrypted, which SEV does through the
DMA api and SWIOTLB.
SME must also use bounce buffers if the device does not support 48-bit
DMA, since the encryption bit is bit 47. Any device supporting DMA above
the encryption bit position can perform DMA without bounce buffers under
SME.
>
> And yes from my reading (better to have AMD guys to confirm) SEV guest uses anonymous memory, but it
> also pins all guest memory (by calling GUP from KVM -- SEV specifically introduced 2 KVM ioctls for
> this purpose), since SEV architecturally cannot support swapping, migraiton of SEV-encrypted guest
> memory, because SME/SEV also uses physical address as "tweak", and there's no way that kernel can
> get or use SEV-guest's memory encryption key. In order to swap/migrate SEV-guest memory, we need SGX
> EPC eviction/reload similar thing, which SEV doesn't have today.
Yes, all the guest memory is currently pinned by calling GUP when creating
an SEV guest. This is to prevent page migration by the kernel. However,
the support to do page migration is available in the 0.17 version of the
SEV API via the COPY commmand. The COPY commannd allows for copying of
an encrypted page from one location to another via the PSP. The support
for this is not yet in the Linux kernel, but we are working on it. This
would remove the requirement of having to pin the guest memory.
The SEV API also supports migration of memory via the SEND_* and RECEIVE_*
APIs to support live migration.
Swapping, however, is not yet supported.
Thanks,
Tom
>
> From this perspective, I think driver proposal kinda makes sense since we already have security
> feature which uses normal memory some kind like "device memory" (no swap, no migration, etc), so it
> makes sense that MKTME just follows that (although from HW MKTME can support swap, page migration,
> etc). The downside of driver proposal for MKTME I think is, like Dave mentioned, it's hard (or not
> sure whether it is possible) to extend to support NVDIMM (and file backed guest memory), since for
> virtual NVDIMM, Qemu needs to call mmap against fd of NVDIMM.
>
> Thanks,
> -Kai
>
> > >
> > > I'm having a hard time imagining that ever working -- wouldn't it blow
> > > up if someone did:
> > >
> > > fd = open("/dev/anything987");
> > > ptr1 = mmap(fd);
> > > ptr2 = mmap(fd);
> > > sys_encrypt(ptr1);
> > >
> > > So I think it really has to be:
> > > fd = open("/dev/anything987");
> > > ioctl(fd, ENCRYPT_ME);
> > > mmap(fd);
> >
> > This requires "/dev/anything987" to support ENCRYPT_ME ioctl, right?
> >
> > So to support NVDIMM (DAX), we need to add ENCRYPT_ME ioctl to DAX?
>
> Yes and yes, or we do it with layers -- see below.
>
> I don't see how we can credibly avoid this. If we try to do MKTME
> behind the DAX driver's back, aren't we going to end up with cache
> coherence problems?
I am not sure whether I understand correctly but how is cache coherence problem related to putting
MKTME concept to different layers? To make MKTME work with DAX/NVDIMM, I think no matter which layer
MKTME concept resides, eventually we need to put keyID into PTE which maps to NVDIMM, and kernel
needs to manage cache coherence for NVDIMM just like for normal memory showed in this series?
Thanks,
-Kai
On Mon, Jun 17, 2019 at 6:34 PM Lendacky, Thomas
<[email protected]> wrote:
>
> On 6/17/19 6:59 PM, Kai Huang wrote:
> > On Mon, 2019-06-17 at 11:27 -0700, Dave Hansen wrote:
> >
> > And yes from my reading (better to have AMD guys to confirm) SEV guest uses anonymous memory, but it
> > also pins all guest memory (by calling GUP from KVM -- SEV specifically introduced 2 KVM ioctls for
> > this purpose), since SEV architecturally cannot support swapping, migraiton of SEV-encrypted guest
> > memory, because SME/SEV also uses physical address as "tweak", and there's no way that kernel can
> > get or use SEV-guest's memory encryption key. In order to swap/migrate SEV-guest memory, we need SGX
> > EPC eviction/reload similar thing, which SEV doesn't have today.
>
> Yes, all the guest memory is currently pinned by calling GUP when creating
> an SEV guest.
Ick.
What happens if QEMU tries to read the memory? Does it just see
ciphertext? Is cache coherency lost if QEMU writes it?
On Mon, Jun 17, 2019 at 6:35 PM Kai Huang <[email protected]> wrote:
>
>
> > > >
> > > > I'm having a hard time imagining that ever working -- wouldn't it blow
> > > > up if someone did:
> > > >
> > > > fd = open("/dev/anything987");
> > > > ptr1 = mmap(fd);
> > > > ptr2 = mmap(fd);
> > > > sys_encrypt(ptr1);
> > > >
> > > > So I think it really has to be:
> > > > fd = open("/dev/anything987");
> > > > ioctl(fd, ENCRYPT_ME);
> > > > mmap(fd);
> > >
> > > This requires "/dev/anything987" to support ENCRYPT_ME ioctl, right?
> > >
> > > So to support NVDIMM (DAX), we need to add ENCRYPT_ME ioctl to DAX?
> >
> > Yes and yes, or we do it with layers -- see below.
> >
> > I don't see how we can credibly avoid this. If we try to do MKTME
> > behind the DAX driver's back, aren't we going to end up with cache
> > coherence problems?
>
> I am not sure whether I understand correctly but how is cache coherence problem related to putting
> MKTME concept to different layers? To make MKTME work with DAX/NVDIMM, I think no matter which layer
> MKTME concept resides, eventually we need to put keyID into PTE which maps to NVDIMM, and kernel
> needs to manage cache coherence for NVDIMM just like for normal memory showed in this series?
>
I mean is that, to avoid cache coherence problems, something has to
prevent user code from mapping the same page with two different key
ids. If the entire MKTME mechanism purely layers on top of DAX,
something needs to prevent the underlying DAX device from being mapped
at the same time as the MKTME-decrypted view. This is obviously
doable, but it's not automatic.
On Mon, Jun 17, 2019 at 5:48 PM Kai Huang <[email protected]> wrote:
>
>
> >
> > > And another silly argument: if we had /dev/mktme, then we could
> > > possibly get away with avoiding all the keyring stuff entirely.
> > > Instead, you open /dev/mktme and you get your own key under the hook.
> > > If you want two keys, you open /dev/mktme twice. If you want some
> > > other program to be able to see your memory, you pass it the fd.
> >
> > We still like the keyring because it's one-stop-shopping as the place
> > that *owns* the hardware KeyID slots. Those are global resources and
> > scream for a single global place to allocate and manage them. The
> > hardware slots also need to be shared between any anonymous and
> > file-based users, no matter what the APIs for the anonymous side.
>
> MKTME driver (who creates /dev/mktme) can also be the one-stop-shopping. I think whether to choose
> keyring to manage MKTME key should be based on whether we need/should take advantage of existing key
> retention service functionalities. For example, with key retention service we can
> revoke/invalidate/set expiry for a key (not sure whether MKTME needs those although), and we have
> several keyrings -- thread specific keyring, process specific keyring, user specific keyring, etc,
> thus we can control who can/cannot find the key, etc. I think managing MKTME key in MKTME driver
> doesn't have those advantages.
>
Trying to evaluate this with the current proposed code is a bit odd, I
think. Suppose you create a thread-specific key and then fork(). The
child can presumably still use the key regardless of whether the child
can nominally access the key in the keyring because the PTEs are still
there.
More fundamentally, in some sense, the current code has no semantics.
Associating a key with memory and "encrypting" it doesn't actually do
anything unless you are attacking the memory bus but you haven't
compromised the kernel. There's no protection against a guest that
can corrupt its EPT tables, there's no protection against kernel bugs
(*especially* if the duplicate direct map design stays), and there
isn't even any fd or other object around by which you can only access
the data if you can see the key.
I'm also wondering whether the kernel will always be able to be a
one-stop shop for key allocation -- if the MKTME hardware gains
interesting new uses down the road, who knows how key allocation will
work?
On 6/17/19 8:40 PM, Andy Lutomirski wrote:
> On Mon, Jun 17, 2019 at 6:34 PM Lendacky, Thomas
> <[email protected]> wrote:
>>
>> On 6/17/19 6:59 PM, Kai Huang wrote:
>>> On Mon, 2019-06-17 at 11:27 -0700, Dave Hansen wrote:
>
>>>
>>> And yes from my reading (better to have AMD guys to confirm) SEV guest uses anonymous memory, but it
>>> also pins all guest memory (by calling GUP from KVM -- SEV specifically introduced 2 KVM ioctls for
>>> this purpose), since SEV architecturally cannot support swapping, migraiton of SEV-encrypted guest
>>> memory, because SME/SEV also uses physical address as "tweak", and there's no way that kernel can
>>> get or use SEV-guest's memory encryption key. In order to swap/migrate SEV-guest memory, we need SGX
>>> EPC eviction/reload similar thing, which SEV doesn't have today.
>>
>> Yes, all the guest memory is currently pinned by calling GUP when creating
>> an SEV guest.
>
> Ick.
>
> What happens if QEMU tries to read the memory? Does it just see
> ciphertext? Is cache coherency lost if QEMU writes it?
If QEMU tries to read the memory is would just see ciphertext. I'll
double check on the write situation, but I think you would end up with
a cache coherency issue because the write by QEMU would be with the
hypervisor key and tagged separately in the cache from the guest cache
entry. SEV provides confidentiality of guest memory from the hypervisor,
it doesn't prevent the hypervisor from trashing the guest memory.
Thanks,
Tom
>
On Mon, 2019-06-17 at 18:50 -0700, Andy Lutomirski wrote:
> On Mon, Jun 17, 2019 at 5:48 PM Kai Huang <[email protected]> wrote:
> >
> >
> > >
> > > > And another silly argument: if we had /dev/mktme, then we could
> > > > possibly get away with avoiding all the keyring stuff entirely.
> > > > Instead, you open /dev/mktme and you get your own key under the hook.
> > > > If you want two keys, you open /dev/mktme twice. If you want some
> > > > other program to be able to see your memory, you pass it the fd.
> > >
> > > We still like the keyring because it's one-stop-shopping as the place
> > > that *owns* the hardware KeyID slots. Those are global resources and
> > > scream for a single global place to allocate and manage them. The
> > > hardware slots also need to be shared between any anonymous and
> > > file-based users, no matter what the APIs for the anonymous side.
> >
> > MKTME driver (who creates /dev/mktme) can also be the one-stop-shopping. I think whether to
> > choose
> > keyring to manage MKTME key should be based on whether we need/should take advantage of existing
> > key
> > retention service functionalities. For example, with key retention service we can
> > revoke/invalidate/set expiry for a key (not sure whether MKTME needs those although), and we
> > have
> > several keyrings -- thread specific keyring, process specific keyring, user specific keyring,
> > etc,
> > thus we can control who can/cannot find the key, etc. I think managing MKTME key in MKTME driver
> > doesn't have those advantages.
> >
>
> Trying to evaluate this with the current proposed code is a bit odd, I
> think. Suppose you create a thread-specific key and then fork(). The
> child can presumably still use the key regardless of whether the child
> can nominally access the key in the keyring because the PTEs are still
> there.
Right. This is a little bit odd, although virtualization (Qemu, which is the main use case of MKTME
at least so far) doesn't use fork().
>
> More fundamentally, in some sense, the current code has no semantics.
> Associating a key with memory and "encrypting" it doesn't actually do
> anything unless you are attacking the memory bus but you haven't
> compromised the kernel. There's no protection against a guest that
> can corrupt its EPT tables, there's no protection against kernel bugs
> (*especially* if the duplicate direct map design stays), and there
> isn't even any fd or other object around by which you can only access
> the data if you can see the key.
I am not saying managing MKTME key/keyID in key retention service is definitely better, but it seems
all those you mentioned are not related to whether to choose key retention service to manage MKTME
key/keyID? Or you are saying it doesn't matter we manage key/keyID in key retention service or in
MKTME driver, since MKTME barely have any security benefits (besides physical attack)?
>
> I'm also wondering whether the kernel will always be able to be a
> one-stop shop for key allocation -- if the MKTME hardware gains
> interesting new uses down the road, who knows how key allocation will
> work?
I by now don't have any use case which requires to manage key/keyID specifically for its own use,
rather than letting kernel to manage keyID allocation. Please inspire us if you have any potential.
Thanks,
-Kai
On Mon, 2019-06-17 at 18:43 -0700, Andy Lutomirski wrote:
> On Mon, Jun 17, 2019 at 6:35 PM Kai Huang <[email protected]> wrote:
> >
> >
> > > > >
> > > > > I'm having a hard time imagining that ever working -- wouldn't it blow
> > > > > up if someone did:
> > > > >
> > > > > fd = open("/dev/anything987");
> > > > > ptr1 = mmap(fd);
> > > > > ptr2 = mmap(fd);
> > > > > sys_encrypt(ptr1);
> > > > >
> > > > > So I think it really has to be:
> > > > > fd = open("/dev/anything987");
> > > > > ioctl(fd, ENCRYPT_ME);
> > > > > mmap(fd);
> > > >
> > > > This requires "/dev/anything987" to support ENCRYPT_ME ioctl, right?
> > > >
> > > > So to support NVDIMM (DAX), we need to add ENCRYPT_ME ioctl to DAX?
> > >
> > > Yes and yes, or we do it with layers -- see below.
> > >
> > > I don't see how we can credibly avoid this. If we try to do MKTME
> > > behind the DAX driver's back, aren't we going to end up with cache
> > > coherence problems?
> >
> > I am not sure whether I understand correctly but how is cache coherence problem related to
> > putting
> > MKTME concept to different layers? To make MKTME work with DAX/NVDIMM, I think no matter which
> > layer
> > MKTME concept resides, eventually we need to put keyID into PTE which maps to NVDIMM, and kernel
> > needs to manage cache coherence for NVDIMM just like for normal memory showed in this series?
> >
>
> I mean is that, to avoid cache coherence problems, something has to
> prevent user code from mapping the same page with two different key
> ids. If the entire MKTME mechanism purely layers on top of DAX,
> something needs to prevent the underlying DAX device from being mapped
> at the same time as the MKTME-decrypted view. This is obviously
> doable, but it's not automatic.
Assuming I am understanding the context correctly, yes from this perspective it seems having
sys_encrypt is annoying, and having ENCRYPT_ME should be better. But Dave said "nobody is going to
do what you suggest in the ptr1/ptr2 example"?
Thanks,
-Kai
On Mon, Jun 17, 2019 at 6:40 PM Andy Lutomirski <[email protected]> wrote:
>
> On Mon, Jun 17, 2019 at 6:34 PM Lendacky, Thomas
> <[email protected]> wrote:
> >
> > On 6/17/19 6:59 PM, Kai Huang wrote:
> > > On Mon, 2019-06-17 at 11:27 -0700, Dave Hansen wrote:
>
> > >
> > > And yes from my reading (better to have AMD guys to confirm) SEV guest uses anonymous memory, but it
> > > also pins all guest memory (by calling GUP from KVM -- SEV specifically introduced 2 KVM ioctls for
> > > this purpose), since SEV architecturally cannot support swapping, migraiton of SEV-encrypted guest
> > > memory, because SME/SEV also uses physical address as "tweak", and there's no way that kernel can
> > > get or use SEV-guest's memory encryption key. In order to swap/migrate SEV-guest memory, we need SGX
> > > EPC eviction/reload similar thing, which SEV doesn't have today.
> >
> > Yes, all the guest memory is currently pinned by calling GUP when creating
> > an SEV guest.
>
> Ick.
>
> What happens if QEMU tries to read the memory? Does it just see
> ciphertext? Is cache coherency lost if QEMU writes it?
I should add: is the current interface that SEV uses actually good, or
should the kernel try to do something differently? I've spent exactly
zero time looking at SEV APIs or at how QEMU manages its memory.
On Mon, Jun 17, 2019 at 7:11 PM Kai Huang <[email protected]> wrote:
>
> On Mon, 2019-06-17 at 18:50 -0700, Andy Lutomirski wrote:
> > On Mon, Jun 17, 2019 at 5:48 PM Kai Huang <[email protected]> wrote:
> > >
> > >
> > > >
> > > > > And another silly argument: if we had /dev/mktme, then we could
> > > > > possibly get away with avoiding all the keyring stuff entirely.
> > > > > Instead, you open /dev/mktme and you get your own key under the hook.
> > > > > If you want two keys, you open /dev/mktme twice. If you want some
> > > > > other program to be able to see your memory, you pass it the fd.
> > > >
> > > > We still like the keyring because it's one-stop-shopping as the place
> > > > that *owns* the hardware KeyID slots. Those are global resources and
> > > > scream for a single global place to allocate and manage them. The
> > > > hardware slots also need to be shared between any anonymous and
> > > > file-based users, no matter what the APIs for the anonymous side.
> > >
> > > MKTME driver (who creates /dev/mktme) can also be the one-stop-shopping. I think whether to
> > > choose
> > > keyring to manage MKTME key should be based on whether we need/should take advantage of existing
> > > key
> > > retention service functionalities. For example, with key retention service we can
> > > revoke/invalidate/set expiry for a key (not sure whether MKTME needs those although), and we
> > > have
> > > several keyrings -- thread specific keyring, process specific keyring, user specific keyring,
> > > etc,
> > > thus we can control who can/cannot find the key, etc. I think managing MKTME key in MKTME driver
> > > doesn't have those advantages.
> > >
> >
> > Trying to evaluate this with the current proposed code is a bit odd, I
> > think. Suppose you create a thread-specific key and then fork(). The
> > child can presumably still use the key regardless of whether the child
> > can nominally access the key in the keyring because the PTEs are still
> > there.
>
> Right. This is a little bit odd, although virtualization (Qemu, which is the main use case of MKTME
> at least so far) doesn't use fork().
>
> >
> > More fundamentally, in some sense, the current code has no semantics.
> > Associating a key with memory and "encrypting" it doesn't actually do
> > anything unless you are attacking the memory bus but you haven't
> > compromised the kernel. There's no protection against a guest that
> > can corrupt its EPT tables, there's no protection against kernel bugs
> > (*especially* if the duplicate direct map design stays), and there
> > isn't even any fd or other object around by which you can only access
> > the data if you can see the key.
>
> I am not saying managing MKTME key/keyID in key retention service is definitely better, but it seems
> all those you mentioned are not related to whether to choose key retention service to manage MKTME
> key/keyID? Or you are saying it doesn't matter we manage key/keyID in key retention service or in
> MKTME driver, since MKTME barely have any security benefits (besides physical attack)?
Mostly the latter. I think it's very hard to evaluate whether a given
key allocation model makes sense given that MKTME provides such weak
security benefits. TME has obvious security benefits, as does
encryption of persistent memory, but this giant patch set isn't needed
for plain TME and it doesn't help with persistent memory.
>
> >
> > I'm also wondering whether the kernel will always be able to be a
> > one-stop shop for key allocation -- if the MKTME hardware gains
> > interesting new uses down the road, who knows how key allocation will
> > work?
>
> I by now don't have any use case which requires to manage key/keyID specifically for its own use,
> rather than letting kernel to manage keyID allocation. Please inspire us if you have any potential.
>
Other than compliance, I can't think of much reason that using
multiple keys is useful, regardless of how their allocated. The only
thing I've thought of is that, with multiple keys, you can use PCONFIG
to remove one and flush caches and the data is most definitely gone.
On the other hand, you can just zero the memory and the data is just
as gone even without any encryption.
On Tue, Jun 18, 2019 at 02:23:31PM +1200, Kai Huang wrote:
> Assuming I am understanding the context correctly, yes from this perspective it seems having
> sys_encrypt is annoying, and having ENCRYPT_ME should be better. But Dave said "nobody is going to
> do what you suggest in the ptr1/ptr2 example"?
You have to phrase that as: 'nobody who knows what he's doing is going
to do that', which leaves lots of people and fuzzers.
Murphy states that if it is possible, someone _will_ do it. And this
being something that causes severe data corruption on persistent
storage,...
On 6/18/19 2:12 AM, Peter Zijlstra wrote:
> On Tue, Jun 18, 2019 at 02:23:31PM +1200, Kai Huang wrote:
>> Assuming I am understanding the context correctly, yes from this perspective it seems having
>> sys_encrypt is annoying, and having ENCRYPT_ME should be better. But Dave said "nobody is going to
>> do what you suggest in the ptr1/ptr2 example"?
>
> You have to phrase that as: 'nobody who knows what he's doing is going
> to do that', which leaves lots of people and fuzzers.
>
> Murphy states that if it is possible, someone _will_ do it. And this
> being something that causes severe data corruption on persistent
> storage,...
I actually think it's not a big deal at all to avoid the corruption that
would occur if it were allowed. But, if you're even asking to map the
same data with two different keys, you're *asking* for data corruption.
What we're doing here is continuing to preserve cache coherency and
ensuring an early failure.
We'd need two rules:
1. A page must not be faulted into a VMA if the page's page_keyid()
is not consistent with the VMA's
2. Upon changing the VMA's KeyID, all underlying PTEs must either be
checked or zapped.
If the rules are broken, we SIGBUS. Andy's suggestion has the same
basic requirements. But, with his scheme, the error can be to the
ioctl() instead of in the form of a SIGBUS. I guess that makes the
fuzzers' lives a bit easier.
BTW, note that we don't have any problems with the current anonymous
implementation and fork() because we zap at the encryption syscall.
On 6/17/19 5:15 PM, Andy Lutomirski wrote:
>>> But I really expect that the encryption of a DAX device will actually
>>> be a block device setting and won't look like this at all. It'll be
>>> more like dm-crypt except without device mapper.
>> Are you suggesting not to support MKTME for DAX, or adding MKTME support to dm-crypt?
> I'm proposing exposing it by an interface that looks somewhat like
> dm-crypt. Either we could have a way to create a device layered on
> top of the DAX devices that exposes a decrypted view or we add a way
> to tell the DAX device to kindly use MKTME with such-and-such key.
I think this basically implies that we need to settle (or at least
present) on an interface for storage (FS-DAX, Device DAX, page cache)
before we merge one for anonymous memory.
That sounds like a reasonable exercise.
On 6/17/19 6:50 PM, Andy Lutomirski wrote:
> I'm also wondering whether the kernel will always be able to be a
> one-stop shop for key allocation -- if the MKTME hardware gains
> interesting new uses down the road, who knows how key allocation will
> work?
I can't share all the details on LKML, of course, but I can at least say
that this model of allocating KeyID slots will continue to be used for a
number of generations.
On Tue, Jun 18, 2019 at 07:09:36AM -0700, Dave Hansen wrote:
> On 6/18/19 2:12 AM, Peter Zijlstra wrote:
> > On Tue, Jun 18, 2019 at 02:23:31PM +1200, Kai Huang wrote:
> >> Assuming I am understanding the context correctly, yes from this perspective it seems having
> >> sys_encrypt is annoying, and having ENCRYPT_ME should be better. But Dave said "nobody is going to
> >> do what you suggest in the ptr1/ptr2 example"?
> >
> > You have to phrase that as: 'nobody who knows what he's doing is going
> > to do that', which leaves lots of people and fuzzers.
> >
> > Murphy states that if it is possible, someone _will_ do it. And this
> > being something that causes severe data corruption on persistent
> > storage,...
>
> I actually think it's not a big deal at all to avoid the corruption that
> would occur if it were allowed. But, if you're even asking to map the
> same data with two different keys, you're *asking* for data corruption.
> What we're doing here is continuing to preserve cache coherency and
> ensuring an early failure.
>
> We'd need two rules:
> 1. A page must not be faulted into a VMA if the page's page_keyid()
> is not consistent with the VMA's
> 2. Upon changing the VMA's KeyID, all underlying PTEs must either be
> checked or zapped.
>
> If the rules are broken, we SIGBUS. Andy's suggestion has the same
> basic requirements. But, with his scheme, the error can be to the
> ioctl() instead of in the form of a SIGBUS. I guess that makes the
> fuzzers' lives a bit easier.
I see a problem with the scheme: if we don't have a way to decide if the
key is right for the file, user without access to the right key is able to
prevent legitimate user from accessing the file. Attacker just need read
access to the encrypted file to prevent any legitimate use to access it.
The problem applies to ioctl() too.
To make sense of it we must have a way to distinguish right key from
wrong. I don't see obvious solution with the current hardware design.
--
Kirill A. Shutemov
On 6/18/19 9:15 AM, Kirill A. Shutemov wrote:
>> We'd need two rules:
>> 1. A page must not be faulted into a VMA if the page's page_keyid()
>> is not consistent with the VMA's
>> 2. Upon changing the VMA's KeyID, all underlying PTEs must either be
>> checked or zapped.
>>
>> If the rules are broken, we SIGBUS. Andy's suggestion has the same
>> basic requirements. But, with his scheme, the error can be to the
>> ioctl() instead of in the form of a SIGBUS. I guess that makes the
>> fuzzers' lives a bit easier.
> I see a problem with the scheme: if we don't have a way to decide if the
> key is right for the file, user without access to the right key is able to
> prevent legitimate user from accessing the file. Attacker just need read
> access to the encrypted file to prevent any legitimate use to access it.
I think you're bringing up a separate issue.
We were talking about how you resolve a conflict when someone attempts
to use two *different* keyids to decrypt the data in the API and what
the resulting API interaction looks like.
You're describing the situation where one of those is the wrong *key*
(not keyid). That's a subtly different scenario and requires different
handling (or no handling IMNHO).
> On Jun 18, 2019, at 9:22 AM, Dave Hansen <[email protected]> wrote:
>
> On 6/18/19 9:15 AM, Kirill A. Shutemov wrote:
>>> We'd need two rules:
>>> 1. A page must not be faulted into a VMA if the page's page_keyid()
>>> is not consistent with the VMA's
>>> 2. Upon changing the VMA's KeyID, all underlying PTEs must either be
>>> checked or zapped.
>>>
>>> If the rules are broken, we SIGBUS. Andy's suggestion has the same
>>> basic requirements. But, with his scheme, the error can be to the
>>> ioctl() instead of in the form of a SIGBUS. I guess that makes the
>>> fuzzers' lives a bit easier.
>> I see a problem with the scheme: if we don't have a way to decide if the
>> key is right for the file, user without access to the right key is able to
>> prevent legitimate user from accessing the file. Attacker just need read
>> access to the encrypted file to prevent any legitimate use to access it.
>
> I think you're bringing up a separate issue.
>
> We were talking about how you resolve a conflict when someone attempts
> to use two *different* keyids to decrypt the data in the API and what
> the resulting API interaction looks like.
>
> You're describing the situation where one of those is the wrong *key*
> (not keyid). That's a subtly different scenario and requires different
> handling (or no handling IMNHO).
I think we’re quibbling over details before we look at the big questions:
Should MKTME+DAX encrypt the entire volume or should it encrypt individual files? Or both?
If it encrypts individual files, should the fs be involved at all? Should there be metadata that can check whether a given key is the correct key?
If it encrypts individual files, is it even conceptually possible to avoid corruption if the fs is not involved? After all, many filesystems think that they can move data blocks, compute checksums, journal data, etc.
I think Dave is right that there should at least be a somewhat credible proposal for how this could fit together.
On 6/18/19 9:36 AM, Andy Lutomirski wrote:
> Should MKTME+DAX encrypt the entire volume or should it encrypt individual files? Or both?
Our current thought is that there should be two modes: One for entire
DAX namespaces and one for filesystem DAX that would allow it to be at
the file level. More likely, we would mirror fscrypt and do it at the
directory level.
> If it encrypts individual files, should the fs be involved at all?
> Should there be metadata that can check whether a given key is the
> correct key?
FWIW, this is a question for the fs guys. Their guidance so far has
been "do what fscrypt does", and fscrypt does not protect against
incorrect keys being specified. See:
https://www.kernel.org/doc/html/v5.1/filesystems/fscrypt.html
Which says:
> Currently, fscrypt does not prevent a user from maliciously providing
> an incorrect key for another user’s existing encrypted files. A
> protection against this is planned.
> If it encrypts individual files, is it even conceptually possible to
> avoid corruption if the fs is not involved? After all, many
> filesystems think that they can move data blocks, compute checksums,
> journal data, etc.
Yes, exactly. Thankfully, fscrypt has thought about this already and
has infrastructure for this. For instance:
> Online defragmentation of encrypted files is not supported. The
> EXT4_IOC_MOVE_EXT and F2FS_IOC_MOVE_RANGE ioctls will fail with
> EOPNOTSUPP.