2021-08-15 01:04:40

by Gavin Shan

[permalink] [raw]
Subject: [PATCH v4 00/15] Support Asynchronous Page Fault

There are two stages of page fault. The guest kernel is responsible for
handling stage-1 page fault, while the host kernel is to take care of the
stage-2 page fault. When the guest is trapped to host because of stage-2
page fault, the guest is suspended until the requested memory (page) is
populated. Sometimes, the cost to populate the requested page isn't cheap
and can take hundreds of milliseconds in extreme cases. Similarly, the
guest has to wait until the requested memory is ready in the scenario of
post-copy live migration.

This series introduces the feature (Asynchronous Page Fault) to improve
situation, so that the guest needn't have to wait in the scenarios. With
it, the overall performance is improved on the guest. This series depends
on the feature "SDEI virtualization" and QEMU changes. All code changes
can be found from github:

https://github.com/gwshan/linux ("kvm/arm64_sdei") # SDEI virtualization
https://github.com/gwshan/linux ("kvm/arm64_apf") # This series + "sdei"
https://github.com/gwshan/qemu ("kvm/arm64_apf") # QEMU code changes

About the design, the details can be found from last patch. Generally,
it's driven by two notifications: page-not-present and page-ready. They
are delivered from the host to guest via SDEI event and PPI separately.
In the mean while, each notification is always associated with a token,
used to identify the notification. The token is passed by the shared
memory between host/guest. Besides, the SMCCC and ioctl interface are
mitigated by VMM and guest to configure, enable, disable, even migrate
the functionality.

When the guest is trapped to host because of stage-2 page fault, a
page-not-present notification is raised by the host, and sent to the
guest through dedicated SDEI event (0x40400001) if the requested page
can't be populated immediately. In the mean while, a (background) worker
is also started to populate the requested page. On receiving the SDEI
event, the guest marks the current running process with special flag
(TIF_ASYNC_PF) and associates it with a pre-allocated waitqueue. At
same time, a (reschedule) IPI is sent to current CPU. After the SDEI
event is acknowledged by the guest, the (reschedule) IPI is delivered
and it causes context switch from that process tagged with TIF_ASYNC_PF
to another process.

Later on, a page-ready notification is sent to guest after the requested
page is populated by the (background) worker. On receiving the interrupt,
the guest uses the associated token to locate the process, which was
previously suspended because of page-not-present. The flag (TIF_ASYNC_PF)
is cleared for the suspended process and it's waken up.

The series is organized as below:

PATCH[01-04] makes the GFN hash table management generic so that it
can be shared by x86/arm64.
PATCH[05-06] preparatory work to support asynchronous page fault.
PATCH[07-08] support asynchronous page fault.
PATCH[09-11] support ioctl and SMCCC interfaces for the functionality.
PATHC[12-14] supoort asynchronous page fault for guest
PATCH[15] adds document to explain the design and internals

Testing
=======

The tests are taken using program "testsuite", which is written by myself.
The program basically does two things: (a) Starts a thread to allocate
all the available memory, write to them by the specified times. (b) The
parallel thread is started to do calculation while the memory is written
for the specified times.

Besides, there are two testing scenarios: (a) the QEMU process is put into
cgroup where the memory is limited. With this, we should have asynchronous
page fault activities. The total used time for "testsuite" to finish is
measured. The calculation capacity is also measured if the corresponding
thread is started. (b) To measure the total used time and calculation
capacity in scenario of migrating the workload.

(a) Running "testsuite" with (cgroup) memory limitation to QEMU process.
When the calculation thread isn't started, the consumed time is
slightly increased because of overhead introduced by asynchronous
page fault. The total used time drops by ~40% when the calculation
thread is started. It means the parallelism is greatly improved
by the asynchronous page fault.

vCPU: 1 Memory: 1024MB cgroupv2.limit: 512MB
Command: testsuite test async_pf -l 1 [-t] -q

Time- Calculation- Time+ Calculation+ Output
--------------------------------------------------------------------
14.592s 15.010s +2.8%
15.726s 15.185s +3.4%
15.742s 15.192s +3.4%
15.827s 15.270s +3.5%
15.831s 15.291s +3.4%
27.880s 2108m 16.539s 1104m -40.6% -47.6%
27.972s 2111m 16.588s 1110m -40.6% -47.4%
28.020s 2114m 16.656s 1117m -40.5% -47.1%
28.227s 2135m 16.722s 1105m -40.7% -48.2%
28.918s 2194m 16.767s 1113m -42.0% -49.2%

Asynchronous page faults: 55000

(b) Migrating the workload ("testsuite"). The total used time is dropped
a bit since the migration is completed in very short period (~1.5s).
There are not too much asynchronous page fault activies during the
migration. It's overall beneficial to migration performance if guest
experices high work load. However, the used time is increased by ~14%
because of the overhead introduced by asynchronous page fault when
guest doesn't have high work load.

vCPU: 1 Memory: 1024MB cgroupv2.limit: unlimited
Command: testsuite test async_pf -l 50 [-t] -q

Time- Calculation- Time+ Calculation+ Output
--------------------------------------------------------------------
11.132s 12.655s +13.6%
11.135s 12.707s +14.1%
11.143s 12.728s +14.2%
11.167s 12.746s +14.1%
11.172s 12.821s +14.7%
27.308s 2252m 25.827s 2131m -5.4%
27.440s 2275m 26.517s 2333m -3.3%
28.069s 2364m 26.520s 2356m -5.5%
28.777s 2427m 26.726s 2383m -7.1%
28.915s 2452m 27.632s 2508m -4.4%

migrate.total_time: ~1.6s
Asynchronous page faults: ~100 times

Changelog
=========
v4:
* Rebase to v5.14.rc5 and retest (Gavin)
v3:
* Rebase to v5.13.rc1 (Gavin)
* Drop patches from Will to detected SMCCC KVM service (Gavin)
* Retest and recapture the benchmarks (Gavin)
v2:
* Rebase to v5.11.rc6 (Gavin)
* Split the patches (James)
* Allocate "struct kvm_arch_async_control" dymaicall and use
it to check if the feature has been enabled. The kernel
option (CONFIG_KVM_ASYNC_PF) isn't used. (James)
* Add document to explain the design (James)
* Make GFN hash table management generic (James)
* Add ioctl commands to support migration (Gavin)

Gavin Shan (15):
KVM: async_pf: Move struct kvm_async_pf around
KVM: async_pf: Add helper function to check completion queue
KVM: async_pf: Make GFN slot management generic
KVM: x86: Use generic async PF slot management
KVM: arm64: Export kvm_handle_user_mem_abort()
KVM: arm64: Add paravirtualization header files
KVM: arm64: Support page-not-present notification
KVM: arm64: Support page-ready notification
KVM: arm64: Support async PF hypercalls
KVM: arm64: Support async PF ioctl commands
KVM: arm64: Export async PF capability
arm64: Detect async PF para-virtualization feature
arm64: Reschedule process on aync PF
arm64: Enable async PF
KVM: arm64: Add async PF document

Documentation/virt/kvm/arm/apf.rst | 143 +++++++
Documentation/virt/kvm/arm/index.rst | 1 +
arch/arm64/Kconfig | 11 +
arch/arm64/include/asm/esr.h | 6 +
arch/arm64/include/asm/kvm_emulate.h | 27 +-
arch/arm64/include/asm/kvm_host.h | 85 ++++
arch/arm64/include/asm/kvm_para.h | 37 ++
arch/arm64/include/asm/processor.h | 1 +
arch/arm64/include/asm/thread_info.h | 4 +-
arch/arm64/include/uapi/asm/Kbuild | 2 -
arch/arm64/include/uapi/asm/kvm.h | 19 +
arch/arm64/include/uapi/asm/kvm_para.h | 23 ++
arch/arm64/include/uapi/asm/kvm_sdei.h | 1 +
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/kvm.c | 452 +++++++++++++++++++++
arch/arm64/kernel/signal.c | 17 +
arch/arm64/kvm/Kconfig | 2 +
arch/arm64/kvm/Makefile | 1 +
arch/arm64/kvm/arm.c | 37 +-
arch/arm64/kvm/async_pf.c | 533 +++++++++++++++++++++++++
arch/arm64/kvm/hypercalls.c | 5 +
arch/arm64/kvm/mmu.c | 76 +++-
arch/arm64/kvm/sdei.c | 5 +
arch/x86/include/asm/kvm_host.h | 2 -
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/mmu/mmu.c | 2 +-
arch/x86/kvm/x86.c | 88 +---
include/linux/arm-smccc.h | 15 +
include/linux/kvm_host.h | 72 +++-
include/uapi/linux/kvm.h | 3 +
virt/kvm/Kconfig | 3 +
virt/kvm/async_pf.c | 95 ++++-
virt/kvm/kvm_main.c | 4 +-
33 files changed, 1621 insertions(+), 153 deletions(-)
create mode 100644 Documentation/virt/kvm/arm/apf.rst
create mode 100644 arch/arm64/include/asm/kvm_para.h
create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
create mode 100644 arch/arm64/kernel/kvm.c
create mode 100644 arch/arm64/kvm/async_pf.c

--
2.23.0


2021-08-15 01:04:40

by Gavin Shan

[permalink] [raw]
Subject: [PATCH v4 15/15] KVM: arm64: Add async PF document

This adds document to explain the interface for asynchronous page
fault and how it works in general.

Signed-off-by: Gavin Shan <[email protected]>
---
Documentation/virt/kvm/arm/apf.rst | 143 +++++++++++++++++++++++++++
Documentation/virt/kvm/arm/index.rst | 1 +
2 files changed, 144 insertions(+)
create mode 100644 Documentation/virt/kvm/arm/apf.rst

diff --git a/Documentation/virt/kvm/arm/apf.rst b/Documentation/virt/kvm/arm/apf.rst
new file mode 100644
index 000000000000..4f5c01b6699f
--- /dev/null
+++ b/Documentation/virt/kvm/arm/apf.rst
@@ -0,0 +1,143 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Asynchronous Page Fault Support for arm64
+=========================================
+
+There are two stages of page faults when KVM module is enabled as accelerator
+to the guest. The guest is responsible for handling the stage-1 page faults,
+while the host handles the stage-2 page faults. During the period of handling
+the stage-2 page faults, the guest is suspended until the requested page is
+ready. It could take several milliseconds, even hundreds of milliseconds in
+extreme situations because I/O might be required to move the requested page
+from disk to DRAM. The guest does not do any work when it is suspended. The
+feature (Asynchronous Page Fault) is introduced to take advantage of the
+suspending period and to improve the overall performance.
+
+There are two paths in order to fulfil the asynchronous page fault, called
+as control path and data path. The control path allows the VMM or guest to
+configure the functionality, while the notifications are delivered in data
+path. The notifications are classified into page-not-present and page-ready
+notifications.
+
+Data Path
+---------
+
+There are two types of notifications delivered from host to guest in the
+data path: page-not-present and page-ready notification. They are delivered
+through SDEI event and (PPI) interrupt separately. Besides, there is a shared
+buffer between host and guest to indicate the reason and sequential token,
+which is used to identify the asynchronous page fault. The reason and token
+resident in the shared buffer is written by host, read and cleared by guest.
+An asynchronous page fault is delivered and completed as below.
+
+(1) When an asynchronous page fault starts, a (workqueue) worker is created
+ and queued to the vCPU's pending queue. The worker makes the requested
+ page ready and resident to DRAM in the background. The shared buffer is
+ updated with reason and sequential token. After that, SDEI event is sent
+ to guest as page-not-present notification.
+
+(2) When the SDEI event is received on guest, the current process is tagged
+ with TIF_ASYNC_PF and associated with a wait queue. The process is ready
+ to keep rescheduling itself on switching from kernel to user mode. After
+ that, a reschedule IPI is sent to current CPU and the received SDEI event
+ is acknowledged. Note that the IPI is delivered when the acknowledgment
+ on the SDEI event is received on host.
+
+(3) On the host, the worker is dequeued from the vCPU's pending queue and
+ enqueued to its completion queue when the requested page becomes ready.
+ In the mean while, KVM_REQ_ASYNC_PF request is sent the vCPU if the
+ worker is the first element enqueued to the completion queue.
+
+(4) With pending KVM_REQ_ASYNC_PF request, the first worker in the completion
+ queue is dequeued and destroyed. In the mean while, a (PPI) interrupt is
+ sent to guest with updated reason and token in the shared buffer.
+
+(5) When the (PPI) interrupt is received on guest, the affected process is
+ located using the token and waken up after its TIF_ASYNC_PF tag is cleared.
+ After that, the interrupt is acknowledged through SMCCC interface. The
+ workers in the completion queue is dequeued and destroyed if any workers
+ exist, and another (PPI) interrupt is sent to the guest.
+
+Control Path
+------------
+
+The configurations are passed through SMCCC or ioctl interface. The SDEI
+event and (PPI) interrupt are owned by VMM, so the SDEI event and interrupt
+numbers are configured through ioctl command on per-vCPU basis. Besides,
+the functionality might be enabled and configured through ioctl interface
+by VMM during migration:
+
+ * KVM_ARM_ASYNC_PF_CMD_GET_VERSION
+
+ Returns the current version of the feature, supported by the host. It is
+ made up of major, minor and revision fields. Each field is one byte in
+ length.
+
+ * KVM_ARM_ASYNC_PF_CMD_GET_SDEI:
+
+ Retrieve the SDEI event number, used for page-not-present notification,
+ so that it can be configured on destination VM in the scenario of
+ migration.
+
+ * KVM_ARM_ASYNC_PF_GET_IRQ:
+
+ Retrieve the IRQ (PPI) number, used for page-ready notification, so that
+ it can be configured on destination VM in the scenario of migration.
+
+ * KVM_ARM_ASYNC_PF_CMD_GET_CONTROL
+
+ Retrieve the address of control block, so that it can be configured on
+ destination VM in the scenario of migration.
+
+ * KVM_ARM_ASYNC_PF_CMD_SET_SDEI:
+
+ Used by VMM to configure number of SDEI event, which is used to deliver
+ page-not-present notification by host. This is used when VM is started
+ or migrated.
+
+ * KVM_ARM_ASYNC_PF_CMD_SET_IRQ
+
+ Used by VMM to configure number of (PPI) interrupt, which is used to
+ deliver page-ready notification by host. This is used when VM is started
+ or migrated.
+
+ * KVM_ARM_ASYNC_PF_CMD_SET_CONTROL
+
+ Set the control block on the destination VM in the scenario of migration.
+
+The other configurations are passed through SMCCC interface. The host exports
+the capability through KVM vendor specific service, which is identified by
+ARM_SMCCC_KVM_FUNC_ASYNC_PF_FUNC_ID. There are several functions defined for
+this:
+
+ * ARM_SMCCC_KVM_FUNC_ASYNC_PF_VERSION
+
+ Returns the current version of the feature, supported by the host. It is
+ made up of major, minor and revision fields. Each field is one byte in
+ length.
+
+ * ARM_SMCCC_KVM_FUNC_ASYNC_PF_SLOTS
+
+ Returns the size of the hashed GFN table. It is used by guest to set up
+ the capacity of waiting process table.
+
+ * ARM_SMCCC_KVM_FUNC_ASYNC_PF_SDEI
+ * ARM_SMCCC_KVM_FUNC_ASYNC_PF_IRQ
+
+ Used by the guest to retrieve the SDEI event and (PPI) interrupt number
+ that are configured by VMM.
+
+ * ARM_SMCCC_KVM_FUNC_ASYNC_PF_ENABLE
+
+ Used by the guest to enable or disable the feature on the specific vCPU.
+ The argument is made up of shared buffer and flags. The shared buffer
+ is written by host to indicate the reason about the delivered asynchronous
+ page fault and token (sequence number) to identify that. There are two
+ flags are supported: KVM_ASYNC_PF_ENABLED is used to enable or disable
+ the feature. KVM_ASYNC_PF_SEND_ALWAYS allows to deliver page-not-present
+ notification regardless of the guest's state. Otherwise, the notification
+ is delivered only when the guest is in user mode.
+
+ * ARM_SMCCC_KVM_FUNC_ASYNC_PF_IRQ_ACK
+
+ Used by the guest to acknowledge the completion of page-ready notification.
diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst
index 78a9b670aafe..f43b5fe25f61 100644
--- a/Documentation/virt/kvm/arm/index.rst
+++ b/Documentation/virt/kvm/arm/index.rst
@@ -7,6 +7,7 @@ ARM
.. toctree::
:maxdepth: 2

+ apf
hyp-abi
psci
pvtime
--
2.23.0

2021-08-15 01:05:27

by Gavin Shan

[permalink] [raw]
Subject: [PATCH v4 12/15] arm64: Detect async PF para-virtualization feature

This implements kvm_para_available() to check if para-virtualization
features are available or not. Besides, kvm_para_has_feature() is
enhanced to detect the asynchronous page fault para-virtualization
feature. These two functions are going to be used by guest kernel
to enable the asynchronous page fault.

This also adds kernel option (CONFIG_KVM_GUEST), which is the umbrella
for the optimizations related to KVM para-virtualization.

Signed-off-by: Gavin Shan <[email protected]>
---
arch/arm64/Kconfig | 11 +++++++++++
arch/arm64/include/asm/kvm_para.h | 12 +++++++++++-
arch/arm64/include/uapi/asm/kvm_para.h | 2 ++
3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fdcd54d39c1e..6dceae6ed7d3 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1081,6 +1081,17 @@ config PARAVIRT_TIME_ACCOUNTING

If in doubt, say N here.

+config KVM_GUEST
+ bool "KVM Guest Support"
+ depends on PARAVIRT
+ default y
+ help
+ This option enables various optimizations for running under the KVM
+ hypervisor. Overhead for the kernel when not running inside KVM should
+ be minimal.
+
+ In case of doubt, say Y
+
config KEXEC
depends on PM_SLEEP_SMP
select KEXEC_CORE
diff --git a/arch/arm64/include/asm/kvm_para.h b/arch/arm64/include/asm/kvm_para.h
index 0ea481dd1c7a..8f39c60a6619 100644
--- a/arch/arm64/include/asm/kvm_para.h
+++ b/arch/arm64/include/asm/kvm_para.h
@@ -3,6 +3,8 @@
#define _ASM_ARM_KVM_PARA_H

#include <uapi/asm/kvm_para.h>
+#include <linux/of.h>
+#include <asm/hypervisor.h>

static inline bool kvm_check_and_clear_guest_paused(void)
{
@@ -11,7 +13,12 @@ static inline bool kvm_check_and_clear_guest_paused(void)

static inline unsigned int kvm_arch_para_features(void)
{
- return 0;
+ unsigned int features = 0;
+
+ if (kvm_arm_hyp_service_available(ARM_SMCCC_KVM_FUNC_ASYNC_PF))
+ features |= (1 << KVM_FEATURE_ASYNC_PF);
+
+ return features;
}

static inline unsigned int kvm_arch_para_hints(void)
@@ -21,6 +28,9 @@ static inline unsigned int kvm_arch_para_hints(void)

static inline bool kvm_para_available(void)
{
+ if (IS_ENABLED(CONFIG_KVM_GUEST))
+ return true;
+
return false;
}

diff --git a/arch/arm64/include/uapi/asm/kvm_para.h b/arch/arm64/include/uapi/asm/kvm_para.h
index 162325e2638f..70bbc7d1ec75 100644
--- a/arch/arm64/include/uapi/asm/kvm_para.h
+++ b/arch/arm64/include/uapi/asm/kvm_para.h
@@ -4,6 +4,8 @@

#include <linux/types.h>

+#define KVM_FEATURE_ASYNC_PF 0
+
/* Async PF */
#define KVM_ASYNC_PF_ENABLED (1 << 0)
#define KVM_ASYNC_PF_SEND_ALWAYS (1 << 1)
--
2.23.0

2021-08-15 01:05:38

by Gavin Shan

[permalink] [raw]
Subject: [PATCH v4 14/15] arm64: Enable async PF

This enables asynchronous page fault from guest side. The design
is highlighted as below:

* The per-vCPU shared memory region, which is represented by
"struct kvm_vcpu_pv_apf_data", is allocated. The reason and
token associated with the received notifications of asynchronous
page fault are delivered through it.

* A per-vCPU table, which is represented by "struct kvm_apf_table",
is allocated. The process, on which the page-not-present notification
is received, is added into the table so that it can reschedule
itself on switching from kernel to user mode. Afterwards, the
process, identified by token, is removed from the table and put
into runnable state when page-ready notification is received.

* During CPU hotplug, the (private) SDEI event is expected to be
enabled or disabled on the affected CPU by SDEI client driver.
The (PPI) interrupt is enabled or disabled on the affected CPU
by ourself. When the system is going to reboot, the SDEI event
is disabled and unregistered and the (PPI) interrupt is disabled.

* The SDEI event and (PPI) interrupt number are retrieved from host
through SMCCC interface. Besides, the version of the asynchronous
page fault is validated when the feature is enabled on the guest.

* The feature is disabled on guest when boot parameter "no-kvmapf"
is specified.

Signed-off-by: Gavin Shan <[email protected]>
---
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/kvm.c | 452 +++++++++++++++++++++++++++++++++++++
2 files changed, 453 insertions(+)
create mode 100644 arch/arm64/kernel/kvm.c

diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 3f1490bfb938..f0c1a6a7eaa7 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -59,6 +59,7 @@ obj-$(CONFIG_ACPI) += acpi.o
obj-$(CONFIG_ACPI_NUMA) += acpi_numa.o
obj-$(CONFIG_ARM64_ACPI_PARKING_PROTOCOL) += acpi_parking_protocol.o
obj-$(CONFIG_PARAVIRT) += paravirt.o
+obj-$(CONFIG_KVM_GUEST) += kvm.o
obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
obj-$(CONFIG_HIBERNATION) += hibernate.o hibernate-asm.o
obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o \
diff --git a/arch/arm64/kernel/kvm.c b/arch/arm64/kernel/kvm.c
new file mode 100644
index 000000000000..effe8dc7e921
--- /dev/null
+++ b/arch/arm64/kernel/kvm.c
@@ -0,0 +1,452 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Asynchronous page fault support.
+ *
+ * Copyright (C) 2021 Red Hat, Inc.
+ *
+ * Author(s): Gavin Shan <[email protected]>
+ */
+
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/slab.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/of.h>
+#include <linux/of_fdt.h>
+#include <linux/arm-smccc.h>
+#include <linux/kvm_para.h>
+#include <linux/arm_sdei.h>
+#include <linux/acpi.h>
+#include <linux/cpuhotplug.h>
+#include <linux/reboot.h>
+
+struct kvm_apf_task {
+ unsigned int token;
+ struct task_struct *task;
+ struct swait_queue_head wq;
+};
+
+struct kvm_apf_table {
+ raw_spinlock_t lock;
+ unsigned int count;
+ struct kvm_apf_task tasks[0];
+};
+
+static bool async_pf_available = true;
+static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_data) __aligned(64);
+static struct kvm_apf_table __percpu *apf_tables;
+static unsigned int apf_tasks;
+static unsigned int apf_sdei_num;
+static unsigned int apf_ppi_num;
+static int apf_irq;
+
+static bool kvm_async_pf_add_task(struct task_struct *task,
+ unsigned int token)
+{
+ struct kvm_apf_table *table = this_cpu_ptr(apf_tables);
+ unsigned int i, index = apf_tasks;
+ bool ret = false;
+
+ raw_spin_lock(&table->lock);
+
+ if (WARN_ON(table->count >= apf_tasks))
+ goto unlock;
+
+ for (i = 0; i < apf_tasks; i++) {
+ if (!table->tasks[i].task) {
+ if (index == apf_tasks) {
+ ret = true;
+ index = i;
+ }
+ } else if (table->tasks[i].task == task) {
+ WARN_ON(table->tasks[i].token != token);
+ ret = false;
+ break;
+ }
+ }
+
+ if (!ret)
+ goto unlock;
+
+ task->thread.data = &table->tasks[index].wq;
+ set_tsk_thread_flag(task, TIF_ASYNC_PF);
+
+ table->count++;
+ table->tasks[index].task = task;
+ table->tasks[index].token = token;
+
+unlock:
+ raw_spin_unlock(&table->lock);
+ return ret;
+}
+
+static inline void kvm_async_pf_remove_one_task(struct kvm_apf_table *table,
+ unsigned int index)
+{
+ clear_tsk_thread_flag(table->tasks[index].task, TIF_ASYNC_PF);
+ WRITE_ONCE(table->tasks[index].task->thread.data, NULL);
+
+ table->count--;
+ table->tasks[index].task = NULL;
+ table->tasks[index].token = 0;
+
+ swake_up_one(&table->tasks[index].wq);
+}
+
+static bool kvm_async_pf_remove_task(unsigned int token)
+{
+ struct kvm_apf_table *table = this_cpu_ptr(apf_tables);
+ unsigned int i;
+ bool ret = (token == UINT_MAX);
+
+ raw_spin_lock(&table->lock);
+
+ for (i = 0; i < apf_tasks; i++) {
+ if (!table->tasks[i].task)
+ continue;
+
+ /* Wakeup all */
+ if (token == UINT_MAX) {
+ kvm_async_pf_remove_one_task(table, i);
+ continue;
+ }
+
+ if (table->tasks[i].token == token) {
+ kvm_async_pf_remove_one_task(table, i);
+ ret = true;
+ break;
+ }
+ }
+
+ raw_spin_unlock(&table->lock);
+
+ return ret;
+}
+
+static int kvm_async_pf_sdei_handler(unsigned int event,
+ struct pt_regs *regs,
+ void *arg)
+{
+ unsigned int reason = __this_cpu_read(apf_data.reason);
+ unsigned int token = __this_cpu_read(apf_data.token);
+ bool ret;
+
+ if (reason != KVM_PV_REASON_PAGE_NOT_PRESENT) {
+ pr_warn("%s: Bogus notification (%d, 0x%08x)\n",
+ __func__, reason, token);
+ return -EINVAL;
+ }
+
+ ret = kvm_async_pf_add_task(current, token);
+ __this_cpu_write(apf_data.token, 0);
+ __this_cpu_write(apf_data.reason, 0);
+
+ if (!ret)
+ return -ENOSPC;
+
+ smp_send_reschedule(smp_processor_id());
+
+ return 0;
+}
+
+static irqreturn_t kvm_async_pf_irq_handler(int irq, void *dev_id)
+{
+ unsigned int reason = __this_cpu_read(apf_data.reason);
+ unsigned int token = __this_cpu_read(apf_data.token);
+ struct arm_smccc_res res;
+
+ if (reason != KVM_PV_REASON_PAGE_READY) {
+ pr_warn("%s: Bogus interrupt %d (%d, 0x%08x)\n",
+ __func__, irq, reason, token);
+ return IRQ_HANDLED;
+ }
+
+ kvm_async_pf_remove_task(token);
+
+ __this_cpu_write(apf_data.token, 0);
+ __this_cpu_write(apf_data.reason, 0);
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
+ ARM_SMCCC_KVM_FUNC_ASYNC_PF_IRQ_ACK, &res);
+
+ return IRQ_HANDLED;
+}
+
+static int __init kvm_async_pf_available(char *arg)
+{
+ async_pf_available = false;
+
+ return 0;
+}
+early_param("no-kvmapf", kvm_async_pf_available);
+
+static void kvm_async_pf_disable(void)
+{
+ struct arm_smccc_res res;
+ u32 enabled = __this_cpu_read(apf_data.enabled);
+
+ if (!enabled)
+ return;
+
+ /* Disable the functionality */
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
+ ARM_SMCCC_KVM_FUNC_ASYNC_PF_ENABLE,
+ 0, 0, &res);
+ if (res.a0 != SMCCC_RET_SUCCESS) {
+ pr_warn("%s: Error %ld to disable on CPU%d\n",
+ __func__, res.a0, smp_processor_id());
+ return;
+ }
+
+ __this_cpu_write(apf_data.enabled, 0);
+
+ pr_info("Async PF disabled on CPU%d\n", smp_processor_id());
+}
+
+static void kvm_async_pf_enable(void)
+{
+ struct arm_smccc_res res;
+ u32 enabled = __this_cpu_read(apf_data.enabled);
+ u64 val = virt_to_phys(this_cpu_ptr(&apf_data));
+
+ if (enabled)
+ return;
+
+ val |= KVM_ASYNC_PF_ENABLED;
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
+ ARM_SMCCC_KVM_FUNC_ASYNC_PF_ENABLE,
+ (u32)val, (u32)(val >> 32), &res);
+ if (res.a0 != SMCCC_RET_SUCCESS) {
+ pr_warn("%s: Error %ld to enable CPU%d\n",
+ __func__, res.a0, smp_processor_id());
+ return;
+ }
+
+ __this_cpu_write(apf_data.enabled, 1);
+
+ pr_info("Async PF enabled on CPU%d\n", smp_processor_id());
+}
+
+static void kvm_async_pf_cpu_disable(void *info)
+{
+ disable_percpu_irq(apf_irq);
+ kvm_async_pf_disable();
+}
+
+static void kvm_async_pf_cpu_enable(void *info)
+{
+ enable_percpu_irq(apf_irq, IRQ_TYPE_LEVEL_HIGH);
+ kvm_async_pf_enable();
+}
+
+static int kvm_async_pf_cpu_reboot_notify(struct notifier_block *nb,
+ unsigned long code,
+ void *unused)
+{
+ if (code == SYS_RESTART) {
+ sdei_event_disable(apf_sdei_num);
+ sdei_event_unregister(apf_sdei_num);
+
+ on_each_cpu(kvm_async_pf_cpu_disable, NULL, 1);
+ }
+
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block kvm_async_pf_cpu_reboot_nb = {
+ .notifier_call = kvm_async_pf_cpu_reboot_notify,
+};
+
+static int kvm_async_pf_cpu_online(unsigned int cpu)
+{
+ kvm_async_pf_cpu_enable(NULL);
+
+ return 0;
+}
+
+static int kvm_async_pf_cpu_offline(unsigned int cpu)
+{
+ kvm_async_pf_cpu_disable(NULL);
+
+ return 0;
+}
+
+static int __init kvm_async_pf_check_version(void)
+{
+ struct arm_smccc_res res;
+
+ /*
+ * Check the version and v1.0.0 or higher version is required
+ * to support the functionality.
+ */
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
+ ARM_SMCCC_KVM_FUNC_ASYNC_PF_VERSION, &res);
+ if (res.a0 != SMCCC_RET_SUCCESS) {
+ pr_warn("%s: Error %ld to get version\n",
+ __func__, res.a0);
+ return -EPERM;
+ }
+
+ if ((res.a1 & 0xFFFFFFFFFF000000) ||
+ ((res.a1 & 0xFF0000) >> 16) < 0x1) {
+ pr_warn("%s: Invalid version (0x%016lx)\n",
+ __func__, res.a1);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int __init kvm_async_pf_info(void)
+{
+ struct arm_smccc_res res;
+
+ /* Retrieve number of tokens */
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
+ ARM_SMCCC_KVM_FUNC_ASYNC_PF_SLOTS, &res);
+ if (res.a0 != SMCCC_RET_SUCCESS) {
+ pr_warn("%s: Error %ld to get token number\n",
+ __func__, res.a0);
+ return -EPERM;
+ }
+
+ apf_tasks = res.a1 * 2;
+
+ /* Retrieve SDEI event number */
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
+ ARM_SMCCC_KVM_FUNC_ASYNC_PF_SDEI, &res);
+ if (res.a0 != SMCCC_RET_SUCCESS) {
+ pr_warn("%s: Error %ld to get SDEI event number\n",
+ __func__, res.a0);
+ return -EPERM;
+ }
+
+ apf_sdei_num = res.a1;
+
+ /* Retrieve (PPI) interrupt number */
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
+ ARM_SMCCC_KVM_FUNC_ASYNC_PF_IRQ, &res);
+ if (res.a0 != SMCCC_RET_SUCCESS) {
+ pr_warn("%s: Error %ld to get IRQ\n",
+ __func__, res.a0);
+ return -EPERM;
+ }
+
+ apf_ppi_num = res.a1;
+
+ return 0;
+}
+
+static int __init kvm_async_pf_init(void)
+{
+ struct kvm_apf_table *table;
+ size_t size;
+ int cpu, i, ret;
+
+ if (!kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) ||
+ !async_pf_available)
+ return -EPERM;
+
+ ret = kvm_async_pf_check_version();
+ if (ret)
+ return ret;
+
+ ret = kvm_async_pf_info();
+ if (ret)
+ return ret;
+
+ /* Allocate and initialize the sleeper table */
+ size = sizeof(struct kvm_apf_table) +
+ apf_tasks * sizeof(struct kvm_apf_task);
+ apf_tables = __alloc_percpu(size, 0);
+ if (!apf_tables) {
+ pr_warn("%s: Unable to alloc async PF table\n",
+ __func__);
+ return -ENOMEM;
+ }
+
+ for_each_possible_cpu(cpu) {
+ table = per_cpu_ptr(apf_tables, cpu);
+ raw_spin_lock_init(&table->lock);
+ for (i = 0; i < apf_tasks; i++)
+ init_swait_queue_head(&table->tasks[i].wq);
+ }
+
+ /*
+ * Initialize SDEI event for page-not-present notification.
+ * The SDEI event number should have been retrieved from
+ * the host.
+ */
+ ret = sdei_event_register(apf_sdei_num,
+ kvm_async_pf_sdei_handler, NULL);
+ if (ret) {
+ pr_warn("%s: Error %d to register SDEI event\n",
+ __func__, ret);
+ ret = -EIO;
+ goto release_tables;
+ }
+
+ ret = sdei_event_enable(apf_sdei_num);
+ if (ret) {
+ pr_warn("%s: Error %d to enable SDEI event\n",
+ __func__, ret);
+ goto unregister_event;
+ }
+
+ /*
+ * Initialize interrupt for page-ready notification. The
+ * interrupt number and its properties should have been
+ * retrieved from the ACPI:APFT table.
+ */
+ apf_irq = acpi_register_gsi(NULL, apf_ppi_num,
+ ACPI_LEVEL_SENSITIVE, ACPI_ACTIVE_HIGH);
+ if (apf_irq <= 0) {
+ ret = -EIO;
+ pr_warn("%s: Error %d to register IRQ\n",
+ __func__, apf_irq);
+ goto disable_event;
+ }
+
+ ret = request_percpu_irq(apf_irq, kvm_async_pf_irq_handler,
+ "Asynchronous Page Fault", &apf_data);
+ if (ret) {
+ pr_warn("%s: Error %d to request IRQ\n",
+ __func__, ret);
+ goto unregister_irq;
+ }
+
+ register_reboot_notifier(&kvm_async_pf_cpu_reboot_nb);
+ ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+ "arm/kvm:online", kvm_async_pf_cpu_online,
+ kvm_async_pf_cpu_offline);
+ if (ret < 0) {
+ pr_warn("%s: Error %d to install cpu hotplug callbacks\n",
+ __func__, ret);
+ goto release_irq;
+ }
+
+ /* Enable async PF on the online CPUs */
+ on_each_cpu(kvm_async_pf_cpu_enable, NULL, 1);
+
+ return 0;
+
+release_irq:
+ free_percpu_irq(apf_irq, &apf_data);
+unregister_irq:
+ acpi_unregister_gsi(apf_ppi_num);
+disable_event:
+ sdei_event_disable(apf_sdei_num);
+unregister_event:
+ sdei_event_unregister(apf_sdei_num);
+release_tables:
+ free_percpu(apf_tables);
+
+ return ret;
+}
+
+static int __init kvm_guest_init(void)
+{
+ return kvm_async_pf_init();
+}
+
+fs_initcall(kvm_guest_init);
--
2.23.0

2021-08-16 17:07:30

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v4 14/15] arm64: Enable async PF

Gavin Shan <[email protected]> writes:

> This enables asynchronous page fault from guest side. The design
> is highlighted as below:
>
> * The per-vCPU shared memory region, which is represented by
> "struct kvm_vcpu_pv_apf_data", is allocated. The reason and
> token associated with the received notifications of asynchronous
> page fault are delivered through it.
>
> * A per-vCPU table, which is represented by "struct kvm_apf_table",
> is allocated. The process, on which the page-not-present notification
> is received, is added into the table so that it can reschedule
> itself on switching from kernel to user mode. Afterwards, the
> process, identified by token, is removed from the table and put
> into runnable state when page-ready notification is received.
>
> * During CPU hotplug, the (private) SDEI event is expected to be
> enabled or disabled on the affected CPU by SDEI client driver.
> The (PPI) interrupt is enabled or disabled on the affected CPU
> by ourself. When the system is going to reboot, the SDEI event
> is disabled and unregistered and the (PPI) interrupt is disabled.
>
> * The SDEI event and (PPI) interrupt number are retrieved from host
> through SMCCC interface. Besides, the version of the asynchronous
> page fault is validated when the feature is enabled on the guest.
>
> * The feature is disabled on guest when boot parameter "no-kvmapf"
> is specified.

Documentation/admin-guide/kernel-parameters.txt states this one is
x86-only:

no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page
fault handling.

makes sense to update in this patch I believe.

>
> Signed-off-by: Gavin Shan <[email protected]>
> ---
> arch/arm64/kernel/Makefile | 1 +
> arch/arm64/kernel/kvm.c | 452 +++++++++++++++++++++++++++++++++++++
> 2 files changed, 453 insertions(+)
> create mode 100644 arch/arm64/kernel/kvm.c
>
> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> index 3f1490bfb938..f0c1a6a7eaa7 100644
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -59,6 +59,7 @@ obj-$(CONFIG_ACPI) += acpi.o
> obj-$(CONFIG_ACPI_NUMA) += acpi_numa.o
> obj-$(CONFIG_ARM64_ACPI_PARKING_PROTOCOL) += acpi_parking_protocol.o
> obj-$(CONFIG_PARAVIRT) += paravirt.o
> +obj-$(CONFIG_KVM_GUEST) += kvm.o
> obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
> obj-$(CONFIG_HIBERNATION) += hibernate.o hibernate-asm.o
> obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o \
> diff --git a/arch/arm64/kernel/kvm.c b/arch/arm64/kernel/kvm.c
> new file mode 100644
> index 000000000000..effe8dc7e921
> --- /dev/null
> +++ b/arch/arm64/kernel/kvm.c
> @@ -0,0 +1,452 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Asynchronous page fault support.
> + *
> + * Copyright (C) 2021 Red Hat, Inc.
> + *
> + * Author(s): Gavin Shan <[email protected]>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/spinlock.h>
> +#include <linux/slab.h>
> +#include <linux/interrupt.h>
> +#include <linux/irq.h>
> +#include <linux/of.h>
> +#include <linux/of_fdt.h>
> +#include <linux/arm-smccc.h>
> +#include <linux/kvm_para.h>
> +#include <linux/arm_sdei.h>
> +#include <linux/acpi.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/reboot.h>
> +
> +struct kvm_apf_task {
> + unsigned int token;
> + struct task_struct *task;
> + struct swait_queue_head wq;
> +};
> +
> +struct kvm_apf_table {
> + raw_spinlock_t lock;
> + unsigned int count;
> + struct kvm_apf_task tasks[0];
> +};
> +
> +static bool async_pf_available = true;
> +static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_data) __aligned(64);
> +static struct kvm_apf_table __percpu *apf_tables;
> +static unsigned int apf_tasks;
> +static unsigned int apf_sdei_num;
> +static unsigned int apf_ppi_num;
> +static int apf_irq;
> +
> +static bool kvm_async_pf_add_task(struct task_struct *task,
> + unsigned int token)
> +{
> + struct kvm_apf_table *table = this_cpu_ptr(apf_tables);
> + unsigned int i, index = apf_tasks;
> + bool ret = false;
> +
> + raw_spin_lock(&table->lock);
> +
> + if (WARN_ON(table->count >= apf_tasks))
> + goto unlock;
> +
> + for (i = 0; i < apf_tasks; i++) {
> + if (!table->tasks[i].task) {
> + if (index == apf_tasks) {
> + ret = true;
> + index = i;
> + }
> + } else if (table->tasks[i].task == task) {
> + WARN_ON(table->tasks[i].token != token);
> + ret = false;
> + break;
> + }
> + }
> +
> + if (!ret)
> + goto unlock;
> +
> + task->thread.data = &table->tasks[index].wq;
> + set_tsk_thread_flag(task, TIF_ASYNC_PF);
> +
> + table->count++;
> + table->tasks[index].task = task;
> + table->tasks[index].token = token;
> +
> +unlock:
> + raw_spin_unlock(&table->lock);
> + return ret;
> +}
> +
> +static inline void kvm_async_pf_remove_one_task(struct kvm_apf_table *table,
> + unsigned int index)
> +{
> + clear_tsk_thread_flag(table->tasks[index].task, TIF_ASYNC_PF);
> + WRITE_ONCE(table->tasks[index].task->thread.data, NULL);
> +
> + table->count--;
> + table->tasks[index].task = NULL;
> + table->tasks[index].token = 0;
> +
> + swake_up_one(&table->tasks[index].wq);
> +}
> +
> +static bool kvm_async_pf_remove_task(unsigned int token)
> +{
> + struct kvm_apf_table *table = this_cpu_ptr(apf_tables);
> + unsigned int i;
> + bool ret = (token == UINT_MAX);
> +
> + raw_spin_lock(&table->lock);
> +
> + for (i = 0; i < apf_tasks; i++) {
> + if (!table->tasks[i].task)
> + continue;
> +
> + /* Wakeup all */
> + if (token == UINT_MAX) {
> + kvm_async_pf_remove_one_task(table, i);
> + continue;
> + }
> +
> + if (table->tasks[i].token == token) {
> + kvm_async_pf_remove_one_task(table, i);
> + ret = true;
> + break;
> + }
> + }
> +
> + raw_spin_unlock(&table->lock);
> +
> + return ret;
> +}
> +
> +static int kvm_async_pf_sdei_handler(unsigned int event,
> + struct pt_regs *regs,
> + void *arg)
> +{
> + unsigned int reason = __this_cpu_read(apf_data.reason);
> + unsigned int token = __this_cpu_read(apf_data.token);
> + bool ret;
> +
> + if (reason != KVM_PV_REASON_PAGE_NOT_PRESENT) {
> + pr_warn("%s: Bogus notification (%d, 0x%08x)\n",
> + __func__, reason, token);
> + return -EINVAL;
> + }
> +
> + ret = kvm_async_pf_add_task(current, token);
> + __this_cpu_write(apf_data.token, 0);
> + __this_cpu_write(apf_data.reason, 0);
> +
> + if (!ret)
> + return -ENOSPC;
> +
> + smp_send_reschedule(smp_processor_id());
> +
> + return 0;
> +}
> +
> +static irqreturn_t kvm_async_pf_irq_handler(int irq, void *dev_id)
> +{
> + unsigned int reason = __this_cpu_read(apf_data.reason);
> + unsigned int token = __this_cpu_read(apf_data.token);
> + struct arm_smccc_res res;
> +
> + if (reason != KVM_PV_REASON_PAGE_READY) {
> + pr_warn("%s: Bogus interrupt %d (%d, 0x%08x)\n",
> + __func__, irq, reason, token);

Spurrious interrupt or bogus APF reason set? Could be both I belive.

> + return IRQ_HANDLED;
> + }
> +
> + kvm_async_pf_remove_task(token);
> +
> + __this_cpu_write(apf_data.token, 0);
> + __this_cpu_write(apf_data.reason, 0);
> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
> + ARM_SMCCC_KVM_FUNC_ASYNC_PF_IRQ_ACK, &res);
> +
> + return IRQ_HANDLED;
> +}
> +
> +static int __init kvm_async_pf_available(char *arg)
> +{
> + async_pf_available = false;
> +
> + return 0;
> +}
> +early_param("no-kvmapf", kvm_async_pf_available);
> +
> +static void kvm_async_pf_disable(void)
> +{
> + struct arm_smccc_res res;
> + u32 enabled = __this_cpu_read(apf_data.enabled);
> +
> + if (!enabled)
> + return;
> +
> + /* Disable the functionality */
> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
> + ARM_SMCCC_KVM_FUNC_ASYNC_PF_ENABLE,
> + 0, 0, &res);
> + if (res.a0 != SMCCC_RET_SUCCESS) {
> + pr_warn("%s: Error %ld to disable on CPU%d\n",
> + __func__, res.a0, smp_processor_id());
> + return;
> + }
> +
> + __this_cpu_write(apf_data.enabled, 0);
> +
> + pr_info("Async PF disabled on CPU%d\n", smp_processor_id());

Nitpicking: x86 uses

"setup async PF for cpu %d\n" and
"disable async PF for cpu %d\n"

which are not ideal maybe but in any case it would probably make sense
to be consistent across arches.


> +}
> +
> +static void kvm_async_pf_enable(void)
> +{
> + struct arm_smccc_res res;
> + u32 enabled = __this_cpu_read(apf_data.enabled);
> + u64 val = virt_to_phys(this_cpu_ptr(&apf_data));
> +
> + if (enabled)
> + return;
> +
> + val |= KVM_ASYNC_PF_ENABLED;
> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
> + ARM_SMCCC_KVM_FUNC_ASYNC_PF_ENABLE,
> + (u32)val, (u32)(val >> 32), &res);
> + if (res.a0 != SMCCC_RET_SUCCESS) {
> + pr_warn("%s: Error %ld to enable CPU%d\n",
> + __func__, res.a0, smp_processor_id());
> + return;
> + }
> +
> + __this_cpu_write(apf_data.enabled, 1);
> +
> + pr_info("Async PF enabled on CPU%d\n", smp_processor_id());
> +}
> +
> +static void kvm_async_pf_cpu_disable(void *info)
> +{
> + disable_percpu_irq(apf_irq);
> + kvm_async_pf_disable();
> +}
> +
> +static void kvm_async_pf_cpu_enable(void *info)
> +{
> + enable_percpu_irq(apf_irq, IRQ_TYPE_LEVEL_HIGH);
> + kvm_async_pf_enable();
> +}
> +
> +static int kvm_async_pf_cpu_reboot_notify(struct notifier_block *nb,
> + unsigned long code,
> + void *unused)
> +{
> + if (code == SYS_RESTART) {
> + sdei_event_disable(apf_sdei_num);
> + sdei_event_unregister(apf_sdei_num);
> +
> + on_each_cpu(kvm_async_pf_cpu_disable, NULL, 1);
> + }
> +
> + return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block kvm_async_pf_cpu_reboot_nb = {
> + .notifier_call = kvm_async_pf_cpu_reboot_notify,
> +};
> +
> +static int kvm_async_pf_cpu_online(unsigned int cpu)
> +{
> + kvm_async_pf_cpu_enable(NULL);
> +
> + return 0;
> +}
> +
> +static int kvm_async_pf_cpu_offline(unsigned int cpu)
> +{
> + kvm_async_pf_cpu_disable(NULL);
> +
> + return 0;
> +}
> +
> +static int __init kvm_async_pf_check_version(void)
> +{
> + struct arm_smccc_res res;
> +
> + /*
> + * Check the version and v1.0.0 or higher version is required
> + * to support the functionality.
> + */
> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
> + ARM_SMCCC_KVM_FUNC_ASYNC_PF_VERSION, &res);
> + if (res.a0 != SMCCC_RET_SUCCESS) {
> + pr_warn("%s: Error %ld to get version\n",
> + __func__, res.a0);
> + return -EPERM;
> + }
> +
> + if ((res.a1 & 0xFFFFFFFFFF000000) ||
> + ((res.a1 & 0xFF0000) >> 16) < 0x1) {
> + pr_warn("%s: Invalid version (0x%016lx)\n",
> + __func__, res.a1);
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> +static int __init kvm_async_pf_info(void)
> +{
> + struct arm_smccc_res res;
> +
> + /* Retrieve number of tokens */
> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
> + ARM_SMCCC_KVM_FUNC_ASYNC_PF_SLOTS, &res);
> + if (res.a0 != SMCCC_RET_SUCCESS) {
> + pr_warn("%s: Error %ld to get token number\n",
> + __func__, res.a0);
> + return -EPERM;
> + }
> +
> + apf_tasks = res.a1 * 2;
> +
> + /* Retrieve SDEI event number */
> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
> + ARM_SMCCC_KVM_FUNC_ASYNC_PF_SDEI, &res);
> + if (res.a0 != SMCCC_RET_SUCCESS) {
> + pr_warn("%s: Error %ld to get SDEI event number\n",
> + __func__, res.a0);
> + return -EPERM;
> + }
> +
> + apf_sdei_num = res.a1;
> +
> + /* Retrieve (PPI) interrupt number */
> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
> + ARM_SMCCC_KVM_FUNC_ASYNC_PF_IRQ, &res);
> + if (res.a0 != SMCCC_RET_SUCCESS) {
> + pr_warn("%s: Error %ld to get IRQ\n",
> + __func__, res.a0);
> + return -EPERM;
> + }
> +
> + apf_ppi_num = res.a1;
> +
> + return 0;
> +}
> +
> +static int __init kvm_async_pf_init(void)
> +{
> + struct kvm_apf_table *table;
> + size_t size;
> + int cpu, i, ret;
> +
> + if (!kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) ||
> + !async_pf_available)
> + return -EPERM;
> +
> + ret = kvm_async_pf_check_version();
> + if (ret)
> + return ret;
> +
> + ret = kvm_async_pf_info();
> + if (ret)
> + return ret;
> +
> + /* Allocate and initialize the sleeper table */
> + size = sizeof(struct kvm_apf_table) +
> + apf_tasks * sizeof(struct kvm_apf_task);
> + apf_tables = __alloc_percpu(size, 0);
> + if (!apf_tables) {
> + pr_warn("%s: Unable to alloc async PF table\n",
> + __func__);
> + return -ENOMEM;
> + }
> +
> + for_each_possible_cpu(cpu) {
> + table = per_cpu_ptr(apf_tables, cpu);
> + raw_spin_lock_init(&table->lock);
> + for (i = 0; i < apf_tasks; i++)
> + init_swait_queue_head(&table->tasks[i].wq);
> + }
> +
> + /*
> + * Initialize SDEI event for page-not-present notification.
> + * The SDEI event number should have been retrieved from
> + * the host.
> + */
> + ret = sdei_event_register(apf_sdei_num,
> + kvm_async_pf_sdei_handler, NULL);
> + if (ret) {
> + pr_warn("%s: Error %d to register SDEI event\n",
> + __func__, ret);
> + ret = -EIO;
> + goto release_tables;
> + }
> +
> + ret = sdei_event_enable(apf_sdei_num);
> + if (ret) {
> + pr_warn("%s: Error %d to enable SDEI event\n",
> + __func__, ret);
> + goto unregister_event;
> + }
> +
> + /*
> + * Initialize interrupt for page-ready notification. The
> + * interrupt number and its properties should have been
> + * retrieved from the ACPI:APFT table.
> + */
> + apf_irq = acpi_register_gsi(NULL, apf_ppi_num,
> + ACPI_LEVEL_SENSITIVE, ACPI_ACTIVE_HIGH);
> + if (apf_irq <= 0) {
> + ret = -EIO;
> + pr_warn("%s: Error %d to register IRQ\n",
> + __func__, apf_irq);
> + goto disable_event;
> + }
> +
> + ret = request_percpu_irq(apf_irq, kvm_async_pf_irq_handler,
> + "Asynchronous Page Fault", &apf_data);
> + if (ret) {
> + pr_warn("%s: Error %d to request IRQ\n",
> + __func__, ret);
> + goto unregister_irq;
> + }
> +
> + register_reboot_notifier(&kvm_async_pf_cpu_reboot_nb);
> + ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> + "arm/kvm:online", kvm_async_pf_cpu_online,
> + kvm_async_pf_cpu_offline);
> + if (ret < 0) {
> + pr_warn("%s: Error %d to install cpu hotplug callbacks\n",
> + __func__, ret);
> + goto release_irq;
> + }
> +
> + /* Enable async PF on the online CPUs */
> + on_each_cpu(kvm_async_pf_cpu_enable, NULL, 1);
> +
> + return 0;
> +
> +release_irq:
> + free_percpu_irq(apf_irq, &apf_data);
> +unregister_irq:
> + acpi_unregister_gsi(apf_ppi_num);
> +disable_event:
> + sdei_event_disable(apf_sdei_num);
> +unregister_event:
> + sdei_event_unregister(apf_sdei_num);
> +release_tables:
> + free_percpu(apf_tables);
> +
> + return ret;
> +}
> +
> +static int __init kvm_guest_init(void)
> +{
> + return kvm_async_pf_init();
> +}
> +
> +fs_initcall(kvm_guest_init);

--
Vitaly

2021-08-17 10:51:11

by Gavin Shan

[permalink] [raw]
Subject: Re: [PATCH v4 14/15] arm64: Enable async PF

Hi Vitaly,

On 8/17/21 3:05 AM, Vitaly Kuznetsov wrote:
> Gavin Shan <[email protected]> writes:
>
>> This enables asynchronous page fault from guest side. The design
>> is highlighted as below:
>>
>> * The per-vCPU shared memory region, which is represented by
>> "struct kvm_vcpu_pv_apf_data", is allocated. The reason and
>> token associated with the received notifications of asynchronous
>> page fault are delivered through it.
>>
>> * A per-vCPU table, which is represented by "struct kvm_apf_table",
>> is allocated. The process, on which the page-not-present notification
>> is received, is added into the table so that it can reschedule
>> itself on switching from kernel to user mode. Afterwards, the
>> process, identified by token, is removed from the table and put
>> into runnable state when page-ready notification is received.
>>
>> * During CPU hotplug, the (private) SDEI event is expected to be
>> enabled or disabled on the affected CPU by SDEI client driver.
>> The (PPI) interrupt is enabled or disabled on the affected CPU
>> by ourself. When the system is going to reboot, the SDEI event
>> is disabled and unregistered and the (PPI) interrupt is disabled.
>>
>> * The SDEI event and (PPI) interrupt number are retrieved from host
>> through SMCCC interface. Besides, the version of the asynchronous
>> page fault is validated when the feature is enabled on the guest.
>>
>> * The feature is disabled on guest when boot parameter "no-kvmapf"
>> is specified.
>
> Documentation/admin-guide/kernel-parameters.txt states this one is
> x86-only:
>
> no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page
> fault handling.
>
> makes sense to update in this patch I believe.
>

Yes, I will update in next revision.

>>
>> Signed-off-by: Gavin Shan <[email protected]>
>> ---
>> arch/arm64/kernel/Makefile | 1 +
>> arch/arm64/kernel/kvm.c | 452 +++++++++++++++++++++++++++++++++++++
>> 2 files changed, 453 insertions(+)
>> create mode 100644 arch/arm64/kernel/kvm.c
>>
>> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
>> index 3f1490bfb938..f0c1a6a7eaa7 100644
>> --- a/arch/arm64/kernel/Makefile
>> +++ b/arch/arm64/kernel/Makefile
>> @@ -59,6 +59,7 @@ obj-$(CONFIG_ACPI) += acpi.o
>> obj-$(CONFIG_ACPI_NUMA) += acpi_numa.o
>> obj-$(CONFIG_ARM64_ACPI_PARKING_PROTOCOL) += acpi_parking_protocol.o
>> obj-$(CONFIG_PARAVIRT) += paravirt.o
>> +obj-$(CONFIG_KVM_GUEST) += kvm.o
>> obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
>> obj-$(CONFIG_HIBERNATION) += hibernate.o hibernate-asm.o
>> obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o \
>> diff --git a/arch/arm64/kernel/kvm.c b/arch/arm64/kernel/kvm.c
>> new file mode 100644
>> index 000000000000..effe8dc7e921
>> --- /dev/null
>> +++ b/arch/arm64/kernel/kvm.c
>> @@ -0,0 +1,452 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Asynchronous page fault support.
>> + *
>> + * Copyright (C) 2021 Red Hat, Inc.
>> + *
>> + * Author(s): Gavin Shan <[email protected]>
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/slab.h>
>> +#include <linux/interrupt.h>
>> +#include <linux/irq.h>
>> +#include <linux/of.h>
>> +#include <linux/of_fdt.h>
>> +#include <linux/arm-smccc.h>
>> +#include <linux/kvm_para.h>
>> +#include <linux/arm_sdei.h>
>> +#include <linux/acpi.h>
>> +#include <linux/cpuhotplug.h>
>> +#include <linux/reboot.h>
>> +
>> +struct kvm_apf_task {
>> + unsigned int token;
>> + struct task_struct *task;
>> + struct swait_queue_head wq;
>> +};
>> +
>> +struct kvm_apf_table {
>> + raw_spinlock_t lock;
>> + unsigned int count;
>> + struct kvm_apf_task tasks[0];
>> +};
>> +
>> +static bool async_pf_available = true;
>> +static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_data) __aligned(64);
>> +static struct kvm_apf_table __percpu *apf_tables;
>> +static unsigned int apf_tasks;
>> +static unsigned int apf_sdei_num;
>> +static unsigned int apf_ppi_num;
>> +static int apf_irq;
>> +
>> +static bool kvm_async_pf_add_task(struct task_struct *task,
>> + unsigned int token)
>> +{
>> + struct kvm_apf_table *table = this_cpu_ptr(apf_tables);
>> + unsigned int i, index = apf_tasks;
>> + bool ret = false;
>> +
>> + raw_spin_lock(&table->lock);
>> +
>> + if (WARN_ON(table->count >= apf_tasks))
>> + goto unlock;
>> +
>> + for (i = 0; i < apf_tasks; i++) {
>> + if (!table->tasks[i].task) {
>> + if (index == apf_tasks) {
>> + ret = true;
>> + index = i;
>> + }
>> + } else if (table->tasks[i].task == task) {
>> + WARN_ON(table->tasks[i].token != token);
>> + ret = false;
>> + break;
>> + }
>> + }
>> +
>> + if (!ret)
>> + goto unlock;
>> +
>> + task->thread.data = &table->tasks[index].wq;
>> + set_tsk_thread_flag(task, TIF_ASYNC_PF);
>> +
>> + table->count++;
>> + table->tasks[index].task = task;
>> + table->tasks[index].token = token;
>> +
>> +unlock:
>> + raw_spin_unlock(&table->lock);
>> + return ret;
>> +}
>> +
>> +static inline void kvm_async_pf_remove_one_task(struct kvm_apf_table *table,
>> + unsigned int index)
>> +{
>> + clear_tsk_thread_flag(table->tasks[index].task, TIF_ASYNC_PF);
>> + WRITE_ONCE(table->tasks[index].task->thread.data, NULL);
>> +
>> + table->count--;
>> + table->tasks[index].task = NULL;
>> + table->tasks[index].token = 0;
>> +
>> + swake_up_one(&table->tasks[index].wq);
>> +}
>> +
>> +static bool kvm_async_pf_remove_task(unsigned int token)
>> +{
>> + struct kvm_apf_table *table = this_cpu_ptr(apf_tables);
>> + unsigned int i;
>> + bool ret = (token == UINT_MAX);
>> +
>> + raw_spin_lock(&table->lock);
>> +
>> + for (i = 0; i < apf_tasks; i++) {
>> + if (!table->tasks[i].task)
>> + continue;
>> +
>> + /* Wakeup all */
>> + if (token == UINT_MAX) {
>> + kvm_async_pf_remove_one_task(table, i);
>> + continue;
>> + }
>> +
>> + if (table->tasks[i].token == token) {
>> + kvm_async_pf_remove_one_task(table, i);
>> + ret = true;
>> + break;
>> + }
>> + }
>> +
>> + raw_spin_unlock(&table->lock);
>> +
>> + return ret;
>> +}
>> +
>> +static int kvm_async_pf_sdei_handler(unsigned int event,
>> + struct pt_regs *regs,
>> + void *arg)
>> +{
>> + unsigned int reason = __this_cpu_read(apf_data.reason);
>> + unsigned int token = __this_cpu_read(apf_data.token);
>> + bool ret;
>> +
>> + if (reason != KVM_PV_REASON_PAGE_NOT_PRESENT) {
>> + pr_warn("%s: Bogus notification (%d, 0x%08x)\n",
>> + __func__, reason, token);
>> + return -EINVAL;
>> + }
>> +
>> + ret = kvm_async_pf_add_task(current, token);
>> + __this_cpu_write(apf_data.token, 0);
>> + __this_cpu_write(apf_data.reason, 0);
>> +
>> + if (!ret)
>> + return -ENOSPC;
>> +
>> + smp_send_reschedule(smp_processor_id());
>> +
>> + return 0;
>> +}
>> +
>> +static irqreturn_t kvm_async_pf_irq_handler(int irq, void *dev_id)
>> +{
>> + unsigned int reason = __this_cpu_read(apf_data.reason);
>> + unsigned int token = __this_cpu_read(apf_data.token);
>> + struct arm_smccc_res res;
>> +
>> + if (reason != KVM_PV_REASON_PAGE_READY) {
>> + pr_warn("%s: Bogus interrupt %d (%d, 0x%08x)\n",
>> + __func__, irq, reason, token);
>
> Spurrious interrupt or bogus APF reason set? Could be both I belive.
>

Yes, It could be both and the message can be more specific like below:

pr_warn("%s: Wrong interrupt (%d) or state (%d 0x%08x) received\n",
__func__, irq, reason, token);

>> + return IRQ_HANDLED;
>> + }
>> +
>> + kvm_async_pf_remove_task(token);
>> +
>> + __this_cpu_write(apf_data.token, 0);
>> + __this_cpu_write(apf_data.reason, 0);
>> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
>> + ARM_SMCCC_KVM_FUNC_ASYNC_PF_IRQ_ACK, &res);
>> +
>> + return IRQ_HANDLED;
>> +}
>> +
>> +static int __init kvm_async_pf_available(char *arg)
>> +{
>> + async_pf_available = false;
>> +
>> + return 0;
>> +}
>> +early_param("no-kvmapf", kvm_async_pf_available);
>> +
>> +static void kvm_async_pf_disable(void)
>> +{
>> + struct arm_smccc_res res;
>> + u32 enabled = __this_cpu_read(apf_data.enabled);
>> +
>> + if (!enabled)
>> + return;
>> +
>> + /* Disable the functionality */
>> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
>> + ARM_SMCCC_KVM_FUNC_ASYNC_PF_ENABLE,
>> + 0, 0, &res);
>> + if (res.a0 != SMCCC_RET_SUCCESS) {
>> + pr_warn("%s: Error %ld to disable on CPU%d\n",
>> + __func__, res.a0, smp_processor_id());
>> + return;
>> + }
>> +
>> + __this_cpu_write(apf_data.enabled, 0);
>> +
>> + pr_info("Async PF disabled on CPU%d\n", smp_processor_id());
>
> Nitpicking: x86 uses
>
> "setup async PF for cpu %d\n" and
> "disable async PF for cpu %d\n"
>
> which are not ideal maybe but in any case it would probably make sense
> to be consistent across arches.
>

Yes, It's worthwhile to do so :)

>
>> +}
>> +
>> +static void kvm_async_pf_enable(void)
>> +{
>> + struct arm_smccc_res res;
>> + u32 enabled = __this_cpu_read(apf_data.enabled);
>> + u64 val = virt_to_phys(this_cpu_ptr(&apf_data));
>> +
>> + if (enabled)
>> + return;
>> +
>> + val |= KVM_ASYNC_PF_ENABLED;
>> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
>> + ARM_SMCCC_KVM_FUNC_ASYNC_PF_ENABLE,
>> + (u32)val, (u32)(val >> 32), &res);
>> + if (res.a0 != SMCCC_RET_SUCCESS) {
>> + pr_warn("%s: Error %ld to enable CPU%d\n",
>> + __func__, res.a0, smp_processor_id());
>> + return;
>> + }
>> +
>> + __this_cpu_write(apf_data.enabled, 1);
>> +
>> + pr_info("Async PF enabled on CPU%d\n", smp_processor_id());
>> +}
>> +
>> +static void kvm_async_pf_cpu_disable(void *info)
>> +{
>> + disable_percpu_irq(apf_irq);
>> + kvm_async_pf_disable();
>> +}
>> +
>> +static void kvm_async_pf_cpu_enable(void *info)
>> +{
>> + enable_percpu_irq(apf_irq, IRQ_TYPE_LEVEL_HIGH);
>> + kvm_async_pf_enable();
>> +}
>> +
>> +static int kvm_async_pf_cpu_reboot_notify(struct notifier_block *nb,
>> + unsigned long code,
>> + void *unused)
>> +{
>> + if (code == SYS_RESTART) {
>> + sdei_event_disable(apf_sdei_num);
>> + sdei_event_unregister(apf_sdei_num);
>> +
>> + on_each_cpu(kvm_async_pf_cpu_disable, NULL, 1);
>> + }
>> +
>> + return NOTIFY_DONE;
>> +}
>> +
>> +static struct notifier_block kvm_async_pf_cpu_reboot_nb = {
>> + .notifier_call = kvm_async_pf_cpu_reboot_notify,
>> +};
>> +
>> +static int kvm_async_pf_cpu_online(unsigned int cpu)
>> +{
>> + kvm_async_pf_cpu_enable(NULL);
>> +
>> + return 0;
>> +}
>> +
>> +static int kvm_async_pf_cpu_offline(unsigned int cpu)
>> +{
>> + kvm_async_pf_cpu_disable(NULL);
>> +
>> + return 0;
>> +}
>> +
>> +static int __init kvm_async_pf_check_version(void)
>> +{
>> + struct arm_smccc_res res;
>> +
>> + /*
>> + * Check the version and v1.0.0 or higher version is required
>> + * to support the functionality.
>> + */
>> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
>> + ARM_SMCCC_KVM_FUNC_ASYNC_PF_VERSION, &res);
>> + if (res.a0 != SMCCC_RET_SUCCESS) {
>> + pr_warn("%s: Error %ld to get version\n",
>> + __func__, res.a0);
>> + return -EPERM;
>> + }
>> +
>> + if ((res.a1 & 0xFFFFFFFFFF000000) ||
>> + ((res.a1 & 0xFF0000) >> 16) < 0x1) {
>> + pr_warn("%s: Invalid version (0x%016lx)\n",
>> + __func__, res.a1);
>> + return -EINVAL;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static int __init kvm_async_pf_info(void)
>> +{
>> + struct arm_smccc_res res;
>> +
>> + /* Retrieve number of tokens */
>> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
>> + ARM_SMCCC_KVM_FUNC_ASYNC_PF_SLOTS, &res);
>> + if (res.a0 != SMCCC_RET_SUCCESS) {
>> + pr_warn("%s: Error %ld to get token number\n",
>> + __func__, res.a0);
>> + return -EPERM;
>> + }
>> +
>> + apf_tasks = res.a1 * 2;
>> +
>> + /* Retrieve SDEI event number */
>> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
>> + ARM_SMCCC_KVM_FUNC_ASYNC_PF_SDEI, &res);
>> + if (res.a0 != SMCCC_RET_SUCCESS) {
>> + pr_warn("%s: Error %ld to get SDEI event number\n",
>> + __func__, res.a0);
>> + return -EPERM;
>> + }
>> +
>> + apf_sdei_num = res.a1;
>> +
>> + /* Retrieve (PPI) interrupt number */
>> + arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_ASYNC_PF_FUNC_ID,
>> + ARM_SMCCC_KVM_FUNC_ASYNC_PF_IRQ, &res);
>> + if (res.a0 != SMCCC_RET_SUCCESS) {
>> + pr_warn("%s: Error %ld to get IRQ\n",
>> + __func__, res.a0);
>> + return -EPERM;
>> + }
>> +
>> + apf_ppi_num = res.a1;
>> +
>> + return 0;
>> +}
>> +
>> +static int __init kvm_async_pf_init(void)
>> +{
>> + struct kvm_apf_table *table;
>> + size_t size;
>> + int cpu, i, ret;
>> +
>> + if (!kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) ||
>> + !async_pf_available)
>> + return -EPERM;
>> +
>> + ret = kvm_async_pf_check_version();
>> + if (ret)
>> + return ret;
>> +
>> + ret = kvm_async_pf_info();
>> + if (ret)
>> + return ret;
>> +
>> + /* Allocate and initialize the sleeper table */
>> + size = sizeof(struct kvm_apf_table) +
>> + apf_tasks * sizeof(struct kvm_apf_task);
>> + apf_tables = __alloc_percpu(size, 0);
>> + if (!apf_tables) {
>> + pr_warn("%s: Unable to alloc async PF table\n",
>> + __func__);
>> + return -ENOMEM;
>> + }
>> +
>> + for_each_possible_cpu(cpu) {
>> + table = per_cpu_ptr(apf_tables, cpu);
>> + raw_spin_lock_init(&table->lock);
>> + for (i = 0; i < apf_tasks; i++)
>> + init_swait_queue_head(&table->tasks[i].wq);
>> + }
>> +
>> + /*
>> + * Initialize SDEI event for page-not-present notification.
>> + * The SDEI event number should have been retrieved from
>> + * the host.
>> + */
>> + ret = sdei_event_register(apf_sdei_num,
>> + kvm_async_pf_sdei_handler, NULL);
>> + if (ret) {
>> + pr_warn("%s: Error %d to register SDEI event\n",
>> + __func__, ret);
>> + ret = -EIO;
>> + goto release_tables;
>> + }
>> +
>> + ret = sdei_event_enable(apf_sdei_num);
>> + if (ret) {
>> + pr_warn("%s: Error %d to enable SDEI event\n",
>> + __func__, ret);
>> + goto unregister_event;
>> + }
>> +
>> + /*
>> + * Initialize interrupt for page-ready notification. The
>> + * interrupt number and its properties should have been
>> + * retrieved from the ACPI:APFT table.
>> + */
>> + apf_irq = acpi_register_gsi(NULL, apf_ppi_num,
>> + ACPI_LEVEL_SENSITIVE, ACPI_ACTIVE_HIGH);
>> + if (apf_irq <= 0) {
>> + ret = -EIO;
>> + pr_warn("%s: Error %d to register IRQ\n",
>> + __func__, apf_irq);
>> + goto disable_event;
>> + }
>> +
>> + ret = request_percpu_irq(apf_irq, kvm_async_pf_irq_handler,
>> + "Asynchronous Page Fault", &apf_data);
>> + if (ret) {
>> + pr_warn("%s: Error %d to request IRQ\n",
>> + __func__, ret);
>> + goto unregister_irq;
>> + }
>> +
>> + register_reboot_notifier(&kvm_async_pf_cpu_reboot_nb);
>> + ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
>> + "arm/kvm:online", kvm_async_pf_cpu_online,
>> + kvm_async_pf_cpu_offline);
>> + if (ret < 0) {
>> + pr_warn("%s: Error %d to install cpu hotplug callbacks\n",
>> + __func__, ret);
>> + goto release_irq;
>> + }
>> +
>> + /* Enable async PF on the online CPUs */
>> + on_each_cpu(kvm_async_pf_cpu_enable, NULL, 1);
>> +
>> + return 0;
>> +
>> +release_irq:
>> + free_percpu_irq(apf_irq, &apf_data);
>> +unregister_irq:
>> + acpi_unregister_gsi(apf_ppi_num);
>> +disable_event:
>> + sdei_event_disable(apf_sdei_num);
>> +unregister_event:
>> + sdei_event_unregister(apf_sdei_num);
>> +release_tables:
>> + free_percpu(apf_tables);
>> +
>> + return ret;
>> +}
>> +
>> +static int __init kvm_guest_init(void)
>> +{
>> + return kvm_async_pf_init();
>> +}
>> +
>> +fs_initcall(kvm_guest_init);
>

Thanks,
Gavin

2021-11-11 10:39:56

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v4 15/15] KVM: arm64: Add async PF document

Hi Gavin,

On 8/15/21 2:59 AM, Gavin Shan wrote:
> This adds document to explain the interface for asynchronous page
> fault and how it works in general.
>
> Signed-off-by: Gavin Shan <[email protected]>
> ---
> Documentation/virt/kvm/arm/apf.rst | 143 +++++++++++++++++++++++++++
> Documentation/virt/kvm/arm/index.rst | 1 +
> 2 files changed, 144 insertions(+)
> create mode 100644 Documentation/virt/kvm/arm/apf.rst
>
> diff --git a/Documentation/virt/kvm/arm/apf.rst b/Documentation/virt/kvm/arm/apf.rst
> new file mode 100644
> index 000000000000..4f5c01b6699f
> --- /dev/null
> +++ b/Documentation/virt/kvm/arm/apf.rst
> @@ -0,0 +1,143 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +Asynchronous Page Fault Support for arm64
> +=========================================
> +
> +There are two stages of page faults when KVM module is enabled as accelerator
> +to the guest. The guest is responsible for handling the stage-1 page faults,
> +while the host handles the stage-2 page faults. During the period of handling
> +the stage-2 page faults, the guest is suspended until the requested page is
> +ready. It could take several milliseconds, even hundreds of milliseconds in
s/It could take several milliseconds, even hundreds of milliseconds/It
could take up to hundreds of milliseconds
> +extreme situations because I/O might be required to move the requested page
> +from disk to DRAM. The guest does not do any work when it is suspended. The
> +feature (Asynchronous Page Fault) is introduced to take advantage of the
s/The feature (Asynchronous Page Fault)/ The Asynchronous Page Fault
feature allows to improve the overall performance by allowing the guest
to reschedule ... ?
> +suspending period and to improve the overall performance.
> +
> +There are two paths in order to fulfil the asynchronous page fault, called
> +as control path and data path.
The asynchronous page fault is implemented upon a control path and a
data path?
The control path allows the VMM or guest to
> +configure the functionality, while the notifications are delivered in data
> +path. The notifications are classified into page-not-present and page-ready
> +notifications.
> +
> +Data Path
> +---------
> +
> +There are two types of notifications delivered from host to guest in the
> +data path: page-not-present and page-ready notification. They are delivered
> +through SDEI event and (PPI) interrupt separately.
s/separately/respectively
Besides, there is a shared
> +buffer between host and guest to indicate the reason and sequential token,
s/to indicate/that indicates
Can you clarify 'reason'?
Also a sequential token is used ...
> +which is used to identify the asynchronous page fault. The reason and token
> +resident in the shared buffer is written by host, read and cleared by guest.
s/is/are
> +An asynchronous page fault is delivered and completed as below.
> +
> +(1) When an asynchronous page fault starts, a (workqueue) worker is created
> + and queued to the vCPU's pending queue. The worker makes the requested
> + page ready and resident to DRAM in the background. The shared buffer is
> + updated with reason and sequential token. After that, SDEI event is sent
> + to guest as page-not-present notification.
This gives the impression the SDEI event is sent after the worker
completes the job. I think you should rephrase.
> +
> +(2) When the SDEI event is received on guest, the current process is tagged
> + with TIF_ASYNC_PF and associated with a wait queue. The process is ready
> + to keep rescheduling itself on switching from kernel to user mode. After
above sentence sounds a bit cryptic to me: ~ waits for being rescheduled
later?
> + that, a reschedule IPI is sent to current CPU and the received SDEI event
> + is acknowledged. Note that the IPI is delivered when the acknowledgment
> + on the SDEI event is received on host.
> +
> +(3) On the host, the worker is dequeued from the vCPU's pending queue and
> + enqueued to its completion queue when the requested page becomes ready.
> + In the mean while, KVM_REQ_ASYNC_PF request is sent the vCPU if the
in the meanwhile here and below
> + worker is the first element enqueued to the completion queue.
I think you should remind what is the intent of this KVM_REQ_ASYNC_PF
request, ie. notify that the page is ready.
> +
> +(4) With pending KVM_REQ_ASYNC_PF request, the first worker in the completion
> + queue is dequeued and destroyed. In the mean while, a (PPI) interrupt is

> + sent to guest with updated reason and token in the shared buffer.
> +
> +(5) When the (PPI) interrupt is received on guest, the affected process is
> + located using the token and waken up after its TIF_ASYNC_PF tag is cleared.
> + After that, the interrupt is acknowledged through SMCCC interface. The
> + workers in the completion queue is dequeued and destroyed if any workers
the worker
Isn't it destroyed even if no other worker exist?
> + exist, and another (PPI) interrupt is sent to the guest.

I think you should briefly remind the motivation of SDEI and PPI
mechanism for both synchros. Maybe by doing an analogy with x86
implementation?
> +
> +Control Path
> +------------
> +
> +The configurations are passed through SMCCC or ioctl interface. The SDEI
> +event and (PPI) interrupt are owned by VMM, so the SDEI event and interrupt
> +numbers are configured through ioctl command on per-vCPU basis.
The "owned" terminology looks weird here. Do you mean the SDEI event
number and the PPI ID are defined by the VMM userspace?

Besides,
> +the functionality might be enabled and configured through ioctl interface
> +by VMM during migration:
> +
> + * KVM_ARM_ASYNC_PF_CMD_GET_VERSION
> +
> + Returns the current version of the feature, supported by the host. It is
> + made up of major, minor and revision fields. Each field is one byte in
> + length.
> +
> + * KVM_ARM_ASYNC_PF_CMD_GET_SDEI:
> +
> + Retrieve the SDEI event number, used for page-not-present notification,
> + so that it can be configured on destination VM in the scenario of
> + migration.
> +
> + * KVM_ARM_ASYNC_PF_GET_IRQ:
> +
> + Retrieve the IRQ (PPI) number, used for page-ready notification, so that
> + it can be configured on destination VM in the scenario of migration.
> +
> + * KVM_ARM_ASYNC_PF_CMD_GET_CONTROL
> +
> + Retrieve the address of control block, so that it can be configured on
> + destination VM in the scenario of migration.
> +
> + * KVM_ARM_ASYNC_PF_CMD_SET_SDEI:
> +
> + Used by VMM to configure number of SDEI event, which is used to deliver
> + page-not-present notification by host. This is used when VM is started
> + or migrated.
> +
> + * KVM_ARM_ASYNC_PF_CMD_SET_IRQ
> +
> + Used by VMM to configure number of (PPI) interrupt, which is used to
> + deliver page-ready notification by host. This is used when VM is started
> + or migrated.
> +
> + * KVM_ARM_ASYNC_PF_CMD_SET_CONTROL
> +
> + Set the control block on the destination VM in the scenario of migration.
What is the size of this control block?
> +
> +The other configurations are passed through SMCCC interface. The host exports
> +the capability through KVM vendor specific service, which is identified by
> +ARM_SMCCC_KVM_FUNC_ASYNC_PF_FUNC_ID. There are several functions defined for
> +this:
> +
> + * ARM_SMCCC_KVM_FUNC_ASYNC_PF_VERSION
> +
> + Returns the current version of the feature, supported by the host. It is
> + made up of major, minor and revision fields. Each field is one byte in
> + length.
> +
> + * ARM_SMCCC_KVM_FUNC_ASYNC_PF_SLOTS
> +
> + Returns the size of the hashed GFN table. It is used by guest to set up
by the guest
> + the capacity of waiting process table.
> +
> + * ARM_SMCCC_KVM_FUNC_ASYNC_PF_SDEI
> + * ARM_SMCCC_KVM_FUNC_ASYNC_PF_IRQ
> +
> + Used by the guest to retrieve the SDEI event and (PPI) interrupt number
> + that are configured by VMM.
How does the guest recognize which SDEI event num it shall register.
Same question for PPI? What if we were to expose the guest with several
SDEIs?
> +
> + * ARM_SMCCC_KVM_FUNC_ASYNC_PF_ENABLE
> +
> + Used by the guest to enable or disable the feature on the specific vCPU.
> + The argument is made up of shared buffer and flags. The shared buffer
> + is written by host to indicate the reason about the delivered asynchronous
> + page fault and token (sequence number) to identify that. There are two
> + flags are supported: KVM_ASYNC_PF_ENABLED is used to enable or disable
> + the feature. KVM_ASYNC_PF_SEND_ALWAYS allows to deliver page-not-present
> + notification regardless of the guest's state. Otherwise, the notification
> + is delivered only when the guest is in user mode.
> +
> + * ARM_SMCCC_KVM_FUNC_ASYNC_PF_IRQ_ACK

How does it compare to x86? I mean there are a huger number of IOCTLs
and SMCCC calls to achieve the functionaly. Was the x86 implementation
as invasive as this implementation?

Thanks

Eric
> +
> + Used by the guest to acknowledge the completion of page-ready notification.
> diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst
> index 78a9b670aafe..f43b5fe25f61 100644
> --- a/Documentation/virt/kvm/arm/index.rst
> +++ b/Documentation/virt/kvm/arm/index.rst
> @@ -7,6 +7,7 @@ ARM
> .. toctree::
> :maxdepth: 2
>
> + apf
> hyp-abi
> psci
> pvtime
>