2017-08-02 16:09:29

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements

Changes since v9:
- Rebase to 4.13-rc3.
- Drop PATCH1 as it was already taken by Greg to char-misc tree. There're no
functional dependencies on this patch so the series can go through a different tree
(and it actually belongs to x86 if I got Ingo's comment right).
- Add in missing void return type in PATCH1 [Colin King, Ingo Molnar, Greg KH]
- A few minor fixes in what is now PATCH7: add pr_fmt, tiny style fix in
hyperv_flush_tlb_others() [Andy Shevchenko]
- Fix "error: implicit declaration of function 'virt_to_phys'" in PATCH2
reported by kbuild test robot (#include <asm/io.h>)
- Add Steven's 'Reviewed-by:' to PATCH9.

Original description:

Hyper-V supports hypercalls for doing local and remote TLB flushing and
gives its guests hints when using hypercall is preferred. While doing
hypercalls for local TLB flushes is probably not practical (and is not
being suggested by modern Hyper-V versions) remote TLB flush with a
hypercall brings significant improvement.

To test the series I wrote a special 'TLB trasher': on a 16 vCPU guest I
was creating 32 threads which were doing 100000 mmap/munmaps each on some
big file. Here are the results:

Before:
# time ./pthread_mmap ./randfile
real 3m33.118s
user 0m3.698s
sys 3m16.624s

After:
# time ./pthread_mmap ./randfile
real 2m19.920s
user 0m2.662s
sys 2m9.948s

This series brings a number of small improvements along the way: fast
hypercall implementation and using it for event signaling, rep hypercalls
implementation, hyperv tracing subsystem (which only traces the newly added
remote TLB flush for now).

Vitaly Kuznetsov (9):
x86/hyper-v: include hyperv/ only when CONFIG_HYPERV is set
x86/hyper-v: make hv_do_hypercall() inline
x86/hyper-v: fast hypercall implementation
hyper-v: use fast hypercall for HVCALL_SIGNAL_EVENT
x86/hyper-v: implement rep hypercalls
hyper-v: globalize vp_index
x86/hyper-v: use hypercall for remote TLB flush
x86/hyper-v: support extended CPU ranges for TLB flush hypercalls
tracing/hyper-v: trace hyperv_mmu_flush_tlb_others()

MAINTAINERS | 1 +
arch/x86/Kbuild | 2 +-
arch/x86/hyperv/Makefile | 2 +-
arch/x86/hyperv/hv_init.c | 90 ++++++------
arch/x86/hyperv/mmu.c | 272 ++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/mshyperv.h | 147 ++++++++++++++++++-
arch/x86/include/asm/trace/hyperv.h | 40 ++++++
arch/x86/include/uapi/asm/hyperv.h | 17 +++
arch/x86/kernel/cpu/mshyperv.c | 1 +
drivers/hv/channel_mgmt.c | 20 +--
drivers/hv/connection.c | 7 +-
drivers/hv/hv.c | 9 --
drivers/hv/hyperv_vmbus.h | 11 --
drivers/hv/vmbus_drv.c | 17 ---
drivers/pci/host/pci-hyperv.c | 54 +------
include/linux/hyperv.h | 17 +--
16 files changed, 533 insertions(+), 174 deletions(-)
create mode 100644 arch/x86/hyperv/mmu.c
create mode 100644 arch/x86/include/asm/trace/hyperv.h

--
2.13.3


2017-08-02 16:09:32

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH v10 1/9] x86/hyper-v: include hyperv/ only when CONFIG_HYPERV is set

Code is arch/x86/hyperv/ is only needed when CONFIG_HYPERV is set, the
'basic' support and detection lives in arch/x86/kernel/cpu/mshyperv.c
which is included when CONFIG_HYPERVISOR_GUEST is set.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
---
arch/x86/Kbuild | 2 +-
arch/x86/include/asm/mshyperv.h | 7 ++++++-
2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index 586b786b3edf..3e6f64073005 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -8,7 +8,7 @@ obj-$(CONFIG_KVM) += kvm/
obj-$(CONFIG_XEN) += xen/

# Hyper-V paravirtualization support
-obj-$(CONFIG_HYPERVISOR_GUEST) += hyperv/
+obj-$(subst m,y,$(CONFIG_HYPERV)) += hyperv/

# lguest paravirtualization support
obj-$(CONFIG_LGUEST_GUEST) += lguest/
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 2b58c8c1eeaa..baea2679a0aa 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -173,7 +173,12 @@ void hyperv_init(void);
void hyperv_report_panic(struct pt_regs *regs);
bool hv_is_hypercall_page_setup(void);
void hyperv_cleanup(void);
-#endif
+#else /* CONFIG_HYPERV */
+static inline void hyperv_init(void) {}
+static inline bool hv_is_hypercall_page_setup(void) { return false; }
+static inline void hyperv_cleanup(void) {}
+#endif /* CONFIG_HYPERV */
+
#ifdef CONFIG_HYPERV_TSCPAGE
struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
--
2.13.3

2017-08-02 16:09:37

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH v10 2/9] x86/hyper-v: make hv_do_hypercall() inline

We have only three call sites for hv_do_hypercall() and we're going to
change HVCALL_SIGNAL_EVENT to doing fast hypercall so we can inline this
function for optimization.

Hyper-V top level functional specification states that r9-r11 registers
and flags may be clobbered by the hypervisor during hypercall and with
inlining this is somewhat important, add the clobbers.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
---
arch/x86/hyperv/hv_init.c | 54 ++++-------------------------------------
arch/x86/include/asm/mshyperv.h | 40 ++++++++++++++++++++++++++++++
drivers/hv/connection.c | 2 ++
include/linux/hyperv.h | 1 -
4 files changed, 47 insertions(+), 50 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 5b882cc0c0e9..691603ee9179 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -75,7 +75,8 @@ static struct clocksource hyperv_cs_msr = {
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
};

-static void *hypercall_pg;
+void *hv_hypercall_pg;
+EXPORT_SYMBOL_GPL(hv_hypercall_pg);
struct clocksource *hyperv_cs;
EXPORT_SYMBOL_GPL(hyperv_cs);

@@ -102,15 +103,15 @@ void hyperv_init(void)
guest_id = generate_guest_id(0, LINUX_VERSION_CODE, 0);
wrmsrl(HV_X64_MSR_GUEST_OS_ID, guest_id);

- hypercall_pg = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL_RX);
- if (hypercall_pg == NULL) {
+ hv_hypercall_pg = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL_RX);
+ if (hv_hypercall_pg == NULL) {
wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
return;
}

rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
hypercall_msr.enable = 1;
- hypercall_msr.guest_physical_address = vmalloc_to_pfn(hypercall_pg);
+ hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);

/*
@@ -170,51 +171,6 @@ void hyperv_cleanup(void)
}
EXPORT_SYMBOL_GPL(hyperv_cleanup);

-/*
- * hv_do_hypercall- Invoke the specified hypercall
- */
-u64 hv_do_hypercall(u64 control, void *input, void *output)
-{
- u64 input_address = (input) ? virt_to_phys(input) : 0;
- u64 output_address = (output) ? virt_to_phys(output) : 0;
-#ifdef CONFIG_X86_64
- u64 hv_status = 0;
-
- if (!hypercall_pg)
- return (u64)ULLONG_MAX;
-
- __asm__ __volatile__("mov %0, %%r8" : : "r" (output_address) : "r8");
- __asm__ __volatile__("call *%3" : "=a" (hv_status) :
- "c" (control), "d" (input_address),
- "m" (hypercall_pg));
-
- return hv_status;
-
-#else
-
- u32 control_hi = control >> 32;
- u32 control_lo = control & 0xFFFFFFFF;
- u32 hv_status_hi = 1;
- u32 hv_status_lo = 1;
- u32 input_address_hi = input_address >> 32;
- u32 input_address_lo = input_address & 0xFFFFFFFF;
- u32 output_address_hi = output_address >> 32;
- u32 output_address_lo = output_address & 0xFFFFFFFF;
-
- if (!hypercall_pg)
- return (u64)ULLONG_MAX;
-
- __asm__ __volatile__ ("call *%8" : "=d"(hv_status_hi),
- "=a"(hv_status_lo) : "d" (control_hi),
- "a" (control_lo), "b" (input_address_hi),
- "c" (input_address_lo), "D"(output_address_hi),
- "S"(output_address_lo), "m" (hypercall_pg));
-
- return hv_status_lo | ((u64)hv_status_hi << 32);
-#endif /* !x86_64 */
-}
-EXPORT_SYMBOL_GPL(hv_do_hypercall);
-
void hyperv_report_panic(struct pt_regs *regs)
{
static bool panic_reported;
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index baea2679a0aa..6fa5e342cc86 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -3,6 +3,7 @@

#include <linux/types.h>
#include <linux/atomic.h>
+#include <asm/io.h>
#include <asm/hyperv.h>

/*
@@ -168,6 +169,45 @@ void hv_remove_crash_handler(void);

#if IS_ENABLED(CONFIG_HYPERV)
extern struct clocksource *hyperv_cs;
+extern void *hv_hypercall_pg;
+
+static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
+{
+ u64 input_address = input ? virt_to_phys(input) : 0;
+ u64 output_address = output ? virt_to_phys(output) : 0;
+ u64 hv_status;
+ register void *__sp asm(_ASM_SP);
+
+#ifdef CONFIG_X86_64
+ if (!hv_hypercall_pg)
+ return U64_MAX;
+
+ __asm__ __volatile__("mov %4, %%r8\n"
+ "call *%5"
+ : "=a" (hv_status), "+r" (__sp),
+ "+c" (control), "+d" (input_address)
+ : "r" (output_address), "m" (hv_hypercall_pg)
+ : "cc", "memory", "r8", "r9", "r10", "r11");
+#else
+ u32 input_address_hi = upper_32_bits(input_address);
+ u32 input_address_lo = lower_32_bits(input_address);
+ u32 output_address_hi = upper_32_bits(output_address);
+ u32 output_address_lo = lower_32_bits(output_address);
+
+ if (!hv_hypercall_pg)
+ return U64_MAX;
+
+ __asm__ __volatile__("call *%7"
+ : "=A" (hv_status),
+ "+c" (input_address_lo), "+r" (__sp)
+ : "A" (control),
+ "b" (input_address_hi),
+ "D"(output_address_hi), "S"(output_address_lo),
+ "m" (hv_hypercall_pg)
+ : "cc", "memory");
+#endif /* !x86_64 */
+ return hv_status;
+}

void hyperv_init(void);
void hyperv_report_panic(struct pt_regs *regs);
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 59c11ff90d12..45e806e3112f 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -32,6 +32,8 @@
#include <linux/hyperv.h>
#include <linux/export.h>
#include <asm/hyperv.h>
+#include <asm/mshyperv.h>
+
#include "hyperv_vmbus.h"


diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index b7d7bbec74e0..6608a71e7d79 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1187,7 +1187,6 @@ int vmbus_allocate_mmio(struct resource **new, struct hv_device *device_obj,
bool fb_overlap_ok);
void vmbus_free_mmio(resource_size_t start, resource_size_t size);
int vmbus_cpu_number_to_vp_number(int cpu_number);
-u64 hv_do_hypercall(u64 control, void *input, void *output);

/*
* GUID definitions of various offer types - services offered to the guest.
--
2.13.3

2017-08-02 16:09:47

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH v10 3/9] x86/hyper-v: fast hypercall implementation

Hyper-V supports 'fast' hypercalls when all parameters are passed through
registers. Implement an inline version of a simpliest of these calls:
hypercall with one 8-byte input and no output.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
---
arch/x86/include/asm/mshyperv.h | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 6fa5e342cc86..e484255bd9de 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -209,6 +209,40 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
return hv_status;
}

+#define HV_HYPERCALL_FAST_BIT BIT(16)
+
+/* Fast hypercall with 8 bytes of input and no output */
+static inline u64 hv_do_fast_hypercall8(u16 code, u64 input1)
+{
+ u64 hv_status, control = (u64)code | HV_HYPERCALL_FAST_BIT;
+ register void *__sp asm(_ASM_SP);
+
+#ifdef CONFIG_X86_64
+ {
+ __asm__ __volatile__("call *%4"
+ : "=a" (hv_status), "+r" (__sp),
+ "+c" (control), "+d" (input1)
+ : "m" (hv_hypercall_pg)
+ : "cc", "r8", "r9", "r10", "r11");
+ }
+#else
+ {
+ u32 input1_hi = upper_32_bits(input1);
+ u32 input1_lo = lower_32_bits(input1);
+
+ __asm__ __volatile__ ("call *%5"
+ : "=A"(hv_status),
+ "+c"(input1_lo),
+ "+r"(__sp)
+ : "A" (control),
+ "b" (input1_hi),
+ "m" (hv_hypercall_pg)
+ : "cc", "edi", "esi");
+ }
+#endif
+ return hv_status;
+}
+
void hyperv_init(void);
void hyperv_report_panic(struct pt_regs *regs);
bool hv_is_hypercall_page_setup(void);
--
2.13.3

2017-08-02 16:09:56

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH v10 4/9] hyper-v: use fast hypercall for HVCALL_SIGNAL_EVENT

We need to pass only 8 bytes of input for HvSignalEvent which makes it a
perfect fit for fast hypercall. hv_input_signal_event_buffer is not needed
any more and hv_input_signal_event is converted to union for convenience.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
---
drivers/hv/channel_mgmt.c | 13 ++-----------
drivers/hv/connection.c | 2 +-
include/linux/hyperv.h | 15 +--------------
3 files changed, 4 insertions(+), 26 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 4bbb8dea4727..fd2b6c67f781 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -805,21 +805,12 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr)
/*
* Setup state for signalling the host.
*/
- newchannel->sig_event = (struct hv_input_signal_event *)
- (ALIGN((unsigned long)
- &newchannel->sig_buf,
- HV_HYPERCALL_PARAM_ALIGN));
-
- newchannel->sig_event->connectionid.asu32 = 0;
- newchannel->sig_event->connectionid.u.id = VMBUS_EVENT_CONNECTION_ID;
- newchannel->sig_event->flag_number = 0;
- newchannel->sig_event->rsvdz = 0;
+ newchannel->sig_event = VMBUS_EVENT_CONNECTION_ID;

if (vmbus_proto_version != VERSION_WS2008) {
newchannel->is_dedicated_interrupt =
(offer->is_dedicated_interrupt != 0);
- newchannel->sig_event->connectionid.u.id =
- offer->connection_id;
+ newchannel->sig_event = offer->connection_id;
}

memcpy(&newchannel->offermsg, offer,
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 45e806e3112f..37ecf514189e 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -408,6 +408,6 @@ void vmbus_set_event(struct vmbus_channel *channel)
if (!channel->is_dedicated_interrupt)
vmbus_send_interrupt(child_relid);

- hv_do_hypercall(HVCALL_SIGNAL_EVENT, channel->sig_event, NULL);
+ hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
}
EXPORT_SYMBOL_GPL(vmbus_set_event);
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 6608a71e7d79..c472bd43bdd7 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -677,18 +677,6 @@ union hv_connection_id {
} u;
};

-/* Definition of the hv_signal_event hypercall input structure. */
-struct hv_input_signal_event {
- union hv_connection_id connectionid;
- u16 flag_number;
- u16 rsvdz;
-};
-
-struct hv_input_signal_event_buffer {
- u64 align8;
- struct hv_input_signal_event event;
-};
-
enum hv_numa_policy {
HV_BALANCED = 0,
HV_LOCALIZED,
@@ -770,8 +758,7 @@ struct vmbus_channel {
} callback_mode;

bool is_dedicated_interrupt;
- struct hv_input_signal_event_buffer sig_buf;
- struct hv_input_signal_event *sig_event;
+ u64 sig_event;

/*
* Starting with win8, this field will be used to specify
--
2.13.3

2017-08-02 16:10:10

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH v10 6/9] hyper-v: globalize vp_index

To support implementing remote TLB flushing on Hyper-V with a hypercall
we need to make vp_index available outside of vmbus module. Rename and
globalize.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
---
arch/x86/hyperv/hv_init.c | 34 +++++++++++++++++++++++++-
arch/x86/include/asm/mshyperv.h | 24 ++++++++++++++++++
drivers/hv/channel_mgmt.c | 7 +++---
drivers/hv/connection.c | 3 ++-
drivers/hv/hv.c | 9 -------
drivers/hv/hyperv_vmbus.h | 11 ---------
drivers/hv/vmbus_drv.c | 17 -------------
drivers/pci/host/pci-hyperv.c | 54 +++--------------------------------------
include/linux/hyperv.h | 1 -
9 files changed, 65 insertions(+), 95 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 691603ee9179..e93b9a0b1b10 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -26,6 +26,8 @@
#include <linux/mm.h>
#include <linux/clockchips.h>
#include <linux/hyperv.h>
+#include <linux/slab.h>
+#include <linux/cpuhotplug.h>

#ifdef CONFIG_HYPERV_TSCPAGE

@@ -80,6 +82,20 @@ EXPORT_SYMBOL_GPL(hv_hypercall_pg);
struct clocksource *hyperv_cs;
EXPORT_SYMBOL_GPL(hyperv_cs);

+u32 *hv_vp_index;
+EXPORT_SYMBOL_GPL(hv_vp_index);
+
+static int hv_cpu_init(unsigned int cpu)
+{
+ u64 msr_vp_index;
+
+ hv_get_vp_index(msr_vp_index);
+
+ hv_vp_index[smp_processor_id()] = msr_vp_index;
+
+ return 0;
+}
+
/*
* This function is to be invoked early in the boot sequence after the
* hypervisor has been detected.
@@ -95,6 +111,16 @@ void hyperv_init(void)
if (x86_hyper != &x86_hyper_ms_hyperv)
return;

+ /* Allocate percpu VP index */
+ hv_vp_index = kmalloc_array(num_possible_cpus(), sizeof(*hv_vp_index),
+ GFP_KERNEL);
+ if (!hv_vp_index)
+ return;
+
+ if (cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/hyperv_init:online",
+ hv_cpu_init, NULL) < 0)
+ goto free_vp_index;
+
/*
* Setup the hypercall page and enable hypercalls.
* 1. Register the guest ID
@@ -106,7 +132,7 @@ void hyperv_init(void)
hv_hypercall_pg = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL_RX);
if (hv_hypercall_pg == NULL) {
wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
- return;
+ goto free_vp_index;
}

rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
@@ -149,6 +175,12 @@ void hyperv_init(void)
hyperv_cs = &hyperv_cs_msr;
if (ms_hyperv.features & HV_X64_MSR_TIME_REF_COUNT_AVAILABLE)
clocksource_register_hz(&hyperv_cs_msr, NSEC_PER_SEC/100);
+
+ return;
+
+free_vp_index:
+ kfree(hv_vp_index);
+ hv_vp_index = NULL;
}

/*
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index efa1860276b5..efd2f80d3353 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -282,6 +282,30 @@ static inline u64 hv_do_rep_hypercall(u16 code, u16 rep_count, u16 varhead_size,
return status;
}

+/*
+ * Hypervisor's notion of virtual processor ID is different from
+ * Linux' notion of CPU ID. This information can only be retrieved
+ * in the context of the calling CPU. Setup a map for easy access
+ * to this information.
+ */
+extern u32 *hv_vp_index;
+
+/**
+ * hv_cpu_number_to_vp_number() - Map CPU to VP.
+ * @cpu_number: CPU number in Linux terms
+ *
+ * This function returns the mapping between the Linux processor
+ * number and the hypervisor's virtual processor number, useful
+ * in making hypercalls and such that talk about specific
+ * processors.
+ *
+ * Return: Virtual processor number in Hyper-V terms
+ */
+static inline int hv_cpu_number_to_vp_number(int cpu_number)
+{
+ return hv_vp_index[cpu_number];
+}
+
void hyperv_init(void);
void hyperv_report_panic(struct pt_regs *regs);
bool hv_is_hypercall_page_setup(void);
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index fd2b6c67f781..dc590195a74e 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -599,7 +599,7 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type)
*/
channel->numa_node = 0;
channel->target_cpu = 0;
- channel->target_vp = hv_context.vp_index[0];
+ channel->target_vp = hv_cpu_number_to_vp_number(0);
return;
}

@@ -683,7 +683,7 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type)
}

channel->target_cpu = cur_cpu;
- channel->target_vp = hv_context.vp_index[cur_cpu];
+ channel->target_vp = hv_cpu_number_to_vp_number(cur_cpu);
}

static void vmbus_wait_for_unload(void)
@@ -1219,8 +1219,7 @@ struct vmbus_channel *vmbus_get_outgoing_channel(struct vmbus_channel *primary)
return outgoing_channel;
}

- cur_cpu = hv_context.vp_index[get_cpu()];
- put_cpu();
+ cur_cpu = hv_cpu_number_to_vp_number(smp_processor_id());
list_for_each_safe(cur, tmp, &primary->sc_list) {
cur_channel = list_entry(cur, struct vmbus_channel, sc_list);
if (cur_channel->state != CHANNEL_OPENED_STATE)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 37ecf514189e..f41901f80b64 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -96,7 +96,8 @@ static int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo,
* the CPU attempting to connect may not be CPU 0.
*/
if (version >= VERSION_WIN8_1) {
- msg->target_vcpu = hv_context.vp_index[smp_processor_id()];
+ msg->target_vcpu =
+ hv_cpu_number_to_vp_number(smp_processor_id());
vmbus_connection.connect_cpu = smp_processor_id();
} else {
msg->target_vcpu = 0;
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 2ea12207caa0..8267439dd1ee 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -234,7 +234,6 @@ int hv_synic_init(unsigned int cpu)
union hv_synic_siefp siefp;
union hv_synic_sint shared_sint;
union hv_synic_scontrol sctrl;
- u64 vp_index;

/* Setup the Synic's message page */
hv_get_simp(simp.as_uint64);
@@ -276,14 +275,6 @@ int hv_synic_init(unsigned int cpu)
hv_context.synic_initialized = true;

/*
- * Setup the mapping between Hyper-V's notion
- * of cpuid and Linux' notion of cpuid.
- * This array will be indexed using Linux cpuid.
- */
- hv_get_vp_index(vp_index);
- hv_context.vp_index[cpu] = (u32)vp_index;
-
- /*
* Register the per-cpu clockevent source.
*/
if (ms_hyperv.features & HV_X64_MSR_SYNTIMER_AVAILABLE)
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 1b6a5e0dfa75..49569f8fe038 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -229,17 +229,6 @@ struct hv_context {
struct hv_per_cpu_context __percpu *cpu_context;

/*
- * Hypervisor's notion of virtual processor ID is different from
- * Linux' notion of CPU ID. This information can only be retrieved
- * in the context of the calling CPU. Setup a map for easy access
- * to this information:
- *
- * vp_index[a] is the Hyper-V's processor ID corresponding to
- * Linux cpuid 'a'.
- */
- u32 vp_index[NR_CPUS];
-
- /*
* To manage allocations in a NUMA node.
* Array indexed by numa node ID.
*/
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index ed84e96715a0..c7e7d6db2d21 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1451,23 +1451,6 @@ void vmbus_free_mmio(resource_size_t start, resource_size_t size)
}
EXPORT_SYMBOL_GPL(vmbus_free_mmio);

-/**
- * vmbus_cpu_number_to_vp_number() - Map CPU to VP.
- * @cpu_number: CPU number in Linux terms
- *
- * This function returns the mapping between the Linux processor
- * number and the hypervisor's virtual processor number, useful
- * in making hypercalls and such that talk about specific
- * processors.
- *
- * Return: Virtual processor number in Hyper-V terms
- */
-int vmbus_cpu_number_to_vp_number(int cpu_number)
-{
- return hv_context.vp_index[cpu_number];
-}
-EXPORT_SYMBOL_GPL(vmbus_cpu_number_to_vp_number);
-
static int vmbus_acpi_add(struct acpi_device *device)
{
acpi_status result;
diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
index 415dcc69a502..aba041438566 100644
--- a/drivers/pci/host/pci-hyperv.c
+++ b/drivers/pci/host/pci-hyperv.c
@@ -562,52 +562,6 @@ static void put_pcichild(struct hv_pci_dev *hv_pcidev,
static void get_hvpcibus(struct hv_pcibus_device *hv_pcibus);
static void put_hvpcibus(struct hv_pcibus_device *hv_pcibus);

-
-/*
- * Temporary CPU to vCPU mapping to address transitioning
- * vmbus_cpu_number_to_vp_number() being migrated to
- * hv_cpu_number_to_vp_number() in a separate patch. Once that patch
- * has been picked up in the main line, remove this code here and use
- * the official code.
- */
-static struct hv_tmpcpumap
-{
- bool initialized;
- u32 vp_index[NR_CPUS];
-} hv_tmpcpumap;
-
-static void hv_tmpcpumap_init_cpu(void *_unused)
-{
- int cpu = smp_processor_id();
- u64 vp_index;
-
- hv_get_vp_index(vp_index);
-
- hv_tmpcpumap.vp_index[cpu] = vp_index;
-}
-
-static void hv_tmpcpumap_init(void)
-{
- if (hv_tmpcpumap.initialized)
- return;
-
- memset(hv_tmpcpumap.vp_index, -1, sizeof(hv_tmpcpumap.vp_index));
- on_each_cpu(hv_tmpcpumap_init_cpu, NULL, true);
- hv_tmpcpumap.initialized = true;
-}
-
-/**
- * hv_tmp_cpu_nr_to_vp_nr() - Convert Linux CPU nr to Hyper-V vCPU nr
- *
- * Remove once vmbus_cpu_number_to_vp_number() has been converted to
- * hv_cpu_number_to_vp_number() and replace callers appropriately.
- */
-static u32 hv_tmp_cpu_nr_to_vp_nr(int cpu)
-{
- return hv_tmpcpumap.vp_index[cpu];
-}
-
-
/**
* devfn_to_wslot() - Convert from Linux PCI slot to Windows
* @devfn: The Linux representation of PCI slot
@@ -971,7 +925,7 @@ static void hv_irq_unmask(struct irq_data *data)
var_size = 1 + HV_VP_SET_BANK_COUNT_MAX;

for_each_cpu_and(cpu, dest, cpu_online_mask) {
- cpu_vmbus = hv_tmp_cpu_nr_to_vp_nr(cpu);
+ cpu_vmbus = hv_cpu_number_to_vp_number(cpu);

if (cpu_vmbus >= HV_VP_SET_BANK_COUNT_MAX * 64) {
dev_err(&hbus->hdev->device,
@@ -986,7 +940,7 @@ static void hv_irq_unmask(struct irq_data *data)
} else {
for_each_cpu_and(cpu, dest, cpu_online_mask) {
params->int_target.vp_mask |=
- (1ULL << hv_tmp_cpu_nr_to_vp_nr(cpu));
+ (1ULL << hv_cpu_number_to_vp_number(cpu));
}
}

@@ -1063,7 +1017,7 @@ static u32 hv_compose_msi_req_v2(
*/
cpu = cpumask_first_and(affinity, cpu_online_mask);
int_pkt->int_desc.processor_array[0] =
- hv_tmp_cpu_nr_to_vp_nr(cpu);
+ hv_cpu_number_to_vp_number(cpu);
int_pkt->int_desc.processor_count = 1;

return sizeof(*int_pkt);
@@ -2490,8 +2444,6 @@ static int hv_pci_probe(struct hv_device *hdev,
return -ENOMEM;
hbus->state = hv_pcibus_init;

- hv_tmpcpumap_init();
-
/*
* The PCI bus "domain" is what is called "segment" in ACPI and
* other specs. Pull it from the instance ID, to get something
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index c472bd43bdd7..e2a4fa57f110 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1173,7 +1173,6 @@ int vmbus_allocate_mmio(struct resource **new, struct hv_device *device_obj,
resource_size_t size, resource_size_t align,
bool fb_overlap_ok);
void vmbus_free_mmio(resource_size_t start, resource_size_t size);
-int vmbus_cpu_number_to_vp_number(int cpu_number);

/*
* GUID definitions of various offer types - services offered to the guest.
--
2.13.3

2017-08-02 16:10:01

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH v10 5/9] x86/hyper-v: implement rep hypercalls

Rep hypercalls are normal hypercalls which perform multiple actions at
once. Hyper-V guarantees to return exectution to the caller in not more
than 50us and the caller needs to use hypercall continuation. Touch NMI
watchdog between hypercall invocations.

This is going to be used for HvFlushVirtualAddressList hypercall for
remote TLB flushing.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
---
arch/x86/include/asm/mshyperv.h | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index e484255bd9de..efa1860276b5 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -3,6 +3,7 @@

#include <linux/types.h>
#include <linux/atomic.h>
+#include <linux/nmi.h>
#include <asm/io.h>
#include <asm/hyperv.h>

@@ -209,7 +210,13 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
return hv_status;
}

+#define HV_HYPERCALL_RESULT_MASK GENMASK_ULL(15, 0)
#define HV_HYPERCALL_FAST_BIT BIT(16)
+#define HV_HYPERCALL_VARHEAD_OFFSET 17
+#define HV_HYPERCALL_REP_COMP_OFFSET 32
+#define HV_HYPERCALL_REP_COMP_MASK GENMASK_ULL(43, 32)
+#define HV_HYPERCALL_REP_START_OFFSET 48
+#define HV_HYPERCALL_REP_START_MASK GENMASK_ULL(59, 48)

/* Fast hypercall with 8 bytes of input and no output */
static inline u64 hv_do_fast_hypercall8(u16 code, u64 input1)
@@ -243,6 +250,38 @@ static inline u64 hv_do_fast_hypercall8(u16 code, u64 input1)
return hv_status;
}

+/*
+ * Rep hypercalls. Callers of this functions are supposed to ensure that
+ * rep_count and varhead_size comply with Hyper-V hypercall definition.
+ */
+static inline u64 hv_do_rep_hypercall(u16 code, u16 rep_count, u16 varhead_size,
+ void *input, void *output)
+{
+ u64 control = code;
+ u64 status;
+ u16 rep_comp;
+
+ control |= (u64)varhead_size << HV_HYPERCALL_VARHEAD_OFFSET;
+ control |= (u64)rep_count << HV_HYPERCALL_REP_COMP_OFFSET;
+
+ do {
+ status = hv_do_hypercall(control, input, output);
+ if ((status & HV_HYPERCALL_RESULT_MASK) != HV_STATUS_SUCCESS)
+ return status;
+
+ /* Bits 32-43 of status have 'Reps completed' data. */
+ rep_comp = (status & HV_HYPERCALL_REP_COMP_MASK) >>
+ HV_HYPERCALL_REP_COMP_OFFSET;
+
+ control &= ~HV_HYPERCALL_REP_START_MASK;
+ control |= (u64)rep_comp << HV_HYPERCALL_REP_START_OFFSET;
+
+ touch_nmi_watchdog();
+ } while (rep_comp < rep_count);
+
+ return status;
+}
+
void hyperv_init(void);
void hyperv_report_panic(struct pt_regs *regs);
bool hv_is_hypercall_page_setup(void);
--
2.13.3

2017-08-02 16:10:27

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH v10 7/9] x86/hyper-v: use hypercall for remote TLB flush

Hyper-V host can suggest us to use hypercall for doing remote TLB flush,
this is supposed to work faster than IPIs.

Implementation details: to do HvFlushVirtualAddress{Space,List} hypercalls
we need to put the input somewhere in memory and we don't really want to
have memory allocation on each call so we pre-allocate per cpu memory areas
on boot.

pv_ops patching is happening very early so we need to separate
hyperv_setup_mmu_ops() and hyper_alloc_mmu().

It is possible and easy to implement local TLB flushing too and there is
even a hint for that. However, I don't see a room for optimization on the
host side as both hypercall and native tlb flush will result in vmexit. The
hint is also not set on modern Hyper-V versions.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
---
arch/x86/hyperv/Makefile | 2 +-
arch/x86/hyperv/hv_init.c | 2 +
arch/x86/hyperv/mmu.c | 138 +++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/mshyperv.h | 3 +
arch/x86/include/uapi/asm/hyperv.h | 7 ++
arch/x86/kernel/cpu/mshyperv.c | 1 +
6 files changed, 152 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/hyperv/mmu.c

diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
index 171ae09864d7..367a8203cfcf 100644
--- a/arch/x86/hyperv/Makefile
+++ b/arch/x86/hyperv/Makefile
@@ -1 +1 @@
-obj-y := hv_init.o
+obj-y := hv_init.o mmu.o
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e93b9a0b1b10..1a8eb550c40f 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -140,6 +140,8 @@ void hyperv_init(void)
hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);

+ hyper_alloc_mmu();
+
/*
* Register Hyper-V specific clocksource.
*/
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
new file mode 100644
index 000000000000..9419a20b1d75
--- /dev/null
+++ b/arch/x86/hyperv/mmu.c
@@ -0,0 +1,138 @@
+#define pr_fmt(fmt) "Hyper-V: " fmt
+
+#include <linux/hyperv.h>
+#include <linux/log2.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+
+#include <asm/fpu/api.h>
+#include <asm/mshyperv.h>
+#include <asm/msr.h>
+#include <asm/tlbflush.h>
+
+/* HvFlushVirtualAddressSpace, HvFlushVirtualAddressList hypercalls */
+struct hv_flush_pcpu {
+ u64 address_space;
+ u64 flags;
+ u64 processor_mask;
+ u64 gva_list[];
+};
+
+/* Each gva in gva_list encodes up to 4096 pages to flush */
+#define HV_TLB_FLUSH_UNIT (4096 * PAGE_SIZE)
+
+static struct hv_flush_pcpu __percpu *pcpu_flush;
+
+/*
+ * Fills in gva_list starting from offset. Returns the number of items added.
+ */
+static inline int fill_gva_list(u64 gva_list[], int offset,
+ unsigned long start, unsigned long end)
+{
+ int gva_n = offset;
+ unsigned long cur = start, diff;
+
+ do {
+ diff = end > cur ? end - cur : 0;
+
+ gva_list[gva_n] = cur & PAGE_MASK;
+ /*
+ * Lower 12 bits encode the number of additional
+ * pages to flush (in addition to the 'cur' page).
+ */
+ if (diff >= HV_TLB_FLUSH_UNIT)
+ gva_list[gva_n] |= ~PAGE_MASK;
+ else if (diff)
+ gva_list[gva_n] |= (diff - 1) >> PAGE_SHIFT;
+
+ cur += HV_TLB_FLUSH_UNIT;
+ gva_n++;
+
+ } while (cur < end);
+
+ return gva_n - offset;
+}
+
+static void hyperv_flush_tlb_others(const struct cpumask *cpus,
+ const struct flush_tlb_info *info)
+{
+ int cpu, vcpu, gva_n, max_gvas;
+ struct hv_flush_pcpu *flush;
+ u64 status = U64_MAX;
+ unsigned long flags;
+
+ if (!pcpu_flush || !hv_hypercall_pg)
+ goto do_native;
+
+ if (cpumask_empty(cpus))
+ return;
+
+ local_irq_save(flags);
+
+ flush = this_cpu_ptr(pcpu_flush);
+
+ if (info->mm) {
+ flush->address_space = virt_to_phys(info->mm->pgd);
+ flush->flags = 0;
+ } else {
+ flush->address_space = 0;
+ flush->flags = HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES;
+ }
+
+ flush->processor_mask = 0;
+ if (cpumask_equal(cpus, cpu_present_mask)) {
+ flush->flags |= HV_FLUSH_ALL_PROCESSORS;
+ } else {
+ for_each_cpu(cpu, cpus) {
+ vcpu = hv_cpu_number_to_vp_number(cpu);
+ if (vcpu >= 64)
+ goto do_native;
+
+ __set_bit(vcpu, (unsigned long *)
+ &flush->processor_mask);
+ }
+ }
+
+ /*
+ * We can flush not more than max_gvas with one hypercall. Flush the
+ * whole address space if we were asked to do more.
+ */
+ max_gvas = (PAGE_SIZE - sizeof(*flush)) / sizeof(flush->gva_list[0]);
+
+ if (info->end == TLB_FLUSH_ALL) {
+ flush->flags |= HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY;
+ status = hv_do_hypercall(HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE,
+ flush, NULL);
+ } else if (info->end &&
+ ((info->end - info->start)/HV_TLB_FLUSH_UNIT) > max_gvas) {
+ status = hv_do_hypercall(HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE,
+ flush, NULL);
+ } else {
+ gva_n = fill_gva_list(flush->gva_list, 0,
+ info->start, info->end);
+ status = hv_do_rep_hypercall(HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST,
+ gva_n, 0, flush, NULL);
+ }
+
+ local_irq_restore(flags);
+
+ if (!(status & HV_HYPERCALL_RESULT_MASK))
+ return;
+do_native:
+ native_flush_tlb_others(cpus, info);
+}
+
+void hyperv_setup_mmu_ops(void)
+{
+ if (ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED) {
+ pr_info("Using hypercall for remote TLB flush\n");
+ pv_mmu_ops.flush_tlb_others = hyperv_flush_tlb_others;
+ setup_clear_cpu_cap(X86_FEATURE_PCID);
+ }
+}
+
+void hyper_alloc_mmu(void)
+{
+ if (ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED)
+ pcpu_flush = __alloc_percpu(PAGE_SIZE, PAGE_SIZE);
+}
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index efd2f80d3353..0d4b01c5e438 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -307,6 +307,8 @@ static inline int hv_cpu_number_to_vp_number(int cpu_number)
}

void hyperv_init(void);
+void hyperv_setup_mmu_ops(void);
+void hyper_alloc_mmu(void);
void hyperv_report_panic(struct pt_regs *regs);
bool hv_is_hypercall_page_setup(void);
void hyperv_cleanup(void);
@@ -314,6 +316,7 @@ void hyperv_cleanup(void);
static inline void hyperv_init(void) {}
static inline bool hv_is_hypercall_page_setup(void) { return false; }
static inline void hyperv_cleanup(void) {}
+static inline void hyperv_setup_mmu_ops(void) {}
#endif /* CONFIG_HYPERV */

#ifdef CONFIG_HYPERV_TSCPAGE
diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h
index 127ddadee1a5..a6fdd3b82b4a 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -242,6 +242,8 @@
(~((1ull << HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_SHIFT) - 1))

/* Declare the various hypercall operations. */
+#define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE 0x0002
+#define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST 0x0003
#define HVCALL_NOTIFY_LONG_SPIN_WAIT 0x0008
#define HVCALL_POST_MESSAGE 0x005c
#define HVCALL_SIGNAL_EVENT 0x005d
@@ -259,6 +261,11 @@
#define HV_PROCESSOR_POWER_STATE_C2 2
#define HV_PROCESSOR_POWER_STATE_C3 3

+#define HV_FLUSH_ALL_PROCESSORS BIT(0)
+#define HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES BIT(1)
+#define HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY BIT(2)
+#define HV_FLUSH_USE_EXTENDED_RANGE_FORMAT BIT(3)
+
/* hypercall status code */
#define HV_STATUS_SUCCESS 0
#define HV_STATUS_INVALID_HYPERCALL_CODE 2
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 70e717fccdd6..daefd67a66c7 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -249,6 +249,7 @@ static void __init ms_hyperv_init_platform(void)
* Setup the hook to get control post apic initialization.
*/
x86_platform.apic_post_init = hyperv_init;
+ hyperv_setup_mmu_ops();
#endif
}

--
2.13.3

2017-08-02 16:10:38

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH v10 8/9] x86/hyper-v: support extended CPU ranges for TLB flush hypercalls

Hyper-V hosts may support more than 64 vCPUs, we need to use
HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX/LIST_EX hypercalls in this
case.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
---
arch/x86/hyperv/mmu.c | 133 ++++++++++++++++++++++++++++++++++++-
arch/x86/include/uapi/asm/hyperv.h | 10 +++
2 files changed, 140 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 9419a20b1d75..51b44be03f50 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -18,11 +18,25 @@ struct hv_flush_pcpu {
u64 gva_list[];
};

+/* HvFlushVirtualAddressSpaceEx, HvFlushVirtualAddressListEx hypercalls */
+struct hv_flush_pcpu_ex {
+ u64 address_space;
+ u64 flags;
+ struct {
+ u64 format;
+ u64 valid_bank_mask;
+ u64 bank_contents[];
+ } hv_vp_set;
+ u64 gva_list[];
+};
+
/* Each gva in gva_list encodes up to 4096 pages to flush */
#define HV_TLB_FLUSH_UNIT (4096 * PAGE_SIZE)

static struct hv_flush_pcpu __percpu *pcpu_flush;

+static struct hv_flush_pcpu_ex __percpu *pcpu_flush_ex;
+
/*
* Fills in gva_list starting from offset. Returns the number of items added.
*/
@@ -53,6 +67,34 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
}

+/* Return the number of banks in the resulting vp_set */
+static inline int cpumask_to_vp_set(struct hv_flush_pcpu_ex *flush,
+ const struct cpumask *cpus)
+{
+ int cpu, vcpu, vcpu_bank, vcpu_offset, nr_bank = 1;
+
+ /*
+ * Some banks may end up being empty but this is acceptable.
+ */
+ for_each_cpu(cpu, cpus) {
+ vcpu = hv_cpu_number_to_vp_number(cpu);
+ vcpu_bank = vcpu / 64;
+ vcpu_offset = vcpu % 64;
+
+ /* valid_bank_mask can represent up to 64 banks */
+ if (vcpu_bank >= 64)
+ return 0;
+
+ __set_bit(vcpu_offset, (unsigned long *)
+ &flush->hv_vp_set.bank_contents[vcpu_bank]);
+ if (vcpu_bank >= nr_bank)
+ nr_bank = vcpu_bank + 1;
+ }
+ flush->hv_vp_set.valid_bank_mask = GENMASK_ULL(nr_bank - 1, 0);
+
+ return nr_bank;
+}
+
static void hyperv_flush_tlb_others(const struct cpumask *cpus,
const struct flush_tlb_info *info)
{
@@ -122,17 +164,102 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
native_flush_tlb_others(cpus, info);
}

+static void hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
+ const struct flush_tlb_info *info)
+{
+ int nr_bank = 0, max_gvas, gva_n;
+ struct hv_flush_pcpu_ex *flush;
+ u64 status = U64_MAX;
+ unsigned long flags;
+
+ if (!pcpu_flush_ex || !hv_hypercall_pg)
+ goto do_native;
+
+ if (cpumask_empty(cpus))
+ return;
+
+ local_irq_save(flags);
+
+ flush = this_cpu_ptr(pcpu_flush_ex);
+
+ if (info->mm) {
+ flush->address_space = virt_to_phys(info->mm->pgd);
+ flush->flags = 0;
+ } else {
+ flush->address_space = 0;
+ flush->flags = HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES;
+ }
+
+ flush->hv_vp_set.valid_bank_mask = 0;
+
+ if (!cpumask_equal(cpus, cpu_present_mask)) {
+ flush->hv_vp_set.format = HV_GENERIC_SET_SPARCE_4K;
+ nr_bank = cpumask_to_vp_set(flush, cpus);
+ }
+
+ if (!nr_bank) {
+ flush->hv_vp_set.format = HV_GENERIC_SET_ALL;
+ flush->flags |= HV_FLUSH_ALL_PROCESSORS;
+ }
+
+ /*
+ * We can flush not more than max_gvas with one hypercall. Flush the
+ * whole address space if we were asked to do more.
+ */
+ max_gvas =
+ (PAGE_SIZE - sizeof(*flush) - nr_bank *
+ sizeof(flush->hv_vp_set.bank_contents[0])) /
+ sizeof(flush->gva_list[0]);
+
+ if (info->end == TLB_FLUSH_ALL) {
+ flush->flags |= HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY;
+ status = hv_do_rep_hypercall(
+ HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX,
+ 0, nr_bank + 2, flush, NULL);
+ } else if (info->end &&
+ ((info->end - info->start)/HV_TLB_FLUSH_UNIT) > max_gvas) {
+ status = hv_do_rep_hypercall(
+ HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX,
+ 0, nr_bank + 2, flush, NULL);
+ } else {
+ gva_n = fill_gva_list(flush->gva_list, nr_bank,
+ info->start, info->end);
+ status = hv_do_rep_hypercall(
+ HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX,
+ gva_n, nr_bank + 2, flush, NULL);
+ }
+
+ local_irq_restore(flags);
+
+ if (!(status & HV_HYPERCALL_RESULT_MASK))
+ return;
+do_native:
+ native_flush_tlb_others(cpus, info);
+}
+
void hyperv_setup_mmu_ops(void)
{
- if (ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED) {
+ if (!(ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED))
+ return;
+
+ setup_clear_cpu_cap(X86_FEATURE_PCID);
+
+ if (!(ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED)) {
pr_info("Using hypercall for remote TLB flush\n");
pv_mmu_ops.flush_tlb_others = hyperv_flush_tlb_others;
- setup_clear_cpu_cap(X86_FEATURE_PCID);
+ } else {
+ pr_info("Using ext hypercall for remote TLB flush\n");
+ pv_mmu_ops.flush_tlb_others = hyperv_flush_tlb_others_ex;
}
}

void hyper_alloc_mmu(void)
{
- if (ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED)
+ if (!(ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED))
+ return;
+
+ if (!(ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED))
pcpu_flush = __alloc_percpu(PAGE_SIZE, PAGE_SIZE);
+ else
+ pcpu_flush_ex = __alloc_percpu(PAGE_SIZE, PAGE_SIZE);
}
diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h
index a6fdd3b82b4a..7032f4d8dff3 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -149,6 +149,9 @@
*/
#define HV_X64_DEPRECATING_AEOI_RECOMMENDED (1 << 9)

+/* Recommend using the newer ExProcessorMasks interface */
+#define HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED (1 << 11)
+
/*
* HV_VP_SET available
*/
@@ -245,6 +248,8 @@
#define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE 0x0002
#define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST 0x0003
#define HVCALL_NOTIFY_LONG_SPIN_WAIT 0x0008
+#define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX 0x0013
+#define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX 0x0014
#define HVCALL_POST_MESSAGE 0x005c
#define HVCALL_SIGNAL_EVENT 0x005d

@@ -266,6 +271,11 @@
#define HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY BIT(2)
#define HV_FLUSH_USE_EXTENDED_RANGE_FORMAT BIT(3)

+enum HV_GENERIC_SET_FORMAT {
+ HV_GENERIC_SET_SPARCE_4K,
+ HV_GENERIC_SET_ALL,
+};
+
/* hypercall status code */
#define HV_STATUS_SUCCESS 0
#define HV_STATUS_INVALID_HYPERCALL_CODE 2
--
2.13.3

2017-08-02 16:10:53

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH v10 9/9] tracing/hyper-v: trace hyperv_mmu_flush_tlb_others()

Add Hyper-V tracing subsystem and trace hyperv_mmu_flush_tlb_others().
Tracing is done the same way we do xen_mmu_flush_tlb_others().

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
---
MAINTAINERS | 1 +
arch/x86/hyperv/mmu.c | 7 +++++++
arch/x86/include/asm/trace/hyperv.h | 40 +++++++++++++++++++++++++++++++++++++
3 files changed, 48 insertions(+)
create mode 100644 arch/x86/include/asm/trace/hyperv.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 567343b8ffaa..cbe88f3ea193 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6258,6 +6258,7 @@ M: Stephen Hemminger <[email protected]>
L: [email protected]
S: Maintained
F: arch/x86/include/asm/mshyperv.h
+F: arch/x86/include/asm/trace/hyperv.h
F: arch/x86/include/uapi/asm/hyperv.h
F: arch/x86/kernel/cpu/mshyperv.c
F: arch/x86/hyperv
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 51b44be03f50..39e7f6e50919 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -10,6 +10,9 @@
#include <asm/msr.h>
#include <asm/tlbflush.h>

+#define CREATE_TRACE_POINTS
+#include <asm/trace/hyperv.h>
+
/* HvFlushVirtualAddressSpace, HvFlushVirtualAddressList hypercalls */
struct hv_flush_pcpu {
u64 address_space;
@@ -103,6 +106,8 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
u64 status = U64_MAX;
unsigned long flags;

+ trace_hyperv_mmu_flush_tlb_others(cpus, info);
+
if (!pcpu_flush || !hv_hypercall_pg)
goto do_native;

@@ -172,6 +177,8 @@ static void hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
u64 status = U64_MAX;
unsigned long flags;

+ trace_hyperv_mmu_flush_tlb_others(cpus, info);
+
if (!pcpu_flush_ex || !hv_hypercall_pg)
goto do_native;

diff --git a/arch/x86/include/asm/trace/hyperv.h b/arch/x86/include/asm/trace/hyperv.h
new file mode 100644
index 000000000000..4253bca99989
--- /dev/null
+++ b/arch/x86/include/asm/trace/hyperv.h
@@ -0,0 +1,40 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM hyperv
+
+#if !defined(_TRACE_HYPERV_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_HYPERV_H
+
+#include <linux/tracepoint.h>
+
+#if IS_ENABLED(CONFIG_HYPERV)
+
+TRACE_EVENT(hyperv_mmu_flush_tlb_others,
+ TP_PROTO(const struct cpumask *cpus,
+ const struct flush_tlb_info *info),
+ TP_ARGS(cpus, info),
+ TP_STRUCT__entry(
+ __field(unsigned int, ncpus)
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, addr)
+ __field(unsigned long, end)
+ ),
+ TP_fast_assign(__entry->ncpus = cpumask_weight(cpus);
+ __entry->mm = info->mm;
+ __entry->addr = info->start;
+ __entry->end = info->end;
+ ),
+ TP_printk("ncpus %d mm %p addr %lx, end %lx",
+ __entry->ncpus, __entry->mm,
+ __entry->addr, __entry->end)
+ );
+
+#endif /* CONFIG_HYPERV */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH asm/trace/
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE hyperv
+#endif /* _TRACE_HYPERV_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--
2.13.3

2017-08-10 11:58:47

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements

Vitaly Kuznetsov <[email protected]> writes:

> Changes since v9:
> - Rebase to 4.13-rc3.
> - Drop PATCH1 as it was already taken by Greg to char-misc tree. There're no
> functional dependencies on this patch so the series can go through a different tree
> (and it actually belongs to x86 if I got Ingo's comment right).
> - Add in missing void return type in PATCH1 [Colin King, Ingo Molnar, Greg KH]
> - A few minor fixes in what is now PATCH7: add pr_fmt, tiny style fix in
> hyperv_flush_tlb_others() [Andy Shevchenko]
> - Fix "error: implicit declaration of function 'virt_to_phys'" in PATCH2
> reported by kbuild test robot (#include <asm/io.h>)
> - Add Steven's 'Reviewed-by:' to PATCH9.

Thomas, Ingo, Greg,

do I get it right that the intention is to take this series through x86
tree? (See: https://www.spinics.net/lists/kernel/msg2561174.html) If so,
is there anything else I need to do to get it accepted?

Thanks,

--
Vitaly

2017-08-10 15:12:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements


* Vitaly Kuznetsov <[email protected]> wrote:

> Vitaly Kuznetsov <[email protected]> writes:
>
> > Changes since v9:
> > - Rebase to 4.13-rc3.
> > - Drop PATCH1 as it was already taken by Greg to char-misc tree. There're no
> > functional dependencies on this patch so the series can go through a different tree
> > (and it actually belongs to x86 if I got Ingo's comment right).
> > - Add in missing void return type in PATCH1 [Colin King, Ingo Molnar, Greg KH]
> > - A few minor fixes in what is now PATCH7: add pr_fmt, tiny style fix in
> > hyperv_flush_tlb_others() [Andy Shevchenko]
> > - Fix "error: implicit declaration of function 'virt_to_phys'" in PATCH2
> > reported by kbuild test robot (#include <asm/io.h>)
> > - Add Steven's 'Reviewed-by:' to PATCH9.
>
> Thomas, Ingo, Greg,
>
> do I get it right that the intention is to take this series through x86
> tree? (See: https://www.spinics.net/lists/kernel/msg2561174.html) If so,
> is there anything else I need to do to get it accepted?

Yeah, the patches are arch/x86/-heavy, so that would be the ideal workflow - it's
just that the series coincided with x86 maintainers vacation time!

I've picked them up now into tip:x86/platform (they look good to me) and will push
them out after some testing.

Thanks,

Ingo

2017-08-10 15:17:45

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements

Ingo Molnar <[email protected]> writes:

> * Vitaly Kuznetsov <[email protected]> wrote:
>
>> Vitaly Kuznetsov <[email protected]> writes:
>>
>> > Changes since v9:
>> > - Rebase to 4.13-rc3.
>> > - Drop PATCH1 as it was already taken by Greg to char-misc tree. There're no
>> > functional dependencies on this patch so the series can go through a different tree
>> > (and it actually belongs to x86 if I got Ingo's comment right).
>> > - Add in missing void return type in PATCH1 [Colin King, Ingo Molnar, Greg KH]
>> > - A few minor fixes in what is now PATCH7: add pr_fmt, tiny style fix in
>> > hyperv_flush_tlb_others() [Andy Shevchenko]
>> > - Fix "error: implicit declaration of function 'virt_to_phys'" in PATCH2
>> > reported by kbuild test robot (#include <asm/io.h>)
>> > - Add Steven's 'Reviewed-by:' to PATCH9.
>>
>> Thomas, Ingo, Greg,
>>
>> do I get it right that the intention is to take this series through x86
>> tree? (See: https://www.spinics.net/lists/kernel/msg2561174.html) If so,
>> is there anything else I need to do to get it accepted?
>
> Yeah, the patches are arch/x86/-heavy, so that would be the ideal workflow - it's
> just that the series coincided with x86 maintainers vacation time!
>
> I've picked them up now into tip:x86/platform (they look good to me) and will push
> them out after some testing.
>

Great, thanks!

--
Vitaly

2017-08-10 16:03:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements


I'm getting this build failure with this series:

arch/x86/hyperv/mmu.c: In function ‘hyperv_setup_mmu_ops’:
arch/x86/hyperv/mmu.c:256:3: error: ‘pv_mmu_ops’ undeclared (first use in this
function)
pv_mmu_ops.flush_tlb_others = hyperv_flush_tlb_others;
^

with the attached (rand-)config.

Thanks,

Ingo


Attachments:
(No filename) (321.00 B)
config (151.88 kB)
Download all attachments
Subject: [tip:x86/platform] x86/hyper-v: Include hyperv/ only when CONFIG_HYPERV is set

Commit-ID: 79cadff2d92bb8b1448f6dba6861d15adc3dc4cb
Gitweb: http://git.kernel.org/tip/79cadff2d92bb8b1448f6dba6861d15adc3dc4cb
Author: Vitaly Kuznetsov <[email protected]>
AuthorDate: Wed, 2 Aug 2017 18:09:13 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 10 Aug 2017 16:50:22 +0200

x86/hyper-v: Include hyperv/ only when CONFIG_HYPERV is set

Code is arch/x86/hyperv/ is only needed when CONFIG_HYPERV is set, the
'basic' support and detection lives in arch/x86/kernel/cpu/mshyperv.c
which is included when CONFIG_HYPERVISOR_GUEST is set.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/Kbuild | 2 +-
arch/x86/include/asm/mshyperv.h | 7 ++++++-
2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index 586b786..3e6f640 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -8,7 +8,7 @@ obj-$(CONFIG_KVM) += kvm/
obj-$(CONFIG_XEN) += xen/

# Hyper-V paravirtualization support
-obj-$(CONFIG_HYPERVISOR_GUEST) += hyperv/
+obj-$(subst m,y,$(CONFIG_HYPERV)) += hyperv/

# lguest paravirtualization support
obj-$(CONFIG_LGUEST_GUEST) += lguest/
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 2b58c8c..baea267 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -173,7 +173,12 @@ void hyperv_init(void);
void hyperv_report_panic(struct pt_regs *regs);
bool hv_is_hypercall_page_setup(void);
void hyperv_cleanup(void);
-#endif
+#else /* CONFIG_HYPERV */
+static inline void hyperv_init(void) {}
+static inline bool hv_is_hypercall_page_setup(void) { return false; }
+static inline void hyperv_cleanup(void) {}
+#endif /* CONFIG_HYPERV */
+
#ifdef CONFIG_HYPERV_TSCPAGE
struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)

Subject: [tip:x86/platform] x86/hyper-v: Make hv_do_hypercall() inline

Commit-ID: fc53662f13b889a5a1c069e79ee1e3d4534df132
Gitweb: http://git.kernel.org/tip/fc53662f13b889a5a1c069e79ee1e3d4534df132
Author: Vitaly Kuznetsov <[email protected]>
AuthorDate: Wed, 2 Aug 2017 18:09:14 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 10 Aug 2017 16:50:22 +0200

x86/hyper-v: Make hv_do_hypercall() inline

We have only three call sites for hv_do_hypercall() and we're going to
change HVCALL_SIGNAL_EVENT to doing fast hypercall so we can inline this
function for optimization.

Hyper-V top level functional specification states that r9-r11 registers
and flags may be clobbered by the hypervisor during hypercall and with
inlining this is somewhat important, add the clobbers.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/hyperv/hv_init.c | 54 ++++-------------------------------------
arch/x86/include/asm/mshyperv.h | 40 ++++++++++++++++++++++++++++++
drivers/hv/connection.c | 2 ++
include/linux/hyperv.h | 1 -
4 files changed, 47 insertions(+), 50 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 5b882cc..691603e 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -75,7 +75,8 @@ static struct clocksource hyperv_cs_msr = {
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
};

-static void *hypercall_pg;
+void *hv_hypercall_pg;
+EXPORT_SYMBOL_GPL(hv_hypercall_pg);
struct clocksource *hyperv_cs;
EXPORT_SYMBOL_GPL(hyperv_cs);

@@ -102,15 +103,15 @@ void hyperv_init(void)
guest_id = generate_guest_id(0, LINUX_VERSION_CODE, 0);
wrmsrl(HV_X64_MSR_GUEST_OS_ID, guest_id);

- hypercall_pg = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL_RX);
- if (hypercall_pg == NULL) {
+ hv_hypercall_pg = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL_RX);
+ if (hv_hypercall_pg == NULL) {
wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
return;
}

rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
hypercall_msr.enable = 1;
- hypercall_msr.guest_physical_address = vmalloc_to_pfn(hypercall_pg);
+ hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);

/*
@@ -170,51 +171,6 @@ void hyperv_cleanup(void)
}
EXPORT_SYMBOL_GPL(hyperv_cleanup);

-/*
- * hv_do_hypercall- Invoke the specified hypercall
- */
-u64 hv_do_hypercall(u64 control, void *input, void *output)
-{
- u64 input_address = (input) ? virt_to_phys(input) : 0;
- u64 output_address = (output) ? virt_to_phys(output) : 0;
-#ifdef CONFIG_X86_64
- u64 hv_status = 0;
-
- if (!hypercall_pg)
- return (u64)ULLONG_MAX;
-
- __asm__ __volatile__("mov %0, %%r8" : : "r" (output_address) : "r8");
- __asm__ __volatile__("call *%3" : "=a" (hv_status) :
- "c" (control), "d" (input_address),
- "m" (hypercall_pg));
-
- return hv_status;
-
-#else
-
- u32 control_hi = control >> 32;
- u32 control_lo = control & 0xFFFFFFFF;
- u32 hv_status_hi = 1;
- u32 hv_status_lo = 1;
- u32 input_address_hi = input_address >> 32;
- u32 input_address_lo = input_address & 0xFFFFFFFF;
- u32 output_address_hi = output_address >> 32;
- u32 output_address_lo = output_address & 0xFFFFFFFF;
-
- if (!hypercall_pg)
- return (u64)ULLONG_MAX;
-
- __asm__ __volatile__ ("call *%8" : "=d"(hv_status_hi),
- "=a"(hv_status_lo) : "d" (control_hi),
- "a" (control_lo), "b" (input_address_hi),
- "c" (input_address_lo), "D"(output_address_hi),
- "S"(output_address_lo), "m" (hypercall_pg));
-
- return hv_status_lo | ((u64)hv_status_hi << 32);
-#endif /* !x86_64 */
-}
-EXPORT_SYMBOL_GPL(hv_do_hypercall);
-
void hyperv_report_panic(struct pt_regs *regs)
{
static bool panic_reported;
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index baea267..6fa5e34 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -3,6 +3,7 @@

#include <linux/types.h>
#include <linux/atomic.h>
+#include <asm/io.h>
#include <asm/hyperv.h>

/*
@@ -168,6 +169,45 @@ void hv_remove_crash_handler(void);

#if IS_ENABLED(CONFIG_HYPERV)
extern struct clocksource *hyperv_cs;
+extern void *hv_hypercall_pg;
+
+static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
+{
+ u64 input_address = input ? virt_to_phys(input) : 0;
+ u64 output_address = output ? virt_to_phys(output) : 0;
+ u64 hv_status;
+ register void *__sp asm(_ASM_SP);
+
+#ifdef CONFIG_X86_64
+ if (!hv_hypercall_pg)
+ return U64_MAX;
+
+ __asm__ __volatile__("mov %4, %%r8\n"
+ "call *%5"
+ : "=a" (hv_status), "+r" (__sp),
+ "+c" (control), "+d" (input_address)
+ : "r" (output_address), "m" (hv_hypercall_pg)
+ : "cc", "memory", "r8", "r9", "r10", "r11");
+#else
+ u32 input_address_hi = upper_32_bits(input_address);
+ u32 input_address_lo = lower_32_bits(input_address);
+ u32 output_address_hi = upper_32_bits(output_address);
+ u32 output_address_lo = lower_32_bits(output_address);
+
+ if (!hv_hypercall_pg)
+ return U64_MAX;
+
+ __asm__ __volatile__("call *%7"
+ : "=A" (hv_status),
+ "+c" (input_address_lo), "+r" (__sp)
+ : "A" (control),
+ "b" (input_address_hi),
+ "D"(output_address_hi), "S"(output_address_lo),
+ "m" (hv_hypercall_pg)
+ : "cc", "memory");
+#endif /* !x86_64 */
+ return hv_status;
+}

void hyperv_init(void);
void hyperv_report_panic(struct pt_regs *regs);
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 59c11ff..45e806e 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -32,6 +32,8 @@
#include <linux/hyperv.h>
#include <linux/export.h>
#include <asm/hyperv.h>
+#include <asm/mshyperv.h>
+
#include "hyperv_vmbus.h"


diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index b7d7bbe..6608a71 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1187,7 +1187,6 @@ int vmbus_allocate_mmio(struct resource **new, struct hv_device *device_obj,
bool fb_overlap_ok);
void vmbus_free_mmio(resource_size_t start, resource_size_t size);
int vmbus_cpu_number_to_vp_number(int cpu_number);
-u64 hv_do_hypercall(u64 control, void *input, void *output);

/*
* GUID definitions of various offer types - services offered to the guest.

Subject: [tip:x86/platform] x86/hyper-v: Introduce fast hypercall implementation

Commit-ID: 6a8edbd0c54ae266b12f4f63e406313481c9d4bc
Gitweb: http://git.kernel.org/tip/6a8edbd0c54ae266b12f4f63e406313481c9d4bc
Author: Vitaly Kuznetsov <[email protected]>
AuthorDate: Wed, 2 Aug 2017 18:09:15 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 10 Aug 2017 16:50:22 +0200

x86/hyper-v: Introduce fast hypercall implementation

Hyper-V supports 'fast' hypercalls when all parameters are passed through
registers. Implement an inline version of a simpliest of these calls:
hypercall with one 8-byte input and no output.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/mshyperv.h | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 6fa5e34..e484255 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -209,6 +209,40 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
return hv_status;
}

+#define HV_HYPERCALL_FAST_BIT BIT(16)
+
+/* Fast hypercall with 8 bytes of input and no output */
+static inline u64 hv_do_fast_hypercall8(u16 code, u64 input1)
+{
+ u64 hv_status, control = (u64)code | HV_HYPERCALL_FAST_BIT;
+ register void *__sp asm(_ASM_SP);
+
+#ifdef CONFIG_X86_64
+ {
+ __asm__ __volatile__("call *%4"
+ : "=a" (hv_status), "+r" (__sp),
+ "+c" (control), "+d" (input1)
+ : "m" (hv_hypercall_pg)
+ : "cc", "r8", "r9", "r10", "r11");
+ }
+#else
+ {
+ u32 input1_hi = upper_32_bits(input1);
+ u32 input1_lo = lower_32_bits(input1);
+
+ __asm__ __volatile__ ("call *%5"
+ : "=A"(hv_status),
+ "+c"(input1_lo),
+ "+r"(__sp)
+ : "A" (control),
+ "b" (input1_hi),
+ "m" (hv_hypercall_pg)
+ : "cc", "edi", "esi");
+ }
+#endif
+ return hv_status;
+}
+
void hyperv_init(void);
void hyperv_report_panic(struct pt_regs *regs);
bool hv_is_hypercall_page_setup(void);

Subject: [tip:x86/platform] x86/hyper-v: Implement rep hypercalls

Commit-ID: 806c89273bab0c8af0202a6fb6279f36042cb2e6
Gitweb: http://git.kernel.org/tip/806c89273bab0c8af0202a6fb6279f36042cb2e6
Author: Vitaly Kuznetsov <[email protected]>
AuthorDate: Wed, 2 Aug 2017 18:09:17 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 10 Aug 2017 16:50:22 +0200

x86/hyper-v: Implement rep hypercalls

Rep hypercalls are normal hypercalls which perform multiple actions at
once. Hyper-V guarantees to return exectution to the caller in not more
than 50us and the caller needs to use hypercall continuation. Touch NMI
watchdog between hypercall invocations.

This is going to be used for HvFlushVirtualAddressList hypercall for
remote TLB flushing.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/mshyperv.h | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index e484255..efa1860 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -3,6 +3,7 @@

#include <linux/types.h>
#include <linux/atomic.h>
+#include <linux/nmi.h>
#include <asm/io.h>
#include <asm/hyperv.h>

@@ -209,7 +210,13 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
return hv_status;
}

+#define HV_HYPERCALL_RESULT_MASK GENMASK_ULL(15, 0)
#define HV_HYPERCALL_FAST_BIT BIT(16)
+#define HV_HYPERCALL_VARHEAD_OFFSET 17
+#define HV_HYPERCALL_REP_COMP_OFFSET 32
+#define HV_HYPERCALL_REP_COMP_MASK GENMASK_ULL(43, 32)
+#define HV_HYPERCALL_REP_START_OFFSET 48
+#define HV_HYPERCALL_REP_START_MASK GENMASK_ULL(59, 48)

/* Fast hypercall with 8 bytes of input and no output */
static inline u64 hv_do_fast_hypercall8(u16 code, u64 input1)
@@ -243,6 +250,38 @@ static inline u64 hv_do_fast_hypercall8(u16 code, u64 input1)
return hv_status;
}

+/*
+ * Rep hypercalls. Callers of this functions are supposed to ensure that
+ * rep_count and varhead_size comply with Hyper-V hypercall definition.
+ */
+static inline u64 hv_do_rep_hypercall(u16 code, u16 rep_count, u16 varhead_size,
+ void *input, void *output)
+{
+ u64 control = code;
+ u64 status;
+ u16 rep_comp;
+
+ control |= (u64)varhead_size << HV_HYPERCALL_VARHEAD_OFFSET;
+ control |= (u64)rep_count << HV_HYPERCALL_REP_COMP_OFFSET;
+
+ do {
+ status = hv_do_hypercall(control, input, output);
+ if ((status & HV_HYPERCALL_RESULT_MASK) != HV_STATUS_SUCCESS)
+ return status;
+
+ /* Bits 32-43 of status have 'Reps completed' data. */
+ rep_comp = (status & HV_HYPERCALL_REP_COMP_MASK) >>
+ HV_HYPERCALL_REP_COMP_OFFSET;
+
+ control &= ~HV_HYPERCALL_REP_START_MASK;
+ control |= (u64)rep_comp << HV_HYPERCALL_REP_START_OFFSET;
+
+ touch_nmi_watchdog();
+ } while (rep_comp < rep_count);
+
+ return status;
+}
+
void hyperv_init(void);
void hyperv_report_panic(struct pt_regs *regs);
bool hv_is_hypercall_page_setup(void);

Subject: [tip:x86/platform] hyper-v: Use fast hypercall for HVCALL_SIGNAL_EVENT

Commit-ID: 057841713cfff62b4485cdd2b245f05b7ea3ba16
Gitweb: http://git.kernel.org/tip/057841713cfff62b4485cdd2b245f05b7ea3ba16
Author: Vitaly Kuznetsov <[email protected]>
AuthorDate: Wed, 2 Aug 2017 18:09:16 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 10 Aug 2017 16:50:22 +0200

hyper-v: Use fast hypercall for HVCALL_SIGNAL_EVENT

We need to pass only 8 bytes of input for HvSignalEvent which makes it a
perfect fit for fast hypercall. hv_input_signal_event_buffer is not needed
any more and hv_input_signal_event is converted to union for convenience.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
drivers/hv/channel_mgmt.c | 13 ++-----------
drivers/hv/connection.c | 2 +-
include/linux/hyperv.h | 15 +--------------
3 files changed, 4 insertions(+), 26 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 4bbb8de..fd2b6c6 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -805,21 +805,12 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr)
/*
* Setup state for signalling the host.
*/
- newchannel->sig_event = (struct hv_input_signal_event *)
- (ALIGN((unsigned long)
- &newchannel->sig_buf,
- HV_HYPERCALL_PARAM_ALIGN));
-
- newchannel->sig_event->connectionid.asu32 = 0;
- newchannel->sig_event->connectionid.u.id = VMBUS_EVENT_CONNECTION_ID;
- newchannel->sig_event->flag_number = 0;
- newchannel->sig_event->rsvdz = 0;
+ newchannel->sig_event = VMBUS_EVENT_CONNECTION_ID;

if (vmbus_proto_version != VERSION_WS2008) {
newchannel->is_dedicated_interrupt =
(offer->is_dedicated_interrupt != 0);
- newchannel->sig_event->connectionid.u.id =
- offer->connection_id;
+ newchannel->sig_event = offer->connection_id;
}

memcpy(&newchannel->offermsg, offer,
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 45e806e..37ecf51 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -408,6 +408,6 @@ void vmbus_set_event(struct vmbus_channel *channel)
if (!channel->is_dedicated_interrupt)
vmbus_send_interrupt(child_relid);

- hv_do_hypercall(HVCALL_SIGNAL_EVENT, channel->sig_event, NULL);
+ hv_do_fast_hypercall8(HVCALL_SIGNAL_EVENT, channel->sig_event);
}
EXPORT_SYMBOL_GPL(vmbus_set_event);
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 6608a71..c472bd4 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -677,18 +677,6 @@ union hv_connection_id {
} u;
};

-/* Definition of the hv_signal_event hypercall input structure. */
-struct hv_input_signal_event {
- union hv_connection_id connectionid;
- u16 flag_number;
- u16 rsvdz;
-};
-
-struct hv_input_signal_event_buffer {
- u64 align8;
- struct hv_input_signal_event event;
-};
-
enum hv_numa_policy {
HV_BALANCED = 0,
HV_LOCALIZED,
@@ -770,8 +758,7 @@ struct vmbus_channel {
} callback_mode;

bool is_dedicated_interrupt;
- struct hv_input_signal_event_buffer sig_buf;
- struct hv_input_signal_event *sig_event;
+ u64 sig_event;

/*
* Starting with win8, this field will be used to specify

Subject: [tip:x86/platform] hyper-v: Globalize vp_index

Commit-ID: 7415aea6072bab15969b6c3c5b2a193d88095326
Gitweb: http://git.kernel.org/tip/7415aea6072bab15969b6c3c5b2a193d88095326
Author: Vitaly Kuznetsov <[email protected]>
AuthorDate: Wed, 2 Aug 2017 18:09:18 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 10 Aug 2017 16:50:23 +0200

hyper-v: Globalize vp_index

To support implementing remote TLB flushing on Hyper-V with a hypercall
we need to make vp_index available outside of vmbus module. Rename and
globalize.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/hyperv/hv_init.c | 34 +++++++++++++++++++++++++-
arch/x86/include/asm/mshyperv.h | 24 ++++++++++++++++++
drivers/hv/channel_mgmt.c | 7 +++---
drivers/hv/connection.c | 3 ++-
drivers/hv/hv.c | 9 -------
drivers/hv/hyperv_vmbus.h | 11 ---------
drivers/hv/vmbus_drv.c | 17 -------------
drivers/pci/host/pci-hyperv.c | 54 +++--------------------------------------
include/linux/hyperv.h | 1 -
9 files changed, 65 insertions(+), 95 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 691603e..e93b9a0 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -26,6 +26,8 @@
#include <linux/mm.h>
#include <linux/clockchips.h>
#include <linux/hyperv.h>
+#include <linux/slab.h>
+#include <linux/cpuhotplug.h>

#ifdef CONFIG_HYPERV_TSCPAGE

@@ -80,6 +82,20 @@ EXPORT_SYMBOL_GPL(hv_hypercall_pg);
struct clocksource *hyperv_cs;
EXPORT_SYMBOL_GPL(hyperv_cs);

+u32 *hv_vp_index;
+EXPORT_SYMBOL_GPL(hv_vp_index);
+
+static int hv_cpu_init(unsigned int cpu)
+{
+ u64 msr_vp_index;
+
+ hv_get_vp_index(msr_vp_index);
+
+ hv_vp_index[smp_processor_id()] = msr_vp_index;
+
+ return 0;
+}
+
/*
* This function is to be invoked early in the boot sequence after the
* hypervisor has been detected.
@@ -95,6 +111,16 @@ void hyperv_init(void)
if (x86_hyper != &x86_hyper_ms_hyperv)
return;

+ /* Allocate percpu VP index */
+ hv_vp_index = kmalloc_array(num_possible_cpus(), sizeof(*hv_vp_index),
+ GFP_KERNEL);
+ if (!hv_vp_index)
+ return;
+
+ if (cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/hyperv_init:online",
+ hv_cpu_init, NULL) < 0)
+ goto free_vp_index;
+
/*
* Setup the hypercall page and enable hypercalls.
* 1. Register the guest ID
@@ -106,7 +132,7 @@ void hyperv_init(void)
hv_hypercall_pg = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL_RX);
if (hv_hypercall_pg == NULL) {
wrmsrl(HV_X64_MSR_GUEST_OS_ID, 0);
- return;
+ goto free_vp_index;
}

rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
@@ -149,6 +175,12 @@ register_msr_cs:
hyperv_cs = &hyperv_cs_msr;
if (ms_hyperv.features & HV_X64_MSR_TIME_REF_COUNT_AVAILABLE)
clocksource_register_hz(&hyperv_cs_msr, NSEC_PER_SEC/100);
+
+ return;
+
+free_vp_index:
+ kfree(hv_vp_index);
+ hv_vp_index = NULL;
}

/*
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index efa1860..efd2f80 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -282,6 +282,30 @@ static inline u64 hv_do_rep_hypercall(u16 code, u16 rep_count, u16 varhead_size,
return status;
}

+/*
+ * Hypervisor's notion of virtual processor ID is different from
+ * Linux' notion of CPU ID. This information can only be retrieved
+ * in the context of the calling CPU. Setup a map for easy access
+ * to this information.
+ */
+extern u32 *hv_vp_index;
+
+/**
+ * hv_cpu_number_to_vp_number() - Map CPU to VP.
+ * @cpu_number: CPU number in Linux terms
+ *
+ * This function returns the mapping between the Linux processor
+ * number and the hypervisor's virtual processor number, useful
+ * in making hypercalls and such that talk about specific
+ * processors.
+ *
+ * Return: Virtual processor number in Hyper-V terms
+ */
+static inline int hv_cpu_number_to_vp_number(int cpu_number)
+{
+ return hv_vp_index[cpu_number];
+}
+
void hyperv_init(void);
void hyperv_report_panic(struct pt_regs *regs);
bool hv_is_hypercall_page_setup(void);
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index fd2b6c6..dc59019 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -599,7 +599,7 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type)
*/
channel->numa_node = 0;
channel->target_cpu = 0;
- channel->target_vp = hv_context.vp_index[0];
+ channel->target_vp = hv_cpu_number_to_vp_number(0);
return;
}

@@ -683,7 +683,7 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type)
}

channel->target_cpu = cur_cpu;
- channel->target_vp = hv_context.vp_index[cur_cpu];
+ channel->target_vp = hv_cpu_number_to_vp_number(cur_cpu);
}

static void vmbus_wait_for_unload(void)
@@ -1219,8 +1219,7 @@ struct vmbus_channel *vmbus_get_outgoing_channel(struct vmbus_channel *primary)
return outgoing_channel;
}

- cur_cpu = hv_context.vp_index[get_cpu()];
- put_cpu();
+ cur_cpu = hv_cpu_number_to_vp_number(smp_processor_id());
list_for_each_safe(cur, tmp, &primary->sc_list) {
cur_channel = list_entry(cur, struct vmbus_channel, sc_list);
if (cur_channel->state != CHANNEL_OPENED_STATE)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 37ecf51..f41901f 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -96,7 +96,8 @@ static int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo,
* the CPU attempting to connect may not be CPU 0.
*/
if (version >= VERSION_WIN8_1) {
- msg->target_vcpu = hv_context.vp_index[smp_processor_id()];
+ msg->target_vcpu =
+ hv_cpu_number_to_vp_number(smp_processor_id());
vmbus_connection.connect_cpu = smp_processor_id();
} else {
msg->target_vcpu = 0;
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index 2ea1220..8267439 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -234,7 +234,6 @@ int hv_synic_init(unsigned int cpu)
union hv_synic_siefp siefp;
union hv_synic_sint shared_sint;
union hv_synic_scontrol sctrl;
- u64 vp_index;

/* Setup the Synic's message page */
hv_get_simp(simp.as_uint64);
@@ -276,14 +275,6 @@ int hv_synic_init(unsigned int cpu)
hv_context.synic_initialized = true;

/*
- * Setup the mapping between Hyper-V's notion
- * of cpuid and Linux' notion of cpuid.
- * This array will be indexed using Linux cpuid.
- */
- hv_get_vp_index(vp_index);
- hv_context.vp_index[cpu] = (u32)vp_index;
-
- /*
* Register the per-cpu clockevent source.
*/
if (ms_hyperv.features & HV_X64_MSR_SYNTIMER_AVAILABLE)
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 1b6a5e0..49569f8 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -229,17 +229,6 @@ struct hv_context {
struct hv_per_cpu_context __percpu *cpu_context;

/*
- * Hypervisor's notion of virtual processor ID is different from
- * Linux' notion of CPU ID. This information can only be retrieved
- * in the context of the calling CPU. Setup a map for easy access
- * to this information:
- *
- * vp_index[a] is the Hyper-V's processor ID corresponding to
- * Linux cpuid 'a'.
- */
- u32 vp_index[NR_CPUS];
-
- /*
* To manage allocations in a NUMA node.
* Array indexed by numa node ID.
*/
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index ed84e96..c7e7d6d 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1451,23 +1451,6 @@ void vmbus_free_mmio(resource_size_t start, resource_size_t size)
}
EXPORT_SYMBOL_GPL(vmbus_free_mmio);

-/**
- * vmbus_cpu_number_to_vp_number() - Map CPU to VP.
- * @cpu_number: CPU number in Linux terms
- *
- * This function returns the mapping between the Linux processor
- * number and the hypervisor's virtual processor number, useful
- * in making hypercalls and such that talk about specific
- * processors.
- *
- * Return: Virtual processor number in Hyper-V terms
- */
-int vmbus_cpu_number_to_vp_number(int cpu_number)
-{
- return hv_context.vp_index[cpu_number];
-}
-EXPORT_SYMBOL_GPL(vmbus_cpu_number_to_vp_number);
-
static int vmbus_acpi_add(struct acpi_device *device)
{
acpi_status result;
diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
index 415dcc6..aba0414 100644
--- a/drivers/pci/host/pci-hyperv.c
+++ b/drivers/pci/host/pci-hyperv.c
@@ -562,52 +562,6 @@ static void put_pcichild(struct hv_pci_dev *hv_pcidev,
static void get_hvpcibus(struct hv_pcibus_device *hv_pcibus);
static void put_hvpcibus(struct hv_pcibus_device *hv_pcibus);

-
-/*
- * Temporary CPU to vCPU mapping to address transitioning
- * vmbus_cpu_number_to_vp_number() being migrated to
- * hv_cpu_number_to_vp_number() in a separate patch. Once that patch
- * has been picked up in the main line, remove this code here and use
- * the official code.
- */
-static struct hv_tmpcpumap
-{
- bool initialized;
- u32 vp_index[NR_CPUS];
-} hv_tmpcpumap;
-
-static void hv_tmpcpumap_init_cpu(void *_unused)
-{
- int cpu = smp_processor_id();
- u64 vp_index;
-
- hv_get_vp_index(vp_index);
-
- hv_tmpcpumap.vp_index[cpu] = vp_index;
-}
-
-static void hv_tmpcpumap_init(void)
-{
- if (hv_tmpcpumap.initialized)
- return;
-
- memset(hv_tmpcpumap.vp_index, -1, sizeof(hv_tmpcpumap.vp_index));
- on_each_cpu(hv_tmpcpumap_init_cpu, NULL, true);
- hv_tmpcpumap.initialized = true;
-}
-
-/**
- * hv_tmp_cpu_nr_to_vp_nr() - Convert Linux CPU nr to Hyper-V vCPU nr
- *
- * Remove once vmbus_cpu_number_to_vp_number() has been converted to
- * hv_cpu_number_to_vp_number() and replace callers appropriately.
- */
-static u32 hv_tmp_cpu_nr_to_vp_nr(int cpu)
-{
- return hv_tmpcpumap.vp_index[cpu];
-}
-
-
/**
* devfn_to_wslot() - Convert from Linux PCI slot to Windows
* @devfn: The Linux representation of PCI slot
@@ -971,7 +925,7 @@ static void hv_irq_unmask(struct irq_data *data)
var_size = 1 + HV_VP_SET_BANK_COUNT_MAX;

for_each_cpu_and(cpu, dest, cpu_online_mask) {
- cpu_vmbus = hv_tmp_cpu_nr_to_vp_nr(cpu);
+ cpu_vmbus = hv_cpu_number_to_vp_number(cpu);

if (cpu_vmbus >= HV_VP_SET_BANK_COUNT_MAX * 64) {
dev_err(&hbus->hdev->device,
@@ -986,7 +940,7 @@ static void hv_irq_unmask(struct irq_data *data)
} else {
for_each_cpu_and(cpu, dest, cpu_online_mask) {
params->int_target.vp_mask |=
- (1ULL << hv_tmp_cpu_nr_to_vp_nr(cpu));
+ (1ULL << hv_cpu_number_to_vp_number(cpu));
}
}

@@ -1063,7 +1017,7 @@ static u32 hv_compose_msi_req_v2(
*/
cpu = cpumask_first_and(affinity, cpu_online_mask);
int_pkt->int_desc.processor_array[0] =
- hv_tmp_cpu_nr_to_vp_nr(cpu);
+ hv_cpu_number_to_vp_number(cpu);
int_pkt->int_desc.processor_count = 1;

return sizeof(*int_pkt);
@@ -2490,8 +2444,6 @@ static int hv_pci_probe(struct hv_device *hdev,
return -ENOMEM;
hbus->state = hv_pcibus_init;

- hv_tmpcpumap_init();
-
/*
* The PCI bus "domain" is what is called "segment" in ACPI and
* other specs. Pull it from the instance ID, to get something
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index c472bd4..e2a4fa5 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1173,7 +1173,6 @@ int vmbus_allocate_mmio(struct resource **new, struct hv_device *device_obj,
resource_size_t size, resource_size_t align,
bool fb_overlap_ok);
void vmbus_free_mmio(resource_size_t start, resource_size_t size);
-int vmbus_cpu_number_to_vp_number(int cpu_number);

/*
* GUID definitions of various offer types - services offered to the guest.

Subject: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

Commit-ID: 88b46342eb037d35decda4d651cfee5216f4f822
Gitweb: http://git.kernel.org/tip/88b46342eb037d35decda4d651cfee5216f4f822
Author: Vitaly Kuznetsov <[email protected]>
AuthorDate: Wed, 2 Aug 2017 18:09:19 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 10 Aug 2017 16:50:23 +0200

x86/hyper-v: Use hypercall for remote TLB flush

Hyper-V host can suggest us to use hypercall for doing remote TLB flush,
this is supposed to work faster than IPIs.

Implementation details: to do HvFlushVirtualAddress{Space,List} hypercalls
we need to put the input somewhere in memory and we don't really want to
have memory allocation on each call so we pre-allocate per cpu memory areas
on boot.

pv_ops patching is happening very early so we need to separate
hyperv_setup_mmu_ops() and hyper_alloc_mmu().

It is possible and easy to implement local TLB flushing too and there is
even a hint for that. However, I don't see a room for optimization on the
host side as both hypercall and native tlb flush will result in vmexit. The
hint is also not set on modern Hyper-V versions.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/hyperv/Makefile | 2 +-
arch/x86/hyperv/hv_init.c | 2 +
arch/x86/hyperv/mmu.c | 138 +++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/mshyperv.h | 3 +
arch/x86/include/uapi/asm/hyperv.h | 7 ++
arch/x86/kernel/cpu/mshyperv.c | 1 +
6 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
index 171ae09..367a820 100644
--- a/arch/x86/hyperv/Makefile
+++ b/arch/x86/hyperv/Makefile
@@ -1 +1 @@
-obj-y := hv_init.o
+obj-y := hv_init.o mmu.o
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e93b9a0..1a8eb55 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -140,6 +140,8 @@ void hyperv_init(void)
hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);

+ hyper_alloc_mmu();
+
/*
* Register Hyper-V specific clocksource.
*/
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
new file mode 100644
index 0000000..9419a20
--- /dev/null
+++ b/arch/x86/hyperv/mmu.c
@@ -0,0 +1,138 @@
+#define pr_fmt(fmt) "Hyper-V: " fmt
+
+#include <linux/hyperv.h>
+#include <linux/log2.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+
+#include <asm/fpu/api.h>
+#include <asm/mshyperv.h>
+#include <asm/msr.h>
+#include <asm/tlbflush.h>
+
+/* HvFlushVirtualAddressSpace, HvFlushVirtualAddressList hypercalls */
+struct hv_flush_pcpu {
+ u64 address_space;
+ u64 flags;
+ u64 processor_mask;
+ u64 gva_list[];
+};
+
+/* Each gva in gva_list encodes up to 4096 pages to flush */
+#define HV_TLB_FLUSH_UNIT (4096 * PAGE_SIZE)
+
+static struct hv_flush_pcpu __percpu *pcpu_flush;
+
+/*
+ * Fills in gva_list starting from offset. Returns the number of items added.
+ */
+static inline int fill_gva_list(u64 gva_list[], int offset,
+ unsigned long start, unsigned long end)
+{
+ int gva_n = offset;
+ unsigned long cur = start, diff;
+
+ do {
+ diff = end > cur ? end - cur : 0;
+
+ gva_list[gva_n] = cur & PAGE_MASK;
+ /*
+ * Lower 12 bits encode the number of additional
+ * pages to flush (in addition to the 'cur' page).
+ */
+ if (diff >= HV_TLB_FLUSH_UNIT)
+ gva_list[gva_n] |= ~PAGE_MASK;
+ else if (diff)
+ gva_list[gva_n] |= (diff - 1) >> PAGE_SHIFT;
+
+ cur += HV_TLB_FLUSH_UNIT;
+ gva_n++;
+
+ } while (cur < end);
+
+ return gva_n - offset;
+}
+
+static void hyperv_flush_tlb_others(const struct cpumask *cpus,
+ const struct flush_tlb_info *info)
+{
+ int cpu, vcpu, gva_n, max_gvas;
+ struct hv_flush_pcpu *flush;
+ u64 status = U64_MAX;
+ unsigned long flags;
+
+ if (!pcpu_flush || !hv_hypercall_pg)
+ goto do_native;
+
+ if (cpumask_empty(cpus))
+ return;
+
+ local_irq_save(flags);
+
+ flush = this_cpu_ptr(pcpu_flush);
+
+ if (info->mm) {
+ flush->address_space = virt_to_phys(info->mm->pgd);
+ flush->flags = 0;
+ } else {
+ flush->address_space = 0;
+ flush->flags = HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES;
+ }
+
+ flush->processor_mask = 0;
+ if (cpumask_equal(cpus, cpu_present_mask)) {
+ flush->flags |= HV_FLUSH_ALL_PROCESSORS;
+ } else {
+ for_each_cpu(cpu, cpus) {
+ vcpu = hv_cpu_number_to_vp_number(cpu);
+ if (vcpu >= 64)
+ goto do_native;
+
+ __set_bit(vcpu, (unsigned long *)
+ &flush->processor_mask);
+ }
+ }
+
+ /*
+ * We can flush not more than max_gvas with one hypercall. Flush the
+ * whole address space if we were asked to do more.
+ */
+ max_gvas = (PAGE_SIZE - sizeof(*flush)) / sizeof(flush->gva_list[0]);
+
+ if (info->end == TLB_FLUSH_ALL) {
+ flush->flags |= HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY;
+ status = hv_do_hypercall(HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE,
+ flush, NULL);
+ } else if (info->end &&
+ ((info->end - info->start)/HV_TLB_FLUSH_UNIT) > max_gvas) {
+ status = hv_do_hypercall(HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE,
+ flush, NULL);
+ } else {
+ gva_n = fill_gva_list(flush->gva_list, 0,
+ info->start, info->end);
+ status = hv_do_rep_hypercall(HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST,
+ gva_n, 0, flush, NULL);
+ }
+
+ local_irq_restore(flags);
+
+ if (!(status & HV_HYPERCALL_RESULT_MASK))
+ return;
+do_native:
+ native_flush_tlb_others(cpus, info);
+}
+
+void hyperv_setup_mmu_ops(void)
+{
+ if (ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED) {
+ pr_info("Using hypercall for remote TLB flush\n");
+ pv_mmu_ops.flush_tlb_others = hyperv_flush_tlb_others;
+ setup_clear_cpu_cap(X86_FEATURE_PCID);
+ }
+}
+
+void hyper_alloc_mmu(void)
+{
+ if (ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED)
+ pcpu_flush = __alloc_percpu(PAGE_SIZE, PAGE_SIZE);
+}
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index efd2f80..0d4b01c 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -307,6 +307,8 @@ static inline int hv_cpu_number_to_vp_number(int cpu_number)
}

void hyperv_init(void);
+void hyperv_setup_mmu_ops(void);
+void hyper_alloc_mmu(void);
void hyperv_report_panic(struct pt_regs *regs);
bool hv_is_hypercall_page_setup(void);
void hyperv_cleanup(void);
@@ -314,6 +316,7 @@ void hyperv_cleanup(void);
static inline void hyperv_init(void) {}
static inline bool hv_is_hypercall_page_setup(void) { return false; }
static inline void hyperv_cleanup(void) {}
+static inline void hyperv_setup_mmu_ops(void) {}
#endif /* CONFIG_HYPERV */

#ifdef CONFIG_HYPERV_TSCPAGE
diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h
index 127ddad..a6fdd3b 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -242,6 +242,8 @@
(~((1ull << HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_SHIFT) - 1))

/* Declare the various hypercall operations. */
+#define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE 0x0002
+#define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST 0x0003
#define HVCALL_NOTIFY_LONG_SPIN_WAIT 0x0008
#define HVCALL_POST_MESSAGE 0x005c
#define HVCALL_SIGNAL_EVENT 0x005d
@@ -259,6 +261,11 @@
#define HV_PROCESSOR_POWER_STATE_C2 2
#define HV_PROCESSOR_POWER_STATE_C3 3

+#define HV_FLUSH_ALL_PROCESSORS BIT(0)
+#define HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES BIT(1)
+#define HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY BIT(2)
+#define HV_FLUSH_USE_EXTENDED_RANGE_FORMAT BIT(3)
+
/* hypercall status code */
#define HV_STATUS_SUCCESS 0
#define HV_STATUS_INVALID_HYPERCALL_CODE 2
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 70e717f..daefd67 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -249,6 +249,7 @@ static void __init ms_hyperv_init_platform(void)
* Setup the hook to get control post apic initialization.
*/
x86_platform.apic_post_init = hyperv_init;
+ hyperv_setup_mmu_ops();
#endif
}


2017-08-10 17:00:22

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements

Ingo Molnar <[email protected]> writes:

> I'm getting this build failure with this series:
>
> arch/x86/hyperv/mmu.c: In function ‘hyperv_setup_mmu_ops’:
> arch/x86/hyperv/mmu.c:256:3: error: ‘pv_mmu_ops’ undeclared (first use in this
> function)
> pv_mmu_ops.flush_tlb_others = hyperv_flush_tlb_others;
> ^
>
> with the attached (rand-)config.
>

> # CONFIG_PARAVIRT is not set

Ouch, this is definitely required for the new feature. Sorry :-(

I think the best way to handle this (and having in mind upcoming PV
spinlocks for Hyper-V) is something like

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index c29cd5387a35..50b89ea0e60f 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -3,6 +3,7 @@ menu "Microsoft Hyper-V guest support"
config HYPERV
tristate "Microsoft Hyper-V client drivers"
depends on X86 && ACPI && PCI && X86_LOCAL_APIC && HYPERVISOR_GUEST
+ select PARAVIRT
help
Select this option to run Linux as a Hyper-V client operating
system.

added to PATCH7 of the series. In case nobody objects, would you like me
to resend the patch or do the whole v11 submission?

Thanks and sorry for the breakage,

--
Vitaly

Subject: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

Commit-ID: 2ffd9e33ce4af4e8cfa3e17bf493defe8474e2eb
Gitweb: http://git.kernel.org/tip/2ffd9e33ce4af4e8cfa3e17bf493defe8474e2eb
Author: Vitaly Kuznetsov <[email protected]>
AuthorDate: Wed, 2 Aug 2017 18:09:19 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 10 Aug 2017 20:16:44 +0200

x86/hyper-v: Use hypercall for remote TLB flush

Hyper-V host can suggest us to use hypercall for doing remote TLB flush,
this is supposed to work faster than IPIs.

Implementation details: to do HvFlushVirtualAddress{Space,List} hypercalls
we need to put the input somewhere in memory and we don't really want to
have memory allocation on each call so we pre-allocate per cpu memory areas
on boot.

pv_ops patching is happening very early so we need to separate
hyperv_setup_mmu_ops() and hyper_alloc_mmu().

It is possible and easy to implement local TLB flushing too and there is
even a hint for that. However, I don't see a room for optimization on the
host side as both hypercall and native tlb flush will result in vmexit. The
hint is also not set on modern Hyper-V versions.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/hyperv/Makefile | 2 +-
arch/x86/hyperv/hv_init.c | 2 +
arch/x86/hyperv/mmu.c | 138 +++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/mshyperv.h | 3 +
arch/x86/include/uapi/asm/hyperv.h | 7 ++
arch/x86/kernel/cpu/mshyperv.c | 1 +
drivers/hv/Kconfig | 1 +
7 files changed, 153 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
index 171ae09..367a820 100644
--- a/arch/x86/hyperv/Makefile
+++ b/arch/x86/hyperv/Makefile
@@ -1 +1 @@
-obj-y := hv_init.o
+obj-y := hv_init.o mmu.o
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e93b9a0..1a8eb55 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -140,6 +140,8 @@ void hyperv_init(void)
hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);

+ hyper_alloc_mmu();
+
/*
* Register Hyper-V specific clocksource.
*/
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
new file mode 100644
index 0000000..9419a20
--- /dev/null
+++ b/arch/x86/hyperv/mmu.c
@@ -0,0 +1,138 @@
+#define pr_fmt(fmt) "Hyper-V: " fmt
+
+#include <linux/hyperv.h>
+#include <linux/log2.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+
+#include <asm/fpu/api.h>
+#include <asm/mshyperv.h>
+#include <asm/msr.h>
+#include <asm/tlbflush.h>
+
+/* HvFlushVirtualAddressSpace, HvFlushVirtualAddressList hypercalls */
+struct hv_flush_pcpu {
+ u64 address_space;
+ u64 flags;
+ u64 processor_mask;
+ u64 gva_list[];
+};
+
+/* Each gva in gva_list encodes up to 4096 pages to flush */
+#define HV_TLB_FLUSH_UNIT (4096 * PAGE_SIZE)
+
+static struct hv_flush_pcpu __percpu *pcpu_flush;
+
+/*
+ * Fills in gva_list starting from offset. Returns the number of items added.
+ */
+static inline int fill_gva_list(u64 gva_list[], int offset,
+ unsigned long start, unsigned long end)
+{
+ int gva_n = offset;
+ unsigned long cur = start, diff;
+
+ do {
+ diff = end > cur ? end - cur : 0;
+
+ gva_list[gva_n] = cur & PAGE_MASK;
+ /*
+ * Lower 12 bits encode the number of additional
+ * pages to flush (in addition to the 'cur' page).
+ */
+ if (diff >= HV_TLB_FLUSH_UNIT)
+ gva_list[gva_n] |= ~PAGE_MASK;
+ else if (diff)
+ gva_list[gva_n] |= (diff - 1) >> PAGE_SHIFT;
+
+ cur += HV_TLB_FLUSH_UNIT;
+ gva_n++;
+
+ } while (cur < end);
+
+ return gva_n - offset;
+}
+
+static void hyperv_flush_tlb_others(const struct cpumask *cpus,
+ const struct flush_tlb_info *info)
+{
+ int cpu, vcpu, gva_n, max_gvas;
+ struct hv_flush_pcpu *flush;
+ u64 status = U64_MAX;
+ unsigned long flags;
+
+ if (!pcpu_flush || !hv_hypercall_pg)
+ goto do_native;
+
+ if (cpumask_empty(cpus))
+ return;
+
+ local_irq_save(flags);
+
+ flush = this_cpu_ptr(pcpu_flush);
+
+ if (info->mm) {
+ flush->address_space = virt_to_phys(info->mm->pgd);
+ flush->flags = 0;
+ } else {
+ flush->address_space = 0;
+ flush->flags = HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES;
+ }
+
+ flush->processor_mask = 0;
+ if (cpumask_equal(cpus, cpu_present_mask)) {
+ flush->flags |= HV_FLUSH_ALL_PROCESSORS;
+ } else {
+ for_each_cpu(cpu, cpus) {
+ vcpu = hv_cpu_number_to_vp_number(cpu);
+ if (vcpu >= 64)
+ goto do_native;
+
+ __set_bit(vcpu, (unsigned long *)
+ &flush->processor_mask);
+ }
+ }
+
+ /*
+ * We can flush not more than max_gvas with one hypercall. Flush the
+ * whole address space if we were asked to do more.
+ */
+ max_gvas = (PAGE_SIZE - sizeof(*flush)) / sizeof(flush->gva_list[0]);
+
+ if (info->end == TLB_FLUSH_ALL) {
+ flush->flags |= HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY;
+ status = hv_do_hypercall(HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE,
+ flush, NULL);
+ } else if (info->end &&
+ ((info->end - info->start)/HV_TLB_FLUSH_UNIT) > max_gvas) {
+ status = hv_do_hypercall(HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE,
+ flush, NULL);
+ } else {
+ gva_n = fill_gva_list(flush->gva_list, 0,
+ info->start, info->end);
+ status = hv_do_rep_hypercall(HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST,
+ gva_n, 0, flush, NULL);
+ }
+
+ local_irq_restore(flags);
+
+ if (!(status & HV_HYPERCALL_RESULT_MASK))
+ return;
+do_native:
+ native_flush_tlb_others(cpus, info);
+}
+
+void hyperv_setup_mmu_ops(void)
+{
+ if (ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED) {
+ pr_info("Using hypercall for remote TLB flush\n");
+ pv_mmu_ops.flush_tlb_others = hyperv_flush_tlb_others;
+ setup_clear_cpu_cap(X86_FEATURE_PCID);
+ }
+}
+
+void hyper_alloc_mmu(void)
+{
+ if (ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED)
+ pcpu_flush = __alloc_percpu(PAGE_SIZE, PAGE_SIZE);
+}
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index efd2f80..0d4b01c 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -307,6 +307,8 @@ static inline int hv_cpu_number_to_vp_number(int cpu_number)
}

void hyperv_init(void);
+void hyperv_setup_mmu_ops(void);
+void hyper_alloc_mmu(void);
void hyperv_report_panic(struct pt_regs *regs);
bool hv_is_hypercall_page_setup(void);
void hyperv_cleanup(void);
@@ -314,6 +316,7 @@ void hyperv_cleanup(void);
static inline void hyperv_init(void) {}
static inline bool hv_is_hypercall_page_setup(void) { return false; }
static inline void hyperv_cleanup(void) {}
+static inline void hyperv_setup_mmu_ops(void) {}
#endif /* CONFIG_HYPERV */

#ifdef CONFIG_HYPERV_TSCPAGE
diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h
index 127ddad..a6fdd3b 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -242,6 +242,8 @@
(~((1ull << HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_SHIFT) - 1))

/* Declare the various hypercall operations. */
+#define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE 0x0002
+#define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST 0x0003
#define HVCALL_NOTIFY_LONG_SPIN_WAIT 0x0008
#define HVCALL_POST_MESSAGE 0x005c
#define HVCALL_SIGNAL_EVENT 0x005d
@@ -259,6 +261,11 @@
#define HV_PROCESSOR_POWER_STATE_C2 2
#define HV_PROCESSOR_POWER_STATE_C3 3

+#define HV_FLUSH_ALL_PROCESSORS BIT(0)
+#define HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES BIT(1)
+#define HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY BIT(2)
+#define HV_FLUSH_USE_EXTENDED_RANGE_FORMAT BIT(3)
+
/* hypercall status code */
#define HV_STATUS_SUCCESS 0
#define HV_STATUS_INVALID_HYPERCALL_CODE 2
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 70e717f..daefd67 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -249,6 +249,7 @@ static void __init ms_hyperv_init_platform(void)
* Setup the hook to get control post apic initialization.
*/
x86_platform.apic_post_init = hyperv_init;
+ hyperv_setup_mmu_ops();
#endif
}

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index c29cd53..50b89ea 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -3,6 +3,7 @@ menu "Microsoft Hyper-V guest support"
config HYPERV
tristate "Microsoft Hyper-V client drivers"
depends on X86 && ACPI && PCI && X86_LOCAL_APIC && HYPERVISOR_GUEST
+ select PARAVIRT
help
Select this option to run Linux as a Hyper-V client operating
system.

2017-08-10 18:56:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On Thu, Aug 10, 2017 at 11:21:49AM -0700, tip-bot for Vitaly Kuznetsov wrote:
> Commit-ID: 2ffd9e33ce4af4e8cfa3e17bf493defe8474e2eb
> Gitweb: http://git.kernel.org/tip/2ffd9e33ce4af4e8cfa3e17bf493defe8474e2eb
> Author: Vitaly Kuznetsov <[email protected]>
> AuthorDate: Wed, 2 Aug 2017 18:09:19 +0200
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Thu, 10 Aug 2017 20:16:44 +0200
>
> x86/hyper-v: Use hypercall for remote TLB flush
>
> Hyper-V host can suggest us to use hypercall for doing remote TLB flush,
> this is supposed to work faster than IPIs.
>
> Implementation details: to do HvFlushVirtualAddress{Space,List} hypercalls
> we need to put the input somewhere in memory and we don't really want to
> have memory allocation on each call so we pre-allocate per cpu memory areas
> on boot.
>
> pv_ops patching is happening very early so we need to separate
> hyperv_setup_mmu_ops() and hyper_alloc_mmu().
>
> It is possible and easy to implement local TLB flushing too and there is
> even a hint for that. However, I don't see a room for optimization on the
> host side as both hypercall and native tlb flush will result in vmexit. The
> hint is also not set on modern Hyper-V versions.

Hold on.. if we don't IPI for TLB invalidation. What serializes our
software page table walkers like fast_gup() ?

2017-08-10 18:59:18

by KY Srinivasan

[permalink] [raw]
Subject: RE: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush



> -----Original Message-----
> From: Peter Zijlstra [mailto:[email protected]]
> Sent: Thursday, August 10, 2017 11:57 AM
> To: Simon Xiao <[email protected]>; Haiyang Zhang
> <[email protected]>; Jork Loeser <[email protected]>;
> Stephen Hemminger <[email protected]>; torvalds@linux-
> foundation.org; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; KY Srinivasan
> <[email protected]>; [email protected]
> Cc: [email protected]
> Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB
> flush
>
> On Thu, Aug 10, 2017 at 11:21:49AM -0700, tip-bot for Vitaly Kuznetsov
> wrote:
> > Commit-ID: 2ffd9e33ce4af4e8cfa3e17bf493defe8474e2eb
> > Gitweb:
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgit.kern
> el.org%2Ftip%2F2ffd9e33ce4af4e8cfa3e17bf493defe8474e2eb&data=02%7C
> 01%7Ckys%40microsoft.com%7C2537372f38d3414e999e08d4e0218ec8%7C72
> f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636379882129411812&sdata
> =odsJ2NnQdD8LCEtDPfVf5rL%2F2sQX4fKUhlqVSjKhjCI%3D&reserved=0
> > Author: Vitaly Kuznetsov <[email protected]>
> > AuthorDate: Wed, 2 Aug 2017 18:09:19 +0200
> > Committer: Ingo Molnar <[email protected]>
> > CommitDate: Thu, 10 Aug 2017 20:16:44 +0200
> >
> > x86/hyper-v: Use hypercall for remote TLB flush
> >
> > Hyper-V host can suggest us to use hypercall for doing remote TLB flush,
> > this is supposed to work faster than IPIs.
> >
> > Implementation details: to do HvFlushVirtualAddress{Space,List} hypercalls
> > we need to put the input somewhere in memory and we don't really want
> to
> > have memory allocation on each call so we pre-allocate per cpu memory
> areas
> > on boot.
> >
> > pv_ops patching is happening very early so we need to separate
> > hyperv_setup_mmu_ops() and hyper_alloc_mmu().
> >
> > It is possible and easy to implement local TLB flushing too and there is
> > even a hint for that. However, I don't see a room for optimization on the
> > host side as both hypercall and native tlb flush will result in vmexit. The
> > hint is also not set on modern Hyper-V versions.
>
> Hold on.. if we don't IPI for TLB invalidation. What serializes our
> software page table walkers like fast_gup() ?

Hypervisor may implement this functionality via an IPI.

K. Y

2017-08-10 19:08:27

by Jork Loeser

[permalink] [raw]
Subject: RE: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

> -----Original Message-----
> From: KY Srinivasan


> > -----Original Message-----
> > From: Peter Zijlstra [mailto:[email protected]]
> > Sent: Thursday, August 10, 2017 11:57 AM
> > To: Simon Xiao <[email protected]>; Haiyang Zhang
> > <[email protected]>; Jork Loeser <[email protected]>;
> > Stephen Hemminger <[email protected]>; torvalds@linux-
> > foundation.org; [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected]; KY Srinivasan
> > <[email protected]>; [email protected]
> > Cc: [email protected]
> > Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote
> > TLB flush

> > Hold on.. if we don't IPI for TLB invalidation. What serializes our
> > software page table walkers like fast_gup() ?
>
> Hypervisor may implement this functionality via an IPI.
>
> K. Y

HvFlushVirtualAddressList() states:
This call guarantees that by the time control returns back to the caller, the observable effects of all flushes on the specified virtual processors have occurred.

HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as adding sparse target VP lists.

Is this enough of a guarantee, or do you see other races?

Regards,
Jork


2017-08-10 19:27:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On Thu, Aug 10, 2017 at 07:08:22PM +0000, Jork Loeser wrote:

> > > Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush
>
> > > Hold on.. if we don't IPI for TLB invalidation. What serializes our
> > > software page table walkers like fast_gup() ?
> >
> > Hypervisor may implement this functionality via an IPI.
> >
> > K. Y
>
> HvFlushVirtualAddressList() states:
> This call guarantees that by the time control returns back to the
> caller, the observable effects of all flushes on the specified virtual
> processors have occurred.
>
> HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as adding sparse target VP lists.
>
> Is this enough of a guarantee, or do you see other races?

That's nowhere near enough. We need the remote CPU to have completed any
guest IF section that was in progress at the time of the call.

So if a host IPI can interrupt a guest while the guest has IF cleared,
and we then process the host IPI -- clear the TLBs -- before resuming the
guest, which still has IF cleared, we've got a problem.

Because at that point, our software page-table walker, that relies on IF
being clear to guarantee the page-tables exist, because it holds off the
TLB invalidate and thereby the freeing of the pages, gets its pages
ripped out from under it.

2017-08-11 01:15:23

by Jork Loeser

[permalink] [raw]
Subject: RE: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

> -----Original Message-----
> From: Peter Zijlstra [mailto:[email protected]]
> Sent: Thursday, August 10, 2017 12:28
> To: Jork Loeser <[email protected]>
> Cc: KY Srinivasan <[email protected]>; Simon Xiao <[email protected]>;
> Haiyang Zhang <[email protected]>; Stephen Hemminger
> <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

> > > > Hold on.. if we don't IPI for TLB invalidation. What serializes
> > > > our software page table walkers like fast_gup() ?
> > >
> > > Hypervisor may implement this functionality via an IPI.
> > >
> > > K. Y
> >
> > HvFlushVirtualAddressList() states:
> > This call guarantees that by the time control returns back to the
> > caller, the observable effects of all flushes on the specified virtual
> > processors have occurred.
> >
> > HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as adding
> sparse target VP lists.
> >
> > Is this enough of a guarantee, or do you see other races?
>
> That's nowhere near enough. We need the remote CPU to have completed any
> guest IF section that was in progress at the time of the call.
>
> So if a host IPI can interrupt a guest while the guest has IF cleared, and we then
> process the host IPI -- clear the TLBs -- before resuming the guest, which still has
> IF cleared, we've got a problem.
>
> Because at that point, our software page-table walker, that relies on IF being
> clear to guarantee the page-tables exist, because it holds off the TLB invalidate
> and thereby the freeing of the pages, gets its pages ripped out from under it.

I see, IF is used as a locking mechanism for the pages. Would CONFIG_HAVE_RCU_TABLE_FREE be an option for x86? There are caveats (statically enabled, RCU for page-free), yet if the resulting perf is still a gain it would be worthwhile for Hyper-V targeted kernels.

Regards,
Jork

2017-08-11 09:03:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On Fri, Aug 11, 2017 at 01:15:18AM +0000, Jork Loeser wrote:

> > > HvFlushVirtualAddressList() states:
> > > This call guarantees that by the time control returns back to the
> > > caller, the observable effects of all flushes on the specified virtual
> > > processors have occurred.
> > >
> > > HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as adding
> > > sparse target VP lists.
> > >
> > > Is this enough of a guarantee, or do you see other races?
> >
> > That's nowhere near enough. We need the remote CPU to have completed any
> > guest IF section that was in progress at the time of the call.
> >
> > So if a host IPI can interrupt a guest while the guest has IF cleared, and we then
> > process the host IPI -- clear the TLBs -- before resuming the guest, which still has
> > IF cleared, we've got a problem.
> >
> > Because at that point, our software page-table walker, that relies on IF being
> > clear to guarantee the page-tables exist, because it holds off the TLB invalidate
> > and thereby the freeing of the pages, gets its pages ripped out from under it.
>
> I see, IF is used as a locking mechanism for the pages. Would
> CONFIG_HAVE_RCU_TABLE_FREE be an option for x86? There are caveats
> (statically enabled, RCU for page-free), yet if the resulting perf is
> still a gain it would be worthwhile for Hyper-V targeted kernels.

I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
would make it work again), but this was some years ago and I cannot
readily find those emails.

Kirill would you have any opinions?

2017-08-11 09:23:17

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

Peter Zijlstra <[email protected]> writes:

> On Thu, Aug 10, 2017 at 07:08:22PM +0000, Jork Loeser wrote:
>
>> > > Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush
>>
>> > > Hold on.. if we don't IPI for TLB invalidation. What serializes our
>> > > software page table walkers like fast_gup() ?
>> >
>> > Hypervisor may implement this functionality via an IPI.
>> >
>> > K. Y
>>
>> HvFlushVirtualAddressList() states:
>> This call guarantees that by the time control returns back to the
>> caller, the observable effects of all flushes on the specified virtual
>> processors have occurred.
>>
>> HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as adding sparse target VP lists.
>>
>> Is this enough of a guarantee, or do you see other races?
>
> That's nowhere near enough. We need the remote CPU to have completed any
> guest IF section that was in progress at the time of the call.
>
> So if a host IPI can interrupt a guest while the guest has IF cleared,
> and we then process the host IPI -- clear the TLBs -- before resuming the
> guest, which still has IF cleared, we've got a problem.
>
> Because at that point, our software page-table walker, that relies on IF
> being clear to guarantee the page-tables exist, because it holds off the
> TLB invalidate and thereby the freeing of the pages, gets its pages
> ripped out from under it.

Oh, I see your concern. Hyper-V, however, is not the first x86
hypervisor trying to avoid IPIs on remote TLB flush, Xen does this
too. Briefly looking at xen_flush_tlb_others() I don't see anything
special, do we know how serialization is achieved there?

--
Vitaly

2017-08-11 10:56:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On Fri, Aug 11, 2017 at 11:23:10AM +0200, Vitaly Kuznetsov wrote:
> Peter Zijlstra <[email protected]> writes:
>
> > On Thu, Aug 10, 2017 at 07:08:22PM +0000, Jork Loeser wrote:
> >
> >> > > Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush
> >>
> >> > > Hold on.. if we don't IPI for TLB invalidation. What serializes our
> >> > > software page table walkers like fast_gup() ?
> >> >
> >> > Hypervisor may implement this functionality via an IPI.
> >> >
> >> > K. Y
> >>
> >> HvFlushVirtualAddressList() states:
> >> This call guarantees that by the time control returns back to the
> >> caller, the observable effects of all flushes on the specified virtual
> >> processors have occurred.
> >>
> >> HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as adding sparse target VP lists.
> >>
> >> Is this enough of a guarantee, or do you see other races?
> >
> > That's nowhere near enough. We need the remote CPU to have completed any
> > guest IF section that was in progress at the time of the call.
> >
> > So if a host IPI can interrupt a guest while the guest has IF cleared,
> > and we then process the host IPI -- clear the TLBs -- before resuming the
> > guest, which still has IF cleared, we've got a problem.
> >
> > Because at that point, our software page-table walker, that relies on IF
> > being clear to guarantee the page-tables exist, because it holds off the
> > TLB invalidate and thereby the freeing of the pages, gets its pages
> > ripped out from under it.
>
> Oh, I see your concern. Hyper-V, however, is not the first x86
> hypervisor trying to avoid IPIs on remote TLB flush, Xen does this
> too. Briefly looking at xen_flush_tlb_others() I don't see anything
> special, do we know how serialization is achieved there?

No idea on how Xen works, I always just hope it goes away :-) But lets
ask some Xen folks.

2017-08-11 11:05:50

by Andrew Cooper

[permalink] [raw]
Subject: Re: [Xen-devel] [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On 11/08/17 11:56, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 11:23:10AM +0200, Vitaly Kuznetsov wrote:
>> Peter Zijlstra <[email protected]> writes:
>>
>>> On Thu, Aug 10, 2017 at 07:08:22PM +0000, Jork Loeser wrote:
>>>
>>>>>> Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush
>>>>>> Hold on.. if we don't IPI for TLB invalidation. What serializes our
>>>>>> software page table walkers like fast_gup() ?
>>>>> Hypervisor may implement this functionality via an IPI.
>>>>>
>>>>> K. Y
>>>> HvFlushVirtualAddressList() states:
>>>> This call guarantees that by the time control returns back to the
>>>> caller, the observable effects of all flushes on the specified virtual
>>>> processors have occurred.
>>>>
>>>> HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as adding sparse target VP lists.
>>>>
>>>> Is this enough of a guarantee, or do you see other races?
>>> That's nowhere near enough. We need the remote CPU to have completed any
>>> guest IF section that was in progress at the time of the call.
>>>
>>> So if a host IPI can interrupt a guest while the guest has IF cleared,
>>> and we then process the host IPI -- clear the TLBs -- before resuming the
>>> guest, which still has IF cleared, we've got a problem.
>>>
>>> Because at that point, our software page-table walker, that relies on IF
>>> being clear to guarantee the page-tables exist, because it holds off the
>>> TLB invalidate and thereby the freeing of the pages, gets its pages
>>> ripped out from under it.
>> Oh, I see your concern. Hyper-V, however, is not the first x86
>> hypervisor trying to avoid IPIs on remote TLB flush, Xen does this
>> too. Briefly looking at xen_flush_tlb_others() I don't see anything
>> special, do we know how serialization is achieved there?
> No idea on how Xen works, I always just hope it goes away :-) But lets
> ask some Xen folks.

How is the software pagewalker relying on IF being clear safe at all (on
native, let alone under virtualisation)? Hardware has no architectural
requirement to keep entries in the TLB.

In the virtualisation case, at any point the vcpu can be scheduled on a
different pcpu even during a critical region like that, so the TLB
really can empty itself under your feet.

~Andrew

2017-08-11 11:29:36

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On Fri, Aug 11, 2017 at 11:03:36AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 01:15:18AM +0000, Jork Loeser wrote:
>
> > > > HvFlushVirtualAddressList() states:
> > > > This call guarantees that by the time control returns back to the
> > > > caller, the observable effects of all flushes on the specified virtual
> > > > processors have occurred.
> > > >
> > > > HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as adding
> > > > sparse target VP lists.
> > > >
> > > > Is this enough of a guarantee, or do you see other races?
> > >
> > > That's nowhere near enough. We need the remote CPU to have completed any
> > > guest IF section that was in progress at the time of the call.
> > >
> > > So if a host IPI can interrupt a guest while the guest has IF cleared, and we then
> > > process the host IPI -- clear the TLBs -- before resuming the guest, which still has
> > > IF cleared, we've got a problem.
> > >
> > > Because at that point, our software page-table walker, that relies on IF being
> > > clear to guarantee the page-tables exist, because it holds off the TLB invalidate
> > > and thereby the freeing of the pages, gets its pages ripped out from under it.
> >
> > I see, IF is used as a locking mechanism for the pages. Would
> > CONFIG_HAVE_RCU_TABLE_FREE be an option for x86? There are caveats
> > (statically enabled, RCU for page-free), yet if the resulting perf is
> > still a gain it would be worthwhile for Hyper-V targeted kernels.
>
> I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
> would make it work again), but this was some years ago and I cannot
> readily find those emails.
>
> Kirill would you have any opinions?

I guess we can try this. The main question is what would be performance
implications of such move.

--
Kirill A. Shutemov

2017-08-11 12:07:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Xen-devel] [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On Fri, Aug 11, 2017 at 12:05:45PM +0100, Andrew Cooper wrote:
> >> Oh, I see your concern. Hyper-V, however, is not the first x86
> >> hypervisor trying to avoid IPIs on remote TLB flush, Xen does this
> >> too. Briefly looking at xen_flush_tlb_others() I don't see anything
> >> special, do we know how serialization is achieved there?
> > No idea on how Xen works, I always just hope it goes away :-) But lets
> > ask some Xen folks.
>
> How is the software pagewalker relying on IF being clear safe at all (on
> native, let alone under virtualisation)? Hardware has no architectural
> requirement to keep entries in the TLB.

No, but it _can_, therefore when we unhook pages we _must_ invalidate.

It goes like:

CPU0 CPU1

unhook page
cli
traverse page tables
TLB invalidate ---> <IF clear, therefore CPU0 waits>
sti
<IPI>
TLB invalidate
<------ complete
</IPI>
free page

So the CPU1 page-table walker gets an existence guarantee of the
page-tables by clearing IF.

> In the virtualisation case, at any point the vcpu can be scheduled on a
> different pcpu even during a critical region like that, so the TLB
> really can empty itself under your feet.

Not the point.

2017-08-11 12:22:31

by Juergen Gross

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On 11/08/17 12:56, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 11:23:10AM +0200, Vitaly Kuznetsov wrote:
>> Peter Zijlstra <[email protected]> writes:
>>
>>> On Thu, Aug 10, 2017 at 07:08:22PM +0000, Jork Loeser wrote:
>>>
>>>>>> Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush
>>>>
>>>>>> Hold on.. if we don't IPI for TLB invalidation. What serializes our
>>>>>> software page table walkers like fast_gup() ?
>>>>>
>>>>> Hypervisor may implement this functionality via an IPI.
>>>>>
>>>>> K. Y
>>>>
>>>> HvFlushVirtualAddressList() states:
>>>> This call guarantees that by the time control returns back to the
>>>> caller, the observable effects of all flushes on the specified virtual
>>>> processors have occurred.
>>>>
>>>> HvFlushVirtualAddressListEx() refers to HvFlushVirtualAddressList() as adding sparse target VP lists.
>>>>
>>>> Is this enough of a guarantee, or do you see other races?
>>>
>>> That's nowhere near enough. We need the remote CPU to have completed any
>>> guest IF section that was in progress at the time of the call.
>>>
>>> So if a host IPI can interrupt a guest while the guest has IF cleared,
>>> and we then process the host IPI -- clear the TLBs -- before resuming the
>>> guest, which still has IF cleared, we've got a problem.
>>>
>>> Because at that point, our software page-table walker, that relies on IF
>>> being clear to guarantee the page-tables exist, because it holds off the
>>> TLB invalidate and thereby the freeing of the pages, gets its pages
>>> ripped out from under it.
>>
>> Oh, I see your concern. Hyper-V, however, is not the first x86
>> hypervisor trying to avoid IPIs on remote TLB flush, Xen does this
>> too. Briefly looking at xen_flush_tlb_others() I don't see anything
>> special, do we know how serialization is achieved there?
>
> No idea on how Xen works, I always just hope it goes away :-) But lets
> ask some Xen folks.

Wait - the TLB can be cleared at any time, as Andrew was pointing out.
No cpu can rely on an address being accessible just because IF is being
cleared. All that matters is the existing and valid page table entry.

So clearing IF on a cpu isn't meant to secure the TLB from being
cleared, but just to avoid interrupts (as the name of the flag is
suggesting).

In the Xen case the hypervisor does the following:

- it checks whether any of the vcpus specified in the cpumask of the
flush request is running on any physical cpu
- if any running vcpu is found an IPI will be sent to the physical cpu
and the hypervisor will do the TLB flush there
- any vcpu addressed by the flush and not running will be flagged to
flush its TLB when being scheduled the next time

This ensures no TLB entry to be flushed can be used after return of
xen_flush_tlb_others().


Juergen

2017-08-11 12:35:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On Fri, Aug 11, 2017 at 02:22:25PM +0200, Juergen Gross wrote:
> Wait - the TLB can be cleared at any time, as Andrew was pointing out.
> No cpu can rely on an address being accessible just because IF is being
> cleared. All that matters is the existing and valid page table entry.
>
> So clearing IF on a cpu isn't meant to secure the TLB from being
> cleared, but just to avoid interrupts (as the name of the flag is
> suggesting).

Yes, but by holding off the TLB invalidate IPI, we hold off the freeing
of the concurrently unhooked page-table.

> In the Xen case the hypervisor does the following:
>
> - it checks whether any of the vcpus specified in the cpumask of the
> flush request is running on any physical cpu
> - if any running vcpu is found an IPI will be sent to the physical cpu
> and the hypervisor will do the TLB flush there

And this will preempt a vcpu which could have IF cleared, right?

> - any vcpu addressed by the flush and not running will be flagged to
> flush its TLB when being scheduled the next time
>
> This ensures no TLB entry to be flushed can be used after return of
> xen_flush_tlb_others().

But that is not a sufficient guarantee. We need the IF to hold off the
TLB invalidate and thereby hold off the freeing of our page-table pages.

2017-08-11 12:46:48

by Juergen Gross

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On 11/08/17 14:35, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 02:22:25PM +0200, Juergen Gross wrote:
>> Wait - the TLB can be cleared at any time, as Andrew was pointing out.
>> No cpu can rely on an address being accessible just because IF is being
>> cleared. All that matters is the existing and valid page table entry.
>>
>> So clearing IF on a cpu isn't meant to secure the TLB from being
>> cleared, but just to avoid interrupts (as the name of the flag is
>> suggesting).
>
> Yes, but by holding off the TLB invalidate IPI, we hold off the freeing
> of the concurrently unhooked page-table.
>
>> In the Xen case the hypervisor does the following:
>>
>> - it checks whether any of the vcpus specified in the cpumask of the
>> flush request is running on any physical cpu
>> - if any running vcpu is found an IPI will be sent to the physical cpu
>> and the hypervisor will do the TLB flush there
>
> And this will preempt a vcpu which could have IF cleared, right?
>
>> - any vcpu addressed by the flush and not running will be flagged to
>> flush its TLB when being scheduled the next time
>>
>> This ensures no TLB entry to be flushed can be used after return of
>> xen_flush_tlb_others().
>
> But that is not a sufficient guarantee. We need the IF to hold off the
> TLB invalidate and thereby hold off the freeing of our page-table pages.

Aah, okay. Now I understand the problem. The TLB isn't the issue but the
IPI is serving two purposes here: TLB flushing (which is allowed to
happen at any time) and serialization regarding access to critical pages
(which seems to be broken in the Xen case as you suggest).

Juergen

>

2017-08-11 12:55:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On Fri, Aug 11, 2017 at 02:46:41PM +0200, Juergen Gross wrote:
> Aah, okay. Now I understand the problem. The TLB isn't the issue but the
> IPI is serving two purposes here: TLB flushing (which is allowed to
> happen at any time) and serialization regarding access to critical pages
> (which seems to be broken in the Xen case as you suggest).

Indeed, and now hyper-v as well.

2017-08-11 13:07:37

by Juergen Gross

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On 11/08/17 14:54, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 02:46:41PM +0200, Juergen Gross wrote:
>> Aah, okay. Now I understand the problem. The TLB isn't the issue but the
>> IPI is serving two purposes here: TLB flushing (which is allowed to
>> happen at any time) and serialization regarding access to critical pages
>> (which seems to be broken in the Xen case as you suggest).
>
> Indeed, and now hyper-v as well.

Is it possible to distinguish between non-critical calls of
flush_tlb_others() (which should be the majority IMHO) and critical ones
regarding above problem? I guess the only problem is the case when a
page table can be freed because its last valid entry is gone, right?

We might want to add a serialization flag to indicate flushing _and_
serialization via IPI should be performed.


Juergen

2017-08-11 13:39:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On Fri, Aug 11, 2017 at 03:07:29PM +0200, Juergen Gross wrote:
> On 11/08/17 14:54, Peter Zijlstra wrote:
> > On Fri, Aug 11, 2017 at 02:46:41PM +0200, Juergen Gross wrote:
> >> Aah, okay. Now I understand the problem. The TLB isn't the issue but the
> >> IPI is serving two purposes here: TLB flushing (which is allowed to
> >> happen at any time) and serialization regarding access to critical pages
> >> (which seems to be broken in the Xen case as you suggest).
> >
> > Indeed, and now hyper-v as well.
>
> Is it possible to distinguish between non-critical calls of
> flush_tlb_others() (which should be the majority IMHO) and critical ones
> regarding above problem? I guess the only problem is the case when a
> page table can be freed because its last valid entry is gone, right?
>
> We might want to add a serialization flag to indicate flushing _and_
> serialization via IPI should be performed.

Possible, but not trivial. Esp things like transparent huge pages, which
swizzles PMDs around makes things tricky.

The by far easiest solution is to switch over to HAVE_RCU_TABLE_FREE
when either Xen or Hyper-V is doing this. Ideally it would not have a
significant performance hit (needs testing) and we can simply always do
this when PARAVIRT, or otherwise we need to get creative with
static_keys or something.

2017-08-11 16:16:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On Fri, Aug 11, 2017 at 2:03 AM, Peter Zijlstra <[email protected]> wrote:
>
> I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
> would make it work again), but this was some years ago and I cannot
> readily find those emails.

I think the only time we really talked about HAVE_RCU_TABLE_FREE for
x86 (at least that I was cc'd on) was not because of RCU freeing, but
because we just wanted to use the generic page table lookup code on
x86 *despite* not using RCU freeing.

And we just ended up renaming HAVE_GENERIC_RCU_GUP as HAVE_GENERIC_GUP.

There was only passing mention of maybe making x86 use RCU, but the
discussion was really about why the IF flag meant that x86 didn't need
to, iirc.

I don't recall us ever discussing *really* making x86 use RCU.

Linus

2017-08-11 16:26:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On Fri, Aug 11, 2017 at 09:16:29AM -0700, Linus Torvalds wrote:
> On Fri, Aug 11, 2017 at 2:03 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
> > would make it work again), but this was some years ago and I cannot
> > readily find those emails.
>
> I think the only time we really talked about HAVE_RCU_TABLE_FREE for
> x86 (at least that I was cc'd on) was not because of RCU freeing, but
> because we just wanted to use the generic page table lookup code on
> x86 *despite* not using RCU freeing.
>
> And we just ended up renaming HAVE_GENERIC_RCU_GUP as HAVE_GENERIC_GUP.
>
> There was only passing mention of maybe making x86 use RCU, but the
> discussion was really about why the IF flag meant that x86 didn't need
> to, iirc.
>
> I don't recall us ever discussing *really* making x86 use RCU.

Google finds me this:

https://lwn.net/Articles/500188/

Which includes:

http://www.mail-archive.com/[email protected]/msg72918.html

which does as was suggested here, selects HAVE_RCU_TABLE_FREE for
PARAVIRT_TLB_FLUSH.

But yes, this is very much virt specific nonsense, native would never
need this.

2017-08-14 13:20:56

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

Peter Zijlstra <[email protected]> writes:

> On Fri, Aug 11, 2017 at 09:16:29AM -0700, Linus Torvalds wrote:
>> On Fri, Aug 11, 2017 at 2:03 AM, Peter Zijlstra <[email protected]> wrote:
>> >
>> > I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
>> > would make it work again), but this was some years ago and I cannot
>> > readily find those emails.
>>
>> I think the only time we really talked about HAVE_RCU_TABLE_FREE for
>> x86 (at least that I was cc'd on) was not because of RCU freeing, but
>> because we just wanted to use the generic page table lookup code on
>> x86 *despite* not using RCU freeing.
>>
>> And we just ended up renaming HAVE_GENERIC_RCU_GUP as HAVE_GENERIC_GUP.
>>
>> There was only passing mention of maybe making x86 use RCU, but the
>> discussion was really about why the IF flag meant that x86 didn't need
>> to, iirc.
>>
>> I don't recall us ever discussing *really* making x86 use RCU.
>
> Google finds me this:
>
> https://lwn.net/Articles/500188/
>
> Which includes:
>
> http://www.mail-archive.com/[email protected]/msg72918.html
>
> which does as was suggested here, selects HAVE_RCU_TABLE_FREE for
> PARAVIRT_TLB_FLUSH.
>
> But yes, this is very much virt specific nonsense, native would never
> need this.

In case we decide to go HAVE_RCU_TABLE_FREE for all PARAVIRT-enabled
kernels (as it seems to be the easiest/fastest way to fix Xen PV) - what
do you think about the required testing? Any suggestion for a
specifically crafted micro benchmark in addition to standard
ebizzy/kernbench/...?

Additionally, I see another option for us: enable 'rcu table free' on
boot (e.g. by taking tlb_remove_table to pv_ops and doing boot-time
patching for it) so bare metal and other hypervisors are not affected
by the change.

--
Vitaly

2017-08-16 00:02:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [Xen-devel] [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On Fri, 11 Aug 2017 14:07:14 +0200
Peter Zijlstra <[email protected]> wrote:

> It goes like:
>
> CPU0 CPU1
>
> unhook page
> cli
> traverse page tables
> TLB invalidate ---> <IF clear, therefore CPU0 waits>
> sti
> <IPI>
> TLB invalidate
> <------ complete

I guess the important part here is the above "complete". CPU0 doesn't
proceed until its receives it. Thus it does act like
cli~rcu_read_lock(), sti~rcu_read_unlock(), and "TLB invalidate" is
equivalent to synchronize_rcu().

[ this response is for clarification for the casual observer of this
thread ;-) ]

-- Steve

> </IPI>
> free page
>
> So the CPU1 page-table walker gets an existence guarantee of the
> page-tables by clearing IF.

2017-08-16 16:42:55

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

Vitaly Kuznetsov <[email protected]> writes:

> Peter Zijlstra <[email protected]> writes:
>
>> On Fri, Aug 11, 2017 at 09:16:29AM -0700, Linus Torvalds wrote:
>>> On Fri, Aug 11, 2017 at 2:03 AM, Peter Zijlstra <[email protected]> wrote:
>>> >
>>> > I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
>>> > would make it work again), but this was some years ago and I cannot
>>> > readily find those emails.
>>>
>>> I think the only time we really talked about HAVE_RCU_TABLE_FREE for
>>> x86 (at least that I was cc'd on) was not because of RCU freeing, but
>>> because we just wanted to use the generic page table lookup code on
>>> x86 *despite* not using RCU freeing.
>>>
>>> And we just ended up renaming HAVE_GENERIC_RCU_GUP as HAVE_GENERIC_GUP.
>>>
>>> There was only passing mention of maybe making x86 use RCU, but the
>>> discussion was really about why the IF flag meant that x86 didn't need
>>> to, iirc.
>>>
>>> I don't recall us ever discussing *really* making x86 use RCU.
>>
>> Google finds me this:
>>
>> https://lwn.net/Articles/500188/
>>
>> Which includes:
>>
>> http://www.mail-archive.com/[email protected]/msg72918.html
>>
>> which does as was suggested here, selects HAVE_RCU_TABLE_FREE for
>> PARAVIRT_TLB_FLUSH.
>>
>> But yes, this is very much virt specific nonsense, native would never
>> need this.
>
> In case we decide to go HAVE_RCU_TABLE_FREE for all PARAVIRT-enabled
> kernels (as it seems to be the easiest/fastest way to fix Xen PV) - what
> do you think about the required testing? Any suggestion for a
> specifically crafted micro benchmark in addition to standard
> ebizzy/kernbench/...?

In the meantime I tested HAVE_RCU_TABLE_FREE with kernbench (enablement
patch I used is attached; I know that it breaks other architectures) on
bare metal with PARAVIRT enabled in config. The results are:

6-CPU host:

Average Half load -j 3 Run (std deviation):
CURRENT HAVE_RCU_TABLE_FREE
======= ===================
Elapsed Time 400.498 (0.179679) Elapsed Time 399.909 (0.162853)
User Time 1098.72 (0.278536) User Time 1097.59 (0.283894)
System Time 100.301 (0.201629) System Time 99.736 (0.196254)
Percent CPU 299 (0) Percent CPU 299 (0)
Context Switches 5774.1 (69.2121) Context Switches 5744.4 (79.4162)
Sleeps 87621.2 (78.1093) Sleeps 87586.1 (99.7079)

Average Optimal load -j 24 Run (std deviation):
CURRENT HAVE_RCU_TABLE_FREE
======= ===================
Elapsed Time 219.03 (0.652534) Elapsed Time 218.959 (0.598674)
User Time 1119.51 (21.3284) User Time 1118.81 (21.7793)
System Time 100.499 (0.389308) System Time 99.8335 (0.251423)
Percent CPU 432.5 (136.974) Percent CPU 432.45 (136.922)
Context Switches 81827.4 (78029.5) Context Switches 81818.5 (78051)
Sleeps 97124.8 (9822.4) Sleeps 97207.9 (9955.04)

16-CPU host:

Average Half load -j 8 Run (std deviation):
CURRENT HAVE_RCU_TABLE_FREE
======= ===================
Elapsed Time 213.538 (3.7891) Elapsed Time 212.5 (3.10939)
User Time 1306.4 (1.83399) User Time 1307.65 (1.01364)
System Time 194.59 (0.864378) System Time 195.478 (0.794588)
Percent CPU 702.6 (13.5388) Percent CPU 707 (11.1131)
Context Switches 21189.2 (1199.4) Context Switches 21288.2 (552.388)
Sleeps 89390.2 (482.325) Sleeps 89677 (277.06)

Average Optimal load -j 64 Run (std deviation):
CURRENT HAVE_RCU_TABLE_FREE
======= ===================
Elapsed Time 137.866 (0.787928) Elapsed Time 138.438 (0.218792)
User Time 1488.92 (192.399) User Time 1489.92 (192.135)
System Time 234.981 (42.5806) System Time 236.09 (42.8138)
Percent CPU 1057.1 (373.826) Percent CPU 1057.1 (369.114)
Context Switches 187514 (175324) Context Switches 187358 (175060)
Sleeps 112633 (24535.5) Sleeps 111743 (23297.6)

As you can see, there's no notable difference. I'll think of a
microbenchmark though.

>
> Additionally, I see another option for us: enable 'rcu table free' on
> boot (e.g. by taking tlb_remove_table to pv_ops and doing boot-time
> patching for it) so bare metal and other hypervisors are not affected
> by the change.

It seems there's no need for that and we can keep things simple...

--
Vitaly


Attachments:
0001-x86-enable-RCU-based-table-free-when-PARAVIRT.patch (2.66 kB)

2017-08-16 21:43:52

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

On 08/16/2017 12:42 PM, Vitaly Kuznetsov wrote:
> Vitaly Kuznetsov <[email protected]> writes:
>
>> Peter Zijlstra <[email protected]> writes:
>>
>>> On Fri, Aug 11, 2017 at 09:16:29AM -0700, Linus Torvalds wrote:
>>>> On Fri, Aug 11, 2017 at 2:03 AM, Peter Zijlstra <[email protected]> wrote:
>>>>> I'm sure we talked about using HAVE_RCU_TABLE_FREE for x86 (and yes that
>>>>> would make it work again), but this was some years ago and I cannot
>>>>> readily find those emails.
>>>> I think the only time we really talked about HAVE_RCU_TABLE_FREE for
>>>> x86 (at least that I was cc'd on) was not because of RCU freeing, but
>>>> because we just wanted to use the generic page table lookup code on
>>>> x86 *despite* not using RCU freeing.
>>>>
>>>> And we just ended up renaming HAVE_GENERIC_RCU_GUP as HAVE_GENERIC_GUP.
>>>>
>>>> There was only passing mention of maybe making x86 use RCU, but the
>>>> discussion was really about why the IF flag meant that x86 didn't need
>>>> to, iirc.
>>>>
>>>> I don't recall us ever discussing *really* making x86 use RCU.
>>> Google finds me this:
>>>
>>> https://lwn.net/Articles/500188/
>>>
>>> Which includes:
>>>
>>> http://www.mail-archive.com/[email protected]/msg72918.html
>>>
>>> which does as was suggested here, selects HAVE_RCU_TABLE_FREE for
>>> PARAVIRT_TLB_FLUSH.
>>>
>>> But yes, this is very much virt specific nonsense, native would never
>>> need this.
>> In case we decide to go HAVE_RCU_TABLE_FREE for all PARAVIRT-enabled
>> kernels (as it seems to be the easiest/fastest way to fix Xen PV) - what
>> do you think about the required testing? Any suggestion for a
>> specifically crafted micro benchmark in addition to standard
>> ebizzy/kernbench/...?
> In the meantime I tested HAVE_RCU_TABLE_FREE with kernbench (enablement
> patch I used is attached; I know that it breaks other architectures) on
> bare metal with PARAVIRT enabled in config. The results are:
>
> 6-CPU host:
>
> Average Half load -j 3 Run (std deviation):
> CURRENT HAVE_RCU_TABLE_FREE
> ======= ===================
> Elapsed Time 400.498 (0.179679) Elapsed Time 399.909 (0.162853)
> User Time 1098.72 (0.278536) User Time 1097.59 (0.283894)
> System Time 100.301 (0.201629) System Time 99.736 (0.196254)
> Percent CPU 299 (0) Percent CPU 299 (0)
> Context Switches 5774.1 (69.2121) Context Switches 5744.4 (79.4162)
> Sleeps 87621.2 (78.1093) Sleeps 87586.1 (99.7079)
>
> Average Optimal load -j 24 Run (std deviation):
> CURRENT HAVE_RCU_TABLE_FREE
> ======= ===================
> Elapsed Time 219.03 (0.652534) Elapsed Time 218.959 (0.598674)
> User Time 1119.51 (21.3284) User Time 1118.81 (21.7793)
> System Time 100.499 (0.389308) System Time 99.8335 (0.251423)
> Percent CPU 432.5 (136.974) Percent CPU 432.45 (136.922)
> Context Switches 81827.4 (78029.5) Context Switches 81818.5 (78051)
> Sleeps 97124.8 (9822.4) Sleeps 97207.9 (9955.04)
>
> 16-CPU host:
>
> Average Half load -j 8 Run (std deviation):
> CURRENT HAVE_RCU_TABLE_FREE
> ======= ===================
> Elapsed Time 213.538 (3.7891) Elapsed Time 212.5 (3.10939)
> User Time 1306.4 (1.83399) User Time 1307.65 (1.01364)
> System Time 194.59 (0.864378) System Time 195.478 (0.794588)
> Percent CPU 702.6 (13.5388) Percent CPU 707 (11.1131)
> Context Switches 21189.2 (1199.4) Context Switches 21288.2 (552.388)
> Sleeps 89390.2 (482.325) Sleeps 89677 (277.06)
>
> Average Optimal load -j 64 Run (std deviation):
> CURRENT HAVE_RCU_TABLE_FREE
> ======= ===================
> Elapsed Time 137.866 (0.787928) Elapsed Time 138.438 (0.218792)
> User Time 1488.92 (192.399) User Time 1489.92 (192.135)
> System Time 234.981 (42.5806) System Time 236.09 (42.8138)
> Percent CPU 1057.1 (373.826) Percent CPU 1057.1 (369.114)
> Context Switches 187514 (175324) Context Switches 187358 (175060)
> Sleeps 112633 (24535.5) Sleeps 111743 (23297.6)
>
> As you can see, there's no notable difference. I'll think of a
> microbenchmark though.

FWIW, I was about to send a very similar patch (but with only Xen-PV
enabling RCU-based free by default) and saw similar results with
kernbench, both Xen PV and baremetal.

>> Additionally, I see another option for us: enable 'rcu table free' on
>> boot (e.g. by taking tlb_remove_table to pv_ops and doing boot-time
>> patching for it) so bare metal and other hypervisors are not affected
>> by the change.
> It seems there's no need for that and we can keep things simple...
>
> -- Vitaly
>
> 0001-x86-enable-RCU-based-table-free-when-PARAVIRT.patch
>
>
> >From daf5117706920aebe793d1239fccac2edd4d680c Mon Sep 17 00:00:00 2001
> From: Vitaly Kuznetsov <[email protected]>
> Date: Mon, 14 Aug 2017 16:05:05 +0200
> Subject: [PATCH] x86: enable RCU based table free when PARAVIRT
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/mm/pgtable.c | 16 ++++++++++++++++
> mm/memory.c | 5 +++++
> 3 files changed, 22 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 781521b7cf9e..9c1666ea04c9 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -167,6 +167,7 @@ config X86
> select HAVE_PERF_REGS
> select HAVE_PERF_USER_STACK_DUMP
> select HAVE_REGS_AND_STACK_ACCESS_API
> + select HAVE_RCU_TABLE_FREE if SMP && PARAVIRT
> select HAVE_RELIABLE_STACKTRACE if X86_64 && FRAME_POINTER && STACK_VALIDATION
> select HAVE_STACK_VALIDATION if X86_64
> select HAVE_SYSCALL_TRACEPOINTS
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 508a708eb9a6..b6aaab9fb3b8 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -56,7 +56,11 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
> {
> pgtable_page_dtor(pte);
> paravirt_release_pte(page_to_pfn(pte));
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> + tlb_remove_table(tlb, pte);
> +#else
> tlb_remove_page(tlb, pte);
> +#endif
> }
>
> #if CONFIG_PGTABLE_LEVELS > 2
> @@ -72,21 +76,33 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
> tlb->need_flush_all = 1;
> #endif
> pgtable_pmd_page_dtor(page);
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> + tlb_remove_table(tlb, page);
> +#else
> tlb_remove_page(tlb, page);
> +#endif
> }
>
> #if CONFIG_PGTABLE_LEVELS > 3
> void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
> {
> paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> + tlb_remove_table(tlb, virt_to_page(pud));
> +#else
> tlb_remove_page(tlb, virt_to_page(pud));
> +#endif
> }
>
> #if CONFIG_PGTABLE_LEVELS > 4
> void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
> {
> paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
> + tlb_remove_table(tlb, virt_to_page(p4d));
> +#else
> tlb_remove_page(tlb, virt_to_page(p4d));
> +#endif

This can probably be factored out.

> }
> #endif /* CONFIG_PGTABLE_LEVELS > 4 */
> #endif /* CONFIG_PGTABLE_LEVELS > 3 */
> diff --git a/mm/memory.c b/mm/memory.c
> index e158f7ac6730..18d6671b6ae2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -329,6 +329,11 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
> * See the comment near struct mmu_table_batch.
> */
>
> +static void __tlb_remove_table(void *table)
> +{
> + free_page_and_swap_cache(table);
> +}
> +

This needs to be a per-arch routine (e.g. see arch/arm64/include/asm/tlb.h).

-boris


2017-08-17 07:58:30

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [tip:x86/platform] x86/hyper-v: Use hypercall for remote TLB flush

Boris Ostrovsky <[email protected]> writes:

> On 08/16/2017 12:42 PM, Vitaly Kuznetsov wrote:
>> Vitaly Kuznetsov <[email protected]> writes:
>>
>>> In case we decide to go HAVE_RCU_TABLE_FREE for all PARAVIRT-enabled
>>> kernels (as it seems to be the easiest/fastest way to fix Xen PV) - what
>>> do you think about the required testing? Any suggestion for a
>>> specifically crafted micro benchmark in addition to standard
>>> ebizzy/kernbench/...?
>> In the meantime I tested HAVE_RCU_TABLE_FREE with kernbench (enablement
>> patch I used is attached; I know that it breaks other architectures) on
>> bare metal with PARAVIRT enabled in config. The results are:
>>
>>...
>>
>> As you can see, there's no notable difference. I'll think of a
>> microbenchmark though.
>
> FWIW, I was about to send a very similar patch (but with only Xen-PV
> enabling RCU-based free by default) and saw similar results with
> kernbench, both Xen PV and baremetal.
>

Thanks for the confirmation,

I'd go with enabling it for PARAVIRT as we will need it for Hyper-V too.

<snip>

>>
>> #if CONFIG_PGTABLE_LEVELS > 4
>> void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
>> {
>> paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
>> +#ifdef CONFIG_HAVE_RCU_TABLE_FREE
>> + tlb_remove_table(tlb, virt_to_page(p4d));
>> +#else
>> tlb_remove_page(tlb, virt_to_page(p4d));
>> +#endif
>
> This can probably be factored out.
>
>> }
>> #endif /* CONFIG_PGTABLE_LEVELS > 4 */
>> #endif /* CONFIG_PGTABLE_LEVELS > 3 */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index e158f7ac6730..18d6671b6ae2 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -329,6 +329,11 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
>> * See the comment near struct mmu_table_batch.
>> */
>>
>> +static void __tlb_remove_table(void *table)
>> +{
>> + free_page_and_swap_cache(table);
>> +}
>> +
>
> This needs to be a per-arch routine (e.g. see arch/arm64/include/asm/tlb.h).
>

Yea, this was a quick-and-dirty x86-only patch.

--
Vitaly

2017-08-31 11:44:04

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements

Vitaly Kuznetsov <[email protected]> writes:

> Changes since v9:
> - Rebase to 4.13-rc3.
> - Drop PATCH1 as it was already taken by Greg to char-misc tree. There're no
> functional dependencies on this patch so the series can go through a different tree
> (and it actually belongs to x86 if I got Ingo's comment right).
> - Add in missing void return type in PATCH1 [Colin King, Ingo Molnar, Greg KH]
> - A few minor fixes in what is now PATCH7: add pr_fmt, tiny style fix in
> hyperv_flush_tlb_others() [Andy Shevchenko]
> - Fix "error: implicit declaration of function 'virt_to_phys'" in PATCH2
> reported by kbuild test robot (#include <asm/io.h>)
> - Add Steven's 'Reviewed-by:' to PATCH9.

Ingo,

this series ended up being 'almost merged' - you merged all but the last
two patches to 'x86/platform' branch when Peter reported an issue with
pagetable walkers. Now as we have 'x86/mm: Enable RCU based page table
freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)' merged to 'x86/mm' this is
resolved and we can hopefully proceed with this series. Could you please
let me know if I need to resend these last two patches or if you can
take them from v10?

Thanks,

--
Vitaly

2017-08-31 12:23:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements


* Vitaly Kuznetsov <[email protected]> wrote:

> > Changes since v9:
> > - Rebase to 4.13-rc3.
> > - Drop PATCH1 as it was already taken by Greg to char-misc tree. There're no
> > functional dependencies on this patch so the series can go through a different tree
> > (and it actually belongs to x86 if I got Ingo's comment right).
> > - Add in missing void return type in PATCH1 [Colin King, Ingo Molnar, Greg KH]
> > - A few minor fixes in what is now PATCH7: add pr_fmt, tiny style fix in
> > hyperv_flush_tlb_others() [Andy Shevchenko]
> > - Fix "error: implicit declaration of function 'virt_to_phys'" in PATCH2
> > reported by kbuild test robot (#include <asm/io.h>)
> > - Add Steven's 'Reviewed-by:' to PATCH9.
>
> Ingo,
>
> this series ended up being 'almost merged' - you merged all but the last
> two patches to 'x86/platform' branch when Peter reported an issue with
> pagetable walkers. Now as we have 'x86/mm: Enable RCU based page table
> freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)' merged to 'x86/mm' this is
> resolved and we can hopefully proceed with this series. Could you please
> let me know if I need to resend these last two patches or if you can
> take them from v10?

Ok, I have merged tip:x86/mm into tip:x86/platform to pick up the dependency and
have applied patches #9 and #10. Will push it all out after testing.

Thanks,

Ingo

2017-08-31 14:53:11

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements

Ingo Molnar <[email protected]> writes:

> * Vitaly Kuznetsov <[email protected]> wrote:
>
>> > Changes since v9:
>> > - Rebase to 4.13-rc3.
>> > - Drop PATCH1 as it was already taken by Greg to char-misc tree. There're no
>> > functional dependencies on this patch so the series can go through a different tree
>> > (and it actually belongs to x86 if I got Ingo's comment right).
>> > - Add in missing void return type in PATCH1 [Colin King, Ingo Molnar, Greg KH]
>> > - A few minor fixes in what is now PATCH7: add pr_fmt, tiny style fix in
>> > hyperv_flush_tlb_others() [Andy Shevchenko]
>> > - Fix "error: implicit declaration of function 'virt_to_phys'" in PATCH2
>> > reported by kbuild test robot (#include <asm/io.h>)
>> > - Add Steven's 'Reviewed-by:' to PATCH9.
>>
>> Ingo,
>>
>> this series ended up being 'almost merged' - you merged all but the last
>> two patches to 'x86/platform' branch when Peter reported an issue with
>> pagetable walkers. Now as we have 'x86/mm: Enable RCU based page table
>> freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)' merged to 'x86/mm' this is
>> resolved and we can hopefully proceed with this series. Could you please
>> let me know if I need to resend these last two patches or if you can
>> take them from v10?
>
> Ok, I have merged tip:x86/mm into tip:x86/platform to pick up the dependency and
> have applied patches #9 and #10.

Hope you meant '#8 and #9' as v10 had only 9 patches :-)

> Will push it all out after testing.
>

Thanks!

--
Vitaly

2017-08-31 20:02:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements


* Vitaly Kuznetsov <[email protected]> wrote:

> Ingo Molnar <[email protected]> writes:
>
> > * Vitaly Kuznetsov <[email protected]> wrote:
> >
> >> > Changes since v9:
> >> > - Rebase to 4.13-rc3.
> >> > - Drop PATCH1 as it was already taken by Greg to char-misc tree. There're no
> >> > functional dependencies on this patch so the series can go through a different tree
> >> > (and it actually belongs to x86 if I got Ingo's comment right).
> >> > - Add in missing void return type in PATCH1 [Colin King, Ingo Molnar, Greg KH]
> >> > - A few minor fixes in what is now PATCH7: add pr_fmt, tiny style fix in
> >> > hyperv_flush_tlb_others() [Andy Shevchenko]
> >> > - Fix "error: implicit declaration of function 'virt_to_phys'" in PATCH2
> >> > reported by kbuild test robot (#include <asm/io.h>)
> >> > - Add Steven's 'Reviewed-by:' to PATCH9.
> >>
> >> Ingo,
> >>
> >> this series ended up being 'almost merged' - you merged all but the last
> >> two patches to 'x86/platform' branch when Peter reported an issue with
> >> pagetable walkers. Now as we have 'x86/mm: Enable RCU based page table
> >> freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)' merged to 'x86/mm' this is
> >> resolved and we can hopefully proceed with this series. Could you please
> >> let me know if I need to resend these last two patches or if you can
> >> take them from v10?
> >
> > Ok, I have merged tip:x86/mm into tip:x86/platform to pick up the dependency and
> > have applied patches #9 and #10.
>
> Hope you meant '#8 and #9' as v10 had only 9 patches :-)

Yes ;-)

Thanks,

Ingo

Subject: [tip:x86/platform] x86/hyper-v: Support extended CPU ranges for TLB flush hypercalls

Commit-ID: 628f54cc6451d2706ba8a56763dbf93be02aaa80
Gitweb: http://git.kernel.org/tip/628f54cc6451d2706ba8a56763dbf93be02aaa80
Author: Vitaly Kuznetsov <[email protected]>
AuthorDate: Wed, 2 Aug 2017 18:09:20 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 31 Aug 2017 14:20:36 +0200

x86/hyper-v: Support extended CPU ranges for TLB flush hypercalls

Hyper-V hosts may support more than 64 vCPUs, we need to use
HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX/LIST_EX hypercalls in this
case.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/hyperv/mmu.c | 133 ++++++++++++++++++++++++++++++++++++-
arch/x86/include/uapi/asm/hyperv.h | 10 +++
2 files changed, 140 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 9419a20..51b44be 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -18,11 +18,25 @@ struct hv_flush_pcpu {
u64 gva_list[];
};

+/* HvFlushVirtualAddressSpaceEx, HvFlushVirtualAddressListEx hypercalls */
+struct hv_flush_pcpu_ex {
+ u64 address_space;
+ u64 flags;
+ struct {
+ u64 format;
+ u64 valid_bank_mask;
+ u64 bank_contents[];
+ } hv_vp_set;
+ u64 gva_list[];
+};
+
/* Each gva in gva_list encodes up to 4096 pages to flush */
#define HV_TLB_FLUSH_UNIT (4096 * PAGE_SIZE)

static struct hv_flush_pcpu __percpu *pcpu_flush;

+static struct hv_flush_pcpu_ex __percpu *pcpu_flush_ex;
+
/*
* Fills in gva_list starting from offset. Returns the number of items added.
*/
@@ -53,6 +67,34 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
}

+/* Return the number of banks in the resulting vp_set */
+static inline int cpumask_to_vp_set(struct hv_flush_pcpu_ex *flush,
+ const struct cpumask *cpus)
+{
+ int cpu, vcpu, vcpu_bank, vcpu_offset, nr_bank = 1;
+
+ /*
+ * Some banks may end up being empty but this is acceptable.
+ */
+ for_each_cpu(cpu, cpus) {
+ vcpu = hv_cpu_number_to_vp_number(cpu);
+ vcpu_bank = vcpu / 64;
+ vcpu_offset = vcpu % 64;
+
+ /* valid_bank_mask can represent up to 64 banks */
+ if (vcpu_bank >= 64)
+ return 0;
+
+ __set_bit(vcpu_offset, (unsigned long *)
+ &flush->hv_vp_set.bank_contents[vcpu_bank]);
+ if (vcpu_bank >= nr_bank)
+ nr_bank = vcpu_bank + 1;
+ }
+ flush->hv_vp_set.valid_bank_mask = GENMASK_ULL(nr_bank - 1, 0);
+
+ return nr_bank;
+}
+
static void hyperv_flush_tlb_others(const struct cpumask *cpus,
const struct flush_tlb_info *info)
{
@@ -122,17 +164,102 @@ do_native:
native_flush_tlb_others(cpus, info);
}

+static void hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
+ const struct flush_tlb_info *info)
+{
+ int nr_bank = 0, max_gvas, gva_n;
+ struct hv_flush_pcpu_ex *flush;
+ u64 status = U64_MAX;
+ unsigned long flags;
+
+ if (!pcpu_flush_ex || !hv_hypercall_pg)
+ goto do_native;
+
+ if (cpumask_empty(cpus))
+ return;
+
+ local_irq_save(flags);
+
+ flush = this_cpu_ptr(pcpu_flush_ex);
+
+ if (info->mm) {
+ flush->address_space = virt_to_phys(info->mm->pgd);
+ flush->flags = 0;
+ } else {
+ flush->address_space = 0;
+ flush->flags = HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES;
+ }
+
+ flush->hv_vp_set.valid_bank_mask = 0;
+
+ if (!cpumask_equal(cpus, cpu_present_mask)) {
+ flush->hv_vp_set.format = HV_GENERIC_SET_SPARCE_4K;
+ nr_bank = cpumask_to_vp_set(flush, cpus);
+ }
+
+ if (!nr_bank) {
+ flush->hv_vp_set.format = HV_GENERIC_SET_ALL;
+ flush->flags |= HV_FLUSH_ALL_PROCESSORS;
+ }
+
+ /*
+ * We can flush not more than max_gvas with one hypercall. Flush the
+ * whole address space if we were asked to do more.
+ */
+ max_gvas =
+ (PAGE_SIZE - sizeof(*flush) - nr_bank *
+ sizeof(flush->hv_vp_set.bank_contents[0])) /
+ sizeof(flush->gva_list[0]);
+
+ if (info->end == TLB_FLUSH_ALL) {
+ flush->flags |= HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY;
+ status = hv_do_rep_hypercall(
+ HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX,
+ 0, nr_bank + 2, flush, NULL);
+ } else if (info->end &&
+ ((info->end - info->start)/HV_TLB_FLUSH_UNIT) > max_gvas) {
+ status = hv_do_rep_hypercall(
+ HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX,
+ 0, nr_bank + 2, flush, NULL);
+ } else {
+ gva_n = fill_gva_list(flush->gva_list, nr_bank,
+ info->start, info->end);
+ status = hv_do_rep_hypercall(
+ HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX,
+ gva_n, nr_bank + 2, flush, NULL);
+ }
+
+ local_irq_restore(flags);
+
+ if (!(status & HV_HYPERCALL_RESULT_MASK))
+ return;
+do_native:
+ native_flush_tlb_others(cpus, info);
+}
+
void hyperv_setup_mmu_ops(void)
{
- if (ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED) {
+ if (!(ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED))
+ return;
+
+ setup_clear_cpu_cap(X86_FEATURE_PCID);
+
+ if (!(ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED)) {
pr_info("Using hypercall for remote TLB flush\n");
pv_mmu_ops.flush_tlb_others = hyperv_flush_tlb_others;
- setup_clear_cpu_cap(X86_FEATURE_PCID);
+ } else {
+ pr_info("Using ext hypercall for remote TLB flush\n");
+ pv_mmu_ops.flush_tlb_others = hyperv_flush_tlb_others_ex;
}
}

void hyper_alloc_mmu(void)
{
- if (ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED)
+ if (!(ms_hyperv.hints & HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED))
+ return;
+
+ if (!(ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED))
pcpu_flush = __alloc_percpu(PAGE_SIZE, PAGE_SIZE);
+ else
+ pcpu_flush_ex = __alloc_percpu(PAGE_SIZE, PAGE_SIZE);
}
diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h
index a6fdd3b..7032f4d 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -149,6 +149,9 @@
*/
#define HV_X64_DEPRECATING_AEOI_RECOMMENDED (1 << 9)

+/* Recommend using the newer ExProcessorMasks interface */
+#define HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED (1 << 11)
+
/*
* HV_VP_SET available
*/
@@ -245,6 +248,8 @@
#define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE 0x0002
#define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST 0x0003
#define HVCALL_NOTIFY_LONG_SPIN_WAIT 0x0008
+#define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX 0x0013
+#define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX 0x0014
#define HVCALL_POST_MESSAGE 0x005c
#define HVCALL_SIGNAL_EVENT 0x005d

@@ -266,6 +271,11 @@
#define HV_FLUSH_NON_GLOBAL_MAPPINGS_ONLY BIT(2)
#define HV_FLUSH_USE_EXTENDED_RANGE_FORMAT BIT(3)

+enum HV_GENERIC_SET_FORMAT {
+ HV_GENERIC_SET_SPARCE_4K,
+ HV_GENERIC_SET_ALL,
+};
+
/* hypercall status code */
#define HV_STATUS_SUCCESS 0
#define HV_STATUS_INVALID_HYPERCALL_CODE 2

Subject: [tip:x86/platform] tracing/hyper-v: Trace hyperv_mmu_flush_tlb_others()

Commit-ID: 773b79f7a7c7839fb9d09c0e206734173a8b0a6b
Gitweb: http://git.kernel.org/tip/773b79f7a7c7839fb9d09c0e206734173a8b0a6b
Author: Vitaly Kuznetsov <[email protected]>
AuthorDate: Wed, 2 Aug 2017 18:09:21 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 31 Aug 2017 14:20:37 +0200

tracing/hyper-v: Trace hyperv_mmu_flush_tlb_others()

Add Hyper-V tracing subsystem and trace hyperv_mmu_flush_tlb_others().
Tracing is done the same way we do xen_mmu_flush_tlb_others().

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Stephen Hemminger <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Haiyang Zhang <[email protected]>
Cc: Jork Loeser <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Simon Xiao <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
MAINTAINERS | 1 +
arch/x86/hyperv/mmu.c | 7 +++++++
arch/x86/include/asm/trace/hyperv.h | 40 +++++++++++++++++++++++++++++++++++++
3 files changed, 48 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index b3eadf3..9fcffdf 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6258,6 +6258,7 @@ M: Stephen Hemminger <[email protected]>
L: [email protected]
S: Maintained
F: arch/x86/include/asm/mshyperv.h
+F: arch/x86/include/asm/trace/hyperv.h
F: arch/x86/include/uapi/asm/hyperv.h
F: arch/x86/kernel/cpu/mshyperv.c
F: arch/x86/hyperv
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 51b44be..39e7f6e 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -10,6 +10,9 @@
#include <asm/msr.h>
#include <asm/tlbflush.h>

+#define CREATE_TRACE_POINTS
+#include <asm/trace/hyperv.h>
+
/* HvFlushVirtualAddressSpace, HvFlushVirtualAddressList hypercalls */
struct hv_flush_pcpu {
u64 address_space;
@@ -103,6 +106,8 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
u64 status = U64_MAX;
unsigned long flags;

+ trace_hyperv_mmu_flush_tlb_others(cpus, info);
+
if (!pcpu_flush || !hv_hypercall_pg)
goto do_native;

@@ -172,6 +177,8 @@ static void hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
u64 status = U64_MAX;
unsigned long flags;

+ trace_hyperv_mmu_flush_tlb_others(cpus, info);
+
if (!pcpu_flush_ex || !hv_hypercall_pg)
goto do_native;

diff --git a/arch/x86/include/asm/trace/hyperv.h b/arch/x86/include/asm/trace/hyperv.h
new file mode 100644
index 0000000..4253bca
--- /dev/null
+++ b/arch/x86/include/asm/trace/hyperv.h
@@ -0,0 +1,40 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM hyperv
+
+#if !defined(_TRACE_HYPERV_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_HYPERV_H
+
+#include <linux/tracepoint.h>
+
+#if IS_ENABLED(CONFIG_HYPERV)
+
+TRACE_EVENT(hyperv_mmu_flush_tlb_others,
+ TP_PROTO(const struct cpumask *cpus,
+ const struct flush_tlb_info *info),
+ TP_ARGS(cpus, info),
+ TP_STRUCT__entry(
+ __field(unsigned int, ncpus)
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, addr)
+ __field(unsigned long, end)
+ ),
+ TP_fast_assign(__entry->ncpus = cpumask_weight(cpus);
+ __entry->mm = info->mm;
+ __entry->addr = info->start;
+ __entry->end = info->end;
+ ),
+ TP_printk("ncpus %d mm %p addr %lx, end %lx",
+ __entry->ncpus, __entry->mm,
+ __entry->addr, __entry->end)
+ );
+
+#endif /* CONFIG_HYPERV */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH asm/trace/
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE hyperv
+#endif /* _TRACE_HYPERV_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

2017-11-06 11:08:19

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements

2017-11-06 18:10 GMT+08:00 Vitaly Kuznetsov <[email protected]>:
> Wanpeng Li <[email protected]> writes:
>
>> 2017-11-06 17:14 GMT+08:00 Vitaly Kuznetsov <[email protected]>:
>>> Wanpeng Li <[email protected]> writes:
>>>
>>>> 2017-08-03 0:09 GMT+08:00 Vitaly Kuznetsov <[email protected]>:
>>>>> Changes since v9:
>>>>> - Rebase to 4.13-rc3.
>>>>> - Drop PATCH1 as it was already taken by Greg to char-misc tree. There're no
>>>>> functional dependencies on this patch so the series can go through a different tree
>>>>> (and it actually belongs to x86 if I got Ingo's comment right).
>>>>> - Add in missing void return type in PATCH1 [Colin King, Ingo Molnar, Greg KH]
>>>>> - A few minor fixes in what is now PATCH7: add pr_fmt, tiny style fix in
>>>>> hyperv_flush_tlb_others() [Andy Shevchenko]
>>>>> - Fix "error: implicit declaration of function 'virt_to_phys'" in PATCH2
>>>>> reported by kbuild test robot (#include <asm/io.h>)
>>>>> - Add Steven's 'Reviewed-by:' to PATCH9.
>>>>>
>>>>> Original description:
>>>>>
>>>>> Hyper-V supports hypercalls for doing local and remote TLB flushing and
>>>>> gives its guests hints when using hypercall is preferred. While doing
>>>>> hypercalls for local TLB flushes is probably not practical (and is not
>>>>> being suggested by modern Hyper-V versions) remote TLB flush with a
>>>>> hypercall brings significant improvement.
>>>>>
>>>>> To test the series I wrote a special 'TLB trasher': on a 16 vCPU guest I
>>>>> was creating 32 threads which were doing 100000 mmap/munmaps each on some
>>>>> big file. Here are the results:
>>>>>
>>>>> Before:
>>>>> # time ./pthread_mmap ./randfile
>>>>> real 3m33.118s
>>>>> user 0m3.698s
>>>>> sys 3m16.624s
>>>>>
>>>>> After:
>>>>> # time ./pthread_mmap ./randfile
>>>>> real 2m19.920s
>>>>> user 0m2.662s
>>>>> sys 2m9.948s
>>>>>
>>>>> This series brings a number of small improvements along the way: fast
>>>>> hypercall implementation and using it for event signaling, rep hypercalls
>>>>> implementation, hyperv tracing subsystem (which only traces the newly added
>>>>> remote TLB flush for now).
>>>>>
>>>>
>>>> Hi Vitaly,
>>>>
>>>> Could you attach your benchmark? I'm interested in to try the
>>>> implementation in paravirt kvm.
>>>>
>>>
>>> Oh, this would be cool) I briefly discussed the idea with Radim (one of
>>> KVM maintainers) during the last KVM Forum and he wasn't opposed to the
>>> idea. Need to talk to Paolo too. Good thing is that we have everything
>>
>> I talk with Paolo today and he points this feature to me, so I believe
>> he likes it. :) In addition,
>> https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs
>> I search Hypervisor Top Level Functional Specification v5.0b.pdf
>> document but didn't find a section introduce the Hyper-V:
>> paravirtualized remote TLB flushing and hypercall stuff, could you
>> point out?
>>
>
> It's there, search for
> HvFlushVirtualAddressSpace/HvFlushVirtualAddressSpaceEx and
> HvFlushVirtualAddressList/HvFlushVirtualAddressListEx.

Got it, thanks.

Regards,
Wanpeng Li

From 1583311095225729641@xxx Mon Nov 06 10:12:06 +0000 2017
X-GM-THRID: 1574636603958149385
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-06 10:12:06

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements

Wanpeng Li <[email protected]> writes:

> 2017-11-06 17:14 GMT+08:00 Vitaly Kuznetsov <[email protected]>:
>> Wanpeng Li <[email protected]> writes:
>>
>>> 2017-08-03 0:09 GMT+08:00 Vitaly Kuznetsov <[email protected]>:
>>>> Changes since v9:
>>>> - Rebase to 4.13-rc3.
>>>> - Drop PATCH1 as it was already taken by Greg to char-misc tree. There're no
>>>> functional dependencies on this patch so the series can go through a different tree
>>>> (and it actually belongs to x86 if I got Ingo's comment right).
>>>> - Add in missing void return type in PATCH1 [Colin King, Ingo Molnar, Greg KH]
>>>> - A few minor fixes in what is now PATCH7: add pr_fmt, tiny style fix in
>>>> hyperv_flush_tlb_others() [Andy Shevchenko]
>>>> - Fix "error: implicit declaration of function 'virt_to_phys'" in PATCH2
>>>> reported by kbuild test robot (#include <asm/io.h>)
>>>> - Add Steven's 'Reviewed-by:' to PATCH9.
>>>>
>>>> Original description:
>>>>
>>>> Hyper-V supports hypercalls for doing local and remote TLB flushing and
>>>> gives its guests hints when using hypercall is preferred. While doing
>>>> hypercalls for local TLB flushes is probably not practical (and is not
>>>> being suggested by modern Hyper-V versions) remote TLB flush with a
>>>> hypercall brings significant improvement.
>>>>
>>>> To test the series I wrote a special 'TLB trasher': on a 16 vCPU guest I
>>>> was creating 32 threads which were doing 100000 mmap/munmaps each on some
>>>> big file. Here are the results:
>>>>
>>>> Before:
>>>> # time ./pthread_mmap ./randfile
>>>> real 3m33.118s
>>>> user 0m3.698s
>>>> sys 3m16.624s
>>>>
>>>> After:
>>>> # time ./pthread_mmap ./randfile
>>>> real 2m19.920s
>>>> user 0m2.662s
>>>> sys 2m9.948s
>>>>
>>>> This series brings a number of small improvements along the way: fast
>>>> hypercall implementation and using it for event signaling, rep hypercalls
>>>> implementation, hyperv tracing subsystem (which only traces the newly added
>>>> remote TLB flush for now).
>>>>
>>>
>>> Hi Vitaly,
>>>
>>> Could you attach your benchmark? I'm interested in to try the
>>> implementation in paravirt kvm.
>>>
>>
>> Oh, this would be cool) I briefly discussed the idea with Radim (one of
>> KVM maintainers) during the last KVM Forum and he wasn't opposed to the
>> idea. Need to talk to Paolo too. Good thing is that we have everything
>
> I talk with Paolo today and he points this feature to me, so I believe
> he likes it. :) In addition,
> https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs
> I search Hypervisor Top Level Functional Specification v5.0b.pdf
> document but didn't find a section introduce the Hyper-V:
> paravirtualized remote TLB flushing and hypercall stuff, could you
> point out?
>

It's there, search for
HvFlushVirtualAddressSpace/HvFlushVirtualAddressSpaceEx and
HvFlushVirtualAddressList/HvFlushVirtualAddressListEx.

--
Vitaly

From 1583310285292665571@xxx Mon Nov 06 09:59:13 +0000 2017
X-GM-THRID: 1574636603958149385
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-06 09:59:14

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements

2017-11-06 17:14 GMT+08:00 Vitaly Kuznetsov <[email protected]>:
> Wanpeng Li <[email protected]> writes:
>
>> 2017-08-03 0:09 GMT+08:00 Vitaly Kuznetsov <[email protected]>:
>>> Changes since v9:
>>> - Rebase to 4.13-rc3.
>>> - Drop PATCH1 as it was already taken by Greg to char-misc tree. There're no
>>> functional dependencies on this patch so the series can go through a different tree
>>> (and it actually belongs to x86 if I got Ingo's comment right).
>>> - Add in missing void return type in PATCH1 [Colin King, Ingo Molnar, Greg KH]
>>> - A few minor fixes in what is now PATCH7: add pr_fmt, tiny style fix in
>>> hyperv_flush_tlb_others() [Andy Shevchenko]
>>> - Fix "error: implicit declaration of function 'virt_to_phys'" in PATCH2
>>> reported by kbuild test robot (#include <asm/io.h>)
>>> - Add Steven's 'Reviewed-by:' to PATCH9.
>>>
>>> Original description:
>>>
>>> Hyper-V supports hypercalls for doing local and remote TLB flushing and
>>> gives its guests hints when using hypercall is preferred. While doing
>>> hypercalls for local TLB flushes is probably not practical (and is not
>>> being suggested by modern Hyper-V versions) remote TLB flush with a
>>> hypercall brings significant improvement.
>>>
>>> To test the series I wrote a special 'TLB trasher': on a 16 vCPU guest I
>>> was creating 32 threads which were doing 100000 mmap/munmaps each on some
>>> big file. Here are the results:
>>>
>>> Before:
>>> # time ./pthread_mmap ./randfile
>>> real 3m33.118s
>>> user 0m3.698s
>>> sys 3m16.624s
>>>
>>> After:
>>> # time ./pthread_mmap ./randfile
>>> real 2m19.920s
>>> user 0m2.662s
>>> sys 2m9.948s
>>>
>>> This series brings a number of small improvements along the way: fast
>>> hypercall implementation and using it for event signaling, rep hypercalls
>>> implementation, hyperv tracing subsystem (which only traces the newly added
>>> remote TLB flush for now).
>>>
>>
>> Hi Vitaly,
>>
>> Could you attach your benchmark? I'm interested in to try the
>> implementation in paravirt kvm.
>>
>
> Oh, this would be cool) I briefly discussed the idea with Radim (one of
> KVM maintainers) during the last KVM Forum and he wasn't opposed to the
> idea. Need to talk to Paolo too. Good thing is that we have everything

I talk with Paolo today and he points this feature to me, so I believe
he likes it. :) In addition,
https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs
I search Hypervisor Top Level Functional Specification v5.0b.pdf
document but didn't find a section introduce the Hyper-V:
paravirtualized remote TLB flushing and hypercall stuff, could you
point out?

Regards,
Wanpeng Li

> in place for guests now (HAVE_RCU_TABLE_FREE is enabled globaly on x86).
>
> Please see the microbenchmark attached. Adjust defines in the beginning
> to match your needs. It is not anything smart, basically just a TLB
> trasher.
>
> In theory, the best result is achived when we're overcommiting the host
> by running multiple vCPUs on each pCPU. In this case PV tlb flush avoids
> touching vCPUs which are not scheduled and avoid the wait on the main
> CPU.
>
> --
> Vitaly
>

From 1583307904382142267@xxx Mon Nov 06 09:21:23 +0000 2017
X-GM-THRID: 1574636603958149385
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-06 09:21:23

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements

#include <pthread.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>

#define nthreads 48
#define pagecount 16384
#define nrounds 1000
#define nchunks 20
#define PAGE_SIZE 4096

int fd;
unsigned long v;

void *threadf(void *ptr)
{
unsigned long *addr[nchunks];
int i, j, k;
struct timespec ts = {0};
int ret;

ts.tv_nsec = random() % 1024;

for (j = 0; j < nrounds; j++) {
for (i = 0; i < nchunks; i++) {
addr[i] = mmap(NULL, PAGE_SIZE * pagecount, PROT_READ, MAP_SHARED, fd, i * PAGE_SIZE);
if (addr[i] == MAP_FAILED) {
fprintf(stderr, "mmap\n");
exit(1);
}
}

nanosleep(&ts, NULL);

for (i = 0; i < nchunks; i++) {
v += *addr[i];
}


nanosleep(&ts, NULL);

for (i = 0; i < nchunks; i++) {
munmap(addr[i], PAGE_SIZE * pagecount);
}
}
}

int main(int argc, char *argv[]) {
pthread_t thr[nthreads];
int i;

srandom(time(NULL));

if (argc < 2) {
fprintf(stderr, "usage: %s <some-big-file>\n", argv[0]);
exit(1);
}

fd = open(argv[1], O_RDONLY);
if (fd < 0) {
fprintf(stderr, "open\n");
exit(1);
}

for (i = 0; i < nthreads; i++) {
if(pthread_create(&thr[i], NULL, threadf, NULL)) {
fprintf(stderr, "pthread_create\n");
exit(1);
}
}

for (i = 0; i < nthreads; i++) {
if(pthread_join(thr[i], NULL)) {
fprintf(stderr, "pthread_join\n");
exit(1);
}
}

return 0;
}


Attachments:
pthread_mmap.c (1.51 kB)

2017-11-06 08:45:55

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH v10 0/9] Hyper-V: paravirtualized remote TLB flushing and hypercall improvements

2017-08-03 0:09 GMT+08:00 Vitaly Kuznetsov <[email protected]>:
> Changes since v9:
> - Rebase to 4.13-rc3.
> - Drop PATCH1 as it was already taken by Greg to char-misc tree. There're no
> functional dependencies on this patch so the series can go through a different tree
> (and it actually belongs to x86 if I got Ingo's comment right).
> - Add in missing void return type in PATCH1 [Colin King, Ingo Molnar, Greg KH]
> - A few minor fixes in what is now PATCH7: add pr_fmt, tiny style fix in
> hyperv_flush_tlb_others() [Andy Shevchenko]
> - Fix "error: implicit declaration of function 'virt_to_phys'" in PATCH2
> reported by kbuild test robot (#include <asm/io.h>)
> - Add Steven's 'Reviewed-by:' to PATCH9.
>
> Original description:
>
> Hyper-V supports hypercalls for doing local and remote TLB flushing and
> gives its guests hints when using hypercall is preferred. While doing
> hypercalls for local TLB flushes is probably not practical (and is not
> being suggested by modern Hyper-V versions) remote TLB flush with a
> hypercall brings significant improvement.
>
> To test the series I wrote a special 'TLB trasher': on a 16 vCPU guest I
> was creating 32 threads which were doing 100000 mmap/munmaps each on some
> big file. Here are the results:
>
> Before:
> # time ./pthread_mmap ./randfile
> real 3m33.118s
> user 0m3.698s
> sys 3m16.624s
>
> After:
> # time ./pthread_mmap ./randfile
> real 2m19.920s
> user 0m2.662s
> sys 2m9.948s
>
> This series brings a number of small improvements along the way: fast
> hypercall implementation and using it for event signaling, rep hypercalls
> implementation, hyperv tracing subsystem (which only traces the newly added
> remote TLB flush for now).
>

Hi Vitaly,

Could you attach your benchmark? I'm interested in to try the
implementation in paravirt kvm.

Regards,
Wanpeng Li

From 1577278272896304438@xxx Thu Aug 31 20:02:58 +0000 2017
X-GM-THRID: 1574636603958149385
X-Gmail-Labels: Inbox,Category Forums