Hello,
This series fixes PM hibernation for hvm guests running on xen hypervisor.
The running guest could now be hibernated and resumed successfully at a
later time. The fixes for PM hibernation are added to block and
network device drivers i.e xen-blkfront and xen-netfront. Any other driver
that needs to add S4 support if not already, can follow same method of
introducing freeze/thaw/restore callbacks.
The patches had been tested against upstream kernel and xen4.11. Large
scale testing is also done on Xen based Amazon EC2 instances. All this testing
involved running memory exhausting workload in the background.
Doing guest hibernation does not involve any support from hypervisor and
this way guest has complete control over its state. Infrastructure
restrictions for saving up guest state can be overcome by guest initiated
hibernation.
These patches were send out as RFC before and all the feedback had been
incorporated in the patches. The last RFCV3 could be found here:
https://lkml.org/lkml/2020/2/14/2789
Known issues:
1.KASLR causes intermittent hibernation failures. VM fails to resumes and
has to be restarted. I will investigate this issue separately and shouldn't
be a blocker for this patch series.
2. During hibernation, I observed sometimes that freezing of tasks fails due
to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may be 1
out of 200 runs and hibernation is aborted in this case. Re-trying hibernation
may work. Also, this is a known issue with hibernation and some
filesystems like XFS has been discussed by the community for years with not an
effectve resolution at this point.
Testing How to:
---------------
1. Setup xen hypervisor on a physical machine[ I used Ubuntu 16.04 +upstream
xen-4.11]
2. Bring up a HVM guest w/t kernel compiled with hibernation patches
[I used ubuntu18.04 netboot bionic images and also Amazon Linux on-prem images].
3. Create a swap file size=RAM size
4. Update grub parameters and reboot
5. Trigger pm-hibernation from within the VM
Example:
Set up a file-backed swap space. Swap file size>=Total memory on the system
sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB
sudo chmod 600 /swap
sudo mkswap /swap
sudo swapon /swap
Update resume device/resume offset in grub if using swap file:
resume=/dev/xvda1 resume_offset=200704 no_console_suspend=1
Execute:
--------
sudo pm-hibernate
OR
echo disk > /sys/power/state && echo reboot > /sys/power/disk
Compute resume offset code:
"
#!/usr/bin/env python
import sys
import array
import fcntl
#swap file
f = open(sys.argv[1], 'r')
buf = array.array('L', [0])
#FIBMAP
ret = fcntl.ioctl(f.fileno(), 0x01, buf)
print buf[0]
"
Anchal Agarwal (5):
x86/xen: Introduce new function to map HYPERVISOR_shared_info on
Resume
genirq: Shutdown irq chips in suspend/resume during hibernation
xen: Introduce wrapper for save/restore sched clock offset
xen: Update sched clock offset to avoid system instability in
hibernation
PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
Munehisa Kamata (7):
xen/manage: keep track of the on-going suspend mode
xenbus: add freeze/thaw/restore callbacks support
x86/xen: add system core suspend and resume callbacks
xen-blkfront: add callbacks for PM suspend and hibernation
xen-netfront: add callbacks for PM suspend and hibernation
xen/time: introduce xen_{save,restore}_steal_clock
x86/xen: save and restore steal clock
arch/x86/xen/enlighten_hvm.c | 8 ++
arch/x86/xen/suspend.c | 72 ++++++++++++++++++
arch/x86/xen/time.c | 18 ++++-
arch/x86/xen/xen-ops.h | 3 +
drivers/block/xen-blkfront.c | 122 ++++++++++++++++++++++++++++--
drivers/net/xen-netfront.c | 98 +++++++++++++++++++++++-
drivers/xen/events/events_base.c | 1 +
drivers/xen/manage.c | 73 ++++++++++++++++++
drivers/xen/time.c | 29 ++++++-
drivers/xen/xenbus/xenbus_probe.c | 99 +++++++++++++++++++-----
include/linux/irq.h | 2 +
include/xen/xen-ops.h | 8 ++
include/xen/xenbus.h | 3 +
kernel/irq/chip.c | 2 +-
kernel/irq/internals.h | 1 +
kernel/irq/pm.c | 31 +++++---
kernel/power/user.c | 6 +-
17 files changed, 536 insertions(+), 40 deletions(-)
--
2.24.1.AMZN
From: Munehisa Kamata <[email protected]>
Guest hibernation is different from xen suspend/resume/live migration.
Xen save/restore does not use pm_ops as is needed by guest hibernation.
Hibernation in guest follows ACPI path and is guest inititated , the
hibernation image is saved within guest as compared to later modes
which are xen toolstack assisted and image creation/storage is in
control of hypervisor/host machine.
To differentiate between Xen suspend and PM hibernation, keep track
of the on-going suspend mode by mainly using a new PM notifier.
Introduce simple functions which help to know the on-going suspend mode
so that other Xen-related code can behave differently according to the
current suspend mode.
Since Xen suspend doesn't have corresponding PM event, its main logic
is modfied to acquire pm_mutex and set the current mode.
Though, acquirng pm_mutex is still right thing to do, we may
see deadlock if PM hibernation is interrupted by Xen suspend.
PM hibernation depends on xenwatch thread to process xenbus state
transactions, but the thread will sleep to wait pm_mutex which is
already held by PM hibernation context in the scenario. Xen shutdown
code may need some changes to avoid the issue.
[Anchal Changelog: Code refactoring]
Signed-off-by: Anchal Agarwal <[email protected]>
Signed-off-by: Munehisa Kamata <[email protected]>
---
drivers/xen/manage.c | 73 +++++++++++++++++++++++++++++++++++++++++++
include/xen/xen-ops.h | 3 ++
2 files changed, 76 insertions(+)
diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index cd046684e0d1..0b30ab522b77 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -14,6 +14,7 @@
#include <linux/freezer.h>
#include <linux/syscore_ops.h>
#include <linux/export.h>
+#include <linux/suspend.h>
#include <xen/xen.h>
#include <xen/xenbus.h>
@@ -40,6 +41,31 @@ enum shutdown_state {
/* Ignore multiple shutdown requests. */
static enum shutdown_state shutting_down = SHUTDOWN_INVALID;
+enum suspend_modes {
+ NO_SUSPEND = 0,
+ XEN_SUSPEND,
+ PM_SUSPEND,
+ PM_HIBERNATION,
+};
+
+/* Protected by pm_mutex */
+static enum suspend_modes suspend_mode = NO_SUSPEND;
+
+bool xen_suspend_mode_is_xen_suspend(void)
+{
+ return suspend_mode == XEN_SUSPEND;
+}
+
+bool xen_suspend_mode_is_pm_suspend(void)
+{
+ return suspend_mode == PM_SUSPEND;
+}
+
+bool xen_suspend_mode_is_pm_hibernation(void)
+{
+ return suspend_mode == PM_HIBERNATION;
+}
+
struct suspend_info {
int cancelled;
};
@@ -99,6 +125,10 @@ static void do_suspend(void)
int err;
struct suspend_info si;
+ lock_system_sleep();
+
+ suspend_mode = XEN_SUSPEND;
+
shutting_down = SHUTDOWN_SUSPEND;
err = freeze_processes();
@@ -162,6 +192,10 @@ static void do_suspend(void)
thaw_processes();
out:
shutting_down = SHUTDOWN_INVALID;
+
+ suspend_mode = NO_SUSPEND;
+
+ unlock_system_sleep();
}
#endif /* CONFIG_HIBERNATE_CALLBACKS */
@@ -387,3 +421,42 @@ int xen_setup_shutdown_event(void)
EXPORT_SYMBOL_GPL(xen_setup_shutdown_event);
subsys_initcall(xen_setup_shutdown_event);
+
+static int xen_pm_notifier(struct notifier_block *notifier,
+ unsigned long pm_event, void *unused)
+{
+ switch (pm_event) {
+ case PM_SUSPEND_PREPARE:
+ suspend_mode = PM_SUSPEND;
+ break;
+ case PM_HIBERNATION_PREPARE:
+ case PM_RESTORE_PREPARE:
+ suspend_mode = PM_HIBERNATION;
+ break;
+ case PM_POST_SUSPEND:
+ case PM_POST_RESTORE:
+ case PM_POST_HIBERNATION:
+ /* Set back to the default */
+ suspend_mode = NO_SUSPEND;
+ break;
+ default:
+ pr_warn("Receive unknown PM event 0x%lx\n", pm_event);
+ return -EINVAL;
+ }
+
+ return 0;
+};
+
+static struct notifier_block xen_pm_notifier_block = {
+ .notifier_call = xen_pm_notifier
+};
+
+static int xen_setup_pm_notifier(void)
+{
+ if (!xen_hvm_domain())
+ return -ENODEV;
+
+ return register_pm_notifier(&xen_pm_notifier_block);
+}
+
+subsys_initcall(xen_setup_pm_notifier);
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 095be1d66f31..4ffe031adfc7 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -40,6 +40,9 @@ u64 xen_steal_clock(int cpu);
int xen_setup_shutdown_event(void);
+bool xen_suspend_mode_is_xen_suspend(void);
+bool xen_suspend_mode_is_pm_suspend(void);
+bool xen_suspend_mode_is_pm_hibernation(void);
extern unsigned long *xen_contiguous_bitmap;
#if defined(CONFIG_XEN_PV) || defined(CONFIG_ARM) || defined(CONFIG_ARM64)
--
2.24.1.AMZN
From: Munehisa Kamata <[email protected]>
Since commit b3e96c0c7562 ("xen: use freeze/restore/thaw PM events for
suspend/resume/chkpt"), xenbus uses PMSG_FREEZE, PMSG_THAW and
PMSG_RESTORE events for Xen suspend. However, they're actually assigned
to xenbus_dev_suspend(), xenbus_dev_cancel() and xenbus_dev_resume()
respectively, and only suspend and resume callbacks are supported at
driver level. To support PM suspend and PM hibernation, modify the bus
level PM callbacks to invoke not only device driver's suspend/resume but
also freeze/thaw/restore.
Note that we'll use freeze/restore callbacks even for PM suspend whereas
suspend/resume callbacks are normally used in the case, becausae the
existing xenbus device drivers already have suspend/resume callbacks
specifically designed for Xen suspend. So we can allow the device
drivers to keep the existing callbacks wihtout modification.
[Anchal Changelog: Refactored the callbacks code]
Signed-off-by: Agarwal Anchal <[email protected]>
Signed-off-by: Munehisa Kamata <[email protected]>
---
drivers/xen/xenbus/xenbus_probe.c | 99 +++++++++++++++++++++++++------
include/xen/xenbus.h | 3 +
2 files changed, 84 insertions(+), 18 deletions(-)
diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c
index 8c4d05b687b7..1589b9b2cb56 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -49,6 +49,7 @@
#include <linux/io.h>
#include <linux/slab.h>
#include <linux/module.h>
+#include <linux/suspend.h>
#include <asm/page.h>
#include <asm/pgtable.h>
@@ -599,27 +600,44 @@ int xenbus_dev_suspend(struct device *dev)
struct xenbus_driver *drv;
struct xenbus_device *xdev
= container_of(dev, struct xenbus_device, dev);
-
+ bool xen_suspend = xen_suspend_mode_is_xen_suspend();
DPRINTK("%s", xdev->nodename);
if (dev->driver == NULL)
return 0;
drv = to_xenbus_driver(dev->driver);
- if (drv->suspend)
- err = drv->suspend(xdev);
- if (err)
- pr_warn("suspend %s failed: %i\n", dev_name(dev), err);
+
+ if (xen_suspend) {
+ if (drv->suspend)
+ err = drv->suspend(xdev);
+ } else {
+ if (drv->freeze) {
+ err = drv->freeze(xdev);
+ if (!err) {
+ free_otherend_watch(xdev);
+ free_otherend_details(xdev);
+ return 0;
+ }
+ }
+ }
+
+ if (err) {
+ pr_warn("%s %s failed: %i\n", xen_suspend ?
+ "suspend" : "freeze", dev_name(dev), err);
+ return err;
+ }
+
return 0;
}
EXPORT_SYMBOL_GPL(xenbus_dev_suspend);
int xenbus_dev_resume(struct device *dev)
{
- int err;
+ int err = 0;
struct xenbus_driver *drv;
struct xenbus_device *xdev
= container_of(dev, struct xenbus_device, dev);
-
+ bool xen_suspend = xen_suspend_mode_is_xen_suspend();
DPRINTK("%s", xdev->nodename);
if (dev->driver == NULL)
@@ -627,24 +645,32 @@ int xenbus_dev_resume(struct device *dev)
drv = to_xenbus_driver(dev->driver);
err = talk_to_otherend(xdev);
if (err) {
- pr_warn("resume (talk_to_otherend) %s failed: %i\n",
+ pr_warn("%s (talk_to_otherend) %s failed: %i\n",
+ xen_suspend ? "resume" : "restore",
dev_name(dev), err);
return err;
}
- xdev->state = XenbusStateInitialising;
+ if (xen_suspend) {
+ xdev->state = XenbusStateInitialising;
+ if (drv->resume)
+ err = drv->resume(xdev);
+ } else {
+ if (drv->restore)
+ err = drv->restore(xdev);
+ }
- if (drv->resume) {
- err = drv->resume(xdev);
- if (err) {
- pr_warn("resume %s failed: %i\n", dev_name(dev), err);
- return err;
- }
+ if (err) {
+ pr_warn("%s %s failed: %i\n",
+ xen_suspend ? "resume" : "restore",
+ dev_name(dev), err);
+ return err;
}
err = watch_otherend(xdev);
if (err) {
- pr_warn("resume (watch_otherend) %s failed: %d.\n",
+ pr_warn("%s (watch_otherend) %s failed: %d.\n",
+ xen_suspend ? "resume" : "restore",
dev_name(dev), err);
return err;
}
@@ -655,8 +681,45 @@ EXPORT_SYMBOL_GPL(xenbus_dev_resume);
int xenbus_dev_cancel(struct device *dev)
{
- /* Do nothing */
- DPRINTK("cancel");
+ int err = 0;
+ struct xenbus_driver *drv;
+ struct xenbus_device *xdev
+ = container_of(dev, struct xenbus_device, dev);
+ bool xen_suspend = xen_suspend_mode_is_xen_suspend();
+
+ if (xen_suspend) {
+ /* Do nothing */
+ DPRINTK("cancel");
+ return 0;
+ }
+
+ DPRINTK("%s", xdev->nodename);
+
+ if (dev->driver == NULL)
+ return 0;
+ drv = to_xenbus_driver(dev->driver);
+ err = talk_to_otherend(xdev);
+ if (err) {
+ pr_warn("thaw (talk_to_otherend) %s failed: %d.\n",
+ dev_name(dev), err);
+ return err;
+ }
+
+ if (drv->thaw) {
+ err = drv->thaw(xdev);
+ if (err) {
+ pr_warn("thaw %s failed: %i\n", dev_name(dev), err);
+ return err;
+ }
+ }
+
+ err = watch_otherend(xdev);
+ if (err) {
+ pr_warn("thaw (watch_otherend) %s failed: %d.\n",
+ dev_name(dev), err);
+ return err;
+ }
+
return 0;
}
EXPORT_SYMBOL_GPL(xenbus_dev_cancel);
diff --git a/include/xen/xenbus.h b/include/xen/xenbus.h
index 5a8315e6d8a6..8da964763255 100644
--- a/include/xen/xenbus.h
+++ b/include/xen/xenbus.h
@@ -104,6 +104,9 @@ struct xenbus_driver {
int (*remove)(struct xenbus_device *dev);
int (*suspend)(struct xenbus_device *dev);
int (*resume)(struct xenbus_device *dev);
+ int (*freeze)(struct xenbus_device *dev);
+ int (*thaw)(struct xenbus_device *dev);
+ int (*restore)(struct xenbus_device *dev);
int (*uevent)(struct xenbus_device *, struct kobj_uevent_env *);
struct device_driver driver;
int (*read_otherend_details)(struct xenbus_device *dev);
--
2.24.1.AMZN
Introduce a small function which re-uses shared page's PA allocated
during guest initialization time in reserve_shared_info() and not
allocate new page during resume flow.
It also does the mapping of shared_info_page by calling
xen_hvm_init_shared_info() to use the function.
Signed-off-by: Anchal Agarwal <[email protected]>
---
arch/x86/xen/enlighten_hvm.c | 7 +++++++
arch/x86/xen/xen-ops.h | 1 +
2 files changed, 8 insertions(+)
diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index e138f7de52d2..75b1ec7a0fcd 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -27,6 +27,13 @@
static unsigned long shared_info_pfn;
+void xen_hvm_map_shared_info(void)
+{
+ xen_hvm_init_shared_info();
+ if (shared_info_pfn)
+ HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
+}
+
void xen_hvm_init_shared_info(void)
{
struct xen_add_to_physmap xatp;
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 45a441c33d6d..d84c357994bd 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -56,6 +56,7 @@ void xen_enable_syscall(void);
void xen_vcpu_restore(void);
void xen_callback_vector(void);
+void xen_hvm_map_shared_info(void);
void xen_hvm_init_shared_info(void);
void xen_unplug_emulated_devices(void);
--
2.24.1.AMZN
From: Munehisa Kamata <[email protected]>
Add Xen PVHVM specific system core callbacks for PM suspend and
hibernation support. The callbacks suspend and resume Xen
primitives,like shared_info, pvclock and grant table. Note that
Xen suspend can handle them in a different manner, but system
core callbacks are called from the context. So if the callbacks
are called from Xen suspend context, return immediately.
Signed-off-by: Agarwal Anchal <[email protected]>
Signed-off-by: Munehisa Kamata <[email protected]>
---
arch/x86/xen/enlighten_hvm.c | 1 +
arch/x86/xen/suspend.c | 53 ++++++++++++++++++++++++++++++++++++
include/xen/xen-ops.h | 3 ++
3 files changed, 57 insertions(+)
diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index 75b1ec7a0fcd..138e71786e03 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -204,6 +204,7 @@ static void __init xen_hvm_guest_init(void)
if (xen_feature(XENFEAT_hvm_callback_vector))
xen_have_vector_callback = 1;
+ xen_setup_syscore_ops();
xen_hvm_smp_init();
WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
xen_unplug_emulated_devices();
diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 1d83152c761b..784c4484100b 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -2,17 +2,22 @@
#include <linux/types.h>
#include <linux/tick.h>
#include <linux/percpu-defs.h>
+#include <linux/syscore_ops.h>
+#include <linux/kernel_stat.h>
#include <xen/xen.h>
#include <xen/interface/xen.h>
+#include <xen/interface/memory.h>
#include <xen/grant_table.h>
#include <xen/events.h>
+#include <xen/xen-ops.h>
#include <asm/cpufeatures.h>
#include <asm/msr-index.h>
#include <asm/xen/hypercall.h>
#include <asm/xen/page.h>
#include <asm/fixmap.h>
+#include <asm/pvclock.h>
#include "xen-ops.h"
#include "mmu.h"
@@ -82,3 +87,51 @@ void xen_arch_suspend(void)
on_each_cpu(xen_vcpu_notify_suspend, NULL, 1);
}
+
+static int xen_syscore_suspend(void)
+{
+ struct xen_remove_from_physmap xrfp;
+ int ret;
+
+ /* Xen suspend does similar stuffs in its own logic */
+ if (xen_suspend_mode_is_xen_suspend())
+ return 0;
+
+ xrfp.domid = DOMID_SELF;
+ xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
+
+ ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrfp);
+ if (!ret)
+ HYPERVISOR_shared_info = &xen_dummy_shared_info;
+
+ return ret;
+}
+
+static void xen_syscore_resume(void)
+{
+ /* Xen suspend does similar stuffs in its own logic */
+ if (xen_suspend_mode_is_xen_suspend())
+ return;
+
+ /* No need to setup vcpu_info as it's already moved off */
+ xen_hvm_map_shared_info();
+
+ pvclock_resume();
+
+ gnttab_resume();
+}
+
+/*
+ * These callbacks will be called with interrupts disabled and when having only
+ * one CPU online.
+ */
+static struct syscore_ops xen_hvm_syscore_ops = {
+ .suspend = xen_syscore_suspend,
+ .resume = xen_syscore_resume
+};
+
+void __init xen_setup_syscore_ops(void)
+{
+ if (xen_hvm_domain())
+ register_syscore_ops(&xen_hvm_syscore_ops);
+}
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 4ffe031adfc7..89b1e88712d6 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -43,6 +43,9 @@ int xen_setup_shutdown_event(void);
bool xen_suspend_mode_is_xen_suspend(void);
bool xen_suspend_mode_is_pm_suspend(void);
bool xen_suspend_mode_is_pm_hibernation(void);
+
+void xen_setup_syscore_ops(void);
+
extern unsigned long *xen_contiguous_bitmap;
#if defined(CONFIG_XEN_PV) || defined(CONFIG_ARM) || defined(CONFIG_ARM64)
--
2.24.1.AMZN
Many legacy device drivers do not implement power management (PM)
functions which means that interrupts requested by these drivers stay
in active state when the kernel is hibernated.
This does not matter on bare metal and on most hypervisors because the
interrupt is restored on resume without any noticable side effects as
it stays connected to the same physical or virtual interrupt line.
The XEN interrupt mechanism is different as it maintains a mapping
between the Linux interrupt number and a XEN event channel. If the
interrupt stays active on hibernation this mapping is preserved but
there is unfortunately no guarantee that on resume the same event
channels are reassigned to these devices. This can result in event
channel conflicts which prevent the affected devices from being
restored correctly.
One way to solve this would be to add the necessary power management
functions to all affected legacy device drivers, but that's a
questionable effort which does not provide any benefits on non-XEN
environments.
The least intrusive and most efficient solution is to provide a
mechanism which allows the core interrupt code to tear down these
interrupts on hibernation and bring them back up again on resume. This
allows the XEN event channel mechanism to assign an arbitrary event
channel on resume without affecting the functionality of these
devices.
Fortunately all these device interrupts are handled by a dedicated XEN
interrupt chip so the chip can be marked that all interrupts connected
to it are handled this way. This is pretty much in line with the other
interrupt chip specific quirks, e.g. IRQCHIP_MASK_ON_SUSPEND.
Add a new quirk flag IRQCHIP_SHUTDOWN_ON_SUSPEND and add support for
it the core interrupt suspend/resume paths.
Signed-off-by: Anchal Agarwal <[email protected]>
Signed-off--by: Thomas Gleixner <[email protected]>
---
drivers/xen/events/events_base.c | 1 +
include/linux/irq.h | 2 ++
kernel/irq/chip.c | 2 +-
kernel/irq/internals.h | 1 +
kernel/irq/pm.c | 31 ++++++++++++++++++++++---------
5 files changed, 27 insertions(+), 10 deletions(-)
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 3a791c8485d0..decf65bd3451 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1613,6 +1613,7 @@ static struct irq_chip xen_pirq_chip __read_mostly = {
.irq_set_affinity = set_affinity_irq,
.irq_retrigger = retrigger_dynirq,
+ .flags = IRQCHIP_SHUTDOWN_ON_SUSPEND,
};
static struct irq_chip xen_percpu_chip __read_mostly = {
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 8d5bc2c237d7..94cb8c994d06 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -542,6 +542,7 @@ struct irq_chip {
* IRQCHIP_EOI_THREADED: Chip requires eoi() on unmask in threaded mode
* IRQCHIP_SUPPORTS_LEVEL_MSI Chip can provide two doorbells for Level MSIs
* IRQCHIP_SUPPORTS_NMI: Chip can deliver NMIs, only for root irqchips
+ * IRQCHIP_SHUTDOWN_ON_SUSPEND: Shutdown non wake irqs in the suspend path
*/
enum {
IRQCHIP_SET_TYPE_MASKED = (1 << 0),
@@ -553,6 +554,7 @@ enum {
IRQCHIP_EOI_THREADED = (1 << 6),
IRQCHIP_SUPPORTS_LEVEL_MSI = (1 << 7),
IRQCHIP_SUPPORTS_NMI = (1 << 8),
+ IRQCHIP_SHUTDOWN_ON_SUSPEND = (1 << 9),
};
#include <linux/irqdesc.h>
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 41e7e37a0928..fd59489ff14b 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -233,7 +233,7 @@ __irq_startup_managed(struct irq_desc *desc, struct cpumask *aff, bool force)
}
#endif
-static int __irq_startup(struct irq_desc *desc)
+int __irq_startup(struct irq_desc *desc)
{
struct irq_data *d = irq_desc_get_irq_data(desc);
int ret = 0;
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 7db284b10ac9..b6fca5eacff7 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -80,6 +80,7 @@ extern void __enable_irq(struct irq_desc *desc);
extern int irq_activate(struct irq_desc *desc);
extern int irq_activate_and_startup(struct irq_desc *desc, bool resend);
extern int irq_startup(struct irq_desc *desc, bool resend, bool force);
+extern int __irq_startup(struct irq_desc *desc);
extern void irq_shutdown(struct irq_desc *desc);
extern void irq_shutdown_and_deactivate(struct irq_desc *desc);
diff --git a/kernel/irq/pm.c b/kernel/irq/pm.c
index 8f557fa1f4fe..dc48a25f1756 100644
--- a/kernel/irq/pm.c
+++ b/kernel/irq/pm.c
@@ -85,16 +85,25 @@ static bool suspend_device_irq(struct irq_desc *desc)
}
desc->istate |= IRQS_SUSPENDED;
- __disable_irq(desc);
-
/*
- * Hardware which has no wakeup source configuration facility
- * requires that the non wakeup interrupts are masked at the
- * chip level. The chip implementation indicates that with
- * IRQCHIP_MASK_ON_SUSPEND.
+ * Some irq chips (e.g. XEN PIRQ) require a full shutdown on suspend
+ * as some of the legacy drivers(e.g. floppy) do nothing during the
+ * suspend path
*/
- if (irq_desc_get_chip(desc)->flags & IRQCHIP_MASK_ON_SUSPEND)
- mask_irq(desc);
+ if (irq_desc_get_chip(desc)->flags & IRQCHIP_SHUTDOWN_ON_SUSPEND) {
+ irq_shutdown(desc);
+ } else {
+ __disable_irq(desc);
+
+ /*
+ * Hardware which has no wakeup source configuration facility
+ * requires that the non wakeup interrupts are masked at the
+ * chip level. The chip implementation indicates that with
+ * IRQCHIP_MASK_ON_SUSPEND.
+ */
+ if (irq_desc_get_chip(desc)->flags & IRQCHIP_MASK_ON_SUSPEND)
+ mask_irq(desc);
+ }
return true;
}
@@ -152,7 +161,11 @@ static void resume_irq(struct irq_desc *desc)
irq_state_set_masked(desc);
resume:
desc->istate &= ~IRQS_SUSPENDED;
- __enable_irq(desc);
+
+ if (irq_desc_get_chip(desc)->flags & IRQCHIP_SHUTDOWN_ON_SUSPEND)
+ __irq_startup(desc);
+ else
+ __enable_irq(desc);
}
static void resume_irqs(bool want_early)
--
2.24.1.AMZN
From: Munehisa Kamata <[email protected]>
S4 power transition states are much different than xen
suspend/resume. Former is visible to the guest and frontend drivers should
be aware of the state transitions and should be able to take appropriate
actions when needed. In transition to S4 we need to make sure that at least
all the in-flight blkif requests get completed, since they probably contain
bits of the guest's memory image and that's not going to get saved any
other way. Hence, re-issuing of in-flight requests as in case of xen resume
will not work here. This is in contrast to xen-suspend where we need to
freeze with as little processing as possible to avoid dirtying RAM late in
the migration cycle and we know that in-flight data can wait.
Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
events, need to implement these xenbus_driver callbacks. The freeze handler
stops block-layer queue and disconnect the frontend from the backend while
freeing ring_info and associated resources. Before disconnecting from the
backend, we need to prevent any new IO from being queued and wait for existing
IO to complete. Freeze/unfreeze of the queues will guarantee that there are no
requests in use on the shared ring. However, for sanity we should check
state of the ring before disconnecting to make sure that there are no
outstanding requests to be processed on the ring. The restore handler
re-allocates ring_info, unquiesces and unfreezes the queue and re-connect to
the backend, so that rest of the kernel can continue to use the block device
transparently.
Note:For older backends,if a backend doesn't have commit'12ea729645ace'
xen/blkback: unmap all persistent grants when frontend gets disconnected,
the frontend may see massive amount of grant table warning when freeing
resources.
[ 36.852659] deferring g.e. 0xf9 (pfn 0xffffffffffffffff)
[ 36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!
In this case, persistent grants would need to be disabled.
[Anchal Changelog: Removed timeout/request during blkfront freeze.
Reworked the whole patch to work with blk-mq and incorporate upstream's
comments]
Signed-off-by: Anchal Agarwal <[email protected]>
Signed-off-by: Munehisa Kamata <[email protected]>
---
drivers/block/xen-blkfront.c | 122 +++++++++++++++++++++++++++++++++--
1 file changed, 115 insertions(+), 7 deletions(-)
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 3b889ea950c2..464863ed7093 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -48,6 +48,8 @@
#include <linux/list.h>
#include <linux/workqueue.h>
#include <linux/sched/mm.h>
+#include <linux/completion.h>
+#include <linux/delay.h>
#include <xen/xen.h>
#include <xen/xenbus.h>
@@ -80,6 +82,8 @@ enum blkif_state {
BLKIF_STATE_DISCONNECTED,
BLKIF_STATE_CONNECTED,
BLKIF_STATE_SUSPENDED,
+ BLKIF_STATE_FREEZING,
+ BLKIF_STATE_FROZEN
};
struct grant {
@@ -219,6 +223,7 @@ struct blkfront_info
struct list_head requests;
struct bio_list bio_list;
struct list_head info_list;
+ struct completion wait_backend_disconnected;
};
static unsigned int nr_minors;
@@ -1005,6 +1010,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
info->sector_size = sector_size;
info->physical_sector_size = physical_sector_size;
blkif_set_queue_limits(info);
+ init_completion(&info->wait_backend_disconnected);
return 0;
}
@@ -1057,7 +1063,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
case XEN_SCSI_DISK5_MAJOR:
case XEN_SCSI_DISK6_MAJOR:
case XEN_SCSI_DISK7_MAJOR:
- *offset = (*minor / PARTS_PER_DISK) +
+ *offset = (*minor / PARTS_PER_DISK) +
((major - XEN_SCSI_DISK1_MAJOR + 1) * 16) +
EMULATED_SD_DISK_NAME_OFFSET;
*minor = *minor +
@@ -1072,7 +1078,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
case XEN_SCSI_DISK13_MAJOR:
case XEN_SCSI_DISK14_MAJOR:
case XEN_SCSI_DISK15_MAJOR:
- *offset = (*minor / PARTS_PER_DISK) +
+ *offset = (*minor / PARTS_PER_DISK) +
((major - XEN_SCSI_DISK8_MAJOR + 8) * 16) +
EMULATED_SD_DISK_NAME_OFFSET;
*minor = *minor +
@@ -1353,6 +1359,8 @@ static void blkif_free(struct blkfront_info *info, int suspend)
unsigned int i;
struct blkfront_ring_info *rinfo;
+ if (info->connected == BLKIF_STATE_FREEZING)
+ goto free_rings;
/* Prevent new requests being issued until we fix things up. */
info->connected = suspend ?
BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
@@ -1360,6 +1368,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
if (info->rq)
blk_mq_stop_hw_queues(info->rq);
+free_rings:
for_each_rinfo(info, rinfo, i)
blkif_free_ring(rinfo);
@@ -1563,8 +1572,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
struct blkfront_info *info = rinfo->dev_info;
- if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
- return IRQ_HANDLED;
+ if (unlikely(info->connected != BLKIF_STATE_CONNECTED
+ && info->connected != BLKIF_STATE_FREEZING)){
+ return IRQ_HANDLED;
+ }
spin_lock_irqsave(&rinfo->ring_lock, flags);
again:
@@ -2027,6 +2038,7 @@ static int blkif_recover(struct blkfront_info *info)
unsigned int segs;
struct blkfront_ring_info *rinfo;
+ bool frozen = info->connected == BLKIF_STATE_FROZEN;
blkfront_gather_backend_features(info);
/* Reset limits changed by blk_mq_update_nr_hw_queues(). */
blkif_set_queue_limits(info);
@@ -2048,6 +2060,9 @@ static int blkif_recover(struct blkfront_info *info)
kick_pending_request_queues(rinfo);
}
+ if (frozen)
+ return 0;
+
list_for_each_entry_safe(req, n, &info->requests, queuelist) {
/* Requeue pending requests (flush or discard) */
list_del_init(&req->queuelist);
@@ -2364,6 +2379,7 @@ static void blkfront_connect(struct blkfront_info *info)
return;
case BLKIF_STATE_SUSPENDED:
+ case BLKIF_STATE_FROZEN:
/*
* If we are recovering from suspension, we need to wait
* for the backend to announce it's features before
@@ -2481,12 +2497,36 @@ static void blkback_changed(struct xenbus_device *dev,
break;
case XenbusStateClosed:
- if (dev->state == XenbusStateClosed)
+ if (dev->state == XenbusStateClosed) {
+ if (info->connected == BLKIF_STATE_FREEZING) {
+ blkif_free(info, 0);
+ info->connected = BLKIF_STATE_FROZEN;
+ complete(&info->wait_backend_disconnected);
+ break;
+ }
+
break;
+ }
+
+ /*
+ * We may somehow receive backend's Closed again while thawing
+ * or restoring and it causes thawing or restoring to fail.
+ * Ignore such unexpected state regardless of the backend state.
+ */
+ if (info->connected == BLKIF_STATE_FROZEN) {
+ dev_dbg(&dev->dev,
+ "ignore the backend's Closed state: %s",
+ dev->nodename);
+ break;
+ }
/* fall through */
case XenbusStateClosing:
- if (info)
- blkfront_closing(info);
+ if (info) {
+ if (info->connected == BLKIF_STATE_FREEZING)
+ xenbus_frontend_closed(dev);
+ else
+ blkfront_closing(info);
+ }
break;
}
}
@@ -2630,6 +2670,71 @@ static void blkif_release(struct gendisk *disk, fmode_t mode)
mutex_unlock(&blkfront_mutex);
}
+static int blkfront_freeze(struct xenbus_device *dev)
+{
+ unsigned int i;
+ struct blkfront_info *info = dev_get_drvdata(&dev->dev);
+ struct blkfront_ring_info *rinfo;
+ /* This would be reasonable timeout as used in xenbus_dev_shutdown() */
+ unsigned int timeout = 5 * HZ;
+ unsigned long flags;
+ int err = 0;
+
+ info->connected = BLKIF_STATE_FREEZING;
+
+ blk_mq_freeze_queue(info->rq);
+ blk_mq_quiesce_queue(info->rq);
+
+ for_each_rinfo(info, rinfo, i) {
+ /* No more gnttab callback work. */
+ gnttab_cancel_free_callback(&rinfo->callback);
+ /* Flush gnttab callback work. Must be done with no locks held. */
+ flush_work(&rinfo->work);
+ }
+
+ for_each_rinfo(info, rinfo, i) {
+ spin_lock_irqsave(&rinfo->ring_lock, flags);
+ if (RING_FULL(&rinfo->ring)
+ || RING_HAS_UNCONSUMED_RESPONSES(&rinfo->ring)) {
+ xenbus_dev_error(dev, err, "Hibernation Failed.
+ The ring is still busy");
+ info->connected = BLKIF_STATE_CONNECTED;
+ spin_unlock_irqrestore(&rinfo->ring_lock, flags);
+ return -EBUSY;
+ }
+ spin_unlock_irqrestore(&rinfo->ring_lock, flags);
+ }
+ /* Kick the backend to disconnect */
+ xenbus_switch_state(dev, XenbusStateClosing);
+
+ /*
+ * We don't want to move forward before the frontend is diconnected
+ * from the backend cleanly.
+ */
+ timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
+ timeout);
+ if (!timeout) {
+ err = -EBUSY;
+ xenbus_dev_error(dev, err, "Freezing timed out;"
+ "the device may become inconsistent state");
+ }
+
+ return err;
+}
+
+static int blkfront_restore(struct xenbus_device *dev)
+{
+ struct blkfront_info *info = dev_get_drvdata(&dev->dev);
+ int err = 0;
+
+ err = talk_to_blkback(dev, info);
+ blk_mq_unquiesce_queue(info->rq);
+ blk_mq_unfreeze_queue(info->rq);
+ if (!err)
+ blk_mq_update_nr_hw_queues(&info->tag_set, info->nr_rings);
+ return err;
+}
+
static const struct block_device_operations xlvbd_block_fops =
{
.owner = THIS_MODULE,
@@ -2653,6 +2758,9 @@ static struct xenbus_driver blkfront_driver = {
.resume = blkfront_resume,
.otherend_changed = blkback_changed,
.is_ready = blkfront_is_ready,
+ .freeze = blkfront_freeze,
+ .thaw = blkfront_restore,
+ .restore = blkfront_restore
};
static void purge_persistent_grants(struct blkfront_info *info)
--
2.24.1.AMZN
From: Munehisa Kamata <[email protected]>
Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. The freeze handler simply disconnects the frotnend from the
backend and frees resources associated with queues after disabling the
net_device from the system. The restore handler just changes the
frontend state and let the xenbus handler to re-allocate the resources
and re-connect to the backend. This can be performed transparently to
the rest of the system. The handlers are used for both PM suspend and
hibernation so that we can keep the existing suspend/resume callbacks
for Xen suspend without modification. Freezing netfront devices is
normally expected to finish within a few hundred milliseconds, but it
can rarely take more than 5 seconds and hit the hard coded timeout,
it would depend on backend state which may be congested and/or have
complex configuration. While it's rare case, longer default timeout
seems a bit more reasonable here to avoid hitting the timeout.
Also, make it configurable via module parameter so that we can cover
broader setups than what we know currently.
[Anchal changelog: Variable name fix and checkpatch.pl fixes]
Signed-off-by: Anchal Agarwal <[email protected]>
Signed-off-by: Munehisa Kamata <[email protected]>
---
drivers/net/xen-netfront.c | 98 +++++++++++++++++++++++++++++++++++++-
1 file changed, 97 insertions(+), 1 deletion(-)
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 482c6c8b0fb7..65edcdd6e05f 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -43,6 +43,7 @@
#include <linux/moduleparam.h>
#include <linux/mm.h>
#include <linux/slab.h>
+#include <linux/completion.h>
#include <net/ip.h>
#include <xen/xen.h>
@@ -56,6 +57,12 @@
#include <xen/interface/memory.h>
#include <xen/interface/grant_table.h>
+enum netif_freeze_state {
+ NETIF_FREEZE_STATE_UNFROZEN,
+ NETIF_FREEZE_STATE_FREEZING,
+ NETIF_FREEZE_STATE_FROZEN,
+};
+
/* Module parameters */
#define MAX_QUEUES_DEFAULT 8
static unsigned int xennet_max_queues;
@@ -63,6 +70,12 @@ module_param_named(max_queues, xennet_max_queues, uint, 0644);
MODULE_PARM_DESC(max_queues,
"Maximum number of queues per virtual interface");
+static unsigned int netfront_freeze_timeout_secs = 10;
+module_param_named(freeze_timeout_secs,
+ netfront_freeze_timeout_secs, uint, 0644);
+MODULE_PARM_DESC(freeze_timeout_secs,
+ "timeout when freezing netfront device in seconds");
+
static const struct ethtool_ops xennet_ethtool_ops;
struct netfront_cb {
@@ -160,6 +173,10 @@ struct netfront_info {
struct netfront_stats __percpu *tx_stats;
atomic_t rx_gso_checksum_fixup;
+
+ int freeze_state;
+
+ struct completion wait_backend_disconnected;
};
struct netfront_rx_info {
@@ -721,6 +738,21 @@ static int xennet_close(struct net_device *dev)
return 0;
}
+static int xennet_disable_interrupts(struct net_device *dev)
+{
+ struct netfront_info *np = netdev_priv(dev);
+ unsigned int num_queues = dev->real_num_tx_queues;
+ unsigned int queue_index;
+ struct netfront_queue *queue;
+
+ for (queue_index = 0; queue_index < num_queues; ++queue_index) {
+ queue = &np->queues[queue_index];
+ disable_irq(queue->tx_irq);
+ disable_irq(queue->rx_irq);
+ }
+ return 0;
+}
+
static void xennet_move_rx_slot(struct netfront_queue *queue, struct sk_buff *skb,
grant_ref_t ref)
{
@@ -1301,6 +1333,8 @@ static struct net_device *xennet_create_dev(struct xenbus_device *dev)
np->queues = NULL;
+ init_completion(&np->wait_backend_disconnected);
+
err = -ENOMEM;
np->rx_stats = netdev_alloc_pcpu_stats(struct netfront_stats);
if (np->rx_stats == NULL)
@@ -1794,6 +1828,50 @@ static int xennet_create_queues(struct netfront_info *info,
return 0;
}
+static int netfront_freeze(struct xenbus_device *dev)
+{
+ struct netfront_info *info = dev_get_drvdata(&dev->dev);
+ unsigned long timeout = netfront_freeze_timeout_secs * HZ;
+ int err = 0;
+
+ xennet_disable_interrupts(info->netdev);
+
+ netif_device_detach(info->netdev);
+
+ info->freeze_state = NETIF_FREEZE_STATE_FREEZING;
+
+ /* Kick the backend to disconnect */
+ xenbus_switch_state(dev, XenbusStateClosing);
+
+ /* We don't want to move forward before the frontend is diconnected
+ * from the backend cleanly.
+ */
+ timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
+ timeout);
+ if (!timeout) {
+ err = -EBUSY;
+ xenbus_dev_error(dev, err, "Freezing timed out;"
+ "the device may become inconsistent state");
+ return err;
+ }
+
+ /* Tear down queues */
+ xennet_disconnect_backend(info);
+ xennet_destroy_queues(info);
+
+ info->freeze_state = NETIF_FREEZE_STATE_FROZEN;
+
+ return err;
+}
+
+static int netfront_restore(struct xenbus_device *dev)
+{
+ /* Kick the backend to re-connect */
+ xenbus_switch_state(dev, XenbusStateInitialising);
+
+ return 0;
+}
+
/* Common code used when first setting up, and when resuming. */
static int talk_to_netback(struct xenbus_device *dev,
struct netfront_info *info)
@@ -1999,6 +2077,8 @@ static int xennet_connect(struct net_device *dev)
spin_unlock_bh(&queue->rx_lock);
}
+ np->freeze_state = NETIF_FREEZE_STATE_UNFROZEN;
+
return 0;
}
@@ -2036,10 +2116,23 @@ static void netback_changed(struct xenbus_device *dev,
break;
case XenbusStateClosed:
- if (dev->state == XenbusStateClosed)
+ if (dev->state == XenbusStateClosed) {
+ /* dpm context is waiting for the backend */
+ if (np->freeze_state == NETIF_FREEZE_STATE_FREEZING)
+ complete(&np->wait_backend_disconnected);
break;
+ }
+
/* Fall through - Missed the backend's CLOSING state. */
case XenbusStateClosing:
+ /* We may see unexpected Closed or Closing from the backend.
+ * Just ignore it not to prevent the frontend from being
+ * re-connected in the case of PM suspend or hibernation.
+ */
+ if (np->freeze_state == NETIF_FREEZE_STATE_FROZEN &&
+ dev->state == XenbusStateInitialising) {
+ break;
+ }
xenbus_frontend_closed(dev);
break;
}
@@ -2186,6 +2279,9 @@ static struct xenbus_driver netfront_driver = {
.probe = netfront_probe,
.remove = xennet_remove,
.resume = netfront_resume,
+ .freeze = netfront_freeze,
+ .thaw = netfront_restore,
+ .restore = netfront_restore,
.otherend_changed = netback_changed,
};
--
2.24.1.AMZN
From: Munehisa Kamata <[email protected]>
Currently, steal time accounting code in scheduler expects steal clock
callback to provide monotonically increasing value. If the accounting
code receives a smaller value than previous one, it uses a negative
value to calculate steal time and results in incorrectly updated idle
and steal time accounting. This breaks userspace tools which read
/proc/stat.
top - 08:05:35 up 2:12, 3 users, load average: 0.00, 0.07, 0.23
Tasks: 80 total, 1 running, 79 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,30100.0%id, 0.0%wa, 0.0%hi, 0.0%si,-1253874204672.0%st
This can actually happen when a Xen PVHVM guest gets restored from
hibernation, because such a restored guest is just a fresh domain from
Xen perspective and the time information in runstate info starts over
from scratch.
This patch introduces xen_save_steal_clock() which saves current values
in runstate info into per-cpu variables. Its couterpart,
xen_restore_steal_clock(), sets offset if it found the current values in
runstate info are smaller than previous ones. xen_steal_clock() is also
modified to use the offset to ensure that scheduler only sees
monotonically increasing number.
Signed-off-by: Munehisa Kamata <[email protected]>
Signed-off-by: Anchal Agarwal <[email protected]>
---
drivers/xen/time.c | 29 ++++++++++++++++++++++++++++-
include/xen/xen-ops.h | 2 ++
2 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index 0968859c29d0..3560222cc0dd 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -23,6 +23,9 @@ static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
static DEFINE_PER_CPU(u64[4], old_runstate_time);
+static DEFINE_PER_CPU(u64, xen_prev_steal_clock);
+static DEFINE_PER_CPU(u64, xen_steal_clock_offset);
+
/* return an consistent snapshot of 64-bit time/counter value */
static u64 get64(const u64 *p)
{
@@ -149,7 +152,7 @@ bool xen_vcpu_stolen(int vcpu)
return per_cpu(xen_runstate, vcpu).state == RUNSTATE_runnable;
}
-u64 xen_steal_clock(int cpu)
+static u64 __xen_steal_clock(int cpu)
{
struct vcpu_runstate_info state;
@@ -157,6 +160,30 @@ u64 xen_steal_clock(int cpu)
return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
}
+u64 xen_steal_clock(int cpu)
+{
+ return __xen_steal_clock(cpu) + per_cpu(xen_steal_clock_offset, cpu);
+}
+
+void xen_save_steal_clock(int cpu)
+{
+ per_cpu(xen_prev_steal_clock, cpu) = xen_steal_clock(cpu);
+}
+
+void xen_restore_steal_clock(int cpu)
+{
+ u64 steal_clock = __xen_steal_clock(cpu);
+
+ if (per_cpu(xen_prev_steal_clock, cpu) > steal_clock) {
+ /* Need to update the offset */
+ per_cpu(xen_steal_clock_offset, cpu) =
+ per_cpu(xen_prev_steal_clock, cpu) - steal_clock;
+ } else {
+ /* Avoid unnecessary steal clock warp */
+ per_cpu(xen_steal_clock_offset, cpu) = 0;
+ }
+}
+
void xen_setup_runstate_info(int cpu)
{
struct vcpu_register_runstate_memory_area area;
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 89b1e88712d6..74fb5eb3aad8 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -37,6 +37,8 @@ void xen_time_setup_guest(void);
void xen_manage_runstate_time(int action);
void xen_get_runstate_snapshot(struct vcpu_runstate_info *res);
u64 xen_steal_clock(int cpu);
+void xen_save_steal_clock(int cpu);
+void xen_restore_steal_clock(int cpu);
int xen_setup_shutdown_event(void);
--
2.24.1.AMZN
From: Munehisa Kamata <[email protected]>
Save steal clock values of all present CPUs in the system core ops
suspend callbacks. Also, restore a boot CPU's steal clock in the system
core resume callback. For non-boot CPUs, restore after they're brought
up, because runstate info for non-boot CPUs are not active until then.
Signed-off-by: Munehisa Kamata <[email protected]>
Signed-off-by: Anchal Agarwal <[email protected]>
---
arch/x86/xen/suspend.c | 13 ++++++++++++-
arch/x86/xen/time.c | 3 +++
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 784c4484100b..dae0f74f5390 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -91,12 +91,20 @@ void xen_arch_suspend(void)
static int xen_syscore_suspend(void)
{
struct xen_remove_from_physmap xrfp;
- int ret;
+ int cpu, ret;
/* Xen suspend does similar stuffs in its own logic */
if (xen_suspend_mode_is_xen_suspend())
return 0;
+ for_each_present_cpu(cpu) {
+ /*
+ * Nonboot CPUs are already offline, but the last copy of
+ * runstate info is still accessible.
+ */
+ xen_save_steal_clock(cpu);
+ }
+
xrfp.domid = DOMID_SELF;
xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
@@ -118,6 +126,9 @@ static void xen_syscore_resume(void)
pvclock_resume();
+ /* Nonboot CPUs will be resumed when they're brought up */
+ xen_restore_steal_clock(smp_processor_id());
+
gnttab_resume();
}
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index c8897aad13cd..33d754564b09 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -545,6 +545,9 @@ static void xen_hvm_setup_cpu_clockevents(void)
{
int cpu = smp_processor_id();
xen_setup_runstate_info(cpu);
+ if (cpu)
+ xen_restore_steal_clock(cpu);
+
/*
* xen_setup_timer(cpu) - snprintf is bad in atomic context. Hence
* doing it xen_hvm_cpu_notify (which gets called by smp_init during
--
2.24.1.AMZN
On Tue, 2020-05-19 at 23:26 +0000, Anchal Agarwal wrote:
> Signed-off--by: Thomas Gleixner <[email protected]>
The Signed-off-by line needs to be fixed (hint: you have --)
Balbir Singh
Introduce wrappers for save/restore xen_sched_clock_offset to be
used by PM hibernation code to avoid system instability during resume.
Signed-off-by: Anchal Agarwal <[email protected]>
---
arch/x86/xen/time.c | 15 +++++++++++++--
arch/x86/xen/xen-ops.h | 2 ++
2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 33d754564b09..1fc2beb7a6c1 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -386,12 +386,23 @@ static const struct pv_time_ops xen_time_ops __initconst = {
static struct pvclock_vsyscall_time_info *xen_clock __read_mostly;
static u64 xen_clock_value_saved;
+/*This is needed to maintain a monotonic clock value during PM hibernation */
+void xen_save_sched_clock_offset(void)
+{
+ xen_clock_value_saved = xen_clocksource_read() - xen_sched_clock_offset;
+}
+
+void xen_restore_sched_clock_offset(void)
+{
+ xen_sched_clock_offset = xen_clocksource_read() - xen_clock_value_saved;
+}
+
void xen_save_time_memory_area(void)
{
struct vcpu_register_time_memory_area t;
int ret;
- xen_clock_value_saved = xen_clocksource_read() - xen_sched_clock_offset;
+ xen_save_sched_clock_offset();
if (!xen_clock)
return;
@@ -434,7 +445,7 @@ void xen_restore_time_memory_area(void)
out:
/* Need pvclock_resume() before using xen_clocksource_read(). */
pvclock_resume();
- xen_sched_clock_offset = xen_clocksource_read() - xen_clock_value_saved;
+ xen_restore_sched_clock_offset();
}
static void xen_setup_vsyscall_time_info(void)
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index d84c357994bd..9f49124df033 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -72,6 +72,8 @@ void xen_save_time_memory_area(void);
void xen_restore_time_memory_area(void);
void xen_init_time_ops(void);
void xen_hvm_init_time_ops(void);
+void xen_save_sched_clock_offset(void);
+void xen_restore_sched_clock_offset(void);
irqreturn_t xen_debug_interrupt(int irq, void *dev_id);
--
2.24.1.AMZN
Save/restore xen_sched_clock_offset in syscore suspend/resume during PM
hibernation. Commit '867cefb4cb1012: ("xen: Fix x86 sched_clock() interface
for xen")' fixes xen guest time handling during migration. A similar issue
is seen during PM hibernation when system runs CPU intensive workload.
Post resume pvclock resets the value to 0 however, xen sched_clock_offset
is never updated. System instability is seen during resume from hibernation
when system is under heavy CPU load. Since xen_sched_clock_offset is not
updated, system does not see the monotonic clock value and the scheduler
would then think that heavy CPU hog tasks need more time in CPU, causing
the system to freeze
Signed-off-by: Anchal Agarwal <[email protected]>
---
arch/x86/xen/suspend.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index dae0f74f5390..7e5275944810 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -105,6 +105,8 @@ static int xen_syscore_suspend(void)
xen_save_steal_clock(cpu);
}
+ xen_save_sched_clock_offset();
+
xrfp.domid = DOMID_SELF;
xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
@@ -126,6 +128,12 @@ static void xen_syscore_resume(void)
pvclock_resume();
+ /*
+ * Restore xen_sched_clock_offset during resume to maintain
+ * monotonic clock value
+ */
+ xen_restore_sched_clock_offset();
+
/* Nonboot CPUs will be resumed when they're brought up */
xen_restore_steal_clock(smp_processor_id());
--
2.24.1.AMZN
From: Aleksei Besogonov <[email protected]>
The SNAPSHOT_SET_SWAP_AREA is supposed to be used to set the hibernation
offset on a running kernel to enable hibernating to a swap file.
However, it doesn't actually update the swsusp_resume_block variable. As
a result, the hibernation fails at the last step (after all the data is
written out) in the validation of the swap signature in
mark_swapfiles().
Before this patch, the command line processing was the only place where
swsusp_resume_block was set.
[Changelog: Resolved patch conflict as code fragmented to
snapshot_set_swap_area]
Signed-off-by: Aleksei Besogonov <[email protected]>
Signed-off-by: Munehisa Kamata <[email protected]>
Signed-off-by: Anchal Agarwal <[email protected]>
---
kernel/power/user.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/power/user.c b/kernel/power/user.c
index 7959449765d9..1afa1f0a223e 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -235,8 +235,12 @@ static int snapshot_set_swap_area(struct snapshot_data *data,
return -EINVAL;
}
data->swap = swap_type_of(swdev, offset, NULL);
- if (data->swap < 0)
+ if (data->swap < 0) {
return -ENODEV;
+ } else {
+ swsusp_resume_device = swdev;
+ swsusp_resume_block = offset;
+ }
return 0;
}
--
2.24.1.AMZN
Thanks. Looks like send an old one without fix. Did resend the patch again.
On Tue, 2020-05-19 at 23:26 +0000, Anchal Agarwal wrote:
> Signed-off--by: Thomas Gleixner <[email protected]>
The Signed-off-by line needs to be fixed (hint: you have --)
Balbir Singh
Hi Anchal,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v5.7-rc6]
[cannot apply to xen-tip/linux-next tip/irq/core tip/auto-latest next-20200519]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
url: https://github.com/0day-ci/linux/commits/Anchal-Agarwal/Fix-PM-hibernation-in-Xen-guests/20200520-073211
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 03fb3acae4be8a6b680ffedb220a8b6c07260b40
config: x86_64-randconfig-a016-20200519 (attached as .config)
compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project e6658079aca6d971b4e9d7137a3a2ecbc9c34aec)
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install x86_64 cross compiling tool for clang build
# apt-get install binutils-x86-64-linux-gnu
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <[email protected]>
All error/warnings (new ones prefixed by >>, old ones prefixed by <<):
>> drivers/block/xen-blkfront.c:2699:30: warning: missing terminating '"' character [-Winvalid-pp-token]
xenbus_dev_error(dev, err, "Hibernation Failed.
^
>> drivers/block/xen-blkfront.c:2699:30: error: expected expression
drivers/block/xen-blkfront.c:2700:26: warning: missing terminating '"' character [-Winvalid-pp-token]
The ring is still busy");
^
>> drivers/block/xen-blkfront.c:2726:1: error: function definition is not allowed here
{
^
>> drivers/block/xen-blkfront.c:2762:10: error: use of undeclared identifier 'blkfront_restore'
.thaw = blkfront_restore,
^
drivers/block/xen-blkfront.c:2763:13: error: use of undeclared identifier 'blkfront_restore'
.restore = blkfront_restore
^
drivers/block/xen-blkfront.c:2767:1: error: function definition is not allowed here
{
^
drivers/block/xen-blkfront.c:2800:1: error: function definition is not allowed here
{
^
drivers/block/xen-blkfront.c:2822:1: error: function definition is not allowed here
{
^
>> drivers/block/xen-blkfront.c:2863:13: error: use of undeclared identifier 'xlblk_init'
module_init(xlblk_init);
^
drivers/block/xen-blkfront.c:2867:1: error: function definition is not allowed here
{
^
>> drivers/block/xen-blkfront.c:2874:13: error: use of undeclared identifier 'xlblk_exit'
module_exit(xlblk_exit);
^
>> drivers/block/xen-blkfront.c:2880:24: error: expected '}'
MODULE_ALIAS("xenblk");
^
drivers/block/xen-blkfront.c:2674:1: note: to match this '{'
{
^
>> drivers/block/xen-blkfront.c:2738:45: warning: ISO C90 forbids mixing declarations and code [-Wdeclaration-after-statement]
static const struct block_device_operations xlvbd_block_fops =
^
3 warnings and 11 errors generated.
vim +2699 drivers/block/xen-blkfront.c
2672
2673 static int blkfront_freeze(struct xenbus_device *dev)
2674 {
2675 unsigned int i;
2676 struct blkfront_info *info = dev_get_drvdata(&dev->dev);
2677 struct blkfront_ring_info *rinfo;
2678 /* This would be reasonable timeout as used in xenbus_dev_shutdown() */
2679 unsigned int timeout = 5 * HZ;
2680 unsigned long flags;
2681 int err = 0;
2682
2683 info->connected = BLKIF_STATE_FREEZING;
2684
2685 blk_mq_freeze_queue(info->rq);
2686 blk_mq_quiesce_queue(info->rq);
2687
2688 for_each_rinfo(info, rinfo, i) {
2689 /* No more gnttab callback work. */
2690 gnttab_cancel_free_callback(&rinfo->callback);
2691 /* Flush gnttab callback work. Must be done with no locks held. */
2692 flush_work(&rinfo->work);
2693 }
2694
2695 for_each_rinfo(info, rinfo, i) {
2696 spin_lock_irqsave(&rinfo->ring_lock, flags);
2697 if (RING_FULL(&rinfo->ring)
2698 || RING_HAS_UNCONSUMED_RESPONSES(&rinfo->ring)) {
> 2699 xenbus_dev_error(dev, err, "Hibernation Failed.
2700 The ring is still busy");
2701 info->connected = BLKIF_STATE_CONNECTED;
2702 spin_unlock_irqrestore(&rinfo->ring_lock, flags);
2703 return -EBUSY;
2704 }
2705 spin_unlock_irqrestore(&rinfo->ring_lock, flags);
2706 }
2707 /* Kick the backend to disconnect */
2708 xenbus_switch_state(dev, XenbusStateClosing);
2709
2710 /*
2711 * We don't want to move forward before the frontend is diconnected
2712 * from the backend cleanly.
2713 */
2714 timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
2715 timeout);
2716 if (!timeout) {
2717 err = -EBUSY;
2718 xenbus_dev_error(dev, err, "Freezing timed out;"
2719 "the device may become inconsistent state");
2720 }
2721
2722 return err;
2723 }
2724
2725 static int blkfront_restore(struct xenbus_device *dev)
> 2726 {
2727 struct blkfront_info *info = dev_get_drvdata(&dev->dev);
2728 int err = 0;
2729
2730 err = talk_to_blkback(dev, info);
2731 blk_mq_unquiesce_queue(info->rq);
2732 blk_mq_unfreeze_queue(info->rq);
2733 if (!err)
2734 blk_mq_update_nr_hw_queues(&info->tag_set, info->nr_rings);
2735 return err;
2736 }
2737
> 2738 static const struct block_device_operations xlvbd_block_fops =
2739 {
2740 .owner = THIS_MODULE,
2741 .open = blkif_open,
2742 .release = blkif_release,
2743 .getgeo = blkif_getgeo,
2744 .ioctl = blkif_ioctl,
2745 .compat_ioctl = blkdev_compat_ptr_ioctl,
2746 };
2747
2748
2749 static const struct xenbus_device_id blkfront_ids[] = {
2750 { "vbd" },
2751 { "" }
2752 };
2753
2754 static struct xenbus_driver blkfront_driver = {
2755 .ids = blkfront_ids,
2756 .probe = blkfront_probe,
2757 .remove = blkfront_remove,
2758 .resume = blkfront_resume,
2759 .otherend_changed = blkback_changed,
2760 .is_ready = blkfront_is_ready,
2761 .freeze = blkfront_freeze,
> 2762 .thaw = blkfront_restore,
2763 .restore = blkfront_restore
2764 };
2765
2766 static void purge_persistent_grants(struct blkfront_info *info)
> 2767 {
2768 unsigned int i;
2769 unsigned long flags;
2770 struct blkfront_ring_info *rinfo;
2771
2772 for_each_rinfo(info, rinfo, i) {
2773 struct grant *gnt_list_entry, *tmp;
2774
2775 spin_lock_irqsave(&rinfo->ring_lock, flags);
2776
2777 if (rinfo->persistent_gnts_c == 0) {
2778 spin_unlock_irqrestore(&rinfo->ring_lock, flags);
2779 continue;
2780 }
2781
2782 list_for_each_entry_safe(gnt_list_entry, tmp, &rinfo->grants,
2783 node) {
2784 if (gnt_list_entry->gref == GRANT_INVALID_REF ||
2785 gnttab_query_foreign_access(gnt_list_entry->gref))
2786 continue;
2787
2788 list_del(&gnt_list_entry->node);
2789 gnttab_end_foreign_access(gnt_list_entry->gref, 0, 0UL);
2790 rinfo->persistent_gnts_c--;
2791 gnt_list_entry->gref = GRANT_INVALID_REF;
2792 list_add_tail(&gnt_list_entry->node, &rinfo->grants);
2793 }
2794
2795 spin_unlock_irqrestore(&rinfo->ring_lock, flags);
2796 }
2797 }
2798
2799 static void blkfront_delay_work(struct work_struct *work)
2800 {
2801 struct blkfront_info *info;
2802 bool need_schedule_work = false;
2803
2804 mutex_lock(&blkfront_mutex);
2805
2806 list_for_each_entry(info, &info_list, info_list) {
2807 if (info->feature_persistent) {
2808 need_schedule_work = true;
2809 mutex_lock(&info->mutex);
2810 purge_persistent_grants(info);
2811 mutex_unlock(&info->mutex);
2812 }
2813 }
2814
2815 if (need_schedule_work)
2816 schedule_delayed_work(&blkfront_work, HZ * 10);
2817
2818 mutex_unlock(&blkfront_mutex);
2819 }
2820
2821 static int __init xlblk_init(void)
> 2822 {
2823 int ret;
2824 int nr_cpus = num_online_cpus();
2825
2826 if (!xen_domain())
2827 return -ENODEV;
2828
2829 if (!xen_has_pv_disk_devices())
2830 return -ENODEV;
2831
2832 if (register_blkdev(XENVBD_MAJOR, DEV_NAME)) {
2833 pr_warn("xen_blk: can't get major %d with name %s\n",
2834 XENVBD_MAJOR, DEV_NAME);
2835 return -ENODEV;
2836 }
2837
2838 if (xen_blkif_max_segments < BLKIF_MAX_SEGMENTS_PER_REQUEST)
2839 xen_blkif_max_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST;
2840
2841 if (xen_blkif_max_ring_order > XENBUS_MAX_RING_GRANT_ORDER) {
2842 pr_info("Invalid max_ring_order (%d), will use default max: %d.\n",
2843 xen_blkif_max_ring_order, XENBUS_MAX_RING_GRANT_ORDER);
2844 xen_blkif_max_ring_order = XENBUS_MAX_RING_GRANT_ORDER;
2845 }
2846
2847 if (xen_blkif_max_queues > nr_cpus) {
2848 pr_info("Invalid max_queues (%d), will use default max: %d.\n",
2849 xen_blkif_max_queues, nr_cpus);
2850 xen_blkif_max_queues = nr_cpus;
2851 }
2852
2853 INIT_DELAYED_WORK(&blkfront_work, blkfront_delay_work);
2854
2855 ret = xenbus_register_frontend(&blkfront_driver);
2856 if (ret) {
2857 unregister_blkdev(XENVBD_MAJOR, DEV_NAME);
2858 return ret;
2859 }
2860
2861 return 0;
2862 }
> 2863 module_init(xlblk_init);
2864
2865
2866 static void __exit xlblk_exit(void)
2867 {
2868 cancel_delayed_work_sync(&blkfront_work);
2869
2870 xenbus_unregister_driver(&blkfront_driver);
2871 unregister_blkdev(XENVBD_MAJOR, DEV_NAME);
2872 kfree(minors);
2873 }
> 2874 module_exit(xlblk_exit);
2875
2876 MODULE_DESCRIPTION("Xen virtual block device frontend");
2877 MODULE_LICENSE("GPL");
2878 MODULE_ALIAS_BLOCKDEV_MAJOR(XENVBD_MAJOR);
2879 MODULE_ALIAS("xen:vbd");
> 2880 MODULE_ALIAS("xenblk");
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
Hi Anchal,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linus/master]
[also build test ERROR on v5.7-rc6]
[cannot apply to xen-tip/linux-next tip/irq/core tip/auto-latest next-20200519]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
url: https://github.com/0day-ci/linux/commits/Anchal-Agarwal/Fix-PM-hibernation-in-Xen-guests/20200520-073211
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 03fb3acae4be8a6b680ffedb220a8b6c07260b40
config: x86_64-rhel (attached as .config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <[email protected]>
All error/warnings (new ones prefixed by >>, old ones prefixed by <<):
drivers/block/xen-blkfront.c: In function 'blkfront_freeze':
>> drivers/block/xen-blkfront.c:2699:30: warning: missing terminating " character
xenbus_dev_error(dev, err, "Hibernation Failed.
^
>> drivers/block/xen-blkfront.c:2699:30: error: missing terminating " character
xenbus_dev_error(dev, err, "Hibernation Failed.
^~~~~~~~~~~~~~~~~~~~
>> drivers/block/xen-blkfront.c:2700:4: error: 'The' undeclared (first use in this function)
The ring is still busy");
^~~
drivers/block/xen-blkfront.c:2700:4: note: each undeclared identifier is reported only once for each function it appears in
>> drivers/block/xen-blkfront.c:2700:8: error: expected ')' before 'ring'
The ring is still busy");
^~~~
drivers/block/xen-blkfront.c:2700:26: warning: missing terminating " character
The ring is still busy");
^
drivers/block/xen-blkfront.c:2700:26: error: missing terminating " character
The ring is still busy");
^~~
>> drivers/block/xen-blkfront.c:2704:2: error: expected ';' before '}' token
}
^
vim +2699 drivers/block/xen-blkfront.c
2672
2673 static int blkfront_freeze(struct xenbus_device *dev)
2674 {
2675 unsigned int i;
2676 struct blkfront_info *info = dev_get_drvdata(&dev->dev);
2677 struct blkfront_ring_info *rinfo;
2678 /* This would be reasonable timeout as used in xenbus_dev_shutdown() */
2679 unsigned int timeout = 5 * HZ;
2680 unsigned long flags;
2681 int err = 0;
2682
2683 info->connected = BLKIF_STATE_FREEZING;
2684
2685 blk_mq_freeze_queue(info->rq);
2686 blk_mq_quiesce_queue(info->rq);
2687
2688 for_each_rinfo(info, rinfo, i) {
2689 /* No more gnttab callback work. */
2690 gnttab_cancel_free_callback(&rinfo->callback);
2691 /* Flush gnttab callback work. Must be done with no locks held. */
2692 flush_work(&rinfo->work);
2693 }
2694
2695 for_each_rinfo(info, rinfo, i) {
2696 spin_lock_irqsave(&rinfo->ring_lock, flags);
2697 if (RING_FULL(&rinfo->ring)
2698 || RING_HAS_UNCONSUMED_RESPONSES(&rinfo->ring)) {
> 2699 xenbus_dev_error(dev, err, "Hibernation Failed.
> 2700 The ring is still busy");
2701 info->connected = BLKIF_STATE_CONNECTED;
2702 spin_unlock_irqrestore(&rinfo->ring_lock, flags);
2703 return -EBUSY;
> 2704 }
2705 spin_unlock_irqrestore(&rinfo->ring_lock, flags);
2706 }
2707 /* Kick the backend to disconnect */
2708 xenbus_switch_state(dev, XenbusStateClosing);
2709
2710 /*
2711 * We don't want to move forward before the frontend is diconnected
2712 * from the backend cleanly.
2713 */
2714 timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
2715 timeout);
2716 if (!timeout) {
2717 err = -EBUSY;
2718 xenbus_dev_error(dev, err, "Freezing timed out;"
2719 "the device may become inconsistent state");
2720 }
2721
2722 return err;
2723 }
2724
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
From: Munehisa Kamata <[email protected]>
S4 power transisiton states are much different than xen
suspend/resume. Former is visible to the guest and frontend drivers should
be aware of the state transistions and should be able to take appropriate
actions when needed. In transition to S4 we need to make sure that at least
all the in-flight blkif requests get completed, since they probably contain
bits of the guest's memory image and that's not going to get saved any
other way. Hence, re-issuing of in-flight requests as in case of xen resume
will not work here. This is in contrast to xen-suspend where we need to
freeze with as little processing as possible to avoid dirtying RAM late in
the migration cycle and we know that in-flight data can wait.
Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
events, need to implement these xenbus_driver callbacks. The freeze handler
stops block-layer queue and disconnect the frontend from the backend while
freeing ring_info and associated resources. Before disconnecting from the
backend, we need to prevent any new IO from being queued and wait for
existing IO to complete. Freeze/unfreeze of the queues will guarantee that
there are no requests in use on the shared ring. However, for sanity we
should check state of the ring before disconnecting to make sure that there
are no outstanding requests to be processed on the ring. The restore
handler re-allocates ring_info, unquiesces and unfreezes the queue
and re-connect to the backend, so that rest of the kernel can continue
to use the block device transparently.
Note:For older backends,if a backend doesn't have commit'12ea729645ace'
xen/blkback: unmap all persistent grants when frontend gets disconnected,
the frontend may see massive amount of grant table warning when freeing
resources.
[ 36.852659] deferring g.e. 0xf9 (pfn 0xffffffffffffffff)
[ 36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!
In this case, persistent grants would need to be disabled.
[Anchal Changelog: Removed timeout/request during blkfront freeze.
Reworked the whole patch to work with blk-mq and incorporate upstream's
comments]
Fixes: Build errors reported by kbuild due to linebreak
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Anchal Agarwal <[email protected]>
Signed-off-by: Munehisa Kamata <[email protected]>
---
drivers/block/xen-blkfront.c | 118 +++++++++++++++++++++++++++++++++--
1 file changed, 112 insertions(+), 6 deletions(-)
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 3b889ea950c2..34b0e51697b6 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -48,6 +48,8 @@
#include <linux/list.h>
#include <linux/workqueue.h>
#include <linux/sched/mm.h>
+#include <linux/completion.h>
+#include <linux/delay.h>
#include <xen/xen.h>
#include <xen/xenbus.h>
@@ -80,6 +82,8 @@ enum blkif_state {
BLKIF_STATE_DISCONNECTED,
BLKIF_STATE_CONNECTED,
BLKIF_STATE_SUSPENDED,
+ BLKIF_STATE_FREEZING,
+ BLKIF_STATE_FROZEN
};
struct grant {
@@ -219,6 +223,7 @@ struct blkfront_info
struct list_head requests;
struct bio_list bio_list;
struct list_head info_list;
+ struct completion wait_backend_disconnected;
};
static unsigned int nr_minors;
@@ -1005,6 +1010,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
info->sector_size = sector_size;
info->physical_sector_size = physical_sector_size;
blkif_set_queue_limits(info);
+ init_completion(&info->wait_backend_disconnected);
return 0;
}
@@ -1057,7 +1063,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
case XEN_SCSI_DISK5_MAJOR:
case XEN_SCSI_DISK6_MAJOR:
case XEN_SCSI_DISK7_MAJOR:
- *offset = (*minor / PARTS_PER_DISK) +
+ *offset = (*minor / PARTS_PER_DISK) +
((major - XEN_SCSI_DISK1_MAJOR + 1) * 16) +
EMULATED_SD_DISK_NAME_OFFSET;
*minor = *minor +
@@ -1072,7 +1078,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
case XEN_SCSI_DISK13_MAJOR:
case XEN_SCSI_DISK14_MAJOR:
case XEN_SCSI_DISK15_MAJOR:
- *offset = (*minor / PARTS_PER_DISK) +
+ *offset = (*minor / PARTS_PER_DISK) +
((major - XEN_SCSI_DISK8_MAJOR + 8) * 16) +
EMULATED_SD_DISK_NAME_OFFSET;
*minor = *minor +
@@ -1353,6 +1359,8 @@ static void blkif_free(struct blkfront_info *info, int suspend)
unsigned int i;
struct blkfront_ring_info *rinfo;
+ if (info->connected == BLKIF_STATE_FREEZING)
+ goto free_rings;
/* Prevent new requests being issued until we fix things up. */
info->connected = suspend ?
BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
@@ -1360,6 +1368,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
if (info->rq)
blk_mq_stop_hw_queues(info->rq);
+free_rings:
for_each_rinfo(info, rinfo, i)
blkif_free_ring(rinfo);
@@ -1563,8 +1572,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
struct blkfront_info *info = rinfo->dev_info;
- if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
+ if (unlikely(info->connected != BLKIF_STATE_CONNECTED
+ && info->connected != BLKIF_STATE_FREEZING)){
return IRQ_HANDLED;
+ }
spin_lock_irqsave(&rinfo->ring_lock, flags);
again:
@@ -2027,6 +2038,7 @@ static int blkif_recover(struct blkfront_info *info)
unsigned int segs;
struct blkfront_ring_info *rinfo;
+ bool frozen = info->connected == BLKIF_STATE_FROZEN;
blkfront_gather_backend_features(info);
/* Reset limits changed by blk_mq_update_nr_hw_queues(). */
blkif_set_queue_limits(info);
@@ -2048,6 +2060,9 @@ static int blkif_recover(struct blkfront_info *info)
kick_pending_request_queues(rinfo);
}
+ if (frozen)
+ return 0;
+
list_for_each_entry_safe(req, n, &info->requests, queuelist) {
/* Requeue pending requests (flush or discard) */
list_del_init(&req->queuelist);
@@ -2364,6 +2379,7 @@ static void blkfront_connect(struct blkfront_info *info)
return;
case BLKIF_STATE_SUSPENDED:
+ case BLKIF_STATE_FROZEN:
/*
* If we are recovering from suspension, we need to wait
* for the backend to announce it's features before
@@ -2481,12 +2497,36 @@ static void blkback_changed(struct xenbus_device *dev,
break;
case XenbusStateClosed:
- if (dev->state == XenbusStateClosed)
+ if (dev->state == XenbusStateClosed) {
+ if (info->connected == BLKIF_STATE_FREEZING) {
+ blkif_free(info, 0);
+ info->connected = BLKIF_STATE_FROZEN;
+ complete(&info->wait_backend_disconnected);
+ break;
+ }
+
+ break;
+ }
+
+ /*
+ * We may somehow receive backend's Closed again while thawing
+ * or restoring and it causes thawing or restoring to fail.
+ * Ignore such unexpected state regardless of the backend state.
+ */
+ if (info->connected == BLKIF_STATE_FROZEN) {
+ dev_dbg(&dev->dev,
+ "ignore the backend's Closed state: %s",
+ dev->nodename);
break;
+ }
/* fall through */
case XenbusStateClosing:
- if (info)
- blkfront_closing(info);
+ if (info) {
+ if (info->connected == BLKIF_STATE_FREEZING)
+ xenbus_frontend_closed(dev);
+ else
+ blkfront_closing(info);
+ }
break;
}
}
@@ -2630,6 +2670,69 @@ static void blkif_release(struct gendisk *disk, fmode_t mode)
mutex_unlock(&blkfront_mutex);
}
+static int blkfront_freeze(struct xenbus_device *dev)
+{
+ unsigned int i;
+ struct blkfront_info *info = dev_get_drvdata(&dev->dev);
+ struct blkfront_ring_info *rinfo;
+ /* This would be reasonable timeout as used in xenbus_dev_shutdown() */
+ unsigned int timeout = 5 * HZ;
+ unsigned long flags;
+ int err = 0;
+
+ info->connected = BLKIF_STATE_FREEZING;
+
+ blk_mq_freeze_queue(info->rq);
+ blk_mq_quiesce_queue(info->rq);
+
+ for_each_rinfo(info, rinfo, i) {
+ /* No more gnttab callback work. */
+ gnttab_cancel_free_callback(&rinfo->callback);
+ /* Flush gnttab callback work. Must be done with no locks held. */
+ flush_work(&rinfo->work);
+ }
+
+ for_each_rinfo(info, rinfo, i) {
+ spin_lock_irqsave(&rinfo->ring_lock, flags);
+ if (RING_FULL(&rinfo->ring)
+ || RING_HAS_UNCONSUMED_RESPONSES(&rinfo->ring)) {
+ xenbus_dev_error(dev, err, "Hibernation Failed.The ring is still busy");
+ info->connected = BLKIF_STATE_CONNECTED;
+ spin_unlock_irqrestore(&rinfo->ring_lock, flags);
+ return -EBUSY;
+ }
+ spin_unlock_irqrestore(&rinfo->ring_lock, flags);
+ }
+ /* Kick the backend to disconnect */
+ xenbus_switch_state(dev, XenbusStateClosing);
+
+ /*
+ * We don't want to move forward before the frontend is diconnected
+ * from the backend cleanly.
+ */
+ timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
+ timeout);
+ if (!timeout) {
+ err = -EBUSY;
+ xenbus_dev_error(dev, err, "Freezing timed out;"
+ "the device may become inconsistent state");
+ }
+ return err;
+}
+
+static int blkfront_restore(struct xenbus_device *dev)
+{
+ struct blkfront_info *info = dev_get_drvdata(&dev->dev);
+ int err = 0;
+
+ err = talk_to_blkback(dev, info);
+ blk_mq_unquiesce_queue(info->rq);
+ blk_mq_unfreeze_queue(info->rq);
+ if (!err)
+ blk_mq_update_nr_hw_queues(&info->tag_set, info->nr_rings);
+ return err;
+}
+
static const struct block_device_operations xlvbd_block_fops =
{
.owner = THIS_MODULE,
@@ -2653,6 +2756,9 @@ static struct xenbus_driver blkfront_driver = {
.resume = blkfront_resume,
.otherend_changed = blkback_changed,
.is_ready = blkfront_is_ready,
+ .freeze = blkfront_freeze,
+ .thaw = blkfront_restore,
+ .restore = blkfront_restore
};
static void purge_persistent_grants(struct blkfront_info *info)
--
2.24.1.AMZN
> @@ -1057,7 +1063,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
> case XEN_SCSI_DISK5_MAJOR:
> case XEN_SCSI_DISK6_MAJOR:
> case XEN_SCSI_DISK7_MAJOR:
> - *offset = (*minor / PARTS_PER_DISK) +
> + *offset = (*minor / PARTS_PER_DISK) +
> ((major - XEN_SCSI_DISK1_MAJOR + 1) * 16) +
> EMULATED_SD_DISK_NAME_OFFSET;
> *minor = *minor +
> @@ -1072,7 +1078,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
> case XEN_SCSI_DISK13_MAJOR:
> case XEN_SCSI_DISK14_MAJOR:
> case XEN_SCSI_DISK15_MAJOR:
> - *offset = (*minor / PARTS_PER_DISK) +
> + *offset = (*minor / PARTS_PER_DISK) +
> ((major - XEN_SCSI_DISK8_MAJOR + 8) * 16) +
> EMULATED_SD_DISK_NAME_OFFSET;
> *minor = *minor +
These seem like whitespace fixes? If so, they should be in a separate patch
Balbir
Hi Anchal,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on linus/master]
[also build test WARNING on v5.7-rc6]
[cannot apply to xen-tip/linux-next tip/irq/core tip/auto-latest next-20200519]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
url: https://github.com/0day-ci/linux/commits/Anchal-Agarwal/Fix-PM-hibernation-in-Xen-guests/20200520-073211
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 03fb3acae4be8a6b680ffedb220a8b6c07260b40
config: x86_64-allmodconfig (attached as .config)
reproduce:
# apt-get install sparse
# sparse version: v0.6.1-193-gb8fad4bc-dirty
# save the attached .config to linux build tree
make C=1 ARCH=x86_64 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'
:::::: branch date: 11 hours ago
:::::: commit date: 11 hours ago
If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <[email protected]>
sparse warnings: (new ones prefixed by >>)
>> drivers/block/xen-blkfront.c:2700:0: sparse: sparse: missing terminating " character
drivers/block/xen-blkfront.c:2701:0: sparse: sparse: missing terminating " character
drivers/block/xen-blkfront.c:2700:25: sparse: sparse: Expected ) in function call
drivers/block/xen-blkfront.c:2700:25: sparse: sparse: got The
# https://github.com/0day-ci/linux/commit/1997467d18e784a64ee0fe00875492e9605f6147
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout 1997467d18e784a64ee0fe00875492e9605f6147
vim +2700 drivers/block/xen-blkfront.c
9f27ee59503865 Jeremy Fitzhardinge 2007-07-17 2672
1997467d18e784 Munehisa Kamata 2020-05-19 2673 static int blkfront_freeze(struct xenbus_device *dev)
1997467d18e784 Munehisa Kamata 2020-05-19 2674 {
1997467d18e784 Munehisa Kamata 2020-05-19 2675 unsigned int i;
1997467d18e784 Munehisa Kamata 2020-05-19 2676 struct blkfront_info *info = dev_get_drvdata(&dev->dev);
1997467d18e784 Munehisa Kamata 2020-05-19 2677 struct blkfront_ring_info *rinfo;
1997467d18e784 Munehisa Kamata 2020-05-19 2678 /* This would be reasonable timeout as used in xenbus_dev_shutdown() */
1997467d18e784 Munehisa Kamata 2020-05-19 2679 unsigned int timeout = 5 * HZ;
1997467d18e784 Munehisa Kamata 2020-05-19 2680 unsigned long flags;
1997467d18e784 Munehisa Kamata 2020-05-19 2681 int err = 0;
1997467d18e784 Munehisa Kamata 2020-05-19 2682
1997467d18e784 Munehisa Kamata 2020-05-19 2683 info->connected = BLKIF_STATE_FREEZING;
1997467d18e784 Munehisa Kamata 2020-05-19 2684
1997467d18e784 Munehisa Kamata 2020-05-19 2685 blk_mq_freeze_queue(info->rq);
1997467d18e784 Munehisa Kamata 2020-05-19 2686 blk_mq_quiesce_queue(info->rq);
1997467d18e784 Munehisa Kamata 2020-05-19 2687
1997467d18e784 Munehisa Kamata 2020-05-19 2688 for_each_rinfo(info, rinfo, i) {
1997467d18e784 Munehisa Kamata 2020-05-19 2689 /* No more gnttab callback work. */
1997467d18e784 Munehisa Kamata 2020-05-19 2690 gnttab_cancel_free_callback(&rinfo->callback);
1997467d18e784 Munehisa Kamata 2020-05-19 2691 /* Flush gnttab callback work. Must be done with no locks held. */
1997467d18e784 Munehisa Kamata 2020-05-19 2692 flush_work(&rinfo->work);
1997467d18e784 Munehisa Kamata 2020-05-19 2693 }
1997467d18e784 Munehisa Kamata 2020-05-19 2694
1997467d18e784 Munehisa Kamata 2020-05-19 2695 for_each_rinfo(info, rinfo, i) {
1997467d18e784 Munehisa Kamata 2020-05-19 2696 spin_lock_irqsave(&rinfo->ring_lock, flags);
1997467d18e784 Munehisa Kamata 2020-05-19 2697 if (RING_FULL(&rinfo->ring)
1997467d18e784 Munehisa Kamata 2020-05-19 2698 || RING_HAS_UNCONSUMED_RESPONSES(&rinfo->ring)) {
1997467d18e784 Munehisa Kamata 2020-05-19 2699 xenbus_dev_error(dev, err, "Hibernation Failed.
1997467d18e784 Munehisa Kamata 2020-05-19 @2700 The ring is still busy");
1997467d18e784 Munehisa Kamata 2020-05-19 2701 info->connected = BLKIF_STATE_CONNECTED;
1997467d18e784 Munehisa Kamata 2020-05-19 2702 spin_unlock_irqrestore(&rinfo->ring_lock, flags);
1997467d18e784 Munehisa Kamata 2020-05-19 2703 return -EBUSY;
1997467d18e784 Munehisa Kamata 2020-05-19 2704 }
1997467d18e784 Munehisa Kamata 2020-05-19 2705 spin_unlock_irqrestore(&rinfo->ring_lock, flags);
1997467d18e784 Munehisa Kamata 2020-05-19 2706 }
1997467d18e784 Munehisa Kamata 2020-05-19 2707 /* Kick the backend to disconnect */
1997467d18e784 Munehisa Kamata 2020-05-19 2708 xenbus_switch_state(dev, XenbusStateClosing);
1997467d18e784 Munehisa Kamata 2020-05-19 2709
1997467d18e784 Munehisa Kamata 2020-05-19 2710 /*
1997467d18e784 Munehisa Kamata 2020-05-19 2711 * We don't want to move forward before the frontend is diconnected
1997467d18e784 Munehisa Kamata 2020-05-19 2712 * from the backend cleanly.
1997467d18e784 Munehisa Kamata 2020-05-19 2713 */
1997467d18e784 Munehisa Kamata 2020-05-19 2714 timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
1997467d18e784 Munehisa Kamata 2020-05-19 2715 timeout);
1997467d18e784 Munehisa Kamata 2020-05-19 2716 if (!timeout) {
1997467d18e784 Munehisa Kamata 2020-05-19 2717 err = -EBUSY;
1997467d18e784 Munehisa Kamata 2020-05-19 2718 xenbus_dev_error(dev, err, "Freezing timed out;"
1997467d18e784 Munehisa Kamata 2020-05-19 2719 "the device may become inconsistent state");
1997467d18e784 Munehisa Kamata 2020-05-19 2720 }
1997467d18e784 Munehisa Kamata 2020-05-19 2721
1997467d18e784 Munehisa Kamata 2020-05-19 2722 return err;
1997467d18e784 Munehisa Kamata 2020-05-19 2723 }
1997467d18e784 Munehisa Kamata 2020-05-19 2724
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
On Tue, May 19, 2020 at 11:27:50PM +0000, Anchal Agarwal wrote:
> From: Munehisa Kamata <[email protected]>
>
> S4 power transition states are much different than xen
> suspend/resume. Former is visible to the guest and frontend drivers should
> be aware of the state transitions and should be able to take appropriate
> actions when needed. In transition to S4 we need to make sure that at least
> all the in-flight blkif requests get completed, since they probably contain
> bits of the guest's memory image and that's not going to get saved any
> other way. Hence, re-issuing of in-flight requests as in case of xen resume
> will not work here. This is in contrast to xen-suspend where we need to
> freeze with as little processing as possible to avoid dirtying RAM late in
> the migration cycle and we know that in-flight data can wait.
>
> Add freeze, thaw and restore callbacks for PM suspend and hibernation
> support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
> events, need to implement these xenbus_driver callbacks. The freeze handler
> stops block-layer queue and disconnect the frontend from the backend while
> freeing ring_info and associated resources. Before disconnecting from the
> backend, we need to prevent any new IO from being queued and wait for existing
> IO to complete. Freeze/unfreeze of the queues will guarantee that there are no
> requests in use on the shared ring. However, for sanity we should check
> state of the ring before disconnecting to make sure that there are no
> outstanding requests to be processed on the ring. The restore handler
> re-allocates ring_info, unquiesces and unfreezes the queue and re-connect to
> the backend, so that rest of the kernel can continue to use the block device
> transparently.
>
> Note:For older backends,if a backend doesn't have commit'12ea729645ace'
> xen/blkback: unmap all persistent grants when frontend gets disconnected,
> the frontend may see massive amount of grant table warning when freeing
> resources.
> [ 36.852659] deferring g.e. 0xf9 (pfn 0xffffffffffffffff)
> [ 36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!
>
> In this case, persistent grants would need to be disabled.
>
> [Anchal Changelog: Removed timeout/request during blkfront freeze.
> Reworked the whole patch to work with blk-mq and incorporate upstream's
> comments]
Please tag versions using vX and it would be helpful if you could list
the specific changes that you performed between versions. There where
3 RFC versions IIRC, and there's no log of the changes between them.
>
> Signed-off-by: Anchal Agarwal <[email protected]>
> Signed-off-by: Munehisa Kamata <[email protected]>
> ---
> drivers/block/xen-blkfront.c | 122 +++++++++++++++++++++++++++++++++--
> 1 file changed, 115 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 3b889ea950c2..464863ed7093 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -48,6 +48,8 @@
> #include <linux/list.h>
> #include <linux/workqueue.h>
> #include <linux/sched/mm.h>
> +#include <linux/completion.h>
> +#include <linux/delay.h>
>
> #include <xen/xen.h>
> #include <xen/xenbus.h>
> @@ -80,6 +82,8 @@ enum blkif_state {
> BLKIF_STATE_DISCONNECTED,
> BLKIF_STATE_CONNECTED,
> BLKIF_STATE_SUSPENDED,
> + BLKIF_STATE_FREEZING,
> + BLKIF_STATE_FROZEN
Nit: adding a terminating ',' would prevent further additions from
having to modify this line.
> };
>
> struct grant {
> @@ -219,6 +223,7 @@ struct blkfront_info
> struct list_head requests;
> struct bio_list bio_list;
> struct list_head info_list;
> + struct completion wait_backend_disconnected;
> };
>
> static unsigned int nr_minors;
> @@ -1005,6 +1010,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
> info->sector_size = sector_size;
> info->physical_sector_size = physical_sector_size;
> blkif_set_queue_limits(info);
> + init_completion(&info->wait_backend_disconnected);
>
> return 0;
> }
> @@ -1057,7 +1063,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
> case XEN_SCSI_DISK5_MAJOR:
> case XEN_SCSI_DISK6_MAJOR:
> case XEN_SCSI_DISK7_MAJOR:
> - *offset = (*minor / PARTS_PER_DISK) +
> + *offset = (*minor / PARTS_PER_DISK) +
> ((major - XEN_SCSI_DISK1_MAJOR + 1) * 16) +
> EMULATED_SD_DISK_NAME_OFFSET;
> *minor = *minor +
> @@ -1072,7 +1078,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
> case XEN_SCSI_DISK13_MAJOR:
> case XEN_SCSI_DISK14_MAJOR:
> case XEN_SCSI_DISK15_MAJOR:
> - *offset = (*minor / PARTS_PER_DISK) +
> + *offset = (*minor / PARTS_PER_DISK) +
Unrelated changes, please split to a pre-patch.
> ((major - XEN_SCSI_DISK8_MAJOR + 8) * 16) +
> EMULATED_SD_DISK_NAME_OFFSET;
> *minor = *minor +
> @@ -1353,6 +1359,8 @@ static void blkif_free(struct blkfront_info *info, int suspend)
> unsigned int i;
> struct blkfront_ring_info *rinfo;
>
> + if (info->connected == BLKIF_STATE_FREEZING)
> + goto free_rings;
> /* Prevent new requests being issued until we fix things up. */
> info->connected = suspend ?
> BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
> @@ -1360,6 +1368,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
> if (info->rq)
> blk_mq_stop_hw_queues(info->rq);
>
> +free_rings:
> for_each_rinfo(info, rinfo, i)
> blkif_free_ring(rinfo);
>
> @@ -1563,8 +1572,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
> struct blkfront_info *info = rinfo->dev_info;
>
> - if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
> - return IRQ_HANDLED;
> + if (unlikely(info->connected != BLKIF_STATE_CONNECTED
> + && info->connected != BLKIF_STATE_FREEZING)){
Extra tab and missing space between '){'. Also my preference would be
for the && to go at the end of the previous line, like it's done
elsewhere in the file.
> + return IRQ_HANDLED;
> + }
>
> spin_lock_irqsave(&rinfo->ring_lock, flags);
> again:
> @@ -2027,6 +2038,7 @@ static int blkif_recover(struct blkfront_info *info)
> unsigned int segs;
> struct blkfront_ring_info *rinfo;
>
> + bool frozen = info->connected == BLKIF_STATE_FROZEN;
Please put this together with the rest of the variable definitions,
and leave the empty line as a split between variable definitions and
code. I've already requested this on RFC v3 but you seem to have
dropped some of the requests I've made there.
> blkfront_gather_backend_features(info);
> /* Reset limits changed by blk_mq_update_nr_hw_queues(). */
> blkif_set_queue_limits(info);
> @@ -2048,6 +2060,9 @@ static int blkif_recover(struct blkfront_info *info)
> kick_pending_request_queues(rinfo);
> }
>
> + if (frozen)
> + return 0;
> +
> list_for_each_entry_safe(req, n, &info->requests, queuelist) {
> /* Requeue pending requests (flush or discard) */
> list_del_init(&req->queuelist);
> @@ -2364,6 +2379,7 @@ static void blkfront_connect(struct blkfront_info *info)
>
> return;
> case BLKIF_STATE_SUSPENDED:
> + case BLKIF_STATE_FROZEN:
> /*
> * If we are recovering from suspension, we need to wait
> * for the backend to announce it's features before
> @@ -2481,12 +2497,36 @@ static void blkback_changed(struct xenbus_device *dev,
> break;
>
> case XenbusStateClosed:
> - if (dev->state == XenbusStateClosed)
> + if (dev->state == XenbusStateClosed) {
> + if (info->connected == BLKIF_STATE_FREEZING) {
> + blkif_free(info, 0);
> + info->connected = BLKIF_STATE_FROZEN;
> + complete(&info->wait_backend_disconnected);
> + break;
There's no need for the break here, you can rely on the break below.
> + }
> +
> break;
> + }
> +
> + /*
> + * We may somehow receive backend's Closed again while thawing
> + * or restoring and it causes thawing or restoring to fail.
> + * Ignore such unexpected state regardless of the backend state.
> + */
> + if (info->connected == BLKIF_STATE_FROZEN) {
I think you can join this with the previous dev->state == XenbusStateClosed?
Also, won't the device be in the Closed state already if it's in state
frozen?
> + dev_dbg(&dev->dev,
> + "ignore the backend's Closed state: %s",
> + dev->nodename);
> + break;
> + }
> /* fall through */
> case XenbusStateClosing:
> - if (info)
> - blkfront_closing(info);
> + if (info) {
> + if (info->connected == BLKIF_STATE_FREEZING)
> + xenbus_frontend_closed(dev);
> + else
> + blkfront_closing(info);
> + }
> break;
> }
> }
> @@ -2630,6 +2670,71 @@ static void blkif_release(struct gendisk *disk, fmode_t mode)
> mutex_unlock(&blkfront_mutex);
> }
>
> +static int blkfront_freeze(struct xenbus_device *dev)
> +{
> + unsigned int i;
> + struct blkfront_info *info = dev_get_drvdata(&dev->dev);
> + struct blkfront_ring_info *rinfo;
> + /* This would be reasonable timeout as used in xenbus_dev_shutdown() */
> + unsigned int timeout = 5 * HZ;
> + unsigned long flags;
> + int err = 0;
> +
> + info->connected = BLKIF_STATE_FREEZING;
> +
> + blk_mq_freeze_queue(info->rq);
> + blk_mq_quiesce_queue(info->rq);
> +
> + for_each_rinfo(info, rinfo, i) {
> + /* No more gnttab callback work. */
> + gnttab_cancel_free_callback(&rinfo->callback);
> + /* Flush gnttab callback work. Must be done with no locks held. */
> + flush_work(&rinfo->work);
> + }
> +
> + for_each_rinfo(info, rinfo, i) {
> + spin_lock_irqsave(&rinfo->ring_lock, flags);
> + if (RING_FULL(&rinfo->ring)
> + || RING_HAS_UNCONSUMED_RESPONSES(&rinfo->ring)) {
'||' should go at the end of the previous line.
> + xenbus_dev_error(dev, err, "Hibernation Failed.
> + The ring is still busy");
> + info->connected = BLKIF_STATE_CONNECTED;
> + spin_unlock_irqrestore(&rinfo->ring_lock, flags);
You need to unfreeze the queues here, or else the device will be in a
blocked state AFAICT.
> + return -EBUSY;
> + }
> + spin_unlock_irqrestore(&rinfo->ring_lock, flags);
> + }
This block has indentation all messed up.
> + /* Kick the backend to disconnect */
> + xenbus_switch_state(dev, XenbusStateClosing);
> +
> + /*
> + * We don't want to move forward before the frontend is diconnected
> + * from the backend cleanly.
> + */
> + timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
> + timeout);
> + if (!timeout) {
> + err = -EBUSY;
Note err is only used here, and I think could just be dropped.
> + xenbus_dev_error(dev, err, "Freezing timed out;"
> + "the device may become inconsistent state");
Leaving the device in this state is quite bad, as it's in a closed
state and with the queues frozen. You should make an attempt to
restore things to a working state.
> + }
> +
> + return err;
> +}
> +
> +static int blkfront_restore(struct xenbus_device *dev)
> +{
> + struct blkfront_info *info = dev_get_drvdata(&dev->dev);
> + int err = 0;
> +
> + err = talk_to_blkback(dev, info);
> + blk_mq_unquiesce_queue(info->rq);
> + blk_mq_unfreeze_queue(info->rq);
> + if (!err)
> + blk_mq_update_nr_hw_queues(&info->tag_set, info->nr_rings);
Bad indentation. Also shouldn't you first update the queues and then
unfreeze them?
Thanks, Roger.
A gentle ping on this whole patch series.
Thanks,
Anchal
Hello,
This series fixes PM hibernation for hvm guests running on xen hypervisor.
The running guest could now be hibernated and resumed successfully at a
later time. The fixes for PM hibernation are added to block and
network device drivers i.e xen-blkfront and xen-netfront. Any other driver
that needs to add S4 support if not already, can follow same method of
introducing freeze/thaw/restore callbacks.
The patches had been tested against upstream kernel and xen4.11. Large
scale testing is also done on Xen based Amazon EC2 instances. All this testing
involved running memory exhausting workload in the background.
Doing guest hibernation does not involve any support from hypervisor and
this way guest has complete control over its state. Infrastructure
restrictions for saving up guest state can be overcome by guest initiated
hibernation.
These patches were send out as RFC before and all the feedback had been
incorporated in the patches. The last RFCV3 could be found here:
https://lkml.org/lkml/2020/2/14/2789
Known issues:
1.KASLR causes intermittent hibernation failures. VM fails to resumes and
has to be restarted. I will investigate this issue separately and shouldn't
be a blocker for this patch series.
2. During hibernation, I observed sometimes that freezing of tasks fails due
to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may be 1
out of 200 runs and hibernation is aborted in this case. Re-trying hibernation
may work. Also, this is a known issue with hibernation and some
filesystems like XFS has been discussed by the community for years with not an
effectve resolution at this point.
Testing How to:
---------------
1. Setup xen hypervisor on a physical machine[ I used Ubuntu 16.04 +upstream
xen-4.11]
2. Bring up a HVM guest w/t kernel compiled with hibernation patches
[I used ubuntu18.04 netboot bionic images and also Amazon Linux on-prem images].
3. Create a swap file size=RAM size
4. Update grub parameters and reboot
5. Trigger pm-hibernation from within the VM
Example:
Set up a file-backed swap space. Swap file size>=Total memory on the system
sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB
sudo chmod 600 /swap
sudo mkswap /swap
sudo swapon /swap
Update resume device/resume offset in grub if using swap file:
resume=/dev/xvda1 resume_offset=200704 no_console_suspend=1
Execute:
--------
sudo pm-hibernate
OR
echo disk > /sys/power/state && echo reboot > /sys/power/disk
Compute resume offset code:
"
#!/usr/bin/env python
import sys
import array
import fcntl
#swap file
f = open(sys.argv[1], 'r')
buf = array.array('L', [0])
#FIBMAP
ret = fcntl.ioctl(f.fileno(), 0x01, buf)
print buf[0]
"
Anchal Agarwal (5):
x86/xen: Introduce new function to map HYPERVISOR_shared_info on
Resume
genirq: Shutdown irq chips in suspend/resume during hibernation
xen: Introduce wrapper for save/restore sched clock offset
xen: Update sched clock offset to avoid system instability in
hibernation
PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
Munehisa Kamata (7):
xen/manage: keep track of the on-going suspend mode
xenbus: add freeze/thaw/restore callbacks support
x86/xen: add system core suspend and resume callbacks
xen-blkfront: add callbacks for PM suspend and hibernation
xen-netfront: add callbacks for PM suspend and hibernation
xen/time: introduce xen_{save,restore}_steal_clock
x86/xen: save and restore steal clock
arch/x86/xen/enlighten_hvm.c | 8 ++
arch/x86/xen/suspend.c | 72 ++++++++++++++++++
arch/x86/xen/time.c | 18 ++++-
arch/x86/xen/xen-ops.h | 3 +
drivers/block/xen-blkfront.c | 122 ++++++++++++++++++++++++++++--
drivers/net/xen-netfront.c | 98 +++++++++++++++++++++++-
drivers/xen/events/events_base.c | 1 +
drivers/xen/manage.c | 73 ++++++++++++++++++
drivers/xen/time.c | 29 ++++++-
drivers/xen/xenbus/xenbus_probe.c | 99 +++++++++++++++++++-----
include/linux/irq.h | 2 +
include/xen/xen-ops.h | 8 ++
include/xen/xenbus.h | 3 +
kernel/irq/chip.c | 2 +-
kernel/irq/internals.h | 1 +
kernel/irq/pm.c | 31 +++++---
kernel/power/user.c | 6 +-
17 files changed, 536 insertions(+), 40 deletions(-)
--
2.24.1.AMZN
On 5/19/20 7:24 PM, Anchal Agarwal wrote:
>
> +enum suspend_modes {
> + NO_SUSPEND = 0,
> + XEN_SUSPEND,
> + PM_SUSPEND,
> + PM_HIBERNATION,
> +};
> +
> +/* Protected by pm_mutex */
> +static enum suspend_modes suspend_mode = NO_SUSPEND;
> +
> +bool xen_suspend_mode_is_xen_suspend(void)
> +{
> + return suspend_mode == XEN_SUSPEND;
> +}
> +
> +bool xen_suspend_mode_is_pm_suspend(void)
> +{
> + return suspend_mode == PM_SUSPEND;
> +}
> +
> +bool xen_suspend_mode_is_pm_hibernation(void)
> +{
> + return suspend_mode == PM_HIBERNATION;
> +}
> +
I don't see these last two used anywhere. Are you, in fact,
distinguishing between PM suspend and hibernation?
(I would also probably shorten the name a bit, perhaps
xen_is_pv/pm_suspend()?)
-boris
On 5/19/20 7:25 PM, Anchal Agarwal wrote:
>
> int xenbus_dev_resume(struct device *dev)
> {
> - int err;
> + int err = 0;
That's not necessary.
> struct xenbus_driver *drv;
> struct xenbus_device *xdev
> = container_of(dev, struct xenbus_device, dev);
> -
> + bool xen_suspend = xen_suspend_mode_is_xen_suspend();
> DPRINTK("%s", xdev->nodename);
>
> if (dev->driver == NULL)
> @@ -627,24 +645,32 @@ int xenbus_dev_resume(struct device *dev)
> drv = to_xenbus_driver(dev->driver);
> err = talk_to_otherend(xdev);
> if (err) {
> - pr_warn("resume (talk_to_otherend) %s failed: %i\n",
> + pr_warn("%s (talk_to_otherend) %s failed: %i\n",
Please use dev_warn() everywhere, we just had a bunch of patches that
replaced pr_warn(). In fact, this is one of the lines that got changed.
>
> int xenbus_dev_cancel(struct device *dev)
> {
> - /* Do nothing */
> - DPRINTK("cancel");
> + int err = 0;
Again, no need to initialize.
> + struct xenbus_driver *drv;
> + struct xenbus_device *xdev
> + = container_of(dev, struct xenbus_device, dev);
xendev please to be consistent with other code. And use to_xenbus_device().
-boris
On 5/19/20 7:25 PM, Anchal Agarwal wrote:
> Introduce a small function which re-uses shared page's PA allocated
> during guest initialization time in reserve_shared_info() and not
> allocate new page during resume flow.
> It also does the mapping of shared_info_page by calling
> xen_hvm_init_shared_info() to use the function.
>
> Signed-off-by: Anchal Agarwal <[email protected]>
> ---
> arch/x86/xen/enlighten_hvm.c | 7 +++++++
> arch/x86/xen/xen-ops.h | 1 +
> 2 files changed, 8 insertions(+)
>
> diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
> index e138f7de52d2..75b1ec7a0fcd 100644
> --- a/arch/x86/xen/enlighten_hvm.c
> +++ b/arch/x86/xen/enlighten_hvm.c
> @@ -27,6 +27,13 @@
>
> static unsigned long shared_info_pfn;
>
> +void xen_hvm_map_shared_info(void)
> +{
> + xen_hvm_init_shared_info();
> + if (shared_info_pfn)
> + HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
> +}
> +
AFAICT it is only called once so I don't see a need for new routine.
And is it possible for shared_info_pfn to be NULL in resume path (which
is where this is called)?
-boris
On 5/19/20 7:26 PM, Anchal Agarwal wrote:
> From: Munehisa Kamata <[email protected]>
>
> Add Xen PVHVM specific system core callbacks for PM suspend and
> hibernation support. The callbacks suspend and resume Xen
> primitives,like shared_info, pvclock and grant table. Note that
> Xen suspend can handle them in a different manner, but system
> core callbacks are called from the context.
I don't think I understand that last sentence.
> So if the callbacks
> are called from Xen suspend context, return immediately.
>
> +
> +static int xen_syscore_suspend(void)
> +{
> + struct xen_remove_from_physmap xrfp;
> + int ret;
> +
> + /* Xen suspend does similar stuffs in its own logic */
> + if (xen_suspend_mode_is_xen_suspend())
> + return 0;
> +
> + xrfp.domid = DOMID_SELF;
> + xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
> +
> + ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrfp);
> + if (!ret)
> + HYPERVISOR_shared_info = &xen_dummy_shared_info;
> +
> + return ret;
> +}
> +
> +static void xen_syscore_resume(void)
> +{
> + /* Xen suspend does similar stuffs in its own logic */
> + if (xen_suspend_mode_is_xen_suspend())
> + return;
> +
> + /* No need to setup vcpu_info as it's already moved off */
> + xen_hvm_map_shared_info();
> +
> + pvclock_resume();
> +
> + gnttab_resume();
Do you call gnttab_suspend() in pm suspend path?
> +}
> +
> +/*
> + * These callbacks will be called with interrupts disabled and when having only
> + * one CPU online.
> + */
> +static struct syscore_ops xen_hvm_syscore_ops = {
> + .suspend = xen_syscore_suspend,
> + .resume = xen_syscore_resume
> +};
> +
> +void __init xen_setup_syscore_ops(void)
> +{
> + if (xen_hvm_domain())
Have you tested this (the whole feature, not just this patch) with PVH
guest BTW? And PVH dom0 for that matter?
-boris
> + register_syscore_ops(&xen_hvm_syscore_ops);
> +}
On 5/19/20 7:26 PM, Anchal Agarwal wrote:
> Many legacy device drivers do not implement power management (PM)
> functions which means that interrupts requested by these drivers stay
> in active state when the kernel is hibernated.
>
> This does not matter on bare metal and on most hypervisors because the
> interrupt is restored on resume without any noticable side effects as
> it stays connected to the same physical or virtual interrupt line.
>
> The XEN interrupt mechanism is different as it maintains a mapping
> between the Linux interrupt number and a XEN event channel. If the
> interrupt stays active on hibernation this mapping is preserved but
> there is unfortunately no guarantee that on resume the same event
> channels are reassigned to these devices. This can result in event
> channel conflicts which prevent the affected devices from being
> restored correctly.
>
> One way to solve this would be to add the necessary power management
> functions to all affected legacy device drivers, but that's a
> questionable effort which does not provide any benefits on non-XEN
> environments.
>
> The least intrusive and most efficient solution is to provide a
> mechanism which allows the core interrupt code to tear down these
> interrupts on hibernation and bring them back up again on resume. This
> allows the XEN event channel mechanism to assign an arbitrary event
> channel on resume without affecting the functionality of these
> devices.
>
> Fortunately all these device interrupts are handled by a dedicated XEN
> interrupt chip so the chip can be marked that all interrupts connected
> to it are handled this way. This is pretty much in line with the other
> interrupt chip specific quirks, e.g. IRQCHIP_MASK_ON_SUSPEND.
>
> Add a new quirk flag IRQCHIP_SHUTDOWN_ON_SUSPEND and add support for
> it the core interrupt suspend/resume paths.
>
> Signed-off-by: Anchal Agarwal <[email protected]>
> Signed-off--by: Thomas Gleixner <[email protected]>
Since Thomas wrote this patch I think it should also have "From: " him.
-boris
On 5/19/20 7:28 PM, Anchal Agarwal wrote:
> From: Munehisa Kamata <[email protected]>
>
> Currently, steal time accounting code in scheduler expects steal clock
> callback to provide monotonically increasing value. If the accounting
> code receives a smaller value than previous one, it uses a negative
> value to calculate steal time and results in incorrectly updated idle
> and steal time accounting. This breaks userspace tools which read
> /proc/stat.
>
> top - 08:05:35 up 2:12, 3 users, load average: 0.00, 0.07, 0.23
> Tasks: 80 total, 1 running, 79 sleeping, 0 stopped, 0 zombie
> Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,30100.0%id, 0.0%wa, 0.0%hi, 0.0%si,-1253874204672.0%st
>
> This can actually happen when a Xen PVHVM guest gets restored from
> hibernation, because such a restored guest is just a fresh domain from
> Xen perspective and the time information in runstate info starts over
> from scratch.
>
> This patch introduces xen_save_steal_clock() which saves current values
> in runstate info into per-cpu variables. Its couterpart,
> xen_restore_steal_clock(), sets offset if it found the current values in
> runstate info are smaller than previous ones. xen_steal_clock() is also
> modified to use the offset to ensure that scheduler only sees
> monotonically increasing number.
>
> Signed-off-by: Munehisa Kamata <[email protected]>
> Signed-off-by: Anchal Agarwal <[email protected]>
> ---
> drivers/xen/time.c | 29 ++++++++++++++++++++++++++++-
> include/xen/xen-ops.h | 2 ++
> 2 files changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/xen/time.c b/drivers/xen/time.c
> index 0968859c29d0..3560222cc0dd 100644
> --- a/drivers/xen/time.c
> +++ b/drivers/xen/time.c
> @@ -23,6 +23,9 @@ static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
>
> static DEFINE_PER_CPU(u64[4], old_runstate_time);
>
> +static DEFINE_PER_CPU(u64, xen_prev_steal_clock);
> +static DEFINE_PER_CPU(u64, xen_steal_clock_offset);
Can you use old_runstate_time here? It is used to solve a similar
problem for pv suspend, isn't it?
-boris
On 5/19/20 7:28 PM, Anchal Agarwal wrote:
> From: Munehisa Kamata <[email protected]>
>
> Save steal clock values of all present CPUs in the system core ops
> suspend callbacks. Also, restore a boot CPU's steal clock in the system
> core resume callback. For non-boot CPUs, restore after they're brought
> up, because runstate info for non-boot CPUs are not active until then.
>
> Signed-off-by: Munehisa Kamata <[email protected]>
> Signed-off-by: Anchal Agarwal <[email protected]>
> ---
> arch/x86/xen/suspend.c | 13 ++++++++++++-
> arch/x86/xen/time.c | 3 +++
> 2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
> index 784c4484100b..dae0f74f5390 100644
> --- a/arch/x86/xen/suspend.c
> +++ b/arch/x86/xen/suspend.c
> @@ -91,12 +91,20 @@ void xen_arch_suspend(void)
> static int xen_syscore_suspend(void)
> {
> struct xen_remove_from_physmap xrfp;
> - int ret;
> + int cpu, ret;
>
> /* Xen suspend does similar stuffs in its own logic */
> if (xen_suspend_mode_is_xen_suspend())
> return 0;
>
> + for_each_present_cpu(cpu) {
> + /*
> + * Nonboot CPUs are already offline, but the last copy of
> + * runstate info is still accessible.
> + */
> + xen_save_steal_clock(cpu);
> + }
> +
> xrfp.domid = DOMID_SELF;
> xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
>
> @@ -118,6 +126,9 @@ static void xen_syscore_resume(void)
>
> pvclock_resume();
Doesn't make any difference but I think since this patch is where you
are dealing with clock then pvclock_resume() should be added here and
not in the earlier patch.
-boris
>
> + /* Nonboot CPUs will be resumed when they're brought up */
> + xen_restore_steal_clock(smp_processor_id());
> +
> gnttab_resume();
> }
>
> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index c8897aad13cd..33d754564b09 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -545,6 +545,9 @@ static void xen_hvm_setup_cpu_clockevents(void)
> {
> int cpu = smp_processor_id();
> xen_setup_runstate_info(cpu);
> + if (cpu)
> + xen_restore_steal_clock(cpu);
> +
> /*
> * xen_setup_timer(cpu) - snprintf is bad in atomic context. Hence
> * doing it xen_hvm_cpu_notify (which gets called by smp_init during
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On 5/19/20 7:26 PM, Anchal Agarwal wrote:
> Many legacy device drivers do not implement power management (PM)
> functions which means that interrupts requested by these drivers stay
> in active state when the kernel is hibernated.
>
> This does not matter on bare metal and on most hypervisors because the
> interrupt is restored on resume without any noticable side effects as
> it stays connected to the same physical or virtual interrupt line.
>
> The XEN interrupt mechanism is different as it maintains a mapping
> between the Linux interrupt number and a XEN event channel. If the
> interrupt stays active on hibernation this mapping is preserved but
> there is unfortunately no guarantee that on resume the same event
> channels are reassigned to these devices. This can result in event
> channel conflicts which prevent the affected devices from being
> restored correctly.
>
> One way to solve this would be to add the necessary power management
> functions to all affected legacy device drivers, but that's a
> questionable effort which does not provide any benefits on non-XEN
> environments.
>
> The least intrusive and most efficient solution is to provide a
> mechanism which allows the core interrupt code to tear down these
> interrupts on hibernation and bring them back up again on resume. This
> allows the XEN event channel mechanism to assign an arbitrary event
> channel on resume without affecting the functionality of these
> devices.
>
> Fortunately all these device interrupts are handled by a dedicated XEN
> interrupt chip so the chip can be marked that all interrupts connected
> to it are handled this way. This is pretty much in line with the other
> interrupt chip specific quirks, e.g. IRQCHIP_MASK_ON_SUSPEND.
>
> Add a new quirk flag IRQCHIP_SHUTDOWN_ON_SUSPEND and add support for
> it the core interrupt suspend/resume paths.
>
> Signed-off-by: Anchal Agarwal <[email protected]>
> Signed-off--by: Thomas Gleixner <[email protected]>
Since Thomas wrote this patch I think it should also have "From: " him.
That sounds about right. I will update it next round and add Tested-by.
-boris
- Anchal
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On 5/19/20 7:24 PM, Anchal Agarwal wrote:
>
> +enum suspend_modes {
> + NO_SUSPEND = 0,
> + XEN_SUSPEND,
> + PM_SUSPEND,
> + PM_HIBERNATION,
> +};
> +
> +/* Protected by pm_mutex */
> +static enum suspend_modes suspend_mode = NO_SUSPEND;
> +
> +bool xen_suspend_mode_is_xen_suspend(void)
> +{
> + return suspend_mode == XEN_SUSPEND;
> +}
> +
> +bool xen_suspend_mode_is_pm_suspend(void)
> +{
> + return suspend_mode == PM_SUSPEND;
> +}
> +
> +bool xen_suspend_mode_is_pm_hibernation(void)
> +{
> + return suspend_mode == PM_HIBERNATION;
> +}
> +
I don't see these last two used anywhere. Are you, in fact,
distinguishing between PM suspend and hibernation?
Yes, I am. Unless there is a better way to distinguish at runtime which I haven't figured out yet.
The initial design was to have separate states for separate modes. Currently, PM_HIBERNATION is handled
by !xen_suspend . However, if any case arises where we need to set the suspend_mode, its available via
this interface. This is basically to support PM* ops via ACPI path. Since, PM_SUSPEND is not handled by the series
the code piece can be removed and added later. Any comments?
(I would also probably shorten the name a bit, perhaps
xen_is_pv/pm_suspend()?)
Sure. Will fix in my next round of post.
-boris
Thanks,
Anchal
On 6/1/20 5:00 PM, Agarwal, Anchal wrote:
>
>
> I don't see these last two used anywhere. Are you, in fact,
> distinguishing between PM suspend and hibernation?
>
> Yes, I am. Unless there is a better way to distinguish at runtime which I haven't figured out yet.
> The initial design was to have separate states for separate modes. Currently, PM_HIBERNATION is handled
> by !xen_suspend . However, if any case arises where we need to set the suspend_mode, its available via
> this interface. This is basically to support PM* ops via ACPI path. Since, PM_SUSPEND is not handled by the series
> the code piece can be removed and added later. Any comments?
Yes, if this is not being handled then I don't see any reason for this
code to be there.
-boris
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On 5/19/20 7:25 PM, Anchal Agarwal wrote:
>
> int xenbus_dev_resume(struct device *dev)
> {
> - int err;
> + int err = 0;
That's not necessary.
ACK.
> struct xenbus_driver *drv;
> struct xenbus_device *xdev
> = container_of(dev, struct xenbus_device, dev);
> -
> + bool xen_suspend = xen_suspend_mode_is_xen_suspend();
> DPRINTK("%s", xdev->nodename);
>
> if (dev->driver == NULL)
> @@ -627,24 +645,32 @@ int xenbus_dev_resume(struct device *dev)
> drv = to_xenbus_driver(dev->driver);
> err = talk_to_otherend(xdev);
> if (err) {
> - pr_warn("resume (talk_to_otherend) %s failed: %i\n",
> + pr_warn("%s (talk_to_otherend) %s failed: %i\n",
Please use dev_warn() everywhere, we just had a bunch of patches that
replaced pr_warn(). In fact, this is one of the lines that got changed.
ACK. Will send fixes in next series
>
> int xenbus_dev_cancel(struct device *dev)
> {
> - /* Do nothing */
> - DPRINTK("cancel");
> + int err = 0;
Again, no need to initialize.
ACK.
> + struct xenbus_driver *drv;
> + struct xenbus_device *xdev
> + = container_of(dev, struct xenbus_device, dev);
xendev please to be consistent with other code. And use to_xenbus_device().
ACK.
-boris
I will put the fixes in next round of patches.
Thanks,
Anchal
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On 5/19/20 7:26 PM, Anchal Agarwal wrote:
> From: Munehisa Kamata <[email protected]>
>
> Add Xen PVHVM specific system core callbacks for PM suspend and
> hibernation support. The callbacks suspend and resume Xen
> primitives,like shared_info, pvclock and grant table. Note that
> Xen suspend can handle them in a different manner, but system
> core callbacks are called from the context.
I don't think I understand that last sentence.
Looks like it may have cryptic meaning of stating that xen_suspend calls syscore_suspend from xen_suspend
So, if these syscore ops gets called during xen_suspend do not do anything. Check if the mode is in xen suspend
and return from there. These syscore_ops are specifically for domU hibernation.
I must admit, I may have overlooked lack of explanation of some implicit details in the original commit msg.
> So if the callbacks
> are called from Xen suspend context, return immediately.
>
> +
> +static int xen_syscore_suspend(void)
> +{
> + struct xen_remove_from_physmap xrfp;
> + int ret;
> +
> + /* Xen suspend does similar stuffs in its own logic */
> + if (xen_suspend_mode_is_xen_suspend())
> + return 0;
> +
> + xrfp.domid = DOMID_SELF;
> + xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
> +
> + ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrfp);
> + if (!ret)
> + HYPERVISOR_shared_info = &xen_dummy_shared_info;
> +
> + return ret;
> +}
> +
> +static void xen_syscore_resume(void)
> +{
> + /* Xen suspend does similar stuffs in its own logic */
> + if (xen_suspend_mode_is_xen_suspend())
> + return;
> +
> + /* No need to setup vcpu_info as it's already moved off */
> + xen_hvm_map_shared_info();
> +
> + pvclock_resume();
> +
> + gnttab_resume();
Do you call gnttab_suspend() in pm suspend path?
No, since it does nothing for HVM guests. The unmap_frames is only applicable for PV guests right?
> +}
> +
> +/*
> + * These callbacks will be called with interrupts disabled and when having only
> + * one CPU online.
> + */
> +static struct syscore_ops xen_hvm_syscore_ops = {
> + .suspend = xen_syscore_suspend,
> + .resume = xen_syscore_resume
> +};
> +
> +void __init xen_setup_syscore_ops(void)
> +{
> + if (xen_hvm_domain())
Have you tested this (the whole feature, not just this patch) with PVH
guest BTW? And PVH dom0 for that matter?
No I haven't. The whole series is just tested with hvm/pvhvm guests.
-boris
Thanks,
Anchal
> + register_syscore_ops(&xen_hvm_syscore_ops);
> +}
On Sat, May 30, 2020 at 07:44:06PM -0400, Boris Ostrovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On 5/19/20 7:28 PM, Anchal Agarwal wrote:
> > From: Munehisa Kamata <[email protected]>
> >
> > Save steal clock values of all present CPUs in the system core ops
> > suspend callbacks. Also, restore a boot CPU's steal clock in the system
> > core resume callback. For non-boot CPUs, restore after they're brought
> > up, because runstate info for non-boot CPUs are not active until then.
> >
> > Signed-off-by: Munehisa Kamata <[email protected]>
> > Signed-off-by: Anchal Agarwal <[email protected]>
> > ---
> > arch/x86/xen/suspend.c | 13 ++++++++++++-
> > arch/x86/xen/time.c | 3 +++
> > 2 files changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
> > index 784c4484100b..dae0f74f5390 100644
> > --- a/arch/x86/xen/suspend.c
> > +++ b/arch/x86/xen/suspend.c
> > @@ -91,12 +91,20 @@ void xen_arch_suspend(void)
> > static int xen_syscore_suspend(void)
> > {
> > struct xen_remove_from_physmap xrfp;
> > - int ret;
> > + int cpu, ret;
> >
> > /* Xen suspend does similar stuffs in its own logic */
> > if (xen_suspend_mode_is_xen_suspend())
> > return 0;
> >
> > + for_each_present_cpu(cpu) {
> > + /*
> > + * Nonboot CPUs are already offline, but the last copy of
> > + * runstate info is still accessible.
> > + */
> > + xen_save_steal_clock(cpu);
> > + }
> > +
> > xrfp.domid = DOMID_SELF;
> > xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
> >
> > @@ -118,6 +126,9 @@ static void xen_syscore_resume(void)
> >
> > pvclock_resume();
>
>
> Doesn't make any difference but I think since this patch is where you
> are dealing with clock then pvclock_resume() should be added here and
> not in the earlier patch.
>
>
> -boris
I think the reason it may be in previous patch because it was a part
of syscore_resume and steal clock fix came in later.
It could me moved to this patch that deals with all clock stuff.
-Anchal
>
>
> >
> > + /* Nonboot CPUs will be resumed when they're brought up */
> > + xen_restore_steal_clock(smp_processor_id());
> > +
> > gnttab_resume();
> > }
> >
> > diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> > index c8897aad13cd..33d754564b09 100644
> > --- a/arch/x86/xen/time.c
> > +++ b/arch/x86/xen/time.c
> > @@ -545,6 +545,9 @@ static void xen_hvm_setup_cpu_clockevents(void)
> > {
> > int cpu = smp_processor_id();
> > xen_setup_runstate_info(cpu);
> > + if (cpu)
> > + xen_restore_steal_clock(cpu);
> > +
> > /*
> > * xen_setup_timer(cpu) - snprintf is bad in atomic context. Hence
> > * doing it xen_hvm_cpu_notify (which gets called by smp_init during
>
>
>
On Sat, May 30, 2020 at 07:02:01PM -0400, Boris Ostrovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On 5/19/20 7:25 PM, Anchal Agarwal wrote:
> > Introduce a small function which re-uses shared page's PA allocated
> > during guest initialization time in reserve_shared_info() and not
> > allocate new page during resume flow.
> > It also does the mapping of shared_info_page by calling
> > xen_hvm_init_shared_info() to use the function.
> >
> > Signed-off-by: Anchal Agarwal <[email protected]>
> > ---
> > arch/x86/xen/enlighten_hvm.c | 7 +++++++
> > arch/x86/xen/xen-ops.h | 1 +
> > 2 files changed, 8 insertions(+)
> >
> > diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
> > index e138f7de52d2..75b1ec7a0fcd 100644
> > --- a/arch/x86/xen/enlighten_hvm.c
> > +++ b/arch/x86/xen/enlighten_hvm.c
> > @@ -27,6 +27,13 @@
> >
> > static unsigned long shared_info_pfn;
> >
> > +void xen_hvm_map_shared_info(void)
> > +{
> > + xen_hvm_init_shared_info();
> > + if (shared_info_pfn)
> > + HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
> > +}
> > +
>
>
> AFAICT it is only called once so I don't see a need for new routine.
>
>
HYPERVISOR_shared_info can only be mapped in this scope without refactoring
much of the code.
> And is it possible for shared_info_pfn to be NULL in resume path (which
> is where this is called)?
>
>
I don't think it should be, still a sanity check but I don't think its needed there
because hibernation will fail in any case if thats the case.
However, HYPERVISOR_shared_info does needs to be re-mapped on resume as its been
marked to dummy address on suspend. Its also safe in case va changes.
Does the answer your question?
> -boris
-Anchal
>
>
On 6/3/20 6:40 PM, Agarwal, Anchal wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On 5/19/20 7:26 PM, Anchal Agarwal wrote:
> > From: Munehisa Kamata <[email protected]>
> >
> > Add Xen PVHVM specific system core callbacks for PM suspend and
> > hibernation support. The callbacks suspend and resume Xen
> > primitives,like shared_info, pvclock and grant table. Note that
> > Xen suspend can handle them in a different manner, but system
> > core callbacks are called from the context.
>
>
> I don't think I understand that last sentence.
>
> Looks like it may have cryptic meaning of stating that xen_suspend calls syscore_suspend from xen_suspend
> So, if these syscore ops gets called during xen_suspend do not do anything. Check if the mode is in xen suspend
> and return from there. These syscore_ops are specifically for domU hibernation.
> I must admit, I may have overlooked lack of explanation of some implicit details in the original commit msg.
>
> > So if the callbacks
> > are called from Xen suspend context, return immediately.
> >
>
>
> > +
> > +static int xen_syscore_suspend(void)
> > +{
> > + struct xen_remove_from_physmap xrfp;
> > + int ret;
> > +
> > + /* Xen suspend does similar stuffs in its own logic */
> > + if (xen_suspend_mode_is_xen_suspend())
> > + return 0;
With your explanation now making this clearer, is this check really
necessary? From what I see we are in XEN_SUSPEND mode when
lock_system_sleep() lock is taken, meaning that we can't initialize
hibernation.
> > +
> > + xrfp.domid = DOMID_SELF;
> > + xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
> > +
> > + ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrfp);
> > + if (!ret)
> > + HYPERVISOR_shared_info = &xen_dummy_shared_info;
> > +
> > + return ret;
> > +}
> > +
> > +static void xen_syscore_resume(void)
> > +{
> > + /* Xen suspend does similar stuffs in its own logic */
> > + if (xen_suspend_mode_is_xen_suspend())
> > + return;
> > +
> > + /* No need to setup vcpu_info as it's already moved off */
> > + xen_hvm_map_shared_info();
> > +
> > + pvclock_resume();
> > +
> > + gnttab_resume();
>
>
> Do you call gnttab_suspend() in pm suspend path?
> No, since it does nothing for HVM guests. The unmap_frames is only applicable for PV guests right?
You should call it nevertheless. It will decide whether or not anything
needs to be done.
-boris
On 6/4/20 7:03 PM, Anchal Agarwal wrote:
> On Sat, May 30, 2020 at 07:02:01PM -0400, Boris Ostrovsky wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> On 5/19/20 7:25 PM, Anchal Agarwal wrote:
>>> Introduce a small function which re-uses shared page's PA allocated
>>> during guest initialization time in reserve_shared_info() and not
>>> allocate new page during resume flow.
>>> It also does the mapping of shared_info_page by calling
>>> xen_hvm_init_shared_info() to use the function.
>>>
>>> Signed-off-by: Anchal Agarwal <[email protected]>
>>> ---
>>> arch/x86/xen/enlighten_hvm.c | 7 +++++++
>>> arch/x86/xen/xen-ops.h | 1 +
>>> 2 files changed, 8 insertions(+)
>>>
>>> diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
>>> index e138f7de52d2..75b1ec7a0fcd 100644
>>> --- a/arch/x86/xen/enlighten_hvm.c
>>> +++ b/arch/x86/xen/enlighten_hvm.c
>>> @@ -27,6 +27,13 @@
>>>
>>> static unsigned long shared_info_pfn;
>>>
>>> +void xen_hvm_map_shared_info(void)
>>> +{
>>> + xen_hvm_init_shared_info();
>>> + if (shared_info_pfn)
>>> + HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
>>> +}
>>> +
>>
>> AFAICT it is only called once so I don't see a need for new routine.
>>
>>
> HYPERVISOR_shared_info can only be mapped in this scope without refactoring
> much of the code.
Refactoring what? All am suggesting is
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -124,7 +124,9 @@ static void xen_syscore_resume(void)
return;
/* No need to setup vcpu_info as it's already moved off */
- xen_hvm_map_shared_info();
+ xen_hvm_init_shared_info();
+ if (shared_info_pfn)
+ HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
pvclock_resume();
>> And is it possible for shared_info_pfn to be NULL in resume path (which
>> is where this is called)?
>>
>>
> I don't think it should be, still a sanity check but I don't think its needed there
> because hibernation will fail in any case if thats the case.
If shared_info_pfn is NULL you'd have problems long before hibernation
started. We set it in xen_hvm_guest_init() and never touch again.
In fact, I'd argue that it should be __ro_after_init.
> However, HYPERVISOR_shared_info does needs to be re-mapped on resume as its been
> marked to dummy address on suspend. Its also safe in case va changes.
> Does the answer your question?
I wasn't arguing whether HYPERVISOR_shared_info needs to be set, I was
only saying that shared_info_pfn doesn't need to be tested.
-boris
On Fri, Jun 05, 2020 at 05:24:37PM -0400, Boris Ostrovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On 6/3/20 6:40 PM, Agarwal, Anchal wrote:
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> >
> >
> >
> > On 5/19/20 7:26 PM, Anchal Agarwal wrote:
> > > From: Munehisa Kamata <[email protected]>
> > >
> > > Add Xen PVHVM specific system core callbacks for PM suspend and
> > > hibernation support. The callbacks suspend and resume Xen
> > > primitives,like shared_info, pvclock and grant table. Note that
> > > Xen suspend can handle them in a different manner, but system
> > > core callbacks are called from the context.
> >
> >
> > I don't think I understand that last sentence.
> >
> > Looks like it may have cryptic meaning of stating that xen_suspend calls syscore_suspend from xen_suspend
> > So, if these syscore ops gets called during xen_suspend do not do anything. Check if the mode is in xen suspend
> > and return from there. These syscore_ops are specifically for domU hibernation.
> > I must admit, I may have overlooked lack of explanation of some implicit details in the original commit msg.
> >
> > > So if the callbacks
> > > are called from Xen suspend context, return immediately.
> > >
> >
> >
> > > +
> > > +static int xen_syscore_suspend(void)
> > > +{
> > > + struct xen_remove_from_physmap xrfp;
> > > + int ret;
> > > +
> > > + /* Xen suspend does similar stuffs in its own logic */
> > > + if (xen_suspend_mode_is_xen_suspend())
> > > + return 0;
>
>
> With your explanation now making this clearer, is this check really
> necessary? From what I see we are in XEN_SUSPEND mode when
> lock_system_sleep() lock is taken, meaning that we can't initialize
> hibernation.
>
I see. Sounds plausible. I will fix both the code and commit message
for better readability. Thanks for catching this.
>
> > > +
> > > + xrfp.domid = DOMID_SELF;
> > > + xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
> > > +
> > > + ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrfp);
> > > + if (!ret)
> > > + HYPERVISOR_shared_info = &xen_dummy_shared_info;
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +static void xen_syscore_resume(void)
> > > +{
> > > + /* Xen suspend does similar stuffs in its own logic */
> > > + if (xen_suspend_mode_is_xen_suspend())
> > > + return;
> > > +
> > > + /* No need to setup vcpu_info as it's already moved off */
> > > + xen_hvm_map_shared_info();
> > > +
> > > + pvclock_resume();
> > > +
> > > + gnttab_resume();
> >
> >
> > Do you call gnttab_suspend() in pm suspend path?
> > No, since it does nothing for HVM guests. The unmap_frames is only applicable for PV guests right?
>
>
> You should call it nevertheless. It will decide whether or not anything
> needs to be done.
Will fix it in V2.
>
>
> -boris
>
Thanks,
Anchal
>
On Fri, Jun 05, 2020 at 05:39:54PM -0400, Boris Ostrovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On 6/4/20 7:03 PM, Anchal Agarwal wrote:
> > On Sat, May 30, 2020 at 07:02:01PM -0400, Boris Ostrovsky wrote:
> >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> >>
> >>
> >>
> >> On 5/19/20 7:25 PM, Anchal Agarwal wrote:
> >>> Introduce a small function which re-uses shared page's PA allocated
> >>> during guest initialization time in reserve_shared_info() and not
> >>> allocate new page during resume flow.
> >>> It also does the mapping of shared_info_page by calling
> >>> xen_hvm_init_shared_info() to use the function.
> >>>
> >>> Signed-off-by: Anchal Agarwal <[email protected]>
> >>> ---
> >>> arch/x86/xen/enlighten_hvm.c | 7 +++++++
> >>> arch/x86/xen/xen-ops.h | 1 +
> >>> 2 files changed, 8 insertions(+)
> >>>
> >>> diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
> >>> index e138f7de52d2..75b1ec7a0fcd 100644
> >>> --- a/arch/x86/xen/enlighten_hvm.c
> >>> +++ b/arch/x86/xen/enlighten_hvm.c
> >>> @@ -27,6 +27,13 @@
> >>>
> >>> static unsigned long shared_info_pfn;
> >>>
> >>> +void xen_hvm_map_shared_info(void)
> >>> +{
> >>> + xen_hvm_init_shared_info();
> >>> + if (shared_info_pfn)
> >>> + HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
> >>> +}
> >>> +
> >>
> >> AFAICT it is only called once so I don't see a need for new routine.
> >>
> >>
> > HYPERVISOR_shared_info can only be mapped in this scope without refactoring
> > much of the code.
>
>
> Refactoring what? All am suggesting is
>
shared_info_pfn does not seem to be in scope here, it's scope is limited
to enlighten_hvm.c. That's the reason I introduced a new function there.
> --- a/arch/x86/xen/suspend.c
> +++ b/arch/x86/xen/suspend.c
> @@ -124,7 +124,9 @@ static void xen_syscore_resume(void)
> return;
>
> /* No need to setup vcpu_info as it's already moved off */
> - xen_hvm_map_shared_info();
> + xen_hvm_init_shared_info();
> + if (shared_info_pfn)
> + HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
>
> pvclock_resume();
>
> >> And is it possible for shared_info_pfn to be NULL in resume path (which
> >> is where this is called)?
> >>
> >>
> > I don't think it should be, still a sanity check but I don't think its needed there
> > because hibernation will fail in any case if thats the case.
>
>
> If shared_info_pfn is NULL you'd have problems long before hibernation
> started. We set it in xen_hvm_guest_init() and never touch again.
>
>
> In fact, I'd argue that it should be __ro_after_init.
>
>
I agree, and I should have mentioned that I will remove that check and its not
necessary as this gets mapped way early in the boot process.
> > However, HYPERVISOR_shared_info does needs to be re-mapped on resume as its been
> > marked to dummy address on suspend. Its also safe in case va changes.
> > Does the answer your question?
>
>
> I wasn't arguing whether HYPERVISOR_shared_info needs to be set, I was
> only saying that shared_info_pfn doesn't need to be tested.
>
Got it. :)
>
> -boris
>
Thanks,
Anchal
>
On 6/8/20 12:52 PM, Anchal Agarwal wrote:
>
>>>>> +void xen_hvm_map_shared_info(void)
>>>>> +{
>>>>> + xen_hvm_init_shared_info();
>>>>> + if (shared_info_pfn)
>>>>> + HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
>>>>> +}
>>>>> +
>>>> AFAICT it is only called once so I don't see a need for new routine.
>>>>
>>>>
>>> HYPERVISOR_shared_info can only be mapped in this scope without refactoring
>>> much of the code.
>>
>> Refactoring what? All am suggesting is
>>
> shared_info_pfn does not seem to be in scope here, it's scope is limited
> to enlighten_hvm.c. That's the reason I introduced a new function there.
OK, that's a good point.
-boris