2013-06-27 10:35:57

by David Vrabel

[permalink] [raw]
Subject: [PATCHv6 0/5] xen: maintain an accurate persistent clock in more cases

Xen guests use the Xen wallclock as their persistent clock. This is a
software only clock in the hypervisor that is used by guests instead
of a real hardware RTC.

The kernel has limited support for updating the persistent clock or
RTC when NTP is synced. This has the following limitations:

* The persistent clock is not updated on step changes. This leaves a
window where it will be incorrect (while NTP resyncs).

* Xen guests use the Xen wallclock as their persistent clock. dom0
maintains this clock so it is persistent for domUs but not dom0
itself.

These limitations mean that guests started before NTP is synchronized
will start with an incorrect wallclock time and the hardware RTC will
not be updated (as on bare metal).

These series fixes the above limitations and depends on "x86: increase
precision of x86_platform.get/set_wallclock()" which was previously
posted.

Changes in v6:

Fix hrtimers_resume() to handle resuming with two or more online CPUs
instead of adding a hrtimers_late_resume(). Add a flag to the
pvclock_gtod notififcations for the clock being set and go back to
using this to sync the Xen wallclock.

Changes in v5:

Dropped the change to disable non-boot CPUs during suspend on Xen as
migration downtime was too poor. Instead, provide
hrtimers_late_resume() for use by Xen's resume code to replace the
call of clock_was_set(). Fix two unused variable warnings.

Changes in v4:

Add a new clock_was_set notifier chain. Use this instead of direct
calls to clock_was_set() from the timekeeping code. Use this notifier
and a new timer to synchronize the Xen wallclock.

Changes in v3:

Don't peek at the timekeeper internals (use __current_kernel_time()
instead). Use the native set_wallclock hook in dom0.

Changes in v2:

Reworked to use the pvclock_gtod notifier to sync the wallclock (this
looked similar to what a KVM host does). update_persistent_clock()
will now only update the CMOS RTC.

David


2013-06-27 10:35:59

by David Vrabel

[permalink] [raw]
Subject: [PATCH 5/5] x86/xen: sync the CMOS RTC as well as the Xen wallclock

From: David Vrabel <[email protected]>

Adjustments to Xen's persistent clock via update_persistent_clock()
don't actually persist, as the Xen wallclock is a software only clock
and modifications to it do not modify the underlying CMOS RTC.

The x86_platform.set_wallclock hook can be used to keep the hardware
RTC synchronized (as on bare metal). Because the Xen wallclock is now
kept synchronized by pvclock_gtod notifier, xen_set_wallclock() need
not do this and dom0 can simply use the native implementation.

Signed-off-by: David Vrabel <[email protected]>
---
arch/x86/xen/time.c | 32 ++++++++++++++++----------------
1 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 3364850..20e395a 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -199,37 +199,35 @@ static void xen_get_wallclock(struct timespec *now)

static int xen_set_wallclock(const struct timespec *now)
{
- struct xen_platform_op op;
-
- /* do nothing for domU */
- if (!xen_initial_domain())
- return -1;
-
- op.cmd = XENPF_settime;
- op.u.settime.secs = now->tv_sec;
- op.u.settime.nsecs = now->tv_nsec;
- op.u.settime.system_time = xen_clocksource_read();
-
- return HYPERVISOR_dom0_op(&op);
+ return -1;
}

static int xen_pvclock_gtod_notify(struct notifier_block *nb, unsigned long was_set,
void *priv)
{
+ static struct timespec next;
struct timespec now;
struct xen_platform_op op;

- if (!was_set)
- return NOTIFY_OK;
-
now = __current_kernel_time();

+ if (!was_set && timespec_compare(&now, &next) < 0)
+ return NOTIFY_OK;
+
op.cmd = XENPF_settime;
op.u.settime.secs = now.tv_sec;
op.u.settime.nsecs = now.tv_nsec;
op.u.settime.system_time = xen_clocksource_read();

(void)HYPERVISOR_dom0_op(&op);
+
+ /*
+ * Don't update the wallclock for another 11 minutes. This is
+ * the same period as the sync_cmos_clock() work.
+ */
+ next = now;
+ next.tv_sec += 11*60;
+
return NOTIFY_OK;
}

@@ -513,7 +511,9 @@ void __init xen_init_time_ops(void)

x86_platform.calibrate_tsc = xen_tsc_khz;
x86_platform.get_wallclock = xen_get_wallclock;
- x86_platform.set_wallclock = xen_set_wallclock;
+ /* Dom0 uses the native method to set the hardware RTC. */
+ if (!xen_initial_domain())
+ x86_platform.set_wallclock = xen_set_wallclock;
}

#ifdef CONFIG_XEN_PVHVM
--
1.7.2.5

2013-06-27 10:35:58

by David Vrabel

[permalink] [raw]
Subject: [PATCH 4/5] x86/xen: sync the wallclock when the system time is set

From: David Vrabel <[email protected]>

Currently the Xen wallclock is only updated every 11 minutes if NTP is
synchronized to its clock source (using the sync_cmos_clock() work).
If a guest is started before NTP is synchronized it may see an
incorrect wallclock time.

Use the pvclock_gtod notifier chain to receive a notification when the
system time has changed and update the wallclock to match.

This chain is called on every timer tick and we want to avoid an extra
(expensive) hypercall on every tick. Because dom0 has historically
never provided a very accurate wallclock and guests do not expect one,
we can do this simply: the wallclock is only updated if the clock was
set.

Signed-off-by: David Vrabel <[email protected]>
---
arch/x86/xen/time.c | 28 ++++++++++++++++++++++++++++
1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index a1947ac..3364850 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -14,6 +14,7 @@
#include <linux/kernel_stat.h>
#include <linux/math64.h>
#include <linux/gfp.h>
+#include <linux/pvclock_gtod.h>

#include <asm/pvclock.h>
#include <asm/xen/hypervisor.h>
@@ -212,6 +213,30 @@ static int xen_set_wallclock(const struct timespec *now)
return HYPERVISOR_dom0_op(&op);
}

+static int xen_pvclock_gtod_notify(struct notifier_block *nb, unsigned long was_set,
+ void *priv)
+{
+ struct timespec now;
+ struct xen_platform_op op;
+
+ if (!was_set)
+ return NOTIFY_OK;
+
+ now = __current_kernel_time();
+
+ op.cmd = XENPF_settime;
+ op.u.settime.secs = now.tv_sec;
+ op.u.settime.nsecs = now.tv_nsec;
+ op.u.settime.system_time = xen_clocksource_read();
+
+ (void)HYPERVISOR_dom0_op(&op);
+ return NOTIFY_OK;
+}
+
+static struct notifier_block xen_pvclock_gtod_notifier = {
+ .notifier_call = xen_pvclock_gtod_notify,
+};
+
static struct clocksource xen_clocksource __read_mostly = {
.name = "xen",
.rating = 400,
@@ -473,6 +498,9 @@ static void __init xen_time_init(void)
xen_setup_runstate_info(cpu);
xen_setup_timer(cpu);
xen_setup_cpu_clockevents();
+
+ if (xen_initial_domain())
+ pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
}

void __init xen_init_time_ops(void)
--
1.7.2.5

2013-06-27 10:36:42

by David Vrabel

[permalink] [raw]
Subject: [PATCH 3/5] time: indicate that the clock was set in the pvclock gtod notifier chain

From: David Vrabel <[email protected]>

If the clock was set (stepped), set the action parameter to functions
in the pvclock gtod notifier chain to non-zero. This allows the
callee to only do work if the clock was stepped.

This will be used on Xen as the synchronization of the Xen wallclock
to the control domain's (dom0) system time will be done with this
notifier and updating on every timer tick is unnecessary and too
expensive.

Signed-off-by: David Vrabel <[email protected]>
---
include/linux/pvclock_gtod.h | 7 +++++++
kernel/time/timekeeping.c | 30 ++++++++++++++++++------------
2 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/include/linux/pvclock_gtod.h b/include/linux/pvclock_gtod.h
index 0ca7582..a71d2db 100644
--- a/include/linux/pvclock_gtod.h
+++ b/include/linux/pvclock_gtod.h
@@ -3,6 +3,13 @@

#include <linux/notifier.h>

+/*
+ * The pvclock gtod notifier is called when the system time is updated
+ * and is used to keep guest time synchronized with host time.
+ *
+ * The 'action' parameter in the notifier function is false (0), or
+ * true (non-zero) if system time was stepped.
+ */
extern int pvclock_gtod_register_notifier(struct notifier_block *nb);
extern int pvclock_gtod_unregister_notifier(struct notifier_block *nb);

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 7aed2b0..dd1b94d 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -28,6 +28,7 @@

#define TK_CLEAR_NTP (1 << 0)
#define TK_MIRROR (1 << 1)
+#define TK_CLOCK_WAS_SET (1 << 2)

static struct timekeeper timekeeper;
static DEFINE_RAW_SPINLOCK(timekeeper_lock);
@@ -203,9 +204,9 @@ static inline s64 timekeeping_get_ns_raw(struct timekeeper *tk)

static RAW_NOTIFIER_HEAD(pvclock_gtod_chain);

-static void update_pvclock_gtod(struct timekeeper *tk)
+static void update_pvclock_gtod(struct timekeeper *tk, bool was_set)
{
- raw_notifier_call_chain(&pvclock_gtod_chain, 0, tk);
+ raw_notifier_call_chain(&pvclock_gtod_chain, was_set, tk);
}

/**
@@ -219,7 +220,7 @@ int pvclock_gtod_register_notifier(struct notifier_block *nb)

raw_spin_lock_irqsave(&timekeeper_lock, flags);
ret = raw_notifier_chain_register(&pvclock_gtod_chain, nb);
- update_pvclock_gtod(tk);
+ update_pvclock_gtod(tk, true);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

return ret;
@@ -251,7 +252,7 @@ static void timekeeping_update(struct timekeeper *tk, unsigned action)
ntp_clear();
}
update_vsyscall(tk);
- update_pvclock_gtod(tk);
+ update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET);

if (action & TK_MIRROR)
memcpy(&shadow_timekeeper, &timekeeper, sizeof(timekeeper));
@@ -511,7 +512,7 @@ int do_settimeofday(const struct timespec *tv)

tk_set_xtime(tk, tv);

- timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -555,7 +556,7 @@ int timekeeping_inject_offset(struct timespec *ts)
tk_set_wall_to_mono(tk, timespec_sub(tk->wall_to_monotonic, *ts));

error: /* even if we error out, we forwarded the time, so call update */
- timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -636,7 +637,7 @@ static int change_clocksource(void *data)
if (old->disable)
old->disable(old);
}
- timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -875,7 +876,7 @@ void timekeeping_inject_sleeptime(struct timespec *delta)

__timekeeping_inject_sleeptime(tk, delta);

- timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -957,7 +958,7 @@ static void timekeeping_resume(void)
tk->cycle_last = clock->cycle_last = cycle_now;
tk->ntp_error = 0;
timekeeping_suspended = 0;
- timekeeping_update(tk, TK_MIRROR);
+ timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

@@ -1239,9 +1240,10 @@ out_adjust:
* It also calls into the NTP code to handle leapsecond processing.
*
*/
-static inline void accumulate_nsecs_to_secs(struct timekeeper *tk)
+static inline bool accumulate_nsecs_to_secs(struct timekeeper *tk)
{
u64 nsecps = (u64)NSEC_PER_SEC << tk->shift;
+ unsigned action = 0;

while (tk->xtime_nsec >= nsecps) {
int leap;
@@ -1264,8 +1266,10 @@ static inline void accumulate_nsecs_to_secs(struct timekeeper *tk)
__timekeeping_set_tai_offset(tk, tk->tai_offset - leap);

clock_was_set_delayed();
+ action = TK_CLOCK_WAS_SET;
}
}
+ return action;
}

/**
@@ -1350,6 +1354,7 @@ static void update_wall_time(void)
struct timekeeper *tk = &shadow_timekeeper;
cycle_t offset;
int shift = 0, maxshift;
+ unsigned action;
unsigned long flags;

raw_spin_lock_irqsave(&timekeeper_lock, flags);
@@ -1402,7 +1407,7 @@ static void update_wall_time(void)
* Finally, make sure that after the rounding
* xtime_nsec isn't larger than NSEC_PER_SEC
*/
- accumulate_nsecs_to_secs(tk);
+ action = accumulate_nsecs_to_secs(tk);

write_seqcount_begin(&timekeeper_seq);
/* Update clock->cycle_last with the new value */
@@ -1418,7 +1423,7 @@ static void update_wall_time(void)
* updating.
*/
memcpy(real_tk, tk, sizeof(*tk));
- timekeeping_update(real_tk, 0);
+ timekeeping_update(real_tk, action);
write_seqcount_end(&timekeeper_seq);
out:
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -1680,6 +1685,7 @@ int do_adjtimex(struct timex *txc)

if (tai != orig_tai) {
__timekeeping_set_tai_offset(tk, tai);
+ update_pvclock_gtod(tk, true);
clock_was_set_delayed();
}
write_seqcount_end(&timekeeper_seq);
--
1.7.2.5

2013-06-27 10:36:41

by David Vrabel

[permalink] [raw]
Subject: [PATCH 2/5] time: pass flags instead of multiple bools to timekeeping_update()

From: David Vrabel <[email protected]>

Instead of passing multiple bools to timekeeping_updated(), define
flags and use a single 'action' parameter. It is then more obvious
what each timekeeping_update() call does.

Signed-off-by: David Vrabel <[email protected]>
---
kernel/time/timekeeping.c | 21 ++++++++++++---------
1 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index baeeb5c..7aed2b0 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -26,6 +26,9 @@
#include "tick-internal.h"
#include "ntp_internal.h"

+#define TK_CLEAR_NTP (1 << 0)
+#define TK_MIRROR (1 << 1)
+
static struct timekeeper timekeeper;
static DEFINE_RAW_SPINLOCK(timekeeper_lock);
static seqcount_t timekeeper_seq;
@@ -241,16 +244,16 @@ int pvclock_gtod_unregister_notifier(struct notifier_block *nb)
EXPORT_SYMBOL_GPL(pvclock_gtod_unregister_notifier);

/* must hold timekeeper_lock */
-static void timekeeping_update(struct timekeeper *tk, bool clearntp, bool mirror)
+static void timekeeping_update(struct timekeeper *tk, unsigned action)
{
- if (clearntp) {
+ if (action & TK_CLEAR_NTP) {
tk->ntp_error = 0;
ntp_clear();
}
update_vsyscall(tk);
update_pvclock_gtod(tk);

- if (mirror)
+ if (action & TK_MIRROR)
memcpy(&shadow_timekeeper, &timekeeper, sizeof(timekeeper));
}

@@ -508,7 +511,7 @@ int do_settimeofday(const struct timespec *tv)

tk_set_xtime(tk, tv);

- timekeeping_update(tk, true, true);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -552,7 +555,7 @@ int timekeeping_inject_offset(struct timespec *ts)
tk_set_wall_to_mono(tk, timespec_sub(tk->wall_to_monotonic, *ts));

error: /* even if we error out, we forwarded the time, so call update */
- timekeeping_update(tk, true, true);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -633,7 +636,7 @@ static int change_clocksource(void *data)
if (old->disable)
old->disable(old);
}
- timekeeping_update(tk, true, true);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -872,7 +875,7 @@ void timekeeping_inject_sleeptime(struct timespec *delta)

__timekeeping_inject_sleeptime(tk, delta);

- timekeeping_update(tk, true, true);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -954,7 +957,7 @@ static void timekeeping_resume(void)
tk->cycle_last = clock->cycle_last = cycle_now;
tk->ntp_error = 0;
timekeeping_suspended = 0;
- timekeeping_update(tk, false, true);
+ timekeeping_update(tk, TK_MIRROR);
write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

@@ -1415,7 +1418,7 @@ static void update_wall_time(void)
* updating.
*/
memcpy(real_tk, tk, sizeof(*tk));
- timekeeping_update(real_tk, false, false);
+ timekeeping_update(real_tk, 0);
write_seqcount_end(&timekeeper_seq);
out:
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
--
1.7.2.5

2013-06-27 10:37:11

by David Vrabel

[permalink] [raw]
Subject: [PATCH 1/5] hrtimers: support resuming with two or more CPUs online (but stopped)

From: David Vrabel <[email protected]>

hrtimers_resume() only reprograms the timers for the current CPU as it
assumes that all other CPUs are offline at this point in the resume
process. If other CPUs are online then their timers will not be
corrected and they may fire at the wrong time.

When running as a Xen guest, this assumption is not true. Non-boot
CPUs are only stopped with IRQs disabled instead of offlining them.
This is a performance optimization as disabling the CPUs would add an
unacceptable amount of additional downtime during a live migration (>
200 ms for a 4 VCPU guest).

hrtimers_resume() cannot call on_each_cpu(retrigger_next_event,...)
as the other CPUs will be stopped with IRQs disabled. Instead, defer
the call to the next softirq.

Signed-off-by: David Vrabel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
drivers/xen/manage.c | 3 ---
kernel/hrtimer.c | 15 ++++++++++++---
2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index 412b96c..421da85 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -166,9 +166,6 @@ out_resume:

dpm_resume_end(si.cancelled ? PMSG_THAW : PMSG_RESTORE);

- /* Make sure timer events get retriggered on all CPUs */
- clock_was_set();
-
out_thaw:
#ifdef CONFIG_PREEMPT
thaw_processes();
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index fd4b13b..e86827e 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -773,15 +773,24 @@ void clock_was_set(void)

/*
* During resume we might have to reprogram the high resolution timer
- * interrupt (on the local CPU):
+ * interrupt on all online CPUs. However, all other CPUs will be
+ * stopped with IRQs interrupts disabled so the clock_was_set() call
+ * must be deferred to the softirq.
+ *
+ * The one-shot timer has already been programmed to fire immediately
+ * (see tick_resume_oneshot()) and this interrupt will trigger the
+ * softirq to run early enough to correctly reprogram the timers on
+ * all CPUs.
*/
void hrtimers_resume(void)
{
+ struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+
WARN_ONCE(!irqs_disabled(),
KERN_INFO "hrtimers_resume() called with IRQs enabled!");

- retrigger_next_event(NULL);
- timerfd_clock_was_set();
+ cpu_base->clock_was_set = 1;
+ __raise_softirq_irqoff(HRTIMER_SOFTIRQ);
}

static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)
--
1.7.2.5

2013-06-27 17:37:24

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 3/5] time: indicate that the clock was set in the pvclock gtod notifier chain

On 06/27/2013 03:35 AM, David Vrabel wrote:
> From: David Vrabel <[email protected]>
>
> If the clock was set (stepped), set the action parameter to functions
> in the pvclock gtod notifier chain to non-zero. This allows the
> callee to only do work if the clock was stepped.
>
> This will be used on Xen as the synchronization of the Xen wallclock
> to the control domain's (dom0) system time will be done with this
> notifier and updating on every timer tick is unnecessary and too
> expensive.
>
> Signed-off-by: David Vrabel <[email protected]>

Looks pretty good. Minor note below.


>
> @@ -1239,9 +1240,10 @@ out_adjust:
> * It also calls into the NTP code to handle leapsecond processing.
> *
> */
> -static inline void accumulate_nsecs_to_secs(struct timekeeper *tk)
> +static inline bool accumulate_nsecs_to_secs(struct timekeeper *tk)
Shouldn't this be unsigned instead of bool?


thanks
-john

2013-06-27 17:39:10

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 2/5] time: pass flags instead of multiple bools to timekeeping_update()

On 06/27/2013 03:35 AM, David Vrabel wrote:
> From: David Vrabel <[email protected]>
>
> Instead of passing multiple bools to timekeeping_updated(), define
> flags and use a single 'action' parameter. It is then more obvious
> what each timekeeping_update() call does.
>
> Signed-off-by: David Vrabel <[email protected]>
> ---
> kernel/time/timekeeping.c | 21 ++++++++++++---------
> 1 files changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index baeeb5c..7aed2b0 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -26,6 +26,9 @@
> #include "tick-internal.h"
> #include "ntp_internal.h"
>
> +#define TK_CLEAR_NTP (1 << 0)
> +#define TK_MIRROR (1 << 1)
> +
> static struct timekeeper timekeeper;
> static DEFINE_RAW_SPINLOCK(timekeeper_lock);
> static seqcount_t timekeeper_seq;
> @@ -241,16 +244,16 @@ int pvclock_gtod_unregister_notifier(struct notifier_block *nb)
> EXPORT_SYMBOL_GPL(pvclock_gtod_unregister_notifier);
>
> /* must hold timekeeper_lock */
> -static void timekeeping_update(struct timekeeper *tk, bool clearntp, bool mirror)
> +static void timekeeping_update(struct timekeeper *tk, unsigned action)

Nit: Mind making this "unsigned int" just for consistency sake with the
rest of the code?

thanks
-john

2013-06-28 10:20:12

by David Vrabel

[permalink] [raw]
Subject: Re: [PATCH 3/5] time: indicate that the clock was set in the pvclock gtod notifier chain

On 27/06/13 18:37, John Stultz wrote:
> On 06/27/2013 03:35 AM, David Vrabel wrote:
>> From: David Vrabel <[email protected]>
>>
>> If the clock was set (stepped), set the action parameter to functions
>> in the pvclock gtod notifier chain to non-zero. This allows the
>> callee to only do work if the clock was stepped.
>>
>> This will be used on Xen as the synchronization of the Xen wallclock
>> to the control domain's (dom0) system time will be done with this
>> notifier and updating on every timer tick is unnecessary and too
>> expensive.
>>
>> Signed-off-by: David Vrabel <[email protected]>
>
> Looks pretty good. Minor note below.

I've fixed this and the other minor change you requested.

You can pull the wallclock-v7 branch from

git://xenbits.xen.org/people/dvrabel/linux.git

or, if you prefer, I can repost the series.

David

2013-06-28 13:14:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCHv6 0/5] xen: maintain an accurate persistent clock in more cases

On Thu, 27 Jun 2013, David Vrabel wrote:
> These series fixes the above limitations and depends on "x86: increase
> precision of x86_platform.get/set_wallclock()" which was previously
> posted.

So I'd like to merge that in the following way:

I pick up patches 1-3 and stick them into tip timers/for-xen and merge
that branch into timers/core. When picking up 1/6, I'll drop the xen
part of that, so timers/core will not hold any xen specific bits.

Then the xen folks can pull timers/for-xen and apply the xen specific
stuff on top.

Thanks,

tglx

2013-06-28 15:02:17

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCHv6 0/5] xen: maintain an accurate persistent clock in more cases

On Fri, Jun 28, 2013 at 03:14:42PM +0200, Thomas Gleixner wrote:
> On Thu, 27 Jun 2013, David Vrabel wrote:
> > These series fixes the above limitations and depends on "x86: increase
> > precision of x86_platform.get/set_wallclock()" which was previously
> > posted.
>
> So I'd like to merge that in the following way:
>
> I pick up patches 1-3 and stick them into tip timers/for-xen and merge
> that branch into timers/core. When picking up 1/6, I'll drop the xen
> part of that, so timers/core will not hold any xen specific bits.
>
> Then the xen folks can pull timers/for-xen and apply the xen specific
> stuff on top.

Wouldn't it be easier for you to pick the "xen part of that" as well?
I am OK with you doing that and it all going through your tree.

But if it is easier for you to do the way you said - I can do that too.
Just tell me when to pull timers/for-xen
>
> Thanks,
>
> tglx
>
>

2013-06-28 15:12:39

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCHv6 0/5] xen: maintain an accurate persistent clock in more cases

On Fri, 28 Jun 2013, Konrad Rzeszutek Wilk wrote:
> On Fri, Jun 28, 2013 at 03:14:42PM +0200, Thomas Gleixner wrote:
> > On Thu, 27 Jun 2013, David Vrabel wrote:
> > > These series fixes the above limitations and depends on "x86: increase
> > > precision of x86_platform.get/set_wallclock()" which was previously
> > > posted.
> >
> > So I'd like to merge that in the following way:
> >
> > I pick up patches 1-3 and stick them into tip timers/for-xen and merge
> > that branch into timers/core. When picking up 1/6, I'll drop the xen
> > part of that, so timers/core will not hold any xen specific bits.
> >
> > Then the xen folks can pull timers/for-xen and apply the xen specific
> > stuff on top.
>
> Wouldn't it be easier for you to pick the "xen part of that" as well?
> I am OK with you doing that and it all going through your tree.

I can do that, if that's not conflicting with other xen/x86 stuff
outside of timers/core.

Thanks,

tglx

2013-06-28 15:38:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 5/5] x86/xen: sync the CMOS RTC as well as the Xen wallclock

On Thu, 27 Jun 2013, David Vrabel wrote:

> From: David Vrabel <[email protected]>
>
> Adjustments to Xen's persistent clock via update_persistent_clock()
> don't actually persist, as the Xen wallclock is a software only clock
> and modifications to it do not modify the underlying CMOS RTC.
>
> The x86_platform.set_wallclock hook can be used to keep the hardware
> RTC synchronized (as on bare metal). Because the Xen wallclock is now
> kept synchronized by pvclock_gtod notifier, xen_set_wallclock() need
> not do this and dom0 can simply use the native implementation.

I can understand that part, but ...

> static int xen_pvclock_gtod_notify(struct notifier_block *nb, unsigned long was_set,
> void *priv)
> {
> + static struct timespec next;
> struct timespec now;
> struct xen_platform_op op;
>
> - if (!was_set)
> - return NOTIFY_OK;
> -
> now = __current_kernel_time();
>
> + if (!was_set && timespec_compare(&now, &next) < 0)
> + return NOTIFY_OK;
> +
> op.cmd = XENPF_settime;
> op.u.settime.secs = now.tv_sec;
> op.u.settime.nsecs = now.tv_nsec;
> op.u.settime.system_time = xen_clocksource_read();
>
> (void)HYPERVISOR_dom0_op(&op);
> +
> + /*
> + * Don't update the wallclock for another 11 minutes. This is
> + * the same period as the sync_cmos_clock() work.
> + */
> + next = now;
> + next.tv_sec += 11*60;
> +

How is this related to the changelog? /me is confused .....

2013-06-28 15:49:57

by David Vrabel

[permalink] [raw]
Subject: Re: [PATCH 5/5] x86/xen: sync the CMOS RTC as well as the Xen wallclock

On 28/06/13 16:38, Thomas Gleixner wrote:
> On Thu, 27 Jun 2013, David Vrabel wrote:
>
>> From: David Vrabel <[email protected]>
>>
>> Adjustments to Xen's persistent clock via update_persistent_clock()
>> don't actually persist, as the Xen wallclock is a software only clock
>> and modifications to it do not modify the underlying CMOS RTC.
>>
>> The x86_platform.set_wallclock hook can be used to keep the hardware
>> RTC synchronized (as on bare metal). Because the Xen wallclock is now
>> kept synchronized by pvclock_gtod notifier, xen_set_wallclock() need
>> not do this and dom0 can simply use the native implementation.
>
> I can understand that part, but ...
>
>> static int xen_pvclock_gtod_notify(struct notifier_block *nb, unsigned long was_set,
>> void *priv)
>> {
>> + static struct timespec next;
>> struct timespec now;
>> struct xen_platform_op op;
>>
>> - if (!was_set)
>> - return NOTIFY_OK;
>> -
>> now = __current_kernel_time();
>>
>> + if (!was_set && timespec_compare(&now, &next) < 0)
>> + return NOTIFY_OK;
>> +
>> op.cmd = XENPF_settime;
>> op.u.settime.secs = now.tv_sec;
>> op.u.settime.nsecs = now.tv_nsec;
>> op.u.settime.system_time = xen_clocksource_read();
>>
>> (void)HYPERVISOR_dom0_op(&op);
>> +
>> + /*
>> + * Don't update the wallclock for another 11 minutes. This is
>> + * the same period as the sync_cmos_clock() work.
>> + */
>> + next = now;
>> + next.tv_sec += 11*60;
>> +
>
> How is this related to the changelog? /me is confused .....

Before:

Xen wallclock set when time is stepped.
Xen wallclock set every 11 minutes (by sync_cmos_clock()).
Hardware RTC never set.

After:

Xen wallclock set when time is stepped.
Xen wallclock set every 11 minutes (in pvclock gtod notifier).
Hardware RTC set every 11 minutes (by sync_cmos_clock()).

I'll update the changelog to be more descriptive:

Adjustments to Xen's persistent clock via update_persistent_clock()
don't actually persist, as the Xen wallclock is a software only clock
and modifications to it do not modify the underlying CMOS RTC.

The x86_platform.set_wallclock hook can be used to keep the hardware
RTC synchronized (as on bare metal). If (in dom0) we make the Xen
wallclock periodically synchronized by the pvclock_gtod notifier, the
set_wallclock hook need not update the Xen wallclock and the native
implementation can be used.

Is that better?

David

2013-06-28 16:09:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 5/5] x86/xen: sync the CMOS RTC as well as the Xen wallclock

On Fri, 28 Jun 2013, David Vrabel wrote:
>
> Before:
>
> Xen wallclock set when time is stepped.
> Xen wallclock set every 11 minutes (by sync_cmos_clock()).
> Hardware RTC never set.
>
> After:
>
> Xen wallclock set when time is stepped.
> Xen wallclock set every 11 minutes (in pvclock gtod notifier).

Ah, you are emulating the sync_cmos_clock() behaviour for the xen
wallclock via the periodic pvclock_gtod notifier call.

> Hardware RTC set every 11 minutes (by sync_cmos_clock()).
>
> I'll update the changelog to be more descriptive:
>
> Adjustments to Xen's persistent clock via update_persistent_clock()
> don't actually persist, as the Xen wallclock is a software only clock
> and modifications to it do not modify the underlying CMOS RTC.
>
> The x86_platform.set_wallclock hook can be used to keep the hardware
> RTC synchronized (as on bare metal). If (in dom0) we make the Xen
> wallclock periodically synchronized by the pvclock_gtod notifier, the
> set_wallclock hook need not update the Xen wallclock and the native
> implementation can be used.

Yep. I'll pick that up.

Thanks,

tglx

2013-06-28 16:20:13

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCHv6 0/5] xen: maintain an accurate persistent clock in more cases

On Fri, Jun 28, 2013 at 05:12:35PM +0200, Thomas Gleixner wrote:
> On Fri, 28 Jun 2013, Konrad Rzeszutek Wilk wrote:
> > On Fri, Jun 28, 2013 at 03:14:42PM +0200, Thomas Gleixner wrote:
> > > On Thu, 27 Jun 2013, David Vrabel wrote:
> > > > These series fixes the above limitations and depends on "x86: increase
> > > > precision of x86_platform.get/set_wallclock()" which was previously
> > > > posted.
> > >
> > > So I'd like to merge that in the following way:
> > >
> > > I pick up patches 1-3 and stick them into tip timers/for-xen and merge
> > > that branch into timers/core. When picking up 1/6, I'll drop the xen
> > > part of that, so timers/core will not hold any xen specific bits.
> > >
> > > Then the xen folks can pull timers/for-xen and apply the xen specific
> > > stuff on top.
> >
> > Wouldn't it be easier for you to pick the "xen part of that" as well?
> > I am OK with you doing that and it all going through your tree.
>
> I can do that, if that's not conflicting with other xen/x86 stuff
> outside of timers/core.

I got one change in there and it seems to apply cleanly (woot!).
But lets be on a safe side. There are also some Xen ARM changes that
are fiddling with arch/x86/xen/time.c (different maintainer)) - so both
me and Stefano can pull from the timers/for-xen and not have to worry about
conflict resolution.

Thanks!
>
> Thanks,
>
> tglx
>

2013-06-28 16:51:20

by David Vrabel

[permalink] [raw]
Subject: Re: [PATCH 5/5] x86/xen: sync the CMOS RTC as well as the Xen wallclock

On 28/06/13 17:09, Thomas Gleixner wrote:
> On Fri, 28 Jun 2013, David Vrabel wrote:
>>
>> Before:
>>
>> Xen wallclock set when time is stepped.
>> Xen wallclock set every 11 minutes (by sync_cmos_clock()).
>> Hardware RTC never set.
>>
>> After:
>>
>> Xen wallclock set when time is stepped.
>> Xen wallclock set every 11 minutes (in pvclock gtod notifier).
>
> Ah, you are emulating the sync_cmos_clock() behaviour for the xen
> wallclock via the periodic pvclock_gtod notifier call.
>
>> Hardware RTC set every 11 minutes (by sync_cmos_clock()).
>>
>> I'll update the changelog to be more descriptive:
>>
>> Adjustments to Xen's persistent clock via update_persistent_clock()
>> don't actually persist, as the Xen wallclock is a software only clock
>> and modifications to it do not modify the underlying CMOS RTC.
>>
>> The x86_platform.set_wallclock hook can be used to keep the hardware
>> RTC synchronized (as on bare metal). If (in dom0) we make the Xen
>> wallclock periodically synchronized by the pvclock_gtod notifier, the
>> set_wallclock hook need not update the Xen wallclock and the native
>> implementation can be used.
>
> Yep. I'll pick that up.

If it helps, I've made this change as well as splitting the xen part of
the hrtimers patch out into a separate commit.

The following changes since commit 52efa9eb9ea36b457b23b1b5d3dd9ce64d110715:

x86: increase precision of x86_platform.get/set_wallclock() (2013-06-26 18:07:16 +0100)

are available in the git repository at:
git://xenbits.xen.org/people/dvrabel/linux.git wallclock-v8

David Vrabel (6):
hrtimers: support resuming with two or more CPUs online (but stopped)
time: pass flags instead of multiple bools to timekeeping_update()
time: indicate that the clock was set in the pvclock gtod notifier chain
xen: remove unnecessary call to clock_was_set() during resume
x86/xen: sync the wallclock when the system time is set
x86/xen: sync the CMOS RTC as well as the Xen wallclock

arch/x86/xen/time.c | 42 +++++++++++++++++++++++++++++++++++-------
drivers/xen/manage.c | 3 ---
include/linux/pvclock_gtod.h | 7 +++++++
kernel/hrtimer.c | 15 ++++++++++++---
kernel/time/timekeeping.c | 39 ++++++++++++++++++++++++---------------
5 files changed, 78 insertions(+), 28 deletions(-)

David

2013-06-28 19:00:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCHv6 0/5] xen: maintain an accurate persistent clock in more cases

On Fri, 28 Jun 2013, Konrad Rzeszutek Wilk wrote:
> On Fri, Jun 28, 2013 at 05:12:35PM +0200, Thomas Gleixner wrote:
> > On Fri, 28 Jun 2013, Konrad Rzeszutek Wilk wrote:
> > > On Fri, Jun 28, 2013 at 03:14:42PM +0200, Thomas Gleixner wrote:
> > > > On Thu, 27 Jun 2013, David Vrabel wrote:
> > > > > These series fixes the above limitations and depends on "x86: increase
> > > > > precision of x86_platform.get/set_wallclock()" which was previously
> > > > > posted.
> > > >
> > > > So I'd like to merge that in the following way:
> > > >
> > > > I pick up patches 1-3 and stick them into tip timers/for-xen and merge
> > > > that branch into timers/core. When picking up 1/6, I'll drop the xen
> > > > part of that, so timers/core will not hold any xen specific bits.
> > > >
> > > > Then the xen folks can pull timers/for-xen and apply the xen specific
> > > > stuff on top.
> > >
> > > Wouldn't it be easier for you to pick the "xen part of that" as well?
> > > I am OK with you doing that and it all going through your tree.
> >
> > I can do that, if that's not conflicting with other xen/x86 stuff
> > outside of timers/core.
>
> I got one change in there and it seems to apply cleanly (woot!).
> But lets be on a safe side. There are also some Xen ARM changes that
> are fiddling with arch/x86/xen/time.c (different maintainer)) - so both
> me and Stefano can pull from the timers/for-xen and not have to worry about
> conflict resolution.

Hrmpf, just noticed, that it depends on other stuff in timers/core:

ce0b098: x86: Fix vrtc_get_time/set_mmss to use new timespec interface
3565184: x86: Increase precision of x86_platform.get/set_wallclock()

/me rumages for frozen shark.

2013-06-28 20:41:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 5/5] x86/xen: sync the CMOS RTC as well as the Xen wallclock

On Fri, 28 Jun 2013, David Vrabel wrote:
> On 28/06/13 17:09, Thomas Gleixner wrote:
> > Yep. I'll pick that up.
>
> If it helps, I've made this change as well as splitting the xen part of
> the hrtimers patch out into a separate commit.

Fun. I already have that same split in my pending queue :)

Thanks,

tglx

Subject: [tip:timers/core] hrtimers: Support resuming with two or more CPUs online (but stopped)

Commit-ID: 7c4c3a0f18ba57ea2a2985034532303d2929902a
Gitweb: http://git.kernel.org/tip/7c4c3a0f18ba57ea2a2985034532303d2929902a
Author: David Vrabel <[email protected]>
AuthorDate: Thu, 27 Jun 2013 11:35:44 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 28 Jun 2013 23:15:06 +0200

hrtimers: Support resuming with two or more CPUs online (but stopped)

hrtimers_resume() only reprograms the timers for the current CPU as it
assumes that all other CPUs are offline at this point in the resume
process. If other CPUs are online then their timers will not be
corrected and they may fire at the wrong time.

When running as a Xen guest, this assumption is not true. Non-boot
CPUs are only stopped with IRQs disabled instead of offlining them.
This is a performance optimization as disabling the CPUs would add an
unacceptable amount of additional downtime during a live migration (>
200 ms for a 4 VCPU guest).

hrtimers_resume() cannot call on_each_cpu(retrigger_next_event,...)
as the other CPUs will be stopped with IRQs disabled. Instead, defer
the call to the next softirq.

[ tglx: Separated the xen change out ]

Signed-off-by: David Vrabel <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: John Stultz <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/hrtimer.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index fd4b13b..e86827e 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -773,15 +773,24 @@ void clock_was_set(void)

/*
* During resume we might have to reprogram the high resolution timer
- * interrupt (on the local CPU):
+ * interrupt on all online CPUs. However, all other CPUs will be
+ * stopped with IRQs interrupts disabled so the clock_was_set() call
+ * must be deferred to the softirq.
+ *
+ * The one-shot timer has already been programmed to fire immediately
+ * (see tick_resume_oneshot()) and this interrupt will trigger the
+ * softirq to run early enough to correctly reprogram the timers on
+ * all CPUs.
*/
void hrtimers_resume(void)
{
+ struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+
WARN_ONCE(!irqs_disabled(),
KERN_INFO "hrtimers_resume() called with IRQs enabled!");

- retrigger_next_event(NULL);
- timerfd_clock_was_set();
+ cpu_base->clock_was_set = 1;
+ __raise_softirq_irqoff(HRTIMER_SOFTIRQ);
}

static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)

Subject: [tip:timers/core] xen: Remove clock_was_set() call in the resume path

Commit-ID: 0eb071651474952c8b6daecd36b378e2d01be22c
Gitweb: http://git.kernel.org/tip/0eb071651474952c8b6daecd36b378e2d01be22c
Author: David Vrabel <[email protected]>
AuthorDate: Thu, 27 Jun 2013 11:35:44 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 28 Jun 2013 23:15:06 +0200

xen: Remove clock_was_set() call in the resume path

commit 359cdd3f866(xen: maintain clock offset over save/restore) added
a clock_was_set() call into the xen resume code to propagate the
system time changes. With the modified hrtimer resume code, which
makes sure that all cpus are notified this call is not longer necessary.

[ tglx: Separated it from the hrtimer change ]

Signed-off-by: David Vrabel <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: John Stultz <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>

---
drivers/xen/manage.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index 412b96c..421da85 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -166,9 +166,6 @@ out_resume:

dpm_resume_end(si.cancelled ? PMSG_THAW : PMSG_RESTORE);

- /* Make sure timer events get retriggered on all CPUs */
- clock_was_set();
-
out_thaw:
#ifdef CONFIG_PREEMPT
thaw_processes();

Subject: [tip:timers/core] timekeeping: Pass flags instead of multiple bools to timekeeping_update()

Commit-ID: 04397fe94ad65289884b9862b6a0c722ececaadf
Gitweb: http://git.kernel.org/tip/04397fe94ad65289884b9862b6a0c722ececaadf
Author: David Vrabel <[email protected]>
AuthorDate: Thu, 27 Jun 2013 11:35:45 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 28 Jun 2013 23:15:06 +0200

timekeeping: Pass flags instead of multiple bools to timekeeping_update()

Instead of passing multiple bools to timekeeping_updated(), define
flags and use a single 'action' parameter. It is then more obvious
what each timekeeping_update() call does.

Signed-off-by: David Vrabel <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: John Stultz <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/time/timekeeping.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 838fc07..d8b23a9 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -27,6 +27,9 @@
#include "ntp_internal.h"
#include "timekeeping_internal.h"

+#define TK_CLEAR_NTP (1 << 0)
+#define TK_MIRROR (1 << 1)
+
static struct timekeeper timekeeper;
static DEFINE_RAW_SPINLOCK(timekeeper_lock);
static seqcount_t timekeeper_seq;
@@ -242,16 +245,16 @@ int pvclock_gtod_unregister_notifier(struct notifier_block *nb)
EXPORT_SYMBOL_GPL(pvclock_gtod_unregister_notifier);

/* must hold timekeeper_lock */
-static void timekeeping_update(struct timekeeper *tk, bool clearntp, bool mirror)
+static void timekeeping_update(struct timekeeper *tk, unsigned int action)
{
- if (clearntp) {
+ if (action & TK_CLEAR_NTP) {
tk->ntp_error = 0;
ntp_clear();
}
update_vsyscall(tk);
update_pvclock_gtod(tk);

- if (mirror)
+ if (action & TK_MIRROR)
memcpy(&shadow_timekeeper, &timekeeper, sizeof(timekeeper));
}

@@ -509,7 +512,7 @@ int do_settimeofday(const struct timespec *tv)

tk_set_xtime(tk, tv);

- timekeeping_update(tk, true, true);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -553,7 +556,7 @@ int timekeeping_inject_offset(struct timespec *ts)
tk_set_wall_to_mono(tk, timespec_sub(tk->wall_to_monotonic, *ts));

error: /* even if we error out, we forwarded the time, so call update */
- timekeeping_update(tk, true, true);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -643,7 +646,7 @@ static int change_clocksource(void *data)
module_put(new->owner);
}
}
- timekeeping_update(tk, true, true);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -884,7 +887,7 @@ void timekeeping_inject_sleeptime(struct timespec *delta)

__timekeeping_inject_sleeptime(tk, delta);

- timekeeping_update(tk, true, true);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -966,7 +969,7 @@ static void timekeeping_resume(void)
tk->cycle_last = clock->cycle_last = cycle_now;
tk->ntp_error = 0;
timekeeping_suspended = 0;
- timekeeping_update(tk, false, true);
+ timekeeping_update(tk, TK_MIRROR);
write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

@@ -1419,7 +1422,7 @@ static void update_wall_time(void)
* updating.
*/
memcpy(real_tk, tk, sizeof(*tk));
- timekeeping_update(real_tk, false, false);
+ timekeeping_update(real_tk, 0);
write_seqcount_end(&timekeeper_seq);
out:
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

Subject: [tip:timers/core] timekeeping: Indicate that clock was set in the pvclock gtod notifier

Commit-ID: 780427f0e113b4c77dfff4d258c05a902cdb0eb9
Gitweb: http://git.kernel.org/tip/780427f0e113b4c77dfff4d258c05a902cdb0eb9
Author: David Vrabel <[email protected]>
AuthorDate: Thu, 27 Jun 2013 11:35:46 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 28 Jun 2013 23:15:06 +0200

timekeeping: Indicate that clock was set in the pvclock gtod notifier

If the clock was set (stepped), set the action parameter to functions
in the pvclock gtod notifier chain to non-zero. This allows the
callee to only do work if the clock was stepped.

This will be used on Xen as the synchronization of the Xen wallclock
to the control domain's (dom0) system time will be done with this
notifier and updating on every timer tick is unnecessary and too
expensive.

Signed-off-by: David Vrabel <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: John Stultz <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/pvclock_gtod.h | 7 +++++++
kernel/time/timekeeping.c | 30 ++++++++++++++++++------------
2 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/include/linux/pvclock_gtod.h b/include/linux/pvclock_gtod.h
index 0ca7582..a71d2db 100644
--- a/include/linux/pvclock_gtod.h
+++ b/include/linux/pvclock_gtod.h
@@ -3,6 +3,13 @@

#include <linux/notifier.h>

+/*
+ * The pvclock gtod notifier is called when the system time is updated
+ * and is used to keep guest time synchronized with host time.
+ *
+ * The 'action' parameter in the notifier function is false (0), or
+ * true (non-zero) if system time was stepped.
+ */
extern int pvclock_gtod_register_notifier(struct notifier_block *nb);
extern int pvclock_gtod_unregister_notifier(struct notifier_block *nb);

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index d8b23a9..846d0a1 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -29,6 +29,7 @@

#define TK_CLEAR_NTP (1 << 0)
#define TK_MIRROR (1 << 1)
+#define TK_CLOCK_WAS_SET (1 << 2)

static struct timekeeper timekeeper;
static DEFINE_RAW_SPINLOCK(timekeeper_lock);
@@ -204,9 +205,9 @@ static inline s64 timekeeping_get_ns_raw(struct timekeeper *tk)

static RAW_NOTIFIER_HEAD(pvclock_gtod_chain);

-static void update_pvclock_gtod(struct timekeeper *tk)
+static void update_pvclock_gtod(struct timekeeper *tk, bool was_set)
{
- raw_notifier_call_chain(&pvclock_gtod_chain, 0, tk);
+ raw_notifier_call_chain(&pvclock_gtod_chain, was_set, tk);
}

/**
@@ -220,7 +221,7 @@ int pvclock_gtod_register_notifier(struct notifier_block *nb)

raw_spin_lock_irqsave(&timekeeper_lock, flags);
ret = raw_notifier_chain_register(&pvclock_gtod_chain, nb);
- update_pvclock_gtod(tk);
+ update_pvclock_gtod(tk, true);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

return ret;
@@ -252,7 +253,7 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action)
ntp_clear();
}
update_vsyscall(tk);
- update_pvclock_gtod(tk);
+ update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET);

if (action & TK_MIRROR)
memcpy(&shadow_timekeeper, &timekeeper, sizeof(timekeeper));
@@ -512,7 +513,7 @@ int do_settimeofday(const struct timespec *tv)

tk_set_xtime(tk, tv);

- timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -556,7 +557,7 @@ int timekeeping_inject_offset(struct timespec *ts)
tk_set_wall_to_mono(tk, timespec_sub(tk->wall_to_monotonic, *ts));

error: /* even if we error out, we forwarded the time, so call update */
- timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -646,7 +647,7 @@ static int change_clocksource(void *data)
module_put(new->owner);
}
}
- timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -887,7 +888,7 @@ void timekeeping_inject_sleeptime(struct timespec *delta)

__timekeeping_inject_sleeptime(tk, delta);

- timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR);
+ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);

write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -969,7 +970,7 @@ static void timekeeping_resume(void)
tk->cycle_last = clock->cycle_last = cycle_now;
tk->ntp_error = 0;
timekeeping_suspended = 0;
- timekeeping_update(tk, TK_MIRROR);
+ timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
write_seqcount_end(&timekeeper_seq);
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);

@@ -1243,9 +1244,10 @@ out_adjust:
* It also calls into the NTP code to handle leapsecond processing.
*
*/
-static inline void accumulate_nsecs_to_secs(struct timekeeper *tk)
+static inline unsigned int accumulate_nsecs_to_secs(struct timekeeper *tk)
{
u64 nsecps = (u64)NSEC_PER_SEC << tk->shift;
+ unsigned int action = 0;

while (tk->xtime_nsec >= nsecps) {
int leap;
@@ -1268,8 +1270,10 @@ static inline void accumulate_nsecs_to_secs(struct timekeeper *tk)
__timekeeping_set_tai_offset(tk, tk->tai_offset - leap);

clock_was_set_delayed();
+ action = TK_CLOCK_WAS_SET;
}
}
+ return action;
}

/**
@@ -1354,6 +1358,7 @@ static void update_wall_time(void)
struct timekeeper *tk = &shadow_timekeeper;
cycle_t offset;
int shift = 0, maxshift;
+ unsigned int action;
unsigned long flags;

raw_spin_lock_irqsave(&timekeeper_lock, flags);
@@ -1406,7 +1411,7 @@ static void update_wall_time(void)
* Finally, make sure that after the rounding
* xtime_nsec isn't larger than NSEC_PER_SEC
*/
- accumulate_nsecs_to_secs(tk);
+ action = accumulate_nsecs_to_secs(tk);

write_seqcount_begin(&timekeeper_seq);
/* Update clock->cycle_last with the new value */
@@ -1422,7 +1427,7 @@ static void update_wall_time(void)
* updating.
*/
memcpy(real_tk, tk, sizeof(*tk));
- timekeeping_update(real_tk, 0);
+ timekeeping_update(real_tk, action);
write_seqcount_end(&timekeeper_seq);
out:
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
@@ -1684,6 +1689,7 @@ int do_adjtimex(struct timex *txc)

if (tai != orig_tai) {
__timekeeping_set_tai_offset(tk, tai);
+ update_pvclock_gtod(tk, true);
clock_was_set_delayed();
}
write_seqcount_end(&timekeeper_seq);

Subject: [tip:timers/core] x86: xen: Sync the wallclock when the system time is set

Commit-ID: 5584880e44e49c587059801faa2a9f7d22619c48
Gitweb: http://git.kernel.org/tip/5584880e44e49c587059801faa2a9f7d22619c48
Author: David Vrabel <[email protected]>
AuthorDate: Thu, 27 Jun 2013 11:35:47 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 28 Jun 2013 23:15:06 +0200

x86: xen: Sync the wallclock when the system time is set

Currently the Xen wallclock is only updated every 11 minutes if NTP is
synchronized to its clock source (using the sync_cmos_clock() work).
If a guest is started before NTP is synchronized it may see an
incorrect wallclock time.

Use the pvclock_gtod notifier chain to receive a notification when the
system time has changed and update the wallclock to match.

This chain is called on every timer tick and we want to avoid an extra
(expensive) hypercall on every tick. Because dom0 has historically
never provided a very accurate wallclock and guests do not expect one,
we can do this simply: the wallclock is only updated if the clock was
set.

Signed-off-by: David Vrabel <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: John Stultz <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/xen/time.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index a1947ac..3364850 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -14,6 +14,7 @@
#include <linux/kernel_stat.h>
#include <linux/math64.h>
#include <linux/gfp.h>
+#include <linux/pvclock_gtod.h>

#include <asm/pvclock.h>
#include <asm/xen/hypervisor.h>
@@ -212,6 +213,30 @@ static int xen_set_wallclock(const struct timespec *now)
return HYPERVISOR_dom0_op(&op);
}

+static int xen_pvclock_gtod_notify(struct notifier_block *nb, unsigned long was_set,
+ void *priv)
+{
+ struct timespec now;
+ struct xen_platform_op op;
+
+ if (!was_set)
+ return NOTIFY_OK;
+
+ now = __current_kernel_time();
+
+ op.cmd = XENPF_settime;
+ op.u.settime.secs = now.tv_sec;
+ op.u.settime.nsecs = now.tv_nsec;
+ op.u.settime.system_time = xen_clocksource_read();
+
+ (void)HYPERVISOR_dom0_op(&op);
+ return NOTIFY_OK;
+}
+
+static struct notifier_block xen_pvclock_gtod_notifier = {
+ .notifier_call = xen_pvclock_gtod_notify,
+};
+
static struct clocksource xen_clocksource __read_mostly = {
.name = "xen",
.rating = 400,
@@ -473,6 +498,9 @@ static void __init xen_time_init(void)
xen_setup_runstate_info(cpu);
xen_setup_timer(cpu);
xen_setup_cpu_clockevents();
+
+ if (xen_initial_domain())
+ pvclock_gtod_register_notifier(&xen_pvclock_gtod_notifier);
}

void __init xen_init_time_ops(void)

Subject: [tip:timers/core] x86: xen: Sync the CMOS RTC as well as the Xen wallclock

Commit-ID: 47433b8c9d7480a3eebd99df38e857ce85a37cee
Gitweb: http://git.kernel.org/tip/47433b8c9d7480a3eebd99df38e857ce85a37cee
Author: David Vrabel <[email protected]>
AuthorDate: Thu, 27 Jun 2013 11:35:48 +0100
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 28 Jun 2013 23:15:07 +0200

x86: xen: Sync the CMOS RTC as well as the Xen wallclock

Adjustments to Xen's persistent clock via update_persistent_clock()
don't actually persist, as the Xen wallclock is a software only clock
and modifications to it do not modify the underlying CMOS RTC.

The x86_platform.set_wallclock hook is there to keep the hardware RTC
synchronized. On a guest this is pointless.

On Dom0 we can use the native implementaion which actually updates the
hardware RTC, but we still need to keep the software emulation of RTC
for the guests up to date. The subscription to the pvclock_notifier
allows us to emulate this easily. The notifier is called at every tick
and when the clock was set.

Right now we only use that notifier when the clock was set, but due to
the fact that it is called periodically from the timekeeping update
code, we can utilize it to emulate the NTP driven drift compensation
of update_persistant_clock() for the Xen wall (software) clock.

Add a 11 minutes periodic update to the pvclock_gtod notifier callback
to achieve that. The static variable 'next' which maintains that 11
minutes update cycle is protected by the core code serialization so
there is no need to add a Xen specific serialization mechanism.

[ tglx: Massaged changelog and added a few comments ]

Signed-off-by: David Vrabel <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: John Stultz <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/xen/time.c | 45 ++++++++++++++++++++++++++-------------------
1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 3364850..7a5671b 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -199,37 +199,42 @@ static void xen_get_wallclock(struct timespec *now)

static int xen_set_wallclock(const struct timespec *now)
{
- struct xen_platform_op op;
-
- /* do nothing for domU */
- if (!xen_initial_domain())
- return -1;
-
- op.cmd = XENPF_settime;
- op.u.settime.secs = now->tv_sec;
- op.u.settime.nsecs = now->tv_nsec;
- op.u.settime.system_time = xen_clocksource_read();
-
- return HYPERVISOR_dom0_op(&op);
+ return -1;
}

-static int xen_pvclock_gtod_notify(struct notifier_block *nb, unsigned long was_set,
- void *priv)
+static int xen_pvclock_gtod_notify(struct notifier_block *nb,
+ unsigned long was_set, void *priv)
{
- struct timespec now;
- struct xen_platform_op op;
+ /* Protected by the calling core code serialization */
+ static struct timespec next_sync;

- if (!was_set)
- return NOTIFY_OK;
+ struct xen_platform_op op;
+ struct timespec now;

now = __current_kernel_time();

+ /*
+ * We only take the expensive HV call when the clock was set
+ * or when the 11 minutes RTC synchronization time elapsed.
+ */
+ if (!was_set && timespec_compare(&now, &next_sync) < 0)
+ return NOTIFY_OK;
+
op.cmd = XENPF_settime;
op.u.settime.secs = now.tv_sec;
op.u.settime.nsecs = now.tv_nsec;
op.u.settime.system_time = xen_clocksource_read();

(void)HYPERVISOR_dom0_op(&op);
+
+ /*
+ * Move the next drift compensation time 11 minutes
+ * ahead. That's emulating the sync_cmos_clock() update for
+ * the hardware RTC.
+ */
+ next_sync = now;
+ next_sync.tv_sec += 11 * 60;
+
return NOTIFY_OK;
}

@@ -513,7 +518,9 @@ void __init xen_init_time_ops(void)

x86_platform.calibrate_tsc = xen_tsc_khz;
x86_platform.get_wallclock = xen_get_wallclock;
- x86_platform.set_wallclock = xen_set_wallclock;
+ /* Dom0 uses the native method to set the hardware RTC. */
+ if (!xen_initial_domain())
+ x86_platform.set_wallclock = xen_set_wallclock;
}

#ifdef CONFIG_XEN_PVHVM

2013-07-05 09:30:18

by Artem Savkov

[permalink] [raw]
Subject: Re: [tip:timers/core] hrtimers: Support resuming with two or more CPUs online (but stopped)

This commit brings up a warning about a potential deadlock in
smp_call_function_many() discussed previously:
https://lkml.org/lkml/2013/4/18/546

[ 4363.082047] PM: Restoring platform NVS memory
[ 4363.082471] ------------[ cut here ]------------
[ 4363.083800] WARNING: CPU: 0 PID: 3977 at kernel/smp.c:385 smp_call_function_many+0xbd/0x2c0()
[ 4363.085789] Modules linked in:
[ 4363.086542] CPU: 0 PID: 3977 Comm: pm-suspend Tainted: G W 3.10.0-next-20130705 #126
[ 4363.088634] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 4363.090402] 0000000000000181 ffff88001f403d98 ffffffff81d2e2b7 ffffffff821cdb2e
[ 4363.092366] 0000000000000000 ffff88001f403dd8 ffffffff810a278c ffff88001f403dd8
[ 4363.094215] 0000000000000000 0000000000000000 0000000000000000 ffffffff82561898
[ 4363.096094] Call Trace:
[ 4363.096656] <IRQ> [<ffffffff81d2e2b7>] dump_stack+0x59/0x82
[ 4363.098259] [<ffffffff810a278c>] warn_slowpath_common+0x8c/0xc0
[ 4363.099762] [<ffffffff810a27da>] warn_slowpath_null+0x1a/0x20
[ 4363.100925] [<ffffffff81118fcd>] smp_call_function_many+0xbd/0x2c0
[ 4363.101937] [<ffffffff8110f0dd>] ? trace_hardirqs_on+0xd/0x10
[ 4363.102870] [<ffffffff810d7030>] ? hrtimer_wakeup+0x30/0x30
[ 4363.103737] [<ffffffff811193bb>] smp_call_function+0x3b/0x50
[ 4363.104609] [<ffffffff810d7030>] ? hrtimer_wakeup+0x30/0x30
[ 4363.105455] [<ffffffff8111943b>] on_each_cpu+0x3b/0x120
[ 4363.106249] [<ffffffff810d743c>] clock_was_set+0x1c/0x30
[ 4363.107168] [<ffffffff810d825c>] run_hrtimer_softirq+0x2c/0x40
[ 4363.108055] [<ffffffff810acf26>] __do_softirq+0x216/0x450
[ 4363.108865] [<ffffffff810ad297>] irq_exit+0x77/0xb0
[ 4363.109618] [<ffffffff81d414da>] smp_apic_timer_interrupt+0x4a/0x60
[ 4363.110569] [<ffffffff81d40032>] apic_timer_interrupt+0x72/0x80
[ 4363.111467] <EOI> [<ffffffff8110eb24>] ? mark_held_locks+0x134/0x160
[ 4363.112481] [<ffffffff810f52af>] ? arch_suspend_enable_irqs+0x2f/0x40
[ 4363.113457] [<ffffffff810f528e>] ? arch_suspend_enable_irqs+0xe/0x40
[ 4363.113910] [<ffffffff810f54c1>] suspend_enter+0x1d1/0x270
[ 4363.114320] [<ffffffff810f5794>] suspend_devices_and_enter+0x1a4/0x390
[ 4363.114887] [<ffffffff810f5a74>] enter_state+0xf4/0x150
[ 4363.115288] [<ffffffff810f5aeb>] pm_suspend+0x1b/0x70
[ 4363.115662] [<ffffffff810f452b>] state_store+0xeb/0xf0
[ 4363.116055] [<ffffffff815d3fc7>] kobj_attr_store+0x17/0x20
[ 4363.116468] [<ffffffff8126f113>] sysfs_write_file+0xb3/0x100
[ 4363.116890] [<ffffffff811f0aca>] vfs_write+0xda/0x170
[ 4363.117274] [<ffffffff811f10b2>] SyS_write+0x62/0xa0
[ 4363.117645] [<ffffffff815e01fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 4363.118118] [<ffffffff81d3f399>] system_call_fastpath+0x16/0x1b
[ 4363.118558] ---[ end trace 87cc49a77badea1e ]---
[ 4363.119093] Enabling non-boot CPUs ...


On Fri, Jun 28, 2013 at 02:18:37PM -0700, tip-bot for David Vrabel wrote:
> Commit-ID: 7c4c3a0f18ba57ea2a2985034532303d2929902a
> Gitweb: http://git.kernel.org/tip/7c4c3a0f18ba57ea2a2985034532303d2929902a
> Author: David Vrabel <[email protected]>
> AuthorDate: Thu, 27 Jun 2013 11:35:44 +0100
> Committer: Thomas Gleixner <[email protected]>
> CommitDate: Fri, 28 Jun 2013 23:15:06 +0200
>
> hrtimers: Support resuming with two or more CPUs online (but stopped)
>
> hrtimers_resume() only reprograms the timers for the current CPU as it
> assumes that all other CPUs are offline at this point in the resume
> process. If other CPUs are online then their timers will not be
> corrected and they may fire at the wrong time.
>
> When running as a Xen guest, this assumption is not true. Non-boot
> CPUs are only stopped with IRQs disabled instead of offlining them.
> This is a performance optimization as disabling the CPUs would add an
> unacceptable amount of additional downtime during a live migration (>
> 200 ms for a 4 VCPU guest).
>
> hrtimers_resume() cannot call on_each_cpu(retrigger_next_event,...)
> as the other CPUs will be stopped with IRQs disabled. Instead, defer
> the call to the next softirq.
>
> [ tglx: Separated the xen change out ]
>
> Signed-off-by: David Vrabel <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: John Stultz <[email protected]>
> Cc: <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> kernel/hrtimer.c | 15 ++++++++++++---
> 1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index fd4b13b..e86827e 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -773,15 +773,24 @@ void clock_was_set(void)
>
> /*
> * During resume we might have to reprogram the high resolution timer
> - * interrupt (on the local CPU):
> + * interrupt on all online CPUs. However, all other CPUs will be
> + * stopped with IRQs interrupts disabled so the clock_was_set() call
> + * must be deferred to the softirq.
> + *
> + * The one-shot timer has already been programmed to fire immediately
> + * (see tick_resume_oneshot()) and this interrupt will trigger the
> + * softirq to run early enough to correctly reprogram the timers on
> + * all CPUs.
> */
> void hrtimers_resume(void)
> {
> + struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
> +
> WARN_ONCE(!irqs_disabled(),
> KERN_INFO "hrtimers_resume() called with IRQs enabled!");
>
> - retrigger_next_event(NULL);
> - timerfd_clock_was_set();
> + cpu_base->clock_was_set = 1;
> + __raise_softirq_irqoff(HRTIMER_SOFTIRQ);
> }
>
> static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Regards,
Artem

2013-07-05 10:07:16

by David Vrabel

[permalink] [raw]
Subject: Re: [tip:timers/core] hrtimers: Support resuming with two or more CPUs online (but stopped)

On 05/07/13 10:30, Artem Savkov wrote:
> This commit brings up a warning about a potential deadlock in
> smp_call_function_many() discussed previously:
> https://lkml.org/lkml/2013/4/18/546

Can we just avoid the wait in clock_was_set()? Something like this?

8<------------------------------------------------------
hrtimers: do not wait for other CPUs in clock_was_set()

Calling on_each_cpu() and waiting in a softirq causes a WARNing about
a potential deadlock.

Because hrtimers are per-CPU, it is sufficient to ensure that all
other CPUs' timers are reprogrammed as soon as possible and before the
next softirq on that CPU. There is no need to wait for this to be
complete on all CPUs.

on_each_cpu() works by IPIing all CPUs which will ensure
retrigger_next_event() will be called as earlier as possible and
before the next softirq.

Signed-off-by: David Vrabel <[email protected]>
---
kernel/hrtimer.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index e86827e..fd5d391 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -766,7 +766,7 @@ void clock_was_set(void)
{
#ifdef CONFIG_HIGH_RES_TIMERS
/* Retrigger the CPU local events everywhere */
- on_each_cpu(retrigger_next_event, NULL, 1);
+ on_each_cpu(retrigger_next_event, NULL, 0);
#endif
timerfd_clock_was_set();
}
--
1.7.2.5


>
> [ 4363.082047] PM: Restoring platform NVS memory
> [ 4363.082471] ------------[ cut here ]------------
> [ 4363.083800] WARNING: CPU: 0 PID: 3977 at kernel/smp.c:385 smp_call_function_many+0xbd/0x2c0()
> [ 4363.085789] Modules linked in:
> [ 4363.086542] CPU: 0 PID: 3977 Comm: pm-suspend Tainted: G W 3.10.0-next-20130705 #126
> [ 4363.088634] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> [ 4363.090402] 0000000000000181 ffff88001f403d98 ffffffff81d2e2b7 ffffffff821cdb2e
> [ 4363.092366] 0000000000000000 ffff88001f403dd8 ffffffff810a278c ffff88001f403dd8
> [ 4363.094215] 0000000000000000 0000000000000000 0000000000000000 ffffffff82561898
> [ 4363.096094] Call Trace:
> [ 4363.096656] <IRQ> [<ffffffff81d2e2b7>] dump_stack+0x59/0x82
> [ 4363.098259] [<ffffffff810a278c>] warn_slowpath_common+0x8c/0xc0
> [ 4363.099762] [<ffffffff810a27da>] warn_slowpath_null+0x1a/0x20
> [ 4363.100925] [<ffffffff81118fcd>] smp_call_function_many+0xbd/0x2c0
> [ 4363.101937] [<ffffffff8110f0dd>] ? trace_hardirqs_on+0xd/0x10
> [ 4363.102870] [<ffffffff810d7030>] ? hrtimer_wakeup+0x30/0x30
> [ 4363.103737] [<ffffffff811193bb>] smp_call_function+0x3b/0x50
> [ 4363.104609] [<ffffffff810d7030>] ? hrtimer_wakeup+0x30/0x30
> [ 4363.105455] [<ffffffff8111943b>] on_each_cpu+0x3b/0x120
> [ 4363.106249] [<ffffffff810d743c>] clock_was_set+0x1c/0x30
> [ 4363.107168] [<ffffffff810d825c>] run_hrtimer_softirq+0x2c/0x40
> [ 4363.108055] [<ffffffff810acf26>] __do_softirq+0x216/0x450
> [ 4363.108865] [<ffffffff810ad297>] irq_exit+0x77/0xb0
> [ 4363.109618] [<ffffffff81d414da>] smp_apic_timer_interrupt+0x4a/0x60
> [ 4363.110569] [<ffffffff81d40032>] apic_timer_interrupt+0x72/0x80
> [ 4363.111467] <EOI> [<ffffffff8110eb24>] ? mark_held_locks+0x134/0x160
> [ 4363.112481] [<ffffffff810f52af>] ? arch_suspend_enable_irqs+0x2f/0x40
> [ 4363.113457] [<ffffffff810f528e>] ? arch_suspend_enable_irqs+0xe/0x40
> [ 4363.113910] [<ffffffff810f54c1>] suspend_enter+0x1d1/0x270
> [ 4363.114320] [<ffffffff810f5794>] suspend_devices_and_enter+0x1a4/0x390
> [ 4363.114887] [<ffffffff810f5a74>] enter_state+0xf4/0x150
> [ 4363.115288] [<ffffffff810f5aeb>] pm_suspend+0x1b/0x70
> [ 4363.115662] [<ffffffff810f452b>] state_store+0xeb/0xf0
> [ 4363.116055] [<ffffffff815d3fc7>] kobj_attr_store+0x17/0x20
> [ 4363.116468] [<ffffffff8126f113>] sysfs_write_file+0xb3/0x100
> [ 4363.116890] [<ffffffff811f0aca>] vfs_write+0xda/0x170
> [ 4363.117274] [<ffffffff811f10b2>] SyS_write+0x62/0xa0
> [ 4363.117645] [<ffffffff815e01fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 4363.118118] [<ffffffff81d3f399>] system_call_fastpath+0x16/0x1b
> [ 4363.118558] ---[ end trace 87cc49a77badea1e ]---
> [ 4363.119093] Enabling non-boot CPUs ...
>
>
> On Fri, Jun 28, 2013 at 02:18:37PM -0700, tip-bot for David Vrabel wrote:
>> Commit-ID: 7c4c3a0f18ba57ea2a2985034532303d2929902a
>> Gitweb: http://git.kernel.org/tip/7c4c3a0f18ba57ea2a2985034532303d2929902a
>> Author: David Vrabel <[email protected]>
>> AuthorDate: Thu, 27 Jun 2013 11:35:44 +0100
>> Committer: Thomas Gleixner <[email protected]>
>> CommitDate: Fri, 28 Jun 2013 23:15:06 +0200
>>
>> hrtimers: Support resuming with two or more CPUs online (but stopped)
>>
>> hrtimers_resume() only reprograms the timers for the current CPU as it
>> assumes that all other CPUs are offline at this point in the resume
>> process. If other CPUs are online then their timers will not be
>> corrected and they may fire at the wrong time.
>>
>> When running as a Xen guest, this assumption is not true. Non-boot
>> CPUs are only stopped with IRQs disabled instead of offlining them.
>> This is a performance optimization as disabling the CPUs would add an
>> unacceptable amount of additional downtime during a live migration (>
>> 200 ms for a 4 VCPU guest).
>>
>> hrtimers_resume() cannot call on_each_cpu(retrigger_next_event,...)
>> as the other CPUs will be stopped with IRQs disabled. Instead, defer
>> the call to the next softirq.
>>
>> [ tglx: Separated the xen change out ]
>>
>> Signed-off-by: David Vrabel <[email protected]>
>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>> Cc: John Stultz <[email protected]>
>> Cc: <[email protected]>
>> Link: http://lkml.kernel.org/r/[email protected]
>> Signed-off-by: Thomas Gleixner <[email protected]>
>> ---
>> kernel/hrtimer.c | 15 ++++++++++++---
>> 1 file changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
>> index fd4b13b..e86827e 100644
>> --- a/kernel/hrtimer.c
>> +++ b/kernel/hrtimer.c
>> @@ -773,15 +773,24 @@ void clock_was_set(void)
>>
>> /*
>> * During resume we might have to reprogram the high resolution timer
>> - * interrupt (on the local CPU):
>> + * interrupt on all online CPUs. However, all other CPUs will be
>> + * stopped with IRQs interrupts disabled so the clock_was_set() call
>> + * must be deferred to the softirq.
>> + *
>> + * The one-shot timer has already been programmed to fire immediately
>> + * (see tick_resume_oneshot()) and this interrupt will trigger the
>> + * softirq to run early enough to correctly reprogram the timers on
>> + * all CPUs.
>> */
>> void hrtimers_resume(void)
>> {
>> + struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
>> +
>> WARN_ONCE(!irqs_disabled(),
>> KERN_INFO "hrtimers_resume() called with IRQs enabled!");
>>
>> - retrigger_next_event(NULL);
>> - timerfd_clock_was_set();
>> + cpu_base->clock_was_set = 1;
>> + __raise_softirq_irqoff(HRTIMER_SOFTIRQ);
>> }
>>
>> static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>

2013-07-05 10:09:22

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [tip:timers/core] hrtimers: Support resuming with two or more CPUs online (but stopped)

On Fri, 5 Jul 2013, Artem Savkov wrote:

> This commit brings up a warning about a potential deadlock in
> smp_call_function_many() discussed previously:
> https://lkml.org/lkml/2013/4/18/546

> On Fri, Jun 28, 2013 at 02:18:37PM -0700, tip-bot for David Vrabel wrote:
> > Commit-ID: 7c4c3a0f18ba57ea2a2985034532303d2929902a

It's not caused by this commit. The problem was there before. We call
clock_was_set() from softirq context because we cannot call it from
the timer interrupt.

So that new WARN on in the smp code requires us to move that call out
from softirq context.

Does the patch below fix it ?

Thanks,

tglx

---
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index e86827e..24c6f3b 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -721,6 +721,18 @@ static int hrtimer_switch_to_hres(void)
return 1;
}

+static void clock_was_set_work(struct work_struct *work)
+{
+ clock_was_set();
+}
+
+static DECLARE_WORK(hrtimer_work, clock_was_set_work);
+
+static void softirq_clock_was_set(void)
+{
+ schedule_work(&hrtimer_work);
+}
+
/*
* Called from timekeeping code to reprogramm the hrtimer interrupt
* device. If called from the timer interrupt context we defer it to
@@ -748,7 +760,10 @@ static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer,
}
static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base) { }
static inline void retrigger_next_event(void *arg) { }
-
+static void softirq_clock_was_set(void)
+{
+ clock_was_set();
+}
#endif /* CONFIG_HIGH_RES_TIMERS */

/*
@@ -1445,7 +1460,7 @@ static void run_hrtimer_softirq(struct softirq_action *h)

if (cpu_base->clock_was_set) {
cpu_base->clock_was_set = 0;
- clock_was_set();
+ softirq_clock_was_set();
}

hrtimer_peek_ahead_timers();

2013-07-05 10:10:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [tip:timers/core] hrtimers: Support resuming with two or more CPUs online (but stopped)

On Fri, 5 Jul 2013, David Vrabel wrote:

> On 05/07/13 10:30, Artem Savkov wrote:
> > This commit brings up a warning about a potential deadlock in
> > smp_call_function_many() discussed previously:
> > https://lkml.org/lkml/2013/4/18/546
>
> Can we just avoid the wait in clock_was_set()? Something like this?
>
> 8<------------------------------------------------------
> hrtimers: do not wait for other CPUs in clock_was_set()
>
> Calling on_each_cpu() and waiting in a softirq causes a WARNing about
> a potential deadlock.
>
> Because hrtimers are per-CPU, it is sufficient to ensure that all
> other CPUs' timers are reprogrammed as soon as possible and before the
> next softirq on that CPU. There is no need to wait for this to be
> complete on all CPUs.

Cute. Did not think about that!

> on_each_cpu() works by IPIing all CPUs which will ensure
> retrigger_next_event() will be called as earlier as possible and
> before the next softirq.
>
> Signed-off-by: David Vrabel <[email protected]>
> ---
> kernel/hrtimer.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index e86827e..fd5d391 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -766,7 +766,7 @@ void clock_was_set(void)
> {
> #ifdef CONFIG_HIGH_RES_TIMERS
> /* Retrigger the CPU local events everywhere */
> - on_each_cpu(retrigger_next_event, NULL, 1);
> + on_each_cpu(retrigger_next_event, NULL, 0);
> #endif
> timerfd_clock_was_set();
> }
> --
> 1.7.2.5
>
>
> >
> > [ 4363.082047] PM: Restoring platform NVS memory
> > [ 4363.082471] ------------[ cut here ]------------
> > [ 4363.083800] WARNING: CPU: 0 PID: 3977 at kernel/smp.c:385 smp_call_function_many+0xbd/0x2c0()
> > [ 4363.085789] Modules linked in:
> > [ 4363.086542] CPU: 0 PID: 3977 Comm: pm-suspend Tainted: G W 3.10.0-next-20130705 #126
> > [ 4363.088634] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> > [ 4363.090402] 0000000000000181 ffff88001f403d98 ffffffff81d2e2b7 ffffffff821cdb2e
> > [ 4363.092366] 0000000000000000 ffff88001f403dd8 ffffffff810a278c ffff88001f403dd8
> > [ 4363.094215] 0000000000000000 0000000000000000 0000000000000000 ffffffff82561898
> > [ 4363.096094] Call Trace:
> > [ 4363.096656] <IRQ> [<ffffffff81d2e2b7>] dump_stack+0x59/0x82
> > [ 4363.098259] [<ffffffff810a278c>] warn_slowpath_common+0x8c/0xc0
> > [ 4363.099762] [<ffffffff810a27da>] warn_slowpath_null+0x1a/0x20
> > [ 4363.100925] [<ffffffff81118fcd>] smp_call_function_many+0xbd/0x2c0
> > [ 4363.101937] [<ffffffff8110f0dd>] ? trace_hardirqs_on+0xd/0x10
> > [ 4363.102870] [<ffffffff810d7030>] ? hrtimer_wakeup+0x30/0x30
> > [ 4363.103737] [<ffffffff811193bb>] smp_call_function+0x3b/0x50
> > [ 4363.104609] [<ffffffff810d7030>] ? hrtimer_wakeup+0x30/0x30
> > [ 4363.105455] [<ffffffff8111943b>] on_each_cpu+0x3b/0x120
> > [ 4363.106249] [<ffffffff810d743c>] clock_was_set+0x1c/0x30
> > [ 4363.107168] [<ffffffff810d825c>] run_hrtimer_softirq+0x2c/0x40
> > [ 4363.108055] [<ffffffff810acf26>] __do_softirq+0x216/0x450
> > [ 4363.108865] [<ffffffff810ad297>] irq_exit+0x77/0xb0
> > [ 4363.109618] [<ffffffff81d414da>] smp_apic_timer_interrupt+0x4a/0x60
> > [ 4363.110569] [<ffffffff81d40032>] apic_timer_interrupt+0x72/0x80
> > [ 4363.111467] <EOI> [<ffffffff8110eb24>] ? mark_held_locks+0x134/0x160
> > [ 4363.112481] [<ffffffff810f52af>] ? arch_suspend_enable_irqs+0x2f/0x40
> > [ 4363.113457] [<ffffffff810f528e>] ? arch_suspend_enable_irqs+0xe/0x40
> > [ 4363.113910] [<ffffffff810f54c1>] suspend_enter+0x1d1/0x270
> > [ 4363.114320] [<ffffffff810f5794>] suspend_devices_and_enter+0x1a4/0x390
> > [ 4363.114887] [<ffffffff810f5a74>] enter_state+0xf4/0x150
> > [ 4363.115288] [<ffffffff810f5aeb>] pm_suspend+0x1b/0x70
> > [ 4363.115662] [<ffffffff810f452b>] state_store+0xeb/0xf0
> > [ 4363.116055] [<ffffffff815d3fc7>] kobj_attr_store+0x17/0x20
> > [ 4363.116468] [<ffffffff8126f113>] sysfs_write_file+0xb3/0x100
> > [ 4363.116890] [<ffffffff811f0aca>] vfs_write+0xda/0x170
> > [ 4363.117274] [<ffffffff811f10b2>] SyS_write+0x62/0xa0
> > [ 4363.117645] [<ffffffff815e01fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [ 4363.118118] [<ffffffff81d3f399>] system_call_fastpath+0x16/0x1b
> > [ 4363.118558] ---[ end trace 87cc49a77badea1e ]---
> > [ 4363.119093] Enabling non-boot CPUs ...
> >
> >
> > On Fri, Jun 28, 2013 at 02:18:37PM -0700, tip-bot for David Vrabel wrote:
> >> Commit-ID: 7c4c3a0f18ba57ea2a2985034532303d2929902a
> >> Gitweb: http://git.kernel.org/tip/7c4c3a0f18ba57ea2a2985034532303d2929902a
> >> Author: David Vrabel <[email protected]>
> >> AuthorDate: Thu, 27 Jun 2013 11:35:44 +0100
> >> Committer: Thomas Gleixner <[email protected]>
> >> CommitDate: Fri, 28 Jun 2013 23:15:06 +0200
> >>
> >> hrtimers: Support resuming with two or more CPUs online (but stopped)
> >>
> >> hrtimers_resume() only reprograms the timers for the current CPU as it
> >> assumes that all other CPUs are offline at this point in the resume
> >> process. If other CPUs are online then their timers will not be
> >> corrected and they may fire at the wrong time.
> >>
> >> When running as a Xen guest, this assumption is not true. Non-boot
> >> CPUs are only stopped with IRQs disabled instead of offlining them.
> >> This is a performance optimization as disabling the CPUs would add an
> >> unacceptable amount of additional downtime during a live migration (>
> >> 200 ms for a 4 VCPU guest).
> >>
> >> hrtimers_resume() cannot call on_each_cpu(retrigger_next_event,...)
> >> as the other CPUs will be stopped with IRQs disabled. Instead, defer
> >> the call to the next softirq.
> >>
> >> [ tglx: Separated the xen change out ]
> >>
> >> Signed-off-by: David Vrabel <[email protected]>
> >> Cc: Konrad Rzeszutek Wilk <[email protected]>
> >> Cc: John Stultz <[email protected]>
> >> Cc: <[email protected]>
> >> Link: http://lkml.kernel.org/r/[email protected]
> >> Signed-off-by: Thomas Gleixner <[email protected]>
> >> ---
> >> kernel/hrtimer.c | 15 ++++++++++++---
> >> 1 file changed, 12 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> >> index fd4b13b..e86827e 100644
> >> --- a/kernel/hrtimer.c
> >> +++ b/kernel/hrtimer.c
> >> @@ -773,15 +773,24 @@ void clock_was_set(void)
> >>
> >> /*
> >> * During resume we might have to reprogram the high resolution timer
> >> - * interrupt (on the local CPU):
> >> + * interrupt on all online CPUs. However, all other CPUs will be
> >> + * stopped with IRQs interrupts disabled so the clock_was_set() call
> >> + * must be deferred to the softirq.
> >> + *
> >> + * The one-shot timer has already been programmed to fire immediately
> >> + * (see tick_resume_oneshot()) and this interrupt will trigger the
> >> + * softirq to run early enough to correctly reprogram the timers on
> >> + * all CPUs.
> >> */
> >> void hrtimers_resume(void)
> >> {
> >> + struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
> >> +
> >> WARN_ONCE(!irqs_disabled(),
> >> KERN_INFO "hrtimers_resume() called with IRQs enabled!");
> >>
> >> - retrigger_next_event(NULL);
> >> - timerfd_clock_was_set();
> >> + cpu_base->clock_was_set = 1;
> >> + __raise_softirq_irqoff(HRTIMER_SOFTIRQ);
> >> }
> >>
> >> static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at http://www.tux.org/lkml/
> >>
> >
>
>

2013-07-05 10:25:37

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [tip:timers/core] hrtimers: Support resuming with two or more CPUs online (but stopped)

On Fri, 5 Jul 2013, David Vrabel wrote:

You failed to CC Artem :(

> On 05/07/13 10:30, Artem Savkov wrote:
> > This commit brings up a warning about a potential deadlock in
> > smp_call_function_many() discussed previously:
> > https://lkml.org/lkml/2013/4/18/546
>
> Can we just avoid the wait in clock_was_set()? Something like this?
>
> 8<------------------------------------------------------
> hrtimers: do not wait for other CPUs in clock_was_set()
>
> Calling on_each_cpu() and waiting in a softirq causes a WARNing about
> a potential deadlock.
>
> Because hrtimers are per-CPU, it is sufficient to ensure that all
> other CPUs' timers are reprogrammed as soon as possible and before the
> next softirq on that CPU. There is no need to wait for this to be
> complete on all CPUs.
>
> on_each_cpu() works by IPIing all CPUs which will ensure
> retrigger_next_event() will be called as earlier as possible and
> before the next softirq.
>
> Signed-off-by: David Vrabel <[email protected]>
> ---
> kernel/hrtimer.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index e86827e..fd5d391 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -766,7 +766,7 @@ void clock_was_set(void)
> {
> #ifdef CONFIG_HIGH_RES_TIMERS
> /* Retrigger the CPU local events everywhere */
> - on_each_cpu(retrigger_next_event, NULL, 1);
> + on_each_cpu(retrigger_next_event, NULL, 0);
> #endif
> timerfd_clock_was_set();
> }
> --
> 1.7.2.5
>
>
> >
> > [ 4363.082047] PM: Restoring platform NVS memory
> > [ 4363.082471] ------------[ cut here ]------------
> > [ 4363.083800] WARNING: CPU: 0 PID: 3977 at kernel/smp.c:385 smp_call_function_many+0xbd/0x2c0()
> > [ 4363.085789] Modules linked in:
> > [ 4363.086542] CPU: 0 PID: 3977 Comm: pm-suspend Tainted: G W 3.10.0-next-20130705 #126
> > [ 4363.088634] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> > [ 4363.090402] 0000000000000181 ffff88001f403d98 ffffffff81d2e2b7 ffffffff821cdb2e
> > [ 4363.092366] 0000000000000000 ffff88001f403dd8 ffffffff810a278c ffff88001f403dd8
> > [ 4363.094215] 0000000000000000 0000000000000000 0000000000000000 ffffffff82561898
> > [ 4363.096094] Call Trace:
> > [ 4363.096656] <IRQ> [<ffffffff81d2e2b7>] dump_stack+0x59/0x82
> > [ 4363.098259] [<ffffffff810a278c>] warn_slowpath_common+0x8c/0xc0
> > [ 4363.099762] [<ffffffff810a27da>] warn_slowpath_null+0x1a/0x20
> > [ 4363.100925] [<ffffffff81118fcd>] smp_call_function_many+0xbd/0x2c0
> > [ 4363.101937] [<ffffffff8110f0dd>] ? trace_hardirqs_on+0xd/0x10
> > [ 4363.102870] [<ffffffff810d7030>] ? hrtimer_wakeup+0x30/0x30
> > [ 4363.103737] [<ffffffff811193bb>] smp_call_function+0x3b/0x50
> > [ 4363.104609] [<ffffffff810d7030>] ? hrtimer_wakeup+0x30/0x30
> > [ 4363.105455] [<ffffffff8111943b>] on_each_cpu+0x3b/0x120
> > [ 4363.106249] [<ffffffff810d743c>] clock_was_set+0x1c/0x30
> > [ 4363.107168] [<ffffffff810d825c>] run_hrtimer_softirq+0x2c/0x40
> > [ 4363.108055] [<ffffffff810acf26>] __do_softirq+0x216/0x450
> > [ 4363.108865] [<ffffffff810ad297>] irq_exit+0x77/0xb0
> > [ 4363.109618] [<ffffffff81d414da>] smp_apic_timer_interrupt+0x4a/0x60
> > [ 4363.110569] [<ffffffff81d40032>] apic_timer_interrupt+0x72/0x80
> > [ 4363.111467] <EOI> [<ffffffff8110eb24>] ? mark_held_locks+0x134/0x160
> > [ 4363.112481] [<ffffffff810f52af>] ? arch_suspend_enable_irqs+0x2f/0x40
> > [ 4363.113457] [<ffffffff810f528e>] ? arch_suspend_enable_irqs+0xe/0x40
> > [ 4363.113910] [<ffffffff810f54c1>] suspend_enter+0x1d1/0x270
> > [ 4363.114320] [<ffffffff810f5794>] suspend_devices_and_enter+0x1a4/0x390
> > [ 4363.114887] [<ffffffff810f5a74>] enter_state+0xf4/0x150
> > [ 4363.115288] [<ffffffff810f5aeb>] pm_suspend+0x1b/0x70
> > [ 4363.115662] [<ffffffff810f452b>] state_store+0xeb/0xf0
> > [ 4363.116055] [<ffffffff815d3fc7>] kobj_attr_store+0x17/0x20
> > [ 4363.116468] [<ffffffff8126f113>] sysfs_write_file+0xb3/0x100
> > [ 4363.116890] [<ffffffff811f0aca>] vfs_write+0xda/0x170
> > [ 4363.117274] [<ffffffff811f10b2>] SyS_write+0x62/0xa0
> > [ 4363.117645] [<ffffffff815e01fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [ 4363.118118] [<ffffffff81d3f399>] system_call_fastpath+0x16/0x1b
> > [ 4363.118558] ---[ end trace 87cc49a77badea1e ]---
> > [ 4363.119093] Enabling non-boot CPUs ...
> >
> >
> > On Fri, Jun 28, 2013 at 02:18:37PM -0700, tip-bot for David Vrabel wrote:
> >> Commit-ID: 7c4c3a0f18ba57ea2a2985034532303d2929902a
> >> Gitweb: http://git.kernel.org/tip/7c4c3a0f18ba57ea2a2985034532303d2929902a
> >> Author: David Vrabel <[email protected]>
> >> AuthorDate: Thu, 27 Jun 2013 11:35:44 +0100
> >> Committer: Thomas Gleixner <[email protected]>
> >> CommitDate: Fri, 28 Jun 2013 23:15:06 +0200
> >>
> >> hrtimers: Support resuming with two or more CPUs online (but stopped)
> >>
> >> hrtimers_resume() only reprograms the timers for the current CPU as it
> >> assumes that all other CPUs are offline at this point in the resume
> >> process. If other CPUs are online then their timers will not be
> >> corrected and they may fire at the wrong time.
> >>
> >> When running as a Xen guest, this assumption is not true. Non-boot
> >> CPUs are only stopped with IRQs disabled instead of offlining them.
> >> This is a performance optimization as disabling the CPUs would add an
> >> unacceptable amount of additional downtime during a live migration (>
> >> 200 ms for a 4 VCPU guest).
> >>
> >> hrtimers_resume() cannot call on_each_cpu(retrigger_next_event,...)
> >> as the other CPUs will be stopped with IRQs disabled. Instead, defer
> >> the call to the next softirq.
> >>
> >> [ tglx: Separated the xen change out ]
> >>
> >> Signed-off-by: David Vrabel <[email protected]>
> >> Cc: Konrad Rzeszutek Wilk <[email protected]>
> >> Cc: John Stultz <[email protected]>
> >> Cc: <[email protected]>
> >> Link: http://lkml.kernel.org/r/[email protected]
> >> Signed-off-by: Thomas Gleixner <[email protected]>
> >> ---
> >> kernel/hrtimer.c | 15 ++++++++++++---
> >> 1 file changed, 12 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> >> index fd4b13b..e86827e 100644
> >> --- a/kernel/hrtimer.c
> >> +++ b/kernel/hrtimer.c
> >> @@ -773,15 +773,24 @@ void clock_was_set(void)
> >>
> >> /*
> >> * During resume we might have to reprogram the high resolution timer
> >> - * interrupt (on the local CPU):
> >> + * interrupt on all online CPUs. However, all other CPUs will be
> >> + * stopped with IRQs interrupts disabled so the clock_was_set() call
> >> + * must be deferred to the softirq.
> >> + *
> >> + * The one-shot timer has already been programmed to fire immediately
> >> + * (see tick_resume_oneshot()) and this interrupt will trigger the
> >> + * softirq to run early enough to correctly reprogram the timers on
> >> + * all CPUs.
> >> */
> >> void hrtimers_resume(void)
> >> {
> >> + struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
> >> +
> >> WARN_ONCE(!irqs_disabled(),
> >> KERN_INFO "hrtimers_resume() called with IRQs enabled!");
> >>
> >> - retrigger_next_event(NULL);
> >> - timerfd_clock_was_set();
> >> + cpu_base->clock_was_set = 1;
> >> + __raise_softirq_irqoff(HRTIMER_SOFTIRQ);
> >> }
> >>
> >> static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at http://www.tux.org/lkml/
> >>
> >
>
>

2013-07-05 10:36:53

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [tip:timers/core] hrtimers: Support resuming with two or more CPUs online (but stopped)

On Fri, 5 Jul 2013, Thomas Gleixner wrote:
> On Fri, 5 Jul 2013, David Vrabel wrote:
>
> You failed to CC Artem :(
>

Second thought.

Where is that warning? Can't find it in neither in Linus tree nor in tip

> > > [ 4363.083800] WARNING: CPU: 0 PID: 3977 at kernel/smp.c:385 smp_call_function_many+0xbd/0x2c0()

All I see there is:
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
&& !oops_in_progress && !early_boot_irqs_disabled);

Thanks,

tglx

2013-07-05 10:38:58

by Artem Savkov

[permalink] [raw]
Subject: Re: [tip:timers/core] hrtimers: Support resuming with two or more CPUs online (but stopped)

Well with David's patch the warning is still hit as path is the same and
WARN_ON_ONCE in question doesn't check wait, should it?

Thomas' patch does seem to work.

On Fri, Jul 05, 2013 at 12:25:34PM +0200, Thomas Gleixner wrote:
> On Fri, 5 Jul 2013, David Vrabel wrote:
>
> You failed to CC Artem :(
>
> > On 05/07/13 10:30, Artem Savkov wrote:
> > > This commit brings up a warning about a potential deadlock in
> > > smp_call_function_many() discussed previously:
> > > https://lkml.org/lkml/2013/4/18/546
> >
> > Can we just avoid the wait in clock_was_set()? Something like this?
> >
> > 8<------------------------------------------------------
> > hrtimers: do not wait for other CPUs in clock_was_set()
> >
> > Calling on_each_cpu() and waiting in a softirq causes a WARNing about
> > a potential deadlock.
> >
> > Because hrtimers are per-CPU, it is sufficient to ensure that all
> > other CPUs' timers are reprogrammed as soon as possible and before the
> > next softirq on that CPU. There is no need to wait for this to be
> > complete on all CPUs.
> >
> > on_each_cpu() works by IPIing all CPUs which will ensure
> > retrigger_next_event() will be called as earlier as possible and
> > before the next softirq.
> >
> > Signed-off-by: David Vrabel <[email protected]>
> > ---
> > kernel/hrtimer.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> > index e86827e..fd5d391 100644
> > --- a/kernel/hrtimer.c
> > +++ b/kernel/hrtimer.c
> > @@ -766,7 +766,7 @@ void clock_was_set(void)
> > {
> > #ifdef CONFIG_HIGH_RES_TIMERS
> > /* Retrigger the CPU local events everywhere */
> > - on_each_cpu(retrigger_next_event, NULL, 1);
> > + on_each_cpu(retrigger_next_event, NULL, 0);
> > #endif
> > timerfd_clock_was_set();
> > }
> > --
> > 1.7.2.5
> >
> >
> > >
> > > [ 4363.082047] PM: Restoring platform NVS memory
> > > [ 4363.082471] ------------[ cut here ]------------
> > > [ 4363.083800] WARNING: CPU: 0 PID: 3977 at kernel/smp.c:385 smp_call_function_many+0xbd/0x2c0()
> > > [ 4363.085789] Modules linked in:
> > > [ 4363.086542] CPU: 0 PID: 3977 Comm: pm-suspend Tainted: G W 3.10.0-next-20130705 #126
> > > [ 4363.088634] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> > > [ 4363.090402] 0000000000000181 ffff88001f403d98 ffffffff81d2e2b7 ffffffff821cdb2e
> > > [ 4363.092366] 0000000000000000 ffff88001f403dd8 ffffffff810a278c ffff88001f403dd8
> > > [ 4363.094215] 0000000000000000 0000000000000000 0000000000000000 ffffffff82561898
> > > [ 4363.096094] Call Trace:
> > > [ 4363.096656] <IRQ> [<ffffffff81d2e2b7>] dump_stack+0x59/0x82
> > > [ 4363.098259] [<ffffffff810a278c>] warn_slowpath_common+0x8c/0xc0
> > > [ 4363.099762] [<ffffffff810a27da>] warn_slowpath_null+0x1a/0x20
> > > [ 4363.100925] [<ffffffff81118fcd>] smp_call_function_many+0xbd/0x2c0
> > > [ 4363.101937] [<ffffffff8110f0dd>] ? trace_hardirqs_on+0xd/0x10
> > > [ 4363.102870] [<ffffffff810d7030>] ? hrtimer_wakeup+0x30/0x30
> > > [ 4363.103737] [<ffffffff811193bb>] smp_call_function+0x3b/0x50
> > > [ 4363.104609] [<ffffffff810d7030>] ? hrtimer_wakeup+0x30/0x30
> > > [ 4363.105455] [<ffffffff8111943b>] on_each_cpu+0x3b/0x120
> > > [ 4363.106249] [<ffffffff810d743c>] clock_was_set+0x1c/0x30
> > > [ 4363.107168] [<ffffffff810d825c>] run_hrtimer_softirq+0x2c/0x40
> > > [ 4363.108055] [<ffffffff810acf26>] __do_softirq+0x216/0x450
> > > [ 4363.108865] [<ffffffff810ad297>] irq_exit+0x77/0xb0
> > > [ 4363.109618] [<ffffffff81d414da>] smp_apic_timer_interrupt+0x4a/0x60
> > > [ 4363.110569] [<ffffffff81d40032>] apic_timer_interrupt+0x72/0x80
> > > [ 4363.111467] <EOI> [<ffffffff8110eb24>] ? mark_held_locks+0x134/0x160
> > > [ 4363.112481] [<ffffffff810f52af>] ? arch_suspend_enable_irqs+0x2f/0x40
> > > [ 4363.113457] [<ffffffff810f528e>] ? arch_suspend_enable_irqs+0xe/0x40
> > > [ 4363.113910] [<ffffffff810f54c1>] suspend_enter+0x1d1/0x270
> > > [ 4363.114320] [<ffffffff810f5794>] suspend_devices_and_enter+0x1a4/0x390
> > > [ 4363.114887] [<ffffffff810f5a74>] enter_state+0xf4/0x150
> > > [ 4363.115288] [<ffffffff810f5aeb>] pm_suspend+0x1b/0x70
> > > [ 4363.115662] [<ffffffff810f452b>] state_store+0xeb/0xf0
> > > [ 4363.116055] [<ffffffff815d3fc7>] kobj_attr_store+0x17/0x20
> > > [ 4363.116468] [<ffffffff8126f113>] sysfs_write_file+0xb3/0x100
> > > [ 4363.116890] [<ffffffff811f0aca>] vfs_write+0xda/0x170
> > > [ 4363.117274] [<ffffffff811f10b2>] SyS_write+0x62/0xa0
> > > [ 4363.117645] [<ffffffff815e01fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > > [ 4363.118118] [<ffffffff81d3f399>] system_call_fastpath+0x16/0x1b
> > > [ 4363.118558] ---[ end trace 87cc49a77badea1e ]---
> > > [ 4363.119093] Enabling non-boot CPUs ...
> > >
> > >
> > > On Fri, Jun 28, 2013 at 02:18:37PM -0700, tip-bot for David Vrabel wrote:
> > >> Commit-ID: 7c4c3a0f18ba57ea2a2985034532303d2929902a
> > >> Gitweb: http://git.kernel.org/tip/7c4c3a0f18ba57ea2a2985034532303d2929902a
> > >> Author: David Vrabel <[email protected]>
> > >> AuthorDate: Thu, 27 Jun 2013 11:35:44 +0100
> > >> Committer: Thomas Gleixner <[email protected]>
> > >> CommitDate: Fri, 28 Jun 2013 23:15:06 +0200
> > >>
> > >> hrtimers: Support resuming with two or more CPUs online (but stopped)
> > >>
> > >> hrtimers_resume() only reprograms the timers for the current CPU as it
> > >> assumes that all other CPUs are offline at this point in the resume
> > >> process. If other CPUs are online then their timers will not be
> > >> corrected and they may fire at the wrong time.
> > >>
> > >> When running as a Xen guest, this assumption is not true. Non-boot
> > >> CPUs are only stopped with IRQs disabled instead of offlining them.
> > >> This is a performance optimization as disabling the CPUs would add an
> > >> unacceptable amount of additional downtime during a live migration (>
> > >> 200 ms for a 4 VCPU guest).
> > >>
> > >> hrtimers_resume() cannot call on_each_cpu(retrigger_next_event,...)
> > >> as the other CPUs will be stopped with IRQs disabled. Instead, defer
> > >> the call to the next softirq.
> > >>
> > >> [ tglx: Separated the xen change out ]
> > >>
> > >> Signed-off-by: David Vrabel <[email protected]>
> > >> Cc: Konrad Rzeszutek Wilk <[email protected]>
> > >> Cc: John Stultz <[email protected]>
> > >> Cc: <[email protected]>
> > >> Link: http://lkml.kernel.org/r/[email protected]
> > >> Signed-off-by: Thomas Gleixner <[email protected]>
> > >> ---
> > >> kernel/hrtimer.c | 15 ++++++++++++---
> > >> 1 file changed, 12 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> > >> index fd4b13b..e86827e 100644
> > >> --- a/kernel/hrtimer.c
> > >> +++ b/kernel/hrtimer.c
> > >> @@ -773,15 +773,24 @@ void clock_was_set(void)
> > >>
> > >> /*
> > >> * During resume we might have to reprogram the high resolution timer
> > >> - * interrupt (on the local CPU):
> > >> + * interrupt on all online CPUs. However, all other CPUs will be
> > >> + * stopped with IRQs interrupts disabled so the clock_was_set() call
> > >> + * must be deferred to the softirq.
> > >> + *
> > >> + * The one-shot timer has already been programmed to fire immediately
> > >> + * (see tick_resume_oneshot()) and this interrupt will trigger the
> > >> + * softirq to run early enough to correctly reprogram the timers on
> > >> + * all CPUs.
> > >> */
> > >> void hrtimers_resume(void)
> > >> {
> > >> + struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
> > >> +
> > >> WARN_ONCE(!irqs_disabled(),
> > >> KERN_INFO "hrtimers_resume() called with IRQs enabled!");
> > >>
> > >> - retrigger_next_event(NULL);
> > >> - timerfd_clock_was_set();
> > >> + cpu_base->clock_was_set = 1;
> > >> + __raise_softirq_irqoff(HRTIMER_SOFTIRQ);
> > >> }
> > >>
> > >> static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > >> the body of a message to [email protected]
> > >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >> Please read the FAQ at http://www.tux.org/lkml/
> > >>
> > >
> >
> >
>

--
Regards,
Artem

2013-07-05 10:46:16

by Artem Savkov

[permalink] [raw]
Subject: Re: [tip:timers/core] hrtimers: Support resuming with two or more CPUs online (but stopped)

On Fri, Jul 05, 2013 at 12:36:50PM +0200, Thomas Gleixner wrote:
> On Fri, 5 Jul 2013, Thomas Gleixner wrote:
> > On Fri, 5 Jul 2013, David Vrabel wrote:
> > You failed to CC Artem :(
> Second thought.
> Where is that warning? Can't find it in neither in Linus tree nor in tip
> > > > [ 4363.083800] WARNING: CPU: 0 PID: 3977 at kernel/smp.c:385 smp_call_function_many+0xbd/0x2c0()
> All I see there is:
> WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
> && !oops_in_progress && !early_boot_irqs_disabled);

It's in linux-next.git, not sure where it was merged from
commit 7f4a1d40a6311b7cb52d0f0061cb904d041c6dbc
Author: Chuansheng Liu <[email protected]>
Date: Wed Jul 3 10:20:26 2013 +1000

smp: Give WARN()ing when calling smp_call_function_many()/single() in serving irq


--
Regards,
Artem

2013-07-05 13:26:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [tip:timers/core] hrtimers: Support resuming with two or more CPUs online (but stopped)

On Fri, 5 Jul 2013, Artem Savkov wrote:
> On Fri, Jul 05, 2013 at 12:36:50PM +0200, Thomas Gleixner wrote:
> > On Fri, 5 Jul 2013, Thomas Gleixner wrote:
> > > On Fri, 5 Jul 2013, David Vrabel wrote:
> > > You failed to CC Artem :(
> > Second thought.
> > Where is that warning? Can't find it in neither in Linus tree nor in tip
> > > > > [ 4363.083800] WARNING: CPU: 0 PID: 3977 at kernel/smp.c:385 smp_call_function_many+0xbd/0x2c0()
> > All I see there is:
> > WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
> > && !oops_in_progress && !early_boot_irqs_disabled);
>
> It's in linux-next.git, not sure where it was merged from

via akpm.

2013-07-05 13:46:07

by David Vrabel

[permalink] [raw]
Subject: Re: [tip:timers/core] hrtimers: Support resuming with two or more CPUs online (but stopped)

On 05/07/13 11:25, Thomas Gleixner wrote:
> On Fri, 5 Jul 2013, David Vrabel wrote:
>
> You failed to CC Artem :(
>
>> On 05/07/13 10:30, Artem Savkov wrote:
>>> This commit brings up a warning about a potential deadlock in
>>> smp_call_function_many() discussed previously:
>>> https://lkml.org/lkml/2013/4/18/546
>>
>> Can we just avoid the wait in clock_was_set()? Something like this?
>>
>> 8<------------------------------------------------------
>> hrtimers: do not wait for other CPUs in clock_was_set()
>>
>> Calling on_each_cpu() and waiting in a softirq causes a WARNing about
>> a potential deadlock.
>>
>> Because hrtimers are per-CPU, it is sufficient to ensure that all
>> other CPUs' timers are reprogrammed as soon as possible and before the
>> next softirq on that CPU. There is no need to wait for this to be
>> complete on all CPUs.

Unfortunately this doesn't look sufficient. on_each_cpu(..., 0) may
still wait for other calls to complete before queuing the calls due to
the use of a single set of per-CPU csd data.

David

2013-07-05 13:51:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [tip:timers/core] hrtimers: Support resuming with two or more CPUs online (but stopped)

On Fri, 5 Jul 2013, David Vrabel wrote:
> On 05/07/13 11:25, Thomas Gleixner wrote:
> > On Fri, 5 Jul 2013, David Vrabel wrote:
> >
> > You failed to CC Artem :(
> >
> >> On 05/07/13 10:30, Artem Savkov wrote:
> >>> This commit brings up a warning about a potential deadlock in
> >>> smp_call_function_many() discussed previously:
> >>> https://lkml.org/lkml/2013/4/18/546
> >>
> >> Can we just avoid the wait in clock_was_set()? Something like this?
> >>
> >> 8<------------------------------------------------------
> >> hrtimers: do not wait for other CPUs in clock_was_set()
> >>
> >> Calling on_each_cpu() and waiting in a softirq causes a WARNing about
> >> a potential deadlock.
> >>
> >> Because hrtimers are per-CPU, it is sufficient to ensure that all
> >> other CPUs' timers are reprogrammed as soon as possible and before the
> >> next softirq on that CPU. There is no need to wait for this to be
> >> complete on all CPUs.
>
> Unfortunately this doesn't look sufficient. on_each_cpu(..., 0) may
> still wait for other calls to complete before queuing the calls due to
> the use of a single set of per-CPU csd data.

Hrmpf. I'll fix it in the non elegant way :(

Thanks,

tglx