2023-12-18 07:42:24

by Peter Hilber

[permalink] [raw]
Subject: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

RFC v3 updates
--------------

This series implements a driver for a virtio-rtc device conforming to spec
RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
the PTP clock driver already present before.

This patch series depends on the patch series "treewide: Use clocksource id
for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
series on top of mainline.

Overview
--------

This patch series adds the virtio_rtc module, and related bugfixes. The
virtio_rtc module implements a driver compatible with the proposed Virtio
RTC device specification [1]. The Virtio RTC (Real Time Clock) device
provides information about current time. The device can provide different
clocks, e.g. for the UTC or TAI time standards, or for physical time
elapsed since some past epoch. The driver can read the clocks with simple
or more accurate methods. Optionally, the driver can set an alarm.

The series first fixes some bugs in the get_device_system_crosststamp()
interpolation code, which is required for reliable virtio_rtc operation.
Then, add the virtio_rtc implementation.

For the Virtio RTC device, there is currently a proprietary implementation,
which has been used for provisional testing.

PTP clock interface
-------------------

virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
If both the Virtio RTC device and this driver have special support for the
current clocksource, time synchronization programs can use
cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
with single-digit ns precision is possible with a quiescent reference clock
(from the Virtio RTC device). This works even when the Virtio device
response is slow compared to ptp_kvm hypercalls.

The following illustrates a test using PTP_SYS_OFFSET_PRECISE, with
interspersed strace log and chrony [2] refclocks log, on arm64. In the
example, chrony tracks a virtio_rtc PTP clock ("PHCV", /dev/ptp0). The raw
offset between the virtio_rtc clock and CLOCK_REALTIME is 0 to 1 ns. At the
device side, the Virtio RTC device artificially delays both the clock read
request, and the response, by 50 ms. Cross-timestamp interpolation still
works with this delay. chrony also monitors a ptp_kvm clock ("PHCK",
/dev/ptp3) for comparison, which yields a similar offset.

ioctl(5</dev/ptp3>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.000329>
===============================================================================
Date (UTC) Time Refid DP L P Raw offset Cooked offset Disp.
===============================================================================
2023-06-29 18:49:55.595742 PHCK 0 N 0 1.000000e-09 8.717931e-10 5.500e-08
2023-06-29 18:49:55.595742 PHCK - N - - 8.717931e-10 5.500e-08
ioctl(6</dev/ptp0>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.101545>
2023-06-29 18:49:56.147766 PHCV 0 N 0 1.000000e-09 8.801870e-10 5.500e-08
2023-06-29 18:49:56.147766 PHCV - N - - 8.801870e-10 5.500e-08
ioctl(5</dev/ptp3>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.000195>
2023-06-29 18:49:56.202446 PHCK 0 N 0 1.000000e-09 7.364180e-10 5.500e-08
2023-06-29 18:49:56.202446 PHCK - N - - 7.364180e-10 5.500e-08
ioctl(6</dev/ptp0>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.101484>
2023-06-29 18:49:56.754641 PHCV 0 N 0 0.000000e+00 -2.617368e-10 5.500e-08
2023-06-29 18:49:56.754641 PHCV - N - - -2.617368e-10 5.500e-08
ioctl(5</dev/ptp3>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.000270>
2023-06-29 18:49:56.809282 PHCK 0 N 0 1.000000e-09 7.779321e-10 5.500e-08
2023-06-29 18:49:56.809282 PHCK - N - - 7.779321e-10 5.500e-08
ioctl(6</dev/ptp0>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.101510>
2023-06-29 18:49:57.361376 PHCV 0 N 0 0.000000e+00 -2.198794e-10 5.500e-08
2023-06-29 18:49:57.361376 PHCV - N - - -2.198794e-10 5.500e-08

This patch series only adds special support for the Arm Generic Timer
clocksource. At the driver side, it should be easy to support more
clocksources.

Fallback PTP clock interface
----------------------------

Without special support for the current clocksource, time synchronization
programs can still use ioctl PTP_SYS_OFFSET_EXTENDED2 aka
PTP_SYS_OFFSET_EXTENDED. In this case, precision will generally be worse
and will depend on the Virtio device response characteristics.

The following illustrates a test using PTP_SYS_OFFSET_EXTENDED, with
interspersed strace log and chrony refclocks log, on x86-64 (with `ts'
values omitted):

ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
===============================================================================
Date (UTC) Time Refid DP L P Raw offset Cooked offset Disp.
===============================================================================
2023-06-28 14:11:26.697782 PHCV 0 N 0 3.318200e-05 3.450891e-05 4.611e-06
2023-06-28 14:11:26.697782 PHCV - N - - 3.450891e-05 4.611e-06
ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
2023-06-28 14:11:27.208763 PHCV 0 N 0 -3.792800e-05 -4.023965e-05 4.611e-06
2023-06-28 14:11:27.208763 PHCV - N - - -4.023965e-05 4.611e-06
ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
2023-06-28 14:11:27.722818 PHCV 0 N 0 -3.328600e-05 -3.134404e-05 4.611e-06
2023-06-28 14:11:27.722818 PHCV - N - - -3.134404e-05 4.611e-06
ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
2023-06-28 14:11:28.233572 PHCV 0 N 0 -4.966900e-05 -4.584331e-05 4.611e-06
2023-06-28 14:11:28.233572 PHCV - N - - -4.584331e-05 4.611e-06
ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
2023-06-28 14:11:28.742737 PHCV 0 N 0 4.902700e-05 5.361388e-05 4.611e-06
2023-06-28 14:11:28.742737 PHCV - N - - 5.361388e-05 4.611e-06

PTP clock setup
---------------

The following udev rule can be used to get a symlink /dev/ptp_virtio to the
UTC clock:

SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP UTC", SYMLINK += "ptp_virtio"

The following chrony configuration directive can then be added in
/etc/chrony/chrony.conf to synchronize to the Virtio UTC clock:

refclock PHC /dev/ptp_virtio refid PHCV poll -1 dpoll -1

RTC interface
-------------

This patch series adds virtio_rtc as a generic Virtio driver, including
both a PTP clock driver and an RTC class driver.

Feedback is greatly appreciated.

[1] https://lore.kernel.org/virtio-comment/[email protected]/
[2] https://chrony.tuxfamily.org/
[3] https://lore.kernel.org/lkml/[email protected]/
[4] https://github.com/OpenSynergy/linux.git virtio-rtc-v3-on-master

v3:

- Update to conform to virtio spec RFC v3 (no significant behavioral
changes).

- Add RTC class driver with alarm according to virtio spec RFC v3.

- For cross-timestamp corner case fix, switch back to v1 style closed
interval test (Thomas Gleixner).

v2:

- Depend on patch series "treewide: Use clocksource id for
get_device_system_crosststamp()" to avoid requiring a clocksource pointer
with get_device_system_crosststamp().

- Assume Arm Generic Timer will use CP15 virtual counter. Drop
arm_arch_timer helper functions (Marc Zyngier).

- Improve cross-timestamp fixes problem description and implementation
(John Stultz).


Peter Hilber (7):
timekeeping: Fix cross-timestamp interpolation on counter wrap
timekeeping: Fix cross-timestamp interpolation corner case decision
timekeeping: Fix cross-timestamp interpolation for non-x86
virtio_rtc: Add module and driver core
virtio_rtc: Add PTP clocks
virtio_rtc: Add Arm Generic Timer cross-timestamping
virtio_rtc: Add RTC class driver

MAINTAINERS | 7 +
drivers/virtio/Kconfig | 62 ++
drivers/virtio/Makefile | 5 +
drivers/virtio/virtio_rtc_arm.c | 22 +
drivers/virtio/virtio_rtc_class.c | 269 +++++
drivers/virtio/virtio_rtc_driver.c | 1357 ++++++++++++++++++++++++++
drivers/virtio/virtio_rtc_internal.h | 122 +++
drivers/virtio/virtio_rtc_ptp.c | 342 +++++++
include/uapi/linux/virtio_rtc.h | 230 +++++
kernel/time/timekeeping.c | 24 +-
10 files changed, 2428 insertions(+), 12 deletions(-)
create mode 100644 drivers/virtio/virtio_rtc_arm.c
create mode 100644 drivers/virtio/virtio_rtc_class.c
create mode 100644 drivers/virtio/virtio_rtc_driver.c
create mode 100644 drivers/virtio/virtio_rtc_internal.h
create mode 100644 drivers/virtio/virtio_rtc_ptp.c
create mode 100644 include/uapi/linux/virtio_rtc.h


base-commit: 2c41b211d72c1eb350c7629a8c85234fef0d12c1
--
2.40.1



2023-12-18 07:42:46

by Peter Hilber

[permalink] [raw]
Subject: [RFC PATCH v3 6/7] virtio_rtc: Add Arm Generic Timer cross-timestamping

Add PTP_SYS_OFFSET_PRECISE2 support on platforms using the Arm Generic
Timer.

Always report the CP15 virtual counter as the HW counter in use by
arm_arch_timer, since the Linux kernel's usage of the Arm Generic Timer
should always be compatible with this.

Signed-off-by: Peter Hilber <[email protected]>
---

Notes:
v2:

- Depend on prerequisite patch series "treewide: Use clocksource id for
get_device_system_crosststamp()".

- Return clocksource id instead of calling dropped arm_arch_timer helpers.

- Always report the CP15 virtual counter to be in use by arm_arch_timer,
since distinction of Arm physical and virtual counter appears unneeded
after discussion with Marc Zyngier.

drivers/virtio/Kconfig | 13 +++++++++++++
drivers/virtio/Makefile | 1 +
drivers/virtio/virtio_rtc_arm.c | 22 ++++++++++++++++++++++
3 files changed, 36 insertions(+)
create mode 100644 drivers/virtio/virtio_rtc_arm.c

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 8542b2f20201..d35c728778d2 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -205,6 +205,19 @@ config VIRTIO_RTC_PTP

If unsure, say Y.

+config VIRTIO_RTC_ARM
+ bool "Virtio RTC cross-timestamping using Arm Generic Timer"
+ default y
+ depends on VIRTIO_RTC_PTP && ARM_ARCH_TIMER
+ help
+ This enables Virtio RTC cross-timestamping using the Arm Generic Timer.
+ It only has an effect if the Virtio RTC device also supports this. The
+ cross-timestamp is available through the PTP clock driver precise
+ cross-timestamp ioctl (PTP_SYS_OFFSET_PRECISE2 or
+ PTP_SYS_OFFSET_PRECISE).
+
+ If unsure, say Y.
+
endif # VIRTIO_RTC

endif # VIRTIO_MENU
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 4d48cbcae6bb..781dff9f8822 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -15,3 +15,4 @@ obj-$(CONFIG_VIRTIO_DMA_SHARED_BUFFER) += virtio_dma_buf.o
obj-$(CONFIG_VIRTIO_RTC) += virtio_rtc.o
virtio_rtc-y := virtio_rtc_driver.o
virtio_rtc-$(CONFIG_VIRTIO_RTC_PTP) += virtio_rtc_ptp.o
+virtio_rtc-$(CONFIG_VIRTIO_RTC_ARM) += virtio_rtc_arm.o
diff --git a/drivers/virtio/virtio_rtc_arm.c b/drivers/virtio/virtio_rtc_arm.c
new file mode 100644
index 000000000000..5185b130b3f1
--- /dev/null
+++ b/drivers/virtio/virtio_rtc_arm.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Provides cross-timestamp params for Arm.
+ *
+ * Copyright (C) 2022-2023 OpenSynergy GmbH
+ */
+
+#include <linux/clocksource_ids.h>
+
+#include <uapi/linux/virtio_rtc.h>
+
+#include "virtio_rtc_internal.h"
+
+/* see header for doc */
+
+int viortc_hw_xtstamp_params(u16 *hw_counter, enum clocksource_ids *cs_id)
+{
+ *hw_counter = VIRTIO_RTC_COUNTER_ARM_VIRT;
+ *cs_id = CSID_ARM_ARCH_COUNTER;
+
+ return 0;
+}
--
2.40.1


2023-12-18 07:43:09

by Peter Hilber

[permalink] [raw]
Subject: [RFC PATCH v3 5/7] virtio_rtc: Add PTP clocks

Expose the virtio_rtc clocks as PTP clocks to userspace, similar to
ptp_kvm. virtio_rtc can expose multiple clocks, e.g. a UTC clock and a
monotonic clock. Userspace should distinguish different clocks through the
name assigned by the driver. A udev rule such as the following can be used
to get a symlink /dev/ptp_virtio to the UTC clock:

SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP UTC", SYMLINK += "ptp_virtio"

The preferred PTP clock reading method is ioctl PTP_SYS_OFFSET_PRECISE2,
through the ptp_clock_info.getcrosststamp() op. For now,
PTP_SYS_OFFSET_PRECISE2 will return -EOPNOTSUPP through a weak function.
PTP_SYS_OFFSET_PRECISE2 requires cross-timestamping support for specific
clocksources, which will be added in the following. If the clocksource
specific code is enabled, check that the Virtio RTC device supports the
respective HW counter before obtaining an actual cross-timestamp from the
Virtio device.

The Virtio RTC device response time may be higher than the timekeeper
seqcount increment interval. Therefore, obtain the cross-timestamp before
calling get_device_system_crosststamp().

As a fallback, support the ioctl PTP_SYS_OFFSET_EXTENDED2 for all
platforms.

Assume that concurrency issues during PTP clock removal are avoided by the
posix_clock framework.

Kconfig recursive dependencies prevent virtio_rtc from implicitly enabling
PTP_1588_CLOCK, therefore just warn the user if PTP_1588_CLOCK is not
available. Since virtio_rtc should in the future also expose clocks as RTC
class devices, do not have VIRTIO_RTC depend on PTP_1588_CLOCK.

Signed-off-by: Peter Hilber <[email protected]>
---

Notes:
v3:

- don't guard cross-timestamping with feature bit (spec v3)

- reduce clock id to 16 bits (spec v3)

v2:

- Depend on prerequisite patch series "treewide: Use clocksource id for
get_device_system_crosststamp()".

- Check clocksource id before sending crosststamp message to device.

- Do not support multiple hardware counters at runtime any more, since
distinction of Arm physical and virtual counter appears unneeded after
discussion with Marc Zyngier.

drivers/virtio/Kconfig | 23 +-
drivers/virtio/Makefile | 1 +
drivers/virtio/virtio_rtc_driver.c | 131 +++++++++-
drivers/virtio/virtio_rtc_internal.h | 46 ++++
drivers/virtio/virtio_rtc_ptp.c | 342 +++++++++++++++++++++++++++
5 files changed, 539 insertions(+), 4 deletions(-)
create mode 100644 drivers/virtio/virtio_rtc_ptp.c

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 834dd14bc070..8542b2f20201 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -179,11 +179,32 @@ config VIRTIO_RTC
depends on PTP_1588_CLOCK_OPTIONAL
help
This driver provides current time from a Virtio RTC device. The driver
- provides the time through one or more clocks.
+ provides the time through one or more clocks. The Virtio RTC PTP
+ clocks must be enabled to expose the clocks to userspace.

To compile this code as a module, choose M here: the module will be
called virtio_rtc.

If unsure, say M.

+if VIRTIO_RTC
+
+comment "WARNING: Consider enabling VIRTIO_RTC_PTP."
+ depends on !VIRTIO_RTC_PTP
+
+comment "Enable PTP_1588_CLOCK in order to enable VIRTIO_RTC_PTP."
+ depends on PTP_1588_CLOCK=n
+
+config VIRTIO_RTC_PTP
+ bool "Virtio RTC PTP clocks"
+ default y
+ depends on PTP_1588_CLOCK
+ help
+ This exposes any Virtio RTC clocks as PTP Hardware Clocks (PHCs) to
+ userspace.
+
+ If unsure, say Y.
+
+endif # VIRTIO_RTC
+
endif # VIRTIO_MENU
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index f760414ed6ab..4d48cbcae6bb 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_VIRTIO_MEM) += virtio_mem.o
obj-$(CONFIG_VIRTIO_DMA_SHARED_BUFFER) += virtio_dma_buf.o
obj-$(CONFIG_VIRTIO_RTC) += virtio_rtc.o
virtio_rtc-y := virtio_rtc_driver.o
+virtio_rtc-$(CONFIG_VIRTIO_RTC_PTP) += virtio_rtc_ptp.o
diff --git a/drivers/virtio/virtio_rtc_driver.c b/drivers/virtio/virtio_rtc_driver.c
index ef1ea14b3bec..c331b7383285 100644
--- a/drivers/virtio/virtio_rtc_driver.c
+++ b/drivers/virtio/virtio_rtc_driver.c
@@ -35,11 +35,16 @@ struct viortc_vq {
* struct viortc_dev - virtio_rtc device data
* @vdev: virtio device
* @vqs: virtqueues
+ * @clocks_to_unregister: Clock references, which are only used during device
+ * removal.
+ * For other uses, there would be a race between device
+ * creation and setting the pointers here.
* @num_clocks: # of virtio_rtc clocks
*/
struct viortc_dev {
struct virtio_device *vdev;
struct viortc_vq vqs[VIORTC_MAX_NR_QUEUES];
+ struct viortc_ptp_clock **clocks_to_unregister;
u16 num_clocks;
};

@@ -626,6 +631,109 @@ int viortc_cross_cap(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
* init, deinit
*/

+/**
+ * viortc_init_ptp_clock() - init and register PTP clock
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @clock_type: virtio_rtc clock type
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_init_ptp_clock(struct viortc_dev *viortc, u16 vio_clk_id,
+ u16 clock_type)
+{
+ struct device *dev = &viortc->vdev->dev;
+ char ptp_clock_name[PTP_CLOCK_NAME_LEN];
+ const char *type_name;
+ /* fit prefix + u16 in decimal */
+ char type_name_buf[5 + 5 + 1];
+ struct viortc_ptp_clock *vio_ptp;
+
+ switch (clock_type) {
+ case VIRTIO_RTC_CLOCK_UTC:
+ type_name = "UTC";
+ break;
+ case VIRTIO_RTC_CLOCK_TAI:
+ type_name = "TAI";
+ break;
+ case VIRTIO_RTC_CLOCK_MONO:
+ type_name = "monotonic";
+ break;
+ default:
+ snprintf(type_name_buf, sizeof(type_name_buf), "type %hu",
+ clock_type);
+ type_name = type_name_buf;
+ }
+
+ snprintf(ptp_clock_name, PTP_CLOCK_NAME_LEN, "Virtio PTP %s",
+ type_name);
+
+ vio_ptp = viortc_ptp_register(viortc, dev, vio_clk_id, ptp_clock_name);
+ if (IS_ERR(vio_ptp)) {
+ dev_err(dev, "failed to register PTP clock '%s'\n",
+ ptp_clock_name);
+ return PTR_ERR(vio_ptp);
+ }
+
+ viortc->clocks_to_unregister[vio_clk_id] = vio_ptp;
+
+ if (!vio_ptp)
+ dev_warn(dev, "clock %d is not exposed to userspace\n",
+ vio_clk_id);
+
+ return 0;
+}
+
+/**
+ * viortc_init_clock() - init local representation of virtio_rtc clock
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ *
+ * Initializes PHC and/or RTC class device to represent virtio_rtc clock.
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_init_clock(struct viortc_dev *viortc, u16 vio_clk_id)
+{
+ u16 clock_type;
+ int ret;
+
+ ret = viortc_clock_cap(viortc, vio_clk_id, &clock_type);
+ if (ret)
+ return ret;
+
+ if (IS_ENABLED(CONFIG_VIRTIO_RTC_PTP)) {
+ ret = viortc_init_ptp_clock(viortc, vio_clk_id, clock_type);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+/**
+ * viortc_clocks_exit() - unregister PHCs
+ * @viortc: device data
+ */
+static void viortc_clocks_exit(struct viortc_dev *viortc)
+{
+ unsigned int i;
+ struct viortc_ptp_clock *vio_ptp;
+
+ for (i = 0; i < viortc->num_clocks; i++) {
+ vio_ptp = viortc->clocks_to_unregister[i];
+
+ if (!vio_ptp)
+ continue;
+
+ viortc->clocks_to_unregister[i] = NULL;
+
+ WARN_ON(viortc_ptp_unregister(vio_ptp, &viortc->vdev->dev));
+ }
+}
+
/**
* viortc_clocks_init() - init local representations of virtio_rtc clocks
* @viortc: device data
@@ -637,6 +745,7 @@ static int viortc_clocks_init(struct viortc_dev *viortc)
{
int ret;
u16 num_clocks;
+ unsigned int i;

ret = viortc_cfg(viortc, &num_clocks);
if (ret)
@@ -649,8 +758,22 @@ static int viortc_clocks_init(struct viortc_dev *viortc)

viortc->num_clocks = num_clocks;

- /* In the future, PTP clocks will be initialized here. */
- (void)viortc_clock_cap;
+ viortc->clocks_to_unregister =
+ devm_kcalloc(&viortc->vdev->dev, num_clocks,
+ sizeof(*viortc->clocks_to_unregister), GFP_KERNEL);
+ if (!viortc->clocks_to_unregister)
+ return -ENOMEM;
+
+ for (i = 0; i < num_clocks; i++) {
+ ret = viortc_init_clock(viortc, i);
+ if (ret)
+ goto err_free_clocks;
+ }
+
+ return 0;
+
+err_free_clocks:
+ viortc_clocks_exit(viortc);

return ret;
}
@@ -734,7 +857,9 @@ static int viortc_probe(struct virtio_device *vdev)
*/
static void viortc_remove(struct virtio_device *vdev)
{
- /* In the future, PTP clocks will be deinitialized here. */
+ struct viortc_dev *viortc = vdev->priv;
+
+ viortc_clocks_exit(viortc);

virtio_reset_device(vdev);
vdev->config->del_vqs(vdev);
diff --git a/drivers/virtio/virtio_rtc_internal.h b/drivers/virtio/virtio_rtc_internal.h
index 9267661b8030..0dedced4aeae 100644
--- a/drivers/virtio/virtio_rtc_internal.h
+++ b/drivers/virtio/virtio_rtc_internal.h
@@ -9,6 +9,7 @@
#define _VIRTIO_RTC_INTERNAL_H_

#include <linux/types.h>
+#include <linux/ptp_clock_kernel.h>

/* driver core IFs */

@@ -20,4 +21,49 @@ int viortc_read_cross(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
int viortc_cross_cap(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
bool *supported);

+/* PTP IFs */
+
+struct viortc_ptp_clock;
+
+#if IS_ENABLED(CONFIG_VIRTIO_RTC_PTP)
+
+struct viortc_ptp_clock *viortc_ptp_register(struct viortc_dev *viortc,
+ struct device *parent_dev,
+ u16 vio_clk_id,
+ const char *ptp_clock_name);
+int viortc_ptp_unregister(struct viortc_ptp_clock *vio_ptp,
+ struct device *parent_dev);
+
+#else
+
+static inline struct viortc_ptp_clock *
+viortc_ptp_register(struct viortc_dev *viortc, struct device *parent_dev,
+ u16 vio_clk_id, const char *ptp_clock_name)
+{
+ return NULL;
+}
+
+static inline int viortc_ptp_unregister(struct viortc_ptp_clock *vio_ptp,
+ struct device *parent_dev)
+{
+ return -ENODEV;
+}
+
+#endif
+
+/* HW counter IFs */
+
+/**
+ * viortc_hw_xtstamp_params() - get HW-specific xtstamp params
+ * @hw_counter: virtio_rtc HW counter type
+ * @cs_id: clocksource id corresponding to hw_counter
+ *
+ * Gets the HW-specific xtstamp params. Returns an error if the driver cannot
+ * support xtstamp.
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+int viortc_hw_xtstamp_params(u16 *hw_counter, enum clocksource_ids *cs_id);
+
#endif /* _VIRTIO_RTC_INTERNAL_H_ */
diff --git a/drivers/virtio/virtio_rtc_ptp.c b/drivers/virtio/virtio_rtc_ptp.c
new file mode 100644
index 000000000000..4592cd070772
--- /dev/null
+++ b/drivers/virtio/virtio_rtc_ptp.c
@@ -0,0 +1,342 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Expose virtio_rtc clocks as PTP clocks.
+ *
+ * Copyright (C) 2022-2023 OpenSynergy GmbH
+ *
+ * Derived from ptp_kvm_common.c, virtual PTP 1588 clock for use with KVM
+ * guests.
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ */
+
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/ptp_clock_kernel.h>
+
+#include <uapi/linux/virtio_rtc.h>
+
+#include "virtio_rtc_internal.h"
+
+/**
+ * struct viortc_ptp_clock - PTP clock abstraction
+ * @ptp_clock: PTP clock handle
+ * @viortc: virtio_rtc device data
+ * @ptp_info: PTP clock description
+ * @vio_clk_id: virtio_rtc clock id
+ * @have_cross: device supports crosststamp with available HW counter
+ */
+struct viortc_ptp_clock {
+ struct ptp_clock *ptp_clock;
+ struct viortc_dev *viortc;
+ struct ptp_clock_info ptp_info;
+ u16 vio_clk_id;
+ bool have_cross;
+};
+
+/**
+ * struct viortc_ptp_cross_ctx - context for get_device_system_crosststamp()
+ * @device_time: device clock reading
+ * @system_counterval: HW counter value at device_time
+ *
+ * Provides the already obtained crosststamp to get_device_system_crosststamp().
+ */
+struct viortc_ptp_cross_ctx {
+ ktime_t device_time;
+ struct system_counterval_t system_counterval;
+};
+
+/* Weak function in case get_device_system_crosststamp() is not supported */
+int __weak viortc_hw_xtstamp_params(u16 *hw_counter,
+ enum clocksource_ids *cs_id)
+{
+ return -EOPNOTSUPP;
+}
+
+/**
+ * viortc_ptp_get_time_fn() - callback for get_device_system_crosststamp()
+ * @device_time: device clock reading
+ * @system_counterval: HW counter value at device_time
+ * @ctx: context with already obtained crosststamp
+ *
+ * Return: zero (success).
+ */
+static int viortc_ptp_get_time_fn(ktime_t *device_time,
+ struct system_counterval_t *system_counterval,
+ void *ctx)
+{
+ struct viortc_ptp_cross_ctx *vio_ctx = ctx;
+
+ *device_time = vio_ctx->device_time;
+ *system_counterval = vio_ctx->system_counterval;
+
+ return 0;
+}
+
+/**
+ * viortc_ptp_do_xtstamp() - get crosststamp from device
+ * @vio_ptp: virtio_rtc PTP clock
+ * @ctx: context for get_device_system_crosststamp()
+ * @hw_counter: virtio_rtc HW counter type
+ * @cs_id: clocksource id corresponding to hw_counter
+ *
+ * Reads HW-specific crosststamp from device.
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_ptp_do_xtstamp(struct viortc_ptp_clock *vio_ptp,
+ struct viortc_ptp_cross_ctx *ctx,
+ u16 hw_counter, enum clocksource_ids cs_id)
+{
+ u64 ns;
+ u64 max_ns;
+ int ret;
+
+ ctx->system_counterval.cs_id = cs_id;
+
+ ret = viortc_read_cross(vio_ptp->viortc, vio_ptp->vio_clk_id,
+ hw_counter, &ns,
+ &ctx->system_counterval.cycles);
+ if (ret)
+ return ret;
+
+ max_ns = (u64)ktime_to_ns(KTIME_MAX);
+ if (ns > max_ns)
+ return -EINVAL;
+
+ ctx->device_time = ns_to_ktime(ns);
+
+ return 0;
+}
+
+/*
+ * PTP clock operations
+ */
+
+/**
+ * viortc_ptp_getcrosststamp() - PTP clock getcrosststamp op
+ * @vio_ptp: virtio_rtc PTP clock
+ * @xtstamp: crosststamp
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_ptp_getcrosststamp(struct ptp_clock_info *ptp,
+ struct system_device_crosststamp *xtstamp)
+{
+ struct viortc_ptp_clock *vio_ptp =
+ container_of(ptp, struct viortc_ptp_clock, ptp_info);
+ int ret;
+ struct system_time_snapshot history_begin;
+ struct viortc_ptp_cross_ctx ctx;
+ enum clocksource_ids cs_id;
+ u16 hw_counter;
+
+ if (!vio_ptp->have_cross)
+ return -EOPNOTSUPP;
+
+ ret = viortc_hw_xtstamp_params(&hw_counter, &cs_id);
+ if (ret)
+ return ret;
+
+ ktime_get_snapshot(&history_begin);
+ if (history_begin.cs_id != cs_id)
+ return -EOPNOTSUPP;
+
+ /*
+ * Getting the timestamp can take many milliseconds with a slow Virtio
+ * device. This is too long for viortc_ptp_get_time_fn() passed to
+ * get_device_system_crosststamp(), which has to usually return before
+ * the timekeeper seqcount increases (every tick or so).
+ *
+ * So, get the actual cross-timestamp first.
+ */
+ ret = viortc_ptp_do_xtstamp(vio_ptp, &ctx, hw_counter, cs_id);
+ if (ret)
+ return ret;
+
+ ret = get_device_system_crosststamp(viortc_ptp_get_time_fn, &ctx,
+ &history_begin, xtstamp);
+ if (ret)
+ pr_debug("%s: get_device_system_crosststamp() returned %d\n",
+ __func__, ret);
+
+ return ret;
+}
+
+/** viortc_ptp_adjfine() - unsupported PTP clock adjfine op */
+static int viortc_ptp_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
+{
+ return -EOPNOTSUPP;
+}
+
+/** viortc_ptp_adjtime() - unsupported PTP clock adjtime op */
+static int viortc_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
+{
+ return -EOPNOTSUPP;
+}
+
+/** viortc_ptp_settime64() - unsupported PTP clock settime64 op */
+static int viortc_ptp_settime64(struct ptp_clock_info *ptp,
+ const struct timespec64 *ts)
+{
+ return -EOPNOTSUPP;
+}
+
+/**
+ * viortc_ptp_gettimex64() - PTP clock gettimex64 op
+ *
+ * Context: Process context.
+ */
+static int viortc_ptp_gettimex64(struct ptp_clock_info *ptp,
+ struct timespec64 *ts,
+ struct ptp_system_timestamp *sts)
+{
+ struct viortc_ptp_clock *vio_ptp =
+ container_of(ptp, struct viortc_ptp_clock, ptp_info);
+ u64 ns;
+ int ret;
+
+ ptp_read_system_prets(sts);
+ ret = viortc_read(vio_ptp->viortc, vio_ptp->vio_clk_id, &ns);
+ ptp_read_system_postts(sts);
+
+ if (ret)
+ return ret;
+
+ if (ns > (u64)S64_MAX)
+ return -EINVAL;
+
+ *ts = ns_to_timespec64((s64)ns);
+
+ return 0;
+}
+
+/** viortc_ptp_enable() - unsupported PTP clock enable op */
+static int viortc_ptp_enable(struct ptp_clock_info *ptp,
+ struct ptp_clock_request *rq, int on)
+{
+ return -EOPNOTSUPP;
+}
+
+/**
+ * viortc_ptp_info_template - ptp_clock_info template
+ *
+ * The .name member will be set for individual virtio_rtc PTP clocks.
+ */
+static const struct ptp_clock_info viortc_ptp_info_template = {
+ .owner = THIS_MODULE,
+ /* .name is set according to clock type */
+ .adjfine = viortc_ptp_adjfine,
+ .adjtime = viortc_ptp_adjtime,
+ .gettimex64 = viortc_ptp_gettimex64,
+ .settime64 = viortc_ptp_settime64,
+ .enable = viortc_ptp_enable,
+ .getcrosststamp = viortc_ptp_getcrosststamp,
+};
+
+/**
+ * viortc_ptp_unregister() - PTP clock unregistering wrapper
+ * @vio_ptp: virtio_rtc PTP clock
+ * @parent_dev: parent device of PTP clock
+ *
+ * Return: Zero on success, negative error code otherwise.
+ */
+int viortc_ptp_unregister(struct viortc_ptp_clock *vio_ptp,
+ struct device *parent_dev)
+{
+ int ret = ptp_clock_unregister(vio_ptp->ptp_clock);
+
+ if (!ret)
+ devm_kfree(parent_dev, vio_ptp);
+
+ return ret;
+}
+
+/**
+ * viortc_ptp_get_cross_cap() - get xtstamp support info from device
+ * @viortc: virtio_rtc device data
+ * @vio_ptp: virtio_rtc PTP clock abstraction
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_ptp_get_cross_cap(struct viortc_dev *viortc,
+ struct viortc_ptp_clock *vio_ptp)
+{
+ int ret;
+ enum clocksource_ids cs_id;
+ u16 hw_counter;
+ bool xtstamp_supported;
+
+ ret = viortc_hw_xtstamp_params(&hw_counter, &cs_id);
+ if (ret) {
+ vio_ptp->have_cross = false;
+ return 0;
+ }
+
+ ret = viortc_cross_cap(viortc, vio_ptp->vio_clk_id, hw_counter,
+ &xtstamp_supported);
+ if (ret)
+ return ret;
+
+ vio_ptp->have_cross = xtstamp_supported;
+
+ return 0;
+}
+
+/**
+ * viortc_ptp_register() - prepare and register PTP clock
+ * @viortc: virtio_rtc device data
+ * @parent_dev: parent device for PTP clock
+ * @vio_clk_id: id of virtio_rtc clock which backs PTP clock
+ * @ptp_clock_name: PTP clock name
+ *
+ * Context: Process context.
+ * Return: Pointer on success, ERR_PTR() otherwise; NULL if PTP clock support
+ * not available.
+ */
+struct viortc_ptp_clock *viortc_ptp_register(struct viortc_dev *viortc,
+ struct device *parent_dev,
+ u16 vio_clk_id,
+ const char *ptp_clock_name)
+{
+ struct viortc_ptp_clock *vio_ptp;
+ struct ptp_clock *ptp_clock;
+ ssize_t len;
+ int ret;
+
+ vio_ptp = devm_kzalloc(parent_dev, sizeof(*vio_ptp), GFP_KERNEL);
+ if (!vio_ptp)
+ return ERR_PTR(-ENOMEM);
+
+ vio_ptp->viortc = viortc;
+ vio_ptp->vio_clk_id = vio_clk_id;
+ vio_ptp->ptp_info = viortc_ptp_info_template;
+ len = strscpy(vio_ptp->ptp_info.name, ptp_clock_name,
+ sizeof(vio_ptp->ptp_info.name));
+ if (len < 0) {
+ ret = len;
+ goto err_free_dev;
+ }
+
+ ret = viortc_ptp_get_cross_cap(viortc, vio_ptp);
+ if (ret)
+ goto err_free_dev;
+
+ ptp_clock = ptp_clock_register(&vio_ptp->ptp_info, parent_dev);
+ if (IS_ERR(ptp_clock))
+ goto err_on_register;
+
+ vio_ptp->ptp_clock = ptp_clock;
+
+ return vio_ptp;
+
+err_on_register:
+ ret = PTR_ERR(ptp_clock);
+
+err_free_dev:
+ devm_kfree(parent_dev, vio_ptp);
+ return ERR_PTR(ret);
+}
--
2.40.1


2023-12-18 07:43:57

by Peter Hilber

[permalink] [raw]
Subject: [RFC PATCH v3 2/7] timekeeping: Fix cross-timestamp interpolation corner case decision

The cycle_between() helper checks if parameter test is in the open interval
(before, after). Colloquially speaking, this also applies to the counter
wrap-around special case before > after. get_device_system_crosststamp()
currently uses cycle_between() at the first call site to decide whether to
interpolate for older counter readings.

get_device_system_crosststamp() has the following problem with
cycle_between() testing against an open interval: Assume that, by chance,
cycles == tk->tkr_mono.cycle_last (in the following, "cycle_last" for
brevity). Then, cycle_between() at the first call site, with effective
argument values cycle_between(cycle_last, cycles, now), returns false,
enabling interpolation. During interpolation,
get_device_system_crosststamp() will then call cycle_between() at the
second call site (if a history_begin was supplied). The effective argument
values are cycle_between(history_begin->cycles, cycles, cycles), since
system_counterval.cycles == interval_start == cycles, per the assumption.
Due to the test against the open interval, cycle_between() returns false
again. This causes get_device_system_crosststamp() to return -EINVAL.

This failure should be avoided, since get_device_system_crosststamp() works
both when cycles follows cycle_last (no interpolation), and when cycles
precedes cycle_last (interpolation). For the case cycles == cycle_last,
interpolation is actually unneeded.

Fix this by changing cycle_between() into timestamp_in_interval(), which
now checks against the closed interval, rather than the open interval.

This changes the get_device_system_crosststamp() behavior for three corner
cases:

1. Bypass interpolation in the case cycles == tk->tkr_mono.cycle_last,
fixing the problem described above.

2. At the first timestamp_in_interval() call site, cycles == now no longer
causes failure.

3. At the second timestamp_in_interval() call site, history_begin->cycles
== system_counterval.cycles no longer causes failure.
adjust_historical_crosststamp() also works for this corner case,
where partial_history_cycles == total_history_cycles.

These behavioral changes should not cause any problems.

Fixes: 2c756feb18d9 ("time: Add history to cross timestamp interface supporting slower devices")
Signed-off-by: Peter Hilber <[email protected]>
---

Notes:
v3:

- switch back to v1 style closed interval test (Thomas Gleixner)
- document effect of closed interval test on corner cases
- do not carry "Acked-by: John Stultz <[email protected]>" due to above
changes

v2:

- try to clarify problem description (John Stultz)
- simplify fix

kernel/time/timekeeping.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 08a3d0052baa..24ffd681aa23 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1180,13 +1180,15 @@ static int adjust_historical_crosststamp(struct system_time_snapshot *history,
}

/*
- * cycle_between - true if test occurs chronologically between before and after
+ * timestamp_in_interval - true if ts is chronologically in [start, end]
+ *
+ * True if ts occurs chronologically at or after start, and before or at end.
*/
-static bool cycle_between(u64 before, u64 test, u64 after)
+static bool timestamp_in_interval(u64 start, u64 end, u64 ts)
{
- if (test > before && test < after)
+ if (ts >= start && ts <= end)
return true;
- if (before > after && (test > before || test < after))
+ if (start > end && (ts >= start || ts <= end))
return true;
return false;
}
@@ -1247,7 +1249,7 @@ int get_device_system_crosststamp(int (*get_time_fn)
*/
now = tk_clock_read(&tk->tkr_mono);
interval_start = tk->tkr_mono.cycle_last;
- if (!cycle_between(interval_start, cycles, now)) {
+ if (!timestamp_in_interval(interval_start, now, cycles)) {
clock_was_set_seq = tk->clock_was_set_seq;
cs_was_changed_seq = tk->cs_was_changed_seq;
cycles = interval_start;
@@ -1278,13 +1280,13 @@ int get_device_system_crosststamp(int (*get_time_fn)
bool discontinuity;

/*
- * Check that the counter value occurs after the provided
+ * Check that the counter value is not before the provided
* history reference and that the history doesn't cross a
* clocksource change
*/
if (!history_begin ||
- !cycle_between(history_begin->cycles,
- system_counterval.cycles, cycles) ||
+ !timestamp_in_interval(history_begin->cycles,
+ cycles, system_counterval.cycles) ||
history_begin->cs_was_changed_seq != cs_was_changed_seq)
return -EINVAL;
partial_history_cycles = cycles - system_counterval.cycles;
--
2.40.1


2023-12-18 07:45:36

by Peter Hilber

[permalink] [raw]
Subject: [RFC PATCH v3 4/7] virtio_rtc: Add module and driver core

Add the virtio_rtc module and driver core. The virtio_rtc module implements
a driver compatible with the proposed Virtio RTC device specification.
The Virtio RTC (Real Time Clock) device provides information about current
time. The device can provide different clocks, e.g. for the UTC or TAI time
standards, or for physical time elapsed since some past epoch. The driver
can read the clocks with simple or more accurate methods.

Implement the core, which interacts with the Virtio RTC device. Apart from
this, the core does not expose functionality outside of the virtio_rtc
module. A follow-up patch will expose PTP clocks.

Provide synchronous messaging, which is enough for the expected time
synchronization use cases through PTP clocks (similar to ptp_kvm) or RTC
Class driver.

Signed-off-by: Peter Hilber <[email protected]>
---

Notes:
v3:

- merge readq and controlq into a single requestq (spec v3)

- don't guard cross-timestamping with feature bit (spec v3)

- pad message headers to 64 bits (spec v3)

- reduce clock id to 16 bits (spec v3)

- change Virtio status codes (spec v3)

- use 'VIRTIO_RTC_REQ_' prefix for request messages (spec v3)

MAINTAINERS | 7 +
drivers/virtio/Kconfig | 13 +
drivers/virtio/Makefile | 2 +
drivers/virtio/virtio_rtc_driver.c | 761 +++++++++++++++++++++++++++
drivers/virtio/virtio_rtc_internal.h | 23 +
include/uapi/linux/virtio_rtc.h | 144 +++++
6 files changed, 950 insertions(+)
create mode 100644 drivers/virtio/virtio_rtc_driver.c
create mode 100644 drivers/virtio/virtio_rtc_internal.h
create mode 100644 include/uapi/linux/virtio_rtc.h

diff --git a/MAINTAINERS b/MAINTAINERS
index b589218605b4..0c157a19bbfd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23200,6 +23200,13 @@ S: Maintained
F: drivers/nvdimm/nd_virtio.c
F: drivers/nvdimm/virtio_pmem.c

+VIRTIO RTC DRIVER
+M: Peter Hilber <[email protected]>
+L: [email protected]
+S: Maintained
+F: drivers/virtio/virtio_rtc_*
+F: include/uapi/linux/virtio_rtc.h
+
VIRTIO SOUND DRIVER
M: Anton Yakovlev <[email protected]>
M: "Michael S. Tsirkin" <[email protected]>
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 0a53a61231c2..834dd14bc070 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -173,4 +173,17 @@ config VIRTIO_DMA_SHARED_BUFFER
This option adds a flavor of dma buffers that are backed by
virtio resources.

+config VIRTIO_RTC
+ tristate "Virtio RTC driver"
+ depends on VIRTIO
+ depends on PTP_1588_CLOCK_OPTIONAL
+ help
+ This driver provides current time from a Virtio RTC device. The driver
+ provides the time through one or more clocks.
+
+ To compile this code as a module, choose M here: the module will be
+ called virtio_rtc.
+
+ If unsure, say M.
+
endif # VIRTIO_MENU
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 8e98d24917cc..f760414ed6ab 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -12,3 +12,5 @@ obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
obj-$(CONFIG_VIRTIO_VDPA) += virtio_vdpa.o
obj-$(CONFIG_VIRTIO_MEM) += virtio_mem.o
obj-$(CONFIG_VIRTIO_DMA_SHARED_BUFFER) += virtio_dma_buf.o
+obj-$(CONFIG_VIRTIO_RTC) += virtio_rtc.o
+virtio_rtc-y := virtio_rtc_driver.o
diff --git a/drivers/virtio/virtio_rtc_driver.c b/drivers/virtio/virtio_rtc_driver.c
new file mode 100644
index 000000000000..ef1ea14b3bec
--- /dev/null
+++ b/drivers/virtio/virtio_rtc_driver.c
@@ -0,0 +1,761 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * virtio_rtc driver core
+ *
+ * Copyright (C) 2022-2023 OpenSynergy GmbH
+ */
+
+#include <linux/completion.h>
+#include <linux/virtio.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+#include <linux/module.h>
+
+#include <uapi/linux/virtio_rtc.h>
+
+#include "virtio_rtc_internal.h"
+
+/* virtqueue order */
+enum {
+ VIORTC_REQUESTQ,
+ VIORTC_MAX_NR_QUEUES,
+};
+
+/**
+ * struct viortc_vq - virtqueue abstraction
+ * @vq: virtqueue
+ * @lock: protects access to vq
+ */
+struct viortc_vq {
+ struct virtqueue *vq;
+ spinlock_t lock;
+};
+
+/**
+ * struct viortc_dev - virtio_rtc device data
+ * @vdev: virtio device
+ * @vqs: virtqueues
+ * @num_clocks: # of virtio_rtc clocks
+ */
+struct viortc_dev {
+ struct virtio_device *vdev;
+ struct viortc_vq vqs[VIORTC_MAX_NR_QUEUES];
+ u16 num_clocks;
+};
+
+/**
+ * struct viortc_msg - Message requested by driver, responded by device.
+ * @viortc: device data
+ * @req: request buffer
+ * @resp: response buffer
+ * @responded: vqueue callback signals response reception
+ * @refcnt: Message reference count, message and buffers will be deallocated
+ * once 0. refcnt is decremented in the vqueue callback and in the
+ * thread waiting on the responded completion.
+ * If a message response wait function times out, the message will be
+ * freed upon late reception (refcnt will reach 0 in the callback), or
+ * device removal.
+ * @req_size: size of request in bytes
+ * @resp_cap: maximum size of response in bytes
+ * @resp_actual_size: actual size of response
+ */
+struct viortc_msg {
+ struct viortc_dev *viortc;
+ void *req;
+ void *resp;
+ struct completion responded;
+ refcount_t refcnt;
+ unsigned int req_size;
+ unsigned int resp_cap;
+ unsigned int resp_actual_size;
+};
+
+/**
+ * viortc_msg_init() - Allocate and initialize requestq message.
+ * @viortc: device data
+ * @msg_type: virtio_rtc message type
+ * @req_size: size of request buffer to be allocated
+ * @resp_cap: size of response buffer to be allocated
+ *
+ * Initializes the message refcnt to 2. The refcnt will be decremented once in
+ * the virtqueue callback, and once in the thread waiting on the message (on
+ * completion or timeout).
+ *
+ * Context: Process context.
+ * Return: non-NULL on success.
+ */
+static struct viortc_msg *viortc_msg_init(struct viortc_dev *viortc,
+ u16 msg_type, unsigned int req_size,
+ unsigned int resp_cap)
+{
+ struct viortc_msg *msg;
+ struct device *dev = &viortc->vdev->dev;
+ struct virtio_rtc_req_head *req_head;
+
+ msg = devm_kzalloc(dev, sizeof(*msg), GFP_KERNEL);
+ if (!msg)
+ return NULL;
+
+ init_completion(&msg->responded);
+
+ msg->req = devm_kzalloc(dev, req_size, GFP_KERNEL);
+ if (!msg->req)
+ goto err_free_msg;
+
+ req_head = msg->req;
+
+ msg->resp = devm_kzalloc(dev, resp_cap, GFP_KERNEL);
+ if (!msg->resp)
+ goto err_free_msg_req;
+
+ msg->viortc = viortc;
+ msg->req_size = req_size;
+ msg->resp_cap = resp_cap;
+
+ refcount_set(&msg->refcnt, 2);
+
+ req_head->msg_type = virtio_cpu_to_le(msg_type, req_head->msg_type);
+
+ return msg;
+
+err_free_msg_req:
+ devm_kfree(dev, msg->req);
+
+err_free_msg:
+ devm_kfree(dev, msg);
+
+ return NULL;
+}
+
+/**
+ * viortc_msg_release() - Decrement message refcnt, potentially free message.
+ * @msg: message requested by driver
+ *
+ * Context: Any context.
+ */
+static void viortc_msg_release(struct viortc_msg *msg)
+{
+ if (refcount_dec_and_test(&msg->refcnt)) {
+ struct device *dev = &msg->viortc->vdev->dev;
+
+ devm_kfree(dev, msg->req);
+ devm_kfree(dev, msg->resp);
+ devm_kfree(dev, msg);
+ }
+}
+
+/**
+ * viortc_do_cb() - generic virtqueue callback logic
+ * @vq: virtqueue
+ * @handle_buf: function to process a used buffer
+ *
+ * Context: virtqueue callback, typically interrupt. Takes and releases vq lock.
+ */
+static void viortc_do_cb(struct virtqueue *vq,
+ void (*handle_buf)(void *token, unsigned int len,
+ struct virtqueue *vq,
+ struct viortc_vq *viortc_vq,
+ struct viortc_dev *viortc))
+{
+ struct viortc_dev *viortc = vq->vdev->priv;
+ struct viortc_vq *viortc_vq;
+ bool cb_enabled = true;
+ unsigned long flags;
+ spinlock_t *lock;
+ unsigned int len;
+ void *token;
+
+ viortc_vq = &viortc->vqs[vq->index];
+ lock = &viortc_vq->lock;
+
+ for (;;) {
+ spin_lock_irqsave(lock, flags);
+
+ if (cb_enabled) {
+ virtqueue_disable_cb(vq);
+ cb_enabled = false;
+ }
+
+ token = virtqueue_get_buf(vq, &len);
+ if (!token) {
+ if (virtqueue_enable_cb(vq)) {
+ spin_unlock_irqrestore(lock, flags);
+ return;
+ }
+ cb_enabled = true;
+ }
+
+ spin_unlock_irqrestore(lock, flags);
+
+ if (token)
+ handle_buf(token, len, vq, viortc_vq, viortc);
+ }
+}
+
+/**
+ * viortc_requestq_hdlr() - process a requestq used buffer
+ * @token: token identifying the buffer
+ * @len: bytes written by device
+ * @vq: virtqueue
+ * @viortc_vq: device specific data for virtqueue
+ * @viortc: device data
+ *
+ * Signals completion for each received message.
+ *
+ * Context: virtqueue callback
+ */
+static void viortc_requestq_hdlr(void *token, unsigned int len,
+ struct virtqueue *vq,
+ struct viortc_vq *viortc_vq,
+ struct viortc_dev *viortc)
+{
+ struct viortc_msg *msg = token;
+
+ msg->resp_actual_size = len;
+
+ /*
+ * completion waiter must see our msg metadata, but complete() does not
+ * guarantee a memory barrier
+ */
+ smp_wmb();
+
+ complete(&msg->responded);
+ viortc_msg_release(msg);
+}
+
+/**
+ * viortc_cb_requestq() - callback for requestq
+ * @vq: virtqueue
+ *
+ * Context: virtqueue callback
+ */
+static void viortc_cb_requestq(struct virtqueue *vq)
+{
+ viortc_do_cb(vq, viortc_requestq_hdlr);
+}
+
+/**
+ * viortc_get_resp_errno() - converts virtio_rtc errnos to system errnos
+ * @resp_head: message response header
+ *
+ * Return: negative system errno, or 0
+ */
+static int viortc_get_resp_errno(struct virtio_rtc_resp_head *resp_head)
+{
+ switch (virtio_le_to_cpu(resp_head->status)) {
+ case VIRTIO_RTC_S_OK:
+ return 0;
+ case VIRTIO_RTC_S_EOPNOTSUPP:
+ return -EOPNOTSUPP;
+ case VIRTIO_RTC_S_EINVAL:
+ return -EINVAL;
+ case VIRTIO_RTC_S_ENODEV:
+ return -ENODEV;
+ case VIRTIO_RTC_S_EIO:
+ default:
+ return -EIO;
+ }
+}
+
+/**
+ * viortc_msg_xfer() - send message request, wait until message response
+ * @vq: virtqueue
+ * @msg: message with driver request
+ * @timeout_jiffies: message response timeout, 0 for no timeout
+ *
+ * Context: Process context. Takes and releases vq.lock. May sleep.
+ */
+static int viortc_msg_xfer(struct viortc_vq *vq, struct viortc_msg *msg,
+ unsigned long timeout_jiffies)
+{
+ int ret;
+ unsigned long flags;
+ struct scatterlist out_sg[1];
+ struct scatterlist in_sg[1];
+ struct scatterlist *sgs[2] = { out_sg, in_sg };
+ bool notify;
+
+ sg_init_one(out_sg, msg->req, msg->req_size);
+ sg_init_one(in_sg, msg->resp, msg->resp_cap);
+
+ spin_lock_irqsave(&vq->lock, flags);
+
+ ret = virtqueue_add_sgs(vq->vq, sgs, 1, 1, msg, GFP_ATOMIC);
+ if (ret) {
+ spin_unlock_irqrestore(&vq->lock, flags);
+ /*
+ * Release in place of the response callback, which will never
+ * come.
+ */
+ viortc_msg_release(msg);
+ return ret;
+ }
+
+ notify = virtqueue_kick_prepare(vq->vq);
+
+ spin_unlock_irqrestore(&vq->lock, flags);
+
+ if (notify)
+ virtqueue_notify(vq->vq);
+
+ if (timeout_jiffies) {
+ long timeout_ret;
+
+ timeout_ret = wait_for_completion_interruptible_timeout(
+ &msg->responded, timeout_jiffies);
+
+ if (!timeout_ret)
+ return -ETIMEDOUT;
+ else if (timeout_ret < 0)
+ return (int)timeout_ret;
+ } else {
+ ret = wait_for_completion_interruptible(&msg->responded);
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Ensure we can read message metadata written in the virtqueue
+ * callback.
+ */
+ smp_rmb();
+
+ /*
+ * There is not yet a case where returning a short message would make
+ * sense, so consider any deviation an error.
+ */
+ if (msg->resp_actual_size != msg->resp_cap)
+ return -EINVAL;
+
+ return viortc_get_resp_errno(msg->resp);
+}
+
+/*
+ * common message handle macros for messages of different types
+ */
+
+/**
+ * VIORTC_DECLARE_MSG_HDL_ONSTACK() - declare message handle on stack
+ * @hdl: message handle name
+ * @msg_suf_lowerc: message type suffix in lowercase
+ * @msg_suf_upperc: message type suffix in uppercase
+ */
+#define VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, msg_suf_lowerc, msg_suf_upperc) \
+ struct { \
+ struct viortc_msg *msg; \
+ struct virtio_rtc_req_##msg_suf_lowerc *req; \
+ struct virtio_rtc_resp_##msg_suf_lowerc *resp; \
+ unsigned int req_size; \
+ unsigned int resp_cap; \
+ u16 msg_type; \
+ } hdl = { \
+ NULL, \
+ NULL, \
+ NULL, \
+ sizeof(struct virtio_rtc_req_##msg_suf_lowerc), \
+ sizeof(struct virtio_rtc_resp_##msg_suf_lowerc), \
+ VIRTIO_RTC_REQ_##msg_suf_upperc, \
+ }
+
+/**
+ * VIORTC_MSG() - extract message from message handle
+ *
+ * Return: struct viortc_msg
+ */
+#define VIORTC_MSG(hdl) ((hdl).msg)
+
+/**
+ * VIORTC_MSG_INIT() - initialize message handle
+ * @hdl: message handle
+ * @viortc: device data (struct viortc_dev *)
+ *
+ * Context: Process context.
+ * Return: 0 on success, -ENOMEM otherwise.
+ */
+#define VIORTC_MSG_INIT(hdl, viortc) \
+ ({ \
+ typeof(hdl) *_hdl = &(hdl); \
+ \
+ _hdl->msg = viortc_msg_init((viortc), _hdl->msg_type, \
+ _hdl->req_size, _hdl->resp_cap); \
+ if (_hdl->msg) { \
+ _hdl->req = _hdl->msg->req; \
+ _hdl->resp = _hdl->msg->resp; \
+ } \
+ _hdl->msg ? 0 : -ENOMEM; \
+ })
+
+/**
+ * VIORTC_MSG_WRITE() - write a request message field
+ * @hdl: message handle
+ * @dest_member: request message field name
+ * @src_ptr: pointer to data of compatible type
+ *
+ * Writes the field in little-endian format.
+ */
+#define VIORTC_MSG_WRITE(hdl, dest_member, src_ptr) \
+ do { \
+ typeof(hdl) _hdl = (hdl); \
+ typeof(src_ptr) _src_ptr = (src_ptr); \
+ \
+ /* Sanity check: must match the member's type */ \
+ typecheck(typeof(_hdl.req->dest_member), *_src_ptr); \
+ \
+ _hdl.req->dest_member = \
+ virtio_cpu_to_le(*_src_ptr, _hdl.req->dest_member); \
+ } while (0)
+
+/**
+ * VIORTC_MSG_READ() - read from a response message field
+ * @hdl: message handle
+ * @src_member: response message field name
+ * @dest_ptr: pointer to data of compatible type
+ *
+ * Converts from little-endian format and writes to dest_ptr.
+ */
+#define VIORTC_MSG_READ(hdl, src_member, dest_ptr) \
+ do { \
+ typeof(dest_ptr) _dest_ptr = (dest_ptr); \
+ \
+ /* Sanity check: must match the member's type */ \
+ typecheck(typeof((hdl).resp->src_member), *_dest_ptr); \
+ \
+ *_dest_ptr = virtio_le_to_cpu((hdl).resp->src_member); \
+ } while (0)
+
+/*
+ * read requests
+ */
+
+/** timeout for clock readings, where timeouts are considered non-fatal */
+#define VIORTC_MSG_READ_TIMEOUT (msecs_to_jiffies(60 * 1000))
+
+/**
+ * viortc_read() - VIRTIO_RTC_REQ_READ wrapper
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @reading: clock reading [ns]
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+int viortc_read(struct viortc_dev *viortc, u16 vio_clk_id, u64 *reading)
+{
+ int ret;
+ VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, read, READ);
+
+ ret = VIORTC_MSG_INIT(hdl, viortc);
+ if (ret)
+ return ret;
+
+ VIORTC_MSG_WRITE(hdl, clock_id, &vio_clk_id);
+
+ ret = viortc_msg_xfer(&viortc->vqs[VIORTC_REQUESTQ], VIORTC_MSG(hdl),
+ VIORTC_MSG_READ_TIMEOUT);
+ if (ret) {
+ dev_dbg(&viortc->vdev->dev, "%s: xfer returned %d\n", __func__,
+ ret);
+ goto out_release;
+ }
+
+ VIORTC_MSG_READ(hdl, clock_reading, reading);
+
+out_release:
+ viortc_msg_release(VIORTC_MSG(hdl));
+
+ return ret;
+}
+
+/**
+ * viortc_read_cross() - VIRTIO_RTC_REQ_READ_CROSS wrapper
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @hw_counter: virtio_rtc HW counter type
+ * @reading: clock reading [ns]
+ * @cycles: HW counter cycles during clock reading
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+int viortc_read_cross(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
+ u64 *reading, u64 *cycles)
+{
+ int ret;
+ VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, read_cross, READ_CROSS);
+
+ ret = VIORTC_MSG_INIT(hdl, viortc);
+ if (ret)
+ return ret;
+
+ VIORTC_MSG_WRITE(hdl, clock_id, &vio_clk_id);
+ VIORTC_MSG_WRITE(hdl, hw_counter, &hw_counter);
+
+ ret = viortc_msg_xfer(&viortc->vqs[VIORTC_REQUESTQ], VIORTC_MSG(hdl),
+ VIORTC_MSG_READ_TIMEOUT);
+ if (ret) {
+ dev_dbg(&viortc->vdev->dev, "%s: xfer returned %d\n", __func__,
+ ret);
+ goto out_release;
+ }
+
+ VIORTC_MSG_READ(hdl, clock_reading, reading);
+ VIORTC_MSG_READ(hdl, counter_cycles, cycles);
+
+out_release:
+ viortc_msg_release(VIORTC_MSG(hdl));
+
+ return ret;
+}
+
+/*
+ * control requests
+ */
+
+/**
+ * viortc_cfg() - VIRTIO_RTC_REQ_CFG wrapper
+ * @viortc: device data
+ * @num_clocks: # of virtio_rtc clocks
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_cfg(struct viortc_dev *viortc, u16 *num_clocks)
+{
+ int ret;
+ VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, cfg, CFG);
+
+ ret = VIORTC_MSG_INIT(hdl, viortc);
+ if (ret)
+ return ret;
+
+ ret = viortc_msg_xfer(&viortc->vqs[VIORTC_REQUESTQ], VIORTC_MSG(hdl),
+ 0);
+ if (ret) {
+ dev_dbg(&viortc->vdev->dev, "%s: xfer returned %d\n", __func__,
+ ret);
+ goto out_release;
+ }
+
+ VIORTC_MSG_READ(hdl, num_clocks, num_clocks);
+
+out_release:
+ viortc_msg_release(VIORTC_MSG(hdl));
+
+ return ret;
+}
+
+/**
+ * viortc_clock_cap() - VIRTIO_RTC_REQ_CLOCK_CAP wrapper
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @type: virtio_rtc clock type
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_clock_cap(struct viortc_dev *viortc, u16 vio_clk_id,
+ u16 *type)
+{
+ int ret;
+ VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, clock_cap, CLOCK_CAP);
+
+ ret = VIORTC_MSG_INIT(hdl, viortc);
+ if (ret)
+ return ret;
+
+ VIORTC_MSG_WRITE(hdl, clock_id, &vio_clk_id);
+
+ ret = viortc_msg_xfer(&viortc->vqs[VIORTC_REQUESTQ], VIORTC_MSG(hdl),
+ 0);
+ if (ret) {
+ dev_dbg(&viortc->vdev->dev, "%s: xfer returned %d\n", __func__,
+ ret);
+ goto out_release;
+ }
+
+ VIORTC_MSG_READ(hdl, type, type);
+
+out_release:
+ viortc_msg_release(VIORTC_MSG(hdl));
+
+ return ret;
+}
+
+/**
+ * viortc_cross_cap() - VIRTIO_RTC_REQ_CROSS_CAP wrapper
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @hw_counter: virtio_rtc HW counter type
+ * @supported: xtstamping is supported for the vio_clk_id/hw_counter pair
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+int viortc_cross_cap(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
+ bool *supported)
+{
+ int ret;
+ VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, cross_cap, CROSS_CAP);
+ u8 flags;
+
+ ret = VIORTC_MSG_INIT(hdl, viortc);
+ if (ret)
+ return ret;
+
+ VIORTC_MSG_WRITE(hdl, clock_id, &vio_clk_id);
+ VIORTC_MSG_WRITE(hdl, hw_counter, &hw_counter);
+
+ ret = viortc_msg_xfer(&viortc->vqs[VIORTC_REQUESTQ], VIORTC_MSG(hdl),
+ 0);
+ if (ret) {
+ dev_dbg(&viortc->vdev->dev, "%s: xfer returned %d\n", __func__,
+ ret);
+ goto out_release;
+ }
+
+ VIORTC_MSG_READ(hdl, flags, &flags);
+ *supported = !!(flags & VIRTIO_RTC_FLAG_CROSS_CAP);
+
+out_release:
+ viortc_msg_release(VIORTC_MSG(hdl));
+
+ return ret;
+}
+
+/*
+ * init, deinit
+ */
+
+/**
+ * viortc_clocks_init() - init local representations of virtio_rtc clocks
+ * @viortc: device data
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_clocks_init(struct viortc_dev *viortc)
+{
+ int ret;
+ u16 num_clocks;
+
+ ret = viortc_cfg(viortc, &num_clocks);
+ if (ret)
+ return ret;
+
+ if (num_clocks < 1) {
+ dev_err(&viortc->vdev->dev, "device reported 0 clocks\n");
+ return -ENODEV;
+ }
+
+ viortc->num_clocks = num_clocks;
+
+ /* In the future, PTP clocks will be initialized here. */
+ (void)viortc_clock_cap;
+
+ return ret;
+}
+
+/**
+ * viortc_init_vqs() - init virtqueues
+ * @viortc: device data
+ *
+ * Inits virtqueues and associated data.
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_init_vqs(struct viortc_dev *viortc)
+{
+ int ret;
+ struct virtio_device *vdev = viortc->vdev;
+ const char *names[VIORTC_MAX_NR_QUEUES];
+ vq_callback_t *callbacks[VIORTC_MAX_NR_QUEUES];
+ struct virtqueue *vqs[VIORTC_MAX_NR_QUEUES];
+ int nr_queues;
+
+ nr_queues = VIORTC_REQUESTQ + 1;
+ names[VIORTC_REQUESTQ] = "requestq";
+ callbacks[VIORTC_REQUESTQ] = viortc_cb_requestq;
+
+ ret = virtio_find_vqs(vdev, nr_queues, vqs, callbacks, names, NULL);
+ if (ret)
+ return ret;
+
+ viortc->vqs[VIORTC_REQUESTQ].vq = vqs[VIORTC_REQUESTQ];
+ spin_lock_init(&viortc->vqs[VIORTC_REQUESTQ].lock);
+
+ return 0;
+}
+
+/**
+ * viortc_probe() - probe a virtio_rtc virtio device
+ * @vdev: virtio device
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_probe(struct virtio_device *vdev)
+{
+ struct viortc_dev *viortc;
+ int ret;
+
+ viortc = devm_kzalloc(&vdev->dev, sizeof(*viortc), GFP_KERNEL);
+ if (!viortc)
+ return -ENOMEM;
+
+ vdev->priv = viortc;
+ viortc->vdev = vdev;
+
+ ret = viortc_init_vqs(viortc);
+ if (ret)
+ return ret;
+
+ virtio_device_ready(vdev);
+
+ /* Ready vdev for use by frontend devices initialized next. */
+ smp_wmb();
+
+ ret = viortc_clocks_init(viortc);
+ if (ret)
+ goto err_reset_vdev;
+
+ return 0;
+
+err_reset_vdev:
+ virtio_reset_device(vdev);
+ vdev->config->del_vqs(vdev);
+
+ return ret;
+}
+
+/**
+ * viortc_remove() - remove a virtio_rtc virtio device
+ * @vdev: virtio device
+ */
+static void viortc_remove(struct virtio_device *vdev)
+{
+ /* In the future, PTP clocks will be deinitialized here. */
+
+ virtio_reset_device(vdev);
+ vdev->config->del_vqs(vdev);
+}
+
+static struct virtio_device_id id_table[] = {
+ { VIRTIO_ID_CLOCK, VIRTIO_DEV_ANY_ID },
+ { 0 },
+};
+MODULE_DEVICE_TABLE(virtio, id_table);
+
+static struct virtio_driver virtio_rtc_drv = {
+ .driver.name = KBUILD_MODNAME,
+ .driver.owner = THIS_MODULE,
+ .id_table = id_table,
+ .probe = viortc_probe,
+ .remove = viortc_remove,
+};
+
+module_virtio_driver(virtio_rtc_drv);
+
+MODULE_DESCRIPTION("Virtio RTC driver");
+MODULE_AUTHOR("OpenSynergy GmbH");
+MODULE_LICENSE("GPL");
diff --git a/drivers/virtio/virtio_rtc_internal.h b/drivers/virtio/virtio_rtc_internal.h
new file mode 100644
index 000000000000..9267661b8030
--- /dev/null
+++ b/drivers/virtio/virtio_rtc_internal.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * virtio_rtc internal interfaces
+ *
+ * Copyright (C) 2022-2023 OpenSynergy GmbH
+ */
+
+#ifndef _VIRTIO_RTC_INTERNAL_H_
+#define _VIRTIO_RTC_INTERNAL_H_
+
+#include <linux/types.h>
+
+/* driver core IFs */
+
+struct viortc_dev;
+
+int viortc_read(struct viortc_dev *viortc, u16 vio_clk_id, u64 *reading);
+int viortc_read_cross(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
+ u64 *reading, u64 *cycles);
+int viortc_cross_cap(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
+ bool *supported);
+
+#endif /* _VIRTIO_RTC_INTERNAL_H_ */
diff --git a/include/uapi/linux/virtio_rtc.h b/include/uapi/linux/virtio_rtc.h
new file mode 100644
index 000000000000..f469276562d7
--- /dev/null
+++ b/include/uapi/linux/virtio_rtc.h
@@ -0,0 +1,144 @@
+/* SPDX-License-Identifier: ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) */
+/*
+ * Copyright (C) 2022-2023 OpenSynergy GmbH
+ */
+
+#ifndef _LINUX_VIRTIO_RTC_H
+#define _LINUX_VIRTIO_RTC_H
+
+#include <linux/types.h>
+
+/* read request message types */
+
+#define VIRTIO_RTC_REQ_READ 0x0001
+#define VIRTIO_RTC_REQ_READ_CROSS 0x0002
+
+/* control request message types */
+
+#define VIRTIO_RTC_REQ_CFG 0x1000
+#define VIRTIO_RTC_REQ_CLOCK_CAP 0x1001
+#define VIRTIO_RTC_REQ_CROSS_CAP 0x1002
+
+/* Message headers */
+
+/** common request header */
+struct virtio_rtc_req_head {
+ __le16 msg_type;
+ __u8 reserved[6];
+};
+
+/** common response header */
+struct virtio_rtc_resp_head {
+#define VIRTIO_RTC_S_OK 0
+#define VIRTIO_RTC_S_EOPNOTSUPP 2
+#define VIRTIO_RTC_S_ENODEV 3
+#define VIRTIO_RTC_S_EINVAL 4
+#define VIRTIO_RTC_S_EIO 5
+ __u8 status;
+ __u8 reserved[7];
+};
+
+/* read requests */
+
+/* VIRTIO_RTC_REQ_READ message */
+
+struct virtio_rtc_req_read {
+ struct virtio_rtc_req_head head;
+ __le16 clock_id;
+ __u8 reserved[6];
+};
+
+struct virtio_rtc_resp_read {
+ struct virtio_rtc_resp_head head;
+ __le64 clock_reading;
+};
+
+/* VIRTIO_RTC_REQ_READ_CROSS message */
+
+struct virtio_rtc_req_read_cross {
+ struct virtio_rtc_req_head head;
+ __le16 clock_id;
+/** Arm Generic Timer Virtual Count */
+#define VIRTIO_RTC_COUNTER_ARM_VIRT 0
+/** Arm Generic Timer Physical Count */
+#define VIRTIO_RTC_COUNTER_ARM_PHYS 1
+/** x86 Time Stamp Counter */
+#define VIRTIO_RTC_COUNTER_X86_TSC 2
+ __le16 hw_counter;
+ __u8 reserved[4];
+};
+
+struct virtio_rtc_resp_read_cross {
+ struct virtio_rtc_resp_head head;
+ __le64 clock_reading;
+ __le64 counter_cycles;
+};
+
+/* control requests */
+
+/* VIRTIO_RTC_REQ_CFG message */
+
+struct virtio_rtc_req_cfg {
+ struct virtio_rtc_req_head head;
+ /* no request params */
+};
+
+struct virtio_rtc_resp_cfg {
+ struct virtio_rtc_resp_head head;
+ /** # of clocks -> clock ids < num_clocks are valid */
+ __le16 num_clocks;
+ __u8 reserved[6];
+};
+
+/* VIRTIO_RTC_REQ_CLOCK_CAP message */
+
+struct virtio_rtc_req_clock_cap {
+ struct virtio_rtc_req_head head;
+ __le16 clock_id;
+ __u8 reserved[6];
+};
+
+struct virtio_rtc_resp_clock_cap {
+ struct virtio_rtc_resp_head head;
+#define VIRTIO_RTC_CLOCK_UTC 0
+#define VIRTIO_RTC_CLOCK_TAI 1
+#define VIRTIO_RTC_CLOCK_MONO 2
+ __le16 type;
+ __u8 reserved[6];
+};
+
+/* VIRTIO_RTC_REQ_CROSS_CAP message */
+
+struct virtio_rtc_req_cross_cap {
+ struct virtio_rtc_req_head head;
+ __le16 clock_id;
+ __le16 hw_counter;
+ __u8 reserved[4];
+};
+
+struct virtio_rtc_resp_cross_cap {
+ struct virtio_rtc_resp_head head;
+#define VIRTIO_RTC_FLAG_CROSS_CAP (1 << 0)
+ __u8 flags;
+ __u8 reserved[7];
+};
+
+/** Union of request types for requestq */
+union virtio_rtc_req_requestq {
+ struct virtio_rtc_req_read read;
+ struct virtio_rtc_req_read_cross read_cross;
+ struct virtio_rtc_req_cfg cfg;
+ struct virtio_rtc_req_clock_cap clock_cap;
+ struct virtio_rtc_req_cross_cap cross_cap;
+};
+
+/** Union of response types for requestq */
+union virtio_rtc_resp_requestq {
+ struct virtio_rtc_resp_read read;
+ struct virtio_rtc_resp_read_cross read_cross;
+ struct virtio_rtc_resp_cfg cfg;
+ struct virtio_rtc_resp_clock_cap clock_cap;
+ struct virtio_rtc_resp_cross_cap cross_cap;
+};
+
+#endif /* _LINUX_VIRTIO_RTC_H */
--
2.40.1


Subject: [tip: timers/core] timekeeping: Fix cross-timestamp interpolation corner case decision

The following commit has been merged into the timers/core branch of tip:

Commit-ID: 87a41130881995f82f7adbafbfeddaebfb35f0ef
Gitweb: https://git.kernel.org/tip/87a41130881995f82f7adbafbfeddaebfb35f0ef
Author: Peter Hilber <[email protected]>
AuthorDate: Mon, 18 Dec 2023 08:38:40 +01:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Mon, 19 Feb 2024 12:18:51 +01:00

timekeeping: Fix cross-timestamp interpolation corner case decision

The cycle_between() helper checks if parameter test is in the open interval
(before, after). Colloquially speaking, this also applies to the counter
wrap-around special case before > after. get_device_system_crosststamp()
currently uses cycle_between() at the first call site to decide whether to
interpolate for older counter readings.

get_device_system_crosststamp() has the following problem with
cycle_between() testing against an open interval: Assume that, by chance,
cycles == tk->tkr_mono.cycle_last (in the following, "cycle_last" for
brevity). Then, cycle_between() at the first call site, with effective
argument values cycle_between(cycle_last, cycles, now), returns false,
enabling interpolation. During interpolation,
get_device_system_crosststamp() will then call cycle_between() at the
second call site (if a history_begin was supplied). The effective argument
values are cycle_between(history_begin->cycles, cycles, cycles), since
system_counterval.cycles == interval_start == cycles, per the assumption.
Due to the test against the open interval, cycle_between() returns false
again. This causes get_device_system_crosststamp() to return -EINVAL.

This failure should be avoided, since get_device_system_crosststamp() works
both when cycles follows cycle_last (no interpolation), and when cycles
precedes cycle_last (interpolation). For the case cycles == cycle_last,
interpolation is actually unneeded.

Fix this by changing cycle_between() into timestamp_in_interval(), which
now checks against the closed interval, rather than the open interval.

This changes the get_device_system_crosststamp() behavior for three corner
cases:

1. Bypass interpolation in the case cycles == tk->tkr_mono.cycle_last,
fixing the problem described above.

2. At the first timestamp_in_interval() call site, cycles == now no longer
causes failure.

3. At the second timestamp_in_interval() call site, history_begin->cycles
== system_counterval.cycles no longer causes failure.
adjust_historical_crosststamp() also works for this corner case,
where partial_history_cycles == total_history_cycles.

These behavioral changes should not cause any problems.

Fixes: 2c756feb18d9 ("time: Add history to cross timestamp interface supporting slower devices")
Signed-off-by: Peter Hilber <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
kernel/time/timekeeping.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 8f35455..4e9f2f8 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1180,13 +1180,15 @@ static int adjust_historical_crosststamp(struct system_time_snapshot *history,
}

/*
- * cycle_between - true if test occurs chronologically between before and after
+ * timestamp_in_interval - true if ts is chronologically in [start, end]
+ *
+ * True if ts occurs chronologically at or after start, and before or at end.
*/
-static bool cycle_between(u64 before, u64 test, u64 after)
+static bool timestamp_in_interval(u64 start, u64 end, u64 ts)
{
- if (test > before && test < after)
+ if (ts >= start && ts <= end)
return true;
- if (before > after && (test > before || test < after))
+ if (start > end && (ts >= start || ts <= end))
return true;
return false;
}
@@ -1246,7 +1248,7 @@ int get_device_system_crosststamp(int (*get_time_fn)
*/
now = tk_clock_read(&tk->tkr_mono);
interval_start = tk->tkr_mono.cycle_last;
- if (!cycle_between(interval_start, cycles, now)) {
+ if (!timestamp_in_interval(interval_start, now, cycles)) {
clock_was_set_seq = tk->clock_was_set_seq;
cs_was_changed_seq = tk->cs_was_changed_seq;
cycles = interval_start;
@@ -1277,13 +1279,13 @@ int get_device_system_crosststamp(int (*get_time_fn)
bool discontinuity;

/*
- * Check that the counter value occurs after the provided
+ * Check that the counter value is not before the provided
* history reference and that the history doesn't cross a
* clocksource change
*/
if (!history_begin ||
- !cycle_between(history_begin->cycles,
- system_counterval.cycles, cycles) ||
+ !timestamp_in_interval(history_begin->cycles,
+ cycles, system_counterval.cycles) ||
history_begin->cs_was_changed_seq != cs_was_changed_seq)
return -EINVAL;
partial_history_cycles = cycles - system_counterval.cycles;

2024-03-07 17:08:03

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> RFC v3 updates
> --------------
>
> This series implements a driver for a virtio-rtc device conforming to spec
> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
> the PTP clock driver already present before.
>
> This patch series depends on the patch series "treewide: Use clocksource id
> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
> series on top of mainline.
>
> Overview
> --------
>
> This patch series adds the virtio_rtc module, and related bugfixes. The
> virtio_rtc module implements a driver compatible with the proposed Virtio
> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
> provides information about current time. The device can provide different
> clocks, e.g. for the UTC or TAI time standards, or for physical time
> elapsed since some past epoch. 

Hm, should we allow UTC? If you tell me the time in UTC, then
(sometimes) I still don't actually know what the time is, because some
UTC seconds occur twice. UTC only makes sense if you provide the TAI
offset, surely? Should the virtio_rtc specification make it mandatory
to provide such?

Otherwise you're just designing it to allow crappy hypervisors to
expose incomplete information.

> PTP clock interface
> -------------------
>
> virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
> If both the Virtio RTC device and this driver have special support for the
> current clocksource, time synchronization programs can use
> cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
> PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
> with single-digit ns precision is possible with a quiescent reference clock
> (from the Virtio RTC device). This works even when the Virtio device
> response is slow compared to ptp_kvm hypercalls.

Is PTP the right mechanism for this? As I understand it, PTP is a way
to precisely synchronize one clock with another. But in the case of
virt guests synchronizing against the host, it isn't really *another*
clock. It really is the *same* underlying clock. As the host clock
varies with temperature, for example, so does the guest clock. The only
difference is an offset and (on x86 perhaps) a mathematical scaling of
the frequency.

I was looking at this another way, when I came across this virtio-rtc
work.

My idea was just for the hypervisor to expose its own timekeeping
information — the counter/TSC value and TAI time at a given moment,
frequency of the counter, and the precision of both that frequency
(±PPM) and the TAI timestamp (±µs).

By putting that in a host/guest shared data structure with a seqcount
for lockless updates, we can update it as time synchronization on the
host is refined, and we can even cleanly handle live migration where
the guest ends up on a completely different host. It allows for use
cases which *really* care (e.g. timestamping financial transactions) to
ensure that there is never even a moment of getting *wrong* timestamps
if they haven't yet resynced after a migration.

Now I'm trying to work out if I should attempt to reconcile with your
existing virtio-rtc work, or just decide that virtio-rtc isn't trying
to solve the actual problem that we have, and go ahead with something
different... ?


Attachments:
smime.p7s (5.83 kB)

2024-03-08 10:34:54

by Peter Hilber

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On 07.03.24 15:02, David Woodhouse wrote:
> On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
>> RFC v3 updates
>> --------------
>>
>> This series implements a driver for a virtio-rtc device conforming to spec
>> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
>> the PTP clock driver already present before.
>>
>> This patch series depends on the patch series "treewide: Use clocksource id
>> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
>> series on top of mainline.
>>
>> Overview
>> --------
>>
>> This patch series adds the virtio_rtc module, and related bugfixes. The
>> virtio_rtc module implements a driver compatible with the proposed Virtio
>> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
>> provides information about current time. The device can provide different
>> clocks, e.g. for the UTC or TAI time standards, or for physical time
>> elapsed since some past epoch.
>
> Hm, should we allow UTC? If you tell me the time in UTC, then
> (sometimes) I still don't actually know what the time is, because some
> UTC seconds occur twice. UTC only makes sense if you provide the TAI
> offset, surely? Should the virtio_rtc specification make it mandatory
> to provide such?
>
> Otherwise you're just designing it to allow crappy hypervisors to
> expose incomplete information.
>

Hi David,

(adding [email protected] for spec discussion),

thank you for your insightful comments. I think I take a broadly similar
view. The reason why the current spec and driver is like this is that I
took a pragmatic approach at first and only included features which work
out-of-the-box for the current Linux ecosystem.

The current virtio_rtc features work similar to ptp_kvm, and therefore can
work out-of-the-box with time sync daemons such as chrony.

As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock as
well, I am afraid that

- in some (embedded) scenarios, the TAI clock may not be available

- crappy hypervisors will pass off the UTC clock as the TAI clock.

For the same reasons, I am also not sure about adding a *mandatory* TAI
offset to each readout. I don't know user-space software which would
leverage this already (at least not through the PTP clock interface). And
why would such software not go straight for the TAI clock instead?

How about adding a requirement to the spec that the virtio-rtc device
SHOULD expose the TAI clock whenever it is available - would this address
your concerns?

>> PTP clock interface
>> -------------------
>>
>> virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
>> If both the Virtio RTC device and this driver have special support for the
>> current clocksource, time synchronization programs can use
>> cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
>> PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
>> with single-digit ns precision is possible with a quiescent reference clock
>> (from the Virtio RTC device). This works even when the Virtio device
>> response is slow compared to ptp_kvm hypercalls.
>
> Is PTP the right mechanism for this? As I understand it, PTP is a way
> to precisely synchronize one clock with another. But in the case of
> virt guests synchronizing against the host, it isn't really *another*
> clock. It really is the *same* underlying clock. As the host clock
> varies with temperature, for example, so does the guest clock. The only
> difference is an offset and (on x86 perhaps) a mathematical scaling of
> the frequency.
>
> I was looking at this another way, when I came across this virtio-rtc
> work.
>
> My idea was just for the hypervisor to expose its own timekeeping
> information — the counter/TSC value and TAI time at a given moment,
> frequency of the counter, and the precision of both that frequency
> (±PPM) and the TAI timestamp (±µs).
>
> By putting that in a host/guest shared data structure with a seqcount
> for lockless updates, we can update it as time synchronization on the
> host is refined, and we can even cleanly handle live migration where
> the guest ends up on a completely different host. It allows for use
> cases which *really* care (e.g. timestamping financial transactions) to
> ensure that there is never even a moment of getting *wrong* timestamps
> if they haven't yet resynced after a migration.

I considered a similar approach as well, but integrating that with the
kernel timekeeping seemed too much effort for the first step. However,
reading the clock from user space would be much simpler.

>
> Now I'm trying to work out if I should attempt to reconcile with your
> existing virtio-rtc work, or just decide that virtio-rtc isn't trying
> to solve the actual problem that we have, and go ahead with something
> different... ?
>

We are certainly interested into the discussed, say, "virtual timekeeper"
mechanism, which would also solve a lot of problems for us (especially if
it would be integrated with kernel timekeeping). Even without Linux kernel
timekeeping, the virtual timekeeper would be useful to us for guests with
simpler timekeeping, and potentially for user space applications.

Our current intent is to at first try to upstream the current (RFC spec v3)
feature set. I think the virtual timekeeper would be suitable as an
optional feature of virtio_rtc (with Virtio, this could easily be added
after initial upstreaming). It is also possible to have a virtio-rtc device
only implement the virtual timekeeper, but not the other clock reading
methods, if these are of no interest.

Best regards,

Peter

2024-03-08 18:37:31

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> On 07.03.24 15:02, David Woodhouse wrote:
> > On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> > > RFC v3 updates
> > > --------------
> > >
> > > This series implements a driver for a virtio-rtc device conforming to spec
> > > RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
> > > the PTP clock driver already present before.
> > >
> > > This patch series depends on the patch series "treewide: Use clocksource id
> > > for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
> > > series on top of mainline.
> > >
> > > Overview
> > > --------
> > >
> > > This patch series adds the virtio_rtc module, and related bugfixes. The
> > > virtio_rtc module implements a driver compatible with the proposed Virtio
> > > RTC device specification [1]. The Virtio RTC (Real Time Clock) device
> > > provides information about current time. The device can provide different
> > > clocks, e.g. for the UTC or TAI time standards, or for physical time
> > > elapsed since some past epoch.
> >
> > Hm, should we allow UTC? If you tell me the time in UTC, then
> > (sometimes) I still don't actually know what the time is, because some
> > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > offset, surely? Should the virtio_rtc specification make it mandatory
> > to provide such?
> >
> > Otherwise you're just designing it to allow crappy hypervisors to
> > expose incomplete information.
> >
>
> Hi David,
>
> (adding [email protected] for spec discussion),
>
> thank you for your insightful comments. I think I take a broadly similar
> view. The reason why the current spec and driver is like this is that I
> took a pragmatic approach at first and only included features which work
> out-of-the-box for the current Linux ecosystem.
>
> The current virtio_rtc features work similar to ptp_kvm, and therefore can
> work out-of-the-box with time sync daemons such as chrony.
>
> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock as
> well, I am afraid that
>
> - in some (embedded) scenarios, the TAI clock may not be available
>
> - crappy hypervisors will pass off the UTC clock as the TAI clock.
>
> For the same reasons, I am also not sure about adding a *mandatory* TAI
> offset to each readout. I don't know user-space software which would
> leverage this already (at least not through the PTP clock interface). And
> why would such software not go straight for the TAI clock instead?
>
> How about adding a requirement to the spec that the virtio-rtc device
> SHOULD expose the TAI clock whenever it is available - would this address
> your concerns?

I think that would be too easy for implementors to miss, or decide not
to obey. Or to get *wrong*, by exposing a TAI clock but actually
putting UTC in it.

I think I prefer to mandate the tai_offset field with the UTC clock.
Crappy implementations will just set it to zero, but at least that
gives a clear signal to the guests that it's *their* problem to
resolve.




> > > PTP clock interface
> > > -------------------
> > >
> > > virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
> > > If both the Virtio RTC device and this driver have special support for the
> > > current clocksource, time synchronization programs can use
> > > cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
> > > PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
> > > with single-digit ns precision is possible with a quiescent reference clock
> > > (from the Virtio RTC device). This works even when the Virtio device
> > > response is slow compared to ptp_kvm hypercalls.
> >
> > Is PTP the right mechanism for this? As I understand it, PTP is a way
> > to precisely synchronize one clock with another. But in the case of
> > virt guests synchronizing against the host, it isn't really *another*
> > clock. It really is the *same* underlying clock. As the host clock
> > varies with temperature, for example, so does the guest clock. The only
> > difference is an offset and (on x86 perhaps) a mathematical scaling of
> > the frequency.
> >
> > I was looking at this another way, when I came across this virtio-rtc
> > work.
> >
> > My idea was just for the hypervisor to expose its own timekeeping
> > information — the counter/TSC value and TAI time at a given moment,
> > frequency of the counter, and the precision of both that frequency
> > (±PPM) and the TAI timestamp (±µs).
> >
> > By putting that in a host/guest shared data structure with a seqcount
> > for lockless updates, we can update it as time synchronization on the
> > host is refined, and we can even cleanly handle live migration where
> > the guest ends up on a completely different host. It allows for use
> > cases which *really* care (e.g. timestamping financial transactions) to
> > ensure that there is never even a moment of getting *wrong* timestamps
> > if they haven't yet resynced after a migration.
>
> I considered a similar approach as well, but integrating that with the
> kernel timekeeping seemed too much effort for the first step. However,
> reading the clock from user space would be much simpler.

Right. In fact my *first* use case was userspace, specifically in the
context of https://github.com/aws/clock-bound — but anything we design
for this absolutely has to be usable for kernel timekeeping too.

It's also critical to solve the Live Migration problem.

But is it so hard to integrate into the kernel timekeeping? My plan
would have given us effectively an infinite number of cross-reads of
the realtime clock vs. TSC. You don't have to actually read from a
virtio device; you just read the TSC and do the maths, using the values
in the shared memory region. Couldn't that be used to present a PTP
device to the guest kernel just the same as you do at the moment?

You could probably even simulate PPS with it. Typically with PPS we
have to catch the hardware interrupt and then read the TSC as soon as
possible thereafter. With this, you'd be able to *calculate* the TSC
value at the start of the next second, and wouldn't have to suffer the
real hardware latency :)

> >
> > Now I'm trying to work out if I should attempt to reconcile with your
> > existing virtio-rtc work, or just decide that virtio-rtc isn't trying
> > to solve the actual problem that we have, and go ahead with something
> > different... ?
> >
>
> We are certainly interested into the discussed, say, "virtual timekeeper"
> mechanism, which would also solve a lot of problems for us (especially if
> it would be integrated with kernel timekeeping). Even without Linux kernel
> timekeeping, the virtual timekeeper would be useful to us for guests with
> simpler timekeeping, and potentially for user space applications.
>
> Our current intent is to at first try to upstream the current (RFC spec v3)
> feature set. I think the virtual timekeeper would be suitable as an
> optional feature of virtio_rtc (with Virtio, this could easily be added
> after initial upstreaming). It is also possible to have a virtio-rtc device
> only implement the virtual timekeeper, but not the other clock reading
> methods, if these are of no interest.

Yeah, that might make sense. I was thinking of a simple ACPI/DT device
exposing a page of memory and *maybe* an interrupt for when an update
happens. (With the caveat that the interrupt would always occur too
late by definition, so it's no substitute for using the seqlock
correctly in applications that *really* care and are going to get fined
millions of dollars for mis-timestamping their transactions.)

But using the virtio-rtc device as the vehicle for that shared memory
page is reasonable too. It's not even mutually exclusive; we could
expose the *same* data structure in memory via whatever mechanisms we
wanted.

One other thing to note is I think we're being very naïve about the TSC
on x86 hosts. Theoretically, the TSC for every vCPU might run at a
different frequency, and even if they run at the same frequency they
might be offset from each other. I'm happy to be naïve but I think we
should be *explicitly* so, and just say for example that it's defined
against vCPU0 so if other vCPUs are different then all bets are off.

We *can* cope with TSC frequencies changing. Fundamentally, that's the
whole *point*; NTP calibrates itself as the underlying frequency does
change due to temperature changes, etc. — so a deliberate frequency
scaling, or even a live migration, are just a slightly special case of
the same thing.

One thing I have added to the memory region is a migration counter. In
the ideal case, guests will be happy just to use the hypervisor's
synchronization. But in some cases the guests may want to do NTP (or
PPS, PTP or something else) for themselves, to have more precise
timekeeping than the host. Even if the host is advertising itself as
stratum 16, the guest still needs to know of *migration*, because it
has to consider itself unsynchronized when that happens.


Attachments:
smime.p7s (5.83 kB)

2024-03-11 18:48:55

by Peter Hilber

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On 08.03.24 13:33, David Woodhouse wrote:
> On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
>> On 07.03.24 15:02, David Woodhouse wrote:
>>> On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
>>>> RFC v3 updates
>>>> --------------
>>>>
>>>> This series implements a driver for a virtio-rtc device conforming to
>>>> spec
>>>> RFC v3 [1]. It now includes an RTC class driver with alarm, in
>>>> addition to
>>>> the PTP clock driver already present before.
>>>>
>>>> This patch series depends on the patch series "treewide: Use
>>>> clocksource id
>>>> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
>>>> series on top of mainline.
>>>>
>>>> Overview
>>>> --------
>>>>
>>>> This patch series adds the virtio_rtc module, and related bugfixes.
>>>> The
>>>> virtio_rtc module implements a driver compatible with the proposed
>>>> Virtio
>>>> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
>>>> provides information about current time. The device can provide
>>>> different
>>>> clocks, e.g. for the UTC or TAI time standards, or for physical time
>>>> elapsed since some past epoch.
>>>
>>> Hm, should we allow UTC? If you tell me the time in UTC, then
>>> (sometimes) I still don't actually know what the time is, because some
>>> UTC seconds occur twice. UTC only makes sense if you provide the TAI
>>> offset, surely? Should the virtio_rtc specification make it mandatory
>>> to provide such?
>>>
>>> Otherwise you're just designing it to allow crappy hypervisors to
>>> expose incomplete information.
>>>
>>
>> Hi David,
>>
>> (adding [email protected] for spec discussion),
>>
>> thank you for your insightful comments. I think I take a broadly similar
>> view. The reason why the current spec and driver is like this is that I
>> took a pragmatic approach at first and only included features which work
>> out-of-the-box for the current Linux ecosystem.
>>
>> The current virtio_rtc features work similar to ptp_kvm, and therefore
>> can
>> work out-of-the-box with time sync daemons such as chrony.
>>
>> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
>> as
>> well, I am afraid that
>>
>> - in some (embedded) scenarios, the TAI clock may not be available
>>
>> - crappy hypervisors will pass off the UTC clock as the TAI clock.
>>
>> For the same reasons, I am also not sure about adding a *mandatory* TAI
>> offset to each readout. I don't know user-space software which would
>> leverage this already (at least not through the PTP clock interface).
>> And
>> why would such software not go straight for the TAI clock instead?
>>
>> How about adding a requirement to the spec that the virtio-rtc device
>> SHOULD expose the TAI clock whenever it is available - would this
>> address
>> your concerns?
>
> I think that would be too easy for implementors to miss, or decide not
> to obey. Or to get *wrong*, by exposing a TAI clock but actually
> putting UTC in it.
>
> I think I prefer to mandate the tai_offset field with the UTC clock.
> Crappy implementations will just set it to zero, but at least that
> gives a clear signal to the guests that it's *their* problem to
> resolve.

To me there are some open questions regarding how this would work. Is there
a use case for this with the v3 clock reading methods, or would it be
enough to address this with the Virtio timekeeper?

Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
best alongside some additional information about leap seconds. I am not
aware about any user-space user. In addition, leap second smearing should
also be addressed.

>
>
>
>
>>>> PTP clock interface
>>>> -------------------
>>>>
>>>> virtio_rtc exposes clocks as PTP clocks to userspace, similar to
>>>> ptp_kvm.
>>>> If both the Virtio RTC device and this driver have special support for
>>>> the
>>>> current clocksource, time synchronization programs can use
>>>> cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
>>>> PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time
>>>> synchronization
>>>> with single-digit ns precision is possible with a quiescent reference
>>>> clock
>>>> (from the Virtio RTC device). This works even when the Virtio device
>>>> response is slow compared to ptp_kvm hypercalls.
>>>
>>> Is PTP the right mechanism for this? As I understand it, PTP is a way
>>> to precisely synchronize one clock with another. But in the case of
>>> virt guests synchronizing against the host, it isn't really *another*
>>> clock. It really is the *same* underlying clock. As the host clock
>>> varies with temperature, for example, so does the guest clock. The only
>>> difference is an offset and (on x86 perhaps) a mathematical scaling of
>>> the frequency.
>>>
>>> I was looking at this another way, when I came across this virtio-rtc
>>> work.
>>>
>>> My idea was just for the hypervisor to expose its own timekeeping
>>> information — the counter/TSC value and TAI time at a given moment,
>>> frequency of the counter, and the precision of both that frequency
>>> (±PPM) and the TAI timestamp (±µs).
>>>
>>> By putting that in a host/guest shared data structure with a seqcount
>>> for lockless updates, we can update it as time synchronization on the
>>> host is refined, and we can even cleanly handle live migration where
>>> the guest ends up on a completely different host. It allows for use
>>> cases which *really* care (e.g. timestamping financial transactions) to
>>> ensure that there is never even a moment of getting *wrong* timestamps
>>> if they haven't yet resynced after a migration.
>>
>> I considered a similar approach as well, but integrating that with the
>> kernel timekeeping seemed too much effort for the first step. However,
>> reading the clock from user space would be much simpler.
>
> Right. In fact my *first* use case was userspace, specifically in the
> context of https://github.com/aws/clock-bound — but anything we design
> for this absolutely has to be usable for kernel timekeeping too.
>
> It's also critical to solve the Live Migration problem.
>
> But is it so hard to integrate into the kernel timekeeping? My plan
> would have given us effectively an infinite number of cross-reads of
> the realtime clock vs. TSC. You don't have to actually read from a
> virtio device; you just read the TSC and do the maths, using the values
> in the shared memory region. Couldn't that be used to present a PTP
> device to the guest kernel just the same as you do at the moment?

Yes, and it would also decrease the clock reading overhead (saving at least
the Virtio response interrupt and associated scheduling). Would make sense
to me.

To be clear, with "kernel timekeeping integration" I meant to have
timekeeping.c derive the time directly from the Virtio timekeeper.

>
> You could probably even simulate PPS with it. Typically with PPS we
> have to catch the hardware interrupt and then read the TSC as soon as
> possible thereafter. With this, you'd be able to *calculate* the TSC
> value at the start of the next second, and wouldn't have to suffer the
> real hardware latency :)
>
>>>
>>> Now I'm trying to work out if I should attempt to reconcile with your
>>> existing virtio-rtc work, or just decide that virtio-rtc isn't trying
>>> to solve the actual problem that we have, and go ahead with something
>>> different... ?
>>>
>>
>> We are certainly interested into the discussed, say, "virtual
>> timekeeper"
>> mechanism, which would also solve a lot of problems for us (especially
>> if
>> it would be integrated with kernel timekeeping). Even without Linux
>> kernel
>> timekeeping, the virtual timekeeper would be useful to us for guests
>> with
>> simpler timekeeping, and potentially for user space applications.
>>
>> Our current intent is to at first try to upstream the current (RFC spec
>> v3)
>> feature set. I think the virtual timekeeper would be suitable as an
>> optional feature of virtio_rtc (with Virtio, this could easily be added
>> after initial upstreaming). It is also possible to have a virtio-rtc
>> device
>> only implement the virtual timekeeper, but not the other clock reading
>> methods, if these are of no interest.
>
> Yeah, that might make sense. I was thinking of a simple ACPI/DT device
> exposing a page of memory and *maybe* an interrupt for when an update
> happens. (With the caveat that the interrupt would always occur too
> late by definition, so it's no substitute for using the seqlock
> correctly in applications that *really* care and are going to get fined
> millions of dollars for mis-timestamping their transactions.)
>
> But using the virtio-rtc device as the vehicle for that shared memory
> page is reasonable too. It's not even mutually exclusive; we could
> expose the *same* data structure in memory via whatever mechanisms we
> wanted.
>
> One other thing to note is I think we're being very naïve about the TSC
> on x86 hosts. Theoretically, the TSC for every vCPU might run at a
> different frequency, and even if they run at the same frequency they
> might be offset from each other. I'm happy to be naïve but I think we
> should be *explicitly* so, and just say for example that it's defined
> against vCPU0 so if other vCPUs are different then all bets are off.

ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you
have an opinion on how to represent this in a platform-independent way.

Thank you for the comments,

Peter

>
> We *can* cope with TSC frequencies changing. Fundamentally, that's the
> whole *point*; NTP calibrates itself as the underlying frequency does
> change due to temperature changes, etc. — so a deliberate frequency
> scaling, or even a live migration, are just a slightly special case of
> the same thing.
>
> One thing I have added to the memory region is a migration counter. In
> the ideal case, guests will be happy just to use the hypervisor's
> synchronization. But in some cases the guests may want to do NTP (or
> PPS, PTP or something else) for themselves, to have more precise
> timekeeping than the host. Even if the host is advertising itself as
> stratum 16, the guest still needs to know of *migration*, because it
> has to consider itself unsynchronized when that happens

2024-03-12 17:16:08

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
> On 08.03.24 13:33, David Woodhouse wrote:
> > On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> > > On 07.03.24 15:02, David Woodhouse wrote:
> > > > Hm, should we allow UTC? If you tell me the time in UTC, then
> > > > (sometimes) I still don't actually know what the time is, because some
> > > > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > > > offset, surely? Should the virtio_rtc specification make it mandatory
> > > > to provide such?
> > > >
> > > > Otherwise you're just designing it to allow crappy hypervisors to
> > > > expose incomplete information.
> > > >
> > >
> > > Hi David,
> > >
> > > (adding [email protected] for spec discussion),
> > >
> > > thank you for your insightful comments. I think I take a broadly similar
> > > view. The reason why the current spec and driver is like this is that I
> > > took a pragmatic approach at first and only included features which work
> > > out-of-the-box for the current Linux ecosystem.
> > >
> > > The current virtio_rtc features work similar to ptp_kvm, and therefore
> > > can work out-of-the-box with time sync daemons such as chrony.
> > >
> > > As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
> > > as well, I am afraid that
> > >
> > > - in some (embedded) scenarios, the TAI clock may not be available
> > >
> > > - crappy hypervisors will pass off the UTC clock as the TAI clock.
> > >
> > > For the same reasons, I am also not sure about adding a *mandatory* TAI
> > > offset to each readout. I don't know user-space software which would
> > > leverage this already (at least not through the PTP clock interface).
> > > And why would such software not go straight for the TAI clock instead?
> > >
> > > How about adding a requirement to the spec that the virtio-rtc device
> > > SHOULD expose the TAI clock whenever it is available - would this
> > > address your concerns?
> >
> > I think that would be too easy for implementors to miss, or decide not
> > to obey. Or to get *wrong*, by exposing a TAI clock but actually
> > putting UTC in it.
> >
> > I think I prefer to mandate the tai_offset field with the UTC clock.
> > Crappy implementations will just set it to zero, but at least that
> > gives a clear signal to the guests that it's *their* problem to
> > resolve.
>
> To me there are some open questions regarding how this would work. Is there
> a use case for this with the v3 clock reading methods, or would it be
> enough to address this with the Virtio timekeeper?
>
> Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
> best alongside some additional information about leap seconds. I am not
> aware about any user-space user. In addition, leap second smearing should
> also be addressed.
>

Is there even a standard yet for leap-smearing? Will it be linear over
1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
think is what Google does? Meta does something different again, don't
they?

Exposing UTC as the only clock reference is bad enough; when leap
seconds happen there's a whole second during which you don't *know*
which second it is. It seems odd to me, for a precision clock to be
deliberately ambiguous about what the time is!

But if the virtio-rtc clock is defined as UTC and then expose something
*different* in it, that's even worse. You potentially end up providing
inaccurate time for a whole *day* leading up to the leap second.

I think you're right that leap second smearing should be addressed. At
the very least, by making it clear that the virtio-rtc clock which
advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
yet-to-be-defined variant.

Please make it explicit that any hypervisor which wants to advertise a
smeared clock shall define a new type which specifies the precise
smearing algorithm and cannot be conflated with the one you're defining
here.

> > One other thing to note is I think we're being very naïve about the TSC
> > on x86 hosts. Theoretically, the TSC for every vCPU might run at a
> > different frequency, and even if they run at the same frequency they
> > might be offset from each other. I'm happy to be naïve but I think we
> > should be *explicitly* so, and just say for example that it's defined
> > against vCPU0 so if other vCPUs are different then all bets are off.
>
> ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you
> have an opinion on how to represent this in a platform-independent way.

Well, it doesn't have a notion of TSCs either; you include that by
implicit reference don't you?


Attachments:
smime.p7s (5.83 kB)

2024-03-13 09:46:22

by Peter Hilber

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On 12.03.24 18:15, David Woodhouse wrote:
> On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
>> On 08.03.24 13:33, David Woodhouse wrote:
>>> On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
>>>> On 07.03.24 15:02, David Woodhouse wrote:
>>>>> Hm, should we allow UTC? If you tell me the time in UTC, then
>>>>> (sometimes) I still don't actually know what the time is, because some
>>>>> UTC seconds occur twice. UTC only makes sense if you provide the TAI
>>>>> offset, surely? Should the virtio_rtc specification make it mandatory
>>>>> to provide such?
>>>>>
>>>>> Otherwise you're just designing it to allow crappy hypervisors to
>>>>> expose incomplete information.
>>>>>
>>>>
>>>> Hi David,
>>>>
>>>> (adding [email protected] for spec discussion),
>>>>
>>>> thank you for your insightful comments. I think I take a broadly similar
>>>> view. The reason why the current spec and driver is like this is that I
>>>> took a pragmatic approach at first and only included features which work
>>>> out-of-the-box for the current Linux ecosystem.
>>>>
>>>> The current virtio_rtc features work similar to ptp_kvm, and therefore
>>>> can work out-of-the-box with time sync daemons such as chrony.
>>>>
>>>> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
>>>> as well, I am afraid that
>>>>
>>>> - in some (embedded) scenarios, the TAI clock may not be available
>>>>
>>>> - crappy hypervisors will pass off the UTC clock as the TAI clock.
>>>>
>>>> For the same reasons, I am also not sure about adding a *mandatory* TAI
>>>> offset to each readout. I don't know user-space software which would
>>>> leverage this already (at least not through the PTP clock interface).
>>>> And why would such software not go straight for the TAI clock instead?
>>>>
>>>> How about adding a requirement to the spec that the virtio-rtc device
>>>> SHOULD expose the TAI clock whenever it is available - would this
>>>> address your concerns?
>>>
>>> I think that would be too easy for implementors to miss, or decide not
>>> to obey. Or to get *wrong*, by exposing a TAI clock but actually
>>> putting UTC in it.
>>>
>>> I think I prefer to mandate the tai_offset field with the UTC clock.
>>> Crappy implementations will just set it to zero, but at least that
>>> gives a clear signal to the guests that it's *their* problem to
>>> resolve.
>>
>> To me there are some open questions regarding how this would work. Is there
>> a use case for this with the v3 clock reading methods, or would it be
>> enough to address this with the Virtio timekeeper?
>>
>> Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
>> best alongside some additional information about leap seconds. I am not
>> aware about any user-space user. In addition, leap second smearing should
>> also be addressed.
>>
>
> Is there even a standard yet for leap-smearing? Will it be linear over
> 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
> think is what Google does? Meta does something different again, don't
> they?
>
> Exposing UTC as the only clock reference is bad enough; when leap
> seconds happen there's a whole second during which you don't *know*
> which second it is. It seems odd to me, for a precision clock to be
> deliberately ambiguous about what the time is!

Just to be clear, the device can perfectly expose only a TAI reference
clock (or both UTC and TAI), the spec is just completely open about this,
as it tries to work for diverse use cases.

>
> But if the virtio-rtc clock is defined as UTC and then expose something
> *different* in it, that's even worse. You potentially end up providing
> inaccurate time for a whole *day* leading up to the leap second.
>
> I think you're right that leap second smearing should be addressed. At
> the very least, by making it clear that the virtio-rtc clock which
> advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
> yet-to-be-defined variant.
>

Agreed.

> Please make it explicit that any hypervisor which wants to advertise a
> smeared clock shall define a new type which specifies the precise
> smearing algorithm and cannot be conflated with the one you're defining
> here.
>

I will add a requirement that the UTC clock can never have smeared/smoothed
leap seconds.

I think that not every vendor would bother to first add a definition of a
smearing algorithm. Also, I think in some cases knowing the precise
smearing algorithm might not be important (when having the same time as the
hypervisor is enough and accuracy w.r.t. actual time is less important).

So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for
now could catch every UTC-like clock which smears/smoothes leap seconds,
where the vendor cannot be bothered to add the smearing algorithm to spec
and implementations.

As for UTC-SLS, this *could* also be added, although [1] says

It is inappropriate to use Internet-Drafts as reference material or
to cite them other than as "work in progress."

[1] https://datatracker.ietf.org/doc/html/draft-kuhn-leapsecond-00

>>> One other thing to note is I think we're being very naïve about the TSC
>>> on x86 hosts. Theoretically, the TSC for every vCPU might run at a
>>> different frequency, and even if they run at the same frequency they
>>> might be offset from each other. I'm happy to be naïve but I think we
>>> should be *explicitly* so, and just say for example that it's defined
>>> against vCPU0 so if other vCPUs are different then all bets are off.
>>
>> ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you
>> have an opinion on how to represent this in a platform-independent way.
>
> Well, it doesn't have a notion of TSCs either; you include that by
> implicit reference don't you?

I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or
boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor,
so the device might not even know which vCPUs there are. E.g. there is even
interest to make virtio-rtc work as part of the virtio-net device (which
might be implemented in hardware).

Thanks for the comments,

Peter

2024-03-13 11:18:20

by Alexandre Belloni

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On 13/03/2024 10:45:54+0100, Peter Hilber wrote:
> > Exposing UTC as the only clock reference is bad enough; when leap
> > seconds happen there's a whole second during which you don't *know*
> > which second it is. It seems odd to me, for a precision clock to be
> > deliberately ambiguous about what the time is!
>
> Just to be clear, the device can perfectly expose only a TAI reference
> clock (or both UTC and TAI), the spec is just completely open about this,
> as it tries to work for diverse use cases.
>
> >
> > But if the virtio-rtc clock is defined as UTC and then expose something
> > *different* in it, that's even worse. You potentially end up providing
> > inaccurate time for a whole *day* leading up to the leap second.
> >
> > I think you're right that leap second smearing should be addressed. At
> > the very least, by making it clear that the virtio-rtc clock which
> > advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
> > yet-to-be-defined variant.
> >
>
> Agreed.
>
> > Please make it explicit that any hypervisor which wants to advertise a
> > smeared clock shall define a new type which specifies the precise
> > smearing algorithm and cannot be conflated with the one you're defining
> > here.
> >
>
> I will add a requirement that the UTC clock can never have smeared/smoothed
> leap seconds.
>
> I think that not every vendor would bother to first add a definition of a
> smearing algorithm. Also, I think in some cases knowing the precise
> smearing algorithm might not be important (when having the same time as the
> hypervisor is enough and accuracy w.r.t. actual time is less important).
>
> So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for
> now could catch every UTC-like clock which smears/smoothes leap seconds,
> where the vendor cannot be bothered to add the smearing algorithm to spec
> and implementations.
>

I still don't know anything about virtio but under Linux, an RTC is
always UTC (or localtime when dual booting but let's not care) and never
accounts for leap seconds. Having an RTC and RTC driver behaving
differently would be super inconvenient. Why don't you leave this to
userspace?

I guess I'm still questioning whether this is the correct interface to
expose the host system time instead of an actual RTC.

--
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

2024-03-13 12:45:47

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On Wed, 2024-03-13 at 10:45 +0100, Peter Hilber wrote:
> On 12.03.24 18:15, David Woodhouse wrote:
> > On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
> > > On 08.03.24 13:33, David Woodhouse wrote:
> > > > On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> > > > > On 07.03.24 15:02, David Woodhouse wrote:
> > > > > > Hm, should we allow UTC? If you tell me the time in UTC, then
> > > > > > (sometimes) I still don't actually know what the time is, because some
> > > > > > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > > > > > offset, surely? Should the virtio_rtc specification make it mandatory
> > > > > > to provide such?
> > > > > >
> > > > > > Otherwise you're just designing it to allow crappy hypervisors to
> > > > > > expose incomplete information.
> > > > > >
> > > > >
> > > > > Hi David,
> > > > >
> > > > > (adding [email protected] for spec discussion),
> > > > >
> > > > > thank you for your insightful comments. I think I take a broadly similar
> > > > > view. The reason why the current spec and driver is like this is that I
> > > > > took a pragmatic approach at first and only included features which work
> > > > > out-of-the-box for the current Linux ecosystem.
> > > > >
> > > > > The current virtio_rtc features work similar to ptp_kvm, and therefore
> > > > > can work out-of-the-box with time sync daemons such as chrony.
> > > > >
> > > > > As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
> > > > > as well, I am afraid that
> > > > >
> > > > > - in some (embedded) scenarios, the TAI clock may not be available
> > > > >
> > > > > - crappy hypervisors will pass off the UTC clock as the TAI clock.
> > > > >
> > > > > For the same reasons, I am also not sure about adding a *mandatory* TAI
> > > > > offset to each readout. I don't know user-space software which would
> > > > > leverage this already (at least not through the PTP clock interface).
> > > > > And why would such software not go straight for the TAI clock instead?
> > > > >
> > > > > How about adding a requirement to the spec that the virtio-rtc device
> > > > > SHOULD expose the TAI clock whenever it is available - would this
> > > > > address your concerns?
> > > >
> > > > I think that would be too easy for implementors to miss, or decide not
> > > > to obey. Or to get *wrong*, by exposing a TAI clock but actually
> > > > putting UTC in it.
> > > >
> > > > I think I prefer to mandate the tai_offset field with the UTC clock.
> > > > Crappy implementations will just set it to zero, but at least that
> > > > gives a clear signal to the guests that it's *their* problem to
> > > > resolve.
> > >
> > > To me there are some open questions regarding how this would work. Is there
> > > a use case for this with the v3 clock reading methods, or would it be
> > > enough to address this with the Virtio timekeeper?
> > >
> > > Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
> > > best alongside some additional information about leap seconds. I am not
> > > aware about any user-space user. In addition, leap second smearing should
> > > also be addressed.
> > >
> >
> > Is there even a standard yet for leap-smearing? Will it be linear over
> > 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
> > think is what Google does? Meta does something different again, don't
> > they?
> >
> > Exposing UTC as the only clock reference is bad enough; when leap
> > seconds happen there's a whole second during which you don't *know*
> > which second it is. It seems odd to me, for a precision clock to be
> > deliberately ambiguous about what the time is!
>
> Just to be clear, the device can perfectly expose only a TAI reference
> clock (or both UTC and TAI), the spec is just completely open about this,
> as it tries to work for diverse use cases.

As long as the guest *knows* what it's getting, sure.

> >
> > But if the virtio-rtc clock is defined as UTC and then expose something
> > *different* in it, that's even worse. You potentially end up providing
> > inaccurate time for a whole *day* leading up to the leap second.
> >
> > I think you're right that leap second smearing should be addressed. At
> > the very least, by making it clear that the virtio-rtc clock which
> > advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
> > yet-to-be-defined variant.
> >
>
> Agreed.
>
> > Please make it explicit that any hypervisor which wants to advertise a
> > smeared clock shall define a new type which specifies the precise
> > smearing algorithm and cannot be conflated with the one you're defining
> > here.
> >
>
> I will add a requirement that the UTC clock can never have smeared/smoothed
> leap seconds.

Thanks.

> I think that not every vendor would bother to first add a definition of a
> smearing algorithm. Also, I think in some cases knowing the precise
> smearing algorithm might not be important (when having the same time as the
> hypervisor is enough and accuracy w.r.t. actual time is less important).
>
> So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for
> now could catch every UTC-like clock which smears/smoothes leap seconds,
> where the vendor cannot be bothered to add the smearing algorithm to spec
> and implementations.

Please $DEITY no.

Surely the whole point of this effort is to provide guests with precise
and *unambiguous* knowledge of what the time is?

Using UTC is bad enough, because for a UTC timestamp in the middle of a
leap second the guest can't know know *which* occurrence of that leap
second it is, so it might be wrong by a second. To resolve that
ambiguity needs a leap indicator and/or tai_offset field.

But if you allow and encourage the use of smeared time without even a
specification of *how* it's smeared... that's even worse. You have an
unknown inaccuracy of up to a second for whole periods of time around a
leap second. That's surely the *antithesis* of what we're trying to do
here? Without an actual definition of the smearing, how is a guest
actually supposed to know what time it is?

(I suppose you could add a tai_offset_nanoseconds field? I don't know
that I want to *encourage* that thought process...)

> As for UTC-SLS, this *could* also be added, although [1] says
>
>         It is inappropriate to use Internet-Drafts as reference material or
>         to cite them other than as "work in progress."
>
> [1] https://datatracker.ietf.org/doc/html/draft-kuhn-leapsecond-00
>
> > > > One other thing to note is I think we're being very naïve about the TSC
> > > > on x86 hosts. Theoretically, the TSC for every vCPU might run at a
> > > > different frequency, and even if they run at the same frequency they
> > > > might be offset from each other. I'm happy to be naïve but I think we
> > > > should be *explicitly* so, and just say for example that it's defined
> > > > against vCPU0 so if other vCPUs are different then all bets are off.
> > >
> > > ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you
> > > have an opinion on how to represent this in a platform-independent way.
> >
> > Well, it doesn't have a notion of TSCs either; you include that by
> > implicit reference don't you?
>
> I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or
> boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor,
> so the device might not even know which vCPUs there are. E.g. there is even
> interest to make virtio-rtc work as part of the virtio-net device (which
> might be implemented in hardware).

Sure, but those implementations aren't going to offer the TSC pairing
at all, are they?


Attachments:
smime.p7s (5.83 kB)

2024-03-13 12:58:40

by Alexandre Belloni

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On 13/03/2024 12:29:38+0000, David Woodhouse wrote:
> On Wed, 2024-03-13 at 12:18 +0100, Alexandre Belloni wrote:
> >
> > I still don't know anything about virtio but under Linux, an RTC is
> > always UTC (or localtime when dual booting but let's not care) and never
> > accounts for leap seconds. Having an RTC and RTC driver behaving
> > differently would be super inconvenient. Why don't you leave this to
> > userspace?
>
> Well yes, we don't need to expose *anything* from the hypervisor and we
> can leave it all to guest userspace. We can run NTP on every single one
> of *hundreds* of guests, leaving them all to duplicate the work of
> calibrating the *same* underlying oscillator.
>

Really, I see the point of sharing the time accurately between the host
and the guest and not duplicating the effort. This is not what I am
objecting to.

> I thought we were trying to avoid that, by having the hypervisor tell
> them what the time was. If we're going to do that, we need it to be
> sufficiently precise (and some clients want to *know* the precision),
> and above all we need it to be *unambiguous*.
>
> If the hypervisor says that the time is 3692217600.001, then the guest
> doesn't actually know *which* 3692217600.001 it is, and thus it still
> doesn't know the time to an accuracy better than 1 second.
>

The RTC subsystem has a 1 second resolution and this is not going to
change because there is no point doing so for the hardware, to get the
time precisely, UIE MUST be used there is no vendor that will just
support reading the time/date or timestamp as this is way too imprecise.

> And if we start allowing the hypervisor to smear clocks in some other
> underspecified ways, then we end up with errors of up to 1 second in
> the clock for long periods of time *around* the leap second.
>
> We need to avoid that ambiguity.

Exactly my point and I said, reading an RTC is always UTC and never
handles leap seconds so if userspace doesn't handle the leap second and
updates the RTC, too bad. There are obviously no RTC that will smear
clock unless instructed to, so the hypervisor must not smear the clock.

Note that this is not an issue for the subsystem because if you are not
capable to track an accurate clock, the RTC itself will have drifted by
much more than a second in between leap second inclusions.

>
> > I guess I'm still questioning whether this is the correct interface to
> > expose the host system time instead of an actual RTC.
>
> If an RTC device is able to report '23:59:60' as the time of day, I
> suppose that *could* resolve the ambiguity. But talking to a device is
> slow; we want guests to be able to know the time — accurately — with a
> simple counter/tsc read and some arithmetic. Which means *paired* reads
> of 'RTC' and the counter, and a precise indication of the counter
> frequency.

23:59:60 is not and will never be allowed in the RTC subsystem as this
is an invalid value for the hardware.

The TSC or whatever CPU counter/clock that is used to keep the system
time is not an RTC, I don't get why it has to be exposed as such to the
guests. PTP is fine and precise, RTC is not.

--
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

2024-03-13 14:16:43

by Peter Hilber

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On 13.03.24 12:18, Alexandre Belloni wrote:
> On 13/03/2024 10:45:54+0100, Peter Hilber wrote:
>>> Exposing UTC as the only clock reference is bad enough; when leap
>>> seconds happen there's a whole second during which you don't *know*
>>> which second it is. It seems odd to me, for a precision clock to be
>>> deliberately ambiguous about what the time is!
>>
>> Just to be clear, the device can perfectly expose only a TAI reference
>> clock (or both UTC and TAI), the spec is just completely open about this,
>> as it tries to work for diverse use cases.
>>
>>>
>>> But if the virtio-rtc clock is defined as UTC and then expose something
>>> *different* in it, that's even worse. You potentially end up providing
>>> inaccurate time for a whole *day* leading up to the leap second.
>>>
>>> I think you're right that leap second smearing should be addressed. At
>>> the very least, by making it clear that the virtio-rtc clock which
>>> advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
>>> yet-to-be-defined variant.
>>>
>>
>> Agreed.
>>
>>> Please make it explicit that any hypervisor which wants to advertise a
>>> smeared clock shall define a new type which specifies the precise
>>> smearing algorithm and cannot be conflated with the one you're defining
>>> here.
>>>
>>
>> I will add a requirement that the UTC clock can never have smeared/smoothed
>> leap seconds.
>>
>> I think that not every vendor would bother to first add a definition of a
>> smearing algorithm. Also, I think in some cases knowing the precise
>> smearing algorithm might not be important (when having the same time as the
>> hypervisor is enough and accuracy w.r.t. actual time is less important).
>>
>> So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for
>> now could catch every UTC-like clock which smears/smoothes leap seconds,
>> where the vendor cannot be bothered to add the smearing algorithm to spec
>> and implementations.
>>
>
> I still don't know anything about virtio but under Linux, an RTC is
> always UTC (or localtime when dual booting but let's not care) and never
> accounts for leap seconds. Having an RTC and RTC driver behaving
> differently would be super inconvenient. Why don't you leave this to
> userspace?
>
> I guess I'm still questioning whether this is the correct interface to
> expose the host system time instead of an actual RTC.
>

virtio_rtc only registers RTC class devices for virtio_rtc clock type UTC,
so adding an UTC_SMEARED clock type would avoid leap seconds for the RTC
class.

But I understand that there are more concerns and I will re-consider. From
my POV CLOCK_{REALTIME,MONOTONIC}_ALARM support is very important however.

So the only alternative to me seems to be adding an alternative alarmtimer
backend (and time synchronization through user space).

Thanks for the comment,

Peter

2024-03-13 14:51:18

by Alexandre Belloni

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On 13/03/2024 14:06:42+0000, David Woodhouse wrote:
> If you're asking why patch 7/7 in Peter's series exists to expose the
> virtio clock through RTC, and you're not particularly interested in the
> first six, I suppose that's a fair question. As is the question of "why
> is it called virtio_rtc not virtio_ptp?".
>

Exactly my question, thanks :)

> But let me turn it around: if the kernel has access to this virtio
> device and *not* any other RTC, why *wouldn't* the kernel use the time
> from it? The fact that it can optionally *also* provide paired readings
> with the CPU counter doesn't actually *hurt* for the RTC use case, does
> it?

As long as it doesn't behave differently from the other RTC, I'm fine
with this. This is important because I don't want to carry any special
infrastructure for this driver or to have to special case this driver
later on because it is incompatible with some evolution of the
subsystem.

--
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

2024-03-13 15:36:11

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On Wed, 2024-03-13 at 12:18 +0100, Alexandre Belloni wrote:
>
> I still don't know anything about virtio but under Linux, an RTC is
> always UTC (or localtime when dual booting but let's not care) and never
> accounts for leap seconds. Having an RTC and RTC driver behaving
> differently would be super inconvenient. Why don't you leave this to
> userspace?

Well yes, we don't need to expose *anything* from the hypervisor and we
can leave it all to guest userspace. We can run NTP on every single one
of *hundreds* of guests, leaving them all to duplicate the work of
calibrating the *same* underlying oscillator.

I thought we were trying to avoid that, by having the hypervisor tell
them what the time was. If we're going to do that, we need it to be
sufficiently precise (and some clients want to *know* the precision),
and above all we need it to be *unambiguous*.

If the hypervisor says that the time is 3692217600.001, then the guest
doesn't actually know *which* 3692217600.001 it is, and thus it still
doesn't know the time to an accuracy better than 1 second.

And if we start allowing the hypervisor to smear clocks in some other
underspecified ways, then we end up with errors of up to 1 second in
the clock for long periods of time *around* the leap second.

We need to avoid that ambiguity.

> I guess I'm still questioning whether this is the correct interface to
> expose the host system time instead of an actual RTC.

If an RTC device is able to report '23:59:60' as the time of day, I
suppose that *could* resolve the ambiguity. But talking to a device is
slow; we want guests to be able to know the time — accurately — with a
simple counter/tsc read and some arithmetic. Which means *paired* reads
of 'RTC' and the counter, and a precise indication of the counter
frequency.


Attachments:
smime.p7s (5.83 kB)

2024-03-13 18:12:40

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On Wed, 2024-03-13 at 13:58 +0100, Alexandre Belloni wrote:
> The TSC or whatever CPU counter/clock that is used to keep the system
> time is not an RTC, I don't get why it has to be exposed as such to the
> guests. PTP is fine and precise, RTC is not.

Ah, I see. But the point of the virtio_rtc is not really to expose that
CPU counter. The point is to report the wallclock time, just like an
actual RTC. The real difference is the *precision*.

The virtio_rtc device has a facility to *also* expose the counter,
because that's what we actually need to gain that precision...

Applications don't read the RTC every time they want to know what the
time is. These days, they don't even make a system call; it's done
entirely in userspace mode. The kernel exposes some shared memory,
essentially saying "the counter was X at time Y, and runs at Z Hz".
Then applications just read the CPU counter and do some arithmetic.

As we require more and more precision in the calibration, it becomes
important to get *paired* readings of the CPU counter and the wallclock
time at precisely the same moment. If the guest has to read one and
then the other, potentially taking interrupts, getting preempted and
suffering steal/SMI time in the middle, that introduces an error which
is increasingly significant as we increasingly care about precision.

Peter's proposal exposes the pairs of {X,Y} and leaves *all* the guest
kernels having to repeat readings over time and perform the calibration
as the underlying hardware oscillator frequency (Z) drifts with
temperature. I'm trying to get him to let the hypervisor expose the
calibrated frequency Z too. Along with *error* bounds for ±δX and ±δZ.
Which aside from reducing the duplication of effort, will *also* fix
the problem of live migration where *all* those things suffer a step
change and leave the guest with an inaccurate clock but not knowing it.

But that part isn't relevant to the RTC, as you say. RTC doesn't care
about that level of precision; it's just what the system uses to know
roughly what year it is, when it powers up. (And isn't even really
trusted for that, which is a large part of why I added the
X509_V_FLAG_NO_CHECK_TIME flag to OpenSSL, so Secure Boot doesn't break
when the RTC is catastrophically wrong :)

If you're asking why patch 7/7 in Peter's series exists to expose the
virtio clock through RTC, and you're not particularly interested in the
first six, I suppose that's a fair question. As is the question of "why
is it called virtio_rtc not virtio_ptp?".

But let me turn it around: if the kernel has access to this virtio
device and *not* any other RTC, why *wouldn't* the kernel use the time
from it? The fact that it can optionally *also* provide paired readings
with the CPU counter doesn't actually *hurt* for the RTC use case, does
it?




Attachments:
smime.p7s (5.83 kB)

2024-03-13 18:56:11

by Peter Hilber

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On 13.03.24 13:45, David Woodhouse wrote:
> On Wed, 2024-03-13 at 10:45 +0100, Peter Hilber wrote:
>> On 12.03.24 18:15, David Woodhouse wrote:
>>> On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
>>>> On 08.03.24 13:33, David Woodhouse wrote:
>>>>> On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
>>>>>> On 07.03.24 15:02, David Woodhouse wrote:
>>>>>>> Hm, should we allow UTC? If you tell me the time in UTC, then
>>>>>>> (sometimes) I still don't actually know what the time is, because some
>>>>>>> UTC seconds occur twice. UTC only makes sense if you provide the TAI
>>>>>>> offset, surely? Should the virtio_rtc specification make it mandatory
>>>>>>> to provide such?
>>>>>>>
>>>>>>> Otherwise you're just designing it to allow crappy hypervisors to
>>>>>>> expose incomplete information.
>>>>>>>
>>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> (adding [email protected] for spec discussion),
>>>>>>
>>>>>> thank you for your insightful comments. I think I take a broadly similar
>>>>>> view. The reason why the current spec and driver is like this is that I
>>>>>> took a pragmatic approach at first and only included features which work
>>>>>> out-of-the-box for the current Linux ecosystem.
>>>>>>
>>>>>> The current virtio_rtc features work similar to ptp_kvm, and therefore
>>>>>> can work out-of-the-box with time sync daemons such as chrony.
>>>>>>
>>>>>> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
>>>>>> as well, I am afraid that
>>>>>>
>>>>>> - in some (embedded) scenarios, the TAI clock may not be available
>>>>>>
>>>>>> - crappy hypervisors will pass off the UTC clock as the TAI clock.
>>>>>>
>>>>>> For the same reasons, I am also not sure about adding a *mandatory* TAI
>>>>>> offset to each readout. I don't know user-space software which would
>>>>>> leverage this already (at least not through the PTP clock interface).
>>>>>> And why would such software not go straight for the TAI clock instead?
>>>>>>
>>>>>> How about adding a requirement to the spec that the virtio-rtc device
>>>>>> SHOULD expose the TAI clock whenever it is available - would this
>>>>>> address your concerns?
>>>>>
>>>>> I think that would be too easy for implementors to miss, or decide not
>>>>> to obey. Or to get *wrong*, by exposing a TAI clock but actually
>>>>> putting UTC in it.
>>>>>
>>>>> I think I prefer to mandate the tai_offset field with the UTC clock.
>>>>> Crappy implementations will just set it to zero, but at least that
>>>>> gives a clear signal to the guests that it's *their* problem to
>>>>> resolve.
>>>>
>>>> To me there are some open questions regarding how this would work. Is there
>>>> a use case for this with the v3 clock reading methods, or would it be
>>>> enough to address this with the Virtio timekeeper?
>>>>
>>>> Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
>>>> best alongside some additional information about leap seconds. I am not
>>>> aware about any user-space user. In addition, leap second smearing should
>>>> also be addressed.
>>>>
>>>
>>> Is there even a standard yet for leap-smearing? Will it be linear over
>>> 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
>>> think is what Google does? Meta does something different again, don't
>>> they?
>>>
>>> Exposing UTC as the only clock reference is bad enough; when leap
>>> seconds happen there's a whole second during which you don't *know*
>>> which second it is. It seems odd to me, for a precision clock to be
>>> deliberately ambiguous about what the time is!
>>
>> Just to be clear, the device can perfectly expose only a TAI reference
>> clock (or both UTC and TAI), the spec is just completely open about this,
>> as it tries to work for diverse use cases.
>
> As long as the guest *knows* what it's getting, sure.
>
>>>
>>> But if the virtio-rtc clock is defined as UTC and then expose something
>>> *different* in it, that's even worse. You potentially end up providing
>>> inaccurate time for a whole *day* leading up to the leap second.
>>>
>>> I think you're right that leap second smearing should be addressed. At
>>> the very least, by making it clear that the virtio-rtc clock which
>>> advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
>>> yet-to-be-defined variant.
>>>
>>
>> Agreed.
>>
>>> Please make it explicit that any hypervisor which wants to advertise a
>>> smeared clock shall define a new type which specifies the precise
>>> smearing algorithm and cannot be conflated with the one you're defining
>>> here.
>>>
>>
>> I will add a requirement that the UTC clock can never have smeared/smoothed
>> leap seconds.
>
> Thanks.
>
>> I think that not every vendor would bother to first add a definition of a
>> smearing algorithm. Also, I think in some cases knowing the precise
>> smearing algorithm might not be important (when having the same time as the
>> hypervisor is enough and accuracy w.r.t. actual time is less important).
>>
>> So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for
>> now could catch every UTC-like clock which smears/smoothes leap seconds,
>> where the vendor cannot be bothered to add the smearing algorithm to spec
>> and implementations.
>
> Please $DEITY no.
>
> Surely the whole point of this effort is to provide guests with precise
> and *unambiguous* knowledge of what the time is?

I would say, a fundamental point of this effort is to enable such
implementations, and to detect if a device is promising to support this.

Where we might differ is as to whether the Virtio clock *for every
implementation* has to be *continuously* accurate w.r.t. a time standard,
or whether *for some implementations* it could be enough that all guests in
the local system have the same, precise local notion of time, which might
be off from the actual time standard.

Also, cf. ptp_kvm, which AFAIU doesn't address leap seconds at all...

With your described use case the UTC_SMEARED clock should of course not be
used. The UTC_SMEARED clock would get a distinct name through udev, like
/dev/ptp_virtio_utc_smeared, so the incompatibility could at least be
detected.

>
> Using UTC is bad enough, because for a UTC timestamp in the middle of a
> leap second the guest can't know know *which* occurrence of that leap
> second it is, so it might be wrong by a second. To resolve that
> ambiguity needs a leap indicator and/or tai_offset field.

I agree that virtio-rtc should communicate this. The question is, what
exactly, and for which clock read request?

As for PTP clocks:

- It doesn't fit into the ioctl PTP_SYS_OFFSET_PRECISE2.

- The clock_adjtime(2) tai_offset and return value could be set (if
upstream will accept this). Would this help? As discussed, user space
would need to interpret this (and currently no dynamic POSIX clock sets
this).

>
> But if you allow and encourage the use of smeared time without even a
> specification of *how* it's smeared... that's even worse. You have an
> unknown inaccuracy of up to a second for whole periods of time around a
> leap second. That's surely the *antithesis* of what we're trying to do
> here? Without an actual definition of the smearing, how is a guest
> actually supposed to know what time it is?

As discussed above, I think in some use cases it is enough for the guest to
have a precise notion of time shared with the other guests.

>
> (I suppose you could add a tai_offset_nanoseconds field? I don't know
> that I want to *encourage* that thought process...)
>
>> As for UTC-SLS, this *could* also be added, although [1] says
>>
>> It is inappropriate to use Internet-Drafts as reference material or
>> to cite them other than as "work in progress."
>>
>> [1] https://datatracker.ietf.org/doc/html/draft-kuhn-leapsecond-00
>>
>>>>> One other thing to note is I think we're being very naïve about the TSC
>>>>> on x86 hosts. Theoretically, the TSC for every vCPU might run at a
>>>>> different frequency, and even if they run at the same frequency they
>>>>> might be offset from each other. I'm happy to be naïve but I think we
>>>>> should be *explicitly* so, and just say for example that it's defined
>>>>> against vCPU0 so if other vCPUs are different then all bets are off.
>>>>
>>>> ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you
>>>> have an opinion on how to represent this in a platform-independent way.
>>>
>>> Well, it doesn't have a notion of TSCs either; you include that by
>>> implicit reference don't you?
>>
>> I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or
>> boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor,
>> so the device might not even know which vCPUs there are. E.g. there is even
>> interest to make virtio-rtc work as part of the virtio-net device (which
>> might be implemented in hardware).
>
> Sure, but those implementations aren't going to offer the TSC pairing
> at all, are they?
>

They could offer an Intel ART pairing (some physical PTP NICs are already
doing this, look for the convert_art_to_tsc() users).

Thanks for the comments,

Peter

2024-03-13 19:06:18

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On 13 March 2024 17:50:48 GMT, Peter Hilber <[email protected]> wrote:
>On 13.03.24 13:45, David Woodhouse wrote:
>> Surely the whole point of this effort is to provide guests with precise
>> and *unambiguous* knowledge of what the time is?
>
>I would say, a fundamental point of this effort is to enable such
>implementations, and to detect if a device is promising to support this.
>
>Where we might differ is as to whether the Virtio clock *for every
>implementation* has to be *continuously* accurate w.r.t. a time standard,
>or whether *for some implementations* it could be enough that all guests in
>the local system have the same, precise local notion of time, which might
>be off from the actual time standard.

That makes sense, but remember I don't just want {X, Y, Z} but *also* the error bounds of ±deltaY and ±deltaZ too.

So your example just boils down to "I'm calling it UTC, and it's really precise, but we make no promises about its *accuracy*". And that's fine.

>Also, cf. ptp_kvm, which AFAIU doesn't address leap seconds at all...

KVM is not an exemplar of good time practices.
Not in *any* respect :)

>With your described use case the UTC_SMEARED clock should of course not be
>used. The UTC_SMEARED clock would get a distinct name through udev, like
>/dev/ptp_virtio_utc_smeared, so the incompatibility could at least be
>detected.

As long as it's clear to all concerned that this is fundamentally not usable as an accurate time source, and is only for the local-sync case you described, sure.

>> Using UTC is bad enough, because for a UTC timestamp in the middle of a
>> leap second the guest can't know know *which* occurrence of that leap
>> second it is, so it might be wrong by a second. To resolve that
>> ambiguity needs a leap indicator and/or tai_offset field.
>
>I agree that virtio-rtc should communicate this. The question is, what
>exactly, and for which clock read request?

Are we now conflating software architecture (and Linux in particular) with "hardware" design?

To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though.

>As for PTP clocks:
>
>- It doesn't fit into the ioctl PTP_SYS_OFFSET_PRECISE2.
>
>- The clock_adjtime(2) tai_offset and return value could be set (if
> upstream will accept this). Would this help? As discussed, user space
> would need to interpret this (and currently no dynamic POSIX clock sets
> this).

Hm, maybe?


>>> I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or
>>> boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor,
>>> so the device might not even know which vCPUs there are. E.g. there is even
>>> interest to make virtio-rtc work as part of the virtio-net device (which
>>> might be implemented in hardware).
>>
>> Sure, but those implementations aren't going to offer the TSC pairing
>> at all, are they?
>>
>
>They could offer an Intel ART pairing (some physical PTP NICs are already
>doing this, look for the convert_art_to_tsc() users).

Right, but isn't that software's problem? The time pairing is defined against the ART in that case.

2024-03-13 19:23:15

by Peter Hilber

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On 13.03.24 15:06, David Woodhouse wrote:
> On Wed, 2024-03-13 at 13:58 +0100, Alexandre Belloni wrote:
>> The TSC or whatever CPU counter/clock that is used to keep the system
>> time is not an RTC, I don't get why it has to be exposed as such to the
>> guests. PTP is fine and precise, RTC is not.
>
> Ah, I see. But the point of the virtio_rtc is not really to expose that
> CPU counter. The point is to report the wallclock time, just like an
> actual RTC. The real difference is the *precision*.
>
> The virtio_rtc device has a facility to *also* expose the counter,
> because that's what we actually need to gain that precision...
>
> Applications don't read the RTC every time they want to know what the
> time is. These days, they don't even make a system call; it's done
> entirely in userspace mode. The kernel exposes some shared memory,
> essentially saying "the counter was X at time Y, and runs at Z Hz".
> Then applications just read the CPU counter and do some arithmetic.
>
> As we require more and more precision in the calibration, it becomes
> important to get *paired* readings of the CPU counter and the wallclock
> time at precisely the same moment. If the guest has to read one and
> then the other, potentially taking interrupts, getting preempted and
> suffering steal/SMI time in the middle, that introduces an error which
> is increasingly significant as we increasingly care about precision.
>
> Peter's proposal exposes the pairs of {X,Y} and leaves *all* the guest
> kernels having to repeat readings over time and perform the calibration
> as the underlying hardware oscillator frequency (Z) drifts with
> temperature. I'm trying to get him to let the hypervisor expose the
> calibrated frequency Z too. Along with *error* bounds for ±δX and ±δZ.
> Which aside from reducing the duplication of effort, will *also* fix
> the problem of live migration where *all* those things suffer a step
> change and leave the guest with an inaccurate clock but not knowing it.

I am already convinced that this would work significantly better than the
{X,Y} pair (but would be a bit more effort to implement):

1. when accessed by user space, obviously

2. when backing the PTP clock, it saves CPU time and makes non-paired
reads more precise.

I would just prefer to try upstreaming the {X,Y} pairing first. I think the
{X,Y,Z...} pairing could be discussed and developed in parallel.

Thanks for the comments,

Peter

2024-03-13 20:12:21

by Andrew Lunn

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

> As long as it doesn't behave differently from the other RTC, I'm fine
> with this. This is important because I don't want to carry any special
> infrastructure for this driver or to have to special case this driver
> later on because it is incompatible with some evolution of the
> subsystem.

Maybe deliberately throw away all the sub-second accuracy when
accessing the time via the RTC API? That helps to make it look like an
RTC. And when doing the rounding, add a constant offset of 10ms to
emulate the virtual i2c bus it is hanging off, like most RTCs?

Andrew

2024-03-14 09:13:56

by Peter Hilber

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On 13.03.24 21:12, Andrew Lunn wrote:
>> As long as it doesn't behave differently from the other RTC, I'm fine
>> with this. This is important because I don't want to carry any special
>> infrastructure for this driver or to have to special case this driver
>> later on because it is incompatible with some evolution of the
>> subsystem.
>
> Maybe deliberately throw away all the sub-second accuracy when
> accessing the time via the RTC API? That helps to make it look like an
> RTC. And when doing the rounding, add a constant offset of 10ms to
> emulate the virtual i2c bus it is hanging off, like most RTCs?
>
> Andrew

The truncating to whole seconds is already done. As to the offset, I do not
understand how this would help. I can read out my CMOS RTC in ~50 us.

Thanks for the comment,

Peter

2024-03-14 10:14:00

by Peter Hilber

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

Now CC'ing the previous commenters to the virtio-rtc spec draft, since
this discussion is mostly about the spec, and the Virtio mailing lists
still seem to be in a migration hiatus...

On 13.03.24 19:18, David Woodhouse wrote:
> On 13 March 2024 17:50:48 GMT, Peter Hilber <[email protected]> wrote:
>> On 13.03.24 13:45, David Woodhouse wrote:
>>> Surely the whole point of this effort is to provide guests with precise
>>> and *unambiguous* knowledge of what the time is?
>>
>> I would say, a fundamental point of this effort is to enable such
>> implementations, and to detect if a device is promising to support this.
>>
>> Where we might differ is as to whether the Virtio clock *for every
>> implementation* has to be *continuously* accurate w.r.t. a time standard,
>> or whether *for some implementations* it could be enough that all guests in
>> the local system have the same, precise local notion of time, which might
>> be off from the actual time standard.
>
> That makes sense, but remember I don't just want {X, Y, Z} but *also* the error bounds of ±deltaY and ±deltaZ too.
>
> So your example just boils down to "I'm calling it UTC, and it's really precise, but we make no promises about its *accuracy*". And that's fine.
>
>> Also, cf. ptp_kvm, which AFAIU doesn't address leap seconds at all...
>
> KVM is not an exemplar of good time practices.
> Not in *any* respect :)
>
>> With your described use case the UTC_SMEARED clock should of course not be
>> used. The UTC_SMEARED clock would get a distinct name through udev, like
>> /dev/ptp_virtio_utc_smeared, so the incompatibility could at least be
>> detected.
>
> As long as it's clear to all concerned that this is fundamentally not usable as an accurate time source, and is only for the local-sync case you described, sure.
>
>>> Using UTC is bad enough, because for a UTC timestamp in the middle of a
>>> leap second the guest can't know know *which* occurrence of that leap
>>> second it is, so it might be wrong by a second. To resolve that
>>> ambiguity needs a leap indicator and/or tai_offset field.
>>
>> I agree that virtio-rtc should communicate this. The question is, what
>> exactly, and for which clock read request?
>
> Are we now conflating software architecture (and Linux in particular) with "hardware" design?
>
> To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though.
>

As Virtio is extensible (unlike hardware), my approach is to mostly specify
only what also has a PoC user and a use case.

>> As for PTP clocks:
>>
>> - It doesn't fit into the ioctl PTP_SYS_OFFSET_PRECISE2.
>>
>> - The clock_adjtime(2) tai_offset and return value could be set (if
>> upstream will accept this). Would this help? As discussed, user space
>> would need to interpret this (and currently no dynamic POSIX clock sets
>> this).
>
> Hm, maybe?
>
>
>>>> I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or
>>>> boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor,
>>>> so the device might not even know which vCPUs there are. E.g. there is even
>>>> interest to make virtio-rtc work as part of the virtio-net device (which
>>>> might be implemented in hardware).
>>>
>>> Sure, but those implementations aren't going to offer the TSC pairing
>>> at all, are they?
>>>
>>
>> They could offer an Intel ART pairing (some physical PTP NICs are already
>> doing this, look for the convert_art_to_tsc() users).
>
> Right, but isn't that software's problem? The time pairing is defined against the ART in that case.

My point was that such a device would then not necessarily have an idea
what vCPU 0 is. But let's just say that this will be phrased as a SHOULD
best-effort requirement anyway.

Thanks for the comments,

Peter

2024-03-14 14:20:22

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On 14 March 2024 11:13:37 CET, Peter Hilber <[email protected]> wrote:
>> To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though.
>>
>
>As Virtio is extensible (unlike hardware), my approach is to mostly specify
>only what also has a PoC user and a use case.

If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case on day one. Otherwise, as I said in my first response, I can go do that as a separate device and decide that virtio_rtc doesn't meet our needs (especially for maintaining accuracy over LM).

My main concern for virto_rtc is that we avoid *ambiguity*. Yes, I get that it's extensible but we don't want a v1.0 of the spec, implemented by various hypervisors, which still leaves guests not knowing what the actual time is. That would not be good. And even UTC without a leap second indicator has that problem.

2024-03-19 14:01:31

by Peter Hilber

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

While the virtio-comment list is not available, now also CC'ing Parav,
which may be interested in this virtio-rtc spec related discussion thread.

On 14.03.24 15:19, David Woodhouse wrote:
> On 14 March 2024 11:13:37 CET, Peter Hilber <[email protected]> wrote:
>>> To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though.
>>>
>>
>> As Virtio is extensible (unlike hardware), my approach is to mostly specify
>> only what also has a PoC user and a use case.
>
> If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case on day one. Otherwise, as I said in my first response, I can go do that as a separate device and decide that virtio_rtc doesn't meet our needs (especially for maintaining accuracy over LM).

We plan to add

- leap second indication,

- UTC-to-TAI offset,

- clock smearing indication (including the noon-to-noon linear smearing
variant which seems to be somewhat popular), and

- clock accuracy indication

to the initial spec and to the PoC implementation.

However, due to resource restrictions, we cannot ourselves add the
memory-mapped clock to the initial spec.

Everyone is very welcome to contribute the memory-mapped clock to the spec,
and I think it might then still make it to the initial version.

>
> My main concern for virto_rtc is that we avoid *ambiguity*. Yes, I get that it's extensible but we don't want a v1.0 of the spec, implemented by various hypervisors, which still leaves guests not knowing what the actual time is. That would not be good. And even UTC without a leap second indicator has that problem.

Agreed. That should be addressed by the above changes.

Best regards,

Peter

2024-03-20 17:23:13

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On Tue, 2024-03-19 at 14:47 +0100, Peter Hilber wrote:
> While the virtio-comment list is not available, now also CC'ing Parav,
> which may be interested in this virtio-rtc spec related discussion thread.
>
> On 14.03.24 15:19, David Woodhouse wrote:
> > On 14 March 2024 11:13:37 CET, Peter Hilber <[email protected]> wrote:
> > > > To a certain extent, as long as the virtio-rtc device is designed to expose time precisely and unambiguously, it's less important if the Linux kernel *today* can use that. Although of course we should strive for that. Let's be...well, *unambiguous*, I suppose... that we've changed topics to discuss that though.
> > > >
> > >
> > > As Virtio is extensible (unlike hardware), my approach is to mostly specify
> > > only what also has a PoC user and a use case.
> >
> > If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case on day one. Otherwise, as I said in my first response, I can go do that as a separate device and decide that virtio_rtc doesn't meet our needs (especially for maintaining accuracy over LM).
>
> We plan to add
>
> - leap second indication,
>
> - UTC-to-TAI offset,
>
> - clock smearing indication (including the noon-to-noon linear smearing
>   variant which seems to be somewhat popular), and
>
> - clock accuracy indication
>
> to the initial spec and to the PoC implementation.

Sounds good, thanks! I look forward to seeing the new revision. I'm
hoping Julien can give feedback on the clock accuracy parts.

> However, due to resource restrictions, we cannot ourselves add the
> memory-mapped clock to the initial spec.
>
> Everyone is very welcome to contribute the memory-mapped clock to the spec,
> and I think it might then still make it to the initial version.

Makes sense. That is my primary target, so I'm *hoping* we can converge
and get that into your initial spec, otherwise for expediency I'm going
to have to define an ACPI or DT or PCI device of our own and expose the
memory region through that instead.

(Even if I have to do that in the short term to stop the bleeding with
customers' clocks and live migration, I'd still aspire to migrate to a
virtio_rtc version of it in future)


Attachments:
smime.p7s (5.83 kB)

2024-06-15 07:51:12

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC PATCH v3 6/7] virtio_rtc: Add Arm Generic Timer cross-timestamping

On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
>
> +int viortc_hw_xtstamp_params(u16 *hw_counter, enum clocksource_ids *cs_id)
> +{
> +       *hw_counter = VIRTIO_RTC_COUNTER_ARM_VIRT;

Hm, but what if it isn't? I think you need to put this in
drivers/clocksource/arm_arch_timer.c where it can do something like
kvm_arch_ptp_get_crosststamp() does to decide:

if (arch_timer_uses_ppi == ARCH_TIMER_VIRT_PPI)
ptp_counter = KVM_PTP_VIRT_COUNTER;
else
ptp_counter = KVM_PTP_PHYS_COUNTER;


> +       *cs_id = CSID_ARM_ARCH_COUNTER;
> +
> +       return 0;
> +}


Attachments:
smime.p7s (5.83 kB)

2024-06-15 08:01:52

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/7] virtio_rtc: Add PTP clocks

On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
>
> +       ret = viortc_hw_xtstamp_params(&hw_counter, &cs_id);
> +       if (ret)
> +               return ret;
> +
> +       ktime_get_snapshot(&history_begin);
> +       if (history_begin.cs_id != cs_id)
> +               return -EOPNOTSUPP;

I think you have to call ktime_get_snapshot() anyway to get a snapshot
from before your crosststamp? But I still don't much like the fact that
you need to use it to work out which cs_id is being used.

Shouldn't get_device_system_crosststamp() pass that to its get_time_fn
as a hint?

On x86, you are likely to find that history_begin.cs_id is the KVM
clock, so this will return -EOPNOTSUPP and userspace will have to fall
back to PTP_SYS_OFFSET. I note the KVM PTP clock actually *converts* a
TSC-based crosststamp to kvmclock µs for itself, so that it can report 
*cs_id = CSID_X86_KVM_CLK. Not sure how I feel about that though. I'm
inclined to suggest that it shouldn't, as anyone who wants accurate
timekeeping shouldn't be using the KVM clock anyway.

But we should at least be relatively consistent about it.


Attachments:
smime.p7s (5.83 kB)

2024-06-15 14:38:31

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> RFC v3 updates
> --------------
>
> This series implements a driver for a virtio-rtc device conforming to spec
> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
> the PTP clock driver already present before.
>
> This patch series depends on the patch series "treewide: Use clocksource id
> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
> series on top of mainline.
>
> Overview
> --------
>
> This patch series adds the virtio_rtc module, and related bugfixes. The
> virtio_rtc module implements a driver compatible with the proposed Virtio
> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
> provides information about current time. The device can provide different
> clocks, e.g. for the UTC or TAI time standards, or for physical time
> elapsed since some past epoch. The driver can read the clocks with simple
> or more accurate methods. Optionally, the driver can set an alarm.
>
> The series first fixes some bugs in the get_device_system_crosststamp()
> interpolation code, which is required for reliable virtio_rtc operation.
> Then, add the virtio_rtc implementation.
>
> For the Virtio RTC device, there is currently a proprietary implementation,
> which has been used for provisional testing.

As discussed before, I don't think it makes sense to design a new high-
precision virtual clock which only gets it right *most* of the time. We
absolutely need to address the issue of live migration.

When live migration occurs, the guest's time precision suffers in two
ways.

First, even when migrating to a supposedly identical host the precise
rate of the underlying counter (TSC, arch counter, etc.) can change
within the tolerances (e.g. ±50PPM) of the hardware. Unlike the natural
changes that NTP normally manages to track, this is a *step* change,
potentially from -50PPM to +50PPM from one host to the next.

Second, the accuracy of the counter as preserved across migration is
limited by the accuracy of each host's NTP synchronization. So there is
also a step change in the value of the counter itself.

At the moment of migration, the guest's timekeeping should be
considered invalid. Any previous cross-timestamps are meaningless, and
blindly using the previously-known relationship between the counter and
real time can lead to problems such as corruption in distributed
databases, fines for mis-timestamped transactions, and other general
unhappiness.

We obviously can't get a new timestamp from the virtio_rtc device every
time an application wants to know the time reliably. We don't even want
*system* calls for that, which is why we have it in a vDSO.

We can take the same approach to warning the guest about clock
disruption due to live migration. A shared memory region from the
virtual clock device even be mapped all the way to userspace, for those
applications which need precise and *reliable* time to check a
'disruption' marker in it, and do whatever is appropriate (e.g. abort
transactions and wait for time to resync) when it happens.

We can do better than just letting the guest know about disruption, of
course. We can put the actual counter→realtime relationship into the
memory region too. As well as being able to provide a PTP driver with
this, the applications which care about *reliable* timestamps can mmap
the page directly and use it, vDSO-style, to have accurate timestamps
even from the first cycle after migration.

When disruption is signalled, the guest needs to throw away any
*additional* refinement that it's done with NTP/PTP/PPS/etc. and revert
to knowing nothing more than what the hypervisor advertises here.

Here's a first attempt at defining such a memory structure. For now
I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I
think it makes most sense for this to be part of the virtio_rtc spec.
Ultimately it doesn't matter *how* the guest finds the memory region.

Very preliminary QEMU hacks at
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock
(I still need to do the KVM side helper for actually filling in the
host clock information, converted to the *guest* TSC)

This is the guest side. H aving heckled your assumption that we can use
the virt counter on Arm, I concede I'm doing the same thing for now.
The structure itself is at the end, or see
https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=include/uapi/linux/vmclock.h;hb=vmclock

From 9e1c3b823d497efa4e0acb21b226a72e4d6e8a53 Mon Sep 17 00:00:00 2001
From: David Woodhouse <[email protected]>
Date: Mon, 10 Jun 2024 15:10:11 +0100
Subject: [PATCH] ptp: Add vDSO-style vmclock support

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Signed-off-by: David Woodhouse <[email protected]>
---
drivers/ptp/Kconfig | 13 ++
drivers/ptp/Makefile | 1 +
drivers/ptp/ptp_vmclock.c | 404 +++++++++++++++++++++++++++++++++++
include/uapi/linux/vmclock.h | 102 +++++++++
4 files changed, 520 insertions(+)
create mode 100644 drivers/ptp/ptp_vmclock.c
create mode 100644 include/uapi/linux/vmclock.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..ace6d58c1781 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
To compile this driver as a module, choose M here: the module
will be called ptp_kvm.

+config PTP_1588_CLOCK_VMCLOCK
+ tristate "Virtual machine PTP clock"
+ depends on X86_TSC || ARM_ARCH_TIMER
+ depends on PTP_1588_CLOCK && ACPI
+ default y
+ help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE) += ptp_dte.o
obj-$(CONFIG_PTP_1588_CLOCK_INES) += ptp_ines.o
obj-$(CONFIG_PTP_1588_CLOCK_PCH) += ptp_pch.o
obj-$(CONFIG_PTP_1588_CLOCK_KVM) += ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK) += ptp_vmclock.o
obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o
ptp-qoriq-y += ptp_qoriq.o
ptp-qoriq-$(CONFIG_DEBUG_FS) += ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index 000000000000..3c4e027090c5
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,404 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/ptp_kvm.h>
+#include <linux/platform_device.h>
+#include <linux/miscdevice.h>
+#include <linux/acpi.h>
+#include <uapi/linux/vmclock.h>
+
+#include <linux/ptp_clock_kernel.h>
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+ phys_addr_t phys_addr;
+ struct vmclock_abi *clk;
+ struct miscdevice miscdev;
+ struct ptp_clock_info ptp_clock_info;
+ struct ptp_clock *ptp_clock;
+ enum clocksource_ids cs_id;
+ int index;
+ char *name;
+};
+
+static int vmclock_get_crosststamp(struct vmclock_state *st,
+ struct system_counterval_t *system_counter,
+ struct timespec64 *tspec)
+{
+ uint64_t cycle, delta, seq, delta_s32, delta_s64, delta_s32b;
+ uint64_t delta_lo, delta_hi, period_lo, period_hi, frac_sec_lo, frac_sec_hi;
+ int ret = 0;
+
+#ifdef CONFIG_X86
+ if (check_tsc_unstable())
+ return -EINVAL;
+#endif
+
+ preempt_disable_notrace();
+
+ do {
+
+ seq = st->clk->seq_count & ~1ULL;
+ virt_rmb();
+
+ if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE) {
+ ret = -EINVAL;
+ virt_rmb();
+ continue;
+ }
+
+ cycle = get_cycles();
+
+ tspec->tv_sec = st->clk->utc_time_sec;
+ frac_sec_lo = st->clk->utc_time_frac_sec & 0xffffffff;
+ frac_sec_hi = st->clk->utc_time_frac_sec >> 32;
+
+ delta = cycle - st->clk->counter_value;
+
+ delta_lo = delta & 0xffffffff;
+ delta_hi = delta >> 32;
+ period_lo = st->clk->counter_period_frac_sec & 0xffffffff;
+ period_hi = st->clk->counter_period_frac_sec >> 32;
+
+ /* Delta in units of (second >> 32) */
+ delta_s32 = delta_lo * period_hi;
+
+ /* Avoid overflow by skimming off the full seconds */
+ tspec->tv_sec += delta_s32 >> 32;
+ delta_s32 &= 0xffffffff;
+
+ if (delta_hi) {
+ tspec->tv_sec += delta_hi * period_hi;
+ delta_s32b = delta_hi * period_lo;
+
+ tspec->tv_sec += delta_s32b >> 32;
+ delta_s32 += delta_s32b & 0xffffffff;
+ }
+
+ delta_s32 += frac_sec_hi;
+
+ delta_s64 = (period_lo * delta_lo) + frac_sec_lo;
+ delta_s32 += delta_s64 >> 32;
+ delta_s64 &= 0xffffffff;
+
+ tspec->tv_sec += delta_s32 >> 32;
+ delta_s32 &= 0xffffffff;
+
+ tspec->tv_nsec = ((delta_s32 * NSEC_PER_SEC) +
+ ((delta_s64 * NSEC_PER_SEC) >> 32)) >> 32;
+
+ ret = 0;
+
+ virt_rmb();
+ } while (seq != st->clk->seq_count);
+
+ preempt_enable_notrace();
+
+ if (ret)
+ return ret;
+
+ if (system_counter) {
+ system_counter->cycles = cycle;
+ system_counter->cs_id = st->cs_id;
+ }
+
+ return 0;
+}
+
+static int ptp_vmclock_get_time_fn(ktime_t *device_time,
+ struct system_counterval_t *system_counter,
+ void *ctx)
+{
+ struct vmclock_state *st = ctx;
+ struct timespec64 tspec;
+ int ret;
+
+ ret = vmclock_get_crosststamp(st, system_counter, &tspec);
+ if (!ret)
+ *device_time = timespec64_to_ktime(tspec);
+
+ return ret;
+}
+
+
+static int ptp_vmclock_getcrosststamp(struct ptp_clock_info *ptp,
+ struct system_device_crosststamp *xtstamp)
+{
+ struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+ ptp_clock_info);
+
+ return get_device_system_crosststamp(ptp_vmclock_get_time_fn, st,
+ NULL, xtstamp);
+}
+
+/*
+ * PTP clock operations
+ */
+
+static int ptp_vmclock_adjfine(struct ptp_clock_info *ptp, long delta)
+{
+ return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_adjtime(struct ptp_clock_info *ptp, s64 delta)
+{
+ return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_settime(struct ptp_clock_info *ptp,
+ const struct timespec64 *ts)
+{
+ return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts)
+{
+ struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+ ptp_clock_info);
+
+ return vmclock_get_crosststamp(st, NULL, ts);
+}
+
+static int ptp_vmclock_enable(struct ptp_clock_info *ptp,
+ struct ptp_clock_request *rq, int on)
+{
+ return -EOPNOTSUPP;
+}
+
+static const struct ptp_clock_info ptp_vmclock_info = {
+ .owner = THIS_MODULE,
+ .max_adj = 0,
+ .n_ext_ts = 0,
+ .n_pins = 0,
+ .pps = 0,
+ .adjfine = ptp_vmclock_adjfine,
+ .adjtime = ptp_vmclock_adjtime,
+ .gettime64 = ptp_vmclock_gettime,
+ .settime64 = ptp_vmclock_settime,
+ .enable = ptp_vmclock_enable,
+ .getcrosststamp = ptp_vmclock_getcrosststamp,
+};
+
+static int vmclock_miscdev_mmap(struct file *fp, struct vm_area_struct *vma)
+{
+ struct vmclock_state *st = container_of(fp->private_data,
+ struct vmclock_state, miscdev);
+
+ if ((vma->vm_flags & (VM_READ|VM_WRITE)) != VM_READ)
+ return -EROFS;
+
+ if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff)
+ return -EINVAL;
+
+ if (io_remap_pfn_range(vma, vma->vm_start,
+ st->phys_addr >> PAGE_SHIFT, PAGE_SIZE,
+ vma->vm_page_prot))
+ return -EAGAIN;
+
+ return 0;
+}
+
+static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vmclock_state *st = container_of(fp->private_data,
+ struct vmclock_state, miscdev);
+ size_t max_count;
+ int32_t seq;
+
+ if (*ppos >= PAGE_SIZE)
+ return 0;
+
+ max_count = PAGE_SIZE - *ppos;
+ if (count > max_count)
+ count = max_count;
+
+ do {
+ seq = st->clk->seq_count & ~1ULL;
+ virt_rmb();
+
+ if (copy_to_user(buf, ((char *)st->clk) + *ppos, count))
+ return -EFAULT;
+
+ virt_rmb();
+ } while (seq != st->clk->seq_count);
+
+ *ppos += count;
+ return count;
+}
+
+static const struct file_operations vmclock_miscdev_fops = {
+ .mmap = vmclock_miscdev_mmap,
+ .read = vmclock_miscdev_read,
+};
+
+/* module operations */
+
+static void vmclock_remove(struct platform_device *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct vmclock_state *st = dev_get_drvdata(dev);
+
+ if (st->ptp_clock)
+ ptp_clock_unregister(st->ptp_clock);
+
+ if (st->miscdev.minor == MISC_DYNAMIC_MINOR)
+ misc_deregister(&st->miscdev);
+}
+
+static int vmclock_probe_acpi(struct device *dev, struct vmclock_state *st)
+{
+ struct acpi_buffer parsed = { ACPI_ALLOCATE_BUFFER };
+ struct acpi_device *adev = ACPI_COMPANION(dev);
+ union acpi_object *obj;
+ acpi_status status;
+
+ status = acpi_evaluate_object(adev->handle, "ADDR", NULL, &parsed);
+ if (ACPI_FAILURE(status)) {
+ ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
+ return -ENODEV;
+ }
+ obj = parsed.pointer;
+ if (!obj || obj->type != ACPI_TYPE_PACKAGE || obj->package.count != 2 ||
+ obj->package.elements[0].type != ACPI_TYPE_INTEGER ||
+ obj->package.elements[1].type != ACPI_TYPE_INTEGER)
+ return -EINVAL;
+
+ st->phys_addr = (obj->package.elements[0].integer.value << 0) |
+ (obj->package.elements[1].integer.value << 32);
+
+ return 0;
+}
+
+static void vmclock_put_idx(void *data)
+{
+ struct vmclock_state *st = data;
+
+ ida_free(&vmclock_ida, st->index);
+}
+
+static int vmclock_probe(struct platform_device *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct vmclock_state *st;
+ int ret;
+
+ st = devm_kzalloc(dev, sizeof (*st), GFP_KERNEL);
+ if (!st)
+ return -ENOMEM;
+
+ if (has_acpi_companion(dev))
+ ret = vmclock_probe_acpi(dev, st);
+ else
+ ret = -EINVAL; /* Only ACPI for now */
+
+ if (ret) {
+ dev_info(dev, "Failed to obtain physical address: %d\n", ret);
+ goto out;
+ }
+
+ st->clk = devm_memremap(dev, st->phys_addr, sizeof(*st->clk),
+ MEMREMAP_WB);
+ if (IS_ERR(st->clk)) {
+ ret = PTR_ERR(st->clk);
+ dev_info(dev, "failed to map shared memory\n");
+ st->clk = NULL;
+ goto out;
+ }
+
+ if (st->clk->magic != VMCLOCK_MAGIC ||
+ st->clk->size < sizeof(*st->clk) ||
+ st->clk->version != 1) {
+ dev_info(dev, "vmclock magic fields invalid\n");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (IS_ENABLED(CONFIG_ARM64) &&
+ st->clk->counter_id == VMCLOCK_COUNTER_ARM_VCNT) {
+ /* Can we check it's the virtual counter? */
+ st->cs_id = CSID_ARM_ARCH_COUNTER;
+ } else if (IS_ENABLED(CONFIG_X86) &&
+ st->clk->counter_id == VMCLOCK_COUNTER_X86_TSC) {
+ st->cs_id = CSID_X86_TSC;
+ }
+
+ ret = ida_alloc(&vmclock_ida, GFP_KERNEL);
+ if (ret < 0)
+ goto out;
+
+ st->index = ret;
+ ret = devm_add_action_or_reset(&pdev->dev, vmclock_put_idx, st);
+ if (ret)
+ goto out;
+
+ st->name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "vmclock%d", st->index);
+ if (!st->name) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /* If the structure is big enough, it can be mapped to userspace */
+ if (st->clk->size >= PAGE_SIZE) {
+ st->miscdev.minor = MISC_DYNAMIC_MINOR;
+ st->miscdev.fops = &vmclock_miscdev_fops;
+ st->miscdev.name = st->name;
+
+ ret = misc_register(&st->miscdev);
+ if (ret)
+ goto out;
+ }
+
+ /* If there is valid clock information, register a PTP clock */
+ if (st->cs_id) {
+ st->ptp_clock_info = ptp_vmclock_info;
+ strncpy(st->ptp_clock_info.name, st->name, sizeof(st->ptp_clock_info.name));
+ st->ptp_clock = ptp_clock_register(&st->ptp_clock_info, dev);
+
+ if (IS_ERR(st->ptp_clock)) {
+ ret = PTR_ERR(st->ptp_clock);
+ st->ptp_clock = NULL;
+ vmclock_remove(pdev);
+ goto out;
+ }
+ }
+
+ dev_set_drvdata(dev, st);
+
+ out:
+ return ret;
+}
+
+static const struct acpi_device_id vmclock_acpi_ids[] = {
+ { "VMCLOCK", 0 },
+ {}
+};
+MODULE_DEVICE_TABLE(acpi, vmclock_acpi_ids);
+
+static struct platform_driver vmclock_platform_driver = {
+ .probe = vmclock_probe,
+ .remove_new = vmclock_remove,
+ .driver = {
+ .name = "vmclock",
+ .acpi_match_table = vmclock_acpi_ids,
+ },
+};
+
+module_platform_driver(vmclock_platform_driver)
+
+MODULE_AUTHOR("David Woodhouse <[email protected]>");
+MODULE_DESCRIPTION("PTP clock using VMCLOCK");
+MODULE_LICENSE("GPL v2");
diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h
new file mode 100644
index 000000000000..cac24536c5c8
--- /dev/null
+++ b/include/uapi/linux/vmclock.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
+
+/*
+ * This structure provides a vDSO-style clock to VM guests, exposing the
+ * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
+ * counter, etc.) and real time. It is designed to address the problem of
+ * live migration, which other clock enlightenments do not.
+ *
+ * When a guest is live migrated, this affects the clock in two ways.
+ *
+ * First, even between identical hosts the actual frequency of the underlying
+ * counter will change within the tolerances of its specification (typically
+ * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
+ * same host, but can be tracked by NTP as it generally varies slowly. With
+ * live migration there is a step change in the frequency, with no warning.
+ *
+ * Second, there may be a step change in the value of the counter itself, as
+ * its accuracy is limited by the precision of the NTP synchronization on the
+ * source and destination hosts.
+ *
+ * So any calibration (NTP, PTP, etc.) which the guest has done on the source
+ * host before migration is invalid, and needs to be redone on the new host.
+ *
+ * In its most basic mode, this structure provides only an indication to the
+ * guest that live migration has occurred. This allows the guest to know that
+ * its clock is invalid and take remedial action. For applications that need
+ * reliable accurate timestamps (e.g. distributed databases), the structure
+ * can be mapped all the way to userspace. This allows the application to see
+ * directly for itself that the clock is disrupted and take appropriate
+ * action, even when using a vDSO-style method to get the time instead of a
+ * system call.
+ *
+ * In its more advanced mode. this structure can also be used to expose the
+ * precise relationship of the CPU counter to real time, as calibrated by the
+ * host. This means that userspace applications can have accurate time
+ * immediately after live migration, rather than having to pause operations
+ * and wait for NTP to recover. This mode does, of course, rely on the
+ * counter being reliable and consistent across CPUs.
+ */
+
+#ifdef __KERNEL__
+#include <linux/types.h>
+#else
+#include <stdint.h>
+#endif
+
+struct vmclock_abi {
+ uint32_t magic;
+#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */
+ uint16_t size; /* Size of page containing this structure */
+ uint16_t version; /* 1 */
+
+ /* Sequence lock. Low bit means an update is in progress. */
+ uint64_t seq_count;
+
+ /*
+ * This field changes to another non-repeating value when the CPU
+ * counter is disrupted, for example on live migration.
+ */
+ uint64_t disruption_marker;
+
+ uint8_t clock_status;
+#define VMCLOCK_STATUS_UNKNOWN 0
+#define VMCLOCK_STATUS_INITIALIZING 1
+#define VMCLOCK_STATUS_SYNCHRONIZED 2
+#define VMCLOCK_STATUS_FREERUNNING 3
+#define VMCLOCK_STATUS_UNRELIABLE 4
+
+ uint8_t counter_id;
+#define VMCLOCK_COUNTER_INVALID 0
+#define VMCLOCK_COUNTER_X86_TSC 1
+#define VMCLOCK_COUNTER_ARM_VCNT 2
+
+ uint8_t pad[3];
+
+ /*
+ * UTC on its own is non-monotonic and ambiguous.
+ *
+ * Inform the guest about the TAI offset, so that it can have an
+ * actual monotonic and reliable time reference if it needs it.
+ *
+ * Also indicate a nearby leap second, if one exists. Unlike in
+ * NTP, this may indicate a leap second in the past in order to
+ * indicate that it *has* been taken into account.
+ */
+ int8_t leapsecond_direction;
+ int16_t tai_offset_sec;
+ uint64_t leapsecond_counter_value;
+ uint64_t leapsecond_tai_time;
+
+ /* Paired values of counter and UTC at a given point in time. */
+ uint64_t counter_value;
+ uint64_t utc_time_sec;
+ uint64_t utc_time_frac_sec;
+
+ /* Counter frequency, and error margin. Units of (second >> 64) */
+ uint64_t counter_period_frac_sec;
+ uint64_t counter_period_error_rate_frac_sec;
+
+ /* Error margin of UTC reading above (± picoseconds) */
+ uint64_t utc_time_maxerror_picosec;
+};
--
2.44.0



Attachments:
smime.p7s (5.83 kB)