2008-11-19 12:09:21

by Patrick Ohly

[permalink] [raw]
Subject: hardware time stamping with extra skb->hwtstamp

This patch series was discussed before on linux-netdev ("hardware
time stamping + igb example implementation"). Since then I have
rebased against net-next and addressed all comments sent so far,
except Octavian's suggestion to include more information in the
packet which is bounced back to the application.

As suggested by David, I'm now also including linux-kernel because:
* patch 2 adds a new user space API (documentation and example
program included, but no man page patch yet)
* patch 8 extends the clocksource struct (mostly adds code, but
also adds one branch to reading the clock, which may affect
gettimeofday)
* patch 10 adds generic code for time synchronization (not
network specific, so people not subscribed to linux-netdev
might have comments)

The open question on linux-netdev was whether struct sk_buff
should be extended to hold the additional hardware time
stamp. The previous patch avoided that at the cost of much more
complicated code and side effects on normal time stamping.

This patch now adds a 8 byte field unconditionally. The
implementation is a lot more straight-forward. The user space
API was already designed to cover this case, so it remained
unchanged.

There's one unsolved problem, though: time synchronization with
PTP (the use case I'm working on) requires a transformation of
raw hardware time stamps into system time. Currently this is done
at the socket level by finding the device which created the time
stamp and letting it do the transformation. This fails for
incoming packets because skb->rt points to the "lo" device.

Perhaps the interface number can be used to find the real
hardware device. Alternatively the conversion could be done when
generating the time stamp (might also be more accurate), but then
another 8 byte field is needed. Delta encoding won't help because
one cannot assume that hardware time stamps track system time
closely enough.

I'm posting the patch despite this problem so that the discussion
can move forward. There are other TODOs anyway; in particular the
igb extension is just a proof-of-concept.

--
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.


2008-11-19 12:09:54

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 01/11] put_cmsg_compat + SO_TIMESTAMP[NS]: use same name for value as caller

In __sock_recv_timestamp() the additional SCM_TIMESTAMP[NS] is used. This
has the same value as SO_TIMESTAMP[NS], so this is a purely cosmetic change.
---
net/compat.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/compat.c b/net/compat.c
index 67fb6a3..6ce1a1c 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -226,14 +226,14 @@ int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *dat
return 0; /* XXX: return error? check spec. */
}

- if (level == SOL_SOCKET && type == SO_TIMESTAMP) {
+ if (level == SOL_SOCKET && type == SCM_TIMESTAMP) {
struct timeval *tv = (struct timeval *)data;
ctv.tv_sec = tv->tv_sec;
ctv.tv_usec = tv->tv_usec;
data = &ctv;
len = sizeof(ctv);
}
- if (level == SOL_SOCKET && type == SO_TIMESTAMPNS) {
+ if (level == SOL_SOCKET && type == SCM_TIMESTAMPNS) {
struct timespec *ts = (struct timespec *)data;
cts.tv_sec = ts->tv_sec;
cts.tv_nsec = ts->tv_nsec;
--
1.6.0.4

2008-11-19 12:10:26

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 02/11] net: new user space API for time stamping of incoming and outgoing packets

User space can request hardware and/or software time stamping.
Reporting of the result(s) via a new control message is enabled
separately for each field in the message because some of the
fields may require additional computation and thus cause overhead.

When a TX timestamp operation is requested, the TX skb will be cloned
and the clone will be time stamped (in hardware or software) and added
to the socket error queue of the skb, if the skb has a socket
associated with it.

The actual TX timestamp will reach userspace as a RX timestamp on the
cloned packet. If timestamping is requested and no timestamping is
done in the device driver (potentially this may use hardware
timestamping), it will be done in software after the device's
start_hard_xmit routine.

TODO: add SO_TIMESTAMPING define also to other platforms
---
Documentation/networking/timestamping.txt | 178 ++++++++
.../networking/timestamping/timestamping.c | 469 ++++++++++++++++++++
arch/x86/include/asm/socket.h | 3 +
include/linux/errqueue.h | 1 +
include/linux/sockios.h | 3 +
include/net/timestamping.h | 95 ++++
6 files changed, 749 insertions(+), 0 deletions(-)
create mode 100644 Documentation/networking/timestamping.txt
create mode 100644 Documentation/networking/timestamping/timestamping.c
create mode 100644 include/net/timestamping.h

diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
new file mode 100644
index 0000000..3ae60b5
--- /dev/null
+++ b/Documentation/networking/timestamping.txt
@@ -0,0 +1,178 @@
+The existing interfaces for getting network packages time stamped are:
+
+* SO_TIMESTAMP
+ Generate time stamp for each incoming packet using the (not necessarily
+ monotonous!) system time. Result is returned via recv_msg() in a
+ control message as timeval (usec resolution).
+
+* SO_TIMESTAMPNS
+ Same time stamping mechanism as SO_TIMESTAMP, but returns result as
+ timespec (nsec resolution).
+
+* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
+ Only for multicasts: approximate send time stamp by receiving the looped
+ packet and using its receive time stamp.
+
+The following interface complements the existing ones: receive time
+stamps can be generated and returned for arbitrary packets and much
+closer to the point where the packet is really sent. Time stamps can
+be generated in software (as before) or in hardware (if the hardware
+has such a feature).
+
+SO_TIMESTAMPING:
+
+Instructs the socket layer which kind of information is wanted. The
+parameter is an integer with some of the following bits set. Setting
+other bits is an error and doesn't change the current state.
+
+SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time stamp in hardware
+SOF_TIMESTAMPING_TX_SOFTWARE: if SOF_TIMESTAMPING_TX_HARDWARE is off or
+ fails, then do it in software
+SOF_TIMESTAMPING_RX_HARDWARE: return the original, unmodified time stamp
+ as generated by the hardware
+SOF_TIMESTAMPING_RX_SOFTWARE: if SOF_TIMESTAMPING_RX_HARDWARE is off or
+ fails, then do it in software
+SOF_TIMESTAMPING_RAW_HARDWARE: return original raw hardware time stamp
+SOF_TIMESTAMPING_SYS_HARDWARE: return hardware time stamp transformed to
+ the system time base
+SOF_TIMESTAMPING_SOFTWARE: return system time stamp generated in
+ software
+
+SOF_TIMESTAMPING_TX/RX determine how time stamps are generated.
+SOF_TIMESTAMPING_RAW/SYS determine how they are reported in the
+following control message:
+ struct scm_timestamping {
+ struct timespec systime;
+ struct timespec hwtimetrans;
+ struct timespec hwtimeraw;
+ };
+
+recvmsg() can be used to get this control message for regular incoming
+packets. For send time stamps the outgoing packet is looped back to
+the socket's error queue with the send time stamp(s) attached. It can
+be received with recvmsg(flags=MSG_ERRQUEUE). The call returns the
+original outgoing packet data including all headers preprended down to
+and including the link layer, the scm_timestamping control message and
+a sock_extended_err control message with ee_errno==ENOMSG and
+ee_origin==SO_EE_ORIGIN_TIMESTAMPING. A socket with such a pending
+bounced packet is ready for reading as far as select() is concerned.
+
+All three values correspond to the same event in time, but were
+generated in different ways. Each of these values may be empty (= all
+zero), in which case no such value was available. If the application
+is not interested in some of these values, they can be left blank to
+avoid the potential overhead of calculating them.
+
+systime is the value of the system time at that moment. This
+corresponds to the value also returned via SO_TIMESTAMP[NS]. If the
+time stamp was generated by hardware, then this field is
+empty. Otherwise it is filled in if SOF_TIMESTAMPING_SOFTWARE is
+set.
+
+hwtimeraw is the original hardware time stamp. Filled in if
+SOF_TIMESTAMPING_RAW_HARDWARE is set. No assumptions about its
+relation to system time should be made.
+
+hwtimetrans is the hardware time stamp transformed so that it
+corresponds as good as possible to system time. This correlation is
+not perfect; as a consequence, sorting packets received via different
+NICs by their hwtimetrans may differ from the order in which they were
+received. hwtimetrans may be non-monotonic even for the same NIC.
+Filled in if SOF_TIMESTAMPING_SYS_HARDWARE is set. Requires support
+by the network device and will be empty without that support.
+
+
+SIOCSHWTSTAMP:
+
+Hardware time stamping must also be initialized for each device driver
+that is expected to do hardware time stamping. The parameter is:
+
+struct hwtstamp_config {
+ int flags; /**< no flags defined right now, must be zero */
+ int tx_type; /**< HWTSTAMP_TX_* */
+ int rx_filter_type; /**< HWTSTAMP_RX_* */
+};
+
+Desired behavior is passed into the kernel and to a specific device by
+calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
+ifr_data points to a struct hwtstamp_config. The tx_type and
+rx_filter_type are hints to the driver what it is expected to do. If
+the requested fine-grained filtering for incoming packets is not
+supported, the driver may time stamp more than just the requested types
+of packets.
+
+A driver which supports hardware time stamping shall update the struct
+with the actual, possibly more permissive configuration. If the
+requested packets cannot be time stamped, then nothing should be
+changed and ERANGE shall be returned (in contrast to EINVAL, which
+indicates that SIOCSHWTSTAMP is not supported at all).
+
+Only a processes with admin rights may change the configuration. User
+space is responsible to ensure that multiple processes don't interfere
+with each other and that the settings are reset.
+
+/** possible values for hwtstamp_config->tx_type */
+enum {
+ /**
+ * no outgoing packet will need hardware time stamping;
+ * should a packet arrive which asks for it, no hardware
+ * time stamping will be done
+ */
+ HWTSTAMP_TX_OFF,
+
+ /**
+ * enables hardware time stamping for outgoing packets;
+ * the sender of the packet decides which are to be
+ * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
+ * before sending the packet
+ */
+ HWTSTAMP_TX_ON,
+};
+
+/** possible values for hwtstamp_config->rx_filter_type */
+enum {
+ /** time stamp no incoming packet at all */
+ HWTSTAMP_FILTER_NONE,
+
+ /** time stamp any incoming packet */
+ HWTSTAMP_FILTER_ALL,
+
+ /** return value: time stamp all packets requested plus some others */
+ HWTSTAMP_FILTER_SOME,
+
+ /** PTP v1, UDP, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
+
+ ...
+};
+
+
+DEVICE IMPLEMENTATION
+
+A driver which supports hardware time stamping must support the
+SIOCSHWTSTAMP ioctl. Time stamps for received packets must be stored
+in the skb with skb_hwtstamp_set().
+
+Time stamps for outgoing packets are to be generated as follows:
+- In hard_start_xmit(), check if skb_hwtstamp_check_tx_hardware()
+ returns non-zero. If yes, then the driver is expected
+ to do hardware time stamping.
+- If this is possible for the skb and requested, then declare
+ that the driver is doing the time stamping by calling
+ skb_hwtstamp_tx_in_progress(). A driver not supporting
+ hardware time stamping doesn't do that. A driver must never
+ touch sk_buff::tstamp! It is used to store how time stamping
+ for an outgoing packets is to be done.
+- As soon as the driver has sent the packet and/or obtained a
+ hardware time stamp for it, it passes the time stamp back by
+ calling skb_hwtstamp_tx() with the original skb, the raw
+ hardware time stamp and a handle to the device (necessary
+ to convert the hardware time stamp to system time). If obtaining
+ the hardware time stamp somehow fails, then the driver should
+ not fall back to software time stamping. The rationale is that
+ this would occur at a later time in the processing pipeline
+ than other software time stamping and therefore could lead
+ to unexpected deltas between time stamps.
+- If the driver did not call skb_hwtstamp_tx_in_progress(), then
+ dev_hard_start_xmit() checks whether software time stamping
+ is wanted as fallback and potentially generates the time stamp.
diff --git a/Documentation/networking/timestamping/timestamping.c b/Documentation/networking/timestamping/timestamping.c
new file mode 100644
index 0000000..7ffc466
--- /dev/null
+++ b/Documentation/networking/timestamping/timestamping.c
@@ -0,0 +1,469 @@
+/**
+ * This program demonstrates how the various time stamping features in
+ * the Linux kernel work. It emulates the behavior of a PTP
+ * implementation in stand-alone master mode by sending PTPv1 Sync
+ * multicasts once every second. It looks for similar packets, but
+ * beyond that doesn't actually implement PTP.
+ *
+ * Outgoing packets are time stamped with SO_TIMESTAMPING with or
+ * without hardware support.
+ *
+ * Incoming packets are time stamped with SO_TIMESTAMPING with or
+ * without hardware support, SIOCGSTAMP[NS] (per-socket time stamp) and
+ * SO_TIMESTAMP[NS].
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <string.h>
+
+#include <sys/time.h>
+#include <sys/socket.h>
+#include <sys/select.h>
+#include <sys/ioctl.h>
+#include <arpa/inet.h>
+#include <net/if.h>
+
+#include "asm/types.h"
+#include "net/timestamping.h"
+#include "linux/errqueue.h"
+
+#ifndef SO_TIMESTAMPNS
+# define SO_TIMESTAMPNS 35
+#endif
+
+#ifndef SIOCGSTAMPNS
+# define SIOCGSTAMPNS 0x8907
+#endif
+
+static void usage(const char *error)
+{
+ if (error)
+ printf("invalid option: %s\n", error);
+ printf("timestamping interface (IP_MULTICAST_LOOP|SO_TIMESTAMP|SO_TIMESTAMPNS|SOF_TIMESTAMPING_TX_HARDWARE|SOF_TIMESTAMPING_TX_SOFTWARE|SOF_TIMESTAMPING_RX_HARDWARE|SOF_TIMESTAMPING_RX_SOFTWARE|SOF_TIMESTAMPING_SOFTWARE|SOF_TIMESTAMPING_SYS_HARDWARE|SOF_TIMESTAMPING_RAW_HARDWARE|SIOCGSTAMP|SIOCGSTAMPNS)*\n");
+ exit(1);
+}
+
+static void bail(const char *error)
+{
+ printf("%s: %s\n", error, strerror(errno));
+ exit(1);
+}
+
+static const unsigned char sync[] = {
+ 0x00,0x01, 0x00,0x01,
+ 0x5f,0x44, 0x46,0x4c,
+ 0x54,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00,
+ 0x01,0x01,
+
+ /* fake uuid */
+ 0x00,0x01,
+ 0x02,0x03, 0x04,0x05,
+
+ 0x00,0x01, 0x00,0x37,
+ 0x00,0x00, 0x00,0x08,
+ 0x00,0x00, 0x00,0x00,
+ 0x49,0x05, 0xcd,0x01,
+ 0x29,0xb1, 0x8d,0xb0,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x01,
+
+ /* fake uuid */
+ 0x00,0x01,
+ 0x02,0x03, 0x04,0x05,
+
+ 0x00,0x00, 0x00,0x37,
+ 0x00,0x00, 0x00,0x04,
+ 0x44,0x46, 0x4c,0x54,
+ 0x00,0x00, 0xf0,0x60,
+ 0x00,0x01, 0x00,0x00,
+ 0x00,0x00, 0x00,0x01,
+ 0x00,0x00, 0xf0,0x60,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x04,
+ 0x44,0x46, 0x4c,0x54,
+ 0x00,0x01,
+
+ /* fake uuid */
+ 0x00,0x01,
+ 0x02,0x03, 0x04,0x05,
+
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00
+};
+
+static void sendpacket(int sock, struct sockaddr *addr, socklen_t addr_len)
+{
+ struct timeval now;
+ int res;
+
+ res = sendto(sock, sync, sizeof(sync), 0,
+ addr, addr_len);
+ gettimeofday(&now, 0);
+ if (res < 0)
+ printf("%s: %s\n", "send", strerror(errno));
+ else
+ printf("%ld.%06ld: sent %d bytes\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ res);
+}
+
+static void recvpacket(int sock, int recvmsg_flags,
+ int siocgstamp, int siocgstampns)
+{
+ char data[256];
+ struct timeval now;
+ struct msghdr msg;
+ struct iovec entry;
+ struct sockaddr_in from_addr;
+ struct {
+ struct cmsghdr cm;
+ char control[512];
+ } control;
+ int res;
+
+ memset(&msg, 0, sizeof(msg));
+ msg.msg_iov = &entry;
+ msg.msg_iovlen = 1;
+ entry.iov_base = data;
+ entry.iov_len = sizeof(data);
+ msg.msg_name = (caddr_t)&from_addr;
+ msg.msg_namelen = sizeof(from_addr);
+ msg.msg_control = &control;
+ msg.msg_controllen = sizeof(control);
+
+ res = recvmsg(sock, &msg, recvmsg_flags|MSG_DONTWAIT);
+ gettimeofday(&now, 0);
+ if (res < 0) {
+ printf("%s %s: %s\n",
+ "recvmsg",
+ (recvmsg_flags & MSG_ERRQUEUE) ? "error" : "regular",
+ strerror(errno));
+ } else {
+ struct cmsghdr *cmsg;
+ struct timeval tv;
+ struct timespec ts;
+
+ printf("%ld.%06ld: received %s data, %d bytes from %s, %d bytes control messages\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ (recvmsg_flags & MSG_ERRQUEUE) ? "error" : "regular",
+ res,
+ inet_ntoa(from_addr.sin_addr),
+ msg.msg_controllen);
+ for (cmsg = CMSG_FIRSTHDR(&msg);
+ cmsg;
+ cmsg = CMSG_NXTHDR(&msg, cmsg)) {
+ printf(" cmsg len %d: ", cmsg->cmsg_len);
+ switch (cmsg->cmsg_level) {
+ case SOL_SOCKET:
+ printf("SOL_SOCKET ");
+ switch (cmsg->cmsg_type) {
+ case SO_TIMESTAMP: {
+ struct timeval *stamp =
+ (struct timeval *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMP %ld.%06ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_usec);
+ break;
+ }
+ case SO_TIMESTAMPNS: {
+ struct timespec *stamp =
+ (struct timespec *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMPNS %ld.%09ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ break;
+ }
+ case SO_TIMESTAMPING: {
+ struct timespec *stamp =
+ (struct timespec *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMPING ");
+ printf("SW %ld.%09ld ",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ stamp++;
+ printf("HW transformed %ld.%09ld ",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ stamp++;
+ printf("HW raw %ld.%09ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ break;
+ }
+ default:
+ printf("type %d", cmsg->cmsg_type);
+ break;
+ }
+ break;
+ case IPPROTO_IP:
+ printf("IPPROTO_IP ");
+ switch (cmsg->cmsg_type) {
+ case IP_RECVERR: {
+ struct sock_extended_err *err =
+ (struct sock_extended_err *)CMSG_DATA(cmsg);
+ printf("IP_RECVERR ee_errno '%s' ee_origin %d => %s",
+ strerror(err->ee_errno),
+ err->ee_origin,
+#ifdef SO_EE_ORIGIN_TIMESTAMPING
+ err->ee_origin == SO_EE_ORIGIN_TIMESTAMPING ?
+ "bounced packet" : "unexpected origin"
+#else
+ "probably SO_EE_ORIGIN_TIMESTAMPING"
+#endif
+ );
+ if (res < sizeof(sync))
+ printf(" => truncated data?!");
+ else if (!memcmp(sync, data + res - sizeof(sync),
+ sizeof(sync)))
+ printf(" => GOT OUR DATA BACK (HURRAY!)");
+ break;
+ }
+ case IP_PKTINFO: {
+ struct in_pktinfo *pktinfo =
+ (struct in_pktinfo *)CMSG_DATA(cmsg);
+ printf("IP_PKTINFO interface index %u",
+ pktinfo->ipi_ifindex);
+ break;
+ }
+ default:
+ printf("type %d", cmsg->cmsg_type);
+ break;
+ }
+ break;
+ default:
+ printf("level %d type %d",
+ cmsg->cmsg_level,
+ cmsg->cmsg_type);
+ break;
+ }
+ printf("\n");
+ }
+
+ if (siocgstamp) {
+ if (ioctl(sock, SIOCGSTAMP, &tv))
+ printf(" %s: %s\n", "SIOCGSTAMP", strerror(errno));
+ else
+ printf("SIOCGSTAMP %ld.%06ld\n",
+ (long)tv.tv_sec,
+ (long)tv.tv_usec);
+ }
+ if (siocgstampns) {
+ if (ioctl(sock, SIOCGSTAMPNS, &ts))
+ printf(" %s: %s\n", "SIOCGSTAMPNS", strerror(errno));
+ else
+ printf("SIOCGSTAMPNS %ld.%09ld\n",
+ (long)ts.tv_sec,
+ (long)ts.tv_nsec);
+ }
+ }
+}
+
+int main(int argc, char **argv)
+{
+ int so_timestamping_flags = 0;
+ int so_timestamp = 0;
+ int so_timestampns = 0;
+ int siocgstamp = 0;
+ int siocgstampns = 0;
+ int ip_multicast_loop = 0;
+ char *interface;
+ int i;
+ int enabled = 1;
+ int sock;
+ struct ifreq device;
+ struct ifreq hwtstamp;
+ struct hwtstamp_config hwconfig, hwconfig_requested;
+ struct sockaddr_in addr;
+ struct ip_mreq imr;
+ struct in_addr iaddr;
+ int val;
+ socklen_t len;
+ struct timeval next;
+
+ if (argc < 2)
+ usage(0);
+ interface = argv[1];
+
+ for (i = 2; i < argc; i++ ) {
+ if (!strcasecmp(argv[i], "SO_TIMESTAMP")) {
+ so_timestamp = 1;
+ } else if (!strcasecmp(argv[i], "SO_TIMESTAMPNS")) {
+ so_timestampns = 1;
+ } else if (!strcasecmp(argv[i], "SIOCGSTAMP")) {
+ siocgstamp = 1;
+ } else if (!strcasecmp(argv[i], "SIOCGSTAMPNS")) {
+ siocgstampns = 1;
+ } else if (!strcasecmp(argv[i], "IP_MULTICAST_LOOP")) {
+ ip_multicast_loop = 1;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_TX_HARDWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_TX_HARDWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_TX_SOFTWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_TX_SOFTWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RX_HARDWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_RX_HARDWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RX_SOFTWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_RX_SOFTWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SOFTWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_SOFTWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SYS_HARDWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_SYS_HARDWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RAW_HARDWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_RAW_HARDWARE;
+ } else {
+ usage(argv[i]);
+ }
+ }
+
+ sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
+ if (socket < 0)
+ bail("socket");
+
+ memset(&device, 0, sizeof(device));
+ strncpy(device.ifr_name, interface, sizeof(device.ifr_name));
+ if (ioctl(sock, SIOCGIFADDR, &device) < 0)
+ bail("getting interface IP address");
+
+ memset(&hwtstamp, 0, sizeof(hwtstamp));
+ strncpy(hwtstamp.ifr_name, interface, sizeof(hwtstamp.ifr_name));
+ hwtstamp.ifr_data = (void *)&hwconfig;
+ memset(&hwconfig, 0, sizeof(&hwconfig));
+ hwconfig.tx_type =
+ (so_timestamping_flags & SOF_TIMESTAMPING_TX_HARDWARE) ?
+ HWTSTAMP_TX_ON : HWTSTAMP_TX_OFF;
+ hwconfig.rx_filter_type =
+ (so_timestamping_flags & SOF_TIMESTAMPING_RX_HARDWARE) ?
+ HWTSTAMP_FILTER_PTP_V1_L4_SYNC : HWTSTAMP_FILTER_NONE;
+ hwconfig_requested = hwconfig;
+ if (ioctl(sock, SIOCSHWTSTAMP, &hwtstamp) < 0) {
+ if ((errno == EINVAL || errno == ENOTSUP) &&
+ hwconfig_requested.tx_type == HWTSTAMP_TX_OFF &&
+ hwconfig_requested.rx_filter_type == HWTSTAMP_FILTER_NONE)
+ printf("SIOCSHWTSTAMP: disabling hardware time stamping not possible\n");
+ else
+ bail("SIOCSHWTSTAMP");
+ }
+ printf("SIOCSHWTSTAMP: tx_type %d requested, got %d; rx_filter_type %d requested, got %d\n",
+ hwconfig_requested.tx_type, hwconfig.tx_type,
+ hwconfig_requested.rx_filter_type, hwconfig.rx_filter_type);
+
+ /* bind to PTP port */
+ addr.sin_family = AF_INET;
+ addr.sin_addr.s_addr = htonl(INADDR_ANY);
+ addr.sin_port = htons(319 /* PTP event port */);
+ if (bind(sock, (struct sockaddr*)&addr, sizeof(struct sockaddr_in)) < 0)
+ bail("bind");
+
+ /* set multicast group for outgoing packets */
+ inet_aton("224.0.1.130", &iaddr); /* alternate PTP domain 1 */
+ addr.sin_addr = iaddr;
+ imr.imr_multiaddr.s_addr = iaddr.s_addr;
+ imr.imr_interface.s_addr = ((struct sockaddr_in *)&device.ifr_addr)->sin_addr.s_addr;
+ if (setsockopt(sock, IPPROTO_IP, IP_MULTICAST_IF,
+ &imr.imr_interface.s_addr, sizeof(struct in_addr)) < 0)
+ bail("set multicast");
+
+ /* join multicast group, loop our own packet */
+ if (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP, &imr, sizeof(struct ip_mreq)) < 0)
+ bail("join multicast group");
+
+ if (setsockopt(sock, IPPROTO_IP, IP_MULTICAST_LOOP, &ip_multicast_loop, sizeof(enabled)) < 0) {
+ bail("loop multicast");
+ }
+
+ /* set socket options for time stamping */
+ if (so_timestamp &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMP, &enabled, sizeof(enabled)) < 0)
+ bail("setsockopt SO_TIMESTAMP");
+
+ if (so_timestampns &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPNS, &enabled, sizeof(enabled)) < 0)
+ bail("setsockopt SO_TIMESTAMPNS");
+
+ if (so_timestamping_flags &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING, &so_timestamping_flags, sizeof(so_timestamping_flags)) < 0)
+ bail("setsockopt SO_TIMESTAMPING");
+
+ /* request IP_PKTINFO for debugging purposes */
+ if (setsockopt(sock, SOL_IP, IP_PKTINFO, &enabled, sizeof(enabled)) < 0)
+ printf("%s: %s\n", "setsockopt IP_PKTINFO", strerror(errno));
+
+ /* verify socket options */
+ len = sizeof(val);
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMP, &val, &len) < 0)
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMP", strerror(errno));
+ else
+ printf("SO_TIMESTAMP %d\n", val);
+
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMPNS, &val, &len) < 0)
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMPNS", strerror(errno));
+ else
+ printf("SO_TIMESTAMPNS %d\n", val);
+
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING, &val, &len) < 0) {
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMPING", strerror(errno));
+ } else {
+ printf("SO_TIMESTAMPING %d\n", val);
+ if (val != so_timestamping_flags)
+ printf(" not the expected value %d\n", so_timestamping_flags);
+ }
+
+ /* send packets forever every five seconds */
+ gettimeofday(&next, 0);
+ next.tv_sec = (next.tv_sec + 1) / 5 * 5;
+ next.tv_usec = 0;
+ while(1) {
+ struct timeval now;
+ struct timeval delta;
+ long delta_us;
+ int res;
+ fd_set readfs, errorfs;
+
+ gettimeofday(&now, 0);
+ delta_us = (long)(next.tv_sec - now.tv_sec) * 1000000 +
+ (long)(next.tv_usec - now.tv_usec);
+ if (delta_us > 0) {
+ /* continue waiting for timeout or data */
+ delta.tv_sec = delta_us / 1000000;
+ delta.tv_usec = delta_us % 1000000;
+
+ FD_ZERO(&readfs);
+ FD_ZERO(&errorfs);
+ FD_SET(sock, &readfs);
+ FD_SET(sock, &errorfs);
+ printf("%ld.%06ld: select %ldus\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ delta_us);
+ res = select(sock + 1, &readfs, 0, &errorfs, &delta);
+ gettimeofday(&now, 0);
+ printf("%ld.%06ld: select returned: %d, %s\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ res,
+ res < 0 ? strerror(errno) : "success");
+ if (res > 0) {
+ if (FD_ISSET(sock, &readfs))
+ printf("ready for reading\n");
+ if (FD_ISSET(sock, &errorfs))
+ printf("has error\n");
+ recvpacket(sock, 0,
+ siocgstamp,
+ siocgstampns);
+ recvpacket(sock, MSG_ERRQUEUE,
+ siocgstamp,
+ siocgstampns);
+ }
+ } else {
+ /* write one packet */
+ sendpacket(sock, (struct sockaddr *)&addr, sizeof(addr));
+ next.tv_sec += 5;
+ continue;
+ }
+ }
+
+ return 0;
+}
diff --git a/arch/x86/include/asm/socket.h b/arch/x86/include/asm/socket.h
index 8ab9cc8..79e1f6c 100644
--- a/arch/x86/include/asm/socket.h
+++ b/arch/x86/include/asm/socket.h
@@ -54,4 +54,7 @@

#define SO_MARK 36

+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING
+
#endif /* _ASM_X86_SOCKET_H */
diff --git a/include/linux/errqueue.h b/include/linux/errqueue.h
index 92f8d4f..86d88dd 100644
--- a/include/linux/errqueue.h
+++ b/include/linux/errqueue.h
@@ -16,6 +16,7 @@ struct sock_extended_err
#define SO_EE_ORIGIN_LOCAL 1
#define SO_EE_ORIGIN_ICMP 2
#define SO_EE_ORIGIN_ICMP6 3
+#define SO_EE_ORIGIN_TIMESTAMPING 4

#define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1))

diff --git a/include/linux/sockios.h b/include/linux/sockios.h
index abef759..209ee22 100644
--- a/include/linux/sockios.h
+++ b/include/linux/sockios.h
@@ -122,6 +122,9 @@
#define SIOCBRADDIF 0x89a2 /* add interface to bridge */
#define SIOCBRDELIF 0x89a3 /* remove interface from bridge */

+/* hardware time stamping: parameters in net/timestamping.h */
+#define SIOCSHWTSTAMP 0x89b0
+
/* Device private ioctl calls */

/*
diff --git a/include/net/timestamping.h b/include/net/timestamping.h
new file mode 100644
index 0000000..c271caa
--- /dev/null
+++ b/include/net/timestamping.h
@@ -0,0 +1,95 @@
+#ifndef _NET_TIMESTAMPING_H
+#define _NET_TIMESTAMPING_H
+
+#include <linux/socket.h> /* for SO_TIMESTAMPING */
+
+/**
+ * user space linux/socket.h might not have these defines yet:
+ * provide fallback
+ */
+#if !defined(__kernel__) && !defined(SO_TIMESTAMPING)
+# define SO_TIMESTAMPING 37
+# define SCM_TIMESTAMPING SO_TIMESTAMPING
+#endif
+
+/** %SO_TIMESTAMPING gets an integer bit field comprised of these values */
+enum {
+ SOF_TIMESTAMPING_TX_HARDWARE = (1<<0),
+ SOF_TIMESTAMPING_TX_SOFTWARE = (1<<1),
+ SOF_TIMESTAMPING_RX_HARDWARE = (1<<2),
+ SOF_TIMESTAMPING_RX_SOFTWARE = (1<<3),
+ SOF_TIMESTAMPING_SOFTWARE = (1<<4),
+ SOF_TIMESTAMPING_SYS_HARDWARE = (1<<5),
+ SOF_TIMESTAMPING_RAW_HARDWARE = (1<<6),
+ SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_RAW_HARDWARE - 1) | SOF_TIMESTAMPING_RAW_HARDWARE
+};
+
+#if !defined(__kernel__) && !defined(SIOCSHWTSTAMP)
+# define SIOCSHWTSTAMP 0x89b0
+#endif
+
+/** %SIOCSHWTSTAMP expects a struct ifreq with a ifr_data pointer to this struct */
+struct hwtstamp_config {
+ int flags; /**< no flags defined right now, must be zero */
+ int tx_type; /**< one of HWTSTAMP_TX_* */
+ int rx_filter_type; /**< one of HWTSTAMP_RX_* */
+};
+
+/** possible values for hwtstamp_config->tx_type */
+enum {
+ /**
+ * no outgoing packet will need hardware time stamping;
+ * should a packet arrive which asks for it, no hardware
+ * time stamping will be done
+ */
+ HWTSTAMP_TX_OFF,
+
+ /**
+ * enables hardware time stamping for outgoing packets;
+ * the sender of the packet decides which are to be
+ * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
+ * before sending the packet
+ */
+ HWTSTAMP_TX_ON,
+};
+
+/** possible values for hwtstamp_config->rx_filter_type */
+enum {
+ /** time stamp no incoming packet at all */
+ HWTSTAMP_FILTER_NONE,
+
+ /** time stamp any incoming packet */
+ HWTSTAMP_FILTER_ALL,
+
+ /** return value: time stamp all packets requested plus some others */
+ HWTSTAMP_FILTER_SOME,
+
+ /** PTP v1, UDP, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
+ /** PTP v1, UDP, Sync packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_SYNC,
+ /** PTP v1, UDP, Delay_req packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ,
+ /** PTP v2, UDP, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V2_L4_EVENT,
+ /** PTP v2, UDP, Sync packet */
+ HWTSTAMP_FILTER_PTP_V2_L4_SYNC,
+ /** PTP v2, UDP, Delay_req packet */
+ HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ,
+
+ /** 802.AS1, Ethernet, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V2_L2_EVENT,
+ /** 802.AS1, Ethernet, Sync packet */
+ HWTSTAMP_FILTER_PTP_V2_L2_SYNC,
+ /** 802.AS1, Ethernet, Delay_req packet */
+ HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ,
+
+ /** PTP v2/802.AS1, any layer, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V2_EVENT,
+ /** PTP v2/802.AS1, any layer, Sync packet */
+ HWTSTAMP_FILTER_PTP_V2_SYNC,
+ /** PTP v2/802.AS1, any layer, Delay_req packet */
+ HWTSTAMP_FILTER_PTP_V2_DELAY_REQ,
+};
+
+#endif /* _NET_TIMESTAMPING_H */
--
1.6.0.4

2008-11-19 12:10:46

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 03/11] net: infrastructure for hardware time stamping

The new sk_buff->hwtstamp is used to transport time stamping
instructions to the device driver (outgoing packets) and to
return raw hardware time stamps back to user space (incoming
or sent packets).

Implements TX time stamping in software if the device driver
doesn't support hardware time stamping.

The new semantic for hardware/software time stamping around
net_device->hard_start_xmit() is based on two assumptions about
existing network device drivers which don't support hardware
time stamping and know nothing about it:
- they leave the skb->hwtstamp field unmodified
- the keep the connection to the originating socket in skb->sk
alive, i.e., don't call skb_orphan()

Given that hwtstamp is a new field, the first assumption is safe.
The second is only true for some drivers. As a result, software
TX time stamping currently works with the bnx2 driver, but not
with the unmodified igb driver (the two drivers this patch series
was tested with).
---
include/linux/netdevice.h | 11 ++++
include/linux/skbuff.h | 136 ++++++++++++++++++++++++++++++++++++++++++++-
net/core/dev.c | 23 +++++++-
net/core/skbuff.c | 72 ++++++++++++++++++++++++
4 files changed, 239 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 12d7f44..24bea0c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -763,6 +763,17 @@ struct net_device
/* for setting kernel sock attribute on TCP connection setup */
#define GSO_MAX_SIZE 65536
unsigned int gso_max_size;
+
+#define HAVE_HW_TIME_STAMP
+ /* Transforms original raw hardware time stamp to
+ * system time base. Always required when supporting
+ * hardware time stamping.
+ *
+ * Returns empty stamp (= all zero) if conversion wasn't
+ * possible.
+ */
+ ktime_t (*hwtstamp_raw2sys)(struct net_device *dev,
+ ktime_t hwstamp);
};
#define to_net_dev(d) container_of(d, struct net_device, dev)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a01b6f8..c8004eb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -200,12 +200,42 @@ typedef unsigned int sk_buff_data_t;
typedef unsigned char *sk_buff_data_t;
#endif

+
+/**
+ * sk_buff_hwtstamp - hardware/software time stamping instructions
+ * (for outgoing packets) or result (for packets processes by the
+ * network device)
+ *
+ * @hwtstamp: hardware time stamp; software time stamps are stored
+ * in skb->tstamp
+ * @tstamp_tx_hardware: generate hardware time stamp
+ * @tstamp_tx_software: generate software time stamp
+ * @tstamp_tx_in_progress: device driver is going to provide hardware
+ * time stamp
+ */
+union sk_buff_hwtstamp
+{
+ ktime_t hwtstamp;
+ struct {
+ __u64 hwtstamp_padding:60,
+ tstamp_tx_hardware:1,
+ tstamp_tx_software:1,
+ tstamp_tx_in_progress:1;
+ };
+};
+
/**
* struct sk_buff - socket buffer
* @next: Next buffer in list
* @prev: Previous buffer in list
* @sk: Socket we are owned by
- * @tstamp: Time we arrived
+ * @tstamp: Time we arrived: generated by ktime_get_real() and
+ * thus is recorded in system time.
+ * @hwtstamp: Time we arrived or were sent: generated by the
+ * network device and therefore not directly related to
+ * system time. For outgoing packets the time stamp
+ * is not valid yet. Instead the union is used to
+ * transport time stamping requests to the device.
* @dev: Device we arrived on/are leaving by
* @transport_header: Transport layer header
* @network_header: Network layer header
@@ -266,6 +296,7 @@ struct sk_buff {

struct sock *sk;
ktime_t tstamp;
+ union sk_buff_hwtstamp hwtstamp;
struct net_device *dev;

union {
@@ -1700,6 +1731,11 @@ static inline void skb_copy_to_linear_data_offset(struct sk_buff *skb,

extern void skb_init(void);

+static inline ktime_t skb_get_ktime(const struct sk_buff *skb)
+{
+ return skb->tstamp;
+}
+
/**
* skb_get_timestamp - get timestamp from a skb
* @skb: skb to get stamp from
@@ -1714,6 +1750,11 @@ static inline void skb_get_timestamp(const struct sk_buff *skb, struct timeval *
*stamp = ktime_to_timeval(skb->tstamp);
}

+static inline void skb_get_timestampns(const struct sk_buff *skb, struct timespec *stamp)
+{
+ *stamp = ktime_to_timespec(skb->tstamp);
+}
+
static inline void __net_timestamp(struct sk_buff *skb)
{
skb->tstamp = ktime_get_real();
@@ -1729,6 +1770,99 @@ static inline ktime_t net_invalid_timestamp(void)
return ktime_set(0, 0);
}

+/**
+ * skb_hwtstamp_available - checks whether the time stamp value has
+ * been set (= non-zero) and really came from hardware
+ *
+ * Only works for packets which have been processed by the device
+ * driver.
+ */
+static inline int skb_hwtstamp_available(const struct sk_buff *skb)
+{
+ return skb->hwtstamp.hwtstamp.tv64 != 0;
+}
+
+/**
+ * skb_hwtstamp_set - stores a time stamp generated by hardware in the skb
+ * @skb: time stamp is stored here
+ * @hwtstamp: original, untransformed hardware time stamp
+ */
+static inline void skb_hwtstamp_set(struct sk_buff *skb,
+ ktime_t hwtstamp)
+{
+ skb->hwtstamp.hwtstamp = hwtstamp;
+}
+
+/**
+ * skb_hwtstamp_raw - fills the timespec with the original, "raw" time
+ * stamp as generated by the hardware when it processed the packet
+ *
+ * Returns 1 if such a hardware time stamp is unavailable or cannot be
+ * inferred. Otherwise it returns 0 and doesn't modify the timespec.
+ */
+int skb_hwtstamp_raw(const struct sk_buff *skb, struct timespec *ts);
+
+/**
+ * skb_hwtstamp_transformed - fills the timespec with the hardware
+ * time stamp generated when the hardware processed the packet,
+ * transformed to system time
+ *
+ * Beware that this transformation is not perfect: packet A received on
+ * interface 1 before packet B on interface 2 might have a higher
+ * transformed time stamp.
+ *
+ * Returns 1 if a transformed hardware time stamp is available, 0
+ * otherwise. In that case the timespec is left unchanged.
+ */
+int skb_hwtstamp_transformed(const struct sk_buff *skb, struct timespec *ts);
+
+static inline int skb_hwtstamp_check_tx_hardware(struct sk_buff *skb)
+{
+ return skb->hwtstamp.tstamp_tx_hardware;
+}
+
+static inline int skb_hwtstamp_check_tx_software(struct sk_buff *skb)
+{
+ return skb->hwtstamp.tstamp_tx_software;
+}
+
+static inline int skb_hwtstamp_check_tx_in_progress(struct sk_buff *skb)
+{
+ return skb->hwtstamp.tstamp_tx_in_progress;
+}
+
+static inline void skb_hwtstamp_set_tx_in_progress(struct sk_buff *skb)
+{
+ skb->hwtstamp.tstamp_tx_in_progress = 1;
+}
+
+/**
+ * skb_hwtstamp_tx - queue clone of skb with send time stamp
+ * @orig_skb: the original outgoing packet
+ * @stamp: either raw hardware time stamp or result of ktime_get_real()
+ * @dev: NULL if time stamp from ktime_get_real(), otherwise device
+ * which generated the hardware time stamp; the device may or
+ * may not implement the system time<->hardware time mapping
+ * functions
+ *
+ * This function will not actually timestamp the skb, but, if the skb has a
+ * socket associated, clone the skb, timestamp it, and queue it to the error
+ * queue of the socket. Errors are silently ignored.
+ */
+void skb_hwtstamp_tx(struct sk_buff *orig_skb,
+ ktime_t stamp,
+ struct net_device *dev);
+
+/**
+ * skb_tx_software_timestamp - software fallback for send time stamping
+ */
+static inline void skb_tx_software_timestamp(struct sk_buff *skb)
+{
+ if (skb_hwtstamp_check_tx_software(skb) &&
+ !skb_hwtstamp_check_tx_in_progress(skb))
+ skb_hwtstamp_tx(skb, ktime_get_real(), NULL);
+}
+
extern __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len);
extern __sum16 __skb_checksum_complete(struct sk_buff *skb);

diff --git a/net/core/dev.c b/net/core/dev.c
index e08c0fc..b4b8eb8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1658,6 +1658,8 @@ static int dev_gso_segment(struct sk_buff *skb)
int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
struct netdev_queue *txq)
{
+ int rc;
+
if (likely(!skb->next)) {
if (!list_empty(&ptype_all))
dev_queue_xmit_nit(skb, dev);
@@ -1669,13 +1671,29 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
goto gso;
}

- return dev->hard_start_xmit(skb, dev);
+ rc = dev->hard_start_xmit(skb, dev);
+ /*
+ * TODO: if skb_orphan() was called by
+ * dev->hard_start_xmit() (for example, the unmodified
+ * igb driver does that; bnx2 doesn't), then
+ * skb_tx_software_timestamp() will be unable to send
+ * back the time stamp.
+ *
+ * How can this be prevented? Always create another
+ * reference to the socket before calling
+ * dev->hard_start_xmit()? Prevent that skb_orphan()
+ * does anything in dev->hard_start_xmit() by clearing
+ * the skb destructor before the call and restoring it
+ * afterwards, then doing the skb_orphan() ourselves?
+ */
+ if (likely(!rc))
+ skb_tx_software_timestamp(skb);
+ return rc;
}

gso:
do {
struct sk_buff *nskb = skb->next;
- int rc;

skb->next = nskb->next;
nskb->next = NULL;
@@ -1685,6 +1703,7 @@ gso:
skb->next = nskb;
return rc;
}
+ skb_tx_software_timestamp(skb);
if (unlikely(netif_tx_queue_stopped(txq) && skb->next))
return NETDEV_TX_BUSY;
} while (skb->next);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 267185a..38360d8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -44,6 +44,7 @@
#include <linux/in.h>
#include <linux/inet.h>
#include <linux/slab.h>
+#include <linux/inetdevice.h>
#include <linux/netdevice.h>
#ifdef CONFIG_NET_CLS_ACT
#include <net/pkt_sched.h>
@@ -55,6 +56,7 @@
#include <linux/rtnetlink.h>
#include <linux/init.h>
#include <linux/scatterlist.h>
+#include <linux/errqueue.h>

#include <net/protocol.h>
#include <net/dst.h>
@@ -496,6 +498,7 @@ EXPORT_SYMBOL(skb_recycle_check);
static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
{
new->tstamp = old->tstamp;
+ new->hwtstamp = old->hwtstamp;
new->dev = old->dev;
new->transport_header = old->transport_header;
new->network_header = old->network_header;
@@ -2395,6 +2398,41 @@ err:

EXPORT_SYMBOL_GPL(skb_segment);

+int skb_hwtstamp_raw(const struct sk_buff *skb, struct timespec *ts)
+{
+ if (skb_hwtstamp_available(skb)) {
+ *ts = ktime_to_timespec(skb->hwtstamp.hwtstamp);
+ return 1;
+ }
+ return 0;
+}
+
+EXPORT_SYMBOL_GPL(skb_hwtstamp_raw);
+
+int skb_hwtstamp_transformed(const struct sk_buff *skb, struct timespec *ts)
+{
+ struct rtable *rt;
+ struct in_device *idev;
+ struct net_device *netdev;
+
+ if (skb_hwtstamp_available(skb) &&
+ (rt = skb->rtable) != NULL &&
+ (idev = rt->idev) != NULL &&
+ (netdev = idev->dev) != NULL &&
+ netdev->hwtstamp_raw2sys) {
+ ktime_t hwtstamp_sys =
+ netdev->hwtstamp_raw2sys(netdev,
+ skb->hwtstamp.hwtstamp);
+ if (hwtstamp_sys.tv64) {
+ *ts = ktime_to_timespec(hwtstamp_sys);
+ return 1;
+ }
+ }
+ return 0;
+}
+
+EXPORT_SYMBOL_GPL(skb_hwtstamp_transformed);
+
void __init skb_init(void)
{
skbuff_head_cache = kmem_cache_create("skbuff_head_cache",
@@ -2601,6 +2639,40 @@ int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer)
return elt;
}

+void skb_hwtstamp_tx(struct sk_buff *orig_skb,
+ ktime_t stamp,
+ struct net_device *dev)
+{
+ struct sock *sk = orig_skb->sk;
+ struct sock_exterr_skb *serr;
+ struct sk_buff *skb;
+ int err = -ENOMEM;
+
+ if (!sk)
+ return;
+
+ skb = skb_clone(orig_skb, GFP_ATOMIC);
+ if (!skb)
+ return;
+
+ if (dev) {
+ skb->hwtstamp.hwtstamp = stamp;
+ } else {
+ skb->tstamp = stamp;
+ skb->hwtstamp.hwtstamp.tv64 = 0;
+ }
+
+ serr = SKB_EXT_ERR(skb);
+ memset(serr, 0, sizeof(serr));
+ serr->ee.ee_errno = ENOMSG;
+ serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
+ err = sock_queue_err_skb(sk, skb);
+ if (err)
+ kfree_skb(skb);
+}
+EXPORT_SYMBOL_GPL(skb_hwtstamp_tx);
+
+
/**
* skb_partial_csum_set - set up and verify partial csum values for packet
* @skb: the skb to set
--
1.6.0.4

2008-11-19 12:11:15

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 04/11] net: socket infrastructure for SO_TIMESTAMPING

The overlap with the old SO_TIMESTAMP[NS] options is handled so
that time stamping in software (net_enable_timestamp()) is
enabled when SO_TIMESTAMP[NS] and/or SO_TIMESTAMPING_RX_SOFTWARE
is set. It's disabled if all of these are off.
---
include/net/sock.h | 37 ++++++++++++++++++++++--
net/compat.c | 19 ++++++++----
net/core/sock.c | 79 ++++++++++++++++++++++++++++++++++++++++++++--------
net/socket.c | 75 ++++++++++++++++++++++++++++++++++++-------------
4 files changed, 168 insertions(+), 42 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 0a63894..0f71951 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -158,7 +158,7 @@ struct sock_common {
* @sk_allocation: allocation mode
* @sk_sndbuf: size of send buffer in bytes
* @sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
- * %SO_OOBINLINE settings
+ * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
* @sk_no_check: %SO_NO_CHECK setting, wether or not checkup packets
* @sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
* @sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
@@ -488,6 +488,13 @@ enum sock_flags {
SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_TIMESTAMPING_TX_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_TX_HARDWARE */
+ SOCK_TIMESTAMPING_TX_SOFTWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_TX_SOFTWARE */
+ SOCK_TIMESTAMPING_RX_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_RX_HARDWARE */
+ SOCK_TIMESTAMPING_RX_SOFTWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_RX_SOFTWARE */
+ SOCK_TIMESTAMPING_SOFTWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_SOFTWARE */
+ SOCK_TIMESTAMPING_RAW_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_RAW_HARDWARE */
+ SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_SYS_HARDWARE */
};

static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -1342,13 +1349,37 @@ sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
{
ktime_t kt = skb->tstamp;

- if (sock_flag(sk, SOCK_RCVTSTAMP))
+ /*
+ * generate control messages if receive time stamping requested
+ * or if time stamp available (RX hardware or TX software/hardware
+ * case) and reporting via SO_TIMESTAMPING enabled
+ */
+ if ((sock_flag(sk, SOCK_RCVTSTAMP) ||
+ sock_flag(sk, SOCK_TIMESTAMPING_RX_SOFTWARE)) ||
+ (kt.tv64 && sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE)) ||
+ (skb_hwtstamp_available(skb) &&
+ (sock_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE) ||
+ sock_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE))))
__sock_recv_timestamp(msg, sk, skb);
else
sk->sk_stamp = kt;
}

/**
+ * sock_tx_timestamp - checks whether the outgoing packet is to be time stamped
+ * @msg: outgoing packet
+ * @sk: socket sending this packet
+ * @stamp_tx: filled with instructions for time stamping
+ *
+ * Currently only depends on SOCK_TIMESTAMPING* flags. Returns error code if
+ * parameters are invalid.
+ */
+extern int sock_tx_timestamp(struct msghdr *msg,
+ struct sock *sk,
+ union sk_buff_hwtstamp *stamp_tx);
+
+
+/**
* sk_eat_skb - Release a skb if it is no longer needed
* @sk: socket to eat this skb from
* @skb: socket buffer to eat
@@ -1416,7 +1447,7 @@ static inline struct sock *skb_steal_sock(struct sk_buff *skb)
return NULL;
}

-extern void sock_enable_timestamp(struct sock *sk);
+extern void sock_enable_timestamp(struct sock *sk, int flag);
extern int sock_get_timestamp(struct sock *, struct timeval __user *);
extern int sock_get_timestampns(struct sock *, struct timespec __user *);

diff --git a/net/compat.c b/net/compat.c
index 6ce1a1c..954377e 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -216,7 +216,7 @@ Efault:
int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *data)
{
struct compat_timeval ctv;
- struct compat_timespec cts;
+ struct compat_timespec cts[3];
struct compat_cmsghdr __user *cm = (struct compat_cmsghdr __user *) kmsg->msg_control;
struct compat_cmsghdr cmhdr;
int cmlen;
@@ -233,12 +233,17 @@ int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *dat
data = &ctv;
len = sizeof(ctv);
}
- if (level == SOL_SOCKET && type == SCM_TIMESTAMPNS) {
+ if (level == SOL_SOCKET &&
+ (type == SCM_TIMESTAMPNS || type == SCM_TIMESTAMPING)) {
+ int count = type == SCM_TIMESTAMPNS ? 1 : 3;
+ int i;
struct timespec *ts = (struct timespec *)data;
- cts.tv_sec = ts->tv_sec;
- cts.tv_nsec = ts->tv_nsec;
+ for (i = 0; i < count; i++) {
+ cts[i].tv_sec = ts[i].tv_sec;
+ cts[i].tv_nsec = ts[i].tv_nsec;
+ }
data = &cts;
- len = sizeof(cts);
+ len = sizeof(cts[0]) * count;
}

cmlen = CMSG_COMPAT_LEN(len);
@@ -455,7 +460,7 @@ int compat_sock_get_timestamp(struct sock *sk, struct timeval __user *userstamp)
struct timeval tv;

if (!sock_flag(sk, SOCK_TIMESTAMP))
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
tv = ktime_to_timeval(sk->sk_stamp);
if (tv.tv_sec == -1)
return err;
@@ -479,7 +484,7 @@ int compat_sock_get_timestampns(struct sock *sk, struct timespec __user *usersta
struct timespec ts;

if (!sock_flag(sk, SOCK_TIMESTAMP))
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
ts = ktime_to_timespec(sk->sk_stamp);
if (ts.tv_sec == -1)
return err;
diff --git a/net/core/sock.c b/net/core/sock.c
index 38de9c3..1a4895a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -120,6 +120,7 @@
#include <net/net_namespace.h>
#include <net/request_sock.h>
#include <net/sock.h>
+#include <net/timestamping.h>
#include <net/xfrm.h>
#include <linux/ipsec.h>

@@ -257,11 +258,14 @@ static void sock_warn_obsolete_bsdism(const char *name)
}
}

-static void sock_disable_timestamp(struct sock *sk)
+static void sock_disable_timestamp(struct sock *sk, int flag)
{
- if (sock_flag(sk, SOCK_TIMESTAMP)) {
- sock_reset_flag(sk, SOCK_TIMESTAMP);
- net_disable_timestamp();
+ if (sock_flag(sk, flag)) {
+ sock_reset_flag(sk, flag);
+ if (!sock_flag(sk, SOCK_TIMESTAMP) &&
+ !sock_flag(sk, SOCK_TIMESTAMPING_RX_SOFTWARE)) {
+ net_disable_timestamp();
+ }
}
}

@@ -616,13 +620,36 @@ set_rcvbuf:
else
sock_set_flag(sk, SOCK_RCVTSTAMPNS);
sock_set_flag(sk, SOCK_RCVTSTAMP);
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
} else {
sock_reset_flag(sk, SOCK_RCVTSTAMP);
sock_reset_flag(sk, SOCK_RCVTSTAMPNS);
}
break;

+ case SO_TIMESTAMPING:
+ if (val & ~SOF_TIMESTAMPING_MASK) {
+ ret = EINVAL;
+ break;
+ }
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_TX_HARDWARE,
+ val & SOF_TIMESTAMPING_TX_HARDWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_TX_SOFTWARE,
+ val & SOF_TIMESTAMPING_TX_SOFTWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_RX_HARDWARE,
+ val & SOF_TIMESTAMPING_RX_HARDWARE);
+ if (val & SOF_TIMESTAMPING_RX_SOFTWARE)
+ sock_enable_timestamp(sk, SOCK_TIMESTAMPING_RX_SOFTWARE);
+ else
+ sock_disable_timestamp(sk, SOCK_TIMESTAMPING_RX_SOFTWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_SOFTWARE,
+ val & SOF_TIMESTAMPING_SOFTWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE,
+ val & SOF_TIMESTAMPING_SYS_HARDWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE,
+ val & SOF_TIMESTAMPING_RAW_HARDWARE);
+ break;
+
case SO_RCVLOWAT:
if (val < 0)
val = INT_MAX;
@@ -768,6 +795,24 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
v.val = sock_flag(sk, SOCK_RCVTSTAMPNS);
break;

+ case SO_TIMESTAMPING:
+ v.val = 0;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_TX_HARDWARE))
+ v.val |= SOF_TIMESTAMPING_TX_HARDWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_TX_SOFTWARE))
+ v.val |= SOF_TIMESTAMPING_TX_SOFTWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_RX_HARDWARE))
+ v.val |= SOF_TIMESTAMPING_RX_HARDWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_RX_SOFTWARE))
+ v.val |= SOF_TIMESTAMPING_RX_SOFTWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE))
+ v.val |= SOF_TIMESTAMPING_SOFTWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE))
+ v.val |= SOF_TIMESTAMPING_SYS_HARDWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE))
+ v.val |= SOF_TIMESTAMPING_RAW_HARDWARE;
+ break;
+
case SO_RCVTIMEO:
lv=sizeof(struct timeval);
if (sk->sk_rcvtimeo == MAX_SCHEDULE_TIMEOUT) {
@@ -969,7 +1014,8 @@ void sk_free(struct sock *sk)
rcu_assign_pointer(sk->sk_filter, NULL);
}

- sock_disable_timestamp(sk);
+ sock_disable_timestamp(sk, SOCK_TIMESTAMP);
+ sock_disable_timestamp(sk, SOCK_TIMESTAMPING_RX_SOFTWARE);

if (atomic_read(&sk->sk_omem_alloc))
printk(KERN_DEBUG "%s: optmem leakage (%d bytes) detected.\n",
@@ -1783,7 +1829,7 @@ int sock_get_timestamp(struct sock *sk, struct timeval __user *userstamp)
{
struct timeval tv;
if (!sock_flag(sk, SOCK_TIMESTAMP))
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
tv = ktime_to_timeval(sk->sk_stamp);
if (tv.tv_sec == -1)
return -ENOENT;
@@ -1799,7 +1845,7 @@ int sock_get_timestampns(struct sock *sk, struct timespec __user *userstamp)
{
struct timespec ts;
if (!sock_flag(sk, SOCK_TIMESTAMP))
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
ts = ktime_to_timespec(sk->sk_stamp);
if (ts.tv_sec == -1)
return -ENOENT;
@@ -1811,11 +1857,20 @@ int sock_get_timestampns(struct sock *sk, struct timespec __user *userstamp)
}
EXPORT_SYMBOL(sock_get_timestampns);

-void sock_enable_timestamp(struct sock *sk)
+void sock_enable_timestamp(struct sock *sk, int flag)
{
- if (!sock_flag(sk, SOCK_TIMESTAMP)) {
- sock_set_flag(sk, SOCK_TIMESTAMP);
- net_enable_timestamp();
+ if (!sock_flag(sk, flag)) {
+ sock_set_flag(sk, flag);
+ /*
+ * we just set one of the two flags which require net
+ * time stamping, but time stamping might have been on
+ * already because of the other one
+ */
+ if (!sock_flag(sk,
+ flag == SOCK_TIMESTAMP ?
+ SOCK_TIMESTAMPING_RX_SOFTWARE :
+ SOCK_TIMESTAMP))
+ net_enable_timestamp();
}
}

diff --git a/net/socket.c b/net/socket.c
index d7128b7..1ad00e1 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -545,6 +545,18 @@ void sock_release(struct socket *sock)
sock->file = NULL;
}

+int sock_tx_timestamp(struct msghdr *msg, struct sock *sk,
+ union sk_buff_hwtstamp *tstamp_tx)
+{
+ tstamp_tx->hwtstamp.tv64 = 0;
+ tstamp_tx->tstamp_tx_hardware =
+ sock_flag(sk, SOCK_TIMESTAMPING_TX_HARDWARE);
+ tstamp_tx->tstamp_tx_software =
+ sock_flag(sk, SOCK_TIMESTAMPING_TX_SOFTWARE);
+ return 0;
+}
+EXPORT_SYMBOL(sock_tx_timestamp);
+
static inline int __sock_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *msg, size_t size)
{
@@ -601,26 +613,49 @@ int kernel_sendmsg(struct socket *sock, struct msghdr *msg,
void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
struct sk_buff *skb)
{
- ktime_t kt = skb->tstamp;
-
- if (!sock_flag(sk, SOCK_RCVTSTAMPNS)) {
- struct timeval tv;
- /* Race occurred between timestamp enabling and packet
- receiving. Fill in the current time for now. */
- if (kt.tv64 == 0)
- kt = ktime_get_real();
- skb->tstamp = kt;
- tv = ktime_to_timeval(kt);
- put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP, sizeof(tv), &tv);
- } else {
- struct timespec ts;
- /* Race occurred between timestamp enabling and packet
- receiving. Fill in the current time for now. */
- if (kt.tv64 == 0)
- kt = ktime_get_real();
- skb->tstamp = kt;
- ts = ktime_to_timespec(kt);
- put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPNS, sizeof(ts), &ts);
+ int need_software_tstamp = sock_flag(sk, SOCK_RCVTSTAMP);
+
+ /* Race occurred between timestamp enabling and packet
+ receiving. Fill in the current time for now. */
+ if (need_software_tstamp && skb->tstamp.tv64 == 0)
+ __net_timestamp(skb);
+
+ if (need_software_tstamp) {
+ if (!sock_flag(sk, SOCK_RCVTSTAMPNS)) {
+ struct timeval tv;
+ skb_get_timestamp(skb, &tv);
+ put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP,
+ sizeof(tv), &tv);
+ } else {
+ struct timespec ts;
+ skb_get_timestampns(skb, &ts);
+ put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPNS,
+ sizeof(ts), &ts);
+ }
+ }
+
+ if (sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE) ||
+ sock_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE) ||
+ sock_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE)) {
+ struct timespec ts[3];
+ int empty = 1;
+ memset(ts, 0, sizeof(ts));
+ if (skb->tstamp.tv64 &&
+ sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE)) {
+ skb_get_timestampns(skb, ts + 0);
+ empty = 0;
+ }
+ if (skb_hwtstamp_available(skb)) {
+ if (sock_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE) &&
+ skb_hwtstamp_transformed(skb, ts + 1))
+ empty = 0;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE) &&
+ skb_hwtstamp_raw(skb, ts + 2))
+ empty = 0;
+ }
+ if (!empty)
+ put_cmsg(msg, SOL_SOCKET,
+ SCM_TIMESTAMPING, sizeof(ts), &ts);
}
}

--
1.6.0.4

2008-11-19 12:11:37

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 05/11] ip: support for TX timestamps on UDP and RAW sockets

Instructions for time stamping outgoing packets are take from the
socket layer and later copied into the new skb.
---
include/net/ip.h | 1 +
net/can/raw.c | 6 ++++++
net/ipv4/icmp.c | 2 ++
net/ipv4/ip_output.c | 2 ++
net/ipv4/raw.c | 1 +
net/ipv4/udp.c | 4 ++++
6 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index bc026ec..9bc2b65 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -55,6 +55,7 @@ struct ipcm_cookie
__be32 addr;
int oif;
struct ip_options *opt;
+ union sk_buff_hwtstamp tstamp_tx;
};

#define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb))
diff --git a/net/can/raw.c b/net/can/raw.c
index 6e0663f..d4a38e3 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -618,6 +618,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
struct raw_sock *ro = raw_sk(sk);
struct sk_buff *skb;
struct net_device *dev;
+ union sk_buff_hwtstamp tstamp_tx;
int ifindex;
int err;

@@ -639,6 +640,10 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
if (!dev)
return -ENXIO;

+ err = sock_tx_timestamp(msg, sk, &tstamp_tx);
+ if (err < 0)
+ return err;
+
skb = sock_alloc_send_skb(sk, size, msg->msg_flags & MSG_DONTWAIT,
&err);
if (!skb) {
@@ -654,6 +659,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
}
skb->dev = dev;
skb->sk = sk;
+ skb->hwtstamp = tstamp_tx;

err = can_send(skb, ro->loopback);

diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 21e497e..ba739f4 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -375,6 +375,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
inet->tos = ip_hdr(skb)->tos;
daddr = ipc.addr = rt->rt_src;
ipc.opt = NULL;
+ ipc.tstamp_tx.hwtstamp.tv64 = 0;
if (icmp_param->replyopts.optlen) {
ipc.opt = &icmp_param->replyopts;
if (ipc.opt->srr)
@@ -532,6 +533,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
inet_sk(sk)->tos = tos;
ipc.addr = iph->saddr;
ipc.opt = &icmp_param.replyopts;
+ ipc.tstamp_tx.hwtstamp.tv64 = 0;

{
struct flowi fl = {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 46d7be2..1498848 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -940,6 +940,7 @@ alloc_new_skb:
skb->ip_summed = csummode;
skb->csum = 0;
skb_reserve(skb, hh_len);
+ skb->hwtstamp = ipc->tstamp_tx;

/*
* Find where to start putting bytes.
@@ -1354,6 +1355,7 @@ void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg *ar

daddr = ipc.addr = rt->rt_src;
ipc.opt = NULL;
+ ipc.tstamp_tx.hwtstamp.tv64 = 0;

if (replyopts.opt.optlen) {
ipc.opt = &replyopts.opt;
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 998fcff..9115ed5 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -493,6 +493,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,

ipc.addr = inet->saddr;
ipc.opt = NULL;
+ ipc.tstamp_tx.hwtstamp.tv64 = 0;
ipc.oif = sk->sk_bound_dev_if;

if (msg->msg_controllen) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index fea2d87..32c4e98 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -573,6 +573,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
return -EOPNOTSUPP;

ipc.opt = NULL;
+ ipc.tstamp_tx.hwtstamp.tv64 = 0;

if (up->pending) {
/*
@@ -620,6 +621,9 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
ipc.addr = inet->saddr;

ipc.oif = sk->sk_bound_dev_if;
+ err = sock_tx_timestamp(msg, sk, &ipc.tstamp_tx);
+ if (err)
+ return err;
if (msg->msg_controllen) {
err = ip_cmsg_send(sock_net(sk), msg, &ipc);
if (err)
--
1.6.0.4

2008-11-19 12:11:52

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 06/11] net: pass new SIOCSHWTSTAMP through to device drivers

---
fs/compat_ioctl.c | 1 +
net/core/dev.c | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 5235c67..a5001a6 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -2555,6 +2555,7 @@ HANDLE_IOCTL(SIOCSIFMAP, dev_ifsioc)
HANDLE_IOCTL(SIOCGIFADDR, dev_ifsioc)
HANDLE_IOCTL(SIOCSIFADDR, dev_ifsioc)
HANDLE_IOCTL(SIOCSIFHWBROADCAST, dev_ifsioc)
+HANDLE_IOCTL(SIOCSHWTSTAMP, dev_ifsioc)

/* ioctls used by appletalk ddp.c */
HANDLE_IOCTL(SIOCATALKDIFADDR, dev_ifsioc)
diff --git a/net/core/dev.c b/net/core/dev.c
index b4b8eb8..4f61f5e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3650,6 +3650,7 @@ static int dev_ifsioc(struct net *net, struct ifreq *ifr, unsigned int cmd)
cmd == SIOCSMIIREG ||
cmd == SIOCBRADDIF ||
cmd == SIOCBRDELIF ||
+ cmd == SIOCSHWTSTAMP ||
cmd == SIOCWANDEV) {
err = -EOPNOTSUPP;
if (dev->do_ioctl) {
@@ -3805,6 +3806,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
case SIOCBONDCHANGEACTIVE:
case SIOCBRADDIF:
case SIOCBRDELIF:
+ case SIOCSHWTSTAMP:
if (!capable(CAP_NET_ADMIN))
return -EPERM;
/* fall through */
--
1.6.0.4

2008-11-19 12:12:20

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 08/11] clocksource: allow usage independent of timekeeping.c

So far struct clocksource acted as the interface between time/timekeeping
and hardware. This patch generalizes the concept so that the same
interface can also be used in other contexts.

The only change as far as kernel/time/timekeeping is concerned is that
the hardware access can be done either with or without passing
the clocksource pointer as context. This is necessary in those
cases when there is more than one instance of the hardware.

The extensions in this patch add code which turns the raw cycle count
provided by hardware into a continously increasing time value. This
reuses fields also used by timekeeping.c. Because of slightly different
semantic (__get_nsec_offset does not update cycle_last, clocksource_read_ns
does that transparently) timekeeping.c was not modified to use the
generalized code.

The new code does no locking of the clocksource. This is the responsibility
of the caller.
---
include/linux/clocksource.h | 119 ++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 118 insertions(+), 1 deletions(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index f88d32f..5435bd2 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -24,6 +24,9 @@ struct clocksource;
/**
* struct clocksource - hardware abstraction for a free running counter
* Provides mostly state-free accessors to the underlying hardware.
+ * Also provides utility functions which convert the underlying
+ * hardware cycle values into a non-decreasing count of nanoseconds
+ * ("time").
*
* @name: ptr to clocksource name
* @list: list head for registration
@@ -43,6 +46,9 @@ struct clocksource;
* The ideal clocksource. A must-use where
* available.
* @read: returns a cycle value
+ * @read_clock: alternative to read which gets a pointer to the clock
+ * source so that the same code can read different clocks;
+ * either read or read_clock must be set
* @mask: bitmask for two's complement
* subtraction of non 64 bit counters
* @mult: cycle to nanosecond multiplier (adjusted by NTP)
@@ -62,6 +68,7 @@ struct clocksource {
struct list_head list;
int rating;
cycle_t (*read)(void);
+ cycle_t (*read_clock)(struct clocksource *cs);
cycle_t mask;
u32 mult;
u32 mult_orig;
@@ -170,7 +177,7 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant)
*/
static inline cycle_t clocksource_read(struct clocksource *cs)
{
- return cs->read();
+ return (cs->read ? cs->read() : cs->read_clock(cs));
}

/**
@@ -190,6 +197,116 @@ static inline s64 cyc2ns(struct clocksource *cs, cycle_t cycles)
}

/**
+ * clocksource_read_ns - get nanoseconds since last call of this function
+ * (never negative)
+ * @cs: Pointer to clocksource
+ *
+ * When the underlying cycle counter runs over, this will be handled
+ * correctly as long as it does not run over more than once between
+ * calls.
+ *
+ * The first call to this function for a new clock source initializes
+ * the time tracking and returns bogus results.
+ */
+static inline s64 clocksource_read_ns(struct clocksource *cs)
+{
+ cycle_t cycle_now, cycle_delta;
+ s64 ns_offset;
+
+ /* read clocksource: */
+ cycle_now = clocksource_read(cs);
+
+ /* calculate the delta since the last clocksource_read_ns: */
+ cycle_delta = (cycle_now - cs->cycle_last) & cs->mask;
+
+ /* convert to nanoseconds: */
+ ns_offset = cyc2ns(cs, cycle_delta);
+
+ /* update time stamp of clocksource_read_ns call: */
+ cs->cycle_last = cycle_now;
+
+ return ns_offset;
+}
+
+/**
+ * clocksource_init_time - initialize a clock source for use with
+ * %clocksource_read_time() and
+ * %clocksource_cyc2time()
+ * @cs: Pointer to clocksource.
+ * @start_tstamp: Arbitrary initial time stamp.
+ *
+ * After this call the current cycle register (roughly) corresponds to
+ * the initial time stamp. Every call to %clocksource_read_time()
+ * increments the time stamp counter by the number of elapsed
+ * nanoseconds.
+ */
+static inline void clocksource_init_time(struct clocksource *cs,
+ u64 start_tstamp)
+{
+ cs->cycle_last = clocksource_read(cs);
+ cs->xtime_nsec = start_tstamp;
+}
+
+/**
+ * clocksource_read_time - return nanoseconds since %clocksource_init_time()
+ * plus the initial time stamp
+ * @cs: Pointer to clocksource.
+ *
+ * In other words, keeps track of time since the same epoch as
+ * the function which generated the initial time stamp. Don't mix
+ * with calls to %clocksource_read_ns()!
+ */
+static inline u64 clocksource_read_time(struct clocksource *cs)
+{
+ u64 nsec;
+
+ /* increment time by nanoseconds since last call */
+ nsec = clocksource_read_ns(cs);
+ nsec += cs->xtime_nsec;
+ cs->xtime_nsec = nsec;
+
+ return nsec;
+}
+
+/**
+ * clocksource_cyc2time - convert an absolute cycle time stamp to same
+ * time base as values returned by
+ * %clocksource_read_time()
+ * @cs: Pointer to clocksource.
+ * @cycle_tstamp: a value returned by cs->read()
+ *
+ * Cycle time stamps that are converted correctly as long as they
+ * fall into the time interval [-1/2 max cycle count, 1/2 cycle count],
+ * with "max cycle count" == cs->mask+1.
+ *
+ * This avoids situations where a cycle time stamp is generated, the
+ * current cycle counter is updated, and then when transforming the
+ * time stamp the value is treated as if it was in the future. Always
+ * updating the cycle counter would also work, but incurr additional
+ * overhead.
+ */
+static inline u64 clocksource_cyc2time(struct clocksource *cs,
+ cycle_t cycle_tstamp)
+{
+ u64 cycle_delta = (cycle_tstamp - cs->cycle_last) & cs->mask;
+ u64 nsec;
+
+ /*
+ * Instead of always treating cycle_tstamp as more recent
+ * than cs->cycle_last, detect when it is too far in the
+ * future and treat it as old time stamp instead.
+ */
+ if (cycle_delta > cs->mask / 2) {
+ cycle_delta = (cs->cycle_last - cycle_tstamp) & cs->mask;
+ nsec = cs->xtime_nsec - cyc2ns(cs, cycle_delta);
+ } else {
+ nsec = cyc2ns(cs, cycle_delta) + cs->xtime_nsec;
+ }
+
+ return nsec;
+}
+
+/**
* clocksource_calculate_interval - Calculates a clocksource interval struct
*
* @c: Pointer to clocksource.
--
1.6.0.4

2008-11-19 12:12:55

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 09/11] igb: infrastructure for hardware time stamping

Adds register definitions and a clocksource accessing the
NIC time.
---
drivers/net/igb/e1000_regs.h | 28 +++++++++++
drivers/net/igb/igb.h | 3 +
drivers/net/igb/igb_main.c | 105 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 136 insertions(+), 0 deletions(-)

diff --git a/drivers/net/igb/e1000_regs.h b/drivers/net/igb/e1000_regs.h
index 95523af..37f9d55 100644
--- a/drivers/net/igb/e1000_regs.h
+++ b/drivers/net/igb/e1000_regs.h
@@ -75,6 +75,34 @@
#define E1000_FCRTH 0x02168 /* Flow Control Receive Threshold High - RW */
#define E1000_RDFPCQ(_n) (0x02430 + (0x4 * (_n)))
#define E1000_FCRTV 0x02460 /* Flow Control Refresh Timer Value - RW */
+
+/* IEEE 1588 TIMESYNCH */
+#define E1000_TSYNCTXCTL 0x0B614
+#define E1000_TSYNCRXCTL 0x0B620
+#define E1000_TSYNCRXCFG 0x05F50
+
+#define E1000_SYSTIML 0x0B600
+#define E1000_SYSTIMH 0x0B604
+#define E1000_TIMINCA 0x0B608
+
+#define E1000_RXMTRL 0x0B634
+#define E1000_RXSTMPL 0x0B624
+#define E1000_RXSTMPH 0x0B628
+#define E1000_RXSATRL 0x0B62C
+#define E1000_RXSATRH 0x0B630
+
+#define E1000_TXSTMPL 0x0B618
+#define E1000_TXSTMPH 0x0B61C
+
+#define E1000_ETQF0 0x05CB0
+#define E1000_ETQF1 0x05CB4
+#define E1000_ETQF2 0x05CB8
+#define E1000_ETQF3 0x05CBC
+#define E1000_ETQF4 0x05CC0
+#define E1000_ETQF5 0x05CC4
+#define E1000_ETQF6 0x05CC8
+#define E1000_ETQF7 0x05CCC
+
/* Split and Replication RX Control - RW */
/*
* Convenience macros
diff --git a/drivers/net/igb/igb.h b/drivers/net/igb/igb.h
index 4ff6f05..2938ab3 100644
--- a/drivers/net/igb/igb.h
+++ b/drivers/net/igb/igb.h
@@ -34,6 +34,8 @@
#include "e1000_mac.h"
#include "e1000_82575.h"

+#include <linux/clocksource.h>
+
struct igb_adapter;

#ifdef CONFIG_IGB_LRO
@@ -262,6 +264,7 @@ struct igb_adapter {
struct napi_struct napi;
struct pci_dev *pdev;
struct net_device_stats net_stats;
+ struct clocksource clock;

/* structs defined in e1000_hw.h */
struct e1000_hw hw;
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index be8e2b8..ee39aee 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -180,6 +180,54 @@ MODULE_DESCRIPTION("Intel(R) Gigabit Ethernet Network Driver");
MODULE_LICENSE("GPL");
MODULE_VERSION(DRV_VERSION);

+/**
+ * Scale the NIC clock cycle by a large factor so that
+ * relatively small clock corrections can be added or
+ * substracted at each clock tick. The drawbacks of a
+ * large factor are a) that the clock register overflows
+ * more quickly (not such a big deal) and b) that the
+ * increment per tick has to fit into 24 bits.
+ *
+ * Note that
+ * TIMINCA = IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS *
+ * IGB_TSYNC_SCALE
+ * TIMINCA += TIMINCA * adjustment [ppm] / 1e9
+ *
+ * The base scale factor is intentionally a power of two
+ * so that the division in clocksource can be done with
+ * a shift.
+ */
+#define IGB_TSYNC_SHIFT (19)
+#define IGB_TSYNC_SCALE (1<<IGB_TSYNC_SHIFT)
+
+/**
+ * The duration of one clock cycle of the NIC.
+ *
+ * @todo This hard-coded value is part of the specification and might change
+ * in future hardware revisions. Add revision check.
+ */
+#define IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS 16
+
+#if (IGB_TSYNC_SCALE * IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS) >= (1<<24)
+# error IGB_TSYNC_SCALE and/or IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS are too large to fit into TIMINCA
+#endif
+
+/**
+ * igb_read_clock - read raw cycle counter (to be used by clocksource)
+ */
+static cycle_t igb_read_clock(struct clocksource *cs)
+{
+ struct igb_adapter *adapter =
+ container_of(cs, struct igb_adapter, clock);
+ struct e1000_hw *hw = &adapter->hw;
+ u64 stamp;
+
+ stamp = rd32(E1000_SYSTIML);
+ stamp |= (u64)rd32(E1000_SYSTIMH) << 32ULL;
+
+ return stamp;
+}
+
#ifdef DEBUG
/**
* igb_get_hw_dev_name - return device name string
@@ -190,6 +238,27 @@ char *igb_get_hw_dev_name(struct e1000_hw *hw)
struct igb_adapter *adapter = hw->back;
return adapter->netdev->name;
}
+
+/**
+ * igb_get_time_str - format current NIC and system time as string
+ */
+static char *igb_get_time_str(struct igb_adapter *adapter,
+ char buffer[160])
+{
+ struct timespec nic = ns_to_timespec(clocksource_read_time(&adapter->clock));
+ struct timespec sys;
+ struct timespec delta;
+ getnstimeofday(&sys);
+
+ delta = timespec_sub(nic, sys);
+
+ sprintf(buffer, "NIC %ld.%09lus, SYS %ld.%09lus, NIC-SYS %lds + %09luns",
+ (long)nic.tv_sec, nic.tv_nsec,
+ (long)sys.tv_sec, sys.tv_nsec,
+ (long)delta.tv_sec, delta.tv_nsec);
+
+ return buffer;
+}
#endif

/**
@@ -1274,6 +1343,42 @@ static int __devinit igb_probe(struct pci_dev *pdev,
}
#endif

+ /*
+ * Initialize hardware timer: we keep it running just in case
+ * that some program needs it later on.
+ */
+ memset(&adapter->clock, 0, sizeof(adapter->clock));
+ adapter->clock.read_clock = igb_read_clock;
+ adapter->clock.mask = (u64)(s64)-1;
+ adapter->clock.mult = 1;
+ adapter->clock.shift = IGB_TSYNC_SHIFT;
+ wr32(E1000_TIMINCA, (1<<24) | IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS * IGB_TSYNC_SCALE);
+#if 0
+ /*
+ * Avoid rollover while we initialize by resetting the time counter.
+ */
+ wr32(E1000_SYSTIML, 0x00000000);
+ wr32(E1000_SYSTIMH, 0x00000000);
+#else
+ /*
+ * Set registers so that rollover occurs soon to test this.
+ */
+ wr32(E1000_SYSTIML, 0x00000000);
+ wr32(E1000_SYSTIMH, 0xFF800000);
+#endif
+ wrfl();
+ clocksource_init_time(&adapter->clock, ktime_to_ns(ktime_get_real()));
+
+#ifdef DEBUG
+ {
+ char buffer[160];
+ printk(KERN_DEBUG
+ "igb: %s: hw %p initialized timer\n",
+ igb_get_time_str(adapter, buffer),
+ &adapter->hw);
+ }
+#endif
+
dev_info(&pdev->dev, "Intel(R) Gigabit Ethernet Network Connection\n");
/* print bus type/speed/width info */
dev_info(&pdev->dev, "%s: (PCIe:%s:%s) %pM\n",
--
1.6.0.4

2008-11-19 12:12:39

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 11/11] igb: use clocksync to implement hardware time stamping

Both TX and RX hardware time stamping are implemented. Due to
hardware limitations it is not possible to verify reliably which
packet was time stamped when multiple were pending for sending; this
could be solved by only allowing one packet marked for hardware time
stamping into the queue (not implemented yet).

RX time stamping relies on the flag in the packet descriptor which
marks packets that were time stamped. In "all packet" mode this flag
is not set. TODO: also support that mode (even though it'll suffer
from race conditions).
---
drivers/net/igb/e1000_82575.h | 1 +
drivers/net/igb/e1000_defines.h | 1 +
drivers/net/igb/e1000_regs.h | 40 +++++++
drivers/net/igb/igb.h | 2 +
drivers/net/igb/igb_main.c | 239 +++++++++++++++++++++++++++++++++++++--
5 files changed, 275 insertions(+), 8 deletions(-)

diff --git a/drivers/net/igb/e1000_82575.h b/drivers/net/igb/e1000_82575.h
index c1928b5..dd32a6f 100644
--- a/drivers/net/igb/e1000_82575.h
+++ b/drivers/net/igb/e1000_82575.h
@@ -116,6 +116,7 @@ union e1000_adv_tx_desc {
};

/* Adv Transmit Descriptor Config Masks */
+#define E1000_ADVTXD_MAC_TSTAMP 0x00080000 /* IEEE1588 Timestamp packet */
#define E1000_ADVTXD_DTYP_CTXT 0x00200000 /* Advanced Context Descriptor */
#define E1000_ADVTXD_DTYP_DATA 0x00300000 /* Advanced Data Descriptor */
#define E1000_ADVTXD_DCMD_IFCS 0x02000000 /* Insert FCS (Ethernet CRC) */
diff --git a/drivers/net/igb/e1000_defines.h b/drivers/net/igb/e1000_defines.h
index ce70068..2a19698 100644
--- a/drivers/net/igb/e1000_defines.h
+++ b/drivers/net/igb/e1000_defines.h
@@ -104,6 +104,7 @@
#define E1000_RXD_STAT_UDPCS 0x10 /* UDP xsum calculated */
#define E1000_RXD_STAT_TCPCS 0x20 /* TCP xsum calculated */
#define E1000_RXD_STAT_DYNINT 0x800 /* Pkt caused INT via DYNINT */
+#define E1000_RXD_STAT_TS 0x10000 /* Pkt was time stamped */
#define E1000_RXD_ERR_CE 0x01 /* CRC Error */
#define E1000_RXD_ERR_SE 0x02 /* Symbol Error */
#define E1000_RXD_ERR_SEQ 0x04 /* Sequence Error */
diff --git a/drivers/net/igb/e1000_regs.h b/drivers/net/igb/e1000_regs.h
index 37f9d55..7b561a1 100644
--- a/drivers/net/igb/e1000_regs.h
+++ b/drivers/net/igb/e1000_regs.h
@@ -78,9 +78,37 @@

/* IEEE 1588 TIMESYNCH */
#define E1000_TSYNCTXCTL 0x0B614
+#define E1000_TSYNCTXCTL_VALID (1<<0)
+#define E1000_TSYNCTXCTL_ENABLED (1<<4)
#define E1000_TSYNCRXCTL 0x0B620
+#define E1000_TSYNCRXCTL_VALID (1<<0)
+#define E1000_TSYNCRXCTL_ENABLED (1<<4)
+enum {
+ E1000_TSYNCRXCTL_TYPE_L2_V2 = 0,
+ E1000_TSYNCRXCTL_TYPE_L4_V1 = (1<<1),
+ E1000_TSYNCRXCTL_TYPE_L2_L4_V2 = (1<<2),
+ E1000_TSYNCRXCTL_TYPE_ALL = (1<<3),
+ E1000_TSYNCRXCTL_TYPE_EVENT_V2 = (1<<3) | (1<<1),
+};
#define E1000_TSYNCRXCFG 0x05F50
+enum {
+ E1000_TSYNCRXCFG_PTP_V1_SYNC_MESSAGE = 0<<0,
+ E1000_TSYNCRXCFG_PTP_V1_DELAY_REQ_MESSAGE = 1<<0,
+ E1000_TSYNCRXCFG_PTP_V1_FOLLOWUP_MESSAGE = 2<<0,
+ E1000_TSYNCRXCFG_PTP_V1_DELAY_RESP_MESSAGE = 3<<0,
+ E1000_TSYNCRXCFG_PTP_V1_MANAGEMENT_MESSAGE = 4<<0,

+ E1000_TSYNCRXCFG_PTP_V2_SYNC_MESSAGE = 0<<8,
+ E1000_TSYNCRXCFG_PTP_V2_DELAY_REQ_MESSAGE = 1<<8,
+ E1000_TSYNCRXCFG_PTP_V2_PATH_DELAY_REQ_MESSAGE = 2<<8,
+ E1000_TSYNCRXCFG_PTP_V2_PATH_DELAY_RESP_MESSAGE = 3<<8,
+ E1000_TSYNCRXCFG_PTP_V2_FOLLOWUP_MESSAGE = 8<<8,
+ E1000_TSYNCRXCFG_PTP_V2_DELAY_RESP_MESSAGE = 9<<8,
+ E1000_TSYNCRXCFG_PTP_V2_PATH_DELAY_FOLLOWUP_MESSAGE = 0xA<<8,
+ E1000_TSYNCRXCFG_PTP_V2_ANNOUNCE_MESSAGE = 0xB<<8,
+ E1000_TSYNCRXCFG_PTP_V2_SIGNALLING_MESSAGE = 0xC<<8,
+ E1000_TSYNCRXCFG_PTP_V2_MANAGEMENT_MESSAGE = 0xD<<8,
+};
#define E1000_SYSTIML 0x0B600
#define E1000_SYSTIMH 0x0B604
#define E1000_TIMINCA 0x0B608
@@ -103,6 +131,18 @@
#define E1000_ETQF6 0x05CC8
#define E1000_ETQF7 0x05CCC

+/* Filtering Registers */
+#define E1000_SAQF(_n) (0x5980 + 4 * (_n))
+#define E1000_DAQF(_n) (0x59A0 + 4 * (_n))
+#define E1000_SPQF(_n) (0x59C0 + 4 * (_n))
+#define E1000_FTQF(_n) (0x59E0 + 4 * (_n))
+#define E1000_SAQF0 E1000_SAQF(0)
+#define E1000_DAQF0 E1000_DAQF(0)
+#define E1000_SPQF0 E1000_SPQF(0)
+#define E1000_FTQF0 E1000_FTQF(0)
+#define E1000_SYNQF(_n) (0x055FC + (4 * (_n))) /* SYN Packet Queue Fltr */
+#define E1000_ETQF(_n) (0x05CB0 + (4 * (_n))) /* EType Queue Fltr */
+
/* Split and Replication RX Control - RW */
/*
* Convenience macros
diff --git a/drivers/net/igb/igb.h b/drivers/net/igb/igb.h
index 2938ab3..86ef1a2 100644
--- a/drivers/net/igb/igb.h
+++ b/drivers/net/igb/igb.h
@@ -35,6 +35,7 @@
#include "e1000_82575.h"

#include <linux/clocksource.h>
+#include <linux/clocksync.h>

struct igb_adapter;

@@ -265,6 +266,7 @@ struct igb_adapter {
struct pci_dev *pdev;
struct net_device_stats net_stats;
struct clocksource clock;
+ struct clocksync sync;

/* structs defined in e1000_hw.h */
struct e1000_hw hw;
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index ee39aee..44e8ac5 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -228,6 +228,13 @@ static cycle_t igb_read_clock(struct clocksource *cs)
return stamp;
}

+static ktime_t igb_hwtstamp_raw2sys(struct net_device *netdev,
+ ktime_t stamp)
+{
+ struct igb_adapter *adapter = netdev_priv(netdev);
+ return clocksync_hw2sys(&adapter->sync, ktime_to_ns(stamp));
+}
+
#ifdef DEBUG
/**
* igb_get_hw_dev_name - return device name string
@@ -245,6 +252,7 @@ char *igb_get_hw_dev_name(struct e1000_hw *hw)
static char *igb_get_time_str(struct igb_adapter *adapter,
char buffer[160])
{
+ cycle_t hw = clocksource_read(&adapter->clock);
struct timespec nic = ns_to_timespec(clocksource_read_time(&adapter->clock));
struct timespec sys;
struct timespec delta;
@@ -252,7 +260,8 @@ static char *igb_get_time_str(struct igb_adapter *adapter,

delta = timespec_sub(nic, sys);

- sprintf(buffer, "NIC %ld.%09lus, SYS %ld.%09lus, NIC-SYS %lds + %09luns",
+ sprintf(buffer, "HW %llu, NIC %ld.%09lus, SYS %ld.%09lus, NIC-SYS %lds + %09luns",
+ hw,
(long)nic.tv_sec, nic.tv_nsec,
(long)sys.tv_sec, sys.tv_nsec,
(long)delta.tv_sec, delta.tv_nsec);
@@ -1369,6 +1378,19 @@ static int __devinit igb_probe(struct pci_dev *pdev,
wrfl();
clocksource_init_time(&adapter->clock, ktime_to_ns(ktime_get_real()));

+ /*
+ * Synchronize our NIC clock against system wall clock. NIC
+ * time stamp reading requires ~3us per sample, each sample
+ * was pretty stable even under load => only require 10
+ * samples for each offset comparison.
+ */
+ memset(&adapter->sync, 0, sizeof(adapter->sync));
+ adapter->sync.clock = &adapter->clock;
+ adapter->sync.systime = ktime_get_real;
+ adapter->sync.num_samples = 10;
+ clocksync_update(&adapter->sync, 0);
+ netdev->hwtstamp_raw2sys = igb_hwtstamp_raw2sys;
+
#ifdef DEBUG
{
char buffer[160];
@@ -2738,6 +2760,7 @@ set_itr_now:
#define IGB_TX_FLAGS_VLAN 0x00000002
#define IGB_TX_FLAGS_TSO 0x00000004
#define IGB_TX_FLAGS_IPV4 0x00000008
+#define IGB_TX_FLAGS_TSTAMP 0x00000010
#define IGB_TX_FLAGS_VLAN_MASK 0xffff0000
#define IGB_TX_FLAGS_VLAN_SHIFT 16

@@ -2958,6 +2981,9 @@ static inline void igb_tx_queue_adv(struct igb_adapter *adapter,
if (tx_flags & IGB_TX_FLAGS_VLAN)
cmd_type_len |= E1000_ADVTXD_DCMD_VLE;

+ if (tx_flags & IGB_TX_FLAGS_TSTAMP)
+ cmd_type_len |= E1000_ADVTXD_MAC_TSTAMP;
+
if (tx_flags & IGB_TX_FLAGS_TSO) {
cmd_type_len |= E1000_ADVTXD_DCMD_TSE;

@@ -3070,7 +3096,27 @@ static int igb_xmit_frame_ring_adv(struct sk_buff *skb,
/* this is a hard error */
return NETDEV_TX_BUSY;
}
- skb_orphan(skb);
+
+ /*
+ * TODO: check that there currently is no other packet with
+ * time stamping in the queue
+ *
+ * when doing time stamping, keep the connection to the socket
+ * a while longer, it is still needed by skb_hwtstamp_tx(), either
+ * in igb_clean_tx_irq() or
+ */
+ if (skb_hwtstamp_check_tx_hardware(skb)) {
+ skb_hwtstamp_set_tx_in_progress(skb);
+ tx_flags |= IGB_TX_FLAGS_TSTAMP;
+ } else if (!skb_hwtstamp_check_tx_software(skb)) {
+ /*
+ * TODO: can this be solved in dev.c:dev_hard_start_xmit()?
+ * There are probably unmodified driver which do something
+ * like this and thus don't work in combination with
+ * SOF_TIMESTAMPING_TX_SOFTWARE.
+ */
+ skb_orphan(skb);
+ }

if (adapter->vlgrp && vlan_tx_tag_present(skb)) {
tx_flags |= IGB_TX_FLAGS_VLAN;
@@ -3761,6 +3807,28 @@ static bool igb_clean_tx_irq(struct igb_ring *tx_ring)
skb->len;
total_packets += segs;
total_bytes += bytecount;
+
+ /*
+ * if we were asked to do hardware
+ * stamping and such a time stamp is
+ * available, then it must have been
+ * for this one here because we only
+ * allow only one such packet into the
+ * queue
+ */
+ if (skb_hwtstamp_check_tx_hardware(skb)) {
+ u32 valid = rd32(E1000_TSYNCTXCTL) & E1000_TSYNCTXCTL_VALID;
+ if (valid) {
+ u64 tstamp = rd32(E1000_TXSTMPL);
+ tstamp |= (u64)rd32(E1000_TXSTMPH) << 32;
+ clocksync_update(&adapter->sync, tstamp);
+ skb_hwtstamp_tx(skb,
+ ns_to_ktime(clocksource_cyc2time(&adapter->clock,
+ tstamp)),
+ netdev);
+ }
+ skb_orphan(skb);
+ }
}

igb_unmap_and_free_tx_resource(adapter, buffer_info);
@@ -3943,6 +4011,7 @@ static bool igb_clean_rx_irq_adv(struct igb_ring *rx_ring,
{
struct igb_adapter *adapter = rx_ring->adapter;
struct net_device *netdev = adapter->netdev;
+ struct e1000_hw *hw = &adapter->hw;
struct pci_dev *pdev = adapter->pdev;
union e1000_adv_rx_desc *rx_desc , *next_rxd;
struct igb_buffer *buffer_info , *next_buffer;
@@ -4032,6 +4101,38 @@ send_up:
goto next_desc;
}

+ /*
+ * If this bit is set, then the RX registers contain
+ * the time stamp. No other packet will be time
+ * stamped until we read these registers, so read the
+ * registers to make them available again. Because
+ * only one packet can be time stamped at a time, we
+ * know that the register values must belong to this
+ * one here and therefore we don't need to compare
+ * any of the additional attributes stored for it.
+ *
+ * TODO: can time stamping be triggered (thus locking
+ * the registers) without the packet reaching this point
+ * here? In that case RX time stamping would get stuck.
+ *
+ * TODO: in "time stamp all packets" mode this bit is
+ * not set. Need a global flag for this mode and then
+ * always read the registers. Cannot be done without
+ * a race condition.
+ */
+ if (staterr & E1000_RXD_STAT_TS) {
+ u64 tstamp;
+
+ WARN(!(rd32(E1000_TSYNCRXCTL) & E1000_TSYNCRXCTL_VALID),
+ "igb: no RX time stamp available for time stamped packet");
+ tstamp = rd32(E1000_RXSTMPL);
+ tstamp |= (u64)rd32(E1000_RXSTMPH) << 32;
+ clocksync_update(&adapter->sync, tstamp);
+ skb_hwtstamp_set(skb,
+ ns_to_ktime(clocksource_cyc2time(&adapter->clock,
+ tstamp)));
+ }
+
if (staterr & E1000_RXDEXT_ERR_FRAME_ERR_MASK) {
dev_kfree_skb_irq(skb);
goto next_desc;
@@ -4226,12 +4327,32 @@ static int igb_mii_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
* @ifreq:
* @cmd:
*
- * Currently cannot enable any kind of hardware time stamping, but
- * supports SIOCSHWTSTAMP in general.
+ * Outgoing time stamping can be enabled and disabled. Play nice and
+ * disable it when requested, although it shouldn't case any overhead
+ * when no packet needs it. At most one packet in the queue may be
+ * marked for time stamping, otherwise it would be impossible to tell
+ * for sure to which packet the hardware time stamp belongs.
+ *
+ * Incoming time stamping has to be configured via the hardware
+ * filters. Not all combinations are supported, in particular event
+ * type has to be specified. Matching the kind of event packet is
+ * not supported, with the exception of "all V2 events regardless of
+ * level 2 or 4".
+ *
**/
static int igb_hwtstamp_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
{
+ struct igb_adapter *adapter = netdev_priv(netdev);
+ struct e1000_hw *hw = &adapter->hw;
struct hwtstamp_config config;
+ u32 tsync_tx_ctl_bit = E1000_TSYNCTXCTL_ENABLED;
+ u32 tsync_rx_ctl_bit = E1000_TSYNCRXCTL_ENABLED;
+ u32 tsync_rx_ctl_type = 0;
+ u32 tsync_rx_cfg = 0;
+ int is_l4 = 0;
+ int is_l2 = 0;
+ short port = 319; /* PTP */
+ u32 regval;

if (copy_from_user(&config, ifr->ifr_data, sizeof(config)))
return -EFAULT;
@@ -4240,11 +4361,113 @@ static int igb_hwtstamp_ioctl(struct net_device *netdev, struct ifreq *ifr, int
if (config.flags)
return -EINVAL;

- if (config.tx_type == HWTSTAMP_TX_OFF &&
- config.rx_filter_type == HWTSTAMP_FILTER_NONE)
- return 0;
+ switch (config.tx_type) {
+ case HWTSTAMP_TX_OFF:
+ tsync_tx_ctl_bit = 0;
+ break;
+ case HWTSTAMP_TX_ON:
+ tsync_tx_ctl_bit = E1000_TSYNCTXCTL_ENABLED;
+ break;
+ default:
+ return -ERANGE;
+ }
+
+ switch (config.rx_filter_type) {
+ case HWTSTAMP_FILTER_NONE:
+ tsync_rx_ctl_bit = 0;
+ break;
+ case HWTSTAMP_FILTER_PTP_V1_L4_EVENT:
+ case HWTSTAMP_FILTER_PTP_V2_L4_EVENT:
+ case HWTSTAMP_FILTER_PTP_V2_L2_EVENT:
+ case HWTSTAMP_FILTER_ALL:
+ /*
+ * register TSYNCRXCFG must be set, therefore it is not
+ * possible to time stamp both Sync and Delay_Req messages
+ * => fall back to time stamping all packets
+ */
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_ALL;
+ config.rx_filter_type = HWTSTAMP_FILTER_ALL;
+ break;
+ case HWTSTAMP_FILTER_PTP_V1_L4_SYNC:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_L4_V1;
+ tsync_rx_cfg = E1000_TSYNCRXCFG_PTP_V1_SYNC_MESSAGE;
+ is_l4 = 1;
+ break;
+ case HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_L4_V1;
+ tsync_rx_cfg = E1000_TSYNCRXCFG_PTP_V1_DELAY_REQ_MESSAGE;
+ is_l4 = 1;
+ break;
+ case HWTSTAMP_FILTER_PTP_V2_L2_SYNC:
+ case HWTSTAMP_FILTER_PTP_V2_L4_SYNC:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_L2_L4_V2;
+ tsync_rx_cfg = E1000_TSYNCRXCFG_PTP_V2_SYNC_MESSAGE;
+ is_l2 = 1;
+ is_l4 = 1;
+ config.rx_filter_type = HWTSTAMP_FILTER_SOME;
+ break;
+ case HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ:
+ case HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_L2_L4_V2;
+ tsync_rx_cfg = E1000_TSYNCRXCFG_PTP_V2_DELAY_REQ_MESSAGE;
+ is_l2 = 1;
+ is_l4 = 1;
+ config.rx_filter_type = HWTSTAMP_FILTER_SOME;
+ break;
+ case HWTSTAMP_FILTER_PTP_V2_EVENT:
+ case HWTSTAMP_FILTER_PTP_V2_SYNC:
+ case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_EVENT_V2;
+ config.rx_filter_type = HWTSTAMP_FILTER_PTP_V2_EVENT;
+ is_l2 = 1;
+ break;
+ default:
+ return -ERANGE;
+ }
+
+ /* enable/disable TX */
+ regval = rd32(E1000_TSYNCTXCTL);
+ regval = (regval & ~E1000_TSYNCTXCTL_ENABLED) | tsync_tx_ctl_bit;
+ wr32(E1000_TSYNCTXCTL, regval);
+
+ /* enable/disable RX, define which PTP packets are time stamped */
+ regval = rd32(E1000_TSYNCRXCTL);
+ regval = (regval & ~E1000_TSYNCRXCTL_ENABLED) | tsync_rx_ctl_bit;
+ regval = (regval & ~0xE) | tsync_rx_ctl_type;
+ wr32(E1000_TSYNCRXCTL, regval);
+ wr32(E1000_TSYNCRXCFG, tsync_rx_cfg);
+
+ /*
+ * Ethertype Filter Queue Filter[0][15:0] = 0x88F7 (Ethertype to filter on)
+ * Ethertype Filter Queue Filter[0][26] = 0x1 (Enable filter)
+ * Ethertype Filter Queue Filter[0][30] = 0x1 (Enable Timestamping)
+ */
+ wr32(E1000_ETQF0, is_l2 ? 0x440088f7 : 0);
+
+ /* L4 Queue Filter[0]: only filter by source and destination port */
+ wr32(E1000_SPQF0, htons(port));
+ wr32(E1000_IMIREXT(0), is_l4 ?
+ ((1<<12) | (1<<19) /* bypass size and control flags */) : 0);
+ wr32(E1000_IMIR(0), is_l4 ?
+ (htons(port)
+ | (0<<16) /* immediate interrupt disabled */
+ | 0 /* (1<<17) bit cleared: do not bypass destination port check */)
+ : 0);
+ wr32(E1000_FTQF0, is_l4 ?
+ (0x11 /* UDP */
+ | (1<<15) /* VF not compared */
+ | (1<<27) /* Enable Timestamping */
+ | (7<<28) /* only source port filter enabled, source/target address and protocol masked */ )
+ : ( (1<<15) | (15<<28) /* all mask bits set = filter not enabled */));
+
+ wrfl();
+
+ /* clear TX/RX time stamp registers, just to be sure */
+ regval = rd32(E1000_TXSTMPH);
+ regval = rd32(E1000_RXSTMPH);

- return -ERANGE;
+ return copy_to_user(ifr->ifr_data, &config, sizeof(config)) ?
+ -EFAULT : 0;
}

/**
--
1.6.0.4

2008-11-19 12:13:24

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 07/11] igb: stub support for SIOCSHWTSTAMP

---
drivers/net/igb/igb_main.c | 30 ++++++++++++++++++++++++++++++
1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index 89ffc07..be8e2b8 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -34,6 +34,7 @@
#include <linux/ipv6.h>
#include <net/checksum.h>
#include <net/ip6_checksum.h>
+#include <net/timestamping.h>
#include <linux/mii.h>
#include <linux/ethtool.h>
#include <linux/if_vlan.h>
@@ -4115,6 +4116,33 @@ static int igb_mii_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
}

/**
+ * igb_hwtstamp_ioctl - control hardware time stamping
+ * @netdev:
+ * @ifreq:
+ * @cmd:
+ *
+ * Currently cannot enable any kind of hardware time stamping, but
+ * supports SIOCSHWTSTAMP in general.
+ **/
+static int igb_hwtstamp_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
+{
+ struct hwtstamp_config config;
+
+ if (copy_from_user(&config, ifr->ifr_data, sizeof(config)))
+ return -EFAULT;
+
+ /* reserved for future extensions */
+ if (config.flags)
+ return -EINVAL;
+
+ if (config.tx_type == HWTSTAMP_TX_OFF &&
+ config.rx_filter_type == HWTSTAMP_FILTER_NONE)
+ return 0;
+
+ return -ERANGE;
+}
+
+/**
* igb_ioctl -
* @netdev:
* @ifreq:
@@ -4127,6 +4155,8 @@ static int igb_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
case SIOCGMIIREG:
case SIOCSMIIREG:
return igb_mii_ioctl(netdev, ifr, cmd);
+ case SIOCSHWTSTAMP:
+ return igb_hwtstamp_ioctl(netdev, ifr, cmd);
default:
return -EOPNOTSUPP;
}
--
1.6.0.4

2008-11-19 12:13:39

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 10/11] time sync: generic infrastructure to map between time stamps generated by a clock source and system time

Currently only mapping from clock source to system time is implemented.
The interface could have been made more versatile by not depending on a clock source,
but this wasn't done to avoid writing glue code elsewhere.

The method implemented here is the one used and analyzed under the name
"assisted PTP" in the LCI PTP paper:
http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Ohly_92221.pdf
---
include/linux/clocksync.h | 139 +++++++++++++++++++++++++++++++++++++++++++++
kernel/time/Makefile | 2 +-
kernel/time/clocksync.c | 117 +++++++++++++++++++++++++++++++++++++
3 files changed, 257 insertions(+), 1 deletions(-)
create mode 100644 include/linux/clocksync.h
create mode 100644 kernel/time/clocksync.c

diff --git a/include/linux/clocksync.h b/include/linux/clocksync.h
new file mode 100644
index 0000000..e8c8fad
--- /dev/null
+++ b/include/linux/clocksync.h
@@ -0,0 +1,139 @@
+/*
+ * Utility code which helps transforming between hardware time stamps
+ * generated by a clocksource and system time. The clocksource is
+ * assumed to return monotonically increasing time (but this code does
+ * its best to compensate if that is not the case) whereas system time
+ * may jump.
+ */
+#ifndef _LINUX_CLOCKSYNC_H
+#define _LINUX_CLOCKSYNC_H
+
+#include <linux/clocksource.h>
+#include <linux/ktime.h>
+
+/**
+ * struct clocksync - stores state and configuration for the two clocks
+ *
+ * Initialize to zero, then set clock, systime, num_samples.
+ *
+ * Transformation between HW time and system time is done with:
+ * HW time transformed = HW time + offset +
+ * (HW time - last_update) * skew / CLOCKSYNC_SKEW_RESOLUTION
+ *
+ * @clock: the source for HW time stamps (%clocksource_read_time)
+ * @systime: function returning current system time (ktime_get
+ * for monotonic time, or ktime_get_real for wall clock)
+ * @num_samples: number of times that HW time and system time are to
+ * be compared when determining their offset
+ * @offset: (system time - HW time) at the time of the last update
+ * @skew: average (system time - HW time) / delta HW time *
+ * CLOCKSYNC_SKEW_RESOLUTION
+ * @last_update: last HW time stamp when clock offset was measured
+ */
+struct clocksync {
+ struct clocksource *clock;
+ ktime_t (*systime)(void);
+ int num_samples;
+
+ s64 offset;
+ s64 skew;
+ u64 last_update;
+};
+
+/**
+ * CLOCKSYNC_SKEW_RESOLUTION - fixed point arithmetic scale factor for skew
+ *
+ * Usually one would measure skew in ppb (parts per billion, 1e9), but
+ * using a factor of 2 simplifies the math.
+ */
+#define CLOCKSYNC_SKEW_RESOLUTION (((s64)1)<<30)
+
+/**
+ * clocksync_hw2sys - transform HW time stamp into corresponding system time
+ * @sync: context for clock sync
+ * @hwtstamp: the result of %clocksource_read_time or
+ * %clocksource_cyc2time
+ */
+static inline ktime_t clocksync_hw2sys(struct clocksync *sync,
+ u64 hwtstamp)
+{
+ u64 nsec;
+
+ nsec = hwtstamp + sync->offset;
+ nsec += (s64)(hwtstamp - sync->last_update) * sync->skew /
+ CLOCKSYNC_SKEW_RESOLUTION;
+
+ return ns_to_ktime(nsec);
+}
+
+/**
+ * clocksync_offset - measure current (system time - HW time) offset
+ * @sync: context for clock sync
+ * @offset: average offset during sample period returned here
+ * @hwtstamp: average HW time during sample period returned here
+ *
+ * Returns number of samples used. Might be zero (= no result) in the
+ * unlikely case that system time was monotonically decreasing for all
+ * samples (= broken).
+ */
+int clocksync_offset(struct clocksync *sync,
+ s64 *offset,
+ u64 *hwtstamp);
+
+/**
+ * clocksync_update - update offset and skew by measuring current offset
+ * @sync: context for clock sync
+ * @hwtstamp: the result of %clocksource_read_time or
+ * %clocksource_cyc2time, pass zero to force update
+ *
+ * Updates are only done at most once per second.
+ */
+static inline void clocksync_update(struct clocksync *sync,
+ u64 hwtstamp)
+{
+ s64 offset;
+ u64 average_time;
+
+ if (hwtstamp &&
+ (s64)(hwtstamp - sync->last_update) < NSEC_PER_SEC)
+ return;
+
+ if (!clocksync_offset(sync, &offset, &average_time))
+ return;
+
+ printk(KERN_DEBUG
+ "average offset: %lld\n", offset);
+
+ if (!sync->last_update) {
+ sync->last_update = average_time;
+ sync->offset = offset;
+ sync->skew = 0;
+ } else {
+ s64 delta_nsec = average_time - sync->last_update;
+
+ /* avoid division by negative or small deltas */
+ if (delta_nsec >= 10000) {
+ s64 delta_offset_nsec = offset - sync->offset;
+ s64 skew = delta_offset_nsec *
+ CLOCKSYNC_SKEW_RESOLUTION /
+ delta_nsec;
+
+ /**
+ * Calculate new overall skew as 4/16 the
+ * old value and 12/16 the new one. This is
+ * a rather arbitrary tradeoff between
+ * only using the latest measurement (0/16 and
+ * 16/16) and even more weight on past measurements.
+ */
+#define CLOCKSYNC_NEW_SKEW_PER_16 12
+ sync->skew =
+ ((16 - CLOCKSYNC_NEW_SKEW_PER_16) * sync->skew +
+ CLOCKSYNC_NEW_SKEW_PER_16 * skew) /
+ 16;
+ sync->last_update = average_time;
+ sync->offset = offset;
+ }
+ }
+}
+
+#endif /* _LINUX_CLOCKSYNC_H */
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 905b0b5..6279fb0 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -1,4 +1,4 @@
-obj-y += timekeeping.o ntp.o clocksource.o jiffies.o timer_list.o
+obj-y += timekeeping.o ntp.o clocksource.o jiffies.o timer_list.o clocksync.o

obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD) += clockevents.o
obj-$(CONFIG_GENERIC_CLOCKEVENTS) += tick-common.o
diff --git a/kernel/time/clocksync.c b/kernel/time/clocksync.c
new file mode 100644
index 0000000..470ef11
--- /dev/null
+++ b/kernel/time/clocksync.c
@@ -0,0 +1,117 @@
+/*
+ * Utility code which helps transforming between hardware time stamps
+ * generated by a clocksource and system time.
+ *
+ * Copyright (C) 2008 Intel, Patrick Ohly ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/clocksync.h>
+#include <linux/module.h>
+
+int clocksync_offset(struct clocksync *sync,
+ s64 *offset,
+ u64 *hwtstamp)
+{
+ u64 starthw = 0, endhw = 0;
+ struct {
+ s64 offset;
+ s64 duration_sys;
+ } buffer[10], sample, *samples;
+ int counter = 0, i;
+ int used;
+ int index;
+ int num_samples = sync->num_samples;
+
+ if (num_samples > sizeof(buffer)/sizeof(buffer[0])) {
+ samples = kmalloc(sizeof(*samples) * num_samples, GFP_ATOMIC);
+ if (!samples) {
+ samples = buffer;
+ num_samples = sizeof(buffer)/sizeof(buffer[0]);
+ }
+ } else {
+ samples = buffer;
+ }
+
+ /* run until we have enough valid samples, but do not try forever */
+ i = 0;
+ counter = 0;
+ while (1) {
+ u64 ts;
+ ktime_t start, end;
+
+ start = sync->systime();
+ ts = clocksource_read_time(sync->clock);
+ end = sync->systime();
+
+ if (!i) {
+ starthw = ts;
+ }
+
+ /* ignore negative durations */
+ sample.duration_sys = ktime_to_ns(ktime_sub(end, start));
+ if (sample.duration_sys >= 0) {
+ /*
+ * assume symetric delay to and from HW: average system time
+ * corresponds to measured HW time
+ */
+ sample.offset = ktime_to_ns(ktime_add(end, start)) / 2 -
+ ts;
+
+ /* simple insertion sort based on duration */
+ index = counter - 1;
+ while (index >= 0) {
+ if(samples[index].duration_sys < sample.duration_sys) {
+ break;
+ }
+ samples[index + 1] = samples[index];
+ index--;
+ }
+ samples[index + 1] = sample;
+ counter++;
+ }
+
+ i++;
+ if (counter >= num_samples || i >= 100000) {
+ endhw = ts;
+ break;
+ }
+ }
+
+ *hwtstamp = (endhw + starthw) / 2;
+
+ /* remove outliers by only using 75% of the samples */
+ used = counter * 3 / 4;
+ if (!used) {
+ used = counter;
+ }
+ if (used) {
+ /* calculate average */
+ s64 off = 0;
+ for (index = 0; index < used; index++) {
+ off += samples[index].offset;
+ }
+ off /= used;
+ *offset = off;
+ }
+
+ if (samples && samples != buffer)
+ kfree(samples);
+
+ return used;
+}
+
+EXPORT_SYMBOL_GPL(clocksync_offset);
--
1.6.0.4

2008-11-19 15:21:24

by Patrick Ohly

[permalink] [raw]
Subject: Re: [RFC PATCH 03/11] net: infrastructure for hardware time stamping

On Wed, 2008-11-19 at 12:08 +0000, Ohly, Patrick wrote:
> + struct sock_exterr_skb *serr;
[...]
> + memset(serr, 0, sizeof(serr));

Before someone else mentions it: this was meant to be "sizeof(*serr)" of
course.

--
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.

2008-11-20 01:15:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH 10/11] time sync: generic infrastructure to map between time stamps generated by a clock source and system time


These patcehs add a lot of quite large inlined functions.

On Wed, 19 Nov 2008 13:08:47 +0100
Patrick Ohly <[email protected]> wrote:

> +static inline void clocksync_update(struct clocksync *sync,
> + u64 hwtstamp)
> +{
> + s64 offset;
> + u64 average_time;
> +
> + if (hwtstamp &&
> + (s64)(hwtstamp - sync->last_update) < NSEC_PER_SEC)
> + return;
> +
> + if (!clocksync_offset(sync, &offset, &average_time))
> + return;
> +
> + printk(KERN_DEBUG
> + "average offset: %lld\n", offset);
> +
> + if (!sync->last_update) {
> + sync->last_update = average_time;
> + sync->offset = offset;
> + sync->skew = 0;
> + } else {
> + s64 delta_nsec = average_time - sync->last_update;
> +
> + /* avoid division by negative or small deltas */
> + if (delta_nsec >= 10000) {
> + s64 delta_offset_nsec = offset - sync->offset;
> + s64 skew = delta_offset_nsec *
> + CLOCKSYNC_SKEW_RESOLUTION /
> + delta_nsec;
> +
> + /**
> + * Calculate new overall skew as 4/16 the
> + * old value and 12/16 the new one. This is
> + * a rather arbitrary tradeoff between
> + * only using the latest measurement (0/16 and
> + * 16/16) and even more weight on past measurements.
> + */
> +#define CLOCKSYNC_NEW_SKEW_PER_16 12
> + sync->skew =
> + ((16 - CLOCKSYNC_NEW_SKEW_PER_16) * sync->skew +
> + CLOCKSYNC_NEW_SKEW_PER_16 * skew) /
> + 16;
> + sync->last_update = average_time;
> + sync->offset = offset;
> + }
> + }
> +}

This one is a champ.

The token '/**' is used exclusively to introduce kerneldoc-formatted
comments. Please check the patches for comments which are incorrectly
thus-tagged.

Please cc [email protected] on patches which affect the
kernel's userspace interfaces.

2008-11-20 07:09:23

by Patrick Ohly

[permalink] [raw]
Subject: RE: [RFC PATCH 10/11] time sync: generic infrastructure to map between time stamps generated by a clock source and system time

Andrew wrote:
> These patcehs add a lot of quite large inlined functions.

Right, I'll need to clean this up once it is clear which code
is really going to be needed.

> On Wed, 19 Nov 2008 13:08:47 +0100
> Patrick Ohly <[email protected]> wrote:
>
> > +static inline void clocksync_update(struct clocksync *sync,
> > + u64 hwtstamp)
> > +{
> > + s64 offset;
> > + u64 average_time;
> > +
> > + if (hwtstamp &&
> > + (s64)(hwtstamp - sync->last_update) < NSEC_PER_SEC)
> > + return;
> > +

In this example, the check is going to avoid a function call when
inlined in most of the cases. This was the motivation for making
the function inline in the first place. The rest of it should be
split of into a non-inline helper function. A "likely()" should
be added, too.

> The token '/**' is used exclusively to introduce kerneldoc-formatted
> comments. Please check the patches for comments which are incorrectly
> thus-tagged.

Sorry for that, will do. Old habits die hard.

> Please cc [email protected] on patches which affect the
> kernel's userspace interfaces.

Okay.

Bye, Patrick

2008-12-05 21:05:43

by john stultz

[permalink] [raw]
Subject: Re: [RFC PATCH 08/11] clocksource: allow usage independent of timekeeping.c

On Wed, Nov 19, 2008 at 4:08 AM, Patrick Ohly <[email protected]> wrote:
> So far struct clocksource acted as the interface between time/timekeeping
> and hardware. This patch generalizes the concept so that the same
> interface can also be used in other contexts.

Hey Patrick,
Sorry for not noticing this thread earlier!

> The only change as far as kernel/time/timekeeping is concerned is that
> the hardware access can be done either with or without passing
> the clocksource pointer as context. This is necessary in those
> cases when there is more than one instance of the hardware.

So as a heads up, the bit about passing the clocksource to the
read_clock() function looks very similar to a bit of what Magnus Damm
was recently working on.

> The extensions in this patch add code which turns the raw cycle count
> provided by hardware into a continously increasing time value. This
> reuses fields also used by timekeeping.c. Because of slightly different
> semantic (__get_nsec_offset does not update cycle_last, clocksource_read_ns
> does that transparently) timekeeping.c was not modified to use the
> generalized code.

Hrm.. I'm a little wary here. Your patch basically creates new
semantics to how the clocksource structure is used, which will likely
cause confusion. I'll agree that the clocksource structure has been
somewhat more cluttered with timekeeping-isms then I'd prefer, so
maybe your patches give us the need to clean it up and better separate
the hardware clocksource accessor information and the timekeeping
state.

So to be clear, let me see if I understand your needs from your patch:

1) Need an interface to a counter, who's interface monotonically increases.
2) Need to translate the counter to nanoseconds and nanoseconds back
to the counter
3) The counter will likely not be registered for use in timekeeping
4) The counter's sense of time will not be steered via frequency adjustments.

Is that about the right set of assumptions?

So if we break the clocksource structure into two portions (ignore
name deatils for now)

strucut counter{
char* name,
u32 mult,
u32 shift,
cycle_t mask,
cycle_t (*read)(struct counter*);
cycle_t (*vread)(void);

/* bits needed here for real monotonic interface, more on that below */

/* other arch specific needs */
}

struct timeclock {
struct counter* counter,
u32 adjusted_mult,
cycle_t cycle_last,
u32 flags;
u64 xtime_nsec;
s64 error;
/* other timekeeping bits go here */
}

So given that, do you think you'd be ok with using just the first
counter structure?

Now there's sort of larger problem I've glossed over. Specifically in
assumption #1 up there. The bit about the interface to the monotonic
counter. Now many hardware counters wrap, and some wrap fairly
quickly. This means we need to have some sort of infrastructure to
periodically accumulate cycles into some "cycle store" value. As long
as the cycle store is 64bits wide, we probably don't have to worry
about overflows (if I recall 64bits at 1GHZ gives us ~500 years).

Now, currently the timekeeping core does this for the active in-use
clocksource. However, if we have a number of counter structs that are
being used in different contexts, maybe three registered for
timekeeping, and a few more for different types of timestamping (maybe
audio, networking, maybe even performance counters?), we suddenly have
to do the accumulation step on a quite a few counters to avoid
wrapping.

You dodged this accumulation infrastructure in your patch, by just
accumulating at read time. This works, as long as you can guarantee
that read events occur more often then the wrap frequency. And in most
cases that's probably not too hard, but with some in-developement
work, like the -rt patches, kernel work (even most interrupt
processing) can be deferred by high priority tasks for an unlimited
amount of time.

So this requires thinking this through maybe a bit more, trying to
figure out how to create a guaranteed accumulation frequency, but only
do so on counters that are really actively in use (we don't want to
accumulate on counters that no one cares about). Its probably not too
much work, but we may want to consider other approaches as well.

Another issue that multiple clocksources can cause, is dealing with
time intervals between clocksources. Different clocksources may be
driven by different crystals, so they will drift apart. Also since the
clocksource used for timekeeping is adjusted by adjtimex(), you'll
likely have to deal with small differences in system time intervals
and clocksource time intervals.

I see you've maybe tried to address some of this with the following
time_sync patch, but I'm not sure I've totally grokked that patch yet.


Anyway, sorry to ramble on so much. I'm really interested in your
work, its really interesting! But we might want to make sure the right
changes are being made to the right place so we don't get too much
confusion with the same variables meaning different things in
different contexts.

thanks
-john

2008-12-11 12:11:58

by Patrick Ohly

[permalink] [raw]
Subject: Re: [RFC PATCH 08/11] clocksource: allow usage independent of timekeeping.c

On Fri, 2008-12-05 at 21:05 +0000, john stultz wrote:
> On Wed, Nov 19, 2008 at 4:08 AM, Patrick Ohly <[email protected]> wrote:
> > So far struct clocksource acted as the interface between time/timekeeping
> > and hardware. This patch generalizes the concept so that the same
> > interface can also be used in other contexts.
>
> Hey Patrick,
> Sorry for not noticing this thread earlier!

No problem, it's not holding up anything. The question of how to extend
skb hasn't been settled either. Thanks for taking the time to consider
it.

> > The extensions in this patch add code which turns the raw cycle count
> > provided by hardware into a continously increasing time value. This
> > reuses fields also used by timekeeping.c. Because of slightly different
> > semantic (__get_nsec_offset does not update cycle_last, clocksource_read_ns
> > does that transparently) timekeeping.c was not modified to use the
> > generalized code.
>
> Hrm.. I'm a little wary here. Your patch basically creates new
> semantics to how the clocksource structure is used, which will likely
> cause confusion.

That's true. I could keep the code separate, if that helps. I just
didn't want to duplicate the whole structure definition.

> I'll agree that the clocksource structure has been
> somewhat more cluttered with timekeeping-isms then I'd prefer, so
> maybe your patches give us the need to clean it up and better separate
> the hardware clocksource accessor information and the timekeeping
> state.
>
> So to be clear, let me see if I understand your needs from your patch:
>
> 1) Need an interface to a counter, who's interface monotonically increases.
> 2) Need to translate the counter to nanoseconds and nanoseconds back
> to the counter

There are two additional ways of using the counter:
* Get nanosecond delay measurements (clocksource_read_ns). Calling this
"resets" the counter.
* Get a continously increasing timer value
(clocksource_init_time/clocksource_read_time). The clock is only reset
when calling clocksource_init_time().

The two are mutually exclusive because clocksource_read_time() depends
on clocksource_read_ns(). If this is too confusing, then
clocksource_read_ns() could be turned into an internal helper function.
I left it in the header because there might be other uses for it. The
rest of the patches only needs clocksource_read_time().

Nanoseconds never have to be converted back to the counter. That
wouldn't be possible anyway (hardware counter might roll over, whereas
the clock counts nanoseconds in a 64 bit value and thus will last longer
than the hardware it runs on).

> 3) The counter will likely not be registered for use in timekeeping
> 4) The counter's sense of time will not be steered via frequency adjustments.
>
> Is that about the right set of assumptions?

About right ;-)

> So if we break the clocksource structure into two portions (ignore
> name deatils for now)
>
> strucut counter{
> char* name,
> u32 mult,
> u32 shift,
> cycle_t mask,
> cycle_t (*read)(struct counter*);
> cycle_t (*vread)(void);
>
> /* bits needed here for real monotonic interface, more on that below */
>
> /* other arch specific needs */
> }
>
> struct timeclock {
> struct counter* counter,
> u32 adjusted_mult,
> cycle_t cycle_last,
> u32 flags;
> u64 xtime_nsec;
> s64 error;
> /* other timekeeping bits go here */
> }
>
> So given that, do you think you'd be ok with using just the first
> counter structure?

Some additional members must be moved to struct counter:
* cycle_last (for the overflow handling)
* xtime_nsec (for the continously increasing timer)

Except from those the first struct is okay.

> Now there's sort of larger problem I've glossed over. Specifically in
> assumption #1 up there. The bit about the interface to the monotonic
> counter. Now many hardware counters wrap, and some wrap fairly
> quickly.
[...]
> You dodged this accumulation infrastructure in your patch, by just
> accumulating at read time. This works, as long as you can guarantee
> that read events occur more often then the wrap frequency.

Exactly. My plan was that the user of such a custom clocksource is
responsible for querying it often enough so that clocksource_read_ns()
can detect the wrap around. This works in the context of PTP (which
causes regular events). Network driver developers must be a bit careful
when there is no active PTP daemon: either they reinitialize the timer
when it starts to get used or probe it automatically after certain
delays.

> And in most
> cases that's probably not too hard, but with some in-developement
> work, like the -rt patches, kernel work (even most interrupt
> processing) can be deferred by high priority tasks for an unlimited
> amount of time.

I'm not sure what can be done in such a case. Use decent hardware which
doesn't wrap around so quickly, I guess. It's not an issue with the
Intel NIC (sorry for the advertising... ;-)

> Another issue that multiple clocksources can cause, is dealing with
> time intervals between clocksources. Different clocksources may be
> driven by different crystals, so they will drift apart. Also since the
> clocksource used for timekeeping is adjusted by adjtimex(), you'll
> likely have to deal with small differences in system time intervals
> and clocksource time intervals.
>
> I see you've maybe tried to address some of this with the following
> time_sync patch, but I'm not sure I've totally grokked that patch yet.

The clocksource API extension and the time sync code are independent at
the moment: the time sync code assumes that it gets two, usually
increasing timer values and tries to match them by measuring skew and
drift between them. If the timer values jump, then the sync code adapts
these values accordingly.

I don't think it will be necessary to add something like adjtimex() to a
clocksource. Either the hardware supports it natively (like the Intel
NIC does, sorry again), or the current time sync deals with frequency
changes by adapting the drift factor.

> Anyway, sorry to ramble on so much. I'm really interested in your
> work, its really interesting! But we might want to make sure the right
> changes are being made to the right place so we don't get too much
> confusion with the same variables meaning different things in
> different contexts.

Thanks for your comments. I agree that splitting the structures would
help. But the variables really have the same meaning. They are just used
in different functions.

--
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.

2008-12-11 22:23:57

by john stultz

[permalink] [raw]
Subject: Re: [RFC PATCH 08/11] clocksource: allow usage independent of timekeeping.c

On Thu, 2008-12-11 at 13:11 +0100, Patrick Ohly wrote:
> On Fri, 2008-12-05 at 21:05 +0000, john stultz wrote:
> > On Wed, Nov 19, 2008 at 4:08 AM, Patrick Ohly <[email protected]> wrote:
> > > The extensions in this patch add code which turns the raw cycle count
> > > provided by hardware into a continously increasing time value. This
> > > reuses fields also used by timekeeping.c. Because of slightly different
> > > semantic (__get_nsec_offset does not update cycle_last, clocksource_read_ns
> > > does that transparently) timekeeping.c was not modified to use the
> > > generalized code.
> >
> > Hrm.. I'm a little wary here. Your patch basically creates new
> > semantics to how the clocksource structure is used, which will likely
> > cause confusion.
>
> That's true. I could keep the code separate, if that helps. I just
> didn't want to duplicate the whole structure definition.

I think either keeping it separate, using your own structure, or
properly splitting out the counter / time-clock interface would be the
way to go.

> > I'll agree that the clocksource structure has been
> > somewhat more cluttered with timekeeping-isms then I'd prefer, so
> > maybe your patches give us the need to clean it up and better separate
> > the hardware clocksource accessor information and the timekeeping
> > state.
> >
> > So to be clear, let me see if I understand your needs from your patch:
> >
> > 1) Need an interface to a counter, who's interface monotonically increases.
> > 2) Need to translate the counter to nanoseconds and nanoseconds back
> > to the counter
>
> There are two additional ways of using the counter:
> * Get nanosecond delay measurements (clocksource_read_ns). Calling this
> "resets" the counter.

Just so I understand, do you mean clocksource_read_ns() returns the
number of nanoseconds since the last call to clocksource_read_ns() ?

That seems like an odd interface to define, since effectively you're
storing state inside the interface.

Why exactly is this useful, as opposed to creating a monotonically
increasing function which can be sampled and the state is managed by the
users of the interface?


> * Get a continously increasing timer value
> (clocksource_init_time/clocksource_read_time). The clock is only reset
> when calling clocksource_init_time().

So a monotonic 64bit wide counter. Close to what I described above. Is
there actually a need for it to reset ever?


> The two are mutually exclusive because clocksource_read_time() depends
> on clocksource_read_ns(). If this is too confusing, then
> clocksource_read_ns() could be turned into an internal helper function.
> I left it in the header because there might be other uses for it. The
> rest of the patches only needs clocksource_read_time().

Yea. It seems like an odd interface, as the internal state seems to
limit its use.

> Nanoseconds never have to be converted back to the counter. That
> wouldn't be possible anyway (hardware counter might roll over, whereas
> the clock counts nanoseconds in a 64 bit value and thus will last longer
> than the hardware it runs on).

Right, but if its a monotonically increasing 64bit counter, rollover
isn't likely an issue. I think we're basically communicating the same
idea here, just the question is do you want to make the interface
provide nanoseconds or cycles.


> > 3) The counter will likely not be registered for use in timekeeping
> > 4) The counter's sense of time will not be steered via frequency adjustments.
> >
> > Is that about the right set of assumptions?
>
> About right ;-)
>
> > So if we break the clocksource structure into two portions (ignore
> > name deatils for now)
> >
> > strucut counter{
> > char* name,
> > u32 mult,
> > u32 shift,
> > cycle_t mask,
> > cycle_t (*read)(struct counter*);
> > cycle_t (*vread)(void);
> >
> > /* bits needed here for real monotonic interface, more on that below */
> >
> > /* other arch specific needs */
> > }
> >
> > struct timeclock {
> > struct counter* counter,
> > u32 adjusted_mult,
> > cycle_t cycle_last,
> > u32 flags;
> > u64 xtime_nsec;
> > s64 error;
> > /* other timekeeping bits go here */
> > }
> >
> > So given that, do you think you'd be ok with using just the first
> > counter structure?
>
> Some additional members must be moved to struct counter:
> * cycle_last (for the overflow handling)
> * xtime_nsec (for the continously increasing timer)

Hmm. I'd still prefer those values to be stored elsewhere. As you add
state to the structure, that limits how the structure can be used. For
instance, if cycles_last and xtime_nsec are in the counter structure,
then that means one counter could not be used for both timekeeping and
the hardware time-stamping you're doing.

Instead that state should be stored in the timekeeping and timestamping
structures respectively.

> Except from those the first struct is okay.
>
> > Now there's sort of larger problem I've glossed over. Specifically in
> > assumption #1 up there. The bit about the interface to the monotonic
> > counter. Now many hardware counters wrap, and some wrap fairly
> > quickly.
> [...]
> > You dodged this accumulation infrastructure in your patch, by just
> > accumulating at read time. This works, as long as you can guarantee
> > that read events occur more often then the wrap frequency.
>
> Exactly. My plan was that the user of such a custom clocksource is
> responsible for querying it often enough so that clocksource_read_ns()
> can detect the wrap around.

Right, however my point quoted below was that this will likely break in
the -rt kernel, since those users may be deferred for a undefined amount
of time. So we'll need to do something here.


> > And in most
> > cases that's probably not too hard, but with some in-developement
> > work, like the -rt patches, kernel work (even most interrupt
> > processing) can be deferred by high priority tasks for an unlimited
> > amount of time.
>
> I'm not sure what can be done in such a case. Use decent hardware which
> doesn't wrap around so quickly, I guess. It's not an issue with the
> Intel NIC (sorry for the advertising... ;-)

Well, I think it would be good to create a infrastructure that will work
on most hardware.

And I think it can work, but in order to make it work cleanly, we'll
have to have some form of accumulation infrastructure, which will not be
able to be deferred.

However, some careful thought will be needed here, so that we don't
create latencies by wasting time sampling unused hardware counters in
the hardirq context.


> > Another issue that multiple clocksources can cause, is dealing with
> > time intervals between clocksources. Different clocksources may be
> > driven by different crystals, so they will drift apart. Also since the
> > clocksource used for timekeeping is adjusted by adjtimex(), you'll
> > likely have to deal with small differences in system time intervals
> > and clocksource time intervals.
> >
> > I see you've maybe tried to address some of this with the following
> > time_sync patch, but I'm not sure I've totally grokked that patch yet.
>
> The clocksource API extension and the time sync code are independent at
> the moment: the time sync code assumes that it gets two, usually
> increasing timer values and tries to match them by measuring skew and
> drift between them. If the timer values jump, then the sync code adapts
> these values accordingly.

Ok. I'll have to spend some more time on that patch, but it sounds like
you're handling the issue.


> > Anyway, sorry to ramble on so much. I'm really interested in your
> > work, its really interesting! But we might want to make sure the right
> > changes are being made to the right place so we don't get too much
> > confusion with the same variables meaning different things in
> > different contexts.
>
> Thanks for your comments. I agree that splitting the structures would
> help. But the variables really have the same meaning. They are just used
> in different functions.

Err, you might be misunderstanding their current meaning. However, its
not your fault, as the naming is not as clear as I like. For instance,
xtime_nsec stores the sub-nanoseconds (shifted up by clocksource->shift)
not represented in the xtime value.

So yes, while you likely want to keep similar state as the timekeeping
core does, I really think splitting it out fully is going to be the way
to go.

Thanks for the consideration of my comments! I look forward to your
future patches!
-john

2008-12-12 08:50:33

by Patrick Ohly

[permalink] [raw]
Subject: Re: [RFC PATCH 08/11] clocksource: allow usage independent of timekeeping.c

On Thu, 2008-12-11 at 22:23 +0000, john stultz wrote:
> On Thu, 2008-12-11 at 13:11 +0100, Patrick Ohly wrote:
> > That's true. I could keep the code separate, if that helps. I just
> > didn't want to duplicate the whole structure definition.
>
> I think either keeping it separate, using your own structure, or
> properly splitting out the counter / time-clock interface would be the
> way to go.

Okay, will do that. I'll try to do it so that later the clocksource can
be rewritten so that it uses the same definition.

> > There are two additional ways of using the counter:
> > * Get nanosecond delay measurements (clocksource_read_ns). Calling this
> > "resets" the counter.
>
> Just so I understand, do you mean clocksource_read_ns() returns the
> number of nanoseconds since the last call to clocksource_read_ns()?

Yes.

> Why exactly is this useful, as opposed to creating a monotonically
> increasing function which can be sampled and the state is managed by the
> users of the interface?

The monotonically increasing function already is based on a stateful
function which calculcates the delta; calculating the original delta
based on a derived value didn't seem right. But I don't really care much
about this part of the API, so I'll just make it internal.

> > * Get a continously increasing timer value
> > (clocksource_init_time/clocksource_read_time). The clock is only reset
> > when calling clocksource_init_time().
>
> So a monotonic 64bit wide counter. Close to what I described above. Is
> there actually a need for it to reset ever?

Perhaps. A device might decide to reset the time each time hardware time
stamping is activated.

> > Some additional members must be moved to struct counter:
> > * cycle_last (for the overflow handling)
> > * xtime_nsec (for the continously increasing timer)
>
> Hmm. I'd still prefer those values to be stored elsewhere. As you add
> state to the structure, that limits how the structure can be used. For
> instance, if cycles_last and xtime_nsec are in the counter structure,
> then that means one counter could not be used for both timekeeping and
> the hardware time-stamping you're doing.

The clean solution would be
* struct cyclecounter: abstract API to access hardware cycle counter
The cycle counter may roll over relatively quickly. The implementor
needs to provide information about the width of the counter and its
frequency.
* struct timecounter: turns cycles from one cyclecounter into a
nanosecond count
Must detect and deal with cycle counter overflows. Uses a 64 bit
counter for time, so it itself doesn't overflow (unless we build
hardware that runs for a *really* long time).

Now, should struct timecounter contain a struct cyclecounter or a
pointer to it? A pointer is more flexible, but overkill for the usage I
had in mind. I'll use a pointer anyway, just in case.

> Instead that state should be stored in the timekeeping and timestamping
> structures respectively.

I'm not sure whether timestamping can be separated from timekeeping: it
depends on the same cycle counter state as the timekeeping.

> > > You dodged this accumulation infrastructure in your patch, by just
> > > accumulating at read time. This works, as long as you can guarantee
> > > that read events occur more often then the wrap frequency.
> >
> > Exactly. My plan was that the user of such a custom clocksource is
> > responsible for querying it often enough so that clocksource_read_ns()
> > can detect the wrap around.
>
> Right, however my point quoted below was that this will likely break in
> the -rt kernel, since those users may be deferred for a undefined amount
> of time. So we'll need to do something here.

If the code isn't called often enough to deal with the regular PTP Sync
messages (sent every two seconds), then such a system would already have
quite a few other problems.

> > > And in most
> > > cases that's probably not too hard, but with some in-developement
> > > work, like the -rt patches, kernel work (even most interrupt
> > > processing) can be deferred by high priority tasks for an unlimited
> > > amount of time.
> >
> > I'm not sure what can be done in such a case. Use decent hardware which
> > doesn't wrap around so quickly, I guess. It's not an issue with the
> > Intel NIC (sorry for the advertising... ;-)
>
> Well, I think it would be good to create a infrastructure that will work
> on most hardware.

Most hardware doesn't have hardware time stamping. Is there any hardware
which has hardware time stamping, but only with such a limited counter
that we run into this problem?

I agree that this problem needs to be taken into account now (while
designing these data structures) and be addressed as soon as it becomes
necessary - but not sooner. Otherwise we might end up with dead code
that isn't used at all.

> And I think it can work, but in order to make it work cleanly, we'll
> have to have some form of accumulation infrastructure, which will not be
> able to be deferred.
>
> However, some careful thought will be needed here, so that we don't
> create latencies by wasting time sampling unused hardware counters in
> the hardirq context.

Currently the structures are owned by the device driver which owns the
hardware. Perhaps the device driver could register the structure with
such an accumulation infrastructure if the driver itself cannot
guarantee that it will check the cycle counter often enough. Concurrent
access to the cycle counter hardware and state could make this tricky.

This goes into areas where I have no experience at all, so I would
depend on others to provide that code.

--
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.