2008-12-15 14:55:34

by Patrick Ohly

[permalink] [raw]
Subject: hardware time stamping with optional structs in data area

This is the third iteration of a patch series which adds a user space
API for hardware time stamping of network packets and the
infrastructure that implements that API. The igb driver is used as
example for how a network driver can use this new infrastructure.
The patches are based on net-next as of today.

This proposal is based on adding just a short flag field to struct
sk_buff itself, with the actual new information stored in the data
area if indicated by the flags. This is based on the suggestion by
Oliver Hartkopp and modelled after the way how skb_shared_info is
stored. The previous two revisions of the patch reused skb->tstamp
resp. added skb->hwtstamp.

The latest approach is more flexible: it allows adding more optional
information in the future. It turned out to be a bit harder to use by
drivers, though. The igb driver allocates the skb for incoming packets
in advance. At that time it is not yet known which packets will need
the extra space for hardware time stamps. The current example igb
patch just allocates that space for all packets, all the time. More
efficient solutions would be possible.

The functions which allocate an skb had to be modified so that
the new flags can be passed down to __netdev_alloc_skb(). I renamed
functions as necessary, so that existing code should continue to
work without changes. In sock_alloc_send_skb_flags() (formerly
known as sock_alloc_send_skb_pskb()) I removed some dead code (see
patch 04/12).

As discussed with John Stultz, the changes to clocksource.[ch]
now adds new structures as needed. John, I chose not to add new files
for these. Please let me know whether you agree with that approach,
see patch 09/12 for details.

The user space API is still as it was designed for the previous two
implementations. IMHO this shows that it was flexible enough to cope
with changes in the implementation. On the other hand some of its
aspects (user space has fine-grained control over what information
is gathered and provided by the kernel) might be seen as overkill
now that drivers can add this information more easily.

I think the API still makes sense because the additional information
might be used for optimizations in the future (drivers could
skip time stamp conversion if no-one is interested).

Also fixed in this iteration:
* comment style, too many/large inline functions (pointed out by Andrew Morton)
* __KERNEL__ instead __kernel__, user space API now in linux/net_tstamp.h (David Miller)

Bye, Patrick

Documentation/networking/timestamping.txt | 180 ++++++++
Documentation/networking/timestamping/.gitignore | 1 +
Documentation/networking/timestamping/Makefile | 3 +
.../networking/timestamping/timestamping.c | 469 ++++++++++++++++++++
arch/x86/include/asm/socket.h | 3 +
drivers/net/igb/e1000_82575.h | 1 +
drivers/net/igb/e1000_defines.h | 1 +
drivers/net/igb/e1000_regs.h | 68 +++
drivers/net/igb/igb.h | 8 +
drivers/net/igb/igb_main.c | 397 ++++++++++++++++-
fs/compat_ioctl.c | 1 +
include/linux/clocksource.h | 99 ++++
include/linux/clocksync.h | 85 ++++
include/linux/errqueue.h | 1 +
include/linux/net_tstamp.h | 104 +++++
include/linux/skbuff.h | 196 ++++++++-
include/linux/sockios.h | 3 +
include/net/ip.h | 1 +
include/net/sock.h | 57 ++-
kernel/time/Makefile | 2 +-
kernel/time/clocksource.c | 76 ++++
kernel/time/clocksync.c | 182 ++++++++
net/can/raw.c | 14 +-
net/compat.c | 19 +-
net/core/dev.c | 36 ++-
net/core/skbuff.c | 139 +++++-
net/core/sock.c | 129 +++---
net/ipv4/icmp.c | 2 +
net/ipv4/ip_output.c | 18 +-
net/ipv4/raw.c | 1 +
net/ipv4/udp.c | 4 +
net/socket.c | 86 +++-
32 files changed, 2258 insertions(+), 128 deletions(-)


2008-12-15 14:56:09

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 01/12] net: new user space API for time stamping of incoming and outgoing packets

User space can request hardware and/or software time stamping.
Reporting of the result(s) via a new control message is enabled
separately for each field in the message because some of the
fields may require additional computation and thus cause overhead.
User space can tell the different kinds of time stamps apart
and choose what suits its needs.

When a TX timestamp operation is requested, the TX skb will be cloned
and the clone will be time stamped (in hardware or software) and added
to the socket error queue of the skb, if the skb has a socket
associated with it.

The actual TX timestamp will reach userspace as a RX timestamp on the
cloned packet. If timestamping is requested and no timestamping is
done in the device driver (potentially this may use hardware
timestamping), it will be done in software after the device's
start_hard_xmit routine.

TODO: add SO_TIMESTAMPING define also to other platforms

Signed-off-by: Patrick Ohly <[email protected]>
---
Documentation/networking/timestamping.txt | 178 ++++++++
Documentation/networking/timestamping/.gitignore | 1 +
Documentation/networking/timestamping/Makefile | 3 +
.../networking/timestamping/timestamping.c | 469 ++++++++++++++++++++
arch/x86/include/asm/socket.h | 3 +
include/linux/errqueue.h | 1 +
include/linux/net_tstamp.h | 104 +++++
include/linux/sockios.h | 3 +
8 files changed, 762 insertions(+), 0 deletions(-)
create mode 100644 Documentation/networking/timestamping.txt
create mode 100644 Documentation/networking/timestamping/.gitignore
create mode 100644 Documentation/networking/timestamping/Makefile
create mode 100644 Documentation/networking/timestamping/timestamping.c
create mode 100644 include/linux/net_tstamp.h

diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
new file mode 100644
index 0000000..a681a65
--- /dev/null
+++ b/Documentation/networking/timestamping.txt
@@ -0,0 +1,178 @@
+The existing interfaces for getting network packages time stamped are:
+
+* SO_TIMESTAMP
+ Generate time stamp for each incoming packet using the (not necessarily
+ monotonous!) system time. Result is returned via recv_msg() in a
+ control message as timeval (usec resolution).
+
+* SO_TIMESTAMPNS
+ Same time stamping mechanism as SO_TIMESTAMP, but returns result as
+ timespec (nsec resolution).
+
+* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
+ Only for multicasts: approximate send time stamp by receiving the looped
+ packet and using its receive time stamp.
+
+The following interface complements the existing ones: receive time
+stamps can be generated and returned for arbitrary packets and much
+closer to the point where the packet is really sent. Time stamps can
+be generated in software (as before) or in hardware (if the hardware
+has such a feature).
+
+SO_TIMESTAMPING:
+
+Instructs the socket layer which kind of information is wanted. The
+parameter is an integer with some of the following bits set. Setting
+other bits is an error and doesn't change the current state.
+
+SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time stamp in hardware
+SOF_TIMESTAMPING_TX_SOFTWARE: if SOF_TIMESTAMPING_TX_HARDWARE is off or
+ fails, then do it in software
+SOF_TIMESTAMPING_RX_HARDWARE: return the original, unmodified time stamp
+ as generated by the hardware
+SOF_TIMESTAMPING_RX_SOFTWARE: if SOF_TIMESTAMPING_RX_HARDWARE is off or
+ fails, then do it in software
+SOF_TIMESTAMPING_RAW_HARDWARE: return original raw hardware time stamp
+SOF_TIMESTAMPING_SYS_HARDWARE: return hardware time stamp transformed to
+ the system time base
+SOF_TIMESTAMPING_SOFTWARE: return system time stamp generated in
+ software
+
+SOF_TIMESTAMPING_TX/RX determine how time stamps are generated.
+SOF_TIMESTAMPING_RAW/SYS determine how they are reported in the
+following control message:
+ struct scm_timestamping {
+ struct timespec systime;
+ struct timespec hwtimetrans;
+ struct timespec hwtimeraw;
+ };
+
+recvmsg() can be used to get this control message for regular incoming
+packets. For send time stamps the outgoing packet is looped back to
+the socket's error queue with the send time stamp(s) attached. It can
+be received with recvmsg(flags=MSG_ERRQUEUE). The call returns the
+original outgoing packet data including all headers preprended down to
+and including the link layer, the scm_timestamping control message and
+a sock_extended_err control message with ee_errno==ENOMSG and
+ee_origin==SO_EE_ORIGIN_TIMESTAMPING. A socket with such a pending
+bounced packet is ready for reading as far as select() is concerned.
+
+All three values correspond to the same event in time, but were
+generated in different ways. Each of these values may be empty (= all
+zero), in which case no such value was available. If the application
+is not interested in some of these values, they can be left blank to
+avoid the potential overhead of calculating them.
+
+systime is the value of the system time at that moment. This
+corresponds to the value also returned via SO_TIMESTAMP[NS]. If the
+time stamp was generated by hardware, then this field is
+empty. Otherwise it is filled in if SOF_TIMESTAMPING_SOFTWARE is
+set.
+
+hwtimeraw is the original hardware time stamp. Filled in if
+SOF_TIMESTAMPING_RAW_HARDWARE is set. No assumptions about its
+relation to system time should be made.
+
+hwtimetrans is the hardware time stamp transformed so that it
+corresponds as good as possible to system time. This correlation is
+not perfect; as a consequence, sorting packets received via different
+NICs by their hwtimetrans may differ from the order in which they were
+received. hwtimetrans may be non-monotonic even for the same NIC.
+Filled in if SOF_TIMESTAMPING_SYS_HARDWARE is set. Requires support
+by the network device and will be empty without that support.
+
+
+SIOCSHWTSTAMP:
+
+Hardware time stamping must also be initialized for each device driver
+that is expected to do hardware time stamping. The parameter is:
+
+struct hwtstamp_config {
+ int flags; /* no flags defined right now, must be zero */
+ int tx_type; /* HWTSTAMP_TX_* */
+ int rx_filter; /* HWTSTAMP_FILTER_* */
+};
+
+Desired behavior is passed into the kernel and to a specific device by
+calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
+ifr_data points to a struct hwtstamp_config. The tx_type and
+rx_filter are hints to the driver what it is expected to do. If
+the requested fine-grained filtering for incoming packets is not
+supported, the driver may time stamp more than just the requested types
+of packets.
+
+A driver which supports hardware time stamping shall update the struct
+with the actual, possibly more permissive configuration. If the
+requested packets cannot be time stamped, then nothing should be
+changed and ERANGE shall be returned (in contrast to EINVAL, which
+indicates that SIOCSHWTSTAMP is not supported at all).
+
+Only a processes with admin rights may change the configuration. User
+space is responsible to ensure that multiple processes don't interfere
+with each other and that the settings are reset.
+
+/* possible values for hwtstamp_config->tx_type */
+enum {
+ /*
+ * no outgoing packet will need hardware time stamping;
+ * should a packet arrive which asks for it, no hardware
+ * time stamping will be done
+ */
+ HWTSTAMP_TX_OFF,
+
+ /*
+ * enables hardware time stamping for outgoing packets;
+ * the sender of the packet decides which are to be
+ * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
+ * before sending the packet
+ */
+ HWTSTAMP_TX_ON,
+};
+
+/* possible values for hwtstamp_config->rx_filter */
+enum {
+ /* time stamp no incoming packet at all */
+ HWTSTAMP_FILTER_NONE,
+
+ /* time stamp any incoming packet */
+ HWTSTAMP_FILTER_ALL,
+
+ /* return value: time stamp all packets requested plus some others */
+ HWTSTAMP_FILTER_SOME,
+
+ /* PTP v1, UDP, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
+
+ ...
+};
+
+
+DEVICE IMPLEMENTATION
+
+A driver which supports hardware time stamping must support the
+SIOCSHWTSTAMP ioctl. Time stamps for received packets must be stored
+in the skb with skb_hwtstamp_set().
+
+Time stamps for outgoing packets are to be generated as follows:
+- In hard_start_xmit(), check if skb_hwtstamp_check_tx_hardware()
+ returns non-zero. If yes, then the driver is expected
+ to do hardware time stamping.
+- If this is possible for the skb and requested, then declare
+ that the driver is doing the time stamping by calling
+ skb_hwtstamp_tx_in_progress(). A driver not supporting
+ hardware time stamping doesn't do that. A driver must never
+ touch sk_buff::tstamp! It is used to store how time stamping
+ for an outgoing packets is to be done.
+- As soon as the driver has sent the packet and/or obtained a
+ hardware time stamp for it, it passes the time stamp back by
+ calling skb_hwtstamp_tx() with the original skb, the raw
+ hardware time stamp and a handle to the device (necessary
+ to convert the hardware time stamp to system time). If obtaining
+ the hardware time stamp somehow fails, then the driver should
+ not fall back to software time stamping. The rationale is that
+ this would occur at a later time in the processing pipeline
+ than other software time stamping and therefore could lead
+ to unexpected deltas between time stamps.
+- If the driver did not call skb_hwtstamp_tx_in_progress(), then
+ dev_hard_start_xmit() checks whether software time stamping
+ is wanted as fallback and potentially generates the time stamp.
diff --git a/Documentation/networking/timestamping/.gitignore b/Documentation/networking/timestamping/.gitignore
new file mode 100644
index 0000000..71e81eb
--- /dev/null
+++ b/Documentation/networking/timestamping/.gitignore
@@ -0,0 +1 @@
+timestamping
diff --git a/Documentation/networking/timestamping/Makefile b/Documentation/networking/timestamping/Makefile
new file mode 100644
index 0000000..ce170d1
--- /dev/null
+++ b/Documentation/networking/timestamping/Makefile
@@ -0,0 +1,3 @@
+CPPFLAGS = -I../../../include
+
+timestamping: timestamping.c
diff --git a/Documentation/networking/timestamping/timestamping.c b/Documentation/networking/timestamping/timestamping.c
new file mode 100644
index 0000000..26d2e25
--- /dev/null
+++ b/Documentation/networking/timestamping/timestamping.c
@@ -0,0 +1,469 @@
+/*
+ * This program demonstrates how the various time stamping features in
+ * the Linux kernel work. It emulates the behavior of a PTP
+ * implementation in stand-alone master mode by sending PTPv1 Sync
+ * multicasts once every second. It looks for similar packets, but
+ * beyond that doesn't actually implement PTP.
+ *
+ * Outgoing packets are time stamped with SO_TIMESTAMPING with or
+ * without hardware support.
+ *
+ * Incoming packets are time stamped with SO_TIMESTAMPING with or
+ * without hardware support, SIOCGSTAMP[NS] (per-socket time stamp) and
+ * SO_TIMESTAMP[NS].
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <string.h>
+
+#include <sys/time.h>
+#include <sys/socket.h>
+#include <sys/select.h>
+#include <sys/ioctl.h>
+#include <arpa/inet.h>
+#include <net/if.h>
+
+#include "asm/types.h"
+#include "linux/net_tstamp.h"
+#include "linux/errqueue.h"
+
+#ifndef SO_TIMESTAMPNS
+# define SO_TIMESTAMPNS 35
+#endif
+
+#ifndef SIOCGSTAMPNS
+# define SIOCGSTAMPNS 0x8907
+#endif
+
+static void usage(const char *error)
+{
+ if (error)
+ printf("invalid option: %s\n", error);
+ printf("timestamping interface (IP_MULTICAST_LOOP|SO_TIMESTAMP|SO_TIMESTAMPNS|SOF_TIMESTAMPING_TX_HARDWARE|SOF_TIMESTAMPING_TX_SOFTWARE|SOF_TIMESTAMPING_RX_HARDWARE|SOF_TIMESTAMPING_RX_SOFTWARE|SOF_TIMESTAMPING_SOFTWARE|SOF_TIMESTAMPING_SYS_HARDWARE|SOF_TIMESTAMPING_RAW_HARDWARE|SIOCGSTAMP|SIOCGSTAMPNS)*\n");
+ exit(1);
+}
+
+static void bail(const char *error)
+{
+ printf("%s: %s\n", error, strerror(errno));
+ exit(1);
+}
+
+static const unsigned char sync[] = {
+ 0x00,0x01, 0x00,0x01,
+ 0x5f,0x44, 0x46,0x4c,
+ 0x54,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00,
+ 0x01,0x01,
+
+ /* fake uuid */
+ 0x00,0x01,
+ 0x02,0x03, 0x04,0x05,
+
+ 0x00,0x01, 0x00,0x37,
+ 0x00,0x00, 0x00,0x08,
+ 0x00,0x00, 0x00,0x00,
+ 0x49,0x05, 0xcd,0x01,
+ 0x29,0xb1, 0x8d,0xb0,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x01,
+
+ /* fake uuid */
+ 0x00,0x01,
+ 0x02,0x03, 0x04,0x05,
+
+ 0x00,0x00, 0x00,0x37,
+ 0x00,0x00, 0x00,0x04,
+ 0x44,0x46, 0x4c,0x54,
+ 0x00,0x00, 0xf0,0x60,
+ 0x00,0x01, 0x00,0x00,
+ 0x00,0x00, 0x00,0x01,
+ 0x00,0x00, 0xf0,0x60,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x04,
+ 0x44,0x46, 0x4c,0x54,
+ 0x00,0x01,
+
+ /* fake uuid */
+ 0x00,0x01,
+ 0x02,0x03, 0x04,0x05,
+
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00
+};
+
+static void sendpacket(int sock, struct sockaddr *addr, socklen_t addr_len)
+{
+ struct timeval now;
+ int res;
+
+ res = sendto(sock, sync, sizeof(sync), 0,
+ addr, addr_len);
+ gettimeofday(&now, 0);
+ if (res < 0)
+ printf("%s: %s\n", "send", strerror(errno));
+ else
+ printf("%ld.%06ld: sent %d bytes\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ res);
+}
+
+static void recvpacket(int sock, int recvmsg_flags,
+ int siocgstamp, int siocgstampns)
+{
+ char data[256];
+ struct timeval now;
+ struct msghdr msg;
+ struct iovec entry;
+ struct sockaddr_in from_addr;
+ struct {
+ struct cmsghdr cm;
+ char control[512];
+ } control;
+ int res;
+
+ memset(&msg, 0, sizeof(msg));
+ msg.msg_iov = &entry;
+ msg.msg_iovlen = 1;
+ entry.iov_base = data;
+ entry.iov_len = sizeof(data);
+ msg.msg_name = (caddr_t)&from_addr;
+ msg.msg_namelen = sizeof(from_addr);
+ msg.msg_control = &control;
+ msg.msg_controllen = sizeof(control);
+
+ res = recvmsg(sock, &msg, recvmsg_flags|MSG_DONTWAIT);
+ gettimeofday(&now, 0);
+ if (res < 0) {
+ printf("%s %s: %s\n",
+ "recvmsg",
+ (recvmsg_flags & MSG_ERRQUEUE) ? "error" : "regular",
+ strerror(errno));
+ } else {
+ struct cmsghdr *cmsg;
+ struct timeval tv;
+ struct timespec ts;
+
+ printf("%ld.%06ld: received %s data, %d bytes from %s, %d bytes control messages\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ (recvmsg_flags & MSG_ERRQUEUE) ? "error" : "regular",
+ res,
+ inet_ntoa(from_addr.sin_addr),
+ msg.msg_controllen);
+ for (cmsg = CMSG_FIRSTHDR(&msg);
+ cmsg;
+ cmsg = CMSG_NXTHDR(&msg, cmsg)) {
+ printf(" cmsg len %d: ", cmsg->cmsg_len);
+ switch (cmsg->cmsg_level) {
+ case SOL_SOCKET:
+ printf("SOL_SOCKET ");
+ switch (cmsg->cmsg_type) {
+ case SO_TIMESTAMP: {
+ struct timeval *stamp =
+ (struct timeval *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMP %ld.%06ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_usec);
+ break;
+ }
+ case SO_TIMESTAMPNS: {
+ struct timespec *stamp =
+ (struct timespec *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMPNS %ld.%09ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ break;
+ }
+ case SO_TIMESTAMPING: {
+ struct timespec *stamp =
+ (struct timespec *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMPING ");
+ printf("SW %ld.%09ld ",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ stamp++;
+ printf("HW transformed %ld.%09ld ",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ stamp++;
+ printf("HW raw %ld.%09ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ break;
+ }
+ default:
+ printf("type %d", cmsg->cmsg_type);
+ break;
+ }
+ break;
+ case IPPROTO_IP:
+ printf("IPPROTO_IP ");
+ switch (cmsg->cmsg_type) {
+ case IP_RECVERR: {
+ struct sock_extended_err *err =
+ (struct sock_extended_err *)CMSG_DATA(cmsg);
+ printf("IP_RECVERR ee_errno '%s' ee_origin %d => %s",
+ strerror(err->ee_errno),
+ err->ee_origin,
+#ifdef SO_EE_ORIGIN_TIMESTAMPING
+ err->ee_origin == SO_EE_ORIGIN_TIMESTAMPING ?
+ "bounced packet" : "unexpected origin"
+#else
+ "probably SO_EE_ORIGIN_TIMESTAMPING"
+#endif
+ );
+ if (res < sizeof(sync))
+ printf(" => truncated data?!");
+ else if (!memcmp(sync, data + res - sizeof(sync),
+ sizeof(sync)))
+ printf(" => GOT OUR DATA BACK (HURRAY!)");
+ break;
+ }
+ case IP_PKTINFO: {
+ struct in_pktinfo *pktinfo =
+ (struct in_pktinfo *)CMSG_DATA(cmsg);
+ printf("IP_PKTINFO interface index %u",
+ pktinfo->ipi_ifindex);
+ break;
+ }
+ default:
+ printf("type %d", cmsg->cmsg_type);
+ break;
+ }
+ break;
+ default:
+ printf("level %d type %d",
+ cmsg->cmsg_level,
+ cmsg->cmsg_type);
+ break;
+ }
+ printf("\n");
+ }
+
+ if (siocgstamp) {
+ if (ioctl(sock, SIOCGSTAMP, &tv))
+ printf(" %s: %s\n", "SIOCGSTAMP", strerror(errno));
+ else
+ printf("SIOCGSTAMP %ld.%06ld\n",
+ (long)tv.tv_sec,
+ (long)tv.tv_usec);
+ }
+ if (siocgstampns) {
+ if (ioctl(sock, SIOCGSTAMPNS, &ts))
+ printf(" %s: %s\n", "SIOCGSTAMPNS", strerror(errno));
+ else
+ printf("SIOCGSTAMPNS %ld.%09ld\n",
+ (long)ts.tv_sec,
+ (long)ts.tv_nsec);
+ }
+ }
+}
+
+int main(int argc, char **argv)
+{
+ int so_timestamping_flags = 0;
+ int so_timestamp = 0;
+ int so_timestampns = 0;
+ int siocgstamp = 0;
+ int siocgstampns = 0;
+ int ip_multicast_loop = 0;
+ char *interface;
+ int i;
+ int enabled = 1;
+ int sock;
+ struct ifreq device;
+ struct ifreq hwtstamp;
+ struct hwtstamp_config hwconfig, hwconfig_requested;
+ struct sockaddr_in addr;
+ struct ip_mreq imr;
+ struct in_addr iaddr;
+ int val;
+ socklen_t len;
+ struct timeval next;
+
+ if (argc < 2)
+ usage(0);
+ interface = argv[1];
+
+ for (i = 2; i < argc; i++ ) {
+ if (!strcasecmp(argv[i], "SO_TIMESTAMP")) {
+ so_timestamp = 1;
+ } else if (!strcasecmp(argv[i], "SO_TIMESTAMPNS")) {
+ so_timestampns = 1;
+ } else if (!strcasecmp(argv[i], "SIOCGSTAMP")) {
+ siocgstamp = 1;
+ } else if (!strcasecmp(argv[i], "SIOCGSTAMPNS")) {
+ siocgstampns = 1;
+ } else if (!strcasecmp(argv[i], "IP_MULTICAST_LOOP")) {
+ ip_multicast_loop = 1;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_TX_HARDWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_TX_HARDWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_TX_SOFTWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_TX_SOFTWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RX_HARDWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_RX_HARDWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RX_SOFTWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_RX_SOFTWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SOFTWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_SOFTWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SYS_HARDWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_SYS_HARDWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RAW_HARDWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_RAW_HARDWARE;
+ } else {
+ usage(argv[i]);
+ }
+ }
+
+ sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
+ if (socket < 0)
+ bail("socket");
+
+ memset(&device, 0, sizeof(device));
+ strncpy(device.ifr_name, interface, sizeof(device.ifr_name));
+ if (ioctl(sock, SIOCGIFADDR, &device) < 0)
+ bail("getting interface IP address");
+
+ memset(&hwtstamp, 0, sizeof(hwtstamp));
+ strncpy(hwtstamp.ifr_name, interface, sizeof(hwtstamp.ifr_name));
+ hwtstamp.ifr_data = (void *)&hwconfig;
+ memset(&hwconfig, 0, sizeof(&hwconfig));
+ hwconfig.tx_type =
+ (so_timestamping_flags & SOF_TIMESTAMPING_TX_HARDWARE) ?
+ HWTSTAMP_TX_ON : HWTSTAMP_TX_OFF;
+ hwconfig.rx_filter =
+ (so_timestamping_flags & SOF_TIMESTAMPING_RX_HARDWARE) ?
+ HWTSTAMP_FILTER_PTP_V1_L4_SYNC : HWTSTAMP_FILTER_NONE;
+ hwconfig_requested = hwconfig;
+ if (ioctl(sock, SIOCSHWTSTAMP, &hwtstamp) < 0) {
+ if ((errno == EINVAL || errno == ENOTSUP) &&
+ hwconfig_requested.tx_type == HWTSTAMP_TX_OFF &&
+ hwconfig_requested.rx_filter == HWTSTAMP_FILTER_NONE)
+ printf("SIOCSHWTSTAMP: disabling hardware time stamping not possible\n");
+ else
+ bail("SIOCSHWTSTAMP");
+ }
+ printf("SIOCSHWTSTAMP: tx_type %d requested, got %d; rx_filter %d requested, got %d\n",
+ hwconfig_requested.tx_type, hwconfig.tx_type,
+ hwconfig_requested.rx_filter, hwconfig.rx_filter);
+
+ /* bind to PTP port */
+ addr.sin_family = AF_INET;
+ addr.sin_addr.s_addr = htonl(INADDR_ANY);
+ addr.sin_port = htons(319 /* PTP event port */);
+ if (bind(sock, (struct sockaddr*)&addr, sizeof(struct sockaddr_in)) < 0)
+ bail("bind");
+
+ /* set multicast group for outgoing packets */
+ inet_aton("224.0.1.130", &iaddr); /* alternate PTP domain 1 */
+ addr.sin_addr = iaddr;
+ imr.imr_multiaddr.s_addr = iaddr.s_addr;
+ imr.imr_interface.s_addr = ((struct sockaddr_in *)&device.ifr_addr)->sin_addr.s_addr;
+ if (setsockopt(sock, IPPROTO_IP, IP_MULTICAST_IF,
+ &imr.imr_interface.s_addr, sizeof(struct in_addr)) < 0)
+ bail("set multicast");
+
+ /* join multicast group, loop our own packet */
+ if (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP, &imr, sizeof(struct ip_mreq)) < 0)
+ bail("join multicast group");
+
+ if (setsockopt(sock, IPPROTO_IP, IP_MULTICAST_LOOP, &ip_multicast_loop, sizeof(enabled)) < 0) {
+ bail("loop multicast");
+ }
+
+ /* set socket options for time stamping */
+ if (so_timestamp &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMP, &enabled, sizeof(enabled)) < 0)
+ bail("setsockopt SO_TIMESTAMP");
+
+ if (so_timestampns &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPNS, &enabled, sizeof(enabled)) < 0)
+ bail("setsockopt SO_TIMESTAMPNS");
+
+ if (so_timestamping_flags &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING, &so_timestamping_flags, sizeof(so_timestamping_flags)) < 0)
+ bail("setsockopt SO_TIMESTAMPING");
+
+ /* request IP_PKTINFO for debugging purposes */
+ if (setsockopt(sock, SOL_IP, IP_PKTINFO, &enabled, sizeof(enabled)) < 0)
+ printf("%s: %s\n", "setsockopt IP_PKTINFO", strerror(errno));
+
+ /* verify socket options */
+ len = sizeof(val);
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMP, &val, &len) < 0)
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMP", strerror(errno));
+ else
+ printf("SO_TIMESTAMP %d\n", val);
+
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMPNS, &val, &len) < 0)
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMPNS", strerror(errno));
+ else
+ printf("SO_TIMESTAMPNS %d\n", val);
+
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING, &val, &len) < 0) {
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMPING", strerror(errno));
+ } else {
+ printf("SO_TIMESTAMPING %d\n", val);
+ if (val != so_timestamping_flags)
+ printf(" not the expected value %d\n", so_timestamping_flags);
+ }
+
+ /* send packets forever every five seconds */
+ gettimeofday(&next, 0);
+ next.tv_sec = (next.tv_sec + 1) / 5 * 5;
+ next.tv_usec = 0;
+ while(1) {
+ struct timeval now;
+ struct timeval delta;
+ long delta_us;
+ int res;
+ fd_set readfs, errorfs;
+
+ gettimeofday(&now, 0);
+ delta_us = (long)(next.tv_sec - now.tv_sec) * 1000000 +
+ (long)(next.tv_usec - now.tv_usec);
+ if (delta_us > 0) {
+ /* continue waiting for timeout or data */
+ delta.tv_sec = delta_us / 1000000;
+ delta.tv_usec = delta_us % 1000000;
+
+ FD_ZERO(&readfs);
+ FD_ZERO(&errorfs);
+ FD_SET(sock, &readfs);
+ FD_SET(sock, &errorfs);
+ printf("%ld.%06ld: select %ldus\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ delta_us);
+ res = select(sock + 1, &readfs, 0, &errorfs, &delta);
+ gettimeofday(&now, 0);
+ printf("%ld.%06ld: select returned: %d, %s\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ res,
+ res < 0 ? strerror(errno) : "success");
+ if (res > 0) {
+ if (FD_ISSET(sock, &readfs))
+ printf("ready for reading\n");
+ if (FD_ISSET(sock, &errorfs))
+ printf("has error\n");
+ recvpacket(sock, 0,
+ siocgstamp,
+ siocgstampns);
+ recvpacket(sock, MSG_ERRQUEUE,
+ siocgstamp,
+ siocgstampns);
+ }
+ } else {
+ /* write one packet */
+ sendpacket(sock, (struct sockaddr *)&addr, sizeof(addr));
+ next.tv_sec += 5;
+ continue;
+ }
+ }
+
+ return 0;
+}
diff --git a/arch/x86/include/asm/socket.h b/arch/x86/include/asm/socket.h
index 8ab9cc8..79e1f6c 100644
--- a/arch/x86/include/asm/socket.h
+++ b/arch/x86/include/asm/socket.h
@@ -54,4 +54,7 @@

#define SO_MARK 36

+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING
+
#endif /* _ASM_X86_SOCKET_H */
diff --git a/include/linux/errqueue.h b/include/linux/errqueue.h
index 92f8d4f..86d88dd 100644
--- a/include/linux/errqueue.h
+++ b/include/linux/errqueue.h
@@ -16,6 +16,7 @@ struct sock_extended_err
#define SO_EE_ORIGIN_LOCAL 1
#define SO_EE_ORIGIN_ICMP 2
#define SO_EE_ORIGIN_ICMP6 3
+#define SO_EE_ORIGIN_TIMESTAMPING 4

#define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1))

diff --git a/include/linux/net_tstamp.h b/include/linux/net_tstamp.h
new file mode 100644
index 0000000..bb89beb
--- /dev/null
+++ b/include/linux/net_tstamp.h
@@ -0,0 +1,104 @@
+#ifndef _NET_TIMESTAMPING_H
+#define _NET_TIMESTAMPING_H
+
+#include <linux/socket.h> /* for SO_TIMESTAMPING */
+
+/*
+ * user space linux/socket.h might not have these defines yet:
+ * provide fallback
+ */
+#if !defined(__KERNEL__) && !defined(SO_TIMESTAMPING)
+# define SO_TIMESTAMPING 37
+# define SCM_TIMESTAMPING SO_TIMESTAMPING
+#endif
+
+/* SO_TIMESTAMPING gets an integer bit field comprised of these values */
+enum {
+ SOF_TIMESTAMPING_TX_HARDWARE = (1<<0),
+ SOF_TIMESTAMPING_TX_SOFTWARE = (1<<1),
+ SOF_TIMESTAMPING_RX_HARDWARE = (1<<2),
+ SOF_TIMESTAMPING_RX_SOFTWARE = (1<<3),
+ SOF_TIMESTAMPING_SOFTWARE = (1<<4),
+ SOF_TIMESTAMPING_SYS_HARDWARE = (1<<5),
+ SOF_TIMESTAMPING_RAW_HARDWARE = (1<<6),
+ SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_RAW_HARDWARE - 1) | SOF_TIMESTAMPING_RAW_HARDWARE
+};
+
+#if !defined(__KERNEL__) && !defined(SIOCSHWTSTAMP)
+# define SIOCSHWTSTAMP 0x89b0
+#endif
+
+/**
+ * struct hwtstamp_config - %SIOCSHWTSTAMP parameter
+ *
+ * @flags: no flags defined right now, must be zero
+ * @tx_type: one of HWTSTAMP_TX_*
+ * @rx_type: one of one of HWTSTAMP_FILTER_*
+ *
+ * %SIOCSHWTSTAMP expects a &struct ifreq with a ifr_data pointer
+ * to this structure.
+ */
+struct hwtstamp_config {
+ int flags;
+ int tx_type;
+ int rx_filter;
+};
+
+/* possible values for hwtstamp_config->tx_type */
+enum {
+ /*
+ * No outgoing packet will need hardware time stamping;
+ * should a packet arrive which asks for it, no hardware
+ * time stamping will be done.
+ */
+ HWTSTAMP_TX_OFF,
+
+ /*
+ * Enables hardware time stamping for outgoing packets;
+ * the sender of the packet decides which are to be
+ * time stamped by setting %SOF_TIMESTAMPING_TX_SOFTWARE
+ * before sending the packet.
+ */
+ HWTSTAMP_TX_ON,
+};
+
+/* possible values for hwtstamp_config->rx_filter */
+enum {
+ /* time stamp no incoming packet at all */
+ HWTSTAMP_FILTER_NONE,
+
+ /* time stamp any incoming packet */
+ HWTSTAMP_FILTER_ALL,
+
+ /* return value: time stamp all packets requested plus some others */
+ HWTSTAMP_FILTER_SOME,
+
+ /* PTP v1, UDP, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
+ /* PTP v1, UDP, Sync packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_SYNC,
+ /* PTP v1, UDP, Delay_req packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ,
+ /* PTP v2, UDP, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V2_L4_EVENT,
+ /* PTP v2, UDP, Sync packet */
+ HWTSTAMP_FILTER_PTP_V2_L4_SYNC,
+ /* PTP v2, UDP, Delay_req packet */
+ HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ,
+
+ /* 802.AS1, Ethernet, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V2_L2_EVENT,
+ /* 802.AS1, Ethernet, Sync packet */
+ HWTSTAMP_FILTER_PTP_V2_L2_SYNC,
+ /* 802.AS1, Ethernet, Delay_req packet */
+ HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ,
+
+ /* PTP v2/802.AS1, any layer, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V2_EVENT,
+ /* PTP v2/802.AS1, any layer, Sync packet */
+ HWTSTAMP_FILTER_PTP_V2_SYNC,
+ /* PTP v2/802.AS1, any layer, Delay_req packet */
+ HWTSTAMP_FILTER_PTP_V2_DELAY_REQ,
+};
+
+#endif /* _NET_TIMESTAMPING_H */
diff --git a/include/linux/sockios.h b/include/linux/sockios.h
index abef759..241f179 100644
--- a/include/linux/sockios.h
+++ b/include/linux/sockios.h
@@ -122,6 +122,9 @@
#define SIOCBRADDIF 0x89a2 /* add interface to bridge */
#define SIOCBRDELIF 0x89a3 /* remove interface from bridge */

+/* hardware time stamping: parameters in linux/net_tstamp.h */
+#define SIOCSHWTSTAMP 0x89b0
+
/* Device private ioctl calls */

/*
--
1.5.5.3

2008-12-15 14:56:35

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 03/12] net: socket infrastructure for SO_TIMESTAMPING

The overlap with the old SO_TIMESTAMP[NS] options is handled so
that time stamping in software (net_enable_timestamp()) is
enabled when SO_TIMESTAMP[NS] and/or SO_TIMESTAMPING_RX_SOFTWARE
is set. It's disabled if all of these are off.

Signed-off-by: Patrick Ohly <[email protected]>
---
include/net/sock.h | 40 ++++++++++++++++++++++--
net/compat.c | 19 +++++++----
net/core/sock.c | 79 ++++++++++++++++++++++++++++++++++++++++-------
net/socket.c | 86 ++++++++++++++++++++++++++++++++++++++++------------
4 files changed, 182 insertions(+), 42 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 5a3a151..36807e4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -158,7 +158,7 @@ struct sock_common {
* @sk_allocation: allocation mode
* @sk_sndbuf: size of send buffer in bytes
* @sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
- * %SO_OOBINLINE settings
+ * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
* @sk_no_check: %SO_NO_CHECK setting, wether or not checkup packets
* @sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
* @sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
@@ -488,6 +488,13 @@ enum sock_flags {
SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_TIMESTAMPING_TX_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_TX_HARDWARE */
+ SOCK_TIMESTAMPING_TX_SOFTWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_TX_SOFTWARE */
+ SOCK_TIMESTAMPING_RX_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_RX_HARDWARE */
+ SOCK_TIMESTAMPING_RX_SOFTWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_RX_SOFTWARE */
+ SOCK_TIMESTAMPING_SOFTWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_SOFTWARE */
+ SOCK_TIMESTAMPING_RAW_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_RAW_HARDWARE */
+ SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_SYS_HARDWARE */
};

static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -1342,13 +1349,40 @@ sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
{
ktime_t kt = skb->tstamp;

- if (sock_flag(sk, SOCK_RCVTSTAMP))
+ /*
+ * generate control messages if
+ * - receive time stamping in software requested (SOCK_RCVTSTAMP
+ * or SOCK_TIMESTAMPING_RX_SOFTWARE)
+ * - software time stamp available and wanted (SOCK_TIMESTAMPING_SOFTWARE)
+ * - hardware time stamps available and wanted (SOCK_TIMESTAMPING_SYS_HARDWARE
+ * or SOCK_TIMESTAMPING_RAW_HARDWARE)
+ */
+ if (sock_flag(sk, SOCK_RCVTSTAMP) ||
+ sock_flag(sk, SOCK_TIMESTAMPING_RX_SOFTWARE) ||
+ (kt.tv64 && sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE)) ||
+ ((skb->optional & SKB_FLAGS_OPTIONAL_HWTSTAMPS) &&
+ (sock_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE) ||
+ sock_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE))))
__sock_recv_timestamp(msg, sk, skb);
else
sk->sk_stamp = kt;
}

/**
+ * sock_tx_timestamp - checks whether the outgoing packet is to be time stamped
+ * @msg: outgoing packet
+ * @sk: socket sending this packet
+ * @shtx: filled with instructions for time stamping
+ *
+ * Currently only depends on SOCK_TIMESTAMPING* flags. Returns error code if
+ * parameters are invalid.
+ */
+extern int sock_tx_timestamp(struct msghdr *msg,
+ struct sock *sk,
+ union skb_shared_tx *shtx);
+
+
+/**
* sk_eat_skb - Release a skb if it is no longer needed
* @sk: socket to eat this skb from
* @skb: socket buffer to eat
@@ -1416,7 +1450,7 @@ static inline struct sock *skb_steal_sock(struct sk_buff *skb)
return NULL;
}

-extern void sock_enable_timestamp(struct sock *sk);
+extern void sock_enable_timestamp(struct sock *sk, int flag);
extern int sock_get_timestamp(struct sock *, struct timeval __user *);
extern int sock_get_timestampns(struct sock *, struct timespec __user *);

diff --git a/net/compat.c b/net/compat.c
index a3a2ba0..fb01ed9 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -216,7 +216,7 @@ Efault:
int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *data)
{
struct compat_timeval ctv;
- struct compat_timespec cts;
+ struct compat_timespec cts[3];
struct compat_cmsghdr __user *cm = (struct compat_cmsghdr __user *) kmsg->msg_control;
struct compat_cmsghdr cmhdr;
int cmlen;
@@ -233,12 +233,17 @@ int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *dat
data = &ctv;
len = sizeof(ctv);
}
- if (level == SOL_SOCKET && type == SCM_TIMESTAMPNS) {
+ if (level == SOL_SOCKET &&
+ (type == SCM_TIMESTAMPNS || type == SCM_TIMESTAMPING)) {
+ int count = type == SCM_TIMESTAMPNS ? 1 : 3;
+ int i;
struct timespec *ts = (struct timespec *)data;
- cts.tv_sec = ts->tv_sec;
- cts.tv_nsec = ts->tv_nsec;
+ for (i = 0; i < count; i++) {
+ cts[i].tv_sec = ts[i].tv_sec;
+ cts[i].tv_nsec = ts[i].tv_nsec;
+ }
data = &cts;
- len = sizeof(cts);
+ len = sizeof(cts[0]) * count;
}

cmlen = CMSG_COMPAT_LEN(len);
@@ -455,7 +460,7 @@ int compat_sock_get_timestamp(struct sock *sk, struct timeval __user *userstamp)
struct timeval tv;

if (!sock_flag(sk, SOCK_TIMESTAMP))
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
tv = ktime_to_timeval(sk->sk_stamp);
if (tv.tv_sec == -1)
return err;
@@ -479,7 +484,7 @@ int compat_sock_get_timestampns(struct sock *sk, struct timespec __user *usersta
struct timespec ts;

if (!sock_flag(sk, SOCK_TIMESTAMP))
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
ts = ktime_to_timespec(sk->sk_stamp);
if (ts.tv_sec == -1)
return err;
diff --git a/net/core/sock.c b/net/core/sock.c
index ac4f0e7..35b4f4c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -120,6 +120,7 @@
#include <net/net_namespace.h>
#include <net/request_sock.h>
#include <net/sock.h>
+#include <linux/net_tstamp.h>
#include <net/xfrm.h>
#include <linux/ipsec.h>

@@ -255,11 +256,14 @@ static void sock_warn_obsolete_bsdism(const char *name)
}
}

-static void sock_disable_timestamp(struct sock *sk)
+static void sock_disable_timestamp(struct sock *sk, int flag)
{
- if (sock_flag(sk, SOCK_TIMESTAMP)) {
- sock_reset_flag(sk, SOCK_TIMESTAMP);
- net_disable_timestamp();
+ if (sock_flag(sk, flag)) {
+ sock_reset_flag(sk, flag);
+ if (!sock_flag(sk, SOCK_TIMESTAMP) &&
+ !sock_flag(sk, SOCK_TIMESTAMPING_RX_SOFTWARE)) {
+ net_disable_timestamp();
+ }
}
}

@@ -618,13 +622,36 @@ set_rcvbuf:
else
sock_set_flag(sk, SOCK_RCVTSTAMPNS);
sock_set_flag(sk, SOCK_RCVTSTAMP);
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
} else {
sock_reset_flag(sk, SOCK_RCVTSTAMP);
sock_reset_flag(sk, SOCK_RCVTSTAMPNS);
}
break;

+ case SO_TIMESTAMPING:
+ if (val & ~SOF_TIMESTAMPING_MASK) {
+ ret = EINVAL;
+ break;
+ }
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_TX_HARDWARE,
+ val & SOF_TIMESTAMPING_TX_HARDWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_TX_SOFTWARE,
+ val & SOF_TIMESTAMPING_TX_SOFTWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_RX_HARDWARE,
+ val & SOF_TIMESTAMPING_RX_HARDWARE);
+ if (val & SOF_TIMESTAMPING_RX_SOFTWARE)
+ sock_enable_timestamp(sk, SOCK_TIMESTAMPING_RX_SOFTWARE);
+ else
+ sock_disable_timestamp(sk, SOCK_TIMESTAMPING_RX_SOFTWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_SOFTWARE,
+ val & SOF_TIMESTAMPING_SOFTWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE,
+ val & SOF_TIMESTAMPING_SYS_HARDWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE,
+ val & SOF_TIMESTAMPING_RAW_HARDWARE);
+ break;
+
case SO_RCVLOWAT:
if (val < 0)
val = INT_MAX;
@@ -770,6 +797,24 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
v.val = sock_flag(sk, SOCK_RCVTSTAMPNS);
break;

+ case SO_TIMESTAMPING:
+ v.val = 0;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_TX_HARDWARE))
+ v.val |= SOF_TIMESTAMPING_TX_HARDWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_TX_SOFTWARE))
+ v.val |= SOF_TIMESTAMPING_TX_SOFTWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_RX_HARDWARE))
+ v.val |= SOF_TIMESTAMPING_RX_HARDWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_RX_SOFTWARE))
+ v.val |= SOF_TIMESTAMPING_RX_SOFTWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE))
+ v.val |= SOF_TIMESTAMPING_SOFTWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE))
+ v.val |= SOF_TIMESTAMPING_SYS_HARDWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE))
+ v.val |= SOF_TIMESTAMPING_RAW_HARDWARE;
+ break;
+
case SO_RCVTIMEO:
lv=sizeof(struct timeval);
if (sk->sk_rcvtimeo == MAX_SCHEDULE_TIMEOUT) {
@@ -971,7 +1016,8 @@ void sk_free(struct sock *sk)
rcu_assign_pointer(sk->sk_filter, NULL);
}

- sock_disable_timestamp(sk);
+ sock_disable_timestamp(sk, SOCK_TIMESTAMP);
+ sock_disable_timestamp(sk, SOCK_TIMESTAMPING_RX_SOFTWARE);

if (atomic_read(&sk->sk_omem_alloc))
printk(KERN_DEBUG "%s: optmem leakage (%d bytes) detected.\n",
@@ -1789,7 +1835,7 @@ int sock_get_timestamp(struct sock *sk, struct timeval __user *userstamp)
{
struct timeval tv;
if (!sock_flag(sk, SOCK_TIMESTAMP))
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
tv = ktime_to_timeval(sk->sk_stamp);
if (tv.tv_sec == -1)
return -ENOENT;
@@ -1805,7 +1851,7 @@ int sock_get_timestampns(struct sock *sk, struct timespec __user *userstamp)
{
struct timespec ts;
if (!sock_flag(sk, SOCK_TIMESTAMP))
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
ts = ktime_to_timespec(sk->sk_stamp);
if (ts.tv_sec == -1)
return -ENOENT;
@@ -1817,11 +1863,20 @@ int sock_get_timestampns(struct sock *sk, struct timespec __user *userstamp)
}
EXPORT_SYMBOL(sock_get_timestampns);

-void sock_enable_timestamp(struct sock *sk)
+void sock_enable_timestamp(struct sock *sk, int flag)
{
- if (!sock_flag(sk, SOCK_TIMESTAMP)) {
- sock_set_flag(sk, SOCK_TIMESTAMP);
- net_enable_timestamp();
+ if (!sock_flag(sk, flag)) {
+ sock_set_flag(sk, flag);
+ /*
+ * we just set one of the two flags which require net
+ * time stamping, but time stamping might have been on
+ * already because of the other one
+ */
+ if (!sock_flag(sk,
+ flag == SOCK_TIMESTAMP ?
+ SOCK_TIMESTAMPING_RX_SOFTWARE :
+ SOCK_TIMESTAMP))
+ net_enable_timestamp();
}
}

diff --git a/net/socket.c b/net/socket.c
index e9d65ea..669b739 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -545,6 +545,20 @@ void sock_release(struct socket *sock)
sock->file = NULL;
}

+int sock_tx_timestamp(struct msghdr *msg, struct sock *sk,
+ union skb_shared_tx *shtx)
+{
+ shtx->flags = 0;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_TX_HARDWARE)) {
+ shtx->hardware = 1;
+ }
+ if (sock_flag(sk, SOCK_TIMESTAMPING_TX_SOFTWARE)) {
+ shtx->software = 1;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(sock_tx_timestamp);
+
static inline int __sock_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *msg, size_t size)
{
@@ -595,33 +609,65 @@ int kernel_sendmsg(struct socket *sock, struct msghdr *msg,
return result;
}

+static int ktime2ts(ktime_t kt, struct timespec *ts)
+{
+ if (kt.tv64) {
+ *ts = ktime_to_timespec(kt);
+ return 1;
+ } else {
+ return 0;
+ }
+}
+
/*
* called from sock_recv_timestamp() if sock_flag(sk, SOCK_RCVTSTAMP)
*/
void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
struct sk_buff *skb)
{
- ktime_t kt = skb->tstamp;
-
- if (!sock_flag(sk, SOCK_RCVTSTAMPNS)) {
- struct timeval tv;
- /* Race occurred between timestamp enabling and packet
- receiving. Fill in the current time for now. */
- if (kt.tv64 == 0)
- kt = ktime_get_real();
- skb->tstamp = kt;
- tv = ktime_to_timeval(kt);
- put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP, sizeof(tv), &tv);
- } else {
- struct timespec ts;
- /* Race occurred between timestamp enabling and packet
- receiving. Fill in the current time for now. */
- if (kt.tv64 == 0)
- kt = ktime_get_real();
- skb->tstamp = kt;
- ts = ktime_to_timespec(kt);
- put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPNS, sizeof(ts), &ts);
+ int need_software_tstamp = sock_flag(sk, SOCK_RCVTSTAMP);
+ struct timespec ts[3];
+ int empty = 1;
+ struct skb_shared_hwtstamps *shhwtstamps =
+ skb_hwtstamps(skb);
+
+ /* Race occurred between timestamp enabling and packet
+ receiving. Fill in the current time for now. */
+ if (need_software_tstamp && skb->tstamp.tv64 == 0)
+ __net_timestamp(skb);
+
+ if (need_software_tstamp) {
+ if (!sock_flag(sk, SOCK_RCVTSTAMPNS)) {
+ struct timeval tv;
+ skb_get_timestamp(skb, &tv);
+ put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP,
+ sizeof(tv), &tv);
+ } else {
+ struct timespec ts;
+ skb_get_timestampns(skb, &ts);
+ put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPNS,
+ sizeof(ts), &ts);
+ }
+ }
+
+
+ memset(ts, 0, sizeof(ts));
+ if (skb->tstamp.tv64 &&
+ sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE)) {
+ skb_get_timestampns(skb, ts + 0);
+ empty = 0;
+ }
+ if (shhwtstamps) {
+ if (sock_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE) &&
+ ktime2ts(shhwtstamps->syststamp, ts + 1))
+ empty = 0;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE) &&
+ ktime2ts(shhwtstamps->hwtstamp, ts + 2))
+ empty = 0;
}
+ if (!empty)
+ put_cmsg(msg, SOL_SOCKET,
+ SCM_TIMESTAMPING, sizeof(ts), &ts);
}

EXPORT_SYMBOL_GPL(__sock_recv_timestamp);
--
1.5.5.3

2008-12-15 14:56:52

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 02/12] net: infrastructure for hardware time stamping

Instead of adding new members to struct sk_buff this
patch introduces and uses a generic mechanism for
extending skb: additional structures are allocated
at the end of the data area, similar to the skb_shared_info.
One new member of skb holds the information which of the
optional structures are present, with one bit per
structure. This allows fast checks whether certain
information is present.

The actual address of an optional structure
is found by using a hard-coded ordering of these
structures and adding up the size and alignment padding
of the preceeding structs.

The new struct skb_shared_tx is used to transport time stamping
instructions to the device driver (outgoing packets). The
resulting hardware time stamps are returned via struct
skb_shared_hwtstamps (incoming or sent packets), in all
formats possibly needed by the rest of the kernel and
user space (original raw hardware time stamp and converted
to system time base). This replaces the problematic callbacks
into the network driver used in earlier revisions of this patch.

Conceptionally the two structs are independent and use
different bits in the new flags fields. This avoids the
problem that dev_start_hard_xmit() cannot distinguish
reliably between outgoing and incoming packets (it is
called for looped multicast packets). But to avoid copying
sent data, the space reserved for skb_shared_tx is
increased so that this space can be reused for skb_shared_hwtstamps
when sending back the packet to the originating socket.

TX time stamping is implemented in software if the device driver
doesn't support hardware time stamping.

The new semantic for hardware/software time stamping around
net_device->hard_start_xmit() is based on two assumptions about
existing network device drivers which don't support hardware
time stamping and know nothing about it:
- they leave the new skb_shared_tx struct unmodified
- the keep the connection to the originating socket in skb->sk
alive, i.e., don't call skb_orphan()

Given that skb_shared_tx is new, the first assumption is safe.
The second is only true for some drivers. As a result, software
TX time stamping currently works with the bnx2 driver, but not
with the unmodified igb driver (the two drivers this patch series
was tested with).

Signed-off-by: Patrick Ohly <[email protected]>
---
include/linux/skbuff.h | 196 ++++++++++++++++++++++++++++++++++++++++++++++--
net/core/dev.c | 34 ++++++++-
net/core/skbuff.c | 139 ++++++++++++++++++++++++++++------
3 files changed, 338 insertions(+), 31 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index acf17af..7f58b55 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -156,6 +156,105 @@ struct skb_shared_info {
#endif
};

+#define HAVE_HW_TIME_STAMP
+
+/**
+ * skb_shared_hwtstamps - optional hardware time stamps
+ *
+ * @hwtstamp: hardware time stamp transformed into duration
+ * since arbitrary point in time
+ * @syststamp: hwtstamp transformed to system time base
+ *
+ * Software time stamps generated by ktime_get_real() are stored in
+ * skb->tstamp. The relation between the different kinds of time
+ * stamps is as follows:
+ *
+ * syststamp and tstamp can be compared against each other in
+ * arbitrary combinations. The accuracy of a
+ * syststamp/tstamp/"syststamp from other device" comparison is
+ * limited by the accuracy of the transformation into system time
+ * base. This depends on the device driver and its underlying
+ * hardware.
+ *
+ * hwtstamps can only be compared against other hwtstamps from
+ * the same device.
+ *
+ * This additional structure has to be allocated together with
+ * the data buffer and is shared between clones.
+ */
+struct skb_shared_hwtstamps {
+ ktime_t hwtstamp;
+ ktime_t syststamp;
+};
+
+/**
+ * skb_shared_tx - optional instructions for time stamping of outgoing packets
+ *
+ * @hardware: generate hardware time stamp
+ * @software: generate software time stamp
+ * @in_progress: device driver is going to provide
+ * hardware time stamp
+ *
+ * This additional structure has to be allocated together with the
+ * data buffer and is shared between clones. Its space is reused
+ * in skb_tstamp_tx() for skb_shared_hwtstamps and therefore it
+ * has to be larger than strictly necessary (handled in skbuff.c).
+ */
+union skb_shared_tx {
+ struct {
+ __u8 hardware:1,
+ software:1,
+ in_progress:1;
+ };
+ __u8 flags;
+};
+
+/*
+ * Flags which control how &struct sk_buff is to be/was allocated.
+ * The &struct skb_shared_info always comes at sk_buff->end, then
+ * all of the optional structs in the order defined by their
+ * flags. Each structure is aligned so that it is at a multiple
+ * of its own size. Putting structs with less strict alignment
+ * requirements at the end increases the chance that no padding
+ * is needed.
+ *
+ * SKB_FLAGS_TXTSTAMP could be combined with SKB_FLAGS_HWTSTAMPS
+ * (outgoing packets have &union skb_shared_tx, incoming
+ * &struct skb_shared_hwtstamps), but telling apart one from
+ * the other is ambiguous: when a multicast packet is looped back,
+ * it has to be considered incoming, but it then passes through
+ * dev_hard_start_xmit() once more. Better avoid such ambiguities,
+ * in particular as it doesn't save any space. One additional byte
+ * is needed in any case.
+ *
+ * SKB_FLAGS_CLONE replaces the true/false integer fclone parameter in
+ * __alloc_skb(). Clones are marked as before in sk_buff->cloned.
+ *
+ * Similarly, SKB_FLAGS_NOBLOCK is used in place of a special noblock
+ * parameter in sock_alloc_send_skb().
+ *
+ * When adding optional structs, remember to update skb_optional_sizes
+ * in skbuff.c!
+ */
+enum {
+ /*
+ * one byte holds the lower order flags in struct sk_buff,
+ * so we could add more structs without additional costs
+ */
+ SKB_FLAGS_OPTIONAL_HWTSTAMPS = 1 << 0,
+ SKB_FLAGS_OPTIONAL_TX = 1 << 1,
+
+ /* number of bits used for optional structures */
+ SKB_FLAGS_OPTIONAL_NUM = 2,
+
+ /*
+ * the following flags only affect how the skb is allocated,
+ * they are not stored like the ones above
+ */
+ SKB_FLAGS_CLONE = 1 << 8,
+ SKB_FLAGS_NOBLOCK = 1 << 9,
+};
+
/* We divide dataref into two halves. The higher 16 bits hold references
* to the payload part of skb->data. The lower 16 bits hold references to
* the entire skb->data. A clone of a headerless skb holds the length of
@@ -228,6 +327,8 @@ typedef unsigned char *sk_buff_data_t;
* @ip_summed: Driver fed us an IP checksum
* @priority: Packet queueing priority
* @users: User count - see {datagram,tcp}.c
+ * @optional: a combination of SKB_FLAGS_OPTIONAL_* flags, indicates
+ * which of the corresponding structs were allocated
* @protocol: Packet protocol from driver
* @truesize: Buffer size
* @head: Head of buffer
@@ -305,6 +406,8 @@ struct sk_buff {
ipvs_property:1,
peeked:1,
nf_trace:1;
+ /* not all of the bits in optional are used */
+ __u8 optional;
__be16 protocol;

void (*destructor)(struct sk_buff *skb);
@@ -374,18 +477,18 @@ extern void skb_dma_unmap(struct device *dev, struct sk_buff *skb,

extern void kfree_skb(struct sk_buff *skb);
extern void __kfree_skb(struct sk_buff *skb);
-extern struct sk_buff *__alloc_skb(unsigned int size,
- gfp_t priority, int fclone, int node);
+extern struct sk_buff *__alloc_skb_flags(unsigned int size,
+ gfp_t priority, int flags, int node);
static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
{
- return __alloc_skb(size, priority, 0, -1);
+ return __alloc_skb_flags(size, priority, 0, -1);
}

static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
gfp_t priority)
{
- return __alloc_skb(size, priority, 1, -1);
+ return __alloc_skb_flags(size, priority, SKB_FLAGS_CLONE, -1);
}

extern int skb_recycle_check(struct sk_buff *skb, int skb_size);
@@ -469,6 +572,29 @@ static inline unsigned char *skb_end_pointer(const struct sk_buff *skb)
#define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB)))

/**
+ * __skb_get_optional - returns pointer to the requested structure
+ *
+ * @optional: one of the SKB_FLAGS_OPTIONAL_* constants
+ *
+ * The caller must check that the structure is actually in the skb.
+ */
+extern void *__skb_get_optional(struct sk_buff *skb, int optional);
+
+static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
+{
+ return (skb->optional & SKB_FLAGS_OPTIONAL_HWTSTAMPS) ?
+ __skb_get_optional(skb, SKB_FLAGS_OPTIONAL_HWTSTAMPS) :
+ NULL;
+}
+
+static inline union skb_shared_tx *skb_tx(struct sk_buff *skb)
+{
+ return (skb->optional & SKB_FLAGS_OPTIONAL_TX) ?
+ __skb_get_optional(skb, SKB_FLAGS_OPTIONAL_TX) :
+ NULL;
+}
+
+/**
* skb_queue_empty - check if a queue is empty
* @list: queue head
*
@@ -1399,8 +1525,33 @@ static inline struct sk_buff *__dev_alloc_skb(unsigned int length,

extern struct sk_buff *dev_alloc_skb(unsigned int length);

-extern struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
- unsigned int length, gfp_t gfp_mask);
+/**
+ * __netdev_alloc_skb_internal - allocate an skbuff for rx on a specific device
+ * @dev: network device to receive on
+ * @length: length to allocate
+ * @flags: SKB_FLAGS_* mask
+ * @gfp_mask: get_free_pages mask, passed to alloc_skb
+ *
+ * Allocate a new &sk_buff and assign it a usage count of one. The
+ * buffer has unspecified headroom built in. Users should allocate
+ * the headroom they think they need without accounting for the
+ * built in space. The built in space is used for optimisations.
+ *
+ * %NULL is returned if there is no free memory.
+ *
+ * This function takes the full set of parameters. There are aliases
+ * with a smaller number of parameters.
+ */
+extern struct sk_buff *__netdev_alloc_skb_internal(struct net_device *dev,
+ unsigned int length, int flags,
+ gfp_t gfp_mask);
+
+static inline struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
+ unsigned int length,
+ gfp_t gfp_mask)
+{
+ return __netdev_alloc_skb_internal(dev, length, 0, gfp_mask);
+}

/**
* netdev_alloc_skb - allocate an skbuff for rx on a specific device
@@ -1418,7 +1569,12 @@ extern struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
static inline struct sk_buff *netdev_alloc_skb(struct net_device *dev,
unsigned int length)
{
- return __netdev_alloc_skb(dev, length, GFP_ATOMIC);
+ return __netdev_alloc_skb_internal(dev, length, 0, GFP_ATOMIC);
+}
+static inline struct sk_buff *netdev_alloc_skb_flags(struct net_device *dev,
+ unsigned int length, int flags)
+{
+ return __netdev_alloc_skb_internal(dev, length, flags, GFP_ATOMIC);
}

extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
@@ -1733,6 +1889,11 @@ static inline void skb_copy_to_linear_data_offset(struct sk_buff *skb,

extern void skb_init(void);

+static inline ktime_t skb_get_ktime(const struct sk_buff *skb)
+{
+ return skb->tstamp;
+}
+
/**
* skb_get_timestamp - get timestamp from a skb
* @skb: skb to get stamp from
@@ -1747,6 +1908,11 @@ static inline void skb_get_timestamp(const struct sk_buff *skb, struct timeval *
*stamp = ktime_to_timeval(skb->tstamp);
}

+static inline void skb_get_timestampns(const struct sk_buff *skb, struct timespec *stamp)
+{
+ *stamp = ktime_to_timespec(skb->tstamp);
+}
+
static inline void __net_timestamp(struct sk_buff *skb)
{
skb->tstamp = ktime_get_real();
@@ -1762,6 +1928,22 @@ static inline ktime_t net_invalid_timestamp(void)
return ktime_set(0, 0);
}

+/**
+ * skb_tstamp_tx - queue clone of skb with send time stamps
+ * @orig_skb: the original outgoing packet
+ * @hwtstamps: hardware time stamps, may be NULL if not available
+ *
+ * If the skb has a socket associated, then this function clones the
+ * skb (thus sharing the actual data and optional structures), stores
+ * the optional hardware time stamping information (if non NULL) or
+ * generates a software time stamp (otherwise), then queues the clone
+ * to the error queue of the socket. Errors are silently ignored.
+ *
+ * May only be called on skbs which have a skb_shared_tx!
+ */
+extern void skb_tstamp_tx(struct sk_buff *orig_skb,
+ struct skb_shared_hwtstamps *hwtstamps);
+
extern __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len);
extern __sum16 __skb_checksum_complete(struct sk_buff *skb);

diff --git a/net/core/dev.c b/net/core/dev.c
index f54cac7..94d95a8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1657,12 +1657,25 @@ static int dev_gso_segment(struct sk_buff *skb)
return 0;
}

+static void tstamp_tx(struct sk_buff *skb)
+{
+ union skb_shared_tx *shtx =
+ skb_tx(skb);
+ if (unlikely(shtx &&
+ shtx->software &&
+ !shtx->in_progress)) {
+ skb_tstamp_tx(skb, NULL);
+ }
+}
+
int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
struct netdev_queue *txq)
{
const struct net_device_ops *ops = dev->netdev_ops;
+ int rc;

prefetch(&dev->netdev_ops->ndo_start_xmit);
+
if (likely(!skb->next)) {
if (!list_empty(&ptype_all))
dev_queue_xmit_nit(skb, dev);
@@ -1674,13 +1687,29 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
goto gso;
}

- return ops->ndo_start_xmit(skb, dev);
+ rc = ops->ndo_start_xmit(skb, dev);
+ /*
+ * TODO: if skb_orphan() was called by
+ * dev->hard_start_xmit() (for example, the unmodified
+ * igb driver does that; bnx2 doesn't), then
+ * skb_tx_software_timestamp() will be unable to send
+ * back the time stamp.
+ *
+ * How can this be prevented? Always create another
+ * reference to the socket before calling
+ * dev->hard_start_xmit()? Prevent that skb_orphan()
+ * does anything in dev->hard_start_xmit() by clearing
+ * the skb destructor before the call and restoring it
+ * afterwards, then doing the skb_orphan() ourselves?
+ */
+ if (likely(!rc))
+ tstamp_tx(skb);
+ return rc;
}

gso:
do {
struct sk_buff *nskb = skb->next;
- int rc;

skb->next = nskb->next;
nskb->next = NULL;
@@ -1690,6 +1719,7 @@ gso:
skb->next = nskb;
return rc;
}
+ tstamp_tx(skb);
if (unlikely(netif_tx_queue_stopped(txq) && skb->next))
return NETDEV_TX_BUSY;
} while (skb->next);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b1f6287..6f5fcc7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -55,6 +55,7 @@
#include <linux/rtnetlink.h>
#include <linux/init.h>
#include <linux/scatterlist.h>
+#include <linux/errqueue.h>

#include <net/protocol.h>
#include <net/dst.h>
@@ -155,6 +156,49 @@ void skb_truesize_bug(struct sk_buff *skb)
}
EXPORT_SYMBOL(skb_truesize_bug);

+/*
+ * The size of each struct that corresponds to a SKB_FLAGS_OPTIONAL_*
+ * flag.
+ */
+static const unsigned int skb_optional_sizes[] =
+{
+ /*
+ * hwtstamps and tx are special: the space allocated for tx
+ * is reused for hwtstamps in skb_tstamp_tx(). This avoids copying
+ * the complete packet data.
+ *
+ * max() cannot be used here because it contains a code block,
+ * which gcc doesn't accept.
+ */
+#define MAX_SHARED_TIMESTAMPING ((sizeof(struct skb_shared_hwtstamps) > \
+ sizeof(union skb_shared_tx)) ? \
+ sizeof(struct skb_shared_hwtstamps) : \
+ sizeof(union skb_shared_tx))
+
+ MAX_SHARED_TIMESTAMPING,
+ MAX_SHARED_TIMESTAMPING
+};
+
+void *__skb_get_optional(struct sk_buff *skb, int optional)
+{
+ unsigned int offset = (unsigned int)(skb_end_pointer(skb) - skb->head +
+ sizeof(struct skb_shared_info));
+ int i = 0;
+
+ while(1) {
+ if (skb->optional & (1 << i)) {
+ unsigned int struct_size = skb_optional_sizes[i];
+ offset = (offset + struct_size - 1) & ~(struct_size - 1);
+ if ((1 << i) == optional)
+ break;
+ offset += struct_size;
+ }
+ i++;
+ }
+ return skb->head + offset;
+}
+EXPORT_SYMBOL(__skb_get_optional);
+
/* Allocate a new skbuff. We do this ourselves so we can fill in a few
* 'private' fields and also do memory statistics to find all the
* [BEEP] leaks.
@@ -162,9 +206,10 @@ EXPORT_SYMBOL(skb_truesize_bug);
*/

/**
- * __alloc_skb - allocate a network buffer
+ * __alloc_skb_flags - allocate a network buffer
* @size: size to allocate
* @gfp_mask: allocation mask
+ * @flags: SKB_FLAGS_* mask
* @fclone: allocate from fclone cache instead of head cache
* and allocate a cloned (child) skb
* @node: numa node to allocate memory on
@@ -176,13 +221,16 @@ EXPORT_SYMBOL(skb_truesize_bug);
* Buffers may only be allocated from interrupts using a @gfp_mask of
* %GFP_ATOMIC.
*/
-struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
- int fclone, int node)
+struct sk_buff *__alloc_skb_flags(unsigned int size, gfp_t gfp_mask,
+ int flags, int node)
{
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
+ int fclone = flags & SKB_FLAGS_CLONE;
+ unsigned int total_size;
+ int i;

cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;

@@ -192,7 +240,15 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
goto out;

size = SKB_DATA_ALIGN(size);
- data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
+ total_size = size + sizeof(struct skb_shared_info);
+ for (i = 0; i < SKB_FLAGS_OPTIONAL_NUM; i++) {
+ if (flags & (1 << i)) {
+ unsigned int struct_size = skb_optional_sizes[i];
+ total_size = (total_size + struct_size - 1) & ~(struct_size - 1);
+ total_size += struct_size;
+ }
+ }
+ data = kmalloc_node_track_caller(total_size,
gfp_mask, node);
if (!data)
goto nodata;
@@ -228,6 +284,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,

child->fclone = SKB_FCLONE_UNAVAILABLE;
}
+ skb->optional = flags;
+
out:
return skb;
nodata:
@@ -236,26 +294,14 @@ nodata:
goto out;
}

-/**
- * __netdev_alloc_skb - allocate an skbuff for rx on a specific device
- * @dev: network device to receive on
- * @length: length to allocate
- * @gfp_mask: get_free_pages mask, passed to alloc_skb
- *
- * Allocate a new &sk_buff and assign it a usage count of one. The
- * buffer has unspecified headroom built in. Users should allocate
- * the headroom they think they need without accounting for the
- * built in space. The built in space is used for optimisations.
- *
- * %NULL is returned if there is no free memory.
- */
-struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
- unsigned int length, gfp_t gfp_mask)
+struct sk_buff *__netdev_alloc_skb_internal(struct net_device *dev,
+ unsigned int length, int flags,
+ gfp_t gfp_mask)
{
int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
struct sk_buff *skb;

- skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+ skb = __alloc_skb_flags(length + NET_SKB_PAD, gfp_mask, flags, node);
if (likely(skb)) {
skb_reserve(skb, NET_SKB_PAD);
skb->dev = dev;
@@ -548,6 +594,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
n->cloned = 1;
n->nohdr = 0;
n->destructor = NULL;
+ C(optional);
C(iif);
C(tail);
C(end);
@@ -2743,6 +2790,54 @@ int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer)
return elt;
}

+void skb_tstamp_tx(struct sk_buff *orig_skb,
+ struct skb_shared_hwtstamps *hwtstamps)
+{
+ struct sock *sk = orig_skb->sk;
+ struct sock_exterr_skb *serr;
+ struct sk_buff *skb;
+ int err = -ENOMEM;
+ union skb_shared_tx *shtx =
+ skb_tx(orig_skb);
+
+ if (!sk)
+ return;
+
+ skb = skb_clone(orig_skb, GFP_ATOMIC);
+ if (!skb)
+ return;
+
+ if (hwtstamps) {
+ /*
+ * reuse the existing space for time stamping
+ * instructions for storing the results
+ */
+ struct skb_shared_hwtstamps *shhwtstamps =
+ (struct skb_shared_hwtstamps *)shtx;
+ *shhwtstamps = *hwtstamps;
+ skb->optional = (skb->optional &
+ ~SKB_FLAGS_OPTIONAL_TX) |
+ SKB_FLAGS_OPTIONAL_HWTSTAMPS;
+ } else {
+ /*
+ * no hardware time stamps available,
+ * so keep the skb_shared_tx and only
+ * store software time stamp
+ */
+ skb->tstamp = ktime_get_real();
+ }
+
+ serr = SKB_EXT_ERR(skb);
+ memset(serr, 0, sizeof(*serr));
+ serr->ee.ee_errno = ENOMSG;
+ serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
+ err = sock_queue_err_skb(sk, skb);
+ if (err)
+ kfree_skb(skb);
+}
+EXPORT_SYMBOL_GPL(skb_tstamp_tx);
+
+
/**
* skb_partial_csum_set - set up and verify partial csum values for packet
* @skb: the skb to set
@@ -2782,8 +2877,8 @@ EXPORT_SYMBOL(___pskb_trim);
EXPORT_SYMBOL(__kfree_skb);
EXPORT_SYMBOL(kfree_skb);
EXPORT_SYMBOL(__pskb_pull_tail);
-EXPORT_SYMBOL(__alloc_skb);
-EXPORT_SYMBOL(__netdev_alloc_skb);
+EXPORT_SYMBOL(__alloc_skb_flags);
+EXPORT_SYMBOL(__netdev_alloc_skb_internal);
EXPORT_SYMBOL(pskb_copy);
EXPORT_SYMBOL(pskb_expand_head);
EXPORT_SYMBOL(skb_checksum);
--
1.5.5.3

2008-12-15 14:57:13

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 04/12] sockets: allow allocating skb with optional structures

The internal sock_alloc_send_pskb() is now exposed
as sock_alloc_send_skb_flags() and takes a flags
parameter with additional instructions for how the
skb is to be allocated. This is necessary for adding
send time stamping information to outgoing packets.

sock_alloc_send_skb() is turned into a simple wrapper
which preserves compatibility with code that
passes a boolean "nblock" instead of the flags
bitmask.

sock_alloc_send_pskb() was never called with non-zero
data_len, therefore the obsolete code and parameter
were removed.

Signed-off-by: Patrick Ohly <[email protected]>
---
include/net/sock.h | 17 +++++++++++++----
net/core/sock.c | 50 +++++++-------------------------------------------
2 files changed, 20 insertions(+), 47 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 36807e4..6cb120c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -948,10 +948,19 @@ extern int sock_setsockopt(struct socket *sock, int level,
extern int sock_getsockopt(struct socket *sock, int level,
int op, char __user *optval,
int __user *optlen);
-extern struct sk_buff *sock_alloc_send_skb(struct sock *sk,
- unsigned long size,
- int noblock,
- int *errcode);
+extern struct sk_buff *sock_alloc_send_skb_flags(struct sock *sk,
+ unsigned long size,
+ int flags,
+ int *errcode);
+inline static struct sk_buff *sock_alloc_send_skb(struct sock *sk,
+ unsigned long size,
+ int noblock,
+ int *errcode)
+{
+ return sock_alloc_send_skb_flags(sk, size,
+ noblock ? SKB_FLAGS_NOBLOCK : 0,
+ errcode);
+}
extern void *sock_kmalloc(struct sock *sk, int size,
gfp_t priority);
extern void sock_kfree_s(struct sock *sk, void *mem, int size);
diff --git a/net/core/sock.c b/net/core/sock.c
index 35b4f4c..64f959e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1304,10 +1304,9 @@ static long sock_wait_for_wmem(struct sock * sk, long timeo)
* Generic send/receive buffer handlers
*/

-static struct sk_buff *sock_alloc_send_pskb(struct sock *sk,
+struct sk_buff *sock_alloc_send_skb_flags(struct sock *sk,
unsigned long header_len,
- unsigned long data_len,
- int noblock, int *errcode)
+ int flags, int *errcode)
{
struct sk_buff *skb;
gfp_t gfp_mask;
@@ -1318,7 +1317,9 @@ static struct sk_buff *sock_alloc_send_pskb(struct sock *sk,
if (gfp_mask & __GFP_WAIT)
gfp_mask |= __GFP_REPEAT;

- timeo = sock_sndtimeo(sk, noblock);
+ timeo = sock_sndtimeo(sk,
+ (flags & SKB_FLAGS_NOBLOCK) ?
+ MSG_DONTWAIT : 0);
while (1) {
err = sock_error(sk);
if (err != 0)
@@ -1329,39 +1330,8 @@ static struct sk_buff *sock_alloc_send_pskb(struct sock *sk,
goto failure;

if (atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf) {
- skb = alloc_skb(header_len, gfp_mask);
+ skb = __alloc_skb_flags(header_len, gfp_mask, flags, -1);
if (skb) {
- int npages;
- int i;
-
- /* No pages, we're done... */
- if (!data_len)
- break;
-
- npages = (data_len + (PAGE_SIZE - 1)) >> PAGE_SHIFT;
- skb->truesize += data_len;
- skb_shinfo(skb)->nr_frags = npages;
- for (i = 0; i < npages; i++) {
- struct page *page;
- skb_frag_t *frag;
-
- page = alloc_pages(sk->sk_allocation, 0);
- if (!page) {
- err = -ENOBUFS;
- skb_shinfo(skb)->nr_frags = i;
- kfree_skb(skb);
- goto failure;
- }
-
- frag = &skb_shinfo(skb)->frags[i];
- frag->page = page;
- frag->page_offset = 0;
- frag->size = (data_len >= PAGE_SIZE ?
- PAGE_SIZE :
- data_len);
- data_len -= PAGE_SIZE;
- }
-
/* Full success... */
break;
}
@@ -1388,12 +1358,6 @@ failure:
return NULL;
}

-struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size,
- int noblock, int *errcode)
-{
- return sock_alloc_send_pskb(sk, size, 0, noblock, errcode);
-}
-
static void __lock_sock(struct sock *sk)
{
DEFINE_WAIT(wait);
@@ -2327,7 +2291,7 @@ subsys_initcall(proto_init);
EXPORT_SYMBOL(sk_alloc);
EXPORT_SYMBOL(sk_free);
EXPORT_SYMBOL(sk_send_sigurg);
-EXPORT_SYMBOL(sock_alloc_send_skb);
+EXPORT_SYMBOL(sock_alloc_send_skb_flags);
EXPORT_SYMBOL(sock_init_data);
EXPORT_SYMBOL(sock_kfree_s);
EXPORT_SYMBOL(sock_kmalloc);
--
1.5.5.3

2008-12-15 14:57:49

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 07/12] net: pass new SIOCSHWTSTAMP through to device drivers


Signed-off-by: Patrick Ohly <[email protected]>
---
fs/compat_ioctl.c | 1 +
net/core/dev.c | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 5235c67..a5001a6 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -2555,6 +2555,7 @@ HANDLE_IOCTL(SIOCSIFMAP, dev_ifsioc)
HANDLE_IOCTL(SIOCGIFADDR, dev_ifsioc)
HANDLE_IOCTL(SIOCSIFADDR, dev_ifsioc)
HANDLE_IOCTL(SIOCSIFHWBROADCAST, dev_ifsioc)
+HANDLE_IOCTL(SIOCSHWTSTAMP, dev_ifsioc)

/* ioctls used by appletalk ddp.c */
HANDLE_IOCTL(SIOCATALKDIFADDR, dev_ifsioc)
diff --git a/net/core/dev.c b/net/core/dev.c
index 94d95a8..dc7b4fc 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3675,6 +3675,7 @@ static int dev_ifsioc(struct net *net, struct ifreq *ifr, unsigned int cmd)
cmd == SIOCSMIIREG ||
cmd == SIOCBRADDIF ||
cmd == SIOCBRDELIF ||
+ cmd == SIOCSHWTSTAMP ||
cmd == SIOCWANDEV) {
err = -EOPNOTSUPP;
if (ops->ndo_do_ioctl) {
@@ -3829,6 +3830,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
case SIOCBONDCHANGEACTIVE:
case SIOCBRADDIF:
case SIOCBRDELIF:
+ case SIOCSHWTSTAMP:
if (!capable(CAP_NET_ADMIN))
return -EPERM;
/* fall through */
--
1.5.5.3

2008-12-15 14:57:34

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 06/12] debug: NULL pointer check in ip_output


Signed-off-by: Patrick Ohly <[email protected]>
---
net/ipv4/ip_output.c | 10 ++++++++--
1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index ed92f0b..03a6706 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -950,8 +950,14 @@ alloc_new_skb:
skb->ip_summed = csummode;
skb->csum = 0;
skb_reserve(skb, hh_len);
- if (ipc->shtx.flags)
- *skb_tx(skb) = ipc->shtx;
+ if (ipc->shtx.flags) {
+ if (skb_tx(skb))
+ *skb_tx(skb) = ipc->shtx;
+ else
+ printk(KERN_DEBUG
+ "ERROR: skb with flags %x and no tx ptr\n",
+ ipc->shtx.flags);
+ }

/*
* Find where to start putting bytes.
--
1.5.5.3

2008-12-15 14:58:11

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 09/12] clocksource: allow usage independent of timekeeping.c

So far struct clocksource acted as the interface between time/timekeeping.c
and hardware. This patch generalizes the concept so that a similar
interface can also be used in other contexts. For that it introduces
new structures and related functions *without* touching the existing
struct clocksource.

The reasons for adding these new structures to clocksource.[ch] are
* the APIs are clearly related
* struct clocksource could be cleaned up to use the new structs
* avoids proliferation of files with similar names (timesource.h?
timecounter.h?)

As outlined in the discussion with John Stultz, this patch adds
* struct cyclecounter: stateless API to hardware which counts clock cycles
* struct timecounter: stateful utility code built on a cyclecounter which
provides a nanosecond counter
* only the function to read the nanosecond counter; deltas are used internally
and not exposed to users of timecounter

The code does no locking of the shared state. It must be called at least
as often as the cycle counter wraps around to detect these wrap arounds.
Both is the responsibility of the timecounter user.

Signed-off-by: Patrick Ohly <[email protected]>
---
include/linux/clocksource.h | 99 +++++++++++++++++++++++++++++++++++++++++++
kernel/time/clocksource.c | 76 +++++++++++++++++++++++++++++++++
2 files changed, 175 insertions(+), 0 deletions(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index f88d32f..d379189 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -22,8 +22,107 @@ typedef u64 cycle_t;
struct clocksource;

/**
+ * struct cyclecounter - hardware abstraction for a free running counter
+ * Provides completely state-free accessors to the underlying hardware.
+ * Depending on which hardware it reads, the cycle counter may wrap
+ * around quickly. Locking rules (if necessary) have to be defined
+ * by the implementor and user of specific instances of this API.
+ *
+ * @read: returns the current cycle value
+ * @mask: bitmask for two's complement
+ * subtraction of non 64 bit counters,
+ * see CLOCKSOURCE_MASK() helper macro
+ * @mult: cycle to nanosecond multiplier
+ * @shift: cycle to nanosecond divisor (power of two)
+ */
+struct cyclecounter {
+ cycle_t (*read)(const struct cyclecounter *cc);
+ cycle_t mask;
+ u32 mult;
+ u32 shift;
+};
+
+/**
+ * struct timecounter - layer above a %struct cyclecounter which counts nanoseconds
+ * Contains the state needed by timecounter_read() to detect
+ * cycle counter wrap around. Initialize with
+ * timecounter_init(). Also used to convert cycle counts into the
+ * corresponding nanosecond counts with timecounter_cyc2time(). Users
+ * of this code are responsible for initializing the underlying
+ * cycle counter hardware, locking issues and reading the time
+ * more often than the cycle counter wraps around. The nanosecond
+ * counter will only wrap around after ~585 years.
+ *
+ * @cc: the cycle counter used by this instance
+ * @cycle_last: most recent cycle counter value seen by timecounter_read()
+ * @nsec:
+ */
+struct timecounter {
+ const struct cyclecounter *cc;
+ cycle_t cycle_last;
+ u64 nsec;
+};
+
+/**
+ * cyclecounter_cyc2ns - converts cycle counter cycles to nanoseconds
+ * @tc: Pointer to cycle counter.
+ * @cycles: Cycles
+ *
+ * XXX - This could use some mult_lxl_ll() asm optimization. Same code
+ * as in cyc2ns, but with unsigned result.
+ */
+static inline u64 cyclecounter_cyc2ns(const struct cyclecounter *cc, cycle_t cycles)
+{
+ u64 ret = (u64)cycles;
+ ret = (ret * cc->mult) >> cc->shift;
+ return ret;
+}
+
+/**
+ * timecounter_init - initialize a time counter
+ * @tc: Pointer to time counter which is to be initialized/reset
+ * @cc: A cycle counter, ready to be used.
+ * @start_tstamp: Arbitrary initial time stamp.
+ *
+ * After this call the current cycle register (roughly) corresponds to
+ * the initial time stamp. Every call to timecounter_read() increments
+ * the time stamp counter by the number of elapsed nanoseconds.
+ */
+extern void timecounter_init(struct timecounter *tc,
+ const struct cyclecounter *cc,
+ u64 start_tstamp);
+
+/**
+ * timecounter_read - return nanoseconds elapsed since timecounter_init()
+ * plus the initial time stamp
+ * @tc: Pointer to time counter.
+ *
+ * In other words, keeps track of time since the same epoch as
+ * the function which generated the initial time stamp.
+ */
+extern u64 timecounter_read(struct timecounter *tc);
+
+/**
+ * timecounter_cyc2time - convert a cycle counter to same
+ * time base as values returned by
+ * timecounter_read()
+ * @tc: Pointer to time counter.
+ * @cycle: a value returned by tc->cc->read()
+ *
+ * Cycle counts that are converted correctly as long as they
+ * fall into the interval [-1/2 max cycle count, +1/2 max cycle count],
+ * with "max cycle count" == cs->mask+1.
+ *
+ * This allows conversion of cycle counter values which were generated
+ * in the past.
+ */
+extern u64 timecounter_cyc2time(struct timecounter *tc,
+ cycle_t cycle_tstamp);
+
+/**
* struct clocksource - hardware abstraction for a free running counter
* Provides mostly state-free accessors to the underlying hardware.
+ * This is the structure used for system time.
*
* @name: ptr to clocksource name
* @list: list head for registration
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 9ed2eec..0d7a2cb 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -31,6 +31,82 @@
#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
#include <linux/tick.h>

+void timecounter_init(struct timecounter *tc,
+ const struct cyclecounter *cc,
+ u64 start_tstamp)
+{
+ tc->cc = cc;
+ tc->cycle_last = cc->read(cc);
+ tc->nsec = start_tstamp;
+}
+EXPORT_SYMBOL(timecounter_init);
+
+/**
+ * clocksource_read_ns - get nanoseconds since last call of this function
+ * @tc: Pointer to time counter
+ *
+ * When the underlying cycle counter runs over, this will be handled
+ * correctly as long as it does not run over more than once between
+ * calls.
+ *
+ * The first call to this function for a new time counter initializes
+ * the time tracking and returns bogus results.
+ */
+static u64 timecounter_read_delta(struct timecounter *tc)
+{
+ cycle_t cycle_now, cycle_delta;
+ u64 ns_offset;
+
+ /* read cycle counter: */
+ cycle_now = tc->cc->read(tc->cc);
+
+ /* calculate the delta since the last timecounter_read_delta(): */
+ cycle_delta = (cycle_now - tc->cycle_last) & tc->cc->mask;
+
+ /* convert to nanoseconds: */
+ ns_offset = cyclecounter_cyc2ns(tc->cc, cycle_delta);
+
+ /* update time stamp of timecounter_read_delta() call: */
+ tc->cycle_last = cycle_now;
+
+ return ns_offset;
+}
+
+u64 timecounter_read(struct timecounter *tc)
+{
+ u64 nsec;
+
+ /* increment time by nanoseconds since last call */
+ nsec = timecounter_read_delta(tc);
+ nsec += tc->nsec;
+ tc->nsec = nsec;
+
+ return nsec;
+}
+EXPORT_SYMBOL(timecounter_read);
+
+u64 timecounter_cyc2time(struct timecounter *tc,
+ cycle_t cycle_tstamp)
+{
+ u64 cycle_delta = (cycle_tstamp - tc->cycle_last) & tc->cc->mask;
+ u64 nsec;
+
+ /*
+ * Instead of always treating cycle_tstamp as more recent
+ * than tc->cycle_last, detect when it is too far in the
+ * future and treat it as old time stamp instead.
+ */
+ if (cycle_delta > tc->cc->mask / 2) {
+ cycle_delta = (tc->cycle_last - cycle_tstamp) & tc->cc->mask;
+ nsec = tc->nsec - cyclecounter_cyc2ns(tc->cc, cycle_delta);
+ } else {
+ nsec = cyclecounter_cyc2ns(tc->cc, cycle_delta) + tc->nsec;
+ }
+
+ return nsec;
+}
+EXPORT_SYMBOL(timecounter_cyc2time);
+
/* XXX - Would like a better way for initializing curr_clocksource */
extern struct clocksource clocksource_jiffies;

--
1.5.5.3

2008-12-15 14:58:54

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 08/12] igb: stub support for SIOCSHWTSTAMP


Signed-off-by: Patrick Ohly <[email protected]>
---
drivers/net/igb/igb_main.c | 30 ++++++++++++++++++++++++++++++
1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index 25df7c9..e9ad560 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -34,6 +34,7 @@
#include <linux/ipv6.h>
#include <net/checksum.h>
#include <net/ip6_checksum.h>
+#include <linux/net_tstamp.h>
#include <linux/mii.h>
#include <linux/ethtool.h>
#include <linux/if_vlan.h>
@@ -4121,6 +4122,33 @@ static int igb_mii_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
}

/**
+ * igb_hwtstamp_ioctl - control hardware time stamping
+ * @netdev:
+ * @ifreq:
+ * @cmd:
+ *
+ * Currently cannot enable any kind of hardware time stamping, but
+ * supports SIOCSHWTSTAMP in general.
+ **/
+static int igb_hwtstamp_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
+{
+ struct hwtstamp_config config;
+
+ if (copy_from_user(&config, ifr->ifr_data, sizeof(config)))
+ return -EFAULT;
+
+ /* reserved for future extensions */
+ if (config.flags)
+ return -EINVAL;
+
+ if (config.tx_type == HWTSTAMP_TX_OFF &&
+ config.rx_filter == HWTSTAMP_FILTER_NONE)
+ return 0;
+
+ return -ERANGE;
+}
+
+/**
* igb_ioctl -
* @netdev:
* @ifreq:
@@ -4133,6 +4161,8 @@ static int igb_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
case SIOCGMIIREG:
case SIOCSMIIREG:
return igb_mii_ioctl(netdev, ifr, cmd);
+ case SIOCSHWTSTAMP:
+ return igb_hwtstamp_ioctl(netdev, ifr, cmd);
default:
return -EOPNOTSUPP;
}
--
1.5.5.3

2008-12-15 14:58:39

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 12/12] igb: use clocksync to implement hardware time stamping

Both TX and RX hardware time stamping are implemented. Due to
hardware limitations it is not possible to verify reliably which
packet was time stamped when multiple were pending for sending; this
could be solved by only allowing one packet marked for hardware time
stamping into the queue (not implemented yet).

RX time stamping relies on the flag in the packet descriptor which
marks packets that were time stamped. In "all packet" mode this flag
is not set. TODO: also support that mode (even though it'll suffer
from race conditions).

Allocation of RX buffers is not optimal yet: the extra space for
hardware time stamps is always allocated. Either this should only
be done when HW time stamping is (implies reallocation of buffers)
or packets with HW time stamps should be copied into a larger
buffer (implies higher overhead for those packets).

Signed-off-by: Patrick Ohly <[email protected]>
---
drivers/net/igb/e1000_82575.h | 1 +
drivers/net/igb/e1000_defines.h | 1 +
drivers/net/igb/e1000_regs.h | 40 ++++++
drivers/net/igb/igb.h | 4 +
drivers/net/igb/igb_main.c | 275 +++++++++++++++++++++++++++++++++++++--
5 files changed, 312 insertions(+), 9 deletions(-)

diff --git a/drivers/net/igb/e1000_82575.h b/drivers/net/igb/e1000_82575.h
index c1928b5..dd32a6f 100644
--- a/drivers/net/igb/e1000_82575.h
+++ b/drivers/net/igb/e1000_82575.h
@@ -116,6 +116,7 @@ union e1000_adv_tx_desc {
};

/* Adv Transmit Descriptor Config Masks */
+#define E1000_ADVTXD_MAC_TSTAMP 0x00080000 /* IEEE1588 Timestamp packet */
#define E1000_ADVTXD_DTYP_CTXT 0x00200000 /* Advanced Context Descriptor */
#define E1000_ADVTXD_DTYP_DATA 0x00300000 /* Advanced Data Descriptor */
#define E1000_ADVTXD_DCMD_IFCS 0x02000000 /* Insert FCS (Ethernet CRC) */
diff --git a/drivers/net/igb/e1000_defines.h b/drivers/net/igb/e1000_defines.h
index c5fe784..587f424 100644
--- a/drivers/net/igb/e1000_defines.h
+++ b/drivers/net/igb/e1000_defines.h
@@ -104,6 +104,7 @@
#define E1000_RXD_STAT_UDPCS 0x10 /* UDP xsum calculated */
#define E1000_RXD_STAT_TCPCS 0x20 /* TCP xsum calculated */
#define E1000_RXD_STAT_DYNINT 0x800 /* Pkt caused INT via DYNINT */
+#define E1000_RXD_STAT_TS 0x10000 /* Pkt was time stamped */
#define E1000_RXD_ERR_CE 0x01 /* CRC Error */
#define E1000_RXD_ERR_SE 0x02 /* Symbol Error */
#define E1000_RXD_ERR_SEQ 0x04 /* Sequence Error */
diff --git a/drivers/net/igb/e1000_regs.h b/drivers/net/igb/e1000_regs.h
index d225601..215d4d6 100644
--- a/drivers/net/igb/e1000_regs.h
+++ b/drivers/net/igb/e1000_regs.h
@@ -78,9 +78,37 @@

/* IEEE 1588 TIMESYNCH */
#define E1000_TSYNCTXCTL 0x0B614
+#define E1000_TSYNCTXCTL_VALID (1<<0)
+#define E1000_TSYNCTXCTL_ENABLED (1<<4)
#define E1000_TSYNCRXCTL 0x0B620
+#define E1000_TSYNCRXCTL_VALID (1<<0)
+#define E1000_TSYNCRXCTL_ENABLED (1<<4)
+enum {
+ E1000_TSYNCRXCTL_TYPE_L2_V2 = 0,
+ E1000_TSYNCRXCTL_TYPE_L4_V1 = (1<<1),
+ E1000_TSYNCRXCTL_TYPE_L2_L4_V2 = (1<<2),
+ E1000_TSYNCRXCTL_TYPE_ALL = (1<<3),
+ E1000_TSYNCRXCTL_TYPE_EVENT_V2 = (1<<3) | (1<<1),
+};
#define E1000_TSYNCRXCFG 0x05F50
+enum {
+ E1000_TSYNCRXCFG_PTP_V1_SYNC_MESSAGE = 0<<0,
+ E1000_TSYNCRXCFG_PTP_V1_DELAY_REQ_MESSAGE = 1<<0,
+ E1000_TSYNCRXCFG_PTP_V1_FOLLOWUP_MESSAGE = 2<<0,
+ E1000_TSYNCRXCFG_PTP_V1_DELAY_RESP_MESSAGE = 3<<0,
+ E1000_TSYNCRXCFG_PTP_V1_MANAGEMENT_MESSAGE = 4<<0,

+ E1000_TSYNCRXCFG_PTP_V2_SYNC_MESSAGE = 0<<8,
+ E1000_TSYNCRXCFG_PTP_V2_DELAY_REQ_MESSAGE = 1<<8,
+ E1000_TSYNCRXCFG_PTP_V2_PATH_DELAY_REQ_MESSAGE = 2<<8,
+ E1000_TSYNCRXCFG_PTP_V2_PATH_DELAY_RESP_MESSAGE = 3<<8,
+ E1000_TSYNCRXCFG_PTP_V2_FOLLOWUP_MESSAGE = 8<<8,
+ E1000_TSYNCRXCFG_PTP_V2_DELAY_RESP_MESSAGE = 9<<8,
+ E1000_TSYNCRXCFG_PTP_V2_PATH_DELAY_FOLLOWUP_MESSAGE = 0xA<<8,
+ E1000_TSYNCRXCFG_PTP_V2_ANNOUNCE_MESSAGE = 0xB<<8,
+ E1000_TSYNCRXCFG_PTP_V2_SIGNALLING_MESSAGE = 0xC<<8,
+ E1000_TSYNCRXCFG_PTP_V2_MANAGEMENT_MESSAGE = 0xD<<8,
+};
#define E1000_SYSTIML 0x0B600
#define E1000_SYSTIMH 0x0B604
#define E1000_TIMINCA 0x0B608
@@ -103,6 +131,18 @@
#define E1000_ETQF6 0x05CC8
#define E1000_ETQF7 0x05CCC

+/* Filtering Registers */
+#define E1000_SAQF(_n) (0x5980 + 4 * (_n))
+#define E1000_DAQF(_n) (0x59A0 + 4 * (_n))
+#define E1000_SPQF(_n) (0x59C0 + 4 * (_n))
+#define E1000_FTQF(_n) (0x59E0 + 4 * (_n))
+#define E1000_SAQF0 E1000_SAQF(0)
+#define E1000_DAQF0 E1000_DAQF(0)
+#define E1000_SPQF0 E1000_SPQF(0)
+#define E1000_FTQF0 E1000_FTQF(0)
+#define E1000_SYNQF(_n) (0x055FC + (4 * (_n))) /* SYN Packet Queue Fltr */
+#define E1000_ETQF(_n) (0x05CB0 + (4 * (_n))) /* EType Queue Fltr */
+
/* Split and Replication RX Control - RW */
/*
* Convenience macros
diff --git a/drivers/net/igb/igb.h b/drivers/net/igb/igb.h
index 2db4c64..65863d0 100644
--- a/drivers/net/igb/igb.h
+++ b/drivers/net/igb/igb.h
@@ -35,6 +35,8 @@
#include "e1000_82575.h"

#include <linux/clocksource.h>
+#include <linux/clocksync.h>
+#include <linux/net_tstamp.h>

struct igb_adapter;

@@ -266,6 +268,8 @@ struct igb_adapter {
struct net_device_stats net_stats;
struct cyclecounter cycles;
struct timecounter clock;
+ struct clocksync sync;
+ struct hwtstamp_config hwtstamp_config;

/* structs defined in e1000_hw.h */
struct e1000_hw hw;
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index c2b6da6..f4dc7b8 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -252,7 +252,8 @@ static char *igb_get_time_str(struct igb_adapter *adapter,

delta = timespec_sub(nic, sys);

- sprintf(buffer, "NIC %ld.%09lus, SYS %ld.%09lus, NIC-SYS %lds + %09luns",
+ sprintf(buffer, "HW %llu, NIC %ld.%09lus, SYS %ld.%09lus, NIC-SYS %lds + %09luns",
+ hw,
(long)nic.tv_sec, nic.tv_nsec,
(long)sys.tv_sec, sys.tv_nsec,
(long)delta.tv_sec, delta.tv_nsec);
@@ -1382,6 +1383,18 @@ static int __devinit igb_probe(struct pci_dev *pdev,
wrfl();
timecounter_init(&adapter->clock, &adapter->cycles, ktime_to_ns(ktime_get_real()));

+ /*
+ * Synchronize our NIC clock against system wall clock. NIC
+ * time stamp reading requires ~3us per sample, each sample
+ * was pretty stable even under load => only require 10
+ * samples for each offset comparison.
+ */
+ memset(&adapter->sync, 0, sizeof(adapter->sync));
+ adapter->sync.clock = &adapter->clock;
+ adapter->sync.systime = ktime_get_real;
+ adapter->sync.num_samples = 10;
+ clocksync_update(&adapter->sync, 0);
+
#ifdef DEBUG
{
char buffer[160];
@@ -2744,6 +2757,7 @@ set_itr_now:
#define IGB_TX_FLAGS_VLAN 0x00000002
#define IGB_TX_FLAGS_TSO 0x00000004
#define IGB_TX_FLAGS_IPV4 0x00000008
+#define IGB_TX_FLAGS_TSTAMP 0x00000010
#define IGB_TX_FLAGS_VLAN_MASK 0xffff0000
#define IGB_TX_FLAGS_VLAN_SHIFT 16

@@ -2964,6 +2978,9 @@ static inline void igb_tx_queue_adv(struct igb_adapter *adapter,
if (tx_flags & IGB_TX_FLAGS_VLAN)
cmd_type_len |= E1000_ADVTXD_DCMD_VLE;

+ if (tx_flags & IGB_TX_FLAGS_TSTAMP)
+ cmd_type_len |= E1000_ADVTXD_MAC_TSTAMP;
+
if (tx_flags & IGB_TX_FLAGS_TSO) {
cmd_type_len |= E1000_ADVTXD_DCMD_TSE;

@@ -3054,6 +3071,7 @@ static int igb_xmit_frame_ring_adv(struct sk_buff *skb,
unsigned int len;
u8 hdr_len = 0;
int tso = 0;
+ union skb_shared_tx *shtx;

len = skb_headlen(skb);

@@ -3076,7 +3094,28 @@ static int igb_xmit_frame_ring_adv(struct sk_buff *skb,
/* this is a hard error */
return NETDEV_TX_BUSY;
}
- skb_orphan(skb);
+
+ /*
+ * TODO: check that there currently is no other packet with
+ * time stamping in the queue
+ *
+ * when doing time stamping, keep the connection to the socket
+ * a while longer, it is still needed by skb_hwtstamp_tx(), either
+ * in igb_clean_tx_irq() or
+ */
+ shtx = skb_tx(skb);
+ if (shtx && shtx->hardware) {
+ shtx->in_progress = 1;
+ tx_flags |= IGB_TX_FLAGS_TSTAMP;
+ } else if (!shtx) {
+ /*
+ * TODO: can this be solved in dev.c:dev_hard_start_xmit()?
+ * There are probably unmodified driver which do something
+ * like this and thus don't work in combination with
+ * SOF_TIMESTAMPING_TX_SOFTWARE.
+ */
+ skb_orphan(skb);
+ }

if (adapter->vlgrp && vlan_tx_tag_present(skb)) {
tx_flags |= IGB_TX_FLAGS_VLAN;
@@ -3760,6 +3799,8 @@ static bool igb_clean_tx_irq(struct igb_ring *tx_ring)

if (skb) {
unsigned int segs, bytecount;
+ union skb_shared_tx *shtx;
+
/* gso_segs is currently only valid for tcp */
segs = skb_shinfo(skb)->gso_segs ?: 1;
/* multiply data chunks by size of headers */
@@ -3767,6 +3808,35 @@ static bool igb_clean_tx_irq(struct igb_ring *tx_ring)
skb->len;
total_packets += segs;
total_bytes += bytecount;
+
+ /*
+ * if we were asked to do hardware
+ * stamping and such a time stamp is
+ * available, then it must have been
+ * for this one here because we only
+ * allow only one such packet into the
+ * queue
+ */
+ shtx = skb_tx(skb);
+ if (shtx && shtx->hardware) {
+ u32 valid = rd32(E1000_TSYNCTXCTL) & E1000_TSYNCTXCTL_VALID;
+ if (valid) {
+ u64 regval = rd32(E1000_TXSTMPL);
+ u64 ns;
+ struct skb_shared_hwtstamps shhwtstamps;
+
+ memset(&shhwtstamps, 0, sizeof(shhwtstamps));
+ regval |= (u64)rd32(E1000_TXSTMPH) << 32;
+ ns = timecounter_cyc2time(&adapter->clock,
+ regval);
+ clocksync_update(&adapter->sync, ns);
+ shhwtstamps.hwtstamp = ns_to_ktime(ns);
+ shhwtstamps.syststamp =
+ clocksync_hw2sys(&adapter->sync, ns);
+ skb_tstamp_tx(skb, &shhwtstamps);
+ }
+ skb_orphan(skb);
+ }
}

igb_unmap_and_free_tx_resource(adapter, buffer_info);
@@ -3949,6 +4019,7 @@ static bool igb_clean_rx_irq_adv(struct igb_ring *rx_ring,
{
struct igb_adapter *adapter = rx_ring->adapter;
struct net_device *netdev = adapter->netdev;
+ struct e1000_hw *hw = &adapter->hw;
struct pci_dev *pdev = adapter->pdev;
union e1000_adv_rx_desc *rx_desc , *next_rxd;
struct igb_buffer *buffer_info , *next_buffer;
@@ -4040,6 +4111,50 @@ send_up:
goto next_desc;
}

+ /*
+ * If this bit is set, then the RX registers contain
+ * the time stamp. No other packet will be time
+ * stamped until we read these registers, so read the
+ * registers to make them available again. Because
+ * only one packet can be time stamped at a time, we
+ * know that the register values must belong to this
+ * one here and therefore we don't need to compare
+ * any of the additional attributes stored for it.
+ *
+ * If nothing went wrong, then it should have a
+ * skb_shared_tx that we can turn into a
+ * skb_shared_hwtstamps.
+ *
+ * TODO: can time stamping be triggered (thus locking
+ * the registers) without the packet reaching this point
+ * here? In that case RX time stamping would get stuck.
+ *
+ * TODO: in "time stamp all packets" mode this bit is
+ * not set. Need a global flag for this mode and then
+ * always read the registers. Cannot be done without
+ * a race condition.
+ */
+ if (staterr & E1000_RXD_STAT_TS) {
+ u64 regval;
+ u64 ns;
+ struct skb_shared_hwtstamps *shhwtstamps =
+ (struct skb_shared_hwtstamps *)skb_tx(skb);
+
+ WARN(!(rd32(E1000_TSYNCRXCTL) & E1000_TSYNCRXCTL_VALID),
+ "igb: no RX time stamp available for time stamped packet");
+ regval = rd32(E1000_RXSTMPL);
+ regval |= (u64)rd32(E1000_RXSTMPH) << 32;
+ if (shhwtstamps) {
+ ns = timecounter_cyc2time(&adapter->clock, regval);
+ clocksync_update(&adapter->sync, ns);
+ memset(shhwtstamps, 0, sizeof(*shhwtstamps));
+ shhwtstamps->hwtstamp = ns_to_ktime(ns);
+ shhwtstamps->syststamp = clocksync_hw2sys(&adapter->sync, ns);
+ skb->optional = (skb->optional & ~SKB_FLAGS_OPTIONAL_TX) |
+ SKB_FLAGS_OPTIONAL_HWTSTAMPS;
+ }
+ }
+
if (staterr & E1000_RXDEXT_ERR_FRAME_ERR_MASK) {
dev_kfree_skb_irq(skb);
goto next_desc;
@@ -4141,8 +4256,26 @@ static void igb_alloc_rx_buffers_adv(struct igb_ring *rx_ring,
else
bufsz = adapter->rx_buffer_len;
bufsz += NET_IP_ALIGN;
- skb = netdev_alloc_skb(netdev, bufsz);

+ /*
+ * Always allocate the extra space for hardware
+ * time stamps because even if hardware time stamping
+ * is off right now, at the time when the buffer is
+ * used it might be on.
+ *
+ * TODO: only allocate the extra space if
+ * needed and when hardware timestamping is
+ * enabled, reallocate the buffers without it.
+ *
+ * If only a few packets will get time stamped,
+ * then the extra space is passed through
+ * the kernel as empty skb_shared_tx (has the
+ * same size as skb_shared_hwtstamps) and thus
+ * wasted.
+ */
+ skb = netdev_alloc_skb_flags(netdev, bufsz,
+ 1 /* adapter->hwtstamp_config.rx_filter != HWTSTAMP_FILTER_NONE */ ?
+ SKB_FLAGS_OPTIONAL_TX : 0);
if (!skb) {
adapter->alloc_rx_buff_failed++;
goto no_buffers;
@@ -4233,12 +4366,32 @@ static int igb_mii_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
* @ifreq:
* @cmd:
*
- * Currently cannot enable any kind of hardware time stamping, but
- * supports SIOCSHWTSTAMP in general.
+ * Outgoing time stamping can be enabled and disabled. Play nice and
+ * disable it when requested, although it shouldn't case any overhead
+ * when no packet needs it. At most one packet in the queue may be
+ * marked for time stamping, otherwise it would be impossible to tell
+ * for sure to which packet the hardware time stamp belongs.
+ *
+ * Incoming time stamping has to be configured via the hardware
+ * filters. Not all combinations are supported, in particular event
+ * type has to be specified. Matching the kind of event packet is
+ * not supported, with the exception of "all V2 events regardless of
+ * level 2 or 4".
+ *
**/
static int igb_hwtstamp_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
{
+ struct igb_adapter *adapter = netdev_priv(netdev);
+ struct e1000_hw *hw = &adapter->hw;
struct hwtstamp_config config;
+ u32 tsync_tx_ctl_bit = E1000_TSYNCTXCTL_ENABLED;
+ u32 tsync_rx_ctl_bit = E1000_TSYNCRXCTL_ENABLED;
+ u32 tsync_rx_ctl_type = 0;
+ u32 tsync_rx_cfg = 0;
+ int is_l4 = 0;
+ int is_l2 = 0;
+ short port = 319; /* PTP */
+ u32 regval;

if (copy_from_user(&config, ifr->ifr_data, sizeof(config)))
return -EFAULT;
@@ -4247,11 +4400,115 @@ static int igb_hwtstamp_ioctl(struct net_device *netdev, struct ifreq *ifr, int
if (config.flags)
return -EINVAL;

- if (config.tx_type == HWTSTAMP_TX_OFF &&
- config.rx_filter == HWTSTAMP_FILTER_NONE)
- return 0;
+ switch (config.tx_type) {
+ case HWTSTAMP_TX_OFF:
+ tsync_tx_ctl_bit = 0;
+ break;
+ case HWTSTAMP_TX_ON:
+ tsync_tx_ctl_bit = E1000_TSYNCTXCTL_ENABLED;
+ break;
+ default:
+ return -ERANGE;
+ }
+
+ switch (config.rx_filter) {
+ case HWTSTAMP_FILTER_NONE:
+ tsync_rx_ctl_bit = 0;
+ break;
+ case HWTSTAMP_FILTER_PTP_V1_L4_EVENT:
+ case HWTSTAMP_FILTER_PTP_V2_L4_EVENT:
+ case HWTSTAMP_FILTER_PTP_V2_L2_EVENT:
+ case HWTSTAMP_FILTER_ALL:
+ /*
+ * register TSYNCRXCFG must be set, therefore it is not
+ * possible to time stamp both Sync and Delay_Req messages
+ * => fall back to time stamping all packets
+ */
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_ALL;
+ config.rx_filter = HWTSTAMP_FILTER_ALL;
+ break;
+ case HWTSTAMP_FILTER_PTP_V1_L4_SYNC:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_L4_V1;
+ tsync_rx_cfg = E1000_TSYNCRXCFG_PTP_V1_SYNC_MESSAGE;
+ is_l4 = 1;
+ break;
+ case HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_L4_V1;
+ tsync_rx_cfg = E1000_TSYNCRXCFG_PTP_V1_DELAY_REQ_MESSAGE;
+ is_l4 = 1;
+ break;
+ case HWTSTAMP_FILTER_PTP_V2_L2_SYNC:
+ case HWTSTAMP_FILTER_PTP_V2_L4_SYNC:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_L2_L4_V2;
+ tsync_rx_cfg = E1000_TSYNCRXCFG_PTP_V2_SYNC_MESSAGE;
+ is_l2 = 1;
+ is_l4 = 1;
+ config.rx_filter = HWTSTAMP_FILTER_SOME;
+ break;
+ case HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ:
+ case HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_L2_L4_V2;
+ tsync_rx_cfg = E1000_TSYNCRXCFG_PTP_V2_DELAY_REQ_MESSAGE;
+ is_l2 = 1;
+ is_l4 = 1;
+ config.rx_filter = HWTSTAMP_FILTER_SOME;
+ break;
+ case HWTSTAMP_FILTER_PTP_V2_EVENT:
+ case HWTSTAMP_FILTER_PTP_V2_SYNC:
+ case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_EVENT_V2;
+ config.rx_filter = HWTSTAMP_FILTER_PTP_V2_EVENT;
+ is_l2 = 1;
+ break;
+ default:
+ return -ERANGE;
+ }
+
+ /* enable/disable TX */
+ regval = rd32(E1000_TSYNCTXCTL);
+ regval = (regval & ~E1000_TSYNCTXCTL_ENABLED) | tsync_tx_ctl_bit;
+ wr32(E1000_TSYNCTXCTL, regval);
+
+ /* enable/disable RX, define which PTP packets are time stamped */
+ regval = rd32(E1000_TSYNCRXCTL);
+ regval = (regval & ~E1000_TSYNCRXCTL_ENABLED) | tsync_rx_ctl_bit;
+ regval = (regval & ~0xE) | tsync_rx_ctl_type;
+ wr32(E1000_TSYNCRXCTL, regval);
+ wr32(E1000_TSYNCRXCFG, tsync_rx_cfg);
+
+ /*
+ * Ethertype Filter Queue Filter[0][15:0] = 0x88F7 (Ethertype to filter on)
+ * Ethertype Filter Queue Filter[0][26] = 0x1 (Enable filter)
+ * Ethertype Filter Queue Filter[0][30] = 0x1 (Enable Timestamping)
+ */
+ wr32(E1000_ETQF0, is_l2 ? 0x440088f7 : 0);
+
+ /* L4 Queue Filter[0]: only filter by source and destination port */
+ wr32(E1000_SPQF0, htons(port));
+ wr32(E1000_IMIREXT(0), is_l4 ?
+ ((1<<12) | (1<<19) /* bypass size and control flags */) : 0);
+ wr32(E1000_IMIR(0), is_l4 ?
+ (htons(port)
+ | (0<<16) /* immediate interrupt disabled */
+ | 0 /* (1<<17) bit cleared: do not bypass destination port check */)
+ : 0);
+ wr32(E1000_FTQF0, is_l4 ?
+ (0x11 /* UDP */
+ | (1<<15) /* VF not compared */
+ | (1<<27) /* Enable Timestamping */
+ | (7<<28) /* only source port filter enabled, source/target address and protocol masked */ )
+ : ( (1<<15) | (15<<28) /* all mask bits set = filter not enabled */));
+
+ wrfl();
+
+ adapter->hwtstamp_config = config;
+
+ /* clear TX/RX time stamp registers, just to be sure */
+ regval = rd32(E1000_TXSTMPH);
+ regval = rd32(E1000_RXSTMPH);

- return -ERANGE;
+ return copy_to_user(ifr->ifr_data, &config, sizeof(config)) ?
+ -EFAULT : 0;
}

/**
--
1.5.5.3

2008-12-15 14:59:22

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 11/12] time sync: generic infrastructure to map between time stamps generated by a time counter and system time

Currently only mapping from time counter to system time is implemented.
The interface could have been made more versatile by not depending on a time counter,
but this wasn't done to avoid writing glue code elsewhere.

The method implemented here is the one used and analyzed under the name
"assisted PTP" in the LCI PTP paper:
http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Ohly_92221.pdf

Signed-off-by: Patrick Ohly <[email protected]>
---
include/linux/clocksync.h | 85 +++++++++++++++++++++
kernel/time/Makefile | 2 +-
kernel/time/clocksync.c | 182 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 268 insertions(+), 1 deletions(-)
create mode 100644 include/linux/clocksync.h
create mode 100644 kernel/time/clocksync.c

diff --git a/include/linux/clocksync.h b/include/linux/clocksync.h
new file mode 100644
index 0000000..d2f93f8
--- /dev/null
+++ b/include/linux/clocksync.h
@@ -0,0 +1,85 @@
+/*
+ * Utility code which helps transforming between hardware time stamps
+ * generated by a clocksource and system time. The clocksource is
+ * assumed to return monotonically increasing time (but this code does
+ * its best to compensate if that is not the case) whereas system time
+ * may jump.
+ */
+#ifndef _LINUX_CLOCKSYNC_H
+#define _LINUX_CLOCKSYNC_H
+
+#include <linux/clocksource.h>
+#include <linux/ktime.h>
+
+/**
+ * struct clocksync - stores state and configuration for the two clocks
+ *
+ * Initialize to zero, then set clock, systime, num_samples.
+ *
+ * Transformation between HW time and system time is done with:
+ * HW time transformed = HW time + offset +
+ * (HW time - last_update) * skew / CLOCKSYNC_SKEW_RESOLUTION
+ *
+ * @clock: the source for HW time stamps (%clocksource_read_time)
+ * @systime: function returning current system time (ktime_get
+ * for monotonic time, or ktime_get_real for wall clock)
+ * @num_samples: number of times that HW time and system time are to
+ * be compared when determining their offset
+ * @offset: (system time - HW time) at the time of the last update
+ * @skew: average (system time - HW time) / delta HW time *
+ * CLOCKSYNC_SKEW_RESOLUTION
+ * @last_update: last HW time stamp when clock offset was measured
+ */
+struct clocksync {
+ struct timecounter *clock;
+ ktime_t (*systime)(void);
+ int num_samples;
+
+ s64 offset;
+ s64 skew;
+ u64 last_update;
+};
+
+/**
+ * clocksync_hw2sys - transform HW time stamp into corresponding system time
+ * @sync: context for clock sync
+ * @hwtstamp: the result of timecounter_read() or
+ * timecounter_cyc2time()
+ */
+extern ktime_t clocksync_hw2sys(struct clocksync *sync,
+ u64 hwtstamp);
+
+/**
+ * clocksync_offset - measure current (system time - HW time) offset
+ * @sync: context for clock sync
+ * @offset: average offset during sample period returned here
+ * @hwtstamp: average HW time during sample period returned here
+ *
+ * Returns number of samples used. Might be zero (= no result) in the
+ * unlikely case that system time was monotonically decreasing for all
+ * samples (= broken).
+ */
+extern int clocksync_offset(struct clocksync *sync,
+ s64 *offset,
+ u64 *hwtstamp);
+
+extern void __clocksync_update(struct clocksync *sync,
+ u64 hwtstamp);
+
+/**
+ * clocksync_update - update offset and skew by measuring current offset
+ * @sync: context for clock sync
+ * @hwtstamp: the result of timecounter_read() or
+ * timecounter_cyc2time(), pass zero to force update
+ *
+ * Updates are only done at most once per second.
+ */
+static inline void clocksync_update(struct clocksync *sync,
+ u64 hwtstamp)
+{
+ if (!hwtstamp ||
+ (s64)(hwtstamp - sync->last_update) >= NSEC_PER_SEC)
+ __clocksync_update(sync, hwtstamp);
+}
+
+#endif /* _LINUX_CLOCKSYNC_H */
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 905b0b5..6279fb0 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -1,4 +1,4 @@
-obj-y += timekeeping.o ntp.o clocksource.o jiffies.o timer_list.o
+obj-y += timekeeping.o ntp.o clocksource.o jiffies.o timer_list.o clocksync.o

obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD) += clockevents.o
obj-$(CONFIG_GENERIC_CLOCKEVENTS) += tick-common.o
diff --git a/kernel/time/clocksync.c b/kernel/time/clocksync.c
new file mode 100644
index 0000000..6b73089
--- /dev/null
+++ b/kernel/time/clocksync.c
@@ -0,0 +1,182 @@
+/*
+ * Utility code which helps transforming between hardware time stamps
+ * generated by a timecounter and system time.
+ *
+ * Copyright (C) 2008 Intel, Patrick Ohly ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/clocksync.h>
+#include <linux/module.h>
+
+/*
+ * fixed point arithmetic scale factor for skew
+ *
+ * Usually one would measure skew in ppb (parts per billion, 1e9), but
+ * using a factor of 2 simplifies the math.
+ */
+#define CLOCKSYNC_SKEW_RESOLUTION (((s64)1)<<30)
+
+ktime_t clocksync_hw2sys(struct clocksync *sync,
+ u64 hwtstamp)
+{
+ u64 nsec;
+
+ nsec = hwtstamp + sync->offset;
+ nsec += (s64)(hwtstamp - sync->last_update) * sync->skew /
+ CLOCKSYNC_SKEW_RESOLUTION;
+
+ return ns_to_ktime(nsec);
+}
+EXPORT_SYMBOL(clocksync_hw2sys);
+
+int clocksync_offset(struct clocksync *sync,
+ s64 *offset,
+ u64 *hwtstamp)
+{
+ u64 starthw = 0, endhw = 0;
+ struct {
+ s64 offset;
+ s64 duration_sys;
+ } buffer[10], sample, *samples;
+ int counter = 0, i;
+ int used;
+ int index;
+ int num_samples = sync->num_samples;
+
+ if (num_samples > sizeof(buffer)/sizeof(buffer[0])) {
+ samples = kmalloc(sizeof(*samples) * num_samples, GFP_ATOMIC);
+ if (!samples) {
+ samples = buffer;
+ num_samples = sizeof(buffer)/sizeof(buffer[0]);
+ }
+ } else {
+ samples = buffer;
+ }
+
+ /* run until we have enough valid samples, but do not try forever */
+ i = 0;
+ counter = 0;
+ while (1) {
+ u64 ts;
+ ktime_t start, end;
+
+ start = sync->systime();
+ ts = timecounter_read(sync->clock);
+ end = sync->systime();
+
+ if (!i) {
+ starthw = ts;
+ }
+
+ /* ignore negative durations */
+ sample.duration_sys = ktime_to_ns(ktime_sub(end, start));
+ if (sample.duration_sys >= 0) {
+ /*
+ * assume symetric delay to and from HW: average system time
+ * corresponds to measured HW time
+ */
+ sample.offset = ktime_to_ns(ktime_add(end, start)) / 2 -
+ ts;
+
+ /* simple insertion sort based on duration */
+ index = counter - 1;
+ while (index >= 0) {
+ if(samples[index].duration_sys < sample.duration_sys) {
+ break;
+ }
+ samples[index + 1] = samples[index];
+ index--;
+ }
+ samples[index + 1] = sample;
+ counter++;
+ }
+
+ i++;
+ if (counter >= num_samples || i >= 100000) {
+ endhw = ts;
+ break;
+ }
+ }
+
+ *hwtstamp = (endhw + starthw) / 2;
+
+ /* remove outliers by only using 75% of the samples */
+ used = counter * 3 / 4;
+ if (!used) {
+ used = counter;
+ }
+ if (used) {
+ /* calculate average */
+ s64 off = 0;
+ for (index = 0; index < used; index++) {
+ off += samples[index].offset;
+ }
+ off /= used;
+ *offset = off;
+ }
+
+ if (samples && samples != buffer)
+ kfree(samples);
+
+ return used;
+}
+EXPORT_SYMBOL(clocksync_offset);
+
+void __clocksync_update(struct clocksync *sync,
+ u64 hwtstamp)
+{
+ s64 offset;
+ u64 average_time;
+
+ if (!clocksync_offset(sync, &offset, &average_time))
+ return;
+
+ printk(KERN_DEBUG
+ "average offset: %lld\n", offset);
+
+ if (!sync->last_update) {
+ sync->last_update = average_time;
+ sync->offset = offset;
+ sync->skew = 0;
+ } else {
+ s64 delta_nsec = average_time - sync->last_update;
+
+ /* avoid division by negative or small deltas */
+ if (delta_nsec >= 10000) {
+ s64 delta_offset_nsec = offset - sync->offset;
+ s64 skew = delta_offset_nsec *
+ CLOCKSYNC_SKEW_RESOLUTION /
+ delta_nsec;
+
+ /*
+ * Calculate new overall skew as 4/16 the
+ * old value and 12/16 the new one. This is
+ * a rather arbitrary tradeoff between
+ * only using the latest measurement (0/16 and
+ * 16/16) and even more weight on past measurements.
+ */
+#define CLOCKSYNC_NEW_SKEW_PER_16 12
+ sync->skew =
+ ((16 - CLOCKSYNC_NEW_SKEW_PER_16) * sync->skew +
+ CLOCKSYNC_NEW_SKEW_PER_16 * skew) /
+ 16;
+ sync->last_update = average_time;
+ sync->offset = offset;
+ }
+ }
+}
+EXPORT_SYMBOL(__clocksync_update);
--
1.5.5.3

2008-12-15 14:59:45

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 10/12] igb: access to NIC time

Adds the register definitions and code to read the time
register.

Signed-off-by: Patrick Ohly <[email protected]>
---
drivers/net/igb/e1000_regs.h | 28 +++++++++++
drivers/net/igb/igb.h | 4 ++
drivers/net/igb/igb_main.c | 106 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 138 insertions(+), 0 deletions(-)

diff --git a/drivers/net/igb/e1000_regs.h b/drivers/net/igb/e1000_regs.h
index bdf5d83..d225601 100644
--- a/drivers/net/igb/e1000_regs.h
+++ b/drivers/net/igb/e1000_regs.h
@@ -75,6 +75,34 @@
#define E1000_FCRTH 0x02168 /* Flow Control Receive Threshold High - RW */
#define E1000_RDFPCQ(_n) (0x02430 + (0x4 * (_n)))
#define E1000_FCRTV 0x02460 /* Flow Control Refresh Timer Value - RW */
+
+/* IEEE 1588 TIMESYNCH */
+#define E1000_TSYNCTXCTL 0x0B614
+#define E1000_TSYNCRXCTL 0x0B620
+#define E1000_TSYNCRXCFG 0x05F50
+
+#define E1000_SYSTIML 0x0B600
+#define E1000_SYSTIMH 0x0B604
+#define E1000_TIMINCA 0x0B608
+
+#define E1000_RXMTRL 0x0B634
+#define E1000_RXSTMPL 0x0B624
+#define E1000_RXSTMPH 0x0B628
+#define E1000_RXSATRL 0x0B62C
+#define E1000_RXSATRH 0x0B630
+
+#define E1000_TXSTMPL 0x0B618
+#define E1000_TXSTMPH 0x0B61C
+
+#define E1000_ETQF0 0x05CB0
+#define E1000_ETQF1 0x05CB4
+#define E1000_ETQF2 0x05CB8
+#define E1000_ETQF3 0x05CBC
+#define E1000_ETQF4 0x05CC0
+#define E1000_ETQF5 0x05CC4
+#define E1000_ETQF6 0x05CC8
+#define E1000_ETQF7 0x05CCC
+
/* Split and Replication RX Control - RW */
/*
* Convenience macros
diff --git a/drivers/net/igb/igb.h b/drivers/net/igb/igb.h
index 2121b8b..2db4c64 100644
--- a/drivers/net/igb/igb.h
+++ b/drivers/net/igb/igb.h
@@ -34,6 +34,8 @@
#include "e1000_mac.h"
#include "e1000_82575.h"

+#include <linux/clocksource.h>
+
struct igb_adapter;

#ifdef CONFIG_IGB_LRO
@@ -262,6 +264,8 @@ struct igb_adapter {
struct napi_struct napi;
struct pci_dev *pdev;
struct net_device_stats net_stats;
+ struct cyclecounter cycles;
+ struct timecounter clock;

/* structs defined in e1000_hw.h */
struct e1000_hw hw;
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index e9ad560..c2b6da6 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -179,6 +179,54 @@ MODULE_DESCRIPTION("Intel(R) Gigabit Ethernet Network Driver");
MODULE_LICENSE("GPL");
MODULE_VERSION(DRV_VERSION);

+/**
+ * Scale the NIC clock cycle by a large factor so that
+ * relatively small clock corrections can be added or
+ * substracted at each clock tick. The drawbacks of a
+ * large factor are a) that the clock register overflows
+ * more quickly (not such a big deal) and b) that the
+ * increment per tick has to fit into 24 bits.
+ *
+ * Note that
+ * TIMINCA = IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS *
+ * IGB_TSYNC_SCALE
+ * TIMINCA += TIMINCA * adjustment [ppm] / 1e9
+ *
+ * The base scale factor is intentionally a power of two
+ * so that the division in %struct timecounter can be done with
+ * a shift.
+ */
+#define IGB_TSYNC_SHIFT (19)
+#define IGB_TSYNC_SCALE (1<<IGB_TSYNC_SHIFT)
+
+/**
+ * The duration of one clock cycle of the NIC.
+ *
+ * @todo This hard-coded value is part of the specification and might change
+ * in future hardware revisions. Add revision check.
+ */
+#define IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS 16
+
+#if (IGB_TSYNC_SCALE * IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS) >= (1<<24)
+# error IGB_TSYNC_SCALE and/or IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS are too large to fit into TIMINCA
+#endif
+
+/**
+ * igb_read_clock - read raw cycle counter (to be used by time counter)
+ */
+static cycle_t igb_read_clock(const struct cyclecounter *tc)
+{
+ struct igb_adapter *adapter =
+ container_of(tc, struct igb_adapter, cycles);
+ struct e1000_hw *hw = &adapter->hw;
+ u64 stamp;
+
+ stamp = rd32(E1000_SYSTIML);
+ stamp |= (u64)rd32(E1000_SYSTIMH) << 32ULL;
+
+ return stamp;
+}
+
#ifdef DEBUG
/**
* igb_get_hw_dev_name - return device name string
@@ -189,6 +237,28 @@ char *igb_get_hw_dev_name(struct e1000_hw *hw)
struct igb_adapter *adapter = hw->back;
return adapter->netdev->name;
}
+
+/**
+ * igb_get_time_str - format current NIC and system time as string
+ */
+static char *igb_get_time_str(struct igb_adapter *adapter,
+ char buffer[160])
+{
+ cycle_t hw = adapter->cycles.read(&adapter->cycles);
+ struct timespec nic = ns_to_timespec(timecounter_read(&adapter->clock));
+ struct timespec sys;
+ struct timespec delta;
+ getnstimeofday(&sys);
+
+ delta = timespec_sub(nic, sys);
+
+ sprintf(buffer, "NIC %ld.%09lus, SYS %ld.%09lus, NIC-SYS %lds + %09luns",
+ (long)nic.tv_sec, nic.tv_nsec,
+ (long)sys.tv_sec, sys.tv_nsec,
+ (long)delta.tv_sec, delta.tv_nsec);
+
+ return buffer;
+}
#endif

/**
@@ -1286,6 +1356,42 @@ static int __devinit igb_probe(struct pci_dev *pdev,
}
#endif

+ /*
+ * Initialize hardware timer: we keep it running just in case
+ * that some program needs it later on.
+ */
+ memset(&adapter->cycles, 0, sizeof(adapter->cycles));
+ adapter->cycles.read = igb_read_clock;
+ adapter->cycles.mask = CLOCKSOURCE_MASK(64);
+ adapter->cycles.mult = 1;
+ adapter->cycles.shift = IGB_TSYNC_SHIFT;
+ wr32(E1000_TIMINCA, (1<<24) | IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS * IGB_TSYNC_SCALE);
+#if 0
+ /*
+ * Avoid rollover while we initialize by resetting the time counter.
+ */
+ wr32(E1000_SYSTIML, 0x00000000);
+ wr32(E1000_SYSTIMH, 0x00000000);
+#else
+ /*
+ * Set registers so that rollover occurs soon to test this.
+ */
+ wr32(E1000_SYSTIML, 0x00000000);
+ wr32(E1000_SYSTIMH, 0xFF800000);
+#endif
+ wrfl();
+ timecounter_init(&adapter->clock, &adapter->cycles, ktime_to_ns(ktime_get_real()));
+
+#ifdef DEBUG
+ {
+ char buffer[160];
+ printk(KERN_DEBUG
+ "igb: %s: hw %p initialized timer\n",
+ igb_get_time_str(adapter, buffer),
+ &adapter->hw);
+ }
+#endif
+
dev_info(&pdev->dev, "Intel(R) Gigabit Ethernet Network Connection\n");
/* print bus type/speed/width info */
dev_info(&pdev->dev, "%s: (PCIe:%s:%s) %pM\n",
--
1.5.5.3

2008-12-15 15:00:05

by Patrick Ohly

[permalink] [raw]
Subject: [RFC PATCH 05/12] ip: support for TX timestamps on UDP and RAW sockets

Instructions for time stamping outgoing packets are take from the
socket layer and later copied into the new skb.

Signed-off-by: Patrick Ohly <[email protected]>
---
Documentation/networking/timestamping.txt | 2 ++
include/net/ip.h | 1 +
net/can/raw.c | 14 +++++++++++---
net/ipv4/icmp.c | 2 ++
net/ipv4/ip_output.c | 12 ++++++++++--
net/ipv4/raw.c | 1 +
net/ipv4/udp.c | 4 ++++
7 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
index a681a65..0e58b45 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.txt
@@ -56,6 +56,8 @@ and including the link layer, the scm_timestamping control message and
a sock_extended_err control message with ee_errno==ENOMSG and
ee_origin==SO_EE_ORIGIN_TIMESTAMPING. A socket with such a pending
bounced packet is ready for reading as far as select() is concerned.
+If the outgoing packet has to be fragmented, then only the first
+fragment is time stamped and returned to the sending socket.

All three values correspond to the same event in time, but were
generated in different ways. Each of these values may be empty (= all
diff --git a/include/net/ip.h b/include/net/ip.h
index 1086813..4ac7577 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -55,6 +55,7 @@ struct ipcm_cookie
__be32 addr;
int oif;
struct ip_options *opt;
+ union skb_shared_tx shtx;
};

#define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb))
diff --git a/net/can/raw.c b/net/can/raw.c
index 27aab63..1f2111d 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -618,6 +618,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
struct raw_sock *ro = raw_sk(sk);
struct sk_buff *skb;
struct net_device *dev;
+ union skb_shared_tx shtx;
int ifindex;
int err;

@@ -639,8 +640,14 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
if (!dev)
return -ENXIO;

- skb = sock_alloc_send_skb(sk, size, msg->msg_flags & MSG_DONTWAIT,
- &err);
+ err = sock_tx_timestamp(msg, sk, &shtx);
+ if (err < 0)
+ return err;
+
+ skb = sock_alloc_send_skb_flags(sk, size,
+ ((msg->msg_flags & MSG_DONTWAIT) ? SKB_FLAGS_NOBLOCK : 0) |
+ (shtx.flags ? SKB_FLAGS_OPTIONAL_TX : 0),
+ &err);
if (!skb)
goto put_dev;

@@ -649,7 +656,8 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
goto free_skb;
skb->dev = dev;
skb->sk = sk;
-
+ if (shtx.flags)
+ *skb_tx(skb) = shtx;
err = can_send(skb, ro->loopback);

dev_put(dev);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 705b33b..382800a 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -375,6 +375,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
inet->tos = ip_hdr(skb)->tos;
daddr = ipc.addr = rt->rt_src;
ipc.opt = NULL;
+ ipc.shtx.flags = 0;
if (icmp_param->replyopts.optlen) {
ipc.opt = &icmp_param->replyopts;
if (ipc.opt->srr)
@@ -532,6 +533,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
inet_sk(sk)->tos = tos;
ipc.addr = iph->saddr;
ipc.opt = &icmp_param.replyopts;
+ ipc.shtx.flags = 0;

{
struct flowi fl = {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 8ebe86d..ed92f0b 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -923,9 +923,11 @@ alloc_new_skb:
alloclen += rt->u.dst.trailer_len;

if (transhdrlen) {
- skb = sock_alloc_send_skb(sk,
+ skb = sock_alloc_send_skb_flags(sk,
alloclen + hh_len + 15,
- (flags & MSG_DONTWAIT), &err);
+ ((flags & MSG_DONTWAIT) ? SKB_FLAGS_NOBLOCK : 0) |
+ (ipc->shtx.flags ? SKB_FLAGS_OPTIONAL_TX : 0),
+ &err);
} else {
skb = NULL;
if (atomic_read(&sk->sk_wmem_alloc) <=
@@ -935,6 +937,9 @@ alloc_new_skb:
sk->sk_allocation);
if (unlikely(skb == NULL))
err = -ENOBUFS;
+ else
+ /* only the initial fragment is time stamped */
+ ipc->shtx.flags = 0;
}
if (skb == NULL)
goto error;
@@ -945,6 +950,8 @@ alloc_new_skb:
skb->ip_summed = csummode;
skb->csum = 0;
skb_reserve(skb, hh_len);
+ if (ipc->shtx.flags)
+ *skb_tx(skb) = ipc->shtx;

/*
* Find where to start putting bytes.
@@ -1364,6 +1371,7 @@ void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg *ar

daddr = ipc.addr = rt->rt_src;
ipc.opt = NULL;
+ ipc.shtx.flags = 0;

if (replyopts.opt.optlen) {
ipc.opt = &replyopts.opt;
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index dff8bc4..f774651 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -493,6 +493,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,

ipc.addr = inet->saddr;
ipc.opt = NULL;
+ ipc.shtx.flags = 0;
ipc.oif = sk->sk_bound_dev_if;

if (msg->msg_controllen) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index cf5ab05..bbf1a6d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -573,6 +573,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
return -EOPNOTSUPP;

ipc.opt = NULL;
+ ipc.shtx.flags = 0;

if (up->pending) {
/*
@@ -620,6 +621,9 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
ipc.addr = inet->saddr;

ipc.oif = sk->sk_bound_dev_if;
+ err = sock_tx_timestamp(msg, sk, &ipc.shtx);
+ if (err)
+ return err;
if (msg->msg_controllen) {
err = ip_cmsg_send(sock_net(sk), msg, &ipc);
if (err)
--
1.5.5.3

2008-12-15 16:27:32

by john stultz

[permalink] [raw]
Subject: Re: [RFC PATCH 09/12] clocksource: allow usage independent of timekeeping.c

On Mon, 2008-12-15 at 15:54 +0100, Patrick Ohly wrote:
> So far struct clocksource acted as the interface between time/timekeeping.c
> and hardware. This patch generalizes the concept so that a similar
> interface can also be used in other contexts. For that it introduces
> new structures and related functions *without* touching the existing
> struct clocksource.
>
> The reasons for adding these new structures to clocksource.[ch] are
> * the APIs are clearly related
> * struct clocksource could be cleaned up to use the new structs
> * avoids proliferation of files with similar names (timesource.h?
> timecounter.h?)
>
> As outlined in the discussion with John Stultz, this patch adds
> * struct cyclecounter: stateless API to hardware which counts clock cycles
> * struct timecounter: stateful utility code built on a cyclecounter which
> provides a nanosecond counter
> * only the function to read the nanosecond counter; deltas are used internally
> and not exposed to users of timecounter
>
> The code does no locking of the shared state. It must be called at least
> as often as the cycle counter wraps around to detect these wrap arounds.
> Both is the responsibility of the timecounter user.
>
> Signed-off-by: Patrick Ohly <[email protected]>


Nice. The cyclecounter struct can work as a good base that I can shift
the clocksource bits over to as I clean that up.

We will probably want to split this out down the road, but for now its
small enough and related enough that I think its fine in the
clocksource.h/c.

Also since Magnus has been working on it, does enable/disable accessors
in the cyclecounter struct make sense for your hardware as well?

Also the corner cases on overflows (how we manage the state, should
reads be deferred for too long) will need to be addressed, but I guess
we can solve that when it becomes an issue. Just to be clear: none of
the hardware you're submitting this round has wrapping issues? Or is
that not the case?

Otherwise,
Acked-by: John Stultz <[email protected]>

thanks
-john


> ---
> include/linux/clocksource.h | 99 +++++++++++++++++++++++++++++++++++++++++++
> kernel/time/clocksource.c | 76 +++++++++++++++++++++++++++++++++
> 2 files changed, 175 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index f88d32f..d379189 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -22,8 +22,107 @@ typedef u64 cycle_t;
> struct clocksource;
>
> /**
> + * struct cyclecounter - hardware abstraction for a free running counter
> + * Provides completely state-free accessors to the underlying hardware.
> + * Depending on which hardware it reads, the cycle counter may wrap
> + * around quickly. Locking rules (if necessary) have to be defined
> + * by the implementor and user of specific instances of this API.
> + *
> + * @read: returns the current cycle value
> + * @mask: bitmask for two's complement
> + * subtraction of non 64 bit counters,
> + * see CLOCKSOURCE_MASK() helper macro
> + * @mult: cycle to nanosecond multiplier
> + * @shift: cycle to nanosecond divisor (power of two)
> + */
> +struct cyclecounter {
> + cycle_t (*read)(const struct cyclecounter *cc);
> + cycle_t mask;
> + u32 mult;
> + u32 shift;
> +};
> +
> +/**
> + * struct timecounter - layer above a %struct cyclecounter which counts nanoseconds
> + * Contains the state needed by timecounter_read() to detect
> + * cycle counter wrap around. Initialize with
> + * timecounter_init(). Also used to convert cycle counts into the
> + * corresponding nanosecond counts with timecounter_cyc2time(). Users
> + * of this code are responsible for initializing the underlying
> + * cycle counter hardware, locking issues and reading the time
> + * more often than the cycle counter wraps around. The nanosecond
> + * counter will only wrap around after ~585 years.
> + *
> + * @cc: the cycle counter used by this instance
> + * @cycle_last: most recent cycle counter value seen by timecounter_read()
> + * @nsec:
> + */
> +struct timecounter {
> + const struct cyclecounter *cc;
> + cycle_t cycle_last;
> + u64 nsec;
> +};
> +
> +/**
> + * cyclecounter_cyc2ns - converts cycle counter cycles to nanoseconds
> + * @tc: Pointer to cycle counter.
> + * @cycles: Cycles
> + *
> + * XXX - This could use some mult_lxl_ll() asm optimization. Same code
> + * as in cyc2ns, but with unsigned result.
> + */
> +static inline u64 cyclecounter_cyc2ns(const struct cyclecounter *cc, cycle_t cycles)
> +{
> + u64 ret = (u64)cycles;
> + ret = (ret * cc->mult) >> cc->shift;
> + return ret;
> +}
> +
> +/**
> + * timecounter_init - initialize a time counter
> + * @tc: Pointer to time counter which is to be initialized/reset
> + * @cc: A cycle counter, ready to be used.
> + * @start_tstamp: Arbitrary initial time stamp.
> + *
> + * After this call the current cycle register (roughly) corresponds to
> + * the initial time stamp. Every call to timecounter_read() increments
> + * the time stamp counter by the number of elapsed nanoseconds.
> + */
> +extern void timecounter_init(struct timecounter *tc,
> + const struct cyclecounter *cc,
> + u64 start_tstamp);
> +
> +/**
> + * timecounter_read - return nanoseconds elapsed since timecounter_init()
> + * plus the initial time stamp
> + * @tc: Pointer to time counter.
> + *
> + * In other words, keeps track of time since the same epoch as
> + * the function which generated the initial time stamp.
> + */
> +extern u64 timecounter_read(struct timecounter *tc);
> +
> +/**
> + * timecounter_cyc2time - convert a cycle counter to same
> + * time base as values returned by
> + * timecounter_read()
> + * @tc: Pointer to time counter.
> + * @cycle: a value returned by tc->cc->read()
> + *
> + * Cycle counts that are converted correctly as long as they
> + * fall into the interval [-1/2 max cycle count, +1/2 max cycle count],
> + * with "max cycle count" == cs->mask+1.
> + *
> + * This allows conversion of cycle counter values which were generated
> + * in the past.
> + */
> +extern u64 timecounter_cyc2time(struct timecounter *tc,
> + cycle_t cycle_tstamp);
> +
> +/**
> * struct clocksource - hardware abstraction for a free running counter
> * Provides mostly state-free accessors to the underlying hardware.
> + * This is the structure used for system time.
> *
> * @name: ptr to clocksource name
> * @list: list head for registration
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 9ed2eec..0d7a2cb 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -31,6 +31,82 @@
> #include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
> #include <linux/tick.h>
>
> +void timecounter_init(struct timecounter *tc,
> + const struct cyclecounter *cc,
> + u64 start_tstamp)
> +{
> + tc->cc = cc;
> + tc->cycle_last = cc->read(cc);
> + tc->nsec = start_tstamp;
> +}
> +EXPORT_SYMBOL(timecounter_init);
> +
> +/**
> + * clocksource_read_ns - get nanoseconds since last call of this function
> + * @tc: Pointer to time counter
> + *
> + * When the underlying cycle counter runs over, this will be handled
> + * correctly as long as it does not run over more than once between
> + * calls.
> + *
> + * The first call to this function for a new time counter initializes
> + * the time tracking and returns bogus results.
> + */
> +static u64 timecounter_read_delta(struct timecounter *tc)
> +{
> + cycle_t cycle_now, cycle_delta;
> + u64 ns_offset;
> +
> + /* read cycle counter: */
> + cycle_now = tc->cc->read(tc->cc);
> +
> + /* calculate the delta since the last timecounter_read_delta(): */
> + cycle_delta = (cycle_now - tc->cycle_last) & tc->cc->mask;
> +
> + /* convert to nanoseconds: */
> + ns_offset = cyclecounter_cyc2ns(tc->cc, cycle_delta);
> +
> + /* update time stamp of timecounter_read_delta() call: */
> + tc->cycle_last = cycle_now;
> +
> + return ns_offset;
> +}
> +
> +u64 timecounter_read(struct timecounter *tc)
> +{
> + u64 nsec;
> +
> + /* increment time by nanoseconds since last call */
> + nsec = timecounter_read_delta(tc);
> + nsec += tc->nsec;
> + tc->nsec = nsec;
> +
> + return nsec;
> +}
> +EXPORT_SYMBOL(timecounter_read);
> +
> +u64 timecounter_cyc2time(struct timecounter *tc,
> + cycle_t cycle_tstamp)
> +{
> + u64 cycle_delta = (cycle_tstamp - tc->cycle_last) & tc->cc->mask;
> + u64 nsec;
> +
> + /*
> + * Instead of always treating cycle_tstamp as more recent
> + * than tc->cycle_last, detect when it is too far in the
> + * future and treat it as old time stamp instead.
> + */
> + if (cycle_delta > tc->cc->mask / 2) {
> + cycle_delta = (tc->cycle_last - cycle_tstamp) & tc->cc->mask;
> + nsec = tc->nsec - cyclecounter_cyc2ns(tc->cc, cycle_delta);
> + } else {
> + nsec = cyclecounter_cyc2ns(tc->cc, cycle_delta) + tc->nsec;
> + }
> +
> + return nsec;
> +}
> +EXPORT_SYMBOL(timecounter_cyc2time);
> +
> /* XXX - Would like a better way for initializing curr_clocksource */
> extern struct clocksource clocksource_jiffies;
>

2008-12-15 16:45:49

by Patrick Ohly

[permalink] [raw]
Subject: Re: [RFC PATCH 09/12] clocksource: allow usage independent of timekeeping.c

On Mon, 2008-12-15 at 16:26 +0000, John Stultz wrote:
[cyclecounter/timecounter]
> Nice. The cyclecounter struct can work as a good base that I can shift
> the clocksource bits over to as I clean that up.
>
> We will probably want to split this out down the road, but for now its
> small enough and related enough that I think its fine in the
> clocksource.h/c.
>
> Also since Magnus has been working on it, does enable/disable accessors
> in the cyclecounter struct make sense for your hardware as well?

I don't think so. The usage model of the cyclecounter is that the
hardware is owned by someone who initializes and controls it, including
enable/disable. The abstract API with the read method is just there so
that common utility code can access the hardware in a uniform way.

For example, the igb driver owns and uses the NIC time register.
Disabling the NIC timer should be done together with disabling the NIC.
This is different from traditional clocksources which are independent
and controlled by the timing subsystem.

> Also the corner cases on overflows (how we manage the state, should
> reads be deferred for too long) will need to be addressed, but I guess
> we can solve that when it becomes an issue. Just to be clear: none of
> the hardware you're submitting this round has wrapping issues?

It has 64 bit registers, so there is indeed no wrapping issue.

> Otherwise,
> Acked-by: John Stultz <[email protected]>

Thanks, will add that.

Can you (or someone else) also look at clocksync.[ch]? David wanted to
have that independently reviewed, too, before including it in
netdev-next.

2008-12-15 21:53:44

by Herbert Xu

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] net: infrastructure for hardware time stamping

Patrick Ohly <[email protected]> wrote:
> @@ -305,6 +406,8 @@ struct sk_buff {
> ipvs_property:1,
> peeked:1,
> nf_trace:1;
> + /* not all of the bits in optional are used */
> + __u8 optional;
> __be16 protocol;

You do reliase that this is going to grow the sk_buff by at least
4 bytes and not 1?

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2008-12-16 07:57:21

by Patrick Ohly

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] net: infrastructure for hardware time stamping

On Mon, 2008-12-15 at 21:53 +0000, Herbert Xu wrote:
> Patrick Ohly <[email protected]> wrote:
> > @@ -305,6 +406,8 @@ struct sk_buff {
> > ipvs_property:1,
> > peeked:1,
> > nf_trace:1;
> > + /* not all of the bits in optional are used */
> > + __u8 optional;
> > __be16 protocol;
>
> You do reliase that this is going to grow the sk_buff by at least
> 4 bytes and not 1?

Yes. I should have been more explicit about that when talking about
"adding one byte". At least it's better than adding 8 bytes of
additional data, as in the previous patch.

I haven't checked it, but was told that sk_buff is already tightly
packed. It didn't look like there was a better place to put the byte
either.

--
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.

2009-01-16 10:36:23

by Patrick Ohly

[permalink] [raw]
Subject: Re: hardware time stamping with optional structs in data area

On Mon, 2008-12-15 at 15:54 +0100, Patrick Ohly wrote:
> This is the third iteration of a patch series which adds a user space
> API for hardware time stamping of network packets and the
> infrastructure that implements that API. The igb driver is used as
> example for how a network driver can use this new infrastructure.

A new year, another month => time to bring this up once more. The latest
revision of the patch series hasn't brought up further requests for
improvement. There still is growing demand for this feature.

David, do you think it is ready to get included?

John acknowledged the changes to clocksource.h. clocksync is independent
of everything else and not active unless called; if driver developers
don't find it useful, it can be replaced/removed. That leaves the API,
which hasn't triggered any comments so far. As proof that it works as
intended I adapted PTPd to use it:
http://github.com/pohly/ptpd/tree/master

If further discussion of the API is necessary to get this into mainline,
perhaps including the patches in netdev-next would help to encourage
that discussion?

Is there anything I can do myself, like rebasing against the latest
netdev-next?

Bye, Patrick

2009-01-16 19:00:45

by David Miller

[permalink] [raw]
Subject: Re: hardware time stamping with optional structs in data area

From: Patrick Ohly <[email protected]>
Date: Fri, 16 Jan 2009 11:36:04 +0100

> David, do you think it is ready to get included?

Resubmit the patch set before asking such questions.

2009-01-21 10:07:54

by Patrick Ohly

[permalink] [raw]
Subject: Re: hardware time stamping with optional structs in data area

On Fri, 2009-01-16 at 21:00 +0200, David Miller wrote:
> From: Patrick Ohly <[email protected]>
> Date: Fri, 16 Jan 2009 11:36:04 +0100
>
> > David, do you think it is ready to get included?
>
> Resubmit the patch set before asking such questions.

Okay. I didn't want to spam the lists with just rebased patches if there
were more fundamental objections against the current approach. I hope
that this is no longer the case, so I'll post the current, rebased patch
set as follow-up to this mail (because the description in the orginal
mail of this thread still applies).

I personally consider the core infrastructure patches ready. If there
are further comments I'd be happy to work on those, of course. The igb
driver patches are more experimental, please don't include them.

I rebased the patches against net-next-2.6 as of today.

I tested them with the modified PTPd with and without hardware support
on x86. With 64 bit kernel and user space both works. With 32 bit user
space on a 64 bit kernel software-only time stamping works (thanks to
the socket's compatibility layer), hardware support doesn't: the ifreq
is passed to the right device driver, but the data pointer from a 32 bit
process is not interpreted correctly by a 64 bit driver. If there is a
way to handle this then please let me know - I didn't see how a device
driver could distinguish between a 32 and 64 bit user process.

With a 32 bit kernel software time stamping works. I couldn't test
hardware support because I couldn't get the igb driver to work. Even
without any of the patches it failed to transmit packets (tested with
net-next-2.6 sources and original Ubuntu 8.10 installation). I need to
look into this problem further, but don't want to hold up the review of
the patches.

--
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.

2009-01-21 10:10:55

by Patrick Ohly

[permalink] [raw]
Subject: [PATCH NET-NEXT 01/12] net: new user space API for time stamping of incoming and outgoing packets

User space can request hardware and/or software time stamping.
Reporting of the result(s) via a new control message is enabled
separately for each field in the message because some of the
fields may require additional computation and thus cause overhead.
User space can tell the different kinds of time stamps apart
and choose what suits its needs.

When a TX timestamp operation is requested, the TX skb will be cloned
and the clone will be time stamped (in hardware or software) and added
to the socket error queue of the skb, if the skb has a socket
associated with it.

The actual TX timestamp will reach userspace as a RX timestamp on the
cloned packet. If timestamping is requested and no timestamping is
done in the device driver (potentially this may use hardware
timestamping), it will be done in software after the device's
start_hard_xmit routine.
---
Documentation/networking/timestamping.txt | 178 ++++++++
Documentation/networking/timestamping/.gitignore | 1 +
Documentation/networking/timestamping/Makefile | 3 +
.../networking/timestamping/timestamping.c | 469 ++++++++++++++++++++
arch/alpha/include/asm/socket.h | 3 +
arch/arm/include/asm/socket.h | 3 +
arch/avr32/include/asm/socket.h | 3 +
arch/blackfin/include/asm/socket.h | 3 +
arch/cris/include/asm/socket.h | 3 +
arch/h8300/include/asm/socket.h | 3 +
arch/ia64/include/asm/socket.h | 3 +
arch/mips/include/asm/socket.h | 3 +
arch/parisc/include/asm/socket.h | 3 +
arch/powerpc/include/asm/socket.h | 3 +
arch/s390/include/asm/socket.h | 3 +
arch/sh/include/asm/socket.h | 3 +
arch/sparc/include/asm/socket.h | 3 +
arch/x86/include/asm/socket.h | 3 +
include/asm-frv/socket.h | 3 +
include/asm-m32r/socket.h | 3 +
include/asm-m68k/socket.h | 3 +
include/asm-mn10300/socket.h | 3 +
include/asm-xtensa/socket.h | 3 +
include/linux/errqueue.h | 1 +
include/linux/net_tstamp.h | 112 +++++
include/linux/sockios.h | 3 +
26 files changed, 824 insertions(+), 0 deletions(-)
create mode 100644 Documentation/networking/timestamping.txt
create mode 100644 Documentation/networking/timestamping/.gitignore
create mode 100644 Documentation/networking/timestamping/Makefile
create mode 100644 Documentation/networking/timestamping/timestamping.c
create mode 100644 include/linux/net_tstamp.h

diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
new file mode 100644
index 0000000..a681a65
--- /dev/null
+++ b/Documentation/networking/timestamping.txt
@@ -0,0 +1,178 @@
+The existing interfaces for getting network packages time stamped are:
+
+* SO_TIMESTAMP
+ Generate time stamp for each incoming packet using the (not necessarily
+ monotonous!) system time. Result is returned via recv_msg() in a
+ control message as timeval (usec resolution).
+
+* SO_TIMESTAMPNS
+ Same time stamping mechanism as SO_TIMESTAMP, but returns result as
+ timespec (nsec resolution).
+
+* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
+ Only for multicasts: approximate send time stamp by receiving the looped
+ packet and using its receive time stamp.
+
+The following interface complements the existing ones: receive time
+stamps can be generated and returned for arbitrary packets and much
+closer to the point where the packet is really sent. Time stamps can
+be generated in software (as before) or in hardware (if the hardware
+has such a feature).
+
+SO_TIMESTAMPING:
+
+Instructs the socket layer which kind of information is wanted. The
+parameter is an integer with some of the following bits set. Setting
+other bits is an error and doesn't change the current state.
+
+SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time stamp in hardware
+SOF_TIMESTAMPING_TX_SOFTWARE: if SOF_TIMESTAMPING_TX_HARDWARE is off or
+ fails, then do it in software
+SOF_TIMESTAMPING_RX_HARDWARE: return the original, unmodified time stamp
+ as generated by the hardware
+SOF_TIMESTAMPING_RX_SOFTWARE: if SOF_TIMESTAMPING_RX_HARDWARE is off or
+ fails, then do it in software
+SOF_TIMESTAMPING_RAW_HARDWARE: return original raw hardware time stamp
+SOF_TIMESTAMPING_SYS_HARDWARE: return hardware time stamp transformed to
+ the system time base
+SOF_TIMESTAMPING_SOFTWARE: return system time stamp generated in
+ software
+
+SOF_TIMESTAMPING_TX/RX determine how time stamps are generated.
+SOF_TIMESTAMPING_RAW/SYS determine how they are reported in the
+following control message:
+ struct scm_timestamping {
+ struct timespec systime;
+ struct timespec hwtimetrans;
+ struct timespec hwtimeraw;
+ };
+
+recvmsg() can be used to get this control message for regular incoming
+packets. For send time stamps the outgoing packet is looped back to
+the socket's error queue with the send time stamp(s) attached. It can
+be received with recvmsg(flags=MSG_ERRQUEUE). The call returns the
+original outgoing packet data including all headers preprended down to
+and including the link layer, the scm_timestamping control message and
+a sock_extended_err control message with ee_errno==ENOMSG and
+ee_origin==SO_EE_ORIGIN_TIMESTAMPING. A socket with such a pending
+bounced packet is ready for reading as far as select() is concerned.
+
+All three values correspond to the same event in time, but were
+generated in different ways. Each of these values may be empty (= all
+zero), in which case no such value was available. If the application
+is not interested in some of these values, they can be left blank to
+avoid the potential overhead of calculating them.
+
+systime is the value of the system time at that moment. This
+corresponds to the value also returned via SO_TIMESTAMP[NS]. If the
+time stamp was generated by hardware, then this field is
+empty. Otherwise it is filled in if SOF_TIMESTAMPING_SOFTWARE is
+set.
+
+hwtimeraw is the original hardware time stamp. Filled in if
+SOF_TIMESTAMPING_RAW_HARDWARE is set. No assumptions about its
+relation to system time should be made.
+
+hwtimetrans is the hardware time stamp transformed so that it
+corresponds as good as possible to system time. This correlation is
+not perfect; as a consequence, sorting packets received via different
+NICs by their hwtimetrans may differ from the order in which they were
+received. hwtimetrans may be non-monotonic even for the same NIC.
+Filled in if SOF_TIMESTAMPING_SYS_HARDWARE is set. Requires support
+by the network device and will be empty without that support.
+
+
+SIOCSHWTSTAMP:
+
+Hardware time stamping must also be initialized for each device driver
+that is expected to do hardware time stamping. The parameter is:
+
+struct hwtstamp_config {
+ int flags; /* no flags defined right now, must be zero */
+ int tx_type; /* HWTSTAMP_TX_* */
+ int rx_filter; /* HWTSTAMP_FILTER_* */
+};
+
+Desired behavior is passed into the kernel and to a specific device by
+calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
+ifr_data points to a struct hwtstamp_config. The tx_type and
+rx_filter are hints to the driver what it is expected to do. If
+the requested fine-grained filtering for incoming packets is not
+supported, the driver may time stamp more than just the requested types
+of packets.
+
+A driver which supports hardware time stamping shall update the struct
+with the actual, possibly more permissive configuration. If the
+requested packets cannot be time stamped, then nothing should be
+changed and ERANGE shall be returned (in contrast to EINVAL, which
+indicates that SIOCSHWTSTAMP is not supported at all).
+
+Only a processes with admin rights may change the configuration. User
+space is responsible to ensure that multiple processes don't interfere
+with each other and that the settings are reset.
+
+/* possible values for hwtstamp_config->tx_type */
+enum {
+ /*
+ * no outgoing packet will need hardware time stamping;
+ * should a packet arrive which asks for it, no hardware
+ * time stamping will be done
+ */
+ HWTSTAMP_TX_OFF,
+
+ /*
+ * enables hardware time stamping for outgoing packets;
+ * the sender of the packet decides which are to be
+ * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
+ * before sending the packet
+ */
+ HWTSTAMP_TX_ON,
+};
+
+/* possible values for hwtstamp_config->rx_filter */
+enum {
+ /* time stamp no incoming packet at all */
+ HWTSTAMP_FILTER_NONE,
+
+ /* time stamp any incoming packet */
+ HWTSTAMP_FILTER_ALL,
+
+ /* return value: time stamp all packets requested plus some others */
+ HWTSTAMP_FILTER_SOME,
+
+ /* PTP v1, UDP, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
+
+ ...
+};
+
+
+DEVICE IMPLEMENTATION
+
+A driver which supports hardware time stamping must support the
+SIOCSHWTSTAMP ioctl. Time stamps for received packets must be stored
+in the skb with skb_hwtstamp_set().
+
+Time stamps for outgoing packets are to be generated as follows:
+- In hard_start_xmit(), check if skb_hwtstamp_check_tx_hardware()
+ returns non-zero. If yes, then the driver is expected
+ to do hardware time stamping.
+- If this is possible for the skb and requested, then declare
+ that the driver is doing the time stamping by calling
+ skb_hwtstamp_tx_in_progress(). A driver not supporting
+ hardware time stamping doesn't do that. A driver must never
+ touch sk_buff::tstamp! It is used to store how time stamping
+ for an outgoing packets is to be done.
+- As soon as the driver has sent the packet and/or obtained a
+ hardware time stamp for it, it passes the time stamp back by
+ calling skb_hwtstamp_tx() with the original skb, the raw
+ hardware time stamp and a handle to the device (necessary
+ to convert the hardware time stamp to system time). If obtaining
+ the hardware time stamp somehow fails, then the driver should
+ not fall back to software time stamping. The rationale is that
+ this would occur at a later time in the processing pipeline
+ than other software time stamping and therefore could lead
+ to unexpected deltas between time stamps.
+- If the driver did not call skb_hwtstamp_tx_in_progress(), then
+ dev_hard_start_xmit() checks whether software time stamping
+ is wanted as fallback and potentially generates the time stamp.
diff --git a/Documentation/networking/timestamping/.gitignore b/Documentation/networking/timestamping/.gitignore
new file mode 100644
index 0000000..71e81eb
--- /dev/null
+++ b/Documentation/networking/timestamping/.gitignore
@@ -0,0 +1 @@
+timestamping
diff --git a/Documentation/networking/timestamping/Makefile b/Documentation/networking/timestamping/Makefile
new file mode 100644
index 0000000..ce170d1
--- /dev/null
+++ b/Documentation/networking/timestamping/Makefile
@@ -0,0 +1,3 @@
+CPPFLAGS = -I../../../include
+
+timestamping: timestamping.c
diff --git a/Documentation/networking/timestamping/timestamping.c b/Documentation/networking/timestamping/timestamping.c
new file mode 100644
index 0000000..26d2e25
--- /dev/null
+++ b/Documentation/networking/timestamping/timestamping.c
@@ -0,0 +1,469 @@
+/*
+ * This program demonstrates how the various time stamping features in
+ * the Linux kernel work. It emulates the behavior of a PTP
+ * implementation in stand-alone master mode by sending PTPv1 Sync
+ * multicasts once every second. It looks for similar packets, but
+ * beyond that doesn't actually implement PTP.
+ *
+ * Outgoing packets are time stamped with SO_TIMESTAMPING with or
+ * without hardware support.
+ *
+ * Incoming packets are time stamped with SO_TIMESTAMPING with or
+ * without hardware support, SIOCGSTAMP[NS] (per-socket time stamp) and
+ * SO_TIMESTAMP[NS].
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <string.h>
+
+#include <sys/time.h>
+#include <sys/socket.h>
+#include <sys/select.h>
+#include <sys/ioctl.h>
+#include <arpa/inet.h>
+#include <net/if.h>
+
+#include "asm/types.h"
+#include "linux/net_tstamp.h"
+#include "linux/errqueue.h"
+
+#ifndef SO_TIMESTAMPNS
+# define SO_TIMESTAMPNS 35
+#endif
+
+#ifndef SIOCGSTAMPNS
+# define SIOCGSTAMPNS 0x8907
+#endif
+
+static void usage(const char *error)
+{
+ if (error)
+ printf("invalid option: %s\n", error);
+ printf("timestamping interface (IP_MULTICAST_LOOP|SO_TIMESTAMP|SO_TIMESTAMPNS|SOF_TIMESTAMPING_TX_HARDWARE|SOF_TIMESTAMPING_TX_SOFTWARE|SOF_TIMESTAMPING_RX_HARDWARE|SOF_TIMESTAMPING_RX_SOFTWARE|SOF_TIMESTAMPING_SOFTWARE|SOF_TIMESTAMPING_SYS_HARDWARE|SOF_TIMESTAMPING_RAW_HARDWARE|SIOCGSTAMP|SIOCGSTAMPNS)*\n");
+ exit(1);
+}
+
+static void bail(const char *error)
+{
+ printf("%s: %s\n", error, strerror(errno));
+ exit(1);
+}
+
+static const unsigned char sync[] = {
+ 0x00,0x01, 0x00,0x01,
+ 0x5f,0x44, 0x46,0x4c,
+ 0x54,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00,
+ 0x01,0x01,
+
+ /* fake uuid */
+ 0x00,0x01,
+ 0x02,0x03, 0x04,0x05,
+
+ 0x00,0x01, 0x00,0x37,
+ 0x00,0x00, 0x00,0x08,
+ 0x00,0x00, 0x00,0x00,
+ 0x49,0x05, 0xcd,0x01,
+ 0x29,0xb1, 0x8d,0xb0,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x01,
+
+ /* fake uuid */
+ 0x00,0x01,
+ 0x02,0x03, 0x04,0x05,
+
+ 0x00,0x00, 0x00,0x37,
+ 0x00,0x00, 0x00,0x04,
+ 0x44,0x46, 0x4c,0x54,
+ 0x00,0x00, 0xf0,0x60,
+ 0x00,0x01, 0x00,0x00,
+ 0x00,0x00, 0x00,0x01,
+ 0x00,0x00, 0xf0,0x60,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x04,
+ 0x44,0x46, 0x4c,0x54,
+ 0x00,0x01,
+
+ /* fake uuid */
+ 0x00,0x01,
+ 0x02,0x03, 0x04,0x05,
+
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00,
+ 0x00,0x00, 0x00,0x00
+};
+
+static void sendpacket(int sock, struct sockaddr *addr, socklen_t addr_len)
+{
+ struct timeval now;
+ int res;
+
+ res = sendto(sock, sync, sizeof(sync), 0,
+ addr, addr_len);
+ gettimeofday(&now, 0);
+ if (res < 0)
+ printf("%s: %s\n", "send", strerror(errno));
+ else
+ printf("%ld.%06ld: sent %d bytes\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ res);
+}
+
+static void recvpacket(int sock, int recvmsg_flags,
+ int siocgstamp, int siocgstampns)
+{
+ char data[256];
+ struct timeval now;
+ struct msghdr msg;
+ struct iovec entry;
+ struct sockaddr_in from_addr;
+ struct {
+ struct cmsghdr cm;
+ char control[512];
+ } control;
+ int res;
+
+ memset(&msg, 0, sizeof(msg));
+ msg.msg_iov = &entry;
+ msg.msg_iovlen = 1;
+ entry.iov_base = data;
+ entry.iov_len = sizeof(data);
+ msg.msg_name = (caddr_t)&from_addr;
+ msg.msg_namelen = sizeof(from_addr);
+ msg.msg_control = &control;
+ msg.msg_controllen = sizeof(control);
+
+ res = recvmsg(sock, &msg, recvmsg_flags|MSG_DONTWAIT);
+ gettimeofday(&now, 0);
+ if (res < 0) {
+ printf("%s %s: %s\n",
+ "recvmsg",
+ (recvmsg_flags & MSG_ERRQUEUE) ? "error" : "regular",
+ strerror(errno));
+ } else {
+ struct cmsghdr *cmsg;
+ struct timeval tv;
+ struct timespec ts;
+
+ printf("%ld.%06ld: received %s data, %d bytes from %s, %d bytes control messages\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ (recvmsg_flags & MSG_ERRQUEUE) ? "error" : "regular",
+ res,
+ inet_ntoa(from_addr.sin_addr),
+ msg.msg_controllen);
+ for (cmsg = CMSG_FIRSTHDR(&msg);
+ cmsg;
+ cmsg = CMSG_NXTHDR(&msg, cmsg)) {
+ printf(" cmsg len %d: ", cmsg->cmsg_len);
+ switch (cmsg->cmsg_level) {
+ case SOL_SOCKET:
+ printf("SOL_SOCKET ");
+ switch (cmsg->cmsg_type) {
+ case SO_TIMESTAMP: {
+ struct timeval *stamp =
+ (struct timeval *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMP %ld.%06ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_usec);
+ break;
+ }
+ case SO_TIMESTAMPNS: {
+ struct timespec *stamp =
+ (struct timespec *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMPNS %ld.%09ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ break;
+ }
+ case SO_TIMESTAMPING: {
+ struct timespec *stamp =
+ (struct timespec *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMPING ");
+ printf("SW %ld.%09ld ",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ stamp++;
+ printf("HW transformed %ld.%09ld ",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ stamp++;
+ printf("HW raw %ld.%09ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ break;
+ }
+ default:
+ printf("type %d", cmsg->cmsg_type);
+ break;
+ }
+ break;
+ case IPPROTO_IP:
+ printf("IPPROTO_IP ");
+ switch (cmsg->cmsg_type) {
+ case IP_RECVERR: {
+ struct sock_extended_err *err =
+ (struct sock_extended_err *)CMSG_DATA(cmsg);
+ printf("IP_RECVERR ee_errno '%s' ee_origin %d => %s",
+ strerror(err->ee_errno),
+ err->ee_origin,
+#ifdef SO_EE_ORIGIN_TIMESTAMPING
+ err->ee_origin == SO_EE_ORIGIN_TIMESTAMPING ?
+ "bounced packet" : "unexpected origin"
+#else
+ "probably SO_EE_ORIGIN_TIMESTAMPING"
+#endif
+ );
+ if (res < sizeof(sync))
+ printf(" => truncated data?!");
+ else if (!memcmp(sync, data + res - sizeof(sync),
+ sizeof(sync)))
+ printf(" => GOT OUR DATA BACK (HURRAY!)");
+ break;
+ }
+ case IP_PKTINFO: {
+ struct in_pktinfo *pktinfo =
+ (struct in_pktinfo *)CMSG_DATA(cmsg);
+ printf("IP_PKTINFO interface index %u",
+ pktinfo->ipi_ifindex);
+ break;
+ }
+ default:
+ printf("type %d", cmsg->cmsg_type);
+ break;
+ }
+ break;
+ default:
+ printf("level %d type %d",
+ cmsg->cmsg_level,
+ cmsg->cmsg_type);
+ break;
+ }
+ printf("\n");
+ }
+
+ if (siocgstamp) {
+ if (ioctl(sock, SIOCGSTAMP, &tv))
+ printf(" %s: %s\n", "SIOCGSTAMP", strerror(errno));
+ else
+ printf("SIOCGSTAMP %ld.%06ld\n",
+ (long)tv.tv_sec,
+ (long)tv.tv_usec);
+ }
+ if (siocgstampns) {
+ if (ioctl(sock, SIOCGSTAMPNS, &ts))
+ printf(" %s: %s\n", "SIOCGSTAMPNS", strerror(errno));
+ else
+ printf("SIOCGSTAMPNS %ld.%09ld\n",
+ (long)ts.tv_sec,
+ (long)ts.tv_nsec);
+ }
+ }
+}
+
+int main(int argc, char **argv)
+{
+ int so_timestamping_flags = 0;
+ int so_timestamp = 0;
+ int so_timestampns = 0;
+ int siocgstamp = 0;
+ int siocgstampns = 0;
+ int ip_multicast_loop = 0;
+ char *interface;
+ int i;
+ int enabled = 1;
+ int sock;
+ struct ifreq device;
+ struct ifreq hwtstamp;
+ struct hwtstamp_config hwconfig, hwconfig_requested;
+ struct sockaddr_in addr;
+ struct ip_mreq imr;
+ struct in_addr iaddr;
+ int val;
+ socklen_t len;
+ struct timeval next;
+
+ if (argc < 2)
+ usage(0);
+ interface = argv[1];
+
+ for (i = 2; i < argc; i++ ) {
+ if (!strcasecmp(argv[i], "SO_TIMESTAMP")) {
+ so_timestamp = 1;
+ } else if (!strcasecmp(argv[i], "SO_TIMESTAMPNS")) {
+ so_timestampns = 1;
+ } else if (!strcasecmp(argv[i], "SIOCGSTAMP")) {
+ siocgstamp = 1;
+ } else if (!strcasecmp(argv[i], "SIOCGSTAMPNS")) {
+ siocgstampns = 1;
+ } else if (!strcasecmp(argv[i], "IP_MULTICAST_LOOP")) {
+ ip_multicast_loop = 1;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_TX_HARDWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_TX_HARDWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_TX_SOFTWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_TX_SOFTWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RX_HARDWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_RX_HARDWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RX_SOFTWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_RX_SOFTWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SOFTWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_SOFTWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SYS_HARDWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_SYS_HARDWARE;
+ } else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RAW_HARDWARE")) {
+ so_timestamping_flags |= SOF_TIMESTAMPING_RAW_HARDWARE;
+ } else {
+ usage(argv[i]);
+ }
+ }
+
+ sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
+ if (socket < 0)
+ bail("socket");
+
+ memset(&device, 0, sizeof(device));
+ strncpy(device.ifr_name, interface, sizeof(device.ifr_name));
+ if (ioctl(sock, SIOCGIFADDR, &device) < 0)
+ bail("getting interface IP address");
+
+ memset(&hwtstamp, 0, sizeof(hwtstamp));
+ strncpy(hwtstamp.ifr_name, interface, sizeof(hwtstamp.ifr_name));
+ hwtstamp.ifr_data = (void *)&hwconfig;
+ memset(&hwconfig, 0, sizeof(&hwconfig));
+ hwconfig.tx_type =
+ (so_timestamping_flags & SOF_TIMESTAMPING_TX_HARDWARE) ?
+ HWTSTAMP_TX_ON : HWTSTAMP_TX_OFF;
+ hwconfig.rx_filter =
+ (so_timestamping_flags & SOF_TIMESTAMPING_RX_HARDWARE) ?
+ HWTSTAMP_FILTER_PTP_V1_L4_SYNC : HWTSTAMP_FILTER_NONE;
+ hwconfig_requested = hwconfig;
+ if (ioctl(sock, SIOCSHWTSTAMP, &hwtstamp) < 0) {
+ if ((errno == EINVAL || errno == ENOTSUP) &&
+ hwconfig_requested.tx_type == HWTSTAMP_TX_OFF &&
+ hwconfig_requested.rx_filter == HWTSTAMP_FILTER_NONE)
+ printf("SIOCSHWTSTAMP: disabling hardware time stamping not possible\n");
+ else
+ bail("SIOCSHWTSTAMP");
+ }
+ printf("SIOCSHWTSTAMP: tx_type %d requested, got %d; rx_filter %d requested, got %d\n",
+ hwconfig_requested.tx_type, hwconfig.tx_type,
+ hwconfig_requested.rx_filter, hwconfig.rx_filter);
+
+ /* bind to PTP port */
+ addr.sin_family = AF_INET;
+ addr.sin_addr.s_addr = htonl(INADDR_ANY);
+ addr.sin_port = htons(319 /* PTP event port */);
+ if (bind(sock, (struct sockaddr*)&addr, sizeof(struct sockaddr_in)) < 0)
+ bail("bind");
+
+ /* set multicast group for outgoing packets */
+ inet_aton("224.0.1.130", &iaddr); /* alternate PTP domain 1 */
+ addr.sin_addr = iaddr;
+ imr.imr_multiaddr.s_addr = iaddr.s_addr;
+ imr.imr_interface.s_addr = ((struct sockaddr_in *)&device.ifr_addr)->sin_addr.s_addr;
+ if (setsockopt(sock, IPPROTO_IP, IP_MULTICAST_IF,
+ &imr.imr_interface.s_addr, sizeof(struct in_addr)) < 0)
+ bail("set multicast");
+
+ /* join multicast group, loop our own packet */
+ if (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP, &imr, sizeof(struct ip_mreq)) < 0)
+ bail("join multicast group");
+
+ if (setsockopt(sock, IPPROTO_IP, IP_MULTICAST_LOOP, &ip_multicast_loop, sizeof(enabled)) < 0) {
+ bail("loop multicast");
+ }
+
+ /* set socket options for time stamping */
+ if (so_timestamp &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMP, &enabled, sizeof(enabled)) < 0)
+ bail("setsockopt SO_TIMESTAMP");
+
+ if (so_timestampns &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPNS, &enabled, sizeof(enabled)) < 0)
+ bail("setsockopt SO_TIMESTAMPNS");
+
+ if (so_timestamping_flags &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING, &so_timestamping_flags, sizeof(so_timestamping_flags)) < 0)
+ bail("setsockopt SO_TIMESTAMPING");
+
+ /* request IP_PKTINFO for debugging purposes */
+ if (setsockopt(sock, SOL_IP, IP_PKTINFO, &enabled, sizeof(enabled)) < 0)
+ printf("%s: %s\n", "setsockopt IP_PKTINFO", strerror(errno));
+
+ /* verify socket options */
+ len = sizeof(val);
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMP, &val, &len) < 0)
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMP", strerror(errno));
+ else
+ printf("SO_TIMESTAMP %d\n", val);
+
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMPNS, &val, &len) < 0)
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMPNS", strerror(errno));
+ else
+ printf("SO_TIMESTAMPNS %d\n", val);
+
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING, &val, &len) < 0) {
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMPING", strerror(errno));
+ } else {
+ printf("SO_TIMESTAMPING %d\n", val);
+ if (val != so_timestamping_flags)
+ printf(" not the expected value %d\n", so_timestamping_flags);
+ }
+
+ /* send packets forever every five seconds */
+ gettimeofday(&next, 0);
+ next.tv_sec = (next.tv_sec + 1) / 5 * 5;
+ next.tv_usec = 0;
+ while(1) {
+ struct timeval now;
+ struct timeval delta;
+ long delta_us;
+ int res;
+ fd_set readfs, errorfs;
+
+ gettimeofday(&now, 0);
+ delta_us = (long)(next.tv_sec - now.tv_sec) * 1000000 +
+ (long)(next.tv_usec - now.tv_usec);
+ if (delta_us > 0) {
+ /* continue waiting for timeout or data */
+ delta.tv_sec = delta_us / 1000000;
+ delta.tv_usec = delta_us % 1000000;
+
+ FD_ZERO(&readfs);
+ FD_ZERO(&errorfs);
+ FD_SET(sock, &readfs);
+ FD_SET(sock, &errorfs);
+ printf("%ld.%06ld: select %ldus\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ delta_us);
+ res = select(sock + 1, &readfs, 0, &errorfs, &delta);
+ gettimeofday(&now, 0);
+ printf("%ld.%06ld: select returned: %d, %s\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ res,
+ res < 0 ? strerror(errno) : "success");
+ if (res > 0) {
+ if (FD_ISSET(sock, &readfs))
+ printf("ready for reading\n");
+ if (FD_ISSET(sock, &errorfs))
+ printf("has error\n");
+ recvpacket(sock, 0,
+ siocgstamp,
+ siocgstampns);
+ recvpacket(sock, MSG_ERRQUEUE,
+ siocgstamp,
+ siocgstampns);
+ }
+ } else {
+ /* write one packet */
+ sendpacket(sock, (struct sockaddr *)&addr, sizeof(addr));
+ next.tv_sec += 5;
+ continue;
+ }
+ }
+
+ return 0;
+}
diff --git a/arch/alpha/include/asm/socket.h b/arch/alpha/include/asm/socket.h
index a1057c2..5135249 100644
--- a/arch/alpha/include/asm/socket.h
+++ b/arch/alpha/include/asm/socket.h
@@ -61,6 +61,9 @@
#define SO_SECURITY_ENCRYPTION_NETWORK 21

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

/* O_NONBLOCK clashes with the bits used for socket types. Therefore we
* have to define SOCK_NONBLOCK to a different value here.
diff --git a/arch/arm/include/asm/socket.h b/arch/arm/include/asm/socket.h
index 6817be9..5bd9130 100644
--- a/arch/arm/include/asm/socket.h
+++ b/arch/arm/include/asm/socket.h
@@ -53,5 +53,8 @@
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#endif /* _ASM_SOCKET_H */
diff --git a/arch/avr32/include/asm/socket.h b/arch/avr32/include/asm/socket.h
index 35863f2..707877b 100644
--- a/arch/avr32/include/asm/socket.h
+++ b/arch/avr32/include/asm/socket.h
@@ -53,5 +53,8 @@
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#endif /* __ASM_AVR32_SOCKET_H */
diff --git a/arch/blackfin/include/asm/socket.h b/arch/blackfin/include/asm/socket.h
index 2ca702e..a51a777 100644
--- a/arch/blackfin/include/asm/socket.h
+++ b/arch/blackfin/include/asm/socket.h
@@ -52,5 +52,8 @@
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#endif /* _ASM_SOCKET_H */
diff --git a/arch/cris/include/asm/socket.h b/arch/cris/include/asm/socket.h
index 9df0ca8..ce30092 100644
--- a/arch/cris/include/asm/socket.h
+++ b/arch/cris/include/asm/socket.h
@@ -55,6 +55,9 @@
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#endif /* _ASM_SOCKET_H */

diff --git a/arch/h8300/include/asm/socket.h b/arch/h8300/include/asm/socket.h
index da2520d..660efa4 100644
--- a/arch/h8300/include/asm/socket.h
+++ b/arch/h8300/include/asm/socket.h
@@ -53,5 +53,8 @@
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#endif /* _ASM_SOCKET_H */
diff --git a/arch/ia64/include/asm/socket.h b/arch/ia64/include/asm/socket.h
index d5ef0aa..6afb493 100644
--- a/arch/ia64/include/asm/socket.h
+++ b/arch/ia64/include/asm/socket.h
@@ -62,5 +62,8 @@
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/mips/include/asm/socket.h b/arch/mips/include/asm/socket.h
index facc2d7..06ad2bc 100644
--- a/arch/mips/include/asm/socket.h
+++ b/arch/mips/include/asm/socket.h
@@ -74,6 +74,9 @@ To add: #define SO_REUSEPORT 0x0200 /* Allow local address and port reuse. */
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#ifdef __KERNEL__

diff --git a/arch/parisc/include/asm/socket.h b/arch/parisc/include/asm/socket.h
index fba402c..885472b 100644
--- a/arch/parisc/include/asm/socket.h
+++ b/arch/parisc/include/asm/socket.h
@@ -54,6 +54,9 @@

#define SO_MARK 0x401f

+#define SO_TIMESTAMPING 0x4020
+#define SCM_TIMESTAMPING SO_TIMESTAMPING
+
/* O_NONBLOCK clashes with the bits used for socket types. Therefore we
* have to define SOCK_NONBLOCK to a different value here.
*/
diff --git a/arch/powerpc/include/asm/socket.h b/arch/powerpc/include/asm/socket.h
index f5a4e16..bd15a06 100644
--- a/arch/powerpc/include/asm/socket.h
+++ b/arch/powerpc/include/asm/socket.h
@@ -60,5 +60,8 @@
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#endif /* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/asm/socket.h b/arch/s390/include/asm/socket.h
index c786ab6..0cc8f3a 100644
--- a/arch/s390/include/asm/socket.h
+++ b/arch/s390/include/asm/socket.h
@@ -61,5 +61,8 @@
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#endif /* _ASM_SOCKET_H */
diff --git a/arch/sh/include/asm/socket.h b/arch/sh/include/asm/socket.h
index 6d4bf65..b4697cb 100644
--- a/arch/sh/include/asm/socket.h
+++ b/arch/sh/include/asm/socket.h
@@ -53,5 +53,8 @@
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#endif /* __ASM_SH_SOCKET_H */
diff --git a/arch/sparc/include/asm/socket.h b/arch/sparc/include/asm/socket.h
index bf50d0c..982a12f 100644
--- a/arch/sparc/include/asm/socket.h
+++ b/arch/sparc/include/asm/socket.h
@@ -50,6 +50,9 @@

#define SO_MARK 0x0022

+#define SO_TIMESTAMPING 0x0023
+#define SCM_TIMESTAMPING SO_TIMESTAMPING
+
/* Security levels - as per NRL IPv6 - don't actually do anything */
#define SO_SECURITY_AUTHENTICATION 0x5001
#define SO_SECURITY_ENCRYPTION_TRANSPORT 0x5002
diff --git a/arch/x86/include/asm/socket.h b/arch/x86/include/asm/socket.h
index 8ab9cc8..ca8bf2c 100644
--- a/arch/x86/include/asm/socket.h
+++ b/arch/x86/include/asm/socket.h
@@ -54,4 +54,7 @@

#define SO_MARK 36

+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING
+
#endif /* _ASM_X86_SOCKET_H */
diff --git a/include/asm-frv/socket.h b/include/asm-frv/socket.h
index e51ca67..0539926 100644
--- a/include/asm-frv/socket.h
+++ b/include/asm-frv/socket.h
@@ -53,6 +53,9 @@
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#endif /* _ASM_SOCKET_H */

diff --git a/include/asm-m32r/socket.h b/include/asm-m32r/socket.h
index 9a0e200..d100bae 100644
--- a/include/asm-m32r/socket.h
+++ b/include/asm-m32r/socket.h
@@ -53,5 +53,8 @@
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#endif /* _ASM_M32R_SOCKET_H */
diff --git a/include/asm-m68k/socket.h b/include/asm-m68k/socket.h
index dbc64e9..1f425c6 100644
--- a/include/asm-m68k/socket.h
+++ b/include/asm-m68k/socket.h
@@ -53,5 +53,8 @@
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#endif /* _ASM_SOCKET_H */
diff --git a/include/asm-mn10300/socket.h b/include/asm-mn10300/socket.h
index 80af9c4..a633a06 100644
--- a/include/asm-mn10300/socket.h
+++ b/include/asm-mn10300/socket.h
@@ -53,5 +53,8 @@
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#endif /* _ASM_SOCKET_H */
diff --git a/include/asm-xtensa/socket.h b/include/asm-xtensa/socket.h
index 6100682..5ada734 100644
--- a/include/asm-xtensa/socket.h
+++ b/include/asm-xtensa/socket.h
@@ -64,5 +64,8 @@
#define SCM_TIMESTAMPNS SO_TIMESTAMPNS

#define SO_MARK 36
+
+#define SO_TIMESTAMPING 37
+#define SCM_TIMESTAMPING SO_TIMESTAMPING

#endif /* _XTENSA_SOCKET_H */
diff --git a/include/linux/errqueue.h b/include/linux/errqueue.h
index 92f8d4f..86d88dd 100644
--- a/include/linux/errqueue.h
+++ b/include/linux/errqueue.h
@@ -16,6 +16,7 @@ struct sock_extended_err
#define SO_EE_ORIGIN_LOCAL 1
#define SO_EE_ORIGIN_ICMP 2
#define SO_EE_ORIGIN_ICMP6 3
+#define SO_EE_ORIGIN_TIMESTAMPING 4

#define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1))

diff --git a/include/linux/net_tstamp.h b/include/linux/net_tstamp.h
new file mode 100644
index 0000000..27d177e
--- /dev/null
+++ b/include/linux/net_tstamp.h
@@ -0,0 +1,112 @@
+/*
+ * Userspace API for hardware time stamping of network packets
+ *
+ * (C) Copyright 2008,2009 Intel Corporation
+ * Author: Patrick Ohly <[email protected]>
+ *
+ */
+
+#ifndef _NET_TIMESTAMPING_H
+#define _NET_TIMESTAMPING_H
+
+#include <linux/socket.h> /* for SO_TIMESTAMPING */
+
+/*
+ * user space linux/socket.h might not have these defines yet:
+ * provide fallback
+ */
+#if !defined(__KERNEL__) && !defined(SO_TIMESTAMPING)
+# define SO_TIMESTAMPING 37
+# define SCM_TIMESTAMPING SO_TIMESTAMPING
+#endif
+
+/* SO_TIMESTAMPING gets an integer bit field comprised of these values */
+enum {
+ SOF_TIMESTAMPING_TX_HARDWARE = (1<<0),
+ SOF_TIMESTAMPING_TX_SOFTWARE = (1<<1),
+ SOF_TIMESTAMPING_RX_HARDWARE = (1<<2),
+ SOF_TIMESTAMPING_RX_SOFTWARE = (1<<3),
+ SOF_TIMESTAMPING_SOFTWARE = (1<<4),
+ SOF_TIMESTAMPING_SYS_HARDWARE = (1<<5),
+ SOF_TIMESTAMPING_RAW_HARDWARE = (1<<6),
+ SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_RAW_HARDWARE - 1) | SOF_TIMESTAMPING_RAW_HARDWARE
+};
+
+#if !defined(__kernel__) && !defined(SIOCSHWTSTAMP)
+# define SIOCSHWTSTAMP 0x89b0
+#endif
+
+/**
+ * struct hwtstamp_config - %SIOCSHWTSTAMP parameter
+ *
+ * @flags: no flags defined right now, must be zero
+ * @tx_type: one of HWTSTAMP_TX_*
+ * @rx_type: one of one of HWTSTAMP_FILTER_*
+ *
+ * %SIOCSHWTSTAMP expects a &struct ifreq with a ifr_data pointer
+ * to this structure.
+ */
+struct hwtstamp_config {
+ int flags;
+ int tx_type;
+ int rx_filter;
+};
+
+/* possible values for hwtstamp_config->tx_type */
+enum {
+ /*
+ * No outgoing packet will need hardware time stamping;
+ * should a packet arrive which asks for it, no hardware
+ * time stamping will be done.
+ */
+ HWTSTAMP_TX_OFF,
+
+ /*
+ * Enables hardware time stamping for outgoing packets;
+ * the sender of the packet decides which are to be
+ * time stamped by setting %SOF_TIMESTAMPING_TX_SOFTWARE
+ * before sending the packet.
+ */
+ HWTSTAMP_TX_ON,
+};
+
+/* possible values for hwtstamp_config->rx_filter */
+enum {
+ /* time stamp no incoming packet at all */
+ HWTSTAMP_FILTER_NONE,
+
+ /* time stamp any incoming packet */
+ HWTSTAMP_FILTER_ALL,
+
+ /* return value: time stamp all packets requested plus some others */
+ HWTSTAMP_FILTER_SOME,
+
+ /* PTP v1, UDP, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
+ /* PTP v1, UDP, Sync packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_SYNC,
+ /* PTP v1, UDP, Delay_req packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ,
+ /* PTP v2, UDP, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V2_L4_EVENT,
+ /* PTP v2, UDP, Sync packet */
+ HWTSTAMP_FILTER_PTP_V2_L4_SYNC,
+ /* PTP v2, UDP, Delay_req packet */
+ HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ,
+
+ /* 802.AS1, Ethernet, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V2_L2_EVENT,
+ /* 802.AS1, Ethernet, Sync packet */
+ HWTSTAMP_FILTER_PTP_V2_L2_SYNC,
+ /* 802.AS1, Ethernet, Delay_req packet */
+ HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ,
+
+ /* PTP v2/802.AS1, any layer, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V2_EVENT,
+ /* PTP v2/802.AS1, any layer, Sync packet */
+ HWTSTAMP_FILTER_PTP_V2_SYNC,
+ /* PTP v2/802.AS1, any layer, Delay_req packet */
+ HWTSTAMP_FILTER_PTP_V2_DELAY_REQ,
+};
+
+#endif /* _NET_TIMESTAMPING_H */
diff --git a/include/linux/sockios.h b/include/linux/sockios.h
index abef759..241f179 100644
--- a/include/linux/sockios.h
+++ b/include/linux/sockios.h
@@ -122,6 +122,9 @@
#define SIOCBRADDIF 0x89a2 /* add interface to bridge */
#define SIOCBRDELIF 0x89a3 /* remove interface from bridge */

+/* hardware time stamping: parameters in linux/net_tstamp.h */
+#define SIOCSHWTSTAMP 0x89b0
+
/* Device private ioctl calls */

/*
--
1.5.5.3

2009-01-21 10:11:44

by Patrick Ohly

[permalink] [raw]
Subject: [PATCH NET-NEXT 02/12] net: infrastructure for hardware time stamping

Instead of adding new members to struct sk_buff this
patch introduces and uses a generic mechanism for
extending skb: additional structures are allocated
at the end of the data area, similar to the skb_shared_info.
One new member of skb holds the information which of the
optional structures are present, with one bit per
structure. This allows fast checks whether certain
information is present.

The actual address of an optional structure
is found by using a hard-coded ordering of these
structures and adding up the size and alignment padding
of the preceeding structs.

The new struct skb_shared_tx is used to transport time stamping
instructions to the device driver (outgoing packets). The
resulting hardware time stamps are returned via struct
skb_shared_hwtstamps (incoming or sent packets), in all
formats possibly needed by the rest of the kernel and
user space (original raw hardware time stamp and converted
to system time base). This replaces the problematic callbacks
into the network driver used in earlier revisions of this patch.

Conceptionally the two structs are independent and use
different bits in the new flags fields. This avoids the
problem that dev_start_hard_xmit() cannot distinguish
reliably between outgoing and incoming packets (it is
called for looped multicast packets). But to avoid copying
sent data, the space reserved for skb_shared_tx is
increased so that this space can be reused for skb_shared_hwtstamps
when sending back the packet to the originating socket.

TX time stamping is implemented in software if the device driver
doesn't support hardware time stamping.

The new semantic for hardware/software time stamping around
net_device->hard_start_xmit() is based on two assumptions about
existing network device drivers which don't support hardware
time stamping and know nothing about it:
- they leave the new skb_shared_tx struct unmodified
- the keep the connection to the originating socket in skb->sk
alive, i.e., don't call skb_orphan()

Given that skb_shared_tx is new, the first assumption is safe.
The second is only true for some drivers. As a result, software
TX time stamping currently works with the bnx2 driver, but not
with the unmodified igb driver (the two drivers this patch series
was tested with).
---
include/linux/skbuff.h | 196 ++++++++++++++++++++++++++++++++++++++++++++++--
net/core/dev.c | 33 ++++++++-
net/core/skbuff.c | 153 ++++++++++++++++++++++++++++++++------
3 files changed, 351 insertions(+), 31 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index cf2cb50..8286f76 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -156,6 +156,105 @@ struct skb_shared_info {
#endif
};

+#define HAVE_HW_TIME_STAMP
+
+/**
+ * skb_shared_hwtstamps - optional hardware time stamps
+ *
+ * @hwtstamp: hardware time stamp transformed into duration
+ * since arbitrary point in time
+ * @syststamp: hwtstamp transformed to system time base
+ *
+ * Software time stamps generated by ktime_get_real() are stored in
+ * skb->tstamp. The relation between the different kinds of time
+ * stamps is as follows:
+ *
+ * syststamp and tstamp can be compared against each other in
+ * arbitrary combinations. The accuracy of a
+ * syststamp/tstamp/"syststamp from other device" comparison is
+ * limited by the accuracy of the transformation into system time
+ * base. This depends on the device driver and its underlying
+ * hardware.
+ *
+ * hwtstamps can only be compared against other hwtstamps from
+ * the same device.
+ *
+ * This additional structure has to be allocated together with
+ * the data buffer and is shared between clones.
+ */
+struct skb_shared_hwtstamps {
+ ktime_t hwtstamp;
+ ktime_t syststamp;
+};
+
+/**
+ * skb_shared_tx - optional instructions for time stamping of outgoing packets
+ *
+ * @hardware: generate hardware time stamp
+ * @software: generate software time stamp
+ * @in_progress: device driver is going to provide
+ * hardware time stamp
+ *
+ * This additional structure has to be allocated together with the
+ * data buffer and is shared between clones. Its space is reused
+ * in skb_tstamp_tx() for skb_shared_hwtstamps and therefore it
+ * has to be larger than strictly necessary (handled in skbuff.c).
+ */
+union skb_shared_tx {
+ struct {
+ __u8 hardware:1,
+ software:1,
+ in_progress:1;
+ };
+ __u8 flags;
+};
+
+/*
+ * Flags which control how &struct sk_buff is to be/was allocated.
+ * The &struct skb_shared_info always comes at sk_buff->end, then
+ * all of the optional structs in the order defined by their
+ * flags. Each structure is aligned so that it is at a multiple
+ * of 8 or its own size, whatever is smaller. Putting structs
+ * with less strict alignment requirements at the end increases
+ * the chance that no padding is needed.
+ *
+ * SKB_FLAGS_TXTSTAMP could be combined with SKB_FLAGS_HWTSTAMPS
+ * (outgoing packets have &union skb_shared_tx, incoming
+ * &struct skb_shared_hwtstamps), but telling apart one from
+ * the other is ambiguous: when a multicast packet is looped back,
+ * it has to be considered incoming, but it then passes through
+ * dev_hard_start_xmit() once more. Better avoid such ambiguities,
+ * in particular as it doesn't save any space. One additional byte
+ * is needed in any case.
+ *
+ * SKB_FLAGS_CLONE replaces the true/false integer fclone parameter in
+ * __alloc_skb(). Clones are marked as before in sk_buff->cloned.
+ *
+ * Similarly, SKB_FLAGS_NOBLOCK is used in place of a special noblock
+ * parameter in sock_alloc_send_skb().
+ *
+ * When adding optional structs, remember to update skb_optional_sizes
+ * in skbuff.c!
+ */
+enum {
+ /*
+ * one byte holds the lower order flags in struct sk_buff,
+ * so we could add more structs without additional costs
+ */
+ SKB_FLAGS_OPTIONAL_HWTSTAMPS = 1 << 0,
+ SKB_FLAGS_OPTIONAL_TX = 1 << 1,
+
+ /* number of bits used for optional structures */
+ SKB_FLAGS_OPTIONAL_NUM = 2,
+
+ /*
+ * the following flags only affect how the skb is allocated,
+ * they are not stored like the ones above
+ */
+ SKB_FLAGS_CLONE = 1 << 8,
+ SKB_FLAGS_NOBLOCK = 1 << 9,
+};
+
/* We divide dataref into two halves. The higher 16 bits hold references
* to the payload part of skb->data. The lower 16 bits hold references to
* the entire skb->data. A clone of a headerless skb holds the length of
@@ -228,6 +327,8 @@ typedef unsigned char *sk_buff_data_t;
* @ip_summed: Driver fed us an IP checksum
* @priority: Packet queueing priority
* @users: User count - see {datagram,tcp}.c
+ * @optional: a combination of SKB_FLAGS_OPTIONAL_* flags, indicates
+ * which of the corresponding structs were allocated
* @protocol: Packet protocol from driver
* @truesize: Buffer size
* @head: Head of buffer
@@ -305,6 +406,8 @@ struct sk_buff {
ipvs_property:1,
peeked:1,
nf_trace:1;
+ /* not all of the bits in optional are used */
+ __u8 optional;
__be16 protocol;

void (*destructor)(struct sk_buff *skb);
@@ -374,18 +477,18 @@ extern void skb_dma_unmap(struct device *dev, struct sk_buff *skb,

extern void kfree_skb(struct sk_buff *skb);
extern void __kfree_skb(struct sk_buff *skb);
-extern struct sk_buff *__alloc_skb(unsigned int size,
- gfp_t priority, int fclone, int node);
+extern struct sk_buff *__alloc_skb_flags(unsigned int size,
+ gfp_t priority, int flags, int node);
static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
{
- return __alloc_skb(size, priority, 0, -1);
+ return __alloc_skb_flags(size, priority, 0, -1);
}

static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
gfp_t priority)
{
- return __alloc_skb(size, priority, 1, -1);
+ return __alloc_skb_flags(size, priority, SKB_FLAGS_CLONE, -1);
}

extern int skb_recycle_check(struct sk_buff *skb, int skb_size);
@@ -469,6 +572,29 @@ static inline unsigned char *skb_end_pointer(const struct sk_buff *skb)
#define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB)))

/**
+ * __skb_get_optional - returns pointer to the requested structure
+ *
+ * @optional: one of the SKB_FLAGS_OPTIONAL_* constants
+ *
+ * The caller must check that the structure is actually in the skb.
+ */
+extern void *__skb_get_optional(struct sk_buff *skb, int optional);
+
+static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
+{
+ return (skb->optional & SKB_FLAGS_OPTIONAL_HWTSTAMPS) ?
+ __skb_get_optional(skb, SKB_FLAGS_OPTIONAL_HWTSTAMPS) :
+ NULL;
+}
+
+static inline union skb_shared_tx *skb_tx(struct sk_buff *skb)
+{
+ return (skb->optional & SKB_FLAGS_OPTIONAL_TX) ?
+ __skb_get_optional(skb, SKB_FLAGS_OPTIONAL_TX) :
+ NULL;
+}
+
+/**
* skb_queue_empty - check if a queue is empty
* @list: queue head
*
@@ -1399,8 +1525,33 @@ static inline struct sk_buff *__dev_alloc_skb(unsigned int length,

extern struct sk_buff *dev_alloc_skb(unsigned int length);

-extern struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
- unsigned int length, gfp_t gfp_mask);
+/**
+ * __netdev_alloc_skb_internal - allocate an skbuff for rx on a specific device
+ * @dev: network device to receive on
+ * @length: length to allocate
+ * @flags: SKB_FLAGS_* mask
+ * @gfp_mask: get_free_pages mask, passed to alloc_skb
+ *
+ * Allocate a new &sk_buff and assign it a usage count of one. The
+ * buffer has unspecified headroom built in. Users should allocate
+ * the headroom they think they need without accounting for the
+ * built in space. The built in space is used for optimisations.
+ *
+ * %NULL is returned if there is no free memory.
+ *
+ * This function takes the full set of parameters. There are aliases
+ * with a smaller number of parameters.
+ */
+extern struct sk_buff *__netdev_alloc_skb_internal(struct net_device *dev,
+ unsigned int length, int flags,
+ gfp_t gfp_mask);
+
+static inline struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
+ unsigned int length,
+ gfp_t gfp_mask)
+{
+ return __netdev_alloc_skb_internal(dev, length, 0, gfp_mask);
+}

/**
* netdev_alloc_skb - allocate an skbuff for rx on a specific device
@@ -1418,7 +1569,12 @@ extern struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
static inline struct sk_buff *netdev_alloc_skb(struct net_device *dev,
unsigned int length)
{
- return __netdev_alloc_skb(dev, length, GFP_ATOMIC);
+ return __netdev_alloc_skb_internal(dev, length, 0, GFP_ATOMIC);
+}
+static inline struct sk_buff *netdev_alloc_skb_flags(struct net_device *dev,
+ unsigned int length, int flags)
+{
+ return __netdev_alloc_skb_internal(dev, length, flags, GFP_ATOMIC);
}

extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
@@ -1735,6 +1891,11 @@ static inline void skb_copy_to_linear_data_offset(struct sk_buff *skb,

extern void skb_init(void);

+static inline ktime_t skb_get_ktime(const struct sk_buff *skb)
+{
+ return skb->tstamp;
+}
+
/**
* skb_get_timestamp - get timestamp from a skb
* @skb: skb to get stamp from
@@ -1749,6 +1910,11 @@ static inline void skb_get_timestamp(const struct sk_buff *skb, struct timeval *
*stamp = ktime_to_timeval(skb->tstamp);
}

+static inline void skb_get_timestampns(const struct sk_buff *skb, struct timespec *stamp)
+{
+ *stamp = ktime_to_timespec(skb->tstamp);
+}
+
static inline void __net_timestamp(struct sk_buff *skb)
{
skb->tstamp = ktime_get_real();
@@ -1764,6 +1930,22 @@ static inline ktime_t net_invalid_timestamp(void)
return ktime_set(0, 0);
}

+/**
+ * skb_tstamp_tx - queue clone of skb with send time stamps
+ * @orig_skb: the original outgoing packet
+ * @hwtstamps: hardware time stamps, may be NULL if not available
+ *
+ * If the skb has a socket associated, then this function clones the
+ * skb (thus sharing the actual data and optional structures), stores
+ * the optional hardware time stamping information (if non NULL) or
+ * generates a software time stamp (otherwise), then queues the clone
+ * to the error queue of the socket. Errors are silently ignored.
+ *
+ * May only be called on skbs which have a skb_shared_tx!
+ */
+extern void skb_tstamp_tx(struct sk_buff *orig_skb,
+ struct skb_shared_hwtstamps *hwtstamps);
+
extern __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len);
extern __sum16 __skb_checksum_complete(struct sk_buff *skb);

diff --git a/net/core/dev.c b/net/core/dev.c
index 4464240..93b2ac9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1658,10 +1658,22 @@ static int dev_gso_segment(struct sk_buff *skb)
return 0;
}

+static void tstamp_tx(struct sk_buff *skb)
+{
+ union skb_shared_tx *shtx =
+ skb_tx(skb);
+ if (unlikely(shtx &&
+ shtx->software &&
+ !shtx->in_progress)) {
+ skb_tstamp_tx(skb, NULL);
+ }
+}
+
int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
struct netdev_queue *txq)
{
const struct net_device_ops *ops = dev->netdev_ops;
+ int rc;

prefetch(&dev->netdev_ops->ndo_start_xmit);
if (likely(!skb->next)) {
@@ -1675,13 +1687,29 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
goto gso;
}

- return ops->ndo_start_xmit(skb, dev);
+ rc = ops->ndo_start_xmit(skb, dev);
+ /*
+ * TODO: if skb_orphan() was called by
+ * dev->hard_start_xmit() (for example, the unmodified
+ * igb driver does that; bnx2 doesn't), then
+ * skb_tx_software_timestamp() will be unable to send
+ * back the time stamp.
+ *
+ * How can this be prevented? Always create another
+ * reference to the socket before calling
+ * dev->hard_start_xmit()? Prevent that skb_orphan()
+ * does anything in dev->hard_start_xmit() by clearing
+ * the skb destructor before the call and restoring it
+ * afterwards, then doing the skb_orphan() ourselves?
+ */
+ if (likely(!rc))
+ tstamp_tx(skb);
+ return rc;
}

gso:
do {
struct sk_buff *nskb = skb->next;
- int rc;

skb->next = nskb->next;
nskb->next = NULL;
@@ -1691,6 +1719,7 @@ gso:
skb->next = nskb;
return rc;
}
+ tstamp_tx(skb);
if (unlikely(netif_tx_queue_stopped(txq) && skb->next))
return NETDEV_TX_BUSY;
} while (skb->next);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b8d0abb..57ad138 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -55,6 +55,7 @@
#include <linux/rtnetlink.h>
#include <linux/init.h>
#include <linux/scatterlist.h>
+#include <linux/errqueue.h>

#include <net/protocol.h>
#include <net/dst.h>
@@ -155,6 +156,63 @@ void skb_truesize_bug(struct sk_buff *skb)
}
EXPORT_SYMBOL(skb_truesize_bug);

+/*
+ * The size of each struct that corresponds to a SKB_FLAGS_OPTIONAL_*
+ * flag.
+ */
+static const unsigned int skb_optional_sizes[] =
+{
+ /*
+ * hwtstamps and tx are special: the space allocated for tx
+ * is reused for hwtstamps in skb_tstamp_tx(). This avoids copying
+ * the complete packet data.
+ *
+ * max() cannot be used here because it contains a code block,
+ * which gcc doesn't accept.
+ */
+#define MAX_SHARED_TIMESTAMPING ((sizeof(struct skb_shared_hwtstamps) > \
+ sizeof(union skb_shared_tx)) ? \
+ sizeof(struct skb_shared_hwtstamps) : \
+ sizeof(union skb_shared_tx))
+
+ MAX_SHARED_TIMESTAMPING,
+ MAX_SHARED_TIMESTAMPING
+};
+
+/**
+ * __skb_align_optional - increase byte offset for next optional struct
+ * @offset: gets increased if necessary
+ * @struct_size: size of the struct which is to be placed at offset
+ *
+ * Default alignment is 8 bytes. Smaller structs require less alignment,
+ * otherwise the compiler would have added padding at their end.
+ */
+static inline unsigned int __skb_align_optional(unsigned int offset, unsigned int struct_size)
+{
+ unsigned int align = min(struct_size, 8u) - 1;
+ return (offset + align) & ~align;
+}
+
+void *__skb_get_optional(struct sk_buff *skb, int optional)
+{
+ unsigned int offset = (unsigned int)(skb_end_pointer(skb) - skb->head +
+ sizeof(struct skb_shared_info));
+ int i = 0;
+
+ while(1) {
+ if (skb->optional & (1 << i)) {
+ unsigned int struct_size = skb_optional_sizes[i];
+ offset = __skb_align_optional(offset, struct_size);
+ if ((1 << i) == optional)
+ break;
+ offset += struct_size;
+ }
+ i++;
+ }
+ return skb->head + offset;
+}
+EXPORT_SYMBOL(__skb_get_optional);
+
/* Allocate a new skbuff. We do this ourselves so we can fill in a few
* 'private' fields and also do memory statistics to find all the
* [BEEP] leaks.
@@ -162,9 +220,10 @@ EXPORT_SYMBOL(skb_truesize_bug);
*/

/**
- * __alloc_skb - allocate a network buffer
+ * __alloc_skb_flags - allocate a network buffer
* @size: size to allocate
* @gfp_mask: allocation mask
+ * @flags: SKB_FLAGS_* mask
* @fclone: allocate from fclone cache instead of head cache
* and allocate a cloned (child) skb
* @node: numa node to allocate memory on
@@ -176,13 +235,16 @@ EXPORT_SYMBOL(skb_truesize_bug);
* Buffers may only be allocated from interrupts using a @gfp_mask of
* %GFP_ATOMIC.
*/
-struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
- int fclone, int node)
+struct sk_buff *__alloc_skb_flags(unsigned int size, gfp_t gfp_mask,
+ int flags, int node)
{
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
+ int fclone = flags & SKB_FLAGS_CLONE;
+ unsigned int total_size;
+ int i;

cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;

@@ -192,7 +254,15 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
goto out;

size = SKB_DATA_ALIGN(size);
- data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
+ total_size = size + sizeof(struct skb_shared_info);
+ for (i = 0; i < SKB_FLAGS_OPTIONAL_NUM; i++) {
+ if (flags & (1 << i)) {
+ unsigned int struct_size = skb_optional_sizes[i];
+ total_size = __skb_align_optional(total_size, struct_size);
+ total_size += struct_size;
+ }
+ }
+ data = kmalloc_node_track_caller(total_size,
gfp_mask, node);
if (!data)
goto nodata;
@@ -228,6 +298,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,

child->fclone = SKB_FCLONE_UNAVAILABLE;
}
+ skb->optional = flags;
+
out:
return skb;
nodata:
@@ -236,26 +308,14 @@ nodata:
goto out;
}

-/**
- * __netdev_alloc_skb - allocate an skbuff for rx on a specific device
- * @dev: network device to receive on
- * @length: length to allocate
- * @gfp_mask: get_free_pages mask, passed to alloc_skb
- *
- * Allocate a new &sk_buff and assign it a usage count of one. The
- * buffer has unspecified headroom built in. Users should allocate
- * the headroom they think they need without accounting for the
- * built in space. The built in space is used for optimisations.
- *
- * %NULL is returned if there is no free memory.
- */
-struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
- unsigned int length, gfp_t gfp_mask)
+struct sk_buff *__netdev_alloc_skb_internal(struct net_device *dev,
+ unsigned int length, int flags,
+ gfp_t gfp_mask)
{
int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
struct sk_buff *skb;

- skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+ skb = __alloc_skb_flags(length + NET_SKB_PAD, gfp_mask, flags, node);
if (likely(skb)) {
skb_reserve(skb, NET_SKB_PAD);
skb->dev = dev;
@@ -548,6 +608,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
n->cloned = 1;
n->nohdr = 0;
n->destructor = NULL;
+ C(optional);
C(iif);
C(tail);
C(end);
@@ -2847,6 +2908,54 @@ int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer)
return elt;
}

+void skb_tstamp_tx(struct sk_buff *orig_skb,
+ struct skb_shared_hwtstamps *hwtstamps)
+{
+ struct sock *sk = orig_skb->sk;
+ struct sock_exterr_skb *serr;
+ struct sk_buff *skb;
+ int err = -ENOMEM;
+ union skb_shared_tx *shtx =
+ skb_tx(orig_skb);
+
+ if (!sk)
+ return;
+
+ skb = skb_clone(orig_skb, GFP_ATOMIC);
+ if (!skb)
+ return;
+
+ if (hwtstamps) {
+ /*
+ * reuse the existing space for time stamping
+ * instructions for storing the results
+ */
+ struct skb_shared_hwtstamps *shhwtstamps =
+ (struct skb_shared_hwtstamps *)shtx;
+ *shhwtstamps = *hwtstamps;
+ skb->optional = (skb->optional &
+ ~SKB_FLAGS_OPTIONAL_TX) |
+ SKB_FLAGS_OPTIONAL_HWTSTAMPS;
+ } else {
+ /*
+ * no hardware time stamps available,
+ * so keep the skb_shared_tx and only
+ * store software time stamp
+ */
+ skb->tstamp = ktime_get_real();
+ }
+
+ serr = SKB_EXT_ERR(skb);
+ memset(serr, 0, sizeof(*serr));
+ serr->ee.ee_errno = ENOMSG;
+ serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
+ err = sock_queue_err_skb(sk, skb);
+ if (err)
+ kfree_skb(skb);
+}
+EXPORT_SYMBOL_GPL(skb_tstamp_tx);
+
+
/**
* skb_partial_csum_set - set up and verify partial csum values for packet
* @skb: the skb to set
@@ -2886,8 +2995,8 @@ EXPORT_SYMBOL(___pskb_trim);
EXPORT_SYMBOL(__kfree_skb);
EXPORT_SYMBOL(kfree_skb);
EXPORT_SYMBOL(__pskb_pull_tail);
-EXPORT_SYMBOL(__alloc_skb);
-EXPORT_SYMBOL(__netdev_alloc_skb);
+EXPORT_SYMBOL(__alloc_skb_flags);
+EXPORT_SYMBOL(__netdev_alloc_skb_internal);
EXPORT_SYMBOL(pskb_copy);
EXPORT_SYMBOL(pskb_expand_head);
EXPORT_SYMBOL(skb_checksum);
--
1.5.5.3

2009-01-21 10:12:15

by Patrick Ohly

[permalink] [raw]
Subject: [PATCH NET-NEXT 03/12] net: socket infrastructure for SO_TIMESTAMPING

The overlap with the old SO_TIMESTAMP[NS] options is handled so
that time stamping in software (net_enable_timestamp()) is
enabled when SO_TIMESTAMP[NS] and/or SO_TIMESTAMPING_RX_SOFTWARE
is set. It's disabled if all of these are off.
---
include/net/sock.h | 40 ++++++++++++++++++++++--
net/compat.c | 19 +++++++----
net/core/sock.c | 79 ++++++++++++++++++++++++++++++++++++++++-------
net/socket.c | 86 ++++++++++++++++++++++++++++++++++++++++------------
4 files changed, 182 insertions(+), 42 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 5a3a151..36807e4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -158,7 +158,7 @@ struct sock_common {
* @sk_allocation: allocation mode
* @sk_sndbuf: size of send buffer in bytes
* @sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
- * %SO_OOBINLINE settings
+ * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
* @sk_no_check: %SO_NO_CHECK setting, wether or not checkup packets
* @sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
* @sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
@@ -488,6 +488,13 @@ enum sock_flags {
SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_TIMESTAMPING_TX_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_TX_HARDWARE */
+ SOCK_TIMESTAMPING_TX_SOFTWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_TX_SOFTWARE */
+ SOCK_TIMESTAMPING_RX_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_RX_HARDWARE */
+ SOCK_TIMESTAMPING_RX_SOFTWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_RX_SOFTWARE */
+ SOCK_TIMESTAMPING_SOFTWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_SOFTWARE */
+ SOCK_TIMESTAMPING_RAW_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_RAW_HARDWARE */
+ SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SO_TIMESTAMPING %SOF_TIMESTAMPING_SYS_HARDWARE */
};

static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -1342,13 +1349,40 @@ sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
{
ktime_t kt = skb->tstamp;

- if (sock_flag(sk, SOCK_RCVTSTAMP))
+ /*
+ * generate control messages if
+ * - receive time stamping in software requested (SOCK_RCVTSTAMP
+ * or SOCK_TIMESTAMPING_RX_SOFTWARE)
+ * - software time stamp available and wanted (SOCK_TIMESTAMPING_SOFTWARE)
+ * - hardware time stamps available and wanted (SOCK_TIMESTAMPING_SYS_HARDWARE
+ * or SOCK_TIMESTAMPING_RAW_HARDWARE)
+ */
+ if (sock_flag(sk, SOCK_RCVTSTAMP) ||
+ sock_flag(sk, SOCK_TIMESTAMPING_RX_SOFTWARE) ||
+ (kt.tv64 && sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE)) ||
+ ((skb->optional & SKB_FLAGS_OPTIONAL_HWTSTAMPS) &&
+ (sock_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE) ||
+ sock_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE))))
__sock_recv_timestamp(msg, sk, skb);
else
sk->sk_stamp = kt;
}

/**
+ * sock_tx_timestamp - checks whether the outgoing packet is to be time stamped
+ * @msg: outgoing packet
+ * @sk: socket sending this packet
+ * @shtx: filled with instructions for time stamping
+ *
+ * Currently only depends on SOCK_TIMESTAMPING* flags. Returns error code if
+ * parameters are invalid.
+ */
+extern int sock_tx_timestamp(struct msghdr *msg,
+ struct sock *sk,
+ union skb_shared_tx *shtx);
+
+
+/**
* sk_eat_skb - Release a skb if it is no longer needed
* @sk: socket to eat this skb from
* @skb: socket buffer to eat
@@ -1416,7 +1450,7 @@ static inline struct sock *skb_steal_sock(struct sk_buff *skb)
return NULL;
}

-extern void sock_enable_timestamp(struct sock *sk);
+extern void sock_enable_timestamp(struct sock *sk, int flag);
extern int sock_get_timestamp(struct sock *, struct timeval __user *);
extern int sock_get_timestampns(struct sock *, struct timespec __user *);

diff --git a/net/compat.c b/net/compat.c
index a3a2ba0..fb01ed9 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -216,7 +216,7 @@ Efault:
int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *data)
{
struct compat_timeval ctv;
- struct compat_timespec cts;
+ struct compat_timespec cts[3];
struct compat_cmsghdr __user *cm = (struct compat_cmsghdr __user *) kmsg->msg_control;
struct compat_cmsghdr cmhdr;
int cmlen;
@@ -233,12 +233,17 @@ int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *dat
data = &ctv;
len = sizeof(ctv);
}
- if (level == SOL_SOCKET && type == SCM_TIMESTAMPNS) {
+ if (level == SOL_SOCKET &&
+ (type == SCM_TIMESTAMPNS || type == SCM_TIMESTAMPING)) {
+ int count = type == SCM_TIMESTAMPNS ? 1 : 3;
+ int i;
struct timespec *ts = (struct timespec *)data;
- cts.tv_sec = ts->tv_sec;
- cts.tv_nsec = ts->tv_nsec;
+ for (i = 0; i < count; i++) {
+ cts[i].tv_sec = ts[i].tv_sec;
+ cts[i].tv_nsec = ts[i].tv_nsec;
+ }
data = &cts;
- len = sizeof(cts);
+ len = sizeof(cts[0]) * count;
}

cmlen = CMSG_COMPAT_LEN(len);
@@ -455,7 +460,7 @@ int compat_sock_get_timestamp(struct sock *sk, struct timeval __user *userstamp)
struct timeval tv;

if (!sock_flag(sk, SOCK_TIMESTAMP))
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
tv = ktime_to_timeval(sk->sk_stamp);
if (tv.tv_sec == -1)
return err;
@@ -479,7 +484,7 @@ int compat_sock_get_timestampns(struct sock *sk, struct timespec __user *usersta
struct timespec ts;

if (!sock_flag(sk, SOCK_TIMESTAMP))
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
ts = ktime_to_timespec(sk->sk_stamp);
if (ts.tv_sec == -1)
return err;
diff --git a/net/core/sock.c b/net/core/sock.c
index f3a0d08..2846274 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -120,6 +120,7 @@
#include <net/net_namespace.h>
#include <net/request_sock.h>
#include <net/sock.h>
+#include <linux/net_tstamp.h>
#include <net/xfrm.h>
#include <linux/ipsec.h>

@@ -255,11 +256,14 @@ static void sock_warn_obsolete_bsdism(const char *name)
}
}

-static void sock_disable_timestamp(struct sock *sk)
+static void sock_disable_timestamp(struct sock *sk, int flag)
{
- if (sock_flag(sk, SOCK_TIMESTAMP)) {
- sock_reset_flag(sk, SOCK_TIMESTAMP);
- net_disable_timestamp();
+ if (sock_flag(sk, flag)) {
+ sock_reset_flag(sk, flag);
+ if (!sock_flag(sk, SOCK_TIMESTAMP) &&
+ !sock_flag(sk, SOCK_TIMESTAMPING_RX_SOFTWARE)) {
+ net_disable_timestamp();
+ }
}
}

@@ -614,13 +618,36 @@ set_rcvbuf:
else
sock_set_flag(sk, SOCK_RCVTSTAMPNS);
sock_set_flag(sk, SOCK_RCVTSTAMP);
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
} else {
sock_reset_flag(sk, SOCK_RCVTSTAMP);
sock_reset_flag(sk, SOCK_RCVTSTAMPNS);
}
break;

+ case SO_TIMESTAMPING:
+ if (val & ~SOF_TIMESTAMPING_MASK) {
+ ret = EINVAL;
+ break;
+ }
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_TX_HARDWARE,
+ val & SOF_TIMESTAMPING_TX_HARDWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_TX_SOFTWARE,
+ val & SOF_TIMESTAMPING_TX_SOFTWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_RX_HARDWARE,
+ val & SOF_TIMESTAMPING_RX_HARDWARE);
+ if (val & SOF_TIMESTAMPING_RX_SOFTWARE)
+ sock_enable_timestamp(sk, SOCK_TIMESTAMPING_RX_SOFTWARE);
+ else
+ sock_disable_timestamp(sk, SOCK_TIMESTAMPING_RX_SOFTWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_SOFTWARE,
+ val & SOF_TIMESTAMPING_SOFTWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE,
+ val & SOF_TIMESTAMPING_SYS_HARDWARE);
+ sock_valbool_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE,
+ val & SOF_TIMESTAMPING_RAW_HARDWARE);
+ break;
+
case SO_RCVLOWAT:
if (val < 0)
val = INT_MAX;
@@ -766,6 +793,24 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
v.val = sock_flag(sk, SOCK_RCVTSTAMPNS);
break;

+ case SO_TIMESTAMPING:
+ v.val = 0;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_TX_HARDWARE))
+ v.val |= SOF_TIMESTAMPING_TX_HARDWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_TX_SOFTWARE))
+ v.val |= SOF_TIMESTAMPING_TX_SOFTWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_RX_HARDWARE))
+ v.val |= SOF_TIMESTAMPING_RX_HARDWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_RX_SOFTWARE))
+ v.val |= SOF_TIMESTAMPING_RX_SOFTWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE))
+ v.val |= SOF_TIMESTAMPING_SOFTWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE))
+ v.val |= SOF_TIMESTAMPING_SYS_HARDWARE;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE))
+ v.val |= SOF_TIMESTAMPING_RAW_HARDWARE;
+ break;
+
case SO_RCVTIMEO:
lv=sizeof(struct timeval);
if (sk->sk_rcvtimeo == MAX_SCHEDULE_TIMEOUT) {
@@ -967,7 +1012,8 @@ void sk_free(struct sock *sk)
rcu_assign_pointer(sk->sk_filter, NULL);
}

- sock_disable_timestamp(sk);
+ sock_disable_timestamp(sk, SOCK_TIMESTAMP);
+ sock_disable_timestamp(sk, SOCK_TIMESTAMPING_RX_SOFTWARE);

if (atomic_read(&sk->sk_omem_alloc))
printk(KERN_DEBUG "%s: optmem leakage (%d bytes) detected.\n",
@@ -1785,7 +1831,7 @@ int sock_get_timestamp(struct sock *sk, struct timeval __user *userstamp)
{
struct timeval tv;
if (!sock_flag(sk, SOCK_TIMESTAMP))
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
tv = ktime_to_timeval(sk->sk_stamp);
if (tv.tv_sec == -1)
return -ENOENT;
@@ -1801,7 +1847,7 @@ int sock_get_timestampns(struct sock *sk, struct timespec __user *userstamp)
{
struct timespec ts;
if (!sock_flag(sk, SOCK_TIMESTAMP))
- sock_enable_timestamp(sk);
+ sock_enable_timestamp(sk, SOCK_TIMESTAMP);
ts = ktime_to_timespec(sk->sk_stamp);
if (ts.tv_sec == -1)
return -ENOENT;
@@ -1813,11 +1859,20 @@ int sock_get_timestampns(struct sock *sk, struct timespec __user *userstamp)
}
EXPORT_SYMBOL(sock_get_timestampns);

-void sock_enable_timestamp(struct sock *sk)
+void sock_enable_timestamp(struct sock *sk, int flag)
{
- if (!sock_flag(sk, SOCK_TIMESTAMP)) {
- sock_set_flag(sk, SOCK_TIMESTAMP);
- net_enable_timestamp();
+ if (!sock_flag(sk, flag)) {
+ sock_set_flag(sk, flag);
+ /*
+ * we just set one of the two flags which require net
+ * time stamping, but time stamping might have been on
+ * already because of the other one
+ */
+ if (!sock_flag(sk,
+ flag == SOCK_TIMESTAMP ?
+ SOCK_TIMESTAMPING_RX_SOFTWARE :
+ SOCK_TIMESTAMP))
+ net_enable_timestamp();
}
}

diff --git a/net/socket.c b/net/socket.c
index 2c730fc..70c3a27 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -545,6 +545,20 @@ void sock_release(struct socket *sock)
sock->file = NULL;
}

+int sock_tx_timestamp(struct msghdr *msg, struct sock *sk,
+ union skb_shared_tx *shtx)
+{
+ shtx->flags = 0;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_TX_HARDWARE)) {
+ shtx->hardware = 1;
+ }
+ if (sock_flag(sk, SOCK_TIMESTAMPING_TX_SOFTWARE)) {
+ shtx->software = 1;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(sock_tx_timestamp);
+
static inline int __sock_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *msg, size_t size)
{
@@ -595,33 +609,65 @@ int kernel_sendmsg(struct socket *sock, struct msghdr *msg,
return result;
}

+static int ktime2ts(ktime_t kt, struct timespec *ts)
+{
+ if (kt.tv64) {
+ *ts = ktime_to_timespec(kt);
+ return 1;
+ } else {
+ return 0;
+ }
+}
+
/*
* called from sock_recv_timestamp() if sock_flag(sk, SOCK_RCVTSTAMP)
*/
void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
struct sk_buff *skb)
{
- ktime_t kt = skb->tstamp;
-
- if (!sock_flag(sk, SOCK_RCVTSTAMPNS)) {
- struct timeval tv;
- /* Race occurred between timestamp enabling and packet
- receiving. Fill in the current time for now. */
- if (kt.tv64 == 0)
- kt = ktime_get_real();
- skb->tstamp = kt;
- tv = ktime_to_timeval(kt);
- put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP, sizeof(tv), &tv);
- } else {
- struct timespec ts;
- /* Race occurred between timestamp enabling and packet
- receiving. Fill in the current time for now. */
- if (kt.tv64 == 0)
- kt = ktime_get_real();
- skb->tstamp = kt;
- ts = ktime_to_timespec(kt);
- put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPNS, sizeof(ts), &ts);
+ int need_software_tstamp = sock_flag(sk, SOCK_RCVTSTAMP);
+ struct timespec ts[3];
+ int empty = 1;
+ struct skb_shared_hwtstamps *shhwtstamps =
+ skb_hwtstamps(skb);
+
+ /* Race occurred between timestamp enabling and packet
+ receiving. Fill in the current time for now. */
+ if (need_software_tstamp && skb->tstamp.tv64 == 0)
+ __net_timestamp(skb);
+
+ if (need_software_tstamp) {
+ if (!sock_flag(sk, SOCK_RCVTSTAMPNS)) {
+ struct timeval tv;
+ skb_get_timestamp(skb, &tv);
+ put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP,
+ sizeof(tv), &tv);
+ } else {
+ struct timespec ts;
+ skb_get_timestampns(skb, &ts);
+ put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPNS,
+ sizeof(ts), &ts);
+ }
+ }
+
+
+ memset(ts, 0, sizeof(ts));
+ if (skb->tstamp.tv64 &&
+ sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE)) {
+ skb_get_timestampns(skb, ts + 0);
+ empty = 0;
+ }
+ if (shhwtstamps) {
+ if (sock_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE) &&
+ ktime2ts(shhwtstamps->syststamp, ts + 1))
+ empty = 0;
+ if (sock_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE) &&
+ ktime2ts(shhwtstamps->hwtstamp, ts + 2))
+ empty = 0;
}
+ if (!empty)
+ put_cmsg(msg, SOL_SOCKET,
+ SCM_TIMESTAMPING, sizeof(ts), &ts);
}

EXPORT_SYMBOL_GPL(__sock_recv_timestamp);
--
1.5.5.3

2009-01-21 10:13:14

by Patrick Ohly

[permalink] [raw]
Subject: [PATCH NET-NEXT 08/12] igb: stub support for SIOCSHWTSTAMP

---
drivers/net/igb/igb_main.c | 30 ++++++++++++++++++++++++++++++
1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index 022794e..d26dacb 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -34,6 +34,7 @@
#include <linux/ipv6.h>
#include <net/checksum.h>
#include <net/ip6_checksum.h>
+#include <linux/net_tstamp.h>
#include <linux/mii.h>
#include <linux/ethtool.h>
#include <linux/if_vlan.h>
@@ -4146,6 +4147,33 @@ static int igb_mii_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
}

/**
+ * igb_hwtstamp_ioctl - control hardware time stamping
+ * @netdev:
+ * @ifreq:
+ * @cmd:
+ *
+ * Currently cannot enable any kind of hardware time stamping, but
+ * supports SIOCSHWTSTAMP in general.
+ **/
+static int igb_hwtstamp_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
+{
+ struct hwtstamp_config config;
+
+ if (copy_from_user(&config, ifr->ifr_data, sizeof(config)))
+ return -EFAULT;
+
+ /* reserved for future extensions */
+ if (config.flags)
+ return -EINVAL;
+
+ if (config.tx_type == HWTSTAMP_TX_OFF &&
+ config.rx_filter == HWTSTAMP_FILTER_NONE)
+ return 0;
+
+ return -ERANGE;
+}
+
+/**
* igb_ioctl -
* @netdev:
* @ifreq:
@@ -4158,6 +4186,8 @@ static int igb_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
case SIOCGMIIREG:
case SIOCSMIIREG:
return igb_mii_ioctl(netdev, ifr, cmd);
+ case SIOCSHWTSTAMP:
+ return igb_hwtstamp_ioctl(netdev, ifr, cmd);
default:
return -EOPNOTSUPP;
}
--
1.5.5.3

2009-01-21 10:12:39

by Patrick Ohly

[permalink] [raw]
Subject: [PATCH NET-NEXT 04/12] sockets: allow allocating skb with optional structures

The internal sock_alloc_send_pskb() is now exposed
as sock_alloc_send_skb_flags() and takes a flags
parameter with additional instructions for how the
skb is to be allocated. This is necessary for adding
send time stamping information to outgoing packets.

sock_alloc_send_skb() is turned into a simple wrapper
which preserves compatibility with code that
passes a boolean "nblock" instead of the flags
bitmask.

sock_alloc_send_pskb() was never called with non-zero
data_len, therefore the obsolete code and parameter
were removed.
---
include/net/sock.h | 17 +++++++++++++----
net/core/sock.c | 50 +++++++-------------------------------------------
2 files changed, 20 insertions(+), 47 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 36807e4..6cb120c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -948,10 +948,19 @@ extern int sock_setsockopt(struct socket *sock, int level,
extern int sock_getsockopt(struct socket *sock, int level,
int op, char __user *optval,
int __user *optlen);
-extern struct sk_buff *sock_alloc_send_skb(struct sock *sk,
- unsigned long size,
- int noblock,
- int *errcode);
+extern struct sk_buff *sock_alloc_send_skb_flags(struct sock *sk,
+ unsigned long size,
+ int flags,
+ int *errcode);
+inline static struct sk_buff *sock_alloc_send_skb(struct sock *sk,
+ unsigned long size,
+ int noblock,
+ int *errcode)
+{
+ return sock_alloc_send_skb_flags(sk, size,
+ noblock ? SKB_FLAGS_NOBLOCK : 0,
+ errcode);
+}
extern void *sock_kmalloc(struct sock *sk, int size,
gfp_t priority);
extern void sock_kfree_s(struct sock *sk, void *mem, int size);
diff --git a/net/core/sock.c b/net/core/sock.c
index 2846274..c6618d0 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1300,10 +1300,9 @@ static long sock_wait_for_wmem(struct sock * sk, long timeo)
* Generic send/receive buffer handlers
*/

-static struct sk_buff *sock_alloc_send_pskb(struct sock *sk,
+struct sk_buff *sock_alloc_send_skb_flags(struct sock *sk,
unsigned long header_len,
- unsigned long data_len,
- int noblock, int *errcode)
+ int flags, int *errcode)
{
struct sk_buff *skb;
gfp_t gfp_mask;
@@ -1314,7 +1313,9 @@ static struct sk_buff *sock_alloc_send_pskb(struct sock *sk,
if (gfp_mask & __GFP_WAIT)
gfp_mask |= __GFP_REPEAT;

- timeo = sock_sndtimeo(sk, noblock);
+ timeo = sock_sndtimeo(sk,
+ (flags & SKB_FLAGS_NOBLOCK) ?
+ MSG_DONTWAIT : 0);
while (1) {
err = sock_error(sk);
if (err != 0)
@@ -1325,39 +1326,8 @@ static struct sk_buff *sock_alloc_send_pskb(struct sock *sk,
goto failure;

if (atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf) {
- skb = alloc_skb(header_len, gfp_mask);
+ skb = __alloc_skb_flags(header_len, gfp_mask, flags, -1);
if (skb) {
- int npages;
- int i;
-
- /* No pages, we're done... */
- if (!data_len)
- break;
-
- npages = (data_len + (PAGE_SIZE - 1)) >> PAGE_SHIFT;
- skb->truesize += data_len;
- skb_shinfo(skb)->nr_frags = npages;
- for (i = 0; i < npages; i++) {
- struct page *page;
- skb_frag_t *frag;
-
- page = alloc_pages(sk->sk_allocation, 0);
- if (!page) {
- err = -ENOBUFS;
- skb_shinfo(skb)->nr_frags = i;
- kfree_skb(skb);
- goto failure;
- }
-
- frag = &skb_shinfo(skb)->frags[i];
- frag->page = page;
- frag->page_offset = 0;
- frag->size = (data_len >= PAGE_SIZE ?
- PAGE_SIZE :
- data_len);
- data_len -= PAGE_SIZE;
- }
-
/* Full success... */
break;
}
@@ -1384,12 +1354,6 @@ failure:
return NULL;
}

-struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size,
- int noblock, int *errcode)
-{
- return sock_alloc_send_pskb(sk, size, 0, noblock, errcode);
-}
-
static void __lock_sock(struct sock *sk)
{
DEFINE_WAIT(wait);
@@ -2323,7 +2287,7 @@ subsys_initcall(proto_init);
EXPORT_SYMBOL(sk_alloc);
EXPORT_SYMBOL(sk_free);
EXPORT_SYMBOL(sk_send_sigurg);
-EXPORT_SYMBOL(sock_alloc_send_skb);
+EXPORT_SYMBOL(sock_alloc_send_skb_flags);
EXPORT_SYMBOL(sock_init_data);
EXPORT_SYMBOL(sock_kfree_s);
EXPORT_SYMBOL(sock_kmalloc);
--
1.5.5.3

2009-01-21 10:13:47

by Patrick Ohly

[permalink] [raw]
Subject: [PATCH NET-NEXT 07/12] net: pass new SIOCSHWTSTAMP through to device drivers

---
fs/compat_ioctl.c | 1 +
net/core/dev.c | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 5235c67..a5001a6 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -2555,6 +2555,7 @@ HANDLE_IOCTL(SIOCSIFMAP, dev_ifsioc)
HANDLE_IOCTL(SIOCGIFADDR, dev_ifsioc)
HANDLE_IOCTL(SIOCSIFADDR, dev_ifsioc)
HANDLE_IOCTL(SIOCSIFHWBROADCAST, dev_ifsioc)
+HANDLE_IOCTL(SIOCSHWTSTAMP, dev_ifsioc)

/* ioctls used by appletalk ddp.c */
HANDLE_IOCTL(SIOCATALKDIFADDR, dev_ifsioc)
diff --git a/net/core/dev.c b/net/core/dev.c
index 93b2ac9..370c9f1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3864,6 +3864,7 @@ static int dev_ifsioc(struct net *net, struct ifreq *ifr, unsigned int cmd)
cmd == SIOCSMIIREG ||
cmd == SIOCBRADDIF ||
cmd == SIOCBRDELIF ||
+ cmd == SIOCSHWTSTAMP ||
cmd == SIOCWANDEV) {
err = -EOPNOTSUPP;
if (ops->ndo_do_ioctl) {
@@ -4018,6 +4019,7 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg)
case SIOCBONDCHANGEACTIVE:
case SIOCBRADDIF:
case SIOCBRDELIF:
+ case SIOCSHWTSTAMP:
if (!capable(CAP_NET_ADMIN))
return -EPERM;
/* fall through */
--
1.5.5.3

2009-01-21 10:14:23

by Patrick Ohly

[permalink] [raw]
Subject: [PATCH NET-NEXT 06/12] debug: NULL pointer check in ip_output

---
net/ipv4/ip_output.c | 10 ++++++++--
1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index ed92f0b..03a6706 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -950,8 +950,14 @@ alloc_new_skb:
skb->ip_summed = csummode;
skb->csum = 0;
skb_reserve(skb, hh_len);
- if (ipc->shtx.flags)
- *skb_tx(skb) = ipc->shtx;
+ if (ipc->shtx.flags) {
+ if (skb_tx(skb))
+ *skb_tx(skb) = ipc->shtx;
+ else
+ printk(KERN_DEBUG
+ "ERROR: skb with flags %x and no tx ptr\n",
+ ipc->shtx.flags);
+ }

/*
* Find where to start putting bytes.
--
1.5.5.3

2009-01-21 10:14:50

by Patrick Ohly

[permalink] [raw]
Subject: [PATCH NET-NEXT 11/12] time sync: generic infrastructure to map between time stamps generated by a time counter and system time

Currently only mapping from time counter to system time is implemented.
The interface could have been made more versatile by not depending on a time counter,
but this wasn't done to avoid writing glue code elsewhere.

The method implemented here is the one used and analyzed under the name
"assisted PTP" in the LCI PTP paper:
http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Ohly_92221.pdf
---
include/linux/clocksync.h | 85 +++++++++++++++++++
kernel/time/Makefile | 2 +-
kernel/time/clocksync.c | 196 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 282 insertions(+), 1 deletions(-)
create mode 100644 include/linux/clocksync.h
create mode 100644 kernel/time/clocksync.c

diff --git a/include/linux/clocksync.h b/include/linux/clocksync.h
new file mode 100644
index 0000000..d2f93f8
--- /dev/null
+++ b/include/linux/clocksync.h
@@ -0,0 +1,85 @@
+/*
+ * Utility code which helps transforming between hardware time stamps
+ * generated by a clocksource and system time. The clocksource is
+ * assumed to return monotonically increasing time (but this code does
+ * its best to compensate if that is not the case) whereas system time
+ * may jump.
+ */
+#ifndef _LINUX_CLOCKSYNC_H
+#define _LINUX_CLOCKSYNC_H
+
+#include <linux/clocksource.h>
+#include <linux/ktime.h>
+
+/**
+ * struct clocksync - stores state and configuration for the two clocks
+ *
+ * Initialize to zero, then set clock, systime, num_samples.
+ *
+ * Transformation between HW time and system time is done with:
+ * HW time transformed = HW time + offset +
+ * (HW time - last_update) * skew / CLOCKSYNC_SKEW_RESOLUTION
+ *
+ * @clock: the source for HW time stamps (%clocksource_read_time)
+ * @systime: function returning current system time (ktime_get
+ * for monotonic time, or ktime_get_real for wall clock)
+ * @num_samples: number of times that HW time and system time are to
+ * be compared when determining their offset
+ * @offset: (system time - HW time) at the time of the last update
+ * @skew: average (system time - HW time) / delta HW time *
+ * CLOCKSYNC_SKEW_RESOLUTION
+ * @last_update: last HW time stamp when clock offset was measured
+ */
+struct clocksync {
+ struct timecounter *clock;
+ ktime_t (*systime)(void);
+ int num_samples;
+
+ s64 offset;
+ s64 skew;
+ u64 last_update;
+};
+
+/**
+ * clocksync_hw2sys - transform HW time stamp into corresponding system time
+ * @sync: context for clock sync
+ * @hwtstamp: the result of timecounter_read() or
+ * timecounter_cyc2time()
+ */
+extern ktime_t clocksync_hw2sys(struct clocksync *sync,
+ u64 hwtstamp);
+
+/**
+ * clocksync_offset - measure current (system time - HW time) offset
+ * @sync: context for clock sync
+ * @offset: average offset during sample period returned here
+ * @hwtstamp: average HW time during sample period returned here
+ *
+ * Returns number of samples used. Might be zero (= no result) in the
+ * unlikely case that system time was monotonically decreasing for all
+ * samples (= broken).
+ */
+extern int clocksync_offset(struct clocksync *sync,
+ s64 *offset,
+ u64 *hwtstamp);
+
+extern void __clocksync_update(struct clocksync *sync,
+ u64 hwtstamp);
+
+/**
+ * clocksync_update - update offset and skew by measuring current offset
+ * @sync: context for clock sync
+ * @hwtstamp: the result of timecounter_read() or
+ * timecounter_cyc2time(), pass zero to force update
+ *
+ * Updates are only done at most once per second.
+ */
+static inline void clocksync_update(struct clocksync *sync,
+ u64 hwtstamp)
+{
+ if (!hwtstamp ||
+ (s64)(hwtstamp - sync->last_update) >= NSEC_PER_SEC)
+ __clocksync_update(sync, hwtstamp);
+}
+
+#endif /* _LINUX_CLOCKSYNC_H */
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 905b0b5..6279fb0 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -1,4 +1,4 @@
-obj-y += timekeeping.o ntp.o clocksource.o jiffies.o timer_list.o
+obj-y += timekeeping.o ntp.o clocksource.o jiffies.o timer_list.o clocksync.o

obj-$(CONFIG_GENERIC_CLOCKEVENTS_BUILD) += clockevents.o
obj-$(CONFIG_GENERIC_CLOCKEVENTS) += tick-common.o
diff --git a/kernel/time/clocksync.c b/kernel/time/clocksync.c
new file mode 100644
index 0000000..0a2e54b
--- /dev/null
+++ b/kernel/time/clocksync.c
@@ -0,0 +1,196 @@
+/*
+ * Utility code which helps transforming between hardware time stamps
+ * generated by a timecounter and system time.
+ *
+ * Copyright (C) 2008 Intel, Patrick Ohly ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/clocksync.h>
+#include <linux/module.h>
+#include <linux/math64.h>
+
+/*
+ * fixed point arithmetic scale factor for skew
+ *
+ * Usually one would measure skew in ppb (parts per billion, 1e9), but
+ * using a factor of 2 simplifies the math.
+ */
+#define CLOCKSYNC_SKEW_RESOLUTION (((s64)1)<<30)
+
+ktime_t clocksync_hw2sys(struct clocksync *sync,
+ u64 hwtstamp)
+{
+ u64 nsec;
+
+ nsec = hwtstamp + sync->offset;
+ nsec += (s64)(hwtstamp - sync->last_update) * sync->skew /
+ CLOCKSYNC_SKEW_RESOLUTION;
+
+ return ns_to_ktime(nsec);
+}
+EXPORT_SYMBOL(clocksync_hw2sys);
+
+int clocksync_offset(struct clocksync *sync,
+ s64 *offset,
+ u64 *hwtstamp)
+{
+ u64 starthw = 0, endhw = 0;
+ struct {
+ s64 offset;
+ s64 duration_sys;
+ } buffer[10], sample, *samples;
+ int counter = 0, i;
+ int used;
+ int index;
+ int num_samples = sync->num_samples;
+
+ if (num_samples > sizeof(buffer)/sizeof(buffer[0])) {
+ samples = kmalloc(sizeof(*samples) * num_samples, GFP_ATOMIC);
+ if (!samples) {
+ samples = buffer;
+ num_samples = sizeof(buffer)/sizeof(buffer[0]);
+ }
+ } else {
+ samples = buffer;
+ }
+
+ /* run until we have enough valid samples, but do not try forever */
+ i = 0;
+ counter = 0;
+ while (1) {
+ u64 ts;
+ ktime_t start, end;
+
+ start = sync->systime();
+ ts = timecounter_read(sync->clock);
+ end = sync->systime();
+
+ if (!i) {
+ starthw = ts;
+ }
+
+ /* ignore negative durations */
+ sample.duration_sys = ktime_to_ns(ktime_sub(end, start));
+ if (sample.duration_sys >= 0) {
+ /*
+ * assume symetric delay to and from HW: average system time
+ * corresponds to measured HW time
+ */
+ sample.offset = ktime_to_ns(ktime_add(end, start)) / 2 -
+ ts;
+
+ /* simple insertion sort based on duration */
+ index = counter - 1;
+ while (index >= 0) {
+ if(samples[index].duration_sys < sample.duration_sys) {
+ break;
+ }
+ samples[index + 1] = samples[index];
+ index--;
+ }
+ samples[index + 1] = sample;
+ counter++;
+ }
+
+ i++;
+ if (counter >= num_samples || i >= 100000) {
+ endhw = ts;
+ break;
+ }
+ }
+
+ *hwtstamp = (endhw + starthw) / 2;
+
+ /* remove outliers by only using 75% of the samples */
+ used = counter * 3 / 4;
+ if (!used) {
+ used = counter;
+ }
+ if (used) {
+ /* calculate average */
+ s64 off = 0;
+ for (index = 0; index < used; index++) {
+ off += samples[index].offset;
+ }
+ *offset = div_s64(off, used);
+ }
+
+ if (samples && samples != buffer)
+ kfree(samples);
+
+ return used;
+}
+EXPORT_SYMBOL(clocksync_offset);
+
+void __clocksync_update(struct clocksync *sync,
+ u64 hwtstamp)
+{
+ s64 offset;
+ u64 average_time;
+
+ if (!clocksync_offset(sync, &offset, &average_time))
+ return;
+
+ printk(KERN_DEBUG
+ "average offset: %lld\n", offset);
+
+ if (!sync->last_update) {
+ sync->last_update = average_time;
+ sync->offset = offset;
+ sync->skew = 0;
+ } else {
+ s64 delta_nsec = average_time - sync->last_update;
+
+ /* avoid division by negative or small deltas */
+ if (delta_nsec >= 10000) {
+ s64 delta_offset_nsec = offset - sync->offset;
+ s64 skew; /* delta_offset_nsec * CLOCKSYNC_SKEW_RESOLUTION /
+ delta_nsec */
+ u64 divisor;
+
+ /* div_s64() is limited to 32 bit divisor */
+ skew = delta_offset_nsec * CLOCKSYNC_SKEW_RESOLUTION;
+ divisor = delta_nsec;
+ while (unlikely(divisor >= ((s64)1) << 32)) {
+ /* divide both by 2; beware, right shift
+ of negative value has undefined
+ behavior and can only be used for
+ the positive divisor */
+ skew = div_s64(skew, 2);
+ divisor >>= 1;
+ }
+ skew = div_s64(skew, divisor);
+
+ /*
+ * Calculate new overall skew as 4/16 the
+ * old value and 12/16 the new one. This is
+ * a rather arbitrary tradeoff between
+ * only using the latest measurement (0/16 and
+ * 16/16) and even more weight on past measurements.
+ */
+#define CLOCKSYNC_NEW_SKEW_PER_16 12
+ sync->skew =
+ div_s64((16 - CLOCKSYNC_NEW_SKEW_PER_16) *
+ sync->skew +
+ CLOCKSYNC_NEW_SKEW_PER_16 * skew,
+ 16);
+ sync->last_update = average_time;
+ sync->offset = offset;
+ }
+ }
+}
+EXPORT_SYMBOL(__clocksync_update);
--
1.5.5.3

2009-01-21 10:15:22

by Patrick Ohly

[permalink] [raw]
Subject: [PATCH NET-NEXT 10/12] igb: access to NIC time

Adds the register definitions and code to read the time
register.
---
drivers/net/igb/e1000_regs.h | 28 +++++++++++
drivers/net/igb/igb.h | 4 ++
drivers/net/igb/igb_main.c | 106 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 138 insertions(+), 0 deletions(-)

diff --git a/drivers/net/igb/e1000_regs.h b/drivers/net/igb/e1000_regs.h
index bdf5d83..d225601 100644
--- a/drivers/net/igb/e1000_regs.h
+++ b/drivers/net/igb/e1000_regs.h
@@ -75,6 +75,34 @@
#define E1000_FCRTH 0x02168 /* Flow Control Receive Threshold High - RW */
#define E1000_RDFPCQ(_n) (0x02430 + (0x4 * (_n)))
#define E1000_FCRTV 0x02460 /* Flow Control Refresh Timer Value - RW */
+
+/* IEEE 1588 TIMESYNCH */
+#define E1000_TSYNCTXCTL 0x0B614
+#define E1000_TSYNCRXCTL 0x0B620
+#define E1000_TSYNCRXCFG 0x05F50
+
+#define E1000_SYSTIML 0x0B600
+#define E1000_SYSTIMH 0x0B604
+#define E1000_TIMINCA 0x0B608
+
+#define E1000_RXMTRL 0x0B634
+#define E1000_RXSTMPL 0x0B624
+#define E1000_RXSTMPH 0x0B628
+#define E1000_RXSATRL 0x0B62C
+#define E1000_RXSATRH 0x0B630
+
+#define E1000_TXSTMPL 0x0B618
+#define E1000_TXSTMPH 0x0B61C
+
+#define E1000_ETQF0 0x05CB0
+#define E1000_ETQF1 0x05CB4
+#define E1000_ETQF2 0x05CB8
+#define E1000_ETQF3 0x05CBC
+#define E1000_ETQF4 0x05CC0
+#define E1000_ETQF5 0x05CC4
+#define E1000_ETQF6 0x05CC8
+#define E1000_ETQF7 0x05CCC
+
/* Split and Replication RX Control - RW */
/*
* Convenience macros
diff --git a/drivers/net/igb/igb.h b/drivers/net/igb/igb.h
index 5a27825..2cf2c1a 100644
--- a/drivers/net/igb/igb.h
+++ b/drivers/net/igb/igb.h
@@ -34,6 +34,8 @@
#include "e1000_mac.h"
#include "e1000_82575.h"

+#include <linux/clocksource.h>
+
struct igb_adapter;

#ifdef CONFIG_IGB_LRO
@@ -262,6 +264,8 @@ struct igb_adapter {
struct napi_struct napi;
struct pci_dev *pdev;
struct net_device_stats net_stats;
+ struct cyclecounter cycles;
+ struct timecounter clock;

/* structs defined in e1000_hw.h */
struct e1000_hw hw;
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index d26dacb..7ab6faf 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -179,6 +179,54 @@ MODULE_DESCRIPTION("Intel(R) Gigabit Ethernet Network Driver");
MODULE_LICENSE("GPL");
MODULE_VERSION(DRV_VERSION);

+/**
+ * Scale the NIC clock cycle by a large factor so that
+ * relatively small clock corrections can be added or
+ * substracted at each clock tick. The drawbacks of a
+ * large factor are a) that the clock register overflows
+ * more quickly (not such a big deal) and b) that the
+ * increment per tick has to fit into 24 bits.
+ *
+ * Note that
+ * TIMINCA = IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS *
+ * IGB_TSYNC_SCALE
+ * TIMINCA += TIMINCA * adjustment [ppm] / 1e9
+ *
+ * The base scale factor is intentionally a power of two
+ * so that the division in %struct timecounter can be done with
+ * a shift.
+ */
+#define IGB_TSYNC_SHIFT (19)
+#define IGB_TSYNC_SCALE (1<<IGB_TSYNC_SHIFT)
+
+/**
+ * The duration of one clock cycle of the NIC.
+ *
+ * @todo This hard-coded value is part of the specification and might change
+ * in future hardware revisions. Add revision check.
+ */
+#define IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS 16
+
+#if (IGB_TSYNC_SCALE * IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS) >= (1<<24)
+# error IGB_TSYNC_SCALE and/or IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS are too large to fit into TIMINCA
+#endif
+
+/**
+ * igb_read_clock - read raw cycle counter (to be used by time counter)
+ */
+static cycle_t igb_read_clock(const struct cyclecounter *tc)
+{
+ struct igb_adapter *adapter =
+ container_of(tc, struct igb_adapter, cycles);
+ struct e1000_hw *hw = &adapter->hw;
+ u64 stamp;
+
+ stamp = rd32(E1000_SYSTIML);
+ stamp |= (u64)rd32(E1000_SYSTIMH) << 32ULL;
+
+ return stamp;
+}
+
#ifdef DEBUG
/**
* igb_get_hw_dev_name - return device name string
@@ -189,6 +237,28 @@ char *igb_get_hw_dev_name(struct e1000_hw *hw)
struct igb_adapter *adapter = hw->back;
return adapter->netdev->name;
}
+
+/**
+ * igb_get_time_str - format current NIC and system time as string
+ */
+static char *igb_get_time_str(struct igb_adapter *adapter,
+ char buffer[160])
+{
+ cycle_t hw = adapter->cycles.read(&adapter->cycles);
+ struct timespec nic = ns_to_timespec(timecounter_read(&adapter->clock));
+ struct timespec sys;
+ struct timespec delta;
+ getnstimeofday(&sys);
+
+ delta = timespec_sub(nic, sys);
+
+ sprintf(buffer, "NIC %ld.%09lus, SYS %ld.%09lus, NIC-SYS %lds + %09luns",
+ (long)nic.tv_sec, nic.tv_nsec,
+ (long)sys.tv_sec, sys.tv_nsec,
+ (long)delta.tv_sec, delta.tv_nsec);
+
+ return buffer;
+}
#endif

/**
@@ -1322,6 +1392,42 @@ static int __devinit igb_probe(struct pci_dev *pdev,
}
#endif

+ /*
+ * Initialize hardware timer: we keep it running just in case
+ * that some program needs it later on.
+ */
+ memset(&adapter->cycles, 0, sizeof(adapter->cycles));
+ adapter->cycles.read = igb_read_clock;
+ adapter->cycles.mask = CLOCKSOURCE_MASK(64);
+ adapter->cycles.mult = 1;
+ adapter->cycles.shift = IGB_TSYNC_SHIFT;
+ wr32(E1000_TIMINCA, (1<<24) | IGB_TSYNC_CYCLE_TIME_IN_NANOSECONDS * IGB_TSYNC_SCALE);
+#if 0
+ /*
+ * Avoid rollover while we initialize by resetting the time counter.
+ */
+ wr32(E1000_SYSTIML, 0x00000000);
+ wr32(E1000_SYSTIMH, 0x00000000);
+#else
+ /*
+ * Set registers so that rollover occurs soon to test this.
+ */
+ wr32(E1000_SYSTIML, 0x00000000);
+ wr32(E1000_SYSTIMH, 0xFF800000);
+#endif
+ wrfl();
+ timecounter_init(&adapter->clock, &adapter->cycles, ktime_to_ns(ktime_get_real()));
+
+#ifdef DEBUG
+ {
+ char buffer[160];
+ printk(KERN_DEBUG
+ "igb: %s: hw %p initialized timer\n",
+ igb_get_time_str(adapter, buffer),
+ &adapter->hw);
+ }
+#endif
+
dev_info(&pdev->dev, "Intel(R) Gigabit Ethernet Network Connection\n");
/* print bus type/speed/width info */
dev_info(&pdev->dev, "%s: (PCIe:%s:%s) %pM\n",
--
1.5.5.3

2009-01-21 10:15:49

by Patrick Ohly

[permalink] [raw]
Subject: [PATCH NET-NEXT 09/12] clocksource: allow usage independent of timekeeping.c

So far struct clocksource acted as the interface between time/timekeeping.c
and hardware. This patch generalizes the concept so that a similar
interface can also be used in other contexts. For that it introduces
new structures and related functions *without* touching the existing
struct clocksource.

The reasons for adding these new structures to clocksource.[ch] are
* the APIs are clearly related
* struct clocksource could be cleaned up to use the new structs
* avoids proliferation of files with similar names (timesource.h?
timecounter.h?)

As outlined in the discussion with John Stultz, this patch adds
* struct cyclecounter: stateless API to hardware which counts clock cycles
* struct timecounter: stateful utility code built on a cyclecounter which
provides a nanosecond counter
* only the function to read the nanosecond counter; deltas are used internally
and not exposed to users of timecounter

The code does no locking of the shared state. It must be called at least
as often as the cycle counter wraps around to detect these wrap arounds.
Both is the responsibility of the timecounter user.

Acked-by: John Stultz <[email protected]>
---
include/linux/clocksource.h | 99 +++++++++++++++++++++++++++++++++++++++++++
kernel/time/clocksource.c | 76 +++++++++++++++++++++++++++++++++
2 files changed, 175 insertions(+), 0 deletions(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index f88d32f..d379189 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -22,8 +22,107 @@ typedef u64 cycle_t;
struct clocksource;

/**
+ * struct cyclecounter - hardware abstraction for a free running counter
+ * Provides completely state-free accessors to the underlying hardware.
+ * Depending on which hardware it reads, the cycle counter may wrap
+ * around quickly. Locking rules (if necessary) have to be defined
+ * by the implementor and user of specific instances of this API.
+ *
+ * @read: returns the current cycle value
+ * @mask: bitmask for two's complement
+ * subtraction of non 64 bit counters,
+ * see CLOCKSOURCE_MASK() helper macro
+ * @mult: cycle to nanosecond multiplier
+ * @shift: cycle to nanosecond divisor (power of two)
+ */
+struct cyclecounter {
+ cycle_t (*read)(const struct cyclecounter *cc);
+ cycle_t mask;
+ u32 mult;
+ u32 shift;
+};
+
+/**
+ * struct timecounter - layer above a %struct cyclecounter which counts nanoseconds
+ * Contains the state needed by timecounter_read() to detect
+ * cycle counter wrap around. Initialize with
+ * timecounter_init(). Also used to convert cycle counts into the
+ * corresponding nanosecond counts with timecounter_cyc2time(). Users
+ * of this code are responsible for initializing the underlying
+ * cycle counter hardware, locking issues and reading the time
+ * more often than the cycle counter wraps around. The nanosecond
+ * counter will only wrap around after ~585 years.
+ *
+ * @cc: the cycle counter used by this instance
+ * @cycle_last: most recent cycle counter value seen by timecounter_read()
+ * @nsec:
+ */
+struct timecounter {
+ const struct cyclecounter *cc;
+ cycle_t cycle_last;
+ u64 nsec;
+};
+
+/**
+ * cyclecounter_cyc2ns - converts cycle counter cycles to nanoseconds
+ * @tc: Pointer to cycle counter.
+ * @cycles: Cycles
+ *
+ * XXX - This could use some mult_lxl_ll() asm optimization. Same code
+ * as in cyc2ns, but with unsigned result.
+ */
+static inline u64 cyclecounter_cyc2ns(const struct cyclecounter *cc, cycle_t cycles)
+{
+ u64 ret = (u64)cycles;
+ ret = (ret * cc->mult) >> cc->shift;
+ return ret;
+}
+
+/**
+ * timecounter_init - initialize a time counter
+ * @tc: Pointer to time counter which is to be initialized/reset
+ * @cc: A cycle counter, ready to be used.
+ * @start_tstamp: Arbitrary initial time stamp.
+ *
+ * After this call the current cycle register (roughly) corresponds to
+ * the initial time stamp. Every call to timecounter_read() increments
+ * the time stamp counter by the number of elapsed nanoseconds.
+ */
+extern void timecounter_init(struct timecounter *tc,
+ const struct cyclecounter *cc,
+ u64 start_tstamp);
+
+/**
+ * timecounter_read - return nanoseconds elapsed since timecounter_init()
+ * plus the initial time stamp
+ * @tc: Pointer to time counter.
+ *
+ * In other words, keeps track of time since the same epoch as
+ * the function which generated the initial time stamp.
+ */
+extern u64 timecounter_read(struct timecounter *tc);
+
+/**
+ * timecounter_cyc2time - convert a cycle counter to same
+ * time base as values returned by
+ * timecounter_read()
+ * @tc: Pointer to time counter.
+ * @cycle: a value returned by tc->cc->read()
+ *
+ * Cycle counts that are converted correctly as long as they
+ * fall into the interval [-1/2 max cycle count, +1/2 max cycle count],
+ * with "max cycle count" == cs->mask+1.
+ *
+ * This allows conversion of cycle counter values which were generated
+ * in the past.
+ */
+extern u64 timecounter_cyc2time(struct timecounter *tc,
+ cycle_t cycle_tstamp);
+
+/**
* struct clocksource - hardware abstraction for a free running counter
* Provides mostly state-free accessors to the underlying hardware.
+ * This is the structure used for system time.
*
* @name: ptr to clocksource name
* @list: list head for registration
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 9ed2eec..0d7a2cb 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -31,6 +31,82 @@
#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
#include <linux/tick.h>

+void timecounter_init(struct timecounter *tc,
+ const struct cyclecounter *cc,
+ u64 start_tstamp)
+{
+ tc->cc = cc;
+ tc->cycle_last = cc->read(cc);
+ tc->nsec = start_tstamp;
+}
+EXPORT_SYMBOL(timecounter_init);
+
+/**
+ * clocksource_read_ns - get nanoseconds since last call of this function
+ * @tc: Pointer to time counter
+ *
+ * When the underlying cycle counter runs over, this will be handled
+ * correctly as long as it does not run over more than once between
+ * calls.
+ *
+ * The first call to this function for a new time counter initializes
+ * the time tracking and returns bogus results.
+ */
+static u64 timecounter_read_delta(struct timecounter *tc)
+{
+ cycle_t cycle_now, cycle_delta;
+ u64 ns_offset;
+
+ /* read cycle counter: */
+ cycle_now = tc->cc->read(tc->cc);
+
+ /* calculate the delta since the last timecounter_read_delta(): */
+ cycle_delta = (cycle_now - tc->cycle_last) & tc->cc->mask;
+
+ /* convert to nanoseconds: */
+ ns_offset = cyclecounter_cyc2ns(tc->cc, cycle_delta);
+
+ /* update time stamp of timecounter_read_delta() call: */
+ tc->cycle_last = cycle_now;
+
+ return ns_offset;
+}
+
+u64 timecounter_read(struct timecounter *tc)
+{
+ u64 nsec;
+
+ /* increment time by nanoseconds since last call */
+ nsec = timecounter_read_delta(tc);
+ nsec += tc->nsec;
+ tc->nsec = nsec;
+
+ return nsec;
+}
+EXPORT_SYMBOL(timecounter_read);
+
+u64 timecounter_cyc2time(struct timecounter *tc,
+ cycle_t cycle_tstamp)
+{
+ u64 cycle_delta = (cycle_tstamp - tc->cycle_last) & tc->cc->mask;
+ u64 nsec;
+
+ /*
+ * Instead of always treating cycle_tstamp as more recent
+ * than tc->cycle_last, detect when it is too far in the
+ * future and treat it as old time stamp instead.
+ */
+ if (cycle_delta > tc->cc->mask / 2) {
+ cycle_delta = (tc->cycle_last - cycle_tstamp) & tc->cc->mask;
+ nsec = tc->nsec - cyclecounter_cyc2ns(tc->cc, cycle_delta);
+ } else {
+ nsec = cyclecounter_cyc2ns(tc->cc, cycle_delta) + tc->nsec;
+ }
+
+ return nsec;
+}
+EXPORT_SYMBOL(timecounter_cyc2time);
+
/* XXX - Would like a better way for initializing curr_clocksource */
extern struct clocksource clocksource_jiffies;

--
1.5.5.3

2009-01-21 10:16:18

by Patrick Ohly

[permalink] [raw]
Subject: [PATCH NET-NEXT 12/12] igb: use clocksync to implement hardware time stamping

Both TX and RX hardware time stamping are implemented. Due to
hardware limitations it is not possible to verify reliably which
packet was time stamped when multiple were pending for sending; this
could be solved by only allowing one packet marked for hardware time
stamping into the queue (not implemented yet).

RX time stamping relies on the flag in the packet descriptor which
marks packets that were time stamped. In "all packet" mode this flag
is not set. TODO: also support that mode (even though it'll suffer
from race conditions).

Allocation of RX buffers is not optimal yet: the extra space for
hardware time stamps is always allocated. Either this should only
be done when HW time stamping is (implies reallocation of buffers)
or packets with HW time stamps should be copied into a larger
buffer (implies higher overhead for those packets).
---
drivers/net/igb/e1000_82575.h | 1 +
drivers/net/igb/e1000_defines.h | 1 +
drivers/net/igb/e1000_regs.h | 40 ++++++
drivers/net/igb/igb.h | 4 +
drivers/net/igb/igb_main.c | 275 +++++++++++++++++++++++++++++++++++++--
5 files changed, 312 insertions(+), 9 deletions(-)

diff --git a/drivers/net/igb/e1000_82575.h b/drivers/net/igb/e1000_82575.h
index c1928b5..dd32a6f 100644
--- a/drivers/net/igb/e1000_82575.h
+++ b/drivers/net/igb/e1000_82575.h
@@ -116,6 +116,7 @@ union e1000_adv_tx_desc {
};

/* Adv Transmit Descriptor Config Masks */
+#define E1000_ADVTXD_MAC_TSTAMP 0x00080000 /* IEEE1588 Timestamp packet */
#define E1000_ADVTXD_DTYP_CTXT 0x00200000 /* Advanced Context Descriptor */
#define E1000_ADVTXD_DTYP_DATA 0x00300000 /* Advanced Data Descriptor */
#define E1000_ADVTXD_DCMD_IFCS 0x02000000 /* Insert FCS (Ethernet CRC) */
diff --git a/drivers/net/igb/e1000_defines.h b/drivers/net/igb/e1000_defines.h
index 40d0342..8909252 100644
--- a/drivers/net/igb/e1000_defines.h
+++ b/drivers/net/igb/e1000_defines.h
@@ -104,6 +104,7 @@
#define E1000_RXD_STAT_UDPCS 0x10 /* UDP xsum calculated */
#define E1000_RXD_STAT_TCPCS 0x20 /* TCP xsum calculated */
#define E1000_RXD_STAT_DYNINT 0x800 /* Pkt caused INT via DYNINT */
+#define E1000_RXD_STAT_TS 0x10000 /* Pkt was time stamped */
#define E1000_RXD_ERR_CE 0x01 /* CRC Error */
#define E1000_RXD_ERR_SE 0x02 /* Symbol Error */
#define E1000_RXD_ERR_SEQ 0x04 /* Sequence Error */
diff --git a/drivers/net/igb/e1000_regs.h b/drivers/net/igb/e1000_regs.h
index d225601..215d4d6 100644
--- a/drivers/net/igb/e1000_regs.h
+++ b/drivers/net/igb/e1000_regs.h
@@ -78,9 +78,37 @@

/* IEEE 1588 TIMESYNCH */
#define E1000_TSYNCTXCTL 0x0B614
+#define E1000_TSYNCTXCTL_VALID (1<<0)
+#define E1000_TSYNCTXCTL_ENABLED (1<<4)
#define E1000_TSYNCRXCTL 0x0B620
+#define E1000_TSYNCRXCTL_VALID (1<<0)
+#define E1000_TSYNCRXCTL_ENABLED (1<<4)
+enum {
+ E1000_TSYNCRXCTL_TYPE_L2_V2 = 0,
+ E1000_TSYNCRXCTL_TYPE_L4_V1 = (1<<1),
+ E1000_TSYNCRXCTL_TYPE_L2_L4_V2 = (1<<2),
+ E1000_TSYNCRXCTL_TYPE_ALL = (1<<3),
+ E1000_TSYNCRXCTL_TYPE_EVENT_V2 = (1<<3) | (1<<1),
+};
#define E1000_TSYNCRXCFG 0x05F50
+enum {
+ E1000_TSYNCRXCFG_PTP_V1_SYNC_MESSAGE = 0<<0,
+ E1000_TSYNCRXCFG_PTP_V1_DELAY_REQ_MESSAGE = 1<<0,
+ E1000_TSYNCRXCFG_PTP_V1_FOLLOWUP_MESSAGE = 2<<0,
+ E1000_TSYNCRXCFG_PTP_V1_DELAY_RESP_MESSAGE = 3<<0,
+ E1000_TSYNCRXCFG_PTP_V1_MANAGEMENT_MESSAGE = 4<<0,

+ E1000_TSYNCRXCFG_PTP_V2_SYNC_MESSAGE = 0<<8,
+ E1000_TSYNCRXCFG_PTP_V2_DELAY_REQ_MESSAGE = 1<<8,
+ E1000_TSYNCRXCFG_PTP_V2_PATH_DELAY_REQ_MESSAGE = 2<<8,
+ E1000_TSYNCRXCFG_PTP_V2_PATH_DELAY_RESP_MESSAGE = 3<<8,
+ E1000_TSYNCRXCFG_PTP_V2_FOLLOWUP_MESSAGE = 8<<8,
+ E1000_TSYNCRXCFG_PTP_V2_DELAY_RESP_MESSAGE = 9<<8,
+ E1000_TSYNCRXCFG_PTP_V2_PATH_DELAY_FOLLOWUP_MESSAGE = 0xA<<8,
+ E1000_TSYNCRXCFG_PTP_V2_ANNOUNCE_MESSAGE = 0xB<<8,
+ E1000_TSYNCRXCFG_PTP_V2_SIGNALLING_MESSAGE = 0xC<<8,
+ E1000_TSYNCRXCFG_PTP_V2_MANAGEMENT_MESSAGE = 0xD<<8,
+};
#define E1000_SYSTIML 0x0B600
#define E1000_SYSTIMH 0x0B604
#define E1000_TIMINCA 0x0B608
@@ -103,6 +131,18 @@
#define E1000_ETQF6 0x05CC8
#define E1000_ETQF7 0x05CCC

+/* Filtering Registers */
+#define E1000_SAQF(_n) (0x5980 + 4 * (_n))
+#define E1000_DAQF(_n) (0x59A0 + 4 * (_n))
+#define E1000_SPQF(_n) (0x59C0 + 4 * (_n))
+#define E1000_FTQF(_n) (0x59E0 + 4 * (_n))
+#define E1000_SAQF0 E1000_SAQF(0)
+#define E1000_DAQF0 E1000_DAQF(0)
+#define E1000_SPQF0 E1000_SPQF(0)
+#define E1000_FTQF0 E1000_FTQF(0)
+#define E1000_SYNQF(_n) (0x055FC + (4 * (_n))) /* SYN Packet Queue Fltr */
+#define E1000_ETQF(_n) (0x05CB0 + (4 * (_n))) /* EType Queue Fltr */
+
/* Split and Replication RX Control - RW */
/*
* Convenience macros
diff --git a/drivers/net/igb/igb.h b/drivers/net/igb/igb.h
index 2cf2c1a..9972801 100644
--- a/drivers/net/igb/igb.h
+++ b/drivers/net/igb/igb.h
@@ -35,6 +35,8 @@
#include "e1000_82575.h"

#include <linux/clocksource.h>
+#include <linux/clocksync.h>
+#include <linux/net_tstamp.h>

struct igb_adapter;

@@ -266,6 +268,8 @@ struct igb_adapter {
struct net_device_stats net_stats;
struct cyclecounter cycles;
struct timecounter clock;
+ struct clocksync sync;
+ struct hwtstamp_config hwtstamp_config;

/* structs defined in e1000_hw.h */
struct e1000_hw hw;
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index 7ab6faf..71c38cd 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -252,7 +252,8 @@ static char *igb_get_time_str(struct igb_adapter *adapter,

delta = timespec_sub(nic, sys);

- sprintf(buffer, "NIC %ld.%09lus, SYS %ld.%09lus, NIC-SYS %lds + %09luns",
+ sprintf(buffer, "HW %llu, NIC %ld.%09lus, SYS %ld.%09lus, NIC-SYS %lds + %09luns",
+ hw,
(long)nic.tv_sec, nic.tv_nsec,
(long)sys.tv_sec, sys.tv_nsec,
(long)delta.tv_sec, delta.tv_nsec);
@@ -1418,6 +1419,18 @@ static int __devinit igb_probe(struct pci_dev *pdev,
wrfl();
timecounter_init(&adapter->clock, &adapter->cycles, ktime_to_ns(ktime_get_real()));

+ /*
+ * Synchronize our NIC clock against system wall clock. NIC
+ * time stamp reading requires ~3us per sample, each sample
+ * was pretty stable even under load => only require 10
+ * samples for each offset comparison.
+ */
+ memset(&adapter->sync, 0, sizeof(adapter->sync));
+ adapter->sync.clock = &adapter->clock;
+ adapter->sync.systime = ktime_get_real;
+ adapter->sync.num_samples = 10;
+ clocksync_update(&adapter->sync, 0);
+
#ifdef DEBUG
{
char buffer[160];
@@ -2776,6 +2789,7 @@ set_itr_now:
#define IGB_TX_FLAGS_VLAN 0x00000002
#define IGB_TX_FLAGS_TSO 0x00000004
#define IGB_TX_FLAGS_IPV4 0x00000008
+#define IGB_TX_FLAGS_TSTAMP 0x00000010
#define IGB_TX_FLAGS_VLAN_MASK 0xffff0000
#define IGB_TX_FLAGS_VLAN_SHIFT 16

@@ -3001,6 +3015,9 @@ static inline void igb_tx_queue_adv(struct igb_adapter *adapter,
if (tx_flags & IGB_TX_FLAGS_VLAN)
cmd_type_len |= E1000_ADVTXD_DCMD_VLE;

+ if (tx_flags & IGB_TX_FLAGS_TSTAMP)
+ cmd_type_len |= E1000_ADVTXD_MAC_TSTAMP;
+
if (tx_flags & IGB_TX_FLAGS_TSO) {
cmd_type_len |= E1000_ADVTXD_DCMD_TSE;

@@ -3092,6 +3109,7 @@ static int igb_xmit_frame_ring_adv(struct sk_buff *skb,
unsigned int len;
u8 hdr_len = 0;
int tso = 0;
+ union skb_shared_tx *shtx;

len = skb_headlen(skb);

@@ -3114,7 +3132,28 @@ static int igb_xmit_frame_ring_adv(struct sk_buff *skb,
/* this is a hard error */
return NETDEV_TX_BUSY;
}
- skb_orphan(skb);
+
+ /*
+ * TODO: check that there currently is no other packet with
+ * time stamping in the queue
+ *
+ * when doing time stamping, keep the connection to the socket
+ * a while longer, it is still needed by skb_hwtstamp_tx(), either
+ * in igb_clean_tx_irq() or
+ */
+ shtx = skb_tx(skb);
+ if (shtx && shtx->hardware) {
+ shtx->in_progress = 1;
+ tx_flags |= IGB_TX_FLAGS_TSTAMP;
+ } else if (!shtx) {
+ /*
+ * TODO: can this be solved in dev.c:dev_hard_start_xmit()?
+ * There are probably unmodified driver which do something
+ * like this and thus don't work in combination with
+ * SOF_TIMESTAMPING_TX_SOFTWARE.
+ */
+ skb_orphan(skb);
+ }

if (adapter->vlgrp && vlan_tx_tag_present(skb)) {
tx_flags |= IGB_TX_FLAGS_VLAN;
@@ -3794,6 +3833,8 @@ static bool igb_clean_tx_irq(struct igb_ring *tx_ring)

if (skb) {
unsigned int segs, bytecount;
+ union skb_shared_tx *shtx;
+
/* gso_segs is currently only valid for tcp */
segs = skb_shinfo(skb)->gso_segs ?: 1;
/* multiply data chunks by size of headers */
@@ -3801,6 +3842,35 @@ static bool igb_clean_tx_irq(struct igb_ring *tx_ring)
skb->len;
total_packets += segs;
total_bytes += bytecount;
+
+ /*
+ * if we were asked to do hardware
+ * stamping and such a time stamp is
+ * available, then it must have been
+ * for this one here because we only
+ * allow only one such packet into the
+ * queue
+ */
+ shtx = skb_tx(skb);
+ if (shtx && shtx->hardware) {
+ u32 valid = rd32(E1000_TSYNCTXCTL) & E1000_TSYNCTXCTL_VALID;
+ if (valid) {
+ u64 regval = rd32(E1000_TXSTMPL);
+ u64 ns;
+ struct skb_shared_hwtstamps shhwtstamps;
+
+ memset(&shhwtstamps, 0, sizeof(shhwtstamps));
+ regval |= (u64)rd32(E1000_TXSTMPH) << 32;
+ ns = timecounter_cyc2time(&adapter->clock,
+ regval);
+ clocksync_update(&adapter->sync, ns);
+ shhwtstamps.hwtstamp = ns_to_ktime(ns);
+ shhwtstamps.syststamp =
+ clocksync_hw2sys(&adapter->sync, ns);
+ skb_tstamp_tx(skb, &shhwtstamps);
+ }
+ skb_orphan(skb);
+ }
}

igb_unmap_and_free_tx_resource(adapter, buffer_info);
@@ -3974,6 +4044,7 @@ static bool igb_clean_rx_irq_adv(struct igb_ring *rx_ring,
{
struct igb_adapter *adapter = rx_ring->adapter;
struct net_device *netdev = adapter->netdev;
+ struct e1000_hw *hw = &adapter->hw;
struct pci_dev *pdev = adapter->pdev;
union e1000_adv_rx_desc *rx_desc , *next_rxd;
struct igb_buffer *buffer_info , *next_buffer;
@@ -4065,6 +4136,50 @@ send_up:
goto next_desc;
}

+ /*
+ * If this bit is set, then the RX registers contain
+ * the time stamp. No other packet will be time
+ * stamped until we read these registers, so read the
+ * registers to make them available again. Because
+ * only one packet can be time stamped at a time, we
+ * know that the register values must belong to this
+ * one here and therefore we don't need to compare
+ * any of the additional attributes stored for it.
+ *
+ * If nothing went wrong, then it should have a
+ * skb_shared_tx that we can turn into a
+ * skb_shared_hwtstamps.
+ *
+ * TODO: can time stamping be triggered (thus locking
+ * the registers) without the packet reaching this point
+ * here? In that case RX time stamping would get stuck.
+ *
+ * TODO: in "time stamp all packets" mode this bit is
+ * not set. Need a global flag for this mode and then
+ * always read the registers. Cannot be done without
+ * a race condition.
+ */
+ if (staterr & E1000_RXD_STAT_TS) {
+ u64 regval;
+ u64 ns;
+ struct skb_shared_hwtstamps *shhwtstamps =
+ (struct skb_shared_hwtstamps *)skb_tx(skb);
+
+ WARN(!(rd32(E1000_TSYNCRXCTL) & E1000_TSYNCRXCTL_VALID),
+ "igb: no RX time stamp available for time stamped packet");
+ regval = rd32(E1000_RXSTMPL);
+ regval |= (u64)rd32(E1000_RXSTMPH) << 32;
+ if (shhwtstamps) {
+ ns = timecounter_cyc2time(&adapter->clock, regval);
+ clocksync_update(&adapter->sync, ns);
+ memset(shhwtstamps, 0, sizeof(*shhwtstamps));
+ shhwtstamps->hwtstamp = ns_to_ktime(ns);
+ shhwtstamps->syststamp = clocksync_hw2sys(&adapter->sync, ns);
+ skb->optional = (skb->optional & ~SKB_FLAGS_OPTIONAL_TX) |
+ SKB_FLAGS_OPTIONAL_HWTSTAMPS;
+ }
+ }
+
if (staterr & E1000_RXDEXT_ERR_FRAME_ERR_MASK) {
dev_kfree_skb_irq(skb);
goto next_desc;
@@ -4166,8 +4281,26 @@ static void igb_alloc_rx_buffers_adv(struct igb_ring *rx_ring,
else
bufsz = adapter->rx_buffer_len;
bufsz += NET_IP_ALIGN;
- skb = netdev_alloc_skb(netdev, bufsz);

+ /*
+ * Always allocate the extra space for hardware
+ * time stamps because even if hardware time stamping
+ * is off right now, at the time when the buffer is
+ * used it might be on.
+ *
+ * TODO: only allocate the extra space if
+ * needed and when hardware timestamping is
+ * enabled, reallocate the buffers without it.
+ *
+ * If only a few packets will get time stamped,
+ * then the extra space is passed through
+ * the kernel as empty skb_shared_tx (has the
+ * same size as skb_shared_hwtstamps) and thus
+ * wasted.
+ */
+ skb = netdev_alloc_skb_flags(netdev, bufsz,
+ 1 /* adapter->hwtstamp_config.rx_filter != HWTSTAMP_FILTER_NONE */ ?
+ SKB_FLAGS_OPTIONAL_TX : 0);
if (!skb) {
adapter->alloc_rx_buff_failed++;
goto no_buffers;
@@ -4258,12 +4391,32 @@ static int igb_mii_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
* @ifreq:
* @cmd:
*
- * Currently cannot enable any kind of hardware time stamping, but
- * supports SIOCSHWTSTAMP in general.
+ * Outgoing time stamping can be enabled and disabled. Play nice and
+ * disable it when requested, although it shouldn't case any overhead
+ * when no packet needs it. At most one packet in the queue may be
+ * marked for time stamping, otherwise it would be impossible to tell
+ * for sure to which packet the hardware time stamp belongs.
+ *
+ * Incoming time stamping has to be configured via the hardware
+ * filters. Not all combinations are supported, in particular event
+ * type has to be specified. Matching the kind of event packet is
+ * not supported, with the exception of "all V2 events regardless of
+ * level 2 or 4".
+ *
**/
static int igb_hwtstamp_ioctl(struct net_device *netdev, struct ifreq *ifr, int cmd)
{
+ struct igb_adapter *adapter = netdev_priv(netdev);
+ struct e1000_hw *hw = &adapter->hw;
struct hwtstamp_config config;
+ u32 tsync_tx_ctl_bit = E1000_TSYNCTXCTL_ENABLED;
+ u32 tsync_rx_ctl_bit = E1000_TSYNCRXCTL_ENABLED;
+ u32 tsync_rx_ctl_type = 0;
+ u32 tsync_rx_cfg = 0;
+ int is_l4 = 0;
+ int is_l2 = 0;
+ short port = 319; /* PTP */
+ u32 regval;

if (copy_from_user(&config, ifr->ifr_data, sizeof(config)))
return -EFAULT;
@@ -4272,11 +4425,115 @@ static int igb_hwtstamp_ioctl(struct net_device *netdev, struct ifreq *ifr, int
if (config.flags)
return -EINVAL;

- if (config.tx_type == HWTSTAMP_TX_OFF &&
- config.rx_filter == HWTSTAMP_FILTER_NONE)
- return 0;
+ switch (config.tx_type) {
+ case HWTSTAMP_TX_OFF:
+ tsync_tx_ctl_bit = 0;
+ break;
+ case HWTSTAMP_TX_ON:
+ tsync_tx_ctl_bit = E1000_TSYNCTXCTL_ENABLED;
+ break;
+ default:
+ return -ERANGE;
+ }
+
+ switch (config.rx_filter) {
+ case HWTSTAMP_FILTER_NONE:
+ tsync_rx_ctl_bit = 0;
+ break;
+ case HWTSTAMP_FILTER_PTP_V1_L4_EVENT:
+ case HWTSTAMP_FILTER_PTP_V2_L4_EVENT:
+ case HWTSTAMP_FILTER_PTP_V2_L2_EVENT:
+ case HWTSTAMP_FILTER_ALL:
+ /*
+ * register TSYNCRXCFG must be set, therefore it is not
+ * possible to time stamp both Sync and Delay_Req messages
+ * => fall back to time stamping all packets
+ */
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_ALL;
+ config.rx_filter = HWTSTAMP_FILTER_ALL;
+ break;
+ case HWTSTAMP_FILTER_PTP_V1_L4_SYNC:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_L4_V1;
+ tsync_rx_cfg = E1000_TSYNCRXCFG_PTP_V1_SYNC_MESSAGE;
+ is_l4 = 1;
+ break;
+ case HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_L4_V1;
+ tsync_rx_cfg = E1000_TSYNCRXCFG_PTP_V1_DELAY_REQ_MESSAGE;
+ is_l4 = 1;
+ break;
+ case HWTSTAMP_FILTER_PTP_V2_L2_SYNC:
+ case HWTSTAMP_FILTER_PTP_V2_L4_SYNC:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_L2_L4_V2;
+ tsync_rx_cfg = E1000_TSYNCRXCFG_PTP_V2_SYNC_MESSAGE;
+ is_l2 = 1;
+ is_l4 = 1;
+ config.rx_filter = HWTSTAMP_FILTER_SOME;
+ break;
+ case HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ:
+ case HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_L2_L4_V2;
+ tsync_rx_cfg = E1000_TSYNCRXCFG_PTP_V2_DELAY_REQ_MESSAGE;
+ is_l2 = 1;
+ is_l4 = 1;
+ config.rx_filter = HWTSTAMP_FILTER_SOME;
+ break;
+ case HWTSTAMP_FILTER_PTP_V2_EVENT:
+ case HWTSTAMP_FILTER_PTP_V2_SYNC:
+ case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ:
+ tsync_rx_ctl_type = E1000_TSYNCRXCTL_TYPE_EVENT_V2;
+ config.rx_filter = HWTSTAMP_FILTER_PTP_V2_EVENT;
+ is_l2 = 1;
+ break;
+ default:
+ return -ERANGE;
+ }
+
+ /* enable/disable TX */
+ regval = rd32(E1000_TSYNCTXCTL);
+ regval = (regval & ~E1000_TSYNCTXCTL_ENABLED) | tsync_tx_ctl_bit;
+ wr32(E1000_TSYNCTXCTL, regval);
+
+ /* enable/disable RX, define which PTP packets are time stamped */
+ regval = rd32(E1000_TSYNCRXCTL);
+ regval = (regval & ~E1000_TSYNCRXCTL_ENABLED) | tsync_rx_ctl_bit;
+ regval = (regval & ~0xE) | tsync_rx_ctl_type;
+ wr32(E1000_TSYNCRXCTL, regval);
+ wr32(E1000_TSYNCRXCFG, tsync_rx_cfg);
+
+ /*
+ * Ethertype Filter Queue Filter[0][15:0] = 0x88F7 (Ethertype to filter on)
+ * Ethertype Filter Queue Filter[0][26] = 0x1 (Enable filter)
+ * Ethertype Filter Queue Filter[0][30] = 0x1 (Enable Timestamping)
+ */
+ wr32(E1000_ETQF0, is_l2 ? 0x440088f7 : 0);
+
+ /* L4 Queue Filter[0]: only filter by source and destination port */
+ wr32(E1000_SPQF0, htons(port));
+ wr32(E1000_IMIREXT(0), is_l4 ?
+ ((1<<12) | (1<<19) /* bypass size and control flags */) : 0);
+ wr32(E1000_IMIR(0), is_l4 ?
+ (htons(port)
+ | (0<<16) /* immediate interrupt disabled */
+ | 0 /* (1<<17) bit cleared: do not bypass destination port check */)
+ : 0);
+ wr32(E1000_FTQF0, is_l4 ?
+ (0x11 /* UDP */
+ | (1<<15) /* VF not compared */
+ | (1<<27) /* Enable Timestamping */
+ | (7<<28) /* only source port filter enabled, source/target address and protocol masked */ )
+ : ( (1<<15) | (15<<28) /* all mask bits set = filter not enabled */));
+
+ wrfl();
+
+ adapter->hwtstamp_config = config;
+
+ /* clear TX/RX time stamp registers, just to be sure */
+ regval = rd32(E1000_TXSTMPH);
+ regval = rd32(E1000_RXSTMPH);

- return -ERANGE;
+ return copy_to_user(ifr->ifr_data, &config, sizeof(config)) ?
+ -EFAULT : 0;
}

/**
--
1.5.5.3

2009-01-21 10:16:43

by Patrick Ohly

[permalink] [raw]
Subject: [PATCH NET-NEXT 05/12] ip: support for TX timestamps on UDP and RAW sockets

Instructions for time stamping outgoing packets are take from the
socket layer and later copied into the new skb.
---
Documentation/networking/timestamping.txt | 2 ++
include/net/ip.h | 1 +
net/can/raw.c | 14 +++++++++++---
net/ipv4/icmp.c | 2 ++
net/ipv4/ip_output.c | 12 ++++++++++--
net/ipv4/raw.c | 1 +
net/ipv4/udp.c | 4 ++++
7 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
index a681a65..0e58b45 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.txt
@@ -56,6 +56,8 @@ and including the link layer, the scm_timestamping control message and
a sock_extended_err control message with ee_errno==ENOMSG and
ee_origin==SO_EE_ORIGIN_TIMESTAMPING. A socket with such a pending
bounced packet is ready for reading as far as select() is concerned.
+If the outgoing packet has to be fragmented, then only the first
+fragment is time stamped and returned to the sending socket.

All three values correspond to the same event in time, but were
generated in different ways. Each of these values may be empty (= all
diff --git a/include/net/ip.h b/include/net/ip.h
index 1086813..4ac7577 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -55,6 +55,7 @@ struct ipcm_cookie
__be32 addr;
int oif;
struct ip_options *opt;
+ union skb_shared_tx shtx;
};

#define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb))
diff --git a/net/can/raw.c b/net/can/raw.c
index 27aab63..1f2111d 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -618,6 +618,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
struct raw_sock *ro = raw_sk(sk);
struct sk_buff *skb;
struct net_device *dev;
+ union skb_shared_tx shtx;
int ifindex;
int err;

@@ -639,8 +640,14 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
if (!dev)
return -ENXIO;

- skb = sock_alloc_send_skb(sk, size, msg->msg_flags & MSG_DONTWAIT,
- &err);
+ err = sock_tx_timestamp(msg, sk, &shtx);
+ if (err < 0)
+ return err;
+
+ skb = sock_alloc_send_skb_flags(sk, size,
+ ((msg->msg_flags & MSG_DONTWAIT) ? SKB_FLAGS_NOBLOCK : 0) |
+ (shtx.flags ? SKB_FLAGS_OPTIONAL_TX : 0),
+ &err);
if (!skb)
goto put_dev;

@@ -649,7 +656,8 @@ static int raw_sendmsg(struct kiocb *iocb, struct socket *sock,
goto free_skb;
skb->dev = dev;
skb->sk = sk;
-
+ if (shtx.flags)
+ *skb_tx(skb) = shtx;
err = can_send(skb, ro->loopback);

dev_put(dev);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 705b33b..382800a 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -375,6 +375,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
inet->tos = ip_hdr(skb)->tos;
daddr = ipc.addr = rt->rt_src;
ipc.opt = NULL;
+ ipc.shtx.flags = 0;
if (icmp_param->replyopts.optlen) {
ipc.opt = &icmp_param->replyopts;
if (ipc.opt->srr)
@@ -532,6 +533,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
inet_sk(sk)->tos = tos;
ipc.addr = iph->saddr;
ipc.opt = &icmp_param.replyopts;
+ ipc.shtx.flags = 0;

{
struct flowi fl = {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 8ebe86d..ed92f0b 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -923,9 +923,11 @@ alloc_new_skb:
alloclen += rt->u.dst.trailer_len;

if (transhdrlen) {
- skb = sock_alloc_send_skb(sk,
+ skb = sock_alloc_send_skb_flags(sk,
alloclen + hh_len + 15,
- (flags & MSG_DONTWAIT), &err);
+ ((flags & MSG_DONTWAIT) ? SKB_FLAGS_NOBLOCK : 0) |
+ (ipc->shtx.flags ? SKB_FLAGS_OPTIONAL_TX : 0),
+ &err);
} else {
skb = NULL;
if (atomic_read(&sk->sk_wmem_alloc) <=
@@ -935,6 +937,9 @@ alloc_new_skb:
sk->sk_allocation);
if (unlikely(skb == NULL))
err = -ENOBUFS;
+ else
+ /* only the initial fragment is time stamped */
+ ipc->shtx.flags = 0;
}
if (skb == NULL)
goto error;
@@ -945,6 +950,8 @@ alloc_new_skb:
skb->ip_summed = csummode;
skb->csum = 0;
skb_reserve(skb, hh_len);
+ if (ipc->shtx.flags)
+ *skb_tx(skb) = ipc->shtx;

/*
* Find where to start putting bytes.
@@ -1364,6 +1371,7 @@ void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg *ar

daddr = ipc.addr = rt->rt_src;
ipc.opt = NULL;
+ ipc.shtx.flags = 0;

if (replyopts.opt.optlen) {
ipc.opt = &replyopts.opt;
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index dff8bc4..f774651 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -493,6 +493,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,

ipc.addr = inet->saddr;
ipc.opt = NULL;
+ ipc.shtx.flags = 0;
ipc.oif = sk->sk_bound_dev_if;

if (msg->msg_controllen) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index cf5ab05..bbf1a6d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -573,6 +573,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
return -EOPNOTSUPP;

ipc.opt = NULL;
+ ipc.shtx.flags = 0;

if (up->pending) {
/*
@@ -620,6 +621,9 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
ipc.addr = inet->saddr;

ipc.oif = sk->sk_bound_dev_if;
+ err = sock_tx_timestamp(msg, sk, &ipc.shtx);
+ if (err)
+ return err;
if (msg->msg_controllen) {
err = ip_cmsg_send(sock_net(sk), msg, &ipc);
if (err)
--
1.5.5.3

2009-01-21 10:33:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH NET-NEXT 11/12] time sync: generic infrastructure to map between time stamps generated by a time counter and system time


* Patrick Ohly <[email protected]> wrote:

> Currently only mapping from time counter to system time is implemented.
> The interface could have been made more versatile by not depending on a time counter,
> but this wasn't done to avoid writing glue code elsewhere.
>
> The method implemented here is the one used and analyzed under the name
> "assisted PTP" in the LCI PTP paper:
> http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Ohly_92221.pdf
> ---
> include/linux/clocksync.h | 85 +++++++++++++++++++
> kernel/time/Makefile | 2 +-
> kernel/time/clocksync.c | 196 +++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 282 insertions(+), 1 deletions(-)
> create mode 100644 include/linux/clocksync.h
> create mode 100644 kernel/time/clocksync.c

hm, these bits have less than casual impact - i think they need to go via
the timer tree.

Ingo

2009-01-21 14:42:27

by Patrick Ohly

[permalink] [raw]
Subject: Re: [PATCH NET-NEXT 11/12] time sync: generic infrastructure to map between time stamps generated by a time counter and system time

On Wed, 2009-01-21 at 10:33 +0000, Ingo Molnar wrote:
> * Patrick Ohly <[email protected]> wrote:
>
> > Currently only mapping from time counter to system time is implemented.
> > The interface could have been made more versatile by not depending on a time counter,
> > but this wasn't done to avoid writing glue code elsewhere.
> >
> > The method implemented here is the one used and analyzed under the name
> > "assisted PTP" in the LCI PTP paper:
> > http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Ohly_92221.pdf
> > ---
> > include/linux/clocksync.h | 85 +++++++++++++++++++
> > kernel/time/Makefile | 2 +-
> > kernel/time/clocksync.c | 196 +++++++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 282 insertions(+), 1 deletions(-)
> > create mode 100644 include/linux/clocksync.h
> > create mode 100644 kernel/time/clocksync.c
>
> hm, these bits have less than casual impact - i think they need to go via
> the timer tree.

I agree that they should be reviewed by experts in that area. Patch 11
and 09 (which 11 depends on and which was already reviewed by John) are
independent of the rest of the patch series and could be included in the
timer tree. On the other hand that code is only called by the example
igb driver in this patch series, which won't compile without the timer
changes.

Please let me know if I should pursue the inclusion separately and if
so, how the inclusion of the two trees can be coordinated.

--
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.

2009-01-26 05:04:53

by David Miller

[permalink] [raw]
Subject: Re: hardware time stamping with optional structs in data area

From: Patrick Ohly <[email protected]>
Date: Wed, 21 Jan 2009 11:07:37 +0100

> I tested them with the modified PTPd with and without hardware support
> on x86. With 64 bit kernel and user space both works. With 32 bit user
> space on a 64 bit kernel software-only time stamping works (thanks to
> the socket's compatibility layer), hardware support doesn't: the ifreq
> is passed to the right device driver, but the data pointer from a 32 bit
> process is not interpreted correctly by a 64 bit driver. If there is a
> way to handle this then please let me know - I didn't see how a device
> driver could distinguish between a 32 and 64 bit user process.

See fs/compat_ioctl.c:dev_ifsioc() for how to handle the
"32-bit process under 64-bit kernel" issue wrt. struct ifreq

Next, I don't like that loop in the SKB allocator to deal with the
variable size. You're computing what amounts to a constant
calculation.

Just consolidate the array into a direct conversion table. You only
have 2 bits defined so you only need an array of 4 entries. Pass the
optional flag bits directly in as the index of that table.

2009-01-26 20:40:20

by Patrick Ohly

[permalink] [raw]
Subject: Re: hardware time stamping with optional structs in data area

On Mon, 2009-01-26 at 07:04 +0200, David Miller wrote:
> From: Patrick Ohly <[email protected]>
> Date: Wed, 21 Jan 2009 11:07:37 +0100
> See fs/compat_ioctl.c:dev_ifsioc() for how to handle the
> "32-bit process under 64-bit kernel" issue wrt. struct ifreq

Thanks, will do.

> Just consolidate the array into a direct conversion table. You only
> have 2 bits defined so you only need an array of 4 entries. Pass the
> optional flag bits directly in as the index of that table.

How can I get some code executed during the initialization of the IP
stack which initializes the table, before any sk_buff gets allocated?

The content is constant, but writing it down as static initializers
using just preprocessor macros would be difficult and/or ugly - that's
why I haven't done it already.

Bye, Patrick

2009-01-27 01:22:21

by David Miller

[permalink] [raw]
Subject: Re: hardware time stamping with optional structs in data area

From: Patrick Ohly <[email protected]>
Date: Mon, 26 Jan 2009 21:39:52 +0100

> On Mon, 2009-01-26 at 07:04 +0200, David Miller wrote:
> > Just consolidate the array into a direct conversion table. You only
> > have 2 bits defined so you only need an array of 4 entries. Pass the
> > optional flag bits directly in as the index of that table.
>
> How can I get some code executed during the initialization of the IP
> stack which initializes the table, before any sk_buff gets allocated?
>
> The content is constant, but writing it down as static initializers
> using just preprocessor macros would be difficult and/or ugly - that's
> why I haven't done it already.

It's 4 entries... really. You can initialize them simply, perhaps
with some fancy macro used by the initializers.

2009-01-27 15:24:21

by Patrick Ohly

[permalink] [raw]
Subject: Re: hardware time stamping with optional structs in data area

On Tue, 2009-01-27 at 03:22 +0200, David Miller wrote:
> From: Patrick Ohly <[email protected]>
> Date: Mon, 26 Jan 2009 21:39:52 +0100
>
> > On Mon, 2009-01-26 at 07:04 +0200, David Miller wrote:
> > > Just consolidate the array into a direct conversion table. You only
> > > have 2 bits defined so you only need an array of 4 entries. Pass the
> > > optional flag bits directly in as the index of that table.
> >
> > How can I get some code executed during the initialization of the IP
> > stack which initializes the table, before any sk_buff gets allocated?
> >
> > The content is constant, but writing it down as static initializers
> > using just preprocessor macros would be difficult and/or ugly - that's
> > why I haven't done it already.
>
> It's 4 entries... really.

True - at this time. But what if this extension mechanism turns out to
be useful and we end up with more optional structures? I was hoping that
this might be the case and thus tried to make it easy to add more
structures.

> You can initialize them simply, perhaps
> with some fancy macro used by the initializers.

Unfortunately recursion in macros is not possible. One needs duplicated
macro definitions, one for each additional structure. But perhaps my
preprocessor code fu is lacking today and there is a simpler
solution :-/

Anyway, below is example code with three structures which initializes a
size array indexed by a bit mask. I find it truly ugly and hard to
understand - and I wrote it. gcc -S shows that the size arrays contain
the expected values.

Please let me know what approach is preferred and I'll revise the patch
accordingly:
1. array with four hard-coded values
2. init code (where?)
3. macro initialization of array
4. simpler initialization solution

---------------------------------------------------------------
struct A {
int a;
};

struct B {
char b;
};

struct C {
long long c;
};

enum {
BIT_A,
BIT_B,
BIT_C
};

#define MIN(_a, _b) ((_a) < (_b) ? (_a) : (_b))

/* round up _previous number of bytes so that a struct of_structsize
bytes is properly aligned at an 8 byte boundary or the structure size,
whatever is smaller */
#define ALIGN(_previous, _structsize) \
(((_previous) + MIN(_structsize, 8) - 1) & ~(MIN(_structsize, 8) - 1))

/* number of bytes for struct X alone
* @_flags: bit mask of BIT_ values
* @_bitname: name of bit enum for X
* @_structname: structure name of X, without struct
*/
#define SIZE_X(_flags, _bitname, _structname) \
(((_flags) & (1<<_bitname)) ? sizeof(struct _structname) : 0)

/* number of bytes needed for B (if present) with or without A in
* front of it
* @_flags: bit mask of BIT_A and BIT_B
*/
#define SIZE_B(_flags) \
(((_flags) & (1<<BIT_B)) ? \
(ALIGN(SIZE_X(_flags, BIT_A, A), SIZE_X(_flags, BIT_B, B)) + sizeof(struct B)) : \
SIZE_X(_flags, BIT_A, A))

/* number of bytes needed for C (if present) with combinations of
* struct A, B in front
* @_flags: bit mask of BIT_A, BIT_B, BIT_C
*/
#define SIZE_C(_flags) \
(((_flags) & (1<<BIT_C)) ? \
(ALIGN(SIZE_B(_flags), SIZE_X(_flags, BIT_C, C)) + sizeof(struct C)) : \
SIZE_B(_flags))

/* number of bytes needed to store combinations of A, B, C, in this
order */
int size[] = { 0,

SIZE_X((1<<BIT_A), BIT_A, A),

SIZE_B((1<<BIT_B)),
SIZE_B((1<<BIT_B)|(1<<BIT_A)),

SIZE_C((1<<BIT_C)),
SIZE_C((1<<BIT_C)|(1<<BIT_A)),
SIZE_C((1<<BIT_C)|(1<<BIT_B)),
SIZE_C((1<<BIT_C)|(1<<BIT_B)|(1<<BIT_A)) };

#define LIST_A(_flags) SIZE_C(_flags), SIZE_C((_flags)|(1<<BIT_A))
#define LIST_B(_flags) LIST_A(_flags), LIST_A((_flags)|(1<<BIT_B))
#define LIST_C(_flags) LIST_B(_flags), LIST_B((_flags)|(1<<BIT_C))

/* same as size */
int size2[] = { LIST_C(0) };
---------------------------------------------------------------

Bye, Patrick

2009-01-28 09:08:47

by Herbert Xu

[permalink] [raw]
Subject: Re: hardware time stamping with optional structs in data area

Patrick Ohly <[email protected]> wrote:
>
> True - at this time. But what if this extension mechanism turns out to
> be useful and we end up with more optional structures? I was hoping that
> this might be the case and thus tried to make it easy to add more
> structures.

You're putting the extension in the skb->end area, right?

How big are the time stamps? If they're not that big, why don't
we put it into the shinfo structure itself? For the common case,
we have plenty of space due to kmalloc padding anyway.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2009-01-28 09:52:32

by Patrick Ohly

[permalink] [raw]
Subject: Re: hardware time stamping with optional structs in data area

On Wed, 2009-01-28 at 11:08 +0200, Herbert Xu wrote:
> Patrick Ohly <[email protected]> wrote:
> >
> > True - at this time. But what if this extension mechanism turns out to
> > be useful and we end up with more optional structures? I was hoping that
> > this might be the case and thus tried to make it easy to add more
> > structures.
>
> You're putting the extension in the skb->end area, right?

Right.

> How big are the time stamps? If they're not that big, why don't
> we put it into the shinfo structure itself? For the common case,
> we have plenty of space due to kmalloc padding anyway.

Two 64 bit fields have to be added for time stamps plus 3 bits for flags
(for time stamping instructions, currently in skb_shared_tx).

Putting that into shinfo should work fine. I thought extending that
structure with information that isn't needed for all packets was as bad
as extending sk_buff itself. If that isn't the case, then extending
shinfo definitely is the simplest solution.

Bye, Patrick

2009-01-28 09:54:49

by Herbert Xu

[permalink] [raw]
Subject: Re: hardware time stamping with optional structs in data area

On Wed, Jan 28, 2009 at 10:52:13AM +0100, Patrick Ohly wrote:
>
> Putting that into shinfo should work fine. I thought extending that
> structure with information that isn't needed for all packets was as bad
> as extending sk_buff itself. If that isn't the case, then extending
> shinfo definitely is the simplest solution.

Not at all, the sk_buff has its own slab while skb->head uses
kmalloc. The latter has loads of free space due to padding for
common MTUs such as 1500 or header-only.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt