Today in Linux, select() and poll() operate in jiffies resolution
(granularity), meaning an effective resolution of 1 millisecond (HZ=1000) to
10 milliseconds (HZ=100). Delays shorter than this are not possible, and all
delays are in multiples of this granularity.
The effect is that applications that want (on average) to specify more
accurate delays (for example multimedia or other interactive apps) just
cannot do that; this creates more than needed jitter.
With this patch series, the internals of select() and poll() interfaces are
changed such that they work on the nanosecond level (using hrtimers). The
userspace interface for select() is in microseconds, for pselect() and
ppoll() this is in nanoseconds.
[actual behavior obviously depends on what resolution the hardware timers work, on
modern PCs this is pretty good though]
To show this effect I made a test application to measure the error made
in the select() timing.
For example, a userspace application asking for a 1200 microsecond delay, on
a HZ=1000 kernel, will in practice get a 1997 microsecond delay, a delta of
almost 800 microseconds (which is of course a high percentage of 1200). The
extreme case is asking for 1 microsecond, and getting 998 microseconds
delay... with the patch we get a 250 times improvement in behavior (!).
A graph of various inputs with the jitter can be seen at
http://www.tglx.de/~arjan/select_benefits.png
One thing to note is that on my machine, the current select() implementation
will return after 1997 microseconds when asked for 1999 microseconds; this
can be seen in a zoom in of the graph above:
http://www.tglx.de/~arjan/zoom.png
E.g. select() is returning too early in current Linux kernels; and this is
also fixed (by nature) by this patch series.
In the graph there's a 4 microsecond delta for most data points, this is
basically the measurement overhead (C-state exit, a few system calls, a
loop and some math).
About the patches:
Patch 1: Introduces infrastructure where select() and poll() start tracking
the end time in a "struct timespec" rather than in jiffies, so in
nanosecond resolution.
Patch 2: Uses the now available end time in nanoseconds to calculate at the
end of a select()/ppoll() how much time is left and returns this
in a high resolution rather than jiffies granularity
Patch 3: Introduces a schedule_hrtimeout(); a high resolution version of the
equivalent schedule_timeout() function.
Patch 4: Converts over select() to schedule_hrtimeout()
Patch 5: Converts over poll() to schedule_hrtimeout()
Note:
even though poll() (as opposed to ppoll()) only accepts milliseconds
as userspace interface, the behavior will still improve because the current
time no longer needs to be rounded up to the next jiffie, so on average
a 500 milliseconds behavior improvement.
Note 2:
I'm a bit unsure about the restart code and how that works; I'd like to
request a bit of help on figuring out how that code is supposed to work.
Future work:
I'd like to get rid of the jiffies timeout entirely over time, and
only use hrtimers (makes the code a lot nicer) but that's for now
a separate step, first I'd like to see how this change pans out.
From: Arjan van de Ven <[email protected]>
Date: Thu, 28 Aug 2008 13:40:29 -0700
Subject: [PATCH] select: return accurate remainer in select() and ppoll()
On Linux, select() and ppoll() write back the time remaining into the timeout
value. Now that we have a nanosecond level end time, we can also give a very
accurate remainder back; this patch adds the logic to do this.
Signed-off-by: Arjan van de Ven <[email protected]>
---
fs/compat.c | 39 ++++++++++++++++++++++++++++++---------
fs/select.c | 47 ++++++++++++++++++++++++++++++++++++-----------
2 files changed, 66 insertions(+), 20 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index 7c9b2a1..b7c68f1 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1615,11 +1615,22 @@ asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
if (tvp) {
struct compat_timeval rtv;
+ struct timespec now;
if (current->personality & STICKY_TIMEOUTS)
goto sticky;
rtv.tv_usec = jiffies_to_usecs(do_div((*(u64*)&timeout), HZ));
rtv.tv_sec = timeout;
+
+ ktime_get_ts(&now);
+ now = timespec_sub(end_time, now);
+ if (now.tv_sec < 0 || now.tv_nsec < 0) {
+ memset(&rtv, 0, sizeof(rtv));
+ } else {
+ rtv.tv_sec = now.tv_sec;
+ rtv.tv_usec = now.tv_nsec / NSEC_PER_USEC;
+ }
+
if (compat_timeval_compare(&rtv, &tv) >= 0)
rtv = tv;
if (copy_to_user(tvp, &rtv, sizeof(rtv))) {
@@ -1697,16 +1708,20 @@ asmlinkage long compat_sys_pselect7(int n, compat_ulong_t __user *inp,
if (tsp) {
struct compat_timespec rts;
+ struct timespec now;
if (current->personality & STICKY_TIMEOUTS)
goto sticky;
- rts.tv_sec = timeout / HZ;
- rts.tv_nsec = (timeout % HZ) * (NSEC_PER_SEC/HZ);
- if (rts.tv_nsec >= NSEC_PER_SEC) {
- rts.tv_sec++;
- rts.tv_nsec -= NSEC_PER_SEC;
+ ktime_get_ts(&now);
+ now = timespec_sub(end_time, now);
+ if (now.tv_sec < 0 || now.tv_nsec < 0) {
+ memset(&rts, 0, sizeof(rts));
+ } else {
+ rts.tv_nsec = now.tv_nsec;
+ rts.tv_sec = now.tv_sec;
}
+
if (compat_timespec_compare(&rts, &ts) >= 0)
rts = ts;
if (copy_to_user(tsp, &rts, sizeof(rts))) {
@@ -1818,13 +1833,19 @@ asmlinkage long compat_sys_ppoll(struct pollfd __user *ufds,
if (tsp && timeout >= 0) {
struct compat_timespec rts;
+ struct timespec now;
if (current->personality & STICKY_TIMEOUTS)
goto sticky;
- /* Yes, we know it's actually an s64, but it's also positive. */
- rts.tv_nsec = jiffies_to_usecs(do_div((*(u64*)&timeout), HZ)) *
- 1000;
- rts.tv_sec = timeout;
+ ktime_get_ts(&now);
+ now = timespec_sub(end_time, now);
+ if (now.tv_sec < 0 || now.tv_nsec < 0) {
+ memset(&rts, 0, sizeof(rts));
+ } else {
+ rts.tv_nsec = now.tv_nsec;
+ rts.tv_sec = now.tv_sec;
+ }
+
if (compat_timespec_compare(&rts, &ts) >= 0)
rts = ts;
if (copy_to_user(tsp, &rts, sizeof(rts))) {
diff --git a/fs/select.c b/fs/select.c
index 6a6e33a..3ce5943 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -409,13 +409,24 @@ asmlinkage long sys_select(int n, fd_set __user *inp, fd_set __user *outp,
if (tvp) {
struct timeval rtv;
+ struct timespec now;
if (current->personality & STICKY_TIMEOUTS)
goto sticky;
- rtv.tv_usec = jiffies_to_usecs(do_div((*(u64*)&timeout), HZ));
- rtv.tv_sec = timeout;
+
+ ktime_get_ts(&now);
+ now = timespec_sub(end_time, now);
+ if (now.tv_sec < 0 || now.tv_nsec < 0) {
+ rtv.tv_usec = 0;
+ rtv.tv_sec = 0;
+ } else {
+ rtv.tv_usec = now.tv_nsec / NSEC_PER_USEC;
+ rtv.tv_sec = now.tv_sec;
+ }
+
if (timeval_compare(&rtv, &tv) >= 0)
rtv = tv;
+
if (copy_to_user(tvp, &rtv, sizeof(rtv))) {
sticky:
/*
@@ -481,15 +492,24 @@ asmlinkage long sys_pselect7(int n, fd_set __user *inp, fd_set __user *outp,
ret = core_sys_select(n, inp, outp, exp, &timeout, &end_time);
if (tsp) {
- struct timespec rts;
+ struct timespec rts, now;
if (current->personality & STICKY_TIMEOUTS)
goto sticky;
- rts.tv_nsec = jiffies_to_usecs(do_div((*(u64*)&timeout), HZ)) *
- 1000;
- rts.tv_sec = timeout;
+
+ ktime_get_ts(&now);
+ now = timespec_sub(end_time, now);
+ if (now.tv_sec < 0 || now.tv_nsec < 0) {
+ rts.tv_nsec = 0;
+ rts.tv_sec = 0;
+ } else {
+ rts.tv_nsec = now.tv_nsec;
+ rts.tv_sec = now.tv_sec;
+ }
+
if (timespec_compare(&rts, &ts) >= 0)
rts = ts;
+
if (copy_to_user(tsp, &rts, sizeof(rts))) {
sticky:
/*
@@ -847,16 +867,21 @@ asmlinkage long sys_ppoll(struct pollfd __user *ufds, unsigned int nfds,
sigprocmask(SIG_SETMASK, &sigsaved, NULL);
if (tsp && timeout >= 0) {
- struct timespec rts;
+ struct timespec rts, now;
if (current->personality & STICKY_TIMEOUTS)
goto sticky;
- /* Yes, we know it's actually an s64, but it's also positive. */
- rts.tv_nsec = jiffies_to_usecs(do_div((*(u64*)&timeout), HZ)) *
- 1000;
- rts.tv_sec = timeout;
+
+ ktime_get_ts(&now);
+ now = timespec_sub(end_time, now);
+ if (now.tv_sec < 0 || now.tv_nsec < 0)
+ memset(&rts, 0, sizeof(rts));
+ else
+ rts = now;
+
if (timespec_compare(&rts, &ts) >= 0)
rts = ts;
+
if (copy_to_user(tsp, &rts, sizeof(rts))) {
sticky:
/*
--
1.5.5.1
From: Arjan van de Ven <[email protected]>
Date: Thu, 28 Aug 2008 13:44:03 -0700
Subject: [PATCH] select: introduce a schedule_hrtimeout() function
This patch adds a schedule_hrtimeout() function, to be used by select() and poll()
in a later patch. This function works similar to schedule_timeout() in most ways,
but takes a timespec rather than jiffies.
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/linux/hrtimer.h | 3 ++
kernel/hrtimer.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 60 insertions(+), 0 deletions(-)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 6d93dce..526b62f 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -346,6 +346,9 @@ extern long hrtimer_nanosleep_restart(struct restart_block *restart_block);
extern void hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
struct task_struct *tsk);
+extern signed long schedule_hrtimeout(struct timespec *time, int mode);
+
+
/* Soft interrupt function to run the hrtimer queues: */
extern void hrtimer_run_queues(void);
extern void hrtimer_run_pending(void);
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index b8e4dce..0ab7b04 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1678,3 +1678,60 @@ void __init hrtimers_init(void)
#endif
}
+
+/**
+ * schedule_hrtimeout - sleep until timeout
+ * @timeout: timeout value (timespec)
+ * @mode: timer mode, HRTIMER_MODE_ABS or HRTIMER_MODE_REL
+ *
+ * Make the current task sleep until @timeout time has
+ * elapsed. The routine will return immediately unless
+ * the current task state has been set (see set_current_state()).
+ *
+ * You can set the task state as follows -
+ *
+ * %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to
+ * pass before the routine returns. The routine will return 0
+ *
+ * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
+ * delivered to the current task. In this case a positivee value will
+ * be returned or 0 if the timer expired in time
+ *
+ * The current task state is guaranteed to be TASK_RUNNING when this
+ * routine returns.
+ *
+ * In all cases the return value is guaranteed to be non-negative.
+ */
+signed long __sched schedule_hrtimeout(struct timespec *time, int mode)
+{
+ struct hrtimer_sleeper t;
+ ktime_t expire;
+
+ expire = timespec_to_ktime(*time);
+
+ if (ktime_to_ns(expire) <= 0) {
+ __set_current_state(TASK_RUNNING);
+ return 0;
+ }
+
+ hrtimer_init_on_stack(&t.timer, CLOCK_MONOTONIC, mode);
+ t.timer.expires = expire;
+
+ hrtimer_init_sleeper(&t, current);
+
+ hrtimer_start(&t.timer, t.timer.expires, mode);
+ if (!hrtimer_active(&t.timer))
+ t.task = NULL;
+
+ if (likely(t.task))
+ schedule();
+
+ hrtimer_cancel(&t.timer);
+
+
+ __set_current_state(TASK_RUNNING);
+
+ return t.task == NULL;
+}
+EXPORT_SYMBOL_GPL(schedule_hrtimeout);
+
--
1.5.5.1
From: Arjan van de Ven <[email protected]>
Date: Thu, 28 Aug 2008 13:45:38 -0700
Subject: [PATCH] select: make select() use schedule_hrtimeout()
now that we have schedule_hrtimeout(), make select() use it.
But only for short delays; really long delays are assumed to not
need the highres level of accuracy but rather want the regular
timer behavior for now.
This is only done for delays shorter than 1 second; currently
the assumption is that longer delays have no desire to get the
higher accuracy.
Signed-off-by: Arjan van de Ven <[email protected]>
---
fs/select.c | 41 +++++++++++++++++++++++++++++------------
1 files changed, 29 insertions(+), 12 deletions(-)
diff --git a/fs/select.c b/fs/select.c
index 3ce5943..62b9767 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -24,6 +24,7 @@
#include <linux/fdtable.h>
#include <linux/fs.h>
#include <linux/rcupdate.h>
+#include <linux/hrtimer.h>
#include <asm/uaccess.h>
@@ -187,6 +188,7 @@ int do_select(int n, fd_set_bits *fds, s64 *timeout, struct timespec *end_time)
struct poll_wqueues table;
poll_table *wait;
int retval, i;
+ int use_hr = 0;
rcu_read_lock();
retval = max_select_fd(n, fds);
@@ -196,6 +198,10 @@ int do_select(int n, fd_set_bits *fds, s64 *timeout, struct timespec *end_time)
return retval;
n = retval;
+ /* Use hrtimers only for short timeouts for now */
+ if (*timeout < HZ && end_time->tv_sec != 0)
+ use_hr = 1;
+
poll_initwait(&table);
wait = &table.pt;
if (!*timeout)
@@ -266,20 +272,31 @@ int do_select(int n, fd_set_bits *fds, s64 *timeout, struct timespec *end_time)
break;
}
- if (*timeout < 0) {
- /* Wait indefinitely */
- __timeout = MAX_SCHEDULE_TIMEOUT;
- } else if (unlikely(*timeout >= (s64)MAX_SCHEDULE_TIMEOUT - 1)) {
- /* Wait for longer than MAX_SCHEDULE_TIMEOUT. Do it in a loop */
- __timeout = MAX_SCHEDULE_TIMEOUT - 1;
- *timeout -= __timeout;
+ if (use_hr) {
+ struct timespec now;
+ ktime_t expire;
+ schedule_hrtimeout(end_time, HRTIMER_MODE_ABS);
+ getnstimeofday(&now);
+ expire = ktime_sub(timespec_to_ktime(*end_time), timespec_to_ktime(now));
+ if (ktime_to_ns(expire) <= 0)
+ break;
+
} else {
- __timeout = *timeout;
- *timeout = 0;
+ if (*timeout < 0) {
+ /* Wait indefinitely */
+ __timeout = MAX_SCHEDULE_TIMEOUT;
+ } else if (unlikely(*timeout >= (s64)MAX_SCHEDULE_TIMEOUT - 1)) {
+ /* Wait for longer than MAX_SCHEDULE_TIMEOUT. Do it in a loop */
+ __timeout = MAX_SCHEDULE_TIMEOUT - 1;
+ *timeout -= __timeout;
+ } else {
+ __timeout = *timeout;
+ *timeout = 0;
+ }
+ __timeout = schedule_timeout(__timeout);
+ if (*timeout >= 0)
+ *timeout += __timeout;
}
- __timeout = schedule_timeout(__timeout);
- if (*timeout >= 0)
- *timeout += __timeout;
}
__set_current_state(TASK_RUNNING);
--
1.5.5.1
From: Arjan van de Ven <[email protected]>
Date: Thu, 28 Aug 2008 13:46:47 -0700
Subject: [PATCH] select: make poll() use schedule_hrtimeout() as well
similar to the select() patch, use schedule_hrtimeout() in the
poll() codepath.
Signed-off-by: Arjan van de Ven <[email protected]>
---
fs/select.c | 46 ++++++++++++++++++++++++++++++----------------
1 files changed, 30 insertions(+), 16 deletions(-)
diff --git a/fs/select.c b/fs/select.c
index 62b9767..e881b97 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -631,6 +631,10 @@ static int do_poll(unsigned int nfds, struct poll_list *list,
{
int count = 0;
poll_table* pt = &wait->pt;
+ int use_hr = 0;
+
+ if (*timeout < HZ && *timeout > 0)
+ use_hr = 1;
/* Optimise the no-wait case */
if (!(*timeout))
@@ -673,24 +677,34 @@ static int do_poll(unsigned int nfds, struct poll_list *list,
if (count || !*timeout)
break;
- if (*timeout < 0) {
- /* Wait indefinitely */
- __timeout = MAX_SCHEDULE_TIMEOUT;
- } else if (unlikely(*timeout >= (s64)MAX_SCHEDULE_TIMEOUT-1)) {
- /*
- * Wait for longer than MAX_SCHEDULE_TIMEOUT. Do it in
- * a loop
- */
- __timeout = MAX_SCHEDULE_TIMEOUT - 1;
- *timeout -= __timeout;
+ if (use_hr) {
+ struct timespec now;
+ ktime_t expire;
+ schedule_hrtimeout(end_time, HRTIMER_MODE_ABS);
+ getnstimeofday(&now);
+ expire = ktime_sub(timespec_to_ktime(*end_time), timespec_to_ktime(now));
+ if (ktime_to_ns(expire) <= 0)
+ break;
} else {
- __timeout = *timeout;
- *timeout = 0;
- }
+ if (*timeout < 0) {
+ /* Wait indefinitely */
+ __timeout = MAX_SCHEDULE_TIMEOUT;
+ } else if (unlikely(*timeout >= (s64)MAX_SCHEDULE_TIMEOUT-1)) {
+ /*
+ * Wait for longer than MAX_SCHEDULE_TIMEOUT. Do it in
+ * a loop
+ */
+ __timeout = MAX_SCHEDULE_TIMEOUT - 1;
+ *timeout -= __timeout;
+ } else {
+ __timeout = *timeout;
+ *timeout = 0;
+ }
- __timeout = schedule_timeout(__timeout);
- if (*timeout >= 0)
- *timeout += __timeout;
+ __timeout = schedule_timeout(__timeout);
+ if (*timeout >= 0)
+ *timeout += __timeout;
+ }
}
__set_current_state(TASK_RUNNING);
return count;
--
1.5.5.1
From: Arjan van de Ven <[email protected]>
Date: Wed, 27 Aug 2008 11:24:52 -0700
Subject: [PATCH] select: add a timespec version of the timeout to select/poll
This patch adds the end_time parameter to the various pieces in the select()
and poll() paths; but does not use it yet (that is for later patches in
the series). This end_time is the timespec accuracy of the end time (eg
nanoseconds) rather than the current jiffies-juggling that happens all
over the poll/select code. It's also an absolute time, so we can always
see how long there is to go just by looking at the delta between the
end time and the current time.
Signed-off-by: Arjan van de Ven <[email protected]>
---
fs/compat.c | 28 ++++++++++++++++++++----
fs/select.c | 56 ++++++++++++++++++++++++++++++++++++++++---------
include/linux/poll.h | 6 ++--
3 files changed, 71 insertions(+), 19 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index c9d1472..7c9b2a1 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1513,7 +1513,8 @@ int compat_set_fd_set(unsigned long nr, compat_ulong_t __user *ufdset,
((unsigned long) (MAX_SCHEDULE_TIMEOUT / HZ)-1)
int compat_core_sys_select(int n, compat_ulong_t __user *inp,
- compat_ulong_t __user *outp, compat_ulong_t __user *exp, s64 *timeout)
+ compat_ulong_t __user *outp, compat_ulong_t __user *exp, s64 *timeout,
+ struct timespec *end_time)
{
fd_set_bits fds;
void *bits;
@@ -1560,7 +1561,7 @@ int compat_core_sys_select(int n, compat_ulong_t __user *inp,
zero_fd_set(n, fds.res_out);
zero_fd_set(n, fds.res_ex);
- ret = do_select(n, &fds, timeout);
+ ret = do_select(n, &fds, timeout, end_time);
if (ret < 0)
goto out;
@@ -1589,6 +1590,7 @@ asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
s64 timeout = -1;
struct compat_timeval tv;
int ret;
+ struct timespec end_time;
if (tvp) {
if (copy_from_user(&tv, tvp, sizeof(tv)))
@@ -1604,9 +1606,12 @@ asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
timeout = DIV_ROUND_UP(tv.tv_usec, 1000000/HZ);
timeout += tv.tv_sec * HZ;
}
+ ktime_get_ts(&end_time);
+ timespec_add_ns(&end_time, tv.tv_usec*1000);
+ end_time.tv_sec += tv.tv_sec;
}
- ret = compat_core_sys_select(n, inp, outp, exp, &timeout);
+ ret = compat_core_sys_select(n, inp, outp, exp, &timeout, &end_time);
if (tvp) {
struct compat_timeval rtv;
@@ -1645,6 +1650,8 @@ asmlinkage long compat_sys_pselect7(int n, compat_ulong_t __user *inp,
sigset_t ksigmask, sigsaved;
s64 timeout = MAX_SCHEDULE_TIMEOUT;
struct compat_timespec ts;
+ struct timespec end_time;
+
int ret;
if (tsp) {
@@ -1653,6 +1660,11 @@ asmlinkage long compat_sys_pselect7(int n, compat_ulong_t __user *inp,
if (ts.tv_sec < 0 || ts.tv_nsec < 0)
return -EINVAL;
+ ktime_get_ts(&end_time);
+ end_time.tv_sec += ts.tv_sec;
+ timespec_add_ns(&end_time, ts.tv_nsec);
+ } else {
+ memset(&end_time, 0, sizeof(end_time));
}
if (sigmask) {
@@ -1679,7 +1691,7 @@ asmlinkage long compat_sys_pselect7(int n, compat_ulong_t __user *inp,
}
}
- ret = compat_core_sys_select(n, inp, outp, exp, &timeout);
+ ret = compat_core_sys_select(n, inp, outp, exp, &timeout, &end_time);
} while (!ret && !timeout && tsp && (ts.tv_sec || ts.tv_nsec));
@@ -1755,6 +1767,7 @@ asmlinkage long compat_sys_ppoll(struct pollfd __user *ufds,
compat_sigset_t ss32;
sigset_t ksigmask, sigsaved;
struct compat_timespec ts;
+ struct timespec end_time;
s64 timeout = -1;
int ret;
@@ -1767,6 +1780,11 @@ asmlinkage long compat_sys_ppoll(struct pollfd __user *ufds,
an s64. Otherwise the compiler bitches at us */
timeout = DIV_ROUND_UP(ts.tv_nsec, 1000000000/HZ);
timeout += ts.tv_sec * HZ;
+ ktime_get_ts(&end_time);
+ end_time.tv_sec += ts.tv_sec;
+ timespec_add_ns(&end_time, ts.tv_nsec);
+ } else {
+ memset(&end_time, 0, sizeof(end_time));
}
if (sigmask) {
@@ -1780,7 +1798,7 @@ asmlinkage long compat_sys_ppoll(struct pollfd __user *ufds,
sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved);
}
- ret = do_sys_poll(ufds, nfds, &timeout);
+ ret = do_sys_poll(ufds, nfds, &timeout, &end_time);
/* We can restart this syscall, usually */
if (ret == -EINTR) {
diff --git a/fs/select.c b/fs/select.c
index da0e882..6a6e33a 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -182,7 +182,7 @@ get_max:
#define POLLOUT_SET (POLLWRBAND | POLLWRNORM | POLLOUT | POLLERR)
#define POLLEX_SET (POLLPRI)
-int do_select(int n, fd_set_bits *fds, s64 *timeout)
+int do_select(int n, fd_set_bits *fds, s64 *timeout, struct timespec *end_time)
{
struct poll_wqueues table;
poll_table *wait;
@@ -300,7 +300,7 @@ int do_select(int n, fd_set_bits *fds, s64 *timeout)
((unsigned long) (MAX_SCHEDULE_TIMEOUT / HZ)-1)
int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
- fd_set __user *exp, s64 *timeout)
+ fd_set __user *exp, s64 *timeout, struct timespec *end_time)
{
fd_set_bits fds;
void *bits;
@@ -351,7 +351,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
zero_fd_set(n, fds.res_out);
zero_fd_set(n, fds.res_ex);
- ret = do_select(n, &fds, timeout);
+ ret = do_select(n, &fds, timeout, end_time);
if (ret < 0)
goto out;
@@ -380,6 +380,9 @@ asmlinkage long sys_select(int n, fd_set __user *inp, fd_set __user *outp,
s64 timeout = -1;
struct timeval tv;
int ret;
+ struct timespec end_time;
+
+ memset(&end_time, 0, sizeof(end_time));
if (tvp) {
if (copy_from_user(&tv, tvp, sizeof(tv)))
@@ -388,6 +391,11 @@ asmlinkage long sys_select(int n, fd_set __user *inp, fd_set __user *outp,
if (tv.tv_sec < 0 || tv.tv_usec < 0)
return -EINVAL;
+ ktime_get_ts(&end_time);
+ timespec_add_ns(&end_time, tv.tv_usec*1000);
+ end_time.tv_sec += tv.tv_sec;
+
+
/* Cast to u64 to make GCC stop complaining */
if ((u64)tv.tv_sec >= (u64)MAX_INT64_SECONDS)
timeout = -1; /* infinite */
@@ -397,7 +405,7 @@ asmlinkage long sys_select(int n, fd_set __user *inp, fd_set __user *outp,
}
}
- ret = core_sys_select(n, inp, outp, exp, &timeout);
+ ret = core_sys_select(n, inp, outp, exp, &timeout, &end_time);
if (tvp) {
struct timeval rtv;
@@ -434,15 +442,22 @@ asmlinkage long sys_pselect7(int n, fd_set __user *inp, fd_set __user *outp,
s64 timeout = MAX_SCHEDULE_TIMEOUT;
sigset_t ksigmask, sigsaved;
struct timespec ts;
+ struct timespec end_time;
int ret;
+ memset(&end_time, 0, sizeof(end_time));
+
if (tsp) {
+ ktime_get_ts(&end_time);
if (copy_from_user(&ts, tsp, sizeof(ts)))
return -EFAULT;
if (ts.tv_sec < 0 || ts.tv_nsec < 0)
return -EINVAL;
+ end_time.tv_sec += ts.tv_sec;
+ timespec_add_ns(&end_time, ts.tv_nsec);
+
/* Cast to u64 to make GCC stop complaining */
if ((u64)ts.tv_sec >= (u64)MAX_INT64_SECONDS)
timeout = -1; /* infinite */
@@ -463,7 +478,7 @@ asmlinkage long sys_pselect7(int n, fd_set __user *inp, fd_set __user *outp,
sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved);
}
- ret = core_sys_select(n, inp, outp, exp, &timeout);
+ ret = core_sys_select(n, inp, outp, exp, &timeout, &end_time);
if (tsp) {
struct timespec rts;
@@ -574,7 +589,8 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
}
static int do_poll(unsigned int nfds, struct poll_list *list,
- struct poll_wqueues *wait, s64 *timeout)
+ struct poll_wqueues *wait, s64 *timeout,
+ struct timespec *end_time)
{
int count = 0;
poll_table* pt = &wait->pt;
@@ -646,7 +662,8 @@ static int do_poll(unsigned int nfds, struct poll_list *list,
#define N_STACK_PPS ((sizeof(stack_pps) - sizeof(struct poll_list)) / \
sizeof(struct pollfd))
-int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds, s64 *timeout)
+int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds, s64 *timeout,
+ struct timespec *end_time)
{
struct poll_wqueues table;
int err = -EFAULT, fdcount, len, size;
@@ -686,7 +703,7 @@ int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds, s64 *timeout)
}
poll_initwait(&table);
- fdcount = do_poll(nfds, head, &table, timeout);
+ fdcount = do_poll(nfds, head, &table, timeout, end_time);
poll_freewait(&table);
for (walk = head; walk; walk = walk->next) {
@@ -716,8 +733,15 @@ static long do_restart_poll(struct restart_block *restart_block)
int nfds = restart_block->arg1;
s64 timeout = ((s64)restart_block->arg3<<32) | (s64)restart_block->arg2;
int ret;
+ struct timespec end_time;
+
+ memset(&end_time, 0, sizeof(end_time));
+ if (timeout) {
+ ktime_get_ts(&end_time);
+ timespec_add_ns(&end_time, timeout * NSEC_PER_SEC/HZ);
+ }
- ret = do_sys_poll(ufds, nfds, &timeout);
+ ret = do_sys_poll(ufds, nfds, &timeout, &end_time);
if (ret == -EINTR) {
restart_block->fn = do_restart_poll;
restart_block->arg2 = timeout & 0xFFFFFFFF;
@@ -732,6 +756,7 @@ asmlinkage long sys_poll(struct pollfd __user *ufds, unsigned int nfds,
{
s64 timeout_jiffies;
int ret;
+ struct timespec end_time;
if (timeout_msecs > 0) {
#if HZ > 1000
@@ -741,12 +766,15 @@ asmlinkage long sys_poll(struct pollfd __user *ufds, unsigned int nfds,
else
#endif
timeout_jiffies = msecs_to_jiffies(timeout_msecs) + 1;
+ ktime_get_ts(&end_time);
+ timespec_add_ns(&end_time, timeout_msecs*1000*1000);
} else {
/* Infinite (< 0) or no (0) timeout */
timeout_jiffies = timeout_msecs;
+ memset(&end_time, 0, sizeof(end_time));
}
- ret = do_sys_poll(ufds, nfds, &timeout_jiffies);
+ ret = do_sys_poll(ufds, nfds, &timeout_jiffies, &end_time);
if (ret == -EINTR) {
struct restart_block *restart_block;
restart_block = ¤t_thread_info()->restart_block;
@@ -768,6 +796,7 @@ asmlinkage long sys_ppoll(struct pollfd __user *ufds, unsigned int nfds,
sigset_t ksigmask, sigsaved;
struct timespec ts;
s64 timeout = -1;
+ struct timespec end_time;
int ret;
if (tsp) {
@@ -781,6 +810,11 @@ asmlinkage long sys_ppoll(struct pollfd __user *ufds, unsigned int nfds,
timeout = DIV_ROUND_UP(ts.tv_nsec, NSEC_PER_SEC/HZ);
timeout += ts.tv_sec * HZ;
}
+ ktime_get_ts(&end_time);
+ end_time.tv_sec += ts.tv_sec;
+ timespec_add_ns(&end_time, ts.tv_nsec);
+ } else {
+ memset(&end_time, 0, sizeof(end_time));
}
if (sigmask) {
@@ -794,7 +828,7 @@ asmlinkage long sys_ppoll(struct pollfd __user *ufds, unsigned int nfds,
sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved);
}
- ret = do_sys_poll(ufds, nfds, &timeout);
+ ret = do_sys_poll(ufds, nfds, &timeout, &end_time);
/* We can restart this syscall, usually */
if (ret == -EINTR) {
diff --git a/include/linux/poll.h b/include/linux/poll.h
index ef45382..ba9d4ce 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -114,11 +114,11 @@ void zero_fd_set(unsigned long nr, unsigned long *fdset)
#define MAX_INT64_SECONDS (((s64)(~((u64)0)>>1)/HZ)-1)
-extern int do_select(int n, fd_set_bits *fds, s64 *timeout);
+extern int do_select(int n, fd_set_bits *fds, s64 *timeout, struct timespec *end_time);
extern int do_sys_poll(struct pollfd __user * ufds, unsigned int nfds,
- s64 *timeout);
+ s64 *timeout, struct timespec *end_time);
extern int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
- fd_set __user *exp, s64 *timeout);
+ fd_set __user *exp, s64 *timeout, struct timespec *end_time);
#endif /* KERNEL */
--
1.5.5.1
On Friday 29 August 2008, Arjan van de Ven wrote:
> This is only done for delays shorter than 1 second; currently
> the assumption is that longer delays have no desire to get the
> higher accuracy.
What is the tradeoff here? Where is the advantage in using the
traditional timer interface for longer timeouts?
Arnd <><
On Friday 29 August 2008, Arjan van de Ven wrote:
> With this patch series, the internals of select() and poll() interfaces are
> changed such that they work on the nanosecond level (using hrtimers). The
> userspace interface for select() is in microseconds, for pselect() and
> ppoll() this is in nanoseconds.
Nice work, seems a lot better thought through than my previous attempt [1],
which unfortunately never saw any reactions.
> even though poll() (as opposed to ppoll()) only accepts milliseconds
> as userspace interface, the behavior will still improve because the current
> time no longer needs to be rounded up to the next jiffie, so on average
> a 500 milliseconds behavior improvement.
That would be 500 microseconds I think.
> I'd like to get rid of the jiffies timeout entirely over time, and
> only use hrtimers (makes the code a lot nicer) but that's for now
> a separate step, first I'd like to see how this change pans out.
You mean entirely entirely getting rid of jiffies in the kernel, or just
with user facing interfaces like epoll_wait, rt_sigtimedwait etc?
I think if we want to replace more code with hr timers, we will need
something like dwmw2's range timers based on that in order to keep the
wakeup rate low. These are still jiffies based, which doesn't help
your case, but they should be easy to change to ktime or timespec.
Arnd <><
[1] http://www.ussg.iu.edu/hypermail/linux/kernel/0703.0/1319.html
[2] http://lwn.net/Articles/291159/
On Fri, 29 Aug 2008 17:54:27 +0200
Arnd Bergmann <[email protected]> wrote:
> On Friday 29 August 2008, Arjan van de Ven wrote:
>
> > With this patch series, the internals of select() and poll()
> > interfaces are changed such that they work on the nanosecond level
> > (using hrtimers). The userspace interface for select() is in
> > microseconds, for pselect() and ppoll() this is in nanoseconds.
>
> Nice work, seems a lot better thought through than my previous
> attempt [1], which unfortunately never saw any reactions.
>
> > even though poll() (as opposed to ppoll()) only accepts milliseconds
> > as userspace interface, the behavior will still improve because the
> > current time no longer needs to be rounded up to the next jiffie,
> > so on average a 500 milliseconds behavior improvement.
>
> That would be 500 microseconds I think.
>
> > I'd like to get rid of the jiffies timeout entirely over time, and
> > only use hrtimers (makes the code a lot nicer) but that's for now
> > a separate step, first I'd like to see how this change pans out.
>
> You mean entirely entirely getting rid of jiffies in the kernel, or
> just with user facing interfaces like epoll_wait, rt_sigtimedwait etc?
I meant actually just in the select() and poll() code; right now it is
sort of dual-brained, I'd like to eventually be able to get rid of that
and just use hrtimers for select() and poll() always
--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Fri, 29 Aug 2008, Arjan van de Ven wrote:
>
> now that we have schedule_hrtimeout(), make select() use it.
> But only for short delays; really long delays are assumed to not
> need the highres level of accuracy but rather want the regular
> timer behavior for now.
This is _really_ ugly.
Can't you just do this relaxation in "schedule_hrtimeout()" instead, and
just document the fact that the "high resolution" of hrtimeout is relative
to the length of the timeout.
It's not that select() doesn't care, it's that *NOBODY* cares. If somebody
asks for a timeout of one second and 2 microseconds, the two microseconds
simply don't matter. Ever. But if somebody asks for a timeout of 12
microseconds, individual microseconds probably _do_ matter.
So if you want high-resolution select/poll, then get rid of the "use_hr"
logic entirely, and just do it unconditionally. Then, relax the scheduler
timeouts in the scheduler.
(But, that's probably _generally_ true. Even now, when people do
"schedule_timeout()", there's a big difference between asking for two
ticks and asking for two seconds. The latter should probably try to round
to a nice timer tick basis for power reasons).
Linus
On Fri, 2008-08-29 at 08:08 -0700, Arjan van de Ven wrote:
> rom: Arjan van de Ven <[email protected]>
> Date: Thu, 28 Aug 2008 13:45:38 -0700
> Subject: [PATCH] select: make select() use schedule_hrtimeout()
>
> now that we have schedule_hrtimeout(), make select() use it.
> But only for short delays; really long delays are assumed to not
> need the highres level of accuracy but rather want the regular
> timer behavior for now.
>
> This is only done for delays shorter than 1 second; currently
> the assumption is that longer delays have no desire to get the
> higher accuracy.
Do you have any basis to make this "short delays" assumption? I would
think if regular timers can be used for longer than a second, people
will naturally assume anything else that's high res will also ..
The futex_wait() has code which looks a lot like your
schedule_hrtimeout() , (you might want to replace your code there
instead) , and it doesn't seem to have the short delays restriction ..
Daniel
> "schedule_timeout()", there's a big difference between asking for two
> ticks and asking for two seconds. The latter should probably try to round
> to a nice timer tick basis for power reasons).
I disagree - that is fixing the problem in the wrong place. The timer
structure needs an accuracy field of some form that the existing timer
functions initialise to 0.
On a heavily loaded system with things like network events the behaviour
of the overall system is too complex to do the job well except at the
timer level which seems all. At the virtualisation level the hypervisor
needs to be doing the work to merge timer events between guests for power
management.
Once the timers have an accuracy representation there doesn't need to be
any real difference anywhere in the stack.
Alan
On Fri, 29 Aug 2008 09:20:18 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:
>
>
> On Fri, 29 Aug 2008, Arjan van de Ven wrote:
> >
> > now that we have schedule_hrtimeout(), make select() use it.
> > But only for short delays; really long delays are assumed to not
> > need the highres level of accuracy but rather want the regular
> > timer behavior for now.
>
> This is _really_ ugly.
>
> Can't you just do this relaxation in "schedule_hrtimeout()" instead,
> and just document the fact that the "high resolution" of hrtimeout is
> relative to the length of the timeout.
yes we can
(sorry, bad pun)
> It's not that select() doesn't care, it's that *NOBODY* cares. If
> somebody asks for a timeout of one second and 2 microseconds, the two
> microseconds simply don't matter. Ever. But if somebody asks for a
> timeout of 12 microseconds, individual microseconds probably _do_
> matter.
ok I'll do it this way; I wanted to avoid changing behavior for long
delays (some people think there's too much overhead in hrtimers or
something) but i'm totally fine with not doing this bifurbicated
braintwist at all.
>
> So if you want high-resolution select/poll, then get rid of the
> "use_hr" logic entirely, and just do it unconditionally. Then, relax
> the scheduler timeouts in the scheduler.
ok, no big deal, see below
>
> (But, that's probably _generally_ true. Even now, when people do
> "schedule_timeout()", there's a big difference between asking for two
> ticks and asking for two seconds. The latter should probably try to
> round to a nice timer tick basis for power reasons).
yes that's a separate story; this is what "range timers" are about and
David Woodhouse and I are working on that as well
(but this select stuff is basically a prerequisite to use that)
On Fri, 29 Aug 2008, Alan Cox wrote:
>
> > "schedule_timeout()", there's a big difference between asking for two
> > ticks and asking for two seconds. The latter should probably try to round
> > to a nice timer tick basis for power reasons).
>
> I disagree - that is fixing the problem in the wrong place. The timer
> structure needs an accuracy field of some form that the existing timer
> functions initialise to 0.
I do agree that we could do that too, but you miss one big issue: even if
we were to add an accuracy field inside the kernel, there is no such field
in the user interfaces.
We just pass timevals (and sometimes timespecs) around, and no, they don't
have any way to specify accuracy.
Yeah, we could use the high bits in the usec/nsec words, but then older
kernels would basically do random things, so that would be a horrible
interface.
The other thing to do would be to just add totally new system calls with
totally new interfaces, but (a) nobody would use them anyway and (b) it's
simply not worth it.
So given that reality, and _if_ we want to support nice high-resolution
sleeping by select/poll, the only reasonable thing to do is to estimate
some kind of expected accuracy from the existing timeval/timespec.
And the only reasonable way to do that is to just look at the range. You
can probably do something fairly trivial with
/* Estimate expected accuracy in ns from a timeval */
unsigned long estimate_accuracy(struct timeval *tv)
{
/*
* Tens of ms if we're looking at seconds, even
* more for 10s+ sleeping
*/
if (tv->tv_sec) {
/* Tenths of seconds for long sleeps */
if (tv->tv_sec > 10)
return 100000000;
/*
* Tens of ms for second-granularity sleeps. This,
* btw, is the historical Linux 100Hz timer range.
*/
return 10000000;
}
/* Single msecs if we're looking at milliseconds */
if (tv->tv_usec > 1000)
return 1000000;
/* Aim for tenths of msecs otherwise */
return 100000;
}
and yes, it's just a heuristic, but it's probably not a horribly stupid
one or a very unreasonable one.
Linus
On Fri, 29 Aug 2008 10:26:02 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:
>
>
> On Fri, 29 Aug 2008, Alan Cox wrote:
> >
> > > "schedule_timeout()", there's a big difference between asking for
> > > two ticks and asking for two seconds. The latter should probably
> > > try to round to a nice timer tick basis for power reasons).
> >
> > I disagree - that is fixing the problem in the wrong place. The
> > timer structure needs an accuracy field of some form that the
> > existing timer functions initialise to 0.
>
> I do agree that we could do that too, but you miss one big issue:
> even if we were to add an accuracy field inside the kernel, there is
> no such field in the user interfaces.
>
> We just pass timevals (and sometimes timespecs) around, and no, they
> don't have any way to specify accuracy.
>
> Yeah, we could use the high bits in the usec/nsec words, but then
> older kernels would basically do random things, so that would be a
> horrible interface.
>
> The other thing to do would be to just add totally new system calls
> with totally new interfaces, but (a) nobody would use them anyway and
> (b) it's simply not worth it.
>
> So given that reality, and _if_ we want to support nice
> high-resolution sleeping by select/poll, the only reasonable thing to
> do is to estimate some kind of expected accuracy from the existing
> timeval/timespec.
>
> And the only reasonable way to do that is to just look at the range.
> You can probably do something fairly trivial with
that works
one of the things we were thinking about was to also have a field in
the task struct (so inherited over fork/exec) that is the default
"slack", which can be set via a prctl, so that an admin can do a "nice"
like thing and run certain known silly apps at different granularity
(and a bunch of smart tricks in some key apps that we'll then do)
--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
> I do agree that we could do that too, but you miss one big issue: even if
> we were to add an accuracy field inside the kernel, there is no such field
> in the user interfaces.
Most of the timers we want to aggregate with are kernel originated and
those also need accuracy fields (tcp retransmit is fairly precision at
high speed while tcp keepalive is 'yeah whatever' sort of urgency.
At the point we are merging timers we want to consider user kernel and in
a virtualised environment across all guest.
> We just pass timevals (and sometimes timespecs) around, and no, they don't
> have any way to specify accuracy.
Agreed for userspace.
> Yeah, we could use the high bits in the usec/nsec words, but then older
> kernels would basically do random things, so that would be a horrible
> interface.
Gak.... no...
> And the only reasonable way to do that is to just look at the range. You
> can probably do something fairly trivial with
Agreed except that I'd assume real time tasks implied high accuracy,
otherwise one assumes they'd have not have been real time - and probably
putting those constants into sysfs somewhere.
Alan
On Fri, 29 Aug 2008, Alan Cox wrote:
>
> Most of the timers we want to aggregate with are kernel originated and
> those also need accuracy fields (tcp retransmit is fairly precision at
> high speed while tcp keepalive is 'yeah whatever' sort of urgency.
Actually, I disagree.
They are _both_ of the exact same type: the precision depends on how long
the timeout is.
TCP retransmit timers are a perfect example. If the timeout is long (which
is quite common if you end up having multiple retransmits), you _really_
don't care about how precise it is.
And keepalives are 'yeah whatever' exactly because they are so long, not
because they are inherently uninteresting per se.
I suspect you could find some kernel-generated timer that doesn't fit that
pattern, but I can't really think of any.
> Agreed except that I'd assume real time tasks implied high accuracy,
> otherwise one assumes they'd have not have been real time - and probably
> putting those constants into sysfs somewhere.
And yes, I do agree that the heuristic could well involve other
characteristics of the process in question. And probably characteristics
of the machine itself (ie some general kind of "power mode" where timers
are simply not considered critical if you want to be in low-power mode).
Linus
> And yes, I do agree that the heuristic could well involve other
> characteristics of the process in question. And probably characteristics
> of the machine itself (ie some general kind of "power mode" where timers
> are simply not considered critical if you want to be in low-power mode).
HAL and dbus already deal with this sort of stuff, including now having a
notion of 'on battery' and the like so if its in sysfs it ceases to be
the kernels problem ;)
On Fri, 29 Aug 2008, Arjan van de Ven wrote:
> Today in Linux, select() and poll() operate in jiffies resolution
> (granularity), meaning an effective resolution of 1 millisecond (HZ=1000) to
> 10 milliseconds (HZ=100). Delays shorter than this are not possible, and all
> delays are in multiples of this granularity.
Would a similar change also be useful for epoll_wait()? It's unfortunate
that there's no way to use epoll with more than millisecond granularity,
but it seems like the same arguments could be made as with poll().
Brian
On Fri, 29 Aug 2008 08:06:30 -0700 Arjan van de Ven <[email protected]> wrote:
> + memset(&end_time, 0, sizeof(end_time));
It's a bit dorkward using memset to clear out a structure which
zillions of code sites already know contains only two members.
It could be that two plain old writes is more efficient. (and it could
be that this is how the compiler implements the memset anyway?)
But I'd suggest that adding a new inlined timespec_zero(timespec*) or
timespec_set(timespec*,int,int) would be nicer.
On Fri, 29 Aug 2008, Andrew Morton wrote:
>
> It could be that two plain old writes is more efficient. (and it could
> be that this is how the compiler implements the memset anyway?)
gcc indeed knows about memset(), and will generate just two stores. No
worries,
Linus
On Fri, 29 Aug 2008 19:33:09 +0100
Alan Cox <[email protected]> wrote:
> > And yes, I do agree that the heuristic could well involve other
> > characteristics of the process in question. And probably
> > characteristics of the machine itself (ie some general kind of
> > "power mode" where timers are simply not considered critical if you
> > want to be in low-power mode).
>
> HAL and dbus already deal with this sort of stuff, including now
> having a notion of 'on battery' and the like so if its in sysfs it
> ceases to be the kernels problem ;)
(side note: "on battery" is the worst possible measure for doing 'do I
need to be power sensitive', but you knew that already)
--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org