2002-07-13 08:32:01

by Tim Schmielau

[permalink] [raw]
Subject: What is supposed to replace clock_t?

Hi Linus,

the log message to your patch that splits in-kernel HZ und user-level HZ
mentions the "broken interfaces that still use 'clock_t'". And indeed,
these interfaces are broken now, since some of them now wrap after 49
days, while others wrap after 497 days.

My goal with the "> 497 days uptime patch" was to hide internal overflows
within the kernel, so that every exported value wraps exactly when the
number of _exported_ bits does not suffice to hold the true value.

However, with the new divisor of 10 between internal and external time
values this would now require most internal time values to be stored in
>= 36 bit wide variables (i.e., 64 bit).
Then we could, of course, also extend the exported values where exported
as text, only keeping binary interfaces as 'legacy interfaces'.

Could you please state whether this is your intended direction to go?

Tim


2002-07-13 18:10:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: What is supposed to replace clock_t?



On Sat, 13 Jul 2002, Tim Schmielau wrote:
>
> the log message to your patch that splits in-kernel HZ und user-level HZ
> mentions the "broken interfaces that still use 'clock_t'". And indeed,
> these interfaces are broken now, since some of them now wrap after 49
> days, while others wrap after 497 days.

clock_t is fundamentally broken partly because of the size issue (ie we're
fixed to a 32-bit clock_t on x86, and the 497 day thing is fundamental,
while the 49 day thing is just because I was lazy and didn't want to
bother with 64-bit divisions etc for the broken stuff).

The only sane interface is a seconds-based one, either like /proc/uptime
(ie ASCII floating point representation) or a mixed integer representation
like timeval/timespec where you have seconds and micro/nanoseconds
separately.

> My goal with the "> 497 days uptime patch" was to hide internal overflows
> within the kernel, so that every exported value wraps exactly when the
> number of _exported_ bits does not suffice to hold the true value.

That's something we should strive for, but I also think we should avoid
using the clock_t format at all, and give alternate representations (for
example, leave the broken clock_t representation in /proc/<pid>/stat
alone, and just add a _sane_ seconds-based thing in the much more readable
and parseable /proc/<pid>/status file.

Not all of the olf clock_t-based stuff is worth trying to fix up, imho.

> However, with the new divisor of 10 between internal and external time
> values this would now require most internal time values to be stored in
> >= 36 bit wide variables (i.e., 64 bit).

I agree. We should just make the internal jiffies be 64-bit. That's
already true for the "true" jiffy count, but it's simply not true for some
other things (process counters etc). That's a separate issue from the
fundamental brokenness of

> Then we could, of course, also extend the exported values where exported
> as text, only keeping binary interfaces as 'legacy interfaces'.

Many of the binary interfaces are perfectly fine. In fact, there are very
few binary interfaces that are fundamentally broken, the obvious one being
the "times()" system call that nobody actually uses any more. Fixing
"jiffies_to_timeval()" and friends to do the right thing for 64-bit
jiffies will make a lot of the binary interfaces (ie gettimeofday/select/
getrusage etc) just do the right thing.

I'd like to do this gradually, though. If you're interested, how about
just slowly migrating to a

typedef u64 jiffies_t;

and slowly making the _internal_ stuff be 64-bit clean?

Linus

2002-07-13 20:13:22

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: What is supposed to replace clock_t?

Linus Torvalds writes:

> The only sane interface is a seconds-based one, either like /proc/uptime
> (ie ASCII floating point representation) or a mixed integer representation
> like timeval/timespec where you have seconds and micro/nanoseconds
> separately.

Anything wrong with 64-bit nanoseconds? It's easy to work with,
being an integer type, and it survives the year 2038.

> That's something we should strive for, but I also think we should avoid
> using the clock_t format at all, and give alternate representations (for
> example, leave the broken clock_t representation in /proc/<pid>/stat
> alone, and just add a _sane_ seconds-based thing in the much more readable
> and parseable /proc/<pid>/status file.

Other than the parentheses issue, /proc/<pid>/stat can
be handled with sscanf. Nobody dinks with the format.

People "correct" the spelling and formatting in the
fancy /proc files. Is it "SigCgt" or "SigCat"? That
depends on the kernel version; somebody "fixed" the
spelling.

There isn't any BNF for any /proc file. The raw files
are easy to handle, but one can only guess at what
others will assume about the format of a fancy one.

Take /proc/cpuinfo for example. Long ago, it was like
this:

foo some value
bar 1 2 3 4 5
baz 8000:0

Then it turned into this:

foo : some value
bar : 1 2 3 4 5
baz : 8000:0
uh oh : 69

Who could have guessed?

Stuff broke. Formatted files are too damn tempting
to muck with. People don't touch the ugly files.

2002-07-13 20:28:10

by Thunder from the hill

[permalink] [raw]
Subject: Re: What is supposed to replace clock_t?

Hi,

On Sat, 13 Jul 2002, Albert D. Cahalan wrote:
> foo some value
> bar 1 2 3 4 5
> baz 8000:0
>
> Then it turned into this:
>
> foo : some value
> bar : 1 2 3 4 5
> baz : 8000:0
> uh oh : 69

$buf = <MyFile>;
$buf =~ s/\s*\:\s*/\t/g;
printf("%s\n", $buf);

This should bring you back the old tab-delimited format. Use it to fix
your parsers.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------

2002-07-14 05:37:54

by Zack Weinberg

[permalink] [raw]
Subject: Re: What is supposed to replace clock_t?

Linus Torvalds wrote:
...
> Many of the binary interfaces are perfectly fine. In fact, there are
> very few binary interfaces that are fundamentally broken, the
> obvious one being the "times()" system call that nobody actually
> uses any more.

Er, no; people do still use times(). It is (as far as I know) the
only way for a process to determine how much wall-clock, user CPU, and
system CPU time it has consumed, all at once. If you use getrusage()
instead, you also have to call gettimeofday() to get the wall-clock
time, which at least doubles the overhead. [Both getrusage and
gettimeofday are somewhat more expensive than times, but I'm pretty
sure the trap cost dominates.] Profiling code tries very hard to have
as little overhead as possible.

As an application programmer, I would be perfectly happy to use an
extended getrusage() that gave me wall-clock time as a struct timeval.
But please do provide such an interface before deprecating times().

[While we're at it, it would be nice if the kernel would provide
useful values for more of the fields of struct rusage.]

zw

2002-07-14 18:41:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: What is supposed to replace clock_t?



On Sat, 13 Jul 2002, Albert D. Cahalan wrote:
> Linus Torvalds writes:
>
> > The only sane interface is a seconds-based one, either like /proc/uptime
> > (ie ASCII floating point representation) or a mixed integer representation
> > like timeval/timespec where you have seconds and micro/nanoseconds
> > separately.
>
> Anything wrong with 64-bit nanoseconds? It's easy to work with,
> being an integer type, and it survives the year 2038.

That still counts as being "seconds-based" in my book - the problem with
clock_t (and jiffies) has always been that it has been based not on a
globally defined time-standard, but on an implementation issue.

And we want to be able to change the implementation issue at will.

Linus

2002-07-16 10:51:31

by Tim Schmielau

[permalink] [raw]
Subject: [patch 1/2] Re: What is supposed to replace clock_t?

On Sat, 13 Jul 2002, Linus Torvalds wrote:

> (for example, leave the broken clock_t representation in /proc/<pid>/stat
> alone, and just add a _sane_ seconds-based thing in the much more readable
> and parseable /proc/<pid>/status file.
>
> Not all of the olf clock_t-based stuff is worth trying to fix up, imho.

However, I also don't see a reason not to fix /proc/<pid>/stat, either.
(see my following patch 2/3)

> We should just make the internal jiffies be 64-bit. That's
> already true for the "true" jiffy count, but it's simply not true for some
> other things (process counters etc).
[...]
> Fixing "jiffies_to_timeval()" and friends to do the right thing for 64-bit
> jiffies will make a lot of the binary interfaces (ie gettimeofday/select/
> getrusage etc) just do the right thing.

Do we actually need to correct anything in gettimeofday() and select() ?
The timeout in select() probably shouldn't exceed 24 days, and
gettimeofday() only uses jiffies to account for a few lost jiffies, as far
as I can tell.
The real work seems to be getrusage(), since this requires having 64 bit
per process statistics first.

>
> I'd like to do this gradually, though. If you're interested, how about
> just slowly migrating to a
>
> typedef u64 jiffies_t;
>
> and slowly making the _internal_ stuff be 64-bit clean?


sure, first hunk of patches comes right now. Part 1/2 is simple:

- introduce jiffies_t as u64
- introduce jiffies_to_user_HZ()
- use 64 bit jiffies for uptime

Since we certainly don't want to have a jiffies_64_to_clock_t_64()
function, I made jiffies_to_user_HZ() return an unsigned long long,
which is what we finally want to feed to printk anyways.

Tim
(email: "Tim Schmielau" <[email protected]>)


--- linux-2.5.25/include/linux/types.h Fri Jun 21 00:53:49 2002
+++ linux-2.5.25-j64/include/linux/types.h Tue Jul 16 08:18:05 2002
@@ -25,6 +25,7 @@
typedef __kernel_suseconds_t suseconds_t;

#ifdef __KERNEL__
+typedef u64 jiffies_t;
typedef __kernel_uid32_t uid_t;
typedef __kernel_gid32_t gid_t;
typedef __kernel_uid16_t uid16_t;

--- linux-2.5.25/include/linux/times.h Sat Jul 13 08:40:21 2002
+++ linux-2.5.25-j64/include/linux/times.h Tue Jul 16 09:15:32 2002
@@ -1,8 +1,16 @@
#ifndef _LINUX_TIMES_H
#define _LINUX_TIMES_H

+#include <asm/div64.h>
+
#ifdef __KERNEL__
# define jiffies_to_clock_t(x) ((x) / (HZ / USER_HZ))
+
+static inline unsigned long long jiffies_to_user_HZ(jiffies_t x)
+{
+ do_div(x, HZ / USER_HZ);
+ return x;
+}
#endif

struct tms {

--- linux-2.5.25/include/linux/jiffies.h Fri Jun 21 00:53:42 2002
+++ linux-2.5.25-j64/include/linux/jiffies.h Tue Jul 16 09:47:15 2002
@@ -2,14 +2,37 @@
#define _LINUX_JIFFIES_H

#include <linux/types.h>
+#include <linux/spinlock.h>
+#include <asm/system.h>
#include <asm/param.h> /* for HZ */

/*
* The 64-bit value is not volatile - you MUST NOT read it
- * without holding read_lock_irq(&xtime_lock)
+ * without holding read_lock_irq(&xtime_lock).
+ * get_jiffies64() will handle this for you as appropriate.
*/
-extern u64 jiffies_64;
+extern jiffies_t jiffies_64;
extern unsigned long volatile jiffies;
+
+/*
+ * We assume only 64 bit architectures have BITS_PER_LONG == 64 and
+ * that these access longs atomically.
+ */
+static inline jiffies_t get_jiffies_64(void)
+{
+#if BITS_PER_LONG < 64
+ extern rwlock_t xtime_lock;
+ unsigned long flags;
+ jiffies_t tmp;
+
+ read_lock_irqsave(&xtime_lock, flags);
+ tmp = jiffies_64;
+ read_unlock_irqrestore(&xtime_lock, flags);
+ return tmp;
+#else
+ return (jiffies_t) jiffies;
+#endif
+}

/*
* These inlines deal with timer wrapping correctly. You are

--- linux-2.5.25/kernel/timer.c Sat Jul 13 08:40:21 2002
+++ linux-2.5.25-j64/kernel/timer.c Tue Jul 16 08:22:12 2002
@@ -24,8 +24,10 @@
#include <linux/interrupt.h>
#include <linux/tqueue.h>
#include <linux/kernel_stat.h>
+#include <linux/jiffies.h>

#include <asm/uaccess.h>
+#include <asm/div64.h>

struct kernel_stat kstat;

@@ -922,13 +924,16 @@
asmlinkage long sys_sysinfo(struct sysinfo *info)
{
struct sysinfo val;
+ u64 uptime;
unsigned long mem_total, sav_total;
unsigned int mem_unit, bitcount;

memset((char *)&val, 0, sizeof(struct sysinfo));

read_lock_irq(&xtime_lock);
- val.uptime = jiffies / HZ;
+ uptime = jiffies_64;
+ do_div(uptime, HZ);
+ val.uptime = (unsigned long) uptime;

val.loads[0] = avenrun[0] << (SI_LOAD_SHIFT - FSHIFT);
val.loads[1] = avenrun[1] << (SI_LOAD_SHIFT - FSHIFT);

--- linux-2.5.25/fs/proc/proc_misc.c Sat Jul 13 08:40:20 2002
+++ linux-2.5.25-j64/fs/proc/proc_misc.c Tue Jul 16 09:50:36 2002
@@ -37,10 +37,12 @@
#include <linux/smp_lock.h>
#include <linux/seq_file.h>
#include <linux/times.h>
+#include <linux/jiffies.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
#include <asm/io.h>
+#include <asm/div64.h>


#define LOAD_INT(x) ((x) >> FSHIFT)
@@ -94,34 +96,27 @@
static int uptime_read_proc(char *page, char **start, off_t off,
int count, int *eof, void *data)
{
- unsigned long uptime;
- unsigned long idle;
+ u64 uptime, idle;
+ unsigned long uptime_remainder, idle_remainder;
int len;

- uptime = jiffies;
- idle = init_task.utime + init_task.stime;
+ uptime = get_jiffies_64();
+ uptime_remainder = (unsigned long) do_div(uptime, HZ);
+ idle = (u64) init_task.utime + init_task.stime;
+ idle_remainder = (unsigned long) do_div(idle, HZ);

- /* The formula for the fraction parts really is ((t * 100) / HZ) % 100, but
- that would overflow about every five days at HZ == 100.
- Therefore the identity a = (a / b) * b + a % b is used so that it is
- calculated as (((t / HZ) * 100) + ((t % HZ) * 100) / HZ) % 100.
- The part in front of the '+' always evaluates as 0 (mod 100). All divisions
- in the above formulas are truncating. For HZ being a power of 10, the
- calculations simplify to the version in the #else part (if the printf
- format is adapted to the same number of digits as zeroes in HZ.
- */
#if HZ!=100
len = sprintf(page,"%lu.%02lu %lu.%02lu\n",
- uptime / HZ,
- (((uptime % HZ) * 100) / HZ) % 100,
- idle / HZ,
- (((idle % HZ) * 100) / HZ) % 100);
+ (unsigned long) uptime,
+ (uptime_remainder * 100) / HZ,
+ (unsigned long) idle,
+ (idle_remainder * 100) / HZ);
#else
len = sprintf(page,"%lu.%02lu %lu.%02lu\n",
- uptime / HZ,
- uptime % HZ,
- idle / HZ,
- idle % HZ);
+ (unsigned long) uptime,
+ uptime_remainder,
+ (unsigned long) idle,
+ idle_remainder;
#endif
return proc_calc_metrics(page, start, off, count, eof, len);
}
@@ -278,7 +273,7 @@
{
int i, len;
extern unsigned long total_forks;
- unsigned long jif = jiffies;
+ jiffies_t jif = get_jiffies_64();
unsigned int sum = 0, user = 0, nice = 0, system = 0;
int major, disk;

@@ -295,21 +290,22 @@
#endif
}

- len = sprintf(page, "cpu %u %u %u %lu\n",
+ len = sprintf(page, "cpu %u %u %u %llu\n",
jiffies_to_clock_t(user),
jiffies_to_clock_t(nice),
jiffies_to_clock_t(system),
- jiffies_to_clock_t(jif * num_online_cpus() - (user + nice + system)));
+ jiffies_to_user_HZ(jif * num_online_cpus()
+ - user - nice - system));
for (i = 0 ; i < NR_CPUS; i++){
if (!cpu_online(i)) continue;
- len += sprintf(page + len, "cpu%d %u %u %u %lu\n",
+ len += sprintf(page + len, "cpu%d %u %u %u %llu\n",
i,
jiffies_to_clock_t(kstat.per_cpu_user[i]),
jiffies_to_clock_t(kstat.per_cpu_nice[i]),
jiffies_to_clock_t(kstat.per_cpu_system[i]),
- jiffies_to_clock_t(jif - ( kstat.per_cpu_user[i] \
- + kstat.per_cpu_nice[i] \
- + kstat.per_cpu_system[i])));
+ jiffies_to_user_HZ(jif - kstat.per_cpu_user[i]
+ - kstat.per_cpu_nice[i]
+ - kstat.per_cpu_system[i]));
}
len += sprintf(page + len,
"page %u %u\n"
@@ -346,12 +342,13 @@
}
}

+ do_div(jif, HZ);
len += sprintf(page + len,
"\nctxt %lu\n"
"btime %lu\n"
"processes %lu\n",
nr_context_switches(),
- xtime.tv_sec - jif / HZ,
+ xtime.tv_sec - (unsigned long) jif,
total_forks);

return proc_calc_metrics(page, start, off, count, eof, len);



2002-07-16 10:57:37

by Tim Schmielau

[permalink] [raw]
Subject: [patch 2/2] Re: What is supposed to replace clock_t?

- extend start_time in task_struct to 64 bits

Together with the previous patch the most obvious 49 day wraparounds
should be gone now. For wraparounds to happen, a single process needs to
get more that 49 days of cpu time.
Fixing this (i.e., the per-process statistics) will be a little bit more
difficult because of the locking issues, so this will happen later.

Tim


--- linux-2.5.25/include/linux/sched.h Sat Jul 13 08:40:21 2002
+++ linux-2.5.25-j64/include/linux/sched.h Tue Jul 16 09:01:09 2002
@@ -310,7 +310,7 @@
unsigned long it_real_incr, it_prof_incr, it_virt_incr;
struct timer_list real_timer;
unsigned long utime, stime, cutime, cstime;
- unsigned long start_time;
+ jiffies_t start_time;
long per_cpu_utime[NR_CPUS], per_cpu_stime[NR_CPUS];
/* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
unsigned long min_flt, maj_flt, nswap, cmin_flt, cmaj_flt, cnswap;

--- linux-2.5.25/kernel/fork.c Sat Jul 13 08:40:21 2002
+++ linux-2.5.25-j64/kernel/fork.c Tue Jul 16 09:42:02 2002
@@ -25,6 +25,7 @@
#include <linux/binfmts.h>
#include <linux/fs.h>
#include <linux/mm.h>
+#include <linux/jiffies.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -700,7 +701,7 @@
#endif
p->array = NULL;
p->lock_depth = -1; /* -1 = no lock */
- p->start_time = jiffies;
+ p->start_time = get_jiffies_64();

INIT_LIST_HEAD(&p->local_pages);


--- linux-2.5.25/fs/proc/array.c Sat Jul 13 08:40:20 2002
+++ linux-2.5.25-j64/fs/proc/array.c Tue Jul 16 09:02:46 2002
@@ -346,7 +346,7 @@
ppid = task->pid ? task->real_parent->pid : 0;
read_unlock(&tasklist_lock);
res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
-%lu %lu %lu %lu %lu %ld %ld %ld %ld %ld %ld %lu %lu %ld %lu %lu %lu %lu %lu \
+%lu %lu %lu %lu %lu %ld %ld %ld %ld %ld %ld %llu %lu %ld %lu %lu %lu %lu %lu \
%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
task->pid,
task->comm,
@@ -369,7 +369,7 @@
nice,
0UL /* removed */,
jiffies_to_clock_t(task->it_real_value),
- jiffies_to_clock_t(task->start_time),
+ jiffies_to_user_HZ(task->start_time),
vsize,
mm ? mm->rss : 0, /* you might want to shift this left 3 */
task->rlim[RLIMIT_RSS].rlim_cur,

--- linux-2.5.25/mm/oom_kill.c Sat Jul 13 08:40:21 2002
+++ linux-2.5.25-j64/mm/oom_kill.c Tue Jul 16 08:37:52 2002
@@ -73,7 +73,7 @@
* but we don't care _that_ much...
*/
cpu_time = (p->utime + p->stime) >> (SHIFT_HZ + 3);
- run_time = (jiffies - p->start_time) >> (SHIFT_HZ + 10);
+ run_time = (int)((get_jiffies_64() - p->start_time) >> (SHIFT_HZ + 10));

points /= int_sqrt(cpu_time);
points /= int_sqrt(int_sqrt(run_time));

--- linux-2.5.25/kernel/acct.c Sat Jul 13 08:40:21 2002
+++ linux-2.5.25-j64/kernel/acct.c Tue Jul 16 09:56:08 2002
@@ -50,6 +50,7 @@
#include <linux/file.h>
#include <linux/tty.h>
#include <asm/uaccess.h>
+#include <asm/div64.h>

/*
* These constants control the amount of freespace that suspend and
@@ -299,6 +300,7 @@
mm_segment_t fs;
unsigned long vsize;
unsigned long flim;
+ u64 elapsed;

/*
* First check to see if there is enough free_space to continue
@@ -316,9 +318,11 @@
strncpy(ac.ac_comm, current->comm, ACCT_COMM);
ac.ac_comm[ACCT_COMM - 1] = '\0';

- ac.ac_btime = CT_TO_SECS(current->start_time) +
- (xtime.tv_sec - (jiffies / HZ));
- ac.ac_etime = encode_comp_t(jiffies - current->start_time);
+ elapsed = get_jiffies_64() - current->start_time;
+ ac.ac_etime = encode_comp_t(elapsed < (unsigned long) -1l ?
+ (unsigned long) elapsed : (unsigned long) -1l);
+ do_div(elapsed, HZ);
+ ac.ac_btime = xtime.tv_sec - (__u32) elapsed;
ac.ac_utime = encode_comp_t(current->utime);
ac.ac_stime = encode_comp_t(current->stime);
ac.ac_uid = current->uid;