2005-11-15 03:30:47

by Shailabh Nagar

[permalink] [raw]
Subject: [Patch 1/4] Delay accounting: Initialization

delayacct-init.patch

Setup data structures for collecting per-task delay statistics which measure
how long it had to wait for cpu, block io and page fault completion. The
collection of statistics and the interface are in other patches for easier
review and selection.

Signed-off-by: Shailabh Nagar <[email protected]>

include/linux/delayacct.h | 57 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/sched.h | 11 ++++++++
init/Kconfig | 13 ++++++++++
init/main.c | 2 +
kernel/fork.c | 2 +
5 files changed, 85 insertions(+)

Index: linux-2.6.14/init/Kconfig
===================================================================
--- linux-2.6.14.orig/init/Kconfig
+++ linux-2.6.14/init/Kconfig
@@ -162,6 +162,19 @@ config BSD_PROCESS_ACCT_V3
for processing it. A preliminary version of these tools is available
at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>.

+config DELAY_ACCT
+ bool "Enable per-task delay accounting (EXPERIMENTAL)"
+ help
+ Collect information on time spent by a task waiting for system
+ resources like cpu, block I/O completion and page fault servicing.
+ Such statistics can help in setting a task's priorities relative to
+ other tasks for cpu, io, rss limits etc.
+
+ Unlike BSD process accounting, this information is available
+ continuously during the lifetime of a task.
+
+ Say N if unsure.
+
config SYSCTL
bool "Sysctl support"
---help---
Index: linux-2.6.14/include/linux/sched.h
===================================================================
--- linux-2.6.14.orig/include/linux/sched.h
+++ linux-2.6.14/include/linux/sched.h
@@ -497,6 +497,14 @@ struct sched_info {
extern struct file_operations proc_schedstat_operations;
#endif

+#ifdef CONFIG_DELAY_ACCT
+struct task_delay_info {
+ spinlock_t lock;
+
+ /* Add stats in pairs: uint64_t delay, uint32_t count */
+};
+#endif
+
enum idle_type
{
SCHED_IDLE,
@@ -813,6 +821,9 @@ struct task_struct {
int cpuset_mems_generation;
#endif
atomic_t fs_excl; /* holding fs exclusive resources */
+#ifdef CONFIG_DELAY_ACCT
+ struct task_delay_info delays;
+#endif
};

static inline pid_t process_group(struct task_struct *tsk)
Index: linux-2.6.14/kernel/fork.c
===================================================================
--- linux-2.6.14.orig/kernel/fork.c
+++ linux-2.6.14/kernel/fork.c
@@ -42,6 +42,7 @@
#include <linux/profile.h>
#include <linux/rmap.h>
#include <linux/acct.h>
+#include <linux/delayacct.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -934,6 +935,7 @@ static task_t *copy_process(unsigned lon
if (p->binfmt && !try_module_get(p->binfmt->module))
goto bad_fork_cleanup_put_domain;

+ delayacct_init(p);
p->did_exec = 0;
copy_flags(clone_flags, p);
p->pid = pid;
Index: linux-2.6.14/init/main.c
===================================================================
--- linux-2.6.14.orig/init/main.c
+++ linux-2.6.14/init/main.c
@@ -47,6 +47,7 @@
#include <linux/rmap.h>
#include <linux/mempolicy.h>
#include <linux/key.h>
+#include <linux/delayacct.h>
#include <net/sock.h>

#include <asm/io.h>
@@ -537,6 +538,7 @@ asmlinkage void __init start_kernel(void
proc_root_init();
#endif
cpuset_init();
+ delayacct_init(&init_task);

check_bugs();

Index: linux-2.6.14/include/linux/delayacct.h
===================================================================
--- /dev/null
+++ linux-2.6.14/include/linux/delayacct.h
@@ -0,0 +1,57 @@
+/* delayacctdelays.h - for delay accounting
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2005
+ * Based on earlier work by Hubertus Franke, IBM Corp. 2003
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKDELAYS_H
+#define _LINUX_TASKDELAYS_H
+
+#include <linux/types.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_DELAY_ACCT
+
+static inline void delayacct_init(struct task_struct *tsk)
+{
+ memset(&tsk->delays, 0, sizeof(tsk->delays));
+ spin_lock_init(&tsk->delays.lock);
+}
+
+#define delayacct_def_var(ts) unsigned long long ts
+static inline void delayacct_timestamp(unsigned long long *ts)
+{
+ *ts = sched_clock();
+}
+
+/* because of hardware timer drifts in SMPs and task continue on different cpu
+ * then where the start_ts was taken there is a possibility that
+ * end_ts < start_ts by some usecs. In this case we ignore the diff
+ * and add nothing to the total.
+ */
+#ifdef CONFIG_SMP
+#define test_ts_integrity(start_ts, end_ts) (likely((end_ts) > (start_ts)))
+#else
+#define test_ts_integrity(start_ts, end_ts) (1)
+#endif
+
+#else
+
+static inline void delayacct_init(struct task_struct *tsk)
+{}
+#define delayacct_def_var(ts)
+static inline void delayacct_timestamp(unsigned long long *ts)
+{}
+
+#endif /* CONFIG_DELAY_ACCT */
+
+
+#endif /* _LINUX_TASKDELAYS_H */


2005-11-15 04:20:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization

Shailabh Nagar <[email protected]> wrote:
>
> + *ts = sched_clock();

I'm not sure that it's kosher to use sched_clock() for fine-grained
timestamping like this. Ingo had issues with it last time this happened?

<too lazy to read all the code> Do you normalise these numbers in some
manner before presenting them to userspace? If so, by what means?

2005-11-15 04:25:28

by Parag Warudkar

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization


On Nov 14, 2005, at 11:35 PM, Shailabh Nagar wrote:

> +/* because of hardware timer drifts in SMPs and task continue on
> different cpu
> + * then where the start_ts was taken there is a possibility that
> + * end_ts < start_ts by some usecs. In this case we ignore the diff
> + * and add nothing to the total.

Curious as to when would this occur. Probably for tasks running on a
SMP machine for a very short period of time (timer drift should not
be hopefully that high) and switching CPUs in that short period of time?

> +config STATS_CONNECTOR
> +config DELAY_ACCT

Probably TASK_DELAY_STATS_CONNECTOR and TASK_DELAY_ACCOUNTING are
better names?

> @@ -813,6 +821,9 @@ struct task_struct {
> int cpuset_mems_generation;
> #endif
> atomic_t fs_excl; /* holding fs exclusive resources */
> +#ifdef CONFIG_DELAY_ACCT
> + struct task_delay_info delays;
> +#endif
> };

Does this mean, whether or not the per task delay accounting is used,
we have a constant overhead of sizeof(spinlock_t) + 2*sizeof
(uint32_t) + 2* sizeof(uint64_t) bytes going into the struct
task_struct?. Is it possible/beneficial to use struct task_delay_info
*delays instead and allocate it if task wants to use the information?

Parag

2005-11-15 12:01:01

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization

On Mon, Nov 14, 2005 at 08:20:17PM -0800, Andrew Morton wrote:
> Shailabh Nagar <[email protected]> wrote:
> >
> > + *ts = sched_clock();
>
> I'm not sure that it's kosher to use sched_clock() for fine-grained
> timestamping like this. Ingo had issues with it last time this happened?

If the system boots with use_rtc == 0 you're going to get jiffies based
resolution from sched_clock(). I have a 1GHz Pentium 3 around here which
does that.

Maybe use do_gettimeofday() for such systems?

Would be nice to have a sort of per-arch overridable "gettime()" function?

> <too lazy to read all the code> Do you normalise these numbers in some
> manner before presenting them to userspace? If so, by what means?

2005-11-15 15:05:08

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization

Andrew Morton wrote:
> Shailabh Nagar <[email protected]> wrote:
>
>>+ *ts = sched_clock();
>
>
> I'm not sure that it's kosher to use sched_clock() for fine-grained
> timestamping like this. Ingo had issues with it last time this happened?
>
> <too lazy to read all the code> Do you normalise these numbers in some
> manner before presenting them to userspace? If so, by what means?

The cpu delay data collected by schedstats (which is jiffies based)
is normalized to nanosecs. The timestamps based on sched_clock() are exported
as is. As Marcelo pointed out, thats not good enough since sched_clock() itself
could return jiffie-based resolution. So some normalization will be needed for
that data as well.



2005-11-15 15:15:48

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization

Marcelo Tosatti wrote:
> On Mon, Nov 14, 2005 at 08:20:17PM -0800, Andrew Morton wrote:
>
>>Shailabh Nagar <[email protected]> wrote:
>>
>>>+ *ts = sched_clock();
>>
>>I'm not sure that it's kosher to use sched_clock() for fine-grained
>>timestamping like this. Ingo had issues with it last time this happened?
>
>
> If the system boots with use_rtc == 0 you're going to get jiffies based
> resolution from sched_clock(). I have a 1GHz Pentium 3 around here which
> does that.

Good point, thanks. This reemphasizes the need for better normalization
at output time.

> Maybe use do_gettimeofday() for such systems?

Perhaps getnstimeofday() so resolution isn't reduced to msec level unnecessarily.
In these patches, userspace takes responsibility for handling wraparound so
delivering a reasonably high-resolution delay data from the kernel is preferable.

>
> Would be nice to have a sort of per-arch overridable "gettime()" function?
>

Provided as part of this patch ?


>><too lazy to read all the code> Do you normalise these numbers in some
>>manner before presenting them to userspace? If so, by what means?



2005-11-15 17:32:19

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization

On Tue, Nov 15, 2005 at 10:19:17AM -0500, Shailabh Nagar wrote:
> Marcelo Tosatti wrote:
> > On Mon, Nov 14, 2005 at 08:20:17PM -0800, Andrew Morton wrote:
> >
> >>Shailabh Nagar <[email protected]> wrote:
> >>
> >>>+ *ts = sched_clock();
> >>
> >>I'm not sure that it's kosher to use sched_clock() for fine-grained
> >>timestamping like this. Ingo had issues with it last time this happened?

Maybe Ingo had some other issue other than !use_rtc ? Better check.

> > If the system boots with use_rtc == 0 you're going to get jiffies based
> > resolution from sched_clock(). I have a 1GHz Pentium 3 around here which
> > does that.
>
> Good point, thanks. This reemphasizes the need for better normalization
> at output time.

> > Maybe use do_gettimeofday() for such systems?
>
> Perhaps getnstimeofday() so resolution isn't reduced to msec level unnecessarily.
> In these patches, userspace takes responsibility for handling wraparound so
> delivering a reasonably high-resolution delay data from the kernel is preferable.
>
> >
> > Would be nice to have a sort of per-arch overridable "gettime()" function?
> >
>
> Provided as part of this patch ?

Yep, think so. My comment meant that its nice to hide away architecture
speficic code from generic code, so you don't have to add #ifdef's and
such.

Not sure about the nicer way to do that.

2005-11-15 17:46:18

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization

On Tue, Nov 15, 2005 at 10:20:35AM -0200, Marcelo Tosatti wrote:
> On Tue, Nov 15, 2005 at 10:19:17AM -0500, Shailabh Nagar wrote:
> > Marcelo Tosatti wrote:
> > > On Mon, Nov 14, 2005 at 08:20:17PM -0800, Andrew Morton wrote:
> > >
> > >>Shailabh Nagar <[email protected]> wrote:
> > >>
> > >>>+ *ts = sched_clock();
> > >>
> > >>I'm not sure that it's kosher to use sched_clock() for fine-grained
> > >>timestamping like this. Ingo had issues with it last time this happened?
>
> Maybe Ingo had some other issue other than !use_rtc ? Better check.
>
> > > If the system boots with use_rtc == 0 you're going to get jiffies based
> > > resolution from sched_clock(). I have a 1GHz Pentium 3 around here which
> > > does that.
> >
> > Good point, thanks. This reemphasizes the need for better normalization
> > at output time.
>
> > > Maybe use do_gettimeofday() for such systems?
> >
> > Perhaps getnstimeofday() so resolution isn't reduced to msec level unnecessarily.
Yep.
> > In these patches, userspace takes responsibility for handling wraparound so
> > delivering a reasonably high-resolution delay data from the kernel is preferable.
> >
> > >
> > > Would be nice to have a sort of per-arch overridable "gettime()" function?
> > >
> >
> > Provided as part of this patch ?
>
> Yep, think so.

Actually I dont think there is any kind of "get_timestamp()" style
API now. Maybe it would be nice to create one? Its generic functionality
after all.

blktrace is also using sched_clock(), too.


> My comment meant that its nice to hide away architecture
> speficic code from generic code, so you don't have to add #ifdef's and
> such.
>
> Not sure about the nicer way to do that.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2005-11-15 22:25:19

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization

Parag Warudkar wrote:
>
> On Nov 14, 2005, at 11:35 PM, Shailabh Nagar wrote:
>
>> +/* because of hardware timer drifts in SMPs and task continue on
>> different cpu
>> + * then where the start_ts was taken there is a possibility that
>> + * end_ts < start_ts by some usecs. In this case we ignore the diff
>> + * and add nothing to the total.
>
>
> Curious as to when would this occur. Probably for tasks running on a
> SMP machine for a very short period of time (timer drift should not be
> hopefully that high) and switching CPUs in that short period of time?

Possibly. Also, the simpler case of wraparound needs to be handled. Since
one delay sample isn't significant, dropping it seemed the safest bet.


>> +config STATS_CONNECTOR
>> +config DELAY_ACCT
>
>
> Probably TASK_DELAY_STATS_CONNECTOR and TASK_DELAY_ACCOUNTING are
> better names?

TASK_DELAY_ACCOUNTING is better since thats what the code protected
does.

STATS_CONNECTOR can be used to transmit stats other than per-task
delays (current patch also transmits cpu run time). Also there's a possibility
that the overall per-task accounting solution whose discussion was proposed
by Andrew will deviate from delays. So how about TASK_STATS_CONNECTOR.

Will fix in next round.


>> @@ -813,6 +821,9 @@ struct task_struct {
>> int cpuset_mems_generation;
>> #endif
>> atomic_t fs_excl; /* holding fs exclusive resources */
>> +#ifdef CONFIG_DELAY_ACCT
>> + struct task_delay_info delays;
>> +#endif
>> };
>
>
> Does this mean, whether or not the per task delay accounting is used,
> we have a constant overhead of sizeof(spinlock_t) + 2*sizeof (uint32_t)
> + 2* sizeof(uint64_t) bytes going into the struct task_struct?. Is it
> possible/beneficial to use struct task_delay_info *delays instead and
> allocate it if task wants to use the information?
>

Doing so would have value in the case where the feature is configured but
no one ever registers to listen for it. The cost of doing this would be
- adding more code to the fork path to allocate conditionally
- make the collecting of the delays conditional on a similar check
- cache pollution from following an extra pointer in the pgflt/io_schedule paths
I'm not sure is this really matters for these two code paths.

Even if one does this, once the first listener registers, all future tasks
(and even the current ones) will have to go ahead and allocate the structure
and accounting of delays will have to switch to unconditional mode. This is
because the delay data has cumulative value...future listeners will be
interested in data collected earlier (as long as task is still running). And
once the first listener registers, you can no longer be sure no one's interested
in the future.

Another alternative is to let userland control the overhead of allocation and
collection completely through a /proc/sys/kernel/delayacct variable.
When its switched on, it triggers an allocation for all existing tasks in the
system, turns on allocation in fork() for future tasks, and collection of the stats.
When turned off, collection of stats stops as does allocation for future tasks
(not worth going in and deallocating structs for existing tasks).

Does this seem worth it ?

-- Shailabh


> Parag
>


2005-11-15 22:53:51

by Parag Warudkar

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization


On Nov 15, 2005, at 5:29 PM, Shailabh Nagar wrote:
>>
>> Curious as to when would this occur. Probably for tasks running on a
>> SMP machine for a very short period of time (timer drift should
>> not be
>> hopefully that high) and switching CPUs in that short period of time?
>
> Possibly. Also, the simpler case of wraparound needs to be handled.
> Since
> one delay sample isn't significant, dropping it seemed the safest bet.
>
Yep, better to report zero than invalid values.

>
>>> +config STATS_CONNECTOR
>>> +config DELAY_ACCT
>>
>>
>> Probably TASK_DELAY_STATS_CONNECTOR and TASK_DELAY_ACCOUNTING are
>> better names?
>
> TASK_DELAY_ACCOUNTING is better since thats what the code protected
> does.
>
> STATS_CONNECTOR can be used to transmit stats other than per-task
> delays (current patch also transmits cpu run time). Also there's a
> possibility
> that the overall per-task accounting solution whose discussion was
> proposed
> by Andrew will deviate from delays. So how about TASK_STATS_CONNECTOR.
>>

Sounds good.

>> Does this mean, whether or not the per task delay accounting is used,
>> we have a constant overhead of sizeof(spinlock_t) + 2*sizeof
>> (uint32_t)
>> + 2* sizeof(uint64_t) bytes going into the struct task_struct?.
>> Is it
>> possible/beneficial to use struct task_delay_info *delays instead
>> and
>> allocate it if task wants to use the information?
>>
>
> Doing so would have value in the case where the feature is
> configured but
> no one ever registers to listen for it.

Precisely. Such a feature will be used only occasionally I suppose.
I didn't read the code deeply but are any scheduling decisions
altered based on
this data? If not, then it makes sense to not account unless required.

I think it should be possible to do it on demand, per process instead
of forcing
the accounting on _all_ processes which cumulatively becomes a
sizeable o/h.

Per Process activation of this feature will add significant value
IMHO. (Of course,
if that's possible in first place.)

> The cost of doing this would be
> - adding more code to the fork path to allocate conditionally

Just an unlikely branch for normal code path - not a big deal.
Also I am thinking it could be handled outside of fork?

> - make the collecting of the delays conditional on a similar check

Weighing this against the actual accounting - I think it's a win.

> - cache pollution from following an extra pointer in the pgflt/
> io_schedule paths
> I'm not sure is this really matters for these two code paths.

Possibly.

> Even if one does this, once the first listener registers, all
> future tasks
> (and even the current ones) will have to go ahead and allocate the
> structure
> and accounting of delays will have to switch to unconditional mode.
> This is
> because the delay data has cumulative value...future listeners will be
> interested in data collected earlier (as long as task is still
> running). And
> once the first listener registers, you can no longer be sure no
> one's interested
> in the future.
>

Is it possible to do it per process? Forcing it on all processes is
what I was trying
to avoid given the feature's usage pattern.

> Another alternative is to let userland control the overhead of
> allocation and
> collection completely through a /proc/sys/kernel/delayacct variable.
> When its switched on, it triggers an allocation for all existing
> tasks in the
> system, turns on allocation in fork() for future tasks, and
> collection of the stats.
> When turned off, collection of stats stops as does allocation for
> future tasks
> (not worth going in and deallocating structs for existing tasks).

> Does this seem worth it ?
>

Definitely not unless we can do it per process and on demand.

> -- Shailabh
>

Cheers

Parag

2005-11-16 00:41:59

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization

Parag Warudkar wrote:
>
> On Nov 15, 2005, at 5:29 PM, Shailabh Nagar wrote:
<snip>
>
>>> Does this mean, whether or not the per task delay accounting is used,
>>> we have a constant overhead of sizeof(spinlock_t) + 2*sizeof (uint32_t)
>>> + 2* sizeof(uint64_t) bytes going into the struct task_struct?. Is it
>>> possible/beneficial to use struct task_delay_info *delays instead and
>>> allocate it if task wants to use the information?
>>>
>>
>> Doing so would have value in the case where the feature is configured
>> but no one ever registers to listen for it.
>
>
> Precisely. Such a feature will be used only occasionally I suppose.
> I didn't read the code deeply but are any scheduling decisions altered
> based on this data? If not, then it makes sense to not account unless required.
>
> I think it should be possible to do it on demand, per process instead
> of forcing the accounting on _all_ processes which cumulatively becomes a sizeable
> o/h.
>
> Per Process activation of this feature will add significant value IMHO.
> (Of course, if that's possible in first place.)


Per-task activation is useful/possible only for long-running tasks.
If one is trying to gather stats for a user-defined grouping of tasks
then it would involve too much overhead & inaccuracy to require monitoring
to be turned on individually.

>
>> The cost of doing this would be
>> - adding more code to the fork path to allocate conditionally
>
>
> Just an unlikely branch for normal code path - not a big deal.
> Also I am thinking it could be handled outside of fork?

only if per-task activation is done - thats probably what you meant ?

>
>> - make the collecting of the delays conditional on a similar check
>
>
> Weighing this against the actual accounting - I think it's a win.

Hmmm..since there is locking involved in the stats collection, this is
starting to make a lot of sense.

>
>> - cache pollution from following an extra pointer in the pgflt/
>> io_schedule paths
>> I'm not sure is this really matters for these two code paths.
>
>
> Possibly.
>
>> Even if one does this, once the first listener registers, all future
>> tasks
>> (and even the current ones) will have to go ahead and allocate the
>> structure
>> and accounting of delays will have to switch to unconditional mode.
>> This is
>> because the delay data has cumulative value...future listeners will be
>> interested in data collected earlier (as long as task is still
>> running). And
>> once the first listener registers, you can no longer be sure no one's
>> interested
>> in the future.
>>
>
> Is it possible to do it per process? Forcing it on all processes is
> what I was trying to avoid given the feature's usage pattern.
>
>> Another alternative is to let userland control the overhead of
>> allocation and
>> collection completely through a /proc/sys/kernel/delayacct variable.
>> When its switched on, it triggers an allocation for all existing
>> tasks in the
>> system, turns on allocation in fork() for future tasks, and
>> collection of the stats.
>> When turned off, collection of stats stops as does allocation for
>> future tasks
>> (not worth going in and deallocating structs for existing tasks).
>
>
>> Does this seem worth it ?
>>
>
> Definitely not unless we can do it per process and on demand.

Per-task doesn't seem like a good idea. Neither does allowing dynamic
switching off/on of the allocation of the delay struct.

So how about this:

Have /proc/sys/kernel/delayacct and a corresponding kernel boot parameter (for setting
the switch early) which control just the collection of data. Allocation always happens.


>> -- Shailabh
>>
>
> Cheers
>
> Parag
>

2005-11-16 01:07:09

by Peter Chubb

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization

>>>>> "Andrew" == Andrew Morton <[email protected]> writes:

Andrew> Shailabh Nagar <[email protected]> wrote:
>> + *ts = sched_clock();

Andrew> I'm not sure that it's kosher to use sched_clock() for
Andrew> fine-grained timestamping like this. Ingo had issues with it
Andrew> last time this happened?

It wasn't Ingo, it was Andi Kleen... for my Microstate Accounting
patches, which do very similar things to Shailabh's patchsetm, but
using /proc and a system call instead (following Solaris's lead)

--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
The technical we do immediately, the political takes *forever*

2005-11-16 01:44:33

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization

Peter Chubb wrote:
>>>>>>"Andrew" == Andrew Morton <[email protected]> writes:
>
>
> Andrew> Shailabh Nagar <[email protected]> wrote:
>
>>> + *ts = sched_clock();
>
>
> Andrew> I'm not sure that it's kosher to use sched_clock() for
> Andrew> fine-grained timestamping like this. Ingo had issues with it
> Andrew> last time this happened?
>
> It wasn't Ingo, it was Andi Kleen... for my Microstate Accounting
> patches, which do very similar things to Shailabh's patchsetm, but
> using /proc and a system call instead (following Solaris's lead)
>

Were these the comments from Andi to which you refer:
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0503.1/1237.html

The objections to microstate overhead seemed to stem from the syscall
overhead, not use of sched_clock() per se.


Andi, Ingo,

Are there problems with using sched_clock()for timestamping if one is prepared
to live with them not necessarily being nanosecond accurate ? I'm trying to search
the archives etc. but if you can respond with any quick comments, that'd be very
helpful.


Thanks,
Shailabh

2005-11-16 01:49:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization

On Wednesday 16 November 2005 02:48, Shailabh Nagar wrote:

>
> Are there problems with using sched_clock()for timestamping if one is prepared
> to live with them not necessarily being nanosecond accurate ? I'm trying to search
> the archives etc. but if you can respond with any quick comments, that'd be very
> helpful.

First it can be relatively slow on P4s (hundreds of cycles)

On other systems it can run with different frequencies on different CPUs,
so you never need to assume a timestamp from one CPU is comparable with
the one from other CPUs (the scheduler carefully avoids this)

If you need a stable timestamp over multiple CPUs don't use it.

In general do_gettimeofday is much safer.
do_gettimeofday shouldn't be that much slower for the case where TSC
works, and where it doesn't there is no other alternative.

-Andi

2005-11-16 01:53:11

by Peter Chubb

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization

>>>>> "Shailabh" == Shailabh Nagar <[email protected]> writes:

Shailabh> Peter Chubb wrote:
>>>>>>> "Andrew" == Andrew Morton <[email protected]> writes:
>>
Andrew> Shailabh Nagar <[email protected]> wrote:
>>
>>>> + *ts = sched_clock();
>>
Andrew> I'm not sure that it's kosher to use sched_clock() for
Andrew> fine-grained timestamping like this. Ingo had issues with it
Andrew> last time this happened?
>> It wasn't Ingo, it was Andi Kleen... for my Microstate Accounting
>> patches, which do very similar things to Shailabh's patchsetm, but
>> using /proc and a system call instead (following Solaris's lead)
>>

Shailabh> Were these the comments from Andi to which you refer:
Shailabh> http://www.uwsg.indiana.edu/hypermail/linux/kernel/0503.1/1237.html

Shailabh> The objections to microstate overhead seemed to stem from
Shailabh> the syscall overhead, not use of sched_clock() per se.

The objection was to the use of rdtsc (which is sched_clock() on
platforms that have rdtsc) in fastpaths.


Anyway, I think it's time to repost the microstate patches.
--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
The technical we do immediately, the political takes *forever*

2005-11-16 02:43:07

by Parag Warudkar

[permalink] [raw]
Subject: Re: [Patch 1/4] Delay accounting: Initialization


On Nov 15, 2005, at 7:45 PM, Shailabh Nagar wrote:

> So how about this:
>
> Have /proc/sys/kernel/delayacct and a corresponding kernel boot
> parameter (for setting
> the switch early) which control just the collection of data.
> Allocation always happens.

Yep, if per task isn't worth doing that sounds reasonable.

Parag