2012-10-11 18:53:39

by Prarit Bhargava

[permalink] [raw]
Subject: RCU NOHZ, tsc, and clock_gettime

I've been tracking an odd bug that may involve the RCU NOHZ code and
just want to know if you have any ideas on debugging and/or what might be
wrong. Note the bug happens on *BOTH* upstream and the current RHEL6 tree.
The data in this email is from running on RHEL6 because that's what I happen
to be running ATM. The result, however, is _identical_ to that of linux.git
latest.

The attached program compares userspace TSC reads to the time returned from
the REALTIME_CLOCK[1]. The test does the following

read tsc1
get REALTIME_CLOCK value
read tsc2

and then does a comparison between the tsc read and the REALTIME_CLOCK value
to see if they are in sync with each other.

[I'm leaving out the guts of the analysis here. It is sufficient to show
examples of "good" data and "bad" data IMO.]

On a good run, we see little variance in between the values:

0 144 0.1
1 138 1.8
2 147 -2.9
3 138 0.6
4 141 0.2
5 138 -1.2
6 147 -0.4
7 144 0.4
8 144 -0.6
9 144 -0.9
10 144 2.1
11 138 -0.7
12 144 -0.4
13 144 0.5
14 144 0.6
15 144 -0.8
16 141 1.5
17 147 1.9
18 141 -0.1
19 141 0.7
20 144 0.3
21 144 -0.4
22 138 0.2
23 141 -2.1
24 144 -1.0
25 141 1.2
26 141 -0.5
27 144 -0.2
28 138 0.7
29 144 -0.6
n: 30, slope: 0.50 (1.99 GHz), dev: 1.1 ns, max: 2.9 ns


On a bad run, there is a lot of variance between the values:

0 144 -346.0
1 138 1410.8
2 138 -806.9
3 141 4006.6
4 147 -3996.1
5 138 -255.8
6 144 -22.2
7 138 -22.4
8 144 102.7
9 141 218.0
10 138 -11.6
11 138 -4.9
12 138 -26.2
13 138 -51.2
14 141 -280.0
15 144 -1120.5
16 144 -13.8
17 138 -15.9
18 144 -46.2
19 138 -46.7
20 138 -453.4
21 138 2062.7
22 141 -125.4
23 138 -453.4
24 138 1050.1
25 138 -643.6
26 138 14.3
27 138 7.7
28 138 -80.6
29 141 -50.3
n: 30, slope: 0.50 (1.99 GHz), dev: 1231.4 ns, max: 4006.6 ns


It was noted by the bug reporter that specifying "nohz=off" resolved the
problem. I tested with "nohz=off" and AFAICT it fixes the issue. I started
out debugging by assuming that delays in the c-state transitions were not being
properly accounted for in the timing calculations.

I ran a baseline test on an unmodified kernel (with no extra boot options) and
confirmed that powertop shows the CPUs entering deep c-states while the test was
running for 300 runs.

I then instrumented the PM QoS and the power management code (specifically
cpuidle). I put in a large # of printk's to monitor the CPU transitions, and
monitored the power states via powertop in order to verify that the system was
behaving correctly wrt PM QoS.

If you modify the tstsc script to run 300 times with this modified kernel, and
run powertop in the middle of the script, you will see that the processors do
NOT enter deep c-states. **This means that PM QoS is doing its job correctly**.

After this I decided to use "idle=poll" as a boot parameter. This kernel
argument forces a kernel polling loop on each cpu that should prevent the
cpu from entering a deep c-state.

When running with idle=poll, powertop indicates that the processors NEVER
enter a deep c-state (or at least AFAICS). Running the tstsc test again
results in the *SAME* failure as with the regular kernel.

So this means, IMO, that the problem lies within some other aspect of the
NOHZ code. I started backing out some pieces of the NOHZ code trying to
see what caused the problems with the test, and finally got to the RCU NOHZ
code. (Yeah ... this wasn't the best thing to do ... and it results in a
almost nonfunctional kernel, especially with filesystems ...)

When I do that, surprisingly the problem goes away. That is, the test
functions like it should. I do not see any problems in the calculations and
at the sametime I can confirm that I'm seeing c-state transitions via
powertop.

I've narrowed down a brute-force code removal to code in
dyntick_save_progress_counter()
and rcu_implicit_dynticks_qs() but don't have enough knowledge of NOHZ RCU to
get my hands around the problem. Admittedly I'm still trying to wrap my head
around dynticks and its usage by reading the code. I was wondering if anyone
might have an idea of what could be wrong? I'm certainly willing to continue to
debug.

P.

[1] Please note that switching to the MONOTONIC, or MONOTONIC_RAW clocks
also result in the same problem.



/* gcc tstsc.c -o tstsc -lrt -lm */

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <limits.h>
#include <math.h>
#include <inttypes.h>
#include <sched.h>

#define SAMPLES 30
#define SAMPLES_PER_SEC 10
#define SAMPLE_LOOPS 40

struct sample {
uint64_t tsc;
uint64_t ts;
int delay;
};

static uint64_t rdtsc() {
#if 1
uint32_t a, d;
__asm__ __volatile__("rdtsc" : "=a" (a), "=d" (d));
return (uint64_t)d << 32 | a;
#else
return __builtin_ia32_rdtsc();
#endif
}


static void get_sample(struct sample *sample) {
int i, delay, mindelay;
struct timespec ts = {0, 0}, ts1;
uint64_t tsc = 0, tsc1, tsc2;

for (i = 0, mindelay = INT_MAX; i < SAMPLE_LOOPS; i++) {
tsc1 = rdtsc();
clock_gettime(CLOCK_REALTIME, &ts1);
tsc2 = rdtsc();
delay = tsc2 - tsc1;
if (delay <= mindelay) {
ts = ts1;
tsc = (tsc1 + tsc2) / 2;
mindelay = delay;
}
}

sample->tsc = tsc; /* the current time read directly from tsc */
/* the kernel's representation of time */
sample->ts = (uint64_t)ts.tv_sec * 1000000000 + ts.tv_nsec;
sample->delay = mindelay;
}

/* taken/modified from chrony/regress.c */
void
regress
(double *x, /* independent variable */
double *y, /* measured data */
int n, /* number of data points */

/* And now the results */

double *b0, /* estimated y axis intercept */
double *b1, /* estimated slope */
double *s2

/* Could add correlation stuff later if required */
)
{
double P, Q, U, V, W;
double diff;
double u, ui;
int i;

assert(n >= 3);

W = U = 0;
for (i=0; i<n; i++) {
U += x[i];
W += 1.0;
}

u = U / W;

/* Calculate statistics from data */
P = Q = V = 0.0;
for (i=0; i<n; i++) {
ui = x[i] - u;
P += y[i];
Q += y[i] * ui;
V += ui * ui;
}

*b1 = Q / V;
*b0 = (P / W) - (*b1) * u;

*s2 = 0.0;
for (i=0; i<n; i++) {
diff = y[i] - *b0 - *b1*x[i];
*s2 += diff*diff;
}

*s2 /= (double)(n-2);
}

char buf[2] = {'0', '\n'} ;

int main() {
struct sample samples[SAMPLES];
double x[SAMPLES], y[SAMPLES], slope, intercept, offset, variance, max_offset;
int i;
FILE *fd = NULL;
uint32_t retval;
struct sched_param param;

/* change this to 0 if you want to avoid using PM QoS */
#if 1
param.sched_priority = sched_get_priority_max(SCHED_RR);
if( sched_setscheduler( 0, SCHED_RR, &param ) == -1 )
{
printf("sched_setscheduler error");
}

fd = fopen("/dev/cpu_dma_latency", "w+");

if (!fd)
printf("Cannot open /dev/cpu_dma_latency ... test may be skewed!\n");
/* write a binary zero and keep the file descriptor */
retval = fwrite(buf, sizeof(buf), 1, fd);
fflush(fd);
#endif

#if 1
usleep(100000);
#endif
for (i = 0; i < SAMPLES; i++) {
printf("getting data for sample set %d\n", i);
get_sample(samples + i);
#if 0
usleep(rand() % 2000000 / SAMPLES_PER_SEC);
#else
usleep(1000000 / SAMPLES_PER_SEC);
#endif
}

for (i = 0; i < SAMPLES; i++) {
x[i] = samples[i].tsc - samples[0].tsc;
y[i] = samples[i].ts - samples[0].ts;
}

/* x = average_tsc[i]-average_tsc[0], y = ts[i]-ts[0] */
regress(x, y, SAMPLES, &intercept, &slope, &variance);
max_offset = 0.0;

for (i = 0; i < SAMPLES; i++) {
offset = x[i] * slope + intercept - y[i];
if (fabs(offset) > max_offset)
max_offset = fabs(offset);
/* delay = delta(tsc2 - tsc1),
* offset = (average_tsc - ts) in ns
*/
printf("%5d %5d %9.1f\n", i, samples[i].delay, offset);
}

printf("n: %d, slope: %.2f (%.2f GHz), dev: %.1f ns, max: %.1f ns\n", SAMPLES,
slope, 1.0 / slope, sqrt(variance), max_offset);
if (fd)
fclose(fd);
return 0;
}


2012-10-11 19:55:09

by john stultz

[permalink] [raw]
Subject: Re: RCU NOHZ, tsc, and clock_gettime

On 10/11/2012 11:52 AM, Prarit Bhargava wrote:
> I've been tracking an odd bug that may involve the RCU NOHZ code and
> just want to know if you have any ideas on debugging and/or what might be
> wrong. Note the bug happens on *BOTH* upstream and the current RHEL6 tree.
> The data in this email is from running on RHEL6 because that's what I happen
> to be running ATM. The result, however, is _identical_ to that of linux.git
> latest.
>
> The attached program compares userspace TSC reads to the time returned from
> the REALTIME_CLOCK[1]. The test does the following
>
> read tsc1
> get REALTIME_CLOCK value
> read tsc2
>
> and then does a comparison between the tsc read and the REALTIME_CLOCK value
> to see if they are in sync with each other.
>
> [I'm leaving out the guts of the analysis here. It is sufficient to show
> examples of "good" data and "bad" data IMO.]
>
> On a good run, we see little variance in between the values:
>
> 0 144 0.1
> 1 138 1.8
> 2 147 -2.9
[snip]
> 29 144 -0.6
> n: 30, slope: 0.50 (1.99 GHz), dev: 1.1 ns, max: 2.9 ns
>
>
> On a bad run, there is a lot of variance between the values:
>
> 0 144 -346.0
> 1 138 1410.8
> 2 138 -806.9
> 3 141 4006.6
> 4 147 -3996.1
[snip]
> 29 141 -50.3
> n: 30, slope: 0.50 (1.99 GHz), dev: 1231.4 ns, max: 4006.6 ns


Do you see the same noisy variance when instead of doing:
rdtsc()
clock_gettime()
rdtsc()

you do:
clock_gettime()
clock_gettime()

And calculate the delta of the timestamp results?

Also does this behavior change if you select different clocksources on
the system?

> It was noted by the bug reporter that specifying "nohz=off" resolved the
> problem. I tested with "nohz=off" and AFAICT it fixes the issue. I started
> out debugging by assuming that delays in the c-state transitions were not being
> properly accounted for in the timing calculations.
>
> I ran a baseline test on an unmodified kernel (with no extra boot options) and
> confirmed that powertop shows the CPUs entering deep c-states while the test was
> running for 300 runs.
>
> I then instrumented the PM QoS and the power management code (specifically
> cpuidle). I put in a large # of printk's to monitor the CPU transitions, and
> monitored the power states via powertop in order to verify that the system was
> behaving correctly wrt PM QoS.
>
> If you modify the tstsc script to run 300 times with this modified kernel, and
> run powertop in the middle of the script, you will see that the processors do
> NOT enter deep c-states. **This means that PM QoS is doing its job correctly**.

So its not clear here, do you see the same noisier latencies when using
PM QoS to limit deep c-states?

Finally, how many cpus are on the machine you see this with? Does the
effect go away with maxcpus=1?

It looks like the TSC clocksource is being used, just because the
latencies are so low, but Is the same clocksource being used in all of
these cases?

One possibility is that if the cpu we're doing our timekeeping
accumulation on is different then the one running the test, we might go
into deeper idle for longer periods of time. Then when we accumulate
time, we have more then a single tick to accumulate and that might
require holding the timekeeper/xtime lock for longer times.

And the max 2.9ns variance seems particularly low, given that we do call
update_vsyscall every so often, and that should block clock_gettime()
callers while we update the vsyscall data. Could it be that the test is
too short to see the locking effect, so you're just getting lucky, and
that adding nohz is jostling the regularity of the execution so you then
see the lock wait times? If you increase the samples and sample loops
by 1000 does that change the behavior?

thanks
-john

2012-10-11 21:08:11

by Paul E. McKenney

[permalink] [raw]
Subject: Re: RCU NOHZ, tsc, and clock_gettime

On Thu, Oct 11, 2012 at 12:51:44PM -0700, John Stultz wrote:
> On 10/11/2012 11:52 AM, Prarit Bhargava wrote:
> >I've been tracking an odd bug that may involve the RCU NOHZ code and
> >just want to know if you have any ideas on debugging and/or what might be
> >wrong. Note the bug happens on *BOTH* upstream and the current RHEL6 tree.
> >The data in this email is from running on RHEL6 because that's what I happen
> >to be running ATM. The result, however, is _identical_ to that of linux.git
> >latest.
> >
> >The attached program compares userspace TSC reads to the time returned from
> >the REALTIME_CLOCK[1]. The test does the following
> >
> >read tsc1
> >get REALTIME_CLOCK value
> >read tsc2
> >
> >and then does a comparison between the tsc read and the REALTIME_CLOCK value
> >to see if they are in sync with each other.
> >
> >[I'm leaving out the guts of the analysis here. It is sufficient to show
> >examples of "good" data and "bad" data IMO.]
> >
> >On a good run, we see little variance in between the values:
> >
> > 0 144 0.1
> > 1 138 1.8
> > 2 147 -2.9
> [snip]
> > 29 144 -0.6
> >n: 30, slope: 0.50 (1.99 GHz), dev: 1.1 ns, max: 2.9 ns
> >
> >
> >On a bad run, there is a lot of variance between the values:
> >
> > 0 144 -346.0
> > 1 138 1410.8
> > 2 138 -806.9
> > 3 141 4006.6
> > 4 147 -3996.1
> [snip]
> > 29 141 -50.3
> >n: 30, slope: 0.50 (1.99 GHz), dev: 1231.4 ns, max: 4006.6 ns
>
>
> Do you see the same noisy variance when instead of doing:
> rdtsc()
> clock_gettime()
> rdtsc()
>
> you do:
> clock_gettime()
> clock_gettime()
>
> And calculate the delta of the timestamp results?
>
> Also does this behavior change if you select different clocksources
> on the system?
>
> >It was noted by the bug reporter that specifying "nohz=off" resolved the
> >problem. I tested with "nohz=off" and AFAICT it fixes the issue. I started
> >out debugging by assuming that delays in the c-state transitions were not being
> >properly accounted for in the timing calculations.
> >
> >I ran a baseline test on an unmodified kernel (with no extra boot options) and
> >confirmed that powertop shows the CPUs entering deep c-states while the test was
> >running for 300 runs.
> >
> >I then instrumented the PM QoS and the power management code (specifically
> >cpuidle). I put in a large # of printk's to monitor the CPU transitions, and
> >monitored the power states via powertop in order to verify that the system was
> >behaving correctly wrt PM QoS.
> >
> >If you modify the tstsc script to run 300 times with this modified kernel, and
> >run powertop in the middle of the script, you will see that the processors do
> >NOT enter deep c-states. **This means that PM QoS is doing its job correctly**.
>
> So its not clear here, do you see the same noisier latencies when
> using PM QoS to limit deep c-states?
>
> Finally, how many cpus are on the machine you see this with? Does
> the effect go away with maxcpus=1?

Also, what is the value of NR_CPUS? And exactly which kernel.org kernel
are you using?

The effect of removing the two functions you noted (on 3.6 and earlier)
is to prevent RCU from checking for dyntick-idle CPUs, likely incurring
a cache miss for each CPU with interrupts disabled. If you have a lot
of CPUs (or even if NR_CPUS is large and you have a smaller number of
CPUs), this can result in user-space-visible delays.

Thanx, Paul

> It looks like the TSC clocksource is being used, just because the
> latencies are so low, but Is the same clocksource being used in all
> of these cases?
>
> One possibility is that if the cpu we're doing our timekeeping
> accumulation on is different then the one running the test, we might
> go into deeper idle for longer periods of time. Then when we
> accumulate time, we have more then a single tick to accumulate and
> that might require holding the timekeeper/xtime lock for longer
> times.
>
> And the max 2.9ns variance seems particularly low, given that we do
> call update_vsyscall every so often, and that should block
> clock_gettime() callers while we update the vsyscall data. Could it
> be that the test is too short to see the locking effect, so you're
> just getting lucky, and that adding nohz is jostling the regularity
> of the execution so you then see the lock wait times? If you
> increase the samples and sample loops by 1000 does that change the
> behavior?
>
> thanks
> -john

2012-10-12 15:41:18

by Prarit Bhargava

[permalink] [raw]
Subject: Re: RCU NOHZ, tsc, and clock_gettime



On 10/11/2012 04:21 PM, Paul E. McKenney wrote:
> On Thu, Oct 11, 2012 at 12:51:44PM -0700, John Stultz wrote:
>> On 10/11/2012 11:52 AM, Prarit Bhargava wrote:
>>> I've been tracking an odd bug that may involve the RCU NOHZ code and
>>> just want to know if you have any ideas on debugging and/or what might be
>>> wrong. Note the bug happens on *BOTH* upstream and the current RHEL6 tree.
>>> The data in this email is from running on RHEL6 because that's what I happen
>>> to be running ATM. The result, however, is _identical_ to that of linux.git
>>> latest.
>>>
>>> The attached program compares userspace TSC reads to the time returned from
>>> the REALTIME_CLOCK[1]. The test does the following
>>>
>>> read tsc1
>>> get REALTIME_CLOCK value
>>> read tsc2
>>>
>>> and then does a comparison between the tsc read and the REALTIME_CLOCK value
>>> to see if they are in sync with each other.
>>>
>>> [I'm leaving out the guts of the analysis here. It is sufficient to show
>>> examples of "good" data and "bad" data IMO.]
>>>
>>> On a good run, we see little variance in between the values:
>>>
>>> 0 144 0.1
>>> 1 138 1.8
>>> 2 147 -2.9
>> [snip]
>>> 29 144 -0.6
>>> n: 30, slope: 0.50 (1.99 GHz), dev: 1.1 ns, max: 2.9 ns
>>>
>>>
>>> On a bad run, there is a lot of variance between the values:
>>>
>>> 0 144 -346.0
>>> 1 138 1410.8
>>> 2 138 -806.9
>>> 3 141 4006.6
>>> 4 147 -3996.1
>> [snip]
>>> 29 141 -50.3
>>> n: 30, slope: 0.50 (1.99 GHz), dev: 1231.4 ns, max: 4006.6 ns
>>
>>
>> Do you see the same noisy variance when instead of doing:
>> rdtsc()
>> clock_gettime()
>> rdtsc()
>>
>> you do:
>> clock_gettime()
>> clock_gettime()
>>
>> And calculate the delta of the timestamp results?

I do not see the noisy variance when comparing clock_gettime() to clock_gettime().

>>
>> Also does this behavior change if you select different clocksources
>> on the system?

No, if the clocksource is the hpet, I still see the large variance. ie) the
behaviour does not change.

>>
>>> It was noted by the bug reporter that specifying "nohz=off" resolved the
>>> problem. I tested with "nohz=off" and AFAICT it fixes the issue. I started
>>> out debugging by assuming that delays in the c-state transitions were not being
>>> properly accounted for in the timing calculations.
>>>
>>> I ran a baseline test on an unmodified kernel (with no extra boot options) and
>>> confirmed that powertop shows the CPUs entering deep c-states while the test was
>>> running for 300 runs.
>>>
>>> I then instrumented the PM QoS and the power management code (specifically
>>> cpuidle). I put in a large # of printk's to monitor the CPU transitions, and
>>> monitored the power states via powertop in order to verify that the system was
>>> behaving correctly wrt PM QoS.
>>>
>>> If you modify the tstsc script to run 300 times with this modified kernel, and
>>> run powertop in the middle of the script, you will see that the processors do
>>> NOT enter deep c-states. **This means that PM QoS is doing its job correctly**.
>>
>> So its not clear here, do you see the same noisier latencies when
>> using PM QoS to limit deep c-states?

I see the same noisier latencies when using PM QoS to limit the deep c-state
transitions.

>>
>> Finally, how many cpus are on the machine you see this with? Does
>> the effect go away with maxcpus=1?


24 physical/48 logical, 2G/core RAM

The large variance is still there if maxcpus=1.

>
> Also, what is the value of NR_CPUS? And exactly which kernel.org kernel
> are you using?

NR_CPUS=4096

I'm using the "main" kernel.org tree,

git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

as of commit 4f1cd91497774488ed16119ec3f54b3daf1561de

(I recompiled this morning and re-ran the tests and they still show a large
variance)

>
> The effect of removing the two functions you noted (on 3.6 and earlier)
> is to prevent RCU from checking for dyntick-idle CPUs, likely incurring
> a cache miss for each CPU with interrupts disabled. If you have a lot
> of CPUs (or even if NR_CPUS is large and you have a smaller number of
> CPUs), this can result in user-space-visible delays.
>

Interesting. I'll lower NR_CPUS and see if that shows an improvement.

> Thanx, Paul
>
>> It looks like the TSC clocksource is being used, just because the
>> latencies are so low, but Is the same clocksource being used in all
>> of these cases?


Yes, the TSC is always being used.

>>
>> One possibility is that if the cpu we're doing our timekeeping
>> accumulation on is different then the one running the test, we might
>> go into deeper idle for longer periods of time. Then when we
>> accumulate time, we have more then a single tick to accumulate and
>> that might require holding the timekeeper/xtime lock for longer
>> times.
>>
>> And the max 2.9ns variance seems particularly low, given that we do
>> call update_vsyscall every so often, and that should block
>> clock_gettime() callers while we update the vsyscall data. Could it
>> be that the test is too short to see the locking effect, so you're
>> just getting lucky, and that adding nohz is jostling the regularity
>> of the execution so you then see the lock wait times? If you
>> increase the samples and sample loops by 1000 does that change the
>> behavior?

That's a possiblity, although I suspect that this has more to do with not
executing the RCU NOHZ code given that we don't see a problem with the
clock_gettime() vs clock_gettime() test. I wonder if not executing the RCU NOHZ
code somehow introduces a "regularity" with execution that results in the CPU
always being in C0/polling when the test is run?

P.

2012-10-12 18:27:11

by Prarit Bhargava

[permalink] [raw]
Subject: Re: RCU NOHZ, tsc, and clock_gettime


> The effect of removing the two functions you noted (on 3.6 and earlier)
> is to prevent RCU from checking for dyntick-idle CPUs, likely incurring
> a cache miss for each CPU with interrupts disabled. If you have a lot
> of CPUs (or even if NR_CPUS is large and you have a smaller number of
> CPUs), this can result in user-space-visible delays.
>

Paul,

I built a kernel with NR_CPUS=48 and booted on a 48 cpu (logical) system. I do
not see a difference in the test -- the variance is AFAICT just as large as if I
had run with NR_CPUS=4096.

P.

2012-10-14 22:51:52

by Paul E. McKenney

[permalink] [raw]
Subject: Re: RCU NOHZ, tsc, and clock_gettime

On Fri, Oct 12, 2012 at 02:27:01PM -0400, Prarit Bhargava wrote:
>
> > The effect of removing the two functions you noted (on 3.6 and earlier)
> > is to prevent RCU from checking for dyntick-idle CPUs, likely incurring
> > a cache miss for each CPU with interrupts disabled. If you have a lot
> > of CPUs (or even if NR_CPUS is large and you have a smaller number of
> > CPUs), this can result in user-space-visible delays.
> >
>
> Paul,
>
> I built a kernel with NR_CPUS=48 and booted on a 48 cpu (logical) system. I do
> not see a difference in the test -- the variance is AFAICT just as large as if I
> had run with NR_CPUS=4096.

OK -- have you applied John Stultz's suggestions?

Thanx, Paul

2012-10-15 14:57:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: RCU NOHZ, tsc, and clock_gettime

On Fri, Oct 12, 2012 at 11:40:44AM -0400, Prarit Bhargava wrote:
> On 10/11/2012 04:21 PM, Paul E. McKenney wrote:
> > On Thu, Oct 11, 2012 at 12:51:44PM -0700, John Stultz wrote:

[ . . . ]

> >> One possibility is that if the cpu we're doing our timekeeping
> >> accumulation on is different then the one running the test, we might
> >> go into deeper idle for longer periods of time. Then when we
> >> accumulate time, we have more then a single tick to accumulate and
> >> that might require holding the timekeeper/xtime lock for longer
> >> times.
> >>
> >> And the max 2.9ns variance seems particularly low, given that we do
> >> call update_vsyscall every so often, and that should block
> >> clock_gettime() callers while we update the vsyscall data. Could it
> >> be that the test is too short to see the locking effect, so you're
> >> just getting lucky, and that adding nohz is jostling the regularity
> >> of the execution so you then see the lock wait times? If you
> >> increase the samples and sample loops by 1000 does that change the
> >> behavior?
>
> That's a possiblity, although I suspect that this has more to do with not
> executing the RCU NOHZ code given that we don't see a problem with the
> clock_gettime() vs clock_gettime() test. I wonder if not executing the RCU NOHZ
> code somehow introduces a "regularity" with execution that results in the CPU
> always being in C0/polling when the test is run?

I don't know about regularity being caused by omitting those RCU
functions, but omitting them when CONFIG_NO_HZ=y can certainly result
in system hangs due to grace periods never completing. It is also
not at all clear to me how RCU knows that your userspace code is using
clock_gettime() instead of the rdtsc instruction.

But it looks like you were collecting only 30 samples at 10 samples
per second. It also looks like you are seeing deviations of a few
microseconds, which could easily be caused by any number of causes,
including scheduling-clock interrupts, device interrupts, any number
of IPIs, and who knows what all else. Furthermore, it is quite likely
that your usleep() system calls are correlated with the scheduler-clock
interrupt, which would pretty much destroy any statistical significance
that your tests might otherwise have.

So for the time being, I have to assume that if you ran a test that
had a good chance of hitting a random sample of interrupts, then you
would see interrupt overhead on all of your tests.

Thanx, Paul

2012-11-12 23:27:50

by john stultz

[permalink] [raw]
Subject: Re: RCU NOHZ, tsc, and clock_gettime

On 10/12/2012 08:40 AM, Prarit Bhargava wrote:
>
>>> One possibility is that if the cpu we're doing our timekeeping
>>> accumulation on is different then the one running the test, we might
>>> go into deeper idle for longer periods of time. Then when we
>>> accumulate time, we have more then a single tick to accumulate and
>>> that might require holding the timekeeper/xtime lock for longer
>>> times.
>>>
>>> And the max 2.9ns variance seems particularly low, given that we do
>>> call update_vsyscall every so often, and that should block
>>> clock_gettime() callers while we update the vsyscall data. Could it
>>> be that the test is too short to see the locking effect, so you're
>>> just getting lucky, and that adding nohz is jostling the regularity
>>> of the execution so you then see the lock wait times? If you
>>> increase the samples and sample loops by 1000 does that change the
>>> behavior?
> That's a possiblity, although I suspect that this has more to do with not
> executing the RCU NOHZ code given that we don't see a problem with the
> clock_gettime() vs clock_gettime() test. I wonder if not executing the RCU NOHZ
> code somehow introduces a "regularity" with execution that results in the CPU
> always being in C0/polling when the test is run?
Hey Prarit,
Just back from being on leave, and wanted to check in on this. Did
you ever get to run with an increase sample size to see how that
affected things? Its exactly your point that the non-NOHZ case could
align the execution of a short run in a way that you always see good
results, where as with NOHZ the alignment might not be the same, so you
see periodic delays from timer interrupts, etc.

Anyway, let me know if this got resolved or not.

thanks
-john

2012-11-13 15:18:42

by Prarit Bhargava

[permalink] [raw]
Subject: Re: RCU NOHZ, tsc, and clock_gettime



On 11/12/2012 06:27 PM, John Stultz wrote:

> Hey Prarit,
> Just back from being on leave, and wanted to check in on this. Did you ever
> get to run with an increase sample size to see how that affected things? Its
> exactly your point that the non-NOHZ case could align the execution of a short
> run in a way that you always see good results, where as with NOHZ the alignment
> might not be the same, so you see periodic delays from timer interrupts, etc.
>
> Anyway, let me know if this got resolved or not.

Hey John,

I never did narrow this down, although this "disappears" with the move to the
lower resolution patches in upstream. I have not, however, done any testing to
see if the situation is actually resolved or if we just reduced the problem by a
factor of 1000 ;)

P.