Date: Mon, 6 Feb 2012 14:25:46 +0100
From: Stephane Eranian <eranian@google.com>
To: linux-kernel@vger.kernel.org
Cc: acme@redhat.com, peterz@infradead.org, mingo@elte.hu,
        robert.richter@amd.com, eric.dumazet@gmail.com
Subject: [BUG] perf: perf sched warning possibly due to clock granularity on
 AMD
Message-ID: <20120206132546.GA30854@quad>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4524
Lines: 166

Hi,

I am running 3.3.30-rc2 on an AMD Bulldozer system in 64-bit mode.
I was testing perf sched and I ran into an issue. That issue seems
to exist only on AMD and not on Intel systems. It is not PMU related
because I am doing tracing.

I am running a simple ping pong test to stress the context switch
code. Two processes exchanging a byte through a pipe (program provided
below).

Then, I capture a trace using perf sched, and I run sched lat, but
on certain runs, I get:

   $ perf sched rec pong 2
   $ perf sched lat
     Warning: Timestamp below last timeslice flush
     Warning: Timestamp below last timeslice flush
     Warning: Timestamp below last timeslice flush
     Warning: Timestamp below last timeslice flush
     Warning: Timestamp below last timeslice flush
     Warning: Timestamp below last timeslice flush
     Warning: Timestamp below last timeslice flush
     Warning: Timestamp below last timeslice flush
     Warning: TimesFound 4934 unknown events!
     Is this an older tool processing a perf.data file generated by a more recent tool?

I did some investigation in the kernel to figure out what
could cause this problem. I found out that with this measurement, it is
possible to get multiple consecutive samples with the SAME timestamp.
That is possible on AMD because, apparently, local_clock() ends up calling,
a sched_clock() function that is only based on jiffies. That means, the
granularity is a tick (1ms in my case). On Intel systems, at least on mines,
local_clock() ends up calling a routine which uses TSC and thus has a much
greater precision. This is all decided by the sched_clock_stable variable. It
is set at init time for Intel and not used on AMD. Yet, recent AMD systems do
have support for NONSTOP_TSC. It is not clear to me why that would not be enough
(unless it is a cross-socket synchronization issue).

One thing is clear, is that if I enable sched_clock_stable on my AMD system, then
the perf sched warning disappears.

I believe there is really a bug in perf. I am sure there are systems out there
which cannot provide a very fine grain timestamp, thus it is possible to get samples
with identical timestamps. I don't quite understand the sample re-ordering code in
perf but it seems sensitive to timer granularity. I think it should not.

In summary, two issues:
  - Why is sched_clock_stable not set or even tested on recent AMD systems?
  - perf should not rely on fine granularity timestamps in the re-ordering code

Any comments?

#include <sys/types.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <errno.h>
#include <unistd.h>
#include <string.h>
#include <signal.h>
#include <syscall.h>
#include <err.h>

#define MAX_CPUS	64
#define NR_CPU_BITS	(MAX_CPUS>>3)
int
pin_cpu(pid_t pid, unsigned int cpu)
{
	uint64_t my_mask[NR_CPU_BITS];

	if (cpu >= MAX_CPUS)
		errx(1, "this program supports only up to %d CPUs", MAX_CPUS);

	my_mask[cpu>>6] = 1ULL << (cpu&63);

	return syscall(__NR_sched_setaffinity, pid, sizeof(my_mask), &my_mask);
}

static volatile int quit;
void sig_handler(int n)
{
	quit = 1;
}

static void
do_child(int fr, int fw)
{
	char c;
	ssize_t ret;

	for(;;) {
		ret = read(fr, &c, 1);	
		if (ret < 0)
			break;
		ret = write(fw, "c", 1);
		if (ret < 0)
			break;
		
	}
	printf("child exited\n");
	exit(0);
}


int
main(int argc, char **argv)
{
	int ret;
	int pr[2], pw[2];
	int which_cpu;
	pid_t pid;
	uint64_t nctx = 0;
	ssize_t nbytes;
	char c = '0';
	int delay;

	delay = argc > 1 ? atoi(argv[1]) : 10;
	
	srandom(getpid());
	which_cpu = random() % sysconf(_SC_NPROCESSORS_ONLN);

	ret = pipe(pr);
	if (ret)
		err(1, "cannot create read pipe");

	ret = pipe(pw);
	if (ret)
		err(1, "cannot create write pipe");

	ret = pin_cpu(getpid(), which_cpu);
	if (ret)
		err(1, "cannot pin to CPU%d", which_cpu);

	printf("Both processes pinned to CPU%d, running for %ds\n", which_cpu, delay);

	switch(pid=fork()) {
		case -1:
			err(1, "cannot create child");
		case 0:
			close(pr[1]); close(pw[0]);
			do_child(pr[0], pw[1]);
			exit(1);
	}

	close(pr[0]);
	close(pw[1]);

	signal(SIGALRM, sig_handler);
	alarm(delay);

	while(!quit) {
		nbytes = write(pr[1], "c", 1);
		nbytes = read(pw[0], &c, 1);	
		nctx++;
	}
	close(pr[1]);
	close(pw[0]);
	return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/