Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755265Ab2BFNZy (ORCPT ); Mon, 6 Feb 2012 08:25:54 -0500 Received: from mail-ww0-f42.google.com ([74.125.82.42]:61714 "EHLO mail-ww0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755243Ab2BFNZw (ORCPT ); Mon, 6 Feb 2012 08:25:52 -0500 Date: Mon, 6 Feb 2012 14:25:46 +0100 From: Stephane Eranian To: linux-kernel@vger.kernel.org Cc: acme@redhat.com, peterz@infradead.org, mingo@elte.hu, robert.richter@amd.com, eric.dumazet@gmail.com Subject: [BUG] perf: perf sched warning possibly due to clock granularity on AMD Message-ID: <20120206132546.GA30854@quad> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4524 Lines: 166 Hi, I am running 3.3.30-rc2 on an AMD Bulldozer system in 64-bit mode. I was testing perf sched and I ran into an issue. That issue seems to exist only on AMD and not on Intel systems. It is not PMU related because I am doing tracing. I am running a simple ping pong test to stress the context switch code. Two processes exchanging a byte through a pipe (program provided below). Then, I capture a trace using perf sched, and I run sched lat, but on certain runs, I get: $ perf sched rec pong 2 $ perf sched lat Warning: Timestamp below last timeslice flush Warning: Timestamp below last timeslice flush Warning: Timestamp below last timeslice flush Warning: Timestamp below last timeslice flush Warning: Timestamp below last timeslice flush Warning: Timestamp below last timeslice flush Warning: Timestamp below last timeslice flush Warning: Timestamp below last timeslice flush Warning: TimesFound 4934 unknown events! Is this an older tool processing a perf.data file generated by a more recent tool? I did some investigation in the kernel to figure out what could cause this problem. I found out that with this measurement, it is possible to get multiple consecutive samples with the SAME timestamp. That is possible on AMD because, apparently, local_clock() ends up calling, a sched_clock() function that is only based on jiffies. That means, the granularity is a tick (1ms in my case). On Intel systems, at least on mines, local_clock() ends up calling a routine which uses TSC and thus has a much greater precision. This is all decided by the sched_clock_stable variable. It is set at init time for Intel and not used on AMD. Yet, recent AMD systems do have support for NONSTOP_TSC. It is not clear to me why that would not be enough (unless it is a cross-socket synchronization issue). One thing is clear, is that if I enable sched_clock_stable on my AMD system, then the perf sched warning disappears. I believe there is really a bug in perf. I am sure there are systems out there which cannot provide a very fine grain timestamp, thus it is possible to get samples with identical timestamps. I don't quite understand the sample re-ordering code in perf but it seems sensitive to timer granularity. I think it should not. In summary, two issues: - Why is sched_clock_stable not set or even tested on recent AMD systems? - perf should not rely on fine granularity timestamps in the re-ordering code Any comments? #include #include #include #include #include #include #include #include #include #include #include #define MAX_CPUS 64 #define NR_CPU_BITS (MAX_CPUS>>3) int pin_cpu(pid_t pid, unsigned int cpu) { uint64_t my_mask[NR_CPU_BITS]; if (cpu >= MAX_CPUS) errx(1, "this program supports only up to %d CPUs", MAX_CPUS); my_mask[cpu>>6] = 1ULL << (cpu&63); return syscall(__NR_sched_setaffinity, pid, sizeof(my_mask), &my_mask); } static volatile int quit; void sig_handler(int n) { quit = 1; } static void do_child(int fr, int fw) { char c; ssize_t ret; for(;;) { ret = read(fr, &c, 1); if (ret < 0) break; ret = write(fw, "c", 1); if (ret < 0) break; } printf("child exited\n"); exit(0); } int main(int argc, char **argv) { int ret; int pr[2], pw[2]; int which_cpu; pid_t pid; uint64_t nctx = 0; ssize_t nbytes; char c = '0'; int delay; delay = argc > 1 ? atoi(argv[1]) : 10; srandom(getpid()); which_cpu = random() % sysconf(_SC_NPROCESSORS_ONLN); ret = pipe(pr); if (ret) err(1, "cannot create read pipe"); ret = pipe(pw); if (ret) err(1, "cannot create write pipe"); ret = pin_cpu(getpid(), which_cpu); if (ret) err(1, "cannot pin to CPU%d", which_cpu); printf("Both processes pinned to CPU%d, running for %ds\n", which_cpu, delay); switch(pid=fork()) { case -1: err(1, "cannot create child"); case 0: close(pr[1]); close(pw[0]); do_child(pr[0], pw[1]); exit(1); } close(pr[0]); close(pw[1]); signal(SIGALRM, sig_handler); alarm(delay); while(!quit) { nbytes = write(pr[1], "c", 1); nbytes = read(pw[0], &c, 1); nctx++; } close(pr[1]); close(pw[0]); return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/