Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756052AbbLGNac (ORCPT ); Mon, 7 Dec 2015 08:30:32 -0500 Received: from szxga01-in.huawei.com ([58.251.152.64]:17341 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755627AbbLGNaH (ORCPT ); Mon, 7 Dec 2015 08:30:07 -0500 From: Wang Nan To: , CC: , , , , , , , , Wang Nan Subject: [RFC PATCH v2 3/3] perf record: Find tail pointer through size at end of event Date: Mon, 7 Dec 2015 13:28:37 +0000 Message-ID: <1449494917-62970-4-git-send-email-wangnan0@huawei.com> X-Mailer: git-send-email 1.8.3.4 In-Reply-To: <1449494917-62970-1-git-send-email-wangnan0@huawei.com> References: <1449063499-236703-1-git-send-email-wangnan0@huawei.com> <1449494917-62970-1-git-send-email-wangnan0@huawei.com> MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [10.107.193.248] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A020202.5665899B.01DE,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2013-06-18 04:22:30, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: b0ce7c0875a1a5f9fe6ee403a4cbe5ea Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11761 Lines: 360 This is an RFC patch which roughly shows the usage of PERF_SAMPLE_SIZE_AT_END introduced by previous patches. In this prototype we can use 'perf record' to capture data in overwritable ringbuffer: # ./perf record -g --call-graph=dwarf,128 -m 1 -e dummy -e syscalls:*/overwrite/ dd if=/dev/zero of=/dev/null count=4096 bs=1 4096+0 records in 4096+0 records out 4096 bytes (4.1 kB) copied, 8.54486 s, 0.5 kB/s [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.082 MB perf.data (9 samples) ] # ./perf script -F comm,event dd syscalls:sys_enter_write: dd syscalls:sys_exit_write: dd syscalls:sys_enter_write: dd syscalls:sys_exit_write: dd syscalls:sys_enter_write: dd syscalls:sys_exit_write: dd syscalls:sys_enter_close: dd syscalls:sys_exit_close: dd syscalls:sys_enter_exit_group: # ls -l ./perf.data -rw------- 1 root root 497438 Dec 7 12:48 ./perf.data In this case perf uses 1 page for a overwritable ringbuffer. The result is only the last 9 samples (see the sys_enter_exit_group) are captured. # ./perf record -g --call-graph=dwarf,128 -m 1 -e dummy -e syscalls:*/overwrite/ dd if=/dev/zero of=/dev/null count=8192 bs=1 8192+0 records in 8192+0 records out 8192 bytes (8.2 kB) copied, 16.9867 s, 0.5 kB/s [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.082 MB perf.data (9 samples) ] # ls -l ./perf.data -rw------- 1 root root 497438 Dec 7 12:51 ./perf.data Issuing more syscalls doesn't causes perf.data size increasing. Still 9 samples are captured. record__search_read_start() is the core function I want to show in this patch, which walks through the whole ring buffer backward, locates the head of each events by the 'size' field at the tail of each event. Finally it get the oldest event in the ring buffer, then start dump there. Other parts in this patch are dirty tricks and should be rewritten. Limitation in this patch: there must have a '-e dummy' with no overwrite set as the first event. Following command causes error: # ./perf record -e syscalls:*/overwrite/ ... This is because we need this dummy event capture tracking events like fork, mmap... We'll go back to kernel to enforce the 'PERF_SAMPLE_SIZE_AT_END' flag as mentioned in previous email: 1. Doesn't allow mixed events attached at one ring buffer with some events has that flag but other events no; 2. Append size to tracking events (no-sample events) so event with attr.tracking == 1 and PERF_SAMPLE_SIZE_AT_END set is acceptable. >From these perf size work I find following improvements should be done: 1. Overwrite should not be a attribute of evlist, but an attribute of evsel; 2. The principle of 'channel' should be introduced, so different events can output things to different places. Which would be useful for: 1) separate overwrite mapping and normal mapping, 2) allow outputting tracking events (through a dummy event) to an isolated mmap area so it would be possible for 'perf record' run as a daemon which periodically synthesizes those event without parsing samples, 3) allow eBPF output event with no-buffer attribute set, so eBPF programs would be possible to control behavior of perf, for example, shut down events. Signed-off-by: Wang Nan --- tools/perf/builtin-record.c | 73 +++++++++++++++++++++++++++++++++++++++++++-- tools/perf/util/evlist.c | 42 +++++++++++++++++++------- tools/perf/util/evsel.c | 3 ++ 3 files changed, 105 insertions(+), 13 deletions(-) diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index ec4135c..3817a9a 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -74,6 +74,54 @@ static int process_synthesized_event(struct perf_tool *tool, return record__write(rec, event, event->header.size); } +static int record__search_read_start(struct perf_mmap *md, u64 head, u64 *pstart) +{ + unsigned char *data = md->base + page_size; + u64 evt_head = head; + u16 *pevt_size; + + while (true) { + struct perf_event_header *pheader; + + pevt_size = (void *)&data[(evt_head - sizeof(*pevt_size)) & md->mask]; + + if (*pevt_size % sizeof(u16) != 0) { + pr_err("strange event size: %d\n", (int)(*pevt_size)); + return -1; + } + + if (!*pevt_size) { + if (evt_head) { + pr_err("size is 0 but evt_head (0x%lx) not 0\n", + (unsigned long)evt_head); + return -1; + } + break; + } + + if (evt_head < *pevt_size) + break; + + evt_head -= *pevt_size; + if (evt_head + (md->mask + 1) < head) { + evt_head += *pevt_size; + pr_debug("evt_head=%lx, head=%lx, size=%d\n", evt_head, head, *pevt_size); + break; + } + + pheader = (struct perf_event_header *)(&data[evt_head & md->mask]); + + if (pheader->size != *pevt_size) { + pr_err("size mismatch: %d vs %d\n", + (int)pheader->size, (int)(*pevt_size)); + return -1; + } + } + + *pstart = evt_head; + return 0; +} + static int record__mmap_read(struct record *rec, int idx) { struct perf_mmap *md = &rec->evlist->mmap[idx]; @@ -83,6 +131,17 @@ static int record__mmap_read(struct record *rec, int idx) unsigned long size; void *buf; int rc = 0; + bool overwrite = rec->evlist->overwrite; + int err; + + if (idx >= rec->evlist->nr_mmaps) + overwrite = !overwrite; + + if (overwrite) { + err = record__search_read_start(md, head, &old); + if (err) + return 0; + } if (old == head) return 0; @@ -400,7 +459,7 @@ static struct perf_event_header finished_round_event = { .type = PERF_RECORD_FINISHED_ROUND, }; -static int record__mmap_read_all(struct record *rec) +static int __record__mmap_read_all(struct record *rec, bool next_half) { u64 bytes_written = rec->bytes_written; int i; @@ -408,9 +467,10 @@ static int record__mmap_read_all(struct record *rec) for (i = 0; i < rec->evlist->nr_mmaps; i++) { struct auxtrace_mmap *mm = &rec->evlist->mmap[i].auxtrace_mmap; + int idx = i + (next_half ? rec->evlist->nr_mmaps : 0); - if (rec->evlist->mmap[i].base) { - if (record__mmap_read(rec, i) != 0) { + if (rec->evlist->mmap[idx].base) { + if (record__mmap_read(rec, idx) != 0) { rc = -1; goto out; } @@ -434,6 +494,11 @@ out: return rc; } +static int record__mmap_read_all(struct record *rec) +{ + return __record__mmap_read_all(rec, false); +} + static void record__init_features(struct record *rec) { struct perf_session *session = rec->session; @@ -727,6 +792,8 @@ static int __cmd_record(struct record *rec, int argc, const char **argv) } auxtrace_snapshot_enabled = 0; + __record__mmap_read_all(rec, true); + if (forks && workload_exec_errno) { char msg[STRERR_BUFSIZE]; const char *emsg = strerror_r(workload_exec_errno, msg, sizeof(msg)); diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index 8dd59aa..a3421ee 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -800,7 +800,9 @@ void perf_evlist__mmap_consume(struct perf_evlist *evlist, int idx) { struct perf_mmap *md = &evlist->mmap[idx]; - if (!evlist->overwrite) { + if ((!evlist->overwrite && (idx < evlist->nr_mmaps)) || + (evlist->overwrite && (idx >= evlist->nr_mmaps))) { + u64 old = md->prev; perf_mmap__write_tail(md, old); @@ -855,7 +857,7 @@ void perf_evlist__munmap(struct perf_evlist *evlist) if (evlist->mmap == NULL) return; - for (i = 0; i < evlist->nr_mmaps; i++) + for (i = 0; i < 2 * evlist->nr_mmaps; i++) __perf_evlist__munmap(evlist, i); zfree(&evlist->mmap); @@ -866,7 +868,7 @@ static int perf_evlist__alloc_mmap(struct perf_evlist *evlist) evlist->nr_mmaps = cpu_map__nr(evlist->cpus); if (cpu_map__empty(evlist->cpus)) evlist->nr_mmaps = thread_map__nr(evlist->threads); - evlist->mmap = zalloc(evlist->nr_mmaps * sizeof(struct perf_mmap)); + evlist->mmap = zalloc(2 * evlist->nr_mmaps * sizeof(struct perf_mmap)); return evlist->mmap != NULL ? 0 : -ENOMEM; } @@ -897,6 +899,7 @@ static int __perf_evlist__mmap(struct perf_evlist *evlist, int idx, evlist->mmap[idx].mask = mp->mask; evlist->mmap[idx].base = mmap(NULL, evlist->mmap_len, mp->prot, MAP_SHARED, fd, 0); + if (evlist->mmap[idx].base == MAP_FAILED) { pr_debug2("failed to mmap perf event ring buffer, error %d\n", errno); @@ -911,18 +914,31 @@ static int __perf_evlist__mmap(struct perf_evlist *evlist, int idx, return 0; } -static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, - struct mmap_params *mp, int cpu, - int thread, int *output) +static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int _idx, + struct mmap_params *_mp, int cpu, + int thread, int *output1, int *output2) { struct perf_evsel *evsel; + int *output; + struct mmap_params new_mp = *_mp; evlist__for_each(evlist, evsel) { int fd; + int idx = _idx; + struct mmap_params *mp = _mp; if (evsel->system_wide && thread) continue; + if (evsel->overwrite ^ evlist->overwrite) { + output = output2; + new_mp.prot ^= PROT_WRITE; + mp = &new_mp; + idx = _idx + evlist->nr_mmaps; + } else { + output = output1; + } + fd = FD(evsel, cpu, thread); if (*output == -1) { @@ -936,6 +952,8 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, perf_evlist__mmap_get(evlist, idx); } + if (evsel->overwrite) + goto skip_poll_add; /* * The system_wide flag causes a selected event to be opened * always without a pid. Consequently it will never get a @@ -949,6 +967,8 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, return -1; } +skip_poll_add: + if (evsel->attr.read_format & PERF_FORMAT_ID) { if (perf_evlist__id_add_fd(evlist, evsel, cpu, thread, fd) < 0) @@ -970,14 +990,15 @@ static int perf_evlist__mmap_per_cpu(struct perf_evlist *evlist, pr_debug2("perf event ring buffer mmapped per cpu\n"); for (cpu = 0; cpu < nr_cpus; cpu++) { - int output = -1; + int output1 = -1; + int output2 = -1; auxtrace_mmap_params__set_idx(&mp->auxtrace_mp, evlist, cpu, true); for (thread = 0; thread < nr_threads; thread++) { if (perf_evlist__mmap_per_evsel(evlist, cpu, mp, cpu, - thread, &output)) + thread, &output1, &output2)) goto out_unmap; } } @@ -998,13 +1019,14 @@ static int perf_evlist__mmap_per_thread(struct perf_evlist *evlist, pr_debug2("perf event ring buffer mmapped per thread\n"); for (thread = 0; thread < nr_threads; thread++) { - int output = -1; + int output1 = -1; + int output2 = -1; auxtrace_mmap_params__set_idx(&mp->auxtrace_mp, evlist, thread, false); if (perf_evlist__mmap_per_evsel(evlist, thread, mp, 0, thread, - &output)) + &output1, &output2)) goto out_unmap; } diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c index 3386437..8e40da9 100644 --- a/tools/perf/util/evsel.c +++ b/tools/perf/util/evsel.c @@ -910,6 +910,9 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts) * it overloads any global configuration. */ apply_config_terms(evsel, opts); + + if (evsel->overwrite) + perf_evsel__set_sample_bit(evsel, SIZE_AT_END); } static int perf_evsel__alloc_fd(struct perf_evsel *evsel, int ncpus, int nthreads) -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/