From: Wang Nan <wangnan0@huawei.com>
To: <acme@kernel.org>, <peterz@infradead.org>
CC: <linux-kernel@vger.kernel.org>, <paulus@samba.org>, <kan.liang@intel.com>,
        <pi3orama@163.com>, <adrian.hunter@intel.com>, <dsahern@gmail.com>,
        <mingo@kernel.org>, <yunlong.song@huawei.com>,
        Wang Nan <wangnan0@huawei.com>
Subject: [RFC PATCH v2 3/3] perf record: Find tail pointer through size at end of event
Date: Mon, 7 Dec 2015 13:28:37 +0000
Message-ID: <1449494917-62970-4-git-send-email-wangnan0@huawei.com>
In-Reply-To: <1449494917-62970-1-git-send-email-wangnan0@huawei.com>
References: <1449063499-236703-1-git-send-email-wangnan0@huawei.com>
 <1449494917-62970-1-git-send-email-wangnan0@huawei.com>
MIME-Version: 1.0
Content-Type: text/plain
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 11761
Lines: 360

This is an RFC patch which roughly shows the usage of
PERF_SAMPLE_SIZE_AT_END introduced by previous patches. In this
prototype we can use 'perf record' to capture data in overwritable
ringbuffer:

 # ./perf record -g --call-graph=dwarf,128 -m 1 -e dummy -e syscalls:*/overwrite/ dd if=/dev/zero of=/dev/null count=4096 bs=1
 4096+0 records in
 4096+0 records out
 4096 bytes (4.1 kB) copied, 8.54486 s, 0.5 kB/s
 [ perf record: Woken up 1 times to write data ]
 [ perf record: Captured and wrote 0.082 MB perf.data (9 samples) ]

 # ./perf script -F comm,event
              dd syscalls:sys_enter_write:
              dd syscalls:sys_exit_write:
              dd syscalls:sys_enter_write:
              dd syscalls:sys_exit_write:
              dd syscalls:sys_enter_write:
              dd syscalls:sys_exit_write:
              dd syscalls:sys_enter_close:
              dd syscalls:sys_exit_close:
              dd syscalls:sys_enter_exit_group:

 # ls -l ./perf.data
 -rw------- 1 root root 497438 Dec  7 12:48 ./perf.data

In this case perf uses 1 page for a overwritable ringbuffer. The result is
only the last 9 samples (see the sys_enter_exit_group) are captured.

 # ./perf record -g --call-graph=dwarf,128 -m 1 -e dummy -e syscalls:*/overwrite/ dd if=/dev/zero of=/dev/null count=8192 bs=1
 8192+0 records in
 8192+0 records out
 8192 bytes (8.2 kB) copied, 16.9867 s, 0.5 kB/s
 [ perf record: Woken up 1 times to write data ]
 [ perf record: Captured and wrote 0.082 MB perf.data (9 samples) ]
 # ls -l ./perf.data
 -rw------- 1 root root 497438 Dec  7 12:51 ./perf.data

Issuing more syscalls doesn't causes perf.data size increasing. Still 9
samples are captured.

record__search_read_start() is the core function I want to show in
this patch, which walks through the whole ring buffer backward, locates
the head of each events by the 'size' field at the tail of each event.
Finally it get the oldest event in the ring buffer, then start dump
there.

Other parts in this patch are dirty tricks and should be rewritten.

Limitation in this patch: there must have a '-e dummy' with no
overwrite set as the first event. Following command causes error:

 # ./perf record -e syscalls:*/overwrite/ ...

This is because we need this dummy event capture tracking events
like fork, mmap...

We'll go back to kernel to enforce the 'PERF_SAMPLE_SIZE_AT_END' flag
as mentioned in previous email:
 1. Doesn't allow mixed events attached at one ring buffer with some
    events has that flag but other events no;

 2. Append size to tracking events (no-sample events) so event with
    attr.tracking == 1 and PERF_SAMPLE_SIZE_AT_END set is acceptable.

>From these perf size work I find following improvements should be done:
 1. Overwrite should not be a attribute of evlist, but an attribute of
    evsel;

 2. The principle of 'channel' should be introduced, so different
    events can output things to different places. Which would be useful
    for:

    1) separate overwrite mapping and normal mapping,
    2) allow outputting tracking events (through a dummy event) to an
       isolated mmap area so it would be possible for 'perf record' run
       as a daemon which periodically synthesizes those event without
       parsing samples,
    3) allow eBPF output event with no-buffer attribute set, so eBPF
       programs would be possible to control behavior of perf, for
       example, shut down events.

Signed-off-by: Wang Nan <wangnan0@huawei.com>
---
 tools/perf/builtin-record.c | 73 +++++++++++++++++++++++++++++++++++++++++++--
 tools/perf/util/evlist.c    | 42 +++++++++++++++++++-------
 tools/perf/util/evsel.c     |  3 ++
 3 files changed, 105 insertions(+), 13 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index ec4135c..3817a9a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -74,6 +74,54 @@ static int process_synthesized_event(struct perf_tool *tool,
 	return record__write(rec, event, event->header.size);
 }
 
+static int record__search_read_start(struct perf_mmap *md, u64 head, u64 *pstart)
+{
+	unsigned char *data = md->base + page_size;
+	u64 evt_head = head;
+	u16 *pevt_size;
+
+	while (true) {
+		struct perf_event_header *pheader;
+
+		pevt_size = (void *)&data[(evt_head - sizeof(*pevt_size)) & md->mask];
+		
+		if (*pevt_size % sizeof(u16) != 0) {
+			pr_err("strange event size: %d\n", (int)(*pevt_size));
+			return -1;
+		}
+
+		if (!*pevt_size) {
+			if (evt_head) {
+				pr_err("size is 0 but evt_head (0x%lx) not 0\n",
+					 (unsigned long)evt_head);
+				return -1;
+			}
+			break;
+		}
+
+		if (evt_head < *pevt_size)
+			break;
+
+		evt_head -= *pevt_size;
+		if (evt_head + (md->mask + 1) < head) {
+			evt_head += *pevt_size;
+			pr_debug("evt_head=%lx, head=%lx, size=%d\n", evt_head, head, *pevt_size);
+			break;
+		}
+
+		pheader = (struct perf_event_header *)(&data[evt_head & md->mask]);
+
+		if (pheader->size != *pevt_size) {
+			pr_err("size mismatch: %d vs %d\n",
+				 (int)pheader->size, (int)(*pevt_size));
+			return -1;
+		}
+	}
+
+	*pstart = evt_head;
+	return 0;
+}
+
 static int record__mmap_read(struct record *rec, int idx)
 {
 	struct perf_mmap *md = &rec->evlist->mmap[idx];
@@ -83,6 +131,17 @@ static int record__mmap_read(struct record *rec, int idx)
 	unsigned long size;
 	void *buf;
 	int rc = 0;
+	bool overwrite = rec->evlist->overwrite;
+	int err;
+
+	if (idx >= rec->evlist->nr_mmaps)
+		overwrite = !overwrite;
+
+	if (overwrite) {
+		err = record__search_read_start(md, head, &old);
+		if (err)
+			return 0;
+	}
 
 	if (old == head)
 		return 0;
@@ -400,7 +459,7 @@ static struct perf_event_header finished_round_event = {
 	.type = PERF_RECORD_FINISHED_ROUND,
 };
 
-static int record__mmap_read_all(struct record *rec)
+static int __record__mmap_read_all(struct record *rec, bool next_half)
 {
 	u64 bytes_written = rec->bytes_written;
 	int i;
@@ -408,9 +467,10 @@ static int record__mmap_read_all(struct record *rec)
 
 	for (i = 0; i < rec->evlist->nr_mmaps; i++) {
 		struct auxtrace_mmap *mm = &rec->evlist->mmap[i].auxtrace_mmap;
+		int idx = i + (next_half ? rec->evlist->nr_mmaps : 0);
 
-		if (rec->evlist->mmap[i].base) {
-			if (record__mmap_read(rec, i) != 0) {
+		if (rec->evlist->mmap[idx].base) {
+			if (record__mmap_read(rec, idx) != 0) {
 				rc = -1;
 				goto out;
 			}
@@ -434,6 +494,11 @@ out:
 	return rc;
 }
 
+static int record__mmap_read_all(struct record *rec)
+{
+	return __record__mmap_read_all(rec, false);
+}
+
 static void record__init_features(struct record *rec)
 {
 	struct perf_session *session = rec->session;
@@ -727,6 +792,8 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	}
 	auxtrace_snapshot_enabled = 0;
 
+	__record__mmap_read_all(rec, true);
+
 	if (forks && workload_exec_errno) {
 		char msg[STRERR_BUFSIZE];
 		const char *emsg = strerror_r(workload_exec_errno, msg, sizeof(msg));
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 8dd59aa..a3421ee 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -800,7 +800,9 @@ void perf_evlist__mmap_consume(struct perf_evlist *evlist, int idx)
 {
 	struct perf_mmap *md = &evlist->mmap[idx];
 
-	if (!evlist->overwrite) {
+	if ((!evlist->overwrite && (idx < evlist->nr_mmaps)) ||
+		(evlist->overwrite && (idx >= evlist->nr_mmaps))) {
+
 		u64 old = md->prev;
 
 		perf_mmap__write_tail(md, old);
@@ -855,7 +857,7 @@ void perf_evlist__munmap(struct perf_evlist *evlist)
 	if (evlist->mmap == NULL)
 		return;
 
-	for (i = 0; i < evlist->nr_mmaps; i++)
+	for (i = 0; i < 2 * evlist->nr_mmaps; i++)
 		__perf_evlist__munmap(evlist, i);
 
 	zfree(&evlist->mmap);
@@ -866,7 +868,7 @@ static int perf_evlist__alloc_mmap(struct perf_evlist *evlist)
 	evlist->nr_mmaps = cpu_map__nr(evlist->cpus);
 	if (cpu_map__empty(evlist->cpus))
 		evlist->nr_mmaps = thread_map__nr(evlist->threads);
-	evlist->mmap = zalloc(evlist->nr_mmaps * sizeof(struct perf_mmap));
+	evlist->mmap = zalloc(2 * evlist->nr_mmaps * sizeof(struct perf_mmap));
 	return evlist->mmap != NULL ? 0 : -ENOMEM;
 }
 
@@ -897,6 +899,7 @@ static int __perf_evlist__mmap(struct perf_evlist *evlist, int idx,
 	evlist->mmap[idx].mask = mp->mask;
 	evlist->mmap[idx].base = mmap(NULL, evlist->mmap_len, mp->prot,
 				      MAP_SHARED, fd, 0);
+
 	if (evlist->mmap[idx].base == MAP_FAILED) {
 		pr_debug2("failed to mmap perf event ring buffer, error %d\n",
 			  errno);
@@ -911,18 +914,31 @@ static int __perf_evlist__mmap(struct perf_evlist *evlist, int idx,
 	return 0;
 }
 
-static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx,
-				       struct mmap_params *mp, int cpu,
-				       int thread, int *output)
+static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int _idx,
+				       struct mmap_params *_mp, int cpu,
+				       int thread, int *output1, int *output2)
 {
 	struct perf_evsel *evsel;
+	int *output;
+	struct mmap_params new_mp = *_mp;
 
 	evlist__for_each(evlist, evsel) {
 		int fd;
+		int idx = _idx;
+		struct mmap_params *mp = _mp;
 
 		if (evsel->system_wide && thread)
 			continue;
 
+		if (evsel->overwrite ^ evlist->overwrite) {
+			output = output2;
+			new_mp.prot ^= PROT_WRITE;
+			mp = &new_mp;
+			idx = _idx + evlist->nr_mmaps;
+		} else {
+			output = output1;
+		}
+
 		fd = FD(evsel, cpu, thread);
 
 		if (*output == -1) {
@@ -936,6 +952,8 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx,
 			perf_evlist__mmap_get(evlist, idx);
 		}
 
+		if (evsel->overwrite)
+			goto skip_poll_add;
 		/*
 		 * The system_wide flag causes a selected event to be opened
 		 * always without a pid.  Consequently it will never get a
@@ -949,6 +967,8 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx,
 			return -1;
 		}
 
+skip_poll_add:
+
 		if (evsel->attr.read_format & PERF_FORMAT_ID) {
 			if (perf_evlist__id_add_fd(evlist, evsel, cpu, thread,
 						   fd) < 0)
@@ -970,14 +990,15 @@ static int perf_evlist__mmap_per_cpu(struct perf_evlist *evlist,
 
 	pr_debug2("perf event ring buffer mmapped per cpu\n");
 	for (cpu = 0; cpu < nr_cpus; cpu++) {
-		int output = -1;
+		int output1 = -1;
+		int output2 = -1;
 
 		auxtrace_mmap_params__set_idx(&mp->auxtrace_mp, evlist, cpu,
 					      true);
 
 		for (thread = 0; thread < nr_threads; thread++) {
 			if (perf_evlist__mmap_per_evsel(evlist, cpu, mp, cpu,
-							thread, &output))
+							thread, &output1, &output2))
 				goto out_unmap;
 		}
 	}
@@ -998,13 +1019,14 @@ static int perf_evlist__mmap_per_thread(struct perf_evlist *evlist,
 
 	pr_debug2("perf event ring buffer mmapped per thread\n");
 	for (thread = 0; thread < nr_threads; thread++) {
-		int output = -1;
+		int output1 = -1;
+		int output2 = -1;
 
 		auxtrace_mmap_params__set_idx(&mp->auxtrace_mp, evlist, thread,
 					      false);
 
 		if (perf_evlist__mmap_per_evsel(evlist, thread, mp, 0, thread,
-						&output))
+						&output1, &output2))
 			goto out_unmap;
 	}
 
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 3386437..8e40da9 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -910,6 +910,9 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts)
 	 * it overloads any global configuration.
 	 */
 	apply_config_terms(evsel, opts);
+
+	if (evsel->overwrite)
+		perf_evsel__set_sample_bit(evsel, SIZE_AT_END);
 }
 
 static int perf_evsel__alloc_fd(struct perf_evsel *evsel, int ncpus, int nthreads)
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/