2021-02-03 14:02:10

by Alexander Antonov

[permalink] [raw]
Subject: [PATCH v4 0/5] perf stat: Introduce iostat mode to provide I/O performance metrics

The previous version can be found at:
v3: https://lkml.kernel.org/r/[email protected]/
Changes in this revision are:
v3 -> v4:
- Addressed comment from Namhyung Kim:
1. Removed NULL-termination of root ports list

The previous version can be found at:
v2: https://lkml.kernel.org/r/[email protected]

Changes in this revision are:
v2 -> v3:
- Addressed comments from Namhyung Kim:
1. Removed perf_device pointer from evsel structure. Use priv field instead
2. Renamed 'iiostat' to 'iostat'
3. Renamed 'show' mode to 'list' mode
4. Renamed iiostat_delete_root_ports() to iiostat_release() and
iostat_show_root_ports() to iostat_list()

The previous version can be found at:
v1: https://lkml.kernel.org/r/[email protected]

Changes in this revision are:
v1 -> v2:
- Addressed comment from Arnaldo Carvalho de Melo:
1. Using 'perf iiostat' subcommand instead of 'perf stat --iiostat':
- Added perf-iiostat.sh script to use short command
- Updated manual pages to get help for 'perf iiostat'
- Added 'perf-iiostat' to perf's gitignore file

Mode is intended to provide four I/O performance metrics in MB per each
root port:
- Inbound Read: I/O devices below root port read from the host memory
- Inbound Write: I/O devices below root port write to the host memory
- Outbound Read: CPU reads from I/O devices below root port
- Outbound Write: CPU writes to I/O devices below root port

Each metric requiries only one uncore event which increments at every 4B
transfer in corresponding direction. The formulas to compute metrics
are generic:
#EventCount * 4B / (1024 * 1024)

Note: iostat introduces new perf data aggregation mode - per PCIe root port
hence -e and -M options are not supported.

Usage examples:

1. List all PCIe root ports (example for 2-S platform):
$ perf iostat list
S0-uncore_iio_0<0000:00>
S1-uncore_iio_0<0000:80>
S0-uncore_iio_1<0000:17>
S1-uncore_iio_1<0000:85>
S0-uncore_iio_2<0000:3a>
S1-uncore_iio_2<0000:ae>
S0-uncore_iio_3<0000:5d>
S1-uncore_iio_3<0000:d7>

2. Collect metrics for all PCIe root ports:
$ perf iostat -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
357708+0 records in
357707+0 records out
375083606016 bytes (375 GB, 349 GiB) copied, 215.974 s, 1.7 GB/s

Performance counter stats for 'system wide':

port Inbound Read(MB) Inbound Write(MB) Outbound Read(MB) Outbound Write(MB)
0000:00 1 0 2 3
0000:80 0 0 0 0
0000:17 352552 43 0 21
0000:85 0 0 0 0
0000:3a 3 0 0 0
0000:ae 0 0 0 0
0000:5d 0 0 0 0
0000:d7 0 0 0 0

3. Collect metrics for comma separated list of PCIe root ports:
$ perf iostat 0000:17,0:3a -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
357708+0 records in
357707+0 records out
375083606016 bytes (375 GB, 349 GiB) copied, 197.08 s, 1.9 GB/s

Performance counter stats for 'system wide':

port Inbound Read(MB) Inbound Write(MB) Outbound Read(MB) Outbound Write(MB)
0000:17 358559 44 0 22
0000:3a 3 2 0 0

197.081983474 seconds time elapsed


Alexander Antonov (5):
perf stat: Add AGGR_PCIE_PORT mode
perf stat: Basic support for iostat in perf
perf stat: Helper functions for PCIe root ports list in iostat mode
perf stat: Enable iostat mode for x86 platforms
perf: Update .gitignore file

tools/perf/.gitignore | 1 +
tools/perf/Documentation/perf-iostat.txt | 88 ++++
tools/perf/Makefile.perf | 5 +-
tools/perf/arch/x86/util/Build | 1 +
tools/perf/arch/x86/util/iostat.c | 469 ++++++++++++++++++
tools/perf/builtin-stat.c | 36 +-
tools/perf/command-list.txt | 1 +
tools/perf/perf-iostat.sh | 12 +
tools/perf/util/iostat.h | 32 ++
.../scripting-engines/trace-event-python.c | 3 +-
tools/perf/util/stat-display.c | 53 +-
tools/perf/util/stat-shadow.c | 11 +-
tools/perf/util/stat.c | 4 +-
tools/perf/util/stat.h | 2 +
14 files changed, 710 insertions(+), 8 deletions(-)
create mode 100644 tools/perf/Documentation/perf-iostat.txt
create mode 100644 tools/perf/arch/x86/util/iostat.c
create mode 100644 tools/perf/perf-iostat.sh
create mode 100644 tools/perf/util/iostat.h


base-commit: b145b0eb2031a620ca010174240963e4d2c6ce26
--
2.19.1


2021-02-03 14:05:32

by Alexander Antonov

[permalink] [raw]
Subject: [PATCH v4 4/5] perf stat: Enable iostat mode for x86 platforms

This functionality is based on recently introduced sysfs attributes
for Intel® Xeon® Scalable processor family (code name Skylake-SP):
Commit bb42b3d39781 ("perf/x86/intel/uncore: Expose an Uncore unit to
IIO PMON mapping")

Mode is intended to provide four I/O performance metrics in MB per each
PCIe root port:
- Inbound Read: I/O devices below root port read from the host memory
- Inbound Write: I/O devices below root port write to the host memory
- Outbound Read: CPU reads from I/O devices below root port
- Outbound Write: CPU writes to I/O devices below root port

Each metric requiries only one uncore event which increments at every 4B
transfer in corresponding direction. The formulas to compute metrics
are generic:
#EventCount * 4B / (1024 * 1024)

Signed-off-by: Alexander Antonov <[email protected]>
---
tools/perf/Documentation/perf-iostat.txt | 88 ++++++
tools/perf/Makefile.perf | 5 +-
tools/perf/arch/x86/util/Build | 1 +
tools/perf/arch/x86/util/iostat.c | 345 +++++++++++++++++++++++
tools/perf/command-list.txt | 1 +
tools/perf/perf-iostat.sh | 12 +
6 files changed, 451 insertions(+), 1 deletion(-)
create mode 100644 tools/perf/Documentation/perf-iostat.txt
create mode 100644 tools/perf/perf-iostat.sh

diff --git a/tools/perf/Documentation/perf-iostat.txt b/tools/perf/Documentation/perf-iostat.txt
new file mode 100644
index 000000000000..165176944031
--- /dev/null
+++ b/tools/perf/Documentation/perf-iostat.txt
@@ -0,0 +1,88 @@
+perf-iostat(1)
+===============
+
+NAME
+----
+perf-iostat - Show I/O performance metrics
+
+SYNOPSIS
+--------
+[verse]
+'perf iostat' list
+'perf iostat' <ports> -- <command> [<options>]
+
+DESCRIPTION
+-----------
+Mode is intended to provide four I/O performance metrics per each PCIe root port:
+
+- Inbound Read - I/O devices below root port read from the host memory, in MB
+
+- Inbound Write - I/O devices below root port write to the host memory, in MB
+
+- Outbound Read - CPU reads from I/O devices below root port, in MB
+
+- Outbound Write - CPU writes to I/O devices below root port, in MB
+
+OPTIONS
+-------
+<command>...::
+ Any command you can specify in a shell.
+
+list::
+ List all PCIe root ports.
+
+<ports>::
+ Select the root ports for monitoring. Comma-separated list is supported.
+
+EXAMPLES
+--------
+
+1. List all PCIe root ports (example for 2-S platform):
+
+ $ perf iostat list
+ S0-uncore_iio_0<0000:00>
+ S1-uncore_iio_0<0000:80>
+ S0-uncore_iio_1<0000:17>
+ S1-uncore_iio_1<0000:85>
+ S0-uncore_iio_2<0000:3a>
+ S1-uncore_iio_2<0000:ae>
+ S0-uncore_iio_3<0000:5d>
+ S1-uncore_iio_3<0000:d7>
+
+2. Collect metrics for all PCIe root ports:
+
+ $ perf iostat -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
+ 357708+0 records in
+ 357707+0 records out
+ 375083606016 bytes (375 GB, 349 GiB) copied, 215.974 s, 1.7 GB/s
+
+ Performance counter stats for 'system wide':
+
+ port Inbound Read(MB) Inbound Write(MB) Outbound Read(MB) Outbound Write(MB)
+ 0000:00 1 0 2 3
+ 0000:80 0 0 0 0
+ 0000:17 352552 43 0 21
+ 0000:85 0 0 0 0
+ 0000:3a 3 0 0 0
+ 0000:ae 0 0 0 0
+ 0000:5d 0 0 0 0
+ 0000:d7 0 0 0 0
+
+3. Collect metrics for comma-separated list of PCIe root ports:
+
+ $ perf iostat 0000:17,0:3a -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
+ 357708+0 records in
+ 357707+0 records out
+ 375083606016 bytes (375 GB, 349 GiB) copied, 197.08 s, 1.9 GB/s
+
+ Performance counter stats for 'system wide':
+
+ port Inbound Read(MB) Inbound Write(MB) Outbound Read(MB) Outbound Write(MB)
+ 0000:17 358559 44 0 22
+ 0000:3a 3 2 0 0
+
+ 197.081983474 seconds time elapsed
+
+SEE ALSO
+--------
+linkperf:perf-stat[1]
\ No newline at end of file
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 902c792f326a..b4ab48cc01e3 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -267,6 +267,7 @@ SCRIPT_SH =

SCRIPT_SH += perf-archive.sh
SCRIPT_SH += perf-with-kcore.sh
+SCRIPT_SH += perf-iostat.sh

grep-libs = $(filter -l%,$(1))
strip-libs = $(filter-out -l%,$(1))
@@ -890,6 +891,8 @@ endif
$(INSTALL) $(OUTPUT)perf-archive -t '$(DESTDIR_SQ)$(perfexec_instdir_SQ)'
$(call QUIET_INSTALL, perf-with-kcore) \
$(INSTALL) $(OUTPUT)perf-with-kcore -t '$(DESTDIR_SQ)$(perfexec_instdir_SQ)'
+ $(call QUIET_INSTALL, perf-iostat) \
+ $(INSTALL) $(OUTPUT)perf-iostat -t '$(DESTDIR_SQ)$(perfexec_instdir_SQ)'
ifndef NO_LIBAUDIT
$(call QUIET_INSTALL, strace/groups) \
$(INSTALL) -d -m 755 '$(DESTDIR_SQ)$(STRACE_GROUPS_INSTDIR_SQ)'; \
@@ -954,7 +957,7 @@ python-clean:
$(python-clean)

clean:: $(LIBTRACEEVENT)-clean $(LIBAPI)-clean $(LIBBPF)-clean $(LIBSUBCMD)-clean $(LIBPERF)-clean config-clean fixdep-clean python-clean
- $(call QUIET_CLEAN, core-objs) $(RM) $(LIBPERF_A) $(OUTPUT)perf-archive $(OUTPUT)perf-with-kcore $(LANG_BINDINGS)
+ $(call QUIET_CLEAN, core-objs) $(RM) $(LIBPERF_A) $(OUTPUT)perf-archive $(OUTPUT)perf-with-kcore $(OUTPUT)perf-iostat $(LANG_BINDINGS)
$(Q)find $(if $(OUTPUT),$(OUTPUT),.) -name '*.o' -delete -o -name '\.*.cmd' -delete -o -name '\.*.d' -delete
$(Q)$(RM) $(OUTPUT).config-detected
$(call QUIET_CLEAN, core-progs) $(RM) $(ALL_PROGRAMS) perf perf-read-vdso32 perf-read-vdsox32 $(OUTPUT)pmu-events/jevents $(OUTPUT)$(LIBJVMTI).so
diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 47f9c56e744f..5b650bebee78 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -6,6 +6,7 @@ perf-y += perf_regs.o
perf-y += group.o
perf-y += machine.o
perf-y += event.o
+perf-y += iostat.o

perf-$(CONFIG_DWARF) += dwarf-regs.o
perf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
diff --git a/tools/perf/arch/x86/util/iostat.c b/tools/perf/arch/x86/util/iostat.c
index 961e540106e6..468a9696a12d 100644
--- a/tools/perf/arch/x86/util/iostat.c
+++ b/tools/perf/arch/x86/util/iostat.c
@@ -27,6 +27,44 @@
#include "util/counts.h"
#include "path.h"

+#ifndef MAX_PATH
+#define MAX_PATH 1024
+#endif
+
+#define UNCORE_IIO_PMU_PATH "devices/uncore_iio_%d"
+#define SYSFS_UNCORE_PMU_PATH "%s/"UNCORE_IIO_PMU_PATH
+#define PLATFORM_MAPPING_PATH UNCORE_IIO_PMU_PATH"/die%d"
+
+enum iostat_mode_t {
+ IOSTAT_NONE = -1,
+ IOSTAT_RUN = 0,
+ IOSTAT_LIST = 1
+};
+
+static enum iostat_mode_t iostat_mode = IOSTAT_NONE;
+
+/*
+ * Each metric requiries one IIO event which increments at every 4B transfer
+ * in corresponding direction. The formulas to compute metrics are generic:
+ * #EventCount * 4B / (1024 * 1024)
+ */
+static const char * const iostat_metrics[] = {
+ "Inbound Read(MB)",
+ "Inbound Write(MB)",
+ "Outbound Read(MB)",
+ "Outbound Write(MB)",
+};
+
+static inline int iostat_metrics_count(void)
+{
+ return sizeof(iostat_metrics) / sizeof(char *);
+}
+
+static const char *iostat_metric_by_idx(int idx)
+{
+ return *(iostat_metrics + idx % iostat_metrics_count());
+}
+
struct iio_root_port {
u32 domain;
u8 bus;
@@ -122,3 +160,310 @@ static int iio_root_ports_list_insert(struct iio_root_ports_list *list,
}
return 0;
}
+
+static int iio_mapping(u8 pmu_idx, struct iio_root_ports_list * const list)
+{
+ char *buf;
+ char path[MAX_PATH];
+ u32 domain;
+ u8 bus;
+ struct iio_root_port *rp;
+ size_t size;
+ int ret;
+
+ for (int die = 0; die < cpu__max_node(); die++) {
+ scnprintf(path, MAX_PATH, PLATFORM_MAPPING_PATH, pmu_idx, die);
+ if (sysfs__read_str(path, &buf, &size) < 0) {
+ if (pmu_idx)
+ goto out;
+ pr_err("Mode iostat is not supported\n");
+ return -1;
+ }
+ ret = sscanf(buf, "%04x:%02hhx", &domain, &bus);
+ free(buf);
+ if (ret != 2) {
+ pr_err("Invalid mapping data: iio_%d; die%d\n",
+ pmu_idx, die);
+ return -1;
+ }
+ rp = iio_root_port_new(domain, bus, die, pmu_idx);
+ if (!rp || iio_root_ports_list_insert(list, rp)) {
+ free(rp);
+ return -ENOMEM;
+ }
+ }
+out:
+ return 0;
+}
+
+static u8 iio_pmu_count(void)
+{
+ u8 pmu_idx = 0;
+ char path[MAX_PATH];
+ const char *sysfs = sysfs__mountpoint();
+
+ if (sysfs) {
+ for (;; pmu_idx++) {
+ snprintf(path, sizeof(path), SYSFS_UNCORE_PMU_PATH,
+ sysfs, pmu_idx);
+ if (access(path, F_OK) != 0)
+ break;
+ }
+ }
+ return pmu_idx;
+}
+
+static int iio_root_ports_scan(struct iio_root_ports_list **list)
+{
+ int ret = -ENOMEM;
+ struct iio_root_ports_list *tmp_list;
+ u8 pmu_count = iio_pmu_count();
+
+ if (!pmu_count) {
+ pr_err("Unsupported uncore pmu configuration\n");
+ return -1;
+ }
+
+ tmp_list = iio_root_ports_list_new();
+ if (!tmp_list)
+ goto err;
+
+ for (u8 pmu_idx = 0; pmu_idx < pmu_count; pmu_idx++) {
+ ret = iio_mapping(pmu_idx, tmp_list);
+ if (ret)
+ break;
+ }
+err:
+ if (!ret)
+ *list = tmp_list;
+ else
+ iio_root_ports_list_free(tmp_list);
+
+ return ret;
+}
+
+static int iio_root_port_parse_str(u32 *domain, u8 *bus, char *str)
+{
+ int ret;
+ regex_t regex;
+ /*
+ * Expected format domain:bus:
+ * Valid domain range [0:ffff]
+ * Valid bus range [0:ff]
+ * Example: 0000:af, 0:3d, 01:7
+ */
+ regcomp(&regex, "^([a-f0-9A-F]{1,}):([a-f0-9A-F]{1,2})$", REG_EXTENDED);
+ ret = regexec(&regex, str, 0, NULL, 0);
+ if (ret || sscanf(str, "%08x:%02hhx", domain, bus) != 2)
+ pr_warning("Unrecognized root port format: %s\n"
+ "Please use the following format:\n"
+ "\t [domain]:[bus]\n"
+ "\t for example: 0000:3d\n", str);
+
+ regfree(&regex);
+ return ret;
+}
+
+static int iio_root_ports_list_filter(struct iio_root_ports_list **list,
+ const char *filter)
+{
+ char *tok, *tmp, *filter_copy = NULL;
+ struct iio_root_port *rp;
+ u32 domain;
+ u8 bus;
+ int ret = -ENOMEM;
+ struct iio_root_ports_list *tmp_list = iio_root_ports_list_new();
+
+ if (!tmp_list)
+ goto err;
+
+ filter_copy = strdup(filter);
+ if (!filter_copy)
+ goto err;
+
+ for (tok = strtok_r(filter_copy, ",", &tmp); tok;
+ tok = strtok_r(NULL, ",", &tmp)) {
+ if (!iio_root_port_parse_str(&domain, &bus, tok)) {
+ rp = iio_root_port_find_by_notation(*list, domain, bus);
+ if (rp) {
+ (*list)->rps[rp->idx] = NULL;
+ ret = iio_root_ports_list_insert(tmp_list, rp);
+ if (ret) {
+ free(rp);
+ goto err;
+ }
+ } else if (!iio_root_port_find_by_notation(tmp_list,
+ domain, bus))
+ pr_warning("Root port %04x:%02x were not found\n",
+ domain, bus);
+ }
+ }
+
+ if (tmp_list->nr_entries == 0) {
+ pr_err("Requested root ports were not found\n");
+ ret = -EINVAL;
+ }
+err:
+ iio_root_ports_list_free(*list);
+ if (ret)
+ iio_root_ports_list_free(tmp_list);
+ else
+ *list = tmp_list;
+
+ free(filter_copy);
+ return ret;
+}
+
+static int iostat_event_group(struct evlist *evl,
+ struct iio_root_ports_list *list)
+{
+ int ret;
+ int idx;
+ const char *iostat_cmd_template =
+ "{uncore_iio_%x/event=0x83,umask=0x04,ch_mask=0xF,fc_mask=0x07/,\
+ uncore_iio_%x/event=0x83,umask=0x01,ch_mask=0xF,fc_mask=0x07/,\
+ uncore_iio_%x/event=0xc0,umask=0x04,ch_mask=0xF,fc_mask=0x07/,\
+ uncore_iio_%x/event=0xc0,umask=0x01,ch_mask=0xF,fc_mask=0x07/}";
+ const int len_template = strlen(iostat_cmd_template) + 1;
+ struct evsel *evsel = NULL;
+ int metrics_count = iostat_metrics_count();
+ char *iostat_cmd = calloc(len_template, 1);
+
+ if (!iostat_cmd)
+ return -ENOMEM;
+
+ for (idx = 0; idx < list->nr_entries; idx++) {
+ sprintf(iostat_cmd, iostat_cmd_template,
+ list->rps[idx]->pmu_idx, list->rps[idx]->pmu_idx,
+ list->rps[idx]->pmu_idx, list->rps[idx]->pmu_idx);
+ ret = parse_events(evl, iostat_cmd, NULL);
+ if (ret)
+ goto err;
+ }
+
+ evlist__for_each_entry(evl, evsel) {
+ evsel->priv = list->rps[evsel->idx / metrics_count];
+ }
+ list->nr_entries = 0;
+err:
+ iio_root_ports_list_free(list);
+ free(iostat_cmd);
+ return ret;
+}
+
+int iostat_parse(const struct option *opt, const char *str,
+ int unset __maybe_unused)
+{
+ int ret;
+ struct iio_root_ports_list *list;
+ struct evlist *evl = *(struct evlist **)opt->value;
+ struct perf_stat_config *config = (struct perf_stat_config *)opt->data;
+
+ if (evl->core.nr_entries > 0) {
+ pr_err("Unsupported event configuration\n");
+ return -1;
+ }
+ config->metric_only = true;
+ config->aggr_mode = AGGR_PCIE_PORT;
+ config->iostat_run = true;
+ ret = iio_root_ports_scan(&list);
+ if (ret)
+ return ret;
+
+ if (!str) {
+ iostat_mode = IOSTAT_RUN;
+ } else if (!strcmp(str, "list")) {
+ iostat_mode = IOSTAT_LIST;
+ } else {
+ iostat_mode = IOSTAT_RUN;
+ ret = iio_root_ports_list_filter(&list, str);
+ if (ret)
+ return ret;
+ }
+ return iostat_event_group(evl, list);
+}
+
+void iostat_prefix(struct perf_stat_config *config,
+ struct evlist *evlist,
+ char *prefix, struct timespec *ts)
+{
+ struct iio_root_port *rp = evlist->selected->priv;
+
+ if (rp) {
+ if (ts)
+ sprintf(prefix, "%6lu.%09lu%s%04x:%02x%s",
+ ts->tv_sec, ts->tv_nsec,
+ config->csv_sep, rp->domain, rp->bus,
+ config->csv_sep);
+ else
+ sprintf(prefix, "%04x:%02x%s", rp->domain, rp->bus,
+ config->csv_sep);
+ }
+}
+
+void iostat_print_metric(struct perf_stat_config *config, struct evsel *evsel,
+ struct perf_stat_output_ctx *out)
+{
+ double iostat_value = 0;
+ u64 prev_count_val = 0;
+ const char *iostat_metric = iostat_metric_by_idx(evsel->idx);
+ u8 die = ((struct iio_root_port *)evsel->priv)->die;
+ struct perf_counts_values *count = perf_counts(evsel->counts, die, 0);
+
+ if (count->run && count->ena) {
+ if (evsel->prev_raw_counts && !out->force_header) {
+ struct perf_counts_values *prev_count =
+ perf_counts(evsel->prev_raw_counts, die, 0);
+
+ prev_count_val = prev_count->val;
+ prev_count->val = count->val;
+ }
+ iostat_value = (count->val - prev_count_val) /
+ ((double) count->run / count->ena);
+ }
+ out->print_metric(config, out->ctx, NULL, "%8.0f", iostat_metric,
+ iostat_value / (256 * 1024));
+}
+
+int iostat_list(struct evlist *evlist, struct perf_stat_config *config)
+{
+ struct evsel *evsel;
+ struct iio_root_port *rp = NULL;
+ bool is_list_mode = (iostat_mode == IOSTAT_LIST);
+
+ if (config->aggr_mode != AGGR_PCIE_PORT) {
+ pr_err("Unsupported event configuration\n");
+ return -1;
+ }
+
+ if (is_list_mode || verbose) {
+ evlist__for_each_entry(evlist, evsel) {
+ if (!evsel->priv) {
+ pr_err("Unsupported event configuration\n");
+ return -1;
+ }
+ if (rp != evsel->priv) {
+ rp = evsel->priv;
+ iio_root_port_show(config->output, rp);
+ }
+ }
+ }
+ /* Stop iostat in list-mode*/
+ config->iostat_run = !is_list_mode;
+ if (is_list_mode)
+ iostat_release(evlist);
+ return 0;
+}
+
+void iostat_release(struct evlist *evlist)
+{
+ struct evsel *evsel;
+ struct iio_root_port *rp = NULL;
+
+ evlist__for_each_entry(evlist, evsel) {
+ if (rp != evsel->priv) {
+ rp = evsel->priv;
+ free(evsel->priv);
+ }
+ }
+}
diff --git a/tools/perf/command-list.txt b/tools/perf/command-list.txt
index bc6c585f74fc..620db1e8e6a7 100644
--- a/tools/perf/command-list.txt
+++ b/tools/perf/command-list.txt
@@ -14,6 +14,7 @@ perf-config mainporcelain common
perf-evlist mainporcelain common
perf-ftrace mainporcelain common
perf-inject mainporcelain common
+perf-iostat mainporcelain common
perf-kallsyms mainporcelain common
perf-kmem mainporcelain common
perf-kvm mainporcelain common
diff --git a/tools/perf/perf-iostat.sh b/tools/perf/perf-iostat.sh
new file mode 100644
index 000000000000..e562f252d56f
--- /dev/null
+++ b/tools/perf/perf-iostat.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# perf iostat
+# Alexander Antonov <[email protected]>
+
+if [[ "$1" == "list" ]] || [[ "$1" =~ ([a-f0-9A-F]{1,}):([a-f0-9A-F]{1,2})(,)? ]]; then
+ DELIMITER="="
+else
+ DELIMITER=" "
+fi
+
+perf stat --iostat$DELIMITER$*
--
2.19.1

2021-02-03 14:05:54

by Alexander Antonov

[permalink] [raw]
Subject: [PATCH v4 5/5] perf: Update .gitignore file

After a "make -C tools/perf", git reports the following untracked file:
perf-iostat

Add this generated file to perf's .gitignore file.

Signed-off-by: Alexander Antonov <[email protected]>
---
tools/perf/.gitignore | 1 +
1 file changed, 1 insertion(+)

diff --git a/tools/perf/.gitignore b/tools/perf/.gitignore
index bf1252dc2cb0..421f27e2b9af 100644
--- a/tools/perf/.gitignore
+++ b/tools/perf/.gitignore
@@ -19,6 +19,7 @@ perf.data.old
output.svg
perf-archive
perf-with-kcore
+perf-iostat
tags
TAGS
cscope*
--
2.19.1

2021-02-03 14:06:11

by Alexander Antonov

[permalink] [raw]
Subject: [PATCH v4 2/5] perf stat: Basic support for iostat in perf

Add basic flow for a new iostat mode in perf. Mode is intended to
provide four I/O performance metrics per each PCIe root port: Inbound Read,
Inbound Write, Outbound Read, Outbound Write.

The actual code to compute the metrics and attribute it to
root port is in follow-on patches.

Signed-off-by: Alexander Antonov <[email protected]>
---
tools/perf/builtin-stat.c | 31 ++++++++++++++++++++++++++
tools/perf/util/iostat.h | 32 +++++++++++++++++++++++++++
tools/perf/util/stat-display.c | 40 +++++++++++++++++++++++++++++++++-
tools/perf/util/stat-shadow.c | 11 +++++++++-
tools/perf/util/stat.h | 1 +
5 files changed, 113 insertions(+), 2 deletions(-)
create mode 100644 tools/perf/util/iostat.h

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 60fdb6a0805f..66c913692120 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -65,6 +65,7 @@
#include "util/target.h"
#include "util/time-utils.h"
#include "util/top.h"
+#include "util/iostat.h"
#include "asm/bug.h"

#include <linux/time64.h>
@@ -186,6 +187,7 @@ static struct perf_stat_config stat_config = {
.metric_only_len = METRIC_ONLY_LEN,
.walltime_nsecs_stats = &walltime_nsecs_stats,
.big_num = true,
+ .iostat_run = false,
};

static inline void diff_timespec(struct timespec *r, struct timespec *a,
@@ -723,6 +725,14 @@ static int parse_metric_groups(const struct option *opt,
return metricgroup__parse_groups(opt, str, &stat_config.metric_events);
}

+__weak int iostat_parse(const struct option *opt __maybe_unused,
+ const char *str __maybe_unused,
+ int unset __maybe_unused)
+{
+ pr_err("iostat mode is not supported\n");
+ return -1;
+}
+
static struct option stat_options[] = {
OPT_BOOLEAN('T', "transaction", &transaction_run,
"hardware transaction statistics"),
@@ -803,6 +813,8 @@ static struct option stat_options[] = {
OPT_CALLBACK('M', "metrics", &evsel_list, "metric/metric group list",
"monitor specified metrics or metric groups (separated by ,)",
parse_metric_groups),
+ OPT_CALLBACK_OPTARG(0, "iostat", &evsel_list, &stat_config, "root port",
+ "measure PCIe metrics per root port", iostat_parse),
OPT_END()
};

@@ -1131,6 +1143,12 @@ __weak void arch_topdown_group_warn(void)
{
}

+__weak int iostat_list(struct evlist *evlist __maybe_unused,
+ struct perf_stat_config *config __maybe_unused)
+{
+ return 0;
+}
+
/*
* Add default attributes, if there were no attributes specified or
* if -d/--detailed, -d -d or -d -d -d is used:
@@ -1682,6 +1700,10 @@ static void setup_system_wide(int forks)
}
}

+__weak void iostat_release(struct evlist *evlist __maybe_unused)
+{
+}
+
int cmd_stat(int argc, const char **argv)
{
const char * const stat_usage[] = {
@@ -1858,6 +1880,12 @@ int cmd_stat(int argc, const char **argv)
goto out;
}

+ if (stat_config.iostat_run) {
+ status = iostat_list(evsel_list, &stat_config);
+ if (status || !stat_config.iostat_run)
+ goto out;
+ }
+
if (add_default_attributes())
goto out;

@@ -2008,6 +2036,9 @@ int cmd_stat(int argc, const char **argv)
perf_stat__exit_aggr_mode();
perf_evlist__free_stats(evsel_list);
out:
+ if (stat_config.iostat_run)
+ iostat_release(evsel_list);
+
zfree(&stat_config.walltime_run);

if (smi_cost && smi_reset)
diff --git a/tools/perf/util/iostat.h b/tools/perf/util/iostat.h
new file mode 100644
index 000000000000..b34ebedfd5e6
--- /dev/null
+++ b/tools/perf/util/iostat.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * perf iostat
+ *
+ * Copyright (C) 2020, Intel Corporation
+ *
+ * Authors: Alexander Antonov <[email protected]>
+ */
+
+#ifndef _IOSTAT_H
+#define _IOSTAT_H
+
+#include <subcmd/parse-options.h>
+#include "util/stat.h"
+#include "util/parse-events.h"
+#include "util/evlist.h"
+
+struct option;
+struct perf_stat_config;
+struct evlist;
+struct timespec;
+
+int iostat_parse(const struct option *opt, const char *str,
+ int unset __maybe_unused);
+void iostat_prefix(struct perf_stat_config *config, struct evlist *evlist,
+ char *prefix, struct timespec *ts);
+void iostat_print_metric(struct perf_stat_config *config, struct evsel *evsel,
+ struct perf_stat_output_ctx *out);
+int iostat_list(struct evlist *evlist, struct perf_stat_config *config);
+void iostat_release(struct evlist *evlist);
+
+#endif /* _IOSTAT_H */
diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
index db1bec115d0b..de78cf6962b9 100644
--- a/tools/perf/util/stat-display.c
+++ b/tools/perf/util/stat-display.c
@@ -16,6 +16,7 @@
#include <linux/ctype.h>
#include "cgroup.h"
#include <api/fs/fs.h>
+#include "iostat.h"

#define CNTR_NOT_SUPPORTED "<not supported>"
#define CNTR_NOT_COUNTED "<not counted>"
@@ -302,6 +303,11 @@ static void print_metric_header(struct perf_stat_config *config,
struct outstate *os = ctx;
char tbuf[1024];

+ /* In case of iostat, print metric header for first root port only */
+ if (config->iostat_run &&
+ os->evsel->priv != os->evsel->evlist->selected->priv)
+ return;
+
if (!valid_only_metric(unit))
return;
unit = fixunit(tbuf, os->evsel, unit);
@@ -936,6 +942,8 @@ static void print_metric_headers(struct perf_stat_config *config,
fputs("time,", config->output);
fputs(aggr_header_csv[config->aggr_mode], config->output);
}
+ if (config->iostat_run && !config->interval && !config->csv_output)
+ fprintf(config->output, " port ");

/* Print metrics headers only */
evlist__for_each_entry(evlist, counter) {
@@ -954,6 +962,13 @@ static void print_metric_headers(struct perf_stat_config *config,
fputc('\n', config->output);
}

+__weak void iostat_prefix(struct perf_stat_config *config __maybe_unused,
+ struct evlist *evlist __maybe_unused,
+ char *prefix __maybe_unused,
+ struct timespec *ts __maybe_unused)
+{
+}
+
static void print_interval(struct perf_stat_config *config,
struct evlist *evlist,
char *prefix, struct timespec *ts)
@@ -966,7 +981,10 @@ static void print_interval(struct perf_stat_config *config,
if (config->interval_clear)
puts(CONSOLE_CLEAR);

- sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec, ts->tv_nsec, config->csv_sep);
+ if (!config->iostat_run)
+ sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec,
+ ts->tv_nsec,
+ config->csv_sep);

if ((num_print_interval == 0 && !config->csv_output) || config->interval_clear) {
switch (config->aggr_mode) {
@@ -996,6 +1014,7 @@ static void print_interval(struct perf_stat_config *config,
fprintf(output, " counts %*s events\n", unit_width, "unit");
break;
case AGGR_PCIE_PORT:
+ fprintf(output, "# time port ");
break;
case AGGR_GLOBAL:
default:
@@ -1174,6 +1193,10 @@ perf_evlist__print_counters(struct evlist *evlist,
int interval = config->interval;
struct evsel *counter;
char buf[64], *prefix = NULL;
+ void *perf_device = NULL;
+
+ if (config->iostat_run)
+ evlist->selected = evlist__first(evlist);

if (interval)
print_interval(config, evlist, prefix = buf, ts);
@@ -1222,6 +1245,21 @@ perf_evlist__print_counters(struct evlist *evlist,
}
break;
case AGGR_PCIE_PORT:
+ counter = evlist__first(evlist);
+ perf_evlist__set_selected(evlist, counter);
+ iostat_prefix(config, evlist, prefix = buf, ts);
+ fprintf(config->output, "%s", prefix);
+ evlist__for_each_entry(evlist, counter) {
+ perf_device = evlist->selected->priv;
+ if (perf_device && perf_device != counter->priv) {
+ perf_evlist__set_selected(evlist, counter);
+ iostat_prefix(config, evlist, prefix, ts);
+ fprintf(config->output, "\n%s", prefix);
+ }
+ print_counter_aggr(config, counter, prefix);
+ if ((counter->idx + 1) == evlist->core.nr_entries)
+ fputc('\n', config->output);
+ }
break;
case AGGR_UNSET:
default:
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index 2c41d47f6f83..083a450c6dc7 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -9,6 +9,7 @@
#include "expr.h"
#include "metricgroup.h"
#include <linux/zalloc.h>
+#include "iostat.h"

/*
* AGGR_GLOBAL: Use CPU 0
@@ -814,6 +815,12 @@ static void generic_metric(struct perf_stat_config *config,
zfree(&pctx.ids[i].name);
}

+__weak void iostat_print_metric(struct perf_stat_config *config __maybe_unused,
+ struct evsel *evsel __maybe_unused,
+ struct perf_stat_output_ctx *out __maybe_unused)
+{
+}
+
void perf_stat__print_shadow_stats(struct perf_stat_config *config,
struct evsel *evsel,
double avg, int cpu,
@@ -829,7 +836,9 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config,
struct metric_event *me;
int num = 1;

- if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
+ if (config->iostat_run) {
+ iostat_print_metric(config, evsel, out);
+ } else if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
total = runtime_stat_avg(st, STAT_CYCLES, ctx, cpu);

if (total) {
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index c7544c28c02a..c2a2b28effd6 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -107,6 +107,7 @@ struct perf_stat_config {
bool big_num;
bool no_merge;
bool walltime_run_table;
+ bool iostat_run;
FILE *output;
unsigned int interval;
unsigned int timeout;
--
2.19.1

2021-02-03 14:08:27

by Alexander Antonov

[permalink] [raw]
Subject: [PATCH v4 3/5] perf stat: Helper functions for PCIe root ports list in iostat mode

Introduce helper functions to control PCIe root ports list.
These helpers will be used in the follow-up patch.

Signed-off-by: Alexander Antonov <[email protected]>
---
tools/perf/arch/x86/util/iostat.c | 124 ++++++++++++++++++++++++++++++
1 file changed, 124 insertions(+)
create mode 100644 tools/perf/arch/x86/util/iostat.c

diff --git a/tools/perf/arch/x86/util/iostat.c b/tools/perf/arch/x86/util/iostat.c
new file mode 100644
index 000000000000..961e540106e6
--- /dev/null
+++ b/tools/perf/arch/x86/util/iostat.c
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * perf iostat
+ *
+ * Copyright (C) 2020, Intel Corporation
+ *
+ * Authors: Alexander Antonov <[email protected]>
+ */
+
+#include <api/fs/fs.h>
+#include <linux/kernel.h>
+#include <linux/err.h>
+#include <limits.h>
+#include <stdio.h>
+#include <string.h>
+#include <errno.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <regex.h>
+#include "util/cpumap.h"
+#include "util/debug.h"
+#include "util/iostat.h"
+#include "util/counts.h"
+#include "path.h"
+
+struct iio_root_port {
+ u32 domain;
+ u8 bus;
+ u8 die;
+ u8 pmu_idx;
+ int idx;
+};
+
+struct iio_root_ports_list {
+ struct iio_root_port **rps;
+ int nr_entries;
+};
+
+static void iio_root_port_show(FILE *output,
+ const struct iio_root_port * const rp)
+{
+ if (output && rp)
+ fprintf(output, "S%d-uncore_iio_%d<%04x:%02x>\n",
+ rp->die, rp->pmu_idx, rp->domain, rp->bus);
+}
+
+static struct iio_root_port *iio_root_port_new(u32 domain, u8 bus,
+ u8 die, u8 pmu_idx)
+{
+ struct iio_root_port *p = calloc(1, sizeof(*p));
+
+ if (p) {
+ p->domain = domain;
+ p->bus = bus;
+ p->die = die;
+ p->pmu_idx = pmu_idx;
+ }
+ return p;
+}
+
+static struct iio_root_ports_list *iio_root_ports_list_new(void)
+{
+ struct iio_root_ports_list *list = calloc(1, sizeof(*list));
+
+ if (list) {
+ list->rps = calloc(1, sizeof(struct iio_root_port *));
+ if (!list->rps) {
+ free(list);
+ list = NULL;
+ }
+ }
+ return list;
+}
+
+static void iio_root_ports_list_free(struct iio_root_ports_list *list)
+{
+ int idx;
+
+ if (list) {
+ for (idx = 0; idx < list->nr_entries; idx++)
+ free(list->rps[idx]);
+ free(list->rps);
+ free(list);
+ }
+}
+
+static struct iio_root_port *iio_root_port_find_by_notation(
+ const struct iio_root_ports_list * const list, u32 domain, u8 bus)
+{
+ int idx;
+ struct iio_root_port *rp;
+
+ if (list) {
+ for (idx = 0; idx < list->nr_entries; idx++) {
+ rp = list->rps[idx];
+ if (rp && rp->domain == domain && rp->bus == bus)
+ return rp;
+ }
+ }
+ return NULL;
+}
+
+static int iio_root_ports_list_insert(struct iio_root_ports_list *list,
+ struct iio_root_port * const rp)
+{
+ struct iio_root_port **tmp_buf;
+
+ if (list && rp) {
+ rp->idx = list->nr_entries++;
+ tmp_buf = realloc(list->rps,
+ list->nr_entries * sizeof(*list->rps));
+ if (!tmp_buf) {
+ pr_err("Failed to realloc memory\n");
+ return -ENOMEM;
+ }
+ tmp_buf[rp->idx] = rp;
+ list->rps = tmp_buf;
+ }
+ return 0;
+}
--
2.19.1

2021-02-03 20:58:42

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v4 0/5] perf stat: Introduce iostat mode to provide I/O performance metrics

Em Wed, Feb 03, 2021 at 04:58:25PM +0300, Alexander Antonov escreveu:
> The previous version can be found at:
> v3: https://lkml.kernel.org/r/[email protected]/
> Changes in this revision are:
> v3 -> v4:
> - Addressed comment from Namhyung Kim:
> 1. Removed NULL-termination of root ports list

Hi Namhyung,

Are you ok with this patchkit now?

Thanks,

- Arnaldo

> The previous version can be found at:
> v2: https://lkml.kernel.org/r/[email protected]
>
> Changes in this revision are:
> v2 -> v3:
> - Addressed comments from Namhyung Kim:
> 1. Removed perf_device pointer from evsel structure. Use priv field instead
> 2. Renamed 'iiostat' to 'iostat'
> 3. Renamed 'show' mode to 'list' mode
> 4. Renamed iiostat_delete_root_ports() to iiostat_release() and
> iostat_show_root_ports() to iostat_list()
>
> The previous version can be found at:
> v1: https://lkml.kernel.org/r/[email protected]
>
> Changes in this revision are:
> v1 -> v2:
> - Addressed comment from Arnaldo Carvalho de Melo:
> 1. Using 'perf iiostat' subcommand instead of 'perf stat --iiostat':
> - Added perf-iiostat.sh script to use short command
> - Updated manual pages to get help for 'perf iiostat'
> - Added 'perf-iiostat' to perf's gitignore file
>
> Mode is intended to provide four I/O performance metrics in MB per each
> root port:
> - Inbound Read: I/O devices below root port read from the host memory
> - Inbound Write: I/O devices below root port write to the host memory
> - Outbound Read: CPU reads from I/O devices below root port
> - Outbound Write: CPU writes to I/O devices below root port
>
> Each metric requiries only one uncore event which increments at every 4B
> transfer in corresponding direction. The formulas to compute metrics
> are generic:
> #EventCount * 4B / (1024 * 1024)
>
> Note: iostat introduces new perf data aggregation mode - per PCIe root port
> hence -e and -M options are not supported.
>
> Usage examples:
>
> 1. List all PCIe root ports (example for 2-S platform):
> $ perf iostat list
> S0-uncore_iio_0<0000:00>
> S1-uncore_iio_0<0000:80>
> S0-uncore_iio_1<0000:17>
> S1-uncore_iio_1<0000:85>
> S0-uncore_iio_2<0000:3a>
> S1-uncore_iio_2<0000:ae>
> S0-uncore_iio_3<0000:5d>
> S1-uncore_iio_3<0000:d7>
>
> 2. Collect metrics for all PCIe root ports:
> $ perf iostat -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
> 357708+0 records in
> 357707+0 records out
> 375083606016 bytes (375 GB, 349 GiB) copied, 215.974 s, 1.7 GB/s
>
> Performance counter stats for 'system wide':
>
> port Inbound Read(MB) Inbound Write(MB) Outbound Read(MB) Outbound Write(MB)
> 0000:00 1 0 2 3
> 0000:80 0 0 0 0
> 0000:17 352552 43 0 21
> 0000:85 0 0 0 0
> 0000:3a 3 0 0 0
> 0000:ae 0 0 0 0
> 0000:5d 0 0 0 0
> 0000:d7 0 0 0 0
>
> 3. Collect metrics for comma separated list of PCIe root ports:
> $ perf iostat 0000:17,0:3a -- dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct
> 357708+0 records in
> 357707+0 records out
> 375083606016 bytes (375 GB, 349 GiB) copied, 197.08 s, 1.9 GB/s
>
> Performance counter stats for 'system wide':
>
> port Inbound Read(MB) Inbound Write(MB) Outbound Read(MB) Outbound Write(MB)
> 0000:17 358559 44 0 22
> 0000:3a 3 2 0 0
>
> 197.081983474 seconds time elapsed
>
>
> Alexander Antonov (5):
> perf stat: Add AGGR_PCIE_PORT mode
> perf stat: Basic support for iostat in perf
> perf stat: Helper functions for PCIe root ports list in iostat mode
> perf stat: Enable iostat mode for x86 platforms
> perf: Update .gitignore file
>
> tools/perf/.gitignore | 1 +
> tools/perf/Documentation/perf-iostat.txt | 88 ++++
> tools/perf/Makefile.perf | 5 +-
> tools/perf/arch/x86/util/Build | 1 +
> tools/perf/arch/x86/util/iostat.c | 469 ++++++++++++++++++
> tools/perf/builtin-stat.c | 36 +-
> tools/perf/command-list.txt | 1 +
> tools/perf/perf-iostat.sh | 12 +
> tools/perf/util/iostat.h | 32 ++
> .../scripting-engines/trace-event-python.c | 3 +-
> tools/perf/util/stat-display.c | 53 +-
> tools/perf/util/stat-shadow.c | 11 +-
> tools/perf/util/stat.c | 4 +-
> tools/perf/util/stat.h | 2 +
> 14 files changed, 710 insertions(+), 8 deletions(-)
> create mode 100644 tools/perf/Documentation/perf-iostat.txt
> create mode 100644 tools/perf/arch/x86/util/iostat.c
> create mode 100644 tools/perf/perf-iostat.sh
> create mode 100644 tools/perf/util/iostat.h
>
>
> base-commit: b145b0eb2031a620ca010174240963e4d2c6ce26
> --
> 2.19.1
>

--

- Arnaldo

2021-02-04 12:27:29

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] perf stat: Basic support for iostat in perf

On Wed, Feb 3, 2021 at 10:58 PM Alexander Antonov
<[email protected]> wrote:
>
> Add basic flow for a new iostat mode in perf. Mode is intended to
> provide four I/O performance metrics per each PCIe root port: Inbound Read,
> Inbound Write, Outbound Read, Outbound Write.
>
> The actual code to compute the metrics and attribute it to
> root port is in follow-on patches.
>
> Signed-off-by: Alexander Antonov <[email protected]>
> ---
> tools/perf/builtin-stat.c | 31 ++++++++++++++++++++++++++
> tools/perf/util/iostat.h | 32 +++++++++++++++++++++++++++
> tools/perf/util/stat-display.c | 40 +++++++++++++++++++++++++++++++++-
> tools/perf/util/stat-shadow.c | 11 +++++++++-
> tools/perf/util/stat.h | 1 +
> 5 files changed, 113 insertions(+), 2 deletions(-)
> create mode 100644 tools/perf/util/iostat.h
>
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 60fdb6a0805f..66c913692120 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -65,6 +65,7 @@
> #include "util/target.h"
> #include "util/time-utils.h"
> #include "util/top.h"
> +#include "util/iostat.h"
> #include "asm/bug.h"
>
> #include <linux/time64.h>
> @@ -186,6 +187,7 @@ static struct perf_stat_config stat_config = {
> .metric_only_len = METRIC_ONLY_LEN,
> .walltime_nsecs_stats = &walltime_nsecs_stats,
> .big_num = true,
> + .iostat_run = false,
> };
>
> static inline void diff_timespec(struct timespec *r, struct timespec *a,
> @@ -723,6 +725,14 @@ static int parse_metric_groups(const struct option *opt,
> return metricgroup__parse_groups(opt, str, &stat_config.metric_events);
> }
>
> +__weak int iostat_parse(const struct option *opt __maybe_unused,
> + const char *str __maybe_unused,
> + int unset __maybe_unused)
> +{
> + pr_err("iostat mode is not supported\n");
> + return -1;
> +}
> +
> static struct option stat_options[] = {
> OPT_BOOLEAN('T', "transaction", &transaction_run,
> "hardware transaction statistics"),
> @@ -803,6 +813,8 @@ static struct option stat_options[] = {
> OPT_CALLBACK('M', "metrics", &evsel_list, "metric/metric group list",
> "monitor specified metrics or metric groups (separated by ,)",
> parse_metric_groups),
> + OPT_CALLBACK_OPTARG(0, "iostat", &evsel_list, &stat_config, "root port",
> + "measure PCIe metrics per root port", iostat_parse),

Can we make the help string and default argument more generic?
Something like "measure IO metrics provided by arch/platform"
and the default value being "default". :)


> OPT_END()
> };
>
> @@ -1131,6 +1143,12 @@ __weak void arch_topdown_group_warn(void)
> {
> }
>
> +__weak int iostat_list(struct evlist *evlist __maybe_unused,
> + struct perf_stat_config *config __maybe_unused)
> +{
> + return 0;
> +}
> +
> /*
> * Add default attributes, if there were no attributes specified or
> * if -d/--detailed, -d -d or -d -d -d is used:
> @@ -1682,6 +1700,10 @@ static void setup_system_wide(int forks)
> }
> }
>
> +__weak void iostat_release(struct evlist *evlist __maybe_unused)
> +{
> +}
> +
> int cmd_stat(int argc, const char **argv)
> {
> const char * const stat_usage[] = {
> @@ -1858,6 +1880,12 @@ int cmd_stat(int argc, const char **argv)
> goto out;
> }
>
> + if (stat_config.iostat_run) {
> + status = iostat_list(evsel_list, &stat_config);

I think it's unnatural to call iostat_list() unconditionally here.
How about this?

status = iostat_prepare(...);
if (status < 0)
goto out;

if (status == IOSTAT_LIST)
iostat_list(...);
else
...


> + if (status || !stat_config.iostat_run)
> + goto out;
> + }
> +
> if (add_default_attributes())
> goto out;
>
> @@ -2008,6 +2036,9 @@ int cmd_stat(int argc, const char **argv)
> perf_stat__exit_aggr_mode();
> perf_evlist__free_stats(evsel_list);
> out:
> + if (stat_config.iostat_run)
> + iostat_release(evsel_list);
> +
> zfree(&stat_config.walltime_run);
>
> if (smi_cost && smi_reset)
> diff --git a/tools/perf/util/iostat.h b/tools/perf/util/iostat.h
> new file mode 100644
> index 000000000000..b34ebedfd5e6
> --- /dev/null
> +++ b/tools/perf/util/iostat.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * perf iostat
> + *
> + * Copyright (C) 2020, Intel Corporation
> + *
> + * Authors: Alexander Antonov <[email protected]>
> + */
> +
> +#ifndef _IOSTAT_H
> +#define _IOSTAT_H
> +
> +#include <subcmd/parse-options.h>
> +#include "util/stat.h"
> +#include "util/parse-events.h"
> +#include "util/evlist.h"
> +
> +struct option;
> +struct perf_stat_config;
> +struct evlist;
> +struct timespec;
> +
> +int iostat_parse(const struct option *opt, const char *str,
> + int unset __maybe_unused);
> +void iostat_prefix(struct perf_stat_config *config, struct evlist *evlist,
> + char *prefix, struct timespec *ts);
> +void iostat_print_metric(struct perf_stat_config *config, struct evsel *evsel,
> + struct perf_stat_output_ctx *out);
> +int iostat_list(struct evlist *evlist, struct perf_stat_config *config);
> +void iostat_release(struct evlist *evlist);
> +
> +#endif /* _IOSTAT_H */
> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> index db1bec115d0b..de78cf6962b9 100644
> --- a/tools/perf/util/stat-display.c
> +++ b/tools/perf/util/stat-display.c
> @@ -16,6 +16,7 @@
> #include <linux/ctype.h>
> #include "cgroup.h"
> #include <api/fs/fs.h>
> +#include "iostat.h"
>
> #define CNTR_NOT_SUPPORTED "<not supported>"
> #define CNTR_NOT_COUNTED "<not counted>"
> @@ -302,6 +303,11 @@ static void print_metric_header(struct perf_stat_config *config,
> struct outstate *os = ctx;
> char tbuf[1024];
>
> + /* In case of iostat, print metric header for first root port only */
> + if (config->iostat_run &&
> + os->evsel->priv != os->evsel->evlist->selected->priv)
> + return;
> +
> if (!valid_only_metric(unit))
> return;
> unit = fixunit(tbuf, os->evsel, unit);
> @@ -936,6 +942,8 @@ static void print_metric_headers(struct perf_stat_config *config,
> fputs("time,", config->output);
> fputs(aggr_header_csv[config->aggr_mode], config->output);
> }
> + if (config->iostat_run && !config->interval && !config->csv_output)
> + fprintf(config->output, " port ");

It's too specific to the current implementation.
Let's make it a callback or a weak function.


>
> /* Print metrics headers only */
> evlist__for_each_entry(evlist, counter) {
> @@ -954,6 +962,13 @@ static void print_metric_headers(struct perf_stat_config *config,
> fputc('\n', config->output);
> }
>
> +__weak void iostat_prefix(struct perf_stat_config *config __maybe_unused,
> + struct evlist *evlist __maybe_unused,
> + char *prefix __maybe_unused,
> + struct timespec *ts __maybe_unused)
> +{
> +}
> +
> static void print_interval(struct perf_stat_config *config,
> struct evlist *evlist,
> char *prefix, struct timespec *ts)
> @@ -966,7 +981,10 @@ static void print_interval(struct perf_stat_config *config,
> if (config->interval_clear)
> puts(CONSOLE_CLEAR);
>
> - sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec, ts->tv_nsec, config->csv_sep);
> + if (!config->iostat_run)
> + sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec,
> + ts->tv_nsec,
> + config->csv_sep);
>
> if ((num_print_interval == 0 && !config->csv_output) || config->interval_clear) {
> switch (config->aggr_mode) {
> @@ -996,6 +1014,7 @@ static void print_interval(struct perf_stat_config *config,
> fprintf(output, " counts %*s events\n", unit_width, "unit");
> break;
> case AGGR_PCIE_PORT:
> + fprintf(output, "# time port ");

Ditto.

> break;
> case AGGR_GLOBAL:
> default:
> @@ -1174,6 +1193,10 @@ perf_evlist__print_counters(struct evlist *evlist,
> int interval = config->interval;
> struct evsel *counter;
> char buf[64], *prefix = NULL;
> + void *perf_device = NULL;
> +
> + if (config->iostat_run)
> + evlist->selected = evlist__first(evlist);
>
> if (interval)
> print_interval(config, evlist, prefix = buf, ts);
> @@ -1222,6 +1245,21 @@ perf_evlist__print_counters(struct evlist *evlist,
> }
> break;
> case AGGR_PCIE_PORT:

Ditto. Something like iostat_print_counters().

> + counter = evlist__first(evlist);
> + perf_evlist__set_selected(evlist, counter);
> + iostat_prefix(config, evlist, prefix = buf, ts);
> + fprintf(config->output, "%s", prefix);
> + evlist__for_each_entry(evlist, counter) {
> + perf_device = evlist->selected->priv;
> + if (perf_device && perf_device != counter->priv) {
> + perf_evlist__set_selected(evlist, counter);
> + iostat_prefix(config, evlist, prefix, ts);
> + fprintf(config->output, "\n%s", prefix);
> + }
> + print_counter_aggr(config, counter, prefix);

I'm not sure but do you assume each counter has different priv?
I don't know if the output is correct (like call iostat_prefix() once
and call print_counter_aggr() twice) when they have same one.

> + if ((counter->idx + 1) == evlist->core.nr_entries)
> + fputc('\n', config->output);

Can we just move this out of the loop?

Thanks,
Namhyung


> + }
> break;
> case AGGR_UNSET:
> default:
> diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
> index 2c41d47f6f83..083a450c6dc7 100644
> --- a/tools/perf/util/stat-shadow.c
> +++ b/tools/perf/util/stat-shadow.c
> @@ -9,6 +9,7 @@
> #include "expr.h"
> #include "metricgroup.h"
> #include <linux/zalloc.h>
> +#include "iostat.h"
>
> /*
> * AGGR_GLOBAL: Use CPU 0
> @@ -814,6 +815,12 @@ static void generic_metric(struct perf_stat_config *config,
> zfree(&pctx.ids[i].name);
> }
>
> +__weak void iostat_print_metric(struct perf_stat_config *config __maybe_unused,
> + struct evsel *evsel __maybe_unused,
> + struct perf_stat_output_ctx *out __maybe_unused)
> +{
> +}
> +
> void perf_stat__print_shadow_stats(struct perf_stat_config *config,
> struct evsel *evsel,
> double avg, int cpu,
> @@ -829,7 +836,9 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config,
> struct metric_event *me;
> int num = 1;
>
> - if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
> + if (config->iostat_run) {
> + iostat_print_metric(config, evsel, out);
> + } else if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
> total = runtime_stat_avg(st, STAT_CYCLES, ctx, cpu);
>
> if (total) {
> diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
> index c7544c28c02a..c2a2b28effd6 100644
> --- a/tools/perf/util/stat.h
> +++ b/tools/perf/util/stat.h
> @@ -107,6 +107,7 @@ struct perf_stat_config {
> bool big_num;
> bool no_merge;
> bool walltime_run_table;
> + bool iostat_run;
> FILE *output;
> unsigned int interval;
> unsigned int timeout;
> --
> 2.19.1
>

2021-02-05 00:06:46

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] perf stat: Helper functions for PCIe root ports list in iostat mode

On Wed, Feb 3, 2021 at 10:58 PM Alexander Antonov
<[email protected]> wrote:
>
> Introduce helper functions to control PCIe root ports list.
> These helpers will be used in the follow-up patch.
>
> Signed-off-by: Alexander Antonov <[email protected]>
> ---
> tools/perf/arch/x86/util/iostat.c | 124 ++++++++++++++++++++++++++++++
> 1 file changed, 124 insertions(+)
> create mode 100644 tools/perf/arch/x86/util/iostat.c
>
> diff --git a/tools/perf/arch/x86/util/iostat.c b/tools/perf/arch/x86/util/iostat.c
> new file mode 100644
> index 000000000000..961e540106e6
> --- /dev/null
> +++ b/tools/perf/arch/x86/util/iostat.c
> @@ -0,0 +1,124 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * perf iostat
> + *
> + * Copyright (C) 2020, Intel Corporation
> + *
> + * Authors: Alexander Antonov <[email protected]>
> + */
> +
> +#include <api/fs/fs.h>
> +#include <linux/kernel.h>
> +#include <linux/err.h>
> +#include <limits.h>
> +#include <stdio.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include <dirent.h>
> +#include <unistd.h>
> +#include <stdlib.h>
> +#include <regex.h>
> +#include "util/cpumap.h"
> +#include "util/debug.h"
> +#include "util/iostat.h"
> +#include "util/counts.h"
> +#include "path.h"
> +
> +struct iio_root_port {
> + u32 domain;
> + u8 bus;
> + u8 die;
> + u8 pmu_idx;
> + int idx;
> +};
> +
> +struct iio_root_ports_list {
> + struct iio_root_port **rps;
> + int nr_entries;
> +};
> +
> +static void iio_root_port_show(FILE *output,
> + const struct iio_root_port * const rp)
> +{
> + if (output && rp)
> + fprintf(output, "S%d-uncore_iio_%d<%04x:%02x>\n",
> + rp->die, rp->pmu_idx, rp->domain, rp->bus);
> +}
> +
> +static struct iio_root_port *iio_root_port_new(u32 domain, u8 bus,
> + u8 die, u8 pmu_idx)
> +{
> + struct iio_root_port *p = calloc(1, sizeof(*p));
> +
> + if (p) {
> + p->domain = domain;
> + p->bus = bus;
> + p->die = die;
> + p->pmu_idx = pmu_idx;
> + }
> + return p;
> +}
> +
> +static struct iio_root_ports_list *iio_root_ports_list_new(void)
> +{
> + struct iio_root_ports_list *list = calloc(1, sizeof(*list));
> +
> + if (list) {
> + list->rps = calloc(1, sizeof(struct iio_root_port *));

This seems unnecessary now.

Thanks,
Namhyung


> + if (!list->rps) {
> + free(list);
> + list = NULL;
> + }
> + }
> + return list;
> +}
> +
> +static void iio_root_ports_list_free(struct iio_root_ports_list *list)
> +{
> + int idx;
> +
> + if (list) {
> + for (idx = 0; idx < list->nr_entries; idx++)
> + free(list->rps[idx]);
> + free(list->rps);
> + free(list);
> + }
> +}
> +
> +static struct iio_root_port *iio_root_port_find_by_notation(
> + const struct iio_root_ports_list * const list, u32 domain, u8 bus)
> +{
> + int idx;
> + struct iio_root_port *rp;
> +
> + if (list) {
> + for (idx = 0; idx < list->nr_entries; idx++) {
> + rp = list->rps[idx];
> + if (rp && rp->domain == domain && rp->bus == bus)
> + return rp;
> + }
> + }
> + return NULL;
> +}
> +
> +static int iio_root_ports_list_insert(struct iio_root_ports_list *list,
> + struct iio_root_port * const rp)
> +{
> + struct iio_root_port **tmp_buf;
> +
> + if (list && rp) {
> + rp->idx = list->nr_entries++;
> + tmp_buf = realloc(list->rps,
> + list->nr_entries * sizeof(*list->rps));
> + if (!tmp_buf) {
> + pr_err("Failed to realloc memory\n");
> + return -ENOMEM;
> + }
> + tmp_buf[rp->idx] = rp;
> + list->rps = tmp_buf;
> + }
> + return 0;
> +}
> --
> 2.19.1
>

2021-02-05 00:09:00

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v4 0/5] perf stat: Introduce iostat mode to provide I/O performance metrics

Hi Arnaldo,

On Thu, Feb 4, 2021 at 5:55 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
>
> Em Wed, Feb 03, 2021 at 04:58:25PM +0300, Alexander Antonov escreveu:
> > The previous version can be found at:
> > v3: https://lkml.kernel.org/r/[email protected]/
> > Changes in this revision are:
> > v3 -> v4:
> > - Addressed comment from Namhyung Kim:
> > 1. Removed NULL-termination of root ports list
>
> Hi Namhyung,
>
> Are you ok with this patchkit now?

I've left some more comments.

Thanks,
Namhyung

2021-02-08 12:19:16

by Alexander Antonov

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] perf stat: Basic support for iostat in perf


On 2/4/2021 3:22 PM, Namhyung Kim wrote:
> On Wed, Feb 3, 2021 at 10:58 PM Alexander Antonov
> <[email protected]> wrote:
>> Add basic flow for a new iostat mode in perf. Mode is intended to
>> provide four I/O performance metrics per each PCIe root port: Inbound Read,
>> Inbound Write, Outbound Read, Outbound Write.
>>
>> The actual code to compute the metrics and attribute it to
>> root port is in follow-on patches.
>>
>> Signed-off-by: Alexander Antonov <[email protected]>
>> ---
>> tools/perf/builtin-stat.c | 31 ++++++++++++++++++++++++++
>> tools/perf/util/iostat.h | 32 +++++++++++++++++++++++++++
>> tools/perf/util/stat-display.c | 40 +++++++++++++++++++++++++++++++++-
>> tools/perf/util/stat-shadow.c | 11 +++++++++-
>> tools/perf/util/stat.h | 1 +
>> 5 files changed, 113 insertions(+), 2 deletions(-)
>> create mode 100644 tools/perf/util/iostat.h
>>
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index 60fdb6a0805f..66c913692120 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -65,6 +65,7 @@
>> #include "util/target.h"
>> #include "util/time-utils.h"
>> #include "util/top.h"
>> +#include "util/iostat.h"
>> #include "asm/bug.h"
>>
>> #include <linux/time64.h>
>> @@ -186,6 +187,7 @@ static struct perf_stat_config stat_config = {
>> .metric_only_len = METRIC_ONLY_LEN,
>> .walltime_nsecs_stats = &walltime_nsecs_stats,
>> .big_num = true,
>> + .iostat_run = false,
>> };
>>
>> static inline void diff_timespec(struct timespec *r, struct timespec *a,
>> @@ -723,6 +725,14 @@ static int parse_metric_groups(const struct option *opt,
>> return metricgroup__parse_groups(opt, str, &stat_config.metric_events);
>> }
>>
>> +__weak int iostat_parse(const struct option *opt __maybe_unused,
>> + const char *str __maybe_unused,
>> + int unset __maybe_unused)
>> +{
>> + pr_err("iostat mode is not supported\n");
>> + return -1;
>> +}
>> +
>> static struct option stat_options[] = {
>> OPT_BOOLEAN('T', "transaction", &transaction_run,
>> "hardware transaction statistics"),
>> @@ -803,6 +813,8 @@ static struct option stat_options[] = {
>> OPT_CALLBACK('M', "metrics", &evsel_list, "metric/metric group list",
>> "monitor specified metrics or metric groups (separated by ,)",
>> parse_metric_groups),
>> + OPT_CALLBACK_OPTARG(0, "iostat", &evsel_list, &stat_config, "root port",
>> + "measure PCIe metrics per root port", iostat_parse),
> Can we make the help string and default argument more generic?
> Something like "measure IO metrics provided by arch/platform"
> and the default value being "default". :)
>
Do you mean using "default" instead of "root port"?
What about the faceless "I/O unit"? :)
>> OPT_END()
>> };
>>
>> @@ -1131,6 +1143,12 @@ __weak void arch_topdown_group_warn(void)
>> {
>> }
>>
>> +__weak int iostat_list(struct evlist *evlist __maybe_unused,
>> + struct perf_stat_config *config __maybe_unused)
>> +{
>> + return 0;
>> +}
>> +
>> /*
>> * Add default attributes, if there were no attributes specified or
>> * if -d/--detailed, -d -d or -d -d -d is used:
>> @@ -1682,6 +1700,10 @@ static void setup_system_wide(int forks)
>> }
>> }
>>
>> +__weak void iostat_release(struct evlist *evlist __maybe_unused)
>> +{
>> +}
>> +
>> int cmd_stat(int argc, const char **argv)
>> {
>> const char * const stat_usage[] = {
>> @@ -1858,6 +1880,12 @@ int cmd_stat(int argc, const char **argv)
>> goto out;
>> }
>>
>> + if (stat_config.iostat_run) {
>> + status = iostat_list(evsel_list, &stat_config);
> I think it's unnatural to call iostat_list() unconditionally here.
> How about this?
>
> status = iostat_prepare(...);
> if (status < 0)
> goto out;
>
> if (status == IOSTAT_LIST)
> iostat_list(...);
> else
> ...
I think it's applicable.
In case of 'list' option we will just print list of root ports and exit.
Also listing of root ports is available in verbose mode. In this case we
will
print list and start the collection.
>
>> + if (status || !stat_config.iostat_run)
>> + goto out;
>> + }
>> +
>> if (add_default_attributes())
>> goto out;
>>
>> @@ -2008,6 +2036,9 @@ int cmd_stat(int argc, const char **argv)
>> perf_stat__exit_aggr_mode();
>> perf_evlist__free_stats(evsel_list);
>> out:
>> + if (stat_config.iostat_run)
>> + iostat_release(evsel_list);
>> +
>> zfree(&stat_config.walltime_run);
>>
>> if (smi_cost && smi_reset)
>> diff --git a/tools/perf/util/iostat.h b/tools/perf/util/iostat.h
>> new file mode 100644
>> index 000000000000..b34ebedfd5e6
>> --- /dev/null
>> +++ b/tools/perf/util/iostat.h
>> @@ -0,0 +1,32 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * perf iostat
>> + *
>> + * Copyright (C) 2020, Intel Corporation
>> + *
>> + * Authors: Alexander Antonov <[email protected]>
>> + */
>> +
>> +#ifndef _IOSTAT_H
>> +#define _IOSTAT_H
>> +
>> +#include <subcmd/parse-options.h>
>> +#include "util/stat.h"
>> +#include "util/parse-events.h"
>> +#include "util/evlist.h"
>> +
>> +struct option;
>> +struct perf_stat_config;
>> +struct evlist;
>> +struct timespec;
>> +
>> +int iostat_parse(const struct option *opt, const char *str,
>> + int unset __maybe_unused);
>> +void iostat_prefix(struct perf_stat_config *config, struct evlist *evlist,
>> + char *prefix, struct timespec *ts);
>> +void iostat_print_metric(struct perf_stat_config *config, struct evsel *evsel,
>> + struct perf_stat_output_ctx *out);
>> +int iostat_list(struct evlist *evlist, struct perf_stat_config *config);
>> +void iostat_release(struct evlist *evlist);
>> +
>> +#endif /* _IOSTAT_H */
>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
>> index db1bec115d0b..de78cf6962b9 100644
>> --- a/tools/perf/util/stat-display.c
>> +++ b/tools/perf/util/stat-display.c
>> @@ -16,6 +16,7 @@
>> #include <linux/ctype.h>
>> #include "cgroup.h"
>> #include <api/fs/fs.h>
>> +#include "iostat.h"
>>
>> #define CNTR_NOT_SUPPORTED "<not supported>"
>> #define CNTR_NOT_COUNTED "<not counted>"
>> @@ -302,6 +303,11 @@ static void print_metric_header(struct perf_stat_config *config,
>> struct outstate *os = ctx;
>> char tbuf[1024];
>>
>> + /* In case of iostat, print metric header for first root port only */
>> + if (config->iostat_run &&
>> + os->evsel->priv != os->evsel->evlist->selected->priv)
>> + return;
>> +
>> if (!valid_only_metric(unit))
>> return;
>> unit = fixunit(tbuf, os->evsel, unit);
>> @@ -936,6 +942,8 @@ static void print_metric_headers(struct perf_stat_config *config,
>> fputs("time,", config->output);
>> fputs(aggr_header_csv[config->aggr_mode], config->output);
>> }
>> + if (config->iostat_run && !config->interval && !config->csv_output)
>> + fprintf(config->output, " port ");
> It's too specific to the current implementation.
> Let's make it a callback or a weak function.
Okay,
This and other similar blocks will be updated.
>
>> /* Print metrics headers only */
>> evlist__for_each_entry(evlist, counter) {
>> @@ -954,6 +962,13 @@ static void print_metric_headers(struct perf_stat_config *config,
>> fputc('\n', config->output);
>> }
>>
>> +__weak void iostat_prefix(struct perf_stat_config *config __maybe_unused,
>> + struct evlist *evlist __maybe_unused,
>> + char *prefix __maybe_unused,
>> + struct timespec *ts __maybe_unused)
>> +{
>> +}
>> +
>> static void print_interval(struct perf_stat_config *config,
>> struct evlist *evlist,
>> char *prefix, struct timespec *ts)
>> @@ -966,7 +981,10 @@ static void print_interval(struct perf_stat_config *config,
>> if (config->interval_clear)
>> puts(CONSOLE_CLEAR);
>>
>> - sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec, ts->tv_nsec, config->csv_sep);
>> + if (!config->iostat_run)
>> + sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec,
>> + ts->tv_nsec,
>> + config->csv_sep);
>>
>> if ((num_print_interval == 0 && !config->csv_output) || config->interval_clear) {
>> switch (config->aggr_mode) {
>> @@ -996,6 +1014,7 @@ static void print_interval(struct perf_stat_config *config,
>> fprintf(output, " counts %*s events\n", unit_width, "unit");
>> break;
>> case AGGR_PCIE_PORT:
>> + fprintf(output, "# time port ");
> Ditto.
>
>> break;
>> case AGGR_GLOBAL:
>> default:
>> @@ -1174,6 +1193,10 @@ perf_evlist__print_counters(struct evlist *evlist,
>> int interval = config->interval;
>> struct evsel *counter;
>> char buf[64], *prefix = NULL;
>> + void *perf_device = NULL;
>> +
>> + if (config->iostat_run)
>> + evlist->selected = evlist__first(evlist);
>>
>> if (interval)
>> print_interval(config, evlist, prefix = buf, ts);
>> @@ -1222,6 +1245,21 @@ perf_evlist__print_counters(struct evlist *evlist,
>> }
>> break;
>> case AGGR_PCIE_PORT:
> Ditto. Something like iostat_print_counters().
>
>> + counter = evlist__first(evlist);
>> + perf_evlist__set_selected(evlist, counter);
>> + iostat_prefix(config, evlist, prefix = buf, ts);
>> + fprintf(config->output, "%s", prefix);
>> + evlist__for_each_entry(evlist, counter) {
>> + perf_device = evlist->selected->priv;
>> + if (perf_device && perf_device != counter->priv) {
>> + perf_evlist__set_selected(evlist, counter);
>> + iostat_prefix(config, evlist, prefix, ts);
>> + fprintf(config->output, "\n%s", prefix);
>> + }
>> + print_counter_aggr(config, counter, prefix);
> I'm not sure but do you assume each counter has different priv?
> I don't know if the output is correct (like call iostat_prefix() once
> and call print_counter_aggr() twice) when they have same one.
There are 4 counters which are related to single 'priv' field (it's used
for root
port object in iostat mode). This means if platform has, for example, 5 root
ports we will have 20 counters in summary. And print_counter_aggr() will be
called for each counter.
I call iostat_prefix() before the loop to print first root port and then
iostat_prefix() will be called when next counter is related to other
root port.
>
>> + if ((counter->idx + 1) == evlist->core.nr_entries)
>> + fputc('\n', config->output);
> Can we just move this out of the loop?
>
Yes, you are right, we can. I will update it.

Thank you,
Alexander
>
>
>> + }
>> break;
>> case AGGR_UNSET:
>> default:
>> diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
>> index 2c41d47f6f83..083a450c6dc7 100644
>> --- a/tools/perf/util/stat-shadow.c
>> +++ b/tools/perf/util/stat-shadow.c
>> @@ -9,6 +9,7 @@
>> #include "expr.h"
>> #include "metricgroup.h"
>> #include <linux/zalloc.h>
>> +#include "iostat.h"
>>
>> /*
>> * AGGR_GLOBAL: Use CPU 0
>> @@ -814,6 +815,12 @@ static void generic_metric(struct perf_stat_config *config,
>> zfree(&pctx.ids[i].name);
>> }
>>
>> +__weak void iostat_print_metric(struct perf_stat_config *config __maybe_unused,
>> + struct evsel *evsel __maybe_unused,
>> + struct perf_stat_output_ctx *out __maybe_unused)
>> +{
>> +}
>> +
>> void perf_stat__print_shadow_stats(struct perf_stat_config *config,
>> struct evsel *evsel,
>> double avg, int cpu,
>> @@ -829,7 +836,9 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config,
>> struct metric_event *me;
>> int num = 1;
>>
>> - if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
>> + if (config->iostat_run) {
>> + iostat_print_metric(config, evsel, out);
>> + } else if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
>> total = runtime_stat_avg(st, STAT_CYCLES, ctx, cpu);
>>
>> if (total) {
>> diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
>> index c7544c28c02a..c2a2b28effd6 100644
>> --- a/tools/perf/util/stat.h
>> +++ b/tools/perf/util/stat.h
>> @@ -107,6 +107,7 @@ struct perf_stat_config {
>> bool big_num;
>> bool no_merge;
>> bool walltime_run_table;
>> + bool iostat_run;
>> FILE *output;
>> unsigned int interval;
>> unsigned int timeout;
>> --
>> 2.19.1
>>

2021-02-08 12:20:11

by Alexander Antonov

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] perf stat: Helper functions for PCIe root ports list in iostat mode


On 2/4/2021 3:32 PM, Namhyung Kim wrote:
> On Wed, Feb 3, 2021 at 10:58 PM Alexander Antonov
> <[email protected]> wrote:
>> Introduce helper functions to control PCIe root ports list.
>> These helpers will be used in the follow-up patch.
>>
>> Signed-off-by: Alexander Antonov <[email protected]>
>> ---
>> tools/perf/arch/x86/util/iostat.c | 124 ++++++++++++++++++++++++++++++
>> 1 file changed, 124 insertions(+)
>> create mode 100644 tools/perf/arch/x86/util/iostat.c
>>
>> diff --git a/tools/perf/arch/x86/util/iostat.c b/tools/perf/arch/x86/util/iostat.c
>> new file mode 100644
>> index 000000000000..961e540106e6
>> --- /dev/null
>> +++ b/tools/perf/arch/x86/util/iostat.c
>> @@ -0,0 +1,124 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * perf iostat
>> + *
>> + * Copyright (C) 2020, Intel Corporation
>> + *
>> + * Authors: Alexander Antonov <[email protected]>
>> + */
>> +
>> +#include <api/fs/fs.h>
>> +#include <linux/kernel.h>
>> +#include <linux/err.h>
>> +#include <limits.h>
>> +#include <stdio.h>
>> +#include <string.h>
>> +#include <errno.h>
>> +#include <sys/types.h>
>> +#include <sys/stat.h>
>> +#include <fcntl.h>
>> +#include <dirent.h>
>> +#include <unistd.h>
>> +#include <stdlib.h>
>> +#include <regex.h>
>> +#include "util/cpumap.h"
>> +#include "util/debug.h"
>> +#include "util/iostat.h"
>> +#include "util/counts.h"
>> +#include "path.h"
>> +
>> +struct iio_root_port {
>> + u32 domain;
>> + u8 bus;
>> + u8 die;
>> + u8 pmu_idx;
>> + int idx;
>> +};
>> +
>> +struct iio_root_ports_list {
>> + struct iio_root_port **rps;
>> + int nr_entries;
>> +};
>> +
>> +static void iio_root_port_show(FILE *output,
>> + const struct iio_root_port * const rp)
>> +{
>> + if (output && rp)
>> + fprintf(output, "S%d-uncore_iio_%d<%04x:%02x>\n",
>> + rp->die, rp->pmu_idx, rp->domain, rp->bus);
>> +}
>> +
>> +static struct iio_root_port *iio_root_port_new(u32 domain, u8 bus,
>> + u8 die, u8 pmu_idx)
>> +{
>> + struct iio_root_port *p = calloc(1, sizeof(*p));
>> +
>> + if (p) {
>> + p->domain = domain;
>> + p->bus = bus;
>> + p->die = die;
>> + p->pmu_idx = pmu_idx;
>> + }
>> + return p;
>> +}
>> +
>> +static struct iio_root_ports_list *iio_root_ports_list_new(void)
>> +{
>> + struct iio_root_ports_list *list = calloc(1, sizeof(*list));
>> +
>> + if (list) {
>> + list->rps = calloc(1, sizeof(struct iio_root_port *));
> This seems unnecessary now.
>
> Thanks,
> Namhyung
>

Yes, you are right. Will be fixed.

Thank you,
Alexander
>> + if (!list->rps) {
>> + free(list);
>> + list = NULL;
>> + }
>> + }
>> + return list;
>> +}
>> +
>> +static void iio_root_ports_list_free(struct iio_root_ports_list *list)
>> +{
>> + int idx;
>> +
>> + if (list) {
>> + for (idx = 0; idx < list->nr_entries; idx++)
>> + free(list->rps[idx]);
>> + free(list->rps);
>> + free(list);
>> + }
>> +}
>> +
>> +static struct iio_root_port *iio_root_port_find_by_notation(
>> + const struct iio_root_ports_list * const list, u32 domain, u8 bus)
>> +{
>> + int idx;
>> + struct iio_root_port *rp;
>> +
>> + if (list) {
>> + for (idx = 0; idx < list->nr_entries; idx++) {
>> + rp = list->rps[idx];
>> + if (rp && rp->domain == domain && rp->bus == bus)
>> + return rp;
>> + }
>> + }
>> + return NULL;
>> +}
>> +
>> +static int iio_root_ports_list_insert(struct iio_root_ports_list *list,
>> + struct iio_root_port * const rp)
>> +{
>> + struct iio_root_port **tmp_buf;
>> +
>> + if (list && rp) {
>> + rp->idx = list->nr_entries++;
>> + tmp_buf = realloc(list->rps,
>> + list->nr_entries * sizeof(*list->rps));
>> + if (!tmp_buf) {
>> + pr_err("Failed to realloc memory\n");
>> + return -ENOMEM;
>> + }
>> + tmp_buf[rp->idx] = rp;
>> + list->rps = tmp_buf;
>> + }
>> + return 0;
>> +}
>> --
>> 2.19.1
>>

2021-02-10 10:26:06

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] perf stat: Basic support for iostat in perf

On Mon, Feb 8, 2021 at 8:55 PM Alexander Antonov
<[email protected]> wrote:
>
>
> On 2/4/2021 3:22 PM, Namhyung Kim wrote:
> > On Wed, Feb 3, 2021 at 10:58 PM Alexander Antonov
> > <[email protected]> wrote:
> >> Add basic flow for a new iostat mode in perf. Mode is intended to
> >> provide four I/O performance metrics per each PCIe root port: Inbound Read,
> >> Inbound Write, Outbound Read, Outbound Write.
> >>
> >> The actual code to compute the metrics and attribute it to
> >> root port is in follow-on patches.
> >>
> >> Signed-off-by: Alexander Antonov <[email protected]>
> >> ---
> >> tools/perf/builtin-stat.c | 31 ++++++++++++++++++++++++++
> >> tools/perf/util/iostat.h | 32 +++++++++++++++++++++++++++
> >> tools/perf/util/stat-display.c | 40 +++++++++++++++++++++++++++++++++-
> >> tools/perf/util/stat-shadow.c | 11 +++++++++-
> >> tools/perf/util/stat.h | 1 +
> >> 5 files changed, 113 insertions(+), 2 deletions(-)
> >> create mode 100644 tools/perf/util/iostat.h
> >>
> >> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> >> index 60fdb6a0805f..66c913692120 100644
> >> --- a/tools/perf/builtin-stat.c
> >> +++ b/tools/perf/builtin-stat.c
> >> @@ -65,6 +65,7 @@
> >> #include "util/target.h"
> >> #include "util/time-utils.h"
> >> #include "util/top.h"
> >> +#include "util/iostat.h"
> >> #include "asm/bug.h"
> >>
> >> #include <linux/time64.h>
> >> @@ -186,6 +187,7 @@ static struct perf_stat_config stat_config = {
> >> .metric_only_len = METRIC_ONLY_LEN,
> >> .walltime_nsecs_stats = &walltime_nsecs_stats,
> >> .big_num = true,
> >> + .iostat_run = false,
> >> };
> >>
> >> static inline void diff_timespec(struct timespec *r, struct timespec *a,
> >> @@ -723,6 +725,14 @@ static int parse_metric_groups(const struct option *opt,
> >> return metricgroup__parse_groups(opt, str, &stat_config.metric_events);
> >> }
> >>
> >> +__weak int iostat_parse(const struct option *opt __maybe_unused,
> >> + const char *str __maybe_unused,
> >> + int unset __maybe_unused)
> >> +{
> >> + pr_err("iostat mode is not supported\n");
> >> + return -1;
> >> +}
> >> +
> >> static struct option stat_options[] = {
> >> OPT_BOOLEAN('T', "transaction", &transaction_run,
> >> "hardware transaction statistics"),
> >> @@ -803,6 +813,8 @@ static struct option stat_options[] = {
> >> OPT_CALLBACK('M', "metrics", &evsel_list, "metric/metric group list",
> >> "monitor specified metrics or metric groups (separated by ,)",
> >> parse_metric_groups),
> >> + OPT_CALLBACK_OPTARG(0, "iostat", &evsel_list, &stat_config, "root port",
> >> + "measure PCIe metrics per root port", iostat_parse),
> > Can we make the help string and default argument more generic?
> > Something like "measure IO metrics provided by arch/platform"
> > and the default value being "default". :)
> >
> Do you mean using "default" instead of "root port"?
> What about the faceless "I/O unit"? :)

Being a generic command, I cannot expect how it can be used later.
So I'd suggest a more general name.


> >> OPT_END()
> >> };
> >>
> >> @@ -1131,6 +1143,12 @@ __weak void arch_topdown_group_warn(void)
> >> {
> >> }
> >>
> >> +__weak int iostat_list(struct evlist *evlist __maybe_unused,
> >> + struct perf_stat_config *config __maybe_unused)
> >> +{
> >> + return 0;
> >> +}
> >> +
> >> /*
> >> * Add default attributes, if there were no attributes specified or
> >> * if -d/--detailed, -d -d or -d -d -d is used:
> >> @@ -1682,6 +1700,10 @@ static void setup_system_wide(int forks)
> >> }
> >> }
> >>
> >> +__weak void iostat_release(struct evlist *evlist __maybe_unused)
> >> +{
> >> +}
> >> +
> >> int cmd_stat(int argc, const char **argv)
> >> {
> >> const char * const stat_usage[] = {
> >> @@ -1858,6 +1880,12 @@ int cmd_stat(int argc, const char **argv)
> >> goto out;
> >> }
> >>
> >> + if (stat_config.iostat_run) {
> >> + status = iostat_list(evsel_list, &stat_config);
> > I think it's unnatural to call iostat_list() unconditionally here.
> > How about this?
> >
> > status = iostat_prepare(...);
> > if (status < 0)
> > goto out;
> >
> > if (status == IOSTAT_LIST)
> > iostat_list(...);
> > else
> > ...
> I think it's applicable.
> In case of 'list' option we will just print list of root ports and exit.
> Also listing of root ports is available in verbose mode. In this case we
> will
> print list and start the collection.

ok.

> >
> >> + if (status || !stat_config.iostat_run)
> >> + goto out;
> >> + }
> >> +
> >> if (add_default_attributes())
> >> goto out;
> >>
> >> @@ -2008,6 +2036,9 @@ int cmd_stat(int argc, const char **argv)
> >> perf_stat__exit_aggr_mode();
> >> perf_evlist__free_stats(evsel_list);
> >> out:
> >> + if (stat_config.iostat_run)
> >> + iostat_release(evsel_list);
> >> +
> >> zfree(&stat_config.walltime_run);
> >>
> >> if (smi_cost && smi_reset)
> >> diff --git a/tools/perf/util/iostat.h b/tools/perf/util/iostat.h
> >> new file mode 100644
> >> index 000000000000..b34ebedfd5e6
> >> --- /dev/null
> >> +++ b/tools/perf/util/iostat.h
> >> @@ -0,0 +1,32 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +/*
> >> + * perf iostat
> >> + *
> >> + * Copyright (C) 2020, Intel Corporation
> >> + *
> >> + * Authors: Alexander Antonov <[email protected]>
> >> + */
> >> +
> >> +#ifndef _IOSTAT_H
> >> +#define _IOSTAT_H
> >> +
> >> +#include <subcmd/parse-options.h>
> >> +#include "util/stat.h"
> >> +#include "util/parse-events.h"
> >> +#include "util/evlist.h"
> >> +
> >> +struct option;
> >> +struct perf_stat_config;
> >> +struct evlist;
> >> +struct timespec;
> >> +
> >> +int iostat_parse(const struct option *opt, const char *str,
> >> + int unset __maybe_unused);
> >> +void iostat_prefix(struct perf_stat_config *config, struct evlist *evlist,
> >> + char *prefix, struct timespec *ts);
> >> +void iostat_print_metric(struct perf_stat_config *config, struct evsel *evsel,
> >> + struct perf_stat_output_ctx *out);
> >> +int iostat_list(struct evlist *evlist, struct perf_stat_config *config);
> >> +void iostat_release(struct evlist *evlist);
> >> +
> >> +#endif /* _IOSTAT_H */
> >> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> >> index db1bec115d0b..de78cf6962b9 100644
> >> --- a/tools/perf/util/stat-display.c
> >> +++ b/tools/perf/util/stat-display.c
> >> @@ -16,6 +16,7 @@
> >> #include <linux/ctype.h>
> >> #include "cgroup.h"
> >> #include <api/fs/fs.h>
> >> +#include "iostat.h"
> >>
> >> #define CNTR_NOT_SUPPORTED "<not supported>"
> >> #define CNTR_NOT_COUNTED "<not counted>"
> >> @@ -302,6 +303,11 @@ static void print_metric_header(struct perf_stat_config *config,
> >> struct outstate *os = ctx;
> >> char tbuf[1024];
> >>
> >> + /* In case of iostat, print metric header for first root port only */
> >> + if (config->iostat_run &&
> >> + os->evsel->priv != os->evsel->evlist->selected->priv)
> >> + return;
> >> +
> >> if (!valid_only_metric(unit))
> >> return;
> >> unit = fixunit(tbuf, os->evsel, unit);
> >> @@ -936,6 +942,8 @@ static void print_metric_headers(struct perf_stat_config *config,
> >> fputs("time,", config->output);
> >> fputs(aggr_header_csv[config->aggr_mode], config->output);
> >> }
> >> + if (config->iostat_run && !config->interval && !config->csv_output)
> >> + fprintf(config->output, " port ");
> > It's too specific to the current implementation.
> > Let's make it a callback or a weak function.
> Okay,
> This and other similar blocks will be updated.

ok

> >
> >> /* Print metrics headers only */
> >> evlist__for_each_entry(evlist, counter) {
> >> @@ -954,6 +962,13 @@ static void print_metric_headers(struct perf_stat_config *config,
> >> fputc('\n', config->output);
> >> }
> >>
> >> +__weak void iostat_prefix(struct perf_stat_config *config __maybe_unused,
> >> + struct evlist *evlist __maybe_unused,
> >> + char *prefix __maybe_unused,
> >> + struct timespec *ts __maybe_unused)
> >> +{
> >> +}
> >> +
> >> static void print_interval(struct perf_stat_config *config,
> >> struct evlist *evlist,
> >> char *prefix, struct timespec *ts)
> >> @@ -966,7 +981,10 @@ static void print_interval(struct perf_stat_config *config,
> >> if (config->interval_clear)
> >> puts(CONSOLE_CLEAR);
> >>
> >> - sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec, ts->tv_nsec, config->csv_sep);
> >> + if (!config->iostat_run)
> >> + sprintf(prefix, "%6lu.%09lu%s", ts->tv_sec,
> >> + ts->tv_nsec,
> >> + config->csv_sep);
> >>
> >> if ((num_print_interval == 0 && !config->csv_output) || config->interval_clear) {
> >> switch (config->aggr_mode) {
> >> @@ -996,6 +1014,7 @@ static void print_interval(struct perf_stat_config *config,
> >> fprintf(output, " counts %*s events\n", unit_width, "unit");
> >> break;
> >> case AGGR_PCIE_PORT:
> >> + fprintf(output, "# time port ");
> > Ditto.
> >
> >> break;
> >> case AGGR_GLOBAL:
> >> default:
> >> @@ -1174,6 +1193,10 @@ perf_evlist__print_counters(struct evlist *evlist,
> >> int interval = config->interval;
> >> struct evsel *counter;
> >> char buf[64], *prefix = NULL;
> >> + void *perf_device = NULL;
> >> +
> >> + if (config->iostat_run)
> >> + evlist->selected = evlist__first(evlist);
> >>
> >> if (interval)
> >> print_interval(config, evlist, prefix = buf, ts);
> >> @@ -1222,6 +1245,21 @@ perf_evlist__print_counters(struct evlist *evlist,
> >> }
> >> break;
> >> case AGGR_PCIE_PORT:
> > Ditto. Something like iostat_print_counters().
> >
> >> + counter = evlist__first(evlist);
> >> + perf_evlist__set_selected(evlist, counter);
> >> + iostat_prefix(config, evlist, prefix = buf, ts);
> >> + fprintf(config->output, "%s", prefix);
> >> + evlist__for_each_entry(evlist, counter) {
> >> + perf_device = evlist->selected->priv;
> >> + if (perf_device && perf_device != counter->priv) {
> >> + perf_evlist__set_selected(evlist, counter);
> >> + iostat_prefix(config, evlist, prefix, ts);
> >> + fprintf(config->output, "\n%s", prefix);
> >> + }
> >> + print_counter_aggr(config, counter, prefix);
> > I'm not sure but do you assume each counter has different priv?
> > I don't know if the output is correct (like call iostat_prefix() once
> > and call print_counter_aggr() twice) when they have same one.
> There are 4 counters which are related to single 'priv' field (it's used
> for root
> port object in iostat mode). This means if platform has, for example, 5 root
> ports we will have 20 counters in summary. And print_counter_aggr() will be
> called for each counter.
> I call iostat_prefix() before the loop to print first root port and then
> iostat_prefix() will be called when next counter is related to other
> root port.

Thanks for the explanation.

> >
> >> + if ((counter->idx + 1) == evlist->core.nr_entries)
> >> + fputc('\n', config->output);
> > Can we just move this out of the loop?
> >
> Yes, you are right, we can. I will update it.

Thanks,
Namhyung


> >
> >
> >> + }
> >> break;
> >> case AGGR_UNSET:
> >> default:
> >> diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
> >> index 2c41d47f6f83..083a450c6dc7 100644
> >> --- a/tools/perf/util/stat-shadow.c
> >> +++ b/tools/perf/util/stat-shadow.c
> >> @@ -9,6 +9,7 @@
> >> #include "expr.h"
> >> #include "metricgroup.h"
> >> #include <linux/zalloc.h>
> >> +#include "iostat.h"
> >>
> >> /*
> >> * AGGR_GLOBAL: Use CPU 0
> >> @@ -814,6 +815,12 @@ static void generic_metric(struct perf_stat_config *config,
> >> zfree(&pctx.ids[i].name);
> >> }
> >>
> >> +__weak void iostat_print_metric(struct perf_stat_config *config __maybe_unused,
> >> + struct evsel *evsel __maybe_unused,
> >> + struct perf_stat_output_ctx *out __maybe_unused)
> >> +{
> >> +}
> >> +
> >> void perf_stat__print_shadow_stats(struct perf_stat_config *config,
> >> struct evsel *evsel,
> >> double avg, int cpu,
> >> @@ -829,7 +836,9 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config,
> >> struct metric_event *me;
> >> int num = 1;
> >>
> >> - if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
> >> + if (config->iostat_run) {
> >> + iostat_print_metric(config, evsel, out);
> >> + } else if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
> >> total = runtime_stat_avg(st, STAT_CYCLES, ctx, cpu);
> >>
> >> if (total) {
> >> diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
> >> index c7544c28c02a..c2a2b28effd6 100644
> >> --- a/tools/perf/util/stat.h
> >> +++ b/tools/perf/util/stat.h
> >> @@ -107,6 +107,7 @@ struct perf_stat_config {
> >> bool big_num;
> >> bool no_merge;
> >> bool walltime_run_table;
> >> + bool iostat_run;
> >> FILE *output;
> >> unsigned int interval;
> >> unsigned int timeout;
> >> --
> >> 2.19.1
> >>

2021-03-09 07:53:09

by liuqi (BA)

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] perf stat: Enable iostat mode for x86 platforms

Hi Alexander,

On 2021/2/3 21:58, Alexander Antonov wrote:
> This functionality is based on recently introduced sysfs attributes
> for Intel® Xeon® Scalable processor family (code name Skylake-SP):
> Commit bb42b3d39781 ("perf/x86/intel/uncore: Expose an Uncore unit to
> IIO PMON mapping")
>
> Mode is intended to provide four I/O performance metrics in MB per each
> PCIe root port:
> - Inbound Read: I/O devices below root port read from the host memory
> - Inbound Write: I/O devices below root port write to the host memory
> - Outbound Read: CPU reads from I/O devices below root port
> - Outbound Write: CPU writes to I/O devices below root port
>
> Each metric requiries only one uncore event which increments at every 4B
> transfer in corresponding direction. The formulas to compute metrics
> are generic:
> #EventCount * 4B / (1024 * 1024)
>
> Signed-off-by: Alexander Antonov<[email protected]>
> ---
> tools/perf/Documentation/perf-iostat.txt | 88 ++++++
> tools/perf/Makefile.perf | 5 +-
> tools/perf/arch/x86/util/Build | 1 +
> tools/perf/arch/x86/util/iostat.c | 345 +++++++++++++++++++++++
> tools/perf/command-list.txt | 1 +
> tools/perf/perf-iostat.sh | 12 +
> 6 files changed, 451 insertions(+), 1 deletion(-)
> create mode 100644 tools/perf/Documentation/perf-iostat.txt
> create mode 100644 tools/perf/perf-iostat.sh
>
> diff --git a/tools/perf/Documentation/perf-iostat.txt b/tools/perf/Documentation/perf-iostat.txt
> new file mode 100644
> index 000000000000..165176944031
> --- /dev/null
> +++ b/tools/perf/Documentation/perf-iostat.txt
> @@ -0,0 +1,88 @@
> +perf-iostat(1)
> +===============
> +
> +NAME
> +----
> +perf-iostat - Show I/O performance metrics
> +
> +SYNOPSIS
> +--------
> +[verse]
> +'perf iostat' list
> +'perf iostat' <ports> -- <command> [<options>]
> +
> +DESCRIPTION
> +-----------
> +Mode is intended to provide four I/O performance metrics per each PCIe root port:
> +
> +- Inbound Read - I/O devices below root port read from the host memory, in MB
> +
> +- Inbound Write - I/O devices below root port write to the host memory, in MB
> +
> +- Outbound Read - CPU reads from I/O devices below root port, in MB
> +
> +- Outbound Write - CPU writes to I/O devices below root port, in MB
> +
> +OPTIONS
> +-------
> +<command>...::
> + Any command you can specify in a shell.
> +
> +list::
> + List all PCIe root ports.

I noticed that "iostat" commond and cmd_iostat() callback function is
not registered in cmd_struct in perf.c. So I think "perf iostat list"
perhaps can not work properly.

I also test this patchset on x86 platform, and here is the log:

root@ubuntu:/home/lq# ./perf iostat list
perf: 'iostat' is not a perf-command. See 'perf --help'.
root@ubuntu:/home/lq# ./perf stat --iostat
^C
Performance counter stats for 'system wide':

port Inbound Read(MB) Inbound Write(MB) Outbound
Read(MB) Outbound Write(MB)
0000:00 0 0 0
0
0000:80 0 0 0
0
0000:17 0 0 0
0
0000:85 0 0 0
0
0000:3a 0 0 0
0
0000:ae 0 0 0
0
0000:5d 0 0 0
0
0000:d7 0 0 0
0

0.611303832 seconds time elapsed


root@ubuntu:/home/lq# ./perf stat --iostat=0000:17
^C
Performance counter stats for 'system wide':

port Inbound Read(MB) Inbound Write(MB) Outbound
Read(MB) Outbound Write(MB)
0000:17 0 0 0
0

0.521317572 seconds time elapsed

So how does following perf iostat list work, did I miss something?

Thanks,
Qi

> +
> +<ports>::
> + Select the root ports for monitoring. Comma-separated list is supported.
> +
> +EXAMPLES
> +--------
> +
> +1. List all PCIe root ports (example for 2-S platform):
> +
> + $ perf iostat list
> + S0-uncore_iio_0<0000:00>
> + S1-uncore_iio_0<0000:80>
> + S0-uncore_iio_1<0000:17>
> + S1-uncore_iio_1<0000:85>
> + S0-uncore_iio_2<0000:3a>
> + S1-uncore_iio_2<0000:ae>
> + S0-uncore_iio_3<0000:5d>
> + S1-uncore_iio_3<0000:d7>

2021-03-10 16:22:15

by Alexander Antonov

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] perf stat: Enable iostat mode for x86 platforms


On 3/9/2021 10:51 AM, liuqi (BA) wrote:
> Hi Alexander,
>
> On 2021/2/3 21:58, Alexander Antonov wrote:
>> This functionality is based on recently introduced sysfs attributes
>> for Intel® Xeon® Scalable processor family (code name Skylake-SP):
>> Commit bb42b3d39781 ("perf/x86/intel/uncore: Expose an Uncore unit to
>> IIO PMON mapping")
>>
>> Mode is intended to provide four I/O performance metrics in MB per each
>> PCIe root port:
>>   - Inbound Read: I/O devices below root port read from the host memory
>>   - Inbound Write: I/O devices below root port write to the host memory
>>   - Outbound Read: CPU reads from I/O devices below root port
>>   - Outbound Write: CPU writes to I/O devices below root port
>>
>> Each metric requiries only one uncore event which increments at every 4B
>> transfer in corresponding direction. The formulas to compute metrics
>> are generic:
>>      #EventCount * 4B / (1024 * 1024)
>>
>> Signed-off-by: Alexander Antonov<[email protected]>
>> ---
>>   tools/perf/Documentation/perf-iostat.txt |  88 ++++++
>>   tools/perf/Makefile.perf                 |   5 +-
>>   tools/perf/arch/x86/util/Build           |   1 +
>>   tools/perf/arch/x86/util/iostat.c        | 345 +++++++++++++++++++++++
>>   tools/perf/command-list.txt              |   1 +
>>   tools/perf/perf-iostat.sh                |  12 +
>>   6 files changed, 451 insertions(+), 1 deletion(-)
>>   create mode 100644 tools/perf/Documentation/perf-iostat.txt
>>   create mode 100644 tools/perf/perf-iostat.sh
>>
>> diff --git a/tools/perf/Documentation/perf-iostat.txt
>> b/tools/perf/Documentation/perf-iostat.txt
>> new file mode 100644
>> index 000000000000..165176944031
>> --- /dev/null
>> +++ b/tools/perf/Documentation/perf-iostat.txt
>> @@ -0,0 +1,88 @@
>> +perf-iostat(1)
>> +===============
>> +
>> +NAME
>> +----
>> +perf-iostat - Show I/O performance metrics
>> +
>> +SYNOPSIS
>> +--------
>> +[verse]
>> +'perf iostat' list
>> +'perf iostat' <ports> -- <command> [<options>]
>> +
>> +DESCRIPTION
>> +-----------
>> +Mode is intended to provide four I/O performance metrics per each
>> PCIe root port:
>> +
>> +- Inbound Read   - I/O devices below root port read from the host
>> memory, in MB
>> +
>> +- Inbound Write  - I/O devices below root port write to the host
>> memory, in MB
>> +
>> +- Outbound Read  - CPU reads from I/O devices below root port, in MB
>> +
>> +- Outbound Write - CPU writes to I/O devices below root port, in MB
>> +
>> +OPTIONS
>> +-------
>> +<command>...::
>> +    Any command you can specify in a shell.
>> +
>> +list::
>> +    List all PCIe root ports.
>
> I noticed that "iostat" commond and cmd_iostat() callback function is
> not registered in cmd_struct in perf.c. So I think "perf iostat list"
> perhaps can not work properly.
>
> I also test this patchset on x86 platform, and here is the log:
>
> root@ubuntu:/home/lq# ./perf iostat list
> perf: 'iostat' is not a perf-command. See 'perf --help'.
> root@ubuntu:/home/lq# ./perf stat --iostat
> ^C
>  Performance counter stats for 'system wide':
>
>    port             Inbound Read(MB)    Inbound Write(MB) Outbound
> Read(MB)   Outbound Write(MB)
> 0000:00                    0 0                    0                  0
> 0000:80                    0 0                    0                  0
> 0000:17                    0 0                    0                  0
> 0000:85                    0 0                    0                  0
> 0000:3a                    0 0                    0                  0
> 0000:ae                    0 0                    0                  0
> 0000:5d                    0 0                    0                  0
> 0000:d7                    0 0                    0                  0
>
>        0.611303832 seconds time elapsed
>
>
> root@ubuntu:/home/lq# ./perf stat --iostat=0000:17
> ^C
>  Performance counter stats for 'system wide':
>
>    port             Inbound Read(MB)    Inbound Write(MB) Outbound
> Read(MB)   Outbound Write(MB)
> 0000:17                    0 0                    0                  0
>
>        0.521317572 seconds time elapsed
>
> So how does following perf iostat list work, did I miss something?
>
> Thanks,
> Qi
>
Hello,

The 'iostat' mode uses aliases mechanism in perf same as 'perf archive' and
in this case you don't need to add function callback into cmd_struct.
For example, the command 'perf iostat list' will be converted to
'perf stat --iostat=list'.

After building the perf tool you should have two shell scripts in tools/perf
directory and one of them is executable, for example:
# make -C tools/perf
# ls -l tools/perf/perf-iostat*
-rwxr-xr-x 1 root root 290 Mar 10 18:17 perf-iostat
-rw-r--r-- 1 root root 290 Feb  3 15:14 perf-iostat.sh

It should be possible to run 'perf iostat' from build directory:
# cd tools/perf
# ./perf iostat list
S0-uncore_iio_0<0000:00>
S1-uncore_iio_0<0000:80>
S0-uncore_iio_1<0000:17>
S1-uncore_iio_1<0000:85>
S0-uncore_iio_2<0000:3a>
S1-uncore_iio_2<0000:ae>
S0-uncore_iio_3<0000:5d>
S1-uncore_iio_3<0000:d7>

Also you can copy 'perf-iostat' to ~/libexec/perf-core/ or just run
'make install'

# make install
# cp perf /usr/bin/
# ls -lh ~/libexec/perf-core/
total 24K
-rwxr-xr-x 1 root root 1.4K Mar 10 18:17 perf-archive
-rwxr-xr-x 1 root root  290 Mar 10 18:17 perf-iostat
-rwxr-xr-x 1 root root 6.3K Mar 10 18:17 perf-with-kcore
drwxr-xr-x 4 root root 4.0K Nov  5  2019 scripts
drwxr-xr-x 4 root root 4.0K Mar 10 18:17 tests
# perf iostat 0000:17 -I 1000 --interval-count 2
#           time    port             Inbound Read(MB)    Inbound
Write(MB)    Outbound Read(MB)   Outbound Write(MB)
     1.000220341 0000:17                    0 0                   
0                    0
     2.000569569 0000:17                    0 0                   
0                    0

Actually, Arnaldo has explained before how does aliases mechanism work.

I hope it will solve your issue. Otherwise, please email.

Thank you,
Alexander
>> +
>> +<ports>::
>> +    Select the root ports for monitoring. Comma-separated list is
>> supported.
>> +
>> +EXAMPLES
>> +--------
>> +
>> +1. List all PCIe root ports (example for 2-S platform):
>> +
>> +   $ perf iostat list
>> +   S0-uncore_iio_0<0000:00>
>> +   S1-uncore_iio_0<0000:80>
>> +   S0-uncore_iio_1<0000:17>
>> +   S1-uncore_iio_1<0000:85>
>> +   S0-uncore_iio_2<0000:3a>
>> +   S1-uncore_iio_2<0000:ae>
>> +   S0-uncore_iio_3<0000:5d>
>> +   S1-uncore_iio_3<0000:d7>
>

2021-03-11 07:05:23

by liuqi (BA)

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] perf stat: Enable iostat mode for x86 platforms



On 2021/3/11 0:19, Alexander Antonov wrote:
>
> On 3/9/2021 10:51 AM, liuqi (BA) wrote:
>> Hi Alexander,
>>
>> On 2021/2/3 21:58, Alexander Antonov wrote:
>>> This functionality is based on recently introduced sysfs attributes
>>> for Intel® Xeon® Scalable processor family (code name Skylake-SP):
>>> Commit bb42b3d39781 ("perf/x86/intel/uncore: Expose an Uncore unit to
>>> IIO PMON mapping")
>>>
>>> Mode is intended to provide four I/O performance metrics in MB per each
>>> PCIe root port:
>>>   - Inbound Read: I/O devices below root port read from the host memory
>>>   - Inbound Write: I/O devices below root port write to the host memory
>>>   - Outbound Read: CPU reads from I/O devices below root port
>>>   - Outbound Write: CPU writes to I/O devices below root port
>>>
>>> Each metric requiries only one uncore event which increments at every 4B
>>> transfer in corresponding direction. The formulas to compute metrics
>>> are generic:
>>>      #EventCount * 4B / (1024 * 1024)
>>>
>>> Signed-off-by: Alexander Antonov<[email protected]>
>>> ---
>>>   tools/perf/Documentation/perf-iostat.txt |  88 ++++++
>>>   tools/perf/Makefile.perf                 |   5 +-
>>>   tools/perf/arch/x86/util/Build           |   1 +
>>>   tools/perf/arch/x86/util/iostat.c        | 345 +++++++++++++++++++++++
>>>   tools/perf/command-list.txt              |   1 +
>>>   tools/perf/perf-iostat.sh                |  12 +
>>>   6 files changed, 451 insertions(+), 1 deletion(-)
>>>   create mode 100644 tools/perf/Documentation/perf-iostat.txt
>>>   create mode 100644 tools/perf/perf-iostat.sh
>>>
>>> diff --git a/tools/perf/Documentation/perf-iostat.txt
>>> b/tools/perf/Documentation/perf-iostat.txt
>>> new file mode 100644
>>> index 000000000000..165176944031
>>> --- /dev/null
>>> +++ b/tools/perf/Documentation/perf-iostat.txt
>>> @@ -0,0 +1,88 @@
>>> +perf-iostat(1)
>>> +===============
>>> +
>>> +NAME
>>> +----
>>> +perf-iostat - Show I/O performance metrics
>>> +
>>> +SYNOPSIS
>>> +--------
>>> +[verse]
>>> +'perf iostat' list
>>> +'perf iostat' <ports> -- <command> [<options>]
>>> +
>>> +DESCRIPTION
>>> +-----------
>>> +Mode is intended to provide four I/O performance metrics per each
>>> PCIe root port:
>>> +
>>> +- Inbound Read   - I/O devices below root port read from the host
>>> memory, in MB
>>> +
>>> +- Inbound Write  - I/O devices below root port write to the host
>>> memory, in MB
>>> +
>>> +- Outbound Read  - CPU reads from I/O devices below root port, in MB
>>> +
>>> +- Outbound Write - CPU writes to I/O devices below root port, in MB
>>> +
>>> +OPTIONS
>>> +-------
>>> +<command>...::
>>> +    Any command you can specify in a shell.
>>> +
>>> +list::
>>> +    List all PCIe root ports.
>>
>> I noticed that "iostat" commond and cmd_iostat() callback function is
>> not registered in cmd_struct in perf.c. So I think "perf iostat list"
>> perhaps can not work properly.
>>
>> I also test this patchset on x86 platform, and here is the log:
>>
>> root@ubuntu:/home/lq# ./perf iostat list
>> perf: 'iostat' is not a perf-command. See 'perf --help'.
>> root@ubuntu:/home/lq# ./perf stat --iostat
>> ^C
>>  Performance counter stats for 'system wide':
>>
>>    port             Inbound Read(MB)    Inbound Write(MB) Outbound
>> Read(MB)   Outbound Write(MB)
>> 0000:00                    0 0                    0                  0
>> 0000:80                    0 0                    0                  0
>> 0000:17                    0 0                    0                  0
>> 0000:85                    0 0                    0                  0
>> 0000:3a                    0 0                    0                  0
>> 0000:ae                    0 0                    0                  0
>> 0000:5d                    0 0                    0                  0
>> 0000:d7                    0 0                    0                  0
>>
>>        0.611303832 seconds time elapsed
>>
>>
>> root@ubuntu:/home/lq# ./perf stat --iostat=0000:17
>> ^C
>>  Performance counter stats for 'system wide':
>>
>>    port             Inbound Read(MB)    Inbound Write(MB) Outbound
>> Read(MB)   Outbound Write(MB)
>> 0000:17                    0 0                    0                  0
>>
>>        0.521317572 seconds time elapsed
>>
>> So how does following perf iostat list work, did I miss something?
>>
>> Thanks,
>> Qi
>>
> Hello,
>
> The 'iostat' mode uses aliases mechanism in perf same as 'perf archive' and
> in this case you don't need to add function callback into cmd_struct.
> For example, the command 'perf iostat list' will be converted to
> 'perf stat --iostat=list'.
>
> After building the perf tool you should have two shell scripts in
> tools/perf
> directory and one of them is executable, for example:
> # make -C tools/perf
> # ls -l tools/perf/perf-iostat*
> -rwxr-xr-x 1 root root 290 Mar 10 18:17 perf-iostat
> -rw-r--r-- 1 root root 290 Feb  3 15:14 perf-iostat.sh
>
> It should be possible to run 'perf iostat' from build directory:
> # cd tools/perf
> # ./perf iostat list
> S0-uncore_iio_0<0000:00>
> S1-uncore_iio_0<0000:80>
> S0-uncore_iio_1<0000:17>
> S1-uncore_iio_1<0000:85>
> S0-uncore_iio_2<0000:3a>
> S1-uncore_iio_2<0000:ae>
> S0-uncore_iio_3<0000:5d>
> S1-uncore_iio_3<0000:d7>
>
> Also you can copy 'perf-iostat' to ~/libexec/perf-core/ or just run
> 'make install'
>
> # make install
> # cp perf /usr/bin/
> # ls -lh ~/libexec/perf-core/
> total 24K
> -rwxr-xr-x 1 root root 1.4K Mar 10 18:17 perf-archive
> -rwxr-xr-x 1 root root  290 Mar 10 18:17 perf-iostat
> -rwxr-xr-x 1 root root 6.3K Mar 10 18:17 perf-with-kcore
> drwxr-xr-x 4 root root 4.0K Nov  5  2019 scripts
> drwxr-xr-x 4 root root 4.0K Mar 10 18:17 tests
> # perf iostat 0000:17 -I 1000 --interval-count 2
> #           time    port             Inbound Read(MB)    Inbound
> Write(MB)    Outbound Read(MB)   Outbound Write(MB)
>      1.000220341 0000:17                    0 0 0                    0
>      2.000569569 0000:17                    0 0 0                    0
>
> Actually, Arnaldo has explained before how does aliases mechanism work.
>
> I hope it will solve your issue. Otherwise, please email.
>
> Thank you,
> Alexander

Hi Alexander,

I try it again on x86 platform and it works, thanks a lot:)

Thanks,
Qi
>>> +
>>> +<ports>::
>>> +    Select the root ports for monitoring. Comma-separated list is
>>> supported.
>>> +
>>> +EXAMPLES
>>> +--------
>>> +
>>> +1. List all PCIe root ports (example for 2-S platform):
>>> +
>>> +   $ perf iostat list
>>> +   S0-uncore_iio_0<0000:00>
>>> +   S1-uncore_iio_0<0000:80>
>>> +   S0-uncore_iio_1<0000:17>
>>> +   S1-uncore_iio_1<0000:85>
>>> +   S0-uncore_iio_2<0000:3a>
>>> +   S1-uncore_iio_2<0000:ae>
>>> +   S0-uncore_iio_3<0000:5d>
>>> +   S1-uncore_iio_3<0000:d7>
>>
>
> .