Try to reduce the data size of the perf command. Before these patches
a stripped non-debug binary was:
$ size -A perf
perf :
section size addr
.interp 28 848
.note.gnu.property 32 880
.note.gnu.build-id 36 912
.note.ABI-tag 32 948
.gnu.hash 24628 984
.dynsym 88920 25616
.dynstr 70193 114536
.gnu.version 7410 184730
.gnu.version_r 800 192144
.rela.dyn 460824 192944
.rela.plt 14784 653768
.init 23 671744
.plt 9872 671776
.plt.got 24 681648
.text 2279182 681680
.noinstr.text 476 2960864
.fini 9 2961340
.rodata 7042922 2961408
.eh_frame_hdr 42844 10004332
.eh_frame 226496 10047176
.tbss 48 10279720
.init_array 16 10279720
.fini_array 8 10279736
.data.rel.ro 53376 10279744
.dynamic 736 10333120
.got 328 10333856
.got.plt 4952 10334184
.data 391088 10339136
.bss 285776 10730240
.comment 31 0
Total 11005894
And after:
perf :
section size addr
.interp 28 848
.note.gnu.property 32 880
.note.gnu.build-id 36 912
.note.ABI-tag 32 948
.gnu.hash 24628 984
.dynsym 88944 25616
.dynstr 70217 114560
.gnu.version 7412 184778
.gnu.version_r 816 192192
.rela.dyn 460824 193008
.rela.plt 14808 653832
.init 23 671744
.plt 9888 671776
.plt.got 24 681664
.text 2280446 681696
.noinstr.text 476 2962144
.fini 9 2962620
.rodata 7048746 2965504
.eh_frame_hdr 42852 10014252
.eh_frame 226568 10057104
.tbss 48 10285640
.init_array 16 10285640
.fini_array 8 10285656
.data.rel.ro 301408 10285664
.dynamic 736 10587072
.got 328 10587808
.got.plt 4960 10588136
.data 100464 10593152
.bss 22512 10693632
.comment 31 0
Total 10707320
The binary has reduced in size by 298,574 bytes. The .bss, that
doesn't count toward file size, is reduced by 263,254 bytes. At
runtime this could reduce the footprint up to 561,828 bytes. This is
still just a fraction of the .rodata section's size of 7,048,746
bytes, that mainly contains the converted json events. The .rodata
section needn't all be mapped at the same time.
The changes are largely removing static variables and replacing them
with local or dynamically allocated memory. A common issue was having
paths in statics for the sake of returning a non-stack pointer to a
buffer, so the APIs were changed to pass buffers in.
v2. Address review comments from Namhyung, thanks!
Ian Rogers (16):
perf header: Make nodes dynamic in write_mem_topology
perf test x86: insn-x86 test data is immutable so mark it const
perf test x86: intel-pt-test data is immutable so mark it const
perf trace: Make some large static arrays const
perf trace beauty: Make MSR arrays const
tools api fs: Avoid large static PATH_MAX arrays
tools lib api fs tracing_path: Remove two unused MAX_PATH paths
perf daemon: Dynamically allocate path to perf
perf lock: Dynamically allocate lockhash_table
perf timechart: Make large arrays dynamic
perf probe: Dynamically allocate params memory
perf path: Make mkpath thread safe
perf scripting-engines: Move static to local variable
tools api fs: Dynamically allocate cgroupfs mount point cache
perf test pmu: Avoid 2 static path arrays
libsubcmd: Avoid two path statics
tools/lib/api/fs/cgroup.c | 17 ++-
tools/lib/api/fs/fs.c | 25 +++-
tools/lib/api/fs/tracing_path.c | 17 +--
tools/lib/subcmd/exec-cmd.c | 35 +++--
tools/perf/arch/x86/tests/insn-x86.c | 10 +-
tools/perf/arch/x86/tests/intel-pt-test.c | 14 +-
tools/perf/builtin-config.c | 4 +-
tools/perf/builtin-daemon.c | 44 +++---
tools/perf/builtin-help.c | 4 +-
tools/perf/builtin-lock.c | 20 ++-
tools/perf/builtin-probe.c | 133 ++++++++++--------
tools/perf/builtin-timechart.c | 48 +++++--
tools/perf/builtin-trace.c | 33 +++--
tools/perf/tests/pmu.c | 17 +--
tools/perf/trace/beauty/beauty.h | 2 +-
.../perf/trace/beauty/tracepoints/x86_msr.sh | 6 +-
tools/perf/util/cache.h | 2 +-
tools/perf/util/config.c | 3 +-
tools/perf/util/header.c | 41 +++---
tools/perf/util/path.c | 35 +----
.../util/scripting-engines/trace-event-perl.c | 4 +-
.../scripting-engines/trace-event-python.c | 5 +-
22 files changed, 297 insertions(+), 222 deletions(-)
--
2.41.0.rc0.172.g3f132b7071-goog
This allows the movement of some sizeable data arrays (168,624bytes)
to .data.relro. Without PIE or the strings it could be moved to
.rodata.
Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/arch/x86/tests/insn-x86.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/tools/perf/arch/x86/tests/insn-x86.c b/tools/perf/arch/x86/tests/insn-x86.c
index 735257d205b5..7b5eb8baf0f2 100644
--- a/tools/perf/arch/x86/tests/insn-x86.c
+++ b/tools/perf/arch/x86/tests/insn-x86.c
@@ -18,14 +18,14 @@ struct test_data {
const char *asm_rep;
};
-struct test_data test_data_32[] = {
+const struct test_data test_data_32[] = {
#include "insn-x86-dat-32.c"
{{0x0f, 0x01, 0xee}, 3, 0, NULL, NULL, "0f 01 ee \trdpkru"},
{{0x0f, 0x01, 0xef}, 3, 0, NULL, NULL, "0f 01 ef \twrpkru"},
{{0}, 0, 0, NULL, NULL, NULL},
};
-struct test_data test_data_64[] = {
+const struct test_data test_data_64[] = {
#include "insn-x86-dat-64.c"
{{0x0f, 0x01, 0xee}, 3, 0, NULL, NULL, "0f 01 ee \trdpkru"},
{{0x0f, 0x01, 0xef}, 3, 0, NULL, NULL, "0f 01 ef \twrpkru"},
@@ -97,7 +97,7 @@ static int get_branch(const char *branch_str)
return -1;
}
-static int test_data_item(struct test_data *dat, int x86_64)
+static int test_data_item(const struct test_data *dat, int x86_64)
{
struct intel_pt_insn intel_pt_insn;
int op, branch, ret;
@@ -147,9 +147,9 @@ static int test_data_item(struct test_data *dat, int x86_64)
return 0;
}
-static int test_data_set(struct test_data *dat_set, int x86_64)
+static int test_data_set(const struct test_data *dat_set, int x86_64)
{
- struct test_data *dat;
+ const struct test_data *dat;
int ret = 0;
for (dat = dat_set; dat->expected_length; dat++) {
--
2.41.0.rc0.172.g3f132b7071-goog
Use a single stack allocated buffer and avoid 8,192 bytes in .bss.
Signed-off-by: Ian Rogers <[email protected]>
---
tools/lib/subcmd/exec-cmd.c | 35 ++++++++++++++++++++---------------
1 file changed, 20 insertions(+), 15 deletions(-)
diff --git a/tools/lib/subcmd/exec-cmd.c b/tools/lib/subcmd/exec-cmd.c
index 5dbea456973e..7739b5217cf6 100644
--- a/tools/lib/subcmd/exec-cmd.c
+++ b/tools/lib/subcmd/exec-cmd.c
@@ -36,38 +36,40 @@ static int is_absolute_path(const char *path)
return path[0] == '/';
}
-static const char *get_pwd_cwd(void)
+static const char *get_pwd_cwd(char *buf, size_t sz)
{
- static char cwd[PATH_MAX + 1];
char *pwd;
struct stat cwd_stat, pwd_stat;
- if (getcwd(cwd, PATH_MAX) == NULL)
+ if (getcwd(buf, sz) == NULL)
return NULL;
pwd = getenv("PWD");
- if (pwd && strcmp(pwd, cwd)) {
- stat(cwd, &cwd_stat);
+ if (pwd && strcmp(pwd, buf)) {
+ stat(buf, &cwd_stat);
if (!stat(pwd, &pwd_stat) &&
pwd_stat.st_dev == cwd_stat.st_dev &&
pwd_stat.st_ino == cwd_stat.st_ino) {
- strlcpy(cwd, pwd, PATH_MAX);
+ strlcpy(buf, pwd, sz);
}
}
- return cwd;
+ return buf;
}
-static const char *make_nonrelative_path(const char *path)
+static const char *make_nonrelative_path(char *buf, size_t sz, const char *path)
{
- static char buf[PATH_MAX + 1];
-
if (is_absolute_path(path)) {
- if (strlcpy(buf, path, PATH_MAX) >= PATH_MAX)
+ if (strlcpy(buf, path, sz) >= sz)
die("Too long path: %.*s", 60, path);
} else {
- const char *cwd = get_pwd_cwd();
+ const char *cwd = get_pwd_cwd(buf, sz);
+
if (!cwd)
die("Cannot determine the current working directory");
- if (snprintf(buf, PATH_MAX, "%s/%s", cwd, path) >= PATH_MAX)
+
+ if (strlen(cwd) + strlen(path) + 2 >= sz)
die("Too long path: %.*s", 60, path);
+
+ strcat(buf, "/");
+ strcat(buf, path);
}
return buf;
}
@@ -133,8 +135,11 @@ static void add_path(char **out, const char *path)
if (path && *path) {
if (is_absolute_path(path))
astrcat(out, path);
- else
- astrcat(out, make_nonrelative_path(path));
+ else {
+ char buf[PATH_MAX];
+
+ astrcat(out, make_nonrelative_path(buf, sizeof(buf), path));
+ }
astrcat(out, ":");
}
--
2.41.0.rc0.172.g3f132b7071-goog
This allows the movement of 5,808 bytes from .data to .rodata.
Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/arch/x86/tests/intel-pt-test.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/tools/perf/arch/x86/tests/intel-pt-test.c b/tools/perf/arch/x86/tests/intel-pt-test.c
index 70b7f79396b1..09d61fa736e3 100644
--- a/tools/perf/arch/x86/tests/intel-pt-test.c
+++ b/tools/perf/arch/x86/tests/intel-pt-test.c
@@ -22,7 +22,7 @@
* @new_ctx: expected new packet context
* @ctx_unchanged: the packet context must not change
*/
-static struct test_data {
+static const struct test_data {
int len;
u8 bytes[INTEL_PT_PKT_MAX_SZ];
enum intel_pt_pkt_ctx ctx;
@@ -186,7 +186,7 @@ static struct test_data {
{0, {0}, 0, {0, 0, 0}, 0, 0 },
};
-static int dump_packet(struct intel_pt_pkt *packet, u8 *bytes, int len)
+static int dump_packet(const struct intel_pt_pkt *packet, const u8 *bytes, int len)
{
char desc[INTEL_PT_PKT_DESC_MAX];
int ret, i;
@@ -206,14 +206,14 @@ static int dump_packet(struct intel_pt_pkt *packet, u8 *bytes, int len)
return TEST_OK;
}
-static void decoding_failed(struct test_data *d)
+static void decoding_failed(const struct test_data *d)
{
pr_debug("Decoding failed!\n");
pr_debug("Decoding: ");
dump_packet(&d->packet, d->bytes, d->len);
}
-static int fail(struct test_data *d, struct intel_pt_pkt *packet, int len,
+static int fail(const struct test_data *d, struct intel_pt_pkt *packet, int len,
enum intel_pt_pkt_ctx new_ctx)
{
decoding_failed(d);
@@ -242,7 +242,7 @@ static int fail(struct test_data *d, struct intel_pt_pkt *packet, int len,
return TEST_FAIL;
}
-static int test_ctx_unchanged(struct test_data *d, struct intel_pt_pkt *packet,
+static int test_ctx_unchanged(const struct test_data *d, struct intel_pt_pkt *packet,
enum intel_pt_pkt_ctx ctx)
{
enum intel_pt_pkt_ctx old_ctx = ctx;
@@ -258,7 +258,7 @@ static int test_ctx_unchanged(struct test_data *d, struct intel_pt_pkt *packet,
return TEST_OK;
}
-static int test_one(struct test_data *d)
+static int test_one(const struct test_data *d)
{
struct intel_pt_pkt packet;
enum intel_pt_pkt_ctx ctx = d->ctx;
@@ -307,7 +307,7 @@ static int test_one(struct test_data *d)
*/
int test__intel_pt_pkt_decoder(struct test_suite *test __maybe_unused, int subtest __maybe_unused)
{
- struct test_data *d = data;
+ const struct test_data *d = data;
int ret;
for (d = data; d->len; d++) {
--
2.41.0.rc0.172.g3f132b7071-goog
Change struct fs to have a pointer to a dynamically allocated array
rather than an array. This reduces the size of fs__entries from 24,768
bytes to 240 bytes. Read paths into a stack allocated array and
strdup. Fix off-by-1 fscanf %<num>s in fs__read_mounts caught by
address sanitizer.
Signed-off-by: Ian Rogers <[email protected]>
---
tools/lib/api/fs/fs.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)
diff --git a/tools/lib/api/fs/fs.c b/tools/lib/api/fs/fs.c
index 82f53d81a7a7..22d34a0be8b4 100644
--- a/tools/lib/api/fs/fs.c
+++ b/tools/lib/api/fs/fs.c
@@ -88,7 +88,7 @@ static const char * const bpf_fs__known_mountpoints[] = {
struct fs {
const char *name;
const char * const *mounts;
- char path[PATH_MAX];
+ char *path;
bool found;
bool checked;
long magic;
@@ -151,17 +151,23 @@ static bool fs__read_mounts(struct fs *fs)
bool found = false;
char type[100];
FILE *fp;
+ char path[PATH_MAX + 1];
fp = fopen("/proc/mounts", "r");
if (fp == NULL)
- return NULL;
+ return false;
while (!found &&
fscanf(fp, "%*s %" STR(PATH_MAX) "s %99s %*s %*d %*d\n",
- fs->path, type) == 2) {
+ path, type) == 2) {
- if (strcmp(type, fs->name) == 0)
+ if (strcmp(type, fs->name) == 0) {
+ free(fs->path);
+ fs->path = strdup(path);
+ if (!fs->path)
+ return false;
found = true;
+ }
}
fclose(fp);
@@ -188,8 +194,11 @@ static bool fs__check_mounts(struct fs *fs)
ptr = fs->mounts;
while (*ptr) {
if (fs__valid_mount(*ptr, fs->magic) == 0) {
+ free(fs->path);
+ fs->path = strdup(*ptr);
+ if (!fs->path)
+ return false;
fs->found = true;
- strcpy(fs->path, *ptr);
return true;
}
ptr++;
@@ -227,10 +236,12 @@ static bool fs__env_override(struct fs *fs)
if (!override_path)
return false;
+ free(fs->path);
+ fs->path = strdup(override_path);
+ if (!fs->path)
+ return false;
fs->found = true;
fs->checked = true;
- strncpy(fs->path, override_path, sizeof(fs->path) - 1);
- fs->path[sizeof(fs->path) - 1] = '\0';
return true;
}
--
2.41.0.rc0.172.g3f132b7071-goog
Allows the movement of 33,128 bytes from .data to .data.rel.ro.
Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/builtin-trace.c | 33 +++++++++++++++++++--------------
1 file changed, 19 insertions(+), 14 deletions(-)
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index b49d3abb1203..62c7c99a0fe4 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -914,7 +914,7 @@ static size_t syscall_arg__scnprintf_getrandom_flags(char *bf, size_t size,
#include "trace/beauty/socket_type.c"
#include "trace/beauty/waitid_options.c"
-static struct syscall_fmt syscall_fmts[] = {
+static const struct syscall_fmt syscall_fmts[] = {
{ .name = "access",
.arg = { [1] = { .scnprintf = SCA_ACCMODE, /* mode */ }, }, },
{ .name = "arch_prctl",
@@ -1176,18 +1176,21 @@ static int syscall_fmt__cmp(const void *name, const void *fmtp)
return strcmp(name, fmt->name);
}
-static struct syscall_fmt *__syscall_fmt__find(struct syscall_fmt *fmts, const int nmemb, const char *name)
+static const struct syscall_fmt *__syscall_fmt__find(const struct syscall_fmt *fmts,
+ const int nmemb,
+ const char *name)
{
return bsearch(name, fmts, nmemb, sizeof(struct syscall_fmt), syscall_fmt__cmp);
}
-static struct syscall_fmt *syscall_fmt__find(const char *name)
+static const struct syscall_fmt *syscall_fmt__find(const char *name)
{
const int nmemb = ARRAY_SIZE(syscall_fmts);
return __syscall_fmt__find(syscall_fmts, nmemb, name);
}
-static struct syscall_fmt *__syscall_fmt__find_by_alias(struct syscall_fmt *fmts, const int nmemb, const char *alias)
+static const struct syscall_fmt *__syscall_fmt__find_by_alias(const struct syscall_fmt *fmts,
+ const int nmemb, const char *alias)
{
int i;
@@ -1199,7 +1202,7 @@ static struct syscall_fmt *__syscall_fmt__find_by_alias(struct syscall_fmt *fmts
return NULL;
}
-static struct syscall_fmt *syscall_fmt__find_by_alias(const char *alias)
+static const struct syscall_fmt *syscall_fmt__find_by_alias(const char *alias)
{
const int nmemb = ARRAY_SIZE(syscall_fmts);
return __syscall_fmt__find_by_alias(syscall_fmts, nmemb, alias);
@@ -1224,7 +1227,7 @@ struct syscall {
bool nonexistent;
struct tep_format_field *args;
const char *name;
- struct syscall_fmt *fmt;
+ const struct syscall_fmt *fmt;
struct syscall_arg_fmt *arg_fmt;
};
@@ -1673,7 +1676,7 @@ static int syscall__alloc_arg_fmts(struct syscall *sc, int nr_args)
return 0;
}
-static struct syscall_arg_fmt syscall_arg_fmts__by_name[] = {
+static const struct syscall_arg_fmt syscall_arg_fmts__by_name[] = {
{ .name = "msr", .scnprintf = SCA_X86_MSR, .strtoul = STUL_X86_MSR, },
{ .name = "vector", .scnprintf = SCA_X86_IRQ_VECTORS, .strtoul = STUL_X86_IRQ_VECTORS, },
};
@@ -1684,13 +1687,14 @@ static int syscall_arg_fmt__cmp(const void *name, const void *fmtp)
return strcmp(name, fmt->name);
}
-static struct syscall_arg_fmt *
-__syscall_arg_fmt__find_by_name(struct syscall_arg_fmt *fmts, const int nmemb, const char *name)
+static const struct syscall_arg_fmt *
+__syscall_arg_fmt__find_by_name(const struct syscall_arg_fmt *fmts, const int nmemb,
+ const char *name)
{
return bsearch(name, fmts, nmemb, sizeof(struct syscall_arg_fmt), syscall_arg_fmt__cmp);
}
-static struct syscall_arg_fmt *syscall_arg_fmt__find_by_name(const char *name)
+static const struct syscall_arg_fmt *syscall_arg_fmt__find_by_name(const char *name)
{
const int nmemb = ARRAY_SIZE(syscall_arg_fmts__by_name);
return __syscall_arg_fmt__find_by_name(syscall_arg_fmts__by_name, nmemb, name);
@@ -1735,8 +1739,9 @@ syscall_arg_fmt__init_array(struct syscall_arg_fmt *arg, struct tep_format_field
* 7 unsigned long
*/
arg->scnprintf = SCA_FD;
- } else {
- struct syscall_arg_fmt *fmt = syscall_arg_fmt__find_by_name(field->name);
+ } else {
+ const struct syscall_arg_fmt *fmt =
+ syscall_arg_fmt__find_by_name(field->name);
if (fmt) {
arg->scnprintf = fmt->scnprintf;
@@ -4458,7 +4463,7 @@ static void evsel__set_syscall_arg_fmt(struct evsel *evsel, const char *name)
struct syscall_arg_fmt *fmt = evsel__syscall_arg_fmt(evsel);
if (fmt) {
- struct syscall_fmt *scfmt = syscall_fmt__find(name);
+ const struct syscall_fmt *scfmt = syscall_fmt__find(name);
if (scfmt) {
int skip = 0;
@@ -4525,7 +4530,7 @@ static int trace__parse_events_option(const struct option *opt, const char *str,
int len = strlen(str) + 1, err = -1, list, idx;
char *strace_groups_dir = system_path(STRACE_GROUPS_DIR);
char group_name[PATH_MAX];
- struct syscall_fmt *fmt;
+ const struct syscall_fmt *fmt;
if (strace_groups_dir == NULL)
return -1;
--
2.41.0.rc0.172.g3f132b7071-goog
Allows the movement of 46,072 bytes from .data to .data.rel.ro.
Signed-off-by: Ian Rogers <[email protected]>
---
tools/perf/trace/beauty/beauty.h | 2 +-
tools/perf/trace/beauty/tracepoints/x86_msr.sh | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/tools/perf/trace/beauty/beauty.h b/tools/perf/trace/beauty/beauty.h
index 4c59edddd6a8..3d12bf0f6d07 100644
--- a/tools/perf/trace/beauty/beauty.h
+++ b/tools/perf/trace/beauty/beauty.h
@@ -11,7 +11,7 @@ struct strarray {
u64 offset;
int nr_entries;
const char *prefix;
- const char **entries;
+ const char * const *entries;
};
#define DEFINE_STRARRAY(array, _prefix) struct strarray strarray__##array = { \
diff --git a/tools/perf/trace/beauty/tracepoints/x86_msr.sh b/tools/perf/trace/beauty/tracepoints/x86_msr.sh
index 0078689963e0..fa3c4418e856 100755
--- a/tools/perf/trace/beauty/tracepoints/x86_msr.sh
+++ b/tools/perf/trace/beauty/tracepoints/x86_msr.sh
@@ -13,7 +13,7 @@ x86_msr_index=${arch_x86_header_dir}/msr-index.h
# Just the ones starting with 0x00000 so as to have a simple
# array.
-printf "static const char *x86_MSRs[] = {\n"
+printf "static const char * const x86_MSRs[] = {\n"
regex='^[[:space:]]*#[[:space:]]*define[[:space:]]+MSR_([[:alnum:]][[:alnum:]_]+)[[:space:]]+(0x00000[[:xdigit:]]+)[[:space:]]*.*'
grep -E $regex ${x86_msr_index} | grep -E -v 'MSR_(ATOM|P[46]|IA32_(TSC_DEADLINE|UCODE_REV)|IDT_FCR4)' | \
sed -r "s/$regex/\2 \1/g" | sort -n | \
@@ -24,7 +24,7 @@ printf "};\n\n"
regex='^[[:space:]]*#[[:space:]]*define[[:space:]]+MSR_([[:alnum:]][[:alnum:]_]+)[[:space:]]+(0xc0000[[:xdigit:]]+)[[:space:]]*.*'
printf "#define x86_64_specific_MSRs_offset "
grep -E $regex ${x86_msr_index} | sed -r "s/$regex/\2/g" | sort -n | head -1
-printf "static const char *x86_64_specific_MSRs[] = {\n"
+printf "static const char * const x86_64_specific_MSRs[] = {\n"
grep -E $regex ${x86_msr_index} | \
sed -r "s/$regex/\2 \1/g" | grep -E -vw 'K6_WHCR' | sort -n | \
xargs printf "\t[%s - x86_64_specific_MSRs_offset] = \"%s\",\n"
@@ -33,7 +33,7 @@ printf "};\n\n"
regex='^[[:space:]]*#[[:space:]]*define[[:space:]]+MSR_([[:alnum:]][[:alnum:]_]+)[[:space:]]+(0xc0010[[:xdigit:]]+)[[:space:]]*.*'
printf "#define x86_AMD_V_KVM_MSRs_offset "
grep -E $regex ${x86_msr_index} | sed -r "s/$regex/\2/g" | sort -n | head -1
-printf "static const char *x86_AMD_V_KVM_MSRs[] = {\n"
+printf "static const char * const x86_AMD_V_KVM_MSRs[] = {\n"
grep -E $regex ${x86_msr_index} | \
sed -r "s/$regex/\2 \1/g" | sort -n | \
xargs printf "\t[%s - x86_AMD_V_KVM_MSRs_offset] = \"%s\",\n"
--
2.41.0.rc0.172.g3f132b7071-goog
FWIW I think the whole patchkit could be replaced with a one liner
that disables THP for the BSS segment. I suspect that would be roughly
equivalent for memory consumption because 4K pages that are never
touched would never be allocated.
-Andi
On Mon, May 29, 2023 at 9:57 PM Andi Kleen <[email protected]> wrote:
>
>
> FWIW I think the whole patchkit could be replaced with a one liner
> that disables THP for the BSS segment. I suspect that would be roughly
> equivalent for memory consumption because 4K pages that are never
> touched would never be allocated.
>
> -Andi
So, it is worth reading some of the comments on the code to see why a
wider clean up is worth it:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/util/path.c?h=perf-tools#n7
"It's obviously not thread-safe. Sue me."
(a comment stemming back to the git origins of the code base)
BSS won't count toward file size, which the patches were primarily
going after - but checking the size numbers I have miscalculated from
reading size's output that I'm not familiar with. The numbers are
still improved, but I just see a 37kb saving, with 5kb more in
.rodata. Something but not much. .data.rel.ro is larger, which imo is
good, but those pages will still be dirtied so a mute point wrt file
size and memory overhead.
For huge pages I thought it was correct that things are aligned by max
page size which I thought on x86-64 was 2MB, so I tried:
EXTRA_LDFLAGS="-z max-page-size=4096"
but it made no difference to anything, and with:
EXTRA_CFLAGS="-Wl,-z,max-page-size=4096"
EXTRA_CXXFLAGS="-Wl,-z,max-page-size=4096"
file size just got worse.
Thanks,
Ian
> BSS won't count toward file size, which the patches were primarily
> going after - but checking the size numbers I have miscalculated from
> reading size's output that I'm not familiar with. The numbers are
> still improved, but I just see a 37kb saving, with 5kb more in
> .rodata. Something but not much. .data.rel.ro is larger, which imo is
> good, but those pages will still be dirtied so a mute point wrt file
> size and memory overhead.
The way perf is written (lots of separate code depending on a single high level
switch) most pages probably won't be dirtied.
>
> For huge pages I thought it was correct that things are aligned by max
> page size which I thought on x86-64 was 2MB, so I tried:
> EXTRA_LDFLAGS="-z max-page-size=4096"
> but it made no difference to anything, and with:
> EXTRA_CFLAGS="-Wl,-z,max-page-size=4096"
> EXTRA_CXXFLAGS="-Wl,-z,max-page-size=4096"
> file size just got worse.
The default alignment to 2MB was dropped in the GNU toolchain in 2018 or
so.
-Andi
On Tue, May 30, 2023 at 12:59 AM Andi Kleen <[email protected]> wrote:
>
> > BSS won't count toward file size, which the patches were primarily
> > going after - but checking the size numbers I have miscalculated from
> > reading size's output that I'm not familiar with. The numbers are
> > still improved, but I just see a 37kb saving, with 5kb more in
> > .rodata. Something but not much. .data.rel.ro is larger, which imo is
> > good, but those pages will still be dirtied so a mute point wrt file
> > size and memory overhead.
>
> The way perf is written (lots of separate code depending on a single high level
> switch) most pages probably won't be dirtied.
For data everything is relocated when perf is loaded. Setting a
breakpoint on main and then dumping smaps (edited for brevity) I see:
```
555555554000-5555555f8000 r--p 00000000 fe:01 32936368
/tmp/perf/perf
Size: 656 kB
Pss: 656 kB
Pss_Dirty: 0 kB
5555555f8000-555555828000 r-xp 000a4000 fe:01 32936368
/tmp/perf/perf
Size: 2240 kB
Pss: 32 kB
Pss_Dirty: 8 kB
555555828000-555555f23000 r--p 002d4000 fe:01 32936368
/tmp/perf/perf
Size: 7148 kB
Pss: 64 kB
Pss_Dirty: 0 kB
555555f23000-555555f6d000 r--p 009cf000 fe:01 32936368
/tmp/perf/perf
Size: 296 kB
Pss: 288 kB
Pss_Dirty: 288 kB
555555f6d000-555555f87000 rw-p 00a19000 fe:01 32936368
/tmp/perf/perf
Size: 104 kB
Pss: 104 kB
Pss_Dirty: 104 kB
```
These are roughly header, text, .rodata, .data.rel.ro, .data. So at
the point we enter main we have 392kB of dirty pages in .data.rel.ro
and .data.
For x86 a large contributor to the relocations comes from the insn-x86.c test:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/arch/x86/tests/insn-x86.c?h=perf-tools-next#n21
The test_data_32 and test_data_64 arrays are 75,024 bytes and 93,600
bytes respectively and are in .data.rel.ro, they account for nearly
40% of it.
In gdb at main entry:
```
(gdb) p test_data_32[0]
$1 = {data = "\017\061", '\000' <repeats 12 times>, expected_length =
2, expected_rel = 0,
expected_op_str = 0x555555866adc "", expected_branch_str = 0x555555866adc "",
asm_rep = 0x55555586fa2a "0f 31", ' ' <repeats 16 times>, "\trdtsc "}
```
you can see that all the strings in test_data_32 have been relocated
(even though we haven't run any part of perf yet) and are pointing to
data in .rodata. To avoid these relocations for the output of
jevents.py (pmu-events.c) all the strings are merged into a big string
and then the offsets within the string are stored - no relocations
means everything goes in the nice non-dirty .rodata. As the data in
the insn-x86.c test is also generated then a similar trick could be
performed. There is also the possibility to separate all the perf
builtins into libraries...
Thanks,
Ian
> >
> > For huge pages I thought it was correct that things are aligned by max
> > page size which I thought on x86-64 was 2MB, so I tried:
> > EXTRA_LDFLAGS="-z max-page-size=4096"
> > but it made no difference to anything, and with:
> > EXTRA_CFLAGS="-Wl,-z,max-page-size=4096"
> > EXTRA_CXXFLAGS="-Wl,-z,max-page-size=4096"
> > file size just got worse.
>
> The default alignment to 2MB was dropped in the GNU toolchain in 2018 or
> so.
>
> -Andi