2012-06-22 13:36:58

by Ingo Molnar

[permalink] [raw]
Subject: [GIT PULL] perf fixes

Linus,

Please pull the latest perf-urgent-for-linus git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf-urgent-for-linus

HEAD: 6921a575c9f26f7ea274aaea3b78967810ce5513 Merge branch 'tip/perf/urgent-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent

Thanks,

Ingo

------------------>
Arnaldo Carvalho de Melo (1):
perf tools: Fix synthesizing tracepoint names from the perf.data headers

David Ahern (1):
perf tools: Fix endianity swapping for adds_features bitmask

Salman Qazi (1):
perf: Use css_tryget() to avoid propping up css refcount

Stephane Eranian (1):
perf stat: Fix default output file

Steven Rostedt (1):
ftrace: Make all inline tags also include notrace


include/linux/compiler-gcc.h | 6 ++--
kernel/events/core.c | 10 +++++--
tools/perf/builtin-stat.c | 8 +++++-
tools/perf/util/header.c | 48 +++++++++++++++++++++++++++-----
tools/perf/util/include/linux/bitops.h | 2 ++
tools/perf/util/session.c | 10 +++++++
tools/perf/util/session.h | 1 +
7 files changed, 71 insertions(+), 14 deletions(-)

diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index e5834aa..6a6d7ae 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -47,9 +47,9 @@
*/
#if !defined(CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING) || \
!defined(CONFIG_OPTIMIZE_INLINING) || (__GNUC__ < 4)
-# define inline inline __attribute__((always_inline))
-# define __inline__ __inline__ __attribute__((always_inline))
-# define __inline __inline __attribute__((always_inline))
+# define inline inline __attribute__((always_inline)) notrace
+# define __inline__ __inline__ __attribute__((always_inline)) notrace
+# define __inline __inline __attribute__((always_inline)) notrace
#else
/* A lot of inline functions can cause havoc with function tracing */
# define inline inline notrace
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f85c015..d7d71d6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -253,9 +253,9 @@ perf_cgroup_match(struct perf_event *event)
return !event->cgrp || event->cgrp == cpuctx->cgrp;
}

-static inline void perf_get_cgroup(struct perf_event *event)
+static inline bool perf_tryget_cgroup(struct perf_event *event)
{
- css_get(&event->cgrp->css);
+ return css_tryget(&event->cgrp->css);
}

static inline void perf_put_cgroup(struct perf_event *event)
@@ -484,7 +484,11 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
event->cgrp = cgrp;

/* must be done before we fput() the file */
- perf_get_cgroup(event);
+ if (!perf_tryget_cgroup(event)) {
+ event->cgrp = NULL;
+ ret = -ENOENT;
+ goto out;
+ }

/*
* all events in a group must monitor
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 2625899..07b5c77 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1179,6 +1179,12 @@ int cmd_stat(int argc, const char **argv, const char *prefix __used)
fprintf(stderr, "cannot use both --output and --log-fd\n");
usage_with_options(stat_usage, options);
}
+
+ if (output_fd < 0) {
+ fprintf(stderr, "argument to --log-fd must be a > 0\n");
+ usage_with_options(stat_usage, options);
+ }
+
if (!output) {
struct timespec tm;
mode = append_file ? "a" : "w";
@@ -1190,7 +1196,7 @@ int cmd_stat(int argc, const char **argv, const char *prefix __used)
}
clock_gettime(CLOCK_REALTIME, &tm);
fprintf(output, "# started on %s\n", ctime(&tm.tv_sec));
- } else if (output_fd != 2) {
+ } else if (output_fd > 0) {
mode = append_file ? "a" : "w";
output = fdopen(output_fd, mode);
if (!output) {
diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 2dd5edf..e909d43 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1942,7 +1942,6 @@ int perf_file_header__read(struct perf_file_header *header,
else
return -1;
} else if (ph->needs_swap) {
- unsigned int i;
/*
* feature bitmap is declared as an array of unsigned longs --
* not good since its size can differ between the host that
@@ -1958,14 +1957,17 @@ int perf_file_header__read(struct perf_file_header *header,
* file), punt and fallback to the original behavior --
* clearing all feature bits and setting buildid.
*/
- for (i = 0; i < BITS_TO_LONGS(HEADER_FEAT_BITS); ++i)
- header->adds_features[i] = bswap_64(header->adds_features[i]);
+ mem_bswap_64(&header->adds_features,
+ BITS_TO_U64(HEADER_FEAT_BITS));

if (!test_bit(HEADER_HOSTNAME, header->adds_features)) {
- for (i = 0; i < BITS_TO_LONGS(HEADER_FEAT_BITS); ++i) {
- header->adds_features[i] = bswap_64(header->adds_features[i]);
- header->adds_features[i] = bswap_32(header->adds_features[i]);
- }
+ /* unswap as u64 */
+ mem_bswap_64(&header->adds_features,
+ BITS_TO_U64(HEADER_FEAT_BITS));
+
+ /* unswap as u32 */
+ mem_bswap_32(&header->adds_features,
+ BITS_TO_U32(HEADER_FEAT_BITS));
}

if (!test_bit(HEADER_HOSTNAME, header->adds_features)) {
@@ -2091,6 +2093,35 @@ static int read_attr(int fd, struct perf_header *ph,
return ret <= 0 ? -1 : 0;
}

+static int perf_evsel__set_tracepoint_name(struct perf_evsel *evsel)
+{
+ struct event_format *event = trace_find_event(evsel->attr.config);
+ char bf[128];
+
+ if (event == NULL)
+ return -1;
+
+ snprintf(bf, sizeof(bf), "%s:%s", event->system, event->name);
+ evsel->name = strdup(bf);
+ if (event->name == NULL)
+ return -1;
+
+ return 0;
+}
+
+static int perf_evlist__set_tracepoint_names(struct perf_evlist *evlist)
+{
+ struct perf_evsel *pos;
+
+ list_for_each_entry(pos, &evlist->entries, node) {
+ if (pos->attr.type == PERF_TYPE_TRACEPOINT &&
+ perf_evsel__set_tracepoint_name(pos))
+ return -1;
+ }
+
+ return 0;
+}
+
int perf_session__read_header(struct perf_session *session, int fd)
{
struct perf_header *header = &session->header;
@@ -2172,6 +2203,9 @@ int perf_session__read_header(struct perf_session *session, int fd)

lseek(fd, header->data_offset, SEEK_SET);

+ if (perf_evlist__set_tracepoint_names(session->evlist))
+ goto out_delete_evlist;
+
header->frozen = 1;
return 0;
out_errno:
diff --git a/tools/perf/util/include/linux/bitops.h b/tools/perf/util/include/linux/bitops.h
index f1584833..587a230 100644
--- a/tools/perf/util/include/linux/bitops.h
+++ b/tools/perf/util/include/linux/bitops.h
@@ -8,6 +8,8 @@
#define BITS_PER_LONG __WORDSIZE
#define BITS_PER_BYTE 8
#define BITS_TO_LONGS(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
+#define BITS_TO_U64(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(u64))
+#define BITS_TO_U32(nr) DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(u32))

#define for_each_set_bit(bit, addr, size) \
for ((bit) = find_first_bit((addr), (size)); \
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 2600916..c3e399b 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -442,6 +442,16 @@ static void perf_tool__fill_defaults(struct perf_tool *tool)
tool->finished_round = process_finished_round_stub;
}
}
+
+void mem_bswap_32(void *src, int byte_size)
+{
+ u32 *m = src;
+ while (byte_size > 0) {
+ *m = bswap_32(*m);
+ byte_size -= sizeof(u32);
+ ++m;
+ }
+}

void mem_bswap_64(void *src, int byte_size)
{
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 7a5434c..0c702e3 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -80,6 +80,7 @@ struct branch_info *machine__resolve_bstack(struct machine *self,
bool perf_session__has_traces(struct perf_session *self, const char *msg);

void mem_bswap_64(void *src, int byte_size);
+void mem_bswap_32(void *src, int byte_size);
void perf_event__attr_swap(struct perf_event_attr *attr);

int perf_session__create_kernel_maps(struct perf_session *self);


2012-06-22 18:07:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Fri, Jun 22, 2012 at 6:36 AM, Ingo Molnar <[email protected]> wrote:
>
> Steven Rostedt (1):
> ? ? ?ftrace: Make all inline tags also include notrace

Btw, this is something I've been wondering about: function call
tracing and inlining seems to be fundamentally incompatible.

And gcc can (and does, depending on version and details) inline pretty
much any static function, whether we mark it inline or not.

Now, there's no question that we don't want inlined functions to be
traced, but that actually means that the *logical* thing would be to
try to somehow tell gcc to not ever do the whole stupid mcount thing
for functions that *might* be inlined - and at least be consistent
about it.

IOW, is there some way to get the mcount thing to only happen for
functions that either have their address taken, or have external
visibility?

Because that mcount thing is expensive as hell, if people haven't
noticed (and I'm not talking about just the call instruction that I
think we can stub out - it changes code generation in other ways too).
And it looks like distros enable it by default, which annoys my
performance-optimizing soul deeply.

So doing it a bit less would be lovely.

Linus

2012-06-22 18:48:52

by Hagen Paul Pfeifer

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

* Linus Torvalds | 2012-06-22 11:07:01 [-0700]:

>And gcc can (and does, depending on version and details) inline pretty
>much any static function, whether we mark it inline or not.
>
>Now, there's no question that we don't want inlined functions to be
>traced, but that actually means that the *logical* thing would be to
>try to somehow tell gcc to not ever do the whole stupid mcount thing
>for functions that *might* be inlined - and at least be consistent
>about it.
>
>IOW, is there some way to get the mcount thing to only happen for
>functions that either have their address taken, or have external
>visibility?
>
>Because that mcount thing is expensive as hell, if people haven't
>noticed (and I'm not talking about just the call instruction that I
>think we can stub out - it changes code generation in other ways too).
>And it looks like distros enable it by default, which annoys my
>performance-optimizing soul deeply.

Isn't it stubed out already? Already replaced by nops at boot time by
ftrace_code_disable() and friends!? But yes, there may be spots where the
additional mcount() call avoid optimization.

Hagen

2012-06-22 18:50:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Fri, 2012-06-22 at 11:07 -0700, Linus Torvalds wrote:
> On Fri, Jun 22, 2012 at 6:36 AM, Ingo Molnar <[email protected]> wrote:
> >
> > Steven Rostedt (1):
> > ftrace: Make all inline tags also include notrace
>
> Btw, this is something I've been wondering about: function call
> tracing and inlining seems to be fundamentally incompatible.

True that the -pg option never adds the mcount call to any function that
gets inlined. But just an FYI, the alternative -finstrument-functions,
which traces both start and stop of the function, even does inlined
functions. Which one one of the reasons I totally avoided it.

>
> And gcc can (and does, depending on version and details) inline pretty
> much any static function, whether we mark it inline or not.

Right, which means that those do not get traced either.

>
> Now, there's no question that we don't want inlined functions to be
> traced, but that actually means that the *logical* thing would be to
> try to somehow tell gcc to not ever do the whole stupid mcount thing
> for functions that *might* be inlined - and at least be consistent
> about it.

Hmm, I'm not sure how to tell gcc that :-/

>
> IOW, is there some way to get the mcount thing to only happen for
> functions that either have their address taken, or have external
> visibility?
>
> Because that mcount thing is expensive as hell, if people haven't
> noticed (and I'm not talking about just the call instruction that I
> think we can stub out

It is stubbed out. Has been since day one, when DYNAMIC_FTRACE is
supported and enabled.

> - it changes code generation in other ways too).

One thing it does, which I hate, is that it enables (forces) frame
pointers.

> And it looks like distros enable it by default, which annoys my
> performance-optimizing soul deeply.

We have been working on an alternative. That is the -mfentry, and I have
working code that is still in the testing phase (and looking good!).

When you add -mfentry with -pg instead of calling mcount, which comes
after the frame pointer has been set up, it calls fentry, as the very
first instruction in the function. It should not interfere with any
other code generation.

The -mfentry is supported since gcc 4.6.0 and only for x86 (thanks to
Andi Kleen for doing this. Here's my first email that described it a
little (I've been working on various versions, but it has settled down
recently):

https://lkml.org/lkml/2011/2/9/271

Back then, one of the issues I worried about was its interaction with
kprobes. As kprobes commonly are inserted at the first instruction of a
function, and kprobes could not be inserted at a ftrace nop, this would
cause issues for users wanting to insert their probe at the beginning of
the function.

But Masami has already helped me in fixing this up. Both the kprobes
issue and the fentry and no framepointers code is working well. I was
going to push it for 3.7 so that I can vindicate it a bit more.

Here's my last RFC patch set for the kprobes/ftrace work. Where kprobes
can actually be optimized by using the ftrace infrastructure.

https://lkml.org/lkml/2012/6/12/668

-- Steve


>
> So doing it a bit less would be lovely.
>
> Linus

2012-06-22 18:53:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Fri, Jun 22, 2012 at 11:38 AM, Hagen Paul Pfeifer <[email protected]> wrote:
>>
>>Because that mcount thing is expensive as hell, if people haven't
>>noticed (and I'm not talking about just the call instruction that I
>>think we can stub out - it changes code generation in other ways too).
>>And it looks like distros enable it by default, which annoys my
>>performance-optimizing soul deeply.
>
> Isn't it stubed out already? Already replaced by nops at boot time by
> ftrace_code_disable() and friends!? But yes, there may be spots where the
> additional mcount() call avoid optimization.

So even stubbed out, it's quite noticeable. The call causes the
function prologue to change quite a bit.

That's actually especially true with newer versions of gcc that
*finally* seem to have done the "don't always generate the full
prologue if some case doesn't need it" optimization. So functions that
have early-out conditions (quite common) will exit before even having
done the prologue, and without doing the whole frame pointer setup
etc.

Except if mcount generation is on. Then gcc will always do the
prologue and frame pointer setup before doing the mcount, because
mcount wants it.

So it really isn't just the extra call instruction.

I may be more sensitive to this than most, because I look at profiles
and the function prologue just looks very ugly with the call mcount
thing. Ugh.

Linus

2012-06-22 19:06:35

by Hagen Paul Pfeifer

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

* Linus Torvalds | 2012-06-22 11:52:43 [-0700]:

>So even stubbed out, it's quite noticeable. The call causes the
>function prologue to change quite a bit.
>
>That's actually especially true with newer versions of gcc that
>*finally* seem to have done the "don't always generate the full
>prologue if some case doesn't need it" optimization. So functions that
>have early-out conditions (quite common) will exit before even having
>done the prologue, and without doing the whole frame pointer setup
>etc.
>
>Except if mcount generation is on. Then gcc will always do the
>prologue and frame pointer setup before doing the mcount, because
>mcount wants it.
>
>So it really isn't just the extra call instruction.
>
>I may be more sensitive to this than most, because I look at profiles
>and the function prologue just looks very ugly with the call mcount
>thing. Ugh.

Yes, ugh. Even Stevens -mfentry replacement do not change things here. Maybe
this performance problem should addressed on another level: distributors
should build there kernel with disabled ftrace support. Kprobes and function
tracing should be sufficient for 98% of their customers.

I see no compiler (gcc) way around this performance issue. Maybe you/Steven
should restart a G+ poll ... ;)

Hagen

2012-06-22 19:55:01

by Steven Rostedt

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Fri, 2012-06-22 at 21:06 +0200, Hagen Paul Pfeifer wrote:

> >I may be more sensitive to this than most, because I look at profiles
> >and the function prologue just looks very ugly with the call mcount
> >thing. Ugh.
>
> Yes, ugh. Even Stevens -mfentry replacement do not change things here.

Why doesn't -mfentry help here? The link I showed still had frame
pointers enabled. With -mfentry, frame pointers do not need to be
enabled. And my latest patches do not automatically enable frame
pointers when enabling function tracing if -mfentry is supported.

I just ran a bunch of compiles against kernel/sched/core.c:

no pg, no mfentry, no fp:

0000000000000882 <schedule>:
882: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax
889: 00 00
887: R_X86_64_32S current_task
88b: 57 push %rdi
88c: 48 8b 10 mov (%rax),%rdx
88f: 48 85 d2 test %rdx,%rdx
892: 74 45 je 8d9 <schedule+0x57>
894: 48 83 b8 60 06 00 00 cmpq $0x0,0x660(%rax)
89b: 00
89c: 75 3b jne 8d9 <schedule+0x57>
89e: 48 8b b8 60 0e 00 00 mov 0xe60(%rax),%rdi
8a5: 31 c0 xor %eax,%eax
8a7: 48 85 ff test %rdi,%rdi
8aa: 74 1a je 8c6 <schedule+0x44>
8ac: 48 8d 57 08 lea 0x8(%rdi),%rdx
8b0: 48 39 57 08 cmp %rdx,0x8(%rdi)
8b4: b0 01 mov $0x1,%al
8b6: 75 0e jne 8c6 <schedule+0x44>
8b8: 48 8d 47 18 lea 0x18(%rdi),%rax
8bc: 48 39 47 18 cmp %rax,0x18(%rdi)
8c0: 0f 95 c0 setne %al
8c3: 0f b6 c0 movzbl %al,%eax
8c6: 85 c0 test %eax,%eax
8c8: 74 0f je 8d9 <schedule+0x57>
8ca: 48 85 ff test %rdi,%rdi
8cd: 74 0a je 8d9 <schedule+0x57>
8cf: be 01 00 00 00 mov $0x1,%esi
8d4: e8 00 00 00 00 callq 8d9 <schedule+0x57>
8d5: R_X86_64_PC32 blk_flush_plug_list-0x4
8d9: e8 70 f9 ff ff callq 24e <__schedule>
8de: 5e pop %rsi
8df: c3 retq


no pg, no mfentry, with fp:

00000000000008cb <schedule>:
8cb: 55 push %rbp
8cc: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax
8d3: 00 00
8d1: R_X86_64_32S current_task
8d5: 48 8b 10 mov (%rax),%rdx
8d8: 48 89 e5 mov %rsp,%rbp
8db: 48 85 d2 test %rdx,%rdx
8de: 74 45 je 925 <schedule+0x5a>
8e0: 48 83 b8 60 06 00 00 cmpq $0x0,0x660(%rax)
8e7: 00
8e8: 75 3b jne 925 <schedule+0x5a>
8ea: 48 8b b8 60 0e 00 00 mov 0xe60(%rax),%rdi
8f1: 31 c0 xor %eax,%eax
8f3: 48 85 ff test %rdi,%rdi
8f6: 74 1a je 912 <schedule+0x47>
8f8: 48 8d 57 08 lea 0x8(%rdi),%rdx
8fc: 48 39 57 08 cmp %rdx,0x8(%rdi)
900: b0 01 mov $0x1,%al
902: 75 0e jne 912 <schedule+0x47>
904: 48 8d 47 18 lea 0x18(%rdi),%rax
908: 48 39 47 18 cmp %rax,0x18(%rdi)
90c: 0f 95 c0 setne %al
90f: 0f b6 c0 movzbl %al,%eax
912: 85 c0 test %eax,%eax
914: 74 0f je 925 <schedule+0x5a>
916: 48 85 ff test %rdi,%rdi
919: 74 0a je 925 <schedule+0x5a>
91b: be 01 00 00 00 mov $0x1,%esi
920: e8 00 00 00 00 callq 925 <schedule+0x5a>
921: R_X86_64_PC32 blk_flush_plug_list-0x4
925: e8 41 f9 ff ff callq 26b <__schedule>
92a: 5d pop %rbp
92b: c3 retq

The above is our basis. Now lets look at the current -pg

with pg, no mfentry, with fp:

000000000000090c <schedule>:
90c: 55 push %rbp
90d: 48 89 e5 mov %rsp,%rbp
910: e8 00 00 00 00 callq 915 <schedule+0x9>
911: R_X86_64_PC32 mcount-0x4
915: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax
91c: 00 00
91a: R_X86_64_32S current_task
91e: 48 8b 10 mov (%rax),%rdx
921: 48 85 d2 test %rdx,%rdx
924: 74 45 je 96b <schedule+0x5f>
926: 48 83 b8 60 06 00 00 cmpq $0x0,0x660(%rax)
92d: 00
92e: 75 3b jne 96b <schedule+0x5f>
930: 48 8b b8 60 0e 00 00 mov 0xe60(%rax),%rdi
937: 31 c0 xor %eax,%eax
939: 48 85 ff test %rdi,%rdi
93c: 74 1a je 958 <schedule+0x4c>
93e: 48 8d 57 08 lea 0x8(%rdi),%rdx
942: 48 39 57 08 cmp %rdx,0x8(%rdi)
946: b0 01 mov $0x1,%al
948: 75 0e jne 958 <schedule+0x4c>
94a: 48 8d 47 18 lea 0x18(%rdi),%rax
94e: 48 39 47 18 cmp %rax,0x18(%rdi)
952: 0f 95 c0 setne %al
955: 0f b6 c0 movzbl %al,%eax
958: 85 c0 test %eax,%eax
95a: 74 0f je 96b <schedule+0x5f>
95c: 48 85 ff test %rdi,%rdi
95f: 74 0a je 96b <schedule+0x5f>
961: be 01 00 00 00 mov $0x1,%esi
966: e8 00 00 00 00 callq 96b <schedule+0x5f>
967: R_X86_64_PC32 blk_flush_plug_list-0x4
96b: e8 37 f9 ff ff callq 2a7 <__schedule>
970: 5d pop %rbp
971: c3 retq

Looks like %rsp is saved in %rbp here as well as the call to mcount.

-pg must have frame pointers when -mfentry is not included, so there is
no 'with pg, no mfentry, no fp'. Now lets look at mfentry:

with pg, with mfentry, with fp:

000000000000090c <schedule>:
90c: e8 00 00 00 00 callq 911 <schedule+0x5>
90d: R_X86_64_PC32 __fentry__-0x4
911: 55 push %rbp
912: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax
919: 00 00
917: R_X86_64_32S current_task
91b: 48 8b 10 mov (%rax),%rdx
91e: 48 89 e5 mov %rsp,%rbp
921: 48 85 d2 test %rdx,%rdx
924: 74 45 je 96b <schedule+0x5f>
926: 48 83 b8 60 06 00 00 cmpq $0x0,0x660(%rax)
92d: 00
92e: 75 3b jne 96b <schedule+0x5f>
930: 48 8b b8 60 0e 00 00 mov 0xe60(%rax),%rdi
937: 31 c0 xor %eax,%eax
939: 48 85 ff test %rdi,%rdi
93c: 74 1a je 958 <schedule+0x4c>
93e: 48 8d 57 08 lea 0x8(%rdi),%rdx
942: 48 39 57 08 cmp %rdx,0x8(%rdi)
946: b0 01 mov $0x1,%al
948: 75 0e jne 958 <schedule+0x4c>
94a: 48 8d 47 18 lea 0x18(%rdi),%rax
94e: 48 39 47 18 cmp %rax,0x18(%rdi)
952: 0f 95 c0 setne %al
955: 0f b6 c0 movzbl %al,%eax
958: 85 c0 test %eax,%eax
95a: 74 0f je 96b <schedule+0x5f>
95c: 48 85 ff test %rdi,%rdi
95f: 74 0a je 96b <schedule+0x5f>
961: be 01 00 00 00 mov $0x1,%esi
966: e8 00 00 00 00 callq 96b <schedule+0x5f>
967: R_X86_64_PC32 blk_flush_plug_list-0x4
96b: e8 37 f9 ff ff callq 2a7 <__schedule>
970: 5d pop %rbp
971: c3 retq


It is identical with non -pg and frame pointers, except that we added a
call to fentry in the start of the function.

with pg, with fentry, no fp:

00000000000008c3 <schedule>:
8c3: e8 00 00 00 00 callq 8c8 <schedule+0x5>
8c4: R_X86_64_PC32 __fentry__-0x4
8c8: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax
8cf: 00 00
8cd: R_X86_64_32S current_task
8d1: 57 push %rdi
8d2: 48 8b 10 mov (%rax),%rdx
8d5: 48 85 d2 test %rdx,%rdx
8d8: 74 45 je 91f <schedule+0x5c>
8da: 48 83 b8 60 06 00 00 cmpq $0x0,0x660(%rax)
8e1: 00
8e2: 75 3b jne 91f <schedule+0x5c>
8e4: 48 8b b8 60 0e 00 00 mov 0xe60(%rax),%rdi
8eb: 31 c0 xor %eax,%eax
8ed: 48 85 ff test %rdi,%rdi
8f0: 74 1a je 90c <schedule+0x49>
8f2: 48 8d 57 08 lea 0x8(%rdi),%rdx
8f6: 48 39 57 08 cmp %rdx,0x8(%rdi)
8fa: b0 01 mov $0x1,%al
8fc: 75 0e jne 90c <schedule+0x49>
8fe: 48 8d 47 18 lea 0x18(%rdi),%rax
902: 48 39 47 18 cmp %rax,0x18(%rdi)
906: 0f 95 c0 setne %al
909: 0f b6 c0 movzbl %al,%eax
90c: 85 c0 test %eax,%eax
90e: 74 0f je 91f <schedule+0x5c>
910: 48 85 ff test %rdi,%rdi
913: 74 0a je 91f <schedule+0x5c>
915: be 01 00 00 00 mov $0x1,%esi
91a: e8 00 00 00 00 callq 91f <schedule+0x5c>
91b: R_X86_64_PC32 blk_flush_plug_list-0x4
91f: e8 66 f9 ff ff callq 28a <__schedule>
924: 5e pop %rsi
925: c3 retq

Now here's the big difference from -pg. This is identical to compiling
without frame pointers with the exception of the fentry call at the
start of the function.

Now what's the issue with function prologues with -mfentry?

-- Steve

2012-06-22 23:18:47

by Steven Rostedt

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Fri, 2012-06-22 at 22:08 +0200, Hagen Paul Pfeifer wrote:
> Rephrase "do not help": it helps for the framepointer aspect, no
> doubt. So mfentry is superior in that aspect. But it still generates a
> function call.

It generates a function call that at boot up is converted to a 5 byte
nop. The output ends up being:

00000000000008c3 <schedule>:
8c3: 0f 1f 44 00 00 nop
8c8: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax
8cf: 00 00
8cd: R_X86_64_32S current_task
8d1: 57 push %rdi
8d2: 48 8b 10 mov (%rax),%rdx
8d5: 48 85 d2 test %rdx,%rdx
8d8: 74 45 je 91f <schedule+0x5c>
8da: 48 83 b8 60 06 00 00 cmpq $0x0,0x660(%rax)
8e1: 00
8e2: 75 3b jne 91f <schedule+0x5c>

> Which in turn affects code generation.

How so? It's not a C function call (like the -finstrument-functions
produces). It's an assembly function call. The only differences between
having ftrace enabled and ftrace disabled with -mfentry is that you get
a 5 byte nop at the start of each traceable function. Sure, it might put
a little pressure on the icache, but from the benchmarks I've run, the
impact has all been within the noise.

I've been told that it doesn't even hurt the pipeline. But I've Cc'd hpa
and Arjan for their comments. How much impact does a 5 byte nop at the
start of each function really have on the normal operations of the
kernel?

> That was the second concern of Linus, regarding mcount.

Again, the only difference is those 5 bytes. There's no other code
generation difference that I know of.

-- Steve


2012-06-23 00:51:55

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On 6/22/2012 4:18 PM, Steven Rostedt wrote:
> How so? It's not a C function call (like the -finstrument-functions
> produces). It's an assembly function call. The only differences between
> having ftrace enabled and ftrace disabled with -mfentry is that you get
> a 5 byte nop at the start of each traceable function. Sure, it might put
> a little pressure on the icache, but from the benchmarks I've run, the
> impact has all been within the noise.
>
> I've been told that it doesn't even hurt the pipeline. But I've Cc'd hpa
> and Arjan for their comments. How much impact does a 5 byte nop at the
> start of each function really have on the normal operations of the
> kernel?
>

if it's truely an official nop, it will take decoder bandwidth obviously
(which can decode 3 to 4 instructions per cycle, depending on the
brand/model of CPU and the total size in bytes of these instructions).
likewise, at the end of the out of order pipeline, NOPs may take a
retirement slot (again 3 to 4 instructions per cycle)

icache is there as well, and if the NOP actually changes cpu flags (some
of the less fortunate ones do) that can create a false data dependency.


I would also worry about the compiler being able to inline a function
containing one of these, but that's a compiler thing, not a CPU type of
thing.

2012-06-23 01:57:43

by Steven Rostedt

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Fri, 2012-06-22 at 17:51 -0700, Arjan van de Ven wrote:

> I would also worry about the compiler being able to inline a function
> containing one of these, but that's a compiler thing, not a CPU type of
> thing.

You missed the beginning of the thread where we talked about the fact
that gcc will not put a mcount (or fentry) in an inlined function, so
that's not a worry.

-- Steve

2012-06-23 18:25:45

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On 06/22/2012 05:51 PM, Arjan van de Ven wrote:
>
> if it's truely an official nop, it will take decoder bandwidth obviously
> (which can decode 3 to 4 instructions per cycle, depending on the
> brand/model of CPU and the total size in bytes of these instructions).
> likewise, at the end of the out of order pipeline, NOPs may take a
> retirement slot (again 3 to 4 instructions per cycle)
>
> icache is there as well, and if the NOP actually changes cpu flags (some
> of the less fortunate ones do) that can create a false data dependency.
>

If it changes CPU flags it's most definitely not a NOP.

Either way, the 0F 1F ops are official, and listed in the SDM.

-hpa