The current use of atomics can lead to test failures, as tests (such as
tests/shell/record.sh) search for samples with "test_loop" as the
top-most stack frame, but find frames related to the atomic operation
(e.g. __aarch64_ldadd4_relax).
This change simply removes the "count" variable, as it is not necessary.
Fixes: 1962ab6f6e0b ("perf test workload thloop: Make count increments atomic")
Signed-off-by: Nick Forrington <[email protected]>
---
tools/perf/tests/workloads/thloop.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/tools/perf/tests/workloads/thloop.c b/tools/perf/tests/workloads/thloop.c
index af05269c2eb8..457b29f91c3e 100644
--- a/tools/perf/tests/workloads/thloop.c
+++ b/tools/perf/tests/workloads/thloop.c
@@ -7,7 +7,6 @@
#include "../tests.h"
static volatile sig_atomic_t done;
-static volatile unsigned count;
/* We want to check this symbol in perf report */
noinline void test_loop(void);
@@ -19,8 +18,7 @@ static void sighandler(int sig __maybe_unused)
noinline void test_loop(void)
{
- while (!done)
- __atomic_fetch_add(&count, 1, __ATOMIC_RELAXED);
+ while (!done);
}
static void *thfunc(void *arg)
--
2.42.0
On 02/11/2023 16:22, Nick Forrington wrote:
> The current use of atomics can lead to test failures, as tests (such as
> tests/shell/record.sh) search for samples with "test_loop" as the
> top-most stack frame, but find frames related to the atomic operation
> (e.g. __aarch64_ldadd4_relax).
>
> This change simply removes the "count" variable, as it is not necessary.
>
> Fixes: 1962ab6f6e0b ("perf test workload thloop: Make count increments atomic")
> Signed-off-by: Nick Forrington <[email protected]>
> ---
> tools/perf/tests/workloads/thloop.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/tools/perf/tests/workloads/thloop.c b/tools/perf/tests/workloads/thloop.c
> index af05269c2eb8..457b29f91c3e 100644
> --- a/tools/perf/tests/workloads/thloop.c
> +++ b/tools/perf/tests/workloads/thloop.c
> @@ -7,7 +7,6 @@
> #include "../tests.h"
>
> static volatile sig_atomic_t done;
> -static volatile unsigned count;
>
> /* We want to check this symbol in perf report */
> noinline void test_loop(void);
> @@ -19,8 +18,7 @@ static void sighandler(int sig __maybe_unused)
>
> noinline void test_loop(void)
> {
> - while (!done)
> - __atomic_fetch_add(&count, 1, __ATOMIC_RELAXED);
> + while (!done);
> }
>
> static void *thfunc(void *arg)
Reviewed-by: James Clark <[email protected]>
Em Fri, Nov 03, 2023 at 09:14:45AM +0000, James Clark escreveu:
> On 02/11/2023 16:22, Nick Forrington wrote:
> > The current use of atomics can lead to test failures, as tests (such as
> > tests/shell/record.sh) search for samples with "test_loop" as the
> > top-most stack frame, but find frames related to the atomic operation
> > (e.g. __aarch64_ldadd4_relax).
> > This change simply removes the "count" variable, as it is not necessary.
> > Fixes: 1962ab6f6e0b ("perf test workload thloop: Make count increments atomic")
> > Signed-off-by: Nick Forrington <[email protected]>
> > +++ b/tools/perf/tests/workloads/thloop.c
> > @@ -7,7 +7,6 @@
> > #include "../tests.h"
> > static volatile sig_atomic_t done;
> > -static volatile unsigned count;
> > /* We want to check this symbol in perf report */
> > noinline void test_loop(void);
> > @@ -19,8 +18,7 @@ static void sighandler(int sig __maybe_unused)
> > noinline void test_loop(void)
> > {
> > - while (!done)
> > - __atomic_fetch_add(&count, 1, __ATOMIC_RELAXED);
> > + while (!done);
> > }
> > static void *thfunc(void *arg)
> Reviewed-by: James Clark <[email protected]>
Thanks, applied to perf-tools-next.
- Arnaldo
On Thu, 2 Nov 2023, Nick Forrington wrote:
> The current use of atomics can lead to test failures, as tests (such as
> tests/shell/record.sh) search for samples with "test_loop" as the
> top-most stack frame, but find frames related to the atomic operation
> (e.g. __aarch64_ldadd4_relax).
>
> This change simply removes the "count" variable, as it is not necessary.
Hello.
I believe that it was there to prevent the compiler to optimize the loop
out or some reason like that. Hopefully, it will work even without that
on all architectures with all compilers that are used for building perf...
I also think that the tests should be designed in a more robust way, so
that they aren't directly dependent on exact stack frames, e.g. let's look
at the "inet_pton" testcase...
The inet_pton test result check algorithm is designed to rely on exact
stacktrace, without a possibility to specify something like "we want this
and that in the stack trace, but otherwise, it does not matter which
wrappers are aroung". Then there must be the following code to adjust
the expected output exactly per architecture:
echo "ping[][0-9 \.:]+$event_name: \([[:xdigit:]]+\)" > $expected
echo ".*inet_pton\+0x[[:xdigit:]]+[[:space:]]\($libc|inlined\)$" >> $expected
case "$(uname -m)" in
s390x)
eventattr='call-graph=dwarf,max-stack=4'
echo "(__GI_)?getaddrinfo\+0x[[:xdigit:]]+[[:space:]]\($libc|inlined\)$" >> $expected
echo "main\+0x[[:xdigit:]]+[[:space:]]\(.*/bin/ping.*\)$" >> $expected
;;
ppc64|ppc64le)
eventattr='max-stack=4'
echo "gaih_inet.*\+0x[[:xdigit:]]+[[:space:]]\($libc\)$" >> $expected
echo "getaddrinfo\+0x[[:xdigit:]]+[[:space:]]\($libc\)$" >> $expected
echo ".*(\+0x[[:xdigit:]]+|\[unknown\])[[:space:]]\(.*/bin/ping.*\)$" >> $expected
;;
*)
eventattr='max-stack=3'
echo ".*(\+0x[[:xdigit:]]+|\[unknown\])[[:space:]]\(.*/bin/ping.*\)$" >> $expected
;;
esac
Of course, since it relies on libc version, then we need patches, such as
1f85d016768f ("perf test record+probe_libc_inet_pton: Fix call chain match on x86_64")
311693ce81c9 ("perf test record+probe_libc_inet_pton: Fix call chain match on s390")
fb710ddee75f ("perf test record_probe_libc_inet_pton: Fix test on s/390 where 'text_to_binary_address' now appears on the backtrace")
...
which break the test when used with older libc...
I think that a better design of such test is either probing some program
of ourselves that has predictable and stable function call sequence or
be more robust in checking the stack trace.
Thoughts?
Michael
>
> Fixes: 1962ab6f6e0b ("perf test workload thloop: Make count increments atomic")
> Signed-off-by: Nick Forrington <[email protected]>
> ---
> tools/perf/tests/workloads/thloop.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/tools/perf/tests/workloads/thloop.c b/tools/perf/tests/workloads/thloop.c
> index af05269c2eb8..457b29f91c3e 100644
> --- a/tools/perf/tests/workloads/thloop.c
> +++ b/tools/perf/tests/workloads/thloop.c
> @@ -7,7 +7,6 @@
> #include "../tests.h"
>
> static volatile sig_atomic_t done;
> -static volatile unsigned count;
>
> /* We want to check this symbol in perf report */
> noinline void test_loop(void);
> @@ -19,8 +18,7 @@ static void sighandler(int sig __maybe_unused)
>
> noinline void test_loop(void)
> {
> - while (!done)
> - __atomic_fetch_add(&count, 1, __ATOMIC_RELAXED);
> + while (!done);
> }
>
> static void *thfunc(void *arg)
> --
> 2.42.0
>
>
>
Hi all,
On Fri, Nov 24, 2023 at 08:57:52PM +0100, Michael Petlan wrote:
> On Thu, 2 Nov 2023, Nick Forrington wrote:
> > The current use of atomics can lead to test failures, as tests (such as
> > tests/shell/record.sh) search for samples with "test_loop" as the
> > top-most stack frame, but find frames related to the atomic operation
> > (e.g. __aarch64_ldadd4_relax).
I am confused by above description. As I went through the script
record.sh, which is the only test invoking the program 'test_loop',
but I don't find any test is related with stack frame.
Do I miss anything? I went through record.sh but no clue why the
failure is caused by stack frame. All the testings use command:
if ! perf report -i "${perfdata}" -q | grep -q "${testsym}"
...
fi
@Nick, could you narrow down which specific test case causing the
failure.
[...]
> I believe that it was there to prevent the compiler to optimize the loop
> out or some reason like that. Hopefully, it will work even without that
> on all architectures with all compilers that are used for building perf...
Agreed.
As said above, I'd like to step back a bit for making clear what's the
exactly failure caused by the program.
Thanks,
Leo
On 25/11/2023 03:05, Leo Yan wrote:
> Hi all,
>
> On Fri, Nov 24, 2023 at 08:57:52PM +0100, Michael Petlan wrote:
>> On Thu, 2 Nov 2023, Nick Forrington wrote:
>>> The current use of atomics can lead to test failures, as tests (such as
>>> tests/shell/record.sh) search for samples with "test_loop" as the
>>> top-most stack frame, but find frames related to the atomic operation
>>> (e.g. __aarch64_ldadd4_relax).
> I am confused by above description. As I went through the script
> record.sh, which is the only test invoking the program 'test_loop',
> but I don't find any test is related with stack frame.
>
> Do I miss anything? I went through record.sh but no clue why the
> failure is caused by stack frame. All the testings use command:
>
> if ! perf report -i "${perfdata}" -q | grep -q "${testsym}"
> ...
> fi
>
> @Nick, could you narrow down which specific test case causing the
> failure.
>
> [...]
All checks for ${testsym} in record.sh (including the example you
provide) can fail, as the expected symbol (test_loop) is not the
top-most function on the stack (and therefore not the symbol associated
with the sample).
Example perf report output:
# Overhead Command Shared Object Symbol
# ........ ....... ..................... .............................
#
99.53% perf perf [.] __aarch64_ldadd4_relax
...
You can see the issue when recording/reporting with call stacks:
# Children Self Command Shared Object Symbol
# ........ ........ ....... .....................
..........................................................
#
99.52% 99.52% perf perf [.]
__aarch64_ldadd4_relax
|
|--49.77%--0xffffb905a5dc
| 0xffffb8ff0aec
| thfunc
| test_loop
| __aarch64_ldadd4_relax
...
>
>> I believe that it was there to prevent the compiler to optimize the loop
>> out or some reason like that. Hopefully, it will work even without that
>> on all architectures with all compilers that are used for building perf...
> Agreed.
>
> As said above, I'd like to step back a bit for making clear what's the
> exactly failure caused by the program.
I don't think this loop could be sensibly optimised away, as it depends
on "done", which is defined at file scope (and assigned by a signal
handler).
Cheers,
Nick
>
> Thanks,
> Leo
>
On Sat, Nov 25, 2023 at 07:10:25PM +0000, Nick Forrington wrote:
[...]
> > @Nick, could you narrow down which specific test case causing the
> > failure.
> >
> > [...]
>
> All checks for ${testsym} in record.sh (including the example you provide)
> can fail, as the expected symbol (test_loop) is not the top-most function on
> the stack (and therefore not the symbol associated with the sample).
>
>
> Example perf report output:
>
> # Overhead Command Shared Object Symbol
> # ........ ....... ..................... .............................
> #
> 99.53% perf perf [.] __aarch64_ldadd4_relax
Thanks for confirmation.
I cannot reproduce the issue on my Juno board with Debian (buster).
It's likely because I don't use GCC toolchain with enabling
'-moutline-atomics' AArch64 flag [1].
> ...
>
>
> You can see the issue when recording/reporting with call stacks:
>
> # Children Self Command Shared Object Symbol
> # ........ ........ ....... .....................
> ..........................................................
> #
> 99.52% 99.52% perf perf [.]
> __aarch64_ldadd4_relax
> |
> |--49.77%--0xffffb905a5dc
> | 0xffffb8ff0aec
> | thfunc
> | test_loop
> | __aarch64_ldadd4_relax
> ...
>
> >
> > > I believe that it was there to prevent the compiler to optimize the loop
> > > out or some reason like that. Hopefully, it will work even without that
> > > on all architectures with all compilers that are used for building perf...
> > Agreed.
> >
> > As said above, I'd like to step back a bit for making clear what's the
> > exactly failure caused by the program.
>
> I don't think this loop could be sensibly optimised away, as it depends on
> "done", which is defined at file scope (and assigned by a signal handler).
I verified your patch with 'perf annotate'.
The disassembly on Arm64 is:
noinline void test_loop(void)
{
stp x29, x30, [sp, #-32]!
mov x29, sp
adrp x0, options+0x4c0
ldr x0, [x0, #3664]
ldr x1, [x0]
str x1, [sp, #24]
mov x1, #0x0 // #0
while (!done);
nop
20: adrp x0, spacing.22251
add x0, x0, #0xa04
ldr w0, [x0]
cmp w0, #0x0
↑ b.eq 20
}
The disassembly on x86_64 is:
noinline void test_loop(void)
{
endbr64
push %rbp
mov %rsp,%rbp
sub $0x10,%rsp
mov %fs:0x28,%rax
mov %rax,-0x8(%rbp)
xor %eax,%eax
while (!done);
nop
1c: mov done,%eax
test %eax,%eax
↑ je 1c
}
Maybe the commit log caused a bit confusion, the problem is after
enabling "-moutline-atomics" on aarch64, the overhead is altered into
the linked __aarch64_ldadd4_relax() function, test_loop() cannot be
sampled anymore, but it's not about stack tracing.
Anyway, the patch is fine for me.
Thanks,
Leo
[1] https://stackoverflow.com/questions/65239845/how-to-enable-mno-outline-atomics-aarch64-flag
On 24/11/2023 19:57, Michael Petlan wrote:
> On Thu, 2 Nov 2023, Nick Forrington wrote:
>> The current use of atomics can lead to test failures, as tests (such as
>> tests/shell/record.sh) search for samples with "test_loop" as the
>> top-most stack frame, but find frames related to the atomic operation
>> (e.g. __aarch64_ldadd4_relax).
>>
>> This change simply removes the "count" variable, as it is not necessary.
>
> Hello.
>
> I believe that it was there to prevent the compiler to optimize the loop
> out or some reason like that. Hopefully, it will work even without that
> on all architectures with all compilers that are used for building perf...
>
If that optimisation ever happens and there is a concrete case, I think
it should be treated like any other test regression and fixed at that
time when it's known what the specifics of that issue are. As Nick says
in a later comment, the loop can't be optimised out because it depends
on the done variable and the function is noinline.
> I also think that the tests should be designed in a more robust way, so
> that they aren't directly dependent on exact stack frames, e.g. let's look
> at the "inet_pton" testcase...
This one doesn't look like it's dependent on exact stack frames, but
just one frame at the end, which unless something is actually broken in
Perf causing a failure, I don't see how could be made any more robust.
>
> The inet_pton test result check algorithm is designed to rely on exact
> stacktrace, without a possibility to specify something like "we want this
> and that in the stack trace, but otherwise, it does not matter which
> wrappers are aroung". Then there must be the following code to adjust
> the expected output exactly per architecture:
>
> echo "ping[][0-9 \.:]+$event_name: \([[:xdigit:]]+\)" > $expected
> echo ".*inet_pton\+0x[[:xdigit:]]+[[:space:]]\($libc|inlined\)$" >> $expected
> case "$(uname -m)" in
> s390x)
> eventattr='call-graph=dwarf,max-stack=4'
> echo "(__GI_)?getaddrinfo\+0x[[:xdigit:]]+[[:space:]]\($libc|inlined\)$" >> $expected
> echo "main\+0x[[:xdigit:]]+[[:space:]]\(.*/bin/ping.*\)$" >> $expected
> ;;
> ppc64|ppc64le)
> eventattr='max-stack=4'
> echo "gaih_inet.*\+0x[[:xdigit:]]+[[:space:]]\($libc\)$" >> $expected
> echo "getaddrinfo\+0x[[:xdigit:]]+[[:space:]]\($libc\)$" >> $expected
> echo ".*(\+0x[[:xdigit:]]+|\[unknown\])[[:space:]]\(.*/bin/ping.*\)$" >> $expected
> ;;
> *)
> eventattr='max-stack=3'
> echo ".*(\+0x[[:xdigit:]]+|\[unknown\])[[:space:]]\(.*/bin/ping.*\)$" >> $expected
> ;;
> esac
>
> Of course, since it relies on libc version, then we need patches, such as
> 1f85d016768f ("perf test record+probe_libc_inet_pton: Fix call chain match on x86_64")
> 311693ce81c9 ("perf test record+probe_libc_inet_pton: Fix call chain match on s390")
> fb710ddee75f ("perf test record_probe_libc_inet_pton: Fix test on s/390 where 'text_to_binary_address' now appears on the backtrace")
> ...
> which break the test when used with older libc...
>
I definitely think that relying on external libraries for stacks in
tests can be annoying to keep up to date, but that's exactly what we
just removed from test_loop.
For the inet_pton test it seems to be specifically testing probing libc,
so maybe making the test program ourselves would make the test much less
valuable. And then as long as all the tests are not like that and there
is only this one to keep up to date, maybe it's not that bad.
> I think that a better design of such test is either probing some program
> of ourselves that has predictable and stable function call sequence or
> be more robust in checking the stack trace.
>
> Thoughts?
>
> Michael
>
>>
>> Fixes: 1962ab6f6e0b ("perf test workload thloop: Make count increments atomic")
>> Signed-off-by: Nick Forrington <[email protected]>
>> ---
>> tools/perf/tests/workloads/thloop.c | 4 +---
>> 1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/tools/perf/tests/workloads/thloop.c b/tools/perf/tests/workloads/thloop.c
>> index af05269c2eb8..457b29f91c3e 100644
>> --- a/tools/perf/tests/workloads/thloop.c
>> +++ b/tools/perf/tests/workloads/thloop.c
>> @@ -7,7 +7,6 @@
>> #include "../tests.h"
>>
>> static volatile sig_atomic_t done;
>> -static volatile unsigned count;
>>
>> /* We want to check this symbol in perf report */
>> noinline void test_loop(void);
>> @@ -19,8 +18,7 @@ static void sighandler(int sig __maybe_unused)
>>
>> noinline void test_loop(void)
>> {
>> - while (!done)
>> - __atomic_fetch_add(&count, 1, __ATOMIC_RELAXED);
>> + while (!done);
>> }
>>
>> static void *thfunc(void *arg)
>> --
>> 2.42.0
>>
>>
>>
>
>
Em Sun, Nov 26, 2023 at 03:41:14PM +0800, Leo Yan escreveu:
> Maybe the commit log caused a bit confusion, the problem is after
We'll have the Link pointing to this discussion.
> enabling "-moutline-atomics" on aarch64, the overhead is altered into
> the linked __aarch64_ldadd4_relax() function, test_loop() cannot be
> sampled anymore, but it's not about stack tracing.
>
> Anyway, the patch is fine for me.
I'm taking this as an Acked-by: Leo
But probably this could be even a Tested-by, ok?
- Arnaldo
On Mon, Nov 27, 2023 at 10:20:47AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Sun, Nov 26, 2023 at 03:41:14PM +0800, Leo Yan escreveu:
> > Maybe the commit log caused a bit confusion, the problem is after
>
> We'll have the Link pointing to this discussion.
Okay, it's a good tracking.
> > enabling "-moutline-atomics" on aarch64, the overhead is altered into
> > the linked __aarch64_ldadd4_relax() function, test_loop() cannot be
> > sampled anymore, but it's not about stack tracing.
> >
> > Anyway, the patch is fine for me.
>
> I'm taking this as an Acked-by: Leo
>
> But probably this could be even a Tested-by, ok?
Yes, here is my test tag:
Tested-by: Leo Yan <[email protected]>