2015-04-24 13:32:28

by Yunlong Song

[permalink] [raw]
Subject: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process

[Profiling Background]
Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
we find a case that ext4 is better than f2fs in random write under the test of
"iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
the overall sys_write (262144/64=4096). We want to find out the call stack during
this specific 1% sys_write. Our idea is to record the stack in a certain time period
and since the specific 1% case takes up 60% time, the total number of records of its
stack should also takes up 60% of the total records, then we can recognize those stacks
and figure out what the I/O stack of f2fs is doing in the 1% case.

[Profiling Problem]

Although perf can record the events (with call stack) of a specified pid, e.g. using
"perf record -g iozone -s 262144 -r 64 -i 0 -i 2". But we find iozone is interrupted
and the CPU is scheduled to other process. As a result, perf will not record any events
of iozone until iozone's context is recovered and the CPU is scheduled to continue
processing the sys_write of iozone. This obeys our initial idea which is described in
[Profiling Background], since we cannot recognize the call stacks of the specific 1% case
by using the ratio of the record number.

[Alternative Solution without Perf]
We can use /proc/#pid/stack to record the stack in a certain period (e.g. 1ms) of iozone,
no matter whether iozone is interrupted or not. However, we have not taken a deep sight
into this, since we want to use perf to do this kind of thing.

[Question about Perf]
So we have a question that "How does perf still record the stack of a specified pid even
when that process is interrupted and CPU is scheduled to other process?"

--
Thanks,
Yunlong Song


2015-04-24 13:49:31

by Yunlong Song

[permalink] [raw]
Subject: Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process

On 2015/4/24 21:31, Yunlong Song wrote:
> [Profiling Background]
> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
> we find a case that ext4 is better than f2fs in random write under the test of
> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
> the overall sys_write (262144/64=4096). We want to find out the call stack during
> this specific 1% sys_write. Our idea is to record the stack in a certain time period
> and since the specific 1% case takes up 60% time, the total number of records of its
> stack should also takes up 60% of the total records, then we can recognize those stacks
> and figure out what the I/O stack of f2fs is doing in the 1% case.
>
> [Profiling Problem]
>
> Although perf can record the events (with call stack) of a specified pid, e.g. using
> "perf record -g iozone -s 262144 -r 64 -i 0 -i 2". But we find iozone is interrupted
> and the CPU is scheduled to other process. As a result, perf will not record any events
> of iozone until iozone's context is recovered and the CPU is scheduled to continue
> processing the sys_write of iozone.

>This obeys our initial idea which is described in [Profiling Background],

This "disobeys" our initial idea which is described in [Profiling Background],

since we cannot recognize the call stacks of the specific 1% case
> by using the ratio of the record number.
>
> [Alternative Solution without Perf]
> We can use /proc/#pid/stack to record the stack in a certain period (e.g. 1ms) of iozone,
> no matter whether iozone is interrupted or not. However, we have not taken a deep sight
> into this, since we want to use perf to do this kind of thing.
>
> [Question about Perf]
> So we have a question that "How does perf still record the stack of a specified pid even
> when that process is interrupted and CPU is scheduled to other process?"
>


--
Thanks,
Yunlong Song

2015-04-24 13:49:52

by David Ahern

[permalink] [raw]
Subject: Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process

On 4/24/15 7:31 AM, Yunlong Song wrote:
> [Profiling Problem]
>
> Although perf can record the events (with call stack) of a specified pid, e.g. using
> "perf record -g iozone -s 262144 -r 64 -i 0 -i 2". But we find iozone is interrupted

That command tells perf to *only* collect data for iozone. This command:

perf record -g -a -- iozone ....

tells perf to collect samples system wide as long as the iozone process
is alive.

2015-04-24 13:56:22

by Yunlong Song

[permalink] [raw]
Subject: Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process

On 2015/4/24 21:49, David Ahern wrote:
> On 4/24/15 7:31 AM, Yunlong Song wrote:
>> [Profiling Problem]
>>
>> Although perf can record the events (with call stack) of a specified pid, e.g. using
>> "perf record -g iozone -s 262144 -r 64 -i 0 -i 2". But we find iozone is interrupted
>
> That command tells perf to *only* collect data for iozone. This command:
>
> perf record -g -a -- iozone ....
>
> tells perf to collect samples system wide as long as the iozone process is alive.
>
>
>

But we only want the records of iozone rather than the overall system, since other process
will interfere the targeted I/O stack of the specific 1% case, and we cannot figure out the
"clean" specific stack which takes up 60% time.

--
Thanks,
Yunlong Song

2015-04-24 13:58:51

by David Ahern

[permalink] [raw]
Subject: Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process

On 4/24/15 7:31 AM, Yunlong Song wrote:
> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
> we find a case that ext4 is better than f2fs in random write under the test of
> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
> the overall sys_write (262144/64=4096). We want to find out the call stack during
> this specific 1% sys_write. Our idea is to record the stack in a certain time period
> and since the specific 1% case takes up 60% time, the total number of records of its
> stack should also takes up 60% of the total records, then we can recognize those stacks
> and figure out what the I/O stack of f2fs is doing in the 1% case.

And to address this specific profiling problem have you tried:

perf trace record -- iozone ...
perf trace -i perf.data -S

2015-04-25 14:04:18

by Yunlong Song

[permalink] [raw]
Subject: Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process

On 2015/4/24 21:49, Yunlong Song wrote:
> On 2015/4/24 21:31, Yunlong Song wrote:
>> [Profiling Background]
>> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
>> we find a case that ext4 is better than f2fs in random write under the test of
>> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
>> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
>> the overall sys_write (262144/64=4096). We want to find out the call stack during
>> this specific 1% sys_write. Our idea is to record the stack in a certain time period
>> and since the specific 1% case takes up 60% time, the total number of records of its
>> stack should also takes up 60% of the total records, then we can recognize those stacks
>> and figure out what the I/O stack of f2fs is doing in the 1% case.
>>
>> [Profiling Problem]
>>
>> Although perf can record the events (with call stack) of a specified pid, e.g. using
>> "perf record -g iozone -s 262144 -r 64 -i 0 -i 2". But we find iozone is interrupted
>> and the CPU is scheduled to other process. As a result, perf will not record any events
>> of iozone until iozone's context is recovered and the CPU is scheduled to continue
>> processing the sys_write of iozone.
>
>> This obeys our initial idea which is described in [Profiling Background],
>
> This "disobeys" our initial idea which is described in [Profiling Background],
>
> since we cannot recognize the call stacks of the specific 1% case
>> by using the ratio of the record number.
>>
>> [Alternative Solution without Perf]
>> We can use /proc/#pid/stack to record the stack in a certain period (e.g. 1ms) of iozone,
>> no matter whether iozone is interrupted or not. However, we have not taken a deep sight
>> into this, since we want to use perf to do this kind of thing.
>>
>> [Question about Perf]
>> So we have a question that "How does perf still record the stack of a specified pid even
>> when that process is interrupted and CPU is scheduled to other process?"
>>
>
>

Here is another example:

$ cat test.c
void func_loop(void)
{
int i, sum = 0;
for (i = 0; i < 100000; i++)
sum += 1;
}

void func_sleep(void)
{
sleep(1);
}

int main(void)
{
while (1) {
func_loop();
func_sleep();
}
return 0;
}

$ gcc test.c -o test -O0
$ ./test
$ ps aux | grep test
Yunlong+ 133798 0.0 0.0 4056 348 pts/7 S+ 21:52 0:00 ./test
$ perf record -g -p 133798 sleep 5
$ perf report -g --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 16 of event 'cycles'
# Event count (approx.): 2439331
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... ................. ...............................
#
79.46% 76.48% test test [.] func_loop
|
---func_loop
|
|--97.07%-- main
| __libc_start_main
--2.93%-- [...]

77.14% 0.00% test libc-2.19.so [.] __libc_start_main
|
---__libc_start_main

77.14% 0.00% test test [.] main
|
---main
__libc_start_main

20.54% 0.00% test [unknown] [.] 0000000000000000
|
---0

20.54% 0.00% test libc-2.19.so [.] __GI___libc_nanosleep
|
---__GI___libc_nanosleep
0

20.54% 0.00% test [unknown] [k] 0x00007f2fa07ef6b0
|
---0x7f2fa07ef6b0
__GI___libc_nanosleep
0

20.54% 0.00% test [kernel.kallsyms] [k] system_call_fastpath
|
---system_call_fastpath
0x7f2fa07ef6b0
__GI___libc_nanosleep

As shown above, the func_loop is regarded as the main operation and its call stack
is showed on the top of perf report. However, we know that func_sleep is really the
main operation from the view of time duration indeed. And if we want to its actual
call stack in the real time, we can see the result as following:

$ while true ; do cat /proc/133798/stack ; sleep 0.1 ; done
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff

As shown above, func_sleep is regarded as the main operation from the view of time and
its call stack is printed in real time.

--
Thanks,
Yunlong Song

2015-04-25 14:05:50

by Yunlong Song

[permalink] [raw]
Subject: Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process

On 2015/4/24 21:58, David Ahern wrote:
> On 4/24/15 7:31 AM, Yunlong Song wrote:
>> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
>> we find a case that ext4 is better than f2fs in random write under the test of
>> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
>> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
>> the overall sys_write (262144/64=4096). We want to find out the call stack during
>> this specific 1% sys_write. Our idea is to record the stack in a certain time period
>> and since the specific 1% case takes up 60% time, the total number of records of its
>> stack should also takes up 60% of the total records, then we can recognize those stacks
>> and figure out what the I/O stack of f2fs is doing in the 1% case.
>
> And to address this specific profiling problem have you tried:
>
> perf trace record -- iozone ...
> perf trace -i perf.data -S
>
>
>
>

But this only shows the system call like strace, but we want the call stack of kernel functions
in fact.

--
Thanks,
Yunlong Song

2015-04-25 15:54:02

by David Ahern

[permalink] [raw]
Subject: Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process

On 4/25/15 8:05 AM, Yunlong Song wrote:
> On 2015/4/24 21:58, David Ahern wrote:
>> On 4/24/15 7:31 AM, Yunlong Song wrote:
>>> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
>>> we find a case that ext4 is better than f2fs in random write under the test of
>>> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
>>> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
>>> the overall sys_write (262144/64=4096). We want to find out the call stack during
>>> this specific 1% sys_write. Our idea is to record the stack in a certain time period
>>> and since the specific 1% case takes up 60% time, the total number of records of its
>>> stack should also takes up 60% of the total records, then we can recognize those stacks
>>> and figure out what the I/O stack of f2fs is doing in the 1% case.
>>
>> And to address this specific profiling problem have you tried:
>>
>> perf trace record -- iozone ...
>> perf trace -i perf.data -S
>>
>>
>>
>>
>
> But this only shows the system call like strace, but we want the call stack of kernel functions
> in fact.
>

We haven't added the callchain option yet; on the to-do list.

perf trace record -g -- iozone ...
perf trace -i perf.data -s
--> summary of system calls, max/min/average times

perf trace -i perf.data --duration 10.0 -T
--> note the timestamp where the write took a "long" time

perf script
--> search down to *around* the time of interest; you want the syscall
entry; timestamp is for exit

2015-05-05 21:53:44

by Rabin Vincent

[permalink] [raw]
Subject: Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process

On Sat, Apr 25, 2015 at 09:53:57AM -0600, David Ahern wrote:
> On 4/25/15 8:05 AM, Yunlong Song wrote:
> >But this only shows the system call like strace, but we want the call
> >stack of kernel functions in fact.
> >
> We haven't added the callchain option yet; on the to-do list.
>
> perf trace record -g -- iozone ...
> perf trace -i perf.data -s
> --> summary of system calls, max/min/average times
>
> perf trace -i perf.data --duration 10.0 -T
> --> note the timestamp where the write took a "long" time
>
> perf script
> --> search down to *around* the time of interest; you want the syscall
> entry; timestamp is for exit

Now if I understood the use case right, what Yulong Song wants to know
is what the iozone process is doing in the kernel (i.e. the stacktrace of why
exactly it goes to sleep / what it's waiting for) during these
sys_writes which take a long time.

The commands above will identify the sys_write which takes time but only
provide the stacktrace at the entry and exit of the syscall, but this do
not show why the process blocked or what it did inside the system call.

So a way to get what is required for the use case would be to make the
following changes to the above sequence:

(1) include the sched:* events when perf trace record is run

(2) around the time of interest, look at the kernel stack st the sched:switch
events between the entry and the exit. This will show what the process was
waiting for when it when it blocked. The stacktraces at the
stat_runtime events in the process may also be useful to understand what
was going on.

Example:

$ perf trace record -g -e sched:* -- dd if=/dev/zero of=/x bs=10M count=100 conv=fsync
$ perf trace -i perf.data -s

dd (147), 364 events, 94.3%, 0.000 msec

syscall calls min avg max stddev
--------------- -------- --------- --------- --------- ------
write 63 266.019 316.896 963.413 4.69%
...

$ perf trace -i perf.data --duration 960 -T
91916.354 (963.413 ms): dd/147 write(arg0: 1, arg1: 139729327423488, arg2: 10485760, arg3: 582, arg4: 100, arg5: 72340172838076673) = 10485760

$ perf script
...

dd 147 [002] 90.952941: raw_syscalls:sys_enter: NR 1 (1, 7f1544ed0000, a00000, 246, 64, 101010101010101)
211a00 syscall_trace_enter_phase2 ([kernel.kallsyms])
7d79c3 tracesys_phase2 ([kernel.kallsyms])
ed20 [unknown] (/lib/libpthread-2.16.so)

(... trimmed everything except the longest block: ...)

dd 147 [002] 91.204723: sched:sched_switch: prev_comm=dd prev_pid=147 prev_prio=120 prev_state=D ==> next_comm=swapper/2 next_pid=0 next_prio=120
7d1121 __schedule ([kernel.kallsyms])
7d1817 schedule ([kernel.kallsyms])
4578a5 jbd2_log_wait_commit ([kernel.kallsyms])
4546ff jbd2_log_do_checkpoint ([kernel.kallsyms])
45489f __jbd2_log_wait_for_space ([kernel.kallsyms])
44c2d8 start_this_handle ([kernel.kallsyms])
44c7b6 jbd2__journal_start ([kernel.kallsyms])
42fc3b __ext4_journal_start_sb ([kernel.kallsyms])
4012e9 ext4_write_begin ([kernel.kallsyms])
33795f generic_perform_write ([kernel.kallsyms])
338b68 __generic_file_write_iter ([kernel.kallsyms])
3f4371 ext4_file_write_iter ([kernel.kallsyms])
381ae1 __vfs_write ([kernel.kallsyms])
381f07 vfs_write ([kernel.kallsyms])
3825dd sys_write ([kernel.kallsyms])
7d7a1a tracesys_phase2 ([kernel.kallsyms])
ed20 [unknown] (/lib/libpthread-2.16.so)

swapper 0 [002] 91.875573: sched:sched_wakeup: comm=dd pid=147 prio=120 success=1 target_cpu=002

2015-05-05 22:24:58

by David Ahern

[permalink] [raw]
Subject: Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process

On 5/5/15 3:53 PM, Rabin Vincent wrote:
> The commands above will identify the sys_write which takes time but only
> provide the stacktrace at the entry and exit of the syscall, but this do
> not show why the process blocked or what it did inside the system call.
>
> So a way to get what is required for the use case would be to make the
> following changes to the above sequence:
>
> (1) include the sched:* events when perf trace record is run
>
> (2) around the time of interest, look at the kernel stack st the sched:switch
> events between the entry and the exit. This will show what the process was
> waiting for when it when it blocked. The stacktraces at the
> stat_runtime events in the process may also be useful to understand what
> was going on.

agreed.

2015-05-06 04:10:28

by Yunlong Song

[permalink] [raw]
Subject: Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process

On 2015/4/25 23:53, David Ahern wrote:
> On 4/25/15 8:05 AM, Yunlong Song wrote:
>> On 2015/4/24 21:58, David Ahern wrote:
>>> On 4/24/15 7:31 AM, Yunlong Song wrote:
>>>> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
>>>> we find a case that ext4 is better than f2fs in random write under the test of
>>>> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
>>>> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
>>>> the overall sys_write (262144/64=4096). We want to find out the call stack during
>>>> this specific 1% sys_write. Our idea is to record the stack in a certain time period
>>>> and since the specific 1% case takes up 60% time, the total number of records of its
>>>> stack should also takes up 60% of the total records, then we can recognize those stacks
>>>> and figure out what the I/O stack of f2fs is doing in the 1% case.
>>>
>>> And to address this specific profiling problem have you tried:
>>>
>>> perf trace record -- iozone ...
>>> perf trace -i perf.data -S
>>>
>>>
>>>
>>>
>>
>> But this only shows the system call like strace, but we want the call stack of kernel functions
>> in fact.
>>
>
> We haven't added the callchain option yet; on the to-do list.
>
> perf trace record -g -- iozone ...
> perf trace -i perf.data -s
> --> summary of system calls, max/min/average times
>
> perf trace -i perf.data --duration 10.0 -T
> --> note the timestamp where the write took a "long" time
>
> perf script
> --> search down to *around* the time of interest; you want the syscall entry; timestamp is for exit
>
> .
>

Hi, David,

It's almost what we want, we are eager to see it can work as a callchain option, since it's really a useful
tool in analyzing latency of I/O performance in production case.

--
Thanks,
Yunlong Song

2015-05-06 04:14:12

by Yunlong Song

[permalink] [raw]
Subject: Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process

On 2015/5/6 5:53, Rabin Vincent wrote:
> On Sat, Apr 25, 2015 at 09:53:57AM -0600, David Ahern wrote:
>> On 4/25/15 8:05 AM, Yunlong Song wrote:
>>> But this only shows the system call like strace, but we want the call
>>> stack of kernel functions in fact.
>>>
>> We haven't added the callchain option yet; on the to-do list.
>>
>> perf trace record -g -- iozone ...
>> perf trace -i perf.data -s
>> --> summary of system calls, max/min/average times
>>
>> perf trace -i perf.data --duration 10.0 -T
>> --> note the timestamp where the write took a "long" time
>>
>> perf script
>> --> search down to *around* the time of interest; you want the syscall
>> entry; timestamp is for exit
>
> Now if I understood the use case right, what Yulong Song wants to know
> is what the iozone process is doing in the kernel (i.e. the stacktrace of why
> exactly it goes to sleep / what it's waiting for) during these
> sys_writes which take a long time.
>
> The commands above will identify the sys_write which takes time but only
> provide the stacktrace at the entry and exit of the syscall, but this do
> not show why the process blocked or what it did inside the system call.
>
> So a way to get what is required for the use case would be to make the
> following changes to the above sequence:
>
> (1) include the sched:* events when perf trace record is run
>
> (2) around the time of interest, look at the kernel stack st the sched:switch
> events between the entry and the exit. This will show what the process was
> waiting for when it when it blocked. The stacktraces at the
> stat_runtime events in the process may also be useful to understand what
> was going on.
>
> Example:
>
> $ perf trace record -g -e sched:* -- dd if=/dev/zero of=/x bs=10M count=100 conv=fsync
> $ perf trace -i perf.data -s
>
> dd (147), 364 events, 94.3%, 0.000 msec
>
> syscall calls min avg max stddev
> --------------- -------- --------- --------- --------- ------
> write 63 266.019 316.896 963.413 4.69%
> ...
>
> $ perf trace -i perf.data --duration 960 -T
> 91916.354 (963.413 ms): dd/147 write(arg0: 1, arg1: 139729327423488, arg2: 10485760, arg3: 582, arg4: 100, arg5: 72340172838076673) = 10485760
>
> $ perf script
> ...
>
> dd 147 [002] 90.952941: raw_syscalls:sys_enter: NR 1 (1, 7f1544ed0000, a00000, 246, 64, 101010101010101)
> 211a00 syscall_trace_enter_phase2 ([kernel.kallsyms])
> 7d79c3 tracesys_phase2 ([kernel.kallsyms])
> ed20 [unknown] (/lib/libpthread-2.16.so)
>
> (... trimmed everything except the longest block: ...)
>
> dd 147 [002] 91.204723: sched:sched_switch: prev_comm=dd prev_pid=147 prev_prio=120 prev_state=D ==> next_comm=swapper/2 next_pid=0 next_prio=120
> 7d1121 __schedule ([kernel.kallsyms])
> 7d1817 schedule ([kernel.kallsyms])
> 4578a5 jbd2_log_wait_commit ([kernel.kallsyms])
> 4546ff jbd2_log_do_checkpoint ([kernel.kallsyms])
> 45489f __jbd2_log_wait_for_space ([kernel.kallsyms])
> 44c2d8 start_this_handle ([kernel.kallsyms])
> 44c7b6 jbd2__journal_start ([kernel.kallsyms])
> 42fc3b __ext4_journal_start_sb ([kernel.kallsyms])
> 4012e9 ext4_write_begin ([kernel.kallsyms])
> 33795f generic_perform_write ([kernel.kallsyms])
> 338b68 __generic_file_write_iter ([kernel.kallsyms])
> 3f4371 ext4_file_write_iter ([kernel.kallsyms])
> 381ae1 __vfs_write ([kernel.kallsyms])
> 381f07 vfs_write ([kernel.kallsyms])
> 3825dd sys_write ([kernel.kallsyms])
> 7d7a1a tracesys_phase2 ([kernel.kallsyms])
> ed20 [unknown] (/lib/libpthread-2.16.so)
>
> swapper 0 [002] 91.875573: sched:sched_wakeup: comm=dd pid=147 prio=120 success=1 target_cpu=002
>
> .
>

Good, it's clearer with "sched:*" event.

--
Thanks,
Yunlong Song