Subject: cpu-freq: running the perf increases the data rate?

[ Please keep me in CC as I'm not subscribed to the list]

Hi all,

I have an application which finds the data rate over the PCIe
interface. I’m getting the lesser data rate in one of my Linux X86
systems.
When I change the scaling_governor from "powersave" to "performance"
mode for each CPU, then there is slight improvement in the PCIe data
rate.
Parallely I started profiling the workload with perf. Whenever I start
running the profile command “perf stat -a -d -p <PID>” surprisingly
the application resulted in excellent data rate over PCIe, but when I
kill the perf command again PCIe data rate drops. I am really confused
about this behavior.Any clues from this behaviour?


Also I noticed my system not having the 'cpuinfo_cur_freq' sys file.
Is that okay?
cat: /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq: No such
file or directory


--
Thanks,


2020-08-28 12:37:18

by Artem Bityutskiy

[permalink] [raw]
Subject: Re: cpu-freq: running the perf increases the data rate?

On Thu, 2020-08-27 at 22:25 +0530, Subhashini Rao Beerisetty wrote:
> I have an application which finds the data rate over the PCIe
> interface. I’m getting the lesser data rate in one of my Linux X86
> systems.

Some more description, may be? Do you have a PCIe device reading one
RAM buffer and then writing to another RAM buffer? Or does it generate
dome data and writes them to a RAM buffer? Presumably it uses DMA? How
much is the CPU involved into the process? Are we talking about
transferring few kilobytes or gigabytes?

> When I change the scaling_governor from "powersave" to "performance"
> mode for each CPU, then there is slight improvement in the PCIe data
> rate.

Definitely this makes your CPU(s) run at max speed, but depending on
platform and settings, this may also affect C-states. Are the CPU(s)
generally idle while you measure, or busy (involved into the test)? You
could run 'turbostat' while measuring the bandwidth, to get some CPU
statistics (e.g., do C-states happen during the PCI test, how busy are
the CPUs).

> Parallely I started profiling the workload with perf. Whenever I start
> running the profile command “perf stat -a -d -p <PID>” surprisingly
> the application resulted in excellent data rate over PCIe, but when I
> kill the perf command again PCIe data rate drops. I am really confused
> about this behavior.Any clues from this behaviour?

Well, one possible reason that comes to mind - you get rid of C-states
when you rung perf, and this increases the PCI bandwidth. You can just
try disabling C-states (there are sysfs knobs) and check it out.
Turbostat could be useful to check for this (with and without perf, run
'turbostat sleep 10' or something like this (measure for 10 seconds in
this example), do this while running your PCI test.

But I am really just guessing here, I do not know enough about your
test and the system (e.g., "a Linux x86" system can be so many things,
like Intel or AMD server or a mobile device)...


Subject: Re: cpu-freq: running the perf increases the data rate?

On Fri, Aug 28, 2020 at 6:04 PM Artem Bityutskiy <[email protected]> wrote:
>
> On Thu, 2020-08-27 at 22:25 +0530, Subhashini Rao Beerisetty wrote:
> > I have an application which finds the data rate over the PCIe
> > interface. I’m getting the lesser data rate in one of my Linux X86
> > systems.
>
> Some more description, may be? Do you have a PCIe device reading one
> RAM buffer and then writing to another RAM buffer? Or does it generate
> dome data and writes them to a RAM buffer? Presumably it uses DMA? How
> much is the CPU involved into the process? Are we talking about
> transferring few kilobytes or gigabytes?
Thanks a lot for your help and reply.
Regarding hardware setup, Xilinx PCIe FPGA endpoint is connected to
HOST CPU via PCIe bus.
Xilinx PCIe FPGA endpoint has the DMA_REF block and it provides a
mechanism to DMA transfer data at the maximum rate between host CPU
memory and a FIFO in the DMA-REF block.
The host software sets up some data in it’s memory, it then transfers
the data to the DMA-REF’s FIFO and then reads it back into a different
location in its host memory. This is repeated in a loop. There is a
register in the DMA-REF block that gives an indication of transfer
speed.


>
> > When I change the scaling_governor from "powersave" to "performance"
> > mode for each CPU, then there is slight improvement in the PCIe data
> > rate.
>
> Definitely this makes your CPU(s) run at max speed, but depending on
> platform and settings, this may also affect C-states. Are the CPU(s)
> generally idle while you measure, or busy (involved into the test)? You
> could run 'turbostat' while measuring the bandwidth, to get some CPU
> statistics (e.g., do C-states happen during the PCI test, how busy are
> the CPUs).
>
> > Parallely I started profiling the workload with perf. Whenever I start
> > running the profile command “perf stat -a -d -p <PID>” surprisingly
> > the application resulted in excellent data rate over PCIe, but when I
> > kill the perf command again PCIe data rate drops. I am really confused
> > about this behavior.Any clues from this behaviour?
>
> Well, one possible reason that comes to mind - you get rid of C-states
> when you rung perf, and this increases the PCI bandwidth. You can just
> try disabling C-states (there are sysfs knobs) and check it out.
> Turbostat could be useful to check for this (with and without perf, run
> 'turbostat sleep 10' or something like this (measure for 10 seconds in
> this example), do this while running your PCI test.
Disabling the C-states improved the throughput a lot, thanks a lot for
pointing this out. Could you please give some more explanation on how
disabling C-states improved the throughput?
As you suggested I collected and attached the turbostat log with and
without perf while running the PCIe test.
In my system, only 'performance\powersave' are listed in
scaling_available_governors. Rest other governors
"userspace\ondemand\schedutil" are not listed in available_goverors.
What might be the reason for this?

>
> But I am really just guessing here, I do not know enough about your
> test and the system (e.g., "a Linux x86" system can be so many things,
> like Intel or AMD server or a mobile device)…
It's an Intel Atom processor.
>
>


Attachments:
trubostat_with_perf.txt (2.57 kB)
trubostat_without_perf.txt (2.58 kB)
Download all attachments